论文标题
股票市场预测中的机器学习模型
Machine Learning Models in Stock Market Prediction
论文作者
论文摘要
本文着重于使用8种监督机器学习模型来预测Nifty 50指数。用于实证研究的技术是自适应增强(ADABOOST),K-NEARST邻居(KNN),线性回归(LR),人工神经网络(ANN),随机森林(RF),随机梯度下降(SGD),支持向量机(SVM)和决策树(DT)。实验基于1996年4月22日至2021年4月16日的印度股票市场Nifty 50指数的历史数据,这是时间序列数据约25年。在此期间,有6220个交易日不包括所有非交易日。将整个交易数据集分为4个大小25%的子集,全部数据的50%,整个数据的75%和整个数据。每个子集进一步分为2个零件培训数据和测试数据。在对训练数据进行了3个测试,对每个子集进行测试数据和交叉验证测试后,比较了使用模型的预测性能,并在比较后发现了非常有趣的结果。评估结果表明,自适应提升,K-最近的邻居,随机森林和决策树,随着数据集的规模增加。线性回归和人工神经网络在所有模型中显示出几乎相似的预测结果,但是人工神经网络花了更多时间在训练和验证模型方面。此后,支持向量机在其余模型中的表现更好,但随着数据集的大小的增加,随机梯度下降的性能要比支持向量机的表现更好。
The paper focuses on predicting the Nifty 50 Index by using 8 Supervised Machine Learning Models. The techniques used for empirical study are Adaptive Boost (AdaBoost), k-Nearest Neighbors (kNN), Linear Regression (LR), Artificial Neural Network (ANN), Random Forest (RF), Stochastic Gradient Descent (SGD), Support Vector Machine (SVM) and Decision Trees (DT). Experiments are based on historical data of Nifty 50 Index of Indian Stock Market from 22nd April, 1996 to 16th April, 2021, which is time series data of around 25 years. During the period there were 6220 trading days excluding all the non trading days. The entire trading dataset was divided into 4 subsets of different size-25% of entire data, 50% of entire data, 75% of entire data and entire data. Each subset was further divided into 2 parts-training data and testing data. After applying 3 tests- Test on Training Data, Test on Testing Data and Cross Validation Test on each subset, the prediction performance of the used models were compared and after comparison, very interesting results were found. The evaluation results indicate that Adaptive Boost, k- Nearest Neighbors, Random Forest and Decision Trees under performed with increase in the size of data set. Linear Regression and Artificial Neural Network shown almost similar prediction results among all the models but Artificial Neural Network took more time in training and validating the model. Thereafter Support Vector Machine performed better among rest of the models but with increase in the size of data set, Stochastic Gradient Descent performed better than Support Vector Machine.