1. Introduction
Predicting stock prices accurately is a formidable challenge due to the inherent volatility and complexity of financial markets. Retail investors, in particular, face significant hurdles as they often lack access to advanced tools and the expertise necessary for sophisticated quantitative trading. Traditional methods such as fundamental and technical analysis have been the mainstay for these investors. However, the advent of machine learning has introduced new possibilities, promising enhanced accuracy and reliability in stock price prediction.
The JPX Tokyo Stock Exchange prediction project aims to leverage advanced machine learning techniques to assist retail investors in making better-informed trading decisions. This research utilizes data from January 4, 2017, to December 3, 2021, providing a comprehensive dataset for developing and testing predictive models. The goal is to create an ensemble model that integrates various machine learning algorithms to improve prediction performance.
Data preprocessing is crucial, especially with financial data that can be noisy and incomplete. Techniques such as mean imputation for missing values and the interquartile range method for outliers ensure the dataset is clean and reliable. Feature engineering involves creating new features to enhance model performance, such as technical indicators like moving averages, relative strength index (RSI), and Bollinger Bands, which capture important market trends.
We employ various algorithms, including Keras Deep Neural Networks (DNN), LightGBM, Long Short-Term Memory (LSTM), Gated Recurrent Units (GRU), and linear regression (LR). DNNs are known for learning complex patterns from large datasets. Using the Keras library, we design DNN models for stock price prediction, experimenting with different architectures and hyperparameters. LightGBM, a gradient boosting framework, is highly effective for high-dimensional and complex data. By building an ensemble of weak learners, typically decision trees, LightGBM enhances predictive accuracy.
RNNs, particularly LSTM and GRU, excel at modeling time series data due to their ability to capture temporal dependencies. These models learn long-term dependencies and sequential patterns, essential for predicting future stock prices. Linear Regression (LR) serves as a fundamental technique, providing a baseline for comparison.
The final stage integrates these models into an ensemble. Ensemble learning improves overall performance by combining multiple models, leveraging their strengths and mitigating weaknesses. Techniques such as stacking, bagging, and boosting develop a robust predictive model. Stacking involves training multiple base models and using their predictions as inputs to a meta-model, typically a simple linear model. Bagging, or bootstrap aggregating, trains multiple models on different data subsets and averages their predictions. Boosting builds models sequentially, with each model correcting the errors of its predecessor.
This research’s contributions include a comprehensive data preprocessing pipeline, extensive feature engineering, multiple machine learning models, and an effective ensemble. Through rigorous preprocessing, feature engineering, and model optimization, we aim to create a robust predictive model surpassing existing approaches. Our findings demonstrate the practical applications of combining different machine learning techniques for retail investors. Future work will refine the model and explore additional features and techniques to enhance performance further.
2. Related Work
In the field of stock price prediction, a variety of machine learning techniques have been explored, each offering distinct advantages and facing specific limitations. Traditional methods such as linear regression (LR) have been widely adopted due to their simplicity and interpretability. However, these methods often struggle to capture the complex, non-linear relationships inherent in stock market data. More advanced approaches are needed to improve predictive accuracy[
1].
Artificial Neural Networks (ANNs) represent one such advanced approach, capable of learning from large datasets to identify intricate patterns. Rather et al.[
2] demonstrated the potential of ANNs in forecasting stock prices, showcasing their ability to handle non-linear relationships. Despite their strengths, ANNs require significant computational resources and are prone to overfitting, particularly when applied to noisy financial data.
Xu et al.[
3] introduce a method for preprocessing data that effectively handles missing values and outliers, which is crucial in our methodology for ensuring clean input data for stock price prediction.Smith et al.[
4] provide insights into feature engineering techniques, which we applied to enhance the predictive power of our machine learning models by creating more informative features.Johnson et al.[
5] emphasize the importance of cross-validation in model training, which we used to ensure the robustness and generalizability of our stock price prediction models.Gupta et al.[
6] discuss parameter tuning for machine learning models, guiding our approach to optimize the performance of Keras DNN, LightGBM, LSTM, GRU, and LR models.
Liu et al.[
7] highlight the advantages of using ensemble models for improved prediction accuracy, which inspired the integration of time series and deep learning models in our study.Chen et al.[
8] demonstrate the application of deep learning techniques, such as LSTM and GRU, in financial data analysis, which we incorporated to capture temporal dependencies in stock price data.Zhou et al.[
9] discuss the application of LightGBM in handling large datasets efficiently, which we employed to manage the extensive data from the stock exchange market.Wang et al. [
10] provide a framework for combining different machine learning models to enhance prediction accuracy, supporting our approach to use an ensemble model.Li et al.[
11] explore the application of neural networks in financial predictions, which informed our use of Keras DNN to model complex patterns in stock price movements.
Recurrent Neural Networks (RNNs), particularly Long Short-Term Memory (LSTM) networks, have gained prominence for their ability to model time-dependent data. LSTMs are designed to capture temporal dependencies, making them well-suited for time series forecasting. Brownlee[
12] highlighted the effectiveness of LSTMs in various applications, including stock price prediction. Fischer and Krauss[
13] successfully applied LSTMs to stock price prediction, demonstrating their capability to capture sequential patterns. However, LSTMs can be computationally intensive and challenging to optimize.
Gradient Boosting Machines, such as LightGBM, offer another promising approach by combining the predictions of multiple weak learners to enhance overall performance. LightGBM has been praised for its efficiency and scalability, making it suitable for large and complex datasets. Ke et al.[
14] demonstrated the efficacy of LightGBM in various predictive tasks, including financial predictions. The boosting technique iteratively improves the model by focusing on difficult cases, leading to significant performance gains.
In recent years, ensemble methods that integrate multiple models have been increasingly explored to leverage the strengths of different algorithms. Zhou[
15] emphasized that ensemble methods can outperform individual models by reducing variance and bias, thus improving predictive accuracy. A recent study by Fernández-Delgado et al.[
16] provided a comprehensive evaluation of different ensemble techniques, confirming their effectiveness in various prediction tasks.
Bagging, or bootstrap aggregating, is a popular ensemble technique that involves training multiple models on different subsets of the data and averaging their predictions. This approach reduces variance and enhances robustness. Barboza et al.[
17] highlighted the effectiveness of bagging in improving model stability and accuracy. However, bagging may not fully capture complex interactions in the data, necessitating complementary techniques.
Boosting, another powerful ensemble method, builds models sequentially, with each model focusing on correcting the errors of its predecessors. This iterative improvement can lead to highly accurate predictions. Chen and Guestrin[
18] demonstrated the strength of boosting in the form of XGBoost, a scalable and efficient implementation widely used in various applications, including stock price prediction.
Stacking, which involves training multiple base models and using their predictions as inputs to a meta-model, has also shown promise. This technique leverages the strengths of diverse models to improve overall performance. For example, Sagi and Rokach[
19] provided a comprehensive survey on ensemble methods, highlighting the potential of stacking to improve predictive performance. However, careful selection of base models and meta-models is crucial to maximize the benefits of stacking.
Recent work by Livieris, Pintelas, and Pintelas[
20] explored a CNN-LSTM model for time series forecasting, demonstrating the effectiveness of combining convolutional and recurrent neural networks for financial data. Additionally, Kingma and Ba[
21] introduced the Adam optimization algorithm, which has become a standard for training deep learning models due to its efficiency and effectiveness. Another significant contribution is by S. Pawaskar[
22], who examined various machine learning algorithms for stock price prediction, highlighting the benefits and limitations of different approaches. Moreover, a study by Nelson, Pereira, and de Oliveira[
23] introduced a hybrid model combining LSTM and technical indicators for stock price prediction, demonstrating the effectiveness of integrating different neural network architectures.Finally, Qiu and Song[
24] investigated the integration of sentiment analysis and machine learning techniques for stock price prediction, showcasing the added value of textual data in improving predictive performance.
Future work will focus on further refining the model by exploring additional features and techniques, as well as adapting the approach to different market conditions. The integration of alternative data sources, such as sentiment analysis from social media, may provide additional insights and improve predictive performance. By continually advancing the methodology, we aim to contribute to the ongoing development of robust and accurate stock price prediction models.