3.3. Data Preparation and Preprocessing
In this study, we used data sets from multiple cryptocurrencies for time series analysis and modeling. This includes market data for cryptocurrencies such as Bitcoin, Dogecoin, Ethereum and Cardano. These datasets contain historical price information from 2015 to 2021, covering key metrics such as Open, High, Low, Close, adjusted Close and Volume.
Table 1.
data sets from multiple cryptocurrencies for time series analysis and modeling.
Table 1.
data sets from multiple cryptocurrencies for time series analysis and modeling.
|
Date |
Open |
High |
Low |
Close |
Adj Close |
Volume |
0 |
2015/9/13 |
235.242004 |
235.934998 |
229.332001 |
230.511993 |
230.511993 |
18478800 |
1 |
2015/9/14 |
230.608994 |
232.440002 |
227.960999 |
230.643997 |
230.643997 |
20997800 |
2 |
2015/9/15 |
230.492004 |
259.182007 |
229.822006 |
230.304001 |
230.304001 |
19177800 |
3 |
2015/9/16 |
230.25 |
231.214996 |
227.401993 |
229.091003 |
229.091003 |
20144200 |
4 |
2015/9/17 |
229.076004 |
230.285004 |
228.925995 |
229.809998 |
229.809998 |
18935400 |
Data reading and preliminary examination: We first read the data files of each cryptocurrency through the pandas library and conduct a preliminary examination of the data structure. For example, Bitcoin's dataset (df_bitcoin) has a total of 2193 records and 7 fields. The data check revealed a small number of missing values in all fields, with the percentage of missing values in each field being about 0.182%.
Missing value processing: Missing value processing is necessary for every data set. We found that all the fields involved in the dataset (such as Open, High, Low, Close, Adj Close, and Volume) had a small number of missing values. Although the proportion of these missing values is relatively small, in order to ensure the stability and accuracy of the model, we will interpolate or delete the missing values in the subsequent processing steps.
Technical indicators calculation: In order to enrich the input data of the model, we calculate some common technical indicators and add them to the data set. These indicators include moving averages (SMA), Relative Strength Index (RSI), and others. The calculation code is as follows:
import ta
# Calculate technical indicators
df['SMA_20'] = ta.trend.sma_indicator(df['Close'], window=20)
df['RSI'] = ta. momentum. RSIIndicator(df['Close']). rsi ()
df['MACD'] = ta. trend. macd(df['Close'])
df['ATR'] = ta. volatility. average_true_range(df['High'], df['Low'], df['Close'])
Data consolidation and standardization: When dealing with multiple cryptocurrency data, we need to consolidate and standardize the data in order to train the model. The normalization step includes data normalization, using the MinMaxScaler to scale the data to the range [0, 1]. This helps to improve the training of the model, especially when using deep learning algorithms. The standardized code is as follows:
from sklearn. preprocessing import MinMaxScaler
# Data standardization
scaler = MinMaxScaler ()
df [['Open', 'High', 'Low', 'Close', 'Volume', 'SMA_20', 'RSI', 'MACD', 'ATR']] = scaler.fit_transform (df [['Open’, ‘High', 'Low', 'Close', 'Volume', 'SMA_20', 'RSI', 'MACD', 'ATR']])
Data Integration: To further enrich the data set, we combined market sentiment data and macroeconomic data into the main data set. The code example for the merge is as follows:
sentiment_df = pd. read_csv('sentiment_data.csv')
macro_df = pd. read_csv('macro_data.csv')
# Merge market sentiment data
df = df. merge (sentiment_df, on='Date', how='left')
# Merge macroeconomic data
df = df. merge (macro_df, on='Date', how='left')
Data partitioning: To evaluate model performance, we divide the data set into a training set and a test set. The training set is used for model training, while the test set is used for model evaluation and validation. The train_test_split function was used to divide the data set, ensuring that the model could make valid predictions about previously unseen data. The partition code is as follows:
from sklearn. model_selection import train_test_split
# Data division
train_df, test_df = train_test_split (df, test_size=0.2, shuffle=False)
3.4. LSTM Model Construction and Training
The Long Short-term Memory (LSTM) model We will use to analyze and predict the time series data of the cryptocurrency market. The day model is very important in the time series prediction because it is excellent in dealing with long-term relationships in the series data. The steps are as follows: First, the date column in the data set is converted to the date-time format. Ensure that the time series data can be processed smoothly, so as to enable subsequent analysis and correct understanding of the practical information before constructing the LSTM model, we visualize the market fluctuation by plotting the closing price trend of each cryptocurrency, and the closing price changes of Bitcoin Ether and Cardano can be visually observed. The trend change in the price of each cryptocurrency provides valuable background information for subsequent modeling. The following diagram shows the trend of closing prices of various cryptocurrencies:
Figure 1.
The Closing Price Trend for the four cryptocurrencies.
Figure 1.
The Closing Price Trend for the four cryptocurrencies.
Building and training LSTM model After completing data processing and preparation, we move to the next step of the LSTM model long short-term memory neural network is a kind of especially suitable for processing time series, because it can effectively capture long-term dependencies, in order to train the model to plan to process this step is mainly in a reasonable range, so as to improve the model. Next, we divide the data into training, collection testing and training, and the actual training of the language model. Through learning these data, the model will gradually master the law of market price changes, while testing and the verification and adjustment of the language model [
16].
Through these data, we can understand the prediction accuracy of the model and make necessary adjustments according to the test results. In order to improve the predictive performance and reliability of the model in the training process, we will constantly monitor and adjust the forecast of the model according to the training results, the market price trend course parameters may also involve the selection of the most appropriate training strategy and algorithm optimization methods, and finally through this way is to effectively predict the future market price for investment decisions
Figure 2.
A multi-subgraph chart of the Volume trends of four cryptocurrencies.
Figure 2.
A multi-subgraph chart of the Volume trends of four cryptocurrencies.