2.2. Deep Learning Models
2.2.1. ANN
Artificial Neural Network (ANN) is a powerful tool for dealing with machine learning problems in the computer field. It is widely used in regression and classification problems. It simulates the operation principle of biological nerve cells, and forms a network structure of artificial neurons with hierarchical relationship and connection relationship. By means of mathematical expression, the signal transmission between neurons can be simulated, so as to establish a nonlinear equation with input and output relationship, and can be visualized through the network, we call it artificial neural network. Generally speaking, ANN can fit any nonlinear function through reasonable network structure configuration, so it can also be used to deal with nonlinear systems or black box models with complex internal expression.
2.2.2. Long Short-Term Memory
The LSTM network is a modified recurrent neural network proposed by Hochreiter and Schmidhuber. In recent years, the research on sequence prediction problem mainly focuses on the prediction of short sequences. LSTM network is used to conduct experiments on a set of data for short time series (12 data points, 0.5 days of data) and long time series (480 data points, 20 days of data). The results show that with the increase of sequence length, the prediction error increases significantly, and the prediction speed decreases sharply. And LSTM models have several serious problems:High prediction error: When dealing with long time series data, the prediction error of LSTM model is high, which makes it perform unsatisfactory in some application scenarios;Slow prediction speed: The LSTM model has a relatively slow prediction speed, which is mainly caused by its internal complex calculation process; More model parameters: The LSTM model has more parameters to train, which makes it require more computing resources and time when dealing with large-scale data; Prone to mode switching: When dealing with non-stationary time series data, the LSTM model is prone to mode switching, which will lead to instability of the model prediction results.
2.2.3. Informer
Recent studies have shown that compared with RNN type models, Transformers show high potential in the expression of long-distance dependencies. However, Transformer has the following three problems: the quadratic computational complexity of self-attention mechanism: the time complexity and memory usage of each layer due to the dot product operation of self-attention mechanism; High memory usage problem: When stacking long sequence inputs, the stack of J encoder-decoder layers makes the total memory usage as, which limits the scalability of the model when accepting long sequence inputs. Efficiency in predicting long-term outputs: The dynamic decoding process of the Transformer, where the output becomes one after another, and the subsequent output depends on the prediction of the previous time step, results in very slow inference.
The authors of Informer target Transformer models with the following goal: Can Transformer models be improved to be more computationally, memory, and architecture-efficient while maintaining higher predictive power? To achieve these goals, this paper designs an improved Transformer based LSTF model, namely Informer model, which has three notable features: a ProbSpare self-attention mechanism, which can achieve a low degree of time complexity and memory usage; In the self-attention distillation mechanism, a Conv1D is set on the results of each attention layer, and a Maxpooling layer is added to halve the output of each layer to highlight the dominant attention, and effectively deal with too long input sequences. The parallel generative decoder mechanism outputs all prediction results for a long time sequence instead of predicting in a stepwise manner, which greatly improves the inference speed of long sequence prediction in
Figure 2.
The model belongs to the Encoder-Decoder structure, in which the self-attention distillation mechanism is located in the Encoder layer. This operation essentially consists of several encoders and aims to extract stable long-term features. The network depth of each Encoder gradually decreases, and the length of the input data also decreases, and finally the features extracted by all encoders are concatenated. It should be noted that in order to distinguish encoders from each other, the authors gradually decrease the depth of each branch by determining the number of repetitions of the self-attention mechanism of each branch. At the same time, in order to ensure the size of the data to be merged, each branch only takes the second half of the input value of the previous branch as input.
The self-attention distillation mechanism is an efficient method to improve the efficiency and accuracy of the model by training a smaller model to guide the learning of a larger model. The core idea of the self-attention distillation mechanism is to use the attention distribution of the large model to guide the training of the small model when training the small model, so that the small model can better imitate the large model.
The principle of the self-attention distillation mechanism is that it captures the global characteristics of the input data by computing the attention distribution of the input data. Then, these features are passed to the small model so that the small model can better imitate the large model.
The generative Decoder is located in the Decoder layer. In the past time series prediction, it is often necessary to perform multi-step prediction separately to obtain the prediction results of several future time points. However, with the continuous increase of the prediction length, the cumulative error will become larger and larger, resulting in the lack of practical significance of long-term prediction. In this paper, the authors propose a generative decoder to obtain the sequence output, which only requires a single derivation process to obtain the prediction result of the desired target length, effectively avoiding the accumulation of error diffusion during multi-step prediction.
Another innovation is the probabilistic sparse self-attention mechanism, which is proposed from the author’s thinking about the feature map of the self-attention mechanism. The authors visualize Head1 and Head7 of the first layer of self-attention mechanism, and find that there are only a few bright stripes in the feature map. At the same time, a small part of the scores of the two heads have large values, which is consistent with the distribution characteristics of long-tailed data, as shown in the figure above. The conclusion is that a small fraction of the dot product pairs contribute the main attention, while the others can be ignored. According to this characteristic, the authors focus on the high-scoring dot product pairs, trying to calculate only the high-scoring parts in each operation of the self-attention module, so as to effectively reduce the time and space cost of the model. It allows the model to automatically capture the relationship between the individual elements of the input sequence, so as to better understand the input sequence. This mechanism enables the model to process all elements of the input sequence in parallel, avoiding the problem of computational order dependence in sequence models such as RNN/LSTM, and thus processing more efficiently.
2.3. Model Application
The water flow data of Wan ’an Reservoir is used as the data set for prediction. The dam site of Wan ’an Reservoir is located 2 kilometers upstream of Furong Town, Wan ’an County, Jiangxi Province, at 114°41 ’east longitude and 26°33’ north latitude. The upstream is 90 kilometers away from Ganzhou City and the downstream is 90 kilometers away from Ji ’an City, and the control basin area is 36900 square kilometers, seeing
Figure 3.
There are several hydrological stations and rainfall stations in the basin of Wan ’an Reservoir. In this paper, the rainfall information of five regions around the reservoir (PQJ1, PQJ2, PQJ3, PQJ4, PQJ5) and the flow information of Xishan hydrological Station and Julongtan hydrological station are selected to predict the flow of Wan ’an Reservoir. Rainfall and discharge data were collected in the region from May 17, 2014 to June 14, 2020, and the collection interval was 6 hours. See
Figure 4.
Figure 4.
Dataset of Wan’an reservoir.
Figure 4.
Dataset of Wan’an reservoir.
In the above figure, FDNO is the flood number, TM is the record date, PQJ1, PQJ2, PQJ3, PQJ4, and PQJ5 are the rainfall information of the five rainfall stations in the reservoir area, xiashan and julongtan are the discharge information of the two discharge stations in the reservoir area, and OT is the actual discharge of Wan ’an Reservoir.
Due to the noise and jitter of the initial data, the smooth function is used to smooth the data, and the value at time t after processing is the average value at time t-1, t and t+1. In addition, the flow at the flood peak needs to be recorded, so the flow at the flood peak is not smoothed.
In this experiment, considering the long sampling time, the step size is set to 4, 5, 6, and the output length is 1, respectively. When the step size is 4, four data are used to predict one data in the future.
The commonly used Loss functions in machine learning regression problems include mean absolute error (MAE), mean square error (MSE) and Huber Loss. MSE loss usually converges faster than MAE, but MAE loss is more robust to outliers, that is, less susceptible to outliers. Huber Loss is a Loss function that combines MSE and MAE and takes the advantages of both, also known as Smooth Mean Absolute Error Loss. The principle is simple: MSE is used when the error is close to 0 and MAE is used when the error is large.
In this paper, more attention is paid to the peak flood flow since it has the greatest impact on reality, and MAE, MSE and Huber Loss functions are used in this experiment to compare the prediction effect.
In the predictions of the two models, the common parameters are learning rate, training epochs, patience and batch_size, respectively 0.0005,100,10,10.
All numerical experiments in this study were implemented on a Windows system (CPU: Intel i7-12700H, GPU: NVIDIA GeForce RTX 3070 using Python (3.9) based on the Pytorch (1.8.0).
2.4. Model Performance Measures
Four indicators are finally used to measure the prediction results, namely NSE (Nash coefficient), R2 (determination coefficient), RMSE (mean square error), and RE (mean difference).
Four hydrological concepts were used to measure the prediction results, namely, the number of fields with the peak discharge gap less than 0.15, the number of fields with the total flood gap less than 0.15, the number of fields with the Nash coefficient greater than 0.8 and the maximum flood peak gap.The calculation equations are as follows:
where x
i and y
i are observed and simulated values, respectively; x and y are average observed and simulated values, respectively and n is the number of samples.
NSE (Nash coefficient): It is used to verify the quality of hydrological model simulation results. The value of NSE is negative infinity to 1, and NSE is close to 1, indicating that the model quality is good.
R2 (Coefficient of Determination): The proportion that reflects the total variation of the dependent variable can be explained by the independent variable through the regression relationship. It ranges from 0 to 1. The greater the coefficient of determination, the better the prediction effect.
RMSE (mean square error): It is the square root of the ratio between the squared deviation of the predicted value from the true value and the number of observations. It tells you how discrete a dataset is.
RE (mean difference): It is the ratio of the absolute error caused by a measurement to the measured (agreed) true value multiplied by 100%, which reflects the confidence of the measurement.
Peak discharge: the maximum instantaneous discharge in a flood discharge process, that is, the highest discharge on the flood process line. It may be the measured value, or it may be the calculated value using the water-flow relationship curve or the calculation value of the hydrodynamic formula.
Total flood water: The total amount of flood water flowing from the outlet section of the basin in a certain period of time. The total amount of a flood caused by a rainfall is often calculated in the forecast of rainfall runoff, which can be obtained from the area between the beginning time of the flood flow and the end time on the retreating section of the flood process line.