3.2. Hyperparameter Tuning and Model Interpretability
Given the Colombian streamflow dataset, the hydrological forecasting task corresponds to predicting the
reservoir contributions at time instant
from the
contributions at day
n.
H stands for the forecasting horizon, including short (up to a week ahead
) and medium-term (up to one month ahead
). To this end, the SVGP demands a fixed number of inducing points,
M.
Figure 3 presents the grid search results for tuning
M by minimizing the MSE in a ten-fold cross-validation on the training subset at the fixed one-day-ahead forecasting horizon (
). Results evidence that the median MSE generally decreases when increasing the number of inducing points, up to
, after which the error increases. The standard deviation also decreases within the range
, suggesting a consistency improvement. The above is because a few inducing points are more sensitive to initialization and fail to capture the intrinsic phenomenon distribution, thereby underfitting the regression task. On the contrary, too many inducing points monotonically increase the median and standard deviation on the error, indicating a model overfitting. Consequently, the hyperparameter tuning designates
inducing points as the most accurate and consistent within the grid search.
Using the tuned number of inducing points,
Figure 4 presents the trained reservoir-wise output scales
and the estimated noise variance
at each horizon
h. Overall, the longer the horizon, the smaller the output scale for all reservoirs. Such a fact suggests a reduction of time correlation for long horizons, making the SVGP approximate the prior distribution with a time-uncorrelated function. Nonetheless, for reservoirs
P and
V, an output scale increment at long horizons follows the initial reduction, implying the existence of relevant periodicity components within such time series. Lastly, reservoir
Q consistently renders a zero scale along horizons for enforcing the expected zero streamflow.
Regarding the noise variance, the heightened unpredictability in the output data at long horizons yields an increment in , which also supports the positive definiteness of the covariance matrix in sogpposteriormean. For such horizons, the model uncertainty mainly depends on the noise variance rather than the output scale. Therefore, the SVGP model exploits the data correlations at short horizons, yielding a low uncertainty while understanding the lack of information for long-horizon forecasting as a zero-mean isotropic Gaussian posterior.
The lengthscale analysis relies on its capacity to deactivate nonessential features on a kernel machine according to the Automatic Relevance Determination (ARD) criterion [
7,
26]. Hence, a small lengthscale value implies a high relevance for predicting the future streamflow of a reservoir from the past of another [
5].
Figure 5 depicts the trained lengthscales for squaredexponentialkernel for each input feature (columns) to each output task (rows) for prediction horizons of one day (left), two weeks (center), and one month (right). At first glance, reservoir
Q exhibits notably low lengthscales, either for predicting other reservoirs (
Q-th column) or as the target reservoir (
Q-th row), misinterpreted as highly relevant according to ARD. As the target reservoir, the zero streamflow during the observation period tunes to zero the output scale, turning off the corresponding rows in
(in sogpposteriormean) and the influence of every reservoir on the posterior mean for the reservoir
Q despite its length scale. As the predicting reservoir, reservoir
Q holds
N zero terms within the vector
in sogpposteriormean, yielding a zero contribution to build the mean posterior of every other reservoir. In either case, the SVGP identifies the influence lack of the reservoir
Q and fixes the corresponding lengthscales (rows and columns) at the default value of
.
The shortest horizon (one day ahead prediction, ) distributes the smallest lengthscale values over the main diagonal contrary to the high off-diagonal values. This result proves the relevance of short-term memory components within each streamflow. At a longer horizon of fourteen days (), the lengthscales over the main diagonal become higher, losing relevance, and several off-diagonal lower, gaining relevance, than at a one-day horizon. Such a change implies that a reservoir provides less information to predict itself at longer horizons, leading the model to look for knowledge on other reservoirs. The longest horizon () illustrates even less relevant lengthscales, indicating that the SVGP restrains the number of relevant features. Thus, the lengthscale variations across different prediction horizons reveal the dynamic interplay between reservoirs and highlight the Gaussian Process’s ability to select relevant features adaptively.
For visual interpretation of SVGP inner workings,
Figure 6 maps in 2D the reservoir data along with the optimized inducing points using the t-distributed Stochastic Neighbor Embedding (t-SNE) technique for each target reservoir. The input-space similarity measure demanded by t-SNE was tailored to the kernel in squaredexponentialkernel with the tuned lengthscales for each task, guaranteeing the mapping of the same latent space designed by the SVGP. The filled contours outline the density distribution of training data, while the color-numerated scatter locates the optimized inducing points. Since the forecasting tasks share the inducing points, not the kernel parameters, each target reservoir holds a different t-SNE mapping.
The 2D plots exhibit multi-modal distributions for all tasks, indicating clustered information and varying dynamics within a reservoir. Hence, the SVGP trades off the location of the inducing points between the within-task centroids and the among-task relationships. For instance, the seventh inducing point emerges near a mode of the four distribution plots, representing global information shared by most tasks. The locations of points 12 and 16 also pose a remarkable insight. On the one hand, the former mainly represents one of the modes of reservoirs E (
Figure 6a) and U (
Figure 6d) simultaneously, despite being located far from any cluster of I and L. On the other hand, the MOVGP devotes the latter point for the most prominent mode of reservoirs I (
Figure 6b) and L (
Figure 6c), but missing E and U. Nonetheless, those two inducing points (12-th and 16-th) provide complementary data representativity for improving the performance of a few related tasks. Therefore, the shared inducing points allow capturing the task-wise, group-wise, and global information about the streamflow dynamics.
3.3. Performance Analysis
For performance analysis, the proposed SVGP-based forecasting is contrasted against the straightforward Linear AutoRegressive model (LAR) and the nonlinear Long-Short-Term Memory network (LSTM) [
17,
19,
22,
32,
34].
Figure 7 illustrates the testing time series of streamflow contributions and their corresponding one-day ahead predictions generated by the contrasted models for three reservoirs with different forecasting complexities. The red dots correspond to test data, and the yellow and green lines denote the LAR and LSTM predictions respectively. The blue line and region represent the mean and 95% confidence interval of the SVGP predictive distribution.
Reservoir U in
Figure 7a presents the challenge of changing dynamics within the time series, from a steady low variance to a rapidly evolving streamflow. For such a reservoir, SVGP outperforms by better adapting the forecasted dynamics. Reservoir E in
Figure 7b portrays a scenario with reduced smoothness with a few abruptly changing segments. Even for those segments, the proposed model tracks the actual curve more closely than LAR and LSTM. Further, the SVGP effectively explains the existence of sudden shifts through its intrinsic confidence intervals. In contrast, reservoir I in
Figure 7c exhibits a heavily varying streamflow. Despite the forecasting for the three models diverging from the actual peak values within the time series, the predictive distribution provided by the SVGP interprets them as outliers. In general, the SVGP outstands the most adaptative and suitable model to exploit the changing within- and between-reservoir time series, explaining their stochastic nature through the inherent predictive distribution.
Finally,
Figure 8 displays the overall Mean Squared Error (MSE) scores achieved by the baseline approaches of LAR and LSTM and the proposed SVGP-based streamflow forecasting over a testing period of two years. Each row represents one target reservoir, while the columns indicate the forecasting horizon. The lower plot indicates the grand-average MSE along the horizon. The colorbar on the right maps the resulting MSE to a logarithmic scale to favor the visual interpretation.
As expected, the longer the horizon, the larger the error for all approaches, implying more challenging future forecasting. However, the prediction difficulty varies among reservoirs, with D, H, I, and N as the most demanding, suggesting a lack of information from external factors. In contrast, reservoirs A, K, and U exhibit a less complex streamflow dynamic due to a lower MSE. A remarkable observation pertains to reservoir Q, for which the SVGP model sustains a zero error due to the null contribution of its corresponding time series in contrast to other models, which obtain a residual error.
From one to seven days horizons, the proposed SVGP outperforms LAR and LSTM in reservoirs A, C, E, F, and K. Hence, the SVGP effectively captures short-term dependencies in the data. Particularly on the reservoir E, the SVGP considerably reduces the error of LSTM, indicating a benefit in capturing short-term dynamics compared to a nonlinear model. As the prediction horizon extends to 14 to 30 days, the SVGP performance remains competitive since its MSE increases slower than LAR and LSTM along the horizon, especially in reservoirs like L, T, and V. Such a fact can also be seen in the grand average MSE where SVGP error remains considerably slower than the baseline models. Concludingly, the Gaussian Process excels in all tested horizons mainly due to the capabilities of nonlinear mapping the data into an infinitely high-dimensional feature space and task-wise adaptation to varying dynamics.