2.1. ML Tools
ML algorithms build a model based on sample data, known as training data, to make predictions or decisions without being explicitly programmed to do so [
14]. Then these ML models can be used for representing, interpolating, and limited extrapolating the data. The set of ML tools includes a large variety of different algorithms such as various neural networks (NN), different kinds of decision trees (e.g., random forest algorithms), kernel methods (e.g., support vector machines and principal component analysis), Bayesian algorithms, etc. (see
Figure 1). Some of these algorithms are more universal (e.g., generic multilayer perceptron or NNs), and some of them are more focused on a specific class of problems (e.g., convolutional NNs that show an impressive performance as image/pattern recognition algorithms).
There are many different types of NNs: shallow, deep, convolutional, recurrent, etc., as well as many types of tree algorithms (see
Figure 1). Here we briefly discuss two major types of ML tools that have been applied to develop applications for numerical weather and climate prediction systems: (1) NNs that have been applied in most studies (e.g., [
19,
20,
21,
22,
23]) and (2) tree algorithms that have been applied in a few works [
24,
25].
Most applications proposed in the aforementioned works are based on two assumptions:
many NWCMS applications, from a mathematical point of view, may be considered as
mapping, M, that is a relationship between two vectors or two sets of parameters
X and
Y:
where
n and
m are the dimensionalities of vectors
X and
Y correspondingly.
ML provides an all-purpose non-linear fitting capability. NN, the major ML tool that is used in applications, are “universal approximators” [
26] for complex multidimensional nonlinear mappings [
27,
28,
29,
30,
31]. Such tools can be used and have already been used to develop a large variety of applications for NWCMSs.
A generic NN that is used for modeling/approximating complex nonlinear multidimensional mappings is called the multilayer perceptron. It is comprised of “neurons” that are arranged in “layers”. A generic neuron can be expressed as,
Eq. (2) represents a neuron number
j in the layer number
n.
is the output of the neuron that, at the same time, is an input to neurons of the layer number
n+1. Here
are inputs to neurons of the layer number
n (outputs of neurons of the layer number
n-1, the input layer corresponds to
n=0),
a and
b are fitting parameters or NN weights and biases,
is the so-called activation function, and
kn is the number of neurons in the layer number
n. The entire layer number
n can be represented by a matrix equation:
where for
n = 0,
, a vector of the NN inputs. If the layer number
n+1 is the output layer, linear neurons are often used for the output layer,
here
is a vector of outputs.
The activation function
is a nonlinear function (see
Figure 2.), often specified as the hyperbolic tangent; however, rectangular linear unit, SoftMax, leaky rectangular linear unit, Gaussian, trigonometric functions, etc. are also used in applications [
14]. All layers of the multilayer perceptron NN between input and output layers are called “hidden layers”. NNs with multiple hidden layers are called “deep neural networks” (DNN). The simplest multilayer perceptron NN with one hidden layer is called a “shallow” NN (SNN). SNN is a generic analytical nonlinear approximation or model for mapping (1) and a mathematical solution of the ML problem [
27,
28,
29]. Multiple authors have shown in a variety of contexts that the SNN can approximate any continuous or almost continuous (with a finite number of finite discontinuities) mapping (1) [
22,
30,
31,
32]. The accuracy of the SNN approximation or the ability of the SNN to resolve details of the mapping (1) is proportional to the number of neurons
k in the hidden layer [
33].
Additional hidden layers and/or nonlinear neurons in the output layer can be introduced and the resulting DNN can be applied to either mapping approximation problems or problems of different nature. DNNs have been extremely successful in many areas including in applications for numerical weather/climate modeling systems. However, as pointed out by Vapnik [
29], from the standpoint of statistical learning theory, only SNN has been formally shown to be a solution to the mapping approximation problem (see also
Figure 3). Successful approximation of the mapping (1) by a DNN cannot be guaranteed theoretically, and this specific application of DNNs should be considered a heuristic approach. Both SNNs and DNNs have been successfully applied to numerical weather/climate modeling system mappings by different authors (see discussion in the following Sections).
NNs are very successful in solving complex nonlinear mapping problems. After they are trained, their application is fast, they are easily parallelizable. They use the training data set only during training. Trained NNs contain all necessary information about the mapping in a set of NN weights and biases that is usually much smaller than the training set and does not require a lot of memory. However, NNs are difficult to interpret because information about the mapping is distributed over multiple weights and biases, which is typical for any nonlinear statistical model. Also, as with any nonlinear statistical model, NN has limited ability for prediction/extrapolation/generalization; however, well trained NN is capable of a limited accurate generalization.
A decision tree is a tree-like model of decisions and their consequences. They are widely used in statistics and ML for solving non-linear classification and regression problems. Decision trees are easily interpretable; however, they are not stable to noise in the data. To avoid instabilities and improve the accuracy and robustness of the approach, an ensemble of decision trees called a ‘forest’ approach, has been developed. Introducing elements of randomness to the trees turned out to be beneficial, hence the approach is named “random forest” [
34]. This algorithm has many advantages: it does not require complex pre-processing and normalization of data; it easily handles missing data; the random forest is a robust algorithm that can handle noisy data and outliers. However, random forests require more memory than other algorithms because this algorithm stores multiple trees. This can be a problem if the dataset is large. To apply a trained random forest algorithm, the entire training set must be kept in memory. Also, it will not be able to predict any value outside the available training set values since averaging various trees, each of which is built upon the training set, is a big part of random forest models. Thus, we cannot expect reliable predictions/extrapolations/generalizations when using the random trees algorithm. For more detailed discussions of NN, trees, and other ML tools see [
14].
2.2. ML for NWCMS Specifics
It is critical to understand that the development of many ML applications for numerical weather and climate modeling systems is essentially different from the standard ML approach. First, a standard ML approach consists of two major steps: (1) training an ML tool (e.g., an NN) using training and test sets; and (2) validating a trained tool on an independent validation set. If the validation is successful, the tool is ready for use. In this sense “Genrative AI” (like ChatGPT) – deep-learning models that can generate high-quality text, images, and other content based on the data they were trained on – can be considered a traditional ML application.
When an ML application is being developed for a numerical weather modeling system to work within the model or in the model environment (e.g., data assimilation system), in close connection with the model, the third and the most important validation step must be included in the approach: (3) the trained application should be introduced in the model to check its coherence with the model and the model environment, to check that it does not introduce any disruption in the stable functioning of the modeling system and that the system keeps producing meaningful results.
Second, such applications usually do not use unstructured datasets (sets that consist of a mixture of numerical, text, images, etc.) for training and validation. Usually, structured datasets that consist of matrixes or tables of numerical observations or simulated data are used.
Third, generally, there are not enough observations for the training and validation of ML applications for NWCMSs. The observations in weather and climate systems are usually sparse and located close to the land and ocean surface. Thus, observations are very often augmented by data simulated by numerical models. Also, analysis and reanalysis, which are thoroughly fused observations and data simulated by numerical models, are often used.
It is noteworthy that the use of a relatively large number of mostly uninterpretable parameters led to the perception of ML as a “black box” approach, which created problems with its acceptance by weather and climate modelers. In essence, the trade-off between simple statistics and ML is mostly between interpretability and accuracy. With relatively few parameters and few predictors (often by using predictor selection methods to reduce the number of predictors), simple statistical models are generally much more interpretable than ML models.
Most ML tools are closely related to nonlinear nonparametric statistics. A limitation of the parametric approach is that the functional form for the statistical model is specified, which may not work well for some datasets. For example, a linear regression model may not work for data representing essentially nonlinear behavior. The alternative non-parametric modeling approach still has parameters, but the parameters are not used to control the specified functional form of the model; instead, the parameters are used to control the model complexity. Thus, in principle, a nonparametric approach (and ML approach as well) is more flexible, and a nonparametric/ML model can automatically adjust to/learn any nonlinear behavior exhibited by data. On the other hand, parametric models (if they work well) may be easier to interpret. With nonparametric/ML models such a straightforward interpretation is not possible.
For example, coefficients of linear regression models may be interpreted as contributions of the corresponding input variables into the output variable. In contrast, ML methods such as neural networks and random forests are run as an ensemble of models initialized with different random numbers, leading to a vast number of parameters that are largely uninterpretable. In this case, contributions of an input parameter are distributed through multiple coefficients of the nonlinear nonparametric/ML model. Also, over time datasets become increasingly larger and more complex making good interpretability harder to achieve even with parametric statistical models. At the same time, the advantage in prediction accuracy attained by ML models makes them more and more attractive. Currently, a lot of works are published that are devoted to the problem of the interpretability of ML models [
35].