2.2.1. Supervised Learning:
Supervised learning refers to a ML method that involves training a model using data that has been labeled. The labeled data comprises input-output pairs, where the input is the data on which the model is trained, and the output is the expected outcome [
45,
46]. The model learns to map inputs to outputs by reducing the error between the predicted and actual outputs during training. Once trained, the model can be applied to generate predictions on new, unlabeled data [
47,
48]. Regression and classification are the two basic sub-types of supervised learning algorithms (
Figure 1) [
45].
1.
Regression: Regression is a supervised learning approach that forecasts a continuous output variable based on one or more input variables. Regression aims to identify a mathematical function that can correlate the input variables to a continuous output variable, which may represent a single value or a range of values [
49]. Linear regression, Polynomial Regression, and Support Vector Regression (SVR) are the three main types of supervised learning algorithms in regression [
50].
Linear and Polynomial Regression: Linear regression is a prevalent and straightforward approach used to forecast a continuous output variable utilizing one or multiple input variables. It uses a straight line to indicate the correlation between the input variables and the output variables [
51]. On the other hand, Polynomial regression, a type of linear regression, employs n
th-degree polynomial functions to depict the connection between input features and the outcome variable [
52]. This can enhance the accuracy of predictions by enabling the model to capture more intricate correlations between the input data and the target variable. In renewable energy forecasting, both linear and polynomial regression can be used to predict the power output of RES such as solar and wind power [
53,
54]. Weather information like temperature, humidity, and wind speed are frequently included in the input characteristics, along with historical power output data. The target variable is the power output of the renewable energy source, which can be predicted using the input features.
For instance, Ibrahim et al. (2012) used data from a weather station collected over three years to create a linear regression model to predict solar radiation in Perlis. The model used three input variables and had a good fit with an R-squared value of 0.954. The authors concluded that their model could be a useful tool for estimating solar radiation in Perlis [
55]. Another scholar Ekanayake et al. (2021), created artificial neural networks (ANNs), multiple linear regression, and power regression models to produce wind power prediction models for a Sri Lankan wind farm (ANN). The models were developed using power generation data over five years and showed acceptable accuracy with low RMSE, bias, and high correlation coefficient. The ANN model was the most precise, but the MLR and PR models provide insights for additional wind farms in the same area [
56]. Mustafa et al. (2022) also compared four regression models, linear regression, logistic regression, Lasso regression, and elastic regression, for solar power prediction. The results showed that all four models are effective, but the Elastic regression outperformed the others in predicting maximum solar power output. Principal component analysis (PCA) was also applied, showing improved results in the elastic regression model. The paper focuses on the strengths and weaknesses of each solar power prediction model [
57].
Support Vector Regression (SVR): SVR algorithm is utilized in regression analysis within the field of ML [
58]. It works by finding the best hyperplane that can separate the data points in a high-dimensional space. The selection of the hyperplane aims to maximize the distance between the closest data points on each side of it. The approach involves constraining the margin while minimizing the discrepancy between the predicted and actual values. It is also a powerful model to predict renewable energy potentials for a specific location. For example, Yuan et al. (2022) proposed a Jellyfish Search algorithm optimization SVR (IJS-SVR) model to predict wind power output and address grid connection and power dispatching issues. The SVR was optimized using the IJS technique, and the model was tested in both the spring and winter. IJS-SVR outperformed other models in both seasons, providing an effective and economical method for wind power prediction [
59]. In addition, Li et al. (2022) created ML-based algorithms for short-term solar irradiance prediction, incorporating Hidden Markov Model and SVM regression techniques. Bureau of Meteorology, they demonstrated that their algorithms can effectively forecast solar irradiance for 5-30 minute intervals in various weather conditions [
60]. Another author Mwende et al. (2022) developed SVR and Random Forest Regression (RFR) models for real-time photovoltaic (PV) power output forecasting. On the validation dataset, SVR performed better than RFR with an RMSE of 43.16, adjusted R2 of 0.97, and MAE of 32.57, in contrast to RFR’s RMSE of 86, adjusted R
2 of 0.90, and MAE of 69 [
61].
Classification: Classification, a form of supervised learning, involves using one or more input variables to anticipate a categorical output variable [
62]. Classification aims to find a function that can map the input variables to discrete output categories. The most widely used classification algorithms for predicting RES include logistic regression, decision trees, random forest, and support vector machines.
Logistic Regression: Logistic regression is a classification method that utilizes one or more input variables to forecast a binary output variable [
63,
64]. It models the probability of the output variable being true or false using a sigmoid function. In renewable energy forecasting, Logistic Regression can be used to predict whether or not a specific event will occur, such as a solar or wind farm reaching a certain level of power output. For instance, using ML, Jagadeesh et al. developed a forecasting method for solar power output in 2020. They used a logistic regression model with data from an 11-month period, including plant output, solar radiation, and local temperature. They found that the right choice of solar variables is essential for precise forecasting, and their study looked at the algorithm’s accuracy and the probability that a plant will produce on a specific day in the future [
65].
Decision Trees: An alternative method for classification is decision trees, which involves dividing the input space into smaller sections based on input variable values and then assigning a label or value to each of these sections [
64]. The different studies developed Decision Tree models to forecast power output from different renewable energy systems. Essama et al. (2018) developed a models to predict the power output of a photovoltaic (PV) system in Cocoa, Florida of USA using weather parameters obtained from United States’ National Renewable Energy Laboratory (NREL). By selecting the best performance among the ANN, RF, DT, extreme gradient boosting (XGB), and LSTM algorithms, they aim to fill a research gap in the area. They have come to the conclusion that even if all of the algorithms were good, ANN is the most accurate method for forecasting PV solar power generation.
Random Forest: An effective and reliable prediction is produced by the supervised ML method known as random forest, which creates several decision trees and merges them [
66]. The bagging technique, which is employed by random forest, reduces the variance of the base algorithms. This technique is particularly useful for forecasting time series data [
67]. Random forest mitigates correlation between trees by introducing randomization in two ways: sampling from the training set and selecting from the feature subset. The RF model creates a complete binary tree for each of the N trees in isolation, which enables parallel processing.
Vassallo et al. (2020) investigate optimal strategies for random forest (RF) modeling in wind speed/power forecasting. The investigation examines the utilization of random forest (RF) as a corrective measure, comparing direct versus recursive multi-step prediction, and assessing the impact of training data availability. Findings indicate that RF is more efficient when deployed as an error-correction tool for the persistence approach, and that the direct forecasting strategy performs slightly better than the recursive strategy. Increased data availability continually improves forecasting accuracy [
68]. In addition, Shi et al. (2018) put forward a two-stage feature selection process, coupled with a supervised random forest model, for the purpose of addressing overfitting and weak reasoning and generalization in neural network models when forecasting short-term wind power. The proposed methodology removes redundant features, selects relevant samples and evaluates the performance of each decision tree. To address the inadequacies of the internal validation index, a new external validation index correlated with wind speed is introduced. Simulation examples and case studies demonstrate the model’s better performance than other models in accuracy, efficiency, and robustness, especially for noisy data and wind power curtailment [
69]. Similarly, Natarajan and Kumar (2015) also compared wind power forecasting methods. Physical methods rely on meteorological data and Numerical Weather Prediction (NWP), while statistical methods like ANN and SVM depend on historical wind speed data. This study experiments with Random Forest Algorithm, finding it more accurate than ANN for predicting wind power at wind farms [
70].
Support Vector Machines (SVM): SVM are a type of classification algorithm that identifies a hyperplane which maximizes the margin between the hyperplane and the data points, akin to SVR [
71,
72]. SVM has been utilized in renewable energy forecasting to estimate the power output of wind and solar farms by incorporating input features such as historical power output, weather data, and time of day. For instance, Zeng et al. (2022) propose a 2D least-square SVM (LS-SVM) model for short-term solar power prediction. The model uses atmospheric transmissivity and meteorological variables, and outperforms the reference autoregressive model and radial basis function neural network model in terms of prediction accuracy [
73]. R. Meenal and A. I. Selvakumar (2018) conducted studies comparing the accuracy of SVM, ANN, and empirical solar radiation models in forecasting monthly mean daily global solar radiation (GSR) in several Indian cities using varying input parameters. Using WEKA software, the authors determine the most significant parameters and conclude that the SVM model with the most influential input parameters yields superior performance compared to the other models [
74]. Generally, classification algorithms are used to predict categorical output variables and regression techniques are used to predict continuous output variables. The particular task at hand and the properties of the data will determine which method is being used.
2.2.2. Unsupervised Learning
Another form of ML is unsupervised learning, where an algorithm is trained on an unlabeled dataset lacking known output variables, with the objective of uncovering patterns, structures, or relationships within the data [
75,
76,
77]. Unsupervised learning algorithms can be primarily classified into two types, namely clustering and dimensionality reduction [
78].
Clustering: It is an unsupervised learning method that consists of clustering related data points depending on how close or similar they are to one another. Clustering algorithms, like K-means clustering, hierarchical clustering, and density-based clustering, are commonly used in energy systems to identify natural groupings or clusters within the data. The primary objective of clustering is to discover these inherent patterns or clusters [
75,
76]. K-Means Clustering is a widely used approach for dividing data into k clusters, where k is a user-defined number. Each data point is assigned to the nearest cluster centroid by the algorithm, and the centroids are updated over time using the average of the data points in the cluster [
75,
76]. Hierarchical Clustering is also a family of algorithms that recursively merge or split clusters based on their similarity or distance, to create a hierarchical tree-like structure of clusters. The other family of clustering algorithms that groups together data points that are within a certain density threshold, and separates them from areas with lower densities is density-based clustering algorithm [
75,
76,
77].
Dimensionality Reduction: It is also an unsupervised learning technique utilized to decrease the quantity of input variables or features while retaining the significant information or structure in the data [
75,
76,
77]. The purpose of dimensionality reduction is to find a lower-dimensional representation of data that captures the majority of the variation or variance in the data. Principal component analysis (PCA), t-SNE, and Autoencoders are some dimensionality reduction algorithms used in renewable energy forecasting [
78]. Principal Component Analysis (PCA) is a commonly utilized method for decreasing the dimensionality of a dataset. It does so by identifying the primary components or directions that have the most variability in the data and then mapping the data onto these components [
78]. t-SNE is a non-linear dimensionality reduction algorithm that is particularly useful for visualizing high-dimensional data in low-dimensional space. It uses a probabilistic approach to map similar data points to nearby points in the low-dimensional space. Autoencoders are a type of neural network that can learn to encode and decode high-dimensional data in a lower-dimensional space. The encoder network is trained to condense the input data into a representation with fewer dimensions, and the decoder network is trained to reconstruct the original data from this condensed representation [
78].
In general, unsupervised learning algorithms are particularly useful when there is a large amount of unstructured data that needs to be analyzed, and when it is not clear what the specific target variable should be. Unsupervised learning has found various applications in the field of renewable energy forecasting, and one of its commonly used applications is the clustering of meteorological data [
79]. For example, in a study by J. Varanasi and Tripathi, M. (2019), K-Means clustering was used to group days of the year, sunny days, cloudy days and rainy days into clusters based on similarity for short term PV power generation forecasting [
80]. The resulting clusters were then used to train separate ML models for each cluster, which resulted in improved PV power forecasting accuracy. Unsupervised learning has also been used for anomaly detection in renewable energy forecasting. Anomaly detection refers to the task of pinpointing data points that exhibit notable deviations from the remaining dataset. In the context of renewable energy forecasting, anomaly detection can aid in identifying exceptional weather patterns or uncommon circumstances that may impact renewable energy generation. For example, in a study by Xu et al. (2015), the K-Means algorithm was used to identify anomalous wond power output data, which were then employed to improve the accuracy of the wind power forecasting model [
81].
In the realm of renewable energy forecasting, unsupervised learning has been utilized for feature selection, which involves choosing a smaller set of pertinent features from a larger set of input variables. In renewable energy forecasting, feature selection can be used to reduce the computational complexity of ML models and to improve the accuracy of renewable energy output predictions. For example, in a study by Scolari et al. (2015), K-Means clustering was used to identify a representative subset of features for predicting solar power output [
82].
Overall, unsupervised learning is a powerful tool for analyzing large amounts of unstructured data in renewable energy forecasting. Clustering, anomaly detection, and feature selection are just a few of the many applications of unsupervised learning in this field, and new techniques are continually being developed to address the unique challenges of renewable energy forecasting.