This section delineates the procedures for data collecting, data pre-processing, and the implementation of machine learning algorithms.
Wind energy forecasting analysis jobs involve the processing and interpretation of large quantities of weather-related data to generate accurate estimates of future energy generation. These tasks employ a diverse range of analytical methodologies and tactics to enhance the precision of forecasts. The meteorological data, consisting of wind speed, wind direction, temperature, humidity, wind gusts, and dewpoint, is now accessible and can be utilized for power generation predictions, as
Figure 8 shows [
15]. Significant statistical connections can be identified between different meteorological factors and energy production. These correlations can be used to construct models that accurately represent the influence of certain weather conditions on the efficiency of wind turbines. The selection of features for forecasting models is determined by this examination. Selecting suitable machine learning techniques, like regression models, for predictive modeling depends on the unique demands of the forecasting issue and the features of the data.
3.1. Dataset and Preprocessing
The Wind Power Generation Data - Forecasting dataset was acquired from Kaggle (
https://www.kaggle.com/datasets/mubashirrahim/wind-power-generation-data-forecasting/data) and uploaded to the Kaggle platform by MUBASHIR RAHIM. The meteorological equipment – IoT devices deployed at the site was used to meticulously gather the data. The meteorological apparatus measured temperature, humidity, dew point, and wind properties at predetermined elevations of 2 meters, 10 meters, and 100 meters. Concurrently, sensors were installed on wind turbines to monitor their efficiency and electricity production. The datasets consist of a detailed hourly log obtained from four distinct sites, spanning from January 2, 2017, 00:00:00, to December 31, 2021, 23:00:00. The data underwent rigorous quality checks to detect and rectify any anomalies or inconsistencies, ensuring a high level of data reliability. Regular equipment maintenance has consistently ensured the quality of data over time.
The following are the columns and weather parameters in the data:
Time: The moment in the day when the measurements were made.
temperature_2m: The temperature in degrees Fahrenheit at two meters above the surface.
relativehumidity_2m: The proportion of relative humidity at two meters above the surface.
dewpoint_2m: Dew point, measured in degrees Fahrenheit at two meters above the surface.
windspeed_10m: The wind speed, expressed in meters per second, at 10 meters above the surface.
windspeed_100m: The speed of the wind at 100 meters above sea level, expressed in meters per second.
winddirection_10m: The wind direction at 10 meters above the surface is represented in degrees. (0-360).
winddirection_100m: The direction of the wind at 100 meters above the surface, expressed in degrees (0–360).
windgusts_10m: A wind gust is an abrupt, transient increase in wind speed at 10 meters.
Power: The normalized turbine output, expressed as a percentage of the turbine’s maximum potential output, and set between 0 and 1.
The normal or Gaussian distribution that is indicated in
Figure 9 represents the famous bell-shaped curve, which is characterized by the arithmetic mean μ and the standard deviation σ. The normal distribution is the most frequently employed probability and statistics distribution. Contemporary techniques like as linear regression, analysis of variance (ANOVA), and t-tests heavily depend on the assumption that the data follows a normal distribution. Because the dataset contains outliers, which indicate the accuracy of the sensors’ measurements, the curves do not have a well-defined shape [
16].
Libraries: Python’s libraries for data analysis, visualization, and scientific computing are extensively utilized. They provide a comprehensive range of tools and features that make it easier to explore data and generate insights [
17]. The libraries to be utilized in the preprocessing stage are as follows:
Pandas is a robust Python package utilized for the manipulation and analysis of data. The software provides data structures such as DataFrames and Series, which facilitate the manipulation and analysis of organized data.
NumPy is an essential library for scientific computation in Python, commonly referred to as “Numerical Python.” The software provides support for large, complex arrays and matrices, together with a collection of mathematical algorithms to effectively handle these arrays.
Matplotlib is a flexible toolbox that enables the generation of static, interactive, and animated visualizations in the Python computer language. The pyplot module offers a MATLAB-like interface for producing plots and visualizations, simplifying the process of generating charts, histograms, scatter plots, and other graphical representations.
Seaborn is a data visualization package that enhances the capabilities of matplotlib and provides a more sophisticated interface for creating visually appealing and meaningful statistical graphics. It streamlines the procedure of generating intricate visualizations and provides pre-installed themes and color palettes to increase the visual appeal of plots.
Dataset: There are four data frames, namely loc1, loc2, loc3, and loc4, that are all of equal size. All data in the datasets originates from the utilization of IoT devices to measure meteorological conditions with consistent precision. The number of rows in each data frame is 43800, and the number of columns is 10. This study centers on the examination carried out utilizing the loc1 dataset. The columns consist of the following variables: time, temperature, relative humidity, dewpoint, wind direction, wind speed, and wind gusts at 2, 10, and 100 meters, as shown in the first 5 rows of
Figure 10. There are six variables of float64 data type, three variables of int64 data type, and one variable of object data type.
Since time is an object, it will be transformed to datetime before being used for analysis. The conversion is performed using the function (pd.to_datetime) from the Pandas package [
17]. The subsequent tables and figures originate from the exploratory data analysis conducted at loc1. Based on the corresponding time values, in
Table 1, organize the power-generating data into separate columns for year, month, and day.
Null values: A non-null value refers to any numerical, textual, or other type of value that is not null [
18]. The data frame has 43,800 non-null values in each column, corresponding to the index range of 0-43799. By utilizing Python’s function .null().sum(), can determine the number of null values for each variable. Datasets do not contain any null values as described in
Table 2. Determining influential and anomalous data points is essential as it will aid in future data collection and the proper utilization of existing knowledge.
Outliers: The degree to which a data point deviates from the mean in terms of standard deviations is measured statistically by the z-score [
18]. The z-score can be calculated using the following formula:
In this context, x represents a specific data point, mean represents the average value of the dataset, and std represents the standard deviation of the dataset.
It appears that the dataset contains some outliers, as illustrated in
Figure 11 Consequently,
Table 3 shows the results of removing outliers to achieve improved outcomes.
The distribution of wind gusts in
Figure 12 may show significant skewness due to extreme values, which can obscure the true underlying patterns. After outlier removal, the distribution typically becomes more normal, allowing for clearer insights and more accurate analyses.
Correlation: The dataset is currently accessible and prepared for utilization in deriving significant insights. Correlation study between measurements such as temperature, relative humidity, and wind speed will assist in selecting parameters for prediction [
19]. These highly correlated metrics aid in forecasting the power generated by wind turbines. A correlation heatmap is a visual representation that presents the correlation between many variables in the form of a matrix, utilizing color codes. By utilizing the Python programming language and the seaborn library, it is possible to generate a helpful heatmap that displays the association between variables. The correlation coefficients for various variables are displayed in a correlation table.
Conventional clustering and correlation analysis face difficulties when dealing with the vast amount and low density of valuable information in big data. To improve energy forecasting, it is recommended to use big data-driven correlation analysis with clustering. Conducting correlation analysis among several measures such as temperature, dew point, relative humidity, wind direction, wind gusts, and wind speed will aid in the selection of forecast parameters. These strongly connected parameters contribute to the accurate prediction of the electricity produced by wind turbines.
The heatmap in
Figure 13 displays the relationships between every conceivable combination of values. It is a potent tool for detecting and visualizing patterns in data, as well as condensing large amounts of data. A Python program, utilizing the Seaborn module, may generate a heatmap that visually represents the association between variables [
16].
When preparing a dataset for ML models, preprocessing stages include data standardization and splitting. Consequently, all three regression models (linear regression, random forest regression, and lasso regression) follow this preprocessing phase.
Standardization: This procedure guarantees that the dataset’s features (or variables) have a mean of 0 and a standard deviation of 1 [
20,
21]. This phase is essential as numerous ML algorithms exhibit enhanced performance when the data is normalized or standardized. The StandardScaler is a prevalent technique that subtracts the mean and normalizes data to unit variance.
The standardization formula for each feature x is as follows:
Splitting the data: This pertains to partitioning the dataset into training and testing (and occasionally validation) subsets [
21]. The training set is utilized to develop the model, whereas the testing set is employed to assess its performance on novel data. The validation set is also employed to optimize the model without affecting the test set, particularly during hyperparameter optimization. This mitigates overfitting and guarantees the model generalizes effectively to novel inputs. The data is typically divided between 70% and 80% for training and 20% and 30% for assessment. The precise ratio is contingent upon the task at hand and the extent of the dataset. After completing the data purification procedure, the dataset (loc1) is now available and ready to be used for extracting meaningful insights.
3.2. Machine Learning and Wind Energy Forecast
ML algorithms have the capability to identify alterations in the surroundings and adjust their actions accordingly. Regression analysis refers to this specific subset of categorization [
22,
23]. The objective of this part is to structure the forecasting in the wind energy domain. Following the technique or data analysis, ML is employed to predict the power energy output. This extensive dataset provides valuable insights on the correlations between various weather patterns and the generation of wind energy. By using predictive models and analyzing meteorological data, it’s feasible to forecast power output.
Regression models are utilized to determine the correlation between alterations in one or more explanatory variables and alterations in the dependent variable. To determine the regression model that exhibits the greatest efficiency and the lowest mean square error (MSE), three regression models will be used and compared [
24]. The comparison of linear regression, random forest regression, and lasso regression will yield valuable results and provide an opportunity to gain a more comprehensive comprehension of each regression model and the capabilities of each other.
Regression is extensively used in the field of big data to build predictive models. These models are designed to forecast certain outcomes for incoming data, rather than interpreting existing data. Regression analysis is a reliable method for identifying the variables that have an impact on a specific topic of interest [
23]. Regression analysis allows for the accurate identification of the key aspects, the ones that may be ignored, and the correlations between these elements.
Dependent Variable: The dependent variable is the primary factor that one seeks to anticipate or comprehend.
Independent Variables: These variables are postulated to exert an influence on the dependent variable.
The metrics included in the regression models are R2, Adjusted R2, MSE, RMSE, and MAE.
R
2 (Coefficient of Determination): Assesses the model’s efficacy in elucidating the variance of the target variable. Varies from 0 to 1, with proximity to 1 indicating a superior fit [
24].
Adjusted R2: Analogous to R2, although modified to account for the quantity of predictors in the model. Addresses overfitting; increases solely if additional predictors enhance the model.
MSE: The mean of the squared deviations between expected and actual values. Imposes more penalties on larger faults compared to lesser ones.
RMSE: The square root of the MSE. Denotes the mean error in the identical units as the target variable.
MAE: The mean of the absolute discrepancies between expected and actual values. More robust to outliers than MSE or RMSE.
Overfitting in ML occurs when models are selected and hyperparameters are adjusted based on test loss, which challenges the assumption that the model’s performance is independent of the test set [
25]. The ultimate classifier may exhibit high performance only on a certain sample of examples within the test set, especially when method designers evaluate numerous models on the identical test set [
26]. K-fold cross-validation is used to assess the performance of predictive models. The dataset is partitioned into k folds, each of which is a subset. As shown in
Figure 14, for each of the k training and assessment cycles, the model uses a distinct fold as the validation set. The model’s generalization performance is measured by calculating the average of the performance metrics obtained from each fold.
3.2.1. Linear Regression
Linear regression, sometimes known as LR, is a prevalent and extensively utilized modeling technique in the fields of statistics and machine learning. The objective outcome is to establish a linear relationship between the input and target variables. The model postulates a linear amalgamation of the input features to predict the continuous output variable. In order to calculate the coefficients of these input variables, several optimization approaches, such as least squares, are utilized [
27]. LR is an excellent option when there are linear relationships between variables due to its simplicity and ease of comprehension. Multiple linear regression (MLR) is an appropriate form of linear regression for this particular situation. MLR produces equations that establish a connection between several input variables (
) and a target variable (
).
Here, n represents the total number of input variables, denotes the coefficient for , and refers to the intercept. Regularization approaches, such as the inclusion of a penalty term on the model’s input variables, can restrict the freedom of the input variables during the learning process, hence improving the accuracy of predictions on data that was not used for training.
MLR is a statistical technique that estimates the value of a dependent variable based on multiple independent variables. The objective of MLR is to construct a precise mathematical model that accurately depicts the linear relationship between the independent variables (
) and the dependent variable (
) being studied [
28]. The primary MLR model is described as:
is the dependent variable.
is the intercept.
The coefficients represent the values assigned to the independent variables .
The intercept, often known as the “constant,” in a regression model signifies the average value of the response variable when all predictor variables in the model are set to zero. The intercept, denoted as , is the estimated value of y when all xi values are equal to zero. The baseline level of (dependent variable) is established when the explanatory variables have no influence. The coefficients represent the weights of each independent variable, indicating the extent to which each variable contributed to the prediction.
The implementation of the linear regression technique, along with cross-validation results, yielded the aforementioned metrics [
24]:
R2 (0.6199): This means that about 61.99% of the variance in the target variable can be explained by the model’s features. This indicates a moderately strong fit, but there is still 38% of variability in the target that the model does not explain.
Adjusted R2 (0.6194): The Adjusted R2 is slightly lower than the R2 (0.6194 vs. 0.6199), which accounts for the number of predictors. It’s close to R2, suggesting that the added features are useful, but not overfitting.
MSE (0.0312): The low value of 0.0312 of MSE indicates that the model’s predictions are generally close to the actual values, though it’s harder to interpret MSE without comparing it to the scale of the data.
RMSE (0.1767): An RMSE of 0.1767 means that, on average, the model’s predictions are off by around 0.18 units from the actual values.
MAE (0.1389): An MAE of 0.1389 means that, on average, the model is off by 0.14 units, which is slightly lower than the RMSE. This suggests the model is performing well with relatively small errors.
Cross-Validation Results (Mean ± Std): These results give insight into how the model performs across multiple data splits during cross-validation. They help confirm the robustness of the model.
R2 (0.6299 ± 0.0082): The average R2 across cross-validation is 62.99%, slightly higher than the original R2. The standard deviation (±0.0082) indicates stable performance across different data splits.
Adjusted R2 (0.6290 ± 0.0082): The adjusted R2 is 62.90% with minimal variability, confirming that the model generalizes well without overfitting.
MSE (0.0303 ± 0.0007): The average error across cross-validation sets is 0.0303 with a small standard deviation (±0.0007), showing that the model is consistent.
RMSE (0.1741 ± 0.0021): The average RMSE is 0.1741, meaning the average prediction error is about 0.174 units, with slight variability (±0.0021).
MAE (0.1376 ± 0.0015): The average MAE is 0.1376, indicating that, on average, the model is 0.1376 units off. The small standard deviation (±0.0015) shows good consistency.
The cross-validation results validate the model’s stability, as the metrics consistently align across several data splits.
Figure 15 displays the actual and predicted values, with residuals representing the differences between the actual and predicted values in a model.
3.2.2. Random Forest Regression
Leo Breiman [
29] and Cutler Adele [
30] proposed the Random Forest Regression (RFR) algorithm in 2001 as an ML method for both regression and classification tasks. Categorical regression tree (CART) techniques can be classified into two categories depending on the nature of the output variables: regression decision trees and categorical decision trees. It is a flexible ML technique employed for forecasting numerical values. In order to reduce overfitting and improve accuracy, it uses the predictions of many decision trees [
31]. Python’s machine-learning modules facilitate the efficient optimization and implementation of this method.
Random forest regression involves adjustable parameters, similar to other ML techniques. Some of the factors that influence a regression tree include the minimum number of observations at each terminal node, the fraction of data to sample in each regression tree, the number of trees, and the number of predictor variables randomly picked at each node [
32]. Cross-validation is employed to optimize these independent parameters. It is often recommended to set the number of decision trees to a high value in order to achieve a steady minimum for the prediction error, rather than making adjustments.
The equation for the generalization error and margin function in random forest is given as follows:
The equation’s relevant term, , specifies the weighting of each tree’s vote to determine the final classification or regression output.
Let and represent random vectors. The margin function, mg, determines the average votes for the correct output compared to other outputs. The function I(.) is an indicator function, and represents the classifiers.
In Random Forest regression, RandomizedSearchCV is employed to optimize the model’s hyperparameters by exploring a spectrum of potential parameter values [
33]. The get_param_grid method produces a dictionary of hyperparameters and their associated values for tuning in the Random Forest model. Each key in the dictionary signifies a hyperparameter (n_estimators: the number of trees in the forest, max_depth: the maximum depth of every tree in the forest, min_samples_split: the minimum number of samples required to split an internal node, min_samples_leaf: the minimum number of samples required to be at a leaf node) and the corresponding list comprises various values that RandomizedSearchCV will investigate.
The implementation of the RFR technique, along with cross-validation results, yielded the aforementioned metrics [
34]:
R2 (0.76087): An R2 of 0.76087 signifies that the model accounts for approximately 76.09% of the variance in the target variable, which is commendable. The model effectively catches most patterns within the data.
Adjusted R2 (0.76060): Adjusted R2 is closely aligned with the R2, indicating that the model is appropriately fitted without superfluous complexity.
MSE (0.01976): The MSE is notably low, signifying that the model’s prediction errors are minimal.
RMSE (0.14057): An RMSE of 0.14057 implies that, on average, the model’s predictions diverge from the actual values by approximately 0.14 units, reflecting commendable performance, particularly relative to the data’s scale.
MAE (0.10439): An MAE of 0.10439 indicates that, on average, the model’s predictions deviate by around 0.10 units. Given that MAE exhibits less sensitivity to outliers compared to MSE, it indicates that the model is continuously producing relatively minor mistakes.
Mean Cross-Validation Results: Better knowledge of how the model spans several subsets of the data comes from cross-valuation results. Often used against the single assessment on the test set, the “Mean” values show the average across several folds (splits) of the dataset.
Mean Cross-Validation R2 (0.64517): With an average R2 value of 0.64517—lower than the test set R2—0.76087—over the cross-valuation folds. This implies that, on average, during cross-valuation, the model explains roughly 64.5% of the variation, whereas on the test set it explains about 76% of the variance. Although this difference suggests some variation in model performance over several data subsets, overall, the finding is still really strong.
Mean Cross-Validation Adjusted R2 (0.64509): Considered as lower than the Adjusted R2 on the test set (0.76060), the average Adjusted R2 across cross-valuation is 0.64509. Like the R2 score, this indicates that although the model may be slightly overfitting the test data relative to its performance on several validation sets, it generalizes somewhat reasonably.
Mean Cross-Validation MSE (0.02943): Higher than the test set MSE (0.01976), the average MSE among several cross-valuation folds is 0.02943. This implies that, on the test data, the model did rather better than on the average validation folds. Still, the variation is not significant, suggesting a rather steady performance.
Mean Cross-Validation RMSE (0.17155): Higher than the test RMSE (0.14057), the average RMSE for the validation sets is 0.17155. This suggests that, although still within a reasonable range, the model’s mistakes during cross-valuation are rather greater than on the test set on average.
Mean Cross-Validation MAE (0.13200): Higher than the test MAE, 0.10439, the average MAE during cross-valuation is 0.13200. Consequently, the model performs really well over several data splits but makes somewhat more mistakes on the cross-valuation folds.
Cross-validation results reveal that, when tested on several subsets of the data, the model’s performance is consistent but rather less. Although the lower cross-valuation R2 (0.64517) points to some variation in the generalizing capacity of the model, the test and cross-valuation metrics differ only in minor degree.
Currently, the evaluation of the model utilizing the most associated features, namely ‘windspeed_10m’, ‘windspeed_100m’, and ‘windgusts_10m’, yields the following results:
By picking solely the most correlated features, the model may forfeit significant interactions or information offered by less correlated variables. The optimal parameters chosen yielded a less intricate model (fewer trees, reduced depth), which may not adequately depict the previous degree of intricacy. A strong correlation may not necessarily reflect a feature’s complete impact on a model’s performance, particularly in non-linear models such as Random Forests.
3.2.3. Lasso Regression
LASSO regression, also known as Least Absolute Shrinkage and Selection Operator regression, is a commonly employed method for reducing the size of coefficients and choosing variables in regression models. The computationally demanding nature of statistical software is no longer concerning due to developments in processing power and integration. The objective of LASSO regression is to identify the variables and corresponding regression coefficients that minimize the prediction error of the model [
35]. A constraint is imposed on the model parameters to ensure that the total of the absolute values of the regression coefficients is smaller than a predetermined value (λ), hence causing the regression coefficients to be “shrunk” towards zero.
LASSO conducts regression analysis using the below equation, where
represents the sample size of
and
,
j denotes the parameter coefficients,
represents the prediction.
The provided formula can be compacted and represented in Lagrangian form, as illustrated in the equation [
36]. The equation below demonstrates that L1 regularization is the preferred method in LASSO. L1 regularization incorporates the absolute value of feature coefficients as a penalty term to regulate the impact of the features.
The implementation of the Lasso regression, along with cross-validation results, yielded the aforementioned metrics [
37]:
R2 (0.6110): This means that 61.10% of the variance in the target variable (y) is explained by the Lasso regression model. It’s a moderate grade, demonstrating that the model captures a good percentage of the variability, however there is potential for improvement.
Adjusted R2 (0.6108): The Adjusted R2 value is quite close to the R2 score (0.6108 vs. 0.6110). This shows that the model’s performance does not diminish when accounting for the amount of predictors used. Since the model isn’t overfitting with irrelevant variables, the adjusted R2 stays virtually the same as the regular R2.
MSE (0.0319): A lower MSE (0.0319) shows the model’s predictions are pretty close to the actual values, while there are some inaccuracies.
RMSE (0.1787): RMSE is 0.1787, suggesting on average, the predictions are wrong by around 0.1787 units of the target variable, which is a substantial amount of inaccuracy.
MAE (0.1410): With a MAE of 0.1410, the predictions average from the actual values by roughly 0.1410 units. This implies somewhat minimal error, although RMSE (which penalizes more significant errors) indicates somewhat more fluctuation in the errors.
Cross-Validation Results for Lasso Regression:
Mean R2 (0.6132 ± 0.0533): With an average R2 score of 0.6132—rather close to the test set R2 of 0.6110—the 10 cross-valuation folds Although the performance of the model fluctuates somewhat throughout the few cross-valuation folds, the standard deviation (±0.0533) indicates minimal fluctuation that suggests consistency.
Adjusted R2 (0.6108 ± 0.0533): With a mean of 0.6108, the modified R2 is also rather consistent; it indicates that the model can generalize effectively over several folds and is not overfitting.
Mean MSE (0.0313 ± 0.0052): With a tiny standard deviation (±0.0052), the average MSE over the cross-valuation folds is 0.0313, somewhat near to the test set MSE of 0.0319. This indicates that the model is not unduly sensitive to several subsets of the data and is rather steady in performance.
Mean RMSE (0.1769 ± 0.0722): Again, revealing a comparable average prediction error, the RMSE from cross-validation (0.1769) is once more near to the test set RMSE of 0.1787. Though it’s still reasonable, the standard deviation (±0.0722) indicates far more fluctuation in mistakes between folds than in MSE.
Mean MAE (0.1399 ± 0.0105): With cross-validation, the average MAE (0.1399) is rather close to the test set MAE (0.1410). Furthermore, showing consistency in the prediction accuracy across several subsets is the low standard deviation (±0.0105).
The cross-validation outcomes closely align with the test set findings, indicating that the model generalizes effectively and is not overfitting the data. The minimal standard deviations for all measures indicate the model’s stability across various data splits.