1. Introduction
Reducing carbon emissions to mitigate climate change and reducing dependence on fossil fuels are a priority for a lot of countries globally. Consequently, investments in the renewable energy sector, in particular, wind energy are experiencing a significant rise. The Global Wind Energy Council (GWEC) reports that an estimated 680
f wind energy are to be added to the global energy mix between 2023 and 2027 [
1].
Ireland, in the Atlantic Ocean is quite well placed in Europe terms of renewable energy capacity [
2]. Irish Government is aiming to generate around 7
hrough wind energy by 2030, however, estimates suggest that the wind potential in the North Atlantic around Ireland is such that between 30 and 70
f energy can be produced [
3,
4]. Ireland has entered a transition phase where it aims to phase out fossil based production and resort to more sustainable sources. Offshore wind is expected to be pivotal in that regard [
3].
While Ireland is still in the early days of maritime planning, the recent designation of 8600
ff the south-east coast [
5] will provide ample opportunity to evaluate procedures and plans set in place by European Union(EU). This area alone will be used to setup offshore wind farm with power generation capacity of 900
[
5]. Additionally, the government also auctioned off four additional wind farms with a combined capacity of 9
nd a total investment of €9bn. The winning bids included wind farms off the coast of Dublin (with a capacity of 850
called Dublin Array, off the west coast near Galway (with a capacity of 450
named Sceirde Rocks, off the coasts of Dublin, Louthe and Meath (with a capacity of 500
called North Irish Sea Array and finally, off the Wicklow Coast, just south of Dublin (with a capacity of 1450
named Codling Wind Farm.
These wind farms, located on the west, south and east of Ireland aim to leverage Ireland’s strategically advantageous location with regards to wind energy with the aim of exporting the excess energy into the EU [
6]. However, a crucial factor in the siting of wind farm is the estimated annual energy production and how that value compares to that of energy produced by other sources [
7]. Using accurate, long-term data to make these determinations is therefore critical. Data should be collected at the site of interest for at least two to three years after which questions about long-term annual variability and annual energy production can be gauged [
7]. Wind turbines accumulate damage and fatigue overtime; it is therefore important they are designed to last for the duration of their service life [
8] making it extremely vital that the long term complex loading data is available. This will prevent over or under designing the wind turbines.
Long-term wind resource analysis is conducted through historical data. Studies with historical data take more than three years worth of data into account when carrying out long-term analysis with most research groups leveraging at least 40 years of data [
9,
10,
11,
12]. However, historical data is difficult to acquire for a potential site since it would be impossible to know 40 years in advance about a potential wind site when renewable energy was still in its infancy.
Collecting data for potential wind sites for several years before deciding on whether this would be a good site can be logistically and financially infeasible. Therefore, data from nearby sites or third parties is regularly sourced to make assessments [
13]. This historical data can be real observed data from a neighbouring site or numerically simulated. Kim and Kim [
13] used data from Yeosu Airport to carry out pre-feasibility wind resource assessment for a 30
ind farm. The determination that historical data from a site is sufficient to make long-term assumptions about another site needs rigorous analysis [
7]. Nelson and Starcher [
7] state that in order to use cross-site data to determine historical trends, the annual hourly linear correlation coefficient should at least be 0.90 between the reference site and off-site data. If the two sites do not show similar trends in wind speed and directionality or topography, the correlations would be weak.
Instruments used for measuring wind speed can occasionally suffer from breakdowns and stop recording. These breakdowns cause missing values which can extend from a few hours to days as is the case with the data collected by the offshore buoys of Marine Institute(MI) [
14]. MI’s buoys have been in service since early 2000 and as such provide excellent database for historical data [
14] for Ireland’s offshore met-ocean conditions. However, many of the buoys deployed record missing values. These missing values can be classified as missing completely at random (MCAR). This means that the pattern of missing values is completely random and does not depend on any information contained within the dataset [
15]. The assumption for MCAR is that the probability of encountering missing values depends neither on the observed values nor on the unobserved values.
Reanalysis techniques such as those employed by Copernicus use data assimilation technique [
11,
16] that rely on observed data. Reanalysis techniques provide the most comprehensive climate data at regular intervals over long time periods - often decades. The quality of reanalysis depends on the data assimilation system itself which in turn relies on the observed data. Data assimilation is the science of combining different sources of data to determine the state of a system as it evolves over time [
17].
Missing data presents a serious challenge to data assimilation [
18]. Therefore, imputation to improve assimilation is [
19,
20] imperative. Sareen et al. [
21] argue that short term wind speed forecasts show poor results when they have a high number of missing values. When the time series is first imputed and then a bi-directional long term short memory neural network used to make predictions, the results show a higher degree of accuracy. Kaur et al. [
22] similarly argue how artificial neural networks make improved predictions on avalanches with imputations.
Standard imputation techniques like averaging might not always be the most accurate as argued by Steffan et al. [
23] where time series data is in question. This is because of the nature of the time series data where structural dependencies exist between future and past data. Steffan et al. used existing time-series packages in R to impute data and concluded that seasonal kalman filter and a linear interpolation on seasonal loess decomposed data were most effective.
Liu et al. [
24] used Gaussian process regression (GPR) to develop short-term prediction models to impute wind speed time series. They compared this with mean substitution and k-nearest neighbour (KNN) and concluded that GPR outperforms them. Shukur et al. [
25] developed a hybrid artificial neural network and auto-regressive model and reached similar conclusions. They showed that when the time-series is nonlinear, the hybrid machine learning technique produces better imputation results compared to linear regression, KNN and state space methods. Liao et al. [
26] developed a model based on context encoders to handle highly non-linear data. To benchmark their model, they compared it against auto-encoder, K-means, k-nearest neighbor, back propagation neural network, cubic interpolation, and conditional generative adversarial network and concluded that context econding technique gives better results. Liu et al. [
27] used a hybrid convolutional neural network with bi-directional recurrent neural network to impute spatio-temporal satellite based aerosol optical depth. Their model is reported to impute missing data with low error.
The present study targets wind time series around Ireland since Ireland is set to take off as a major wind energy producer. Additionally, the study makes use of MI buoys deployed and focuses on univariate time series data imputation. This is particularly important as the authors wanted to quantify the error in imputation in the absence of additional variables like sea-state conditions.
Bouys do not always have the sensors available to measure wave heights, currents and direction of waves. Wind speed is therefore the only data being collected. Thus, wind speed is the only variate considered for this data imputation study.
3. Methodology
While differencing the time series can reduce the seasonality and thus ACF, few problems arise while using the full time series for training any statistical technique for imputation or for that matter, forecasting.
It is evident that wind time series data from
Figure 3 and
Figure A1 has long term trends and short term seasonality trends that need to be appropriately captured. ARIMA model is often termed as a linear time series analysis tool due to its inability to capture complex non-linear patterns in the dataset [
30,
31]. Khashei and Bijari [
32] also highlighted the failure of ARIMA to capture the non-linear patters in the dataset and instead turned towards Artificial Neural Netowrks. Wang et al. [
33] echoed the findings and concluded a hybrid ARIMA and metabolic grey model would instead be better to capture long term non-linearity trends.
However, Dong et al. [
34] argued that many of the failures of ARIMA could be avoided by using a sliding window approach. They concluded that the lack of non-linearity capture by ARIMA is not a concern as long as the sliding window of training is carefully selected. Sheoran and Pasari. [
35] reached the same conclusion that daily and weekly sliding windows with ARIMA outperform modelling the entire time series with conventional ARIMA.
LSTMs suffer from similar problems if not entirely the same. While they can retain information about complex non-linear patterns, they struggle with retaining this information over longer sequences. Miller and Hardt [
36] considered RNNs and LSTMs as dynamic systems and concluded that LSTMs do not actually have long term memory. Tunnell and Harchaoui [
37] applied LSTMs to music and language and reached the same conclusion that LSTMs struggle with fully representing the long memory effect in the input and can not generate long memory sequences from white noise inputs. Zhao et al. [
38] while reaching the same conclusion as other researchers proposed a new definition for long-memory. They argued that because a time series is not inherently i.i.d, it violates the primary rule of an ANN that all inputs should be independent of each other. They go on to propose a new Memory-LSTM that attempts to retain long memory.
3.1. Training Dataset
Therefore, a 30 days period with a full lunar event is considered in this study instead of the entirety of the 20+ years of data available to avoid some of the problems discussed above. This study will attempt to highlight if LSTMs with their computational expense and hyperparameter tuning can outperform conventional ARIMA in a window of data for imputation of missing values.
Training data was created from this 30 day wind time series with complete hourly data. Artificial hourly gaps were introduced completely at random [
39] in the dataset and the hourly availability of the dataset was reduced to 90%. Availability is calculated using equation
1. Consequently, the models will be trained on the 90% data and tested on 10% of unseen data.
Additionally, since the raw data showed auto-correlation up to 500 lags (
Figure 3), a differenced time series was used.
3.2. ARIMA
ARIMA model is fit onto the time series data. This is done through the R package
forecast [
40]. ARIMA combines autoregressive features with those of moving averages. An AR(1) - autoregressive order of one signifies that the current value is determined only by the immediate preceding value, while an AR(2) means that the current value is based on the previous two values. The moving average component on the other hand analyses data points by smoothing different subsets of the data set to remove the influence of outliers. MA(1) - moving average of order one truncates after a lag of one. This means that the auto-covariance function drops to zero after lag one. A MA(
q) would therefore mean that the auto-covariance drops to zero after q lags. The
I in ARIMA stands for Integrated which denotes the number of times the series has to be differenced to achieve stationarity. The Auto-regressive (AR) model of order
p, denoted as AR(
p), is given by the equation:
where:
is the value of the time series at time t,
c is a constant term (often omitted),
are the parameters of the model,
are the lagged values of the time series,
is the white noise error term at time t.
The Moving Average (MA) model of order
q, denoted as MA(
q), is given by the equation:
where:
is the value of the time series at time t,
is the mean of the time series (often assumed to be zero),
is the white noise error term at time t,
are the parameters of the model,
are the lagged error terms.
Auto-Arima, another one of the utilities within the package goes over several different combinations of the p, d and q parameters and settles on the ones with the lowest AIC.
3.3. LSTM
LSTMs [
41] is a form of RNN. A kind of RNN a architecture called Long Short-Term Memory (LSTM) was created to solve the vanishing gradient issue with conventional RNNs. LSTMs are extensively employed in a wide range of sequence generating and prediction tasks, including speech recognition, time series forecasting, natural language processing, and more. LSTMs are made with the intention of capturing long-term dependencies in sequential data by keeping a memory cell that has the capacity to hold data for extended periods of time.
An LSTM’s memory cell is its central component. It enables LSTMs to forget or retain knowledge over time in a chosen manner. The network’s "memory" can be compared to the state of the cell. To regulate the information flow via the memory cell, LSTMs employ gates. To regulate the information flow via the memory cell, LSTMs employs three different gates. The forget gate decides what information to discard from the cell state. It takes as input the concatenation of the current input and the previous hidden state, and its activation function is the sigmoid function. The input gate determines which new information to store in the cell state. It also takes as input the concatenation of the current input and the previous hidden state. The input gate uses the sigmoid function to regulate which values will be updated, and it also uses the hyperbolic tangent function to create a vector of new candidate values.
LSTMs can be further enhanced by modifying the network to be stateful [
42]. This is particularly helpful in time series modelling where long term structures in data exist. A stateful LSTM is a type of RNN architecture that is capable of capturing long-term dependencies in sequential data while also maintaining an internal state or memory. Unlike its counterpart, the stateless LSTM, which resets its internal state after processing each sequence, the stateful LSTM retains its state across multiple sequences within a given batch of data. This allows the model to remember information from previous sequences and use it to make predictions or generate outputs for subsequent sequences.
The LSTM architecture used for this study is represented by
Figure 5. It shows a deep neural network with five layers of LSTM, one FFN layer with eight neurons preceding a final layer of FFN connecting the output of LSTMs with the output layer. The activation function used for the penultimate layer is the RELU with a linear function used in the final layer before output. To reduce overfitting, dropouts ranging from 0 - 0.05 were employed. Several simulations for hyperparameter were performed to identify the optimum set of parameters.
3.4. Feature Selection
Feature selection for LSTMs was done through calculating PACF between the dataset.
Figure 4 shows the PACF of the time series. For LSTM modelling, one time step will be used as the primary feature. Mathematically, this means that the feature chosen is
to predict or impute the
value. However, to safely conclude that this is indeed the right decision, models with up to two lags will also be used as primary features to predict the output state. Mathematically, up to two features will be used as inputs to the LSTM model:
and
.
3.5. Hyperparameter Tuning
The LSTMs were trained on several different configurations to identify the combination that produces the lowest MSE. The following parameters were adjusted and compared:
4. Results
The parameters concluded for fitting ARIMA on the dataset were (2, 2, 0) for (p, d, q). It is worth noting that the lowest AIC was obtained with a second differenced model with AR(2). MSE for the imputed values for this model comes out to be 0.64.
Figure 6 shows the PDF of both the originally deleted values and the imputed values. It can be observed that the values are deleted at random with the highest number of values deleted near the middle. Coincidentally, the mean of the first four weeks of time series is 7.007
Since this was a random deletion, values were deleted from both sides of the mean. The lowest recorded speed that was deleted was 2.05
nd the highest was 12.37
The ARIMA model’s PDF is interesting. It shows, contrary to the PDF of the original series, the highest density of imputations are around the mean even though the spread of the actual values is quite even across the entire range of the series.
Figure 7 confirms this hypothesis as well. Missing values near the mean were imputed rather accurately as denoted by the line
. The nearer a data point is to this line, the lesser the error. Such that data points on the line show 100% accuracy and points above or below the line show inaccuracy with the vertical distance from the line showing the degree of inaccuracy. It is worth noting that extreme values either side of the mean show rather poor accuracy of imputation. The values below the median are over predicted and the values above the mean are under predicted. This is because of the inherent tendency of the statistical model to stay true of the mean of the time series.
Figure 8 shows the PDF of the results obtained through LSTM imputation plotted in the backdrop of the true values. Interestingly, and unlike the ARIMA imputation, this is releatively more spread out across the range of the time series. While the ARIMA formed a strong aggregation with a higher density around mean values, LSTM imputation shows a far more equal spread out. This however, is observed only till 11
Beyond this speed, LSTM fails to predict an equivalent magnitude. On the other hand, the lower extreme shows a much higher density of predictions. It can be argued that LSTM does a releatively good job in identifying the lower side of the range better than it does on the opposite end of the series.
Figure 9 shows a visual comparison of the actual imputed values with the true values. The line
shows where the true and predicted values are equal. The further away the marks are from the line, the greater the error. As with the PDF plot, near the median values, the cluster of points tends to stay very close to the line. On either side of the median though, there is a higher diversion. Similar to the results of ARIMA imputation in
Figure 7, the values below the mean are more often than not over predicted while the values above the mean are underpredicted. This is an observed trend for values above 11
uite regularly as all the values are under predicted and are significantly further away from true values as indicated by the line.
5. Conclusions
The goal of reaching Net Zero emissions has raised interest in offshore wind energy and established it as a vital alternative energy source. Ireland aims to produce over 7y 2030, with the capacity to produce more than 30 gigawatts of electricity from offshore wind alone. However, knowing the long-term feasibility of these sites requires precise evaluations of the wind resources.
However, the buoys used to collect the data are not always reliable and can malfunction or provide inaccurate results, leaving large gaps in the data. We used 20 years of wind time series data from Ireland’s Marine Institute for our investigation, and the results showed significant gaps in the dataset. Data imputation techniques were used to fill in these gaps as accurately as feasible in order to remedy this issue.
We investigated and compared the data imputation performance of ARIMA and LSTM approaches. LSTMs outperform ARIMA by a small margin, with a mean squared error of 0.45 as opposed to 0.60 for ARIMA. It can be argued that the long compute times and the hyperparameter tuning of LSTMs for a marginal improvement in impute accuracy is not the best way forward and therefore, ARIMA for hourly imputation should suffice. The PDF of the imputations through ARIMA and LSTMs are also inconclusive. While LSTM does reletively better on predicting the speed further away from mean, ARIMA registers a lower error for imputations near the mean. LSTMs then, can be preferred if the goal is to better capture the extreme left and right of the mean.
Author Contributions
Conceptualization, V.P. and G.S.; methodology, G.S., V.P; software, G.S.; validation, G.S.; formal analysis, G.S; investigation, G.S., V.P, A.M; resources, V.P., G.S; data curation, G.S, V.P; writing—original draft preparation, G.S; writing—review and editing, G.S, V.P, A.M; visualization, G.S; supervision, G.S, V.P, A.M; project administration, V.P., A.M; funding acquisition, V.P., G.S. All authors have read and agreed to the published version of the manuscript.