This study applied various calibration methods to NOAA NCEPGEFSv12 reforecasts to improve the predictability of summer daily Tmax and associated Tmax extremes over Taiwan. The performance of these methods was evaluated using standard skill metrics for deterministic and ensemble probabilistic forecasts. The results are discussed in the following subsections.
3.1. Prediction skill of Raw, QQ, ANN, and Hybrid post-processing methods for summer daily Tmax over Taiwan
The performance of raw and three calibration methods (ERA5, QQ, ANN, and Hybrid) for predicting summer (JJAS) daily T
max over Taiwan for 2000-2019 was evaluated by analyzing the spatial patterns of the climatological mean at forecast lead times of Day-1, 5, 10, and 15 (
Figure 2). Both Raw-GEFSv12 and ERA5 show similar spatial patterns of summer daily T
max over Taiwan for all forecast lead times However, GEFSv12 has a warm bias in most parts of the country. The daily T
max from ERA5 is lower in the east and increases towards the west, which is also seen in GEFSv12. The highest summer T
max is seen in the southernmost region of Taiwan. The GEFSv12 forecasts for all lead times reflect this. All calibration methods notably reduced the warm bias in most parts of Taiwan, resulting in a T
max climatological mean similar to ERA5 for all forecast lead times.
The spatial patterns of IAV of summer T
max over Taiwan from GEFSv12 and ERA5 are similar for all forecast lead times (
Figure 3). GEFSv12 tends to overestimate the IAV of summer T
max in most parts of the country for all forecast lead times. The IAV of T
max is higher in the northeastern part of the country, which is accurately reflected in the GEFSv12 forecasts for all lead times. All three calibration methods successfully reduced the overestimation of T
max IAV over Taiwan. The spatial patterns of the IAV of T
max were found to be similar to those of ERA5 for all forecast lead times. The ANN method slightly underestimated the IAV of T
max in most parts of the country, while the QQ and Hybrid methods accurately captured the magnitude of the IAV of T
max over Taiwan for all forecast lead times. The Hybrid method of capturing T
max IAV in Taiwan is more effective than the QQ method, especially for longer lead time forecasts (
Figure 3).
The QQ method has the advantage of adjusting the Tmax probability distribution to the observed data, particularly in the extreme tails, to account for IAV. The spatial patterns have been improved, however, the temporal patterns remain the same. Deep learning combined with the QQ method has been found to be effective in capturing temporal patterns, IAV, and climatological patterns. The Hybrid method has been seen to be more successful than the QQ and ANN methods.
The Raw-GEFSv12 model showed a high RMSE in predicting summer daily T
max in the eastern parts of Taiwan for all forecast lead times (). The RMSE patterns were similar to the IAV patterns, with higher values in high IAV regions. The RMSE increased with lead time. All three calibration methods effectively reduced the RMSE in most parts of Taiwan for all forecast lead times. The RMSE of the QQ method is higher for longer lead times, while the ANN and Hybrid methods show significant improvements. The comparison between the methods reveals that the RMSE of ANN and Hybrid methods is lower than that of the QQ method for all forecast lead times, particularly in the eastern parts of the country (
Figure 4).
GEFSv12 shows a high Index of Agreement (IOA) (> 0.8) for predicting summer daily T
max in northwestern Taiwan, decreasing to > 0.5 in the southeast (
Figure 5). However, the IOA is lower in the central part of the country for all forecast lead times. The IOA of GEFSv12 for summer daily T
max generally decreases with increasing forecast lead time in most areas. However, the application of calibration methods has significantly improved the IOA of predicting T
max over Taiwan for all forecast lead times. The ANN method has an IOA range of 0.7 to 1, which is higher than the QQ range of 0.5 to 1. The accuracy of the forecasts for T
max in all parts of Taiwan produced by ANN is significantly higher for longer lead times. On the other hand, the IOA from QQ decreases with increasing lead time, mainly due to larger errors in the forecasts. The Hybrid method, however, has a higher IOA value (0.8-1) than the other two methods, making it the most reliable for predicting summer daily T
max over Taiwan for all forecast lead times. Hybrid methods of predicting T
max demonstrate more reliable results across the majority of the country compared to ANN and QQ for all forecast lead times (
Figure 5).
The performance of the Raw and all three calibration methods in predicting T
max over Taiwan for the reforecast period was evaluated using RMSE, Mean Bias, Correlation Coefficient, and Index of Agreement (
Figure 6). Results showed that the RMSE increased with increasing forecast lead time. The highest RMSE was observed for the Raw, ranging from 1.5 to 2.5℃. However, the application of calibration methods such as QQ (0.8-1.2℃), ANN (0.6-1℃), and Hybrid (0.6-1℃) significantly reduced the RMSE for all forecast lead times (
Figure 6a). The comparison of the methods reveals that ANN and Hybrid have similar RMSE values, which are much lower than QQ for all forecast lead times (
Figure 6b). The warm bias of 0.6-1℃ over Taiwan during the summer season was successfully reduced to nearly 0℃ by all calibration methods. GEFSv12 shows a strong correlation with summer daily T
max over Taiwan for Day-1 forecasts (r>0.8), decreasing with increasing lead time (r=0.4) (
Figure 6c). No improvement was seen in the correlation coefficient when using the QQ method compared to the Raw products. However, the ANN and Hybrid calibration methods both showed a significant improvement in the correlation coefficient (r>0.79) for all forecast lead times. The Hybrid method yields the same correlation coefficient values as the ANN for all forecast lead times. However, for longer lead times, both the ANN and Hybrid methods show a significant improvement in the correlation coefficient (
Figure 6c). The IOA of GEFSv12 in predicting T
max over Taiwan is highest for shorter lead times (0.8) and decreases to 0.6 as the forecast lead time increases (
Figure 6d). All the calibration methods improve the IOA for all forecast lead times. The Hybrid method shows the highest IOA (0.92) compared to the ANN (0.88) and QQ (0.9). The Hybrid method yielded higher IOA values than the ANN for all forecast lead times (
Figure 6d). The QQ calibration method also had lower IOA values than the Hybrid method for all forecast lead times (
Figure 6d). This improvement in accuracy is especially beneficial for longer lead time forecasts, which can be immensely helpful for climate management in various sectors at the regional level, such as Taiwan.
The probability distribution (PDF) of summer daily T
max over Taiwan was calculated from all 5 ensemble members and all grid points of Taiwan daily T
max values pooled for ERA5, Raw and each calibration method for the study period and selected lead time forecasts (Day-1, 5, 10, and 15). The results are shown in
Figure 7. The PDF of the summer daily T
max from Raw is right-skewed compared to the ERA5 for all forecast lead times. This indicates that the number of extreme days with higher T
max is higher in Raw data than in ERA5 data. The calibration methods were well-adjusted for the probability distribution of summer daily T
max over Taiwan to ERA5 for all the forecast lead times. The QQ method was found to be more effective than the ANN. The Hybrid method was found to be the most effective in adjusting the PDF of summer daily T
max over Taiwan to ERA5. The Hybrid method outperforms the QQ and ANN in adjusting the summer daily T
max PDF to ERA5.
3.2. Statistical Categorical Skill Scores for Summer Daily Tmax Extremes over Taiwan from Raw, QQ, ANN, and Hybrid Methods
Statistical skill scores (e.g. POD, FAR, ACC, SR, TS, ETS) were computed for the 2000-2019 reforecast period for Taiwan's summer daily T
max extreme days (T
max > 90th percentile of annual T
max) from day-1 to 16. The ETS of GEFSv12 for summer daily T
max extremes is higher in coastal areas than in interior regions of Taiwan (
Figure 8). The ETS values decrease with increasing forecast lead time. All calibration methods tested showed an improvement in the ETS score for summer daily T
max extremes over Taiwan for all forecast lead times. Raw and all three calibration methods for summer daily T
max over Taiwan indicate a decrease in ETS score with increasing forecast lead times. However, the ANN method yields a higher ETS score than the QQ calibration method. The Hybrid method yields the highest ETS score than ANN and QQ for all forecast lead times (
Figure 8).
The ETS scores for the Week-1, Week-2, and Week-1 to 2 scales were further analyzed. Results showed that the Hybrid method had the highest ETS score for all forecast lead times. The ETS score for predicting summer daily T
max extremes over Taiwan from GEFSv12 is higher for Week-1 than Week-2, as seen in
Figure 9. The ETS score from GEFSv12 for the two-week period (Week-1 to Week-2) is higher than the ETS scores of Week-1 and Week-2 for predicting summer daily T
max extremes in Taiwan. All three calibration methods improve the ETS score for summer daily T
max extremes for Week-1, Week-2, and Week-1 to 2. The ETS score of summer daily T
max from all three methods is higher for Week-1 than Week-2 and Week-1 to Week-2. The comparative analysis shows that the ETS from ANN in most parts of Taiwan for summer daily T
max extremes for Week-1, 2, and 1 to 2 is relatively higher than the QQ calibration method. The Hybrid method for summer daily T
max extremes for Week-1, 2, and 1 to 2 yielded notably higher ETS scores in most parts of Taiwan than the ANN and QQ calibration methods.
A performance diagram is a graphical representation of multiple skill scores, such as POD, Frequency Bias, TS, and SR (1-FAR), which can be used to compare and analyze performance [
62].
Figure 10a shows that the GEFSv12 model overestimates summer daily T
max extreme days over Taiwan for all forecast lead times, with a Frequency Bias of more than 1.5 and a POD ranging from 0.6 to 0.8. The SR and TS scores of GEFSv12 decreases with increasing forecast lead time. However, the three calibration methods have been found to effectively reduce the overestimation of daily T
max extremes over Taiwan for all forecast lead times. For summer daily T
max extremes over Taiwan, the POD has decreased for all forecast lead times when using all three calibration methods. However, the QQ method showed higher POD values than ANN for longer lead time forecasts.
The ANN model yields higher SR and TS values than the QQ method for all forecast lead times. Both the QQ and Hybrid calibration methods are able to accurately reproduce the number of summer Tmax extreme days observed in ERA5. However, the Hybrid method outperforms the other two methods in terms of POD, SR, and TS skill scores. This suggests that the Hybrid method could be beneficial for extended-range time-scale predictions.
The comparison of GEFSv12 with three calibration methods for Week-1, 2, and 1 to 2 revealed a substantial overestimation of summer T
max extreme days (
Figure 10b). All three calibration methods were successful in reducing the overestimation. However, the Hybrid method showed the highest statistical categorical skill scores. The skill scores from Raw, QQ, ANN, and Hybrid calibration methods were generally higher for Week-1 and Week-1 to 2 than for Week-2 (
Figure 10b). This suggests that the GEFSv12 summer T
max extreme day data is not reliable without calibration. The Hybrid method was found to be the most effective in improving the skill scores for all forecast scales. This makes it a valuable tool for climate risk management in the region.
3.3. Probabilistic Prediction Skill Scores of Raw, QQ, ANN, and Hybrid methods for Summer daily Tmax Extremes
The uncertainty of summer T
max extremes over Taiwan can be evaluated using metrics such as Resolution, Reliability, Brier Score, Brier Skill Score, and ROC curves to assess the ensemble probabilistic forecast. The GEFSv12 probabilistic forecast of summer T
max extreme days over Taiwan has a good reliability (< 0.15) for all forecast lead times, as shown in
Figure 11a. This was further improved by the application of three calibration methods (< 0.05). The reliability of the forecast decreases with increasing lead time from Raw and all three calibration methods. However, the ANN and Hybrid methods showed the highest reliability, particularly for longer lead time forecasts. The resolution of the GEFSv12 model for probabilistic forecasts of summer T
max extreme days over Taiwan decreases with increasing forecast lead time, with higher resolution for shorter lead times (
Figure 11b). All three calibration techniques significantly improved the resolution of the ensemble probabilistic forecast of summer T
max extreme days over Taiwan for all forecast lead times. ANN and Hybrid methods showed the highest resolution. Especially for longer lead times, the resolution of ANN and Hybrid methods was significantly higher. The hybrid calibration method has a relatively better resolution than the ANN for all forecast lead times (
Figure 11b). The Brier Score (BS) is a metric used to measure the accuracy of binary predictions, where the result is either yes or no. The ideal score is 0. According to
Figure 11c, the confidence of GEFSv12's ensemble probabilistic forecasts of summer T
max extreme days over Taiwan is low (BS > 0.25) for all forecast lead times. However, the calibration methods used were found to be highly effective in improving the accuracy (BS < 0.2) of these forecasts. Specifically, the ANN and Hybrid calibration methods showed higher accuracy than the QQ method. The Hybrid method of ensemble probabilistic forecasting of summer T
max extreme days over Taiwan produces results similar to those of the ANN for all forecast lead times (
Figure 11c). The GEFSv12 ensemble probabilistic forecasting of summer T
max extreme days over Taiwan with a BSS of less than -0.4 was not as accurate as the climatological/random forecast for all forecast lead times. This was evident from the results shown in
Figure 11d. However, the use of calibration methods such as QQ, ANN, and Hybrid methods improved the BSS remarkably for all forecast lead times. The QQ method was found to be the most accurate for up to one week lead time than the reference/climatological/random forecast. After the first week, the ensemble probabilistic forecasting of summer T
max extreme days over Taiwan from QQ was not as accurate as expected as the random forecast. However, the ANN and Hybrid methods outperformed both the random forecast and QQ for all forecast lead times (
Figure 11d). The Hybrid method is more effective than ANN for predicting extreme summer T
max days over Taiwan for all forecast lead times.
As a final diagnostic, we use the ROC curve to assess a model's ability to distinguish between events and non-events (Wilks, 2011). The ROC curve evaluates the forecast if a summer T
max extreme day had occurred. It plots the true positive rate (correctly predicted T
max extreme day) against the false positive rate (incorrectly predicted T
max extreme day). We calculate the true positive rate and false positive rate for cumulative probabilities ranging from 0% to 100%, in increments of 10%. A skillful forecasting model should have a higher true positive rate than a false positive rate, resulting in a ROC curve that curves towards the top-left corner of the plot. Conversely, a forecast system with no skill would be a straight line along the diagonal, indicating that the forecast is no better than a random guess. The AUC (Area Under the Curve) is a useful scalar measure for summarizing the performance of a model, with a score of 1 indicating the highest level of skill and a score of 0 indicating the lowest level of skill. The ROC curves for Raw, QQ, ANN, and Hybrid calibration methods for ensemble probabilistic forecasting of summer T
max extreme days over Taiwan are all above the diagonal line for all forecast lead times, as shown in
Figure 12.
Raw and all three calibration methods for ensemble probabilistic forecasting of summer Tmax extreme days over Taiwan have a satisfactory AUC skill score (> 0.65) for all forecast lead times. However, it has been observed that the AUC skill decreases with increasing forecast lead times. The Hybrid calibration method yielded the highest AUC skill score (0.79-0.85), followed by ANN (0.75-0.83), QQ (0.68-0.81), and Raw (0.65-0.74). The performance analysis of three calibration methods revealed that they significantly improved the accuracy of GEFSv12 in forecasting extreme summer Tmax days in Taiwan. The Hybrid calibration method for ensemble probabilistic forecasting of summer Tmax extreme days on an extended range time scale over Taiwan has been shown to be more effective than the QQ and ANN techniques.