Version 1
: Received: 23 October 2024 / Approved: 23 October 2024 / Online: 23 October 2024 (15:46:06 CEST)
Version 2
: Received: 24 October 2024 / Approved: 24 October 2024 / Online: 24 October 2024 (15:29:56 CEST)
How to cite:
Pan, P.; Srirenganathan Malarvizhi, A.; Yang, C. Data Augmentation Strategies for Improved PM2.5 Forecasting Using Transformer Architectures. Preprints2024, 2024101853. https://doi.org/10.20944/preprints202410.1853.v2
Pan, P.; Srirenganathan Malarvizhi, A.; Yang, C. Data Augmentation Strategies for Improved PM2.5 Forecasting Using Transformer Architectures. Preprints 2024, 2024101853. https://doi.org/10.20944/preprints202410.1853.v2
Pan, P.; Srirenganathan Malarvizhi, A.; Yang, C. Data Augmentation Strategies for Improved PM2.5 Forecasting Using Transformer Architectures. Preprints2024, 2024101853. https://doi.org/10.20944/preprints202410.1853.v2
APA Style
Pan, P., Srirenganathan Malarvizhi, A., & Yang, C. (2024). Data Augmentation Strategies for Improved PM<sub>2.5</sub> Forecasting Using Transformer Architectures. Preprints. https://doi.org/10.20944/preprints202410.1853.v2
Chicago/Turabian Style
Pan, P., Anusha Srirenganathan Malarvizhi and Chaowei Yang. 2024 "Data Augmentation Strategies for Improved PM<sub>2.5</sub> Forecasting Using Transformer Architectures" Preprints. https://doi.org/10.20944/preprints202410.1853.v2
Abstract
Breathing in fine particulate matter with diameters less than 2.5 µm (PM2.5) has greatly increased an individual’s risk of cardiovascular and respiratory diseases. As climate change progresses, extreme weather events, including wildfires, are expected to rise, exacerbating air pollution. The 2023 Canadian wildfires highlighted the growing threat of PM2.5 as smoke spread across U.S. cities like New York, Philadelphia, and Washington D.C. This research investigates the application of data augmentation techniques to improve the accuracy of PM2.5 concentration forecasts in these urban environments. Models trained on imbalanced datasets often struggle to capture extreme pollution events, underestimating high PM2.5 levels due to the model’s focus on more frequent, low-value samples. To address this, we implemented cluster-based undersampling and trained transformer models using various cutoff thresholds (12.1 µg/m³ and 35.5 µg/m³) and partial sampling ratios (10/90, 20/80, 30/70, 40/60, 50/50). Our results demonstrate that the 35.5 µg/m³ threshold, coupled with a 20/80 partial sampling ratio, provides the best performance regarding RMSE and R², particularly in capturing high PM2.5 events. Overall, models trained on augmented data significantly outperformed those trained on original data, highlighting the importance of resampling techniques in improving air quality forecasting accuracy, especially for high-pollution scenarios. These insights significantly contribute to a better understanding of PM2.5pollution with the hopes of more informed public health and environmental policies.
Keywords
Air Quality; PM2.5 Forecasting; Data Augmentation; Cluster-Based Under Sampling; Transformer Model; 2023 Canadian Wildfires
Subject
Environmental and Earth Sciences, Atmospheric Science and Meteorology
Copyright:
This is an open access article distributed under the Creative Commons Attribution License which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.