Preprint Article Version 1 This version is not peer-reviewed

Data Augmentation Strategies for Improved PM2.5 Forecasting Using Transformer Architectures

Version 1 : Received: 23 October 2024 / Approved: 23 October 2024 / Online: 23 October 2024 (15:46:06 CEST)
Version 2 : Received: 24 October 2024 / Approved: 24 October 2024 / Online: 24 October 2024 (15:29:56 CEST)

How to cite: Pan, P.; Srirenganathan Malarvizhi, A.; Yang, C. Data Augmentation Strategies for Improved PM2.5 Forecasting Using Transformer Architectures. Preprints 2024, 2024101853. https://doi.org/10.20944/preprints202410.1853.v1 Pan, P.; Srirenganathan Malarvizhi, A.; Yang, C. Data Augmentation Strategies for Improved PM2.5 Forecasting Using Transformer Architectures. Preprints 2024, 2024101853. https://doi.org/10.20944/preprints202410.1853.v1

Abstract

Breathing in fine particulate matter with diameters less than 2.5 µm (PM2.5) has greatly increased an individual’s risk of cardiovascular and respiratory diseases. As climate change progresses, extreme weather events, including wildfires, are expected to rise, exacerbating air pollution. The 2023 Canadian wildfires highlighted the growing threat of PM2.5 as smoke spread across U.S. cities like New York, Philadelphia, and Washington D.C. This research investigates the application of data augmentation techniques to improve the accuracy of PM2.5 concentration forecasts in these urban environments. Models trained on imbalanced datasets often struggle to capture extreme pollution events, underestimating high PM2.5 levels due to the model’s focus on more frequent, low-value samples. To address this, we implemented cluster-based undersampling and trained transformer models using various cutoff thresholds (12.1 µg/m³ and 35.5 µg/m³) and partial sampling ratios (10/90, 20/80, 30/70, 40/60, 50/50). Our results demonstrate that the 35.5 µg/m³ threshold, coupled with a 20/80 partial sampling ratio, provides the best performance regarding RMSE and R², particularly in capturing high PM2.5 events. Overall, models trained on augmented data significantly outperformed those trained on original data, highlighting the importance of resampling techniques in improving air quality forecasting accuracy, especially for high-pollution scenarios. These insights significantly contribute to a better understanding of PM2.5 pollution with the hopes of more informed public health and environmental policies.

Keywords

Air Quality; PM2.5 Forecasting; Data Augmentation; Cluster-Based Under Sampling; Transformer Model; 2023 Canadian Wildfires 

Subject

Environmental and Earth Sciences, Atmospheric Science and Meteorology

Comments (0)

We encourage comments and feedback from a broad range of readers. See criteria for comments and our Diversity statement.

Leave a public comment
Send a private comment to the author(s)
* All users must log in before leaving a comment
Views 0
Downloads 0
Comments 0


×
Alerts
Notify me about updates to this article or when a peer-reviewed version is published.
We use cookies on our website to ensure you get the best experience.
Read more about our cookies here.