Abstract
Currently available nationwide prediction models for fine particulate matter (PM2.5) lack prediction confidence intervals and usually do not describe cross validated model performance at different spatiotemporal resolutions and extents. We used 41 different spatiotemporal predictors, including data on land use, meteorology, aerosol optical density, emissions, wildfires, population, traffic, and spatiotemporal indicators to train a machine learning model to predict daily averages of PM2.5 concentrations at 0.75 sq km resolution across the contiguous United States from 2000 through 2020. We utilized a generalized random forest model that allowed us to generate asymptotically-valid prediction confidence intervals and took advantage of its usefulness as an ensemble learner to quickly and cheaply characterize leave-one-location-out CV model performance for different temporal resolutions and geographic regions. Using a variable importance metric, we selected 8 predictors that were able to accurately predict daily PM2.5, with an overall leave-one-location-out cross validated median absolute error of 1.20 ug/m3, an R2 of 0.84, and confidence interval coverage fraction of 95%. When considering aggregated temporal windows, the model achieved leave-one-location-out cross validated median absolute errors of 0.99, 0.76, 0.63, and 0.60 ug/m3 for weekly, monthly, annual, and all-time exposure assessments, respectively. We further describe the model’s cross validated performance at different geographic regions in the United States, finding that it performs worse in the Western half of the country where there are less monitors. The code and data used to create this model are publicly available and we have developed software packages to be used for exposure assessment. This accurate exposure assessment model will be useful for epidemiologists seeking to study the health effects of PM across the continental United States, while possibly considering exposure estimation accuracy and uncertainty specific to the the spatiotemporal resolution and extent of their study design and population.