Title: A systematic study of AI/ML models to support air quality digital twin replica
Presenting Author: Chaowei (Phil) Yang
Organization: George Mason University
Co-Author(s): Cédric H. David, Sina Hasheminassab, Olga V. Kalashnikova, Stepheny Perez, Joe T. Roberts, Sujay V. Kumar, Nishan Kumar Biswas, Paul Stackhouse, David Borges, Simon Baillarin, Frederic Bretar, and Raquel Rodriguez-Suquet

Abstract:
One of the three objectives of digital twin is to replicate the physical condition of an object of system of our home planet as laid out by NASA AIST. This research addresses such replication issues by leveraging AI/ML models and tune them with different parameter configurations to find the best perform model and optimum configuration. Using air pollutant observation, specifically PM2.5 and AOD, as an example, we investigated how to use AI/ML to calibrate the PM2.5 in-situ sensor readings and to fill data gaps of AOD. Low-cost PM2.5 purple air sensors data are calibrated against U.S. Environmental Protection Agency (EPA) sensor data by employing machine learning and geospatial analysis to enhance the accuracy of air quality replica. Sixty-four pairs of data from PurpleAir sensors and EPA stations were preprocessed to eliminate outliers and address missing values. A specialized tool was developed to streamline data selection and filtration, laying the foundation for initial AI/ML model training. A comprehensive series of 3000+ runs, aimed at optimizing the models based on key performance indicators such as RMSE and R², indicated that humidity, temperature, and corrected particulate matter concentration (CF=1) are significant predictors of PM2.5 levels. A Long Short-Term Memory (LSTM) model, trained with a 70/30 training/testing split without exposure to the test data during training, demonstrated superior predictive capabilities, with an R² value of 0.897 and an RMSE of 3.559. The model was encapsulated within a Docker/container to facilitate its deployment on cloud computing platforms, supporting the Air Quality Analytics Collaborative Framework (AQ ACF) and FireAlarm systems, and aligning with the Los Angeles (LA) Air Quality Prediction model at California State University, Los Angeles (CSU-LA). The study also extended its analysis to address the imputation of missing Aerosol Optical Depth (AOD) data for the Moderate Resolution Imaging Spectroradiometer (MODIS) Multi-Angle Implementation of Atmospheric Correction (MAIAC) at a 1 km x 1 km resolution. Utilizing a Generative Adversarial Imputation Nets (GAIN) model, the research tackled the challenge of non-random missing AOD values due to cloud cover by incorporating aerosol and meteorological covariates. This methodology not only enhances the accuracy of AOD data but also facilitates more refined environmental assessments. The imputed AOD data underwent validation against AErosol RObotic NETwork (AERONET) observations, demonstrating robust performance with an R² of 0.87. This validation underscores the effectiveness of the imputation process, offering a reliable approach for filling gaps in AOD datasets, thereby supporting full scale air quality replica. The project also fostered a knowledge transfer ecosystem, engaging researchers, and students (from high school to Ph.D. levels) in a collaborative learning process. The culmination of this research will be delivered as open-source software, offering a docker/container deployment template for widespread adoption. Our ongoing efforts include data collection, model training across five platforms, data-fusion technique development for comprehensive datasets, and rigorous model evaluation to ensure predictive accuracy and generalizability.