Title: Empowering Data Management, Diagnosis, and Visualization of Cloud-Resolving Models by Cloud Library upon Spark and Hadoop
Presenting Author: Wei-Kuo Tao
Organization: NASA GSFC

Co-Author(s): Shujia Zhou , Northrop Grumman Information Technology; Xian-He Sun, IIT; Toshihisa Matsui, ESSICI; Xiaowen Li, GESTAR

Abstract:
A cloud-resolving model (CRM) is an atmospheric numerical model that can simulate/resolve clouds and cloud systems at very high spatial resolution. The main advantage of the CRM is that it can allow explicit interactive processes between microphysics, radiation, turbulence, surface, and aerosols. Because of their fine resolution and complex physical processes, it is challenging for the CRM community to i) visualize/inter-compare CRM simulations, ii) diagnose key processes for cloud-precipitation formation and intensity, and iii) evaluate against NASA’s field campaign data and L1/L2 satellite data products due to large data volume (~10TB) and complexity of CRM’s physical processes.

In this project, we are building the Super Cloud Library (SCL) upon a Hadoop framework, capable of CRM database management (IO control and compression), distribution, visualization, subsetting, and evaluation. Our progress is as follows: (1) Developing a data model so that various CRM simulation outputs in NetCDF, including the NASA-Unified Weather Research and Forecasting (NU-WRF) and Goddard Cumulus Ensemble (GCE) model that can be accessed and processed by Hadoop. We are extending a NetCDF-to-CSV converter to support NU-WRF and GCE model outputs, (2) developed a 3D visualization through wrapping IDL codes with Python and will test it in Hadoop, (3) developing a use case where a data set with our data model can be subset with HIVE queries via HUE’s Web interface, (4) prototyping a portable Hadoop reader to access data in a parallel file system where high performance computing (HPC) simulation outputs such as NU-WRF’s and GCE’s are located. In the coming months, we are also speeding up SCL with Apache Spark. 

With the SCL capabilities proposed, SCL users can conduct large-domain on-demand tasks without downloading voluminous CRM datasets and various observations from NASA Field Campaigns and Satellite data to a local computer.