Title: SciSpark: Interactive and Highly Scalable Climate Model Analytics
Presenting Author: Chris Mattmann
Organization: Jet Propulsion Laboratory

Abstract:
We present SciSpark, a lightning fast cluster computing technology built on top of Apache Spark. Spark outperforms the de facto big data technology Apache Hadoop by a factor of 100-1000x in memory and 10-100x on disk for iterative cluster algorithms. SciSpark enables data reuse between scientific workflows by natively deciding which data should be kept in memory and periodically what data should be flushed to and from disk. SciSpark is being prototyped using a novel k-means clustering algorithm for Climate Variables and their probability distribution functions (PDFs), and for a graph-based algorithm for discovering mesoscale convective complexes in satellite IR-data. To date we have prototyped SciSpark on a highly scalable 4 node Spark cluster with 256GB RAM per node, and 96 processors per node. We have evaluated a preliminary version of SciSpark over a long running climate model analytic for converting units between temperature data. In this talk too we will also highlight the benefits of directly engaging with the Apache Software Foundation, the home to Apache Spark and a very large Big Data ecosystem.