Title: HDF5 Performance on OpenStack
Presenting Author: John Readey
Organization: The HDF Group
Co-Author(s):
Joe Lee, Aleksandar Jelenak

Abstract:
Using a an Earth Science dataset (130 GB of NCEP3 data) we evaluate the performance of HDF5 for typical problems when running on a 300 core OpenStack cluster. Various storage options were benchmarked and analyzed including: 1. Different compression filters (GZIP, MAFISC, BLOSC) 2. Different ways of organizing data in the files (chunk layout) 3. Aggregated data (one 130 GB file) vs non-aggregated (7980 files/16MB each) To utilize the capabilities of the cluster we explore techniques to improve performance using multiple nodes in the cluster. For some problems it is possible to provide performance that scales with the number of nodes, enabling users to interactively explore datasets that would require batch-style processes otherwise. We compare the pros and cons of this approach to the use of other toolsets such as Hadoop and Spark.