Title: Scaling biodiversity data processing and analysis workflows with Apache Beam
Presenting Author: Jeremy Malczyk
Organization: Yale University
Co-Author(s): Walter Jetz, Robert Guralnick, Adam Wilson

Abstract:
As the volume of remote sensing imagery and derived products suitable for biodiversity research has grown beyond local compute and storage capacities, workflows to process and analyse these data must be developed and deployed on the platforms they reside. Likewise, biodiversity occurrence record collections already reach into the billions of records, with a strong growth curve as efforts from citizen science, camera trapping, and satellite acquired sensor networks expand. The Apache Beam programming model provides a unified method for defining batch and streaming data-parallel processing pipelines that may be deployed across a variety of distributed processing back-ends. Here we explore the suitability of processing pipelines defined in Beam for the fusion of biodiversity occurrence data across varying spatio-temporal uncertainty with global environmental gridded layers of fine spatial resolution and temporal cadence. Pipeline performance and capability is compared against existing platforms for planetary-scale environmental analysis (Google Earth Engine) and biodiversity data fusion services (Movebank.org ENV-Data) built specifically for this purpose. The Map of Life project seeks to expose these capabilities to the biodiversity research community via a suite of application programming and user interfaces to aid scientists in defining species niches and to abstract data fusion requirements out of their local modelling workflows.