

# SpaceCubeX: Emulation Results of Hybrid On-board Processing Architectures

Matthew French, Andrew Schmidt, Gabe Weisz – USC / ISI Tom Flatley, Gary Crum, Jonathan Bobblit – NASA GSFC Carlos Villalpando, Robert Bocchino – NASA JPL June 13th, 2017





## Motivation: Next Gen NASA Earth Science Missions

- New Instruments required to produce essential data to help scientists answer critical 21st century questions
  - Global climate change, air quality, ocean health, ecosystem dynamics, etc...
- Missions specifying instruments with significantly increased:
  - Temporal, spatial, and frequency resolutions  $\rightarrow$  to global, continuous observations
  - Current/near-term data at rates >10<sup>8</sup> to 10<sup>11</sup> bits/second
- On-board processing ~100-1,000x than previous missions (compression, storage, downlink)
- Adding new capabilities such as low-latency data products for extreme event warnings



Hybrid computing is a key cross-cutting technology directly applicable to missions recommended in the Decadal Survey





## **SpaceCubeX Project**

- SpaceCubeX Architecture Analysis Framework:
  - Enables selection of the most SWAP efficient processing architecture, with direct impact on mission capabilities, cost, and risk.
  - Look ahead performance estimates for new processors, such as the anticipated NASA High Performance Spaceflight Computer (HPSC).
  - Reduces risk due to supply chain disruptions by allowing a user to rapidly compare alternative component selections.
  - Leverages a suite of high performance Earth science scenarios to ensure robust architecture characterization.
  - Utilizes a proven programming model to facilitate interoperability between commercial compilers.









## **Onboard Computing Analysis Framework**





### **Framework Maturation**





## SpaceCubeX End-to-End Framework

#### **Architecture Development**

- Rapidly generate systems
- Simulate when hardware unavailable
- Emulate for increases in:
  - Data size, speed, and precision
- Leverages same board model for simulation • and emulation
- ٠

#### **Application Development and Testing**

- Analyze and partition application for system
- Run application on full Linux heterogeneous system
- Refine application or hardware for target metrics ٠





## **Accelerator Peripheral and Emulation Manager**

#### **Accelerator Peripheral:**

- Connects processor simulators with FPGAs and DSPs co-simulation environments
- Enables support for a diverse set of vendor specific simulators to work in concert for each generated architecture

#### **Emulation Manager:** ٠

- Supplies realistic data transactions to kernels running in emulation environment on accelerators (i.e. FPGAs)
- Provides fine-grain control of transactions between multi-cores and accelerators for high precision analysis

#### **Recent Enhancements:** •

- Accelerator peripheral ported to all new architectures (i.e. ARM A53, Xilinx UltraScale+, Microsemi FPGAs)
- Transaction files now fully interoperable with emulation manager
- Now supports integrated simulation and emulation environments

Simulation Environment with Accelerator

Verified using RaspberryPi (ARM A53s) and Virtex7 months prior to Xilinx MPSoC boards available for emulation



Cross simulation/emulation compatibility with



- Simulation and Emulation Environment stack running on Host PC/HW
- SpaceCubeX's Hybrid architectures are effectively "virtual system"
- Able to provide support for bare metal and OS support in system
- Benchmarks using compilation flow can run in instruction accurate simulation and provide conventional debugging techniques
- Emulation Manager and Transaction Files unify run-time and support unmodified benchmark/application binaries running on each platform





- Main architectures generated and evaluated by SpaceCubeX framework:
- SpaceCube 2.0:
  - Legacy Xilinx Virtex5 FPGAs with embedded PowerPC 440 cores
- Zynq (A9, cores):
  - Single Xilinx Zynq 7045 FPGA with dual ARM A9 processors
- Hybrid (A53, cores)
  - ARM A53 processors found on Hybrid-FPGA or Hybrid-DSP board, HPSC equivalent processor
- Hybrid (DSP):
  - Quad ARM A53 with ClearSpeed CSX700 (simulated TI C6747 DSP core)
- Hybrid (FPGA):
  - Two Quad ARM A53 processor clusters with Virtex 7 FPGA
- Next-gen Hybrid (# FPGAs):
  - Configurable Multi-core (HPSC/Xilinx Zynq UltraScale+) with FPGAs
- Some "mutations" of these main branches explored but pruned based on results





### **Emulation Platforms**



Emulation framework running at ISI, GSFC, and JPL for individual application development, testing, and to demonstrate SpaceCubeX portability

Office



## **Application Benchmark Suite**

| Benchmark                               | Description                                                                                                                                                                                                                 |  |  |  |  |  |
|-----------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--|--|--|--|--|
| Micro Benchmarks                        | Kernels to benchmark architecture subcomponents and measure system viability.                                                                                                                                               |  |  |  |  |  |
| NAS Parallel Benchmarks                 | NASA generated set of programs designed to help evaluate the performance of parallel supercomputers, derived from computational fluid dynamics (CFD) applications and consist of five kernels and three pseudo-applications |  |  |  |  |  |
| Packet Routing                          | 2 kernels: Packet generation and transmission & Packet reception and verification                                                                                                                                           |  |  |  |  |  |
| ATCORR                                  | Atmospheric correction algorithm commonly used in Hyperspectral and other sensing applications                                                                                                                              |  |  |  |  |  |
| Hyperspectral Classifiers               | Two classification kernels: Sulfur, Thermal                                                                                                                                                                                 |  |  |  |  |  |
| Hyperspectral Compression               | Lossless Compression algorithm tuned for hyperspectral data                                                                                                                                                                 |  |  |  |  |  |
| Image segmentation and segment analysis | Autonomous spacecraft tasking, geological feature identification, analysis, and data handling. (HPFEC-3)                                                                                                                    |  |  |  |  |  |
| Image Classification                    | Common image processing kernels including feature extraction, shape analysis, and surface analysis. (HPFEC-4)                                                                                                               |  |  |  |  |  |
|                                         | Final Result                                                                                                                                                                                                                |  |  |  |  |  |

11



## **Application Mapping Process**

- Existing applications:
  - Port code, recompile
  - Existing FPGA extremely helpful

### New Applications:

- Utilize Redsharc to encapsulate kernels using common interface API to facilitate migrating kernels between heterogeneous elements (CPU, DSP, FPGA)
- **Redsharc**: Reconfigurable Datastream Software / Hardware ARChitecture
  - Redsharc infrastructure utilized to standardize simulation framework
  - Application developers target API and get infrastructure for 'free'

### Optimization

- Reasonable effort level approach taken
- Goal to identify best board level architecture, not a mission level application optimization project



### **REDSHARC Uniform APIs**



#### Software API Instantiation



Hardware API Instantiation





### How Well Does the SpaceCubeX Framework Predict Performance?



# Achieving less than 6% error in Simulation vs. Emulation

- The benchmark includes 2 accelerators, which each process 1 pixel/cycle
- The application dynamically activates either or both accelerators
- When using both accelerators, data streams between the accelerators, saving a round-trip to memory

Simulated and emulated performance estimates reflect real-world performance Chaining two accelerators makes a huge difference in performance

| HPFEC-3 Comparisons for Edge Detection |                                 |                           |       |  |  |  |  |
|----------------------------------------|---------------------------------|---------------------------|-------|--|--|--|--|
|                                        | Sim vs. Emu                     | Speedup                   |       |  |  |  |  |
| A9                                     | 1 Core (Both)                   | 1.92%                     | 1.00  |  |  |  |  |
| A9                                     | 2 Core (Both)                   | 1.72%                     | 1.89  |  |  |  |  |
|                                        | Conv FPGA + 2 SW Edge           | 0.22%                     | 4.71  |  |  |  |  |
| A9+FPGA                                | 2 SW Conv + FPGA Grad (SW Rest) | 0.65%                     | 2.47  |  |  |  |  |
|                                        | FPGA Conv + FPGA Grad (SW Rest) | 3.22%                     | 18.21 |  |  |  |  |
|                                        | 1 Core (Both)                   | 1.43%                     | 1.20  |  |  |  |  |
| A53                                    | 2 Core (Both)                   | 1.95%                     | 2.38  |  |  |  |  |
|                                        | 4 Core (Both)                   | 1.54%                     | 4.37  |  |  |  |  |
|                                        | Conv FPGA + 4 SW Edge           | 5.93%                     | 9.40  |  |  |  |  |
| A53+FPGA                               | 4 SW Conv + FPGA Grad (SW Rest) | 2.09%                     | 5.11  |  |  |  |  |
|                                        | FPGA Conv + FPGA Grad (SW Rest) | 0.29%                     | 48.49 |  |  |  |  |
|                                        |                                 | CS<br>Gardh Salense, Tech |       |  |  |  |  |

SpaceCubeX: Performance Comparison



**Performance Analysis for SpaceCubeX Architectures** 

Hybrid Multi-Core/FPGA Architectures provides orders of magnitude higher performance





## **SpaceCubeX: Energy Comparison**







### **Application Development Productivity with Python**

- Xilinx PYNQ platform combines Python with Zynq SoC to attract orders of magnitude more developers
  - Significant shift for FPGA community
- Worked Xilinx to integrate portions of SpaceCubeX framework into PYNQ platform and evaluate gains
  - Running full Edge Detection benchmarks from SpaceCubeX
  - Evaluation enabled design space exploration with highly tuned software
  - Performance results represent FPGAs still offer significant advantages
- Allows team to leverage huge Python development community for libraries, tools, and modularity to support more diverse applications
- Won Best Short Paper at FCCM 2017 in May
  - "Evaluating Rapid Application Development with Python for Heterogeneous Processor-based FPGAs"
- Work highlighted in Xilinx Xcell and The Next
  Platform publications



Xilinx PYNQ platform with SpaceCubeX's Edge Detector

| In [ ]: | ## Create Overlay Bitstream Object         |  |  |  |  |  |  |  |
|---------|--------------------------------------------|--|--|--|--|--|--|--|
|         | <pre>bit0 = Overlay(""project.bit"")</pre> |  |  |  |  |  |  |  |
|         | ## Program Bitstream                       |  |  |  |  |  |  |  |
|         | <pre>bit0 . download()</pre>               |  |  |  |  |  |  |  |

Simple API for complex tasks like FPGA programming

Performance Comparison of Edge Detector on PYNQ

| Configuration               | Time (s) | Speedup        |  |
|-----------------------------|----------|----------------|--|
| C Version - 1 Thread        | 2.0516   | 1.00×          |  |
| C Version - 2 Threads       | 1.0660   | 1.93×          |  |
| OpenCV Version - 2 Threads  | 0.0896   | 22.91×         |  |
| HW Accelerated Version      | 0.0765   | $26.80 \times$ |  |
| Python OpenCV Version       | 0.1795   | 11.43×         |  |
| PYNQ HW Accelerated Version | 0.0679   | $30.21 \times$ |  |



- Ran over 425 benchmark application experiments on 15 different permuted architectures for processors, FPGAs, and DSPs in simulation and emulation
- Demonstrated accuracy of simulation and emulation (~6%)
- Demonstrated application portability directly from simulation to emulation
- Next generation multi-core and FPGAs providing orders of magnitude (100x -50,000x) over existing technology
  - Recommending as SpaceCube 3.0 hardware architecture
- Modeling application performance to further improve developer efficiency

### General observations

- ARM A53 significantly outperforms A9
- ARM A53 with FPGAs dramatically improve performance of highly parallel applications
- Integrated multi-core and FPGAs (i.e. Xilinx Zynq UltraScale+) offer 2x - 10x speedup and energy efficiency vs. multiple discrete parts (i.e. ARM A53 + Virtex7)
- Multi-core architectures provide fast, scalable approach
- Hybrid DSP architecture lagging
- Hybrid architectures provide best performance / power

| Турс                | Name                       | Space Cube 2.0 |                    | Zynq     |                    | Hybrid    |                    |                  |                   | Next-gen Hybrid      |                    |
|---------------------|----------------------------|----------------|--------------------|----------|--------------------|-----------|--------------------|------------------|-------------------|----------------------|--------------------|
|                     |                            | 1 PPC          | 2 PPCs             | 1 ARM A9 | 2 ARM A95          | 1 ARM A53 | 4 ARM A535         | 4 A536<br>+ FPCA | 4 A536<br>+ 11912 | 4 //53s<br>+ 1 FP/CA | 8 A536<br>+ 2 FPGA |
| Diagnostic          | Memory Test                | Y              | Y                  | Y        | Y                  | Y         | Y                  | NA               | N/A               | Y                    | Y                  |
| Diegnostic          | Interfaces   est           | Y              | Y                  | Y        | Y                  | Y         | Y                  | NA               | N/A               | Y                    | Y                  |
| Micro-<br>Benchmark | Dhrystone                  | Y              | NA                 | Y        | N/A                | Y         | NA                 | NA               | N/A               | Y                    | Y                  |
| Micro<br>Benchmark  | Whetstone                  | Y              | N/A                | Y        | N/A                | Y         | N/A                | NIA              | N/A               | Y                    | Y                  |
| Aicro-<br>Benchmark | Linpack                    | Y              | N/A                | Y        | N/A                | Y         | N/A                | NA               | N/A               | Y                    | Y                  |
| Micro-<br>Benchmark | NAS Parallel<br>Benchmarks | Y              | NA                 | Y        | Y                  | Y         | Y                  | NA               | N/A               | Y                    | Y                  |
| Application         | GSFC Packet<br>Generation  | γ              | Y                  | Y        | Y                  | Y         | Y                  | NA               | N/A               | Y                    | Y                  |
| Application         | GSFC Packel<br>Validation  | Y              | Y                  | Y        | Y                  | Ŷ         | Y                  | NA               | N/A               | Y                    | Ŷ                  |
| \pplication         | SAR                        | Ν              | N<br>(Single Core) | Y        | N<br>(Single Core) | Y         | N<br>(Single Core) | NIA              | N/A               | Y                    | Y                  |
| Application         | ATCORR                     | Y              | N<br>(Single Core) | Y        | N<br>(Single Core) | Υ         | N<br>(Single Core) | Y                | Υ                 | Y                    | Y                  |
| Application         | SVM Sullur                 | Y              | N<br>(Single Core) | Y        | N<br>(Single Core) | Y         | N<br>(Single Core) | Y                | Y                 | Y                    | Y                  |
| Application         | Hyper.<br>Thermal          | Y              | N<br>(Single Core) | Y        | N<br>(Single Core) | γ         | N<br>(Single Core) | Y                | Y                 | γ                    | Y                  |
| Application         | Hyper.<br>Compression      | Y              | Y                  | Y        | Y                  | Y         | Y                  | Y                | Ŷ                 | Y                    | Ŷ                  |
| Benchmark           | HPFEC-3                    | Y              | N<br>(Single Core) | Y        | Y                  | Y         | Y                  | Y                | Y                 | Y                    | Y                  |
| Benchmark           | HPFFC-4                    | Υ              | N<br>(Single Core) | Y        | Y                  | Y         | Y                  | Y                | Y                 | Y                    | Y                  |







## Major Technical Accompliate

- Fully functional emulation architecture
  - Supports migration from simulation to emulation with no recompile
  - Supports hybrid simulation and emulation
  - Well beyond commercial single core offerings
- On-board computing architecture trade space evaluation completed
  - 5 State of the Art computing architecture evaluated at the detailed emulation level on over 400 experiments
- Strong candidate SpaceCube 3.0 architecture selected
  - 60-28,000x increase in energy efficiency
  - Hybrid Xilinx UltraScale+ SoC and Microsemi RTG4 FPGA
  - Radiation hardened and high-performance radiation tolerant FPGAs
  - Ability to add additional processing elements
    - Additional Xilinx MPSoC, NASA/AFRL HPSC processor, or application specific A/D or D/A modules

#### SpaceCubeX Heterogeneous Hardware Development:



#### Recommended SpaceCube 3.0 Architecture







### **Future Research**

- SpaceCubeX AIST-14 effort generated 2 major technology thrusts:
  - On-board Computing Analysis Framework
  - SpaceCube 3.0 Architecture
- Utilize AIST-14 framework to aide in development of additional applications
  - Fluid Lensing 3D Reconstruction
  - MiDAR active multispectral imaging
  - Model Predictive Control Architecture for Optimizing Earth Science Data
  - Radio Frequency Interference Detection and Mitigation
- On-board Computing Analysis Framework Extensions
  - Multi-satellite extensions
  - Inclusion of airborne processors
- SpaceCube 3.0 Architecture
  - Develop prototype hardware



Fluid Lensing Image



Multi-Satellite, Distributed Sensing Mission Enabled by SpaceCubeX Extensions





# **QUESTIONS?**





### Recommended Prototype SpaceCube 3.0 Architecture: Next-gen Hybrid FPGA

- Highest overall performance vs state of practice onboard computing
  - Microbenchmarks: 150 400x
  - Application Benchmarks: 110x 50,000x
  - Energy efficiency: 60x 28,000x
  - With reserve capacity up to 90% FPGA resources
- Design concept includes:
  - Hybrid Xilinx UltraScale+ SoC and Microsemi RTG4 FPGA
  - Radiation hardened and high-performance radiation tolerant FPGAs
  - Coupled with a high-speed interfaces, memory
  - Ability to add additional processing elements
    - Additional Xilinx MPSoC, NASA/AFRL HPSC processor, or application specific A/D or D/A modules

### Best feasible path to support new mission capabilities

- Autonomous instrument control
- Distributed measurement
- Multi-satellite missions

#### Proposed SpaceCube 3.0 Architecture



#### SpaceCube 3.0 Emulation Platform



Xilinx ZCU102 Development Board

**Competition Sensitive** 





### **Architecture Overview**

