

## SpaceCubeX: Hybrid Multi-core CPU / FPGA / DSP Flight Architecture

Matthew French, Andrew Schmidt, Gabe Weisz – USC / ISI Tom Flatley, Gary Crum, Jonathan Bobblit – NASA GSFC Carlos Villalpando, Robert Bocchino – NASA JPL June 14th, 2016





### Motivation: Next Gen NASA Earth Science Missions

- New Instruments required to produce essential data to help scientists answer critical 21st century questions
  - Global climate change, air quality, ocean health, ecosystem dynamics, etc...
- Missions specifying instruments with significantly increased:
  - Temporal, spatial, and frequency resolutions  $\rightarrow$  to global, continuous observations
  - Current/near-term data at rates >10<sup>8</sup> to 10<sup>11</sup> bits/second
- On-board processing ~100-1,000x than previous missions (compression, storage, downlink)
- Adding new capabilities such as low-latency data products for extreme event warnings



Hybrid computing is a key cross-cutting technology directly applicable to missions recommended in the Decadal Survey





- Recent publications show hybrid computing architectures can be highly efficient in highly constrained SWAP scenarios
  - Also tracks trends in HPC, mobile computing
- Can we identify a hybrid computing architecture of high utility across a wide suite of Earth Science applications?



- SpaceCubeX Architecture Analysis Framework:
  - Enables selection of the most SWAP efficient processing architecture, with direct impact on mission capabilities, cost, and risk.
  - Minimal extensions to integrate new processors, such as the anticipated NASA High Performance Spaceflight Computer (HPSC), reducing time to initiate benchmarking by months.
  - Reduces risk due to supply chain disruptions by allowing a user to rapidly compare alternative component selections, quantify the impact, and update the processing architecture.
  - Leverages a wide suite of high performance embedded computing benchmarks and Earth science scenarios to ensure robust architecture characterization.
  - Utilizes a proven inter-task programming model to facilitate benchmark compilation and experimentation, while being fully interoperable with commercial compilers.





- Multi-core processors more easily provide:
  - General OS support
  - High-level functions
  - Coarse grained application parallelism

### Co-processors then provide:

- Customized acceleration
- High throughput
- Fine-grained data parallelism

### • Sparked questions:

- What devices to use?
- How to connect the devices?
- How to program the system?
- (and many more...)











- Search space of hybrid combinations too complex for existing methods
- Create a development environment which enables design space exploration of architectures across a range of applications
- Components pulled from a database and architecture generated
- Same environment can be used as architecture increases TRL
- Performance can be tracked across architectures and across TRL development



#### SpaceCubeX Heterogeneous Hardware Development:







### Simulation Testbed

#### • ArchGen:

- Creates board models for evaluation in simulation/emulation environments
- Component Database:
  - Collection of components of interest for evaluation
- Performance Models:
  - Individual component parameters used by simulation environment and report generation
- Simulation Generator:
  - Builds simulation environment based on board model
- Compiler:
  - Sets up compilation environment based on board model
- Simulation Environment:
  - Run-time framework to evaluate benchmark applications on hybrid architecture model
- Report:
  - Details of system's metrics based on benchmark applications on candidate architecture model

Year 1: Development and evaluation of simulation testbed across hybrid architectures





- Python based script for rapid architecture prototyping
- Parse Excel spreadsheets for component database and user selections
- Modular capabilities for future and integration with Emulation Environment

MySQL

🟓 python

- Added hooks for GUI front-end for User Selection via TkInter
- Produces XML-based Board Model file for:
  - Use by Simulation Generator/Compilation & Emulation Framework





- Python based script to stitch together hybrid simulation
- Generate Simulation Environment from board model
  - Interoperability between multi-core processors, FPGAs, DSPs
  - Run-time platform to execute benchmark applications
  - Monitor and collect performance report for post-run analysis
- Added support for external co-simulator (e.g. DSP/FPGAs)
- Leveraging Imperas Open Virtual Platform Platform APIs and Libraries



Simulation Generation create platform executables for CPUs and FPGAs. DSP simulator already incorporates DSP simulator platform executable.

Earth Science Tec



- Connects processor simulator with FPGAs and DSPs co-simulation environments
- Motivated by limited DSP simulation SystemC & TLM integration support
- Recognized need for similar functionality with future Emulation Environment
- Accelerator peripheral is transaction-based
  - Supports multiple transactions to processor's memory
  - Working on adding more fine-grain delays
    - Accelerator Delay Time for accelerator to perform computation
    - Interface Delay Time for accelerator to transfer data to memory/CPU
- Processor issues commands / checks status via Bus transactions
- Integrates with Redsharc FPGA simulations and DSP TI simulator





### **Simulation Environment**





- Simulation environment stack running on Host PC
- SpaceCubeX's Hybrid architectures are virtual system
- Able to provide support for bare metal and OS support in system
- Benchmarks using compilation flow can run in instruction accurate simulation and provide conventional debugging techniques





- Microbenchmarks used to evaluate run-time performance
  - Hardware platform performance based on real SpaceCube 2.0 system (Xilinx ML510 development board)
  - SpaceCubeX Simulator generated SpaceCube 2.0 simulation platform
- Performance differences within simulator tolerance

|                      | Dhrystone Simulator Performance Evaluation Comparison |                           |          |             |            |  |  |
|----------------------|-------------------------------------------------------|---------------------------|----------|-------------|------------|--|--|
|                      | Runs                                                  | Clock Cycles/Instructions | Time (s) | DMIPS       | Difference |  |  |
|                      |                                                       |                           |          |             |            |  |  |
| Hardware Platform    | 20000                                                 | 23651950                  | 0.2365   | 48.12727607 | -          |  |  |
|                      |                                                       |                           |          |             |            |  |  |
| SpaceCubeX Simulator | 20000                                                 | 7980020                   | 0.2418   | 47.07260082 | 2.22%      |  |  |
|                      |                                                       |                           |          |             |            |  |  |
| Hardware Platform    | 200000                                                | 249200120                 | 2.4920   | 45.67830574 | -          |  |  |
|                      |                                                       |                           |          |             |            |  |  |
| SpaceCubeX Simulator | 200000                                                | 8000020                   | 2.4242   | 46.95502526 | 2.76%      |  |  |

|                      | Whetstone Simulator Performance Evaluation Comparison |          |            |  |  |
|----------------------|-------------------------------------------------------|----------|------------|--|--|
|                      | Clock Cycles/Instructions                             | Time (s) | Difference |  |  |
| Hardware Platform    | 13334671                                              | 0.1333   |            |  |  |
| SpaceCubeX Simulator | 4780862                                               | 0.1449   | 8.29%      |  |  |





- Architectures selected for initial 'seed' to begin trade-space exploration
  - Further permutation later in effort
- Baseline: SpaceCube 2.0 Architecture:
  - Existing Flight system One RadHard Aeroflex FPGA and two Virtex5





### **Application Benchmark Suite**

| Benchmark                               | Description                                                                                                                                                                                                                 |
|-----------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Micro Benchmarks                        | Kernels to benchmark architecture subcomponents and measure system viability.                                                                                                                                               |
| NAS Parallel Benchmarks                 | NASA generated set of programs designed to help evaluate the performance of parallel supercomputers, derived from computational fluid dynamics (CFD) applications and consist of five kernels and three pseudo-applications |
| Packet Routing                          | 2 kernels: Packet generation and transmission & Packet reception and verification                                                                                                                                           |
| ATCORR                                  | Atmospheric correction algorithm commonly used in Hyperspectral and other sensing applications                                                                                                                              |
| Hyperspectral Classifiers               | Two classification kernels: Sulfur, Thermal                                                                                                                                                                                 |
| Hyperspectral Compression               | Lossless Compression algorithm tuned for hyperspectral data                                                                                                                                                                 |
| Image segmentation and segment analysis | Autonomous spacecraft tasking, geological feature identification, analysis, and data handling. (HPFEC-3)                                                                                                                    |

Image Classification

Common image processing kernels including feature extraction, shape







### **Application Mapping Process**

- Existing applications
  - Port code, recompile
  - Existing FPGA extremely helpful
- New Applications (HPFEC-3, HPFEC-4)
  - Utilize REDSHARC to encapsulate kernels using common interface API to facilitate migrating kernels between heterogeneous elements (CPU, DSP, FPGA)
- REDSHARC: Reconfigurable Datastream Software / Hardware ARChitecture
  - REDSHARC infrastructure utilized to standardize simulation framework
  - Application developers target API and get infrastructure for 'free'
- Optimization
  - Reasonable effort level approach taken
  - Goal to identify best board level architecture, not a mission level



### **REDSHARC Uniform APIs**

```
while(1) {
    // Pop element off of Stream
    temp = incoming.streamPop();
    //Increment by 1
    temp++;
    output.streamPush(temp);
}
```

#### Software API Instantiation



#### Hardware API Instantiation





### **NAS Parallel Benchmarks: Results**



Multi-core performance scales with the number of cores. Architectures enabling high performance (cluster in space) class computing capabilities.



### SpaceCubeX: Throughput Comparison



Hybrid DSP and FPGA Architectures provides orders of magnitude higher performance





Hybrid DSP and FPGA Architectures provide significant reduction in power usage



- Ran over 200 benchmark application experiments on 8 different permuted architectures for processors, FPGAs, and DSPs
- General observations
  - ARM 53 significantly outperforms A9
  - Multi-core architectures provide fast, scalable approach
  - Hybrid architectures provide best performance / power
  - Hybrid DSP architecture lagging

|                      |                                          | Space Cube 2.0 |                    |          | 'nq                | Hybrid    |                    |                          |                 |
|----------------------|------------------------------------------|----------------|--------------------|----------|--------------------|-----------|--------------------|--------------------------|-----------------|
| Туре                 | Name                                     | 1 PPC          | 2 PPCs             | 1 ARM A9 | 2 ARM A9s          | 1 ARM A53 | 4 ARM<br>A53s      | 4 A53s<br>+ FPG <u>A</u> | 4 A53s<br>+ DSP |
| Diagnostic           | Memory<br>Test                           | Y              | Y                  | Y        | Y                  | Y         | Y                  | N/A                      | N/A             |
| Diagnostic           | Interfaces<br>Test                       | Y              | Y                  | Y        | Y                  | Y         | Y                  | N/A                      | N/A             |
| Micro-<br>Benchmarks | Dhrystone<br>/<br>Whetstone<br>/ Linpack | Y              | N/A                | Y        | N/A                | Y         | N/A                | N/A                      | N/A             |
| Micro-<br>Benchmark  | NAS<br>Parallel<br>Benchmark<br>s        | Y              | N/A                | Y        | Y                  | Y         | Y                  | N/A                      | N/A             |
| Application          | GSFC<br>Packet<br>Generation             | Y              | Y                  | Y        | Y                  | Y         | Y                  | N/A                      | N/A             |
| Application          | GSFC<br>Packet<br>Validation             | Y              | Y                  | Y        | Y                  | Y         | Y                  | N/A                      | N/A             |
| Application          | SAR                                      | Ν              | N<br>(Single Core) | Y        | N<br>(Single Core) | Y         | N<br>(Single Core) | N/A                      | N/A             |
| Application          | ATCORR                                   | Y              | N<br>(Single Core) | Y        | N<br>(Single Core) | Y         | N<br>(Single Core) | Y                        | Ν               |
| Application          | SVM Sulfur                               | Y              | N<br>(Single Core) | Y        | N<br>(Single Core) | Y         | N<br>(Single Core) | Y                        | Ν               |
| Application          | Hyper.<br>Thermal                        | Y              | N<br>(Single Core) | Y        | N<br>(Single Core) | Y         | N<br>(Single Core) | Y                        | Y               |
| Application          | Hyper.<br>Compressio<br>n                | Y              | Y                  | Y        | Y                  | Y         | Y                  | Y                        | Y               |
| Benchmark            | HPFEC-3                                  | Ν              | Ν                  | Y        | Y                  | Y         | Y                  | Y                        | Y               |





- Achieved first spiral of system capable of performing board level heterogeneous design space evaluation
- System capable of supporting additional earth sensing applications
- Year 2
  - Additional architecture permutations
  - Migrating from Simulation to Emulation
  - Assessing additional application









### **Potential Future Extensions: HPC**

- Extensions to address High Performance Computation
  - GPU-based accelerators
  - Tightly coupled FPGAs
    - Intel HARP
  - More than Moore architectures
    - Micron Hybrid Memory cube
  - ISI has all three architectures in house
- Update framework to incorporate new architectures
- Application benchmarks representative of NASA HPC



Nvidia Graphic Processor Architecture





- SpaceCube-X Simulation Infrastructure mature enough to support broader community
- Example: Radio Frequency Interference Detection and Mitigation Applications
  - JPL Group (Livesey, Kocz, Misra) approached SpaceCubeX team after ESTF2015
  - New approach discriminates between RFI and natural thermal emissions signals in L and C bands





128 channel kurtosis spectrogram

300 400 500 600

Frequency (MHz)



Interested in how your application performs? Come talk to us! Matthew French mfrench@isi.edu



Time (ms)

100

200



# **QUESTIONS?**





### **FPGA Simulation**

#### **RE**configurable Data-stream Hardware/Software ARChitecture:

- Provides rapid construction of System-on-Chip platforms
- Enables design space exploration of heterogeneous resources
  - Seamless HW/SW task migration supports spiral development
- Combines situational awareness with high performance processing
  - Dynamic resource management and performance tuning
- System infrastructure (network/interfaces) pre-designed and verified
  - Focus shifts to design and verification of hardware accelerator cores
- SpaceCubeX leverages Redsharc for rapid IP integration of accelerator peripherals into Hybrid FPGA Simulation Platform
  - JPL: Edge Detection & Image Convolution Kernels
  - ISI: Hyperspectral Image Compression, Thermal & Sulfur Classification Kernels



**Redsharc's Build Infrastructure** 



Diagram of Redsharc System Components on an FPGA



#### Redsharc's Hardware Kernel Interface (HWKI)

Redsharc Philosophy: Interoperate and leverage current and emerging devices, tools, and technology



### **DSP** Simulation

- Limited availability of DSP simulation environments
- Selected TI's Code Composer Studio (CCS)
  - Support for multiple DSPs
  - Instruction/cycle-accurate simulation
  - Scriptable access to running simulator and extracting results
- Eclipse-based development environment Space in based of the primit optimization of TI's C6747 DSP:
  - Low power ½W @ 300 MHz
  - Built-in floating point support 4 SP adds and 2 SP multiplies/cycle
    - Also supports 4 16 bit x 16 bit integer multiply/adds per cycle
- Evaluation framework supports selection of different TI DSPs for hybrid platform analysis





![](_page_25_Picture_0.jpeg)

- Added support to compile, run, and collect co-simulator results
  - 1. Compile accelerator source code, test bench, and test vectors using standard tools (e.g. Synopsys VCS, TI's CCS)
  - 2. Run accelerator and collect output results (data and delays)
  - 3. Plug results into peripheral memory files
  - 4. Compile application (independent of co-simulator results/data)
  - 5. Run applications in simulator accessing memory files
- Peripheral can be integrated into different platform evaluation systems

![](_page_25_Figure_9.jpeg)