

# Spatial Acceleration for Efficient and Scalable Horizontal Diffusion Weather Stencil Computation

Gagandeep Singh, Alireza Khodamoradi, Kristof Denolf, Jack Lo, Juan Gómez-Luna, Joseph Melber, Andra Bisca, Henk Corporaal, and Onur Mutlu

> 37<sup>th</sup> International Conference on Supercomputing (ICS) Orlando, Florida

ETHZURICH TU/e EINDHOVEN UNIVERSITY OF TECHNOLOGY



AMD together we advance\_

### **Talk Outline**

#### **Background and Motivation**

#### **SPARTA:** Design and Implementation

#### **Evaluation of SPARTA and Key Results**

#### Summary

SAFARI

# **Stencil Computations and Applications**

- Stencil computations update values in a grid using a fixed pattern of grid points
- Stencils are used in ~30% of high-performance computing applications





Fluid Dynamics



Image Processing



Climate/Weather Simulations

Image sources: http://www.flometrics.com/fluid-dynamics/computational-fluid-dynamics Naoe, Kensuke et al., "Secure Key Generation for Static Visual Watermarking by Machine Learning in Intelligent Systems and Services," IJSSOE, 2010

# **Stencil Computations in Weather Applications**

COSMO (Consortium for Small-Scale Modeling) weather prediction application [Thaler+, PASC'19]

- The essential part of the weather prediction models is called the **dynamical core (dycore)**
- Around **80 different** stencil compute motifs
- ~30 variables and ~70 temporary arrays
- Complex stencil computation





### **Fundamental Complex Stencil: Horizontal Diffusion**

- Compound stencil kernel consists of a collection of elementary stencil kernels
- Iterates over a 3D grid performing Laplacian and flux operations
- Complex memory access behavior with low arithmetic intensity



### **Fundamental Complex Stencil: Horizontal Diffusion**

- Compound stencil kernel consists of a collection of elementary stencil kernels
- Iterates over a 3D grid performing Laplacian and flux operations



### **Roofline Analysis**



<13.5% peak floating-point performance on current computing systems

# **Spatial Architecture: Al Engine**

- High compute density
- Allows for tailoring of the dataflow to optimize data movement



# **Spatial Architecture: Al Engine**

- High compute density
- Allows for tailoring of the dataflow to optimize data movement

 2D layout of spatial architectures maps well to processing multidimensional stencil grids



# Our Goal

Mitigate the performance bottleneck of compound weather prediction kernels by taking advantage of the characteristics of spatial computing systems

### **Our Proposal**



Novel spatial accelerator for efficient and scalable horizontal diffusion weather stencil computation

### **Talk Outline**

#### **Background and Motivation**

#### **SPARTA: Design and Implementation**

#### **Evaluation of SPARTA and Key Results**

#### Summary

SAFARI

# **SPARTA Design: Single AIE Core Mapping**



# **SPARTA Design: Single AIE Core Mapping**



#### Imbalanced computation and memory demands



# Flux stencils have lower compute-to-memory ratio than Laplacian stencils



# **SPARTA Design: Multi-AIE Core Mapping**

Divide computation over multiple AIE Cores: dual-AIE and tri-AIE design



Compute-bound distributed among multiple cores

Parallel execution of multiple AIE cores per shimDMA

SAFARI

**AMD**, 15

# **SPARTA Design: Scaling Challenges**

- **1.** Balancing computation and memory resources
- 2. Limited external memory channels
- **3. Gathering and ordering of calculated results** before sending them back to the external memory
- **4. Placing input and output cores** close to the external memory interface to optimize data transfer





**Balance compute and memory resources** 

Maximize the usage of shimDMA channels

Efficient gathering and reordering of output



**Exploit data-reuse by broadcasting** input data to multiple cores

**AMD**, 18





SAFARI



# **SPARTA Application Toolflow**

MLIR (Multi-Level Intermediate Representation) to separate the AIE core computation optimization and the effective dataflow management



https://github.com/Xilinx/mlir-aie

### **Talk Outline**

#### **Background and Motivation**

#### **SPARTA: Design and Implementation**

#### **Evaluation of SPARTA and Key Results**

#### **Summary**

SAFARI

# **Evaluation Methodology (1/2)**

#### **Real system evaluation**

#### **Versal AIE Configuration**

- Frequency: 1 GHz
- Cores: 400



# **Evaluation Methodology (2/2)**

- State-of-the-art baselines on all major computing platforms
  - CPU [Singh+, FPL'21]
  - GPU [Licht+, CGO'21]
  - FPGA [Singh+, FPL'21]

#### • Programming tools

- MLIR-AIE
- Vitis Chess Compiler v2022.2

#### • Elementary stencil benchmarks

- jacobi-1d
- jacobi-2d-3pt
- Laplacian
- jacobi-2d-9pt
- seidel-2d

### SPARTA Performance: Single and Multi-AIE Design (1/2)



**AMD**, 26

### SPARTA Performance: Single and Multi-AIE Design (1/2)



### SPARTA Performance: Single and Multi-AIE Design (1/2)



### **SPARTA Performance: Scaling Accelerator Design**



# **SPARTA Performance: Scaling Accelerator Design**



| C) | 7 | 6.53 |                            |     |
|----|---|------|----------------------------|-----|
| Se | 5 |      | Each B-block has dedicated | i   |
| Ä  | 5 |      | shimDMA channel assigned   | - I |

Single B-block provides 4.3x higher performance compared to a single-tri-AIE-based design



**Performance scales linearly with the number of B-blocks** 

### **SPARTA Performance: Comparison to SOTA**

| Stencil | Work   | Year | Platform | Device               | Mem. Tech. | Peak Perf. (TFLOPS) | Peak B/W (GB/s) | Perf. (GOp/s) | Arch. Roof. (%) |
|---------|--------|------|----------|----------------------|------------|---------------------|-----------------|---------------|-----------------|
| hdiff   | [23]   | 2019 | FPGA     | XCVU3P [97]          | DDR4       | 0.97                | 25.6            | 129.9         | 13.4%           |
| hdiff   | [16]   | 2021 | CPU      | Xeon E5-2690V3 [98]  | DDR4       | 0.24                | 68.0            | 32.0          | 13.0%           |
| hdiff   | [24]   | 2021 | CPU      | POWER9 [31]          | DDR4       | 0.49                | 110.0           | 58.5          | 11.8%           |
| hdiff   | [16]   | 2021 | GPU      | V100 [32]            | HBM2       | 14.1                | 900.0           | 849.0         | 6.1%            |
| hdiff   | [16]   | 2021 | FPGA     | Stratix 10 [99]      | DDR4       | 9.2                 | 76.8            | 145.0         | 1.6%            |
| hdiff   | [24]   | 2021 | FPGA     | XCVU37P [97]         | HBM        | 3.6                 | 410.0           | 485.4         | 13.5%           |
| hdiff   | SPARTA | 2023 | AIE      | <b>XCVC1902</b> [83] | DDR4       | 3.1                 | 25.6            | 995.7         | 32.2%           |

**SPARTA outperforms** state-of-the-art CPU, GPU, and FPGA implementations by 17.1x, 1.2x, and 2.1x, respectively

### **SPARTA Performance: Comparison to SOTA**

| Stencil | Work   | Year | Platform | Device              | Mem. Tech. | Peak Perf. (TFLOPS) | Peak B/W (GB/s) | Perf. (GOp/s) | Arch. Roof. (%) |
|---------|--------|------|----------|---------------------|------------|---------------------|-----------------|---------------|-----------------|
| hdiff   | [23]   | 2019 | FPGA     | XCVU3P [97]         | DDR4       | 0.97                | 25.6            | 129.9         | 13.4%           |
| hdiff   | [16]   | 2021 | CPU      | Xeon E5-2690V3 [98] | DDR4       | 0.24                | 68.0            | 32.0          | 13.0%           |
| hdiff   | [24]   | 2021 | CPU      | POWER9 [31]         | DDR4       | 0.49                | 110.0           | 58.5          | 11.8%           |
| hdiff   | [16]   | 2021 | GPU      | V100 [32]           | HBM2       | 14.1                | 900.0           | 849.0         | 6.1%            |
| hdiff   | [16]   | 2021 | FPGA     | Stratix 10 [99]     | DDR4       | 9.2                 | 76.8            | 145.0         | 1.6%            |
| hdiff   | [24]   | 2021 | FPGA     | XCVU37P [97]        | HBM        | 3.6                 | 410.0           | 485.4         | 13.5%           |
| hdiff   | SPARTA | 2023 | AIE      | XCVC1902 [83]       | DDR4       | 3.1                 | 25.6            | 995.7         | 32.2%           |

State-of-the-art implementations achieve only 1.6%-13.5% of the peak theoretical performance of a platform

SPARTA achieves the highest peak roofline performance of 32.2%

### **SPARTA Performance: Comparison to SOTA**

| Stencil | Work   | Year | Platform | Device               | Mem. Tech. | Peak Perf. (TFLOPS) | Peak B/W (GB/s) | Perf. (GOp/s) | Arch. Roof. (%) |
|---------|--------|------|----------|----------------------|------------|---------------------|-----------------|---------------|-----------------|
| hdiff   | [23]   | 2019 | FPGA     | XCVU3P [97]          | DDR4       | 0.97                | 25.6            | 129.9         | 13.4%           |
| hdiff   | [16]   | 2021 | CPU      | Xeon E5-2690V3 [98]  | DDR4       | 0.24                | 68.0            | 32.0          | 13.0%           |
| hdiff   | [24]   | 2021 | CPU      | POWER9 [31]          | DDR4       | 0.49                | 110.0           | 58.5          | 11.8%           |
| hdiff   | [16]   | 2021 | GPU      | V100 [32]            | HBM2       | 14.1                | 900.0           | 849.0         | 6.1%            |
| hdiff   | [16]   | 2021 | FPGA     | Stratix 10 [99]      | DDR4       | 9.2                 | 76.8            | 145.0         | 1.6%            |
| hdiff   | [24]   | 2021 | FPGA     | XCVU37P [97]         | HBM        | 3.6                 | 410.0           | 485.4         | 13.5%           |
| hdiff   | SPARTA | 2023 | AIE      | <b>XCVC1902</b> [83] | DDR4       | 3.1                 | 25.6            | 995.7         | 32.2%           |

SPARTA is 2.4x more energy-efficient with 42.2 GOps/Watt than the state-of-the-art FPGA design

# More in the Paper

- Results for elementary stencil benchmarks
- Analytical modeling for computation and memory requirements

• Implementation details for single and multi-AIE core mapping

- Managing data transfer using MLIR
- Discussion and key takeaways

## More in the Paper

Results for elementary stencil benchmarks

# SPARTA: Spatial Acceleration for Efficient and Scalable Horizontal Diffusion Weather Stencil Computation

Gagandeep Singh<sup>a,b</sup>Alireza Khodamoradi<sup>a</sup>Kristof Denolf<sup>a</sup>Jack Lo<sup>a</sup>Juan Gómez-Luna<sup>b</sup>Joseph Melber<sup>a</sup>Andra Bisca<sup>a</sup>Henk Corporaal<sup>c</sup>Onur Mutlu<sup>b</sup><sup>a</sup>AMD Research<sup>b</sup>ETH Zürich<sup>c</sup>Eindhoven University of Technology

https://arxiv.org/pdf/2303.03509.pdf

Discussion and key takeaways

Full Paper

### **SPARTA is Open Sourced**



https://github.com/CMU-SAFARI/SPARTA

https://github.com/Xilinx/mlir-aie

#### **AMD** 36

### **Talk Outline**

#### **Background and Motivation**

#### **SPARTA: Design and Implementation**

#### **Evaluation of SPARTA and Key Results**



SAFARI

## Summary

Mitigate the performance bottleneck of compound weather prediction kernels by taking advantage of the characteristics of spatial computing systems

SPARTA is a novel spatial accelerator for efficient and scalable horizontal diffusion weather stencil computation

SPARTA outperforms state-of-the-art CPU, GPU, and FPGA horizontal diffusion implementations by 17.1x, 1.2x, and 2.1x, respectively



# Spatial Acceleration for Efficient and Scalable Horizontal Diffusion Weather Stencil Computation

Gagandeep Singh, Alireza Khodamoradi, Kristof Denolf, Jack Lo, Juan Gómez-Luna, Joseph Melber, Andra Bisca, Henk Corporaal, and Onur Mutlu

> 37<sup>th</sup> International Conference on Supercomputing (ICS) Orlando, Florida

ETHZURICH TU/e EINDHOVEN UNIVERSITY OF TECHNOLOGY



AMD together we advance\_