#### Boyi: A Systematic Framework for Automatically Deciding the Right Execution Model of OpenCL Applications on FPGAs

Jiantong Jiang (Northeastern University, China), Zeke Wang (Zhejiang University, China), Xue Liu (Northeastern University, China), Juan Gómez-Luna (ETH Zürich, Switzerland), Nan Guan (Hong Kong Polytechnic University), Qingxu Deng (Northeastern University, China), Wei Zhang (Hong Kong University of Science and Technology), Onur Mutlu (ETH Zürich , Switzerland)

# Outline

- Background and Motivations
- Our Solution
- Experiment
- Conclusion

# What is OpenCL?

- OpenCL stands for *Open Computing Language*.
- OpenCL has been developed for heterogeneous computing environments with a host-accelerator execution model.

≻The CPU runs the control task.

≻The GPU/ FPGA runs the computing kernel.

## **OpenCL on FPGA**



- Hardware-centric → fine-gained parallelism
- Users need to program with HDL.
- Software-centric → FPGA as a parallel architecture.
- Users can program with OpenCL.

# **Conventional OpenCL: NDRange Kernel**



Explicit multi-threaded  $\rightarrow$  executing the same operation on multiple data concurrently

How conventional OpenCL performs on an FPGA?

# Issue of Conventional OpenCL

 Conventional OpenCL cannot always represent FPGA architecture in an efficient manner.



The optimal performance is enabled by two OpenCL features!

# **OpenCL Feature 1: SWI Kernel**

• The SWI model executes the kernel in only one CU that contains only one work-item.



Pipelined parallelism  $\rightarrow$  No conflict among work items.

# **OpenCL Feature 2: OpenCL Channel**

 OpenCL channel can be used to pass data between two OpenCL kernels (typically SWI).

Synchronizing the kernels

► Reducing the number of global memory accesses



# Challenges

- Two OpenCL features exponentially increase design space.
  - Enabled by two features, we have four OpenCL execution models:
  - NDRange SWI NDRange+Channel SWI+Channel
    For each execution model, we have at least six optimization methods:
    M MC PM UL SIMD CU
  - -• For each optimization method, we have different pragmas.

• The compilation time is extremely long!

# Effect of Four Execution Models

• Different execution models can significantly affect the performance.

NDRange+Channel

SWI+Channel

**NDR**ange



Can we explicitly determine the most suitable execution model (i.e., whether or not to use two OpenCL features) to optimize OpenCL programs on FPGAs?

We provide a systematic framework Boyi to automatically determine the most suitable execution model.

# Outline

- Background and Motivations
- Our Solution
  - OpenCL Pattern Recognition
  - Execution Model Prediction
- Experiment
- Conclusion

# Architecture of Boyi

- Boyi explicitly determines the most suitable execution model to optimize OpenCL programs on FPGAs.
  - OpenCL Pattern Recognition
  - Execution Model Prediction OpenCL Pattern Recognition

#### **Execution Model Prediction**



AO: Atomic Operation

MPS: Multi-Pass Scheme

KKC: Kernel-to-Kernel Communication

# Outline

- Background and Motivations
- Our Solution
  - OpenCL Pattern Recognition
  - Execution Model Prediction
- Experiment
- Conclusion

## OpenCL Pattern: Atomic Operation



• Issues on FPGAs:

Noticeable resource overheadLong latency and low bandwidthLow frequency→ AO is not a good fit on FPGAs.

#### OpenCL Pattern: Atomic Operation

• Potential on FPGAs:



#### OpenCL Pattern: Multi-Pass Scheme

in out local\_sum pre\_sum out 

#### Step 2: 4 work-groups

#### OpenCL Pattern: Multi-Pass Scheme

• Issues on FPGAs:

More memory traffic

 $\rightarrow$  MPS is not a good fit on FPGAs.

• Potential on FPGAs:



#### OpenCL Pattern: Kernel-to-Kernel Communication



• Issues on FPGAs:

The communication via global memory is expensive.

#### OpenCL Pattern: Kernel-to-Kernel Communication

• Potential on FPGAs

Reducing the number of memory accesses Inter-kernel parallelism (i.e., concurrent kernel execution)



**Global memory** 

# **OpenCL** Pattern Recognition

- We develop nine LLVM passes to recognize three OpenCL patterns.
  - AO recognition
  - KKC recognition
  - MPS recognition



The implementation details can be found in our paper!

# Outline

- Background and Motivations
- Our Solution
  - OpenCL Pattern Recognition
  - Execution Model Prediction
- Experiment
- Conclusion

# **Execution Model Prediction**

• Direct prediction

| AO | MPS | ККС | Direct prediction |
|----|-----|-----|-------------------|
| Ν  | Ν   | Ν   | NDR               |
| Y  | Ν   | N   | SWI               |
| N  | Y   | N   | SWI               |
| Y  | Y   | N   | SWI               |
| N  | Ν   | Y   | NDR+C             |
| Y  | Ν   | Y   | SWI+C             |
| N  | Y   | Y   | SWI+C             |
| Y  | Y   | Y   | SWI+C             |

Kernels with AO and MPS benefit from the SWI kernel. Kernels with KKC benefit from the OpenCL channel.

# **Execution Model Prediction**

Potential evolution of SWI

| AO | MPS | ККС | Direct prediction | Potential evolution  |  |  |
|----|-----|-----|-------------------|----------------------|--|--|
| Ν  | Ν   | Ν   | NDR               | NDR                  |  |  |
| Y  | Ν   | Ν   | SWI               | SWI+C                |  |  |
| Ν  | Y   | N   | SWI – –           | - <b>-&gt;</b> SWI+C |  |  |
| Y  | Y   | N   | SWI               | SWI+C                |  |  |
| Ν  | N   | Y   | NDR+C             | NDR+C                |  |  |
| Y  | Ν   | Y   | SWI+C             | SWI+C                |  |  |
| Ν  | Y   | Y   | SWI+C             | SWI+C                |  |  |
| Y  | Y   | Y   | SWI+C             | SWI+C                |  |  |

Conditions: Sufficient FPGA resource.

The SWI kernel is compute-bound.

# Outline

- Background and Motivations
- Our Solution
- Experiment
  - Experimental Setup
  - Effect of Execution Model
  - Prediction of Execution Model
- Conclusion

# **Experimental Setup**

- **Platform**: Terasic DE5a-Net board: Altera Arria 10 GX FPGA and 8GB 2-bank DDR3, with Altera OpenCL SDK version 16.1.
- Workloads:

| Benchmark | Source     | Description           | AO | MPS | ККС | Datasets                 |
|-----------|------------|-----------------------|----|-----|-----|--------------------------|
| BFS       | Chai       | Breadth-First Search  | Y  | Ν   | Ν   | NY, NE, UT               |
| RSCD      |            | RANSAC                | Y  | Ν   | Y   | 2000 iterations          |
| TQH       |            | Task Queue System     | Y  | Ν   | Ν   | Basket                   |
| HSTO      |            | Histogram             | Y  | Ν   | Ν   | 256bins                  |
| SC        |            | Stream Compaction     | Y  | Ν   | Ν   | 50%                      |
| PAD       |            | Padding               | Y  | Ν   | Ν   | 1000*999                 |
| CEDD      |            | Canny Edge Detection  | Ν  | Ν   | Y   | Peppa, Maradona, Paw     |
| KM        | Rodinia    | K-Means               | Ν  | Ν   | Ν   | 25600 points, 8 features |
| MM        | Intel demo | Matrix Multiplication | Ν  | Ν   | Ν   | A: 2k*1k, B: 1k*1k       |
| MS        |            | Mandelbrot Set        | Ν  | Ν   | Ν   | 640*800, 2000 iterations |
| PS        | CUDA demo  | Prefix Sum            | Ν  | Y   | Ν   | 262144 points            |

# **Comparison Methodology**

• Hypothesis 1: Different execution models lead to significant performance differences.

►Quantitative comparison among execution models

>Exploring optimization combinations

• Hypothesis 2: Boyi can predict the most suitable execution model for each OpenCL application.

# Hypothesis 1: Different execution model --> different performance

Quantitative comparison among execution models

| Арр  | Number of combinations |     |       |       | Maximum speedup |     |       |       |
|------|------------------------|-----|-------|-------|-----------------|-----|-------|-------|
|      | NDR                    | SWI | NDR+C | SWI+C | NDR             | SWI | NDR+C | SWI+C |
| BFS  | 17                     | 7   | 7     | 9     | 1.9             | 3.1 | 1.2   | 3.1   |
| RSCD | 25                     | 10  | 24    | 46    | 15.8            | 4.6 | 73.8  | 39.7  |
| ТQН  | 9                      | 15  |       | 23    | 1.1             | 1.3 |       | 21.0  |

It is critical to decide the most suitable execution model when optimizing OpenCL applications on FPGAs. models to achieve the best performance.

significant performance differences.

| KM | 33 | 11 | 10 | 10 | 14/./ | 52.0 | 130.4 | 51.4 |
|----|----|----|----|----|-------|------|-------|------|
| MM | 25 | 9  |    | 6  | 837.1 | 13.3 |       | 6.6  |
| MS | 7  | 6  |    | 7  | 34.7  | 0.02 |       | 3.2  |
| PS | 26 | 20 |    | 12 | 15.8  | 44.4 |       | 46.2 |

#### Hypothesis 1: Exploring optimization combinations for Each Execution Model

• We manually implement sufficient number of optimization combinations (subset) for KM, such that we reach the near-to-optimal optimization combination for each execution model.



# Hypothesis 2: Boyi Predicts the Right Execution Model

|     | Application | AO        | MPS       | ККС      | Actual    | Predicted  |       |
|-----|-------------|-----------|-----------|----------|-----------|------------|-------|
|     | BFS         | Y         | Ν         | Ν        | SWI       | SWI        |       |
|     | RSCD        | Y - →     | N N       | Y        | NDR+C     | SWI+C      | ► NDR |
|     | TQH         | Y         | Ν         | Ν        | SWI+C     | SWI+C ≫    |       |
|     | HSTI        | Y         | Ν         | Ν        | SWI+C     | SWI+C ≫    |       |
|     | SC          | Y         | Ν         | Ν        | SWI+C     | SWI+C ≫    |       |
|     | ΡΔΠ         | V         | N         | N        | S\\/I+C   | S\WI+C ※   |       |
| The | e actual a  | nd predie | cted exec | ution mo | dels roug | shly matcl | h.    |
|     | КМ          | Ν         | N         | N        | NDR       | NDR        |       |
|     | MM          | Ν         | Ν         | Ν        | NDR       | NDR        |       |
|     | MS          | Ν         | Ν         | Ν        | NDR       | NDR        |       |
|     | PS          | Ν         | Y         | N        | SWI       | SWI        |       |

SWI+C <sup>\*</sup> indicates the potential evolution of SWI

# End-to-end Performance Comparison

• Performance comparison to existing works

| Application | Ours (ms) | Existing work (ms) | Our/Existing Speedup |
|-------------|-----------|--------------------|----------------------|
| RSCD [1]    | 0.8       | 28.9               | 38.3                 |
| TQH [1]     | 66.9      | 150.6              | 2.3                  |
| HSTO [1]    | 38.8      | 487.9              | 12.6                 |
| CEDD [1]    | 161.9     | 237.8              | 1.5                  |
| MM [2]      | 9.1       | 34.3               | 3.8                  |
| MS [2]      | 27.2      | 944.1              | 34.7                 |

[1] S. Huang et al., "Analysis and modeling of collaborative execution strategies for heterogeneous cpu-fpga architectures", *ICPE*, 2019.

[2] Intel. Intel SDK for OpenCL Design Examples. 2018

# Outline

- Background and Motivations
- Our Solution
- Experiment
- Conclusion