# Scheduling Techniques for GPU Architectures with Processing-In-Memory Capabilities

#### **Ashutosh Pattnaik**

Xulong Tang, Adwait Jog, Onur Kayıran, Asit Mishra, Mahmut Kandemir, Onur Mutlu, Chita Das

#### **PACT '16**



PennState WILLIAM & MARY AMD, ETHzürich





#### Era of Energy-Efficient Architectures



#### Future: 1 ExaFlops/s at 20 MW Peak power

 Greatly need to improve energy efficiency as well as performance!

| 2010:             | 2013:                 | 2016:                  |
|-------------------|-----------------------|------------------------|
| Tianhe-1A         | Tianhe-2              | Sunway TaihuLight      |
| 4.7 PFlop/s, 4 MW | 54.9 PFlop/s, 17.8 MW | 125.4 PFlop/s, 15.4 MW |
| ~1.175 TFlops/W   | ~3.084 TFlops/W       | ~8.143 TFlops/W        |



#### Bottleneck

- Continuous energy-efficiency and performance scaling is not easy.
- Energy consumed by a floating-point operation is scaling down with technology scaling.
- Energy consumption due to data transfer overhead is not scaling down!



#### Bottleneck



Data movement and system energy consumption caused by off-chip memory accesses.



#### Bottleneck



# Main memory accesses lead to 45% performance degradation!

Performance normalized to a hypothetical GPU where all the off-chip accesses hit in the last-level cache.



#### Outline

- Introduction and Motivation
- Background and Challenges
- Design of Kernel Offloading Mechanism
- Design of Concurrent Kernel Management
- Simulation Setup and Evaluation
- Conclusions



## Revisiting Processing-In-Memory (PIM)

- It's a promising approach to minimize data movement.
- The concept dates back to the late 1960s
- Technological limitations of integrating fast computational units in memory was a challenge
- Significant advances in adoption of 3D-stacked memory has
  - enabled tight integration of memory dies and logic layer
  - brought computational units into the memory stack



- We integrate PIM units to a GPU based system and we call this as "PIM-Assisted GPU architecture".
- At least one 3D-stacked memory is integrated with PIM units and is placed adjacent to a traditional GPU design.



• Traditional GPU architecture\*



\* Only a single DRAM partition is shown for illustration purposes



• GPU architecture with 3D-stacked memory on a silicon interposer





- Now we add a logic layer to the 3D-stacked memory and we call this logic layer as GPU-PIM.
- The traditional GPU logic is now called GPU-PIC.





- Application can now be run on both GPU-PIC and GPU-PIM
- Challenge: Where to execute the application on?





#### **Application Offloading**

• We evaluate application execution on either GPU-PIC or GPU-PIM



Optimal application offloading scheme provides 16% and 28% improvements in performance and energy efficiency, respectively.





• Limitation 1: Lack of Fine-Grained Offloading





• Limitation 1: Lack of Fine-Grained Offloading



Running K1 on GPU-PIM, and K2 and K3 on GPU-PIC provides the optimal kernel placement for improved performance.



- Limitation 1: Lack of Fine-Grained Offloading
- Limitation 2: Lack of Concurrent Utilization of GPU-PIM and GPU-PIC



• From the application we find that kernel K1 and K2 are independent from each other.



- Limitation 1: Lack of Fine-Grained Offloading
- Limitation 2: Lack of Concurrent Utilization of GPU-PIM and GPU-PIC



# Scheduling kernels based on their affinity is very important to achieve higher performance.





#### **Our Goal**

To develop runtime mechanisms for

- automatically identifying architecture affinity of each kernel in an application
- scheduling kernels on GPU-PIC and GPU-PIM to maximize for performance and utilization



## Outline

- Introduction and Motivation
- Background and Challenges
- Design of Kernel Offloading Mechanism
- Design of Concurrent Kernel Management
- Simulation Setup and Evaluation
- Conclusions



- Goal: Offload kernels to either GPU-PIC or GPU-PIM to maximize performance
- Challenge: Need to know the architecture affinity of the kernels
- We build an architecture affinity prediction model



• Metrics used to predict compute engine affinity and GPU-PIC and GPU-PIM execution time.

| Category                                   | Predictive Metric                      | Static/Dynamic |
|--------------------------------------------|----------------------------------------|----------------|
| I: Memory Intensity of<br>Kernel           | Memory to Compute Ratio                | Static         |
|                                            | Number of Compute Inst.                | Static         |
|                                            | Number of Memory Inst.                 | Static         |
| II: Available Parallelism<br>in the Kernel | Number of CTAs                         | Dynamic        |
|                                            | Total Number of Threads                | Dynamic        |
|                                            | Number of Thread Inst.                 | Dynamic        |
| III: Shared Memory<br>Intensity of Kernel  | Total Number of Shared<br>Memory Inst. | Static         |



• Logistic Regression Model for Affinity Prediction

$$\sigma(t) = \frac{e^t}{e^t + 1}$$

where:

 $\sigma(t)$  = model output ( $\sigma(t)$  < 0.5 => GPU-PIC,  $\sigma(t) \ge 0.5$  => GPU-PIM)

- $t = \alpha_0 + \alpha_1 x_1 + \alpha_2 x_2 + \alpha_3 x_3 + \alpha_4 x_4 + \alpha_5 x_5 + \alpha_6 x_6 + \alpha_7 x_7$
- $\alpha_i$  = Coefficients of the Regression Model
- $x_i$  = Predictive Metrics



- Training Set: we randomly sample 60% (15) of the 25 GPGPU applications considered in the paper.
- These 15 applications consists of 82 unique kernels that are used for training the affinity prediction model.
- Test Set: the remaining 40% (10) of the applications are used as the test set for the model
- Accuracy of the model on the test set: 83%



## Outline

- Introduction and Motivation
- Background and Challenges
- Design of Kernel Offloading Mechanism
- Design of Concurrent Kernel Management
- Simulation Setup and Evaluation
- Conclusions



- Goal: Efficiently manage the scheduling of concurrent kernels to improve performance and utilization of the PIM-Assisted GPU architecture
- For efficiently managing kernel execution on both GPU-PIM and GPU-PIC, we need
  - Kernel-level Dependence Information
  - Architecture Affinity Information
  - Execution Time Information



- For efficiently managing kernel execution on both GPU-PIM and GPU-PIC, we need
  - Kernel-level Dependence Information
    - Obtained through exhaustive analysis to find RAW dependence for all considered applications and input pairs
  - Architecture Affinity Information
  - Execution Time Information

- For efficiently managing kernel execution on both GPU-PIM and GPU-PIC, we need
  - Kernel-level Dependence Information
    - Obtained through exhaustive analysis to find RAW dependence for all considered applications and input pairs
  - Architecture Affinity Information
    - Utilizes the affinity prediction model built for kernel offloading mechanism
  - Execution Time Information



- For efficiently managing kernel execution on both GPU-PIM and GPU-PIC, we need
  - Kernel-level Dependence Information
    - Obtained through exhaustive analysis to find RAW dependence for all considered applications and input pairs
  - Architecture Affinity Information
    - Utilizes the affinity prediction model built for kernel offloading mechanism
  - Execution Time Information
    - We build linear regression models for execution time prediction on GPU-PIC and GPU-PIM
    - We use the same "Predictive metrics" and training set used for affinity prediction model



Linear Regression Model for Execution Time Prediction Model

 $y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \beta_3 x_3 + \beta_4 x_4 + \beta_5 x_5 + \beta_6 x_6 + \beta_7 x_7$ 

where:

*y* = model output (predicted execution time)

- $\beta_i$  = Coefficients of the Regression Model
- $x_i$  = Predictive Metrics



• Lets run through an example





 We can potentially pick any kernel (assuming no data dependence among themselves and K4) from GPU-PIC Queue and schedule them onto GPU-PIM GPU-PIC Queue GPU-PIM Queue



• But which one to pick?



- We steal the first kernel that satisfies a given condition and schedule it on to GPU-PIM Queue.
- Pseudocode:
- time(kernel, compute\_engine) returns the estimated execution time of "kernel" when executed on "compute\_engine"
  Estimated execution time of

for X in GPU-PIC's Queue if (time (X, GPU - PIM)  $\leq$  { time(K4, GPU - PIC)  $- time_{executed}(K4)$  + time(X, GPU - PIC)} steal and schedule X to GPU - PIM; break;



#### Outline

- Introduction and Motivation
- Background and Challenges
- Design of Kernel Offloading Mechanism
- Design of Concurrent Kernel Management
- Simulation Setup and Evaluation
- Conclusions



#### **Simulation Setup**

- Evaluated on GPGPU-Sim, a cycle accurate GPU simulator
- Baseline configuration
  - 40 SMs, 32-SIMT lanes, 32-threads/warp
  - 768 kB L2 cache
- GPU-PIM configuration
  - 8 SMs, 32-SIMT lanes, 32-threads/warp
  - No L2 cache
- GPU-PIC configuration
  - 32 SMs, 32-SIMT lanes, 32-threads/warp
  - 768 kB L2 cache
- 25 GPGPU Applications classified into 2 exclusive sets
  - Training Set: The kernels are used as input to build the regression models
  - Test Set: The regression models are only tested on these kernels



#### Performance (Normalized to Baseline)



- Performance improvement for Test Set applications
  - Kernel Offloading = 25%
  - Concurrent Kernel Management = 42%



#### Energy-Efficiency (Normalized to Baseline)



# More results and detailed description of our runtime mechanisms are in the paper.

- Energy-Efficiency improvement for Test Set applications
  - Kernel Offloading = 28%
  - Concurrent Kernel Management = 27%



#### Conclusions

- Processing-In-Memory is a key direction in achieving high performance with lower power budget.
- Simply offloading applications completely onto PIM units is not optimal.
- For effective utilization of PIM-Assisted GPU architecture, we need to
  - Identify code segments for offloading onto GPU-PIM
  - Efficiently distribute work between GPU-PIC and GPU-PIM
- Our kernel-level scheduling mechanisms can be an effective runtime solution for exploiting processing-in-memory in modern GPU-based architectures.



Scheduling Techniques for GPU Architectures with Processing-In-Memory Capabilities

Ashutosh Pattnaik, Xulong Tang, Adwait Jog, Onur Kayıran, Asit Mishra, Mahmut Kandemir, Onur Mutlu, Chita Das.

#### **PACT '16**



PennState WILLIAM & MARY AMDI ETH zürich



