DAMOV: A New Methodology and Benchmark Suite for Evaluating Data Movement Bottlenecks

### Geraldo F. Oliveira

Juan Gómez-Luna Lois Orosa Saugata Ghose Nandita Vijaykumar Ivan Fernandez Mohammad Sadrosadati Onur Mutlu









## **Executive Summary**

- <u>Problem</u>: Data movement is a major bottleneck is modern systems. However, it is **unclear** how to identify:
  - different sources of data movement bottlenecks
  - the most suitable mitigation technique (e.g., caching, prefetching, near-data processing) for a given data movement bottleneck
- <u>Goals</u>:

SAFARI

- 1. Design a methodology to **identify** sources of data movement bottlenecks
- 2. **Compare** compute- and memory-centric data movement mitigation techniques
- <u>Key Approach</u>: Perform a large-scale application characterization to identify key metrics that reveal the sources to data movement bottlenecks
- Key Contributions:
  - Experimental characterization of 77K functions across 345 applications
  - A methodology to characterize applications based on data movement bottlenecks and their relation with different data movement mitigation techniques
  - DAMOV: a benchmark suite with 144 functions for data movement studies
  - Four case-studies to highlight DAMOV's applicability to open research problems

#### DAMOV: <a href="https://github.com/CMU-SAFARI/DAMOV">https://github.com/CMU-SAFARI/DAMOV</a>

## Outline

## 1. Data Movement Bottlenecks

# 2. Methodology Overview

# 3. Application Profiling

# 4. Locality-Based Clustering

## 5. Memory Bottleneck Analysis

## 6. Case Studies

### Outline

## **1. Data Movement Bottlenecks**

# 2. Methodology Overview

# 3. Application Profiling

# 4. Locality-Based Clustering

## 5. Memory Bottleneck Analysis

## 6. Case Studies

## Data Movement Bottlenecks (1/2)



### Data movement bottlenecks happen because of:

- Not enough data **locality**  $\rightarrow$  ineffective use of the cache hierarchy
- Not enough **memory bandwidth**
- High average memory access time

## **Data Movement Bottlenecks (2/2)**



5

## Near-Data Processing (1/2)

### **Compute-Centric Architecture**







## Near-Data Processing (2/2)

### **UPMEM (2019)**



Near-DRAM-banks processing for general-purpose computing

0.9 TOPS compute throughput<sup>1</sup>

### Samsung FIMDRAM (2021)



#### Near-DRAM-banks processing for neural networks

**1.2 TFLOPS compute throughput<sup>2</sup>** 

### The goal of Near-Data Processing (NDP) is to mitigate data movement



[1] Devaux, "The True Processing In Memory Accelerator," HCS, 2019
 [2] Kwon+, "A 20nm 6GB Function-In-Memory DRAM, Based on HBM2 with a 1.2TFLOPS Programmable Computing Unit Using Bank-Level Parallelism, for Machine Learning Applications," ISSCC, 2021

## When to Employ Near-Data Processing?



[1] Ahn+, "A Scalable Processing-in-Memory Accelerator for Parallel Graph Processing," ISCA, 2015

[2] Boroumand+, "Google Workloads for Consumer Devices: Mitigating Data Movement Bottlenecks," ASPLOS, 2018

[3] Cali+, "GenASM: A High-Performance, Low-Power Approximate String Matching Acceleration Framework for Genome Sequence Analysis," MICRO, 2020

[4] Kim+, "GRIM-Filter: Fast Seed Location Filtering in DNA Read Mapping Using Processing-in-Memory Technologies," BMC Genomics, 2018

[5] Boroumand+, "Polynesia: Enabling Effective Hybrid Transactional/Analytical Databases with Specialized Hardware/Software Co-Design," arXiv:2103.00798 [cs.AR], 2021

[6] Fernandez+, "NATSA: A Near-Data Processing Accelerator for Time Series Analysis," ICCD, 2020

## **Identifying Memory Bottlenecks**

- Multiple approaches to identify applications that:
  - suffer from data movement bottlenecks
  - take advantage of NDP
- Existing approaches are not comprehensive enough



## Limitations of Prior Approaches (1/2)

 Roofline model → identifies when an application is bounded by compute or memory units



## Limitations of Prior Approaches (1/2)

 Roofline model → identifies when an application is bounded by compute or memory units



#### 10

## Limitations of Prior Approaches (1/2)

 Roofline model → identifies when an application is bounded by compute or memory units



## Limitations of Prior Approaches (1/2)

 Roofline model → identifies when an application is bounded by compute or memory units



### Roofline model **does not accurately account** for the **NDP suitability** of memory-bound applications



Arithmetic Intensity (OPS/byte)

## Limitations of Prior Approaches (2/2)

Application with a last-level cache MPKI > 10
 → memory intensive and benefits from NDP



## Limitations of Prior Approaches (2/2)

Application with a last-level cache MPKI > 10
 → memory intensive and benefits from NDP



## Limitations of Prior Approaches (2/2)

Application with a last-level cache MPKI > 10
 → memory intensive and benefits from NDP



### LLC MPKI **does not accurately account** for the **NDP suitability** of memory-bound applications



## **Identifying Memory Bottlenecks**

- Multiple approaches to identify applications that:
  - suffer from data movement bottlenecks
  - take advantage of NDP
- Existing approaches are not comprehensive enough



## **The Problem**

- Multiple approaches to identify applications that:
  - suffer from data movement bottlenecks
  - take advantage of NDP

No available methodology can comprehensively:

- identify data movement bottlenecks
- correlate them with the most suitable data movement mitigation mechanism





## **Our Goal**

- **Our Goal:** develop a methodology to:
  - methodically identify sources of data movement bottlenecks
  - comprehensively compare compute- and memorycentric data movement mitigation techniques

## Outline

## 1. Data Movement Bottlenecks

# 2. Methodology Overview

# 3. Application Profiling

# 4. Locality-Based Clustering

## 5. Memory Bottleneck Analysis

## 6. Case Studies

## **Key Approach**

- New workload characterization methodology to analyze:
  - data movement bottlenecks
  - suitability of different data movement mitigation mechanisms
- Two main profiling strategies:

**Architecture-independent profiling:** 

characterizes the memory behavior independently of the underlying hardware

**Architecture-dependent profiling:** 

evaluates the impact of the system configuration on the memory behavior



## **Methodology Overview**



## **Methodology Overview**



## **Step 1: Application Profiling**

# Goal: Identify **application functions** that suffer from data movement bottlenecks



## **Methodology Overview**



## **Step 2: Locality-Based Clustering**

### Goal: analyze application's memory characteristics



## **Step 2: Locality-Based Clustering**

### • Goal: analyze application's memory characteristics









#### **Reuse Profile Histogram**



## **Methodology Overview**



### Step 3: Memory Bottleneck Classification (1/2)

### **Arithmetic Intensity (AI)**

floating-point/arithmetic operations per L1 cache lines accessed
 → shows computational intensity per memory request

LLC Misses-per-Kilo-Instructions (MPKI)

LLC misses per one thousand instructions
 → shows memory intensity

### Last-to-First Miss Ratio (LFMR)

LLC misses per L1 misses
 → shows if an application benefits from L2/L3 caches

### **Step 3: Memory Bottleneck Classification (2/2)**

• **Goal:** identify the specific sources of data movement bottlenecks



### • Scalability Analysis:

SAFARI

- 1, 4, 16, 64, and 256 out-of-order/in-order host and NDP CPU cores
- 3D-stacked memory as main memory

#### DAMOV-SIM: https://github.com/CMU-SAFARI/DAMOV

## Outline

## 1. Data Movement Bottlenecks

# 2. Methodology Overview

# **3. Application Profiling**

# 4. Locality-Based Clustering

## 5. Memory Bottleneck Analysis

## 6. Case Studies

## **Step 1: Application Profiling**

- We analyze 345 applications from distinct domains:
- Graph Processing
- Deep Neural Networks
- Physics
- High-Performance Computing
- Genomics
- Machine Learning
- Databases
- Data Reorganization
- Image Processing
- Map-Reduce
- Benchmarking
- Linear Algebra



## **Memory Bound Functions**

- We analyze 345 applications from distinct domains
- Selection criteria: clock cycles > 3% and Memory Bound > 30%



- We find 144 functions from a total of 77K functions and select:
  - 44 functions → **apply steps 2 and 3**
  - 100 functions → **validation**

## Outline

## 1. Data Movement Bottlenecks

# 2. Methodology Overview

# 3. Application Profiling

# 4. Locality-Based Clustering

## 5. Memory Bottleneck Analysis

## 6. Case Studies

## **Step 2: Locality-Based Clustering**





### **Step 2: Locality-Based Clustering**



The closer a function is to the **bottom-left corner** 

# → less likely it is to **take advantage** of a deep cache hierarchy

applications (in orange)

2. High locality applications (in blue)



### Outline

### 1. Data Movement Bottlenecks

# 2. Methodology Overview

# 3. Application Profiling

# 4. Locality-Based Clustering

## **5. Memory Bottleneck Analysis**

### 6. Case Studies

#### **Memory Bottleneck Class**



#### **Memory Bottleneck Class**





#### **Memory Bottleneck Class**



# 32

# NDP does better because of the higher internal DRAM bandwidth



**DRAM bandwidth bound applications:** 

- NDP scales without saturating alongside attained bandwidth
- High MPKI  $\rightarrow$  high memory pressure

SAFARI

Host scales well until bandwidth saturates





Bandwidth (GB/s

### Class 1a: DRAM Bandwidth Bound (2/2)

- High LFMR  $\rightarrow$  L2 and L3 caches are inefficient
- Host's energy consumption is dominated by cache look-ups and off-chip data transfers



 NDP provides large system energy reduction since it does not access L2, L3, and off-chip links



DRAM bandwidth bound applications: NDP does better because it eliminates off-chip I/O traffic

#### Memory Bottleneck Class



### **Methodology Validation**

- Goal: evaluate the accuracy of our workload characterization methodically on a large set of functions
- Two-phase validation:



#### **High accuracy:**

our methodology accurately classifies 97% of functions into one of the six memory bottleneck classes



### More in the Paper

- Effect of the last-level cache size
  - Large L3 cache size (e.g., 512 MB) can mitigate some cache contention issues
- Summary of our workload characterization methodology
  - Including workload characterization using in-order host/NDP cores
- Limitations of our methodology
- Benchmark diversity

### **More in the Paper**

- Effect of the last-level cache size
  - Large L3 cache size (e.g., 512 MB) can mitigate some cache

### DAMOV: A New Methodology and Benchmark Suite for Evaluating Data Movement Bottlenecks

GERALDO F. OLIVEIRA<sup>1</sup>, JUAN GÓMEZ-LUNA<sup>1</sup>, LOIS OROSA<sup>1</sup>, SAUGATA GHOSE<sup>2</sup>, NANDITA VIJAYKUMAR<sup>3</sup>, IVAN FERNANDEZ<sup>1,4</sup>, MOHAMMAD SADROSADATI<sup>1</sup>, and ONUR MUTLU<sup>1</sup>

<sup>1</sup>ETH Zürich, Switzerland
<sup>2</sup>University of Illinois Urbana-Champaign, USA
<sup>3</sup>University of Toronto, Canada

<sup>4</sup>University of Malaga, Spain

Corresponding author: Geraldo F. Oliveira (e-mail: geraldod@inf.ethz.ch).

#### Benchmark diversity

### Outline

### 1. Data Movement Bottlenecks

# 2. Methodology Overview

# 3. Application Profiling

# 4. Locality-Based Clustering

## 5. Memory Bottleneck Analysis

### 6. Case Studies

### **Case Studies**

- Many open questions related to NDP system designs<sup>8</sup>:
  - Interconnects
  - Data mapping and allocation
  - NDP core design (accelerators, general-purpose cores)
  - Offloading granularity
  - Programmability
  - Coherence
  - System integration
  - ...

### Goal: demonstrate how DAMOV is useful to study NDP system designs

[8] Mutlu+, ""A Modern Primer on Processing in Memory," Emerging Computing: From Devices to Systems - Looking Beyond Moore and Von Neumann, 2021

### **Case Studies**

**Load Balance and Inter-Vault Communication on NDP** 

#### NDP Accelerators and Our Methodology

#### **Different Core Models on NDP Architectures**



### Case Studies (1/4)

#### Load Balance and Inter-Vault Communication on NDP

portion of the memory requests an NDP core issues go to remote vaults → increases the memory access latency for the NDP core

NDP Accelerators and Our Methodology

#### **Different Core Models on NDP Architectures**



### Case Studies (2/4)

**Load Balance and Inter-Vault Communication on NDP** 

#### NDP Accelerators and Our Methodology

NDP accelerator is faster than compute-centric accelerator for Class 1a and 1b applications; slower for Class 2c

 $\rightarrow$  key observations hold for other NDP architectures

**Different Core Models on NDP Architectures** 



### Case Studies (3/4)

#### **Load Balance and Inter-Vault Communication on NDP**

#### NDP Accelerators and Our Methodology

#### **Different Core Models on NDP Architectures**

using in-order cores limits performance of some applications → static instruction scheduling cannot exploit memory parallelism



### Case Studies (4/4)

SAFARI

**Load Balance and Inter-Vault Communication on NDP** 

#### NDP Accelerators and Our Methodology

#### **Different Core Models on NDP Architectures**

#### **Fine-Grained NDP Offloading**

few basic blocks are responsible for most of LLC misses → offloading such basic blocks to NDP are enough to improve performance

### **Case Studies**

SAFAR

#### **Load Balance and Inter-Vault Communication on NDP**

portion of the memory requests an NDP core issues go to remote vaults → increases the memory access latency for the NDP core

#### **NDP Accelerators and Our Methodology**

NDP accelerator is faster than compute-centric accelerator for Class 1a and 1b applications; slower for Class 2c

 $\rightarrow$  key observations hold for other NDP architectures

#### **Different Core Models on NDP Architectures**

using in-order cores limits performance of some applications → static instruction scheduling cannot exploit memory parallelism

#### **Fine-Grained NDP Offloading**

few basic blocks are responsible for most of LLC misses → offloading such basic blocks to NDP are enough to improve performance

### **Case Studies**

#### **Load Balance and Inter-Vault Communication on NDP**

portion of the memory requests an NDP core issues go to remote vaults

### DAMOV: A New Methodology and Benchmark Suite for Evaluating Data Movement Bottlenecks

GERALDO F. OLIVEIRA<sup>1</sup>, JUAN GÓMEZ-LUNA<sup>1</sup>, LOIS OROSA<sup>1</sup>, SAUGATA GHOSE<sup>2</sup>, NANDITA VIJAYKUMAR<sup>3</sup>, IVAN FERNANDEZ<sup>1,4</sup>, MOHAMMAD SADROSADATI<sup>1</sup>, and ONUR MUTLU<sup>1</sup>

<sup>1</sup>ETH Zürich, Switzerland <sup>2</sup>University of Illinois Urbana-Champaign, USA <sup>3</sup>University of Toronto, Canada

<sup>4</sup>University of Malaga, Spain

Corresponding author: Geraldo F. Oliveira (e-mail: geraldod@inf.ethz.ch).

#### **Fine-Grained NDP Offloading**

few basic blocks are responsible for most of LLC misses → offloading such basic blocks to NDP are enough to improve performance

### **DAMOV** is Open-Source

• We open-source our benchmark suite and our toolchain



#### DAMOV: A New Methodology and Benchmark Suite for Evaluating Data Movement Bottlenecks

DAMOV is a benchmark suite and a methodical framework targeting the study of data movement bottlenecks in modern applications. It is intended to study new architectures, such as near-data processing.

The DAMOV benchmark suite is the first open-source benchmark suite for main memory data movement-related studies, based on our systematic characterization methodology. This suite consists of 144 functions representing different sources of data movement bottlenecks and can be used as a baseline benchmark set for future data-movement mitigation research. The applications in the DAMOV benchmark suite belong to popular benchmark suites, including BWA, Chai, Darknet, GASE, Hardware Effects, Hashjoin, HPCC, HPCG, Ligra, PARSEC, Parboil, PolyBench, Phoenix, Rodinia, SPLASH-2, STREAM.

#### Releases

No releases published Create a new release

#### Packages

No packages published Publish your first package

#### Languages

44

කු

### **DAMOV is Open-Source**

#### • We open-source our benchmark suite and our toolchain

#### **Get DAMOV at:**

### https://github.com/CMU-SAFARI/DAMOV

DAMOV: A New Methodology and Benchmark Suite for Evaluating Data Movement Bottlenecks

DAMOV is a benchmark suite and a methodical framework targeting the study of data movement bottlenecks in modern applications. It is intended to study new architectures, such as near-data processing.

The DAMOV benchmark suite is the first open-source benchmark suite for main memory data movement-related studies, based on our systematic characterization methodology. This suite consists of 144 functions representing different sources of data movement bottlenecks and can be used as a baseline benchmark set for future data-movement mitigation research. The applications in the DAMOV benchmark suite belong to popular benchmark suites, including BWA, Chai, Darknet, GASE, Hardware Effects, Hashjoin, HPCC, HPCG, Ligra, PARSEC, Parboil, PolyBench, Phoenix, Rodinia, SPLASH-2, STREAM.



### Conclusion

- <u>Problem</u>: Data movement is a major bottleneck is modern systems. However, it is **unclear** how to identify:
  - different sources of data movement bottlenecks
  - the most suitable mitigation technique (e.g., caching, prefetching, near-data processing) for a given data movement bottleneck
- <u>Goals</u>:

SAFARI

- 1. Design a methodology to **identify** sources of data movement bottlenecks
- 2. **Compare** compute- and memory-centric data movement mitigation techniques
- <u>Key Approach</u>: Perform a large-scale application characterization to identify key metrics that reveal the sources to data movement bottlenecks
- Key Contributions:
  - Experimental characterization of 77K functions across 345 applications
  - A methodology to characterize applications based on data movement bottlenecks and their relation with different data movement mitigation techniques
  - DAMOV: a benchmark suite with 144 functions for data movement studies
  - Four case-studies to highlight DAMOV's applicability to open research problems

#### DAMOV: <a href="https://github.com/CMU-SAFARI/DAMOV">https://github.com/CMU-SAFARI/DAMOV</a>

DAMOV: A New Methodology and Benchmark Suite for Evaluating Data Movement Bottlenecks

#### Geraldo F. Oliveira

Juan Gómez-Luna Lois Orosa Saugata Ghose Nandita Vijaykumar Ivan Fernandez Mohammad Sadrosadati Onur Mutlu









