# SAFARI EFCL Research Projects: Recent Results and Future Outlook

Onur Mutlu <u>omutlu@gmail.com</u> <u>https://people.inf.ethz.ch/omutlu</u> 23 May 2022 EFCL Mini-Conference





### Current Research Mission

Computer architecture, HW/SW, systems, bioinformatics, security



**Graphics and Vision Processing** 

### **Build fundamentally better architectures**

### SAFARI

## Four Key Current Directions

Fundamentally Secure/Reliable/Safe Architectures

Fundamentally Energy-Efficient Architectures
 Memory-centric (Data-centric) Architectures

Fundamentally Low-Latency and Predictable Architectures

Architectures for AI/ML, Genomics, Medicine, Health, ...

### Fundamentally Better Architectures

# **Data-centric**

# **Data-driven**

# **Data-aware**



## Onur Mutlu's SAFARI Research Group

### Computer architecture, HW/SW, systems, bioinformatics, security, memory

https://safari.ethz.ch/safari-newsletter-january-2021/



### SAFARI Newsletter December 2021 Edition

### <u>https://safari.ethz.ch/safari-newsletter-december-2021/</u>



Think Big, Aim High



f y in 🛛

View in your browser December 2021



# Referenced Papers, Talks, Artifacts

All are available at

https://people.inf.ethz.ch/omutlu/projects.htm

https://www.youtube.com/onurmutlulectures

https://github.com/CMU-SAFARI/

# Open-Source Artifacts

https://github.com/CMU-SAFARI

# Open Source Tools: SAFARI GitHub



MQSim is a fast and accurate simulator modeling the performance of modern multi-queue (MQ) SSDs as well as traditional SATA based SSDs. MQSim faithfully models new high-bandwidth protocol implement...

● C++ ☆ 143 😵 90

Source code for testing the Row Hammer error mechanism in DRAM devices. Described in the ISCA 2014 paper by Kim et al. at http://users.ece.cmu.edu/~omutlu/pub/dram-row-hammer\_isca14.pdf.

●C ☆ 188 % 41

### https://github.com/CMU-SAFARI/

### Onur Mutlu's SAFARI Research Group

# SAFARI Research Group safari.ethz.ch



# SAFARI Overview at EFCL Huawei Day

 Onur Mutlu, "SAFARI Research Group: Introduction & Research" Invited Talk at the ETH Future Computing Laboratory Huawei Day, Virtual, 19 October 2021. [Slides (pptx) (pdf)] [Talk Video (15 minutes)]

# SAFARI Overview at EFCL Huawei Day



SAFARI Research Group: Introduction & Research - ETH Future Computing

Laboratorv Event Talk - Onur Mutlu

### https://youtu.be/mSr1QQmYuX0

### Fundamentally Better Architectures

# **Data-centric**

# **Data-driven**

# **Data-aware**



### A Blueprint for Fundamentally Better Architectures

### Onur Mutlu, "Intelligent Architectures for Intelligent Computing Systems" Invited Paper in Proceedings of the Design, Automation, and Test in Europe Conference (DATE), Virtual, February 2021. [Slides (pptx) (pdf)] [IEDM Tutorial Slides (pptx) (pdf)] [Short DATE Talk Video (11 minutes)] [Longer IEDM Tutorial Video (1 hr 51 minutes)]

### Intelligent Architectures for Intelligent Computing Systems

Onur Mutlu ETH Zurich omutlu@gmail.com

### SAFARI

# Current EFCL Projects

"A New Methodology and Open-Source Benchmark Suite for Evaluating Data Movement Bottlenecks: A Processing-in-Memory Case Study"

Data-centric

 "Machine-Learning-Assisted Intelligent Microarchitectures to Reduce Memory Access Latency"

Data-driven

- "Cross-layer Hardware/Software Techniques to Enable Powerful Computation and Memory Optimizations"
  - Data-aware

A New Methodology and Open-Source Benchmark Suite for Evaluating Data Movement Bottlenecks: A Processing-in-Memory Case Study

> Juan Gómez Luna, Geraldo F. Oliveira, Mohammad Sadr, Lois Orosa Onur Mutlu





# Goal: Processing Inside Memory



# Processing-in-Memory Landscape Today



This does not include many experimental chips and startups

# PIM Review and Open Problems

# A Modern Primer on Processing in Memory

Onur Mutlu<sup>a,b</sup>, Saugata Ghose<sup>b,c</sup>, Juan Gómez-Luna<sup>a</sup>, Rachata Ausavarungnirun<sup>d</sup>

SAFARI Research Group

<sup>a</sup>ETH Zürich <sup>b</sup>Carnegie Mellon University <sup>c</sup>University of Illinois at Urbana-Champaign <sup>d</sup>King Mongkut's University of Technology North Bangkok

Onur Mutlu, Saugata Ghose, Juan Gomez-Luna, and Rachata Ausavarungnirun, "A Modern Primer on Processing in Memory" *Invited Book Chapter in <u>Emerging Computing: From Devices to Systems -</u> <i>Looking Beyond Moore and Von Neumann*, Springer, to be published in 2021.

### SAFARI

https://arxiv.org/pdf/1903.03988.pdf

# PIM Review and Open Problems (II)

### A Workload and Programming Ease Driven Perspective of Processing-in-Memory

Saugata Ghose†Amirali Boroumand†Jeremie S. Kim†§Juan Gómez-Luna§Onur Mutlu§††Carnegie Mellon University§ETH Zürich

Saugata Ghose, Amirali Boroumand, Jeremie S. Kim, Juan Gomez-Luna, and Onur Mutlu, "Processing-in-Memory: A Workload-Driven Perspective" *Invited Article in IBM Journal of Research & Development, Special Issue on Hardware for Artificial Intelligence*, to appear in November 2019. [Preliminary arXiv version]

### SAFARI

https://arxiv.org/pdf/1907.12947.pdf

### A Modern Primer on Processing in Memory

Onur Mutlu<sup>a,b</sup>, Saugata Ghose<sup>b,c</sup>, Juan Gómez-Luna<sup>a</sup>, Rachata Ausavarungnirun<sup>d</sup>

SAFARI Research Group

<sup>a</sup>ETH Zürich <sup>b</sup>Carnegie Mellon University <sup>c</sup>University of Illinois at Urbana-Champaign <sup>d</sup>King Mongkut's University of Technology North Bangkok

### Abstract

Modern computing systems are overwhelmingly designed to move data to computation. This design choice goes directly against at least three key trends in computing that cause performance, scalability and energy bottlenecks: (1) data access is a key bottleneck as many important applications are increasingly data-intensive, and memory bandwidth and energy do not scale well, (2) energy consumption is a key limiter in almost all computing platforms, especially server and mobile systems, (3) data movement, especially off-chip to on-chip, is very expensive in terms of bandwidth, energy and latency, much more so than computation. These trends are especially severely-felt in the data-intensive server and energy-constrained mobile systems of today.

At the same time, conventional memory technology is facing many technology scaling challenges in terms of reliability, energy, and performance. As a result, memory system architects are open to organizing memory in different ways and making it more intelligent, at the expense of higher cost. The emergence of 3D-stacked memory plus logic, the adoption of error correcting codes inside the latest DRAM chips, proliferation of different main memory standards and chips, specialized for different purposes (e.g., graphics, low-power, high bandwidth, low latency), and the necessity of designing new solutions to serious reliability and security issues, such as the RowHammer phenomenon, are an evidence of this trend.

This chapter discusses recent research that aims to practically enable computation close to data, an approach we call *processing-in-memory* (PIM). PIM places computation mechanisms in or near where the data is stored (i.e., inside the memory chips, in the logic layer of 3D-stacked memory, or in the memory controllers), so that data movement between the computation units and memory is reduced or eliminated. While the general idea of PIM is not new, we discuss motivating trends in applications as well as memory circuits/technology that greatly exacerbate the need for enabling it in modern computing systems. We examine at least two promising new approaches to designing PIM systems to accelerate important data-intensive applications: (1) *processing using memory* by exploiting analog operational properties of DRAM chips to perform massively-parallel operations in memory, with low-cost changes, (2) *processing near memory* by exploiting 3D-stacked memory technology design to provide high memory bandwidth and low memory latency to in-memory logic. In both approaches, we describe and tackle relevant cross-layer research, design, and adoption challenges in devices, architecture, systems, and programming models. Our focus is on the development of in-memory processing designs that can be adopted in real computing platforms at low cost. We conclude by discussing work on solving key challenges to the practical adoption of PIM.

*Keywords:* memory systems, data movement, main memory, processing-in-memory, near-data processing, computation-in-memory, processing using memory, processing near memory, 3D-stacked memory, non-volatile memory, energy efficiency, high-performance computing, computer architecture, computing paradigm, emerging technologies, memory scaling, technology scaling, dependable systems, robust systems, hardware security, system security, latency, low-latency computing

### SAFARI

### Contents

SAFARI

| 1 | Introduction                                | 2  |  |  |  |
|---|---------------------------------------------|----|--|--|--|
| 2 | Major Trends Affecting Main Memory          | 4  |  |  |  |
| 3 | The Need for Intelligent Memory Controllers |    |  |  |  |
| - | to Enhance Memory Scaling                   | 6  |  |  |  |
| _ |                                             |    |  |  |  |
| 4 | Perils of Processor-Centric Design          | 9  |  |  |  |
| 5 | Processing-in-Memory (PIM): Technology      | _  |  |  |  |
| 5 | Enablers and Two Approaches                 | 12 |  |  |  |
| - | 5.1 New Technology Enablers: 3D-Stacked     | _  |  |  |  |
|   | Memory and Non-Volatile Memory              | 12 |  |  |  |
| - | 5.2 Two Approaches: Processing Using        |    |  |  |  |
|   | Memory (PUM) vs. Processing Near            |    |  |  |  |
|   | Memory (PNM)                                | 13 |  |  |  |
|   |                                             |    |  |  |  |
| 6 | Processing Using Memory (PUM)               | 14 |  |  |  |
|   | 6.1 RowClone                                | 14 |  |  |  |
|   | 6.2 Ambit                                   | 15 |  |  |  |
|   | 6.3 Gather-Scatter DRAM                     | 17 |  |  |  |
|   | 6.4 In-DRAM Security Primitives             | 17 |  |  |  |
| 7 | Processing Near Memory (PNM)                | 18 |  |  |  |
| 4 | 7.1 Tesseract: Coarse-Grained Application-  | 10 |  |  |  |
|   | Level PNM Acceleration of Graph Pro-        |    |  |  |  |
|   | cessing                                     | 19 |  |  |  |
| - | 7.2 Function-Level PNM Acceleration of      |    |  |  |  |
|   | Mobile Consumer Workloads                   | 20 |  |  |  |
|   | 7.3 Programmer-Transparent Function-        |    |  |  |  |
|   | Level PNM Acceleration of GPU               |    |  |  |  |
|   | Applications                                | 21 |  |  |  |
| _ | 7.4 Instruction-Level PNM Acceleration      |    |  |  |  |
|   | with PIM-Enabled Instructions (PEI)         | 21 |  |  |  |
| _ | 7.5 Function-Level PNM Acceleration of      | _  |  |  |  |
|   | Genome Analysis Workloads                   | 22 |  |  |  |
| _ | 7.6 Application-Level PNM Acceleration of   | 22 |  |  |  |
|   | Time Series Analysis                        | 23 |  |  |  |
| 8 | Enabling the Adoption of PIM                | 24 |  |  |  |
| • | 8.1 Programming Models and Code Genera-     |    |  |  |  |
|   | tion for PIM                                | 24 |  |  |  |
| _ | 8.2 PIM Runtime: Scheduling and Data        |    |  |  |  |
|   | Mapping                                     | 25 |  |  |  |
| _ | 8.3 Memory Coherence                        | 27 |  |  |  |
|   | 8.4 Virtual Memory Support                  | 27 |  |  |  |
|   | 8.5 Data Structures for PIM                 | 28 |  |  |  |
|   | 8.6 Benchmarks and Simulation Infrastruc-   |    |  |  |  |
|   | tures                                       | 29 |  |  |  |
| _ | 8.7 Real PIM Hardware Systems and Proto-    |    |  |  |  |
|   | types                                       | 30 |  |  |  |
|   | 8.8 Security Considerations                 | 30 |  |  |  |
| 0 | Conclusion and Future Order 1               | 21 |  |  |  |
| 9 | Conclusion and Future Outlook               | 31 |  |  |  |

### 1. Introduction

Main memory, built using the Dynamic Random Access Memory (DRAM) technology, is a major component in nearly all computing systems, including servers, cloud platforms, mobile/embedded devices, and sensor systems. Across all of these systems, the data working set sizes of modern applications are rapidly growing, while the need for fast analysis of such data is increasing. Thus, main memory is becoming an increasingly significant bottleneck across a wide variety of computing systems and applications [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16]. Alleviating the main memory bottleneck requires the memory capacity, energy, cost, and performance to all scale in an efficient manner across technology generations. Unfortunately, it has become increasingly difficult in recent years, especially the past decade, to scale all of these dimensions [1, 2, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49], and thus the main memory bottleneck has been worsening.

A major reason for the main memory bottleneck is the high energy and latency cost associated with data movement. In modern computers, to perform any operation on data that resides in main memory, the processor must retrieve the data from main memory. This requires the memory controller to issue commands to a DRAM module across a relatively slow and power-hungry off-chip bus (known as the memory channel). The DRAM module sends the requested data across the memory channel, after which the data is placed in the caches and registers. The CPU can perform computation on the data once the data is in its registers. Data movement from the DRAM to the CPU incurs long latency and consumes a significant amount of energy [7, 50, 51, 52, 53, 54]. These costs are often exacerbated by the fact that much of the data brought into the caches is not reused by the CPU [52, 53, 55, 56], providing little benefit in return for the high latency and energy cost.

The cost of data movement is a fundamental issue with the *processor-centric* nature of contemporary computer systems. The CPU is considered to be the master in the system, and computation is performed only in the processor (and accelerators). In contrast, data storage and communication units, including the main memory, are treated as unintelligent workers that are incapable of computation. As a result of this processor-centric design paradigm, data moves a lot in the system between the computation units and communication/ storage units so that computation can be done on it. With the increasingly *data-centric* nature of contemporary and emerging appli-

22

## Eliminating the Adoption Barriers

# How to Enable Adoption of Processing in Memory

# Potential Barriers to Adoption of PIM

1. Applications & software for PIM

2. Ease of **programming** (interfaces and compiler/HW support)

3. **System** and **security** support: coherence, synchronization, virtual memory, isolation, communication interfaces, ...

4. **Runtime** and **compilation** systems for adaptive scheduling, data mapping, access/sharing control, ...

5. Infrastructures to assess benefits and feasibility

### All can be solved with change of mindset

## Our Goal

- To enable adoption of Processing-in-Memory (PIM) systems by solving various key challenges
  - Identifying workloads and functions suitable for PIM
    - Fundamental (architecture-independent) characterization
    - Architecture-specific suitability
  - Overcoming the problem of lack of useful tools
    - Profiling tools
    - Analytical performance and energy models (or ML-based models)
    - Simulation tools
- The project is organized in two phases
  - Phase 1: Methodology and open-source benchmark suite(s)
  - Phase 2: Follow-up project to enable PIM adoption

# Results So Far (2021-2022)

DAMOV: A New Methodology and Benchmark Suite for Evaluating Data Movement Bottlenecks [IEEE Access 2021]

- Benchmarking a New Paradigm: Experimental Analysis and Characterization of a Real Processing-in-Memory System [IEEE Access 2022]
- Google Neural Network Models for Edge Devices: Analyzing and Mitigating Machine Learning Inference Bottlenecks [PACT 2021]
- Enabling High-Performance and Energy-Efficient Hybrid Transactional/Analytical Databases with Hardware/Software Cooperation [ICDE 2022]
- FPGA-based Near-Memory Acceleration of Modern Data-Intensive Applications [IEEE Micro 2021]

### SAFARI

DAMOV: A New Methodology and Benchmark Suite for Evaluating Data Movement Bottlenecks

### Geraldo F. Oliveira

Juan Gómez-Luna Lois Orosa Saugata Ghose Nandita Vijaykumar Ivan Fernandez Mohammad Sadrosadati Onur Mutlu









## **Executive Summary**

- <u>Problem</u>: Data movement is a major bottleneck is modern systems. However, it is **unclear** how to identify:
  - different sources of data movement bottlenecks
  - the most suitable mitigation technique (e.g., caching, prefetching, near-data processing) for a given data movement bottleneck
- <u>Goals</u>:

SAFARI

- 1. Design a methodology to **identify** sources of data movement bottlenecks
- 2. **Compare** compute- and memory-centric data movement mitigation techniques
- <u>Key Approach</u>: Perform a large-scale application characterization to identify key metrics that reveal the sources to data movement bottlenecks
- Key Contributions:
  - Experimental characterization of 77K functions across 345 applications
  - A methodology to characterize applications based on data movement bottlenecks and their relation with different data movement mitigation techniques
  - DAMOV: a benchmark suite with 144 functions for data movement studies
  - Four case-studies to highlight DAMOV's applicability to open research problems

### **DAMOV:** <u>https://github.com/CMU-SAFARI/DAMOV</u>

### **Near-Data Processing**

### **UPMEM (2019)**



Near-DRAM-banks processing for general-purpose computing

0.9 TOPS compute throughput<sup>1</sup>

### Samsung FIMDRAM (2021)



### Near-DRAM-banks processing for neural networks

**1.2 TFLOPS compute throughput<sup>2</sup>** 

### The goal of Near-Data Processing (NDP) is to mitigate data movement



 Devaux, "The True Processing In Memory Accelerator," HCS, 2019
 Kwon+, "A 20nm 6GB Function-In-Memory DRAM, Based on HBM2 with a 1.2TFLOPS Programmable Computing Unit Using Bank-Level Parallelism, for Machine Learning Applications," ISSCC, 2021

### When to Employ Near-Data Processing?



[1] Ahn+, "A Scalable Processing-in-Memory Accelerator for Parallel Graph Processing," ISCA, 2015

[2] Boroumand+, "Google Workloads for Consumer Devices: Mitigating Data Movement Bottlenecks," ASPLOS, 2018

[3] Cali+, "GenASM: A High-Performance, Low-Power Approximate String Matching Acceleration Framework for Genome Sequence Analysis," MICRO, 2020

[4] Kim+, "GRIM-Filter: Fast Seed Location Filtering in DNA Read Mapping Using Processing-in-Memory Technologies," BMC Genomics, 2018

[5] Boroumand+, "Polynesia: Enabling Effective Hybrid Transactional/Analytical Databases with Specialized Hardware/Software Co-Design," arXiv:2103.00798 [cs.AR], 2021

[6] Fernandez+, "NATSA: A Near-Data Processing Accelerator for Time Series Analysis," ICCD, 2020

SAFAR

# **Key Approach**

- New workload characterization methodology to analyze:
  - data movement bottlenecks
  - suitability of different data movement mitigation mechanisms
- Two main profiling strategies:

**Architecture-independent profiling:** 

characterizes the memory behavior independently of the underlying hardware

**Architecture-dependent profiling:** 

evaluates the impact of the system configuration on the memory behavior



### **Methodology Overview**



# **Step 1: Application Profiling**

- We analyze 345 applications from distinct domains:
- Graph Processing
- Deep Neural Networks
- Physics
- High-Performance Computing
- Genomics
- Machine Learning
- Databases
- Data Reorganization
- Image Processing
- Map-Reduce
- Benchmarking
- Linear Algebra

SAFARI



### **Step 3: Memory Bottleneck Analysis**



### **DAMOV** is Open Source

• We open-source our benchmark suite and our toolchain



### DAMOV: A New Methodology and Benchmark Suite for Evaluating Data Movement Bottlenecks

DAMOV is a benchmark suite and a methodical framework targeting the study of data movement bottlenecks in modern applications. It is intended to study new architectures, such as near-data processing.

The DAMOV benchmark suite is the first open-source benchmark suite for main memory data movement-related studies, based on our systematic characterization methodology. This suite consists of 144 functions representing different sources of data movement bottlenecks and can be used as a baseline benchmark set for future data-movement mitigation research. The applications in the DAMOV benchmark suite belong to popular benchmark suites, including BWA, Chai, Darknet, GASE, Hardware Effects, Hashjoin, HPCC, HPCG, Ligra, PARSEC, Parboil, PolyBench, Phoenix, Rodinia, SPLASH-2, STREAM.

### Releases

No releases published Create a new release කු

### Packages

No packages published Publish your first package

### Languages

### SAFARI

### **DAMOV** is Open Source

• We open-source our benchmark suite and our toolchain

| Code | 1) Pull requests 🕞 Actions 🔟 Projects 🔃 Security 🗠 Insights           | Settings                                                                                                                                                                  |                                                                           |
|------|-----------------------------------------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------------------------------------------------------------------------|
|      | <sup>৫°</sup> main → <sup>৫°</sup> 1 branch 💿 0 tags                  | Go to file Add file ▼                                                                                                                                                     | About<br>DAMOV is a benchmark suite and                                   |
|      | Get DAM                                                               | OV at:                                                                                                                                                                    |                                                                           |
| ]    | https://github.com/CN                                                 | <b>IU-SAFARI/I</b>                                                                                                                                                        | DAMOV                                                                     |
|      |                                                                       | · · · · · · · · · · · · · · · · · · ·                                                                                                                                     |                                                                           |
|      | i≡ README.md                                                          | Û                                                                                                                                                                         | 🛱 Readme                                                                  |
|      | DAMOV: A New Methodology and Be<br>Evaluating Data Movement Bottlenec | nchmark Suite for<br>ks                                                                                                                                                   | Readme      Releases      No releases published      Create a new release |
|      | DAMOV: A New Methodology and Be                                       | nchmark Suite for<br>ks<br>tudy of data movement bottlenecks in<br>ar-data processing.<br>r main memory data movement-related<br>e consists of 144 functions representing | Releases No releases published                                            |

### More on DAMOV Analysis Methodology & Workloads



https://www.youtube.com/watch?v=GWideVyo0nM&list=PL5Q2soXY2Zi tOTAYm--dYByNPL7JhwR9&index=3

### More on DAMOV Methods & Benchmarks

 Geraldo F. Oliveira, Juan Gomez-Luna, Lois Orosa, Saugata Ghose, Nandita Vijaykumar, Ivan fernandez, Mohammad Sadrosadati, and Onur Mutlu, "DAMOV: A New Methodology and Benchmark Suite for Evaluating Data Movement Bottlenecks"
 <u>IEEE Access</u>, 8 September 2021. Preprint in <u>arXiv</u>, 8 May 2021.
 [arXiv preprint]
 [IEEE Access version]
 [DAMOV Suite and Simulator Source Code]
 [SAFARI Live Seminar Video (2 hrs 40 mins)]
 [Short Talk Video (21 minutes)]

#### DAMOV: A New Methodology and Benchmark Suite for Evaluating Data Movement Bottlenecks

GERALDO F. OLIVEIRA, ETH Zürich, Switzerland JUAN GÓMEZ-LUNA, ETH Zürich, Switzerland LOIS OROSA, ETH Zürich, Switzerland SAUGATA GHOSE, University of Illinois at Urbana–Champaign, USA NANDITA VIJAYKUMAR, University of Toronto, Canada IVAN FERNANDEZ, University of Malaga, Spain & ETH Zürich, Switzerland MOHAMMAD SADROSADATI, ETH Zürich, Switzerland ONUR MUTLU, ETH Zürich, Switzerland

### DAMOV Analysis Methodology & Workloads

#### DAMOV: A New Methodology and Benchmark Suite for Evaluating Data Movement Bottlenecks

GERALDO F. OLIVEIRA, ETH Zürich, Switzerland JUAN GÓMEZ-LUNA, ETH Zürich, Switzerland LOIS OROSA, ETH Zürich, Switzerland SAUGATA GHOSE, University of Illinois at Urbana–Champaign, USA NANDITA VIJAYKUMAR, University of Toronto, Canada IVAN FERNANDEZ, University of Malaga, Spain & ETH Zürich, Switzerland MOHAMMAD SADROSADATI, Institute for Research in Fundamental Sciences (IPM), Iran & ETH Zürich, Switzerland ONUR MUTLU, ETH Zürich, Switzerland

Data movement between the CPU and main memory is a first-order obstacle against improving performance, scalability, and energy efficiency in modern systems. Computer systems employ a range of techniques to reduce overheads tied to data movement, spanning from traditional mechanisms (e.g., deep multi-level cache hierarchies, aggressive hardware prefetchers) to emerging techniques such as Near-Data Processing (NDP), where some computation is moved close to memory. Prior NDP works investigate the root causes of data movement bottlenecks using different profiling methodologies and tools. However, there is still a lack of understanding about the key metrics that can identify different data movement bottlenecks and their relation to traditional and emerging data movement mitigation mechanisms. Our goal is to methodically identify potential sources of data movement mitigation techniques (e.g., caching and prefetching) to more memory-centric techniques (e.g., NDP), thereby developing a rigorous understanding of the best techniques to mitigate each source of data movement.

With this goal in mind, we perform the first large-scale characterization of a wide variety of applications, across a wide range of application domains, to identify fundamental program properties that lead to data movement to/from main memory. We develop the first systematic methodology to classify applications based on the sources contributing to data movement bottlenecks. From our large-scale characterization of 77K functions across 345 applications, we select 144 functions to form the first open-source benchmark suite (DAMOV) for main memory data movement studies. We select a diverse range of functions that (1) represent different types of data movement bottlenecks, and (2) come from a wide range of application domains. Using NDP as a case study, we identify new insights about the different data movement bottlenecks and use these insights to determine the most suitable data movement mitigation mechanism for a particular application. We open-source DAMOV and the complete source code for our new characterization methodology at https://github.com/CMU-SAFARI/DAMOV.

#### SAFARI

#### https://arxiv.org/pdf/2105.03725.pdf

### Results So Far (2021-2022)

- DAMOV: A New Methodology and Benchmark Suite for Evaluating Data Movement Bottlenecks [IEEE Access 2021]
- Benchmarking a New Paradigm: Experimental Analysis and Characterization of a Real Processing-in-Memory System [IEEE Access 2022]
- Google Neural Network Models for Edge Devices: Analyzing and Mitigating Machine Learning Inference Bottlenecks [PACT 2021]
- Enabling High-Performance and Energy-Efficient Hybrid Transactional/Analytical Databases with Hardware/Software Cooperation [ICDE 2022]
- FPGA-based Near-Memory Acceleration of Modern Data-Intensive Applications [IEEE Micro 2021]

### Eliminating the Adoption Barriers

# Processing-in-Memory in the Real World

### UPMEM Processing-in-DRAM Engine (2019)

### Processing in DRAM Engine

 Includes standard DIMM modules, with a large number of DPU processors combined with DRAM chips.

### Replaces standard DIMMs

- DDR4 R-DIMM modules
  - 8GB+128 DPUs (16 PIM chips)
  - Standard 2x-nm DRAM process



Large amounts of compute & memory bandwidth



https://www.upmem.com/video-upmem-presenting-its-true-processing-in-memory-solution-hot-chips-2019/

## **UPMEM Memory Modules**

- E19: 8 chips DIMM (1 rank). DPUs @ 267 MHz
- P21: 16 chips DIMM (2 ranks). DPUs @ 350 MHz



## 2,560-DPU Processing-in-Memory System



#### Benchmarking a New Paradigm: An Experimental Analysis of a Real Processing-in-Memory Architecture

JUAN GÓMEZ-LUNA, ETH Zürich, Switzerland IZZAT EL HAJJ, Amerian University of Beruti, Lebanon IVAN FERNANDEZ, ETH Zürich, Switzerland and University of Malaga, Spain CHRISTINA GIANNOULA, ETH Zürich, Switzerland and NTUA, Greece GERALDO F. OLIVEIRA, ETH Zürich, Switzerland ONUR MUTLU, ETH Zürich, Switzerland

Many modern workloads, such as neural networks, databases, and graph processing, are fundamentally memory-bound for such workloads, the data movement between main memory and CPU cores imposes a significant overhead in terms of both latency and energy. A major reason is that this communication happense through a narrow bus with high latency and limited bandwidth, and the low data reuse in memory-bound workloads is insufficient to amorize the cost of main memory ancess. Fundamentally addressing this data movement builteneck requires a paradigm where the memory system assumes an active role in computing by integrating processing capabilities. This paradigm is known as *processing-in-memory* (PdM).

Recent research explores different forms of PIM architectures, motivated by the emergence of new 3Dstacked memory technologies that integrate memory with a logic layer where processing elements can be easily placed. Past works evaluate these architectures in simulation or, at best, with simplified hardware prototypes. In contrast, the UPMEM company has designed and manufactured the first publicly-available real-world PIM architecture. The UPMEM PIM architecture combines traditional DRAM memory arrays with general-purpose in-order cores, called DRAM Processing Units (DPUs), integrated in the same chip.

This paper provides the first comprehensive analysis of the first publicly-available real-world PIM architecture. We make two evolution through Strik, we conduct an experimental characterization of the UPMEM-based PIM system using microbenchmarks to assess various architecture limits such as compute throughput and memory bandwisht, yielding new insights. Second, we present PMU (*Decessing in-Memory benchmarks*), a benchmark suite of 16 workloads from different application domains (e.g., denne/sparse linear algebra, dathases, data analytics, graph processing, neural networks, bioinformatics, image processing), which we identify as memory-bound. We evaluate the performance and scaling characteristics of PtM benchmarks on the UPMEM PIM architecture, and compare their performance and energy consumption to their stateof-the-art CPU and CPU counterparts. Our extensive evaluation conducted on two real UPMEM-based PIM systems with 64 and 2550 PUP sproids new insights about satiabality of different workloads to the PIM systems reason of and 2550 PUP sorvides new insights about satiabality of different workloads to the PIM systems reason of Suture PIM systems.



https://arxiv.org/pdf/2105.03814.pdf

### More on the UPMEM PIM System



https://www.youtube.com/watch?v=Sscy1Wrr22A&list=PL5Q2soXY2Zi9xidyIgBxUz7xRPS-wisBN&index=26

### UPMEM PIM System Summary & Analysis

Juan Gomez-Luna, Izzat El Hajj, Ivan Fernandez, Christina Giannoula, Geraldo F. Oliveira, and Onur Mutlu, "Benchmarking Memory-Centric Computing Systems: Analysis of Real **Processing-in-Memory Hardware**" Invited Paper at Workshop on Computing with Unconventional Technologies (CUT), Virtual, October 2021. [arXiv version] [PrIM Benchmarks Source Code] [Slides (pptx) (pdf)] [Talk Video (37 minutes)] [Lightning Talk Video (3 minutes)]

### Benchmarking Memory-Centric Computing Systems: Analysis of Real Processing-in-Memory Hardware

Juan Gómez-Luna ETH Zürich

Izzat El Hajj American University of Beirut

University of Malaga

Ivan Fernandez Christina Giannoula Geraldo F. Oliveira Onur Mutlu National Technical University of Athens

ETH Zürich ETH Zürich

## Understanding a Modern Processing-in-Memory Architecture:

### **Benchmarking and Experimental Characterization**

<u>Juan Gómez Luna</u>, Izzat El Hajj, Ivan Fernandez, Christina Giannoula, Geraldo F. Oliveira, Onur Mutlu

https://arxiv.org/pdf/2105.03814.pdf https://github.com/CMU-SAFARI/prim-benchmarks





## **Executive Summary**

- Data movement between memory/storage units and compute units is a major contributor to execution time and energy consumption
- Processing-in-Memory (PIM) is a paradigm that can tackle the data movement bottleneck
  - Though explored for +50 years, technology challenges prevented the successful materialization
- UPMEM has designed and fabricated the first publicly-available real-world PIM architecture
  - DDR4 chips embedding in-order multithreaded DRAM Processing Units (DPUs)
- Our work:
  - Introduction to UPMEM programming model and PIM architecture
  - Microbenchmark-based characterization of the DPU
  - Benchmarking and workload suitability study
- Main contributions:
  - Comprehensive characterization and analysis of the first commercially-available PIM architecture
  - PrIM (Processing-In-Memory) benchmarks:
    - 16 workloads that are memory-bound in conventional processor-centric systems
    - Strong and weak scaling characteristics
  - Comparison to state-of-the-art CPU and GPU
- Takeaways:
  - Workload characteristics for PIM suitability
  - Programming recommendations
  - Suggestions and hints for hardware and architecture designers of future PIM systems
  - PrIM: (a) programming samples, (b) evaluation and comparison of current and future PIM systems

## **PrIM Benchmarks: Application Domains**

| Domain                | Benchmark                     | Short name |
|-----------------------|-------------------------------|------------|
| Dense linear algebra  | Vector Addition               | VA         |
|                       | Matrix-Vector Multiply        | GEMV       |
| Sparse linear algebra | Sparse Matrix-Vector Multiply | SpMV       |
| Databases             | Select                        | SEL        |
|                       | Unique                        | UNI        |
| Data analytics        | Binary Search                 | BS         |
|                       | Time Series Analysis          | TS         |
| Graph processing      | Breadth-First Search          | BFS        |
| Neural networks       | Multilayer Perceptron         | MLP        |
| Bioinformatics        | Needleman-Wunsch              | NW         |
| Image processing      | Image histogram (short)       | HST-S      |
|                       | Image histogram (large)       | HST-L      |
| Parallel primitives   | Reduction                     | RED        |
|                       | Prefix sum (scan-scan-add)    | SCAN-SSA   |
|                       | Prefix sum (reduce-scan-scan) | SCAN-RSS   |
|                       | Matrix transposition          | TRNS       |

## **PrIM Benchmarks are Open Source**

- All microbenchmarks, benchmarks, and scripts
- <u>https://github.com/CMU-SAFARI/prim-benchmarks</u>

| % main * prim-benchmarks / README.md     Image: Second Secon | CMU-SAFARI / prim-benchmark         | S     | 💿 Unwato          | th → 2 🖧 Star 2 😵 Fork 1               |
|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-------------------------------------|-------|-------------------|----------------------------------------|
| <ul> <li>Juan Gomez Luna PrIM first commit</li> <li>Latest commit 3de4b49 9 days ago  History</li> <li>At 1 contributor</li> <li>168 lines (132 sloc) 5.79 KB</li> <li>Raw Blame  Blame  I &lt; 1</li> </ul> PrIM (Processing-In-Memory Benchmarks) PrIM is the first benchmark suite for a real-world processing-in-memory (PIM) architecture. PrIM is developed to evaluate, analyze, and characterize the first publicly-available real-world processing-in-memory (PIM) architecture, the UPMEM PIM architecture. The UPMEM PIM architecture combines traditional DRAM memory arrays with general-purpose in-order cores, called DRAM Processing Units (DPUs), integrated in the same chip. PrIM provides a common set of workloads to evaluate the UPMEM PIM architecture with and can be useful for programming, architecture and system researchers all alike to improve multiple aspects of future PIM hardware and software. The workloads have different characteristics, exhibiting heterogeneity in their memory access patterns, operations and data types, and                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    | <> Code 🕑 Issues 👘 Pull requests    |       | 🕮 Wiki 🕕 Security | 🖂 Insights 🛛 🕸 Settings                |
| Ra 1 contributor         Image: Interpret to the state of the sta           | ه main → prim-benchmarks / READ     | ME.md |                   | Go to file                             |
| Id8 lines (132 sloc) 5.79 KB Raw Blame C & C PrIM (Processing-In-Memory Benchmarks) PrIM is the first benchmark suite for a real-world processing-in-memory (PIM) architecture. PrIM is developed to evaluate, analyze, and characterize the first publicly-available real-world processing-in-memory (PIM) architecture, the UPMEM PIM architecture. The UPMEM PIM architecture combines traditional DRAM memory arrays with general-purpose in-order cores, called DRAM Processing Units (DPUs), integrated in the same chip. PrIM provides a common set of workloads to evaluate the UPMEM PIM architecture with and can be useful for programming, architecture and system researchers all alike to improve multiple aspects of future PIM hardware and software. The workloads have different characteristics, exhibiting heterogeneity in their memory access patterns, operations and data types, and                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    | 💿 Juan Gomez Luna PrIM first commit |       | Late              | st commit 3de4b49 9 days ago 🕤 History |
| PrIM (Processing-In-Memory Benchmarks)         PrIM is the first benchmark suite for a real-world processing-in-memory (PIM) architecture. PrIM is developed to evaluate, analyze, and characterize the first publicly-available real-world processing-in-memory (PIM) architecture, the UPMEM PIM architecture. The UPMEM PIM architecture combines traditional DRAM memory arrays with general-purpose in-order cores, called DRAM Processing Units (DPUs), integrated in the same chip.         PrIM provides a common set of workloads to evaluate the UPMEM PIM architecture with and can be useful for programming, architecture and system researchers all alike to improve multiple aspects of future PIM hardware and software. The workloads have different characteristics, exhibiting heterogeneity in their memory access patterns, operations and data types, and                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 | २२ 1 contributor                    |       |                   |                                        |
| PrIM is the first benchmark suite for a real-world processing-in-memory (PIM) architecture. PrIM is developed to evaluate,<br>analyze, and characterize the first publicly-available real-world processing-in-memory (PIM) architecture, the UPMEM PIM<br>architecture. The UPMEM PIM architecture combines traditional DRAM memory arrays with general-purpose in-order cores, called<br>DRAM Processing Units (DPUs), integrated in the same chip.<br>PrIM provides a common set of workloads to evaluate the UPMEM PIM architecture with and can be useful for programming,<br>architecture and system researchers all alike to improve multiple aspects of future PIM hardware and software. The workloads<br>have different characteristics, exhibiting heterogeneity in their memory access patterns, operations and data types, and                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      | i≘ 168 lines (132 sloc) 5.79 KB     |       |                   | Raw Blame 🖵 🖉 🗓                        |
|                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 |                                     |       |                   |                                        |

### Benchmarking a New Paradigm: Experimental Analysis and Characterization of a Real Processing-in-Memory System

JUAN GÓMEZ-LUNA<sup>1</sup>, IZZAT EL HAJJ<sup>2</sup>, IVAN FERNANDEZ<sup>1,3</sup>, CHRISTINA GIANNOULA<sup>1,4</sup>, GERALDO F. OLIVEIRA<sup>1</sup>, AND ONUR MUTLU<sup>1</sup>

<sup>1</sup>ETH Zürich

<sup>2</sup>American University of Beirut

<sup>3</sup>University of Malaga

<sup>4</sup>National Technical University of Athens

Corresponding author: Juan Gómez-Luna (e-mail: juang@ethz.ch).

https://doi.org/10.1109/ACCESS.2022.3174101 https://github.com/CMU-SAFARI/prim-benchmarks

### **Observations, Recommendations, Takeaways**

#### **GENERAL PROGRAMMING RECOMMENDATIONS**

- 1. Execute on the *DRAM Processing Units* (*DPUs*) **portions of parallel code** that are as long as possible.
- 2. Split the workload into **independent data blocks**, which the DPUs operate on independently.
- 3. Use **as many working DPUs** in the system as possible.
- 4. Launch at least **11** *tasklets* (i.e., software threads) per DPU.

#### **PROGRAMMING RECOMMENDATION 1**

For data movement between the DPU's MRAM bank and the WRAM, **use large DMA transfer sizes when all the accessed data is going to be used**.

#### **KEY OBSERVATION 7**

Larger CPU-DPU and DPU-CPU transfers between the host main memory and the DRAM Processing Unit's Main memory (MRAM) banks result in higher sustained bandwidth.

#### **KEY TAKEAWAY 1**

The UPMEM PIM architecture is fundamentally compute bound. As a result, the most suitable work- loads are memory-bound.

## Outline

- Introduction
  - Accelerator Model
  - UPMEM-based PIM System Overview
- UPMEM PIM Programming
  - Vector Addition
  - CPU-DPU Data Transfers
  - Inter-DPU Communication
  - CPU-DPU/DPU-CPU Transfer Bandwidth
- DRAM Processing Unit
  - Arithmetic Throughput
  - WRAM and MRAM Bandwidth
- PrIM Benchmarks
  - Roofline Model
  - Benchmark Diversity
- Evaluation
  - Strong and Weak Scaling
  - Comparison to CPU and GPU

Key Takeaways

## Key Takeaway 1



Operational Intensity (OP/B)

The throughput saturation point is as low as ¼ OP/B, i.e., 1 integer addition per every 32-bit element fetched

#### **KEY TAKEAWAY 1**

**The UPMEM PIM architecture is fundamentally compute bound.** As a result, **the most suitable workloads are memory-bound.** 

## Key Takeaway 2



#### **KEY TAKEAWAY 2**

The most well-suited workloads for the UPMEM PIM architecture use no arithmetic operations or use only simple operations (e.g., bitwise operations and integer addition/subtraction).

## Key Takeaway 3



#### **KEY TAKEAWAY 3**

The most well-suited workloads for the UPMEM PIM architecture require little or no communication across DPUs (inter-DPU communication).

#### **KEY TAKEAWAY 4**

• UPMEM-based PIM systems **outperform state-of-the-art CPUs in terms of performance and energy efficiency on most of PrIM benchmarks.** 

• UPMEM-based PIM systems **outperform state-of-the-art GPUs on a majority of PrIM benchmarks**, and the outlook is even more positive for future PIM systems.

• UPMEM-based PIM systems are **more energy-efficient than state**of-the-art CPUs and GPUs on workloads that they provide performance improvements over the CPUs and the GPUs.

### Benchmarking a New Paradigm: Experimental Analysis and Characterization of a Real Processing-in-Memory System

JUAN GÓMEZ-LUNA<sup>1</sup>, IZZAT EL HAJJ<sup>2</sup>, IVAN FERNANDEZ<sup>1,3</sup>, CHRISTINA GIANNOULA<sup>1,4</sup>, GERALDO F. OLIVEIRA<sup>1</sup>, AND ONUR MUTLU<sup>1</sup>

<sup>1</sup>ETH Zürich

<sup>2</sup>American University of Beirut

<sup>3</sup>University of Malaga

<sup>4</sup>National Technical University of Athens

Corresponding author: Juan Gómez-Luna (e-mail: juang@ethz.ch).

https://doi.org/10.1109/ACCESS.2022.3174101 https://github.com/CMU-SAFARI/prim-benchmarks

## Understanding a Modern Processing-in-Memory Architecture:

### **Benchmarking and Experimental Characterization**

<u>Juan Gómez Luna</u>, Izzat El Hajj, Ivan Fernandez, Christina Giannoula, Geraldo F. Oliveira, Onur Mutlu

el1goluj@gmail.com

<u>https://arxiv.org/pdf/2105.03814.pdf</u> <u>https://github.com/CMU-SAFARI/prim-benchmarks</u>





### Experimental Analysis of the UPMEM PIM Engine

### Benchmarking a New Paradigm: An Experimental Analysis of a Real Processing-in-Memory Architecture

JUAN GÓMEZ-LUNA, ETH Zürich, Switzerland IZZAT EL HAJJ, American University of Beirut, Lebanon IVAN FERNANDEZ, ETH Zürich, Switzerland and University of Malaga, Spain CHRISTINA GIANNOULA, ETH Zürich, Switzerland and NTUA, Greece GERALDO F. OLIVEIRA, ETH Zürich, Switzerland ONUR MUTLU, ETH Zürich, Switzerland

Many modern workloads, such as neural networks, databases, and graph processing, are fundamentally memory-bound. For such workloads, the data movement between main memory and CPU cores imposes a significant overhead in terms of both latency and energy. A major reason is that this communication happens through a narrow bus with high latency and limited bandwidth, and the low data reuse in memory-bound workloads is insufficient to amortize the cost of main memory access. Fundamentally addressing this *data movement bottleneck* requires a paradigm where the memory system assumes an active role in computing by integrating processing capabilities. This paradigm is known as *processing-in-memory (PIM*).

Recent research explores different forms of PIM architectures, motivated by the emergence of new 3Dstacked memory technologies that integrate memory with a logic layer where processing elements can be easily placed. Past works evaluate these architectures in simulation or, at best, with simplified hardware prototypes. In contrast, the UPMEM company has designed and manufactured the first publicly-available real-world PIM architecture. The UPMEM PIM architecture combines traditional DRAM memory arrays with general-purpose in-order cores, called *DRAM Processing Units* (*DPUs*), integrated in the same chip.

This paper provides the first comprehensive analysis of the first publicly-available real-world PIM architecture. We make two key contributions. First, we conduct an experimental characterization of the UPMEM-based PIM system using microbenchmarks to assess various architecture limits such as compute throughput and memory bandwidth, yielding new insights. Second, we present *PrIM* (*Processing-In-Memory benchmarks*), a benchmark suite of 16 workloads from different application domains (e.g., dense/sparse linear algebra, databases, data analytics, graph processing, neural networks, bioinformatics, image processing), which we identify as memory-bound. We evaluate the performance and scaling characteristics of PrIM benchmarks on the UPMEM PIM architecture, and compare their performance and energy consumption to their stateof-the-art CPU and GPU counterparts. Our extensive evaluation conducted on two real UPMEM-based PIM systems with 640 and 2,556 DPUs provides new insights about suitability of different workloads to the PIM system, programming recommendations for software designers, and suggestions and hints for hardware and architecture designers of future PIM systems.

#### https://arxiv.org/pdf/2105.03814.pdf

### More on Analysis of the UPMEM PIM Engine



#### https://www.youtube.com/watch?v=D8Hjy2iU9I4&list=PL5Q2soXY2Zi\_tOTAYm--dYByNPL7JhwR9

### More on Analysis of the UPMEM PIM Engine



#### Understanding a Modern Processing-in-Memory Arch: Benchmarking & Experimental Characterization; 21m



#### https://www.youtube.com/watch?v=Pp9jSU2b9oM&list=PL5Q2soXY2Zi8\_VVChACnON4sfh2bJ5IrD&index=159

### More on PRIM Benchmarks

Juan Gomez-Luna, Izzat El Hajj, Ivan Fernandez, Christina Giannoula, Geraldo F. Oliveira, and Onur Mutlu, **"Benchmarking a New Paradigm: An Experimental Analysis of a Real Processing-in-Memory** Architecture" Preprint in arXiv, 9 May 2021. [arXiv preprint] PrIM Benchmarks Source Code Slides (pptx) (pdf) [Long Talk Slides (pptx) (pdf)] [Short Talk Slides (pptx) (pdf)] [SAFARI Live Seminar Slides (pptx) (pdf)] [SAFARI Live Seminar Video (2 hrs 57 mins)] [Lightning Talk Video (3 minutes)]

### UPMEM PIM System Summary & Analysis

Juan Gomez-Luna, Izzat El Hajj, Ivan Fernandez, Christina Giannoula, Geraldo F. Oliveira, and Onur Mutlu, "Benchmarking Memory-Centric Computing Systems: Analysis of Real **Processing-in-Memory Hardware**" Invited Paper at Workshop on Computing with Unconventional Technologies (CUT), Virtual, October 2021. [arXiv version] [PrIM Benchmarks Source Code] [Slides (pptx) (pdf)] [Talk Video (37 minutes)] [Lightning Talk Video (3 minutes)]

### Benchmarking Memory-Centric Computing Systems: Analysis of Real Processing-in-Memory Hardware

Juan Gómez-Luna ETH Zürich

Izzat El Hajj American University of Beirut

University of Malaga

Ivan Fernandez Christina Giannoula Geraldo F. Oliveira Onur Mutlu National Technical University of Athens

ETH Zürich ETH Zürich

### Results So Far (2021-2022)

- DAMOV: A New Methodology and Benchmark Suite for Evaluating Data Movement Bottlenecks [IEEE Access 2021]
- Benchmarking a New Paradigm: Experimental Analysis and Characterization of a Real Processing-in-Memory System [IEEE Access 2022]
- Google Neural Network Models for Edge Devices: Analyzing and Mitigating Machine Learning Inference Bottlenecks [PACT 2021]
- Enabling High-Performance and Energy-Efficient Hybrid Transactional/Analytical Databases with Hardware/Software Cooperation [ICDE 2022]
- FPGA-based Near-Memory Acceleration of Modern Data-Intensive Applications [IEEE Micro 2021]

### Data-Centric Neural Network Inference

Amirali Boroumand, Saugata Ghose, Berkin Akin, Ravi Narayanaswami, Geraldo F. Oliveira, Xiaoyu Ma, Eric Shiu, and Onur Mutlu,
 "Google Neural Network Models for Edge Devices: Analyzing and

 <u>Mitigating Machine Learning Inference Bottlenecks"</u>
 Proceedings of the <u>30th International Conference on Parallel Architectures and</u>

 <u>Compilation Techniques</u> (PACT), Virtual, September 2021.
 [Slides (pptx) (pdf)]
 [Talk Video (14 minutes)]

#### **Google Neural Network Models for Edge Devices: Analyzing and Mitigating Machine Learning Inference Bottlenecks**

Amirali Boroumand\*\*Saugata Ghose\*Berkin Akin\*Ravi Narayanaswami\*Geraldo F. Oliveira\*Xiaoyu Ma\*Eric Shiu\*Onur Mutlu\*\*\* Carnegie Mellon Univ.\* Stanford Univ.\* Univ. of Illinois Urbana-Champaign\* Google\* ETH Zürich

### Google Neural Network Models for Edge Devices: Analyzing and Mitigating Machine Learning Inference Bottlenecks

Amirali BoroumandSaugata GhoseBerkin AkinRavi NarayanaswamiGeraldo F. OliveiraXiaoyu MaEric ShiuOnur Mutlu

**PACT 2021** 



## **Executive Summary**

#### <u>Context</u>: We extensively analyze a state-of-the-art edge ML accelerator (Google Edge TPU) using 24 Google edge models

- Wide range of models (CNNs, LSTMs, Transducers, RCNNs)

#### **Problem:** The Edge TPU accelerator suffers from three challenges:

- It operates significantly below its peak throughput
- It operates significantly below its theoretical energy efficiency
- It inefficiently handles memory accesses

## <u>Key Insight</u>: These shortcomings arise from the monolithic design of the Edge TPU accelerator

- The Edge TPU accelerator design does not account for layer heterogeneity

#### Key Mechanism: A new framework called Mensa

 Mensa consists of heterogeneous accelerators whose dataflow and hardware are specialized for specific families of layers

#### Key Results: We design a version of Mensa for Google edge ML models

- Mensa improves performance and energy by 3.0X and 3.1X
- Mensa reduces cost and improves area efficiency

## **Google Edge NN Models**

### We analyze inference execution using 24 edge NN models



## **Diversity Across the Models**

### Insight I: there is significant variation in terms of layer characteristics across the models



## **Diversity Within the Models**

Insight 2: even within each model, layers exhibit significant variation in terms of layer characteristics

For example, our analysis of edge CNN models shows:



Variation in MAC intensity: up to 200x across layers

Variation in FLOP/Byte: up to 244x across layers

### **Root Cause of Accelerator Challenges**

The key components of Google Edge TPU are completely oblivious to layer heterogeneity



Edge accelerators typically take a monolithic approach: equip the accelerator with an over-provisioned <u>PE array</u> and <u>on-chip buffer</u>, a rigid <u>dataflow</u>, and fixed <u>off-chip bandwidth</u>

While this approach might work for a specific group of layers, it fails to efficiently execute inference across a wide variety of edge models

# Mensa Framework

**Goal:** design an edge accelerator that can efficiently run inference across a wide range of different models and layers

### Instead of running the entire NN model on a monolithic accelerator:

Mensa: a new acceleration framework for edge NN inference





# Mensa Runtime Scheduler

The goal of Mensa's software runtime scheduler is to identify which accelerator each layer in an NN model should run on



# Mensa Runtime Scheduler

The goal of Mensa's software runtime scheduler is to identify which accelerator each layer in an NN model should run on

Generated once during initial setup



# Mensa: Highly-Efficient ML Inference

Amirali Boroumand, Saugata Ghose, Berkin Akin, Ravi Narayanaswami, Geraldo F. Oliveira, Xiaoyu Ma, Eric Shiu, and Onur Mutlu,

 "Google Neural Network Models for Edge Devices: Analyzing and
 Mitigating Machine Learning Inference Bottlenecks"
 Proceedings of the 30th International Conference on Parallel Architectures and
 Compilation Techniques (PACT), Virtual, September 2021.
 [Slides (pptx) (pdf)]
 [Talk Video (14 minutes)]

### **Google Neural Network Models for Edge Devices: Analyzing and Mitigating Machine Learning Inference Bottlenecks**

Amirali Boroumand\*\*Saugata Ghose\*Berkin Akin\*Ravi Narayanaswami\*Geraldo F. Oliveira\*Xiaoyu Ma\*Eric Shiu\*Onur Mutlu\*\*\* Carnegie Mellon Univ.\* Stanford Univ.\* Univ. of Illinois Urbana-Champaign\* Google\* ETH Zürich

# Results So Far (2021-2022)

- DAMOV: A New Methodology and Benchmark Suite for Evaluating Data Movement Bottlenecks [IEEE Access 2021]
- Benchmarking a New Paradigm: Experimental Analysis and Characterization of a Real Processing-in-Memory System [IEEE Access 2022]
- Google Neural Network Models for Edge Devices: Analyzing and Mitigating Machine Learning Inference Bottlenecks [PACT 2021]
- Enabling High-Performance and Energy-Efficient Hybrid
   Transactional/Analytical Databases with Hardware/Software
   Cooperation [ICDE 2022]
- FPGA-based Near-Memory Acceleration of Modern Data-Intensive Applications [IEEE Micro 2021]

# Accelerating HTAP Database Systems

Appears in ICDE 2022

### **Polynesia: Enabling High-Performance and Energy-Efficient Hybrid Transactional/Analytical Databases** with Hardware/Software Co-Design

Amirali Boroumand<sup>†</sup>

Saugata Ghose<sup> $\diamond$ </sup> Geraldo F. Oliveira<sup>‡</sup> Onur Mutlu<sup>‡</sup> <sup>†</sup>Google <sup>•</sup>Univ. of Illinois Urbana-Champaign <sup>‡</sup>ETH Zürich

https://arxiv.org/pdf/2204.11275.pdf

# Results So Far (2021-2022)

- DAMOV: A New Methodology and Benchmark Suite for Evaluating Data Movement Bottlenecks [IEEE Access 2021]
- Benchmarking a New Paradigm: Experimental Analysis and Characterization of a Real Processing-in-Memory System [IEEE Access 2022]
- Google Neural Network Models for Edge Devices: Analyzing and Mitigating Machine Learning Inference Bottlenecks [PACT 2021]
- Enabling High-Performance and Energy-Efficient Hybrid Transactional/Analytical Databases with Hardware/Software Cooperation [ICDE 2022]

FPGA-based Near-Memory Acceleration of Modern Data-Intensive Applications [IEEE Micro 2021]

# FPGA-based Processing Near Memory

 Gagandeep Singh, Mohammed Alser, Damla Senol Cali, Dionysios Diamantopoulos, Juan Gómez-Luna, Henk Corporaal, and Onur Mutlu, "FPGA-based Near-Memory Acceleration of Modern Data-Intensive Applications" IEEE Micro (IEEE MICRO), 2021.

# FPGA-based Near-Memory Acceleration of Modern Data-Intensive Applications

Gagandeep Singh<sup>◊</sup> Mohammed Alser<sup>◊</sup> Damla Senol Cali<sup>⋈</sup>

**Dionysios Diamantopoulos**<sup>∇</sup> **Juan Gómez-Luna**<sup>◊</sup>

Henk Corporaal<sup>★</sup> Onur Mutlu<sup>◊ ⋈</sup>

◇ETH Zürich <sup>™</sup>Carnegie Mellon University
 \*Eindhoven University of Technology <sup>▽</sup>IBM Research Europe

# Near-Memory Acceleration Using FPGAs



### **Near-HBM FPGA-based accelerator**

Two communication technologies: CAPI2 and OCAPI Two memory technologies: DDR4 and HBM Two workloads: Weather Modeling and Genome Analysis

# Performance & Energy Greatly Improve



5-27× performance vs. a 16-core (64-thread) IBM POWER9 CPU

12-133× energy efficiency vs. a 16-core (64-thread) IBM POWER9 CPU

### **HBM alleviates memory bandwidth contention vs. DDR4**

### We Need to Revisit the Entire Stack

| Problem            | , |
|--------------------|---|
| Aigorithm          |   |
| Program/Language   |   |
| System Software    |   |
| SW/HW Interface    |   |
| Micro-architecture |   |
| Logic              |   |
| Devices            |   |
| Electrons          |   |

### We can get there step by step

# Coming Up: SpMV on Real PIM Systems

### • To appear in SIGMETRICS 2022

### **SparseP:** Towards Efficient Sparse Matrix Vector Multiplication on Real Processing-In-Memory Systems

CHRISTINA GIANNOULA, ETH Zürich, Switzerland and National Technical University of Athens, Greece

IVAN FERNANDEZ, ETH Zürich, Switzerland and University of Malaga, Spain JUAN GÓMEZ-LUNA, ETH Zürich, Switzerland NECTARIOS KOZIRIS, National Technical University of Athens, Greece GEORGIOS GOUMAS, National Technical University of Athens, Greece ONUR MUTLU, ETH Zürich, Switzerland

> https://arxiv.org/pdf/2201.05072.pdf https://github.com/CMU-SAFARI/SparseP

# Coming Up: SpMV on Real PIM Systems



#### https://www.youtube.com/watch?v=5kaOsJKlGrE

# Coming Up: FPGA Framework for PuM

### PiDRAM: A Holistic End-to-end FPGA-based Framework for Processing-in-DRAM

Ataberk Olgun§†

Juan Gómez Luna<sup>§</sup> Konstantinos Kanellopoulos<sup>§</sup> Hasan Hassan<sup>§</sup> Oğuz Ergin<sup>†</sup> Onur Mutlu<sup>§</sup> <sup>§</sup>ETH Zürich <sup>†</sup>TOBB ETÜ <sup>\*</sup>BSC

Behzad Salami<sup>§\*</sup>

https://arxiv.org/pdf/2111.00082.pdf https://github.com/cmu-safari/pidram

# Comp Arch (Fall'21)

#### Fall 2021 Edition:

- https://safari.ethz.ch/architecture/fall2021/doku. php?id=schedule
- Fall 2020 Edition:
  - https://safari.ethz.ch/architecture/fall2020/doku. php?id=schedule

#### Youtube Livestream (2021):

- https://www.youtube.com/watch?v=4yfkM\_5EFg o&list=PL5Q2soXY2Zi-Mnk1PxjEIG32HAGILkTOF
- Youtube Livestream (2020):
  - https://www.youtube.com/watch?v=c3mPdZA-Fmc&list=PL5Q2soXY2Zi9xidyIgBxUz7xRPS-wisBN
- Master's level course
  - Taken by Bachelor's/Masters/PhD students
  - Cutting-edge research topics + fundamentals in Computer Architecture
  - 5 Simulator-based Lab Assignments
  - Potential research exploration
  - Many research readings

### https://www.youtube.com/onurmutlulectures



#### Fall 2021 Lectures & Schedule

Watch on 🕞 YouTub

| Week | Date          | Livestream    | Lecture                                                                   | Readings               | Lab          | HW          |
|------|---------------|---------------|---------------------------------------------------------------------------|------------------------|--------------|-------------|
| W1   | 30.09<br>Thu. | You the Live  | L1: Introduction and Basics                                               | Required<br>Mentioned  | Lab 1<br>Out | HW 0<br>Out |
|      | 01.10<br>Fri. | You Tube Live | L2: Trends, Tradeoffs and Design<br>Fundamentals<br>@(PDF) @(PPT)         | Required<br>Mentioned  |              |             |
| W2   | 07.10<br>Thu. | You Tibe Live | L3a: Memory Systems: Challenges and<br>Opportunities<br>ma(PDF) izm (PPT) | Described<br>Suggested |              | HW 1<br>Out |
|      |               |               | L3b: Course Info & Logistics                                              |                        |              |             |
|      |               |               | L3c: Memory Performance Attacks                                           | Described<br>Suggested |              |             |
|      | 08.10<br>Fri. |               | L4a: Memory Performance Attacks                                           | Described<br>Suggested | Lab 2<br>Out |             |
|      |               |               | L4b: Data Retention and Memory Refresh                                    | Described<br>Suggested |              |             |
|      |               |               | L4c: RowHammer<br>(PDF)   (PPT)                                           | Described<br>Suggested |              |             |

### PIM Course (Fall'21)

#### Fall 2021 Edition:

https://safari.ethz.ch/projects and semi nars/fall2021/doku.php?id=processing in memory

#### Youtube Livestream:

https://www.youtube.com/watch?v=9e4 Chnwdovo&list=PL5Q2soXY2Zi-841fUYYUK9EsXKhQKRPyX

#### Project course

- Taken by Bachelor's/Master's students
- Processing-in-Memory lectures
- Hands-on research exploration
- Many research readings

Lecture Video Playlist on YouTube

Secture Playlist



#### Fall 2021 Meetings/Schedule

| Week | Date          | Livestream    | Meeting                                                                       | Learning Materials                             | Assignments |
|------|---------------|---------------|-------------------------------------------------------------------------------|------------------------------------------------|-------------|
| W1   | 05.10<br>Tue. | You Tube Live | M1: P&S PIM Course Presentation<br>(PDF) (PPT)                                | Required Materials<br>Recommended<br>Materials | HW 0 Out    |
| W2   | 12.10<br>Tue. | You Tube Live | M2: Real-World PIM Architectures                                              |                                                |             |
| W3   | 19.10<br>Tue. | You Tube Live | M3: Real-World PIM Architectures II (PDF) (PDF)                               |                                                |             |
| W4   | 26.10<br>Tue. | You Tube Live | M4: Real-World PIM Architectures III a (PDF) a (PPT)                          |                                                |             |
| W5   | 02.11<br>Tue. | You Tube Live | M5: Real-World PIM Architectures IV                                           |                                                |             |
| W6   | 09.11<br>Tue. | You Tube Live | M6: End-to-End Framework for<br>Processing-using-Memory<br>(PDF) (2000) (PPT) |                                                |             |
| W7   | 16.11<br>Tue. | You Tube Live | M7: How to Evaluate Data Movement<br>Bottlenecks<br>@ (PDF) # (PPT)           |                                                |             |
| W8   | 23.11<br>Tue. | You Tube Live | M8: Programming PIM Architectures                                             |                                                |             |
| W9   | 30.11<br>Tue. | You Tube Live | M9: Benchmarking and Workload<br>Suitability on PIM<br>@ (PDF) # (PPT)        |                                                |             |
| W10  | 07.12<br>Tue. | You Tube Live | M10: Bit-Serial SIMD Processing<br>using DRAM<br>@ (PDF) @ (PPT)              |                                                |             |

### **PIM Course (Current)**

#### Spring 2022 Edition:

https://safari.ethz.ch/projects\_and\_semi nars/spring2022/doku.php?id=processing in\_memory

#### Youtube Livestream:

https://www.youtube.com/watch?v=9e4 Chnwdovo&list=PL5Q2soXY2Zi-841fUYYUK9EsXKhQKRPyX

#### Project course

- Taken by Bachelor's/Master's students
- Processing-in-Memory lectures
- Hands-on research exploration
- Many research readings



#### Recorded Lecture Playlist



#### Spring 2022 Meetings/Schedule

| Week | Date          | Livestream      | Meeting                                                                      | Learning<br>Materials                          | Assignments |
|------|---------------|-----------------|------------------------------------------------------------------------------|------------------------------------------------|-------------|
| W1   | 10.03<br>Thu. | Weine Live      | M1: P&S PIM Course<br>Presentation<br>(PDF) ((PPT))                          | Required Materials<br>Recommended<br>Materials | HW 0 Out    |
| W2   | 15.03<br>Tue. |                 | Hands-on Project Proposals                                                   |                                                |             |
|      | 17.03<br>Thu. | Nemiere         | M2: Real-world PIM: UPMEM PIM<br>(PDF) (PPT)                                 |                                                |             |
| W3   | 24.03<br>Thu. | Metro Live      | M3: Real-world PIM:<br>Microbenchmarking of UPMEM<br>PIM<br>am(PDF) am (PPT) |                                                |             |
| W4   | 31.03<br>Thu. | Maine Live      | M4: Real-world PIM: Samsung<br>HBM-PIM<br>(PDF) ((PPT))                      |                                                |             |
| W5   | 07.04<br>Thu. | W Tive          | M5: How to Evaluate Data<br>Movement Bottlenecks<br>(m(PDF) (m (PPT)         |                                                |             |
| W6   | 14.04<br>Thu. | Millio Live     | M6: Real-world PIM: SK Hynix<br>AM<br>(m) (PDF) (m) (PPT)                    |                                                |             |
| W7   | 21.04<br>Thu. | Metere Premiere | M7: Programming PIM<br>Architectures<br>(PDF) (m (PPT)                       |                                                |             |
| W8   | 28.04<br>Thu. | Millio Premiere | M8: Benchmarking and Workload<br>Suitability on PIM<br>im (PDF) im (PPT)     |                                                |             |
| W9   | 05.05<br>Thu. | Main Premiere   | M9: Real-world PIM: Samsung<br>AxDIMM<br>(PDF) ((PPT))                       |                                                |             |
| W10  | 12.05<br>Thu. |                 | M10: Real-world PIM: Alibaba HB-<br>PNM                                      |                                                |             |

## Current EFCL Projects

- A New Methodology and Open-Source Benchmark Suite for Evaluating Data Movement Bottlenecks: A Processing-in-Memory Case Study"
  - Data-centric

"Machine-Learning-Assisted Intelligent Microarchitectures to Reduce Memory Access Latency"

Data-driven

 "Cross-layer Hardware/Software Techniques to Enable Powerful Computation and Memory Optimizations"

Data-aware

Machine-Learning-Assisted Intelligent Microarchitectures to Reduce Memory Access Latency

Rahul Bera, Gagandeep Singh, Rakesh Nadig, Konstantinos Kanellopoulos, Onur Mutlu





# System Architecture Design Today

- Human-driven
  - Humans design the policies (how to do things)
- Many (too) simple, short-sighted policies all over the system
- No automatic data-driven policy learning
- (Almost) no learning: cannot take lessons from past actions

### Can we design fundamentally intelligent architectures?

# An Intelligent Architecture

- Data-driven
  - Machine learns the "best" policies (how to do things)
- Sophisticated, workload-driven, changing, far-sighted policies
- Automatic data-driven policy learning
- All controllers are intelligent data-driven agents

### How do we start?

Challenge and Opportunity for Future

# Data-Driven (Self-Optimizing) Computing Architectures

## Results So Far (2021-2022)

- Pythia: A Customizable Hardware Prefetching Framework Using Online Reinforcement Learning [MICRO 2021]
- Sibyl: Adaptive and Extensible Data Placement in Hybrid Storage Systems Using Online Reinforcement Learning [ISCA 2022]

## Self-Optimizing Memory Prefetchers

 Rahul Bera, Konstantinos Kanellopoulos, Anant Nori, Taha Shahroodi, Sreenivas Subramoney, and Onur Mutlu,
 "Pythia: A Customizable Hardware Prefetching Framework Using Online Reinforcement Learning" Proceedings of the <u>54th International Symposium on Microarchitecture</u> (MICRO), Virtual, October 2021.
 [Slides (pptx) (pdf)]
 [Short Talk Slides (pptx) (pdf)]
 [Lightning Talk Slides (pptx) (pdf)]
 [Lightning Talk Slides (pptx) (pdf)]
 [Lightning Talk Video (1.5 minutes)]
 [Pythia Source Code (Officially Artifact Evaluated with All Badges)]
 [arXiv version]

### Pythia: A Customizable Hardware Prefetching Framework Using Online Reinforcement Learning

Rahul Bera1Konstantinos Kanellopoulos1Anant V. Nori2Taha Shahroodi3,1Sreenivas Subramoney2Onur Mutlu1

<sup>1</sup>ETH Zürich <sup>2</sup>Processor Architecture Research Labs, Intel Labs <sup>3</sup>TU Delft

#### https://arxiv.org/pdf/2109.12021.pdf



# Pythia

## A Customizable Hardware Prefetching Framework Using Online Reinforcement Learning

<u>Rahul Bera</u>, Konstantinos Kanellopoulos, Anant V. Nori, Taha Shahroodi, Sreenivas Subramoney, Onur Mutlu

https://github.com/CMU-SAFARI/Pythia





Mainly use <u>one</u> program context info. for prediction Lack <u>system</u> <u>awareness</u>

Lack in-silicon <u>customizability</u>







Why prefetchers do not perform well?







Autonomously learns to prefetch using multiple program context information and system-level feedback <u>In-silicon customizable</u> to change program context information or prefetching objective <u>on the fly</u>





# **Basics of Reinforcement Learning (RL)**

 Algorithmic approach to learn to take an action in a given situation to maximize a numerical reward



Environment

- Agent stores Q-values for every state-action pair
  - **Expected return** for taking an action in a state

- Given a state, selects action that provides highest Q-value SAFARI

# **Brief Overview of Pythia**

Pythia formulates prefetching as a **reinforcement learning** problem

# **Basic Pythia Configuration**

• Derived from automatic design-space exploration

### • State: 2 features

- PC+Delta
- Sequence of last-4 deltas

### • Actions: 16 prefetch offsets

- Ranging between -6 to +32. Including 0.

### • Rewards:

- R<sub>AT</sub> = +20; R<sub>AL</sub> = +12; R<sub>NP</sub>-H=-2; R<sub>NP</sub>-L=-4;
- $R_{IN}$ -H=-14;  $R_{IN}$ -L=-8;  $R_{CL}$ =-12

# More Detailed Pythia Overview

- **Q-Value Store**: Records Q-values for *all* state-action pairs
- Evaluation Queue: A FIFO queue of recently-taken actions



# **Rigorous Evaluation Methodology**

- Champsim [3] trace-driven simulator
- **150** single-core memory-intensive workload traces
  - SPEC CPU2006 and CPU2017
  - PARSEC 2.1
  - Ligra
  - Cloudsuite
- Homogeneous and heterogeneous multi-core mixes

### • Five state-of-the-art prefetchers

- SPP [Kim+, MICRO'16]
- Bingo [Bakhshalipour+, HPCA'19]
- MLOP [Shakerinava+, 3<sup>rd</sup> Prefetching Championship, 2019]
- SPP+DSPatch [Bera+, MICRO'19]
- SPP+PPF [Bhatia+, ISCA'20]



# **Performance with Varying Core Count**



# Performance with Varying Core Count



### **Performance with Varying DRAM Bandwidth**



### **Performance with Varying DRAM Bandwidth**



### Pythia outperforms prior best prefetchers for a wide range of DRAM bandwidth configurations



# A Lot More in the Paper

 Performance comparison with unseen traces - Pythia provides equally high performance benefits

#### amparison against multi-level prefetchers

#### **Pythia: A Customizable Hardware Prefetching Framework Using Online Reinforcement Learning**

Taha Shahroodi<sup>3,1</sup> Anant V. Nori<sup>2</sup> Rahul Bera<sup>1</sup> Konstantinos Kanellopoulos<sup>1</sup> Sreenivas Subramoney<sup>2</sup> Onur Mutlu<sup>1</sup> <sup>1</sup>ETH Zürich <sup>2</sup>Processor Architecture Research Labs, Intel Labs <sup>3</sup>TU Delft

 Performance sets in the set of and hyperparameter values

Detailed single-core and four-core performance

#### SAFARI

# **Pythia is Open Source**



### https://github.com/CMU-SAFARI/Pythia

- MICRO'21 artifact evaluated
- Champsim source code + Chisel modeling code

All traces used for evaluation

SAFARI

| CMU-SAFARI / Pythia | Public                                 |                                                              |                       | ⊙ Unwatch 👻 3 🖧 Sta                                                         | r 7 😵 Fork     |
|---------------------|----------------------------------------|--------------------------------------------------------------|-----------------------|-----------------------------------------------------------------------------|----------------|
| <> Code ① Issues 11 | Pull requests (*) Actions (*) Projects | 🖽 Wiki 😲 Security 🗠                                          | Insights 🔅 Settings   |                                                                             |                |
| ۶۶ master - ۶۶ ۱    | branch 🛛 🏷 5 tags                      | Go to file                                                   | Add file - Code -     | About                                                                       | ŝ              |
| rahulbera Updated   | d README                               | f96dee9 2 c                                                  | days ago 🕚 38 commits | A customizable hardwa<br>framework using online<br>learning as described in | reinforcement  |
| branch              | Initial commit for MICRO'21 art        | tifact evaluation                                            | 2 months ago          | 2021 paper by Bera and                                                      |                |
| Config              | Initial commit for MICRO'21 art        | tifact evaluation                                            | 2 months ago          | et al.                                                                      |                |
| experiments         | Added chart visualization in Ex        | cel template                                                 | 2 months ago          |                                                                             | 021.pdf        |
| inc inc             | Updated README                         |                                                              | 6 days ago            | machine-learning                                                            |                |
| prefetcher          | Initial commit for MICRO'21 art        | tifact evaluation                                            | 2 months ago          | reinforcement-learning<br>computer-architecture                             | prefetcher     |
| replacement         | Initial commit for MICRO'21 art        | tifact evaluation                                            | 2 months ago          |                                                                             | e-replacement  |
| scripts             | Added md5 checksum for all a           | rtifact traces to verify download                            | 2 months ago          | branch-predictor cham                                                       | psim-simulator |
| src src             | Initial commit for MICRO'21 art        | tifact evaluation                                            | 2 months ago          | champsim-tracer                                                             |                |
| tracer              | Initial commit for MICRO'21 art        | tifact evaluation                                            | 2 months ago          | 🛱 Readme                                                                    |                |
| .gitignore          | Initial commit for MICRO'21 art        | Initial commit for MICRO'21 artifact evaluation 2 months ago |                       | ▲▲ View license                                                             |                |
| CITATION.cff        | Added citation file                    |                                                              | 6 days ago            | Cite this repository ◄                                                      |                |
|                     | Updated LICENSE                        |                                                              | 2 months ago          |                                                                             |                |
| LICENSE.champsir    | m Initial commit for MICRO'21 art      | tifact evaluation                                            | 2 months ago          | Releases 5                                                                  |                |



# Pythia

### A Customizable Hardware Prefetching Framework Using Online Reinforcement Learning

<u>Rahul Bera</u>, Konstantinos Kanellopoulos, Anant V. Nori, Taha Shahroodi, Sreenivas Subramoney, Onur Mutlu

https://github.com/CMU-SAFARI/Pythia





### An Intelligent Architecture

- Data-driven
  - Machine learns the "best" policies (how to do things)
- Sophisticated, workload-driven, changing, far-sighted policies
- Automatic data-driven policy learning
- All controllers are intelligent data-driven agents

### We need to rethink design (of all controllers)

### Results So Far (2021-2022)

- Pythia: A Customizable Hardware Prefetching Framework Using Online Reinforcement Learning [MICRO 2021]
- Sibyl: Adaptive and Extensible Data Placement in Hybrid Storage Systems Using Online Reinforcement Learning [ISCA 2022]

### Self-Optimizing Hybrid Storage Systems

• To appear in ISCA 2022

#### Sibyl: Adaptive and Extensible Data Placement in Hybrid Storage Systems Using Online Reinforcement Learning

Gagandeep Singh1Rakesh Nadig1Jisung Park1Rahul Bera1Nastaran Hajinazar1David Novo3Juan Gómez-Luna1Sander Stuijk2Henk Corporaal2Onur Mutlu11ETH Zürich2Eindhoven University of Technology3LIRMM, Univ. Montpellier, CNRS

#### https://arxiv.org/pdf/2205.07394.pdf

# **Sibyl:** Adaptive and Extensible Data Placement in Hybrid Storage Systems Using Online Reinforcement Learning

Gagandeep Singh, Rakesh Nadig, Jisung Park, Rahul Bera, Nastaran Hajinazar, David Novo, Juan Gómez-Luna, Onur Mutlu



# **ETH** zürich

### **Executive Summary**

**Background**: Hybrid storage systems (HSSs) complement different storage technologies to extend the overall capacity and reduce the system cost with minimal effect on the application performance

**Problem:** Accurately identify the performance-critical data of an application and placing it in the "best-fit" storage device. Three key shortcomings of prior data placement policies (heuristic-based and supervised learning-based) of hybrid storage systems:

- Lack of adaptability
- Lack of device awareness (e.g., read/write latencies of each device)
- Lack of extensibility

Goal: Develop a new, efficient, and high performance data-placement mechanism for hybrid storage systems that can:

- Dynamically derive an adaptive data-placement strategy by continuously learning and adapting to the application and underlying device characteristics
- Easily extensible to incorporate a wide range of hybrid storage configurations.

Key Idea: Sibyl, an online reinforcement learning-based self-optimizing mechanism for data placement that:

- Dynamically learns from past experiences and continuously adapts its policy to improve long-term performance by interacting with the hybrid storage system
- Learns the asymmetry in the read/write latencies present in modern hybrid storage devices while taking into account the inherent characteristics of an application
- Key Results: Sibyl is evaluated on a real system with multiple device configurations
  - Evaluated using a wide range of workloads from MSR Cambridge and Filebench
  - In a performance (cost) optimized hybrid storage configuration, Sibyl provides up to 21.6% (19.9%) performance improvement compared to prior data placement policies
  - On a tri-hybrid storage system, Sibyl outperforms a heuristics-based policy by 23.9% -48.2%
  - Sibyl achieves 80% performance of an oracle policy with storage overhead of 124.4 KiB

### **Hybrid Storage Systems**

Logical Block Space (Application/File-system View)



### **Key Shortcomings of Prior Data Placement Techniques**

We observe **three key shortcomings** that significantly limit performance benefits of data-placement techniques

Lack of adaptability

### Lack of device awareness

### Lack of extensibility



# Lack of Adaptability (1/2)

- Prior heuristic-based techniques consider only a few characteristics (e.g., access frequency) to perform data placement
- Statically tuned characteristics (based on fixed thresholds) are ineffective when used on a wide range of applications and system configurations
- Supervised learning techniques need labeled data and frequent retraining to adapt to varying workloads and system conditions

#### Prior techniques offer 41.1% lower performance compared to an Oracle policy

# Lack of Adaptability (2/2)



### Lack of Device Awareness

Prior data placement techniques:

- **do not adapt** well to changes in underlying device characteristics (e.g., storage read latency)
- **do not consider the data migration cost** between storage devices while making a data placement decision
- **are highly inefficient** in hybrid storage systems that have devices with significantly different read/write latencies

### Lack of Extensibility

- Prior data placement techniques are typically **designed** for a hybrid storage system with **only two storage devices**
- **Significant effort** is required to extend the data placement policies for more than two devices

Compared to a RL-based solution, a heuristic-based policy provides 48.2% lower performance when extended from two to three devices

### **Our Goal**

#### A data-placement mechanism that can

- dynamically derive an adaptive dataplacement strategy by continuously learning and adapting to the application and underlying device characteristics
- be easily extended to incorporate a wide range of hybrid storage configurations

### **Basics of Reinforcement Learning**



- RL is a framework for decision making
  - An autonomous agent observes the current state of the environment
  - It interacts with the environment by taking actions
  - Agent is **rewarded** or **penalized** based on the consequences of its actions
  - Agent tries to maximize the cumulative reward

# **Applying RL to Data Placement**

Key factors in applying RL for data placement in a hybrid storage system

- RL agent needs to be aware of:
  - asymmetry in read/write latencies of a storage device
  - differences in latencies across hybrid storage devices
  - application access patterns
- Data placement module should decide which actions to reward and penalize (credit assignment)
- Low implementation overhead

### **RL State**

- Feature selection is performed to select only the most correlated features that affect data placement
- Divide the states into a small number of bins to reduce the state space

| Feature            | Description                                         | # of bins | Encoding (bits) |
|--------------------|-----------------------------------------------------|-----------|-----------------|
| size <sub>t</sub>  | Size of the requested page (in pages)               | 8         | 8               |
| t ype <sub>t</sub> | Type of the current request (read/write)            | 2         | 4               |
| intr <sub>t</sub>  | Access interval of the requested page               | 64        | 8               |
| cnt <sub>t</sub>   | Access count of the requested page                  | 64        | 8               |
| $cap_t$            | Remaining capacity in the fast storage device       | 8         | 8               |
| curr <sub>t</sub>  | Current placement of the requested page (fast/slow) | 2         | 4               |

### Reward

- For every action at time-step *t*, Sibyl gets a reward from the environment at time-step *t* + 1
- Reward acts as a feedback to the agent's past action
- Request latency faithfully captures the status of the hybrid storage system
- Penalty value is chosen to prevent the agent from aggressively servicing all the requests from the faster device

 $R = \begin{cases} \frac{1}{L_t} & \text{if no eviction} \\ max(0, \frac{1}{L_t} - R_p) & \text{if an eviction happens} \end{cases} \begin{array}{l} L_t = \text{latency of the} \\ request \\ R_p = \text{eviction penalty} \end{cases}$ 

#### **SAFARI**

### **Overview of Sibyl**



The two threads run asynchronously to prevent training delay from affecting the inference time

#### SAFARI

### Hyper-parameter Tuning

 Different hyper-parameter configurations were chosen using the design of experiments (DoE) technique

| Hyper-parameter                   | Design Space       | Chosen Value |
|-----------------------------------|--------------------|--------------|
| Discount factor $(\gamma)$        | 0-1                | 0.9          |
| Learning rate $(\alpha)$          | $1e^{-5} - 1e^{0}$ | $1e^{-4}$    |
| Exploration rate ( $\epsilon$ )   | 0-1                | 0.001        |
| Batch size                        | 64-256             | 128          |
| Experience buffer size $(e_{EB})$ | 10-10000           | 1000         |

# **Evaluation Methodology**

- Evaluated on a real system with different hybrid storage configurations
- Hybrid storage system constitutes one contiguous logical block address space
- A custom block driver was implemented to manage the I/O requests to the storage devices
- We evaluate three different hybrid storage configurations
  - Performance-optimized (H&M)
  - Cost-optimized (H&L)
  - Tri-hybrid storage system

### **Evaluation Methodology**

| Host System                             | AMD Ryzen 7 2700G [146], 8-cores@3.5 GHz,<br>8×64/32 KiB L1-I/D, 4 MiB L2, 8 MiB L3,<br>16 GiB RDIMM DDR4 2666 MHz |  |  |
|-----------------------------------------|--------------------------------------------------------------------------------------------------------------------|--|--|
| Storage Devices                         | Characteristics                                                                                                    |  |  |
| H: Intel Optane SSD P4800X [94]         | 375 GB, PCIe 3.0 NVMe, SLC, R/W: 2.4/2 GB/s,                                                                       |  |  |
| The Inter Optane SSD F4800X [94]        | random R/W: 550000/500000 IOPS                                                                                     |  |  |
| M: Intel SSD D3-S4510 [96]              | 1.92 TB, SATA TLC (3D), R/W: 550/510 MB/s,                                                                         |  |  |
| M. IIItel 33D D3-34310 [90]             | random R/W: 895000/21000 IOPS                                                                                      |  |  |
| L: Seagate HDD ST1000DM010 [98]         | 1 TB, SATA 6Gb/s 7200 RPM                                                                                          |  |  |
| L. Seagate HDD ST1000DM010 [98]         | Max. Sustained Transfer Rate: 210 MB/s                                                                             |  |  |
| L <sub>SSD</sub> : ADATA SU630 SSD [99] | 960 GB, SATA 6 Gb/s, TLC,                                                                                          |  |  |
| LSSD: ADATA 30030 33D [99]              | Max R/W: 520/450 MB/s                                                                                              |  |  |
| HSS Configurations                      | Fast Device Slow Device                                                                                            |  |  |
| H&M (Performance-oriented)              | high-end (H) middle-end (M)                                                                                        |  |  |
| H&L (Cost-oriented)                     | high-end (H) low-end (L)                                                                                           |  |  |

# **Evaluation Methodology**

- 18 different workloads from MSR Cambridge and FileBench suites
- Sibyl is compared against **four baselines** 
  - Heuristic-based policies
    - Cold data eviction (CDE) [Matsui et. al., "Design of Hybrid SSDs With Storage Class Memory and NAND Flash Memory," IEEE 2017]
    - History Page Scheduler (HPS) [Meswani et.al., "Heterogeneous Memory Architectures: A HW/SW Approach for Mixing Die-stacked and Off-package Memories," HPCA, 2015]
  - Supervised learning-based policies
    - Recurrent neural network (RNN)-based technique adapted from Kleio [Doudali et.al., "Kleio: A Hybrid Memory Page Scheduler with Machine Intelligence," HPDC, 2019]
    - Neural network-based classifier based on Archivist [Ren et.al., "Archivist: A Machine Learning Assisted Data Placement Mechanism for Hybrid Storage Systems," ICCD, 2019 ]

### Latency Improvement



| Configuration | CDE   | HPS   | Archivist | RNN-HSS |
|---------------|-------|-------|-----------|---------|
| H&M           | 28.1% | 23.2% | 36.1%     | 21.6%   |
| H&L           | 19.9% | 45.9% | 68.8%     | 34.1%   |

### **Throughput Improvement**



| Configuration | CDE   | HPS   | Archivist | RNN-HSS |
|---------------|-------|-------|-----------|---------|
| H&M           | 32.6% | 21.9% | 54.2%     | 22.7%   |
| H&L           | 22.8% | 49.1% | 86.9%     | 41.9%   |

### Latency in Tri-Hybrid System



Sibyl outperforms the heuristic-based data placement policy for trihybrid system by 48.2% on average across all workloads

#### SAFARI

### Latency for Unseen Workloads



In H&M (H&L) configurations, Sibyl outperforms RNN-HSS and Archivist by 46.1% (54.6%) and 8.5% (44.1%) respectively

#### SAFARI

### **Sensitivity to Fast Storage Capacity**



### **Sensitivity to Fast Storage Capacity**



Sibyl consistently provides highest performance by dynamically adapting its data-placement policy



# **Overhead Analysis**

- Performance Overhead
  - ~10ns for every inference on the evaluated system; this is several orders of magnitude less than I/O latency of highend SSD

### Implementation Overhead

- **124.4 KiB** of implementation overhead
- Metadata overhead
  - 0.1% of the total storage capacity when using a 4 KiB data placement granularity
  - 40-bit metadata overhead per data placement unit

• To appear in ISCA 2022

#### Sibyl: Adaptive and Extensible Data Placement in Hybrid Storage Systems Using Online Reinforcement Learning

Gagandeep Singh1Rakesh Nadig1Jisung Park1Rahul Bera1Nastaran Hajinazar1David Novo3Juan Gómez-Luna1Sander Stuijk2Henk Corporaal2Onur Mutlu11ETH Zürich2Eindhoven University of Technology3LIRMM, Univ. Montpellier, CNRS

#### https://arxiv.org/pdf/2205.07394.pdf

### Current EFCL Projects

- A New Methodology and Open-Source Benchmark Suite for Evaluating Data Movement Bottlenecks: A Processing-in-Memory Case Study"
  - Data-centric
- "Machine-Learning-Assisted Intelligent Microarchitectures to Reduce Memory Access Latency"
  - Data-driven

"Cross-layer Hardware/Software Techniques to Enable Powerful Computation and Memory Optimizations"

Data-aware

Cross-layer Hardware/Software Techniques to Enable Powerful Computation and Memory Optimizations

Ataberk Olgun, Konstantinos Kanellopoulos, Nisa Bostanci Onur Mutlu





### Data-Aware Architectures

- A data-aware architecture understands what it can do with and to each piece of data
- It makes use of different properties of data to improve performance, efficiency and other metrics
  - Compressibility
  - Approximability
  - Locality
  - Sparsity
  - Criticality for Computation X
  - Access Semantics

• ...

### One Problem: Limited Expressiveness





### A Solution: More Expressive Interfaces



Challenge and Opportunity for Future

# Data-Aware (Expressive) Computing Architectures

### Results So Far (2021-2022)

MetaSys: A Practical Open-Source Metadata Management System to Implement and Evaluate Cross-Layer Optimizations [ACM TACO 2022]

### MetaSys Open Source Framework

### Appears in ACM TACO 2022

MetaSys: A Practical Open-Source Metadata Management System to Implement and Evaluate Cross-Layer Optimizations

NANDITA VIJAYKUMAR, University of Toronto, Canada ATABERK OLGUN, ETH Zurich, TOBB ETU, Turkey KONSTANTINOS KANELLOPOULOS, ETH Zurich, Switzerland F. NISA BOSTANCI, ETH Zurich, TOBB ETU, Turkey HASAN HASSAN, ETH Zurich, Switzerland MEHRSHAD LOTFI, Max Plank Institute, Germany PHILLIP B. GIBBONS, Carnegie Mellon University, USA ONUR MUTLU, ETH Zurich, Switzerland

### SAFARI

https://arxiv.org/pdf/2105.08123.pdf



### A Practical Open-Source Metadata Management System to Implement and Evaluate Cross-Layer Optimizations

### Nandita Vijaykumar

Ataberk Olgun, Konstantinos Kanellopoulos, F. Nisa Bostanci

Hasan Hassan, Mehrshad Lotfi, Phillip B. Gibbons, Onur Mutlu

## SAFARI



UNIVERSITY OF TORONTO





## **Executive Summary**

#### Problem

- Cross-layer techniques are challenging to implement because they require full-stack changes
- Existing open-source infrastructures for implementing cross-layer techniques are not designed to provide key features

Key Idea – Provide:

- Rich dynamic HW/SW interfaces
- Low-overhead metadata management
- Interfaces to key hardware components (e.g., prefetcher, caches)

#### Our goal is twofold:

- 1. Develop an efficient and flexible framework to enable rapid implementation of new cross-layer techniques
- 2. Perform a detailed limit study to quantify the overheads associated with general metadata systems

### **SAFARI**

## **The Atom Abstraction**

### An abstraction to express data semantics



### SAFARI

## **The Software Interface**

### **Three Atom operators**





### **MetaSys Key Structures**



## **FPGA** Prototype

# Prototype on Xilinx Zedboard within a real RISC-V system (Rocket Chip)



SAFARI

### **Rocket Chip**



15

## **MetaSys in Rocket Chip**

Implement two main components:

### 1. Atom Controller

- Manages the attribute table (CREATE (DE)ACTIVATE)
- Performs atom mapping (MAP/UNMAP)
  - Physical address  $\rightarrow$  Atom ID

### 2. Metadata Lookup Unit

- Responds to clients:
  - Provides atom attributes
- Contains the metadata mapping cache

### SAFARI

## **MetaSys is Open Source**

### https://github.com/CMU-SAFARI/MetaSys

| CMU-S | SAFARI / MetaSys (Public)<br>⊙ Issues 11 Pull requests ⊙ | Actions 🖽 Projects 🖽 Wiki 😲 Se                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 | ecurity 🗠 Insights                  | ① Notifications 양 Fork 2 ☆ Star 0 →                                            |
|-------|----------------------------------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-------------------------------------|--------------------------------------------------------------------------------|
|       | 🐉 main 👻 🎝 branch 🔊 0 tags                               |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                | Go to file Code -                   | About<br>Metasys is the first open-source FPGA-                                |
|       | olgunataberk Update README.md                            |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                | e21ccd2 on Jul 9, 2021 🕲 12 commits | based infrastructure with a prototype in<br>a RISC-V core, to enable the rapid |
|       | common                                                   | Initial commit                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 | 9 months ago                        | implementation and evaluation of a wide                                        |
|       | riscv-tools                                              | Add tools directory                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            | 7 months ago                        | range of cross-layer software/hardware<br>cooperative techniques techniques in |
|       | rocket-chip                                              | Initial commit                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 | 9 months ago                        | real hardware. Described in our pre-                                           |
|       | testchipip                                               | Initial commit                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 | 9 months ago                        | print: https://arxiv.org/abs/2105.08123                                        |
|       | zedboard                                                 | Initial commit                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 | 9 months ago                        | Readme     Star View license                                                   |
|       |                                                          | Initial commit                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 | 9 months ago                        | 값 0 stars                                                                      |
|       | C README.md                                              | Update README.md                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               | 7 months ago                        | ⊙ 3 watching                                                                   |
|       | metasys_readme.md                                        | Update metasys_readme.md                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       | 7 months ago                        | ళి 2 forks                                                                     |
|       | ≅ README.md                                              |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                |                                     | Releases                                                                       |
|       | MetaSys                                                  |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                |                                     | No releases published                                                          |
|       |                                                          | Sys repository to metasys_readme.md, where a walkthrough of an implement a walkthrough of a walkth |                                     | Packages<br>No packages published                                              |

For more details, please read our preprint on arXiv.

### SAFARI

## **Performance Overhead**



Metadata lookups occur low performance overheads 2.7% on average



### **Impact of Tagging Granularity on TLB misses**



### Fine tagging granularities increase TLB misses





### A Practical Open-Source Metadata Management System to Implement and Evaluate Cross-Layer Optimizations

### Nandita Vijaykumar

Ataberk Olgun, Konstantinos Kanellopoulos, F. Nisa Bostanci

Hasan Hassan, Mehrshad Lotfi, Phillip B. Gibbons, Onur Mutlu

## SAFARI



UNIVERSITY OF TORONTO





### Results So Far (2021-2022)

- Many Educational and Outreach Efforts
  - 11 Different Courses (livestreamed & recorded)
  - Tutorials
  - Keynote Talks
  - Invited Talks
  - Conference & Workshop Talks
  - •••

- All are freely available on our YouTube Channel
  - https://www.youtube.com/onurmutlulectures

### A Detailed Tutorial

Onur Mutlu, "Memory-Centric Computing" *Education Class at <u>Embedded Systems Week (ESWEEK)</u>, Virtual, 9 October 2021. [<u>Slides (pptx) (pdf)</u>] [<u>Abstract (pdf)</u>] [<u>Talk Video (2 hours, including Q&A)</u>] [<u>Invited Paper at DATE 2021</u>] [<u>"A Modern Primer on Processing in Memory" paper</u>]* 

https://www.youtube.com/watch?v=N1Ac1ov1JOM

| ]                                                            | Memory-Cent                                  | ric                             |   |               |
|--------------------------------------------------------------|----------------------------------------------|---------------------------------|---|---------------|
|                                                              | Computing                                    | )                               |   |               |
|                                                              | Onur Mutlu                                   |                                 |   | All Onur Muth |
|                                                              | omutlu@gmail.com                             |                                 |   |               |
|                                                              | https://people.inf.ethz.ch/omu               | ıtlu                            |   |               |
|                                                              | 9 October 2021                               |                                 |   |               |
|                                                              | ESWEEK Education Class                       |                                 |   |               |
| SAFARI                                                       | <b>ETH</b> zürich                            | Carnegie Mello                  | n | <b>ķ</b>      |
| I ► ► ► ₩ ₩ 1:08/2                                           | :00:10                                       |                                 |   | • • • • ::    |
| nbedded Systems Week (ESW<br>9 views • Premiered Dec 6, 2021 | EEK) 2021 Lecture - Memory-Centric Computing | g - Onur Mutlu - 9 October 2021 |   |               |
|                                                              |                                              |                                 |   |               |



Onur Mutlu Lectures 20.7K subscribers https://www.youtube.com/watch?v=N1Ac1ov1JOM

ANALYTICS EDIT VIDEO

https://www.youtube.com/onurmutlulectures

### Online Courses & Lectures

### First Computer Architecture & Digital Design Course

- Digital Design and Computer Architecture
- Spring 2022 Livestream Edition: <u>https://www.youtube.com/watch?v=cpXdE3HwvK0&list=PL5Q2soXY2Zi97Ya5DEUpMp</u> <u>O2bbAoaG7c6</u>

#### Spring 2021 Livestream Edition: https://www.youtube.com/watch?v=LbC0EZY8yw4&list=PL5Q2soXY2Zi\_uej3aY39YB5 pfW4SJ7LIN

### Advanced Computer Architecture Course

- Computer Architecture
- Fall 2021 Livestream Edition:

https://www.youtube.com/watch?v=4yfkM\_5EFgo&list=PL5Q2soXY2Zi-Mnk1PxjEIG32HAGILkTOF

Fall 2020 Edition: <u>https://www.youtube.com/watch?v=c3mPdZA-</u> <u>Fmc&list=PL5Q2soXY2Zi9xidyIgBxUz7xRPS-wisBN</u>

### SAFARI https://www.youtube.com/onurmutlulectures

### Comp Arch (Fall'21)

#### Fall 2021 Edition:

- https://safari.ethz.ch/architecture/fall2021/doku. php?id=schedule
- Fall 2020 Edition:
  - https://safari.ethz.ch/architecture/fall2020/doku. php?id=schedule

#### Youtube Livestream (2021):

- https://www.youtube.com/watch?v=4yfkM\_5EFg o&list=PL5Q2soXY2Zi-Mnk1PxjEIG32HAGILkTOF
- Youtube Livestream (2020):
  - https://www.youtube.com/watch?v=c3mPdZA-Fmc&list=PL5Q2soXY2Zi9xidyIgBxUz7xRPS-wisBN
- Master's level course
  - Taken by Bachelor's/Masters/PhD students
  - Cutting-edge research topics + fundamentals in Computer Architecture
  - 5 Simulator-based Lab Assignments
  - Potential research exploration
  - Many research readings

#### https://www.youtube.com/onurmutlulectures



#### Fall 2021 Lectures & Schedule

Watch on 🕞 YouTub

| Week | Date          | Livestream    | Lecture                                                                   | Readings               | Lab          | HW          |
|------|---------------|---------------|---------------------------------------------------------------------------|------------------------|--------------|-------------|
| W1   | 30.09<br>Thu. | You the Live  | L1: Introduction and Basics                                               | Required<br>Mentioned  | Lab 1<br>Out | HW 0<br>Out |
|      | 01.10<br>Fri. | You Tube Live | L2: Trends, Tradeoffs and Design<br>Fundamentals<br>@(PDF) @(PPT)         | Required<br>Mentioned  |              |             |
| W2   | 07.10<br>Thu. | You Tube Live | L3a: Memory Systems: Challenges and<br>Opportunities<br>ma(PDF) and (PPT) | Described<br>Suggested |              | HW 1<br>Out |
|      |               |               | L3b: Course Info & Logistics                                              |                        |              |             |
|      |               |               | L3c: Memory Performance Attacks                                           | Described<br>Suggested |              |             |
|      | 08.10<br>Fri. | You Tube Live | L4a: Memory Performance Attacks                                           | Described<br>Suggested | Lab 2<br>Out |             |
|      |               |               | L4b: Data Retention and Memory Refresh                                    | Described<br>Suggested |              |             |
|      |               |               | L4c: RowHammer<br>(PDF)  (PPT)                                            | Described<br>Suggested |              |             |

### **PIM Course (Current)**

#### Spring 2022 Edition:

https://safari.ethz.ch/projects and semi nars/spring2022/doku.php?id=processing in memory

#### Youtube Livestream:

https://www.youtube.com/watch?v=9e4 Chnwdovo&list=PL5Q2soXY2Zi-841fUYYUK9EsXKhQKRPyX

#### Project course

- Taken by Bachelor's/Master's students
- Processing-in-Memory lectures
- Hands-on research exploration
- Many research readings



#### Recorded Lecture Playlist



#### Spring 2022 Meetings/Schedule

| Week | Date          | Livestream      | Meeting                                                                    | Learning<br>Materials                          | Assignments |
|------|---------------|-----------------|----------------------------------------------------------------------------|------------------------------------------------|-------------|
| W1   | 10.03<br>Thu. | Weine Live      | M1: P&S PIM Course<br>Presentation<br>(PDF) ((PPT))                        | Required Materials<br>Recommended<br>Materials | HW 0 Out    |
| W2   | 15.03<br>Tue. |                 | Hands-on Project Proposals                                                 |                                                |             |
|      | 17.03<br>Thu. | Nemiere         | M2: Real-world PIM: UPMEM PIM<br>(PDF) (PPT)                               |                                                |             |
| W3   | 24.03<br>Thu. | Metto Live      | M3: Real-world PIM:<br>Microbenchmarking of UPMEM<br>PIM<br>@(PDF) @(PPT)  |                                                |             |
| W4   | 31.03<br>Thu. | Maine Live      | M4: Real-world PIM: Samsung<br>HBM-PIM<br>m(PDF) m (PPT)                   |                                                |             |
| W5   | 07.04<br>Thu. | Maine Live      | M5: How to Evaluate Data<br>Movement Bottlenecks<br>(PDF) (m (PPT)         |                                                |             |
| W6   | 14.04<br>Thu. | Min Live        | M6: Real-world PIM: SK Hynix<br>AM<br>(PDF) ((PPT)                         |                                                |             |
| W7   | 21.04<br>Thu. | Metere Premiere | M7: Programming PIM<br>Architectures<br>(PDF) (m (PPT)                     |                                                |             |
| W8   | 28.04<br>Thu. | Millio Premiere | M8: Benchmarking and Workload<br>Suitability on PIM<br>(m) (PDF) (m) (PPT) |                                                |             |
| W9   | 05.05<br>Thu. | Main Premiere   | M9: Real-world PIM: Samsung<br>AxDIMM<br>(PDF) ((PPT))                     |                                                |             |
| W10  | 12.05<br>Thu. |                 | M10: Real-world PIM: Alibaba HB-<br>PNM                                    |                                                |             |

167

### PIM Course (Fall'21)

#### Fall 2021 Edition:

https://safari.ethz.ch/projects and semi nars/fall2021/doku.php?id=processing in memory

#### Youtube Livestream:

https://www.youtube.com/watch?v=9e4 Chnwdovo&list=PL5Q2soXY2Zi-841fUYYUK9EsXKhQKRPyX

#### Project course

- Taken by Bachelor's/Master's students
- Processing-in-Memory lectures
- Hands-on research exploration
- Many research readings

Lecture Video Playlist on YouTube

Secture Playlist



#### Fall 2021 Meetings/Schedule

| Week | Date          | Livestream    | Meeting                                                                       | Learning Materials                             | Assignments |
|------|---------------|---------------|-------------------------------------------------------------------------------|------------------------------------------------|-------------|
| W1   | 05.10<br>Tue. | You Tube Live | M1: P&S PIM Course Presentation<br>(PDF) (PPT)                                | Required Materials<br>Recommended<br>Materials | HW 0 Out    |
| W2   | 12.10<br>Tue. | You Tube Live | M2: Real-World PIM Architectures                                              |                                                |             |
| W3   | 19.10<br>Tue. | You Tube Live | M3: Real-World PIM Architectures II (PDF)  (PPF)                              |                                                |             |
| W4   | 26.10<br>Tue. | You Tube Live | M4: Real-World PIM Architectures III a (PDF) a (PPT)                          |                                                |             |
| W5   | 02.11<br>Tue. | You Tube Live | M5: Real-World PIM Architectures IV                                           |                                                |             |
| W6   | 09.11<br>Tue. | You Tube Live | M6: End-to-End Framework for<br>Processing-using-Memory<br>(PDF) (2000) (PPT) |                                                |             |
| W7   | 16.11<br>Tue. | You Tube Live | M7: How to Evaluate Data Movement<br>Bottlenecks<br>@ (PDF) # (PPT)           |                                                |             |
| W8   | 23.11<br>Tue. | You Tube Live | M8: Programming PIM Architectures                                             |                                                |             |
| W9   | 30.11<br>Tue. | You Tube Live | M9: Benchmarking and Workload<br>Suitability on PIM<br>@ (PDF) # (PPT)        |                                                |             |
| W10  | 07.12<br>Tue. | You Tube Live | M10: Bit-Serial SIMD Processing<br>using DRAM<br>@ (PDF) @ (PPT)              |                                                |             |

### Hetero. Systems (Fall'21)

#### Fall 2021 Edition:

https://safari.ethz.ch/projects and semi nars/fall2021/doku.php?id=heterogeneou s systems

#### Youtube Livestream:

https://www.youtube.com/watch?v=QY bjwzsfMM&list=PL5Q2soXY2Zi\_OwkTgEy A6tk3UsoPBH737

#### Project course

- Taken by Bachelor's/Master's students
- GPU and Parallelism lectures
- Hands-on research exploration
- Many research readings



#### Fall 2021 Meetings/Schedule

| Week | Date          | Livestream    | Meeting                                                     | Learning Materials                             | Assignments |
|------|---------------|---------------|-------------------------------------------------------------|------------------------------------------------|-------------|
| W1   | 07.10<br>Thu. | You Tube Live | M1: P&S Course Presentation<br>(PDF)  (PPT)                 | Required Materials<br>Recommended<br>Materials | HW 0 Out    |
| W2   | 14.10<br>Thu. | You Tube Live | M2: SIMD Processing and GPUs a (PDF) a (PPT)                |                                                |             |
| W3   | 21.10<br>Thu. | You Tube Live | M3: GPU Software Hierarchy                                  |                                                |             |
| W4   | 28.10<br>Thu. | You Tube Live | M4: GPU Memory Hierarchy                                    |                                                |             |
| W5   | 04.11<br>Thu. | You Tube Live | M5: GPU Performance<br>Considerations<br>@(PDF) @(PPT)      |                                                |             |
| W6   | 11.11<br>Thu. | You Tube Live | M6: Parallel Patterns: Reduction<br>(PDF) (PPT)             |                                                |             |
| W7   | 18.11<br>Thu. | You Tube Live | M7: Parallel Patterns: Histogram                            |                                                |             |
| W8   | 25.11<br>Thu. | You Tube Live | M8: Parallel Patterns: Convolution<br>(PDF)  (PPT)          |                                                |             |
| W9   | 02.12<br>Thu. | You Tube Live | M9: Parallel Patterns: Prefix Sum<br>(Scan)<br>(PDF)  (PPT) |                                                |             |
| W10  | 09.12<br>Thu. | You Tube Live | M10: Parallel Patterns: Sparse<br>Matrices<br>@(PDF) @(PPT) |                                                |             |
| W11  | 16.12<br>Thu. | You Tube Live | M11: Parallel Patterns: Graph<br>Search<br>@(PDF) @(PPT)    |                                                |             |
| W12  | 22.12<br>Thu. | You Tube Live | M12: Dynamic Parallelism                                    |                                                |             |
| W13  | 06.01<br>Thu. | You Tube Live | M13: Collaborative Computing                                |                                                |             |

### Genomics (Fall 2021)

#### Fall 2021 Edition:

 <u>https://safari.ethz.ch/projects\_and\_semi</u> <u>nars/fall2021/doku.php?id=bioinformatic</u> <u>s</u>

#### Youtube Livestream:

- https://www.youtube.com/watch?v=Mno gTeMjY8k&list=PL5Q2soXY2Zi8sngH-TrNZnDhDkPq55J9J
- Project course
  - Taken by Bachelor's/Master's students
  - Genomics lectures
  - Hands-on research exploration
  - Many research readings



disease outbreaks Developing personalized medicine

#### Fall 2021 Meetings/Schedule

| Week | Date          | Livestream    | Meeting                                                                                                                    | Learning<br>Materials                          | Assignments |
|------|---------------|---------------|----------------------------------------------------------------------------------------------------------------------------|------------------------------------------------|-------------|
| W1   | 5.10<br>Tue.  | You Tube Live | M1: P&S Accelerating Genomics<br>Course Introduction & Project<br>Proposals<br>(PDF) (PPT)<br>You (Mether)<br>You (Mether) | Required Materials<br>Recommended<br>Materials |             |
| W2   | 20.10<br>Wed. | You Tube Live | M2: Introduction to Sequencing<br>(PDF) m (PPT)                                                                            |                                                |             |
| W3   | 27.10<br>Wed. | You Tube Live | M3: Read Mapping                                                                                                           |                                                |             |
| W4   | 3.11<br>Wed.  | You Tube Live | M4: GateKeeper<br>(PDF)                                                                                                    |                                                |             |
| W5   | 10.11<br>Wed. | You Tube Live | M5: MAGNET & Shouji                                                                                                        |                                                |             |
| W6   | 17.11<br>Wed. |               | M6.1: SneakySnake<br>(PDF) (main (PPT))<br>Video                                                                           |                                                |             |
|      |               |               | M6.2: GRIM-Filter<br>(PDF) (PDF) (PDF)<br>You We Video                                                                     |                                                |             |
| W7   | 24.11<br>Wed. |               | M7: GenASM<br>(PDF) (and (PPT)<br>You Two Video                                                                            |                                                |             |
| W8   | 01.12<br>Wed. | You Tube Live | M8: Genome Assembly                                                                                                        |                                                |             |
| W9   | 13.12<br>Mon. | You Tube Live | M9: GRIM-Filter<br>(PDF)   (PPT)                                                                                           |                                                |             |
| W10  | 15.12<br>Wed. | You Tube Live | M10: Genomic Data Sharing Under<br>Differential Privacy<br>@ (PDF) @ (PPT)                                                 |                                                |             |

### HW/SW Co-Design (Spring 2022)

#### Spring 2022 Edition:

https://safari.ethz.ch/projects and semi nars/spring2022/doku.php?id=hw sw co design

#### Youtube Livestream:

<u>https://youtube.com/playlist?list=PL5Q2s</u> <u>oXY2Zi8nH7un3ghD2nutKWWDk-NK</u>

#### Project course

- Taken by Bachelor's/Master's students
- HW/SW co-design lectures
- Hands-on research exploration
- Many research readings



#### 2022 Meetings/Schedule (Tentative)

| Week | Date  | Livestream    | Meeting                                 | Materials | Assignments |
|------|-------|---------------|-----------------------------------------|-----------|-------------|
| W0   | 16.03 | You Tube Live | Intro to HW/SW Co-Design                | Required  | HW 0 Out    |
| W1   | 23.03 |               | Project selection                       | Required  |             |
| W2   | 30.03 | You Tube Live | Virtual Memory (I)<br>(PPTX) (PDF)      |           |             |
| W3   | 13.04 | You Tube Live | Virtual Memory (II)<br>a (PPTX) a (PDF) |           |             |

### SSD Course (Spring 2022)

#### Spring 2022 Edition:

https://safari.ethz.ch/projects and semi nars/spring2022/doku.php?id=modern s sds

#### Youtube Livestream:

- https://www.youtube.com/watch?v= g4r m71DsY4&list=PL5Q2soXY2Zi8vabcse1kL 22DEcgMI2RAq
- Project course
  - Taken by Bachelor's/Master's students
  - SSD Basics and Advanced Topics
  - Hands-on research exploration
  - Many research readings

|                                                   |                                                                                                               | Jisung Park       |
|---------------------------------------------------|---------------------------------------------------------------------------------------------------------------|-------------------|
|                                                   | P&S Modern SSDs                                                                                               |                   |
| В                                                 | asics of NAND Flash-Based SSI                                                                                 | Ds                |
| _                                                 | Dr. Jisung Park                                                                                               | _                 |
|                                                   | Prof. Onur Mutlu                                                                                              | _                 |
|                                                   | ETH Zürich                                                                                                    |                   |
|                                                   | Spring 2022                                                                                                   |                   |
|                                                   | 25 March 2021                                                                                                 | 200               |
|                                                   |                                                                                                               |                   |
|                                                   | rse - Meeting 2: Basics of NAND Flash-Based SSDs (Spring 2022)                                                |                   |
| treamed live on Mar 25, 2022                      | 📫 16 🖓 DISLIKE 📣 SHARE 🛓                                                                                      | DOWNLOAD 💥 CLIP = |
| treamed live on Mar 25, 2022<br>Ir Mutlu Lectures | u 16 🖓 dislike 💫 share 🛓                                                                                      |                   |
|                                                   | 👘 16 🖓 DISLIKE À SHARE ⊻                                                                                      | DOWNLOAD & CLIP = |
| ır Mutlu Lectures                                 | n∰ 16 🖓 dislike À share ⊻                                                                                     |                   |
| ır Mutlu Lectures                                 |                                                                                                               |                   |
| ır Mutlu Lectures                                 | ■ 16 🖓 DISLIKE A SHARE<br>P&S Modern SSDs                                                                     | ANALYTICS         |
| ır Mutlu Lectures                                 |                                                                                                               | ANALYTICS         |
| ır Mutlu Lectures                                 | <b>P&amp;S Modern SSDs</b><br>Introduction to MQSim                                                           | ANALYTICS         |
| ır Mutlu Lectures                                 | P&S Modern SSDs                                                                                               | ANALYTICS         |
| ır Mutlu Lectures                                 | <b>P&amp;S Modern SSDs</b><br>Introduction to MQSim<br>Rakesh Nadig                                           | ANALYTICS         |
| ır Mutlu Lectures                                 | P&S Modern SSDs<br>Introduction to MQSim<br>Rakesh Nadig<br>Dr. Jisung Park ▷                                 | ANALYTICS         |
| ır Mutlu Lectures                                 | P&S Modern SSDs<br>Introduction to MQSim<br>Rakesh Nadig<br>Dr. Jisung Park<br>Prof. Onur Mutlu<br>ETH Zürich | ANALYTICS         |
| ır Mutlu Lectures                                 | P&S Modern SSDs<br>Introduction to MQSim<br>Rakesh Nadig<br>Dr. Jisung Park<br>Prof. Onur Mutlu               | ANALYTICS         |



ANALYTICS

EDIT VIDE 172

### Hands-On Projects & Seminars Courses

### <u>https://safari.ethz.ch/projects\_and\_seminars</u>



SAFARI Project & Seminars Courses (Spring 2022)

Search Q

Recent Changes Media Manager Sitemap

start

Trace: • start

#### Home

Courses

- SoftMC
- Ramulator
- Accelerating Genomics
- Mobile Genomics
- Processing-in-Memory
- Heterogeneous Systems
- Modern SSDs
- Hardware/Software Co-design

#### SAFARI Projects & Seminars Courses (Spring 2022)

Welcome to the wiki for Project and Seminar courses SAFARI offers.

Courses we offer:

- Understanding and Improving Modern DRAM Performance, Reliability, and Security with Hands-On Experiments
- Designing and Evaluating Memory Systems and Modern Software Workloads with Ramulator
- Accelerating Genome Analysis with FPGAs, GPUs, and New Execution Paradigms
- Genome Sequencing on Mobile Devices
- Exploring the Processing-in-Memory Paradigm for Future Computing Systems
- Hands-on Acceleration on Heterogeneous Computing Systems
- Understanding and Designing Modern NAND Flash-Based Solid-State Drives (SSDs)
- Intelligent Architectures using Hardware/Software Cooperative Techniques

#### **SAFARI**







Onur Mutlu Lectures VIEW FULL PLAYLIST

SAFARI

Onur Mutlu Lectures

VIEW FULL PLAYLIST

Onur Mutlu Lectures VIEW FULL PLAYLIST Systems

Onur Mutlu Lectures VIEW FULL PLAYLIST

Onur Mutlu Lectures VIEW FULL PLAYLIST Onur Mutlu Lectures VIEW FULL PLAYLIST

174

#### Research Talks https://www.youtube.com/onurmutlulectures

## SAFARI EFCL Research Projects: Recent Results and Future Outlook

Onur Mutlu <u>omutlu@gmail.com</u> <u>https://people.inf.ethz.ch/omutlu</u> 23 May 2022 EFCL Mini-Conference





Backup Slides (for More Detail)

### Brief Self Introduction

### Onur Mutlu

- □ Full Professor @ ETH Zurich ITET (INFK), since Sept 2015
- Strecker Professor @ Carnegie Mellon University ECE (CS), 2009-2016, 2016-...
- Started the Comp Arch Research Group @ Microsoft Research, 2006-2009
- Worked @ Google, VMware, Microsoft Research, Intel, AMD
- PhD in Computer Engineering from University of Texas at Austin in 2006
- BS in Computer Engineering & Psychology from University of Michigan in 2000
- <u>https://people.inf.ethz.ch/omutlu/ omutlu@gmail.com</u>

### Research and Teaching in:

- **Computer architecture, systems, hardware security, bioinformatics**
- Memory and storage systems
- Robust & dependable hardware systems: security, safety, predictability, reliability
- Hardware/software cooperation
- New computing paradigms; architectures with emerging technologies/devices
- Architectures for bioinformatics, genomics, health, medicine, AI/ML



A New Methodology and Open-Source Benchmark Suite for Evaluating Data Movement Bottlenecks: A Processing-in-Memory Case Study

**Backup Slides** 





### Executive Summary

- Data movement between memory units and compute units is a major system bottleneck
- Many workloads suffer from the data movement bottleneck
  - Machine learning, computational biology, graph processing, databases, video analytics, real-time data analytics...
- The traditional processor-centric design should evolve to a more data-centric design where processing elements are closer to where the data resides
  - Near-data processing (NDP), processing-in-memory (PIM)
  - Processing-immersed-with-memory (monolithic 3D integration)
- Some main challenges for adoption
  - Identifying suitable workloads and functions
  - Lack of profiling tools, analytical models, simulators...

### Data Movement in Computing Systems

- Data movement dominates performance and is a major system energy bottleneck
- Total system energy: data movement accounts for
  - □ 62% in consumer applications\*,
  - □ 40% in scientific applications\*,
  - $\hfill\square$  35% in mobile applications  $\hfill \bigstar$



\* Boroumand et al., "Google Workloads for Consumer Devices: Mitigating Data Movement Bottlenecks," ASPLOS 2018 \* Kestor et al., "Quantifying the Energy Cost of Data Movement in Scientific Applications," IISWC 2013

#### \* Pandiyan and Wu, "Quantifying the energy cost of data movement for emerging smart phone workloads on mobile platforms," IISWC 2014

#### SAFARI

### Our Goal

- To enable adoptable PIM systems by solving various key challenges
  - Identifying workloads and functions that are suitable for processing near/inside where the data resides
    - Fundamental (architecture-independent) characterization
    - Architecture-specific suitability
  - Lack of tools
    - Profiling tools
    - Analytical performance and energy models (or ML-based models)
    - Simulation tools
- The project is organized in two phases
  - Phase 1: Methodology and benchmark suite
  - Phase 2: Follow-up project to enable PIM adoption

## Phase 1: Methodology and Benchmark Suite

- We aim to develop understanding into modern workloads with the key goal of identifying workload characteristics and portions that would be beneficial to offload to a PIM engine
- We intend to develop:
  - New profiling tools
  - Analytical and simulation models
  - Benchmark suites

#### Three anticipated steps:

- □ 1. Application profiling on modern processors
- 2. Application characterization
- 3. Performance analysis and validation

## Phase 1: Three Anticipated Steps (I)

- Step 1: Application profiling on modern processors
  - We will use
    - Profiling tools: VTune, perf, nvprof
    - Relevant metrics (e.g., cache misses, DRAM accesses, data locality)
  - Outcomes
    - Motivate the need for PIM and type of PIM
    - Strong empirical understanding about workload characteristics and suitable portions

## Phase 1: Three Anticipated Steps (II)

- Step 2: Application characterization
  - Analysis of metrics
    - Architecture-independent and architecture-dependent metrics
    - Identification of key metrics
  - Huge design space
    - Types of cores, accelerators, functional units
    - Processing models on host and PIM side

## Phase 1: Three Anticipated Steps (III)

- Step 3: Performance analysis and validation
  - Rigorous analysis and validation
  - Commonalities between functions
  - Development and evaluation of general-purpose and specialpurpose PIM

# Phase 1: Target Applications

- We will apply our methodology to more than 500 applications
  - Benchmark suites, frameworks
  - Custom open-source version
- Major domains
  - Machine learning
  - Graph processing
  - Data analytics
  - Databases
  - Bioinformatics
  - Image/video processing
  - Physics simulation
  - Etc.

# Phase 1: Types of PIM

#### Where?

- Processing in logic layer of 3D stacked memories
- Processing in memory controllers on die
- In-DRAM processing ala Ambit
- In-DRAM processing ala UPMEM
- In-cache processing

#### What?

- General-purpose and special-purpose PIM
- Host: CPU, GPU, FPGA, accelerators

### Phase 1: Tools

#### Profiling tools

- VTune
- Perf
- Pin
- Simulation tools
  - Ramulator
  - □ Gem5
  - GPGPU-Sim
  - Ramulator-PIM

#### SAFARI

#### DAMOV Analysis Methodology & Workloads

#### DAMOV: A New Methodology and Benchmark Suite for Evaluating Data Movement Bottlenecks

GERALDO F. OLIVEIRA, ETH Zürich, Switzerland JUAN GÓMEZ-LUNA, ETH Zürich, Switzerland LOIS OROSA, ETH Zürich, Switzerland SAUGATA GHOSE, University of Illinois at Urbana–Champaign, USA NANDITA VIJAYKUMAR, University of Toronto, Canada IVAN FERNANDEZ, University of Malaga, Spain & ETH Zürich, Switzerland MOHAMMAD SADROSADATI, Institute for Research in Fundamental Sciences (IPM), Iran & ETH Zürich, Switzerland ONUR MUTLU, ETH Zürich, Switzerland

Data movement between the CPU and main memory is a first-order obstacle against improving performance, scalability, and energy efficiency in modern systems. Computer systems employ a range of techniques to reduce overheads tied to data movement, spanning from traditional mechanisms (e.g., deep multi-level cache hierarchies, aggressive hardware prefetchers) to emerging techniques such as Near-Data Processing (NDP), where some computation is moved close to memory. Prior NDP works investigate the root causes of data movement bottlenecks using different profiling methodologies and tools. However, there is still a lack of understanding about the key metrics that can identify different data movement bottlenecks and their relation to traditional and emerging data movement mitigation mechanisms. Our goal is to methodically identify potential sources of data movement mitigation techniques (e.g., caching and prefetching) to more memory-centric techniques (e.g., NDP), thereby developing a rigorous understanding of the best techniques to mitigate each source of data movement.

With this goal in mind, we perform the first large-scale characterization of a wide variety of applications, across a wide range of application domains, to identify fundamental program properties that lead to data movement to/from main memory. We develop the first systematic methodology to classify applications based on the sources contributing to data movement bottlenecks. From our large-scale characterization of 77K functions across 345 applications, we select 144 functions to form the first open-source benchmark suite (DAMOV) for main memory data movement studies. We select a diverse range of functions that (1) represent different types of data movement bottlenecks, and (2) come from a wide range of application domains. Using NDP as a case study, we identify new insights about the different data movement bottlenecks and use these insights to determine the most suitable data movement mitigation mechanism for a particular application. We open-source DAMOV and the complete source code for our new characterization methodology at https://github.com/CMU-SAFARI/DAMOV.

#### SAFARI

#### https://arxiv.org/pdf/2105.03725.pdf

# **Step 1: Application Profiling**

- We analyze 345 applications from distinct domains:
- Graph Processing
- Deep Neural Networks
- Physics
- High-Performance Computing
- Genomics
- Machine Learning
- Databases
- Data Reorganization
- Image Processing
- Map-Reduce
- Benchmarking
- Linear Algebra

SAFARI



## **Methodology Overview**



### **DAMOV** is Open Source

• We open-source our benchmark suite and our toolchain



https://github.com/CMU-SAFARI/DAMOV



#### More on DAMOV Analysis Methodology & Workloads



https://www.youtube.com/watch?v=GWideVyo0nM&list=PL5Q2soXY2Zi tOTAYm--dYByNPL7JhwR9&index=3

#### Experimental Analysis of the UPMEM PIM Engine

#### Benchmarking a New Paradigm: An Experimental Analysis of a Real Processing-in-Memory Architecture

JUAN GÓMEZ-LUNA, ETH Zürich, Switzerland IZZAT EL HAJJ, American University of Beirut, Lebanon IVAN FERNANDEZ, ETH Zürich, Switzerland and University of Malaga, Spain CHRISTINA GIANNOULA, ETH Zürich, Switzerland and NTUA, Greece GERALDO F. OLIVEIRA, ETH Zürich, Switzerland ONUR MUTLU, ETH Zürich, Switzerland

Many modern workloads, such as neural networks, databases, and graph processing, are fundamentally memory-bound. For such workloads, the data movement between main memory and CPU cores imposes a significant overhead in terms of both latency and energy. A major reason is that this communication happens through a narrow bus with high latency and limited bandwidth, and the low data reuse in memory-bound workloads is insufficient to amortize the cost of main memory access. Fundamentally addressing this *data movement bottleneck* requires a paradigm where the memory system assumes an active role in computing by integrating processing capabilities. This paradigm is known as *processing-in-memory (PIM*).

Recent research explores different forms of PIM architectures, motivated by the emergence of new 3Dstacked memory technologies that integrate memory with a logic layer where processing elements can be easily placed. Past works evaluate these architectures in simulation or, at best, with simplified hardware prototypes. In contrast, the UPMEM company has designed and manufactured the first publicly-available real-world PIM architecture. The UPMEM PIM architecture combines traditional DRAM memory arrays with general-purpose in-order cores, called *DRAM Processing Units* (*DPUs*), integrated in the same chip.

This paper provides the first comprehensive analysis of the first publicly-available real-world PIM architecture. We make two key contributions. First, we conduct an experimental characterization of the UPMEM-based PIM system using microbenchmarks to assess various architecture limits such as compute throughput and memory bandwidth, yielding new insights. Second, we present *PrIM* (*Processing-In-Memory benchmarks*), a benchmark suite of 16 workloads from different application domains (e.g., dense/sparse linear algebra, databases, data analytics, graph processing, neural networks, bioinformatics, image processing), which we identify as memory-bound. We evaluate the performance and scaling characteristics of PrIM benchmarks on the UPMEM PIM architecture, and compare their performance and energy consumption to their stateof-the-art CPU and GPU counterparts. Our extensive evaluation conducted on two real UPMEM-based PIM systems with 640 and 2,556 DPUs provides new insights about suitability of different workloads to the PIM system, programming recommendations for software designers, and suggestions and hints for hardware and architecture designers of future PIM systems.

#### SAFARI

#### https://arxiv.org/pdf/2105.03814.pdf

## 2,560-DPU Processing-in-Memory System



#### Benchmarking a New Paradigm: An Experimental Analysis of a Real Processing-in-Memory Architecture

JUAN GÓMEZ-LUNA, ETH Zürich, Switzerland IZZAT EL HAJJ, American University of Beruti, Lebanon IVAN FERNANDEZ, ETH Zirich, Switzerland and University of Malaga, Spain CHRISTINA GIANNOULA, ETH Zärich, Switzerland and NTUA, Greece GERALDO F. OLIVEIRA, ETH Zärich, Switzerland ONUR MUTLU, ETH Zärich, Switzerland

Many modern workloads, such as neural networks, databases, and graph processing, are fundamentally memory-bound for such workloads, the data movement between main memory and CPU cores imposes a significant overhead in terms of both latency and energy. A major reason is that this communication happens through a narrow busy with high latency and limited bandwidth, and the low data reuse in memory-bound workload is insufficient to amortize the cost of main memory access. Fundamentally addressing this data movement builteneck requires a paradigm where the memory system assumes an active role in computing by integrating processing capabilities. This paradigm is known as processing-in-memory (PMA).

Recent research explores different forms of PIM architectures, motivated by the emergence of new 3Dstacked memory technologies that integrate memory with a logic layer where processing elements can be easily placed. Past works evaluate these architectures in simulation or, at best, with simplified hardware prototypes. In contrast, the UPMEM company has designed and manufactured the first publicly-available real-world PIM architecture. The UPMEM PIM architecture combines traditional DRAM memory arrays with general-purpose in-order cores, called DRAM Processing Units (DPUs), integrated in the same chip.

This paper provides the first comprehensive analysis of the first publicly-available real-world PDM architectrue. We make two key contributions: First, we conduct an experimental characterization of the UPReM-based PDM system using microbenchmarks to assess various architecture limits such as compute throughput and memory bandwidth, yielding me winsights. Second, we present PPM (*Processing A-Memory benchmarks*), a benchmark suite of 16 workloads from different application domains (e.g., dense/sparse linear algebra, a database, data analytics, graph porcessing, neural networks, bioinformatics, image processing), which we identify as memory-bound. We evaluate the performance and scaling characteristics of PfM benchmarks on the UPMEM PM architecture, and compare their performance and energy consumption to their stateof-the-art CPU and GPU counterparts. Our extensive evaluation conducted on two real UPMEM-based PIM systems with 64 and 2556 DPUE provides new insights about suitability of different workloads to the PIM systems years datagenes of future PIM systems.



https://arxiv.org/pdf/2105.03814.pdf

# PrIM Benchmarks: Application Domains

| Domain                | Benchmark                     | Short name |  |  |  |  |
|-----------------------|-------------------------------|------------|--|--|--|--|
| Donco linear algobra  | Vector Addition               | VA         |  |  |  |  |
| Dense linear algebra  | Matrix-Vector Multiply        | GEMV       |  |  |  |  |
| Sparse linear algebra | Sparse Matrix-Vector Multiply | SpMV       |  |  |  |  |
| Databases             | Select                        | SEL        |  |  |  |  |
|                       | Unique                        | UNI        |  |  |  |  |
| Data applytics        | Binary Search                 | BS         |  |  |  |  |
| Data analytics        | Time Series Analysis          | TS         |  |  |  |  |
| Graph processing      | Breadth-First Search          | BFS        |  |  |  |  |
| Neural networks       | Multilayer Perceptron         | MLP        |  |  |  |  |
| Bioinformatics        | Needleman-Wunsch              | NW         |  |  |  |  |
| Image are cossing     | Image histogram (short)       | HST-S      |  |  |  |  |
| Image processing      | Image histogram (large)       | HST-L      |  |  |  |  |
|                       | Reduction                     | RED        |  |  |  |  |
| Parallel primitives   | Prefix sum (scan-scan-add)    | SCAN-SSA   |  |  |  |  |
|                       | Prefix sum (reduce-scan-scan) | SCAN-RSS   |  |  |  |  |
|                       | Matrix transposition          | TRNS       |  |  |  |  |

#### SAFARI

### PrIM Benchmarks are Open Source

- All microbenchmarks, benchmarks, and scripts
- <u>https://github.com/CMU-SAFARI/prim-benchmarks</u>

| G CMU-SAFARI / prim-benchmarks                                   | ⊙ Unwatch ▼         2         ☆ Star         2         % Fork         1 |
|------------------------------------------------------------------|-------------------------------------------------------------------------|
| <> Code ⊙ Issues the Pull requests ⊙ Actions III Projects □ Wiki | ① Security 🗠 Insights 🕸 Settings                                        |
| prim-benchmarks / README.md                                      | Go to file                                                              |
| Juan Gomez Luna PrIM first commit                                | Latest commit 3de4b49 9 days ago 🛛 History                              |
| ନ୍ଦ 1 contributor                                                |                                                                         |
| ⋮= 168 lines (132 sloc) 5.79 KB                                  | Raw Blame 🖵 🖉 🗓                                                         |
|                                                                  |                                                                         |

#### PrIM (Processing-In-Memory Benchmarks)

PrIM is the first benchmark suite for a real-world processing-in-memory (PIM) architecture. PrIM is developed to evaluate, analyze, and characterize the first publicly-available real-world processing-in-memory (PIM) architecture, the UPMEM PIM architecture. The UPMEM PIM architecture combines traditional DRAM memory arrays with general-purpose in-order cores, called DRAM Processing Units (DPUs), integrated in the same chip.

PrIM provides a common set of workloads to evaluate the UPMEM PIM architecture with and can be useful for programming, architecture and system researchers all alike to improve multiple aspects of future PIM hardware and software. The workloads have different characteristics, exhibiting heterogeneity in their memory access patterns, operations and data types, and communication patterns. This repository also contains baseline CPU and GPU implementations of PrIM benchmarks for comparison purposes.

PrIm also includes a set of microbenchmarks can be used to assess various architecture limits such as compute throughput and memory bandwidth.

#### SAFARI

### Understanding a Modern PIM Architecture



SAFA

## Phase 2: Projects to Enable PIM Adoption (I)

- Learnings and artifacts (e.g., benchmark suites) from Phase 1 will allow us to start the exploration of the following lines of research
- 1. Design space exploration for PIM hardware accelerators
  - Understand hardware requirements of a vast range of dataflow-based PIM accelerators
  - Near-memory hardware accelerators for PIM-friendly workloads
  - Pre-RTL power-performance accelerator simulator
  - PIM hardware library to instantiate by workloads

### Phase 2: Projects to Enable PIM Adoption (II)

- 2. Runtime scheduling
  - Runtime schedulers that can automatically identify when to offload a code segment to a PIM core
  - Considerations:
    - Data movement
    - Memory coherence
    - Memory interference between applications and PIM cores
    - Offloading to different levels of the memory hierarchy

### Phase 2: Projects to Enable PIM Adoption (III)

- 3. PIM API
  - Search for common computation patterns
  - PIM API can facilitate PIM programmability
  - Techniques to maximize performance, bandwidth utilization, to reduce communication

### Phase 2: Projects to Enable PIM Adoption (IV)

- 4. Extending our methodology to other levels of the memory/storage hierarchy
  - Computation offloading to other parts of the memory hierarchy
    - Near-cache computing
    - Near-storage computing
  - Memory technologies other than DRAM

### Term / Timeline

#### The project plan spans 3 years

| Project plan for Phase 1            | Year 1 |    |    | Year 2 |    |    |    | Year 3 |    |    |    |    |
|-------------------------------------|--------|----|----|--------|----|----|----|--------|----|----|----|----|
|                                     | Q1     | Q2 | Q3 | Q4     | Q1 | Q2 | Q3 | Q4     | Q1 | Q2 | Q3 | Q4 |
| Application profiling               |        |    |    |        |    |    |    |        |    |    |    |    |
| Application characterization        |        |    |    |        |    |    |    |        |    |    |    |    |
| Performance analysis and validation |        |    |    |        |    |    |    |        |    |    |    |    |

| Project plan for Phase 2  | Year 1 |    |    | Year 2 |    |    |    | Year 3 |    |    |    |    |
|---------------------------|--------|----|----|--------|----|----|----|--------|----|----|----|----|
|                           | Q1     | Q2 | Q3 | Q4     | Q1 | Q2 | Q3 | Q4     | Q1 | Q2 | Q3 | Q4 |
| Design space exploration  |        |    |    |        |    |    |    |        |    |    |    |    |
| Runtime scheduling        |        |    |    |        |    |    |    |        |    |    |    |    |
| PIM API                   |        |    |    |        |    |    |    |        |    |    |    |    |
| Extending our methodology |        |    |    |        |    |    |    |        |    |    |    |    |

### Expected Outcomes

#### • At the completion of Phase 1 of this project

- □ 1. Tools for identification of PIM suitability
- 2. Models for understanding and evaluating PIM architectures
- 3. PIM benchmark suites
- At the completion of Phase 2 of this project
  - □ 1. Design space exploration of PIM accelerators
  - 2. Runtime schedulers
  - □ 3. Tools for PIM programming
  - 4. Tools for PIM suitability across other levels of memory/storage

Machine-Learning-Assisted Intelligent Microarchitectures to Reduce Memory Access Latency

**Backup Slides** 







# Pythia

## A Customizable Hardware Prefetching Framework Using Online Reinforcement Learning

<u>Rahul Bera</u>, Konstantinos Kanellopoulos, Anant V. Nori, Taha Shahroodi, Sreenivas Subramoney, Onur Mutlu

https://github.com/CMU-SAFARI/Pythia





# **Executive Summary**

- Background: Prefetchers predict addresses of future memory requests by associating memory access patterns with program context (called feature)
- **Problem**: Three key shortcomings of prior prefetchers:
  - Predict mainly using a single program feature
  - Lack inherent system awareness (e.g., memory bandwidth usage)
  - Lack in-silicon customizability
- **Goal**: Design a prefetching framework that:
  - Learns from multiple features and inherent system-level feedback
  - Can be customized in silicon to use different features and/or prefetching objectives
- Contribution: Pythia, which formulates prefetching as reinforcement learning problem
  - Takes adaptive prefetch decisions using multiple features and system-level feedback
  - Can be customized in silicon for target workloads via simple configuration registers
  - Proposes a realistic and practical implementation of RL algorithm in hardware
- Key Results:

SAFARI

- Evaluated using a wide range of workloads from SPEC CPU, PARSEC, Ligra, Cloudsuite
- Outperforms best prefetcher (in 1-core config.) by **3.4%**, **7.7%** and **17%** in 1/4/bw-constrained cores
- Up to 7.8% more performance over basic Pythia across Ligra workloads via simple customization

#### https://github.com/CMU-SAFARI/Pythia

# **Key Shortcomings in Prior Prefetchers**

• We observe three key shortcomings that significantly limit performance benefits of prior prefetchers





# **Our Goal**

# A prefetching framework that can:

1.Learn to prefetch using multiple features and inherent system-level feedback information

2.Be **easily customized in silicon** to use different features and/or change prefetcher's objectives

# **Our Proposal**



# Pythia

# Formulates prefetching as a reinforcement learning problem



Pythia is named after the oracle of Delphi, who is known for her accurate prophecies https://en.wikipedia.org/wiki/Pythia

# **Basics of Reinforcement Learning (RL)**

 Algorithmic approach to learn to take an action in a given situation to maximize a numerical reward



Environment

- Agent stores Q-values for every state-action pair
  - **Expected return** for taking an action in a state

- Given a state, selects action that provides highest Q-value SAFARI

# **Formulating Prefetching as RL**

#### SAFARI

# **Pythia Overview**

- **Q-Value Store**: Records Q-values for *all* state-action pairs
- Evaluation Queue: A FIFO queue of recently-taken actions



# **Simulation Methodology**

- Champsim [3] trace-driven simulator
- **150** single-core memory-intensive workload traces
  - SPEC CPU2006 and CPU2017
  - PARSEC 2.1
  - Ligra
  - Cloudsuite
- Homogeneous and heterogeneous multi-core mixes

#### • Five state-of-the-art prefetchers

- SPP [Kim+, MICRO'16]
- Bingo [Bakhshalipour+, HPCA'19]
- MLOP [Shakerinava+, 3<sup>rd</sup> Prefetching Championship, 2019]
- SPP+DSPatch [Bera+, MICRO'19]
- SPP+PPF [Bhatia+, ISCA'20]

#### SAFARI

# **Basic Pythia Configuration**

• Derived from automatic design-space exploration

#### • State: 2 features

- PC+Delta
- Sequence of last-4 deltas

#### • Actions: 16 prefetch offsets

- Ranging between -6 to +32. Including 0.

#### • Rewards:

- R<sub>AT</sub> = +20; R<sub>AL</sub> = +12; R<sub>NP</sub>-H=-2; R<sub>NP</sub>-L=-4;
- $R_{IN}$ -H=-14;  $R_{IN}$ -L=-8;  $R_{CL}$ =-12

#### SAFARI

# **Performance with Varying Core Count**



### Performance with Varying Core Count



#### SAFARI

#### **Performance with Varying DRAM Bandwidth**



#### **Performance with Varying DRAM Bandwidth**



#### Pythia outperforms prior best prefetchers for a wide range of DRAM bandwidth configurations





## **Pythia's Overhead**

#### • 25.5 KB of total metadata storage per core

- Only simple tables
- We also model functionally-accurate Pythia with full complexity in Chisel [4] HDL



of a desktop-class 4-core Skylake processor (Xeon D2132IT, 60W)

[4] https://www.chisel-lang.org220



## More in the Paper

Performance comparison with unseen traces
 Pythia provides equally high performance benefits

#### Comparison against multi-level prefetchers

#### Pythia: A Customizable Hardware Prefetching Framework Using Online Reinforcement Learning

Rahul Bera<sup>1</sup> Konstantinos Kanellopoulos<sup>1</sup> Anant V. Nori<sup>2</sup> Taha Shahroodi<sup>3,1</sup> Sreenivas Subramoney<sup>2</sup> Onur Mutlu<sup>1</sup> <sup>1</sup>ETH Zürich <sup>2</sup>Processor Architecture Research Labs, Intel Labs <sup>3</sup>TU Delft

 Performance sensitivity towards unterent features and hyperparameter values

Detailed single-core and four-core performance

#### SAFARI

## **Pythia is Open Source**



222

#### https://github.com/CMU-SAFARI/Pythia

- MICRO'21 artifact evaluated
- Champsim source code + Chisel modeling code

All traces used for evaluation

SAFARI

| ⊒ CMU-SA | FARI / Pythia Public           |                                               |                                  | ⊙ Unwatch ▼     3     ☆ Star     7     % Fork                               |
|----------|--------------------------------|-----------------------------------------------|----------------------------------|-----------------------------------------------------------------------------|
| <> Code  | 🕑 Issues ျို Pull requests     |                                               | ① Security 🗠 Insights 🕸 Settings |                                                                             |
| រះ<br>រះ | naster 👻 🖁 1 branch 🛭 🕤 5 tags | ;                                             | Go to file Add file - Code -     | About 💱                                                                     |
|          | rahulbera Updated README       |                                               | f96dee9 2 days ago  🖰 38 commits | A customizable hardware prefetching<br>framework using online reinforcement |
|          | branch                         | Initial commit for MICRO'21 artifact evaluati | ion 2 months ago                 | learning as described in the MICRO<br>2021 paper by Bera and Kanellopoulos  |
|          | config                         | Initial commit for MICRO'21 artifact evaluati | ion 2 months ago                 | et al.                                                                      |
|          | experiments                    | Added chart visualization in Excel template   | 2 months ago                     |                                                                             |
|          | inc                            | Updated README                                | 6 days ago                       | machine-learning                                                            |
|          | prefetcher                     | Initial commit for MICRO'21 artifact evaluati | ion 2 months ago                 | computer-architecture prefetcher                                            |
|          | replacement                    | Initial commit for MICRO'21 artifact evaluati | ion 2 months ago                 | microarchitecture cache-replacement                                         |
|          | scripts                        | Added md5 checksum for all artifact traces    | to verify download 2 months ago  | branch-predictor champsim-simulator                                         |
|          | src                            | Initial commit for MICRO'21 artifact evaluati | ion 2 months ago                 | champsim-tracer                                                             |
|          | tracer                         | Initial commit for MICRO'21 artifact evaluati | ion 2 months ago                 | 🛱 Readme                                                                    |
| Ľ        | .gitignore                     | Initial commit for MICRO'21 artifact evaluati | ion 2 months ago                 | ১∰⊼ View license                                                            |
| Ľ        | CITATION.cff                   | Added citation file                           | 6 days ago                       | ÇŽ Cite this repository →                                                   |
| Ľ        | LICENSE                        | Updated LICENSE                               | 2 months ago                     |                                                                             |
| Ľ        | LICENSE.champsim               | Initial commit for MICRO'21 artifact evaluati | ion 2 months ago                 | Releases 5                                                                  |



# Pythia

#### A Customizable Hardware Prefetching Framework Using Online Reinforcement Learning

<u>Rahul Bera</u>, Konstantinos Kanellopoulos, Anant V. Nori, Taha Shahroodi, Sreenivas Subramoney, Onur Mutlu

https://github.com/CMU-SAFARI/Pythia





#### Self-Optimizing Memory Prefetchers

 Rahul Bera, Konstantinos Kanellopoulos, Anant Nori, Taha Shahroodi, Sreenivas Subramoney, and Onur Mutlu,
 "Pythia: A Customizable Hardware Prefetching Framework Using Online Reinforcement Learning" Proceedings of the <u>54th International Symposium on Microarchitecture</u> (MICRO), Virtual, October 2021.
 [Slides (pptx) (pdf)]
 [Short Talk Slides (pptx) (pdf)]
 [Lightning Talk Slides (pptx) (pdf)]
 [Lightning Talk Slides (pptx) (pdf)]
 [Lightning Talk Video (1.5 minutes)]
 [Pythia Source Code (Officially Artifact Evaluated with All Badges)]
 [arXiv version]

#### Pythia: A Customizable Hardware Prefetching Framework Using Online Reinforcement Learning

Rahul Bera1Konstantinos Kanellopoulos1Anant V. Nori2Taha Shahroodi3,1Sreenivas Subramoney2Onur Mutlu1

<sup>1</sup>ETH Zürich <sup>2</sup>Processor Architecture Research Labs, Intel Labs <sup>3</sup>TU Delft

#### https://arxiv.org/pdf/2109.12021.pdf

#### An Intelligent Architecture

- Data-driven
  - Machine learns the "best" policies (how to do things)
- Sophisticated, workload-driven, changing, far-sighted policies
- Automatic data-driven policy learning
- All controllers are intelligent data-driven agents

#### We need to rethink design (of all controllers)

#### Self-Optimizing Hybrid Storage Systems

• To appear in ISCA 2022

#### Sibyl: Adaptive and Extensible Data Placement in Hybrid Storage Systems Using Online Reinforcement Learning

Gagandeep Singh1Rakesh Nadig1Jisung Park1Rahul Bera1Nastaran Hajinazar1David Novo3Juan Gómez-Luna1Sander Stuijk2Henk Corporaal2Onur Mutlu11ETH Zürich2Eindhoven University of Technology3LIRMM, Univ. Montpellier, CNRS

#### https://arxiv.org/pdf/2205.07394.pdf

## **Sibyl:** Adaptive and Extensible Data Placement in Hybrid Storage Systems Using Online Reinforcement Learning

Gagandeep Singh, Rakesh Nadig, Jisung Park, Rahul Bera, Nastaran Hajinazar, David Novo, Juan Gómez-Luna, Onur Mutlu



## **ETH** zürich

### **Executive Summary**

**Background**: Hybrid storage systems (HSSs) complement different storage technologies to extend the overall capacity and reduce the system cost with minimal effect on the application performance

**Problem:** Accurately identify the performance-critical data of an application and placing it in the "best-fit" storage device. Three key shortcomings of prior data placement policies (heuristic-based and supervised learning-based) of hybrid storage systems:

- Lack of adaptability
- Lack of device awareness (e.g., read/write latencies of each device)
- Lack of extensibility

Goal: Develop a new, efficient, and high performance data-placement mechanism for hybrid storage systems that can:

- Dynamically derive an adaptive data-placement strategy by continuously learning and adapting to the application and underlying device characteristics
- Easily extensible to incorporate a wide range of hybrid storage configurations.

Key Idea: Sibyl, an online reinforcement learning-based self-adaptable mechanism for data placement that:

- Dynamically learns from past experiences and continuously adapts its policy to improve long-term performance by interacting with the hybrid storage system
- Learns the asymmetry in the read/write latencies present in modern hybrid storage devices while taking into account the inherent characteristics of an application

Key Results: Sibyl is evaluated on a real system with multiple device configurations

- Evaluated using a wide range of workloads from MSR Cambridge and Filebench
- In a performance (cost) optimized hybrid storage configuration, Sibyl provides up to 21.6% (19.9%) performance improvement compared to prior data placement policies
- On a tri-hybrid storage system, Sibyl outperforms a heuristics-based policy by 23.9% -48.2%
- Sibyl achieves 80% performance of an oracle policy with storage overhead of 124.4 KiB

### Outline

#### Background

#### Formulating Data Placement as RL problem

Sibyl: Overview

**Evaluation and Key Results** 

Conclusion



### **Hybrid Storage Systems**

Logical Block Space (Application/File-system View)



#### **Key Shortcomings of Prior Data Placement Techniques**

We observe three key shortcomings that significantly limit performance benefits of data-placement techniques

Lack of adaptability

#### Lack of device awareness

#### Lack of extensibility



## Lack of Adaptability (1/2)

- Prior heuristic-based techniques consider only a few characteristics (e.g., access frequency) to perform data placement
- Statically tuned characteristics (based on fixed thresholds) are ineffective when used on a wide range of applications and system configurations
- Supervised learning techniques need labeled data and frequent retraining to adapt to varying workloads and system conditions

Prior techniques offer 41.1% lower performance compared to an Oracle policy

### Lack of Adaptability (2/2)



### Lack of Device Awareness

Prior data placement techniques:

- **do not adapt** well to changes in underlying device characteristics (e.g., storage read latency)
- **do not consider the data migration cost** between storage devices while making a data placement decision
- **are highly inefficient** in hybrid storage systems that have devices with significantly different read/write latencies

### Lack of Extensibility

- Prior data placement techniques are typically **designed** for a hybrid storage system with **only two storage devices**
- **Significant effort** is required to extend the data placement policies for more than two devices

Compared to a RL-based solution, a heuristic-based policy provides 48.2% lower performance when extended from two to three devices

### **Our Goal**

#### A data-placement mechanism that can

- dynamically derive an adaptive dataplacement strategy by continuously learning and adapting to the application and underlying device characteristics
- be easily extended to incorporate a wide range of hybrid storage configurations

### Outline

#### Background

#### **Formulating Data Placement as RL problem**

Sibyl: Overview

**Evaluation and Key Results** 

Conclusion



### **Basics of Reinforcement Learning**



- RL is a framework for decision making
  - An autonomous agent observes the current state of the environment
  - It interacts with the environment by taking actions
  - Agent is **rewarded** or **penalized** based on the consequences of its actions
  - Agent tries to maximize the cumulative reward

## **Applying RL to Data Placement**

Key factors in applying RL for data placement in a hybrid storage system

- RL agent needs to be aware of:
  - asymmetry in read/write latencies of a storage device
  - differences in latencies across hybrid storage devices
  - application access patterns
- Data placement module should decide which actions to reward and penalize (credit assignment)
- Low implementation overhead

## **RL Formulation**

- Sibyl observes multiple application/device features for every storage request to make a data placement decision
- Possible actions Placing data in fast or slow device
- For every action, Sibyl receives a reward that takes into account the data placement decision and state of the environment
- Sibyl finds an **optimal data placement policy** that increases the overall performance for any workload and system configuration

### **RL State**

- Feature selection is performed to select only the most correlated features that affect data placement
- Divide the states into a small number of bins to reduce the state space

| Feature            | Description                                         | # of bins | Encoding (bits) |
|--------------------|-----------------------------------------------------|-----------|-----------------|
| size <sub>t</sub>  | Size of the requested page (in pages)               | 8         | 8               |
| t ype <sub>t</sub> | Type of the current request (read/write)            | 2         | 4               |
| intr <sub>t</sub>  | Access interval of the requested page               | 64        | 8               |
| cnt <sub>t</sub>   | Access count of the requested page                  | 64        | 8               |
| $cap_t$            | Remaining capacity in the fast storage device       | 8         | 8               |
| curr <sub>t</sub>  | Current placement of the requested page (fast/slow) | 2         | 4               |

#### Reward

- For every action at time-step *t*, Sibyl gets a reward from the environment at time-step *t* + 1
- Reward acts as a feedback to the agent's past action
- Request latency faithfully captures the status of the hybrid storage system
- Penalty value is chosen to prevent the agent from aggressively servicing all the requests from the faster device

 $R = \begin{cases} \frac{1}{L_t} & \text{if no eviction} \\ max(0, \frac{1}{L_t} - R_p) & \text{if an eviction happens} \end{cases} \begin{array}{l} L_t = \text{latency of the} \\ request \\ R_p = \text{eviction penalty} \end{cases}$ 

#### SAFARI

### Outline

#### Background

#### Formulating Data Placement as RL problem

Sibyl: Overview

**Evaluation and Key Results** 

Conclusion



### **Overview of Sibyl**



The two threads run asynchronously to prevent training delay from affecting the inference time

#### SAFARI

### **Overview of Sibyl**



In the RL training thread, Sibyl uses explored experiences to autonomously update its decision-making policy

### Hyper-parameter Tuning

 Different hyper-parameter configurations were chosen using the design of experiments (DoE) technique

| Hyper-parameter                   | Design Space       | Chosen Value |
|-----------------------------------|--------------------|--------------|
| Discount factor $(\gamma)$        | 0-1                | 0.9          |
| Learning rate $(\alpha)$          | $1e^{-5} - 1e^{0}$ | $1e^{-4}$    |
| Exploration rate ( $\epsilon$ )   | 0-1                | 0.001        |
| Batch size                        | 64-256             | 128          |
| Experience buffer size $(e_{EB})$ | 10-10000           | 1000         |

### Outline

#### Background

#### Formulating Data Placement as RL problem

Sibyl: Overview

**Evaluation and Key Results** 

Conclusion



### **Evaluation Methodology**

- Evaluated on a real system with different hybrid storage configurations
- Hybrid storage system constitutes one contiguous logical block address space
- A custom block driver was implemented to manage the I/O requests to the storage devices
- We evaluate three different hybrid storage configurations
  - Performance-optimized (H&M)
  - Cost-optimized (H&L)
  - Tri-hybrid storage system

### **Evaluation Methodology**

| Host System                             | AMD Ryzen 7 2700G [146], 8-cores@3.5 GHz,<br>8×64/32 KiB L1-I/D, 4 MiB L2, 8 MiB L3,<br>16 GiB RDIMM DDR4 2666 MHz |  |  |
|-----------------------------------------|--------------------------------------------------------------------------------------------------------------------|--|--|
| Storage Devices                         | Characteristics                                                                                                    |  |  |
| H: Intel Optane SSD P4800X [94]         | 375 GB, PCIe 3.0 NVMe, SLC, R/W: 2.4/2 GB/s,                                                                       |  |  |
| 11. Intel Optane 33D F4600X [94]        | random R/W: 550000/500000 IOPS                                                                                     |  |  |
| M: Intel SSD D3-S4510 [96]              | 1.92 TB, SATA TLC (3D), R/W: 550/510 MB/s,                                                                         |  |  |
| M: IIItel 33D D3-34310 [90]             | random R/W: 895000/21000 IOPS                                                                                      |  |  |
| L: Seagate HDD ST1000DM010 [98]         | 1 TB, SATA 6Gb/s 7200 RPM                                                                                          |  |  |
| L. Seagate HDD ST1000DM010 [96]         | Max. Sustained Transfer Rate: 210 MB/s                                                                             |  |  |
| L <sub>SSD</sub> : ADATA SU630 SSD [99] | 960 GB, SATA 6 Gb/s, TLC,                                                                                          |  |  |
| LSSD: ADATA 30030 33D [99]              | Max R/W: 520/450 MB/s                                                                                              |  |  |
| HSS Configurations                      | Fast Device Slow Device                                                                                            |  |  |
| H&M (Performance-oriented)              | high-end (H) middle-end (M)                                                                                        |  |  |
| H&L (Cost-oriented)                     | high-end (H) low-end (L)                                                                                           |  |  |

### **Evaluation Methodology**

- 18 different workloads from MSR Cambridge and FileBench suites
- Sibyl is compared against **four baselines** 
  - Heuristic-based policies
    - Cold data eviction (CDE) [Matsui et. al., "Design of Hybrid SSDs With Storage Class Memory and NAND Flash Memory," IEEE 2017]
    - History Page Scheduler (HPS) [Meswani et.al., "Heterogeneous Memory Architectures: A HW/SW Approach for Mixing Die-stacked and Off-package Memories," HPCA, 2015]
  - Supervised learning-based policies
    - Recurrent neural network (RNN)-based technique adapted from Kleio [Doudali et.al., "Kleio: A Hybrid Memory Page Scheduler with Machine Intelligence," HPDC, 2019]
    - Neural network-based classifier based on Archivist [Ren et.al., "Archivist: A Machine Learning Assisted Data Placement Mechanism for Hybrid Storage Systems," ICCD, 2019 ]

### Latency for HSS Configurations



| Configuration | CDE   | HPS   | Archivist | RNN-HSS |
|---------------|-------|-------|-----------|---------|
| H&M           | 28.1% | 23.2% | 36.1%     | 21.6%   |
| H&L           | 19.9% | 45.9% | 68.8%     | 34.1%   |

### **Throughput for HSS Configurations**



| Configuration | CDE   | HPS   | Archivist | RNN-HSS |
|---------------|-------|-------|-----------|---------|
| H&M           | 32.6% | 21.9% | 54.2%     | 22.7%   |
| H&L           | 22.8% | 49.1% | 86.9%     | 41.9%   |

### Latency in Tri-Hybrid System



Sibyl outperforms the heuristic-based data placement policy for trihybrid system by 48.2% on average across all workloads

### Latency for Unseen Workloads



In H&M (H&L) configurations, Sibyl outperforms RNN-HSS and Archivist by 46.1% (54.6%) and 8.5% (44.1%) respectively

### **Sensitivity to Fast Storage Capacity**



### Sensitivity to Fast Storage Capacity



Sibyl consistently provides highest performance by dynamically adapting its data-placement policy



### **Overhead Analysis**

- Performance Overhead
  - ~10ns for every inference on the evaluated system; this is several orders of magnitude less than I/O latency of highend SSD

### Implementation Overhead

- **124.4 KiB** of implementation overhead
- Metadata overhead
  - 0.1% of the total storage capacity when using a 4 KiB data placement granularity
  - 40-bit metadata overhead per data placement unit

• To appear in ISCA 2022

#### Sibyl: Adaptive and Extensible Data Placement in Hybrid Storage Systems Using Online Reinforcement Learning

Gagandeep Singh1Rakesh Nadig1Jisung Park1Rahul Bera1Nastaran Hajinazar1David Novo3Juan Gómez-Luna1Sander Stuijk2Henk Corporaal2Onur Mutlu11ETH Zürich2Eindhoven University of Technology3LIRMM, Univ. Montpellier, CNRS

#### https://arxiv.org/pdf/2205.07394.pdf

### Cross-layer Hardware/Software Techniques to Enable Powerful Computation and Memory Optimizations

**Backup Slides** 







#### A Practical Open-Source Metadata Management System to Implement and Evaluate Cross-Layer Optimizations

#### Nandita Vijaykumar

Ataberk Olgun, Konstantinos Kanellopoulos, F. Nisa Bostanci

Hasan Hassan, Mehrshad Lotfi, Phillip B. Gibbons, Onur Mutlu

# SAFARI



UNIVERSITY OF TORONTO





# **Executive Summary**

#### Problem

- Cross-layer techniques are challenging to implement because they require full-stack changes
- Existing open-source infrastructures for implementing cross-layer techniques are not designed to provide key features

Key Idea – Provide:

- Rich dynamic HW/SW interfaces
- Low-overhead metadata management
- Interfaces to key hardware components (e.g., prefetcher)

#### Our goal is twofold:

- 1. Develop an efficient and flexible framework to enable rapid implementation of new cross-layer techniques
- 2. Perform a detailed limit study to quantify the overheads associated with general metadata systems

### Outline

- Background on Expressive Memory
- MetaSys
  - Software Interface
  - Key Structures
- FPGA Implementation
- Evaluation



### Higher-level information is not visible to HW



### With a richer abstraction: SW can provide program information can significantly help hardware





## Outline

- Background on Expressive Memory
- MetaSys
  - Software Interface
  - Key Structures
- FPGA Implementation
- Evaluation



# **Metadata: Data Semantics**





# The ATOM

### An abstraction to express data semantics



# **The Software Interface**

### **Three Atom operators**





### **MetaSys Key Structures**



## Outline

- Background on Expressive Memory
- MetaSys
  - Software Interface
  - Key Structures
- FPGA Implementation
- Evaluation



# **FPGA** Prototype

# Prototype on Xilinx Zedboard within a real RISC-V system (Rocket Chip)



### **Rocket Chip**



# **MetaSys in Rocket Chip**

Implement two main components:

#### 1. Atom Controller

- Manages the attribute table (CREATE (DE)ACTIVATE)
- Performs atom mapping (MAP/UNMAP)
  - Physical address  $\rightarrow$  Atom ID

#### 2. Metadata Lookup Unit

- Responds to clients:
  - Provides atom attributes
- Contains the metadata mapping cache

# **Changes in Rocket Chip**



## **Source on Github**

#### https://github.com/CMU-SAFARI/MetaSys

| CMU-SAFARI/MetaSys Public 😵 Fork 2 🛱 Star 0 👻 |                                                                                                                                                                                                                                                       |                           |                                     |                                                                                                                                                                                                                                                                                                                                                        |  |
|-----------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------------------------|-------------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--|
| <> Code                                       | <ul> <li>O Issues 0<sup>4</sup> ₀ Pull requests (▷)</li> </ul>                                                                                                                                                                                        | Actions 🗄 Projects 🖽 Wiki | ① Security 🗠 Insights               |                                                                                                                                                                                                                                                                                                                                                        |  |
|                                               | 🐉 main 👻 🕻 1 branch 🛯 🛇 0 tags                                                                                                                                                                                                                        |                           | Go to file Code -                   | About         Metasys is the first open-source FPGA-<br>based infrastructure with a prototype in<br>a RISC-V core, to enable the rapid<br>implementation and evaluation of a wide<br>range of cross-layer software/hardware<br>cooperative techniques techniques in<br>real hardware. Described in our pre-<br>print: https://arxiv.org/abs/2105.08123 |  |
|                                               | olgunataberk Update README.md                                                                                                                                                                                                                         |                           | e21ccd2 on Jul 9, 2021 🗿 12 commits |                                                                                                                                                                                                                                                                                                                                                        |  |
|                                               | common                                                                                                                                                                                                                                                | Initial commit            | 9 months ago                        |                                                                                                                                                                                                                                                                                                                                                        |  |
|                                               | riscv-tools                                                                                                                                                                                                                                           | Add tools directory       | 7 months ago                        |                                                                                                                                                                                                                                                                                                                                                        |  |
|                                               | rocket-chip                                                                                                                                                                                                                                           | Initial commit            | 9 months ago                        |                                                                                                                                                                                                                                                                                                                                                        |  |
|                                               | testchipip                                                                                                                                                                                                                                            | Initial commit            | 9 months ago                        |                                                                                                                                                                                                                                                                                                                                                        |  |
|                                               | zedboard                                                                                                                                                                                                                                              | Initial commit            | 9 months ago                        |                                                                                                                                                                                                                                                                                                                                                        |  |
|                                               |                                                                                                                                                                                                                                                       | Initial commit            | 9 months ago                        |                                                                                                                                                                                                                                                                                                                                                        |  |
|                                               | C README.md                                                                                                                                                                                                                                           | Update README.md          | 7 months ago                        |                                                                                                                                                                                                                                                                                                                                                        |  |
|                                               | metasys_readme.md                                                                                                                                                                                                                                     | Update metasys_readme.md  | 7 months ago                        |                                                                                                                                                                                                                                                                                                                                                        |  |
|                                               | ≅ README.md                                                                                                                                                                                                                                           |                           |                                     | Releases                                                                                                                                                                                                                                                                                                                                               |  |
|                                               | MetaSys                                                                                                                                                                                                                                               |                           |                                     | No releases published                                                                                                                                                                                                                                                                                                                                  |  |
|                                               | We refer the developers of the MetaSys repository to metasys_readme.md, where we describe our modifications to the existing rocket-chip code base, and present a walkthrough of an implementation of the prefetching use case described in our paper. |                           |                                     | Packages No packages published                                                                                                                                                                                                                                                                                                                         |  |

For more details, please read our preprint on arXiv.

## Outline

- Background on Expressive Memory
- MetaSys
  - Software Interface
  - Key Structures
- FPGA Implementation
- Evaluation



### **Characterizing Metadata Management**

### Our goal is twofold:

- 2. Perform a detailed limit study to quantify the overheads associated with general metadata systems
- Quantify the overheads of performing lookups in MetaSys



# **Evaluation Methodology**

Run workloads on MetaSys prototype (Zedboard):

Microbenchmarks: Represent a variety of memory access patterns Polybench: Scientific computation kernels Ligra: Graph workloads

CPU: 25 MHz; in-order Rocket core [21]; TLB 16 entries DTLB; LRU policy;

L1 Data + Inst. Cache: 16 KB, 4-way; 4-cycle; 64 B line; LRU policy; MSHR size: 2

MMC: NMRU Policy; 128 entries; 38bits/entry; Tagging Granularity: 512B;

Private Metadata Table: 256 entries; 64B/entry; DRAM: 533MHz; V<sub>dd</sub>: 1.5V;

**Workloads: Ligra [36]:** PageRank(PR), Shortest Path (SSSP), Collaborative Filtering (CF) Teenage Follower (TF), Triangle Counting (TC), Breadth-First Search (BFS) Radius Estimation (Radii), Connected Components (CC); **Polybench [**37]; μ**Benchmarks** 



# **Performance Overhead**



Metadata lookups occur low performance overheads 2.7% on average



### **MMC** hit rate



MMC can cover ~81% of all memory requests on average

MMC hit rate correlates with locality of application requests

### Impact of MMC size



Workloads with low temporal and spatial locality are not sensitive to MMC size



### **Impact of Tagging Granularity**



Performance impact increases with finer granularity

### **Impact of Tagging Granularity on TLB misses**



#### Fine tagging granularities increase TLB misses



# **Effect of Contention**

**One Client:** All memory requests originating from rocket core **Two Clients:** One client + all memory requests originating from the page table walker



Multiple clients do not significantly affect performance (0.3% overhead on average)



### **Executive Summary**

#### **Problem**

- Cross-layer techniques are challenging to implement because they require full-stack changes
- Existing open-source infrastructures for implementing cross-layer techniques are not designed to provide key features:

Key Idea – Provide:

- Rich dynamic HW/SW interfaces
- Low-overhead metadata management
- Interfaces to key hardware components (e.g., prefetcher)

Our goal is twofold:

- 1. Develop an efficient and flexible framework to enable rapid implementation of new cross-layer techniques
- 2. Perform a detailed limit study to quantify the overheads associated with general metadata systems



#### A Practical Open-Source Metadata Management System to Implement and Evaluate Cross-Layer Optimizations

#### Nandita Vijaykumar

Ataberk Olgun, Konstantinos Kanellopoulos, F. Nisa Bostanci

Hasan Hassan, Mehrshad Lotfi, Phillip B. Gibbons, Onur Mutlu

# SAFARI



UNIVERSITY OF TORONTO



