Memory System Design for AI/ML Accelerators & ML/AI Techniques for Memory System Design

> Onur Mutlu omutlu@gmail.com https://people.inf.ethz.ch/omutlu 30 August 2022 SRC AIHW Annual Review

SAFARI

**ETH** zürich





### Confidentiality

- By reviewing this presentation or participating in a SRC event, you are agreeing not to use the presented information for purposes unrelated to the event until approved by SRC;
- Material may be presented that represents current research, some of which has not been published or protected. This material is not for public disclosure and until potential IP rights have been protected, please treat all of the information presented as <u>confidential information</u> which is the property of the researcher and their university.



Problem and Background

Task Overview

Technical Challenges, Goals and Ideas

Ideas, Results and Papers from the Past Year



# Computing is Bottlenecked by Data



# Data is Key for AI, ML, Genomics, ...

Important workloads are all data intensive

 They require rapid and efficient processing of large amounts of data

- Data is increasing
  - □ We can generate more than we can process

# Data is Key for Future Workloads



### **In-memory Databases**

[Mao+, EuroSys'12; Clapp+ (**Intel**), IISWC'15]



### **In-Memory Data Analytics**

[Clapp+ (**Intel**), IISWC'15; Awan+, BDCloud'15]



**Graph/Tree Processing** [Xu+, IISWC'12; Umuroglu+, FPL'15]



**Datacenter Workloads** [Kanev+ (**Google**), ISCA'15]

# Data Overwhelms Modern Machines





### **In-memory Databases**

### **Graph/Tree Processing**

# Data → performance & energy bottleneck



### In-Memory Data Analytics

[Clapp+ (**Intel**), IISWC'15; Awan+, BDCloud'15]



**Datacenter Workloads** [Kanev+ (**Google**), ISCA'I 5]

# Data is Key for Future Workloads





**Google's web browser** 



### **TensorFlow Mobile**

Google's machine learning framework



**Google's video codec** 



# Data Overwhelms Modern Machines



# Data → performance & energy bottleneck



**Google's video codec** 



# Data is Key for Future Workloads



10 http://www.economist.com/news/21631808-so-much-genetic-data-so-many-uses-genes-unzipped



## Data → performance & energy bottleneck

| reau4: | COCITCCAT |
|--------|-----------|
| read5: | CCATGACGC |
| read6: | TTCCATGAC |

### 3 Variant Calling



### **Scientific Discovery 4**

Nanopore sequencing technology and tools for genome assembly: computational analysis of the current state, bottlenecks and future directions

Damla Senol Cali 🖾, Jeremie S Kim, Saugata Ghose, Can Alkan, Onur Mutlu

Briefings in Bioinformatics, bby017, https://doi.org/10.1093/bib/bby017Published:02 April 2018Article history ▼



**Oxford Nanopore MinION** 

# Data → performance & energy bottleneck

# Data Overwhelms Modern Machines ...

Storage/memory capability

Communication capability

Computation capability

Greatly impacts robustness, energy, performance, cost

# Data Overwhelms Modern Machines



# Data → performance & energy bottleneck



**Google's video codec** 



### Data Movement Overwhelms Modern Machines

 Amirali Boroumand, Saugata Ghose, Youngsok Kim, Rachata Ausavarungnirun, Eric Shiu, Rahul Thakur, Daehyun Kim, Aki Kuusela, Allan Knies, Parthasarathy Ranganathan, and Onur Mutlu, "Google Workloads for Consumer Devices: Mitigating Data Movement Bottlenecks" Proceedings of the <u>23rd International Conference on Architectural Support for Programming</u> <u>Languages and Operating Systems</u> (ASPLOS), Williamsburg, VA, USA, March 2018.

# 62.7% of the total system energy is spent on data movement

## Google Workloads for Consumer Devices: Mitigating Data Movement Bottlenecks

Amirali Boroumand<sup>1</sup>Saugata Ghose<sup>1</sup>Youngsok Kim<sup>2</sup>Rachata Ausavarungnirun<sup>1</sup>Eric Shiu<sup>3</sup>Rahul Thakur<sup>3</sup>Daehyun Kim<sup>4,3</sup>Aki Kuusela<sup>3</sup>Allan Knies<sup>3</sup>Parthasarathy Ranganathan<sup>3</sup>Onur Mutlu<sup>5,1</sup>15



# An Intelligent Architecture Handles Data Well



### **Ensure data does not overwhelm** the components

- via intelligent algorithms
- via intelligent architectures
- via whole system designs: algorithm-architecture-devices

# Take advantage of vast amounts of data and metadata to improve architectural & system-level decisions

### Understand and exploit properties of (different) data

to improve algorithms & architectures in various metrics

# Corollaries: Computing Systems Today ...

Are processor-centric vs. data-centric

Make designer-dictated decisions vs. data-driven

Make component-based myopic decisions vs. data-aware

# Architectures for Intelligent Machines

# **Data-centric**

# **Data-driven**

# **Data-aware**



### A Blueprint for Fundamentally Better Architectures

# Onur Mutlu, <u>"Intelligent Architectures for Intelligent Computing Systems"</u> Invited Paper in Proceedings of the <u>Design, Automation, and Test in</u> <u>Europe Conference</u> (**DATE**), Virtual, February 2021. [Slides (pptx) (pdf)] [IEDM Tutorial Slides (pptx) (pdf)] [Short DATE Talk Video (11 minutes)] [Longer IEDM Tutorial Video (1 hr 51 minutes)]

### Intelligent Architectures for Intelligent Computing Systems

Onur Mutlu ETH Zurich omutlu@gmail.com



Problem and Background

Task Overview

Technical Challenges, Goals and Ideas

Ideas, Results and Papers from the Past Year



# In This Task... (Task #2946.001)

- We focus on designing memory systems to handle data well
- We aim to solve two different yet related and synergistic problems, both focusing on ML/AI and memory system design
- We explore (and exploit the synergy between)
  - Memory system design for AI/ML workloads/accelerators
  - AI/ML techniques for improving memory system designs
- Task Name: Memory System Design for AI/ML Accelerators & ML/AI Techniques for Memory System Design

# Our Goals in This Task

Two Major Goals:

1 Memory system design for AI/ML workloads/accelerators

 $\rightarrow$  in-depth exploration of memory system designs for cuttingedge and emerging machine learning accelerators

 $\rightarrow$  more efficient on-chip and off-chip memory systems

2. AI/ML techniques for improving memory system designs

 $\rightarrow$  take a comprehensive look at memory system design and make it data driven, i.e., based on machine learning

→ more effective cache/memory/prefetch/thread controllers and data/resource management/mapping/scheduling policies

# Anticipated Primary Results

 Realistic, practical and effective novel memory system designs for ML/AI accelerators

 New ML-based techniques to improve memory system efficiency and performance

 Open-source workloads, metrics, methodologies & infrastructures to analyze such designs and techniques.

# Task Description

#### Description

Our major goals in this research are twofold. First, we aim to provide the first in-depth exploration of memory system designs for cutting-edge and emerging machine learning accelerators. To this end, we aim to develop much more efficient on-chip/on-die as well as off-chip memory system designs for such accelerators, along with open source models, metrics, simulators, prototypes & workload suites to evaluate existing and future ML/AI accelerators. Second, we would like to take a comprehensive look at memory system design and make it data driven, i.e., based on machine learning: we aim to design ML/AI techniques for on-chip cache/memory/prefetch/thread controllers and data/resource management/mapping/scheduling policies, to maximize efficiency, performance and QoS beyond levels that can be achievable by human-designed policies.

To this end, we will comprehensively examine a wide variety of key issues and bottlenecks in the entire memory system designs of modern ML/AI accelerators as well as general purpose processors, ranging from issues in SRAM buffers/caches, DRAM main memory, cache and memory controllers, interconnects, non-volatile memory, hybrid memories, prefetching mechanisms, and near-data acceleration mechanisms, with a special focus on cutting-edge data-intensive production ML/AI workloads (for Problem 1) and with a broader focus on key data-intensive workloads (for Problem 2).

To solve Problem 1, based on our analysis of bottlenecks in state-of-the-art ML/AI accelerators and workloads, we aim to develop new on-chip and off-chip memory designs, data organization techniques, data movement reduction mechanisms, request scheduling, caching, prefetching schemes, near-data and in-memory acceleration mechanisms, customized SRAM, DRAM, NVM designs for demands of ML/AI acceleration, and various other innovative techniques across the entire memory hierarchy. To solve Problem 2, based on our analysis of each controller and major policy in the memory hierarchy, we aim to find and design new ML-based policies that are best fit for each controller and its optimization goals.

# Task Deliverables (2020)

#### **Deliverables**

Report on experimental performance and energy analysis & breakdown of ML/AI accelerator execution on key ML/AI workloads using rigorous evaluation metrics and methodologies

Original due date: 30-Jun-2020

Annual review presentation

```
Revised due date: 9-Sep-2020 (Original Due Date: 1-Sep-2020)
```

Report on description and analysis of new customized memory system designs for ML accelerators & complete ML accelerator designs with new data orchestration and memory management mechanisms

Original due date: 31-Dec-2020

# Task Deliverables (2021)

Report on performance and energy analysis of control and management policies in the memory hierarchy & potential of machine learning based techniques to replace them

Original due date: 28-Feb-2021

Report on description and analysis of new ML-based memory system policies and designs & specification and coordination of various on-chip ML-based agents

Original due date: 31-Aug-2021

Annual review presentation

Original due date: 1-Sep-2021

Report on analysis of various different memory types, new on-chip/off-chip near-data processing designs, and shortterm & long-term options for near-data processing designs for ML/AI accelerators

Original due date: 31-Dec-2021

# Task Deliverables (2022)

Report analyzing various new ML-based memory/cache/interconnect/prefetcher control mechanisms along with MLbased data mapping, address mapping, thread scheduling policies across the memory system

Original due date: 30-Jun-2022

Report on open source release of new ML/AI accelerator simulation infrastructures, their evaluation metrics and methodologies, and their analysis

Original due date: 31-Oct-2022

Report on open source release of ML/AI-based memory system evaluation infrastructures their evaluation metrics and methodologies, and their analysis

Original due date: 31-Oct-2022

Final report summarizing research accomplishments and future direction

Original due date: 31-Dec-2022

# Task Information #2946.001 (1)

- Thrust: AI Hardware
- Task Leader: Onur Mutlu
  - <u>https://people.inf.ethz.ch/omutlu/</u>
  - onur.mutlu@inf.ethz.ch
- Students

- Rahul Bera (ETH)
- Joao Ferreira (ETH)
- Geraldo Francisco de Oliveira Junior (ETH)
- Konstantinos Kanellopoulos (ETH)
- Joel Lindegger (ETH)
- Aditya Manglik (ETH)
- Rakesh Nadig (ETH)















# Task Information #2946.001 (2)

- Senior Researchers
  - Juan Gomez Luna (ETH)
  - Haiyu Mao (ETH)
  - Lois Orosa (ETH)
  - Jisung Park (ETH)
  - Gagandeep Singh (ETH)









More students/postdocs to be added as the task evolves

# Recent PhD Graduate

Minesh Patel

SAFAR

- October 2021
- Enabling Effective Error Mitigation in Memory Chips That Use On-Die Error-Correcting Codes
- **2022 William C. Carter PhD Dissertation Award in Dependability**
- Best Paper Awards at DSN 2019 & MICRO 2020
- https://www.youtube.com/watch?v=0c9bDr18jZE
- https://arxiv.org/abs/2204.10387
- <u>https://www.mineshp.com/</u>

### **Dissertation Overview**

#### **"Enabling Effective Error Mitigation in Modern Memory Chips that Use On-Die ECC"** Defended Oct. 2021 (ETH Zürich)

Deposited Apr. 2022 (DOI 10.3929/ethz-b-000542542)

#### Advisor:

Onur Mutlu (ETH Zürich) Co-Examiners: Mattan Erez (UT Austin) Moinuddin Qureshi (Georgia Tech) Vilas Sridharan (AMD) Christian Weis (TU Kaiserslautern)



Award Speech - William C. Carter PhD Dissertation Award in Dependability - Minesh Patel

402 views · Premiered Jul 15, 2022

i 21  $\bigcirc$  Dislike → Share  $\downarrow$  Download % CLIP =+ Save ...





ANALYTICS EDIT VIDEO

# Recent PostDoc Alumni

### Dr. Lois Orosa

- March 2022
- Director at the Galician Supercomputing Center



### Dr. Gagandeep Singh

- September 2022
- Joining AMD Research



### Dr. Jisung Park

- September 2022
- Joining POSTECH (South Korea) as Assistant Professor



# Soon to Finish PhD

- Hasan Hassan
  - PhD Defense date: September 29, 2022
  - Improving DRAM Performance, Reliability, and Security by Rigorously Understanding Intrinsic DRAM Operation
  - <u>https://drive.google.com/file/d/1E5mFYI\_SMjCP-</u> <u>7TQ8qt6kRALROGhZs9K/view</u>



# Recent Internships

- Dr. Gagandeep Singh
  - February-June 2022
  - Visit to AMD Research



# Upcoming TECHCON Presentation

### Dr. Juan Gomez-Luna

- Benchmarking Memory-Centric Computing Systems: Analysis of Real Processing-in-Memory Hardware
- Based on two major works
  - https://arxiv.org/pdf/2105.03814.pdf
  - https://arxiv.org/pdf/2207.07886.pdf



### Benchmarking Memory-Centric Computing Systems: Analysis of Real Processing-In-Memory Hardware

Year: 2021, Pages: 1-7 DOI Bookmark: 10.1109/IGSC54211.2021.9651614

#### Authors

SAFARI

Juan Gómez-Luna, ETH Zürich Izzat El Hajj, American University of Beirut Ivan Fernandez, University of Malaga Christina Giannoula, National Technical University of Athens Geraldo F. Oliveira, ETH Zürich Onur Mutlu, ETH Zürich



https://www.youtube.com/watch?v=nphV36SrysA

# Industry Liaisons

- Charles Augustine, Intel
- Pradip Bose, IBM
- Alper Buyuktosunoglu, IBM
- Rosario Cammarota, Intel
- Ramesh Chauhan, Qualcomm
- Prokash Ghosh, NXP
- Jose Joao, ARM
- Arun Joseph, IBM
- Preetham Lobo, IBM
- Nithyakalyani Sampath, TI
- Willem Sanberg, NXP
- Pushkar Sareen, NXP
- Sreenivas Subramoney, Intel
- Xin Zhang, IBM
- We are having and will have regular and irregular meetings with all liaison companies
- Very open to other collaborators, feedback, internships, visits

## Industry Interactions (This Year I)

- Intel: Collaborative papers with as part of this task
  - Sreenivas Subramoney, Gurpreet Kalsi, Anant Nori, Kamlesh Pillai, Shankar Balachandran, Bharathwaj Suresh
  - SeGraM: A Universal Hardware Accelerator for Genomic Sequence-to-Graph and Sequence-to-Sequence Mapping [ISCA 2022]
  - pLUTo: Enabling Massively Parallel Computation In DRAM via Lookup Tables [MICRO 2022]
  - Hermes: Accelerating Long-Latency Load Requests via Perceptron-Based Off-Chip Load Prediction [MICRO 2022]
  - ApHMM: Accelerating Profile Hidden Markov Models for Fast and Energy-Efficient Genome Analysis [arXiv 2022]
- IBM: Collaborative papers
  - Dionysios Diamantopoulos, Christoph Hagleitner
  - Accelerating Weather Prediction Using Near-Memory Reconfigurable Fabric [TRETS 2022]

## Industry Interactions (This Year II)

- IBM: Collaborative EU Horizon Project BioPIM
  - Abu Sebastian, Irem Boybat (IBM Research Zurich)
  - <u>http://www.biopim.eu/</u>
  - BioPIM project aims to leverage the emerging processing-in-memory (PIM) technologies to enable powerful edge computing.
  - Synergistic with this task
  - We will focus on co-designing algorithms and data structures commonly used in bioinformatics together with several types of PIM architectures to obtain the highest benefit in cost, energy, and time savings.
  - BioPIM will also impact other fields that employ similar algorithms.
  - Our designs and algorithms will not be limited to cheap hardware, and they will impact computation efficiency on all forms of computing environments including cloud platforms.
  - The targeted breakthrough of **BioPIM** is to invent and leverage in-memory computing architectures to fundamentally improve the performance and energy efficiency of various important bioinformatics algorithms to make mobile genomics a reality

BOPIM

## Industry Interactions (This Year III)

- Qualcomm: In-person Visit & Talk
  - Ramesh Chauhan
  - May 2022
- IBM Research: In-person Visit & Talk
  - Pradip Bose, Karthik Swaminathan, Alper Buyuktosunoglu, Krishnan Kailas
  - May 2022
- Intel: Keynote Talk at the Intel Interconnect & Connectivity Summit
  - Debendra Das Sharma
  - <u>"Memory-Centric Computing"</u>

*Keynote Talk at the Intel Interconnect & Connectivity Summit (IICS)*, Virtual, 9 February 2022. [<u>Slides (pptx) (pdf)</u>]

## Posters for Annual Review 2022



- Sibyl: Adaptive and Extensible Data Placement in Hybrid Storage Systems Using Online Reinforcement Learning [ISCA 2022]
  - Gagandeep Singh
- SeGraM: A Universal Hardware Accelerator for Genomic Sequence-to-Graph and Sequence-to-Sequence Mapping [ISCA 2022]
  - Damla Senol Cali, Joel Lindegger
- Hermes: Accelerating Long-Latency Load Requests via Perceptron-Based Off-Chip Load Prediction [MICRO 2022]
  - Rahul Bera
- Polynesia: Enabling High-Performance and Energy-Efficient Hybrid Transactional/Analytical Databases with Hardware/Software Co-Design [ICDE 2022]
  - Geraldo Francisco de Oliveira Junior
- Benchmarking Memory-Centric Computing Systems: Analysis of Real Processing-in-Memory Hardware [IEEE Access 2022]
  - Juan Gómez-Luna
- Google Neural Network Models for Edge Devices: Analyzing and Mitigating Machine Learning Inference Bottlenecks [PACT 2021]
  - Geraldo Francisco de Oliveira Junior

## Special Research Sessions & Courses

### Special Session at ISVLSI 2022: 9 cutting-edge talks

SAFAR



https://www.youtube.com/watch?v=geukNs5XI3g

41

## Comp Arch (Fall'21)

#### Fall 2021 Edition:

- https://safari.ethz.ch/architecture/fall2021/doku. php?id=schedule
- Fall 2020 Edition:
  - https://safari.ethz.ch/architecture/fall2020/doku. php?id=schedule

### Youtube Livestream (2021):

- https://www.youtube.com/watch?v=4yfkM\_5EFg o&list=PL5Q2soXY2Zi-Mnk1PxjEIG32HAGILkTOF
- Youtube Livestream (2020):
  - https://www.youtube.com/watch?v=c3mPdZA-Fmc&list=PL5Q2soXY2Zi9xidyIgBxUz7xRPS-wisBN
- Master's level course
  - Taken by Bachelor's/Masters/PhD students
  - Cutting-edge research topics + fundamentals in Computer Architecture
  - 5 Simulator-based Lab Assignments
  - Potential research exploration
  - Many research readings

### https://www.youtube.com/onurmutlulectures



#### Fall 2021 Lectures & Schedule

Watch on 🕞 YouTub

| Week | Date          | Livestream    | Lecture                                                                   | Readings               | Lab          | HW          |
|------|---------------|---------------|---------------------------------------------------------------------------|------------------------|--------------|-------------|
| W1   | 30.09<br>Thu. | You the Live  | L1: Introduction and Basics                                               | Required<br>Mentioned  | Lab 1<br>Out | HW 0<br>Out |
|      | 01.10<br>Fri. | You Tube Live | L2: Trends, Tradeoffs and Design<br>Fundamentals<br>@(PDF) @(PPT)         | Required<br>Mentioned  |              |             |
| W2   | 07.10<br>Thu. | You Tube Live | L3a: Memory Systems: Challenges and<br>Opportunities<br>ma(PDF) and (PPT) | Described<br>Suggested |              | HW 1<br>Out |
|      |               |               | L3b: Course Info & Logistics                                              |                        |              |             |
|      |               |               | L3c: Memory Performance Attacks                                           | Described<br>Suggested |              |             |
|      | 08.10<br>Fri. | You Tube Live | L4a: Memory Performance Attacks                                           | Described<br>Suggested | Lab 2<br>Out |             |
|      |               |               | L4b: Data Retention and Memory Refresh                                    | Described<br>Suggested |              |             |
|      |               |               | L4c: RowHammer<br>(PDF)  (PPT)                                            | Described<br>Suggested |              |             |

## DDCA (Spring 2022)

### Spring 2022 Edition:

https://safari.ethz.ch/digitaltechnik/spring2022/do ku.php?id=schedule

### Spring 2021 Edition:

https://safari.ethz.ch/digitaltechnik/spring2021/do ku.php?id=schedule

### Youtube Livestream (Spring 2022):

https://www.youtube.com/watch?v=cpXdE3HwvK 0&list=PL5Q2soXY2Zi97Ya5DEUpMpO2bbAoaG7c6

### Youtube Livestream (Spring 2021):

- https://www.youtube.com/watch?v=LbC0EZY8yw 4&list=PL5Q2soXY2Zi uej3aY39YB5pfW4SJ7LIN
- Bachelor's course
  - 2<sup>nd</sup> semester at ETH Zurich
  - Rigorous introduction into "How Computers Work"
  - Digital Design/Logic
  - Computer Architecture
  - 10 FPGA Lab Assignments

### https://www.youtube.com/onurmutlulectures



Digital Design and Computer Architecture -Spring 2021 Search Recent Changes Media Manager Siten

sched

Trace: • schedule

#### Lecture Video Playlist on YouTube

Subject to Livestream Lecture Playlist





#### Spring 2021 Lectures/Schedule

| Week | Date          | Livestream    | Lecture                                                    | Readings                           | Lab | HW |
|------|---------------|---------------|------------------------------------------------------------|------------------------------------|-----|----|
| W1   | 25.02<br>Thu. | You Tube Live | L1: Introduction and Basics                                | Required<br>Suggested<br>Mentioned |     |    |
|      | 26.02<br>Fri. | You Tube Live | L2a: Tradeoffs, Metrics, Mindset                           | Required                           |     |    |
|      |               |               | L2b: Mysteries in Computer Architecture<br>a (PDF) a (PPT) | Required<br>Mentioned              |     |    |
| W2   | 04.03<br>Thu. | You Tube Live | L3a: Mysteries in Computer Architecture II                 | Required<br>Suggested              |     |    |

Home Announcements Materials <u>Lectures/Schedule</u> Lecture Buzzwords Readings

ReadingsOptional HWsLabs

Extra Assignments
 Exams
 Technical Docs

 Scomputer Architecture (CMU) SS15: Lecture Videos
 Computer Architecture (CMU)

S Digitaltechnik SS18: Lecture

S Digitaltechnik SS19: Lecture

Digitaltechnik SS18: Course

Sigitaltechnik SS19: Course

Digitaltechnik SS20: Lecture

Source State Course Signature Signat

SS15: Course Website

Resources

Videos

Website

Videos

Website

Videos

Website

### PIM Course (Spring 2022)

### Spring 2022 Edition:

https://safari.ethz.ch/projects and semi nars/spring2022/doku.php?id=processing in memory

#### Youtube Livestream:

https://www.youtube.com/watch?v=9e4 Chnwdovo&list=PL5Q2soXY2Zi-841fUYYUK9EsXKhQKRPyX

#### Project course

- Taken by Bachelor's/Master's students
- Processing-in-Memory lectures
- Hands-on research exploration
- Many research readings

| P | A Modern Primer on Processing in Memo                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 |
|---|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
|   | Onur Mutlu <sup>a,b</sup> , Saugata Ghose <sup>b,c</sup> , Juan Gómez-Luna <sup>a</sup> , Rachata Ausavarungnirun <sup>d</sup>                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        |
|   | SAFARI Research Group                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 |
|   | <sup>b</sup> Carne <sup>(27)</sup> - <sup>b</sup> Carne <sup>(2</sup> |
|   | Onur Mutlu, Saugata Ghose, Juan Gomez-Luna, and Rachata Ausavarungnirun,<br>"A Modern Primer on Processing in Memory"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 |
|   | Invited Book Chapter in Emerging Computing: From Devices to Systems -                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 |
|   | Looking Beyond Moore and Von Neumann, Springer, to be published in 202                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                |

#### Spring 2022 Meetings/Schedule

| Week | Date          | Livestream        | Meeting                                                                                 | Learning<br>Materials                          | Assignmen |
|------|---------------|-------------------|-----------------------------------------------------------------------------------------|------------------------------------------------|-----------|
| W1   | 10.03<br>Thu. | You Tube Live     | M1: P&S PIM Course Presentation<br>@ (PDF) @ (PPT)                                      | Required Materials<br>Recommended<br>Materials | HW 0 Out  |
| W2   | 15.03<br>Tue. |                   | Hands-on Project Proposals                                                              |                                                |           |
|      | 17.03<br>Thu. | You Tube Premiere | M2: Real-world PIM: UPMEM PIM                                                           |                                                |           |
| W3   | 24.03<br>Thu. | You The Live      | M3: Real-world PIM:<br>Microbenchmarking of UPMEM<br>PIM<br>@ (PDF) @ (PPT)             |                                                |           |
| W4   | 31.03<br>Thu. | You Tube Live     | M4: Real-world PIM: Samsung<br>HBM-PIM<br>(PDF) (PPT)                                   |                                                |           |
| W5   | 07.04<br>Thu. | You Tobe Live     | M5: How to Evaluate Data<br>Movement Bottlenecks<br>(200 (PDF) 222 (PPT)                |                                                |           |
| W6   | 14.04<br>Thu. | You Tube Live     | M6: Real-world PIM: SK Hynix AiM (PDF) (PPT)                                            |                                                |           |
| W7   | 21.04<br>Thu. | You Tobe Premiere | M7: Programming PIM<br>Architectures<br>(PDF) (mathematical (PPT))                      |                                                |           |
| W8   | 28.04<br>Thu. | You Tobe Premiere | M8: Benchmarking and Workload<br>Suitability on PIM<br>(PDF) (mathematical (PPT))       |                                                |           |
| W9   | 05.05<br>Thu. | You Tube Premiere | M9: Real-world PIM: Samsung<br>AxDIMM                                                   |                                                |           |
| W10  | 12.05<br>Thu. | You Tube Premiere | M10: Real-world PIM: Alibaba HB-<br>PNM<br>(PDF) (PPT)                                  |                                                |           |
| W11  | 19.05<br>Thu. | You Tube Live     | M11: SpMV on a Real PIM<br>Architecture<br>(PDF) (m) (PPT)                              |                                                |           |
| W12  | 26.05<br>Thu. | You Tube Live     | M12: End-to-End Framework for<br>Processing-using-Memory<br>(PDF) (mathematic (PPT))    |                                                |           |
| W13  | 02.06<br>Thu. | You Tube Live     | M13: Bit-Serial SIMD Processing<br>using DRAM<br>(PDF) (PPT)                            |                                                |           |
| W14  | 09.06<br>Thu. | You Tube Live     | M14: Analyzing and Mitigating ML<br>Inference Bottlenecks<br>(PDF) (mathematical (PPT)) |                                                |           |
| W15  | 15.06<br>Thu. | You Tube Live     | M15: In-Memory HTAP Databases<br>with HW/SW Co-design<br>(PDF) (2000) (PPT)             |                                                |           |
| W16  | 23.06<br>Thu. | You Tube Live     | M16: In-Storage Processing for<br>Genome Analysis<br>(PDF) ((PPT))                      |                                                |           |
| W17  | 18.07<br>Mon. | You Tube Premiere | M17: How to Enable the Adoption<br>of PIM?<br>(PDF) (PPT)                               |                                                |           |
| W18  | 09.08<br>Tue. | You Tube Premiere | SS1: ISVLSI 2022 Special Session<br>on PIM<br>(PDF & PPT)                               |                                                |           |

## Genomics (Spring 2022)

### Spring 2022 Edition:

https://safari.ethz.ch/projects and semi nars/spring2022/doku.php?id=bioinforma tics

#### Youtube Livestream:

- https://www.youtube.com/watch?v=DEL 5A Y3TI&list=PL5Q2soXY2Zi8NrPDgOR 1yRU Cxxjw-u18
- Project course
  - Taken by Bachelor's/Master's students
  - Genomics lectures
  - Hands-on research exploration
  - Many research readings



#### Spring 2022 Meetings/Schedule

| Week | Date          | Livestream        | Meeting                                                                                        | Learning<br>Materials                             | Assignment |
|------|---------------|-------------------|------------------------------------------------------------------------------------------------|---------------------------------------------------|------------|
| W1   | 11.3<br>Fri.  | You Tube Live     | M1: P&S Accelerating Genomics<br>Course Introduction & Project<br>Proposals<br>@ (PDF) @ (PPT) | Required<br>Materials<br>Recommended<br>Materials |            |
| W2   | 18.3<br>Fri.  | You Tube Live     | M2: Introduction to Sequencing                                                                 |                                                   |            |
| W3   | 25.3<br>Fri.  | You Tube Premiere | M3: Read Mapping                                                                               |                                                   |            |
| W4   | 01.04<br>Fri. | You Tube Premiere | M4: GateKeeper                                                                                 |                                                   |            |
| W5   | 08.04<br>Fri. | You Tube Premiere | M5: MAGNET & Shouji                                                                            |                                                   |            |
| W6   | 15.4<br>Fri.  | You Tube Premiere | M6: SneakySnake                                                                                |                                                   |            |
| W7   | 29.4<br>Fri.  | You Tube Premiere | M7: GenStore<br>(PDF)                                                                          |                                                   |            |
| W8   | 06.05<br>Fri. | You Tube Premiere | M8: GRIM-Filter<br>(PDF)    (PPT)                                                              |                                                   |            |
| W9   | 13.05<br>Fri. | You Tube Premiere | M9: Genome Assembly a (PDF) a (PPT)                                                            |                                                   |            |
| W10  | 20.05<br>Fri. | You Tube Live     | M10: Genomic Data Sharing Under<br>Differential Privacy<br>(PDF) (2000) (PPT)                  |                                                   |            |
| W11  | 10.06<br>Fri. | You Tube Premiere | M11: Accelerating Genome<br>Sequence Analysis                                                  |                                                   |            |

### Hetero. Systems (Spring'22)

### Spring 2022 Edition:

https://safari.ethz.ch/projects and semi nars/spring2022/doku.php?id=heterogen eous systems

#### Youtube Livestream:

https://www.youtube.com/watch?v=oFO <u>5fTrgFIY&list=PL5Q2soXY2Zi9XrgXR38IM</u> <u>FTjmY6h7Gzm</u>

#### Project course

- Taken by Bachelor's/Master's students
- GPU and Parallelism lectures
- Hands-on research exploration
- Many research readings



#### Spring 2022 Meetings/Schedule

| Week | Date          | Livestream        | Meeting                                                                 | Learning<br>Materials                          | Assignments |
|------|---------------|-------------------|-------------------------------------------------------------------------|------------------------------------------------|-------------|
| W1   | 15.03<br>Tue. | You Tube Premiere | M1: P&S Course Presentation<br>(PDF) (PPT)                              | Required Materials<br>Recommended<br>Materials | HW 0 Out    |
| W2   | 22.03<br>Tue. | You Tube Premiere | M2: SIMD Processing and GPUs<br>(PDF) m(PPT)                            |                                                |             |
| W3   | 29.03<br>Tue. | You Tube Premiere | M3: GPU Software Hierarchy<br>(PDF) (PPT)                               |                                                |             |
| W4   | 05.04<br>Tue. | You Tube Premiere | M4: GPU Memory Hierarchy<br>(PDF) 200 (PPT)                             |                                                |             |
| W5   | 12.04<br>Tue. | You Tube Premiere | M5: GPU Performance<br>Considerations<br>@ (PDF) @ (PPT)                |                                                |             |
| W6   | 19.04<br>Tue. | You Tube Premiere | M6: Parallel Patterns: Reduction                                        |                                                |             |
| W7   | 26.04<br>Tue. | You Tube Premiere | M7: Parallel Patterns: Histogram                                        |                                                |             |
| W8   | 03.05<br>Tue. | You Tube Premiere | M8: Parallel Patterns: Convolution<br>(PDF)                             |                                                |             |
| W9   | 10.05<br>Tue. | You Tube Premiere | M9: Parallel Patterns: Prefix Sum<br>(Scan)                             |                                                |             |
| W10  | 17.05<br>Tue. | You Tube Premiere | M10: Parallel Patterns: Sparse<br>Matrices<br>(PDF) ((PPT))             |                                                |             |
| W11  | 24.05<br>Tue. | You Tube Premiere | M11: Parallel Patterns: Graph<br>Search<br>@ (PDF) @ (PPT)              |                                                |             |
| W12  | 01.06<br>Wed. | You Tube Premiere | M12: Parallel Patterns: Merge<br>Sort<br>(PDF) (PPT) (PPT)              |                                                |             |
| W13  | 07.06<br>Tue. | You Tube Premiere | M13: Dynamic Parallelism                                                |                                                |             |
| W14  | 15.06<br>Wed. | You Tube Premiere | M14: Collaborative Computing<br>(PDF) m (PPT)                           |                                                |             |
| W15  | 24.06<br>Fri. | You Tube Premiere | M15: GPU Acceleration of<br>Genome Sequence Alignment<br>(PDF) (m(PPT)) |                                                |             |
| W16  | 14.07<br>Thu. | You Tube Premiere | M16: Accelerating Agent-based<br>Simulations<br>(PDF) (a) (ODP)         |                                                |             |

### HW/SW Co-Design (Spring 2022)

#### Spring 2022 Edition:

https://safari.ethz.ch/projects and semi nars/spring2022/doku.php?id=hw sw co design

#### Youtube Livestream:

<u>https://youtube.com/playlist?list=PL5Q2s</u> <u>oXY2Zi8nH7un3ghD2nutKWWDk-NK</u>

#### Project course

- Taken by Bachelor's/Master's students
- HW/SW co-design lectures
- Hands-on research exploration
- Many research readings



#### 2022 Meetings/Schedule (Tentative)

| Week | Date  | Livestream    | Meeting                                 | Materials | Assignments |
|------|-------|---------------|-----------------------------------------|-----------|-------------|
| W0   | 16.03 | You Tube Live | Intro to HW/SW Co-Design                | Required  | HW 0 Out    |
| W1   | 23.03 |               | Project selection                       | Required  |             |
| W2   | 30.03 | You Tube Live | Virtual Memory (I)<br>(PPTX) (PDF)      |           |             |
| W3   | 13.04 | You Tube Live | Virtual Memory (II)<br>a (PPTX) a (PDF) |           |             |

### SSD Course (Spring 2022)

### Spring 2022 Edition:

https://safari.ethz.ch/projects\_and\_semi nars/spring2022/doku.php?id=modern\_s sds

#### Youtube Livestream:

- https://www.youtube.com/watch?v= q4r m71DsY4&list=PL5Q2soXY2Zi8vabcse1kL 22DEcgMl2RAq
- Project course
  - Taken by Bachelor's/Master's students
  - SSD Basics and Advanced Topics
  - Hands-on research exploration
  - Many research readings

| _                                                                   |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               | Jisung Park    |
|---------------------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|----------------|
|                                                                     | P&S Modern SSDs                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               |                |
| _                                                                   | Basics of NAND Flash-Based SSD                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                | 9s             |
|                                                                     | Dr. Jisung Park                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               | _              |
|                                                                     | Prof. Onur Mutlu                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              |                |
|                                                                     | ETH Zürich                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    |                |
|                                                                     | Spring 2022                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   |                |
|                                                                     | 25 March 2021                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 | 700m           |
| Streamed live on Mar 25     Onur Mutlu Lectures     25K subscribers | SDs) Course - Meeting 2: Basics of NAND Flash-Based SSDs (Spring 2022)<br>2022 🍙 16 🤤 DISLIKE A SHARE 🛓 DI                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    |                |
|                                                                     |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               |                |
| Onur Mutlu Lectures                                                 |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               | ANALYTICS EDIT |
| Onur Mutlu Lectures                                                 | 2022 📫 16 🖓 DISLIKE 🏟 SHARE 🛓 DI                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              | ANALYTICS EDIT |
| Dour Mutlu Lectures                                                 | 2022 I DISLIKE A SHARE U DI<br>P&S Modern SSDs                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                | ANALYTICS EDIT |
| Dnur Mutlu Lectures                                                 | DISLIKE À SHARE ± DA<br>P&S Modern SSDs<br>Introduction to MQSim                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              | ANALYTICS EDIT |
| Onur Mutlu Lectures                                                 | DISLIKE A SHARE ⊥ D<br>P&S Modern SSDs<br>Introduction to MQSim<br>Rakesh Nadig                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               | ANALYTICS EDIT |
| Onur Mutlu Lectures                                                 | District A share ± District A share to be a s | ANALYTICS EDIT |
| Onur Mutlu Lectures                                                 | P&S Modern SSDs<br>Introduction to MQSim<br>Rakesh Nadig<br>Dr. Jisung Park<br>Prof. Onur Mutlu<br>ETH Zürich<br>Spring 2022                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  | ANALYTICS EDIT |
| Onur Mutlu Lectures                                                 | P&S Modern SSDs<br>Introduction to MQSim<br>Rakesh Nadig<br>Dr. Jisung Park<br>Prof. Onur Mutlu<br>ETH Zürich                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 | ANALYTICS EDIT |







Problem and Background

Task Overview

Technical Challenges, Goals and Ideas

Ideas, Results and Papers from the Past Year



1. Memory system design for AI/ML workloads/accelerators

### 2. AI/ML techniques for improving memory system designs

## Thrust 1 Exploration Ideas

1.1. Comprehensive Energy and Performance Analysis of ML/AI Accelerator Execution on Key ML/AI Workloads

1.2. Cache/Buffer, On-Chip Memory, Interconnect, Memory Controller Designs for ML Accelerators and Their Interfaces

1.3. Complete on-chip ML/AI accelerator designs with careful data orchestration and on-chip memory management.

1.4. On-chip & off-chip near-data processing (NDP) designs, interfaces, evaluation, programming for AI/ML workloads

1.5. Evaluation and understanding of both short-term and long-term options for NDP for AI/ML Workloads

1.6. Use of NVM devices, simple customized DRAM and 3D-stacked Memory+Logic for AI/ML Acceleration

1.7. High-Fidelity and Highly-Flexible Open Source Simulation & Modeling Infrastructures for ML/AI Memory Systems

### SAFARI

This

talk

### 1. Memory system design for AI/ML workloads/accelerators

2. AI/ML techniques for improving memory system designs

## Thrust 2 Exploration Ideas

2.1. Comprehensive performance and energy analysis of rigid policies in the memory hierarchy – how far are they from the ideal policies? What is the maximum potential ML techniques can achieve?

2.2. New caching, prefetching, mem. controller, runahead, compression policies that are directed with appropriate ML techniques

2.3. Rigorous specification and coordination of ML-based on-chip cache, prefetch, DRAM, NVM, hybrid mem. Controllers

2.4. Design and evaluation of new ML-based techniques to manage hybrid **This** memories consisting of multiple different technologies **talk** 

2.5. Design and evaluation of new ML-based data mapping policies across on-chip caches and memory controllers

2.6. Design and evaluation of new ML-based thread scheduling policies in both SMT and memory controllers

2.7. High-Fidelity and Highly-Flexible Open Source Simulation & Modeling Infrastructures for ML-Based Controllers

## System Architecture Design Today

- Human-driven
  - Humans design the policies (how to do things)
- Many (too) simple, short-sighted policies all over the system
- No automatic data-driven policy learning
- (Almost) no learning: cannot take lessons from past actions

### Can we design fundamentally intelligent architectures?

## An Intelligent Architecture

- Data-driven
  - Machine learns the "best" policies (how to do things)
- Sophisticated, workload-driven, changing, far-sighted policies
- Automatic data-driven policy learning
- All controllers are intelligent data-driven agents

### How do we start?

## Two Major Thrusts & Their Synergies

1. Memory system design for AI/ML workloads/accelerators

2. AI/ML techniques for improving memory system designs



Problem and Background

Task Overview

Technical Challenges, Goals and Ideas

Ideas, Results and Papers from the Past Year



## Initial Results in Year I (2020 Review)

- GenASM: A High-Performance, Low-Power Approximate String Matching Acceleration Framework for Genome Sequence Analysis [MICRO 2020]
- NERO: A Near High-Bandwidth Memory Stencil Accelerator for Weather Prediction Modeling [FPL 2020]
- An Experimental Study of Reduced-Voltage Operation in Modern FPGAs for Neural Network Acceleration [DSN 2020]
- NATSA: A Near-Data Processing Accelerator for Time Series Analysis [ICCD 2020]
- Robust Machine Learning Systems: Challenges, Current Trends, Perspectives, and the Road Ahead [IEEE D&T 2020]
- Accelerating Genome Analysis: A Primer on an Ongoing Journey [IEEE Micro 2020]
- SMASH Open Source Software Code Release [GitHub]

## Initial Results in Year I (2020 Ongoing)

- Efficiently Accelerating Edge ML Inference by Exploiting Layer Heterogeneity: An Empirical Study with Google Edge Models [Ongoing]
- A New Methodology and Open-Source Benchmark Suite for Evaluating Data Movement Bottlenecks: A Near-Data Processing Case Study [Ongoing]
- Accelerating Profile Hidden Markov Models in Computational Biology Applications [Ongoing]
- StenCache: A Near-Cache Accelerator for Stencil Computations [Ongoing]
- SIMDRAM: A Framework for Bit-Serial SIMD Processing using DRAM [Ongoing]
- Polynesia: Enabling Effective Hybrid Transactional/Analytical Databases with Specialized Hardware/Software Co-Design [Ongoing]
- Reinforcement Learning based Prefetch Generation [Ongoing]
- Benchmarking a New Paradigm: Understanding a Modern Processing-in-Memory Architecture [Ongoing]

## Year II Results (2021 Annual Review I)

- Google Neural Network Models for Edge Devices: Analyzing and Mitigating Machine Learning Inference Bottlenecks [PACT 2021]
- Pythia: A Customizable Hardware Prefetching Framework Using Online Reinforcement Learning [MICRO 2021]
- Refresh Triggered Computation: Improving the Energy Efficiency of Convolutional Neural Network Accelerators [TACO 2020]
- SynCron: Efficient Synchronization Support for Near-Data-Processing Architectures [HPCA 2021]
- SIMDRAM: An End-to-End Framework for Bit-Serial SIMD Computing in DRAM [ASPLOS 2021]

## Year II Results (2021 Annual Review II)

- DAMOV: A New Methodology and Benchmark Suite for Evaluating Data Movement Bottlenecks [IEEE Access 2021]
- Benchmarking a New Paradigm: An Experimental Analysis of a Real Processing-in-Memory Architecture [Arxiv, 2021]
- FPGA-based Near-Memory Acceleration of Modern Data-Intensive Applications [IEEE Micro 2021]
- A Modern Primer on Processing in Memory [Arxiv, 2020]
- Sibyl: A Reinforcement Learning Approach to Data Placement in Hybrid Storage Systems [Ongoing]

## Year III Results (2022 Annual Review 1)

- Benchmarking a New Paradigm: Experimental Analysis and Characterization of a Real Processing-in-Memory System [IEEE Access'22]
- Benchmarking Memory-Centric Computing Systems: Analysis of Real Processing-in-Memory Hardware [CUT 2021]
- An Experimental Evaluation of Machine Learning Training on a Real Processing-in-Memory System [arXiv 2022]
- SparseP: Towards Efficient Sparse Matrix Vector Multiplication on Real Processing-In-Memory Architectures [SIGMETRICS 2022]
- High-throughput Pairwise Alignment with the Wavefront Algorithm using Processing-in-Memory [HICOMB 2022]
   Part of Thrust 1:

### Real PIM Systems

 PiDRAM: A Holistic End-to-end FPGA-based Framework for Processing-in-DRAM [arXiv 2021]

## Year III Results (2022 Annual Review 2)

- SeGraM: A Universal Hardware Accelerator for Genomic Sequence-to-Graph and Sequence-to-Sequence Mapping [ISCA 2022]
- GenStore: A High-Performance and Energy-Efficient In-Storage Computing System for Genome Sequence Analysis [ASPLOS 2022]
- Algorithmic Improvement and GPU Acceleration of the GenASM Algorithm [HICOMB 2022]
- Polynesia: Enabling High-Performance and Energy-Efficient Hybrid Transactional/Analytical Databases with Hardware/Software Co-Design [ICDE 2022]
- Flash-Cosmos: In-Flash Bitwise Operations Using Inherent Computation Capability of NAND Flash Memory [MICRO 2022]
   Part of Thrust 1

## Year III Results (2022 Annual Review 3)

- Sibyl: Adaptive and Extensible Data Placement in Hybrid Storage Systems Using Online Reinforcement Learning [ISCA 2022]
- Hermes: Accelerating Long-Latency Load Requests via Perceptron-Based Off-Chip Load Prediction [MICRO 2022]
   Part of Thrust 2
- GenPIP: In-Memory Acceleration of Genome Analysis via Tight Integration of Basecalling and Read Mapping [MICRO 2022]
- pLUTo: Enabling Massively Parallel Computation via In DRAM via Lookup Tables [MICRO 2022]
   Part of Thrust 1

 DeepSketch: A New Machine Learning-Based Reference Search Technique for Post-Deduplication Delta Compression [FAST 2022]

A Modern Primer on Processing in Memory [Arxiv, Updated 2022]

## Year III Results (2022 Annual Review 4)

- EcoFlow: Efficient Convolutional Dataflows for Low-Power Neural Network Accelerators [arXiv 2022] <u>https://arxiv.org/abs/2202.02310</u>
- ApHMM: Accelerating Profile Hidden Markov Models for Fast and Energy-Efficient Genome Analysis [arXiv 2022] <a href="https://arxiv.org/abs/2207.09765">https://arxiv.org/abs/2207.09765</a>
- Accelerating Weather Prediction Using Near-Memory Reconfigurable Fabric [TRETS 2022] <u>https://arxiv.org/abs/2107.08716</u>

## Third Year Results: More Detail

## Year III Results (2022 Annual Review 1)

- Benchmarking a New Paradigm: Experimental Analysis and Characterization of a Real Processing-in-Memory System [IEEE Access'22]
- Benchmarking Memory-Centric Computing Systems: Analysis of Real Processing-in-Memory Hardware [CUT 2021]
- An Experimental Evaluation of Machine Learning Training on a Real Processing-in-Memory System [arXiv 2022]
- SparseP: Towards Efficient Sparse Matrix Vector Multiplication on Real Processing-In-Memory Architectures [SIGMETRICS 2022]
- High-throughput Pairwise Alignment with the Wavefront Algorithm using Processing-in-Memory [HICOMB 2022]
- PiDRAM: A Holistic End-to-end FPGA-based Framework for Processing-in-DRAM [arXiv 2021]

## Eliminating the Adoption Barriers

# Processing-in-Memory in the Real World

## Processing-in-Memory Landscape Today



This does not include many experimental chips and startups

## UPMEM Processing-in-DRAM Engine (2019)

### Processing in DRAM Engine

 Includes standard DIMM modules, with a large number of DPU processors combined with DRAM chips.

### Replaces standard DIMMs

- DDR4 R-DIMM modules
  - 8GB+128 DPUs (16 PIM chips)
  - Standard 2x-nm DRAM process



Large amounts of compute & memory bandwidth



https://www.upmem.com/video-upmem-presenting-its-true-processing-in-memory-solution-hot-chips-2019/

## **UPMEM Memory Modules**

- E19: 8 chips DIMM (1 rank). DPUs @ 267 MHz
- P21: 16 chips DIMM (2 ranks). DPUs @ 350 MHz



## 2,560-DPU Processing-in-Memory System



#### Benchmarking a New Paradigm: An Experimental Analysis of a Real Processing-in-Memory Architecture

JUAN GÓMEZ-LUNA, ETH Zürich, Switzerland IZZAT EL HAJJ, American University of Berut, Lebanon IVAN FERNANDEZ, ETH Zürich, Switzerland and University of Malaga, Spain CHRISTINA GIANNOULA, ETH Zürich, Switzerland and NTUA, Greece GERALDO F. OLIVEIRA, ETH Zürich, Switzerland ONUR MUTLU, ETH Zürich, Switzerland

Many modern workloads, such as neural networks, databases, and graph processing, are fundamentally memory-bound. For such workloads, the data movement between main memory and CPU cores imposes a significant overhead in terms of both latency and energy. A major reason is that this communication happens through a narrow bus with high latency and limited bandwidth, and the low data reuse in memory-bound workloads is insufficient to amorize the cost of main memory ancess. Fundamentally addressing this data movement builleneck requires a paradigm where the memory system assumes an active role in computing by integrating processing capabilities. This paradigm is known as processing-in-memory (PdN).

Recent research explores different forms of PIM architectures, motivated by the emergence of new 3Dstacked memory technologies that integrate memory with a logic layer where processing elements can be easily placed. Past works evaluate these architectures in simulation or, at best, with simplified hardware prototypes. In contrast, the UPMEM company has designed and manufactured the first publicly-available real-world PIM architecture. The UPMEM PIM architecture combines traditional DRAM memory arrays with general-purpose in-order cores, called DRAM Processing Units (DPUs), integrated in the same chip.

This paper provides the first comprehensive analysis of the first publicly-available real-world PIM architecture. We make two evolution through Strik, we conduct an experimental characterization of the UPMEM-based PIM system using microbenchmarks to assess various architecture limits such as compute throughput and memory bandwishth, yielding new insights. Second, we present PMU (*Decessing in-Memory benchmarks*), a benchmark suite of 16 workloads from different application domains (e.g., denne/sparse linear algebra, dathases, data analytics, graph processing, neural networks, bioinformatics, image processing), which we identify as memory-bound. We evaluate the performance and scaling characteristics of PtM benchmarks on the UPMEM PIM architecture, and compare their performance and energy consumption to their stateof-the-art CPU and CPU counterparts. Our extensive evaluation conducted on two real UPMEM-based PIM systems with 64 and 2550 PUP sproids new insights about satiabality of different workloads to the PIM systems reares of future PIM systems.



https://arxiv.org/pdf/2105.03814.pdf

## More on the UPMEM PIM System



https://www.youtube.com/watch?v=Sscy1Wrr22A&list=PL5Q2soXY2Zi9xidyIgBxUz7xRPS-wisBN&index=26

### Experimental Analysis of the UPMEM PIM Engine

## Benchmarking a New Paradigm: An Experimental Analysis of a Real Processing-in-Memory Architecture

JUAN GÓMEZ-LUNA, ETH Zürich, Switzerland IZZAT EL HAJJ, American University of Beirut, Lebanon IVAN FERNANDEZ, ETH Zürich, Switzerland and University of Malaga, Spain CHRISTINA GIANNOULA, ETH Zürich, Switzerland and NTUA, Greece GERALDO F. OLIVEIRA, ETH Zürich, Switzerland ONUR MUTLU, ETH Zürich, Switzerland

Many modern workloads, such as neural networks, databases, and graph processing, are fundamentally memory-bound. For such workloads, the data movement between main memory and CPU cores imposes a significant overhead in terms of both latency and energy. A major reason is that this communication happens through a narrow bus with high latency and limited bandwidth, and the low data reuse in memory-bound workloads is insufficient to amortize the cost of main memory access. Fundamentally addressing this *data movement bottleneck* requires a paradigm where the memory system assumes an active role in computing by integrating processing capabilities. This paradigm is known as *processing-in-memory (PIM*).

Recent research explores different forms of PIM architectures, motivated by the emergence of new 3Dstacked memory technologies that integrate memory with a logic layer where processing elements can be easily placed. Past works evaluate these architectures in simulation or, at best, with simplified hardware prototypes. In contrast, the UPMEM company has designed and manufactured the first publicly-available real-world PIM architecture. The UPMEM PIM architecture combines traditional DRAM memory arrays with general-purpose in-order cores, called *DRAM Processing Units* (*DPUs*), integrated in the same chip.

This paper provides the first comprehensive analysis of the first publicly-available real-world PIM architecture. We make two key contributions. First, we conduct an experimental characterization of the UPMEM-based PIM system using microbenchmarks to assess various architecture limits such as compute throughput and memory bandwidth, yielding new insights. Second, we present *PrIM* (*Processing-In-Memory benchmarks*), a benchmark suite of 16 workloads from different application domains (e.g., dense/sparse linear algebra, databases, data analytics, graph processing, neural networks, bioinformatics, image processing), which we identify as memory-bound. We evaluate the performance and scaling characteristics of PrIM benchmarks on the UPMEM PIM architecture, and compare their performance and energy consumption to their stateof-the-art CPU and GPU counterparts. Our extensive evaluation conducted on two real UPMEM-based PIM systems with 640 and 2,556 DPUs provides new insights about suitability of different workloads to the PIM system, programming recommendations for software designers, and suggestions and hints for hardware and architecture designers of future PIM systems.

#### https://arxiv.org/pdf/2105.03814.pdf

Understanding a Modern Processing-in-Memory Architecture:

**Benchmarking and Experimental Characterization** 

<u>Juan Gómez Luna</u>, Izzat El Hajj, Ivan Fernandez, Christina Giannoula, Geraldo F. Oliveira, Onur Mutlu

https://arxiv.org/pdf/2105.03814.pdf https://github.com/CMU-SAFARI/prim-benchmarks





# **Executive Summary**

- Data movement between memory/storage units and compute units is a major contributor to execution time and energy consumption
- Processing-in-Memory (PIM) is a paradigm that can tackle the data movement bottleneck
  - Though explored for +50 years, technology challenges prevented the successful materialization
- UPMEM has designed and fabricated the first publicly-available real-world PIM architecture
  - DDR4 chips embedding in-order multithreaded DRAM Processing Units (DPUs)
- Our work:
  - Introduction to UPMEM programming model and PIM architecture
  - Microbenchmark-based characterization of the DPU
  - Benchmarking and workload suitability study
- Main contributions:
  - Comprehensive characterization and analysis of the first commercially-available PIM architecture
  - **PrIM** (<u>Processing-In-Memory</u>) benchmarks:
    - 16 workloads that are memory-bound in conventional processor-centric systems
    - Strong and weak scaling characteristics
  - Comparison to state-of-the-art CPU and GPU
- Takeaways:
  - Workload characteristics for PIM suitability
  - Programming recommendations
  - Suggestions and hints for hardware and architecture designers of future PIM systems
  - PrIM: (a) programming samples, (b) evaluation and comparison of current and future PIM systems

# Upcoming TECHCON Presentation

### Dr. Juan Gomez-Luna

- Benchmarking Memory-Centric Computing Systems: Analysis of Real Processing-in-Memory Hardware
- Based on two major works
  - https://arxiv.org/pdf/2105.03814.pdf
  - https://arxiv.org/pdf/2207.07886.pdf



#### Benchmarking Memory-Centric Computing Systems: Analysis of Real Processing-In-Memory Hardware

Year: 2021, Pages: 1-7 DOI Bookmark: 10.1109/IGSC54211.2021.9651614

#### Authors

SAFARI

Juan Gómez-Luna, ETH Zürich Izzat El Hajj, American University of Beirut Ivan Fernandez, University of Malaga Christina Giannoula, National Technical University of Athens Geraldo F. Oliveira, ETH Zürich Onur Mutlu, ETH Zürich



https://www.youtube.com/watch?v=nphV36SrysA

## **Observations, Recommendations, Takeaways**

#### **GENERAL PROGRAMMING RECOMMENDATIONS**

- 1. Execute on the *DRAM Processing Units* (*DPUs*) **portions of parallel code** that are as long as possible.
- 2. Split the workload into **independent data blocks**, which the DPUs operate on independently.
- 3. Use **as many working DPUs** in the system as possible.
- 4. Launch at least **11** *tasklets* (i.e., software threads) per DPU.

#### **PROGRAMMING RECOMMENDATION 1**

For data movement between the DPU's MRAM bank and the WRAM, **use large DMA transfer sizes when all the accessed data is going to be used**.

#### **KEY OBSERVATION 7**

Larger CPU-DPU and DPU-CPU transfers between the host main memory and the DRAM Processing Unit's Main memory (MRAM) banks result in higher sustained bandwidth.

#### **KEY TAKEAWAY 1**

The UPMEM PIM architecture is fundamentally compute bound. As a result, the most suitable work- loads are memory-bound.

# Outline

- Introduction
  - Accelerator Model
  - UPMEM-based PIM System Overview
- UPMEM PIM Programming
  - Vector Addition
  - CPU-DPU Data Transfers
  - Inter-DPU Communication
  - CPU-DPU/DPU-CPU Transfer Bandwidth
- DRAM Processing Unit
  - Arithmetic Throughput
  - WRAM and MRAM Bandwidth
- PrIM Benchmarks
  - Roofline Model
  - Benchmark Diversity
- Evaluation
  - Strong and Weak Scaling
  - Comparison to CPU and GPU

Key Takeaways

# Key Takeaway 1



Operational Intensity (OP/B)

The throughput saturation point is as low as ¼ OP/B, i.e., 1 integer addition per every 32-bit element fetched

### **KEY TAKEAWAY 1**

**The UPMEM PIM architecture is fundamentally compute bound.** As a result, **the most suitable workloads are memory-bound.** 

# Key Takeaway 2

Table 4: Evaluated CPU, GPU, and UPMEM-based PIM Systems.

| System                          | Process | Pi                    | ocessor Core | i                | TDP       |                 |                    |
|---------------------------------|---------|-----------------------|--------------|------------------|-----------|-----------------|--------------------|
|                                 | Node    | Total Cores           | Frequency    | Peak Performance | Capacity  | Total Bandwidth | IDF                |
| Intel Xeon E3-1225 v6 CPU [241] | 14 nm   | 4 (8 threads)         | 3.3 GHz      | 26.4 GFLOPS*     | 32 GB     | 37.5 GB/s       | 73 W               |
| NVIDIA Titan V GPU [277]        | 14 nm   | 80 (5,120 SIMD lanes) | 1.2 GHz      | 12,288.0 GFLOPS  | 12 GB     | 652.8 GB/s      | 250 W              |
| 2,556-DPU PIM System            | 2x nm   | 2,5569                | 350 MHz      | 894.6 GOPS       | 159.75 GB | 1.7 TB/s        | 383 W <sup>†</sup> |
| 640-DPU PIM System              | 2x nm   | 640                   | 267 MHz      | 170.9 GOPS       | 40 GB     | 333.75 GB/s     | 96 W <sup>†</sup>  |

\*Estimated TDP =  $\frac{Total DPUs}{DPUs | chip} \times 1.2 W/chip [199].$ 



### **KEY TAKEAWAY 2**

The most well-suited workloads for the UPMEM PIM architecture use no arithmetic operations or use only simple operations (e.g., bitwise operations and integer addition/subtraction).

# Key Takeaway 3

Table 4: Evaluated CPU, GPU, and UPMEM-based PIM Systems.

| System                          | Process | Pi                    | 1         | трр              |           |                 |                    |
|---------------------------------|---------|-----------------------|-----------|------------------|-----------|-----------------|--------------------|
|                                 | Node    | Total Cores           | Frequency | Peak Performance | Capacity  | Total Bandwidth | IDF                |
| Intel Xeon E3-1225 v6 CPU [241] | 14 nm   | 4 (8 threads)         | 3.3 GHz   | 26.4 GFLOPS*     | 32 GB     | 37.5 GB/s       | 73 W               |
| NVIDIA Titan V GPU [277]        | 14 nm   | 80 (5,120 SIMD lanes) | 1.2 GHz   | 12,288.0 GFLOPS  | 12 GB     | 652.8 GB/s      | 250 W              |
| 2,556-DPU PIM System            | 2x nm   | 2,5569                | 350 MHz   | 894.6 GOPS       | 159.75 GB | 1.7 TB/s        | 383 W <sup>†</sup> |
| 640-DPU PIM System              | 2x nm   | 640                   | 267 MHz   | 170.9 GOPS       | 40 GB     | 333.75 GB/s     | 96 W <sup>†</sup>  |

<sup>†</sup>Estimated TDP =  $\frac{Total DPUs}{DPUs|chip} \times 1.2 \text{ W/chip [199]}.$ 



### **KEY TAKEAWAY 3**

The most well-suited workloads for the UPMEM PIM architecture require little or no communication across DPUs (inter-DPU communication).

### **KEY TAKEAWAY 4**

• UPMEM-based PIM systems **outperform state-of-the-art CPUs in terms of performance and energy efficiency on most of PrIM benchmarks.** 

• UPMEM-based PIM systems **outperform state-of-the-art GPUs on a majority of PrIM benchmarks**, and the outlook is even more positive for future PIM systems.

• UPMEM-based PIM systems are **more energy-efficient than state**of-the-art CPUs and GPUs on workloads that they provide performance improvements over the CPUs and the GPUs. Understanding a Modern Processing-in-Memory Architecture:

**Benchmarking and Experimental Characterization** 

<u>Juan Gómez Luna</u>, Izzat El Hajj, Ivan Fernandez, Christina Giannoula, Geraldo F. Oliveira, Onur Mutlu

el1goluj@gmail.com

<u>https://arxiv.org/pdf/2105.03814.pdf</u> <u>https://github.com/CMU-SAFARI/prim-benchmarks</u>





## UPMEM PIM System Summary & Analysis

Juan Gomez-Luna, Izzat El Hajj, Ivan Fernandez, Christina Giannoula, Geraldo F. Oliveira, and Onur Mutlu, "Benchmarking Memory-Centric Computing Systems: Analysis of Real **Processing-in-Memory Hardware**" Invited Paper at Workshop on Computing with Unconventional Technologies (CUT), Virtual, October 2021. [arXiv version] [PrIM Benchmarks Source Code] [Slides (pptx) (pdf)] [Talk Video (37 minutes)] [Lightning Talk Video (3 minutes)]

## Benchmarking Memory-Centric Computing Systems: Analysis of Real Processing-in-Memory Hardware

Juan Gómez-Luna ETH Zürich

Izzat El Hajj American University of Beirut

University of Malaga

Ivan Fernandez Christina Giannoula Geraldo F. Oliveira Onur Mutlu National Technical University of Athens

ETH Zürich ETH Zürich

# **PrIM Benchmarks: Application Domains**

| Domain                | Benchmark                     | Short name |
|-----------------------|-------------------------------|------------|
| Dense linear algebra  | Vector Addition               | VA         |
| Dense linear algebra  | Matrix-Vector Multiply        | GEMV       |
| Sparse linear algebra | Sparse Matrix-Vector Multiply | SpMV       |
| Databases             | Select                        | SEL        |
| Databases             | Unique                        | UNI        |
| Data applytics        | Binary Search                 | BS         |
| Data analytics        | Time Series Analysis          | TS         |
| Graph processing      | Breadth-First Search          | BFS        |
| Neural networks       | Multilayer Perceptron         | MLP        |
| Bioinformatics        | Needleman-Wunsch              | NW         |
| lung de processie d   | Image histogram (short)       | HST-S      |
| Image processing      | Image histogram (large)       | HST-L      |
|                       | Reduction                     | RED        |
| Darallal primitivas   | Prefix sum (scan-scan-add)    | SCAN-SSA   |
| Parallel primitives   | Prefix sum (reduce-scan-scan) | SCAN-RSS   |
|                       | Matrix transposition          | TRNS       |

## **PrIM Benchmarks are Open Source**

- All microbenchmarks, benchmarks, and scripts
- <u>https://github.com/CMU-SAFARI/prim-benchmarks</u>

| CMU-SAFARI / prim-benchmarks                                                                                                                                                                                                               |                                                                             |                                              | <ul> <li>Unwatc</li> </ul>                            | ch ▼ 2 ₹                         | 중 Star 2 양 Fo                      | ork 1   |
|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------|----------------------------------------------|-------------------------------------------------------|----------------------------------|------------------------------------|---------|
| <> Code <ul> <li>Issues</li> <li>Pull requests</li> <li>Action</li> </ul>                                                                                                                                                                  | ns 🔟 Projects                                                               | 🕮 Wiki                                       | I Security                                            | └── Insights                     | 鈞 Settings                         |         |
| <sup>৫°</sup> main 👻 prim-benchmarks / README.md                                                                                                                                                                                           |                                                                             |                                              |                                                       |                                  | Go to file                         | ••••    |
| Juan Gomez Luna PrIM first commit                                                                                                                                                                                                          |                                                                             |                                              | Late                                                  | st commit 3de4t                  | 049 9 days ago  🖒 H                | listory |
| At 1 contributor                                                                                                                                                                                                                           |                                                                             |                                              |                                                       |                                  |                                    |         |
| i≘ 168 lines (132 sloc)   5.79 KB                                                                                                                                                                                                          |                                                                             |                                              |                                                       | R                                | aw Blame 🖵 d                       | 1       |
| PrIM (Processing-In-Mem<br>PrIM is the first benchmark suite for a real-world<br>analyze, and characterize the first publicly-availal<br>architecture. The UPMEM PIM architecture combi<br>DRAM Processing Units (DPUs), integrated in the | processing-in-me<br>ole real-world proo<br>nes traditional DR<br>same chip. | mory (PIM) ar<br>cessing-in-m<br>AM memory a | rchitecture. Prl<br>emory (PIM) ar<br>arrays with ger | rchitecture, th<br>neral-purpose | e UPMEM PIM<br>in-order cores, cal | lled    |
| PrIM provides a common set of workloads to eval<br>architecture and system researchers all alike to in<br>have different characteristics, exhibiting heteroge<br>communication patterns. This repository also con<br>comparison purposes.  | prove multiple as<br>neity in their mem                                     | pects of futu<br>nory access p               | re PIM hardwai<br>patterns, opera                     | re and softwar<br>tions and data | re. The workloads<br>a types, and  |         |
| PrIm also includes a set of microbenchmarks can                                                                                                                                                                                            | be used to assess                                                           | s various arch                               | itecture limits                                       | such as comp                     | oute throughput and                | b       |

memory bandwidth.

## Benchmarking a New Paradigm: Experimental Analysis and Characterization of a Real Processing-in-Memory System

JUAN GÓMEZ-LUNA<sup>1</sup>, IZZAT EL HAJJ<sup>2</sup>, IVAN FERNANDEZ<sup>1,3</sup>, CHRISTINA GIANNOULA<sup>1,4</sup>, GERALDO F. OLIVEIRA<sup>1</sup>, AND ONUR MUTLU<sup>1</sup>

<sup>1</sup>ETH Zürich

<sup>2</sup>American University of Beirut

<sup>3</sup>University of Malaga

<sup>4</sup>National Technical University of Athens

Corresponding author: Juan Gómez-Luna (e-mail: juang@ethz.ch).

https://arxiv.org/pdf/2105.03814.pdf

https://github.com/CMU-SAFARI/prim-benchmarks

## Understanding a Modern PIM Architecture



https://www.youtube.com/watch?v=D8Hjy2iU9I4&list=PL5Q2soXY2Zi tOTAYm--dYByNPL7JhwR9

## More on Analysis of the UPMEM PIM Engine



https://www.youtube.com/watch?v=D8Hjy2iU9I4&list=PL5Q2soXY2Zi tOTAYm--dYByNPL7JhwR9

## More on Analysis of the UPMEM PIM Engine



#### Understanding a Modern Processing-in-Memory Arch: Benchmarking & Experimental Characterization; 21m



https://www.youtube.com/watch?v=Pp9jSU2b9oM&list=PL5Q2soXY2Zi8\_VVChACnON4sfh2bJ5IrD&index=159

## More on PRIM Benchmarks

Juan Gomez-Luna, Izzat El Hajj, Ivan Fernandez, Christina Giannoula, Geraldo F. Oliveira, and Onur Mutlu, **"Benchmarking a New Paradigm: An Experimental Analysis of a Real Processing-in-Memory** Architecture" Preprint in arXiv, 9 May 2021. [arXiv preprint] PrIM Benchmarks Source Code Slides (pptx) (pdf) [Long Talk Slides (pptx) (pdf)] [Short Talk Slides (pptx) (pdf)] [SAFARI Live Seminar Slides (pptx) (pdf)] [SAFARI Live Seminar Video (2 hrs 57 mins)] [Lightning Talk Video (3 minutes)]

## UPMEM PIM System Summary & Analysis

Juan Gomez-Luna, Izzat El Hajj, Ivan Fernandez, Christina Giannoula, Geraldo F. Oliveira, and Onur Mutlu, "Benchmarking Memory-Centric Computing Systems: Analysis of Real **Processing-in-Memory Hardware**" Invited Paper at Workshop on Computing with Unconventional Technologies (CUT), Virtual, October 2021. [arXiv version] [PrIM Benchmarks Source Code] [Slides (pptx) (pdf)] [Talk Video (37 minutes)] [Lightning Talk Video (3 minutes)]

## Benchmarking Memory-Centric Computing Systems: Analysis of Real Processing-in-Memory Hardware

Juan Gómez-Luna ETH Zürich

Izzat El Hajj American University of Beirut

University of Malaga

Ivan Fernandez Christina Giannoula Geraldo F. Oliveira Onur Mutlu National Technical University of Athens

ETH Zürich ETH Zürich

# Year III Results (2022 Annual Review 1)

- Benchmarking a New Paradigm: Experimental Analysis and Characterization of a Real Processing-in-Memory System [IEEE Access'22]
- Benchmarking Memory-Centric Computing Systems: Analysis of Real Processing-in-Memory Hardware [CUT 2021]
- An Experimental Evaluation of Machine Learning Training on a Real Processing-in-Memory System [arXiv 2022]
- SparseP: Towards Efficient Sparse Matrix Vector Multiplication on Real Processing-In-Memory Architectures [SIGMETRICS 2022]
- High-throughput Pairwise Alignment with the Wavefront Algorithm using Processing-in-Memory [HICOMB 2022]
- PiDRAM: A Holistic End-to-end FPGA-based Framework for Processing-in-DRAM [arXiv 2021]

## **ML Training on a Real PIM System**

### Machine Learning Training on a Real Processing-in-Memory System

Juan Gómez-Luna<sup>1</sup> Yuxin Guo<sup>1</sup> Sylvan Brocard<sup>2</sup> Julien Legriel<sup>2</sup> Remy Cimadomo<sup>2</sup> Geraldo F. Oliveira<sup>1</sup> Gagandeep Singh<sup>1</sup> Onur Mutlu<sup>1</sup> <sup>1</sup>ETH Zürich <sup>2</sup>UPMEM

### An Experimental Evaluation of Machine Learning Training on a Real Processing-in-Memory System

Juan Gómez-Luna<sup>1</sup> Yuxin Guo<sup>1</sup> Sylvan Brocard<sup>2</sup> Julien Legriel<sup>2</sup> Remy Cimadomo<sup>2</sup> Geraldo F. Oliveira<sup>1</sup> Gagandeep Singh<sup>1</sup> Onur Mutlu<sup>1</sup> <sup>1</sup>ETH Zürich <sup>2</sup>UPMEM

Short version: https://arxiv.org/pdf/2206.06022.pdf Long version: https://arxiv.org/pdf/2207.07886.pdf https://www.youtube.com/watch?v=qeukNs5XI3g&t=11226s

# Machine Learning Training on a Real Processing-in-Memory System

<u>Juan Gómez Luna</u>, Yuxin Guo, Sylvan Brocard, Julien Legriel, Remy Cimadomo, Geraldo F. Oliveira, Gagandeep Singh, Onur Mutlu

<u>Short version: https://arxiv.org/pdf/2206.06022.pdf</u> <u>Long version: https://arxiv.org/pdf/2207.07886.pdf</u> <u>https://www.youtube.com/watch?v=qeukNs5XI3g&t=11226</u>s

SAFARI

**ETH** Zürich

# **Executive Summary**

- Training machine learning (ML) algorithms is a computationally expensive process, frequently memory-bound due to repeatedly accessing large training datasets
- Memory-centric computing systems, i.e., with Processing-in-Memory (PIM) capabilities, can alleviate this data movement bottleneck
- Real-world PIM systems have only recently been manufactured and commercialized
  - UPMEM has designed and fabricated the first publicly-available real-world PIM architecture
- Our goal is to understand the potential of modern general-purpose PIM architectures to accelerate machine learning training
- Our main contributions:
  - PIM implementation of several classical machine learning algorithms: linear regression, logistic regression, decision tree, K-means clustering
  - Workload characterization in terms of accuracy, performance, and scaling
  - Comparison to their counterpart implementations on processor-centric systems (CPU and GPU)
- Experimental evaluation on a real-world PIM system with 2,524 PIM cores @ 425 MHz and 158 GB of DRAM memory
- New observations and insights:
  - ML training in PIM systems benefits from (1) fixed-point representation, (2) quantization, and (3) hybrid precision implementations
  - Complex activation functions (e.g., sigmoid) can take advantage of LUTs in PIM systems without native support for those activation functions
  - Data can be placed and laid out for PIM cores to access nearby memory banks in streaming, thus maximizing PIM memory bandwidth
  - ML training benefits from scaling the size of PIM-enabled memory with PIM cores attached to memory banks



## ML Training on Real PIM Talk Video





https://www.youtube.com/watch?v=geukNs5XI3g&t=11226s

## Outline

## Machine learning workloads

## Processing-in-memory

## PIM implementation of ML workloads

### Evaluation

## Key observations and insights

# **Machine Learning Workloads**

 Machine learning training with large amounts of data is a computationally expensive process, which requires many iterations to update an ML model's parameters



- Frequent data movement between memory and processing elements to access training data
- The amount of computation is not enough to amortize the cost of moving training data to the processing elements
  - Low arithmetic intensity
  - Low temporal locality
  - Irregular memory accesses

# Machine Learning Workloads: Our Goal

- Our goal is to study and analyze how real-world general-purpose PIM can accelerate ML training
- Four representative ML algorithms: linear regression, logistic regression, decision tree, K-means



 Roofline model to quantify the memory boundedness of CPU versions of the four workloads



All workloads fall in the memory-bound area of the Roofline

# Processing-in-Memory (PIM)

- PIM is a computing paradigm that advocates for memorycentric computing systems, where processing elements are placed near or inside the memory arrays
- Real-world PIM architectures are becoming a reality
  - UPMEM PIM, Samsung HBM-PIM, Samsung AxDIMM, SK Hynix AiM, Alibaba HB-PNM
- These PIM systems have some common characteristics:
  - 1. There is a host processor (CPU or GPU) with access to (1) standard main memory, and (2) PIM-enabled memory
  - 2. PIM-enabled memory contains multiple PIM processing elements (PEs) with high bandwidth and low latency memory access
  - 3. PIM PEs run only at a few hundred MHz and have a small number of registers and small (or no) cache/scratchpad
  - 4. PEs may need to communicate via the host processor

# A State-of-the-Art PIM System



- In our work, we use the UPMEM PIM architecture
  - General-purpose processing cores called DRAM Processing Units (DPUs)
    - Up to 24 PIM threads, called *tasklets*
    - <u>32-bit integer arithmetic, but multiplication/division are</u> emulated, as well as floating-point operations
  - 64-MB DRAM bank (MRAM), 64-KB scratchpad (WRAM)

# **ML Training Workloads**



- Memory access patterns
- Operations and datatypes
- Communication/synchronization

| Learning                  | Application         | Algorithm         | Short name | Short name Memory access pattern |         |                    | Computation pattern |                  | Communication/synchronization |                |
|---------------------------|---------------------|-------------------|------------|----------------------------------|---------|--------------------|---------------------|------------------|-------------------------------|----------------|
| approach                  | Application         | Algorithm         | Short name | Sequential                       | Strided | Random             | Operations          | Datatype         | Intra PIM Core                | Inter PIM Core |
|                           | Regression          | Linear Regression | LIN        | Yes                              | No      | No                 | mul, add            | float, int32_t   | barrier                       | Yes            |
| Supervised Classification | Logistic Regression | LOG               | Yes        | No                               | No      | mul, add, exp, div | float, int32_t      | barrier          | Yes                           |                |
|                           | Classification      | Decision Tree     | DTR        | Yes                              | No      | No                 | compare, add        | float            | barrier, mutex                | Yes            |
| Unsupervised              | Clustering          | K-Means           | KME        | Yes                              | No      | No                 | mul, compare, add   | int16_t, int64_t | barrier, mutex                | Yes            |

# **Evaluation Methodology**

### • Syr thetic and real datasets

| ML Worklo     | Worklond |            |                        | Synthetic Datasets                            |                                     | Real Dataset            |  |
|---------------|----------|------------|------------------------|-----------------------------------------------|-------------------------------------|-------------------------|--|
|               |          | Strong Sc  | aling (1 PIM core      | 256-2048 PIM cores)                           | Weak Scaling (per PIM core)         | Real Dataset            |  |
| Linear regres | sion     | 2,048 sam  | les, 16 attr. (0.125 I | IB)   6,291,456 samples, 16 attr. (384 MB)    | 2,048 samples, 16 attr. (0.125 MB)  | SUSY [223, 224]         |  |
| Logistic regr | ssion    | 2,048 sam  | les, 16 attr. (0.125 I | IB)   6,291,456 samples, 16 attr. (384 MB)    | 2,048 samples, 16 attr. (0.125 MB)  | Skin segmentation [225] |  |
| Decision tree |          | 60,000 san | ples, 16 attr. (3.84 I | IB)   153,600,000 samples, 16 attr. (9830 MB) | 600,000 samples, 16 attr. (38.4 MB) | Higgs boson [223, 226]  |  |
| K-Means       |          | 10,000 san | ples, 16 attr. (0.64 I | IB)   25,600,000 samples, 16 attr. (1640 MB)  | 100,000 samples, 16 attr. (6.4 MB)  | Higgs boson [223, 226]  |  |



- Metrics
- Performance of PIM kernels
- Performance scaling
- Comparison to CPU and GPU

# 2,560-DPU System (I)

- UPMEM-based PIM system with 20 UPMEM DIMMs of 16 chips each (40 ranks)
  - P21 DIMMs
  - Dual x86 socket
    - UPMEM DIMMs coexist with regular DDR4 DIMMs
    - 2 memory controllers/socket (3 channels each)
    - 2 conventional DDR4 DIMMs on one channel of one controller



# 2,560-DPU System (II)



# **Evaluation: Metrics**

- Linear regression
  - Training error rate of LIN-FP32 is the same as the CPU version
  - For integer versions, it remains low and close to that of LIN– FP32
- Logistic regression
  - LUT-based versions obtain lower training error rates that LOG-INT32, since they use exact values, not approximations
- Decision tree
  - Training accuracy only slightly lower than that of the CPU version
- K-means
  - Same Calinski-Harabasz score and adjusted Rand index of PIM and CPU versions

## Evaluation: Analysis of PIM Kernels (I)



LIN-HYB is 41% faster than LIN-INT32 LIN-BUI provides an additional 25% speedup

## **Evaluation: Analysis of PIM Kernels (II)**



than LOG-INT32-LUT

LOG-BUI-LUT provides an additional 43% speedup

## **Evaluation: Analysis of PIM Kernels (III)**

#### Decision tree & K-means



## **Evaluation: Performance Scaling**

• Strong scaling: 256 to 2,048 PIM cores

SAFARI



112

## Comparison to CPU and GPU (I)

• Linear regression and logistic regression



SAFARI



PIM versions are heavily burdened when they use operations that are not natively supported by the hardware

Several optimizations reduce the execution time considerably and close the gap with GPU performance

## Comparison to CPU and GPU (II)

#### • Decision tree and K-means



PIM version of DTR is 27x faster than the CPU version and 1.34x faster than the GPU version PIM version of KME is 2.8x faster than the CPU version and 3.2x faster than the GPU version

## **Key Observations and Insights**

- ML training workloads can greatly benefit from (1) fixedpoint data representation, (2) quantization, and (3) hybrid precision implementation in PIM systems
- ML training workloads that require complex activation functions (e.g., sigmoid) can take advantage of lookup tables (LUTs) in PIM systems instead of function approximation
- Data can be placed and laid out such that memory accesses of PIM cores are streaming
- ML training workloads with large training datasets benefit from scaling the size of PIM-enabled memory with PIM cores attached to memory arrays

## Machine Learning Training on a Real Processing-in-Memory System

<u>Juan Gómez Luna</u>, Yuxin Guo, Sylvan Brocard, Julien Legriel, Remy Cimadomo, Geraldo F. Oliveira, Gagandeep Singh, Onur Mutlu

<u>Short version: https://arxiv.org/pdf/2206.06022.pdf</u> <u>Long version: https://arxiv.org/pdf/2207.07886.pdf</u> <u>https://www.youtube.com/watch?v=qeukNs5XI3g&t=11226</u>s

SAFARI

**ETH** Zürich

## Year III Results (2022 Annual Review 1)

- Benchmarking a New Paradigm: Experimental Analysis and Characterization of a Real Processing-in-Memory System [IEEE Access'22]
- Benchmarking Memory-Centric Computing Systems: Analysis of Real Processing-in-Memory Hardware [CUT 2021]
- An Experimental Evaluation of Machine Learning Training on a Real Processing-in-Memory System [arXiv 2022]
- SparseP: Towards Efficient Sparse Matrix Vector Multiplication on Real Processing-In-Memory Architectures [SIGMETRICS 2022]
- High-throughput Pairwise Alignment with the Wavefront Algorithm using Processing-in-Memory [HICOMB 2022]
- PiDRAM: A Holistic End-to-end FPGA-based Framework for Processing-in-DRAM [arXiv 2021]

## SpMV Multiplication on Real PIM Systems

#### • Appears in SIGMETRICS 2022

#### **SparseP:** Towards Efficient Sparse Matrix Vector Multiplication on Real Processing-In-Memory Systems

CHRISTINA GIANNOULA, ETH Zürich, Switzerland and National Technical University of Athens, Greece

IVAN FERNANDEZ, ETH Zürich, Switzerland and University of Malaga, Spain JUAN GÓMEZ-LUNA, ETH Zürich, Switzerland NECTARIOS KOZIRIS, National Technical University of Athens, Greece GEORGIOS GOUMAS, National Technical University of Athens, Greece ONUR MUTLU, ETH Zürich, Switzerland

> https://arxiv.org/pdf/2201.05072.pdf https://github.com/CMU-SAFARI/SparseP

#### SAFARI <u>https://www.youtube.com/watch?v=5kaOsJKIGrE</u>



#### Towards Efficient Sparse Matrix Vector Multiplication on Real Processing-In-Memory Architectures

Christina Giannoula Ivan Fernandez, Juan Gomez-Luna, Nectarios Koziris, Georgios Goumas, Onur Mutlu









UNIVERSIDAD DE MÁLAGA

## SparseP Summary

### Efficient Algorithmic Designs

The first open-source Sparse Matrix Vector Multiplication (SpMV) software package, SparseP, for real Processing-In-Memory (PIM) systems

SparseP is Open-Source

SparseP: https://github.com/CMU-SAFARI/SparseP

### **Extensive Characterization**

The first comprehensive analysis of SpMV on the first real commercial PIM architecture

Recommendations for Architects and Programmers

Full Paper: https://arxiv.org/pdf/2201.05072.pdf

## SparseP: SpMV Library for Real PIMs

Our Contributions:

- 1. Design efficient SpMV kernels for current and future PIM systems
  - 25 SpMV kernels
    - 4 compressed matrix formats (CSR, COO, BCSR, BCOO)
    - 6 data types
    - 4 data partitioning techniques
    - Various load balancing schemes among PIM cores/threads
    - 3 synchronization approaches
- 2. Provide a comprehensive analysis of SpMV on the first commercially-available real PIM system **UP** 
  - 26 sparse matrices
  - Comparisons to state-of-the-art CPU and GPU systems

mem

 Recommendations for software, system and hardware designers

## SparseP Talk Video

SAFAR



https://www.youtube.com/watch?v=5kaOsJKlGrE

122

## Sparse Matrix Vector Multiplication

Sparse Matrix Vector Multiplication (SpMV):

- Widely-used kernel in graph processing, machine learning, scientific computing ...
- A highly memory-bound kernel

**Roofline Model** 



## Real Processing-In-Memory Systems

Real Near-Bank Processing-In-Memory (PIM) Systems:

- High levels of parallelism
- Low memory access latency
- Large aggregate memory bandwidth



## Real Processing-In-Memory Systems

Real Near-Bank Processing-In-Memory (PIM) Systems:

- High levels of parallelism
- Low memory access latency
- Large aggregate memory bandwidth



## SpMV Execution on a PIM System



## SparseP Software Package

#### 25 SpMV kernels for PIM Systems $\rightarrow$

#### https://github.com/CMU-SAFARI/SparseP

| Partitioning                            | Matrix Format | Load-Balancing     |
|-----------------------------------------|---------------|--------------------|
| <b>9x</b><br>1D<br>Kernels              | CSR           | rows, nnzs *       |
|                                         | C00 🔺         | rows, nnzs *, nnzs |
|                                         | BCSR          | blocks ^, nnzs ^   |
|                                         | BCOO 🔺        | blocks, nnzs       |
| <b>4x</b><br>2D<br>Equally-Sized Tiles  | CSR           |                    |
|                                         | C00 🔺         |                    |
|                                         | BCSR          |                    |
|                                         | BCOO 🔺        |                    |
| <b>6x</b><br>2D<br>Equally-Wide Tiles   | CSR           | nnzs *             |
|                                         | C00 🔺         | nnzs               |
|                                         | BCSR          | blocks ^, nnzs ^   |
|                                         | BCOO 🔺        | blocks, nnzs       |
| <b>6x</b><br>2D<br>Variable-Sized Tiles | CSR           | nnzs *             |
|                                         | C00 🔺         | nnzs               |
|                                         | BCSR          | blocks ^, nnzs ^   |
|                                         | BCOO 🔺        | blocks, nnz        |

Load-balance across PIM cores/threads:

- \* row-granularity (CSR)
- ^ block-row-granularity (BCSR)

#### Synchronization

among threads of a PIM core:

Ib-cg, lb-fb, lf (COO, BCOO)

#### Data Types:

- 8-bit integer
- 16-bit integer
- 32-bit integer
- 64-bit integer
- 32-bit float
- 64-bit float

## **Comparison of Compressed Formats**

2048 PIM Cores, 32-bit integer

1D

2D Equally-Sized

#### Key Takeaway 1

The compressed matrix format used to store the input matrix determines the data partitioning across DRAM banks of PIM-enabled memory. As a result, it affects the load-balance across PIM cores (and threads of a PIM core) with corresponding performance implications.

| regular matrices | scale-free<br>matrices | regular matrices | scale-free<br>matrices |
|------------------|------------------------|------------------|------------------------|
| 2D Equally-      | Vide                   | 2D Variable      | e-Sized                |

#### **Recommendation 1**

5

Speedup

Design compressed data structures that can be effectively partitioned across DRAM banks, with the goal of providing high computation balance across PIM cores (and threads of a PIM core).

regular matrices

scale-free matrices regular matrices

scale-free mat**r (298** 

## Scalability

COO format, 32-bit integer

#### Key Takeaway 2

The 1D-partitioned kernels are severely **bottlenecked** by the high data transfer costs to **broadcast** the whole input vector **into DRAM banks** of all PIM cores, through the narrow off-chip memory bus.



#### Recommendation 2

Optimize the broadcast collective collective in data transfers to PIM-enabled memory to efficiently copy the input data into DRAM banks in the PIM system.

## Scalability

COO format, 32-bit integer

#### Key Takeaway 3

The 2D equally-wide and variable-sized kernels need fine-grained parallel data transfers at DRAM bank granularity (zero padding) to be supported by the PIM system to achieve high performance.



#### Recommendation 3

Optimize the gather collective operation at DRAM bank granularity in data transfers from PIM-enabled memory to efficiently retrieve the output results to the host CPU.

## 1D vs 2D

#### Key Takeaway 4

Expensive data transfers to/from PIM-enabled memory performed via the narrow memory bus impose significant performance overhead to end-to-end SpMV execution. Thus, it is hard to fully exploit all available PIM cores of the system.



#### Recommendation 4

Design high-speed communication channels and optimized libraries in data transfers to/from PIM-enabled memory, provide hardware support to effectively overlap computation with data transfers in the PIM system, and/or integrate PIM-enabled memory as the main memory of the system.

## **CPU/GPU** Comparisons

- Kernel-Only (COO, 32-bit float):
  - CPU = 0.51% of Peak Perf.
  - GPU = 0.21% of Peak Perf.
  - PIM (1D) = 50.7% of Peak Perf.
- End-to-End (COO, 32-bit float):
  - CPU = **4.08 GFlop/s**
  - GPU = 1.92 GFlop/s
  - PIM (1D) = 0.11 GFlop/s

| S   | ystem                     | Peak Performance | Bandwidth | TDP    |                       |
|-----|---------------------------|------------------|-----------|--------|-----------------------|
| CPU | Intel Xeon<br>Silver 4110 | 660 GFlops       | 23.1 GB/s | 2x85 W | Processor-<br>Centric |
| GPU | NVIDIA<br>Tesla V100      | 14.13 TFlops     | 897 GB/s  | 300 W  |                       |
| PIM | UPMEM<br>1st Gen.         | 4.66 GFlops      | 1.77 TB/s | 379 W  | Memory-<br>Centric    |

## **CPU/GPU** Comparisons

- Kernel-Only (COO, 32-bit float):
  - CPU = 0.51% of Peak Perf.
  - GPU = 0.21% of Peak Perf.
  - PIM (1D) = 50.7% of Peak Perf.
- End-to-End (COO, 32-bit float):
  - CPU = **4.08** GFlop/s
  - GPU = 1.92 GFlop/s
  - PIM (1D) = 0.11 GFlop/s

# Many more results in the full paper: <a href="https://arxiv.org/pdf/2201.05072.pdf">https://arxiv.org/pdf/2201.05072.pdf</a>



#### Towards Efficient Sparse Matrix Vector Multiplication on Real Processing-In-Memory Architectures

Christina Giannoula Ivan Fernandez, Juan Gomez-Luna, Nectarios Koziris, Georgios Goumas, Onur Mutlu









UNIVERSIDAD DE MÁLAGA

## Year III Results (2022 Annual Review 1)

- Benchmarking a New Paradigm: Experimental Analysis and Characterization of a Real Processing-in-Memory System [IEEE Access'22]
- Benchmarking Memory-Centric Computing Systems: Analysis of Real Processing-in-Memory Hardware [CUT 2021]
- An Experimental Evaluation of Machine Learning Training on a Real Processing-in-Memory System [arXiv 2022]
- SparseP: Towards Efficient Sparse Matrix Vector Multiplication on Real Processing-In-Memory Architectures [SIGMETRICS 2022]
- High-throughput Pairwise Alignment with the Wavefront Algorithm using Processing-in-Memory [HICOMB 2022]

 PiDRAM: A Holistic End-to-end FPGA-based Framework for Processing-in-DRAM [arXiv 2021]

## Real Processing Using Memory Prototype

### PiDRAM: A Holistic End-to-end FPGA-based Framework for Processing-in-DRAM

Ataberk Olgun§†

Juan Gómez Luna<sup>§</sup> Konstantinos Kanellopoulos<sup>§</sup> Hasan Hassan<sup>§</sup> Oğuz Ergin<sup>†</sup> Onur Mutlu<sup>§</sup> <sup>§</sup>ETH Zürich <sup>†</sup>TOBB ETÜ <sup>\*</sup>BSC

Behzad Salami<sup>§\*</sup>

https://arxiv.org/pdf/2111.00082.pdf https://github.com/cmu-safari/pidram https://www.youtube.com/watch?v=geukNs5XI3g&t=4192s

## Real Processing Using Memory Prototype



https://arxiv.org/pdf/2111.00082.pdf https://github.com/cmu-safari/pidram https://www.youtube.com/watch?v=qeukNs5XI3g&t=4192s

## Real Processing Using Memory Prototype

#### E README.md

0

#### Building a PiDRAM Prototype

To build PiDRAM's prototype on Xilinx ZC706 boards, developers need to use the two sub-projects in this directory. fpga-zynq is a repository branched off of UCB-BAR's fpga-zynq repository. We use fpga-zynq to generate rocket chip designs that support end-to-end DRAM PuM execution. controller-hardware is where we keep the main Vivado project and Verilog sources for PiDRAM's memory controller and the top level system design.

#### **Rebuilding Steps**

- Navigate into fpga-zynq and read the README file to understand the overall workflow of the repository

   Follow the readme in fpga-zynq/rocket-chip/riscv-tools to install dependencies
- Create the Verilog source of the rocket chip design using the ZynqCopyFPGAConfig

   Navigate into zc706, then run make rocket C0NFIG=ZynqCopyFPGAConfig -j<number of cores>
- 3. Copy the generated Verilog file (should be under zc706/src) and overwrite the same file in controllerhardware/source/hdl/impl/rocket-chip
- 4. Open the Vivado project in controller-hardware/Vivado\_Project using Vivado 2016.2
- 5. Generate a bitstream
- 6. Copy the bitstream (system\_top.bit) to fpga-zynq/zc706
- 7. Use the ./build\_script.sh to generate the new boot.bin under fpga-images-zc706 , you can use this file to program the FPGA using the SD-Card
  - For details, follow the relevant instructions in fpga-zynq/README.md

You can run programs compiled with the RISC-V Toolchain supplied within the fpga-zynq repository. To install the toolchain, follow the instructions under fpga-zynq/rocket-chip/riscv-tools.

#### **Generating DDR3 Controller IP sources**

We cannot provide the sources for the Xilinx PHY IP we use in PiDRAM's memory controller due to licensing issues. We describe here how to regenerate them using Vivado 2016.2. First, you need to generate the IP RTL files:

1- Open IP Catalog 2- Find "Memory Interface Generator (MIG 7 Series)" IP and double click

### https://arxiv.org/pdf/2111.00082.pdf https://github.com/cmu-safari/pidram

https://www.youtube.com/watch?v=qeukNs5XI3g&t=4192s

# Pidram

An FPGA-based Framework for End-to-end Evaluation of Processing-in-DRAM Techniques

#### **Ataberk Olgun**

Juan Gomez Luna Konstantinos Kanellopoulos Behzad Salami Hasan Hassan Oğuz Ergin Onur Mutlu







# **Executive Summary**

Motivation: Commodity DRAM based PiM techniques improve the performance and energy efficiency of computing systems at no additional DRAM hardware cost **Problem:** Challenges of integrating these PiM techniques into real systems are not solved General-purpose computing systems, special-purpose testing platforms, and system simulators *cannot* be used to efficiently study system integration challenges

**Goal:** Design and implement a flexible framework that can be used to:

- Solve system integration challenges
- Analyze trade-offs of end-to-end implementations of commodity DRAM based PiM techniques

Key idea: PiDRAM, an FPGA-based framework that enables:

- System integration studies
- End-to-end evaluations

of commodity DRAM based PiM techniques using real unmodified DRAM chips

**Evaluation:** End-to-end integration of two PiM techniques on PiDRAM's FPGA prototype

Case Study #1 – RowClone: In-DRAM bulk data copy operations

- 119x speedup for copy operations compared to CPU-copy with system support
- 198 lines of Verilog and 565 lines of C++ code over PiDRAM's flexible codebase

Case Study #2 – D-RaNGe: DRAM-based random number generation technique

- 8.30 Mb/s true random number generator (TRNG) throughput, 220 ns TRNG latency
- 190 lines of Verilog and 78 lines of C++ code over PiDRAM's flexible codebase

## PiDRAM Talk Video



https://www.youtube.com/watch?v=qeukNs5XI3g&t=4243s<sup>14</sup>

## **PiDRAM: Overview (I)**

- A flexible framework that can be used to:
- Solve system integration challenges
- Analyze trade-offs of end-to-end implementations of commodity DRAM based PiM techniques

Identify key components shared across PiM techniques

Implement customizable key components:

• Provide modularity, enhance extensibility of the framework

Common basis to enable system support for PiM techniques

### SAFARI @kasırga

# **PiDRAM: Overview (II)**

Identify and develop four key hardware and software components

#### Hardware



Flexible PiM Ops. Controller Easy-to-extend Memory Controller

### Software





Custom Supervisor Software



## **PiDRAM: System Design**

Key components are attached to a real computing system

- PiM Ops. Controller and PiDRAM Memory Controller is implemented within the hardware system
- Custom supervisor software runs on the hardware system
- Extensible software library is used by the supervisor software





# **PiM Operations Controller (POC)**

Decode & execute PiDRAM instructions (e.g., in-DRAM copy)

Receive instructions over memory-mapped interface (portable to other systems with different CPU ISAs)

Simple interface to the PiDRAM memory controller (i) send request, (ii) wait until completion, (iii) read results





# **PiDRAM Memory Controller**

Perform PiM operations by violating DRAM timing parameters

Support conventional memory operations (e.g., LOAD/STORE) One state machine per operation (e.g., LOAD/STORE, in-DRAM copy)

Easily replicate a state machine to implement a new operation

**Controls the physical DDR3 interface** 

**Receives commands from command scheduler & operates DDR3 pins** 





# **PiM Operations Library (pimolib)**

**Contains customizable functions that interface with the POC Software interface for performing PiM operations** 





### **Custom Supervisor Software**

**Exposes PiM operations to the user application via system calls** 

Contains the necessary OS primitives to develop end-to-end PiM techniques (e.g., memory management and allocation for RowClone)





copy () function called by the user to perform a RowClone-Copy operation in DRAM











pimolib

copy (S, D)

SAFARI

kasırga

Scheduler



kasırga

SAFARI

**10** Copy (S, D) periodically checks either "Ack" or "Fin." flags using LOAD instructions

Copy (S, D) returns when the periodically checked flag is set



#### Data Register is not used in RowClone operations because the result is stored *in memory*

#### It is used to read true random numbers generated by D-RaNGe





# **PiDRAM Components Summary**

Four key components orchestrate PiM operation execution

#### Four key components provide an extensible basis for end-to-end integration of PiM techniques





# **PiDRAM's FPGA Prototype**

Full system prototype on Xilinx ZC706 FPGA board

- **RISC-V System:** In-order, pipelined RISC-V Rocket CPU core, L1D/I\$, TLB
- **PiM-Enabled DIMM:** Micron MT8JTF12864, 1 GiB, 8 banks





### **PiDRAM is Open Source**

#### https://github.com/CMU-SAFARI/PiDRAM

| CMU-SAFARI / PiDRAM (Public)                                                                                                                                                                                                                                                                             | 🔯 Edit Pins 🗸 👁 Wa                        | atch (3) - Y Fork (2) 🛱 Star (21)                                                 | • |
|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-------------------------------------------|-----------------------------------------------------------------------------------|---|
| <> Code ⊙ Issues 1 Pull requests ⊙ Actions ⊞ Projects □ Wi                                                                                                                                                                                                                                               | iki 🛈 Security 🗠 Insights 🔯 Settings      |                                                                                   |   |
| မှို master → မှို 2 branches 🐼 0 tags                                                                                                                                                                                                                                                                   | Go to file Add file - Code -              | About                                                                             | ŝ |
| olgunataberk Fix small mistake in README                                                                                                                                                                                                                                                                 | 46522cc on Dec 5, 2021 🕚 11 commits       | PiDRAM is the first flexible end-to-end<br>framework that enables system          |   |
| controller-hardware Add files via upload                                                                                                                                                                                                                                                                 | 7 months ago                              | integration studies and evaluation of real<br>Processing-using-Memory techniques. |   |
| fpga-zynq Adds instructions to reproduce two key                                                                                                                                                                                                                                                         | y results 7 months ago                    | Prototype on a RISC-V rocket chip sys<br>implemented on an FPGA. Described i      |   |
| README.md Fix small mistake in README                                                                                                                                                                                                                                                                    | 7 months ago                              | our preprint:<br>https://arxiv.org/abs/2111.00082                                 | n |
| i≡ README.md                                                                                                                                                                                                                                                                                             | P                                         | Readme                                                                            |   |
| Pidram                                                                                                                                                                                                                                                                                                   |                                           | <ul> <li>☆ 21 stars</li> <li>③ 3 watching</li> <li>♀ 2 forks</li> </ul>           |   |
| PiDRAM is the first flexible end-to-end framework that enables system in                                                                                                                                                                                                                                 | ntegration studies and evaluation of real |                                                                                   |   |
| Processing-using-Memory (PuM) techniques. PiDRAM, at a high level, comprises a RISC-V system and a custom memory controller that can perform PuM operations in real DDR3 chips. This repository contains all sources required to build PiDRAM and develop its prototype on the Xilinx ZC706 FPGA boards. |                                           | Releases                                                                          |   |
|                                                                                                                                                                                                                                                                                                          | o Fran boards.                            | No releases published<br>Create a new release                                     |   |

#### SAFARI @kasırga

### **Extended Version on ArXiv**

SAFARI @kasırga

#### https://arxiv.org/abs/2111.00082

| Se and the second s                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           | arch All fields 🗸 Search                                                                                                                         |
|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------|
| <b>ETXIV</b> > cs > arXiv:2111.00082                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     | Help   Advanced Search                                                                                                                           |
| Computer Science > Hardware Architecture                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 | Download:                                                                                                                                        |
| [Submitted on 29 Oct 2021 (v1), last revised 19 Dec 2021 (this version, v3)]                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             | • PDF                                                                                                                                            |
| PiDRAM: A Holistic End-to-end FPGA-based Framework for Processing-in-DRAM                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                | Other formats     (cc) EY                                                                                                                        |
| Ataberk Olgun, Juan Gómez Luna, Konstantinos Kanellopoulos, Behzad Salami, Hasan Hassan, Oğuz Ergin, Onur Mutlu<br>Processing-using-memory (PuM) techniques leverage the analog operation of memory cells to perform computation. Several recent works have demone<br>PuM techniques in off-the-shelf DRAM devices. Since DRAM is the dominant memory technology as main memory in current computing systems, these<br>techniques represent an opportunity for alleviating the data movement bottleneck at very low cost. However, system integration of PuM techniques import<br>non-trivial challenges that are yet to be solved. Design space exploration of potential solutions to the PuM integration challenges requires appropriate to<br>develop necessary hardware and software components. Unfortunately, current specialized DRAM-testing platforms, or system simulators do not provide<br>flexibility and/or the holistic system view that is necessary to deal with PuM integration challenges.<br>We design and develop PiDRAM, the first flexible end-to-end framework that enables system integration studies and evaluation of real PuM techniques.<br>PiDRAM provides software and hardware components to rapidly integrate PuM techniques across the whole system software and hardware stack (e.g., | Current browse context:       cs.AR       nstrated     < prev   next >       ee PuM     new   recent   2111       poses     Change to browse by: |
|                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          | s. References & Citations<br>• NASA ADS<br>• Google Scholar                                                                                      |
| necessary modifications in the operating system, memory controller). We implement PiDRAM on an FPGA-based platform along with an open-source system. Using PiDRAM, we implement and evaluate two state-of-the-art PuM techniques: in-DRAM (i) copy and initialization, (ii) true random number generation. Our results show that the in-memory copy and initialization techniques can improve the performance of bulk copy operations by 12.6x and initialization operations by 14.6x on a real system. Implementing the true random number generator requires only 190 lines of Verilog and 74 lines of C using PiDRAM's software and hardware components.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              | bulk Juan Gómez-Luna                                                                                                                             |
| Comments: 15 pages, 12 figures Subjects: Hardware Architecture (cs.AR)                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   | Export Bibtex Citation                                                                                                                           |
| Cite as: arXiv:2111.00082 [cs.AR]<br>(or arXiv:2111.00082v3 [cs.AR] for this version)<br>https://doi.org/10.48550/arXiv.2111.00082                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       | Bookmark<br>💥 💀 🙅 📷                                                                                                                              |



# Longer Talk + Tutorial on Youtube

#### https://youtu.be/s z S6FYpC8



158

#### Year III Results (2022 Annual Review 2)

- SeGraM: A Universal Hardware Accelerator for Genomic Sequence-to-Graph and Sequence-to-Sequence Mapping [ISCA 2022]
- GenStore: A High-Performance and Energy-Efficient In-Storage Computing System for Genome Sequence Analysis [ASPLOS 2022]
- Algorithmic Improvement and GPU Acceleration of the GenASM Algorithm [HICOMB 2022]
- Polynesia: Enabling High-Performance and Energy-Efficient Hybrid Transactional/Analytical Databases with Hardware/Software Co-Design [ICDE 2022]
- Flash-Cosmos: In-Flash Bitwise Operations Using Inherent Computation Capability of NAND Flash Memory [MICRO 2022]

### Accelerating Sequence-to-Graph Mapping

Damla Senol Cali, Konstantinos Kanellopoulos, Joel Lindegger, Zulal Bingol, Gurpreet S. Kalsi, Ziyi Zuo, Can Firtina, Meryem Banu Cavlak, Jeremie Kim, Nika MansouriGhiasi, Gagandeep Singh, Juan Gomez-Luna, Nour Almadhoun Alserr, Mohammed Alser, Sreenivas Subramoney, Can Alkan, Saugata Ghose, and Onur Mutlu,
 "SeGraM: A Universal Hardware Accelerator for Genomic Sequence-to-Graph and Sequence-to-Sequence Mapping"
 Proceedings of the <u>49th International Symposium on Computer Architecture</u> (ISCA), New York, June 2022.

[arXiv version]

#### SeGraM: A Universal Hardware Accelerator for Genomic Sequence-to-Graph and Sequence-to-Sequence Mapping

Damla Senol Cali<sup>1</sup> Konstantinos Kanellopoulos<sup>2</sup> Joël Lindegger<sup>2</sup> Zülal Bingöl<sup>3</sup> Gurpreet S. Kalsi<sup>4</sup> Ziyi Zuo<sup>5</sup> Can Firtina<sup>2</sup> Meryem Banu Cavlak<sup>2</sup> Jeremie Kim<sup>2</sup> Nika Mansouri Ghiasi<sup>2</sup> Gagandeep Singh<sup>2</sup> Juan Gómez-Luna<sup>2</sup> Nour Almadhoun Alserr<sup>2</sup> Mohammed Alser<sup>2</sup> Sreenivas Subramoney<sup>4</sup> Can Alkan<sup>3</sup> Saugata Ghose<sup>6</sup> Onur Mutlu<sup>2</sup>

<sup>1</sup>Bionano Genomics <sup>2</sup>ETH Zürich <sup>3</sup>Bilkent University <sup>4</sup>Intel Labs <sup>5</sup>Carnegie Mellon University <sup>6</sup>University of Illinois Urbana-Champaign

#### SAFARI https://arxiv.org/pdf/2205.05883.pdf

#### SeGraM: A Universal Hardware Accelerator for Genomic Sequence-to-Graph and Sequence-to-Sequence Mapping

#### Damla Senol Cali, Ph.D.

damlasenolcali@gmail.com https://damlasenolcali.github.io/

Konstantinos Kanellopoulos, Joel Lindegger, Zulal Bingol, Gurpreet S. Kalsi, Ziyi Zuo, Can Firtina, Meryem Banu Cavlak, Jeremie S. Kim, Nika Mansouri Ghiasi, Gagandeep Singh, Juan Gomez-Luna, Nour Almadhoun Alserr, Mohammed Alser, Sreenivas Subramoney, Can Alkan, Saugata Ghose, Onur Mutlu







### Genome Sequence Analysis

Mapping the reads to a reference genome (i.e., *read mapping*) is a critical step in genome sequence analysis



Sequence-to-graph mapping results in notable quality improvements. However, it is a more difficult computational problem, with no prior hardware design.

#### SAFARI

#### SeGraM: First Graph Mapping Accelerator

#### Our Goal:

**Specialized, high-performance, scalable, and low-cost** algorithm/hardware co-design that alleviates bottlenecks in **multiple steps** of sequence-to-graph mapping

**SeGraM:** First universal algorithm/hardware co-designed genomic mapping accelerator that can effectively and efficiently support:

- Sequence-to-graph mapping
- Sequence-to-sequence mapping
- Both short and long reads



### Use Cases & Key Results

(1) Sequence-to-Graph (S2G) Mapping

□ 5.9×/106× speedup, 4.1×/3.0× less power than GraphAligner

for long and short reads, respectively (state-of-the-art SW)

□ 3.9×/742× speedup, 4.4×/3.2× less power than vg

for long and short reads, respectively (state-of-the-art SW)

(2) Sequence-to-Graph (S2G) Alignment

**41**×–**539**× speedup over **PaSGAL** with AVX-512 support (state-of-the-art SW)

#### (3) Sequence-to-Sequence (S2S) Alignment

□ 1.2×/4.8× higher throughput than GenASM and GACT of Darwin

for long reads (state-of-the-art HW)

1.3×/2.4× higher throughput than GenASM and SillaX of GenAX

for short reads (state-of-the-art HW)

#### SAFARI

#### SeGraM Talk Video

**SAFARI** 

| Linear reference<br>genome<br>Known genetic<br>variations                                 | O<br>Genome Graph Construction<br>(construct the graph using a linear reference genome and variations)<br>Genome graph<br>O<br>Indexing<br>(index the nodes of the graph)<br>Hash-table-based index (of generation) | Pre-Processing<br>Steps (Offline)         |                                       |
|-------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-------------------------------------------|---------------------------------------|
| Reads from<br>sequenced →<br>genome                                                       | Seeding     (query the index & find the seed matches)     Candidate mapping location     Filtering/Chaining/Clustering     (filter out dissimilar query read and subgraph pairs)     Remaining candidate mapping    |                                           |                                       |
| Damla Senol Cali                                                                          | 3<br>S2G Alignment<br>(perform distance/score calculation & traceback)<br>↓<br>Optimal alignment between read & subgraph<br>SAFARI                                                                                  | Seed-and-Extend<br>Steps (Online)<br>14   |                                       |
| ► ► 5:56 / 21:29<br>eGraM: A Universal HW Accelerator<br>6 views • Premiered 21 hours ago | for Genomic Sequence-to-Graph Mapping - Damla S                                                                                                                                                                     | Senol Cali (ISCA)<br>12 🖓 dislike 🏟 share | • Co $\Leftrightarrow$ • Clip =+ save |

#### https://www.youtube.com/watch?v=gyjqYoyDP9s

### **Genome Graphs**

Genome graphs:

- Combine the linear reference genome with the known genetic variations in the entire population as a graph-based data structure
- Enable us to move away from aligning with a single linear reference genome (reference bias) and more accurately express the genetic diversity in a population

Sequence #1: ACGTACGT Sequence #2: ACGGACGT Sequence #3: ACGTTACGT Sequence #4: ACGACGT



SAFARI

## Sequence-to-Graph Mapping Pipeline



### S2S vs. S2G Alignment



Sequence-to-Sequence (S2S) Alignment



### S2S vs. S2G Alignment



Sequence-to-Graph (S2G) Alignment

In contrast to S2S alignment,

S2G alignment must incorporate non-neighboring characters as well whenever there is an edge (i.e., *hop*) from the non-neighboring character to the current character

#### SAFARI

# Analysis of State-of-the-Art Tools

Based on our analysis with **GraphAligner** and **vg**:

**Observation 1:** Alignment step is the bottleneck

**Observation 2:** Alignment suffers from high cache miss rates

**Observation 3:** Seeding suffers from the DRAM latency bottleneck

**Observation 4:** Baseline tools scale sublinearly

**Observation 5:** Existing S2S mapping accelerators are unsuitable for the S2G mapping problem

**Observation 6:** Existing graph accelerators are unable to handle S2G alignment

**Damla Senol Cali** 



HW

SW

#### SeGraM: Universal Genomic Mapping Accelerator

First universal genomic mapping accelerator that can support both sequence-to-graph mapping and sequence-to-sequence mapping, for both short and long reads

First algorithm/hardware co-design for accelerating sequence-to-graph mapping

□ We base SeGraM upon a minimizer-based seeding algorithm

We propose a novel bitvector-based alignment algorithm to perform approximate string matching between a read and a graphbased reference genome

We co-design both algorithms with high-performance, scalable, and efficient hardware accelerators

НW

# SeGraM Hardware Design



**Damla Senol Cali** 

Host

CPU

SAFARI

# SeGraM Hardware Design



173

# **Overall System Design of SeGraM**



#### SAFARI

#### Use Cases of SeGraM

#### (1) Sequence-to-Graph Mapping

#### (2) Sequence-to-Graph Alignment

#### (3) Sequence-to-Sequence Alignment

SAFARI

#### (4) Seeding

Damla Senol Cali









#### Key Results – Area & Power

Based on our synthesis of MinSeed and BitAlign accelerator datapaths using the Synopsys Design Compiler with a 28nm process (@ 1GHz):

| Component                                                                    | Area (mm <sup>2</sup> ) | Power (mW)    |
|------------------------------------------------------------------------------|-------------------------|---------------|
| MinSeed – Logic                                                              | 0.017                   | 10.8          |
| Read Scratchpad (6 kB)                                                       | 0.012                   | 7.9           |
| Minimizer Scratchpad (40 kB)                                                 | 0.055                   | 22.7          |
| Seed Scratchpad (4 kB)                                                       | 0.008                   | 6.4           |
| BitAlign – Edit Distance Calculation Logic with Hop Queue Registers (64 PEs) | 0.393                   | 378.0         |
| BitAlign – Traceback Logic                                                   | 0.020                   | 2.7           |
| Input Scratchpad (24 kB)                                                     | 0.033                   | 13.3          |
| Bitvector Scratchpads (128 kB)                                               | 0.329                   | 316.2         |
| Total – 1 SeGraM Accelerator                                                 | 0.867                   | 758.0 (0.8 W) |
| Total – 4 SeGraM Modules (32 SeGraM Accelerators)                            | 27.744                  | 24.3 W        |
| HBM2E (4 stacks)                                                             |                         | 3.8 W         |

### Key Results – SeGraM with Long Reads



SeGraM provides **5.9× and 3.9× throughput improvement** over GraphAligner and vg,

while reducing the power consumption by 4.1× and 4.4×

#### Key Results – SeGraM with Short Reads



SeGraM provides **106× and 742× throughput improvement** over GraphAligner and vg,

while reducing the power consumption by 3.0× and 3.2×

### Key Results – BitAlign (S2G Alignment)



BitAlign provides 41×-539× speedup over PaSGAL

**Damla Senol Cali** 

# Key Results – BitAlign (S2S Alignment)

BitAlign can also be used for sequence-to-sequence alignment

- The cost of more functionality: extra hop queue registers
- We do not sacrifice any performance

**For long reads (over GACT of Darwin and GenASM):** 

- 4.8× and 1.2× throughput improvement,
- 2.7× and 7.5× higher power consumption, and
- **1.5× and 2.6×** higher area overhead

For short reads (over SillaX of GenAx and GenASM):

O 2.4× and 1.3× throughput improvement

#### SAFARI

# Conclusion

SeGraM: First universal algorithm/hardware co-designed genomic mapping accelerator that supports:

- Sequence-to-graph (S2G) & sequence-to-sequence (S2S) mapping
- Short & long reads
- **MinSeed:** First minimizer-based seeding accelerator
- **BitAlign:** *First (bitvector-based)* S2G alignment accelerator
- SeGraM **supports multiple use cases**:
  - End-to-end S2G mapping
  - S2G alignment
  - S2S alignment
  - Seeding

□ SeGraM outperforms state-of-the-art software & hardware solutions

### SeGraM Talk Video

SAFARI

| Sequer<br>Linear reference<br>genome<br>Known genetic<br>variations | Cento-Graph Mapping<br>Genome Graph Construction<br>(construct the graph using a linear reference genome and variations)<br>Genome graph<br>Indexing<br>(index the nodes of the graph)<br>Hash-table-based index (of g | Pre-Processing<br>Steps (Offline)         |                   |
|---------------------------------------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-------------------------------------------|-------------------|
| Reads from<br>sequenced →<br>genome                                 | 1<br>Seeding<br>(query the index & find the seed matches)<br>Candidate mapping location<br>Filtering/Chaining/Clustering<br>(filter out dissimilar query read and subgraph pairs)<br>Remaining candidate mapping       |                                           |                   |
| Damla Senol Cali                                                    | S2G Alignment<br>(perform distance/score calculation & traceback)<br>Optimal alignment between read & subgraph<br>SAFARI                                                                                               | Seed-and-Extend<br>Steps (Online)<br>14   |                   |
| SeGraM: A Universal HW Accelerator                                  | for Genomic Sequence-to-Graph Mapping - Damla S                                                                                                                                                                        | Genol Cali (ISCA)<br>12 🖓 DISLIKE 📣 SHARE | ■ C C IIP =+ SAVE |

#### https://www.youtube.com/watch?v=gyjqYoyDP9s

# Year III Results (2022 Annual Review 2)

- SeGraM: A Universal Hardware Accelerator for Genomic Sequence-to-Graph and Sequence-to-Sequence Mapping [ISCA 2022]
- GenStore: A High-Performance and Energy-Efficient In-Storage Computing System for Genome Sequence Analysis [ASPLOS 2022]
- Algorithmic Improvement and GPU Acceleration of the GenASM Algorithm [HICOMB 2022]
- Polynesia: Enabling High-Performance and Energy-Efficient Hybrid Transactional/Analytical Databases with Hardware/Software Co-Design [ICDE 2022]
- Flash-Cosmos: In-Flash Bitwise Operations Using Inherent Computation Capability of NAND Flash Memory [MICRO 2022]

## In-Storage Genomic Data Filtering [ASPLOS 2022]

Nika Mansouri Ghiasi, Jisung Park, Harun Mustafa, Jeremie Kim, Ataberk Olgun, Arvid Gollwitzer, Damla Senol Cali, Can Firtina, Haiyu Mao, Nour Almadhoun Alserr, Rachata Ausavarungnirun, Nandita Vijaykumar, Mohammed Alser, and Onur Mutlu,
 "GenStore: A High-Performance and Energy-Efficient In-Storage Computing System for Genome Sequence Analysis"
 Proceedings of the <u>27th International Conference on Architectural Support for</u> Programming Languages and Operating Systems (ASPLOS), Virtual, February-March 2022.

[Lightning Talk Slides (pptx) (pdf)] [Lightning Talk Video (90 seconds)]

### GenStore: A High-Performance In-Storage Processing System for Genome Sequence Analysis

Nika Mansouri Ghiasi<sup>1</sup> Jisung Park<sup>1</sup> Harun Mustafa<sup>1</sup> Jeremie Kim<sup>1</sup> Ataberk Olgun<sup>1</sup> Arvid Gollwitzer<sup>1</sup> Damla Senol Cali<sup>2</sup> Can Firtina<sup>1</sup> Haiyu Mao<sup>1</sup> Nour Almadhoun Alserr<sup>1</sup> Rachata Ausavarungnirun<sup>3</sup> Nandita Vijaykumar<sup>4</sup> Mohammed Alser<sup>1</sup> Onur Mutlu<sup>1</sup>

<sup>1</sup>ETH Zürich <sup>2</sup>Bionano Genomics <sup>3</sup>KMUTNB <sup>4</sup>University of Toronto







## **Accelerating Genome Sequence Analysis**



Data movement overhead





*Filter* reads that do *not* require alignment *inside the storage system* 



### **Exactly-matching reads**

#### Do not need expensive approximate string matching during alignment

### **Non-matching reads**

Do not have potential matching locations and can skip alignment

## Challenges

*Filter* reads that do *not* require alignment *inside the storage system* 



Read mapping workloads can exhibit different behavior

There are limited hardware resources in the storage system





*Filter* reads that do *not* require alignment *inside the storage system* 



## In-Storage Genomic Data Filtering [ASPLOS 2022]

Nika Mansouri Ghiasi, Jisung Park, Harun Mustafa, Jeremie Kim, Ataberk Olgun, Arvid Gollwitzer, Damla Senol Cali, Can Firtina, Haiyu Mao, Nour Almadhoun Alserr, Rachata Ausavarungnirun, Nandita Vijaykumar, Mohammed Alser, and Onur Mutlu,
 "GenStore: A High-Performance and Energy-Efficient In-Storage Computing System for Genome Sequence Analysis"
 Proceedings of the <u>27th International Conference on Architectural Support for</u> Programming Languages and Operating Systems (ASPLOS), Virtual, February-March 2022.

[Lightning Talk Slides (pptx) (pdf)] [Lightning Talk Video (90 seconds)]

### GenStore: A High-Performance In-Storage Processing System for Genome Sequence Analysis

Nika Mansouri Ghiasi<sup>1</sup> Jisung Park<sup>1</sup> Harun Mustafa<sup>1</sup> Jeremie Kim<sup>1</sup> Ataberk Olgun<sup>1</sup> Arvid Gollwitzer<sup>1</sup> Damla Senol Cali<sup>2</sup> Can Firtina<sup>1</sup> Haiyu Mao<sup>1</sup> Nour Almadhoun Alserr<sup>1</sup> Rachata Ausavarungnirun<sup>3</sup> Nandita Vijaykumar<sup>4</sup> Mohammed Alser<sup>1</sup> Onur Mutlu<sup>1</sup>

<sup>1</sup>ETH Zürich <sup>2</sup>Bionano Genomics <sup>3</sup>KMUTNB <sup>4</sup>University of Toronto

### GenStore Talk Video

SAFARI



#### https://www.youtube.com/watch?v=bv7hgXOOMjk

# Year III Results (2022 Annual Review 2)

- SeGraM: A Universal Hardware Accelerator for Genomic Sequence-to-Graph and Sequence-to-Sequence Mapping [ISCA 2022]
- GenStore: A High-Performance and Energy-Efficient In-Storage Computing System for Genome Sequence Analysis [ASPLOS 2022]
- Algorithmic Improvement and GPU Acceleration of the GenASM Algorithm [HICOMB 2022]
- Polynesia: Enabling High-Performance and Energy-Efficient Hybrid Transactional/Analytical Databases with Hardware/Software Co-Design [ICDE 2022]
- Flash-Cosmos: In-Flash Bitwise Operations Using Inherent Computation Capability of NAND Flash Memory [MICRO 2022]

## Accelerating HTAP Database Systems

Amirali Boroumand, Saugata Ghose, Geraldo F. Oliveira, and Onur Mutlu, "Polynesia: Enabling High-Performance and Energy-Efficient Hybrid Transactional/Analytical Databases with Hardware/Software Co-Design" Proceedings of the <u>38th International Conference on Data Engineering</u> (ICDE), Virtual, May 2022. [arXiv version] [Slides (pptx) (pdf)] [Short Talk Slides (pptx) (pdf)]

#### **Polynesia: Enabling High-Performance and Energy-Efficient Hybrid Transactional/Analytical Databases** with Hardware/Software Co-Design

Amirali Boroumand<sup>†</sup>

Saugata Ghose<sup>\lambda</sup> Geraldo F. Oliveira<sup>‡</sup> Onur Mutlu<sup>‡</sup> <sup>†</sup>Google <sup>•</sup>Univ. of Illinois Urbana-Champaign <sup>‡</sup>ETH Zürich

#### https://arxiv.org/pdf/2204.11275.pdf SAFARI

## **Polynesia:**

Enabling High-Performance and Energy-Efficient Hybrid Transactional/Analytical Databases with Hardware/Software Co-Design

### Amirali Boroumand Geraldo F. Oliveira

Saugata Ghose Onur Mutlu

ICDE 2022



# **Executive Summary**

- <u>Context</u>: Many applications need to perform real-time data analysis using an <u>Hybrid Transactional/Analytical Processing (HTAP)</u> system
  - An ideal HTAP system should have three properties:
    - (1) data freshness and consistency, (2) workload-specific optimization,
    - (3) performance isolation
- <u>Problem</u>: Prior works cannot achieve all properties of an ideal HTAP system
- <u>Key Idea</u>: Divide the system into transactional and analytical processing islands
  - Enables workload-specific optimizations and performance isolation
- <u>Key Mechanism</u>: Polynesia, a novel hardware/software cooperative design for in-memory HTAP databases
  - Implements custom algorithms and hardware to reduce the costs of data freshness and consistency
  - Exploits **PIM** for analytical processing to alleviate data movement
- <u>Key Results</u>: Polynesia outperforms three state-of-the-art HTAP systems
  - Average transactional/analytical throughput improvements of 1.7x/3.7x
  - 48% reduction on energy consumption

## Polynesia Talk Video (I)

SAFARI



https://www.youtube.com/watch?v=geukNs5XI3g&t=5897s

196

## Polynesia Talk Video (II)

SAFAR



https://www.youtube.com/watch?v=1HkXy3g6FF4

# **Real-Time Analysis**

Increasing interest in many applications domains to perform data analytics on the most recent version of data (real-time analysis)



#### For these applications, it is critical to analyze the transactions in real-time as the data's value diminishes over time

SAFARI Introduction

Motivation

Polynesia Up

Update Propagation

Consistency Mechanism

Analytical Engine

Evaluation

Conclusion 98

# **HTAP: Supporting Real-Time Analysis**

Traditionally, new transactions (updates) are propagated to the analytical database using a periodic and costly process



To support real-time analysis: a single hybrid DBMS is used to execute both transactional and analytical workloads

# Ideal HTAP System Properties

### An ideal HTAP system should have three properties:

### Workload-Specific Optimizations

• Transactional and analytical workloads must benefit from their own specific optimizations

### **2** Data Freshness and Consistency Guarantees

• Guarantee access to the most recent version of data for analytics while ensuring that transactional and analytical workloads have a consistent view of data

#### **3** Performance Isolation

• Latency and throughput of transactional and analytical workloads are the same as if they were run in isolation

#### Achieving all three properties at the same time is very challenging

# **Problem and Goal**

#### **Problems:**

| 1 | State-of-the-art HTAP systems do not achieve<br>all of the desired HTAP properties |
|---|------------------------------------------------------------------------------------|
|   |                                                                                    |

2 Data freshness and consistency mechanisms are data-intensive and cause a drastic reduction in throughput

**3** These systems fail to provide performance isolation because of high resource contention

### Goal:

4 Take advantage of custom algorithm and processing-in-memory (PIM) to address these challenges

# Polynesia

Key idea: partition computing resources into two types of isolated and specialized processing islands

Isolating transactional islands from analytical islands allows us to:

- Apply workload-specific optimizations to each island
- **2** Avoid high resource contention
- 3 Design efficient data freshness and consistency mechanisms without incurring high data movement costs
  - Leverage processing-in-memory (PIM) to reduce data movement
  - PIM mitigates data movement overheads by placing computation units nearby or inside memory

# **Polynesia: High-Level Overview**

Each island includes (1) a replica of data, (2) an optimized execution engine, and (3) a set of hardware resources



# **Key Results**

Polynesia achieves 91.6% the transactional throughput of an ideal system by employing custom PIM logic for data freshness/consistency, which significantly reduces resource contention and data movement

Polynesia improves analytical throughput by 63.8% over an optimized multiple-instance system, by eliminating data movement, and using custom logic for update propagation and consistency

Overall, Polynesia achieves all three properties of HTAP system and has a higher transactional/analytical throughput (1.7x/3.74x) over prior HTAP systems

# Conclusion

- <u>Context</u>: Many applications need to perform real-time data analysis using an <u>Hybrid Transactional/Analytical Processing (HTAP)</u> system
  - An ideal HTAP system should have three properties:
    - (1) data freshness and consistency, (2) workload-specific optimization,
    - (3) performance isolation
- <u>Problem</u>: Prior works cannot achieve all properties of an ideal HTAP system
- <u>Key Idea</u>: Divide the system into transactional and analytical processing islands
  - Enables workload-specific optimizations and performance isolation
- <u>Key Mechanism</u>: Polynesia, a novel hardware/software cooperative design for in-memory HTAP databases
  - Implements custom algorithms and hardware to reduce the costs of data freshness and consistency
  - Exploits **PIM** for analytical processing to alleviate data movement
- <u>Key Results</u>: Polynesia outperforms three state-of-the-art HTAP systems
  - Average transactional/analytical throughput improvements of 1.7x/3.7x
  - 48% reduction on energy consumption

# More in the Paper

#### **Polynesia: Enabling High-Performance and Energy-Efficient Hybrid Transactional/Analytical Databases** with Hardware/Software Co-Design

Amirali Boroumand<sup>†</sup> <sup>†</sup>Google

Saugata Ghose<sup>◊</sup> Geraldo F. Oliveira<sup>‡</sup> <sup>°</sup>Univ. of Illinois Urbana-Champaign



Onur Mutlu<sup>‡</sup>

Introduction SAFAR

Motivation

Polynesia . .

Update Propagation . . . . . .

**Consistency Mechanism** • •

Analytical Engine

<sup>‡</sup>ETH Zürich

Evaluation

## **Polynesia:**

Enabling High-Performance and Energy-Efficient Hybrid Transactional/Analytical Databases with Hardware/Software Co-Design

### Amirali Boroumand Geraldo F. Oliveira

SAFARI Google I

Saugata Ghose Onur Mutlu

ICDE 2022



S ETH Zürich

## Year III Results (2022 Annual Review 3)

- Sibyl: Adaptive and Extensible Data Placement in Hybrid Storage Systems Using Online Reinforcement Learning [ISCA 2022]
- Hermes: Accelerating Long-Latency Load Requests via Perceptron-Based Off-Chip Load Prediction [MICRO 2022]
- GenPIP: In-Memory Acceleration of Genome Analysis via Tight Integration of Basecalling and Read Mapping [MICRO 2022]
- pLUTo: Enabling Massively Parallel Computation via In DRAM via Lookup Tables [MICRO 2022]
- DeepSketch: A New Machine Learning-Based Reference Search Technique for Post-Deduplication Delta Compression [FAST 2022]
- A Modern Primer on Processing in Memory [Arxiv, Updated 2022]
  SAFARI

### Sibyl: Self-Optimizing Hybrid Storage Systems

 Gagandeep Singh, Rakesh Nadig, Jisung Park, Rahul Bera, Nastaran Hajinazar, David Novo, Juan Gomez-Luna, Sander Stuijk, Henk Corporaal, and Onur Mutlu, "Sibyl: Adaptive and Extensible Data Placement in Hybrid Storage Systems Using Online Reinforcement Learning" Proceedings of the <u>49th International Symposium on Computer Architecture</u> (ISCA), New York, June 2022.
 [Slides (pptx) (pdf)]
 [arXiv version]
 [Sibyl Source Code]
 [Talk Video (16 minutes)]

#### Sibyl: Adaptive and Extensible Data Placement in Hybrid Storage Systems Using Online Reinforcement Learning

Gagandeep Singh1Rakesh Nadig1Jisung Park1Rahul Bera1Nastaran Hajinazar1David Novo3Juan Gómez-Luna1Sander Stuijk2Henk Corporaal2Onur Mutlu11ETH Zürich2Eindhoven University of Technology3LIRMM, Univ. Montpellier, CNRS

#### https://arxiv.org/pdf/2205.07394.pdf



# Adaptive and Extensible Data Placement in Hybrid Storage Systems Using Online Reinforcement Learning

<u>Gagandeep Singh</u>, Rakesh Nadig, Jisung Park, Rahul Bera, Nastaran Hajinazar, David Novo, Juan Gómez Luna, Sander Stuijk, Henk Corporaal, Onur Mutlu





# Sibyl Talk Video [ISCA'22]

Gagandeep Singh, Rakesh Nadig, Jisung Park, Rahul Bera, Nastaran Hajinazar, David Novo, Juan Gomez-Luna, Sander Stuijk, Henk Corporaal, and Onur Mutlu, "Sibyl: Adaptive and Extensible Data Placement in Hybrid Storage Systems Using Online **Reinforcement Learning**" ISCA, New York, June 2022. [Sibyl Source Code]





231 views · Premiered Jul 14, 2022

27K subscribers

SAFAR

**Onur Mutlu Lectures** 





ANALYTICS

https://www.youtube.com/watch?v=5-WedkiB000

# **Executive Summary**

- **Background:** A hybrid storage system (HSS) uses multiple different storage devices to provide high and scalable storage capacity at high performance
- Problem: Two key shortcomings of prior data placement policies:
  - Lack of adaptivity to:
    - Workload changes
    - Changes in device types and configurations
  - Lack of extensibility to more devices
- **Goal**: Design a data placement technique that provides:
  - Adaptivity, by continuously learning and adapting to the application and underlying device characteristics
  - *Easy extensibility* to incorporate a wide range of hybrid storage configurations
- **Contribution**: Sibyl, the first reinforcement learning-based data placement technique in hybrid storage systems that:
  - Provides adaptivity to changing workload demands and underlying device characteristics
  - Can easily extend to any number of storage devices
  - Provides ease of design and implementation that requires only a small computation overhead
- Key Results: Evaluate on real systems using a wide range of workloads
  - Sibyl improves performance by 21.6% compared to the best previous data placement technique in dual-HSS configuration
  - In a tri-HSS configuration, Sibyl outperforms the state-of-the-art-policy policy by 48.2%
- Sibyl achieves 80% of the performance of an oracle policy with storage overhead of only 124.4 KiB SAFARI

# **Talk Outline**

**Key Shortcomings of Prior Data Placement Techniques** 

### Formulating Data Placement as Reinforcement Learning

Sybil: Overview

**Evaluation of Sybil and Key Results** 

Conclusion



# **Hybrid Storage System Basics**

### **Address Space (Application/File System View)**



# **Hybrid Storage System Basics**

Logical Address Space (Application/File System View)



Hybrid Storage System



## Key Shortcomings in Prior Techniques

We observe two key shortcomings that significantly limit the performance benefits of prior techniques

### 1. Lack of **adaptivity to**:

- a) Workload changes
- b) Changes in device types and configuration

2. Lack of **extensibility** to more devices



### **Our Goal**

# A data-placement mechanism that can provide:

1.Adaptivity, by continuously learning and adapting to the application and underlying device characteristics

**2.Easy extensibility** to incorporate a wide range of hybrid storage configurations



### **Our Proposal**



### **Sibyl** Formulates data placement in hybrid storage systems as a **reinforcement learning problem**



Sybil is an oracle that makes accurate prophecies https://en.wikipedia.org/wiki/Sibyl

### **Talk Outline**

**Key Shortcomings of Prior Data Placement Techniques** 

### Formulating Data Placement as Reinforcement Learning

Sybil: Overview

#### **Evaluation of Sybil and Key Results**

Conclusion



### Basics of Reinforcement Learning (RL)



Environment

Agent learns to take an action in a given state to maximize a numerical reward



### **Formulating Data Placement as RL**



### What is State?

SAFARI

#### • Limited number of state features:

- Reduce the implementation overhead
- RL agent is more sensitive to reward

#### • 6-dimensional vector of state features

 $O_t = (size_t, type_t, intr_t, cnt_t, cap_t, curr_t)$ 

# • We **quantize the state representation** into bins to reduce storage overhead



## What is Reward?

• Defines the **objective** of Sibyl



- We formulate the reward as a function of the request latency
- Encapsulates three key aspects:
  - Internal state of the device (e.g., read/write latencies, the latency of garbage collection, queuing delays, ...)
  - Throughput
  - Evictions
- More details in the paper **SAFARI**

## What is Action?

• At every new page request, the action is to select a storage device



 Action can be easily extended to any number of storage devices

• Sibyl learns to proactively evict or promote a page





### **Talk Outline**

**Key Shortcomings of Prior Data Placement Techniques** 

### Formulating Data Placement as Reinforcement Learning

Sybil: Overview

### **Evaluation of Sybil and Key Results**

#### Conclusion



### **Sibyl Execution**



### **Sibyl Design: Overview**



### **Sibyl Design: Overview**



### **Talk Outline**

**Key Shortcomings of Prior Data Placement Techniques** 

#### Formulating Data Placement as Reinforcement Learning

Sybil: Overview

#### **Evaluation of Sybil and Key Results**

#### Conclusion



## **Evaluation Methodology (1/3)**

#### Real system with various HSS configurations

- Dual-hybrid and tri-hybrid systems



## **Evaluation Methodology (2/3)**

### **Cost-Oriented HSS Configuration**



High-end SSD

Low-end HDD

### **Performance-Oriented HSS Configuration**



## **Evaluation Methodology (3/3)**

### • 18 different workloads from:

- MSR Cambridge and Filebench Suites

### Four state-of-the-art data placement baselines:





### **Cost-Oriented HSS Configuration**





### **Cost-Oriented HSS Configuration**



Sibyl consistently outperforms all the baselines for all the workloads





### **Performance-Oriented HSS Configuration**





### **Performance-Oriented HSS Configuration**



Sibyl provides 21.6% performance improvement by dynamically adapting its data placement policy

## **Performance on Tri-HS**



#### Extending Sibyl for more devices:

- 1. Add a new action
- 2. Add the remaining capacity of the new device as a state feature



## **Performance on Tri-HS**



Extending Sibyl for more devices: 1. Add a new action

> Sibyl outperforms the state-of-the-art data placement policy by 48.2% in a real tri-hybrid system

Sibyl reduces the system architect's burden by providing ease of extensibility

## Sibyl's Overhead

#### 124.4 KiB of total storage cost

- Experience buffer, inference and training network
- 40-bit metadata overhead per page for state features
- Inference latency of ~10ns
- Training latency of ~2us





### More in the Paper

#### Sibyl: Adaptive and Extensible Data Placement in Hybrid Storage Systems Using Online Reinforcement Learning

Gagandeep Singh1Rakesh Nadig1Jisung Park1Rahul Bera1Nastaran Hajinazar1David Novo3Juan Gómez-Luna1Sander Stuijk2Henk Corporaal2Onur Mutlu11ETH Zürich2Eindhoven University of Technology3LIRMM, Univ. Montpellier, CNRS

https://arxiv.org/pdf/2205.07394.pdf

https://github.com/CMU-SAFARI/Sibyl

https://www.youtube.com/watch?v=5-WedkiB000



### **Talk Outline**

**Key Shortcomings of Prior Data Placement Techniques** 

### Formulating Data Placement as Reinforcement Learning

Sybil: Overview

#### **Evaluation of Sybil and Key Results**

#### Conclusion





## Conclusion

- We introduced Sibyl, the first reinforcement learningbased data placement technique in hybrid storage systems that provides
  - Adaptivity
  - Easily extensibility
  - Ease of design and implementation

# • We evaluated Sibyl on real systems using many different workloads

- Sibyl **improves performance by 21.6%** compared to the best prior data placement policy in a dual-HSS configuration
- In a tri-HSS configuration, Sibyl **outperforms** the state-of-the-artdata placement policy by **48.2%**
- Sibyl achieves 80% of the performance of an oracle policy with a storage overhead of only 124.4 KiB

### **Sibyl is Open-Source**

#### https://github.com/CMU-SAFARI/Sibyl

| 🛱 CMU-SAFARI / Sib                                      | Public                | • Watch 6 ▼         • Fork 2         •          •          • |                                                                         |                                                                                                                                                                                                                  |             |  |
|---------------------------------------------------------|-----------------------|--------------------------------------------------------------|-------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-------------|--|
| <> Code                                                 | រ៉ែ Pull requests 🕞 A | ctions 🗄 Projects                                            | 🖽 Wiki 🔃                                                                | Security 🗠 Insights 🛛                                                                                                                                                                                            | හි Settings |  |
| <mark>ਿ</mark> ੰ main → ਿੱ 3 bran                       | ches  🔊 0 tags        | Go to file Add file                                          | Code →                                                                  | About                                                                                                                                                                                                            | \$          |  |
| singagan Update README.md 21a98ee on 7 Jul 🕲 20 commits |                       |                                                              | Source code for the software<br>implementation of Sibyl proposed in our |                                                                                                                                                                                                                  |             |  |
| driver                                                  | added driver support  |                                                              | 2 months ago                                                            | ISCA 2022 paper: Gagandeep Singh et.<br>al., "Sibyl: Adaptive and Extensible Data<br>Placement in Hybrid Storage Systems<br>using Online Reinforcement Learning" at<br>https://people.inf.ethz.ch/omutlu/pub/Sib |             |  |
| sibyl                                                   | execution fixes       |                                                              | 2 months ago                                                            |                                                                                                                                                                                                                  |             |  |
|                                                         | Create LICENSE        |                                                              | 2 months ago                                                            |                                                                                                                                                                                                                  |             |  |
| 🗋 README.md                                             | Update README.md      |                                                              | last month                                                              | yl_RL-based-data-placement-in-hybrid-<br>storage-systems_isca22.pdf                                                                                                                                              |             |  |
| 🗋initpy                                                 | init clean            |                                                              | 2 months ago                                                            |                                                                                                                                                                                                                  |             |  |
| requirements.txt                                        | added logging         |                                                              | 2 months ago                                                            | শ্রু MIT license                                                                                                                                                                                                 |             |  |
| 🗋 setup.py                                              | added logging         |                                                              | 2 months ago                                                            | ☆ 14 stars                                                                                                                                                                                                       |             |  |
|                                                         |                       |                                                              |                                                                         | 6 watching                                                                                                                                                                                                       |             |  |
| i≣ README.md                                            |                       |                                                              | Ø                                                                       | 약 <b>2</b> forks                                                                                                                                                                                                 |             |  |





### Please Check Out Our Poster!





## Adaptive and Extensible Data Placement in Hybrid Storage Systems Using Online Reinforcement Learning

<u>Gagandeep Singh</u>, Rakesh Nadig, Jisung Park, Rahul Bera, Nastaran Hajinazar, David Novo, Juan Gómez Luna, Sander Stuijk, Henk Corporaal, Onur Mutlu





### Year III Results (2022 Annual Review 3)

- Sibyl: Adaptive and Extensible Data Placement in Hybrid Storage Systems Using Online Reinforcement Learning [ISCA 2022]
- Hermes: Accelerating Long-Latency Load Requests via Perceptron-Based Off-Chip Load Prediction [MICRO 2022]
- GenPIP: In-Memory Acceleration of Genome Analysis via Tight Integration of Basecalling and Read Mapping [MICRO 2022]
- pLUTo: Enabling Massively Parallel Computation via In DRAM via Lookup Tables [MICRO 2022]
- DeepSketch: A New Machine Learning-Based Reference Search Technique for Post-Deduplication Delta Compression [FAST 2022]
- A Modern Primer on Processing in Memory [Arxiv, Updated 2022]
  SAFARI

### Hermes

#### To Appear in MICRO 2022



Accelerating Long-Latency Load Requests via Perceptron-Based Off-Chip Load Prediction

License MIT release v1.0.1 DOI 10.5281/zenodo.6909799

- ▼ Table of Contents
- 1. What is Hermes?
- 2. About the Framework
- 3. Prerequisites
- 4. Installation
- 5. Preparing Traces
- 6. Experimental Workflow
  - Launching Experiments
  - Rolling up Statistics
- 7. Brief Code Walkthrough
- 8. Frequently Asked Questions
- 9. Citation
- 10. License
- 11. Contact
- 12. Acknowledgments

#### What is Hermes?

Hermes is a speculative mechanism that accelerates long-latency off-chip load requests by removing on-chip cache access latency from their critical path.

The key idea behind Hermes is to: (1) accurately predict which load requests might go to off-chip, and (2) speculatively start fetching the data required by the predicted off-chip loads directly from the main memory in parallel to the cache accesses. Hermes proposes a lightweight, perceptron-based off-chip predictor that identifies off-chip load requests using multiple disparate program features. The predictor is implemented using only tables and simple arithmetic operations like increment and decrement.

#### SAFARI

#### https://github.com/CMU-SAFARI/Hermes



# Accelerating Long-Latency Load Requests via Perceptron-based Off-chip Load Prediction

Rahul Bera, Konstantinos Kanellopoulos, Shankar Balachandran, David Novo, Ataberk Olgun, Mohammad Sadrosadati, Onur Mutlu







### Long-latency off-chip requests significantly limit performance of a processor





### Increase size of on-chip caches



Nearly 50% of the off-chip requests in a no-prefetching system still go to the main memory even in presence of state-of-the-art prefetcher





37.5% of the stall cycles caused by an off-chip load can be reduced by removing on-chip cache access latency from its critical path



# **Predicts** which load requests might go off-chip using multiple program features

Starts fetching data **directly** from main memory while concurrently accessing the cache hierarchy

# **Hermes: Overview**



# **Hermes: Overview**



# **Hermes: Overview**



### **Perceptron-based Off-chip Predictor**



We evaluate Hermes using a wide-range of workloads Hermes improves performance by 5.4% in single-core 5.1% in eight-core 6.2% in memory bandwidth-constrained core over the baseline with the state-of-the-art prefetcher **Consistent performance improvement in a wide range of configurations** with varying prefetchers and cache access latency 5.1%, 6.2%, 7.7% performance improvement in single-core with SPP, Bingo, SMS prefetchers **Realistic**, practical implementation Only 5.1 KB storage and 1.5% power overhead of a desktop-class processor

# **Hermes is Open Source**

### https://github.com/CMU-SAFARI/Hermes

- All 3 badges from MICRO'22 artifact evaluation
- Champsim and McPAT source code

SAFARI

All traces & scripts used for evaluation

| G CMU   | -SAFARI / Hermes Public                                                                                                                                         | watch 3 $\checkmark$ $\overset{\mathfrak{G}}{\bullet}$ Fork 1 $\checkmark$ $\overset{\mathfrak{G}}{\leftarrow}$ Star 3 $\checkmark$                                                                                                                                                  |                                                                                                                                                    |                                                                                                                                                                        |
|---------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| <> Code | ⓒ Issues 🎲 Pull requests 🕑 /                                                                                                                                    | Actions 🖽 Projects 🖽 Wiki 🕕 Securit                                                                                                                                                                                                                                                  | ry 🖂 Insights 🕸 Settings                                                                                                                           |                                                                                                                                                                        |
|         | ያ main 👻 ያ 1 branch 🔊 3 tags                                                                                                                                    |                                                                                                                                                                                                                                                                                      | Go to file Add file - Code -                                                                                                                       | About 章                                                                                                                                                                |
|         | Rahul Bera Updated README                                                                                                                                       |                                                                                                                                                                                                                                                                                      | e18045c 7 days ago 🕚 16 commits                                                                                                                    | A speculative mechanism to accelerate<br>long-latency off-chip load requests by                                                                                        |
|         | branch                                                                                                                                                          | Initial commit for MICRO'22 AE                                                                                                                                                                                                                                                       | 27 days ago                                                                                                                                        | removing on-chip cache access latency<br>from their critical path.                                                                                                     |
|         | Config                                                                                                                                                          | Initial commit for MICRO'22 AE                                                                                                                                                                                                                                                       | 27 days ago                                                                                                                                        | machine-learning cache perceptron                                                                                                                                      |
|         | cvp_tracer                                                                                                                                                      | Initial commit for MICRO'22 AE                                                                                                                                                                                                                                                       | 27 days ago                                                                                                                                        | computer-architecture microarchitecture                                                                                                                                |
|         | experiments                                                                                                                                                     | 1. Added traces/ directory                                                                                                                                                                                                                                                           | 21 days ago                                                                                                                                        | perceptron-learning-algorithm prefetching                                                                                                                              |
|         | inc inc                                                                                                                                                         | Initial commit for MICRO'22 AE                                                                                                                                                                                                                                                       | 27 days ago                                                                                                                                        | 🛱 Readme                                                                                                                                                               |
|         | 📄 logo                                                                                                                                                          | Github theme-adapting logo                                                                                                                                                                                                                                                           | 25 days ago                                                                                                                                        |                                                                                                                                                                        |
|         | 🖿 mcpat                                                                                                                                                         | Initial commit for MICRO'22 AE                                                                                                                                                                                                                                                       | 27 days ago                                                                                                                                        | <ul> <li>3 stars</li> <li>3 watching</li> </ul>                                                                                                                        |
|         | prefetcher                                                                                                                                                      | Initial commit for MICRO'22 AE                                                                                                                                                                                                                                                       | 27 days ago                                                                                                                                        | 양 1 fork                                                                                                                                                               |
|         | replacement                                                                                                                                                     | Initial commit for MICRO'22 AE                                                                                                                                                                                                                                                       | 27 days ago                                                                                                                                        |                                                                                                                                                                        |
|         | scripts                                                                                                                                                         | Added more documentations for the script files                                                                                                                                                                                                                                       | 24 days ago                                                                                                                                        | Releases 3                                                                                                                                                             |
|         | src src                                                                                                                                                         | Initial commit for MICRO'22 AE                                                                                                                                                                                                                                                       | 27 days ago                                                                                                                                        | S v1.0.1 Latest                                                                                                                                                        |
|         | tools                                                                                                                                                           | Initial commit for MICRO'22 AE                                                                                                                                                                                                                                                       | 27 days ago                                                                                                                                        |                                                                                                                                                                        |
|         | tracer                                                                                                                                                          | Initial commit for MICRO'22 AE                                                                                                                                                                                                                                                       | 27 days ago                                                                                                                                        |                                                                                                                                                                        |
|         | <ul> <li>experiments</li> <li>inc</li> <li>logo</li> <li>mcpat</li> <li>prefetcher</li> <li>replacement</li> <li>scripts</li> <li>src</li> <li>tools</li> </ul> | 1. Added traces/ directory<br>Initial commit for MICRO'22 AE<br>Github theme-adapting logo<br>Initial commit for MICRO'22 AE<br>Initial commit for MICRO'22 AE<br>Added more documentations for the script files<br>Initial commit for MICRO'22 AE<br>Initial commit for MICRO'22 AE | 21 days ago<br>27 days ago<br>25 days ago<br>27 days ago | perceptron-learning-algorithm     prefetching       ①     Readme       ④     MIT, Unknown licenses found       ☆     3 stars       ③     3 watching       걓     1 fork |



# Accelerating Long-Latency Load Requests via Perceptron-based Off-chip Load Prediction

Rahul Bera, Konstantinos Kanellopoulos, Shankar Balachandran, David Novo, Ataberk Olgun, Mohammad Sadrosadati, Onur Mutlu





### Year III Results (2022 Annual Review 3)

- Sibyl: Adaptive and Extensible Data Placement in Hybrid Storage Systems Using Online Reinforcement Learning [ISCA 2022]
- Hermes: Accelerating Long-Latency Load Requests via Perceptron-Based Off-Chip Load Prediction [MICRO 2022]
- GenPIP: In-Memory Acceleration of Genome Analysis via Tight Integration of Basecalling and Read Mapping [MICRO 2022]
- pLUTo: Enabling Massively Parallel Computation via In DRAM via Lookup Tables [MICRO 2022]
- DeepSketch: A New Machine Learning-Based Reference Search Technique for Post-Deduplication Delta Compression [FAST 2022]
- A Modern Primer on Processing in Memory [Arxiv, Updated 2022]
   SAFARI

### Year III Results (2022 Annual Review 3)

- Sibyl: Adaptive and Extensible Data Placement in Hybrid Storage Systems Using Online Reinforcement Learning [ISCA 2022]
- Hermes: Accelerating Long-Latency Load Requests via Perceptron-Based Off-Chip Load Prediction [MICRO 2022]
- GenPIP: In-Memory Acceleration of Genome Analysis via Tight Integration of Basecalling and Read Mapping [MICRO 2022]

 pLUTo: Enabling Massively Parallel Computation via In DRAM via Lookup Tables [MICRO 2022]

- DeepSketch: A New Machine Learning-Based Reference Search Technique for Post-Deduplication Delta Compression [FAST 2022]
- A Modern Primer on Processing in Memory [Arxiv, Updated 2022]
   SAFARI

### To Appear in MICRO 2022

### pLUTo: Enabling Massively Parallel Computation In DRAM via Lookup Tables

João Dinis Ferreira<sup>§</sup> Gabriel Falcao<sup>†</sup> Juan Gómez-Luna<sup>§</sup> Mohammed Alser<sup>§</sup> Lois Orosa<sup>§</sup> Mohammad Sadrosadati<sup>§</sup> Jeremie S. Kim<sup>§</sup> Geraldo F. Oliveira<sup>§</sup> Taha Shahroodi<sup>§‡</sup> Anant Nori<sup>\*</sup> Onur Mutlu<sup>§</sup> <sup>§</sup>ETH Zürich <sup>†</sup>Instituto de Telecomunicações, University of Coimbra <sup>‡</sup>TU Delft <sup>\*</sup>Intel Corporation

#### https://arxiv.org/pdf/2104.07699.pdf

pLUTo: In-DRAM Lookup Tables to Enable General-Purpose Massively Parallel Computations

João Dinis Ferreira, Gabriel Falcao, Juan Gómez-Luna,

Mohammed Alser, Geraldo F. Oliveira, Jeremie S. Kim, Mohammad Sadrosadati,

Lois Orosa, Taha Shahroodi, Anant Nori, Onur Mutlu





August 2022

#### **Executive Summary**

- **Problem.** Many workloads require significant data movement. Existing Processing-using-Memory solutions mitigate this data movement but lack support for complex operations.
- **Key Idea.** LUTs enable general-purpose computation: perform LUT-based computation inside memory subarrays to perform complex operations.
- **Mechanism Overview.** With the LUT query operation, the elements in a source memory row are queried simultaneously in a LUT. In this way, it is possible to perform bulk LUT queries in-DRAM.
- Key Contributions.
  - Introduce support for **bulk in-memory LUT querying** for general-purpose in-memory computing.
  - Three implementations of pLUTo with varying area/performance/efficiency trade-offs.
- Key Results.
  - **Compared to CPU:** up to 33x faster and 110x more energy-efficient.
  - **Compared to GPU:** up to 8x faster and 80x more energy-efficient.

### Year III Results (2022 Annual Review 3)

- Sibyl: Adaptive and Extensible Data Placement in Hybrid Storage Systems Using Online Reinforcement Learning [ISCA 2022]
- Hermes: Accelerating Long-Latency Load Requests via Perceptron-Based Off-Chip Load Prediction [MICRO 2022]
- GenPIP: In-Memory Acceleration of Genome Analysis via Tight Integration of Basecalling and Read Mapping [MICRO 2022]
- pLUTo: Enabling Massively Parallel Computation via In DRAM via Lookup Tables [MICRO 2022]

DeepSketch: A New Machine Learning-Based Reference Search Technique for Post-Deduplication Delta Compression [FAST 2022]

A Modern Primer on Processing in Memory [Arxiv, Updated 2022]
SAFARI

### DeepSketch

 Jisung Park, Jeonggyun Kim, Yeseong Kim, Sungjin Lee, and Onur Mutlu, "DeepSketch: A New Machine Learning-Based Reference Search Technique for Post-Deduplication Delta Compression" Proceedings of the 20th USENIX Conference on File and Storage Technologies (FAST), Santa Clara, CA, USA, February 2022. [Slides (pptx) (pdf)] [Talk Video (15 minutes)]

#### DeepSketch: A New Machine Learning-Based Reference Search Technique for Post-Deduplication Delta Compression

Jisung Park<sup>1\*</sup> Jeonggyun Kim<sup>2\*</sup> Yeseong Kim<sup>2</sup> Sungjin Lee<sup>2</sup> Onur Mutlu<sup>1</sup>  ${}^{1}ETH Z \ddot{u}rich$   ${}^{2}DGIST$ 

### SAFARI https://arxiv.org/pdf/2202.10584.pdf

### **Executive Summary**

#### Motivation

- Data reduction: Effective at reducing the management cost of a data center by reducing the amount of data physically written to storage devices
- Post-deduplication delta compression: Maximizes the data-reduction ratio by applying delta compression along with deduplication and lossless compression
- <u>Problem</u>: Existing post-deduplication delta-compression techniques provide significantly low data-reduction ratios compared to the optimal.
  - Due to the limited accuracy of reference search for delta compression
  - Cannot identify a good reference block for many incoming data blocks
- Key Idea: DeepSketch, a new machine learning-based reference search technique that uses the learning-to-hash method
  - Generates a given data block's signature (sketch) using a deep neural network
  - The higher the delta-compression benefit of two data blocks, the more similar the signatures of the two blocks to each other
- **Evaluation Results:** DeepSketch reduces the amount of physically-written data
  - □ Up to 33% (21% on average) compared to a state-of-the-art baseline

### DeepSketch Talk Video

SAFAR



#### https://www.youtube.com/watch?v=RFdGyAJCk9M

### Year III Results (2022 Annual Review 3)

- Sibyl: Adaptive and Extensible Data Placement in Hybrid Storage Systems Using Online Reinforcement Learning [ISCA 2022]
- Hermes: Accelerating Long-Latency Load Requests via Perceptron-Based Off-Chip Load Prediction [MICRO 2022]
- GenPIP: In-Memory Acceleration of Genome Analysis via Tight Integration of Basecalling and Read Mapping [MICRO 2022]
- pLUTo: Enabling Massively Parallel Computation via In DRAM via Lookup Tables [MICRO 2022]
- DeepSketch: A New Machine Learning-Based Reference Search Technique for Post-Deduplication Delta Compression [FAST 2022]

A Modern Primer on Processing in Memory [Arxiv, Updated 2022]
SAFARI

### PIM Review and Open Problems

### A Modern Primer on Processing in Memory

Onur Mutlu<sup>a,b</sup>, Saugata Ghose<sup>b,c</sup>, Juan Gómez-Luna<sup>a</sup>, Rachata Ausavarungnirun<sup>d</sup>

SAFARI Research Group

<sup>a</sup>ETH Zürich <sup>b</sup>Carnegie Mellon University <sup>c</sup>University of Illinois at Urbana-Champaign <sup>d</sup>King Mongkut's University of Technology North Bangkok

Onur Mutlu, Saugata Ghose, Juan Gomez-Luna, and Rachata Ausavarungnirun, "A Modern Primer on Processing in Memory" *Invited Book Chapter in <u>Emerging Computing: From Devices to Systems -</u> <i>Looking Beyond Moore and Von Neumann*, Springer, to be published in 2021.

#### SAFARI

https://arxiv.org/pdf/1903.03988.pdf

#### A Modern Primer on Processing in Memory

Onur Mutlu<sup>a,b</sup>, Saugata Ghose<sup>b,c</sup>, Juan Gómez-Luna<sup>a</sup>, Rachata Ausavarungnirun<sup>d</sup>

SAFARI Research Group

<sup>a</sup>ETH Zürich <sup>b</sup>Carnegie Mellon University <sup>c</sup>University of Illinois at Urbana-Champaign <sup>d</sup>King Mongkut's University of Technology North Bangkok

#### Abstract

Modern computing systems are overwhelmingly designed to move data to computation. This design choice goes directly against at least three key trends in computing that cause performance, scalability and energy bottlenecks: (1) data access is a key bottleneck as many important applications are increasingly data-intensive, and memory bandwidth and energy do not scale well, (2) energy consumption is a key limiter in almost all computing platforms, especially server and mobile systems, (3) data movement, especially off-chip to on-chip, is very expensive in terms of bandwidth, energy and latency, much more so than computation. These trends are especially severely-felt in the data-intensive server and energy-constrained mobile systems of today.

At the same time, conventional memory technology is facing many technology scaling challenges in terms of reliability, energy, and performance. As a result, memory system architects are open to organizing memory in different ways and making it more intelligent, at the expense of higher cost. The emergence of 3D-stacked memory plus logic, the adoption of error correcting codes inside the latest DRAM chips, proliferation of different main memory standards and chips, specialized for different purposes (e.g., graphics, low-power, high bandwidth, low latency), and the necessity of designing new solutions to serious reliability and security issues, such as the RowHammer phenomenon, are an evidence of this trend.

This chapter discusses recent research that aims to practically enable computation close to data, an approach we call *processing-in-memory* (PIM). PIM places computation mechanisms in or near where the data is stored (i.e., inside the memory chips, in the logic layer of 3D-stacked memory, or in the memory controllers), so that data movement between the computation units and memory is reduced or eliminated. While the general idea of PIM is not new, we discuss motivating trends in applications as well as memory circuits/technology that greatly exacerbate the need for enabling it in modern computing systems. We examine at least two promising new approaches to designing PIM systems to accelerate important data-intensive applications: (1) *processing using memory* by exploiting analog operational properties of DRAM chips to perform massively-parallel operations in memory, with low-cost changes, (2) *processing near memory* by exploiting 3D-stacked memory technology design to provide high memory bandwidth and low memory latency to in-memory logic. In both approaches, we describe and tackle relevant cross-layer research, design, and adoption challenges in devices, architecture, systems, and programming models. Our focus is on the development of in-memory processing designs that can be adopted in real computing platforms at low cost. We conclude by discussing work on solving key challenges to the practical adoption of PIM.

*Keywords:* memory systems, data movement, main memory, processing-in-memory, near-data processing, computation-in-memory, processing using memory, processing near memory, 3D-stacked memory, non-volatile memory, energy efficiency, high-performance computing, computer architecture, computing paradigm, emerging technologies, memory scaling, technology scaling, dependable systems, robust systems, hardware security, system security, latency, low-latency computing

#### Contents

SAFARI

| 1 | Introduction                                |    |  |  |  |
|---|---------------------------------------------|----|--|--|--|
| 2 | Major Trends Affecting Main Memory          |    |  |  |  |
| 3 | The Need for Intelligent Memory Controllers |    |  |  |  |
| - | to Enhance Memory Scaling                   |    |  |  |  |
| _ |                                             |    |  |  |  |
| 4 | Perils of Processor-Centric Design          |    |  |  |  |
| 5 | Processing-in-Memory (PIM): Technology      |    |  |  |  |
| 5 | Enablers and Two Approaches                 |    |  |  |  |
| - | 5.1 New Technology Enablers: 3D-Stacked     | _  |  |  |  |
|   | Memory and Non-Volatile Memory              | 12 |  |  |  |
| - | 5.2 Two Approaches: Processing Using        |    |  |  |  |
|   | Memory (PUM) vs. Processing Near            |    |  |  |  |
|   | Memory (PNM)                                | 13 |  |  |  |
|   |                                             |    |  |  |  |
| 6 | Processing Using Memory (PUM)               | 14 |  |  |  |
|   | 6.1 RowClone                                | 14 |  |  |  |
|   | 6.2 Ambit                                   | 15 |  |  |  |
|   | 6.3 Gather-Scatter DRAM                     | 17 |  |  |  |
|   | 6.4 In-DRAM Security Primitives             | 17 |  |  |  |
| 7 | Processing Near Memory (PNM)                | 18 |  |  |  |
| 4 | 7.1 Tesseract: Coarse-Grained Application-  | 10 |  |  |  |
|   | Level PNM Acceleration of Graph Pro-        |    |  |  |  |
|   | cessing                                     | 19 |  |  |  |
| - | 7.2 Function-Level PNM Acceleration of      |    |  |  |  |
|   | Mobile Consumer Workloads                   | 20 |  |  |  |
|   | 7.3 Programmer-Transparent Function-        |    |  |  |  |
|   | Level PNM Acceleration of GPU               |    |  |  |  |
|   | Applications                                | 21 |  |  |  |
| _ | 7.4 Instruction-Level PNM Acceleration      |    |  |  |  |
|   | with PIM-Enabled Instructions (PEI)         | 21 |  |  |  |
| _ | 7.5 Function-Level PNM Acceleration of      | _  |  |  |  |
|   | Genome Analysis Workloads                   | 22 |  |  |  |
| _ | 7.6 Application-Level PNM Acceleration of   | 22 |  |  |  |
|   | Time Series Analysis                        | 23 |  |  |  |
| 8 | Enabling the Adoption of PIM                | 24 |  |  |  |
| • | 8.1 Programming Models and Code Genera-     |    |  |  |  |
|   | tion for PIM                                | 24 |  |  |  |
| _ | 8.2 PIM Runtime: Scheduling and Data        |    |  |  |  |
|   | Mapping                                     | 25 |  |  |  |
| _ | 8.3 Memory Coherence                        | 27 |  |  |  |
|   | 8.4 Virtual Memory Support                  | 27 |  |  |  |
|   | 8.5 Data Structures for PIM                 | 28 |  |  |  |
|   | 8.6 Benchmarks and Simulation Infrastruc-   |    |  |  |  |
|   | tures                                       | 29 |  |  |  |
| _ | 8.7 Real PIM Hardware Systems and Proto-    |    |  |  |  |
|   | types                                       | 30 |  |  |  |
|   | 8.8 Security Considerations                 | 30 |  |  |  |
| 0 |                                             |    |  |  |  |
| 9 | Conclusion and Future Outlook               | 31 |  |  |  |

#### 1. Introduction

Main memory, built using the Dynamic Random Access Memory (DRAM) technology, is a major component in nearly all computing systems, including servers, cloud platforms, mobile/embedded devices, and sensor systems. Across all of these systems, the data working set sizes of modern applications are rapidly growing, while the need for fast analysis of such data is increasing. Thus, main memory is becoming an increasingly significant bottleneck across a wide variety of computing systems and applications [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16]. Alleviating the main memory bottleneck requires the memory capacity, energy, cost, and performance to all scale in an efficient manner across technology generations. Unfortunately, it has become increasingly difficult in recent years, especially the past decade, to scale all of these dimensions [1, 2, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49], and thus the main memory bottleneck has been worsening.

A major reason for the main memory bottleneck is the high energy and latency cost associated with data movement. In modern computers, to perform any operation on data that resides in main memory, the processor must retrieve the data from main memory. This requires the memory controller to issue commands to a DRAM module across a relatively slow and power-hungry off-chip bus (known as the memory channel). The DRAM module sends the requested data across the memory channel, after which the data is placed in the caches and registers. The CPU can perform computation on the data once the data is in its registers. Data movement from the DRAM to the CPU incurs long latency and consumes a significant amount of energy [7, 50, 51, 52, 53, 54]. These costs are often exacerbated by the fact that much of the data brought into the caches is not reused by the CPU [52, 53, 55, 56], providing little benefit in return for the high latency and energy cost.

The cost of data movement is a fundamental issue with the *processor-centric* nature of contemporary computer systems. The CPU is considered to be the master in the system, and computation is performed only in the processor (and accelerators). In contrast, data storage and communication units, including the main memory, are treated as unintelligent workers that are incapable of computation. As a result of this processor-centric design paradigm, data moves a lot in the system between the computation units and communication/ storage units so that computation can be done on it. With the increasingly *data-centric* nature of contemporary and emerging appli-

273

### PIM Review and Open Problems (II)

#### A Workload and Programming Ease Driven Perspective of Processing-in-Memory

Saugata Ghose†Amirali Boroumand†Jeremie S. Kim†§Juan Gómez-Luna§Onur Mutlu§††Carnegie Mellon University§ETH Zürich

Saugata Ghose, Amirali Boroumand, Jeremie S. Kim, Juan Gomez-Luna, and Onur Mutlu, "Processing-in-Memory: A Workload-Driven Perspective" *Invited Article in <u>IBM Journal of Research & Development</u>, Special Issue on Hardware for Artificial Intelligence*, to appear in November 2019. [Preliminary arXiv version]

#### SAFARI

https://arxiv.org/pdf/1907.12947.pdf

### Year III Results (2022 Annual Review 4)

- EcoFlow: Efficient Convolutional Dataflows for Low-Power Neural Network Accelerators [arXiv 2022] <u>https://arxiv.org/abs/2202.02310</u>
- ApHMM: Accelerating Profile Hidden Markov Models for Fast and Energy-Efficient Genome Analysis [arXiv 2022] <u>https://arxiv.org/abs/2207.09765</u>
- Accelerating Weather Prediction Using Near-Memory Reconfigurable Fabric [TRETS 2022] <u>https://arxiv.org/abs/2107.08716</u>

Memory System Design for AI/ML Accelerators & ML/AI Techniques for Memory System Design

> Onur Mutlu omutlu@gmail.com https://people.inf.ethz.ch/omutlu 30 August 2022 SRC AIHW Annual Review

SAFARI

**ETH** zürich

