## Memory Systems and Memory-Centric Computing Topic 1: Trends, Challenges, Opportunities

Onur Mutlu omutlu@gmail.com https://people.inf.ethz.ch/omutlu 15 July 2024 HiPEAC ACACES Summer School 2024



**ETH** zürich

## Brief Self Introduction

#### Onur Mutlu

- Full Professor @ ETH Zurich ITET (INFK), since Sept 2015
- Strecker Professor @ Carnegie Mellon University ECE (CS), 2009-2016, 2016-...
- Started the Comp Arch Research Group @ Microsoft Research, 2006-2009
- Worked @ Google, VMware, Microsoft Research, Intel, AMD
- PhD in Computer Engineering from University of Texas at Austin in 2006
- BS in Computer Engineering & Psychology from University of Michigan in 2000
- <u>https://people.inf.ethz.ch/omutlu/ omutlu@gmail.com</u>

#### Research and Teaching in:

- **Computer architecture, systems, hardware security, bioinformatics**
- Memory and storage systems
- Robust & dependable hardware systems: security, safety, predictability, reliability
- Hardware/software cooperation
- New computing paradigms; architectures with emerging technologies/devices
- Architectures for bioinformatics, genomics, health, medicine, AI/ML



### SAFARI Research Group

#### Computer architecture, HW/SW, systems, bioinformatics, security, memory

https://safari.ethz.ch/safari-newsletter-april-2020/



## SAFARI Newsletter January 2021 Edition

#### <u>https://safari.ethz.ch/safari-newsletter-january-2021/</u>

in 🖸 🖌 f



Think Big, Aim High, and Have a Wonderful 2021! Newsletter January 2021



Dear SAFARI friends,

Happy New Year! We are excited to share our group highlights with you in this second edition of the SAFARI newsletter (You can find the first edition from April 2020 here). 2020 has

### SAFARI Newsletter December 2021 Edition

#### <u>https://safari.ethz.ch/safari-newsletter-december-2021/</u>



Think Big, Aim High



f y in 🛛

View in your browser December 2021



### SAFARI Newsletter June 2023 Edition

#### <u>https://safari.ethz.ch/safari-newsletter-june-2023/</u>



Think Big, Aim High



y in D

View in your browser June 2023



#### SAFARI Newsletter July 2024 Edition

#### https://safari.ethz.ch/safari-newsletter-july-2024/



#### SAFARI Introduction & Research

#### Computer architecture, HW/SW, systems, bioinformatics, security, memory



Seminar in Computer Architecture - Lecture 5: Potpourri of Research Topics (Spring 2023)



## SAFARI PhD and Post-Doc Alumni

#### <u>https://safari.ethz.ch/safari-alumni/</u>

- Hasan Hassan (Rivos), EDAA Outstanding Dissertation Award 2023; S&P 2020 Best Paper Award, 2020 Pwnie Award, IEEE Micro TP HM 2020
- Christina Giannoula (Univ. of Toronto), NTUA Best Dissertation Award 2023
- Minesh Patel (Rutgers, Asst. Prof.), DSN Carter Award Best Thesis 2022; ETH Medal 2023; MICRO'20 & DSN'20 Best Paper Awards; ISCA HoF 2021
- Damla Senol Cali (Bionano Genomics), SRC TECHCON 2019 Best Student Presentation Award; RECOMB-Seq 2018 Best Poster Award
- Nastaran Hajinazar (Intel)
- Gagandeep Singh (AMD/Xilinx), FPL 2020 Best Paper Award Finalist
- Amirali Boroumand (Stanford Univ → Google), SRC TECHCON 2018 Best Presentation Award
- Jeremie Kim (Apple), EDAA Outstanding Dissertation Award 2020; IEEE Micro Top Picks 2019; ISCA/MICRO HoF 2021
- Nandita Vijaykumar (Univ. of Toronto, Assistant Professor), ISCA Hall of Fame 2021
- Kevin Hsieh (Microsoft Research, Senior Researcher)
- Justin Meza (Facebook), HiPEAC 2015 Best Student Presentation Award; ICCD 2012 Best Paper Award
- Mohammed Alser (ETH Zurich), IEEE Turkey Best PhD Thesis Award 2018
- Yixin Luo (Google), HPCA 2015 Best Paper Session
- Kevin Chang (Facebook), SRC TECHCON 2016 Best Student Presentation Award
- Rachata Ausavarungnirun (KMUNTB, Assistant Professor), NOCS 2015 and NOCS 2012 Best Paper Award Finalist
- Gennady Pekhimenko (Univ. of Toronto, Assistant Professor), ISCA Hall of Fame 2021; ASPLOS 2015 SRC Winner
- Vivek Seshadri (Microsoft Research)
- Donghyuk Lee (NVIDIA Research, Senior Researcher), HPCA Hall of Fame 2018
- Yoongu Kim (Software Robotics → Google), IFIP JCL Award'24, TCAD'19 Top Pick Award; IEEE Micro Top Picks'10; HPCA'10 Best Paper Session
- Lavanya Subramanian (Intel Labs → Facebook)
- Samira Khan (Univ. of Virginia, Assistant Professor), HPCA 2014 Best Paper Session
- Saugata Ghose (Univ. of Illinois, Assistant Professor), DFRWS-EU 2017 Best Paper Award
- Jawad Haj-Yahya (Huawei Research Zurich, Principal Researcher)
- Lois Orosa (Galicia Supercomputing Center, Director)
- Jisung Park (POSTECH, Assistant Professor)
- Gagandeep Singh (AMD/Xilinx, Researcher)
- Juan Gomez-Luna (NVIDIA, Researcher), ISPASS 2023 Best Paper Session

## An Interview on Computing Futures





SAFARI

ANALYTICS EDIT VIDEO

#### https://www.youtube.com/watch?v=8ffSEKZhmvo

#### Current Mission

Computer architecture, HW/SW, systems, bioinformatics, security



**Graphics and Vision Processing** 

#### **Build fundamentally better computers**

## Current Research Mission & Major Topics

#### **Build fundamentally better computers**



Broad research spanning apps, systems, logic with architecture at the center

- Data-centric arch. for low energy & high perf.
   Proc. in Mem/DRAM, NVM, unified mem/storage
- Low-latency & predictable architectures
  - □ Low-latency, low-energy yet low-cost memory
  - QoS-aware and predictable memory systems
- Fundamentally secure/reliable/safe arch.
   Tolerating all bit flips; patchable HW; secure mem
- Architectures for ML/AI/Genomics/Health/Med
   Algorithm/arch./logic co-design; full heterogeneity
- Data-driven and data-aware architectures
  - ML/AI-driven architectural controllers and design
  - Expressive memory and expressive systems

## Five Key Current Directions

- Fundamentally Robust (Secure/Reliable/Safe) Architectures
- Fundamentally Energy-Efficient Architectures
  - Memory-centric (Data-centric) Architectures
- Fundamentally Low-Latency and Predictable Architectures
- Fundamentally Intelligent and Evolving Architectures
   ML/AI-Assisted (Data-driven) and Data-aware Architectures

Architectures for ML/AI, Genomics, Medicine, Health, ...

### The Transformation Hierarchy

Computer Architecture (expanded view)



Computer Architecture (narrow view)



To achieve the highest efficiency, performance, robustness:

#### we must take the expanded view

of computer architecture



### Principle: Teaching and Research

## Teaching drives Research Research drives Teaching



## Open Source Tools: SAFARI GitHub



#### https://github.com/CMU-SAFARI/

## Referenced Papers, Talks, Artifacts

All are available at

https://people.inf.ethz.ch/omutlu/projects.htm

https://www.youtube.com/onurmutlulectures

https://github.com/CMU-SAFARI/

## Quick Course Overview

## What Will You Learn in This Course?

- Memory Systems and Memory-Centric Computing
   July 15-19, 2024
- Topic 1: Memory Trends, Challenges, Opportunities, Basics
- Topic 2: Memory-Centric Computing
- Topic 3: Memory Robustness: RowHammer, RowPress & Beyond
- Topic 4: Machine Learning Driven Memory Systems
- Topic 5 (another course): Architectures for Genomics and ML
- Topic 6 (unlikely): Non-Volatile Memories and Storage
- Topic 7 (unlikely): Memory Latency, Predictability & QoS
- Major Overview Reading:
  - Mutlu et al., "A Modern Primer on Processing in Memory," Book Chapter on Emerging Computing and Devices, 2022.

## Course Website & Some Study Materials

#### https://safari.ethz.ch/memory\_systems/ACACES2024/

- "A Modern Primer on Processing in Memory" (Emerging Computing, 2022) <u>https://arxiv.org/abs/2012.03112</u>
- "Fundamentally Understanding and Solving RowHammer" (ASP-DAC, 2023) <u>https://arxiv.org/abs/2211.07613</u>
- "Intelligent Architectures for Intelligent Computing Systems" (DATE, 2021) <u>https://arxiv.org/abs/2012.12381</u>
- "Accelerating Neural Network Inference With Processing-in-DRAM: From the Edge to the Cloud" (IEEE Micro, 2022) <u>https://arxiv.org/abs/2209.08938</u>
- "Accelerating Genome Analysis via Algorithm-Architecture Co-Design" (DAC, 2023) <u>https://arxiv.org/abs/2305.00492</u>
- "Memory-Centric Computing" (DAC, 2023) <u>https://arxiv.org/abs/2305.20000</u>
- "RowHammer: A Retrospective" (TCAD, 2019) <u>https://arxiv.org/abs/1904.09724</u>
- "Accelerating Genome Analysis: A Primer on an Ongoing Journey" (IEEE Micro, 2020) <u>https://arxiv.org/abs/2008.00961</u>

## Course Information

- My Contact Information
  - Onur Mutlu
  - omutlu@gmail.com
  - <u>https://people.inf.ethz.ch/omutlu</u>
  - +41-79-572-1444 (my cell phone)
  - □ Find me during breaks and/or email any time.
- Website for Course Slides, Papers, Updates
  - https://safari.ethz.ch/memory\_systems/ACACES2024/
- For the curious ACACES 2013 & 2018 courses:
  - https://people.inf.ethz.ch/omutlu/acaces2013-memory.html
  - https://people.inf.ethz.ch/omutlu/acaces2018.html

## This Course

- Will cover many problems and potential solutions related to the design of memory systems & memory-centric computers
- The design of memory systems poses many
  - Difficult research and engineering problems
  - Important fundamental problems
  - Industry-relevant problems
  - Problems whose solutions can revolutionize the world
- Many creative and insightful solutions are needed to solve these problems
- Goal: Acquire the basics to develop such solutions (by covering fundamentals and cutting-edge research)
   SAFARI

## How To Make the Best Out of This Course

- Be alert during lectures they will be fast paced
- Do the readings (and explore even more)
   I will provide many references
- Go back and reinforce fundamentals (as needed)
  - I will provide pointers to basic computer architecture materials (lecture videos, slides, readings, exams, ...)



Remember "Chance favors the prepared mind." (Pasteur)

## Unfortunately, No Time For:

- Memory Latency
- Memory Interference and QoS, Predictable Performance
   QoS-aware Memory Systems
- Emerging Memory Technologies and Hybrid Memories
- Interconnects
- Caching, Prefetching, Memory Hierarchy Design
- You can find many materials on these at my online lectures
   <u>https://people.inf.ethz.ch/omutlu/teaching.html</u>

#### Links for Basic Materials

- Digital Design & Computer Architecture Course (Spring 2023):
  - https://safari.ethz.ch/digitaltechnik/spring2023/
  - https://www.youtube.com/onurmutlulectures
  - https://www.youtube.com/playlist?list=PL5Q2soXY2Zi-EImKxYYY1SZuGiOAOBKaf



40K views Streamed 1 year ago Livestream - Digital Design and Computer Architecture - ETH Zürich (Spring 2023)

### Links for More Advanced Materials

- Computer Architecture Course (Fall 2021):
  - https://safari.ethz.ch/architecture/fall2021/
  - https://www.youtube.com/onurmutlulectures
  - https://www.youtube.com/playlist?list=PL5Q2soXY2Zi-Mnk1PxjEIG32HAGILkTOF



## PIM Review and Open Problems

## A Modern Primer on Processing in Memory

Onur Mutlu<sup>a,b</sup>, Saugata Ghose<sup>b,c</sup>, Juan Gómez-Luna<sup>a</sup>, Rachata Ausavarungnirun<sup>d</sup>

SAFARI Research Group

<sup>a</sup>ETH Zürich <sup>b</sup>Carnegie Mellon University <sup>c</sup>University of Illinois at Urbana-Champaign <sup>d</sup>King Mongkut's University of Technology North Bangkok

Onur Mutlu, Saugata Ghose, Juan Gomez-Luna, and Rachata Ausavarungnirun, "A Modern Primer on Processing in Memory" *Invited Book Chapter in <u>Emerging Computing: From Devices to Systems -</u> <i>Looking Beyond Moore and Von Neumann*, Springer, to be published in 2023

## PIM Review and Open Problems (II)

#### A Workload and Programming Ease Driven Perspective of Processing-in-Memory

Saugata Ghose†Amirali Boroumand†Jeremie S. Kim†§Juan Gómez-Luna§Onur Mutlu§††Carnegie Mellon University§ETH Zürich

Saugata Ghose, Amirali Boroumand, Jeremie S. Kim, Juan Gomez-Luna, and Onur Mutlu, "Processing-in-Memory: A Workload-Driven Perspective" *Invited Article in <u>IBM Journal of Research & Development</u>, Special Issue on Hardware for Artificial Intelligence*, to appear in November 2019. [Preliminary arXiv version]

#### SAFARI

https://arxiv.org/pdf/1907.12947.pdf

Future Computing Platforms Challenges and Opportunities

## Why Do We Do Computing?



## **To Solve Problems**



## To Gain Insight

**SAFARI** Hamming, "Numerical Methods for Scientists and Engineers," 1962. <sup>33</sup>

## To Enable a Better Life & Future

## How Does a Computer Solve Problems?



# Orchestrating Electrons

In today's dominant technologies



# How Do Problems Get Solved by Electrons?

### The Transformation Hierarchy

Computer Architecture (expanded view)



Computer Architecture (narrow view)

# Computer Architecture

- is the science and art of designing computing platforms (hardware, interface, system SW, and programming model)
- to achieve a set of design goals
  - □ E.g., highest performance on earth on workloads X, Y, Z
  - E.g., longest battery life at a form factor that fits in your pocket with cost < \$\$\$ CHF</li>
  - E.g., best average performance across all known workloads at the best performance/cost ratio

• ...

Designing a supercomputer is different from designing a smartphone → But, many fundamental principles are similar







#### SAFARI Source: https://taxistartup.com/wp-content/uploads/2015/03/UK-Self-Driving-Cars.jpg



Mohammed Alser, Zülal Bingöl, Damla Senol Cali, Jeremie Kim, Saugata Ghose, Can Alkan, Onur Mutlu <u>"Accelerating Genome Analysis: A Primer on an Ongoing Journey"</u> IEEE Micro, August 2020.



#### FPGA-Based Near-Memory Acceleration of Modern Data-Intensive Applications

July-Aug. 2021, pp. 39-48, vol. 41 DOI Bookmark: 10.1109/MM.2021.3088396

MinION from ONT



### An Example System in Your Pocket



**SAFARI** 

https://www.gsmarena.com/apple\_announces\_m1\_ultra\_with\_20core\_cpu\_and\_64core\_gpu-news-53481.php









Control

**Figure 3.** TPU Printed Circuit Board. It can be inserted in the slot for an SATA disk in a server, but the card uses PCIe Gen3 x16.

**Figure 4.** Systolic data flow of the Matrix Multiply Unit. Software has the illusion that each 256B input is read at once, and they instantly update one location of each of 256 accumulator RAMs.

#### Jouppi et al., "In-Datacenter Performance Analysis of a Tensor Processing Unit", ISCA 2017.



#### New ML applications (vs. TPU3):

- Computer vision
- Natural Language Processing (NLP)
- Recommender system
- Reinforcement learning that plays Go

250 TFLOPS per chip in 2021 vs 90 TFLOPS in TPU3



https://spectrum.ieee.org/tech-talk/computing/hardware/heres-how-googles-tpu-v4-ai-chip-stacked-up-in-training-tests

- ML accelerator: 260 mm<sup>2</sup>, 6 billion transistors, 600 GFLOPS GPU, 12 ARM 2.2 GHz CPUs.
- Two redundant chips for better safety.





Tesla Dojo Chip & System



D1 Chip

### 362 TFLOPs BF16/CFP8 22.6 TFLOPs FP32

10TBps/dir. On-Chip Bandwidth 4TBps/edge. Off-Chip Bandwidth

400W TDP





50 Billion Transistors

11+ Miles Of Wires

■ 1:53:07 / 3:03:20 • Dojo >



https://www.youtube.com/watch?v=j0z4FweCy4M&t=6340s

Tesla Dojo Chip & System





#### https://www.youtube.com/watch?v=j0z4FweCy4M&t=6340s

Tesla Dojo Chip & System







NVIDIA is claiming a **7x improvement** in dynamic programming algorithm (**DPX instructions**) performance on a single H100 versus naïve execution on an A100.

#### https://www.nvidia.com/en-us/data-center/h100/



# Cerebras's Wafer Scale Engine (2019)



 The largest ML accelerator chip

400,000 cores



Cerebras WSE 1.2 Trillion transistors 46,225 mm<sup>2</sup> Largest GPU 21.1 Billion transistors 815 mm<sup>2</sup>

https://www.anandtech.com/show/14758/hot-chips-31-live-blogs-cerebras-wafer-scale-deep-learning

https://www.cerebras.net/cerebras-wafer-scale-engine-why-we-need-big-chips-for-deep-learning/

# Cerebras's Wafer Scale Engine-2 (2021)



 The largest ML accelerator chip (2021)

850,000 cores



Cerebras WSE-2 2.6 Trillion transistors 46,225 mm<sup>2</sup> Largest GPU 54.2 Billion transistors 826 mm<sup>2</sup>

NVIDIA Ampere GA100

https://www.anandtech.com/show/14758/hot-chips-31-live-blogs-cerebras-wafer-scale-deep-learning

https://www.cerebras.net/cerebras-wafer-scale-engine-why-we-need-big-chips-for-deep-learning/

# Many (Other) (AI/ML) Chips

- Alibaba
- Amazon
- Facebook
- Google
- Huawei
- Intel
- Microsoft
- NVIDIA
- Tesla
- Many Others and Many Startups are Building Their Own Chips...

### Many More to Come...

# Many (Other) AI/ML Chips (2019)



All information contained within this infographic is gathered from the internet and periodically updated, no guarantee is given that the information provided is correct, complete, and up-to-date.

### SAFARI

#### https://basicmi.github.io/AI-Chip/

# Many (Other) AI/ML Chips (2021)



All information contained within this infographic is gathered from the internet and periodically updated, no guarantee is given that the information provided is correct, complete, and up-to-date.

### SAFARI

#### https://basicmi.github.io/AI-Chip/

# UPMEM Processing-in-DRAM Engine (2019)

### Processing in DRAM Engine

 Includes standard DIMM modules, with a large number of DPU processors combined with DRAM chips.

### Replaces standard DIMMs

- DDR4 R-DIMM modules
  - 8GB+128 DPUs (16 PIM chips)
  - Standard 2x-nm DRAM process



Large amounts of compute & memory bandwidth



https://www.upmem.com/video-upmem-presenting-its-true-processing-in-memory-solution-hot-chips-2019/

# **UPMEM Memory Modules**

- E19: 8 chips DIMM (1 rank). DPUs @ 267 MHz
- P21: 16 chips DIMM (2 ranks). DPUs @ 350 MHz



# 2,560-DPU Processing-in-Memory System



#### Benchmarking a New Paradigm: An Experimental Analysis of a Real Processing-in-Memory Architecture

JUAN GÓMEZ-LUNA, ETH Zürich, Switzerland IZZAT EL HAJJ, Amerian University of Betrut, Lebanon IVAN FERNANDEZ, ETH Zürich, Switzerland and University of Malaga, Spain CHRISTINA GIANNOULA, ETH Zürich, Switzerland and NTUA, Greece GERALDO F. OLIVEIRA, ETH Zürich, Switzerland ONUR MUTLU, ETH Zürich, Switzerland

Many modern workloads, such as neural networks, databases, and graph processing, are fundamentally memory-bound for such workloads, the data movement between main memory and CPU cores imposes a significant overhead in terms of both latency and energy. A major reason is that this communication happens through a narrow busy with high latency and limited bandwidth, and the low data reuse in memory-bound workload is insufficient to amortize the cost of main memory access. Fundamentally addressing this data movement butlenced requires a paradigm where the memory system assumes an active role in computing by integrating processing capabilities. This paradigm is known as *processing-in-memory (PMA)*.

Recent research explores different forms of PIM architectures, motivated by the emergence of new 3Dstacked memory technologies that integrate memory with a logic layer where processing elements can be easily placed. Past works evaluate these architectures in simulation or, at best, with simplified hardware prototypes. In contrast, the UPMEM company has designed and manufactured the first publicly-available real-world PIM architecture. The UPMEM PIM architecture combines traditional DRAM memory arrays with general-purpose in-order cores, called DRAM Processing Units (DPUs), integrated in the same chip.

This paper provides the first comprehensive analysis of the first publicly-available real-world PM architecture. We make two evolution string, we conduct an experimental characterization of the UPMRM-based PIM system using microbenchmarks to assess various architecture limits such as compute throughput and memory bandwisht, yielding new insights. Second, we present PIM (*Coressing in-Homery benchmarks*), a benchmark suite of 16 worldoads from different application domains (e.g., dense/sparse linear algebra, dathases, data analytics, graph processing, neural networks, bioinformatics, image processing), which we identify as memory-bound. We evaluate the performance and acaling characteristics of PIM benchmarks on the UPMRM PIM architecture, and compare their performance and energy consumption to their stateof-the-art CPU and CPU counterparts. Our extensive evaluation conducted on two real UPMRM-based PIM systems with 64 and 2550 PUP sproids new insights about suitability of different worldoads to the PIM systems reares of future PIM systems.



https://arxiv.org/pdf/2105.03814.pdf

### Experimental Analysis of the UPMEM PIM Engine

### Benchmarking a New Paradigm: An Experimental Analysis of a Real Processing-in-Memory Architecture

JUAN GÓMEZ-LUNA, ETH Zürich, Switzerland IZZAT EL HAJJ, American University of Beirut, Lebanon IVAN FERNANDEZ, ETH Zürich, Switzerland and University of Malaga, Spain CHRISTINA GIANNOULA, ETH Zürich, Switzerland and NTUA, Greece GERALDO F. OLIVEIRA, ETH Zürich, Switzerland ONUR MUTLU, ETH Zürich, Switzerland

Many modern workloads, such as neural networks, databases, and graph processing, are fundamentally memory-bound. For such workloads, the data movement between main memory and CPU cores imposes a significant overhead in terms of both latency and energy. A major reason is that this communication happens through a narrow bus with high latency and limited bandwidth, and the low data reuse in memory-bound workloads is insufficient to amortize the cost of main memory access. Fundamentally addressing this *data movement bottleneck* requires a paradigm where the memory system assumes an active role in computing by integrating processing capabilities. This paradigm is known as *processing-in-memory (PIM*).

Recent research explores different forms of PIM architectures, motivated by the emergence of new 3Dstacked memory technologies that integrate memory with a logic layer where processing elements can be easily placed. Past works evaluate these architectures in simulation or, at best, with simplified hardware prototypes. In contrast, the UPMEM company has designed and manufactured the first publicly-available real-world PIM architecture. The UPMEM PIM architecture combines traditional DRAM memory arrays with general-purpose in-order cores, called *DRAM Processing Units* (*DPUs*), integrated in the same chip.

This paper provides the first comprehensive analysis of the first publicly-available real-world PIM architecture. We make two key contributions. First, we conduct an experimental characterization of the UPMEM-based PIM system using microbenchmarks to assess various architecture limits such as compute throughput and memory bandwidth, yielding new insights. Second, we present *PrIM* (*Processing-In-Memory benchmarks*), a benchmark suite of 16 workloads from different application domains (e.g., dense/sparse linear algebra, databases, data analytics, graph processing, neural networks, bioinformatics, image processing), which we identify as memory-bound. We evaluate the performance and scaling characteristics of PrIM benchmarks on the UPMEM PIM architecture, and compare their performance and energy consumption to their stateof-the-art CPU and GPU counterparts. Our extensive evaluation conducted on two real UPMEM-based PIM systems with 640 and 2,556 DPUs provides new insights about suitability of different workloads to the PIM system, programming recommendations for software designers, and suggestions and hints for hardware and architecture designers of future PIM systems.

#### https://arxiv.org/pdf/2105.03814.pdf

### Understanding a Modern PIM Architecture



https://www.youtube.com/watch?v=D8Hjy2iU9I4&list=PL5Q2soXY2Zi tOTAYm--dYByNPL7JhwR9

### More on Analysis of the UPMEM PIM Engine



https://www.youtube.com/watch?v=D8Hjy2iU9I4&list=PL5Q2soXY2Zi tOTAYm--dYByNPL7JhwR9

### More on Analysis of the UPMEM PIM Engine



#### Understanding a Modern Processing-in-Memory Arch: Benchmarking & Experimental Characterization; 21m



https://www.youtube.com/watch?v=Pp9jSU2b9oM&list=PL5Q2soXY2Zi8\_VVChACnON4sfh2bJ5IrD&index=159

# FPGA-based Processing Near Memory

 Gagandeep Singh, Mohammed Alser, Damla Senol Cali, Dionysios Diamantopoulos, Juan Gómez-Luna, Henk Corporaal, and Onur Mutlu, "FPGA-based Near-Memory Acceleration of Modern Data-Intensive Applications" IEEE Micro (IEEE MICRO), to appear, 2021.

# FPGA-based Near-Memory Acceleration of Modern Data-Intensive Applications

Gagandeep Singh<sup>◊</sup> Mohammed Alser<sup>◊</sup> Damla Senol Cali<sup>⋈</sup>

Dionysios Diamantopoulos<sup>∇</sup> Juan Gómez-Luna<sup>◊</sup>

Henk Corporaal<sup>★</sup> Onur Mutlu<sup>◊ ⋈</sup>

◇ETH Zürich <sup>™</sup>Carnegie Mellon University
 \*Eindhoven University of Technology <sup>▽</sup>IBM Research Europe

### Samsung Function-in-Memory DRAM (2021)

Samsung Newsroom

CORPORATE | PRODUCTS | PRESS RESOURCES | VIEWS | ABOUT US

Audio

Share ( 🔊

### Samsung Develops Industry's First High Bandwidth Memory with AI Processing Power

Korea on February 17, 2021

#### The new architecture will deliver over twice the system performance and reduce energy consumption by more than 70%

Samsung Electronics, the world leader in advanced memory technology, today announced that it has developed the industry's first High Bandwidth Memory (HBM) integrated with artificial intelligence (AI) processing power – the HBM-PIM The new processing-in-memory (PIM) architecture brings powerful AI computing capabilities inside high-performance memory, to accelerate large-scale processing in data centers, high performance computing (HPC) systems and AI-enabled mobile applications.

Kwangil Park, senior vice president of Memory Product Planning at Samsung Electronics stated, "Our groundbreaking HBM-PIM is the industry's first programmable PIM solution tailored for diverse AI-driven workloads such as HPC, training and inference. We plan to build upon this breakthrough by further collaborating with AI solution providers for even more advanced PIM-powered applications."

### Samsung Function-in-Memory DRAM (2021)





#### [3D Chip Structure of HBM with FIMDRAM]

ISSCC 2021 / SESSION 25 / DRAM / 25.4

25.4 A 20nm 6GB Function-In-Memory DRAM, Based on HBM2 with a 1.2TFLOPS Programmable Computing Unit Using Bank-Level Parallelism, for Machine Learning Applications

Young-Cheon Kwon', Suk Han Lee', Jaehoon Lee', Sang-Hyuk Kwon', Je Min Ryu', Jong-Pil Son', Seongil O', Hak-Soo Yu', Haesuk Lee', Soo Young Kim', Youngmin Cho', Jin Guk Kim', Jongyoon Choi', Hyun-Sung Shin', Jin Kim', BengSeng Phuah', HyoungMin Kim', Myeong Jun Song', Ahn Choi', Daeho Kim', SooYoung Kim', Eun-Bong Kim', David Wang', Shinhaeng Kang', Yuhwan Ro<sup>3</sup>, Seungwoo Seo<sup>3</sup>, JoonHo Song<sup>3</sup>, Jaeyoun Youn', Kyomin Sohn', Nam Sung Kim'

<sup>1</sup>Samsung Electronics, Hwaseong, Korea <sup>2</sup>Samsung Electronics, San Jose, CA <sup>3</sup>Samsung Electronics, Suwon, Korea

# Samsung AxDIMM (2021)

### DDRx-PIM

DLRM recommendation system





AxDIMM System





### SK Hynix Accelerator-in-Memory (2022)

#### **SK**hynix NEWSROOM

**SK hvnix STORY** 

INSIGHT

PRESS CENTER

MULTIMEDIA

Search

🌐 ENG 🗸

Q

#### SK hynix Develops PIM, Next-Generation AI Accelerator

February 16, 2022

#### Seoul, February 16, 2022

SK hynix (or "the Company", www.skhynix.com) announced on February 16 that it has developed PIM\*, a nextgeneration memory chip with computing capabilities.

\*PIM(Processing In Memory): A next-generation technology that provides a solution for data congestion issues for AI and big data by adding computational functions to semiconductor memory

It has been generally accepted that memory chips store data and CPU or GPU, like human brain, process data. SK hynix, following its challenge to such notion and efforts to pursue innovation in the next-generation smart memory, has found a breakthrough solution with the development of the latest technology.

SK hynix plans to showcase its PIM development at the world's most prestigious semiconductor conference, 2022 ISSCC\*, in San Francisco at the end of this month. The company expects continued efforts for innovation of this technology to bring the memory-centric computing, in which semiconductor memory plays a central role, a step closer In Paper 11.1, SK Hynix describes an 1ynm, GDDR6-based accelerator-in-memory with a command set for deep-learning operation. The to the reality in devices such as smartphones.

\*ISSCC: The International Solid-State Circuits Conference will be held virtually from Feb. 20 to Feb. 24 this year with a theme of "Intelligent Silicon for a Sustainable World'

For the first product that adopts the PIM technology, SK hynix has developed a sample of GDDR6-AiM (Accelerator\* in memory). The GDDR6-AiM adds computational functions to GDDR6\* memory chips, which process data at 16Gbps. A combination of GDDR6-AiM with CPU or GPU instead of a typical DRAM makes certain computation speed 16 times faster. GDDR6-AiM is widely expected to be adopted for machine learning, high-performance computing, and big data computation and storage.



#### 11.1 A 1ynm 1.25V 8Gb, 16Gb/s/pin GDDR6-based Accelerator-in-Memory supporting 1TFLOPS MAC Operation and Various Activation Functions for Deep-Learning Applications

Seongiu Lee, SK hynix, Icheon, Korea

8Gb design achieves a peak throughput of 1TFLOPS with 1GHz MAC operations and supports major activation functions to improve accuracy.

https://news.skhynix.com/sk-hynix-develops-pim-next-generation-ai-accelerator/

### AliBaba PIM Recommendation System (2022)



Figure 29.1.2: Illustration of 3D-stacked chip, cross-illustration of package, DRAM array layout and design blocks on logic die.



Figure 29.1.7: Die micrographs of DRAM die, NE and ME. Detailed specifications of DRAM die and logic die.

55nm

16 per IF

1.2 V

INT8

602 22 mm

5.90 mm<sup>2</sup>

7.02 mm<sup>2</sup>

#### 184QPS/W 64Mb/mm<sup>2</sup> 3D Logic-to-DRAM Hybrid Bonding 29.1 with Process-Near-Memory Engine for Recommendation System

Dimin Niu<sup>1</sup>, Shuangchen Li<sup>1</sup>, Yuhao Wang<sup>1</sup>, Wei Han<sup>1</sup>, Zhe Zhang<sup>2</sup>, Yijin Guan<sup>2</sup>, Tianchan Guan<sup>3</sup>, Fei Sun<sup>1</sup>, Fei Xue<sup>1</sup>, Lide Duan<sup>1</sup>, Yuanwei Fang<sup>1</sup>, Hongzhong Zheng<sup>1</sup>, Xiping Jiang<sup>4</sup>, Song Wang<sup>4</sup>, Fengguo Zuo<sup>4</sup>, Yubing Wang<sup>4</sup>, Bing Yu<sup>4</sup>, Qiwei Ren<sup>4</sup>, Yuan Xie<sup>1</sup>

### SK Hynix CXL Processing Near Memory (2023)

IEEE COMPUTER ARCHITECTURE LETTERS, VOL. 22, NO. 1, JANUARY-JUNE

#### Computational CXL-Memory Solution for Accelerating Memory-Intensive Applications

Joonseop Sim<sup>®</sup>, Soohong Ahn<sup>®</sup>, Taeyoung Ahn<sup>®</sup>, Seungyong Lee<sup>®</sup>, Myunghyun Rhee, Jooyoung Kim<sup>®</sup>, Kwangsik Shin, Donguk Moon<sup>®</sup>, Euiseok Kim, and Kyoung Park<sup>®</sup>

Abstract—CXL interface is the up-to-date technology that enables effective memory expansion by providing a memory-sharing protocol in configuring heterogeneous devices. However, its limited physical bandwidth can be a significant bottleneck for emerging data-intensive applications. In this work, we propose a novel CXL-based memory disaggregation architecture with a real-world prototype demonstration, which overcomes the bandwidth limitation of the CXL interface using near-data processing. The experimental results demonstrate that our design achieves up to 1.9× better performance/power efficiency than the existing CPU system.

Index Terms—Compute express link (CXL), near-data-processing (NDP)



#### SAFARI

Fig. 6. FPGA prototype of proposed CMS card.

### Samsung CXL Processing Near Memory (2023)

#### Samsung Processing in Memory Technology at Hot Chips 2023

By Patrick Kennedy - August 28, 2023





Samsung PIM PNM For Transformer Based AI HC35\_Page\_24

### Processing-in-Memory Landscape (2022)



SAFARI

And, many other experimental chips and startups

### Future of Genome Sequencing & Analysis

Mohammed Alser, Zülal Bingöl, Damla Senol Cali, Jeremie Kim, Saugata Ghose, Can Alkan, Onur Mutlu <u>"Accelerating Genome Analysis: A Primer on an Ongoing Journey"</u> IEEE Micro, August 2020.



#### FPGA-Based Near-Memory Acceleration of Modern Data-Intensive Applications

July-Aug. 2021, pp. 39-48, vol. 41 DOI Bookmark: 10.1109/MM.2021.3088396

MinION from ONT

SmidgION from ONT



To achieve the highest efficiency, performance, robustness:

### we must take the expanded view

#### of computer architecture



#### SAFARI

## What Kind of a Future Do We Want?

### How Reliable/Secure/Safe is This Bridge?





### Collapse of the "Galloping Gertie"





### Another View





### How Secure Are These People?



#### Security is about preventing unforeseen consequences

#### Source: https://s-media-cache-ak0.pinimg.com/originals/48/09/54/4809543a9c7700246a0cf8acdae27abf.jpg

SAFARI

### How Safe & Secure Is This Platform?



#### SAFARI Source: https://taxistartup.com/wp-content/uploads/2015/03/UK-Self-Driving-Cars.jpg

### How Robust Are These Platforms Really?









**SAFARI** https://www.kennedyspacecenter.com/explore-attractions/nasa-now https://www.cnet.com/pictures/nasas-wildest-rides-extreme-vehicles-for-earth-and-beyond/7/

### Challenge and Opportunity for Future

# Robust (Reliable, Secure, Safe)



### Do We Want This?



SAFARI Source

Source: V. Milutinovic

### Or This?



Challenge and Opportunity for Future

# Sustainable and Energy Efficient

### Many Difficult Problems: Climate



### Many Difficult Problems: Congestion



### Many Difficult Problems: Intelligence



SAFARI

### Many Difficult Problems: Public Health



#### SAFARI

Source: https://blog.wego.com/7-crowded-places-and-events-that-you-will-love/

### Many Difficult Problems: Genome Analysis



http://www.economist.com/news/21631808-so-much-genetic-data-so-many-uses-genes-unzipped

### We Need Faster & Scalable Genome Analysis



Understanding genetic variations, species, evolution, ...



Rapid surveillance of **disease outbreaks** 



Predicting the presence and relative abundance of **microbes** in a sample



Developing personalized medicine

#### SAFARI

#### And, many, many other applications ...

### New Genome Sequencing Technologies

Nanopore sequencing technology and tools for genome assembly: computational analysis of the current state, bottlenecks and future directions

Damla Senol Cali 🖾, Jeremie S Kim, Saugata Ghose, Can Alkan, Onur Mutlu

Briefings in Bioinformatics, bby017, https://doi.org/10.1093/bib/bby017Published:02 April 2018Article history ▼



Oxford Nanopore MinION

Senol Cali+, "Nanopore Sequencing Technology and Tools for Genome Assembly: Computational Analysis of the Current State, Bottlenecks and Future Directions," Briefings in Bioinformatics, 2018. [Open arxiv.org version]

#### SAFARI

### Accelerating Genome Analysis [DAC 2023]

 Onur Mutlu and Can Firtina, <u>"Accelerating Genome Analysis via Algorithm-Architecture</u> <u>Co-Design"</u> *Invited Special Session Paper in Proceedings of the <u>60th Design</u> <u>Automation Conference</u> (DAC), San Francisco, CA, USA, July 2023. [arXiv version]* 

### Accelerating Genome Analysis via Algorithm-Architecture Co-Design

Onur Mutlu Can Firtina ETH Zürich

#### SAFARI https://arxiv.org/pdf/2305.00492.pdf

### Accelerating Genome Analysis [IEEE MICRO 2020]

 Mohammed Alser, Zulal Bingol, Damla Senol Cali, Jeremie Kim, Saugata Ghose, Can Alkan, and Onur Mutlu,
 "Accelerating Genome Analysis: A Primer on an Ongoing Journey" IEEE Micro (IEEE MICRO), Vol. 40, No. 5, pages 65-75, September/October 2020.
 [Slides (pptx)(pdf)]
 [Talk Video (1 hour 2 minutes)]

## Accelerating Genome Analysis: A Primer on an Ongoing Journey

Mohammed Alser ETH Zürich

Zülal Bingöl Bilkent University

Damla Senol Cali Carnegie Mellon University

Jeremie Kim ETH Zurich and Carnegie Mellon University Saugata Ghose University of Illinois at Urbana–Champaign and Carnegie Mellon University

Can Alkan Bilkent University

**Onur Mutlu** ETH Zurich, Carnegie Mellon University, and Bilkent University

### Beginner Reading on Genome Analysis

Mohammed Alser, Joel Lindegger, Can Firtina, Nour Almadhoun, Haiyu Mao, Gagandeep Singh, Juan Gomez-Luna, Onur Mutlu "From Molecules to Genomic Variations to Scientific Discovery: Intelligent Algorithms and Architectures for Intelligent Genome Analysis" Computational and Structural Biotechnology Journal, 2022 [Source code]



Review

From molecules to genomic variations: Accelerating genome analysis via intelligent algorithms and architectures



Mohammed Alser\*, Joel Lindegger, Can Firtina, Nour Almadhoun, Haiyu Mao, Gagandeep Singh, Juan Gomez-Luna, Onur Mutlu\*

ETH Zurich, Gloriastrasse 35, 8092 Zürich, Switzerland

### SAFARI https://arxiv.org/pdf/2205.07957.pdf

### Future of Genome Sequencing & Analysis

Mohammed Alser, Zülal Bingöl, Damla Senol Cali, Jeremie Kim, Saugata Ghose, Can Alkan, Onur Mutlu "Accelerating Genome Analysis: A Primer on an Ongoing Journey" IEEE Micro, August 2020.



Accelerating Genome Analysis: A Primer on an Ongoing Journey Sept.-Oct. 2020, pp. 65-75, vol. 40

#### FPGA-Based Near-Memory Acceleration of **Modern Data-Intensive Applications**

July-Aug. 2021, pp. 39-48, vol. 41 DOI Bookmark: 10.1109/MM.2021.3088396

MinION from ONT

SmidgION from ONT

### More on Fast & Efficient Genome Analysis ...





Analytics Edit video

#### https://www.youtube.com/watch?v=NCagwf0ivT0

57

**1**6

A Share

↓ Download

💥 Clip

=+ Save

• • •

### Genomics Course (Fall 2022)

#### Fall 2022 Edition:

https://safari.ethz.ch/projects and seminars/fall2022/do ku.php?id=bioinformatics

#### Spring 2022 Edition:

https://safari.ethz.ch/projects and seminars/spring2022 /doku.php?id=bioinformatics

#### Youtube Livestream (Fall 2022):

https://www.youtube.com/watch?v=nA41964-9r8&list=PL5Q2soXY2Zi8tFlQvdxOdizD\_EhVAMVQV

#### Youtube Livestream (Spring 2022):

- https://www.youtube.com/watch?v=DEL\_5A\_Y3TI&list= PL5Q2soXY2Zi8NrPDgOR1yRU\_Cxxjw-u18
- Project course
  - Taken by Bachelor's/Master's students
  - Genomics lectures
  - Hands-on research exploration
  - Many research readings

#### https://www.youtube.com/onurmutlulectures

#### SAFARI



#### Spring 2022 Meetings/Schedule

| Week | Date          | Livestream        | Meeting                                                                                    | Learning<br>Materials                             |
|------|---------------|-------------------|--------------------------------------------------------------------------------------------|---------------------------------------------------|
| W1   | 11.3<br>Fri.  | You Tube Live     | M1: P&S Accelerating Genomics<br>Course Introduction & Project<br>Proposals<br>(PDF) (PPT) | Required<br>Materials<br>Recommended<br>Materials |
| W2   | 18.3<br>Fri.  | You Tube Live     | M2: Introduction to Sequencing                                                             |                                                   |
| W3   | 25.3<br>Fri.  | You Tube Premiere | M3: Read Mapping ma (PDF) ma (PPT)                                                         |                                                   |
| W4   | 01.04<br>Fri. | You Tube Premiere | M4: GateKeeper                                                                             |                                                   |
| W5   | 08.04<br>Fri. | You Tube Premiere | M5: MAGNET & Shouji                                                                        |                                                   |
| W6   | 15.4<br>Fri.  | You Tube Premiere | M6: SneakySnake                                                                            |                                                   |
| W7   | 29.4<br>Fri.  | You Tube Premiere | M7: GenStore                                                                               |                                                   |
| W8   | 06.05<br>Fri. | You Tube Premiere | M8: GRIM-Filter                                                                            |                                                   |
| W9   | 13.05<br>Fri. | You Tube Premiere | M9: Genome Assembly                                                                        |                                                   |
| W10  | 20.05<br>Fri. | You Tube Live     | M10: Genomic Data Sharing Under<br>Differential Privacy<br>@ (PDF) # (PPT)                 |                                                   |
| W11  | 10.06<br>Fri. | You Tube Premiere | M11: Accelerating Genome<br>Sequence Analysis<br>(PDF) (PPT)                               |                                                   |

### BIO-Arch Workshop at RECOMB 2023

#### April 14, 2023

#### BIO-Arch: Workshop on Hardware Acceleration of Bioinformatics Workloads

#### About

BIO-Arch is a new forum for presenting and discussing new ideas in accelerating bioinformatics workloads with the co-design of hardware & software and the use of new computer architectures. Our goal is to discuss new system designs tailored for bioinformatics. BIO-Arch aims to bring together researchers in the bioinformatics, computational biology, and computer architecture communities to strengthen the progress in accelerating bioinformatics analysis (e.g., genome analysis) with efficient system designs that include hardware acceleration and software systems tailored for new hardware technologies.

#### Venue

BIO-Arch will be held in The Social Facilities of İstanbul Technical University on April14. Detailed information about how to arrive at the venue location with various transportation options can be found on the RECOMB website.

Our panel discussion will be held in conjunction with the main RECOMB conference. The panel discussion will be held in Marriott Şişli on **April 17 at 17:00**. You can find



#### https://www.youtube.com/watch?v=2rCsb4-nLmg

#### https://safari.ethz.ch/recomb23-arch-workshop/

# Exponential Growth of Neural Networks



#### SAFARI

Source: https://youtu.be/Bh13Idwcb0Q?t=283

104



### Challenge and Opportunity for Future

# High Performance

## (to solve the **toughest** & **all** problems)

### Personalization: Medicine



SAFARI Source: Jane Ades, NHGRI

### Comparative Genomics & Medicine



Source: By Aaron E. Darling, István Miklós, Mark A. Ragan - Figure 1 from Darling AE, Miklós I, Ragan MA (2008). "Dynamics of Genome Rearrangement in Bacterial Populations". PLOS Genetics. DOI:10.1371/journal.pgen.1000128., CC BY 2.5, https://commons.wikimedia.org/w/index.php?curid=30550950

### Personalized Medical Technologies

#### Nanopore sequencing technology and tools for genome assembly: computational analysis of the current state, bottlenecks and future directions

Damla Senol Cali 🖾, Jeremie S Kim, Saugata Ghose, Can Alkan, Onur Mutlu

Briefings in Bioinformatics, bby017, https://doi.org/10.1093/bib/bby017 Published: 02 April 2018 Article history ▼



Oxford Nanopore MinION

Senol Cali+, "Nanopore Sequencing Technology and Tools for Genome Assembly: Computational Analysis of the Current State, Bottlenecks and Future Directions," Briefings in Bioinformatics, 2018. [Preliminary arxiv.org version]

#### SAFARI

## Personalized Robotics



# Challenge and Opportunity for Future

# **Personalized and Private**

(in every aspect of life: health, medicine, spaces, devices, robotics, ...)

# What Limits Us in Computing Today?

Questioning what limits us in designing the best computing architectures for the future

Providing directions for fundamentally better designs

Advocating principled approaches

# Increasingly Demanding Applications

# Dream...

# and, they will come

As applications push boundaries, computing platforms become increasingly strained



Key Realization



# Modern Systems are Bottlenecked by

# Data Storage and Movement



- Main memory is a critical component of all computing systems: server, mobile, embedded, desktop, sensor
- Main memory system must scale (in size, technology, efficiency, cost, and management algorithms) to maintain performance growth and technology scaling benefits



- Main memory is a critical component of all computing systems: server, mobile, embedded, desktop, sensor
- Main memory system must scale (in size, technology, efficiency, cost, and management algorithms) to maintain performance growth and technology scaling benefits



- Main memory is a critical component of all computing systems: server, mobile, embedded, desktop, sensor
- Main memory system must scale (in size, technology, efficiency, cost, and management algorithms) to maintain performance growth and technology scaling benefits



- Main memory is a critical component of all computing systems: server, mobile, embedded, desktop, sensor
- Main memory system must scale (in size, technology, efficiency, cost, and management algorithms) to maintain performance growth and technology scaling benefits

# Memory & Storage

# Why Is Memory So Important? (Especially Today)

# Importance of Main Memory

Performance Perspective

Energy Perspective

 Scaling & Robustness (Reliability/Security/Safety) Perspective

Trends/Challenges/Opportunities in Main Memory

# Perils of Processor-Centric Design



#### Most of the system is dedicated to storing and moving data

### **SAFARI** Yet, system is still bottlenecked by memory



# Computing is Bottlenecked by Data



# Data is Key for AI, ML, Genomics, ...

Important workloads are all data intensive

 They require rapid and efficient processing of large amounts of data

- Data is increasing
  - We can generate more than we can process
  - We need to perform more sophisticated analyses on more data

# Memory Is Critical for Performance (I)



## **In-memory Databases**

[Mao+, EuroSys'12; Clapp+ (**Intel**), IISWC'15]



## **In-Memory Data Analytics**

[Clapp+ (**Intel**), IISWC'15; Awan+, BDCloud'15]



**Graph/Tree Processing** [Xu+, IISWC'12; Umuroglu+, FPL'15]



**Datacenter Workloads** [Kanev+ (**Google**), ISCA'15]

# Memory Is Critical for Performance (I)





#### **In-memory Databases**

**Graph/Tree Processing** 

## Memory → bottleneck



## In-Memory Data Analytics

[Clapp+ (**Intel**), IISWC'15; Awan+, BDCloud'15]



**Datacenter Workloads** [Kanev+ (**Google**), ISCA'15]

# Memory Is Critical for Performance (II)



Chrome

**Google's web browser** 



## **TensorFlow Mobile**

Google's machine learning framework



**Google's video codec** 



# Memory Is Critical for Performance (II)





## Chrome

## **TensorFlow Mobile**

## Memory → bottleneck



**Google's video codec** 







## Memory → bottleneck

| reference. | THATCOCHICCATOACOCAO |  |
|------------|----------------------|--|
| read1:     | ATCGCATCC            |  |
| read2:     | TATCGCATC            |  |
| read3:     | CATCCATGA            |  |
| read4:     | CGCTTCCAT            |  |
| read5:     | CCATGACGC            |  |
| read6:     | TTCCATGAC            |  |

## 3 Variant Calling



## **Scientific Discovery 4**

# New Genome Sequencing Technologies

Nanopore sequencing technology and tools for genome assembly: computational analysis of the current state, bottlenecks and future directions

Damla Senol Cali 🖾, Jeremie S Kim, Saugata Ghose, Can Alkan, Onur Mutlu

Briefings in Bioinformatics, bby017, https://doi.org/10.1093/bib/bby017 Published: 02 April 2018 Article history ▼



Oxford Nanopore MinION

Senol Cali+, "Nanopore Sequencing Technology and Tools for Genome Assembly: Computational Analysis of the Current State, Bottlenecks and Future Directions," Briefings in Bioinformatics, 2018. [Open arxiv.org version]

# New Genome Sequencing Technologies

## Nanopore sequencing technology and tools for genome assembly: computational analysis of the current state, bottlenecks and future directions

Damla Senol Cali 🖾, Jeremie S Kim, Saugata Ghose, Can Alkan, Onur Mutlu

Briefings in Bioinformatics, bby017, https://doi.org/10.1093/bib/bby017Published:02 April 2018Article history ▼



Oxford Nanopore MinION

## Memory → bottleneck

# State of the Main Memory System

- Recent technology, architecture, and application trends
  - lead to new requirements
  - exacerbate old requirements
- DRAM and memory controllers, as we know them today, are (will be) unlikely to satisfy all requirements
- Some emerging non-volatile memory technologies (e.g., PCM) enable new opportunities: memory+storage merging
- We need to rethink the main memory system
   to fix DRAM issues and enable emerging technologies
   to satisfy all requirements
  - to satisfy all requirements

# Major Trends Affecting Main Memory (I)

Need for main memory capacity, bandwidth, QoS increasing

Main memory energy/power is a key system design concern

## DRAM technology scaling is ending

# Major Trends Affecting Main Memory (II)

- Need for main memory capacity, bandwidth, QoS increasing
  - Data-intensive applications: increasing demand/hunger for data
  - Multi-core: increasing number of cores/agents
  - Consolidation: cloud computing, GPUs, mobile, heterogeneity

• Main memory energy/power is a key system design concern

DRAM technology scaling is ending

# Consequence: The Memory Capacity Gap

Core count doubling ~ every 2 years DRAM DIMM capacity doubling ~ every 3 years



*Memory capacity per core* expected to drop by 30% every two years
Trends worse for *memory bandwidth per core!*

# DRAM Capacity, Bandwidth & Latency



# Memory Capacity Has Improved Greatly



https://arxiv.org/pdf/2204.10378

# Memory Latency Lags Behind



https://arxiv.org/pdf/2204.10378

# Memory Latency Lags Behind



https://arxiv.org/pdf/2204.10378

# Memory Is Critical for Performance



## **In-memory Databases**

[Mao+, EuroSys'12; Clapp+ (**Intel**), IISWC'15]



## **In-Memory Data Analytics**

[Clapp+ (**Intel**), IISWC'15; Awan+, BDCloud'15]



**Graph/Tree Processing** [Xu+, IISWC'12; Umuroglu+, FPL'15]



**Datacenter Workloads** [Kanev+ (**Google**), ISCA'15]

## Memory Is Critical for Performance





#### **In-memory Databases**

### **Graph/Tree Processing**

# Memory → performance bottleneck



## **In-Memory Data Analytics**

[Clapp+ (**Intel**), IISWC'15; Awan+, BDCloud'15]



**Datacenter Workloads** [Kanev+ (**Google**), ISCA'15]

# Memory Is Critical for Performance



Chrome

**Google's web browser** 



## **TensorFlow Mobile**

Google's machine learning framework



**Google's video codec** 



#### Memory Is Critical for Performance



#### Memory → performance bottleneck



**Google's video codec** 



# It's the Memory, Stupid!

#### "It's the Memory, Stupid!" (Richard Sites, MPR, 1996)

#### **RICHARD SITES**

#### It's the Memory, Stupid!

When we started the Alpha architecture design in 1988, we estimated a 25-year lifetime and a relatively modest 32% per year compounded performance improvement of implementations over that lifetime (1,000× total). We guestimated about 10× would come from CPU clock improvement, 10× from multiple instruction issue, and 10× from multiple processors.

#### 5, 1996 MICROPROCESSOR REPORT

I expect that over the coming decade memory subsystem design will be the *only* important design issue for microprocessors.

#### The Performance Perspective



Mutlu+, "Runahead Execution: An Alternative to Very Large Instruction Windows for Out-of-Order Processors," HPCA 2003.

# The Performance Perspective

 Onur Mutlu, Jared Stark, Chris Wilkerson, and Yale N. Patt,
 "Runahead Execution: An Alternative to Very Large Instruction Windows for Out-of-order Processors"
 Proceedings of the <u>9th International Symposium on High-Performance Computer</u> <u>Architecture</u> (HPCA), pages 129-140, Anaheim, CA, February 2003. <u>Slides (pdf)</u>
 One of the 15 computer arch. papers of 2003 selected as Top Picks by IEEE Micro. HPCA Test of Time Award (awarded in 2021).

#### **Runahead Execution: An Alternative to Very Large Instruction Windows for Out-of-order Processors**

Onur Mutlu § Jared Stark † Chris Wilkerson ‡ Yale N. Patt §

§ECE Department
The University of Texas at Austin
{onur,patt}@ece.utexas.edu

†Microprocessor Research Intel Labs jared.w.stark@intel.com

‡Desktop Platforms Group Intel Corporation chris.wilkerson@intel.com

# The Memory Bottleneck

 Onur Mutlu, Jared Stark, Chris Wilkerson, and Yale N. Patt, "Runahead Execution: An Effective Alternative to Large <u>Instruction Windows"</u>

 <u>Instruction Windows</u>

*IEEE Micro, Special Issue: Micro's Top Picks from Microarchitecture Conferences* (*MICRO TOP PICKS*), Vol. 23, No. 6, pages 20-25, November/December 2003.

# RUNAHEAD EXECUTION: AN EFFECTIVE ALTERNATIVE TO LARGE INSTRUCTION WINDOWS

### More on Runahead Execution (I)



EDIT VIDEO

ANALYTICS

### More on Runahead Execution (II)

#### Runahead Execution in NVIDIA Denver

Reducing the effects of long cache-miss penalties has been a major focus of the microarchitecture, using techniques like prefetching and run-ahead. An aggressive hardware prefetcher implementation detects L2 cache requests and tracks up to 32 streams, each with complex stride patterns.

Run-ahead uses the idle time that a CPU spends waiting on a long latency operation to discover cache and DTLB misses further down the instruction stream and generates prefetch requests for these misses.<sup>1</sup> These prefetch requests warm up the data cache and DTLB well before the actual execution of the instructions that require the data. Runahead complements the hardware prefetcher because it's better at prefetching nonstrided streams, and it trains the hardware prefetcher faster than normal execution to yield a combined benefit of 13 percent on SPECint2000 and up to 60 percent on SPECfp2000.

Boggs+, "Denver: NVIDIA's First 64-Bit ARM Processor,"

Gwennap, "NVIDIA's First CPU is a Winner," MPR 2014.

The core includes a hardware prefetch unit that Boggs describes as "aggressive" in preloading the data cache but less aggressive in preloading the instruction cache. It also implements a "run-ahead" feature that continues to execute microcode speculatively after a data-cache miss; this execution can trigger additional cache misses that resolve in the shadow of the first miss. Once the data from the original miss returns, the results of this speculative execution are discarded and execution restarts with the bundle containing the original miss, but run-ahead can preload subsequent data into the cache, thus avoiding a string of time-wasting cache misses. These and other features help Denver outscore Cortex-A15 by more than 2.6x on a memory-read test even when both use the same SoC framework (Tegra K1).





1,162 views • Premiered Mar 6, 2021

IEEE Micro 2015.



▶ 💦 6:18 / 14:27

EDIT VIDEO

📼 🧬 🖬 🖂 []

A SHARE =+ SAVE

ANALYTICS

50

### The Performance Perspective (Today)

All of Google's Data Center Workloads (2015):



Kanev+, "Profiling a Warehouse-Scale Computer," ISCA 2015.

#### The Memory Bottleneck

#### All of Google's Data Center Workloads (2015):



#### Figure 11: Half of cycles are spent stalled on caches.

### Memory in a Modern System



AMD Barcelona, 2006

#### A Large Fraction of Modern Systems is Memory



#### Intel Pentium Pro, 1995

#### A Large Fraction of Modern Systems is Memory



https://download.intel.com/newsroom/kits/40thanniversary/gallery/images/Pentium 4 6xx-die.jpg

Intel Pentium 4, 2000<sup>156</sup>



Apple M1, 2021

SAFARI

Source: https://www.anandtech.com/show/16252/mac-mini-apple-m1-tested



Core Count: 8 cores/16 threads

L1 Caches: 32 KB per core

L2 Caches: 512 KB per core

L3 Cache: 32 MB shared

#### AMD Ryzen 5000, 2020



IBM POWER10, 2020

Cores: 15-16 cores, 8 threads/core

L2 Caches: 2 MB per core

L3 Cache: 120 MB shared



#### Cores:

128 Streaming Multiprocessors

L1 Cache or Scratchpad: 192KB per SM Can be used as L1 Cache and/or Scratchpad

L2 Cache: 40 MB shared

### AMD's 3D Last Level Cache (2021)



https://community.microcenter.com/discussion/5 134/comparing-zen-3-to-zen-2 AMD increases the L3 size of their 8-core Zen 3 processors from 32 MB to 96 MB

Additional 64 MB L3 cache die stacked on top of the processor die

- Connected using Through Silicon Vias (TSVs)
- Total of 96 MB L3 cache



## Deeper and Larger Memory Hierarchies



#### Apple M1 Ultra System (2022)

#### Memory System: Most of the Platform



#### Most of the system is dedicated to storing and moving data

**SAFARI** Yet, system is still bottlenecked by memory

### Major Trends Affecting Main Memory (III)

Need for main memory capacity, bandwidth, QoS increasing

- Main memory energy/power is a key system design concern
  - ~40-50% energy spent in off-chip memory hierarchy [Lefurgy, IEEE Computer'03] >40% power in DRAM [Ware, HPCA'10][Paul,ISCA'15]
  - DRAM consumes power even when not used (periodic refresh)
- DRAM technology scaling is ending

# Data Movement vs. Computation Energy



### A memory access consumes ~100-1000X the energy of a complex addition

### Data Movement vs. Computation Energy



#### Data Movement vs. Computation Energy



A memory access consumes 6400X the energy of a simple integer addition

#### Energy Waste in Mobile Devices

 Amirali Boroumand, Saugata Ghose, Youngsok Kim, Rachata Ausavarungnirun, Eric Shiu, Rahul Thakur, Daehyun Kim, Aki Kuusela, Allan Knies, Parthasarathy Ranganathan, and Onur Mutlu, "Google Workloads for Consumer Devices: Mitigating Data Movement Bottlenecks" Proceedings of the <u>23rd International Conference on Architectural Support for Programming</u> <u>Languages and Operating Systems</u> (ASPLOS), Williamsburg, VA, USA, March 2018.

# 62.7% of the total system energy is spent on data movement

#### Google Workloads for Consumer Devices: Mitigating Data Movement Bottlenecks

Amirali Boroumand<sup>1</sup>Saugata Ghose<sup>1</sup>Youngsok Kim<sup>2</sup>Rachata Ausavarungnirun<sup>1</sup>Eric Shiu<sup>3</sup>Rahul Thakur<sup>3</sup>Daehyun Kim<sup>4,3</sup>Aki Kuusela<sup>3</sup>Allan Knies<sup>3</sup>Parthasarathy Ranganathan<sup>3</sup>Onur Mutlu<sup>5,1</sup>168

# Memory is Critical for Energy

 Amirali Boroumand, Saugata Ghose, Berkin Akin, Ravi Narayanaswami, Geraldo F. Oliveira, Xiaoyu Ma, Eric Shiu, and Onur Mutlu,
 "Google Neural Network Models for Edge Devices: Analyzing and Mitigating Machine Learning Inference Bottlenecks"
 Proceedings of the <u>30th International Conference on Parallel Architectures and Compilation</u> <u>Techniques</u> (PACT), Virtual, September 2021.
 [Slides (pptx) (pdf)]
 [Talk Video (14 minutes)]

#### > 90% of the total system energy is spent on memory in large ML models

#### **Google Neural Network Models for Edge Devices: Analyzing and Mitigating Machine Learning Inference Bottlenecks**

Amirali Boroumand\*Saugata Ghose\*Berkin Akin\*Ravi Narayanaswami\*Geraldo F. Oliveira\*Xiaoyu Ma\*Eric Shiu\*Onur Mutlu\*\*

<sup>†</sup>Carnegie Mellon Univ. <sup>°</sup>Stanford Univ. <sup>‡</sup>Univ. of Illinois Urbana-Champaign <sup>§</sup>Google <sup>\*</sup>ETH Zürich

# **Example Energy Breakdowns**



In LSTMs and Transducers used by Google, >90% energy spent on off-chip interconnect and DRAM

https://arxiv.org/pdf/2109.14320

SAFARI

#### We Do Not Want to Move Data!



### A memory access consumes ~100-1000X the energy of a complex addition

### We Need A **Paradigm Shift** To ...

Enable computation with minimal data movement

Compute where it makes sense (where data resides)

Make computing architectures more data-centric

#### Process Data Where It Makes Sense



**SAFARI** 

https://www.gsmarena.com/apple\_announces\_m1\_ultra\_with\_20core\_cpu\_and\_64core\_gpu-news-53481.php

# Goal: Processing Inside Memory/Storage



# Major Trends Affecting Main Memory (IV)

Need for main memory capacity, bandwidth, QoS increasing

#### Main memory energy/power is a key system design concern

#### DRAM technology scaling is ending

- ITRS projects DRAM will not scale easily below X nm
- Scaling has provided many benefits:
  - higher capacity (density), lower cost, lower energy
- Difficiulties in scaling create robustness problems

#### SAFARI

#### An "Early" Position Paper [IMW'13]

 Onur Mutlu,
 <u>"Memory Scaling: A Systems Architecture Perspective"</u> *Proceedings of the <u>5th International Memory</u> <i>Workshop (IMW)*, Monterey, CA, May 2013. <u>Slides</u> (pptx) (pdf)
 <u>EETimes Reprint</u>

#### Memory Scaling: A Systems Architecture Perspective

Onur Mutlu Carnegie Mellon University onur@cmu.edu http://users.ece.cmu.edu/~omutlu/

#### https://people.inf.ethz.ch/omutlu/pub/memory-scaling\_memcon13.pdf

# The DRAM Scaling Problem

- DRAM stores charge in a capacitor (charge-based memory)
  - Capacitor must be large enough for reliable sensing
  - Access transistor should be large enough for low leakage and high retention time
  - Scaling beyond 40-35nm (2013) is challenging [ITRS, 2009]



DRAM capacity, cost, and energy/power hard to scale

#### SAFARI

# The DRAM Scaling Problem

- DRAM stores charge in a capacitor (charge-based memory)
  - Capacitor must be large enough for reliable sensing
  - Access transistor should be large enough for low leakage and high retention time
  - Scaling beyond 40-35nm (2013) is challenging [ITRS, 2009]



DRAM capacity, cost, and energy/power hard to scale

SAFARI

#### http://in4.iue.tuwien.ac.at/pdfs/sispad2021/P03.pdf

### Limits of Charge Memory

- Difficult charge placement and control
  - Flash: floating gate charge
  - DRAM: capacitor charge, transistor leakage
- Data retention and reliable sensing becomes difficult as charge storage unit size reduces



### As Memory Scales, It Becomes Unreliable

- Data from all of Facebook's servers worldwide
- Meza+, "Revisiting Memory Errors in Large-Scale Production Data Centers," DSN'15.



Chip density (Gb)

# Large-Scale Failure Analysis of DRAM Chips

- Analysis and modeling of memory errors found in all of Facebook's server fleet
- Justin Meza, Qiang Wu, Sanjeev Kumar, and Onur Mutlu, "Revisiting Memory Errors in Large-Scale Production Data Centers: Analysis and Modeling of New Trends from the Field" Proceedings of the <u>45th Annual IEEE/IFIP International Conference on</u> Dependable Systems and Networks (DSN), Rio de Janeiro, Brazil, June 2015. [Slides (pptx) (pdf)] [DRAM Error Model]

Revisiting Memory Errors in Large-Scale Production Data Centers: Analysis and Modeling of New Trends from the Field

Justin Meza Qiang Wu\* Sanjeev Kumar\* Onur Mutlu

Carnegie Mellon University \* Facebook, Inc.

# Infrastructures to Understand Such Issues



Flipping Bits in Memory Without Accessing Them: An Experimental Study of DRAM Disturbance Errors (Kim et al., ISCA 2014)

Adaptive-Latency DRAM: Optimizing DRAM Timing for the Common-Case (Lee et al., HPCA 2015)

AVATAR: A Variable-Retention-Time (VRT) Aware Refresh for DRAM Systems (Qureshi et al., DSN 2015) An Experimental Study of Data Retention Behavior in Modern DRAM Devices: Implications for Retention Time Profiling Mechanisms (Liu et al., ISCA 2013)

The Efficacy of Error Mitigation Techniques for DRAM Retention Failures: A Comparative Experimental Study (Khan et al., SIGMETRICS 2014)



#### SAFARI

## Infrastructures to Understand Such Issues



#### **SAFARI**

Kim+, "Flipping Bits in Memory Without Accessing Them: An Experimental Study of DRAM Disturbance Errors," ISCA 2014.

# SoftMC: Open Source DRAM Infrastructure

 Hasan Hassan et al., "<u>SoftMC: A</u> <u>Flexible and Practical Open-</u> <u>Source Infrastructure for</u> <u>Enabling Experimental DRAM</u> <u>Studies</u>," HPCA 2017.

- Flexible
- Easy to Use (C++ API)
- Open-source

github.com/CMU-SAFARI/SoftMC



# SoftMC: Open Source DRAM Infrastructure

 Hasan Hassan, Nandita Vijaykumar, Samira Khan, Saugata Ghose, Kevin Chang, Gennady Pekhimenko, Donghyuk Lee, Oguz Ergin, and Onur Mutlu, "SoftMC: A Flexible and Practical Open-Source Infrastructure for Enabling Experimental DRAM Studies" Proceedings of the 23rd International Symposium on High-Performance Computer Architecture (HPCA), Austin, TX, USA, February 2017.
 [Slides (pptx) (pdf)] [Lightning Session Slides (pptx) (pdf)] [Full Talk Lecture (39 minutes)]

#### SoftMC: A Flexible and Practical Open-Source Infrastructure for Enabling Experimental DRAM Studies

Hasan Hassan<sup>1,2,3</sup> Nandita Vijaykumar<sup>3</sup> Samira Khan<sup>4,3</sup> Saugata Ghose<sup>3</sup> Kevin Chang<sup>3</sup> Gennady Pekhimenko<sup>5,3</sup> Donghyuk Lee<sup>6,3</sup> Oguz Ergin<sup>2</sup> Onur Mutlu<sup>1,3</sup>

<sup>1</sup>ETH Zürich <sup>2</sup>TOBB University of Economics & Technology <sup>3</sup>Carnegie Mellon University <sup>4</sup>University of Virginia <sup>5</sup>Microsoft Research <sup>6</sup>NVIDIA Research

#### ARI https://github.com/CMU-SAFARI/SoftMC

## DRAM Bender

 Ataberk Olgun, Hasan Hassan, A Giray Yağlıkçı, Yahya Can Tuğrul, Lois Orosa, Haocong Luo, Minesh Patel, Oğuz Ergin, and Onur Mutlu, "DRAM Bender: An Extensible and Versatile FPGA-based Infrastructure to Easily Test State-of-the-art DRAM Chips" *IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems (TCAD)*, 2023.
 [Extended arXiv version]
 [DRAM Bender Source Code]
 [DRAM Bender Tutorial Video (43 minutes)]

#### DRAM Bender: An Extensible and Versatile FPGA-based Infrastructure to Easily Test State-of-the-art DRAM Chips

Ataberk Olgun<sup>§</sup> Hasan Hassan<sup>§</sup> A. Giray Yağlıkçı<sup>§</sup> Yahya Can Tuğrul<sup>§†</sup> Lois Orosa<sup>§</sup><sup>•</sup> Haocong Luo<sup>§</sup> Minesh Patel<sup>§</sup> Oğuz Ergin<sup>†</sup> Onur Mutlu<sup>§</sup> <sup>§</sup>ETH Zürich <sup>†</sup>TOBB ETÜ <sup>•</sup>Galician Supercomputing Center

#### SAFARI https://github.com/CMU-SAFARI/DRAM-Bender

# **DRAM Bender: Prototypes**

| Testing Infrastructure            | Protocol Support | FPGA Support    |
|-----------------------------------|------------------|-----------------|
| SoftMC [134]                      | DDR3             | One Prototype   |
| LiteX RowHammer Tester (LRT) [17] | DDR3/4, LPDDR4   | Two Prototypes  |
| DRAM Bender (this work)           | DDR3/DDR4        | Five Prototypes |

#### Five out of the box FPGA-based prototypes



SAFARI



187

https://github.com/CMU-SAFARI/DRAM-Bender

# Data Retention in Memory [Liu et al., ISCA 2013]

Retention Time Profile of DRAM looks like this:

# 64-128ms >256ms **Location** dependent 128-256ms Stored value pattern dependent Time dependent

**SAFARI** Liu+, "RAIDR: Retention-Aware Intelligent DRAM Refresh," ISCA 2012.

# RAIDR: Heterogeneous Refresh [ISCA'12]

 Jamie Liu, Ben Jaiyen, Richard Veras, and Onur Mutlu, "RAIDR: Retention-Aware Intelligent DRAM Refresh" Proceedings of the <u>39th International Symposium on Computer</u> <u>Architecture</u> (ISCA), Portland, OR, June 2012. <u>Slides (pdf)</u> [Invited Retrospective at 50 Years of ISCA, 2023 (pdf)] Selected to the ISCA-50 25-Year Retrospective Issue covering 1996-2020 in 2023 (Retrospective (pdf) Full Issue).

#### **RAIDR: Retention-Aware Intelligent DRAM Refresh**

Jamie Liu Ben Jaiyen Richard Veras Onur Mutlu Carnegie Mellon University

# Analysis of Data Retention Failures [ISCA'13]

 Jamie Liu, Ben Jaiyen, Yoongu Kim, Chris Wilkerson, and Onur Mutlu,
 <u>"An Experimental Study of Data Retention Behavior in Modern DRAM Devices:</u> <u>Implications for Retention Time Profiling Mechanisms"</u>

Proceedings of the <u>40th International Symposium on Computer Architecture</u> (**ISCA**), Tel-Aviv, Israel, June 2013. <u>Slides (ppt)</u> <u>Slides (pdf)</u> [Invited Retrospective at 50 Years of ISCA, 2023 (pdf)] Selected to the ISCA-50 25-Year Retrospective Issue covering 1996-2020 in 2023 (Retrospective (pdf) Full Issue).

#### An Experimental Study of Data Retention Behavior in Modern DRAM Devices: Implications for Retention Time Profiling Mechanisms

Jamie Liu<sup>\*</sup> Carnegie Mellon University 5000 Forbes Ave. Pittsburgh, PA 15213 jamiel@alumni.cmu.edu Ben Jaiyen<sup>\*</sup> Carnegie Mellon University 5000 Forbes Ave. Pittsburgh, PA 15213 bjaiyen@alumni.cmu.edu

Yoongu Kim Carnegie Mellon University 5000 Forbes Ave. Pittsburgh, PA 15213 yoonguk@ece.cmu.edu

Chris Wilkerson Intel Corporation 2200 Mission College Blvd. Santa Clara, CA 95054 chris.wilkerson@intel.com

Onur Mutlu Carnegie Mellon University 5000 Forbes Ave. Pittsburgh, PA 15213 onur@cmu.edu

# A Curious Phenomenon

## A Curious Phenomenon [Kim et al., ISCA 2014]

# One can predictably induce errors in DRAM memory chips

Kim+, "Flipping Bits in Memory Without Accessing Them: An Experimental Study of DRAM Disturbance Errors," ISCA 2014.

# An Example: The RowHammer Problem

- One can predictably induce bit flips in commodity DRAM chips
   All recent DRAM chips are fundamentally vulnerable
- First example of how a simple hardware failure mechanism can create a widespread system security vulnerability



# First RowHammer Analysis

Yoongu Kim, Ross Daly, Jeremie Kim, Chris Fallin, Ji Hye Lee, Donghyuk Lee, Chris Wilkerson, Konrad Lai, and Onur Mutlu, "Flipping Bits in Memory Without Accessing Them: An Experimental Study of DRAM Disturbance Errors" Proceedings of the <u>41st International Symposium on Computer Architecture</u> (**ISCA**), Minneapolis, MN, June 2014. [Slides (pptx) (pdf)] [Lightning Session Slides (pptx) (pdf)] [Source Code and Data] [Lecture Video (1 hr 49 mins), 25 September 2020] One of the 7 papers of 2012-2017 selected as Top Picks in Hardware and Embedded Security for IEEE TCAD (link). Selected to the ISCA-50 25-Year Retrospective Issue covering 1996-2020 in 2023 (Retrospective (pdf) Full Issue). Winner of the 2024 IFIP Jean-Claude Laprie Award in dependable computing (link).

## Flipping Bits in Memory Without Accessing Them: An Experimental Study of DRAM Disturbance Errors

Yoongu Kim<sup>1</sup> Ross Daly<sup>\*</sup> Jeremie Kim<sup>1</sup> Chris Fallin<sup>\*</sup> Ji Hye Lee<sup>1</sup> Donghyuk Lee<sup>1</sup> Chris Wilkerson<sup>2</sup> Konrad Lai Onur Mutlu<sup>1</sup> <sup>1</sup>Carnegie Mellon University <sup>2</sup>Intel Labs



# The Robustness Perspective (I)

# Onur Mutlu, "The RowHammer Problem and Other Issues We May Face as Memory Becomes Denser" Invited Paper in Proceedings of the Design, Automation, and Test in Europe Conference (DATE), Lausanne, Switzerland, March 2017. [Slides (pptx) (pdf)]

#### The RowHammer Problem and Other Issues We May Face as Memory Becomes Denser

Onur Mutlu ETH Zürich onur.mutlu@inf.ethz.ch https://people.inf.ethz.ch/omutlu

SAFARI https://people.inf.ethz.ch/omutlu/pub/rowhammer-and-other-memory-issues date17.pdf 196

# The Robustness Perspective (II)

Onur Mutlu and Jeremie Kim,
 "RowHammer: A Retrospective"
 *IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems* (TCAD) Special Issue on Top Picks in Hardware and
 *Embedded Security*, 2019.
 [Preliminary arXiv version]
 [Slides from COSADE 2019 (pptx)]
 [Slides from VLSI-SOC 2020 (pptx) (pdf)]
 [Talk Video (1 hr 15 minutes, with Q&A)]

# RowHammer: A Retrospective

Onur Mutlu<sup>§‡</sup> Jeremie S. Kim<sup>‡§</sup> <sup>§</sup>ETH Zürich <sup>‡</sup>Carnegie Mellon University

**SAFARI** 

https://arxiv.org/pdf/1904.09724.pdf

# Major Trends Affecting Main Memory (V)

- DRAM scaling has already become very difficult
  - Increasing cell leakage current, reduced cell reliability, increasing manufacturing difficulties [Kim+ ISCA 2014], [Liu+ ISCA 2013], [Mutlu IMW 2013], [Mutlu DATE 2017]
  - Difficult to significantly improve capacity, energy

#### Emerging memory technologies are promising

# Major Trends Affecting Main Memory (V)

- DRAM scaling has already become very difficult
  - Increasing cell leakage current, reduced cell reliability, increasing manufacturing difficulties [Kim+ ISCA 2014], [Liu+ ISCA 2013], [Mutlu IMW 2013], [Mutlu DATE 2017]
  - Difficult to significantly improve capacity, energy

#### Emerging memory technologies are promising

| 3D-Stacked DRAM                                                       | higher bandwidth | smaller capacity                                          |
|-----------------------------------------------------------------------|------------------|-----------------------------------------------------------|
| <b>Reduced-Latency DRAM</b><br>(e.g., RL/TL-DRAM, FLY-RAM)            | lower latency    | higher cost                                               |
| Low-Power DRAM<br>(e.g., LPDDR3, LPDDR4, Voltron)                     | lower power      | higher latency<br>higher cost                             |
| Non-Volatile Memory (NVM)<br>(e.g., PCM, STTRAM, ReRAM, 3D<br>Xpoint) | larger capacity  | higher latency<br>higher dynamic power<br>lower endurance |

# Major Trend: Hybrid Main Memory



#### Hardware/software manage data allocation and movement to achieve the best of multiple technologies

Meza+, "Enabling Efficient and Scalable Hybrid Memories," IEEE Comp. Arch. Letters, 2012. Yoon+, "Row Buffer Locality Aware Caching Policies for Hybrid Memories," ICCD 2012 Best Paper Award.

#### SAFARI



# Main Memory Needs Intelligent Controllers

# Industry Is Writing Papers About It, Too

#### **DRAM Process Scaling Challenges**

#### Refresh

- · Difficult to build high-aspect ratio cell capacitors decreasing cell capacitance
- · Leakage current of cell access transistors increasing

#### ✤ tWR

- · Contact resistance between the cell capacitor and access transistor increasing
- · On-current of the cell access transistor decreasing
- · Bit-line resistance increasing

#### VRT

· Occurring more frequently with cell capacitance decreasing



202

# Call for Intelligent Memory Controllers

#### **DRAM Process Scaling Challenges**

#### \* Refresh

Difficult to build high-aspect ratio cell capacitors decreasing cell capacitance
THE MEMORY FORUM 2014

# Co-Architecting Controllers and DRAM to Enhance DRAM Process Scaling

Uksong Kang, Hak-soo Yu, Churoo Park, \*Hongzhong Zheng, \*\*John Halbert, \*\*Kuljit Bains, SeongJin Jang, and Joo Sun Choi



Samsung Electronics, Hwasung, Korea / \*Samsung Electronics, San Jose / \*\*Intel

# An Orthogonal Issue: Memory Interference



Cores' interfere with each other when accessing shared main memory Uncontrolled interference leads to many problems (QoS, performance)

### Goal: Predictable Performance in Complex Systems



- Heterogeneous agents: CPUs, GPUs, and HWAs
- Main memory interference between CPUs, GPUs, HWAs

How to allocate resources to heterogeneous agents to mitigate interference and provide predictable performance?

#### Many goals, many constraints, many metrics ... <sup>205</sup>



# Memory Controllers are critical to research

# They will become even more important

SAFARI

## Memory Control w/ Machine Learning [ISCA'08]

 Engin Ipek, Onur Mutlu, José F. Martínez, and Rich Caruana,
 "Self Optimizing Memory Controllers: A Reinforcement Learning Approach"
 Proceedings of the <u>35th International Symposium on Computer Architecture</u> (ISCA), pages 39-50, Beijing, China, June 2008. <u>Slides (pptx)</u>

#### Self-Optimizing Memory Controllers: A Reinforcement Learning Approach

Engin İpek<sup>1,2</sup> Onur Mutlu<sup>2</sup> José F. Martínez<sup>1</sup> Rich Caruana<sup>1</sup>

<sup>1</sup>Cornell University, Ithaca, NY 14850 USA

<sup>2</sup> Microsoft Research, Redmond, WA 98052 USA

# Solving the Memory Problem

# How Do We Solve The Memory Problem?

- Fix it: Make memory and controllers more intelligent
   New interfaces, functions, architectures: system-mem codesign
- Eliminate or minimize it: Replace or (more likely) augment DRAM with a different technology
  - New technologies and system-wide rethinking of memory & storage
- Embrace it: Design heterogeneous memories (none of which are perfect) and map data intelligently across them
   New models for data management and maybe usage

# How Do We Solve The Memory Problem?

- Fix it: Make memory and controllers more intelligent
   New interfaces, functions, architectures: system-mem codesign
- Eliminate or minimize it: Replace or (more likely) augment DRAM with a different technology
  - New technologies and system-wide rethinking of memory & storage
- Embrace it: Design heterogeneous memories (none of which are perfect) and map data intelligently across them
  - New models for data management and maybe usage

# Solutions (to memory scaling) require software/hardware/device cooperation

# How Do We Solve The Memory Problem?



software/hardware/device cooperation

# Solution 1: New Memory Architectures

- Overcome memory shortcomings with
  - Memory-centric system design
  - Novel memory architectures, interfaces, functions
  - Better waste management (efficient utilization)
- Key issues to tackle
  - Enable reliability at low cost  $\rightarrow$  high capacity
  - Reduce energy
  - Reduce latency
  - Improve bandwidth
  - Reduce waste (capacity, bandwidth, latency)
  - Enable computation close to data

# Solution 2: Emerging Memory Technologies

- Some emerging resistive memory technologies seem more scalable than DRAM (and they are non-volatile)
- Example: Phase Change Memory
  - Data stored by changing phase of material
  - Data read by detecting material's resistance
  - Expected to scale to 9nm (2022 [ITRS 2009])
  - Prototyped at 20nm (Raoux+, IBM JRD 2008)
  - Expected to be denser than DRAM: can store multiple bits/cell
- But, emerging technologies have (many) shortcomings
   Can they be enabled to replace/augment/surpass DRAM?



PCM

BL

SENSE

# Solution 2: Emerging Memory Technologies

- Lee+, "Architecting Phase Change Memory as a Scalable DRAM Alternative," ISCA'09, CACM'10, IEEE Micro'10.
- Meza+, "Enabling Efficient and Scalable Hybrid Memories," IEEE Comp. Arch. Letters 2012.
- Yoon, Meza+, "Row Buffer Locality Aware Caching Policies for Hybrid Memories," ICCD 2012.
- Kultursay+, "Evaluating STT-RAM as an Energy-Efficient Main Memory Alternative," ISPASS 2013.
- Meza+, "A Case for Efficient Hardware-Software Cooperative Management of Storage and Memory," WEED 2013.
- Lu+, "Loose Ordering Consistency for Persistent Memory," ICCD 2014.
- Zhao+, "FIRM: Fair and High-Performance Memory Control for Persistent Memory Systems," MICRO 2014.
- Yoon, Meza+, "Efficient Data Mapping and Buffering Techniques for Multi-Level Cell Phase-Change Memories," TACO 2014.
- Ren+, "ThyNVM: Enabling Software-Transparent Crash Consistency in Persistent Memory Systems," MICRO 2015.
- Chauhan+, "NVMove: Helping Programmers Move to Byte-Based Persistence," INFLOW 2016.
- Li+, "Utility-Based Hybrid Memory Management," CLUSTER 2017.
- Yu+, "Banshee: Bandwidth-Efficient DRAM Caching via Software/Hardware Cooperation," MICRO 2017.
- Tavakkol+, "MQSim: A Framework for Enabling Realistic Studies of Modern Multi-Queue SSD Devices," FAST 2018.
- Tavakkol+, "FLIN: Enabling Fairness and Enhancing Performance in Modern NVMe Solid State Drives," ISCA 2018.
- Sadrosadati+. "LTRF: Enabling High-Capacity Register Files for GPUs via Hardware/Software Cooperative Register Prefetching," ASPLOS 2018.
- Salkhordeh+, "An Analytical Model for Performance and Lifetime Estimation of Hybrid DRAM-NVM Main Memories," TC 2019.
- Wang+, "Panthera: Holistic Memory Management for Big Data Processing over Hybrid Memories," PLDI 2019.
- Song+, "Enabling and Exploiting Partition-Level Parallelism (PALP) in Phase Change Memories," CASES 2019.
- Liu+, "Binary Star: Coordinated Reliability in Heterogeneous Memory Systems for High Performance and Scalability," MICRO'19.
- Song+, "Improving Phase Change Memory Performance with Data Content Aware Access," ISMM 2020.
- Yavits+, "WoLFRaM: Enhancing Wear-Leveling and Fault Tolerance in Resistive Memories using Programmable Address Decoders," ICCD 2020.
- Song+, "Aging-Aware Request Scheduling for Non-Volatile Main Memory," ASP-DAC 2021.

#### SAFARI

# Combination: Hybrid Memory Systems



#### Hardware/software manage data allocation and movement to achieve the best of multiple technologies

Meza+, "Enabling Efficient and Scalable Hybrid Memories," IEEE Comp. Arch. Letters, 2012. Yoon, Meza et al., "Row Buffer Locality Aware Caching Policies for Hybrid Memories," ICCD 2012 Best Paper Award.

#### **SAFARI**

# Exploiting Memory Error Tolerance with Hybrid Memory Systems



On Microsoft's Web Search workload Reduces server hardware cost by 4.7 % Achieves single server availability target of 99.90 % Heterogeneous-Reliability Memory [DSN 2014]

# Heterogeneous-Reliability Memory



# **Evaluation Results**



### Bigger area means better tradeoff

# More on Heterogeneous Reliability Memory

Yixin Luo, Sriram Govindan, Bikash Sharma, Mark Santaniello, Justin Meza, Aman Kansal, Jie Liu, Badriddine Khessib, Kushagra Vaid, and Onur Mutlu,
 "Characterizing Application Memory Error Vulnerability to Optimize
 Data Center Cost via Heterogeneous-Reliability Memory"
 Proceedings of the <u>44th Annual IEEE/IFIP International Conference on</u>
 Dependable Systems and Networks (DSN), Atlanta, GA, June 2014. [Summary]
 [Slides (pptx) (pdf)] [Coverage on ZDNet]

#### Characterizing Application Memory Error Vulnerability to Optimize Datacenter Cost via Heterogeneous-Reliability Memory

Yixin Luo Sriram Govindan<sup>\*</sup> Bikash Sharma<sup>\*</sup> Mark Santaniello<sup>\*</sup> Justin Meza Aman Kansal<sup>\*</sup> Jie Liu<sup>\*</sup> Badriddine Khessib<sup>\*</sup> Kushagra Vaid<sup>\*</sup> Onur Mutlu Carnegie Mellon University, yixinluo@cs.cmu.edu, {meza, onur}@cmu.edu \*Microsoft Corporation, {srgovin, bsharma, marksan, kansal, jie.liu, bkhessib, kvaid}@microsoft.com

## HRM is an Example of Our Axiom

To achieve the highest energy efficiency and performance:

## we must take the expanded view

of computer architecture



#### SAFARI

## Another Example: EDEN for DNNs

- Deep Neural Network evaluation is very DRAM-intensive (especially for large networks)
- 1. Some data and layers in DNNs are very tolerant to errors
- 2. Reduce DRAM latency and voltage on such data and layers

3. While still achieving a user-specified DNN accuracy target by making training DRAM-error-aware

### Data-aware management of DRAM latency and voltage for Deep Neural Network Inference

## **Example DNN Data Type to DRAM Mapping**

#### Mapping example of ResNet-50:



### Map more error-tolerant DNN layers to DRAM partitions with lower voltage/latency

4 DRAM partitions with different error rates

## **EDEN: Overview**

<u>Key idea</u>: Enable accurate, efficient DNN inference using approximate DRAM

### EDEN is an iterative process that has <u>3 key steps</u>



### **CPU: DRAM Energy Evaluation**



**Average 21% DRAM energy reduction** maintaining accuracy within 1% of original

## **CPU: Performance Evaluation**



Average 8% system speedup Some workloads achieve 17% speedup

EDEN achieves **close to the ideal** speedup possible via tRCD latency reduction

### **GPU, Eyeriss, and TPU: Energy Evaluation**

• **<u>GPU</u>**: average **37% energy reduction** 

**Everiss**: average **31% energy reduction** 

**TPU**: average **32% energy reduction** 

•

## EDEN: Data-Aware Efficient DNN Inference

 Skanda Koppula, Lois Orosa, A. Giray Yaglikci, Roknoddin Azizi, Taha Shahroodi, Konstantinos Kanellopoulos, and Onur Mutlu, "EDEN: Enabling Energy-Efficient, High-Performance Deep Neural Network Inference Using Approximate DRAM" Proceedings of the 52nd International Symposium on Microarchitecture (MICRO), Columbus, OH, USA, October 2019.
 [Lightning Talk Slides (pptx) (pdf)]
 [Lightning Talk Video (90 seconds)]

### EDEN: Enabling Energy-Efficient, High-Performance Deep Neural Network Inference Using Approximate DRAM

Skanda Koppula Lois Orosa A. Giray Yağlıkçı Roknoddin Azizi Taha Shahroodi Konstantinos Kanellopoulos Onur Mutlu ETH Zürich

# Memory Systems and Memory-Centric Computing Topic 1: Trends, Challenges, Opportunities

Onur Mutlu <u>omutlu@gmail.com</u> <u>https://people.inf.ethz.ch/omutlu</u> 15 July 2024 HiPEAC ACACES Summer School 2024



**ETH** zürich