# Future of Computer Architecture and Hardware Security

Onur Mutlu

omutlu@gmail.com

https://people.inf.ethz.ch/omutlu

6 March 2025

University of Southern California





### Agenda

- Computer Architecture Today
  - What is it and where it is going
- Three Major Hardware Issues That Affect Security
  - Technology scaling problems
  - Growing system complexity; old methods not keeping up
  - New architectures and technologies

# Why Do We Do Computing?

## To Solve Problems

# To Gain Insight

# To Enable a Better Life & Future

# How Does a Computer Solve Problems?

# Orchestrating Electrons

In today's dominant technologies

# How Do Problems Get Solved by Electrons?

#### The Transformation Hierarchy

Computer Architecture (expanded view)



Computer Architecture (narrow view)

#### Computer Architecture

- is the science and art of designing computing platforms (hardware, interface, system SW, and programming model)
- to achieve a set of design goals
  - E.g., highest performance on earth on workloads X, Y, Z
  - E.g., longest battery life at a form factor that fits in your pocket with cost < \$\$\$ CHF</li>
  - E.g., best average performance across all known workloads at the best performance/cost ratio
  - **-** ...
  - □ Designing a supercomputer is different from designing a smartphone → But, many fundamental principles are similar











Mohammed Alser, Zülal Bingöl, Damla Senol Cali, Jeremie Kim, Saugata Ghose, Can Alkan, Onur Mutlu "Accelerating Genome Analysis: A Primer on an Ongoing Journey" IEEE Micro, August 2020.



### An Example System in Your Pocket



Apple M1 Ultra System (2022)











**Figure 3.** TPU Printed Circuit Board. It can be inserted in the slot for an SATA disk in a server, but the card uses PCIe Gen3 x16.



**Figure 4.** Systolic data flow of the Matrix Multiply Unit. Software has the illusion that each 256B input is read at once, and they instantly update one location of each of 256 accumulator RAMs.

Jouppi et al., "In-Datacenter Performance Analysis of a Tensor Processing Unit", ISCA 2017.



#### New ML applications (vs. TPU3):

- Computer vision
- Natural Language Processing (NLP)
- Recommender system
- Reinforcement learning that plays Go

250 TFLOPS per chip in 2021 vs 90 TFLOPS in TPU3



1 ExaFLOPS per board

https://spectrum.ieee.org/tech-talk/computing/hardware/heres-how-googles-tpu-v4-ai-chip-stacked-up-in-training-tests

- ML accelerator: 260 mm<sup>2</sup>, 6 billion transistors,
   600 GFLOPS GPU, 12 ARM 2.2 GHz CPUs.
- Two redundant chips for better safety.





TESLA

Tesla Dojo Chip & System

#### D1 Chip

362 TFLOPs BF16/CFP8
22.6 TFLOPs FP32

10TBps/dir. on-Chip Bandwidth
4TBps/edge. off-Chip Bandwidth

**400W TDP** 



645mm<sup>2</sup> 7nm Technology

**50 Billion** Transistors

11+ Miles Of Wires









TESLA

Tesla Dojo Chip & System



TESLA

Tesla Dojo Chip & System





NVIDIA is claiming a **7x improvement** in dynamic programming algorithm (**DPX instructions**) performance on a single H100 versus naïve execution on an A100.



#### Evolution of Recent GPUs (I)



Volta
>21 billion transistors
815mm^2

TSMC 12nm FFN



Ampere >54 billion transistors 826 mm^2 TSMC N7



Hopper >80 billion transistors 814 mm^2 TSMC 4N



Plackwell
>208 billion transistors
>1600 mm^2
TSMC 4NP



Ampere | NVLink3 12 NVLinks | 50GB/s each x4@50Gbps-NRZ 600GB/s total



Hopper | NVLink4 18 NVLinks | 50GB/s each x2@100Gbps-PAM4



Blackwell | NVLink5 18 NVLinks | 100GB/s each x2@200Gbps-PAM4

#### Multiple GPUs to Tackle Large Workloads

#### Al Models Growing Exponentially

Need for multi-GPU inference at scale



New Capabilities | Trillions of Parameters | 70,000X Growth in a Decade

#### Evolution of Recent GPUs (II)







2016 Hybrid Cube Mesh NVLink technology

**2022**3rd Gen NVLink Switch

All-to-all connection among NVLink domain of 8 GPU

**2024**4<sup>th</sup> Gen NVLink Switch Chip
All-to-all connection among NVLink domain of 72 GPU

## Cerebras's Wafer Scale Engine (2019)



The largest ML accelerator chip

400,000 cores



#### **Cerebras WSE**

1.2 Trillion transistors 46,225 mm<sup>2</sup>

#### **Largest GPU**

21.1 Billion transistors 815 mm<sup>2</sup>

**NVIDIA** TITAN V

https://www.anandtech.com/show/14758/hot-chips-31-live-blogs-cerebras-wafer-scale-deep-learning

https://www.cerebras.net/cerebras-wafer-scale-engine-why-we-need-big-chips-for-deep-learning/

## Cerebras's Wafer Scale Engine-2 (2021)



 The largest ML accelerator chip (2021)

850,000 cores



#### **Cerebras WSE-2**

2.6 Trillion transistors 46,225 mm<sup>2</sup>

#### **Largest GPU**

54.2 Billion transistors 826 mm<sup>2</sup>

**NVIDIA** Ampere GA100

https://www.anandtech.com/show/14758/hot-chips-31-live-blogs-cerebras-wafer-scale-deep-learning

## Cerebras's Wafer Scale Engine-3 (2023)



#### **Cerebras Wafer-Scale Engine**

The largest chip ever produced

46,225 mm<sup>2</sup> silicon

4 trillion transistors

**900,000** Al cores

125 Petaflops of AI compute

**44 Gigabytes** of on-chip memory

21 PByte/s memory bandwidth

214 Pbit/s fabric bandwidth

5nm TSMC process

## Many (Other) AI/ML Chips (2021)



All information contained within this infographic is gathered from the internet and periodically updated, no guarantee is given that the information provided is correct, complete, and up-to-date.

#### Axiom

To achieve the highest efficiency, performance, robustness:

#### we must take the expanded view

of computer architecture



Co-design across the hierarchy:
Algorithms to devices

Specialize as much as possible within the design goals

# What Limits Us in Computing Today?

#### Increasingly Demanding Applications

Dream...

and, they will come

As applications push boundaries, computing platforms become increasingly strained

#### Many Metrics to Optimize for

- Performance
- Energy/Power
- Correctness
- Robustness (Safety, Security, Reliability, Availability)
- Cost
- Programming Ease
- Usability (Ease of Use)
- Scalability
- Simplicity (Complexity)
- Privacy
- **...**

Challenging especially with complex systems & hardware

#### Three Major Limiters to Computing

- Technology scaling is not going well
- System complexity is increasing; old methods not keeping up
- Processor-centric designs are not keeping up

- These affect all metrics we care about
- These have fundamental impact on security and how we build secure systems

## Technology Scaling

#### Technology Scaling Problems

- Circuit size and energy reduction has enabled continuous innovation at all levels of the computing stack
- As circuits become smaller, they become less reliable
- More flaky circuits are a problem for robust (reliable, safe, secure) operation
- If circuits produce wrong results, security can be affected (along with safety, reliability, availability)

#### How Reliable/Secure/Safe is This Bridge?



#### Collapse of the "Galloping Gertie"



#### Another View



#### How Secure Are These People?



Security is about preventing unforeseen consequences

#### How Safe & Secure Is **This** Platform?



#### How Robust Are **These** Platforms?











https://www.kennedyspacecenter.com/explore-attractions/nasa-now https://www.cnet.com/pictures/nasas-wildest-rides-extreme-vehicles-for-earth-and-beyond/7/

#### Challenge and Opportunity for Future

## Robust (Reliable, Secure, Safe)

#### An Example: The RowHammer Problem

- One can predictably induce bit flips in commodity DRAM chips
  - All recent DRAM chips are fundamentally vulnerable
- First example of how a simple hardware failure mechanism can create a widespread system security vulnerability



Forget Software—Now Hackers Are Exploiting Physics

BUSINESS CULTURE DESIGN GEAR SCIENCE







NDY GREENBERG SECURITY 08.31.16 7:00 AM

## FORGET SOFTWARE—NOW HACKERS ARE EXPLOITING PHYSICS

#### A Curious Phenomenon [Kim et al., ISCA 2014]

# One can predictably induce errors in DRAM memory chips

Kim+, "Flipping Bits in Memory Without Accessing Them: An Experimental Study of DRAM Disturbance Errors," ISCA 2014.



#### Modern Memory is Prone to Disturbance Errors



Repeatedly reading a row enough times (before memory gets refreshed) induces disturbance errors in adjacent rows in most real DRAM chips you can buy today

#### Recent DRAM Is More Vulnerable



All modules from 2012–2013 are vulnerable

#### Higher-Level Implications

 This simple circuit level failure mechanism has enormous implications on upper layers of the transformation hierarchy

**Problem** Algorithm Program/Language **Runtime System** (VM, OS, MM) ISA (Architecture) Microarchitecture Logic Devices Electrons









```
loop:
  mov (X), %eax
  mov (Y), %ebx
  clflush (X)
  clflush (Y)
  mfence
  jmp loop
```









- 1. Avoid cache hits
  - Flush X from cache
- 2. Avoid *row hits* to X
  - Read Y in another row









```
loop:
  mov (X), %eax
  mov (Y), %ebx
  clflush (X)
  clflush (Y)
  mfence
  jmp loop
```









```
loop:
  mov (X), %eax
  mov (Y), %ebx
  clflush (X)
  clflush (Y)
  mfence
  jmp loop
```









```
loop:
  mov (X), %eax
  mov (Y), %ebx
  clflush (X)
  clflush (Y)
  mfence
  jmp loop
```



#### Observed Errors in Real Systems

| CPU Architecture          | Errors | Access-Rate |
|---------------------------|--------|-------------|
| Intel Haswell (2013)      | 22.9K  | 12.3M/sec   |
| Intel Ivy Bridge (2012)   | 20.7K  | 11.7M/sec   |
| Intel Sandy Bridge (2011) | 16.1K  | 11.6M/sec   |
| AMD Piledriver (2012)     | 59     | 6.1M/sec    |

#### A real reliability, security, safety issue

#### One Can Take Over an Otherwise-Secure System

#### Flipping Bits in Memory Without Accessing Them: An Experimental Study of DRAM Disturbance Errors

Abstract. Memory isolation is a key property of a reliable and secure computing system — an access to one memory address should not have unintended side effects on data stored in other addresses. However, as DRAM process technology

### Project Zero

Flipping Bits in Memory Without Accessing Them:
An Experimental Study of DRAM Disturbance Errors
(Kim et al., ISCA 2014)

News and updates from the Project Zero team at Google

Exploiting the DRAM rowhammer bug to gain kernel privileges (Seaborn, 2015)

Monday, March 9, 2015

Exploiting the DRAM rowhammer bug to gain kernel privileges

#### Many RowHammer Security Exploits

- One can exploit RowHammer to
- Take over a system
- Read data they do not have access to
- Break out of virtual machine sandboxes
- Corrupt important data → render ML inference useless
- Steal secret data (e.g., crypto keys & ML model parameters)

#### Security Implications



#### Security Implications



It's like breaking into an apartment by repeatedly slamming a neighbor's door until the vibrations open the door you were after

#### Infrastructures to Understand Such Issues



Flipping Bits in Memory Without Accessing
Them: An Experimental Study of DRAM
Disturbance Errors (Kim et al., ISCA 2014)

Adaptive-Latency DRAM: Optimizing DRAM
Timing for the Common-Case (Lee et al.,
HPCA 2015)

AVATAR: A Variable-Retention-Time (VRT)

Aware Refresh for DRAM Systems (Qureshi et al., DSN 2015)

An Experimental Study of Data Retention
Behavior in Modern DRAM Devices:
Implications for Retention Time Profiling
Mechanisms (Liu et al., ISCA 2013)

The Efficacy of Error Mitigation Techniques for DRAM Retention Failures: A Comparative Experimental Study (Khan et al., SIGMETRICS 2014)



#### Infrastructures to Understand Such Issues



#### SoftMC: Open Source DRAM Infrastructure

Hasan Hassan et al., "SoftMC: A Flexible and Practical Open-Source Infrastructure for Enabling Experimental DRAM Studies," HPCA 2017.

- Flexible
- Easy to Use (C++ API)
- Open-source github.com/CMU-SAFARI/SoftMC



#### SoftMC: Open Source DRAM Infrastructure

Hasan Hassan, Nandita Vijaykumar, Samira Khan, Saugata Ghose, Kevin Chang, Gennady Pekhimenko, Donghyuk Lee, Oguz Ergin, and Onur Mutlu, "SoftMC: A Flexible and Practical Open-Source Infrastructure for Enabling Experimental DRAM Studies"
 Proceedings of the 23rd International Symposium on High-Performance Computer Architecture (HPCA), Austin, TX, USA, February 2017.

 [Slides (pptx) (pdf)] [Lightning Session Slides (pptx) (pdf)]
 [Full Talk Lecture (39 minutes)]
 [Source Code]

## SoftMC: A Flexible and Practical Open-Source Infrastructure for Enabling Experimental DRAM Studies

Hasan Hassan $^{1,2,3}$  Nandita Vijaykumar $^3$  Samira Khan $^{4,3}$  Saugata Ghose $^3$  Kevin Chang $^3$  Gennady Pekhimenko $^{5,3}$  Donghyuk Lee $^{6,3}$  Oguz Ergin $^2$  Onur Mutlu $^{1,3}$ 

<sup>1</sup>ETH Zürich <sup>2</sup>TOBB University of Economics & Technology <sup>3</sup>Carnegie Mellon University <sup>4</sup>University of Virginia <sup>5</sup>Microsoft Research <sup>6</sup>NVIDIA Research

#### DRAM Bender: New DRAM Infrastructure

Ataberk Olgun, Hasan Hassan, A Giray Yağlıkçı, Yahya Can Tuğrul, Lois Orosa,
Haocong Luo, Minesh Patel, Oğuz Ergin, and Onur Mutlu,
"DRAM Bender: An Extensible and Versatile FPGA-based Infrastructure
to Easily Test State-of-the-art DRAM Chips"

IEEE Transactions on Computer-Aided Design of Integrated Circuits and

<u>IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems</u> (**TCAD**), 2023.

[Extended arXiv version]

[DRAM Bender Source Code]

[DRAM Bender Tutorial Video (43 minutes)]

### DRAM Bender: An Extensible and Versatile FPGA-based Infrastructure to Easily Test State-of-the-art DRAM Chips

Ataberk Olgun<sup>§</sup> Hasan Hassan<sup>§</sup> A. Giray Yağlıkçı<sup>§</sup> Yahya Can Tuğrul<sup>§†</sup> Lois Orosa<sup>§⊙</sup> Haocong Luo<sup>§</sup> Minesh Patel<sup>§</sup> Oğuz Ergin<sup>†</sup> Onur Mutlu<sup>§</sup> <sup>§</sup>ETH Zürich <sup>†</sup>TOBB ETÜ <sup>⊙</sup>Galician Supercomputing Center

#### DRAM Bender: FPGA Prototypes

| Testing Infrastructure            | Protocol Support | FPGA Support    |
|-----------------------------------|------------------|-----------------|
| SoftMC [134]                      | DDR3             | One Prototype   |
| LiteX RowHammer Tester (LRT) [17] | DDR3/4, LPDDR4   | Two Prototypes  |
| DRAM Bender (this work)           | DDR3/DDR4        | Five Prototypes |

#### Five out of the box FPGA-based prototypes











#### **HBM2 DRAM Testing Infrastructure**

DRAM Bender on a Bittware XUPVVH



Fine-grained control over DRAM commands, timing parameters (±1.67ns), and temperature (±0.5°C)



#### RowHammer [ISCA 2014]

Yoongu Kim, Ross Daly, Jeremie Kim, Chris Fallin, Ji Hye Lee, Donghyuk Lee, Chris Wilkerson, Konrad Lai, and Onur Mutlu,

"Flipping Bits in Memory Without Accessing Them: An Experimental Study of DRAM Disturbance Errors"

Proceedings of the <u>41st International Symposium on Computer Architecture</u> (**ISCA**), Minneapolis, MN, June 2014.

[Slides (pptx) (pdf)] [Lightning Session Slides (pptx) (pdf)] [Source Code and Data] [Lecture Video (1 hr 49 mins), 25 September 2020]

One of the 7 papers of 2012-2017 selected as Top Picks in Hardware and Embedded Security for IEEE TCAD (<u>link</u>). Selected to the ISCA-50 25-Year Retrospective Issue covering 1996-2020 in 2023 (<u>Retrospective</u> (<u>pdf</u>) <u>Full Issue</u>). Winner of the 2024 IFIP Jean-Claude Laprie Award in dependable computing (link).

#### Flipping Bits in Memory Without Accessing Them: An Experimental Study of DRAM Disturbance Errors

Yoongu Kim<sup>1</sup> Ross Daly\* Jeremie Kim<sup>1</sup> Chris Fallin\* Ji Hye Lee<sup>1</sup> Donghyuk Lee<sup>1</sup> Chris Wilkerson<sup>2</sup> Konrad Lai Onur Mutlu<sup>1</sup>

<sup>1</sup>Carnegie Mellon University <sup>2</sup>Intel Labs

### Memory Scaling Issues Are Real

Onur Mutlu and Jeremie Kim,

"RowHammer: A Retrospective"

<u>IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems</u> (**TCAD**) Special Issue on Top Picks in Hardware and Embedded Security, 2019.

[Preliminary arXiv version]

[Slides from COSADE 2019 (pptx)]

[Slides from VLSI-SOC 2020 (pptx) (pdf)]

[Talk Video (1 hr 15 minutes, with Q&A)]

# RowHammer: A Retrospective

Onur Mutlu<sup>§‡</sup> Jeremie S. Kim<sup>‡§</sup> §ETH Zürich <sup>‡</sup>Carnegie Mellon University

SAFARI

### Memory Scaling Issues Are Real

Onur Mutlu, Ataberk Olgun, and A. Giray Yaglikci,
 "Fundamentally Understanding and Solving RowHammer"
 Invited Special Session Paper at the <u>28th Asia and South Pacific Design Automation Conference</u> (<u>ASP-DAC</u>), Tokyo, Japan, January 2023.
 [arXiv version]
 [Slides (pptx) (pdf)]
 [Talk Video (26 minutes)]

### Fundamentally Understanding and Solving RowHammer

Onur Mutlu onur.mutlu@safari.ethz.ch ETH Zürich Zürich, Switzerland Ataberk Olgun ataberk.olgun@safari.ethz.ch ETH Zürich Zürich, Switzerland A. Giray Yağlıkcı giray.yaglikci@safari.ethz.ch ETH Zürich Zürich, Switzerland

### A Recent PhD Thesis

A. Giray Yaglikci, "<u>Enabling Efficient and Scalable DRAM Read Disturbance Mitigation via New Experimental Insights into Modern DRAM Chips</u>," PhD Thesis, ETH Zürich, 2024.

[Slides (pdf) (pptx)]

[Thesis arXiv (abs) (pdf)]

[SAFARI News]

# ENABLING EFFICIENT AND SCALABLE DRAM READ DISTURBANCE MITIGATION VIA NEW EXPERIMENTAL INSIGHTS INTO MODERN DRAM CHIPS

ABDULLAH GİRAY YAĞLIKÇI

https://arxiv.org/pdf/2408.15044.pdf

# Main Memory Needs Intelligent Controllers

# Industry's Intelligent DRAM Controllers (I)

#### **ISSCC 2023 / SESSION 28 / HIGH-DENSITY MEMORIES /**

28.8 A 1.1V 16Gb DDR5 DRAM with Probabilistic-Aggressor Tracking, Refresh-Management Functionality, Per-Row Hammer Tracking, a Multi-Step Precharge, and Core-Bias Modulation for Security and Reliability Enhancement

Woongrae Kim, Chulmoon Jung, Seongnyuh Yoo, Duckhwa Hong, Jeongjin Hwang, Jungmin Yoon, Ohyong Jung, Joonwoo Choi, Sanga Hyun, Mankeun Kang, Sangho Lee, Dohong Kim, Sanghyun Ku, Donhyun Choi, Nogeun Joo, Sangwoo Yoon, Junseok Noh, Byeongyong Go, Cheolhoe Kim, Sunil Hwang, Mihyun Hwang, Seol-Min Yi, Hyungmin Kim, Sanghyuk Heo, Yeonsu Jang, Kyoungchul Jang, Shinho Chu, Yoonna Oh, Kwidong Kim, Junghyun Kim, Soohwan Kim, Jeongtae Hwang, Sangil Park, Junphyo Lee, Inchul Jeong, Joohwan Cho, Jonghwan Kim

SK hynix Semiconductor, Icheon, Korea



# Industry's Intelligent DRAM Controllers (II)

SK hynix Semiconductor, Icheon, Korea

DRAM products have been recently adopted in a wide range of high-performance computing applications: such as in cloud computing, in big data systems, and IoT devices. This demand creates larger memory capacity requirements, thereby requiring aggressive DRAM technology node scaling to reduce the cost per bit [1,2]. However, DRAM manufacturers are facing technology scaling challenges due to row hammer and refresh retention time beyond 1a-nm [2]. Row hammer is a failure mechanism, where repeatedly activating a DRAM row disturbs data in adjacent rows. Scaling down severely threatens reliability since a reduction of DRAM cell size leads to a reduction in the intrinsic row hammer tolerance [2,3]. To improve row hammer tolerance, there is a need to probabilistically activate adjacent rows with carefully sampled active addresses and to improve intrinsic row hammer tolerance [2]. In this paper, row-hammer-protection and refresh-management schemes are presented to guarantee DRAM security and reliability despite the aggressive scaling from 1a-nm to sub 10-nm nodes. The probabilisticaggressor-tracking scheme with a refresh-management function (RFM) and per-row hammer tracking (PRHT) improve DRAM resilience. A multi-step precharge reinforces intrinsic row-hammer tolerance and a core-bias modulation improves retention time: even in the face of cell-transistor degradation due to technology scaling. This comprehensive scheme leads to a reduced probability of failure, due to row hammer attacks, by 93.1% and an improvement in retention time by 17%.

### Industry's Intelligent DRAM Controllers (III)



#### ISSCC 2023 / SESSION 28 / HIGH-DENSITY MEMORIES /

28.8 A 1.1V 16Gb DDR5 DRAM with Probabilistic-Aggressor Tracking, Refresh-Management Functionality, Per-Row Hammer Tracking, a Multi-Step Precharge, and Core-Bias Modulation for Security and Reliability Enhancement

Woongrae Kim, Chulmoon Jung, Seongnyuh Yoo, Duckhwa Hong, Jeongjin Hwang, Jungmin Yoon, Ohyong Jung, Joonwoo Choi, Sanga Hyun, Mankeun Kang, Sangho Lee, Dohong Kim, Sanghyun Ku, Donhyun Choi, Nogeun Joo, Sangwoo Yoon, Junseok Noh, Byeongyong Go, Cheolhoe Kim, Sunil Hwang, Mihyun Hwang, Seol-Min Yi, Hyungmin Kim, Sanghyuk Heo, Yeonsu Jang, Kyoungchul Jang, Shinho Chu, Yoonna Oh, Kwidong Kim, Junghyun Kim, Soohwan Kim, Jeongtae Hwang, Sangil Park, Junphyo Lee, Inchul Jeong, Joohwan Cho, Jonghwan Kim

SK hynix Semiconductor, Icheon, Korea

# Industry's Intelligent DRAM Controllers (IV)

# DSAC: Low-Cost Rowhammer Mitigation Using In-DRAM Stochastic and Approximate Counting Algorithm

Seungki Hong Dongha Kim Jaehyung Lee Reum Oh Changsik Yoo Sangjoon Hwang Jooyoung Lee

DRAM Design Team, Memory Division, Samsung Electronics

https://arxiv.org/pdf/2302.03591v1.pdf

### A Solution from Microsoft

# Panopticon: A Complete In-DRAM Rowhammer Mitigation

Tanj Bennett<sup>§</sup>, Stefan Saroiu, Alec Wolman, and Lucian Cojocar Microsoft, <sup>§</sup>Avant-Gray LLC

https://stefan.t8k2.com/publications/dramsec/2021/panopticon.pdf

# Recent Improvements in JEDEC (2024)



Version 1.30

This standard defines the DDR5 SDRAM specification, including features, functionalities, AC and DC characteristics, packages, and ball/signal assignments. The purpose of this Standard is to define the minimum set of requirements for JEDEC compliant 8 Gb through 32 Gb for x4, x8, and x16 DDR5 SDRAM devices. This standard was created based on the DDR4 standards (JESD79-4) and some aspects of the DDR, DDR2, DDR3, and LPDDR4 standards (JESD79, JESD79-2, JESD79-3, and JESD209-4).

Committee(s): JC-42, JC-42.3

### Evaluation of Industry's Recent Solutions

Appears at DRAMSec 2024

### Understanding the Security Benefits and Overheads of Emerging Industry Solutions to DRAM Read Disturbance

```
Oğuzhan Canpolat<sup>§†</sup> A. Giray Yağlıkçı<sup>§</sup> Geraldo F. Oliveira<sup>§</sup> Ataberk Olgun<sup>§</sup> Oğuz Ergin<sup>†</sup> Onur Mutlu<sup>§</sup> † TOBB University of Economics and Technology
```

https://arxiv.org/pdf/2406.19094

https://github.com/CMU-SAFARI/ramulator2

### Evaluation of Industry's Recent Solutions

 Oguzhan Canpolat, Abdullah Giray Yaglikci, Geraldo Francisco de Oliveira, Ataberk Olgun, Nisa Bostanci, Ismail Emir Yuksel, Haocong Luo, Oguz Ergin, and Onur Mutlu,
 "Chronus: Understanding and Securing the Cutting-Edge Industry Solutions to DRAM Read Disturbance"

Proceedings of the <u>31st International Symposium on High-Performance Computer</u> <u>Architecture</u> (**HPCA**), Las Vegas, NV, USA, March 2025.

[Chronus Source Code (Officially Artifact Evaluated with All Badges)]

Officially artifact evaluated as available, functional, and reproduced.

2025 IEEE International Symposium on High-Performance Computer Architecture (HPCA)



# Chronus: Understanding and Securing the Cutting-Edge Industry Solutions to DRAM Read Disturbance

Oğuzhan Canpolat<sup>§†</sup> A. Giray Yağlıkçı<sup>§</sup> Geraldo F. Oliveira<sup>§</sup> Ataberk Olgun<sup>§</sup> Nisa Bostancı<sup>§</sup> Ismail Emir Yuksel<sup>§</sup> Haocong Luo<sup>§</sup> Oğuz Ergin<sup>‡†</sup> Onur Mutlu<sup>§</sup> ETH Zürich <sup>†</sup>TOBB University of Economics and Technology <sup>‡</sup>University of Sharjah

https://arxiv.org/pdf/2502.12650

### Are Solutions Good & Secure?



# Are we now BitFlip-free in 2024 and Beyond?

### Are We Now BitFlip Free?

Appears at ISCA 2023

What if there is another phenomenon that does NOT require high row activation count?

# RowPress: Amplifying Read-Disturbance in Modern DRAM Chips

Haocong Luo Ataberk Olgun A. Giray Yağlıkçı Yahya Can Tuğrul Steve Rhyner Meryem Banu Cavlak Joël Lindegger Mohammad Sadrosadati Onur Mutlu ETH Zürich

### RowPress [ISCA 2023]







Haocong Luo, Ataberk Olgun, Giray Yaglikci, Yahya Can Tugrul, Steve Rhyner,
 M. Banu Cavlak, Joel Lindegger, Mohammad Sadrosadati, and Onur Mutlu,
 "RowPress: Amplifying Read Disturbance in Modern DRAM Chips"

Proceedings of the <u>50th International Symposium on Computer</u> <u>Architecture</u> (**ISCA**), Orlando, FL, USA, June 2023.

[Slides (pptx) (pdf)]

[Lightning Talk Slides (pptx) (pdf)]

[<u>Lightning Talk Video</u> (3 minutes)]

[RowPress Source Code and Datasets (Officially Artifact Evaluated with All Badges)]

Officially artifact evaluated as available, reusable and reproducible. Best artifact award at ISCA 2023. IEEE Micro Top Pick in 2024.

# RowPress: Amplifying Read-Disturbance in Modern DRAM Chips

Haocong Luo Ataberk Olgun A. Giray Yağlıkçı Yahya Can Tuğrul Steve Rhyner Meryem Banu Cavlak Joël Lindegger Mohammad Sadrosadati Onur Mutlu

ETH Zürich

### RowPress vs. RowHammer

Instead of using a high activation count, increase the time that the aggressor row stays open



We observe bitflips even with **ONLY ONE activation** in extreme cases where the row stays open for 30ms

### **Key Characteristics of RowPress (I)**

### RowPress Amplifies Read Disturbance in DRAM

- Reduces the minimum number of row activations needed to induce a bitflip (ACmin) by 1-2 orders of magnitude
- In extreme cases, activating a row only once induces bitflips





### Real-System Demonstration (I)



Intel Core i5-10400 (Comet Lake)



Samsung DDR4 Module M378A2K43CB1-CTD (Date Code: 20-10)

w/ TRR RowHammer Mitigation

**Key Idea:** A proof-of-concept RowPress program keeps a DRAM row open for a longer period by **keeping on accessing different cache blocks in the row** 

```
// Sync with Refresh and Loop Below for (k = 0; k < NUM\_AGGR\_ACTS; k++) for (j = 0; j < NUM\_READS) j++) *AGGRESSOR1[j]; for (j = 0; j < NUM\_READS) j++) *AGGRESSOR2[j]; for (j = 0; j < NUM\_READS) j++) *AGGRESSOR2[j]; per Aggressor Row ACT (NUM\_READS=1 is Rowhammer) clflushopt(AGGRESSOR2[j]); mfence(); activate_dummy_rows();
```

### Real-System Demonstration (II)

#### On 1500 victim rows



Leveraging RowPress, our user-level program induces bitflips when RowHammer cannot

### Combining RowHammer and RowPress

Appears at DSN Disrupt 2024

# An Experimental Characterization of Combined RowHammer and RowPress Read Disturbance in Modern DRAM Chips

Haocong Luo İsmail Emir Yüksel Ataberk Olgun A. Giray Yağlıkçı Mohammad Sadrosadati Onur Mutlu ETH Zürich

### Combining RowHammer and RowPress

#### Appears at DIMVA 2024

# Presshammer: Rowhammer and Rowpress without Physical Address Information

Jonas Juffinger<sup>1</sup>, Sudheendra Raghav Neela<sup>1</sup>, Martin Heckel<sup>2</sup>, Lukas Schwarz<sup>1</sup>, Florian Adamsky<sup>2</sup>, and Daniel Gruss<sup>1</sup>

Graz University of Technology, Graz, Austria
 Hof University of Applied Sciences, Hof, Germany

# Understanding RowPress

Appears in IEEE TED, 2024



# Unveiling RowPress in Sub-20 nm DRAM Through Comparative Analysis With Row Hammer: From Leakage Mechanisms to Key Features

Longda Zhou<sup>®</sup>, Sheng Ye, Runsheng Wang<sup>®</sup>, *Member, IEEE*, and Zhigang Ji<sup>®</sup>

# Read disturbance is a technology scaling problem

Finding a good solution to read disturbance is difficult (and will become more so)

# More to Come...

### RowHammer Becomes Worse with Aging

Preliminary data on aging via 68-day of continuous hammering

Aging can lead to read disturbance bitflips at smaller hammer counts



Minimum hammer count to induce the first bitflip HC<sub>first</sub> (before aging)

98

### RowHammer (Spatial Variation) Analysis (2024)

Appears at HPCA 2024

### Spatial Variation-Aware Read Disturbance Defenses: Experimental Analysis of Real DRAM Chips and Implications on Future Solutions

Abdullah Giray Yağlıkçı Yahya Can Tuğrul Geraldo F. Oliveira İsmail Emir Yüksel Ataberk Olgun Haocong Luo Onur Mutlu ETH Zürich

https://arxiv.org/pdf/2402.18652

### Variable Read Disturbance (2025)

### Key Takeaway

The Read Disturbance Threshold (RDT) of a row changes randomly and unpredictably over time

Accurately identifying RDT is challenging

### Variable Read Disturbance (2025)

Appears at HPCA 2025

### Variable Read Disturbance:

An Experimental Analysis of Temporal Variation in DRAM Read Disturbance

```
Ataberk Olgun† F. Nisa Bostancı† İsmail Emir Yüksel† Oğuzhan Canpolat† Haocong Luo† Geraldo F. Oliveira† A. Giray Yağlıkçı† Minesh Patel‡ Onur Mutlu†

ETH Zurich† Rutgers University‡
```

### Two Major Directions

#### Understanding Bitflips (Hardware errors in general)

- Many effects on bitflips still need to be rigorously examined
  - Aging of DRAM Chips
  - Environmental Conditions (e.g., Process, Voltage, Temperature)
  - Memory Access Patterns
  - Memory Controller & System Design Decisions
  - **...**

### Solving Bitflips (Hardware errors in general)

- Flexible and efficient solutions are necessary
  - In-field patchable / reconfigurable / programmable solutions
- Co-architecting across the system stack/components is important
  - To avoid performance and denial-of-service problems

### A Recent RowHammer Lecture



### Emerging Memories Also Need Intelligent Controllers

Benjamin C. Lee, Engin Ipek, Onur Mutlu, and Doug Burger,

"Architecting Phase Change Memory as a Scalable DRAM Alternative"

Proceedings of the 36th International Symposium on Computer

Architecture (ISCA), pages 2-13, Austin, TX, June 2009. Slides (pdf)

One of the 13 computer architecture papers of 2009 selected as Top

Picks by IEEE Micro. Selected as a CACM Research Highlight.

2022 Persistent Impact Prize.

### Architecting Phase Change Memory as a Scalable DRAM Alternative

Benjamin C. Lee† Engin Ipek† Onur Mutlu‡ Doug Burger†

†Computer Architecture Group Microsoft Research Redmond, WA {blee, ipek, dburger}@microsoft.com ‡Computer Architecture Laboratory Carnegie Mellon University Pittsburgh, PA onur@cmu.edu

# Intelligent Memory Controllers Can Enhance Security & Enable Better Scaling

### Read Disturbance Sessions @ HPCA 2025

#### Session 7A (Acacia A and B): Hammering the Odds - 1

Session Chair: Gururaj Saileshwar (Toronto)

- Variable Read Disturbance: An Experimental Analysis of Temporal Variation in DRAM Read Disturbance
   Ataberk Olgun (ETH Zürich), Nisa Bostanci (ETH Zürich), Ismail Emir Yuksel (ETH Zürich), Giray Yaglikci (ETH Zürich),
   Geraldo F. Oliveira (ETH Zürich), Haocong Luo (ETH Zürich), Oguzhan Canpolat (ETH Zürich), Minesh Patel (Rutgers
   University), Onur Mutlu (ETH Zürich)
- Understanding RowHammer Under Reduced Refresh Latency: Experimental Analysis of Real DRAM Chips and Implications on Future Solutions
  - Yahya Can Tuğrul (TOBB ETÜ & ETH Zürich), Giray Yaglikci (ETH Zürich), Ismail Emir Yuksel (ETH Zürich), Ataberk Olgun (ETH Zürich), Oğuzhan Canpolat (TOBB ETÜ & ETH Zürich), Nisa Bostanci (ETH Zürich), Mohammad Sadrosadati (ETH Zürich), Oguz Ergin (TOBB ETÜ), Onur Mutlu (ETH Zürich)
- Chronus: Understanding and Securing the Cutting-Edge Industry Solutions to DRAM Read Disturbance
   Oğuzhan Canpolat (TOBB ETÜ & ETH Zürich), Giray Yaglikci (ETH Zürich), Geraldo Francisco de Oliveira (ETH
   Zürich), Ataberk Olgun (ETH Zürich), Nisa Bostanci (ETH Zürich), Ismail Emir Yuksel (ETH Zürich), Haocong Luo (ETH
   Zürich), Oğuz Ergin (TOBB ETÜ), Onur Mutlu (ETH Zürich)

#### Session 8A (Acacia A and B): Hammering the Odds - 2

Session Chair: Sudhanva Gurumurthi (AMD)

- AutoRFM: Scaling Low-Cost In-DRAM Trackers to Ultra-Low Rowhammer Thresholds Moinuddin Qureshi (Georgia Tech)
- DAPPER: A Performance-Attack-Resilient Tracker for RowHammer Defense
   Jeonghyun Woo (The University of British Columbia (UBC)), Prashant J. Nair (The University of British Columbia (UBC))
- QPRAC: Towards Secure and Practical PRAC-based Rowhammer Mitigation using Priority Queues
   Jeonghyun Woo (The University of British Columbia (UBC)), Shaopeng (Chris) Lin (University of Toronto), Prashant J.
   Nair (The University of British Columbia (UBC)), Aamer Jaleel (NVIDIA), Gururaj Saileshwar (University of Toronto)

# Data Corruption is in CPU Logic, Too

- Intermittent defects can cause silent data corruption
- They may be hard to detect or replicate
- They may be exploitable

# Silent Data Corruption in Logic (2021)

#### **Silent Data Corruptions at Scale**

Harish Dattatraya Dixit Facebook, Inc. hdd@fb.com Sneha Pendharkar Facebook, Inc. spendharkar@fb.com Matt Beadon Facebook, Inc. mbeadon@fb.com Chris Mason Facebook, Inc. clm@fb.com

Tejasvi Chakravarthy Facebook, Inc. teju@fb.com Bharath Muthiah Facebook, Inc. bharathm@fb.com

Sriram Sankar Facebook Inc. sriramsankar@fb.com

### Cores that don't count

Peter H. Hochschild
Paul Turner
Jeffrey C. Mogul
Google
Sunnyvale, CA, US

Rama Govindaraju
Parthasarathy
Ranganathan
Google
Sunnyvale, CA, US

David E. Culler Amin Vahdat Google Sunnyvale, CA, US

#### Silent Data Corruption In-the-Field (2021)



HotOS 2021: Cores That Don't Count (Fun Hardware)

#### Silent Data Corruption in Logic (2023)

## Understanding Silent Data Corruptions in a Large Production CPU Population

Shaobu Wang Tsinghua University

Yang Wang The Ohio State University Guangyan Zhang\* Tsinghua University

> Jiesheng Wu Alibaba Cloud

Junyu Wei Tsinghua University

Qingchao Luo Alibaba Cloud

### Understanding and Mitigating Hardware Failures in Deep Learning Training Accelerator Systems

Yi He University of Chicago Chciago, IL, USA yiizy@uchicago.edu

Robert de Gruijl Google Sunnyvale, CA, USA rdegruijl@google.com Mike Hutton Google Sunnyvale, CA, USA mdhutton@google.com

Rama Govindaraju Google Sunnyvale, CA, USA govindaraju@google.com

Yanjing Li University of Chicago Chciago, IL, USA yanjingl@uchicago.edu Steven Chan Google Sunnyvale, CA, USA scchan@google.com

Nishant Patil Google Sunnyvale, CA, USA nishantpatil@google.com

#### How to Detect Hardware Errors? (I)

Kypros Constantinides, Onur Mutlu, Todd Austin, and Valeria Bertacco, "Software-Based Online Detection of Hardware Defects:
 Mechanisms, Architectural Support, and Evaluation"
 Proceedings of the 40th International Symposium on
 Microarchitecture (MICRO), pages 97-108, Chicago, IL, December 2007. Slides (ppt)

## Software-Based Online Detection of Hardware Defects: Mechanisms, Architectural Support, and Evaluation

Kypros Constantinides<sup>‡</sup>

Onur Mutlu†

Todd Austin<sup>‡</sup>

Valeria Bertacco<sup>‡</sup>

<sup>‡</sup>Advanced Computer Architecture Lab University of Michigan Ann Arbor, MI {kypros, austin, valeria}@umich.edu †Computer Architecture Group Microsoft Research Redmond, WA onur@microsoft.com

#### How to Detect Hardware Errors? (II)

Kypros Constantinides, Onur Mutlu, and Todd Austin,
 "Online Design Bug Detection: RTL Analysis, Flexible
 Mechanisms, and Evaluation"
 Proceedings of the <u>41st International Symposium on</u>
 Microarchitecture (MICRO), pages 282-293, Lake Como, Italy, November 2008. Slides (ppt)

#### Online Design Bug Detection: RTL Analysis, Flexible Mechanisms, and Evaluation

Kypros Constantinides‡ Onur Mutlu§ Todd Austin‡

‡Advanced Computer Architecture Lab University of Michigan {kypros, austin}@umich.edu

§Microsoft Research and Carnegie Mellon University onur@{microsoft.com,cmu.edu}

#### How to Detect Hardware Errors? (III)

Yanjing Li, Onur Mutlu, and Subhasish Mitra,
 "Operating System Scheduling for Efficient Online Self-Test in Robust Systems"

Proceedings of the <u>International Conference on Computer-Aided</u>

<u>Design</u> (**ICCAD**), pages 201-208, San Jose, CA, November 2009. <u>Slides</u>

(ppt) (pdf)

**Operating System Scheduling for Efficient Online Self-Test in Robust Systems** 

Yanjing Li Stanford University Onur Mutlu Carnegie Mellon University

Subhasish Mitra Stanford University

#### How to Detect Hardware Errors? (IV)

Yanjing Li, Onur Mutlu, Donald S. Gardner, and Subhasish Mitra,
 "Concurrent Autonomous Self-Test for Uncore Components in System-on-Chips"

Proceedings of the <u>28th IEEE VLSI Test Symposium</u> (**VTS**), pages 232-237, Santa Cruz, CA, April 2010. <u>Slides (ppt)</u> **Best paper award at VTS 2010.** 

#### **Concurrent Autonomous Self-Test for Uncore Components in System-on-Chips**

Yanjing Li Stanford University

Onur Mutlu Carnegie Mellon University Donald S. Gardner Intel Corporation

Subhasish Mitra Stanford University

#### How to Detect Hardware Errors? (V)

Kypros Constantinides, Onur Mutlu, Todd Austin, and Valeria Bertacco,
 "A Flexible Software-Based Framework for Online Detection of Hardware Defects"

<u>IEEE Transactions on Computers</u> (**TC**), Vol. 58, No. 8, pages 1063-1079, August 2009.

#### A Flexible Software-Based Framework for Online Detection of Hardware Defects

Kypros Constantinides, Student Member, IEEE, Onur Mutlu, Member, IEEE, Todd Austin, Member, IEEE, and Valeria Bertacco, Member, IEEE

#### Takeaways

 Both memory and logic errors will become worse with technology scaling

Hardware errors will create worse robustness problems

We cannot afford to ignore data corruption

## System Complexity

#### Complex Systems Cause Many Issues

- Many hardware components, complex components
- Harder to design & verify
- Harder to reason about operational behavior
  - Correctness, performance, energy, security, privacy, ...
- Harder to control interactions between components and avoid information leakage
- Old methods do not keep up with new trends and complexity
  - Virtual memory a prime example, also coherence & verification

118

#### Processor Complexity Is Growing

#### Moore's Law: The number of transistors on microchips doubles every two years Our World



Moore's law describes the empirical regularity that the number of transistors on integrated circuits doubles approximately every two years. This advancement is important for other aspects of technological progress in computing – such as processing speed or the price of computers.





#### Complex CPUs and Memory Hierarchies



10nm ESF=Intel 7 Alder Lake die shot (~209mm²) from Intel: https://www.intel.com/content/www/us/en/newsroom/news/12th-gen-core-processors.html

Die shot interpretation by Locuza, October 2021

Intel Alder Lake, 2021

#### Complex CPUs and Memory Hierarchies



**Core Count:** 

8 cores/16 threads

L1 Caches:

32 KB per core

L2 Caches:

512 KB per core

L3 Cache:

32 MB shared

AMD Ryzen 5000, 2020

#### Complexity Growing with 3D (2021)



34/comparing-zen-3-to-zen-2

AMD increases the L3 size of their 8-core Zen 3 processors from 32 MB to 96 MB

Additional 64 MB L3 cache die stacked on top of the processor die

- Connected using Through Silicon Vias (TSVs)
- Total of 96 MB L3 cache



#### Processor Complexity and Features

- Leads to many (endless) side and covert channels
  - Spectre and Meltdown are prime recent examples
  - These will not go away
- Leads to many bugs and unintended behavior
  - Especially with new features or complex interactions
  - Some can be exploitable
- How to tame processor complexity & resulting issues?

#### Access Control & Protection Mechanisms

- Are based on virtual memory (VM), invented in 1950s
- VM has not changed much even after decades of technology scaling and memory system improvements
- VM causes large performance problems and is responsible for large complexity, power, energy
- VM is poor for fine-grained security and access control
- VM hinders innovation in heterogeneous (e.g., accelerator) systems and new architectures (e.g., processing near data)
- It is time to rethink virtual memory

#### Virtual Memory: Parting Thoughts

- Virtual Memory is one of the most successful examples of
  - architectural support for programmers
  - how to partition work between hardware and software
  - hardware/software cooperation
  - programmer/architect tradeoff
- Going forward: How does virtual memory fare and scale into the future? Five key trends:
  - Increasing, huge physical memory sizes (local & remote)
  - Hybrid physical memory systems (DRAM + NVM + SSD)
  - Many accelerators in the system accessing physical memory
  - Virtualized systems (hypervisors, software virtualization, local and remote memories)
  - Processing in memory systems near-data accelerators

#### Rethinking Virtual Memory

Nastaran Hajinazar, Pratyush Patel, Minesh Patel, Konstantinos Kanellopoulos, Saugata Ghose, Rachata Ausavarungnirun, Geraldo Francisco de Oliveira Jr., Jonathan Appavoo, Vivek Seshadri, and Onur Mutlu, "The Virtual Block Interface: A Flexible Alternative to the Conventional Virtual Memory Framework"

Proceedings of the <u>47th International Symposium on Computer Architecture</u> (**ISCA**), Virtual, June 2020.

[Slides (pptx) (pdf)]

[<u>Lightning Talk Slides (pptx) (pdf)</u>]

[ARM Research Summit Poster (pptx) (pdf)]

[Talk Video (26 minutes)]

[Lightning Talk Video (3 minutes)]

[Lecture Video (43 minutes)]

## The Virtual Block Interface: A Flexible Alternative to the Conventional Virtual Memory Framework

Nastaran Hajinazar\*† Pratyush Patel<sup>™</sup> Minesh Patel\* Konstantinos Kanellopoulos\* Saugata Ghose<sup>‡</sup> Rachata Ausavarungnirun<sup>⊙</sup> Geraldo F. Oliveira\* Jonathan Appavoo<sup>⋄</sup> Vivek Seshadri<sup>▽</sup> Onur Mutlu\*<sup>‡</sup>

\*ETH Zürich  $^{\dagger}$ Simon Fraser University  $^{\bowtie}$ University of Washington  $^{\ddagger}$ Carnegie Mellon University  $^{\odot}$ King Mongkut's University of Technology North Bangkok  $^{\diamond}$ Boston University  $^{\bigtriangledown}$ Microsoft Research India

#### Better Virtual Memory (I)

Konstantinos Kanellopoulos, Hong Chul Nam, F. Nisa Bostanci, Rahul Bera, Mohammad Sadrosadati, Rakesh Kumar, Davide Basilio Bartolini, and Onur Mutlu,

"Victima: Drastically Increasing Address Translation Reach by Leveraging Underutilized Cache Resources"

Proceedings of the <u>56th International Symposium on Microarchitecture</u> (**MICRO**), Toronto, ON, Canada, November 2023.

[Slides (pptx) (pdf)]

arXiv version

[Victima Source Code (Officially Artifact Evaluated with All Badges)]

Officially artifact evaluated as available, functional, reusable and reproducible. Distinguished artifact award at MICRO 2023.

## Victima: Drastically Increasing Address Translation Reach by Leveraging Underutilized Cache Resources

Konstantinos Kanellopoulos<sup>1</sup> Hong Chul Nam<sup>1</sup> F. Nisa Bostanci<sup>1</sup> Rahul Bera<sup>1</sup> Mohammad Sadrosadati<sup>1</sup> Rakesh Kumar<sup>2</sup> Davide Basilio Bartolini<sup>3</sup> Onur Mutlu<sup>1</sup>

<sup>1</sup>ETH Zürich <sup>2</sup>Norwegian University of Science and Technology <sup>3</sup>Huawei Zurich Research Center

#### Better Virtual Memory (II)

Konstantinos Kanellopoulos, Rahul Bera, Kosta Stojiljkovic, Nisa Bostanci, Can Firtina, Rachata Ausavarungnirun, Rakesh Kumar, Nastaran Hajinazar, Mohammad Sadrosadati, Nandita Vijaykumar, and Onur Mutlu,

"Utopia: Fast and Efficient Address Translation via Hybrid Restrictive & Flexible Virtual-to-Physical Address Mappings"

Proceedings of the <u>56th International Symposium on Microarchitecture</u> (**MICRO**), Toronto, ON, Canada, November 2023.

[Slides (pptx) (pdf)]

[arXiv version]

[<u>Utopia Source Code</u>]

## Utopia: Fast and Efficient Address Translation via Hybrid Restrictive & Flexible Virtual-to-Physical Address Mappings

Konstantinos Kanellopoulos¹ Rahul Bera¹ Kosta Stojiljkovic¹ Nisa Bostanci¹ Can Firtina¹ Rachata Ausavarungnirun² Rakesh Kumar³ Nastaran Hajinazar⁴ Mohammad Sadrosadati¹ Nandita Vijaykumar⁵ Onur Mutlu¹

<sup>1</sup>ETH Zürich <sup>2</sup>King Mongkut's University of Technology North Bangkok <sup>3</sup>Norwegian University of Science and Technology <sup>4</sup>Intel Labs <sup>5</sup>University of Toronto

#### Even Better Virtual Memory

#### To Appear at ASPLOS 2025

#### Virtuoso: Enabling Fast and Accurate Virtual Memory Research via an Imitation-based Operating System Simulation Methodology

Konstantinos Kanellopoulos ETH Zürich Zürich, Switzerland

Andreas Kosmas Kakolyris ETH Zürich Zürich, Switzerland

Mohammad Sadrosadati ETH Zürich Zürich, Switzerland Konstantinos Sgouras ETH Zürich Zürich, Switzerland

Berkin Kerim Konar ETH Zürich Zürich, Switzerland

Rakesh Kumar NTNU Trondheinm, Norway

Onur Mutlu ETH Zürich Zürich, Switzerland F. Nisa Bostanci ETH Zürich Zürich, Switzerland

Rahul Bera ETH Zürich Zürich, Switzerland

Nandita Vijaykumar University of Toronto Toronto, Canada

## New Architectures & Technologies

#### New Architectures & Technologies

- Can have large impact on security and robustness
  - Positive or negative
- They need to be designed with system security in mind
  - Ideally as a first-class design goal
- Multiple potentially paradigm-changing new technologies and architectures

131

- Processing in memory
- Accelerator-based computing
- Quantum computing
- **...**

## Processing in Memory

## Computing is Bottlenecked by Data

#### Data is Key for AI, ML, Genomics, ...

Important workloads are all data intensive

 They require rapid and efficient processing of large amounts of data

- Data is increasing
  - We can generate more than we can process
  - We need to perform more sophisticated analyses on more data

#### Huge Demand for Performance & Efficiency



#### **Exponential Growth of Neural Networks**



#### Huge Demand for Performance & Efficiency



#### Do We Want This?





137

#### Or This?



**SAFARI** 

Source: V. Milutinovic

High Performance, Energy Efficient, Sustainable (All at the Same Time)

#### The Problem

Data access is the major performance and energy bottleneck

# Our current design principles cause great energy waste

(and great performance loss)

#### Today's Computing Systems

- Processor centric
- All data processed in the processor → at great system cost



#### Processor-Centric System Performance

All of Google's Data Center Workloads (2015):



#### Data Movement vs. Computation Energy



A memory access consumes ~100-1000X the energy of a complex addition

#### Data Movement vs. Computation Energy



### Data Movement vs. Computation Energy



A memory access consumes 6400X the energy of a simple integer addition

### Energy Waste in Mobile Devices

Amirali Boroumand, Saugata Ghose, Youngsok Kim, Rachata Ausavarungnirun, Eric Shiu, Rahul Thakur, Daehyun Kim, Aki Kuusela, Allan Knies, Parthasarathy Ranganathan, and Onur Mutlu, "Google Workloads for Consumer Devices: Mitigating Data Movement Bottlenecks" Proceedings of the <u>23rd International Conference on Architectural Support for Programming</u> <u>Languages and Operating Systems</u> (ASPLOS), Williamsburg, VA, USA, March 2018.

### 62.7% of the total system energy is spent on data movement

### Google Workloads for Consumer Devices: Mitigating Data Movement Bottlenecks

Amirali Boroumand<sup>1</sup> Rachata Ausavarungnirun<sup>1</sup> Aki Kuusela<sup>3</sup> Allan Knies<sup>3</sup>

Saugata Ghose<sup>1</sup> Youngsok Kim<sup>2</sup>

Eric Shiu<sup>3</sup> Rahul Thakur<sup>3</sup> Daehyun Kim<sup>4,3</sup>

Parthasarathy Ranganathan<sup>3</sup> Onur Mutlu<sup>5,1</sup>

### Energy Waste in Accelerators

Amirali Boroumand, Saugata Ghose, Berkin Akin, Ravi Narayanaswami, Geraldo F. Oliveira, Xiaoyu Ma, Eric Shiu, and Onur Mutlu,

"Google Neural Network Models for Edge Devices: Analyzing and Mitigating Machine **Learning Inference Bottlenecks**"

Proceedings of the <u>30th International Conference on Parallel Architectures and Compilation</u> *Techniques (PACT)*, Virtual, September 2021.

[Slides (pptx) (pdf)]

[Talk Video (14 minutes)]

### > 90% of the total system energy is spent on memory in large ML models

### **Google Neural Network Models for Edge Devices: Analyzing and Mitigating Machine Learning Inference Bottlenecks**

Amirali Boroumand<sup>†</sup>◊ Saugata Ghose<sup>‡</sup> Berkin Akin§ Ravi Narayanaswami§ Onur Mutlu\*† Geraldo F. Oliveira\* Xiaoyu Ma<sup>§</sup> Eric Shiu§

<sup>†</sup>Carnegie Mellon Univ. <sup>†</sup>Stanford Univ. <sup>‡</sup>Univ. of Illinois Urbana-Champaign §Google \*ETH Zürich

### Energy Wasted on Data Movement



In LSTMs and Transducers used by Google, >90% energy spent on off-chip interconnect and DRAM

### Fundamental Problem

# Processing of data is performed far away from the data

### We Need A Paradigm Shift To ...

Enable computation with minimal data movement

Compute where it makes sense (where data resides)

Make computing architectures more data-centric

### Process Data Where It Makes Sense



Apple M1 Ultra System (2022)

### Memory as an Accelerator



Memory similar to a "conventional" accelerator

### Goal: Processing Inside Memory/Storage



- Many questions ... How do we design the:
  - compute-capable memory & controllers?
  - processors & communication units?
  - software & hardware interfaces?
  - system software, compilers, languages?
  - algorithms & theoretical foundations?

**Problem** 

Algorithm

Program/Language

System Software

SW/HW Interface

Micro-architecture

Logic

Electrons

## Processing in Memory: Two Types

- 1. Processing **near** Memory
- 2. Processing using Memory

### Processing-in-Memory Landscape Today









[Samsung 2021]



[UPMEM 2019]



### Processing-in-Memory Landscape Today

IEEE COMPUTER ARCHITECTURE LETTERS, VOL. 22, NO. 1, JANUARY-JUNE

### Computational CXL-Memory Solution for Accelerating Memory-Intensive Applications

Joonseop Sim<sup>®</sup>, Soohong Ahn<sup>®</sup>, Taeyoung Ahn<sup>®</sup>, Seungyong Lee<sup>®</sup>, Myunghyun Rhee, Jooyoung Kim<sup>®</sup>, Kwangsik Shin, Donguk Moon<sup>®</sup>, Euiseok Kim, and Kyoung Park<sup>®</sup>

Abstract—CXL interface is the up-to-date technology that enables effective memory expansion by providing a memory-sharing protocol in configuring heterogeneous devices. However, its limited physical bandwidth can be a significant bottleneck for emerging data-intensive applications. In this work, we propose a novel CXL-based memory disaggregation architecture with a real-world prototype demonstration, which overcomes the bandwidth limitation of the CXL interface using near-data processing. The experimental results demonstrate that our design achieves up to 1.9× better performance/power efficiency than the existing CPU system.

Index Terms—Compute express link (CXL), near-data-processing (NDP)





Fig. 6. FPGA prototype of proposed CMS card.

### Processing-in-Memory Landscape Today

### Samsung Processing in Memory Technology at Hot Chips 2023

By Patrick Kennedy - August 28, 2023



















Samsung PIM PNM For Transformer Based AI HC35\_Page\_24

### Opportunity: 3D-Stacked Logic+Memory





### Tesseract System for Graph Processing

Interconnected set of 3D-stacked memory+logic chips with simple cores



### Evaluated Systems



### Tesseract Graph Processing Performance



### Tesseract Graph Processing System Energy



### More on Tesseract

 Junwhan Ahn, Sungpack Hong, Sungjoo Yoo, Onur Mutlu, and Kiyoung Choi,

"A Scalable Processing-in-Memory Accelerator for Parallel Graph Processing"

Proceedings of the <u>42nd International Symposium on Computer</u> Architecture (**ISCA**), Portland, OR, June 2015.

[Slides (pptx) (pdf)] [Lightning Session Slides (pptx) (pdf)]

Top Picks Honorable Mention by IEEE Micro. Selected to the ISCA-50 25-Year Retrospective Issue covering 1996-2020 in 2023 (<u>Retrospective (pdf)</u> <u>Full</u> <u>Issue</u>).

### A Scalable Processing-in-Memory Accelerator for Parallel Graph Processing

Junwhan Ahn Sungpack Hong<sup>§</sup> Sungjoo Yoo Onur Mutlu<sup>†</sup> Kiyoung Choi junwhan@snu.ac.kr, sungpack.hong@oracle.com, sungjoo.yoo@gmail.com, onur@cmu.edu, kchoi@snu.ac.kr Seoul National University <sup>§</sup>Oracle Labs <sup>†</sup>Carnegie Mellon University

### A Short Retrospective (a) 50 Years of ISCA

### Retrospective: A Scalable Processing-in-Memory Accelerator for Parallel Graph Processing

Sungpack Hong $^{\ddagger}$  Sungjoo Yoo $^{\nabla}$  Onur Mutlu $^{\$}$  Kiyoung Choi $^{\nabla}$  Oracle Labs  $^{\$}ETH$  Zürich  $^{\nabla}Seoul$  National Univers Junwhan Ahn† Google DeepMind Seoul National University

Abstract—Our ISCA 2015 paper [1] provides a new programmable processing-in-memory (PIM) architecture and system design that can accelerate key data-intensive applications, with a focus on graph processing workloads. Our major idea was to completely rethink the system, including the programming model, data partitioning mechanisms, system support, instruction set architecture, along with near-memory execution units and their communication architecture, such that an important workload can be accelerated at a maximum level using a distributed system of well-connected near-memory accelerators. We built our accelerator system, Tesseract, using 3D-stacked memories with logic layers, where each logic layer contains general-nurnose with logic layers, where each logic layer contains general-purpose

with logic layers, where each logic layer contains general-purpose processing cores and cores communicate with each other using a message-passing programming model. Cores could be specialized for graph processing (or any other application to be accelerated). To our knowledge, our paper was the first to completely design a near-memory accelerator system from scratch such that it is both generally programmable and specifically customizable to accelerate important applications, with a case study on major graph processing workloads. Ensuing work in academia and industry showed that similar approaches to system design can greatly benefit both graph processing workloads and other applications, such as machine learning, for which ideas from Tesseract seem to have been influential. Tesseract seem to have been influential.

This short retrospective provides a brief analysis of our ISCA 2015 paper and its impact. We briefly describe the major ideas and contributions of the work, discuss later works that built on it or were influenced by it, and make some educated guesses on what the future may bring on PIM and accelerator systems.

### I. BACKGROUND, APPROACH & MINDSET

We started our research when 3D-stacked memories (e.g., [2-4]) were viable and seemed to have promise for building effective and practical processing-near-memory systems. Such near-memory systems could lead to improvements, but there was little to no research that examined how an accelerator could be completely (re-)designed using such near-memory technology, from its hardware architecture to its programming model and software system, and what the performance and energy benefits could be of such a re-design. We set out to answer these questions in our ISCA 2015 paper [1].

We followed several major principles to design our accelerator from the ground up. We believe these principles are still important: a major contribution and influence of our work was in putting all of these together in a cohesive full-system design and demonstrating the large performance and energy benefits that can be obtained from such a design. We see a similar approach in many modern large-scale accelerator systems in machine learning today (e.g., [5-9]). Our principles are:

- 1. Near-memory execution to enable/exploit the high data access bandwidth modern workloads (e.g., graph processing) need and to reduce data movement and access latency.
- 2. General programmability so that the system can be easily adopted, extended, and customized for many workloads.
- 3. Maximal acceleration capability to maximize the performance and energy benefits. We set ourselves free from backward compatibility and cost constraints. We aimed to completely re-design the system stack. Our goal was to explore the maximal performance and energy efficiency benefits we can gain from a near-memory accelerator if we had complete freedom to change things as much as we needed. We contrast this approach to the minimal intrusion approach we also explored in a separate ISCA 2015 paper [10]
- 4. Customizable to specific workloads, such that we can maximize acceleration benefits. Our focus workload was graph

analytics/processing, a key workload at the time and today. However, our design principles are not limited to graph processing and the system we built is customizable to other workloads as well, e.g., machine learning, genome analysis.

5. Memory-capacity-proportional performance, i.e., processing capability should proportionally grow (i.e., scale) as memory capacity increases and vice versa. This enables scaling of data-intensive workloads that need both memory and compute.

6. Exploit new technology (3D stacking) that enables tight integration of memory and logic and helps multiple above principles (e.g., enables customizable near-memory acceleration capability in the logic layer of a 3D-stacked memory chip).

7. Good communication and scaling capability to support scalability to large dataset sizes and to enable memorycapacity-proportional performance. To this end, we provided scalable communication mechanisms between execution cores and carefully interconnected small accelerator chips to form a large distributed system of accelerator chips.

8. Maximal and efficient use of memory bandwidth to supply the high-bandwidth data access that modern workloads need To this end, we introduced new, specialized mechanisms for prefetching and a programming model that helps leverage application semantics for hardware optimization.

### II. CONTRIBUTIONS AND INFLUENCE

We believe the major contributions of our work were 1) complete rethinking of how an accelerator system should be designed to enable maximal acceleration capability, and 2) the design and analysis of such an accelerator with this mindset and using the aforementioned principles to demonstrate its effectiveness in an important class of workloads.

One can find examples of our approach in modern largescale machine learning (ML) accelerators, which are perhaps the most successful incarnation of scalable near-memory execution architectures. ML infrastructure today (e.g., [5-9]) consists of accelerator chips, each containing compute units and high-bandwidth memory tightly packaged together, and features scale-up capability enabled by connecting thousands of such chips with high-bandwidth interconnection links. The system-wide rethinking that was done to enable such accelerators and many of the principles used in such accelerators resemble our ISCA 2015 paper's approach.

The "memory-capacity-proportional performance" principle we explored in the paper shares similarities with how ML workloads are scaled up today. Similar to how we carefully sharded graphs across our accelerator chips to greatly improve effective memory bandwidth in our paper, today's ML workloads are sharded across a large number of accelerators by leveraging data/model parallelism and optimizing the placement to balance communication overheads and compute scalability [11, 12]. With the advent of large generative models requiring high memory bandwidth for fast training and inference, the scaling behavior where capacity and bandwidth are scaled together has become an essential architectural property to support modern data-intensive workloads.

The "maximal acceleration capability" principle we used in Tesseract provides much larger performance and energy improvements and better customization than the "minimalist approach that our other ISCA 2015 paper on PIM-Enabled Instructions [10] explored: "minimally change" an existing system to incorporate (near-memory) acceleration capability to ease programming and keep costs low. So far, the industry has more widely adopted the maximal approach to overcome the pressing scaling bottlenecks of major workloads. The key enabler that bridges the programmability gap between the maximal approach favoring large performance & energy benefits and the minimal approach favoring ease of programming is compilation techniques. These techniques lower well-defined high-level constructs into lower-level primitives [12, 13]; our ISCA 2015 papers [1, 10] and a follow-up work [14] explore them lightly. We believe that a good programming model that enables large benefits coupled with support for it across the entire system stack (including compilers & hardware) will continue to be important for effective near-memory system and accelerator designs [14]. We also believe that the maximal versus minimal approaches that are initially explored in our two ISCA 2015 papers is a useful way of exploring emerging technologies (e.g., near-memory accelerators) to better understand the tradeoffs of system designs that exploit such technologies.

### III. INFLUENCE ON LATER WORKS

Our paper was at the beginning of a proliferation of scalable near-memory processing systems designed to accelerate key applications (see [15] for many works on the topic). Tesseract has inspired many near-memory system ideas (e.g., [16-28]) and served as the de facto comparison point for such systems, including near-memory graph processing accelerators that built on Tesseract and improved various aspects of Tesseract. Since machine learning accelerators that use high-bandwidth memory (e.g., [5, 29]) and industrial PIM prototypes (e.g., [30-41]) are now in the market, near-memory processing is no longer an "eccentric" architecture it used to be when Tesseract was originally published.

Graph processing & analytics workloads remain as an important and growing class of applications in various forms, ranging from large-scale industrial graph analysis engines (e.g., [42]) to graph neural networks [43]. Our focus on largescale graph processing in our ISCA 2015 paper increased attention to this domain in the computer architecture community, resulting in subsequent research on efficient hardware architectures for graph processing (e.g., [44-46]).

### IV. SUMMARY AND FUTURE OUTLOOK

We believe that our ISCA 2015 paper's principled rethinking of system design to accelerate an important class of data-intensive workloads provided significant value and enabled/influenced a large body of follow-on works and ideas. We expect that such rethinking of system design for key workloads, especially with a focus on "maximal acceleration capability," will continue to be critical as pressing technology and application scaling challenges increasingly require us to think differently to substantially improve performance and energy (as well as other metrics). We believe the principles exploited in Tesseract are fundamental and they will remain useful and likely become even more important as systems become more constrained due to the continuously-increasing memory access and computation demands of future workloads. We also project that as hardware substrates for near-memory acceleration (e.g., 3D stacking, in-DRAM computation, NVMbased PIM, processing using memory [15]) evolve and mature, systems will take advantage of them even more, likely using principles similar to those used in the design of Tesseract.

- J. Ahn et al., "A Scalable Processing-in-Memory Accelerator for Parallel Graph Processing," in ISCA, 2015.
   Hybrid Memory Cube Consortium, "HMC Specification 1.1," 2013.
   J. Jeddeloh and B. Keeth, "Hybrid Memory Cube: New DRAM Architecture Increases Density and Performance," in VISTI, 2012.
   IEDEC, "High Bandwidth Memory (HBM) DRAM," Standard No. IESD235, 2013.

- [5] N. Jouppi et al., "TPU v4: An Optically Reconfigurable Supercomputer for Machine Learning with Hardware Support for Embedding," in ISCA. tor Machine Learning with Hardware Support for Embedding," in ISCA, 2023.

  2023.

  2023.

  2023.

  2023.

  2023.

  2023.

  2023.

  2023.

  2023.

  2023.

  2023.

  2023.

  2023.

  2023.

  2023.

  2023.

  2023.

  2023.

  2023.

  2023.

  2023.

  2023.

  2023.

  2023.

  2023.

  2024.

  2025.

  2025.

  2025.

  2025.

  2026.

  2027.

  2027.

  2027.

  2027.

  2027.

  2027.

  2027.

  2027.

  2027.

  2027.

  2027.

  2027.

  2027.

  2027.

  2027.

  2027.

  2027.

  2027.

  2027.

  2027.

  2027.

  2027.

  2027.

  2027.

  2027.

  2027.

  2027.

  2027.

  2027.

  2027.

  2027.

  2027.

  2027.

  2027.

  2027.

  2027.

  2027.

  2027.

  2027.

  2027.

  2027.

  2027.

  2027.

  2027.

  2027.

  2027.

  2027.

  2027.

  2027.

  2027.

  2027.

  2027.

  2027.

  2027.

  2027.

  2027.

  2027.

  2027.

  2027.

  2027.

  2027.

  2027.

  2027.

  2027.

  2027.

  2027.

  2027.

  2027.

  2027.

  2027.

  2027.

  2027.

  2027.

  2027.

  2027.

  2027.

  2027.

  2027.

  2027.

  2027.

  2027.

  2027.

  2027.

  2027.

  2027.

  2027.

  2027.

  2027.

  2027.

  2027.

  2027.

  2027.

  2027.

  2027.

  2027.

  2027.

  2027.

  2027.

  2027.

  2027.

  2027.

  2027.

  2027.

  2027.

  2027.

  2027.

  2027.

  2027.

  2027.

  2027.

  2027.

  2027.

  2027.

  2027.

  2027.

  2027.

  2027.

  2027.

  2027.

  2027.

  2027.

  2027.

  2027.

  2027.

  2027.

  2027.

  2027.

  2027.

  2027.

  2027.

  2027.

  2027.

  2027.

  2027.

  2027.

  2027.

  2027.

  2027.

  2027.

  2027.

  2027.

  2027.

  2027.

  2027.

  2027.

  2027.

  2027.

  2027.

  2027.

  2027.

  2027.

  2027.

  2027.

  2027.

  2027.

  2027.

  2027.

  2027.

  2027.

  2027.

  2027.

  2027.

  2027.

  2027.

  2027.

  2027.

  2027.

  2027.

  2027.

  2027.

  2027.

  2027.

  2027.

  2027.

  2027.

  2027.

  2027.

  2027.

  2027.

  2027.

  2027.

  2027.

  2027.

  2027.

  2027.

  2027.

  2027.

  2027.

  2027.

  2027.

  2027.

  2027.

  2027.

  2027.

  2027.

  2027.

  2027.

  2027.

  2027.

  2027.

  2027.

  2027.

  2027.

  2027.

  2027.

  2027.

  2027.

  2027.

  2027.

  2027.

  2027.

  2027.

  2027.

  2027.

  2027.

  2027.

  2027.

  2027.

  2027.

  2027.

  2027.

  2027.

  2027.

  2027.

  2027.

  2027.

  2027.

  2027.

  2027.

  2027.

  2027.

  2027.

  2027.

  2027.

  2027.

  2027.

  2027.

  2027.

  2027.

  2027.

  2027.

  2027.

  2027.

  2027.

  2027.

  2027.

  2027.

  2027.

  2027.

  2027.

  2027

- 2022.
  J. Ahn et al., "PIM-Enabled Instructions: A Low-Overhead, Locality-Aware Processing-in-Memory Architecture," in ISCA, 2015.
  R. Pope et al., "Efficiently Scaling Transformer Inference," in MLSys,

- D. Lepikhin et al., "GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding," in ICLR, 2021.
   S. Wang et al., "Overlap Communication with Dependent Computation via Decomposition in Large Deep Learning Models," in ASPLOS, 2023.
   J. Ahn et al., "AIM: Energy-Efficient Aggregation Inside the Memory Hierarchy," ACM TACO, vol. 13, no. 4, 2016.
   O. Mutle et al., "A Modern Primer on Processing in Memory," Emerging Computing: From Devices to Systems, 2021, https://arxiv.org/abs/2012.
   3112.

- M. Zhang et al., "GraphP: Reducing Communication for PIM-Based Graph Processing with Efficient Data Partition," in HPCA, 2018.
   L. Song, "GraphR: Accelerating Graph Processing Using ReRAM," in HPCA, 2018.
   Y. Zhuo et al., "GraphQ: Scalable PIM-Based Graph Processing," in MICRO, 2019.
   G. Dai et al., "GraphQ: Scalable PIM-Based Graph Processing," in MICRO, 2019.
   G. Dai et al., "GraphQ: Scalable PIM-Based Graph Processing," in MICRO, 2019.

- G. Dai et al., "GraphH: A Processing-in-Memory Architecture for Large-Scale Graph Processing," IEEE TCAD, 2018.
   G. Li et al., "GraphIA: An In-situ Accelerator for Large-scale Graph Processing," in IMEMSYS, 2018.
   S. Rheindt et al., "NEMESYS: Near-Memory Graph Copy Enhanced System-Software," in IMEMSYS, 2019.
   L. Belayneh and V. Bertacco, "GraphVine: Exploiting Multicast for Scalable Graph Analytics," in DATE, 2020.
   N. Challapalle et al., "Gas-X: Graph Analytics Accelerator Supporting Control of the Processing Analytics Accelerator Supporting Control of the Processing Contr Sparse Data Representation using Crossbar Architectures," in ISCA,

- John et al., "Ultra Efficient Acceleration for De Novo Genome (24) Mc. Ambly via Near-Memory Computing," in PACT, 2021.
  J. Xie et al., "SpaceA: Sparse Matrix Vector Multiplication on Processing-in-Memory Accelerator," in HPCA, 2021.
  G. M. Zhou et al., "Hydraph: Accelerating Graph Processing with Hybrid Memory-Centric Computing," in DATE, 2021.
  J. H. Lenjani et al., "Gearbox: A Case for Supporting Accumulation Dispatching and Hybrid Partitioning in PIM-based Accelerators," in ISCA, 2022.
  M. Orenes-Vera et al., "Dalorex: A Data-Local Program Execution and Architecture for Memory-Bound Applications," in IPCA, 2023.
  J. Choquette, "Nvidia Hopper GPU: Scaling Performance," in Hot Chips, 2022.

- 2022.
  F. Devaux, "The True Processing In Memory Accelerator," in Hot Chips

- F. Devaux, "The True Processing In Memory Accelerator," in Hot Chips, 2019.
   J. Gómez-Luna et al., "Benchmarking a New Paradigm: Experimental Analysis and Characterization of a Real Processing-in-Memory System,"
   J. Gómez-Luna et al., "Evaluating Machine Learning Workloads on Memory-Centric Computing Systems," in ISPASS, 2023.
   S. Lee et al., "Hardware Architecture and Software Stack for PIM Based on Commercial DRAM Technology: Industrial Product," in ISCA, 2021.
   Y.-C. Kwon et al., "25.4 A 20nm 6GB Function-in-Memory DRAM, Based on HBM2 with a 1.2 TPLOPS Programmable Computing Unit Using Bank-Level Parallelism, for Machine Learning Applications," in Using Bank-Level Parallelism, for Machine Learning Applications," in Using Bank-Level Parallelism, for Machine Learning Applications," in Using Bank-Level Parallelism, for Machine Learning Applications, "in Using Bank-Level Parallelism, for Machine Learning Applications," in Using Bank-Level Parallelism, for Machine Learning Person-

- ISSCC. 2021.
   L. Ke et al., "Near-Memory Processing in Action: Accelerating Personalized Recommendation with AxDIMM," *IEEE Micro*, 2021.
   D. Lee et al., "Improving In-Memory Databases Operations with Acceleration DIMM (AxDIMM)," in *DaMoN*, 2022.
   S. Lee et al., "A Jynnin 125V 80b. 16Gb/s/pin GDDR6-based Accelerator-in-Memory supporting 1TFLOPS MAC Operation and Various Accidentation." in *ISSCC*,
- Accelerator-in-metundy supplied in TP-DOS which Open Accelerator in Section 2022. Activation Functions for Deep-Learning Applications," in ISSCC, 2022.

  [38] D. Niu et al., "184QPS/W 64Mb/mm² 2D Logic-to-DRAM Hybrid Bonding with Process-Near-Memory Engine for Recommendation System," in ISSCC, 2022.

  [39] Y. Kwon, "System Architecture and Software Stack for GDDR6-AiM," in HCS, 2022.

  [40] G. Singh et al., "FPGA-based Near-Memory Acceleration of Modern Data-Intensive Applications," IEEE Metro, 2021.

  [41] G. Singh et al., "Accelerating Weather Prediction using Near-Memory Reconfigurable Fabric," ACM TRETS, 2021.

  [42] S. Hong et al., "PGX.D: A Fast Distributed Graph Processing Engine," in SC, 2015.

  [43] T. N. Kipf and M. Welling, "Semi-Supervised Classification with Graph and Computing Planneworks," in HPCA, 2017.

  [44] Convolutional Networks, "In HPCA, 2017.

  [45] M. Besta et al., "SiSA: Set-Centric Instruction Set Architecture for Graph Mining on Processing-in-Memory Systems," in MICRO, 2021.

  [46] T. J. Ham et al., "Graphicionado: A High-Performance and Energy-Efficient Accelerator for Graph Analytics," in MICRO, 2016.

### Processing using DRAM

### We can support

- Bulk bitwise AND, OR, NOT, MAJ
- Bulk bitwise COPY and INIT/ZERO
- True Random Number Generation; Physical Unclonable Functions
- More complex computation using Lookup Tables
- At low cost
- Using analog computation capability of DRAM
  - Idea: activating (multiple) rows performs computation
    - Even in commodity off-the-shelf DRAM chips!

### 30X-257X performance and energy improvements

Seshadri+"RowClone: Fast and Efficient In-DRAM Copy and Initialization of Bulk Data," MICRO 2013.

Seshadri+, "Fast Bulk Bitwise AND and OR in DRAM", IEEE CAL 2015.

Seshadri+, "Ambit: In-Memory Accelerator for Bulk Bitwise Operations Using Commodity DRAM Technology," MICRO 2017.

Hajinazar+, "SIMDRAM: A Framework for Bit-Serial SIMD Processing using DRAM," ASPLOS 2021.

Oliveira+, "MIMDRAM: An End-to-End Processing-Using-DRAM System for High-Throughput, Energy-Efficient and Programmer-Transparent Multiple-Instruction Multiple-Data Processing," HPCA 2024.

### Future Systems: In-Memory Copy



1046ns, 3.6uJ

→ 90ns, 0.04uJ

### More on RowClone

Vivek Seshadri, Yoongu Kim, Chris Fallin, Donghyuk Lee, Rachata
 Ausavarungnirun, Gennady Pekhimenko, Yixin Luo, Onur Mutlu, Michael A.
 Kozuch, Phillip B. Gibbons, and Todd C. Mowry,

"RowClone: Fast and Energy-Efficient In-DRAM Bulk Data Copy and Initialization"

Proceedings of the <u>46th International Symposium on Microarchitecture</u> (**MICRO**), Davis, CA, December 2013. [<u>Slides (pptx) (pdf)</u>] [<u>Lightning Session Slides (pptx) (pdf)</u>] [<u>Poster (pptx) (pdf)</u>]

### RowClone: Fast and Energy-Efficient In-DRAM Bulk Data Copy and Initialization

Vivek Seshadri Yoongu Kim Chris Fallin\* Donghyuk Lee vseshadr@cs.cmu.edu yoongukim@cmu.edu cfallin@c1f.net donghyuk1@cmu.edu

Rachata Ausavarungnirun Gennady Pekhimenko Yixin Luo rachata@cmu.edu gpekhime@cs.cmu.edu yixinluo@andrew.cmu.edu

Onur Mutlu Phillip B. Gibbons† Michael A. Kozuch† Todd C. Mowry onur@cmu.edu phillip.b.gibbons@intel.com michael.a.kozuch@intel.com tcm@cs.cmu.edu

Carnegie Mellon University †Intel Pittsburgh

### In-DRAM AND/OR: Triple Row Activation



### More on Ambit

 Vivek Seshadri, Donghyuk Lee, Thomas Mullins, Hasan Hassan, Amirali Boroumand, Jeremie Kim, Michael A. Kozuch, Onur Mutlu, Phillip B. Gibbons, and Todd C. Mowry,

"Ambit: In-Memory Accelerator for Bulk Bitwise Operations Using Commodity DRAM Technology"

Proceedings of the <u>50th International Symposium on</u>

Microarchitecture (MICRO), Boston, MA, USA, October 2017.

[Slides (pptx) (pdf)] [Lightning Session Slides (pptx) (pdf)] [Poster (pptx) (pdf)]

Ambit: In-Memory Accelerator for Bulk Bitwise Operations
Using Commodity DRAM Technology

Vivek Seshadri $^{1,5}$  Donghyuk Lee $^{2,5}$  Thomas Mullins $^{3,5}$  Hasan Hassan $^4$  Amirali Boroumand $^5$  Jeremie Kim $^{4,5}$  Michael A. Kozuch $^3$  Onur Mutlu $^{4,5}$  Phillip B. Gibbons $^5$  Todd C. Mowry $^5$ 

 $^1$ Microsoft Research India  $^2$ NVIDIA Research  $^3$ Intel  $^4$ ETH Zürich  $^5$ Carnegie Mellon University

### Capabilities of Off-The-Shelf Memory

## Existing DRAM Chips Are Already Quite Capable

### Real Processing Using Memory Prototype

- End-to-end RowClone & TRNG using off-the-shelf DRAM chips
- Idea: Violate DRAM timing parameters to mimic RowClone

### PiDRAM: A Holistic End-to-end FPGA-based Framework for Processing-in-DRAM

Ataberk Olgun<sup>§†</sup> Juan Gómez Luna<sup>§</sup> Konstantinos Kanellopoulos<sup>§</sup> Behzad Salami<sup>§\*</sup> Hasan Hassan<sup>§</sup> Oğuz Ergin<sup>†</sup> Onur Mutlu<sup>§</sup>

§ETH Zürich <sup>†</sup>TOBB ETÜ \*BSC

https://arxiv.org/pdf/2111.00082.pdf

https://github.com/cmu-safari/pidram

https://www.youtube.com/watch?v=qeukNs5XI3g&t=4192s

### Real Processing-using-Memory Prototype



https://arxiv.org/pdf/2111.00082.pdf

https://github.com/cmu-safari/pidram

https://www.youtube.com/watch?v=qeukNs5XI3g&t=4192s

### Real Processing-using-Memory Prototype



https://arxiv.org/pdf/2111.00082.pdf https://github.com/cmu-safari/pidram

https://www.youtube.com/watch?v=qeukNs5XI3g&t=4192s

### Microbenchmark Copy/Initialization Throughput



In-DRAM Copy and Initialization improve throughput by 119x and 89x



### More on PiDRAM

 Ataberk Olgun, Juan Gomez Luna, Konstantinos Kanellopoulos, Behzad Salami, Hasan Hassan, Oguz Ergin, and Onur Mutlu,

"PiDRAM: A Holistic End-to-end FPGA-based Framework for Processing-in-DRAM"

<u>ACM Transactions on Architecture and Code Optimization</u> (**TACO**), March 2023. [arXiv version]

Presented at the 18th HiPEAC Conference, Toulouse, France, January 2023.

[Slides (pptx) (pdf)]

[Longer Lecture Slides (pptx) (pdf)]

[Lecture Video (40 minutes)]

[PiDRAM Source Code]

### PiDRAM: A Holistic End-to-end FPGA-based Framework for <u>Processing-in-DRAM</u>

Ataberk Olgun§ Juan Gómez Luna§ Konstantinos Kanellopoulos§ Behzad Salami§ Hasan Hassan§ Oğuz Ergin† Onur Mutlu§

§ETH Zürich †TOBB University of Economics and Technology

### DRAM Chips Are Already (Quite) Capable!

Appears at HPCA 2024 <a href="https://arxiv.org/pdf/2402.18736.pdf">https://arxiv.org/pdf/2402.18736.pdf</a>

### Functionally-Complete Boolean Logic in Real DRAM Chips: Experimental Characterization and Analysis

İsmail Emir Yüksel Yahya Can Tuğrul Ataberk Olgun F. Nisa Bostancı A. Giray Yağlıkçı Geraldo F. Oliveira Haocong Luo Juan Gómez-Luna Mohammad Sadrosadati Onur Mutlu

### ETH Zürich

We experimentally demonstrate that COTS DRAM chips are capable of performing 1) functionally-complete Boolean operations: NOT, NAND, and NOR and 2) many-input (i.e., more than two-input) AND and OR operations. We present an extensive characterization of new bulk bitwise operations in 256 off-theshelf modern DDR4 DRAM chips. We evaluate the reliability of these operations using a metric called success rate: the fraction of correctly performed bitwise operations. Among our 19 new observations, we highlight four major results. First, we can perform the NOT operation on COTS DRAM chips with 98.37% success rate on average. Second, we can perform up to 16-input NAND, NOR, AND, and OR operations on COTS DRAM chips with high reliability (e.g., 16-input NAND, NOR, AND, and OR with average success rate of 94.94%, 95.87%, 94.94%, and 95.85%, respectively). Third, data pattern only slightly

### The Capability of COTS DRAM Chips

We demonstrate that COTS DRAM chips:

Can copy one row into up to 31 other rows with >99.98% success rate

2 Can perform NOT operation with up to 32 output operands

Can perform up to 16-input AND, NAND, OR, and NOR operations

### In-DRAM Physical Unclonable Functions

Jeremie S. Kim, Minesh Patel, Hasan Hassan, and Onur Mutlu,
 "The DRAM Latency PUF: Quickly Evaluating Physical Unclonable
 Functions by Exploiting the Latency-Reliability Tradeoff in Modern DRAM Devices"

Proceedings of the <u>24th International Symposium on High-Performance Computer</u> <u>Architecture</u> (**HPCA**), Vienna, Austria, February 2018.

[Lightning Talk Video]

[Slides (pptx) (pdf)] [Lightning Session Slides (pptx) (pdf)]

[Full Talk Lecture Video (28 minutes)]

### The DRAM Latency PUF:

Quickly Evaluating Physical Unclonable Functions by Exploiting the Latency-Reliability Tradeoff in Modern Commodity DRAM Devices

Jeremie S. Kim<sup>†§</sup> Minesh Patel<sup>§</sup> Hasan Hassan<sup>§</sup> Onur Mutlu<sup>§†</sup>

<sup>†</sup>Carnegie Mellon University <sup>§</sup>ETH Zürich

### In-DRAM True Random Number Generation

Jeremie S. Kim, Minesh Patel, Hasan Hassan, Lois Orosa, and Onur Mutlu,
 "D-RaNGe: Using Commodity DRAM Devices to Generate True Random Numbers with Low Latency and High Throughput"

Proceedings of the <u>25th International Symposium on High-Performance Computer</u> <u>Architecture</u> (**HPCA**), Washington, DC, USA, February 2019.

[Slides (pptx) (pdf)]

[Full Talk Video (21 minutes)]

[Full Talk Lecture Video (27 minutes)]

Top Picks Honorable Mention by IEEE Micro.

### D-RaNGe: Using Commodity DRAM Devices to Generate True Random Numbers with Low Latency and High Throughput

Jeremie S. Kim<sup>‡§</sup> Minesh Patel<sup>§</sup> Hasan Hassan<sup>§</sup> Lois Orosa<sup>§</sup> Onur Mutlu<sup>§‡</sup> <sup>‡</sup>Carnegie Mellon University <sup>§</sup>ETH Zürich

181

### In-DRAM True Random Number Generation

 Ataberk Olgun, Minesh Patel, A. Giray Yaglikci, Haocong Luo, Jeremie S. Kim, F. Nisa Bostanci, Nandita Vijaykumar, Oguz Ergin, and Onur Mutlu,

"QUAC-TRNG: High-Throughput True Random Number Generation Using Quadruple Row Activation in Commodity DRAM Chips"

Proceedings of the <u>48th International Symposium on Computer Architecture</u> (**ISCA**), Virtual, June 2021.

[Slides (pptx) (pdf)]

[Short Talk Slides (pptx) (pdf)]

[Talk Video (25 minutes)]

[SAFARI Live Seminar Video (1 hr 26 mins)]

### QUAC-TRNG: High-Throughput True Random Number Generation Using Quadruple Row Activation in Commodity DRAM Chips

Ataberk Olgun<sup>§†</sup> Minesh Patel<sup>§</sup> A. Giray Yağlıkçı<sup>§</sup> Haocong Luo<sup>§</sup> Jeremie S. Kim<sup>§</sup> F. Nisa Bostancı<sup>§†</sup> Nandita Vijaykumar<sup>§⊙</sup> Oğuz Ergin<sup>†</sup> Onur Mutlu<sup>§</sup>

§ETH Zürich  $^{\dagger}$  TOBB University of Economics and Technology  $^{\odot}$  University of Toronto

SAFARI 182

#### In-DRAM True Random Number Generation

F. Nisa Bostanci, Ataberk Olgun, Lois Orosa, A. Giray Yaglikci, Jeremie S. Kim, Hasan Hassan, Oguz Ergin, and Onur Mutlu,

"DR-STRaNGe: End-to-End System Design for DRAM-based True Random **Number Generators**"

Proceedings of the <u>28th International Symposium on High-Performance Computer</u> Architecture (HPCA), Virtual, April 2022.

[Slides (pptx) (pdf)]

[Short Talk Slides (pptx) (pdf)]

#### DR-STRaNGe: End-to-End System Design for DRAM-based True Random Number Generators

F. Nisa Bostanci†§ Jeremie S. Kim§

SAFARI

Ataberk Olgun<sup>†§</sup> Lois Orosa<sup>§</sup>

A. Giray Yağlıkçı§ Onur Mutlu§

Hasan Hassan<sup>§</sup> Oğuz Ergin<sup>†</sup>

§ETH Zürich

†TOBB University of Economics and Technology

#### In-Flash Bulk Bitwise Execution

Jisung Park, Roknoddin Azizi, Geraldo F. Oliveira, Mohammad Sadrosadati, Rakesh Nadig, David Novo, Juan Gómez-Luna, Myungsuk Kim, and Onur Mutlu, "Flash-Cosmos: In-Flash Bulk Bitwise Operations Using Inherent **Computation Capability of NAND Flash Memory** 

Proceedings of the <u>55th International Symposium on Microarchitecture</u> (**MICRO**), Chicago, IL, USA, October 2022.

[Slides (pptx) (pdf)]

[Longer Lecture Slides (pptx) (pdf)]

[Lecture Video (44 minutes)]

[arXiv version]

#### Flash-Cosmos: In-Flash Bulk Bitwise Operations Using **Inherent Computation Capability of NAND Flash Memory**

Jisung Park<sup>§</sup>∇ Roknoddin Azizi<sup>§</sup> Geraldo F. Oliveira<sup>§</sup> Mohammad Sadrosadati<sup>§</sup> Rakesh Nadig<sup>§</sup> David Novo<sup>†</sup> Juan Gómez-Luna<sup>§</sup> Myungsuk Kim<sup>‡</sup> Onur Mutlu<sup>§</sup>

§ETH Zürich <sup>∇</sup>POSTECH <sup>†</sup>LIRMM, Univ. Montpellier, CNRS

\*Kyungpook National University

## PIM Review and Open Problems

## A Modern Primer on Processing in Memory

Onur Mutlu<sup>a,b</sup>, Saugata Ghose<sup>b,c</sup>, Juan Gómez-Luna<sup>a</sup>, Rachata Ausavarungnirun<sup>d</sup>

SAFARI Research Group

<sup>a</sup>ETH Zürich

<sup>b</sup>Carnegie Mellon University

<sup>c</sup>University of Illinois at Urbana-Champaign

<sup>d</sup>King Mongkut's University of Technology North Bangkok

Onur Mutlu, Saugata Ghose, Juan Gomez-Luna, and Rachata Ausavarungnirun,

"A Modern Primer on Processing in Memory"

Invited Book Chapter in Emerging Computing: From Devices to Systems 
Looking Beyond Moore and Von Neumann, Springer, to be published in 2021.

## Eliminating the Adoption Barriers

# How to Enable Adoption of Processing in Memory

## Potential Barriers to Adoption of PIM

- 1. **Applications** & **software** for PIM
- 2. Ease of **programming** (interfaces and compiler/HW support)
- 3. **System** and **security** support: coherence, synchronization, virtual memory, isolation, communication interfaces, ...
- 4. **Runtime** and **compilation** systems for adaptive scheduling, data mapping, access/sharing control, ...
- 5. **Infrastructures** to assess benefits and feasibility

All can be solved with change of mindset

#### We Need to Revisit the Entire Stack

With a memory-centric mindset



We can get there step by step

## Security Issues in Processing in Memory

- Does PIM make security better or easier?
- Does PIM make security worse?
- Many interesting & important questions here
- Some recent papers:
  - Evaluating Homomorphic Operations on a Real-World Processing-In-Memory System [IISWC 2023]
  - CIPHERMATCH: Accelerating Homomorphic Encryption based String Matching via Memory-Efficient Data Packing and In-Flash Processing [ASPLOS 2025]
  - Amplifying Main Memory-Based Timing Covert and Side Channels using Processing-in-Memory Operations [arxiv 2024]

189

### Potential Security Issues & Benefits (I)

#### Can PIM worsen security?

- Worsened or easier-to-induce physical issues (e.g., RowHammer)?
- Worsened or new side channels?
- Hardware bugs?
- New threat models?
- **...**

#### Can PIM enhance security?

- Less exposure of data (& keys?)
- In-memory (homomorphic) encryption & cryptographic hashing
- Execution of security functions; trusted execution in memory
- Support for security primitives (TRNGs, PUFs, encryption, ...)
- More or better isolation, virtualization, containerization?
- **-** ...

## Potential Security Issues & Benefits (II)

- Security analysis of PIM Systems
  - Different types of PIM: PnM vs. PuM
  - Different locations: cache, MC, DRAM, NVM, storage, remote, ...
  - General-purpose vs. special-purpose PIM?
  - Multi tenancy vs. single workload?
  - Concurrent host and PIM access?
  - Memory bus protection; memory wire(s) protection?
  - Robustness issues like RowHammer, RowPress, ...
  - **...**
- Can PIM support (more) secure execution of workloads?
  - What is needed to do so?
  - Secure PIM enclaves?
  - **...**

## PIM Helps Security: Many Examples (I)

Harshita Gupta, Mayank Kabra, Juan Gómez-Luna, Konstantinos Kanellopoulos, and Onur Mutlu, "Evaluating Homomorphic Operations on a Real-World Processing-In-Memory System" Proceedings of the 2023 IEEE International Symposium on Workload Characterization Poster Session (IISWC), Ghent, Belgium, October 2023.

arXiv version

[Lightning Talk Slides (pptx) (pdf)]

[Poster (pptx) (pdf)]

#### **Evaluating Homomorphic Operations on a Real-World Processing-In-Memory System**

Harshita Gupta\* Mayank Kabra\* Juan Gómez-Luna Konstantinos Kanellopoulos Onur Mutlu

ETH Zürich

## PIM Helps Security: Many Examples (II)

Mayank Kabra, Rakesh Nadig, Harshita Gupta, Rahul Bera, Manos Frouzakis, Vamanan Arulchelvan, Yu Liang, Haiyu Mao, Mohammad Sadrosadati, and Onur Mutlu, "CIPHERMATCH: Accelerating Homomorphic Encryption based String Matching via Memory-Efficient Data Packing and In-Flash Processing," Proceedings of the 30th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), Rotterdam, Netherlands, April 2025.

#### **CIPHERMATCH:**

#### Accelerating Homomorphic Encryption-Based String Matching via Memory-Efficient Data Packing and In-Flash Processing

Mayank Kabra ETH Zürich Zürich, Switzerland

Rahul Bera ETH Zürich Zürich, Switzerland

Yu Liang ETH Zürich Zürich, Switzerland Rakesh Nadig ETH Zürich Zürich, Switzerland

Manos Frouzakis ETH Zürich Zürich, Switzerland

Haiyu Mao King's College London London, United Kingdom

> Onur Mutlu ETH Zürich Zürich, Switzerland

Harshita Gupta ETH Zürich Zürich. Switzerland

Vamanan Arulchelvan ETH Zürich Zürich, Switzerland

Mohammad Sadrosadati ETH Zürich Zürich, Switzerland

#### PIM Worsens Side Channels: An Example

# **Amplifying Main Memory-Based Timing Covert and Side Channels using Processing-in-Memory Operations**

```
Konstantinos Kanellopoulos<sup>†*</sup> F. Nisa Bostancı<sup>†*</sup> Ataberk Olgun<sup>†</sup>
A. Giray Yağlıkçı<sup>†</sup> İsmail Emir Yüksel<sup>†</sup> Nika Mansouri Ghiasi<sup>†</sup>
Zülal Bingöl<sup>†‡</sup> Mohammad Sadrosadati<sup>†</sup> Onur Mutlu<sup>†</sup>

<sup>†</sup>ETH Zürich <sup>‡</sup>Bilkent University
```

## A Short Talk on Security of PIM Systems



# Concluding Remarks

## Summary: Three Major Limiters

- Technology scaling is not going well
- System complexity is increasing; old methods not keeping up
- Processor-centric designs are not keeping up
- These affect all metrics we care about
- These have fundamental impact on security and how we build secure systems
- We need to revisit how we build architectures and how we secure them

197

## Funding Acknowledgments

- Alibaba, AMD, ASML, Google, Facebook, Hi-Silicon, HP Labs, Huawei, IBM, Intel, Microsoft, Nvidia, Oracle, Qualcomm, Rambus, Samsung, Seagate, VMware, Xilinx
- NSF
- NIH
- GSRC
- SRC
- CyLab
- EFCL
- SNSF
- ACCESS

## Thank you!

## Acknowledgments



Think BIG, Aim HIGH!

https://safari.ethz.ch

#### SAFARI Newsletter June 2023 Edition

https://safari.ethz.ch/safari-newsletter-june-2023/



Think Big, Aim High





View in your browser June 2023



#### SAFARI Newsletter July 2024 Edition

https://safari.ethz.ch/safari-newsletter-july-2024/



## Referenced Papers, Talks, Artifacts

All are available at

https://people.inf.ethz.ch/omutlu/projects.htm

https://www.youtube.com/onurmutlulectures

https://github.com/CMU-SAFARI/

#### Open Source Tools: SAFARI GitHub



#### SAFARI Research Group at ETH Zurich and Carnegie Mellon University

Site for source code and tools distribution from SAFARI Research Group at ETH Zurich and Carnegie Mellon University.

● ETH Zurich and Carnegie Mellon U... Anttps://safari.ethz.ch/ omutlu@gmail.com

Repositories 80

Packages

8 People 13

prim-benchmarks Public

☐ ramulator Public

A Fast and Extensible DRAM Simulator, with built-in support for modeling many different DRAM technologies including DDRx, LPDDRx, GDDRx, WIOx, HBMx, and various academic proposals. Described in the...

● C++ ☆ 558 ¥ 207

MQSim Public

MQSim is a fast and accurate simulator modeling the performance of modern multi-queue (MQ) SSDs as well as traditional SATA based SSDs. MQSim faithfully models new high-bandwidth protocol implement...

● C++ ☆ 271 ♀ 149

rowhammer Public

● C 1 131 1 48

Source code for testing the Row Hammer error mechanism in DRAM devices. Described in the ISCA 2014 paper by Kim et al. at http://users.ece.cmu.edu/~omutlu/pub/dram-row-hammer\_isca14.pdf.

PrIM (Processing-In-Memory benchmarks) is the first benchmark suite

for a real-world processing-in-memory (PIM) architecture. PrIM is

developed to evaluate, analyze, and characterize the first publ...

● C ☆ 214 ♀ 42

SoftMC Public

SoftMC is an experimental FPGA-based memory controller design that can be used to develop tests for DDR3 SODIMMs using a C++ based API. The design, the interface, and its capabilities and limitatio...

Verilog ☆ 122 ♀ 28

Pythia Public

A customizable hardware prefetching framework using online reinforcement learning as described in the MICRO 2021 paper by Bera et al. (https://arxiv.org/pdf/2109.12021.pdf).

# Future of Computer Architecture and Hardware Security

Onur Mutlu

omutlu@gmail.com

https://people.inf.ethz.ch/omutlu

6 March 2025

University of Southern California





#### ISMM 2025

ISMM 2025

Tue 17 Jun 2025 Seoul, South Korea

co-located with PLDI 2025

Attending 
Track/Call Organization 
Q Search Series 
Sign in Sign up

♠ PLDI 2025 (series) / ISMM 2025 (series) /

#### International Symposium on Memory Management

**ISMM 2025** 



Call for Papers

Update 2025-03-06: One week submission extension, submissions now due 2025-03-18 (AoE UTC-12h)

Welcome to the home page of the 2025 ACM SIGPLAN International Symposium on Memory Management (ISMM 2025)! ISMM is the premier forum dedicated to research in memory management, covering the areas of memory performance, allocator design, garbage collection, architectural support for memory management, persistent memories, emerging memory technologies, and more.

ISMM'25 will held in-person as part of PLDI'25, sharing the venue and activities.

#### **Code of Conduct**

ISMM follows the **ACM Policy Against Harassment at ACM Activities**. Please familiarize yourself with the policy and guide for reporting unacceptable behavior.



#### DRAMSec 2025

## Fifth Workshop on DRAM Security (DRAMSec) June 21, 2025, co-located with ISCA 2025

Home / Call for Papers / 2021 edition / 2022 edition / 2023 edition / 2024 edition

DRAM is the most prevalent memory technology used in laptops, mobile phones, workstations and servers. As such, its security is paramount, yet DRAM attacks remain as viable as ever despite many attempts to resolve its security problems. The toolkit of DRAM disturb attacks has expanded with the introduction of new techniques such as Half-Double and RowPress, and it is likely that additional form of disturbance errors (and reliability and security issues) will emerge as we scale DRAM devices to



smaller feature sizes. DRAM is also plagued by additional forms of attack, including side-channel, Denial-of-Service (DoS), and cold-boot attacks.

Against this backdrop, the industry is introducing new DRAM security solutions that require independent scrutiny from the academia. Academia continues to propose novel fixes for Rowhammer, often without the benefit of insight into constraints faced by the industry.

#### PIM Tutorial November 2024 Edition

#### MICRO 2024 - Tutorial on Memory-Centric Computing Systems

Saturday, November 2<sup>nd</sup>, Austin, Texas, USA

Organizers: Geraldo F. Oliveira, Dr. Mohammad Sadrosadati,

Ataberk Olgun, Professor Onur Mutlu

Program: https://events.safari.ethz.ch/micro24-memorycentric-tutorial/

Overview of PIM | PIM taxonomy
PIM in memory & storage
Real-world PNM systems
PUM for bulk bitwise operations
Programming techniques & tools
Infrastructures for PIM Research
Research challenges &
opportunities





https://www.youtube.com/watch?v=KV2MXvcBgb0

### PIM Tutorial @ PPoPP/HPCA/CGO/CC

#### PPoPP 2025 - Tutorial on Memory-Centric Computing Systems

March 1st, Las Vegas, Nevada, USA

Organizers: Geraldo F. Oliveira, Dr. Mohammad Sadrosadati,

Ataberk Olgun, Professor Onur Mutlu

Program: https://events.safari.ethz.ch/ppopp25-memorycentric-tutorial/

PPOPP 2025

ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming 2025

March 1 5, 2025

Las Vegas, NV, USA

Overview of PIM | PIM taxonomy
PIM in memory & storage
Real-world PNM systems
PUM for bulk bitwise operations
Programming techniques & tools
Infrastructures for PIM Research
Research challenges &
opportunities



https://www.youtube.com/live/NkDY6osus6g

## Upcoming PIM Tutorials/Workshops (I)

# ASPLOS 2025 - 1<sup>st</sup> Workshop on Memory-Centric Computing Systems

Sunday, March 30<sup>th</sup>, Rotterdam, The Netherlands

Organizers: Geraldo F. Oliveira, Dr. Mohammad Sadrosadati,

Ataberk Olgun, Professor Onur Mutlu

Program: https://events.safari.ethz.ch/asplos25-MCCSys/doku.php



Overview of PIM | PIM taxonomy
PIM in memory & storage
Real-world PNM systems
PUM for bulk bitwise operations
Programming techniques & tools
Infrastructures for PIM Research
Research challenges &
opportunities



https://events.safari.ethz.ch/asplos25-MCCSys/doku.php

## Upcoming PIM Tutorials/Workshops (II)

# ICS 2025 - 2<sup>nd</sup> Workshop on Memory-Centric Computing Systems

Sunday, June 8th, Salt Lake City, USA

Organizers: Geraldo F. Oliveira, Dr. Mohammad Sadrosadati,

Ataberk Olgun, Professor Onur Mutlu

Program: https://events.safari.ethz.ch/ics25-MCCSys/doku.php



Overview of PIM | PIM taxonomy
PIM in memory & storage
Real-world PNM systems
PUM for bulk bitwise operations
Programming techniques & tools
Infrastructures for PIM Research
Research challenges &
opportunities



https://events.safari.ethz.ch/ics25-MCCSys/doku.php

## Upcoming PIM Tutorials/Workshops (III)

## ISCA 2025 - 3<sup>rd</sup> Workshop on Memory-Centric Computing Systems

Saturday, 21st June, 2025, Tokyo, Japan

Organizers: Geraldo F. Oliveira, Dr. Mohammad Sadrosadati,

Ataberk Olgun, Professor Onur Mutlu

Program: https://events.safari.ethz.ch/isca25-MCCSys/doku.php



Overview of PIM | PIM taxonomy
PIM in memory & storage
Real-world PNM systems
PUM for bulk bitwise operations
Programming techniques & tools
Infrastructures for PIM Research
Research challenges &
opportunities



https://events.safari.ethz.ch/isca25-MCCSys/doku.php

# Backup Slides – Longer Version

#### How Do We Make Sure Solution is Good?

- Many challenges (some below)
- Security by obscurity (as done in JEDEC DDR5 spec) unhelpful
- How do we guarantee we use correct thresholds?
  - Determining RH threshold is not easy
  - Many factors & conditions: RowPress, temperature, voltage, spatial variation, aging, voltage, and the unknowns
- How do we guarantee the correct mitigating actions?
  - In the presence of blast radius, address remapping, ...
- How do we guarantee we perform accurate bookkeeping?
  - Updating counters properly? Row open time? When to reset?
     Worst-case access patterns?, ...

#### Good at What Cost?

- Even if we magically assume the solution prevents bitflips...
- Does the problem turn into a memory performance attack?
- How do we avoid large performance and energy losses?
- Is there a smarter way of handling things more holistically?
  - Better partitioning of responsibilities between CPU and memory
  - Our memory interface is terrible
    - DRAM has "no real freedom" to do things internally
    - DRAM should be more self-managing we need more open minds

#### What is RowPress?

# Keeping a DRAM row **open for a long time** causes bitflips in adjacent rows

These bitflips do **NOT** require many row activations

Only one activation is enough in some cases!

#### Real DRAM Chip Characterization (II)

#### DRAM chips tested

- 164 DDR4 chips from all 3 major DRAM manufacturers
- Covers different die densities and revisions

| Mfr.                 | #DIMMs | #Chips | Density | Die Rev. | Org. | Date  |
|----------------------|--------|--------|---------|----------|------|-------|
| Mfr. S<br>(Samsung)  | 2      | 8      | 8Gb     | В        | x8   | 20-53 |
|                      | 1      | 8      | 8Gb     | С        | x8   | N/A   |
|                      | 3      | 8      | 8Gb     | D        | x8   | 21-10 |
|                      | 2      | 8      | 4Gb     | F        | x8   | N/A   |
|                      | 1      | 8      | 4Gb     | А        | x8   | 19-46 |
| Mfr. H<br>(SK Hynix) | 1      | 8      | 4Gb     | X        | x8   | N/A   |
|                      | 2      | 8      | 16Gb    | A        | x8   | 20-51 |
|                      | 2      | 8      | 16Gb    | С        | x8   | 21-36 |
| Mfr. M<br>(Micron)   | 1      | 16     | 8Gb     | В        | x4   | N/A   |
|                      | 2      | 4      | 16Gb    | В        | x16  | 21-26 |
|                      | 1      | 16     | 16Gb    | Е        | x4   | 20-14 |
|                      | 2      | 4      | 16Gb    | Е        | x16  | 20-46 |
|                      | 1      | 4      | 16Gb    | F        | x16  | 21-50 |



## Major Takeaways from Real DRAM Chips

RowPress significantly **amplifies**DRAM's vulnerability to read disturbance

RowPress has a different underlying error mechanism from RowHammer

## Reported HC<sub>first</sub> Values (2012 - Now)



\*Not shown: Significant variance in HC<sub>first</sub> across vendors and die variations



### RowPress at $t_{AggON}$ = Refresh Interval

HC<sub>first</sub>: Number of hammers for the first bitflip\*

HC<sub>first</sub> =  $\infty$ (all good)  $row \ on \ time = 7.2 \ us$ DDR4-new @ 380
[Luo+, ISCA'23]

<sup>\*</sup>Not shown: Significant variance in HC<sub>first</sub> across vendors and die variations



### RowPress at $t_{AggON} = 9 * Refresh Interval$



\*Not shown: Significant variance in HC<sub>first</sub> across vendors and die variations



### **Key Idea: NOT Operation**

Connect rows in neighboring subarrays through a NOT gate by simultaneously activating rows



### Key Idea: NAND, NOR, AND, OR

Manipulate the bitline voltage to express a wide variety of functions using multiple-row activation in neighboring subarrays



### **DRAM Chips Tested**

- 256 DDR4 chips from two major DRAM manufacturers
- Covers different die revisions and chip densities

| Chip Mfr. | #Modules<br>(#Chips) | Die<br>Rev. | Mfr.<br>Date <sup>a</sup> | Chip<br>Density | Chip<br>Org. | Speed<br>Rate |
|-----------|----------------------|-------------|---------------------------|-----------------|--------------|---------------|
| SK Hynix  | 9 (72)               | M           | N/A                       | 4Gb             | x8           | 2666MT/s      |
|           | 5 (40)               | A           | N/A                       | 4Gb             | x8           | 2133MT/s      |
|           | 1 (16)               | A           | N/A                       | 8Gb             | x8           | 2666MT/s      |
|           | 1 (32)               | A           | 18-14                     | 4Gb             | x4           | 2400MT/s      |
|           | 1 (32)               | A           | 16-49                     | 8Gb             | x4           | 2400MT/s      |
|           | 1 (32)               | M           | 16-22                     | 8Gb             | x4           | 2666MT/s      |
| Samsung   | 1 (8)                | F           | 21-02                     | 4Gb             | x8           | 2666MT/s      |
|           | 2 (16)               | D           | 21-10                     | 8Gb             | x8           | 2133MT/s      |
|           | 1 (8)                | A           | 22-12                     | 8Gb             | x8           | 3200MT/s      |

### Performing AND, NAND, OR, and NOR



COTS DRAM chips can perform {2, 4, 8, 16}-input AND, NAND, OR, and NOR operations

### Performing AND, NAND, OR, and NOR



COTS DRAM chips can perform 16-input AND, NAND, OR, and NOR operations with very high success rate (>94%)

# Other Backup Slides

### Onur Mutlu's SAFARI Research Group

Computer architecture, HW/SW, systems, bioinformatics, security, memory

https://safari.ethz.ch/safari-newsletter-january-2021/



Think BIG, Aim HIGH!

SAFARI

https://safari.ethz.ch

### SAFARI Newsletter December 2021 Edition

https://safari.ethz.ch/safari-newsletter-december-2021/



Think Big, Aim High





View in your browser December 2021



### SAFARI Newsletter June 2023 Edition

https://safari.ethz.ch/safari-newsletter-june-2023/



Think Big, Aim High





June 2023



### SAFARI Introduction & Research

Computer architecture, HW/SW, systems, bioinformatics, security, memory



Seminar in Computer Architecture - Lecture 5: Potpourri of Research Topics (Spring 2023)













SAFARI

https://www.youtube.com/watch?v=mV2OuB2djEs

### SAFARI PhD and Post-Doc Alumni

#### https://safari.ethz.ch/safari-alumni/

- Hasan Hassan (Rivos), EDAA Outstanding Dissertation Award 2023; S&P 2020 Best Paper Award, 2020 Pwnie Award, IEEE Micro TP HM 2020
- Christina Giannoula (Univ. of Toronto), NTUA Best Dissertation Award 2023
- Minesh Patel (Rutgers, Asst. Prof.), DSN Carter Award Best Thesis 2022; ETH Medal 2023; MICRO'20 & DSN'20 Best Paper Awards; ISCA HoF 2021
- Damla Senol Cali (Bionano Genomics), SRC TECHCON 2019 Best Student Presentation Award; RECOMB-Seq 2018 Best Poster Award
- Nastaran Hajinazar (Intel)
- Gagandeep Singh (AMD/Xilinx), FPL 2020 Best Paper Award Finalist
- Amirali Boroumand (Stanford Univ → Google), SRC TECHCON 2018 Best Presentation Award
- Jeremie Kim (Apple), EDAA Outstanding Dissertation Award 2020; IEEE Micro Top Picks 2019; ISCA/MICRO HoF 2021
- Nandita Vijaykumar (Univ. of Toronto, Assistant Professor), ISCA Hall of Fame 2021
- Kevin Hsieh (Microsoft Research, Senior Researcher)
- Justin Meza (Facebook), HiPEAC 2015 Best Student Presentation Award; ICCD 2012 Best Paper Award
- Mohammed Alser (ETH Zurich), IEEE Turkey Best PhD Thesis Award 2018
- Yixin Luo (Google), HPCA 2015 Best Paper Session
- Kevin Chang (Facebook), SRC TECHCON 2016 Best Student Presentation Award
- Rachata Ausavarungnirun (KMUNTB, Assistant Professor), NOCS 2015 and NOCS 2012 Best Paper Award Finalist
- Gennady Pekhimenko (Univ. of Toronto, Assistant Professor), ISCA Hall of Fame 2021; ASPLOS 2015 SRC Winner
- Vivek Seshadri (Microsoft Research)
- Donghyuk Lee (NVIDIA Research, Senior Researcher), HPCA Hall of Fame 2018
- Yoongu Kim (Software Robotics → Google), TCAD'19 Top Pick Award; IEEE Micro Top Picks'10; HPCA'10 Best Paper Session
- Lavanya Subramanian (Intel Labs → Facebook)
- Samira Khan (Univ. of Virginia, Assistant Professor), HPCA 2014 Best Paper Session
- Saugata Ghose (Univ. of Illinois, Assistant Professor), DFRWS-EU 2017 Best Paper Award
- Jawad Haj-Yahya (Huawei Research Zurich, Principal Researcher)
- Lois Orosa (Galicia Supercomputing Center, Director)
- Jisung Park (POSTECH, Assistant Professor)
- Gagandeep Singh (AMD/Xilinx, Researcher)
- Juan Gomez-Luna (NVIDIA, Researcher), ISPASS 2023 Best Paper Session

# Processing in Memory: Evaluation Methods

### Simulators (Open Source)

Ramulator 2.0 & Ramulator-PIM

DAMOVSim

UPMEMSim (UPMEM)

AiMSim (SK Hynix)

**...** 

### Ramulator + Gem5

 Haocong Luo, Yahya Can Tugrul, F. Nisa Bostanci, Ataberk Olgun, A. Giray Yaglikci, and Onur Mutlu,

"Ramulator 2.0: A Modern, Modular, and Extensible DRAM Simulator" Preprint on arxiv, August 2023.

[arXiv version]

[Ramulator 2.0 Source Code]

# Ramulator 2.0: A Modern, Modular, and Extensible DRAM Simulator

Haocong Luo, Yahya Can Tuğrul, F. Nisa Bostancı, Ataberk Olgun, A. Giray Yağlıkçı, and Onur Mutlu

https://arxiv.org/pdf/2308.11030.pdf

# Processing-in-Memory in the Real World

### Processing-in-Memory Landscape Today









[Samsung 2021]



[UPMEM 2019]

### Processing-in-Memory Landscape Today

IEEE COMPUTER ARCHITECTURE LETTERS, VOL. 22, NO. 1, JANUARY-JUNE

#### Computational CXL-Memory Solution for Accelerating Memory-Intensive Applications

Joonseop Sim<sup>®</sup>, Soohong Ahn<sup>®</sup>, Taeyoung Ahn<sup>®</sup>, Seungyong Lee<sup>®</sup>, Myunghyun Rhee, Jooyoung Kim<sup>®</sup>, Kwangsik Shin, Donguk Moon<sup>®</sup>, Euiseok Kim, and Kyoung Park<sup>®</sup>

Abstract—CXL interface is the up-to-date technology that enables effective memory expansion by providing a memory-sharing protocol in configuring heterogeneous devices. However, its limited physical bandwidth can be a significant bottleneck for emerging data-intensive applications. In this work, we propose a novel CXL-based memory disaggregation architecture with a real-world prototype demonstration, which overcomes the bandwidth limitation of the CXL interface using near-data processing. The experimental results demonstrate that our design achieves up to 1.9× better performance/power efficiency than the existing CPU system.

Index Terms—Compute express link (CXL), near-data-processing (NDP)





Fig. 6. FPGA prototype of proposed CMS card.

### Processing-in-Memory Landscape Today

## Samsung Processing in Memory Technology at Hot Chips 2023

By Patrick Kennedy - August 28, 2023

















Samsung PIM PNM For Transformer Based AI HC35\_Page\_24

### Samsung AxDIMM (2021)

- DDRx-PIM
  - DLRM recommendation system





#### **AxDIMM System**





Samsung Newsroom

CORPORATE

**PRODUCTS** 

PRESS RESOURCES

VIEWS

**ABOUT US** 

Q

### Samsung Develops Industry's First High Bandwidth Memory with Al Processing Power

Korea on February 17, 2021

Audio



**₽**)





### The new architecture will deliver over twice the system performance and reduce energy consumption by more than 70%

Samsung Electronics, the world leader in advanced memory technology, today announced that it has developed the industry's first High Bandwidth Memory (HBM) integrated with artificial intelligence (AI) processing power — the HBM-PIM The new processing-in-memory (PIM) architecture brings powerful AI computing capabilities inside high-performance memory, to accelerate large-scale processing in data centers, high performance computing (HPC) systems and AI-enabled mobile applications.

Kwangil Park, senior vice president of Memory Product Planning at Samsung Electronics stated, "Our groundbreaking HBM-PIM is the industry's first programmable PIM solution tailored for diverse Al-driven workloads such as HPC, training and inference. We plan to build upon this breakthrough by further collaborating with Al solution providers for even more advanced PIM-powered applications."

#### FIMDRAM based on HBM2



[3D Chip Structure of HBM with FIMDRAM]

#### **Chip Specification**

128DQ / 8CH / 16 banks / BL4

32 PCU blocks (1 FIM block/2 banks)

1.2 TFLOPS (4H)

FP16 ADD / Multiply (MUL) / Multiply-Accumulate (MAC) / Multiply-and- Add (MAD)

#### ISSCC 2021 / SESSION 25 / DRAM / 25.4

25.4 A 20nm 6GB Function-In-Memory DRAM, Based on HBM2 with a 1.2TFLOPS Programmable Computing Unit Using Bank-Level Parallelism, for Machine Learning Applications

Young-Cheon Kwon1, Suk Han Lee1, Jaehoon Lee1, Sang-Hyuk Kwon1, Je Min Ryu1, Jong-Pil Son1, Seongil O1, Hak-Soo Yu1, Haesuk Lee1, Soo Young Kim<sup>1</sup>, Youngmin Cho<sup>1</sup>, Jin Guk Kim<sup>1</sup>, Jongyoon Choi<sup>1</sup>, Hyun-Sung Shin<sup>1</sup>, Jin Kim<sup>1</sup>, BengSeng Phuah<sup>1</sup>, HyoungMin Kim<sup>1</sup>. Myeong Jun Song<sup>1</sup>, Ahn Choi<sup>1</sup>, Daeho Kim<sup>1</sup>, SooYoung Kim<sup>1</sup>, Eun-Bong Kim<sup>1</sup>, David Wang<sup>2</sup>, Shinhaeng Kang<sup>1</sup>, Yuhwan Ro<sup>3</sup>, Seungwoo Seo<sup>3</sup>, JoonHo Song<sup>3</sup>, Jaeyoun Youn1, Kyomin Sohn1, Nam Sung Kim1

<sup>1</sup>Samsung Electronics, Hwaseong, Korea <sup>2</sup>Samsung Electronics, San Jose, CA 3Samsung Electronics, Suwon, Korea

### **Programmable Computing Unit**

- Configuration of PCU block
  - Interface unit to control data flow
  - Execution unit to perform operations
  - Register group
    - 32 entries of CRF for instruction memory
    - 16 GRF for weight and accumulation
    - 16 SRF to store constants for MAC operations



#### [Block diagram of PCU in FIMDRAM]

#### ISSCC 2021 / SESSION 25 / DRAM / 25.4

25.4 A 20nm 6GB Function-in-Memory DRAM, Based on HBM2 with a 1.2TFLOPS Programmable Computing Unit Using Bank-Level Parallelism, for Machine Learning Applications

Young-Cheon Kwon', Suk Han Ler', Jaehoon Lee', Sang-Hruk Kwon', Je Min Ryu', Jong-Pil Son', Seongil O', Hak-Soo Yu', Haesuk Lee', Soo Young Kim', Youngmin Cho', Jin Guk Kim', Jongyoon Choi', Hyun-Sung Shin', Jin Kim', BengSeng Phuah', HyoungMin Kim', Hyeong Jun Song', Aln Choi', Deach Kim', Soo'Oung Kim', Eun-Bong Kim', David Wang', Shinhaeng Kang', Yuhwan Ro', Seungwoo Seo', JoonHo Song', Jaeyoun Youn', Kyomin Sonh', Man Sung Kim'

#### [Available instruction list for FIM operation]

| Туре              | CMD  | Description                 |
|-------------------|------|-----------------------------|
|                   | ADD  | FP16 addition               |
| Floating<br>Point | MUL  | FP16 multiplication         |
|                   | MAC  | FP16 multiply-accumulate    |
|                   | MAD  | FP16 multiply and add       |
| Data Path         | MOVE | Load or store data          |
| Data Patri        | FILL | Copy data from bank to GRFs |
|                   | NOP  | Do nothing                  |
| Control Path      | JUMP | Jump instruction            |
|                   | EXIT | Exit instruction            |

#### ISSCC 2021 / SESSION 25 / DRAM / 25.4

25.4 A 20nm 6GB Function-in-Memory DRAM, Based on HBM2 with a 1.2TFLOPS Programmable Computing Unit Using Bank-Level Parallelism, for Machine Learning Applications

Young-Cheon Kwon', Suk Han Let', Jaehoon Let', Sang-Hyuk Kwon', Ja Min Ryu', Jong-Pi Son', Seongli O', Hak Soo Yu', Hesauk Let', Soo Young Kim', Youngmin Cho', Jin Guk Kim', Jongyoon Choi', Hyun-Sung Shin', Jin Kim', BengSeng Phuah', HyoungMin Kim', Hyeong Jun Song', Alm Choi', Daeho Kim', Soo Young Kim', Eun-Bong Kim', David Wang', Shinhaend Kang', Yuhwan Ro', Seungwoo Seo', JoonHo Song', Jaeyoun Youn', Kyomin Sohn', Man Sung Kim'

### **Chip Implementation**

- Mixed design methodology to implement FIMDRAM
  - Full-custom + Digital RTL



[Digital RTL design for PCU block]

#### ISSCC 2021 / SESSION 25 / DRAM / 25.4

25.4 A 20nm 6GB Function-In-Memory DRAM, Based on HBM2 with a 1.2TFLOPS Programmable Computing Unit Using Bank-Level Parallelism, for Machine Learning Applications

Young-Cheon Kwont, Suk Han Lerl, Jaehoon Lerl, Sang-Hruk Kwon', Je Min Ryu', Jong-Pil Son', Seongil O', Hak-Soo Yu', Haesuk Ler, Soo Young Kim', Youngmin Cho', Jin Guk Kim', Jongyoon Cho'r, Hyun-Sung Shin', Jin Kim', BengSeng Phuah', HyoungMin Kong, Ahn Choi, Jaeho Kim', Soo'young Kim', Eun-Bong Kim', David Wang', Shinhaeng Kang', Yuhwan Ro', Seungwoo Seo', JoonHo Song', Jaeyoun Youn', Kyomin Sohn', Man Sung Kim'

| Cell array<br>for bank0                             | Cell array<br>for bank4                              | Cell array<br>for bank0                             | Cell array<br>for bank4                              | Pseudo       | Pseudo    |
|-----------------------------------------------------|------------------------------------------------------|-----------------------------------------------------|------------------------------------------------------|--------------|-----------|
| PCU block<br>for bank0 & 1                          | PCU block<br>for bank4 & 5                           | PCU block<br>for bank0 & 1                          | PCU block<br>for bank4 & 5                           | channel-0    | channel-1 |
| Cell array<br>for bank1<br>Cell array<br>for bank2  | Cell array<br>for bank5<br>Cell array<br>for bank6   | Cell array<br>for bank1<br>Cell array<br>for bank2  | Cell array<br>for bank5<br>Cell array<br>for bank6   |              |           |
| PCU block<br>for bank2 & 3                          | PCU block<br>for bank6 & 7                           | PCU block<br>for bank2 & 3                          | PCU block<br>for bank6 & 7                           |              |           |
| Cell array<br>for bank3                             | Cell array<br>for bank7                              | Cell array<br>for bank3                             | Cell array<br>for bank7                              |              |           |
| 11 17 17 18 18 18 18 18 18 18 18 18 18 18 18 18     |                                                      | TSV &                                               |                                                      | ontrol Block |           |
| Cell array<br>for bank11                            | Cell array<br>for bank15                             | Cell array<br>for bank11                            | Cell array<br>for bank15                             |              |           |
| PCU block<br>for bank10 & 11                        | PCU block<br>for bank14 & 15                         | PCU block<br>for bank10 & 11                        | PCU block<br>for bank14 & 15                         |              |           |
| Cell array<br>for bank10<br>Cell array<br>for bank9 | Cell array<br>for bank14<br>Cell array<br>for bank13 | Cell array<br>for bank10<br>Cell array<br>for bank9 | Cell array<br>for bank14<br>Cell array<br>for bank13 |              |           |
| PCU block<br>for bank8 & 9                          | PCU block<br>for bank12 & 13                         | PCU block<br>for bank8 & 9                          | PCU block<br>for bank12 & 13                         | Pseudo       | Pseudo .  |
| Cell array<br>for bank8                             | Cell array<br>for bank12                             | Cell array<br>for bank8                             | Cell array<br>for bank12                             | channel-0    | channel-1 |

### SK Hynix Accelerator-in-Memory (2022)

**SK**hynix NEWSROOM

⊕ ENG ∨

INSIGHT

**SK hvnix STORY** 

PRESS CENTER

MULTIMEDIA

Search

Q

#### SK hynix Develops PIM, Next-Generation AI Accelerator

February 16, 2022







#### Seoul, February 16, 2022

SK hynix (or "the Company", www.skhynix.com) announced on February 16 that it has developed PIM\*, a nextgeneration memory chip with computing capabilities.

\*PIM(Processing In Memory): A next-generation technology that provides a solution for data congestion issues for AI and big data by adding computational functions to semiconductor memory

It has been generally accepted that memory chips store data and CPU or GPU, like human brain, process data. SK hynix, following its challenge to such notion and efforts to pursue innovation in the next-generation smart memory, has found a breakthrough solution with the development of the latest technology.

SK hynix plans to showcase its PIM development at the world's most prestigious semiconductor conference, 2022 ISSCC\*, in San Francisco at the end of this month. The company expects continued efforts for innovation of this technology to bring the memory-centric computing, in which semiconductor memory plays a central role, a step closer in Paper 11.1. SK Hynix describes an Tynm, GDDR6-based accelerator-in-memory with a command set for deep-learning operation. The to the reality in devices such as smartphones.

\*ISSCC: The International Solid-State Circuits Conference will be held virtually from Feb. 20 to Feb. 24 this year with a theme of "Intelligent Silicon for a Sustainable World"

For the first product that adopts the PIM technology, SK hynix has developed a sample of GDDR6-AiM (Accelerator\* in memory). The GDDR6-AiM adds computational functions to GDDR6\* memory chips, which process data at 16Gbps. A combination of GDDR6-AiM with CPU or GPU instead of a typical DRAM makes certain computation speed 16 times faster. GDDR6-AiM is widely expected to be adopted for machine learning, high-performance computing, and big data computation and storage.



11.1 A 1ynm 1.25V 8Gb, 16Gb/s/pin GDDR6-based Accelerator-in-Memory supporting 1TFLOPS MAC Operation and Various Activation Functions for Deep-Learning Applications

Seongiu Lee, SK hynix, Icheon, Korea

8Gb design achieves a peak throughput of 1TFLOPS with 1GHz MAC operations and supports major activation functions to improve



### SK Hynix Accelerator-in-Memory (2022)



ASPLOS 2023 Tutorial: Real-world Processing-in-Memory Systems for Modern Workloads



1,146 views Streamed live on Mar 26, 2023 Livestream - Data-Centric Architectures: Fundamentally Improving Performance and Energy (Spring 2023)
ASPLOS 2023 Tutorial: Real-world Processing-in-Memory Systems for Modern Workloads
https://events.safari.ethz.ch/asplos-...



### AliBaba PIM Recommendation System (2022)





Figure 29.1.7: Die micrographs of DRAM die, NE and ME. Detailed specifications of DRAM die and logic die.

#### 184QPS/W 64Mb/mm<sup>2</sup> 3D Logic-to-DRAM Hybrid Bonding **29.1** with Process-Near-Memory Engine for Recommendation **System**

Dimin Niu<sup>1</sup>, Shuangchen Li<sup>1</sup>, Yuhao Wang<sup>1</sup>, Wei Han<sup>1</sup>, Zhe Zhang<sup>2</sup>, Yijin Guan<sup>2</sup>, Tianchan Guan<sup>3</sup>, Fei Sun<sup>1</sup>, Fei Xue<sup>1</sup>, Lide Duan<sup>1</sup>, Yuanwei Fang<sup>1</sup>, Hongzhong Zheng<sup>1</sup>, Xiping Jiang<sup>4</sup>, Song Wang<sup>4</sup>, Fengguo Zuo<sup>4</sup>, Yubing Wang<sup>4</sup>,-SAFARI Bing Yu<sup>4</sup>, Qiwei Ren<sup>4</sup>, Yuan Xie<sup>1</sup>

### UPMEM Processing-in-DRAM Engine (2019)

- Processing in DRAM Engine
- Includes standard DIMM modules, with a large number of DPU processors combined with DRAM chips.
- Replaces standard DIMMs
  - DDR4 R-DIMM modules
    - 8GB+128 DPUs (16 PIM chips)
    - Standard 2x-nm DRAM process
  - Large amounts of compute & memory bandwidth





### **UPMEM Memory Modules**

- E19: 8 chips DIMM (1 rank). DPUs @ 267 MHz
- P21: 16 chips DIMM (2 ranks). DPUs @ 350 MHz





### 2,560-DPU Processing-in-Memory System



#### Benchmarking a New Paradigm: An Experimental Analysis of a Real Processing-in-Memory Architecture

JUAN GÓMEZ-LUNA, ETH Zürich, Switzerland
IZZAT EL HAJJ, American University of Beirut, Lebanon
IVAN FERNANDEZ, ETH Zürich, Switzerland and University of Malaga, Spain
CHRISTINA GIANNOULA, ETH Zürich, Switzerland and NTUA, Greece
GERALDO F. OLIVEIRA, ETH Zürich, Switzerland
ONUR MUTLU, ETH Zürich, Switzerland

Many modern workloads, such as neural networks, databases, and graph processing, are fundamentally memory-bound for such workloads, the data nowment between main memory and CPU core simpose a significant overhead in terms of both latency and energy. A major reason is that this communication happens through a narrow bus with high latency and limited bandwidth, and the low data reuse in memory-bound workloads is insufficient to amortize the cost of main memory access. Fundamentally addressing this data movement bottleneck requires a paradigm where the memory system assumes an active role in computing by integrating processing capabilities. This paradigm is known as processing—in-memory (PRI).

Recent research explores different forms of PIM architectures, motivated by the emergence of new 3Dstacked memory technologies that integrate memory with a logic layer where processing elements can be
easily placed. Past works evaluate these architectures in simulation or, at best, with simplified hardware
prototypes. In contrast, the UPMEM company has designed and manufactured the first publicly-available
real-world PIM architecture. The UPMEM PIM architecture combines traditional DRAM memory arrays with
general-purpose in-order cores, called DRAM Processing Units (DPUs), integrated in the same chip.

This paper provides the first comprehensive analysis of the first publicly-available real-world PIM architecture. We make two key contributions. First, we conduct an experimental characterization of the UPIMEM-based to PIM system using microbenchmarks to assess various architecture limits such as compute throughput and memory bandwidth, yielding new insights. Second, we present PPIM (Processing, in-Pigmory) benchmarks) as a benchmark suite of 16 workfoads from different application domains (e.g., dense/sparse linear algebra, databases, data naphytics, graph processing, which we identify as memory-bound. We evaluate the performance and scaling characteristics of PIM benchmarks on the UPIMEM PIM architecture, and compare their performance and energy consumption to their state-of-the-art CPU and CPU counterparts. Our extensive evaluation conducted on two real UPIMEM-based PIM systems with 640 and 2550 PDIS provides new insights about satiability of different workloads to the PIM systems with 640 was not software designers, and suggestions and hints for hardware and architecture designers of future PIM systems.



### More on the UPMEM PIM System



### Experimental Analysis of the UPMEM PIM Engine

### Benchmarking a New Paradigm: An Experimental Analysis of a Real Processing-in-Memory Architecture

JUAN GÓMEZ-LUNA, ETH Zürich, Switzerland
IZZAT EL HAJJ, American University of Beirut, Lebanon
IVAN FERNANDEZ, ETH Zürich, Switzerland and University of Malaga, Spain
CHRISTINA GIANNOULA, ETH Zürich, Switzerland and NTUA, Greece
GERALDO F. OLIVEIRA, ETH Zürich, Switzerland
ONUR MUTLU, ETH Zürich, Switzerland

Many modern workloads, such as neural networks, databases, and graph processing, are fundamentally memory-bound. For such workloads, the data movement between main memory and CPU cores imposes a significant overhead in terms of both latency and energy. A major reason is that this communication happens through a narrow bus with high latency and limited bandwidth, and the low data reuse in memory-bound workloads is insufficient to amortize the cost of main memory access. Fundamentally addressing this *data movement bottleneck* requires a paradigm where the memory system assumes an active role in computing by integrating processing capabilities. This paradigm is known as *processing-in-memory (PIM)*.

Recent research explores different forms of PIM architectures, motivated by the emergence of new 3D-stacked memory technologies that integrate memory with a logic layer where processing elements can be easily placed. Past works evaluate these architectures in simulation or, at best, with simplified hardware prototypes. In contrast, the UPMEM company has designed and manufactured the first publicly-available real-world PIM architecture. The UPMEM PIM architecture combines traditional DRAM memory arrays with general-purpose in-order cores, called *DRAM Processing Units* (*DPUs*), integrated in the same chip.

This paper provides the first comprehensive analysis of the first publicly-available real-world PIM architecture. We make two key contributions. First, we conduct an experimental characterization of the UPMEM-based PIM system using microbenchmarks to assess various architecture limits such as compute throughput and memory bandwidth, yielding new insights. Second, we present *PrIM* (*Processing-In-Memory benchmarks*), a benchmark suite of 16 workloads from different application domains (e.g., dense/sparse linear algebra, databases, data analytics, graph processing, neural networks, bioinformatics, image processing), which we identify as memory-bound. We evaluate the performance and scaling characteristics of PrIM benchmarks on the UPMEM PIM architecture, and compare their performance and energy consumption to their state-of-the-art CPU and GPU counterparts. Our extensive evaluation conducted on two real UPMEM-based PIM systems with 640 and 2,556 DPUs provides new insights about suitability of different workloads to the PIM system, programming recommendations for software designers, and suggestions and hints for hardware and architecture designers of future PIM systems.

https://arxiv.org/pdf/2105.03814.pdf

### UPMEM PIM System Summary & Analysis

Juan Gomez-Luna, Izzat El Hajj, Ivan Fernandez, Christina Giannoula, Geraldo F. Oliveira, and Onur Mutlu,

"Benchmarking Memory-Centric Computing Systems: Analysis of Real **Processing-in-Memory Hardware**"

Invited Paper at Workshop on Computing with Unconventional *Technologies (CUT)*, Virtual, October 2021.

[arXiv version]

[PrIM Benchmarks Source Code]

[Slides (pptx) (pdf)]

[Talk Video (37 minutes)]

[Lightning Talk Video (3 minutes)]

### Benchmarking Memory-Centric Computing Systems: Analysis of Real Processing-in-Memory Hardware

Juan Gómez-Luna ETH Zürich

Izzat El Haji American University of Beirut

University of Malaga

National Technical University of Athens

Ivan Fernandez Christina Giannoula Geraldo F. Oliveira Onur Mutlu ETH Zürich

ETH Zürich

### **PrIM Benchmarks: Application Domains**

| Domain                | Benchmark                     | Short name |
|-----------------------|-------------------------------|------------|
| Dance linear algebra  | Vector Addition               | VA         |
| Dense linear algebra  | Matrix-Vector Multiply        | GEMV       |
| Sparse linear algebra | Sparse Matrix-Vector Multiply | SpMV       |
| Databass              | Select                        | SEL        |
| Databases             | Unique                        | UNI        |
| Data analytica        | Binary Search                 | BS         |
| Data analytics        | Time Series Analysis          | TS         |
| Graph processing      | Breadth-First Search          | BFS        |
| Neural networks       | Multilayer Perceptron         | MLP        |
| Bioinformatics        | Needleman-Wunsch              | NW         |
| luca da mua acadia d  | Image histogram (short)       | HST-S      |
| Image processing      | Image histogram (large)       | HST-L      |
|                       | Reduction                     | RED        |
| Devallal maioriticas  | Prefix sum (scan-scan-add)    | SCAN-SSA   |
| Parallel primitives   | Prefix sum (reduce-scan-scan) | SCAN-RSS   |
|                       | Matrix transposition          | TRNS       |

### PrIM Benchmarks are Open Source

- All microbenchmarks, benchmarks, and scripts
- https://github.com/CMU-SAFARI/prim-benchmarks



### **Understanding a Modern PIM Architecture**

### Benchmarking a New Paradigm: Experimental Analysis and Characterization of a Real Processing-in-Memory System

JUAN GÓMEZ-LUNA<sup>1</sup>, IZZAT EL HAJJ<sup>2</sup>, IVAN FERNANDEZ<sup>1,3</sup>, CHRISTINA GIANNOULA<sup>1,4</sup>, GERALDO F. OLIVEIRA<sup>1</sup>, AND ONUR MUTLU<sup>1</sup>

Corresponding author: Juan Gómez-Luna (e-mail: juang@ethz.ch).

https://arxiv.org/pdf/2105.03814.pdf

https://github.com/CMU-SAFARI/prim-benchmarks

<sup>&</sup>lt;sup>1</sup>ETH Zürich

<sup>&</sup>lt;sup>2</sup>American University of Beirut

<sup>&</sup>lt;sup>3</sup>University of Malaga

<sup>&</sup>lt;sup>4</sup>National Technical University of Athens

## More Security Implications (I)

"We can gain unrestricted access to systems of website visitors."

www.iaik.tugraz.at

Not there yet, but ...



ROOT privileges for web apps!





Daniel Gruss (@lavados), Clémentine Maurice (@BloodyTangerine), December 28, 2015 — 32c3, Hamburg, Germany

Rowhammer.js: A Remote Software-Induced Fault Attack in JavaScript (DIMVA'16)

Source: https://lab.dsst.io/32c3-slides/7197.html

### More Security Implications (II)

"Can gain control of a smart phone deterministically" Hammer And Root Millions of Androids

Drammer: Deterministic Rowhammer Attacks on Mobile Platforms, CCS'16<sup>258</sup>

### More Security Implications (III)

 Using an integrated GPU in a mobile system to remotely escalate privilege via the WebGL interface. IEEE S&P 2018



"GRAND PWNING UNIT" —

# Drive-by Rowhammer attack uses GPU to compromise an Android phone

JavaScript based GLitch pwns browsers by flipping bits inside memory chips.

**DAN GOODIN - 5/3/2018, 12:00 PM** 

# Grand Pwning Unit: Accelerating Microarchitectural Attacks with the GPU

Pietro Frigo Vrije Universiteit Amsterdam p.frigo@vu.nl Cristiano Giuffrida Vrije Universiteit Amsterdam giuffrida@cs.vu.nl Herbert Bos
Vrije Universiteit
Amsterdam
herbertb@cs.vu.nl

Kaveh Razavi Vrije Universiteit Amsterdam kaveh@cs.vu.nl

### More Security Implications (IV)

Rowhammer over RDMA (I) USENIX ATC 2018



BIZ & IT

TECH

SCIENCE

**POLIC** 

CARS

AMING & CULTUR

THROWHAMMER -

# Packets over a LAN are all it takes to trigger serious Rowhammer bit flips

The bar for exploiting potentially serious DDR weakness keeps getting lower.

**DAN GOODIN - 5/10/2018, 5:26 PM** 

#### Throwhammer: Rowhammer Attacks over the Network and Defenses

Andrei Tatar

VU Amsterdam

Radhesh Krishnan
VU Amsterdam
Herbert Bos

Herbert Bos VU Amsterdam Elias Athanasopoulos University of Cyprus

> Kaveh Razavi VU Amsterdam

Cristiano Giuffrida VU Amsterdam

### More Security Implications (V)

Rowhammer over RDMA (II)



Nethammer—Exploiting DRAM Rowhammer Bug Through Network Requests



# Nethammer: Inducing Rowhammer Faults through Network Requests

Moritz Lipp Graz University of Technology

Daniel Gruss
Graz University of Technology

Misiker Tadesse Aga University of Michigan

Clémentine Maurice Univ Rennes, CNRS, IRISA

Lukas Lamster Graz University of Technology Michael Schwarz Graz University of Technology

Lukas Raab Graz University of Technology

### More Security Implications (VI)

**IEEE S&P 2020** 



### RAMBleed

### RAMBleed: Reading Bits in Memory Without Accessing Them

Andrew Kwong University of Michigan ankwong@umich.edu

Daniel Genkin University of Michigan genkin@umich.edu

Daniel Gruss Graz University of Technology daniel.gruss@iaik.tugraz.at

Yuval Yarom University of Adelaide and Data61 yval@cs.adelaide.edu.au

### More Security Implications (VII)

USENIX Security 2019

# Terminal Brain Damage: Exposing the Graceless Degradation in Deep Neural Networks Under Hardware Fault Attacks

Sanghyun Hong, Pietro Frigo<sup>†</sup>, Yiğitcan Kaya, Cristiano Giuffrida<sup>†</sup>, Tudor Dumitraș

University of Maryland, College Park

†Vrije Universiteit Amsterdam



#### A Single Bit-flip Can Cause Terminal Brain Damage to DNNs

One specific bit-flip in a DNN's representation leads to accuracy drop over 90%

Our research found that a specific bit-flip in a DNN's bitwise representation can cause the accuracy loss up to 90%, and the DNN has 40-50% parameters, on average, that can lead to the accuracy drop over 10% when individually subjected to such single bitwise corruptions...

**Read More** 

### More Security Implications (VIII)

#### USENIX Security 2020

# DeepHammer: Depleting the Intelligence of Deep Neural Networks through Targeted Chain of Bit Flips

Fan Yao
University of Central Florida
fan.yao@ucf.edu

Adnan Siraj Rakin Deliang Fan Arizona State University asrakin@asu.edu dfan@asu.edu

#### Degrade the inference accuracy to the level of Random Guess

Example: ResNet-20 for CIFAR-10, 10 output classes

Before attack, Accuracy: 90.2% After attack, Accuracy: ~10% (1/10)



### Google's Half-Double RowHammer Attack (May 2021)

### Google Security Blog

The latest news and insights from Google on security and safety on the Internet

### Introducing Half-Double: New hammering technique for DRAM Rowhammer bug

May 25, 2021

Research Team: Salman Qazi, Yoongu Kim, Nicolas Boichat, Eric Shiu & Mattias Nissler

Today, we are sharing details around our discovery of Half-Double, a new Rowhammer technique that capitalizes on the worsening physics of some of the newer DRAM chips to alter the contents of memory.

Rowhammer is a DRAM vulnerability whereby repeated accesses to one address can tamper with the data stored at other addresses. Much like speculative execution vulnerabilities in CPUs, Rowhammer is a breach of the security guarantees made by the underlying hardware. As an electrical coupling phenomenon within the silicon itself, Rowhammer allows the potential bypass of hardware and software memory protection policies. This can allow untrusted code to break out of its sandbox and take full control of the system.

### More Security Implications (VIII)

USENIX Security 2022

Google's Half-Double RowHammer Attack



The latest news and insights from Google on security and safety on the Internet

Introducing Half-Double: New hammering technique for DRAM Rowhammer bug

May 25, 2021

Research Team: Salman Qazi, Yoongu Kim, Nicolas Boichat, Eric Shiu & Mattias Nissle

Today, we are sharing details around our discovery of Half-Double, a new Rowhammer technique that capitalizes on the worsening physics of some of the newer DRAM chips to alter the contents of memory.

Rowhammer is a DRAM vulnerability whereby repeated accesses to one address can tamper with the data stored at other addresses. Much like speculative execution vulnerabilities in CPUs, Rowhammer is a breach of the security guarantees made by the underlying hardware. As an electrical coupling phenomenon within the silicon itself, Rowhammer allows the potential bypass of hardware and software memory protection policies. This can allow untrusted code to break out of its sandbox and take full control of the system.

#### Half-Double: Hammering From the Next Row Over

Andreas Kogler<sup>1</sup> Jonas Juffinger<sup>1,2</sup> Salman Qazi<sup>3</sup> Yoongu Kim<sup>3</sup> Moritz Lipp<sup>4\*</sup> Nicolas Boichat<sup>3</sup> Eric Shiu<sup>5</sup> Mattias Nissler<sup>3</sup> Daniel Gruss<sup>1</sup>

<sup>1</sup>Graz University of Technology <sup>2</sup>Lamarr Security Research <sup>3</sup>Google <sup>4</sup>Amazon Web Services <sup>5</sup>Rivos

## More Security Implications?



### Aside: Intelligent Controller for NAND Flash



[DATE 2012, ICCD 2012, DATE 2013, ITJ 2013, ICCD 2013, SIGMETRICS 2014, HPCA 2015, DSN 2015, MSST 2015, JSAC 2016, HPCA 2017, DFRWS 2017, PIEEE 2017, HPCA 2018, SIGMETRICS 2018]

NAND Daughter Board

### Intelligent Flash Controllers [PIEEE'17]



Proceedings of the IEEE, Sept. 2017

# Error Characterization, Mitigation, and Recovery in Flash-Memory-Based Solid-State Drives



This paper reviews the most recent advances in solid-state drive (SSD) error characterization, mitigation, and data recovery techniques to improve both SSD's reliability and lifetime.

By Yu Cai, Saugata Ghose, Erich F. Haratsch, Yixin Luo, and Onur Mutlu

https://arxiv.org/pdf/1706.08642