# Rethinking Memory System Design Robustness, Energy, Performance

Onur Mutlu

omutlu@amail.com

https://people.inf.ethz.ch/omutlu

July 3, 2018 IOLTS Keynote Talk



**EH** zürich



### Current Research Focus Areas





Graphics and Vision Processing



 An embedded device that performs complex genome analysis in real time

### What Is a Genome Made Of?



**SAFARI** The discovery of DNA's double-helical structure (Watson+, 1953)

# DNA Sequencing

Goal:

- □ Find the complete sequence of A, C, G, T's in DNA.
- Challenge:
  - There is no machine that takes long DNA as an input, and gives the complete sequence as output
  - All sequencing machines chop DNA into pieces and identify relatively small pieces (but not how they fit together)

# Untangling Yarn Balls & DNA Sequencing





# Genome Sequencers



Roche/454



AB SOLID



Illumina HiSeq2000



SAFARI Ion Torrent PGM

Pacific Biosciences RS



Ion Torrent Proton



Illumina MiSeq



#### Oxford Nanopore MinION



Complete Genomics





Illumina NovaSeq 6000

#### Oxford Nanopore GridION

... and more! All produce data with different properties.



300 M bases/min bases/min

# Genome Sequence Alignment: Example



V meandate barandaris ID22052

Source: By Aaron E. Darling, István Miklós, Mark A. Ragan - Figure 1 from Darling AE, Miklós I, Ragan MA (2008). "Dynamics of Genome Rearrangement in Bacterial Populations". PLOS Genetics. DOI:10.1371/journal.pgen.1000128., CC BY 2.5, https://commons.wikimedia.org/w/index.pnp?curid=30550950

# Hash Table Based Read Mappers

- + Guaranteed to find *all* mappings  $\rightarrow$  sensitive
- + Can tolerate up to *e* errors



http://mrfast.sourceforge.net/

# Personalized copy number and segmental duplication maps using next-generation sequencing

Can Alkan<sup>1,2</sup>, Jeffrey M Kidd<sup>1</sup>, Tomas Marques-Bonet<sup>1,3</sup>, Gozde Aksay<sup>1</sup>, Francesca Antonacci<sup>1</sup>, Fereydoun Hormozdiari<sup>4</sup>, Jacob O Kitzman<sup>1</sup>, Carl Baker<sup>1</sup>, Maika Malig<sup>1</sup>, Onur Mutlu<sup>5</sup>, S Cenk Sahinalp<sup>4</sup>, Richard A Gibbs<sup>6</sup> & Evan E Eichler<sup>1,2</sup>

Alkan+, "Personalized copy number and segmental duplication maps using next-generation sequencing", Nature Genetics 2009.

### Read Mapping Execution Time Breakdown



# Filter fast before you align

# Minimize costly "approximate string comparisons"

# Our First Filter: Pure Software Approach

- Download source code and try for yourself
  - Download link to FastHASH

Xin et al. BMC Genomics 2013, 14(Suppl 1):S13 http://www.biomedcentral.com/1471-2164/14/S1/S13



**Open Access** 

#### PROCEEDINGS

### Accelerating read mapping with FastHASH

Hongyi Xin<sup>1</sup>, Donghyuk Lee<sup>1</sup>, Farhad Hormozdiari<sup>2</sup>, Samihan Yedkar<sup>1</sup>, Onur Mutlu<sup>1\*</sup>, Can Alkan<sup>3\*</sup>

From The Eleventh Asia Pacific Bioinformatics Conference (APBC 2013) Vancouver, Canada. 21-24 January 2013

### Shifted Hamming Distance: SIMD Acceleration

Bioinformatics, 31(10), 2015, 1553–1560 doi: 10.1093/bioinformatics/btu856 Advance Access Publication Date: 10 January 2015 Original Paper

OXFORD

Sequence analysis

### Shifted Hamming distance: a fast and accurate SIMD-friendly filter to accelerate alignment verification in read mapping

Hongyi Xin<sup>1,\*</sup>, John Greth<sup>2</sup>, John Emmons<sup>2</sup>, Gennady Pekhimenko<sup>1</sup>, Carl Kingsford<sup>3</sup>, Can Alkan<sup>4,\*</sup> and Onur Mutlu<sup>2,\*</sup>

Xin+, <u>"Shifted Hamming Distance: A Fast and Accurate SIMD-friendly Filter</u> to Accelerate Alignment Verification in Read Mapping", Bioinformatics 2015.

### An Example Solution: GateKeeper



# FPGA-Based Alignment Filtering

 Mohammed Alser, Hasan Hassan, Hongyi Xin, Oguz Ergin, Onur Mutlu, and Can Alkan
 "GateKeeper: A New Hardware Architecture for Accelerating Pre-Alignment in DNA Short Read Mapping" *Bioinformatics*, [published online, May 31], 2017.
 [Source Code]
 [Online link at Bioinformatics Journal]

# GateKeeper: a new hardware architecture for accelerating pre-alignment in DNA short read mapping

Mohammed Alser 🖾 , Hasan Hassan, Hongyi Xin, Oğuz Ergin, Onur Mutlu 🖾 , Can Alkan 🖾

*Bioinformatics*, Volume 33, Issue 21, 1 November 2017, Pages 3355–3363, https://doi.org/10.1093/bioinformatics/btx342

Published: 31 May 2017 Article history -

# DNA Read Mapping & Filtering

- Problem: Heavily bottlenecked by Data Movement
- GateKeeper FPGA performance limited by DRAM bandwidth [Alser+, Bioinformatics 2017]
- Ditto for SHD on SIMD [Xin+, Bioinformatics 2015]
- Solution: Processing-in-memory can alleviate the bottleneck
- However, we need to design mapping & filtering algorithms to fit processing-in-memory

# In-Memory DNA Sequence Analysis

 Jeremie S. Kim, Damla Senol Cali, Hongyi Xin, Donghyuk Lee, Saugata Ghose, Mohammed Alser, Hasan Hassan, Oguz Ergin, Can Alkan, and Onur Mutlu,
 "GRIM-Filter: Fast Seed Location Filtering in DNA Read Mapping Using

**Processing-in-Memory Technologies**"

**BMC Genomics**, 2018. *Proceedings of the <u>16th Asia Pacific Bioinformatics Conferenc</u>e (APBC), Yokohama, Japan, January 2018. arxiv.org Version (pdf)* 

### GRIM-Filter: Fast seed location filtering in DNA read mapping using processing-in-memory technologies

Jeremie S. Kim<sup>1,6\*</sup>, Damla Senol Cali<sup>1</sup>, Hongyi Xin<sup>2</sup>, Donghyuk Lee<sup>3</sup>, Saugata Ghose<sup>1</sup>, Mohammed Alser<sup>4</sup>, Hasan Hassan<sup>6</sup>, Oguz Ergin<sup>5</sup>, Can Alkan<sup>4\*</sup> and Onur Mutlu<sup>6,1\*</sup>

From The Sixteenth Asia Pacific Bioinformatics Conference 2018 Yokohama, Japan. 15-17 January 2018

# New Genome Sequencing Technologies

Nanopore sequencing technology and tools for genome assembly: computational analysis of the current state, bottlenecks and future directions

Damla Senol Cali 🕿, Jeremie S Kim, Saugata Ghose, Can Alkan, Onur Mutlu

Briefings in Bioinformatics, bby017, https://doi.org/10.1093/bib/bby017Published:02 April 2018Article history •



Oxford Nanopore MinION

Senol Cali+, "Nanopore Sequencing Technology and Tools for Genome Assembly: Computational Analysis of the Current State, Bottlenecks and Future Directions," Briefings in Bioinformatics, 2018.

[Preliminary arxiv.org version]

# Nanopore Genome Assembly Pipeline



Figure 1. The analyzed genome assembly pipeline using nanopore sequence data, with its five steps and the associated tools for each step. SAFARI SAFARI SAFARI

### More on Genome Analysis: Another Talk

# Accelerating Genome Analysis A Primer on an Ongoing Journey

Onur Mutlu omutlu@gmail.com https://people.inf.ethz.ch/omutlu

May 21, 2018 HiCOMB-17 Keynote Talk



SAFARI

ETH zürich

### Four Key Directions

Fundamentally Secure/Reliable/Safe Architectures

- Fundamentally Energy-Efficient Architectures
  - Memory-centric (Data-centric) Architectures

Fundamentally Low-Latency Architectures

Architectures for Genomics, Medicine, Health

# Memory & Storage

# The Main Memory System



- Main memory is a critical component of all computing systems: server, mobile, embedded, desktop, sensor
- Main memory system must scale (in size, technology, efficiency, cost, and management algorithms) to maintain performance growth and technology scaling benefits

# The Main Memory System



- Main memory is a critical component of all computing systems: server, mobile, embedded, desktop, sensor
- Main memory system must scale (in size, technology, efficiency, cost, and management algorithms) to maintain performance growth and technology scaling benefits

# The Main Memory System



- Main memory is a critical component of all computing systems: server, mobile, embedded, desktop, sensor
- Main memory system must scale (in size, technology, efficiency, cost, and management algorithms) to maintain performance growth and technology scaling benefits

### Memory System: A Shared Resource View



Most of the system is dedicated to storing and moving data

### State of the Main Memory System

- Recent technology, architecture, and application trends
   lead to new requirements
  - exacerbate old requirements
- DRAM and memory controllers, as we know them today, are (will be) unlikely to satisfy all requirements
- Some emerging non-volatile memory technologies (e.g., PCM) enable new opportunities: memory+storage merging
- We need to rethink the main memory system
  - to fix DRAM issues and enable emerging technologies
  - to satisfy all requirements



- Major Trends Affecting Main Memory
- Key Challenges and Solution Directions
  - Robustness: Reliability and Security
  - Low Latency/Energy and Latency/Energy/Reliability Tradeoffs
  - Energy and Performance: In-Memory Computation
- Concluding Remarks
- Summary

# Major Trends Affecting Main Memory (I)

Need for main memory capacity, bandwidth, QoS increasing

Main memory energy/power is a key system design concern

### DRAM technology scaling is ending

# Major Trends Affecting Main Memory (II)

- Need for main memory capacity, bandwidth, QoS increasing
  - Multi-core: increasing number of cores/agents
  - Data-intensive applications: increasing demand/hunger for data
  - Consolidation: cloud computing, GPUs, mobile, heterogeneity

• Main memory energy/power is a key system design concern

DRAM technology scaling is ending

# Example: The Memory Capacity Gap

Core count doubling ~ every 2 years DRAM DIMM capacity doubling ~ every 3 years



Memory capacity per core expected to drop by 30% every two years

Trends worse for memory bandwidth per core!

### DRAM Capacity, Bandwidth, Latency Trends



1999 2003 2006 2008 2011 2013 2014 2015 2016 2017

Memory latency remains almost constant

# DRAM Latency Is Critical for Performance



### **In-memory Databases**

[Mao+, EuroSys'12; Clapp+ (**Intel**), IISWC'15]



### **In-Memory Data Analytics**

[Clapp+ (**Intel**), IISWC'15; Awan+, BDCloud'15]



**Graph/Tree Processing** 

[Xu+, IISWC'12; Umuroglu+, FPL'15]



**Datacenter Workloads** 

[Kanev+ (Google), ISCA'I5]

### DRAM Latency Is Critical for Performance





#### **In-memory Databases**

#### **Graph/Tree Processing**

### Long memory latency -> performance bottleneck



### **In-Memory Data Analytics**

[Clapp+ (**Intel**), IISWC'15; Awan+, BDCloud'15]



### **Datacenter Workloads**

[Kanev+ (Google), ISCA'I5]

# Major Trends Affecting Main Memory (III)

Need for main memory capacity, bandwidth, QoS increasing

- Main memory energy/power is a key system design concern
  - ~40-50% energy spent in off-chip memory hierarchy [Lefurgy, IEEE Computer'03] >40% power in DRAM [Ware, HPCA'10][Paul,ISCA'15]
  - DRAM consumes power even when not used (periodic refresh)
- DRAM technology scaling is ending

# Major Trends Affecting Main Memory (IV)

Need for main memory capacity, bandwidth, QoS increasing

Main memory energy/power is a key system design concern

### DRAM technology scaling is ending

- ITRS projects DRAM will not scale easily below X nm
- Scaling has provided many benefits:
  - higher capacity (density), lower cost, lower energy

# Major Trends Affecting Main Memory (V)

- DRAM scaling has already become increasingly difficult
  - Increasing cell leakage current, reduced cell reliability, increasing manufacturing difficulties [Kim+ ISCA 2014], [Liu+ ISCA 2013], [Mutlu IMW 2013], [Mutlu DATE 2017]
  - Difficult to significantly improve capacity, energy
- Emerging memory technologies are promising

| - |  |
|---|--|
|   |  |
| - |  |
|   |  |
|   |  |
|   |  |
|   |  |
|   |  |

# Major Trends Affecting Main Memory (V)

- DRAM scaling has already become increasingly difficult
  - Increasing cell leakage current, reduced cell reliability, increasing manufacturing difficulties [Kim+ ISCA 2014], [Liu+ ISCA 2013], [Mutlu IMW 2013], [Mutlu DATE 2017]
  - Difficult to significantly improve capacity, energy

### Emerging memory technologies are promising

| 3D-Stacked DRAM                                                              | higher bandwidth | smaller capacity                                          |
|------------------------------------------------------------------------------|------------------|-----------------------------------------------------------|
| <b>Reduced-Latency DRAM</b><br>(e.g., RL/TL-DRAM, FLY-RAM)                   | lower latency    | higher cost                                               |
| Low-Power DRAM<br>(e.g., LPDDR3, LPDDR4, Voltron)                            | lower power      | higher latency<br>higher cost                             |
| <b>Non-Volatile Memory (NVM)</b><br>(e.g., PCM, STTRAM, ReRAM, 3D<br>Xpoint) | larger capacity  | higher latency<br>higher dynamic power<br>lower endurance |

# Major Trend: Hybrid Main Memory



#### Hardware/software manage data allocation and movement to achieve the best of multiple technologies

Meza+, "Enabling Efficient and Scalable Hybrid Memories," IEEE Comp. Arch. Letters, 2012. Yoon+, "Row Buffer Locality Aware Caching Policies for Hybrid Memories," ICCD 2012 Best Paper Award.

#### SAFARI



# Main Memory Needs Intelligent Controllers



# Industry Is Writing Papers About It, Too

#### **DRAM Process Scaling Challenges**

#### Refresh

- · Difficult to build high-aspect ratio cell capacitors decreasing cell capacitance
- · Leakage current of cell access transistors increasing

#### tWR

- Contact resistance between the cell capacitor and access transistor increasing
- · On-current of the cell access transistor decreasing
- Bit-line resistance increasing

#### VRT

· Occurring more frequently with cell capacitance decreasing



# Call for Intelligent Memory Controllers

### **DRAM Process Scaling Challenges**

#### \* Refresh

 Difficult to build high-aspect ratio cell capacitors decreasing cell capacitance THE MEMORY FORUM 2014

# Co-Architecting Controllers and DRAM to Enhance DRAM Process Scaling

Uksong Kang, Hak-soo Yu, Churoo Park, \*Hongzhong Zheng, \*\*John Halbert, \*\*Kuljit Bains, SeongJin Jang, and Joo Sun Choi



Samsung Electronics, Hwasung, Korea / \*Samsung Electronics, San Jose / \*\*Intel



- Major Trends Affecting Main Memory
- Key Challenges and Solution Directions
  - Robustness: Reliability and Security
  - Low Latency/Energy and Latency/Energy/Reliability Tradeoffs
  - Energy and Performance: In-Memory Computation
- Concluding Remarks
- Summary

# Four Key Issues in Future Platforms

Fundamentally Secure/Reliable/Safe Architectures

- Fundamentally Energy-Efficient Architectures
  - Memory-centric (Data-centric) Architectures

Fundamentally Low-Latency Architectures

Architectures for Genomics, Medicine, Health

# Maslow's (Human) Hierarchy of Needs



• We need to start with reliability and security...

# How Reliable/Secure/Safe is This Bridge?





# Collapse of the "Galloping Gertie"



# How Secure Are These People?



### Security is about preventing unforeseen consequences

#### Source: https://s-media-cache-ak0.pinimg.com/originals/48/09/54/4809543a9c7700246a0cf8acdae27abf.jpg

SAFARI

# The DRAM Scaling Problem

- DRAM stores charge in a capacitor (charge-based memory)
  - Capacitor must be large enough for reliable sensing
  - Access transistor should be large enough for low leakage and high retention time
  - Scaling beyond 40-35nm (2013) is challenging [ITRS, 2009]



DRAM capacity, cost, and energy/power hard to scale

# As Memory Scales, It Becomes Unreliable

- Data from all of Facebook's servers worldwide
- Meza+, "Revisiting Memory Errors in Large-Scale Production Data Centers," DSN'15.



SAFARI

Chip density (Gb)

# Large-Scale Failure Analysis of DRAM Chips

- Analysis and modeling of memory errors found in all of Facebook's server fleet
- Justin Meza, Qiang Wu, Sanjeev Kumar, and Onur Mutlu, "Revisiting Memory Errors in Large-Scale Production Data Centers: Analysis and Modeling of New Trends from the Field" Proceedings of the 45th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), Rio de Janeiro, Brazil, June 2015. [Slides (pptx) (pdf)] [DRAM Error Model]

Revisiting Memory Errors in Large-Scale Production Data Centers: Analysis and Modeling of New Trends from the Field

Justin Meza Qiang Wu\* Sanjeev Kumar\* Onur Mutlu

Carnegie Mellon University \* Facebook, Inc.

# Infrastructures to Understand Such Issues



Elipping Bits in Memory Without Accessing Them: An Experimental Study of DRAM Disturbance Errors (Kim et al., ISCA 2014)

Adaptive-Latency DRAM: Optimizing DRAM Timing for the Common-Case (Lee et al., HPCA 2015)

AVATAR: A Variable-Retention-Time (VRT) Aware Refresh for DRAM Systems (Qureshi et al., DSN 2015) An Experimental Study of Data Retention Behavior in Modern DRAM Devices: Implications for Retention Time Profiling Mechanisms (Liu et al., ISCA 2013)

The Efficacy of Error Mitigation Techniques for DRAM Retention Failures: A Comparative Experimental Study (Khan et al., SIGMETRICS 2014)



#### SAFARI

# Infrastructures to Understand Such Issues



### **SAFARI** Kim+, "Flipping Bits in Memory Without Accessing Them: An Experimental Study of DRAM Disturbance Errors," ISCA 2014.

# SoftMC: Open Source DRAM Infrastructure

 Hasan Hassan et al., "SoftMC: A Flexible and Practical Open-Source Infrastructure for Enabling Experimental DRAM Studies," HPCA 2017.

- Flexible
- Easy to Use (C++ API)
- Open-source github.com/CMU-SAFARI/SoftMC





### <u>https://github.com/CMU-SAFARI/SoftMC</u>

### SoftMC: A Flexible and Practical Open-Source Infrastructure for Enabling Experimental DRAM Studies

Hasan Hassan<sup>1,2,3</sup> Nandita Vijaykumar<sup>3</sup> Samira Khan<sup>4,3</sup> Saugata Ghose<sup>3</sup> Kevin Chang<sup>3</sup> Gennady Pekhimenko<sup>5,3</sup> Donghyuk Lee<sup>6,3</sup> Oguz Ergin<sup>2</sup> Onur Mutlu<sup>1,3</sup>

<sup>1</sup>ETH Zürich <sup>2</sup>TOBB University of Economics & Technology <sup>3</sup>Carnegie Mellon University <sup>4</sup>University of Virginia <sup>5</sup>Microsoft Research <sup>6</sup>NVIDIA Research

#### SAFARI

# Data Retention in Memory [Liu et al., ISCA 2013]

Retention Time Profile of DRAM looks like this:

# 64-128ms >256ms **Location** dependent 128-256ms Stored value pattern dependent Time dependent

#### SAFARI

A Curious Discovery [Kim et al., ISCA 2014]

# One can predictably induce errors in most DRAM memory chips

# A simple hardware failure mechanism can create a widespread system security vulnerability



## Modern DRAM is Prone to Disturbance Errors



Repeatedly reading a row enough times (before memory gets refreshed) induces disturbance errors in adjacent rows in most real DRAM chips you can buy today

Elipping Bits in Memory Without Accessing Them: An Experimental Study of DRAM Disturbance Errors, (Kim et al., ISCA 2014)

# Most DRAM Modules Are Vulnerable

A company B company







**C** company

| Up to               | Up to               | Up to               |
|---------------------|---------------------|---------------------|
| 1.0×10 <sup>7</sup> | 2.7×10 <sup>6</sup> | 3.3×10 <sup>5</sup> |
| errors              | errors              | errors              |

Elipping Bits in Memory Without Accessing Them: An Experimental Study of DRAM Disturbance Errors, (Kim et al., ISCA 2014)

# Recent DRAM Is More Vulnerable



# Recent DRAM Is More Vulnerable



# Recent DRAM Is More Vulnerable



All modules from 2012–2013 are vulnerable



loop: mov (X), %eax mov (Y), %ebx clflush (X) clflush (Y) mfence jmp loop



Download from: https://github.com/CMU-SAFARI/rowhammer



- Avoid *cache hits* Flush X from cache
- Avoid *row hits* to X
   Read Y in another row



Download from: <a href="https://github.com/CMU-SAFARI/rowhammer">https://github.com/CMU-SAFARI/rowhammer</a>



loop: mov (X), %eax mov (Y), %ebx clflush (X) clflush (Y) mfence jmp loop



Download from: https://github.com/CMU-SAFARI/rowhammer



loop: mov (X), %eax mov (Y), %ebx clflush (X) clflush (Y) mfence jmp loop



Download from: https://github.com/CMU-SAFARI/rowhammer



loop: mov (X), %eax mov (Y), %ebx clflush (X) clflush (Y) mfence jmp loop



Download from: <a href="https://github.com/CMU-SAFARI/rowhammer">https://github.com/CMU-SAFARI/rowhammer</a>

# Observed Errors in Real Systems

| CPU Architecture          | Errors | Access-Rate |
|---------------------------|--------|-------------|
| Intel Haswell (2013)      | 22.9K  | 12.3M/sec   |
| Intel Ivy Bridge (2012)   | 20.7K  | 11.7M/sec   |
| Intel Sandy Bridge (2011) | 16.1K  | 11.6M/sec   |
| AMD Piledriver (2012)     | 59     | 6.1M/sec    |

### A real reliability & security issue

Kim+, "Flipping Bits in Memory Without Accessing Them: An Experimental Study of 70 DRAM Disturbance Errors," ISCA 2014.

### One Can Take Over an Otherwise-Secure System

### Flipping Bits in Memory Without Accessing Them: An Experimental Study of DRAM Disturbance Errors

Abstract. Memory isolation is a key property of a reliable and secure computing system — an access to one memory address should not have unintended side effects on data stored in other addresses. However, as DRAM process technology

# Project Zero

Elipping Bits in Memory Without Accessing Them: An Experimental Study of DRAM Disturbance Errors (Kim et al., ISCA 2014)

News and updates from the Project Zero team at Google

Exploiting the DRAM rowhammer bug to gain kernel privileges (Seaborn, 2015)

Monday, March 9, 2015

Exploiting the DRAM rowhammer bug to gain kernel privileges

# RowHammer Security Attack Example

- "Rowhammer" is a problem with some recent DRAM devices in which repeatedly accessing a row of memory can cause bit flips in adjacent rows (Kim et al., ISCA 2014).
  - Flipping Bits in Memory Without Accessing Them: An Experimental Study of DRAM Disturbance Errors (Kim et al., ISCA 2014)
- We tested a selection of laptops and found that a subset of them exhibited the problem.
- We built two working privilege escalation exploits that use this effect.
  - □ <u>Exploiting the DRAM rowhammer bug to gain kernel privileges</u> (Seaborn+, 2015)
- One exploit uses rowhammer-induced bit flips to gain kernel privileges on x86-64 Linux when run as an unprivileged userland process.
- When run on a machine vulnerable to the rowhammer problem, the process was able to induce bit flips in page table entries (PTEs).
- It was able to use this to gain write access to its own page table, and hence gain read-write access to all of physical memory.

## Security Implications



## Security Implications



It's like breaking into an apartment by repeatedly slamming a neighbor's door until the vibrations open the door you were after

## Selected Readings on RowHammer (I)

- Our first detailed study: Rowhammer analysis and solutions (June 2014)
  - Yoongu Kim, Ross Daly, Jeremie Kim, Chris Fallin, Ji Hye Lee, Donghyuk Lee, Chris Wilkerson, Konrad Lai, and Onur Mutlu,
     "Flipping Bits in Memory Without Accessing Them: An Experimental Study of DRAM Disturbance Errors"
     Proceedings of the <u>41st International Symposium on Computer Architecture</u> (ISCA), Minneapolis, MN, June 2014. [Slides (pptx) (pdf)] [Liahtning Session Slides (pptx) (pdf)] [Source Code and Data]
- Our Source Code to Induce Errors in Modern DRAM Chips (June 2014)
  - <u>https://github.com/CMU-SAFARI/rowhammer</u>
- Google Project Zero's Attack to Take Over a System (March 2015)
  - Exploiting the DRAM rowhammer bug to gain kernel privileges (Seaborn+, 2015)
  - https://github.com/google/rowhammer-test
  - Double-sided Rowhammer

## Selected Readings on RowHammer (II)

- Remote RowHammer Attacks via JavaScript (July 2015)
  - http://arxiv.org/abs/1507.06955
  - https://github.com/IAIK/rowhammeris
  - Gruss et al., DIMVA 2016.
  - CLFLUSH-free Rowhammer
  - "A fully automated attack that requires nothing but a website with JavaScript to trigger faults on remote hardware."
  - "We can gain unrestricted access to systems of website visitors."
- ANVIL: Software-Based Protection Against Next-Generation Rowhammer Attacks (March 2016)
  - http://dl.acm.org/citation.cfm?doid=2872362.2872390
  - □ Aweke et al., ASPLOS 2016
  - CLFLUSH-free Rowhammer
  - Software based monitoring for rowhammer detection

## Selected Readings on RowHammer (III)

- Dedup Est Machina: Memory Deduplication as an Advanced Exploitation Vector (May 2016)
  - https://www.ieee-security.org/TC/SP2016/papers/0824a987.pdf
  - Bosman et al., IEEE S&P 2016.
  - Exploits Rowhammer and Memory Deduplication to overtake a browser
  - "We report on the first reliable remote exploit for the Rowhammer vulnerability running entirely in Microsoft Edge."
  - "[an attacker] ... can reliably "own" a system with all defenses up, even if the software is entirely free of bugs."

## Selected Readings on RowHammer (IV)

- Flip Feng Shui: Hammering a Needle in the Software Stack (August 2016)
  - https://www.usenix.org/system/files/conference/usenixsecurity16/sec16\_paper razavi.pdf
  - Razavi et al., USENIX Security 2016.
  - Combines memory deduplication and RowHammer
  - "A malicious VM can gain unauthorized access to a co-hosted VM running OpenSSH."
  - Breaks OpenSSH public key authentication
- Drammer: Deterministic Rowhammer Attacks on Mobile Platforms (October 2016)
  - <u>http://dl.acm.org/citation.cfm?id=2976749.2978406</u>
  - □ Van Der Veen et al., CCS 2016
  - **Can take over an ARM-based Android system deterministically**
  - Exploits predictable physical memory allocator behavior
    - Can deterministically place security-sensitive data (e.g., page table) in an attackerchosen, vulnerable location in memory

## Selected Readings on RowHammer (V)

- Grand Pwning Unit: Accelerating Microarchitectural Attacks with the GPU (May 2018)
  - https://www.vusec.net/wp-content/uploads/2018/05/alitch.pdf
  - Frigo et al., IEEE S&P 2018.
  - The first end-to-end remote Rowhammer exploit on mobile platforms that use our GPU-based primitives in orchestration to compromise browsers on mobile devices in under two minutes.
- Throwhammer: Rowhammer Attacks over the Network and Defenses (July 2018)
  - https://www.cs.vu.nl/~herbertb/download/papers/throwhammer\_atc18.pdf
  - Tatar et al., USENIX ATC 2018.
  - "[We] show that an attacker can trigger and exploit Rowhammer bit flips directly from a remote machine by only sending network packets."

## Selected Readings on RowHammer (VI)

- Nethammer: Inducing Rowhammer Faults through Network Requests (July 2018)
  - https://arxiv.org/pdf/1805.04956.pdf
  - Lipp et al., arxiv.org 2018.
  - Nethammer is the first truly remote Rowhammer attack, without a single attacker-controlled line of code on the targeted system."

## More Security Implications (I)

#### "We can gain unrestricted access to systems of website visitors."

Not there yet, but ...



ROOT privileges for web apps!

Daniel Gruss (@lavados), Clémentine Maurice (@BloodyTangerine), December 28, 2015 - 32c3, Hamburg, Germany





Rowhammer.js: A Remote Software-Induced Fault Attack in JavaScript (DIMVA'16)

## More Security Implications (II)

"Can gain control of a smart phone deterministically"

## Hammer And Root

# androids Millions of Androids

Drammer: Deterministic Rowhammer Attacks on Mobile Platforms, CCS'16<sup>82</sup>

Source: https://fossbytes.com/drammer-rowhammer-attack-android-root-devices/

## More Security Implications (III)

 Using an integrated GPU in a mobile system to remotely escalate privilege via the WebGL interface

ars TECHNICA

BIZ & IT TECH SCIENCE POLICY CARS GAMING & CULTURE

#### "GRAND PWNING UNIT" ---

## Drive-by Rowhammer attack uses GPU to compromise an Android phone

JavaScript based GLitch pwns browsers by flipping bits inside memory chips.

DAN GOODIN - 5/3/2018, 12:00 PM

## Grand Pwning Unit: Accelerating Microarchitectural Attacks with the GPU

Pietro Frigo Vrije Universiteit Amsterdam p.frigo@vu.nl Cristiano Giuffrida Vrije Universiteit Amsterdam giuffrida@cs.vu.nl Herbert Bos Vrije Universiteit Amsterdam herbertb@cs.vu.nl Kaveh Razavi Vrije Universiteit Amsterdam kaveh@cs.vu.nl

## More Security Implications (IV)

#### Rowhammer over RDMA (I)

ars TECHNICA

BIZ & IT TECH SCIENCE POLICY CARS GAMING & CULTURE

THROWHAMMER —

# Packets over a LAN are all it takes to trigger serious Rowhammer bit flips

The bar for exploiting potentially serious DDR weakness keeps getting lower.

DAN GOODIN - 5/10/2018, 5:26 PM

#### **Throwhammer: Rowhammer Attacks over the Network and Defenses**

Andrei Tatar VU Amsterdam Radhesh Krishnan VU Amsterdam Elias Athanasopoulos University of Cyprus

Herbert Bos VU Amsterdam Kaveh Razavi VU Amsterdam Cristiano Giuffrida VU Amsterdam

## More Security Implications (V)

Rowhammer over RDMA (II)

# Security in a serious way

Nethammer—Exploiting DRAM Rowhammer Bug Through Network Requests



#### Nethammer: Inducing Rowhammer Faults through Network Requests

Moritz Lipp Graz University of Technology

Daniel Gruss Graz University of Technology Misiker Tadesse Aga University of Michigan

Clémentine Maurice Univ Rennes, CNRS, IRISA

Lukas Lamster Graz University of Technology Michael Schwarz Graz University of Technology

Lukas Raab Graz University of Technology

## More Security Implications?



## Some Potential Solutions





• Refresh frequently **Power, Performance** 

• Sophisticated ECC

Cost, Power

## • Access counters Cost, Power, Complexity

## Apple's Patch for RowHammer

#### https://support.apple.com/en-gb/HT204934

Available for: OS X Mountain Lion v10.8.5, OS X Mavericks v10.9.5

Impact: A malicious application may induce memory corruption to escalate privileges

Description: A disturbance error, also known as Rowhammer, exists with some DDR3 RAM that could have led to memory corruption This issue was mitigated by increasing memory refresh rates.

CVE-ID

CVE-2015-3693 : Mark Seaborn and Thomas Dullien of Google, working from original research by Yoongu Kim et al (2014)

HP, Lenovo, and other vendors released similar patches

## **Our Solution to RowHammer**

- PARA: Probabilistic Adjacent Row Activation
- Key Idea
  - After closing a row, we activate (i.e., refresh) one of its neighbors with a low probability: p = 0.005

## Reliability Guarantee

- When p=0.005, errors in one year:  $9.4 \times 10^{-14}$
- By adjusting the value of p, we can vary the strength of protection against errors

# Advantages of PARA

- PARA refreshes rows infrequently
  - Low power
  - Low performance-overhead
    - Average slowdown: 0.20% (for 29 benchmarks)
    - Maximum slowdown: 0.75%
- PARA is stateless
  - Low cost
  - Low complexity
- PARA is an effective and low-overhead solution to prevent disturbance errors

## Requirements for PARA

- If implemented in DRAM chip
  - Enough slack in timing parameters
  - Plenty of slack today:
    - Lee et al., "Adaptive-Latency DRAM: Optimizing DRAM Timing for the Common Case," HPCA 2015.
    - Chang et al., "Understanding Latency Variation in Modern DRAM Chips," SIGMETRICS 2016.
    - Lee et al., "Design-Induced Latency Variation in Modern DRAM Chips," SIGMETRICS 2017.
    - Chang et al., "Understanding Reduced-Voltage Operation in Modern DRAM Devices," SIGMETRICS 2017.
    - Ghose et al., "What Your DRAM Power Models Are Not Telling You: Lessons from a Detailed Experimental Study," SIGMETRICS 2018.
- If implemented in memory controller
  - Better coordination between memory controller and DRAM
  - Memory controller should know which rows are physically adjacent

## More on RowHammer Analysis

Yoongu Kim, Ross Daly, Jeremie Kim, Chris Fallin, Ji Hye Lee, Donghyuk Lee, Chris Wilkerson, Konrad Lai, and Onur Mutlu,
 "Flipping Bits in Memory Without Accessing Them: An Experimental Study of DRAM Disturbance Errors"
 Proceedings of the <u>A1st International Symposium on Computer</u>
 <u>Architecture</u> (ISCA), Minneapolis, MN, June 2014.
 [Slides (pptx) (pdf)] [Lightning Session Slides (pptx) (pdf)] [Source Code and Data]

## Flipping Bits in Memory Without Accessing Them: An Experimental Study of DRAM Disturbance Errors

Yoongu Kim<sup>1</sup> Ross Daly<sup>\*</sup> Jeremie Kim<sup>1</sup> Chris Fallin<sup>\*</sup> Ji Hye Lee<sup>1</sup> Donghyuk Lee<sup>1</sup> Chris Wilkerson<sup>2</sup> Konrad Lai Onur Mutlu<sup>1</sup> <sup>1</sup>Carnegie Mellon University <sup>2</sup>Intel Labs

## Future of Memory Reliability/Security

## Onur Mutlu, "The RowHammer Problem and Other Issues We May Face as Memory Becomes Denser" Invited Paper in Proceedings of the Desian, Automation, and Test in Europe Conference (DATE), Lausanne, Switzerland, March 2017. [Slides (pptx) (pdf)]

#### The RowHammer Problem and Other Issues We May Face as Memory Becomes Denser

Onur Mutlu ETH Zürich onur.mutlu@inf.ethz.ch https://people.inf.ethz.ch/omutlu

#### SAFARI https://people.inf.ethz.ch/omutlu/pub/rowhammer-and-other-memory-issues\_date17.pdf 93

## Industry Is Writing Papers About It, Too

#### **DRAM Process Scaling Challenges**

#### Refresh

- · Difficult to build high-aspect ratio cell capacitors decreasing cell capacitance
- · Leakage current of cell access transistors increasing

#### tWR

- Contact resistance between the cell capacitor and access transistor increasing
- · On-current of the cell access transistor decreasing
- Bit-line resistance increasing

#### VRT

· Occurring more frequently with cell capacitance decreasing



## Call for Intelligent Memory Controllers

#### **DRAM Process Scaling Challenges**

#### \* Refresh

 Difficult to build high-aspect ratio cell capacitors decreasing cell capacitance THE MEMORY FORUM 2014

## Co-Architecting Controllers and DRAM to Enhance DRAM Process Scaling

Uksong Kang, Hak-soo Yu, Churoo Park, \*Hongzhong Zheng, \*\*John Halbert, \*\*Kuljit Bains, SeongJin Jang, and Joo Sun Choi



Samsung Electronics, Hwasung, Korea / \*Samsung Electronics, San Jose / \*\*Intel

## Aside: Intelligent Controller for NAND Flash



[DATE 2012, ICCD 2012, DATE 2013, ITJ 2013, ICCD 2013, SIGMETRICS 2014, HPCA 2015, DSN 2015, MSST 2015, JSAC 2016, HPCA 2017, DFRWS 2017, PIEEE 2017, HPCA 2018, SIGMETRICS 2018]

NAND Daughter Board

Cai+, "Error Characterization, Mitigation, and Recovery in Flash Memory Based Solid State Drives," Proc. IEEE 2017.

## Aside: Intelligent Controller for NAND Flash



Proceedings of the IEEE, Sept. 2017

## Error Characterization, Mitigation, and Recovery in Flash-Memory-Based Solid-State Drives



This paper reviews the most recent advances in solid-state drive (SSD) error characterization, mitigation, and data recovery techniques to improve both SSD's reliability and lifetime.

By Yu Cai, Saugata Ghose, Erich F. Haratsch, Yixin Luo, and Onur Mutlu

https://arxiv.org/pdf/1706.08642



# Main Memory Needs Intelligent Controllers

## Solution Direction: Principled Designs

# Design fundamentally secure computing architectures

# Predict and prevent such safety issues

## Architecting for Security

#### Understand: Methods for vulnerability modeling & discovery

- Modeling and prediction based on real (device) data and analysis
- Understanding vulnerabilities
- Developing reliable metrics

#### Architect: Principled architectures with security as key concern

- Good partitioning of duties across the stack
- Cannot give up performance and efficiency
- Patch-ability in the field

#### Design & Test: Principled design, automation, (online) testing

- Design for security
- High coverage and good interaction with system reliability methods

### Understand and Model with Experiments (DRAM)



## **SAFARI** Kim+, "Flipping Bits in Memory Without Accessing Them: An Experimental Study of DRAM Disturbance Errors," ISCA 2014.

## Understand and Model with Experiments (Flash)



[DATE 2012, ICCD 2012, DATE 2013, ITJ 2013, ICCD 2013, SIGMETRICS 2014, HPCA 2015, DSN 2015, MSST 2015, JSAC 2016, HPCA 2017, DFRWS 2017, PIEEE'17, HPCA'18]

NAND Daughter Board

Cai+, "Error Characterization, Mitigation, and Recovery in Flash Memory Based Solid State Drives," Proc. IEEE 2017.

## Understanding Flash Memory Reliability



Proceedings of the IEEE, Sept. 2017

## Error Characterization, Mitigation, and Recovery in Flash-Memory-Based Solid-State Drives

This paper reviews the most recent advances in solid-state drive (SSD) error characterization, mitigation, and data recovery techniques to improve both SSD's reliability and lifetime.

By Yu Cai, Saugata Ghose, Erich F. Haratsch, Yixin Luo, and Onur Mutlu

SAFARI

https://arxiv.org/pdf/1706.08642

## Understanding Flash Memory Reliability

 Justin Meza, Qiang Wu, Sanjeev Kumar, and <u>Onur Mutlu</u>, "A Large-Scale Study of Flash Memory Errors in the Field" Proceedings of the <u>ACM International Conference on Measurement and</u> <u>Modeling of Computer Systems</u> (SIGMETRICS), Portland, OR, June 2015. [Slides (pptx) (pdf)] [Coverage at ZDNet] [Coverage on The Register] [Coverage on TechSpot] [Coverage on The Tech Report]

#### A Large-Scale Study of Flash Memory Failures in the Field

Justin Meza Carnegie Mellon University meza@cmu.edu Qiang Wu Facebook, Inc. qwu@fb.com Sanjeev Kumar Facebook, Inc. skumar@fb.com Onur Mutlu Carnegie Mellon University onur@cmu.edu

#### SAFARI

## NAND Flash Vulnerabilities [HPCA'17]

#### HPCA, Feb. 2017

#### Vulnerabilities in MLC NAND Flash Memory Programming: Experimental Analysis, Exploits, and Mitigation Techniques

Yu Cai<sup>†</sup> Saugata Ghose<sup>†</sup> Yixin Luo<sup>‡†</sup> Ken Mai<sup>†</sup> Onur Mutlu<sup>§†</sup> Erich F. Haratsch<sup>‡</sup> <sup>†</sup>Carnegie Mellon University <sup>‡</sup>Seagate Technology <sup>§</sup>ETH Zürich

Modern NAND flash memory chips provide high density by storing two bits of data in each flash cell, called a multi-level cell (MLC). An MLC partitions the threshold voltage range of a flash cell into four voltage states. When a flash cell is programmed, a high voltage is applied to the cell. Due to parasitic capacitance coupling between flash cells that are physically close to each other, flash cell programming can lead to cell-to-cell program interference, which introduces errors into neighboring flash cells. In order to reduce the impact of cell-to-cell interference on the reliability of MLC NAND flash memory, flash manufacturers adopt a two-step programming method, which programs the MLC in two separate steps. First, the flash memory partially programs the least significant bit of the MLC to some intermediate threshold voltage. Second, it programs the most significant bit to bring the MLC up to its full voltage state.

In this paper, we demonstrate that two-step programming exposes new reliability and security vulnerabilities. We expebelongs to a different flash memory *page* (the unit of data programmed and read at the same time), which we refer to, respectively, as the least significant bit (LSB) page and the most significant bit (MSB) page [5].

A flash cell is programmed by applying a large voltage on the control gate of the transistor, which triggers charge transfer into the floating gate, thereby increasing the threshold voltage. To precisely control the threshold voltage of the cell, the flash memory uses *incremental step pulse programming* (ISPP) [12,21,25,41]. ISPP applies multiple short pulses of the programming voltage to the control gate, in order to increase the cell threshold voltage by some small voltage amount ( $V_{step}$ ) after each step. Initial MLC designs programmed the threshold voltage in *one shot*, issuing all of the pulses back-to-back to program *both* bits of data at the same time. However, as flash memory scales down, the distance between neighboring flash cells decreases, which

https://people.inf.ethz.ch/omutlu/pub/flash-memory-programming-vulnerabilities\_hpca17.pdf

## 3D NAND Flash Reliability I [HPCA'18]

 Yixin Luo, Saugata Ghose, Yu Cai, Erich F. Haratsch, and Onur Mutlu, "HeatWatch: Improving 3D NAND Flash Memory Device Reliability by Exploiting Self-Recovery and Temperature-Awareness" Proceedings of the 24th International Symposium on High-Performance Computer Architecture (HPCA), Vienna, Austria, February 2018. [Lightning Talk Video] [Slides (pptx) (pdf)] [Lightning Session Slides (pptx) (pdf)]

#### HeatWatch: Improving 3D NAND Flash Memory Device Reliability by Exploiting Self-Recovery and Temperature Awareness

Yixin Luo<sup>†</sup> Saugata Ghose<sup>†</sup> Yu Cai<sup>‡</sup> Erich F. Haratsch<sup>‡</sup> Onur Mutlu<sup>§†</sup> <sup>†</sup>Carnegie Mellon University <sup>‡</sup>Seagate Technology <sup>§</sup>ETH Zürich

## 3D NAND Flash Reliability II [SIGMETRICS'18]

- Yixin Luo, Saugata Ghose, Yu Cai, Erich F. Haratsch, and Onur Mutlu, "Improving 3D NAND Flash Memory Lifetime by Tolerating Early Retention Loss and Process Variation"
  - Proceedings of the <u>ACM International Conference on Measurement and</u> <u>Modeling of Computer Systems</u> (**SIGMETRICS**), Irvine, CA, USA, June 2018. [Abstract]

#### Improving 3D NAND Flash Memory Lifetime by Tolerating Early Retention Loss and Process Variation

Yixin Luo<sup>†</sup>Saugata Ghose<sup>†</sup>Yu Cai<sup>†</sup>Erich F. Haratsch<sup>‡</sup>Onur Mutlu<sup>§†</sup><sup>†</sup>Carnegie Mellon University<sup>‡</sup>Seagate Technology<sup>§</sup>ETH Zürich

### Another Talk: NAND Flash Memory Robustness

 Yu Cai, Saugata Ghose, Erich F. Haratsch, Yixin Luo, and Onur Mutlu,
 "Error Characterization, Mitigation, and Recovery in Flash Memory Based Solid State Drives"

to appear in <u>Proceedings of the IEEE</u>, 2017.

Cai+, "Error Patterns in MLC NAND Flash Memory: Measurement, Characterization, and Analysis," DATE 2012.

Cai+, "Flash Correct-and-Refresh: Retention-Aware Error Management for Increased Flash Memory Lifetime," ICCD 2012.

Cai+, "Threshold Voltage Distribution in MLC NAND Flash Memory: Characterization, Analysis and Modeling," DATE 2013.

Cai+, "Error Analysis and Retention-Aware Error Management for NAND Flash Memory," Intel Technology Journal 2013.

Cai+, "Program Interference in MLC NAND Flash Memory: Characterization, Modeling, and Mitigation," ICCD 2013.

Cai+, "Neighbor-Cell Assisted Error Correction for MLC NAND Flash Memories," SIGMETRICS 2014.

Cai+,"Data Retention in MLC NAND Flash Memory: Characterization, Optimization and Recovery," HPCA 2015.

Cal+, "Read Disturb Errors in MLC NAND Flash Memory: Characterization and Mitigation," DSN 2015.

Luo+, "WARM: Improving NAND Flash Memory Lifetime with Write-notness Aware Retention Management," MSST 2015.

Meza+, "A Large-Scale Study of Flash Memory Errors in the Field," SIGMETRICS 2015.

Luo+, "Enabling Accurate and Practical Online Flash Channel Modeling for Modern MLC NAND Flash Memory," IEEE JSAC 2016.

Cai+, "Vulnerabilities in MLC NAND Flash Memory Programming: Experimental Analysis, Exploits, and Mitigation Techniques," HPCA 2017.

Luo+, "HeatWatch: Improving 3D NAND Flash Memory Device Reliability by Exploiting Self-Recovery and Temperature-Awareness," HPCA 2018.

Luo+, "Improving 3D NAND Flash Memory Lifetime by Tolerating Early Retention Loss and Process Variation," SIGMETRICS 2018.

Cai+, "Error Characterization, Mitigation, and Recovery in Flash Memory Based Solid State Drives," Proc. IEEE 2017.

### There are Two Other Solution Directions

New Technologies: Replace or (more likely) augment DRAM with a different technology

Non-volatile memories

### Embracing Un-reliability:

Design memories with different reliability and store data intelligently across them

| Problem            |
|--------------------|
| Aigorithm          |
| Program/Language   |
| System Software    |
| SW/HW Interface    |
| Micro-architecture |
| Logic              |
| Devices            |
| Electrons          |

# Fundamental solutions to security require co-design across the hierarchy

### Exploiting Memory Error Tolerance with Hybrid Memory Systems



On Microsoft's Web Search workload Reduces server hardware cost by 4.7 % Achieves single server availability target of 99.90 % Heterogeneous-Reliability Memory [DSN 2014]

# More on Heterogeneous-Reliability Memory

Yixin Luo, Sriram Govindan, Bikash Sharma, Mark Santaniello, Justin Meza, Aman Kansal, Jie Liu, Badriddine Khessib, Kushagra Vaid, and Onur Mutlu,
 "Characterizing Application Memory Error Vulnerability to Optimize
 Data Center Cost via Heterogeneous-Reliability Memory"
 Proceedings of the <u>44th Annual IEEE/IFIP International Conference on</u>
 Dependable Systems and Networks (DSN), Atlanta, GA, June 2014. [Summary]
 [Slides (pptx) (pdf)] [Coverage on ZDNet]

### Characterizing Application Memory Error Vulnerability to Optimize Datacenter Cost via Heterogeneous-Reliability Memory

Yixin Luo Sriram Govindan<sup>\*</sup> Bikash Sharma<sup>\*</sup> Mark Santaniello<sup>\*</sup> Justin Meza Aman Kansal<sup>\*</sup> Jie Liu<sup>\*</sup> Badriddine Khessib<sup>\*</sup> Kushagra Vaid<sup>\*</sup> Onur Mutlu Carnegie Mellon University, yixinluo@cs.cmu.edu, {meza, onur}@cmu.edu \*Microsoft Corporation, {srgovin, bsharma, marksan, kansal, jie.liu, bkhessib, kvaid}@microsoft.com

## Summary: Memory Reliability and Security

- Memory reliability is reducing
- Reliability issues open up security vulnerabilities
  - Very hard to defend against
- Rowhammer is an example
  - □ Its implications on system security research are tremendous & exciting
- Good news: We have a lot more to do.
- Understand: Solid methodologies for failure modeling and discovery
  - Modeling based on real device data small scale and large scale
- Architect: Principled co-architecting of system and memory
  - Good partitioning of duties across the stack
- Design & Test: Principled electronic design, automation, testing
  - High coverage and good interaction with system reliability methods

Challenge and Opportunity for Future

# Fundamentally Secure, Reliable, Safe Computing Architectures

### One Important Takeaway

# Main Memory Needs Intelligent Controllers

### Four Key Issues in Future Platforms

Fundamentally Secure/Reliable/Safe Architectures

Fundamentally Energy-Efficient Architectures
 Memory-centric (Data-centric) Architectures

Fundamentally Low-Latency Architectures

Architectures for Genomics, Medicine, Health



### Maslow's Hierarchy of Needs, A Third Time



# Reducing Memory Latency

### Main Memory Latency Lags Behind



1999 2003 2006 2008 2011 2013 2014 2015 2016 2017

Memory latency remains almost constant

### A Closer Look ...



Figure 1: DRAM latency trends over time [20, 21, 23, 51].

Chang+, "<u>Understanding Latency Variation in Modern DRAM Chips: Experimental</u> Characterization, Analysis, and Optimization"," SIGMETRICS 2016.

## DRAM Latency Is Critical for Performance



### **In-memory Databases**

[Mao+, EuroSys'12; Clapp+ (**Intel**), IISWC'15]



### **In-Memory Data Analytics**

[Clapp+ (**Intel**), IISWC'15; Awan+, BDCloud'15]



**Graph/Tree Processing** 

[Xu+, IISWC'12; Umuroglu+, FPL'15]



**Datacenter Workloads** 

[Kanev+ (Google), ISCA'I5]

### DRAM Latency Is Critical for Performance





#### **In-memory Databases**

#### **Graph/Tree Processing**

### Long memory latency -> performance bottleneck



### **In-Memory Data Analytics**

[Clapp+ (**Intel**), IISWC'15; Awan+, BDCloud'15]



### **Datacenter Workloads**

[Kanev+ (Google), ISCA'I5]

## Two Major Sources of Latency Inefficiency

- Modern DRAM is not designed for low latency
   Main focus is cost-per-bit (capacity)
- Modern DRAM latency is determined by worst case conditions and worst case devices
  - Much of memory latency is unnecessary

## Our Goal: Reduce Memory Latency at the Source of the Problem

## Why the Long Memory Latency?

- Reason 1: Design of DRAM Micro-architecture
  - Goal: Maximize capacity/area, not minimize latency



## Latency Variation in Memory Chips

Heterogeneous manufacturing & operating conditions → latency variation in timing parameters



### DRAM Characterization Infrastructure



#### SAFARI

Kim+, "Flipping Bits in Memory Without Accessing Them: An Experimental Study of DRAM Disturbance Errors," ISCA 2014.

### DRAM Characterization Infrastructure

 Hasan Hassan et al., <u>SoftMC: A</u> <u>Flexible and Practical Open-</u> <u>Source Infrastructure for</u> <u>Enabling Experimental DRAM</u> <u>Studies</u>, HPCA 2017.

- Flexible
- Easy to Use (C++ API)
- Open-source github.com/CMU-SAFARI/SoftMC



## SoftMC: Open Source DRAM Infrastructure

https://github.com/CMU-SAFARI/SoftMC

### SoftMC: A Flexible and Practical Open-Source Infrastructure for Enabling Experimental DRAM Studies

Hasan Hassan<sup>1,2,3</sup> Nandita Vijaykumar<sup>3</sup> Samira Khan<sup>4,3</sup> Saugata Ghose<sup>3</sup> Kevin Chang<sup>3</sup> Gennady Pekhimenko<sup>5,3</sup> Donghyuk Lee<sup>6,3</sup> Oguz Ergin<sup>2</sup> Onur Mutlu<sup>1,3</sup>

<sup>1</sup>ETH Zürich <sup>2</sup>TOBB University of Economics & Technology <sup>3</sup>Carnegie Mellon University <sup>4</sup>University of Virginia <sup>5</sup>Microsoft Research <sup>6</sup>NVIDIA Research

# Tackling the Fixed Latency Mindset

- Reliable operation latency is actually very heterogeneous
   Across temperatures, chips, parts of a chip, voltage levels, ...
- Idea: Dynamically find out and use the lowest latency one can reliably access a memory location with
  - Adaptive-Latency DRAM [HPCA 2015]
  - Flexible-Latency DRAM [SIGMETRICS 2016]
  - Design-Induced Variation-Aware DRAM [SIGMETRICS 2017]
  - Voltron [SIGMETRICS 2017]
  - DRAM Latency PUF [HPCA 2018]
  - ••••
- We would like to find sources of latency heterogeneity and exploit them to minimize latency

### Adaptive-Latency DRAM

• Key idea

SAFAR

- Optimize DRAM timing parameters online
- Two components
  - DRAM manufacturer provides multiple sets of reliable DRAM timing parameters at different temperatures for each DIMM
  - System monitors DRAM temperature & uses appropriate DRAM timing parameters

Lee+, "Adaptive-Latency DRAM: Optimizing DRAM Timing for the Common-Case," HPCA 130 2015.

# Latency Reduction Summary of 115 DIMMs

- Latency reduction for read & write (55°C)
   Read Latency: 32.7%
  - Write Latency: **55.1%**
- Latency reduction for each timing parameter (55°C)
  - Sensing: **17.3%**
  - Restore: **37.3%** (read), **54.8%** (write)
  - Precharge: **35.2%**

**SAFARI** Lee+, "Adaptive-Latency DRAM: Optimizing DRAM Timing for the Common-Case," HPCA 131 2015.

## **AL-DRAM: Real System Evaluation**

• System

### - CPU: AMD 4386 ( 8 Cores, 3.1GHz, 8MB LLC)

| D18F2x200_dct[0]_mp[1:0] DDR3 DRAM Timing 0                 |                                                                                                                                                                                                                                                         |  |
|-------------------------------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--|
| Reset: 0F05_0505h. See 2.9.3 [DCT Configuration Registers]. |                                                                                                                                                                                                                                                         |  |
|                                                             |                                                                                                                                                                                                                                                         |  |
| Bits                                                        | Description                                                                                                                                                                                                                                             |  |
| 31:30                                                       | Reserved.                                                                                                                                                                                                                                               |  |
| 29:24                                                       | Bits       Description         07h-00h       Reserved         2Ah-08h <tras> clocks         3Fh-2Bh       Reserved</tras>                                                                                                                               |  |
| 23:21                                                       | Reserved.                                                                                                                                                                                                                                               |  |
| 20:16                                                       | <b>Trp: row precharge time</b> . Read-write. BIOS: See 2.9.7.5 [SPD ROM-Based Configuration]. Speci-<br>fies the minimum time in memory clock cycles from a precharge command to an activate command or<br>auto refresh command, both to the same bank. |  |



AL-DRAM *improves single-core performance* on a real system



AL-DRAM provides higher performance on multi-programmed & multi-threaded workloads SAFARI

# Reducing Latency Also Reduces Energy

- AL-DRAM reduces DRAM power consumption by 5.8%
- Major reason: reduction in row activation time



### More on Adaptive-Latency DRAM

- Donghyuk Lee, Yoongu Kim, Gennady Pekhimenko, Samira Khan, Vivek Seshadri, Kevin Chang, and Onur Mutlu,
  - "Adaptive-Latency DRAM: Optimizing DRAM Timing for the Common-Case"

Proceedings of the <u>21st International Symposium on High</u>-

Performance Computer Architecture (HPCA), Bay Area, CA,

February 2015. [<u>Slides (pptx</u>) (pdf)] [Full data sets]

#### Adaptive-Latency DRAM: Optimizing DRAM Timing for the Common-Case

Donghyuk Lee Yoongu Kim Gennady Pekhimenko Samira Khan Vivek Seshadri Kevin Chang Onur Mutlu

Carnegie Mellon University

# Heterogeneous Latency within A Chip

- **Observation:** DRAM timing errors (slow DRAM cells) are concentrated on certain regions
- Flexible-LatencY (FLY) DRAM

- A software-transparent design that reduces latency

• Key idea:

I) Divide memory into regions of different latencies

2) Memory controller: Use lower latency for regions without slow cells; higher latency for other regions

Chang+, "Understanding Latency Variation in Modern DRAM Chips: Experimental Characterization, Analysis, and Optimization"," SIGMETRICS 2016.

## Heterogeneous Latency within A Chip



40 Workloads

Chang+, "<u>Understanding Latency Variation in Modern DRAM Chips: Experimental</u> Characterization, Analysis, and Optimization"," SIGMETRICS 2016.

### Analysis of Latency Variation in DRAM Chips

- Kevin Chang, Abhijith Kashyap, Hasan Hassan, Samira Khan, Kevin Hsieh, Donghyuk Lee, Saugata Ghose, Gennady Pekhimenko, Tianshi Li, and Onur Mutlu,
  - "Understanding Latency Variation in Modern DRAM Chips: Experimental Characterization, Analysis, and Optimization"
  - Proceedings of the <u>ACM International Conference on Measurement and</u> <u>Modeling of Computer Systems</u> (**SIGMETRICS**), Antibes Juan-Les-Pins, France, June 2016. [Slides (pptx) (pdf)] [Source Code]

### Understanding Latency Variation in Modern DRAM Chips: Experimental Characterization, Analysis, and Optimization

Kevin K. Chang<sup>1</sup> Abhijith Kashyap<sup>1</sup> Hasan Hassan<sup>1,2</sup> Saugata Ghose<sup>1</sup> Kevin Hsieh<sup>1</sup> Donghyuk Lee<sup>1</sup> Tianshi Li<sup>1,3</sup> Gennady Pekhimenko<sup>1</sup> Samira Khan<sup>4</sup> Onur Mutlu<sup>5,1</sup> <sup>1</sup>Carnegie Mellon University <sup>2</sup>TOBB ETÜ <sup>3</sup>Peking University <sup>4</sup>University of Virginia <sup>5</sup>ETH Zürich SAFARI

# What Is Design-Induced Variation?

slow

fast

across column

distance from wordline driver

Inherently fast

Ine

drivers

across row

**MOIS** 

fast

distance from sense amplifier

inherently slow

*Systematic variation* in cell access times caused by the *physical organization* of DRAM

sense amplifiers



Profile only slow regions to determine min. latency → Dynamic & low cost latency optimization



Combine error-correcting codes & online profiling → Reliably reduce DRAM latency



DIVA-DRAM *reduces latency more aggressively* and uses ECC to correct random slow cells

### Design-Induced Latency Variation in DRAM

- Donghyuk Lee, Samira Khan, Lavanya Subramanian, Saugata Ghose, Rachata Ausavarungnirun, Gennady Pekhimenko, Vivek Seshadri, and <u>Onur Mutlu</u>,
  - "Design-Induced Latency Variation in Modern DRAM Chips: Characterization, Analysis, and Latency Reduction Mechanisms" Proceedings of the <u>ACM International Conference on Measurement and</u> <u>Modeling of Computer Systems</u> (**SIGMETRICS**), Urbana-Champaign, IL, USA, June 2017.

#### Design-Induced Latency Variation in Modern DRAM Chips: Characterization, Analysis, and Latency Reduction Mechanisms

Donghyuk Lee, NVIDIA and Carnegie Mellon University Samira Khan, University of Virginia Lavanya Subramanian, Saugata Ghose, Rachata Ausavarungnirun, Carnegie Mellon University Gennady Pekhimenko, Vivek Seshadri, Microsoft Research Onur Mutlu, ETH Zürich and Carnegie Mellon University

# Voltron: Exploiting the Voltage-Latency-Reliability Relationship

## **Executive Summary**

- DRAM (memory) power is significant in today's systems
  - Existing low-voltage DRAM reduces voltage **conservatively**
- <u>Goal</u>: Understand and exploit the reliability and latency behavior of real DRAM chips under *aggressive reduced-voltage operation*
- Key experimental observations:
  - Huge voltage margin -- Errors occur beyond some voltage
  - Errors exhibit spatial locality
  - Higher operation latency mitigates voltage-induced errors
- <u>Voltron</u>: A new DRAM energy reduction mechanism
  - Reduce DRAM voltage without introducing errors
  - Use a **regression model** to select voltage that does not degrade performance beyond a chosen target  $\rightarrow$  7.3% system energy reduction

#### SAFARI

## **DIMMs Operating at Higher Latency**

Measured minimum latency that does not cause errors in DRAM modules



# DRAM requires longer latency to access data without errors at lower voltage

### **Spatial Locality of Errors**

A module under 1.175V (**12% voltage reduction**) 0 1.0 ⊃r(row with ≥1-bit error 5 0.8 10 Row (000s) 0.6 15 0.4 20 25 0.2 30 0.0 2 3 6 0 1 4 5 7 Bank

### Errors concentrate in certain regions

### **Energy Savings with Bounded Performance**



### Analysis of Latency-Voltage in DRAM Chips

 Kevin Chang, A. Giray Yaglikci, Saugata Ghose, Aditya Agrawal, Niladrish Chatterjee, Abhijith Kashyap, Donghyuk Lee, Mike O'Connor, Hasan Hassan, and <u>Onur Mutlu</u>,
 "Understanding Reduced-Voltage Operation in Modern DRAM Devices: Experimental Characterization, Analysis, and Mechanisms"
 Proceedings of the <u>ACM International Conference on Measurement and</u> Modeling of Computer Systems (SIGMETRICS), Urbana-Champaign, IL, USA, June 2017.

### Understanding Reduced-Voltage Operation in Modern DRAM Chips: Characterization, Analysis, and Mechanisms

Kevin K. Chang<sup>†</sup> Abdullah Giray Yağlıkçı<sup>†</sup> Saugata Ghose<sup>†</sup> Aditya Agrawal<sup>¶</sup> Niladrish Chatterjee<sup>¶</sup> Abhijith Kashyap<sup>†</sup> Donghyuk Lee<sup>¶</sup> Mike O'Connor<sup>¶,‡</sup> Hasan Hassan<sup>§</sup> Onur Mutlu<sup>§,†</sup>

<sup>†</sup>Carnegie Mellon University <sup>¶</sup>NVIDIA <sup>‡</sup>The University of Texas at Austin <sup>§</sup>ETH Zürich

### And, What If ...

... we can sacrifice reliability of some data to access it with even lower latency?

## The DRAM Latency PUF:

Quickly Evaluating Physical Unclonable Functions by Exploiting the Latency-Reliability Tradeoff in Modern Commodity DRAM Devices

### <u>Jeremie S. Kim</u> Minesh Patel Hasan Hassan Onur Mutlu









# Motivation

• A **PUF** is function that generates a signature **unique** to a given device

- Used in a Challenge-Response Protocol
  - Each device generates a unique **PUF response** depending the inputs
  - A trusted server **authenticates** a device if it generates the expected PUF response

# DRAM Latency Characterization of 223 LPDDR4 DRAM Devices

• Latency failures come from accessing DRAM with **reduced** timing parameters.

### • Key Observations:

1. A cell's **latency failure** probability is determined by **random process variation** 

2. Latency failure patterns are **repeatable and unique to a device** 

# **DRAM Latency PUF Key Idea**

High % chance to fail with reduced trcd

Low % chance to fail with reduced t<sub>RCD</sub>



## **DRAM Accesses and Failures**



## **DRAM Accesses and Failures**



## **The DRAM Latency PUF Evaluation**

• We generate PUF responses using **latency** errors in a region of DRAM

The latency error patterns satisfy PUF requirements

The DRAM Latency PUF generates PUF responses in 88.2ms

## Results



• DL-PUF is **orders of magnitude faster** than prior DRAM PUFs & temperature independent SAFARI

## The DRAM Latency PUF:

Quickly Evaluating Physical Unclonable Functions by Exploiting the Latency-Reliability Tradeoff in Modern Commodity DRAM Devices

### Ieremie S. Kim Minesh Patel

### Hasan Hassan Onur Mutlu





QR Code for the paper https://people.inf.ethz.ch/omutlu/pub/dram-latency-puf\_hpca18.pd



**Carnegie Mellon** 

HPCA 2018

SAFARI

### DRAM Latency PUFs

 Jeremie S. Kim, Minesh Patel, Hasan Hassan, and <u>Onur Mutl</u>u, "The DRAM Latency PUF: Quickly Evaluating Physical Unclonable Functions by Exploiting the Latency-Reliability Tradeoff in Modern DRAM Devices" Proceedings of the 24th International Symposium on High-Performance Computer Architecture (HPCA), Vienna, Austria, February 2018. [Lightning Talk Video] [Slides (pptx) (pdf)] [Lightning Session Slides (pptx) (pdf)]

#### **The DRAM Latency PUF:** Quickly Evaluating Physical Unclonable Functions by Exploiting the Latency-Reliability Tradeoff in Modern Commodity DRAM Devices

Jeremie S. Kim<sup>†§</sup> Minesh Patel<sup>§</sup> Hasan Hassan<sup>§</sup> Onur Mutlu<sup>§†</sup> <sup>†</sup>Carnegie Mellon University <sup>§</sup>ETH Zürich Challenge and Opportunity for Future

# Fundamentally Low-Latency Computing Architectures



### One Important Takeaway

# Main Memory Needs Intelligent Controllers

## On DRAM Power Consumption

### VAMPIRE DRAM Power Model

 Saugata Ghose, A. Giray Yaglikci, Raghav Gupta, Donghyuk Lee, Kais Kudrolli, William X. Liu, Hasan Hassan, Kevin K. Chang, Niladrish Chatterjee, Aditya Agrawal, Mike O'Connor, and Onur Mutlu,
 "What Your DRAM Power Models Are Not Telling You: Lessons from a Detailed Experimental Study"

Proceedings of the <u>ACM International Conference on Measurement and</u> <u>Modeling of Computer Systems</u> (**SIGMETRICS**), Irvine, CA, USA, June 2018. [Abstract]

### What Your DRAM Power Models Are Not Telling You: Lessons from a Detailed Experimental Study

Saugata Ghose<sup>†</sup> Abdullah Giray Yağlıkçı<sup>‡†</sup> Raghav Gupta<sup>†</sup> Donghyuk Lee<sup>§</sup> Kais Kudrolli<sup>†</sup> William X. Liu<sup>†</sup> Hasan Hassan<sup>‡</sup> Kevin K. Chang<sup>†</sup> Niladrish Chatterjee<sup>§</sup> Aditya Agrawal<sup>§</sup> Mike O'Connor<sup>§¶</sup> Onur Mutlu<sup>‡†</sup> <sup>†</sup>Carnegie Mellon University <sup>‡</sup>ETH Zürich <sup>§</sup>NVIDIA <sup>¶</sup>University of Texas at Austin

### Four Key Issues in Future Platforms

Fundamentally Secure/Reliable/Safe Architectures

Fundamentally Energy-Efficient Architectures
 Memory-centric (Data-centric) Architectures

Fundamentally Low-Latency Architectures

Architectures for Genomics, Medicine, Health

### Maslow's (Human) Hierarchy of Needs, Revisited



SAFARI

### Do We Want This?



### Or This?



Challenge and Opportunity for Future

# Sustainable and Energy Efficient



Data access is the major performance and energy bottleneck

# Our current design principles cause great energy waste (and great performance loss)

# Processing of data is performed far away from the data



## A Computing System

- Three key components
- Computation
- Communication
- Storage/memory



Burks, Goldstein, von Neumann, "Preliminary discussion of the logical design of an electronic computing instrument," 1946.

#### **Computing System**



## A Computing System

- Three key components
- Computation
- Communication
- Storage/memory



Burks, Goldstein, von Neumann, "Preliminary discussion of the logical design of an electronic computing instrument," 1946.

#### **Computing System**



Image source: https://lbsitbytes2010.wordpress.com/2013/03/29/john-von-neumann-roll-no-15/

### Today's Computing Systems

- Are overwhelmingly processor centric
- All data processed in the processor  $\rightarrow$  at great system cost
- Processor is heavily optimized and is considered the master
- Data storage units are dumb and are largely unoptimized (except for some that are on the processor die)



Yet ...

### "It's the Memory, Stupid!" (Richard Sites, MPR, 1996)



Mutlu+, "Runahead Execution: An Alternative to Very Large Instruction Windows for Out-of-Order Processors," HPCA 2003.

## The Performance Perspective

 Onur Mutlu, Jared Stark, Chris Wilkerson, and Yale N. Patt, "Runahead Execution: An Alternative to Very Large Instruction Windows for Out-of-order Processors" Proceedings of the <u>9th International Symposium on High-Performance</u> <u>Computer Architecture</u> (HPCA), pages 129-140, Anaheim, CA, February 2003. <u>Slides (pdf)</u>

#### **Runahead Execution: An Alternative to Very Large Instruction Windows for Out-of-order Processors**

Onur Mutlu § Jared Stark † Chris Wilkerson ‡ Yale N. Patt §

§ECE Department
The University of Texas at Austin
{onur,patt}@ece.utexas.edu

†Microprocessor Research Intel Labs jared.w.stark@intel.com

Desktop Platforms Group Intel Corporation chris.wilkerson@intel.com

### The Performance Perspective (Today)

All of Google's Data Center Workloads (2015):



Kanev+, "Profiling a Warehouse-Scale Computer," ISCA 2015.

### The Performance Perspective (Today)

All of Google's Data Center Workloads (2015):



#### Figure 11: Half of cycles are spent stalled on caches.

## Perils of Processor-Centric Design

### Grossly-imbalanced systems

- Processing done only in **one place**
- Everything else just stores and moves data: data moves a lot
- → Energy inefficient
- → Low performance
- $\rightarrow$  Complex
- Overly complex and bloated processor (and accelerators)
  - To tolerate data access from memory
  - Complex hierarchies and mechanisms
  - $\rightarrow$  Energy inefficient
  - $\rightarrow$  Low performance
  - $\rightarrow$  Complex

#### Perils of Processor-Centric Design



Most of the system is dedicated to storing and moving data

## The Energy Perspective



## Data Movement vs. Computation Energy



## A memory access consumes ~1000X the energy of a complex addition

### Data Movement vs. Computation Energy

#### Data movement is a major system energy bottleneck

- Comprises 41% of mobile system energy during web browsing [2]
- Costs ~115 times as much energy as an ADD operation [1, 2]



[1]: Reducing data Movement Energy via Online Data Clustering and Encoding (MICRO'16)

[2]: Quantifying the energy cost of data movement for emerging smart phone workloads on mobile platforms (IISWC'14)

#### Energy Waste in Mobile Devices

 Amirali Boroumand, Saugata Ghose, Youngsok Kim, Rachata Ausavarungnirun, Eric Shiu, Rahul Thakur, Daehyun Kim, Aki Kuusela, Allan Knies, Parthasarathy Ranganathan, and Onur Mutlu, "Google Workloads for Consumer Devices: Mitigating Data Movement Bottlenecks" Proceedings of the 23rd International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), Williamsburg, VA, USA, March 2018.

## 62.7% of the total system energy is spent on data movement

#### Google Workloads for Consumer Devices: Mitigating Data Movement Bottlenecks

Amirali Boroumand<sup>1</sup> Saugata Ghose<sup>1</sup> Youngsok Kim<sup>2</sup> Rachata Ausavarungnirun<sup>1</sup> Eric Shiu<sup>3</sup> Rahul Thakur<sup>3</sup> Daehyun Kim<sup>4,3</sup> Aki Kuusela<sup>3</sup> Allan Knies<sup>3</sup> Parthasarathy Ranganathan<sup>3</sup> Onur Mutlu<sup>5,1</sup> SAFARI

#### We Do Not Want to Move Data!



## A memory access consumes ~1000X the energy of a complex addition

#### We Need A Paradigm Shift To ...

Enable computation with minimal data movement

Compute where it makes sense (where data resides)

Make computing architectures more data-centric

### Goal: Processing Inside Memory



### Why In-Memory Computation Today?

🔁 muusu y



UV HOFAC

#### Pull from Systems and Applications

- Data access is a major system and application bottleneck
- Systems are energy limited
- Data movement much more energy-hungry than computation

## Processing in Memory: Two Approaches

Minimally changing memory chips
 Exploiting 3D-stacked memory

## Starting Simple: Data Copy and Initialization

memmove & memcpy: 5% cycles in Google's datacenter [Kanev+ ISCA'15]





#### VM Cloning Deduplication



••• Many more

## **Page Migration**

#### Today's Systems: Bulk Data Copy 1) High latency 3) Cache pollution Memory CPU **L3** MC L2 2) High bandwidth utilization 4) Unwanted data movement (for 4KB page copy via DMA) 1046ns, 3.6uJ



### RowClone: In-DRAM Row Copy



## RowClone: Latency and Energy Savings



Seshadri et al., "RowClone: Fast and Efficient In-DRAM Copy and Initialization of Bulk Data," MICRO 2013.

#### More on RowClone

 Vivek Seshadri, Yoongu Kim, Chris Fallin, Donghyuk Lee, Rachata Ausavarungnirun, Gennady Pekhimenko, Yixin Luo, Onur Mutlu, Michael A. Kozuch, Phillip B. Gibbons, and Todd C. Mowry,

"RowClone: Fast and Energy-Efficient In-DRAM Bulk Data Copy and Initialization"

Proceedings of the <u>46th International Symposium on Microarchitecture</u> (**MICRO**), Davis, CA, December 2013. [Slides (pptx) (pdf)] [Lightning Session Slides (pptx) (pdf)] [Poster (pptx) (pdf)]

#### RowClone: Fast and Energy-Efficient In-DRAM Bulk Data Copy and Initialization

Vivek Seshadri Yoongu Kim Chris Fallin\* Donghyuk Lee vseshadr@cs.cmu.edu yoongukim@cmu.edu cfallin@c1f.net donghyuk1@cmu.edu Rachata Ausavarungnirun Gennady Pekhimenko Yixin Luo rachata@cmu.edu gpekhime@cs.cmu.edu yixinluo@andrew.cmu.edu Onur Mutlu Phillip B. Gibbons† Michael A. Kozuch† Todd C. Mowry onur@cmu.edu phillip.b.gibbons@intel.com michael.a.kozuch@intel.com tcm@cs.cmu.edu Carnegie Mellon University †Intel Pittsburgh

#### In-Memory Bulk Bitwise Operations

- We can support in-DRAM COPY, ZERO, AND, OR, NOT, MAJ
- At low cost
- Using analog computation capability of DRAM
  - Idea: activating multiple rows performs computation
- 30-60X performance and energy improvement
  - Seshadri+, "Ambit: In-Memory Accelerator for Bulk Bitwise Operations Using Commodity DRAM Technology," MICRO 2017.

- New memory technologies enable even more opportunities
  - Memristors, resistive RAM, phase change mem, STT-MRAM, ...
  - Can operate on data with minimal movement

#### In-DRAM AND/OR: Triple Row Activation



Seshadri+, "Fast Bulk Bitwise AND and OR in DRAM", IEEE CAL 2015.

#### In-DRAM NOT: Dual Contact Cell



Idea: Feed the negated value in the sense amplifier into a special row

Figure 5: A dual-contact cell connected to both ends of a sense amplifier

Seshadri+, "Ambit: In-Memory Accelerator for Bulk Bitwise Operations using Commodity DRAM Technology," MICRO 2017.



#### Performance: In-DRAM Bitwise Operations



Figure 9: Throughput of bitwise operations on various systems.

## Energy of In-DRAM Bitwise Operations

|                | Design         | not   | and/or | nand/nor | xor/xnor |
|----------------|----------------|-------|--------|----------|----------|
| DRAM &         | DDR3           | 93.7  | 137.9  | 137.9    | 137.9    |
| Channel Energy | Ambit          | 1.6   | 3.2    | 4.0      | 5.5      |
| (nJ/KB)        | $(\downarrow)$ | 59.5X | 43.9X  | 35.1X    | 25.1X    |

Table 3: Energy of bitwise operations. ( $\downarrow$ ) indicates energy reduction of Ambit over the traditional DDR3-based design.

Seshadri+, "Ambit: In-Memory Accelerator for Bulk Bitwise Operations using Commodity DRAM Technology," MICRO 2017.



#### Ambit vs. DDR3: Performance and Energy

Performance Improvement Energy Reduction



#### Bulk Bitwise Operations in Workloads



#### SAFARI

[1] Li and Patel, BitWeaving, SIGMOD 2013[2] Goodwin+, BitFunnel, SIGIR 2017

### Performance: Bitmap Index on Ambit



## Figure 10: Bitmap index performance. The value above each bar indicates the reduction in execution time due to Ambit.

Seshadri+, "Ambit: In-Memory Accelerator for Bulk Bitwise Operations using Commodity DRAM Technology," MICRO 2017.



#### Performance: BitWeaving on Ambit

`select count(\*) from T where c1 <= val <= c2'</pre>



## Figure 11: Speedup offered by Ambit over baseline CPU with SIMD for BitWeaving

Seshadri+, "Ambit: In-Memory Accelerator for Bulk Bitwise Operations using Commodity DRAM Technology," MICRO 2017.

#### More on In-DRAM Bulk AND/OR

 Vivek Seshadri, Kevin Hsieh, Amirali Boroumand, Donghyuk Lee, Michael A. Kozuch, Onur Mutlu, Phillip B. Gibbons, and Todd C. Mowry,
 <u>"Fast Bulk Bitwise AND and OR in DRAM</u>" *IEEE Computer Architecture Letters (CAL)*, April 2015.

## Fast Bulk Bitwise AND and OR in DRAM

Vivek Seshadri\*, Kevin Hsieh\*, Amirali Boroumand\*, Donghyuk Lee\*, Michael A. Kozuch<sup>†</sup>, Onur Mutlu\*, Phillip B. Gibbons<sup>†</sup>, Todd C. Mowry\* \*Carnegie Mellon University <sup>†</sup>Intel Pittsburgh

#### More on In-DRAM Bitwise Operations

 Vivek Seshadri et al., "<u>Ambit: In-Memory Accelerator</u> for Bulk Bitwise Operations Using Commodity DRAM <u>Technology</u>," MICRO 2017.

#### Ambit: In-Memory Accelerator for Bulk Bitwise Operations Using Commodity DRAM Technology

Vivek Seshadri<sup>1,5</sup> Donghyuk Lee<sup>2,5</sup> Thomas Mullins<sup>3,5</sup> Hasan Hassan<sup>4</sup> Amirali Boroumand<sup>5</sup> Jeremie Kim<sup>4,5</sup> Michael A. Kozuch<sup>3</sup> Onur Mutlu<sup>4,5</sup> Phillip B. Gibbons<sup>5</sup> Todd C. Mowry<sup>5</sup>

<sup>1</sup>Microsoft Research India <sup>2</sup>NVIDIA Research <sup>3</sup>Intel <sup>4</sup>ETH Zürich <sup>5</sup>Carnegie Mellon University

Challenge and Opportunity for Future

# Computing Architectures with

## Minimal Data Movement



## Processing in Memory: Two Approaches

Minimally changing memory chips
 Exploiting 3D-stacked memory

### Opportunity: 3D-Stacked Logic+Memory

# Hybrid Memory Cube



## DRAM Landscape (circa 2015)

| Segment     | DRAM Standards & Architectures                                                                                                                                                                                                        |
|-------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Commodity   | DDR3 (2007) [14]; DDR4 (2012) [18]                                                                                                                                                                                                    |
| Low-Power   | LPDDR3 (2012) [17]; LPDDR4 (2014) [20]                                                                                                                                                                                                |
| Graphics    | GDDR5 (2009) [15]                                                                                                                                                                                                                     |
| Performance | eDRAM [28], [32]; RLDRAM3 (2011) [29]                                                                                                                                                                                                 |
| 3D-Stacked  | WIO (2011) [16]; WIO2 (2014) [21]; MCDRAM (2015) [13];<br>HBM (2013) [19]; HMC1.0 (2013) [10]; HMC1.1 (2014) [11]                                                                                                                     |
| Academic    | SBA/SSA (2010) [38]; Staged Reads (2012) [8]; RAIDR (2012) [27];<br>SALP (2012) [24]; TL-DRAM (2013) [26]; RowClone (2013) [37];<br>Half-DRAM (2014) [39]; Row-Buffer Decoupling (2014) [33];<br>SARP (2014) [6]; AL-DRAM (2015) [25] |

Table 1. Landscape of DRAM-based memory

Kim+, "Ramulator: A Flexible and Extensible DRAM Simulator", IEEE CAL 2015.

#### Two Key Questions in 3D-Stacked PIM

- How can we accelerate important applications if we use 3D-stacked memory as a coarse-grained accelerator?
  - what is the architecture and programming model?
  - what are the mechanisms for acceleration?

- What is the minimal processing-in-memory support we can provide?
  - without changing the system significantly
  - while achieving significant benefits

#### Another Example: In-Memory Graph Processing

Large graphs are everywhere (circa 2015)



Scalable large-scale graph processing is challenging



## Key Bottlenecks in Graph Processing





#### Tesseract System for Graph Processing

Interconnected set of 3D-stacked memory+logic chips with simple cores



SAFARI Ahn+, "A Scalable Processing-in-Memory Accelerator for Parallel Graph Processing" ISCA 2015.

#### Tesseract System for Graph Processing



## Tesseract System for Graph Processing



#### **Evaluated Systems**



AFARI Ahn+, "A Scalable Processing-in-Memory Accelerator for Parallel Graph Processing" ISCA 2015.

## Tesseract Graph Processing Performance

#### >13X Performance Improvement



SAFARI Ahn+, "A Scalable Processing-in-Memory Accelerator for Parallel Graph Processing" ISCA 2015.

#### Tesseract Graph Processing Performance



#### Effect of Bandwidth & Programming Model



### Tesseract Graph Processing System Energy



SAFARI Ahn+, "A Scalable Processing-in-Memory Accelerator for Parallel Graph Processing" ISCA 2015.

#### More on Tesseract

 Junwhan Ahn, Sungpack Hong, Sungjoo Yoo, Onur Mutlu, and Kiyoung Choi,
 "A Scalable Processing-in-Memory Accelerator for Parallel Graph Processing"
 Proceedings of the <u>42nd International Symposium on</u> <u>Computer Architecture</u> (ISCA), Portland, OR, June 2015.
 [Slides (pdf)] [Lightning Session Slides (pdf)]

#### A Scalable Processing-in-Memory Accelerator for Parallel Graph Processing

Junwhan Ahn Sungpack Hong<sup>§</sup> Sungjoo Yoo Onur Mutlu<sup>†</sup> Kiyoung Choi junwhan@snu.ac.kr, sungpack.hong@oracle.com, sungjoo.yoo@gmail.com, onur@cmu.edu, kchoi@snu.ac.kr Seoul National University <sup>§</sup>Oracle Labs <sup>†</sup>Carnegie Mellon University

### 3D-Stacked PIM on Mobile Devices

 Amirali Boroumand, Saugata Ghose, Youngsok Kim, Rachata Ausavarungnirun, Eric Shiu, Rahul Thakur, Daehyun Kim, Aki Kuusela, Allan Knies, Parthasarathy Ranganathan, and Onur Mutlu, "Google Workloads for Consumer Devices: Mitigating Data Movement Bottlenecks"

Proceedings of the <u>23rd International Conference on Architectural</u> <u>Support for Programming Languages and Operating</u> <u>Systems</u> (**ASPLOS**), Williamsburg, VA, USA, March 2018.

#### Google Workloads for Consumer Devices: Mitigating Data Movement Bottlenecks

Amirali Boroumand1Saugata Ghose1Youngsok Kim2Rachata Ausavarungnirun1Eric Shiu3Rahul Thakur3Daehyun Kim4,3Aki Kuusela3Allan Knies3Parthasarathy Ranganathan3Onur Mutlu<sup>5,1</sup>

## Google Workloads for Consumer Devices: Mitigating Data Movement Bottlenecks

#### **Amirali Boroumand**

Saugata Ghose, Youngsok Kim, Rachata Ausavarungnirun, Eric Shiu, Rahul Thakur, Daehyun Kim, Aki Kuusela, Allan Knies, Parthasarathy Ranganathan, Onur Mutlu

**Carnegie Mellon** 







SEOUL NATIONAL UNIVERSITY



Google

## **Consumer Devices**



#### **Consumer devices are everywhere!**

#### Energy consumption is a first-class concern in consumer devices



## Popular Google Consumer Workloads





Google's web browser



#### **TensorFlow Mobile**

Google's machine learning framework



**Google's video codec** 



## **Energy Cost of Data Movement**

#### I st key observation: 62.7% of the total system energy is spent on data movement



**Processing-In-Memory (PIM)** 

Potential solution: move computation close to data

Challenge: limited area and energy budget



## Using PIM to Reduce Data Movement

2<sup>nd</sup> key observation: a significant fraction of the data movement often comes from simple functions

We can design lightweight logic to implement these <u>simple functions</u> in <u>memory</u>

Small embedded low-power core

> PIM Core

Small fixed-function accelerators



Offloading to PIM logic reduces energy and improves performance, on average, by 55.4% and 54.2%

## Google Workloads for Consumer Devices: Mitigating Data Movement Bottlenecks

#### **Amirali Boroumand**

Saugata Ghose, Youngsok Kim, Rachata Ausavarungnirun, Eric Shiu, Rahul Thakur, Daehyun Kim, Aki Kuusela, Allan Knies, Parthasarathy Ranganathan, Onur Mutlu

#### **ASPLOS 2018**

**Carnegie Mellon** 







SEOUL NATIONAL UNIVERSITY



Google

#### More on PIM for Mobile Devices

 Amirali Boroumand, Saugata Ghose, Youngsok Kim, Rachata Ausavarungnirun, Eric Shiu, Rahul Thakur, Daehyun Kim, Aki Kuusela, Allan Knies, Parthasarathy Ranganathan, and Onur Mutlu, "Google Workloads for Consumer Devices: Mitigating Data Movement Bottlenecks" Proceedings of the 23rd International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), Williamsburg, VA, USA, March 2018.

# 62.7% of the total system energy is spent on data movement

#### Google Workloads for Consumer Devices: Mitigating Data Movement Bottlenecks

Amirali Boroumand<sup>1</sup>Saugata Ghose<sup>1</sup>Youngsok Kim<sup>2</sup>Rachata Ausavarungnirun<sup>1</sup>Eric Shiu<sup>3</sup>Rahul Thakur<sup>3</sup>Daehyun Kim<sup>4,3</sup>Aki Kuusela<sup>3</sup>Allan Knies<sup>3</sup>Parthasarathy Ranganathan<sup>3</sup>Onur Mutlu<sup>5,1</sup>231

#### **Truly Distributed GPU Processing with PIM?**



## Accelerating GPU Execution with PIM (I)

 Kevin Hsieh, Eiman Ebrahimi, Gwangsun Kim, Niladrish Chatterjee, Mike O'Connor, Nandita Vijaykumar, Onur Mutlu, and Stephen W. Keckler, "Transparent Offloading and Mapping (TOM): Enabling Programmer-Transparent Near-Data Processing in GPU Systems" Proceedings of the <u>43rd International Symposium on Computer</u>

Architecture (ISCA), Seoul, South Korea, June 2016.

[Slides (pptx) (pdf)]

[Lightning Session Slides (pptx) (pdf)]

#### Transparent Offloading and Mapping (TOM): Enabling Programmer-Transparent Near-Data Processing in GPU Systems

Kevin Hsieh<sup>‡</sup> Eiman Ebrahimi<sup>†</sup> Gwangsun Kim<sup>\*</sup> Niladrish Chatterjee<sup>†</sup> Mike O'Connor<sup>†</sup> Nandita Vijaykumar<sup>‡</sup> Onur Mutlu<sup>§‡</sup> Stephen W. Keckler<sup>†</sup> <sup>‡</sup>Carnegie Mellon University <sup>†</sup>NVIDIA <sup>\*</sup>KAIST <sup>§</sup>ETH Zürich

## Accelerating GPU Execution with PIM (II)

 Ashutosh Pattnaik, Xulong Tang, Adwait Jog, Onur Kayiran, Asit K. Mishra, Mahmut T. Kandemir, <u>Onur Mutlu</u>, and Chita R. Das,
 "Scheduling Techniques for GPU Architectures with Processing-In-Memory Capabilities" Proceedings of the <u>25th International Conference on Parallel</u> <u>Architectures and Compilation Techniques</u> (PACT), Haifa, Israel, September 2016.

#### Scheduling Techniques for GPU Architectures with Processing-In-Memory Capabilities

Ashutosh Pattnaik<sup>1</sup> Xulong Tang<sup>1</sup> Adwait Jog<sup>2</sup> Onur Kayıran<sup>3</sup> Asit K. Mishra<sup>4</sup> Mahmut T. Kandemir<sup>1</sup> Onur Mutlu<sup>5,6</sup> Chita R. Das<sup>1</sup> <sup>1</sup>Pennsylvania State University <sup>2</sup>College of William and Mary <sup>3</sup>Advanced Micro Devices, Inc. <sup>4</sup>Intel Labs <sup>5</sup>ETH Zürich <sup>6</sup>Carnegie Mellon University

## Accelerating Linked Data Structures

 Kevin Hsieh, Samira Khan, Nandita Vijaykumar, Kevin K. Chang, Amirali Boroumand, Saugata Ghose, and Onur Mutlu,
 "Accelerating Pointer Chasing in 3D-Stacked Memory: Challenges. Mechanisms. Evaluation"
 Proceedings of the <u>34th IEEE International Conference on Computer</u> Design (ICCD), Phoenix, AZ, USA, October 2016.

#### Accelerating Pointer Chasing in 3D-Stacked Memory: Challenges, Mechanisms, Evaluation

Kevin Hsieh<sup>†</sup> Samira Khan<sup>‡</sup> Nandita Vijaykumar<sup>†</sup> Kevin K. Chang<sup>†</sup> Amirali Boroumand<sup>†</sup> Saugata Ghose<sup>†</sup> Onur Mutlu<sup>§†</sup> <sup>†</sup>Carnegie Mellon University <sup>‡</sup>University of Virginia <sup>§</sup>ETH Zürich

## Accelerating Dependent Cache Misses

 Milad Hashemi, Khubaib, Eiman Ebrahimi, Onur Mutlu, and Yale N. Patt, "Accelerating Dependent Cache Misses with an Enhanced Memory Controller" *Proceedings of the <u>43rd International Symposium on Computer</u> <i>Architecture (ISCA)*, Seoul, South Korea, June 2016. [Slides (pptx) (pdf)] [Lightning Session Slides (pptx) (pdf)]

#### Accelerating Dependent Cache Misses with an Enhanced Memory Controller

Milad Hashemi\*, Khubaib<sup>†</sup>, Eiman Ebrahimi<sup>‡</sup>, Onur Mutlu<sup>§</sup>, Yale N. Patt\*

\* The University of Texas at Austin <sup>†</sup>Apple <sup>‡</sup>NVIDIA <sup>§</sup>ETH Zürich & Carnegie Mellon University

## Two Key Questions in 3D-Stacked PIM

- How can we accelerate important applications if we use 3D-stacked memory as a coarse-grained accelerator?
  - what is the architecture and programming model?
  - what are the mechanisms for acceleration?

- What is the minimal processing-in-memory support we can provide?
  - without changing the system significantly
  - while achieving significant benefits

#### **PIM-Enabled Instructions**

 Junwhan Ahn, Sungjoo Yoo, Onur Mutlu, and Kiyoung Choi, "PIM-Enabled Instructions: A Low-Overhead,"
 Locality-Aware Processing-in-Memory Architecture" Proceedings of the <u>42nd International Symposium on</u> <u>Computer Architecture</u> (ISCA), Portland, OR, June 2015.
 [Slides (pdf)] [Lightning Session Slides (pdf)]

#### PIM-Enabled Instructions: A Low-Overhead, Locality-Aware Processing-in-Memory Architecture

Junwhan Ahn Sungjoo Yoo Onur Mutlu<sup>†</sup> Kiyoung Choi junwhan@snu.ac.kr, sungjoo.yoo@gmail.com, onur@cmu.edu, kchoi@snu.ac.kr Seoul National University <sup>†</sup>Carnegie Mellon University

## Automatic Code and Data Mapping

 Kevin Hsieh, Eiman Ebrahimi, Gwangsun Kim, Niladrish Chatterjee, Mike O'Connor, Nandita Vijaykumar, Onur Mutlu, and Stephen W. Keckler, "Transparent Offloading and Mapping (TOM): Enabling Programmer-Transparent Near-Data Processing in GPU Systems" Proceedings of the <u>43rd International Symposium on Computer</u>

Architecture (ISCA), Seoul, South Korea, June 2016.

Slides (pptx) (pdf)

[Lightning Session Slides (pptx) (pdf)]

#### Transparent Offloading and Mapping (TOM): Enabling Programmer-Transparent Near-Data Processing in GPU Systems

Kevin Hsieh<sup>‡</sup> Eiman Ebrahimi<sup>†</sup> Gwangsun Kim<sup>\*</sup> Niladrish Chatterjee<sup>†</sup> Mike O'Connor<sup>†</sup> Nandita Vijaykumar<sup>‡</sup> Onur Mutlu<sup>§‡</sup> Stephen W. Keckler<sup>†</sup> <sup>‡</sup>Carnegie Mellon University <sup>†</sup>NVIDIA <sup>\*</sup>KAIST <sup>§</sup>ETH Zürich

## Automatic Offloading of Critical Code

 Milad Hashemi, Khubaib, Eiman Ebrahimi, Onur Mutlu, and Yale N. Patt, "Accelerating Dependent Cache Misses with an Enhanced Memory Controller" *Proceedings of the <u>43rd International Symposium on Computer</u> <i>Architecture (ISCA)*, Seoul, South Korea, June 2016. [Slides (pptx) (pdf)] [Lightning Session Slides (pptx) (pdf)]

#### Accelerating Dependent Cache Misses with an Enhanced Memory Controller

Milad Hashemi\*, Khubaib<sup>†</sup>, Eiman Ebrahimi<sup>‡</sup>, Onur Mutlu<sup>§</sup>, Yale N. Patt\*

\* The University of Texas at Austin <sup>†</sup>Apple <sup>‡</sup>NVIDIA <sup>§</sup>ETH Zürich & Carnegie Mellon University

#### Automatic Offloading of Prefetch Mechanisms

 Milad Hashemi, Onur Mutlu, and Yale N. Patt, "Continuous Runahead: Transparent Hardware Acceleration for Memory Intensive Workloads" Proceedings of the <u>49th International Symposium on</u> Microarchitecture (MICRO), Taipei, Taiwan, October 2016. [Slides (pptx) (pdf)] [Lightning Session Slides (pdf)] [Poster (pptx) (pdf)]

#### Continuous Runahead: Transparent Hardware Acceleration for Memory Intensive Workloads

Milad Hashemi\*, Onur Mutlu§, Yale N. Patt\*

\* The University of Texas at Austin §ETH Zürich



## Efficient Automatic Data Coherence Support

 Amirali Boroumand, Saugata Ghose, Minesh Patel, Hasan Hassan, Brandon Lucia, Kevin Hsieh, Krishna T. Malladi, Hongzhong Zheng, and Onur Mutlu,
 "LazyPIM: An Efficient Cache Coherence Mechanism for Processing-in-Memory" IEEE Computer Architecture Letters (CAL), June 2016.

#### LazyPIM: An Efficient Cache Coherence Mechanism for Processing-in-Memory

Amirali Boroumand<sup>†</sup>, Saugata Ghose<sup>†</sup>, Minesh Patel<sup>†</sup>, Hasan Hassan<sup>†§</sup>, Brandon Lucia<sup>†</sup>, Kevin Hsieh<sup>†</sup>, Krishna T. Malladi<sup>\*</sup>, Hongzhong Zheng<sup>\*</sup>, and Onur Mutlu<sup>‡†</sup> <sup>†</sup>Carnegie Mellon University \*Samsung Semiconductor, Inc. <sup>§</sup> TOBB ETÜ <sup>‡</sup>ETH Zürich



Challenge and Opportunity for Future

Fundamentally **Energy-Efficient** (Data-Centric) **Computing Architectures**  Challenge and Opportunity for Future

# Computing Architectures with

# Minimal Data Movement



### Eliminating the Adoption Barriers

# How to Enable Adoption of Processing in Memory



## Barriers to Adoption of PIM

1. Functionality of and applications for PIM

- 2. Ease of programming (interfaces and compiler/HW support)
- 3. System support: coherence & virtual memory
- 4. Runtime systems for adaptive scheduling, data mapping, access/sharing control
- 5. Infrastructures to assess benefits and feasibility

#### We Need to Revisit the Entire Stack

| Problem            |
|--------------------|
| Aigorithm          |
| Program/Language   |
| System Software    |
| SW/HW Interface    |
| Micro-architecture |
| Logic              |
| Devices            |
| Electrons          |

#### Enabling the Adoption of Processing-in-Memory: Challenges, Mechanisms, Future Research Directions

#### SAUGATA GHOSE, KEVIN HSIEH, AMIRALI BOROUMAND, RACHATA AUSAVARUNGNIRUN

Carnegie Mellon University

ONUR MUTLU ETH Zürich and Carnegie Mellon University

https://arxiv.org/pdf/1802.00320.pdf



## Key Challenge 1: Code Mapping

• Challenge 1: Which operations should be executed in memory vs. in CPU?



## Key Challenge 2: Data Mapping

• Challenge 2: How should data be mapped to different 3D memory stacks?



## How to Do the Code and Data Mapping?

 Kevin Hsieh, Eiman Ebrahimi, Gwangsun Kim, Niladrish Chatterjee, Mike O'Connor, Nandita Vijaykumar, Onur Mutlu, and Stephen W. Keckler,
 "Transparent Offloading and Mapping (TOM): Enabling Programmer-Transparent Near-Data Processing in GPU Systems"
 Proceedings of the <u>43rd International Symposium on Computer</u>

Architecture (ISCA), Seoul, South Korea, June 2016.

[Slides (pptx) (pdf)]

[Lightning Session Slides (pptx) (pdf)]

#### Transparent Offloading and Mapping (TOM): Enabling Programmer-Transparent Near-Data Processing in GPU Systems

Kevin Hsieh<sup>‡</sup> Eiman Ebrahimi<sup>†</sup> Gwangsun Kim<sup>\*</sup> Niladrish Chatterjee<sup>†</sup> Mike O'Connor<sup>†</sup> Nandita Vijaykumar<sup>‡</sup> Onur Mutlu<sup>§‡</sup> Stephen W. Keckler<sup>†</sup> <sup>‡</sup>Carnegie Mellon University <sup>†</sup>NVIDIA <sup>\*</sup>KAIST <sup>§</sup>ETH Zürich

#### How to Schedule Code?

 Ashutosh Pattnaik, Xulong Tang, Adwait Jog, Onur Kayiran, Asit K. Mishra, Mahmut T. Kandemir, <u>Onur Mutlu</u>, and Chita R. Das,
 "Scheduling Techniques for GPU Architectures with Processing-In-Memory Capabilities" Proceedings of the <u>25th International Conference on Parallel</u> <u>Architectures and Compilation Techniques</u> (PACT), Haifa, Israel, September 2016.

#### Scheduling Techniques for GPU Architectures with Processing-In-Memory Capabilities

Ashutosh Pattnaik<sup>1</sup> Xulong Tang<sup>1</sup> Adwait Jog<sup>2</sup> Onur Kayıran<sup>3</sup> Asit K. Mishra<sup>4</sup> Mahmut T. Kandemir<sup>1</sup> Onur Mutlu<sup>5,6</sup> Chita R. Das<sup>1</sup> <sup>1</sup>Pennsylvania State University <sup>2</sup>College of William and Mary <sup>3</sup>Advanced Micro Devices, Inc. <sup>4</sup>Intel Labs <sup>5</sup>ETH Zürich <sup>6</sup>Carnegie Mellon University

#### Challenge: Coherence for Hybrid CPU-PIM Apps



#### How to Maintain Coherence?

 Amirali Boroumand, Saugata Ghose, Minesh Patel, Hasan Hassan, Brandon Lucia, Kevin Hsieh, Krishna T. Malladi, Hongzhong Zheng, and Onur Mutlu,
 "LazyPIM: An Efficient Cache Coherence Mechanism for Processing-in-Memory" IEEE Computer Architecture Letters (CAL), June 2016.

#### LazyPIM: An Efficient Cache Coherence Mechanism for Processing-in-Memory

Amirali Boroumand<sup>†</sup>, Saugata Ghose<sup>†</sup>, Minesh Patel<sup>†</sup>, Hasan Hassan<sup>†§</sup>, Brandon Lucia<sup>†</sup>, Kevin Hsieh<sup>†</sup>, Krishna T. Malladi<sup>\*</sup>, Hongzhong Zheng<sup>\*</sup>, and Onur Mutlu<sup>‡†</sup>
<sup>†</sup>Carnegie Mellon University \*Samsung Semiconductor, Inc. <sup>§</sup>TOBB ETÜ <sup>‡</sup>ETH Zürich



#### How to Support Virtual Memory?

 Kevin Hsieh, Samira Khan, Nandita Vijaykumar, Kevin K. Chang, Amirali Boroumand, Saugata Ghose, and Onur Mutlu,
 "Accelerating Pointer Chasing in 3D-Stacked Memory: Challenges. Mechanisms. Evaluation"
 Proceedings of the <u>34th IEEE International Conference on Computer</u> Design (ICCD), Phoenix, AZ, USA, October 2016.

#### Accelerating Pointer Chasing in 3D-Stacked Memory: Challenges, Mechanisms, Evaluation

Kevin Hsieh<sup>†</sup> Samira Khan<sup>‡</sup> Nandita Vijaykumar<sup>†</sup> Kevin K. Chang<sup>†</sup> Amirali Boroumand<sup>†</sup> Saugata Ghose<sup>†</sup> Onur Mutlu<sup>§†</sup> <sup>†</sup>Carnegie Mellon University <sup>‡</sup>University of Virginia <sup>§</sup>ETH Zürich

#### How to Design Data Structures for PIM?

 Zhiyu Liu, Irina Calciu, Maurice Herlihy, and Onur Mutlu,
 "Concurrent Data Structures for Near-Memory Computing"
 Proceedings of the 29th ACM Symposium on Parallelism in Algorithms and Architectures (SPAA), Washington, DC, USA, July 2017.
 [Slides (pptx) (pdf)]

#### **Concurrent Data Structures for Near-Memory Computing**

Zhiyu Liu Computer Science Department Brown University zhiyu\_liu@brown.edu

Maurice Herlihy Computer Science Department Brown University mph@cs.brown.edu Irina Calciu VMware Research Group icalciu@vmware.com

Onur Mutlu Computer Science Department ETH Zürich onur.mutlu@inf.ethz.ch

#### Simulation Infrastructures for PIM

- Ramulator extended for PIM
  - Flexible and extensible DRAM simulator
  - Can model many different memory standards and proposals
  - Kim+, "Ramulator: A Flexible and Extensible DRAM Simulator", IEEE CAL 2015.
  - https://github.com/CMU-SAFARI/ramulator

#### Ramulator: A Fast and Extensible DRAM Simulator

Yoongu Kim<sup>1</sup> Weikun Yang<sup>1,2</sup> Onur Mutlu<sup>1</sup> <sup>1</sup>Carnegie Mellon University <sup>2</sup>Peking University



#### An FPGA-based Test-bed for PIM?

 Hasan Hassan et al., <u>SoftMC: A</u> <u>Flexible and Practical Open-</u> <u>Source Infrastructure for</u> <u>Enabling Experimental DRAM</u> <u>Studies</u> HPCA 2017.

- Flexible
- Easy to Use (C++ API)
- Open-source github.com/CMU-SAFARI/SoftMC



#### Simulation Infrastructures for PIM (in SSDs)

 Arash Tavakkol, Juan Gomez-Luna, Mohammad Sadrosadati, Saugata Ghose, and <u>Onur Mutlu</u>,
 "MOSim: A Framework for Enabling Realistic Studies of Modern Multi-Queue SSD Devices"
 Proceedings of the 16th USENIX Conference on File and Storage Technologies (FAST), Oakland, CA, USA, February 2018.
 [Slides (pptx) (pdf)]
 [Source Code]

#### MQSim: A Framework for Enabling Realistic Studies of Modern Multi-Queue SSD Devices

Arash Tavakkol<sup>†</sup>, Juan Gómez-Luna<sup>†</sup>, Mohammad Sadrosadati<sup>†</sup>, Saugata Ghose<sup>‡</sup>, Onur Mutlu<sup>†‡</sup> <sup>†</sup>ETH Zürich <sup>‡</sup>Carnegie Mellon University



# Concluding Remarks

#### A Quote from A Famous Architect

"architecture [...] based upon principle, and not upon precedent"



### Precedent-Based Design?

"architecture [...] based upon principle, and not upon precedent"



### Principled Design

## "architecture [...] based upon principle, and not upon precedent"





### The Overarching Principle

### **Organic architecture**

From Wikipedia, the free encyclopedia

**Organic architecture** is a philosophy of architecture which promotes harmony between human habitation and the natural world through design approaches so sympathetic and well integrated with its site, that buildings, furnishings, and surroundings become part of a unified, interrelated composition.

A well-known example of organic architecture is Fallingwater, the residence Frank Lloyd Wright designed for the Kaufmann family in rural Pennsylvania. Wright had many choices to locate a home on this large site, but chose to place the home directly over the waterfall and creek creating a close, yet noisy dialog with the rushing water and the steep site. The horizontal striations of stone masonry with daring cantilevers of colored beige concrete blend with native rock outcroppings and the wooded environment.

### Another Example: Precedent-Based Design



### Principled Design



### Another Principled Design



Source: By Martín Gómez Tagle - Lisbon, Portugal, CC BY-SA 3.0, https://commons.wikimedia.org/w/index.php?curid=13764903 Source: http://www.arcspace.com/exhibitions/unsorted/santiago-calatrava/

#### Principle Applied to Another Structure



### The Overarching Principle

#### Zoomorphic architecture

From Wikipedia, the free encyclopedia

**Zoomorphic architecture** is the practice of using animal forms as the inspirational basis and blueprint for architectural design. "While animal forms have always played a role adding some of the deepest layers of meaning in architecture, it is now becoming evident that a new strand of biomorphism is emerging where the meaning derives not from any specific representation but from a more general allusion to biological processes."<sup>[1]</sup>

Some well-known examples of Zoomorphic architecture can be found in the TWA Flight Center building in New York City, by Eero Saarinen, or the Milwaukee Art Museum by Santiago Calatrava, both inspired by the form of a bird's wings.<sup>[3]</sup>

### Overarching Principles for Computing?



Source: http://spectrum.ieee.org/image/MjYzMzAyMg.jpeg

### Concluding Remarks

- It is time to design principled system architectures to solve the memory problem
- Discover design principles for fundamentally secure and reliable computer architectures
- Design complete systems to be balanced and energy-efficient, i.e., low latency and data-centric (or memory-centric)
- Enable new and emerging memory architectures
- This can
  - Lead to orders-of-magnitude improvements
  - Enable new applications & computing platforms

#### We Need to Think Across the Stack



#### If In Doubt, See Other Doubtful Technologies

- A very "doubtful" emerging technology
  - for at least two decades



Proceedings of the IEEE, Sept. 2017

### Error Characterization, Mitigation, and Recovery in Flash-Memory-Based Solid-State Drives

This paper reviews the most recent advances in solid-state drive (SSD) error characterization, mitigation, and data recovery techniques to improve both SSD's reliability and lifetime.

By Yu Cai, Saugata Ghose, Erich F. Haratsch, Yixin Luo, and Onur Mutlu

https://arxiv.org/pdf/1706.08642

### Rethinking Memory System Design Robustness, Energy, Performance

Onur Mutlu

omutlu@amail.com

https://people.inf.ethz.ch/omutlu

July 3, 2018 IOLTS Keynote Talk



**EH** zürich



### Acknowledgments

#### My current and past students and postdocs

Rachata Ausavarungnirun, Abhishek Bhowmick, Amirali Boroumand, Rui Cai, Yu Cai, Kevin Chang, Saugata Ghose, Kevin Hsieh, Tyler Huberty, Ben Jaiyen, Samira Khan, Jeremie Kim, Yoongu Kim, Yang Li, Jamie Liu, Lavanya Subramanian, Donghyuk Lee, Yixin Luo, Justin Meza, Gennady Pekhimenko, Vivek Seshadri, Lavanya Subramanian, Nandita Vijaykumar, HanBin Yoon, Jishen Zhao, ...

#### My collaborators

 Can Alkan, Chita Das, Phil Gibbons, Sriram Govindan, Norm Jouppi, Mahmut Kandemir, Mike Kozuch, Konrad Lai, Ken Mai, Todd Mowry, Yale Patt, Moinuddin Qureshi, Partha Ranganathan, Bikash Sharma, Kushagra Vaid, Chris Wilkerson, ...

### Funding Acknowledgments

- NSF
- GSRC
- SRC
- CyLab
- Alibaba, AMD, Google, Facebook, HP Labs, Huawei, IBM, Intel, Microsoft, Nvidia, Oracle, Qualcomm, Rambus, Samsung, Seagate, VMware

## Slides Not Covered But Could Be Useful

## Open Problems and References

#### Enabling the Adoption of Processing-in-Memory: Challenges, Mechanisms, Future Research Directions

#### SAUGATA GHOSE, KEVIN HSIEH, AMIRALI BOROUMAND, RACHATA AUSAVARUNGNIRUN

Carnegie Mellon University

ONUR MUTLU ETH Zürich and Carnegie Mellon University

https://arxiv.org/pdf/1802.00320.pdf



#### Reference Overview Paper II

 Onur Mutlu and Lavanya Subramanian, <u>"Research Problems and Opportunities in Memory</u> <u>Systems"</u> *Invited Article in <u>Supercomputing Frontiers and Innovations</u> (SUPERFRI), 2014/2015.* 

- **Research Problems and Opportunities in Memory Systems**
- Onur Mutlu<sup>1</sup>, Lavanya Subramanian<sup>1</sup>

https://people.inf.ethz.ch/omutlu/pub/memory-systems-research\_superfri14.pdf

#### Reference Overview Paper III

 Onur Mutlu,
 "The RowHammer Problem and Other Issues We May Face as Memory Becomes Denser"
 Invited Paper in Proceedings of the Desian, Automation, and Test in Furope Conference (DATE), Lausanne, Switzerland, March 2017.
 [Slides (pptx) (pdf)]

#### The RowHammer Problem and Other Issues We May Face as Memory Becomes Denser

Onur Mutlu ETH Zürich onur.mutlu@inf.ethz.ch https://people.inf.ethz.ch/omutlu

https://people.inf.ethz.ch/omutlu/pub/rowhammer-and-other-memory-issues\_date17.pdf

#### Reference Overview Paper IV

 Onur Mutlu, <u>"Memory Scaling: A Systems Architecture</u>
 <u>Perspective</u>" *Technical talk at <u>MemCon 2013</u> (MEMCON)*, Santa Clara, CA, August 2013. [Slides (pptx) (pdf)] [Video] [Coverage on StorageSearch]

#### Memory Scaling: A Systems Architecture Perspective

Onur Mutlu Carnegie Mellon University onur@cmu.edu http://users.ece.cmu.edu/~omutlu/

https://people.inf.ethz.ch/omutlu/pub/memory-scaling\_memcon13.pdf

#### Reference Overview Paper V



Proceedings of the IEEE, Sept. 2017

### Error Characterization, Mitigation, and Recovery in Flash-Memory-Based Solid-State Drives

This paper reviews the most recent advances in solid-state drive (SSD) error characterization, mitigation, and data recovery techniques to improve both SSD's reliability and lifetime.

By YU CAI, SAUGATA GHOSE, ERICH F. HARATSCH, YIXIN LUO, AND ONUR MUTLU

#### SAFARI

https://arxiv.org/pdf/1706.08642

### Related Videos and Course Materials (I)

- <u>Undergraduate Computer Architecture Course Lecture</u> <u>Videos (2015, 2014, 2013)</u>
- <u>Undergraduate Computer Architecture Course</u> <u>Materials (2015, 2014, 2013)</u>
- Graduate Computer Architecture Course Lecture Videos (2017, 2015, 2013)
- Graduate Computer Architecture Course Materials (2017, 2015, 2013)
- <u>Parallel Computer Architecture Course Materials</u> (<u>Lecture Videos</u>)



#### Related Videos and Course Materials (II)

- Freshman Digital Circuits and Computer Architecture Course Lecture Videos (2018, 2017)
- Freshman Digital Circuits and Computer Architecture Course Materials (2018)
- Memory Systems Short Course Materials (Lecture Video on Main Memory and DRAM Basics)



#### Some Open Source Tools (I)

- Rowhammer Program to Induce RowHammer Errors
   <u>https://github.com/CMU-SAFARI/rowhammer</u>
- Ramulator Fast and Extensible DRAM Simulator
  - https://github.com/CMU-SAFARI/ramulator
- MemSim Simple Memory Simulator
  - <u>https://aithub.com/CMU-SAFARI/memsim</u>
- NOCulator Flexible Network-on-Chip Simulator
  - <u>https://github.com/CMU-SAFARI/NOCulator</u>
- SoftMC FPGA-Based DRAM Testing Infrastructure
  - <u>https://github.com/CMU-SAFARI/SoftMC</u>
- Other open-source software from my group
  - <u>https://github.com/CMU-SAFARI/</u>

http://www.ece.cmu.edu/~safari/tools.html

#### Some Open Source Tools (II)

- MQSim A Fast Modern SSD Simulator
  - https://github.com/CMU-SAFARI/MOSim
- Mosaic GPU Simulator Supporting Concurrent Applications
  - https://github.com/CMU-SAFARI/Mosaic
- IMPICA Processing in 3D-Stacked Memory Simulator
   <u>https://github.com/CMU-SAFARI/IMPICA</u>
- SMLA Detailed 3D-Stacked Memory Simulator
  - <u>https://github.com/CMU-SAFARI/SMLA</u>
- HWASim Simulator for Heterogeneous CPU-HWA Systems
  - https://github.com/CMU-SAFARI/HWASim
- Other open-source software from my group
  - <u>https://github.com/CMU-SAFARI/</u>
  - http://www.ece.cmu.edu/~safari/tools.html

#### More Open Source Tools (III)

- A lot more open-source software from my group
  - https://github.com/CMU-SAFARI/
  - http://www.ece.cmu.edu/~safari/tools.html

| SAFARI Research Group at ETH Zurich ar           University           Site for source code and tools distribution from SAFARI Research Group at ETH Zurich           © ETH Zurich and CarnegL.                                                                                                                                                                                                           | Ū.                                                |
|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------------------------------------------------|
| Repositories 30 L People 27 Teams 1 Projects 0 O Settings                                                                                                                                                                                                                                                                                                                                                |                                                   |
| Search repositories Type: All - Language: All -                                                                                                                                                                                                                                                                                                                                                          | Customize pinned repositories                     |
| MQSim is a fast and accurate simulator modeling the performance of modern multi-queue (MQ) SSDs as well as traditional SATA based SSDs. MQSim faithfully models new high-bandwidth protocol implementations, steady-state SSD conditions, and the full end-to-end latency of requests in modern SSDs. It is described in detail in the FAST 2018 paper by A  C++ ★ 14 ♀ 14 ♀ 14 ♀ MIT Updated 8 days ago | Top languages<br>C++ C C C# AGS Script<br>Verilog |
|                                                                                                                                                                                                                                                                                                                                                                                                          | Most used topics Manage                           |



All are available at

#### https://people.inf.ethz.ch/omutlu/projects.htm

http://scholar.google.com/citations?user=7XvGUGkAAAAJ&hl=en



#### Key Principles and Results

- Two key principles:
  - Exploit the structure of the genome to minimize computation
  - Morph and exploit the structure of the underlying hardware to maximize performance and efficiency
- Algorithm-architecture co-design for DNA read mapping
  - Speeds up read mapping by ~200X (sometimes more)
  - Improves accuracy of read mapping in the presence of errors

Xin et al., "Accelerating Read Mapping with FastHASH," BMC Genomics 2013. Xin et al., "Shifted Hamming Distance: A Fast and Accurate SIMD-friendly Filter to Accelerate Alignment Verification in Read Mapping," Bioinformatics 2015. Alser et al., "GateKeeper: A New Hardware Architecture for Accelerating Pre-Alignment in DNA Short Read Mapping," Bioinformatics 2017. Kim et al., "Genome Read In-Memory (GRIM) Filter," BMC Genomics 2018.

# End of Backup Slides

Onur Mutlu



- Strecker Professor @ Carnegie Mellon University ECE/CS, 2009-2016, 2016-...
- □ PhD from UT-Austin, worked at Google, VMware, Microsoft Research, Intel, AMD
- https://people.inf.ethz.ch/omutlu/
- omutlu@gmail.com (Best way to reach me)
- https://people.inf.ethz.ch/omutlu/projects.htm

#### Research and Teaching in:

- Computer architecture, computer systems, hardware security, bioinformatics
- Memory and storage systems
- □ Hardware security, safety, predictability
- Fault tolerance
- □ Hardware/software cooperation
- Architectures for bioinformatics, health, medicine

• ..

