# **Storage-Centric Computing** for Modern Data-Intensive Workloads

Onur Mutlu omutlu@gmail.com https://people.inf.ethz.ch/omutlu

17 May 2024







# Computing is Bottlenecked by Data



### Data is Key for AI, ML, Genomics, ...

Important workloads are all data intensive

 They require rapid and efficient processing of large amounts of data

- Data is increasing
  - We can generate more than we can process
  - We need to perform more sophisticated analyses on more data

# Exponential Growth of Neural Networks

SAFAR



Source: https://youtu.be/Bh13Idwcb0Q?t=283

### Huge Demand for Performance & Efficiency

### Huge Demand for Performance & Efficiency



http://www.economist.com/news/21631808-so-much-genetic-data-so-many-uses-genes-unzipped

### Do We Want This?



SAFARI Source:

### Or This?



#### SAFARI Source: V. Milutinovic

Challenge and Opportunity for Future

High Performance, Energy Efficient, Sustainable (All at the Same Time) Data access is the major performance and energy bottleneck

# Our current design principles cause great energy waste (and great performance loss)

## Today's Computing Systems

- Processor centric
- All data processed in the processor  $\rightarrow$  at great system cost



# It's the Memory, Stupid!

### "It's the Memory, Stupid!" (Richard Sites, MPR, 1996)

#### **RICHARD SITES**

#### It's the Memory, Stupid!

When we started the Alpha architecture design in 1988, we estimated a 25-year lifetime and a relatively modest 32% per year compounded performance improvement of implementations over that lifetime (1,000× total). We guestimated about 10× would come from CPU clock improvement, 10× from multiple instruction issue, and 10× from multiple processors.

### 5, 1996 MICROPROCESSOR REPORT

I expect that over the coming decade memory subsystem design will be the *only* important design issue for microprocessors.

### Processor-Centric System Performance



Mutlu+, "Runahead Execution: An Alternative to Very Large Instruction Windows for Out-of-Order Processors," HPCA 2003.

### Processor-Centric System Performance

All of Google's Data Center Workloads (2015):



Kanev+, "Profiling a Warehouse-Scale Computer," ISCA 2015.

# Data Movement vs. Computation Energy



# A memory access consumes ~100-1000X the energy of a complex addition

## Data Movement vs. Computation Energy



### Data Movement vs. Computation Energy



A memory access consumes 6400X the energy of a simple integer addition

## Energy Waste in Mobile Devices

 Amirali Boroumand, Saugata Ghose, Youngsok Kim, Rachata Ausavarungnirun, Eric Shiu, Rahul Thakur, Daehyun Kim, Aki Kuusela, Allan Knies, Parthasarathy Ranganathan, and Onur Mutlu, "Google Workloads for Consumer Devices: Mitigating Data Movement Bottlenecks" Proceedings of the <u>23rd International Conference on Architectural Support for Programming</u> <u>Languages and Operating Systems</u> (ASPLOS), Williamsburg, VA, USA, March 2018.

### 62.7% of the total system energy is spent on data movement

### Google Workloads for Consumer Devices: Mitigating Data Movement Bottlenecks

Amirali Boroumand<sup>1</sup>Saugata Ghose<sup>1</sup>Youngsok Kim<sup>2</sup>Rachata Ausavarungnirun<sup>1</sup>Eric Shiu<sup>3</sup>Rahul Thakur<sup>3</sup>Daehyun Kim<sup>4,3</sup>Aki Kuusela<sup>3</sup>Allan Knies<sup>3</sup>Parthasarathy Ranganathan<sup>3</sup>Onur Mutlu<sup>5,1</sup>17

## Energy Waste in Accelerators

 Amirali Boroumand, Saugata Ghose, Berkin Akin, Ravi Narayanaswami, Geraldo F. Oliveira, Xiaoyu Ma, Eric Shiu, and Onur Mutlu,
 "Google Neural Network Models for Edge Devices: Analyzing and Mitigating Machine Learning Inference Bottlenecks"
 Proceedings of the <u>30th International Conference on Parallel Architectures and Compilation</u> <u>Techniques</u> (PACT), Virtual, September 2021.
 [Slides (pptx) (pdf)]
 [Talk Video (14 minutes)]

### > 90% of the total system energy is spent on memory in large ML models

#### **Google Neural Network Models for Edge Devices: Analyzing and Mitigating Machine Learning Inference Bottlenecks**

Amirali Boroumand<sup>†</sup>Saugata Ghose<sup>‡</sup>Berkin Akin<sup>§</sup>Ravi Narayanaswami<sup>§</sup>Geraldo F. Oliveira<sup>★</sup>Xiaoyu Ma<sup>§</sup>Eric Shiu<sup>§</sup>Onur Mutlu<sup>★†</sup>

<sup>†</sup>Carnegie Mellon Univ. <sup>•</sup>Stanford Univ. <sup>‡</sup>Univ. of Illinois Urbana-Champaign <sup>§</sup>Google <sup>\*</sup>ETH Zürich

# Processing of data is performed far away from the data

# We Need A **Paradigm Shift** To ...

Enable computation with minimal data movement

Compute where it makes sense (where data resides)

Make computing architectures more data-centric

### Process Data Where It Makes Sense



#### **SAFARI**

https://www.gsmarena.com/apple\_announces\_m1\_ultra\_with\_20core\_cpu\_and\_64core\_gpu-news-53481.php

# Memory/Storage as an Accelerator



#### Memory similar to a "conventional" accelerator

# Goal: Processing Inside Memory/Storage



# Vision: Storage-Centric Computing (I)

Storage system is a heterogeneous computing device with hybrid memory Storage system enables data-centric design of systems & workloads Application-driven customization enables a powerful data-centric engine



**Fig. 1.** (a) SSD system architecture, showing controller (Ctrl) and chips. (b) Detailed view of connections between controller components and chips.

SAFARI

Cai+, "Error Characterization, Mitigation, and Recovery in Flash Memory Based Solid State Drives," Proc. IEEE 2017.

https://arxiv.org/pdf/1711.11427.pdf

## Vision: Storage-Centric Computing (II)



**SAFARI** Workload-customized storage-centric computing

### Workload-Customized Storage-Centric Computing

- Software and hardware customized for major workloads
  - Genomics
  - Video analytics
  - Data & graph analytics
  - Machine learning

• •••

- Data-centric (processing capability in all memories)
- Data-driven (design & decision making)
- Data-aware (optimization & design)
- Unified interfaces for efficient & fast communication

#### SAFARI

# Processing in Storage: Two Approaches

Processing using Storage
 Processing near Storage

# In-Storage Genomic Data Filtering [ASPLOS 2022]

Nika Mansouri Ghiasi, Jisung Park, Harun Mustafa, Jeremie Kim, Ataberk Olgun, Arvid Gollwitzer, Damla Senol Cali, Can Firtina, Haiyu Mao, Nour Almadhoun Alserr, Rachata Ausavarungnirun, Nandita Vijaykumar, Mohammed Alser, and Onur Mutlu,
 "GenStore: A High-Performance and Energy-Efficient In-Storage Computing System for Genome Sequence Analysis"
 Proceedings of the <u>27th International Conference on Architectural Support for</u> Programming Languages and Operating Systems (ASPLOS), Virtual, February-March 2022.

[Lightning Talk Slides (pptx) (pdf)] [Lightning Talk Video (90 seconds)]

### GenStore: A High-Performance In-Storage Processing System for Genome Sequence Analysis

Nika Mansouri Ghiasi<sup>1</sup> Jisung Park<sup>1</sup> Harun Mustafa<sup>1</sup> Jeremie Kim<sup>1</sup> Ataberk Olgun<sup>1</sup> Arvid Gollwitzer<sup>1</sup> Damla Senol Cali<sup>2</sup> Can Firtina<sup>1</sup> Haiyu Mao<sup>1</sup> Nour Almadhoun Alserr<sup>1</sup> Rachata Ausavarungnirun<sup>3</sup> Nandita Vijaykumar<sup>4</sup> Mohammed Alser<sup>1</sup> Onur Mutlu<sup>1</sup>

<sup>1</sup>ETH Zürich <sup>2</sup>Bionano Genomics <sup>3</sup>KMUTNB <sup>4</sup>University of Toronto

## **Genome Sequence Analysis**

- Read mapping: first key step in genome sequence analysis
  - Aligns reads to potential matching locations in the reference genome
  - For each matching location, the alignment step finds the degree of similarity (alignment score)



- Calculating the alignment score requires computationally-expensive approximate string matching (ASM) to account for differences between reads and the reference genome due to:
  - Sequencing errors
  - Genetic variation

### SAFARI







### **Compute-Centric Accelerators**





# **Key Idea: In-Storage Filtering**

*Filter* reads that do *not* require alignment *inside the storage system* 



### **Exactly-matching reads**

#### Do not need expensive approximate string matching during alignment

### **Non-matching reads**

Do not have potential matching locations and can skip alignment

#### SAFARI

### GenStore

• Key idea: Filter reads that do not require alignment inside the storage system

### Challenges

- Different behavior across read mapping workloads
- Limited hardware resources in the SSD



### **Filtering Opportunities**

- Sequencing machines produce one of two kinds of reads
  - Short reads: highly accurate and short
  - Long reads: less accurate and long

Reads that do not require the expensive alignment step:

### **Exactly-matching reads**

Do not need expensive approximate string matching during alignment

- Low sequencing error rates (short reads) combined with
- Low genetic variation

### **Non-matching reads**

Do not have potential matching locations, so they skip alignment

- High sequencing error rates (long reads) or
- High genetic variation (short or long reads)

### SAFARI



### GenStore-EM for Exactly-Matching Reads

### GenStore-NM for Non-Matching Reads



35



*Filter* reads that do *not* require alignment *inside the storage system* 



## **Performance – GenStore-EM**



2.1× - 2.5× speedup compared to the software Base

1.5× – 3.3× speedup compared to the hardware Base

**On average 3.92× energy reduction** 

# **Performance – GenStore-NM**



22.4× – 27.9× speedup compared to the software Base

6.8× – 19.2× speedup compared to the hardware Base

**On average 27.2× energy reduction** 

# **Area and Power Consumption**

• Based on **Synthesis** of **GenStore** accelerators using the Synopsys Design Compiler @ 65nm technology node

| Logic unit            | # of instance | s Area [mm²] | Power [mW] |
|-----------------------|---------------|--------------|------------|
| Comparator            | 1 per SSD     | 0.0007       | 0.14       |
| K -mer Window         | 2 per channel | 0.0018       | 0.27       |
| Hash Accelerator      | 2 per SSD     | 0.008        | 1.8        |
| Location Buffer       | 1 per channel | 0.00725      | 0.37375    |
| Chaining Buffer       | 1 per channel | 0.008        | 0.95       |
| Chaining PE           | 1 per channel | 0.004        | 0.98       |
| Control               | 1 per SSD     | 0.0002       | 0.11       |
| Total for an 8-channe | elSSD -       | 0.2          | 26.6       |

Only 0.006% of a 14nm Intel Processor, less than 9.5% of the three ARM processors in a SATA SSD controller

# In-Storage Genomic Data Filtering [ASPLOS 2022]

Nika Mansouri Ghiasi, Jisung Park, Harun Mustafa, Jeremie Kim, Ataberk Olgun, Arvid Gollwitzer, Damla Senol Cali, Can Firtina, Haiyu Mao, Nour Almadhoun Alserr, Rachata Ausavarungnirun, Nandita Vijaykumar, Mohammed Alser, and Onur Mutlu,
 "GenStore: A High-Performance and Energy-Efficient In-Storage Computing System for Genome Sequence Analysis"
 Proceedings of the <u>27th International Conference on Architectural Support for</u> Programming Languages and Operating Systems (ASPLOS), Virtual, February-March 2022.

[Lightning Talk Slides (pptx) (pdf)] [Lightning Talk Video (90 seconds)]

#### GenStore: A High-Performance In-Storage Processing System for Genome Sequence Analysis

Nika Mansouri Ghiasi<sup>1</sup> Jisung Park<sup>1</sup> Harun Mustafa<sup>1</sup> Jeremie Kim<sup>1</sup> Ataberk Olgun<sup>1</sup> Arvid Gollwitzer<sup>1</sup> Damla Senol Cali<sup>2</sup> Can Firtina<sup>1</sup> Haiyu Mao<sup>1</sup> Nour Almadhoun Alserr<sup>1</sup> Rachata Ausavarungnirun<sup>3</sup> Nandita Vijaykumar<sup>4</sup> Mohammed Alser<sup>1</sup> Onur Mutlu<sup>1</sup>

<sup>1</sup>ETH Zürich <sup>2</sup>Bionano Genomics <sup>3</sup>KMUTNB <sup>4</sup>University of Toronto

# Tight Integration of Genome Analysis Tasks

 Haiyu Mao, Mohammed Alser, Mohammad Sadrosadati, Can Firtina, Akanksha Baranwal, Damla Senol Cali, Aditya Manglik, Nour Almadhoun Alserr, and Onur Mutlu, "GenPIP: In-Memory Acceleration of Genome Analysis via Tight Integration of Basecalling and Read Mapping" Proceedings of the <u>55th International Symposium on Microarchitecture</u> (MICRO), Chicago, IL, USA, October 2022.
 [Slides (pptx) (pdf)]
 [Longer Lecture Slides (pptx) (pdf)]
 [Lecture Video (25 minutes)]
 [arXiv version]

#### **GenPIP: In-Memory Acceleration of Genome Analysis** via Tight Integration of Basecalling and Read Mapping

Haiyu Mao<sup>1</sup> Mohammed Alser<sup>1</sup> Mohammad Sadrosadati<sup>1</sup> Can Firtina<sup>1</sup> Akanksha Baranwal<sup>1</sup> Damla Senol Cali<sup>2</sup> Aditya Manglik<sup>1</sup> Nour Almadhoun Alserr<sup>1</sup> Onur Mutlu<sup>1</sup> <sup>1</sup>ETH Zürich <sup>2</sup>Bionano Genomics

#### SAFARI https://arxiv.org/pdf/2209.08600.pdf

# Processing in Storage: Two Approaches

Processing using Storage
 Processing near Storage

# In-Flash Bulk Bitwise Execution

 Jisung Park, Roknoddin Azizi, Geraldo F. Oliveira, Mohammad Sadrosadati, Rakesh Nadig, David Novo, Juan Gómez-Luna, Myungsuk Kim, and Onur Mutlu, "Flash-Cosmos: In-Flash Bulk Bitwise Operations Using Inherent Computation Capability of NAND Flash Memory" Proceedings of the <u>55th International Symposium on Microarchitecture</u> (MICRO), Chicago, IL, USA, October 2022.
 [Slides (pptx) (pdf)]
 [Longer Lecture Slides (pptx) (pdf)]
 [Lecture Video (44 minutes)]
 [arXiv version]

### Flash-Cosmos: In-Flash Bulk Bitwise Operations Using Inherent Computation Capability of NAND Flash Memory

Jisung Park<sup>§∇</sup> Roknoddin Azizi<sup>§</sup> Geraldo F. Oliveira<sup>§</sup> Mohammad Sadrosadati<sup>§</sup> Rakesh Nadig<sup>§</sup> David Novo<sup>†</sup> Juan Gómez-Luna<sup>§</sup> Myungsuk Kim<sup>‡</sup> Onur Mutlu<sup>§</sup>

<sup>§</sup>ETH Zürich  $\nabla$  POSTECH <sup>†</sup>LIRMM, Univ. Montpellier, CNRS <sup>‡</sup>Kyungpook National University

SAFARI https://arxiv.org/pdf/2209.05566.pdf

### Data-Movement Bottleneck

Conventional systems: Outside-storage processing (OSP) that must move the entire data to CPUs/GPUs through the memory hierarchy



External I/O bandwidth of storage systems is the main bottleneck in conventional systems (OSP)

 Uses in-storage compute units (embedded cores or FPGA) to send only the computation results



 Uses in-storage compute units (embedded cores or FPGA) to send only the computation results



 Uses in-storage compute units (embedded cores or FPGA) to send only the computation results



ISP can mitigate data movement overhead by reducing SSD-external data movement

 Uses in-storage compute units (embedded cores or FPGA) to send only the computation results



SSD-internal bandwidth becomes the new bottleneck in ISP

## In-Flash Processing (IFP)

Performs computation inside NAND flash chips



SSD internal I/O bandwidth: ~ 10 GB/s

## In-Flash Processing (IFP)

Performs computation inside NAND flash chips



IFP fundamentally mitigates data movement

## Our Proposal: Flash-Cosmos

#### Flash-Cosmos enables

- Computation on multiple operands with a single sensing operation
- Accurate computation results by eliminating raw bit errors in stored data



## Multi-Wordline Sensing (MWS): Bitwise AND



## Multi-Wordline Sensing (MWS): Bitwise AND

Intra-Block MWS:
 Simultaneously activates multiple WLs in the same block
 → Bitwise AND of the stored data in the WLs

Flash-Cosmos (Intra-Block MWS) enables bitwise AND of multiple pages in the same block via a single sensing operation





## **Other Types of Bitwise Operations**

Flash-Cosmos also enables other types of bitwise operations (NOT/NAND/NOR/XOR/XNOR) leveraging existing features of NAND flash memory

#### Flash-Cosmos: In-Flash Bulk Bitwise Operations Using Inherent Computation Capability of NAND Flash Memory

Jisung Park<sup>§∇</sup> Roknoddin Azizi<sup>§</sup> Geraldo F. Oliveira<sup>§</sup> Mohammad Sadrosadati<sup>§</sup> Rakesh Nadig<sup>§</sup> David Novo<sup>†</sup> Juan Gómez-Luna<sup>§</sup> Myungsuk Kim<sup>‡</sup> Onur Mutlu<sup>§</sup>

<sup>§</sup>ETH Zürich <sup>¬</sup>POSTECH <sup>†</sup>LIRMM, Univ. Montpellier, CNRS <sup>‡</sup>Kyungpook National University



#### https://arxiv.org/abs/2209.05566.pdf

### Key Ideas





Enhanced SLC-Mode Programming (ESP) to eliminate raw bit errors in stored data (and thus in computation results)



## Enhanced SLC-Mode Programming (ESP)

- Goal: eliminate raw bit errors in stored data (and computation results)
- Key ideas
  - Programs only a single bit per cell (SLC-mode programming)
    - Trades storage density for reliable computation
  - Performs more precise programming of the cells
    - Trades programming latency for reliable computation

#### Maximizes the reliability margin between the different states of flash cells



## Enhanced SLC-Mode Programming (ESP)

To eliminate raw bit errors in stored data (and computation results)

Flash-Cosmos (ESP) enables reliable in-flash computation by trading storage density & programming latency

#### Storage & latency overheads affect only data used in in-flash computation



# **Evaluation Methodology**

#### Real-device characterization

- To validate the feasibility and reliability of Flash-Cosmos
- Using 160 48-WL-layer 3D Triple-Level Cell NAND flash chips
   3,686,400 tested wordlines
- Under worst-case operating conditions
  - Under a 1-year retention time at 10K P/E cycles
  - Worst-case data patterns

#### System-level evaluation

- Using the state-of-the-art SSD simulator (MQSim [Tavakkol+, FAST'18])
- Three real-world applications
  - Bitmap Indices (BMI): Bitwise AND of up to ~1,000 operands
  - Image Segmentation (IMS): Bitwise AND of 3 operands
  - K-clique Star Listing (KCS): Bitwise OR of up to 32 operands
- Baselines
  - Outside-Storage Processing (OSP): A multi-core CPU (Intel i7-11700K)
  - In-Storage Processing (ISP): An in-storage hardware accelerator
  - ParaBit [Gao+, MICRO'21]: State-of-the-art in-flash processing mechanism

No changes to the cell array of commodity NAND flash chips

Can have many operands (AND: up to 48, OR: up to 4) with small increase in sensing latency (< 10%)

ESP significantly improves the reliability of computation results (no observed bit error in the tested flash cells)

## **Results: Performance & Energy**



Flash-Cosmos provides significant performance & energy benefits over all the baselines

The larger the number of operands, the higher the performance & energy benefits

# In-Flash Bulk Bitwise Execution

 Jisung Park, Roknoddin Azizi, Geraldo F. Oliveira, Mohammad Sadrosadati, Rakesh Nadig, David Novo, Juan Gómez-Luna, Myungsuk Kim, and Onur Mutlu, "Flash-Cosmos: In-Flash Bulk Bitwise Operations Using Inherent Computation Capability of NAND Flash Memory" Proceedings of the <u>55th International Symposium on Microarchitecture</u> (MICRO), Chicago, IL, USA, October 2022.
 [Slides (pptx) (pdf)]
 [Longer Lecture Slides (pptx) (pdf)]
 [Lecture Video (44 minutes)]
 [arXiv version]

### Flash-Cosmos: In-Flash Bulk Bitwise Operations Using Inherent Computation Capability of NAND Flash Memory

Jisung Park<sup>§∇</sup> Roknoddin Azizi<sup>§</sup> Geraldo F. Oliveira<sup>§</sup> Mohammad Sadrosadati<sup>§</sup> Rakesh Nadig<sup>§</sup> David Novo<sup>†</sup> Juan Gómez-Luna<sup>§</sup> Myungsuk Kim<sup>‡</sup> Onur Mutlu<sup>§</sup>

<sup>§</sup>ETH Zürich  $\nabla$  POSTECH <sup>†</sup>LIRMM, Univ. Montpellier, CNRS <sup>‡</sup>Kyungpook National University

SAFARI https://arxiv.org/pdf/2209.05566.pdf

Processing in Storage: Adoption Challenges

Processing using Storage
 Processing near Storage

# Eliminating the Adoption Barriers

# How to Enable Adoption of Processing in Storage

# Potential Barriers to Adoption of PIM

1. Applications & software for PIM

2. Ease of **programming** (interfaces and compiler/HW support)

3. **System** and **security** support: coherence, synchronization, virtual memory, isolation, communication interfaces, ...

4. **Runtime** and **compilation** systems for adaptive scheduling, data mapping, access/sharing control, ...

5. Infrastructures to assess benefits and feasibility

#### All can be solved with change of mindset

## We Need to Revisit the Entire Stack

| Problem            |   |
|--------------------|---|
| Aigorithm          |   |
| Program/Language   |   |
| System Software    |   |
| SW/HW Interface    |   |
| Micro-architecture |   |
| Logic              | J |
| Devices            |   |
| Electrons          |   |

#### We can get there step by step

Challenge and Opportunity for Future

Fundamentally **Energy-Efficient** (Data-Centric) **Computing Architectures**  Challenge and Opportunity for Future

Fundamentally **High-Performance** (Data-Centric) **Computing Architectures**  Challenge and Opportunity for Future

# Computing Architectures with

# Minimal Data Movement



# Data-Driven (Self-Optimizing) Memory/Storage Architectures

# System Architecture Design Today

- Human-driven
  - Humans design the policies (how to do things)
- Many (too) simple, short-sighted policies all over the system
- No automatic data-driven policy learning
- (Almost) no learning: cannot take lessons from past actions

## Can we design fundamentally intelligent architectures?

# An Intelligent Architecture

- Data-driven
  - Machine learns the "best" policies (how to do things)
- Sophisticated, workload-driven, changing, far-sighted policies
- Automatic data-driven policy learning
- All controllers are intelligent data-driven agents

## We need to rethink design (of all controllers)

# Self-Optimizing Memory Controllers

 Engin Ipek, Onur Mutlu, José F. Martínez, and Rich Caruana,
 "Self Optimizing Memory Controllers: A Reinforcement Learning Approach"
 Proceedings of the <u>35th International Symposium on Computer Architecture</u> (ISCA), pages 39-50, Beijing, China, June 2008.

#### Self-Optimizing Memory Controllers: A Reinforcement Learning Approach

Engin Ípek<sup>1,2</sup> Onur Mutlu<sup>2</sup> José F. Martínez<sup>1</sup> Rich Caruana<sup>1</sup>

<sup>1</sup>Cornell University, Ithaca, NY 14850 USA

<sup>2</sup> Microsoft Research, Redmond, WA 98052 USA

### Self-Optimizing Memory Prefetchers

Rahul Bera, Konstantinos Kanellopoulos, Anant Nori, Taha Shahroodi, Sreenivas Subramoney, and Onur Mutlu, "Pythia: A Customizable Hardware Prefetching Framework Using Online Reinforcement Learning" *Proceedings of the <u>54th International Symposium on Microarchitecture</u> (<i>MICRO*), Virtual, October 2021. [Slides (pptx) (pdf)] [Short Talk Slides (pptx) (pdf)] [Lightning Talk Slides (pptx) (pdf)] [Talk Video (20 minutes)] [Lightning Talk Video (1.5 minutes)] [Pythia Source Code (Officially Artifact Evaluated with All Badges)] [arXiv version] *Officially artifact evaluated as available, reusable and reproducible.* 



#### Pythia: A Customizable Hardware Prefetching Framework Using Online Reinforcement Learning

Rahul Bera<sup>1</sup> Konstantinos Kanellopoulos<sup>1</sup>

Anant V. Nori<sup>2</sup> Onur Mutlu<sup>1</sup>

Taha Shahroodi<sup>3,1</sup>

<sup>1</sup>ETH Zürich <sup>2</sup>Processor Architecture Research Labs, Intel Labs <sup>3</sup>TU Delft

Sreenivas Subramoney<sup>2</sup>

https://arxiv.org/pdf/2109.12021.pdf

# Learning-Based Off-Chip Load Predictors

 Rahul Bera, Konstantinos Kanellopoulos, Shankar Balachandran, David Novo, Ataberk Olgun, Mohammad Sadrosadati, and Onur Mutlu,
 "Hermes: Accelerating Long-Latency Load Requests via Perceptron-Based Off-Chip Load Prediction"
 Proceedings of the <u>55th International Symposium on Microarchitecture</u> (MICRO), Chicago, IL, USA, October 2022.
 [Slides (pptx) (pdf)]
 [Longer Lecture Slides (pptx) (pdf)]
 [Talk Video (12 minutes)]
 [Lecture Video (25 minutes)]
 [arXiv version]
 [Source Code (Officially Artifact Evaluated with All Badges)]
 Officially artifact evaluated as available, reusable and reproducible. Best paper award at MICRO 2022.



#### Hermes: Accelerating Long-Latency Load Requests via Perceptron-Based Off-Chip Load Prediction

Rahul Bera1Konstantinos Kanellopoulos1Shankar Balachandran2David Novo3Ataberk Olgun1Mohammad Sadrosadati1Onur Mutlu1

<sup>1</sup>ETH Zürich <sup>2</sup>Intel Processor Architecture Research Lab <sup>3</sup>LIRMM, Univ. Montpellier, CNRS

#### https://arxiv.org/pdf/2209.00188.pdf

### Self-Optimizing Storage Controllers

Gagandeep Singh, Rakesh Nadig, Jisung Park, Rahul Bera, Nastaran Hajinazar, David Novo, Juan Gomez-Luna, Sander Stuijk, Henk Corporaal, and Onur Mutlu, "Sibyl: Adaptive and Extensible Data Placement in Hybrid Storage Systems Using Online Reinforcement Learning" Proceedings of the <u>49th International Symposium on Computer</u> <u>Architecture</u> (ISCA), New York, June 2022. [Slides (pptx) (pdf)] [arXiv version] [Sibyl Source Code] [Talk Video (16 minutes)]

#### Sibyl: Adaptive and Extensible Data Placement in Hybrid Storage Systems Using Online Reinforcement Learning

Gagandeep Singh1Rakesh Nadig1Jisung Park1Rahul Bera1Nastaran Hajinazar1David Novo3Juan Gómez-Luna1Sander Stuijk2Henk Corporaal2Onur Mutlu11ETH Zürich2Eindhoven University of Technology3LIRMM, Univ. Montpellier, CNRS

#### https://arxiv.org/pdf/2205.07394.pdf



# Adaptive and Extensible Data Placement in Hybrid Storage Systems Using Online Reinforcement Learning

<u>Gagandeep Singh</u>, Rakesh Nadig, Jisung Park, Rahul Bera, Nastaran Hajinazar, David Novo, Juan Gómez Luna, Sander Stuijk, Henk Corporaal, Onur Mutlu

**ETH** zürich



# **Hybrid Storage System Basics**

### **Address Space (Application/File System View)**



# **Hybrid Storage System Basics**

Logical Address Space (Application/File System View)



# **Key Shortcomings in Prior Techniques**

We observe **two key shortcomings** that significantly limit the performance benefits of prior techniques

### 1. Lack of **adaptivity to**:

- a) Workload changes
- b) Changes in device types and configuration

2. Lack of **extensibility** to more devices



# **Our Goal**

A data-placement mechanism that can provide:

1.Adaptivity, by continuously learning and adapting to the application and underlying device characteristics

**2.Easy extensibility** to incorporate a wide range of hybrid storage configurations

# **Our Proposal**



### **Sibyl** Formulates data placement in hybrid storage systems as a **reinforcement learning problem**

Sybil is an oracle that makes accurate prophecies https://en.wikipedia.org/wiki/Sibyl



# **Basics of Reinforcement Learning (RL)**



Agent learns to take an **action** in a given **state** to maximize a numerical **reward** 



# **Formulating Data Placement as RL**



83

# **Sibyl Execution**



# **Sibyl Design: Overview**



# **Evaluation Methodology (1/3)**

### Real system with various HSS configurations

- Dual-hybrid and tri-hybrid systems



# **Evaluation Methodology (2/3)**

### **Cost-Oriented HSS Configuration**



High-end SSD

Low-end HDD

### **Performance-Oriented HSS Configuration**





# **Evaluation Methodology (3/3)**

### • 18 different workloads from:

- MSR Cambridge and Filebench Suites

### • Four state-of-the-art data placement baselines:



# **Performance Analysis**



### **Cost-Oriented HSS Configuration**



# **Performance Analysis**



### **Cost-Oriented HSS Configuration**



Sibyl consistently outperforms all the baselines for all the workloads

# **Performance Analysis**



### **Performance-Oriented HSS Configuration**



### Sibyl achieves 80% of the performance of an oracle policy that has complete knowledge of future access patterns



# **Performance on Tri-HSS**



Extending Sibyl for more devices: 1. Add a new action

Sibyl outperforms the state-of-the-art data placement policy by 48.2% in a real tri-hybrid system Sibyl reduces the system architect's burden by providing ease of extensibility

# Sibyl: Summary

- We introduced Sibyl, the first reinforcement learningbased data placement technique in hybrid storage systems that provides
  - Adaptivity

SAFARI

- Easily extensibility
- Ease of design and implementation

# • We evaluated Sibyl on real systems using many different workloads

- In a tri-HSS configuration, Sibyl **outperforms** the state-of-the-artdata placement policy by **48.2%**
- Sibyl achieves **80% of the performance** of an oracle policy with a storage overhead of only **124.4 KiB**

#### https://github.com/CMU-SAFARI/Sibyl

Challenge and Opportunity for Future

# Data-Driven (Self-Optimizing) Computing Architectures

### Sibyl Paper, Slides, Videos [ISCA 2022]

 Gagandeep Singh, Rakesh Nadig, Jisung Park, Rahul Bera, Nastaran Hajinazar, David Novo, Juan Gomez-Luna, Sander Stuijk, Henk Corporaal, and Onur Mutlu, "Sibyl: Adaptive and Extensible Data Placement in Hybrid Storage Systems Using Online Reinforcement Learning" Proceedings of the <u>49th International Symposium on Computer</u> <u>Architecture</u> (ISCA), New York, June 2022.
 [Slides (pptx) (pdf)] [arXiv version]
 [Sibyl Source Code] [Talk Video (16 minutes)]

#### Sibyl: Adaptive and Extensible Data Placement in Hybrid Storage Systems Using Online Reinforcement Learning

Gagandeep Singh1Rakesh Nadig1Jisung Park1Rahul Bera1Nastaran Hajinazar1David Novo3Juan Gómez-Luna1Sander Stuijk2Henk Corporaal2Onur Mutlu11ETH Zürich2Eindhoven University of Technology3LIRMM, Univ. Montpellier, CNRS

#### https://arxiv.org/pdf/2205.07394.pdf

# Concluding Remarks

### Concluding Remarks

- We must design systems to be balanced, high-performance, energy-efficient (all at the same time) → intelligent systems
   Data-centric, data-driven, data-aware
- Enable computation capability inside and close to storage
- This can
  - Lead to orders-of-magnitude improvements
  - Enable new applications & computing platforms
  - **Enable better understanding of nature**
  - ...

### Future of truly storage-centric computing is bright

• We need to do research & design across the computing stack

### Fundamentally Better Architectures

# **Data-centric**

# **Data-driven**

# **Data-aware**





#### **SAFARI**

Source: http://spectrum.ieee.org/image/MjYzMzAyMg.jpeg

### We Need to Revisit the Entire Stack

|  | Problem            | , |
|--|--------------------|---|
|  | Aigorithm          |   |
|  | Program/Language   |   |
|  | System Software    |   |
|  | SW/HW Interface    |   |
|  | Micro-architecture |   |
|  | Logic              |   |
|  | Devices            |   |
|  | Electrons          |   |

#### We can get there step by step

### A Blueprint for Fundamentally Better Architectures

#### Onur Mutlu, "Intelligent Architectures for Intelligent Computing Systems" Invited Paper in Proceedings of the Design, Automation, and Test in Europe Conference (DATE), Virtual, February 2021. [Slides (pptx) (pdf)] [IEDM Tutorial Slides (pptx) (pdf)] [Short DATE Talk Video (11 minutes)] [Longer IEDM Tutorial Video (1 hr 51 minutes)]

#### Intelligent Architectures for Intelligent Computing Systems

Onur Mutlu ETH Zurich omutlu@gmail.com

### Acknowledgments

# SAFARI Research Group safari.ethz.ch



### Onur Mutlu's SAFARI Research Group

#### Computer architecture, HW/SW, systems, bioinformatics, security, memory

https://safari.ethz.ch/safari-newsletter-january-2021/



### SAFARI Newsletter December 2021 Edition

#### <u>https://safari.ethz.ch/safari-newsletter-december-2021/</u>



Think Big, Aim High



f y in 🛛

View in your browser December 2021



### SAFARI Newsletter June 2023 Edition

#### <u>https://safari.ethz.ch/safari-newsletter-june-2023/</u>



Think Big, Aim High



y in D

View in your browser June 2023



### SAFARI Introduction & Research

#### Computer architecture, HW/SW, systems, bioinformatics, security, memory



Seminar in Computer Architecture - Lecture 5: Potpourri of Research Topics (Spring 2023)



### Referenced Papers, Talks, Artifacts

All are available at

https://people.inf.ethz.ch/omutlu/projects.htm

https://www.youtube.com/onurmutlulectures

https://github.com/CMU-SAFARI/

### Open Source Tools: SAFARI GitHub



#### https://github.com/CMU-SAFARI/

# **Storage-Centric Computing** for Modern Data-Intensive Workloads

Onur Mutlu omutlu@gmail.com https://people.inf.ethz.ch/omutlu

17 May 2024



