# Hermes & Sibyl: ML-Driven Memory & Storage Management

Onur Mutlu omutlu@gmail.com

https://people.inf.ethz.ch/omutlu

27 September 2023

VMware



**ETH** zürich



# Data-Driven (Self-Optimizing) Architectures

# System Architecture Design Today

- Human-driven
  - Humans design the policies (how to do things)
- Many (too) simple, short-sighted policies all over the system
- No automatic data-driven policy learning
- (Almost) no learning: cannot take lessons from past actions

# Can we design fundamentally intelligent architectures?

# An Intelligent Architecture

- Data-driven
  - Machine learns the "best" policies (how to do things)
- Sophisticated, workload-driven, changing, far-sighted policies
- Automatic data-driven policy learning
- All controllers are intelligent data-driven agents

# We need to rethink design (of all controllers)

# Self-Optimizing Memory Controllers

 Engin Ipek, Onur Mutlu, José F. Martínez, and Rich Caruana, "Self Optimizing Memory Controllers: A Reinforcement Learning <u>Approach</u>" *Proceedings of the <u>35th International Symposium on Computer Architecture</u> (ISCA), pages 39-50, Beijing, China, June 2008. <i>Selected to the ISCA-50 25-Year Retrospective Issue covering 1996- 2020 in 2023 (Retrospective (pdf) Full Issue).*

Self-Optimizing Memory Controllers: A Reinforcement Learning Approach

Engin İpek<sup>1,2</sup> Onur Mutlu<sup>2</sup> José F. Martínez<sup>1</sup> Rich Caruana<sup>1</sup>

<sup>1</sup>Cornell University, Ithaca, NY 14850 USA

 $^2$  Microsoft Research, Redmond, WA 98052 USA

# Self-Optimizing Memory Prefetchers

Rahul Bera, Konstantinos Kanellopoulos, Anant Nori, Taha Shahroodi, Sreenivas Subramoney, and Onur Mutlu, "Pythia: A Customizable Hardware Prefetching Framework Using Online Reinforcement Learning" *Proceedings of the <u>54th International Symposium on Microarchitecture</u> (<i>MICRO*), Virtual, October 2021. [Slides (pptx) (pdf)] [Short Talk Slides (pptx) (pdf)] [Lightning Talk Slides (pptx) (pdf)] [Talk Video (20 minutes)] [Lightning Talk Video (1.5 minutes)] [Pythia Source Code (Officially Artifact Evaluated with All Badges)] [arXiv version] *Officially artifact evaluated as available, reusable and reproducible.* 



### Pythia: A Customizable Hardware Prefetching Framework Using Online Reinforcement Learning

Rahul Bera<sup>1</sup> Konstantinos Kanellopoulos<sup>1</sup>

Anant V. Nori<sup>2</sup> Taha Shahroodi<sup>3,1</sup> Onur Mutlu<sup>1</sup>

<sup>1</sup>ETH Zürich <sup>2</sup>Processor Architecture Research Labs, Intel Labs <sup>3</sup>TU Delft

Sreenivas Subramoney<sup>2</sup>

https://arxiv.org/pdf/2109.12021.pdf

# Learning-Based Off-Chip Load Predictors

 Rahul Bera, Konstantinos Kanellopoulos, Shankar Balachandran, David Novo, Ataberk Olgun, Mohammad Sadrosadati, and Onur Mutlu,
 "Hermes: Accelerating Long-Latency Load Requests via Perceptron-Based Off-Chip Load Prediction"
 Proceedings of the <u>55th International Symposium on Microarchitecture</u> (MICRO), Chicago, IL, USA, October 2022.
 [Slides (pptx) (pdf)]
 [Longer Lecture Slides (pptx) (pdf)]
 [Talk Video (12 minutes)]
 [Lecture Video (25 minutes)]
 [arXiv version]
 [Source Code (Officially Artifact Evaluated with All Badges)]
 Officially artifact evaluated as available, reusable and reproducible. Best paper award at MICRO 2022.



### Hermes: Accelerating Long-Latency Load Requests via Perceptron-Based Off-Chip Load Prediction

Rahul Bera1Konstantinos Kanellopoulos1Shankar Balachandran2David Novo3Ataberk Olgun1Mohammad Sadrosadati1Onur Mutlu1

<sup>1</sup>ETH Zürich <sup>2</sup>Intel Processor Architecture Research Lab <sup>3</sup>LIRMM, Univ. Montpellier, CNRS

### https://arxiv.org/pdf/2209.00188.pdf

# Self-Optimizing Hybrid SSD Controllers

Gagandeep Singh, Rakesh Nadig, Jisung Park, Rahul Bera, Nastaran Hajinazar, David Novo, Juan Gomez-Luna, Sander Stuijk, Henk Corporaal, and Onur Mutlu, "Sibyl: Adaptive and Extensible Data Placement in Hybrid Storage Systems Using Online Reinforcement Learning" Proceedings of the <u>49th International Symposium on Computer</u> <u>Architecture (ISCA)</u>, New York, June 2022. [Slides (pptx) (pdf)] [arXiv version] [Sibyl Source Code] [Talk Video (16 minutes)]

### Sibyl: Adaptive and Extensible Data Placement in Hybrid Storage Systems Using Online Reinforcement Learning

Gagandeep Singh1Rakesh Nadig1Jisung Park1Rahul Bera1Nastaran Hajinazar1David Novo3Juan Gómez-Luna1Sander Stuijk2Henk Corporaal2Onur Mutlu11ETH Zürich2Eindhoven University of Technology3LIRMM, Univ. Montpellier, CNRS

### https://arxiv.org/pdf/2205.07394.pdf

Hermes: Perceptron-Based Off-Chip Load Prediction

# Learning-Based Off-Chip Load Predictors

 Rahul Bera, Konstantinos Kanellopoulos, Shankar Balachandran, David Novo, Ataberk Olgun, Mohammad Sadrosadati, and Onur Mutlu,
 "Hermes: Accelerating Long-Latency Load Requests via Perceptron-Based Off-Chip Load Prediction"
 Proceedings of the <u>55th International Symposium on Microarchitecture</u> (MICRO), Chicago, IL, USA, October 2022.
 [Slides (pptx) (pdf)]
 [Longer Lecture Slides (pptx) (pdf)]
 [Talk Video (12 minutes)]
 [Lecture Video (25 minutes)]
 [arXiv version]
 [Source Code (Officially Artifact Evaluated with All Badges)]
 Officially artifact evaluated as available, reusable and reproducible. Best paper award at MICRO 2022.



### Hermes: Accelerating Long-Latency Load Requests via Perceptron-Based Off-Chip Load Prediction

Rahul Bera1Konstantinos Kanellopoulos1Shankar Balachandran2David Novo3Ataberk Olgun1Mohammad Sadrosadati1Onur Mutlu1

<sup>1</sup>ETH Zürich <sup>2</sup>Intel Processor Architecture Research Lab <sup>3</sup>LIRMM, Univ. Montpellier, CNRS

### https://arxiv.org/pdf/2209.00188.pdf

# Hermes Talk Video



Computer Architecture - Lecture 18: Cutting-Edge Research in Computer Architecture (Fall 2022)



2.4K views Streamed 5 months ago Livestream - Computer Architecture - ETH Zürich (Fall 2022) Computer Architecture, ETH Zürich, Fall 2022 (https://safari.ethz.ch/architecture/f...)

### SAFARI

### https://www.youtube.com/watch?v=PWWBtrL60dQ&t=3609s







# Accelerating Long-Latency Load Requests via Perceptron-Based Off-Chip Load Prediction

Rahul Bera, Konstantinos Kanellopoulos, Shankar Balachandran, David Novo, Ataberk Olgun, Mohammad Sadrosadati, Onur Mutlu

https://github.com/CMU-SAFARI/Hermes







https://arxiv.org/pdf/2209.00188.pdf

# **The Key Problem**



# Often **stall** processor by **blocking instruction retirement** from Reorder Buffer (ROB)



## **Traditional Solutions**



# ၂ Employ sophisticated prefetchers

# Increase size of on-chip caches

# Key Observation 1





*# off-chip loads without any prefetcher* 

### **On-chip cache access latency** significantly contributes to off-chip load latency



40% of the stalls can be eliminated by removing on-chip cache access latency from critical path

# Caches are Getting Bigger and Slower...



# **Our Goal**

## Improve processor performance by **removing on-chip cache access latency** from the **critical path of off-chip loads**



# Predicts which load requests are likely to go off-chip

Starts **fetching** data **directly** from **main memory** while concurrently accessing the cache hierarchy

# **Key Contribution**

# Hermes employs **the first perceptron-based** off-chip load predictor



# By **learning** from multiple program context information

## **Hermes Overview**





# **Designing the Off-Chip Load Predictor**

### **History-based prediction**

HMP [Yoaz+, ISCA'99] for the **L1-D cache** 

Using **branch-predictor-like** hybrid predictor:



### POPET provides both higher accuracy and higher performance than predictors inspired from these previous works

- Metadata size increases with cache hierarchy size
- X May need to track **all** cache operations
  - Gets complex depending on the cache hierarchy configuration (e.g., inclusivity, bypassing,...)

### Learning from program behavior

Correlate different program features with off-chip loads



Low storage overhead 🛛 🐼



Low design complexity



## **POPET:** Perceptron-Based Off-Chip Predictor

- Multi-feature hashed perceptron model<sup>[1]</sup>
  - Each feature has its own weight table
    - Stores correlation between feature value and off-chip prediction





# **Predicting using POPET**

• Uses simple table lookups, addition, and comparison









# **Training POPET**



# **Features Used in Hermes**

# Table 1: The initial set of program features used for automated feature selection. $\oplus$ represents a bitwise XOR operation.

| Features without control-flow information | Features with control-flow information |
|-------------------------------------------|----------------------------------------|
|                                           | 8. Load PC                             |
| 1. Load virtual address                   | 9. PC $\oplus$ load virtual address    |
| 2. Virtual page number                    | 10. $PC \oplus virtual page number$    |
| 3. Cacheline offset in page               | 11. PC $\oplus$ cacheline offset       |
| 4. First access                           | 12. PC + first access                  |
| 5. Cacheline offset + first access        | 13. PC $\oplus$ byte offset            |
| 6. Byte offset in cacheline               | 14. $PC \oplus word offset$            |
| 7. Word offset in cacheline               | 15. Last-4 load PCs                    |
|                                           | 16. Last-4 PCs                         |

#### **Table 2: POPET configuration parameters**

| Selected features | <ul> <li>PC ⊕ cacheline offset</li> <li>PC ⊕ byte offset</li> <li>PC + first access</li> <li>Cacheline offset + first access</li> <li>Last-4 load PCs</li> </ul> |
|-------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Threshold values  | $	au_{act} = -18, T_N = -35, T_P = 40$                                                                                                                           |

# **Evaluation**

# **Simulation Methodology**

- ChampSim trace driven simulator
- **110 single-core** memory-intensive traces
  - SPEC CPU 2006 and 2017
  - PARSEC 2.1
  - Ligra
  - Real-world applications

### • **220 eight-core** memory-intensive trace mixes

### LLC Prefetchers

- Pythia [Bera+, MICRO'21]
- Bingo [Bakshalipour+, HPCA'19]
- MLOP [Shakerinava+, 3rd Prefetching Championship'19]
- SPP + Perceptron filter [Bhatia+, ISCA'20]
- SMS [Somogyi+, ISCA'06]

### Off-Chip Predictors

- History-based: HMP [Yoaz+, ISCA'99]
- Tracking-based: Address Tag-Tracking based Predictor (TTP)
- Ideal Off-chip Predictor

## **Latency Configuration**



### Cache round-trip latency

- L1-D: 5 cycles
- L2: **15** cycles
- LLC: **55** cycles
- Hermes request issue latency (incurred after address translation)

Depends on

Interconnect between POPET and MC



## **Single-Core Performance Improvement**



Hermes provides nearly 90% performance benefit of Ideal Hermes that has an ideal off-chip load predictor

## **Increase in Main Memory Requests**

Hermes Pythia Pythia + Hermes Pythia + Ideal Hermes



Hermes is more **bandwidth-efficient** than even an efficient prefetcher like Pythia



## Performance with Varying Memory Bandwidth



Hermes+Pythia outperforms Pythia across all bandwidth configurations

# Performance with Varying Baseline Prefetcher



# **Overhead of Hermes**



\*On top of an Intel Alder Lake-like performance-core [2] configuration

# More in the Paper

- Performance sensitivity to:
  - Cache hierarchy access latency
  - Hermes request issue latency
  - Activation threshold
  - ROB size (in extended version on arXiv)
  - LLC size (in extended version on arXiv)
- Accuracy, coverage, and performance analysis against HMP and TTP
- Understanding usefulness of each program feature
- Effect on stall cycle reduction
- Performance analysis on an eight-core system

### More in the Paper

#### Performance sensitivity to:



#### Hermes: Accelerating Long-Latency Load Requests via Perceptron-Based Off-Chip Load Prediction

Rahul Bera<sup>1</sup> Konstantinos Kanellopoulos<sup>1</sup> Shankar Balachandran<sup>2</sup> David Novo<sup>3</sup> Ataberk Olgun<sup>1</sup> Mohammad Sadrosadati<sup>1</sup> Onur Mutlu<sup>1</sup>

<sup>1</sup>ETH Zürich <sup>2</sup>Intel Processor Architecture Research Lab <sup>3</sup>LIRMM, Univ. Montpellier, CNRS

Long-latency load requests continue to limit the performance of modern high-performance processors. To increase the latency tolerance of a processor, architects have primarily relied on two key techniques: sophisticated data prefetchers and large on-chip caches. In this work, we show that: (1) even a sophisticated stateof-the-art prefetcher can only predict half of the off-chip load requests on average across a wide range of workloads, and (2) due to the increasing size and complexity of on-chip caches, a large fraction of the latency of an off-chip load request is spent accessing the on-chip cache hierarchy to solely determine that it needs to go off-chip.

The goal of this work is to accelerate off-chip load requests by removing the on-chip cache access latency from their critical path. To this end, we propose a new technique called Hermes, whose key idea is to: (1) accurately predict which load requests off-chip main memory (i.e., an *off-chip load*) often stalls the processor core by blocking the instruction retirement from the reorder buffer (ROB), thus limiting the core's performance [88, 91, 92]. To increase the latency tolerance of a core, computer architects primarily rely on two key techniques. First, they employ increasingly sophisticated hardware prefetchers that can learn complex memory address patterns and fetch data required by future load requests before the core demands them [28, 32, 33, 35, 75]. Second, they significantly scale up the size of the on-chip cache hierarchy with each new generation of processors [10, 11, 16].

**Key problem.** Despite recent advances in processor core design, we observe two key trends in new processor designs that leave a significant opportunity for performance improvement on the table. First, even a sophisticated state-of-the-art

#### https://arxiv.org/pdf/2209.00188.pdf

# To Summarize...

# Summary

### Hermes advocates for **off-chip load prediction**, a **different** form of speculation than **load address prediction** employed by prefetchers

### **Off-chip load prediction** can be applied **by itself** or **combined with load address prediction** to provide performance improvement

# Summary

# Hermes employs the first perceptron-based off-chip load predictor



### Hermes is Open Sourced





## All workload traces





### 13 prefetchers

- Stride [Fu+, MICRO'92]
- Streamer [Chen and Baer, IEEE TC'95]
- SMS [Somogyi+, ISCA'06]
- AMPM [Ishii+, ICS'09]
- Sandbox [Pugsley+, HPCA'14]
- BOP [Michaud, HPCA'16]
- SPP [Kim+, MICRO'16]
- Bingo [Bakshalipour+, HPCA'19]
- SPP+PPF [Bhatia+, ISCA'19]
- DSPatch [Bera+, MICRO'19]
- MLOP [Shakerinava+, DPC-3'19]
- IPCP [Pakalapati+, ISCA'20]
- Pythia [Bera+, MICRO'21]

## off-chip predictors

| riment fil | les and rollup script | 6 days ago                                                  |
|------------|-----------------------|-------------------------------------------------------------|
|            | Predictor type        | Description                                                 |
|            | Base                  | Always NO                                                   |
|            | Basic                 | Simple confidence counter-based threshold                   |
| ement      | Random                | Random Hit-miss predictor with a given positive probability |
|            | HMP-Local             | Hit-miss predictor [Yoaz+, ISCA'99] with local prediction   |
|            | HMP-GShare            | Hit-miss predictor with GShare prediction                   |
| S.CSV      | HMP-GSkew             | Hit-miss predictor with GSkew prediction                    |
| ple py     | HMP-Ensemble          | Hit-miss predictor with all three types combined            |
|            | TTP                   | Tag-tracking based predictor                                |
|            | Perc                  | Perceptron-based OCP used in this paper                     |
|            |                       |                                                             |

#### https://github.com/CMU-SAFARI/Hermes SAFARI

### **Easy To Define Your Own Off-Chip Predictor**

### • Just extend the OffchipPredBase class

```
class OffchipPredBase
 8
    {
 9
    public:
10
         uint32_t cpu;
11
12
         string type;
        uint64_t seed;
13
         uint8 t dram bw; // current DRAM bandwidth bucket
14
15
         OffchipPredBase(uint32_t _cpu, string _type, uint64_t _seed) : cpu(_cpu), type(_type), seed(_seed)
16
         {
17
             srand(seed);
18
             dram_bw = 0;
19
20
         }
         ~OffchipPredBase() {}
21
         void update_dram_bw(uint8_t _dram_bw) { dram_bw = _dram_bw; }
22
23
         virtual void print_config();
24
         virtual void dump_stats();
25
26
         virtual void reset_stats();
         virtual void train(ooo model instr *arch instr, uint32 t data index, LSQ ENTRY *lq entry);
27
28
         virtual bool predict(ooo model instr *arch instr, uint32 t data index, LSQ ENTRY *lq entry);
29
    };
30
31
    #endif /* OFFCHIP PRED BASE H */
32
```

### **Easy To Define Your Own Off-Chip Predictor**

### Define your own train() and predict() functions

```
void OffchipPredBase::train(ooo_model_instr *arch_instr, uint32_t data_index, LSQ_ENTRY *lq_entry)
19
     {
20
        // nothing to train
21
    }
22
23
24
    bool OffchipPredBase::predict(ooo_model_instr *arch_instr, uint32_t data_index, LSQ_ENTRY *lq_entry)
25
    {
        // predict randomly
26
        // return (rand() % 2) ? true : false;
27
        return false;
28
29
   }
```

 Get statistics like accuracy (stat name precision) and coverage (stat name recall) out of the box

> Core\_0\_offchip\_pred\_true\_pos 2358716 Core\_0\_offchip\_pred\_false\_pos 276883 Core\_0\_offchip\_pred\_false\_neg 132145 Core\_0\_offchip\_pred\_precision 89.49 Core\_0\_offchip\_pred\_recall 94.69

### **Off-Chip Prediction Can Further Enable...**

**Prioritizing** loads that are likely go off-chip in cache queues and on-chip network routing

### **Better instruction scheduling** of data-dependent instructions

Other ideas to improve **performance** and **fairness** in multi-core system design...









# Accelerating Long-Latency Load Requests via Perceptron-Based Off-Chip Load Prediction

Rahul Bera, Konstantinos Kanellopoulos, Shankar Balachandran, David Novo, Ataberk Olgun, Mohammad Sadrosadati, Onur Mutlu

https://github.com/CMU-SAFARI/Hermes







https://arxiv.org/pdf/2209.00188.pdf

### **Hermes Discussion**

#### • FAQs

- What are the selected set of program features?
- <u>Can you provide some intuition on why these</u> <u>features work?</u>
- What happens in case of a misprediction?
- <u>What's the performance headroom for off-chip</u> <u>prediction?</u>
- <u>Do you see a variance of different features in final</u> prediction accuracy?

#### Simulation Methodology

- System parameters
- Evaluated workloads

- More Results
  - Percentage of off-chip requests
  - <u>Reduction in stall cycles by reducing the</u> <u>critical path</u>
  - Fraction of off-chip load requests
  - Accuracy and coverage of POPET
  - Effect of different features
  - Are all features required?
  - <u>1C performance</u>
  - <u>1C performance line graph</u>
  - <u>1C performance against prior predictors</u>
  - Effect on stall cycles
  - <u>8C performance</u>
  - Sensitivity:
    - Hermes request issue latency
    - <u>Cache hierarchy access latency</u>
    - Activation threshold
    - <u>ROB size</u>
    - LLC size
  - Power overhead
  - Accuracy without prefetcher
  - <u>Main memory request overhead with</u> <u>different prefetchers</u>

### Hermes Paper [MICRO 2022]

 Rahul Bera, Konstantinos Kanellopoulos, Shankar Balachandran, David Novo, Ataberk Olgun, Mohammad Sadrosadati, and Onur Mutlu,
 "Hermes: Accelerating Long-Latency Load Requests via Perceptron-Based Off-Chip Load Prediction"
 Proceedings of the <u>55th International Symposium on Microarchitecture</u> (MICRO), Chicago, IL, USA, October 2022.
 [Slides (pptx) (pdf)]
 [Longer Lecture Slides (pptx) (pdf)]
 [Talk Video (12 minutes)]
 [Lecture Video (25 minutes)]
 [arXiv version]
 [Source Code (Officially Artifact Evaluated with All Badges)]
 Officially artifact evaluated as available, reusable and reproducible. Best paper award at MICRO 2022.



#### Hermes: Accelerating Long-Latency Load Requests via Perceptron-Based Off-Chip Load Prediction

Rahul Bera1Konstantinos Kanellopoulos1Shankar Balachandran2David Novo3Ataberk Olgun1Mohammad Sadrosadati1Onur Mutlu1

<sup>1</sup>ETH Zürich <sup>2</sup>Intel Processor Architecture Research Lab <sup>3</sup>LIRMM, Univ. Montpellier, CNRS

#### https://arxiv.org/pdf/2209.00188.pdf

# Sibyl: Reinforcement Learning based Data Placement in Hybrid SSDs

### Self-Optimizing Hybrid SSD Controllers

Gagandeep Singh, Rakesh Nadig, Jisung Park, Rahul Bera, Nastaran Hajinazar, David Novo, Juan Gomez-Luna, Sander Stuijk, Henk Corporaal, and Onur Mutlu, "Sibyl: Adaptive and Extensible Data Placement in Hybrid Storage Systems Using Online Reinforcement Learning" Proceedings of the <u>49th International Symposium on Computer</u> <u>Architecture (ISCA)</u>, New York, June 2022. [Slides (pptx) (pdf)] [arXiv version] [Sibyl Source Code] [Talk Video (16 minutes)]

#### Sibyl: Adaptive and Extensible Data Placement in Hybrid Storage Systems Using Online Reinforcement Learning

Gagandeep Singh1Rakesh Nadig1Jisung Park1Rahul Bera1Nastaran Hajinazar1David Novo3Juan Gómez-Luna1Sander Stuijk2Henk Corporaal2Onur Mutlu11ETH Zürich2Eindhoven University of Technology3LIRMM, Univ. Montpellier, CNRS

#### https://arxiv.org/pdf/2205.07394.pdf





# Adaptive and Extensible Data Placement in Hybrid Storage Systems Using Online Reinforcement Learning

Gagandeep Singh, Rakesh Nadig, Jisung Park, Rahul Bera, Nastaran Hajinazar, David Novo, Juan Gómez Luna, Sander Stuijk, Henk Corporaal, Onur Mutlu





TU

2



# **Executive Summary**

- **Background**: A hybrid storage system (HSS) uses multiple different storage devices to provide high and scalable storage capacity at high performance
- **Problem**: Two key shortcomings of prior data placement policies:
  - Lack of adaptivity to:
    - Workload changes
    - Changes in device types and configurations
  - Lack of extensibility to more devices
- Goal: Design a data placement technique that provides:
  - Adaptivity, by continuously learning and adapting to the application and underlying device characteristics
  - Easy extensibility to incorporate a wide range of hybrid storage configurations
- **Contribution**: Sibyl, the first reinforcement learning-based data placement technique in hybrid storage systems that:
  - Provides adaptivity to changing workload demands and underlying device characteristics
  - Can easily extend to any number of storage devices
  - Provides ease of design and implementation that requires only a small computation overhead
- Key Results: Evaluate on real systems using a wide range of workloads
  - Sibyl **improves performance by 21.6%** compared to the best previous data placement technique in dual-HSS configuration
  - In a tri-HSS configuration, Sibyl outperforms the state-of-the-art-policy policy by 48.2%
  - Sibyl achieves 80% of the performance of an oracle policy with storage overhead of only 124.4 KiB

#### SAFARI

#### https://github.com/CMU-SAFARI/Sibyl

# **Talk Outline**

**Key Shortcomings of Prior Data Placement Techniques** 

### Formulating Data Placement as Reinforcement Learning

Sibyl: Overview

**Evaluation of Sibyl and Key Results** 

Conclusion



# **Hybrid Storage System Basics**

### **Address Space (Application/File System View)**



# **Hybrid Storage System Basics**

Logical Address Space (Application/File System View)





### **Key Shortcomings in Prior Techniques**

We observe **two key shortcomings** that significantly limit the performance benefits of prior techniques

### 1. Lack of **adaptivity to**:

- a) Workload changes
- b) Changes in device types and configuration

2. Lack of **extensibility** to more devices



# Lack of Adaptivity (1/2)

### **Workload Changes**

Prior data placement techniques consider only a few workload characteristics that are statically tuned



# Lack of Adaptivity (2/2)

**Changes in Device Types and Configurations** 

Do not consider **underlying storage device characteristics** (e.g., changes in the level asymmetry in read/write latencies, garbage collection)



# Lack of Extensibility (1/2)

# **Rigid techniques** that require significant effort to accommodate more than two devices

Change in storage configuration







# Lack of Extensibility (2/2)

# **Rigid techniques** that require significant effort to accommodate more than two devices

Change in storage configuration



Design a new policy







### **Our Goal**

# A data-placement mechanism that can provide:

1.Adaptivity, by continuously learning and adapting to the application and underlying device characteristics

**2.Easy extensibility** to incorporate a wide range of hybrid storage configurations



### **Our Proposal**



### **Sibyl** Formulates data placement in hybrid storage systems as a **reinforcement learning problem**



Sibyl is an oracle that makes accurate prophecies https://en.wikipedia.org/wiki/Sibyl

# **Talk Outline**

**Key Shortcomings of Prior Data Placement Techniques** 

### Formulating Data Placement as Reinforcement Learning

Sibyl: Overview

**Evaluation of Sibyl and Key Results** 

Conclusion



### **Basics of Reinforcement Learning (RL)**



Environment

Agent learns to take an **action** in a given **state** to maximize a numerical **reward** 



### **Formulating Data Placement as RL**



# What is State?

#### • Limited number of state features:

- Reduce the implementation overhead
- RL agent is more sensitive to reward



 $O_t = (size_t, type_t, intr_t, cnt_t, cap_t, curr_t)$ 

# • We **quantize the state representation** into bins to reduce storage overhead



# What is Reward?

• Defines the **objective** of Sibyl



- We formulate the reward as a function of the request latency
- Encapsulates three key aspects:
  - Internal state of the device (e.g., read/write latencies, the latency of garbage collection, queuing delays, ...)
  - Throughput
  - Evictions
- More details in the paper
   SAFARI

# What is Action?

• At every new page request, the action is to select a storage device



 Action can be easily extended to any number of storage devices

• Sibyl learns to proactively evict or promote a page

# **Talk Outline**

**Key Shortcomings of Prior Data Placement Techniques** 

### Formulating Data Placement as Reinforcement Learning

Sibyl: Overview

### **Evaluation of Sibyl and Key Results**

Conclusion



# **Sibyl Execution**



# **Sibyl Design: Overview**



### **RL Decision Thread**



### **RL Decision Thread**



## **RL Decision Thread**



## **RL Decision Thread**



## **RL Decision Thread**



## **RL Training Thread**



## **Periodic Weight Transfer**



## **Talk Outline**

**Key Shortcomings of Prior Data Placement Techniques** 

### Formulating Data Placement as Reinforcement Learning

Sibyl: Overview

### **Evaluation of Sibyl and Key Results**

### Conclusion



## **Evaluation Methodology (1/3)**

### Real system with various HSS configurations

- Dual-hybrid and tri-hybrid systems



## **Evaluation Methodology (2/3)**

### **Cost-Oriented HSS Configuration**



High-end SSD

Low-end HDD

### **Performance-Oriented HSS Configuration**





## **Evaluation Methodology (3/3)**

### • 18 different workloads from:

- MSR Cambridge and Filebench Suites

### • Four state-of-the-art data placement baselines:





### **Cost-Oriented HSS Configuration**





### **Cost-Oriented HSS Configuration**



Sibyl consistently outperforms all the baselines for all the workloads



### **Performance-Oriented HSS Configuration**





### **Performance-Oriented HSS Configuration**



Sibyl provides 21.6% performance improvement by dynamically adapting its data placement policy



### **Performance-Oriented HSS Configuration**





### **Performance-Oriented HSS Configuration**



## of an oracle policy that has

complete knowledge of future access patterns



## **Performance on Tri-HSS**



### Extending Sibyl for more devices:

- 1. Add a new action
- **2.** Add the remaining capacity of the new device as a state feature



## **Performance on Tri-HSS**



### Extending Sibyl for more devices:

- 1. Add a new action
- 2. Add the remaining capacity of the new device as a state feature



## **Performance on Tri-HSS**



Extending Sibyl for more devices: 1. Add a new action

Sibyl outperforms the state-of-the-art data placement policy by 48.2% in a real tri-hybrid system Sibyl reduces the system architect's burden by providing ease of extensibility

## Sibyl's Overhead

### • 124.4 KiB of total storage cost

- Experience buffer, inference and training network
- 40-bit metadata overhead per page for state features
- Inference latency of ~10ns
- Training latency of ~2us



## More in the Paper (1/3)

### Throughput (IOPS) evaluation

 Sibyl provides high IOPS compared to baseline policies because it indirectly captures throughput (size/latency)

- Evaluation on unseen workloads
  - Sibyl can effectively adapt its policy to highly dynamic workloads

- Evaluation on **mixed workloads** 
  - Sibyl provides equally-high performance benefits as in single workloads



## More in the Paper (2/3)

- Evaluation on different features
  - Sibyl autonomously decides which features are important to maximize the performance
- Evaluation with different hyperparameter values

- Sensitivity to fast storage capacity
  - Sibyl provides scalability by dynamically adapting its policy to available storage size
- Explainability analysis of Sibyl's decision making
  - Explain Sibyl's actions for different workload characteristics and device configurations

## More in the Paper (3/3)

#### Sibyl: Adaptive and Extensible Data Placement in Hybrid Storage Systems Using Online Reinforcement Learning

Gagandeep Singh1Rakesh Nadig1Jisung Park1Rahul Bera1Nastaran Hajinazar1David Novo3Juan Gómez-Luna1Sander Stuijk2Henk Corporaal2Onur Mutlu11ETH Zürich2Eindhoven University of Technology3LIRMM, Univ. Montpellier, CNRS

https://arxiv.org/pdf/2205.07394.pdf

https://github.com/CMU-SAFARI/Sibyl



## **Talk Outline**

**Key Shortcomings of Prior Data Placement Techniques** 

### Formulating Data Placement as Reinforcement Learning

Sibyl: Overview

### **Evaluation of Sibyl and Key Results**

### Conclusion



## Conclusion

- We introduced Sibyl, the first reinforcement learningbased data placement technique in hybrid storage systems that provides
  - Adaptivity
  - Easily extensibility
  - Ease of design and implementation

# • We evaluated Sibyl on real systems using many different workloads

- Sibyl **improves performance by 21.6%** compared to the best prior data placement policy in a dual-HSS configuration
- In a tri-HSS configuration, Sibyl **outperforms** the state-of-the-artdata placement policy by **48.2%**
- Sibyl achieves 80% of the performance of an oracle policy with a storage overhead of only 124.4 KiB

#### SAFARI

https://github.com/CMU-SAFARI/Sibyl





## Adaptive and Extensible Data Placement in Hybrid Storage Systems Using Online Reinforcement Learning

Gagandeep Singh, Rakesh Nadig, Jisung Park, Rahul Bera, Nastaran Hajinazar, David Novo, Juan Gómez Luna, Sander Stuijk, Henk Corporaal, Onur Mutlu





TU

2

97

## ISCA 2022 Paper, Slides, Videos

 Gagandeep Singh, Rakesh Nadig, Jisung Park, Rahul Bera, Nastaran Hajinazar, David Novo, Juan Gomez-Luna, Sander Stuijk, Henk Corporaal, and Onur Mutlu, "Sibyl: Adaptive and Extensible Data Placement in Hybrid Storage Systems Using Online Reinforcement Learning" Proceedings of the <u>49th International Symposium on Computer</u> <u>Architecture</u> (ISCA), New York, June 2022.
 [Slides (pptx) (pdf)] [arXiv version]
 [Sibyl Source Code] [Talk Video (16 minutes)]

#### Sibyl: Adaptive and Extensible Data Placement in Hybrid Storage Systems Using Online Reinforcement Learning

Gagandeep Singh1Rakesh Nadig1Jisung Park1Rahul Bera1Nastaran Hajinazar1David Novo3Juan Gómez-Luna1Sander Stuijk2Henk Corporaal2Onur Mutlu11ETH Zürich2Eindhoven University of Technology3LIRMM, Univ. Montpellier, CNRS

#### https://arxiv.org/pdf/2205.07394.pdf

## SSD Course (Spring 2023)

#### Spring 2023 Edition:

https://safari.ethz.ch/projects and seminars/spring2023/ doku.php?id=modern ssds

#### Fall 2022 Edition:

https://safari.ethz.ch/projects and seminars/fall2022/do ku.php?id=modern ssds

#### Youtube Livestream (Spring 2023):

https://www.youtube.com/watch?v=4VTwOMmsnJY&list =PL5Q2soXY2Zi 8qOM5Icpp8hB2SHtm4z57&pp=iAQB

#### Youtube Livestream (Fall 2022):

- https://www.youtube.com/watch?v=hqLrd-Uj0aU&list=PL5Q2soXY2Zi9BJhenUq4JI5bwhAMpAp13&p p=iAQB
- Project course
  - Taken by Bachelor's/Master's students
  - SSD Basics and Advanced Topics
  - Hands-on research exploration
  - Many research readings

#### https://www.youtube.com/onurmutlulectures



Fall 2022 Meetings/Schedule

| Week | Date       | Livestream        | Meeting                                                                                                                                            | Learning<br>Materials   | Assignment |
|------|------------|-------------------|----------------------------------------------------------------------------------------------------------------------------------------------------|-------------------------|------------|
| W1   | 06.10      |                   | M1: P&S Course Presentation                                                                                                                        | Required<br>Recommended |            |
| W2   | 12.10      | You Tube Live     | M2: Basics of NAND Flash-<br>Based SSDs<br>m PDF m PPT                                                                                             | Required<br>Recommended |            |
| W3   | 19.10      | You Tube Live     | M3: NAND Flash Read/Write<br>Operations<br>mPDF mPPT                                                                                               | Required<br>Recommended |            |
| W4   | 26.10      | You Tube Live     | M4: Processing inside NAND<br>Flash                                                                                                                | Required<br>Recommended |            |
| W5   | 02.11      | You Tube Live     | M5: Advanced NAND Flash<br>Commands & Mapping                                                                                                      | Required<br>Recommended |            |
| W6   | 09.11      | You Tute Live     | M6: Processing inside Storage                                                                                                                      | Required<br>Recommended |            |
| W7   | 23.11      | You Tube Live     | M7: Address Mapping &<br>Garbage Collection                                                                                                        | Required<br>Recommended |            |
| W8   | 30.11      | You Tute Live     | M8: Introduction to MQSim                                                                                                                          | Required<br>Recommended |            |
| W9   | 14.12      | You Ture Live     | M9: Fine-Grained Mapping and<br>Multi-Plane Operation-Aware<br>Block Management                                                                    | Required<br>Recommended |            |
| W10  | 04.01.2023 | You Tube Premiere | M10a: NAND Flash Basics                                                                                                                            | Required<br>Recommended |            |
|      |            |                   | M10b: Reducing Solid-State<br>Drive Read Latency by<br>Optimizing Read-Retry                                                                       | Required<br>Recommended |            |
|      |            |                   | M10c: Evanesco: Architectural<br>Support for Efficient Data<br>Sanitization in Modern Flash-<br>Based Storage Systems                              | Required<br>Recommended |            |
|      |            |                   | M10d: DeepSkatch: A New<br>Machine Learning-Based<br>Reference Search Technique<br>for Post-Deduplication Delta<br>Compression<br>mPDF mPPT mPaper | Required<br>Recommended |            |
| W11  | 11.01      | You 🛅 Live        | M11: FLIN: Enabling Fairness<br>and Enhancing Performance in<br>Modern NVMe Solid State<br>Drives<br>im PDF im PPT                                 | Required                |            |
| W12  | 25.01      | You De Premiere   | M12: Flash Memory and Solid-<br>State Drives                                                                                                       | Recommended             |            |

## Comp Arch (Fall 2021)

- Fall 2021 Edition:
  - https://safari.ethz.ch/architecture/fall2021/doku. php?id=schedule
- Fall 2020 Edition:
  - https://safari.ethz.ch/architecture/fall2020/doku. php?id=schedule

#### Youtube Livestream (2021):

- https://www.youtube.com/watch?v=4yfkM\_5EFg o&list=PL5Q2soXY2Zi-Mnk1PxjEIG32HAGILkTOF
- Youtube Livestream (2020):
  - https://www.youtube.com/watch?v=c3mPdZA-Fmc&list=PL5Q2soXY2Zi9xidyIgBxUz7xRPS-wisBN
- Master's level course
  - Taken by Bachelor's/Masters/PhD students
  - Cutting-edge research topics + fundamentals in Computer Architecture
  - 5 Simulator-based Lab Assignments
  - Potential research exploration
  - Many research readings

#### https://www.youtube.com/onurmutlulectures



Computer Architecture - Fall 2021

# Watch on • Voluble • UupOTTmvqOE?t=4238=

#### Fall 2021 Lectures & Schedule

| Week | Date          | Livestream    | Lecture                                                                   | Readings               | Lab          | HW          |
|------|---------------|---------------|---------------------------------------------------------------------------|------------------------|--------------|-------------|
| W1   | 30.09<br>Thu. | You the Live  | L1: Introduction and Basics                                               | Required<br>Mentioned  | Lab 1<br>Out | HW 0<br>Out |
|      | 01.10<br>Fri. | You Tube Live | L2: Trends, Tradeoffs and Design<br>Fundamentals<br>@(PDF) @(PPT)         | Required<br>Mentioned  |              |             |
| W2   | 07.10<br>Thu. | You 🕪 Live    | L3a: Memory Systems: Challenges and<br>Opportunities<br>ma(PDF) === (PPT) | Described<br>Suggested |              | HW 1<br>Out |
|      |               |               | L3b: Course Info & Logistics                                              |                        |              |             |
|      |               |               | L3c: Memory Performance Attacks                                           | Described<br>Suggested |              |             |
|      | 08.10<br>Fri. | You Tube Live | L4a: Memory Performance Attacks                                           | Described<br>Suggested | Lab 2<br>Out |             |
|      |               |               | L4b: Data Retention and Memory Refresh                                    | Described<br>Suggested |              |             |
|      |               |               | L4c: RowHammer                                                            | Described<br>Suggested |              |             |

## Hermes & Sibyl: ML-Driven Memory & Storage Management

Onur Mutlu

<u>omutlu@gmail.com</u>

https://people.inf.ethz.ch/omutlu

27 September 2023

VMware



**ETH** zürich



## **Hermes Discussion**

#### • FAQs

- What are the selected set of program features?
- <u>Can you provide some intuition on why these</u> <u>features work?</u>
- What happens in case of a misprediction?
- <u>What's the performance headroom for off-chip</u> <u>prediction?</u>
- <u>Do you see a variance of different features in final</u> prediction accuracy?

#### Simulation Methodology

- System parameters
- Evaluated workloads

- More Results
  - Percentage of off-chip requests
  - <u>Reduction in stall cycles by reducing the</u> <u>critical path</u>
  - Fraction of off-chip load requests
  - Accuracy and coverage of POPET
  - Effect of different features
  - Are all features required?
  - <u>1C performance</u>
  - <u>1C performance line graph</u>
  - <u>1C performance against prior predictors</u>
  - Effect on stall cycles
  - <u>8C performance</u>
  - Sensitivity:
    - Hermes request issue latency
    - <u>Cache hierarchy access latency</u>
    - Activation threshold
    - <u>ROB size</u>
    - LLC size
  - Power overhead
  - Accuracy without prefetcher
  - <u>Main memory request overhead with</u> <u>different prefetchers</u>

# HERMES BACKUP

## **Initial Set of Program Features**

| Features without control-flow information | Features with control-flow information |  |  |  |
|-------------------------------------------|----------------------------------------|--|--|--|
|                                           | 8. Load PC                             |  |  |  |
| 1. Load virtual address                   | 9. PC $\oplus$ load virtual address    |  |  |  |
| 2. Virtual page number                    | 10. $PC \oplus virtual page number$    |  |  |  |
| 3. Cacheline offset in page               | 11. PC $\oplus$ cacheline offset       |  |  |  |
| 4. First access                           | 12. PC + first access                  |  |  |  |
| 5. Cacheline offset + first access        | 13. PC $\oplus$ byte offset            |  |  |  |
| 6. Byte offset in cacheline               | 14. PC $\oplus$ word offset            |  |  |  |
| 7. Word offset in cacheline               | 15. Last-4 load PCs                    |  |  |  |
|                                           | 16. Last-4 PCs                         |  |  |  |

## **Selected Set of Program Features**

### Five features

- $PC \oplus cacheline offset$
- $PC \oplus byte offset$
- PC + first access
- Cacheline offset ← first access
- Last-4 load PCs

### A binary hint that

represents whether or not a cacheblock has been recently touched



## When A Feature Works/Does Not Work?



### Without prefetcher

- PC + first access
- Cacheline offset + first access

### With a simple stride prefetcher

• Cacheline offset + first access



## What Happens in case of a Misprediction?

- Two cases of mispredictions:
- Predicted on-chip but actually goes off-chip
  - Loss of performance improvement opportunity

No need for misprediction detection and recovery

### Predicted off-chip but actually is on-chip

 Memory controller forwards the data to LLC if and only if a load to the same address have already missed LLC and arrived at the memory controller

### No need for misprediction detection and recovery



### **Performance Headroom of Off-Chip Prediction**



### **System Parameters**

#### **Table 4: Simulated system parameters**

| Core            | 1 and 8 cores, 6-wide fetch/execute/commit, 512-entry ROB,<br>128/72-entry LQ/SQ, Perceptron branch predictor [61] with<br>17-cycle misprediction penalty                                                      |  |  |
|-----------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--|--|
| L1/L2<br>Caches | Private, 48KB/1.25MB, 64B line, 12/20-way, 16/48 MSHRs,<br>LRU, 5/15-cycle round-trip latency [25]                                                                                                             |  |  |
| LLC             | 3MB/core, 64B line, 12 way, 64 MSHRs/slice, SHiP [122],<br>55-cycle round-trip latency [24, 25], <b>Pythia</b> prefetcher [32]                                                                                 |  |  |
| Main<br>Memory  | <b>1C:</b> 1 channel, 1 rank per channel; <b>8C:</b> 4 channels, 2 ranks per channel; 8 banks per rank, DDR4-3200 MTPS, 64b databus per channel, 2KB row buffer per bank, tRCD=12.5ns, tRP=12.5ns, tCAS=12.5ns |  |  |
| Hermes          | Hermes-O/P: 6/18-cycle Hermes request issue latency                                                                                                                                                            |  |  |



SAFARI

#### **Table 5: Workloads used for evaluation**

| Suite  | #Workloads | <b>#Traces</b> | Example Workloads                |
|--------|------------|----------------|----------------------------------|
| SPEC06 | 14         | 22             | gcc, mcf, cactusADM, lbm,        |
| SPEC17 | 11         | 23             | gcc, mcf, pop2, fotonik3d,       |
| PARSEC | 4          | 12             | canneal, facesim, raytrace,      |
| Ligra  | 11         | 20             | BFS, PageRank, Radii,            |
| CVP    | 33         | 33             | integer, floating-point, server, |



### **Observation: Not All Off-Chip Loads are Prefetched**



Nearly 50% of the loads are still not prefetched

### **Observation: Not All Off-Chip Loads are Prefetched**



70% of these off-chip loads blocks ROB

### **Observation: With Large Cache Comes Longer Latency**

• On-chip cache access latency significantly contributes to the latency of an off-chip load



### **Observation: With Large Cache Comes Longer Latency**

• On-chip cache access latency significantly contributes to the latency of an off-chip load



**40%** of stall cycles caused by an off-chip load can be eliminated by removing on-chip cache access latency from its critical path





### What Fraction of Load Requests Goes Off-Chip?



# **Off-Chip Prediction Quality:** *Defining Metrics*





# **Off-Chip Prediction Quality:** Analysis



# **Off-Chip Prediction Quality:** Analysis



POPET provides off-chip predictions with high-accuracy and high-coverage



# **Effect of Different Features**



Combination of features provides both higher accuracy and higher coverage than any individual feature



# Are All Features Required? (1)



### No single feature individually provides highest prediction accuracy across *all* workloads



# Are All Features Required? (2)

SAFARI



### No single feature individually provides highest prediction coverage also across *all* workloads



# **Single-Core Performance**



### Hermes in combination with Pythia outperforms Pythia alone in every workload category



# **Single-Core Performance Line Graph**





### Single-Core Performance Against Prior Predictors



**POPET provides higher performance benefit** than prior predictors

Hermes with POPET achieves nearly 90% performance improvement of the Ideal Hermes





# **Effect on Stall Cycles**



Hermes reduces off-chip load induced stall cycles on average by 16.2% (up-to 51.8%)



# **Eight-Core Performance**



# Hermes in combination with Pythia outperforms Pythia alone by **5.1%** on average



# **Effect of Hermes Request Issue Latency**



Hermes in combination with Pythia outperforms Pythia alone even with a 24-cycle Hermes request issue latency

Hermes request issue latency (in processor cycles)



# **Effect of Cache Hierarchy Access Latency**



Hermes can provide even higher performance benefit in future processors with bigger and slower on-chip caches

On-chip cache hierarchy access latency (in processor cycles)



# **Effect of Activation Threshold**



With increase in activation threshold 1. Accuracy increases 2. Coverage decreases



### **Power Overhead**





### **Effect of ROB Size**





### **Effect of LLC Size**





### Accuracy and Coverage with Different Prefetchers



POPET's accuracy and coverage increases significantly in absence of a data prefetcher



### **Increase in Main Memory Requests**





# **SIBYL BACKUP**

# **Performance on Unseen Workloads**



H&M (H&L) HSS configuration, Sibyl outperforms RNN-HSS and Archivist by 46.1% (54.6%) and 8.5% (44.1%), respectively

# **Performance Analysis**

#### **Performance-Oriented HSS Configuration**





# **Performance on Mixed Workloads**



# **Performance on Mixed Workloads**



# **Performance on Mixed Workloads**



### **Performance With Different Features**



Sibyl autonomously decides which features are important to maximize the performance of the running workload

# **Sensitivity to Fast Storage Capacity**



# **Explainability Analysis**



# **Training and Inference Network**

 Training and inference network allow parallel execution

 Observation vector as the input



• Produces probability distribution of Q-values

<size<sub>t</sub>, type<sub>t</sub>, intr<sub>t</sub>, cnt<sub>t</sub>, cap<sub>t</sub>, curr<sub>t</sub>>