

# Pythia

## A Customizable Hardware Prefetching Framework Using Online Reinforcement Learning

<u>Rahul Bera</u>, Konstantinos Kanellopoulos, Anant V. Nori, Taha Shahroodi, Sreenivas Subramoney, Onur Mutlu

https://github.com/CMU-SAFARI/Pythia





Mainly use one program context info. for prediction 2 Lack inherent system awareness

Lack in-silicon customizability







Why do prefetchers not perform well?







Autonomously learns to prefetch using multiple program context information and system-level feedback Can be customized in silicon to change program context information or prefetching objective on the fly





# **Brief Overview of Pythia**

Pythia formulates prefetching as a reinforcement learning problem



# What is State?

k-dimensional vector of features

 $S \equiv \{\phi_S^1, \phi_S^2, \dots, \phi_S^k\}$ 

• Feature = control-flow + data-flow

### Control-flow examples

- PC
- Branch PC
- Last-3 PCs, ...

#### Data-flow examples

- Cacheline address
- Physical page number
- Delta between two cacheline addresses
- Last 4 deltas, ...

#### Features of memory request to address A (e.g., PC) Processor & Memory Subsystem

# What is Action?

Given a demand access to address A the action is to select prefetch offset "O"

- Action-space: 127 actions in the range [-63, +63]
  - For a machine with 4KB page and 64B cacheline
- Upper and lower limits ensure prefetches do not cross physical page boundary
- A zero offset means no prefetch is generated
- We further **prune** action-space by design-space exploration

#### SAFARI

Prefetcher

Reward

Prefetch from addres

A+offset (0)

Features of memory

(e.g., PC)

# What is Reward?

- Defines the **objective** of Pythia
- Encapsulates two metrics:

- Features of memory request to address A (e.g., PC) Processor & Memory Suosystem
- Prefetch usefulness (e.g., accurate, late, out-of-page, ...)
- System-level feedback (e.g., mem. b/w usage, cache pollution, energy, ...)
- We demonstrate Pythia with memory bandwidth usage as the system-level feedback in the paper

# What is Reward?

### Seven distinct reward levels

- Accurate and timely (R<sub>AT</sub>)
- Accurate but late (R<sub>AL</sub>)
- Loss of coverage (R<sub>CL</sub>)
- Inaccurate
  - With low memory b/w usage (R<sub>IN</sub>-L)
  - With high memory b/w usage (R<sub>IN</sub>-H)
- No-prefetch
  - With low memory b/w usage (R<sub>NP</sub>-L)
  - With high memory b/w usage(R<sub>NP</sub>-H)
- Values are set at design time via automatic designspace exploration

Can be customized further in silicon for higher performance
 SAFARI



# **Simulation Methodology**

- Champsim [3] trace-driven simulator
- **150** single-core memory-intensive workload traces
  - SPEC CPU2006 and CPU2017
  - PARSEC 2.1
  - Ligra
  - Cloudsuite
- Homogeneous and heterogeneous multi-core mixes

#### • Five state-of-the-art prefetchers

- SPP [Kim+, MICRO'16]
- Bingo [Bakhshalipour+, HPCA'19]
- MLOP [Shakerinava+, 3<sup>rd</sup> Prefetching Championship, 2019]
- SPP+DSPatch [Bera+, MICRO'19]
- SPP+PPF [Bhatia+, ISCA'20]

## **Performance with Varying Core Count**



## **Performance with Varying Core Count**



## **Performance with Varying DRAM Bandwidth**



## **Performance with Varying DRAM Bandwidth**



## Pythia outperforms prior best prefetchers for a wide range of DRAM bandwidth configurations



## **Performance Improvement via Customization**

Reward value customization

### • Strict Pythia configuration

- Increasing the rewards for no prefetching
- **Decreasing** the rewards for **inaccurate prefetching**
- Strict Pythia is more conservative in generating prefetch requests than the basic Pythia
- Evaluate on all Ligra graph processing workloads

## **Performance Improvement via Customization**



## **Performance Improvement via Customization**



# Pythia can extract even higher performance via customization without changing hardware



# **Pythia's Overhead**

### • 25.5 KB of total metadata storage per core

- Only simple tables
- We also model functionally-accurate Pythia with full complexity in Chisel [4] HDL



of a desktop-class 4-core Skylake processor (Xeon D2132IT, 60W)



# More in the Paper

- Performance comparison with **unseen traces** 
  - Pythia provides equally high performance benefits
- Comparison against multi-level prefetchers
  - Pythia outperforms prior best multi-level prefetchers
- Understanding Pythia's learning with a case study
  - We reason towards the correctness of Pythia's decision
- Performance sensitivity towards different features and hyperparameter values
- Detailed single-core and four-core performance

# **More in the Paper**

Performance comparison with unseen traces
 Pythia provides equally high performance benefits

#### Comparison against multi-level prefetchers

#### Pythia: A Customizable Hardware Prefetching Framework Using Online Reinforcement Learning

Rahul Bera<sup>1</sup> Konstantinos Kanellopoulos<sup>1</sup> Anant V. Nori<sup>2</sup> Taha Shahroodi<sup>3,1</sup> Sreenivas Subramoney<sup>2</sup> Onur Mutlu<sup>1</sup> <sup>1</sup>ETH Zürich <sup>2</sup>Processor Architecture Research Labs, Intel Labs <sup>3</sup>TU Delft

 Performance sensitivity towards unterent features and hyperparameter values

Detailed single-core and four-core performance

# **Pythia is Open Source**



## https://github.com/CMU-SAFARI/Pythia

- MICRO'21 artifact evaluated
- Champsim source code + Chisel modeling code
- All traces used for evaluation

| CMU-SAFARI / Pythia Public            |                                                               | <ul> <li>Unwat</li> </ul> | tch - 3 $\swarrow$ Star 9 $\heartsuit$ Fork 2                              |
|---------------------------------------|---------------------------------------------------------------|---------------------------|----------------------------------------------------------------------------|
| <> Code ⊙ Issues îî Pull reques       | sts 🕑 Actions 🔟 Projects 🖽 Wiki 🕕 Security 🗠                  | Insights 🔯 Se             | ttings                                                                     |
| 🐉 master 👻 🤔 1 branch 📀 5 tags        | Go to file Add fi                                             | ile - Code -              | About 8                                                                    |
| 👘 rahulbera Github pages documentatio | on 🗸 diefc65 7 hours ago                                      | 3 40 commits              | A customizable hardware prefetching<br>framework using online reinforcemen |
| branch                                | Initial commit for MICRO'21 artifact evaluation               | 2 months ago              | learning as described in the MICRO 2021 paper by Bera and                  |
| Config                                | Initial commit for MICRO'21 artifact evaluation               | 2 months ago              | Kanellopoulos et al.                                                       |
| docs                                  | Github pages documentation                                    | 7 hours ago               |                                                                            |
| experiments                           | Added chart visualization in Excel template                   | 2 months ago              | machine-learning                                                           |
| inc inc                               | Updated README                                                | 8 days ago                | reinforcement-learning<br>computer-architecture prefetcher                 |
| prefetcher                            | Initial commit for MICRO'21 artifact evaluation               | 2 months ago              | microarchitecture cache-replacement                                        |
| replacement                           | Initial commit for MICRO'21 artifact evaluation               | 2 months ago              | branch-predictor champsim-simulator                                        |
| scripts                               | Added md5 checksum for all artifact traces to verify download | 2 months ago              | champsim-tracer                                                            |
| src src                               | Initial commit for MICRO'21 artifact evaluation               | 2 months ago              | 🛱 Readme                                                                   |
| tracer                                | Initial commit for MICRO'21 artifact evaluation               | 2 months ago              | Ճ₫ View license                                                            |
| 🗅 .gitignore                          | Initial commit for MICRO'21 artifact evaluation               | 2 months ago              | Ç∄ Cite this repository →                                                  |
| CITATION.cff                          | Added citation file                                           | 8 days ago                |                                                                            |
|                                       | Updated LICENSE                                               | 2 months ago              | Releases 5                                                                 |
| LICENSE.champsim                      | Initial commit for MICRO'21 artifact evaluation               | 2 months ago              | V1.3 Latest                                                                |
|                                       |                                                               |                           |                                                                            |





# Pythia

## A Customizable Hardware Prefetching Framework Using Online Reinforcement Learning

<u>Rahul Bera</u>, Konstantinos Kanellopoulos, Anant V. Nori, Taha Shahroodi, Sreenivas Subramoney, Onur Mutlu

https://github.com/CMU-SAFARI/Pythia





# Discussion

#### • FAQs

- Why RL?
- What about large page?
- What's the prefetch degree?
- <u>Can customization happen during</u> workload execution?
- Can runtime mixing create problem?

#### Simulation and Methodology

- Basic Pythia configuration
- System parameters
- Configuration of prefetchers
- Evaluated workloads
- Feature selection

#### Detailed Design

- <u>Reward structure</u>
- Design overview
- **QVStore Organization**

#### • More Results

- <u>Comparison against other adaptive</u> <u>prefetchers</u>
- Comparison against Context prefetcher
- Feature combination sensitivity
- <u>Hyperparameter sensitivity</u>
- Comparison with multi-level prefetchers
- Performance in unseen workloads
- Single-core s-curve
- Four-core s-curve
- Detailed performance analysis
- Benefit of bandwidth awareness
- Case study
- Customizing rewards
- Customizing features



## Why RL? Why Not Supervised Learning?

- Determining the **benefits of prefetching** (i.e., whether a decision was good for performance or not) is **not easy** 
  - Depends on a complex set of metrics
    - Coverage, accuracy, timeliness
    - Effects on system: b/w usage, pollution, cross-application interference, ...
  - Dynamically-changing environmental conditions change the benefit
  - Delayed feedback due to long latency (might not receive feedback at all for inaccurate prefetches!)
- Differs from classification tasks (e.g., branch prediction)
  - Performance strongly correlates mainly to accuracy
  - Does not depend on environment
  - Bounded feedback delay



## What About Large Pages?

- Pythia's framework can be easily extended to incorporate additional prefetch actions (i.e., possible prefetch offsets for the page size)
- To decrease the storage overhead
  - Prune action space via automatic design-space exploration
  - Hash action values to retrieve Q-values





# What is the Prefetch Degree? Is It Managed by the RL Agent?

- Pythia employs a simple degree selector, separate from the RL agent
  - If the agent has selected the same prefetch action (O) multiple times in a row, Pythia increases the degree (A+2O, A+3O, ...)
  - At most degree 4
- Future works on managing degree by the RL agent





# Can the Customization Be Done While the Workload is Running?

- Certainly.
- Pythia, being an **online learning** technique, will autonomously adapt (and optimize) its policy to use the new program features or the modified reward values



## **Can Runtime Workload Mix Create an Issue?**

- We implement the bandwidth usage feedback using a counter in the memory controller. Thus Pythia already has a global view of the memory bandwidth usage that incorporates all workloads running on a multi-core system
- We evaluate a diverse set (300 of each category) of fourcore, eight-core, twelve-core random workload mixes
- Based on our evaluation, we observe that Pythia dynamically adapts itself to varying workload demands





# How does Pythia Compare Against Other Adaptive Prefetching Solutions?

- We compare Pythia against IBM POWER7<sup>[5]</sup> prefetcher
  - Adaptively selects prefetcher degree/configuration by monitoring program IPC







# How Does Pythia Compare Against the Context Prefetcher?

- Pythia widely differs from the Context Prefetcher (CP)<sup>[6]</sup> in all three aspects: state, action, and reward. The key differences are:
  - CP does not consider system-level feedback
  - CP models the agent as a contextual bandit which takes myopic prefetch decisions as compared to Pythia
  - CP requires compiler support to extract software-level features



## Pythia outperforms CP-HW by 5.3% in single-core and 7.6% in four-core system

#### SAFARI

[6] Leeor et al., ISCA'15



### How Pythia's Performance Changes With Various State Definitions You Have Swept?

• In total we evaluate state defined as any-one, any-two, and any-three combinations of 32 features



**Performance gain ranges from 20.7% to 22.4%** 

**Coverage ranges from 66.2% to 71.5%** 

**Overprediction** ranges from 26.7% to 32.2%

## Is Pythia Sensitive to Hyperparameter?

 Not setting hyperparameters can significantly impact the overall performance improvement



Changing  $\varepsilon$  from 0.002 to 1.0 drops perf. by 16%

Changing  $\alpha$  from 0.0065 to 1.0 drops perf. by 5.4%



### How Does Pythia Compare Against Commercial Multi-level Prefetchers?



Pythia outperforms IPCP [7] by 14.2% on average in 150-MTPS

DRAM MTPS (in log scale)

#### SAFARI

[6] Prakalapati et al., ISCA'20



### **Does Pythia Perform Equally Well for Unseen Workloads also?**

- Evaluated with 500 traces from value prediction championship
  - No prefetcher has been trained on these traces



Pythia outperforms MLOP and Bingo by 8.3% and 3.5% in single-core

And 9.7% and 5.4% in four-core





## **Basic Pythia Configuration**

#### Table 2: Basic Pythia configuration derived from our automated design-space exploration

| Features                    | PC+Delta,Sequence of last-4 deltas                      |  |
|-----------------------------|---------------------------------------------------------|--|
| <b>Prefetch Action List</b> | {-6,-3,-1,0,1,3,4,5,10,11,12,16,22,23,30,32}            |  |
| Reward Level Values         | $\begin{array}{cccccccccccccccccccccccccccccccccccc$    |  |
| Hyperparameters             | ers $\alpha = 0.0065, \gamma = 0.556, \epsilon = 0.002$ |  |



## **System Parameters**

#### **Table 5: Simulated system parameters**

| Core         | 1-12 cores, 4-wide OoO, 256-entry ROB, 72/56-entry LQ/SQ                                                                                                                                                                                     |  |  |
|--------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--|--|
| Branch Pred. | Perceptron-based [69], 20-cycle misprediction penalty                                                                                                                                                                                        |  |  |
| L1/L2        | Private, 32KB/256KB, 64B line, 8 way, LRU, 16/32 MSHRs, 4-                                                                                                                                                                                   |  |  |
| Caches       | cycle/14-cycle round-trip latency                                                                                                                                                                                                            |  |  |
| LLC          | 2MB/core, 64B line, 16 way, SHiP [133], 64 MSHRs per LLC Bank,                                                                                                                                                                               |  |  |
|              | 34-cycle round-trip latency                                                                                                                                                                                                                  |  |  |
| Main Memory  | <ul> <li>1C: Single channel, 1 rank/channel; 4C: Dual channel, 2 ranks/channel; 8C: Quad channel, 2 ranks/channel;</li> <li>8 banks/rank, 2400 MTPS, 64b data-bus/channel, 2KB row buffer-/bank, tRCD=15ns, tRP=15ns, tCAS=12.5ns</li> </ul> |  |  |



## **Configuration of Prefetchers**

#### **Table 7: Configuration of evaluated prefetchers**

| <b>SPP</b> [78]   | 256-entry ST, 512-entry 4-way PT, 8-entry GHR | 6.2 KB  |
|-------------------|-----------------------------------------------|---------|
| <b>Bingo</b> [27] | 2KB region, 64/128/4K-entry FT/AT/PHT         | 46 KB   |
| <b>MLOP</b> [111] | 128-entry AMT, 500-update, 16-degree          | 8 KB    |
| DSPatch [30]      | Same configuration as in [30]                 | 3.6 KB  |
| <b>PPF</b> [32]   | Same configuration as in [32]                 | 39.3 KB |
| Pythia            | 2 features, 2 vaults, 3 planes, 16 actions    | 25.5 KB |



### **Evaluated Workloads**

#### Table 6: Workloads used for evaluation

| Suite      | # Workloads | # Traces | Example Workloads            |
|------------|-------------|----------|------------------------------|
| SPEC06     | 16          | 28       | gcc, mcf, cactusADM, lbm,    |
| SPEC17     | 12          | 18       | gcc, mcf, pop2, fotonik3d,   |
| PARSEC     | 5           | 11       | canneal, facesim, raytrace,  |
| Ligra      | 13          | 40       | BFS, PageRank, Bellman-ford, |
| Cloudsuite | 4           | 53       | cassandra, cloud9, nutch,    |



## **List of Evaluated Features**

#### Table 3: List of program control-flow and data-flow components used to derive the list of features for exploration

| <b>Control-flow Component</b>                                                                                          | Data-flow Component                                                                                                                                                                                                                     |
|------------------------------------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| <ol> <li>PC of load request</li> <li>PC-path (XOR-ed last-3 PCs)</li> <li>PC XOR-ed branch-PC</li> <li>None</li> </ol> | <ol> <li>Load cacheline address</li> <li>Page number</li> <li>Page offset</li> <li>Load address delta</li> <li>Sequence of last-4 offsets</li> <li>Sequence of last-4 deltas</li> <li>Offset XOR-ed with delta</li> <li>None</li> </ol> |





# **MORE RESULTS**

## **Performance S-curve: Single-core**







### **Performance S-curve: Four-core**







### **Single-core Coverage & Overprediction**



**4**3

# **Detailed Performance**



SAFARI

**a** 44

### **Benefit of Bandwidth Awareness**







### **Case Study**



Figure 13: Q-value curves of PC+Delta feature values (a) 0x436a81+0 and (b) 0x4377c5+0 in 459.GemsFDTD-1320B.



# **Customizing Rewards**



Figure 14: Performance and main memory bandwidth usage of prefetchers in Ligra-CC.



Figure 15: Performance of the basic and strict Pythia configurations on the Ligra workload suite.



## **Customizing Features**



Figure 16: Performance of the basic and feature-optimized Pythia on the SPEC CPU2006 suite.



# BACKUP

## **Executive Summary**

- Background: Prefetchers predict addresses of future memory requests by associating memory access patterns with program context (called feature)
- **Problem**: Three key shortcomings of prior prefetchers:
  - Predict mainly using a single program feature
  - Lack inherent system awareness (e.g., memory bandwidth usage)
  - Lack in-silicon customizability
- **Goal**: Design a prefetching framework that:
  - Learns from multiple features and inherent system-level feedback
  - Can be customized in silicon to use different features and/or prefetching objectives
- Contribution: Pythia, which formulates prefetching as reinforcement learning problem
  - Takes adaptive prefetch decisions using multiple features and system-level feedback
  - Can be **customized in silicon** for target workloads via simple configuration registers
  - Proposes a realistic and practical implementation of RL algorithm in hardware
- Key Results:

SAFARI

- Evaluated using a wide range of workloads from SPEC CPU, PARSEC, Ligra, Cloudsuite
- Outperforms best prefetcher (in 1-core config.) by **3.4%**, **7.7%** and **17%** in 1/4/bw-constrained cores
- Up to 7.8% more performance over basic Pythia across Ligra workloads via simple customization

#### https://github.com/CMU-SAFARI/Pythia

# **Talk Outline**

#### **Key Shortcomings of Prior Prefetchers**

### Formulating Prefetching as Reinforcement Learning

**Pythia: Overview** 

**Evaluation of Pythia and Key Results** 

Conclusion



# **Prefetching Basics**

- Predicts addresses of long-latency memory requests and fetches data before the program demands it
- Associates access patterns from past memory requests with program context information

#### **Program Feature** → Access Pattern

#### • Example program features

- Program counter (PC)
- Page number
- Page offset
- Cacheline delta
- ...
- Or a combination of these attributes

## **Key Shortcomings in Prior Prefetchers**

 We observe three key shortcomings that significantly limit performance benefits of prior prefetchers





## (1) Single-Feature Prefetch Prediction

 Provides good performance gains mainly on workloads where the feature-to-pattern correlation exists



## (1) Single-Feature Prefetch Prediction

 Provides good performance gains mainly on workloads where the feature-to-pattern correlation exists



# (2) Lack of Inherent System Awareness

- Little understanding of **undesirable effects** (e.g., memory bandwidth usage, cache pollution, ...)
  - Performance loss in **resource-constrained** configurations



Similar coverage

Lower overpredictions

Yet, lower performance

# (2) Lack of Inherent System Awareness

- Little understanding of **undesirable effects** (e.g., memory bandwidth usage, cache pollution, ...)
  - Performance loss in **resource-constrained** configurations



# (3) Lack of In-silicon Customizability

• Feature **statically** selected at design time

SΔFΔ

- **Rigid hardware** designed specifically to exploit that feature
- No way to change program feature and/or change prefetcher's objective in silicon
  - Cannot adapt to a wide range of workload demands



## **Our Goal**

### A prefetching framework that can:

1.Learn to prefetch using multiple features and inherent system-level feedback information

2.Be **easily customized in silicon** to use different features and/or change prefetcher's objectives

## **Our Proposal**



# Pythia

# Formulates prefetching as a reinforcement learning problem



Pythia is named after the oracle of Delphi, who is known for her accurate prophecies https://en.wikipedia.org/wiki/Pythia

# **Talk Outline**

**Key Shortcomings of Prior Prefetchers** 

### Formulating Prefetching as Reinforcement Learning

**Pythia: Overview** 

### **Evaluation of Pythia and Key Results**

Conclusion



# **Basics of Reinforcement Learning (RL)**

 Algorithmic approach to learn to take an action in a given situation to maximize a numerical reward



Environment

- Agent stores Q-values for every state-action pair
  - Expected return for taking an action in a state

- Given a state, selects action that provides highest Q-value SAFARI

### **Formulating Prefetching as RL**

# What is State?

k-dimensional vector of features

 $S \equiv \{\phi_S^1, \phi_S^2, \dots, \phi_S^k\}$ 

• Feature = control-flow + data-flow

#### Control-flow examples

- PC
- Branch PC
- Last-3 PCs, ...

#### Data-flow examples

- Cacheline address
- Physical page number
- Delta between two cacheline addresses
- Last 4 deltas, ...



### What is State?



# What is Action?

Given a demand access to address A the action is to select prefetch offset "O"

- Action-space: 127 actions in the range [-63, +63]
  - For a machine with 4KB page and 64B cacheline
- Upper and lower limits ensure prefetches do not cross physical page boundary
- A zero offset means no prefetch is generated
- We further **prune** action-space by design-space exploration

#### SAFARI

Prefetcher

Reward

Prefetch from addres

A+offset (0)

Features of memory

request to address A

(e.g., PC)

# What is Reward?

- Defines the **objective** of Pythia
- Encapsulates two metrics:

- Features of memory request to address A (e.g., PC) Processor & Memory Subsystem
- Prefetch usefulness (e.g., accurate, late, out-of-page, ...)
- **System-level feedback** (e.g., mem. b/w usage, cache pollution, energy, ...)
- We demonstrate Pythia with memory bandwidth usage as the system-level feedback in the paper

# What is Reward?

#### Seven distinct reward levels

- Accurate and timely (R<sub>AT</sub>)
- Accurate but late (R<sub>AL</sub>)
- Loss of coverage (R<sub>CL</sub>)
- Inaccurate
  - With low memory b/w usage (R<sub>IN</sub>-L)
  - With high memory b/w usage (R<sub>IN</sub>-H)
- No-prefetch
  - With low memory b/w usage (R<sub>NP</sub>-L)
  - With high memory b/w usage(R<sub>NP</sub>-H)
- Values are set at design time via automatic designspace exploration

- Can be **customized** further in silicon for higher performance **SAFARI** 



### **Steering Pythia's Objective via Reward Values**

- Example reward configuration for
  - Generating accurate prefetches
  - Making bandwidth-aware prefetch decisions



AT = Accurate & timely; AL = Accurate & late; NP = No-prefetching; IN = Inaccurate;H = High mem. b/w; L = Low mem. b/w

Highly prefers to generate accurate prefetches

Prefers not to prefetch if memory bandwidth usage is low

Strongly prefers not to prefetch if memory bandwidth usage is high



### **Steering Pythia's Objective via Reward Values**

 Customizing reward values to make Pythia conservative towards prefetching



AT = Accurate & timely; AL = Accurate & late; NP = No-prefetching; IN = Inaccurate; H = High mem. b/w; L = Low mem. b/w

Highly prefers to generate accurate prefetches

**Otherwise prefers not to prefetch** 





### **Steering Pythia's Objective via Reward Values**

Customizing reward values to make Dythis concernative towards p Strict Pythia configuration





# **Talk Outline**

**Key Shortcomings of Prior Prefetchers** 

### Formulating Prefetching as Reinforcement Learning

**Pythia: Overview** 

### **Evaluation of Pythia and Key Results**

#### Conclusion



# **Pythia Overview**

- **Q-Value Store**: Records Q-values for *all* state-action pairs
- Evaluation Queue: A FIFO queue of recently-taken actions



### **Architecting QVStore**



## **Architecting QVStore**



- A monolithic two-dimensional table?
  - Indexed by state and action values
- State-space increases **exponentially** with #bits



- We partition QVStore into k vaults [k = number of features in state]
  - Each vault corresponds to one feature and stores the Qvalues of feature-action pairs



### To retrieve Q(S,A) for each action

- Query each vault in parallel with feature and action
- Retrieve feature-action
   Q-value from each vault
- Compute MAX of all feature-action Q-values

MAX ensures the Q(S,A) is driven by the constituent feature that has highest Q( $\phi$ ,A)

- We further partition each vault into multiple planes
  - Each plane stores a partial Q-value of a feature-action pair

### To retrieve Q(φ,A) for each action

- Query each plane in parallel with hashed feature and action
- Retrieve partial featureaction Q-value from each plane
- Compute SUM of all parital feature-action Q-values







We further partition each vault into multiple planes
 Each plane stores a partial Q-value of a feature-action pair

**1. Enables sharing of partial Q-values between similar feature values, shortens prefetcher training time** 

parallel with hashed feature and action

2. Reduces chances of sharing partial Q-values across widely different feature values

feature-action Q-values

# **More in the Paper**

- Pipelined search operation for QVStore
- Reward assignment and **QVStore update**
- Automatic design-space exploration
  - Feature types
  - Action
  - Reward and Hyperparameter values



# **More in the Paper**

• Pipelined search operation for QVStore

#### Reward assignment and OVStore undate

#### Pythia: A Customizable Hardware Prefetching Framework Using Online Reinforcement Learning

Rahul Bera<sup>1</sup> Konstantinos Kanellopoulos<sup>1</sup> Anant V. Nori<sup>2</sup> Taha Shahroodi<sup>3,1</sup> Sreenivas Subramoney<sup>2</sup> Onur Mutlu<sup>1</sup> <sup>1</sup>ETH Zürich <sup>2</sup>Processor Architecture Research Labs, Intel Labs <sup>3</sup>TU Delft

- Reward a <a href="https://arxiv.org/pdf/2109.12021.pdf">https://arxiv.org/pdf/2109.12021.pdf</a>

## **Talk Outline**

**Key Shortcomings of Prior Prefetchers** 

### Formulating Prefetching as Reinforcement Learning

**Pythia: Overview** 

**Evaluation of Pythia and Key Results** 

Conclusion



# **Simulation Methodology**

- Champsim [3] trace-driven simulator
- **150** single-core memory-intensive workload traces
  - SPEC CPU2006 and CPU2017
  - PARSEC 2.1
  - Ligra
  - Cloudsuite
- Homogeneous and heterogeneous multi-core mixes

### • Five state-of-the-art prefetchers

- SPP [Kim+, MICRO'16]
- Bingo [Bakhshalipour+, HPCA'19]
- MLOP [Shakerinava+, 3<sup>rd</sup> Prefetching Championship, 2019]
- SPP+DSPatch [Bera+, MICRO'19]
- SPP+PPF [Bhatia+, ISCA'20]

# **Basic Pythia Configuration**

• Derived from automatic design-space exploration

### • State: 2 features

- PC+Delta
- Sequence of last-4 deltas

### • Actions: 16 prefetch offsets

- Ranging between -6 to +32. Including 0.

### • Rewards:

- $R_{AT} = +20$ ;  $R_{AL} = +12$ ;  $R_{NP}$ -H=-2;  $R_{NP}$ -L=-4;
- $R_{IN}$ -H=-14;  $R_{IN}$ -L=-8;  $R_{CL}$ =-12



### **Performance with Varying Core Count**



### **Performance with Varying Core Count**



### **Performance with Varying DRAM Bandwidth**



### **Performance with Varying DRAM Bandwidth**



### Pythia outperforms prior best prefetchers for a wide range of DRAM bandwidth configurations



### **Performance Improvement via Customization**

- Reward value customization
- Strict Pythia configuration
  - Increasing the rewards for no prefetching
  - **Decreasing** the rewards for **inaccurate prefetching**



- Strict Pythia is more conservative in generating prefetch requests than the basic Pythia
- Evaluate on all Ligra graph processing workloads

### **Performance Improvement via Customization**



### **Performance Improvement via Customization**



# Pythia can extract even higher performance via customization without changing hardware



# **Pythia's Overhead**

### • 25.5 KB of total metadata storage per core

- Only simple tables
- We also model functionally-accurate Pythia with full complexity in Chisel [4] HDL



of a desktop-class 4-core Skylake processor (Xeon D2132IT, 60W)



# More in the Paper

- Performance comparison with **unseen traces** 
  - Pythia provides equally high performance benefits
- Comparison against multi-level prefetchers
  - Pythia outperforms prior best multi-level prefetchers
- Understanding Pythia's learning with a case study
  - We reason towards the correctness of Pythia's decision
- Performance sensitivity towards different features and hyperparameter values
- Detailed single-core and four-core performance

# **More in the Paper**

Performance comparison with unseen traces
 Pythia provides equally high performance benefits

#### Comparison against multi-level prefetchers

#### Pythia: A Customizable Hardware Prefetching Framework Using Online Reinforcement Learning

Rahul Bera<sup>1</sup> Konstantinos Kanellopoulos<sup>1</sup> Anant V. Nori<sup>2</sup> Taha Shahroodi<sup>3,1</sup> Sreenivas Subramoney<sup>2</sup> Onur Mutlu<sup>1</sup> <sup>1</sup>ETH Zürich <sup>2</sup>Processor Architecture Research Labs, Intel Labs <sup>3</sup>TU Delft

 Performance sensitivity towards unterent features and hyperparameter values

Detailed single-core and four-core performance

# **Pythia is Open Source**



95

### https://github.com/CMU-SAFARI/Pythia

- MICRO'21 artifact evaluated
- Champsim source code + Chisel modeling code
- All traces used for evaluation

| ₽ C | CMU-SAFARI / Pythia Public          |                                                 | ⊙ Unw                              | atch • 3 $\swarrow$ Star 9 ${0}{0}{0}{0}{0}{0}{0}{0}{0}{0}{0}{0}{0$                                               |  |
|-----|-------------------------------------|-------------------------------------------------|------------------------------------|-------------------------------------------------------------------------------------------------------------------|--|
| <>  | Code 🕑 Issues 🖧 Pull reques         | sts 🕑 Actions 🔟 Projects 🖽 Wiki                 | ① Security 🗠 Insights 🕸 S          | ettings                                                                                                           |  |
| ٢   | master - 🖓 1 branch 📎 5 tags        |                                                 | Go to file Add file - Code -       | About දි                                                                                                          |  |
|     | rahulbera Github pages documentatic | n v                                             | / dlefc65 7 hours ago 🕚 40 commits | A customizable hardware prefetching<br>framework using online reinforcement<br>learning as described in the MICRO |  |
|     | branch                              | Initial commit for MICRO'21 artifact evaluation | 2 months ago                       | 2021 paper by Bera and                                                                                            |  |
|     | config                              | Initial commit for MICRO'21 artifact evaluation | 2 months ago                       | Kanellopoulos et al.                                                                                              |  |
|     | docs                                | Github pages documentation                      | 7 hours ago                        |                                                                                                                   |  |
|     | experiments                         | Added chart visualization in Excel template     | 2 months ago                       | machine-learning                                                                                                  |  |
|     | inc                                 | Updated README                                  | 8 days ago                         | reinforcement-learning<br>computer-architecture prefetcher                                                        |  |
|     | prefetcher                          | Initial commit for MICRO'21 artifact evaluation | 2 months ago                       | microarchitecture cache-replacement                                                                               |  |
|     | replacement                         | Initial commit for MICRO'21 artifact evaluation | 2 months ago                       | branch-predictor champsim-simulator                                                                               |  |
|     | scripts                             | Added md5 checksum for all artifact traces to v | erify download 2 months ago        | champsim-tracer                                                                                                   |  |
|     | src                                 | Initial commit for MICRO'21 artifact evaluation | 2 months ago                       | 🛱 Readme                                                                                                          |  |
|     | tracer                              | Initial commit for MICRO'21 artifact evaluation | 2 months ago                       | む View license                                                                                                    |  |
| ß   | .gitignore                          | Initial commit for MICRO'21 artifact evaluation | 2 months ago                       | Ç3 Cite this repository -                                                                                         |  |
| ۵   | CITATION.cff                        | Added citation file                             | 8 days ago                         |                                                                                                                   |  |
| ۵   | LICENSE                             | Updated LICENSE                                 | 2 months ago                       | Releases 5                                                                                                        |  |
| ۵   | LICENSE.champsim                    | Initial commit for MICRO'21 artifact evaluation | 2 months ago                       | V1.3 Latest                                                                                                       |  |

## **Talk Outline**

**Key Shortcomings of Prior Prefetchers** 

### Formulating Prefetching as Reinforcement Learning

**Pythia: Overview** 

### **Evaluation of Pythia and Key Results**

### Conclusion



### **Executive Summary**

- Background: Prefetchers predict addresses of future memory requests by associating memory access patterns with program context (called feature)
- **Problem**: Three key shortcomings of prior prefetchers:
  - Predict mainly using a single program feature
  - Lack inherent system awareness (e.g., memory bandwidth usage)
  - Lack in-silicon customizability
- Goal: Design a prefetching framework that:
  - Learns from multiple features and inherent system-level feedback
  - Can be customized in silicon to use different features and/or prefetching objectives
- Contribution: Pythia, which formulates prefetching as reinforcement learning problem
  - Takes adaptive prefetch decisions using multiple features and system-level feedback
  - Can be **customized in silicon** for target workloads via simple configuration registers
  - Proposes a realistic and practical implementation of RL algorithm in hardware
- Key Results:

SAFARI

- Evaluated using a wide range of workloads from SPEC CPU, PARSEC, Ligra, Cloudsuite
- Outperforms best prefetcher (in 1-core config.) by **3.4%**, **7.7%** and **17%** in 1/4/bw-constrained cores
- Up to 7.8% more performance over basic Pythia across Ligra workloads via simple customization

#### https://github.com/CMU-SAFARI/Pythia

# **Reward Assignment to EQ Entry**

- Every action gets inserted into EQ
- Reward is assigned to each EQ entry **before or during** the eviction
- **During EQ insertion**: for actions
  - Not to prefetch
  - Out-of-page prefetch



# **Reward Assignment to EQ Entry**

- Every action gets inserted into EQ
- Reward is assigned to each EQ entry **before or during** the eviction
- **During EQ insertion**: for actions
  - Not to prefetch
  - Out-of-page prefetch
- During EQ residency:





# **Reward Assignment to EQ Entry**

- Every action gets inserted into EQ
- Reward is assigned to each EQ entry **before or during** the eviction
- During EQ insertion: for actions
  - Not to prefetch
  - Out-of-page prefetch
- During EQ residency:
  - In case address of a demand matches with address in EQ (signifies accurate prefetch)
- During EQ eviction:
  - In case no reward is assigned till eviction (signifies inaccurate prefetch)

