







# HERMES

### **Accelerating Long-Latency Load Requests** via Perceptron-Based Off-Chip Load Prediction

Rahul Bera, Konstantinos Kanellopoulos, Shankar Balachandran, David Novo, Ataberk Olgun, Mohammad Sadrosadati, Onur Mutlu

https://github.com/CMU-SAFARI/Hermes









### The Key Problem

Long-latency off-chip load requests



# The Key Problem

# Long-latency off-chip load requests



Often **stall** processor by **blocking instruction retirement** from Reorder Buffer (ROB)



# The Key Problem

# Long-latency off-chip load requests



Often **stall** processor by **blocking instruction retirement** from Reorder Buffer (ROB)



Limit performance



#### **Traditional Solutions**





#### **Traditional Solutions**



Employ sophisticated prefetchers



#### **Traditional Solutions**



Employ sophisticated prefetchers

Increase size of on-chip caches



#### Many loads still go off-chip



# off-chip loads without any prefetcher



#### Many loads still go off-chip



# off-chip loads without any prefetcher





# off-chip loads without any prefetcher



**50%** still go off-chip



50% still go off-chip

#### On-chip cache access latency

significantly contributes to off-chip load latency

L1 L2 LLC Main Memory

50% still go off-chip

#### On-chip cache access latency

significantly contributes to off-chip load latency



40% of the stalls can be eliminated by removing on-chip cache access latency from critical path

# Caches are Getting Bigger and Slower...



Hardavellas+, "Database Servers on Chip Multiprocessors: Limitations and Opportunities", CIDR, 2007



### Caches are Getting Bigger and Slower...



#### **Our Goal**

Improve processor performance by removing on-chip cache access latency from the critical path of off-chip loads











# Predicts which load requests are likely to go off-chip







# Predicts which load requests are likely to go off-chip

Starts fetching data directly from main memory while concurrently accessing the cache hierarchy

# **Key Contribution**



# Hermes employs **the first**perceptron-based off-chip load predictor



# **Key Contribution**



# Hermes employs **the first**perceptron-based off-chip load predictor



That predicts which loads are likely to go off-chip



# **Key Contribution**



# Hermes employs **the first**perceptron-based off-chip load predictor



That predicts which loads are likely to go off-chip



By **learning** from multiple program context information



#### **Hermes Overview**





#### **Hermes Overview**















#### **Hermes Overview** Perceptron-based off-chip load predictor **Predict POPET** Core Latency tolerance limit of ROB L<sub>1</sub>-D 2 Processor is stalled Issue a L<sub>2</sub> Hermes **Main Memory** LLC request L1 L<sub>2</sub> Hermes LLC LLC L<sub>1</sub> L<sub>2</sub> 3 Wait MC Main Memory Off-Chip

**Main Memory** 

#### **Hermes Overview** Perceptron-based off-chip load predictor **Predict POPET** Core Latency tolerance limit of ROB L<sub>1</sub>-D 2 Processor is stalled Issue a L<sub>2</sub> Hermes **Main Memory** LLC request L1 L<sub>2</sub> Hermes LLC LLC L<sub>1</sub> L<sub>2</sub> 3 Wait Saved stall cycles MC Main Memory Off-Chip **Main Memory**

#### **Hermes Overview** Perceptron-based off-chip load predictor **Predict POPET** Core Train Latency tolerance limit of ROB L<sub>1</sub>-D 2 Processor is stalled Issue a L<sub>2</sub> Hermes LLC **Main Memory** request L1 L<sub>2</sub> Hermes LLC LLC L<sub>1</sub> L<sub>2</sub> 3 Wait Saved stall cycles MC Main Memory Off-Chip **Main Memory**

#### **History-based prediction**

HMP [Yoaz+, ISCA'99] for the L1-D cache



#### **Tracking cache contents**

MissMap [Loh+, MICRO'11] for the **DRAM cache**, D2D [Sembrant+, ISCA'14], D2M [Sembrant+, HPCA'17], LP [Jalili+, HPCA'22] for the **cache hierarchy** 



#### **History-based prediction**

HMP [Yoaz+, ISCA'99] for the L1-D cache

Using branch-predictor-like hybrid predictor:

Global, Gshare, and GSkew

#### **Tracking cache contents**

MissMap [Loh+, MICRO'11] for the **DRAM cache**, D2D [Sembrant+, ISCA'14], D2M [Sembrant+, HPCA'17], LP [Jalili+, HPCA'22] for the **cache hierarchy** 





#### **History-based prediction**

HMP [Yoaz+, ISCA'99] for the L1-D cache

Using branch-predictor-like hybrid predictor:

Global, Gshare, and GSkew



MissMap [Loh+, MICRO'11] for the **DRAM cache**, D2D [Sembrant+, ISCA'14], D2M [Sembrant+, HPCA'17], LP [Jalili+, HPCA'22] for the **cache hierarchy** 

- 🗙 Large metadata
  - Metadata size increases with cache hierarchy size
- May need to track **all** cache operations
  - Gets complex depending on the cache hierarchy configuration (e.g., inclusivity, bypassing,...)





#### **History-based prediction**

HMP [Yoaz+, ISCA'99] for the L1-D cache

Using branch-predictor-like hybrid predictor:

Global, Gshare, and GSkew



MissMap [Loh+, MICRO'11] for the **DRAM cache**, D2D [Sembrant+, ISCA'14], D2M [Sembrant+, HPCA'17], LP [Jalili+, HPCA'22] for the **cache hierarchy** 

- **Large** metadata
  - Metadata size increases with cache hierarchy size
- May need to track **all** cache operations
  - Gets complex depending on the cache hierarchy configuration (e.g., inclusivity, bypassing,...)

#### Learning from program behavior

Correlate different program features with off-chip loads







# Designing the Off-Chip Load Predictor

### **History-based prediction**

HMP [Yoaz+, ISCA'99] for the L1-D cache

Using branch-predictor-like hybrid predictor:

Global, Gshare, and GSkew



MissMap [Loh+, MICRO'11] for the **DRAM cache**, D2D [Sembrant+, ISCA'14], D2M [Sembrant+, HPCA'17], LP [Jalili+, HPCA'22] for the **cache hierarchy** 

- **Large** metadata
  - Metadata size increases with cache hierarchy size
- May need to track **all** cache operations
  - Gets complex depending on the cache hierarchy configuration (e.g., inclusivity, bypassing,...)

### Learning from program behavior

Correlate different program features with off-chip loads











# Designing the Off-Chip Load Predictor

### **History-based prediction**

HMP [Yoaz+, ISCA'99] for the L1-D cache

Using branch-predictor-like hybrid predictor: Global, Gshare, and GSkew



### Tracking cache contents

MissMap [Loh+, MICRO'11] for the **DRAM cache**, D2D [Sembrant+, ISCA'14], D2M [Sembrant+, HPCA'17], LP [Jalili+, HPCA'22] for the **cache hierarchy** 

- 🗙 Large metadata
  - Metadata size increases with cache hierarchy size
- May need to track **all** cache operations
  - Gets complex depending on the cache hierarchy configuration (e.g., inclusivity, bypassing,...)



Correlate different program features with off-chip loads











# Designing the Off-Chip Load Predictor

### **History-based prediction**

Using branch-predictor-like hybrid predictor:

Global Gshare and GSkew



### POPET provides

### both higher accuracy and higher performance than predictors inspired from these previous works

Metadata size increases with cache hierarchy size



#### May need to track **all** cache operations

Gets complex depending on the cache hierarchy configuration (e.g., inclusivity, bypassing,...)

### Learning from program behavior

Correlate different program features with off-chip loads



Low storage overhead



Low design complexity



### **POPET:** Perceptron-Based Off-Chip Predictor

Multi-feature hashed perceptron model<sup>[1]</sup>



### **POPET:** Perceptron-Based Off-Chip Predictor

- Multi-feature hashed perceptron model<sup>[1]</sup>
  - Each feature has its own weight table
    - Stores correlation between feature value and off-chip prediction





Uses simple table lookups, addition, and comparison









\_\_\_\_

•

Feature<sub>N</sub>

Weight Table<sub>2</sub>

Weight Table<sub>N</sub>



Uses simple table lookups, addition, and comparison





Uses simple table lookups, addition, and comparison





Uses simple table lookups, addition, and comparison





1 Predict

POPF/T

Uses simple table lookups, addition, and comparison





1 Predict

POPF/T

1 Predict Uses simple table lookups, addition, and comparison POPF/T L2 Stage 3 Stage 1 Stage 2 LLC Extract features from the load request index Weight Feature<sub>1</sub> # Table₁ Off-Chip 42 Main Memory hash (e.g., PC + offset) weight, 0x7ffe0+12 3 3 >= -2 Predict that weight, index Weight Feature<sub>2</sub>  $\geq au_{act}$ the load would Table<sub>2</sub> 12 go off-chip hash Sum **Activation** weights weight, -5 Weight index Feature<sub>N</sub> Table<sub>N</sub> hash



1 Predict Uses simple increment or decrement of feature weights ore POPET Train 4 2 LLC 1 3 Wait MC ⟨= index Weight Feature<sub>1</sub> Off-Chip Main Memory (e.g., PC + offset) weight, weight, index Weight  $\sum$ Feature<sub>2</sub> go off-chip Activation weight, Weight index Feature<sub>N</sub>



Uses simple increment or decrement of feature weights



1 Predict

ore POPET

Uses simple increment or decrement of feature weights





1 Predict

Train 4

POPET

Uses simple increment or decrement of feature weights



1 Predict

POPET

1 Predict Uses simple increment or decrement of feature weights ⇒ POPET Train 4 LLC index Weight Feature<sub>1</sub> Table<sub>1</sub> Off-Chip 42 Main Memory -1 hash (e.g., PC + offset)0x7ffe0+12 3 3 >= -2 weight, Weight index Feature<sub>2</sub>  $\geq au_{act}$ Table<sub>2</sub> -1 go off-chip hash Sum **Activation** weights weight, Shouldn't be activated Weight index Cumulative weight  $< \tau_{act}$ Feature<sub>N</sub> Table<sub>N</sub> -1 hash



# Evaluation

• ChampSim trace driven simulator



- ChampSim trace driven simulator
- 110 single-core memory-intensive traces
  - SPEC CPU 2006 and 2017
  - PARSEC 2.1
  - Ligra
  - Real-world applications
- 220 eight-core memory-intensive trace mixes

- ChampSim trace driven simulator
- 110 single-core memory-intensive traces
  - SPEC CPU 2006 and 2017
  - PARSEC 2.1
  - Ligra
  - Real-world applications
- 220 eight-core memory-intensive trace mixes

#### **LLC Prefetchers**

- Pythia [Bera+, MICRO'21]
- Bingo [Bakshalipour+, HPCA'19]
- MLOP [Shakerinava+, 3rd Prefetching Championship'19]
- SPP + Perceptron filter [Bhatia+, ISCA'20]
- SMS [Somogyi+, ISCA'o6]

- ChampSim trace driven simulator
- 110 single-core memory-intensive traces
  - SPEC CPU 2006 and 2017
  - PARSEC 2.1
  - Ligra
  - Real-world applications
- 220 eight-core memory-intensive trace mixes

#### **LLC Prefetchers**

- Pythia [Bera+, MICRO'21]
- Bingo [Bakshalipour+, HPCA'19]
- MLOP [Shakerinava+, 3rd Prefetching Championship'19]
- SPP + Perceptron filter [Bhatia+, ISCA'20]
- SMS [Somogyi+, ISCA'06]

### Off-Chip Predictors

- History-based: HMP [Yoaz+, ISCA'99]
- Tracking-based: Address Tag-Tracking based Predictor (TTP)
- Ideal Off-chip Predictor



Cache round-trip latency



- Cache round-trip latency
  - L1-D: **5** cycles



Cache round-trip latency

• L1-D: **5** cycles

• L2: **15** cycles



### Cache round-trip latency

• L1-D: **5** cycles

• L2: **15** cycles

• LLC: **55** cycles



### Cache round-trip latency

• L1-D: **5** cycles

• L2: **15** cycles

• LLC: **55** cycles

### Hermes request issue latency

(incurred after address translation)

#### Depends on

Interconnect between POPET and MC



### Cache round-trip latency

• L1-D: **5** cycles

• L2: **15** cycles

• LLC: **55** cycles

### Hermes request issue latency

(incurred after address translation)

Depends on

Interconnect between POPET and MC

```
O cycles 24 cycles
```



### Cache round-trip latency

• L1-D: **5** cycles

• L2: **15** cycles

• LLC: **55** cycles

### Hermes request issue latency

(incurred after address translation)

#### Depends on

Interconnect between POPET and MC

















Hermes alone provides nearly 50% performance benefits of Pythia with only 1/5th storage overhead











Hermes on top of Pythia outperforms Pythia alone in every workload category

SAFAKI





## Single-Core Performance Improvement





## Single-Core Performance Improvement



Hermes provides nearly 90% performance benefit of Ideal Hermes that has an ideal off-chip load predictor

FARI

































Hermes is more bandwidth-efficient than even an efficient prefetcher like Pythia















In bandwidth-constrained configurations, Hermes alone outperforms Pythia







Hermes+Pythia outperforms Pythia across all bandwidth configurations

90

### Performance with Varying Baseline Prefetcher





### Performance with Varying Baseline Prefetcher





## Performance with Varying Baseline Prefetcher





#### **Overhead of Hermes**



4 KB storage overhead



1.5% power overhead\*

\*On top of an Intel Alder Lake-like performance-core [2] configuration



## More in the Paper

- Performance sensitivity to:
  - Cache hierarchy access latency
  - Hermes request issue latency
  - Activation threshold
  - ROB size (in extended version on arXiv)
  - LLC size (in extended version on arXiv)
- Accuracy, coverage, and performance analysis against HMP and TTP
- Understanding usefulness of each program feature
- Effect on stall cycle reduction
- Performance analysis on an eight-core system

## More in the Paper







## Hermes: Accelerating Long-Latency Load Requests via Perceptron-Based Off-Chip Load Prediction

Rahul Bera<sup>1</sup> Konstantinos Kanellopoulos<sup>1</sup> Shankar Balachandran<sup>2</sup> David Novo<sup>3</sup> Ataberk Olgun<sup>1</sup> Mohammad Sadrosadati<sup>1</sup> Onur Mutlu<sup>1</sup>

<sup>1</sup>ETH Zürich <sup>2</sup>Intel Processor Architecture Research Lab <sup>3</sup>LIRMM, Univ. Montpellier, CNRS

Long-latency load requests continue to limit the performance of modern high-performance processors. To increase the latency tolerance of a processor, architects have primarily relied on two key techniques: sophisticated data prefetchers and large on-chip caches. In this work, we show that: (1) even a sophisticated state-of-the-art prefetcher can only predict half of the off-chip load requests on average across a wide range of workloads, and (2) due to the increasing size and complexity of on-chip caches, a large fraction of the latency of an off-chip load request is spent accessing the on-chip cache hierarchy to solely determine that it needs to go off-chip.

The goal of this work is to accelerate off-chip load requests by removing the on-chip cache access latency from their critical path. To this end, we propose a new technique called Hermes, whose key idea is to: (1) accurately predict which load requests off-chip main memory (i.e., an off-chip load) often stalls the processor core by blocking the instruction retirement from the reorder buffer (ROB), thus limiting the core's performance [88, 91, 92]. To increase the latency tolerance of a core, computer architects primarily rely on two key techniques. First, they employ increasingly sophisticated hardware prefetchers that can learn complex memory address patterns and fetch data required by future load requests before the core demands them [28, 32, 33, 35, 75]. Second, they significantly scale up the size of the on-chip cache hierarchy with each new generation of processors [10, 11, 16].

**Key problem.** Despite recent advances in processor core design, we observe two key trends in new processor designs that leave a significant opportunity for performance improvement on the table. First, even a sophisticated state-of-the-art

https://arxiv.org/pdf/2209.00188.pdf



# To Summarize...

Hermes advocates for off-chip load prediction, a different form of speculation than load address prediction employed by prefetchers

Hermes advocates for off-chip load prediction, a different form of speculation than load address prediction employed by prefetchers

Off-chip load prediction can be applied by itself or combined with load address prediction to provide performance improvement

# Hermes employs the first perceptron-based off-chip load predictor



# Hermes employs the first perceptron-based off-chip load predictor





# Hermes employs the first perceptron-based off-chip load predictor



High accuracy (77%)



High coverage (74%)

# Hermes employs the first perceptron-based off-chip load predictor



High accuracy (77%)



High coverage (74%)



Low storage overhead (4KB/core)

# Hermes employs the first perceptron-based off-chip load predictor



High accuracy (77%)



High coverage (74%)



Low storage overhead (4KB/core)

High performance improvement over best prior baseline (5.4%)

# Hermes employs the first perceptron-based off-chip load predictor



High accuracy (77%)



High coverage (74%)



Low storage overhead (4KB/core)

High performance improvement over best prior baseline (5.4%)



High performance per bandwidth

## Hermes is Open Sourced





## Hermes is Open Sourced











## Hermes is Open Sourced





#### All workload traces







- Streamer [Chen and Baer, IEEE TC'95]
- SMS [Somogyi+, ISCA'06]
- AMPM [Ishii+, ICS'09]
- Sandbox [Pugsley+, HPCA'14]
- BOP [Michaud, HPCA'16]
- SPP [Kim+, MICRO'16]
- Bingo [Bakshalipour+, HPCA'19]
- SPP+PPF [Bhatia+, ISCA'19]
- DSPatch [Bera+, MICRO'19]
- MLOP [Shakerinava+, DPC-3'19]
- IPCP [Pakalapati+, ISCA'20]
- Pythia [Bera+, MICRO'21]

removing on-chip cache access latency from their critical path, as described by MICRO 2022 paper by Bera et al. (https://arxiv.org/pdf/2209.00188.pdf)

machine-learning cache perceptron
computer-architecture microarchitecture
perceptron-learning-algorithm prefetching

- Readme
- MIT, Unknown licenses found
- ☆ 6 stars
- 3 watching
- ¥ 1 fork

Releases 5

v1.2 Latest 6 days ago

+ 4 releases

Packages

No packages published



## Hermes is Open Sourced





## All workload traces

13 prefetchers







- Streamer [Chen and Baer, IEEE TC'95]
- SMS [Somogyi+, ISCA'06]
- AMPM [Ishii+, ICS'09]
- Sandbox [Pugsley+, HPCA'14]
- BOP [Michaud, HPCA'16]
- SPP [Kim+, MICRO'16]
- Bingo [Bakshalipour+, HPCA'19]
- SPP+PPF [Bhatia+, ISCA'19]
- DSPatch [Bera+, MICRO'19]
- MLOP [Shakerinava+, DPC-3'19]
- IPCP [Pakalapati+, ISCA'20]
- Pythia [Bera+, MICRO'21]

| All experiment <u>f</u> | iles and rollup script | 6 days ago                                                  |
|-------------------------|------------------------|-------------------------------------------------------------|
|                         | Predictor type         | Description                                                 |
|                         | Base                   | Always NO                                                   |
|                         | Basic                  | Simple confidence counter-based threshold                   |
| e experiement           | Random                 | Random Hit-miss predictor with a given positive probability |
| at traces.csv           | HMP-Local              | Hit-miss predictor [Yoaz+, ISCA'99] with local prediction   |
|                         | HMP-GShare             | Hit-miss predictor with GShare prediction                   |
|                         | HMP-GSkew              | Hit-miss predictor with GSkew prediction                    |
| an example p            | HMP-Ensemble           | Hit-miss predictor with all three types combined            |
|                         | TTP                    | Tag-tracking based predictor                                |
|                         | Perc                   | Perceptron-based OCP used in this paper                     |



## Easy To Define Your Own Off-Chip Predictor

Just extend the OffchipPredBase class

```
class OffchipPredBase
    {
    public:
10
        uint32_t cpu;
11
12
        string type;
        uint64_t seed;
13
        uint8 t dram bw; // current DRAM bandwidth bucket
14
15
        OffchipPredBase(uint32_t _cpu, string _type, uint64_t _seed) : cpu(_cpu), type(_type), seed(_seed)
16
         {
17
             srand(seed);
18
             dram_bw = 0;
19
20
        ~OffchipPredBase() {}
21
        void update_dram_bw(uint8_t _dram_bw) { dram_bw = _dram_bw; }
22
23
        virtual void print_config();
24
        virtual void dump_stats();
25
26
        virtual void reset_stats();
        virtual void train(ooo model instr *arch instr, uint32 t data index, LSQ ENTRY *lq entry);
27
28
        virtual bool predict(ooo model instr *arch instr, uint32 t data index, LSQ ENTRY *lq entry);
29
    };
30
31
    #endif /* OFFCHIP PRED BASE H */
32
```



## Easy To Define Your Own Off-Chip Predictor

Define your own train() and predict() functions

```
void OffchipPredBase::train(ooo_model_instr *arch_instr, uint32_t data_index, LSQ_ENTRY *lq_entry)
19
    {
20
        // nothing to train
21
22
    }
23
    bool OffchipPredBase::predict(ooo_model_instr *arch_instr, uint32_t data_index, LSQ_ENTRY *lq_entry)
24
25
    {
26
        // predict randomly
        // return (rand() % 2) ? true : false;
27
        return false;
28
29
```

## Easy To Define Your Own Off-Chip Predictor

Define your own train() and predict() functions

```
void OffchipPredBase::train(ooo_model_instr *arch_instr, uint32_t data_index, LSQ_ENTRY *lq_entry)
19
     {
20
        // nothing to train
21
    }
22
23
24
    bool OffchipPredBase::predict(ooo_model_instr *arch_instr, uint32_t data_index, LSQ_ENTRY *lq_entry)
25
    {
        // predict randomly
26
        // return (rand() % 2) ? true : false;
27
        return false;
28
29
```

 Get statistics like accuracy (stat name precision) and coverage (stat name recall) out of the box

```
Core_0_offchip_pred_true_pos 2358716
Core_0_offchip_pred_false_pos 276883
Core_0_offchip_pred_false_neg 132145
Core_0_offchip_pred_precision 89.49
Core_0_offchip_pred_recall 94.69
```





**Prioritizing** loads that are likely go off-chip in cache queues and on-chip network routing

**Prioritizing** loads that are likely go off-chip in cache queues and on-chip network routing

# **Better instruction scheduling** of data-dependent instructions

**Prioritizing** loads that are likely go off-chip in cache queues and on-chip network routing

# **Better instruction scheduling** of data-dependent instructions

Other ideas to improve **performance** and **fairness** in multi-core system design...









## HERMES

## **Accelerating Long-Latency Load Requests** via Perceptron-Based Off-Chip Load Prediction

Rahul Bera, Konstantinos Kanellopoulos, Shankar Balachandran, David Novo, Ataberk Olgun, Mohammad Sadrosadati, Onur Mutlu

https://github.com/CMU-SAFARI/Hermes









#### **Discussion**

#### FAQs

- What are the selected set of program features?
- Can you provide some intuition on why these features work?
- What happens in case of a misprediction?
- What's the performance headroom for off-chip prediction?
- Do you see a variance of different features in final prediction accuracy?

#### Simulation Methodology

- System parameters
- Evaluated workloads

#### More Results

- Percentage of off-chip requests
- Reduction in stall cycles by reducing the critical path
- <u>Fraction of off-chip load requests</u>
- Accuracy and coverage of POPET
- Effect of different features
- Are all features required?
- <u>1C performance</u>
- <u>1C performance line graph</u>
- <u>1C performance against prior predictors</u>
- Effect on stall cycles
- 8C performance
- Sensitivity:
  - Hermes request issue latency
  - Cache hierarchy access latency
  - Activation threshold
  - ROB size
  - LLC size
- Power overhead
- Accuracy without prefetcher
- Main memory request overhead with different prefetchers



## **BACKUP**

## **Initial Set of Program Features**

| Features without control-flow information | Features with control-flow information |  |
|-------------------------------------------|----------------------------------------|--|
|                                           | 8. Load PC                             |  |
| <ol> <li>Load virtual address</li> </ol>  | 9. PC $\oplus$ load virtual address    |  |
| 2. Virtual page number                    | 10. PC $\oplus$ virtual page number    |  |
| 3. Cacheline offset in page               | 11. PC $\oplus$ cacheline offset       |  |
| 4. First access                           | 12. PC + first access                  |  |
| 5. Cacheline offset + first access        | 13. PC $\oplus$ byte offset            |  |
| 6. Byte offset in cacheline               | 14. PC $\oplus$ word offset            |  |
| 7. Word offset in cacheline               | 15. Last-4 load PCs                    |  |
|                                           | 16. Last-4 PCs                         |  |

## Selected Set of Program Features

## **Five** features

- PC  $\oplus$  cacheline offset
- PC  $\oplus$  byte offset
- PC + first access
- Cacheline offset + first access
- Last-4 load PCs



## Selected Set of Program Features

## **Five** features

- PC  $\oplus$  cacheline offset
- PC  $\oplus$  byte offset
- PC \* first access
- Cacheline offset ← first access:
- Last-4 load PCs

A binary hint that represents whether or not a cacheblock has been recently touched



Trace: 462.libquantum-1343B

PC: 0x401442



Trace: 462.libquantum-1343B

PC: 0x401442

|                      |                                      | - |
|----------------------|--------------------------------------|---|
| <br>Cacheline #42    | Cacheline #43                        |   |
| <br>2921121112 // 42 | <b>Gastients</b> <i>n</i> <b>4</b> 5 |   |



Trace: 462.libquantum-1343B

PC: 0x401442





Trace: 462.libquantum-1343B

PC: 0x401442



#### Without prefetcher

- PC + first access
- Cacheline offset + first access





Trace: 462.libquantum-1343B

PC: 0x401442



... Cacheline #42

Cacheline #43

\_\_\_

#### Without prefetcher

- PC + first access
- Cacheline offset + first access

#### With a simple stride prefetcher

Cacheline offset + first access



Two cases of mispredictions:



- Two cases of mispredictions:
- Predicted on-chip but actually goes off-chip

Predicted off-chip but actually is on-chip



- Two cases of mispredictions:
- Predicted on-chip but actually goes off-chip
  - Loss of performance improvement opportunity

Predicted off-chip but actually is on-chip



- Two cases of mispredictions:
- Predicted on-chip but actually goes off-chip
  - Loss of performance improvement opportunity

No need for misprediction detection and recovery

Predicted off-chip but actually is on-chip



- Two cases of mispredictions:
- Predicted on-chip but actually goes off-chip
  - Loss of performance improvement opportunity

#### No need for misprediction detection and recovery

- Predicted off-chip but actually is on-chip
  - Memory controller forwards the data to LLC if and only if a load to the same address have already missed LLC and arrived at the memory controller



- Two cases of mispredictions:
- Predicted on-chip but actually goes off-chip
  - Loss of performance improvement opportunity

#### No need for misprediction detection and recovery

- Predicted off-chip but actually is on-chip
  - Memory controller forwards the data to LLC if and only if a load to the same address have already missed LLC and arrived at the memory controller

No need for misprediction detection and recovery



## Performance Headroom of Off-Chip Prediction





## **System Parameters**

**Table 4: Simulated system parameters** 

| Core                                                                                                                                                                       | 1 and 8 cores, 6-wide fetch/execute/commit, 512-entry ROB, 128/72-entry LQ/SQ, Perceptron branch predictor [61] with 17-cycle misprediction penalty |  |
|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------|--|
| L1/L2<br>Caches                                                                                                                                                            | Private, 48KB/1.25MB, 64B line, 12/20-way, 16/48 MSHRs, LRU, 5/15-cycle round-trip latency [25]                                                     |  |
| LLC                                                                                                                                                                        | 3MB/core, 64B line, 12 way, 64 MSHRs/slice, SHiP [122], 55-cycle round-trip latency [24, 25], <b>Pythia</b> prefetcher [32]                         |  |
| Main per channel, 1 rank per channel; 8C: 4 channels per channel; 8 banks per rank, DDR4-3200 MTPS, bus per channel, 2KB row buffer per bank, tRCE tRP=12.5ns, tCAS=12.5ns |                                                                                                                                                     |  |
| Hermes                                                                                                                                                                     | Hermes-O/P: 6/18-cycle Hermes request issue latency                                                                                                 |  |



## **Evaluated Workloads**

Table 5: Workloads used for evaluation

| Suite  | #Workloads | #Traces | Example Workloads                |
|--------|------------|---------|----------------------------------|
| SPEC06 | 14         | 22      | gcc, mcf, cactusADM, lbm,        |
| SPEC17 | 11         | 23      | gcc, mcf, pop2, fotonik3d,       |
| PARSEC | 4          | 12      | canneal, facesim, raytrace,      |
| Ligra  | 11         | 20      | BFS, PageRank, Radii,            |
| CVP    | 33         | 33      | integer, floating-point, server, |











Nearly 50% of the loads are still not prefetched









## **Observation: With Large Cache Comes Longer Latency**

 On-chip cache access latency significantly contributes to the latency of an off-chip load



### **Observation: With Large Cache Comes Longer Latency**

 On-chip cache access latency significantly contributes to the latency of an off-chip load





### **Observation: With Large Cache Comes Longer Latency**

 On-chip cache access latency significantly contributes to the latency of an off-chip load





#### **Observation: With Large Cache Comes Longer Latency**

 On-chip cache access latency significantly contributes to the latency of an off-chip load





#### **Observation: With Large Cache Comes Longer Latency**

 On-chip cache access latency significantly contributes to the latency of an off-chip load





#### What Fraction of Load Requests Goes Off-Chip?

















**Accuracy** 

























#### **Effect of Different Features**





#### **Effect of Different Features**



Combination of features provides both higher accuracy and higher coverage than any individual feature



### Are All Features Required? (1)





## Are All Features Required? (1)



No single feature individually provides highest prediction accuracy across all workloads



## Are All Features Required? (2)







## Are All Features Required? (2)



No single feature individually provides highest prediction coverage also across all workloads





# Single-Core Performance





# Single-Core Performance



Hermes in combination with Pythia outperforms Pythia alone in every workload category



# Single-Core Performance Line Graph







### Single-Core Performance Against Prior Predictors





### Single-Core Performance Against Prior Predictors



POPET provides higher performance benefit than prior predictors



### Single-Core Performance Against Prior Predictors



POPET provides higher performance benefit than prior predictors

Hermes with POPET achieves nearly 90% performance improvement of the Ideal Hermes

# **Effect on Stall Cycles**





# **Effect on Stall Cycles**



Hermes reduces off-chip load induced stall cycles on average by 16.2% (up-to 51.8%)

# **Eight-Core Performance**





# **Eight-Core Performance**



Hermes in combination with Pythia outperforms Pythia alone by 5.1% on average



























Hermes in combination with Pythia outperforms Pythia alone even with a 24-cycle Hermes request issue latency

Hermes request issue latency (in processor cycles)







On-chip cache hierarchy access latency (in processor cycles)





On-chip cache hierarchy access latency (in processor cycles)





On-chip cache hierarchy access latency (in processor cycles)





Hermes can provide even higher performance benefit in future processors with bigger and slower on-chip caches

On-chip cache hierarchy access latency (in processor cycles)





#### **Effect of Activation Threshold**





#### **Effect of Activation Threshold**



#### With increase in activation threshold

- Accuracy increases
- Coverage decreases





#### **Power Overhead**







#### **Effect of ROB Size**





#### **Effect of LLC Size**





### **Accuracy and Coverage with Different Prefetchers**





### **Accuracy and Coverage with Different Prefetchers**



POPET's accuracy and coverage increases significantly in absence of a data prefetcher



# Increase in Main Memory Requests





