# Banshee: Bandwidth-Efficient DRAM Caching via Software/Hardware Cooperation

<sup>1</sup>MIT

- <u>Xiangyao Yu<sup>1</sup></u>, Christopher Hughes<sup>2</sup>, Nadathur Satish<sup>2</sup>, Onur Mutlu<sup>3</sup>, Srinivas Devadas<sup>1</sup>
  - <sup>2</sup>Intel Labs <sup>3</sup>ETH Zürich





## High-Bandwidth In-Package DRAM

- In-package DRAM has
  - 5X higher bandwidth than off-package DRAM
  - Similar latency as offpackage DRAM
  - Limited capacity (up to 16 GB)
- In-package DRAM can be used as a cache



\* Numbers from Intel Knights Landing





## **Bandwidth Inefficiency in Existing DRAM Cache Designs**

**Drawback 1**: Metadata traffic (e.g., tags, LRU bits, frequency counters, etc.)





## **Bandwidth Inefficiency in Existing DRAM Cache Designs**

- **Drawback 1**: Metadata traffic (e.g., tags, LRU bits, frequency counters, etc.)
- **Drawback 2**: Cache replacement traffic
  - Especially for coarse-granularity (e.g., page-granularity) DRAM cache designs





## **Banshee Improves DRAM Bandwidth Efficiency**

- Idea 1: Page-table-based contents tracking with efficient translation lookaside buffer (TLB) coherence
  - Track contents of DRAM cache using page tables and TLBs
  - Lightweight TLB coherence mechanism



## **Banshee Improves DRAM Bandwidth Efficiency**

- Idea 1: Page-table-based contents tracking with efficient translation lookaside buffer (TLB) coherence
- Idea 2: Bandwidth-aware frequency-based replacement (FBR) policy
  - Replacement traffic reduction: Limit the rate of DRAM cache replacement
  - Metadata traffic reduction: Access metadata for a sampled fraction of memory accesses



## Page-Table-Based DRAM Cache Contents Tracking

- Track DRAM cache contents using the the virtual memory mechanism
- Advantage
  - Zero overhead for tag storage and lookup
- Disadvantage
  - **TLB** coherence overhead
  - Cache replacement overhead



\* assuming 4-way set associativity DRAM cache





## Idea 1: Efficient TLB Coherence

### Track DRAM cache contents using page tables and TLBs



## Idea 1: Efficient TLB Coherence

- Track DRAM cache contents using page tables and TLBs
- Maintain latest mapping for recently remapped pages in the Tag Buffer



\* Assuming 4-way set-associative DRAM cache

## Idea 1: Efficient TLB Coherence

- Track DRAM cache contents using page tables and TLBs
- Maintain latest mapping for recently remapped pages in the Tag Buffer
- Enforce TLB coherence lazily when the Tag Buffer is full to amortize the cost





\* Assuming 4-way set-associative DRAM cache

- DRAM cache replacement incurs significant DRAM traffic
  - Cache replacement traffic
  - Metadata traffic (e.g., frequency counter lookups/updates)



- DRAM cache replacement incurs significant DRAM traffic
  - Cache replacement traffic
  - Metadata traffic (e.g., frequency counter lookups/updates)



- DRAM cache replacement incurs significant DRAM traffic
  - Cache replacement traffic
  - Metadata traffic (e.g., frequency counter lookups/updates)



- DRAM cache replacement incurs significant DRAM traffic
- Limit cache replacement rate
  - Replace only when the incoming page's frequency counter is greater than the victim pages's counter by a threshold



- DRAM cache replacement incurs significant DRAM traffic
- Limit cache replacement rate
- Reduce metadata traffic
  - Access frequency counters for a randomly sampled fraction of memory accesses



## **Banshee Extensions**

- Supporting large pages (e.g., 2MB) - A large page is cached either in its entirety or not at all
- Supporting multi-socket processors
  - Coherent DRAM caches
  - Partitioned DRAM caches

## **Performance Evaluation**

- ZSim simulator <sup>[1]</sup>
- 16 cores (4-issue, out-of-order, 2.7 GHz)
- In-package DRAM (1 GB, 84 GB/s)
- Off-package DRAM (21 GB/s)
- Tag Buffer
  - One Tag Buffer per memory controller (MC)
  - 1024 entries, 5 KB in size

[1] Sanchez, Daniel, and Christos Kozyrakis. "ZSim: fast and accurate microarchitectural simulation of thousand-core systems." *ISCA*, 2013.

17

# Speedup (Normalized to off-package DRAM only)



(i.e., BEAR) latency-optimized DRAM cache design

### **Perfect In-package DRAM** Čache



## **DRAM Bandwidth Efficiency**



### Banshee reduces 36% in-package DRAM traffic over the best-previous design

## **DRAM Bandwidth Efficiency**



Banshee reduces 36% in-package DRAM traffic over the best-previous design Banshee reduces 3% off-package DRAM traffic over the best-previous design



## **Effect of Replacement Traffic Reduction**



Limiting replacement rate and sampling frequency counters are both important for bandwidth efficiency in Banshee

## More Analysis in the Paper

- Performance with large (2 MB) pages
- Balancing in- and off-package DRAM bandwidth
- Overhead for page table update and TLB coherence
- Storing tags in SRAM
- Sweep DRAM cache latency and bandwidth
- Sampling coefficient
- DRAM cache associativity

- of in-package DRAM
- Idea 1: Improving page-table-based DRAM cache designs with efficient Translation Lookaside Buffer (TLB) coherence
- traffic by 36% over the best-previous latency-optimized DRAM cache design



## Summary

Need to optimize for bandwidth efficiency to fully exploit the performance

### Idea 2: Bandwidth-aware frequency-based replacement (FBR) policy

Banshee improves performance by 15% and reduces in-package DRAM



# Banshee: Bandwidth-Efficient DRAM Caching via Software/Hardware Cooperation

<sup>1</sup>MIT

- <u>Xiangyao Yu<sup>1</sup></u>, Christopher Hughes<sup>2</sup>, Nadathur Satish<sup>2</sup>, Onur Mutlu<sup>3</sup>, Srinivas Devadas<sup>1</sup>
  - <sup>2</sup>Intel Labs <sup>3</sup>ETH Zürich







## **Backup Slides**

## Summary of Operational Characteristics of Different State-of-the-Art DRAM Cache Designs

| Scheme      | DRAM Cache Hit            | DRAM Cache Miss          | Replacement Traffic       | Replacement Decision             | Large Page Cachin |
|-------------|---------------------------|--------------------------|---------------------------|----------------------------------|-------------------|
| Unison [32] | In-package traffic: 128 B | In-package traffic: 96 B | On every miss             | Hardware managed,                | Yes               |
|             | (data + tag read and up-  | (spec. data + tag read)  | Footprint size [31]       | set-associative,                 |                   |
|             | date)                     | Latency: ~2x             |                           | LRU                              |                   |
|             | Latency: ~1x              |                          |                           |                                  |                   |
| Alloy [50]  | In-package traffic: 96 B  | In-package traffic: 96 B | On some misses            | Hardware managed,                | Yes               |
|             | (data + tag read)         | (spec. data + tag read)  | Cacheline size (64 B)     | direct-mapped,                   |                   |
|             | Latency: ~1x              | Latency: ~2x             |                           | stochastic [20]                  |                   |
| TDC [38]    | In-package traffic: 64 B  | In-package traffic: 0 B  | On every miss             | Hardware managed,                | No                |
|             | Latency: ~1x              | Latency: ~1x             | Footprint size [28]       | fully-associative,               |                   |
|             | TLB coherence             | TLB coherence            |                           | FIFO                             |                   |
| HMA [44]    | In-package traffic: 64 B  | In-package traffic: 0 B  | Software managed, high re | e managed, high replacement cost |                   |
|             | Latency: ~1x              | Latency: ~1x             |                           |                                  |                   |
| Banshee     | In-package traffic: 64 B  | In-package traffic: 0 B  | Only for hot pages        | Hardware managed,                | Yes               |
| (This work) | Latency: ~1x              | Latency: ~1x             | Page size (4 KB)          | set-associative,                 |                   |
|             |                           |                          |                           | frequency based                  |                   |



## Tag Buffer Organization

Physical Page Number (48 - 12 = 36 bits)

| Tag | Value | Tag | Value |  |  |
|-----|-------|-----|-------|--|--|
|     |       |     |       |  |  |
|     | • • • |     |       |  |  |
|     |       |     |       |  |  |

**Figure 2: Tag Buffer Organization** – The Tag Buffer is organized as a set-associative cache. The DRAM is 4-way set-associative in this example.



## **DRAM Cache Layout**

### In-Package DRAM



## **Speedup Normalized to NoCache**



Figure 4: Speedup Normalized to NoCache - Speedup is shown in bars and misses per kilo instruction (MPKI) is shown in red dots.



## In-Package DRAM Traffic Breakdown



## **Off-Package DRAM Traffic**



## Sensitivity to Page Table Update Cost

## Table 5: Sensitivity to Page Table Update Cost.

| Update Cost (µs) | Avg Perf. Loss | Max Perf. Loss |
|------------------|----------------|----------------|
| 10               | 0.11%          | 0.76%          |
| 20               | 0.18%          | 1.3%           |
| 40               | 0.31%          | 2.4%           |

## Sensitivity to DRAM Cache Latency and Bandwidth



and bandwidth values are relative to off-package DRAM.



Figure 8: Sensitivity to DRAM Cache Latency and Bandwidth – Each data point is the geometric mean over all benchmarks. Default parameter setting is highlighted on x-axis. Latency

