#### Fairness via Source Throttling:

A configurable and high-performance fairness substrate for multi-core memory systems

> Eiman Ebrahimi\* Chang Joo Lee\* Onur Mutlu<sup>‡</sup> Yale N. Patt\*

\* HPS Research Group The University of Texas at Austin **‡** Computer Architecture Laboratory Carnegie Mellon University















 Applications slow down due to interference from memory requests of other applications

- Applications slow down due to interference from memory requests of other applications
- A memory system is fair if slowdowns of same-priority applications are equal (MICRO '06, MICRO '07, ISCA '08)

- Applications slow down due to interference from memory requests of other applications
- A memory system is fair if slowdowns of same-priority applications are equal (MICRO '06, MICRO '07, ISCA '08)

• Slowdown of application 
$$i = \frac{T_i^{\text{Shared}}}{T_i^{\text{Alone}}}$$

- Applications slow down due to interference from memory requests of other applications
- A memory system is fair if slowdowns of same-priority applications are equal (MICRO '06, MICRO '07, ISCA '08)

• Slowdown of application 
$$i = \frac{T_i^{\text{Shared}}}{T_i^{\text{Alone}}}$$

Unfairness = Max{Slowdown i} over all applications i
(MICRO '07)





 Magnitude of each application's slowdown depends on concurrently running applications' memory behavior



• Large disparities in slowdowns are unacceptable



- Large disparities in slowdowns are unacceptable
  - Low system performance



- Large disparities in slowdowns are unacceptable
  - Low system performance
  - Vulnerability to denial of service attacks



- Large disparities in slowdowns are unacceptable
  - Low system performance
  - Vulnerability to denial of service attacks
  - Difficult for system software to enforce priorities

## Outline

- Background and Problem
- Motivation for Source Throttling
- Fairness via Source Throttling (FST)
- Evaluation
- Conclusion

- Primarily manage inter-application interference in only one particular resource
  - Shared Cache, Memory Controller, Interconnect, etc.

- Primarily manage inter-application interference in only one particular resource
  - Shared Cache, Memory Controller, Interconnect, etc.
  - Combining techniques for the different resources can result in negative interaction

- Primarily manage inter-application interference in only one particular resource
  - Shared Cache, Memory Controller, Interconnect, etc.
  - Combining techniques for the different resources can result in negative interaction
- Approaches that coordinate interaction among techniques for different resources require complex implementations

- Primarily manage inter-application interference in only one particular resource
  - Shared Cache, Memory Controller, Interconnect, etc.
  - Combining techniques for the different resources can result in negative interaction
- Approaches that coordinate interaction among techniques for different resources require complex implementations

Our Goal: Enable fair sharing of the entire memory system by dynamically detecting and controlling interference in a coordinated manner

 Manage inter-application interference at the cores, not at the shared resources

- Manage inter-application interference at the cores, not at the shared resources
- Dynamically estimate unfairness in the memory system

- Manage inter-application interference at the cores, not at the shared resources
- Dynamically estimate unfairness in the memory system
- If unfairness > system-software-specified target then throttle down core causing unfairness & throttle up core that was unfairly treated



A: B:

Fair Source Throttling A: B:

Wednesday, March 17, 2010





B:

A:











Fair Source Throttling



Request Generation Order:

AI, A2, A3, A4, BI

Wednesday, March 17, 2010













Fair Source Throttling





Throttling







**B**:





























## Outline

- Background and Problem
- Motivation for Source Throttling
- Fairness via Source Throttling (FST)
- Evaluation
- Conclusion

- Runtime Unfairness Evaluation
  - Dynamically estimates the unfairness in the memory system

- Runtime Unfairness Evaluation
  - Dynamically estimates the unfairness in the memory system
- Dynamic Request Throttling
  - Adjusts how aggressively each core makes requests to the shared resources

















Wednesday, March 17, 2010











- Unfairness =  $\frac{\text{Max}\{\text{Slowdown }i\}\text{ over all applications }i}{\text{Min}\{\text{Slowdown }i\}\text{ over all applications }i}$ • Slowdown of application  $i = \frac{T_i^{\text{Shared}}}{T_i^{\text{Alone}}}$ • How can  $T_i^{\text{Alone}}$  be estimated in shared mode?
  - T<sup>Excess</sup> is the number of extra cycles it takes application *i* to execute due to interference

 Unfairness = Max{Slowdown i} over all applications i
Min{Slowdown i} over all applications i • Slowdown of application  $i = \frac{T_i^{\text{Shared}}}{T_i^{\text{Alone}}}$ • How can  $T_i^{Alone}$  be estimated in shared mode? • T<sub>i</sub> is the number of extra cycles it takes application *i* to execute due to interference •  $T_i^{\text{Alone}} = T_i^{\text{Shared}} - T_i^{\text{Excess}}$ 

















Core 0



Core I

































16



















### Fairness via Source Throttling (FST)



• To identify App-interfering, for each core *i* 

- To identify App-interfering, for each core *i* 
  - FST separately tracks interference caused by each core j ( j ! i )

To identify App-interfering, for each core *i*FST separately tracks interference

caused by each core j ( j ! i )

Interference per core bit vector

Core #0 | 2 3

To identify App-interfering, for each core *i*FST separately tracks interference caused by each core *j* (*j* ! *i*)

Interference per core bit vector

| Core # | 0 |   | 2 | 3 |
|--------|---|---|---|---|
| 0      | - | 0 | 0 | 0 |
| 1      | 0 | - | 0 | 0 |
| 2      | 0 | 0 | - | 0 |
| 3      | 0 | 0 | 0 | - |

To identify App-interfering, for each core *i*FST separately tracks interference

caused by each core j ( j ! i )

Interference per core bit vector Interfered with core Core #0 1 2 3



| П | .0 |   |   | <u> </u> | _ |
|---|----|---|---|----------|---|
| ) | -  | 0 | 0 | 0        |   |
|   | 0  | - | 0 | 0        |   |
| 2 | 0  | 0 | - | 0        |   |
| 3 | 0  | 0 | 0 | -        |   |

To identify App-interfering, for each core *i*FST separately tracks interference caused by each core *j* (*j* ! *i*)



• To identify App-interfering, for each core *i*  FST separately tracks interference caused by each core j ( j ! i )

> Interference per core bit vector

**Excess Cycles** Counters per core



Core #0 | 2 3 Interfering core

| $\pi$ | . 0 |   |   | <u> </u> | - |
|-------|-----|---|---|----------|---|
| )     | -   | 0 | 0 | 0        |   |
|       | 0   | - | 0 | 0        |   |
| 2     | 0   | 0 | - | 0        |   |
| 3     | 0   | 0 | 0 | -        |   |

| -       | Cnt 0, I | Cnt 0,2 | Cnt 0,3 |
|---------|----------|---------|---------|
| Cnt I,0 | -        | Cnt I,2 | Cnt I,3 |
| Cnt 2,0 | Cnt 2, I | -       | Cnt 2,3 |
| Cnt 3,0 | Cnt 3, I | Cnt 3,2 | -       |

• To identify App-interfering, for each core *i*  FST separately tracks interference caused by each core j ( j ! i )

> Interference per core bit vector

**Excess Cycles** Counters per core



Interfering core

| Cor  | e # | 0 |   | 2 | 3 |
|------|-----|---|---|---|---|
|      | 0   | - | 0 | 0 | 0 |
| ring |     | 0 | - | 0 | 0 |
| e    | 2   | 0 | I | - | 0 |
|      | 3   | 0 | 0 | 0 |   |

| -       | Cnt 0, I | Cnt 0,2 | Cnt 0,3 |
|---------|----------|---------|---------|
| Cnt I,0 | -        | Cnt I,2 | Cnt I,3 |
| Cnt 2,0 | Cnt 2, I | -       | Cnt 2,3 |
| Cnt 3,0 | Cnt 3, I | Cnt 3,2 | _       |

To identify App-interfering, for each core *i*FST separately tracks interference

caused by each core j ( j ! i )



To identify App-interfering, for each core *i*FST separately tracks interference

caused by each core j ( j ! i )



To identify App-interfering, for each core *i*FST separately tracks interference

caused by each core j ( j ! i )



To identify App-interfering, for each core *i*FST separately tracks interference caused by each core *j* (*j* ! *i*)



## Fairness via Source Throttling (FST)



• Goal: Adjust how aggressively each core makes requests to the shared resources

- Goal: Adjust how aggressively each core makes requests to the shared resources
- Mechanisms:
  - Miss Status Holding Register (MSHR) quota

- Goal: Adjust how aggressively each core makes requests to the shared resources
- Mechanisms:
  - Miss Status Holding Register (MSHR) quota
    - Controls the number of concurrent requests accessing shared resources from each application

 Goal: Adjust how aggressively each core makes requests to the shared resources

#### Mechanisms:

- Miss Status Holding Register (MSHR) quota
  - Controls the number of concurrent requests accessing shared resources from each application
- Request injection frequency

 Goal: Adjust how aggressively each core makes requests to the shared resources

#### Mechanisms:

- Miss Status Holding Register (MSHR) quota
  - Controls the number of concurrent requests accessing shared resources from each application
- Request injection frequency
  - Controls how often memory requests are issued to the last level cache from the MSHRs

 Throttling level assigned to each core determines both MSHR quota and request injection rate

• Throttling level assigned to each core determines both MSHR quota and request injection rate

|                          | Throttling level | MSHR quota | Request Injection<br>Rate |
|--------------------------|------------------|------------|---------------------------|
|                          | 100%             | 128        | Every cycle               |
|                          | 50%              | 64         | Every other cycle         |
|                          | 25%              | 32         | Once every 4 cycles       |
|                          | 10%              | 12         | Once every 10 cycles      |
|                          | 5%               | 6          | Once every 20 cycles      |
|                          | 4%               | 5          | Once every 25 cycles      |
|                          | 3%               | 3          | Once every 30 cycles      |
| Total # of<br>MSHRs: 128 | 2%               | 2          | Once every 50 cycles      |

 Throttling level assigned to each core determines both MSHR quota and request injection rate

|                           | Throttling level | MSHR quota | Request Injection<br>Rate |
|---------------------------|------------------|------------|---------------------------|
|                           | 100%             | 128        | Every cycle               |
|                           | 50%              | 64         | Every other cycle         |
|                           | 25%              | 32         | Once every 4 cycles       |
| $\langle$                 | 0%               | 12         | Once every 10 cycles      |
|                           | 5%               | 6          | Once every 20 cycles      |
|                           | 4%               | 5          | Once every 25 cycles      |
| <b>T</b>                  | 3%               | 3          | Once every 30 cycles      |
| Total # of<br>MSHRs: I 28 | 2%               | 2          | Once every 50 cycles      |



Core 0Core 1Core 2Core 3Interval iInterval i + 1Interval i + 2Interval i + 2Interval i + 2Interval i + 2Interval i + 2Throttling Levels

Time



|         |                   | Core 0 | Core I | Core 2 | Core 3 |  |
|---------|-------------------|--------|--------|--------|--------|--|
| Inter   | val i             | 50%    | 100%   | 10%    | 100%   |  |
| Interva | <i>i</i> +        |        |        |        |        |  |
| Interva | l i + 2           |        |        |        |        |  |
|         | Throttling Levels |        |        |        |        |  |







|                       | Core 0            | Core I | Core 2 | Core 3 |  |  |
|-----------------------|-------------------|--------|--------|--------|--|--|
| Interval i            | 50%               | 100%   | 10%    | 100%   |  |  |
| Interval <i>i</i> + 1 |                   |        |        |        |  |  |
| Interval <i>i</i> + 2 |                   |        |        |        |  |  |
|                       | Throttling Levels |        |        |        |  |  |



|                       | Core 0            | Core I | Core 2 | Core 3 |  |  |
|-----------------------|-------------------|--------|--------|--------|--|--|
| Interval i            | 50%               | 100%   | 10%    | 100%   |  |  |
| Interval <i>i</i> + 1 |                   |        |        |        |  |  |
| Interval <i>i</i> + 2 |                   |        |        |        |  |  |
|                       | Throttling Levels |        |        |        |  |  |



#### FST at Work Interval *i* \_ Interval *i*+1 Time Slowdown **Estimation** FST Unfairness Estimate 3 System software fairness goal: 1.4 Runtime Unfairness App-slowest Core 2 Dynamic Evaluation App-interfering Core Q **Request Throttling** Throttle up Throttle down Core 2 Core 3 Core 0 Core 1 Interval *i* 50% 100% 10% 100% 25% Interval i + | (25%)100% 100% Interval i + 2**Throttling Levels**



|                       | Core 0            | Core I | Core 2 | Core 3 |  |
|-----------------------|-------------------|--------|--------|--------|--|
| Interval i            | 50%               | 100%   | 10%    | 100%   |  |
| Interval i + 1        | 25%               | 100%   | 25%    | 100%   |  |
| Interval <i>i</i> + 2 |                   |        |        |        |  |
|                       | Throttling Levels |        |        |        |  |





## System Software Support

# System Software Support

 Different fairness objectives can be configured by system software

# System Software Support

- Different fairness objectives can be configured by system software
  - Estimated Unfairness > Target Unfairness

- Different fairness objectives can be configured by system software
  - Estimated Unfairness > Target Unfairness
  - Estimated Max Slowdown > Target Max Slowdown

- Different fairness objectives can be configured by system software
  - Estimated Unfairness > Target Unfairness
  - Estimated Max Slowdown > Target Max Slowdown
  - Estimated Slowdown(i) > Target Slowdown(i)

- Different fairness objectives can be configured by system software
  - Estimated Unfairness > Target Unfairness
  - Estimated Max Slowdown > Target Max Slowdown
  - Estimated Slowdown(i) > Target Slowdown(i)
- Support for thread priorities

- Different fairness objectives can be configured by system software
  - Estimated Unfairness > Target Unfairness
  - Estimated Max Slowdown > Target Max Slowdown
  - Estimated Slowdown(i) > Target Slowdown(i)
- Support for thread priorities

Weighted Slowdown(i) =
Estimated Slowdown(i) x Weight(i)



 Total storage cost required for 4 cores is ~ 12KB

 FST does not require any structures or logic that are on the processor's critical path

# Outline

- Background and Problem
- Motivation for Source Throttling
- Fairness via Source Throttling (FST)
- Evaluation
- Conclusion

# **Evaluation Methodology**

- x86 cycle accurate simulator
- Baseline processor configuration
  - Per-core
    - 4-wide issue, out-of-order, 256 entry ROB
  - Shared (4-core system)
    - 128 MSHRs
    - 2 MB, 16-way L2 cache
  - Main Memory
    - DDR3 1333 MHz
    - Latency of 15ns per command (*tRP*, *tRCD*, *CL*)
    - 8B wide core to memory bus

























#### Conclusion

- Fairness via Source Throttling (FST) is a new fair and high-performance shared resource management approach for CMPs
- Dynamically monitors unfairness and throttles down sources of interfering memory requests
- Eliminates the need for and complexity of multiple per-resource fairness techniques
- Improves both system fairness and performance
- Incorporates thread weights and enables different fairness objectives

#### Fairness via Source Throttling:

A configurable and high-performance fairness substrate for multi-core memory systems

> Eiman Ebrahimi\* Chang Joo Lee\* Onur Mutlu<sup>‡</sup> Yale N. Patt\*

\* HPS Research Group The University of Texas at Austin **‡** Computer Architecture Laboratory Carnegie Mellon University