**Revisiting Memory Errors in Large-Scale Production Data Centers** Analysis and Modeling of New Trends from the Field

**Justin Meza** Qiang Wu Sanjeev Kumar Onur Mutlu

> **facebook** Carnegie Mellon University

### Study of DRAM reliability:

- on *modern* devices and workloads
- at a large scale in the field



#### Error/failure occurrence

Page C Errors follow a *power-law distribution* and a large number of errors occur due to *sockets/channels* 

Modeling errors

Architecture & workload

#### Error/failure occurrence

We find that *newer* cell fabrication technologies have *higher failure rates* 



trends

Modeling errors

Architecture & workload

#### Error/failure occurrence

**Page d Chips per DIMM, transfer width**, and **workload type** (not necessarily CPU/ memory utilization) affect reliability

### Modeling errors

Architecture & workload

pgy

#### Error/failure occurrence

Page We have made publicly available a statistical model for assessing server memory reliability

#### renas

### Modeling errors

Architecture & workload

ogy

#### Error/failure occurrence

Page offlining at scale First large-scale study of page offlining; real-world limitations of technique

trends

Modeling errors

Architecture & workload

## Outline

- background and motivation
- server memory organization
- error collection/analysis methodology
- memory reliability trends
- summary

# Background and motivation

### DRAM errors are common

- examined extensively in prior work
  - charged particles, wear-out
  - variable retention time (next talk)
- error correcting codes
  - used to detect and correct errors
  - require additional storage overheads

### Our goal Strengthen understanding of DRAM reliability by studying:

- new trends in DRAM errors
  - modern devices and workloads
- at a large scale
  - billions of device-days, across 14 months

## Our main contributions

- identified new DRAM failure trends
- developed a *model* for DRAM errors
- evaluated page offlining at scale

# Server memory organization



#### Socket





#### DIMM slots

|    | _  |    |    |    | Ĩ  |
|----|----|----|----|----|----|
| 15 | 25 | 23 | 15 | 25 | 25 |
| 0  | 0  | 0  | 0  | 0  | 0  |
| 0  | 0  | 0  | 0  | 0  | 0  |
| 0  | 0  | 0  | 0  | 0  | 0  |
| 34 | 54 | 34 | 34 | 34 | 54 |
|    |    |    |    |    |    |



#### DIMM











#### User data



# *ECC metadata* additional 12.5% overhead

# **Reliability events**

### Fault

- the underlying cause of an error
  - DRAM cell unreliably stores charge

### Error

- the manifestation of a fault
- permanent: every time
- transient: only some of the time

# Error collection/ analysis methodology

## **DRAM error measurement**

- measured every correctable error
  - across Facebook's fleet
  - for 14 months
  - metadata associated with each error
- parallelized Map-Reduce to process
  used R for further analysis

## System characteristics

- 6 different system configurations
  - Web, Hadoop, Ingest, Database, Cache, Media
  - diverse CPU/memory/storage requirements
- modern DRAM devices
  - DDR3 communication protocol
    - (more aggressive clock frequencies)
  - diverse organizations (banks, ranks, ...)
  - previously unexamined characteristics
    - density, # of chips, transfer width, workload

# Memory reliability trends

#### Error/failure occurrence



New reliability trends

#### Technology scaling

#### Modeling errors

Architecture & workload

#### Error/failure occurrence

#### Page offlining at scale

New reliability trends

#### Modeling errors

Architecture & workload

Technology

scaling

### Server error rate



Month

## Memory error distribution



## Memory error distribution



## Memory error distribution



### Sockets/channels: many errors


## Sockets/channels: many errors



## Sockets/channels: many errors



#### Bank/cell/spurious failures are common







# Analytical methodology

- measure server characteristics
  - not feasible to examine every server
  - examined all servers with errors (error group)
  - sampled servers without errors (control group)
- bucket devices based on characteristics
- measure relative failure rate
  - of error group vs. control group
  - within each bucket

#### Error/failure occurrence

#### Page offlining at scale

New reliability trends

#### Technology scaling

#### Modeling errors

Architecture & workload

### Prior work found *inconclusive trends* with respect to memory *capacity*



DIMM capacity (GB)

Prior work found *inconclusive trends* with respect to memory *capacity* 

## Examine characteristic more closely related to cell fabrication technology



DIMM capacity (GB)

## Use **DRAM chip density** to examine technology scaling

(closely related to fabrication technology)



Chip density (Gb)



Chip density (Gb)

#### Error/failure occurrence



workload

#### Error/failure occurrence

#### Page offlining at scale

New reliability trends

#### Technology scaling

#### Modeling errors

Architecture & workload

# **DIMM architecture**

chips per DIMM, transfer width

manal and

- 8 to 48 chips
- x4, x8 = 4 or 8 bits per cycle
- electrical implications

## **DIMM architecture**

# Does DIMM organization affect memory reliability?

electrical implications





















# Workload dependence

- prior studies: homogeneous workloads
  web search and scientific
- warehouse-scale data centers:
  - web, hadoop, ingest, database, cache, media

# Workload dependence

prior studies: homogeneous workloads

## What affect to <u>heterogeneous</u> workloads have on reliability?



1 Gb — 2 Gb —

4 Gb

**CPU** utilization

Memory utilization



**CPU** utilization

Memory utilization





#### Error/failure occurrence



#### Error/failure occurrence

#### Page offlining at scale

New reliability trends

#### Technology scaling

#### Modeling errors

Architecture & workload

# A model for server failure

- use statistical regression model
  - compare control group vs. error group
  - linear regression in R
  - trained using data from analysis
- enable exploratory analysis
  high perf. vs. low power systems




 $\ln \left[ \mathcal{F}/(1-\mathcal{F}) \right] = \beta_{Intercept} + (Capacity \cdot \beta_{Capacity}) + (Density2Gb \cdot \beta_{Density2Gb}) + (Density4Gb \cdot \beta_{Density4Gb}) + (Chips \cdot \beta_{Chips}) + (CPU\% \cdot \beta_{CPU\%}) + (Age \cdot \beta_{Age}) + (CPUs \cdot \beta_{CPUs})$ 

### Available online

#### Memory error model

From Justin Meza, Qiang Wu, Sanjeev Kumar, and Onur Mutlu: Revisiting Memory Errors in Large-Scale Production Data Centers: Analysis and Modeling of New Trends from the Field. DSN, 2015.



<u>http://www.ece.cmu.edu/~safari/tools/memerr/</u>

#### Error/failure occurrence



#### Error/failure occurrence

#### Page offlining at scale

New reliability trends

#### Technology scaling

#### Modeling errors

### Prior page offlining work

- [Tang+,DSN'06] proposed technique
  - "retire" faulty pages using OS
  - do not allow software to allocate them
- [Hwang+,ASPLOS'12] simulated eval.
  - error traces from Google and IBM
  - recommended retirement on first error
    - large number of cell/spurious errors

### Prior page off lining work

[Tang+,DSN'06] proposed technique

"rotiro" faulty pages using OC

## How effective is page offlining in the wild?

- error traces from Google and IBM
- recommended retirement on first error
  - large number of cell/spurious errors









#### Error/failure occurrence

#### Page offlining at scale

*First large-scale study* of page offlining; real-world *limitations* of technique

trends

#### Modeling errors

#### Error/failure occurrence



New reliability trends

#### Technology scaling

#### Modeling errors

### More results in paper

- Vendors
- Age
- Processor cores
- Correlation analysis
- Memory model case study

# Modern systems Large scale



#### Error/failure occurrence

Page C Errors follow a *power-law distribution* and a large number of errors occur due to *sockets/channels* 

Modeling errors

#### Error/failure occurrence

We find that *newer* cell fabrication technologies have *higher failure rates* 

#### Technology scaling

trends

#### Modeling errors

#### Error/failure occurrence

**Page d** Chips per DIMM, transfer width, and workload type (not necessarily CPU/ memory utilization) affect reliability

#### Modeling errors

Architecture & workload

pgy

#### Error/failure occurrence

Page We have made publicly available a statistical model for assessing server memory reliability

#### renas

#### Modeling errors

Architecture & workload

ogy

#### Error/failure occurrence

Page offlining at scale First large-scale study of page offlining; real-world limitations of technique

trends

Modeling errors

**Revisiting Memory Errors in Large-Scale Production Data Centers** Analysis and Modeling of New Trends from the Field

**Justin Meza** Qiang Wu Sanjeev Kumar Onur Mutlu

> **facebook** Carnegie Mellon University

### Backup slides

### Decreasing hazard rate







Errors 54,326 0 2 10 Density 4Gb 1Gb 2Gb 2Gb

















### Case study

### **Case study**

| Factor       | Low-end | High-end (HE) |       |
|--------------|---------|---------------|-------|
| Capacity     | 4 GB    | 16 GB         |       |
| Density2Gb   | 1       | 0             |       |
| Density4Gb   | 0       | 1             |       |
| Chips        | 16      | 32            | l Inp |
| CPU%         | 50%     | 25%           |       |
| Age          | 1       | 1             |       |
| CPUs         | 8       | 16            |       |
| Predicted    |         |               |       |
| relative     | 0.12    | 0.78          | Out   |
| failure rate |         |               |       |

Inputs

Dutput

### **Case study**

Factor

Low-end High-end (HE)

ts

#### Does CPUs or density have a higher impact?

| Age<br>CPUs                           | 8    | 16   |        |
|---------------------------------------|------|------|--------|
| Predicted<br>relative<br>failure rate | 0.12 | 0.78 | Output |

### Exploratory analysis

| Factor                                | Low-end | High-end (HE) | HE/↓density | HE/↓CPUs |
|---------------------------------------|---------|---------------|-------------|----------|
| Capacity                              | 4 GB    | 16 GB         | 4 GB        | 16 GB    |
| Density2Gb                            | 1       | 0             | 1           | 0        |
| Density4Gb                            | 0       | 1             | 0           | 1        |
| Chips                                 | 16      | 32            | 16          | 32       |
| CPU%                                  | 50%     | 25%           | 25%         | 50%      |
| Age                                   | 1       | 1             | 1           | 1        |
| CPUs                                  | 8       | 16            | 16          | 8        |
| Predicted<br>relative<br>failure rate | 0.12    | 0.78          | 0.33        | 0.51     |

### Exploratory analysis

| Factor       | Low-end | High-end (HE) | HE/↓density | HE/↓CPUs |
|--------------|---------|---------------|-------------|----------|
| Capacity     | 4 GB    | 16 GB         | 4 GB        | 16 GB    |
| Density2Gb   | 1       | 0             | 1           | 0        |
| Density4Gb   | 0       | 1             | 0           | 1        |
| Chips        | 16      | 32            | 16          | 32       |
| CPU%         | 50%     | 25%           | 25%         | 50%      |
| Age          | 1       | 1             | 1           | 1        |
| CPUs         | 8       | 16            | 16          | 8        |
| Predicted    |         |               |             |          |
| relative     | 0.12    | 0.78          | 0.33        | 0.51     |
| failure rate |         |               |             |          |