# Memory Systems

# and Memory-Centric Computing Systems

Part 5: Principles and Conclusion

Prof. Onur Mutlu

omutlu@gmail.com

https://people.inf.ethz.ch/omutlu

7 July 2019

**SAMOS Tutorial** 





**Carnegie Mellon** 

# Four Key Directions

Fundamentally Secure/Reliable/Safe Architectures

Fundamentally Energy-Efficient Architectures

Memory-centric (Data-centric) Architectures

Fundamentally Low-Latency Architectures

Architectures for Genomics, Medicine, Health

# Guiding Principles

# Some Solution Principles (So Far)

- Data-centric system design & intelligence spread around
  - Do not center everything around traditional computation units
- Better cooperation across layers of the system
  - Careful co-design of components and layers: system/arch/device
  - Better, richer, more expressive and flexible interfaces
- Better-than-worst-case design
  - Do not optimize for the worst case
  - Worst case should not determine the common case
- Heterogeneity in design (specialization, asymmetry)
  - Enables a more efficient design (No one size fits all)

# Some Solution Principles (More Compact)

- Data-centric design
- All components intelligent
- Better cross-layer communication, better interfaces
- Better-than-worst-case design
- Heterogeneity
- Flexibility, adaptability

# **Open minds**

### Data-Aware Architectures

- A data-aware architecture understands what it can do with and to each piece of data
- It makes use of different properties of data to improve performance, efficiency and other metrics
  - Compressibility
  - Approximability
  - Locality
  - Sparsity
  - Criticality for Computation X
  - Access Semantics
  - **...**

### One Problem: Limited Interfaces

## Higher-level information is not visible to HW



Hardware

100011111... 101010011... Instructions
Memory Addresses

# A Solution: More Expressive Interfaces

**Performance** 













**Functionality** 

ISA Virtual Memory Higher-level Program Semantics

Expressive Memory "XMem"

#### **Hardware**







# Expressive (Memory) Interfaces

 Nandita Vijaykumar, Abhilasha Jain, Diptesh Majumdar, Kevin Hsieh, Gennady Pekhimenko, Eiman Ebrahimi, Nastaran Hajinazar, Phillip B. Gibbons and Onur Mutlu, "A Case for Richer Cross-layer Abstractions: Bridging the Semantic Gap with Expressive Memory"

Proceedings of the <u>45th International Symposium on Computer Architecture</u> (**ISCA**), Los Angeles, CA, USA, June 2018.

[Slides (pptx) (pdf)] [Lightning Talk Slides (pptx) (pdf)] [Lightning Talk Video]

#### A Case for Richer Cross-layer Abstractions: Bridging the Semantic Gap with Expressive Memory

Nandita Vijaykumar<sup>†§</sup> Abhilasha Jain<sup>†</sup> Diptesh Majumdar<sup>†</sup> Kevin Hsieh<sup>†</sup> Gennady Pekhimenko<sup>‡</sup> Eiman Ebrahimi<sup>ℵ</sup> Nastaran Hajinazar<sup>‡</sup> Phillip B. Gibbons<sup>†</sup> Onur Mutlu<sup>§†</sup>

# Expressive (Memory) Interfaces for GPUs

Nandita Vijaykumar, Eiman Ebrahimi, Kevin Hsieh, Phillip B. Gibbons and Onur Mutlu,
 "The Locality Descriptor: A Holistic Cross-Layer Abstraction to Express
 Data Locality in GPUs"

Proceedings of the <u>45th International Symposium on Computer Architecture</u> (**ISCA**), Los Angeles, CA, USA, June 2018.

[Slides (pptx) (pdf)] [Lightning Talk Slides (pptx) (pdf)] [Lightning Talk Video]

#### The Locality Descriptor:

#### A Holistic Cross-Layer Abstraction to Express Data Locality in GPUs

```
Nandita Vijaykumar<sup>†§</sup> Eiman Ebrahimi<sup>‡</sup> Kevin Hsieh<sup>†</sup> Phillip B. Gibbons<sup>†</sup> Onur Mutlu<sup>§†</sup>
```

<sup>†</sup>Carnegie Mellon University <sup>‡</sup>NVIDIA <sup>§</sup>ETH Zürich

# Architectures for Intelligent Machines

# **Data-centric**

**Data-driven** 

**Data-aware** 

# Concluding Remarks

# A Quote from A Famous Architect

"architecture [...] based upon principle, and not upon precedent"



# Precedent-Based Design?

"architecture [...] based upon principle, and not upon precedent"



# Principled Design

"architecture [...] based upon principle, and not upon precedent"



15



# The Overarching Principle

# Organic architecture

From Wikipedia, the free encyclopedia

Organic architecture is a philosophy of architecture which promotes harmony between human habitation and the natural world through design approaches so sympathetic and well integrated with its site, that buildings, furnishings, and surroundings become part of a unified, interrelated composition.

A well-known example of organic architecture is Fallingwater, the residence Frank Lloyd Wright designed for the Kaufmann family in rural Pennsylvania. Wright had many choices to locate a home on this large site, but chose to place the home directly over the waterfall and creek creating a close, yet noisy dialog with the rushing water and the steep site. The horizontal striations of stone masonry with daring cantilevers of colored beige concrete blend with native rock outcroppings and the wooded environment.

# Another Example: Precedent-Based Design



# Principled Design



# Another Principled Design



# Another Principled Design



# Principle Applied to Another Structure





# The Overarching Principle

# Zoomorphic architecture

From Wikipedia, the free encyclopedia

**Zoomorphic architecture** is the practice of using animal forms as the inspirational basis and blueprint for architectural design. "While animal forms have always played a role adding some of the deepest layers of meaning in architecture, it is now becoming evident that a new strand of biomorphism is emerging where the meaning derives not from any specific representation but from a more general allusion to biological processes."<sup>[1]</sup>

Some well-known examples of Zoomorphic architecture can be found in the TWA Flight Center building in New York City, by Eero Saarinen, or the Milwaukee Art Museum by Santiago Calatrava, both inspired by the form of a bird's wings.<sup>[3]</sup>

# Overarching Principle for Computing?



# Concluding Remarks

- It is time to design principled system architectures to solve the memory problem
- Discover design principles for fundamentally secure and reliable computer architectures
- Design complete systems to be balanced and energy-efficient,
   i.e., low latency and data-centric (or memory-centric)
- Enable new and emerging memory architectures
- This can
  - Lead to orders-of-magnitude improvements
  - Enable new applications & computing platforms
  - Enable better understanding of nature

25

### We Need to Think Across the Stack



We can get there step by step

# If In Doubt, See Other Doubtful Technologies

- A very "doubtful" emerging technology
  - for at least two decades



Proceedings of the IEEE, Sept. 2017

# Error Characterization, Mitigation, and Recovery in Flash-Memory-Based Solid-State Drives

This paper reviews the most recent advances in solid-state drive (SSD) error characterization, mitigation, and data recovery techniques to improve both SSD's reliability and lifetime.

By Yu Cai, Saugata Ghose, Erich F. Haratsch, Yixin Luo, and Onur Mutlu



# Accelerated Memory Course (~6.5 hours)

#### ACACES 2018

- Memory Systems and Memory-Centric Computing Systems
- Taught by Onur Mutlu July 9-13, 2018
- □ ~6.5 hours of lectures
- Website for the Course including Videos, Slides, Papers
  - https://people.inf.ethz.ch/omutlu/acaces2018.html
  - https://www.youtube.com/playlist?list=PL5Q2soXY2Zi-HXxomthrpDpMJm05P6J9x

#### All Papers are at:

- https://people.inf.ethz.ch/omutlu/projects.htm
- Final lecture notes and readings (for all topics)

# A Final Detour

# In-Memory Bulk Bitwise Operations

- We can support in-DRAM COPY, ZERO, AND, OR, NOT, MAJ
- At low cost
- Using analog computation capability of DRAM
  - Idea: activating multiple rows performs computation
- 30-60X performance and energy improvement
  - Seshadri+, "Ambit: In-Memory Accelerator for Bulk Bitwise Operations Using Commodity DRAM Technology," MICRO 2017.

- New memory technologies enable even more opportunities
  - Memristors, resistive RAM, phase change mem, STT-MRAM, ...
  - Can operate on data with minimal movement

### More on Ambit

 Vivek Seshadri et al., "<u>Ambit: In-Memory Accelerator</u> for Bulk Bitwise Operations Using Commodity DRAM <u>Technology</u>," MICRO 2017.

Ambit: In-Memory Accelerator for Bulk Bitwise Operations
Using Commodity DRAM Technology

```
Vivek Seshadri^{1,5} Donghyuk Lee^{2,5} Thomas Mullins^{3,5} Hasan Hassan^4 Amirali Boroumand^5 Jeremie Kim^{4,5} Michael A. Kozuch^3 Onur Mutlu^{4,5} Phillip B. Gibbons^5 Todd C. Mowry^5
```

 $^1$ Microsoft Research India  $^2$ NVIDIA Research  $^3$ Intel  $^4$ ETH Zürich  $^5$ Carnegie Mellon University

# Ambit Sounds Good, No?

#### **Paper summary**

### **Review from ISCA 2016**

The paper proposes to extend DRAM to include bulk, bit-wise logical

operations directly between rows within the DRAM.

#### **Strengths**

- Very clever/novel idea.
- Great potential speedup and efficiency gains.

#### Weaknesses

- Probably won't ever be built. Not practical to assume DRAM manufacturers with change DRAM in this way.

### Another Review

### **Another Review from ISCA 2016**

#### **Strengths**

The proposed mechanisms effectively exploit the operation of the DRAM to perform efficient bitwise operations across entire rows of the DRAM.

#### Weaknesses

This requires a modification to the DRAM that will only help this type of bitwise operation. It seems unlikely that something like that will be adopted.

### Yet Another Review

### **Yet Another Review from ISCA 2016**

#### Weaknesses

The core novelty of Buddy RAM is almost all circuits-related (by exploiting sense amps). I do not find architectural innovation even though the circuits technique benefits architecturally by mitigating memory bandwidth and relieving cache resources within a subarray. The only related part is the new ISA support for bitwise operations at DRAM side and its induced issue on cache coherence.

### We Have a Mindset Issue...

- There are many other similar examples from reviews...
  - For many other papers...
- And, we are not even talking about JEDEC yet...
- How do we fix the mindset problem?
- By doing more research, education, implementation in alternative processing paradigms

### We need to work on enabling the better future...

### Aside: A Recommended Book



Raj Jain, "The Art of Computer Systems Performance Analysis," Wiley, 1991.

WILEY

#### DECISION MAKER'S GAMES

Even if the performance analysis is correctly done and presented, it may not be enough to persuade your audience—the decision makers—to follow your recommendations. The list shown in Box 10.2 is a compilation of reasons for rejection heard at various performance analysis presentations. You can use the list by presenting it immediately and pointing out that the reason for rejection is not new and that the analysis deserves more consideration. Also, the list is helpful in getting the competing proposals rejected!

There is no clear end of an analysis. Any analysis can be rejected simply on the grounds that the problem needs more analysis. This is the first reason listed in Box 10.2. The second most common reason for rejection of an analysis and for endless debate is the workload. Since workloads are always based on the past measurements, their applicability to the current or future environment can always be questioned. Actually workload is one of the four areas of discussion that lead a performance presentation into an endless debate. These "rat holes" and their relative sizes in terms of time consumed are shown in Figure 10.26. Presenting this cartoon at the beginning of a presentation helps to avoid these areas.



Raj Jain, "The Art of Computer Systems Performance Analysis," Wiley, 1991.

FIGURE 10.26 Four issues in performance presentations that commonly lead to endless discussion.

### Box 10.2 Reasons for Not Accepting the Results of an Analysis

- This needs more analysis.
   You need a better understanding of the workload.
- You need a better discovered and provided and solution of the I/O's, packets, jobs, and files are short.
- 4. It improves performance only for short I/O's, packets, jobs, and files, but who cares for the performance of short I/O's, packets, jobs, and files; its the long ones that impact the system.
- 5. It needs too much memory/CPU/bandwidth and memory/CPU/bandwidth isn't free.
- 6. It only saves us memory/CPU/bandwidth and memory/CPU/bandwidth is cheap.
- 7. There is no point in making the networks (similarly, CPUs/disks/...) faster; our CPUs/disks (any component other than the one being discussed) aren't fast enough to use them.
- 8. It improves the performance by a factor of x, but it doesn't really matter at the user level because everything else is so slow.
- 9. It is going to increase the complexity and cost.
- 10. Let us keep it simple stupid (and your idea is not stupid).
- 11. It is not simple. (Simplicity is in the eyes of the beholder.)
- 12. It requires too much state.
- 13. Nobody has ever done that before. (You have a new idea.)
- 14. It is not going to raise the price of our stock by even an eighth. (Nothing ever does, except rumors.)
- 15. This will violate the IEEE, ANSI, CCITT, or ISO standard.
- 16. It may violate some future standard.
- 17. The standard says nothing about this and so it must not be important.
- 18. Our competitors don't do it. If it was a good idea, they would have done it.
- 19. Our competition does it this way and you don't make money by copying others.
- It will introduce randomness into the system and make debugging difficult.
- 21. It is too deterministic; it may lead the system into a cycle.
- 22. It's not interoperable.
- 23. This impacts hardware.
- 24. That's beyond today's technology.
- 23. It is not self. I billion
- 26. Why change—it's working OK.

Raj Jain, "The Art of Computer Systems Performance Analysis," Wiley, 1991.

# Initial RowHammer Reviews

# Disturbance Errors in DRAM: Demonstration, Characterization, and Prevention

Rejected (R2)



863kB Friday 31 May 2013 2:00:53pm PDT

b9bf06021da54cddf4cd0b3565558a181868b972

You are an author of this paper.

+ Abstract + Authors

Review #66A
Review #66B
Review #66C
Review #66D
Review #66E
Review #66F

| OveMer | Nov | WriQua | RevExp |
|--------|-----|--------|--------|
| 1      | 4   | 4      | 4      |
| 5      | 4   | 5      | 3      |
| 2      | 3   | 5      | 4      |
| 1      | 2   | 3      | 4      |
| 4      | 4   | 4      | 3      |
| 2      | 4   | 4      | 3      |

#### SAFARI

# Missing the Point Reviews from Micro 2013

#### PAPER WEAKNESSES

This is an excellent test methodology paper, but there is no micro-architectural or architectural content.

#### PAPER WEAKNESSES

- Whereas they show disturbance may happen in DRAM array, authors don't show it can be an issue in realistic DRAM usage scenario
- Lacks architectural/microarchitectural impact on the DRAM disturbance analysis

#### PAPER WEAKNESSES

The mechanism investigated by the authors is one of many well known disturb mechanisms. The paper does not discuss the root causes to sufficient depth and the importance of this mechanism compared to others. Overall the length of the sections restating known information is much too long in relation to new work.

More ...

# **Reviews from ISCA 2014**

#### PAPER WEAKNESSES

- 1) The disturbance error (a.k.a coupling or cross-talk noise induced error) is a known problem to the DRAM circuit community.
- 2) What you demonstrated in this paper is so called DRAM row hammering issue you can even find a Youtube video showing this! <a href="http://www.youtube.com/watch?v=i3-gQSnBcdo">http://www.youtube.com/watch?v=i3-gQSnBcdo</a>
- Ine architectural contribution of this study is too insignificant.

#### PAPER WEAKNESSES

- Row Hammering appears to be well-known, and solutions have already been proposed by industry to address the issue.
- The paper only provides a qualitative analysis of solutions to the problem. A more robust evaluation is really needed to know whether the proposed solution is necessary.

# Suggestions to Reviewers

- Be fair; you do not know it all
- Be open-minded; you do not know it all
- Be accepting of diverse research methods: there is no single way of doing research
- Be constructive, not destructive
- Do not have double standards...

## Do not block or delay scientific progress for non-reasons

# Suggestion to Community

# We Need to Fix the Reviewer Accountability Problem

# Suggestion to Community

# Eliminate Double Standards

Suggestion to Researchers: Principle: Passion

# Follow Your Passion (Do not get derailed by naysayers)

Suggestion to Researchers: Principle: Resilience

# Be Resilient

Principle: Learning and Scholarship

# Focus on learning and scholarship

Principle: Learning and Scholarship

# The quality of your work defines your impact

# Memory Systems

# and Memory-Centric Computing Systems

Part 5: Principles and Conclusion

Prof. Onur Mutlu

omutlu@gmail.com

https://people.inf.ethz.ch/omutlu

7 July 2019

**SAMOS Tutorial** 





**Carnegie Mellon** 

# Acknowledgments

### My current and past students and postdocs

 Rachata Ausavarungnirun, Abhishek Bhowmick, Amirali Boroumand, Rui Cai, Yu Cai, Kevin Chang, Saugata Ghose, Kevin Hsieh, Tyler Huberty, Ben Jaiyen, Samira Khan, Jeremie Kim, Yoongu Kim, Yang Li, Jamie Liu, Lavanya Subramanian, Donghyuk Lee, Yixin Luo, Justin Meza, Gennady Pekhimenko, Vivek Seshadri, Lavanya Subramanian, Nandita Vijaykumar, HanBin Yoon, Jishen Zhao, ...

## My collaborators

 Can Alkan, Chita Das, Phil Gibbons, Sriram Govindan, Norm Jouppi, Mahmut Kandemir, Mike Kozuch, Konrad Lai, Ken Mai, Todd Mowry, Yale Patt, Moinuddin Qureshi, Partha Ranganathan, Bikash Sharma, Kushagra Vaid, Chris Wilkerson, ...

# Funding Acknowledgments

- NSF
- GSRC
- SRC
- CyLab
- Alibaba, AMD, Google, Facebook, HP Labs, Huawei, IBM, Intel, Microsoft, Nvidia, Oracle, Qualcomm, Rambus, Samsung, Seagate, VMware

# Slides Not Covered But Could Be Useful

# End of Backup Slides

# Readings, Videos, Reference Materials

# Accelerated Memory Course (~6.5 hours)

#### ACACES 2018

- Memory Systems and Memory-Centric Computing Systems
- Taught by Onur Mutlu July 9-13, 2018
- □ ~6.5 hours of lectures
- Website for the Course including Videos, Slides, Papers
  - https://safari.ethz.ch/memory\_systems/ACACES2018/
  - https://www.youtube.com/playlist?list=PL5Q2soXY2Zi-HXxomthrpDpMJm05P6J9x

### All Papers are at:

- https://people.inf.ethz.ch/omutlu/projects.htm
- Final lecture notes and readings (for all topics)

# Required: Reference Overview Paper I

# Processing Data Where It Makes Sense: Enabling In-Memory Computation

Onur Mutlu<sup>a,b</sup>, Saugata Ghose<sup>b</sup>, Juan Gómez-Luna<sup>a</sup>, Rachata Ausavarungnirun<sup>b,c</sup>

<sup>a</sup>ETH Zürich
<sup>b</sup>Carnegie Mellon University
<sup>c</sup>King Mongkut's University of Technology North Bangkok

Onur Mutlu, Saugata Ghose, Juan Gomez-Luna, and Rachata Ausavarungnirun, <a href="Processing Data Where It Makes Sense: Enabling In-Memory">Processing Data Where It Makes Sense: Enabling In-Memory</a>
<a href="Computation">Computation</a>

Invited paper in <u>Microprocessors and Microsystems</u> (**MICPRO**), June 2019. [arXiv version]

# Required: Reference Overview Paper II

Onur Mutlu and Jeremie Kim,
 "RowHammer: A Retrospective"
 <u>IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems</u> (TCAD) Special Issue on Top Picks in Hardware and Embedded Security, 2019.
 [Preliminary arXiv version]

# RowHammer: A Retrospective

Onur Mutlu<sup>§‡</sup> Jeremie S. Kim<sup>‡§</sup> §ETH Zürich <sup>‡</sup>Carnegie Mellon University

SAFARI 5

# Required: Reference Overview Paper III

Onur Mutlu and Lavanya Subramanian,
 "Research Problems and Opportunities in Memory
 Systems"

Invited Article in <u>Supercomputing Frontiers and Innovations</u> (**SUPERFRI**), 2014/2015.

Research Problems and Opportunities in Memory Systems

Onur Mutlu<sup>1</sup>, Lavanya Subramanian<sup>1</sup>

# Reference Overview Paper IV

# **Enabling the Adoption of Processing-in-Memory: Challenges, Mechanisms, Future Research Directions**

SAUGATA GHOSE, KEVIN HSIEH, AMIRALI BOROUMAND, RACHATA AUSAVARUNGNIRUN

Carnegie Mellon University

ONUR MUTLU

ETH Zürich and Carnegie Mellon University

Saugata Ghose, Kevin Hsieh, Amirali Boroumand, Rachata Ausavarungnirun, Onur Mutlu, "Enabling the Adoption of Processing-in-Memory: Challenges, Mechanisms, Future Research Directions"

Invited Book Chapter, to appear in 2018.

[Preliminary arxiv.org version]

# Reference Overview Paper V

Onur Mutlu,

"The RowHammer Problem and Other Issues We May Face as Memory Becomes Denser"

Invited Paper in Proceedings of the <u>Design, Automation, and Test in</u> <u>Europe Conference</u> (**DATE**), Lausanne, Switzerland, March 2017. [Slides (pptx) (pdf)]

# The RowHammer Problem and Other Issues We May Face as Memory Becomes Denser

Onur Mutlu
ETH Zürich
onur.mutlu@inf.ethz.ch
https://people.inf.ethz.ch/omutlu

# Reference Overview Paper VI

Onur Mutlu,
 "Memory Scaling: A Systems Architecture
 Perspective"

Technical talk at <u>MemCon 2013</u> (**MEMCON**), Santa Clara, CA, August 2013. [Slides (pptx) (pdf)]
[Video] [Coverage on StorageSearch]

# Memory Scaling: A Systems Architecture Perspective

Onur Mutlu
Carnegie Mellon University
onur@cmu.edu
http://users.ece.cmu.edu/~omutlu/

# Reference Overview Paper VII



Proceedings of the IEEE, Sept. 2017

# Error Characterization, Mitigation, and Recovery in Flash-Memory-Based Solid-State Drives

This paper reviews the most recent advances in solid-state drive (SSD) error characterization, mitigation, and data recovery techniques to improve both SSD's reliability and lifetime.

By Yu Cai, Saugata Ghose, Erich F. Haratsch, Yixin Luo, and Onur Mutlu

# Related Videos and Course Materials (I)

- Undergraduate Computer Architecture Course Lecture
   Videos (2015, 2014, 2013)
- Undergraduate Computer Architecture Course
   Materials (2015, 2014, 2013)

- Graduate Computer Architecture Course Lecture
   Videos (2017, 2015, 2013)
- Graduate Computer Architecture Course
   Materials (2017, 2015, 2013)
- Parallel Computer Architecture Course Materials (Lecture Videos)

# Related Videos and Course Materials (II)

- Freshman Digital Circuits and Computer Architecture
   Course Lecture Videos (2018, 2017)
- Freshman Digital Circuits and Computer Architecture
   Course Materials (2018)
- Memory Systems Short Course Materials
   (Lecture Video on Main Memory and DRAM Basics)

# Some Open Source Tools (I)

- Rowhammer Program to Induce RowHammer Errors
  - https://github.com/CMU-SAFARI/rowhammer
- Ramulator Fast and Extensible DRAM Simulator
  - https://github.com/CMU-SAFARI/ramulator
- MemSim Simple Memory Simulator
  - https://github.com/CMU-SAFARI/memsim
- NOCulator Flexible Network-on-Chip Simulator
  - https://github.com/CMU-SAFARI/NOCulator
- SoftMC FPGA-Based DRAM Testing Infrastructure
  - https://github.com/CMU-SAFARI/SoftMC
- Other open-source software from my group
  - https://github.com/CMU-SAFARI/
  - http://www.ece.cmu.edu/~safari/tools.html

# Some Open Source Tools (II)

- MQSim A Fast Modern SSD Simulator
  - https://github.com/CMU-SAFARI/MQSim
- Mosaic GPU Simulator Supporting Concurrent Applications
  - https://github.com/CMU-SAFARI/Mosaic
- IMPICA Processing in 3D-Stacked Memory Simulator
  - https://github.com/CMU-SAFARI/IMPICA
- SMLA Detailed 3D-Stacked Memory Simulator
  - https://github.com/CMU-SAFARI/SMLA
- HWASim Simulator for Heterogeneous CPU-HWA Systems
  - https://github.com/CMU-SAFARI/HWASim
- Other open-source software from my group
  - https://github.com/CMU-SAFARI/
  - http://www.ece.cmu.edu/~safari/tools.html

# More Open Source Tools (III)

- A lot more open-source software from my group
  - https://github.com/CMU-SAFARI/
  - http://www.ece.cmu.edu/~safari/tools.html



# Referenced Papers

All are available at

https://people.inf.ethz.ch/omutlu/projects.htm

http://scholar.google.com/citations?user=7XyGUGkAAAAJ&hl=en

https://people.inf.ethz.ch/omutlu/acaces2018.html

# Ramulator: A Fast and Extensible DRAM Simulator

[IEEE Comp Arch Letters'15]

# Ramulator Motivation

- DRAM and Memory Controller landscape is changing
- Many new and upcoming standards
- Many new controller designs
- A fast and easy-to-extend simulator is very much needed

| Segment     | DRAM Standards & Architectures                                                                                                                                                                                               |
|-------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Commodity   | DDR3 (2007) [14]; DDR4 (2012) [18]                                                                                                                                                                                           |
| Low-Power   | LPDDR3 (2012) [17]; LPDDR4 (2014) [20]                                                                                                                                                                                       |
| Graphics    | GDDR5 (2009) [15]                                                                                                                                                                                                            |
| Performance | eDRAM [28], [32]; RLDRAM3 (2011) [29]                                                                                                                                                                                        |
| 3D-Stacked  | WIO (2011) [16]; WIO2 (2014) [21]; MCDRAM (2015) [13];<br>HBM (2013) [19]; HMC1.0 (2013) [10]; HMC1.1 (2014) [11]                                                                                                            |
| Academic    | SBA/SSA (2010) [38]; Staged Reads (2012) [8]; RAIDR (2012) [27]; SALP (2012) [24]; TL-DRAM (2013) [26]; RowClone (2013) [37]; Half-DRAM (2014) [39]; Row-Buffer Decoupling (2014) [33]; SARP (2014) [6]; AL-DRAM (2015) [25] |

Table 1. Landscape of DRAM-based memory

# Ramulator

- Provides out-of-the box support for many DRAM standards:
  - DDR3/4, LPDDR3/4, GDDR5, WIO1/2, HBM, plus new proposals (SALP, AL-DRAM, TLDRAM, RowClone, and SARP)
- ~2.5X faster than fastest open-source simulator
- Modular and extensible to different standards

| Simulator<br>(clang -O3) | Cycles (10 <sup>6</sup> ) |        | Runtime (sec.) |        | Reg/sec (10 <sup>3</sup> ) |        | Memory  |  |
|--------------------------|---------------------------|--------|----------------|--------|----------------------------|--------|---------|--|
|                          | Random                    | Stream | Random         | Stream | Random                     | Stream | (MB)    |  |
| Ramulator                | 652                       | 411    | 752            | 249    | 133                        | 402    | 2.1     |  |
| DRAMSim2                 | 645                       | 413    | 2,030          | 876    | 49                         | 114    | 1.2     |  |
| USIMM                    | 661                       | 409    | 1,880          | 750    | 53                         | 133    | 4.5     |  |
| DrSim                    | 647                       | 406    | 18,109         | 12,984 | 6                          | 8      | 1.6     |  |
| NVMain                   | 666                       | 413    | 6,881          | 5,023  | 15                         | 20     | 4,230.0 |  |

Table 3. Comparison of five simulators using two traces

# Case Study: Comparison of DRAM Standards

| Standard          | Rate<br>(MT/s) | Timing<br>(CL-RCD-RP) | Data-Bus<br>(Width×Chan.) | Rank-per-Chan | BW<br>(GB/s) |
|-------------------|----------------|-----------------------|---------------------------|---------------|--------------|
| DDR3              | 1,600          | 11-11-11              | 64-bit × 1                | 1             | 11.9         |
| DDR4              | 2,400          | 16-16-16              | $64$ -bit $\times 1$      | 1             | 17.9         |
| SALP <sup>†</sup> | 1,600          | 11-11-11              | $64$ -bit $\times 1$      | 1             | 11.9         |
| LPDDR3            | 1,600          | 12-15-15              | $64$ -bit $\times 1$      | 1             | 11.9         |
| LPDDR4            | 2,400          | 22-22-22              | $32$ -bit $\times 2^*$    | 1             | 17.9         |
| GDDR5 [12]        | 6,000          | 18-18-18              | $64$ -bit $\times 1$      | 1             | 44.7         |
| HBM               | 1,000          | 7-7-7                 | $128$ -bit $\times$ $8*$  | 1             | 119.2        |
| WIO               | 266            | 7-7-7                 | $128$ -bit $\times 4^*$   | 1             | 15.9         |
| WIO2              | 1,066          | 9-10-10               | $128$ -bit $\times$ $8*$  | 1             | 127.2        |



Across 22 workloads, simple CPU model

Figure 2. Performance comparison of DRAM standards

# Ramulator Paper and Source Code

- Yoongu Kim, Weikun Yang, and Onur Mutlu,
   "Ramulator: A Fast and Extensible DRAM Simulator"
   IEEE Computer Architecture Letters (CAL), March 2015.
   [Source Code]
- Source code is released under the liberal MIT License
  - https://github.com/CMU-SAFARI/ramulator

# Ramulator: A Fast and Extensible DRAM Simulator

Yoongu Kim<sup>1</sup> Weikun Yang<sup>1,2</sup> Onur Mutlu<sup>1</sup>
<sup>1</sup>Carnegie Mellon University <sup>2</sup>Peking University

# Optional Assignment

- Review the Ramulator paper
  - Email me your review (<u>omutlu@gmail.com</u>)
- Download and run Ramulator
  - Compare DDR3, DDR4, SALP, HBM for the libquantum benchmark (provided in Ramulator repository)
  - Email me your report (<u>omutlu@gmail.com</u>)

This will help you get into memory systems research

# Some More Suggested Readings

# Some Key Readings on DRAM (I)

- DRAM Organization and Operation
  - Lee et al., "Tiered-Latency DRAM: A Low Latency and Low Cost DRAM Architecture," HPCA 2013.
     <a href="https://people.inf.ethz.ch/omutlu/pub/tldram\_hpca13.pdf">https://people.inf.ethz.ch/omutlu/pub/tldram\_hpca13.pdf</a>
  - Kim et al., "A Case for Subarray-Level Parallelism (SALP) in DRAM," ISCA 2012.
     https://people.inf.ethz.ch/omutlu/pub/salp-dram\_isca12.pdf
  - Lee et al., "Simultaneous Multi-Layer Access: Improving 3D-Stacked Memory Bandwidth at Low Cost," ACM TACO 2016.
     <a href="https://people.inf.ethz.ch/omutlu/pub/smla\_high-bandwidth-3d-stacked-memory\_taco16.pdf">https://people.inf.ethz.ch/omutlu/pub/smla\_high-bandwidth-3d-stacked-memory\_taco16.pdf</a>

# Some Key Readings on DRAM (II)

### DRAM Refresh

- Liu et al., "RAIDR: Retention-Aware Intelligent DRAM Refresh," ISCA 2012.
   <a href="https://people.inf.ethz.ch/omutlu/pub/raidr-dram-refresh\_isca12.pdf">https://people.inf.ethz.ch/omutlu/pub/raidr-dram-refresh\_isca12.pdf</a>
- Chang et al., "Improving DRAM Performance by Parallelizing Refreshes with Accesses," HPCA 2014.
   <a href="https://people.inf.ethz.ch/omutlu/pub/dram-access-refresh-parallelization-hpca14.pdf">https://people.inf.ethz.ch/omutlu/pub/dram-access-refresh-parallelization-hpca14.pdf</a>
- Patel et al., "The Reach Profiler (REAPER): Enabling the Mitigation of DRAM Retention Failures via Profiling at Aggressive Conditions," ISCA 2017.
   <a href="https://people.inf.ethz.ch/omutlu/pub/reaper-dram-retention-profiling-lpddr4\_isca17.pdf">https://people.inf.ethz.ch/omutlu/pub/reaper-dram-retention-profiling-lpddr4\_isca17.pdf</a>

# Reading on Simulating Main Memory

- How to evaluate future main memory systems?
- An open-source simulator and its brief description
- Yoongu Kim, Weikun Yang, and Onur Mutlu,
   "Ramulator: A Fast and Extensible DRAM Simulator"
   IEEE Computer Architecture Letters (CAL), March 2015.
   [Source Code]

# Some Key Readings on Memory Control 1

- Mutlu+, "Parallelism-Aware Batch Scheduling: Enhancing both Performance and Fairness of Shared DRAM Systems," ISCA 2008.
   <a href="https://people.inf.ethz.ch/omutlu/pub/parbs\_isca08.pdf">https://people.inf.ethz.ch/omutlu/pub/parbs\_isca08.pdf</a>
- Kim et al., "Thread Cluster Memory Scheduling: Exploiting Differences in Memory Access Behavior," MICRO 2010.
   <a href="https://people.inf.ethz.ch/omutlu/pub/tcm\_micro10.pdf">https://people.inf.ethz.ch/omutlu/pub/tcm\_micro10.pdf</a>
- Subramanian et al., "BLISS: Balancing Performance, Fairness and Complexity in Memory Access Scheduling," TPDS 2016.
   <a href="https://people.inf.ethz.ch/omutlu/pub/bliss-memory-scheduler\_ieee-tpds16.pdf">https://people.inf.ethz.ch/omutlu/pub/bliss-memory-scheduler\_ieee-tpds16.pdf</a>
- Usui et al., "DASH: Deadline-Aware High-Performance Memory Scheduler for Heterogeneous Systems with Hardware Accelerators," TACO 2016.
   <a href="https://people.inf.ethz.ch/omutlu/pub/dash\_deadline-aware-heterogeneous-memory-scheduler\_taco16.pdf">https://people.inf.ethz.ch/omutlu/pub/dash\_deadline-aware-heterogeneous-memory-scheduler\_taco16.pdf</a>

# Some Key Readings on Memory Control 2

- Ipek+, "Self Optimizing Memory Controllers: A Reinforcement Learning Approach," ISCA 2008.
  - https://people.inf.ethz.ch/omutlu/pub/rlmc\_isca08.pdf
- Ebrahimi et al., "Fairness via Source Throttling: A Configurable and High-Performance Fairness Substrate for Multi-Core Memory Systems," ASPLOS 2010.
  - https://people.inf.ethz.ch/omutlu/pub/fst\_asplos10.pdf
- Subramanian et al., "The Application Slowdown Model: Quantifying and Controlling the Impact of Inter-Application Interference at Shared Caches and Main Memory," MICRO 2015.
  - https://people.inf.ethz.ch/omutlu/pub/application-slowdown-model\_micro15.pdf
- Lee et al., "Decoupled Direct Memory Access: Isolating CPU and IO Traffic by Leveraging a Dual-Data-Port DRAM," PACT 2015.
  - https://people.inf.ethz.ch/omutlu/pub/decoupled-dma\_pact15.pdf

# More Readings

- To come as we cover the future topics
- Search for "DRAM" or "Memory" in:
  - https://people.inf.ethz.ch/omutlu/projects.htm