Retrospective: RAIDR: Retention-Aware Intelligent DRAM Refresh

Onur Mutlu
ETH Zürich

Abstract—Dynamic Random Access Memory (DRAM) is the prevalent memory technology used to build main memory systems of almost all computers. A fundamental shortcoming of DRAM is the need to refresh memory cells to keep stored data intact. DRAM refresh consumes energy and degrades performance. It is also a technology scaling challenge as its negative effects become worse as DRAM cell size reduces and DRAM chip capacity increases.

Our ISCA 2012 paper, RAIDR [1], examines the DRAM refresh problem. The paper provided detailed experimental analyses and solutions, and make some educated guesses on what the future may bring on the DRAM refresh problem (and more generally in DRAM technology scaling).

I. BACKGROUND, APPROACH & MINDSET

At the time we began our focus on solving the DRAM refresh (i.e., data retention) challenge in late 2010, our research group, SAFARI, had already been working on memory controllers and memory technology scaling issues, motivated by many challenges memory systems, in particular the DRAM technology [2], have been facing (as described in, e.g., [3–5]). Our intense work on memory systems started during my tenure at Microsoft Research from 2006 and continued at CMU from 2009. For example, we had developed better memory schedulers for multi-core processors (e.g., [6–10]), developed platforms to perform voltage and frequency scaling to save DRAM energy (e.g., [11]) and architected emerging memory technologies to replace or augment DRAM (e.g., [12–14]). We were quite excited about the prospect of much more capable memory controllers in enabling better memory systems. As such, we were pursuing new memory-controller and system-level techniques to 1) overcome the challenging device- and circuit-level scaling issues of memory technologies and 2) better exploit underlying characteristics of memory technology; an approach we termed system-DRAM co-design [15]. RAIDR is a product of this approach. Our focus on data retention issues and other low-level issues in DRAM especially increased via discussions with the Samsung DRAM Design Team, who visited us in April 2011 and encouraged the development of our system-level solutions to DRAM issues, enabling strong support both technically and funding-wise. In fact, much of our ensuing research in DRAM was supported by generous gift funding by and technical discussions with Samsung based on a proposal entitled "New ideas to enhance DRAM scaling: Scaling-aware controller design and co-design of DRAM and controllers" (Intel provided similar gift funding and technical discussions).

II. CONTRIBUTIONS AND IMPACT OF RAIDR

RAIDR is the first work to propose a low-cost memory controller technique that reduces refresh operations by exploiting variation in data retention times across DRAM rows. Its appeal comes from its simplicity and low cost, enabled by the careful use of Bloom filters [15]. Exploiting the DRAM data retention time distribution [16], RAIDR can eliminate a very large fraction (e.g., ~75% or more) of refresh operations with very small hardware cost at the memory controller.

Apart from the new technique it introduced, we believe the RAIDR paper made two other major contributions that have enabled a large number of future works and new ideas. First, it provided an empirical scaling analysis that clearly demonstrated the importance of the DRAM refresh problem in modern systems: if nothing is done about it, DRAM refresh would waste almost half of the throughput and half of the energy of a high-capacity 64-Gb DRAM chip! This analytical prediction encouraged more works in the topic area. Second, it demonstrated a methodical way of exploiting cell-level heterogeneous data retention times at the system (e.g., memory controller) level: if data retention times of DRAM rows are accurately known, the system can use them to optimize DRAM refresh and get rid of most refresh operations. This demonstration enabled other works to develop 1) methods for accurately determining DRAM data retention times and 2) other system-level approaches to optimize DRAM behavior using data retention time information.

III. BUILDING ON RAIDR AND MAKING IT WORK

We believe RAIDR enabled a refreshing approach to DRAM refresh. Its largest contribution could be the works it has inspired that rigorously examined the questions of 1) how to perform accurate DRAM data retention time profiling, 2) how to overcome potential hurdles that stand in the way of obtaining accurate minimum data retention times, 3) how to reliably get rid of unnecessary refresh operations.

We wanted to make RAIDR work in a real system setting. To this end, collaboratively with Intel, we developed an FPGA-based flexible DRAM testing infrastructure [17] that enabled us to rigorously test data retention times of cells in real DDR3 DRAM chips. Using this infrastructure, later open sourced as SoftMC [18,19] and DRAM Bender [20,21], we experimentally examined practical issues that affect the accuracy (and performance) of DRAM data retention time profiling. We analyzed two major issues that make such profiling very challenging: 1) data pattern dependence (DPD) of retention times [17,22] and 2) the variable retention time (VRT) phenomenon [17,23,24]. Our follow-up work, which appeared at ISCA 2013 [17] provides a detailed experimental analysis of these challenges in cutting-edge DRAM chips, demonstrating that ideas like RAIDR that depend on accurate identification of retention times are not easy to exploit in practice. Later works (e.g., [25,32]) developed new methods for making RAIDR-like techniques more practical by tackling especially the DPD and VRT problems and enhancing retention time profiling methods to work in the presence of DPD and VRT, usually by exploiting ECC techniques that have since become mainstream in DRAM chips (see [31,33]) to tolerate VRT [34].

The development of our flexible FPGA-based DRAM testing infrastructure also enabled experimental DRAM research in directions that are completely different from retention time profiling and refresh. These include studies that provided valuable experimental data on various DRAM characteristics, including RowHammer [20,25,44], latency [45,48], voltage-latency-reliability relationships [49], power consumption and modeling [50]. Using this infrastructure, later research also demonstrated the ability of real off-the-shelf DRAM chips to perform data copy/initialization and bulk bitwise operations [51,55], implement physical unclonable functions [56], and generate true random numbers [57,58]. We believe the investment we made to try to make RAIDR work using a real FPGA-based infrastructure helped us and the broader research community uncover many interesting characteristics of DRAM chips and propose new ideas to make DRAM-based systems more secure, reliable, efficient, and high performance.

Other later works provided refined models of DRAM refresh's impact on system performance (e.g., [59,60]) and developed new...
methods to reduce DRAM refresh’s negative impact on performance & energy (e.g., [41][45-67]). Our HPCA 2014 paper [59] developed a more refined projection of the effect of DRAM refresh as technology scales. AVATAR [15], in DSN 2015 [56] and REAPER in ISCA 2017 [57] enabled more practical ways of exploiting heterogeneous retention times in the presence of VRT. Our recent work [66] shows that with a more flexible DRAM interface that gives some autonomy to DRAM chips, RAIDR can be more efficiently implemented inside the DRAM chip.

IV. SUMMARY AND FUTURE OUTLOOK

RAIDR is a nice example of how enthusiastic support from industry can foster new ideas that can open up many new analyses and other ideas. We were inspired by our deep technical discussions with especially Samsung and Intel, along with prior works that described DRAM technology scaling challenges (e.g., [3]) and that developed promising solutions (e.g., [68][69]). Engineers from Samsung and Intel later wrote an insightful paper [24] on DRAM scaling challenges, which described refresh as a key problem and advocated a controller-DRAM co-design approach as we had been advocating [1][2]. RAIDR was also a nice example of how teaching & research smoothly feed each other: much of the research was done as part of a group project in the Parallel Computer Architecture class I taught at CMU in Fall 2011.

Looking forward, DRAM technology scaling is getting worse and data retention will continue to be an important issue [34][70]. The negative effects of DRAM refresh will be (and are being) exacerbated by other technology scaling issues like RowHammer [41]. We believe there are a lot more new ideas and techniques to develop to minimize the impact of refresh on computing systems.

REFERENCES