As billions of transistors and complex nano-systems are continuing to be integrated onto a single chip and the era of the big data application is approaching, the demand on memory and massive data storage capacity grows sharply due to the exponentially increased data processing capabilities. The large area of the chip memories are especially vulnerable to one-bit or multi-bit soft errors caused by single energetic particles like high-energy neutrons and alpha particles as the technology continues to shrink [80].
To cover those errors, Error correction code (ECC) has been proven a “must-have” tech- nology in modern memory subsystem designs. The traditional memory technologies, such as SRAM and DRAM, can usually be equipped with the common ECCs, such as single-error- correction double error detection code (SEC-DED), BCH codes etc., to better tradeoff be- tween reliability, performance or energy due to the inherent reliability and fast programming in storage devices. Also, the extremely strong but slow Low-Density-Parity-Check (LDPC) codes are widely utilized in the high-density NAND flash memories because of the extremely degraded device reliability but slow programming speed requirement [81]. In recent years, the concerns on the continuous scaling of these technologies have motivated the tremendous investment in emerging memory technologies (EMTs), including Spin-Transfer Torque Ran- dom Access Memory (STT-RAM), Resistive Random Access Memory (RRAM) and Phase Change Memory (PCM). However, taking STT-RAM as an example, before benefiting from its attractive features-fast access speed, low leakage power and non-volatility, reliability issue becomes more and more prominent due to its unique storage stability from the aggravated process variations, stochastic device behaviors and environment fluctuations, and the ever-
increasing reliability requirement in massive data systems. As such, the complicated device or cell level reliability characterization becomes extremely expensive; Also, popular ECCs, such as SEC-DED, BCH or LPDC etc. may not be sufficient or suitable for STT-RAM based memory systems, and the demanding for stronger error correction codes (ECCs) or other so- lutions with minimized performance and hardware overhead for delay-sensitive on-chip/off chip applications are becoming essential.
This dissertation has looked at many facets of reliability issues of STT-RAM in designing memory systems, including the statistical computer-aided design (CAD) tool, the novel ECC design for asymmetric errors of SLC STT-RAM and the holistic circuit-architecture solution set for advanced MLC STT-RAM.
5.1.1 Conclusion of Chapter 2
Process variations and thermal fluctuations significantly affect the write reliability and write energy of STT-RAM, traditionally, modeling the impacts of these variations on STT-RAM designs requires expensive Monte-Carlo runs with hybrid magnetic-CMOS simulation steps. Also, those solutions are usually performed on the STT-RAM cells with fixed variation con- figurations, and significantly reduce their scalability and portability. Thus, in Chapter 2, we proposed PS3-RAM–a fast, portable and scalable statistical STT-RAM reliability/energy analysis method. By introducing the sensitivity analysis technique to capture the statistical characteristics of the MTJ switching, and dual-exponential model to efficiently and accu- rately recover the MTJ switching current samples for statistical STT-RAM thermal analysis, PS3- RAM can achieve multiple orders-of-magnitude run time cost reduction with marginal accuracy degradation under any variation configurations when compared to SPICE-based Monte-Carlo simulations.
5.1.2 Conclusion of Chapter 3
In chapter 3, we proposed the first analytical asymmetric write channel (AWC) to deeply understand the unique operation errors of STT-RAM write mechanism–its write failure rate is extremely asymmetric (the writing ‘1’ error rate can be even several orders higher than
that of writing ‘0’). By carefully investigating the common ECC solutions to tolerate such errors, we discovered interesting observations neglected before in memory systems: Generic ECCs like SEC-DED code, etc. are all designed under the assumption that the symmetric error rate always exists at 0 → 1 and 1 → 0 flipping and such ECCs cannot efficiently handle the highly asymmetric writing errors at different bit-flipping directions. Thus, to efficiently address such challenges, we introduced the new design concept based ECCs-the content dependent view instead of the worst-corner design view. The original data is intentionally partitioned into two different corners based on their reliability degree, and can be further processed through the proposed low cost circuit-level solutions–typical-corner-ECC (TCE) scheme or the worst-corner-ECC (WCE) scheme, respectively. By proposing the content- dependent ECC (CD-ECC) technique to balance and enhance the reliability of the STT-RAM with asymmetric write errors, our CD-ECC improves the reliability of the STT-RAM based cache system significantly with marginal performance degradation.
5.1.3 Conclusion of Chapter 4
The invention of multi-level cell (MLC) technology doubles the storage density by integrat- ing two MTJs with different dimensions in one memory cell to represent multiple logic bits. However, MLC STT-RAM design further aggravates the reliability and write latency w.r.t. the single-level cell (SLC) version. In chapter 4, we demonstrated the infeasibility of apply- ing extremely strong ECCs on MLC STT-RAM based memory systems for high-reliable and high-performance applications due to the associated decoding latency and storage overhead. Thus, we proposed a cross-layer solution, named State-Restrict MLC STT-RAM (SR-MLC), to address the reliability, performance and information density simultaneously. Three tech- niques: state restriction, error pattern removal, and ternary coding are proposed at circuit level to reduce the read and write errors of MLC STT-RAM cells. State pre-recovery tech- nique is further developed at architecture level to improve the access performance of SR-MLC STT-RAM by eliminating unnecessary two-step write operations. Simulation results show that our SR-MLC design can enhance the write/read error rate by 10 − 10000× over tra- ditional MLC designs, while simultaneously boosting the system performance by averagely
6.2% over SLC designs. In summary, our solution delivers similar information density as traditional MLC design, comparable reliability and programming speed as SLC design, but significantly improved IPC performance.