Anda di halaman 1dari 6

238

IEEE TRANSACTIONS ON RELIABILITY, VOL. 62, NO. 1, MARCH 2013

Using Single Error Correction Codes to Protect Against Isolated Defects and Soft Errors
Costas Argyrides, Member, IEEE, Pedro Reviriego, Member, IEEE, and Juan Antonio Maestro, Member, IEEE
AbstractDifferent techniques have been used to deal with defects and soft errors. Repair techniques are commonly used for defects, while error correction codes are used for soft errors. Recently, some proposals have been made to use error correction codes to deal with defects. In this paper, we analyze the impact on reliability of such approaches that use error correction codes, which in addition to soft errors can resolve defects, at the cost of reduced ability to correct soft errors. The results showed that low defect rates or small memory sizes are required to have a low impact on reliability. Additionally, a technique that can improve reliability is proposed and analyzed. The results show that our new approach can achieve a similar reliability in terms of time to failure as that of a defect free memory at the cost of a more complex decoding algorithm. Index TermsDefects, error correcting codes, fault tolerance, soft errors.

ACRONYMS ECC 1-D 2-D SEU SEC MCU SEC-DED METF MTTF Error correcting codes One-dimensional redundancy Two-dimensional redundancy Single-event upsets Single error correction Multiple cell upsets Single error correction-double error detection Mean number of events to failure Mean time to failure I. INTRODUCTION

those issues due to their high level of integration. Current techniques to address those reliability issues in memories include the use of redundant elements to repair manufacturing defects, and the use of Error Correcting Codes (ECC) to deal with soft errors once the device is in operation. Different techniques are used to deal with defects versus soft errors. ECC can also be used to correct errors caused by defects, but then their ability to correct soft errors may be compromised leading to a reduced reliability. However, to the best of our knowledge, there is no previous work on how the use of ECC to deal with defects affects the reliability of memory in the eld. In this paper, an effective technique to use ECC to deal with isolated defects and soft errors on memory chips is presented. A technique that can cope with either stuck-at-defects, soft errors, or both at the same time is illustrated. The following analysis has shown that the reliability is approximately the same as when the code is used to correct soft errors only. II. RELATED WORK The technology scaling process provides high-density, low cost, high-performance integrated circuits. These circuits are characterized by high operating frequencies, low voltage levels, and small noise margins with increased defect rate [1]. To cope with defects in memory chips, many different techniques have been proposed, all of them based on the use of redundant elements to replace defective ones. Those techniques vary from those applied during the manufacturing process, in the test phase, to the use of built-in circuits able to repair the memory chips even during normal operation in the eld, with different tradeoffs in terms of cost and speed. The use of redundant rows and columns has been widely used in memory design to cope with this problem. One-dimensional (1-D) redundancy is the simplest variation in which only redundant rows (or columns) are included in the memory array and used to replace the defective rows (or columns) detected during test. The main advantage of this approach is that its implementation does not require any complex allocation algorithms. Unfortunately, its repair efciency can be low because a defective column (row) containing multiple defective cells cannot be replaced by a single redundant row (column). Examples of such techniques are presented in [2], and [3]. In [4], and [5], authors proposed a two-dimensional (2-D) redundancy approach which improves the efciency of the 1-D approach. This approach adds both redundant rows and columns to the memory array to provide more efcient repair when multiple defective cells exist in the same row or column of the array. When multiple faulty cells are detected, the choice between the use of a redundant row or a redundant column to replace them

S TECHNOLOGY scales, reliability becomes a challenge for CMOS circuits. Reliability issues appear, for example during device manufacturing, as defects that can compromise production yield. Once the devices are in the eld, other reliability issues appear in the form of soft errors or age induced permanent failures. Memory devices are among those affected by

Manuscript received February 19, 2012; revised August 24, 2012; accepted August 24, 2012. Date of publication January 29, 2013; date of current version February 27, 2013. Associate Editor: J.-C. Lu. C. Argyrides is with the Research Division, EVOLVI.T., 3010 Limassol, Cyprus (e-mail: costas@computer.org). P. Reviriego and J. A. Maestro are with the Departmento de Ingenieria Informatica, Universidad Antonio de Nebrija, 28040 Madrid, Spain (e-mail: previrie@nebrija.es; jmaestro@nebrija.es). Digital Object Identier 10.1109/TR.2013.2240901

0018-9529/$31.00 2013 IEEE

ARGYRIDES et al.: USING SEC CODES TO PROTECT AGAINST ISOLATED DEFECTS AND SOFT ERRORS

239

is made based on the maximum repair capability of each alternative. The main drawback of this approach is that the optimal redundancy allocation problem becomes NP-complete as discussed in [6] and [7]. Although many heuristic algorithms have been proposed to solve this problem, it is still difcult to develop built-in repair implementations using them. For both redundancy approaches, when the number of defective cells in the array exceeds the repair capability through the use of redundant elements, the last alternative before discarding the defective chip is to try to use it as a downgraded version of memory. For example, when all remaining defective cells are located in one half of the array, the other half can still be used as a memory with reduced capacity. This reduction is done by permanently setting the most signicant bit of the addresses either to 0 or 1, depending on which part of the memory is to be used. However, in most cases, the remaining defective cells are evenly distributed across the whole array, and not clustered in one half of the array, making this technique useless. Another major issue in designing memory chips in submicron technologies is the susceptibility to single-event upsets (SEU) produced by atmospheric neutrons and alpha particles. When these particles hit the silicon bulk, they create minority carriers, which if collected by the source-drain diffusions, could change the voltage level of the node [1]. This change in the voltage level will change the state of the transistor, which will result in a change of the value in a memory cell. For example, if a memory cell holds 1, an SEU will force it to 0. Radiation induced soft errors are a major issue for reliability, and is a critical factor for memories when they operate in environments where there are many sources of error [8]. Traditionally, memories have been protected with Single Error Correction (SEC) codes [9][12] that can correct up to one error per memory word. Per-word parity bits are also commonly used when the objective is only to detect errors. Unfortunately, these techniques will fail in the appearance of multiple cell upsets (MCU). The most common approach to deal with multiple errors has been the use of interleaving in the physical arrangement of the memory cells, so that cells that belong to the same logical word are separated. As the errors in an MCU are physically close as discussed in [15], they will cause single errors in different words that can be corrected by the Single Error Correction-Double Error Detection (SEC-DED) codes. However, interleaving cannot be used, for example, in small memories or register les, and in other cases, its use may have an impact on oor-planning, access time, and power consumption, as discussed in [16], [17]. Another technique to cope with SEUs is the scrubbing process [18]. The scrubbing process periodically reads the memory words, and corrects the errors so that only if two errors arrive in the same scrubbing period can a failure occur. This approach is usually a valid solution to prevent error accumulation. There have been some approaches that propose to use redundancy (as explained before) together with Error Correction Codes (ECC) to improve the protection effectiveness. In [13], a mechanism is proposed to identify if detected errors are permanent, or temporary (soft). In the permanent case, redundancy would be used to solve the problem. In the temporary case, ECC would correct the soft errors. In [14], redundancy is used when

the number of defects is large enough to utilize a whole spare row or column. But, for isolated errors, this approach produces an excessive waste of redundant bits. In these cases, ECC is proposed as a better option. However, using ECC to handle a permanent defect would leave that particular word unprotected if a soft error affects it. In this paper, we propose to use the standard memory protection approach of SEC-DED plus interleaving to deal not only with soft errors but also with isolated stuck-at defects. A technique is proposed to locate the defects such that they do not compromise the ability of SEC-DED codes to correct single soft errors, even when a word has a stuck at defect. Stuck-at defects are the faults where memory cells (as well as lines or transistors) permanently store (stuck-at) the same value regardless what is supposed to be saved. This approach enables us to achieve a similar reliability as that of a defect free memory. III. PROPOSED TECHNIQUE The main problem when using SEC-DED correction to deal with manufacturing defects is that words in which a cell has a defect are left unprotected, as a single soft error will cause a failure. This outcome can signicantly reduce memory reliability. However, for defects that manifest as isolated stuck-at failures such that a cell that is read always gives the same value, an alternative correction scheme can be used to improve reliability. This alternative would provide several benets: detection of defects, leaving SEC-DED codes to handle soft errors; detection of defects that appear on the eld (after the manufacturing process, and therefore not detected in the test processes); and address isolated defects, and clustered defects that are transformed into isolated defects at a logical level when interleaving is used. The proposed technique is as follows. When a word is read and an error is detected, if it is classied as a single error, then it is corrected as in a normal SEC-DED memory. However, if an uncorrectable error is detected, a procedure is triggered to detect if the word contains defects. This procedure stores the contents of the word in a register, and then writes all-zeros into the word and reads it back to check that there are no errors. The same operation is then done for the all-ones pattern. If there is a stuck-at defect on that word, the procedure will detect the defect and locate it. If there is no defect, a failure is triggered as the error is in fact uncorrectable. However, if there is a defect, the corresponding bit in the register is inverted, and the modied word is decoded again. This technique can effectively correct words that contain either a soft error, or a stuck-at defect, or both simultaneously. An example is illustrated in Fig. 1. In this case, two bits are affected by errors: one by a soft error, and another by a defect. When the word is read, two errors are detected, thus uncorrectable. Following the proposed technique, a defect will be detected, and that defective bit will be inverted. In this case, the word will be correctly decoded as there is only a single error (the soft error). Therefore, when the proposed technique is used, single bit errors will not cause failures, even if they affect a word

240

IEEE TRANSACTIONS ON RELIABILITY, VOL. 62, NO. 1, MARCH 2013

Fig. 1. Example of a correctable error.

Fig. 3. Example of a situation in which the proposed method can provoke a miscorrection or a failure.

Fig. 2. Example of a situation in which the proposed method can provoke a miscorrection or a failure.

The cost of that logic should be negligible compared to the rest of the memory. For the second case, there is no additional cost as the proposed technique can be implemented in the system processor. However, for the proposed technique to work, it needs access to all bits in each word, including the SEC-DED bits. In terms of speed, the use of the proposed technique should have a negligible impact on average access time as it only slows down accesses when there are multiple errors in a word, a situation that should occur in a very small percentage of the accesses. In those cases, two additional read operations and two additional write operations are required to detect the defects. IV. ANALYSIS In this section, we study the reliability of a memory on which SEC-DED is used to deal with soft errors and isolated stuck-at failures. For the analysis, soft errors are assumed to arrive following a Poisson process, and they are uniformly distributed among all memory cells as in previous studies [10], [11]. Finally, a word will contain a defect with probability F (per-word defect rate). The analysis starts by considering a memory in which the proposed approach is not used, and words are decoded normally. If a single error is detected, it is corrected; and otherwise a failure is agged. In this case, if we consider only failures caused by an error falling on a word that contains a defect, the mean number of events to failure (METF) would be (1) And the METF for soft errors only would be (see [10]) (2) where M is the number of memory words. Failures caused by soft errors only will be dominant when (3) If we dene the METF ratio as (4) then (3) holds when (5)

that contains a stuck-at defect. This approach would greatly increase the reliability when ECC is used to correct defects. In the proposed approach, failures will occur when two soft errors affect a word that contains a defect. This case is illustrated in Fig. 2. In this case, there are three errors: a defect, and two soft errors. When trying to decode this word, there are two possible outcomes: 1) an uncorrectable error is agged, or 2) a correctable error is detected. This last outcome is due to the fact that SEC-DED codes have a Hamming distance of four, and therefore triple errors can be interpreted as single errors and miscorrected. For case 1), the procedure will correct the defect, and end with a double and therefore uncorrectable error, causing a failure. For case 2) a miscorrection will be performed, causing a failure with silent data corruption. In both cases, there will be a failure, but in the second the failure will not be detected. Obviously, this last type of failure is more dangerous. There is another situation for a word that suffers two soft errors and a defect, as illustrated in Fig. 3. In this case, the defect does not cause an error as the original data matches the stuck-at value. Therefore, when reading the word, an uncorrectable error will be detected; and in this case, the proposed technique will transform the double error into a triple one, which in turn can provoke a miscorrection. This result is obviously a drawback of the proposed technique, whose impact will be analyzed in the next section. In Fig. 4, an exhaustive analysis of the different sorts of errordefect combinations is presented showing how the algorithm would be applied, and its outcome (only up to two soft errors are considered). The technique described in Fig. 4 may be implemented not only in memories but in existing devices that implement SEC-DED as well. In the case of memories, it requires a register to store the word while the defect detection procedure is used, and some control logic to implement the correction algorithm.

ARGYRIDES et al.: USING SEC CODES TO PROTECT AGAINST ISOLATED DEFECTS AND SOFT ERRORS

241

Fig. 5. Value of the METF ratio in (5) for different values of F and M.

Fig. 4. Procedures when a word with defects or errors or both is read.

This result means that, as technology scales and larger memory sizes are used, smaller defect rates are needed to

ensure that defects do not affect memory reliability. Therefore, the use of standard SEC-DED to deal with isolated stuck-at failures will not be effective unless the defect rate is very small. This impact of defects becomes more prominent if scrubbing is used, as the METF for soft errors will increase substantially while the METF for defects would remain the same, further restricting the values of F and M for which (3) is valid. As an example, in Fig. 5, the defect rates (F) and memory sizes (M) for which condition (5) is valid are illustrated. As a reference, a value of 0.01 is shown as a straight line. If a memory with 32 M-words is considered, a defect rate of less than is needed to ensure that defects do not affect memory reliability. The plots show how as memory sizes increase lower defect rates are required to ensure that defects do not affect reliability. When the proposed technique is used, failures would occur only when at least two soft errors affect a given word. Therefore, the time to failure would be exactly the same as for a defect free memory protected with SEC-DED. The same result applies when scrubbing is used as the same number of soft errors are required to cause a failure in a cell with or without defects. The main issue with the proposed approach is that failures that are

242

IEEE TRANSACTIONS ON RELIABILITY, VOL. 62, NO. 1, MARCH 2013

PERCENTAGE

OF

TABLE I 3 ERRORS MISCORRECTION TECHNIQUES

FOR

DIFFERENT CODING

The proposed technique can also be combined with traditional 2-D repair approaches such that row and column failures and defects on multiple bits are repaired, and isolated defects are handled by the SEC-DED codes. This approach would provide a complete approach to protect against defects and soft errors in memories. REFERENCES
[1] I.T.R.S., International Technology Road Map for Semiconductors [Online]. Available: http://public.itrs.net/ Jul. 2010, update [2] D. Bhavsar, An algorithm for rowcolumn self-repair of RAMs and its implementation in the Alpha 21264, in Proc. Int. Test Conf., 1999, pp. 311318. [3] I. Kim, Y. Zorian, G. Komoriya, H. Pham, F. Higgins, and J. Lewandowski, Built-in self repair for embedded high density SRAM, in Proc. Int. Test Conf., 1998, pp. 11121119. [4] W. K. Huang, Y.-N. Shen, and F. Lombardi, New approaches for the repairs of memories with redundancy by row/column deletion for yield enhancement, IEEE Trans. Computer-Aided Des. Integr. Circuits Syst., vol. 9, no. 3, pp. 323328, Mar. 1990. [5] M. Horiguchi, J. Etoh, M. Aoki, K. Itoh, and T. Matsumoto, A exible redundancy technique for high-density DRAMs, IEEE J. Solid-State Circuits, vol. 26, no. 1, pp. 1217, Jan. 1991. [6] S.-K. Lu, Y.-C. Tsai, C.-H. Hsu, K.-H. Wang, and C.-W. Wu, Efcient built-in redundancy analysis for embedded memories with 2-D redundancy, IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 14, no. 1, pp. 3442, Jan. 2006. [7] S.-K. Lu and S.-C. Huang, Built-in self-test and repair (BISTR) techniques for embedded RAMs, in memory technology, in Proc. Records of the 2004 Int. Workshop Des. Testing, Aug. 2004, pp. 6064. [8] D. Rossi, M. Omana, F. Toma, and C. Metra, Multiple transient faults in logic: An issue for next generation ICS?, in Proc. 20th IEEE Int. Symp. Defect Fault Tolerance in VLSI Syst. (DFT 2005), Oct. 2005, pp. 352360. [9] G. Cardarilli, A. Leandri, P. Marinucci, M. Ottavi, S. Pontarelli, M. Re, and A. Salsano, Design of a fault tolerant solid state mass memory, IEEE Trans. Rel., vol. 52, no. 4, pp. 476491, Dec. 2003. [10] A. Saleh, J. Serrano, and J. Patel, Reliability of scrubbing recovery techniques for memory systems, IEEE Trans. Rel., vol. 39, no. 1, pp. 114122, Apr. 1990. [11] R. Goodman and M. Sayano, The reliability of semiconductor ram memories with on-chip error-correction coding, IEEE Trans. Inf. Theory, vol. 37, no. 3, pp. 884896, May 1991. [12] M. Blaum, R. Goodman, and R. McEliece, The reliability of singleerror protected computer memories, IEEE Trans. Comput., vol. 37, no. 1, pp. 114119, Jan. 1988. [13] S. Satoh, Y. Tosaka, and S. Wender, Geometric effect of multiple-bit soft errors induced by cosmic ray neutrons on DRAMs, IEEE Electron Device Lett., vol. 21, no. 6, pp. 310312, Jun. 2000. [14] A. Dutta and N. Touba, Multiple bit upset tolerant memory using a selective cycle avoidance based SEC-DED-DAEC code, in Proc. 25th IEEE VLSI Test Symp., May 2007, pp. 349354. [15] S. Baeg, S. Wen, and R. Wong, SRAM interleaving distance selection with a soft error failure model, IEEE Trans. Nuclear Sci., vol. 56, no. 4, pp. 21112118, Aug. 2009. [16] G.-C. Yang, Reliability of semiconductor RAMs with soft-error scrubbing techniques, IEE Proc. Comput. Dig. Techn., vol. 142, no. 5, pp. 337344, Sep. 1995. [17] C.-L. Su, Y.-T. Yeh, and C.-W. Wu, An integrated ECC and redundancy repair scheme for memory reliability enhancement, in Proc. 20th IEEE Int. Symp. Defect Fault Tolerance VLSI Syst. (DFT 2005), Oct. 2005, pp. 8189. [18] M. Nicolaidis, N. Achouri, and L. Anghel, A diversied memory built-in self-repair approach for nanotechnologies, in Proc. 22nd IEEE VLSI Test Symp. (VTS04), Apr. 2004, pp. 313318. [19] M. Richter, K. Oberlaender, and M. Goessel, New linear SEC-DED codes with reduced triple bit error miscorrection probability, in Proc. 14th IEEE Int. On-Line Testing Symp. (IOLTS08), Jul. 2008, pp. 3742. [20] R. W. Hamming, Error detecting and error correcting codes, Bell Syst. Techn. J., vol. 26, no. 2, pp. 147160, 1950. [21] M. Y. Hsiao, A class of optimal minimum odd-weight column SEC-DED codes, IBM J. Res.Develop., vol. 14, no. 4, pp. 395401, 1970.

OWC: OddWeight Column Codes.

not detected would appear with two soft errors in a word that contains a defect, while those would occur only with three soft errors on a word for a defect free memory. To analyze the probability that a failure is not detected, it will be assumed that as soon as a failure occurs it is detected (if detectable), or provokes silent data corruption that affects the system (if not detectable). Under this assumption, the probability that a failure is an undetected failure that affects the system is equal to the defect rate F multiplied by the miscorrection probability for the SEC-DED code used. The miscorrection probability depends on the type of SEC-DED code, and can be minimized by appropriately selecting the code at the cost of a more complex encoder-decoder. The miscorrection probabilities for different SEC-DED codes are illustrated in Table I, showing that linear SEC-DED codes provide the lowest values. Some of the values have been obtained from reference [21]. The rest of the entries have been simulated. For small values of the defect rate, and large Mean Time to Failures (MTTF), the probability that an undetectable failure occurs during memory operation can be negligible. In those cases, the proposed technique can effectively deal with stuck-at defects without compromising reliability. V. CONCLUSIONS In this paper, the use of SEC-DED to deal with both soft errors and isolated stuck-at defects has been studied. An analysis has been presented to evaluate the defect rates and memory sizes for which SED-DED codes with no modications can be used to deal with defects without compromising reliability. Then a technique has been proposed that can deal with both types of errors effectively by applying a modied error correction process. The analysis shows that the mean time to failure for the proposed technique would be the same as that of a defect free memory that incorporates SEC-DED. The main issue of the proposed approach is that a small percentage of the failures will be undetected, leading to silent data corruption. However, for small defect rates and low failure probabilities during device operation, the probability of undetected failures will be negligible.

ARGYRIDES et al.: USING SEC CODES TO PROTECT AGAINST ISOLATED DEFECTS AND SOFT ERRORS

243

Costas Argyrides (S07M10) received the B.Sc. degree in informatics and computer science from Moscow Power Engineering Institute-Technical University (MPEI-TU) Moscow, Russia, with distinction. He received the M.Sc. degree in advanced computing, and the Ph.D. degree in computer science from the University of Bristol, Bristol, U.K. He is currently a Validation Engineer at Intel Corporation. Prior to this, he served as a Research Assistant at the Universities of Bristol, Oxford Brookes, Warwick, and Cambridge. He is the author or coauthor of more than 40 technical papers. His research interests include fault-tolerant computer systems, software fault tolerance, reliability improvement, error correcting codes, algorithmic based fault tolerance, and nanotechnology-based designs. Dr. Argyrides received a Best Paper Award for his paper Reliability Aware Yield Improvement Technique for Nanotechnology Based Circuits with C. Lisboa, L. Carro, and D. K. Pradhan presented at the 22nd Symposium on Integrated Circuits and Systems Design SBCCI 2009.

working on the development of Ethernet transceivers. He is currently with the Universidad Antonio de Nebrija, Madrid. He is the author of numerous papers in international conference proceedings and journals. He has also participated in the IEEE 802.3 standardization for 10 GBaseT. His research interests include fault-tolerant systems, performance evaluation of communication networks, and the design of physical layer communication devices.

Pedro Reviriego (A03M04) received the M.Sc. and Ph.D. degrees (Hons) in telecommunications engineering from the Technical University of Madrid, Madrid, Spain, in 1994, and 1997, respectively. From 1997 to 2000, he was an R&D Engineer with Teldat, Madrid, Spain, working on router implementation. In 2000, he joined Massana to work on the development of 1000 BaseT transceivers. During 2003, he was a Visiting Professor with the University Carlos III, Legans, Madrid. From 2004 to 2007, he was a Distinguished Member of Technical Staff with the LSI Corporation,

Juan Antonio Maestro (M07) received the M.Sc. degree in physics and the Ph.D. degree in computer science from Universidad Complutense de Madrid, Madrid, Spain, in 1994, and 1999, respectively. He has served both as a Lecturer and a Researcher at several universities, such as the Universidad Complutense de Madrid; the Universidad Nacional de Educacin a Distancia (Open University), Madrid; Saint Louis University, Madrid; and the Universidad Antonio de Nebrija, Madrid, where he currently manages the Computer Architecture and Technology Group. His current activities are oriented to the space eld, with several projects on reliability and radiation protection, as well as collaborations with the European Space Agency. Aside from this work, he has worked for several multinational companies, managing projects as a Project Management Professional, and organizing support departments. He is the author of numerous technical publications, both in journals and international conferences. His areas of interest include high-level synthesis and cosynthesis, signal processing, and real-time systems, fault tolerance, and reliability.

Anda mungkin juga menyukai