Sie sind auf Seite 1von 21

CHAPTER II

REVIEW OF LITERATURE Memory based First-In First-Out (FIFO) buffers see widespread use in a variety of data transferring and buffering applications (Hashimoto 1988 pp. 496-497), (Le Ngoc 1989 sec. 14 pp 27-30), (Miller 1985 sec. 14 pp. 6-7). The FIFO internal organization and variety of applications pose some interesting opportunities for error detection and correction. The rigid, sequential method the FIFO uses to handle data limits which memory faults are applicable and which fault handling techniques are possible. Establishing a framework for FIFO study requires a survey of the whole range of memory faults, testing, and error reduction techniques. The first step is to understand the FIFO, both at the chip and at the memory cell level. Next look at the faults that can occur in different memory types. From this comprehensive list, select a realistic fault model, rooted in physical behavior, yet simplified because of the inherent FIFO structure and operation. The fault model will be the means to examine error reduction techniques. Again, the whole range of RAM fault solutions must be examined to see which are applicable to FIFOs. Fault tolerant techniques include data coding, on-chip error correction, redundancy, and data interleaving.

Introduction to FIFOs The First-In First-Out concept provides a means to store data for retrieval in the exact input order. Desired properties include simultaneous and asynchronous read and write operations. Early FIFO devices were constructed of a parallel set of serial, linear shift registers. Data words presented to the input trickled through the shift registers to reach the output. A major disadvantage with these early FIFOs is their limited capacity and excessive delay (Miller 1985 sec. 14 p. 2). Because the data must propagate through each storage location, input to output latency time is 0(K) where K is the capacity of the FIFO. The second generation of FIFOs store data in memory cells and use internal logic to track the storage and retrieval addresses. The block diagram in fig. 1 shows the logical structure of a FIFO. The RAM array holds the data. A write pointer and read pointer contain the addresses of the last word written and the next word to be read. The flag logic tracks the relative locations of the two pointers. Both pointers start at Ox0000. The write pointer increments by one for each data word written. A difference between the two pointer values indicates the FIFO contains data. The read pointer increments for each word 4 Figure 1.-- FIFO Block Diagram

read until it equals the write pointer, whereupon the FIFO asserts the empty flag. When the full flag indicates K words in the FIFO, the write pointer value is one less than the read pointer value. FIFOs are designed for simultaneous read and write operations in order to buffer data between two asynchronous entities. The current generation of FIFOs uses a dual port static RAM cell to allow these simultaneous reads and writes. To select a proper fault model, we will need to understand memory cell structures. Static RAM (SRAM) cells are formed by a pair of cross-coupled inverters as shown in fig. 2. The value at node A represents the logical value of the data held by the cell. The value at node A is inverted at node B. Node B is fedback and inverted, maintaining the logic value at node A. Of the four transistors comprising the inver tors, two transistors are always on, Figure 2.-- Static RAM Cell

sinking current. The data value is written to or read from the cell via two access transistors. Enabling the transistors via the select line transfers the voltage levels at nodes A and B down to the sense amplifiers and then to the output pins. A dual port static RAM (DPSRAM) cell has the same cross-coupled inverter structure. There is an additional set of access transistors as shown in fig. 3. The upper set is used for writes, the lower set is used for reads. This design allows the memory cell to be written and read at the same time. A DPSRAM chip uses a flag 5

to prevent simultaneous writes to a cell (Raposa 1988 p. 362). The FIFO structure does not require this flag because it only writes with one set of transistors. The latest experimental FIFOs employ dynamic random access memory (DRAM) cells. In

-I I-

SELECT2

L 1SELECT1

DATA

DATA

Figure 3.-- Dual Port Static RAM Cell

a paper by Hashimoto et al. (1988 p. 490), the authors describe a novel scheme to perform simultaneous, asynchronous FIFO operation with single transistor dynamic RAM cells. Unlike the SRAM cell which maintains logic states by transistor current flow, the DRAM cell in fig. 4 maintains logic states by charge
DATA i_ SELECT

T
Figure 4.-Dynamic RAM Cell

storage on a capacitor (Geiger 1990 p. 836). This high density memory cell technique has two problems in FIFO applications. The primary difficulty is the lack of simultaneous access to an individual cell. Secondly, DRAM cells require a periodic refresh to restore the capacitor charge leakage. Hashimoto's design solves both problems. Figure 5 shows the block diagram of a 256K x 4 FIFO. The main memory array is composed of four 256K blocks of single transistor dynamic RAM cells. There is a 512 x 4 static RAM buffer which holds data to be written. On the output side there is a 256 x 4 static RAM buffer which passes data to be read. The key to the simul6

A BUFFER INPUT

B BUFFER OUTPUT

WRITE BUFFER 512 x 4

MEMORY ARRAY 256 K x 4

READ BUFFER 256 x 4

REFRESH

Figure 5.-- DRAM based FIFO Block Diagram taneous operation lies in the A and B buffers which are cross connected to the input and output pins. When a data word is presented to the input pins, the word is read into either the A or B buffer depending on which is idle. Then the word is sent to the write buffer to be held. When the memory array is idle, the entire write buffer is transferred into it. During read operations, the output accesses whichever of the A or B buffers is idle. Data words are taken from the read buffer which is then replenished whenever the main memory array is idle. A separate circuit blocks out both read and write operations to perform refreshes on the main memory array. In summary, memory based FIFOs can use two different types of memory

cells, dual port static RAM and dynamic RAM. In addition to the memory array itself, FIFO structures typically include read and write address pointers as well as input and output buffers and registers. Each of these parts is susceptible to failure. In the next section we will examine papers on physical fault processes and develop a practical fault model for further study.
Fault Model

Our objective in this section is to create a fault model to simulate and analyze. The fault model chosen will be composed of the faults that occur during system use rather than faults screened out during burn-in testing. This requires examining faults identified both theoretically and experimentally and constructing a list of FIFO errors from the identified fault types.
Failure Modes

Study of faults starts with the baseline faults experienced on less complex ICs (Colbourne et al. 1974 p. 250). Colbourne et al. list the following package related and chip related failure modes for all integrated circuits: The assembly- and package-related failure modes include: 4) hermeticity 1) open bond wires 2) lifted bonds 5) thermal intermittents 3) lifted chips The chip-related failure modes include: 1) oxide faults 2) metallization faults 3) diffusion defects 4) mechanical defects in chips 5) design defects

Technical reports from the Rome Air Development Center (Coit et al. 1984), 8

(Reliability Analysis Center 1989) support these failure modes. Identifying physical failure modes is but a first step in developing the fault model. Failure modes need to be translated into logical faults and parametric faults. Logical faults cause the logic function of the device to be changed to some other value. Parametric faults change the timing, current or voltage levels of a circuit (Breuer and Friedman 1976 p. 15). In the process of converting from physical failure modes to faults, we would want to eliminate failure modes which do not appear during system operation. Environmental stress screening rejects devices with latent defects, to prevent in a

Figure 6. -- The Roller-Coaster Curve manufacturing defects from becoming service life failures (Tustin 1986). For 9

example,silicon crystal with a small initial crack, the crack grows under vibration screening until vital circuits are bisected, causing a failure (Wong and Lindstrom 1988 p. 360). Unfortunately there is a limit to screening which lets inherent manufacturing flaws manifest during service life. Wong and Lindstrom (1988) explain why in their paper about the roller-coaster shaped expected hazard rate curve. The roller-coaster curve starts with a high initial failure rate and gradually decreases, but there are several distinct 'humps' or localized areas of higher failure rate (fig. 6). The cause of these humps is latent flaws in the device which pass through screening to develop into failures later. Therefore when constructing a fault model, we must include some faults which should have been screened out as part of device burn-in.

Permanent Fault Types Failures are divided into permanent and transient faults. A typical list of permanent faults is that of Blaum et al. (1988) as illustrated in figure 7.

MMMMMMMM EMMEN

MMMMM

MMMMMMMMM MOON MMMMMMMMM

MMMMMMMM 011 MMMMMMMM


MO MMMMMMM MMMMMMMM MMMMMMM


MMMMMM MMMMMMM MMMMMM

Figure 7.-- Typical RAM Permanent Faults 1) Single cell failures where the cell is stuck at a logical one or zero. 2) Row failure where all cells in a chip's row are affected by a failure of the chip's row driver.

10

3) Column failures where all cells in a chip's columns are affected by a failure of the sense amplifiers or column decoders. 4) Row and column failures where a single cell short circuits both. 5) Whole chip failures in which all cells fail. One can choose to include all or some of these fault types in a fault model with varying results. A survey paper by Koo and Chenoweth (1984) shows an analysis as applied to four different models.

Pattern Sensitive Faults Later papers have considered more complicated faults. Mazumder and Patel (1989 p. 398) use a model with several pattern sensitive faults. They list two sets of physically adjacent cell patterns (neighborhoods) that could affect the base cell. The

von Neumann neighborhood (fig. 8a) consists of cells directly above, below, and to

BASE CELL

BASE CELL

VON NEUMANN NEIGHBORHOOD

MOORE NEIGHBORHOOD

I.

b.

Figure 8.-- Pattern Sensitive Fault Neighborhoods the sides of the base cell. The Moore neighborhood (fig. 8b) extends the von

11

Neumann neighborhood to include the corner cells. Mazmuder and Patel also differentiate between static and dynamic pattern sensitive faults. Static faults occur in the presence of a fixed pattern in the surrounding neighborhood. Dynamic faults occur when the base cell changes state due to a change in a surrounding cell. Franklin et al. (1989) describe an even more extensive fault neighborhood in which the contents of a cell become sensitive to the weight of the cell contents in its row and column. Coupling faults described by Chang et al. (1989 p. 638) are a special case of pattern sensitive faults. Active coupling faults occur when a read or write operation at address i forces a write at address j. Passive coupling faults occur when a read or write operation at address j does not work correctly because of a value at address i. David et al. (1989) define the problem even further by differentiating between idempotent and inversion faults. A transition has an idempotent influence if the transition at cell i influences cell j only when cell j is in a fixed state. An inversion influence has cell i influencing cell j independently of the value at cell j. Alpha Particle Burst Faults One very important class of fault was first identified by May and Woods (1978, 1979). They showed that alpha particles interact with the memory chip causing a soft failure. No permanent damage is done and the chip completely recovers when the memory cell is written to next. As discussed previously, dynamic memories store data as charge on a capacitor or potential well. A data value '0' is indicated by the

12

potential well filled with electrons. A data value '1' is indicated by an empty potential well. Alpha particles generate electron hole pairs as they lose energy in the silicon. Holes are driven from the potential well, leaving behind the generated electrons. If there is a data value '0' in the cell, the extra electrons merely reinforce the data value. If there is a data value 1' in the cell, the excess electrons swept into the well by the applied potential will change the data value to a '0'. Further studies expanded on the work performed by May and Woods. Yaney et al. (1979) determined that the sense amplifiers and bit lines are also likely to be affected during read operations. Recent study of the phenomena indicate burst errors where one alpha particle can affect several adjacent cells at the same time (Horiguchi 1988), (Goodman 1991). In addition, some static RAM architectures can experience alpha particle burst errors the same as dynamic RAMs (Minami 1989 p. 1657). Another class of transient errors stems from design errors. A paper by Shepherd and Rodgers (1985) gives detailed descriptions of FIFO design flaws which passed manufacturing test. These design flaws showed up in very specific situations. For example, when the device is full during a simultaneous read and write operation the write pointer crossed over into the read pointers addressing space, effectively losing all of the data in the FIFO. Another transient FIFO phenomenon concerns voltage transients on the write or read control line. These short noise transients can force erroneous pointer advances (Rajpal and Schapfel 1987). Many of the fault types examined before were either examined in isolation, like the alpha particle phenomenon, or as a conglomerate of experimental and theoretical 13

fault types. Quite a few papers have chosen fault models as a subset of available fault types without experimental backup on relative fault frequencies (Cox 1978), (Ayache and Diaz 1979), (Dobbins 1986), (Midkiff and Koe 1989). Though this is the same procedure followed in this paper, there is a more scientific method to constructing a fault model for a specific device. Inductive Fault Analysis "Inductive fault analysis (IFA) is a systematic and automatic method for determining what faults are likely to occur on a specific circuit" (Ferguson and Shen 1988). IFA takes into account the manufacturing process, physical defect rates and device layout. The IFA computer program injects defects on a description of the physical layout and simulates device operation to determine what faults result. The research team of Dekker, Beenker and Thijssen (1988, 1989, and 1990) have used inductive fault analysis to construct and grade a more effective testing technique. While inductive fault analysis appears to be a superior method for choosing a fault model, several restrictions prevented IFA use in this paper. First, one needs the detailed physical layout and the process defect statistics. Second this technique seems more applicable in identifying manufacturing defects screened out as part of burn-in. As we learned back in the discussion on the roller-coaster curve, some latent defects will always pass through any inspection procedure. The IFA method clearly identifies some defects as non-faults (Ferguson and Shen 1988 p. 479). Though Ferguson and Shen do not identify what these exceptions were, they might be prime examples of 14

latent defects that pass through manufacturing screening only to develop into faults under long term stress. Since this paper is predominantly focused on faults during the service life, not the burn-in period, IFA would direct our attention to a burn-in fault model when really we want to concentrate on the IFA exceptions, the faults during service life.

Selected Fault Model Based on the review of literature and FIFO organization, the simulation uses the following fault model. The fault model is divided into permanent (hard) and intermittent (soft) faults. Many of the failure modes listed in the literature have to do with manufacturing defects which are typically screened out. But the work by Wong and Lindstrom (1988) indicates that latent manufacturing flaws will pass screening and develop into faults during device service life.

Permanent Faults Stuck-at faults: Memory array: Write pointer: stuck-at-0; stuck-at-1. multiple writes to the same address; some addresses not written to. multiple reads from the same address; some addresses not read from.

Read pointer:

15

Pattern Sensitive Faults in the Moore neighborhood (see fig. 8b.): Only static faults where an individual cell modifies another individual cell are modeled.

Intermittent Faults Unidirectional burst faults from alpha particles hitting one or more cells. Design timing defects and read/write pulse noise causing erroneous pointer operation.

Chapter III explains how the simulation implements each of these fault types.

Error Reduction Techniques Having learned about failures in memory based FIFOs, it is time to examine what techniques are used for fault detection and error reduction.

RAM Architectures There are several architecture approaches to RAM error reduction. Each approach is aimed at a different level of the memory system. There is nothing that precludes one or all of the approaches from being used simultaneously.

Memory cell techniques. For the case of alpha particle errors, design of the individual cell can have a dramatic impact on the error rate experienced (Takeuchi et al. 1989 p. 1644), (Takeda et al. 1989 p. 2567). An obvious error reduction technique is to redesign the memory cell. Minami et al. (1989) proposed a change in architecture from lateral to vertical memory cells, shielding critical areas from the

16

alpha particles. Since chip packaging is the primary source of the alpha particles, different packages and bather films on the chip have been tried (Sarrazin and Malek 1984 p. 53).

On-chip techniques. One method of dealing with manufacturing defects is to add spare decoders and spare word and bit lines (Schuster 1978 p. 698). Schuster explains that when faults are encountered, the defective column or row can be switched out for a spare. The changes can be latched in temporarily, burned in with a laser, or programmed using electrically programmable read only memory cells (EPROM). Chang, Fuchs, and Patel (1989) extend this concept to show how coupling faults can be diagnosed and then repaired by switching rows or columns. On-chip error correction is another error reduction method. Yamada et al. (1984) implemented a bidirectional parity code which computes horizontal and vertical parities on a 4 bit cell group and is capable of correcting single bit errors. Fuja, Heegard, and Goodman (1988) provide the theory behind this technique as they explain single, double, and triple-error correcting linear sum codes.

System solutions. There are several error correcting codes useful at the system level to reduce errors. Bossen and Hsiao (1980) describe use of a single-errorcorrecting and double-error-detecting code with a hardware algorithm that clears soft errors in the presence of hard errors by remembering hard error locations. Grosspietsch (1988) describes a VLSI chip that remembers the location of hard faults and dynamically switches to spare bit-slices of the memory system. 17

Design diversity. Extending fault tolerant architectures one level beyond system solutions requires design diversity. Aviiienis and Kelly (1984) discuss a method to reduce even design errors. Design faults are those that would be present in every copy of a FIFO. The asynchronous timing errors identified by Shepherd and Rodgers (1985) would manifest even in redundant FIFOs if they were from the same manufacturer. Aviiienis and Kelly propose that faults of this type can be reduced only through placing two separately designed elements in parallel. For example, to reduce software errors in a phone system, have three different companies develop a solution for computers from three different companies and implement a majority voting scheme to resolve errors. Four levels of error reduction techniques have been discussed. The first, memory cell changes, are implemented independently of the other three techniques. The second level, on-chip or built-in techniques, is the subject of the next section. The third level, system solutions, includes error correcting codes, the subject of the last section. The fourth level, design diversity, is not applicable to FIFOs. Built-In Test for RAMs Built-In test of memories was motivated by the unacceptably long test times required for classical patterns to exercise very large RAMs (Breuer and Friedman 1976 pp. 156-160). Much work had been done to devise optimal test patterns (Jacobson 1985), (Mazumder and Patel 1989), (David, Fuentes, and Courtois 1989). But even these are order kn where n is the size of the array and k is a large constant. 18

Built-In self test (BIST) of the RAM promises a dramatic reduction in test time. Bardell and McAnney describe an algorithmic test sequence (1985) which detects all single and multiple stuck-at faults in the memory array, its decoder, registers, and all bit-rail faults (1988). Their method has a sequence generator on chip which writes into the memory cells and an on-chip comparator circuit which compares each word read out with the generated sequence. If all words pass, the chip is declared free of faults. Jain and Stroud (1986) and Sridhar (1986) propose on-chip built-in self-test circuitry to facilitate RAM testing. Both of their approaches consist of a test sequence generator and a fault grading circuit. Jain and Stroud discuss two methods to implement fault grading. The first of these has a comparator circuit which checks the test word from the generator against the word after it has been written to the memory location. A faster implementation uses a parallel signature analyzer to compress the test words and then perform a compare of the compressed word at the end of the test cycle. A key bottleneck in memory access is the time required to translate the memory cell charge to a drive current on the output pins. Furthermore, the output pins limit the number of memory cells an off chip tester can examine together. In the parallel signature analyzer each bit line is an input to a linear feedback shift register which computes a signature across all bit lines at one time. Sridhar shows how existing test patterns such as the marching 1's and 0's test can be modified for built-in execution. Jain and Stroud develop new test sequences, one of which is specifically targeted for detecting coupling faults between adjacent memory cells. Both Jain and 19

Stroud and Sridhar report diagnostic runtimes reduced by three orders of magnitude compared to external test methods. One problem with parallel signature analyzers is that the test is invalid if a redundant bit line is substituted for a defective bit line (Sridhar 1986 p. 20). Kraus et al. (1989) propose a modification to the test architecture to deal with this problem. During testing the faulty bit line is disconnected from the parallel comparator. Thus the indeterminate state of the cells in the bad column will not corrupt the signature. Saluja et al. (1987) compare the BIST hardware overheads for random logic and microcoded ROM. Though microcoded ROM requires more silicon area, at chip sizes above 64K the difference between the two methods in percentage of the chip area used is negligible. Advantages of the microcoded ROM method include shorter design cycle and increased testability of the BIST hardware. Franklin, Saluja, and Kinoshita (1989) use the fact that the increased speed of built-in self testing can make longer test sequences practical. If the tester is limited to 0(n) steps, then only pattern sensitive fault models which consider the immediate neighborhood are practical. Franklin et al. can test for pattern sensitive faults which depend on the weight of the cell contents in the base cells row and column neighborhood. The speed of the built-in test circuit and the economy of the test equipment make their 0(n' 5) algorithm practical. Regener (1988) proposes an on-chip sequence generator which cycles through all possible bit patterns for a n bit RAM. Their generator circuit is fast and small, but any test pattern which cycles through all transitions will have n2n steps. Even the 20

most efficient circuit will quickly become impractical with increasing RAM densities. For example, at 1 ns step time, Regener's algorithm will take 10" years to test a 1K RAM. Perhaps his method of generating test sequences might be successfully applied to some of the optimal 0(n k ) test patterns. Fujiwara et al. (1988) survey RAM built-in self test techniques in Japan. Of particular note is the work by Miura, Tamamoto, and Narita (1987). They consider address faults where a given address accesses an incorrect location and the neglected location is not picked up by some other address. Their 0(kn().5) sequence detects all of these address decoder and local neighborhood, pattern sensitive faults. A further method to reduce testing time applicable to any of the above BIST techniques is to partition the RAM of size M into Q nodes of size N. Using parallel test techniques on each module, a test pattern which requires O(M k ) operations will now only require O(Nk ) operations. A great deal of the literature on built-in self test is aimed at the part screening process. For non-interruptive built-in test we will examine error correcting codes. The next section will address several codes that are intriguing because they are targeted toward the types of faults seen in RAMs.

Error Correcting Codes for RAMS In this thesis we use only one error correcting code, the modified Hamming code proposed by Hsiao. There are alternative codes which could provide better results than the Hsiao code. For example, Davydor and Tombak (1991 p. 897)

21

discuss the II code which has a higher probability of detecting triple errors. Since a predominant RAM failure mode stems from alpha particle radiation, a more efficient code can be found that takes advantage of the specifics of alpha particle faults. These transient faults always produce a 1 0 transition in memory cell data values, researchers have capitalized on this property to devise a new class of codes. Tao, Hartmann, and Lala (1988 p. 879) provide the following definitions:
Symmetric errors - Both 1 -o 0 and 0 -o 1 errors can occur in a codeword. Asymmetric errors - Only one of the errors 1 -o 0 or 0 -, 1 can occur in a codeword. The error is known a priori.

Unidirectional errors - Both 1--0 0 and 0 -0 1 errors can occur, but they do not occur in the same codeword.
Pradhan (1980 p. 471) identified a new class of systematic codes which can correct t errors (t-EC) and simultaneously detect all unidirectional errors (AUED). Nikolos, Gaitanis, and Philokyprou (1986 p. 394) proposed a set of significantly more efficient systematic codes. Several research teams continued to improve the efficiency of t-EC/AUED codes (Tao, Hartmann, and Lala 1988 p. 879), (Montgomery and Kumar 1990 p. 836), (Kundu and Reddy 1990 p. 752), (Blaum and Tilborg 1989 p. 1493). Another consideration when tailoring codes is that the number of unidirectional errors may be limited. Blaum (1988 p. 453) presents a family of codes which can detect more unidirectional errors than cyclic redundancy codes but cannot detect an unlimited number of errors. Lin and Bose (1988 p. 433) provide the fundamental theory behind t-EC/d-UED codes where d > t is the number of unidirectional errors 22

which can be detected. The t-EC/d-UED codes use fewer check bits than the tEC/AUED codes. Because of the bursty, asymmetric nature of alpha particle errors, the t-EC/AUED and t-EC/d-UED codes provide an interesting alternative to the modified Hamming SEC-DED code.

23

Das könnte Ihnen auch gefallen