Lecture 1. Introduction PDF

EE 658
Diagnosis and Design of Reliable

Digital Systems
Lecture 1:
Introduction to testing of digital circuits and
systems
University of Southern California

Viterbi School of Engineering
Ming Hsieh Dept. of Electrical Engineering
Moe Tabar
Fall 2020
References: Dr. Breuer’s lecture slides, books listed in the syllabus, and online resources
Key to Slides
Question: try to answer this question before class
Animation: reminder for me
Read only: not discussed in class
Advanced: material related to subject but not on exams
Dr. Moe Tabar - EE 658 2

What you should get from this lecture
▪ A good idea of the subject matter of this

course
▪ The importance of VLSI test
▪ Basic terminology and concepts related to
VLSI test
▪ How this material fits into the VLSI/CAD
curricula

outline
▪ 1. The VLSI fabrication process
▪ 2. Defects
▪ 3. Process variation
▪ 4. Errors in computation
▪ 5. Yield
▪ 6. Burn-in
▪ 7. Wafer testing
▪ 8. Testing
▪ Introduction
▪ Modeling defects as faults
▪ Quality of a test
▪ Generating and applying a test
▪ Three test methods
▪ 9. Summary

1. The VLSI fabrication process
Read only
Building circuitry
Layer-by-layer
⚫ Murphy’s Law:
◦ If anything can go wrong, it
will.
⚫ Currently above 10 layers
of metal
⚫ 20-30 masks
⚫ Some problem areas:
▪ Planarization is difficult-need
filler in empty spaces
▪ Add 10 more items to this list

Read only
Semiconductor Manufacturing
⚫ They're everywhere. From appliances to space ships, semiconductors have
pervaded every fabric of our society. They have transformed the world so
drastically that we've practically gone through hundreds of industrial revolutions
during the last five decades. So overwhelming is the power of computing and
signal processing today that it's difficult to believe how these can come from
sand.
⚫ Indeed, this world was reinvented simply by purifying sand, making it flat, and
adding materials to it. This magical process of building integrated circuits from
sand is now referred to as semiconductor manufacturing.
Semiconductor manufacturing consists of the following steps:

1) production of silicon wafers from very pure silicon ingots;
2) fabrication of integrated circuits onto these wafers;
3) assembly of every integrated circuit on the wafer into a finished product; and
4) testing and back-end processing of the finished products.

Read only
Wafer Fabrication
⚫ Wafer fabrication generally refers to the process of building integrated circuits
on silicon wafers. Prior to wafer fabrication, the raw silicon wafers to be used
for this purpose are first produced from very pure silicon ingots, through either
the Czochralski (CZ) or the Float Zone (FZ) method. The ingots are shaped
then sliced into thin wafers through a process called wafering.
⚫ The semiconductor industry has already advanced tremendously that there now
exist so many distinct wafer fab processes, allowing the device designer to
optimize his design by selecting the best fab process for his device.
Nonetheless, all existing fab processes today simply consist of a series of steps
to deposit special material layers on the wafers one at a time in precise amounts
and patterns.
⚫ The first step might be to grow a p-type epitaxial layer on the silicon substrate
through chemical vapor deposition. A nitride layer may then be deposited over
the epi-layer, then masked and etched according to specific patterns, leaving
behind exposed areas on the epi-layer, i.e., areas no longer covered by the
nitride layer. These exposed areas may then be masked again in specific
patterns before being subjected to diffusion or ion implantation to receive
dopants such as phosphorus, forming n-wells.

Read only
Epitaxy
⚫ Epitaxy or epitaxial growth is the process of depositing a
thin layer (0.5 to 20 microns) of single crystal material over a
single crystal substrate, usually through chemical vapor
deposition (CVD). In semiconductors, the deposited film is
often the same material as the substrate, and the process is
known as homoepitaxy, or simply, epi. An example of this is
silicon deposition over a silicon substrate.
⚫ Silicon epitaxy is done to improve the performance of bipolar

devices. By growing a lightly doped epi layer over a
heavily-doped silicon substrate, a higher breakdown voltage
across the collector-substrate junction is achieved while
maintaining low collector resistance. Lower collector
resistance allows a higher operating speed with the same
current.

Read only
Diffusion and polysilicon

⚫ Diffusion and ion implant are the two major processes by which chemical
species or dopants are introduced into a semiconductor such as silicon to form
the electronic structures that make integrated circuits useful (although ion
implant is now much more widely used for this purpose than thermal diffusion).
⚫ Diffusion is the movement of a chemical species from an area of high
concentration to an area of lower concentration. The controlled diffusion of
dopants into silicon is the foundation of forming a p-n junction and fabrication of
devices during wafer fabrication.
⚫ Diffusion is used primarily to alter the type and level of conductivity of
semiconductor materials. It is used to form bases, emitters, and resistors in
bipolar devices, as well as drains and sources in MOS devices. It is also used to
dope polysilicon layers.
⚫ Thin films of polycrystalline silicon, or polysilicon (also known as poly-Si or
poly), are widely used as MOS transistor gate electrodes and for
interconnection in MOS circuits. It is also used as resistor, as well as in ensuring
ohmic contacts for shallow junctions. When used as gate electrode, a metal
(such as tungsten) or metal silicide (such as tantalum silicide) may be deposited
over it to enhance its conductivity.

Read only
Lithography/Etch
⚫ The fabrication of circuits on silicon wafers requires that several different layers,
each with a different pattern, be deposited on the surface one at a time, and
that doping of the active regions be done in very controlled amounts over tiny
regions of precise areas. The various patterns used in depositing layers and
doping regions on the substrate are defined by a process called lithography.
⚫ Simply put, the lithography process generally consists of the following steps. A
layer of photoresist (PR) material is first spin-coated on the surface of the
wafer. The resist layer is then selectively exposed to radiation such as ultraviolet
light, electrons, or xrays, with the exposed areas defined by the exposure tool,
mask, or computer data.
⚫ After exposure, the PR layer is subjected to development which destroys
unwanted areas of the PR layer, exposing the corresponding areas of the
underlying layer. Depending on the resist type, the development stage may
destroy either the exposed or unexposed areas. The areas with no resist
material left on top of them are then subjected to additive or subtractive
processes, allowing the selective deposition or removal of material on the
substrate.

Read only
Optical Lithography
⚫ The fabrication of circuits on a wafer requires a process by which specific
patterns of various materials can be deposited on or removed from the wafer's
surface. The process of defining these patterns on the wafer is known as
lithography. Lithography uses photoresist materials to cover areas on the wafer
that will not be subjected to material deposition or removal.
⚫ Optical Lithography refers to a lithographic process that uses visible or
ultraviolet light to form patterns on the photoresist through printing. Printing is
the process of projecting the image of the patterns onto the wafer surface using
a light source and a photo mask. There are three types of printing - contact,
proximity, and projection printing. Equipment used for printing are known as
printers or aligners.
⚫ Patterned masks, usually composed of glass or chromium, are used during
printing to cover areas of the photoresist layer that shouldn't get exposed to
light. Development of the photoresist in a developer solution after its exposure
to light produces a resist pattern on the wafer, which defines which areas of the
wafer are exposed for material deposition or removal.

Read only
Electron Beam Lithography

⚫ Electron Beam Lithography (EBL) refers to a lithographic process that uses a
focused beam of electrons to form the circuit patterns needed for material
deposition on (or removal from) the wafer, in contrast with optical lithography
which uses light for the same purpose. Electron lithography offers higher
patterning resolution than optical lithography because of the shorter wavelength
possessed by the 10-50 keV electrons that it employs.
⚫ Given the availability of technology that allows a small-diameter focused beam of
electrons to be scanned over a surface, an EBL system doesn't need masks
anymore to perform its task (unlike optical lithography, which uses photomasks
to project the patterns). An EBL system simply 'draws' the pattern over the
resist wafer using the electron beam as its drawing pen. Thus, EBL systems
produce the resist pattern in a 'serial' manner, making it slow compared to
optical systems.
⚫ A typical EBL system consists of the following parts: 1) an electron gun or
electron source that supplies the electrons; 2) an electron column that 'shapes'
and focuses the electron beam; 3) a mechanical stage that positions the wafer
under the electron beam; 4) a wafer handling system that automatically feeds
wafers to the system and unloads them after processing; and 5) a computer
system that controls the equipment.

Read only
Photoresist
⚫ During development, the unwanted areas in the PR are dissolved by the
developer. In the case wherein the exposed areas become soluble in the
developer, a positive image of the mask pattern is produced on the resist. Such a
resist is therefore called a positive photoresist. Negative photoresist layers
result in negative images of the mask pattern, wherein the exposed areas are
made less soluble in the developer. Wafer fabrication may employ both positive
and negative photoresists, although positive resists are preferred because they
offer higher resolution capabilities.
⚫ Photoresist materials consist of three components: 1) a matrix material (also
known as resin), which provides body for the photoresist; 2) the inhibitor (also
referred to as sensitizer), which is the photoactive ingredient; and 3) the solvent,
which keeps the resist liquid until it is applied to the substrate.
⚫ Etching is the process of removing regions of the underlying material that are
no longer protected by photoresist after development. The rate at which the
etching process occurs is known as the etch rate. The etching process is said to
be isotropic if it proceeds in all directions at the same rate. If it proceeds in only
one direction, then it is completely anisotropic.

Read only
Etching
⚫ Since etching processes generally fall between being completely isotropic and completely
anisotropic, an etching process needs to be described in terms of its level of isotropy. Wet
etching, or etching with the use of chemicals, is generally isotropic. On the other hand, dry
etching processes that employ reactive plasmas are generally anisotropic. Wet Etching is
an etching process that utilizes liquid chemicals or etchants to remove materials from the
wafer, usually in specific patterns defined by photoresist masks on the wafer. Materials not
covered by these masks are 'etched away' by the chemicals while those covered by the
masks are left almost intact. These masks were deposited on the wafer in an earlier wafer
fab step known as 'lithography.
⚫ A simple wet etching process may just consist of dissolution of the material to be removed
in a liquid solvent, without changing the chemical nature of the dissolved material. In general,
however, a wet etching process involves one or more chemical reactions that consume the
original reactants and produce new species.
⚫ Reactive plasma etching involves the removal of surface material not protected by
lithographic masks using chemically active species. These species are usually oxidizing and
reducing agents produced from process gases that have been ionized and fragmentized by a
glow discharge. The species react with the exposed surface material, removing them from
the substrate while forming volatile byproducts in the process.

A survey
▪ Consider a fabrication facility mass
producing a large state of the art
processor chip using their newest and
most advanced nano-CMOS technology.
What fraction of newly manufactured die
(chips) are bad and need to be discarded?

2. Defects
Spot defects-resistive opens and short
An impurity
resulting in
two open
wires
and one
partially cut
Sputtering in
deposition
of material
leading to
a short
Example of 2-dimensional defect due
to a particulate
Khare and Maly, p. 28: Courtesy

Siemens A. G., Munich, Germany

Extra metal spot defect
Khare and Maly, p. 3: Courtesy

Siemens A. G., Munich, Germany

Another Defect

3. Process variation
Clean room

Sources of process disturbances
▪ Human errors and equipment failures
▪ Instabilities in the process conditions
▪ Random fluctuations in the process environment such as turbulent
flow of gasses, inaccuracies in control of furnace temperature, etc.
▪ Material instabilities - variations in the physical parameters of
chemical compounds and other materials such as fluctuations in the
purity and physical characteristics of chemical compounds, density
and viscosity of photo-resist, etc.
▪ Substrate and surface inhomogeneities
▪ Local disturbances in the properties of substrate wafers that typically fall
into three categories: 1) point defects, 2) dislocations, and 3) surface
imperfections
▪ Spots: mainly lithography related disturbances caused by mask
fabrication process

The normal distribution
▪ Most process steps conform to the normal distribution
▪ Thickness of deposition of a material
▪ Width of a line
Self-study: Look into the normal distribution, including its average and variance

Normal distribution and acceptability
Unacceptable
parameter
Why might values
thicker wires
be more
acceptable
than thinner
ones? Acceptable metal 2
wire thickness (Ω/)

Read only
Variation of solder thickness for a PCB

4. Errors in computation
Defects, faults and errors
⚫ A die defect is an unintended variation in the
physical aspects of a die, such as shown earlier.
⚫ A fault is a defect that can create an error in
computation. This can occur when a logic 1
erroneously becomes a logic 0 or vice versa, or
when a transition in logic value occurs to early or
late. Faults are usually associated with a common
form of defect, such as an open.
⚫ An error is when a signal has the wrong logic value
for a certain period of time. Not all errors create
wrong values in state variables.

Sources of error producing entities
▪ Design error - e.g., the logic design has an
error, or the layout has an error, etc.
▪ These problems should have been caught during
design verification
▪ Producing a new chip consists primarily of design,
verification and test
▪ Such errors occur in almost all large VLSI designs
▪ Such errors are analogous to a bug in a computer
program
▪ Design errors are not considered in this course
▪ All copies of the component have this same error

More sources of error producing entities
⚫ Fabrication error/mistake/fault - e.g.,

someone wired up a system improperly, or
inserted the wrong component
◦ These problems only occur in some components
◦ Fabrication errors are not dealt with in this
course

More sources of error producing entities
▪ Fabrication defects - these are defects due to the
manufacturing process. They are primarily due to
▪ Process variations
▪ Spot defects due to deposition, etching or impurities
▪ Physical failures - these occur during the normal

lifetime of a system due to wear-out mechanisms like
metal migration and environmental factors
(temperature, vibration, etc. )
Yes- these are the guys we will focus on in this

class

Characteristics associated with defects
⚫ Variation of defect occurrence with respect
to time
● Permanent- always present
● Intermittent – comes and goes – Due to an internal
problem, e.g. vibration or temperature. Very common
and hard to diagnose. Semi-predictable, e.g., always
occurs if T > 100o C.
● Transient – usually a random occurrence (non
predictable) due to an external source, e.g., a high
energy alpha particle in space, or a power surge.
(Soft-error rate (SER) is currently a major field of
study)
This class will only deal with permanent defects

Effect of defect on logic and interconnect
⚫ The defect might affect the logical function of a circuit
component, e.g., a NAND gate operating as an Inverter. That
is, F = NOT(A•B) operates as F = NOT(A).
⚫ The defect might affect the interconnection pattern (topology)
of a circuit, e.g. creating an open circuit or a short.

More effects due to defects
⚫ The defect might affect the operating speed of a
circuit, e.g., circuit has too much delay. May be
possible to operate the circuit “correctly” if you
slow down the clock. (See speed binning)
⚫ The defect might affect the voltage and/or
current levels of the signals, e.g., the threshold
voltage levels.
⚫ While not a behavioral effect, the number of
distinct defects that are present simultaneously
is important. (Single vs. multiple faults)

5. Yield
The problem is process variations and
defects
⚫ As scaling approaches molecular limits
◦ Process variation is very significant
◦ It is harder to control spot defects just by having
clean rooms
◦ Wavelength of light used for etching is larger
than feature size -- lithography is currently a
fundamental limitation to scaling

Yield and other terms
⚫ Yield (Y): The yield of a manufacturing process is the fraction of
manufactured components (printed circuit board, chips, radios,
etc.) that have “no failures,” i.e., no error-producing
manufacturing/packaging anomaly. That is, the component meets all
the specifications. Usually we cannot determine this value of yield,
so we back-off on the definitions and focus on detected
error-producing anomalies and hence try to identify components
that appear to meet all specs.
⚫ Defect-level (DL or d): The fraction of bad components
certified as good after testing. These are sent to unhappy
customers. We like to minimize the defect level.
◦ Example: DL = 0.000050 implies that, on average, there are
about 50 bad components per million (106) components.
⚫ Yield-loss (YL): The fraction of good components certified as
bad and thus discarded. This is nice to minimize too.

Why does yield-loss occur?
⚫ Testing is costly, to do less than a perfect job we usually try to error
on the side of caution, i.e., we are pessimistic. The engineering concept
is called “guard banding”.
⚫ Example: say a chip is good if runs at 500 MHz, at a temperature of …,
humidity of …
◦ What frequency should we test our chips?
◦ User’s clock might vary from 480-510 MHz?
◦ We must do this fast to keep the cost of testing low, so the
chip will not cost too much.
⚫ Assume a chip responds erroneously to a sequence of binary patterns
that can never occur under actual functional operation- is this chip
good or bad?
⚫ IDDQ testing usually has some yield-loss (quiescent Idd, or
quiescent power-supply current)

Yield learning
End mass production
100% yield
X nanom
Begin mass production
0.7X nanom
Time in years

Read only
Future fabrication issues

⚫ Global distributions and spot defects contribute to yield. As a
process matures, better control is achieved over the process
parameters and variations. Spot defects are also reduced,
but not to the same degree. Thus, in the early 1990’s,
spot defects were the dominant cause for yield loss in
mature processes.
⚫ In future CMOS technologies, it appears that process
variations will be a greater problem than spot defects.
⚫ What do you think will be the major manufacturing issues
related to
◦ Quantum computing devices
◦ Biological computing devices
◦ DNA computing

Metal migration
▪ Metal migration occurs due to a high current density in a conductor.
Molecules actually get stripped away. The conductor gets thinner and
eventually becomes an open circuit.

Read only
The bathtub curve

⚫ The operating life cycle
distribution of a population of
devices may be modeled as a
bathtub curve, if the failures
are plotted on the y-axis against Manufacturing failures
the operating life in the x-axis.
The bathtub curve shows that
the highest failure rates
experienced by a population of
devices occur during the early
stage of the life cycle, or early
life, and during the wear-out
period of the life cycle. Between
the early life and wear-out stages
is a long period wherein the
devices fail very sparingly.
0 hrs 100 hrs 200 hrs

6. Burn-in
What is Burn-in?
⚫ Wafer Burn-in is a form of accelerated aging
◦ Increase temperature
◦ Increase voltage
◦ Shake-rattle and roll!
◦ Power on device
⚫ Try to break “weak” parts of a die
⚫ Try NOT to break “good” parts of a die

Read only
Burn-in
⚫ Burn-in is an electrical stress test that employs voltage and
temperature to accelerate in time the electrical failure of a
device. Burn-in essentially simulates the operating life of the
device, since the electrical excitation applied during burn-in may
mirror the worst-case bias that the device will be subjected to
in the course of its useable life. Depending on the burn-in
duration used, the reliability information obtained may pertain
to the device's early life or its wear-out. Burn-in may be used as
a reliability monitor or as a production screen to weed
out potential infant mortalities from the lot.

Read only
Burn-in --cont
⚫ Burn-in is usually done at 125 deg C, with electrical
excitation applied to the samples. The burn-in process is
facilitated by using burn-in boards (see Fig. 1) where the
samples are loaded. These burn-in boards are then
inserted into the burn-in oven (see Fig. 2), which supplies
the necessary voltages to the samples while maintaining
the oven temperature at 125 deg C. The electrical bias
applied may either be static or dynamic, depending on
the failure mechanism being accelerated.
Figure 1: Bare and

Socket-populated burn-in
boards

Burn in units
Figure 2: Burn-in ovens

7. Wafer testing
A wafer and test structures
Wires, oscillators,
transistors
Gates, F/Fs, etc.

Wafer testing
From Wikipedia, the free encyclopedia
⚫ Wafer testing is a step performed during semiconductor

device fabrication. During this step, performed before a
wafer is sent to die preparation, all individual integrated
circuits that are present on the wafer are tested for
functional defects by applying special test patterns to
them. The wafer testing is performed by a piece of test
equipment called a prober, and the process is sometimes
referred to as a probe test or wafer sort.
⚫ When all test patterns pass for a specific die, its position is
remembered for later use during IC packaging. A die that
does not pass all test patterns is usually considered to be
faulty and is thrown away. Non-passing circuits are typically
marked with a small dot of paint in the middle of the die.

Wafer testing (cont.)
From Wikipedia, the free encyclopedia
⚫ In some very specific cases, a die that passes some but not all test
patterns can still be used as a product, typically with limited
functionality. The most common example of this is a microprocessor
for which only one part of the on-die cache memory is functional. In
this case, the processor can sometimes still be sold as a lower cost part
with a smaller amount of memory and thus lower performance.
⚫ The contents of all test patterns and the sequence by which they are
applied to an integrated circuit are called the test program.
⚫ After IC packaging, a packaged chip will be tested again during the
IC testing phase, usually with the same or very similar test patterns. For
this reason, one might think that wafer testing is an unnecessary,
redundant step. In reality this is not the case, since the removal of
defective dies saves the considerable cost of packaging faulty devices.
However, when the production yield is so high that wafer testing
is more expensive than the packaging cost of defect devices, the wafer
testing step can be skipped altogether and dies will undergo blind
assembly.

Probing a few pads of a die on a wafer

Cutting a wafer into dice

Probing a die

A bare die test systems
• ATE
• Die under test
• Test fixture
• Test head
electronics

Summary (till part 7)
▪ Manufacturing VLSI chips having billions of transistors and wires
▪ A sizeable fraction of these manufactured chips are bad
▪ Separating the good from the bad is the role of testing
▪ In practice chips are often partitioned, via testing, into 20-30 bins,
depending on what parts of them are good
▪ Test equipment is expensive hence testing is costly if it takes too long
▪ Determining tests that result in low defect levels and low yield loss is
very difficult and is a major focus of this course
▪ Designing chips so that test development is easier, or that the chip
actually tests itself, is also a major part of this course

8. Testing – Introduction
Various stages of testing
1. During wafer processing many checks take place
▪ Thickness of layers, chemicals, etc.
2. Wafer testing using probes
▪ Test simple test structures next to scribe lines, like ring
oscillators, simple logic cells, etc.
▪ Burn-in and stress testing
▪ Die on wafers usually powered on, and sometimes tests are run as
temperature and humidity are carefully controlled; in the ovens results
are not observed - infant mortality
▪ Test each die - not extensively because of head impedance and
lack of access points
▪ Label die accordingly
Package what you think are good die
4. Package testing-comprehensive
▪ Check packaging (interconnect) and logic
5. If applicable, system testing-usually functional

Read only
Terminology
▪ Defects and Abnormal Processing Situations(DAPS): Something
unintended in the manufacturing process, e.g. extra metal, insufficient material
in a via, a pin hole in a conductor, etc.
▪ Defects and abnormal processing situations can effect the behavior and/or
performance of a chip, e.g. make it run slow, create errors at outputs, etc.
▪ But some DAPS have no noticeable effect on the operation of a circuit
▪ Errors: An error (usually binary) is a response that does not meet the
specification for the device.
▪ Failure: A physical part of a circuit effected by a DAPS that can result in
errors.
▪ Fault model: An abstract characterization of a failure, e.g., a line is stuck-at
1.
▪ Fault (failure) detection: Determining whether or not a circuit appears to
have a fault - actually a failure
▪ Fault (failure of defect ) diagnosis: Determining the details of the DAPS
that resulted in the failure

Animation
More terminology
▪ Testing: A way to identify a faulty system, component or circuit.
Testing is an experiment in which a system is exercised (inputs
applied) and resulting responses analyzed to determine if the
behavior is correct. If not, then this implies a failure. (We assume
the design is correct.)
▪ Design-for-test and built-in self-test: Modifications made to a
design to make testing (fault detection and diagnosis) easier.
▪ One or the other or both are universally applied
▪ Automatic test pattern generation: A program that generates
test data to apply to a chip during fault detection and diagnosis.

Importance aspects of testing
⚫ Quality – works properly
⚫ Reliability – will continue to work properly for
at least a reasonable amount of time (usually
7-10 years)
⚫ Maintainability – is easy to fix if it fails
◦ But we are moving into a throw-away society where reparability is
becoming less important
This course deal primarily with the issue of

Quality and a little with Maintainability and
nothing about Reliability
Discussion
Some food for thought

▪ If the behavior during testing is correct,
does this imply there is no failure?
▪ Yes - what are your arguments?
▪ No - why not?
▪ If the behavior during testing is incorrect,

is the component useable?
▪ Yes- how so?
▪ No - why not?

Discussion
Buying a new car

⚫ Car is minimally tested upon
assembly
⚫ Cars come with a 5 year and
50,000 warrantee
⚫ After 3000 miles you are
expected to bring it to the
dealer for an oil change and to
have your list of problems
fixed-for free
◦ The horn stopped working
after 3 weeks
◦ The cigarette lighter doesn't
work How and why is this
◦ Etc. scenario different from
buying a new high
⚫ After 4 months, the
transmission goes out and they performance processor
fix it for free chip?

Discussion
Failures create errors

0011
0101
x 0 1 0 1
0 0/1 0 1
0 D 0 0
⚫ failures ≈ faults - are assumed to create errors, i.e.,

eventually under the right input conditions, the
response is wrong.
⚫ What should a system do if a failure first occurs in the
field?
⚫ How can we detect errors?
⚫ How can we detect failures?
⚫ Etc.
8. Testing – Modeling Defects as faults
Modeling defects: Three schools of thought
1. Model physical anomalies (DAPS) as logical and/or timing faults,
e.g., a wire that might be open is modeled as a signal line always at the
value of 0. Look at the logic design. Develop a test to exercise this
fault, that is, will create an observable error if the defect is present.
This must be repeated for every fault in the circuit that corresponds
to this model.
2. Consider only “functional” faults, e.g., does a decoder decode
properly; does a FF reset properly; etc. Look at small building blocks
such as adders, decoders and multiplexers and test their functionality.
3. Carry out functional or behavioral testing – does the circuit
carry out its primary functions
◦ Ignore the actual hardware implementation, hence ignore defects
and faults
◦ For a processor, execute programs

1. Classical physical fault model
A defective NAND gate: Case 1
P1 P2
In 1 R In 2
Out
N1
N2
⚫ A manufacturing defect has shorted

drain and source of P1 together
Steady-state behavior: In1=In2=Vdd
rsh: resistance of short

V(Out)
V(Out2)
Behavior similar Affects transient

to SA1 at line Out behavior
Transient behavior: In1=GND->VDD, In2=Vdd
rsh = 2k
rsh = 5k
rsh = 8k
The defect causes extra delay

A resistive open drain of N1

V(Out
) rsh = 8k
rsh = 4k
rsh = 0
In
1
The defect causes substantial extra delay

Read only
More on classical physical faults

▪ If either input (A or B) is
disconnected the input node
initially charges to VCC. Thus
the circuit acts as an inverter
w.r.t. the other input. Hence
a disconnected input in this
technology “looks like” a
stuck-at-1.
▪ An “open” in the 4 K Ω bias
resistor causes Y to appear to
be stuck-at-1.
▪ A short in this same resistor
does not result in the output
A transistor-level diagram of being stuck-at-1 or stuck-at-0.
a 7400 NAND gate relating
physical defects to logic
faults

Read only
More on classical physical faults

V1 VH
RL
R1
Z (output)
D1 R2
A
A 3-input
B NAND gate
R3
C VL
V2
⚫ If diode D1 is open then this circuit operates as a 2

input NAND gate with inputs B and C. This is
equivalent to A = 1, so we can model this fault as A
stuck-at-1 ( A s-a-1)

Read only
More on physical faults and shorts

VH
V1 VH
RL RL
R1
Z
D1-short R2 (output)
R2 ed
B
R3
C VL
R3
VL
V2
V2
▪ If D1 is shorted then we get the equivalent circuit shown

above
How would you model this fault ?

Observations!
▪ Effect of actual defects is technology
dependent
▪ There are an infinite number of possible
defects (type and degree, like shorts and
their resistance)
▪ In general, we abstract these defects into
gross categories

2. Functional fault modes and models
▪ What does a FF do?
▪ It sets, resets, holds
▪ We expect the Q and Qbar outputs to be of opposite values, and a certain
transfer rate from clock edge to output change
▪ What about a decoder?
▪ We expect the correct output to be high and all others low
▪ So, d0 = d2 = 1 is an error
▪ So X0=X1 = 1 and d3 = 0 is an error
▪ A functional test would attempt to verify that a device carries out its intended
function
▪ What about testing for illegal or non-functional inputs?
▪ Would this have to be an exhaustive test?
X0 d0
d1
2X4
d2
X1
d3

3. Behavioral fault modes and models
⚫ No fault model is used
⚫ Ignore implementation issues
⚫ There are many different implementations, say C1, C2, … , Cp, for a
function f(x1, x2, …, xn). Is a “good” test for C1 necessarily a good test
for C2?
⚫ Exhaustive testing!
⚫ Let’s test f or C1 exhaustively!
◦ What is an exhaustive test for f?
◦ What effects are detected?
◦ How many patterns are in this test?
◦ What types of effects might not be detected?
◦ Is there a test that is non-exhaustive and is good for all the Ci’s?
⚫ If you did test a component in a circuit exhaustively, could you observe
the response? How?

Read only
More on exhaustive testing
▪ Consider a finite state machine (FSM) M having n input wires, m
output wires, and q flip-flops. (You better know what a FSM is!)
▪ What is meant by an exhaustive test for M?
1
▪ How long is this test (in patterns or clock cycles)?
2
▪ What is the function or behavior of a microprocessor?

3
▪ Does a behavioral test consists of executing all possible operations
(op codes) with all possible operands (data)? So, a 32 bit adder
would be tested using 232 x 232 = 264 data values!
4
▪ How long would this take if a 800 MHz μp could carry out 4
instructions per clock cycle ?
5

I assume you know all about:
⚫ Gates, flip-flops and latches
⚫ MUXes, decoders, encoders, ALUs, etc.
⚫ Review EE 101- a Freshman course
⚫ Some theory, such as DeMorgan’s law, K-maps, BSFs,
prime implicants, etc.

8. Testing – Quality of a Test
Determining the quality of a test
⚫ There are an innumerable number of failures producing DAPS
(defects) that can occur that make a device produce erroneous
results. Normally we attempt to characterize the various failures into
categories via a failure/fault model. We attempt to use models that
relate well to the most common defect that might occur, but it
appears that our classical fault models, such as single stuck-at and
bridging faults, are becoming inadequate for capturing the effect of
new fault mechanism, such as crosstalk, current leakage, delay, and
intermediate voltage levels.
⚫ We still use these models, however, because
◦ They are well understood
◦ We have lots of tools that support these models
◦ They give us reasonable results
Boy-these seem like weak excuses!
What do we mean by Quality?

Fault coverage D
▪ Let D be the universe of all failures in a

circuit C.
▪ Let D* ⊆ D be those failures that
correspond to model M, e.g., a bridge or D*
open.
▪ Let T be a test, and assume T detects all
the faults corresponding to M in D** , D**
where D** ⊆ D. Note- T might also
detect other faults not in D*.
▪ T is associated with a fault coverage
value, FC, w.r.t. T, M and C, where
FC = (|D** ∩ D* |/ |D*|)100%
▪ Given T, we can usually determine FC(T)
with respect to various models M.
▪ Sometimes we generate different tests for
different models, such as Tdelay, Ts-a, Tshort,
etc.

Observations regarding the next graph
⚫ Let T be a test procedure.
⚫ Defect-coverage (DC): The defect coverage of a test T is the
probability that T detects the existence of any failure (defect that
induces an error) in a circuit. In other words if a failure exists in a
circuit the defect coverage is the probability that T will detect the
failure.
⚫ Defect-level (DL): The defect level of a manufactured product
(after testing) is the probability of shipping a defective product
where Y is the yield and DC is the defect coverage
DL ≈1 – Y(1-DC) (2)
⚫ If you don’t test, then DC=0 and DL= 1-Y.

⚫ If testing is perfect, then DC=1 and DL = 0

Defect level vs. defect coverage
⚫ For DC = 0.8 and Y=0.75, then

DL ≈ 0.06. So 6% of the parts
shipped are bad ! This is
unacceptable.
⚫ IC manufacturers want to ship
less than 50 bad parts per
million, i.e., 50 DPM, therefore
if Y = 0.75, what value of DC is
required?
⚫ What is the relationship
between DL and DPM?

Defect level as function of defect
coverage and yield
Example: If our goal
is a quality level of
100 DPM and our
yield is 70%, then
the defect coverage
should be 99.99%.
But all the books
talk about
fault-coverage (FC).
What’s up Doc!
Defect coverage of test strategy (%)

Animation
How good is enough

⚫ Clearly it is important to obtain very high quality
tests.
⚫ Normally we seek a fault coverage FC (FC ≤ DC
) that is above 98%.
⚫ Unlike some other products where statistical
sampling and testing is done, e.g. test every 100th
part, because of the concept of spot defects and
process variations, usually all IC’s must be tested.
Thus testing is a large part of the recurring cost
of an IC. In some cases it represents 50% of this
cost and this % is growing year by year.

8. Testing – Generating and applying a test

An example of a test for a fault
We
We say that: propagate
(sensitize)
A=x x the error
B=0 A 1
0 G 0/1
C=x B F
D=1 C 1
x E
1/0
is a test for the fault “line E D
stuck-at-0” in the circuit 1
shown.
How many binary patterns are We activated
the fault, i.e.,
actual tests for this fault? we created the
We
observed
If we apply the test pattern initial error-
an error at
0001 and observe a 1 at F, this requires
an output-
does this tell us that E is controllability
observabilit
stuck-at 0? y

Testing a circuit
1. Z* = Z iff UUT is good
A test (X, Z) :
2. Z* ≠ Z iff UUT is faulty
X – input sequence
Z – correct output sequence
▪ Conditions (1) and (2) are
equivalent.
▪ Finding Z given X and UUT is
usually not too hard (Simulate the
design or run golden part).
▪ Finding Z* given X, the UUT and a
specific fault model is usually
Unit under test easy-fault simulation.
▪ Finding X given UUT so that
( UUT)
(1) is true is very hard.
X Circuit under Z* Generating X automatically given
test the UUT description (net list) and a
fault model is called Automatic Test
(CUT) Pattern Generation (ATPG).
Actual
response
An ATPG system
New circuit Fault
description dictionaries
ATPG
ATE Test
description system statistics
(software)
Fault Test
models to program for
address ATE
and
coverage
One of our goals
desired
is to design and
build
an ATPG system

What is Design-for-Test (DFT)?
UUT
X ← fault f
X (maybe) Z* ATE
From ATPG
DFT added
system
▪ X and Z derived from ATPG system

▪ DFT makes it easier to compute X, and to apply X and observe
Z*
▪ Compare Z and Z*
▪ If same- error free ⇒ UUT is good!
▪ Else discard UUT or use DFT to help diagnose fault/defect

What is Built-In Self-Test (BIST)?
ATE
UUT
X ← fault f (maybe)
X’
BIST added Z’
From ATPG
system
⚫ Built-in hardware generates most of the test stimuli internally, and

knows most of the correct responses Z’. X’ is minimal.
⚫ Usually, no ATPG needed.
⚫ No similarity between X (ATPG), x the internal fault, and X’.
⚫ The ATE is not too complex and only needed for some simple
functions.

8. Testing – Three test methods

Animation
1. External Testing
Hey Doc-
⚫ ATE – Automatic Test Equipment what if the
ATE
is broken?
⚫ Go/ No-Go – Detection
⚫ Diagnostic dictionary – location
⚫ Probe information – location
⚫ Bed-of-Nails tester – I/O access
◦ In-circuit component testing
UUT
ATE CUT
DUT

Read only
ATE
⚫ Electrical testing is the identification and segregation
of electrical failures from a population of devices. An
electrical failure is any unit that does not meet the
electrical specifications defined for the device. In
simplified terms, electrical testing consists of providing
a series of electrical excitation to the device under
test (DUT) and measuring the response of the DUT.
⚫ For every set of electrical stimuli, the measured
response is compared to the expected response, which
is usually defined in terms of a lower and an upper limit.
Any DUT that exhibits a response outside of the
expected range of response is considered a failure.

Read only
ATE cont.
⚫ In production mode, electrical testing is usually performed
using a test system or platform, consisting of a tester (see Fig.
1) and a handler (see Fig. 2). Such a test system is also
referred to as an automatic (or automated) test equipment,
or ATE. The tester performs the electrical testing itself,
while the handler takes care of transferring the unit to the
test site and positioning it for proper testing, as well as
reloading it back into another tube after the testing process is
completed.
⚫ The testing process executed by the tester is controlled by
the test program or test software. The test program is
usually written in a high level language such as C++ or
Pascal. It consists of a series of several test blocks, each of
which tests the DUT for a certain parameter. Every test
block sets up the DUT fixtures for proper testing of the
DUT for the corresponding parameter. It also tells the tester
what electrical excitation needs to be applied to the DUT, as
well as the correct timing of applying them

Read only
ATE cont.
Figure 2: Test handler

Figure 1: Tester
There are usually two versions of the test program. One is a

production (stringent) version and the other is a quality assurance
version. The production version has stricter limits compared to the QA
version, while the QA version more or less tests the DUT to the
datasheet specification limits. The differences in production and QA
limits, or the guardbands, should be large enough to take into account
errors attributed to over-all testing variability and noise, but not large
enough to result in over-rejection. If the guardband is chosen properly,
any unit passing the production test is almost sure to pass the
datasheet limits, regardless of which test equipment on the floor is
used.
Read only
ATE cont.
⚫ The test program usually consists of two types of test blocks,
namely, parametric and functional. Functional testing checks if the
device is able to perform its basic operation. Parametric testing
checks if the device exhibits the correct voltage, current, or power
characteristics, regardless of whether the unit is functional or not.
Parametric testing usually consists of forcing a constant voltage at a
node and measuring the current response
(force-voltage-measure-current, or FVMC) at that node, or forcing a
constant current at a node and measuring the voltage response
(force-current-measure-voltage, or FCMV).
⚫ Electrical testing is normally done at ambient temperature, but
testing at other temperatures is also being done depending on the
screening requirements. For instance, latch-up problems have
better chances of being detected at an elevated temperature while
hot carrier failures are easier detected at low temperatures. Aside
from 25C, other standard test temperatures include -40C, 0C, 70C,
85C, 100C, and 125C.

Read only
ATE cont.
⚫ Automatic Test Equipment (ATE), or testers (see Fig. 1),
are used in the process of automatically testing the electrical
characteristics and performance of finished devices.
⚫ ATE's vary widely in accordance with the types of products
they test. In general, however, it consists of an elaborate
controller- or microprocessor-based system that controls: 1)
boards or modules that can supply electrical excitation to the
device under test (DUT) and 2) boards or modules that can
measure the electrical characteristics and behavior of the DUT
in response to the applied excitation. Additional paraphernalia
such as family boards and DUT boards are attached to the
tester to configure it to the specific needs of the DUT, since
the testers themselves are often designed to be as generic as
possible.

Read only
Test Handler
⚫ Mass production electrical testing can only be possible by attaching a test handler to an
ATE. A test handler (see Fig. 2) refers to the equipment used in presenting the unit to be
tested to the test site of the ATE, allowing the ATE to test the unit. After testing, the
handler puts the unit to the appropriate output location based on the ATE test results.
⚫ Test handlers vary widely in configuration. Some use gravity to bring the device under
test (DUT) to the test site and to reload them back into tubes. Others use special
electromechanical or pick-and-place systems to accomplish this. Some handlers can only
be assigned to one tester, yet some can be allocated to eight or more testers. A typical
test handler is equipped with a loading or input stage, a test site, a sort shuttle, an
unloading or output stage, various sensors, and interfaces to the tester.
⚫ For gravity-fed handlers, the input stage usually consists of input tracks into which the
input tubes containing the units to be tested are inserted. The units slide down the input
track into the test site for testing. After testing, the unit is then transported by the sort
shuttle to the appropriate output track based on whether the unit is good or bad.
Pick-and-place handlers usually pick the units for testing from a tray and present them to
the test site for testing. After testing, the pick-and-place system takes the unit and puts it
into the appropriate output tray.

2. Self-testing
⚫ Here, the patient is the doctor! For
example, store test in memory of a μp and
have the μp execute the test.
⚫ What happens if the μp is faulty? Can a
designer know, consider, comprehend all
such faults when writing the test? What must
be working to execute a self-test?
⚫ If the power is turned off, will the machine
still display the error message “my power is
turned off ” ?
This leads to the concept of BIST

Animation
BIST
n R n R
...
... C ...
1 2
Part of a Pipeline
Normal mode of operation
Test mode of operation

n R n R2
...
C ...
1* *
Comparator
Unequal
Count Counts the number Signature implies an
0f 1’s it sees: error due
generator:
0, 1, 2, ... to C,
0, 1, 2, ...
hence a
fault!
Read only
More on BIST issues
▪ How would you determine the fault coverage of this technique with respect
to some model M?
▪ What constitutes, in terms of attributes, a good test pattern generator (R1*)

? How would you design it? What are the area and performance overheads?
▪ What constitutes, in terms of attributes, a good compressor (R2*)? How

would you design it? What are the area and performance overheads?
▪ How do we determine the correct signature, S?
▪ What are other good BIST architectures?
▪ How are high levels of controllability and observability achieved?

Self-study
3. System level testing (diagnostic)
a11
a12 Mi
M1 M2 -- a machine
a42
a41 a23 a32 aik implies that Mi tests Mk
aik =1 implies that Mi concludes that
M4 M3 Mk is good (pass)
a33 aik = 0 implies that Mi concludes that
Mk is bad (fail)

Self-study
More on system level diagnosis

▪ If Mi is fault-free then its conclusion (test
outcome) is correct; if Mi is actually faulty,
then its conclusion may be correct or
incorrect.
▪ Problem 1: Given {aij }, referred to as the
test result signature, determine which
processors are good and which are faulty?
▪ Problem 2: How should we interconnect the
test arcs to get reliable diagnosis?

Self-study
An example of system level Diagnosis
a12
M1 M2
a21
▪ So, if a12 = 0 and a21 = 0, what do we fix?

▪ Would it help if we added self-testing arcs, akk ?

Self-study
More on system level diagnosis

If only M3 is bad, can
M2 this system properly
diagnose the problem?
M1 M3
M6 M4
M5

What are we doing about yield loss?
* * ** **
* * * *
* *
* * * * *
Error-free Error-free Error-free Discard Error
(minor memory memory after memory producing
defects) after recon- reconfiguration memory
Memory figuration and reduction in
with DT and (FT) capacity
DFM
Live with the errors if you
can!
* A defect that is masked This is called
* A defect that is not masked Error-Tolerance (ET)

8. Testing – Summary

Let’s summarize the important points
▪ Defects, process variation, old age and yield
▪ Failures, faults and errors
▪ Testing at various levels
▪ Wafer, die, packaged chip, system
▪ Test generation (ATPG), design-for-test and
built-in self-test and ATE
▪ Fault models
▪ Stuck-at, delay, shorts, opens, resistive
▪ Permanent, transient, intermittent
▪ Change in logic, interconnect, timing

More important points
⚫ Modes of testing
◦ Off-line vs. on-line testing (occasionally vs. real-time)
⚫ Self-checking
◦ Dual-redundant, system level testing
⚫ Fault-tolerance--mask errors
⚫ Defect-tolerance -- compensate for defects that
create errors, e.g., use redundant vias
⚫ Error-tolerance -- learn to live with errors (Breuer
& Gupta)
And don’t ever forget the difference between a

Fault and an Error, and the MOST important
concepts of Observability and Controllability

Lecture 1. Introduction PDF

Hochgeladen von

Dokumentinformationen

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Lecture 1. Introduction PDF

Hochgeladen von

Copyright:

Verfügbare Formate

EE 658

Diagnosis and Design of Reliable

University of Southern California

Dr. Moe Tabar - EE 658 2

▪ A good idea of the subject matter of this

Dr. Moe Tabar - EE 658 3

Dr. Moe Tabar - EE 658 4

Dr. Moe Tabar - EE 658 6

Semiconductor manufacturing consists of the following steps:

Dr. Moe Tabar - EE 658 7

Dr. Moe Tabar - EE 658 8

⚫ Silicon epitaxy is done to improve the performance of bipolar

Dr. Moe Tabar - EE 658 9

Diffusion and polysilicon

Dr. Moe Tabar - EE 658 10

Dr. Moe Tabar - EE 658 11

Dr. Moe Tabar - EE 658 12

Electron Beam Lithography

Dr. Moe Tabar - EE 658 13

Dr. Moe Tabar - EE 658 14

Dr. Moe Tabar - EE 658 15

Dr. Moe Tabar - EE 658 16

Khare and Maly, p. 28: Courtesy

Dr. Moe Tabar - EE 658 19

Khare and Maly, p. 3: Courtesy

Dr. Moe Tabar - EE 658 20

Dr. Moe Tabar - EE 658 21

Dr. Moe Tabar - EE 658 23

Dr. Moe Tabar - EE 658 24

Dr. Moe Tabar - EE 658 25

Dr. Moe Tabar - EE 658 26

Variation of solder thickness for a PCB

Dr. Moe Tabar - EE 658 27

Dr. Moe Tabar - EE 658 29

Dr. Moe Tabar - EE 658 30

⚫ Fabrication error/mistake/fault - e.g.,

Dr. Moe Tabar - EE 658 31

▪ Physical failures - these occur during the normal

Yes- these are the guys we will focus on in this

Dr. Moe Tabar - EE 658 32

Dr. Moe Tabar - EE 658 33

Dr. Moe Tabar - EE 658 34

Dr. Moe Tabar - EE 658 35

Dr. Moe Tabar - EE 658 37

Dr. Moe Tabar - EE 658 38

Dr. Moe Tabar - EE 658 39

Dr. Moe Tabar - EE 658 40

Future fabrication issues

Dr. Moe Tabar - EE 658 41

Dr. Moe Tabar - EE 658 42

The bathtub curve

0 hrs 100 hrs 200 hrs

Dr. Moe Tabar - EE 658 43

Dr. Moe Tabar - EE 658 45

Dr. Moe Tabar - EE 658 46

Figure 1: Bare and

Dr. Moe Tabar - EE 658 47

Dr. Moe Tabar - EE 658 48

Dr. Moe Tabar - EE 658 50

⚫ Wafer testing is a step performed during semiconductor

Dr. Moe Tabar - EE 658 51

Dr. Moe Tabar - EE 658 52

Dr. Moe Tabar - EE 658 53