Sie sind auf Seite 1von 6

1174

IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 22, NO. 5, MAY 2014

Fig. 5 compares two valid decomposition schemes provided by


the basic decomposition method and our decomposition method,
respectively. As illustrated in this figure, the proposed decomposition method achieves higher printability because of the
following.
1) Compared to a nonlitho-friendly solution, structural spacers are
applied as much as possible in our solution to print the critical
layout features.
2) In Fig. 5, most of the structural spacers are formed around
mandrel features with critical dimensions. Therefore, the imaging imperfections of the mandrel mask are propagated to the
structural spacers.
3) Moreover, mandrel mask is mainly used for dummy features
in our solution. As dummy features are not part of the target
layout, their imaging quality is not a primary concern of OPC
engine as long as their printability imperfections do not penetrate to feature tiles printed by structural spacers. For example,
consider the top-right dummy mandrel feature in Fig. 5. As
its upper edge does not contribute to any target feature, its
printability is not as critical as other edges in the mandrel mask
such as original mandrel features. Therefore, a SATP-aware
OPC engine can lower the priority of these noncritical edges
in favor of critical mandrel edges in the neighborhood. This
weighting method provides more space for many of critical
edges to receive more aggressive OPC treatments. We used
Calibre nmOPC tool to implement a SATP-aware OPC recipe
which is applied to both sets of SATP-blind and SATP-aware
decomposed layouts.
V. C ONCLUSION
In this brief, the effective parameters in the printability of SATPdecomposed layouts are discussed. Subsequently, a decomposition
method for SATP technique is proposed which uses ILP formulation
to avoid decomposition conflicts. The proposed method also improves
the overall printability of the layout using structural spacers efficiently
and minimizing the mandrel-trim co-defined layout features. The
experimental results show that the total length of overlay-sensitive
layout features, total EPE, and overall printability of the attempted
designs are improved considerably by the proposed decomposition
method.
This brief is the first attempt to solve the litho-friendly SATP
layout decomposition and can be extended by providing heuristic
solutions for SATP decomposition. Being formulated as ILP, the
proposed method gives the least overlay-sensitive solution for fully
decomposable layouts; however, it does not converge for partially
decomposable ones. A heuristic decomposition method can support
partially decomposable layouts and improve the run-time complexity
of decomposition. Moreover, similar to [16], layout partitioning can
be used to make the ILP-based method applicable for large layouts.
Finally, SATP-aware layout design is highly needed to make SATP
a mainstream imaging solution. Similar to DPL layout migration
methods [17], the proposed formulation can be extended in future
to resolve decomposability issues.

[4] Y. Chen, P. Xu, L. Miao, Y. Chen, X. Xu, D. Mao, P. Blanco, C. Bencher,


R. Hung, and C. S. Ngai, Self-aligned triple patterning for continuous
IC scaling to half-pitch 15nm, in Proc. 24th Opt. Microlithograph.,
2011, pp. 79731P-179731P-8.
[5] B. Mebarki, H. D. Chen, Y. Chen, A. Wang, J. Liang, K. Sapre,
T. Mandrekar, X. Chen, P. Xu, P. Blanko, C. Ngai, C. Bencher, and
M. Naik, Innovative self-aligned triple patterning for 1x half pitch
using single spacer deposition-spacer etch step, in Proc. 24th Opt.
Microlithograph., 2011, pp. 79730G-179730G-6.
[6] Y. Chen, P. Xu, Y. M. Chen, L. Miao, X. Xu, C. Bencher, and C. Ngai,
Self-aligned triple patterning to extend optical lithography for 1x
patterning, in Proc. Int. Symp. Lithograph. Extensions, 2010, pp. 2022.
[7] C. Bencher, Y. Chen, H. Dai, W. Montgomery, and L. Huli, 22nm
half-pitch patterning by CVD spacer self alignment double patterning
(SADP), Proc. SPIE, vol. 6924, p. 69244E, Mar. 2008.
[8] H. Dai, J. Sweis, C. Bencher, Y. Chen, J. Shu, X. Xu, C. Ngai,
J. Huckabay, and M. Weling, Implementing self-aligned double patterning on non-gridded design layouts, Proc. SPIE, vol. 7275, p. 72751E,
Feb. 2009.
[9] Y. Ban, K. Lucas, and D. Pan, Flexible 2D layout decomposition
framework for spacer-type double pattering lithography, in Proc. 48th
Design Autom. Conf., 2011, pp. 789794.
[10] M. Mirsaeedi, J. A. Torres, and M. Anis, Self-aligned double patterning (SADP) layout decomposition, in Proc. 12th Int. Symp. Quality
Electron. Design, 2011, pp. 17.
[11] H. Zhang, Y. Du, M. D. Wong, R. Topaloglu, and W. Conley, Effective
decomposition algorithm for self-aligned double patterning, in Proc.
SPIE, vol. 7973, p. 79730J, Mar. 2011.
[12] M. Mirsaeedi, J. Andres Torres, and M. Anis, Self-aligned doublepatterning (SADP) friendly detailed routing, in Proc. Design Manuf.
Through Design-Process Integr., 2011, p. 79740O.
[13] (2013). GNU Linear Programming Kit [Online]. Available:
http://www.gnu.org/software/glpk
[14] (2010). Calibre Litho Friendly Reference Manual [Online]. Available:
http://www.mentor.com
[15] (2011). Nangate 45nm Open Cell Library [Online]. Available:
http://www.nangate.com
[16] K. Yuan, J.-S. Yang, and D. Pan, Double patterning layout decomposition for simultaneous conflict and stitch minimization, in Proc. Int.
Symp. Phys. Design, 2009, pp. 107114.
[17] C.-H. Hsu, Y.-W. Chang, and S. R. Nassif, Simultaneous layout migration and decomposition for double patterning technology, IEEE Trans.
Comput.-Aided Des. Integr. Circuits Syst., vol. 30, no. 2, pp. 284294,
Feb. 2011.

Area-Delay Efficient Binary Adders in QCA


Stefania Perri, Pasquale Corsonello, and Giuseppe Cocorullo

Abstract As transistors decrease in size more and more of them can


be accommodated in a single die, thus increasing chip computational
capabilities. However, transistors cannot get much smaller than their
current size. The quantum-dot cellular automata (QCA) approach represents one of the possible solutions in overcoming this physical limit, even
though the design of logic modules in QCA is not always straightforward.

R EFERENCES
[1] (2009). International Technology Roadmap for Semiconductors,
[Onlilne]. Available: http://en.wikipedia.org/wiki/International_Technolo
gy_Roadmap_for_Semiconductors
[2] Y. Borodovsky, Lithography 2009 overview of opportunities, in Proc.
Semicon West, 2009, pp. 13.
[3] C. Cork, J. C. Madre, and L. Barnes, Comparison of triple-patterning
decomposition algorithms using aperiodic tiling patterns, in Proc.
Photomask NGL Mask Technol. 15th Int. Soc. Opt. Photon., May 2008,
pp. 702839-1702839-3.

Manuscript received October 29, 2012; revised March 6, 2013; accepted


April 21, 2013. Date of publication June 14, 2013; date of current version
April 22, 2014.
The authors are with the Department of Electronics Computer Sciences
and Systems of the University of Calabria, Rende 87036, Italy (e-mail:
perri@deis.unical.it; p.corsonello@unical.it; g.cocorullo@unical.it).
Color versions of one or more of the figures in this paper are available
online at http://ieeexplore.ieee.org.
Digital Object Identifier 10.1109/TVLSI.2013.2261831

1063-8210 2013 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 22, NO. 5, MAY 2014

1175

In this brief, we propose a new adder that outperforms all state-of-theart competitors and achieves the best area-delay tradeoff. The above
advantages are obtained by using an overall area similar to the cheaper
designs known in literature. The 64-bit version of the novel adder spans
over 18.72 m2 of active area and shows a delay of only nine clock
cycles, that is just 36 clock phases.
Index Terms Adders, nanocomputing, quantum-dot cellular automata
(QCA).

I. I NTRODUCTION
Quantum-dot cellular automata (QCA) is an attractive emerging
technology suitable for the development of ultra-dense low-power
high-performance digital circuits [1]. For this reason, in the last few
years, the design of efficient logic circuits in QCA has received a great
deal of attention. Special efforts are directed to arithmetic circuits
[2][16], with the main interest focused on the binary addition
[11][16] that is the basic operation of any digital system.
Of course, the architectures commonly employed in traditional
CMOS designs are considered a first reference for the new design
environment. Ripple-carry (RCA), carry look-ahead (CLA), and conditional sum adders were presented in [11]. The carry-flow adder
(CFA) shown in [12] was mainly an improved RCA in which
detrimental wires effects were mitigated. Parallel-prefix architectures,
including BrentKung (BKA), KoggeStone, LadnerFischer, and
HanCarlson adders, were analyzed and implemented in QCA in
[13] and [14]. Recently, more efficient designs were proposed in
[15] for the CLA and the BKA, and in [16] for the CLA and the
CFA.
In this brief, an innovative technique is presented to implement
high-speed low-area adders in QCA. Theoretical formulations demonstrated in [15] for CLA and parallel-prefix adders are here exploited
for the realization of a novel 2-bit addition slice. The latter allows the
carry to be propagated through two subsequent bit-positions with the
delay of just one majority gate (MG). In addition, the clever top level
architecture leads to very compact layouts, thus avoiding unnecessary
clock phases due to long interconnections.
An adder designed as proposed runs in the RCA fashion,
but it exhibits a computational delay lower than all state-ofthe-art competitors and achieves the lowest area-delay product
(ADP).
The rest of this brief is organized as follows: a brief background of
the QCA technology and existing adders designed in QCA is given
in Section II, the novel adder design is then introduced in Section III,
simulation and comparison results are presented in Section IV, and
finally, in Section V conclusions are drawn.
II. BACKGROUND
A QCA is a nanostructure having as its basic cell a square four
quantum dots structure charged with two free electrons able to tunnel
through the dots within the cell [1]. Because of Coulombic repulsion,
the two electrons will always reside in opposite corners. The locations
of the electrons in the cell (also named polarizations P) determine
two possible stable states that can be associated to the binary states
1 and 0.
Although adjacent cells interact through electrostatic forces and
tend to align their polarizations, QCA cells do not have intrinsic data
flow directionality. To achieve controllable data directions, the cells
within a QCA design are partitioned into the so-called clock zones
that are progressively associated to four clock signals, each phase
shifted by 90. This clock scheme, named the zone clocking scheme,
makes the QCA designs intrinsically pipelined, as each clock zone
behaves like a D-latch [8].

Fig. 1.

Novel 2-bit basic module.

QCA cells are used for both logic structures and interconnections
that can exploit either the coplanar cross or the bridge technique [1],
[2], [5], [17], [18]. The fundamental logic gates inherently available
within the QCA technology are the inverter and the MG. Given three
inputs a, b, and c, the MG performs the logic function reported in (1)
provided that all input cells are associated to the same clock signal
clkx (with x ranging from 0 to 3), whereas the remaining cells of
the MG are associated to the clock signal clkx+1
M(abc) = a b + a c + b c.

(1)

Several designs of adders in QCA exist in literature. The RCA


[11], [13] and the CFA [12] process n-bit operands by cascading n full-adders (FAs). Even though these addition circuits use
different topologies of the generic FA, they have a carry-in to
carry-out path consisting of one MG, and a carry-in to sum bit
path containing two MGs plus one inverter. As a consequence,
the worst case computational paths of the n-bit RCA and the
n-bit CFA consist of (n+2) MGs and one inverter.
A CLA architecture formed by 4-bit slices was also presented
in [11]. In particular, the auxiliary propagate and generate signals,
namely pi = ai + bi and gi = ai bi , are computed for
each bit of the operands and then they are grouped four by four.
Such a designed n-bit CLA has a computational path composed of
7 + 4 (log4 n) cascaded MGs and one inverter. This can be easily
verified by observing that, given the propagate and generate signals
(for which only one MG is required), to compute grouped propagate
and grouped generate signals; four cascaded MGs are introduced in
the computational path. In addition, to compute the carry signals,
one level of the CLA logic is required for each factor of four in
the operands word-length. This means that, to process n-bit addends,
log4 n levels of CLA logic are required, each contributing to the
computational path with four cascaded MGs. Finally, the computation of sum bits introduces two further cascaded MGs and one
inverter.
The parallel-prefix BKA demonstrated in [13] exploits more
efficient basic CLA logic structures. As its main advantage over
the previously described adders, the BKA can achieve lower
computational delay. When n-bit operands are processed, its worst
case computational path consists of 4 log2 n 3 cascaded MGs
and one inverter. Apart from the level required to compute propagate
and generate signals, the prefix tree consists of 2 log2 n 2 stages.
From the logic equations provided in [13], it can be easily verified
that the first stage of the tree introduces in the computational
path just one MG; the last stage of the tree contributes with
only one MG; whereas, the intermediate stages introduce in the
critical path two cascaded MGs each. Finally, for the computation
of the sum bits, further two cascaded MGs and one inverter are
added.

1176

IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 22, NO. 5, MAY 2014

Fig. 3.

Novel 16-bit adder.


TABLE I
S IMULATION PARAMETERS

Fig. 2.

Novel n-bit adder (a) carry chain and (b) sum block.

With the main objective of trading off area and delay, the hybrid
adder (HYBA) described in [14] combines a parallel-prefix adder
with the RCA. In the presence of n-bit operands, this architecture
has a worst computational path consisting of 2 log2 n + 2 cascaded
MGs and one inverter.
When the methodology recently proposed in [15]was exploited,

the
 case path of the CLA is reduced to 4 log4 n + 2
 worst
log4 n 1 MGs and one inverter. The above-mentioned approach
can be applied also to design the BKA. In this case the overall area is
reduced with respect to [13], but maintaining the same computational
path.
By applying the decomposition method demonstrated in [16], the
computational paths of the CLA and the CFA are reduced to 7 +
2 log2 (n/8) MGs and one inverter and to (n/2) + 3 MGs and one
inverter, respectively.
III. N OVEL QCA A DDER
To introduce the novel architecture proposed for implementing ripple adders in QCA, let consider two n-bit addends
A = an1 , . . . , a0 and B = bn1 , . . . , b0 and suppose that
for the ith bit position (with i = n 1, . . . , 0) the auxiliary propagate and generate signals, namely pi = ai + bi and
gi = ai bi , are computed. ci being the carry produced at the
generic (i1)th bit position, the carry signal ci+2 , furnished at the
(i+1)th bit position, can be computed using the conventional CLA
logic reported in (2). The latter can be rewritten as given in (3),
by exploiting Theorems 1 and 2 demonstrated in [15]. In this way,
the RCA action, needed to propagate the carry ci through the two

Parameter

Value

Temperature

1K

Relaxation time

1 1015 s

Time step

1 1016 s

Total simulation time

7 1011 s

Radius of effect

80 nm

Relative permittivity

12.9

Layer separation

11.5 nm

subsequent bit positions, requires only one MG. Conversely, conventional circuits operating in the RCA fashion, namely the RCA and
the CFA, require two cascaded MGs to perform the same operation.
In other words, an RCA adder designed as proposed has a worst
case path almost halved with respect to the conventional RCA and
CFA.
Equation (3) is exploited in the design of the novel 2-bit module
shown in Fig. 1 that also shows the computation of the carry
ci+1 = M( pi gi ci ). The proposed n-bit adder is then implemented
by cascading n/2 2-bit modules as shown in Fig. 2(a). Having
assumed that the carry-in of the adder is cin = 0, the signal p0
is not required and the 2-bit module used at the least significant bit
position is simplified. The sum bits are finally computed as shown in
Fig. 2(b).
It must be noted that the time critical addition is performed when
a carry is generated at the least significant bit position (i.e., g0 =
1) and then it is propagated through the subsequent bit positions
to the most significant one. In this case, the first 2-bit module
computes c2 , contributing to the worst case computational path with
two cascaded MGs. The subsequent 2-bit modules contribute with
only one MG each, thus introducing a total number of cascaded
MGs equal to (n 2)/2. Considering that further two MGs and
one inverter are required to compute the sum bits, the worst case
path of the novel adder consists of (n/2) + 3 MGs and one
inverter
ci+2 = gi+1 + pi+1 gi + pi+1 pi ci



ci+2 = M(M ai+1 , bi+1 , gi M ai+1 , bi+1 , pi ci ).

(2)
(3)

IV. R ESULTS
The proposed addition architecture is implemented for several operands word lengths using the QCA Designer tool [16]
adopting the same rules and simulation settings used in
[11][16]. The QCA cells are 18-nm wide and 18-nm high; the
cells are placed on a grid with a cell center-to-center distance of
20 nm; there is at least one cell spacing between adjacent wires; the
quantum-dot diameter is 5 nm; the multilayer wire crossing structure

IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 22, NO. 5, MAY 2014

Fig. 4.

Novel 32-bit adder.

Fig. 5.

Novel 64-bit adder.

Fig. 6.

Simulation results obtained for the novel 64-bit adder.

is exploited; a maximum of 16 cascaded cells and a minimum of


two cascaded cells per clock zone are assumed. The coherence
vector engine is used for simulations with the options shown in
Table I.
Layouts for the 16-, 32- and 64-bit versions of the novel adder
are shown in Figs. 35, respectively. Simulation results for the
64-bit adder is shown in Fig. 6. There, the carry out bit is included in
the output sum bus. Because of the limited QCA Designer graphical
capability, input and output busses are split into two separate more
significant and less significant busses. Fig. 6 also shows the polarization values of few single output signals (i.e., sum64 , sum32 , sum31 ,
and sum).
Simulations performed on 32- and 64-bit adders have shown that
the first valid result is outputted after five and nine latency clock
cycles, respectively.

1177

As an example, the 20 clock phases (or five cycles delay) of


the 32-bit adder are as follows: one clock phase is needed for
inputs acquisition; the carry c2 related to the least significant bit
positions is then computed within the two subsequent clock phases;
15 phases are required for the carry propagation through the remaining bit positions; finally, two more phases are needed for the sum
computation.
Critical path consistencies and post layout characteristics, such as
cell count, overall size, delay, number of clock phases, and ADP,
are shown in Table II for all the compared adders. The number
of cascaded MGs within the worst case computational path directly
impacts on the achieved speed performances as an MG always adds
one more clock phase.
However, it is worth noting that because of their different
basic logics, designs with the same critical path can achieve

1178

IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 22, NO. 5, MAY 2014

TABLE II
C OMPARISON R ESULTS
Adder

New

CFA [12]

RCA [13]

BKA [13]

HYBA [14]

CLA [15]
BKA [15]

CLA [16]

CFA [16]

n
8
16
32
64
8
16
32
64
8
16
32
64
8
16
32
64
8
16
32
64
16
8
8
16
32
64
8
16
32
64

Worst Case Path


MGs INV
7
1
11
1
19
1
35
1
10
1
18
1
34
1
66
1
10
1
18
1
34
1
66
1
9
1
13
1
17
1
21
1
8
1
10
1
12
1
14
1
11
1
9
1
7
1
9
1
11
1
13
1
7
1
11
1
19
1
35
1

Cell Count

Size (m)2

Delay

Clock Phases

ADP

1606
3587
7691
16667
789
1769
4305
11681
712
1602
3901
10926
1782
4350
11825
30145
1422
3555
8946
22427
4489
1462
1785
4114
12540
33302
1143
2817
6942
17586

1.13
2.66
6.65
18.72
0.949
2.45
7.3
24.2
0.74
1.99
6.46
20.92
1.49
3.55
10.77
31.20
1.11
2.66
8.21
25.29
3.65
1.06
1.46
3.67
11.84
35.63
1.02
2.45
6.83
20.46

2
3
5
9
22/4
42/4
82/4
162/4
23/4
43/4
83/4
163/4
22/4
4
63/4
112/4
21/4
31/4
52/4
92/4
41/4
22/4
21/4
33/4
63/4
121/4
21/4
31/4
52/4
92/4

8
12
20
36
10
18
34
66
11
19
35
67
10
16
27
46
9
13
22
38
17
10
9
15
27
49
9
13
22
38

2.26
7.98
33.25
168.48
2.37
11.02
62.05
399.3
2.03
9.45
56.52
350.41
3.72
14.2
72.69
358.8
2.5
8.65
45.15
240.25
15.51
2.65
3.29
13.76
79.92
436.47
2.29
7.96
37.56
194.37

different numbers of clock phases. As an example, the novel


adder requires less clock phases than the CFA proposed in [16]
to perform the generic addition operation. Table II shows that the
layout strategy is also quite important. In fact, designs using the
same basic logic with the same worst case theoretical path can
have different number of clock phases due to differently compact
layouts.
It should also be noted that the critical path of the HYBA [14]
contains the fewest MGs, while the novel adder, the RCA [13] and
the CFA [12] require less additional clock phases exceeding the
number of cascaded MGs. This means that their layouts do not
contain overlong wires, thus demonstrating that the novel architecture
inherits logic advantages of CLA and parallel-prefix adders (i.e., the
short computational path), but limiting, as happens in the RCA and
CFA structures, detrimental layout effects.
Comparison results shown in Table II for operands wordlengths
ranging from 8- to 64-bit also show that the novel adder achieves the
lowest delay and spans over an area similar to that occupied by the
cheaper designs known in literature. Therefore, our design approach
allows the best area-delay tradeoff to be achieved.
Table II also shows the numbers of wire crossovers for some
competitors. As it is well known, wire crossovers are sensitive to
fabrication imperfections and consequently they can represent a reliability issue [19][21]. It can be seen that the number of crossovers
used in the novel 8- and 16-bit adder was 33.6% and 37% lower

Number of
Crossovers
69
145
297
601
n.a.
n.a.
n.a.
n.a.
n.a.
n.a.
n.a.
n.a.
n.a.
n.a.
n.a.
n.a.
n.a.
n.a.
n.a.
n.a.
231
104
n.a.
n.a.
n.a.
n.a.
n.a.
n.a.
n.a.
n.a.

than the CLA and the BKA proposed in [15]. Unfortunately, for
the adders presented in [11][14], [16] similar information was not
available.
The robustness of the proposed circuit was evaluated as a function
of temperature using a simulation set-up identical to that used
in [13]. Obtained results demonstrated that at 22 K, all the output cells assume correct polarizations. In fact, the cells furnishing
output signals equal to zero are polarized to 0.534, whereas
those providing output bits equal to 1 assume the polarization
0.534.
V. C ONCLUSION
A new adder designed in QCA was presented. It achieved speed
performances higher than all the existing QCA adders, with an area
requirement comparable with the cheap RCA and CFA demonstrated
in [13] and [16].
The novel adder operated in the RCA fashion, but it could
propagate a carry signal through a number of cascaded MGs
significantly lower than conventional RCA adders. In addition,
because of the adopted basic logic and layout strategy, the number of clock cycles required for completing the elaboration was
limited.
A 64-bit binary adder designed as described in this brief exhibited
a delay of only nine clock cycles, occupied an active area of
18.72 m2 , and achieved an ADP of only 168.48.

IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 22, NO. 5, MAY 2014

R EFERENCES
[1] C. S. Lent, P. D. Tougaw, W. Porod, and G. H. Bernestein, Quantum
cellular automata, Nanotechnology, vol. 4, no. 1, pp. 4957, 1993.
[2] M. T. Niemer and P. M. Kogge, Problems in designing with QCAs:
Layout = Timing, Int. J. Circuit Theory Appl., vol. 29, no. 1,
pp. 4962, 2001.
[3] J. Huang and F. Lombardi, Design and Test of Digital Circuits by
Quantum-Dot Cellular Automata. Norwood, MA, USA: Artech House,
2007.
[4] W. Liu, L. Lu, M. ONeill, and E. E. Swartzlander, Jr., Design rules
for quantum-dot cellular automata, in Proc. IEEE Int. Symp. Circuits
Syst., May 2011, pp. 23612364.
[5] K. Kim, K. Wu, and R. Karri, Toward designing robust QCA architectures in the presence of sneak noise paths, in Proc. IEEE Design,
Autom. Test Eur. Conf. Exhibit., Mar. 2005, pp. 12141219.
[6] K. Kong, Y. Shang, and R. Lu, An optimized majority logic synthesis
methology for quantum-dot cellular automata, IEEE Trans. Nanotechnol., vol. 9, no. 2, pp. 170183, Mar. 2010.
[7] K. Walus, G. A. Jullien, and V. S. Dimitrov, Computer arithmetic
structures for quantum cellular automata, in Proc. Asilomar Conf.
Sygnals, Syst. Comput., Nov. 2003, pp. 14351439.
[8] J. D. Wood and D. Tougaw, Matrix multiplication using quantumdot cellular automata to implement conventional microelectronics, IEEE Trans. Nanotechnol., vol. 10, no. 5, pp. 10361042,
Sep. 2011.
[9] K. Navi, M. H. Moaiyeri, R. F. Mirzaee, O. Hashemipour, and
B. M. Nezhad, Two new low-power full adders based on majority-not
gates, Microelectron. J., vol. 40, pp. 126130, Jan. 2009.
[10] L. Lu, W. Liu, M. ONeill, and E. E. Swartzlander, Jr., QCA systolic
array design, IEEE Trans. Comput., vol. 62, no. 3, pp. 548560,
Mar. 2013.
[11] H. Cho and E. E. Swartzlander, Adder design and analyses for
quantum-dot cellular automata, IEEE Trans. Nanotechnol., vol. 6, no. 3,
pp. 374383, May 2007.
[12] H. Cho and E. E. Swartzlander, Adder and multiplier design in
quantum-dot cellular automata, IEEE Trans. Comput., vol. 58, no. 6,
pp. 721727, Jun. 2009.
[13] V. Pudi and K. Sridharan, Low complexity design of ripple carry and
BrentKung adders in QCA, IEEE Trans. Nanotechnol., vol. 11, no. 1,
pp. 105119, Jan. 2012.
[14] V. Pudi and K. Sridharan, Efficient design of a hybrid adder in quantumdot cellular automata, IEEE Trans. Very Large Scale Integr. (VLSI) Syst.,
vol. 19, no. 9, pp. 15351548, Sep. 2011.
[15] S. Perri and P. Corsonello, New methodology for the design of efficient
binary addition in QCA, IEEE Trans. Nanotechnol., vol. 11, no. 6,
pp. 11921200, Nov. 2012.
[16] V. Pudi and K. Sridharan, New decomposition theorems on majority
logic for low-delay adder designs in quantum dot cellular automata,
IEEE Trans. Circuits Syst. II, Exp. Briefs, vol. 59, no. 10, pp. 678682,
Oct. 2012.
[17] K. Walus and G. A. Jullien, Design tools for an emerging SoC
technology: Quantum-dot cellular automata, Proc. IEEE, vol. 94, no. 6,
pp. 12251244, Jun. 2006.
[18] S. Bhanja, M. Ottavi, S. Pontarelli, and F. Lombardi, QCA circuits for
robust coplanar crossing, J. Electron. Testing, Theory Appl., vol. 23,
no. 2, pp. 193210, Jun. 2007.
[19] A. Gin, P. D. Tougaw, and S. Williams, An alternative geometry
for quantum dot cellular automata, J. Appl. Phys., vol. 85, no. 12,
pp. 82818286, 1999.
[20] A. Chaudhary, D. Z. Chen, X. S. Hu, and M. T. Niemer, Fabricatable interconnect and molecular QCA circuits, IEEE Trans. Comput.
Aided Design Integr. Circuits Syst., vol. 26, no. 11, pp. 19771991,
Nov. 2007.
[21] M. Janez, P. Pecar, and M. Mraz, Layout design of manufacturable
quantum-dot cellular automata, Microelectron. J., vol. 43, no. 7,
pp. 501513, 2012.

1179

Optimization Scheme to Minimize Reference Resistance


Distribution of Spin-Transfer-Torque MRAM
Kejie Huang, Ning Ning, and Yong Lian
Abstract Spin-transfer-torque magnetoresistive random access memory (STT-MRAM) is an emerging type of nonvolatile memory with
compelling advantages in endurability, scalability, speed, and energy
consumption. As the process technology shrinks, STT-MRAM has limited
sensing margin due to the decrease in supply voltage and increase
in process variation. Furthermore, the relatively smaller resistance
difference of two states in STT-MRAM poses challenges for its read/write
circuit design to maintain an acceptable sensing margin. The proposed
reference circuits optimization scheme solves the reference resistance
distribution issue to maximize the sensing margin and minimize the
read disturbance, with low power consumption. Simulation results show
that the optimization scheme is able to significantly improve the read
reliability with the presence of one or few cases of reference cell failure,
thus it eliminates the requirement of additional circuits for failure
detection of reference cell or referencing to neighboring blocks.
Index Terms Reference cell resistance distribution, sensing margin,
spin-transfer-torque (STT).

I. I NTRODUCTION
Spin-transfer-torque magnetoresistive random access memory
(STT-MRAM) which offers advantages in endurability, scalability,
speed, and energy consumption over other types of nonvolatile
memory [1], [2] has attracted increasing research interests. The STT
switching technique enables MRAM scalability beyond 90 nm and
leads to simpler memory architecture and manufacturing than conventional MRAM [3], [4]. As the process technology shrinks, the
write current can be reduced as it is dependent on the size of the
magnetic tunnel junction (MTJ). The scaling down of technology,
however, increases the process variation and decreases the supply
voltage, which poses great challenges for STT-MRAM circuit design
to maintain the sensing margin. The sensing margin is defined as the
voltage difference between the bit line voltage during read operation
and the reference of the sense amplifier subtracting the offset voltage
and noise. Employing the differential sensing architecture [5] doubles
the sensing margin but sacrifices the density of the STT-MRAM
array. Furthermore, as its read and write operations share the same
current path, STT-MRAM has a known issue of read disturbance,
which is an unintended write occurring during a read operation [6].
Read disturbance occurs when the read current is larger than the
critical switching current (IC ) of the write operation. Consequently,
the read current is required to be small enough to prevent potential
read disturbance for STT-MRAM.
The sensing margin in STT-MRAM can be expressed as Iread
|RMTJ Rref |, where Iread is the reading current, RMTJ is the
resistance of the MTJ, which could be R P and RAP for P and AP
states of the MTJ, respectively, and Rref is the equivalent resistance
Manuscript received August 1, 2012; revised March 7, 2013; accepted April
23, 2013. Date of publication July 3, 2013; date of current version April 22,
2014.
K. Huang is with Data Storage Institute, Agency for Science, Technology
and Research (A*STAR), 117608 Singapore, and also with the Department
of Electrical and Computer Engineering, National University of Singapore,
119260 Singapore (e-mail: huangkejie@dsi.a-star.edu.sg).
N. Ning is with Data Storage Institute, A*STAR, 117608 Singapore
(e-mail: nanning@gmail.com).
Y. Lian is with the Department of Electrical and Computer
Engineering, National University of Singapore, 119260 Singapore (e-mail:
eleliany@nus.edu.sg).
Color versions of one or more of the figures in this paper are available
online at http://ieeexplore.ieee.org.
Digital Object Identifier 10.1109/TVLSI.2013.2260365

1063-8210 2013 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

Das könnte Ihnen auch gefallen