Beruflich Dokumente
Kultur Dokumente
"
&
( )
x
!
=!!
0
= #
0
'(# !
0
"
&
( )
Lhere !
0
and y
0
repre0ent the input >ector aligned at the origin 3ith magnitude !
!
and angle !
!
5
H4."K
!
!
i
= #
i
cos!
i
$
i
= #
i
sin!
i
In each 0u;0eEuent 0tep= a ne3 angle o2 rotation !
i
i0 determined 0uch that5
28
(4.3)
!
!
" tan
!1
2
!!
# $
,!where ! % 0
This restriction is crucial in allowing the rotation calculations perIormed in each step
to be accomplished using only an add (or subtract) and a shiIt.
!
!
!
!
"
!+1
"
!
#
!
#
!+1
$
!
!i#$%&'()*+,-&.'i'i/'-0&'123451'67#8%i-09)
Figure 4.1 shows what each step in the CORDIC algorithm looks like. At each step,
a decision is made whether to rotate the vector by !!
!
or !"
i
. The outcome oI both oI
these decisions is shown in the Iigure. The expression Ior the rotated vector in the (i1)th
step is:
(4.4)
x
i"!
# !""
!"i
#$% !
i
$"
i
% &
y
i"!
# !""
!"i
%&' !
i
$"
i
% &
When applying the restriction in (4.3), the shiIting and adding become evident:
29
(4.5)
!
i+!
=
!
K
i
!
i
!$
i
"
!i
%
i
( )
%
i+!
=
!
K
i
%
i
+ $
i
"
!i
!
i
( )
#$%&%'K
i
= !+ "
!"i
(!$
i
=!
!
"
is the magnitude error term, and !
i
corresponds to the rotation direction (1
corresponds to a rotation away Irom the !-axis). The !
!!
terms in the top and bottom
equations correspond to a leIt shiIt oI !
"
and !
i
, respectively (when operating in base 2).
This shiIted value is then added (or subtracted) to the current value oI the component.
Volder reIerred this operation "#$%%&'((i*i$+. It is this cross-addition that enables the
algorithm to be used eIIectively in digital hardware.
As illustrated in Figure 4.1, all rotation steps eIIect an increase in the magnitude oI
the input vector by a Iactor oI
!!"
!"!
with each rotation. This error is introduced as a
consequence oI the algorithm`s derivation Irom the Givens transIorm, which rotates a
vector by a speciIied angle:
(4.6)
!
!
! = ! cos!" "sin!
!
" = "cos!+!sin!
These terms can be rearranged using the basic trigonometric identity
tan! =
sin!
cos!
:
(4.7)
!
!
x " !"#! x " y $%&! # $
!
y " !"#! y % x $%&! # $
Using the same restriction in (4.3), we get the same relationships as in (4.5). The
error term is Irom the presence oI the cosine term in (4.7), which is independent oI the
rotation direction, since cosine is symmetric about the rotation direction, since cosine is
30
symmetric about the !-axis. Moreover, this error accumulates with each step, so iI the
number oI iterations is set, the total error in the algorithm is independent oI the input
angle.
II the Iirst rotation is given by
(4.8)
!
!
" !#"
!" ! ( )
"
#
$%& !
#
# #
!
"
!
( )
$
!
" !#"
!" ! ( )
"
#
&'( !
#
# #
!
"
!
( )
Then the second rotation is given by
(4.9)
!
!
!
= "+!
!! " ( )
"+!
!! ! ( )
"
#
$%& ! + #
#
"
#
+ #
"
"
"
( )
$
!
= "+!
!! " ( )
"+!
!! ! ( )
"
#
&'( ! + #
#
"
#
+ #
"
"
"
( )
The n-th rotation can be extended and a general deIinition derived:
(4.10)
!
!
"+!
= !+"
!" ! ( )
!+"
!" " ( )
! !+"
!"" "
#
$
%
&
'
#
#
$%& ! +$
#
"
#
+$
!
"
!
+!+$
"
"
"
( )
%
"+!
= !+"
!" ! ( )
!+"
!" " ( )
! !+"
!"" "
#
$
%
&
'
#
#
&'( ! +$
#
"
#
+$
!
"
!
+!+$
"
"
"
( )
The total increase in magnitude can be speciIied as
!
"
" 1#"
!"#
"
"
. This increase
must be accounted Ior when perIorming calculations using this algorithm.
!"# $%%&'&()*+, ./012*/,2
The CORDIC design calls Ior three accumulator registers: the X register, the Y
register, and the angle accumulator (Z). The X and Y registers hold the present #- and !-
components oI the vector as it is being rotated. The angle accumulator holds the total
rotation amount completed at the current iteration.
31
The angle accumulator stores the arguments to the sine and cosine terms in
Equation (4.10):
!
!
"
"#
!
!
!
##
"
!
"
#!##
"
!
"
. However, since this term is equivalent to
the desired rotation angleconstrained to
!
!
!
= !"n
!$
%
!!
( )
the Z register must always
contain the expression
(4.11)
!
!
!
" "
#
"#$
!%
&
!!
# $
#"%
!
"
The arctangent terms can be stored in a small lookup table. When this is done, only
an addition or a subtraction is required to compute the next value in the Z register, since
! ! "1. This table is very small, requiring only one row Ior each iteration oI the
algorithm. Since the iteration count can be Iixed in hardware, the size oI the table is
constant.
The accumulator registers are also used to determine the direction oI rotation. In
Rotation mode, the sign oI the Z register determines the direction. In Vectoring mode, the
Y register determines the direction. These modes are Iurther elaborated in the next
section.
!"! #$%&'()(*$+ -$./0
The CORDIC algorithm operates in one oI two modes: Rotation and Vectoring. The
mode oI operation determines which set oI Iunctions can be computed, and how the
values in the X, Y, and Z registers change each iteration.
32
!"!"#!$%&'&i%)*+%,-*
!"!#!$%
!
&
'
&
0
!&'
&1
!&
'
&
2
!&'
&3
!&
'
&
4
!
"
! " # $
1.0000 0.0000 30.0000
0 1.0000 1.0000 -15.0000
1 1.5000 0.5000 11.5651
2 1.3750 0.8750 -2.4712
3 1.4844 0.7031 4.6538
4 1.4404 0.7959 1.0775
5 1.4156 0.8409 -0.7124
6 1.4287 0.8188 0.1828
7 1.4223 0.8300 -0.2649
8 1.4255 0.8244 -0.0411
9 1.4272 0.8216 0.0709
10 1.4264 0.8230 0.0149
!i#$%& ()*+,-& ./012. 03454i36 738&)
In Rotation mode, the input vector is rotated over a speciIied angle. The input vector
is speciIied as the initial value the X and Y registers. The rotation amount is input into the
Z register. In order to rotate the input vector over the input angle, the goal oI each
iteration should be to reduce the value in the angle accumulator to 0. Since the arctangent
values Ior each iteration are Iixed, the only way that the value in the Z register can be
controlled is through the ! values. In Rotation Mode, the decision values are deIined to
be:
(4.12)
!
i
=
+!"#i%#&
i
!'
"!" #i%#&
i
$'
#
$
%
%
&
%
%
With this deIinition oI the decision Iunction, aIter " iterations oI the algorithms, it is
known what the values in each oI the registers will be:
(4.13)
X
n
" "
n
X
0
cos Z
0
( )!Y
0
sin Z
0
( )
"
#
$
%
Y
n
" "
n
#
0
cos Z
0
( )% $
0
sin Z
0
( )
"
#
$
%
Z
n
" 0
"
n
" 1%2
!2n
n
&
33
Figure 4.2 shows what happens to the input vector aIter each iteration in the
algorithm. In this example, the vector is initially aligned with the !-axis. The magnitude
oI the vector increases with each step, as the angle oI the vector converges on 30. The
values oI the X, Y, and Z registers are also shown.
In the Iollowing sections, descriptions oI how various Iunctions can be computed
using Rotation mode will be discussed.
!"!"#"# $%n'()*+,%n')-n.)/+0-12*-13',%-n)41-n,5+16-3%+n)
The computation oI the sine and cosine Iunctions is intrinsic to the Rotation mode,
and can easily be derived Irom (4.13). All that is necessary is to initialize the Y register
with 0, and the X register with the desired scaling Iactor. II there were no gain in the
magnitude oI the input vector, then the X register could be initialized to 1, and values oI
sine and cosine could be read directly out oI the Y and X registers, respectively.
However, the gain means that some scaling must be perIormed beIore computation. AIter
n iterations oI the algorithm, the contents oI the registers will be:
(4.14)
!
X
n
" "
n
X
0
cos $
0
# $
Y
n
" "
n
Y
0
sin $
0
# $
Z
n
!0
ThereIore, in order to compute the sine or cosine oI an angle !, then the Z register
will be initialized with !, Y with 0, and the X register with !
"
, to account Ior the gain.
Figure 4.2 shows the CORDIC process to compute sine and cosine. The initial vector is
aligned, as required. The X register corresponds to the scaled cosine value, and the Y
register corresponds to the scaled sine value. We can determine the unscaled value by
! ! 34
$i&i$in(!)*+!,in-.!&-./+s!12! !
!"
! !#$%$& .!456!+7-89.+:!-,)+6!11!i)+6-)i5ns:!)*+!-.(56i)*8!
2i+.$s!
! sin 30! " # $
X
10
!
10
$
0.8230
1.6468
$ 0.4998 !
<*+!8+)*5$!)5!=589/)+!sin+!-n$!=5sin+!is!)*+!s-8+!,56!>5.-6?)5?@-6)+si-n!=556$in-)+!
)6-ns,568-)i5n.!Ain=+!)*+!)6-ns,568-)i5n!is!$+,in+$!)5!1+!
B4.1CD!
! ! " !"#!
# ! " #$n!
!
E..!)*-)!is!n+=+ss-62!)5!9+6,568!)*+!)6-ns,568-)i5n!is!)5!5n=+!-(-in!.5-$!!!in)5!!!-n$!
.5-$!X!Gi)*! !"
#
:!-n$!H!Gi)*!0.!
4.4.2!$%&'()*+,-.(/%-
!
i " # $
! 1.0000 1.73'1 0.0000(
0 2.7321 0.7321 45.0000
1 3.0981 -0.6340 71.5651
2 3.2566 0.1405 57.5288
3 3.2741 -0.2665 64.6538
4 3.2908 -0.0619 61.0775
5 3.2927 0.0409 59.2876
6 3.2934 -0.0105 60.1828
7 3.2935 0.0152 59.7351
8 3.2935 0.0024 59.9589
9 3.2935 -0.0041 60.0709
10 3.')35 +0.000) 60.014)(
i
n
i
t
i
a
l
i =
0
i = 1
i = 2
i = 3
i = 4
"
!
!"#$%&'()*+,-./0,'1&234%"5#'647&'
Jn!K+=)56in(!85$+:!)*+!+L/-)i5ns!/s+$!)5!/9$-)+!)*+!6+(is)+6s!6+8-in!)*+!s-8+:!1/)!
)*+!,/n=)i5n!)*-)!$+)+68in+s!)*+!65)-)i5n!$i6+=)i5n!=*-n(+s.!!Jn!&+=)56in(!85$+:!)*+!
"#
$%&'(i*+, *(i-. *' $%i&/ *+- i/01* 2-3*'( 4i*+ *+- ! $5i., 4+i3+ ,-$/. *+$* *+- 2$%1- i/ *+-
Y (-&i.*-( .+'1%8 3'/2-(&- '/ 9-(': ;+- 8-3i.i'/ f1/3*i'/ i/ ('*$*i/& ,'8- i.=
>?:1AB !
"
!
"!"#i%#&
"
# '
!!"#i%#&
"
" '
#
$
%
Ci&1(- ?:" .+'4. +'4 *+- i/01* 2-3*'( 3'/2-(&-. '/ *+- !D$5i. 4i*+ -$3+ i*-($*i'/ 'f
*+- $%&'(i*+,: E. i/ *+- ('*$*i'/ ,'8-, *+- ,$&/i*18- 'f *+- 2-3*'( i/3(-$.-. 4i*+ -2-(F
i*-($*i'/ 'f *+- $%&'(i*+,: ;+- .i&/ 'f Y i. 1.-8 *' 8-*-(,i/- 4+i3+ 8i(-3*i'/ *' ('*$*-,
4i*+ *+- &'$% 'f G(i/&i/& *+- 2$%1- *' 0: I/ *+i. -5$,0%-, *+- $/&%- 'f *+- i/01* 2-3*'(
4i*+ (-.0-3* *' *+- !D$5i. i. 3',01*-8 $/8 f'1/8 i/ *+- J (-&i.*-(: Ki/3- *+- $/&%- 8'-.
/'* .3$%-, *+- (-.1%* i. *+- 3'((-3* $/&%-: L.i/& *+- /-4 8-3i.i'/ f1/3*i'/, $f*-( n
i*-($*i'/. 'f *+- $%&'(i*+,, *+- (-&i.*-(. 4i%% 3'/*$i/=
>?:1MB
X
!
" "
!
X
0
2
#Y
0
2
Y
!
! 0
Z
!
" #
0
#tan
"1
Y
0
X
0
#
$
%
%
%
%
&
'
(
(
(
(
"
!
" 1#2
"2$
!
)
;+- f'%%'4i/& .-3*i'/. 8i.31.. 4+i3+ f1/3*i'/. 3$/ G- 3',01*-8 i/ 2-3*'(i/& ,'8-:
!"!"#"$ %&'()*+,*(-
E. .--/ i/ -N1$*i'/ >?:1MB, *+- $(3*$/&-/* f1/3*i'/ i. i/*(i/.i3$%%F 3',01*-8 i/ *+- J
(-&i.*-( 4+-/ i/ O-3*'(i/& ,'8-: I/ '(8-( *' 3',01*- *+- $(3*$/&-/* 'f $/ $/&%-
!, *+-/
*+- J (-&i.*-( ,1.* G- i/i*i$%i9-8 *' 0, .' *+$* *+- !
"
*-(, i. -%i,i/$*-8 i/ >?:1MB, $/8 *+-
$/&%-
! ,1.* G- -50(-..-8 $. $ ($*i' 'f *+- *4' 2$%1-. i/ *+- P $/8 Y (-&i.*-(.: I* i.
36
possible to initialize X with 1.0, and Y with
!, as is done in Figure 4.3. The result oI
!
!"#$!% &
" #
is correctly computed to be 60.
(4.18)
!
"
" !
!
# "#$
!%
#
!
$
!
"
#
$
$
$
$
%
&
'
'
'
'
!
"
" ! # "#$
!%
&
$ %
!
"
" '!&
!"!"#"#! $%&'(r*+,-./'01%*,.1*2,r'%3/,.45(6,r*
7r,.38(r9,'/(.*
As mentioned earlier, the Iinal value in the X register contains the scaled magnitude
oI the input vector. This property, combined with the intrinsic computation oI arctangent
at the same time means that the CORDIC vectoring mode automatically does a Cartesian
to Polar coordinate transIormation, just as the Rotation mode does a Polar to Cartesian
conversion. Recall the equation Ior the Cartesian-Polar transIormation:
(4.19)
r " "
!
##
!
! " "#$
!%
#
"
The value Ior ! is computed in the X register, and
!
! in the Z register.
!":! ;x=,.1/.-*'>%*2(9=0','/(.*?(9,/.*
The CORDIC algorithm as presented thus Iar can only compute Iunctions based on
the sine and cosine Iunctions. This is a consequence oI the circular rotations perIormed in
each step. The algorithm is capable oI perIorming linear and hyperbolic rotations as well,
which expands the set oI Iunctions that the algorithm can compute. To allow Ior these
new domains, a new Iactor is introduced to the set oI CORDIC equations. Its value
37
determines in which coordinate system the algorithm will operate. This Iactor is deIined
as
(4.20)
!
!! "!" #"! " #
An ! value oI 1 corresponds to the hyperbolic domain, 0 to the linear domain, and
1 to the circular domain. When this Iactor is applied to Equation (4.5), the Iollowing
general-purpose deIinition oI the CORDIC algorithm is obtained (assuming n iterations):
(4.21)
!
"
" #
"
!
!
"#$ !
"
$
# $
%%
!
$$%& !
"
$
# $
!
"
#
$
%
&
%
"
" #
"
%
!
"#$ !
"
$
# $
'!
!
$$%& !
"
$
# $
!
"
#
$
%
&
&
"
" &
!
%!
"
As in 4.2,
!
! is the elementary rotation perIormed each step, with
!
!
being the sum
oI the rotations perIormed. This rotation is deIined so that the operation perIormed Ior
each step in the algorithm reduces to a shiIt and adds:
(4.22)
!
!
!
=
!"#
!$
%
!!
( )
& "=+$
%
!!
& "= '
!"#(
!$
%
!!
( )
& "=!$
"
#
$
$
$
$
$
%
$
$
$
$
$
With these conditions, the algorithm reduces to:
(4.23)
!
x
i"1
#
1
K
i
x
i
!md
i
2
!i
y
i
$ %
y
i"1
#
1
K
i
y
i
"d
i
2
!i
x
i
$ %
z
i"1
# z
i
!d
i
!
i
where K
i
# 1" m2
!2i
,!d
i
#&1,! m" !1, ),"1 ' (
38
Domain Growth Factor, !
"
Constant After
!
Circular
!"1 # $
!"!! "
"!
!
#
=!#!$$%0%'"(
15 iterations
!
Linear
!" 0 # $
1""! #
"!
!
#
=1$"""""""""
Hyperbolic
!=!1 ( )
1+ !1 ( )" 2
!!
= 0.828159361
!
#
15 iterations
Table 4.1-CORDIC growth factors for each computation domain.
The values oI the CORDIC growth Iactors are shown in Table 4.1. The values are
rounded to nine decimal places. The Constant AIter column shows how many iterations it
takes until the displayed value remains constant. For all but the linear domain, which
always has a Iactor oI 1, the displayed value can be used Ior any implementation with 15
or more iterations.
As beIore, the # and $ components are stored in the X and Y registers. The deIinition
oI Z depends on the computation mode, with hyperbolic tangent being used instead oI the
standard tan Iunction when in the hyperbolic domain, Ior example:
(4.24)
!
z
i"!
#
z
i
!d
i
tan
!!
2
!i
$ %
&! m #!
z
i
!d
i
2
!i
&! m # '
z
i
!d
i
tanh
!!
2
!!
$ %
&! m #!!
"
#
$
$
$
$
$
%
$
$
$
$
$
The Iollowing table summarizes what Iunctions can be computed and in which mode
and domain they can be computed:
39
!"#$%&'&(") +",-
."#'() /"&'&(") 0-1&"2()3
!
!i#c%&a#
!"( # $
!
! " "
n
!
!
"#$ $
!
# $
% " "
n
!
!
$%& $
!
# $
'(&$
!
"
%
!
! = "
n
!
!
"
#$
!
"
% = %
!
##$%
!&
$
!
!
!
"
#
$
$
$
$
%
&
'
'
'
'
!
!"#$%&
!" 0 # $
!
! "!
!
# "
!
#
!
!
! " !
0
#
"
0
#
0
Hyperbolic
!=!1 ( )
X = K
n
X
0
"#$h Z
0
( )
Y = K
n
X
0
$inh Z
0
( )
tanhZ
0
=
Y
X
&
Z
0
= X +Y
! " "
n
!
0
2
!$
0
2
% " %
0
# tanh
!1
$
0
!
0
"
#
$
$
$
$
%
&
'
'
'
'
ln ! $ % " 2%, with! !
0
"!#1,!$
0
"!!1
! "
1
"
n
!, with !
0
"!#
1
4
,!$
0
"!!
1
4
4'56- 789:;%)1&(")< &='& 1') 5- 1"#$%&-, %<()3 &=- !>/.?! '63"2(&=#8
In addition to the addition oI the ! domain selector, a couple other modiIications
must be made to the algorithm when Iunctioning in other modes.
In the circular domain, the algorithm starts with the registers in their initialized states
and the algorithm begins with a sequence oI " values oI the Iorm 0, 1, 2, 3, 4, . and so
on until the desired number oI iterations has been completed. In the Iirst step, no shiIt is
perIormed as a result oI " being equal to 0. In the linear domain, the sequence goes 1, 2, 3,
4, 5, . and so on, with a shiIt happening in the Iirst step. Finally, as explained in 4.6, the
hyperbolic domain`s " sequence is Iurther complicated by the necessity to repeat
iterations. The " sequence in this domain is 1, 2, 3, 4, 4, 5, . with every 3# 1 iteration
being repeated.
40
!"#! $%&'()*(&+(,
The algorithm is now fully defined, and it will now be demonstrated that the
algorithm converges on the previously noted functions. The specific region of
convergence will also be shown.
!"#"-!./(,$%&'()*(&+(,$%&0i2i%&,
Assume that the CORDIC algorithm is in vectoring mode. Let
!
!
be the angle of the
input vector after the !th iteration of the algorithm. The algorithm tries to reduce the angle
of the input vector. Therefore, after step
i +!, the angle of the vector will change such
that
(4.25)
!
!
!"!
# !
!
!"
!
where
!
!
!
is the rotation performed in the !th iteration of the algorithm. In order for
the algorithm to converge in " iterations, then all subsequent iterations must bring
! to
within
!
!
!!!
of zero. If this is the case, then when the "th iteration completes, the input
angle will be zero. Since the rotations accumulate, the following condition is derived:
(4.26)
!
!
!
! !
"
""!#1
#!1
"
$!
#!1
In order for the algorithm to converge, the condition in (4.26) must hold when the
algorithm first begins:
(4.27)
!
!
!
! "
!
!"!
"!"
"
#"
"!"
To find the domain of initial values for which the algorithm converges, the above
inequality is solved:
41
(4.28)
!
!"# !
$
" # $"
!!%
% "
"
"$$
!!%
"
Since the deIinition oI
!
!
depends only on !, the domain oI convergence can be
computed. The domains oI convergence Ior the three computation domains are shown in
Table 4.3 by using (4.28). By evaluating the limit oI the
! terms as ! approaches
!
!, and
observing that
!
!!!
"" as
i !", the
!
!!1
term can be dropped Irom (4.28).
!"#$%&'&(") +"#'() ,$$-".(#'&/ +"#'() "0 !")1/-2/)3/
!(-3%4'-
!=! # $
!
!
!
" !"#
!$
%
!!
tan
!1
2
!!
( )
!=1
"
#
$1.74329
5()/'-
m= ! # $
!
!
!
" !
!!
!
!!
!="
"
#
="
67$/-8"4(3
!"!! # $
!
!
!
= tanh
!1
2
!!
!
tanh
!1
2
!!
" #
!
"
#1.11817
! $ 1, 2, 3, 4, 4, 5, " % &
9'84/ :;<=+"#'()> "0 3")1/-2/)3/ 0"- &?/ !@A+B! '42"-(&?#;
Note that the condition (4.26) is not met when in the hyperbolic domain iI the
standard non-repeating sequence (e.g. 1, 2, 3,.) is used. In order Ior the algorithm to
converge, certain iterations must be repeated. SpeciIically, it is necessary Ior steps 4, 13,
40, . ,
3k +1} to be repeated. This comes as a consequence oI the Iollowing: |1|
(4.29)
!
!
! !
"
""!#1
#!1
"
#
$
%
%
%
&
'
(
(
(
(
!!
3!#1
$!
#!1
!"#"$ %&''()'()*'+,-&.-+/-)*&01-&02)
In order to demonstrate that
!
!
will converge to at most
!
!!!
, it will Iirst be proven
by induction that the Iollowing is true:
(4.30)
!
!
i
""
"!1
# "
#
#$1
"!1
"
42
First, (4.30) is true Ior ! 0, as a result oI (4.27). To prove that (4.30) is true Ior
! 1,
!
i
is subtracted Irom (4.30) and (4.26) is applied to the leIt side. This yields:
(4.31)
!
! !
!!!
" !
"
"##"!
!!!
"
#
$
%
%
&
'
(
(
$!!
#
$ "
#
!!
#
$ !
!!!
" !
"
"##"!
!!!
"
#
$
%
%
&
'
(
(
When (4.25) is applied, the Iollowing results:
(4.32)
!
!
!"!
#"
"!!
" "
#
#$!"!
"!!
"
ThereIore, by induction, it is proven that (4.30) is true Ior all ! ! 0. II
!
! " n , then
(4.33)
!
!
""
!!1
which proves that the CORDIC algorithm converges iI the input angle is within the
domain oI convergence deIined in (4.28) is satisIied.
!"#"$ %&'()*+)',) .' /&010.&' 2&3)
The domain oI convergence deIinition and prooI in 4.6.1 and 4.6.2 assumed that
CORDIC is operating in vectoring mode. II " is substituted Ior
! in the equations in
those sections, then the domain oI convergence can be Iound and proven Ior the rotation
mode. In particular, (4.28) shows that " has the same domain oI convergence as
!
! :
(4.34)
ma# !
$
" # $ ma# !
$
" #
4#
!"# $%%&'(%)*(n,*-''.'*
With the algorithm now fully defined, and its convergence proven, the accuracy and
precision of the algorithm will now be discussed. Generally speaking, for each iteration
of the algorithm, an additional bit of accuracy is obtained.
>rror creeps into the results from many sources. ?aturally, since there is not an
infinite number of bits available to any real hardware device, rounding errors are inherent
to any implementation. Additionally, since there are a finite number of iterations to any
real-world implementation of the algorithm, the desired rotation angle can never be fully
realiBed. This results in an angle approximation error.
!"#"/ 0&12'3%(4*526'272n8(83.n*
The CORDIC algorithm operates using bit-shifts, the natural representation for any
number used in the algorithm is going to be a fixed-point representation rather than
floating-point. A fixed point number is a scaled integer, where the binary point is implied
to be ! positions to the left of the JKL. As a result, the integer is 2
!
times larger than the
number it is representing. This is similar to storing 2.N4 as 2N4. This scale, combined
with the number of bits available completely determines the range and precision available
to all numbers in the algorithm. The scale determines how many bits are available to the
left and right of the binary point.
As the binary point moves to the right, greater ranges of numbers are available, but
the scale becomes coarser, with larger gaps between adOacent numbers. Jikewise, as the
binary point moves to the left, the scale is finer, but the range of possible numbers is
smaller.
44
Guard digits can also be employed at both ends oI the binary word to enhance
accuracy. The guard digits are not used in any input values and are reserved Ior bits
appearing in Iinal values. For example, at least 2 bits are necessary in circular mode,
because the !
"
Iactor is grater than 1, which means that the magnitude oI the values will
grow as the algorithm works. Guard digits on the least-signiIicant side are employed to
reduce rounding error.
!"#"$ %&'()*(+,-..&.,
When the number oI bits required to represent a value exceeds the number oI bits
available in the system, the designer has two options: round or truncate. In a truncation,
the excess bits are discarded. In a round, the representation is altered depending on the
excess bits. Rounding works better in the CORDIC algorithm, because the maximum
rounding error is only halI as large as the error resulting Irom truncation.
Rounding errors are introduced in the course oI completing each iteration oI the
algorithm. These errors accumulate, which has the eIIect oI Iurther oIIsetting the Iinal
result. However, Walther pointed out in |1| that the rounding oI the results in each oI N
iterations never results in more than log
2
! bits oI error. In order to negate this eIIect,
the system can use
! "!"#
$
! bits oI storage Ior all intermediate results.
!"#"/ 0(+12,033.&4*567*&(,-..&.,
It has been shown that the algorithm will always converge on an accurate result iI
given inIinite precision. However, Ior some values, it may take an inIinite number oI
iterations to arrive at the accurate result. As a result, the actual rotation oI the input vector
is only an approximation oI the desired angle. This angle approximation error is the
45
desired rotation angle minus the actual rotation angle (which is the sum oI all the
intermediate angles):
(4.35)
!
!! " ! " !
"
"
"
""!
#""
#
, where
!
! is the desired rotation angle
Since this error decreases with the number oI iterations, an obvious way to improve
the accuracy oI the algorithm is to increase the number oI iterations perIormed. However,
there are some limits to how many iterations can be added and still improve accuracy.
First, the rounding error discussed in 4.7.2 increases with the number oI iterations, so this
eIIect must be considered when adding to the iteration count. Further, since the angles are
quantized, the accurate angle may never Iit into the bit width provided. II B bits are used
Ior storage, and there are P bits to the right oI the binary point, then the value oI the least-
signiIicant bit in storage is given by
!
!B
!
P
. This is the smallest number that can be
accurately speciIied in the system. ThereIore the last rotation angle (the smallest) must be
able to Iit in the available space. This yields the Iollowing condition:
(4.36)
!
!
n!1
"2
!"
2
P
II this does not hold, then the angle actually stored will be 0, which will result in no
Iurther rotations oI the input vector.
"#
!"#$%&' ) *"& +#'#,,&, -.$,&.&/%#%01/
$%is (%esis s(*+ies (,- +i..e/en( i123e1en(4(i-ns -. (%e 567D95 43g-/i(%1. <-(%
+esigns 4/e s=n(%esi>e+ .-/ (%e ?i3in@ A24/(4nBC ?5CA15FF *sing (%e ?i3in@ 9AG He/si-n
I.1i 4n+ si1*34(e+ *sing J-+e3Ai1 He/si-n #.F4. $%is K%42(e/ s(*+ies 4 s(/4ig%(.-/,4/+L
24/433e3 +esign K-nsis(ing -. (,- s%i.( /egis(e/sL 4 s(4n+4/+ /egis(e/L .-*/
4++e/Ms*N(/4K(-/sL 4 3--O*2 (4N3eL 4n+ 4 K-n(/-3 *ni(. $%e 4/e4 4n+ (i1ing /eP*i/e1en(s
4/e +isK*sse+L 4n+ si1*34(i-ns -. (%e +esign 4/e 2/esen(e+.
$%e +esign is K-n.ig*/e+ (- Ne in %=2e/N-3iK /-(4(i-n 1-+e in -/+e/ (- K-12*(e (%e
e@2-nen(i43 .*nK(i-nL N*( 433 K-12-nen(s 4/e 2/esen( .-/ 4 K-123e(e3= gene/43B2*/2-se
567D95 *ni(. 6n3= 1in-/ K%4nges (- (%e K-n(/-3 *ni( ,-*3+ Ne neKess4/= (- i123e1en(
(%e -(%e/ 1-+es. $%is +esign 43s- inK3*+es (%e %4/+,4/e neKess4/= (- 2e/.-/1 (%e /e2e4(
i(e/4(i-ns /eP*i/e+ in (%e %=2e/N-3iK +-14in. $%e K-n(/-3 *ni( K4n Ne 1-+i.ie+ (- 433-,
.-/ 1-/e 4 s-2%is(iK4(e+ i(e/4(i-n /e2e(i(i-n sK%e1e.
$%is i123e1en(4(i-n *ses CQ Ni(s -. 2/eKisi-nL ,i(% (%e Nin4/= 2-in( .i@e+ s*K% (%4(
(%e/e is -ne sign Ni( 4n+ (%/ee in(ege/ Ni(s (- (%e 3e.( -. (%e Nin4/= 2-in(L 4n+ QR Ni(s
/e2/esen(ing (%e ./4K(i-n43 24/( -. (%e H43*es. $%e +esign *ses QSs K-123e1en( 4/i(%1e(iKL
giHing (%e /egis(e/s in (%e +esign 4 /4nge -. H43*es ./-1 TR.F (- I.F. $%e 2-si(i-n -. (%e
Nin4/= 2-in( is i123iKi(L 4n+ n-( s2eKi.ie+ 4n=,%e/e in (%e +esign. $%e H43*es in (%e
3--O*2 (4N3e 4/e gene/4(e+ *sing (%e s2eKi.ie+ Nin4/= 2-in( 2-si(i-nL 4n+ in -/+e/ .-/
K-//eK( /es*3(s (- Ne 4K%ieHe+L 433 in2*(s 1*s( Ne sK43e+ 4KK-/+ing3=. $%e -n3= K%4nge
neKess4/= (- 433-, .-/ 4 +i..e/en( Nin4/= 2-in( 2-si(i-n ,-*3+ Ne in (%e H43*es s(-/e+ in
(%e 3--O*2 (4N3e. U- K%4nges (- (%e +esign ,-*3+ Ne /eP*i/e+.
47
!"#! De&'()*
!
"
1
X
Y
!
"
1
X0
Y0
INIT
INIT
#
!
atanh
()T
LD
SHAMT
LD
SHAMT
LD
ADDR
SUBXY
SUBXY
SUBZ
!"#$%"&
'#($
X
Y
Z
START
LD
SUBXY
SUBZ
INIT
SHAMT
DONE
X+Y
$hifted value
un$caled value
input value
NOT./ 3he 4 and 5 regi$ter$ are $pecial regi$ter$ that al$o have
$hifting capa9ilit:. 3he output$ are a$ $ho<n a9ove.
=
X
Y
exp(Z0)
"
1
Z0
INIT
ADDR
Figure 5.1-Block-level diagram of the complete parallel CORDIC unit.
All major components oI the CORDIC processing unit are shown in Figure 5.1. The
X, Y, and Z registers contain the current values oI the X, Y, and Z components oI the
CORDIC algorithm. They each have parallel load capabilities controlled by the LD input
signal. 2:1 multiplexers control the input values to each oI the registers. An INIT output
by the control unit allows initial values to be loaded beIore computation begins. When
INIT is asserted (active high), then the multiplexers pass the initial values into the
register. During the computation phase, the INIT signal is not asserted, allowing the
output oI the adders to be Ied back into the registers.
The control unit uses the sign oI the Z register to generate the subtract signals that
are Ied into the adder/subtractors. With the appropriate minor modiIications to the control
unit, the generation oI the SUBXY and SUBZ signals can be modiIied so that the unit
48
operates in vectoring mode. The design Ior the control unit is discussed Iurther in 5.1.1.
The values oI the X and Y registers are input into a Iinal adder to generate the
exponential Iunction. AIter the algorithm is complete, the result will be
!
"
!
. The unit can
also compute
e
!Z
!
by changing the adder into a subtractor.
Rather than compute the arc-hyperbolic tangent, a small lookup table is used to store
the pre-computed values oI this Iunction. The table has one row per iteration oI the
algorithm. II a general-purpose CORDIC algorithm were to be designed, an additional
lookup table would be necessary to store the necessary arctangent values. For the linear
domain, an additional shiIt register would also be necessary. A multiplexer would then be
used to select the value Irom the appropriate table. The control unit provides the address
used to access the data Irom the table.
U#$% A'()*(+*$ U,)*)-(,)./
S*)1$# 329 13,312 2.5
S*)1$ 2*)342*.3# 123 26,624 0.5
LUT# 607 26,624 2.3
IO9# 132 221 59.7
9RAM# 1 32 3.1
<CL># 1 8 12.5
T(+*$ ?@ABO'$C(** %$')1$ D,)*)-(,)./ E.C ,F$ 3(C(**$* %$#)G/@
The device utilization Ior entire design is summarized in Table 5.1. The design uses
2.4 oI the XC3S1500`s available slices, leaving ample space Ior usage by a neural
network. The Xilinx synthesizer also reported that the design can handle a theoretical
maximum clock Irequency oI 75.0 MHz. Naturally, testing is necessary to determine the
true timing requirements Ior the design. The Iollowing sections will discuss the design
and area requirements IorI the major components oI the unit.
"#
5.1.1 The Control Unit
$he 'ontrol unit i/ re/pon/i1le for regulating the flo5 of data through the CORDIC
unit. $he unit i/ a finite /tate ma'hine ha>ing three /tate/? a/ /ho5n in Figure 5.2. $he
default /tate for the unit i/ the IDCD /tate. In thi/ /tate? the regi/ter/ are not loading? and
the unit i/ /taE/ in thi/ /tate until the S$GR$ /ignal i/ a//erted. Hhen the S$GR$ /ignal
i/ a//erted? the unit mo>e/ into the IRDCOGD /tate.
In the IRDCOGD /tate? the regi/ter load /ignal i/ a//erted? and the IJI$ /ignal i/
al/o a//erted? 'au/ing the initial >alue/ to 1e loaded into the regi/ter/ Kthe/e >alue/ are
repre/ented a/ LM? NM? and OM in Figure 5.PQ. $he unit then ad>an'e/ to the CORIS$D
/tate. $he regi/ter load /ignal/ remain a//erted? 1ut the IJI$ /ignal i/ not a//erted. G/ a
re/ult? the regi/ter/ are loaded 5ith the >alue/ from the 'orre/ponding adder/? rather than
the input >alue/. $he unit maintain/ an internal iteration 'ount? and 5hen the/e >alue/ are
eTual? the 'ontrol unit mo>e/ 1a'U into the IDCD /tate? 5ith the DOJD /ignal a//erted?
/ignifEing that the >alue/ output 1E the regi/ter/ are >alid. $a1le 5.2 /ho5/ the
un'onditional output/ for ea'h /tate. Other /ignal/? /u'h a/ the looUVup ta1le addre//?
SWGR$? and SSX /ignal/ >arE and are 'ontinuou/lE updated 5hile in the CORIS$D
/tate. $he logi' 1ehind the /tate tran/ition/ i/ /ho5n in Figure 5.Y.
!DL$
%R$L'AD C'M%+T$
SHAMT 0 ST'%1AL
START
Figure 5.2-State diagram for the parallel CORDIC unit.
5#
IDLE PRELOAD COMPUTE
INIT # 1 #
DONE 1 # #
LD # 1 1
Table 5.2-Unconditional control unit outputs for each state in the parallel design.
D Q
C
D Q
C
CLK
SHAMT
START
!
Figure 5.3-State transition logic for the parallel CORDIC unit.
The generation of the SHAMT signal is complicated by the need to perform repeat
iterations, and the need to wait one cloc> cycle to read hyperbolic tangent data from the
L@T. Bigure 5.4 shows the logic behind the SHAMT signal, as well as the other signals
that are state independent (S@BXH, S@BI, ADDR, and DONE). The subtract signals
S@BXH and S@BI are determined by the sign of the I register, since the PORDIP unit
is operating in rotation mode. Therefore, the sign bit of the I register is connected
directly to the subtract inputs of the corresponding adders.
The SHAMT register contains the shift amount, which also corresponds to the
current iteration count, which is then connected to the X and H registers. Since data from
the loo>-up table ta>es one cloc> cycle to be retrieved, SHAMT is reTuired to lag one
cycle behind the ADDR signal, which is fed to the loo>up-table. This is shown as a direct
connection between the output of the ADDR register and the input of the SHAMT
51
register. It is the contents oI the ADDR register that determine when a repeat iteration is
necessary.
The current ADDR is compared against the value oI the NXTRPT register. The
NXTRPT register contains the next iteration that must be repeated. It is initialized to 4. In
hyperbolic mode, every !! !" iteration must be repeated (where ! is the previously
repeated iteration). This computation can be perIormed using a single adder. 3! is
computed by inputting NXTRPT into the Iirst input oI the adder and ! ! "#$%&$
(accomplished with a bit shiIt). By asserting the carry-in input, !! NXTRPT!' is
computed. When the current value oI ADDR and NXTRPT match, then NXTRPT is
updated, and ADDR gets loaded with its current value, rather than the output oI the adder
that increments its value.
When SHAMT matches the constant STOPVAL, then the DONE signal is asserted,
and the unit returns to the IDLE state.
!"31% S'(XY
S'(!
ADDR
$%TR'T
1
"
#
+
DD
C
in
#
+
DD
C
in
$$ 1
"
1
&
S)AMT
&
STO',AL
DO12
!"#$%& ()*+,-#". /&0"12 30& 4353&6"12&7&12&13 -$37$34 -8 30& 75%599&9 .-13%-9 $1"3)
52
"#$% &'()*(+*$ ",)*)-(,)./
0*)1$# 39 13,312 0.3
0*)1$ 2*)342*.3# 29 26,624 0.1
5"6# 71 26,624 0.3
789# 113 221 51.1
9:&;# 0 32 0.0
<=5># 1 8 12.5
6(+*$ ?@ABC$')1$ D,)*)-(,)./ E.F ,G$ 3(F(**$* 1./,F.* D/),@
The area requirements Ior this component oI the CORDIC unit are detailed in Table
5.3. The control unit accounts Ior 12 oI the slices used by the entire design.
5.1.2!The Shift -egisters
The X and Y registers require shiIt capability. Since the shiIt amount increases with
each iteration oI the algorithm, a standard single-bit shiIt register cannot be used. A
register capable oI shiIting a variable amount is necessary. Fundamentally, there are two
approaches to designing such a register. The easiest and most area-conscious is to
perIorm a single one-bit shiIt per clock cycle and repeat as necessary until the value has
been shiIted by the desired amount. The single bit shiIt can be hard-wired, minimizing
area requirements. This approach requires the computation to pause until the shiIting is
complete. This is used as part oI the area optimization gained Irom the bit-serial
implementation discussed in Chapter 6.
It is preIerred that the shiIt be perIormed in one clock cycle. This can be
accomplished using multiplexers, and is illustrated Ior a 4-bit register in Figure 5.5. For
this design, which uses 32-bit registers, 32 32 ! # multiplexers are used Ior each shiIting
unit. The registers have two outputs, the unshiIted output direct Irom the register, and the
shiIted output, which is the output Irom the multiplexers. The SHAMT signal controls the
amount by which the output is shiIted. The shiIt register perIorms an arithmetic shiIt
53
ope'a)*o+ )o e+,u'e p'ope' e.e/u)*o+ 0he+ +e2a)*3e 3alue, a'e u,e5. 7o+,e8ue+)l9: )he
;<= >)he ,*2+ b*)@ *, /o++e/)e5 )o )he ,h*A)e5 ou)pu), oA )he Bul)*ple.e', )o allo0 Ao' a
,*2+ e.)e+,*o+ )o o//u' 0he+ ,h*A)*+2.
!"#
$"#
0
1
2
3
%H'FT*3+
0
1
2
3 0
1
2
3
0
1
2
3
%H'FT*2+
%H'FT*1+
%H'FT*0+
D*3+
D*2+
D*1+
D*0+
CL/ %HAMT
Figure 5.5Design of the variable shift register.
Used Available Utilization
Slices CD 13:31F D.GH
Slice Flip-Flops I5 FI:IFJ D.FH
LUTs 1IF FI:IFJ D.IH
IOBs 1DJ FF1 JG.1H
BRAMs D 3F D.DH
GCLKs 1 K 1F.5H
Table 5.HDevice utilization for the parallel shift register.
The ,h*A) 'e2*,)e', u,e a ,*2+*A*/a+) po')*o+ oA )he 5e3*/e a'ea: 'ela)*3e )o )he 'e,) oA
)he 5e,*2+: a, ,ho0+ *+ Table 5.J. Mo)e )ha) )he )able o+l9 /o3e', a ,*+2le 'e2*,)e': a+5
)he'e a'e )0o ,h*A) 'e2*,)e', *+ )he )o)al 5e,*2+. The N 'e2*,)e' *, a ,)a+5a'5 'e2*,)e'
0*)hou) ,h*A)*+2 Au+/)*o+al*)9.
5.1.$ %&e )**+u- %a/0e
The lookup )able /o+)a*+, )he 3alue, oA )he a'/ h9pe'bol*/ )a+2e+) +e/e,,a'9 Ao' ea/h
*)e'a)*o+ oA )he al2o'*)hB. P 7 p'o2'aB 0'*))e+ Ao' )ha) pu'po,e 2e+e'a)e, )he QRSL Ao'
)he )able. The p'o2'aB u,e, )he 7 Ba)h l*b'a'9 Au+/)*o+, *+ /o+Uu+/)*o+ 0*)h a A*.e5V
54
point conversion function to output a 1HDL file containing a description of the lookup
table in fixed-point notation.
"#e% &'a)lable ",)l)-a,)o/
0l)1e# 0 13,312 0.0%
0l)1e 2l)p42lop# 0 26,624 0.0%
L"T# 0 26,624 0.0%
789# 38 221 17.2%
9:&;# 1 32 3.1%
<=L># 1 8 12.5%
Table 5.5ABe')1e u,)l)-a,)o/ DoE ,he paEallel lookup ,able.
The presence of the lookup table has essentially zero impact on the area requirements
for the unit, since the entire table can be synthesized into a single LRAM, as shown in
Table 5.5.
5.1.$ Design Summar1
2)HuEe 5.IA0l)1e# u#e% bJ ,he 'aE)ou# 1oKpo/e/,# oD ,he paEallel =8:B7= u/),.
The entire design uses very little of the QPSAs resources, and Qigure 5.6 shows that
much of the QPSAs slices are used up by the X and Y shift registers. The large
multiplexers required by the design result in massive resource requirements. Chapter 6
55
presents an alternate design that aims to Iurther reduce the area requirements, but at the
cost oI speed.
With only 2.4 oI the Spartan-3`s slices used by this design, there is still plenty oI
room leIt Ior a system such as a neural network that uses the CORDIC unit. II necessary,
the resource requirements could be Iurther reduced by reducing the precision oI the unit
(to 16 bits, Ior example).
At 31 iterations (with 2 repeats), plus a PRELOAD cycle, the algorithm will take 34
clock cycles to complete. At the theoretical maximum clock Irequency oI 75.0 MHz, the
unit will take 0.45 !s to compute the Iinal value.
5.2 $imulation Results
This section presents the results oI simulations perIormed using the design speciIied
in 5.1. The design uses 32 bits to represent each word, with 4 bits Ior the integer part oI
the number, and 28 bits Ior the Iractional part oI the number. Thus the values output by
the X, Y, Z, and the adder that computed the exponential Iunction are eIIectively scaled
up by a Iactor oI 2
28
. The simulated clock runs at 100 MHz.
5.2.1 $imulation 1: Computing e With CORDIC ;rowth
Error
The Iirst simulation perIormed demonstrates the eIIect oI the CORDIC gain K
"
. To
compute the exponential Iunction e
$
, the X register must be initialized with 1, the Y
register with 0, and the Z register with the desired argument to the exponential Iunction.
This simulation attempts to compute the value oI e, so an initial value oI 1 is placed into
the Z register.
5#
!"#$%& ()*+,&-$./- 01 2 304&.5"6 -"6$.2/"07 2//&68/"7# /0 9068$/& /:& ;2.$& 01 e) <- 2 %&-$./ 01
=>,?@= #%0A/:B /:& 1"72. 9068$/&4 ;2.$& "- "7299$%2/&)
$he results of this simulation are shown in 3igure 5.6. $he algorithm iterate7 for a
total of 89 times :in;lu7ing the <R>?OAD state an7 the two repeat iterations reDuire7 in
the hEperFoli; 7omainG. $he final heHa7e;imal Ialues in the J an7 K registers are
L9625NC5
L#
an7 P3Q262AR
L#
respe;tiIelE. Shen ;onIerte7 to 7e;imal an7 s;ale7
a;;or7inglET theE Fe;ome L.266Q
LP
for the J register an7 P.Q688
LP
for the K register. $he
Ialue of the eHponential fun;tion is the sum of these two numFers: 2.25L2
LP
T whi;h
mat;hes the output of the > signal.
$he a;tual Ialue of e to four 7e;imal pla;es is 2.6LN8. $his simulation 7emonstrates
an error in ;omputation FE P.9#6L. $his is a result of the CORDIC magnitu7e error fa;tor
"
i
7is;usse7 in 9.2 an7 illustrate7 in 3igure 9.2. In or7er to pro7u;e a;;urate resultsT the
J register nee7s to Fe preWs;ale7 FE the gain fa;tor ".
5.#.#!Simulation #. Computing e Compensating for
CORDIC :rowth
In or7er to ;ompute an a;;urate Ialue of eT the growth oFserIe7 in 9.2 must Fe
a;;ounte7 for. $o pro7u;e an uns;ale7 final IalueT the initial Ialue must Fe 7iIi7e7 FE
the growth fa;tor. $he formula use7 to ;ompute the amount of growth is shown in :9.28G.
A pre;ompute7 Ialue is shown in $aFle 9.L. 3or this simulation rather than initialiXing J
with LT J will Fe initialiXe7 with the heHa7e;imal Ialue L85L>N62
L#
T whi;h eDuals
L.2P65
LP
. K is again initialiXe7 with P an7 Y with L.
5#
!"#$%& ()*+,&-$./- 01 2 304&.5"6 -"6$.2/"07 8069$/"7# /:& ;2.$& 01 e) <= 2880$7/"7# 10% >?,@A>
#%0B/:C /:& 1"72. 8069$/&4 ;2.$& "- 288$%2/&)
$%& '&(ult( o- t%i( (imulation a'& (%o2n in 3igu'& 5.6. $%& -inal 7omput&9 :alu& o- e
i( output on t%& ; <u(. =t( %&>a9&7imal :alu& 2@#;152B
1C
7on:&'t( to 2.#163
1E
F 2%i7%
mat7%&( t%& a7tual :alu& o- e. G%&n taH&n to 3E 9&7imal pla7&( t%& :alu& output in ; i(
2.#1626166EE211EEEEEEEEEEEEEEEEE
1E
F 2%i7% 9i--&'( -'om t%& '&al :alu& o- e <I onlI
E.EEEEEEE515C2E5E165E633EEE
1E
. $%& onlI 2aI -o' t%i( &''o' to <& '&9u7&9 2oul9 <& to
u(& mo'& p'&7i(ion in t%& '&gi(t&'(. Gi9&ning t%& '&gi(t&'(F o' (%i-ting t%& <ina'I point to
t%& l&-t 7an a7%i&:& t%i(. =n a99ition to allo2ing mo'& p'&7i(ion in t%& '&(ult(F it al(o
allo2( g'&at&' p'&7i(ion -o' t%& :alu&( (to'&9 in t%& a'7 %Ip&'<oli7 tang&nt looHup ta<l&.
!"#"$ %&'()*+&,-.$/.0,'1(+&-2.!
34
.
=t i( al(o impo'tant -o' t%& unit to 7omput& t%& :alu& o- t%& &>pon&ntial -un7tion -o'
n&gati:& a'gum&nt(. Sin7& t%& 9omain o- 7on:&'g&n7& -o' t%& algo'it%m i( (Imm&t'i7
a<out EF %al- o- t%& po((i<l& a'gum&nt( a'& n&gati:&. $%i( (imulation 7omput&( t%& :alu&
o- e
1
. L i( initialiM&9 2it% 1.2E#5
1E
to &n(u'& an un(7al&9 '&(ultF an9 N i( initialiM&9 2it%
E. $%& O '&gi(t&' 7ontain( t%& a'gum&nt -o' t%& &>pon&ntial -un7tion an9 i( t%u( initialiM&9
to 1
1E
P3EEEEEEE
1C
in 2Q( 7ompl&m&nt notationR.
!"#$%& ()D+,&-$./- 01 2 304&.5"6 -"6$.2/"07 8069$/"7# /:& ;2.$& 01 e
EF
)
5#
$he simulation results presented in Figure 5.6 sho7 that the 89R;<8 unit is a=le to
handle the negative argument. $he final value in the @ register A05@C;5#8
1E
F
corresponds to 0.HEI#I6JH5JC
10
, 7hich is of the same accuracL as the computation for
e
1
.
5.2.$ Simulation $/ Computing Outside the Domain of
Convergence
Ms discussed in J.E, the 89R;<8 algorithm does not converge on an accurate
ans7er for all input values. $a=le J.H sho7s the domains of convergence for all the
computation domains supported =L the 89R;<8 algorithm. For the rotation mode, 7hich
this design uses, the domain of convergence is a limiter on the magnitude of the initial
value of N. <f N 7ere initialiOed 7ith a value 7hose magnitude is greater than 1.11#1I
10
,
then the final computed value 7ould =e incorrect. For this simulation, the P register is
again initialiOed 7ith 1.C0I5
10
, and Q is again initialiOed to 0, =ut N is initialiOed to 5, so
that the unit 7ill attempt to compute e
5
.
!"#$%& ()*+,-&.$/0. 12 3 415&/6"7 ."7$/30"18 917:$0"8# 0;& <3/$& 12 e
(
) 6"89& 0;& "8"0"3/ <3/$& ( ".
1$0."5& 0;& 5173"8 12 918<&%#&89&= 0;& 917:$0&5 <3/$& ". "891%%&90)
$he simulation results sho7n in Figure 5.10 demonstrate that the 89R;<8 unit is
una=le to compute e
5
. $he final value output =L @ is H.056H
10
, 7hich is no7here near the
correct value A1J#.J1HC
10
F. H.056H
10
acts is an asLmptote for the computed values of @.
9nce the initial values of N move outside the domain of convergence, the computed @
value rapidlL approaches H.056H
10
, and remains at that value no matter ho7 large a value
"#
$ is initiali+e- .ith. 1 similar as4mptote e7ists 8or .hen $ is initiali+e- .ith 9al:es that
are too negati9e. In this =ase> the ? 9al:es approa=h @.AB6#
1@
.
!"#! $%&&'()
The parallel -esign is a straight8or.ar- implementation o8 the CGHDIC eJ:ations.
?9en .ith the large register si+es> the -esign has a small 8ootprint insi-e the SpartanLA.
Mith iterations lasting one =lo=N =4=le> the per8orman=e is =onsistent. Oer8orman=e =an Pe
impro9e- P4 :sing 8e.er iterations> .hi=h :s:all4 entails :sing a smaller .or- si+e. The
sim:lations -emonstrate a==eptaPle a==:ra=4> espe=iall4 gi9en the area reJ:irements 8or
the -esign. Qor s4stems that reJ:ire an e9en smaller 8ootprint> Chapter 6 presents a
smaller> slo.er PitLserial implementation that 8:rther aims to re-:=e area reJ:irements P4
:sing 1LPit a--erRs:Ptra=tors an- simple shi8t registers.
60
Chapter 6 *he +erial Implementation
A straightIorward implementation oI the CORDIC algorithm is discussed in Chapter
5. While that design required very little oI the FPGA`s resources, there might be some
applications where Iurther area reduction is necessary. It was observed in 5.1.4 that the
two variable-width shiIt registers accounted Ior halI the resource requirements Ior the
design. This chapter presents an alternate design that uses bit-serial arithmetic along with
one-bit adders to execute the CORDIC algorithm. It will be shown that this design
requires Iar less oI the FPGA`s resources, but will require more time to execute.
6.1 Design
!
!
X
!
BITC&T
'(
)*
+(
)*
SUB'+
#
!
SUB. .(
)*
S/'T
+Sign
'Sign
atanh
()T
BITC&T
!i#$r& 6.1+,lo/0 1i2#r23 4or t6& 7&ri2l i38l&3&nt2tion o4 t6& C;<=>C 2l#orit63.
As shown in Figure 6.1, the overall the design remains relatively unchanged when
compared to the parallel design shown in Figure 5.1. All the changes were made to the
individual components oI the design. The X, Y, and Z registers are now standard one-bit
shiIt registers with parallel load and parallel output capability. This keeps their internal
logic simple and reduces area requirements. The adder/subtractors are now reduced to
operating on single-bit operands, which also reduces their complexity. The adders also
61
contain a Ilip-Ilop that saves the carry output so that it can be applied to the next pair oI
operands. This allows the 32-bit addition to be perIormed one bit at a time.
Each iteration requires the X and Y cross-values to be shiIted by the current iteration
count. The two multiplexers attached to the X and Y registers allow the correct bit to be
selected Ior the corresponding register. Since the results oI the addition or subtraction are
Ied back as the serial inputs to the registers, an additional SEXT control signal and 2:1
multiplexer is needed to choose either the bit selected Irom the register or the
corresponding sign bit. Without this signal, bits computed previously in the iteration
would be Ied into the adders. The sign oI each register is saved prior to the execution oI
each iteration, allowing Ior an arithmetic shiIt oI the values being Ied into the adders.
A multiplexer was also used to allow the correct bit to be Ied Irom the arc hyperbolic
tangent lookup table to the Z-adder. The multiplexer allows the lookup table to be placed
in BRAM and prevents any additional clock cycle delays Irom being introduced into the
design, as would be the case iI the value were loaded into a shiIt register. As beIore, the
control unit provides the address into the lookup table.
This design uses a 32-bit word size, with 4 bits Ior the whole part oI the number, and
28 bits Ior the Iractional part oI the number.
Used Available Utilization
Slices 139 13,312 1.0
Slice 2lip42lops 143 26,624 0.5
LUTs 256 26,624 1.0
IOBs 132 221 59.7
BRAMs 1 32 3.1
<CL>s 1 8 12.5
Table 6.ABOverall device utilization for the serial design.
62
The device utilization Ior the serial design is shown in Table 6.1. This design uses
only 139 oI the XC3S1500`s available slices, as opposed to the 329 required by the
parallel design. This is a reduction oI 57.8. In addition, the Xilinx synthesizer reported
that a maximum clock Irequency oI 134.8 MHz can be used. The parallel design tops out
at 75.0 MHz. The simplicity oI this design results in less combinational logic delay
which in turn allows Ior Iaster clock speeds.
!"1"1 C%&'(%) +&,'
The parallel design allowed each iteration to be completed in one clock cycle. This
reduced the complexity oI the control unit. The switch to a serial design means that the
number oI clock cycles required to complete each iteration depends on the word size,
since all values are computed one bit at a time. Previously, the subtract control signals
that determine whether each adder is adding or subtracting could be evaluated each clock
cycle. In the serial design, these values need to be determined once at the beginning oI
the iteration and saved until the current value is Iully computed. To accomplish this, a
new state is added to the state machine oI the control unit, as shown in Figure 6.2.
63
IDLE PRELOAD
SETUP COMPUTE
ST#RT
%ST#RT
&'T()T + ,1
#). S/#0T 1 ,1 &'T()T 1 ,1
&'T()T + ,1
#). S/#0T + ,1
!"#$%& ()*+,-.-& /".#%.0 12% -3& 4&%".5 "065&0&7-.-"27 21 -3& 89:;<8 .5#2%"-30)
The IDL* and ./*LO1D states remain unchanged. The IDL* state is active when
the algorithm is not being executed. The ./*LO1D state is used to load the initial values
into the X, Y, and Z registers. The S*TF. state is the new state and is only active the
very first clock cycle of each iteration. In the S*TF. state, the values in the X, Y, and Z
registers are evaluated to determine the subtract and sign signals. These signals are saved
into registers for use during the remaining clock cycles of the current iteration. The
JOM.FT* state is active until all 32 bits have been computed. Once this happens, IDL*
becomes the active state, otherwise if more iterations remain, the machine returns to the
S*TF. state to begin execution of the next iteration.
1 new internal control value is also added, named BITJOT. Whereas SH1MT holds
the current iteration count, BITJOT holds the number of bits that have been processed
within the current iteration. When this counter rolls over to zero, then the active iteration
is completed.
6#
IDLE PRELOAD SETUP COMPUTE
INIT $ % $ $
DONE % $ $ $
PLD $ % $ $
SHIFT $ $ % %
Table 6.2-State outputs for the serial control unit.
To manage the new shift registers, a new SHIFT signal is also added. When asserted,
the SHIFT signal causes the shift registers to load the input bit into the MSB position,
and shift out the LSB. The old LD signal has been changed to cause a parallel load
operation in the shift registers. This allows execution of the algorithm to begin sooner.
Otherwise, without the parallel load, the initial values would need to be shifted into the
register. The values of these signals in each of the four states are shown in Table 6.H.
Used Available Utilization
Slices 3J %3,3%H $.3K
Slice Flip-Flops 33 H6,6H# $.%K
LUTs 6L H6,6H# $.3K
IOBs %H6 HH% 5L.$K
BRAMs $ 3H $.$K
GCLKs % J %H.5K
Table 6.3-Device utilization for the serial control unit.
The device utiliNation for the control unit is shown in Table 6.3. The parallel control
unit uses 3O slices, and this control unit uses 3J slices, making the area reQuirements for
the two units essentially the same. This unit reQuired more slice flip-flops T33U than the
parallel unitVs HO flip-flops.
6.#.$ %he )hift -egisters
The W, X, and Y registers in the serial design are all identical shift registers with
parallel load and parallel output capability. In addition to the clock signal, the registers
have two control inputs, two data inputs, and two data outputs. The data inputs are
65
labeled as S+I- and /. S+I- is the bit to be shifted in to the MSB position of the register
and / is the parallel data input. The t=o control signals are /LD and SAIFT. Chen
SAIFT is assertedD then the contents of the register shift to the right everF clock cFcleD
=ith S+I- becoming the ne= MSB. Chen /+LD is assertedD then the parallel data in / is
loaded into the register at the next rising edge of the clock.
!L#
!
31
&'()
&'*+T
!L#
!
3-
!L#
!
-
!'*+T
31
!'*+T
3-
!'*+T
-
.L/ .L/ .L/
!"#$%& ()*+,&-"#. /0 12& -&%"34 -2"01 %&#"-1&%-)
The basic design is sho=n in Figure 6.3. Chen /LD is not assertedD then the data
inputs for each flip-flop are the data outputs of the adLacent flip-flip to the left. Chen
/LD is assertedD then the bits in / become the inputs for each of the flip-flops.
-ot sho=n in the figure are the clock enable inputs for the flip-flops. After the
algorithm is completeD the values should be held constantN that isD neither shifting nor
loading. The SAIFT and /LD inputs are connected through an OR gate to the clock
enable inputs of each of the flip-flops. As a resultD the register =ill onlF shift if SAIFT is
asserted and =ill onlF do a parallel load if /LD is asserted. In anF caseD the parallel
outputs are available as the /+OUT signalD and serial output is available as the S+OUT
signalD =hich al=aFs corresponds to the LSB of the register.
66
"sed A'aila+le ",ili-a,i./
Sli1es 2$ 1&,&12 (.2*
Sli1e 2li342l.3s &2 26,624 (.1*
5"Ts 4$ 26,624 (.2*
789s 44 221 1$.$*
9:A;s ( &2 (.(*
<=5>s 1 8 12.-*
Ta+le 6.4BCe'i1e D,ili-a,i./ E.r a si/Gle serial shiE, reGis,er Ii/1lDdi/G JKL1 NDl,i3leOerP.
Since this design does not have the large amount of multiplexers required by the
variable shift register described in -.1.2, the resource requirements of this register are
significantly reduced. The device utiliFation for one shift register is shown in Table 6.4.
The table includes the bit-select multiplexer shown in Figure 6.1 that is used to perform
the bit-shift operation. Only 2$ slices are required for this design, compared to $( for the
parallel shift register. The parallel shift registers are each 21(* larger than the serial
registers. Since the parallel shift registers were used twice in the parallel design, this area
reduction is the key to the small area requirements of the entire serial design.
!"#"$!%&' )**+,- %./0'
The lookup table remains essentially unchanged. In the serial design, a multiplexer is
added to choose the individual bits to be fed into the serial adder.
"sed A'aila+le ",ili-a,i./
Sli1es 8 1&,&12 (.1*
Sli1e 2li342l.3s ( 26,624 (.(*
5"Ts 16 26,624 (.1*
789s 12 221 -.4*
9:A;s 1 &2 &.1*
<=5>s 1 8 12.-*
Ta+le 6.QBCe'i1e D,ili-a,i./ E.r ,he serial l..RD3 ,a+le.
The parallel lookup table did not require any slices. The entire design was
implemented in one BROM. Since the serial design requires a &2:1 multiplexer, this
67
implementation oI the lookup table requires 8 slices. This small increase in area is more
than oIIset by the savings in the registers and adders.
!"#"$ %&' S'*i,- .//'*0
!
!
"
!
in
C
!
$%&'Cin
A
B
SUB
S
!
$u&
S)*
S
CLK
!"#$%& ()*+,&-"#. /0 12& -&%"34 355&%)
The adders that compute the X, Y, and Z values in the serial CORDIC unit are
simple one-bit adder/subtractors with additional hardware to support serial arithmetic.
The design oI these adders is shown in Figure 6.4. The additional hardware is necessary
to save the carry value Irom one stage and then to Ieed it to the next. Without this logic,
the Iinal sum would be incorrect since there would be no carry propagation.
The C
out
output oI the adder is connected to the input oI a D Ilip-Ilop. The data
output oI the Ilip-Ilop is in turn Ied into a 2:1 multiplexer. When a subtraction is being
perIormed, it is necessary to have an initial carry oI 1 to allow Ior the 2s complement oI
the B input. The NEWCin signal is generated by the control unit and is asserted only in
the SETUP state and when a subtraction is called Ior. In the COMPUTE state, NEWCin
is not asserted, allowing the carry out oI the previous stage to be passed through to the
carry in oI the current stage.
68
The adder that computes the exponential value is a 32-bit parallel adder that uses the
parallel outputs of the ; and < registers.
"#r%&' (&r&''#' )*&%'&+'# "#r%&' ,-%'%.&-%/n
"'%1#2 4 16 13,312 0.0%
"'%1# 3'%453'/42 1 0 26,624 0.0%
6,72 7 32 26,624 0.0%
89:2 7 97 221 3.2%
:;)<2 0 0 32 0.0%
=>6?2 1 1 8 12.5%
7&+'# @A@BC#*%1# D-%'%.&-%/n E/r -F# 2#r%&' &nG 4&r&''#' &GG#r2A
As the design would imply, the resource requirements for this component are quite
small. Table 6.6s serial column shows that only 4 slices are required to implement this
design. The parallel column shows the device utilization for the adder that computes the
exponential function.
!"#"$!%&'()* ,-../01
3%HDr# @AIB"'%1#2 D2#G +J -F# *&r%/D2 1/K4/n#n-2 /E -F# 2#r%&' >9;C8> Dn%-A
This design is even more efficient at using the FPOAs resources, requiring much
less of the chips area than the parallel design. As Figure 6.5 shows, the distribution of
69
the chips resources to the components is very balanced, with a near even balance between
the 3 registers and the control unit. Included in the 'Other category are the three serial
adders, the 32-bit adder, and the arc hyperbolic tangent lookup table. With only 1 oI the
Spartan-3`s slices used in this design, even larger applications can be supported. The
serial nature oI the design means that extra precision can be added to the system without
having a large increase in area.
While the area requirements Ior the design are small, the timing requirements are
not. For a system using n iterations (and r repeats) with a word size oI $ bits, then the
total execution time oI the unit will be
! " "# # $"1 clock cycles. Each iteration will
require $ cycles to complete as the values are passed serially through the adders, and n
r total iterations are completed aIter the algorithm Iinishes. The PRELOAD state oI the
control unit adds another clock cycle oI execution time. The added clock cycles are
partially oIIset by an increase in clock Irequency. The linear convergence oI the CORDIC
algorithm means that the number oI iterations should be close to the number oI bits in the
registers. A halving oI the word size (and a corresponding halving oI the iteration count)
will reduce the execution time by a Iactor oI 4.
This design, which perIorms 31 iterations, 2 repeated, with a word size oI 32 bits
will take 1,057 clock cycles to complete. At the theoretical maximum clock rate oI 134.8
MHz, the unit will take 7.84 !s to compute the Iinal value.
!"#! $%&'()*%+, ./0'(*0
This section presents the results oI simulations perIormed using the design speciIied
in 6.1. The design uses 32 bits to represent each word, with 4 bits Ior the integer portion,
and 28 bits Ior the Iractional portion oI the value. The same simulations are perIormed
70
here as are done in 5.2, in order to demonstrate that this unit produces identical results,
but at the cost oI additional clock cycles.
6.#.$!Simulation $/ Computing e 3ith CORDIC 9rowth
<rror
The Iirst simulation perIormed demonstrates the eIIect oI the CORDIC gain !
n
. To
compute the exponential Iunction #
$
, the X register must be initialized with 1, the Y
register with 0, and the Z register with the desired argument to the exponential Iunction.
This simulation attempts to compute the value oI #, so an initial value oI 1 is placed into
the Z register.
!igure 6.6*Results of a 2o3elSim simulation of the serial CORDIC unit attempting to =ompute the
>alue of !. ?s a result of CORDIC growthA the final =ompute3 >alue is ina==urate.
The results oI this simulation are shown in Figure 6.6. The algorithm iterated Ior a
total oI 34 times (including the two repeat iterations required in the hyperbolic domain).
The Iinal hexadecimal values in the X and Y registers are 147258C5
16
and 0F9272AB
16
respectively. When converted to decimal and scaled accordingly, they become 1.2779 Ior
the X register and 0.9733 Ior the Y register. The value oI the exponential Iunction is the
sum oI these two numbers: 2.2512, corresponding to the value oI the E signal.
71
These results correspond precisely to the values produced in 5.2.1. Whereas the
parallel unit completes in under 8 !s, the serial unit takes just over 200 !s to produce the
Iinal value.
6.#.#!Simulation #. Computing e Compensating for
CORDIC :rowth
In order to compute an accurate value oI e, the growth observed in 4.2 must be
accounted Ior. To produce an unscaled Iinal value, the initial value must be divided by
the growth Iactor. The Iormula used to compute the amount oI growth is shown in (4.23).
A precomputed value is shown in Table 4.1. For this simulation, rather than initializing X
with 1, X will be initialized with the hexadecimal value 1351E872
16
, which equals
1.2075
10
. Y is again initialized with 0 and Z with 1
10
.
!"#$%& ()*+,&-$./- 01 2 304&.5"6 -"6$.2/"07 01 /8& -&%"2. 9:,;<9 $7"/ =06>$/"7# /8& ?2.$& 01 e)
@A 2==0$7/"7# 10% 9:,;<9 #%0B/8C /8& 1"72. =06>$/&4 ?2.$& "- 2==$%2/&)
The results oI this simulation are shown in Figure 6.7. The Iinal computed value oI e
is output on the E bus. Its hexadecimal value 2B7E1524
16
converts to 2.7183
10
, which
matches the actual value oI e. When taken to 30 decimal places the value output in E is
2.718281880021100000000000000000, which diIIers Irom the real value oI e by only
0.0000000515620501850833000. These values are the same as the values produced by
the parallel unit. As beIore, the only way to increase the precision is to increase the word
72
size and iteration count, which is easier to achieve in the serial design, but comes at the
cost oI longer execution time.
6.#.$!Simulation $/ Computing e
1
It is also important Ior the unit to compute the value oI the exponential Iunction Ior
negative arguments. Since the domain oI convergence Ior the algorithm is symmetric
about 0, halI oI the possible arguments are negative. This simulation computes the value
oI e
1
. X is initialized with 1.2075
10
to ensure an unscaled result, and Y is initialized with
0. The Z register contains the argument Ior the exponential Iunction and is thus initialized
to 1 (F0000000
16
in 2`s complement notation).
Figure 6.*Results of a ModelSim simulation of the serial CORDIC unit computing the value of e
@A
.
The simulation results presented in Figure 6.8 show that the CORDIC unit is able to
handle the negative argument. The Iinal value oI the E signal (05E2D58C
16
) corresponds
to 0.36787943542
10
, which is oI the same accuracy as the computation Ior e
1
.
6.#.5!Simulation 5/ Computing Outside the Domain of
Convergence
As discussed in 4.6, the CORDIC algorithm does not converge on an accurate
answer Ior all input values. Table 4.3 shows the domains oI convergence Ior all the
"3
$%m'u)a)+%, -%ma+,. .u''%r)0- 1y )30 456784 a9:%r+)3m. <%r )30 r%)a)+%, m%-0= >3+$3
)3+. -0.+:, u.0.= )30 -%ma+, %? $%,@0r:0,$0 +. a 9+m+)0r %, )30 ma:,+)u-0 %? )30 +,+)+a9
@a9u0 %? A. 8? A >0r0 +,+)+a9+B0- >+)3 a @a9u0 >3%.0 ma:,+)u-0 +. :r0a)0r )3a, C.CCDC"=
)30, )30 ?+,a9 $%m'u)0- @a9u0 >%u9- 10 +,$%rr0$). <%r )3+. .+mu9a)+%,= )30 E r0:+.)0r +.
a:a+, +,+)+a9+B0- >+)3 C.FG"H
CG
= a,- I +. a:a+, +,+)+a9+B0- )% G= 1u) A +. +,+)+a9+B0- )% H
CG
=
.% )3a) )30 u,+) >+99 a))0m') )% $%m'u)0 !
H
.
Figure (.*+,esults of a 3odel5im simulation of the serial C:,;<C unit computing the value of e
5
.
5ince the initial value 5 is outside the domain of convergence, the computed value is incorrect.
J30 .+mu9a)+%, r0.u9). .3%>, +, <+:ur0 6.L -0m%,.)ra)0 )3a) )30 456784 u,+) +.
u,a190 )% $%m'u)0 !
H
. J30 ?+,a9 @a9u0 %u)'u) 1y M +. 3.GHL3
CG
= >3+$3 +. ,%>30r0 ,0ar )30
$%rr0$) @a9u0 NCOD.OC3FP. 3.GHL3 a$). +. a, a.ym')%)0 ?%r )30 $%m'u)0- @a9u0. %? M. 5,$0
)30 +,+)+a9 @a9u0. %? A m%@0 %u).+-0 )30 -%ma+, %? $%,@0r:0,$0= )30 $%m'u)0- M @a9u0
ra'+-9y a''r%a$30. 3.GHL3= a,- r0ma+,. a) )3a) @a9u0 ,% ma))0r 3%> 9ar:0 a @a9u0 A +.
+,+)+a9+B0- >+)3. Q .+m+9ar a.ym')%)0 0R+.). ?%r >30, A +. +,+)+a9+B0- >+)3 @a9u0. )3a) ar0
)%% ,0:a)+@0. 8, )3+. $a.0= )30 M @a9u0. a''r%a$3 G.3F6L.
!"# $u&&ar)
<%r 9ar:0 a''9+$a)+%,. >30r0 <STQ r0.%ur$0. ar0 .$ar$0= )30 .0r+a9 456784 u,+)
-0.$r+10- +, )3+. $3a')0r %??0r. ma,y a-@a,)a:0. %@0r )30 'ara9909 u,+) -0.$r+10- +,
74
Chapter 5. -y operating only on single bits in a serial fashion, the complexity of the shift
registers is reduced significantly. The size of the adders used in the design is also reduced
since they only need to handle one bit at a time. Precision and accuracy are not sacrificed
to achieve the reduction in resource utilization. However, the added execution time may
prevent this design from being used in more time-sensitive applications. For those, the
parallel design or a faster table-based approach would be better suited.
7#
!"#$%&' ) !*+,-./0*+ #+1 2.%.'& 3*'4
$rti(i)ia+ ne.ra+ net/or1s ha4e the potentia+ to 6e)ome in4a+.a6+e too+s (or s8stems
9esigners +oo1ing to imp+ement arti(i)ia+ inte++igen)e into their /or1. <he a6i+it8 o( these
net/or1s to 6e traine9 (or a spe)i(i) pro6+em an9 to operate a.tonomo.s+8 opens .p a
9oor to ne/ /a8s o( 9ata ana+8sis an9 other app+i)ations /here it is 9i((i).+t to 9esign
straight(or/ar9 a+gorithms to pro9.)e the 9esire9 res.+ts.
=>?$s are +i1e/ise 6e)oming in)reasing+8 .se(.+ (or 9esigners. <he a6i+it8 o( the
)hips to 6e reprogramme9 spee9s protot8ping an9 simp+i(ies the 9esign pro)ess (or +arger
s8stems. <he training pro)e9.re (or arti(i)ia+ ne.ra+ net/or1s re@.ires )hanges to interna+
/eights o( the net/or1. <he reprogramma6i+it8 o( the =>?$ means that these )hanges
)an 6e ma9e easi+8. <he a6i+it8 to p+a)e an arti(i)ia+ ne.ra+ net/or1 in an em6e99e9
en4ironment /ith rea+A/or+9 inp.ts (.rther eBpan9s the n.m6er o( app+i)ations a4ai+a6+e.
>re4io.s+8C it /as 4er8 9i((i).+t to 6.i+9 a ne.ra+ net/or1 in an =>?$ 6e)a.se o( the
9i((i).+t8 o( (itting the arithmeti) har9/are on the same )hip as the net/or1.
De.ra+ net/or1s re@.ire )omp+eB mathemati)a+ s.pport in or9er to (a)i+itate the
)omp.tation o( the s.mmationC /eightingC an9 trans(er (.n)tions. Eost arithmeti)
a+gorithms are optimiFe9 (or spee9 an9 re@.ire +arge amo.nts o( )hip area /hen
imp+emente9. De.ra+ net/or1s are inherent+8 para++e+ )omp.ter s8stems an9 in or9er to
.se a +arge arithmeti) .nitC a++ the )omp.tations /o.+9 ha4e to 6e pipe+ine9 thro.gh one
or se4era+ o( these .nits. <his /o.+9 +essen the para++e+ism o( the net/or1C /hi)h a+so
res.+ts in +onger )omp.tation times.
7#
The '()*+' implementations presented are compact enough that they potentially
could be duplicated in the F=>A, allowing each neuron to have its own '()*+' unit.
This maintains the parallelism of the network. Two alternate designs of '()*+' units
have been presented, a parallel design and a much smaller serial design that takes longer
to execute. Either of these could be used in a neural network at the neuron level.
7.# $omparing the 1esigns
The parallel design fits in 32J of the KpartanL3Ms slices and can run at a theoretical
maximum clock frequency of 7O PQz. The serial design is much smaller, needing only
13J slices, and can also run at almost twice the clock frequencyT 13U.V PQz.
Figure 7.1+,lices required by the various components of the two CORDIC designs.
The parallel design requires a large amount of 32T1 multiplexers to accomplish the
shifting. This is necessary because the amount by which the operands are shifted
77
increases each iteration of the algorith0. When switching to the serial a44roach5 these
0ulti4le7ers are eli0inate8 resulting in significant resource sa9ings. As seen in ;igure
7.15 it is this sa9ings that accounts for 0ost of the efficienc= of the 8esign.
>he effect of wor8 si?e on the area re@uire0ent of the 8esigns can Ae seen in ;igure
7.2. >he 4arallel 8esign alwa=s re@uires 0ore area than the serial 8esign5 an8 the nu0Aer
of slices re@uire8 for the 4arallel 8esign increases 8ra0aticall= with each increase of the
wor8 si?e. >he area re@uire0ent of the serial 8esign increases roughl= linearl=5 while the
4arallel 8esign follows a shar4 e74onential cur9e.
>he affect wor8 si?e has on e7ecution ti0e is shown in ;igure 7.C. >he unit is
assu0e8 to Ae o4erating at the 0a7i0u0 clock fre@uenc=5 as re4orte8 A= the Eilin7
s=nthesi?er. >his 9alue is 8ifferent for Aoth the 4arallel an8 serial 8esigns. >he 0a7i0u0
clock fre@uenc= also 8ecreases as wor8 si?e increases5 which further lengthens e7ecution
ti0e with larger wor8 si?es. >he serial 8esign is se9erel= i04lacte8 A= the increase in
wor8 si?e. F7ecution ti0e increases e74onentiall= with wor8 si?e5 while e7ecution ti0e
for the 4arallel 8esign follows a roughl= linear track.
78
Figure 7.2+The effect of word size on the F78A chip area re<uirements of the CORDIC unit.
Figure 7.C+The effect of word size on the execution time of the CORDIC unit.
As can be seen by comparing the two Iigures, the area and timing requirements Ior
these designs are inversely proportional. This is true Ior most designs. In general, when
design changes are made that reduce the timing requirements Ior the system, these same
changes result in an increased area requirement. The move to bit-serial arithmetic, which
7#
optimized the algorithm for area, requires each iteration of the algorithm to take longer.
Each bit in the operands will use one clock cycle to compute the new value. For these
designs, which use the word sizes of 32 bits and 31 iterations Cwith 2 repeatsD, the 3E
clock cycles that the paralle implementation requires are increased to 1,FG7 cycles. This
is an increase of 3FFFI. This is offset somewhat by the increase in clock frequency, but
the serial implementation still takes significantly longer to complete.
For applications that can tolerate less precision, further savings in both area and
computation time can be achieved. Jeural networks in particular are probabilistic in
nature and may not need 32 bits of precision. Kecreasing the word size to 1L bits will
drastically reduce the area requirement for the parallel design, since its area requirement
decreases exponentially with word size. The area requirement of the serial design
depends only linearly on the word size, so the area savings would not be as significant.
The execution time, however, would decrease exponentially.
Noth implementations also have complex control units that are designed with
flexibility in mind. Ot is possible that the clock frequency for both designs could be
increased if the control unit could be simplified for a specific application.
7.# Integration .ith Artificial 4eural 4etwor7s
7.#.9 :;<=I: Artificial 4euron
This section discusses the design of a basic artificial neuron that uses the PQRKOP
unit to compute its transfer function. Figure 7.E shows the design of this neuron. This
sample design uses E inputs, but the design can be scaled to accommodate any number of
inputs.
"0
WT_1
!"#$
!"#%
!"#&
WT_2
WT_3
WT_4
'
(
(
(
!"#)
'
'
'
!"#$%!
lN_X
lN_Y
lN_Z
START
RESET
CLK
E
DONE
*
+
,
-
.
/
.
0
/
0
1
-
0
"21+
34".
!"##$%&'()*"(+%&'(
%,$(!*-,)*"(+%&'(
&(."%)/-&01&(0
Figure 7.*Bloc0 diagram of a basic artificial neuron utili9ing the CORDIC unit to compute the
transfer function e
x
.
The 4 inputs are weighted by constants that are stored internally in the neuron. 9
separate multiplier computes the weighted value for each input in parallel. 9s per the
artificial neuron model shown in Figure 2.1, these weighted values are then passed
through a summation function. Three adders are cascaded to add the four values together.
The CORDIC unit then computes the transfer function. The output of the final adder is
used as the argument to the function. In this case, since the CORDIC unit has been
designed to compute e
x
, the sum is connected to the INGH input. INGI is initialized to 0,
and INGK is initialized to 1.2075 to ensure an unscaled result. The NT9RT and RENET
signals may be provided externally or may be hardQwired to high or low. The E output of
the CORDIC unit then becomes the output of the neuron (NS9T) and may be connected
to any number of neurons in the next layer of the network.
The multipliers can be of any design, but since all inputs use a fixed point, the output
must be shifted accordingly. The design used here has 2" bits reserved for the fractional
81
component oI the values, so the Iinal product must be shiIted to the right 28 bits to ensure
that it is in the Iormat expected by the CORDIC unit.
Parallel Serial
Slices 661 (5.0) 463 (3.5)
Slice Flip-Flops 123 (0.5) 141 (0.5)
LUTs 1,273 (4.8) 893 (3.4)
IOBs 163 (73.8) 163 (73.8)
BRAMs 1 (3.1) 1 (3.1)
18!18 Multipliers 16 (50.0) 16 (50.0)
GCLKs 1 (12.5) 1 (12.5)
Table 7.1-Device resource requirements for the artificial neuron designs using both the parallel and
serial CORDIC units.
The resource requirements oI this design are shown in Table 7.1. The design was
synthesized using both the parallel and serial CORDIC units. The neuron with the parallel
CORDIC unit requires 661 oI the Spartan`s slices, while the serial design only needs 463.
The added area requirement Ior both designs comes Irom the 3 adders Ior the summation
Iunction and the additional routing logic required to interconnect the multipliers and
adders. The neuron is able to beneIit Irom the Spartan`s built-in multipliers. For larger
networks, Iurther chip area will be required to implement the multiplication Iunction. The
serial design is able to operate at a maximum theoretical clock Irequency oI 140.0 MHz,
and the parallel design can operate at a theoretical maximum Irequency oI 77.9 MHz. The
area/time tradeoII holds true Ior the neuron design just as it did Ior the CORDIC unit
alone. The 32-bit adders present in the serial design are Iast enough to allow the neuron
to still operate at the Iaster clock rate.
It should be noted that the serial neuron still uses parallel multipliers to compute the
weighted inputs, and serial adders to compute the summation Iunction. It should be
possible to use both serial adders and multipliers to gain additional area savings.
82
Depending on the designs used, modiIications to the CORDIC design may also be
required should a designer use this approach.
Results oI a ModelSim simulation oI the design are shown in Figure 7.5. The Iour
imputs used in the simulation are 0.3 (04CCCCCC
16
), 0.02 (0051EB85
16
), 2.1
(21999999
16
), and 0.65 (0A666666
16
). Their respective weights are 1, 2, 0.5, and 0.25.
The signals in1w through in4w show the results oI multiplying the inputs by their
weighting constants.
The signal sumall shows the result oI the summation Iunction: the hex value
F73D70A3
16
or 0.547500. The signal nval is the Iinal output oI the neuron and its hex
value oI 09411A07
16
(0.578394( matches the result oI !
0.5475
.
Since the implementation uses the Spartan`s internal multipliers, the intermediate
results oI computing the weighted values are not available. The only delay in the
simulation is Irom the CORDIC unit`s execution. A design utilizing custom multipliers
would have increased execution time.
!
Figure (.*+Simulation of an artificial neuron using the 4 inputs 9:.;, :.:=, =.>, :.?*@ and the weights
9>, =, C:.*, :.=*@.
7.#.#
For any implementation oI an artiIicial neural network into a programmable device
such as an FPGA, at least two possibilities exist Ior implementing the arithmetic
83
computations: a high-speed lookup table 7either on-chip or off: or either of the CORDIC
implementations discussed here.
Table-based implementations work similar to the logarithm tables found in the back
of textbooks, requiring interpolations between rows. These designs can be very fast and
accurate, but require large amounts of chip area. For large networks, it would require a
much larger 7and more expensive: FPGA to allow the lookup table to coexist with the
network. Additionally, all arithmetic calculations would have to be queued, slowing
down the operation of the entire network. If the table implementation is fast enough and
the network small enough, then performance may be comparable to a CORDIC
implementation.
The CORDIC algorithm is powerful enough and small enough to be able to support a
neural network, ideally with a CORDIC unit in each neuron. This would maintain the
parallelism that is key to the operation of the neural network. The longer computation
time is not a maLor shortcoming, since the power of a neural network lies in its
parallelism. The fact that all neurons can be computing in parallel and all neurons can
finish processing their inputs at the same time allows for a natural progression of data
through the network.
The flexibility of the algorithm with its two computation modes that can each operate
in one of three domains means that the same unit can compute a wide variety of
functions. This gives the network designer the flexibility to vary the transfer functions
used with only minimal changes and no redesigning necessary. It would also be possible
for the CORDIC unit to perform the multiplication of the weighting factors with the
"#
$n&'(s* (+,'-+ f,r ne'r,ns 1$(+ 23r-e 34,'n(s ,f $n&'(s* (+$s 1,'25 c,4e 1$(+ 3 se7ere
&erf,r43nce &en32(89
:+e s8s(e4s 5es$-ner 1,'25 c+,,se (+e &r,&er $4&2e4en(3($,n (+3( ;es( f$(s (+e s$<e
,f (+e ne(1,r= (+3( 1,'25 ;e s'&&,r(e59 :+e f3s( (3;2e>;3se5 3&&r,3c+es 1,'25 ;e
&referre5 ;ec3'se ,f (+e$r +$-+ &erf,r43nce* ;'( ,n28 (+e s4322er ne(1,r=s 1,'25 ;e 3;2e
(, effec($7e28 '($2$<e (+e49 ?,r 4e5$'4>s$<e ne(1,r=s* (+e &3r322e2 @ARCD@ 'n$( $s (+e
;es( c,4&r,4$se ;e(1een s$<e 3n5 s&ee59 En5 f,r 23r-e ne(1,r=s* (+e ser$32
$4&2e4en(3($,n 438 ;e (+e ,n28 c+,$ce9
!"# $%&%'( *&%+,
?'r(+er rese3rc+ c3n ;e 5,ne $n(, (+e ;enef$(s (+3( c3n ;e 5er$7e5 fr,4 's$n- (+e
@ARCD@ 'n$( $n 3 ne'r32 ne(1,r=9 F('58 ,f 3 2$7e* &r3c($c32 ne'r32 ne(1,r= 1$22 +e2& $n
'n5ers(3n5$n- +,1 23r-e 3 ne(1,r= c,'25 ;e s'&&,r(e5 3n5 1+$c+ $4&2e4en(3($,n s ;es(
s'$(e5 f,r ne(1,r=s ,f 73r8$n- s$<e9
F('5en(s 438 32s, 1$s+ (, s('58 138s (+3( (+e @ARCD@ 'n$( c3n ;e f'r(+er
,&($4$<e5 s&ec$f$c3228 f,r 'se $n ne'r32 ne(1,r=s9 :+e c,n(r,2 'n$( $s ,ne c,4&,nen( (+3(
$s 3n e3s8 (3r-e( f,r ,&($4$<3($,n* ;'( f'r(+er 3re3 re5'c($,n 438 ;e &,ss$;2e 1+en (+e
'n$( $s $n(e-r3(e5 $n(, (+e s(r'c('re ,f 3 ne'r,n9 :+$s 32s, $nc2'5es 5e(er4$n$n- (+e $5e32
1,r5 s$<e f,r (+e 3&&2$c3($,n9 :+e 3cc'r3c8 ,f (+e ne(1,r= nee5s (, ;e (3=en $n(, 3cc,'n(
1+en 5e(er4$n$n- (+e &rec$s$,n ,f (+e 'n5er28$n- +3r513re9
:+e @ARCD@ 'n$( &resen(e5 +ere $s +3r5>1$re5 (, (+e +8&er;,2$c r,(3($,n32 4,5e9 E
(r'28 c,4&2e(e -ener32 'n$( (+3( $s c3&3;2e ,f ,&er3($n- $n 322 (+ree 5,43$ns $n ;,(+ 4,5es
s+,'25 ;e &,ss$;2e 1$(+ ,n28 3 2$((2e 4,re +3r513re9 E s('58 ,f (+e res,'rce reG'$re4en(s
,f (+$s -ener32>&'r&,se 'n$( c,'25 32s, ;e 5,ne9
85
!"#"$"%&"'
[1] J. S. *alther, A unified algorithm for elementary functions, in Spring (oint
Computer Conference, vol. 38, pp.379385, 1971.
[2] J.E. Holder, The COLDIC Trigonometric Computing TechniOue, 12E Trans.
Electronic Computers, vol. EC-8, no. 3, pp. 330334, Sept. 1959.
[3] H. Dawid and H. Meyr, COLDIC Algorithms and Architectures, in Digital
Signal :rocessing for Multimedia Systems, ed. by K.K. Parhi and T. Nishitani,
Marcel Dekker, 1999, pp. 623655.
[4] L. Andraka, A survey of COLDIC algorithms for FPGA-based computers, in
1nternational Symposium on Field :rogrammable @ate Arrays, 1998.
[5] ^. H. Hu, The _uanti`ation Effects of the COLDIC Algorithm, 1EEE Trans.
Signal :rocessing, vol. 40, pp. 834844, July 1992.
[6] IEEE Std. 754-1985, Standard for Binary Floating Point Arithmetic, 1985.
[7] A. A. Liddicoat and L. A. Slivovsky. :rogrammable logic. 3
rd
Edition of EE
Handbook. CLC Press.
[8] cilind. Spartan-3 FPGA family: Complete data sheet. Datasheet DS099, cilind.
[9] M.M. Mano and C.L. Kime, BLS1 :rogrammable Logic Devices, Pearson
Prentice Hall, Upper Saddle Liver, NJ, 3rd edition, 2004. Supplement to Logic
and Computer Design Fundamentals.
[10] D. Anderson and G. McNeil, Artificial Neural Networks Technology,
DoD Data g Analysis Center for Software, August 1992.
86
|11| M. Skrbek, 'New Neuochip Architecture, Doctoral Thesis. 115 p.CTU,
Faculty oI Electrical Engineering, Prague, 2000.
|12| J. Zhu and P. Sutton, 'FPGA Implementations oI Neural Networks a
Survey oI a Decade oI Progress, Proceedings of -.t0 International Conference
on Field Programmable Logic and Applications ;FPL <==.>, Lisbon, Sep 2003.
|13| M. Figueiredo and C. Gloster, 'Implementation oI a Probabilistic Neural
Network Ior Multi-spectral Image ClassiIication on an FPGA Based Custom
Computing Machine, Proceedings of ?t0 @raAilian Symposium on Eeural
EetForGs, Dec. 1998.
|14| S.L. Bade and B.L. Hutchings, 'FPGA-Based Stochastic Neural
NetworksImplementation, Proc. of IEEE JorGs0op on FPGAs, 1994.
|15| K. Nichols and M. Moussa, 'Feasibility oI Floating-Point Arithmetic in
FPGA based ArtiIicial Neural Networks, in Proceedings of t0e -?t0
International Conference on Computer Applications in Industry and
Engineering, 2002.
|16| J.L. Holt and T.E. Baker, 'Back propagation simulations using limited
precision calculations, in Proceedings of International Joint Conference on
Eeural EetForGs, 1991, pp 121126 vol. 2.
|17| S. Draghici, 'On the capabilities oI neural networks using limited
precision weights, Eeural EetForGs, 2002, 15: p. 395-414
87
[18] '. )u and S.N. Batalama, An Efficient Learning Algorithm for
Associative Memories, IEEE $rans. Neural Networks, vol. 11, no. 5, pp. 1058
1066, Sept. 2000.
[19] S. Halgamuge, A Trainable Transparent Universal ApproPimator for
DefuRRification in Mamdani-Type Neuro-FuRRy Controllers, IEEE $rans. Fuzzy
Systems, vol. 6, no. 2, pp. 304314, May 1998.