Sie sind auf Seite 1von 6

CORDIC and SVD Implementation

in Digital Hardware
Przemysaw M. Szecwka, Piotr Malinowski
*

Faculty of Microsystem Electronics and Photonics
Wrocaw University of Technology
Wrocaw, Poland
przemyslaw.szecowka@pwr.wroc.pl
*)
now with University School of Physical Education in Wrocaw, Poland

AbstractSingular Value Decomposition is classified among
the most effective numeric methods of matrices inversion. The
paper presents a study of hardware implementation of SVD and
CORDIC algorithms. Various digital architectures were
proposed and compared, including low-cost sequential and high-
performance pipelined solutions. Fixed point and floating point
arithmetic was considered. The concepts were implemented in
VHDL, verified and synthesized with Xilinx tools. Selected
approach was physically implemented and tested.
Index TermsCORDIC, SVD, digital, hardware, VHDL,
FPGA
I. INTRODUCTION
Processing of matrices, especially inversion remains a key
challenge for contemporary computing machines. Very smart
algorithms were proposed many years ago, by the scientists
who expected rapid development of digital hardware in the
future. Many of those solutions were presumed to work on
futuristic parallel devices. CORDIC and Singular Value
Decomposition (SVD) are good examples here [1-3].
Eventually recent years have brought the long expected rapid
development of digital hardware and growth of programmable
logic devices complexity. There is growing interest in
construction of dedicated digital hardware, according to more
or less classic concepts [4-7].
This paper describes a study of hardware implementation of
Singular Value Decomposition of matrix based on replicated
CORDIC modules. The authors focus on comparison of
architecture variants in the context of resource allocation, speed
and accuracy. Similar works may be found in contemporary
literature [8] showing growing interest in practical use of
achievements of great mid XX-th century mathematicians.
II. CORDIC AND SVD OVERVIEW
CORDIC algorithm (Coordinate Rotation Digital
Computer) was proposed by Volder in 1959 [2]. Initially it was
used to transform polar to perpendicular coordinates and
reverse. Then CORDIC was extended to provide estimation of
hyperbolic and exponential function, calculation of square root
and other numeric applications. Nowadays it is extensively
used in digital signal and data processing like DFT [7] and
SVD [5]. I.e. it is quite universal tool which may be applied in
many variants and configurations. In general CORDIC consists
in iterative rotations of a vector with a predefined series of
constant angles. The angles decrease in a special manner
forming a series: 45

, 26.7

, 14

, 7.1

, 3.57

etc. Consecutive
rotations are left or right depending on target and actual result.
With growing number of rotations n the increase in accuracy is
obtained. This generic schematic may be applied in various
modes, depending on needs. If the target is rotation with
defined angle, a series of rotations is performed. For 2-
dimensional space, where the [x
0
, y
0
]
T
vector is to be rotated by
an angle of
0
z , after n iterations, the new coordinates are:
| |
0 0 0 0
sin cos
1
z y z x
K
x
n
n
= (1)
| |
0 0 0 0
sin cos
1
z x z y
K
y
n
n
+ = (2)
whilst the final rotation angle 0 =
n
z .
In vector mode CORDIC determines the angle between [x
0
,
y
0
]
T
vector and X axis. After series of dummy iterative
rotations the new coordinates would be

2
0
2
0
1
y x
K
x
n
n
+ = (3)
0 =
n
y (4)
and
|
|
.
|

\
|
=
0
0
arctg
x
y
z
n
. The product of algorithm in such case
however is numerical value of z
n
determined by cumulated sum
of angles (+/- for left/right) applied for consecutive rotations.
Singular Value Decomposition of a matrix consists in
finding a series of singular values
l
, , !
2 1
which simplify
MIXED DESIGN
MIXDES 2010, 17
th
International Conference "Mixed Design of Integrated Circuits and Systems", June 24-26, 2010, Wrocaw, Poland
*QTv`B;?i kyRy #v .2T`iK2Mi Q7 JB+`Q2H2+i`QMB+b *QKTmi2` a+B2M+2- h2+?MB+H lMBp2`bBiv Q7 GQ/x kjd
inversion of matrix. For each matrix
n m
M
,
R e there exist
orthogonal matrices
m m
U
,
R e and
n n
V
,
R e , for which

n m
l
T
, , MV U
,
2 1
R ) diag( e = E = ! (5)
where l = min(m,n), and for r = rank(A) the diagonal values
fulfill conditions
0
2 1
> > > >
r
! (6)
0
2 1
= = = =
+ + l r r
! (7)
A pseudo-inverse matrix M
+
may be determined by

T
U V M
+ +
E = (8)
where E
+
is a pseudo-inverse of diagonal matrix, i.e. it is
diagonal matrix formed by inverted (when non-zero) values of
l
, , !
2 1
. SVD is currently classified among the most
efficient numerical methods of matrices inversion. SVD may
be performed by the appropriate rotation of a matrix. For a
basic 2x2 matrix
(

=
d c
b a
M the rotation angle is
|
.
|

\
|

+
a d
b c
arctg .
This operation may be done by double use of CORDIC in two
modes. First the appropriate angles are determined and then the
rotations are performed. Due to the properties of CORDIC the
iterations may be described by combinations of
adding/subtracting and shifts of bits:
) ( SHIFT
1 i i i i i
y x x + =
+
o (9)
) ( SHIFT
1 i i i i i
x y y =
+
o (10)
where
i
= +/-1 denotes left or right shift. Eventually hardware
implementation of CORDIC consists of adders, subtractors and
muxes.


Figure 1. CORDIC - sequential architecture
kj3
III. CORDIC ARCHITECTURE
Two variants of CORDIC architectures are presented in
Fig. 1 and 2. Both solutions are full-synchronous with single
clock. In the first - sequential approach, arithmetic modules are
shared by iterations. Intermediate results are fed back via the
registers and the appropriate angles are delivered to arithmetic
units by the muxes. Control is provided by iteration counter.
Another concept is pipelined architecture presented in Fig. 2.
Schematic shows a hardware providing 3 consecutive
iterations. Arithmetic blocks are replicated for each iteration,
thus the data flow may form a pipeline. This solution provides
much faster throughput but needs more hardware resources. On
the other hand the control circuitry is more simple for this
solution, leading to some savings and much higher clocking
speed available. The two concepts were implemented in VHDL
[9], verified and synthesized with Xilinx ISE [10] tools for
Virtex-5 programmable device. Arithmetic is fixed point with
8-bit numbers coded in 2complement. Synthesis results
summarized in Table 1. show clearly the difference between
the low-cost and high-speed approach.
TABLE I. SYNTHESIS RESULTS FOR 2 VARIANTS OF CORDIC
ARCHITECTURES

Sequential Pipelined
Number of Slice Registers 56 208
Number of Slice LUTs 151 243
Clock frequency 257 MHz 428 MHz
Levels of Logic 10 2
Delay 3,891 ns 2,336 ns
Delay on Logic 1,612 ns (41,4%) 0,659 ns (28,2%)
Delay on Route 2,279 ns (58,6%) 1,677 ns (71,8%)



Figure 2. CORDIC pipelined architecture.

IV. SVD ARCHITECTURE
General concept of SVD architecture based on CORDIC
modules is presented in Fig. 3. The input is a basic 2x2 matrix.
The primary output are two singular values, secondary output
are rotation angles. This module, either replicated or reused
may be applied for construction of dedicated devices working
with bigger matrices. Detailed schematic of vector rotation
block is presented in Fig. 4. It is a synchronous machine based
on a single CORDIC element reused for consecutive iterations.
The CORDIC output is fed back to the input via the register
until the final value is obtained and latched. Rotation angle is
delivered by the module shown in Fig. 5. Arithmetic block is
reused again for consecutive iterations, thus the output is fed
back. The appropriate angles for elementary rotations are
stored in a memory. Control of data flow in these two modules
is provided by the Finite State Machine working together with
iteration counter. Schematic of FSM is presented in Fig. 6. The
initial neutral state is wait. Activation of the strobe signal
forces calculation of the angle and then the following steps of
processing.








SVD 22


CORDIC
SHIFT-SUM
SHIFT-SUM


CORDIC
SHIFT-SUM
SHIFT-SUM
b
c
d
1
2
p
l
a

Figure 3. Basic SVD architecture composed of CORDIC blocks
kjN
The initial neutral state is wait. Activation of the strobe
signal forces calculation of the angle and then the following
steps of processing. After transition to each state the iteration
counter is activated and counts to predefined value. When the
appropriate number of iterations is reached the FSM transits to
the next state. The two final stages are used to correct the scale
of output values, disturbed during iterative approximations. In
general the machine circulates around all the states with a little
exception for immediate start of new processing with wait state
skipped, on request.

y
1
nreset
clk
c
y
2
x
1
nreset
clk
d
CORDIC
nreset
clk
enable
Out 2
nreset
clk
enable
shiftsum
shiftsum
iteration,
FSM state
iteration, FSM state
iteration, FSM state
iteration,
FSM state
Out 1

Figure 4. SVD architeccture - vector rotation block.


rotation_angle
1
rotation_angle
2

rotation_angle
23
rotation_angle
24
ROM 2429
z
1
nreset
clk

di
angle R
nreset
clk
enable
Z
2

iteration , FSM state


iteration,
FSM state
angle
angle L
zero

Figure 5. SVD architeccture calcualtion of rotation angle.
For this part of study two kinds of number formats and
arithmetic were applied. In the first approach the floating point
numbers compatible with IEEE 754 standard [11] were used. In
this format the bit vector consists of a sign bit, 8-bit, 2-
complement coded exponent and 23-bit significand (non-
negative). Another approach was fixed point arithmetic with
k9y
25-bit, 2-complement coded vectors. For constant angles
specific format was chosen fixed point with 2 bits reserved
for integral part and the rest left for fractions (the possible
angle values when scaled in radians do not exceed 2). CORDIC
module described in previous section was redesigned twice for
these two formats

Figure 6. Finite State Machine controlling SVD
SVD architecture with 2 variants of arithmetic was
implemented in VHDL and synthesized for Xilinx Virtex-5
device. Synthesis results are summarized in Table 2. If to
compare allocation of resources there is no huge difference in
number of registers allocated. On the other hand the floating
point variant consumes much more combinatorial logic. There
is huge difference in maximum clock speed 148 MHz for
fixed point version point and only 35 MHz for floating point
approach. Arithmetic operations on floating point numbers
require long chains of combinatorial logic which require more
time to transfer signal from one register to another.
TABLE II SYNTHESIS RESULTS FOR 2 VARIANTS OF SVD ARCHITECTURE
32-bit IEEE
floating point
25-bit
fixed point
Clock frequency 35 MHz 148 MHz
Levels of Logic 74 35
Delay 28,602 ns 6,738 ns
Number of Slice
Registers
337 (1%) 314 (1%)
Number of Slice
LUTs
4648 (14%) 2609 (7%)

The two code variants were simulated in Xilinx ISE
environment for several sample matrices. The results were sent
to a file, converted and compared with the ones given by SVD
algorithm run in Octave environment. Fig. 7 shows two plots of
relative errors obtained for two architectures. It is visible that
fixed point architecture delivers substantially better results.
10
-23
10
-13
10
-3
10
7
10
17
10
27
10
37
0,0
2,0x10
-7
4,0x10
-7
6,0x10
-7
8,0x10
-7
1,0x10
-6
1,2x10
-6
1,4x10
-6
o1
|
A
o
1
/
o
1
|

Figure 7. Relative error of singular value determination for two kinds of arithmetic approach
25-bit fixed point (lower) and 32-bit floating point floating point (upper plot).
k9R
V. CONCLUSIONS
A comprehensive study of digital hardware dedicated to
Singular Value Decomposition was performed. The motivation
was authors interest in construction of specialized computing
machines performing operations on matrices in highly parallel
way. Significant effort was devoted to CORDIC algorithm
which was used for SVD but may be treated as separate issue
as well. The results lead to conclusion that contemporary
FPGAs are very close to enable construction of machines
dealing with huge computational complexity.
Presented results, limited to small matrices are a good basis
for further work, but at this stage deliver quite reasonable
comparative material about architecture and arithmetic
variants. In this context the results obtained for fixed and
floating point are very interesting. As it was expected, fixed
point approach provides higher processing speed and lower
logic resources allocation. Surprising result was higher
precision obtained with fixed point. Shall be noted however
that 25-bit vectors were selected after very careful
considerations and estimations.
Further research will focus on construction of devices
dealing with matrices of higher dimension, perhaps with
processing decomposed to basic 2x2 elements, so the described
modules may be used without any redesign. An advantage of
this approach is a chance to develop a methodology of
processing matrices of unlimited dimension with limited
number of basic SVD/CORDIC units. That would enable
optimal utilization of currently available resources with at least
partial independence on input complexity.
REFERENCES
[1] C. Eckart, G. Young, The approximation of one matrix by another of
lower rank, Psychometrika, vol. 1, no. 3, 1936.
[2] J.E. Volder, The CORDIC Trigonometric Computing Technique, IRE
Transactions on Electronic Computers, 1959.
[3] G. Golub, W. Kahan, Calculating the singular values and pseudo-
inverse of a matrix, J. SIAM Numerical Analysis, Ser. B, Vol. 2, No. 2,
1965, pp. 205-224.
[4] R.P. Brent, F.T. Luk, C.F. Van Loan, Computation of the singular
value decomposition using mesh-connected processors, Journal for
VLSI Computer Systems, vol. 1, no. 3, 1985, pp. 243-270
[5] J.R. Cavallaro, F.T. Luk, CORDIC Arithmetic for a SVD Processor.
Journal for Parallel and Distributed Computing, vol. 5, 1988, pp. 271-
290.
[6] R. Andraka, A Survey of CORDIC Algorithms for FPGA based
computers, in FPGA '98: Proc. of sixth international symposium on
Field programmable gate arrays ACM/SIGDA, 1998, pp. 191-200.
[7] F. Deprettere (ed.), SVD and signal processing. Algorithms,
applications and architectures, Department of Electrical Engineering,
Delft University of Technology, Elsevier Science Publishers B.V.,
Amsterdam, 1988.
[8] H. Wang, P. Leray, J. Palicot, A CORDIC-based dynamically
reconfigurable FPGA architecture for signal processing algorithms,
URSI 08, The XXIX General Assembly of the International Union of
Radio Science, Chicago IL, 2008.
[9] VHDL, IEEE Std No. 1076, 2000.
[10] Xilinx ISE Web Pack, www.xilinx.com, 2009.
[11] Floating-point arithmetic, IEEE Std No. 754, 2008.





k9k

Das könnte Ihnen auch gefallen