AbstractThis paper presents a new type of coprocessor using a new scheduling technique. This technique is able to
architecture suited for both conventional floating point and identify all the available FU parallelism, at the FU pipeline
interval arithmetic. The coprocessor is composed of two logical . '
processors (LP). The floating point units are shared between stage level.The novelty of the schedulng technique consists
these two LPs in order to reduce the area overhead. Some in the ability to dynamically verify if an FUs resources
functional units implement two or more operations (for example needed for a certain instruction are free. The availability of
the multiplyadd fused (MAF) unit can be used for addition, those resources varies dynamically with the instructions
multiplication or multiplyadd fused). The set of functional units currently executed it's micro FUs.
can thus help reduce the number of structural hazards and This paper is organized as follows: in Section lIthe overall
increase the resource utilization (for example, if addition occurs 
on both LPs, one can be executed on the adder, while the other architecture is presented, Section III is dedicated to the
on the MAF). In order to further reduce the data and structural scheduling mechanism, while in Section IV the floating point
hazards a scheduler for this architecture is also proposed. units are presented. The last section is dedicated to the
Keywords  Simultaneous multithreading, interval arithmetic, concluding remarks.
parallel architectures
I. INTRODUCTION II. OVERALL ARCHITECTURE
Interval arithmetic represents a more reliable alternative to The goal was to devise an architecture that can balance
conventional floating point (FP) arithmetic. A wide range of high functional unit usage and low overhead. To achieve this
applications in a wide range of fields have been developed in goal we chose a baseline simultaneous dualthreaded
the last decades. Several approaches for dedicated interval architecture similar to [4]. Simultaneous multithreading
arithmetic units have been proposed, like the ones in [1][5]. (SMT) and its advantages are presented in [7][9][10]. The
This paper presents a coprocessor architecture with support for basic idea that is of interest to us is SMT's ability to raise
interval arithmetic which can be used for improving the functional unit usage (and hence instructions per cycle) by
performance of these applications, issuing instructions from multiple threads in the same clock
The basic architecture is the one of a dual threaded cycle. This feature increases the level of parallelism available.
., ~~~Asshown in Fig.g 1 thet physical
y processor is made up of
prp
processor. The architecture of each LPs is similar to a MIPS
two logical processors (LPs), each having its own pipeline.
type architecture [3]. We chose this architecture to take The issue logic and functional units (FUs) are shared between
advantage of the particular structure of the functional unit set. thloiprcsr.EahLfeuesamxum
the logi1c p rocessors. Each LP features a maximum of one
fon
Each of the processor's threads has inorder issue, and each instruction per cycle inorder fetch and issue. An independent
thread potentially issues an instruction every clock cycle, thread runs on each logic processor. In our context, by thread
Thus, the difficulty of identifying data hazards is the same as we understand a contiguous set of FP and/or interval
in standard one thread inorder issue RISC, yet the potential arithmetic instructions. Inside the coprocessor, the
IPC is double. independency of the threads is implicit since the two LPs have
The functional units of this processor are: one adder, one separate register files and the arithmetic operands can only be
multiplier, one multiplyadd fused (MAF), one comparator located in the register file.
and one divideadd fused (DAF). Some units can perform The coprocessor's structure is based on standard RISC
multiple operations: the multiplier can be used also for architecture, more precisely on the MIPS architecture [3].
comparisons; the MAF can implement additions and Each pipeline stage takes one clock cycle. The exception is the
multiplications, while the DAF can be used for divisions and execution stage which can take a variable number of clock
additions. By having the 3 FUs that can execute an addition, cycles. The proposed pipeline contains four pipeline stages:
and 2 that can execute comparison or multiplication, we can instruction fetch (IF), instruction decode (ID), execute (EX)
reduce the number of structural hazards and increase resource and write back (WB). IF and WB are duplicated for each LP.
utilization. Since the latencies of the addition and Also, duplicated are the register files of both LPs.
multiplication are different on different FUs, a complication in
the scheduling scheme appears. We try to solve this problem
Operation FUs ~~~~number for each valid FU  instruction combination, and has
Addition ADMF,AFtwo read ports, addressed by an opcode for each thread. The
Multiplication MUL, MAF information in the LUT is used to check if a WB hazard exists,
Comparisson COM, MUL or to update the clock cycles in which there is WB. A shift
Division DAF register per LP (with each bit corresponding to a future clock
Multiply add fused MAF cycle) is used to hold the clock cycles in which there is WB.
Divideadd fused DAF
rT r
Selected:FUs *\d=instr t t
ADD*2
IR JlR2
~~~~~~~~Selected Opcode 10 FU CAN EXEC LUTI
IRI ADD 2 a0
AfUT; FU 1 < 3 ddmul * Div
1 212code,MUL*2 F~~~~~~~~~~___ U OP LUTAD
WB lHazard COPFUSel MUL_
Op d .Check MAF*2 M MSi6l.Hazadrc
icro FUs 5mL1m
D[A1F*2 DAFdal Iiformnation
FU
Hazard in LpFUOSR
Stall ~~~~~~IRI&IR2
''"iE R' a
Hazard
Check
Check RAWI'VAN ~~~Selection ~Signals O
ADD FU MD. FU COMPF
Shift Rea Shift Reg Shift Reg
Group Group Group
Selected
IRI G FLJ FU 1F
Mi[cro
. lFllU
7 Control = =
1R2
OP,,de Hazard
Ch=ck
COMk
AF,
 D
M Signalts MAFPFU
i Re
Shift
DAFFUL
Shift Re ADO DAF
Group Group 2* . . .*2
Fig. 2  General structure of operation scheduler Fg2 _Gnrltutro eat sFig 3  Structure of microFU hazard check unit
The WB hazard check block verifies whether the two The opcode selects its corresponding shift register from
instructions waiting to be issued will write to the register file every group (if it can be issued on the groups FU) and outputs
in the same clock cycle as a previously issued instruction from from these registers the bits corresponding to the next clock
the same thread. A LUT, contains the WB clock cycle number cycle. For input, the registers are addressed by the selected
for each valid FU instruction combination, and has two read functional unit signals, and the whole groups selected this way
are ORed with the output of the FU OP LU.
ports, addressed by an opcode for each thread. The
information in the LUT is used to check if a WB hazard exists The second logic stage of the scheduler (actual FU
or to update the clock cycles in which there is WB. A shift selection block) uses all the information resulted after the first
register per LP (with each bit corresponding to a future clock stage to generate FU selection signals and stall signals for the
two threads. Since on proposed architecture various operations
cycle) is used to hold the clock cycles in which there is WB. can be issued on different FUs, with different latencies, a
The RAW & WAW hazard check block verifies the existence
of the hazards specified in its name. We use one register per selection criterion is needed for the choice of FUs for the 2
LP to hold information about the registers to which the instructions waiting issue in the ID stage. Our chosen criterion
instructions in execution will writeback. The information is is to match each instruction with an FU optimized for it. For
used to check if any of the operands or the result register of this purpose we use a priority encoding scheme. If a match is
the operations waiting to beissuedareusedbfound for both instructions on the same FU at the same
th oerations
in execution. priority level, an LP priority flipflop makes the choice of
The micro FUs hazard check block (Fig. 3) verifies the which LP uses the FU. The flipflop then changes value to
potential hazards if the instructions in ID stage are to be issued give the other LP high priority. The highest priority matches
on each possible FU. Two LUTs and a structure based on shift are selected.
registers (FU operation shift register  FUOSR) are used for The employed scheme takes as inputs all the outputs of
this verification. The first LUT (FU CAN EXEC OP LUT) the first logic stage. The outputs are the selected FUs and the
contains information about possible operation  selected operations (if an LP pipeline is stalled, the selected
combainatins. The.in addressisaopoderantho FU
a instruction for that LP is set to NOP), and also the stall signals
sofbintsin Lut The toe
tatrespo nd Fu  for the two LPs. The selected FUs and instructions signals are
input opcodee combination. The OPs
containsuto iformbation.on sconf The second LUT (FU LUT) fed to the ID/EX register and also to the inputs of the blocks in
LcTsbetwe  Tw the first logic stage, so that the machine's state can be
instruc tionsothahcoudnstrtteecutonea uany
mbro c updated. Each of the LUTs in the scheduler can be initialized
cycles apart on the same FU. This LUT is addressed by a with different values, permitting the modification of every
coding the two
of the selected FUs and by
by the FUs set of micro microconfigurable
FUs.ofAlso FUs used inthis
an way is the
codin of two selected FUs and the two selected
combination and order FU for the
operations. The output consists of the lines of information that execution of a certain instruction.
indicate the possible conflicts between the selected operation
and any possible future operation on the selected FU. The
FUOSR is divided into groups of shift register, one group for III. ARITHMETIC UNITS
each functional unit. Each of the groups contains as many shift
registers as there are operations that can be executed on the Imoanfetrsfthscpcsorreherimtc
corsodn fucioa uni.Alo eac of th hitrgitr
. . . .
floating point units (FU). The FUs are designed for both
1interval and conventional floating point arithmetic.
Six units
>'.
cotan as man bit as th nube ofclckcylerqure of FU are used: interval adder, interval multiplier, floating
forth leghis oprtin
iS addressed with an opcode. Fo'uptpross
h US point
DAF.
comparison unit, floating point MAF and floating point
The interval adder design is the one proposed in [2]. This operations, although less optimally. Therefore, it is possible to
adder is based on a classical double path adder. The main execute simultaneously, the same two types operations if they
characteristic of this type of adder is the two floating point occur on the two LPs. Thus, an increase in the throughput is
operations needed for an interval addition can be done obtained. A specialized operation scheduler was designed,
simultaneously, each on a different path. This adder design is which tries to select for each operation the FU best suited for
also suitable for SMT architectures, because two floating point it. It also parallelizes tasks at the FU stage level.
additions can be performed in parallel. The next step in the development and optimization of the
The interval multiplier is based on a design presented in coprocessor will be the construction and analysis of an
[1] and follows and algorithm suitable for pipeline structures. interval benchmark (similar to SPECFPU). This way, a
The structure of the interval multiplier is based on a dual rigorous performance analysis can be made. Furthermore,
result multiplier (a floating point multiplier with two further optimization can be achieved. Improvements can be
differently rounded results for the same multiplication) and on made both in the operation scheduler and the functional units.
two floating point comparators. Therefore, comparisons can Regarding the operation scheduler, further research direction
also be performed using this type of multiplier. of this project is to minimize the area occupied by the LUTs
Comparisons are very important in interval arithmetic, and shift registers and to experiment with different degrees of
because they are used in interval set operations, like the configurability. Regarding the functional units, improvements
interval hull, interval intersection and interval inclusion. can be made on the DAF unit, by including more performing
Two FU for combined operations are used: MAF and SRT based dividers.
DAF. For MAF unit, a structure based on the [6] design is ACKNOLEDGEMENTS
used, which is a high performance floating point MAF unit.
For interval MAF, a combination of the interval addition and This work was supported by the Romanian Second Research
interval multiplication has to be done. Thus, the interval NIAF and Development National Plan (PNII) grants IDEI17/2007
is defined as: and TD26/2007.
[Xlo,Xh] *[Y1o,yhi]+[Zlo,Zhi]=[fin(XIoYIo+Z/o;XI/Yp +Z16;XhiY10+Z1o;
XhiYhi +Zlo),ax(Xloylo i+Zi;X1XYiy +ZhZ;XY Xhiyhi+Zi)] REFERENCES
[1] A. Amaricai, M. Vladutiu, L.Prodan, M. Udrescu, 0. Boncalo
[X/O,XX ] *[y1,yhi] [Z1,Zhi] = [min(X10Y Zhi;XloYhi Zhi;Xhiylo Zhi;

"Hardware Support for Combined Interval and Floating Point
Xhiyhi Zhi),rax(X1oY16 Zlo;XloYhi Zlo;Xhiyo Zlo;Xhiyhi Z1)] Multiplication" Proceedings 14th Mixed Design Of Integrated
n order to decrease the number of operations, a sign Circuits and Systems, 2007, pp 278282
' is.....done ....... isis [2] A. Amaricai, M. Vladutiu, L.Prodan, M. Udrescu, 0. Boncalo
i5. Tse
examining for the multiplication This
examining forthe mtii orands iso "Exploiting Parallelism in Double Path Adders' Structure for
similarv multiplication.
interval multipliteign .
forIncreased Throughput of Floating Point Addition" Proceedings
10th EUROMICRO Conference on Digital System Design,
A new unit was designed for this coprocessor: the DAF Architectures, Methods and Tools, 2007, pp 132137
unit. The floating point DAF was meant to improve the [3] J. L. Hennessy, D. A. Patterson "Computer Architecture, Fourth
Newton's interval method for equations/systems of equation Edition: A Quantitative Approach" MorganKaufmann, 2006
solving, which is one of the high performance interval [4] H. Hirata, K. Kimura, S. Nagamine, Y. Mochizuki, A.
arithmetic methods [5]. The Newton's interval method relies Nishimura, Y. Nakase, T. Nishizawa "An Elementary Processor
on a division followed by a subtraction at every iteration, thus Architecture with Simultaneous Instruction Issuing from Multiple
justifying such a combined unit. Because interval division is Threads ", Proceedings of the 19th Annual International
simpler than interval multiplication, so the interval DAF is Symposium on Computer Architecture, 1992, pages 136  145
simpler than interval MAF, requiring only two efloating
uiigonytwlatn[5] U.W. Kulisch "Advanced Arithmetic for the Digital Computer",
simpler thnineva A
point SpringerVerlag, 2002
DAF operations. [6] T. Lang, J. Bruguera "FloatingPoint MultiplyAddFused with
Reduced Latency" EEE Transaction on Computers, Vol. 53, No.
IV. CONCLUSIONS AND FUTURE WORK 8, 2004, pp 9881003
[7] H. Levy, J. Lo, J. Emer, R. Stamm, S. Eggers , D. M. Tullsen
This paper presents a new type of SMT coprocessor with "Exploiting Choice: Instruction Fetch and Issue on an
inorder fetch and issue, suitable for both interval and Implementable Simultaneous Multithreading Processor",
conventional floating point arithmetic. This coprocessor uses a Proceedings 23rd Annual International Symposium on Computer
specialized set of* functional
1 * 1 * r r 1 floating
rl * * units,
point * some ofr ~~Architecture (ISCA'96) 1996, p. 191203
which can perform two or three interval and floating which
pointcan p m t[8] Computing
C. V. Ramamoorthy, H.F. Li
Surveys (CSUR), 9, Issue Architecture"
Vol."Pipeline 1, 1977 pp 61  ACM
102
operations specific to other units. The specialized units [9] D. M. Tullsen, S. J. Eggers, H. Levy "Simultaneous
designed are: the interval adder, the interval multiplier, the multithreading: maximizing onchip parallelism" Proceedings
comparator, floating point MAF and floating point DAF. The of the 22nd Annual International Symposium on Computer
coprocessor design has two logic processors. In order to Architecture (ISCA'95), 1995, pp 392 403
reduce the area overhead, the functional unit set is common to [10] T. Ungerer, B. Robic, J. Silc "A survey of processors with
both logical processors. The units in this set are each explicit multithreading" ACM Computing Surveys (CSUR),
specialized for one operation, but can execute others Vol. 35, Issue 1, 2003, pages 29  63
Viel mehr als nur Dokumente.
Entdecken, was Scribd alles zu bieten hat, inklusive Bücher und Hörbücher von großen Verlagen.
Jederzeit kündbar.