Beruflich Dokumente
Kultur Dokumente
Jin-Fu Li
Department of Electrical Engineering
National Central University
instruction
Barrel Shifter
decode – Shift or rotate the operand by
A
L
multiply
register
& any number of bits
U
ALU
control
A B
b
u b b
s u
s barrel
shifter
u
s Address register and
incrementer
ALU Data Registers
– Hold data passing to and from
memory
data out register data in register
Instruction Decoder and
D[31:0]
Control
SOC Consortium Course Material 4
3-Stage Pipeline (1/2)
Fetch
– The instruction is fetched from memory and placed in the instruction pipeline
Decode
– The instruction is decoded and the datapath control signals prepared for the
next cycle
Execute
– The register bank is read, an operand shifted, the ALU result generated and
written back into destination register
SOC Consortium Course Material 5
3-Stage Pipeline (2/2)
At any time slice, 3 different instructions may
occupy each of these stages, so the hardware in
each stage has to be capable of independent
operations
When the processor is executing data processing
instructions , the latency = 3 cycles and the
throughput = 1 instruction/cycle
increment increment
Rd PC Rd PC
registers registers
Rn Rm Rn
mult mult
as ins. as ins.
as instruction as instruction
[7:0]
increment increment
PC Rn PC
registers registers
Rn Rd
mult mult
lsl #0 shifter
=A / A+ B / A- B =A + B /A - B
[11:0]
(a) 1st cycle - compute address (b) 2nd cycle - store data & auto-index
increment increment
R14
registers registers
PC PC
mult mult
lsl #2 shifter
=A+ B =A
[23:0]
(a) 1st cycle - compute branch target (b) 2nd cycle - save return address
The third cycle, which is required to complete the pipeline refilling, is also
used to mark the small correction to the value stored in the link register
in order that is points directly at the instruction which follows the branch
SOC Consortium Course Material 10
Branch Pipeline Example
Decode
instruction
decode
register read
immediate
fields – The instruction is decoded and
LDM/
mul register operands read from the
+4
STM post -
index
shift reg
register files. There are 3 operand
shift
pre-index
execute
read ports in the register file so most
mux
ALU forwarding
paths ARM instructions can source all their
B, BL
MOV pc
operands in one cycle
Execute
SUBS pc
byte repl.
load/store
D-cache buffer/ – An operand is shifted and the ALU
data
address
result generated. If the instruction is
LDR pc
rot/sgn ex
a load or store, the memory address
is computed in the ALU
register write write-back
Write back
instruction
decode
register read
immediate
fields – The result generated by the
LDM/
mul instruction are written back to the
+4
STM post -
index
shift reg
register file, including any data
shift
pre-index
execute
loaded from memory
ALU forwarding
paths
mux
B, BL
MOV pc
SUBS pc
byte repl.
D-cache buffer/
load/store data
address
rot/sgn ex
LDR pc
Forwarding works as
pc
+4
I-cache fetch
pc + 4
follows:
pc + 8 I decode – The ALU result from the
r15
instruction
decode
EX/MEM register is always fed
register read back to the ALU input latches.
immediate
fields
– If the forwarding hardware
LDM/
STM
mul
detects that the previous ALU
post -
+4 index
shift reg
shift
operation has written the
pre-index
execute register corresponding to the
ALU forwarding
mux paths source for the current ALU
B, BL
MOV pc operation, control logic selects
SUBS pc
byte repl.
the forwarded result as the ALU
input rather than the value read
buffer/
load/store
address
D-cache
data from the register file.
rot/sgn ex
LDR pc forwarding paths
register write write-back
1 2 3 4 5 6 7 8
LDR R1,@(R2) IF ID EX MEM WB
SUB R4,R1,R5 IF ID EXsub MEM WB
AND R6,R1,R7 IF ID EXand MEM WB
OR R8,R1,R9 IF ID EXE MEM WB
1 2 3 4 5 6 7 8 9
LDR R1,@(R2) IF ID EX MEM WB
SUB R4,R1,R5 IF ID stall EXsub MEM WB
AND R6,R1,R7 IF stall ID EX MEM WB
OR R8,R1,R9 stall IF ID EX MEM WB
scan chain 2
extern0 Embedded scan chain 0
extern1
ICE
opc, r/w,
mreq, trans,
mas[1:0]
A[31:0] processor other
core signals
Din[31:0]
bus JTAG TAP
Dout[31:0]
splitter controller
ARM710T ARM720T
– 8K unified write through cache – As ARM 710T but with WinCE
– Full memory management unit support
supporting virtual memory ARM 740T
– Write buffer – 8K unified write through cache
– Memory protection unit
– Write buffer
SOC Consortium Course Material 41
ARM8
Higher performance than ARM7
– By increasing the clock rate
– By reducing the CPI
• Higher memory bandwidth, 64-bit wide memory
• Separate memories for instruction and data accesses
ARM ARM9TDMI prefetch
unit
8 ARM10TDMI addresses
inst. decode
decode
register read
coproc
data multiplier
ALU/shifter execute
write
pipeline
+4 mux
write
data
address
memory
read
data
forwarding rot/sgn ex
paths
write
register write
– Copy-back
8 Kbyte cache
(double-
bandwidth)
PC instructions
– Double-bandwidth
read data
ARM8 integer
unit – MMU
write data
CPinst. CPdata
– Coprocessor
copy-back tag
copy-back data
CP15 – Write buffer
physical address
address buffer
pc + 8 I decode
r15
instruction
decode
register read
immediate
fields
mul
LDM/
STM post-
+4 index reg
shift shift
pre-index
execute
ALU forwarding
paths
mux
B, BL
MOV pc
SUBS pc
byte repl.
D-cache buffer/
load/store data
address
rot/sgn ex
LDR pc
ARM9TDMI:
instruction r. read data memory reg
fetch shift/ALU access write
decode
Not sufficient slack time to translate Thumb instructions into ARM instructions and
then decode, instead the hardware decode both ARM and Thumb instructions
directly
virtual DA
virt ual IA
CP15
addressing and
memory protection
instruction data
MMU ARM9TDMI MMU – Write buffer
physical DA
EmbeddedICE
& JTAG
physical
address tag
write
AMBA interface
buffer
copy-back DA
physical IA
AMBA AMBA
address data
ARM9TDMI
data address
EmbeddedICE
instructions
data
& JTAG
I address
write
AMBA interface
buffer
AMBA AMBA
address data
Main memory
registers
instructions
processor
instructions
address and data
data
copies of
instructions address
copies of
data
cache memory
instructions 00..0016
and data
instructions
cache
address instructions
instructions
registers
processor
address
copies of
data
data memory
cache
00..0016
19 9 4
The 8Kbytes of data in
address: tag index line 16-byte lines. There
would therefore be 512
lines
tag RAM data RAM A 32-bit address:
– 4 bits to address bytes
512
within the line
lines – 9 bits to select the line
– 19-bit tag
compare mux
hit data
hit data
compare mux
compare mux
256
lines
tag RAM data RAM
SOC Consortium Course Material 65
Fully associative cache
A CAM (Content Addressed
address Memory) cell is a RAM cell
with an inbuilt comparator,
so a CAM based tag store
tag CAM data RAM
can perform a parallel
search to locate an address
in any location
The address bit are
compared with the stored
tag
mux If they are equal, the item is
in the cache
hit data The lowest address bit can
be used to access the
desired item with in the line.
SOC Consortium Course Material 66
Example
28 4
address
line The 8Kbytes of data in
16-byte lines. There
would therefore be 512
tag CAM data RAM lines
256 A 32-bit address:
– 4 bits to address bytes
lines within the line
– 28-bit tag
mux
hit data
C compiler assembler
ARMsd
system model
development
ARMulator
board