Sie sind auf Seite 1von 37

ARM

Organization and Implementation


Aleksandar Milenkovic
E-mail: milenka@ece.uah.edu
Web: http://www.ece.uah.edu/~milenka
2
Outline
ARM Architecture
ARM Organization and Implementation
ARM Instruction Set
Architectural Support for High-level Languages
Thumb Instruction Set
Architectural Support for System Development
ARM Processor Cores
Memory Hierarchy
Architectural Support for Operating Systems
ARM CPU Cores
Embedded ARM Applications
3
ARM organization
Register file
2 read ports, 1 write port +
1 read, 1 write port reserved for
r15 (pc)
Barrel shifter shift or rotate
one operand for any number of
bits
ALU performs the arithmetic
and logic functions required
Memory address register +
incrementer
Memory data registers
Instruction decoder and
associated control logic
multiply
data out register
instruction
decode
&
control
incrementer
register
bank
address register
barrel
shifter
A[31:0]
D[31:0]
data in register
ALU
control
P
C
PC
A
L
U

b
u
s
A

b
u
s
B

b
u
s
register
4
Three-stage pipeline
Fetch
the instruction is fetched from memory and placed in the
instruction pipeline
Decode
the instruction is decoded and the datapath control signals
prepared for the next cycle; in this stage the instruction
owns the decode logic but not the datapath
Execute
the instruction owns the datapath; the register bank is read,
an operand shifted, the ALU register generated and written
back into a destination register

5
ARM single-cycle instruction pipeline
f etch decode execute
time
1
f etch decode execute
f etch decode execute
2
3
instruction
6
ARM single-cycle instruction pipeline
add r0,r1,#5
sub r2,r3,r6
cmp r2,#3
fetch
time
decode
fetch
execute add
decode
fetch
execute sub
decode execute cmp
1 2 3
7
ARM multi-cycle instruction pipeline
f etch ADD decode execute
time
1
f etch STR decode calc. addr.
f etch ADD decode execute
2
3
data xfer
f etch ADD decode execute 4
5 f etch ADD decode execute
instruction
Decode logic is always generating
the control signals for the datapath
to use in the next cycle
8
ARM multi-cycle LDMIA (load multiple)
instruction
fetch decode ex ld r2
ldmia
r0,{r2,r3}
sub r2,r3,r6
cmp r2,#3
ex ld r3
fetch
time
decode ex sub
fetch decode ex cmp
Decode stage occupied
since ldmia must continue to
remember decoded instruction
sub fetched at normal time but
not decoded until LDMIA is finishing
Instruction delayed
9
Control stalls: due to branches
Branches often introduce stalls (branch penalty)
Stall time may depend on whether branch is taken
May have to squash instructions
that already started executing
Dont know what to fetch until condition is evaluated
10
ARM pipelined branch
time
fetch decode ex bne
bne foo
sub
r2,r3,r6
fetch decode
foo add
r0,r1,r2
ex bne
fetch decode ex add
ex bne
Decision not made until the third clock cycle
Two cycles of work thrown
away if bne takes place
11
Pipeline: how it works
All instructions occupy the datapath
for one or more adjacent cycles
For each cycle that an instruction occupies the datapath,
it occupies the decode logic in
the immediately preceding cycle
During the fist datapath cycle each instruction issues
a fetch for the next instruction but one
Branch instruction flush and refill the instruction pipeline
12
ARM9TDMI
5-stage pipeline
I-cache
rot/sgn ex
+ 4
byte repl.
ALU
I decode
register read
D-cache
fetch
instruction
decode
execute
buffer/
data
write-back
forwarding
paths
immediate
fields
next
pc
reg
shift
load/store
address
LDR pc
SUBS pc
post-
index
pre-index
LDM/
STM
register write
r15
pc + 8
pc + 4
+4
mux
shift
mul
B, BL
MOV pc
Fetch
Decode
instruction is decoded
register operands read
(3 read ports)
Execute
an operand is shifted and
the ALU result generated, or
address is computed
Buffer/data
data memory is accessed
(load, store)
Write-back
write to register file

13
ARM9TDMI
Data Forwarding
I-cache
rot/sgn ex
+ 4
byte repl.
ALU
I decode
register read
D-cache
fetch
instruction
decode
execute
buffer/
data
write-back
forwarding
paths
immediate
fields
next
pc
reg
shift
load/store
address
LDR pc
SUBS pc
post-
index
pre-index
LDM/
STM
register write
r15
pc + 8
pc + 4
+4
mux
shift
mul
B, BL
MOV pc
ADD r3, r2, r1, LSL #3
ADD r5, r5, r3, LSL r2
r3 := r2 + 8 x r1
r5 := r5 + 2
r2
x r3
ADD r3, r2, r1, LSL #3
ADD r8, r9, r10
ADD r5, r5, r3, LSL r2
r3 := r2 + 8 x r1
r8 := r9 + r10
r5 := r5 + 2
r2
x r3
LD r3, [r2]
ADD r1, r2, r3
r3 := mem[r2]
r1 := r2 + r3
Data Forwarding
Stall?
14
ARM9TDMI
PC generation
I-cache
rot/sgn ex
+ 4
byte repl.
ALU
I decode
register read
D-cache
fetch
instruction
decode
execute
buffer/
data
write-back
forwarding
paths
immediate
fields
next
pc
reg
shift
load/store
address
LDR pc
SUBS pc
post-
index
pre-index
LDM/
STM
register write
r15
pc + 8
pc + 4
+4
mux
shift
mul
B, BL
MOV pc
3-stage pipeline
PC behavior:
operands are read in execution
stage
r15 = PC + 8
5-stage pipeline
operands are read in decode
stage and r15 = PC + 4?
incompatibilities between 3-
stage and 5-stage
implementations =>
unacceptable
to avoid this 5-stage pipeline
ARMs emulate the behavior of
the older 3-stage designs

15
Data processing instruction
datapath activity
address register
increment
registers
Rd
Rn
PC
Rm
as ins.
as instruction
mult
data out data in i. pipe
(a) register register operations
address register
increment
registers
Rd
Rn
PC
as ins.
as instruction
mult
data out data in i. pipe
[7:0]
(b) register immediate operations
Reg-Reg
Rd = Rn op Rm
r15 = AR + 4
AR = AR + 4
Reg-Imm
Rd = Rn op Imm
r15 = AR + 4
AR = AR + 4

16
STR (store register) datapath activity
address register
increment
registers
Rn
PC
lsl #0
= A / A + B / A - B
mult
data out data in i. pipe
[11:0]
address register
increment
registers
Rn
Rd
shifter
= A + B / A - B
mult
PC
byte? data in i. pipe
(a) 1
st
cycle compute address (b) 2
nd
cycle store data & auto-index
Compute address
AR = Rn op Disp
r15 = AR + 4
Store data
AR = PC
mem[AR] =
Rd<x:y>
If autoindexing
=>
Rn = Rn +/- 4
17
The first two (of three) cycles of a
branch instruction
address register
increment
registers
PC
lsl #2
= A + B
mult
data out data in i. pipe
[23:0]
address register
increment
registers
R14
PC
shifter
= A
mult
data out data in i. pipe
(a) 1
st
cycle compute branch target
(b) 2
nd
cycle save return address
Third cycle: do a small
correction to the value
stored in the link register in
order that it points to
directly at the instruction
which follows the branch?
Compute target
address
AR = PC + Disp,lsl #2
Save return address
(if required)
r14 = PC
AR = AR + 4
18
ARM Implementation
Datapath
Control unit (FSM)
19
2-phase non-overlapping clock scheme
Most ARMs do not operate on edge-sensitive registers
Instead the design is based around
2-phase non-overlapping clocks which are generated
internally from a single clock signal
Data movement is controlled by passing the data
alternatively through latches which are open during phase
1 or latches during phase 2

1 clock cycle
phase 1
phase 2
20
ARM datapath timing
Register read
Register read buses dynamic, precharged during phase 2
During phase 1 selected registers discharge the read buses
which become valid early in phase 1
Shift operation
second operand passes through barrel shifter
ALU operation
ALU has input latches which are open in phase 1,
allowing the operands to begin combining in ALU
as soon as they are valid, but they close at the end of phase 1
so that the phase 2 precharge does not get through to the ALU
ALU processes the operands during the phase 2, producing the
valid output towards the end of the phase
the result is latched in the destination register
at the end of phase 2
21
ARM datapath timing (contd)


read bus valid
shif t out valid
ALU out
shif t time
ALU t ime
register
write time
register
read
time
ALU operands
latched
phase 1
phase 2
precharge
invalidates
buses
Minimum Datapath Delay =
Register read time +
Shifter Delay + ALU Delay +
Register write set-up time + Phase 2 to phase 1 non-overlap time
22
The original ARM1 ripple-carry adder
Carry logic: use CMOS AOI (And-Or-Invert) gate
Even bits use circuit show below
Odd bits use the dual circuit with inverted inputs and
outputs and AND and OR gates swapped around
Worst case path:
32 gates long

A
B
Cin
sum
Cout
23
ARM2 4-bit carry look-ahead scheme
Carry Generate (G)
Carry Propagate (P)
Cout[3] =Cin[0].P + G
Use AOI and
alternate AND/OR gates
Worst case:
8 gates long
A[3:0]
B[3:0]
Cin[0]
sum[3:0]
Cout[3]
4-bit
adder
logic
P
G
24
The ARM2 ALU logic for one result bit
ALU functions
data operations (add, sub, ...)
address computations for memory accesses
branch target computations
bit-wise logical
operations
...
ALU
bus
4 3 2 1 0 5
NB
bus
NA
bus
carry
l ogi c
fs:
G
P
25
ARM2 ALU function codes
f s 5 f s 4 f s 3 f s 2 f s 1 f s 0 ALU o ut put
0 0 0 1 0 0 A and B
0 0 1 0 0 0 A and not B
0 0 1 0 0 1 A xor B
0 1 1 0 0 1 A plus not B plus carry
0 1 0 1 1 0 A plus B plus carry
1 1 0 1 1 0 not A plus B plus carry
0 0 0 0 0 0 A
0 0 0 0 0 1 A or B
0 0 0 1 0 1 B
0 0 1 0 1 0 not B
0 0 1 1 0 0 zero
26
The ARM6 carry-select adder scheme
Compute sums of
various fields of
the word
for carry-in of
zero and carry-in
of one
Final result is
selected by using
the correct carry-
in value to control
a multiplexor
sum[31:16] sum[15:8] sum[7:4] sum[3:0]
s s+1
a,b[31:28] a,b[3:0]
+ +, +1
c
+, +1
mux
mux
mux
Worst case:
O(log
2
[word width]) gates long
Note: Be careful! Fan-out on some of these gates is
high so direct comparison with previous schemes is
not applicable.
27
The ARM6 ALU organization
Not easy to merge the arithmetic and logic functions =>
a separate logic unit runs in parallel with the adder,
and multiplexor selects the output
Z
N
V
C
logic/arithmetic
C in
f unction
invert A
invert B
result
result mux
logic f unctions
A operand latch B operand latch
XOR gates XOR gates
adder
zero detect
28
ARM9 carry arbitration encoding
Carry arbitration adder

A B C u v
0 0 0 0 0
0 1 unknown 1 0
1 0 unknown 1 0
1 1 1 1 1
29
The cross-bar switch barrel shifter
Shifter delay is critical since it contributes directly to the
datapath cycle time
Cross-bar switch matrix (32 x 32)
Principle for 4x4 matrix

in[0]
in[1]
in[2]
in[3]
out[0] out[1] out[2] out[3]
no shift right 1 right 2 right 3
left 1
left 2
left 3
30
The cross-bar switch barrel shifter
(contd)
Precharged logic is used =>
each switch is a single NMOS transistor
Precharging sets all outputs to logic 0, so those which are
not connected to any input during switching remain at 0
giving the zero filling required by the shift semantics
For rotate right, the right shift diagonal is enabled +
complementary shift left diagonal (e. g., right 1 + left 3)
Arithmetic shift right:
use sign-extension => separate logic is used to decode
the shift amount and discharge those outputs
appropriately



31
The 2-bit multiplication algorithm, Nth
cycle
Carry - i n Mul t i pl i er Shi f t ALU Carry - o ut
0 x 0 LSL #2N A + 0 0
x 1 LSL #2N A + B 0
x 2 LSL #(2N + 1) A B 1
x 3 LSL #2N A B 1
1 x 0 LSL #2N A + B 0
x 1 LSL #(2N + 1) A + B 0
x 2 LSL #2N A B 1
x 3 LSL #2N A + 0 1
32
Carry-propagate (a) and carry-save (b)
adder structures
+
A B Cin
Cout S
(a)
+
A B Cin
Cout S
+
A B Cin
Cout S
+
A B Cin
Cout S
+
A B Cin
Cout S
(b) +
A B Cin
Cout S
+
A B Cin
Cout S
+
A B Cin
Cout S
33
ARM high-speed multiplier organization
Rs >> 8 bits/cycle
carry-save adders
partial sum
partial carry
i ni ti al i zation for MLA
registers
Rm
ALU (add partials)
rotate sum and
carry 8 bi ts/cycl e
34
ARM2 register cell circuit
A bus
B bus
ALU bus
write
read
B
read
A
35
ARM register bank floorplan
A bus read decoders
B bus read decoders
write decoders
register cells
PC
Vdd
Vss
ALU
bus
PC
bus
INC
bus
ALU
bus
A bus
B bus
36
ARM core datapath buses
address register
incrementer
register bank
multiplier
ALU
shif ter
data in
instruction pipe
data out
A B
W
instruction
Din
shif t out
PC
Ad
inc
37
ARM control logic structure
decode
PLA
cycle
count
multiply
control
load/store
multiple
address
control
register
control
ALU
control
shif ter
control
instruction
coprocessor

Das könnte Ihnen auch gefallen