Beruflich Dokumente
Kultur Dokumente
Introduction
2
¨ General-Purpose Processor
¤ Processor designed for a variety of computation
tasks
¤ Low unit cost, in part because manufacturer
spreads NRE over large numbers of units
n Motorolasold half a billion 68HC05 microcontrollers in
1996 alone
¤ Carefully designed since higher NRE is
GENERAL PURPOSE acceptable
n Can yield good performance, size and power
PROCESSOR ¤ Low NRE cost for Embedded system designer,
short time-to-market/prototype, high flexibility
n User just writes software; no processor design
1
8/21/2012
0 Fetch DecodeFetch Exec. Store Control unit Datapath 0 Fetch DecodeFetch Exec. Store Control unit Datapath
ops result ALU
ops result ALU
clk s Controller Control +1 clk s Controller Control
/Status /Status
PC=10 PC=10
Registers Registers
1 Fetch DecodeFetch Exec. Store 1 Fetch DecodeFetch Exec. Store
ops result ops result
clk s 10 11 clk s 10 11
PC 101 IR R0 R1 PC 102 IR R0 R1
inc R1, R0 store M[501], R1
PC=10
2 Fetch DecodeFetch Exec. Store
I/O I/O
... ops result ...
100 load R0, M[500] Memory 100 load R0, M[500] Memory
500 10 clk s 500 10
101 inc R1, R0 101 inc R1, R0
501 ... 501 11
...
102 store M[501], R1 102 store M[501], R1
¤ Embedded: 8-bit,
than longest
16-bit, 32-bit register to register
common PC IR delay in entire PC IR
¤ Desktop/servers:
processor
I/O I/O
32-bit, even 64 Memory
¤ Memory access is
Memory
ARM
Introduction
2
8/21/2012
¤ Embedded systems prefer slow and low cost and end of function
memory ¨ Inline barrel shifting – leads to complex
¤ Reduce area of the die taken by embedded instructions
processor ¤ Improved code density
n Leavespace for specialized processor ¤ E.g. ADD r0,r1,r1, LSL #1
n Hardware debug capability
3
8/21/2012
NZ CV unused IF T mode
4
8/21/2012
5
8/21/2012
Time
Clock cycle 1 2 3 4 5 6 7
Add F D X+ [R1] W
Load F D X+ [R1] [X +[R1]] [[X +[R1]]] W
Forward
Load F D [X +[R1]] W
Next instruction F D E W
Next instruction F D E W
6
8/21/2012
Time
I1
1
F1
2
D1
3
E1
4
W1
5 6 7 8
I 2 (Branch) F2 D2 E2
Time F3 D3 X
I3
Clock cycle 1 2 3 4 5 6
- Branch penalty I4 F4 X
Instruction Ik Fk Dk Ek Wk
I1 F1 E1 - Reducing the penalty
I k+1 Fk+1 Dk+1 E k+1
Time
I3 F3 X Clock cycle 1 2 3 4 5 6 7
I1 F1 D1 E1 W1
Ik Fk Ek I 2 (Branch) F2 D2
I3 F3 X
I3 F3 D3 E3 W3
I4 F4 D4 E4 W4
I5 (Branch) F5 D5
D : Dispatch/
Decode E : Execute W : Write
instruction results I6 F6 X
unit
Ik Fk Dk Ek Wk
Ik +1 Fk +1 D k +1 Ek +1
Figure 8.10. Use of an instruction queue in the hardware organization of Figure 8.2b.
7
8/21/2012
Figure 8.13. Execution timing showing the delay slot being filled
during the last two passes through the loop in Figure 8.12.
8
8/21/2012
Integer
unit
Timing ALU
Time ¨ Logic operation
Clock cycle 1 2 3 4 5 6 7
¤ OR,AND, XOR, NOT, NAND, NOR etc.
I 1 (Fadd)
¤ No dependencies among bits – Each result can be
F1 D1 E1A E1B E1C W1
calculated in parallel for every bit
I 2 (Add) F2 D2 E2 W2
¨ Arithmetic operation
¤ ADD, SUB, INC, DEC, MUL, DIVIDE
I 3 (Fsub) F3 D3 E3 E3 E3 W3
¤ Involve long carry propagation chain
n Major source of delay
n Require optimization
I 4 (Sub) F4 D4 E4 W4
¤ Suitability of algorithm based on
n Resource usage – physical space on silicon die
n Turnaround time
Figure 8.20. An example of instruction execution flow in the processor of Figure 8.19,
assuming no hazards are encountered.
9
8/21/2012
Cout[3]
A
B
A[3:0] G
sum 4-bit
P adder sum[3:0]
logic
Cin B[3:0]
Cin[0]
10
8/21/2012
in[2]
left 2
function C in
logic functions C in[1]
adder left 3
V
in[0]
logic/arithmetic
result mux
N
out[0] out[1] out[2] out[3]
zero detect Z
result
11
8/21/2012
12
8/21/2012
time
Rs >> 8 bits/cycle
Rm
partial sum
partial carry
13