Sie sind auf Seite 1von 13

8/21/2012

Introduction
2

¨ General-Purpose Processor
¤ Processor designed for a variety of computation
tasks
¤ Low unit cost, in part because manufacturer
spreads NRE over large numbers of units
n Motorolasold half a billion 68HC05 microcontrollers in
1996 alone
¤ Carefully designed since higher NRE is
GENERAL PURPOSE acceptable
n Can yield good performance, size and power
PROCESSOR ¤ Low NRE cost for Embedded system designer,
short time-to-market/prototype, high flexibility
n User just writes software; no processor design

Basic Architecture Datapath Operations


3 4

¨ Control unit and Processor


¨ Load Processor
datapath Control unit Datapath ¤ Read memory location Control unit Datapath
into register
¤ Similar to single- ALU ALU
Controller Control • ALU operation Controller Control +1
purpose processor /Status /Status
– Input certain registers
¨ Key differences Registers
through ALU, store Registers
¤ Datapath is general back in register
¤ Control unit doesn’t • Store
10 11
store the algorithm – PC IR – Write register to PC IR

the algorithm is memory location


“programmed” into the
I/O I/O
memory ...
Memory Memory
10
11
...

Control Unit Instruction Cycles


5 6
¨ Control unit: configures the
datapath operations Processor PC=10 Processor
¤ Sequence of desired operations Control unit Datapath 0 Fetch DecodeFetch Exec. Store Control unit Datapath
(“instructions”) stored in memory – ops result
ALU ALU
“program” clk s
Controller Control Controller Control
¨ Instruction cycle – broken into /Status /Status
several sub-operations, each one
clock cycle, e.g.: Registers Registers
¤ Fetch: Get next instruction into IR
¤ Decode: Determine what the
10
instruction means PC IR PC 100 IR
R0 R1 load R0, M[500] R0 R1
¤ Fetch operands: Move data from
memory to datapath register
¤ Execute: Move data through the I/O I/O
ALU ... ...
100 load R0, M[500] Memory 100 load R0, M[500] Memory
¤ Store results: Write data from 500 10 500 10
101 inc R1, R0 101 inc R1, R0
register to memory 102 store M[501], R1
501 ... 102 store M[501], R1
501 ...

1
8/21/2012

Instruction Cycles Instruction Cycles


7 8

PC=10 Processor PC=10 Processor

0 Fetch DecodeFetch Exec. Store Control unit Datapath 0 Fetch DecodeFetch Exec. Store Control unit Datapath
ops result ALU
ops result ALU
clk s Controller Control +1 clk s Controller Control
/Status /Status

PC=10 PC=10
Registers Registers
1 Fetch DecodeFetch Exec. Store 1 Fetch DecodeFetch Exec. Store
ops result ops result
clk s 10 11 clk s 10 11
PC 101 IR R0 R1 PC 102 IR R0 R1
inc R1, R0 store M[501], R1
PC=10
2 Fetch DecodeFetch Exec. Store
I/O I/O
... ops result ...
100 load R0, M[500] Memory 100 load R0, M[500] Memory
500 10 clk s 500 10
101 inc R1, R0 101 inc R1, R0
501 ... 501 11
...
102 store M[501], R1 102 store M[501], R1

Architectural Considerations Architectural Considerations


9 10

¨ N-bit processor Processor ¨ Clock frequency Processor

¤ N-bit ALU, Control unit Datapath


¤ Inverse of clock
Control unit Datapath

registers, buses, Controller Control


ALU
period Controller Control
ALU

memory data /Status /Status

interface ¤ Must be longer


Registers Registers

¤ Embedded: 8-bit,
than longest
16-bit, 32-bit register to register
common PC IR delay in entire PC IR

¤ Desktop/servers:
processor
I/O I/O
32-bit, even 64 Memory
¤ Memory access is
Memory

¨ PC size often the longest


determines
address space

ARM RISC Design Philosophy


¨ Smaller die size
¨ Shorter Development time
¨ Higher performance
¤ Insects flap wings faster than small birds
¤ Complex instruction will make some high level
function more efficient but will slow down the
clock for all instructions

ARM
Introduction

2
8/21/2012

Instruction set for embedded


ARM Design philosophy
systems
¨ Reduce power consumption and extend ¨ Variable cycle execution for certain instructions
battery life ¤ Multiregisters Load-store instructions
¨ High Code density ¤ Faster if memory access is sequential

¨ Low price ¤ Higher code density – common operation at start

¤ Embedded systems prefer slow and low cost and end of function
memory ¨ Inline barrel shifting – leads to complex
¤ Reduce area of the die taken by embedded instructions
processor ¤ Improved code density
n Leavespace for specialized processor ¤ E.g. ADD r0,r1,r1, LSL #1
n Hardware debug capability

¨ ARM is not a pure RISC Architecture


Designed primarily for embedded systems

Instruction set for embedded


Arm Based Embedded device
systems
¨ Thumb 16 bit instruction set
n Code can execute both 16 or 32 bit instruction
¨ Conditional execution
n Improved code density
n Reduce branch instructions
n CMP r1,r2
n SUBGT r1,r1,r2
n SUBLT r2,r2,r1

¨ Enhanced instructions – DSP Instructions


n Use one processor instead of traditional combination
of two

Peripherals ARM Datapath


¨ ALL ARM Peripherals are Memory Mapped ¨ Registers
¨ Interrupt Controllers ¤ R0-R15 General Purpose registers
¤ Standard Interrupt Controller ¤ R13-stack pointer
n Sends a interrupt signal to processor core
n Can be programmed to ignore or mask an individual device ¤ R14-Link register
or set of devices ¤ R15 – program counter
n Interrupt handler read a device bitmap register to determine
which device requires servicing ¤ R0-R13 are orthogonal

¤ VIC- Vectored interrupt controller ¤ Two program status registers


n Assigned priority and ISR handler to each device n CPSR
n Depending on type will call standard Int. Hand. Or jump to n SPSR
specific device handler directly

3
8/21/2012

ARM’s visible registers BANK Registers


r0
usable in user mode
r1
r2
¨ Total 37 registers
r3 system modes only
r4 ¤ 20 are hidden from program at different time
r5
r6 ¤ Also called Banked Registers
r7
r8_fiq
r8
r9 r9_fiq ¤ Available only when processor in certain mode
r10 r10_fiq
r11 r11_fiq
r13_und
¤ Mode can be changed by program or on
r12_fiq r13_irq
r12 r13_abt
r13
r13_fiq r13_svc
r14_svc r14_abt
r14_irq r14_und
exception
r14 r14_fiq
r15 (PC) n Reset,interrupt request, fast interrupt request
SPSR_irq SPSR_und
software interrupt, data abort, prefetch abort and
SPSR_abt
CPSR SPSR_fiq SPSR_svc
undefined instruction
user mode
fiq svc abort irq undefined ¤ No SPSR access in user mode
mode mode mode mode mode

CPSR Instruction execution


¨ Condition flags – NZCV
¨ Interrupt masks – IF
¨ Thumb state- T , Jazelle –J
¨ Mode bits 0-4 – processor mode
¤ Six privileged modes
n Abort – failed attempt to access memory
n Fast interrupt request
n Interrupt request
n Supervisor mode – after reset, Kernel work in this mode
n System – special version of user mode – full RW access to
CPSR
n Undefined mode – when undefined or not supported inst. Is
exec.
¤31User Mode
28 27 8 7 6 5 4 0

NZ CV unused IF T mode

3 Stage pipeline ARM Organization ARM7 Core Diagram


¨ Fetch
¤ The instruction is fetched from the memory and
placed in the instruction pipeline
¨ Decode
¤ The instruction is decoded and the datapath control
signals prepared for the next cycle. In this stage inst.
‘Owns’ the decode logic but not the datapath
¨ Execute
¤ The inst. ‘owns’ the datapath; the register bank is
read, an operand shifted, the ALU result generated
and written back into a destination register.

4
8/21/2012

3 stage Pipeline – Single Cycle 3 stage Pipeline – Multi-Cycle


Inst. Inst.

PC Behavior To get Higher performance


¨ R15 increment twice before an instruction ¨ Tprog =(Ninst X CPI ) / fclk
executes ¨ Ninst – No of inst. Executed for a program–
¤ due to pipeline operation Constant
¨ R15=current instruction address+8 ¨ Increase the clock rate
¤ The clock rate is limited by slowest pipeline stage
¤ Offset is +4 for thumb instruction
n Decrease the logic complexity per stage
n Increase the pipeline depth
¨ Improve the CPI
¤ Instruction that take more than one cycle are re-
implemented to occupy fewer cycles
¤ Pipeline stalls are reduced

Typical Dynamic Instruction


Memory Bottleneck
usage
Statistics for a print preview program in an ARM Inst. Emulator ¨ Von Neumann Bottleneck
¤ Single inst and data memory
¤ Limited by available memory bandwidth

¤ A 3 stage ARM core accesses memory on


Instruction Type Dynamic
Usage (almost) every clock
Data Movement 43%
Control Flow 23%
Arithmetic 15%
¨ Harvard Architecture in higher performance
operation arm cores
Comparisons 13%
Logical Operation 5%
Other 1%

5
8/21/2012

The 5 stage pipeline Data Forwarding


¨ Fetch ¨ Read after write pipeline hazard
n Inst. Fetched and placed in Inst. Pipeline
¤ An instruction needs to use the result of one of its
¨ Decode
n Inst. Is decoded and register operand read from the register predecessors before that result has returned to
file the register file
¨ Execute n e.g. Add r1,r2,r3
n An operand is shifted and the ALU result generated. For load n Add r4,r5,r1
and store memory address is computed
¨ Buffer/Data ¤ Data forwarding is used to eliminate stall
n Data Memory is accessed if required otherwise ALU result is ¤ In following case even with forwarding it is not
simply buffered
possible to avoid a pipeline stall
¨ Write Back
n E.g LDR rN, [..] ; Load rN from somewhere
n The results are written back to register file
n ADD r2,r1,rN ; and use it immediately
¤ Processor cannot avoid one cycle stall

Data Hazards Data Hazards


¤ Handling data hazard in software ¨ Complex addressing
n Solution- Encourage compiler to not put a depended ¤ require more complex hardware to decode and
instruction immediately after a load instruction execute them
¤ Cause the pipeline to stall
¤ Side effects
n When a location other than one explicitly named in an ¨ Pipelining features
instruction as destination operand is affected ¤ Access to an operand does not require more than one
access to memory
¤ Addressing modes
¤ Only load and store instruction access memory
n Complex addressing modes doesn’t necessarily leads
¤ The addressing modes used do not have side effects
to faster execution
n Register, register indirect, index modes
n E.g. Load (X(R1)),R2
n Add #X,R1,R2
¨ Condition codes
¤ Flags are modified by as few instruction as possible
n Load (R2),R2
¤ Compiler should be able to specify in which instr. Of
n Load (R2),R2
the program they are affected and in which they are
not

Complex Addressing Mode Simple Addressing Mode


Add #X, R1, R2
Load (X(R1)), R2
Load (R2), R2
Load (R2), R2

Time
Clock cycle 1 2 3 4 5 6 7

Add F D X+ [R1] W
Load F D X+ [R1] [X +[R1]] [[X +[R1]]] W

Forward
Load F D [X +[R1]] W
Next instruction F D E W

Load F D [[X +[R1]]] W


(a) Complex addressing mode

Next instruction F D E W

(b) Simple addressing mode

6
8/21/2012

ARM 5 Stage Pipeline Instruction hazards - Overview


¨ Whenever the stream of instructions supplied
by the instruction fetch unit is interrupted, the
pipeline stalls.
¨ Cache miss
¨ Branch

Time

Unconditional Branches Branch Timing Clock cycle

I1
1

F1
2

D1
3

E1
4

W1
5 6 7 8

I 2 (Branch) F2 D2 E2

Time F3 D3 X
I3
Clock cycle 1 2 3 4 5 6
- Branch penalty I4 F4 X

Instruction Ik Fk Dk Ek Wk
I1 F1 E1 - Reducing the penalty
I k+1 Fk+1 Dk+1 E k+1

I 2 (Branch) F2 E2 Execution unit idle (a) Branch address computed inecute


Ex stage

Time
I3 F3 X Clock cycle 1 2 3 4 5 6 7

I1 F1 D1 E1 W1

Ik Fk Ek I 2 (Branch) F2 D2

I3 F3 X

I k+1 Fk+1 Ek+1 Fk Dk Ek Wk


Ik

I k+1 Fk+1 D k+1 E k+1

(b) Branch address computed in Decode stage

Figure 8.8. An idle cycle caused by a branch instruction.

Figure 8.9. Branch timing.

Instruction Queue and Branch Timing with Instruction


Prefetching Queue
Time
Clock cycle 1 2 3 4 5 6 7 8 9 10
Queue length 1 1 1 1 2 3 2 1 1 1
Instruction fetch unit
Instruction queue
I1 F1 D1 E1 E1 E1 W1 Branch folding
F : Fetch
instruction
I2 F2 D2 E2 W2

I3 F3 D3 E3 W3

I4 F4 D4 E4 W4

I5 (Branch) F5 D5
D : Dispatch/
Decode E : Execute W : Write
instruction results I6 F6 X
unit
Ik Fk Dk Ek Wk

Ik +1 Fk +1 D k +1 Ek +1
Figure 8.10. Use of an instruction queue in the hardware organization of Figure 8.2b.

Figure 8.11. Branch timing in the presence of an instruction queue.


Branch target address is computed in the D stage.

7
8/21/2012

Branch Folding Conditional Braches


¨ Branch folding – executing the branch instruction concurrently ¨ A conditional branch instruction introduces the
with the execution of other instructions.
added hazard caused by the dependency of
¨ Branch folding occurs only if at the time a branch instruction is
encountered, at least one instruction is available in the queue the branch condition on the result of a
other than the branch instruction. preceding instruction.
¨ Therefore, it is desirable to arrange for the queue to be full most
of the time, to ensure an adequate supply of instructions for
¨ The decision to branch cannot be made until
processing. the execution of that instruction has been
¨ This can be achieved by increasing the rate at which the fetch completed.
unit reads instructions from the cache.
¨ Branch instructions represent about 20% of
¨ Having an instruction queue is also beneficial in dealing with
cache misses. the dynamic instruction count of most
programs.

Delayed Branch Delayed Branch


LOOP Shift_left R1
¨ The instructions in the delay slots are always Decrement R2
fetched. Therefore, we would like to arrange Branch=0 LOOP
NEXT Add R1,R3
for them to be fully executed whether or not
the branch is taken. (a) Original program loop

¨ The objective is to place useful instructions in


these slots. LOOP Decrement R2
Branch=0 LOOP
¨ The effectiveness of the delayed branch Shift_left R1
NEXT Add R1,R3
approach depends on how often it is possible
to reorder instructions. (b) Reordered instructions

Figure 8.12. Reordering of instructions for a delayed branch.

Delayed Branch Branch Prediction


Time
Clock cycle 1 2 3 4 5 6 7 8
¨ To predict whether or not a particular branch will be taken.
Instruction
¨ Simplest form: assume branch will not take place and continue
Decrement F E
to fetch instructions in sequential address order.
Branch F E ¨ Until the branch is evaluated, instruction execution along the
predicted path must be done on a speculative basis.
Shift (delay slot) F E ¨ Speculative execution: instructions are executed before the
processor is certain that they are in the correct execution
Decrement (Branch tak
en) F E sequence.
¨ Need to be careful so that no processor registers or memory
Branch F E
locations are updated until it is confirmed that these instructions
should indeed be executed.
Shift (delay slot) F E

Add (Branch not taken) F E

Figure 8.13. Execution timing showing the delay slot being filled
during the last two passes through the loop in Figure 8.12.

8
8/21/2012

Incorrectly Predicted Branch Branch Prediction


Time
Clock cycle 1 2 3 4 5 6
¨ Better performance can be achieved if we arrange
Instruction
for some branch instructions to be predicted as
I 1 (Compare) F1 D1 E1 W1
taken and others as not taken.
I 2 (Branch>0) F2 D2/P2 E2 ¨ Use hardware to observe whether the target
address is lower or higher than that of the branch
I3 F3 D3 X
instruction.
I4 F4 X ¨ Let compiler include a branch prediction bit.
Ik Fk Dk
¨ So far the branch prediction decision is always the
same every time a given instruction is executed –
static branch prediction.
Figure 8.14.Timing when a branch decision has been incorrectly predicted
as not taken.

Superscalar operation Superscalar


F : Instruction
¨ Maximum Throughput - One instruction per fetch unit

clock cycle Instruction queue

¨ Multiple processing units


¤ More than one instruction per cycle
Floating-
point
unit
Dispatch
unit W : Write
results

Integer
unit

Figure 8.19. A processor with two execution units.

Timing ALU
Time ¨ Logic operation
Clock cycle 1 2 3 4 5 6 7
¤ OR,AND, XOR, NOT, NAND, NOR etc.
I 1 (Fadd)
¤ No dependencies among bits – Each result can be
F1 D1 E1A E1B E1C W1
calculated in parallel for every bit
I 2 (Add) F2 D2 E2 W2
¨ Arithmetic operation
¤ ADD, SUB, INC, DEC, MUL, DIVIDE
I 3 (Fsub) F3 D3 E3 E3 E3 W3
¤ Involve long carry propagation chain
n Major source of delay
n Require optimization
I 4 (Sub) F4 D4 E4 W4
¤ Suitability of algorithm based on
n Resource usage – physical space on silicon die
n Turnaround time
Figure 8.20. An example of instruction execution flow in the processor of Figure 8.19,
assuming no hazards are encountered.

9
8/21/2012

The original ARM1 ripple-carry The ARM2 4-bit carry look-ahead


adder circuit scheme
Cout

Cout[3]
A
B

A[3:0] G
sum 4-bit
P adder sum[3:0]
logic
Cin B[3:0]

Cin[0]

The ARM2 ALU logic for one result


ARM2 ALU function codes
bit
fs: 5 01 23 4
carry fs5 fs4 fs3 fs2 fs1 fs 0 ALU o ut p ut
logic
NB 0 0 0 1 0 0 A an d B
bus 0 0 1 0 0 0 A an d not B
G
0 0 1 0 0 1 A xor B
0 1 1 0 0 1 A plus not B plus carry
ALU
0 1 0 1 1 0 A plus B plus carry
bus
P 1 1 0 1 1 0 not A plus B plus carry
0 0 0 0 0 0 A
NA
0 0 0 0 0 1 A or B
bus
0 0 0 1 0 1 B
0 0 1 0 1 0 not B
0 0 1 1 0 0 zero

The ARM6 carry-select adder


Conditional Sum Adder
scheme
¨ Extension of carry-select adder
a,b[3:0] a,b[31:28]
¨ Carry select adder
+ +, +1 +, +1 ¤ One-level using k/2-bit adders
c s s+1
¤ Two-level using k/4-bit adders
mux
¤ Three-level using k/8-bit adders
mux ¤ Etc.

¨ Assuming k is a power of two, eventually have


mux an extreme where there are log2k-levels using
sum[3:0] sum[7:4] sum[15:8] sum[31:16]
1-bit adders
¤ This is a conditional sum adder

10
8/21/2012

Conditional Sum Adder:


Conditional sum - example Top-Level Block for One Bit Position

The cross-bar switch barrel shifter


The ARM6 ALU organization
principle

A operand latch B operand latch right 3 right 2 right 1 no shift


invert A invert B
XOR gates XOR gates in[3] left 1

in[2]
left 2
function C in
logic functions C in[1]
adder left 3
V
in[0]
logic/arithmetic
result mux
N
out[0] out[1] out[2] out[3]
zero detect Z

result

Shift implementation Multiplier


¨ For left or right shift one diagonal is turned on ¨ ARM include hardware support for integer
¤ Shifter
operate in negative logic multiplication
¤ Precharging sets all output logic to ‘0’. ¨ Older ARM cores include low cost multiplication
¨ For rotate right, the right shift diagonal is hardware
enabled together with complimentary left shift ¤ Support 32 bit result multiply and multiply
diagonal accumulate
¤ Uses the main datapath iteratively
¨ Arithmetic shift uses sign extension rather than
n Barrelshifter and ALU to generate 2 bit product in each
‘0’ fill
cycle
n Employ a modified booth’s algorithm to produce 2 bit
product

11
8/21/2012

Multiplier Modified Booth’s Recoding


–––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––
¨ Radix 2 multiplication xi+1 xi xi–1 yi+1 yi zi/2 Explanation
–––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––
¨ Radix 4 multiplication 0 0 0 0 0 0 No string of 1s in sight
0 0 1 0 1 1 End of string of 1s
¨ Radix 2 Booth algorithm 0
0
1
1
0
1
0
1
1
0
1
2
Isolated 1
End of string of 1s
1 0 0 −1 0 −2 Beginning of string of 1s
¨ Radix 4 booth algorithm 1 0 1 −1 1 −1 End a string, begin new one
1 1 0 0 −1 −1 Beginning of string of 1s
1 1 1 0 0 0 Continuation of string of 1s
–––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––
Recoded
Context radix-2 digits Radix-4 digit
Example
1 0 0 1 1 1 0 1 1 0 1 0 1 1 1 0 Operand x
(1) −1 0 1 0 0 −1 1 0 −11 −1 1 0 0 −1 0 Recoded version y
(1) −2 2 −1 2 −1 −1 0 −2 Radix-4 version z

Example : Modified Booth’s


High speed multiplier
Recoding
================================
a Multiplicand
a 0 1 1 0 ×
x 1 0 1 0
x
(x x ) a 4 0
Multiplier ¨ Recent cores have high performance
1 0 two
z −1 −2 Radix-4 (x3x 2) twoa 4 1
multiplication hardware
================================ p Product

p(0) 0 0 0 0 0 0 ¤ Support64 bit result multiply and multiply


+z0a 1 1 0 1 0 0 accumulate
–––––––––––––––––––––––––––––––––
4p(1) 1 1 0 1 0 0
p(1) 1 1 1 1 0 1 0 0
+z1a 1 1 1 0 1 0
–––––––––––––––––––––––––––––––––
4p(2) 1 1 0 1 1 1 0 0
p(2) 1 1 0 1 1 1 0 0
================================

Multiplier: Carry Save Addition CSA-Basic unit - (3,2)Counter


¨ In Multiplication multiple partial products are added ¨ Simplest implementation - full adder (FA) with 3
simultaneously using 2-operand adders inputs x,y,z
¨ Time-consuming carry-propagation must be repeated
several times: k operands - k-1 propagations ¨ x+y+z=2c+s (s,c - sum and carry outputs)
¨ Techniques for lowering this penalty exist - Carry-save
addition
¨ Carry propagates only in last step - other steps
generate partial sum and sequence of carries ¨ Outputs - weighted binary representation of
¨ Basic CSA accepts 3 n-bit operands; generates 2n-bit
number of 1's in inputs
results: n-bit partial sum, n-bit carry ¨ FA called a (3,2) counter
¨ Second CSA accepts the 2 sequences and another ¨ n-bit CSA: n(3,2)counters in
input operand, generates new partial sum and carry
parallel with no carry links
¨ CSA reduces number of operands to be added from 3
to 2 without carry propagation

12
8/21/2012

Cascaded CSA for four 4 bit


(a)Carry-propagate (b)carry-save operands
¤ Upper 2 levels - 4-bit CSAs
¤ 3rd level - 4-bit carry-propagating adder (CPA)
A B Cin A B Cin A B Cin A B Cin
(a) + + + +
Cout S Cout S Cout S Cout S

A B Cin A B Cin A B Cin A B Cin


(b) + + + +
Cout S Cout S Cout S Cout S

ARM high-speed multiplier


Wallace Tree
organization
¨ Better Organization for CSA – faster operation initiali zation for MLA
registers

time
Rs >> 8 bits/cycle

Rm

rotate sum and carry-save adders


carry 8 bits/cycle

partial sum

partial carry

ALU (add partials)

13

Das könnte Ihnen auch gefallen