Sie sind auf Seite 1von 31

Lecture 5 - ARM Organization and Implementation - ICE 1222/2342

Fall, 2008
Daeyoung Kim kimd@icu.ac.kr http://resl.icu.ac.kr/~kimd

Contents

3-stage pipeline ARM organization & implementation 5-stage pipeline ARM organization & implementation

3-stage pipeline ARM Organization


A[31:0] control

ARM Processors up to ARM7

address register P C

incrementer

register bank

PC

instruction decode A L U b u s multiply register A b u s B b u s & control

barrel shifter

ALU

data out register D[31:0]

data in register

3-stage pipeline

Fetch

Instruction is fetched and placed in the instruction pipeline

Decode

The instruction is decoded and the datapath control signals prepared for the next cycle The instruction owns the decode logic but not the datapath

Execute

The instruction owns the datapath Register bank is read, an operand is shifted, ALU result generated and written back into a destination register

ARM single-cycle instruction 3stage pipeline operation


1 2 3 instruction fetch decode fetch execute decode fetch execute decode execute time

ARM multi-cycle instruction 3stage pipeline operation


1 2 3 4 5 instruction fetch ADD decode fetch STR execute decode calc. addr. data xfer fetch ADD fetch ADD decode execute decode fetch ADD time execute decode execute

To achieve higher performance


Tprog = Ninst x CPI / fclk

Increase the clock rate, fclk

The logic in each pipeline stage to be simplified and, therefore, the number of pipeline stages to be increased

Reduce the average number of clock cycles per instruction, CPI

Instructions which occupy more than one pipeline slot are reimplemented to occupy fewer slots Pipeline stalls caused by dependencies between instructions are reduced

Memory bottleneck Von Neumann bottleneck


Deliver more than 32 bits per access Separate instruction and data memory

ARM9TDMI 5-stage pipeline organization


next pc

+4

I-cache

fetch

pc + 4

Fetch Instruction is fetched and placed in the instruction pipeline Decode The instruction is decoded and register operands read Execute An operand is shifted and ALU result generated. Load/Store -> memory address is calculated in ALU B, BL Buffer/Data MOV pc Data memory is accessed if SUBS pc required Otherwise ALU result is simply buffered Write-back Result is written back to register LDR pc file

pc + 8 r15

I decode instruction decode


immediate fields

register read

mul
LDM/ STM

+4

postindex

shift ALU

reg shift

pre-index

execute mux
forwarding paths

byte repl. buffer/ data

load/store address

D-cache

rot/sgn ex

register write

write-back

Data Forwarding
next pc pc + 4

+4 I-cache fetch

A major source of complexity in 5stage pipeline Instruction execution is spread across the stages To resolve data dependencies without stalling the pipeline

pc + 8 r15

I decode instruction decode


immediate fields

register read

Forwarding paths
LDM/ STM

mul +4
postindex

Even with forwarding we can not avoid stall


shift ALU

reg shift

pre-index

LDR rN, [..] ADD r2, r1, rN One cycle stall required rN available at the end of buffer/data stage Use instruction level scheduling

execute mux
B, BL MOV pc SUBS pc forwarding paths

byte repl. buffer/ data

Do not put a dependent instruction immediately after a load instruction


LDR pc

load/store address

D-cache

rot/sgn ex

register write

write-back

Data Processing Instructions


address register increment Rd Rn registers PC Rm Rd Rn registers address register increment PC

mult as ins. as instruction

mult as ins. as instruction [7:0]

data out

data in

i. pipe

data out

data in

i. pipe

(a) register - register operations

(b) register - immediate operations

10

Data Transfer Instructions (STR)


address register increment PC Rn registers Rn address register increment PC registers

Rd

mult lsl #0 =A /A+ B / A- B [11:0]

mult shifter = A + B /A - B

data out

data in

i. pipe

byte?

data in

i. pipe

(a) 1st cycle - compute addr ess

(b) 2nd cycle - store data & auto-index

immediate offset If store byte, replicates it four times, 11 byt Lowest two bits are used for proper

Branch Instructions
address register increment R14 PC registers PC registers address register increment

mult lsl #2 =A+ B [23:0] data out data in i. pipe data out

mult shifter =A

data in

i. pipe

(a) 1st cycle - compute branch tar get

(b) 2nd cycle - save r eturn addr ess

12

ARM Implementation - 1

Clocking Scheme

Most ARMs do not operate with edge-sensitive registers Based around 2-phase non-overlapping clocks generated internally from a single input clock signal

Allows level-sensitive transparent latches Data movement is controlled by passing the data alternatively through latches open during phase 1 and latches open during phase 2 Non-overlapping property ensures no race condition

phase 1 phase 2 1 clock cycle


13

ARM Implementation - 2

Datapath Timing (1)


ALU operands latched phase 1 register read time shift time phase 2 read bus valid precharge invalidates buses register write time

shift out valid

ALU time

ALU out
14

ARM Implementation - 3

Datapath Timing (2) The minimum datapath cycle time is the sum of

Register read time Shifter delay ALU delay


Dominates cycle time Logical operations relatively faster than Arithmetic operations Why?

Register write set-up time Phase 2 and phase 1 non-overlap time

15

ARM Implementation - 4

Adder Design 1

http://www.cs.umd.edu/class/sum2003/cmsc311/Notes/Comb/adder.htm 32-bit addition time has a significant effect on the datapath cycle time Influence maximum clock rate and processors performance The first Arm processor prototype

Ripple-carry adder circuit Cout long Worst-case carry path is 32 gates

A B

sum
16

Cin

ARM Implementation - 5

Adder Design - 2

ARM2 4-bit look-ahead scheme

To reduce the worst-case carry path length


Cout[3]

A[3:0]

G P 4-bit adder logic sum[3:0]

B[3:0]

Cin[0]

17

Carry-Look-Ahead (CLA) Adder 1

calculating the carry signals in advance

a carry signal will be generated


when both bits Ai and Bi are 1 when one of the two bits is 1 and the carry-in (carry of the previous stage) is 1 COUT = Ci+1 = Ai.Bi + (Ai $ Bi).Ci (1) Ci+1 = Gi + Pi.Ci

(2)
(3) (4) Propagate

Gi = Ai.Bi Generate Pi = (Ai $ Bi)

Propagate and Generate terms only depend on the input bits


will be valid after one gate delay If one uses the above expression to calculate the carry signals, one does not need to wait for the carry to ripple through all the previous stages to find its proper value. Lets apply this to a 4-bit adder
18

Carry-Look-Ahead (CLA) Adder 2

Lets apply this to a 4-bit adder

C1 = G0 + P0.C0 (5) C2 = G1 + P1.C1 = G1 + P1.G0 + P1.P0.C0 (6) C3 = G2 + P2.G1 + P2.P1.G0 + P2.P1.P0.C0 (7) C4 = G3 + P3.G2 + P3.P2.G1 + P3P2.P1.G0 + P3P2.P1.P0.C0 (8) carry-out bit, Ci+1, of the last stage will be available after three delays (one delay to calculate the Propagate signal and two delays as a result of the AND and OR gate) Sum signal can be calculated as follows

Si = Ai $ Bi $ Ci = Pi $ Ci. (9)

19

Carry-Look-Ahead (CLA) Adder 3

4-bit adder

20

Carry-Look-Ahead (CLA) Adder 4

16-bit adder (Group)


PG = P3.P2.P1.P0 (10) GG = G3 + P3G2 + P3.P2.G1. + P3.P2.P1.G0(11)

21

ARM Implementation - 6

ALU functions

Adder, address computations for memory transfer, branch calculations, bit-wise logical functions, and so on
fs 5 0 0 0 0 0 1 0 0 0 0 0 fs 4 0 0 0 1 1 1 0 0 0 0 0 fs 3 0 1 1 1 0 0 0 0 0 1 1 fs 2 1 0 0 0 1 1 0 0 1 0 1 fs 1 0 0 0 0 1 1 0 0 0 1 0 fs 0 0 0 1 1 0 0 0 1 1 0 0 ALU o ut p ut A and B A and not B A xor B A plus not B plus carry A plus B plus carry not A plus B plus carry A A or B B not B zero

22

ARM Implementation - 7

ALU functions The ARM2 ALU logic for one result bit

fs: 5 NB bus

01 23

carry logic G

ALU bus P NA bus

23

ARM Implementation - 8

ARM6 Carry-Select Adder

Computes the sums of various fields of the word for a carry-in of both zero and one The final result is selected by using the correct carry-in bit
a,b[3:0] + c +, +1 +, +1 s s+1 mux a,b[31:28]

mux

mux sum[3:0] sum[7:4] sum[15:8] sum[31:16]

24

ARM Implementation - 9

ARM6 ALU Organization


A operand latch invert A XOR gates B operand latch XOR gates invert B

function

logic functions

adder

C in C V

logic/arithmetic

result mux zero detect result

N Z

25

ARM Implementation - 10

Barrel Shifter The shifter performance is critical

Shifter time contributes to the datapath cycle time


right 3 right 2 right 1 no shift in[3] in[2] in[1] in[0] left 1 left 2 left 3

out[0] out[1] out[2] out[3] 26

ARM Implementation - 10

The ARM register bank


A bus read decoders B bus read decoders Vdd Vss ALU bus PC bus INC bus PC register cells ALU bus A bus B bus write decoders

27

ARM Implementation - 11

Control Structures
instruction coprocessor

decode PLA

cycle count

multiply control load/store multiple

address control

register control

ALU control

shifter control

28

ARM Coprocessor Interface - 1

A general-purpose extension of its instruction set through the addition of hardware coprocessors

Also supports software emulation of coprocessors through undefined instruction trap

Coprocessor Architecture

16 logical coprocessors Each coprocessor have up to 16 private registers of any reasonable size Load-store architecture

Internal operations on registers Load and store from and to the memory Move data to or from an ARM register Board level coprocessor slow speed On-chip coprocessor high clock speed, cache and memory management, etc.

Implementation

29

ARM Coprocessor Interface - 2

ARM7TDMI Coprocessor interface

Bus watching

Coprocessor is attached to a bus where the ARM instruction stream flows into the ARM Coprocessor copies the instructions into an internal pipeline Handshake between ARM and coprocessor

cpi* (from ARM to all coprocessors) Coprocessor instruction cpa (from the coprocessors to ARM) Coprocessor absent cpb (from the coproessors to ARM) Coprocessor busy

30

ARM Coprocessor Interface - 3

Handshake outcomes

ARM may decide not to execute it

It falls in a branch shadow or fails condition code test / cpi* high Undefined instruction trap

ARM may decide to execute it (cpi* low), but cpa high

ARM decides to execute it and a coprocessor accepts it, but cannot execute it yet

cpa low but cpb high Busy-wait while stalling instruction stream Enabled interrupt request arrives? Handle it and retry coprocessor instruction later

ARM decides to execute it and coprocessor accepts it and executes it immediately

cpi* low, cpa low, cpb low

31

Das könnte Ihnen auch gefallen