Sie sind auf Seite 1von 50

ARM Processor Architecture (II)

Speaker: Lung-Hao Chang Advisor: Porf. Andy Wu Graduate Institute of Electronics Engineering, National Taiwan University
Modified from National Chiao-Tung University IP Core Design course

Outline
ARM processor core Memory hierarchy Software development Summary

ARM Platform Design

SOC Consortium Course Material

09/21/2003

ARM Processor Core

ARM Platform Design

SOC Consortium Course Material

09/21/2003

ARM7TDMI Processor Core


Current low-end ARM core for applications like digital mobile phones TDMI
T: Thumb, 16-bit compressed instruction set D: on-chip Debug support, enabling the processor to halt in response to a debug request M: enhanced Multiplier, yield a full 64-bit result, high performance I: EmbeddedICE hardware

Von Neumann architecture 3-stage pipeline CPI ~ 1.9


ARM Platform Design SOC Consortium Course Material 09/21/2003 4

ARM7TDMI Block Diagram


scan chain 2 extern0 extern1 opc, r/w, m , trans , req m as[1:0] A[31:0] D[31:0] Din[31:0] Dout[31:0]

Embedded ICE

scan chain 0

processor core
scan chain 1

other signals

bus splitter

JTAG TAP controller


TCK TMS TRST TDI TDO

ARM Platform Design

SOC Consortium Course Material

09/21/2003

ARM7TDMI Core Diagram

ARM Platform Design

SOC Consortium Course Material

09/21/2003

ARM7TDMI Interface Signals (1/4)


clock control configuration interrupts initialization
mclk wait eclk bigend irq q isync reset enin enout enouti abe ale ape dbe tbe busen highz busdis ecapclk A[31:0] Din[31:0] Dout[31:0] D[31:0] bl[3:0] r/w mas[1:0] mreq seq lock trans mode[4:0] abort Tbit

memory interface

bus control

MMU interface st ate

ARM7TDMI core

dbgrq breakpt dbgack exec extern1 extern0 debug dbgen rangeout0 rangeout1 dbgrqi commrx commtx

tapsm[3:0] ir[3:0] tdoen tck1 tck2 screg[3:0] drivebs ecapclkbs icapclkbs highz pclkbs rstclkbs sdinbs sdoutbs shclkbs shclk2bs TRST TCK TMS TDI TDO

TAP information

boundary scan extension

coprocessor interface power

opc cpi cpa cpb Vdd Vss

JTAG controls

ARM Platform Design

SOC Consortium Course Material

09/21/2003

ARM7TDMI Interface Signals (2/4)


Clock control
All state change within the processor are controlled by mclk, the memory clock Internal clock = mclk AND \wait eclk clock output reflects the clock used by the core

Memory interface
32-bit address A[31:0], bidirectional data bus D[31:0], separate data out Dout[31:0], data in Din[31:0] seq indicates that the memory address will be sequential to that used in the previous cycle

ARM Platform Design

SOC Consortium Course Material

09/21/2003

ARM7TDMI Interface Signals (3/4)


Lock indicates that the processor should keep the bus to ensure the atomicity of the read and write phase of a SWAP instruction \r/w, read or write mas[1:0], encode memory access size byte, half-word or word bl[3:0], externally controlled enables on latches on each of the 4 bytes on the data input bus

MMU interface
\trans (translation control), 0: user mode, 1: privileged mode \mode[4:0], bottom 5 bits of the CPSR (inverted) Abort, disallow access

State
T bit, whether the processor is currently executing ARM or Thumb instructions

Configuration
Bigend, big-endian or little-endian

ARM Platform Design

SOC Consortium Course Material

09/21/2003

ARM7TDMI Interface Signals (4/4)


Interrupt
\fiq, fast interrupt request, higher priority \irq, normal interrupt request isync, allow the interrupt synchronizer to be passed

Initialization
\reset, starts the processor from a known state, executing from address 0000000016

ARM7TDMI characteristics

ARM Platform Design

SOC Consortium Course Material

09/21/2003

10

Memory Access
The ARM7 is a Von Neumann, load/store architecture, i.e.,
Only 32 bit data bus for both inst. And data. Only the load/store inst. (and SWP) access memory.

Memory is addressed as a 32 bit address space Data type can be 8 bit bytes, 16 bit half-words or 32 bit words, and may be seen as a byte line folded into 4-byte words Words must be aligned to 4 byte boundaries, and half-words to 2 byte boundaries. Always ensure that memory controller supports all three access sizes
ARM Platform Design

SOC Consortium Course Material

09/21/2003

11

ARM Memory Interface


Sequential (S cycle)
(nMREQ, SEQ) = (0, 1) The ARM core requests a transfer to or from an address which is either the same, or one word or one-half-word greater than the preceding address.

Non-sequential (N cycle)
(nMREQ, SEQ) = (0, 0) The ARM core requests a transfer to or from an address which is unrelated to the address used in the preceding address.

Internal (I cycle)
(nMREQ, SEQ) = (1, 0) The ARM core does not require a transfer, as it performing an internal function, and no useful prefetching can be performed at the same time

Coprocessor register transfer (C cycle)


(nMREQ, SEQ) = (1, 1) The ARM core wished to use the data bus to communicate with a coprocessor, but does no require any action by the memory system.

ARM Platform Design

SOC Consortium Course Material

09/21/2003

12

Cached ARM7TDMI Macrocells


EmbeddedICE & JTAG CP15

ARM Core Physical Address AMBA Interface

Inst. & data

Virtual Address

MMU

AMBA Address AMBA Data

Inst. & data cache

Write Buffer

ARM710T
8K unified write through cache Full memory management unit supporting virtual memory Write buffer
ARM Platform Design

ARM720T
As ARM 710T but with WinCE support

ARM 740T
8K unified write through cache Memory protection unit Write buffer
09/21/2003 13

SOC Consortium Course Material

Processor Core Vs CPU Core


Processor Core
The engine that fetches instructions and execute them E.g.: ARM7TDMI, ARM9TDMI, ARM9E-S

CPU Core
Consists of the ARM processor core and some tightly coupled function blocks Cache and memory management blocks E.g.: ARM710T, ARM720T, ARM74T, ARM920T, ARM922T, ARM940T, ARM946E-S, and ARM966E-S
virtual address

MMU

instruction & data cache

ARM7TDMI
EmbeddedICE & JTAG

physical address

instructions & data

write buffer AMBA interface

CP15

AMBA AMBA address data

ARM710T
09/21/2003 14

ARM Platform Design

SOC Consortium Course Material

ARM8
Higher performance than ARM7
By increasing the clock rate By reducing the CPI
Higher memory bandwidth, 64-bit wide memory Separate memories for instruction and data accesses

ARM8

ARM9TDMI ARM10TDMI

addresses

prefetch unit

Core Organization
The prefetch unit is responsible for fetching instructions from memory and buffering them (exploiting the double bandwidth memory) It is also responsible for branch prediction and use static prediction based on the branch prediction (backward: predicted taken; forward: predicted not taken)
ARM Platform Design
PC instructions

memory (doublebandwidth)

read data write data

integer unit
CPinst. CPdata

coprocessor(s)

SOC Consortium Course Material

09/21/2003

15

Pipeline Organization
5-stage, prefetch unit occupies the 1st stage, integer unit occupies the remainder

(1) Instruction prefetch (2) Instruction decode and register read (3) Execute (shift and ALU) (4) Data memory access (5) Write back results

Prefetch Unit

Integer Unit

ARM Platform Design

SOC Consortium Course Material

09/21/2003

16

Integer Unit Organization


instructions coprocessor instructions PC+4

inst. d ecode decode

register read

coproc data

multiplier

ALU/shifter write pipeline +4 mux


write data address read data forwarding paths

execute

memor y

rot/sgn ex write register write

ARM Platform Design

SOC Consortium Course Material

09/21/2003

17

ARM8 Macrocell
ARM810
prefetch unit

virtual address

4 Kbyte cache (doublebandwidth)

PC instructions read data write data CPinst. CPdata

ARM 4 integer unit

copy-back tag copy-back data

CP44

8K byte unified instruction and data cache Copy-back Double-bandwidth MMU Coprocessor Write buffer

JTAG

write buffer
physical address

MMU

address buffer
data in data out address

ARM Platform Design

SOC Consortium Course Material

09/21/2003

18

StrongARM
The first ARM processor to use a modified-Harvard (separate instruction and data cache) architecture and now available from Intel Feature
A 5-stage pipeline with register forwarding Single-cycle execution of all common instruction s except 64-bit multiplies Instruction cache/copy-back data cache Write buffer Pseudo-static operation with low power consumption

ARM Platform Design

SOC Consortium Course Material

09/21/2003

19

StrongARM core pipeline organization


next pc

+4

I-cache

fetch

pc + 4 branch of fset pc + 4 r44

I decode instruction decode


im ediate m elds

+ disp
B, BL MOV pc LDM/ STM branch target

register read

+4

postindex

shift ALU & multiply

pre-index

reg shift

execute
forwarding paths

mux
SUBS pc

rotate D-cache rot/sgn ex


LDR pc

load/store address

buffer/ data

register write

write-back

ARM Platform Design

SOC Consortium Course Material

09/21/2003

20

StrongARM Processor
SA-1110/SA-1111
Intel SA-1 core 16-Kbyte instruction and 8-Kbyte data cache MMU, read and write buffers 512-byte mini-data cache

ARM Platform Design

SOC Consortium Course Material

09/21/2003

21

ARM9TDMI
Harvard architecture
Increases available memory bandwidth
Instruction memory interface Data memory interface

Simultaneous accesses to instruction and data memory can be achieved

5-stage pipeline Changes implemented to


Improve CPI to ~1.5 Improve maximum clock frequency

ARM Platform Design

SOC Consortium Course Material

09/21/2003

22

ARM9TDMI Organization
next pc

+4 I-cache fetch

pc + 4

pc + 4 r44

I decode instruction decode


immediate fields

register read

mul
LDM/ STM

+4

post index

shift ALU

reg shift

pre-index

execute
forwarding paths

mux
B, B L MOV pc SUBS pc

byte repl. buffer/ data

load/store address

D-cache

rot/sgn ex
LDR pc

register write

write-back

ARM Platform Design

SOC Consortium Course Material

09/21/2003

23

ARM9TDMI Pipeline Operations (1/2)


ARM7TDMI: Fetch
instruction fetch

Decode
Thumb decompress ARM decode reg read

Execute
shift/ALU reg write

ARM9TDMI:
instr uction fetch r. read decode shift/ALU data memor y access reg write

Fetch

Decode

Execute

Memory

Write

The ARM9TDMI pipeline is much tighter and does not have sufficient slack time to allow Thumb instructions to be first translate into ARM instructions and then decoded It has hardware to decode both ARM and Thumb instructions directly
ARM Platform Design SOC Consortium Course Material 09/21/2003 24

ARM9TDMI Pipeline Operations (2/2)


Coprocessor support
Coprocessors: floating-point, digital signal processing, specialpurpose hardware accelerator

On-chip debugger
Additional features compared to ARM7TDMI
Hardware single stepping Breakpoint can be set on exceptions

ARM9TDMI characteristics

ARM Platform Design

SOC Consortium Course Material

09/21/2003

25

Cached ARM9TDMI Macrocell


ARM920T
ARM9TDMI 16KB instruction cache, 16KB data cache Full Memory Management Unit, Write Buffer

ARM922T
ARM9TDMI 8KB instruction cache, 8KB data cache Full Memory Management Unit, Write Buffer

ARM940T
ARM9TDMI 4KB instruction cache, 4KB data cache Protection Unit
ARM Platform Design SOC Consortium Course Material 09/21/2003 26

ARM920T CPU Core

ARM Platform Design

SOC Consortium Course Material

09/21/2003

27

ARM9E-S Family Overview


ARM9E-S is based on an ARM9TDMI with the following extensions:
Single cycle 32*6 multiplier implementation EmbeddedICE logic RT Improved ARM/Thumb interworking New 32*16 and 16*16 multiply instructions New count leading zero instruction New saturated math instructions ARM9E-S core Instruction and data caches, selectable sizes Instruction and data RAMs, selectable sizes Protection unit AHB bus interface
SOC Consortium Course Material 09/21/2003 28

Architecture v5TE

ARM946E-S

ARM Platform Design

ARM10TDMI (1/2)
High-end ARM processor core Performance on the same IC process
ARM10TDMI 2 ARM9TDMI 2 ARM7TDMI

300MHz, 0.25m CMOS Increase clock rate


ARM10TDMI
b ra n ch p re d ic tio n in s tr u c tio n fe t ch r. r e a d d e co d e a d d r. ca lc . d a t a m e m o ry a c ce s s data w r ite

decode

s h ift /A L U m u ltip ly

m u lt ip lie r re g p a r tia ls a d d w r ite

F e tc h
ARM Platform Design

Is s ue

D e c od e

E x e c ute

M em or y
09/21/2003

W rit e
29

SOC Consortium Course Material

ARM10TDMI (2/2)
Reduce CPI
Branch prediction Non-blocking load and store execution 64-bit data memory transfer 2 registers in each cycle

ARM Platform Design

SOC Consortium Course Material

09/21/2003

30

ARM1020T Overview
Architecture v5T
ARM1020E will be v5TE

CPI ~ 1.3 6-stage pipeline Static branch prediction 32KB instruction and 32KB data caches
hit under miss support

64 bits per cycle LDM/STM operations EmbeddedICE Logic RT-II Support for new VFPv1 architecture ARM10200 test chip
ARM1020T VFP10 SDRAM memory interface PLL

ARM Platform Design

SOC Consortium Course Material

09/21/2003

31

ARM1136J(F)-S
First Implementations of ARMv6 Architecture
ARM1136J-S ARM1136JF-S 8 stages integer-only core with integrated floating point

High speed pipeline microarchitecture System level flexibility Low Power


Microarchitecture designed for low power Power management modes

Availability
Delivering to first licensees in December 2002

The ARM11 core has been developed and integrated in parallel with the ARM11 PrimeXsys Platform to ensure a fully compatible, high performance, extendable system solution
ARM Platform Design SOC Consortium Course Material 09/21/2003 32

Memory Hierarchy

ARM Platform Design

SOC Consortium Course Material

09/21/2003

33

Memory Size and Speed

Small

Fast

registers

Expensive

On-chip cache memory

2nd-level off chip cache

Main memory

Large capacity

Slow Access time

Hard disk

Cheap Cost

ARM Platform Design

SOC Consortium Course Material

09/21/2003

34

Caches (1/2)
A cache memory is a small, very fast memory that retains copies of recently used memory values. It usually implemented on the same chip as the processor. Caches work because programs normally display the property of locality, which means that at any particular time they tend to execute the same instruction many times on the same areas of data. An access to an item which is in the cache is called a hit, and an access to an item which is not in the cache is a miss.
ARM Platform Design SOC Consortium Course Material 09/21/2003 35

Caches (2/2)
A processor can have one of the following two organizations:
A unified cache
This is a single cache for both instructions and data

Separate instruction and data caches


This organization is sometimes called a modified Harvard architectures

ARM Platform Design

SOC Consortium Course Material

09/21/2003

36

Unified instruction and data cache


FF..FF 4 4 registers instructions processor address copies of instructions copies of data cache instructions and data
ARM Platform Design SOC Consortium Course Material 09/21/2003

instructions and data data address

memory

4.. 4 4 44 4

37

Separate data and instruction caches


copies of instructions FF..FF 4 4 address

cache address

instructions

instructions

instructions

registers processor address data data

address copies of data cache


ARM Platform Design

data

memory

4.. 4 4 44 4
38

SOC Consortium Course Material

09/21/2003

Cache Write Strategies


Write-through
All write operations are passed to main memory

Write-through with buffered write


All write operations are still passed to main memory and the cache updated as appropriate, but instead of slowing the processor down to main memory speed the write address and data are stored in a write buffer which can accept the write information at high speed.

Copy-back (write-back)
No kept coherent with main memory

ARM Platform Design

SOC Consortium Course Material

09/21/2003

39

Software Development

ARM Platform Design

SOC Consortium Course Material

09/21/2003

40

Main Components in ADS


ANSI C compilers armcc and tcc ISO/Embedded C++ compilers armcpp and tcpp ARM/Thumb assembler - armasm Linker - armlink Project management tool for windows - CodeWarrior Instruction set simulator - ARMulator Debuggers - AXD, ADW, ADU and armsd Format converter - fromelf Librarian armar ARM profiler armprof C and C++ libraries ROM-based debug tools (ARM Firmware Suite, AFS) Real Time Debug and Trace support Support for all ARM cores and processors including ARM9E, ARM10, Jazelle, StrongARM and Intel Xscale
ADS: ARM Developer Suite
ARM Platform Design SOC Consortium Course Material 09/21/2003 41

ARM C Compiler
Compiler is compliant with the ANSI standard for C Supported by the appropriate library of functions Use ARM Procedure Call Standard, APCS for all external functions
For procedure entry and exit

May produce assembly source output


Can be inspected, hand optimized and then assembled sequentially

Can also produce Thumb codes

ARM Platform Design

SOC Consortium Course Material

09/21/2003

42

ARM Linker
Take one or more object files and combine them Resolve symbolic references between the object files and extract the object modules from libraries Normally the linker includes debug tables in the output file

ARM Platform Design

SOC Consortium Course Material

09/21/2003

43

ARM Symbolic Debugger


A front-end interface to debug program running either under emulator (on the ARMulator) or remotely on a ARM development board (via a serial line or through JTAG test interface) ARMsd allows an executable program to be loaded into the ARMulator or a development board and run. It allows the setting of
Breakpoints, addresses in the code Watchpoints, memory address if accessed as data address
Cause exception to halt so that the processor state can be examined

ARM Platform Design

SOC Consortium Course Material

09/21/2003

44

ARM Emulator: ARMulator


A suite of programs that models the behavior of various ARM processor cores and system architecture in software on a host system Can be operates at various levels of accuracy
Instruction accurate Cycle accurate Timing accurate

Benchmarking before hardware is available


Instruction count or number of cycles can be measured for a program. Performance analysis.

Run software on ARMulator


Through ARMsd or ARM GUI debuggers, e.g., AXD The processor core model incorporates the remote debug interface, so the processor and the system state are visible from the ARM symbolic debugger Supports a C library to allow complete C programs to run on the simulated system
ARM Platform Design SOC Consortium Course Material 09/21/2003 45

ARM Development Board


A circuit board including an ARM core (e.g. ARM9TDMI), memory component, I/O and electrically programmable devices It can support both hardware and software development before the final application-specific hardware is available

ARM Platform Design

SOC Consortium Course Material

09/21/2003

46

ARM Integrator
A mother with some extensions to support the development of applications Provides core modules, logic modules (Xilinx Virtex FPGA, Alter APEX FPGA), OS, input/output resources, bus arbitration, interrupt handling

ARM Platform Design

SOC Consortium Course Material

09/21/2003

47

Summary (1/2)
ARM Processor Family
Processor family ARM6 ARM7 ARM8 ARM9 ARM10 StrongARM ARM11 # of pipeline stages 3 3 5 5 6 5 8 Memory organization Von Neumann Von Neumann Von Neumann Harvard Harvard Harvard Clock Rate MIPS/MHz 25 MHz 66 MHz 72 MHz 200 MHz 400 MHz 233 MHz 0.9 1.2 1.1 1.25 1.15 1.2

Von Neumann/ 550 MHz Harvard

ARM Platform Design

SOC Consortium Course Material

09/21/2003

48

Summary (2/2)
Memory hierarchy
Unified cache/Separate instruction and data cache Write-through with buffered write

Software Development
CodeWarrior IDE
armcc/tcc/armcpp/tcpp armasm armlink armprof

AXD (ARM eXtended Debugger)


armsd

ARMulator

ARM Integrator
ARM Platform Design SOC Consortium Course Material 09/21/2003 49

References
[1] http://twins.ee.nctu.edu.tw/courses/ip_core_02/index.html [2] ARM System-on-Chip Architecture, Second Edition, edited by S.Furber, Addison Wesley Longman: ISBN 0-201-67519-6. [3] Architecture Reference Manual, Second Edition, edited by D. Seal, Addison Wesley Longman: ISBN 0-201-73719-1. [4] www.arm.com

ARM Platform Design

SOC Consortium Course Material

09/21/2003

50

Das könnte Ihnen auch gefallen