Sie sind auf Seite 1von 24

Getting started on Cortex A8 Instruction Set

Instruction Sets
32-bit ARM instruction set : 16-bit Thumb instruction set : 32-bit Thumb-2 instruction set :
(Trade off between two above), Most 32 bit instructions are unconditional when compared to ARM

Advanced SIMD architecture.


Enabling the same operation to be performed on multiple items in parallel. Instructions operate on vectors held in 64-bit or 128-bit registers

Other instruction sets


ThumbEE instruction set Jazelle Extension

Register Set (ARM and Neon)


33 general-purpose 32-bit registers
In user mode only R0 to R15 are available R14 -> Link register : Holds the return address when the branch is called with link (BL) R15 -> Program counter

seven 32-bit status registers


Status Flags/Processor mode

Neon Register Bank


View 1: 32x64-bit general-purpose registers or (D0-D31) View 2: 16x128-bit (quadword) registers, Q0-Q15. Combination of these 128-bit and 64-bit registers, Q0-Q15 and D0-D31.

ARM Instruction set


All ARM instructions are 32 bits long
Branch instructions Data processing instructions Register load and store instructions Multiple register load and store instructions Status register access instructions(OOS) Coprocessor instructions (OOS)

ARM Instruction set


Branch Instructions
branch backwards to form loops branch forward in conditional structures branch to subroutines

e.g.
B label1 BL label1(Branch with link) BEQ {pc}+4

ARM Instruction set


Data processing instructions
Add or multiply two registers Add register with constant Bitwise operations operate on 8 bit, 16 bit and 32 bit data Long multiply instructions give a 64-bit result in two registers

e.g.
ADD r2, r1, r3 SUBS r8, r6, #240 ; sets the flags on the result RSB r4, r4, #1280 ; subtracts contents of r4 from 1280 AND r9,r2,#0xFF00 ORREQ r2,r0,r5 MOVS r3, r2, LSR #3 ;

ARM Instruction set


Register load and store instructions
Load or store the a single register - 8,16,32 bit Load double words Byte and halfword loads can be zero filled or sign extended

e.g.
STMFD r13!, {r0-r5} LDMFD r13!, {r0-r5} PUSH {r5-r7,lr} POP {r5-r7,pc} LDR r3, [r0], #4 ;r0 is incremented by 4 LDR r3, [r0],r4 ;r0 is incremented by r4 LDR r3,[r0,#0x2C] ;load with offset LDR r3,[r0,r4,lsl #2] ;

ARM Instruction set


Conditional Execution Flags
N Set when the result of the operation was Negative. Z Set when the result of the operation was Zero. C Set when the operation resulted in a Carry. V Set when the operation caused oVerflow.

Most of the ARM instructions can be conditional E.g.


ADD r0, r1, r2 ; r0 = r1 + r2, don't update flags ADDS r0, r1, r2 ; r0 = r1 + r2, and update flags ADDSCS r0, r1, r2 ; If C flag set then r0 = r1 + r2, and update flags CMP r0, r1 ; update flags based on r0-r1.

why conditional instructions are required if branch instructions are available?

ARM Instruction set


Suffix details

Neon Instruction set


Vector Duplicate
VDUP{cond}.size Qd, Dm[x]
cond is an optional condition code size must be 8, 16, or 32 Qd specifies the destination register for a quadword operation Dm[x] specifies the NEON scalar.

VADD.datatype {Qd}, Qn, Qm VADD.datatype {Dd}, Dn, Dm


Datatype -> I8, I16, I32 for VADD and VSUB Datatype -> S64, U64 for VQADD or VQSUB(depends on instruction, refer TRM)

Neon Instruction set (e.g.)

Effective Assembly coding


Branch prediction
Maximize usage of conditional instructions instead of branches
a 512-entry 2-way set associative Branch Target Buffer (BTB) a 4096-entry Global History Buffer (GHB) an 8-entry return stack

Pipeline model- Instruction cycle timing


fetch, decode, execute >> 13 stage
Load Store MAC ALU

Neon Pipeline >> 10 Removing interlocks/stalls

Maximize usage of SIMD/Neon Instructions Maximize Dual Issue

Effective Assembly coding


how to read ARM instruction tables
ADDEQ R0, R1, R2 LSL#10

Effective Assembly coding


Interlock e.g.(Refer Table in next slide)
SMLAL R0, R1, R2, R3 ADD R7,R8,R0 >> four cycles waisted

Alternate approach
SMLAL R0, R1, R2, R3
MOV r4,#0x6 ADD r5,r4,r5 MOV r6,#0x6 LDR r5,[r6,#0x2C]

ADD R7,R8,R0

Effective Assembly coding


dummy

Effective Assembly coding


Dual Issue
Two basic pipeleines ->Pipeline0 and Pipeline1
LS pipeline, Multiply pipeline, ALU pipeline Multiply pipeline always goes in Pipeline 0 The first instruction always issues in pipeline 0 and the second instruction, if present, issues in pipeline 1 Instructions with the same destination cannot be issued in the same cycle. Refer next Slide for more e.g.

Dual issue (contd..)

General ARM optimization Techniques


Loop unrolling

Use fixed point arithmetic Use shifts instead of multiply and divisions See if complex calculations can be avoided using table lookup Minimize the number of arguments of a function Avoid branches in low level functions

Assly Funcs/files e.g


First four argument go in r0,r1,r2,r3 e.g. of assembly function

General /Neon optimization Techniques


Code Vectorization in C itself Use word arrays instead of halfword or byte arrays Cache friendly coding Put code belonging to same module in the same code section

Code Vectorization

Code Vectorization

Code Vectorization

Code Warrior Demo/Hands on

Das könnte Ihnen auch gefallen