Beruflich Dokumente
Kultur Dokumente
Advanced Topics
Lecture 25: 1
Announcements
• ECE Day tomorrow: Duffield atrium
– Photo booth challenge
– Mini golf
– Student vs. faculty JEOPARDY
– Grab & Go lunch (Dinosaur BBQ)
– Professor dunk tank
– Many other activities
Lecture 25: 2
Final Exam
• When: Tuesday May 22, 2:00-3:40PM (100mins)
Lecture 25: 3
Final Exam
• Review session
– Friday May 18, 1:00-3:00pm, Location TBD
Lecture 25: 4
Performance Tradeoffs
Lecture 25: 5
Discussion: Performance Tradeoff
Example 1
LW R2,0(R1)
ADD 8(R1),0(R1),4(R1) vs. LW R3,4(R1)
ADD R2,R2,R3
SW R2,8(R1)
Lecture 25: 6
Discussion: Performance Tradeoff
Example 2
Lecture 25: 7
Parallelism: Making our Processor Fast
• Processor architects improve performance
through hardware that exploits the different
types of parallelism within computer programs
Lecture 25: 8
Instruction-Level Parallelism (ILP)
Lecture 25: 9
Two-Way Superscalar Pipeline
IF ID EX MEM WB
A
IM Reg L Reg
DM
U
A
L
U
Lecture 25: 10
Instruction Sequence on 2W SS
A
IM Reg L DM Reg
ADD R1,R2,R3 U
OR R4,R4,R3
SUB R5,R2,R3
AND R6,R6,R2
ADDI R7,R7,3 A
LW R2,R3,0
L
U
Lecture 25: 11
Instruction Sequence on 2W SS
A
IM Reg L DM Reg
ADD R1,R2,R3 U
OR R4,R4,R3
SUB R5,R2,R3
AND R6,R6,R2
ADDI R7,R7,3 A
LW R2,R3,0
L
U
ADD R1,R2,R3
OR R4,R4,R3
Lecture 25: 12
Instruction Sequence on 2W SS
A
IM Reg L DM Reg
ADD R1,R2,R3 U
OR R4,R4,R3
SUB R5,R2,R3
AND R6,R6,R2
ADDI R7,R7,3 A
LW R2,R3,0
L
U
Lecture 25: 13
Instruction Sequence on 2W SS
A
IM Reg L DM Reg
ADD R1,R2,R3 U
OR R4,R4,R3
SUB R5,R2,R3
AND R6,R6,R2
ADDI R7,R7,3 A
LW R2,R3,0
L
U
Lecture 25: 14
ARM Cortex-A8 Microprocessor
Apple iPhone 4, iPod Touch (3rd & 4th gen), iPad; Motorola Droid, Droid X, Droid 2; Palm Pre, Pre
2; Samsung Omnia HD, Wave 8500, i9000 Galaxy S, P1000 Galaxy Tab; HTC Desire; Google Nexus
S; Nokia N900; Sony Ericsson Satio, Xperia X10, etc, etc
Lecture 25: 15
Cortex-A8 Processor Pipeline
• 2-way superscalar
• L2 cache
– Up to 1MB, 8-way set associative, 64 byte block size,
random replacement
Lecture 25: 17
Data Dependences Limit SS Execution
• Consider this program sequence
ADD R1,R2,R3
OR R4,R1,R3
SUB R5,R2,R3
AND R6,R6,R5
• The ADD and the OR, and the SUB and AND,
cannot execute at the same time
– Cortex-A8 limits dual issue in this case
Lecture 25: 18
Out-of-Order Execution
• Processor can execute instructions out of the
original program order
ADD R1,R2,R3 ADD R1,R2,R3
OR R4,R1,R3 SUB R5,R2,R3
SUB R5,R2,R3 OR R4,R1,R3
AND R6,R6,R5 AND R6,R6,R5
Reg
File
Lecture 25: 19
ARM Cortex-A9
• Successor to the Cortex-A8
• Superscalar pipeline with out-of-order execution
– Issues up to 4 instruction each clock cycle
• ITLB and DTLB + L2 TLB
Apple iPhone 4S, iPad2; Motorola Droid Bionic, Altrix 4G, Xoom; Blackberry Playbook;
Samsung Galaxy S II, Galaxy S III; HTC Sensation, EVO 3D; LG Optimus 2X, Optimus
3D; Lenovo IdeaPad K2, ASUS Eee Pad Transformer; Acer ICONIA TAB A-series, etc, etc
Lecture 25: 20
Samsung Exynos 4210 System-on-Chip
Reg A
Reg
L
U
PC
PC IF/ID ID/EX EX/MEM MEM/WB
Lecture 25: 25
SIMD Instructions
• Single Instruction, Multiple Data
Lecture 25: 28
Ongoing Trends in Computer Systems
Lecture 25: 29
Microprocessor Scaling
7
10 Intel 48-Core Transistors
Prototype (Thousands)
6
10 AMD 4-Core Parallel Proc
Opteron Performance
5
10 Intel Sequential
Pentium 4 Processor
4 Performance
10 DEC Alpha
21264 Frequency
3 (MHz)
10 MIPS R2K
2 Typical Power
10 (Watts)
1 Number
10 of Cores
0
10
Lecture 25: 30
Multicore to the Rescue?
• Multicore: Multiprocessor on a single chip
– Performance improvement by splitting task among
multiple CPUs (thread-level parallelism)
Lecture 25: 31
Modern Computers are Power Constrained
Performance (Ops/Second)
Lecture 25: 32
Hardware Specialization in Mobile Chips
Lecture 25: 34
Hardware Specialization in Datacenter
• ASIC/FPGA-based accelerators are deployed for a rich set
of compute-intensive applications in cloud datacenters
EC2 F1
Amazon Instance
Machine
Image (AMI)
Launch Instance
and Load AFI
DDR
DDR-4
Controllers DDR-4
CPU PCIe Attached
DDR-4
Attached
Memory DDR-4
DDR-4
Application Attached
Memory
Attached
Attached
on F1 Memory
Memory
Memory
FPGA Link
Lecture 25: 35
Google TPU
Flexibility vs. Efficiency Trade-offs
ASICs
FPGAs
• CS 4120 (Compilers)
Lecture 25: 37
Next Time
Final Exam
Lecture 25: 38