Beruflich Dokumente
Kultur Dokumente
Krste Asanovic
(krste@mit.edu)
Computer Science and Artificial Intelligence Laboratory
Massachusetts Institute of Technology
Krste
CS 252 Feb. 27, 2006
Lecture 12, Slide 2
Supercomputers
Definition of a supercomputer:
• Fastest machine in world at given task
• A device to turn a compute-bound problem into an
I/O bound problem
• Any machine costing $30M+
• Any machine designed by Seymour Cray
Supercomputer Applications
Vector Supercomputers
Epitomized by Cray-1, 1976:
V0 Vi V. Mask
V1
V2 Vj
64 Element V3 V. Length
Vector Registers V4 Vk
Single Port V5
V6
Memory V7
FP Add
S0 Sj FP Mul
16 banks of ( (Ah) + j k m ) S1
S2 Sk FP Recip
64-bit words Si S3
(A0) 64 S4 Si Int Add
+ Tjk S5
T Regs S6
8-bit SECDED S7
Int Logic
Int Shift
A0
80MW/sec data ( (Ah) + j k m ) A1 Pop Cnt
A2
load/store Ai A3
Aj
(A0) 64 A4 Ak Addr Add
Bjk A5
Ai
320MW/sec B Regs A6 Addr Mul
A7
instruction
buffer refill NIP CIP
64-bitx16
LIP
4 Instruction Buffers
r0 v0
[0] [1] [2] [VLRMAX-1]
Vector Length Register VLR
v1
Vector Arithmetic v2
Instructions + + + + + +
ADDV v3, v1, v2 v3
[0] [1] [VLR-1]
Memory
Base, r1 Stride, r2
Krste
CS 252 Feb. 27, 2006
Lecture 12, Slide 8
• Compact
– one short instruction encodes N operations
• Expressive, tells hardware that these N operations:
– are independent
– use the same functional unit
– access disjoint registers
– access registers in the same pattern as previous instructions
– access a contiguous block of memory (unit-stride load/store)
– access memory in a known pattern (strided load/store)
• Scalable
– can run same object code on more parallel pipelines or lanes
Krste
CS 252 Feb. 27, 2006
V3 <- v1 * v2
Krste
CS 252 Feb. 27, 2006
Lecture 12, Slide 11
Base Stride
Vector Registers
Address
Generator +
0 1 2 3 4 5 6 7 8 9 A B C D E F
Memory Banks
Krste
CS 252 Feb. 27, 2006
ADDV C,A,B
A[6] B[6] A[24] B[24] A[25] B[25] A[26] B[26] A[27] B[27]
A[5] B[5] A[20] B[20] A[21] B[21] A[22] B[22] A[23] B[23]
A[4] B[4] A[16] B[16] A[17] B[17] A[18] B[18] A[19] B[19]
A[3] B[3] A[12] B[12] A[13] B[13] A[14] B[14] A[15] B[15]
Vector
Registers
Elements Elements Elements Elements
0, 4, 8, … 1, 5, 9, … 2, 6, 10, … 3, 7, 11, …
Lane
Memory Subsystem
Krste
CS 252 Feb. 27, 2006
load
Iter. Iter.
load 1 2 Vector Instruction
Iter. 2
add
Vectorization is a massive compile-time
reordering of operation sequencing
store requires extensive loop dependence
analysis
Krste
Instruction
issue
V V V V V
LV v1
1 2 3 4 5
MULV v3,v1,v2
ADDV v5, v3, v4
Chain Chain
Load
Unit
Mult. Add
Memory
Krste
CS 252 Feb. 27, 2006
Lecture 12, Slide 21
Load
Mul
Time Add
Load
Mul
Add
Krste
R X X X W
R X X X W
R X X X W
Dead Time
R X X X W
R X X X W
R X X X W
Dead Time R X X X W Second Vector Instruction
R X X X W
Krste
CS 252 Feb. 27, 2006
Lecture 12, Slide 23
No dead time
64 cycles active
Vector Scatter/Gather
Vector Scatter/Gather
Scatter example:
for (i=0; i<N; i++)
A[B[i]]++;
M[0]=0 C[0]
Write Enable Write data port
Krste
Compress Expand
A Modern Vector Super: NEC SX-6 (2003) CS 252 Feb. 27, 2006
Lecture 12, Slide 30
• CMOS Technology
– 500 MHz CPU, fits on single chip
– SDRAM main memory (up to 64GB)
• Scalar unit
– 4-way superscalar with out-of-order and speculative
execution
– 64KB I-cache and 64KB data cache
• Vector unit
– 8 foreground VRegs + 64 background VRegs (256x64-bit
elements/VReg)
– 1 multiply unit, 1 divide unit, 1 add/shift unit, 1 logical unit,
1 mask unit
– 8 lanes (8 GFLOPS peak, 16 FLOPS/cycle)
– 1 load & store unit (32x8 byte accesses/cycle)
– 32 GB/s memory bandwidth per processor
• SMP structure
– 8 CPUs connected to memory through crossbar
– 256 GB/s shared memory bandwidth (4096 interleaved
banks)
Krste
CS 252 Feb. 27, 2006
Lecture 12, Slide 31
Multimedia Extensions