Sie sind auf Seite 1von 51

Chapter 7

Multicores, p , and Multiprocessors, Clusters

9.1 Intro oduction

Introduction

Goal: connecting multiple computers to get higher g e performance pe o a ce


Multiprocessors Scalability, y, availability, y, power p efficiency y High g throughput oug pu for o independent depe de jobs Single program run on multiple processors Chips with multiple processors (cores)
Chapter 7 Multicores, Multiprocessors, and Clusters 2

Job-level (process-level) parallelism

Parallel processing program

Multicore microprocessors

Hardware and Software

Hardware

Serial: e.g., e g Pentium 4 Parallel: e.g., quad-core Xeon e5345 Sequential: e.g., matrix multiplication Concurrent: e.g., operating system

Software

Sequential/concurrent software can run on serial/parallel hardware

Challenge: making effective use of parallel hardware


Chapter 7 Multicores, Multiprocessors, and Clusters 3

What Weve Already Covered

2.11: Parallelism and Instructions

Synchronization Associativity

3.6: Parallelism and Computer Arithmetic

4.10: Parallelism and Advanced Instruction-Level Instruction Level Parallelism 5.8: Parallelism and Memory Hierarchies

Cache Coherence Redundant Arrays of Inexpensive Disks


Chapter 7 Multicores, Multiprocessors, and Clusters 4

6.9: Parallelism and I/O:

7.2 The Difficulty of Creatin ng Parallel Processin ng Programs

Parallel Programming

Parallel software is the problem Need to get significant performance improvement

Otherwise, Oth i j just t use a f faster t uniprocessor, i since its easier! Partitioning Coordination Communications overhead

Diffi lti Difficulties


Chapter 7 Multicores, Multiprocessors, and Clusters 5

Amdahls Law

Sequential part can limit speedup Example: 100 processors processors, 90 speedup?

Tnew = Tparallelizable/100 + Tsequential


1 Speedup = = 90 (1 Fp parallelizable ) + Fp parallelizable /100

Solving: Fparallelizable = 0.999

Need sequential part to be 0 0.1% 1% of original time


Chapter 7 Multicores, Multiprocessors, and Clusters 6

Scaling Example

Workload: sum of 10 scalars, and 10 10 matrix sum

S Speed d up f from 10 t to 100 processors

Single processor: Time = (10 + 100) tadd 10 processors


Time = 10 tadd + 100/10 tadd = 20 tadd Speedup = 110/20 = 5.5 (55% of potential) Time = 10 tadd + 100/100 tadd = 11 tadd Speedup = 110/11 0/ = 10 0( (10% 0% o of po potential) e a)

100 processors

Assumes load can be balanced across processors


Chapter 7 Multicores, Multiprocessors, and Clusters 7

Scaling Example (cont)


What if matrix size is 100 100? Single processor: Time = (10 + 10000) tadd dd 10 processors

Time = 10 tadd dd + 10000/10 tadd dd = 1010 tadd dd Speedup = 10010/1010 = 9.9 (99% of potential) Time = 10 tadd + 10000/100 tadd = 110 tadd Speedup = 10010/110 = 91 (91% of potential)

100 processors

Assuming load balanced

Chapter 7 Multicores, Multiprocessors, and Clusters 8

Strong vs Weak Scaling

Strong scaling: problem size fixed

As in example

Weak scaling: problem size proportional to n mber of processors number

10 processors, 10 10 matrix

Ti Time = 20 tadd Ti Time = 10 tadd + 1000/100 tadd = 20 tadd

100 processors, 32 32 matrix

Constant performance in this example


Chapter 7 Multicores, Multiprocessors, and Clusters 9

7.3 Sha ared Memo ory Multipr rocessors

Shared Memory

SMP: shared memory multiprocessor

Hardware provides single physical address space for all processors Synchronize shared variables using locks Memory access time

UMA (uniform) vs. NUMA (nonuniform)

Chapter 7 Multicores, Multiprocessors, and Clusters 10

Example: Sum Reduction

Sum 100,000 numbers on 100 processor UMA


Each processor has ID: 0 Pn 99 Partition 1000 numbers per processor Initial summation on each processor sum[Pn] = 0; for (i = 1000*Pn; i < 1000*(Pn+1); i = i + 1) sum[Pn] = sum[Pn] + A[i]; Reduction: divide and conquer Half the processors add pairs, then quarter, Need to synchronize between reduction steps
Chapter 7 Multicores, Multiprocessors, and Clusters 11

Now need to add these partial sums


Example: Sum Reduction

half = 100; ; repeat synch(); if ( (half%2 != 0 && Pn == 0) ) sum[0] = sum[0] + sum[half-1]; /* Conditional sum needed when half is odd; Processor0 gets missing element */ / half = half/2; /* dividing line on who sums */ if (Pn < half) sum[Pn] = sum[Pn] + sum[Pn+half]; until (half == 1);
Chapter 7 Multicores, Multiprocessors, and Clusters 12

7.4 Clus sters and O Other Mes ssage-Pas ssing Multiprocessors

Message Passing

Each processor has private physical address add ess space Hardware sends/receives messages p between processors

Chapter 7 Multicores, Multiprocessors, and Clusters 13

Loosely Coupled Clusters

Network of independent computers


Each has private memory and OS Connected using I/O system

E.g., Ethernet/switch, Internet

Suitable for applications with independent tasks

Web servers, databases, simulations,

High availability, scalable, affordable Problems


Administration cost (prefer virtual machines) Low interconnect bandwidth

c f processor/memory bandwidth on an SMP c.f.


Chapter 7 Multicores, Multiprocessors, and Clusters 14

Sum Reduction (Again)


Sum 100,000 on 100 processors First distribute 100 numbers to each

The do partial sums sum = 0; 0 for (i = 0; i<1000; i = i + 1) sum = sum + AN[i]; Half H lf the th processors send, d other th half h lf receive i and add Th quarter The t send, d quarter t receive i and d add, dd
Chapter 7 Multicores, Multiprocessors, and Clusters 15

Reduction

Sum Reduction (Again)

Given send() and receive() operations


limit = 100; half = 100;/ 100;/* 100 processors */ / repeat half = (half+1)/2; /* send vs. receive dividing line */ / if (Pn >= half && Pn < limit) send(Pn - half, sum); if (Pn < (limit/2)) sum = sum + receive(); limit = half; /* upper limit of senders */ until ( (half == 1); ); /* / exit with final sum */ /

Send/receive also provide synchronization Assumes send/receive take similar time to addition
Chapter 7 Multicores, Multiprocessors, and Clusters 16

Grid Computing

Separate computers interconnected by long-haul networks


E.g., Internet connections Work units farmed out, out results sent back E.g., SETI@home, World Community Grid

Can make use of idle time on PCs

Chapter 7 Multicores, Multiprocessors, and Clusters 17

7.5 Hard dware Multithreadin ng

Multithreading

Performing multiple threads of execution in parallel


Replicate registers, PC, etc. Fast switching between threads Switch threads after each cycle Interleave instruction execution If one thread stalls, others are executed Only switch on long stall (e.g., L2-cache miss) Simplifies hardware, but doesnt hide short stalls (eg data hazards) (eg,

Fi Fine-grain i multithreading ltith di


Coarse-grain multithreading

Chapter 7 Multicores, Multiprocessors, and Clusters 18

Simultaneous Multithreading

In multiple-issue dynamically scheduled p ocesso processor


Schedule instructions from multiple threads Instructions from independent p threads execute when function units are available Within threads, dependencies handled by scheduling h d li and d register i t renaming i Two threads: T h d d duplicated li d registers, i shared h d function units and caches

Example: Intel Pentium-4 HT

Chapter 7 Multicores, Multiprocessors, and Clusters 19

Multithreading Example

Chapter 7 Multicores, Multiprocessors, and Clusters 20

Future of Multithreading

Will it survive? In what form? Power considerations simplified microarchitectures

Si l forms Simpler f of f multithreading ltith di Thread switch may be most effective

Tolerating cache-miss latency

Multiple p simple p cores might g share resources more effectively

Chapter 7 Multicores, Multiprocessors, and Clusters 21

7.6 SISD, MIMD, SIMD, SP PMD, and Vector

Instruction and Data Streams

An alternate classification
Data Streams Single Multiple SIMD: SSE instructions of x86 MIMD: Intel Xeon e5345

Instruction Single Streams Multiple

SISD: Intel Pentium 4 MISD: No examples today

SPMD: Single Program Multiple Data


A parallel program on a MIMD computer Conditional code for different p processors


Chapter 7 Multicores, Multiprocessors, and Clusters 22

SIMD

Operate elementwise on vectors of data

E g MMX and SSE instructions in x86 E.g.,

Multiple data elements in 128-bit wide registers

All p processors execute the same instruction at the same time

Each with different data address, etc.

Simplifies synchronization Reduced instruction control hardware Works best for highly data-parallel applications
Chapter 7 Multicores, Multiprocessors, and Clusters 23

Vector Processors

Highly pipelined function units Stream data from/to vector registers to units

Data collected from memory into registers Results stored from registers g to memory y 32 64-element registers g ( (64-bit elements) ) Vector instructions

Example: Vector extension to MIPS


lv, sv: load/store vector addv.d dd d: add dd vectors t of f double d bl addvs.d: add scalar to each element of vector of double

Significantly reduces instruction-fetch instruction fetch bandwidth


Chapter 7 Multicores, Multiprocessors, and Clusters 24

Example: DAXPY (Y = a X + Y)
Conventional MIPS code l.d $f0,a($sp) addiu r4,$s0,#512 , , loop: l.d $f2,0($s0) mul.d $f2,$f2,$f0 l.d $f4,0($s1) $f4,$f4,$f2 ,$ ,$ add.d $ s.d $f4,0($s1) addiu $s0,$s0,#8 addiu $s1,$s1,#8 $t0,r4,$s0 , ,$ subu $ bne $t0,$zero,loop Vector MIPS code l.d $f0,a($sp) l lv $v1,0($s0) $ 1 0($ 0) mulvs.d $v2,$v1,$f0 lv $v3,0($s1) addv.d $v4,$v2,$v3 sv $ $v4,0($s1) ($ )

;load scalar a ;upper ; pp bound of what to load ;load x(i) ;a x(i) ;load y(i) ;a ; x(i) ( ) + y(i) y( ) ;store into y(i) ;increment index to x ;increment index to y ;compute ; p bound ;check if done ;load scalar a ;load l d vector x ;vector-scalar multiply ;load vector y ;add y to product ;store the h result l

Chapter 7 Multicores, Multiprocessors, and Clusters 25

Vector vs. Scalar

Vector architectures and compilers


Simplify data-parallel data parallel programming Explicit statement of absence of loop-carried dependences

Reduced checking in hardware

Regular access patterns benefit from i t l interleaved d and db burst t memory Avoid control hazards by avoiding loops

More general than ad ad-hoc hoc media extensions (such as MMX, SSE)

Better match with compiler technology


Chapter 7 Multicores, Multiprocessors, and Clusters 26

7.7 Intro oduction to o Graphics s Processing Units

History of GPUs

Early video cards

Frame buffer memory with address generation for video output Originally high-end computers (e.g., SGI) Moores Law lower cost, higher density 3D graphics cards for PCs and game consoles Processors oriented P i d to 3D graphics hi tasks k Vertex/pixel processing, shading, texture mapping, rasterization

3D graphics processing

Graphics Processing Units


Chapter 7 Multicores, Multiprocessors, and Clusters 27

Graphics in the System

Chapter 7 Multicores, Multiprocessors, and Clusters 28

GPU Architectures

Processing is highly data-parallel


GPUs are highly multithreaded U thread Use h d switching i hi to hid hide memory l latency

Less reliance on multi-level caches

Graphics memory is wide and high-bandwidth Heterogeneous CPU/GPU systems CPU for sequential code code, GPU for parallel code DirectX, OpenGL C for Graphics (Cg), High Level Shader Language (HLSL) Compute p Unified Device Architecture ( (CUDA) )
Chapter 7 Multicores, Multiprocessors, and Clusters 29

Trend toward general purpose GPUs


Programming languages/APIs

Example: NVIDIA Tesla


Streaming multiprocessor

8 Streaming processors
Chapter 7 Multicores, Multiprocessors, and Clusters 30

Example: NVIDIA Tesla

Streaming Processors

Single precision FP and integer units Single-precision Each SP is fine-grained multithreaded Executed in parallel, SIMD style y

Warp: group of 32 threads

8 SPs 4 clock cycles

Hardware contexts for 24 warps

Registers, g , PCs, ,
Chapter 7 Multicores, Multiprocessors, and Clusters 31

Classifying GPUs

Dont fit nicely into SIMD/MIMD model

Conditional execution in a thread allows an illusion of MIMD


But with performance degredation Need to write general purpose code with care
Static: Discovered at Compile Time Dynamic: y Discovered at Runtime Superscalar Tesla Multiprocessor

Instruction-Level Parallelism Data-Level Parallelism

VLIW SIMD or Vector

Chapter 7 Multicores, Multiprocessors, and Clusters 32

7.8 Intro oduction to o Multiproc cessor Ne etwork Top pologies

Interconnection Networks

Network topologies

Arrangements of processors, switches, and links

Bus

Ring

N-cube (N = 3) 2D Mesh Fully connected

Chapter 7 Multicores, Multiprocessors, and Clusters 33

Multistage Networks

Chapter 7 Multicores, Multiprocessors, and Clusters 34

Network Characteristics

Performance

Latency per message (unloaded network) Throughput


Link bandwidth Total network bandwidth Bisection bandwidth

C Congestion ti d delays l (d (depending di on t traffic) ffi )

Cost Power Routability in silicon


Chapter 7 Multicores, Multiprocessors, and Clusters 35

7.9 Mult tiprocesso or Benchm marks

Parallel Benchmarks

Linpack: matrix linear algebra SPECrate: p parallel run of SPEC CPU p programs g

Job-level parallelism

SPLASH: Stanford Parallel Applications for Shared Memory

Mix of kernels and applications, strong scaling computational fluid dynamics kernels

NAS (NASA Advanced Supercomputing) suite

PARSEC (Princeton Application Repository for Shared Memory Computers) suite

Multithreaded applications using Pthreads and O OpenMP MP


Chapter 7 Multicores, Multiprocessors, and Clusters 36

Code or Applications?

Traditional benchmarks

Fixed code and data sets Should algorithms, algorithms programming languages languages, and tools be part of the system? Compare p systems, y p provided they y implement p a given application E.g., Linpack, Berkeley Design Patterns

Parallel programming is evolving

Would foster innovation in approaches to parallelism


Chapter 7 Multicores, Multiprocessors, and Clusters 37

7.10 Ro oofline: A S Simple Performance e Model

Modeling Performance

Assume performance metric of interest is achievable ac e ab e G GFLOPs/sec O s/sec

Measured using computational kernels from Berkeley Design Patterns FLOPs per byte of memory accessed Peak GFLOPS ( (from data sheet) ) Peak memory bytes/sec (using Stream benchmark)

Arithmetic intensity of a kernel

For a given computer, determine


Chapter 7 Multicores, Multiprocessors, and Clusters 38

Roofline Diagram

Attainable GPLOPs/sec = Max ( Peak Memory y BW Arithmetic Intensity, y, Peak FP Performance )

Chapter 7 Multicores, Multiprocessors, and Clusters 39

Comparing Systems

Example: Opteron X2 vs. Opteron X4

2 core vs. 4-core, 2-core 4 core, 2 2 FP performance/core, 2.2GHz vs. 2.3GHz Same memory system

To get higher performance on X4 than X2


Need high arithmetic intensity Or working set must fit in X4s 2MB L-3 cache

Chapter 7 Multicores, Multiprocessors, and Clusters 40

Optimizing Performance

Optimize FP performance

Balance adds & multiplies Improve superscalar ILP and use of SIMD instructions Software prefetch

Optimize memory usage

Avoid load stalls Avoid non-local data accesses


Chapter 7 Multicores, Multiprocessors, and Clusters 41

M Memory affinity ffi it

Optimizing Performance

Choice of optimization depends on arithmetic intensity of code

Arithmetic e c intensity e s y is s not always fixed

May scale with problem size Caching g reduces memory accesses

Increases arithmetic intensity

Chapter 7 Multicores, Multiprocessors, and Clusters 42

7.11 Re eal Stuff: B Benchmark king Four Multicores s

Four Example Systems


2 quad quad-core core Intel Xeon e5345 (Clovertown)

2 quad-core AMD Opteron X4 2356 (Barcelona)

Chapter 7 Multicores, Multiprocessors, and Clusters 43

Four Example Systems


2 oct-core Sun UltraSPARC T2 5140 (Niagara 2)

2 oct-core IBM Cell QS20

Chapter 7 Multicores, Multiprocessors, and Clusters 44

And Their Rooflines

Kernels
SpMV (left) LBHMD (right)

Some optimizations change arithmetic intensity x86 systems have higher peak GFLOPs

But harder to achieve, given memory g y bandwidth

Chapter 7 Multicores, Multiprocessors, and Clusters 45

Performance on SpMV

Sparse matrix/vector multiply

Irregular memory accesses, memory bound 0.166 before memory y optimization, p , 0.25 after

Arithmetic intensity

Xeon vs. Opteron


Similar peak FLOPS Xeon limited by shared FSBs and chipset 20 30 vs. 75 peak GFLOPs More cores and memory bandwidth

UltraSPARC/Cell vs vs. x86


Chapter 7 Multicores, Multiprocessors, and Clusters 46

Performance on LBMHD

Fluid dynamics: structured grid over time steps

Each point: 75 FP read/write, 1300 FP ops 0.70 before optimization, p , 1.07 after

Arithmetic intensity

Opteron vs. UltraSPARC

More powerful cores, not limited by memory bandwidth Still suffers ff f from memory bottlenecks

Xeon vs. others

Chapter 7 Multicores, Multiprocessors, and Clusters 47

Achieving Performance

Compare nave vs. optimized code

If nave code performs well well, its it s easier to write high performance code for the system
Kernel SpMV LBMHD SpMV LBMHD SpMV LBMHD SpMV LBMHD Nave GFLOPs/sec 1.0 46 4.6 1.4 7.1 3.5 3 5 9.7 Nave code not feasible Optimized GFLOPs/sec 1.5 56 5.6 3.6 14.1 4.1 4 1 10.5 6.4 16 7 16.7 Nave as % of optimized 64% 82% 38% 50% 86% 93% 0% 0%

System Intel Xeon AMD Opteron X4 Sun UltraSPARC T2 IBM Cell QS20

Chapter 7 Multicores, Multiprocessors, and Clusters 48

7.12 Fa allacies and d Pitfalls

Fallacies

Amdahls Law doesnt apply to parallel computers


Since we can achieve linear speedup But only on applications with weak scaling

Peak performance tracks observed performance f


Marketers like this approach! But compare Xeon with others in example Need to be aware of bottlenecks
Chapter 7 Multicores, Multiprocessors, and Clusters 49

Pitfalls

Not developing the software to take account of a multiprocessor architecture

Example: using a single lock for a shared composite resource

Serializes accesses, even if they could be done in parallel Use finer-granularity locking

Chapter 7 Multicores, Multiprocessors, and Clusters 50

7.13 Co oncluding R Remarks

Concluding Remarks

Goal: higher performance by using multiple p processors Difficulties


Developing p gp parallel software Devising appropriate architectures Changing software and application environment Chip-level multiprocessors with lower latency, hi h b higher bandwidth d idth i interconnect t t

Many y reasons for optimism


An ongoing challenge for computer architects!


Chapter 7 Multicores, Multiprocessors, and Clusters 51

Das könnte Ihnen auch gefallen