You are on page 1of 23

Morgan Kaufmann Publishers

4 December 2015

COMPUTER ORGANIZATION AND DESIGN

The Hardware/Software Interface

5th
Edition

Chapter 6
Parallel Processors from
Client to Cloud

Goal: connecting multiple computers

to get higher performance
n
n

Parallel processing program

n

Multiprocessors
Scalability, availability, power efficiency

n

6.1 Introduction

Introduction

Single program run on multiple processors

Multicore microprocessors
n

Chips with multiple processors (cores)

Chapter 6 Parallel Processors from Client to Cloud 2

4 December 2015

n

Hardware
n
n

Software
n
n

Serial: e.g., Pentium 4

Sequential: e.g., matrix multiplication
Concurrent: e.g., operating system

Sequential/concurrent software can run on

serial/parallel hardware
n

Challenge: making effective use of parallel

hardware
Chapter 6 Parallel Processors from Client to Cloud 3

n

n

3.6: Parallelism and Computer Arithmetic

n

Synchronization
Subword Parallelism

Instruction-Level Parallelism
5.10: Parallelism and Memory
Hierarchies
n

Cache Coherence

4 December 2015

n
n

Parallel software is the problem

Need to get significant performance
improvement
n

Otherwise, just use a faster uniprocessor,

since its easier!

Difficulties
n
n
n

Partitioning
Coordination

6.2 The Difficulty of Creating Parallel Processing Programs

Parallel Programming

Amdahls Law
n
n

Sequential part can limit speedup

Example: 100 processors, 90 speedup?
n

Speedup =

Solving: Fparallelizable = 0.999

1
= 90
(1 Fparallelizable ) + Fparallelizable /100

Need sequential part to be 0.1% of original

time
Chapter 6 Parallel Processors from Client to Cloud 6

4 December 2015

Scaling Example
n

sum
n

n
n

Single processor: Time = (10 + 100) tadd

10 processors
n
n

Speedup = 110/20 = 5.5 (55% of potential)

100 processors
n
n

Speed up from 10 to 100 processors

Speedup = 110/11 = 10 (10% of potential)

Assumes load can be balanced across

processors
Chapter 6 Parallel Processors from Client to Cloud 7

n
n
n

What if matrix size is 100 100?

Single processor: Time = (10 + 10000) tadd
10 processors
n
n

100 processors
n
n

Speedup = 10010/1010 = 9.9 (99% of potential)
Speedup = 10010/110 = 91 (91% of potential)

4 December 2015

n

n

As in example

Weak scaling: problem size proportional to

number of processors
n

10 processors, 10 10 matrix
n

n

Constant performance in this example

Chapter 6 Parallel Processors from Client to Cloud 9

An alternate classification
Data Streams
Single

Instruction Single
Streams
Multiple

Multiple

SISD:
Intel Pentium 4

SIMD: SSE
instructions of x86

MISD:
No examples today

MIMD:
Intel Xeon e5345

n
n

A parallel program on a MIMD computer

Conditional code for different processors
Chapter 6 Parallel Processors from Client to Cloud 10

Morgan Kaufmann Publishers

4 December 2015

Example: DAXPY (Y = a X + Y)
Conventional MIPS code
l.d
\$f0,a(\$sp)
loop: l.d
\$f2,0(\$s0)
mul.d \$f2,\$f2,\$f0
l.d
\$f4,0(\$s1)
s.d
\$f4,0(\$s1)
subu \$t0,r4,\$s0
bne
\$t0,\$zero,loop
n Vector MIPS code
l.d
\$f0,a(\$sp)
lv
\$v1,0(\$s0)
mulvs.d \$v2,\$v1,\$f0
lv
\$v3,0(\$s1)
sv
\$v4,0(\$s1)
n

;upper bound of what to load
;a x(i)
;a x(i) + y(i)
;store into y(i)
;increment index to x
;increment index to y
;compute bound
;check if done
;vector-scalar multiply
;store the result

Chapter 6 Parallel Processors from Client to Cloud 11

Vector Processors
n
n

Highly pipelined function units

Stream data from/to vector registers to units
n
n

Data collected from memory into registers

Results stored from registers to memory

n
n

32 64-element registers (64-bit elements)

Vector instructions
n
n
n

Significantly reduces instruction-fetch bandwidth

Chapter 6 Parallel Processors from Client to Cloud 12

4 December 2015

n

Vector architectures and compilers

Simplify data-parallel programming
n Explicit statement of absence of loop-carried
dependences
n

Regular access patterns benefit from

interleaved and burst memory
n Avoid control hazards by avoiding loops
n

extensions (such as MMX, SSE)
n

Better match with compiler technology

Chapter 6 Parallel Processors from Client to Cloud 13

SIMD
n

n

n

All processors execute the same

instruction at the same time
n

n
n
n

Each with different data address, etc.

Simplifies synchronization
Reduced instruction control hardware
Works best for highly data-parallel
applications
Chapter 6 Parallel Processors from Client to Cloud 14

4 December 2015

n

Vector instructions have a variable vector width,

multimedia extensions have a fixed width
Vector instructions support strided access,
multimedia extensions do not
Vector units can be combination of pipelined and
arrayed functional units:

parallel
n
n

n
n
n

Replicate registers, PC, etc.

Interleave instruction execution
If one thread stalls, others are executed

n
n

Only switch on long stall (e.g., L2-cache miss)

Simplifies hardware, but doesnt hide short stalls
(eg, data hazards)
Chapter 6 Parallel Processors from Client to Cloud 16

4 December 2015

n

In multiple-issue dynamically scheduled

processor
n Instructions from independent threads execute
when function units are available
n Within threads, dependencies handled by
scheduling and register renaming
n

Example: Intel Pentium-4 HT

n

function units and caches
Chapter 6 Parallel Processors from Client to Cloud 17

4 December 2015

n
n

Will it survive? In what form?

Power considerations simplified
microarchitectures
n

Tolerating cache-miss latency

n

Thread switch may be most effective

Multiple simple cores might share

resources more effectively

n

n
n

Hardware provides single physical

Synchronize shared variables using locks
Memory access time
n

Shared Memory

10

4 December 2015

n

n
n
n

Each processor has ID: 0 Pn 99

Partition 1000 numbers per processor
Initial summation on each processor
sum[Pn] = 0;
for (i = 1000*Pn;
i < 1000*(Pn+1); i = i + 1)
sum[Pn] = sum[Pn] + A[i];

n
n
n

Reduction: divide and conquer

Half the processors add pairs, then quarter,
Need to synchronize between reduction steps
Chapter 6 Parallel Processors from Client to Cloud 21

Example: Sum Reduction

half = 100;
repeat
synch();
if (half%2 != 0 && Pn == 0)
sum[0] = sum[0] + sum[half-1];
/* Conditional sum needed when half is odd;
Processor0 gets missing element */
half = half/2; /* dividing line on who sums */
if (Pn < half) sum[Pn] = sum[Pn] + sum[Pn+half];
until (half == 1);
Chapter 6 Parallel Processors from Client to Cloud 22

11

4 December 2015

Early video cards

n

3D graphics processing
n
n
n

Frame buffer memory with address generation for

video output
Originally high-end computers (e.g., SGI)
Moores Law lower cost, higher density
3D graphics cards for PCs and game consoles

n
n

History of GPUs

Processors oriented to 3D graphics tasks

rasterization
Chapter 6 Parallel Processors from Client to Cloud 23

12

Morgan Kaufmann Publishers

4 December 2015

GPU Architectures
n

Processing is highly data-parallel

n

Use thread switching to hide memory latency

n
n

Heterogeneous CPU/GPU systems

CPU for sequential code, GPU for parallel code

Programming languages/APIs
n
n

DirectX, OpenGL
C for Graphics (Cg), High Level Shader Language
(HLSL)
Compute Unified Device Architecture (CUDA)
Chapter 6 Parallel Processors from Client to Cloud 25

Example: NVIDIA Tesla

Streaming
multiprocessor

8 Streaming
processors
Chapter 6 Parallel Processors from Client to Cloud 26

13

4 December 2015

Example: NVIDIA Tesla

n

Streaming Processors
n
n

Single-precision FP and integer units

n

Executed in parallel,
SIMD style
n

8 SPs
4 clock cycles

Hardware contexts
for 24 warps
n

Registers, PCs,
Chapter 6 Parallel Processors from Client to Cloud 27

Feature

Multicore with SIMD

GPU

SIMD processors

4 to 8

8 to 16

SIMD lanes/processor

2 to 4

8 to 16

2 to 4

16 to 32

2:1

2:1

8 MB

0.75 MB

64-bit

64-bit

Typical ratio of single precision to

double-precision performance

8 GB to 256 GB

4 GB to 6 GB

Yes

Yes

Demand paging

Yes

No

processor

Yes

No

Cache coherent

Yes

No

14

4 December 2015

Each processor has private physical

between processors

Message Passing

15

4 December 2015

n

n
n

Each has private memory and OS

Connected using I/O system
n

n

n
n

High availability, scalable, affordable

Problems
n
n

Low interconnect bandwidth
n

c.f. processor/memory bandwidth on an SMP

Chapter 6 Parallel Processors from Client to Cloud 31

n
n

Sum 100,000 on 100 processors

First distribute 100 numbers to each
n

The do partial sums

sum = 0;
for (i = 0; i<1000; i = i + 1)
sum = sum + AN[i];

Reduction
Half the processors send, other half receive
n

16

4 December 2015

Sum Reduction (Again)

n

limit = 100; half = 100;/* 100 processors */
repeat
half = (half+1)/2; /* send vs. receive
dividing line */
if (Pn >= half && Pn < limit)
send(Pn - half, sum);
if (Pn < (limit/2))
limit = half; /* upper limit of senders */
until (half == 1); /* exit with final sum */
n
n

Chapter 6 Parallel Processors from Client to Cloud 33

Grid Computing
n

Separate computers interconnected by

long-haul networks
n
n

E.g., Internet connections

Work units farmed out, results sent back

n

17

Morgan Kaufmann Publishers

4 December 2015

Network topologies
n

Bus

Ring

N-cube (N = 3)
2D Mesh

6.8 Introduction to Multiprocessor Network Topologies

Interconnection Networks

Fully connected
Chapter 6 Parallel Processors from Client to Cloud 35

Multistage Networks

18

Morgan Kaufmann Publishers

4 December 2015

Network Characteristics
n

Performance
n
n

Throughput
n
n
n

n
n
n

Total network bandwidth
Bisection bandwidth

Congestion delays (depending on traffic)

Cost
Power
Routability in silicon
Chapter 6 Parallel Processors from Client to Cloud 37

Code or Applications?
n

n

Parallel programming is evolving

Should algorithms, programming languages,
and tools be part of the system?
n Compare systems, provided they implement a
given application
n E.g., Linpack, Berkeley Design Patterns
n

Would foster innovation in approaches to

parallelism
Chapter 6 Parallel Processors from Client to Cloud 38

19

Morgan Kaufmann Publishers

4 December 2015

Optimizing Performance
n

Optimize FP performance
n Improve superscalar ILP
and use of SIMD
instructions
n

Optimize memory usage

n

Software prefetch
n

Memory affinity
n

Avoid non-local data

accesses
Chapter 6 Parallel Processors from Client to Cloud 39

Use OpenMP:

void dgemm (int n, double* A, double* B, double* C)

{
#pragma omp parallel for
for ( int sj = 0; sj < n; sj += BLOCKSIZE )
for ( int si = 0; si < n; si += BLOCKSIZE )
for ( int sk = 0; sk < n; sk += BLOCKSIZE )
do_block(n, si, sj, sk, A, B, C);
}

20

4 December 2015

21

4 December 2015

computers
n
n

Since we can achieve linear speedup

But only on applications with weak scaling

Fallacies

performance
n
n
n

Marketers like this approach!

But compare Xeon with others in example
Need to be aware of bottlenecks
Chapter 6 Parallel Processors from Client to Cloud 43

Pitfalls
n

Not developing the software to take

account of a multiprocessor architecture
n

Example: using a single lock for a shared

composite resource
n

Serializes accesses, even if they could be done in

parallel
Use finer-granularity locking

22

4 December 2015

processors
Difficulties
n
n

Developing parallel software

Devising appropriate architectures

6.14 Concluding Remarks

Concluding Remarks

SaaS importance is growing and clusters are a

good match
Performance per dollar and performance per
Joule drive both mobile and WSC

n

SIMD and vector

operations match
multimedia applications
and are easy to
program

23