Sie sind auf Seite 1von 35

Introduction to Parallel and Heterogeneous Computing

Benedict R. Gaster| October, 2010

Agenda

Motivation

A little terminology
Hardware in a heterogeneous world Software in a heterogeneous world

2 | Introduction to Parallel and Heterogeneous Computing| October, 2010

The Free Lunch is Over


Herb Sutter (2005)

Hardware can no longer depend on getting:


Increased clock speed Execution optimization (i.e. instruction level parallelism) Larger caches How has and is this being addressed?

3 | Introduction to Parallel and Heterogeneous Computing| October, 2010

Solution

Parallelism (lots of it!)


4 | Introduction to Parallel and Heterogeneous Computing| October, 2010

Quick stop to cover a bit of terminology

5 | Introduction to Parallel and Heterogeneous Computing| October, 2010

Definitions
Parallelism A property of a computation where portions of the calculations are independent of each other, allowing them to be executed at the same time.

For example, consider the following pseudo code:


float float float float float a = E + A; b = E + B; c = E + C; d = E + D; r = a + b + c + d; Assignments a, b, c, and d are independent, so can be run in parallel

6 | Introduction to Parallel and Heterogeneous Computing| October, 2010

Definitions
Concurrency A logical programming abstraction used to arbitrate communication between multiple processing entities (like processes or threads). For example, concurrency can be used to build user interfaces and other asynchronous tasks.

Concurrency is NOT the same as parallelism


Does no preclude running tasks in parallel, it is not a necessary component.

7 | Introduction to Parallel and Heterogeneous Computing| October, 2010

Definitions
Heterogenous Computing A system comprised of two or more compute engines with signficant structural differences In our case, a low latency x86 CPU and a high throughput Radeon GPU Fusion

Bringing together two or more components and joining them into a single unified whole
In our case, combining CPUs and GPUs on a single silicon die for higher performance and lower power

8 | Introduction to Parallel and Heterogeneous Computing| October, 2010

Hardware in a heterogeneous world

9 | Introduction to Parallel and Heterogeneous Computing| October, 2010

AMD Balanced Platform Advantage


CPU is ideal for scalar processing
Out of order x86 cores with low latency memory access Optimized for sequential and branching algorithms Runs existing applications very well GPU is ideal for parallel processing GPU shaders optimized for throughput computing Ready for emerging workloads Media processing, simulation, natural UI, etc

Graphics Workloads Serial/Task-Parallel Workloads Other Highly Parallel Workloads

Delivers

optimal performance

for a wide range of platform configurations

10 | Introduction to Parallel and Heterogeneous Computing| October, 2010

Three Eras of Processor Performance


Single-Core Era
Enabled by: Moores Law Voltage Scaling MicroArchitecture Constrained by: Power Complexity
Single-thread Performance

Multi-Core Era
Enabled by: Moores Law Desire for Throughput 20 years of SMP arch Constrained by: Power Parallel SW availability Scalability
Throughput Performance

Heterogeneous Systems Era


Enabled by: Moores Law Abundant data parallelism Power efficient GPUs Temporarily constrained by: Programming models Communication overheads

o
we are here

o
we are here

Targeted Application Performance

o
we are here

Time

Time (# of Processors)

Time (Data-parallel exploitation)

11 | Introduction to Parallel and Heterogeneous Computing| October, 2010

GPU SP ALU Performance

HD5870

HD4870

CPU

12 | Introduction to Parallel and Heterogeneous Computing| October, 2010

GPU DP ALU Performance

HD5870 HD4870

CPU

13 | Introduction to Parallel and Heterogeneous Computing| October, 2010

GPU BW Performance expectations over time


300

250

200

150
HD5870

100
HD4870

50

14 | Introduction to Parallel and Heterogeneous Computing| October, 2010

GPU Computing Efficiency Trend


14.47
GFLOPS/W

GFLOPS/W
GFLOPS/mm2 4.50

7.50

7.90
GFLOPS/mm2

2.01
1.07

2.21 2.24

4.56

0.42

1.06

0.92

15 | Introduction to Parallel and Heterogeneous Computing| October, 2010

Fusion APUs: Putting it all together


Microprocessor Advancement
Mainstream
Single-Thread Era Multi-Core Era Heterogeneous Systems Era

Fusion APU

Programmer Accessibility

High Performance Task Parallel Execution


Experts Only

Heterogeneous Computing

System-level Programmable

Unacceptable

Power-efficient Data Parallel Execution

Graphics Driver-based programs

Throughput Performance
16 | Introduction to Parallel and Heterogeneous Computing| October, 2010

GPU Advancement

OCL/DC Driver-based programs

Why AMD Fusion APUs?


X86 CPU owns the SW Universe
Windows, MacOS and Linux Franchises Many thousands of applications Well matched to branchy scalar code

A balanced approach is optimal


The GPU is the Game Changer
Enormous parallel computing capacity
Outstanding performance per watt per dollar Very efficient hardware threading SIMD architecture well matched to media workloads: video, audio, graphics

Established programming and memory model


Mature tool chain Backward compatible for 15 years of applications and OSs

Highly Programmable Power Efficient Massive Throughput Best of both worlds

Positioned to enable the emergence of immersive media based experiences

17 | Introduction to Parallel and Heterogeneous Computing| October, 2010

PC with Discrete GPU

18 | Introduction to Parallel and Heterogeneous Computing| October, 2010

Fusion APU Based PC

19 | Introduction to Parallel and Heterogeneous Computing| October, 2010

The Benefits of Fusion


Unparalleled processing capabilities in mobile form factors Shared memory for the CPU and GPU Eliminates copies, increasing performance Reduces dispatch overhead Lower latency from the GPU to memory Power efficient design Enables architectural innovations between CPU, GPU and the Memory System

Scalable architecture that can target a broad range of platforms from mobile to data center
20 | Introduction to Parallel and Heterogeneous Computing| October, 2010

These machines are being built but


Heterogeneous systems are being built and there is no question that we will build more of them There are new emerging workloads that contain enough parallelism to use them, but This not enough! The question then becomes: How do we program 10, 100, or even >1000 cores? The future of performance is entirely about software!
21 | Introduction to Parallel and Heterogeneous Computing| October, 2010

AMD Fusion Developer Summit


Find out more about Fusion APUs;

Programming models for Fusion; and


Much more

June 13-16, 2011 Seattle, Washington, USA

http://sites.amd.com/us/fusion/apu/Pages/fusion-developer-summit.aspx

22 | Introduction to Parallel and Heterogeneous Computing| October, 2010

OpenCL Programming Webinar Series


Designed to help advance your experience in parallel programming, with a focus on OpenCL Much of what will be taught is useful for parallel programing in general
Beginners Tacks Advanced Tracks

Introduction to OpenCL
OpenCL Programming in Detail Using OpenCL C Language

Device Fission Extension for OenCL


Optimization Techniques I Optimization Techniques II

http://developer.amd.com/zones/OpenCLZone/Events/pages/OpenCLWebinars.aspx

23 | Introduction to Parallel and Heterogeneous Computing| October, 2010

Software in a heterogeneous world

24 | Introduction to Parallel and Heterogeneous Computing| October, 2010

What lies ahead?


Guy Steele (2009) The Future Is Parallel: Whats a Programmer to Do? Million dollar question with many (many) answers!
POSIX Win32 Threads Task parallelism Accelerator MPI Kite

Java Threads
X10 Cuda Thread Building Blocks Cilk Fortress Data parallelism

OpenMP OpenCL Task Parallel Library Concurrent ML

Brook+

25 | Introduction to Parallel and Heterogeneous Computing| October, 2010

Different types of parallelism

Braided parallelism Task-decomposition

Data-decomposition
26 | Introduction to Parallel and Heterogeneous Computing| October, 2010

Task-decomposition
Divides the problem by type of task to be done

For example, in modern games:


Computations are organized as tasks/jobs
Some maybe fine-grained (short-running)

Others long-running and data-parallel

Tasking runtime must account for: Task dependencies

Synchronization
Load balancing Etc

27 | Introduction to Parallel and Heterogeneous Computing| October, 2010

Load balancing - work Stealing


Internally, most tasking runtimes use Work stealing implementation Work stealing has provably
Good locality Work distribution properties
Seminal reference: Cilk: an efficient multithreaded runtime system Blumofe et al SIGPLAN Notices 1995

28 | Introduction to Parallel and Heterogeneous Computing| October, 2010

Popular task runtimes (CPU only)


Unmanaged C/C++

Intels Thread Building Blocks


Apples Grand Central Dispatch OpenMP Parallelism should not be tacked on! Managed languages Microsofts Task Parallel library for .NET4

Different OS, different options!

29 | Introduction to Parallel and Heterogeneous Computing| October, 2010

Data-decomposition
Divides the problem into elements to be processed

assigning a subset of elements to a parallel worker


For example, in modern games: Particle systems
1000 maybe 100,000, even millions
forces and actions computed independently (locality can be used to describe interaction)

Data-parallel execution must account for: Local communication Synchronization Etc


30 | Introduction to Parallel and Heterogeneous Computing| October, 2010

Popular data-parallel languages


Unmanaged C/C++
Khronos Open Compute Language (OpenCL) (CPU+GPU)
NVIDIAs Cuda (GPU only) OpenMP Parallelism should not be tacked on! (CPU only) Managed languages Microsofts Accelerator II for .NET4 (CPU + GPU via DX9) AMDs Aparapi (A PARallel API) for Java (CPU + GPU via OpenCL)

Different OS, different options!


31 | Introduction to Parallel and Heterogeneous Computing| October, 2010

Task and data-parallelism together

Braided Parallelism

Reference: Aaron Lefohn. Programming Larrabee: Beyond Data Parallelism. Beyond Programmable Shading Course. SIGGRAPH 2008.
32 | Introduction to Parallel and Heterogeneous Computing| October, 2010

Job graph from DICEs Battlefield Bad Company 2

Fusion APUs: Putting it all together


Microprocessor Advancement
Mainstream
Single-Thread Era Multi-Core Era Heterogeneous Systems Era

Fusion APU

Programmer Accessibility

Unacceptable

Power-efficient Data Parallel Execution

Graphics Driver-based programs

Throughput Performance
33 | Introduction to Parallel and Heterogeneous Computing| October, 2010

GPU Advancement

Braided Parallelism High Performance Heterogeneous Task Parallel Execution a natural programming Computing model for heterogeneous computing

Experts Only

System-level Programmable

OCL/DC Driver-based programs

Conclusion and Questions

34 | Introduction to Parallel and Heterogeneous Computing| October, 2010

Trademark Attribution AMD, the AMD Arrow logo and combinations thereof are trademarks of Advanced Micro Devices, Inc. in the United States and/or other jurisdictions. Other names used in this presentation are for identification purposes only and may be trademarks of their respective owners. 2009 Advanced Micro Devices, Inc. All rights reserved.

35 | Introduction to Parallel and Heterogeneous Computing| October, 2010

Das könnte Ihnen auch gefallen