Introduction Parallel Heterogeneous Computing Final

Introduction to Parallel and Heterogeneous Computing
Benedict R. Gaster| October, 2010
Agenda
Motivation
A little terminology
Hardware in a heterogeneous world Software in a heterogeneous world
2 | Introduction to Parallel and Heterogeneous Computing| October, 2010
The Free Lunch is Over

Herb Sutter (2005)
Hardware can no longer depend on getting:

Increased clock speed Execution optimization (i.e. instruction level parallelism) Larger caches How has and is this being addressed?
Solution
Parallelism (lots of it!)

Quick stop to cover a bit of terminology
Definitions
Parallelism A property of a computation where portions of the calculations are independent of each other, allowing them to be executed at the same time.
For example, consider the following pseudo code:

float float float float float a = E + A; b = E + B; c = E + C; d = E + D; r = a + b + c + d; Assignments a, b, c, and d are independent, so can be run in parallel
Definitions
Concurrency A logical programming abstraction used to arbitrate communication between multiple processing entities (like processes or threads). For example, concurrency can be used to build user interfaces and other asynchronous tasks.
Concurrency is NOT the same as parallelism

Does no preclude running tasks in parallel, it is not a necessary component.
Definitions
Heterogenous Computing A system comprised of two or more compute engines with signficant structural differences In our case, a low latency x86 CPU and a high throughput Radeon GPU Fusion
Bringing together two or more components and joining them into a single unified whole
In our case, combining CPUs and GPUs on a single silicon die for higher performance and lower power
Hardware in a heterogeneous world
AMD Balanced Platform Advantage

CPU is ideal for scalar processing
Out of order x86 cores with low latency memory access Optimized for sequential and branching algorithms Runs existing applications very well GPU is ideal for parallel processing GPU shaders optimized for throughput computing Ready for emerging workloads Media processing, simulation, natural UI, etc
Graphics Workloads Serial/Task-Parallel Workloads Other Highly Parallel Workloads
Delivers
optimal performance
for a wide range of platform configurations
Three Eras of Processor Performance

Single-Core Era
Enabled by: Moores Law Voltage Scaling MicroArchitecture Constrained by: Power Complexity
Single-thread Performance
Multi-Core Era
Enabled by: Moores Law Desire for Throughput 20 years of SMP arch Constrained by: Power Parallel SW availability Scalability
Throughput Performance
Heterogeneous Systems Era

Enabled by: Moores Law Abundant data parallelism Power efficient GPUs Temporarily constrained by: Programming models Communication overheads
o
we are here
o
we are here
Targeted Application Performance
o
we are here
Time
Time (# of Processors)
Time (Data-parallel exploitation)
GPU SP ALU Performance
HD5870
HD4870
CPU
GPU DP ALU Performance
HD5870 HD4870
CPU
GPU BW Performance expectations over time

300
250
200
150
HD5870
100
HD4870
50
GPU Computing Efficiency Trend

14.47
GFLOPS/W
GFLOPS/W
GFLOPS/mm2 4.50
7.50
7.90
GFLOPS/mm2
2.01
1.07
2.21 2.24
4.56
0.42
1.06
0.92
Fusion APUs: Putting it all together

Microprocessor Advancement
Mainstream
Single-Thread Era Multi-Core Era Heterogeneous Systems Era
Fusion APU
Programmer Accessibility
High Performance Task Parallel Execution

Experts Only
Heterogeneous Computing
System-level Programmable
Unacceptable
Power-efficient Data Parallel Execution
Graphics Driver-based programs
GPU Advancement
OCL/DC Driver-based programs
Why AMD Fusion APUs?

X86 CPU owns the SW Universe
Windows, MacOS and Linux Franchises Many thousands of applications Well matched to branchy scalar code
A balanced approach is optimal

The GPU is the Game Changer
Enormous parallel computing capacity
Outstanding performance per watt per dollar Very efficient hardware threading SIMD architecture well matched to media workloads: video, audio, graphics
Established programming and memory model

Mature tool chain Backward compatible for 15 years of applications and OSs
Highly Programmable Power Efficient Massive Throughput Best of both worlds
Positioned to enable the emergence of immersive media based experiences
PC with Discrete GPU
Fusion APU Based PC
The Benefits of Fusion

Unparalleled processing capabilities in mobile form factors Shared memory for the CPU and GPU Eliminates copies, increasing performance Reduces dispatch overhead Lower latency from the GPU to memory Power efficient design Enables architectural innovations between CPU, GPU and the Memory System
Scalable architecture that can target a broad range of platforms from mobile to data center
These machines are being built but

Heterogeneous systems are being built and there is no question that we will build more of them There are new emerging workloads that contain enough parallelism to use them, but This not enough! The question then becomes: How do we program 10, 100, or even >1000 cores? The future of performance is entirely about software!
AMD Fusion Developer Summit

Find out more about Fusion APUs;
Programming models for Fusion; and

Much more
June 13-16, 2011 Seattle, Washington, USA
http://sites.amd.com/us/fusion/apu/Pages/fusion-developer-summit.aspx
OpenCL Programming Webinar Series

Designed to help advance your experience in parallel programming, with a focus on OpenCL Much of what will be taught is useful for parallel programing in general
Beginners Tacks Advanced Tracks
Introduction to OpenCL
OpenCL Programming in Detail Using OpenCL C Language
Device Fission Extension for OenCL

Optimization Techniques I Optimization Techniques II
http://developer.amd.com/zones/OpenCLZone/Events/pages/OpenCLWebinars.aspx
Software in a heterogeneous world
What lies ahead?

Guy Steele (2009) The Future Is Parallel: Whats a Programmer to Do? Million dollar question with many (many) answers!
POSIX Win32 Threads Task parallelism Accelerator MPI Kite
Java Threads
X10 Cuda Thread Building Blocks Cilk Fortress Data parallelism
OpenMP OpenCL Task Parallel Library Concurrent ML
Brook+
Different types of parallelism
Braided parallelism Task-decomposition
Data-decomposition
Task-decomposition
Divides the problem by type of task to be done
For example, in modern games:

Computations are organized as tasks/jobs
Some maybe fine-grained (short-running)
Others long-running and data-parallel
Tasking runtime must account for: Task dependencies
Synchronization
Load balancing Etc
Load balancing - work Stealing

Internally, most tasking runtimes use Work stealing implementation Work stealing has provably
Good locality Work distribution properties
Seminal reference: Cilk: an efficient multithreaded runtime system Blumofe et al SIGPLAN Notices 1995
Popular task runtimes (CPU only)

Unmanaged C/C++
Intels Thread Building Blocks

Apples Grand Central Dispatch OpenMP Parallelism should not be tacked on! Managed languages Microsofts Task Parallel library for .NET4
Different OS, different options!
Data-decomposition
Divides the problem into elements to be processed
assigning a subset of elements to a parallel worker

For example, in modern games: Particle systems
1000 maybe 100,000, even millions
forces and actions computed independently (locality can be used to describe interaction)
Data-parallel execution must account for: Local communication Synchronization Etc

Popular data-parallel languages

Unmanaged C/C++
Khronos Open Compute Language (OpenCL) (CPU+GPU)
NVIDIAs Cuda (GPU only) OpenMP Parallelism should not be tacked on! (CPU only) Managed languages Microsofts Accelerator II for .NET4 (CPU + GPU via DX9) AMDs Aparapi (A PARallel API) for Java (CPU + GPU via OpenCL)
Different OS, different options!

Task and data-parallelism together
Braided Parallelism
Reference: Aaron Lefohn. Programming Larrabee: Beyond Data Parallelism. Beyond Programmable Shading Course. SIGGRAPH 2008.
Job graph from DICEs Battlefield Bad Company 2
Fusion APUs: Putting it all together

Microprocessor Advancement
Mainstream
Single-Thread Era Multi-Core Era Heterogeneous Systems Era
Fusion APU
Programmer Accessibility
Unacceptable
Power-efficient Data Parallel Execution
Graphics Driver-based programs
GPU Advancement
Braided Parallelism High Performance Heterogeneous Task Parallel Execution a natural programming Computing model for heterogeneous computing
Experts Only
System-level Programmable
OCL/DC Driver-based programs
Conclusion and Questions
Trademark Attribution AMD, the AMD Arrow logo and combinations thereof are trademarks of Advanced Micro Devices, Inc. in the United States and/or other jurisdictions. Other names used in this presentation are for identification purposes only and may be trademarks of their respective owners. 2009 Advanced Micro Devices, Inc. All rights reserved.

Introduction Parallel Heterogeneous Computing Final

Hochgeladen von

Dokumentinformationen

Originalbeschreibung:

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Introduction Parallel Heterogeneous Computing Final

Hochgeladen von

Copyright:

Verfügbare Formate

Introduction to Parallel and Heterogeneous Computing

Benedict R. Gaster| October, 2010

2 | Introduction to Parallel and Heterogeneous Computing| October, 2010

The Free Lunch is Over

Hardware can no longer depend on getting:

3 | Introduction to Parallel and Heterogeneous Computing| October, 2010

Parallelism (lots of it!)

Quick stop to cover a bit of terminology

5 | Introduction to Parallel and Heterogeneous Computing| October, 2010

For example, consider the following pseudo code:

6 | Introduction to Parallel and Heterogeneous Computing| October, 2010

Concurrency is NOT the same as parallelism

7 | Introduction to Parallel and Heterogeneous Computing| October, 2010

8 | Introduction to Parallel and Heterogeneous Computing| October, 2010

Hardware in a heterogeneous world

9 | Introduction to Parallel and Heterogeneous Computing| October, 2010

AMD Balanced Platform Advantage

Graphics Workloads Serial/Task-Parallel Workloads Other Highly Parallel Workloads

for a wide range of platform configurations

10 | Introduction to Parallel and Heterogeneous Computing| October, 2010

Three Eras of Processor Performance

Heterogeneous Systems Era

Targeted Application Performance

Time (Data-parallel exploitation)

11 | Introduction to Parallel and Heterogeneous Computing| October, 2010

GPU SP ALU Performance

12 | Introduction to Parallel and Heterogeneous Computing| October, 2010

GPU DP ALU Performance

13 | Introduction to Parallel and Heterogeneous Computing| October, 2010

GPU BW Performance expectations over time

14 | Introduction to Parallel and Heterogeneous Computing| October, 2010

GPU Computing Efficiency Trend

15 | Introduction to Parallel and Heterogeneous Computing| October, 2010

Fusion APUs: Putting it all together

High Performance Task Parallel Execution

Power-efficient Data Parallel Execution

Graphics Driver-based programs

OCL/DC Driver-based programs

Why AMD Fusion APUs?

A balanced approach is optimal

Established programming and memory model

Highly Programmable Power Efficient Massive Throughput Best of both worlds

Positioned to enable the emergence of immersive media based experiences

17 | Introduction to Parallel and Heterogeneous Computing| October, 2010

PC with Discrete GPU

18 | Introduction to Parallel and Heterogeneous Computing| October, 2010

Fusion APU Based PC

19 | Introduction to Parallel and Heterogeneous Computing| October, 2010

The Benefits of Fusion

These machines are being built but

AMD Fusion Developer Summit

Programming models for Fusion; and

June 13-16, 2011 Seattle, Washington, USA

22 | Introduction to Parallel and Heterogeneous Computing| October, 2010

OpenCL Programming Webinar Series

Device Fission Extension for OenCL

23 | Introduction to Parallel and Heterogeneous Computing| October, 2010

Software in a heterogeneous world

24 | Introduction to Parallel and Heterogeneous Computing| October, 2010

What lies ahead?

OpenMP OpenCL Task Parallel Library Concurrent ML

25 | Introduction to Parallel and Heterogeneous Computing| October, 2010

Different types of parallelism