Sie sind auf Seite 1von 38

OpenCL

Parallel Computing on the GPU and CPU

Aaftab Munshi

Opportunity: Processor
Todays processors are increasingly parallel CPUs

Multiple cores are driving performance increases Transforming into general purpose data-parallel computational coprocessors Improving numerical precision (single and double)

GPUs

Beyond Programmable Shading: Fundamentals

Challenge: Processor Parallelism


Writing parallel programs different for the CPU and
GPU

Differing domain-specic techniques Vendor-specic technologies

Graphics API is not an ideal abstraction for general


purpose compute

Beyond Programmable Shading: Fundamentals

Introducing OpenCL
OpenCL Open Computing Language Approachable language for accessing heterogeneous
computational resources processors

Supports parallel execution on single or multiple


GPU, CPU, GPU + CPU or multiple GPUs

Desktop and Handheld Proles Designed to work with graphics APIs such as
OpenGL

Beyond Programmable Shading: Fundamentals

OpenCL = Open Standard


Specication under review

Royalty free, cross-platform, vendor neutral Khronos OpenCL working group (www.khronos.org) Developed in collaboration with industry leaders Performance-enhancing technology in Mac OS X Snow Leopard

Based on a proposal by Apple


Beyond Programmable Shading: Fundamentals

OpenCL Working Group Members


Broad Industry Support

Copyright Khronos Group, 2008 - Page

Beyond Programmable Shading: Fundamentals

OpenCL

A Sneak Preview

Design Goals of OpenCL


Use all computational resources in system

GPUs and CPUs as peers

Data- and task- parallel compute model


Based on C Abstract the specics of underlying hardware IEEE 754 compliant rounding behavior Dene maximum allowable error of math functions

Efficient parallel programming model


Specify accuracy of oating-point computations


Drive future hardware requirements

Beyond Programmable Shading: Fundamentals

OpenCL Software Stack


Platform Layer

query and select compute devices in the system initialize a compute device(s) create compute contexts and work-queues resource management execute compute kernels A subset of ISO C99 with appropriate language additions Compile and build compute program executables

Runtime

Compiler

online or offline

Beyond Programmable Shading: Fundamentals

OpenCL Execution Model


Compute Kernel

Basic unit of executable code similar to a C function Data-parallel or task-parallel Collection of compute kernels and internal functions Analogous to a dynamic library

Compute Program

Applications queue compute kernel execution


instances

Queued in-order Executed in-order or out-of-order Events are used to implement appropriate

Beyond Programmable Shading: Fundamentals

OpenCL Data-Parallel Execution


Dene N-Dimensional computation domain

Each independent element of execution in N-D domain is called a work-item The N-D domain denes the total number of workitems that execute in parallel global work size. Work-items in group can communicate with each other Can synchronize execution among work-items in group to coordinate memory access

Work-items can be grouped together work-group

Execute multiple work-groups in parallel Mapping of global work size to work-groups


Beyond Programmable Shading: Fundamentals

OpenCL Task-Parallel Execution


Data-parallel execution model must be implemented
by all OpenCL compute devices

Some compute devices such as CPUs can also


execute task-parallel compute kernels

Executes as a single work-item A compute kernel written in OpenCL A native C / C++ function

Beyond Programmable Shading: Fundamentals

OpenCL Memory Model


Implements a relaxed
consistency, shared memory model

Multiple distinct address spaces


Address spaces can be collapsed

Beyond Programmable Shading: Fundamentals

OpenCL Memory Model


Implements a relaxed
consistency, shared memory model

Private Memory
WorkItem 1

Private Memory
WorkItem M

Private Memory
WorkItem 1

Private Memory
WorkItem M

Multiple distinct address spaces


Address Qualiers

Compute Unit 1 Address spaces can be collapsed

Compute Unit N

__private

Beyond Programmable Shading: Fundamentals

OpenCL Memory Model


Implements a relaxed
consistency, shared memory model

Private Memory
WorkItem 1

Private Memory
WorkItem M

Private Memory
WorkItem 1

Private Memory
WorkItem M

Multiple distinct address spaces


Address Qualiers

Compute Unit 1 Address spaces can be collapsed Local Memory

Compute Unit N

Local Memory

__private __local

Beyond Programmable Shading: Fundamentals

OpenCL Memory Model


Implements a relaxed
consistency, shared memory model

Private Memory
WorkItem 1

Private Memory
WorkItem M

Private Memory
WorkItem 1

Private Memory
WorkItem M

Multiple distinct address spaces


Address Qualiers

Compute Unit 1 Address spaces can be collapsed Local Memory

Compute Unit N

Local Memory

__private __local __constant and __global

Global / Constant Memory Data Cache Compute Device

Example:

Global Memory Compute Device Memory

__global oat4 *p;

Beyond Programmable Shading: Fundamentals

Language for writing compute


Derived from ISO C99 A few restrictions

Recursion, function pointers, functions in C99 standard headers ...

Preprocessing directives dened by C99 are


supported

Built-in Data Types


Scalar and vector data types Structs, Pointers Data-type conversion functions

convert_type<_sat><_roundingmode>

Image types

Beyond Programmable Shading: Fundamentals

Language for writing compute

Beyond Programmable Shading: Fundamentals

Language for writing compute


Built-in Functions Required

work-item functions math.h read and write image relational geometric functions synchronization functions

Beyond Programmable Shading: Fundamentals

Language for writing compute


Built-in Functions Required

work-item functions math.h read and write image relational geometric functions synchronization functions double precision atomics to global and local memory selection of rounding mode

Built-in Functions Optional


Beyond Programmable Shading: Fundamentals

OpenCL FFT Example - Host API

Beyond Programmable Shading: Fundamentals

OpenCL FFT Example - Host API


// create a compute context with GPU device

Beyond Programmable Shading: Fundamentals

OpenCL FFT Example - Host API


// create a compute context with GPU device context = clCreateContextFromType(CL_DEVICE_TYPE_GPU);

Beyond Programmable Shading: Fundamentals

OpenCL FFT Example - Host API


// create a compute context with GPU device context = clCreateContextFromType(CL_DEVICE_TYPE_GPU); // create a work-queue

Beyond Programmable Shading: Fundamentals

OpenCL FFT Example - Host API


// create a compute context with GPU device context = clCreateContextFromType(CL_DEVICE_TYPE_GPU); // create a work-queue queue = clCreateWorkQueue(context, NULL, NULL, 0);

Beyond Programmable Shading: Fundamentals

OpenCL FFT Example - Host API


// create a compute context with GPU device context = clCreateContextFromType(CL_DEVICE_TYPE_GPU); // create a work-queue queue = clCreateWorkQueue(context, NULL, NULL, 0); // allocate the buffer memory objects

Beyond Programmable Shading: Fundamentals

OpenCL FFT Example - Host API


// create a compute context with GPU device context = clCreateContextFromType(CL_DEVICE_TYPE_GPU); // create a work-queue queue = clCreateWorkQueue(context, NULL, NULL, 0); // allocate the buffer memory objects memobjs[0] = clCreateBuffer(context,

Beyond Programmable Shading: Fundamentals

OpenCL FFT Example - Host API


// create a compute context with GPU device context = clCreateContextFromType(CL_DEVICE_TYPE_GPU); // create a work-queue queue = clCreateWorkQueue(context, NULL, NULL, 0); // allocate the buffer memory objects memobjs[0] = clCreateBuffer(context, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, sizeof(float)*2*num_entries, srcA);

Beyond Programmable Shading: Fundamentals

OpenCL FFT Example - Host API


// create a compute context with GPU device context = clCreateContextFromType(CL_DEVICE_TYPE_GPU); // create a work-queue queue = clCreateWorkQueue(context, NULL, NULL, 0); // allocate the buffer memory objects memobjs[0] = clCreateBuffer(context, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, sizeof(float)*2*num_entries, srcA); memobjs[1] = clCreateBuffer(context,

Beyond Programmable Shading: Fundamentals

OpenCL FFT Example - Host API


// create a compute context with GPU device context = clCreateContextFromType(CL_DEVICE_TYPE_GPU); // create a work-queue queue = clCreateWorkQueue(context, NULL, NULL, 0); // allocate the buffer memory objects memobjs[0] = clCreateBuffer(context, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, sizeof(float)*2*num_entries, srcA); memobjs[1] = clCreateBuffer(context, CL_MEM_READ_WRITE,

Beyond Programmable Shading: Fundamentals

OpenCL FFT Example - Host API


// create a compute context with GPU device context = clCreateContextFromType(CL_DEVICE_TYPE_GPU); // create a work-queue queue = clCreateWorkQueue(context, NULL, NULL, 0); // allocate the buffer memory objects memobjs[0] = clCreateBuffer(context, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, sizeof(float)*2*num_entries, srcA); memobjs[1] = clCreateBuffer(context, CL_MEM_READ_WRITE, sizeof(float)*2*num_entries, NULL);

Beyond Programmable Shading: Fundamentals

OpenCL FFT Example - Host API

Beyond Programmable Shading: Fundamentals

OpenCL FFT Example - Host API


// create the compute program program = clCreateProgramFromSource(context, 1, &fft1D_1024_kernel_src, NULL); // build the compute program executable clBuildProgramExecutable(program, false, NULL, NULL); // create the compute kernel kernel = clCreateKernel(program, fft1D_1024);

Beyond Programmable Shading: Fundamentals

OpenCL FFT Example - Host API

Beyond Programmable Shading: Fundamentals

OpenCL FFT Example - Host API


// create N-D range object with work-item dimensions global_work_size[0] = n; local_work_size[0] = 64; range = clCreateNDRangeContainer(context, 0, 1, global_work_size, local_work_size); // set the args values clSetKernelArg(kernel, 0, (void *)&memobjs[0], sizeof(cl_mem), NULL); clSetKernelArg(kernel, 1, (void *)&memobjs[1], sizeof(cl_mem), NULL); clSetKernelArg(kernel, 2, NULL, sizeof(float)*(local_work_size[0]+1)*16, NULL); clSetKernelArg(kernel, 3, NULL, sizeof(float)*(local_work_size[0]+1)*16, NULL); // execute kernel clExecuteKernel(queue, kernel, NULL, range, NULL, 0, NULL);

Beyond Programmable Shading: Fundamentals

OpenCL FFT Example - Compute


// This kernel computes FFT of length 1024. The 1024 length FFT is decomposed into // calls to a radix 16 function, another radix 16 function and then a radix 4 function
// Based on "Fitting FFT onto G80 Architecture". Vasily Volkov & Brian Kazian, UC Berkeley CS258 project report, May 2008

__kernel void fft1D_1024 (__global float2 *in, __global float2 *out, __local float *sMemx, __local float *sMemy) { int tid = get_local_id(0); int blockIdx = get_group_id(0) * 1024 + tid; float2 data[16]; // starting index of data to/from global memory in = in + blockIdx; out = out + blockIdx; globalLoads(data, in, 64); // coalesced global reads fftRadix16Pass(data); // in-place radix-16 pass twiddleFactorMul(data, tid, 1024, 0); // local shuffle using local memory localShuffle(data, sMemx, sMemy, tid, (((tid & 15) * 65) + (tid >> 4))); fftRadix16Pass(data); // in-place radix-16 pass twiddleFactorMul(data, tid, 64, 4); // twiddle factor multiplication localShuffle(data, sMemx, sMemy, tid, (((tid >> 4) * 64) + (tid & 15))); // four radix-4 function calls fftRadix4Pass(data); fftRadix4Pass(data + 4); fftRadix4Pass(data + 8); fftRadix4Pass(data + 12); // coalesced global writes globalStores(data, out, 64); }

Beyond Programmable Shading: Fundamentals

OpenCL and OpenGL


Sharing OpenGL Resources

OpenCL is designed to efficiently share with OpenGL


Textures, Buffer Objects and Renderbuffers Data is shared, not copied

Efficient queuing of OpenCL and OpenGL commands Apps can select compute device(s) that will run
OpenGL and OpenCL

Beyond Programmable Shading: Fundamentals

Summary
A new compute language that works across GPUs
and CPUs

C99 with extensions Familiar to developers Includes a rich set of built-in functions Makes it easy to develop data- and task- parallel compute programs

Denes hardware and numerical precision


requirements

Open standard for heterogeneous parallel computing

Beyond Programmable Shading: Fundamentals