Munshi Opencl

OpenCL
Parallel Computing on the GPU and CPU
Aaftab Munshi
Opportunity: Processor
Todays processors are increasingly parallel CPUs
Multiple cores are driving performance increases Transforming into general purpose data-parallel computational coprocessors Improving numerical precision (single and double)
GPUs
Beyond Programmable Shading: Fundamentals
Challenge: Processor Parallelism

Writing parallel programs different for the CPU and
GPU

Differing domain-specic techniques Vendor-specic technologies
Graphics API is not an ideal abstraction for general

purpose compute
Introducing OpenCL
OpenCL Open Computing Language Approachable language for accessing heterogeneous
computational resources processors
Supports parallel execution on single or multiple

GPU, CPU, GPU + CPU or multiple GPUs
Desktop and Handheld Proles Designed to work with graphics APIs such as
OpenGL
OpenCL = Open Standard

Specication under review

Royalty free, cross-platform, vendor neutral Khronos OpenCL working group (www.khronos.org) Developed in collaboration with industry leaders Performance-enhancing technology in Mac OS X Snow Leopard
Based on a proposal by Apple

OpenCL Working Group Members

Broad Industry Support
Copyright Khronos Group, 2008 - Page
OpenCL
A Sneak Preview
Design Goals of OpenCL

Use all computational resources in system
GPUs and CPUs as peers
Data- and task- parallel compute model

Based on C Abstract the specics of underlying hardware IEEE 754 compliant rounding behavior Dene maximum allowable error of math functions
Efficient parallel programming model

Specify accuracy of oating-point computations

Drive future hardware requirements
OpenCL Software Stack

Platform Layer

query and select compute devices in the system initialize a compute device(s) create compute contexts and work-queues resource management execute compute kernels A subset of ISO C99 with appropriate language additions Compile and build compute program executables
Runtime

Compiler
online or offline
OpenCL Execution Model

Compute Kernel
Basic unit of executable code similar to a C function Data-parallel or task-parallel Collection of compute kernels and internal functions Analogous to a dynamic library
Compute Program

Applications queue compute kernel execution

instances

Queued in-order Executed in-order or out-of-order Events are used to implement appropriate
OpenCL Data-Parallel Execution

Dene N-Dimensional computation domain
Each independent element of execution in N-D domain is called a work-item The N-D domain denes the total number of workitems that execute in parallel global work size. Work-items in group can communicate with each other Can synchronize execution among work-items in group to coordinate memory access
Work-items can be grouped together work-group
Execute multiple work-groups in parallel Mapping of global work size to work-groups

OpenCL Task-Parallel Execution

Data-parallel execution model must be implemented
by all OpenCL compute devices
Some compute devices such as CPUs can also

execute task-parallel compute kernels

Executes as a single work-item A compute kernel written in OpenCL A native C / C++ function
OpenCL Memory Model

Implements a relaxed
consistency, shared memory model
Multiple distinct address spaces

Address spaces can be collapsed
OpenCL Memory Model


Private Memory
WorkItem 1
Private Memory
WorkItem M
Private Memory
WorkItem 1
Private Memory
WorkItem M

Address Qualiers
Compute Unit 1 Address spaces can be collapsed
Compute Unit N
__private
OpenCL Memory Model


Private Memory
WorkItem 1
Private Memory
WorkItem M
Private Memory
WorkItem 1
Private Memory
WorkItem M

Address Qualiers

Compute Unit 1 Address spaces can be collapsed Local Memory
Compute Unit N
Local Memory
__private __local
OpenCL Memory Model


Private Memory
WorkItem 1
Private Memory
WorkItem M
Private Memory
WorkItem 1
Private Memory
WorkItem M

Address Qualiers

Compute Unit 1 Address spaces can be collapsed Local Memory
Compute Unit N
Local Memory
__private __local __constant and __global
Global / Constant Memory Data Cache Compute Device
Example:
Global Memory Compute Device Memory
__global oat4 *p;
Language for writing compute

Derived from ISO C99 A few restrictions
Recursion, function pointers, functions in C99 standard headers ...
Preprocessing directives dened by C99 are

supported

Built-in Data Types

Scalar and vector data types Structs, Pointers Data-type conversion functions
convert_type<_sat><_roundingmode>
Image types

Built-in Functions Required

work-item functions math.h read and write image relational geometric functions synchronization functions

Built-in Functions Required

work-item functions math.h read and write image relational geometric functions synchronization functions double precision atomics to global and local memory selection of rounding mode
Built-in Functions Optional

OpenCL FFT Example - Host API

// create a compute context with GPU device

// create a compute context with GPU device context = clCreateContextFromType(CL_DEVICE_TYPE_GPU);

// create a compute context with GPU device context = clCreateContextFromType(CL_DEVICE_TYPE_GPU); // create a work-queue

// create a compute context with GPU device context = clCreateContextFromType(CL_DEVICE_TYPE_GPU); // create a work-queue queue = clCreateWorkQueue(context, NULL, NULL, 0);

// create a compute context with GPU device context = clCreateContextFromType(CL_DEVICE_TYPE_GPU); // create a work-queue queue = clCreateWorkQueue(context, NULL, NULL, 0); // allocate the buffer memory objects

// create a compute context with GPU device context = clCreateContextFromType(CL_DEVICE_TYPE_GPU); // create a work-queue queue = clCreateWorkQueue(context, NULL, NULL, 0); // allocate the buffer memory objects memobjs[0] = clCreateBuffer(context,

// create a compute context with GPU device context = clCreateContextFromType(CL_DEVICE_TYPE_GPU); // create a work-queue queue = clCreateWorkQueue(context, NULL, NULL, 0); // allocate the buffer memory objects memobjs[0] = clCreateBuffer(context, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, sizeof(float)*2*num_entries, srcA);

// create a compute context with GPU device context = clCreateContextFromType(CL_DEVICE_TYPE_GPU); // create a work-queue queue = clCreateWorkQueue(context, NULL, NULL, 0); // allocate the buffer memory objects memobjs[0] = clCreateBuffer(context, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, sizeof(float)*2*num_entries, srcA); memobjs[1] = clCreateBuffer(context,

// create a compute context with GPU device context = clCreateContextFromType(CL_DEVICE_TYPE_GPU); // create a work-queue queue = clCreateWorkQueue(context, NULL, NULL, 0); // allocate the buffer memory objects memobjs[0] = clCreateBuffer(context, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, sizeof(float)*2*num_entries, srcA); memobjs[1] = clCreateBuffer(context, CL_MEM_READ_WRITE,

// create a compute context with GPU device context = clCreateContextFromType(CL_DEVICE_TYPE_GPU); // create a work-queue queue = clCreateWorkQueue(context, NULL, NULL, 0); // allocate the buffer memory objects memobjs[0] = clCreateBuffer(context, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, sizeof(float)*2*num_entries, srcA); memobjs[1] = clCreateBuffer(context, CL_MEM_READ_WRITE, sizeof(float)*2*num_entries, NULL);

// create the compute program program = clCreateProgramFromSource(context, 1, &fft1D_1024_kernel_src, NULL); // build the compute program executable clBuildProgramExecutable(program, false, NULL, NULL); // create the compute kernel kernel = clCreateKernel(program, fft1D_1024);

// create N-D range object with work-item dimensions global_work_size[0] = n; local_work_size[0] = 64; range = clCreateNDRangeContainer(context, 0, 1, global_work_size, local_work_size); // set the args values clSetKernelArg(kernel, 0, (void *)&memobjs[0], sizeof(cl_mem), NULL); clSetKernelArg(kernel, 1, (void *)&memobjs[1], sizeof(cl_mem), NULL); clSetKernelArg(kernel, 2, NULL, sizeof(float)*(local_work_size[0]+1)*16, NULL); clSetKernelArg(kernel, 3, NULL, sizeof(float)*(local_work_size[0]+1)*16, NULL); // execute kernel clExecuteKernel(queue, kernel, NULL, range, NULL, 0, NULL);
OpenCL FFT Example - Compute

// This kernel computes FFT of length 1024. The 1024 length FFT is decomposed into // calls to a radix 16 function, another radix 16 function and then a radix 4 function
// Based on "Fitting FFT onto G80 Architecture". Vasily Volkov & Brian Kazian, UC Berkeley CS258 project report, May 2008
__kernel void fft1D_1024 (__global float2 *in, __global float2 *out, __local float *sMemx, __local float *sMemy) { int tid = get_local_id(0); int blockIdx = get_group_id(0) * 1024 + tid; float2 data[16]; // starting index of data to/from global memory in = in + blockIdx; out = out + blockIdx; globalLoads(data, in, 64); // coalesced global reads fftRadix16Pass(data); // in-place radix-16 pass twiddleFactorMul(data, tid, 1024, 0); // local shuffle using local memory localShuffle(data, sMemx, sMemy, tid, (((tid & 15) * 65) + (tid >> 4))); fftRadix16Pass(data); // in-place radix-16 pass twiddleFactorMul(data, tid, 64, 4); // twiddle factor multiplication localShuffle(data, sMemx, sMemy, tid, (((tid >> 4) * 64) + (tid & 15))); // four radix-4 function calls fftRadix4Pass(data); fftRadix4Pass(data + 4); fftRadix4Pass(data + 8); fftRadix4Pass(data + 12); // coalesced global writes globalStores(data, out, 64); }
OpenCL and OpenGL

Sharing OpenGL Resources
OpenCL is designed to efficiently share with OpenGL

Textures, Buffer Objects and Renderbuffers Data is shared, not copied
Efficient queuing of OpenCL and OpenGL commands Apps can select compute device(s) that will run
OpenGL and OpenCL
Summary
A new compute language that works across GPUs
and CPUs

C99 with extensions Familiar to developers Includes a rich set of built-in functions Makes it easy to develop data- and task- parallel compute programs
Denes hardware and numerical precision

requirements
Open standard for heterogeneous parallel computing

Munshi Opencl

Hochgeladen von

Dokumentinformationen

Originalbeschreibung:

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Munshi Opencl

Hochgeladen von

Copyright:

Verfügbare Formate

OpenCL

Parallel Computing on the GPU and CPU

Beyond Programmable Shading: Fundamentals

Challenge: Processor Parallelism

Differing domain-specic techniques Vendor-specic technologies

Graphics API is not an ideal abstraction for general

Beyond Programmable Shading: Fundamentals

Supports parallel execution on single or multiple

Beyond Programmable Shading: Fundamentals

OpenCL = Open Standard

Based on a proposal by Apple

Beyond Programmable Shading: Fundamentals

OpenCL Working Group Members

Copyright Khronos Group, 2008 - Page

Beyond Programmable Shading: Fundamentals

Design Goals of OpenCL

GPUs and CPUs as peers

Data- and task- parallel compute model

Efficient parallel programming model

Specify accuracy of oating-point computations

Drive future hardware requirements

Beyond Programmable Shading: Fundamentals

OpenCL Software Stack

Beyond Programmable Shading: Fundamentals

OpenCL Execution Model

Applications queue compute kernel execution

Beyond Programmable Shading: Fundamentals

OpenCL Data-Parallel Execution

Work-items can be grouped together work-group

Execute multiple work-groups in parallel Mapping of global work size to work-groups

OpenCL Task-Parallel Execution

Some compute devices such as CPUs can also

Beyond Programmable Shading: Fundamentals

OpenCL Memory Model

Multiple distinct address spaces

Beyond Programmable Shading: Fundamentals

OpenCL Memory Model

Multiple distinct address spaces

Compute Unit 1 Address spaces can be collapsed

Beyond Programmable Shading: Fundamentals

OpenCL Memory Model

Multiple distinct address spaces

Compute Unit 1 Address spaces can be collapsed Local Memory

Beyond Programmable Shading: Fundamentals

OpenCL Memory Model

Multiple distinct address spaces

Compute Unit 1 Address spaces can be collapsed Local Memory

__private __local __constant and __global

Global / Constant Memory Data Cache Compute Device

Global Memory Compute Device Memory

__global oat4 *p;

Beyond Programmable Shading: Fundamentals

Language for writing compute

Recursion, function pointers, functions in C99 standard headers ...

Preprocessing directives dened by C99 are

Built-in Data Types

Beyond Programmable Shading: Fundamentals

Language for writing compute

Beyond Programmable Shading: Fundamentals

Language for writing compute

Beyond Programmable Shading: Fundamentals

Language for writing compute

private local constant and global