Beruflich Dokumente
Kultur Dokumente
Aaftab Munshi
Opportunity: Processor
Todays processors are increasingly parallel CPUs
Multiple cores are driving performance increases Transforming into general purpose data-parallel computational coprocessors Improving numerical precision (single and double)
GPUs
Introducing OpenCL
OpenCL Open Computing Language Approachable language for accessing heterogeneous
computational resources processors
Desktop and Handheld Proles Designed to work with graphics APIs such as
OpenGL
Royalty free, cross-platform, vendor neutral Khronos OpenCL working group (www.khronos.org) Developed in collaboration with industry leaders Performance-enhancing technology in Mac OS X Snow Leopard
OpenCL
A Sneak Preview
query and select compute devices in the system initialize a compute device(s) create compute contexts and work-queues resource management execute compute kernels A subset of ISO C99 with appropriate language additions Compile and build compute program executables
Runtime
Compiler
online or offline
Basic unit of executable code similar to a C function Data-parallel or task-parallel Collection of compute kernels and internal functions Analogous to a dynamic library
Compute Program
Queued in-order Executed in-order or out-of-order Events are used to implement appropriate
Each independent element of execution in N-D domain is called a work-item The N-D domain denes the total number of workitems that execute in parallel global work size. Work-items in group can communicate with each other Can synchronize execution among work-items in group to coordinate memory access
Executes as a single work-item A compute kernel written in OpenCL A native C / C++ function
Private Memory
WorkItem M
Private Memory
WorkItem 1
Private Memory
WorkItem M
Compute Unit N
__private
Private Memory
WorkItem M
Private Memory
WorkItem 1
Private Memory
WorkItem M
Compute Unit N
Local Memory
__private __local
Private Memory
WorkItem M
Private Memory
WorkItem 1
Private Memory
WorkItem M
Compute Unit N
Local Memory
Example:
convert_type<_sat><_roundingmode>
Image types
work-item functions math.h read and write image relational geometric functions synchronization functions
work-item functions math.h read and write image relational geometric functions synchronization functions double precision atomics to global and local memory selection of rounding mode
__kernel void fft1D_1024 (__global float2 *in, __global float2 *out, __local float *sMemx, __local float *sMemy) { int tid = get_local_id(0); int blockIdx = get_group_id(0) * 1024 + tid; float2 data[16]; // starting index of data to/from global memory in = in + blockIdx; out = out + blockIdx; globalLoads(data, in, 64); // coalesced global reads fftRadix16Pass(data); // in-place radix-16 pass twiddleFactorMul(data, tid, 1024, 0); // local shuffle using local memory localShuffle(data, sMemx, sMemy, tid, (((tid & 15) * 65) + (tid >> 4))); fftRadix16Pass(data); // in-place radix-16 pass twiddleFactorMul(data, tid, 64, 4); // twiddle factor multiplication localShuffle(data, sMemx, sMemy, tid, (((tid >> 4) * 64) + (tid & 15))); // four radix-4 function calls fftRadix4Pass(data); fftRadix4Pass(data + 4); fftRadix4Pass(data + 8); fftRadix4Pass(data + 12); // coalesced global writes globalStores(data, out, 64); }
Efficient queuing of OpenCL and OpenGL commands Apps can select compute device(s) that will run
OpenGL and OpenCL
Summary
A new compute language that works across GPUs
and CPUs
C99 with extensions Familiar to developers Includes a rich set of built-in functions Makes it easy to develop data- and task- parallel compute programs