Sie sind auf Seite 1von 28

Introduction to OpenCL

Module Overview
Overview OpenCL Architecture & Programming Model Basic components for getting started Information on tools

OVERVIEW

OpenCL
OpenCL Open Computing Language Open Standard
Royalty free, cross-platform, vendor neutral

Standard for accessing heterogeneous computational resources


GPU, CPU, GPU+CPU or multiple GPUs

What is OpenCL : Processor Parallelism


CPUs
Multiple cores driving performance increases

GPUs
Emerging Intersection
Increasingly general purpose data-parallel computing Improving numerical precision

OpenCL
Multi-processor programming e.g. OpenMP

Heterogenous Computing

Graphics APIs and Shading Languages

OpenCL Open Computing Language


Open, royalty-free standard for portable, parallel programming of heterogeneous parallel computing CPUs, GPUs, and other processors

Design Goals of OpenCL


Use all computational resources in system
Program GPUs, CPUs, Cell, DSP and other processors as peers Support both data- and task- parallel compute models

Low-level, high-performance but portable


Primarily targeted at expert developers Foundation for parallel computing ecosystem

C-based programming model Specify accuracy of floating-point computations


IEEE 754 compliant rounding behavior Define maximum allowable error of math functions

Defines a configuration profile for handheld and embedded devices Close integration with OpenGL and other 3D APIs

OpenCL
Interface designed for graphics free API Software Stack
High level Language
Extended C to show parallelism

Runtime libraries
Allows GPU memory management

How does it fit with vendor specific Architecture

OPENCL ARCHITECTURE & PROGRAMMING MODEL

OpenCL Platform Model

One Host + one or more compute devices


Each Compute Device is composed of one or more Compute Units
Each Compute Unit is further divided into one or more Processing Elements

OpenCL Platform Model


Computations on a device occur within the processing elements An OpenCL application runs on a host and submits commands from the host to execute computations on the processing elements within a device

GPU as Co-processor
GPU as Compute device
Has its own DRAM (Video memory) Can run multiple threads in parallel

Application runs on host The compute intensive, data-parallel part is sent to GPU
Written as C functions called kernel The kernel is executed on device simultaneously by multiple threads

Programming Model
Host application GPU kernel

Opteron Load/Initialize Input Data

FireStream

Copy Input Data from Host to GPU Memory

Process Input Data and Write to output

Copy Output from GPU to Host Memory

Main Memory

GPU Memory

Implicit Data Parallelism


C
void sum(float A[], float B[], float C[]) { for(int i = 0; i < n; i++) { C[i] = A[i] + B[i]; } }

C - Rewritten
float sum_kernel(int x, float A[], float B[]) { return A[x] + B[x]; } void sum(float A[], float B[], float C[]) { for(int i = 0; i < n; i++) C[j][i] = sum_kernel(i, A, B); }

Implicit Data Parallelism


C Rewritten 2
void sum(float A[], float B[], float C[]) { for(int i = 0; i < n; i++) launch_thread(C[i] = sum_kernel(i, A, B)); sync_threads(); } float sum_kernel(int x, float A[][], float B[][]) { return A[x] + B[x]; }

OpenCL
// Kernel definition __kernel void vecAdd(__global float* A, __global float* B, __global float* C) { int i = get_local_id(0); C[i] = A[i] + B[i]; } int main() { // Kernel invocation size_t globalWorkSize[] = {n}; size_t localWorkSize[] = {n}; clEnqueueNDRangeKernel(..,1, NULL, globalWorkSize, localWorkSize, 0, NULL,NULL); }

Kernel invocation from host Number of OpenCL threads

Kernel
Each thread has a unique thread ID
__kernel void vecAdd(__global float* A, __global float* B, __global float* C) { int i = get_local_id(0); Unique Thread ID Accessible within the kernel through C[i] = A[i] + B[i]; intrinsic function }

Function Qualifier __kernel qualifier declares a function as a Kernel

Work-Group
Work-items are organized into work-groups Group can be a 1D, 2D or 3D array of work-items
Specified during kernel invocation Helpful to invoke kernels on Matrices, fields Each work-item within a group can be identified by a 1D, 2D or 3D id
Built-in function get_local_id()
Work-Group
WI (0, 0) WI (0, 1) WI (0, 2) WI (1, 0) WI (1, 1) WI (1, 2) WI (2, 0) WI (2, 1) WI (2, 2) WI (3, 0) WI (3, 1) WI (3, 2) WI (4, 0) WI (4, 1) WI (4, 2)

Work-Group
Example of 2D work-group
// Add two matrices A and B of dimension NxN and store the // result into C __kernel void matAdd(int N, __global float* A, __global float* B, __global float* C) { int i = get_local_id(0); int j = get_local_id(1); C[j * N + i] = A[j * N + i] + B[j * N + i]; }

// host code int main() { // Declare, allocate and initialize device memory A, B & C

// Kernel invocation size_t globalWorkSize[] = {N, N}; size_t localWorkSize[] = {N, N}; clEnqueueNDRangeKernel(.., 1, NULL, globalWorkSize, localWorkSize, 0, NULL, NULL); }

An N-dimension domain of work-items


Global Dimensions: 1024 x 1024 (whole problem space) Local Dimensions: 128 x 128 (executed together) Choose the dimensions that are best for your algorithm

Example Problem Dimensions


1D: 1 million elements in an array:
global_dim[3] = {1000000, 1, 1};

2D: 1920 x 1200 HD video frame, 2.3M pixels:


global_dim[3] = {1920, 1200, 1};

3D: 256 x 256 x 256 volume, 16.7M voxels:


global_dim[3] = {256, 256, 256};

Choose the dimensions that are best for your algorithm Maps well Performs well

BASIC COMPONENTS FOR GETTING STARTED

Basic OpenCL Program Structure


Kernels
C code with some restrictions and extensions
Language

Host program
Query compute devices Platform Layer Create contexts Create memory objects associated to contexts Compile and create kernel program objects Runtime Issue commands to command-queue Synchronization of commands Clean up OpenCL resources

Typical OpenCL Program


Computation intensive, data parallel function written as kernel Host side code
Context Creation Allocate memory on device Host to Device Data transfer Compilation and creation of kernel program objects Bind memory objects to kernel arguments Call a kernel function to be executed on device Read-back result data from device

INFORMATION ON TOOLS

OpenCL Implementation
AMDs implementation
Ships with ATI Stream SDK v2.0 Released on: 21th Dec, 2009

Requires ATI GPU >= RV7XX

OpenCL Installation
ATI Stream SDK
Environment variable
$(ATISTREAMSDKROOT) = ATI Stream SDK installation directory $(ATISTREAMSDKSAMPLESROOT) = ATI Stream SDK Samples installation directory

ATI OpenCL SDK


Header files
cl.h, cl_gl.h, cl_platform.h under $(ATISTREAMSDKROOT)\include\CL

Library files
OpenCL.lib under $(ATISTREAMSDKROOT)\lib\x86

Dynamic Link Library


OpenCL.dll under $(ATISTREAMSDKROOT)\bin\x86 Make sure Path contains this directory

Recap and Q&A


Overview & Programming model Basic components for getting started Information on tools

Das könnte Ihnen auch gefallen