1 - Introduction To OpenCL

Introduction to OpenCL
Module Overview
Overview OpenCL Architecture & Programming Model Basic components for getting started Information on tools
OVERVIEW
OpenCL
OpenCL Open Computing Language Open Standard
Royalty free, cross-platform, vendor neutral
Standard for accessing heterogeneous computational resources

GPU, CPU, GPU+CPU or multiple GPUs
What is OpenCL : Processor Parallelism

CPUs
Multiple cores driving performance increases
GPUs
Emerging Intersection
Increasingly general purpose data-parallel computing Improving numerical precision
OpenCL
Multi-processor programming e.g. OpenMP
Heterogenous Computing
Graphics APIs and Shading Languages
OpenCL Open Computing Language

Open, royalty-free standard for portable, parallel programming of heterogeneous parallel computing CPUs, GPUs, and other processors
Design Goals of OpenCL

Use all computational resources in system
Program GPUs, CPUs, Cell, DSP and other processors as peers Support both data- and task- parallel compute models
Low-level, high-performance but portable

Primarily targeted at expert developers Foundation for parallel computing ecosystem
C-based programming model Specify accuracy of floating-point computations

IEEE 754 compliant rounding behavior Define maximum allowable error of math functions
Defines a configuration profile for handheld and embedded devices Close integration with OpenGL and other 3D APIs
OpenCL
Interface designed for graphics free API Software Stack
High level Language
Extended C to show parallelism
Runtime libraries
Allows GPU memory management
How does it fit with vendor specific Architecture
OPENCL ARCHITECTURE & PROGRAMMING MODEL
OpenCL Platform Model
One Host + one or more compute devices

Each Compute Device is composed of one or more Compute Units
Each Compute Unit is further divided into one or more Processing Elements
OpenCL Platform Model

Computations on a device occur within the processing elements An OpenCL application runs on a host and submits commands from the host to execute computations on the processing elements within a device
GPU as Co-processor
GPU as Compute device
Has its own DRAM (Video memory) Can run multiple threads in parallel
Application runs on host The compute intensive, data-parallel part is sent to GPU
Written as C functions called kernel The kernel is executed on device simultaneously by multiple threads
Programming Model
Host application GPU kernel
Opteron Load/Initialize Input Data
FireStream
Copy Input Data from Host to GPU Memory
Process Input Data and Write to output
Copy Output from GPU to Host Memory
Main Memory
GPU Memory
Implicit Data Parallelism

C
void sum(float A[], float B[], float C[]) { for(int i = 0; i < n; i++) { C[i] = A[i] + B[i]; } }
C - Rewritten
float sum_kernel(int x, float A[], float B[]) { return A[x] + B[x]; } void sum(float A[], float B[], float C[]) { for(int i = 0; i < n; i++) C[j][i] = sum_kernel(i, A, B); }
Implicit Data Parallelism

C Rewritten 2
void sum(float A[], float B[], float C[]) { for(int i = 0; i < n; i++) launch_thread(C[i] = sum_kernel(i, A, B)); sync_threads(); } float sum_kernel(int x, float A[][], float B[][]) { return A[x] + B[x]; }
OpenCL
// Kernel definition __kernel void vecAdd(__global float* A, __global float* B, __global float* C) { int i = get_local_id(0); C[i] = A[i] + B[i]; } int main() { // Kernel invocation size_t globalWorkSize[] = {n}; size_t localWorkSize[] = {n}; clEnqueueNDRangeKernel(..,1, NULL, globalWorkSize, localWorkSize, 0, NULL,NULL); }
Kernel invocation from host Number of OpenCL threads
Kernel
Each thread has a unique thread ID
__kernel void vecAdd(__global float* A, __global float* B, __global float* C) { int i = get_local_id(0); Unique Thread ID Accessible within the kernel through C[i] = A[i] + B[i]; intrinsic function }
Function Qualifier __kernel qualifier declares a function as a Kernel
Work-Group
Work-items are organized into work-groups Group can be a 1D, 2D or 3D array of work-items
Specified during kernel invocation Helpful to invoke kernels on Matrices, fields Each work-item within a group can be identified by a 1D, 2D or 3D id
Built-in function get_local_id()
Work-Group
WI (0, 0) WI (0, 1) WI (0, 2) WI (1, 0) WI (1, 1) WI (1, 2) WI (2, 0) WI (2, 1) WI (2, 2) WI (3, 0) WI (3, 1) WI (3, 2) WI (4, 0) WI (4, 1) WI (4, 2)
Work-Group
Example of 2D work-group
// Add two matrices A and B of dimension NxN and store the // result into C __kernel void matAdd(int N, __global float* A, __global float* B, __global float* C) { int i = get_local_id(0); int j = get_local_id(1); C[j * N + i] = A[j * N + i] + B[j * N + i]; }
// host code int main() { // Declare, allocate and initialize device memory A, B & C
// Kernel invocation size_t globalWorkSize[] = {N, N}; size_t localWorkSize[] = {N, N}; clEnqueueNDRangeKernel(.., 1, NULL, globalWorkSize, localWorkSize, 0, NULL, NULL); }
An N-dimension domain of work-items

Global Dimensions: 1024 x 1024 (whole problem space) Local Dimensions: 128 x 128 (executed together) Choose the dimensions that are best for your algorithm
Example Problem Dimensions

1D: 1 million elements in an array:
global_dim[3] = {1000000, 1, 1};
2D: 1920 x 1200 HD video frame, 2.3M pixels:

global_dim[3] = {1920, 1200, 1};
3D: 256 x 256 x 256 volume, 16.7M voxels:

global_dim[3] = {256, 256, 256};
Choose the dimensions that are best for your algorithm Maps well Performs well
BASIC COMPONENTS FOR GETTING STARTED
Basic OpenCL Program Structure

Kernels
C code with some restrictions and extensions
Language
Host program
Query compute devices Platform Layer Create contexts Create memory objects associated to contexts Compile and create kernel program objects Runtime Issue commands to command-queue Synchronization of commands Clean up OpenCL resources
Typical OpenCL Program

Computation intensive, data parallel function written as kernel Host side code
Context Creation Allocate memory on device Host to Device Data transfer Compilation and creation of kernel program objects Bind memory objects to kernel arguments Call a kernel function to be executed on device Read-back result data from device
INFORMATION ON TOOLS
OpenCL Implementation
AMDs implementation
Ships with ATI Stream SDK v2.0 Released on: 21th Dec, 2009
Requires ATI GPU >= RV7XX
OpenCL Installation
ATI Stream SDK
Environment variable
$(ATISTREAMSDKROOT) = ATI Stream SDK installation directory $(ATISTREAMSDKSAMPLESROOT) = ATI Stream SDK Samples installation directory
ATI OpenCL SDK

Header files
cl.h, cl_gl.h, cl_platform.h under $(ATISTREAMSDKROOT)\include\CL
Library files
OpenCL.lib under $(ATISTREAMSDKROOT)\lib\x86
Dynamic Link Library

OpenCL.dll under $(ATISTREAMSDKROOT)\bin\x86 Make sure Path contains this directory
Recap and Q&A

Overview & Programming model Basic components for getting started Information on tools

1 - Introduction To OpenCL

Hochgeladen von

Dokumentinformationen

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

1 - Introduction To OpenCL

Hochgeladen von

Copyright:

Verfügbare Formate

Introduction to OpenCL

Standard for accessing heterogeneous computational resources

What is OpenCL : Processor Parallelism

Graphics APIs and Shading Languages

OpenCL Open Computing Language

Design Goals of OpenCL

Low-level, high-performance but portable

C-based programming model Specify accuracy of floating-point computations

How does it fit with vendor specific Architecture

OPENCL ARCHITECTURE & PROGRAMMING MODEL

OpenCL Platform Model

One Host + one or more compute devices

OpenCL Platform Model

Opteron Load/Initialize Input Data

Copy Input Data from Host to GPU Memory

Process Input Data and Write to output

Copy Output from GPU to Host Memory

Implicit Data Parallelism

Implicit Data Parallelism

Kernel invocation from host Number of OpenCL threads

Function Qualifier __kernel qualifier declares a function as a Kernel

An N-dimension domain of work-items

Example Problem Dimensions

2D: 1920 x 1200 HD video frame, 2.3M pixels:

3D: 256 x 256 x 256 volume, 16.7M voxels:

BASIC COMPONENTS FOR GETTING STARTED

Basic OpenCL Program Structure

Typical OpenCL Program

Requires ATI GPU >= RV7XX

ATI OpenCL SDK

Dynamic Link Library

Recap and Q&A

Das könnte Ihnen auch gefallen