Sie sind auf Seite 1von 4

CS521CSEIITG 11/23/2012

StreamingProcessorArray Gridofthread
blocks
TPC TPC TPC TPC TPC TPC TPC TPC TPC TPC

Multiplethreadblocks,
manywarpsofthreads
TextureProcessor
( )
Cluster(TPC) StreamingMultiprocessor

SP SP
240shader cores
1.4Btransistors
SM SP SP Upto2GBonboard
SFU SFU memory
SP SP

TextureUnit
~150GB/secBW
SP SP 1.06SPTFLOPS
SM CUDAandOpenCL
support
Programmable
SM memoryspaces
Individualthreads TeslaS1070provides
JulyAug2011 ASahu 1 JulyAug2011 ASahu 2
4GPUsina1Uunit

GeForce8Seriesisamassivelyparallelcomputingplatform
12,288concurrentthreads,hardwaremanaged
128ThreadProcessorcoresat1.35GHz==518GFLOPSpeak IU IU
GPUComputingfeaturesenableConGraphicsProcessingUnit
HostCPU WorkDistribution
SP SP

IU IU IU IU IU IU IU IU IU IU IU IU IU IU IU IU
SP=Stream
SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP Processor
TX= texture filter
TX=texturefilter
IU:InstructionUnit
SFU=SpecialFun
Shared Shared Shared Shared Shared Shared Shared Shared Shared Shared Shared Shared Shared Shared Shared Shared
Memory Memory Memory Memory Memory Memory Memory Memory Memory Memory Memory Memory Memory Memory Memory Memory Unit
TF
TF TF
TF TF
TF TF
TF TF TF TF TF
Shared Shared
TEXL1 TEXL1 TEXL1 TEXL1 TEXL1 TEXL1 TEXL1 TEXL1
Memory Memory

TF
L2 L2 L2 L2 L2 L2
TEXL1
Memory Memory Memory Memory Memory Memory

367GFLOPSpeakperformance
2550timesofcurrenthighendmicroprocessors
265GFLOPSsustainedforappropriateapplications
Massivelyparallel,128cores
Partitionedinto16Multiprocessors
Massivelythreaded,sustains1000softhreadsperapplication
768MBdevicememory
768 MB device memory
1.4GHzclockfrequency
CPUat3.6GHz
86.4GB/secmemorybandwidth
CPUat8GB/secfrontsidebus
MultiGPUserversavailable
SLIQuadhighendNVIDIAGPUsonasinglemotherboard
Also8GPUserversannouncedrecently Source:CUDAProg.Guide4.0

JulyAug2011 ASahu 5 ASahu 6

ASahu 1
CS521CSEIITG 11/23/2012

Giventhehardwareinvestedtodographicswell,
howcanbesupplementittoimprove Athreadisassociatedwitheachdataelement
performanceofawiderrangeofapplications? Threadsareorganizedintoblocks
Basicidea: Blocksareorganizedintoagrid
Heterogeneousexecutionmodel GPUhardwarehandlesthreadmanagement,
CPUisthehost,GPUisthedevice notapplicationsorOS
DevelopaClikeprogramminglanguageforGPU
UnifyallformsofGPUparallelismasCUDAthread
Prog.modelisSingleInstructionMultipleThread

ASahu
JulyAug2011 7 ASahu
JulyAug2011 8

//
//InvokeDAXPYwith
k h
Similaritiestovectormachines: //256threadsperthreadblock
//InvokeDAXPY
__host__int nb=(n+255)/256;
Workswellwithdatalevelparallelproblems DAXPY(n,2.0,x,y);
DAXPY<<<nb,256>>>(n,2.0,x,y);
//functioninC
Scattergathertransfers voidDAXPY(int n,doublea,
__device__
Maskregisters double*x,double*y){y)
void DAXPY(int n,doublea,
voidDAXPY(int n double a
Largeregisterfiles for(int i=0;i<n;i++)
double*x,double*y){
y[i]=a*x[i]+y[i];
int i;
Differences: }
i=blockIdx.x*blockDim.x
Noscalarprocessor +threadIdx.x;
Usesmultithreadingtohidememorylatency If(i<n)
y[i]=a*x[i]+y[i];
Hasmanyfunctionalunits,asopposedtoafew }
deeplypipelinedunitslikeavectorprocessor
ASahu
JulyAug2011 9 ASahu
JulyAug2011 10

__device__voidDAXPY(int n,doublea,double*x,double*y){
int i;
i=blockIdx.x*blockDim.x +threadIdx.x;
If(i<n)y[i]=a*x[i]+y[i]; Codethatworksoverallelementsisthegrid
}
main(){
Threadblocksbreakthisdownintomanageablesizes
int x[64],y[64],size=sizeof(int)*64; Initialize(x,y); 512threadsperblock
__device__int *x_d,*y_d;//DeclarationandMemoryCreation
cudaMalloc((void**)&x_d,size);//Ondevice
SIMDinstructionexecutes32elementsatatime
SIMD instruction executes 32 elements at a time
cudaMalloc((void**)&y_d,size); Thus
cudaMemcpy(x_d,x,size,cudaMemcpyHostToDevice ); GridSize=TotalRequiredThread/ThreadBlockSize)
cudaMemcpy(x_d,x,size,cudaMemcpyHostToDevice );
=8192/512=16
__host__int nb=(n+255)/256;//InvokeDAXPYwith256thrds/TB
DAXPY<<<nb,256>>>(n,2.0,xd,yd);
cudaMemcpy(x,x_d,size,cudaMemcpyDeviceToHost );
cudaMemcpy(x,x_d,size,cudaMemcpyDeviceToHost );
}
ASahu
JulyAug2011 11 ASahu
JulyAug2011 12

ASahu 2
CS521CSEIITG 11/23/2012

Vector width is exposed to programmers.


Scalar program Vector program (vector width of 8)

float A[4][8]; float A[4][8];


Blockisanalogoustoastripminedvector do-all(i=0;i<4;i++){
do-all(j=0;j<8;j++){ do-all(i=0;i<4;i++){
loopwithvectorlengthof32 A[i][j]++; movups xmm0, [ &A[i][0] ]
incps xmm0
BlockisassignedtoamultithreadedSIMD }
movups [ &A[i][0] ], xmm0
}
processorbythethreadblockscheduler }
CurrentgenerationGPUs(Fermi)have715
multithreadedSIMDprocessors

ASahu
JulyAug2011 13

CUDA program expresses data level parallelism (DLP) in kernelF<<<(4,1),(8,1)>>>(A);


terms of thread level parallelism (TLP). __device__ kernelF(A){
Hardware converts TLP into DLP at run time. i = blockIdx.x;
Scalar program CUDA program j = threadIdx.x;
A[i][j]++;
}
float A[4][8]; float A[4][8];
do-all(i=0;i<4;i++){
( ; ; ){
do-all(j=0;j<8;j++){ kernelF<<<(4,1),(8,1)>>>(A);
A[i][j]++;
} __device__ kernelF(A){
} i = blockIdx.x;
j = threadIdx.x;
A[i][j]++;
}

Example:
Both grid and thread block can have two dimensional index.
Scheduling 4 thread blocks on 3 SMs.
kernelF<<<(2,2),(4,2)>>>(A);

__device__ kernelF(A){
i = blockDim.x * blockIdx.y
+ blockIdx.x;
j = threadDim.x * threadIdx.y
+ threadIdx.x;
A[i][j]++;
}

ASahu 3
CS521CSEIITG 11/23/2012

kernelF<<<(2,2),(4,2)>>>(A);

__device__ kernelF(A){
i = blockDim.x * blockIdx.y Therearetwomainparts
+ blockIdx.x;
j = threadDim.x * threadIdx.y
+ threadIdx.x; 1. Host (CPUpart)
A[i][j]++; SingleProgram,Single
} Data
Executed on machine with width of 4:
Notes: the number of
Processing Elements 2. Device (GPUpart)
(PEs) is transparent to SingleProgram,
programmer. MultipleData

Executed on machine with width of 8:

JulyAug2011 ASahu 20

ASahu 4

Das könnte Ihnen auch gefallen