Beruflich Dokumente
Kultur Dokumente
StreamingProcessorArray Gridofthread
blocks
TPC TPC TPC TPC TPC TPC TPC TPC TPC TPC
Multiplethreadblocks,
manywarpsofthreads
TextureProcessor
( )
Cluster(TPC) StreamingMultiprocessor
SP SP
240shader cores
1.4Btransistors
SM SP SP Upto2GBonboard
SFU SFU memory
SP SP
TextureUnit
~150GB/secBW
SP SP 1.06SPTFLOPS
SM CUDAandOpenCL
support
Programmable
SM memoryspaces
Individualthreads TeslaS1070provides
JulyAug2011 ASahu 1 JulyAug2011 ASahu 2
4GPUsina1Uunit
GeForce8Seriesisamassivelyparallelcomputingplatform
12,288concurrentthreads,hardwaremanaged
128ThreadProcessorcoresat1.35GHz==518GFLOPSpeak IU IU
GPUComputingfeaturesenableConGraphicsProcessingUnit
HostCPU WorkDistribution
SP SP
IU IU IU IU IU IU IU IU IU IU IU IU IU IU IU IU
SP=Stream
SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP Processor
TX= texture filter
TX=texturefilter
IU:InstructionUnit
SFU=SpecialFun
Shared Shared Shared Shared Shared Shared Shared Shared Shared Shared Shared Shared Shared Shared Shared Shared
Memory Memory Memory Memory Memory Memory Memory Memory Memory Memory Memory Memory Memory Memory Memory Memory Unit
TF
TF TF
TF TF
TF TF
TF TF TF TF TF
Shared Shared
TEXL1 TEXL1 TEXL1 TEXL1 TEXL1 TEXL1 TEXL1 TEXL1
Memory Memory
TF
L2 L2 L2 L2 L2 L2
TEXL1
Memory Memory Memory Memory Memory Memory
367GFLOPSpeakperformance
2550timesofcurrenthighendmicroprocessors
265GFLOPSsustainedforappropriateapplications
Massivelyparallel,128cores
Partitionedinto16Multiprocessors
Massivelythreaded,sustains1000softhreadsperapplication
768MBdevicememory
768 MB device memory
1.4GHzclockfrequency
CPUat3.6GHz
86.4GB/secmemorybandwidth
CPUat8GB/secfrontsidebus
MultiGPUserversavailable
SLIQuadhighendNVIDIAGPUsonasinglemotherboard
Also8GPUserversannouncedrecently Source:CUDAProg.Guide4.0
ASahu 1
CS521CSEIITG 11/23/2012
Giventhehardwareinvestedtodographicswell,
howcanbesupplementittoimprove Athreadisassociatedwitheachdataelement
performanceofawiderrangeofapplications? Threadsareorganizedintoblocks
Basicidea: Blocksareorganizedintoagrid
Heterogeneousexecutionmodel GPUhardwarehandlesthreadmanagement,
CPUisthehost,GPUisthedevice notapplicationsorOS
DevelopaClikeprogramminglanguageforGPU
UnifyallformsofGPUparallelismasCUDAthread
Prog.modelisSingleInstructionMultipleThread
ASahu
JulyAug2011 7 ASahu
JulyAug2011 8
//
//InvokeDAXPYwith
k h
Similaritiestovectormachines: //256threadsperthreadblock
//InvokeDAXPY
__host__int nb=(n+255)/256;
Workswellwithdatalevelparallelproblems DAXPY(n,2.0,x,y);
DAXPY<<<nb,256>>>(n,2.0,x,y);
//functioninC
Scattergathertransfers voidDAXPY(int n,doublea,
__device__
Maskregisters double*x,double*y){y)
void DAXPY(int n,doublea,
voidDAXPY(int n double a
Largeregisterfiles for(int i=0;i<n;i++)
double*x,double*y){
y[i]=a*x[i]+y[i];
int i;
Differences: }
i=blockIdx.x*blockDim.x
Noscalarprocessor +threadIdx.x;
Usesmultithreadingtohidememorylatency If(i<n)
y[i]=a*x[i]+y[i];
Hasmanyfunctionalunits,asopposedtoafew }
deeplypipelinedunitslikeavectorprocessor
ASahu
JulyAug2011 9 ASahu
JulyAug2011 10
__device__voidDAXPY(int n,doublea,double*x,double*y){
int i;
i=blockIdx.x*blockDim.x +threadIdx.x;
If(i<n)y[i]=a*x[i]+y[i]; Codethatworksoverallelementsisthegrid
}
main(){
Threadblocksbreakthisdownintomanageablesizes
int x[64],y[64],size=sizeof(int)*64; Initialize(x,y); 512threadsperblock
__device__int *x_d,*y_d;//DeclarationandMemoryCreation
cudaMalloc((void**)&x_d,size);//Ondevice
SIMDinstructionexecutes32elementsatatime
SIMD instruction executes 32 elements at a time
cudaMalloc((void**)&y_d,size); Thus
cudaMemcpy(x_d,x,size,cudaMemcpyHostToDevice ); GridSize=TotalRequiredThread/ThreadBlockSize)
cudaMemcpy(x_d,x,size,cudaMemcpyHostToDevice );
=8192/512=16
__host__int nb=(n+255)/256;//InvokeDAXPYwith256thrds/TB
DAXPY<<<nb,256>>>(n,2.0,xd,yd);
cudaMemcpy(x,x_d,size,cudaMemcpyDeviceToHost );
cudaMemcpy(x,x_d,size,cudaMemcpyDeviceToHost );
}
ASahu
JulyAug2011 11 ASahu
JulyAug2011 12
ASahu 2
CS521CSEIITG 11/23/2012
ASahu
JulyAug2011 13
Example:
Both grid and thread block can have two dimensional index.
Scheduling 4 thread blocks on 3 SMs.
kernelF<<<(2,2),(4,2)>>>(A);
__device__ kernelF(A){
i = blockDim.x * blockIdx.y
+ blockIdx.x;
j = threadDim.x * threadIdx.y
+ threadIdx.x;
A[i][j]++;
}
ASahu 3
CS521CSEIITG 11/23/2012
kernelF<<<(2,2),(4,2)>>>(A);
__device__ kernelF(A){
i = blockDim.x * blockIdx.y Therearetwomainparts
+ blockIdx.x;
j = threadDim.x * threadIdx.y
+ threadIdx.x; 1. Host (CPUpart)
A[i][j]++; SingleProgram,Single
} Data
Executed on machine with width of 4:
Notes: the number of
Processing Elements 2. Device (GPUpart)
(PEs) is transparent to SingleProgram,
programmer. MultipleData
JulyAug2011 ASahu 20
ASahu 4