MCC Iitg Sahu MPLec12

CS521CSEIITG 11/23/2012
StreamingProcessorArray Gridofthread
blocks
TPC TPC TPC TPC TPC TPC TPC TPC TPC TPC
Multiplethreadblocks,
manywarpsofthreads
TextureProcessor
( )
Cluster(TPC) StreamingMultiprocessor
SP SP
240shader cores
1.4Btransistors
SM SP SP Upto2GBonboard
SFU SFU memory
SP SP
TextureUnit
~150GB/secBW
SP SP 1.06SPTFLOPS
SM CUDAandOpenCL
support
Programmable
SM memoryspaces
Individualthreads TeslaS1070provides
JulyAug2011 ASahu 1 JulyAug2011 ASahu 2
4GPUsina1Uunit
GeForce8Seriesisamassivelyparallelcomputingplatform
12,288concurrentthreads,hardwaremanaged
128ThreadProcessorcoresat1.35GHz==518GFLOPSpeak IU IU
GPUComputingfeaturesenableConGraphicsProcessingUnit
HostCPU WorkDistribution
SP SP
IU IU IU IU IU IU IU IU IU IU IU IU IU IU IU IU
SP=Stream
SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP Processor
TX= texture filter
TX=texturefilter
IU:InstructionUnit
SFU=SpecialFun
Shared Shared Shared Shared Shared Shared Shared Shared Shared Shared Shared Shared Shared Shared Shared Shared
Memory Memory Memory Memory Memory Memory Memory Memory Memory Memory Memory Memory Memory Memory Memory Memory Unit
TF
TF TF
TF TF
TF TF
TF TF TF TF TF
Shared Shared
TEXL1 TEXL1 TEXL1 TEXL1 TEXL1 TEXL1 TEXL1 TEXL1
Memory Memory
TF
L2 L2 L2 L2 L2 L2
TEXL1
Memory Memory Memory Memory Memory Memory
367GFLOPSpeakperformance
2550timesofcurrenthighendmicroprocessors
265GFLOPSsustainedforappropriateapplications
Massivelyparallel,128cores
Partitionedinto16Multiprocessors
Massivelythreaded,sustains1000softhreadsperapplication
768MBdevicememory
768 MB device memory
1.4GHzclockfrequency
CPUat3.6GHz
86.4GB/secmemorybandwidth
CPUat8GB/secfrontsidebus
MultiGPUserversavailable
SLIQuadhighendNVIDIAGPUsonasinglemotherboard
Also8GPUserversannouncedrecently Source:CUDAProg.Guide4.0
JulyAug2011 ASahu 5 ASahu 6
ASahu 1
CS521CSEIITG 11/23/2012
Giventhehardwareinvestedtodographicswell,
howcanbesupplementittoimprove Athreadisassociatedwitheachdataelement
performanceofawiderrangeofapplications? Threadsareorganizedintoblocks
Basicidea: Blocksareorganizedintoagrid
Heterogeneousexecutionmodel GPUhardwarehandlesthreadmanagement,
CPUisthehost,GPUisthedevice notapplicationsorOS
DevelopaClikeprogramminglanguageforGPU
UnifyallformsofGPUparallelismasCUDAthread
Prog.modelisSingleInstructionMultipleThread
ASahu
JulyAug2011 7 ASahu
JulyAug2011 8
//
//InvokeDAXPYwith
k h
Similaritiestovectormachines: //256threadsperthreadblock
//InvokeDAXPY
__host__int nb=(n+255)/256;
Workswellwithdatalevelparallelproblems DAXPY(n,2.0,x,y);
DAXPY<<<nb,256>>>(n,2.0,x,y);
//functioninC
Scattergathertransfers voidDAXPY(int n,doublea,
__device__
Maskregisters double*x,double*y){y)
void DAXPY(int n,doublea,
voidDAXPY(int n double a
Largeregisterfiles for(int i=0;i<n;i++)
double*x,double*y){
y[i]=a*x[i]+y[i];
int i;
Differences: }
i=blockIdx.x*blockDim.x
Noscalarprocessor +threadIdx.x;
Usesmultithreadingtohidememorylatency If(i<n)
y[i]=a*x[i]+y[i];
Hasmanyfunctionalunits,asopposedtoafew }
deeplypipelinedunitslikeavectorprocessor
ASahu
JulyAug2011 9 ASahu
JulyAug2011 10
__device__voidDAXPY(int n,doublea,double*x,double*y){
int i;
i=blockIdx.x*blockDim.x +threadIdx.x;
If(i<n)y[i]=a*x[i]+y[i]; Codethatworksoverallelementsisthegrid
}
main(){
Threadblocksbreakthisdownintomanageablesizes
int x[64],y[64],size=sizeof(int)*64; Initialize(x,y); 512threadsperblock
__device__int *x_d,*y_d;//DeclarationandMemoryCreation
cudaMalloc((void**)&x_d,size);//Ondevice
SIMDinstructionexecutes32elementsatatime
SIMD instruction executes 32 elements at a time
cudaMalloc((void**)&y_d,size); Thus
cudaMemcpy(x_d,x,size,cudaMemcpyHostToDevice ); GridSize=TotalRequiredThread/ThreadBlockSize)
cudaMemcpy(x_d,x,size,cudaMemcpyHostToDevice );
=8192/512=16
__host__int nb=(n+255)/256;//InvokeDAXPYwith256thrds/TB
DAXPY<<<nb,256>>>(n,2.0,xd,yd);
cudaMemcpy(x,x_d,size,cudaMemcpyDeviceToHost );
cudaMemcpy(x,x_d,size,cudaMemcpyDeviceToHost );
}
ASahu
JulyAug2011 11 ASahu
JulyAug2011 12
ASahu 2
CS521CSEIITG 11/23/2012
Vector width is exposed to programmers.

Scalar program Vector program (vector width of 8)
float A[4][8]; float A[4][8];

Blockisanalogoustoastripminedvector do-all(i=0;i<4;i++){
do-all(j=0;j<8;j++){ do-all(i=0;i<4;i++){
loopwithvectorlengthof32 A[i][j]++; movups xmm0, [ &A[i][0] ]
incps xmm0
BlockisassignedtoamultithreadedSIMD }
movups [ &A[i][0] ], xmm0
}
processorbythethreadblockscheduler }
CurrentgenerationGPUs(Fermi)have715
multithreadedSIMDprocessors
ASahu
JulyAug2011 13
CUDA program expresses data level parallelism (DLP) in kernelF<<<(4,1),(8,1)>>>(A);

terms of thread level parallelism (TLP). __device__ kernelF(A){
Hardware converts TLP into DLP at run time. i = blockIdx.x;
Scalar program CUDA program j = threadIdx.x;
A[i][j]++;
}
float A[4][8]; float A[4][8];
do-all(i=0;i<4;i++){
( ; ; ){
do-all(j=0;j<8;j++){ kernelF<<<(4,1),(8,1)>>>(A);
A[i][j]++;
} __device__ kernelF(A){
} i = blockIdx.x;
j = threadIdx.x;
A[i][j]++;
}
Example:
Both grid and thread block can have two dimensional index.
Scheduling 4 thread blocks on 3 SMs.
kernelF<<<(2,2),(4,2)>>>(A);
__device__ kernelF(A){
i = blockDim.x * blockIdx.y
+ blockIdx.x;
j = threadDim.x * threadIdx.y
+ threadIdx.x;
A[i][j]++;
}
ASahu 3
CS521CSEIITG 11/23/2012
kernelF<<<(2,2),(4,2)>>>(A);
__device__ kernelF(A){
i = blockDim.x * blockIdx.y Therearetwomainparts
+ blockIdx.x;
j = threadDim.x * threadIdx.y
+ threadIdx.x; 1. Host (CPUpart)
A[i][j]++; SingleProgram,Single
} Data
Executed on machine with width of 4:
Notes: the number of
Processing Elements 2. Device (GPUpart)
(PEs) is transparent to SingleProgram,
programmer. MultipleData
Executed on machine with width of 8:
JulyAug2011 ASahu 20
ASahu 4

MCC Iitg Sahu MPLec12

Hochgeladen von

Dokumentinformationen

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

MCC Iitg Sahu MPLec12

Hochgeladen von

Copyright:

Verfügbare Formate

CS521CSEIITG 11/23/2012

JulyAug2011 ASahu 5 ASahu 6

Vector width is exposed to programmers.

float A[4][8]; float A[4][8];

CUDA program expresses data level parallelism (DLP) in kernelF<<<(4,1),(8,1)>>>(A);

Executed on machine with width of 8:

Das könnte Ihnen auch gefallen