Sie sind auf Seite 1von 8

2017 IEEE 20th International Symposium on Real-Time Distributed Computing

Static WCET Analysis of GPUs with Predictable


Warp Scheduling
Yijie Huangfu Wei Zhang
Virginia Commonwealth University Virginia Commonwealth University
huangfuy2@vcu.edu wzhang4@vcu.edu

Abstract—The capability of GPUs to accelerate general- through interconnection networks. The dynamic behavior of
purpose applications that can be parallelized into massive num- cores in competing for the memory resources is also hard to
ber of threads makes it promising to apply GPUs to real-time predict statically.
applications as well, where high throughput and intensive compu-
tation are also needed. However, due to the different architecture Therefore, before applying GPUs to real-time applications,
and programming model of GPUs, the worst-case execution time the time predictability of the GPU architecture needs to be
(WCET) analysis methods and techniques designed for CPUs improved and analyzable. In this work, we proposed to employ
cannot be used directly to estimate the WCET of GPUs. In this a predictable greedy then round-robin scheduling policy, based
work, based on the analysis of the architecture and dynamic
on which we build a timing model for GPGPU programs.
behavior of GPUs, we propose a WCET timing model and
analyzer based on a predictable GPU warp scheduling policy With the proposed timing model, we build a static analyzer
to enable the WCET estimation on GPUs. that can analyze the assembly codes of the GPGPU programs
and give the Worst-Case Execution Time (WCET) estimations.
I. I NTRODUCTION The evaluation results show that the proposed timing model
The massive number of processing cores on chip and the and static analyzer can provide safe and fairly tight WCET
Single-Instruction Multiple-Thread (SIMT) execution model estimations for GPGPU applications.
allow GPUs to execute thousands of threads simultaneously. The rest of the paper is organized as follows. Section II
Therefore, GPUs have become the ideal accelerators for the introduces the background about GPU architecture and the
applications that are compute- and/or data-intensive but can programming model for GPGPU applications. Section III talks
be parallelized into a large number of threads with very little about the GPU architectural simulator GPGPU-Sim[9] used
dependency among each other. With the increasing need for in this work. The proposed WCET analyzer is discussed in
computing power in all kinds of devices, the applications in Section IV, following which the evaluation methodology and
embedded systems have become more compute- and data- experimental results are given in Section V. The related works
intensive as well. As a result, more and more GPUs and GPU are reviewed in Section VI, and the conclusion and future work
platforms for embedded applications have come to the market, are in Section VII.
e.g., the NVIDIA Tegra[1] and the DRIVE PX[2].
GPUs can also benefit real-time applications, such as human II. GPU A RCHITECTURE AND P ROGRAMMING M ODEL
pose recognition[3] and traffic sign classification[4], where
high throughput and/or computation power are needed. How- A. GPU Architecture
ever, to exploit the potentials of GPUs in real-time applica- Fig. 1 shows the basic architecture of a Nvidia GPU1 , which
tions, the predictability issue of the GPU architecture must has a certain number of Streaming Multiprocessors (SMs),
be addressed first. To achieve high average-case performance e.g., 16 SMs in Fermi architecture[5]. All the SMs share the L2
and throughput, modern GPUs maintains massive number of cache, through which they access the DRAM global memory.
active threads at the same time and uses the large number Other parts, like the interface to host CPUs, are not included
of on-chip cores to schedule and execute these threads. The in Fig. 1.
scheduling of the massive number of active threads is a
dynamic behavior, which is very hard to analyze statically
 

 

and harms the predictability. Moreover, due to the dynamic








scheduling among different threads, which execute the same

program code, the execution order of the basic blocks of
assembly instructions can be different from the execution on






CPUs. Therefore, the traditional static analysis methods cannot
be applied to GPUs directly. Furthermore, the computing Fig. 1: GPU Architecture[5]
cores on a GPU chip are divided into groups, which are
connected to the memory systems outside computing cores 1 The Nvidia CUDA GPU terminologies are used in this paper.

2375-5261/17 $31.00 © 2017 IEEE 101


DOI 10.1109/ISORC.2017.24
Fig. 2 shows the architecture of an SM, which contains III. GPGPU-S IM S IMULATOR
a group of Streaming Processors (SPs, also called CUDA
An accurate documentation of the internal processor archi-
processor or CUDA core). Each SP has the pipelined integer
tecture is critical to have the precise low level model and
arithmetic logic unit and floating point unit, which execute
analysis[8]. However, the detailed architecture information
the normal arithmetic instructions, while the Special Function
about modern GPUs is unavailable. Therefore, in this work,
Units (SFUs) execute the transcendental instruction.Besides
we choose to use a detailed and configurable GPU simulator
the computing functional units, there are several L1 caches
GPGPU-Sim[9], based on which we design and implement our
for instruction, data, texture data and constants. The register
scheduling policy and GPU WCET analyzer.
file contains huge number of registers shared by all the SPs and
SFUs, while the warp scheduler and dispatching unit choose A. Pipeline Model
among the active warps and collect the operands and send the
warp to execution.The load and store units (LD/ST) handle Fig. 3 shows the architectural model of the pipelines of the
the memory instructions. function units in one SM, including SPs, SFUs and LD/ST
unit, in GPGPU-Sim. After being decoded, an instruction is
sent to one of the channels (ID_OC_*). Then the instruction

  is buffered at the operand collectors to collect the value of the
    operands. In the SPs and SFUs, the execution latency has two
parts: initiation and execution. The instruction does not move
      to the execution stage until the initiation stage finishes, which
   
may stall the other instructions in the pipelines. Therefore,
the configuration of the number of operand collectors and
     
the length of the initiation decides how many SP or SFU
instruction can be sent on to the pipeline before the stall
  

  
  

happens.
     




       



! 
  

 

! 
 
Fig. 2: SM Architecture[5] 


! 
 




B. GPU Programming Model Fig. 3: Pipeline Model of Function Units in GPGPU-Sim


GPUs use the SIMT execution model to allow big num-
ber of threads to execute in parallel. A general-purpose The LD/ST unit is the interface of an SM to the memory
GPU(GPGPU) program, which is also called a GPU kernel, system. If the L1 data cache is enabled, the memory requests
commonly can be written in CUDA C[6] or OpenCL[7]. are sent to the L1 cache, which will then handle the access
A GPU kernel is configured and launched by a host CPU. to the global memory if it is a miss in L1. If the L1 cache is
Through the configuration of the kernel, the host CPU tells disabled, the LD/ST unit directly accesses the global memory
the GPU the number of the threads in the execution of the through the interconnection network. In this work, we assume
kernel and the hierarchy of the threads. The hierarchy of a all the requests go to the global memory directly, i.e., there is
kernel has two levels; the dimensions in kernel grid (number no L1 or L2 caches, since cache analysis is not the focus in
of kernel blocks in a kernel grid) and in kernel block (number this work.
of threads in a kernel block). As a group of threads, a kernel
block is the basic unit in the workload assignment to different B. Interconnection Model
SMs. In GPGPU-Sim the global memory space is divided into
The kernel code describes the function and behavior of a several partitions. The SMs and the memory partitions are con-
single thread, based on the position of a thread in the hierarchy nected through an interconnection network. Since the timing
of the kernel, e.g., thread and block IDs. The most common analysis of different topologies is not the focus of this work,
way is to use the thread and block IDs to calculate the indices we choose to use the default topology configuration, as shown
for each thread to access a certain array with. In the GPU in Fig. 4. In one memory partition, the memory requests from
kernel execution, a kernel block is assigned to an SM and stays different SMs are served in a round-robin order. Therefore, the
there until finishing its execution. 32 threads in a kernel block number of SMs in the GPU and how they compete to access
are grouped together as the basic scheduling and execution a memory partition decide how long an instruction will be
unit, which is called a warp. The threads in the same warp stalled at the LD/ST unit in the pipeline model for memory
execute the same instruction together in the SIMT model. instructions.

102
one or more instructions and are called Code Segment in this



work. The dependencies between these code segments lead

to the fact that the instructions in one code segment cannot


be issued until the instructions in the previous code segments
 have finished their executions and written back the results.



Fig. 4: Interconnection Example in GPGPU-Sim T00 ⇐ 0


Ti0 ⇐ Ti 0 + LIi 0 (i > 0) (1)
i = (i − 1)
C. Warp Scheduler Model
In the execution of warps, when a warp gets stalled, the Tij ⇐ M AX(Ti k + LIi k , Tij  + LIij  + LEij  )
warp scheduler will choose a ready-to-issue warp among the k = (i == 0) ? (j − 1) : j
active warps. Although the dynamic warp scheduling can
benefit the average-case performance, it seriously harms the i = (i == 0) ? (N − 1) : (i − 1) (2)
time predictability since it is hard to statically predict the j = j − 1
scheduling order. Therefore, in this work, we propose to use a N : N umber of W arps
greedy then round robin warp scheduling policy, in which the
scheduler tries to issue as many as possible instruction in a Ti(end) = Tij_last + LIij_last + LEij_last
warp, until stall happens due to dependency, then moves to the (3)
W CET = M AX(T0(end) , T1(end) , ..., TN −1(end) )
next warp following the order of the warp IDs. The scheduler
will issue at least one instruction before it switches to the
next warp. It should be noted that the proposed greedy then LIinstArithmetic = (N <= Cpipeline ) ? 0 : LIStallArithmetic
round robin policy is not any of the default policies that are LIinstM emory = (N <= Cpipeline ) ? 0 : LIStallM emory
supported by GPGPU-Sim. We propose this policy to explore (4)
the predictability of the round robin policy and the average-
case performance of the greedy policy. LIStallArithmetic = Linitiation
(5)
D. Branch Divergence LIStallM emory = Ncoal + Ncoal × NCompetingSM
In some cases, all the 32 threads in a warp may not take
the same program path, which is called branch divergence. LEinstArithmetic = Lengthpipeline + Linitiation + Lexecution
When this happens, one path of the branch is executed first LEinstM emory = Lbase +
then the other branch path. Therefore, the WCET analysis for
Lengthpipeline × (Ncoal + Ncoal × NCompetingSM )
branches is different in GPUs than in CPUs. In CPUs, the
(6)
analyzer would estimate the latencies of both paths and use the
larger one if the branch condition can not be decided statically.
However, in GPUs the two paths are actually executed in se- Assuming K instructions in Code Segement ij.
rial. Therefore, the WCET analyzer needs to add the latencies 
K−1

together if it can not decide whether or not there would be a LIij = SCodeSegij + LIinstn (7)
branch divergence. n=0
LEij = M AX(LEinst0 , ..., LEinstK−1 )
E. Resource Limitation
Fig. 5 shows the scheduling of N warps with the greedy then
Due to the limitation of different resources in an SM, the round-robin scheduling policy. Tij represents the time point
maximal number of threads that can be active at the same time when the GPU can start to issue the code segment j of warp
on an SM is limited. The limitations include the total number i. LIij is the latency of issuing the code segment j of warp
of registers, the total available shared memory, the maximal i, while LEij represents the latency of executing the same
numbers of concurrent kernel blocks, warps and threads. All code segment. After initializing the starting issuing time point
these limitations constraint the number of active warps. of each warp by Equation 1, then the rest of the time points
in the scheduling can be calculated using Equation 2, which
IV. GPU WCET A NALYZER
basically means that the time point when one code segment
A. Greedy Then Round-Robin Scheduler Timing Model in a warp can start to issue depends on the maximal latency
We propose the greedy then round-robin (GTRR) warp between the latency of executing the previous code segment
scheduling policy so that a timing model can be built for in the same warp and the latency of issuing the segments in
the execution of the warps in a GPU kernel. Based on the other warps before the scheduler gets back to this warp. Based
dependencies between instructions, the PTX[11] code of a on this model, the estimated WCET is the time point when all
GPU kernel can be divided into segments, each of which has the warps finish the execution, as shown in Equation 3.

103
    
 b) Number of Coalesced Memory Accesses: In a global


memory instruction, different threads in the warp can access
 different memory addresses, which are coalesced together so

 that addresses belonging to the same 128-Byte memory space
  are merged together. Therefore, there can be as many as 32
 
  memory requests with different addresses from one memory

 


warp instruction. Since these memory requests need to be sent
 
out by the LD/ST unit one by one at each clock cycle, the

 
  number of coalesced memory requests affects not only the

issuing latency but also the execution latency of the instruction,
 
 as shown in Equation 5 and 6.
 
  c) Number of Competing SMs: Different SMs may com-
 
 

pete to access the same memory partition in the memory sys-
 tem. In the simulated architecture, the requests from different


 SMs are served in a round-robin order. Therefore, if there

 
 are M SMs trying to access the same memory partition, the
 interval for two consecutive requests from the same SM to


be served is M-1 cycles. This latency can happen at every

  coalesced memory request, as shown in Equation 5 and 6.
 
 
 Equation 5 calculates the possible stall latency of issuing



 


an instruction. For arithmetic instructions, if a stall happens,





the latency equals to the initiation latency.For memory instruc-


tions, if a stall happens, each coalesced memory access will
Fig. 5: Timing Model of GTRR Scheduling Policy
cause one cycle of stall by itself. Besides, since every memory
access needs to compete to access the global memory, the
number of coalesced accesses needs to multiply the possible
B. Code Segment Issuing and Execution Latency Timing Mod- number of competing SMs.
els Equation 6 calculates the latency of executing instructions.
The total execution latency of an arithmetic instruction equals
To use the timing model in Section IV-A, the latencies of to the length of the SP or SFU pipeline plus the summary of
issuing and executing code segments need to be estimated the initiation and execution latencies. For memory instructions,
statically. Generally, the issue latency is related to the length the Lbase is the baseline latency to access the global memory.
of the segment, if there is no stall. But stalls can happen The other part in the equation represents the latency for
when the number of active warpsis larger than the capacity the instructions buffered in the pipeline before the current
of a pipeline. Also, for the memory instructions, the number instruction is sent out to the interconnection network.
of coalesced memory accesses one instruction has and the
Equation 7 calculates the LI and LE of a code segment.
number of SMs that compete to access one memory partition
Adding the size of a code segment SCodeSegij and the possible
affect both the issuing and execution latencies. Therefore, we
stall latency of the instructions of the code segment, we have
are interested in three things: the number of active warps, the
the issuing latency, while the execution latency is the maximal
maximal and average number of coalesced memory accesses
execution latency among the instructions in the code segment.
from one global memory instruction, as well as the maximal
and average number of competing SMs to access a memory
partition. C. Static GPU Kernel Analyzer

a) Number of Active Warps: The pipelines can act as The static kernel analyzer parses the PTX code of a GPU
buffers for different types of instructions. In other words, kernel to get the estimated value of the metrics in the equations
as long as the pipeline is not full, there will be no extra in Section IV-B, as well as the scheduling order of each warp,
stalls in issuing. For arithmetic instructions, the configurations which is used to generate the code segments in the timing
of the number of operand collectors and the length of the model. The analyzer also needs the kernel inputs and the
initiation buffer in function units decide how many instructions hierarchy configuration of the kernel as inputs for the analysis.
the pipeline can hold before the stall happens, while the Fig. 6 shows the components in the analyzer.
configuration of initiation latency determines how long the 1) Warp Scheduling Order: Algorithm 1 shows how the
stall is. The kernel analyzer checks whether the number of scheduling order of a warp is generated. The analyzer starts
active warps is larger than the capacity of the pipeline and with the first instruction of the first basic block and parses
adds the stall latencies to the code segment issuing period each instruction in the current basic block. The register values
according to the instruction types, as shown in Equation 4 are updated with the arithmetic instructions. If the last in-
and 5. struction of a basic block is a branch instruction and there is

104
  
lists of coalesced memory addresses of each global memory
   instruction in the warp. Then the analyzer gets the number

 
 
(N) of coalesced memory addresses for this instruction and
appends it to the result list NumCoalAccessList of this warp.
 
The analyzer returns the warp execution trace, the list of
#$%  numbers that represent the number of coalesced memory
 #$% 


! 
$ accesses, and the list of address lists of each global memory
&$ 
"  !&
 instruction in this warp. The same process is done for every
  '$ 
'$ 
warp and all the results are collected together to calculate the
Fig. 6: GPU Kernel Analyzer maximal and average number of coalesced accesses of the
GPU kernel.

branch divergence, then the analyzer finds the immediate post- Algorithm 2 Coalesced Addresses Generation
dominator[10] basic block and pushes it, together with the not 1: procedure C OALESCEDA DDR L IST G EN (I, W )
2: CoalAddrList = []
taken and then taken basic blocks, to the reconvergence stack. 3: for Each Thread Ti ∈ W do
4: if CheckAcitve(Ti ) then
If there is no branch divergence, then either the taken or not 5: CurAddr = GetAddr(Ti , I)
taken basic block is pushed to the stack. If this last instruction 6: Coalesced = F alse
7: for Each Address Aj ∈ CoalAddrList do
of the current basic block is not a branch instruction, the 8: if Coalesce(CurAddr, Aj ) then
9: Coalesced = T rue
analyzer pops the top from the reconvergence stack as the new 10: Break
current basic block. The analyzer appends every new current 11: end if
12: end for
basic block to the warp scheduling order and returns this trace 13: if Not Coalesced then
14: CoalAddrList.append(CurAddr)
when it reaches the end of the kernel. 15: end if
16: end if
17: end for
Algorithm 1 Warp Execution Trace Generation 18: Return CoalAddrList
19: end procedure
1: procedure WARP E XE T RACE A NA (I NPUTS , CFG, B LOCK , WARP )
2: WarpExecutionTrace = []
3: ReconvergenceStack = []
4: NumCoalAccessList = [] 3) Number of Competing SMs: Algorithm 3 shows how
5: AddrCoalAccessList = []
6: CurrentBB = FirstBB(CFG) the analyzer estimates the possible number of competing SMs
7: WarpExecutionTrace.append(CurrentBB)
8: INST = FirstInstruction(CurrentBB) that may access the same memory partition at the same
9: while INST is not Exit do
10: if INST is arithmetic instruction then time. Based on the memory addresses each warp instruction
11: UpdateRegisterValue(INST, Inputs, Block, Warp) uses, the analyzer builds a vector for every global memory
12: end if
13: if INST is global load/store then instruction in an SM. This vector represents the distribution
14: CoalList = CoalescedAddrListGen(INST, Warp)
15: AddrCoalAccessList.append(CoalList) of the memory addresses among the memory partitions from
16: N = SizeOf(CoalList)
a certain instruction on a certain SM. For instance, if there
17: NumCoalAccessList.append(N)
18: end if are 3 memory partitions and, from one instruction I on SM s,
19: if INST is last of CurrentBB then
20: if INST is branch then there are 5 memory addresses used, among which 2 addresses
21: if Has Divergence then
22: IPD = FindImmediatePostdominator(CFG, CurrentBB) go to partition 0 and 3 addresses go to partition 2, then the
23: ReconvergenceStack.push(IPD) distribution vector is [2,0,3]. As shown in the algorithm, there
24: ReconvergenceStack.push(NotTakenBB)
25: ReconvergenceStack.push(TakenBB) is one such vector for every global memory instruction in every
26: else
27: if Taken then SM, i.e., MemPtnAccVector is a 2D array of such kind of
28: ReconvergenceStack.push(TakenBB)
vectors. Two metrics are calculated using this vector.
29: else
30: ReconvergenceStack.push(NotTakenBB) The first metric represents the unevenness of the distribu-
31: end if
32: end if tion. The Distance2Center function calculates the Euclidean
33: end if
34: CurrentBB = ReconvergenceStack.pop() distance between the vector of the address distribution and
35: WarpExecutionTrace.append(CurrentBB) the vector that represents an even distribution (called center
36: INST = FirstInstruction(CurrentBB)
37: else in the algorithm). This distance is a metric that indicates how
38: INST = NextInstCurBB()
39: end if uneven the distribution to different partitions is. The larger the
40: end while
41: Return WarpExecutionTrace, NumCoalAccessList, AddrCoalAccessList distance is, the more uneven the distribution is and thus the
42: end procedure more possibly SMs compete for the same partition.
Another metric is the Euclidean distance between the dis-
2) Number of Coalesced Memory Accesses: The analyzer tribution vector of one instruction on one SM to the distri-
also collects the information of the memory addresses used bution vector of the same instruction on other SMs, named
by each global memory instruction. All the memory addresses as D2OtherSM in the algorithm. The smaller the value of
used by the threads in a warp are coalesced together using D2OtherSM is, the more similar the address distributions
Algorithm 2. The list of coalesced memory addresses is from two SMs (s and s’) are. If the distance is 0, we
appended to the result list AddrCoalAccessList, which contains have the same distributions and then the number of possibly

105
competing SMs is increased by 1, as show on line 9, where TABLE I: GPGPU-Sim Configuration
the MaxDistance means the maximal distance of two vectors, Number of SMs 15
whose distributions all focus on single but different partitions, Number of Memory Partitions 12
e.g., [5, 0, 0] and [0, 0, 4]. This is a constant value according Number of 32-bit registers per SM 32768
number of SMs; if there are M SMs, MaxDistance
to the total √ Size of shared memory per SM 48KB
L1 data None
is (M − 1) 2. L2 cache None
Then the number of possible competing SMs (Com- Max Number of Active Kernel Block 8
petingSM) and the distance to the center (D2Center) are Max Number of Active Warp 48
Number of SP Operand Collectors 6
compared to heuristic thresholds, i.e., TCompetingSM and Number of SFU Operand Collectors 8
TD2Center , to decide whether the number of possible compet- Number of load/store Operand Collectors 2
ing SM of current instruction counts to the final result (line Warp Scheduling Policy GTRR
11). After all the instructions are analyzed for all the SMs,
an average value of the number of competing SMs is returned TABLE II: GPGPU-Sim Function Unit Latency Configuration
and used in the calculation in Equation 5 and 6. The maximal Initiaion ADD MAX MUL MAD DIV
value of the number of competing SMs is the number of active Integer 1 2 2 1 8
SMs minus one. The reason that heuristic thresholds are used Float 1 2 1 1 4
Double 8 16 8 8 130
is that the behaviors of different SMs are basically independent
Execution ADD MAX MUL MAD DIV
of each other and, therefore, their interactions are very hard to Integer 4 13 4 5 145
predict statically. So, we use these heuristic threshold values Float 4 13 4 5 39
to estimate the average degree of competing among SMs. The Double 8 19 8 8 330
heuristic values used in this work are 13 for TCompetingSM
and 0.5 for TD2Center , for the architecture configuration with
15 SMs and 12 memory partitions. It should be noted that we There are totally 9 GPU kernels in these benchmarks, 2 from
do not claim the WCET estimation with the average degree each, except lud. Table III shows the configuration of the grid
of competing SMs to be a safe upper bound, while the WCET and block sizes of each kernel, as well as the numbers of
estimation with the maximal number of possible competing active SMs and warps in execution. As shown in Table III, two
SMs can be considered as the safe upper bound. configurations of kernel sizes are used in the srad benchmark,
where the block sizes are the same but the grid sizes are
Algorithm 3 Average Number of Competing SMs different, leading to different number of active warps on an
1: NumCompetingSM = [] SM. These benchmarks are selected based on the criteria that
2: for Each I in all load/store instructions do the loop bounds are known statically.
3: for Each s in all SMs do
4: D2Center = Distance2Center(MemPtnAccVector[I][s])
5: CompetingSM = 0 TABLE III: GPU Benchmark Kernels
6: for Each s’ in all the rest SMs do
7: D2OtherSM = Distance2Vector(MemPtnAccVector[I][s],
Benchmark Grid Size Block Size # Act SMs # Act Warps
8: MemPtnAccVector[I][s’])
9: CompetingSM += (MaxDistance - D2OtherSM)/MaxDistance gsn k1 1x1x1 512x1x1 1 16
10: end for gsn k2 32x32x1 8x8x1 15 16
11: if CompetingSM > TCompetingSM or D2Center > TD2Center then
12: NumCompetingSM.append(CompetingSM) nw k1 32x1x1 16x1x1 15 2
13: else nw k2 31x1x1 16x1x1 15 2
14: NumCompetingSM.append(0) cfd k1 127x1x1 192x1x1 15 48
15: end if
16: end for cfd k2 127x1x1 192x1x1 15 48
17: end for lud k1 15x15x1 16x16x1 15 48
18: Return average(NumCompetingSM) srad 128 k1 8x8x1 16x16x1 15 40
srad 128 k2 8x8x1 16x16x1 15 40
srad 512 k1 32x32x1 16x16x1 15 48
V. E VALUATION M ETHODOLOGY AND E XPERIMENTAL srad 512 k2 32x32x1 16x16x1 15 48
R ESULTS
A. Evaluation Methodology B. Experimental Results
We used the GPGPU-Sim[9] simulator in this work as the Fig. 7 shows the normalized estimated WCET of the
analysis target of the GPU architecture. We also implement the simulated GPU architecture with and without the perfect
greedy then round-robin warp scheduling policy based on the memory configuration. The estimated WCET results with the
simulator. The general configuration of the simulator is shown perfect memory configuration is normalized to the measured
in Table I and the latency configuration of the function units simulation performance results with the same configuration.
is in Table II, the numbers in which represent the number of Perfect memory means every memory request just takes one
cycles. cycle after it has arrived at the LD/ST unit and does not go to
In the evaluation, 5 benchmarks, including Gaussian Elimi- the memory partitions through the interconnection network.
nation (gsn), Needleman-Wunsch (nw), CFD Solver (cfd), LU The normalized estimated WCET results with normal mem-
Decomposition (lud) and Speckle Reducing Anisotropic Diffu- ories in Fig. 7 are the estimated WCET results when the
sion (srad), are chosen from the Rodinia benchmark suite[12]. simulator and the WCET analyzer use a normal memory

106
TABLE IV: Estimated Average and Maximal Number of Coalesced Accesses and Competing SMs
Benchmark gsn k1 gsn k2 nw k1 nw k2 cfd k1 cfd k2 lud k1 srad128 k1 srad128 k2 srad512 k1 srad512 k2
Avg. Coalesced Access 22 7 3 2 1 1 2 2 2 2 2
Max. Coalesced Access 32 8 16 16 1 1 2 2 2 2 2
Avg. Competing SMs 0 10 7 5 7 7 10 9 9 13 13
Max. Competing SMs 0 14 14 14 14 14 14 14 14 14 14

system model. These estimated WCET results are normalized values is small or when they are the same, the increase in
to the measured simulation performance results with the the overestimation is small. But, when the difference grows,
normal memory system configuration. The results show that the overestimation increases. For example, in the gsn k2 and
generally with a perfect memory model, the estimator has nw, both the number of coalesced access and the number of
tighter estimations, compared to that with a normal memory competing SMs are different in average and maximal values
model. This is because, when no interference from other SMs and, as a result, the overestimation is huge when the maximal
needs to be considered, the predictability within an SM is value is used. For the two kernel hierarchy configurations in
better than the case when the interconnection network and the the srad benchmark, the srad128 has less estimated average
interferences from other SMs need to be considered. It should number of competing SMs than srad512, since there are less
be noted that the average values of the number of coalesced active warps per SM in the srad128. Therefore, when the
memory accesses and the average number of competing SMs maximal values are used, the overestimation in srad512 is less
are used in getting the estimated results in Fig. 7 and the then in srad128, since the estimated average value is closer to
estimated results are normalized to the measured performance the maximal one. For hard real-time applications, the maximal
with and without perfect memory respectively. Therefore, the estimated values of these two metrics should be used, while
overestimation in the estimated results with normal memory for soft real-time applications, the average values can be used.
can be smaller than the overestimation in the estimated results
with perfect memory, e.g., in benchmark gsn k1 and cfd k2. BothAverage MaxNumberofCoalescedMemoryAccesses
MaxNumberofCompetingSMs BothMax
323% 277%
NormalizedEstimated WCET

w/PerfectMemory w/NormalMemory 215% 483% 481% 222%


200%
PerfectMemory andNormalMemory

180% 180%
NormalizedEstimated WCET with

160% 160%
140%
140% 120%
120% 100%
80%
100% 60%
80% 40%
60% 20%
0%
40%
20%
0%

Fig. 8: Normalized Estimated WCET with Maximal and


Average Number of Coalesced Accesses and Competing SMs
Fig. 7: Normalized Estimated WCET with and without Perfect
Memory Model Fig. 9 shows the normalized average-case performance
results of the proposed greedy then round-robin scheduling
Fig. 8 shows estimated WCET results of the estimator using policy and the default loose round-robin policy in GPGPU-
different combinations of the estimated number of coalesced Sim. In the proposed greedy then round-robin scheduling
memory accesses and number of competing SMs. Since there policy, the scheduler tries to issue at least one instruction for a
are two types of estimated values for each metric, i.e., the warp. However, this can introduce performance overhead due
average and the maximal, there are four groups of estimated to missing some of the opportunities of hiding latency. As
WCET results. In the kernel k1 of the gsn benchmark, only shown in Fig. 9, on average the performance overhead is about
one SM is active in the execution. Therefore, both the average 50%, which we consider as the trade-off for predictability.
and maximal number of competing SMs are 0. In all the other
kernels all the SMs are active. Therefore, the maximal number VI. R ELATED W ORK
of competing SMs is 14. As shown in the results, the estimator GPUs are now widely used in all kinds of data-parallel
has the lowest overestimation when the average values are applications, due to their ability to boost the performance
used in both metrics. The overestimation increases when the with parallel computing. For instance, GPUs are useful in
maximal estimated value in either or both of the metrics are medical image processing[13] and can achieve high efficiency
used. The estimated average and maximal values of the number and performance improvement in specific algorithms[14].
of coalesced accesses and competing SMs are shown in Table There are also researches on real-time scheduling with
IV. When the difference between the average and maximal GPUs and heterogeneous processors, which focus on the real-

107
LooseRoundRobin GreedyThenRoundRobin
work. We plan to incorporate static timing models for cache
NormalizedAverageCasePerformance

165% 188% 173% 224% memories and shared memory in the future to estimate the
160%
140%
GPU execution time more accurately. Also, we will explore
120% other time-predictable warp scheduling methods to reduce the
100% impact on the average-case performance.
80%
60% ACKNOWLEDGMENT
40% This work was funded in part by the NSF grant CNS-
20%
1421577.
0%
R EFERENCES
[1] Nvidia. Nvidia Tegra Mobile Processors.
http://www.nvidia.com/object/tegra.html.
Fig. 9: Normalized Average-Case Performance Results [2] Nvidia. Nvidia DRIVE PX 2. http://www.nvidia.com/object/drive-px.html
[3] J. Shotton, et al. Real-time human pose recognition in parts from single
depth images. Commun. ACM 56, 1 (January 2013), 116-124.
[4] D. CiresAn, et al. Multi-column deep neural network for traffic sign clas-
time scheduling algorithms[15][16]. Although these studies try sification. Neural Networks, Volume 32, August 2012, Pages 333Ű338.
[5] NVIDIA Next Generation CUDA Compute Architecture: Fermi.
to employ GPUs in real-time applications, they assume the www.nvidia.com/content/PDF/fermi_white_papers/
WCET of the real-time applications and tasks are known for NVIDIA_Fermi_Compute_Architecture_Whitepaper.pdf.
the scheduling algorithms, which emphasizes the importance [6] Nvidia CUDA. CUDA Toolkit Documentation v7.0.
[7] Stone, J.E.; Gohara, D.; Guochun Shi, OpenCL: A Parallel Programming
of having reliable WCET estimations for GPU kernels. Standard for Heterogeneous Computing Systems. In Computing in Science
Studies on performance analysis of GPU architecture and and Engineering, 2010
GPGPU applications[17][18] focus on building the perfor- [8] Martin Schoeberl. Time-Predictable Computer Architecture. EURASIP
Journal on Embedded Systems, 2009
mance models. These studies mainly concentrate on the mod- [9] A. Bakhoda et al. Analyzing CUDA workloads using a detailed GPU
els of average-case performance and/or using the model to simulator. In ISPASS, Apr 2009.
identify the performance bottleneck, while the performance [10] W. W. L. Fung et al. Dynamic warp formation: Efficient MIMD control
flow on SIMD graphics hardware. ACM Transactions on Architecture and
model in this work focuses on the WCET estimation. Code Optimization, Volume 6 Issue 2, June 2009
There are also studies on the GPU warp scheduling [11] Nvidia CUDA. Parallel Thread Execution ISA Version 4.2.
policies[19][20] to improve the efficiency in utilizing the [12] S. Che, et al. Rodinia: A benchmark suite for heterogeneous computing.
In Proc. of the IEEE Int. Symp. Workload Characterization, 2009.
computational resources and to access the memory in a more [13] A. Eklund, et al. Medical image processing on the GPU Ű Past, present
friendly way, so that the performance is improved. How- and future. Medical Image Analysis. Volume 17, Issue 8, December 2013,
ever, the proposed scheduling policy in this work focuses Pages 1073Ű1094
[14] D. Merrill, et al. Scalable GPU graph traversal. Proceedings of the
on improving the predictability of the GPU architecture. The 17th ACM SIGPLAN symposium on Principles and Practice of Parallel
memory access reordering method proposed in [21] regulates Programming.
the order of memory accesses to the GPU L1 data cache to [15] G. Elliott and J. Anderson. Globally scheduled real-time multiprocessor
systems with GPUs. In Real-Time Systems, vol. 48, 2012.
improve the time predictability of the GPU L1 data cache, [16] G. A. Elliott, et al. GPUSync: Architecture-Aware Management of GPUs
while the proposed scheduling policy and analyzer in this work for Predictable Multi-GPU Real-Time Systems. In Proc. of RTSS, 2013.
focus on the timing model of the whole GPU system. [17] J. Sim, et al. A performance analysis framework for identifying potential
benefits in GPGPU applications. In Proc. of the 17th ACM SIGPLAN
The studies on GPU WCET analysis[22][23] use symposium on Principles and Practice of Parallel Programming, 2012
measurement-based methods, while the proposed WCET [18] Z. Cui, et al. An Accurate GPU Performance Model for Effective Control
analysis method in this work is based on static timing Flow Divergence Optimization. In Proc. of Parallel and Distributed
Processing Symposium (IPDPS), 2012
analysis and can give safe WCET estimations for GPU [19] A. Jog, et al. OWL: Cooperative Thread Array Aware Scheduling
kernels. Techniques for Improving GPGPU performance. In the Proc. of 18th
International Conference on Architectural Support for Programming Lan-
VII. C ONCLUSION AND F UTURE W ORK guages and Operating Systems, 2013
[20] V. Narasiman, et al. Improving GPU performance via large warps and
The parallel computing capability of GPUs can potentially two-level warp scheduling. In Proc. of the 44th Annual IEEE/ACM
benefit the real-time applications with better performance and International Symposium on Microarchitecture, 2011
energy efficiency. However, the time predictability issues of [21] Y. Huangfu, et al. Warp-Based Load/Store Reordering to Improve GPU
Data Cache Time Predictability and Performance. In Proc. of the 19th
GPU architecture and GPGPU applications must be addressed. International Symposium on Real-Time Distributed Computing, 2016
In this work, we analyze the GPU architecture of a detailed [22] A. Betts and A. Donaldson. Estimating the WCET of GPU-Accelerated
GPU simulator, based on which we propose a predictable warp Applications Using Hybrid Analysis. In Proc. of 25th Euromicro Confer-
ence on Real-Time Systems, 2013.
scheduling policy that enables us to build a worst-case timing [23] K. Berezovskyi, et al. WCET Measurement-based and Extreme Value
model for GPU kernels. The experimental results show that our Theory Characterisation of CUDA Kernels. In Proc. of RTNS, 2014.
WCET analyzer can effectively provide WCET estimations for
both soft and hard real-time application purpose.
The proposed timing model and the WCET analysis method
developed for the GPU can be further enhanced in our future

108

Das könnte Ihnen auch gefallen