Sie sind auf Seite 1von 33

Improving Bayesian

Computational Time and


Scalability with GPGPU
Thanakij Pechprasarn, Noppadon Khiripet
thanakij.pechprasarn@nectec.or.th

Knowledge Elicitation Archiving Laboratory (KEA)


National Electronics and Computer Technology Center
(NECTEC)
A N S C S E 1 5 , 1 st April 2011
Bayesian applications
• Style of problems includes: inference
problems, causal problems
• For example, the problem may be
that given that the grass is wet
(evidence), what is the probability
of each influential cause (rain,
sprinkler)?
Bayesian probability
• “probability” as a degree of belief
• “conditional probability” given information
(evidence), your belief changes
• “posterior” as inverse probability
• Bayes’ theorem
P( D | θ ) P (θ )
• P (θ | D ) =
P( D)
Where,


P (θ ) = prior
θ of

P( D | θ ) = likelihood

P (θ | D) = posterior

P(D) = prior of D (acts as a normalizing constant of value
) ∫
P ( D | θ ) P (θ )dθ
Our selected application
• To do hypothesis testing given
observed data
• The expected value of the posterior
has to fall under 95% region
(credible interval) of the prior
distribution
• If true, then the hypothesis is
accepted, otherwise rejected
Posterior expectation
• An expected value of the posterior,
E θ [θ ] P ( |D )

• It requires one to sample from the


posterior, but a sampling method
for all posterior may not be known,
especially when the posterior has a
complex form
• We can work out math to make it
simpler

Remark a powerful method such as Markov chain Monte Carlo



Posterior expectation (2)
• The definition of an expected Evalue,

[X ] = ∫ P( X ) X * P( X )dx
−∞

• So, ∞
• E P (θ | D ) [θ ] = ∫−∞ θ * P(θ | D)dθ
• Using Bayes’ rule,
∞θ * P( D | θ ) * P(θ )
• =∫
−∞ P( D)

• From the definition of an expectation,


E [θ * P( D | θ )]
• =
P (θ )

P( D)
• Now we have changed the distribution
from using the posterior to⇒the
E P (θ |D ) [...] prior
E P (θ ) [...]
• We assume that known sampling method
for the prior distribution exists
Hypothesis testing
• We do the testing to see if the
calculated expected value of the
posterior falls under 95% region of
prior distribution
value
∫−∞ P(θ )dθ < 0.95
• That is, to see if


Problems
• However, we still have to solve the
integrals appeared in the denominator,
P(D), and in the hypothesis testing
• Analytical method may not work because
closed-form solution may not be found
• Notice that we can convert back and forth
between the integrals and the
expectations
• However, how can we really solve either
an integral or an expected value
Solutions
• Monte Carlo integration (MCI) can be
used to approximate an
expectation/integral involving a
“random” process∑ f ( x )
N

i
E[ f ( x)] ≈
i =1
• N

Thus, to find an expectation with MCI:


1.Sample X1..N according to the


distribution f
Solutions (2)
• Unfortunately, MCI also has its
drawback
• In general, the more samples, the
more accurate of the final answer
• However, with a lot more samples,
the computation becomes much
slower!
GPUs and CUDA
• GPU computing, leveraging graphics
cards as an accelerator of the
computation
• Nvidia CUDA is a major framework
for programming GPUs
• CUDA allows developers to exploit
parallelism in a form of blocks and
threads
Parallel reduction
• Common programming pattern found
in parallel programs
• It helps one to calculate a summation
quickly
• With its tree-based structure, many
additions can be done in parallel
Previous work
• Speed up the computation of parallel
reduction using GPUs
• With a special care taken on floating-
point errors
• Achieved the maximum speed-up at
57.19 times the sequential code
(CPU)
Current work
• Make use of our previous work, the
parallel reduction module in GPUs
• Speed up the computation in a real-
world Bayesian application with
GPU computing
Current work (2)
• Calculate the posterior expectation
• E [θ * P ( D | θ )]
P (θ )
E [θ ] =
P (θ | D )
P( D)
• E [θ * P ( D | θ )]
P (θ )
=
• ∫ P( D | θ ) * P(θ )dθ
 E P (θ ) [θ * P( D | θ )]
=
 E P (θ ) [ P( D | θ )]

With this form, we can calculate the


expectations with MCI for both the


numerator and denominator
Current work (3)
• Given the computed value of the posterior
expectation, one can test the hypothesis
via Monte Carlo methods as follows:
1.X1..N = sample from the prior
2.count = the number of samples that its
value is less than the expected value
3.If count/N < 0.95 then
 accept
 Else
 reject
Structures of the parallel
program
1.Sample from the prior,
P (θ ) (CPU)
2.
3.Calculate the posterior expectation (GPU)
N
– The numerator part,
∑ θ * P( D | θ ) / N
i i
i =1 N
– The denominator ∑ part, P( D | θ i ) / N
i =1
4.
5.Do hypothesis testing,
value

(CPU)
∫−∞
P (θ )dθ < 0.95

Partial GPU implementation


Extra issues
In addition to the parallelized Bayesian

applications, we also handle 2 issues found in our


previous work in the parallel reduction step:
1. Further optimization
– Although results from previous work show that
the computational time is substantial
reduced, but we find that it can be further
improved
– Techniques: loop unrolling, enhance compacting
code
2. Scalability
– The problem is that a certain block size can
handle a problem size up to a certain point,
so small blocks cannot afford larger problem
What about the likelihood and
prior?
• Prior ~
N (5,0.5) (broad prior)

• Each observationN ( µ~
,0.04) (normal
model) 23

• Likelihood ∏
=
i =1
N ( Di ; µ ,0.04) (observations
are independent)


• The 23 observations we’ve used are from the
Cavendish’s data:
5.36, 5.29, 5.58, 5.65, 5.57, 5.53,

5.62, 5.29, 5.44, 5.34, 5.79, 5.10,

5.27, 5.39, 5.42, 5.47, 5.63, 5.34,

5.46, 5.30, 5.78, 5.68, 5.85


Platforms
• CUDA 3.2
• A workstation with following
specification:
Description CPU GPU

Model Intel Core i7 Nvidia GeForce GTX


Clock frequency (GHz) 2.8 580
1.56
# processors 2 16
# cores per processor 4 32
# total cores 8 512
Results
Results (2)
• The calculated expected value is
about 5.483
• It falls under 95% region, so the
hypothesis is accepted
Results (3)
• Running time: Sequential (CPU) vs Parallel
(GPU)





• Our maximum speed-up achieved is
53.49x
Results (4)
• However, we know that the parallel
implementation also contain a
sequential part
• Currently only the portion of finding a
posterior expectation is parallelized
• If we compare the running time of
this specific portion between CPU
and GPU versions, we would see
greater difference in performance
• And the maximum speed-up we
Future Work
• Employ CURAND and develop full
GPU implementation, which is to
generate random samples within
GPUs, to see if the speed-up is
increased
• Use advance techniques such as
MCMC when sampling to cover the
case that the distribution has no
known sampling methods
• May not be trivial to parallelize since
Summary
• We’ve implemented a Bayesian
application to do the hypothesis
testing given a posterior
expectation
• We develop parallel programs
running on GPUs to help accelerate
the computation
• Our maximum speed-up obtained is
53.49x
• In addition, we cope with the
Thank You
• Q&A
Solving the scalability issues
• We now use 2D blocks instead of 1D
blocks
Results (scalability issue)
Problem Size Running Time (second)
using Block Size = 128
• 65,535 0.011
• 131,070 0.021

• 262,140 0.041
524,280 0.080

1,048,560 0.159

2,097,120 0.317

4,194,240 0.631

8,388,480 1.261
• 16,776,960 2.523
• 33,553,920 5.076
• 67,107,840 10.368
• 134,215,680 20.516
• 268,431,360 40.332

• We show that the smallest block size can also be used with
the largest problem size (this would not be possible in
our previous work)
Further optimization:
Loop unrolling
(* parallel reduction in the reduce kernel *)
FOR s from num_samples/2 to 64 having s/=2

 Sync threads (* make sure that all threads are working on the same
level of the tree *)
 IF threadId is less than s THEN
 Add s_data[threadId] to s_data[threadId + s]
 END IF
END FOR

(* loop unrolling *)

IF threadId is less than 32 THEN (* CUDA warp size is 32 *)

 Add s_data[threadId] to s_data[threaded + 32]


 Add s_data[threadId] to s_data[threaded + 16]
 Add s_data[threadId] to s_data[threaded + 8]
 Add s_data[threadId] to s_data[threaded + 4]
 Add s_data[threadId] to s_data[threaded + 2]
 Add s_data[threadId] to s_data[threaded + 1]
END IF
Further optimization:
Enhance compacting kernel

• Original version:
kernel_reduce <<<num_samples,

1>>>(…)

• Modified version:
kernel_reduce

<<<num_samples/num_threads,
num_threads>>>(…)
Effect of further
optimization





• Unfortunately, each introduced optimization
on parallel reduction seems to have a little
gain
• We find that this is due to the other hot spot
in the program that dominates the
computation (that is, the time spent on
Monte Carlo integration
(MCI)
• We want to integrate f in [a,b]
• b
I = ∫ f ( x)dx
• a

• Divide by P(x), distribution that we know how to


sample from ∞
f ( x)
I= ∫ P ( x)dx
• −∞
P( x)

• Change into a form of expectation f ( x)
I = EP ( X ) [ ]
• P( x)

• Estimate the integral by sampling
N
from P(x) and
f ( xi )
calculate the sample mean ∑ i =1 P( xi )
I≈
N

Das könnte Ihnen auch gefallen