ANSCSE15 Full Paper Thanakij

Improving Bayesian Computational Time and Scalability with
GPGPU
T. PechprasarnC, and N. Khiripet
Knowledge Elicitation and Archiving Laboratory, National Electronics and Computer Technology Center,
Pathumthani, 12120, Thailand
C
E-mail: thanakij.pechprasarn@nectec.or.th; Fax: 02-5646772; Tel. 02-5646900 ext. 2220
ABSTRACT
It is almost impossible for one to find the posterior probability in Bayesian inference due to
the lack of closed-form antiderivatives. Instead, an approximation method like Monte Carlo
integration (MCI) is used to calculate such an integral. MCI involves a ramdom process to
generate samples corresponding to the target distribution. In general, a larger number of
samples yield a more accurate result; however, it also requires more computational time.
To obtain higher performance, NVidia CUDA can help accelerate the computation by
leveraging a parallel programming pattern called “parallel reduction”. Although our current
achieved speed-up is reasonable, it still can be further improved. In addition to the running
time, scalability is another issue that we can add on. In this paper, in order to improve the
performance we further optimize our parallel programs by introducing some optmization
techniques and also cope with the problem of scalability. Loop unrolling and enhancing the
compacting code are included in our optimization methods. In order to improve scalability,
we utilize the multidimensional feature of CUDA by using 2D blocks instead of 1D blocks.
The result shows that the computation time is substantially decreased and the program can
handle much larger problem size even though a small block size is being used. We
conclude our work by identifying proper block sizes for certain problem sizes.
Keywords: Bayesian probability, Monte Carlo integration, Parallel reduction, GPU

computing, CUDA.
1. INTRODUCTION
In Bayesian probability, one is often interested in finding the posterior distribution to test the
hypothesis given observed training data. However, solving for the posterior is a challenging task
because typically the posterior is in a form of integrals and most of the time the closed-form
solutions for such integrals are not available [3]. Instead, the approximation method like Monte
Carlo integration (MCI) is used to find the integrated value. MCI involves a random process to
generate samples from the target distribution. Then, each sample will be calculated its
contribution to the final integrated value. In general, using larger number of samples would yield
more accurate results. Nevertheless, when the sample size is large, the computation becomes
much slower. Therefore, we try to speed up the computation of MCI with GPUs. We implement a
parallel program using Compute Unified Device Architecture (CUDA), a leading framework for
programming GPUs. Given a set of samples, our work is focusing on the core integration part.
The integration involves finding a summation from each contributed part. We employ a parallel
pattern called parallel reduction for finding a summation. Parallel reduction is suitable for our
CUDA programs as it allows many parts of the calculation to be done in parallel [5]. From our
previous work [7], the experimental result indicates that the higher performance is gained since
the running time is substantially decreased. Although our previous work is successful to some
ANSCSE15 Bangkok University, Thailand

March 30-April 2, 2011
point, it still can be improved in many aspects. For example, the computational time can be
further improved by introducing some optimization methods such as loop unrolling technique.
There is also an issue about scalability as we cannot use smaller block sizes for larger problem
sizes. This scalability issue is important because it prevents us to determine the effect and the
performance from using smaller block sizes. Thus, solving scalability issue would significantly
reveal additional search space for finding the optimal running time. To solve the scalability
problem, we utilize the multidimensional feature of CUDA and divide the samples into 2D blocks
instead of 1D blocks. This lets us use smaller block sizes such as 128 for larger problem sizes.
Eventually, we present our work to reduce the running time with chosen optimization techniques
and also cope with the problem of scalability. In addition, a real world example of Bayesian
application is also provided. According to our experiments, the results show our parallel
programs perform much better than the sequential implementation. For example, the maximum
speed-up obtained is 53.49 times the sequential code.
2. THEORY AND RELATED WORKS
2.1 Bayesian Probability

According to [1], Bayes rule is defined as

|
| (1)

where,
D = observed data
= the hypothesis defined by parameter
P(|D) = posterior of given D
P(D|) = likelihood of
P() = prior probability of
P(D) = probability of D
The posterior is often of interests as it is an inverse probability compared to direct probability
from classical theory of statistics. The posterior is used to infer the causes given observed data.
Given data, the model which is called the likelihood can be constructed. The prior distribution is
for expressing a general knowledge about the data. Next, in order to see whether the hypothesis
will be accepted or rejected, an expected value of the posterior has to be computed. The posterior
expectation has to fall in the region of 95% of the prior distribution.
According to [2], an expected value is used to find an averaged outcome of a function in long
run and is defined as

(2)
where,
P(x) = probability density function
The expectation of the posterior is |
. According to (2),
|
|
According to (1),
|
|


According to (2),

|
| (3)

where,
P(D) = a constant value of | = |
2.2 Monte Carlo Integration (MCI)

Involving a random process, MCI is an integration method to find the value of a definite
integral [4]. A general form of such an integral is

(4)
We can divide f(x) with P(x) and (4) becomes

where,
P(x) = probability density function on interval [a,b]
According to (2), we have

(6)

But the expected value can be estimated as

∑! (7)
From (6) and (7), MCI is defined as

"
∑ ! (8)
"
where,
P(x) = a sampling distribution
N = the number of samples
There are two major steps in MCI. The first one is to generate a set of samples from the
sampling distribution. Then, contributions from each sample are summed to find an integrated
value.
2.3 Parallel Reduction

Parallel reduction is a common pattern for reducing a set of numbers into a single value [5].
The structure of parallel reduction is shown in Figure 1. With a tree-based structure, there are
log2N tree levels. All operations at the same level can be done in parallel, but the next level has to
wait until operands from the previous level are ready. We employ the parallel reduction pattern
into our program during the second step of MCI which is to find a summation of contributions
from samples.

+ +
Figure 1. Structure of parallel reduction.
2.4 Compute Unified Device Architecture (CUDA)

According to [6], CUDA is a general-purpose parallel computing architecture. It comes with
a programming model and new instruction set architecture. The architecture of CUDA is
composed of GPUs with stream multiprocessors. Each stream multiprocessors contains CUDA
cores. CUDA exploits parallelism via blocks of threads. Blocks are executed independently by
CUDA cores. Therefore, more than one block can be executed in parallel depending on the
available CUDA core resources. This allows CUDA programs to automatically scale up by
simply running more blocks. Next, a kernel is a function to be executed on GPUs. In order to
launch a kernel, both the number of blocks and the number of threads per block have to be
specified by the CPU callers.
3. IMPLEMENTATION DETAILS
1. Bayesian Application
We extend our previous work by introducing a Bayesian application. This application is for
calculating the expectation of the posterior. The application is going to compute the result from
(3). Using MCI, the first step is to generate a set of samples according to the prior distribution.
This random number generation part is done in CPUs. Next, using generated samples, according
to (7), an expected value can be calculated with parallel reduction. With GPUs, the computation
of the parallel reduction is accelerated. After obtaining the expectation of the posterior,
hypothesis testing is performed by checking whether the probability is fall under 95% regions of
the prior. An overview of the implementation is shown in Figure 2.
(* Calculate the expectation of the posterior using MCI *)

SET samples to Sampling(Normal(5,0.5), N)
SET numerator to Reduce(f, samples, N) using f(x) = x*lhd(x)
SET denominator to Reduce(f, samples, N) using f(x) = lhd(x)
SET expected_value to numerator/denominator
(* Hypothesis testing *)
SET pH0 to Test(expected_value, Normal(5,0.5))
RETURN pH0 < 0.95
Figure 2. Implementation of 2D blocks.
2. Solving the Scalability

In order to solve the scalability issue found in our previous work [7], we employ the
multidimensional feature of CUDA. Our CUDA programs basically divide an array of samples
into smaller blocks which will be reduced later to find an integrated value. Figure 1 illustrates our

idea in transformation of using 1D blocks into 2D blocks. With 1D blocks, the maximum number
of blocks we can use is 65535 x 1 x 1 = 65535. On the other hand, if 2D blocks are being used,
the maximum number of blocks becomes 65535 x 65535 x 1 = 4294836225. This number is
already large enough to utilize CUDA core resources and also allow us to scale to larger problem
sizes. Theoretically, the maximum problem size for a certain block size is calculated by
multiplying the number of blocks with the block size. In practice, the physical limit of GPU
memory may be a blocker for a very large problem size. Figure 3 provides an illustration of the
idea.
samples
row size
row0
row1
…
Figure 3. Transformation from 1D blocks to 2D blocks.
(* 1D block representation *)
SET num_blocks to num_samples/block_size
SET block.x to num_blocks
SET block.y to 1
(* 2D block representation *)
SET num_blocks to num_samples /block_size
SET num_row to num_blocks/row_size
SET block.x to min(num_blocks, row_size)
SET block.y to num_rows
(* # CUDA blocks = min(block.x, 65535) x min(block.y, 65535) *)

Figure 4. Implementation of 2D blocks.
Figure 4 shows our implementation with a required parameter, the size of a row. This
parameter can be tuned to fit certain problem sizes. If larger value of row size is being used, there
might be a lot of waste computation in the last row. On the other hand, if smaller rows are used,
then there is much less opportunity to waste the computation; however, as the number of rows
grows it might hit the limit of 65535. Future work can provide complex analysis on this problem.
3. Performance
2.1) Loop unrolling
We employ loop unrolling into our parallel reduction code. The major advantage of the loop
unrolling technique is that there is no need to check the condition of the loop when iterating. We
unroll last six iterations since the number of threads can be ensured to be a warp size. By doing

this, there is an extra benefit as we can remove unnecessary expensive synchronous instructions.
Threads within the same warp do not require a synchronous point as they will always execute the
same instruction. The idea is shown in Figure 5.
(* parallel reduction in the reduce kernel *)

FOR s from num_samples/2 to 64 having s/=2
Sync threads (* make sure that all threads are working on the same level of the tree *)
IF threadId is less than s THEN
Add s_data[threadId] to s_data[threadId + s]
END IF
END FOR
(* loop unrolling *)
IF threadId is less than 32 THEN (* CUDA warp size is 32 *)
Add s_data[threadId] to s_data[threaded + 32]
END IF
Figure 5. Loop unrolling in parallel reduction.
2.2) Enhancing the compact kernel

The compact kernel is for gathering the reduced values from all CUDA blocks and forming a
new array which will be sent to a reduce kernel until only one block is left. In our previous work,
to keep simplicity of programming, we use only a single thread per block and let each block to do
the compact job which may not utilize the CUDA resources. Although it is not a core code,
tuning up this part also yields a performance improvement. Figure 6 shows our modification.
(* Original version *)
kernel_reduce <<<num_samples, 1>>>(…)
(* Modified version *)
kernel_reduce <<<num_samples/num_threads, num_threads>>>(…)
Figure 6. Enhancing the compact kernel.
There is another parameter appeared which is the number of threads for the compact
kernel. We adjust the number of threads for this kernel according to the problem size. For
example, we use 128 threads if the sample size is less than 8388480, use 512 threads if
the sample size is larger than 16776960 and use 256 threads if the size is in between.
4. EXPERIMENTS AND RESULTS
1. Platforms
We use NVidia GeForce GTX 580 as our platform for GPUs. On the CPU side, we have Intel
Core i7. The detail specifications are shown in Table 1.

Table 1. Specification of CPUs and GPUs.
Description CPU GPU
Model Intel Core i7 NVidia GeForce GTX 580
Clock frequency (GHz) 2.8 1.56
# processors 2 16
# cores per processor 4 32
# total cores 8 512
2. Datasets
Cavendish’s data [8] are used in our experiments. The data represents the specific density of
the earth. From 23 experiments, they are: 5.36, 5.29, 5.58, 5.65, 5.57, 5.53, 5.62, 5.29, 5.44, 5.34,
5.79, 5.10, 5.27, 5.39, 5.42, 5.47, 5.63, 5.34, 5.46, 5.30, 5.78, 5.68 and 5.85. According to [9], our
corresponding model is also a normal distribution N(#: |,0.04). For the prior, it is chosen to be
normal with mean = 5 and variance = 0.5.
3. Results
Our computed posterior expectation is 5.483 which is similar to the result from [9]. We find
that the computed probability falls under the 95% of the region (0.75 < 0.95.) Thus, the
hypothesis is accepted.
Figure 7. Results from our Bayesian application.
Figure 7 shows examples of results from our application. In addition to the answer, the
running time is also provided for both CPU and GPU versions. The details of the computational
time are provided for both the part of calculating the posterior expectation and the part of
hypothesis testing. In terms of performance, we expect an improvement for the part of posterior
calculation since this part involves parallelism using CUDA. On the other hand, we should obtain
similar running time for the part of hypothesis testing since there is no GPU involvement in this
part.
3.1) Running Time

According to the experiment, for our Bayesian application, the results show that our GPU
program takes less time than the CPU implementation. The logarithmic chart below illustrates the
comparison.

10000.000
1000.000
Running time (seconds)
100.000
10.000
1.000
0 100,000,000 200,000,000 300,000,000
0.100
0.010
Problem size
CPU GPU
Figure 8. Running time of CPU and GPU (the whole application.)
Because there are two main parts in the application: 1) posterior expectation calculation and 2)
hypothesis tests, we proceed by providing the details of the running time for each part. The part
of calculating the expectation is shown in Figure 9.
10000.000
1000.000
100.000
10.000
1.000
0 100,000,000 200,000,000 300,000,000
0.100
0.010
0.001
Problem size
CPU GPU
Figure 9. Running time of CPU and GPU (for posterior expectation.)
The two charts, Figure 8 and 9, reveal a similar trend that the GPU implementation is faster
than the CPU. Next, for the running time of testing the hypothesis, because this portion of the
code has no GPU involvement so there is no difference in timing between CPU and GPU.
However, it would still be useful to see this part scales with different problem sizes. According to
Figure 10, we find that the testing part has a linear-time scaling.

20.000
18.000
16.000
14.000
12.000
10.000
8.000
6.000
4.000
2.000
0.000
0 100,000,000 200,000,000 300,000,000
Problem size
Figure 10. Running time of the portion of testing the hypothesis.
Next, we move back to the posterior expectation calculation. It would be interesting to see
how each optimization strategy performs on the GPU side. Therefore, Figure 11 provides the
details running time of the GPU programs with different optimizations.
25.000
20.000
15.000
10.000
5.000
0.000
0 100,000,000 200,000,000 300,000,000
Problem size
1) No exta optimization 2) Enhance the compacting kernel
3) Loop unrolling 4) Optimization (2)+(3)
Figure 11. Effect of optimization methods in GPU programs.
However, the chart illustrates that there is no much difference in running time of each
method. We anticipate that this would be caused by the evaluation of the complex function like
the likelihood function on the GPU side in the parallel reduction step of MCI. Although many
threads are working in parallel to evaluate the function, but at least the elapsed time of such the
calculation for a single thread is dominating the portion of the whole reduction. Because the
optimization techniques such as enhancing the compact kernel and loop unrolling are focusing on

the core reduction part, such improvement becomes very little compared to the time used by the
function evaluation. Therefore, there is no much difference for each optimization technique.
3.2) Scalability
We show the result after solving the scalability problem in Table 2. Notice that all block size
even the block size of 128 can be used by all problem sizes and this would not be possible in our
previous work.
Table 2. Running time of GPU programs with different block sizes.

Problem Size 128 256 512 1024 2048 4096
65,535 0.011 0.011 0.011 0.011 0.011 0.011
131,070 0.021 0.021 0.021 0.021 0.021 0.021
262,140 0.041 0.041 0.040 0.047 0.040 0.040
524,280 0.080 0.080 0.081 0.080 0.080 0.080
1,048,560 0.159 0.159 0.165 0.166 0.159 0.158
2,097,120 0.317 0.316 0.316 0.316 0.316 0.316
4,194,240 0.631 0.638 0.652 0.631 0.638 0.631
8,388,480 1.261 1.261 1.264 1.261 1.261 1.262
16,776,960 2.523 2.529 2.522 2.522 2.525 2.524
33,553,920 5.076 5.042 5.117 5.042 5.041 5.045
67,107,840 10.368 10.087 10.082 10.082 10.085 10.084
134,215,680 20.516 20.502 20.786 20.161 20.155 20.516
268,431,360 40.332 40.311 41.300 40.313 40.316 40.329
Table 2 shows no difference in running time of the GPU programs varying block
sizes. Again, this would be caused by that the most time spent is not in the core parallel
reduction code so the effect of different block sizes cannot be seen.
3.3) Speed-up
We calculate the speed-up of the GPU programs for both the whole program and the
portion of posterior calculation. The speed-ups are shown in Table 3.
Table 3. Speed-ups of GPU programs.

Problem Size Whole Posterior
Application Expectation
65,535 49.97 84.25
131,070 50.17 88.58
262,140 52.27 91.59
524,280 52.83 93.48
1,048,560 53.22 94.61
2,097,120 53.29 94.95
4,194,240 53.37 95.21
8,388,480 53.45 95.41
16,776,960 53.44 95.52
33,553,920 53.48 95.54
67,107,840 53.49 95.60
134,215,680 53.49 95.56
268,431,360 52.21 95.58

The maximum speed-up obtained in case of the whole application is 53.49 times the sequential
code. For the only portion of calculating the posterior expectation, the maximum speed-up is
95.60.
5. CONCLUSION
We illustrate a real world application of Bayesian probability for testing the
hypothesis. The expectation is required to do the hypothesis testing. The implementation
shows that our method can be accurately used to find such the posterior expectation. We
also present an enhancement to our previous work by further optimizing our CUDA programs and
also handling the scalability issue. Our results show that our parallel programs perform better
than the CPU program as they take much less time when executing. In our experiments, we show
that with small block sizes, we still can handle large problem sizes and this is essential since more
solution space has been created. The maximum speed-up identified in our experiment is 53.49
times the sequential code. Future work would focus on employing a full GPU implementation by
generating random numbers in GPUs and also cover the issue of evaluating the function in the
parallel reduction step so that the effect of optimization and block size can be seen.
REFERENCES
1. Bayes, T., and Price, R., "An Essay towards solving a Problem in the Doctrine of Chance. By
the late Rev. Mr. Bayes, communicated by Mr. Price, in a letter to John Canton, M. A. and F.
R. S.". Philosophical Transactions of the Royal Society of London 53, 1763, 370–418.
2. Ross, S., "2.4 Expectation of a random variable". Introduction to probability models (9th ed.).
Academic Press, 2007, p. 38.
3. Tierney, L., and Kadane, J., "Accurate Approximations for Posterior Moments and Marginal
Densities," Journal of the American Statistical Association, 1986, 81, 82-86.
4. Caflisch, R., Monte Carlo and quasi-Monte Carlo methods, Acta Numerica vol. 7, Cambridge
University Press, 1998, pp. 1-49.
5. Harris, M., Mapping computational concepts to GPUs, in: M. Pharr (ed.), GPUGems 2 :
Programming Techniques for High-Performance Graphics and General-Purpose
Computation, chap. 31, Addison-Wesley, 2005, pp. 493–508.
6. NVIDIA CUDA C Programming Guide Version 3.2, 2010.
7. Pechprasarn, T. and Khiripet, N., Accelerating Bayesian Computation with Parallel Reduction
using CUDA, The 4th Mahasarakham International Workshop on Artificial Intelligence
(MIWAI), 2010, p40-45.
8. Cavendish, H., "Experiments to Determine the Density of the Earth". MacKenzie, A. S..
Scientific Memoirs Vol.9: The Laws of Gravitation. American Book Co.. 1900. pp. 59–105.
9. Piche R., “Normal Data”, in the note 2 of Bayesian statistics courses, Tampere University of
Technology, 2009.


ANSCSE15 Full Paper Thanakij

Diunggah oleh

Informasi Dokumen

Deskripsi Asli:

Hak Cipta

Format Tersedia

Bagikan dokumen Ini

Bagikan atau Tanam Dokumen

Opsi Berbagi

Apakah menurut Anda dokumen ini bermanfaat?

Apakah konten ini tidak pantas?

Hak Cipta:

Format Tersedia

ANSCSE15 Full Paper Thanakij

Diunggah oleh

Hak Cipta:

Format Tersedia

Improving Bayesian Computational Time and Scalability with

Keywords: Bayesian probability, Monte Carlo integration, Parallel reduction, GPU

ANSCSE15 Bangkok University, Thailand

2. THEORY AND RELATED WORKS

2.1 Bayesian Probability

ANSCSE15 Bangkok University, Thailand

2.2 Monte Carlo Integration (MCI)

We can divide f(x) with P(x) and (4) becomes

But the expected value can be estimated as

From (6) and (7), MCI is defined as

2.3 Parallel Reduction

ANSCSE15 Bangkok University, Thailand

Figure 1. Structure of parallel reduction.

2.4 Compute Unified Device Architecture (CUDA)

(* Calculate the expectation of the posterior using MCI *)

2. Solving the Scalability

ANSCSE15 Bangkok University, Thailand

Figure 3. Transformation from 1D blocks to 2D blocks.

(* # CUDA blocks = min(block.x, 65535) x min(block.y, 65535) *)

ANSCSE15 Bangkok University, Thailand

(* parallel reduction in the reduce kernel *)

2.2) Enhancing the compact kernel

4. EXPERIMENTS AND RESULTS

ANSCSE15 Bangkok University, Thailand

Figure 7. Results from our Bayesian application.

3.1) Running Time

ANSCSE15 Bangkok University, Thailand

ANSCSE15 Bangkok University, Thailand

Figure 10. Running time of the portion of testing the hypothesis.

ANSCSE15 Bangkok University, Thailand

Table 2. Running time of GPU programs with different block sizes.

Table 3. Speed-ups of GPU programs.

ANSCSE15 Bangkok University, Thailand

ANSCSE15 Bangkok University, Thailand

Anda mungkin juga menyukai