GPGPU
T. PechprasarnC, and N. Khiripet
Knowledge Elicitation and Archiving Laboratory, National Electronics and Computer Technology Center,
Pathumthani, 12120, Thailand
C
E-mail: thanakij.pechprasarn@nectec.or.th; Fax: 02-5646772; Tel. 02-5646900 ext. 2220
ABSTRACT
It is almost impossible for one to find the posterior probability in Bayesian inference due to
the lack of closed-form antiderivatives. Instead, an approximation method like Monte Carlo
integration (MCI) is used to calculate such an integral. MCI involves a ramdom process to
generate samples corresponding to the target distribution. In general, a larger number of
samples yield a more accurate result; however, it also requires more computational time.
To obtain higher performance, NVidia CUDA can help accelerate the computation by
leveraging a parallel programming pattern called “parallel reduction”. Although our current
achieved speed-up is reasonable, it still can be further improved. In addition to the running
time, scalability is another issue that we can add on. In this paper, in order to improve the
performance we further optimize our parallel programs by introducing some optmization
techniques and also cope with the problem of scalability. Loop unrolling and enhancing the
compacting code are included in our optimization methods. In order to improve scalability,
we utilize the multidimensional feature of CUDA by using 2D blocks instead of 1D blocks.
The result shows that the computation time is substantially decreased and the program can
handle much larger problem size even though a small block size is being used. We
conclude our work by identifying proper block sizes for certain problem sizes.
1. INTRODUCTION
In Bayesian probability, one is often interested in finding the posterior distribution to test the
hypothesis given observed training data. However, solving for the posterior is a challenging task
because typically the posterior is in a form of integrals and most of the time the closed-form
solutions for such integrals are not available [3]. Instead, the approximation method like Monte
Carlo integration (MCI) is used to find the integrated value. MCI involves a random process to
generate samples from the target distribution. Then, each sample will be calculated its
contribution to the final integrated value. In general, using larger number of samples would yield
more accurate results. Nevertheless, when the sample size is large, the computation becomes
much slower. Therefore, we try to speed up the computation of MCI with GPUs. We implement a
parallel program using Compute Unified Device Architecture (CUDA), a leading framework for
programming GPUs. Given a set of samples, our work is focusing on the core integration part.
The integration involves finding a summation from each contributed part. We employ a parallel
pattern called parallel reduction for finding a summation. Parallel reduction is suitable for our
CUDA programs as it allows many parts of the calculation to be done in parallel [5]. From our
previous work [7], the experimental result indicates that the higher performance is gained since
the running time is substantially decreased. Although our previous work is successful to some
where,
D = observed data
= the hypothesis defined by parameter
P(|D) = posterior of given D
P(D|) = likelihood of
P() = prior probability of
P(D) = probability of D
The posterior is often of interests as it is an inverse probability compared to direct probability
from classical theory of statistics. The posterior is used to infer the causes given observed data.
Given data, the model which is called the likelihood can be constructed. The prior distribution is
for expressing a general knowledge about the data. Next, in order to see whether the hypothesis
will be accepted or rejected, an expected value of the posterior has to be computed. The posterior
expectation has to fall in the region of 95% of the prior distribution.
According to [2], an expected value is used to find an averaged outcome of a function in long
run and is defined as
(2)
where,
P(x) = probability density function
The expectation of the posterior is |
. According to (2),
|
|
According to (1),
|
|
where,
P(D) = a constant value of | = |
where,
P(x) = probability density function on interval [a,b]
According to (2), we have
(6)
where,
P(x) = a sampling distribution
N = the number of samples
There are two major steps in MCI. The first one is to generate a set of samples from the
sampling distribution. Then, contributions from each sample are summed to find an integrated
value.
3. IMPLEMENTATION DETAILS
1. Bayesian Application
We extend our previous work by introducing a Bayesian application. This application is for
calculating the expectation of the posterior. The application is going to compute the result from
(3). Using MCI, the first step is to generate a set of samples according to the prior distribution.
This random number generation part is done in CPUs. Next, using generated samples, according
to (7), an expected value can be calculated with parallel reduction. With GPUs, the computation
of the parallel reduction is accelerated. After obtaining the expectation of the posterior,
hypothesis testing is performed by checking whether the probability is fall under 95% regions of
the prior. An overview of the implementation is shown in Figure 2.
(* Hypothesis testing *)
SET pH0 to Test(expected_value, Normal(5,0.5))
RETURN pH0 < 0.95
Figure 2. Implementation of 2D blocks.
samples
row size
row0
row1
…
(* 1D block representation *)
SET num_blocks to num_samples/block_size
SET block.x to num_blocks
SET block.y to 1
(* 2D block representation *)
SET num_blocks to num_samples /block_size
SET num_row to num_blocks/row_size
SET block.x to min(num_blocks, row_size)
SET block.y to num_rows
Figure 4 shows our implementation with a required parameter, the size of a row. This
parameter can be tuned to fit certain problem sizes. If larger value of row size is being used, there
might be a lot of waste computation in the last row. On the other hand, if smaller rows are used,
then there is much less opportunity to waste the computation; however, as the number of rows
grows it might hit the limit of 65535. Future work can provide complex analysis on this problem.
3. Performance
2.1) Loop unrolling
We employ loop unrolling into our parallel reduction code. The major advantage of the loop
unrolling technique is that there is no need to check the condition of the loop when iterating. We
unroll last six iterations since the number of threads can be ensured to be a warp size. By doing
(* Original version *)
kernel_reduce <<<num_samples, 1>>>(…)
(* Modified version *)
kernel_reduce <<<num_samples/num_threads, num_threads>>>(…)
Figure 6. Enhancing the compact kernel.
There is another parameter appeared which is the number of threads for the compact
kernel. We adjust the number of threads for this kernel according to the problem size. For
example, we use 128 threads if the sample size is less than 8388480, use 512 threads if
the sample size is larger than 16776960 and use 256 threads if the size is in between.
1. Platforms
We use NVidia GeForce GTX 580 as our platform for GPUs. On the CPU side, we have Intel
Core i7. The detail specifications are shown in Table 1.
2. Datasets
Cavendish’s data [8] are used in our experiments. The data represents the specific density of
the earth. From 23 experiments, they are: 5.36, 5.29, 5.58, 5.65, 5.57, 5.53, 5.62, 5.29, 5.44, 5.34,
5.79, 5.10, 5.27, 5.39, 5.42, 5.47, 5.63, 5.34, 5.46, 5.30, 5.78, 5.68 and 5.85. According to [9], our
corresponding model is also a normal distribution N(#: |,0.04). For the prior, it is chosen to be
normal with mean = 5 and variance = 0.5.
3. Results
Our computed posterior expectation is 5.483 which is similar to the result from [9]. We find
that the computed probability falls under the 95% of the region (0.75 < 0.95.) Thus, the
hypothesis is accepted.
Figure 7 shows examples of results from our application. In addition to the answer, the
running time is also provided for both CPU and GPU versions. The details of the computational
time are provided for both the part of calculating the posterior expectation and the part of
hypothesis testing. In terms of performance, we expect an improvement for the part of posterior
calculation since this part involves parallelism using CUDA. On the other hand, we should obtain
similar running time for the part of hypothesis testing since there is no GPU involvement in this
part.
1000.000
Running time (seconds)
100.000
10.000
1.000
0 100,000,000 200,000,000 300,000,000
0.100
0.010
Problem size
CPU GPU
Figure 8. Running time of CPU and GPU (the whole application.)
Because there are two main parts in the application: 1) posterior expectation calculation and 2)
hypothesis tests, we proceed by providing the details of the running time for each part. The part
of calculating the expectation is shown in Figure 9.
10000.000
1000.000
Running time (seconds)
100.000
10.000
1.000
0 100,000,000 200,000,000 300,000,000
0.100
0.010
0.001
Problem size
CPU GPU
Figure 9. Running time of CPU and GPU (for posterior expectation.)
The two charts, Figure 8 and 9, reveal a similar trend that the GPU implementation is faster
than the CPU. Next, for the running time of testing the hypothesis, because this portion of the
code has no GPU involvement so there is no difference in timing between CPU and GPU.
However, it would still be useful to see this part scales with different problem sizes. According to
Figure 10, we find that the testing part has a linear-time scaling.
16.000
14.000
12.000
10.000
8.000
6.000
4.000
2.000
0.000
0 100,000,000 200,000,000 300,000,000
Problem size
Next, we move back to the posterior expectation calculation. It would be interesting to see
how each optimization strategy performs on the GPU side. Therefore, Figure 11 provides the
details running time of the GPU programs with different optimizations.
25.000
Running time (seconds)
20.000
15.000
10.000
5.000
0.000
0 100,000,000 200,000,000 300,000,000
Problem size
1) No exta optimization 2) Enhance the compacting kernel
3) Loop unrolling 4) Optimization (2)+(3)
Figure 11. Effect of optimization methods in GPU programs.
However, the chart illustrates that there is no much difference in running time of each
method. We anticipate that this would be caused by the evaluation of the complex function like
the likelihood function on the GPU side in the parallel reduction step of MCI. Although many
threads are working in parallel to evaluate the function, but at least the elapsed time of such the
calculation for a single thread is dominating the portion of the whole reduction. Because the
optimization techniques such as enhancing the compact kernel and loop unrolling are focusing on
3.2) Scalability
We show the result after solving the scalability problem in Table 2. Notice that all block size
even the block size of 128 can be used by all problem sizes and this would not be possible in our
previous work.
Table 2 shows no difference in running time of the GPU programs varying block
sizes. Again, this would be caused by that the most time spent is not in the core parallel
reduction code so the effect of different block sizes cannot be seen.
3.3) Speed-up
We calculate the speed-up of the GPU programs for both the whole program and the
portion of posterior calculation. The speed-ups are shown in Table 3.
5. CONCLUSION
We illustrate a real world application of Bayesian probability for testing the
hypothesis. The expectation is required to do the hypothesis testing. The implementation
shows that our method can be accurately used to find such the posterior expectation. We
also present an enhancement to our previous work by further optimizing our CUDA programs and
also handling the scalability issue. Our results show that our parallel programs perform better
than the CPU program as they take much less time when executing. In our experiments, we show
that with small block sizes, we still can handle large problem sizes and this is essential since more
solution space has been created. The maximum speed-up identified in our experiment is 53.49
times the sequential code. Future work would focus on employing a full GPU implementation by
generating random numbers in GPUs and also cover the issue of evaluating the function in the
parallel reduction step so that the effect of optimization and block size can be seen.
REFERENCES
1. Bayes, T., and Price, R., "An Essay towards solving a Problem in the Doctrine of Chance. By
the late Rev. Mr. Bayes, communicated by Mr. Price, in a letter to John Canton, M. A. and F.
R. S.". Philosophical Transactions of the Royal Society of London 53, 1763, 370–418.
2. Ross, S., "2.4 Expectation of a random variable". Introduction to probability models (9th ed.).
Academic Press, 2007, p. 38.
3. Tierney, L., and Kadane, J., "Accurate Approximations for Posterior Moments and Marginal
Densities," Journal of the American Statistical Association, 1986, 81, 82-86.
4. Caflisch, R., Monte Carlo and quasi-Monte Carlo methods, Acta Numerica vol. 7, Cambridge
University Press, 1998, pp. 1-49.
5. Harris, M., Mapping computational concepts to GPUs, in: M. Pharr (ed.), GPUGems 2 :
Programming Techniques for High-Performance Graphics and General-Purpose
Computation, chap. 31, Addison-Wesley, 2005, pp. 493–508.
6. NVIDIA CUDA C Programming Guide Version 3.2, 2010.
7. Pechprasarn, T. and Khiripet, N., Accelerating Bayesian Computation with Parallel Reduction
using CUDA, The 4th Mahasarakham International Workshop on Artificial Intelligence
(MIWAI), 2010, p40-45.
8. Cavendish, H., "Experiments to Determine the Density of the Earth". MacKenzie, A. S..
Scientific Memoirs Vol.9: The Laws of Gravitation. American Book Co.. 1900. pp. 59–105.
9. Piche R., “Normal Data”, in the note 2 of Bayesian statistics courses, Tampere University of
Technology, 2009.