Rahul Agrawal
rahul@ece.iitkgp.ernet.in
Soumyajit Gupta
smjtgupta@gmail.com
ABSTRACT
We present a GPU-based implementation of the saliency
model proposed by Achanta et al. [1] to perform real-time
and detailed saliency map generation. We map all the components of the algorithm to GPU-based kernels and data
structures. The parallel version of the algorithm is able to
accurately simulate the desired results in a very low time.
We describe the streaming pipeline and address many issues in terms of obtaining high throughput on multi-core
GPUs. We highlight the parallel performance of the algorithm on three dierent generations of GPUs. On a high-end
NVIDIA Tesla K20m, we observe up to 600x order of magnitude performance improvement as compared to a singlethreaded CPU-based algorithm, and about 300x order of
magnitude improvement over a CPU-based OpenCV implementation.
Keywords
Saliency, FSRD, GPU, OpenCV, CUDA
1.
INTRODUCTION
Jayanta Mukherjee,
Ritwik Kumar Layek
{ jay@cse | ritwik@ece
}.iitkgp.ernet.in
2.
SALIENCY MODEL
2.1 Description
The method of calculating the saliency map S for an image
I of width W and height H pixels can be formulated as:
S(x, y) = |I Ihc (x, y)|
(1)
http://opencv.org/.
(2)
2.2
Limitation
3.
4.
CUDA
5.
Salience Map on
device memory
Input Image
Mean distance
kernel (L2 norm
from avg)
Convert from
uchar4(BGRA) to
float4(RGBA)
Allocate Global
Memory for storing
Input image, Lab
Image, Output image,
Buffers to be used as
global sync
Copy input image
stored in host memory
to device memory
allocated
Set Convolution
Kernel
Shared memory
reduce kernel
Shared memory
reduce kernel
Shared memory
reduce kernel
Finding
Avg(image)
using 3
reduce
kernel each
acting as a
global sync
point
RGBA to
LABA Kernel
Constant Memory
5.1
Device (GPU)
Host (CPU)
Copy output
image stored in
device memory
to host memory
Initialization
5.3 Filter
A GPU is ecient at launching a lot of threads in parallel
working together. The mapping between thread and memory is called communication pattern. A very basic parallel
filter version is to divide the image in blocks of threads (max
number threads per block is 1024) where each thread corresponds to each pixel calculation in filtered output. Now
each thread gathers the corresponding pixel value in input
5.5 Mean
Figure 5: Stencil Communication Pattern
To implement stencil operation more eciently, separable
5 5 filter with elements [1 4 6 4 1] is used. In this case, a
two-dimensional convolution filter requires 5 5 = 25 multiplications for each output pixel. A separable filter divided
into two consecutive one-dimensional convolution operations
requires only 5+5 = 10 multiplications for each output pixel.
Following the technique reported in [10], it computes separately in horizontal (row) and vertical (column) passes, and
uses a write to global memory between each pass, each pixel
is loaded five times at the most. The pixels at the edge
of image will depend on pixels outside the thread block as
shown by yellow called apron region (Fig.6). Thus each
thread block must load into shared memory, the pixels to
be filtered and the apron pixels. Using separable filter it
is no longer necessary to load the top and bottom apron
regions (for the horizontal pass) and vice versa for vertical
pass. This allows more pixels to be loaded for processing in
each thread block.
The operations such as filter, RGB to Lab can be decomposed into small tasks for each pixel independent of each
other. However, obtaining mean cannot be mapped into
such parallel independent tasks. So the reduce algorithm
is used to obtain the mean of the image. If we want to compute sum of n elements (n = 8 shown in Fig.8(a)) serially it
takes O(n) number of steps (7 steps), but in parallel, we can
do it by pairing elements in groups of two, finding sum of
each pair to get intermediate results, which are again paired
and added. This process continues till we get a single result,
and complexity of this parallel algorithm is of order log2 (n)
(3 steps).
5.4
RGBA to LABA
Representing the image in Lab color space was implemented in parallel by launching threads for each pixel which
calculates [L a b]T for corresponding [R G B]T . This com-
6.
PROFILING
The normal process would have been to apply reduce algorithm to whole image stored in global memory. But due
to limitation of maximum number of threads that can be
launched per block, limited amount of shared memory and
absence of any global synchronization(for synchronization
between threads of dierent blocks) in CUDA, this approach
does not work for large size arrays (like images in our case).
So multiple reduce kernels (3 in our case) are used, each
serving as a global sync point as shown in Fig. 10. After
each kernel launch, partially reduced result of each blocks of
the grid is obtained as arrays of partial results. Then these
partial results are again divided in to a grid of blocks, and
each block is reduced to give once again an array of partial
results. These are again reduced to give the final reduced
value in last step. This four step processing, similar to the
approach reported in [11], is shown in Fig. 9.
6.1.1
6.1.2
6.1.3
Shows the percentage of time each multiprocessor was active during the duration of the kernel launch. A multiprocessor is considered to be active, if at least one warp is currently
assigned for execution.
6.1.4
6.1.5
Warps Launched
Shows the total number of warps launched per multiprocessor for the executed kernel grid.
Figure 10: Global Synchronization
5.6
5.7
Intrinsic Functions
To maximize the instruction throughput, the use of arithmetic instructions with low throughput is minimized. This
includes trading precision for speed when it does not aect
the end result. Instead of regular functions like sqrt() and
cbrt() in math.h, CUDA intrinsic functions like sqrtf() and
cbrtf() are used, which gives single floating precision instead
of double floating precision.
6.2.1
Warps Per SM
Each warp scheduler manages a fixed, hardware-given maximum number of warps. This defines the Device Limit of
warps per SM - the upper bound of how many warps can be
resident at once on each SM. Active Warps are active from
the time it is scheduled on a multiprocessor until it completes the last instruction. Each warp scheduler maintains
its own list of assigned active warps. Eligible Warps is an
active warp which is able to issue the next instruction. Each
warp scheduler selects the next warp to issue an instruction
from the pool of eligible warps. Warps that are not eligible,
report an Issue Stall Reason. Theoretical Occupancy acts as
Function
Filter
RGBA2LabA
Reduce1
Reduce2
Reduce3
Euclid
Issued
1.81
1.96
1.59
1.24
0.10
1.76
IPC
Executed
1.77
1.60
1.42
1.11
0.09
1.57
IS
%
2.68
18.52
10.20
10.61
9.51
10.91
SM Activity
%
99.98
99.98
99.98
99.04
88.89
99.99
IPW
2459.25
409.00
669.75
669.75
875.00
427.00
Warps Launch
Warps Blocks
4096
1024
32768
8192
16384
4096
64
16
1
1
32768
8192
Function
Filter
RGBA2LabA
Reduce1
Reduce2
Reduce3
Euclid
Active
27.13
30.33
24.69
19.68
1.00
30.60
Warps per SM
Eligible Occupancy
3.20
28
5.62
32
2.81
28
2.14
28
0.09
8
2.82
32
Table 2: Profiling data for Issue Eciency on NVIDIA GT 610M. - indicates Not Applicable.
6.2.2
6.2.3
6.3
Discussion
7.
RESULTS
7.1 Implementation
The proposed parallel version of the algorithm is implemented on three dierent generation GPUs namely NVIDIA
GeForce GTS 450, NVIDIA GeForce GT 610M, and NVIDIA
Tesla K20m (Table 3). For all these NVIDIA GPUs, CUDA
toolkit 5.5, OpenCV 2.4.6 and Visual Studio 2010 are used
as the APIs and development environment. All tests were
carried out on a standard PC (Windows 7 Ultimate 64-bit,
Intel I3 CPU@2.3GHz, 4GB DDR3 RAM) as the testing
environment.
7.2 Performance
The original algorithm demonstrated very high execution
times. To speed it up in CPU, OpenCV version of the same
Figure 11: Top Row: Input Images, Bottom Row: Corresponding Saliency Maps.
GPU
Number of Cores
Number of SMs
Shared Memory per block
Memory Capacity (GB)
Memory Bus Width (Bits)
Memory Clock Rate (MHz)
GPU Clock Rate (MHz)
GTS
450
192
4
49152
1
128
1804
1566
GT
610M
48
1
49152
2
64
900
1250
Tesla
K20m
2496
13
49152
5
320
2600
706
Resolution
256 256
512 512
640 480
1024 768
1024 1024
2048 2048
FSRD - CPU
Time(s) fps
0.37
2.70
0.68
1.47
0.71
1.41
1.19
0.84
1.42
0.70
3.41
0.29
FSRD - OpenCV
Time(s)
fps
0.09
11.11
0.27
3.70
0.33
3.03
0.75
1.33
0.87
1.15
2.37
0.42
Resolution
256 256
512 512
640 480
1024 768
1024 1024
2048 2048
GTS 450
Time(ms)
0.83
2.27
2.62
6.41
8.35
32.54
GT 610M
Time(ms)
3.44
10.83
13.26
30.87
40.72
160.88
Tesla K20m
Time(ms)
0.35
0.74
0.85
1.81
2.41
8.89
Table 5: Performance evaluation data showing Execution Time and Framerate on GPU versions of
FSRD.
Resolution
256256
512512
640480
1024768
10241024
20482048
NVIDIA
GTS 450
108.44
119.06
125.97
117.30
104.14
73.17
NVIDIA
GT 610M
26.16
24.96
24.88
24.35
21.36
14.79
NVIDIA
Tesla K20m
257.17
365.23
388.27
415.41
360.82
267.83
Table 4: Performance evaluation data showing Execution Time and Framerate on CPU versions of
FSRD.
8.
CONCLUSION
9.
ACKNOWLEDGMENTS
10. REFERENCES
[1] R. Achanta, S. Hemami, F. Estrada, and S. Susstrunk,
Frequency Tuned Salient region Detection, Proc.
IEEE Conf. Computer Vision and Pattern Recognition
(CVPR), pp. 1597-1604, 2009.
[2] L. Itti, C. Koch, and E. Niebur, A model of saliency
based visual attention for rapid scene analysis, IEEE
Pattern Analysis and Machine Intelligence (PAMI),
vol. 20, no. 11, pp. 1254-1259, Nov. 1998.
[3] J. Harel, C. Koch, P. Perona, Graph-based visual
saliency, in Advances in Neural Information
Processing Systems, vol. 19. Cambridge, MA: MIT
Press, pp. 545-552, 2006.
[4] S. Frintrop, VOCUS: A Visual Attention System for
Object Detection and Goal-Directed Search, Lecture
Notes in Computer Science (LNCS), vol. 3899,
Springer, pp. 7-31, 2006.
[5] Xiaodi Hou and Liqing Zhang, Saliency detection: A
spectral residual approach, Proc. of IEEE Computer
Vision and Pattern Recognition (CVPR), pp. 1-8,
2007.
[6] V. Navalpakkam, L. Itti, Modeling the influence of
task on attention, Vision Research, vol. 45, Issue 2,
pp. 205-231, 2005.
[7] J. Li, D. Levine, X. An,X. Xu, Visual saliency based
on scale-space analysis in the frequency domain,
IEEE Pattern Analysis and Machine Intelligence
(PAMI),vol. 35, no. 4, pp. 996 - 1010, April 2013.
[8] S. Gupta, R. Agrawal, R. Layek, J. Mukhopadhyay,
Psychovisual saliency in color images, Proc. of IEEE
National Conference on Computer Vision, Pattern
Recognition, Image Processing and Graphics
(NCVPRIPG), pp. 1-4, 2013.
[9] NVIDIA Corporation, NVIDIA CUDA C
Programming Guide, June 2011.
[10] V. Podlozhnyuk, Image Convolution with CUDA,
NVIDIA Corporation white paper, June 2008.
[11] S. Sengupta, M. Harris and M. Garland, Ecient
Parallel Scan Algorithms for GPUs, NVIDIA
Technical Report NVR-2008-003, December 2008.
[12] G. Bradski, OpenCV computer vision library, Dr.
Dobbs Journal of Software Tools, 2000.
[13] Xu Tingting , T. Pototschnig, K. Kuhnlenz, M. Buss,
A high-speed multi-GPU implementation of bottom up attention using CUDA, Proc. of IEEE Conf. on
Robotics and Automation, pp. 41 - 47, 2009.
[14] G. Homann. CIE color space. Tech. Rep., FHO
Emden, 2008.