Anda di halaman 1dari 99

Diploma Thesis

High-Performance Tomographic
Reconstruction using OpenCL
submitted by
cand. inform. Sebastian Schuberth

Department of Mathematics and Computer Science


at Freie Universitt Berlin

First Reviewer: Prof. Dr. Konrad Polthier, FU Berlin


Second Reviewer: Hans-Christian Hege, Director Visualization and Data Analysis, ZIB

February 2011

Konrad-Zuse-Zentrum
fr Informationstechnik Berlin

Eidesstattliche Erklrung: Hiermit versichere ich, Sebastian Schuberth, dass ich die Diplomarbeit High-Performance Tomographic Reconstruction using OpenCL selbststndig
und ohne Benutzung anderer als der angegebenen Quellen und Hilfsmittel angefertigt habe
und dass alle Ausfhrungen, die wrtlich oder sinngem bernommen wurden, als solche
gekennzeichnet sind, sowie dass diese Diplomarbeit noch keiner anderen Prfungsbehrde in
gleicher oder hnlicher Form vorgelegt wurde.

Affidavit: Hereby I, Sebastian Schuberth, declare that I wrote the thesis High-Performance
Tomographic Reconstruction using OpenCL on my own and without the use of any other
than the cited sources and tools and that all explanations which I copied directly or in their
sense are marked as such, as well as that this thesis has not been handed in at any other
official commission in either this or equal form.

Datum / Date

Unterschrift / Signature

iii

Acknowledgments
This thesis is dedicated to my mother, who always had let me go my own way and never
lost faith in me.

Furthermore, I wish to thank Advanced Micro Devices, Inc., for supporting my research work
with two GPUs of type Radeon HD 5870.

Last but not least, thanks go to the Visage Imaging GmbH for arising my interest in tomographic reconstruction and to the Zuse Institute Berlin for providing a nice work environment.

Abstract
This thesis implements a cross-platform software library for Cone Beam Computed Tomography (CBCT) reconstruction using a Filtered Backprojection (FBP) algorithm. By utilizing
the newly established OpenCL API, the library works on the Windows, Linux, and Mac OS
X operating systems. It runs on Graphics Processing Units (GPUs) as well as Central Processing Units (CPUs), and scales well with multiple compute units in the system. Supported
by benchmarks on different hardware platforms, the specific design decisions that lead to a
high-performance implementation are analyzed and explained in detail, and a report on the
state of current OpenCL implementations is given. As part of the thesis, a module for the
Amira visualization system that wraps the librarys functionality was implemented for ease
of use in current research projects.

vii

Contents

Affidavit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

iii

Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Abstract

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii

1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

1.1

Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

1.2

Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

1.3

Physical and Mathematical Principles . . . . . . . . . . . . . . . . . . . . . .

1.4

OpenCL Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

2 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.1

Programming Language . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2.2

Benchmarking Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2.3

Visualization Tool . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

2.4

Kernel Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

2.5

Choice of Hardware . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

2.6

Choice of Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.1

Work-Size Influence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

3.2

Benchmarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.2.1

Standard Problem Sizes . . . . . . . . . . . . . . . . . . . . . . . . . 44

ix

Contents

Contents
3.2.2

Large Problem Size . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

3.2.3

RabbitCT Ranking . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

3.3

Multiple Devices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

3.4

Kernel Compilation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

3.5

Image Interpolation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

4 Discussion
4.1

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

Work-Size Determination . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
4.1.1

CPU Device Specifics . . . . . . . . . . . . . . . . . . . . . . . . . . 55

4.1.2

GPU Device Specifics . . . . . . . . . . . . . . . . . . . . . . . . . . 57

4.2

Performance Considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

4.3

Image Quality Considerations . . . . . . . . . . . . . . . . . . . . . . . . . . 72

4.4

Vendor-Specific Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
4.4.1

ATI Stream . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

4.4.2

NVIDIA OpenCL . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75

4.4.3

Intel OpenCL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
5.1

Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83

1 Introduction
Over the last decade, many researchers have proposed to use GPUs to do tomographic reconstruction using a FBP algorithm. While the early approaches were using the available
graphics-based APIs like OpenGL or DirectX [CCF94, M04, SBMW07], more recent publications often use the Compute Unified Device Architecture (CUDA) introduced by NVIDIA
[NVI10a]. Although the geometric nature of FBP maps well to the graphics-based APIs
[MXN07], the General Purpose GPU (GPGPU) APIs provide more direct access to the hardware, contingently resulting in better performance due to more explicit resource management
[YLKK07, NWH+ 08, OIH10].

1.1 Motivation
In the fast moving GPU hardware business, the vendor offering the best price / performance
ratio often changes, which leads to a diverse landscape of GPUs available at both research
institutions and project partners. As such, it is desirable to have GPGPU applications that
that are largely vendor-independent. The OpenCL API [Mun10] as managed by the Khronos
Group is a solution to this. Additionally, it is not only available for GPUs, but also for CPUs
and more exotic platforms like the Cell Broadband Engine [BF10] or Field-Programmable
Gate Arrays (FPGAs), which offers great portability with the same code base.
Moreover, OpenCL offers a way to handle the increasingly difficult task of programming
the growing number of cores in modern processor architectures. While in the past processors
got faster by basically just increasing their clock rate, processors now get faster because they
can do more things in parallel. That is, given that the applications are designed to make
use of parallelism. OpenCLs programming paradigm is a good choice to make ubiquitous

1.2. RELATED WORK

CHAPTER 1. INTRODUCTION

supercomputing possible on todays workstations [CH10].


The proposed library for CBCT reconstruction will enable researchers to make full use of
the compute power of recent GPUs, typically cutting down the reconstruction time by orders
of magnitude compared to straight-forward CPU implementations. The speed-up is also
the key to reconstructing volumes of resolutions that were impractical to be reconstructed
before, paving the way to state-of-the-art visualizations for example in biology research. The
vast computing power of current GPUs also allows for implementing higher quality custom
projection image filtering at some cost of the gained reconstruction time. This helps to
either achieve better image quality, or maintain image quality at a lower number of projection
images / a lower electric current of the scanning device, lowering the dose of radiation a
patient or probe is exposed to.
Another specific aim is to investigate the current state of OpenCL implementations and
whether there are any performance impacts in contrast to CUDA, as the latter has been
reported to be faster in the past [ZZS+ 09, KDH10, KSAK10].

1.2 Related Work


Some previous work exists which evaluates the use of OpenCL for scientific and / or medical
problems, also in particular in comparison to CUDA. To start with, [KDH10] provides a
systematic comparison of CUDA and OpenCL at the example of a Monte Carlo simulation
with a Mersenne Twister Pseudo-Random Number Generator (PRNG) at its heart. Instead
of rewriting an OpenCL kernel from scratch, minimum changes to an existing CUDA kernel
were done to make it pass through the OpenCL compiler, disregarding any OpenCL-specific
optimizations. The authors mention that they did try to compile the same OpenCL kernel
using ATI Stream [Adv10], but that it failed due to existing global variable declarations.
As a result, they decided that evaluating the performance of OpenCL using ATIs Stream
computing platform is outside the scope of their paper, something that is done intensively in
this thesis. Taking a look at the OpenCL specification [Mun10] in fact reveals that OpenCL
allows program scope variables but requires them to be declared in __constant address

CHAPTER 1. INTRODUCTION

1.2. RELATED WORK

space. So the ATI Stream compiler is correct if it refuses to compile such code. The fact
that NVIDIAs OpenCL implementation accepts the invalid code leaves a certain aftertaste
and also puts a different light on the papers performance results. The paper concludes
OpenCL to be between 13% and 63% slower than CUDA and recommends the latter for
applications where high performance is important.
A very convincing comparison between OpenCL and CUDA is done in [KSAK10]. The
paper features comparisons on the Parallel Thread eXecution (PTX) language level and
inspects the code generated by both the OpenCL and CUDA compilers. It emphasizes that
the five existing CUDA applications used for benchmarking have been ported as faithfully
as possible to OpenCL, all of which also compile using ATI Stream. The authors note that
the performance of the code generated by the OpenCL compiler with default settings, that
is with no additional options, is worse that the code generated by the CUDA compiler by
default. However, if optimizations for floating-point arithmetic that may violate the IEEE754
standard are enabled, performance is almost on par. For ATI Stream targeting the GPU,
there is not perceivable performance increase for enabled compiler optimizations, leading the
authors to assume that there is room for improvement by maturing the OpenCL compiler
for that platform / device combination. Both last statements are something this thesis can
confirm, as will be shown in later chapters. In order to fully exploit the potential of GPU
computation, the authors plan to explore an OpenCL kernels work-size configuration space
based on profiling. This is done as part of this thesis and thus complements the papers
findings.
For the specific FBP problem, see [ZZS+ 09, WZJZ10]. The first offers a straight forward
implementation for Parallel Beam Computed Tomography (PBCT) as part of an application
for drug detection in luggage. The authors admit that they did not put much effort in
optimizing their OpenCL implementation and report it to be nine times slower than a CUDA
implementation, but about 100 times faster than a native CPU implementation. Given these
numbers, neither the OpenCL nor the CPU implementation seems to have been optimized
for high performance, for example it may be doubted that the CPU implementation takes
advantage of vector instruction or multiple threads. Do get a more meaningful comparison

1.3. PHYSICAL AND MATHEMATICAL PRINCIPLES CHAPTER 1. INTRODUCTION


to CPU devices, this thesis uses the same optimized OpenCL kernel on all devices.
The second aforementioned paper relates the most to this thesis in the sense that it also
implements a cone beam reconstructor using OpenCL. But again performance comparisons
are only done against a conventional CPU implementation. The authors see a speed-up of
factor 40 - 60 when comparing the OpenCL implementation running on the GPU to the CPU
implementation. No information about OpenCL compiler options are given, hampering the
grading of the results. In a sense this thesis continues the authors efforts by comparing
the backprojection performance to a CUDA implementation, which they left over for future
work.

1.3 Physical and Mathematical Principles


In contrast to classic X-ray which only measures the relative intensity I of the rays as
they have passed through the object to scan, computed tomography also takes the primary
intensity I0 into account. This value is typically acquired by a reference scan without any
object. Given that an X-rays intensity falls off exponentially with an objects diameter d,
see sub-figure 1.1(a), the measured intensity can be described as
I = I0 ed
and
P = ln(

I0
)=d
I

is the so-called projection value. The introduced is a linear attenuation factor that depends
on the objects material. If d is known, this yields
=

1
I0
ln( )
d
I

Note that this is a simplification that assumes the object to consist of a single homogeneous
material. In case of an inhomogeneous object, varies along an X-ray. Sub-figure 1.1(b)
illustrates this with an object composed of n homogeneous areas of diameters di and attention

CHAPTER 1. INTRODUCTION 1.3. PHYSICAL AND MATHEMATICAL PRINCIPLES


factors i . In practice, n builds up towards infinity while di becomes infinitesimal, thus it is
I = I0 ed1 1 d2 2 d3 3 ...
= I0 e[
= I0 e

and
P = ln(

Pn
i=0

Rd
0

di i ]

ds

n
X
I0
)=
di i
I
i=0

Although the individual i cannot be directly determined as only their weighted


sum as part of P is known, Johann
Radon has already shown back in 1917
that any density distribution in a plane
can be calculated from an infinite number of different line integrals in that
plane [Rad17]. As a materials density
is proportional to its attenuation factor,

(a) homogeneous object

(b) inhomogeneous object

a discretized version of the so-called (in-

Figure 1.1: X-ray attenuation for different objects (image courtesy W. Kalender).
verse) Radon Transform can be used to

approximate the attenuation factors i


for a finite number of measurements. FBP is such a discretization of the inverse Radon
Transform. It was not until [Cor63] that this transform was reinvented by Nobel Prize laureate Allan Cormack and to be used in parallel beam tomography. Shortly afterwards, the
British engineer Godfrey N. Hounsfield announced the first X-ray scanning system [Hou73].
He also offered a solution to the problem that not only depends on the objects material
but also on the X-ray energy, which makes it hard to compare different scans of the same

1.4. OPENCL OVERVIEW

CHAPTER 1. INTRODUCTION

object. Thus he defined a relative value


rel =

water
1000
water air

where water is the measured attenuation for water and air is the measured attenuation
for air for a fixed X-ray energy. In honor of Hounsfield, the unit for rel is called Hounsfield
Unit (HU). The formula maps material to distinct HU values: By definition, air is at -1000
HU and water is at 0 HU while muscle is at about 40 HU and bone at 400 HU and above.
In computer science, usually the range from -1024 HU to 3071 HU is mapped to 12 bits of
storage and often displayed as 0 HU to 4095 HU.
Since then, parallel beam tomography has first been extended to fan beam tomography
[HN77, Hor79] and later to cone beam tomography [Tuy83, FDK84]. The last two paper are
particularly important: The first introduces Tuys Condition which states that for complete
cone beam data, any plane which intersects the object must also intersect the scan trajectory.
The second introduces a formula for direct reconstruction of a 3D density function from a
set of 2D projection images, which became known as the Feldkamp-Davis-Kress (FDK)
algorithm. This is the algorithm used in the implementation which has been developed as
part of this thesis. It is outlined in algorithm 1.1 and basically works by filtering the projection
images row by row, and then accumulating the projection values of the pixels to which the
output voxels project to.
For further details about the derivation of cone beam tomography from fan and parallel
beam tomography see [Tur01]. A nice introduction to the topic of computed tomography is
also given in [Qui06].

1.4 OpenCL Overview


Based on its experiences with the Core Image and Core Video frameworks that use OpenGL1
on the GPU and SIMD instructions on the CPU to accelerate various operations, Apple
Inc. initiated and managed the work on OpenCL, later in cooperation with the NVIDIA
1

http://www.opengl.org/

CHAPTER 1. INTRODUCTION

1.4. OPENCL OVERVIEW

Algorithm 1.1 Pseudo-code for the FDK algorithm.


set d to the source trajectory radius
set fu to a Ramp filter
{Preprocessing}
for each vertical detector position v do
for each horizontal detector position u do

compute angle weighting factor wa (u, v) = d/ d2 + u2 + v 2


end for
end for
for each projection angle do
{Weighting}
for each vertical detector position v do
for each horizontal detector position u do
calculate weighted projection value pw
(u, v) = p (u, v) wa (u, v)
end for
end for
{Filtering}
for each vertical detector position v do
Convolve the current row rf (v) = rw (v) fu
end for
end for
{Backprojection}
clear voxels in the output volume
for each projection angle do
for each output voxel position (x, y, z) do
project world space (x, y, z) into image space (u, v)
lookup interpolated projection value pi = pf (u, v)
compute distance weighting factor wd (x, y, ) = d + x cos() + y sin()
add d2 /wd (x, y, )2 pi to the current output voxel value
end for
end for

1.4. OPENCL OVERVIEW

CHAPTER 1. INTRODUCTION

Corporation. Since the submission of the OpenCL specification to the Khronos group2 in
2008, it is managed as an open standard.
In order to provide a common interface for heterogeneous compute resources, OpenCL
abstracts the underlying hardware as follows:
The host is the system the OpenCL runtime executes on. In the common case, this is
a PC running for example a Linux or Windows operating system.
A platform encapsulates all OpenCL resources provided by a specific vendor. Installable
Client Devices (ICDs)3 allow multiple platforms to be installed in parallel on a single
host.
Platforms provide one or more devices that map to a specific piece of hardware in the
system that is able to perform parallel computations. Typically, a device is a GPU or
multi-core CPU. Note that in the latter case the CPU is both the host and device.
Devices are composed of compute units, resembling the coarsest partitioning of a
devices compute power. A compute unit usually maps to a Streaming Multiprocessor
(SM) for NVIDIA or to a SIMD Engine for ATI GPUs, or to a processor core / HyperThreading Technology (HTT) hardware thread for CPUs.
Processing elements are a compute devices building blocks that model a virtual scalar
processor.
In OpenCL, the main work to solve a specific problem is done by the kernel, a function
that serves as an entry point in a compute program. A running instance of a kernel is
called a work-item, which is executed by one or more processing elements as part of a workgroup. While a single work-group executes on a single compute unit, in case of multiple
compute units multiple work-groups execute in parallel. Likewise, a work-groups work-items
are scheduled to run in parallel on one or more processing elements. A context consists of
one or more devices and provides the environment within which a kernel executes.
2
3

http://www.khronos.org/
http://www.khronos.org/registry/cl/extensions/khr/cl_khr_icd.txt

CHAPTER 1. INTRODUCTION

1.4. OPENCL OVERVIEW

Each of the aforementioned entities has access to some dedicated memory in OpenCLs
memory model. Per work-item memory is called private memory, which is very fast and can
be used without need for synchronization primitives. It is similar to registers in a GPU or
CPU device. Multiple work-items that are part of the same work-group share local memory,
which is usually located on-chip to enable coalesced accesses to global memory, to which all
work-groups running on compute units of a device have access to. Global memory generally is
the largest capacity memory subsystem on the compute device, but also the slowest. Finally,
constant memory denotes a read-only section of memory.
Note that there is no implicit memory management. A kernel has no direct access to
host memory, it is necessary to copy data from the host to global memory to local memory
and back. To sum up, figure 1.2 shows graphical representations of OpenCLs platform and
memory models.

device

(a) platform model

(b) memory model

Figure 1.2: The OpenCL platform an memory models (image courtesy B. Knig).

When developing an OpenCL application, the programmer is a wanderer between two


worlds, the runtime C language API on the one hand and the C99 based computing language
on the other hand. The first is compromised of host functions that manage the OpenCL
state in the form of memory and program objects, command queues and kernel executions.
Memory object can either be buffers that store a linear collection of bytes, or images that store
a 2D or 3D structured array. A program object encapsulates a compute programs source or
binary along with several build related information. Command queues hold commands that

1.4. OPENCL OVERVIEW

CHAPTER 1. INTRODUCTION

will be executed on a specific device, either in an in-order or out-of-order fashion.


As the second world offers a language that is similar to GPU shading languages like
the OpenGL Shading Language (GLSL) and even more similar to NVIDIA CUDA, most
OpenCL tool chains share their back-ends with pieces of the existing driver architecture.
See figure 1.3 for how OpenCL language support just required adding
a different compiler front-end to
NVIDIAs architecture. Note that
in contrast to CUDA, the OpenCL
compiler is no external program
that needs to be deployed to the
developers machine and run as
part of a custom build step. Instead, it is an integral part of the
OpenCL driver, just like the GLSL
compiler is part of an OpenGL

Figure 1.3: OpenCL versus CUDA compile process on


NVIDIA devices (image courtesy S. Strobl).

driver. Due to the similar syntax


to CUDA, tools like Swan4 have surfaced which aid the reversible conversion of existing
CUDA code bases to OpenCL, allowing to quickly evaluate the OpenCL performance of existing CUDA code or to maintain a dual-target OpenCL and CUDA code base. But also
other vendors OpenCL implementations build upon proven technologies. For example, ATI
Stream uses the Edison Design Group C++ front-end to create an Immediate Representation
(IR) for the Low-Level Virtual Machine (LLVM) projects linker. If targeting a GPU, specific
optimizations are made and Immediate Language (IL) code for ATIs Compute Abstraction
Layer (CAL) is generated, which in turn is compiled to a GPU binary. If the target is a CPU,
the LLVM projects x86 back-end is used to generate Assembler code that can be passed to
the as / ld tools. The Intel OpenCL tool chain uses LLVMs Clang as a front-end instead,
and the back-end relies on the Threading Building Blocks (TBB) library. However, compilers
4

http://www.multiscalelab.org/swan/

10

CHAPTER 1. INTRODUCTION

1.4. OPENCL OVERVIEW

for parallel computing hardware are continuously evolving and it will be interesting to see
whether ideas like those proposed in [YXKZ10] are going to find their way into commercial
quality products.
For further reading about OpenCL see AMDs excellent OpenCL University Kit5 which
contains a set of materials for teaching a full semester course in OpenCL programming.
Other good sources are Marcus Bannermans slides on Supercomputing on Graphics Cards6
and Rob Farbers series of articles at The Code Project7,8,9 .

http://developer.amd.com/zones/openclzone/universities/Pages/default.aspx
http://www.mss.cbi.uni-erlangen.de/?p1=lecturefeed&id=29
7
http://www.codeproject.com/KB/showcase/Portable-Parallelism.aspx
8
http://www.codeproject.com/KB/showcase/Memory-Spaces.aspx
9
http://www.codeproject.com/KB/showcase/Work-Groups-Sync.aspx
6

11

2 Methods
To provide some justification for the employed methods and tools, this chapter gives an
overview of the specific design decisions for the implementation of the reconstruction library.

2.1 Programming Language


OpenCL at its heart is a C language API, which comes with an official C++ wrapper (consisting of a single header file). Although bindings for quite a few different languages like
Java, C#, Python, Ruby, Scheme and probably more already exist, the choice was to use
C++ as the primary programming language because the native C API is the most mature one
and by definition feature complete. Moreover, the official C++ wrapper provides the typical
advantages of object orientation like encapsulation and freeing of resources on scope exit.
With cutil 1 , an alternate unofficial C++ wrapper exists that makes use of advanced features
introduced by C++0x, but the cross-platform requirement and portability constraints forbid
the use of such recent compiler features.
The decision in favor of C++ also simplifies the librarys integration into other applications. Ever since the downfall of Fortran in the scientific community, C++ has gained many
supporters especially in the field of high-performance computing.

2.2 Benchmarking Framework


Doing fair benchmarks is a difficult task because many parameters have an influence on
the results. For filtered backprojection, this includes questions about which steps of the
1

http://code.google.com/p/clutil/

13

2.2. BENCHMARKING FRAMEWORK

CHAPTER 2. METHODS

backprojection pipeline are to be considered, the number and resolution of the projection
images, the output volume resolution, the amount of knowledge the implementation has
regarding the nature of the data, the internal data structures and memory management, and
much more. Fortunately, the RabbitCT [RKHH09] framework provides the means to level
the playing-field. It defines a simple C language API consisting of four functions to load the
algorithm (RCTLoadAlgorithm), backproject a single image (RCTAlgorithmBackprojection),
finish the backprojection (RCTFinishAlgorithm), and unload the algorithm (RCTUnloadAlgorithm). These functions need to be implemented by a shared object / dynamic link library,
which is dynamically loaded by the so-called runner. The runner is supplied as a binary
command line application for different platforms and expects the following arguments:
Usage: RabbitCTRunner.exe [Algorithm Library] [Dataset File] [Result File] [ProblemSize: 128|256|512|1024]

The Algorithm Library is the path to the binary than implements the RabbitCT API
functions, Dataset File is the path to the data file containing the projection images and
reference reconstructions (this is available at the RabbitCT project page2 ), Result File is the
path to the result file to be written that will contain the reconstruction statistics (in the
same directory, the output volume file will be written with the same base name and a .vol
extension), and finally ProblemSize is one of the four supported problem sizes.
After registering as a participant, the reasonably small Result File can be uploaded via the
Algorithm User Area to the RabbitCT project page. After manual approval, it will be listed
on the public ranking page.
Each of the API functions accepts only a single argument, a pointer to the structure shown
in listing 2.1. Some structure members require a few explanatory words as the official RabbitCT documentation is rather sparse: The projection matrix A_n takes an output volumes
voxel coordinates in 3D world space (with the output volume being centered around the
origin) and converts it to 2D projection image space coordinates of the pixel it projects onto.
But more often than not, the projection geometry is given in terms of source / emitter and
detector positions. In order to calculate A_n from those quantities, one has to
2

http://www.rabbitct.com/

14

CHAPTER 2. METHODS

2.2. BENCHMARKING FRAMEWORK

1. calculate the perspective 4x4 projection matrix M from the source onto the virtual
detector which is centered around the origin,
2. pre- / post-multiply M with translation matrices that account for translating the virtual
detector to the position of the physical detector,
3. multiply M with another translation matrix that maps to the upper left projection
image corner to which the detector coordinate systems base B is attached,
4. and finally multiply M with the basis transformation matrix from world coordinates to
B.
The resulting 4x4 matrix M can be simplified to a 3x4 matrix A by omitting the third
row which is responsible for calculating the resulting vectors z-component. In detector /
projection image space z is 0 by definition, which is why that row can be removed. A in turn
is the sort of matrix the RabbitCT framework expects to be pointed to by A_n.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23

/ \ b r i e f RabbitCT g l o b a l d a t a s t r u c t u r e .
T h i s i s t h e main s t r u c t u r e d e s c r i b i n g t h e r e l e v a n t d a t a s e t d e s c r i p t i o n s .
The n o t a t i o n i s a d a p t e d t o t h e M e d i c a l P h y s i c s T e c h n i c a l Note .
/
struct RabbitCtGlobalData
{
// /@{ R e l e v a n t d a t a f o r t h e b a c k p r o j e c t i o n .
unsigned i n t L ;
///< p r o b l e m s i z e \ i n { 1 2 8 , 2 5 6 , 5 1 2 , 1024}
u n s i g n e d i n t S_x ;
///< p r o j e c t i o n image w i d t h
u n s i g n e d i n t S_y ;
///< p r o j e c t i o n image h e i g h t ( d e t e c t o r r o w s )
double
A_n ;
///< 3 x4 p r o j e t i o n m a t r i x
float
I_n ;
///< p r o j e c t i o n image b u f f e r
float
R_L ;
///< i s o t r o p i c v o x e l s i z e
float
O_L ;
///< p o s i t i o n o f t h e 0i n d e x i n t h e w o r l d c o o r d i n a t e s y s t e m
float
f_L ;
///< p o i n t e r t o w h e r e t h e r e s u l t volume s h o u l d be s t o r e d
// /@}
// /@{ R e l e v a n t d a t a f o r p r o j e c t i o n image memory management . Only r e q u i r e d f o r a d v a n c e d u s a g e .
u n s i g n e d i n t a d v _ n u m P r o j B u f f e r s ; ///< number o f p r o j e c t i o n b u f f e r s i n RAM
f l o a t
adv_pProjBuffers ;
///< p r o j e c t i o n image b u f f e r s
// /@}
};

Listing 2.1: The RabbitCT global data structure.

As the comment for R_L states, for simplicity only isotropic voxel sizes are supported.
This is also why L and O_L only have one instead of three components. While this certainly
is a limitation that would hinder the frameworks use in practice, it is reasonable for the
purpose of benchmarking as fewer parameters account for more clear results.

15

2.2. BENCHMARKING FRAMEWORK

CHAPTER 2. METHODS

An unwittingly well hidden feature is that the f_L and adv_pProjBuffers members support
to be written to. On the call to RCTLoadAlgorithm the adv_pProjBuffers pointer is NULL. If
it is non-NULL at return, the runner will not allocate any memory for the projection images but
use the given memory pointer instead. A use case for this is when special devices like GPUs
are used for reconstruction: As the runner does not know anything about the implementation,
it allocates standard host memory by default. But for GPUs allocating page-locked memory
for fast Direct Memory Access (DMA) transfers is beneficial. So if this is supported by the
GPU, the implementation may decide to allocate page-locked memory during load time and
set adv_pProjBuffers accordingly. The same mechanism applies also to the f_L pointer.
So to sum up, a total reconstruction process as coordinated by the runner looks like this:
1. The Algorithm Library is loaded and the API function pointers are initialized.
2. RCTLoadAlgorithm is called, A_n and I_n are still NULL to prevent any cheating by
performing backprojection already during load time.
3. If either f_L or adv_pProjBuffers is NULL, host memory is allocated accordingly.
4. Up to adv_numProjBuffers projection images and matrices are loaded into memory.
5. For all resident projections, RCTAlgorithmBackprojection is called; the run time of
each call is stored.
6. The last two steps are repeated until all projections are processed.
7. RCTFinishAlgorithm is called where the implementation for example copies the output
volume to host memory.
8. RCTUnloadAlgorithm is called for resource clean-up.
Note that the RabbitCT framework of course is not immune to cheating attempts. For
example, there are no technical means to prevent participants from performing reconstruction
work during the call to RCTFinishAlgorithm outside the timing. But the approval step which

16

CHAPTER 2. METHODS

2.3. VISUALIZATION TOOL

validates the statistics before publishing the results will most likely uncover the most cheating
attempts.
Besides the API definition the RabbitCT framework more importantly also defines the
input data and a reference output data set for each problem size. See figure 2.1 for some
projection image examples. The statistics that are calculated against the reference data
include the mean square error, error histogram, peak signal-to-noise ratio, as well as the
mean and total reconstruction times. All these are encoded into the Result File which is
uploaded to the RabbitCT project.

(a) 0 degree

(b) 90 degree

Figure 2.1: RabbitCT pre-filtered projection images from different angles.

The results of all submissions are then grouped by problem size and ranked by mean
backprojection time on the RabbitCT project page.

2.3 Visualization Tool


While the RabbitCT framework is convenient for performing benchmarks, it is less suitable
for modifying various backprojection parameters to quickly evaluate their impact on performance and image quality. For example, each runner execution loads the projection data
and writes the output volume, vastly increased the turn-around times when tuning parameters. Therefore several modules for the Amira [SWH05] visualization system were developed.
Amira is a data-flow oriented application which provides the means for visual programming

17

2.3. VISUALIZATION TOOL

CHAPTER 2. METHODS

and the development of powerful networks for analyzing volume data. Figure 2.2 gives an
overview of Amiras Graphical User Interface (GUI).

Figure 2.2: The Amira GUI layout (image courtesy R. Brandt).

The (Object) Pool at the upper right contains widgets representing the modules and data
objects that the network is composed of. A widgets color indicates the type of object it
represents: Data objects are green, compute modules are red, display modules are yellow
etc. Objects are interconnected using straight lines that mirror the data flow. Below that,
the settings of all currently selected objects are shown in the Properties panel. Any object
specific information is displayed here and settings can be adjusted using so-called ports. In
the lower left the Console is visible which displays all kinds of status information, warnings
and errors. It also integrates a browser for the help. The 3D Viewer above the console takes
by far the most space. This is where the display modules render their output to. Both the
console and pool / properties panels can be collapsed in order to make even more room for
the 3D viewer.
Several Amira modules were implemented as part of this thesis. First of all, a generic
OpenCL information module was implemented that complements Amiras built-in system
and OpenGL information dialog. The OpenCL module lists the capabilities of all OpenCL
platforms and devices installed in the system. The list can be exported to various spreadsheet

18

CHAPTER 2. METHODS

2.3. VISUALIZATION TOOL

formats for convenience.


Then, file readers for the custom RabbitCT data format (.rctd) and RabbitCT result
format (no file extension) were implemented. Using the first, the projection images and reference reconstructions can be loaded and visualized in Amira. With the second module, the
information encoded in previously saved result files can be dumped to the Amira console. Finally, a modules was developed that basically is a GUI version of the original RabbitCT runner.
It makes calls to the API functions like the runner does, but also exposes several OpenCL related settings and implementation specific parameters via the properties panel, see figure 2.3.
The first two Platform and Device ports
choose the OpenCL platform and device to
use, whose Work size limits are displayed
in the port below. The OpenCL kernel is
loaded from an external text file which can
be opened in the systems default text editor by pressing the button in the File port.
Whenever the file is modified and saved
in the external editor, the kernel is automatically rebuild. Any compiler output is
dumped to the Amira console. The Kernel

Figure 2.3: The RabbitCTRunner modules available properties.

port lists all kernels available in the text file


and selects the one to be used for backprojection. After the build process some kerneldependent information is available which is displayed in the Kernel info port. For example,
due to resource / register usage, the maximum supported work-group size supported for this
kernel might be smaller than the theoretical hardware limit. The Work size specifies the
X-, Y- and Z-dimension of the local work-size to use. For convenience, the product of all
dimensions is calculated and also displayed. Options1 refer to host code paths: They choose
whether to respect the CL_DEVICE_MAX_MEM_ALLOC_SIZE limit, or to ignore it and simply
try if memory allocation fails. Another path allows trying to use page-locked memory for
the projection image data. Settings that affect kernel compilation are listed under Options2 :

19

2.3. VISUALIZATION TOOL

CHAPTER 2. METHODS

During the build process, the binary can be dumped to the Amira console, support for
image objects can be disabled and nearest-neighbor interpolation can be enabled. The build
process itself can be influenced by the Compiler options specified in the port below. The
output can be controlled by the Isotropic volume bounding box and - resolution ports. For
visual debugging and better understanding, parts of the projection geometry can be displayed
in the 3D viewer. For example, the Volume BBox and Projected BBox in image space can
be shown for the projection number as set in the port below. Finally, the total scan angle
can be adjusted and the estimated angle for the current projection number is shown in the
Geometry scan angle port. Take a look at figure 2.4 to see the described module in action
together with some volume rendering of the reconstructed volume.

Figure 2.4: Network for a RabbitCTRunner reconstruction with volume rendering.

20

CHAPTER 2. METHODS

2.4. KERNEL IMPLEMENTATION

2.4 Kernel Implementation


The first version of the backprojection kernel was a straight forward implementation of the
formula as given in the Reconstruction algorithm section of [RKHH09]. Where necessary,
arguments like R_L and O_L are splat to vectors before passing them as arguments to
the kernel in order to make use of SIMD instructions inside the kernel. The projection
matrix rows are passed as float4-vectors which makes it easy to use them with dot()
for coordinate transformation. Next, a single image object which is fed with new data in
between backprojection calls is created and passed to the kernel. Note that in OpenCL, like
in OpenGL, coordinate (0.0,0.0) does not sample the first image element at its center, but
at its upper left corner. This means that compared to native CPU code adding an offset of
0.5 in both U- and V-direction is necessary to sample the image element at its center. The
kernels last argument is a pointer to a buffer object stored in global memory which contains
the output volume.
After verifying that this first implementation generated reasonable results compared to the
reference reconstruction, some minor optimizations were made. For example, the output is
now only modified if the value to accumulate is non-zero, and suitable mathematical expressions were replaced by their equivalent native_*() function calls. Another optimization
was to not clear the output volume on the host before uploading it to the device and running
the standard backprojection kernel, but to use a special backprojection kernel for the first
pass which simply overwrites the output instead. This way, neither the output needs to be
uploaded nor cleared.
Finally, more complex modifications were necessary to both the kernel and host code in
order to handle output volumes that exceed the maximum memory object allocation size
and / or the global device memory size. The first case was quite easy to handle as it just
requires the output volume to be split into multiple buffer objects. Two kernel variants were
developed that can handle two or four buffer objects, respectively. At the example of two
buffer objects, a single kernel call writes to both buffer objects which represent the upper and
lower output volume halves. The second case was harder as it requires multiple reconstruction

21

2.5. CHOICE OF HARDWARE

CHAPTER 2. METHODS

passes to swap out the already reconstructed parts and reconstruct the remaining ones. This
was particularly tricky to accomplish with the RabbitCT API, as it is not designed for multiple
passes. But in the end, it was successfully implemented using some wrapper functions that
store several settings across calls and replace the original API functions to call them multiple
times if required.

2.5 Choice of Hardware


For practice-oriented results, the choice was to use hardware that is pretty much regarded
as standard nowadays for a well-equipped scientific workstation. The CPU should provide
several hardware cores, probably in addition to so-called hardware threads. An existing
workstation with an Intel Core i7 CPU fulfilled those criteria. As this CPU is able to run
OpenCL CPU implementations by different vendors, benchmarking a different vendors CPU
was not regarded as strictly necessary for meaningful results, but still additional tests were
made on an AMD Opteron 6174 multi-processor system.
Regarding the GPUs, two devices by the major competing vendors were used. The choice
was to use an existing NVIDIA GTX 260 and an ATI Radeon HD 5870. Note that the first is
from a previous generation compared to the latter: The GTX 260 rather matches a HD 4870
according to most game-oriented benchmarks. But in the past it has turned out that driver
and compiler quality can have a huge impact on performance, not necessarily mirroring the
theoretical hardware performance.

2.6 Choice of Experiments


The problem of FBP itself is well suited for benchmarking the performance of parallel computing devices because it requires high bandwidth for streaming the projecting image data and
also is massively data-parallel for computing the individual output volume elements. The experiments are chosen such that at first several parameters that generally determine a devices
peak performance are found out and henceforth the problem sizes and some implementation
details are altered to examine the devices repose to that.

22

3 Results
In this chapter, benchmark results of the implementation are presented. Unless noted otherwise, the benchmarks were performed on a PC with an Intel Core i7 920 CPU running at
2.67 GHz and 12 GB of DDR3 RAM under Windows Vista x64 SP2. Timings were measured
using the RabbitCT runner for Windows 64-bit1 , except for the Intel OpenCL platform which
currently is available for 32-bit only and thus the 32-bit version of the runner2 was used.
For all benchmarks, the -cl-mad-enable and -cl-fast-relaxed-math OpenCL compiler
options were enabled in order to benefit from automatic device-specific optimizations.

3.1 Work-Size Influence


In order to reduce the number of parameters that can be tweaked for best performance, the
first series of benchmarks tries to determine the optimal work-size per platform and problem
size. Consequently, the following benchmarks were only performed for that optimal worksize. This way, the number of benchmarks is limited to the most relevant ones, simplifying
the extraction of conclusive data.

For each platform, there are three figures in this section. The first one shows how a given
work-size in X, Y and Z dimension relates to the achieved performance for a problem size of
2563 . The second one shows the same correlation for a problem size of 5123 , larger problem
sizes were not benchmarked to keep the duration of a benchmark within reasonable limits.
1

http://www5.informatik.uni-erlangen.de/fileadmin/Forschung/Software/RabbitCT/download/
RabbitCTRunner-win64.zip
2
http://www5.informatik.uni-erlangen.de/fileadmin/Forschung/Software/RabbitCT/download/
RabbitCTRunner-win32.zip

23

3.1. WORK-SIZE INFLUENCE

CHAPTER 3. RESULTS

A triple of dots, one from the X, Y and Z work-size charts each, corresponds to a work-size
configuration that was used in the benchmark. Note that an increase of the work-size in one
dimension usually goes hand in hand with a decrease in at least one of the other dimensions
and vice versa, as there is a device-specific maximum number of work-items supported per
work-group. The only exceptions are work-size configurations whose number of work-items
is below that maximum.
As an example how to read the charts in order to find the work-size configuration that
resulted in the best performance, proceed as follows: In each chart, find the dot that is
closest to the horizontal axis. These are the dots for the backprojection that took the least
time. Finding the corresponding value on the horizontal axis then leads to the work-size
being used for the respective dimension. In general, starting with any time on the vertical
axis and finding the dot that lies on the horizontal line for that time leads to the work-size
for the backprojection that took that long. The best work-size configuration is marked with
a cross in the charts.
Besides for finding the best work-size configuration, the charts also reveal other properties
of interest: In a single chart, finding the dots for all work-sizes (that is, in each column) which
are closest to the horizontal axis and drawing an approximating trend line for them yields
a lower bound for the backprojection time for varying work-sizes in the current dimension.
In other words, one can quickly estimate whether increasing the work-size in that dimension
would benefit performance or not.
Finally, vertical scattering of dots is of interest. The more dots agglomerate for a specific
chart and dimension, the less influential on the performance is modifying the work-size in
the other dimensions.

The third figure for a platform in this section reveals the influence of the total number of
work-items (the product of the work-sizes for all dimensions) on the performance for different
problem sizes. Again, the number of work-items resulting in the best performance is marked
with a cross.
These charts show nicely if and how much increasing the number of work-items also in-

24

CHAPTER 3. RESULTS

3.1. WORK-SIZE INFLUENCE

18
ms per backprojection

16
14
12
10
8
6
4
2
0
0

12

16

20

12

16

20

12

16

20

X work-size

18
ms per backprojection

16
14
12
10
8
6
4
2
0
0

ms per backprojection

Y work-size

20
18
16
14
12
10
8
6
4
2
0
0

8
Z work-size

Figure 3.1: Influence of the work-sizes on an HD 5870 GPU for problem size 2563 . The best configuration is (16,4,4).

25

3.1. WORK-SIZE INFLUENCE

CHAPTER 3. RESULTS

160
ms per backprojection

140
120
100
80
60
40
20
0
0

12

16

20

12

16

20

12

16

20

X work-size

160
ms per backprojection

140
120
100
80
60
40
20
0
0

8
Y work-size

180
ms per backprojection

160
140
120
100
80
60
40
20
0
0

8
Z work-size

Figure 3.2: Influence of the work-sizes on an HD 5870 GPU for problem size 5123 . The best configuration is (16,4,4).

26

CHAPTER 3. RESULTS

3.1. WORK-SIZE INFLUENCE

creases performance. Vertical scattering of dots here indicates how much the actual work-size
configuration affects performance for a constant number of work-items. The further the dots
are apart, the more important is it to choose a good work-size configuration for that number
of work-items.

To start with, figure 3.1 shows data for the HD 5870 GPU. On this device, each work-size
dimension may not exceed 256 work-items, and the maximum work-group size is also limited
to 256 work-items. Benchmarking was performed using the ATI Stream SDK 2.33 and ATI
Catalyst 10.12 drivers4 .
For problem size 2563 , the dot that is closest to the horizontal axis in the chart for the
X dimension is located at a work-size of 16. For the Y dimension, it is located at 4, and
for the Z dimension also at 4. To sum up, in this case the work-size configuration (16,4,4)
is best on this platform. That configuration is more than 2.3 times faster than the worst
configuration of (4,4,16), which turns out to exactly be the reversed triple.
The trend lines for the X and Y work-size do not deviate too much from a straight
horizontal line, meaning that changing the work-size in these dimensions does not affect
performance very much. There is only a slight tendency noticeable that increasing the X
work-size also increases performance a little bit, while increasing the Y work-size decreases
performance a little bit. In contrast, the trend line for the Z work-size clearly shows that
an increase, especially if at the cost of the X or Y work-size, decreases performance. (The
larger the Z work-size increase is, the more likely it is that this is at the cost of decreasing
the X or Y work-size in order to not exceed the maximum supported number of work-items
per work-group.) For example, increasing the Z work-size from 4 to 16 in the worst case
decreases performance by a factor of 2.3 (this equals the ratio of best vs. worst work-size
configuration), and at least by a factor of 1.3. This impact of larger Z work-sizes is mirrored
in the vertically scattered dots for small X and Y work-sizes: If the latter stay small, there
is more room for varying Z work-sizes, which result in noticeably different backprojection
3
4

http://developer.amd.com/gpu/ATIStreamSDK/downloads/Pages/default.aspx
http://sites.amd.com/us/game/downloads/Pages/downloads.aspx

27

3.1. WORK-SIZE INFLUENCE

CHAPTER 3. RESULTS

times.
The charts for problem size 5123 in figure 3.2 support the previous observations even
more clearly. The best and worst work-size configurations also are (16,4,4) and (4,4,16),
respectively, but here they differ by a larger factor of 4.4 in performance. This time, increasing
the Z work-size from 4 to 16 leads to a decrease in performance by at least a factor of 1.7.

18
ms per backprojection

16
14
12
10
8
6
4
2
0
0

64

128

192

256

320

256

320

total work-items (problem size 256)

160
ms per backprojection

140
120
100
80
60
40
20
0
0

64

128

192

total work-items (problem size 512)

Figure 3.3: Performance for numbers of work-items on an HD 5870 GPU for problem sizes 2563 and
5123 . Best performance is achieved with 256 work-items.

Regarding the total number of work-items, figure 3.3 shows that for both problem sizes
using the maximum of 256 work-items gives the best performance. However, using only 128
work-items, that is 50% of the original number, still reaches 90% of the best performance

28

CHAPTER 3. RESULTS

3.1. WORK-SIZE INFLUENCE

for problem size 2563 , and 80% of the best performance for problem size 5123 .
To close with, by further reading the charts it becomes apparent that both the best and
worst configuration actually use the full 256 work-items.

The second GPU device is a GTX 260, featuring a maximum work-group size of 512, a
maximum of 512 work-item in the X and Y dimension, and at most 64 work-items in the Z
dimension. It was benchmarked using the 263.06 developer drivers for use with the CUDA
Toolkit 3.25 .
By looking at figure 3.4 one can see that for a problem size of 2563 the shortest reconstruction time was achieved with a work-size of (8,4,8). For problem size 5123 (see figure 3.5),
the configuration (8,4,4) gave slightly better results than (8,4,8), but only by a negligible
factor of about 1.05. Generally, on this GPU many different work-size configurations result
in similar performance: The top 20% only deviate at most 0.02% in relative performance.
The trend lines for problem size 2563 show that an increase of the X or Y work-size
slightly decreases performance, and an increase in the Z dimension results in a noticeable
performance loss. Briefly, this device performs best for small work-size configurations. The
numbers for problem size 5123 again basically confirm that finding, with the exception of an
X work-size of 4, which performs even worse than an X work-size of 32.
Scattering of the dots and therefore the influence of the work-size configuration is larger for
the larger problem size, especially for small work-sizes. For a work-size of 4 in any dimension,
performance varies at least by a factor of 2.0 depending on the configuration of the other
dimensions when backprojecting to problem size 5123 , and about a factor of 1.3 for problem
size 2563 , which equals the best vs. worst performance ratio for that problem size.
When looking at the total number of work-items for best performance in figure 3.6, 128
and 256 go head-to-head for both problem sizes. Using either of the remaining configurations
of 64 or 512 work-items reduces performance, in case of problem size 2563 only marginally,
but for problem size 5123 by up to a factor of 2.6, which results in the worst performance.
Supporting the initial statement about many work-size configurations resulting in similar
5

http://developer.nvidia.com/object/cuda_download.html

29

3.1. WORK-SIZE INFLUENCE

CHAPTER 3. RESULTS

8
ms per backprojection

7
6
5
4
3
2
1
0
0

12

16

20

24

28

32

36

24

28

32

36

24

28

32

36

X work-size

8
ms per backprojection

7
6
5
4
3
2
1
0
0

12

16

20

Y work-size

9
ms per backprojection

8
7
6
5
4
3
2
1
0
0

12

16

20

Z work-size

Figure 3.4: Influence of the work-sizes on a GTX 260 GPU for problem size 2563 . The best configuration is (8,4,8).

30

CHAPTER 3. RESULTS

3.1. WORK-SIZE INFLUENCE

ms per backprojection

60
50
40
30
20
10
0
0

12

16

20

24

28

32

36

24

28

32

36

24

28

32

36

X work-size

ms per backprojection

60
50
40
30
20
10
0
0

12

16

20

Y work-size

70
ms per backprojection

60
50
40
30
20
10
0
0

12

16

20

Z work-size

Figure 3.5: Influence of the work-sizes on a GTX 260 GPU for problem size 5123 . The best configuration is (8,4,4).

31

3.1. WORK-SIZE INFLUENCE

CHAPTER 3. RESULTS

8
ms per backprojection

7
6
5
4
3
2
1
0
0

64

128

192

256

320

384

448

512

576

448

512

576

total work-items (problem size 256)

ms per backprojection

60
50
40
30
20
10
0
0

64

128

192

256

320

384

total work-items (problem size 512)

Figure 3.6: Performance for numbers of work-items on a GTX 260 GPU for problem sizes 2563 and
5123 . Best performance is achieved with 256 and 128 work-items, respectively.

32

CHAPTER 3. RESULTS

3.1. WORK-SIZE INFLUENCE

performance, the scattering in these charts indicate that the GTX 260 is less sensitive to
changing the configuration than the HD 5870 for a fixed number of work-items. To recall,
while for the latter choosing the worst configuration for 256 work-items reconstructs problem
size 5123 slower by factor 4.4, the GTX 260 only becomes slower by a factor of 2.6 in the
worst case using 512 work-items.

For the CPU device provided by the ATI Stream SDK 2.3, experimental image support
has been enabled by defining the CPU_IMAGE_SUPPORT environment variable. This was done
to get a fair comparison to the other benchmarked OpenCL devices which all provide image
support by default. On this device, each dimension is limited to a maximum of 1024 workitems, and the maximum work-group size is also limited to a total of 1024 work-items.
As can be seen in figure 3.7, work-size configuration (16,16,4) is best for problem size
2563 , but only by about a factor of 1.1 compared to the worst configuration of (4,4,4), so all
measurements are tightly staggered. The trend line shows that increasing the X work-size
also increases performance, but with a factor of 1.03 the gain when going from 4 to 64
work-items is expectably low. Increasing the Y or Z work-size decreases the performance
in a similar manner. However, there is one exception in the X and Y dimensions: Here a
work-size of 16 is better than any other work-size, clearly breaking the trend in the belonging
charts.
Although the absolute performance gain by optimizing the work-size configuration is quite
small on this device, the obvious vertical scattering of the dots indicates a relatively big
influence of the configuration. For a work-size of 4 in the X and Y charts dots are scattered
the most, because for such small work-sizes the other work-size dimensions can change the
most. With increasing work-size the scattering decreases for all dimensions, again with one
exception, this time in the X dimension: For a work-size of 16, which is best otherwise, the
particular configuration of (16,4,4) is second worst after (4,4,4).
The charts in figure 3.8 for problem size 5123 do not come up with too many surprises
apart from work-size 16 not quite being so exceptional anymore. The best configuration is
(16,8,8) this time, the worst still is (4,4,4), which again equals a speed-up factor of 1.1,

33

3.1. WORK-SIZE INFLUENCE

CHAPTER 3. RESULTS

435
ms per backprojection

430
425
420
415
410
405
400
395
390
0

8 12 16 20 24 28 32 36 40 44 48 52 56 60 64 68 72
X work-size

435
ms per backprojection

430
425
420
415
410
405
400
395
390
0

8 12 16 20 24 28 32 36 40 44 48 52 56 60 64 68 72

ms per backprojection

Y work-size

435
430
425
420
415
410
405
400
395
390
385
0

8 12 16 20 24 28 32 36 40 44 48 52 56 60 64 68 72
Z work-size

Figure 3.7: Influence of the work-sizes on an Intel Core i7 CPU (ATI Stream) for problem size 2563 .
The best configuration is (16,16,4).

34

CHAPTER 3. RESULTS

3.1. WORK-SIZE INFLUENCE

3450
ms per backprojection

3400
3350
3300
3250
3200
3150
3100
3050
0

8 12 16 20 24 28 32 36 40 44 48 52 56 60 64 68 72
X work-size

3450
ms per backprojection

3400
3350
3300
3250
3200
3150
3100
3050
0

8 12 16 20 24 28 32 36 40 44 48 52 56 60 64 68 72
Y work-size

3450
ms per backprojection

3400
3350
3300
3250
3200
3150
3100
3050
0

8 12 16 20 24 28 32 36 40 44 48 52 56 60 64 68 72
Z work-size

Figure 3.8: Influence of the work-sizes on an Intel Core i7 CPU (ATI Stream) for problem size 5123 .
The best configuration is (16,8,8).

35

3.1. WORK-SIZE INFLUENCE

CHAPTER 3. RESULTS

confirming that this device is not very sensitive to work-size configuration changes.
The trend lines repeat to show that it is to be preferred to increase the X work-size in
favor of increasing the Y or Z work-sizes in order to gain performance. Relative to the
backprojection time, vertical scattering stays about the same as for a problem size of 2563 ,
so there is no bigger influence of the work-size configuration visible for the larger problem
size. As mentioned before there neither is a positive or negative outlier anymore for a worksize of 16 in any of the charts. Although a work-size of 16 still is a good choice for any
dimension, choosing 8 is better in Y and Z dimension and second best in X dimension.

435
ms per backprojection

430
425
420
415
410
405
400
395
390
0

128

256

384

512

640

768

896

1024

1152

896

1024

1152

total work-items (problem size 256)

3450
ms per backprojection

3400
3350
3300
3250
3200
3150
3100
3050
0

128

256

384

512

640

768

total work-items (problem size 512)

Figure 3.9: Performance for numbers of work-items on an Intel Core i7 CPU (ATI Stream) for problem
sizes 2563 and 5123 . Best performance is achieved with 1024 work-items in both cases.

36

CHAPTER 3. RESULTS

3.1. WORK-SIZE INFLUENCE

A look at figure 3.9 reveals that using the full number of 1024 work-items results in the
best performance for both problem sizes. Again for both sizes, using only 512 work-items still
leads to about 99% of the maximum performance. A more noticeable performance drop to
95% only occurs at 256 work-items for problem size 2563 and at 128 work-items for problem
size 5123 .
In both charts, the vertical scattering pattern is about the same. Although it was already
observed that the actual work-size configuration does not matter to much in absolute performance numbers on this device, using 512 work-items in a good configuration instead of
using 1024 work-items in a bad configuration can lead to a close to maximum speed-up.

Finally, the CPU device as provided by the Intel OpenCL SDK 1.1.0.89836 was benchmarked. Like for the previous CPU device, the work-size limit is 1024 for all dimensions as
well as for the maximum number of work-items per work-group.
Figure 3.10 shows the familiar charts for the performance of different work-size configurations for problem size 2563 . The best configuration of (32,4,8) is about 1.05 times faster
than the worst configuration of (4,8,4), so again all measurements lie very close to each
other.
The trend lines show the usual tendency that performance benefits from increasing the
X work-size but is brought back by increasing the Y or Y work-size. In the X dimension,
the trend is broken by an outlier at a work-size of 64 which is slower than using 32 or 16
work-items by a mere 2 ms per backprojection. The chart for the Y dimension shows a
close to perfect linear slowdown in backprojection performance if raising the work-size from
4 work-items up to the maximum of 64. Here, the difference between the best timings for 4
and 64 work-items is 10.5 ms per backprojection. In comparison, the performance decrease
in the Z dimension for increasing work-sizes is stronger than linear with a slope greater than
1. Other than the work-size of 64 in X dimension, none of charts show any remarkable
outliers.
When examining the scattering for problem size 2563 we see about the same amount of
6

http://software.intel.com/en-us/articles/download-intel-opencl-sdk/

37

3.1. WORK-SIZE INFLUENCE

CHAPTER 3. RESULTS

344
ms per backprojection

342
340
338
336
334
332
330
328
326
0

8 12 16 20 24 28 32 36 40 44 48 52 56 60 64 68 72
X work-size

344
ms per backprojection

342
340
338
336
334
332
330
328
326
0

8 12 16 20 24 28 32 36 40 44 48 52 56 60 64 68 72
Y work-size

344
ms per backprojection

342
340
338
336
334
332
330
328
326
0

8 12 16 20 24 28 32 36 40 44 48 52 56 60 64 68 72
Z work-size

Figure 3.10: Influence of the work-sizes on an Intel Core i7 CPU (Intel OpenCL) for problem size
2563 . The best configuration is (32,4,8).

38

CHAPTER 3. RESULTS

3.1. WORK-SIZE INFLUENCE

2650
ms per backprojection

2640
2630
2620
2610
2600
2590
2580
2570
0

8 12 16 20 24 28 32 36 40 44 48 52 56 60 64 68 72
X work-size

2650
ms per backprojection

2640
2630
2620
2610
2600
2590
2580
2570
0

8 12 16 20 24 28 32 36 40 44 48 52 56 60 64 68 72
Y work-size

2650
ms per backprojection

2640
2630
2620
2610
2600
2590
2580
2570
0

8 12 16 20 24 28 32 36 40 44 48 52 56 60 64 68 72
Z work-size

Figure 3.11: Influence of the work-sizes on an Intel Core i7 CPU (Intel OpenCL) for problem size
5123 . The best configuration is (16,16,4).

39

3.1. WORK-SIZE INFLUENCE

CHAPTER 3. RESULTS

relative variation as for the ATI Stream CPU device when excluding outliers in the latter.
The charts for the Y and Z dimensions show the most variations, meaning those are the
most sensitive to changes in the other dimensions, either to the good or bad regarding the
performance. In particular, a work-size of 8 in Y dimension is very susceptible to changes
in the X dimension: Even if choosing the best (because smallest) work-size of 4 in the Z
dimension, choosing 4 in the X dimension results in the worst performance, but choosing
16 for the X work-size leads to the second best performance which is only 0.8 ms per
backprojection slower than the best result.
Turning to problem size 5123 in figure 3.11, we can verify by reference to the X work-size
chart that the maximum of 64 work-items starts to decrease performance again, contradicting
the trend that an increased X work-size also increases performance. Similar to the results
for the ATI Stream CPU device and a problem size of 2563 , using 16 work-items in the X
and Y dimension clearly is the best choice, even better than larger numbers of work-items.
For the Z dimension, once more using as few as possible work-items obviously is the best
choice, with the trend line running along a logarithmic-like curve. In short, (16,16,4) is the
best configuration for this problem size, (4,16,8) is the worst, but they only differ in roughly
65 ms per backprojection in performance.
Also the general shapes of the trend lines resemble the ones from the ATI Stream CPU
device and a problem size of 2563 quite closely: The curve for the X dimension slowly flats
with increasing work-size, only antagonized by two measurements for a work-size of 16. In
the chart for the Y dimension, the Y curve does not start as linear, but again a work-size of
16 lies out of the trend in the positive sense. The Z work-size chart corresponds to the one
from the ATI Stream CPU device taken to the extreme: It starts out with a greater slope
and flats out more quickly.
Scattering of the measurements is most observable in the Y dimension chart, closely followed by the Z dimension, meaning those dimensions are the most responsive to configuration
changes in the other dimensions. Especially keeping the Y work-size at 16 and varying the
X or Z work-sizes leads to the best or worst total performance.
Lastly, figure 3.12 shows that using the maximum of 1024 work-items leads to the best

40

CHAPTER 3. RESULTS

3.1. WORK-SIZE INFLUENCE

performance for both problem sizes on this device. In case of problem size 2563 , the second
best work-group size of 512 is at least 1.9 ms per backprojection slower, for problem size
5123 it is 6.7 ms, the second best result. Once more theses charts nicely illustrate that not
only the number of work-items matters, but how important a good work-size configuration
is. For example, depending on the configuration 512 work-items lead to both the worst and
second best result for problem size 5123 .

To sum up the results of this section, tables 3.1 and 3.2 give an overview of the work-size
influences on the benchmarked devices for problem sizes 2563 and 5123 , respectively. The
data once again shows that the work-size configuration is an important setting in particular
for the GPUs, but not so much for the CPUs. Moreover, while the factor by which the
performance can be increased grows with the problem size for the GPUs, it is independent
of the problem size for the CPUs. Among all devices, the HD 5870 is the most sensible
to work-size configuration optimizations. Finally, all devices but the GTX 260 achieve their
best performance with a work-size configuration that compromises all available work-items.

41

3.1. WORK-SIZE INFLUENCE

CHAPTER 3. RESULTS

344
ms per backprojection

342
340
338
336
334
332
330
328
326
0

128

256

384

512

640

768

896

1024

1152

896

1024

1152

total work-items (problem size 256)

2650
ms per backprojection

2640
2630
2620
2610
2600
2590
2580
2570
0

128

256

384

512

640

768

total work-items (problem size 512)

Figure 3.12: Performance for numbers of work-items on an Intel Core i7 CPU (Intel OpenCL) for
problem sizes 2563 and 5123 . Best performance is achieved with 1024 work-items in
both cases.

42

Device
type
GPU
GPU
CPU
CPU

Best
configuration
(16,4,4)
(8,4,8)
(16,16,4)
(32,4,8)

Worst
configuration
(4,4,16)
(4,4,32)
(4,4,4)
(4,8,4)

Best vs.
worst ratio
2.34
1.34
1.10
1.05

Best
work-group size
256 (of 256)
256 (of 512)
1024 (of 1024)
1024 (of 1024)

Device
type
GPU
GPU
CPU
CPU

Best
configuration
(16,4,4)
(8,4,4)
(16,8,8)
(16,16,4)

Worst
configuration
(4,4,16)
(4,4,32)
(4,4,4)
(4,16,8)

Best vs.
worst ratio
4.40
2.64
1.10
1.03

Best
work-group size
256 (of 256)
128 (of 512)
1024 (of 1024)
1024 (of 1024)

Table 3.2: Influence of the work-sizes on GPU and CPU devices for problem size 5123 .

Device
name
HD 5870
GTX 260
Core i7 (ATI Stream)
Core i7 (Intel OpenCL)

Table 3.1: Influence of the work-sizes on GPU and CPU devices for problem size 2563 .

Device
name
HD 5870
GTX 260
Core i7 (ATI Stream)
Core i7 (Intel OpenCL)

CHAPTER 3. RESULTS
3.1. WORK-SIZE INFLUENCE

43

3.2. BENCHMARKS

CHAPTER 3. RESULTS

3.2 Benchmarks
Using the determined best work-size configuration per device and problem size, the backprojection performances of all benchmarked devices are compared to each other in this section.
As the GPU devices perform an order of magnitude better than the CPU devices for this
particular problem, charts are grouped by device type and not by problem size to avoid axis
scaling issues and make them easier to read.

3.2.1 Standard Problem Sizes


35

31,1

ms per backprojection

30
25

21,4

20
15
10

HD 5870
GTX 260

6,7

5,5

5
0
256

512
problem size

Figure 3.13: Best performances of the GPU devices per problem size (lower is better).

The GPU results in figure 3.13 show that the GTX 260 performs faster than the HD 5870
for both problem sizes. In case of problem size 2563 , the speed-up is about a factor of 1.2.
For a complete reconstruction consisting of 496 backprojections, this equals a gain of 0.6 s
in total reconstruction time. For problem size 5123 , the GTX 260 is 1.5 times faster than
the HD 5870, totaling in 4.8 s for a complete reconstruction. Note that total reconstruction
times here do not include the time required for input / output operations.
When comparing the timings for problem sizes 2563 and 5123 per device, the HD 5870
requires 4.6 times the time for the larger problem size, whereas the GTX 260 requires 3.9
times the time. So for a large problem which is 8 times the size of the small problem, both

44

CHAPTER 3. RESULTS

3.2. BENCHMARKS

GPUs roughly take 4 times the time.

3500

3097

ms per backprojection

3000

2574

2500
2000
1500

Core i7 (ATI Stream)


Core i7 (Intel OpenCL)

1000
500

393

328

0
256

512
problem size

Figure 3.14: Best performances of the CPU devices per problem size (lower is better).

Figure 3.14 shows the chart for the CPU timings. Here, the Intel OpenCL device outperforms the device provided by ATI Stream: For both problem sizes the first is roughly a factor
of 1.2 faster. But compared to the best GPU device, the best CPU device is about 60 times
slower for problem size 2563 , and 120 times slower for problem size 5123 (so for 8 times the
problem size the performance ratio doubles). This also means that the total reconstruction
times are much longer on CPU devices. For the 496 projections, the Intel OpenCL device
requires 163 s for reconstructing problem size 2563 and 1277 s for problem size 5123 , again
not including input / output. The ATI Stream device requires 32 s and 259 s more time for
the respective problem sizes.
Going from the smaller to the larger problem size on either the ATI Stream or Intel OpenCL
CPU device increases the reconstruction time by a factor of 7.9, so the time requirements
grow by the same amount as the problem size increases.

45

3.2. BENCHMARKS

CHAPTER 3. RESULTS

3.2.2 Large Problem Size


As none of the introduced CPU platforms was able to complete the largest problem size of
10243 , which the RabbitCT team winkingly refers to as being only for real rabbits, within
an practice-oriented time limit of one hour, another test system was set up by installing the
ATI Stream SDK 2.3 on a Linux server running Ubuntu 10.04.1 LTS7 . The system is made
up of four physical CPU sockets, each of which is hosting an AMD Opteron 6174 processor
with 12 cores running at 2.2 GHz, and 256 GiB of RAM. As the timing compared to the
GPUs still differ greatly, instead of a figure with charts table 3.3 is given to present the
results.

Device
name
HD 5870
GTX 260
Opteron 6174

Device
type
GPU
GPU
CPU

Work-Size
configuration
(16,4,4)
(8,4,4)
(64,4,4)

Mean
time [ms]
200.91
172.58
5098.02

Total
time [s]
99.65
85.60
2528.62

Table 3.3: Performance of devices reconstructing problem size 10243 .

Again, the CPU implementation clearly loses out compared to the GPU implementations.
This time, the CPU is only a factor of 25 - 30 slower than the GPUs, but to anticipate
some results from section 3.3 about Multiple Devices, this is due to the Opteron being faster
about a factor of 4.35 compared to the Core i7. Extrapolating the data, it is reasonable to
assume that the Core i7 would be 110 - 130 times slower than the GPUs, which pretty much
matches the CPU versus GPU observations for problem size 5123 .
The HD 5870 and GTX 260 GPUs are almost on par performance-wise, the latter only
is less than a factor of 1.2 faster, which is the same ratio as for problem size 2563 , and in
the same ballpark as the factor 1.5 for problem size 5123 . Going from problem size 5123 to
10243 again is an eightfold increase in data to reconstruct, and the HD 5870 requires 6.5
times the time to finish, while the GTX 260 required 8.0 times the time.
7

http://www.ubuntu.com/

46

CHAPTER 3. RESULTS

3.2. BENCHMARKS

3.2.3 RabbitCT Ranking


The best result for each problem size was submitted to the RabbitCT project via the algorithm
user area to make them show up on the public ranking web page8 after approval. Figure 3.15
shows screen shots from the web page after the submissions had been approved. CLeopatra,
the implementation which was developed as part of this thesis, came first for all three problem
sizes when run on the GTX 260 GPU. In particular, it is ranked before the only other GPU
implementation named SpeedyGonzales (which is written using NVIDIA CUDA C) while
maintaining a very similar mean square error and peak signal to noise ratio. However, as
the later discussion will elaborate, the ranking is to be taken with care as it only reflects
absolute performance, not relative performance with respect to the hardware an algorithm
is running on. The given number of Giga Updates Per Second (GUPS) as introduced by
[GBB+ 07] is only suitable for comparing performance within one implementation, but not
across implementations. That said, it obviously would have been best to run SpeedyGonzales
on the same hardware as CLeopatra, but unfortunately the binary does not execute on the
system used for benchmarks in this thesis for unknown reasons, and any debugging was
impeded due to SpeedyGonzales being closed-source.
Note that SpeedyGonzales does not appear at all in the ranking for problem size 10243
as it cannot handle problem sizes that do not fully fit into GPU memory, in this particular
case into the 4 GiB of memory of the Quadro FX 5800 GPU it was run on. That amount of
memory would already be fully exhausted for the output volume of problem size 10243 and
floating point data.

The CPU implementations continue to show a lower performance for this particular problem. Depending on the problem size, the fastest CPU implementation called TomatoSalad is
roughly 50 - 10 times slower than CLeopatra. For problem sizes 2563 and 5123 , TomatoSalad
was run on an Intel Core2 Duo T9300 (two cores) at 2.50 GHz, for problem size 10243 , two
Intel Xeon E5410 (with four cores each) at 2.33 GHz were used. However, an advantage of
CPU-based algorithms is their typically slightly better image quality.
8

http://www5.informatik.uni-erlangen.de/research/projects/rabbitct/ranking/

47

3.2. BENCHMARKS

CHAPTER 3. RESULTS

(a) Problem size 2563

(b) Problem size 5123

(c) Problem size 10243

Figure 3.15: RabbitCT project public ranking web page screen shots.

48

CHAPTER 3. RESULTS

3.3. MULTIPLE DEVICES

3.3 Multiple Devices


In order to test running on a multi-CPU system and the scalability of OpenCL and the
reconstruction implementation, again the Opteron server from benchmarking the large problem size reconstruction was used. The systems four physical CPU sockets compromise a
Non-Uniform Memory Access (NUMA) architecture. The memory subsystem of NUMA architectures can usually have Node Interleaving disabled or enabled. If disabled, each CPU
can use its own memory controller and does not have to compete for access to the shared
interconnect. Only if local memory is exhausted, remote memory will be used. To do so
efficiently, however, applications need to be aware of running on a NUMA architecture and
support it. Enabling Node Interleaving hides the details of the underlying physical architecture from the applications and interleaves memory allocations across all memory banks
at the price of higher latency, effectively turning it into a Uniform Memory Access (UMA)
system. See figure 3.16 for a nicely illustrated example of a NUMA system with two physical
CPU sockets and four cores each which runs two Virtual Machine processes.

(a) Disabled Node Interleaving

(b) Enabled Node Interleaving

Figure 3.16: Node interleaving on NUMA architectures (image courtesy F. Denneman).

To optimize memory access for highly multi-threaded applications whose threads all work
on the same large amount of data, Node Interleaving was enabled for this benchmark.
The ATI Stream implementation exposes the systems complete computing power as a
single OpenCL device compromising 48 compute units. As such, it does not take any
source code modifications to make the reconstruction algorithm utilize all hardware threads.

49

3.3. MULTIPLE DEVICES

CHAPTER 3. RESULTS

While this is very convenient on the one hand, this means distributing work across multiple
OpenCL devices cannot be tested with this platform. The chart in figure 3.17 contrasts
the timings against the known ones from ATI Stream running on the Core i7. Again, the
CPU_IMAGE_SUPPORT environment variable was set.

ms per backprojection

3500

3097

3000
2500
2000
Core i7
(8 Compute Units)

1500
1000
500

712
393

Opteron 6174
(48 Compute Units)

105

0
256

512
problem size

Figure 3.17: Times required to compile the backprojection kernels for different devices.

For problem size 2563 the Opteron is about 3.73 times faster than the Core i7, in case of
problem size 5123 with 4.35 the ratio is slightly higher.

Distributing work across multiple GPU devices is a more complex topic. To make a long
story short, even if sticking to identical GPUs from the same vendor, none of the current
OpenCL implementations for GPUs is able to make it appear as a single device with the
combined number of compute units, like ATI Stream is able to do for the CPU NUMA
architecture. Moreover, any GPU interconnection (ATI Crossfire, NVIDIA SLI) has to be
disabled for GPU usage. For GPUs that have multiple chips on a single board (ATI Radeon
HD 5970, NVIDIA GTX 295), interconnection usually is hardwired and thus only the first
device can be used by GPGPU applications.
As a result, the easiest and quite effective way to scale an OpenCL application across
multiple GPUs is to simply launch the application once for each GPU in the system with

50

CHAPTER 3. RESULTS

3.4. KERNEL COMPILATION

its affinity set to that particular GPU. Obviously, this only is applicable if there are at least
as many problems to solve as there are GPUs in the system. Otherwise, the first choice
would be to modify the application to create an OpenCL context for all GPU devices of the
platform and one command queue for each device. If that fails to scale well, the second
more complex choice would be to create separate host threads which create one context
and command queue per device each. The latter solution, however, requires OpenCL 1.1 as
previous versions of the API are not guaranteed to be thread-safe. Anyway, an advantage of
creating one context per device is that also multiple platforms, for example those exposing
CPU devices, could be used. Creating one context per device from different host threads is
the conservative fallback that is reported to work best across all vendors.

3.4 Kernel Compilation


It turns out that on some OpenCL platforms, compiling the kernel code for specific devices
takes a notable amount of time. For completeness, figure 3.18 shows the compile times of
the backprojection kernels for the benchmarked OpenCL implementations.

kernel compile time in s

3,5

3,2

3,0
2,5
2,0

1,9
1,5

1,5
1,0

Cold

0,8

1,0

0,7
0,4

0,5

0,5

Hot

0,0
ATI Stream
(GPU)

ATI Stream
(CPU)

NVIDIA OpenCL
(GPU)

Intel OpenCL
(CPU)

compiler implementation

Figure 3.18: Times required to compile the backprojection kernels for different devices.

The kernel source code file currently consists of six kernel functions (of which only two
are used per reconstruction, depending on the devices memory constraints) compromising

51

3.5. IMAGE INTERPOLATION

CHAPTER 3. RESULTS

about 200 lines of code, not counting comments or empty lines. The Cold series represents
the kernel compile times as measured directly after application startup and creation of an
OpenCL context, that is with cold caches. Consequently, the Hot series was obtained by
taking the minimum timing measured by quickly compiling the same kernel ten times in a
row, to take advantage of any caching effects.
The NVIDIA OpenCL and Intel OpenCL implementations reach very similar speeds and
are the fastest, while the ATI Stream implementation if targeting the CPU is slowest, in
average taking 4.5 times as long. Interestingly, all implementations are able to cut down the
required compile time almost in half with hot caches.

3.5 Image Interpolation


In OpenCL, support for image objects is optional and can be checked for by a call to
clGetDeviceInfo() with CL_DEVICE_IMAGE_SUPPORT as an argument. With the release
of version 2.3, the last OpenCL implementation lacking image support in the form of the
ATI Stream SDK added experimental image support that can be turned on by setting the
CPU_IMAGE_SUPPORT environment variable. In all past benchmarks, this variable was set in
order to have the same prerequisites and a fair comparison to the other OpenCL implementations.
Although image support and image interpolation seems to be a natural thing especially
for GPU devices with their hardware texturing units and something straight-forward to add
for CPU devices, in particular the OpenCL implementations on Mac OS X (both GPU and
CPU) are often still missing image support. In order to generally support such devices,
some custom code which is doing the interpolation manually was added to the reconstructor
implementation featuring both nearest-neighbor and bilinear interpolation. And while at it,
the reconstructor implementation was modified to be able to fake non-image-support even
if the OpenCL implementation supports images in order to do some more benchmarks, with
some interesting results as table 3.4 for GPU devices and table 3.5 for CPU devices show.
When reading the first table from left to right and top to bottom, there is no real gain

52

CHAPTER 3. RESULTS

ms per backprojection
With image support
(built-in interpolation)
Without image support
(custom interpolation)

3.5. IMAGE INTERPOLATION


HD 5870 (ATI Stream)
Bilinear Nearest-neighbor

GTX 260 (NVIDIA OpenCL)


Bilinear Nearest-neighbor

7.35

7.32

5.70

5.69

9.22

5.30

12.57

7.01

Table 3.4: Interpolation performance of GPU devices for problem size 2563 .

ms per backprojection
With image support
(built-in interpolation)
Without image support
(custom interpolation)

Core i7 (ATI Stream)


Bilinear Nearest-neighbor

Core i7 (Intel OpenCL)


Bilinear Nearest-neighbor

392.67

168.80

327.73

190.56

134.61

79.47

144.73

96.38

Table 3.5: Interpolation performance of CPU devices for problem size 2563 .

apparent from using nearest-neighbor interpolation on GPUs with native image support. The
tables second row shows that custom bilinear interpolation in software as part of the OpenCL
kernel is slower than built-in interpolation, but on the HD 5870 with a factor of 1.3 not so
much as on the GTX 260 with a factor of 2.2. The first surprise manifests when looking
at the numbers for custom nearest-neighbor interpolation: On the HD 5870 this is actually
faster by a factor of 1.4 than built-in nearest-neighbor interpolation. There is no surprise
when looking at the GTX 260 again, though, here custom nearest-neighbor interpolation is
1.2 times slower. When sticking to custom interpolation, doing nearest-neighbor is 1.7 times
faster than doing bilinear on the HD 5870, and 1.8 times faster on the GTX 260.
The table for the CPU numbers features even more interesting numbers. When switching
from built-in bilinear to built-in nearest-neighbor interpolation, there is a notable performance
increase for both the ATI Stream and Intel OpenCL implementation of factors 2.3 and 1.7,
respectively. The second big surprise is the results for custom interpolation: For ATI Stream,
bilinear interpolation becomes 2.9 times faster and nearest-neighbor interpolation 2.1 times
faster, for Intel OpenCL the numbers are 2.3 and 2.0. This also means when doing custom
bilinear interpolation ATI Stream is faster than Intel OpenCL, and the CPU is no longer
60 times slower than the GPUs, but only 25 times slower. For problem size 5123 separate

53

3.5. IMAGE INTERPOLATION

CHAPTER 3. RESULTS

measurements have shown that the performance difference is reduced from factor 120 to
factor 50. The factors by which custom nearest-neighbor interpolation is faster than bilinear
interpolation on the CPU are very similar to those for the GPUs: They are 1.7 for ATI Stream
and 1.5 for Intel OpenCL.
As another positive side effect image quality improves noticeably with custom bilinear
interpolation, see table 3.6. For all implementations except Intel OpenCL the error can be
reduced to the level of RabbitCT participants that run on the CPU using SSE code.

Mean square
error
Built-in bilinear
interpolation
Custom bilinear
interpolation

HD 5870
(ATI Stream)

GTX 260
(NVIDIA OpenCL)

Core i7
(ATI Stream)

Core i7
(Intel OpenCL)

8.07115 HU2

8.071 HU2

8.05676 HU2

10.9059 HU2

0.001 HU2

0.00088 HU2

0.0093 HU2

2.90477 HU2

Table 3.6: Image quality for built-in versus custom bilinear interpolation for problem size 2563 .

54

4 Discussion
To interpret the presented results, this chapter discusses the measurements with respect to
several architectural properties of the underlying hardware.

4.1 Work-Size Determination


As observed in the previous chapter, choosing a good work-size configuration is important
for high-performance reconstruction. While this is true in particular for the GPU devices, the
CPU devices remain less sensitive to changes regarding the configuration, which is discussed
in the following.

4.1.1 CPU Device Specifics


AMDs OpenCL Zone provides a direct link to the OpenCL Programming Guide accompanying the ATI Stream SDK [Adv10]. It contains a guideline on how to partition the work
for best performance, which recommends to use at least as many work-groups as there are
computing units. For the benchmarked CPU device with eight computing units (four cores
with HTT each) this is always the case, as even for the largest number of 1024 work-items
per group (and thus the smallest number of work-groups) the smallest problem size 2563
requires a total of
2563
= 16384 work groups
1024 work items
to calculate the output, which always allows eight work-groups to be processed in parallel
regardless of the work-size configuration. Other than that, the guide does not expose any
CPU-specific constraints or hints for optimizations regarding the work-size configuration, so

55

4.1. WORK-SIZE DETERMINATION

CHAPTER 4. DISCUSSION

it does not provide any insights as to why changing the configuration does not have much
of an effect. On the other hand, the guide contains a lot more information when it comes
to GPU optimizations, probably suggesting that the work-size configuration simply is not
something to care about too much on CPUs.

The Intel OpenCL SDK comes with a performance guide [Int10] that features a section
about work-group size considerations. For all configurations that were benchmarked as part
of this thesis, the work-size in each dimension is a multiple of 4. Thus, also the total
work-group size for all configurations is a multiple of 4, enabling implicit vectorization by
the OpenCL compiler according to the guide. In this respect no configuration is better or
worse than the other, which contributes to explaining why they do not differ too much in
performance. Moreover, the current kernel implementation does not require any barriers for
synchronization, ruling out another possible source of varying performance with changing
configurations, as barriers issue copies for the total amount of private and local memory
used by all work-items in the work-group.
In general, Intel advises to experiment with the work-group size (that is the product of the
work-sizes for all dimensions) and not with the actual configuration for a specific number
of work-items per work-group. This is another indication that rather the total number of
work-items than their configuration has an impact on performance. Intel recommends using
work-groups of size 64 - 128 for kernels without barrier instructions. However, best results
in the benchmarks were achieved with a work-group size of the full 1024 work-items. This
discrepancy is probably due to the simplicity of the backprojection kernel in terms of generated instructions. To get the largest performance gain from parallelism between work-groups,
Intels guide urges to ensure that the execution of a work-group takes around 10000 - 100000
instructions, as a smaller value increases the proportion of switching overhead compared to
actual work. Using the Intel OpenCL Offline Compiler, the x86 Assembler code for the kernel
was manually generated. The generated assembly contains about 90 instructions. Making
use of the maximum work-group size of 1024 work-items yields 1024 90 = 92160 instructions, perfectly matching the recommended maximum of 100000 instructions.

56

CHAPTER 4. DISCUSSION

4.1. WORK-SIZE DETERMINATION

The optimization guide provided by Intel offers more details when it comes to work-size
configuration optimizations on the CPU than AMDs guide does, but as both run on the same
hardware, the hardware-specific considerations from the Intel guide probably apply to both
OpenCL implementations. Generally speaking, using the maximum number of 1024 workitems per work-group minimizes the number of work-groups and thus reduces the likelihood
of context switches, which are quite heavyweight entities on CPUs. More specifically, from a
cache-usage perspective, using at least 64 bytes of data (which equals 16 work-item for float
data) in the fastest-changing X dimension makes sense because this matches the cache-line
size on the Core i7 CPU.
Due to the low impact of the work-size configuration on the performance for CPU devices,
it is reasonable to choose a fixed configuration for all problem sizes. For the measured results,
a configuration of (64,4,4) would be a good choice. This configuration came second / third
on the ATI Stream device and fifths / sixths on the Intel OpenCL device, which makes it the
single best configuration across all problem sizes for the benchmarked CPU devices.

4.1.2 GPU Device Specifics


When looking at the GPU devices to find out why these are much more sensitive to work-size
configuration changes, their radically different hardware architecture compared to CPU devices needs to be taken into account. As an example, GPUs typically feature high-bandwidth
texture memory whose (small) cache is optimized for 2D spatial locality. This means it
can provide a significant performance benefit to have all work-items of a work-group access
nearby locations in the texture as demonstrated in [GM07].

The section about work-group optimizations in the ATI Stream SDKs Programming Guide
opens with the statement that the most effective way to exploit the potential performance
on a GPU is to provide enough work-items per work-group to keep the device completely
busy, which is reflected in the benchmarks because best performance is achieved by using
the maximum work-group size. On the HD 5870, work-items inside a work-group are further

57

4.1. WORK-SIZE DETERMINATION

CHAPTER 4. DISCUSSION

organized into so-called wavefronts consisting of 64 work-items. Each compute unit executes
a quarter-wavefront on each cycle, and the entire wavefront is executed in four consecutive
cycles. Thus, to hide eight cycles of latency, the program must schedule two wavefronts,
that is 128 work-items. Lower work-group sizes are susceptible to exposing the ALU pipeline
latency which results in lower performance. This explains why the benchmarked work-group
sizes of 64 work-items lag behind in performance.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47

#! / u s r / b i n / en v p y t h o n
import pyopencl as c l
# Set the problem s i z e .
PROBLEM_SIZE=256
# C r e a t e an empty d i c t i o n a r y .
w o r k _ s i z e s ={}
p r i n t " L i s t i n g b e s t works i z e

c o n f i g u r a t i o n s f o r p r o b l e m s i z e " ,PROBLEM_SIZE , " p e r d e v i c e "

for platform in cl . get_platforms ( ) :


# Works i z e d e t e r m i n a t i o n i s v e n d o rs p e c i f i c .
i f not " Advanced M i c r o D e v i c e s " i n p l a t f o r m . v e n d o r :
continue
# Loop o v e r a l l GPU d e v i c e s .
f o r d e v i c e i n p l a t f o r m . g e t _ d e v i c e s ( c l . d e v i c e _ t y p e . GPU ) :
# The x s h o u l d be a t l e a s t 16 ( one f o u r t h o f a w a v e f r o n t ) .
f o r x i n range ( d e v i c e . max_work_item_sizes [ 0 ] , 1 5 , 1 ) :
# y must be l e s s t h a n x and w i t h i n l i m i t s .
f o r y i n r a n g e ( min ( x 1, d e v i c e . m a x _ w o r k _ i t e m _ s i z e s [ 1 ] ) , 0 , 1 ) :
# A l w a y s u s e t h e t h e maximum workg r o u p s i z e , s o d e r i v e z from i t .
z=d e v i c e . max_work_group_size / ( xy )
# z must be l e s s t h a n o r e q u a l t o y and w i t h i n l i m i t s .
i f z>y o r z==0 o r z>d e v i c e . m a x _ w o r k _ i t e m _ s i z e s [ 2 ] :
continue
# The works i z e must be a f a c t o r o f t h e p r o b l e m s i z e .
i f PROBLEM_SIZE%x o r PROBLEM_SIZE%y o r PROBLEM_SIZE%z :
continue
s i z e=xyz
# S k i p c o n f i g u r a t i o n s t h a t a r e no m u l t i p l e o f a w a v e f r o n t
# o r do n o t c o n t a i n a t l e a s t two w a v e f r o n t s .
i f s i z e %64 o r s i z e <264:
continue
# C r e a t e a m e t r i c t h a t f a v o r s " c u b i c " works i z e s .
m e t r i c=a b s ( xy)+ a b s ( yz )
work_sizes [ metric ]=[ x , y , z ]
b e s t _ m e t r i c=s o r t e d ( w o r k _ s i z e s ) [ 0 ]
p r i n t d e v i c e . name , " : " , w o r k _ s i z e s [ b e s t _ m e t r i c ]

Listing 4.1: Python code for finding the best work-size configuration on ATI GPUs.

According to the ATI Stream SDKs Programming Guide, the GPU hardware schedules
the kernels so that the X dimensions moves fastest as the work-items are packed into wavefronts. As a result, the coalescing of global memory and local memory bank conflicts can be

58

CHAPTER 4. DISCUSSION

4.1. WORK-SIZE DETERMINATION

impacted by the work-size dimension, in particular if the fast-moving X dimension is small,


which is why best results are achieved for X dimensions which are larger than the Y or Z
dimensions. Moreover, work-items in the same quarter-wavefront execute on the same cycle
in the processing engine. To make such work-items work on similar data so that they use
the same control-flow and to avoid idle work-items waiting for others on the same cycle to
finish, the X work-size should be at least 16.
See listing 4.1 for an algorithm written in the Python1 language which determines the best
work-size configuration on ATI GPUs. To run the script on Windows, install Python along
with the PyOpenCL and NumPy packages2 . It works by looping over all possible work-size
configurations, filtering out those which are not within the hardware limits, and further filtering out configurations which do not match the mentioned constraints. The metric introduced
in line 50 uses the differences between work-sizes in all dimensions to prefer compact configurations where differences are small. Such cube-like work-groups have high spatial locality
in the output volume and thus are likely to map to nearby pixels in the input texture for all
projection angles, making best use of the textures 2D cache. Running the Python script
for a HD 5870 returns (16,4,4) as the best work-size configuration which equals the best
configuration determined by benchmarking.

As can be learned from NVIDIAs OpenCL Best Practices Guide [NVI10b], the GTX 260
uses so-called warps compromising 32 work-items as its smallest executable unit of parallelism. See figure 4.1 to get an overview on how work from different warps is interleaved by
the instruction scheduler on GPUs of the GT200 generation. Moreover, NVIDIA categorizes
its GPU devices by Compute Capability composed of a major and minor revision number:
Devices with the same major revision number are of the same core architecture, whereas the
minor revision number corresponds to an incremental improvement to the core architecture,
possibly introducing new features. The GTX 260 is of Compute Capability 1.3, which means
it can coalesce memory accesses of active work-items within a half-warp to segments of 128
1
2

http://www.python.org/
http://www.lfd.uci.edu/~gohlke/pythonlibs/

59

4.1. WORK-SIZE DETERMINATION

CHAPTER 4. DISCUSSION

bytes in size (for float data). The transaction size may be reduced to 64 or 32 bytes if all other
active work-items accessing the same segment only use the segments lower or upper half or
quarter.
As on this device the X dimension also moves fastest when assigning work-items to warps, this
means the X work-size should be
8, 16 or 32 so that reading and
writing the output memory can
benefit from coalesced memory
access.

Note that in general,

the multi-dimensional aspect of a


work-group does not play a role in
performance but just allows easier mapping to multi-dimensional
problems.

However, whenever

mapping of a multi-dimensional
problem involves non-linear mem-

Figure 4.1: The instruction scheduler on GT200 GPUs (image courtesy AnandTech, Inc.).

ory access, the choice of dimension


becomes important.
Another prominent performance metric related to the work-group size is occupancy, that
is the ratio of the number of active warps per multi-processors to the maximum number of
possible active warps. According to NVIDIA, to completely hide latencies due to register
dependencies caused by an instruction that uses a result stored in a register written by a
preceding instruction, multi-processors should be running at least 192 work-items (six warps)
per work-groups, which equates to 18.75% occupancy. The top 20% benchmark results all
use either 128 or 256 work-items, and 192 happens to be just in between. As 128 is lower
than the suggested number of 192 work-items, there must be other means than increasing
the number of work-items to hide latency. This reveals an interesting characteristic of the

60

CHAPTER 4. DISCUSSION

4.1. WORK-SIZE DETERMINATION

GTX 260: Even though it supports larger work-group sizes than for example the HD 5870, it
reaches higher performance at lower occupancy, an effect that is described in detail in [Vol10].
Increasing the number of workitem per work-group only helps to
increase Thread-Level Parallelism
(TLP, figure 4.2), but there also is
Instruction-Level Parallelism (ILP,
Figure 4.2: Thread-Level Parallelism (TLP) to hide arithmetic latency (image courtesy V. Volkov).

figure 4.3). The key is to use ILP


in addition to TLP for best performance. Doing so theoretically al-

lows to use only as few as 64 work-items per work-group to completely hide arithmetic
latency on the GTX 260. Using fewer work-items also means more registers can be used per
work-item, and keeping values in registers in the only way to get peak performance as they
provide the fastest data access.
Working at lower occupancy also frees resources for concurrent kernels running on the same
device. Having that in mind, it is tempting to use only 128 instead of 256 work-items on the
GTX 260, which in the worst case still reaches 97% of the performance of 256 work-items.
The Python script in listing 4.2 implements the
gained insights about the GTX 260s hardwarespecifics to determine the best work-size configuration. This time, the Z work-size is not implicitly
derived from the maximum work-group size to allow
low-occupancy configurations compromised of less
work-items. However, at least 64 work-items are
Figure 4.3: Instruction-Level Parallelism
(ILP) to hide arithmetic latency (image courtesy V.
accessing 2D texture pixels also is a big advantage
Volkov).

required to hide latency. As spatial locality when

on the GTX 260, the metric from the script for ATI
GPUs which favors cube-like work-groups is used again. With these few simple rules, the
script returns (8,4,4) as the best work-size configuration, which was benchmarked as the

61

4.1. WORK-SIZE DETERMINATION

CHAPTER 4. DISCUSSION

best configuration for problem size 5123 and came third for problem size 2563 .

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45

#! / u s r / b i n / en v p y t h o n
import pyopencl as c l
# Set the problem s i z e .
PROBLEM_SIZE=256
# C r e a t e an empty d i c t i o n a r y .
w o r k _ s i z e s ={}
p r i n t " L i s t i n g b e s t works i z e

c o n f i g u r a t i o n s f o r p r o b l e m s i z e " ,PROBLEM_SIZE , " p e r d e v i c e "

for platform in cl . get_platforms ( ) :


# Works i z e d e t e r m i n a t i o n i s v e n d o rs p e c i f i c .
i f not " NVIDIA " i n p l a t f o r m . v e n d o r :
continue
# Loop o v e r a l l GPU d e v i c e s .
f o r d e v i c e i n p l a t f o r m . g e t _ d e v i c e s ( c l . d e v i c e _ t y p e . GPU ) :
# The x s h o u l d be a t l e a s t 8 f o r m i n i m a l c o a l e s c e d memory a c c e s s .
f o r x i n range ( d e v i c e . max_work_item_sizes [ 0 ] , 7 , 1 ) :
# y must be l e s s t h a n x and w i t h i n l i m i t s .
f o r y i n r a n g e ( min ( x 1, d e v i c e . m a x _ w o r k _ i t e m _ s i z e s [ 1 ] ) , 0 , 1 ) :
# z must be l e s s t h a n o r e q u a l t o y and w i t h i n l i m i t s .
f o r z i n r a n g e ( min ( y , d e v i c e . m a x _ w o r k _ i t e m _ s i z e s [ 2 ] ) , 0 , 1 ) :
# The works i z e must be a f a c t o r o f t h e p r o b l e m s i z e .
i f PROBLEM_SIZE%x o r PROBLEM_SIZE%y o r PROBLEM_SIZE%z :
continue
s i z e=xyz
# The works i z e must be w i t h i n l i m i t s .
i f s i z e >d e v i c e . max_work_group_size :
continue
# S k i p c o n f i g u r a t i o n s t h a t do n o t c o n s i s t o f a t l e a s t 2 w a r p s .
i f s i z e <232:
continue
# C r e a t e a m e t r i c t h a t f a v o r s " c u b i c " works i z e s .
m e t r i c=a b s ( xy)+ a b s ( yz )
work_sizes [ metric ]=[ x , y , z ]
b e s t _ m e t r i c=s o r t e d ( w o r k _ s i z e s ) [ 0 ]
p r i n t d e v i c e . name , " : " , w o r k _ s i z e s [ b e s t _ m e t r i c ]

Listing 4.2: Python code for finding the best work-size configuration on NVIDIA GPUs.

To summarize, the highly specialized GPU architecture is more sensitive to work-size


configuration changes than the more generic CPU architecture. Also, simply using the
maximum number of work-items by itself is no guarantee for best performance on all devices,
in particular not on the GTX 260. For all devices, it is possible to determine a suitable worksize configuration for a given problem size using the presented algorithms.

62

CHAPTER 4. DISCUSSION

4.2. PERFORMANCE CONSIDERATIONS

4.2 Performance Considerations


For all problem sizes, the GPU results have shown that the GTX 260 is ahead of the HD
5870 by something between a factor of 1.2 and 1.5. This is in so far interesting as the best
work-size configuration on the GTX 260 is with 128 work-items also notably smaller than the
HD 5870s with 256 work-items. This could either mean that a compute unit on the GTX
260 has more computing power (for example due to more resources, like hardware registers,
or due to running at a higher clock rate), or that the code does not map equally well to
the ATI hardware, probably leaving some compute units underutilized. To rule out the first,
comparing both GPUs on GPUReview3 reveals that the HD 5870 has in fact a higher core
and memory clock, and a much higher theoretical peak performance of 2720 GFLOPS versus
875 GFLOPS for the GTX 260. So it is time to investigate the latter assumption.
Using the ATI Stream Profiler4 , the
OpenCL implementation of the reconstruction was analyzed with an eye on the
ALUBusy and ALUPacking metrics.

As

explained in [PRH10], while the first indicates the rate of instruction processed by
the SIMD units, the latter equals the utilization of the HD 5870s VLIW-5-architecture
(Very Long Instruction Word), that is how Figure 4.4: The
well ALU instructions can be assigned to the

Cypress GPU generations


Streaming Processor (SP) architectur
(image courtesy ATI, Inc.).

five available ALU units. Figure 4.4 depicts


a single Streaming Processor (SP) to illustrate this design: x through w are full-blown ALU
units capable of performing fused multiply-additions (FMA) on vectors, whereas ALU unit t is

only able to perform transcendental calculations like 1/x, x, ex etc. and simple arithmetic
instructions. A block of 16 such SPs team up for an OpenCL compute unit, of which the
HD 5870 has 20. For kernels that mostly work on 4-component vector data, it can be quite
3
4

http://www.gpureview.com/show_cards.php?card1=613&card2=604
http://developer.amd.com/gpu/StreamProfiler/Pages/default.aspx

63

4.2. PERFORMANCE CONSIDERATIONS

CHAPTER 4. DISCUSSION

hard to make use of the transcendental unit in order to keep the SP fully busy. That said,
figure 4.5 shows the ALUBusy and ALUPacking percentages over the number of projection
images.
With a median values of about 46% each, the values are suspiciously low. Multiplying
the ALUBusy and ALUPacking values yields the percentage of the SIMD utilization with
respect to the theoretical peak performance. Doing so, we get a median SIMD utilization of

70

54

60

52

50

50
percentage

percentage

only about 21%.

40
30

48
46

20

44

10

42

40
1

46

91

136

181

226

271

316

361

406

451

496

46

projection number

(a) ALUBusy

91

136

181

226

271

316

361

406

451

496

projection number

(b) ALUPacking

Figure 4.5: Metrics over the number of projection images as reported by the ATI Stream Profiler.

Regarding the low ALUBusy percentage, this indicates that there is either not enough work
scheduled, or ALU units are stalled due to data latency. Some source code modifications are
helpful to determine which assumption applies. When replacing the data fetch
1

f l o a t 4 v a l u e=r e a d _ i m a g e f ( i n p u t , s a m p l e r , ( f l o a t 2 ) ( un +0.5 f , vn +0.5 f ) ) ;

with a an arithmetic instruction like


1

f l o a t 4 v a l u e =( f l o a t 4 ) ( un +0.5 f , vn +0.5 f , un +0.5 f , vn +0.5 f ) ;

and looking at the performance, a slightly increased reconstruction time can be observed.
This suggests that ALU units are not stalled due to data latency. In conjunction with the
low ALUPacking percentage, a reasonable conclusion is that not enough ALU instructions
that can be calculated independently of each other are available. Indeed, when looking at
the source code again, it is apparent that a straight-forward backprojection implementation

64

CHAPTER 4. DISCUSSION

4.2. PERFORMANCE CONSIDERATIONS

naturally introduces long dependency computation chains, because a work-items ID evaluates to both the coordinates for the input data and the output position to accumulate. A
possible way to reduce this effect would be to introduce higher Instruction-Level Parallelism
(IPL) as explained when discussing the occupancy on the GTX 260 in the previous section.
This can be done by merging the work of two or more work-items into one work-item in
order to interleave independent instructions from different work-items.

For completeness, the GTX 260 NVIDIAs Visual Compute Profiler5 was used to determine the same codes occupancy for that GPU architecture. As all GPUs from the GT200
generation, the GTX 260 has a fundamentally different hardware architecture as the HD
5870. Instead of vector-based ALUs it is based on scalar processors of which eight each form
a Streaming Multiprocessor (SM) along with two Special Function Units (SFU), which can
be compared to the HD 5870s transcendental ALU unit. In total, there are 27 SMs (or
compute units in OpenCL-speak) available on the GTX 260.

Figure 4.6: NVIDIAs Compute Visual Profiler showing GPU time in a width plot.

Figure 4.6 shows a width plot of the GPU time for a complete reconstruction where idle
times would be visible as white gaps. As both no white gaps are visible in the plot and the
Occupancy column in the profiler output table lists a value of 1 (which equals 100%) for all
backprojections, the GTX 260 seems to be much better working to capacity than the HD
5870.
Another likely reason for the OpenCL code to better utilize the GTX 260 than the HD
5870 is differences in compiler efficiency. Since the announcement of the CUDA in Novem5

http://developer.nvidia.com/object/visual-profiler.html

65

4.2. PERFORMANCE CONSIDERATIONS

CHAPTER 4. DISCUSSION

ber 2006, NVIDIA was able to gain more experience regarding high-level language compilers
than ATI, although they announced their Close To Metal (CTM) technology for GPGPU
computing about the same time, but this was a low-level approach only. Not until one year
later in December 2007, and with first laying the foundations in the form of their Compute
Abstraction Layer (CAL), ATI announced Brook+ as their first high-level GPGPU computing
language based on the Brook language developed at Stanford University [BFH+ 04]. Meanwhile, NVIDIA was able to optimize both their front-end compiler which translates CUDA C
to the Parallel Thread eXecution (PTX) intermediate language, and the back-end compiler
which translates PTX assembly to the native GPU binary code. NVIDIAs head start in
high-level language GPGPU compiler technology and the fact that it is harder for a compiler
to produce optimized code for the VLIW-5-architecture than for NVIDIAs scalar architecture
contribute to the GTX 260 taking the lead in the benchmarks.

The RabbitCT ranking provides a nice overview of how a given algorithm (filtered backprojection, in this case) performs on different hardware architectures. In the scope of this
thesis, it serves as a tool to compare the performance of the emerging OpenCL technology
against other well-established parallel computing tools and APIs.
The first interesting question is how the OpenCL-based implementation named CLeopatra, which was developed as part of this thesis, performs against its direct competitor from
the CUDA community named SpeedyGonzales. As already raised in the results section, the
RabbitCT framework does not provide any reference hardware to run the algorithm implementations on. This decision was made in order to encourage submissions that run on rather
exotic hardware (like the Cell Broadband Engine or FPGAs) for which no common reference
hardware could be provided by the University of Erlangen-Nuremberg which is hosting the
RabbitCT project. However, this has the disadvantage that two different implementations
from different submitters that are designed to run on the same hardware cannot be easily
compared performance-wise if the submitters do not share the same hardware and keep their
implementations binaries and / or source code private. As such, it is difficult to compare
the CLeopatra results from a GTX 260 to the SpeedyGonzales results from a Quadro FX

66

CHAPTER 4. DISCUSSION

4.2. PERFORMANCE CONSIDERATIONS

5800. But thanks to the overview of NVIDIA Quadro GPUs at Wikipedia6 one can learn
that this GPU also belongs to the GT200 generation of GPUs, and is in fact the professional
version of the GeForce GTX 285 with 240 shader processors (30 compute units), which in
turn is the larger sibling of the GeForce GTX 260 with 216 shader processors (27 compute
units). For reference, table 4.1 lists the Quadro and GeForce specifications side-by-side.

GPU
name
FX 5800
GTX 285
GTX 260

Core
clock [MHz]
650
648
576

Memory
clock [MHz]
1632
1242
999

Memory
size [MiB]
4096
1024
896

Memory
bandwidth [GiB/s]
102
159
112

Compute
units
240
240
216

Table 4.1: Hardware specifications for the GPUs used by SpeedyGonzales and CLeopatra.

The specifications of the FX 5800 surpass the ones for the GTX 260 in all areas by at
least 10% except memory bandwidth, where the GTX is about again 10% better. On the
other hand, the FX 5800s memory clock is about 60% faster. Given these numbers it is
reasonable to expect the FX 5800 to be faster or at least not slower than the GTX 260,
especially since the huge memory size allows for less swapping of device and host memory
and thus saves memory bandwidth, the FX 5800s weak point. However, to verify this, both
GPUs would need to be running the same code and not using different implementations, as
one implementation might simply be more efficient than the other, so the software instead
of the hardware would be responsible for the performance increase. However, a conservative
estimation can be made: Assuming the FX 5800 to not be faster but of equal speed as the
GTX 260, CLeopatra would still be 1.7 times faster than SpeedyGonzales for problem size
2563 , and 1.4 times faster for problem size 5123 . More accurate assessment of the FX 5800s
performance advantage can only increase the factors in favor of CLeopatra. The reason for
the speed factor drop for the larger problem size is the more complex memory management
the CLeopatra implementation has to do on the GTX 260. Although the 512 MiB required by
the output volume would theoretically fit into the GPUs 896 MiB of memory (even including
some projection images), the GTX 260s CL_DEVICE_MAX_MEM_ALLOC_SIZE is at 217.344
6

http://en.wikipedia.org/wiki/Nvidia_Quadro

67

4.2. PERFORMANCE CONSIDERATIONS

CHAPTER 4. DISCUSSION

MiB. That is why multiple buffers sized smaller than this limit need to be allocated to reflect
the output volumes data. Having to write to multiple buffers is less efficient due to multiple
offset calculations and cache trashing, which is why CLeopatras performance gain is slightly
smaller for problem size 5123 .
As already mentioned in the results section, CLeopatra currently is the only GPU implementation in the RabbitCT ranking that is capable of reconstructing problem size 10243 .
To be able to do so, the memory management was extended to not only support output
volumes larger than CL_DEVICE_MAX_MEM_ALLOC_SIZE by using multiple buffers in parallel,
but also volumes larger than CL_DEVICE_GLOBAL_MEM_SIZE by using multiple passes. Even
with the additional memory management involved, which unfortunately is rather complex
due to limitations in the RabbitCT API, CLeopatra is more than ten times faster than the
second placed TomatoSalad which is using the Intels Threading Building Blocks (TBB)
library7 , multi-threads and SSE code on two Intel Xeon E5410 Quad-Core-CPUs running at
2.33 GHz.
A big advantage of the OpenCL-based CLeopatra implementation is that it will run without
modifications also on CPUs. This allows to easily see how it performs against native CPU
applications. Classifying the CPU reconstruction results in terms of RabbitCT ranking would
place them somewhere in between the LolaTBB and LolaOMP implementations. While the
first exclusively uses Intels TBB library, the latter makes sole use of OpenMP. When considering that the Intel OpenCL implementation seemingly internally also uses TBB (as can
be guessed by looking at the files in the Intel OpenCL SDK installation directory) and that
the quality of parallelization as done by the OpenCL front-end compiler is probably in the
same ballpark as that of an OpenMP compiler, the result is not very surprising and actually
makes sense. All better performing CPU implementations in the RabbitCT ranking contain
hand-tuned SSE and / or multi-threading code which makes it unlikely for the completely
compiler-generated OpenCL code to perform equally well.

The results from running the reconstruction on the Opteron server are a little bit disap7

http://threadingbuildingblocks.org/

68

CHAPTER 4. DISCUSSION

4.2. PERFORMANCE CONSIDERATIONS

pointing. Although it has six times more compute units than the Core i7, the Opteron is at
best 4.35 times faster. This is particularly interesting as since AMDs acquisition of ATI one
could expect ATI Stream to be optimized for AMD CPUs, if any. Moreover, not all of the
Core i7s eight compute units are real hardware cores, there only are four cores with HTT
each. But to be fair, the clock frequency needs to be taken into account: The Core i7 is
running at a 1.2 times higher frequency. Assuming a comparable instruction throughput per
clock cycle, one could expect the Opteron to be theoretically five times faster with its six
times more compute units. This is much closer to the observed speedup, but still, when
doing a compute-unit-to-compute-unit comparison, the Core i7 outperforms the Opteron.
The reason is probably not only to be searched in raw compute power differences, but rather
in different memory architectures. While the Core i7 can make use of local high-performance
DDR3 RAM, the Opteron with Node Interleaving enabled has to go through the bandwidthlimiting interconnect for access to remote memory attached to neighboring CPUs. As on
CPU platforms both the input and output data reside in regular host memory, the reconstruction process is very I/O intensive, making the CPU interconnect on the NUMA architecture
a limiting factor.

Regarding the kernel compile times, the speedup with hot caches is very likely due to a
reduced the overhead for setting up the compiler when compiling the same kernel multiple
times (or several different kernels) as the compiler is already loaded into memory. This is
underpinned by an API function in the OpenCL reference called clUnloadCompiler() which
allows the implementation to release the resources allocated by the OpenCL compiler. Its
documentation says that a call to clBuildProgram() will reload the compiler, if necessary,
and it seems obvious that implementations do not implicitly unload the compiler directly
after a single compile pass.
The partially observed long kernel compile times actually are a no-issue, as the OpenCL API
provides the means to retrieve the compiled program binary by calling clGetProgramInfo()
with CL_PROGRAM_BINARIES and to load it later again via clCreateProgramWithBinary().
However, as the program binary is not only specific to the device but also to the OpenCL

69

4.2. PERFORMANCE CONSIDERATIONS

CHAPTER 4. DISCUSSION

implementation and driver version, applications should not ship pre-compiled OpenCL program binaries, but compile them once when the application is run initially and store them
for later reuse.
Note that the term binary might be a bit misleading in this context, as the returned
bytes may actually not contain the binary representation of an executable program, but the
human-readable source code in a device-specific low-level language. As an example, NVIDIAs
OpenCL implementation returns PTX assembler source code, whereas ATIs implementation
returns an Executable and Linkable Format (ELF) binary and Intels implementation a custom Computing Language Program Container (CLPC) binary format.

Some surprises surfaced when doing the rather peripheral test to replace the native image
support provided by the OpenCL implementations with custom code to do bilinear or nearestneighbor interpolation. While it is expected that using built-in nearest-neighbor interpolation
on GPUs results in no real performance gain as bilinear interpolation is available in hardware,
see again [MXN07], it is surprising to see that on the HD 5870 custom nearest-neighbor
interpolation implemented in software is faster than built-in nearest-neighbor interpolation.
It is obvious that the benefit of the 2D texture cache is not as big for nearest-neighbor
interpolation as for bilinear interpolation because no neighboring pixels need to be accessed
to fetch a single texture value, but the question remains how the custom nearest-neighbor
interpolation code can be faster.
One possibility is that the 1D caching of buffer objects which are used instead of textures in
case of custom interpolation is actually more efficient for nearest-neighbor interpolation and
the chosen work-size configuration of (16,4,4). This configuration favors the X dimension
over the other dimensions, meaning that most projected voxels of a work-group hit the virtual
detector in a small number of long rows. So having a 1D cache which caches more values
in the X dimension could be beneficial compared to a 2D cache with fewer cached values in
the X dimensions due to also caching values in the Y dimension.
Another possible explanation is that the custom nearest-neighbor interpolation performs
better than the built-in one due to some optimizations it can do because it knows the data

70

CHAPTER 4. DISCUSSION

4.2. PERFORMANCE CONSIDERATIONS

being worked on, whereas the built-in interpolation needs to be generic, e.g. regarding
the data type, image dimensions, and border color to work on. For example, although the
current OpenCL 1.1 specification does not define a way to set a border color, it defines the
concept of a border color for image objects. It is very well possible that in preparation for
future extensions to the specification, the ATI OpenCL implementation is already generic
enough to in fact handle any custom set border color. As such, the custom nearest-neighbor
interpolation might be more efficient because it assumes a fixed float value of 0.0 when
sampling the virtual detector outside the projection image bounds.
The large performance gain of an averaged factor of two when using nearest-neighbor
interpolation on the CPU devices can be explained in a similar way. Their global memory,
that is regular host memory, only provides caching for addresses that lie linearly in memory,
typically based on cache lines with a size of something like 64 bytes in case of the Core i7.
Nearest-neighbor interpolation avoids access to pixels in adjacent rows which are likely to
start at an offset from the current pixel which exceeds the cache line size, so accessing them
will probably trash the cache and cause a major drop in performance. Another increase in
performance of at least a factor of two can be achieved by using the custom interpolation
code. Again, the a priori knowledge about the data for the custom interpolation code probably is the key to the performance gain. A generic implementation for the read_imagef()
API call has to deal with all kinds of sampler settings like (non-)normalized coordinates,
as well as different addressing and interpolation modes. Of course, each combination of
sampler settings could be handled by an optimized code path, but this is rather unlikely for
an alpha software release like the Intel OpenCL SDK. Finally, even for scalar float images,
read_imagef() is specified to return a float4-vector, which requires the implementation
to splat the scalar value to a vector. On the other hand, the custom code knows that it
will always deal with float data, normalized coordinates, address clamping with a border
color of 0.0 and a compile-time determined interpolation mode. All this eliminates a lot of
conditional code compared to more generic implementations, which is adequate to explain
the observed performance gain.
Finally, a look at the image quality metrics provided by the RabbitCT framework revealed

71

4.3. IMAGE QUALITY CONSIDERATIONS

CHAPTER 4. DISCUSSION

another advantage of using the custom bilinear image interpolation code: By enabling that
code path it was possible to drastically increase image quality to a Mean Square Error of as
little as 0.001 HU2 . This works for both GPU and CPU OpenCL devices, and puts them on
the same image quality level as the native CPU solutions from the RabbitCT ranking.

4.3 Image Quality Considerations


While the GPU solutions, no matter if using CLeopatra or SpeedyGonzales, clearly outperform
the native CPU solutions, the latter achieve better image quality in the RabbitCT ranking.
However, when looking at the error histogram from CLeopatras RabbitCT algorithm details
page as shown in sub-figure 4.7(a) one can see that over 90% of the errors are only off
by at most one HU relative to the output volume generated by the LolaBunny reference
implementation running on a CPU. Furthermore, by far the most errors occur at the border
of the area of interest as can be obtained from the spatial error distribution visualization in
sub-figure 4.7(b). Only a relatively small number of also small errors is scattered across the
volume. The roof-like pattern in the spatial error distribution occurs where backprojection
frustum planes intersect the output volume bounding box. It seems that especially if sampling
the detector at its borders, GPUs and CPUs come to slightly different results.

(a) Error value histogram

(b) Spatial error distribution

Figure 4.7: Errors relative to the reference implementation.

Besides the occurring errors being negligible due to their location in the output volume,

72

CHAPTER 4. DISCUSSION

4.4. VENDOR-SPECIFIC ISSUES

the image interpolation results have also shown that errors can be vastly reduced by using
custom bilinear filtering. This leads to the assumption that built-in bilinear interpolation
on GPUs does not consistently use the full precision of 32-bit floating point numbers but
partly uses lower precision for performance reasons. This is supported by [VMR07] which has
experimentally shown that older GPUs like the NVIDIA 8800 GTX use 8-bit blending weights
even if interpolating 32-bit values. As errors due to interpolation at reduced precision become
the most apparent for values that differ largely, they are the most visible when sampling border
pixels which are always 0. Also, it is interesting to note that even if running CLeopatra on
the CPU the image quality is worse than that of the native CPU implementations. This
is probably because the OpenCL CPU implementations are designed to deliver results that
resemble as closely as possible those from the GPU implementations, and not results of the
highest possible precision.

4.4 Vendor-Specific Issues


Due to OpenCL being an emerging technology, only few vendors already provide implementations to the public. Some implementations only have the quality of technology previews
and contain severe bugs, while others mainly suffer from rather trivial issues that can be
easily worked around. The following gives an overview of vendor-specific issues that were
discovered during the implementation phase of the thesis and reported back the respective
vendors. As most vendors do not announce any estimated date for updates to their OpenCL
implementations, it is very well possible that these issues are still current at the time of
reading, providing a valuable guide to OpenCL developers on how to avoid some pitfalls.

4.4.1 ATI Stream


To start with the positive aspects, both the CPU and GPU OpenCL implementations provided
by ATI Stream have proven to be the most stable ones. Only two minor issues were discovered
during this thesis: For one thing, the number of characters in the build log returned by a call

73

4.4. VENDOR-SPECIFIC ISSUES

CHAPTER 4. DISCUSSION

to clGetProgramBuildInfo() with CL_PROGRAM_BUILD_LOG is too high as it assumes the


log to contain in DOS-style line-endings consisting of two characters (0x0D, 0x0A), whereas
it always contains Unix-style line-endings compromised of only one character (0x0A). As a
result, an allocated buffer is too large and will most likely contain garbage characters at
the end if not cleared previously. The second discovered issue is probably rather a missing
feature than a bug: On the HD 5870, only 512 MiB of the total 1024 MiB of memory
are exposed via OpenCL, so only half the physically available memory is usable, which is
a major inconvenience as it requires to implement otherwise unnecessary work-arounds to
handle large data.
The downside of the ATI implementation is the lack of features that could potentially
increase performance, the most prominent example being the lack of page-locked memory
support for Direct Memory Access (DMA) transfers from the host to the device and vice
versa. This would allow hiding latencies by overlapping data copies with kernel computation,
resulting in a notable increase in performance.

A more critical issue surfaces when often changing GPUs in a PC, for example when
doing benchmarks. In that case it is convenient to not uninstall the ATI GPU driver or
Stream SDK. Leaving the ATI Stream SDK installed makes sense anyway as it also provides
an OpenCL CPU device in addition to the GPU device. Unfortunately, now using another
GPU like the NVIDIA GTX 260 in the same system leads to a crash when enumerating the
available OpenCL platforms via clGetPlatformIDs(). As uninstalling the ATI GPU driver
fixes the problem, it seems likely that the ATI Stream runtime assumes there to be an ATI
GPU in the system just because the driver files are still present, and crashes when it fails to
initialize the missing hardware.
Alternatively, the ATI OpenCL ICDs can be disabled by removing the corresponding entries
from the system registry using the script in listing 4.3. Of course, this will also get rid of
the CPU OpenCL device provided by ATI Stream. To add the ATI OpenCL ICDs again, the
script shown in listing 4.4 can be used.

74

CHAPTER 4. DISCUSSION

4.4. VENDOR-SPECIFIC ISSUES

Windows R e g i s t r y E d i t o r V e r s i o n 5 . 0 0

2
3

[ HKEY_LOCAL_MACHINE\SOFTWARE\ Khronos \OpenCL\ V e n d o r s ]

" a t i o c l . d l l "=

" a t i o c l 6 4 . d l l "=

6
7

[ HKEY_LOCAL_MACHINE\SOFTWARE\Wow6432Node\ Khronos \OpenCL\ V e n d o r s ]

" a t i o c l . d l l "=

" a t i o c l 6 4 . d l l "=

Listing 4.3: Registry script to remove the ATI OpenCL ICDs.

Windows R e g i s t r y E d i t o r V e r s i o n 5 . 0 0

2
3

[ HKEY_LOCAL_MACHINE\SOFTWARE\ Khronos \OpenCL\ V e n d o r s ]

" a t i o c l . d l l "=dword : 0 0 0 0 0 0 0 0

" a t i o c l 6 4 . d l l "=dword : 0 0 0 0 0 0 0 0

6
7

[ HKEY_LOCAL_MACHINE\SOFTWARE\Wow6432Node\ Khronos \OpenCL\ V e n d o r s ]

" a t i o c l . d l l "=dword : 0 0 0 0 0 0 0 0

" a t i o c l 6 4 . d l l "=dword : 0 0 0 0 0 0 0 0

Listing 4.4: Registry script to add the ATI OpenCL ICDs.

4.4.2 NVIDIA OpenCL


During the thesis, NVIDIAs OpenCL implementation has proven to be the one providing
the best performance (probably due to NVIDIAs experience in GPGPU computing with the
CUDA API), but also the one with the most issues. In addition, the latest official driver still
only supports OpenCL 1.0, while all other benchmarked implementations by other vendors
already support OpenCL 1.1.

c o n s t s a m p l e r _ t samp=CLK_ADDRESS_CLAMP_TO_EDGE | CLK_FILTER_LINEAR ;

2
3

_ _ k e r n e l v o i d copy_image ( image2d_t i n p u t , image2d_t o u t p u t )

i n t 2 p o s =( i n t 2 ) ( g e t _ g l o b a l _ i d ( 0 ) , g e t _ g l o b a l _ i d ( 1 ) ) ;

float4

w r i t e _ i m a g e f ( o u t p u t , pos , p i x e l ) ;

p i x e l=r e a d _ i m a g e f ( i n p u t , samp , p o s ) ;

Listing 4.5: Missing image access qualifiers generate invalid PTX assembly.

75

4.4. VENDOR-SPECIFIC ISSUES

CHAPTER 4. DISCUSSION

As an example, consider the OpenCL kernel code in listing 4.5. It implements a simple
kernel which copies the input image to the output image. In OpenCL, image objects implicitly
point to the __global address space and are __read_only by default. Since the image
access qualifiers are missing in the example, output is read-only and cannot be written
to. But instead of providing a compile-time error message to the user about trying to call
write_imagef() on a read-only image, the NVIDIA compiler front-end generates invalid
PTX assembly in this case, resulting in an obscure message which says
ptxas application ptx input, line 14; fatal : Parsing error near ,: syntax error
ptxas fatal : Ptx assembly aborted due to errors

Indeed, when looking at the PTX assembly returned by a call to clGetProgramInfo()


using CL_PROGRAM_BINARIES, one will see code similar to the one shown in listing+4.6.
1

. e n t r y copy_image (

3
4

Listing 4.6: Invalid PTX assembly as generated by the NVIDIA compiler.

So the error message is right about invalid PTX code being generated. The code can be
easily fixed by adding the required __write_only prefix to output. Although it is correct
that the code does not compile as it is invalid, the error message is not useful at all and
hints towards an internal compiler issue.
The above example also features a more subtle issue that affects portability. On the one
hand, the OpenCL specification states that all program scope variables must be declared in
the __constant address space. However, here samp is declared at program scope without
any address space qualifier. Note that const is not an address space qualifier, but a type
qualifier as defined by the C99 language specification. This declaration currently compiles
fine on the NVIDIA implementation, but for ATI Stream SDK versions prior to 2.3 an error
message is shown. On the other hand, the OpenCL specification is contradicting itself by
containing an explicit example to declare samplers at program scope like shown in listing 4.7.

76

CHAPTER 4. DISCUSSION

const sampler_t

4.4. VENDOR-SPECIFIC ISSUES

<s a m p l e r name> = <v a l u e >

Listing 4.7: Sampler declaration example from the OpenCL specification.

As a result, the current ATI implementation was changed to issue just a warning if const
is used instead of __constant. There currently is an ongoing discussion within the OpenCL
working group at Khronos on how to resolve this issue.

OpenCL image support on NVIDIA also seems to have an issue when querying the device
using clGetDeviceInfo() and CL_DEVICE_IMAGE2D_MAX_WIDTH for its maximum image
width. According to the OpenCL 1.0 specification, this value must be at least 8192. However, the returned value is 4096. NVIDIA reports this to be a mixup in the specification as
the actual limits for OpenCL 1.0 were reduced to be 4096 instead of 8192, but so far no
updated revision of the specification was released which addresses this issue.

Another obscure and hard to reproduce error only occurs for source code files that contain
multiple kernels. Under certain circumstances, the kernel code will fail to compile when
targeting the WIN64 platform, although the exact same host and kernel source code will
succeed to compile for the WIN32 platform. Removing some seemingly arbitrary lines of
code will make the code compile also for WIN64, remarkably. This issue is likely to be fixed
with the 270.22 driver release, but it dramatically makes clear how fragile current OpenCL
implementations can be.

4.4.3 Intel OpenCL


The Intel OpenCL SDK includes an implicit CPU vectorization module as part of the program
build process. The current version of the vectorization module is sensitive to complex control
flows inside kernels, so most if-statements will prevent the compiler from performing any
vectorization. However, if the compiler is able to vectorize the kernel, it will mostly likely
also try to perform loop-unrolling on the work-group level for lightweight kernels. Consider
the code shown in listing 4.8:

77

4.4. VENDOR-SPECIFIC ISSUES

__kernel v o i d copy_data ( __global i n t input , __global i n t output )

i n t p o s=g e t _ g l o b a l _ i d ( 0 ) ;

o u t p u t [ p o s ]= i n p u t [ p o s ] ;

CHAPTER 4. DISCUSSION

6
7

__kernel void copy_data_unrolled ( __global i n t input , __global i n t output )

i n t p o s=g e t _ g l o b a l _ i d ( 0 ) 4 ;

10

o u t p u t [ p o s +0]= i n p u t [ p o s + 0 ] ;

11

o u t p u t [ p o s +1]= i n p u t [ p o s + 1 ] ;

12

o u t p u t [ p o s +2]= i n p u t [ p o s + 2 ] ;

13
14

o u t p u t [ p o s +3]= i n p u t [ p o s + 3 ] ;
}

Listing 4.8: Work-group level unrolling return wrong results for changing input data.

Launching kernel copy_data() with a 1D work-size of, say, 1024 is equivalent to launching
the compiler-generated kernel copy_data_unrolled() with a 1D work-size of 256. The Intel
OpenCL runtime keeps both kernels and dynamically dispatches to the unrolled version if
the local X work-size as passed to clEnqueueNDRangeKernel() is a multiple of 4. While
this sounds good in theory, in practice the auto-vectorization and unrolling does not always
generate correct code. For the backprojector implemented as part of this thesis, the output
is wrong if the compiler is able to perform auto-vectorization. As a work-around, the kernel
declaration can be accompanied by a hint to the compiler as shown in listing 4.9.
1

__kernel __attribute__ ( ( vec_type_hint ( f l o a t 4 ) ) )

Listing 4.9: Hinting the compiler about the basic computation width.

This signals the compiler that the kernel declaration to follow is already written in an
optimized vectorized form, so the implicit CPU vectorization module will not operate on it.

But even with auto-vectorization disabled the Intel compiler currently might generate invalid code. While doing some tests with explicitly using the mad() instruction on scalar floats,
so expressions like D = A B + C were manually rewritten as D = mad(A, B, C), the latter
returned always 0. Again using the Intel OpenCL Offline Compiler, further investigations
showed that when compiling the kernel

78

CHAPTER 4. DISCUSSION

__kernel v o i d float_mad ( i n t 2 in_res , f l o a t 2 pos )

i n t 2 c o o r d=c o n v e r t _ i n t 2 ( p o s ) ;

f l o a t 2 t=posc o n v e r t _ f l o a t 2 ( c o o r d ) , s =1.0 ft ;

4.4. VENDOR-SPECIFIC ISSUES

5
6
7

volatile

f l o a t v a l u e=t . x t . y+s . x ;

the generated assembly looks correct like


1

_float_mad :

# BB#0:

# @float_mad

sub

movq

RSP , 36

cvttps2dq

cvtdq2ps

XMM0, QWORD PTR [ RSP + 8 4 ]


XMM1, XMM0
XMM1, XMM1

subps

XMM0, XMM1

movss

XMM1, DWORD PTR [ RIP + LCPI3_0 ]

subss

XMM1, XMM0

9
10

pshufd

11

mulss

XMM2, XMM0, 1
XMM2, XMM0

12

addss

XMM2, XMM1

13

movss

DWORD PTR [ RSP + 3 2 ] , XMM2

14

add

15

ret

RSP , 36

but when compiling the semantically equivalent code


1

__kernel v o i d float_mad ( i n t 2 in_res , f l o a t 2 pos )

i n t 2 c o o r d=c o n v e r t _ i n t 2 ( p o s ) ;

f l o a t 2 t=posc o n v e r t _ f l o a t 2 ( c o o r d ) , s =1.0 ft ;

5
6
7

volatile

f l o a t v a l u e=mad ( t . x , t . y , s . x ) ;

the assembly looks like


1

_float_mad :

# BB#0:

# @float_mad

sub

RSP , 36

mov

DWORD PTR [ RSP + 3 2 ] , 0

add

RSP , 36

ret

so a constant value of 0 is returned. Interestingly, passing the -cl-mad-enable option


to the compiler does not change anything, in particular it does not make the first example
also return 0.

79

5 Conclusion
The thesis has outlined the successful implementation of a high-performance tomographic
backprojection library using the OpenCL API and its integration into the Amira visualization
system. The implementation was submitted to the RabbitCT benchmarking project where it
currently outperforms any other implementation in both absolute and relative performance.
Working with different OpenCL implementations has both shown that APIs great capabilities
in terms of ease of use, portability, scalability and performance, but also uncovered several
severe bugs in the current vendors implementations. Note, however, that these have mainly
been development-time issues. If an OpenCL kernel compiles and runs correctly, it usually
runs stable and fast. Since all issues that have surfaced during this thesis have been reported
to the respective vendors, chances are good that the upcoming versions will have most of
them fixed. For example, the anticipated NVIDIA CUDA Toolkit 4.0 and NVIDIA graphics
driver 270.22 releases, which include the OpenCL runtime, will have all the reported issues
fixed.
If the existing issues can be worked around, OpenCL has proven to be a stable and performant alternative to existing GPGPU languages like CUDA C or parallel programming
approaches like OpenMP or TBB in the scope of this thesis. Despite previous results like
those in [ZZS+ 09], the presented OpenCL implementation is not slower than a similar implementation in CUDA C. This has two major reasons: First of all more time was spent to do
some OpenCL-specific performance tuning (like using native_*() functions), secondly the
OpenCL compilers have matured and produce better code. [KSAK10] also emphasizes the
OpenCL compilers importance in closing the semantic gap between the language and the
compute devices, and concludes that with a combination of manually optimizing the interme-

81

CHAPTER 5. CONCLUSION
diate PTX code generated by the OpenCL compiler and enabling the compilers automatic
optimizations equal performance compared to CUDA C implementations can be achieved.
However, the described manual optimizing primarily addresses only two issues (loop invariant
code motion and use of the rsqrt instruction) which are both not applicable to the implementation presented in this thesis, therefore no manual optimizing of intermediate code is
required in this case to achieve CUDA-like performance. The current NVIDIA OpenCL compilers nature of being particularly bad at loop optimizations is also a reason why [KDH10]
is seeing OpenCL to be slower than CUDA by up to 67%.
Instead of comparing OpenCL to CUDA C, [WZJZ10] makes performance comparisons to a
conventional CPU implementation written in the C language. Depending on the problem size,
their implementation of a full backprojection pipeline including pre-weighting and filtering
is between factors 40 - 60 faster than the CPU implementation. To be able to compare
their results to those of this thesis, only the core timings for the grids mapping and
accumulation steps are summed up for problem size 256 256 128, multiplied by two to
match problem size 2563 and divided by the number of projection images in order to get
an estimated average backprojection time for a single projection image. Doing so yields a
time of (10199 ms + 2636 ms) 2/320 = 80.2ms, which is 16.4 times slower than the
backprojection achieved in this thesis.
Despite the results of [LKC+ 10], in the scope of this thesis the GPUs performed much
better than the CPU. Although the initial factor of 60 - 120 (depending on the problem size)
can be brought down to 25 - 50 by using custom bilinear interpolation, this does not put
GPUs and CPUs roughly in the same performance ballpark for throughput computing.
When comparing the GPUs among themselves, the more recent HD 5870 did not perform as good as the older GTX 260. It was argued that this is mostly likely due to ATI
Streams compiler currently not doing a very good job at optimizing for the complex VLIW5architecture and due to NVIDIAs more mature GPGPU tool chain.
The presented algorithms to determine an optimal work-size configuration for a given
OpenCL device and problem size turned out to be very handy tools to quickly narrow down
the number of performance tuning parameters. By changing the metric which ranks the ratio

82

CHAPTER 5. CONCLUSION

5.1. FUTURE WORK

of the work-group dimension the algorithms are also applicable to non-backprojection-like


algorithms. In conjunction with some auto-tuning heuristics as discussed in [DWL+ 10] the
means are given to quickly find the parameters for high performance OpenCL implementations, which according to [KSAK10] is essential to enable a single OpenCL code to run
efficiently on various GPUs.

5.1 Future Work


Although the RabbitCT framework provides pre-filtered data for benchmarking, in real use
cases this often is not the case. As the projection image data resides on GPU memory
anyway, it seems natural to also perform the filtering on the GPU. The required convolution
in the spatial domain is usually performed as a multiplication in the frequency domain for
efficiency reasons. As such, it is necessary to implement both forward and inverse Discrete
Fourier Transformations (DFT) on the GPU. A future extension to the backprojector should
evaluate the native OpenCL_FFT1 implementation provided by Apple, Inc., which makes
use of several techniques as proposed in [VK08, GLD+ 08]. Doing the filtering as part of the
backprojection implementations also paves the way for using custom filters to improve image
quality or to emphasize certain features in the reconstructed data. In addition to filtering
the projection images via OpenCL, filtering the output volume in a post-processing step as
suggested in [Waa09] is a tempting extension. When it comes to filtering in terms of projection image interpolation, it was shown a custom implementation of bilinear interpolation is
actually faster the built-in interpolation on CPU devices. With this in mind, if implementing
custom interpolation anyway, higher quality sinc-interpolation as proposed in [ZXM10] could
be used at the same time.

Some work should also be put into improving the RabbitCT framework itself, as the current
API is too limiting to implement optimization techniques or memory management features
that require a priori knowledge about the geometry or data for all projection images. For
1

http://developer.apple.com/library/mac/#samplecode/OpenCL_FFT/

83

5.1. FUTURE WORK

CHAPTER 5. CONCLUSION

example, the current API design enforces to first backproject a projection image into the
entire volume before advancing to the next projection image. Once the total number of
projection images has been iterated, there is no way an implementation could refer to the
first projection again in order to run through a second pass. Likewise, only complete projection images are passed to the implementation; there is no way to request only several
projection image rows to save memory bandwidth. This makes it impossible to implement a
more cache-friendly and memory saving backprojection scheme which reconstructs the output volume slice-by-slice by uploading only those rows of all projection images to the device
which the current slice projects onto.

Regarding the recent announcement of new GPU hardware, in particular two architectures
are interesting for future evaluation: NVIDIAs GF100 aka Fermi GPU generation is the
first to provide full IEEE754 floating point support as well as double precision arithmetic at
an acceptable performance. To see whether this has a positive effect on the image quality of
the reconstructed volume, and in particular if the quality of bilinear interpolation improves
when compared to custom bilinear interpolation, it would be beneficial to run benchmarks
on this architecture.
To verify whether the observed under-utilization on the HD 5870 is really caused by the
complexity of the VLIW-5-architecture, taking a closer look at the recently announced ATI
Radeon HD 6970 from the Cayman series is a good idea. It introduces the VLIW-4architecture which several sources2,3,4 indicate to be suited much better for GPGPU applications.

http://www.anandtech.com/show/4061/amds-radeon-hd-6970-radeon-hd-6950/4
http://www.realworldtech.com/page.cfm?ArticleID=RWT121410213827&p=11
4
http://www.tomshardware.com/reviews/radeon-hd-6970-radeon-hd-6950-cayman,2818-2.html
3

84

Bibliography
[Adv10]

Advanced Micro Devices Inc. Programming Guide for ATI Stream Computing with
OpenCL. http://developer.amd.com/gpu/amdappsdk/assets/ATI_Stream_
SDK_OpenCL_Programming_Guide.pdf, 2010.

[BF10]

Jens Breitbart and Claudia Fohry. OpenCL - An effective programming model


for data parallel computations at the Cell Broadband Engine. 2010 IEEE International Symposium on Parallel Distributed Processing Workshops and Phd Forum
IPDPSW, pages 18, 2010.

[BFH+ 04] Ian Buck, Tim Foley, Daniel Horn, Jeremy Sugerman, Kayvon Fatahalian, Mike
Houston, and Pat Hanrahan. Brook for GPUs: stream computing on graphics
hardware. In ACM SIGGRAPH 2004 Papers, pages 777786. ACM, 2004.
[CCF94] Brian Cabral, Nancy Cam, and Jim Foran. Accelerated volume rendering and
tomographic reconstruction using texture mapping hardware. In Proceedings of
the 1994 symposium on Volume visualization, pages 9198. ACM, 1994.
[CH10]

Slo-Li Chu and Chih-Chieh Hsiao. OpenCL: Make Ubiquitous Supercomputing


Possible. In International Conference on High Performance Computing and Communications, pages 556561. IEEE, September 2010. ISBN 978-1-4244-8335-8.

[Cor63]

Allan M. Cormack. Representation of a Function by Its Line Integrals, with Some


Radiological Applications. Journal of Applied Physics, 34(9):27222727, 1963.

[DWL+ 10] Peng Du, Rick Weber, Piotr Luszczek, Stanimire Tomov, Gregory Peterson,

85

Bibliography

Bibliography

and Jack Dongarra. From CUDA to OpenCL : Towards a Performance-portable


Solution for Multi-platform. Parallel Computing, 2010.
[FDK84] L. A. Feldkamp, L. C. Davis, and J. W. Kress. Practical cone-beam algorithm.
Journal of the Optical Society of America, 1(6):612619, 1984.
[GBB+ 07] Iain Goddard, Ari Berman, Olivier Bockenbach, Frank Lauginiger, Sebastian
Schuberth, and Scott Thieret. Evolution of Computer Technology for Fast Cone
Beam Backprojection. SPIE Symposium on Electronic Imaging, 2007.
[GLD+ 08] Naga K. Govindaraju, Brandon Lloyd, Yuri Dotsenko, Burton Smith, and John
Manferdelli. High Performance Discrete Fourier Transforms on Graphics Processors. Proc. of ACM/IEEE SuperComputing, 2008.
[GM07]

Naga K. Govindaraju and Dinesh Manocha. Cache-efficient numerical algorithms


using graphics hardware. Parallel Computing, October 2007.

[HN77]

Gabor T. Herman and Abraham Naparstek. Fast Image Reconstruction Based on


a Radon Inversion Formula Appropriate for Rapidly Collected Data. SIAM Journal
on Applied Mathematics, 33(3):511533, 1977.

[Hor79]

Berthold K. P. Horn. Fan-Beam Reconstruction Methods. Proc IEEE, 67(12):


16161623, 1979.

[Hou73] Godfrey N. Hounsfield. Computerized transverse axial scanning (tomography):


Part I. Description of system. British Journal of Radiology, (46):10161022, 1973.
[Int10]

Intel Corporation. Writing Optimal OpenCL Code with the Intel OpenCL SDK.
http://software.intel.com/file/33860, 2010.

[KDH10] Kamran Karimi, Neil G. Dickson, and Firas Hamze. A Performance Comparison
of CUDA and OpenCL, 2010.
[KSAK10] Kazuhiko Komatsu, Katsuto Sato, Yusuke Arai, and Kentaro Koyama. Evaluating Performance and Portability of OpenCL Programs. The Fifth International
Workshop on Automatic Performance Tuning, 2010.

86

Bibliography

Bibliography

[LKC+ 10] Victor W. Lee, Changkyu Kim, Jatin Chhugani, Michael Deisher, Daehyun Kim,
Anthony D. Nguyen, Nadathur Satish, Mikhail Smelyanskiy, Srinivas Chennupaty,
Per Hammarlund, Ronak Singhal, and Pradeep Dubey. Debunking the 100X GPU
vs. CPU myth: An evaluation of throughput computing on CPU and GPU. Proceedings of the 37th annual international symposium on Computer architecture2,
38(3):451460, 2010.
[M04]

Klaus Mller. Ultra-fast 3D filtered backprojection on commodity graphics hardware. 2004 2nd IEEE International Symposium on Biomedical Imaging: Macro to
Nano (IEEE Cat No. 04EX821), pages 571574, 2004.

[Mun10] Aaftab Munshi.

OpenCL 1.1 Specification.

http://www.khronos.org/

registry/cl/specs/opencl-1.1.pdf, 2010.
[MXN07] Klaus Mller, Fang Xu, and Neophytos Neophytou. Why do commodity graphics
hardware boards (GPUs) work so well for acceleration of computed tomography?
Proceedings of SPIE, 2007.
[NVI10a] NVIDIA Corporation. NVIDIA CUDA Reference Manual. http://developer.
nvidia.com/object/gpucomputing.html, 2010.
[NVI10b] NVIDIA Corporation.

OpenCL Best Practices Guide.

http://developer.

download.nvidia.com/compute/cuda/3_2_prod/toolkit/docs/OpenCL_
Best_Practices_Guide.pdf, 2010.
[NWH+ 08] Peter B. Nol, Alan M. Walczak, Kenneth R. Hoffmann, Jinhui Xu, Jason J.
Corso, and Sebastian Schafer. Clinical evaluation of GPU-based cone beam computed tomography. In Proc. of High-Performance Medical Image Computing and
Computer-Aided Intervention (HP-MICCAI), 2008.
[OIH10] Yusuke Okitsu, Fumihiko Ino, and Kenichi Hagihara. High-performance cone beam
reconstruction using CUDA compatible GPUs. Parallel Computing, 36(2-3):129
141, 2010.

87

Bibliography

Bibliography

[PRH10] Budirijanto Purnomo, Norman Rubin, and Michael Houston. ATI Stream Profiler.
ACM Press, New York, New York, USA, July 2010. ISBN 9781450303934. 1 pp.
[Qui06]

Eric Todd Quinto. An introduction to X-ray tomography and Radon transforms.


Proceedings of Symposia in Applied Mathematics, 63:1, 2006.

[Rad17]

Johann Radon. ber die Bestimmung von Funktionen durch ihre Integralwerte
lngs gewisser Mannigfaltigkeiten. Berichte ber die Verhandlungen der Kniglich
Schsischen Gesellschaft der Wissenschaften zu Leipzig, Mathematisch Physische
Klasse, 69(69):262277, 1917.

[RKHH09] Christopher Rohkohl, Benjamin Keck, Hannes G. Hofmann, and Joachim Hornegger. RabbitCT - An open platform for benchmarking 3D cone-beam reconstruction
algorithms. Medical Physics, 36(9):39403944, 2009.
[SBMW07] Thomas Schiwietz, Supratik Bose, Jonathan Maltz, and Rdiger Westermann.
A fast and high-quality cone beam reconstruction pipeline using the GPU. Proceedings of SPIE, pages 65105H65105H12, 2007.
[SWH05] Detlev Stalling, Malte Westerhoff, and Hans-Christian Hege. Amira: A Highly
Interactive System for Visual Data Analysis, chapter 38, pages 749767. Elsevier,
2005. ISBN 978-0-12-387582-2.
[Tur01]

Henrik Turbell. Cone-Beam Reconstruction Using Filtered Backprojection. PhD


thesis, University of Linkping, 2001.

[Tuy83]

Heang K. Tuy. An inversion formula for cone-beam reconstruction. SIAM Journal


on Applied Mathematics, 43(3):546552, 1983.

[VK08]

Vasily Volkov and Brian Kazian. Fitting FFT onto the G80 architecture. University
of California, Berkeley, 2008.

[VMR07] Michael S. Vaz, Matthew McLin, and Alan Ricker. White Paper Current and nextgeneration GPUs for accelerating CT reconstruction : quality , performance , and
tuning. 2007.

88

Bibliography
[Vol10]

Vasily Volkov.

Bibliography
Better Performance at Lower Occupancy.

http://www.cs.

berkeley.edu/%7Evolkov/volkov10-GTC.pdf, 2010.
[Waa09] Jonas Waage. Accelerated Filtering using OpenCL. In Ivan Viola and Helwig
Hauser, editors, Seminar in Visualization, 2009.
[WZJZ10] Bo Wang, Lei Zhu, Kebin Jia, and Jie Zheng. Accelerated Cone Beam CT
Reconstruction Based on OpenCL. International Conference on Image Analysis
and Signal Processing (IASP), 2010.
[YLKK07] Haiquan Yang, Meihua Li, Kazuhito Koizumi, and Hiroyuki Kudo. Accelerating Backprojections via CUDA Architecture. In Fully Three-Dimensional Image
Reconstruction in Radiology and Nuclear Medicine, pages 5255, 2007.
[YXKZ10] Yi Yang, Ping Xiang, Jingfei Kong, and Huiyang Zhou. A GPGPU Compiler for
Memory Optimization and Parallelism Management. PLDI, 2010.
[ZXM10] Ziyi Zheng, Wei Xu, and Klaus Mueller.

VDVR: verifiable visualization of

projection-based data. IEEE transactions on visualization and computer graphics, 16(6):151524, 2010.
[ZZS+ 09] Wenyu Zhang, Li Zhang, Shangmin Sun, Yuxiang Xing, Yajie Wang, and Juan
Zheng. A preliminary study of OpenCL for accelerating CT reconstruction and
image recognition. 2009 IEEE Nuclear Science Symposium Conference Record
NSSMIC, pages 40594063, 2009.

89

Anda mungkin juga menyukai