A Programming Model For GPU-based Parallel Computing With Scalability and Abstraction

A Programming Model for GPU-based Parallel Computing with Scalability and Abstraction
Bal zs Domonkos a Budapest University of Technology and Economics, Hungary Mediso Medical Imaging Systems, Hungary G bor Jakab a Mediso Medical Imaging Systems, Hungary
Abstract
In this paper, we present a multi-level programming model for recent GPU-based high performance computing systems. Involving cooperative stream threads and symmetric multiprocessing threads our model gives a computational framework that scales through multi-GPU environments to GPU-cluster systems. Instead of hiding the execution environment from the programmer using compiler extensions or metaprogramming techniques we aim a solution that both enables optimizations and provides abstract problem space mapping with code reusability and virtualization of hardware resources in order to decrease the programming effort. We evaluate an implementation of our model based on CUDA, OpenMP, and MPI2 technologies on a complex practical application scenario and discuss its performance scaling behavior. CR Categories: I.3.1 [Computer Graphics]: Hardware ArchitectureParallel processing; I.3.2 [Computer Graphics]: Graphics SystemsDistributed/network graphics Keywords: GPU-based computing, multi-GPU systems, GPUbased compute clusters
Meanwhile, various programming languages and frameworks have emerged supporting high performance data-parallel computations with real scatter operations and thread cooperation getting rid of former limited shared memory emulation in the texture memory. These continuous innovations, the wide deployment of GPUs, and the attractive performance per price ratio resulted increasing experimental research on GPUs which have become an attractive throughput architecture for high performance computing computations in various domains of science. From the algorithmic point of view, computational tasks often contain exploitable concurrency that enables different parts of the problem to be solved simultaneously by multiple closely coupled processors (parallel computing). Usually, high-speed communication channels exist between these processing elements and fast synchronization is also possible. This allows compute-intensive problems to be solved in less time and also might enable large data-intensive problems to be handled. However, when memory resources are insufcient to store the model of a problem on a single device, the parallel program has to be further split up into loosely coupled instances that require only a limited amount of interaction during their execution (distributed computing). Commonly, there is no communication medium between these devices as fast as there is within a device and there is no adequate solution for rapid synchronization. It has been shown by many researches that GPUs and clusters of GPUs can be successfully applied for these purposes; see NVIDIA web site for various examples. However, when writing complex applications or software library stacks, one often faces issues of reusability, abstraction, or portability. On the other hand, the underlying computational efciency can be lost by dening too high level, general, or excessively transparent GPU-frameworks. Therefore, we have investigated how an OpenCL-like model can be extended in two dimensions without losing the computational horse power: in the levels of scalability from a streaming device through an SMP computer to a cluster; and in the layers of abstraction from the hardware level through mapping the execution to the hardware to projecting the problem space to the execution (Figure 1). The remainder of this paper is organized as follows. First existing languages, programming libraries, and frameworks are briey summarized in Section 2. Considering the conjectural trends of the evolution, we introduce a simple formalism using which a multi-level and multi-layer programming model can be dened. Since code portability does not immediately involve efcient code reusability on different system architectures, our model does not hide the abstract hardware environment (Section 3) from the programmer using compiler extensions or metaprogramming techniques. Instead, we propose a solution that still enables architectural optimizations (Section 4) while also provides high-level mapping of the abstract problem space (Section 5). This supports code reusability and virtualization of hardware resources in order to decrease the programming effort. Although our primary contribution is our three-tier model, we present a potential software infrastructure and implementation based on CUDA, OpenMP, and MPI2 technologies (Section 6). Finally, to demonstrate the usability of the model, this implementation is evaluated on a complex practical application scenario, an tomographic reconstruction algorithm (Section 7) and its
Introduction and Motivation
Although semiconductor capability increases at about the same rate for both CPUs and GPUs driven by advances in fabrication technology, computational performance of GPUs increases more rapidly than that of CPUs. With many transistors dedicated to instructionlevel parallelism using branch prediction and out-of-order execution, CPUs are optimized for high performance sequential code execution. In contrast, graphics computations with rapid data-parallel characteristics result higher arithmetic intensity for GPUs with the same transistor count. GPUs have undergone several stages of evolution in the last decade including increasing raw speed, precision, and programmability. From a highly graphics-related xed point xed-function pipelines, GPUs rapidly evolved through the era of dedicated vertex and fragment programs with limited SIMD programming features to general purpose fully programmable stream processors with data dependent branching, IEEE double precision data type support, and fullling a real Single Program Multiple Data (SPMD) programming model.
e-mail: e-mail:
domonkos@iit.bme.hu gjakab@mediso.hu
performance scaling behavior is discussed.
Previous Work
The concept to use the graphics hardware for general purpose algorithms has been introduced in the early nineties by Lengyel et al. by using a rasterization device for robot motion planning [Lengyel et al. 1990] and by Cabral et al. who accelerated tomographic reconstruction using a texture mapping hardware [Cabral et al. 1994]. Although, in that time most of the pioneering works have been done as proof of concept prototypes, from the beginning of this decade a clear new domain has been emerging for stream computations on the graphics hardware. Inuenced by the GPGPU movement, numerous programming languages, libraries, and frameworks have been designed to make stream programming on the GPU easier and more productive. The common requirements for such frameworks were portability to the primary hardware and software platforms, and efciency introducing only a low overhead compared to optimized hand-written implementations. On the other hand, high abstraction, that obscures the details of the underlying hardware was also welcome. Besides, a strong demand raised to couple the shader language and the host code into a single language to handle CPUGPU interactions by grammar instead of using cumbersome API calls. The evolution of stream programming on the GPU began with programmable shading using GPUs for their primary purpose: for doing computer graphics. The rst high level real-time procedural shading languages were the Stanford Shading Language [Proudfoot et al. 2001], Cg [Mark et al. 2003], and graphics API specic languages like HLSL [Oneppo 2007] and GLSL [Rost et al. 2004]. In addition to the traditional rendering, GPGPU languages have appeared in order to support common activities like glue code generation and shader synthesis. Such languages are the RapidMind Development Platform [Monteyne 2008] as the generalization of Sh [McCool et al. 2002; McCool et al. 2004], and the GPU++ language [Jansen 2007], that apply C++ meta programming techniques for generating syntax tree and synthesizing shader source code on-the-y. Other approaches that use special compilers to generate kernel and host source code in compile time are the Brook stream programming language [Buck et al. 2004] and its spin-offs, PeakStream and Brook+. Other similar compiled languages are CGiS [Fritz et al. 2004], Accelerator [Tarditi et al. 2006], and Brahma [Bra 2006]. In addition there are GPGPU programming libraries like Shallows [Sha 2005] and Glift [Lefohn et al. 2006] which provide variety of standard data structures for the GPU shader model. Closing the GPGPU era, currently general purpose GPU programming languages represent the most effective way to write and execute programs on GPUs without the need to map the algorithms onto a 3D graphics API such as OpenGL or DirectX. These solutions provide compilers, runtime libraries, and tools bundled for efcient development and debugging. Frameworks of the main vendors are CUDA from NVIDIA [Nickolls et al. 2008], the CAL programming framework of AMD/ATI Stream [AMD/ATI 2008] (formerly Close To Metal), and Ct with Larrabee from Intel [Gholoum et al. 2007][Seiler et al. 2008]. Designed especially for CUDA, CUDASA [Strengert et al. 2008] can handle multiple hardware levels in a CUDA-fashioned way. Also, there are languages that can compile source code for multiple hardware architectures, such as DirectX Compute Shaders, hmpp [hmp 2008] from CAPS Enterprise, OpenCL [Khronos OpenCL Working Group 2008], a new open standard for general purpose parallel programming from Khronos Group, and the C$ language [A. V. 2007].
Many of the models are restricted to only a single device. The domain of OpenCL and CUDA is a single host computer with possibly multiple devices. CUDASA has similar purpose that our work has. However, the clear difference is that our model does not force the block-grid mechanism of CUDA to higher hardware levels, since blocks and grids are a specic solution for problem space mapping (Section 5) designed especially for NVIDIA GPUs and are not necessarily optimal for any hardware levels. In our solution, which might be considered as a modication and extension of the OpenCL model, a memory transfer API is dened for each levels at the execution layer and split-merge-exchange operations are dened with address translator at the problem space layer. This API of the execution layer can be implemented in different ways including the one proposed in [Strengert et al. 2008] or in the way we did (Section 6). The most important benet is that the code at the highest abstraction layer, at the problem space layer only has to be altered if the underlying hardware architecture signicantly changes. For instance, when porting the application from GPUs to Cell BE. In next three sections, we summarize the three different abstraction layers of our model. At the end of each section, a usage of the components of the model is illustrated by an example that is depicted in Figure 1.
Layer 1: A Conceptual Hardware Architecture Model
First, we introduce a simple formalism to describe hardware architectures. Our model follows the key ideas behind the OpenCL Platform Model [Khronos OpenCL Working Group 2008] but with the following assumptions: Any hardware element is reliable, the execution times of any task and memory transfers are predictable, and there is no fault tolerance introduced in the model. The hardware architecture is homogeneous, the hardware elements are identical on the same level and the transfer times are approximately also equal. The abstract hardware element of the execution is the processing element (PE), which is a virtual scalar processor that can execute sequential code. PEs might be bundled into processing arrays (PA) that provide instruments to organize the cooperation of the PEs by barriers, shared memory, or communication channels. A processing device (PD) is the physical unit of the computing hardware for a certain hardware level: a collection of PEs, PAs, and memory spaces. A processing unit (PU) is a task execution interface for a processing device. PE, PA, or even PD can represent the task execution interface for a specic device. Within their device, PEs may access data from multiple memory spaces during their execution. A PE may have a private memory space that is inaccessible for other PEs. Each PA may have a local memory space shared among all PEs of a PA. Finally, PD can have a global memory space to which all PEs have access. Private memory and local memory may be implemented as registers or as dedicated regions of memory on the PD. Alternatively, the these regions may be mapped onto sections of the global memory. The model of a multiprocessor of a NVIDIA GPU, a graphics board, a desktop computer with a quad-core CPU and four graphics
Figure 1: An overview of an imaginary distributed computing system built up from the basic components of our model. From left to right, the abstraction layers are depicted bounded by upright dark grey rectangles: the hardware abstraction layer, the abstract execution model, and the the problem space layer, respectively. From bottom to top, the scalability of the system is enhanced with levels bounded by horizontal light grey polygons. (a) Abstract model of multiprocessor from a NVIDIA GPU, a graphics board, an SMP computer, and a compute cluster. (b) Contexts and task instances of a kernel, a process, and a job corresponding to the hardware level. (c) A typical problem space partitioning scenario that illustrates split memory objects in system memory, in global memory of the graphics board and in local memory of a NVIDIA multiprocessor driven by address translation.
cards, and a compute cluster built up from these abstract elements are depicted in Figure 1 (a).
Layer 2: Execution Model
The execution control, including executing tasks and performing memory transfers only depend on the task to instantiate and the extent of the input and output data, and does not depend on the data itself. The size of the related data is always known in advance. This also implies that tasks are associated with data before the execution of tasks begins. Only task instances of the same task code can cooperate. This assumes at least SPMD or the stricter SIMD or SISD programming model at any level of the system.
This section explains a logical execution model that can be mapped to the mentioned abstract hardware. In out work, we restricted ourselves to the class of problems with the following facts assumed:
private MO local MO global MO
access sequential parallel / sequential parallel / sequential
scope one WI WIs in the same WG any WI
lifetime execution of the WI execution of the WG managed by the source WI
consistency always consistent across WIs in the same WG only at a WG barrier across WIs in the same WG only at a WG barrier, for source WI only at command queue barrier
Table 1: Properties of memory objects (MOs) in different memory spaces of the processing device. Sequential access can be provided if the underlying hardware support atomic memory instructions.
The distributed parallel architecture operates on the same instance of a problem at the same time, not on multitudinous parallel transactions.
4.1
Logical Units of Execution and Storage
The context is created and manipulated by the source WI issuing commands. The commands are scheduled by a command queue for execution within a context, then they are executed asynchronously regarding the source WI and the target WIs. The command-queue can be implemented on the target PD, on the PAs, or directly on the PEs. There are three types of commands:
In this paper, we dene a task as a sequence of instructions that operate together as a group corresponding to some logical part of a program. A task, that is implemented in a procedural programming language, has well dened inputs and outputs and it has to be executed on a PU on purpose to do some modication on its output data. When a task is submitted for execution, a discrete index space is also dened by the caller. For each point of this index space, an instance of the task is executed, which is called work-item (WI). WIs are only distinguished from each other by their own index space positions. Bundling WIs in a sub-space of the index space, work-groups (WG) force a coarser grained decomposition of the index space. WIs are executed by single PEs maybe as part of a WG running on a PA. Each WI operates according to the same task but the actual sequence of instructions can be different across the set of WIs. WIs within a WG are executed concurrently and may interact through local memory objects and WG barriers implemented by the PA. The number of WGs can be usually higher than the number of PAs for a specic device as well as the number of WIs is usually not limited by the number of PEs in a PA. Memory objects (MO) store linear collection of bytes in a memory space. Variables, arrays, textures, or data structures can also be considered as MOs. MOs are randomly addressable using pointers or maybe real valued positions by the target WIs. In contrast, MOs are usually not directly accessible by the source WI, which can manipulate them only by API calls. Table 1 describes the properties of private MOs, local MOs, and global MOs dened in different memory spaces of the PD. Task instances of a job running on a compute cluster, instances of a process executed on an SMP thread, and instances of a kernel that are executed on a GPU is illustrated in Figure 1 (b).
1. An instantiation command creates WI task instances and WGs by dening index space for a task. 2. Memory space manipulation commands can be used for creating and destroying memory objects on the target PU and for interaction between the memory objects of source WI and the created target memory objects. 3. Synchronization commands enable to constrain the order of command executions. Command-queue barriers ensure that all previously queued commands have nished their execution and any modications on global memory objects are effectively updated. A GPU context maintained by a process WI, a node context managed by a job instance, and a cluster-context attachment point is illustrated in Figure 1 (b). An application pseudo-WI can be attached to any of these contexts: to the GPU context in case of single-GPU programs, to the node context for multi-GPU environments, and to the cluster context to get a GPU cluster system architecture.
Layer 3: Problem Space Mapping and Virtualization
4.2
Managing the Execution
Using the elements of the execution model, parallel and distributed tasks can be designed that is executed on an abstract multi-level hardware architecture. Nevertheless, reimplementing a parallel algorithm originally designed for the shared memory architecture of a single GPU, on a distributed memory architecture is usually not a trivial porting issue. In general, it can be performed easily only for the simplest classes of problems when the parallel task instances (e.g. kernel executions for the GPU) work on different data without any communication. Even so, suitable guidelines can be dened for the more general problem domain dened in Section 4 that enables detaching the memory and address space management from the algorithmic code. This section explains, how it can be done. Any computational problem has a common, well dened purpose, inputs and outputs in case of any implementations, which may differ in the details of the applied variant of the algorithm. Thus, it is a reasonable expectation for a computational system to provide an interface that represents an abstract solution for a certain computational problem with specied inputs and outputs. This interface completely obscures the hardware architecture, the related underlying scheduling and communication mechanisms. We advise to
We dene context as a linkage between a logical source WI and a task execution target PU interface. A context is managed by the source WI and it includes the PU, memory spaces accessible to that PU, and a command-queue used to schedule commands. From the viewpoint of the source WI, it is an execution environment within which tasks can be instantiated as target WIs and WGs and a domain for managing memory spaces accessible to the target PU.
Task M y A b s t r a c t T a s k : : c r e a t e T a s k ( MOExtent d a t a E x t e n t [ ] , A b s t r a c t H a r d w a r e C a p a b i l i t i e s hw ) { / / C r e a t e t a s k i n s t a n c e b a s e d on t h e hw and d a t a Task myTask = s o m e C o n d i t i o n ( d a t a E x t e n t , hw ) ? new T h i s T a s k ( ) : new T h a t T a s k ( ) ; / / Parametrize the created task calculateParamsSomeHow ( dataExtent , hw , myTask p l i t P a r a m s , >s myTask >mergeParams , myTask x c h a n g e P a r a m s , >e myTask d d r e s s T r a n s l a t o r s >a ); instance // // // // // // in in out out out out
In the terms of our model, an abstract task solves a certain computational problem on well dened data. For instance, an abstract task can be multiplication of two sparse matrices, sorting a sequence, or reconstruction of a volume from its X-ray projections. An abstract task refers of a set of different tasks that implement the functionality of the abstract task, and a control logic that selects and parametrizes one of these tasks for a given abstract hardware conguration and the specied extent of the input/output data. Dealing with computational problems with high memory requirements, the Geometric Decomposition design pattern [Mattson et al. 2005] is an appropriate solution to the recurring problem of execution organization and managing data decomposition among multiple WIs of the adjacent levels of parallelism. In this case, the problem space is partitioned into chunks, where the solution for each chunk can be solved using data from only a few other chunks. This scheme requires the following operations to be implemented for a certain task: The global problem space is partitioned into coarse-grained chunks by the split and merge operation pair considering the shape of the chunks. Decomposition can be performed using write and read memory copies or memory mappings. The exchange operation ensures that each task has access to all data needed for its subsequent execution by target-to-target data transfer. The computation is done by the execute operation, that updates the data within the chunk by instantiate WIs on the target PU. Using these pattern elements the generic outline of a task is split execute [exchange execute] merge (1)
r e t u r n myTask ; } / / void MyAbstractTask : : e x e c u t e ( Context ctx , Task t a s k , MO d a t a [ ] ) { task etContext ( ctx ) ; >s task p l i t ( data ) ; >s for ( i t e r a t i o n s ) { task xecute ( ) ; >e task xchange ( da ta ) ; >e } task >merge ( d a t a ) ; / / set context to the task / / s p l i t d a t a on t a r g e t PUs
/ / execute the task / / e x c h . b e t w e e n t a r g e t PUs
/ / merge d a t a f r o m t a r g e t PUs
} / / v o i d T h i s T a s k : : workItemCode ( I n d e x P o s i t i o n p o s ) { ... / / Update l o c a l MO u s i n g a d d r e s s t r a n s l a t o r a d d r e s s T r a n s l a t o r s . f o o ( pos , getLocalMOs ()> f o o ) = . . . ; / / Do some s y n c r o n i z a t i o n i n t h e work g r o u p getWorkGroup ()> s y n c h r o n i z e ( ) ; ... / / Update g l o b a l MO u s i n g a d d r e s s t r a n s l a t o r a d d r e s s T r a n s l a t o r s . b a r ( pos , getGlobalMOs ()> b a r ) = . . . ; }
where the asterisk indicates that [exchange execute] block might be executed zero or more times. This model can be recursively applied since the execute operation invokes another task on the target PU. To efciently encapsulate the data decomposition strategy for both memory transfer operations like split, exchange, merge, and for unifying the environment for the called WIs we apply address translators (AT) [Lefohn et al. 2006]. An AT is an object for memory read and write operations supporting both random-access reads and writes, and range translations to support efcient memory copy operations. Technically, only the index space splitting scheme have to be specied by the programmer, since it is closely related to the solved problem. We think there is no framework that can do it automatically, thus in our solution just a structured way is dened for the programmer to express it. Nevertheless, the number of practical splitting schemes is usually low, usually cloning or modulo partitioning (for instance into stripes or chessboard cells) is appropriate. This index space splitting scheme then directly implies the automatic execution of instantiation and memory space manipulation commands. As an example, distribution of the problem space among job instances, SMP processes, and GPU kernel instances is depicted in Figure 1 (c). Listing 1 illustrates code snippets of an imaginary abstract task and one of its implementation. The specic task implementation of the abstract task is selected based on the hardware resources and extent of the input/output data. The hardware resources and the input/output data also rules the selection and the parametrization of the split/merge parameters and the address translators.
Listing 1: Code snippets from an imaginary abstract task and one its implemenations. Method createTask() creates and parametrizes a task instance based on the input/output sizes and the abstract hardware. Later, when the context is created to the target PU, method execute() spawns the abstract task with additional split/exchange/merge operations. The nal method is the WI code of a task implementation with a typical code layout: calculation, local MO update, syncronization, another calculation, and global MO update.
introduce this abstraction here as a third layer instead of trying to force into the second one. Using this abstraction, the memory and address space management and the algorithmic code can be separated. This enables the virtualization of hardware resources of multiple PUs for the source WI, and it also makes possible the virtualization of the execution environment for the target WIs that can be exploited for writing reusable task codes. This is advantageous since tasks usually have very similar implementation that can be usually fused by dening a simple memory address translation for the input/output data.
scalability level cluster SMP stream
layer 1: hardware architecture hw. elements of execution memory spaces PE PA PD private local global node cluster (PU) system memory CPU CPU system PC core (PU) memory scalar multilocal global GPU (PU) registers processor processor MP mem. GPU mem.
layer 2: execution task instancing WI WG job MPI2 job OpenMP OpenMP process thread thr. team CUDA CUDA kernel thread thr. block task
Table 2: Hardware elements and corresponding logical units of execution in our software implementation.
Implementation Details
In order to illustrate a realization of our model, we built a system of connected desktop computer with multi-core CPUs and modern GPUs according to the abstract hardware architecture outlined in Figure 1 (a).
communication between SMP WIs. However, signicantly different implementations are also possible, for instance pthreads with message queues, if it ts better to the demands. Finally, the runtime API of CUDA implements the lowest scalability level. According to the recommendation of NVIDIA, a dedicated SMP WI is assigned to each stream PU. This SMP WI manages the memory operations and the kernel invocations. For now, according to our assumptions about the homogeneous execution (Section 4), each stream PD executes the same kernel at the same time. The stream WIs correspond to CUDA kernel instances with specic group and block indices, while stream WG represents a CUDA block in which synchronization and fast communication is possible.
6.1
Hardware Layer
At the topmost level of scalability, the cluster PEs, i.e. the nodes of the cluster can cooperate via their network adapters. The cluster PU is represented by the cluster itself, the cluster PD. On the next level, processor cores can be considered as SMP PEs since their access to the system memory is symmetric. The SMP PDs have memory spaces, in our case 4 GB system memory fully addressable by any SMP PE. Since the L1 caches are not addressable directly, no SMP private memory exists, while the SMP local and SMP global memory spaces coincide. The SMP PA stands for SMP PU, the multi-core CPU when using homogeneous processor threads using OpenMP for instance. Note that SMP PE can at the same time be SMP PU, e.g. when using pthreads. On the lowest level of scalability, stream PAs are located in multiprocessors on the graphics board. There are two additional cache spaces over the 16 KB local store in each multiprocessor: the constant memory cache and the texture cache. These caches can be fully exploited when using CUDA interface for programming the graphics device. A stream PD which the the stream PU consists of numerous stream PA, 16 NVIDIA multiprocessors for our graphics cards, while each multiprocessor contains 8 scalar processors, stream PEs. There is an additional stream global memory space for each stream PD, 512 MB and 1 GB in our case. See Figure 1 (a) and Table 2.
An Example and Evaluation
A problem space layer has to be dened in order to evaluate the implementation of the hardware and execution layers. For this purpose we implemented the GPU adaptation of a complete practical helical CT reconstruction scheme with projection image preprocessing, ltering, and back-projection. Our implementation is based on the well-known FDK method [Feldkamp et al. 1984], which solves the three-dimensional reconstruction for cone-beam CT scanners. To avoid the CPU to become the bottleneck of the system, not only the ltering and the back-projection step of the FDK method, but also each step preprocessing the scanned raw projections are decomposed into parallel tasks performed on the GPU as different kernel programs such as offset, gain, bad pixel, and geometrical corrections. In the highest level of abstraction, projection images are cloned for each PU in the split phase. Therefore, no address translation is needed for the projections. 2D image processing tasks are executed on each GPU since according to our experiences, the execution time of these tasks is not dominant in the overall execution time (Table 3). To avoid the write-write collisions the index space is directly mapped to the output volume for the back-projection. The output volume is split among the target GPUs and the reconstructed data have to be copied back by the source WI. Due to the helical geometry of the scanner not every projection has to be projected back to any voxel. In order to balance the computational workload among the GPUs, axial slices are assigned to GPUs in a modulo scheme. To support this scheme, a simple memory translator was implemented that creates a mapping between the logical voxel coordinates of the problem space and the memory address to the stored sub-volume data. For our experiences we used a Mediso NanoScan CT scanner and a LEGO R Technic plastic gure as a specimen (see Figure 2 for the reconstructed image). We have investigated two different hardware congurations:
1. Two desktop computers, each equipped with an Intel Core2 6700
TM
6.2
Execution Layer
According to the scalability levels, we dened three task levels for the execution layer as it is illustrated in Figure 1 (b) and Table 2. We have implemented our code in C++ under Linux operating system. To invoke the cluster WIs, the job instances at the topmost level, we applied Grid Engine scheduler. The communication between the job instances is performed using MPI2 calls, for which we used the OpenMPI implementation. For the simplicity, no cluster WGs are dened, nevertheless using MPI2 it is also possible if needed. On the next level of scalability, OpenMP is applied to spawn SMP WIs using pragma compiler directives in the C++ code. The OpenMP threads are grouped into thread teams as SMP WGs. As a straightforward solution, shared system memory is applied for
volume size
hardware architecture 1 PC 1 GPU 2 GPUs 4 GPUs 1 GPU 2 GPUs 4 GPUs 1 GPU 2 GPUs 1 GPU 2 GPUs 1 GPU 2 GPUs 1 GPU 2 GPUs 2 GPUs 1 GPU 2 GPUs 2 GPUs
tiny, 384 MB 512 512 384
cong-1 2 PCs 1 PC cong-2 2 PCs 1 PC cong-2 2 PCs 1 PC cong-2 cong-2 2 PCs 2 PCs
tiny, 384 MB 512 512 384
small, 768 MB 512 512 768 medium, 1.5 GB 1k 1k 384 large, 3 GB 1k 1k 768
projection transfer network bus 0.7 1.4 2.8 14.8 0.7 14.8 1.4 15.1 2.8 0.6 1.2 14.7 0.6 14.6 1.2 0.6 1.2 15.8 0.6 14.8 1.2 1.2 14.8 0.6 15.2 1.2 15.7 1.2
2D image processing 5.9 5.9 5.9 5.9 5.9 5.9 3.7 3.7 3.7 3.7 3.7 3.7 3.7 3.7 3.7 3.7 3.7 3.7
back projection 115.0 63.0 26.0 67.0 38.6 17.2 25.0 13.0 12.6 6.5 46.0 23.0 24.9 12.0 41.4 46.4 21.8 41.4
volume transfer bus network 0.3 0.3 0.4 0.1 2.1 0.1 2.0 0.2 2.0 0.3 0.3 0.2 1.9 0.1 2.0 0.6 0.6 0.3 2.0 0.1 2.0 1.2 0.6 2.0 0.1 2.1 0.6 1.9
total 121.9 70.6 35.0 90.6 62.8 43.3 29.6 18.2 33.7 28.0 50.9 28.5 47.3 33.8 47.5 68.1 44.1 64.5
speedup compute total ( 1.00 ) 1.75 1.73 3.78 3.48 1.65 1.35 2.72 1.94 5.23 2.82 ( 1.00 ) 1.73 1.63 1.78 0.88 2.87 1.07 ( 1.00 ) 1.86 1.79 1.73 1.08 3.16 1.51 ( 1.00 ) 0.90 0.70 1.77 1.54 -
Table 3: Mean values of task execution times in seconds for different volume sizes on different hardware architectures. Task executions include computations and memory transfers, through the network and via the PCI Express bus.
The volume was reconstructed on a Cartesian Cubic lattice in four resolution settings (384 MB, 768 MB, 1.5 GB, and 3 GB) with single precision oating point values from 360 projection images each consisting of 1024 2048 pixels with 16-bit values (1.4 GB). Nine rays were cast for each voxel. The projection transfer times, the 2D correction and ltering times, the back-projection time, and the volume transfer times as well as the performance scaling values are detailed in Table 3. The task execution times were monitored using GNU gprof and using NVIDIA CUDA Proler. These values reect the efciency of our C++ and CUDA implementation of the tomographic reconstruction algorithm. The computational cost of the implementation of our model elements is practically zero since these task have to be performed at any cases: creating light-weight CPU threads, copying memory to the device, launching CUDA threads. The overhead is 2-3 vtable fetches for split, merge, exchange, and execute calls since C++ virtual methods are introduced to detach the implementation from the API. The network transfer times over Gigabit Ethernet are constant 1516 seconds for the projection images of 1.4 GB which is higher than the sum of all other steps for cong-2, 2-PC, 2-GPU case for tiny and small volumes and still dominates the overall time for other cases, too. Using Inniband network architecture these transfer times can be lowered by a magnitude and can be comparable to the PCI Express transfer times. Thus, these circumstances have to be considered when evaluating the overall speedup factors of this implementation on this hardware. On the other hand, since all addressing calculations are encapsulated in the address translator logic, a volume of three gigabytes can be reconstructed in a minute using cong-2 without any modication of the C++ or CUDA code, only with tuning the parameters of the execution. The rendering of the reconstructed small data set is depicted in Figure 2. The image was generated using a simple OpenGL program rendering a full-screen quad with a Cg pixel shader performing rst hit ray casting in a 3D texture.
Figure 2: Rendering of the reconstructed volumetric data of a LEGO R Technic plastic gure with a pipe wrench in his hand using rst hit ray casting with a simple local illumination model.
dual-core CPU @ 2.6 GHz, 4 GB of DDR2 RAM, and two NVIDIA 9800X2 GPU (2512 MB RAM). 2. Two desktop computers, each equipped with an Intel Core2 Q9450 quad-core CPU @ 2.6 GHz, 4 GB of DDR3 RAM, and two NVIDIA GTX-280 GPU (11024 MB RAM).
In both cases, the desktop computers were connected over Gigabit Ethernet using Intel Pro1000 network adapters. Actually, Gigabit Ethernet is not suitable as the interconnection network of a compute cluster. We could hardly reach 90-100 MB/s bandwidth with the average delay of 1-3 milliseconds and 1-1.5% packet lost ratio. Thus, in order to increase effective network transfer rate we used MPI2 only to control the distributed application. For transferring data we have created a custom solution using simplex point-to-point connection with 8k jumbo frames and UDP protocol. Nevertheless, it is worth to use InniBand or Myrinet technology for this purpose. Running tests on such systems remains our future work.
Conclusion and Future Work
We have presented a scalable, multi-layer and multi-level programming model for recent GPU-based high performance computing systems. With this model different aspects of a computing system can be described. It enables creating an abstract architectural model for different hardware architectures as well as dening tasks from basic components of the execution model that can be mapped to the hardware model using contexts implementations. Instead of hiding the execution environment from the programmer, we dene a higher abstraction level on the top of the execution model to provide abstract problem space mapping with code reusability and virtualization of hardware resources. We have created a C++ implementation of the abstract model elements and we also dened contexts, work-groups, and work-item implementations using recent popular parallel programming frameworks: OpenMPI, OpenMP, and CUDA. Within this programming environment we have implemented a medical industry level microCT reconstruction application and evaluated its performance scaling behavior. In the future, we plan to evaluate the exibility and the performance of our model on more applications especially when multiple split-exchange-merge strategies can be investigated and compared. We plan to support OpenCL contexts as well as this technology seems to be an upcoming standard. On the other hand, there are several ways to ease the constraints we made during the design of the model. Some of these are software considerations, such as introducing features of transaction systems enabling the distributed parallel architecture to execute multiple instances of the same application or instances of different applications. For SMP threads we used homogeneous task instances, although it is not necessary. On the other hand, the execution control could be be data-dependent as the size of the input/output data might be unknown a priori. From the hardware point of view, supporting inhomogeneous hardware architectures as well as fault tolerant systems can be a future work.
F RITZ , N., L UCAS , P., AND S LUSALLEK , P. 2004. CGiS, a new language for data-parallel GPU programming. In Proceedings of the 9th International Workshop Vision, Modeling, and Visualization (VMV04), B. Girod, H.-P. Seidel, and M. Magnor, Eds., 241248. G HOLOUM , A., S PRANGLE , E., FANG , J., W U , G., AND Z HOU , X. 2007. Ct: A Flexible Parallel Programming Model for Tera-scale Architectures. Tech. rep., Intel, Oct. 2008. hmpp: Hybrid Compiler with Data-Parallel Code Generators. http: //www.caps-entreprise.com/. JANSEN , T. C. 2007. GPU++: An Embedded GPU Development System for General-Purpose Computations. PhD thesis, Technische Universit t a M nchen. u K HRONOS O PEN CL W ORKING G ROUP. 2008. The OpenCL Specication, Dec. Version: 1.0, Document Revision: 29. L EFOHN , A. E., S ENGUPTA , S., K NISS , J., S TRZODKA , R., AND OWENS , J. D. 2006. Glift: Generic, efcient, random-access gpu data structures. ACM Trans. Graph. 25, 1, 6099. L ENGYEL , J., R EICHERT, M., D ONALD , B. R., AND G REENBERG , D. P. 1990. Real-time robot motion planning using rasterizing computer graphics hardware. In SIGGRAPH, 327335. M ARK , W. R., G LANVILLE , R. S., A KELEY, K., AND K ILGARD , M. J. 2003. Cg: a system for programming graphics hardware in a c-like language. In SIGGRAPH 03: ACM SIGGRAPH 2003 Papers, ACM, New York, NY, USA, 896907. M ATTSON , T., S ANDERS , B., AND M ASSINGILL , B. 2005. Patterns for Parallel Programming. Addison-Wesley Professional. M C C OOL , M. D., Q IN , Z., AND P OPA , T. S. 2002. Shader metaprogramming. In HWWS 02: Proceedings of the ACM SIGGRAPH/EUROGRAPHICS conference on Graphics hardware, Eurographics Association, Aire-la-Ville, Switzerland, Switzerland, 5768. M C C OOL , M., T OIT, S. D., P OPA , T., C HAN , B., AND M OULE , K. 2004. Shader algebra. In SIGGRAPH 04: ACM SIGGRAPH 2004 Papers, ACM, New York, NY, USA, 787795. M ONTEYNE , M. 2008. RapidMind Multi-Core Development Platform. RapidMind Inc., Feb. N ICKOLLS , J., B UCK , I., G ARLAND , M., AND S KADRON , K. 2008. Scalable parallel programming with CUDA. In SIGGRAPH 08: ACM SIGGRAPH 2008 classes, ACM, New York, NY, USA, 114. O NEPPO , M. 2007. Hlsl shader model 4.0. In SIGGRAPH 07: ACM SIGGRAPH 2007 courses, ACM, New York, NY, USA, 112152. P ROUDFOOT, K., M ARK , W. R., T ZVETKOV, S., AND H ANRAHAN , P. 2001. A real-time procedural shading system for programmable graphics hardware. In SIGGRAPH 01: Proceedings of the 28th annual conference on Computer graphics and interactive techniques, ACM, New York, NY, USA, 159170. ROST, R. J., K ESSENICH , J. M., AND L ICHTENBELT, B. 2004. OpenGL Shading Language. Addison-Wesley. S EILER , L., C ARMEAN , D., S PRANGLE , E., F ORSYTH , T., A BRASH , M., D UBEY, P., J UNKINS , S., L AKE , A., S UGERMAN , J., C AVIN , R., E S PASA , R., G ROCHOWSKI , E., J UAN , T., AND H ANRAHAN , P. 2008. Larrabee: A Many-Core x86 Architecture for Visual Computing. ACM Trans. Graph. 27, 3 (August), 115. 2005. Shallows: GPGPU Abstraction Library. sourceforge.net/. http://shallows.
Acknowledgements
This work was supported by the Hungarian National Ofce for Research and Technology (TECH08A2) and Mediso Medical Imaging Systems.
References
A. V., A. 2007. A Higher-Level and Portable Approach to GPGPU programming. In Proceedings of SYRCoSE2007. AMD/ATI. 2008. ATI Stream Computing Technical Overview. Tech. rep., AMD Inc. 2006. Brahma: Shader Meta-Programming framework .NET. http:// brahma.ananthonline.net/. B UCK , I., F OLEY, T., H ORN , D., S UGERMAN , J., FATAHALIAN , K., H OUSTON , M., AND H ANRAHAN , P. 2004. Brook for GPUs: Stream Computing on Graphics Hardware. ACM Trans. Graph. 23, 3, 777786. C ABRAL , B., C AM , N., AND F ORAN , J. 1994. Accelerated volume rendering and tomographic reconstruction using texture mapping hardware. In VVS 94: Proceedings of the 1994 symposium on Volume visualization, ACM, New York, NY, USA, 9198. F ELDKAMP , L. A., DAVIS , L. C., AND K RESS , J. W. 1984. Practical Cone-Beam Algorithm. Journal of the Optical Society of America A 1 (Jun), 612619.
S TRENGERT, M., M LLER , C., DACHSBACHER , C., AND E RTL , T. 2008. CUDASA: Compute Unied Device and Systems Architecture. In Eurographics Symposium on Parallel Graphics and Visualization (EGPGV08), Eurographics Association, 4956. TARDITI , D., P URI , S., AND O GLESBY, J. 2006. Accelerator: using data parallelism to program gpus for general-purpose uses. In ASPLOSXII: Proceedings of the 12th international conference on Architectural support for programming languages and operating systems, ACM, New York, NY, USA, 325335.

A Programming Model For GPU-based Parallel Computing With Scalability and Abstraction

Diunggah oleh

Informasi Dokumen

Judul Asli

Hak Cipta

Format Tersedia

Bagikan dokumen Ini

Bagikan atau Tanam Dokumen

Opsi Berbagi

Apakah menurut Anda dokumen ini bermanfaat?

Apakah konten ini tidak pantas?

Hak Cipta:

Format Tersedia

A Programming Model For GPU-based Parallel Computing With Scalability and Abstraction

Diunggah oleh

Hak Cipta:

Format Tersedia

A Programming Model for GPU-based Parallel Computing with Scalability and Abstraction

Introduction and Motivation

performance scaling behavior is discussed.

Layer 1: A Conceptual Hardware Architecture Model

Layer 2: Execution Model

private MO local MO global MO

access sequential parallel / sequential parallel / sequential

scope one WI WIs in the same WG any WI

lifetime execution of the WI execution of the WG managed by the source WI

Logical Units of Execution and Storage

Layer 3: Problem Space Mapping and Virtualization

Managing the Execution

/ / execute the task / / e x c h . b e t w e e n t a r g e t PUs

scalability level cluster SMP stream

An Example and Evaluation

tiny, 384 MB 512 512 384

tiny, 384 MB 512 512 384

Conclusion and Future Work

Anda mungkin juga menyukai