A Benchmark Environment For Parallel Image Compositing

Table of Contents
ParCompMark A Benchmark Environment for Parallel Image Compositing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Balzs Domonkos, Attila Egri, Tibor Fris a o
II
ParCompMark A Benchmark Environment for Parallel Image Compositing

Balzs Domonkos, Attila Egri, and Tibor Fris a o
Budapest University of Technology and Economics Department of Control Engineering and Information Technology [domonkos|egri|foris]@ik.bme.hu
Abstract. There are several domains of computer graphics in which so large amount of data has to be handled or so complex rendering model has to be applied that interactive frame rates can not be reached with a single graphics hardware. Parallel implementations or rendering algorithms have been proposed by several researchers to overcome this limitation. In order to have quantitative feedback about operation of parallel rendering systems, test-bench applications are required. This paper presents ParCompMark (Parallel Compositing Benchmark Framework) which can measure the operation of sort-last parallel rendering techniques. Our implementation gives a test-bench for the Parallel Compositing API specied by Hewlett-Packard and Computational Engineering International as an intended standard of parallel compositing implementations on present-day PC clusters.
Introduction
In computer graphics, generated image is created from an abstract description of the scene using a dened rendering model. When interactive frame rates are required and the scene is large or the rendering model is complex, the limitations of a single graphics hardware can be easily reached. For example the domains of scientic visualization, real-time simulation, photo-realistic imaging, volume rendering, and animation are particularly concerned in this issue. Algorithms from these domains are usually very computing-intensive simulations and dynamic data generation require intensive use of CPU, while rendering with complex shading model or into large output image requires high GPU computing capacity. On the other hand, large amount of data needs huge storage (system memory and video RAM) and when using dynamic data system bandwidths can also be bottlenecks. Therefore, to provide necessary level of performance in such applications, parallelization of rendering is necessary to handle the memory constraints and the lack of computing power. The advantage of this distribution is scaling up graphics resources: the texture memory, the amount of geometry processed, and the overall size of image produced. Additional benet is scaling up the host system resources, too: the host system memory and the bandwidth between the system memory and the graphics card.
Domonkos, Egri, Fris o
In order to have quantitative feedback about the eectiveness of such a parallel rendering and compositing system, it is expedient to apply test-bench applications, that can simulate various GPU and CPU loads, rendering and compositing modes for dierent software scenarios under control. We have designed a general benchmark framework for measuring performance of both parallel compositing systems. Using this framework several implementations of the Hewlett-Packard parallel image compositing API was evaluated in a distributed memory multiprocessor environment. This compositing API might be general enough to become a standard for parallel image compositing soon. In the next section we briey overview the parallelization techniques focusing on image compositing-based models. Our parallel image compositing benchmark framework is presented in Section 3. The measured quantitative characteristics are reported in Section 4, while in the last section, we summarize the contributions of this paper.
Parallelization of Graphics Pipelines
Parallel rendering methods are usually classied based on what kind of data is distributed and where the sorting in graphics pipeline occurs: (1) in the beginning of the pipeline, (2) between the geometry processing and the rasterization step, and (3) at the end of the pipeline, after the rasterization [6]. The location of the sorting fundamentally determines the required hardware architecture and communication infrastructure. The sort-rst approach distributes the primitives early in the pipeline to the rendering nodes to do all the rendering operations [7] [2]. In case of a sortmiddle parallelization the primitives are transformed into screen coordinates, they are clipped, and ready for rasterization at the corresponding display device [3]. Sort-last transmits the primitives through local rendering pipelines and defers sorting after rasterization. In this case, one fraction of processors (renderers) are assigned to determined subsets of the primitives and pixel sets of the nal output image, usually to rectangular areas. The other type of processors (compositors) are assigned to subsets of pixels in the output image. Note that the partitioning of the pixels does not have to be identical in the two cases (see Fig. 1.). Another classication scheme is based on the type of entities, which are simultaneously processed. Single-threaded software renderers take graphics primitives one after another and the pixels corresponding to these primitives are also processed sequentially. In contrast, recent graphics cards have multiple graphics pipelines, therefore more vertices and pixels can be processed at the same time. This is called pixel-parallel rendering. Pixel-based parallelization can also be performed when multiple graphics cards are used for creating tiles of the overall output image and the rendering queue is branched into multiple pipes (screen-space decomposition). On the other hand, when the data is divided in an initialization step, multiple subsets of graphics primitives can be processed at the same time. This is called object-parallel rendering or object-space decomposition.

G
Renderer1
G R
Geometry processing Rasterization
Renderer2
Compositing Redistribute pixels
CompositorA
CompositorB
CompositorC
Display
Fig. 1. Sort-last parallel rendering approach with both screen- and object-level parallelization. Geometry processing and rasterization are done in the same rendering pipeline, then the renderers transmit the pixels over an interconnection network to compositing processors which calculate the corresponding segment of the nal output [6].
p
0
Fig. 2. Parallel pipeline algorithm for sort-last approach on distributed memory architectures. Left: image area transfer for four compositing processes performed in N 1 steps, where N is the number of compositors. Right: collecting nal data for an external or an internal process in one step [4].
Object-parallel rendering needs combination of the subsets of pixels corresponding to dierent objects, which is called image compositing. This is a simple procedure, which involves processing of pixel attributes. Originally alpha colours were introduced as a pixel coverage model for compositing digital images [8]. Besides alpha-based compositing, spatial covering can be also carried out using comparison of depth values, when a subset of Z-buer is transmitted with the colour values [1]. These per-pixel calculations are to be achieved for all image elements, therefore compositing can be a bottleneck of the whole rendering system and unsuitable for interactive applications. However, when the compositing is also done in parallel, interactive compositing is possible. There are several algorithms providing parallel image compositing on multiprocessor architectures including direct send, parallel pipeline [4], and binary swap [5]. The main demands for interactive parallel visualization systems are the auspicious data scalability and the performance scalability. Nowadays there are two signicant trends for interactive parallel rendering which satises these demands. One of them based on the sort-rst approach virtualizes multiple graphics cards, and provides a single conceptual graphics pipeline. The other solution uses the
sort-last method and operates with object-space data distribution and image compositing. The drawback of this method is that larger modications or redesign is required for existing applications, but the advantage is that the load balance is more predictable and designable. Because we have intended to create a test bench for applications doing especially scientic visualization, simulation, and volume rendering, where the sizes of the data are that mainly determine the rendering time, the latter approach was more favourable for our investigations. The specic software solution on which our work was implemented uses the parallel pipeline compositing algorithm consisting of two parts (detailed in Fig. 2.). The images to be composited are divided into N frame areas, which is the number of the compositing processes. In the rst part of the algorithm these areas ow around through each node in N 1 steps, each consisting of a compositing and a communication stage. After N 1 steps each processor will have a fully composited portion of the nal frame. Then the areas are collected for an external display node or for an internal node in the second part in one step. The clear benet of this compositing scheme is that the amount of data transferred on the network is independent of the number of compositing processes.
Our Benchmark Framework
ParCompMark (Parallel Image Compositing Benchmark Framework) is a framework analyzing the ParaComp (Parallel Compositing) API specied by HewlettPackard in collaboration with Computational Engineering International (CEI) for multiprocessor systems. The benchmark framework should allow the denition and execution of complex parallel rendering tasks in a distributed environment while collecting measurement data from every participating process. The main requirements for such a benchmark program are: Provide means for dening the target cluster structure, the CPU, GPU resources to use in a benchmark. Dene the roles of the participating resources such as rendering and/or compositing. Specify the characteristics of the rendering task as shading model, type and number of primitives, specic algorithm, size of the nal image, compositing mode (depth, alpha, etc.), other specialties. Allow various measurements such as frame rate, latency and network communication; collect measurement data in a format that is suitable for further processing (e.g. scientic visualization tools). Provide means to simulate workload in real environments by specifying multiple processes/threads, extra load on the elements of the cluster, etc. Facilitate the denition of time-varying and batch like scenarios for benchmark test suites. Besides the display of the nal image allow the visualization of partial results, output from dierent nodes. Based on the requirements listed above, we designed and implemented a general benchmarking framework which allows its users to concentrate solely on

cluster
host
node
host
node
buffer
buffer
process
process
node process buffer
context
Fig. 3. Basic denitions of ParCompMark illustrated on the component model of an example setup: cluster, host, node, process, buer, and context. The cluster contains two hosts: one is used for rendering with two graphics cards (left), the other host (right) composites the output of the two renderers. Each node has one process and one corresponding buer used for both containing rendering results and for storing composited output.
specifying the cluster parameters, rendering characteristics, scenarios without the need to understand the details behind the operation of the framework. The system exposes its services through an integrated scripting engine which provides all the necessary tools for dening and running the benchmarks. In the following some relevant design and implementation aspects are detailed. The model of the benchmark system was created to be able to describe the physical cluster and a specic benchmark scenario and to be correspondent to the ParaComp API. Therefore, the basic model elements of ParCompMark are cluster and scenario. The cluster describes the physical system, it contains hosts. Scenario is similar to cluster but from the viewpoint of benchmark execution. Its elements are (see Fig. 3.): context groups a collection of nodes together by host strings. An individual context has a parallel structure in each node. The context denes which nodes are participating to the frame of the context and it also denes the basic parameters of this frame: its pixel format, its dimensions etc. The operation of the contexts is controlled using master-slave scheme. One of the nodes (referred to as the global node or master node) creates a global context where all the context parameters are specied and the list of participating nodes is passed. The remaining nodes (referred as slave nodes) create local contexts that share this information from the global node. node is an entity that produces framelets. These framelets can be the result of a compositing or rendering step. It has well-dened input and output corresponding to the framelets it requires and what it provides.
process supplies a well-dened function for the node, it can provide or require data only for one or from one context. Processes are the very basic elements in application scenario modelling, they can belong to only one context. Although, a node can have more processes, therefore it can be attached to more contexts. This property supports the hierarchical development of the logical cluster structure giving opportunity to change a node to a sub-cluster. For exible benchmark scenario build-up ParCompMark provides a scripting engine with script methods connected to the core code. The scripting has three levels: low-level script is cluster-dependent describing one complete benchmark execution. This script initializes and controls the overall execution. dynamic script is cluster-independent procedural code generating the lowlevel script. It can also describe one benchmark execution. scenario script is a batch script. It describes a group of benchmark tests that parameterize dynamic scripts for arbitrary tests. Low-level script contains every information that the current benchmark test is needed on given cluster. It describes the nodes a host has to create and the nodes of a context. The hosts are initializing themselves at the benchmark execution from the low-level script. The input of dynamic script is the physical cluster description and some benchmark parameters for customization. In a scenario script one can specify that ParCompMark executes more benchmarks and during execution it varies for example host number, node number, triangle count.
Results
We used a Hewlett-Packard Scalable Visualization Array (SVA) for our experiments consisting of 5 computing nodes. Each node had a dual-core AMD Opteron 246 Processor, an nVidia Quadro FX3450 graphics controller, and an InniBand network adapter. In the test scenarios we have dened one compositing context with from 1 to 4 rendering processes and an additional compositing process and each process was placed on separate hosts. Each renderer produced a full sized (800x600) image of its subset of objects. As rendering job a simple scene was designed, which contained increasing number of triangles with random vertices. The benchmark was executed in both depth- and alpha-compositing mode. Frame rates of the compositing process were measured as performance. According to our measurements (Fig. 4.) we can derive the following conclusions. The performance of both depth compositing and alpha compositing can be improved increasing the number of rendering-compositing nodes. For depth compositing this improvement is around 33% at 105 triangles per frame and 95% at 1.5 106 triangles per frame. For alpha compositing the improvement is balanced better: it is 81% at 105 triangles and 99% at 1.5 106 triangles. The reasons of this are twofold. First, alpha blending is a more rendering expensive

20
30
15
Frame ra
Average frame rate (P) [Hz]
10 15 20 25 30
10 12
100 000
Average frame rate (P) [Hz]
250 000 500 000 1 000 000
1.0
1.5
2.0
2.5
3.0
3.5
4.0
Fig. 4. Compositing benchmark results for the average frame rate as performance (P ) in the function of host count (computing power, C) and total number of triangles (work, W ) (a, b) on our ve-node cluster. Performance scalability results plotting frame rate in the function of host count for dierent number of triangles (c, d).
operation than depth testing. For a lower amount of graphics primitives it does not worth decomposing the rendering for depth based operations. On the other hand, alpha compositing requires less amount of image data to read back from the framebuer (only four bytes in RGBA format) while for depth compositing reading back of the Z-buer is needed which involves 5-6 bytes per pixel. Additionally, reading of the Z-buer can be performed more slowly on recent graphics hardware than alpha values. Therefore, despite of the fact that parallel rendering and compositing always increases the performance of the application over a very low level of triangle count, alpha-based rendering algorithms seems to be more generally scalable on the hardware we had investigated. Since our cluster consists widely prevailing hardware elements this result can be generalized.
te (P Frame ra
20
10
te (P) [H z]
) [Hz]
10
0 1
0 1
Nu 5e+05 m be ro ft ria
Nu 5e+05 m be ro ft ria
ng 1e+06 le s (W )
Nu
be
f ro
ho
st
(C
2
ho st s (C
ng 1e+06 le s (W )
3
Nu m
be
f ro
(a) depth compositing
(b) alpha compositing

100 000
250 000 500 000 1 000 000
1.0
1.5
2.0
2.5
3.0
3.5
4.0
Number of hosts (C)
Number of hosts (C)
(c) depth compositing
(d) alpha compositing
Conclusions and Future Work
A general benchmark framework was designed for measuring performance of parallel compositing systems. This framework currently gives a test bench for ParaComp (Parallel Compositing) API implementations on HP SVA cluster analyzing the compositing performance. In our future work, we intend to improve our benchmarking framework to support rst-sort parallel rendering systems to be able to compare the scalability of dierent parallel rendering trends. On the other hand, the framework could be further extended to provide some mechanism capable of inferring the parameters of a cluster to be used for a certain task from the collected data of various measurements.
Acknowledgements
This work has been supported by OTKA (T042735), GameTools FP6 (IST-2-004363) project, and by Hewlett-Packard and the National Oce for Research and Technology (Hungary).
References
1. T. Du. Compositing 3-D rendered images. In SIGGRAPH 85: Proceedings of the 12th annual conference on Computer graphics and interactive techniques, pages 4144, New York, NY, USA, 1985. ACM Press. 2. G. Humphreys, M. Eldridge, I. Buck, G. Stoll, M. Everett, and P. Hanrahan. WireGL: a scalable graphics system for clusters. In SIGGRAPH 01: Proceedings of the 28th annual conference on Computer graphics and interactive techniques, pages 129140, New York, NY, USA, 2001. ACM Press. 3. W. J. L. and H. R. E. A proposal for a sort-middle cluster rendering system. In Intelligent Data Acquisition and Advanced Computing Systems: Technology and Applications, 2003. Proceedings of the Second IEEE International Workshop, pages 36 38, 2003. 4. T.-Y. Lee, C. S. Raghavendra, and J. B. Nicholas. Image Composition Schemes for Sort-Last Polygon Rendering on 2D Mesh Multicomputers. IEEE Transactions on Visualization and Computer Graphics, 2(3):202217, 1996. 5. K. L. Ma, J. S. Painter, C. D. Hansen, and M. F. Krogh. Parallel Volume Rendering using Binary-Swap Compositing. IEEE Computer Graphics and Applications, 14(4):5968, 1994. 6. S. Molnar, M. Cox, and D. Ellsworth. A sorting classication of parallel rendering. IEEE Computer Graphics and Applications, 14(4):2332, 1994. 7. C. Mueller. The sort-rst rendering architecture for high-performance graphics. In Symposium on Interactive 3D Graphics: Proceedings of the 1995 symposium on Interactive 3D graphics, pages 75 , New York, NY, USA, 1995. ACM Press. 8. T. Porter and T. Du. Compositing digital images. In SIGGRAPH 84: Proceedings of the 11th annual conference on Computer graphics and interactive techniques, pages 253259, New York, NY, USA, 1984. ACM Press.

A Benchmark Environment For Parallel Image Compositing

Diunggah oleh

Informasi Dokumen

Judul Asli

Hak Cipta

Format Tersedia

Bagikan dokumen Ini

Bagikan atau Tanam Dokumen

Opsi Berbagi

Apakah menurut Anda dokumen ini bermanfaat?

Apakah konten ini tidak pantas?

Hak Cipta:

Format Tersedia

A Benchmark Environment For Parallel Image Compositing

Diunggah oleh

Hak Cipta:

Format Tersedia

Table of Contents

ParCompMark A Benchmark Environment for Parallel Image Compositing

Domonkos, Egri, Fris o

Parallelization of Graphics Pipelines

ParCompMark A Benchmark Environment for Parallel Image Compositing

Geometry processing Rasterization

Compositing Redistribute pixels

Domonkos, Egri, Fris o

Our Benchmark Framework

ParCompMark A Benchmark Environment for Parallel Image Compositing

node process buffer

Domonkos, Egri, Fris o

ParCompMark A Benchmark Environment for Parallel Image Compositing

Average frame rate (P) [Hz]

Average frame rate (P) [Hz]

250 000 500 000 1 000 000

(a) depth compositing

(b) alpha compositing

250 000 500 000 1 000 000

Number of hosts (C)

Number of hosts (C)

(c) depth compositing

(d) alpha compositing

Domonkos, Egri, Fris o

Conclusions and Future Work

Anda mungkin juga menyukai