Anda di halaman 1dari 10

GRS - GPU Radix Sort For Multield Records*

Shibdas Bandyopadhyay and Sartaj Sahni Department of Computer and Information Science and Engineering, University of Florida, Gainesville, FL 32611 shibdas@u.edu, sahni@cise.u.edu
AbstractWe develop a radix sort algorithm, GRS, suitable to sort multield records on a graphics processing unit (GPU). We assume the ByF ield layout for records to be sorted. GRS is benchmarked against the radix sort algorithm, SDK, in NVIDIAs CUDA SDK 3.0 [14] as well as the radix sort algorithm, SRTS, of Merrill and Grimshaw [11]. Although SRTS is faster than both GRS and SDK when sorting numbers as well as records that have a key and an additional 32-bit eld, both GRS and SDK outperform SRTS on records with 2 or more elds (in addition to the key). GRS is consistently faster than SDK on numbers as well as records with 1 or more elds. When sorting records with 9 32-bit elds, GRS is up to 74% faster than SRTS and up to 55% faster than SDK. Thus, GRS is the fastest way to radix sort records with more than 1 32-bit eld on a GPU. Index TermsGraphics Processing Units, sorting multield records, radix sort.

I. I NTRODUCTION Contemporary graphics processing units (GPUs) are massively parallel manycore processors. NVIDIAs Tesla GPUs, for example, have 240 scalar processing cores (SPs) per chip [12]. These cores are partitioned into 30 Streaming Multiprocessors (SMs) with each SM comprising 8 SPs. Each SM shares a 16KB local memory (called shared memory) and has a total of 16,384 32-bit registers that may be utilized by the threads running on this SM. Besides registers and shared memory, on-chip memory shared by the cores in an SM also includes constant and texture caches. The 240 on-chip cores also share a 4GB off-chip global (or device) memory. Figure 1 shows a schematic of the Tesla architecture. With the introduction of CUDA (Compute Unied Driver Architecture) [21], it has become possible to program GPUs using C. This has resulted in an explosion of research directed toward expanding the applicability of GPUs from their native computer graphics applications to a wide variety of high-performance computing applications. One of the very rst GPU sorting algorithms, an adaptation of bitonic sort, was developed by Govindraju et al. [4]. Since this algorithm was developed before the advent of CUDA, the algorithm was implemented using GPU pixel shaders. Zachmann et al. [5] improved on this sort algorithm by using BitonicT rees to reduce the number of comparisons while merging the bitonic sequences. Cederman et al. [3] have adapted quick sort for GPUs. Their adaptation rst
* This research was supported, in part, by the National Science Foundation under grant 0829916. The authors acknowledge the University of Florida High-Performance Computing Center for providing computational resources and support that have contributed to the research results reported within this paper. URL: http://hpc.u.edu.

partitions the sequence to be sorted into subsequences, sorts these subsequences in parallel, and then merges the sorted subsequences in parallel. A hybrid sort algorithm that splits the data using bucket sort and then merges the data using a vectorized version of merge sort is proposed by Sintron et al. [16]. Satish et al. [14] have developed an even faster merge sort The fastest GPU merge sort algorithm known at this time is Warpsort [19]. Warpsort rst creates sorted sequences using bitonic sort; each sorted sequence being created by a thread warp. The sorted sequences are merged in pairs until too few sequences remain. The remaining sequences are partitioned into subsequences that can be pairwise merged independently and nally this pairwise merging is done with each warp merging a pair of subsequences. Experimental results reported in [19] indicate that Warpsort is about 30% faster than the merge sort algorithm of [14]. Another comparison-based sort for GPUsGPU sample sortwas developed by Leischner et al. [10]. Sample sort is reported to be about 30% faster than the merge sort of [14], on average, when the keys are 32bit integers. This would make sample sort competitive with Warpsort for 32-bit keys. For 64-bit keys, sample sort is twice as fast, on average, as the merge sort of [14]. [15], [20], [9], [14], [11] have adapted radix sort to GPUs. Radix sort accomplishes the sort in phases where each phase sorts on a digit of the key using, typically, either a count sort or a bucket sort. The counting to be done in each phase may be carried out using a prex sum or scan [2] operation that is quite efciently done on a GPU [15]. Harris et al.s [20] adaptation of radix sort to GPUs uses the radix 2 (i.e., each phase sorts on a bit of the key) and uses the bitsplit technique of [2] in each phase of the radix sort to reorder records by the bit being considered in that phase. This implementation of radix sort is available in the CUDA Data Parallel Primitive (CUDPP) library [20]. For 32-bit keys, this implementation of radix sort requires 32 phases. In each phase, expensive scatter operations to/from the global memory are made. Le Grand et al. [9] reduce the number of phases and hence the number of expensive scatters to global memory by using a larger radix, 2b , for b > 0. A radix of 16, for example, reduces the number of phases from 32 to 8. The sort in each phase is done by rst computing the histogram of the 2b possible values that a digit with radix 2b may have. Satish et al. [14] further improve the 2b -radix sort of Le Grand et al. [9] by sorting blocks of data in shared memory before writing to global memory. This reduces the randomness of the scatter to global memory, which, in turn, improves performance. The

Fig. 1: NVIDIAs Tesla GPU [14]

radix-sort implementation of Satish et al. [14] is included in NVIDIAs CUDA SDK 3.0. Merrill and Grimshaw [11] have developed an alternative radix sort, SRTS, for GPUs that is based on a highly optimized algorithm, developed by them, for the scan operation and co-mingling of several logical steps of a radix sort so as to reduce accesses to device/global memory. Presently, SRTS is the fastest GPU radix sort algorithm for integers as well as for records that have a 32-bit key and a 32-bit value eld. The results of [10], [19] indicate that the radix sort algorithm of [14] outperforms both Warpsort [19] and sample sort [10] on 32-bit keys. These results together with those of [11] imply that the radix sort of [11] is the fastest GPU sort algorithm for 32-bit integer keys. Our focus, in this paper, is to develop a GPU adaptation of radix sort that is suitable to sort records that have many elds in addition to the key. The radix sort implementations of Satish et al. [14], which is included in the CUDA SDK 3.0 as well as that of Merrill and Grimshaw [11] are written for the case when each record has a 32-bit key and one additional 32-bit eld. These implementations are easily extended to the case when each record has more additional elds. Although both implementations use the same record layout for records that have a key and a single additional 32-bit eld, their natural extension to records with more than 1 eld result in different layout format. The implementation of [14] readily extends to the ByF ield layout [1] while that of [11] results in a hybrid layout. In this paper, we target primarily the ByF ield layout. However, we make a simple adaptation to our code for the ByF ield layout enabling it to handle the hybrid layout as well. Our objective is to obtain a faster radix sort for GPUs when records have multiple elds. The strategy is to reduce the number of times these multiple eld records are read from or written to global memory. Our multield sorting algorithm, which is an adaptation of the radix sort algorithm of [14] handily outperforms the algorithm of [14] when sorting records with zero or more elds (in addition to the key eld). While our algorithm is slower than that of [11] when the

number of elds is either 0 or 1, it is considerably faster for records with two or more elds. The remainder of this paper is organized as follows. In Section II we describe features of the NVIDIA Tesla GPU that affect program performance. In Section III, we describe three popular layouts for records including the ByF ield layout used in this paper as well as two overall strategies to handle the sort of multi-eld records. The SDK 3.0 radix sort algorithm of Satish et al. [14] is described in section IV and our proposed multi-eld GPU algorithm GRS is described in Section V. The key differences between these two radix sort algorithms are enumerated in Section VI and an experimental evaluation of the two algorithms is provided in Section VII. II. NVIDIA T ESLA P ERFORMANCE C HARACTERISTICS GPUs operate under the master-slave computing model (see [13] for e.g.) in which there is a host or master processor to which are attached a collection of slave processors. A possible conguration would have a GPU card attached to the bus of a PC. The PC CPU would be the host or master and the GPU processors would be the slaves. The CUDA programming model requires the user to write a program that runs on the host processor. At present, CUDA supports host programs written in C and C++ only though there are plans to expand the set of available languages [21]. The host program may invoke kernels, which are C functions, that run on the GPU slaves. A kernel may be instantiated in synchronous (the CPU waits for the kernel to complete before proceeding with other tasks) or asynchronous (the CPU continues with other tasks following the spawning of a kernel) mode. A kernel species the computation to be done by a thread. When a kernel is invoked by the host program, the host program species the number of threads that are to be created. Each thread is assigned a unique ID and CUDA provides C-language extensions to enable a kernel to determine which thread it is executing. The host program groups threads into blocks, by specifying a block size, at the time a kernel is invoked. Figure 2 shows the organization of threads used by CUDA. The GPU schedules the threads so that a block of threads

3)

4) Fig. 2: Cuda programming model [21] 5) runs on the cores of an SM. At any given time, an SM executes the threads of a single block and the threads of a block can execute only on a single SM. Once a block begins to execute on an SM, that SM executes the block to completion. Each SM schedules the threads in its assigned block in groups of 32 threads called a warp. The partitioning into warps is fairly intuitive with the rst 32 threads forming the rst warp, the next 32 threads form the next warp, and so on. A half warp is a group of 16 threads. The rst 16 threads in a warp form the rst half warp for the warp and the remaining 32 threads form the second half warp. When an SM is ready to execute the next instruction, it selects a warp that is ready (i.e., its threads are not waiting for a memory transaction to complete) and executes the next instruction of every thread in the selected warp. Common instructions are executed in parallel using the 8 SPs in the SM. Non-common instructions are serialized. So, it is important, for performance, to avoid thread divergence within a warp. Some of the other factors important for performance are: 1) Since access to global memory is about an order of magnitude more expensive than access to registers and shared memory, data that are to be used several times should be read once from global memory and stored in registers or shared memory for reuse. 2) When the threads of a half-warp access global memory, this access is accomplished via a series of memory transactions. The number of memory transactions equals the number of different 32-byte (64-byte, 128-byte, 128byte) memory segments that the words to be accessed lie in when each thread accesses an 8-bit (16-bit, 32-

6)

bit, 64-bit) word. Given the cost of a global memory transaction, it pays to organize the computation so that the number of global memory transactions made by each half warp is minimized. Shared memory is divided into 16 banks in round robin fashion using words of size 32 bits. When the threads of a half warp access shared memory, the access is accomplished as a series of 1 or more memory transactions. Let S denoted the set of addresses to be accessed. Each transaction is built by selecting one of the addresses in S to dene the broadcast word. All addresses in S that are included in the broadcast word are removed from S. Next, upto one address from each of the remaining banks is removed from S. The set of removed addresses is serviced by a single memory transaction. Since the user has no way to specify the broadcast word, for maximum parallelism, the computation should be organized so that, at any given time, the threads in a half warp access either words in different banks of shared memory or they access the same word of shared memory. Volkov et al. [17] have observed greater throughput using operands in registers than operands in shared memory. So, data that is to be used often should be stored in registers rather than in shared memory. Loop unrolling often improves performance. However, the #pragma unroll statement unrolls loops only under certain restrictive conditions. Manual loop unrolling by replicating code and changing the loop stride can be employed to overcome these limitations. Arrays declared as register arrays get assigned to global memory when the CUDA compiler is unable to determine at compile time what the value of an array index is. This is typically the case when an array is indexed using a loop variable. Manually unrolling the loop so that all references to the array use indexes known at compile time ensures that the register array is, in fact, stored in registers. III. M ULTIFIELD R ECORD L AYOUT A ND S ORTING

A record R is comprised of a key k and m other elds f1 , f2 , , fm . For simplicity, we assume that the key and each other eld occupies 32 bits. Let ki be the key of record Ri and let fij , 1 j m be this records other elds. With our simplifying assumption of uniform size elds, we may view the n records to be sorted as a two-dimensional array f ieldsArray[][] with f ieldsArray[i][0] = ki and f ieldsArray[i][j] = fij , 1 j m, 1 i n. When this array is mapped to memory in column-major order, we get the ByField layout of [1]. This layout was used also for the AA-sort algorithm developed for the Cell Broadband Engine in [7] and is essentially the same as that used by the GPU radix sort algorithm of [14]. When the elds array is mapped to memory in row-major order, we get the ByRecord layout of [1]. A third layout, Hybrid, is employed in [11]. This is a hybrid between the ByF ield and ByRecord layouts. The keys are stored in an array and the remaining elds are stored

using the ByRecord layout. Essentially then, in the Hybrid layout, we have two arrays. Each element of one array is a key and each element of the other array is a structure that contains all elds associated with an individual record. We observe that when m < 2, the ByF ield and Hybrid layouts are identical. When the sort begins with data in a particular layout format, the result of the sort must also be in that layout format. Our primary focus in this paper is the ByF ield layout. However, to make an apples to apples comparison with SRTS [11], which is the fastest known radix sort for integers and records with a single 32-bit eld, we make a simple extension of our ByF ield algorithm to sort when the Hybrid layout is used. At a high level, there are two very distinct approaches to sort multield records. In one, we construct a set of tuples (ki , i), where ki is the key of the ith record. Then, these tuples are sorted by extending a number sort algorithm so that whenever the number sort algorithm moves a key, the extended version moves a tuple. Once the tuples are sorted, the original records are rearranged by copying records from the f ieldsArray to a new array placing the records into their sorted positions in the new array or in-place using a cycle chasing algorithm as described for a table sort in [6]. The second strategy is to extend a number sort so as to move an entire record every time its key is moved by the number sort. For relatively small sized records, the second strategy outperforms the rst one when the records are stored in uniform access memory. But, for large records, the rst strategy is faster as it reduces the number of moves of large records. However, since reordering records according to a prescribed permutation as is done in the rst strategy makes random accesses to memory, the second scheme outperforms the rst (unless size is very large) when the records to be rearranged are in relatively slow memory such as global memory of the GPU. For this reason, we focus, in this paper, on using the second strategy. That is, our sort algorithm moves all the elds of a record whenever its key is moved. IV. SDK R ADIX S ORT A LGORITHM OF [14] Since our multield sorting algorithm is an adaptation of the radix sort algorithm of [14], we describe this latter algorithm in some detail in this section. Two versions of the radix sort algorithm of [14] are available in NVIDIAs CUDA SDK 3.0. These versions differ only in that one version sorts numbers while the other sorts pairs of the form (key, value) where the keys are stored in one array and the values in another. Although the latter code assumes that values are 32 bits in size, the code is easily extended to handle values that are larger than 32 bits. Both versions use a radix of 2b with b = 4 (b = 4 was determined, experimentally, to give best results). With b = 4 and 32-bit keys, the radix sort runs in 8 phases with each phase sorting on 4 bits of the key. Each phase of the radix sort is accomplished in 4 steps as below. These steps assume the data is partitioned (implicitly) into tiles of size 1024 records each. There is a separate kernel for each of these 4 steps. Step 1: Sort each tile on the b bits being considered in this phase using the bit-split algorithm of [2].

Step 2: Compute the histogram of each tile. Note that the histogram has 2b entries with entry i giving the number of records in the tile for which the b bits considered in this phase equal i. Step 3: Compute the prex sum of the histograms of all tiles. Step 4: Use the prex sums computed in Step 3 to rearrange the records into sorted order of the b bits being considered. For Step 1, each SM inputs a tile into shared memory using 256 threads. So, each thread reads 4 records. This is done as two reads. The rst read inputs the 4 keys associated with these records using a variable of type int4, which is 128 bits long. The next read inputs the 4 values associated with these records using another variable of type int 4. The input data is stored in registers. Next, the 256 threads collaborate to do b rounds of the bit split scheme of [2]. In each round, shared memory is used to move data from the registers of one thread to those of other threads as needed. Following these b rounds the tile is in sorted order (of the b bits being considered) in registers. The sorted tile is then written to global memory. In Step 2, the keys for each half tile are input (in sorted order) using 256 threads per SM and the histogram of the b bits being considered is computed for the half tile input. For this computation, each thread inputs two keys from global memory and writes these to shared memory. The threads then determine the up to 15 places in the input half tile where the 4 bits being considered change. The histogram for the half tile is then written to global memory. In Step 3, the prex sum of the half-tile histograms is computed using the prex sum code available in CUDPP [20] and written to global memory. In Step 4, each SM inputs a half tile of data, which is sorted on the b bits being considered, and uses the computed prex sum histogram for this half tile to write the records in the half tile to their correct position in global memory. Following this, all records are in sorted order, in global memory, with respect to the b bits being considered in this phase of the radix sort. As can be seen, the SDK implementation of radix sort inputs inputs from global memory and writes to global memory potentially long records twiceonce in Step 1 and again in Step 4. In Step 2 only the keys are input and in Step 3 only histograms are input. Both steps, however, output only histograms. Although the half tile histograms can be computed by combing Steps 1 and 2 and inputing only the keys from global to shared memory, the overall performance drops because the global memory writes done in Step 4 become more random in nature. By sorting the tiles in Step 1, the writes done in Step 4 take sufciently less time to more than compensate for the input and output of entire records in Step 1. V. GPU R ADIX S ORT (GRS) Like the SDK radix sort algorithm of [14], GRS accomplishes a radix sort using a radix of 2b with b = 4. Each phase of the radix sort is done in 3 steps with each step using a different kernel. For purposes of discussion, we assume a tile size of t (t = 1024 in the SDK implementation). We dene the

rank of record i in a tile to be the number of records in the tile that precede record i and have the same key as of record i. Since we compute ranks in each phase of the radix sort, key equality (for rank purposes) translates to equality of the b bits of the key being considered in a particular phase. Note that when the tile size is 1024, ranks lie in the range 0 through 1023. The three steps in each phase of GRS are: Step 1: Compute the histogram for for each tile as well as the rank of each record in the tile. This histogram is the same as that computed in Step 2 of the SDK radix sort. Step 2: Compute the prex sums of the histograms of all tiles. Step 3: Use the ranks computed in Step 1 to sort the data within a tile. Next, use the prex sums computed in Step 2 to rearrange the records into sorted order of the b bits being considered. Step 1 requires us to read the keys of the records in a tile from global memory, compute the histogram and ranks, and then write the computed histogram and ranks to global memory. Note that only the key of a record is read from global memory, not the entire record as is done in Step 1 of the SDK algorithm. Step 2 is identical to Step 3 of the SDK algorithm. In Step 3, entire records are read from global memory. The records in a tile are rst reordered in shared memory to get the sorted arrangement of Step 1 of the SDK algorithm and then written to global memory so as to obtain the sorted order following Step 4 of the SDK algorithm. This writing of records from shared memory to global memory is identical to that done in Step 4 of the SDK algorithm. So, our 3-step algorithm reduces the number of times non-key elds of records are read from global memory and written to global memory by 50% (i.e., from 2 to 1). The following subsections provide implementation details for the 3 steps of GRS. A. Step 1Histogram and Ranks An SM computes the histograms and ranks for 64 tiles at a time employing 64 threads. Figure 3 gives a high-level description of the algorithm used by us for this purpose. Our algorithm processes 32 keys from each thread in an iteration of the for loop. So, the number of for loop iterations is the tile size (t) divided by 32. In each iteration of the for loop, the 64 threads cooperate to read 32 keys from each of the 64 tiles. This is done in such a way (described later) that global memory transactions are 128 bytes each. The data that is read is written to shared memory. Next, each thread reads the 32 keys of a particular tile from shared memory and updates the tile histogram, which itself resides in shared memory. Although we have enough registers to accommodate the 64 histograms, CUDA relegates a register array to global memory unless it is able to determine, at compile time, the value of the array index. To maintain the histograms in registers, we need an elaborate histogram update scheme whose computation time exceeds the time saved over making accesses to random points in an array stored in shared memory. When a key is processed by a thread, the thread extracts the b bits in use for

this phase of the radix sort. Suppose the extracted b bits have the value 12, then the current histogram value for 12 is the rank of the key. To new histogram value for 12 is 1 more than the current value. The determined rank is written to shared memory using the same location used by the key (i.e., the rank overwrites the key). Note that once a key is processed, it is no longer needed by Algorithm HR. Once the ranks for the current batch of 32 keys per tile have been computed, these are written to global memory and we proceed to the next batch of 32 keys/tile. To write the ranks to global memory, the 64 threads cooperate ensuring that each transaction to global memory is 128 bytes. When the for loop terminates, we have successfully computed the histograms for the 64 tiles and these are written to global memory.
Algorithm HR() { // Compute the histograms and ranks for 64 tiles itrs = t / 32; // t = tile size for(i = 0; i < itrs; i++) { Read 32 keys from each of the 64 tiles; Determine the ranks and update the histograms; Write the ranks to global memory; } Write the histograms to global memory;

Fig. 3: Algorithm to compute the histograms and ranks of 64 tiles To ensure 128-byte read transactions to global memory, we use an array that is declared as: shared int4 sKeys4[512]; Each element of sKeys4 is comprised of 4 4-byte integers and the entire array is assigned to shared memory. A thread reads in 1 element of sKeys4 at a time from global memory and in so doing, 4 keys are input. It takes 8 threads to cooperate so as to read in 32 keys (or 128 bytes) from a tile. The 16 threads in a half warp read 32 keys from each of 2 tiles. This read takes 2 128-byte memory transactions. With a single read of this type, the 64 threads together are able to read 32 keys from a total of 8 tiles. So, each thread needs to repeat this read 8 times (each time targeting a different tile) in order for the 64 threads to input 32 keys from each of 64 tiles. Besides maximizing bandwidth utilization from global memory, we need also be concerned about avoiding shared memory bank conicts when the threads begin to process the keys for their assigned tile. Since shared memory is divided into 16 banks of 4-byte words, storing the keys in a natural way results in the rst key of each tile residing in the same bank. Since in the next step, each thread processes its tiles keys in the same order, we will have shared memory conicts that will cause the reads of keys to be serialized within each half warp. To avoid this serialization, we use a circular shift pattern to map keys to the array sKeys4. The CUDA kernel code to do this read is given in Figure 4. As stated earlier, to compute the histograms and ranks, each thread works on the keys of a single tile. A thread inputs

// tid is the thread id and bid is the block id // Determine the first tile handled by this thread startTile = (bid * 64 + tid / 8) * (t / 4); // starting key position in the tile // keyOf f set is the offset for current 32 keys keyPos = keyOffset + tid % 8; // shared memory position to write the keys with // circular shift sKeyPos = (tid / 8) * 8 + (((tid / 8) % 8) + (tid % 8)) % 8; // some constants tileSize8 = 8 * (t / 4); tid4 = tid * 4; // Initialize the histogram counters for(i = 0; i < 16; i++) { sHist[tid * 16 + i] = 0; } // Wait for all threads to finish syncthreads(); curTileId = startTileId; for(i = 0; i < 8; i++) { sKeys4[sKeyPos + i * 64] = keysIn4[keyPos + startTile]; curTileId += tileSize8; } syncthreads();

Once the ranks have been computed, they are written to global memory using a process similar to that used to read in the keys. Figure 6 gives the kernel code. Here, ranks4[] is an array in global memory; its data type is int4.
curTileId = startTileId; for(i = 0; i < 8; i++) { ranks4[keyOffset + keyPos + startTileId] = sKeys4[sKeyPos + i * 64] ; curTileId += tileSize8; } syncthreads();

Fig. 6: Writing the ranks to global memory When an SM completes the histogram computation for the 64 tiles assigned to it, it writes the computed 64 histograms to the array counters in global memory. If we view the 64 histograms as forming a 16 64 array, then this array is mapped to the one dimensional array in column-major order. Figure 7 gives the kernel code for this.
// calculate Id of threads amongst all threads // nT iles is total number of tiles globalTid = bid * 64 + tid; for(i = 0; i < 16; i++) { counters[i * nTiles + globalTid] = sHist[i * 64 + tid]; }

Fig. 4: Reading the keys from global mem

one element (4 keys) of sKeys4, updates the histogram using these 4 keys and writes back the rank of these 4 keys (as noted earlier the equals the histogram value just before it is updated). Figure 5 gives the kernel code to update the histogram and compute the ranks for the 4 keys in one element of sKeys4.
// Update the histograms and calculate the rank // startbit is the starting bit position for // this phase int4 p4, r4; for(i = 0; i < 8; i ++) { p4 = sKeys4[(tid4) + (i + tid) % 8]; r4.x = sHist[((p4.x >> startbit) & 0xF) * 64 tid]++; r4.y = sHist[((p4.y >> startbit) & 0xF) * 64 tid]++; r4.z = sHist[((p4.z >> startbit) & 0xF) * 64 tid]++; r4.w = sHist[((p4.w >> startbit) & 0xF) * 64 tid]++; sKeys4[tid4 + (i + tid) % 8] = r4; } syncthreads();

Fig. 7: Writing the histograms to global memory

B. Step 2Prex sum of tile histograms As in Step 3 of the SDK algorithm, the prex sum of the tile histograms is computed with a CUDPP function call. The SDK radix sort computes the prex sum of half tiles while we do this for full tiles. Assuming both algorithms use the same tile size, the prex sum in Step 2 of GRS involves half as many histograms as does the prex sum in Step 3 of the SDK algorithm. This difference, however, results in a negligible reduction in run time for Step 2 of GRS versus Step 3 of SDK. C. Step 3Positioning records in a tile

+ + + +

Fig. 5: Processing an element of sHist4[]

To move the records in a tile to their correct overall sorted position with respect to the b bits being considered in a particular phase, we need to determine the correct position for each record in a tile. The correct position is obtained by rst computing the prex sum of the tile histogram. This prex sum may be computed using the warp scan algorithm used in the SDK radix sort code corresponding to the algorithm of [14]. The correct position of a record is its rank plus the histogram prex sum corresponding to the b bits of the record key being considered in this phase. As noted earlier, moving records directly from their current position to their nal correct

positions is expensive because of the large number of memory transactions to global memory. Better performance is obtained when we rst rearrange the records in a tile into sorted order of the b bits being considered and then move the records to their correct position in the overall sorted order. We do this reordering one eld at a time so that our code can handle records with a large number of elds. Figure 8 gives the kernel code. One SM reorders the records in a single tile. Since each thread handles 4 records, the number of threads used by an SM for this purpose (equivalently, the number of threads per thread block) is t/4. VI. D IFFERENCES BETWEEN SDK RADIX SORT AND GRS The essential difference between the SDK radix sort and GRS are summarized below. 1) The SDK code, as written, can handle only records that have one key and one 32-bit value eld. The code may be extended easily to handle records with multiple 32-bit elds. For this extension, the rearrangement of records in Steps 1 and 4 is done one eld at a time. 2) The SDK reads/writes the records to be sorted from/to global memory twice while GRS does this once. More precisely, suppose we are sorting n records that have a 4-byte key and m 4-byte elds each. We shall ignore global memory I/O of the histograms in our analysis as for reasonable tile sizes the histograms represent a relatively small amount of the total global memory I/O. The SDK algorithm reads 4mn + 4n bytes of data and writes as much data in Step 1 (exclusive of the histogram I/O). In Step 2, only the key are input from global memory. So, in this step, 4n bytes of data are read. Step 3 does I/O on histograms only. In Step 4, 4mn + 4n bytes are read and written. So, in all the SDK algorithm reads 8mn + 12n bytes and writes 8mn + 8n bytes. The GRS algorithm, on the other hand, reads 4n bytes (the n keys) in Step 1 and writes 4n bytes (the ranks). Since the ranks require only 2 bytes each (2 bytes are sufcient so long as the tile size is no more than 216 ), the writes could be reduced to 2n bytes. The Step 2 I/O involves only histograms. In Step 3, all keys, elds, and ranks are read but only keys and elds are written back to the global memory. So, in Step 4, we read 4mn + 8n bytes and write 4mn + 4n bytes. The GRS reads a total of 4mn + 12n bytes and writes a total of 4mn + 8n bytes. We see that SDK reads (writes) 4mn bytes of data more than does GRS. Although our analysis for the SDK sort is based on the version for records with multiple elds, the analysis applies also for the number sort version (i.e., m = 0). So, GRS does the same amount of global memory I/O when we are sorting numbers but reads 4mn fewer bytes and writes 4mn fewer bytes when we are sorting records that have m 4-byte elds in addition to the key eld. 3) GRS needs global memory to store ranks. Although our code stores each rank as a 4-byte integer, 2-byte integers sufce and the global memory needed to store the ranks

kernel reorderData(keysOut, recsOut, keysIn4, recsIn4, counters, countersSum, ranks4) { // Read the records from recsIn4 and put them in // sorted order in recsOut // sT ileCnt stores tile histogram // sGOf f set stores global prefix-summed histogram shared sTileCnt[16], sGOffset[16]; // storage for keys and fields shared int sKeys[t]; shared int sFields[t]; int4 k4, r4; // Read the histograms from the global memory if(tid < 16) { sTileCnt[tid] = counters[tid * nTiles + bid]; sGOffset[tid] = countersSum[tid* nTiles + bid]; } syncthreads(); // Perform a warp scan on the tile histogram sTileCnt = warp_scan(sTileCnt); syncthreads(); // read the keys and their ranks k4 = keysIn4[bid * (t / 4) + tid]; r4 = ranks4[bid * (t / 4) + tid]; // Find the correct position and write to the shared mem r4.x = r4.x + sTileCnt[(k4.x >> startbit) & 0xF]; sKeys[r4.x] = k4.x; // Similar code for y,z // r4 and k4 comes here syncthreads(); // Determine the global // Each thread places 4 // tid, (tid + t/4), (tid + t/2) and w components of

rank keys at positions and (tid + 3t/2)

radix = (sKeys[tid] >> startbit) & 0xF; globalOffset.x = sGOffset[radix] + tid sTileCnt[radix]; keysOut[globalOffset.x] = sKeys[tid]; // Place other keys similarly // Reorder the fields using the shared memory // and the positions of the keys for(i = 0; i < nFields; i++) { fieldi = recsIn4[bid * (t / 4) + tid][i]; sFields[r4.x] = fieldi.x; // Similar code for y,z and w components of // r4 and f ieldi comes here syncthreads(); recsOut[globalOffset.x][i] = sFields[tid]; // Place other records similarly } }

Fig. 8: Rearranging Data

Time (in ms)

is 2n bytes, where n is the number of records to be sorted. The SDK algorithm does not need this additional space. Both algorithms use the same amount of global memory space to store tile histograms and prex sums. VII. E XPERIMENTAL R ESULTS We programmed our GRS algorithm using NVIDIA CUDA SDK 3.0 and ran it on an NVIDIA Tesla C1060 GPU. We have empirically determined that a tile size t of 768 works best for GRS. For benchmarking purposes, we compare the performance of GRS with that of the SDK radix sort algorithm (SDK) extended to sort records with more than one eld as described in Section VI and the SRTS radix sort implementation of [11]. Since SRTS uses the Hybrid layout, our rst comparison between GRS and SRTS uses the ByF ield layout for GRS and the Hybrid layout for SRTS. Later, we compare an adaptation of GRS to the Hybrid layout to SRTS. For each of our experiments, the run time reported is the average run time for 100 randomly generated sequences. While comparing SDK, GRS and SRTS we have used the same random sequences for the 3 algorithms. First, we compared the performance of three algorithms while sorting only numbers. For the rst set of comparisons, we have employed SDK, GRS and SRTS for sorting from 1M to 10M numbers. As shown in Figure 9, SDK runs 20% to 7% faster than GRS for 1M to 3M numbers respectively. However, GRS outperforms SDK for sorting 4M records and more. It runs 11% to 21% faster than SDK for 4M to 10M numbers, respectively. SRTS is the best performing algorithm among the three by running 53% to 57% faster than GRS for 1M to 10M numbers.
80 70

600

SDK
500 400

GRS
300

SRTS
200 100 0 10M 20M 30M 40M 50M 60M 70M 80M 90M 100M

# Integers

Fig. 10: Time to sort 10M to 100M integers

5 1M 3 2M 3M 4M 5M 6M 1 7M 8M 9M 10M

# Records

Fig. 11: GRS for 1M to 10M records with 1 to 9 Fields

SDK
60 Time (in ms) 50 40 30

GRS

SRTS
20 10 0 1M 2M 3M 4M 5M 6M 7M 8M 9M 10M

13ms to sort 1M records having 1 eld, while on the other extreme, it takes 149ms to sort 10M records with 9 elds. We then compared SDK, GRS and SRTS for sorting records. We have varied the number of elds from 1 to 9. We have used the ByF ield format for SDK and GRS while we have used the Hybrid format for SRTS. For sorting 4M records with a single eld, GRS runs 25% faster than SDK and 62% slower than SRTS. GRS is faster than both SDK and SRTS when records have 2 or more elds and SDK is faster than SRTS when records have 2 or more elds. In fact, when records have 9 elds, GRS runs 46% faster than SDK and 74% faster than SRTS as shown in Figure 12.
500 450 400 350 300 250 200 150 100 50 0 1 2 3 4 5 # Fields 6 7

SRTS

# Integers

Fig. 9: Time to sort 1M to 10M integers This performance differential is also observed while sorting even larger sets of numbers. We recorded the execution time for sorting up to 100M numbers. Figure 10 shows the run times of SDK, GRS and SRTS starting from 10M numbers with an increment of 10M. For 100M numbers, GRS runs 21% faster than SDK and 53% slower than SRTS. Next, we ran GRS on records with 1 to 9 32-bit elds (in addition to the key). The elds are assumed to be 32-bit integers. Figure 11 shows the runtime of GRS on 1M to 10M records whose elds are stored in ByF ield layout. GRS takes

Time (in ms)

SDK GRS

Fig. 12: Time for sorting 4M multield records Figure 13 shows the run time of SDK, GRS and SRTS for 40M records with 1 to 9 elds. The results are similar to those for 4M records. GRS is 34% faster than SDK but 62% slower than SRTS for records with 1 eld; for records with 2 or more

# Fields

160 140 120 100 80 60 40 20 0

Time (in ms)

9 7

elds, GRS is faster than both SDK and SRTS and SDK is faster than SRTS; and for records with 9 elds, GRS runs 55% faster than SDK and 74% faster than SRTS.
5000 4500 4000 3500 Time (in ms) 3000 2500 2000 1500 1000 500 0 1 2 3 4 5 # Fields 6 7 8 9

SRTS

To get another perspective on sorting large records, we measured the sorting rate (in million records per second) for records with 9 elds. For 1M to 41M records, GRSH gives a consistent sorting rate of approximately 22 million records per second while the sorting rate for SRTS is approximately 9M records per second (Figure 16).
25 Sorting Rate (106 records / sec)

GRSH
20 15 10 5 0

SDK GRS

SRTS

Fig. 13: Time for sorting 40M multield records To get an apples-to-apples comparison, we modied the rearrangement step of GRS to work with the Hybrid layout used in SRTS. We refer to the modied GRS as GRSH. Figures 14 and Figure 15 plot the run time of the modied GRSH algorithm and that of SRTS for 4M and 40M records, respectively. For 4M records with 1 eld GRSH runs 51% slower than SRTS but for records with 2 or more elds, GRS is faster. For records with 9 elds GRSH is 59% faster than SRTS. For 40M records, GRSH is 48% slower than SRTS for records with a single eld; faster than SRTS when records have 2 or more elds; and runs 60% faster than SRTS for records with 9 elds.
500 450 400 350 300 250 200 150 100 50 0 1 2 3 4 5 # Fields 6 7

9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 # Records (in millions)

Fig. 16: Sorting rate for 9-eld records

VIII. C ONCLUSION We have developed a new radix sort algorithm, GRS, to sort records in a GPU using the ByF ield layout. GRS reads and writes records from/to global memory only once per radix sort phase. Our experiments indicate that, of the three contemporary GPU radix sort algorithms SDK, SRTS, and GRS, SRTS is the fastest for records with 0 or 1 eld. However, when records have 2 or more elds, GRS is the fastest. This conclusion remains the same when GRS is adapted to sort records in the Hybrid layout used by SRTS, GRSH is faster than SRTS when records have 2 or more elds. R EFERENCES

SRTS

Time (in ms)

GRSH

Fig. 14: SRTS and GRSH for 4M multield records

5000 4500 4000 3500 3000 2500 2000 1500 1000 500 0 1 2 3 4 5 # Fields 6 7

SRTS

GRSH

Fig. 15: SRTS and GRSH for 40M multield records

[1] Bandyopadhyay, S. and Sahni, S., Sorting Large Records on a Cell Broadband Engine, IEEE International Symposium on Computers and Communications (ISCC), 2010. [2] Blelloch, G.E., Vector models for data-parallel computing. Cambridge, MA, USA: MIT Press, 1990. [3] Cederman, D. and Tsigas, P., GPU-Quicksort: A Practical Quicksort Algorithm for Graphics Processors, ACM Journal of Experimental Algorithmics(JEA), 14, 4, 2009. [4] Govindaraju, N., Gray, J., Kumar, R. and Manocha D., Gputerasort: High performance graphics coprocessor sorting for large database management, ACM SIGMOD International Conference on Management of Data, 2006. [5] Greb, A. and Zachmann, G., GPU-ABiSort: optimal parallel sorting on stream architectures, IEEE International Parallel and Distributed Processing Symposium (IPDPS), 2006. [6] Horowitz, E., Sahni, S., and Mehta, D., Fundamentals of data structures in C++, Second Edition, Silicon Press, 2007. [7] Inoue, H., Moriyama, T., Komatsu, H., and Nakatani, T., AA-sort: A new parallel sorting algorithm for multi-core SIMD processors, 16th International Conference on Parallel Architecture and Compilation Techniques (PACT), 2007. [8] Knuth, D., The Art of Computer Programming: Sorting and Searching, Volume 3, Second Edition, Addison Wesley, 1998. [9] Le Grand, S., Broad-phase collision detection with CUDA, GPU Gems 3, Addison-Wesley Professional, 2007. [10] Leischner, N., Osipov, V. and Sanders P., GPU sample sort, IEEE International Parallel and Distributed Processing Symposium (IPDPS), 2010.

Time (in ms)

[11] Merrill, D and Grimshaw A, Revisiting Sorting for GPGPU Stream Architectures, University of Virginia, Department of Computer Science, Technical Report CS2010-03, 2010. [12] Lindholm, E., Nickolls, J., Oberman S. and Montrym J., NVIDIA Tesla: A unied graphics and computing architecture, IEEE Micro, 28, 3955, 2008. [13] Sahni, S., Scheduling master-slave multiprocessor systems, IEEE Trans. on Computers, 45, 10, 1195-1199, 1996. [14] Satish, N., Harris, M. and Garland, M., Designing Efcient Sorting Algorithms for Manycore GPUs, IEEE International Parallel and Distributed Processing Symposium (IPDPS), 2009. [15] Sengupta, S., Harris, M., Zhang, Y. and Owens, J., D., Scan primitives for GPU computing, Graphics Hardware 2007, 97-106, 2007. [16] Sintorn, E. and Assarsson, U., Fast parallel GPU-sorting using a hybrid algorithm, Journal of Parallel and Distributed Computing, 10, 13811388, 2008. [17] Volkov, V. and Demmel, J.W., Benchmarking GPUs to Tune Dense Linear Algebra, ACM/IEEE conference on Supercomputing, 2008. [18] Won, Y. and Sahni, S., Hypercube-to-host sorting, Jr. of Supercomputing, 3, 41-61, 1989. [19] Ye, X., Fan, D., Lin, W., Yuan, N. and Ienne, P., High performance comparison-based sorting algorithm on many-core GPUs, IEEE International Parallel and Distributed Processing Symposium (IPDPS), 2010. [20] CUDPP: CUDA Data-Parallel Primitives Library, http://www.gpgpu. org/developer/cudpp/, 2009. [21] NVIDIA CUDA Programming Guide, NVIDIA Corporation, version 3.0, Feb 2010.

Anda mungkin juga menyukai