Vitor Pamplona
vitor@vitorpamplona.com
Goals
N by N Queens Problem
http://en.wikipedia.org/wiki/Eight_queens_puzzle
Possibilities vs Solutions
Board Size 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 Possibilities 1 4 27 256 3,125 46,656 823,543 16,777,216 387,420,489 10,000,000,000 285,311,670,611 8,916,100,448,256 302,875,106,592,253 11,112,006,825,558,000 437,893,890,380,859,000 18,446,744,073,709,600,000 827,240,261,886,337,000,000 Solutions 1 0 0 2 10 4 40 92 352 724 2,680 14,200 73,713 365,596 2,299,184 14,772,512 95,815,104
Cu... what?
Compute Unified Device Architecture C-style language and compiler Designed for parallel solutions Not a graphics API Runs on current graphics hardware
nVidia GeForce 8+
Faster transfers between CPU and GPU Compiler for CPU and GPU
Copyright Vitor F. Pamplona 5
Hardware Architecture
GPU
CPU
Hardware Architecture
GPU
Processor
Cache CPU
Memory
Hardware Architecture
Processor
GPU
Processor
Memory
Cache CPU
Memory
Hardware Architecture
Device
GPU
Host Memory
Hardware Architecture
Host Memory
10
Hardware Architecture
thread
warp
Host Memory
11
Hardware Architecture
local thread
L L L L L L L L L L L L L L L L L L L L L L L L L L L L L L L L L L L L L L L L L L L L L L L L
warp
Host
Cache CPU
Host Memory
12
Hardware Architecture
local thread
L L L L L L L L L L L L L L L L L L L L L L L L L L L L L L L L L L L L L L L L L L L L L L L L
warp
banks
Host
Cache CPU
Host Memory
13
Hardware Architecture
local thread
L L L L L L L L L L L L L L L L L L L L L L L L L L L L L L L L L L L L L L L L L L L L L L L L
warp
banks
Constant
(64kB)
Cache
Host
Cache CPU
Host Memory
14
Hardware Architecture
local thread
L L L L L L L L L L L L L L L L L L L L L L L L L L L L L L L L L L L L L L L L L L L L L L L L
warp
banks
Constant
(64kB)
Cache
Host
Cache CPU
Global
Host Memory
15
Hardware Architecture
local thread
L L L L L L L L L L L L L L L L L L L L L L L L L L L L L L L L L L L L L L L L L L L L L L L L
warp
banks
Constant
(64kB)
Cache
Cache
Host
Cache CPU
Texture
optimized for 2D access
Global
Host Memory
16
Hardware Architecture
local thread
L L L L L L L L L L L L L L L L L L L L L L L L L L L L L L L L L L L L L L L L L L L L L L L L
warp
banks
Constant
(64kB)
Cache
Cache
Host
Cache CPU
Texture
optimized for 2D access
Global
Host Memory
17
Memory Access
18
Basics of Programming
19
Hardware Architecture
20
Hardware Architecture
21
Hardware Architecture
22
Hardware Architecture
23
CPU
GPU
24
GPU
25
GPU
26
CUDA Driver
GPU
27
CUDA Driver
GPU
28
Startup
Special Windows/Linux drivers CUDA Toolkit CUDA Developer SDK, SDK which includes
API Documentation Programming guide Compiler (nvcc) Libraries (CUFFT, CUBLAS) Source code examples
29
Host Example
float *pHostData = (float*) malloc(sizeof(float) * 256); // fill in the data array... // allocate global memory float *pInput, *pOutput; cudaMalloc((void**) &pInput, sizeof(float) * 256)); cudaMalloc((void**) &pOutput, sizeof(float) * 256)); // host memory to global memory cudaMemcpy(pInput, pHostData, sizeof(float) * 256, cudaMemcpyHostToDevice)); dim3 nDimGrid(1, 1, 1);// 1 block only dim3 nDimBlock(32, 1, 1); // 32 threads per block int nSharedMemBytes = sizeof(float) * 32; MyKernel<<<nDimGrid, nDimBlock, nSharedMemBytes>>>(pInput, pOutput); // global memory to host memory cudaMemcpy(pHostData, pOutput, sizeof(float) * 256, cudaMemcpyDeviceToHost)); free(pHostData); free(pInput); free(pOutput);
30
Kernel Example
__global__ void MyKernel(float* pInData, float* pOutData){ extern __shared__ float sharedData[]; const unsigned int tid = threadIdx.x; const unsigned int num_threads = blockDim.x; // global memory to shared memory sharedData[tid] = pInData[tid]; __syncthreads(); // do something sharedData[tid] = (float) num_threads * sharedData[tid]; __syncthreads(); // shared memory to global memory pOutData[tid] = sharedData[tid]; }
31
Competitors
Back to Work
33
Back to Work
Monothread depth-first recursive Monothread depth-first plain N-threads depth-first plain Step-based breadth-first static memory Step-based breadth-first dynamic memory Plain depth-first dynamic memory version
Copyright Vitor F. Pamplona 34
Back to Work
Monothread depth-first recursive Monothread depth-first plain N-threads depth-first plain Step-based breadth-first static memory Step-based breadth-first dynamic memory Plain depth-first dynamic memory version
Copyright Vitor F. Pamplona 35
Back to Work
Monothread depth-first recursive Monothread depth-first plain N-threads depth-first plain Step-based breadth-first static memory Step-based breadth-first dynamic memory Plain depth-first dynamic memory version
Copyright Vitor F. Pamplona 36
Optimized implementation Single thread Depth-first approach No recursion, recursion no function call Memory buffers :) Fast, really fast!
37
Back to Work
Monothread depth-first recursive Monothread depth-first plain N-threads depth-first plain Step-based breadth-first static memory Step-based breadth-first dynamic memory Plain depth-first dynamic memory version
Copyright Vitor F. Pamplona 38
N-threads, N-threads where N is the board size First column filled in the main thread Create N linux pthreads
One thread for each line Each thread process N-1 columns solutions++; saveSolution(board);
Critical Section
39
Back to Work
Monothread depth-first recursive Monothread depth-first plain N-threads depth-first plain Step-based breadth-first static memory Step-based breadth-first dynamic memory Plain depth-first dynamic memory version
Copyright Vitor F. Pamplona 40
41
Thread N
...
N Threads = Num. Solutions * N
42
...
43
...
Thread N*N
...
N N Threads = Num. Solutions * N
44
... ...
45
Good for GPU Easy solution-thread mapping by indexes Good for GPU
Fast kernels
46
Bad: One sort in the output for each step Good for GPU Bad: Synchronized memory access Bad: Global last output index
47
Back to Work
Monothread depth-first recursive Monothread depth-first plain N-threads depth-first plain Step-based breadth-first static memory Step-based breadth-first dynamic memory Plain depth-first dynamic memory version
Copyright Vitor F. Pamplona 48
49
Test platforms
Ubuntu 4GB RAM 8 multiprocessor 64 processors at 650 Mhz 512MB RAM at 900 Mhz Cuda 1.0
51
Results: CPU
9000 8000 7000 6000 5000 4000 3000 2000 1000 0 12 13 14
52
9
58
13
59
Results
Soluo GPU-breadth-first static GPU-breadth-first dynamic GPU-depth-first 1Thread GPU-depth-first n-Threads GPU-depth-first n-grids GPU-depth-first n*n-grids GPU-depth-first N^3 GPU-depth-first N^4 GPU-depth-first FULL CPU-Plain CPU-Recursive CPU-Plain-Threads Threads Sol * N Sol * N 1 N N N*N N^3 N^4 N^N 1 1 N 1 171 171 171 171 171 172 171 171 171 2 2 2 2 171 171 171 171 171 172 172 171 171 2 2 2 3 171 171 171 171 171 172 172 171 172 2 2 2 4 174 173 171 172 171 172 172 171 172 2 2 2 5 174 173 171 172 171 172 172 171 172 2 2 2 6 174 173 171 173 173 172 172 171 172 2 2 2 7 8 9 178 184 220 173 173 174 171 185 227 173 175 230 173 173 177 172 172 174 172 172 174 171 171 171 230 1682 11420 2 2 2 2 2 2 3 3 5
61
Results
Soluo GPU-breadth-first static GPU-breadth-first dynamic GPU-depth-first 1Thread GPU-depth-first n-Threads GPU-depth-first n-grids GPU-depth-first n*n-grids GPU-depth-first N^3 GPU-depth-first N^4 GPU-depth-first FULL CPU-Plain CPU-Recursive CPU-Plain-Threads Threads Sol * N Sol * N 1 N N N*N N^3 N^4 N^N 1 1 N 11 1234 218 1463 441 301 216 192 181 12 6184 407 7198 1561 824 424 267 199 13 Mem 1481 7827 3604 1425 661 360 14 Mem 7886 15 Mem Cont 16 Mem Mem 17 Mem Mem
7562
43488 05:38.99
18 35 17
91 198 84
32010 04:40.95
62
Conclusions
Cuda is slow.
Low use of GPU graphics resources GLSL, HLSL and Cg are faster Compiler needs improvements More documentation on assembly optimization GPU kill some process (I don't know why)
Instable
Performance depends on implementation Good for mixed solutions: solutions CPU + GPU
Copyright Vitor F. Pamplona 63
Conclusions
%, * and / are slow ThreadIdx and blockIdx are fantastic __shared__ memory helps Cuda locks the screen while processing
Synchronized architecture
64
Perguntas?
Vitor Pamplona
vitor@vitorpamplona.com