Tf08vitorapresentacao 1231591035116792 2

n-Queens Problem: A Comparison Between CPU and GPU using C++ and Cuda
Vitor Pamplona
vitor@vitorpamplona.com
Goals
Learn Cuda and its limitations Implement some n-Queens solutions
Cuda version C++ version
Compare performance Check for possible papers
Parallel processing Computer graphics
Copyright Vitor F. Pamplona
N by N Queens Problem
http://en.wikipedia.org/wiki/Eight_queens_puzzle
Possibilities vs Solutions
Board Size 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 Possibilities 1 4 27 256 3,125 46,656 823,543 16,777,216 387,420,489 10,000,000,000 285,311,670,611 8,916,100,448,256 302,875,106,592,253 11,112,006,825,558,000 437,893,890,380,859,000 18,446,744,073,709,600,000 827,240,261,886,337,000,000 Solutions 1 0 0 2 10 4 40 92 352 724 2,680 14,200 73,713 365,596 2,299,184 14,772,512 95,815,104
Cu... what?
Compute Unified Device Architecture C-style language and compiler Designed for parallel solutions Not a graphics API Runs on current graphics hardware
nVidia GeForce 8+
Faster transfers between CPU and GPU Compiler for CPU and GPU
Copyright Vitor F. Pamplona 5
Hardware Architecture
GPU
CPU
GPU
Processor
Cache CPU
Memory
Processor
GPU
Processor
Memory
Cache CPU
Memory
Device
GPU
Host Device Memory

Cache CPU
Host Memory
Host Device Memory

Cache
Host Memory
10
thread
warp
Host Device Memory

Cache CPU
Host Memory
11
local thread
L L L L L L L L L L L L L L L L L L L L L L L L L L L L L L L L L L L L L L L L L L L L L L L L
warp
Host
Cache CPU
Host Memory
12
local thread
warp
banks
Host
Cache CPU
Host Memory
13
local thread
warp
banks
Constant
(64kB)
Cache
Host
Cache CPU
Host Memory
14
local thread
warp
banks
Constant
(64kB)
Cache
Host
Cache CPU
Global
Host Memory
15
local thread
warp
banks
Constant
(64kB)
Cache
Cache
Host
Cache CPU
Texture
optimized for 2D access
Global
Host Memory
16
local thread
warp
banks
Constant
(64kB)
Cache
Cache
Host
Cache CPU
Texture
optimized for 2D access
Global
Host Memory
17
Memory Access
18
Basics of Programming
19
20
21
22
23
Libraries and Access

Application
CPU
GPU
24

Application
CUDA Libraries CPU
GPU
25

Application
CUDA Libraries CPU CUDA Runtime
GPU
26

Application
CUDA Driver
GPU
27

Application
CUDA Driver
GPU
28
Startup
Special Windows/Linux drivers CUDA Toolkit CUDA Developer SDK, SDK which includes
API Documentation Programming guide Compiler (nvcc) Libraries (CUFFT, CUBLAS) Source code examples
29
Host Example
float *pHostData = (float*) malloc(sizeof(float) * 256); // fill in the data array... // allocate global memory float *pInput, *pOutput; cudaMalloc((void**) &pInput, sizeof(float) * 256)); cudaMalloc((void**) &pOutput, sizeof(float) * 256)); // host memory to global memory cudaMemcpy(pInput, pHostData, sizeof(float) * 256, cudaMemcpyHostToDevice)); dim3 nDimGrid(1, 1, 1);// 1 block only dim3 nDimBlock(32, 1, 1); // 32 threads per block int nSharedMemBytes = sizeof(float) * 32; MyKernel<<<nDimGrid, nDimBlock, nSharedMemBytes>>>(pInput, pOutput); // global memory to host memory cudaMemcpy(pHostData, pOutput, sizeof(float) * 256, cudaMemcpyDeviceToHost)); free(pHostData); free(pInput); free(pOutput);
30
Kernel Example
__global__ void MyKernel(float* pInData, float* pOutData){ extern __shared__ float sharedData[]; const unsigned int tid = threadIdx.x; const unsigned int num_threads = blockDim.x; // global memory to shared memory sharedData[tid] = pInData[tid]; __syncthreads(); // do something sharedData[tid] = (float) num_threads * sharedData[tid]; __syncthreads(); // shared memory to global memory pOutData[tid] = sharedData[tid]; }
31
Competitors
AMD/ATI Close to Metal (CTM) RapidMind Acceleware PeakStream
Unavailable since acquisition by Google
BrookGPU OpenGL/Direct3D + GLSL/HLSL/Cg BSGP

Back to Work
Brute force implementations 3 solutions for CPU
Monothread depth-first recursive Monothread depth-first plain N-threads depth-first plain
33
Back to Work
Monothread depth-first recursive Monothread depth-first plain N-threads depth-first plain Step-based breadth-first static memory Step-based breadth-first dynamic memory Plain depth-first dynamic memory version
3 solutions for GPU

Back to Work
3 solutions for GPU

Back to Work
3 solutions for GPU

CPU Monothread Depth-first Plain
Optimized implementation Single thread Depth-first approach No recursion, recursion no function call Memory buffers :) Fast, really fast!
37
Back to Work
3 solutions for GPU

CPU N-threads Depth-first Plain
N-threads, N-threads where N is the board size First column filled in the main thread Create N linux pthreads
One thread for each line Each thread process N-1 columns solutions++; saveSolution(board);
Critical Section

39
Back to Work
3 solutions for GPU

GPU Step Breadth-first

Step 1
In
41

Step 1
In Thread 1 Thread 2 Thread 3 1 2 3 Out
Thread N
...
N Threads = Num. Solutions * N
42

Step 2
In 1 2 3
...
43

Step 2
In 1 2 3 Thread 1 Thread 2 Thread 3 1 1 1 2 1 3 Out
...
Thread N*N
...
N N Threads = Num. Solutions * N
44

Step 3
In 1 3 1 4 1 5
... ...
8 6 Threads = Num. Solutions * N
45
Why a Breadth-first solution?
Graphics processors are not Intel/AMD
Slow: 650 MHz
Driver can kill time-expensive kernels Lots of threads
Good for GPU Easy solution-thread mapping by indexes Good for GPU
Fast kernels
46
Static memory version
Bad: One sort in the output for each step Good for GPU Bad: Synchronized memory access Bad: Global last output index
Dynamic memory version

47
Back to Work
3 solutions for GPU

Plain Depth-first Dynamic
Best case: N^4 threads
Thread indexes fill the first 4 columns Depth-first approach
Synchronized global memory access
49
Implementations and Threads

Soluo GPU-breadth-first static mem GPU-breadth-first dynamic mem GPU-depth-first 1Thread GPU-depth-first n-Threads GPU-depth-first n-grids GPU-depth-first n*n-grids GPU-depth-first n*n-grids*n-threads GPU-depth-first n*n-grids*n*n-threads GPU-depth-first FULL threads CPU-Plain CPU-Recursive CPU-Plain-Threads Threads Sol * N Sol * N 1 N N N*N N*N*N N*N*N*N N^N 1 1 N
50
Test platforms
CPU: Intel Quad Core 2.4 Ghz
Ubuntu 4GB RAM 8 multiprocessor 64 processors at 650 Mhz 512MB RAM at 900 Mhz Cuda 1.0
GPU: Geforce 9600 GT

51
Results: CPU
9000 8000 7000 6000 5000 4000 3000 2000 1000 0 12 13 14
52
CPU-Plain CPU-Recursive CPU-PlainThreads
Results: GPU: Static vs Dynamic

7000 6000 5000 4000 3000 2000 1000 0 11 12
breadth-first static breadth-first dynamic CPU-Plain CPU-Recursive CPU-PlainThreads
Results: Same Number of Threads

9000 8000 7000 6000 5000 4000 3000 2000 1000 0 12 13
depth-first nThreads depth-first nGrids CPU-PlainThreads
Results: Only 1 Thread

8000 7000 6000 5000 4000 3000 2000 1000 0 10 11 12
55
depth-first 1Thread CPU-Recursive CPU-Plain
Results: Dynamic vs Depth

1800 1600 1400 1200 1000 800 600 400 200 0 12
breadth-first dynamic depth-first nThreads depth-first nGrids CPU-Plain CPU-Recursive CPU-PlainThreads
Results: Depth vs CPU

1800 1600 1400 1200 1000 800 600 400 200 0 12
depth-first nThreads depth-first nGrids depth-first n*ngrids depth-first n*ngrids*n-threads
depth-first n*ngrids*n*nthreads CPU-Plain CPU-Recursive CPU-PlainThreads
Results: GPU N^N solution

12000 10000 8000 6000 4000 2000 0
depth-first n^n CPU-Plain CPU-Recursive CPU-PlainThreads
9
58
Results: Dynamic, Depth, CPU

1600 1400 1200 1000 800 600 400 200 0 10 11 12
breadth-first dynamic depth-first N*N*N*N CPU-Plain CPU-Recursive CPU-PlainThreads
13
59
Results: Depth vs CPU Threads

140000 120000 100000 80000 60000 40000 20000 0 14 15 16
60
depth-first N*N*N*N CPU-Plain CPU-PlainThreads
Results
Soluo GPU-breadth-first static GPU-breadth-first dynamic GPU-depth-first 1Thread GPU-depth-first n-Threads GPU-depth-first n-grids GPU-depth-first n*n-grids GPU-depth-first N^3 GPU-depth-first N^4 GPU-depth-first FULL CPU-Plain CPU-Recursive CPU-Plain-Threads Threads Sol * N Sol * N 1 N N N*N N^3 N^4 N^N 1 1 N 1 171 171 171 171 171 172 171 171 171 2 2 2 2 171 171 171 171 171 172 172 171 171 2 2 2 3 171 171 171 171 171 172 172 171 172 2 2 2 4 174 173 171 172 171 172 172 171 172 2 2 2 5 174 173 171 172 171 172 172 171 172 2 2 2 6 174 173 171 173 173 172 172 171 172 2 2 2 7 8 9 178 184 220 173 173 174 171 185 227 173 175 230 173 173 177 172 172 174 172 172 174 171 171 171 230 1682 11420 2 2 2 2 2 2 3 3 5
61
Results
Soluo GPU-breadth-first static GPU-breadth-first dynamic GPU-depth-first 1Thread GPU-depth-first n-Threads GPU-depth-first n-grids GPU-depth-first n*n-grids GPU-depth-first N^3 GPU-depth-first N^4 GPU-depth-first FULL CPU-Plain CPU-Recursive CPU-Plain-Threads Threads Sol * N Sol * N 1 N N N*N N^3 N^4 N^N 1 1 N 11 1234 218 1463 441 301 216 192 181 12 6184 407 7198 1561 824 424 267 199 13 Mem 1481 7827 3604 1425 661 360 14 Mem 7886 15 Mem Cont 16 Mem Mem 17 Mem Mem
7025 2937 1369
7562
43488 05:38.99
18 35 17
91 198 84
502 1225 290
3020 8283 1393
19685 58493 8578
32010 04:40.95
62
Conclusions
Cuda is slow.
Low use of GPU graphics resources GLSL, HLSL and Cg are faster Compiler needs improvements More documentation on assembly optimization GPU kill some process (I don't know why)
Instable
Performance depends on implementation Good for mixed solutions: solutions CPU + GPU
Conclusions
%, * and / are slow ThreadIdx and blockIdx are fantastic __shared__ memory helps Cuda locks the screen while processing
No inter-process scheduling Think synchronized
Synchronized architecture
64
Perguntas?
Vitor Pamplona
vitor@vitorpamplona.com

Tf08vitorapresentacao 1231591035116792 2

Diunggah oleh

Informasi Dokumen

Judul Asli

Hak Cipta

Format Tersedia

Bagikan dokumen Ini

Bagikan atau Tanam Dokumen

Opsi Berbagi

Apakah menurut Anda dokumen ini bermanfaat?

Apakah konten ini tidak pantas?

Hak Cipta:

Format Tersedia

Tf08vitorapresentacao 1231591035116792 2

Diunggah oleh

Hak Cipta:

Format Tersedia

n-Queens Problem: A Comparison Between CPU and GPU using C++ and Cuda

Learn Cuda and its limitations Implement some n-Queens solutions

Cuda version C++ version

Compare performance Check for possible papers

Parallel processing Computer graphics

Copyright Vitor F. Pamplona

Copyright Vitor F. Pamplona

Copyright Vitor F. Pamplona

Copyright Vitor F. Pamplona

Copyright Vitor F. Pamplona

Copyright Vitor F. Pamplona

Host Device Memory

Copyright Vitor F. Pamplona

Host Device Memory

Copyright Vitor F. Pamplona

Host Device Memory

Copyright Vitor F. Pamplona

Copyright Vitor F. Pamplona

Copyright Vitor F. Pamplona

Copyright Vitor F. Pamplona

Copyright Vitor F. Pamplona

Copyright Vitor F. Pamplona

Copyright Vitor F. Pamplona

Copyright Vitor F. Pamplona

Copyright Vitor F. Pamplona

Copyright Vitor F. Pamplona

Copyright Vitor F. Pamplona

Copyright Vitor F. Pamplona

Copyright Vitor F. Pamplona

Libraries and Access

Copyright Vitor F. Pamplona

Libraries and Access

CUDA Libraries CPU

Copyright Vitor F. Pamplona

Libraries and Access

CUDA Libraries CPU CUDA Runtime

Copyright Vitor F. Pamplona

Libraries and Access

CUDA Libraries CPU CUDA Runtime

Copyright Vitor F. Pamplona

Libraries and Access

CUDA Libraries CPU CUDA Runtime

Copyright Vitor F. Pamplona

Copyright Vitor F. Pamplona

Copyright Vitor F. Pamplona

Copyright Vitor F. Pamplona

AMD/ATI Close to Metal (CTM) RapidMind Acceleware PeakStream

Unavailable since acquisition by Google

BrookGPU OpenGL/Direct3D + GLSL/HLSL/Cg BSGP

Brute force implementations 3 solutions for CPU

Monothread depth-first recursive Monothread depth-first plain N-threads depth-first plain

Copyright Vitor F. Pamplona

Brute force implementations 3 solutions for CPU

3 solutions for GPU

Brute force implementations 3 solutions for CPU

3 solutions for GPU

Brute force implementations 3 solutions for CPU

3 solutions for GPU

CPU Monothread Depth-first Plain

Copyright Vitor F. Pamplona

Brute force implementations 3 solutions for CPU

3 solutions for GPU

CPU N-threads Depth-first Plain

depth-first nThreads depth-first nGrids depth-first nngrids depth-first nngrids*n-threads

depth-first nngridsn*nthreads CPU-Plain CPU-Recursive CPU-PlainThreads

breadth-first dynamic depth-first NNN*N CPU-Plain CPU-Recursive CPU-PlainThreads

depth-first NNN*N CPU-Plain CPU-PlainThreads