Anda di halaman 1dari 65

n-Queens Problem: A Comparison Between CPU and GPU using C++ and Cuda

Vitor Pamplona
vitor@vitorpamplona.com

Goals

Learn Cuda and its limitations Implement some n-Queens solutions

Cuda version C++ version

Compare performance Check for possible papers

Parallel processing Computer graphics

Copyright Vitor F. Pamplona

N by N Queens Problem

http://en.wikipedia.org/wiki/Eight_queens_puzzle

Copyright Vitor F. Pamplona

Possibilities vs Solutions
Board Size 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 Possibilities 1 4 27 256 3,125 46,656 823,543 16,777,216 387,420,489 10,000,000,000 285,311,670,611 8,916,100,448,256 302,875,106,592,253 11,112,006,825,558,000 437,893,890,380,859,000 18,446,744,073,709,600,000 827,240,261,886,337,000,000 Solutions 1 0 0 2 10 4 40 92 352 724 2,680 14,200 73,713 365,596 2,299,184 14,772,512 95,815,104

Copyright Vitor F. Pamplona

Cu... what?

Compute Unified Device Architecture C-style language and compiler Designed for parallel solutions Not a graphics API Runs on current graphics hardware

nVidia GeForce 8+

Faster transfers between CPU and GPU Compiler for CPU and GPU
Copyright Vitor F. Pamplona 5

Hardware Architecture

GPU

CPU

Copyright Vitor F. Pamplona

Hardware Architecture

GPU

Processor
Cache CPU

Memory

Copyright Vitor F. Pamplona

Hardware Architecture

Processor

GPU

Processor

Memory

Cache CPU

Memory

Copyright Vitor F. Pamplona

Hardware Architecture

Device

GPU

Host Device Memory


Cache CPU

Host Memory

Copyright Vitor F. Pamplona

Hardware Architecture

Host Device Memory


Cache

Host Memory

Copyright Vitor F. Pamplona

10

Hardware Architecture
thread

warp

Host Device Memory


Cache CPU

Host Memory

Copyright Vitor F. Pamplona

11

Hardware Architecture
local thread
L L L L L L L L L L L L L L L L L L L L L L L L L L L L L L L L L L L L L L L L L L L L L L L L

warp

Host
Cache CPU

Host Memory

Copyright Vitor F. Pamplona

12

Hardware Architecture
local thread
L L L L L L L L L L L L L L L L L L L L L L L L L L L L L L L L L L L L L L L L L L L L L L L L

warp

banks

Host
Cache CPU

Host Memory

Copyright Vitor F. Pamplona

13

Hardware Architecture
local thread
L L L L L L L L L L L L L L L L L L L L L L L L L L L L L L L L L L L L L L L L L L L L L L L L

warp

banks

Constant
(64kB)

Cache

Host
Cache CPU

Host Memory

Copyright Vitor F. Pamplona

14

Hardware Architecture
local thread
L L L L L L L L L L L L L L L L L L L L L L L L L L L L L L L L L L L L L L L L L L L L L L L L

warp

banks

Constant
(64kB)

Cache

Host
Cache CPU

Global

Host Memory

Copyright Vitor F. Pamplona

15

Hardware Architecture
local thread
L L L L L L L L L L L L L L L L L L L L L L L L L L L L L L L L L L L L L L L L L L L L L L L L

warp

banks

Constant
(64kB)

Cache

Cache

Host
Cache CPU

Texture
optimized for 2D access

Global

Host Memory

Copyright Vitor F. Pamplona

16

Hardware Architecture
local thread
L L L L L L L L L L L L L L L L L L L L L L L L L L L L L L L L L L L L L L L L L L L L L L L L

warp

banks

Constant
(64kB)

Cache

Cache

Host
Cache CPU

Texture
optimized for 2D access

Global

Host Memory

Copyright Vitor F. Pamplona

17

Memory Access

Copyright Vitor F. Pamplona

18

Basics of Programming

Copyright Vitor F. Pamplona

19

Hardware Architecture

Copyright Vitor F. Pamplona

20

Hardware Architecture

Copyright Vitor F. Pamplona

21

Hardware Architecture

Copyright Vitor F. Pamplona

22

Hardware Architecture

Copyright Vitor F. Pamplona

23

Libraries and Access


Application

CPU

GPU

Copyright Vitor F. Pamplona

24

Libraries and Access


Application

CUDA Libraries CPU

GPU

Copyright Vitor F. Pamplona

25

Libraries and Access


Application

CUDA Libraries CPU CUDA Runtime

GPU

Copyright Vitor F. Pamplona

26

Libraries and Access


Application

CUDA Libraries CPU CUDA Runtime

CUDA Driver

GPU

Copyright Vitor F. Pamplona

27

Libraries and Access


Application

CUDA Libraries CPU CUDA Runtime

CUDA Driver

GPU

Copyright Vitor F. Pamplona

28

Startup

Special Windows/Linux drivers CUDA Toolkit CUDA Developer SDK, SDK which includes

API Documentation Programming guide Compiler (nvcc) Libraries (CUFFT, CUBLAS) Source code examples

Copyright Vitor F. Pamplona

29

Host Example
float *pHostData = (float*) malloc(sizeof(float) * 256); // fill in the data array... // allocate global memory float *pInput, *pOutput; cudaMalloc((void**) &pInput, sizeof(float) * 256)); cudaMalloc((void**) &pOutput, sizeof(float) * 256)); // host memory to global memory cudaMemcpy(pInput, pHostData, sizeof(float) * 256, cudaMemcpyHostToDevice)); dim3 nDimGrid(1, 1, 1);// 1 block only dim3 nDimBlock(32, 1, 1); // 32 threads per block int nSharedMemBytes = sizeof(float) * 32; MyKernel<<<nDimGrid, nDimBlock, nSharedMemBytes>>>(pInput, pOutput); // global memory to host memory cudaMemcpy(pHostData, pOutput, sizeof(float) * 256, cudaMemcpyDeviceToHost)); free(pHostData); free(pInput); free(pOutput);

Copyright Vitor F. Pamplona

30

Kernel Example
__global__ void MyKernel(float* pInData, float* pOutData){ extern __shared__ float sharedData[]; const unsigned int tid = threadIdx.x; const unsigned int num_threads = blockDim.x; // global memory to shared memory sharedData[tid] = pInData[tid]; __syncthreads(); // do something sharedData[tid] = (float) num_threads * sharedData[tid]; __syncthreads(); // shared memory to global memory pOutData[tid] = sharedData[tid]; }

Copyright Vitor F. Pamplona

31

Competitors

AMD/ATI Close to Metal (CTM) RapidMind Acceleware PeakStream

Unavailable since acquisition by Google

BrookGPU OpenGL/Direct3D + GLSL/HLSL/Cg BSGP


Copyright Vitor F. Pamplona 32

Back to Work

Brute force implementations 3 solutions for CPU

Monothread depth-first recursive Monothread depth-first plain N-threads depth-first plain

Copyright Vitor F. Pamplona

33

Back to Work

Brute force implementations 3 solutions for CPU

Monothread depth-first recursive Monothread depth-first plain N-threads depth-first plain Step-based breadth-first static memory Step-based breadth-first dynamic memory Plain depth-first dynamic memory version
Copyright Vitor F. Pamplona 34

3 solutions for GPU


Back to Work

Brute force implementations 3 solutions for CPU

Monothread depth-first recursive Monothread depth-first plain N-threads depth-first plain Step-based breadth-first static memory Step-based breadth-first dynamic memory Plain depth-first dynamic memory version
Copyright Vitor F. Pamplona 35

3 solutions for GPU


Back to Work

Brute force implementations 3 solutions for CPU

Monothread depth-first recursive Monothread depth-first plain N-threads depth-first plain Step-based breadth-first static memory Step-based breadth-first dynamic memory Plain depth-first dynamic memory version
Copyright Vitor F. Pamplona 36

3 solutions for GPU


CPU Monothread Depth-first Plain

Optimized implementation Single thread Depth-first approach No recursion, recursion no function call Memory buffers :) Fast, really fast!

Copyright Vitor F. Pamplona

37

Back to Work

Brute force implementations 3 solutions for CPU

Monothread depth-first recursive Monothread depth-first plain N-threads depth-first plain Step-based breadth-first static memory Step-based breadth-first dynamic memory Plain depth-first dynamic memory version
Copyright Vitor F. Pamplona 38

3 solutions for GPU


CPU N-threads Depth-first Plain

N-threads, N-threads where N is the board size First column filled in the main thread Create N linux pthreads

One thread for each line Each thread process N-1 columns solutions++; saveSolution(board);

Critical Section

Copyright Vitor F. Pamplona

39

Back to Work

Brute force implementations 3 solutions for CPU

Monothread depth-first recursive Monothread depth-first plain N-threads depth-first plain Step-based breadth-first static memory Step-based breadth-first dynamic memory Plain depth-first dynamic memory version
Copyright Vitor F. Pamplona 40

3 solutions for GPU


GPU Step Breadth-first


Step 1
In

Copyright Vitor F. Pamplona

41

GPU Step Breadth-first


Step 1
In Thread 1 Thread 2 Thread 3 1 2 3 Out

Thread N

...
N Threads = Num. Solutions * N

Copyright Vitor F. Pamplona

42

GPU Step Breadth-first


Step 2
In 1 2 3

...

Copyright Vitor F. Pamplona

43

GPU Step Breadth-first


Step 2
In 1 2 3 Thread 1 Thread 2 Thread 3 1 1 1 2 1 3 Out

...

Thread N*N

...
N N Threads = Num. Solutions * N

Copyright Vitor F. Pamplona

44

GPU Step Breadth-first


Step 3
In 1 3 1 4 1 5

... ...

8 6 Threads = Num. Solutions * N

Copyright Vitor F. Pamplona

45

Why a Breadth-first solution?

Graphics processors are not Intel/AMD

Slow: 650 MHz

Driver can kill time-expensive kernels Lots of threads

Good for GPU Easy solution-thread mapping by indexes Good for GPU

Fast kernels

Copyright Vitor F. Pamplona

46

GPU Step Breadth-first

Static memory version

Bad: One sort in the output for each step Good for GPU Bad: Synchronized memory access Bad: Global last output index

Dynamic memory version


Copyright Vitor F. Pamplona

47

Back to Work

Brute force implementations 3 solutions for CPU

Monothread depth-first recursive Monothread depth-first plain N-threads depth-first plain Step-based breadth-first static memory Step-based breadth-first dynamic memory Plain depth-first dynamic memory version
Copyright Vitor F. Pamplona 48

3 solutions for GPU


Plain Depth-first Dynamic

Best case: N^4 threads

Thread indexes fill the first 4 columns Depth-first approach

Synchronized global memory access

Copyright Vitor F. Pamplona

49

Implementations and Threads


Soluo GPU-breadth-first static mem GPU-breadth-first dynamic mem GPU-depth-first 1Thread GPU-depth-first n-Threads GPU-depth-first n-grids GPU-depth-first n*n-grids GPU-depth-first n*n-grids*n-threads GPU-depth-first n*n-grids*n*n-threads GPU-depth-first FULL threads CPU-Plain CPU-Recursive CPU-Plain-Threads Threads Sol * N Sol * N 1 N N N*N N*N*N N*N*N*N N^N 1 1 N
50

Copyright Vitor F. Pamplona

Test platforms

CPU: Intel Quad Core 2.4 Ghz

Ubuntu 4GB RAM 8 multiprocessor 64 processors at 650 Mhz 512MB RAM at 900 Mhz Cuda 1.0

GPU: Geforce 9600 GT


Copyright Vitor F. Pamplona

51

Results: CPU
9000 8000 7000 6000 5000 4000 3000 2000 1000 0 12 13 14
52

CPU-Plain CPU-Recursive CPU-PlainThreads

Copyright Vitor F. Pamplona

Results: GPU: Static vs Dynamic


7000 6000 5000 4000 3000 2000 1000 0 11 12
Copyright Vitor F. Pamplona 53

breadth-first static breadth-first dynamic CPU-Plain CPU-Recursive CPU-PlainThreads

Results: Same Number of Threads


9000 8000 7000 6000 5000 4000 3000 2000 1000 0 12 13
Copyright Vitor F. Pamplona 54

depth-first nThreads depth-first nGrids CPU-PlainThreads

Results: Only 1 Thread


8000 7000 6000 5000 4000 3000 2000 1000 0 10 11 12
55

depth-first 1Thread CPU-Recursive CPU-Plain

Copyright Vitor F. Pamplona

Results: Dynamic vs Depth


1800 1600 1400 1200 1000 800 600 400 200 0 12
Copyright Vitor F. Pamplona 56

breadth-first dynamic depth-first nThreads depth-first nGrids CPU-Plain CPU-Recursive CPU-PlainThreads

Results: Depth vs CPU


1800 1600 1400 1200 1000 800 600 400 200 0 12
Copyright Vitor F. Pamplona 57

depth-first nThreads depth-first nGrids depth-first n*ngrids depth-first n*ngrids*n-threads

depth-first n*ngrids*n*nthreads CPU-Plain CPU-Recursive CPU-PlainThreads

Results: GPU N^N solution


12000 10000 8000 6000 4000 2000 0
depth-first n^n CPU-Plain CPU-Recursive CPU-PlainThreads

9
58

Copyright Vitor F. Pamplona

Results: Dynamic, Depth, CPU


1600 1400 1200 1000 800 600 400 200 0 10 11 12
Copyright Vitor F. Pamplona

breadth-first dynamic depth-first N*N*N*N CPU-Plain CPU-Recursive CPU-PlainThreads

13
59

Results: Depth vs CPU Threads


140000 120000 100000 80000 60000 40000 20000 0 14 15 16
60

depth-first N*N*N*N CPU-Plain CPU-PlainThreads

Copyright Vitor F. Pamplona

Results
Soluo GPU-breadth-first static GPU-breadth-first dynamic GPU-depth-first 1Thread GPU-depth-first n-Threads GPU-depth-first n-grids GPU-depth-first n*n-grids GPU-depth-first N^3 GPU-depth-first N^4 GPU-depth-first FULL CPU-Plain CPU-Recursive CPU-Plain-Threads Threads Sol * N Sol * N 1 N N N*N N^3 N^4 N^N 1 1 N 1 171 171 171 171 171 172 171 171 171 2 2 2 2 171 171 171 171 171 172 172 171 171 2 2 2 3 171 171 171 171 171 172 172 171 172 2 2 2 4 174 173 171 172 171 172 172 171 172 2 2 2 5 174 173 171 172 171 172 172 171 172 2 2 2 6 174 173 171 173 173 172 172 171 172 2 2 2 7 8 9 178 184 220 173 173 174 171 185 227 173 175 230 173 173 177 172 172 174 172 172 174 171 171 171 230 1682 11420 2 2 2 2 2 2 3 3 5

Copyright Vitor F. Pamplona

61

Results
Soluo GPU-breadth-first static GPU-breadth-first dynamic GPU-depth-first 1Thread GPU-depth-first n-Threads GPU-depth-first n-grids GPU-depth-first n*n-grids GPU-depth-first N^3 GPU-depth-first N^4 GPU-depth-first FULL CPU-Plain CPU-Recursive CPU-Plain-Threads Threads Sol * N Sol * N 1 N N N*N N^3 N^4 N^N 1 1 N 11 1234 218 1463 441 301 216 192 181 12 6184 407 7198 1561 824 424 267 199 13 Mem 1481 7827 3604 1425 661 360 14 Mem 7886 15 Mem Cont 16 Mem Mem 17 Mem Mem

7025 2937 1369

7562

43488 05:38.99

18 35 17

91 198 84

502 1225 290

3020 8283 1393

19685 58493 8578

32010 04:40.95

Copyright Vitor F. Pamplona

62

Conclusions

Cuda is slow.

Low use of GPU graphics resources GLSL, HLSL and Cg are faster Compiler needs improvements More documentation on assembly optimization GPU kill some process (I don't know why)

Instable

Performance depends on implementation Good for mixed solutions: solutions CPU + GPU
Copyright Vitor F. Pamplona 63

Conclusions

%, * and / are slow ThreadIdx and blockIdx are fantastic __shared__ memory helps Cuda locks the screen while processing

No inter-process scheduling Think synchronized

Synchronized architecture

Copyright Vitor F. Pamplona

64

Perguntas?

Vitor Pamplona
vitor@vitorpamplona.com

Anda mungkin juga menyukai