1. Introduction
Thomas Bradley
CUDA Review
Review of CUDA Architecture
Programming & Memory Models
Programming Environment
Execution
Performance
Optimization Guidelines
Productivity
Resources
PCI Bus
PCI Bus
PCI Bus
CUDA defines:
Programming model
Memory model
Execution model
CUDA is scalable
Scale to run on 100s of cores/1000s of parallel threads
PROGRAMMING MODEL
CUDA threads:
Lightweight
Fast switching
1000s execute simultaneously
float x = input[threadID];
All threads execute the same float y = func(x);
code, can take different paths output[threadID] = y;
GPU
11 12
9 10
7 8
5 6
3 4
1 2
9 10 11 12
1 2 3 4 5 6 7 8
1 2 3 4 5 6 7 8 9 10 11 12
Idle
... Idle Idle
Host Device
A kernel executes as a grid of
thread blocks
Kernel 1 0 1 2 3 1D
A block is a batch of threads
Communicate through shared
memory
0,0 0,1 0,2 0,3
Kernel 2 2D
Each block has a block ID 1,0 1,1 1,2 1,3
MEMORY MODEL
Thread:
Registers
Thread:
Registers
Thread:
Local memory
Thread:
Registers
Thread:
Local memory
Block of threads:
Shared memory
Thread:
Registers
Thread:
Local memory
Block of threads:
Shared memory
Thread:
Registers
Thread:
Local memory
Block of threads:
Shared memory
All blocks:
Global memory
Thread:
Registers
Thread:
Local memory
Block of threads:
Shared memory
All blocks:
Global memory
PROGRAMMING ENVIRONMENT
OPTIMIZATION GUIDELINES
Algorithm selection
Understand the problem, consider alternate algorithms
Maximize independent parallelism
Maximize arithmetic intensity (math/bandwidth)
Recompute?
GPU allocates transistors to arithmetic, not memory
Sometimes better to recompute rather than cache
Avoid serialization
Minimize shared memory bank conflicts
Understand constant cache semantics
RESOURCES
CUDA Zone
www.nvidia.com/cuda
Introductory tutorials/webinars
Forums
Documentation
Programming Guide
Best Practices Guide
Examples
CUDA SDK
© NVIDIA Corporation 2010
Libraries
NVIDIA
cuBLAS Dense linear algebra (subset of full BLAS suite)
cuFFT 1D/2D/3D real and complex
Third party
NAG Numeric libraries e.g. RNGs
cuLAPACK/MAGMA
Open Source
Thrust STL/Boost style template language
cuDPP Data parallel primitives (e.g. scan, sort and reduction)
CUSP Sparse linear algebra and graph computation
Many more...
© NVIDIA Corporation 2010