Anda di halaman 1dari 43

Advanced CUDA Optimization

1. Introduction
Thomas Bradley

© NVIDIA Corporation 2010


Agenda

CUDA Review
Review of CUDA Architecture
Programming & Memory Models
Programming Environment
Execution
Performance
Optimization Guidelines
Productivity
Resources

© NVIDIA Corporation 2010


CUDA Review

REVIEW OF CUDA ARCHITECTURE

© NVIDIA Corporation 2010


Processing Flow

PCI Bus

1. Copy input data from CPU memory to GPU


memory

© NVIDIA Corporation 2010


Processing Flow

PCI Bus

1. Copy input data from CPU memory to GPU


memory
2. Load GPU program and execute,
caching data on chip for performance

© NVIDIA Corporation 2010


Processing Flow

PCI Bus

1. Copy input data from CPU memory to GPU


memory
2. Load GPU program and execute,
caching data on chip for performance
3. Copy results from GPU memory to CPU
memory

© NVIDIA Corporation 2010


CUDA Parallel Computing Architecture

Parallel computing architecture


and programming model

Includes a CUDA C compiler,


support for OpenCL and
DirectCompute

Architected to natively support


multiple computational
interfaces (standard languages
and APIs)

© NVIDIA Corporation 2010


CUDA Parallel Computing Architecture

CUDA defines:
Programming model
Memory model
Execution model

CUDA uses the GPU, but is for general-purpose computing


Facilitate heterogeneous computing: CPU + GPU

CUDA is scalable
Scale to run on 100s of cores/1000s of parallel threads

© NVIDIA Corporation 2010


CUDA Review

PROGRAMMING MODEL

© NVIDIA Corporation 2010


CUDA Kernels

Parallel portion of application: execute as a kernel


Entire GPU executes kernel, many threads

CUDA threads:
Lightweight
Fast switching
1000s execute simultaneously

CPU Host Executes functions


GPU Device Executes kernels

© NVIDIA Corporation 2010


CUDA Kernels: Parallel Threads

A kernel is a function executed


on the GPU
Array of threads, in parallel

float x = input[threadID];
All threads execute the same float y = func(x);
code, can take different paths output[threadID] = y;

Each thread has an ID


Select input/output data
Control decisions

© NVIDIA Corporation 2010


CUDA Kernels: Subdivide into Blocks

© NVIDIA Corporation 2010


CUDA Kernels: Subdivide into Blocks

Threads are grouped into blocks

© NVIDIA Corporation 2010


CUDA Kernels: Subdivide into Blocks

Threads are grouped into blocks


Blocks are grouped into a grid

© NVIDIA Corporation 2010


CUDA Kernels: Subdivide into Blocks

Threads are grouped into blocks


Blocks are grouped into a grid
A kernel is executed as a grid of blocks of threads
© NVIDIA Corporation 2010
CUDA Kernels: Subdivide into Blocks

GPU

Threads are grouped into blocks


Blocks are grouped into a grid
A kernel is executed as a grid of blocks of threads
© NVIDIA Corporation 2010
Communication Within a Block

Threads may need to cooperate


Memory accesses
Share results

Cooperate using shared memory


Accessible by all threads within a block

Restriction to “within a block” permits scalability


Fast communication between N threads is not feasible when N large

© NVIDIA Corporation 2010


Transparent Scalability – G84
1 2 3 4 5 6 7 8 9 10 11 12

11 12

9 10

7 8

5 6

3 4

1 2

© NVIDIA Corporation 2010


Transparent Scalability – G80
1 2 3 4 5 6 7 8 9 10 11 12

9 10 11 12

1 2 3 4 5 6 7 8

© NVIDIA Corporation 2010


Transparent Scalability – GT200
1 2 3 4 5 6 7 8 9 10 11 12

1 2 3 4 5 6 7 8 9 10 11 12
Idle
... Idle Idle

© NVIDIA Corporation 2010


CUDA Programming Model - Summary

Host Device
A kernel executes as a grid of
thread blocks
Kernel 1 0 1 2 3 1D
A block is a batch of threads
Communicate through shared
memory
0,0 0,1 0,2 0,3
Kernel 2 2D
Each block has a block ID 1,0 1,1 1,2 1,3

Each thread has a thread ID

© NVIDIA Corporation 2010


CUDA Review

MEMORY MODEL

© NVIDIA Corporation 2010


Memory hierarchy

Thread:
Registers

© NVIDIA Corporation 2010


Memory hierarchy

Thread:
Registers

Thread:
Local memory

© NVIDIA Corporation 2010


Memory hierarchy

Thread:
Registers

Thread:
Local memory

Block of threads:
Shared memory

© NVIDIA Corporation 2010


Memory hierarchy

Thread:
Registers

Thread:
Local memory

Block of threads:
Shared memory

© NVIDIA Corporation 2010


Memory hierarchy

Thread:
Registers

Thread:
Local memory

Block of threads:
Shared memory

All blocks:
Global memory

© NVIDIA Corporation 2010


Memory hierarchy

Thread:
Registers

Thread:
Local memory

Block of threads:
Shared memory

All blocks:
Global memory

© NVIDIA Corporation 2010


Additional Memories

Host can also allocate textures and arrays of constants

Textures and constants have dedicated caches

© NVIDIA Corporation 2010


CUDA Review

PROGRAMMING ENVIRONMENT

© NVIDIA Corporation 2010


CUDA C and OpenCL

Entry point for developers Entry point for developers


who want low-level API who prefer high-level C

Shared back-end compiler


and optimization technology

© NVIDIA Corporation 2010


Visual Studio

Separate file types


.c/.cpp for host code
.cu for device/mixed code

Compilation rules: cuda.rules


Syntax highlighting
Intellisense

Integrated debugger and


profiler: Nexus

© NVIDIA Corporation 2010


NVIDIA Nexus IDE

The industry’s first IDE for massively


parallel applications

Accelerates co-processing (CPU + GPU)


application development

Complete Visual Studio-integrated


development environment

© NVIDIA Corporation 2010


Linux

Separate file types


.c/.cpp for host code
.cu for device/mixed code

Typically makefile driven

cuda-gdb for debugging

CUDA Visual Profiler

© NVIDIA Corporation 2010


Performance

OPTIMIZATION GUIDELINES

© NVIDIA Corporation 2010


Optimize Algorithms for GPU

Algorithm selection
Understand the problem, consider alternate algorithms
Maximize independent parallelism
Maximize arithmetic intensity (math/bandwidth)

Recompute?
GPU allocates transistors to arithmetic, not memory
Sometimes better to recompute rather than cache

Serial computation on GPU?


Low parallelism computation may be faster on GPU vs copy to/from host

© NVIDIA Corporation 2010


Optimize Memory Access

Coalesce global memory access


Maximise DRAM efficiency
Order of magnitude impact on performance

Avoid serialization
Minimize shared memory bank conflicts
Understand constant cache semantics

Understand spatial locality


Optimize use of textures to ensure spatial locality

© NVIDIA Corporation 2010


Exploit Shared Memory

Hundreds of times faster than global memory

Inter-thread cooperation via shared memory and synchronization

Cache data that is reused by multiple threads

Stage loads/stores to allow reordering


Avoid non-coalesced global memory accesses

© NVIDIA Corporation 2010


Use Resources Efficiently

Partition the computation to keep multiprocessors busy


Many threads, many thread blocks
Multiple GPUs

Monitor per-multiprocessor resource utilization


Registers and shared memory
Low utilization per thread block permits multiple active blocks per
multiprocessor

Overlap computation with I/O


Use asynchronous memory transfers

© NVIDIA Corporation 2010


Productivity

RESOURCES

© NVIDIA Corporation 2010


Getting Started

CUDA Zone
www.nvidia.com/cuda
Introductory tutorials/webinars
Forums

Documentation
Programming Guide
Best Practices Guide

Examples
CUDA SDK
© NVIDIA Corporation 2010
Libraries

NVIDIA
cuBLAS Dense linear algebra (subset of full BLAS suite)
cuFFT 1D/2D/3D real and complex
Third party
NAG Numeric libraries e.g. RNGs
cuLAPACK/MAGMA
Open Source
Thrust STL/Boost style template language
cuDPP Data parallel primitives (e.g. scan, sort and reduction)
CUSP Sparse linear algebra and graph computation
Many more...
© NVIDIA Corporation 2010

Anda mungkin juga menyukai