Advanced CUDA 01

Advanced CUDA Optimization
1. Introduction
Thomas Bradley
© NVIDIA Corporation 2010

Agenda
CUDA Review
Review of CUDA Architecture
Programming & Memory Models
Programming Environment
Execution
Performance
Optimization Guidelines
Productivity
Resources

CUDA Review
REVIEW OF CUDA ARCHITECTURE

Processing Flow
PCI Bus
1. Copy input data from CPU memory to GPU

memory

Processing Flow
PCI Bus

memory
2. Load GPU program and execute,
caching data on chip for performance

Processing Flow
PCI Bus

memory
2. Load GPU program and execute,
caching data on chip for performance
3. Copy results from GPU memory to CPU
memory

CUDA Parallel Computing Architecture
Parallel computing architecture

and programming model
Includes a CUDA C compiler,

support for OpenCL and
DirectCompute
Architected to natively support

multiple computational
interfaces (standard languages
and APIs)

CUDA Parallel Computing Architecture
CUDA defines:
Programming model
Memory model
Execution model
CUDA uses the GPU, but is for general-purpose computing

Facilitate heterogeneous computing: CPU + GPU
CUDA is scalable
Scale to run on 100s of cores/1000s of parallel threads

CUDA Review
PROGRAMMING MODEL

CUDA Kernels
Parallel portion of application: execute as a kernel

Entire GPU executes kernel, many threads
CUDA threads:
Lightweight
Fast switching
1000s execute simultaneously
CPU Host Executes functions

GPU Device Executes kernels

CUDA Kernels: Parallel Threads
A kernel is a function executed

on the GPU
Array of threads, in parallel
float x = input[threadID];
All threads execute the same float y = func(x);
code, can take different paths output[threadID] = y;
Each thread has an ID

Select input/output data
Control decisions

CUDA Kernels: Subdivide into Blocks

Threads are grouped into blocks


Blocks are grouped into a grid


A kernel is executed as a grid of blocks of threads
GPU

A kernel is executed as a grid of blocks of threads
Communication Within a Block
Threads may need to cooperate

Memory accesses
Share results
Cooperate using shared memory

Accessible by all threads within a block
Restriction to “within a block” permits scalability

Fast communication between N threads is not feasible when N large

Transparent Scalability – G84
1 2 3 4 5 6 7 8 9 10 11 12
11 12
9 10
7 8
5 6
3 4
1 2

Transparent Scalability – G80
1 2 3 4 5 6 7 8 9 10 11 12
9 10 11 12
1 2 3 4 5 6 7 8

Transparent Scalability – GT200
1 2 3 4 5 6 7 8 9 10 11 12
1 2 3 4 5 6 7 8 9 10 11 12
Idle
... Idle Idle

CUDA Programming Model - Summary
Host Device
A kernel executes as a grid of
thread blocks
Kernel 1 0 1 2 3 1D
A block is a batch of threads
Communicate through shared
memory
0,0 0,1 0,2 0,3
Kernel 2 2D
Each block has a block ID 1,0 1,1 1,2 1,3
Each thread has a thread ID

CUDA Review
MEMORY MODEL

Memory hierarchy
Thread:
Registers

Memory hierarchy
Thread:
Registers
Thread:
Local memory

Memory hierarchy
Thread:
Registers
Thread:
Local memory
Block of threads:
Shared memory

Memory hierarchy
Thread:
Registers
Thread:
Local memory
Block of threads:
Shared memory

Memory hierarchy
Thread:
Registers
Thread:
Local memory
Block of threads:
Shared memory
All blocks:
Global memory

Memory hierarchy
Thread:
Registers
Thread:
Local memory
Block of threads:
Shared memory
All blocks:
Global memory

Additional Memories
Host can also allocate textures and arrays of constants
Textures and constants have dedicated caches

CUDA Review
PROGRAMMING ENVIRONMENT

CUDA C and OpenCL
Entry point for developers Entry point for developers

who want low-level API who prefer high-level C
Shared back-end compiler

and optimization technology

Visual Studio
Separate file types

.c/.cpp for host code
.cu for device/mixed code
Compilation rules: cuda.rules

Syntax highlighting
Intellisense
Integrated debugger and

profiler: Nexus

NVIDIA Nexus IDE
The industry’s first IDE for massively

parallel applications
Accelerates co-processing (CPU + GPU)

application development
Complete Visual Studio-integrated

development environment

Linux
Separate file types

.c/.cpp for host code
.cu for device/mixed code
Typically makefile driven
cuda-gdb for debugging
CUDA Visual Profiler

Performance
OPTIMIZATION GUIDELINES

Optimize Algorithms for GPU
Algorithm selection
Understand the problem, consider alternate algorithms
Maximize independent parallelism
Maximize arithmetic intensity (math/bandwidth)
Recompute?
GPU allocates transistors to arithmetic, not memory
Sometimes better to recompute rather than cache
Serial computation on GPU?

Low parallelism computation may be faster on GPU vs copy to/from host

Optimize Memory Access
Coalesce global memory access

Maximise DRAM efficiency
Order of magnitude impact on performance
Avoid serialization
Minimize shared memory bank conflicts
Understand constant cache semantics
Understand spatial locality

Optimize use of textures to ensure spatial locality

Exploit Shared Memory
Hundreds of times faster than global memory
Inter-thread cooperation via shared memory and synchronization
Cache data that is reused by multiple threads
Stage loads/stores to allow reordering

Avoid non-coalesced global memory accesses

Use Resources Efficiently
Partition the computation to keep multiprocessors busy

Many threads, many thread blocks
Multiple GPUs
Monitor per-multiprocessor resource utilization

Registers and shared memory
Low utilization per thread block permits multiple active blocks per
multiprocessor
Overlap computation with I/O

Use asynchronous memory transfers

Productivity
RESOURCES

Getting Started
CUDA Zone
www.nvidia.com/cuda
Introductory tutorials/webinars
Forums
Documentation
Programming Guide
Best Practices Guide
Examples
CUDA SDK
Libraries
NVIDIA
cuBLAS Dense linear algebra (subset of full BLAS suite)
cuFFT 1D/2D/3D real and complex
Third party
NAG Numeric libraries e.g. RNGs
cuLAPACK/MAGMA
Open Source
Thrust STL/Boost style template language
cuDPP Data parallel primitives (e.g. scan, sort and reduction)
CUSP Sparse linear algebra and graph computation
Many more...

Advanced CUDA 01

Diunggah oleh

Informasi Dokumen

Deskripsi Asli:

Hak Cipta

Format Tersedia

Bagikan dokumen Ini

Bagikan atau Tanam Dokumen

Opsi Berbagi

Apakah menurut Anda dokumen ini bermanfaat?

Apakah konten ini tidak pantas?

Hak Cipta:

Format Tersedia

Advanced CUDA 01

Diunggah oleh

Hak Cipta:

Format Tersedia

Advanced CUDA Optimization

© NVIDIA Corporation 2010

© NVIDIA Corporation 2010

REVIEW OF CUDA ARCHITECTURE

© NVIDIA Corporation 2010

1. Copy input data from CPU memory to GPU

© NVIDIA Corporation 2010

1. Copy input data from CPU memory to GPU

© NVIDIA Corporation 2010

1. Copy input data from CPU memory to GPU

© NVIDIA Corporation 2010

Parallel computing architecture

Includes a CUDA C compiler,

Architected to natively support

© NVIDIA Corporation 2010

CUDA uses the GPU, but is for general-purpose computing

© NVIDIA Corporation 2010

© NVIDIA Corporation 2010

Parallel portion of application: execute as a kernel

CPU Host Executes functions

© NVIDIA Corporation 2010

A kernel is a function executed

Each thread has an ID

© NVIDIA Corporation 2010

© NVIDIA Corporation 2010

Threads are grouped into blocks

© NVIDIA Corporation 2010

Threads are grouped into blocks

© NVIDIA Corporation 2010

Threads are grouped into blocks

Threads are grouped into blocks

Threads may need to cooperate

Cooperate using shared memory

Restriction to “within a block” permits scalability

© NVIDIA Corporation 2010

© NVIDIA Corporation 2010

© NVIDIA Corporation 2010

© NVIDIA Corporation 2010

Each thread has a thread ID

© NVIDIA Corporation 2010

© NVIDIA Corporation 2010

© NVIDIA Corporation 2010

© NVIDIA Corporation 2010

© NVIDIA Corporation 2010

© NVIDIA Corporation 2010

© NVIDIA Corporation 2010

© NVIDIA Corporation 2010

Host can also allocate textures and arrays of constants

Textures and constants have dedicated caches

© NVIDIA Corporation 2010

© NVIDIA Corporation 2010

Entry point for developers Entry point for developers

Shared back-end compiler

© NVIDIA Corporation 2010

Separate file types

Compilation rules: cuda.rules

Integrated debugger and

© NVIDIA Corporation 2010

The industry’s first IDE for massively

Accelerates co-processing (CPU + GPU)

Complete Visual Studio-integrated

© NVIDIA Corporation 2010

Separate file types