Programming With CUDA and OpenCL

Programming with
CUDA and OpenCL

Dana Schaa and Byunghyun Jang
Northeastern University
Tutorial Overview
CUDA
- Architecture and programming model
- Strengths and limitations of the GPU
- Example applications
OpenCL
- Architecture and programming model
- Comparison with CUDA
- Example applications
References
CUDA Programming Guide

- http://developer.download.nvidia.com/compute/cuda/2_3/toolkit/docs/
NVIDIA_CUDA_Programming_Guide_2.3.pdf
CUDA SDK (example application)

- http://www.nvidia.com/object/cuda_get.html
OpenCL Specication
- http://www.khronos.org/registry/cl/specs/opencl-1.0.29.pdf
Introduction to GPU Programming

- Volodymyr Kindratenko Innovative Systems Laboratory @ NCSA Institute for Advanced
Computing Applications and Technologies (IACAT)
CUDA
Installing CUDA
http://www.nvidia.com/object/cuda_get.html
CUDA Driver
- Software to communicate with the GPU
CUDA Toolkit
- Compiler, libraries, emulator, development tools
CUDA SDK
- Example programs
Hardware Architecture
Scalable array of Streaming Multiprocessors

(SMs)
- 8 scalar processors (SIMD)
Multiple memory spaces

- On-chip memory (shared memory,
registers, some caches)
- Off-chip memory (global/device, texture,
constant)
High-latency, high-bandwidth
PCIe interface with CPU

- Transfer ~GB/s
GPU vs. CPU Architectures
Programming Model meets HW
Massively multithreaded programs

- Program will run correctly without considering underlying
hardware, but will be very slow
Programmer must divide threads between SMs (discussed in

following slides)
Divergence in control ow serializes SIMD execution
No global synchronization*
Thread Structure
A CUDA kernel is executed by a grid

of threads
Due to GPU architecture, threads are

grouped into blocks which execute
together on an SM
Each block has a unique ID within a grid

(block ID) and a unique ID within a
block (thread ID)
- Used to compute global ID
Device
Grid 2
Host
Kernel
1
Kernel
2
Block (1, 1)
Thread
(0,1,0)
Thread
(1,1,0)
Thread
(2,1,0)
Thread
(3,1,0)
Thread
(0,0,0)
Thread
(1,0,0)
Thread
(2,0,0)
Thread
(3,0,0)
(0,0,1) (1,0,1) (2,0,1) (3,0,1)
Grid 1
Block
(0, 0)
Block
(1, 0)
Block
(2, 0)
Block
(0, 1)
Block
(1, 1)
Block
(2, 1)
Thread Blocks
Threads within a block:

- Can perform local barriers
- Have access to the same shared memory (SW cache)
- Are scheduled in SIMD groups called warps
Threads within a warp execute the same instruction

simultaneously with different data (here is where
divergence impacts performance)
Porting Applications
Porting application to GPU

- Create standalone C version (remove classes, library calls)
- Write multi-threaded CPU version (debugging, partitioning)
- Create simple CUDA version
- Optimize CUDA version for underlying hardware
Learning curve similar to threaded C programming

- Large performance gains require mapping program to
specic underlying architecture
Vector Addition (CPU)
void vecAdd(float *A, float *B, float *C, int N) {
for(int i = 0; i < N; i++)
C[i] = A[i] + B[i];
}
int main() {

int N = 4096;
float *A = (float *)malloc(sizeof(float)*N);
float *B = (float *)malloc(sizeof(float)*N);
float *C = (float *)malloc(sizeof(float)*N);
init(A); init(B);
vecAdd(A, B, C, N);
free(A); free(B); free(C);
}
Computational kernel
Allocate memory
Initialize memory
Deallocate memory
Vector Addition (GPU)
__global__
void gpuVecAdd(float *A, float *B, float *C) {

int tid = blockIdx.x * blockDim.x + threadIdx.x

C[tid] = A[tid] + B[tid];
GPU Computational
kernel
threadIdx.x
blockIdx.x
blockDim.x = 32
(0,0) (1,0) (2,0) ... (31,0)
(0,0)
...
GRID
BLOCK
(0,0) (1,0) (2,0) ... (31,0)
(1,0) BLOCK
tid = blockIdx.x * blockDim.x + threadIdx.x
Vector Addition (GPU)
Run kernel (on GPU)
Copy results back to CPU
Deallocate memory on
GPU
int main() {

int N = 4096;
float *A = (float *)malloc(sizeof(float)*N);
float *B = (float *)malloc(sizeof(float)*N);
float *C = (float *)malloc(sizeof(float)*N)
init(A); init(B);
float *d_A, *d_B, *d_C;
cudaMalloc(&d_A, sizeof(float)*N);
cudaMalloc(&d_B, sizeof(float)*N);
cudaMalloc(&d_C, sizeof(float)*N);
cudaMemcpy(d_A, A, sizeof(float)*N, HtoD);
cudaMemcpy(d_B, B, sizeof(float)*N, HtoD);
dim3 dimBlock(32,1);
dim3 dimGrid(N/32,1);
gpuVecAdd <<< dimBlock,dimGrid >>> (d_A, d_B, d_C);
cudaMemcpy(C, d_C, sizeof(float)*N, DtoH);
cudaFree(d_A);
cudaFree(d_B);
cudaFree(d_C);
free(A);
free(B);
free(C);
Allocate memory on
GPU
Initialize memory on
GPU
Congure threads
Example: Image Flip
Original Input Image Rotated Output Image
Image Flip (GPU)
main() {
int width, height;
float *inImage, *outImage;
readImage(inImage, &width, &height);
int size = width * height;
outImage = (float*)malloc(sizeof(float)*size);
float *d_inImage, *d_outImage;
cudaMalloc(&d_inImage, sizeof(float)*size);
cudaMalloc(&d_outImage, sizeof(float)*size);
cudaMemcpy(d_inImage, inImage, sizeof(float)*size, HtoD);
dim3 dimBlock(8, 8);
dim3 dimGrid(width / dimBlock.x, height / dimBlock.y);
flipImage <<< dimGrid, dimBlock >>> (d_inImage, d_outImage,
width, height);
cudaMemcpy(outImage, d_outImage, sizeof(float)*size, DtoH)
cudaFree(d_inImage);
cudaFree(d_outImage);
writeImage(outImage);
free(inImage);
free(outImage);
}
Image Flip (GPU)
__global__
void flipImage(float *inImage, float *outImage, int width, int
height) {

int x = blockIdx.x * blockDim.x + threadIdx.x;
int y = blockIdx.y * blockDim.y + threadIdx.y;
outImage[((height-1)-y)*width + x] = inImage[y*width + x];
}
512
512
8
8
Thread
Block
(63, 0)
Thread
Block
(0, 0)
Thread
Block
(1, 0)
Thread
Block
(0, 63)
Thread
Block
(63, 63)
Checking GPU Capabilities
Run deviceQuery program in CUDA SDK

OpenCL
OpenCL Architecture
Parallel computing for heterogenous devices

- CPUs, GPUs, other processors (Cell, DSPs, etc)
- Portable accelerated code
Dened in four parts

- Platform Model
- Execution Model
- Memory Model
- Programming Model
Platform Model
The model consists of a host connected to one or more

OpenCL devices
A device is divided into one or more compute units
Compute units are divided into one or more processing

elements
Execution Model
CUDA Terminology OpenCL Terminology
Grid Index space
Block Work-group
Thread Work-item
Execution Model
2 main parts:
- Host programs execute on the host
- Kernels execute on one or more OpenCL devices
Each instance of a kernel is called a work-item
Work-items are organized as work-groups
When a kernel is submitted, an index space of work-groups

and work-items is dened
Work-items can identify themselves based on their work-group

ID and their local ID within the work-group (sound familiar?)
Execution Model
@ 6oyright Khronos 0rou, 2009 - Page 15
arg [0]
value
arg [1]
value
arg [2]
value
arg [0]
value
arg [1]
value
arg [2]
value
In
Order
Queue
Out of
Order
Queue
GPU
6ontext
__kerne| void
d_mu|(g|oba| const f|oat *a,
g|oba| const f|oat *b,
g|oba| f|oat *c)
{
int id = get_g|oba|_id(0);
c[idj = a[idj * b[idj;
}
d_mu|
6PU rogram binary
d_mu|
0PU rogram binary
Programs Kerne|s
ar[O| va|ue
ar[1| va|ue
ar[2| va|ue
|mages uffers
|n
0rder
0ueue
0ut of
0rder
0ueue
0PU
*38 &38
dp_mul
Programs Kerne|s Hemory 0bjects 6ommand 0ueues
2SHQ&/ 2SHQ&/
Execution Model
A context refers to the environment in which kernels execute

- Devices
- Kernels (OpenCL functions that run on OpenCL devices)
- Program objects (The program source that implements the kernel)
- Memory objects (Data that can be operated on by the device)
- Command queues are used to coordinate execution of the kernels on the
devices
Memory commands (data transfers)
Kernel synchronization commands
Synchronization
Execution between host and device(s) is asynchronous
Commands can execute in-order or out-of-order

Memory Model
Denes the various types of supported memories
No guarantees of consistency between different work-groups

Memory Description
Global Accessible by all work-items
Constant
RO, global
Local Local to a work-group
Private Private to a work-item
Programming Model
Data parallel
- One-to-one mapping between work-items and elements in a memory object
- Work-groups can be dened explicitly (like CUDA) or implicitly (specify the
number of work-items and OpenCL creates the work-groups)
Task parallel
- Kernel is executed independent of an index space
- Other ways to express parallelism: enqueueing multiple tasks, using device-specic
vector types, etc.
Synchronization
- Possible between items in a work-group
- Possible between commands in a context command queue
OpenCL Program Flow
Typical OpenCL program:

- Select the desired devices (ex: all GPUs)
- Create a context
- Create command queues (per device)
- Compile programs
- Create kernels
- Allocate memory on devices
- Transfer data to devices
- Execute
- Transfer results back
- Free memory on devices
Vector Addition (OpenCL)
__kernel void VectorAdd(__global const float* A,
__global const float* B,
__global float* C)
{
// get index into global data array
int iGID = get_global_id(0);
// add the vector elements
c[iGID] = a[iGID] + b[iGID];
}
// create the OpenCL context on a GPU device
context = clCreateContextFromType(0, CL_DEVICE_TYPE_GPU, NULL, NULL, NULL);
// get the list of GPU devices associated with context
clGetContextInfo(context, CL_CONTEXT_DEVICES, 0, NULL, &cb);
devices = malloc(cb);
clGetContextInfo(context, CL_CONTEXT_DEVICES, cb, devices, NULL);

// create a command-queue
cmd_queue = clCreateCommandQueue(context, devices[0], 0, NULL);

// allocate the buffer memory objects
memobjs[0] = clCreateBuffer(context, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR,
sizeof(cl_float4) * n, srcA, NULL);
memobjs[1] = clCreateBuffer(context, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR,
sizeof(cl_float4) * n, srcB, NULL);
memobjs[2] = clCreateBuffer(context,
CL_MEM_READ_WRITE,
sizeof(cl_float) * n, NULL, NULL);
// create the program
program = clCreateProgramWithSource(context, 1, (const char**)&program_source,
NULL, NULL);
// build the program
clBuildProgram(program, 0, NULL, NULL, NULL, NULL);

// create the kernel
kernel = clCreateKernel(program, "dot_product", NULL);
...
// set the args values
clSetKernelArg(kernel, 0, sizeof(cl_mem), (void *) &memobjs[0]);
// set work-item dimensions
global_work_size[0] = n;
local_work_size[0]= 1;

// execute kernel
err = clEnqueueNDRangeKernel(cmd_queue, kernel, 1, NULL, global_work_size, 0,
NULL, NULL);
// read output image
err = clEnqueueReadBuffer(cmd_queue, memobjs[2], CL_TRUE,
0, n * sizeof(cl_float), dst,
0, NULL, NULL);
// clean up
clReleaseKernel(kernel);
clReleaseProgram(program);
clReleaseCommandQueue(cmd_queue);
clReleaseContext(context);
GPU projects at NU
Tomosynthesis mammography
3D Cardiac CT
Vascular segmentation
Physics simulation (surgical

simulator)
Hyperspectral imaging
Clustering algorithms (kmeans)
Image manipulation
(convolution, ltering)
Phase unwrapping
Ray tracing
Memory hierarchy analysis
Compiler optimizations
Thank you!

Programming With CUDA and OpenCL

Diunggah oleh

Informasi Dokumen

Judul Asli

Hak Cipta

Format Tersedia

Bagikan dokumen Ini

Bagikan atau Tanam Dokumen

Opsi Berbagi

Apakah menurut Anda dokumen ini bermanfaat?

Apakah konten ini tidak pantas?

Hak Cipta:

Format Tersedia

Programming With CUDA and OpenCL

Diunggah oleh

Hak Cipta:

Format Tersedia

Programming with

CUDA and OpenCL

CUDA Programming Guide

CUDA SDK (example application)

Introduction to GPU Programming

Scalable array of Streaming Multiprocessors

Multiple memory spaces

PCIe interface with CPU

Massively multithreaded programs

Programmer must divide threads between SMs (discussed in

Divergence in control ow serializes SIMD execution

A CUDA kernel is executed by a grid

Due to GPU architecture, threads are

Each block has a unique ID within a grid

Threads within a block:

Threads within a warp execute the same instruction

Porting application to GPU

Learning curve similar to threaded C programming

Run deviceQuery program in CUDA SDK

Parallel computing for heterogenous devices

Dened in four parts

The model consists of a host connected to one or more

A device is divided into one or more compute units

Compute units are divided into one or more processing

Each instance of a kernel is called a work-item

Work-items are organized as work-groups

When a kernel is submitted, an index space of work-groups

Work-items can identify themselves based on their work-group

A context refers to the environment in which kernels execute

Memory commands (data transfers)

Kernel synchronization commands

Execution between host and device(s) is asynchronous

Commands can execute in-order or out-of-order

Denes the various types of supported memories

No guarantees of consistency between different work-groups

Typical OpenCL program:

Physics simulation (surgical

Clustering algorithms (kmeans)

Memory hierarchy analysis

Anda mungkin juga menyukai