Anda di halaman 1dari 33

Programming with

CUDA and OpenCL


Dana Schaa and Byunghyun Jang
Northeastern University
Tutorial Overview

CUDA
- Architecture and programming model
- Strengths and limitations of the GPU
- Example applications

OpenCL
- Architecture and programming model
- Comparison with CUDA
- Example applications
References

CUDA Programming Guide


- http://developer.download.nvidia.com/compute/cuda/2_3/toolkit/docs/
NVIDIA_CUDA_Programming_Guide_2.3.pdf

CUDA SDK (example application)


- http://www.nvidia.com/object/cuda_get.html

OpenCL Specication
- http://www.khronos.org/registry/cl/specs/opencl-1.0.29.pdf

Introduction to GPU Programming


- Volodymyr Kindratenko Innovative Systems Laboratory @ NCSA Institute for Advanced
Computing Applications and Technologies (IACAT)
CUDA
Installing CUDA

http://www.nvidia.com/object/cuda_get.html

CUDA Driver
- Software to communicate with the GPU

CUDA Toolkit
- Compiler, libraries, emulator, development tools

CUDA SDK
- Example programs
Hardware Architecture

Scalable array of Streaming Multiprocessors


(SMs)
- 8 scalar processors (SIMD)

Multiple memory spaces


- On-chip memory (shared memory,
registers, some caches)
- Off-chip memory (global/device, texture,
constant)

High-latency, high-bandwidth

PCIe interface with CPU


- Transfer ~GB/s
GPU vs. CPU Architectures
Programming Model meets HW

Massively multithreaded programs


- Program will run correctly without considering underlying
hardware, but will be very slow

Programmer must divide threads between SMs (discussed in


following slides)

Divergence in control ow serializes SIMD execution

No global synchronization*
Thread Structure

A CUDA kernel is executed by a grid


of threads

Due to GPU architecture, threads are


grouped into blocks which execute
together on an SM

Each block has a unique ID within a grid


(block ID) and a unique ID within a
block (thread ID)
- Used to compute global ID
Device
Grid 2
Host
Kernel
1
Kernel
2
Block (1, 1)
Thread
(0,1,0)
Thread
(1,1,0)
Thread
(2,1,0)
Thread
(3,1,0)
Thread
(0,0,0)
Thread
(1,0,0)
Thread
(2,0,0)
Thread
(3,0,0)
(0,0,1) (1,0,1) (2,0,1) (3,0,1)
Grid 1
Block
(0, 0)
Block
(1, 0)
Block
(2, 0)
Block
(0, 1)
Block
(1, 1)
Block
(2, 1)
Thread Blocks

Threads within a block:


- Can perform local barriers
- Have access to the same shared memory (SW cache)
- Are scheduled in SIMD groups called warps

Threads within a warp execute the same instruction


simultaneously with different data (here is where
divergence impacts performance)
Porting Applications

Porting application to GPU


- Create standalone C version (remove classes, library calls)
- Write multi-threaded CPU version (debugging, partitioning)
- Create simple CUDA version
- Optimize CUDA version for underlying hardware

Learning curve similar to threaded C programming


- Large performance gains require mapping program to
specic underlying architecture
Vector Addition (CPU)
void vecAdd(float *A, float *B, float *C, int N) {
for(int i = 0; i < N; i++)
C[i] = A[i] + B[i];
}
int main() {

int N = 4096;
float *A = (float *)malloc(sizeof(float)*N);
float *B = (float *)malloc(sizeof(float)*N);
float *C = (float *)malloc(sizeof(float)*N);
init(A); init(B);
vecAdd(A, B, C, N);
free(A); free(B); free(C);
}
Computational kernel
Allocate memory
Initialize memory
Deallocate memory
Vector Addition (GPU)
__global__
void gpuVecAdd(float *A, float *B, float *C) {

int tid = blockIdx.x * blockDim.x + threadIdx.x

C[tid] = A[tid] + B[tid];
GPU Computational
kernel
threadIdx.x
blockIdx.x
blockDim.x = 32
(0,0) (1,0) (2,0) ... (31,0)
(0,0)
...
GRID
BLOCK
(0,0) (1,0) (2,0) ... (31,0)
(1,0) BLOCK
tid = blockIdx.x * blockDim.x + threadIdx.x
Vector Addition (GPU)
Run kernel (on GPU)
Copy results back to CPU
Deallocate memory on
GPU
int main() {

int N = 4096;
float *A = (float *)malloc(sizeof(float)*N);
float *B = (float *)malloc(sizeof(float)*N);
float *C = (float *)malloc(sizeof(float)*N)
init(A); init(B);
float *d_A, *d_B, *d_C;
cudaMalloc(&d_A, sizeof(float)*N);
cudaMalloc(&d_B, sizeof(float)*N);
cudaMalloc(&d_C, sizeof(float)*N);
cudaMemcpy(d_A, A, sizeof(float)*N, HtoD);
cudaMemcpy(d_B, B, sizeof(float)*N, HtoD);
dim3 dimBlock(32,1);
dim3 dimGrid(N/32,1);
gpuVecAdd <<< dimBlock,dimGrid >>> (d_A, d_B, d_C);
cudaMemcpy(C, d_C, sizeof(float)*N, DtoH);
cudaFree(d_A);
cudaFree(d_B);
cudaFree(d_C);
free(A);
free(B);
free(C);
Allocate memory on
GPU
Initialize memory on
GPU
Congure threads
Example: Image Flip
Original Input Image Rotated Output Image
Image Flip (GPU)
main() {
int width, height;
float *inImage, *outImage;
readImage(inImage, &width, &height);
int size = width * height;
outImage = (float*)malloc(sizeof(float)*size);
float *d_inImage, *d_outImage;
cudaMalloc(&d_inImage, sizeof(float)*size);
cudaMalloc(&d_outImage, sizeof(float)*size);
cudaMemcpy(d_inImage, inImage, sizeof(float)*size, HtoD);
dim3 dimBlock(8, 8);
dim3 dimGrid(width / dimBlock.x, height / dimBlock.y);
flipImage <<< dimGrid, dimBlock >>> (d_inImage, d_outImage,
width, height);
cudaMemcpy(outImage, d_outImage, sizeof(float)*size, DtoH)
cudaFree(d_inImage);
cudaFree(d_outImage);
writeImage(outImage);
free(inImage);
free(outImage);
}
Image Flip (GPU)
__global__
void flipImage(float *inImage, float *outImage, int width, int
height) {

int x = blockIdx.x * blockDim.x + threadIdx.x;
int y = blockIdx.y * blockDim.y + threadIdx.y;
outImage[((height-1)-y)*width + x] = inImage[y*width + x];
}
512
512
8
8
Thread
Block
(63, 0)
Thread
Block
(0, 0)
Thread
Block
(1, 0)
Thread
Block
(0, 63)
Thread
Block
(63, 63)
Checking GPU Capabilities

Run deviceQuery program in CUDA SDK


OpenCL
OpenCL Architecture

Parallel computing for heterogenous devices


- CPUs, GPUs, other processors (Cell, DSPs, etc)
- Portable accelerated code

Dened in four parts


- Platform Model
- Execution Model
- Memory Model
- Programming Model
Platform Model

The model consists of a host connected to one or more


OpenCL devices

A device is divided into one or more compute units

Compute units are divided into one or more processing


elements
Execution Model
CUDA Terminology OpenCL Terminology
Grid Index space
Block Work-group
Thread Work-item
Execution Model

2 main parts:
- Host programs execute on the host
- Kernels execute on one or more OpenCL devices

Each instance of a kernel is called a work-item

Work-items are organized as work-groups

When a kernel is submitted, an index space of work-groups


and work-items is dened

Work-items can identify themselves based on their work-group


ID and their local ID within the work-group (sound familiar?)
Execution Model
@ 6oyright Khronos 0rou, 2009 - Page 15
arg [0]
value
arg [1]
value
arg [2]
value
arg [0]
value
arg [1]
value
arg [2]
value
In
Order
Queue
Out of
Order
Queue
GPU
6ontext
__kerne| void
d_mu|(g|oba| const f|oat *a,
g|oba| const f|oat *b,
g|oba| f|oat *c)
{
int id = get_g|oba|_id(0);
c[idj = a[idj * b[idj;
}
d_mu|
6PU rogram binary
d_mu|
0PU rogram binary
Programs Kerne|s
ar[O| va|ue
ar[1| va|ue
ar[2| va|ue
|mages uffers
|n
0rder
0ueue
0ut of
0rder
0ueue
0PU
*38 &38
dp_mul
Programs Kerne|s Hemory 0bjects 6ommand 0ueues
2SHQ&/ 2SHQ&/
Execution Model

A context refers to the environment in which kernels execute


- Devices
- Kernels (OpenCL functions that run on OpenCL devices)
- Program objects (The program source that implements the kernel)
- Memory objects (Data that can be operated on by the device)
- Command queues are used to coordinate execution of the kernels on the
devices

Memory commands (data transfers)

Kernel synchronization commands

Synchronization

Execution between host and device(s) is asynchronous

Commands can execute in-order or out-of-order


Memory Model

Denes the various types of supported memories

No guarantees of consistency between different work-groups


Memory Description
Global Accessible by all work-items
Constant
RO, global
Local Local to a work-group
Private Private to a work-item
Programming Model

Data parallel
- One-to-one mapping between work-items and elements in a memory object
- Work-groups can be dened explicitly (like CUDA) or implicitly (specify the
number of work-items and OpenCL creates the work-groups)

Task parallel
- Kernel is executed independent of an index space
- Other ways to express parallelism: enqueueing multiple tasks, using device-specic
vector types, etc.

Synchronization
- Possible between items in a work-group
- Possible between commands in a context command queue
OpenCL Program Flow

Typical OpenCL program:


- Select the desired devices (ex: all GPUs)
- Create a context
- Create command queues (per device)
- Compile programs
- Create kernels
- Allocate memory on devices
- Transfer data to devices
- Execute
- Transfer results back
- Free memory on devices
Vector Addition (OpenCL)
__kernel void VectorAdd(__global const float* A,
__global const float* B,
__global float* C)
{
// get index into global data array
int iGID = get_global_id(0);
// add the vector elements
c[iGID] = a[iGID] + b[iGID];
}
Vector Addition (OpenCL)
// create the OpenCL context on a GPU device
context = clCreateContextFromType(0, CL_DEVICE_TYPE_GPU, NULL, NULL, NULL);
// get the list of GPU devices associated with context
clGetContextInfo(context, CL_CONTEXT_DEVICES, 0, NULL, &cb);
devices = malloc(cb);
clGetContextInfo(context, CL_CONTEXT_DEVICES, cb, devices, NULL);

// create a command-queue
cmd_queue = clCreateCommandQueue(context, devices[0], 0, NULL);

// allocate the buffer memory objects
memobjs[0] = clCreateBuffer(context, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR,
sizeof(cl_float4) * n, srcA, NULL);
memobjs[1] = clCreateBuffer(context, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR,
sizeof(cl_float4) * n, srcB, NULL);
memobjs[2] = clCreateBuffer(context,
CL_MEM_READ_WRITE,
sizeof(cl_float) * n, NULL, NULL);
// create the program
program = clCreateProgramWithSource(context, 1, (const char**)&program_source,
NULL, NULL);
// build the program
clBuildProgram(program, 0, NULL, NULL, NULL, NULL);

// create the kernel
kernel = clCreateKernel(program, "dot_product", NULL);
...
Vector Addition (OpenCL)
// set the args values
clSetKernelArg(kernel, 0, sizeof(cl_mem), (void *) &memobjs[0]);
clSetKernelArg(kernel, 1, sizeof(cl_mem), (void *) &memobjs[1]);
clSetKernelArg(kernel, 2, sizeof(cl_mem), (void *) &memobjs[2]);
// set work-item dimensions
global_work_size[0] = n;
local_work_size[0]= 1;

// execute kernel
err = clEnqueueNDRangeKernel(cmd_queue, kernel, 1, NULL, global_work_size, 0,
NULL, NULL);
// read output image
err = clEnqueueReadBuffer(cmd_queue, memobjs[2], CL_TRUE,
0, n * sizeof(cl_float), dst,
0, NULL, NULL);
// clean up
clReleaseKernel(kernel);
clReleaseProgram(program);
clReleaseCommandQueue(cmd_queue);
clReleaseContext(context);
GPU projects at NU

Tomosynthesis mammography

3D Cardiac CT

Vascular segmentation

Physics simulation (surgical


simulator)

Hyperspectral imaging

Clustering algorithms (kmeans)

Image manipulation
(convolution, ltering)

Phase unwrapping

Ray tracing

Memory hierarchy analysis

Compiler optimizations
Thank you!

Anda mungkin juga menyukai