GPU Programming

NSM Nodal Centre for Training in HPC and AI

Home | Syllabus | Registration | Schedule

Slides:

Intro + Logistics
Computation
Memory
Synchronization
Functions
Support
Streams
Topics
Case Study -- Graphs
Resources:

All codes
Lecture videos
FAQ one and two
Compute capability crossword

Doubts Session 1: February 8
Doubts Session 2: March 9
Doubts Session 3: April 1
Doubts Session 4: May 4
Assignment 1 on Computation: Statement, Header file, Sample testcases
Assignment 2 on Memory: Statement, main.cu, Sample testcases
Assignment 3 on Synchronization: Statement, main.cu, Sample testcases
Assignment 4 on Application: Statement, Sample testcases

Recorded Lectures

Month Dates Topic Comments

February 1, 2, 3, 5 Introduction, Computation
Hello World, One, Two, Three
Grid, Blocks, Threads
Kernel Launch: 1D, 1D-General, 2D

8, 9, 10, 12 Computation
CPU-GPU Communication (cudaMalloc, cudaMemcpy)
Global variables
Matrix mult.: CPU, Outer parallel, Outer+Inner parallel

15, 16, 17, 19 Computation
Thread Divergence
Divergence due to switch
Problem Set 1

22, 23, 24, 26 Memory
Memory Coalescing
AoS versus SoA
Barrier

March 1, 2, 3, 5 Memory, Support
Linked List Copying
Shared Memory
Shared Memory with Barrier
String Permutation
Dynamic Shared Memory
Dynamic Shared Memory with Multiple Arrays

CUDA GDB
Error Handling
Dangling Pointer

8, 9, 10, 12 Memory
Texture Memory (via CUDA SDK)
Constant Memory
Bank Conflicts
Problem Set 2

NvProf
Original Code
Loop Fusion
Kernel Fusion
Converting Loop to Blocks

15, 16, 17, 19 Synchronization
Convolution
Worklist Insertion
Task Donation

22, 23, 24, 26 Synchronization
Reduction: i + N/2, N - i, i + 1
Prefix Sum / Scan

~~29, 30, 31~~ Synchronization No classes due to semester break.
No Global Barrier
Global Barrier using Atomics
Hierarchical Global Barrier

April 2 Synchronization No classes due to semester break.
Linked List Insertion
CPU-GPU Shared Pinned Memory
Persistent Kernel
Problem Set 3

5, 6, 7, 9 Functions
Array increment: Sequential, Parallel
Thrust basics
Thrust Reduction
Thrust Prefix Sum
Thrust-like device vector implementation

12, 13, 14, 16 Streams
Basic Stream Program
with Asynchronous memcpy
with cudaHostAlloc
Cooperative Kernels

19, 20, 21, 23 Topics
Dynamic Parallelism
Conditional Child Kernels
using Global Device Memory
with Non-Blocking Streams

26, 27, 28, 30 Topics
MultiGPU: Number of Devices
Cross-Device Synchronization
PTX: CUDA Code, Assembly Code
Basic Warp Voting
Converting Mask to Count (popc)
Use of ffs
Conditional Participation in ballot

May 3, 4, 5, 7 Topics, Case Study
Loop Unrolling, Unrolled Assembly
Heterogeneous Computation
with Shared Variable
Task Distribution
OpenMP Reduction
with HostAlloc'ed Memory
Dynamic Scheduling
OpenCL: Driver, Kernel