GPU Programming with CUDA

NSM Nodal Centre for Training in HPC and AI

Home | Syllabus | Registration | Schedule

Final exam on May 1, 2022 from 10:00--11:30 AM.

Slides:

Resources:

All codes
Lecture videos
FAQ one and two
Compute capability crossword
Doubts Session 1: Feb 22, 2022 from 10:00 -- 11:00: Join here.
Doubts Session 2: Mar 22, 2022 from 10:00 -- 11:00: Join here.
Doubts Session 3: Apr 26, 2022 from 10:00 -- 11:00: Join here.

Evaluation

A1 on Computation: due Feb 28, 2022: Statement, Header file, Sample testcases
A2 on Memory: due Mar 28, 2022: Statement, main.cu, Sample testcases
A3 on Synchronization: due Apr 28, 2022:
-- Statement, main.cu, Sample testcases
OR
-- Statement, main.cu, Sample testcases

Submit assignments here.

Final exam on May 1, 2022 from 10:00--11:30 AM.

Schedule

Month Week Topic Comments

February week 1 Introduction, Computation
Hello World, One, Two, Three
Grid, Blocks, Threads
Kernel Launch: 1D, 1D-General, 2D

week 2 Computation
CPU-GPU Communication (cudaMalloc, cudaMemcpy)
Global variables
Matrix mult.: CPU, Outer parallel, Outer+Inner parallel

week 3 Computation
Thread Divergence
Divergence due to switch
Problem Set 1

week 4 Memory
Memory Coalescing
AoS versus SoA
Barrier

March week 5 Memory, Support
Linked List Copying
Shared Memory
Shared Memory with Barrier
String Permutation
Dynamic Shared Memory
Dynamic Shared Memory with Multiple Arrays

CUDA GDB
Error Handling
Dangling Pointer

week 6 Memory
Texture Memory (via CUDA SDK)
Constant Memory
Bank Conflicts
Problem Set 2

NvProf
Original Code
Loop Fusion
Kernel Fusion
Converting Loop to Blocks

week 7 Synchronization
Convolution
Worklist Insertion
Task Donation

week 8 Synchronization
Reduction: i + N/2, N - i, i + 1
Prefix Sum / Scan

week 9 Synchronization
No Global Barrier
Global Barrier using Atomics
Hierarchical Global Barrier

April week 10 Synchronization
Linked List Insertion
CPU-GPU Shared Pinned Memory
Persistent Kernel
Problem Set 3

week 11 Functions
Array increment: Sequential, Parallel
Thrust basics
Thrust Reduction
Thrust Prefix Sum
Thrust-like device vector implementation

week 12 Streams
Basic Stream Program
with Asynchronous memcpy
with cudaHostAlloc
Cooperative Kernels
Dynamic Parallelism
Conditional Child Kernels

week 13 Topics
using Global Device Memory
with Non-Blocking Streams
MultiGPU: Number of Devices
Cross-Device Synchronization
PTX: CUDA Code, Assembly Code

week 14 Topics
Basic Warp Voting
Converting Mask to Count (popc)
Use of ffs
Conditional Participation in ballot
Loop Unrolling, Unrolled Assembly
Heterogeneous Computation