CS6023 GPU Programming

January 2025

Photo Courtesy: Satya Bhagavan

Important Links

Moodle
All codes
Video recordings
FAQ one and two
Using Aqua | Google Colab | Olakrutrim
(Courtesy: HPCE Team | Rajesh Pandian | Sanket Tarafder)

Course Slides

TCF rating:

course = 0.87 (institute mean 0.79)

instructor = 0.94 (institute mean 0.83)

Full report

Evaluation

Eval. Item	Marks	Deadlines
Eval. Item	Marks	Student	TA	Instructor
A1	10	Feb 9	+10 days	--
A2	15	Mar 2	+10	--
A3	15	Apr 6	+10	--
A4/Project	20	May 4	+10	--

MidSem	20	Mar 6	--	+10
EndSem	20	May 9	--	+10

Attendance
Standard institute rules apply.

Other details

Syllabus and structure

Prerequisite: CS2710 (PDS Lab) or Equivalent.

TAs: Joel Raj K, Hrudai Koda, Shubhodeep Chanda, Ravindra Bidwe, Abhirup Majumder, Kushagra Jain, Tanmay Pramod Garde, Srikakolapu Naga Soma Satya Bhagavan, Sriram Gugulothu, Omkar Dhawal

Instructor: Rupesh Nasre.

Venue: SSB 134

Slot: D (Monday 11, Tuesday 10, Wednesday 9, Thursday 12)

Schedule

Month Dates Topic Comments

January 16 Introduction, Computation
Hello World, One, Two, Three
Grid, Blocks, Threads
Kernel Launch: 1D, 1D-General, 2D

20, 21, 22, 23 Computation
CPU-GPU Communication (cudaMalloc, cudaMemcpy)
Global variables
Matrix mult.: CPU, Outer parallel, Outer+Inner parallel

27, 28, 29, 30 Computation
Thread Divergence
Divergence due to switch
Problem Set 1

February 3, 4, 5, 6 Memory
Memory Coalescing
AoS versus SoA
Barrier

10, 11, 12, 13 Memory
Linked List Copying
Shared Memory
Shared Memory with Barrier
String Permutation
Dynamic Shared Memory
Dynamic Shared Memory with Multiple Arrays

CUDA GDB
Error Handling
Dangling Pointer

17, 18, 19, 20 Memory ~~Class at 8 on February 17.~~
Texture Memory (via CUDA SDK)
Constant Memory
Bank Conflicts
Problem Set 2

NvProf
Original Code
Loop Fusion
Kernel Fusion
Converting Loop to Blocks

24, 25, 26, 27 Synchronization Class at 8 on February 25.
Convolution
Worklist Insertion
Task Donation

March 3, 4, 5, 6 Synchronization MidSem on 6.
Reduction: i + N/2, N - i, i + 1
Prefix Sum / Scan

10, 11, 12, 13 Synchronization
No Global Barrier
Global Barrier using Atomics
Hierarchical Global Barrier

17, 18, 19, 20 Synchronization
Linked List Insertion
CPU-GPU Shared Pinned Memory
Persistent Kernel
Problem Set 3

24, 25, 26, 27 Synchronization
Array increment: Sequential, Parallel
Thrust basics
Thrust Reduction
Thrust Prefix Sum
Thrust-like device vector implementation

April 1, 2, 3 Functions Class at 8 on April 1.
Basic Stream Program
with Asynchronous memcpy
with cudaHostAlloc
Cooperative Kernels

7, 8, 9 Functions
Dynamic Parallelism
Conditional Child Kernels
using Global Device Memory
with Non-Blocking Streams

15, 16, 17 Functions
MultiGPU: Number of Devices
Cross-Device Synchronization

21, 22, 23, 24 Topics Guest lectures by Dr. Pradeep Ramachandran, KLA on 23 and 24
PTX: CUDA Code, Assembly Code
Basic Warp Voting
Converting Mask to Count (popc)
Use of ffs
Conditional Participation in ballot

28, 29, 30 Topics
Loop Unrolling, Unrolled Assembly
Heterogeneous Computation

May 1 Topics
with Shared Variable
Task Distribution
OpenMP Reduction
with HostAlloc'ed Memory
Dynamic Scheduling
OpenCL: Driver, Kernel

9 EndSem from 10:00 -- 12:00

GPU Programming Crossword Puzzle (click and type)

Courtesy: crosswordlabs.com