CS6023 GPU Programming

January 2025

Important Links Course Slides
  1. Intro + Logistics
  2. Computation
  3. Memory
  4. Synchronization
  5. Functions
  6. Support
  7. Streams
  8. Topics
  9. Case Study -- Graphs
  Evaluation
Eval. ItemMarksDeadlines
StudentTAInstructor
A110Feb 9+10 days--
A215Mar 2+10--
A315Apr 6+10--
A4/Project20May 4+10--
MidSem20Mar 6--+10
EndSem20May 9--+10

Attendance
Standard institute rules apply.

Other details

  • Syllabus and structure
  • Prerequisite: CS2710 (PDS Lab) or Equivalent.
  • TAs: Joel Raj K, Hrudai Koda, Shubhodeep Chanda, Ravindra Bidwe, Abhirup Majumder, Kushagra Jain, Tanmay Pramod Garde, Srikakolapu Naga Soma Satya Bhagavan, Sriram Gugulothu, Omkar Dhawal
  • Instructor: Rupesh Nasre.
  • Venue: SSB 134
  • Slot: D (Monday 11, Tuesday 10, Wednesday 9, Thursday 12)


Schedule
MonthDatesTopicComments
 January   16  Introduction, Computation  
  • Hello World, One, Two, Three
  • Grid, Blocks, Threads
  • Kernel Launch: 1D, 1D-General, 2D
  •    20, 21, 22, 23  Computation  
  • CPU-GPU Communication (cudaMalloc, cudaMemcpy)
  • Global variables
  • Matrix mult.: CPU, Outer parallel, Outer+Inner parallel
  •    27, 28, 29, 30  Computation  
  • Thread Divergence
  • Divergence due to switch
  • Problem Set 1
  •  February  3, 4, 5, 6  Memory  
  • Memory Coalescing
  • AoS versus SoA
  • Barrier
  •    10, 11, 12, 13  Memory  
  • Linked List Copying
  • Shared Memory
  • Shared Memory with Barrier
  • String Permutation
  • Dynamic Shared Memory
  • Dynamic Shared Memory with Multiple Arrays

    CUDA GDB
  • Error Handling
  • Dangling Pointer
  •    17, 18, 19, 20  Memory  Class at 8 on February 17.
  • Texture Memory (via CUDA SDK)
  • Constant Memory
  • Bank Conflicts
  • Problem Set 2

    NvProf
  • Original Code
  • Loop Fusion
  • Kernel Fusion
  • Converting Loop to Blocks
  •     24, 25, 26, 27  Synchronization   Class at 8 on February 25.
  • Convolution
  • Worklist Insertion
  • Task Donation
  •  March   3, 4, 5, 6  Synchronization   MidSem on 6.
  • Reduction: i + N/2, N - i, i + 1
  • Prefix Sum / Scan
  •     10, 11, 12, 13  Synchronization  
  • No Global Barrier
  • Global Barrier using Atomics
  • Hierarchical Global Barrier
  •     17, 18, 19, 20  Synchronization  
  • Linked List Insertion
  • CPU-GPU Shared Pinned Memory
  • Persistent Kernel
  • Problem Set 3
  •     24, 25, 26, 27  Synchronization  
  • Array increment: Sequential, Parallel
  • Thrust basics
  • Thrust Reduction
  • Thrust Prefix Sum
  • Thrust-like device vector implementation
  •  April   1, 2, 3  Functions  Class at 8 on April 1.
  • Basic Stream Program
  • with Asynchronous memcpy
  • with cudaHostAlloc
  • Cooperative Kernels
  •     7, 8, 9  Functions  
  • Dynamic Parallelism
  • Conditional Child Kernels
  • using Global Device Memory
  • with Non-Blocking Streams
  •    15, 16, 17  Functions  
  • MultiGPU: Number of Devices
  • Cross-Device Synchronization
  •    21, 22, 23, 24  Topics  Guest lectures by Dr. Pradeep Ramachandran, KLA on 23 and 24

  • PTX: CUDA Code, Assembly Code
  • Basic Warp Voting
  • Converting Mask to Count (popc)
  • Use of ffs
  • Conditional Participation in ballot
  •     28, 29, 30  Topics  
  • Loop Unrolling, Unrolled Assembly
  • Heterogeneous Computation
  •  May   1  Topics  
  • with Shared Variable
  • Task Distribution
  • OpenMP Reduction
  • with HostAlloc'ed Memory
  • Dynamic Scheduling
  • OpenCL: Driver, Kernel
  •     9    EndSem from 10:00 -- 12:00


    GPU Programming Crossword Puzzle (click and type)

    Courtesy: crosswordlabs.com