CS6700: Reinforcement learning
Course information
When: Jul-Nov 2025
Lectures: Slot J
Where: CS34
Teaching Assistants: Udit Narayan Singh (CS23S038), Sahil Kumar Koiri (CS25M042), \Shikhar Tiwari (CS25M044), Sayak Sen (CS25S006)
Course Content
Markov Decision Processes (MDPs)
Finite horizon MDPs: General theory, DP algorithm
Infinite horizon model: (1) Stochastic shortest path
General theory: Contraction mapping, Bellman equation
Computational solution schemes: Value and policy iteration, convergence analysis
Infinite horizon model: (2) Discounted cost MDPs
General theory: Contraction mapping, Bellman equation
Classical solution techniques: value and policy iteration
Reinforcement learning
Stochastic approximation
Introduction and connection to RL
Convergence result for contraction mappings
Tabular methods
Monte Carlo policy evaluation
Temporal difference learning
TD(0), TD(lambda)
Convergence analysis
Q-learning and its convergence analysis
Function approximation
Approximate policy evaluation using TD(lambda)
Least-squares methods: LSTD and LSPI
Policy-gradient algorithms
Policy gradient theorem
Gradient estimation using likelihood ratios
Variants (REINFORCE, PPO, etc)
The portion on MDPs roughly coincides with Chapters 1 of Vol. I of Dynamic programming and optimal control book of Bertsekas and Chapter 2, 4, 5 and 6 of Neuro dynamic programming book of Bertsekas and Tsitsiklis. For several topics, the book by Sutton and Barto is an useful reference, in particular, to obtain an intuitive understanding. Also, Chapter 6 and 7 of DP/OC Vol II is a useful reference of the advanced topics under RL with function approximation.
Honourable omissions: Neural networks, Average cost models.
Grading
Quiz-1: 15%
Quiz-2: 15%
Final exam: 30%
Papers’ study: 10%
Class participation (including attendance): 10%
Mini-quizzes: 20%
Mini-quiz 1 is mandatory and has weightage 10%, while the best score among mini-quiz 2 and 3 will taken for a weightage of 10%.
Important Dates
Quiz-1, Quiz-2 and Final: Sep 3, Oct 8, Nov 14 (as per academic calendar)
Mini-quizzes: Aug 11, Sep 29, Oct 27
Papers’ study: Oct 28
Papers’ study
Students form a team of size at most two, and read a set of papers on RL. Each team would be then quizzed orally on the content of these papers by the instructor and TAs.
Textbooks/References
D.P.Bertsekas, Dynamic Programming and Optimal Control, Vol. I, Athena Scientific, 2017
D.P.Bertsekas, Dynamic Programming and Optimal Control, Vol. II, Athena Scientific, 2012
D.P.Bertsekas and J.N.Tsitsiklis, Neuro-Dynamic Programming, Athena Scientific, 1996
R.S.Sutton and A.G.Barto, Reinforcement Learning: An Introduction, MIT Press, 2020.