CS6700: Reinforcement learning

Course information

  • When: Jul-Nov 2025

  • Lectures: Slot J

  • Where: CS34

  • Teaching Assistants: Udit Narayan Singh (CS23S038), Sahil Kumar Koiri (CS25M042), \Shikhar Tiwari (CS25M044), Sayak Sen (CS25S006)

Course Content

Markov Decision Processes (MDPs)

  • Finite horizon MDPs: General theory, DP algorithm

  • Infinite horizon model: (1) Stochastic shortest path

    • General theory: Contraction mapping, Bellman equation

    • Computational solution schemes: Value and policy iteration, convergence analysis

  • Infinite horizon model: (2) Discounted cost MDPs

    • General theory: Contraction mapping, Bellman equation

    • Classical solution techniques: value and policy iteration

Reinforcement learning

  • Stochastic approximation

    • Introduction and connection to RL

    • Convergence result for contraction mappings

  • Tabular methods

    • Monte Carlo policy evaluation

    • Temporal difference learning

      • TD(0), TD(lambda)

      • Convergence analysis

    • Q-learning and its convergence analysis

  • Function approximation

    • Approximate policy evaluation using TD(lambda)

    • Least-squares methods: LSTD and LSPI

  • Policy-gradient algorithms

    • Policy gradient theorem

    • Gradient estimation using likelihood ratios

    • Variants (REINFORCE, PPO, etc)

The portion on MDPs roughly coincides with Chapters 1 of Vol. I of Dynamic programming and optimal control book of Bertsekas and Chapter 2, 4, 5 and 6 of Neuro dynamic programming book of Bertsekas and Tsitsiklis. For several topics, the book by Sutton and Barto is an useful reference, in particular, to obtain an intuitive understanding. Also, Chapter 6 and 7 of DP/OC Vol II is a useful reference of the advanced topics under RL with function approximation.

Honourable omissions: Neural networks, Average cost models.

Grading

  • Quiz-1: 15%

  • Quiz-2: 15%

  • Final exam: 30%

  • Papers’ study: 10%

  • Class participation (including attendance): 10%

  • Mini-quizzes: 20%

Mini-quiz 1 is mandatory and has weightage 10%, while the best score among mini-quiz 2 and 3 will taken for a weightage of 10%.

Important Dates

  • Quiz-1, Quiz-2 and Final: Sep 3, Oct 8, Nov 14 (as per academic calendar)

  • Mini-quizzes: Aug 11, Sep 29, Oct 27

  • Papers’ study: Oct 28

Papers’ study

Students form a team of size at most two, and read a set of papers on RL. Each team would be then quizzed orally on the content of these papers by the instructor and TAs.

Textbooks/References

  • D.P.Bertsekas, Dynamic Programming and Optimal Control, Vol. I, Athena Scientific, 2017

  • D.P.Bertsekas, Dynamic Programming and Optimal Control, Vol. II, Athena Scientific, 2012

  • D.P.Bertsekas and J.N.Tsitsiklis, Neuro-Dynamic Programming, Athena Scientific, 1996

  • R.S.Sutton and A.G.Barto, Reinforcement Learning: An Introduction, MIT Press, 2020.

Resources

  • The lecture notes are accessible here

  • The schedule of lectures from the 2019 run of this course is available here