CS6700: Reinforcement learning
Course information
When: Feb-May 2021
Lectures: Slot J
Where: Online
Teaching Assistants: Nithia Vijayan, Dipayan Sen
Course Content
Markov Decision Processes
Finite horizon model
General theory
DP algorithm
Infinite horizon model: (1) Stochastic shortest path
General theory: Contraction mapping, Bellman equation
Computational solution schemes: Value and policy iteration, convergence analysis
Infinite horizon model: (2) Discounted cost MDPs
General theory: Contraction mapping, Bellman equation
Classical solution techniques: value and policy iteration
Reinforcement learning
Stochastic iterative algorithm
Convergence result for contraction mappings
Tabular methods
Monte Carlo policy evaluation
Temporal difference learning
TD(0), TD(lambda)
Convergence analysis
Q-learning and its convergence analysis
Function approximation
Approximate policy iteration and error bounds
Approximate policy evaluation using TD(lambda)
The case of linear function approximation
Convergence analysis
Least-squares methods: LSTD and LSPI
Q-learning
Tentative list of other topics:
Policy-gradient and actor-critic algorithms
Policy gradient theorem
Gradient estimation using likelihood ratios
Actor-critic methods: Compatible features, essential features, advantage updating
Regret minimization in MDPs
Introduction to bandits and UCB algorithm
UCRL and PSRL algorithms
The portion on MDPs roughly coincides with Chapters 1 of Vol. I of Dynamic programming and optimal control book of Bertsekas and Chapter 2, 4, 5 and 6 of Neuro dynamic programming book of Bertsekas and Tsitsiklis. For several topics, the book by Sutton and Barto is an useful reference, in particular, to obtain an intuitive understanding. Also, Chapter 6 and 7 of DP/OC Vol II is a useful reference of the advanced topics under RL with function approximation.
Honourable omissions: Neural networks, Average cost models.
The schedule of lectures from the 2019 run of this course is available here
Grading
Mid-term exam: 25%
Final exam: 25%
Assignments: 10% each
Project: 20%
Paper review: 5%
Scribe: 5%
Important Dates
Assignment 1: Mar 10, Assignment 2: Apr 20
Mid-term: Mar 24
Final: May 13
Project: May 5 (both report and presentation are due on this date)
Paper review: Apr 26
Scribing: May 10
Paper review
Each individual student is expected to read a paper from a list to be provided soon and submit a 1-page critical review that identifies the strengths and areas of improvement of the paper studied. A paraphrase of the paper content is strongly discouraged.
Course project
The students work in teams of size two.
Each team is required to implement an RL algorithm of their choice.
A few MDP environments would be provided for tuning the RL algorithm.
The submitted RL algorithm would be evaluated across different MDP environments, and assigned a score based on different performance indicators.
Each team is required to submit a short report providing a description of the chosen algorithm, and its associated parameters.
Assignments facilitated by Aicrowd
Assignment 1: Click here
Assignment 2: Click here
Project: Click here
Textbooks/References
D.P.Bertsekas, Dynamic Programming and Optimal Control, Vol. I, Athena Scientific, 2017
D.P.Bertsekas, Dynamic Programming and Optimal Control, Vol. II, Athena Scientific, 2012
D.P.Bertsekas and J.N.Tsitsiklis, Neuro-Dynamic Programming, Athena Scientific, 1996
R.S.Sutton and A.G.Barto, Reinforcement Learning: An Introduction, MIT Press, 2020.