Pre-requisites | Logistics | Schedule for Part 1 | Schedule for Part 2 | Schedule for Part 3

Pre-requisites


Logistics


Schedule for Part-1 (Oct-Nov '23)


Lecture# Contents Lecture Slides Lecture Videos Extra Reading Material
Week 1
  • Introduction to the transformer architecture
  • Self-attention - Encoder Layers - Encoder stack
Slides
Week 2
  • Teacher Forcing and masked attention
  • Zooming into decoder layer and decoder stack
  • Position encoding
  • Batch and Layer normalization
Slides
Week 3
  • What is a language model?
  • Decoder-only LLMs: A deep dive into GPT to understand the architecture and training objectives (Causal language model).
  • The distinction between pre-training and fine-tuning.
  • Understanding decoding strategies: Greedy, Beam Search, Top-k and Top-p.
Slides
Week 4
  • Encoder-only LLMs: A deep dive into BERT to understand the architecture and training objectives (Masked Language Model)
  • Adapting to downstream tasks.
Slides
Week 5
  • Understanding tokenization
  • Challenges - Motivation for sub-word tokenization.
  • Byte Pair Encoding - Wordpiece.
  • Sentence Piece.
Slides
Week 6
  • Encoder-Decoder models: Brief intro to BART.
  • Text to Text framework: GPT-2 (zero-shot learning), deep dive into T5 framework.
Slides
The road ahead: Zooming into the differences
  • Data: What are the different data sources? What are the components of a good data pipeline?
  • Model/Architecture: What are the different types of attention mechanisms? How do you increase the scale of a model (narrow v/s deep)? What are the different types of positional embeddings?
  • Training: What are the different type of objective functions? Are there any specific choices for optimisers?

Schedule for Part-2 (Jan-Feb '24)


Lecture# Contents Lecture Slides Lecture Videos Extra Reading Material
Week 7 Data
  • The bigger picture: A taxonomy of all models (encoder-only, decoder-only, encoder-decoder only)
  • What do they differ in? Data (sources and pipelines), Model (scale, attention, PEs, vocab), Training (optimiser, objective functions)
  • Data Sources: C4, mC4, Pile, RedPajama, SlimPajama, Stack, Sangraha, …
  • Data Pipelines: Gopher, OPT, Llama, Falcon, SETU
  • Studies on the effectiveness of clean data (recent papers)
Slides-1
Slides-2
Week 8 Fast attention and Fast Inference
  • Time and space complexity of attention mechanism
  • Sub-quadratic attention: Local, dilated, random, block sparse, low-rank approximation, kernels
  • Hardware aware full attention: Flash attention
  • Fast inference: KV-caching, MQA, GQA, Paged attention
Slides
Week 9 Position encoding and Length Generalization
  • Positional Embeddings: importance of long sequences, drawbacks of sinusoidal PEs, AliBI, RoPE, NoPE
  • Pre-normalization vs post-normalization
Week 10 Training (Part 1)
  • Recap of optimisers, LION optimiser
  • Learning schedules, gradient clipping, typical failures during training
  • Scaling laws
Week 11 Training (Part 2)
  • Mixed precision training, activation check-pointing, cpu offloading
  • 3D parallelism, ZERO
Week 12 Bringing it all together
  • Coming back to the big picture: An overview of all recent LLMs highlighting the differences and similarities
The road ahead: Fine-tuning and evaluating LLMs

Schedule for Part-3 (Mar'24)


We are still working on the schedule for Part 3 but here is a tentative list of topics
  • Fine-tuning LLMs: Prompt tuning, Multi-task fine-tuning, Parameter efficient fine-tuning
  • Evaluating LLMs: Benchmarks, evaluation frameworks and popular leaderboards