Introduction to Large Language Models

Pre-requisites | Logistics | Schedule for Part 1 | Schedule for Part 2 | Schedule for Part 3

Pre-requisites

Fundamentals of Deep Learning

Logistics

Instructor: Mitesh M. Khapra
Part 1: Oct-Nov 2023
Part 2: Jan-Feb 2024
Part 3: Mar 2024
Q: How do I register for the course? A: Please visit https://anudesh.ai4bharat.org/

Q: Where will the videos be posted? A: Please see this video for more details on how to access the course.

Q: How can I contribute data for IndicLLMs? A: Please see this video for more details on how to access the course.

Schedule for Part-1 (Oct-Nov '23)

Lecture# Contents Lecture Slides Lecture Videos Extra Reading Material

Week 1

Introduction to the transformer architecture
Self-attention - Encoder Layers - Encoder stack
Slides

Week 2

Teacher Forcing and masked attention
Zooming into decoder layer and decoder stack
Position encoding
Batch and Layer normalization
Slides

Week 3

What is a language model?
Decoder-only LLMs: A deep dive into GPT to understand the architecture and training objectives (Causal language model).
The distinction between pre-training and fine-tuning.
Understanding decoding strategies: Greedy, Beam Search, Top-k and Top-p.
Slides

Week 4

Encoder-only LLMs: A deep dive into BERT to understand the architecture and training objectives (Masked Language Model)
Adapting to downstream tasks.
Slides

Week 5

Understanding tokenization
Challenges - Motivation for sub-word tokenization.
Byte Pair Encoding - Wordpiece.
Sentence Piece.
Slides

Week 6

Encoder-Decoder models: Brief intro to BART.
Text to Text framework: GPT-2 (zero-shot learning), deep dive into T5 framework.
Slides

The road ahead: Zooming into the differences

Data: What are the different data sources? What are the components of a good data pipeline?
Model/Architecture: What are the different types of attention mechanisms? How do you increase the scale of a model (narrow v/s deep)? What are the different types of positional embeddings?
Training: What are the different type of objective functions? Are there any specific choices for optimisers?

Schedule for Part-2 (Jan-Feb '24)

Lecture# Contents Lecture Slides Lecture Videos Extra Reading Material

Week 7 Data

The bigger picture: A taxonomy of all models (encoder-only, decoder-only, encoder-decoder only)
What do they differ in? Data (sources and pipelines), Model (scale, attention, PEs, vocab), Training (optimiser, objective functions)
Data Sources: C4, mC4, Pile, RedPajama, SlimPajama, Stack, Sangraha, …
Data Pipelines: Gopher, OPT, Llama, Falcon, SETU
Studies on the effectiveness of clean data (recent papers)
Slides-1
Slides-2

Week 8 Fast attention and Fast Inference

Time and space complexity of attention mechanism
Sub-quadratic attention: Local, dilated, random, block sparse, low-rank approximation, kernels
Hardware aware full attention: Flash attention
Fast inference: KV-caching, MQA, GQA, Paged attention
Slides

Week 9 Position encoding and Length Generalization

Positional Embeddings: importance of long sequences, drawbacks of sinusoidal PEs, AliBI, RoPE, NoPE
Pre-normalization vs post-normalization

Week 10 Training (Part 1)

Recap of optimisers, LION optimiser
Learning schedules, gradient clipping, typical failures during training
Scaling laws

Week 11 Training (Part 2)

Mixed precision training, activation check-pointing, cpu offloading
3D parallelism, ZERO

Week 12 Bringing it all together

Coming back to the big picture: An overview of all recent LLMs highlighting the differences and similarities

The road ahead: Fine-tuning and evaluating LLMs

Schedule for Part-3 (Mar'24)

We are still working on the schedule for Part 3 but here is a tentative list of topics

Fine-tuning LLMs: Prompt tuning, Multi-task fine-tuning, Parameter efficient fine-tuning
Evaluating LLMs: Benchmarks, evaluation frameworks and popular leaderboards

Lecture#	Contents	Lecture Slides
Week 1	Introduction to the transformer architecture Self-attention - Encoder Layers - Encoder stack	Slides
Week 2	Teacher Forcing and masked attention Zooming into decoder layer and decoder stack Position encoding Batch and Layer normalization	Slides
Week 3	What is a language model? Decoder-only LLMs: A deep dive into GPT to understand the architecture and training objectives (Causal language model). The distinction between pre-training and fine-tuning. Understanding decoding strategies: Greedy, Beam Search, Top-k and Top-p.	Slides
Week 4	Encoder-only LLMs: A deep dive into BERT to understand the architecture and training objectives (Masked Language Model) Adapting to downstream tasks.	Slides
Week 5	Understanding tokenization Challenges - Motivation for sub-word tokenization. Byte Pair Encoding - Wordpiece. Sentence Piece.	Slides
Week 6	Encoder-Decoder models: Brief intro to BART. Text to Text framework: GPT-2 (zero-shot learning), deep dive into T5 framework.	Slides
The road ahead: Zooming into the differences Data: What are the different data sources? What are the components of a good data pipeline? Model/Architecture: What are the different types of attention mechanisms? How do you increase the scale of a model (narrow v/s deep)? What are the different types of positional embeddings? Training: What are the different type of objective functions? Are there any specific choices for optimisers?

Lecture#	Contents	Lecture Slides
Week 7	Data The bigger picture: A taxonomy of all models (encoder-only, decoder-only, encoder-decoder only) What do they differ in? Data (sources and pipelines), Model (scale, attention, PEs, vocab), Training (optimiser, objective functions) Data Sources: C4, mC4, Pile, RedPajama, SlimPajama, Stack, Sangraha, … Data Pipelines: Gopher, OPT, Llama, Falcon, SETU Studies on the effectiveness of clean data (recent papers)	Slides-1 Slides-2
Week 8	Fast attention and Fast Inference Time and space complexity of attention mechanism Sub-quadratic attention: Local, dilated, random, block sparse, low-rank approximation, kernels Hardware aware full attention: Flash attention Fast inference: KV-caching, MQA, GQA, Paged attention	Slides
Week 9	Position encoding and Length Generalization Positional Embeddings: importance of long sequences, drawbacks of sinusoidal PEs, AliBI, RoPE, NoPE Pre-normalization vs post-normalization
Week 10	Training (Part 1) Recap of optimisers, LION optimiser Learning schedules, gradient clipping, typical failures during training Scaling laws
Week 11	Training (Part 2) Mixed precision training, activation check-pointing, cpu offloading 3D parallelism, ZERO
Week 12	Bringing it all together Coming back to the big picture: An overview of all recent LLMs highlighting the differences and similarities
The road ahead: Fine-tuning and evaluating LLMs