RLHF / post-training
The idea: Pretraining → SFT → RLHF: rank outputs to train a reward model, shift the policy.
What you'll be able to do: You can explain RLHF: human preferences train a reward model that shapes the assistant.
The problem it solves: A raw pretrained model isn't helpful or aligned.
Builds on: Gradient descent: rolling downhill
← Why training needs datacenters · Next: Fine-tuning vs. prompting →
All lessons