Training-data curation
The idea: Drive a funnel: raw web → dedup → quality filter → decontaminate.
What you'll be able to do: You can explain how training data is cleaned, and why quality beats quantity.
The problem it solves: What actually goes into the training set?
← Interpretability
All lessons