Training-data curation
Drive a funnel: raw web → dedup → quality filter → decontaminate.
1The web is mostly junk. Drag the cleaning slider and watch how few survive.your turn
cat.jpgan original
cat.jpg← copy of it
BUY NOW!!!still here
[TEST]still here
24 of 24 kept
raw web crawl: 24 of 24 kept
raw web crawl
everything, all mixed
Keep dragging. Each filter drops more, until barely a quarter is left.
→ continue← backR replay
You've reached the end
That's the whole arc, from a word, to attention, to agents, down to the GPUs. Now show someone what you can do.
Common questions
What is "Training-data curation" about?
Drive a funnel: raw web → dedup → quality filter → decontaminate.
What problem does it solve?
What actually goes into the training set?
What will I be able to do after this lesson?
You can explain how training data is cleaned, and why quality beats quantity.