Skip to content
See How AI Works
all lessonsE.5●●○○○elective

Training-data curation

Drive a funnel: raw web → dedup → quality filter → decontaminate.

1The web is mostly junk. Drag the cleaning slider and watch how few survive.your turn
cat.jpgan original
cat.jpg← copy of it
BUY NOW!!!still here
[TEST]still here

24 of 24 kept

raw web crawl: 24 of 24 kept

raw web crawl

everything, all mixed

Keep dragging. Each filter drops more, until barely a quarter is left.

continue backR replay

You've reached the end

That's the whole arc, from a word, to attention, to agents, down to the GPUs. Now show someone what you can do.

Get your capability card
Elective Rooms·

Common questions

What is "Training-data curation" about?
Drive a funnel: raw web → dedup → quality filter → decontaminate.
What problem does it solve?
What actually goes into the training set?
What will I be able to do after this lesson?
You can explain how training data is cleaned, and why quality beats quantity.