Beyond text: images become tokens too
Images and audio become vectors in the same space as text tokens.
1The model only does math on vectors. So how can it see a photo? Guess first.your turn
photo9 patches
Tap what you think becomes of each patch.your turn
β continueβ backR replay
Now: can we predict the next word?
2.1 Predict the next wordCommon questions
What is "Beyond text: images become tokens too" about?
Images and audio become vectors in the same space as text tokens.
What problem does it solve?
If a model only does math on token vectors, how can it 'see' an image or 'hear' audio?
What will I be able to do after this lesson?
You can explain how multimodal models turn images and audio into vectors in the same space as text, so the model compares them the same way it compares words.
What comes next?
Now: can we predict the next word?