Skip to content
See How AI Works
← all lessons1.5●●○○○

Beyond text: images become tokens too

Images and audio become vectors in the same space as text tokens.

1The model only does math on vectors. So how can it see a photo? Guess first.your turn
photo
9 patches
Tap what you think becomes of each patch.your turn
β†’ continue← backR replay

Now: can we predict the next word?

2.1 Predict the next word
RepresentationΒ·

Common questions

What is "Beyond text: images become tokens too" about?
Images and audio become vectors in the same space as text tokens.
What problem does it solve?
If a model only does math on token vectors, how can it 'see' an image or 'hear' audio?
What will I be able to do after this lesson?
You can explain how multimodal models turn images and audio into vectors in the same space as text, so the model compares them the same way it compares words.
What comes next?
Now: can we predict the next word?