Beyond text: images become tokens too
The idea: Images and audio are cut into patches or frames, each turned into a vector in the same space as text tokens — so the model can compare words and pixels with the same dot-product similarity and predict the next token.
What you'll be able to do: You can explain how multimodal models turn images and audio into vectors in the same space as text, so the model compares them the same way it compares words.
The problem it solves: If a model only does math on token vectors, how can it 'see' an image or 'hear' audio?
Builds on: Embeddings: meaning as coordinates
← Below the vector: tokens · Next: Predict the next word →
All lessons