Question 1

What is "Beyond text: images become tokens too" about?

Accepted Answer

Images and audio are cut into patches or frames, each turned into a vector in the same space as text tokens — so the model can compare words and pixels with the same dot-product similarity and predict the next token.

Question 2

What problem does it solve?

Accepted Answer

If a model only does math on token vectors, how can it 'see' an image or 'hear' audio?

Question 3

What will I be able to do after this lesson?

Accepted Answer

You can explain how multimodal models turn images and audio into vectors in the same space as text, so the model compares them the same way it compares words.

Question 4

What comes next?

Accepted Answer

Now: can we predict the next word?