Question 1

What is "Beyond text: images become tokens too" about?

Accepted Answer

Images and audio become vectors in the same space as text tokens.

Question 2

What problem does it solve?

Accepted Answer

If a model only does math on token vectors, how can it 'see' an image or 'hear' audio?

Question 3

What will I be able to do after this lesson?

Accepted Answer

You can explain how multimodal models turn images and audio into vectors in the same space as text, so the model compares them the same way it compares words.

Question 4

What comes next?

Accepted Answer

Now: can we predict the next word?

Beyond text: images become tokens too

Common questions