How do images, audio, and documents become something a model can reason about?

Question

Accepted Answer

Multimodal models turn every input, text, image, audio, or a screenshot, into vectors in one shared space, then reason over all of them together. An image is cut into patches and each patch becomes a vector, the same kind of vector a word becomes. Because they live in the same space, the model can compare a picture and a caption, answer a question about a chart, or describe what is on a screen. It is the same machinery as text, pointed at more kinds of input.

Multimodal AI, explained

What people get wrong

Where you see it in real products

Related explainers