Skip to content
← All explainers

Plain-language explainer

Multimodal AI, explained

How do images, audio, and documents become something a model can reason about?

Multimodal models turn every input, text, image, audio, or a screenshot, into vectors in one shared space, then reason over all of them together. An image is cut into patches and each patch becomes a vector, the same kind of vector a word becomes. Because they live in the same space, the model can compare a picture and a caption, answer a question about a chart, or describe what is on a screen. It is the same machinery as text, pointed at more kinds of input.

Do not just read it. Operate the mechanism yourself in a short interactive lesson.

See it work: Beyond text: images become tokens too β†’

Free, no code, no signup.

What people get wrong

  • The model 'sees' like an eye. It converts pixels into vectors and reasons over those, not raw images.
  • Vision is a separate bolt-on model. Modern multimodal models share one representation across inputs.
  • It reads any image perfectly. Fine print, dense charts, and odd layouts still trip it up.

Where you see it in real products

  • Assistants answer questions about photos, screenshots, and PDFs.
  • Document AI extracts data from scans and forms.
  • Voice agents and computer-use agents build on multimodal understanding.

Related explainers

Part of See How AI Works, a free interactive course, where you learn how modern AI works by operating it, not watching videos.