Plain-language explainer
Multimodal AI, explained
How do images, audio, and documents become something a model can reason about?
Multimodal models turn every input, text, image, audio, or a screenshot, into vectors in one shared space, then reason over all of them together. An image is cut into patches and each patch becomes a vector, the same kind of vector a word becomes. Because they live in the same space, the model can compare a picture and a caption, answer a question about a chart, or describe what is on a screen. It is the same machinery as text, pointed at more kinds of input.
Do not just read it. Operate the mechanism yourself in a short interactive lesson.
See it work: Beyond text: images become tokens too βFree, no code, no signup.
What people get wrong
- The model 'sees' like an eye. It converts pixels into vectors and reasons over those, not raw images.
- Vision is a separate bolt-on model. Modern multimodal models share one representation across inputs.
- It reads any image perfectly. Fine print, dense charts, and odd layouts still trip it up.
Where you see it in real products
- Assistants answer questions about photos, screenshots, and PDFs.
- Document AI extracts data from scans and forms.
- Voice agents and computer-use agents build on multimodal understanding.
Related explainers
Part of See How AI Works, a free interactive course, where you learn how modern AI works by operating it, not watching videos.