See How AI Works — An Interactive Course on How Modern LLMs Actually Work

A free, no-code, fully interactive course on how modern LLMs work — from 'a word is a vector' to transformers, attention, RAG, agents, and the hardware underneath.

The Big Picture — See it work first

See it think — It only ever predicts the next word, from a ranked list of guesses.

Representation — A word is a vector

Words have no math — To do math on meaning, turn each word into a position: a vector.
Embeddings: meaning as coordinates — More dimensions = more nuance; an embedding is a learned coordinate list.
Dot-product similarity — Dot product / cosine as an alignment score: multiply-and-add.
Below the vector: tokens — Before meaning, text is split into tokens — common chunks (often sub-words) drawn from a fixed vocabulary an algorithm like BPE learned. The model only ever sees token ids, not characters.
Beyond text: images become tokens too — Images and audio are cut into patches or frames, each turned into a vector in the same space as text tokens — so the model can compare words and pixels with the same dot-product similarity and predict the next token.

Prediction & Learning — How it gets good

Predict the next word — Language needs context.
The bigram model — Condition on the previous word(s); n-grams.
Loss as a scoreboard — Loss = surprise; lower is better; it is the game's score.
Gradient descent: rolling downhill — Follow the slope of the loss downhill.
Where the map comes from — Words that fill the same blanks get pulled together. Predicting the next word, over and over, learns the map — no dictionary needed.

Architecture — Attention & the transformer block

The context wall — Tokens must look at other tokens.
Query, Key, Value — Each token emits a Query, a Key, and a Value; match Q to K by dot product.
Attention scores, softmax & the weighted sum — Q·K → scores → softmax → weighted sum of Values.
Causal masking — Mask the upper triangle to −∞ before softmax.
Multi-head attention — Several heads attend to different relationships in parallel.
Matrix × vector as a neural layer — A matrix multiply is a layer; it rotates / scales the vector.
Residuals & LayerNorm — Residual = keep a copy and add the change; LayerNorm = rescale to stay sane.
The transformer block, assembled & stacked — attention + FFN + residual/norm = a block; stack N blocks (nano-GPT).

Operating & Scaling — Decoding, context, scaling laws

Decoding & sampling knobs — Greedy vs. sampling; temperature, top-k, top-p, penalties.
Context window & KV cache — Fixed context length; the KV cache reuses past keys/values for speed.
Positional information — Inject position (sinusoidal / RoPE intuition).
Scaling laws — Loss falls as a power law in params, data, compute (Chinchilla).
Mixture of Experts — A router sends each token to a few expert FFNs; sparsity decouples cost.
Quantization — Store weights at lower precision; small accuracy cost, big memory win.
GPT-2 → GPT-3 → frontier — The same architecture scaled by orders of magnitude.
The model thinks before it answers — Reasoning models are trained to spend tokens thinking first — a private chain of thought — before the final answer, so the 'scratch paper' is just more generated tokens.
Test-time compute: pay at answer-time — A third scaling axis (beyond parameters and data): spend more compute per question at inference — think longer, sample many tries, pick the best — trading latency and cost for accuracy.
The thinking dial: when more hurts — More thinking has diminishing (then negative) returns: it adds latency and cost, can overthink simple tasks, and sometimes talks itself out of the right answer. Match the thinking budget to the problem.
Speculative decoding: two models, one fast answer — A small, fast model drafts several tokens ahead; the big model checks them all in one pass, keeping the run it agrees with. Same output, fewer slow steps.
Distillation: a small model learns from a big one — Train a small 'student' model to imitate a big 'teacher' (its outputs/probabilities), capturing much of the skill at a fraction of the size and cost — most small fast models you use are distilled.

A System in the World — Frozen model → RAG → agents

The model is frozen and stateless — Weights are fixed after training; no memory between calls; it confabulates.
RAG: retrieval as a callback to similarity — Embed the query, find nearest chunks (the dot product again!), paste into context.
Tools & agents: the loop around a frozen model — An agent is a loop: the model proposes an action, a tool runs, the result re-enters context.

Agents in Practice — Drive the tools well

The harness: the loop, made real — A harness wraps the model in a real loop: it reads the running notes, the model proposes an action, a tool runs, the result joins the notes — repeat. The model is the same frozen predictor; the harness makes it act.
What's in the context window — Each turn the model sees one bundle — system prompt, tool definitions, your files, the whole chat so far, plus room reserved for its reply — and it all must fit in a fixed context window.
Managing the context — Clear it, compact it, or start fresh — pick the move for the moment. Anything not in the window (files, memory notes) can be reloaded on demand, so pruning is safe.
Prompt caching: reuse the prefix — Cache the stable prefix: the model processes it once, then reuses it for a fraction of the cost and latency — as long as the start of the prompt is byte-for-byte unchanged.
Tools, permissions & trust — Every tool call is the agent acting on the world — read, edit, run, search — and each result flows back into context. You gate permissions, scope what it can touch, and verify what it did.
Skills & on-demand context — A skill is expertise on a shelf: only its one-line description stays in context until it's relevant, then its full instructions load — progressive disclosure. MCP adds tools the same just-in-time way.
Working well: habits that compound — Scope the task, give the right context (not all of it), keep the window clean, plan before acting, and verify the output — small habits that compound into real leverage.
MCP: the universal connector — The Model Context Protocol (MCP) is one open standard: connect a tool or data source once, and any MCP-aware harness can use it — think USB-C for AI tools.
Let the agent gather context — Give the agent tools and a clear goal and it fetches its own context: it searches the repo, opens the right files, runs commands — pulling in exactly what it needs, when it needs it.
Token awareness & the economics — Every token in and out costs money and time, and the model even paces itself against its budget. Fewer, well-chosen tokens beat more — and caching, retrieval, and clearing all tilt the economics.
Inside one turn: reason, then act — Each turn it writes a private thought, picks ONE action (a tool call), then reads the result — interleaving reasoning and acting (the ReAct pattern) instead of answering in one shot.
The self-correction loop: test, fix, repeat — Give the agent a way to check itself — run the tests, read the failure, fix, re-run — and the test suite becomes its ground truth. Verification, not generation, is the real bottleneck.
Plan first, or feel your way? — Plan-and-execute is cheap and predictable but shatters when a step surprises you; step-by-step (ReAct) adapts but costs more and can wander. Match it to how predictable the task is.
How much leash? Levels of autonomy — Autonomy is a dial: suggest → approve-each-step → hand off a task → autonomous teammate. Higher isn't better — match it to how well-scoped the task is and how easily you can verify the result.
Many agents: orchestrator & subagents — An orchestrator delegates sub-tasks to subagents, each with its own clean context, working in parallel and reporting back. Powerful for wide search — but it multiplies cost and can fragment, so reach for it deliberately.
Memory across sessions — Memory lives outside the context window — notes files (like CLAUDE.md), saved facts, and retrieval — that the agent reloads on demand. The window is short-term; these are its long-term notebook.

LLMs in Production — Ship, watch, evaluate, improve

An LLM feature in production — In production the model is one piece of a pipeline: input guardrails → prompt assembly (system + retrieved context + user input) → model → validate/parse → output guardrails → your app. Every step can fail and must be handled.
Observability: seeing inside — Trace every request — inputs, retrieved context, the assembled prompt, the output, tool calls, tokens, latency, cost, errors — plus user feedback. You can't fix what you can't see.
Evals: proving it works — Build an eval set: inputs paired with expected answers or rubric criteria. Score every change against it — unit tests for prompts — so you ship improvements, not regressions.
LLM-as-a-judge — Use a strong model as the grader against a rubric — pointwise scores or pairwise 'which is better.' Fast and scalable, but watch its biases (order, verbosity, self-preference) and calibrate it against human labels.
Datasets, labeling & ground truth — Ground truth is built by people: clear guidelines, human labelers, agreement checks, and curated 'gold' sets — increasingly seeded by models but verified by humans. Quality beats quantity.
The data flywheel — Production traffic → log it → curate the hard and failed cases → label them → feed evals and training → a better model → more usage. The loop compounds, and your proprietary data becomes the moat.
The lethal trifecta — Danger spikes when an agent has all three at once: access to private data, exposure to untrusted content, and a way to send data out. Untrusted text becomes instructions (prompt injection) and exfiltrates. Remove any one leg to defuse it.
Designing trustworthy AI features — Design for the verification gap: stream the answer so it feels alive, cite sources so claims are checkable, show confidence and let users correct, and keep a human in the loop for high-stakes actions.
Structured outputs: guaranteed JSON — Constrained decoding forces output to match a schema by only allowing valid next tokens — so the JSON always parses. But schema-valid still isn't semantically correct, so validate the values too.

Physical Substrate — GPUs & datacenters

Why GPUs beat CPUs — Each output cell of a layer is its own independent sum, so all of them can be computed at once; a GPU has thousands of slower workers to do exactly that.
Why training needs datacenters — Cut the model into slices, put one slice per GPU, and wire them together to act as one machine.

Elective Rooms — Off the main arc

RLHF / post-training — Pretraining → SFT → RLHF: rank outputs to train a reward model, shift the policy.
Fine-tuning vs. prompting — Same task via a better prompt vs. a few adapter weights (LoRA intuition).
Hallucination — It optimizes plausibility, not truth; connect to RAG as mitigation.
Interpretability — Explore real attention patterns / feature activations on curated inputs.
Training-data curation — Drive a funnel: raw web → dedup → quality filter → decontaminate.