See How AI Works — An Interactive Course on How Modern LLMs Actually Work
A free, no-code, fully interactive course on how modern LLMs work — from 'a word is a vector' to transformers, attention, RAG, agents, and the hardware underneath.
The Big Picture — See it work first
- See it think — It only ever predicts the next word, from a ranked list of guesses.
Representation — A word is a vector
- Words have no math — To do math on meaning, turn each word into a position: a vector.
- Embeddings: meaning as coordinates — More dimensions = more nuance; an embedding is a learned coordinate list.
- Dot-product similarity — Dot product / cosine as an alignment score: multiply-and-add.
- Below the vector: tokens — Before meaning, text is split into tokens — common chunks (often sub-words) drawn from a fixed vocabulary an algorithm like BPE learned. The model only ever sees token ids, not characters.
- Beyond text: images become tokens too — Images and audio are cut into patches or frames, each turned into a vector in the same space as text tokens — so the model can compare words and pixels with the same dot-product similarity and predict the next token.
Prediction & Learning — How it gets good
Architecture — Attention & the transformer block
Operating & Scaling — Decoding, context, scaling laws
- Decoding & sampling knobs — Greedy vs. sampling; temperature, top-k, top-p, penalties.
- Context window & KV cache — Fixed context length; the KV cache reuses past keys/values for speed.
- Positional information — Inject position (sinusoidal / RoPE intuition).
- Scaling laws — Loss falls as a power law in params, data, compute (Chinchilla).
- Mixture of Experts — A router sends each token to a few expert FFNs; sparsity decouples cost.
- Quantization — Store weights at lower precision; small accuracy cost, big memory win.
- GPT-2 → GPT-3 → frontier — The same architecture scaled by orders of magnitude.
- The model thinks before it answers — Reasoning models are trained to spend tokens thinking first — a private chain of thought — before the final answer, so the 'scratch paper' is just more generated tokens.
- Test-time compute: pay at answer-time — A third scaling axis (beyond parameters and data): spend more compute per question at inference — think longer, sample many tries, pick the best — trading latency and cost for accuracy.
- The thinking dial: when more hurts — More thinking has diminishing (then negative) returns: it adds latency and cost, can overthink simple tasks, and sometimes talks itself out of the right answer. Match the thinking budget to the problem.
- Speculative decoding: two models, one fast answer — A small, fast model drafts several tokens ahead; the big model checks them all in one pass, keeping the run it agrees with. Same output, fewer slow steps.
- Distillation: a small model learns from a big one — Train a small 'student' model to imitate a big 'teacher' (its outputs/probabilities), capturing much of the skill at a fraction of the size and cost — most small fast models you use are distilled.
A System in the World — Frozen model → RAG → agents
Agents in Practice — Drive the tools well
- The harness: the loop, made real — A harness wraps the model in a real loop: it reads the running notes, the model proposes an action, a tool runs, the result joins the notes — repeat. The model is the same frozen predictor; the harness makes it act.
- What's in the context window — Each turn the model sees one bundle — system prompt, tool definitions, your files, the whole chat so far, plus room reserved for its reply — and it all must fit in a fixed context window.
- Managing the context — Clear it, compact it, or start fresh — pick the move for the moment. Anything not in the window (files, memory notes) can be reloaded on demand, so pruning is safe.
- Prompt caching: reuse the prefix — Cache the stable prefix: the model processes it once, then reuses it for a fraction of the cost and latency — as long as the start of the prompt is byte-for-byte unchanged.
- Tools, permissions & trust — Every tool call is the agent acting on the world — read, edit, run, search — and each result flows back into context. You gate permissions, scope what it can touch, and verify what it did.
- Skills & on-demand context — A skill is expertise on a shelf: only its one-line description stays in context until it's relevant, then its full instructions load — progressive disclosure. MCP adds tools the same just-in-time way.
- Working well: habits that compound — Scope the task, give the right context (not all of it), keep the window clean, plan before acting, and verify the output — small habits that compound into real leverage.
- MCP: the universal connector — The Model Context Protocol (MCP) is one open standard: connect a tool or data source once, and any MCP-aware harness can use it — think USB-C for AI tools.
- Let the agent gather context — Give the agent tools and a clear goal and it fetches its own context: it searches the repo, opens the right files, runs commands — pulling in exactly what it needs, when it needs it.
- Token awareness & the economics — Every token in and out costs money and time, and the model even paces itself against its budget. Fewer, well-chosen tokens beat more — and caching, retrieval, and clearing all tilt the economics.
- Inside one turn: reason, then act — Each turn it writes a private thought, picks ONE action (a tool call), then reads the result — interleaving reasoning and acting (the ReAct pattern) instead of answering in one shot.
- The self-correction loop: test, fix, repeat — Give the agent a way to check itself — run the tests, read the failure, fix, re-run — and the test suite becomes its ground truth. Verification, not generation, is the real bottleneck.
- Plan first, or feel your way? — Plan-and-execute is cheap and predictable but shatters when a step surprises you; step-by-step (ReAct) adapts but costs more and can wander. Match it to how predictable the task is.
- How much leash? Levels of autonomy — Autonomy is a dial: suggest → approve-each-step → hand off a task → autonomous teammate. Higher isn't better — match it to how well-scoped the task is and how easily you can verify the result.
- Many agents: orchestrator & subagents — An orchestrator delegates sub-tasks to subagents, each with its own clean context, working in parallel and reporting back. Powerful for wide search — but it multiplies cost and can fragment, so reach for it deliberately.
- Memory across sessions — Memory lives outside the context window — notes files (like CLAUDE.md), saved facts, and retrieval — that the agent reloads on demand. The window is short-term; these are its long-term notebook.
LLMs in Production — Ship, watch, evaluate, improve
- An LLM feature in production — In production the model is one piece of a pipeline: input guardrails → prompt assembly (system + retrieved context + user input) → model → validate/parse → output guardrails → your app. Every step can fail and must be handled.
- Observability: seeing inside — Trace every request — inputs, retrieved context, the assembled prompt, the output, tool calls, tokens, latency, cost, errors — plus user feedback. You can't fix what you can't see.
- Evals: proving it works — Build an eval set: inputs paired with expected answers or rubric criteria. Score every change against it — unit tests for prompts — so you ship improvements, not regressions.
- LLM-as-a-judge — Use a strong model as the grader against a rubric — pointwise scores or pairwise 'which is better.' Fast and scalable, but watch its biases (order, verbosity, self-preference) and calibrate it against human labels.
- Datasets, labeling & ground truth — Ground truth is built by people: clear guidelines, human labelers, agreement checks, and curated 'gold' sets — increasingly seeded by models but verified by humans. Quality beats quantity.
- The data flywheel — Production traffic → log it → curate the hard and failed cases → label them → feed evals and training → a better model → more usage. The loop compounds, and your proprietary data becomes the moat.
- The lethal trifecta — Danger spikes when an agent has all three at once: access to private data, exposure to untrusted content, and a way to send data out. Untrusted text becomes instructions (prompt injection) and exfiltrates. Remove any one leg to defuse it.
- Designing trustworthy AI features — Design for the verification gap: stream the answer so it feels alive, cite sources so claims are checkable, show confidence and let users correct, and keep a human in the loop for high-stakes actions.
- Structured outputs: guaranteed JSON — Constrained decoding forces output to match a schema by only allowing valid next tokens — so the JSON always parses. But schema-valid still isn't semantically correct, so validate the values too.
Physical Substrate — GPUs & datacenters
- Why GPUs beat CPUs — Each output cell of a layer is its own independent sum, so all of them can be computed at once; a GPU has thousands of slower workers to do exactly that.
- Why training needs datacenters — Cut the model into slices, put one slice per GPU, and wire them together to act as one machine.
Elective Rooms — Off the main arc