LLM Glossary — Key AI Terms Explained · See How AI Works
Plain-language definitions of the core terms behind large language models: tokens, embeddings, attention, transformers, softmax, RAG, agents, MoE, quantization, and more.
- Token
- The unit a language model reads and writes — usually a word or word-piece. Text is split into tokens, and the model only ever predicts the next one. Learn it.
- Embedding
- A list of numbers (a vector) that represents a word's meaning as a position in space, so similar words sit close together and a machine can do math on meaning. Learn it.
- Vector
- An ordered list of numbers. In an LLM, words, positions, and internal states are all vectors, which is what lets the model compute with them. Learn it.
- Dot product / cosine similarity
- A way to score how aligned two vectors are by multiplying matching numbers and adding them up. It's how a model measures meaning-similarity — and it reappears in attention and RAG. Learn it.
- N-gram model
- A simple predictor that guesses the next word from the previous one or few words. It shows why a fixed, tiny window of context isn't enough. Learn it.
- Loss
- A single number measuring how surprised the model was by the correct next word. Lower is better; training is the search for lower loss. Learn it.
- Gradient descent
- The training method: nudge every parameter a little in the direction that lowers the loss, over and over — like rolling downhill on the loss surface. Learn it.
- Attention
- The mechanism that lets each token look at other tokens and pull in the ones relevant to its meaning. It's the core idea behind the transformer. Learn it.
- Query, Key, Value (Q/K/V)
- Three vectors each token produces. A token's Query is matched against every Key (by dot product) to decide how much of each Value to blend in. Learn it.
- Softmax
- A function that turns raw scores into a set of weights that add up to 1 — a probability distribution. Attention uses it to turn match-scores into a blend. Learn it.
- Causal mask
- A rule that stops a token from attending to future tokens during training, so the model learns to predict, not peek ahead. Learn it.
- Multi-head attention
- Running several attention patterns in parallel, each free to track a different relationship (grammar, reference, …), then combining them. Learn it.
- Transformer block
- The repeating unit of a modern LLM: attention + a feed-forward layer, wrapped in residual connections and normalization. Stack many of them and you have the model. Learn it.
- Residual connection
- A shortcut that adds a layer's input back to its output, so the original signal survives through a deep stack and the network stays trainable. Learn it.
- Layer normalization
- A step that rescales a vector back to a stable range as it passes through each layer, keeping values from exploding or vanishing in deep networks. Learn it.
- Temperature
- A decoding knob for randomness. Near 0 the model picks the most likely word every time (deterministic); higher values make it sample more creatively. Learn it.
- Top-k / top-p sampling
- Ways to pick the next word from only the most probable candidates — the top k of them (top-k) or the smallest set covering probability p (top-p / nucleus). Learn it.
- Context window
- The maximum number of tokens a model can attend to at once. Everything outside it is invisible to the model on that call. Learn it.
- KV cache
- A speed trick that stores the Keys and Values already computed for past tokens, so generating each new token reuses them instead of recomputing. Learn it.
- Positional encoding
- Information added to each token so the model knows word order — because attention on its own is order-blind. Learn it.
- Scaling laws
- The empirical finding that loss falls predictably as you add parameters, data, and compute. They forecast loss, not which specific abilities emerge. Learn it.
- Mixture of Experts (MoE)
- An architecture where a router sends each token to a few specialized sub-networks (experts), so the model holds lots of knowledge but only runs a slice per token. Learn it.
- Quantization
- Storing a model's weights with fewer bits of precision to shrink memory and speed up serving, at a small, usually acceptable, cost to quality. Learn it.
- Retrieval-Augmented Generation (RAG)
- Giving a frozen model fresh or private knowledge by searching your documents for the most relevant chunks (using similarity) and pasting them into its context. Learn it.
- Agent
- A loop around a model: it proposes an action, a tool runs, the result re-enters the context, and it repeats — turning a one-shot predictor into something that can act. Learn it.
- Hallucination
- When a model states something false with confidence. It happens because the model optimizes for plausible-sounding text, not verified truth. Learn it.
- RLHF
- Reinforcement Learning from Human Feedback: people rank model outputs to train a reward model, which is then used to steer the model toward helpful, aligned answers. Learn it.
- Fine-tuning
- Continuing to train a pretrained model on extra data to specialize it — versus prompting, which changes behavior with instructions alone. LoRA does this cheaply with small adapter weights. Learn it.
- Parameter / weight
- One of the model's learned numbers. Frontier models have billions; training is the process of setting them so predictions improve. Learn it.
- GPU
- A processor with thousands of parallel cores. The matrix multiplications inside an LLM are massively parallel, so wide GPU hardware runs them far faster than a CPU. Learn it.
- Harness
- The program that wraps a model and lets it act: it loops — read context, let the model propose a tool call, run the tool, feed the result back — until the task is done. Claude Code, Codex, and Cursor are harnesses. Learn it.
- System prompt
- The standing instructions placed at the very start of the context window — who the assistant is, its rules and tools. It's sent every turn, which is why it's the prime candidate for prompt caching. Learn it.
- Context rot
- The measured tendency of every model's accuracy to degrade as the context window fills — even on easy tasks. A short, focused prompt often beats the same answer buried in a huge one, so curating context beats stuffing it. Learn it.
- Prompt caching
- Reusing the processed form of a stable prompt prefix so you don't pay full price to re-send it. A cache read costs about 10% of a normal input token; it only hits if the prefix is unchanged, so keep stable content first. Learn it.
- Skill
- A packaged bit of expertise the agent loads on demand: only its one-line description sits in context until it's relevant, then its full instructions load (progressive disclosure) — capability without a permanent context cost. Learn it.
- MCP (Model Context Protocol)
- An open standard for connecting a model/harness to external tools and data sources (files, APIs, databases) in a uniform way — connect once, use from any MCP-aware tool. Turns N×M custom integrations into N+M. Learn it.
- Observability
- Seeing inside a production LLM app: tracing each request (retrieval, prompt, model, tools) and logging inputs, outputs, tokens, latency, cost, errors, and user feedback. You can't fix what you can't see. Learn it.
- Eval
- A test set for an LLM app — inputs paired with expected answers or rubric criteria — scored on every change so you ship improvements, not regressions. Like unit tests for prompts. Learn it.
- LLM-as-a-judge
- Using a strong model to grade outputs at scale against a rubric (pointwise scores or pairwise comparisons). Scalable, but prone to biases (position, verbosity, self-preference) — calibrate it against human labels. Learn it.
- Ground truth
- The verified 'right answers' a dataset is judged against — built by people via clear guidelines, multiple labelers, and agreement checks. Evals and training are only as good as their ground truth. Learn it.
- Data flywheel
- The compounding loop where production traffic is logged, hard cases are curated and labeled, fed back into evals and training, yielding a better model and more usage. Proprietary production data becomes the moat. Learn it.
- Prompt injection
- An attack where untrusted content the model reads is treated as new instructions (the model sees instructions and data on one channel). The #1 LLM security risk and not fully solvable — guardrails reduce, not eliminate it. Learn it.
- Tokenization (BPE)
- Splitting text into tokens — common sub-word chunks from a fixed vocabulary learned by an algorithm like Byte-Pair Encoding. The model reads token ids, not letters, which is why it miscounts characters and bills per token. Learn it.
- Multimodal model
- A model that handles more than text: images are cut into patches and audio into frames, each turned into a vector in the same space as text tokens, so one transformer attends across them together. Learn it.
- Chain of thought
- Tokens a model generates to 'work things out' before its final answer. Reasoning models are trained to do this; the thinking is just more generated tokens spent before answering. Learn it.
- Test-time compute
- Spending more compute per question at inference — thinking longer, or sampling many answers and picking the best — to raise accuracy. A third way to scale capability beyond parameters and training data. Learn it.
- Speculative decoding
- A speedup where a small, fast model drafts several tokens ahead and the big model verifies them in one pass, keeping the agreed prefix — same output, fewer slow steps. Learn it.
- Distillation
- Training a small 'student' model to imitate a big 'teacher' so it keeps most of the skill at a fraction of the size and cost. Most small, fast models you use are distilled. Learn it.
- ReAct
- The agent loop pattern of reason → act → observe: each turn the model writes a private thought, takes one action (a tool call), then reads the result — interleaving thinking and acting instead of answering in one shot. Learn it.
- Autonomy levels
- How much an agent does on its own — from suggesting, to asking before each step, to handing off a whole task. Higher isn't better: match it to how well-scoped the task is and how easily you can verify the result. Learn it.
- Multi-agent (orchestrator & subagents)
- Splitting a job across agents: an orchestrator delegates sub-tasks to subagents, each with its own clean context, often in parallel. Powerful for wide search, but it multiplies cost and can fragment. Learn it.
- Lethal trifecta
- The dangerous combination of an agent having private-data access, exposure to untrusted content, and a way to send data out — together they enable data exfiltration via prompt injection. Remove any one leg to defuse it. Learn it.
- Structured outputs
- Constraining a model's output to a schema (e.g. JSON) by only allowing schema-valid next tokens, so it always parses. Valid shape still isn't the same as a correct value, so validate the contents too. Learn it.
All lessons