LLM Glossary — Key AI Terms Explained · See How AI Works

Plain-language definitions of the core terms behind large language models: tokens, embeddings, attention, transformers, softmax, RAG, agents, MoE, quantization, and more.

Token: The unit a language model reads and writes — usually a word or word-piece. Text is split into tokens, and the model only ever predicts the next one. Learn it.
Embedding: A list of numbers (a vector) that represents a word's meaning as a position in space, so similar words sit close together and a machine can do math on meaning. Learn it.
Vector: An ordered list of numbers. In an LLM, words, positions, and internal states are all vectors, which is what lets the model compute with them. Learn it.
Dot product / cosine similarity: A way to score how aligned two vectors are by multiplying matching numbers and adding them up. It's how a model measures meaning-similarity — and it reappears in attention and RAG. Learn it.
N-gram model: A simple predictor that guesses the next word from the previous one or few words. It shows why a fixed, tiny window of context isn't enough. Learn it.
Loss: A single number measuring how surprised the model was by the correct next word. Lower is better; training is the search for lower loss. Learn it.
Gradient descent: The training method: nudge every parameter a little in the direction that lowers the loss, over and over — like rolling downhill on the loss surface. Learn it.
Attention: The mechanism that lets each token look at other tokens and pull in the ones relevant to its meaning. It's the core idea behind the transformer. Learn it.
Query, Key, Value (Q/K/V): Three vectors each token produces. A token's Query is matched against every Key (by dot product) to decide how much of each Value to blend in. Learn it.
Softmax: A function that turns raw scores into a set of weights that add up to 1 — a probability distribution. Attention uses it to turn match-scores into a blend. Learn it.
Causal mask: A rule that stops a token from attending to future tokens during training, so the model learns to predict, not peek ahead. Learn it.
Multi-head attention: Running several attention patterns in parallel, each free to track a different relationship (grammar, reference, …), then combining them. Learn it.
Transformer block: The repeating unit of a modern LLM: attention + a feed-forward layer, wrapped in residual connections and normalization. Stack many of them and you have the model. Learn it.
Residual connection: A shortcut that adds a layer's input back to its output, so the original signal survives through a deep stack and the network stays trainable. Learn it.
Layer normalization: A step that rescales a vector back to a stable range as it passes through each layer, keeping values from exploding or vanishing in deep networks. Learn it.
Temperature: A decoding knob for randomness. Near 0 the model picks the most likely word every time (deterministic); higher values make it sample more creatively. Learn it.
Top-k / top-p sampling: Ways to pick the next word from only the most probable candidates — the top k of them (top-k) or the smallest set covering probability p (top-p / nucleus). Learn it.
Context window: The maximum number of tokens a model can attend to at once. Everything outside it is invisible to the model on that call. Learn it.
KV cache: A speed trick that stores the Keys and Values already computed for past tokens, so generating each new token reuses them instead of recomputing. Learn it.
Positional encoding: Information added to each token so the model knows word order — because attention on its own is order-blind. Learn it.
Scaling laws: The empirical finding that loss falls predictably as you add parameters, data, and compute. They forecast loss, not which specific abilities emerge. Learn it.
Mixture of Experts (MoE): An architecture where a router sends each token to a few specialized sub-networks (experts), so the model holds lots of knowledge but only runs a slice per token. Learn it.
Quantization: Storing a model's weights with fewer bits of precision to shrink memory and speed up serving, at a small, usually acceptable, cost to quality. Learn it.
Retrieval-Augmented Generation (RAG): Giving a frozen model fresh or private knowledge by searching your documents for the most relevant chunks (using similarity) and pasting them into its context. Learn it.
Agent: A loop around a model: it proposes an action, a tool runs, the result re-enters the context, and it repeats — turning a one-shot predictor into something that can act. Learn it.
Hallucination: When a model states something false with confidence. It happens because the model optimizes for plausible-sounding text, not verified truth. Learn it.
RLHF: Reinforcement Learning from Human Feedback: people rank model outputs to train a reward model, which is then used to steer the model toward helpful, aligned answers. Learn it.
Fine-tuning: Continuing to train a pretrained model on extra data to specialize it — versus prompting, which changes behavior with instructions alone. LoRA does this cheaply with small adapter weights. Learn it.
Parameter / weight: One of the model's learned numbers. Frontier models have billions; training is the process of setting them so predictions improve. Learn it.
GPU: A processor with thousands of parallel cores. The matrix multiplications inside an LLM are massively parallel, so wide GPU hardware runs them far faster than a CPU. Learn it.
Harness: The program that wraps a model and lets it act: it loops — read context, let the model propose a tool call, run the tool, feed the result back — until the task is done. Claude Code, Codex, and Cursor are harnesses. Learn it.
System prompt: The standing instructions placed at the very start of the context window — who the assistant is, its rules and tools. It's sent every turn, which is why it's the prime candidate for prompt caching. Learn it.
Context rot: The measured tendency of every model's accuracy to degrade as the context window fills — even on easy tasks. A short, focused prompt often beats the same answer buried in a huge one, so curating context beats stuffing it. Learn it.
Prompt caching: Reusing the processed form of a stable prompt prefix so you don't pay full price to re-send it. A cache read costs about 10% of a normal input token; it only hits if the prefix is unchanged, so keep stable content first. Learn it.
Skill: A packaged bit of expertise the agent loads on demand: only its one-line description sits in context until it's relevant, then its full instructions load (progressive disclosure) — capability without a permanent context cost. Learn it.
MCP (Model Context Protocol): An open standard for connecting a model/harness to external tools and data sources (files, APIs, databases) in a uniform way — connect once, use from any MCP-aware tool. Turns N×M custom integrations into N+M. Learn it.
Observability: Seeing inside a production LLM app: tracing each request (retrieval, prompt, model, tools) and logging inputs, outputs, tokens, latency, cost, errors, and user feedback. You can't fix what you can't see. Learn it.
Eval: A test set for an LLM app — inputs paired with expected answers or rubric criteria — scored on every change so you ship improvements, not regressions. Like unit tests for prompts. Learn it.
LLM-as-a-judge: Using a strong model to grade outputs at scale against a rubric (pointwise scores or pairwise comparisons). Scalable, but prone to biases (position, verbosity, self-preference) — calibrate it against human labels. Learn it.
Ground truth: The verified 'right answers' a dataset is judged against — built by people via clear guidelines, multiple labelers, and agreement checks. Evals and training are only as good as their ground truth. Learn it.
Data flywheel: The compounding loop where production traffic is logged, hard cases are curated and labeled, fed back into evals and training, yielding a better model and more usage. Proprietary production data becomes the moat. Learn it.
Prompt injection: An attack where untrusted content the model reads is treated as new instructions (the model sees instructions and data on one channel). The #1 LLM security risk and not fully solvable — guardrails reduce, not eliminate it. Learn it.
Tokenization (BPE): Splitting text into tokens — common sub-word chunks from a fixed vocabulary learned by an algorithm like Byte-Pair Encoding. The model reads token ids, not letters, which is why it miscounts characters and bills per token. Learn it.
Multimodal model: A model that handles more than text: images are cut into patches and audio into frames, each turned into a vector in the same space as text tokens, so one transformer attends across them together. Learn it.
Chain of thought: Tokens a model generates to 'work things out' before its final answer. Reasoning models are trained to do this; the thinking is just more generated tokens spent before answering. Learn it.
Test-time compute: Spending more compute per question at inference — thinking longer, or sampling many answers and picking the best — to raise accuracy. A third way to scale capability beyond parameters and training data. Learn it.
Speculative decoding: A speedup where a small, fast model drafts several tokens ahead and the big model verifies them in one pass, keeping the agreed prefix — same output, fewer slow steps. Learn it.
Distillation: Training a small 'student' model to imitate a big 'teacher' so it keeps most of the skill at a fraction of the size and cost. Most small, fast models you use are distilled. Learn it.
ReAct: The agent loop pattern of reason → act → observe: each turn the model writes a private thought, takes one action (a tool call), then reads the result — interleaving thinking and acting instead of answering in one shot. Learn it.
Autonomy levels: How much an agent does on its own — from suggesting, to asking before each step, to handing off a whole task. Higher isn't better: match it to how well-scoped the task is and how easily you can verify the result. Learn it.
Multi-agent (orchestrator & subagents): Splitting a job across agents: an orchestrator delegates sub-tasks to subagents, each with its own clean context, often in parallel. Powerful for wide search, but it multiplies cost and can fragment. Learn it.
Lethal trifecta: The dangerous combination of an agent having private-data access, exposure to untrusted content, and a way to send data out — together they enable data exfiltration via prompt injection. Remove any one leg to defuse it. Learn it.
Structured outputs: Constraining a model's output to a schema (e.g. JSON) by only allowing schema-valid next tokens, so it always parses. Valid shape still isn't the same as a correct value, so validate the contents too. Learn it.

All lessons