The production AI glossary.
Plain-English definitions for the 43 terms that come up most in real production engagements. Free to bookmark, free to share, free to copy into your internal docs.
Agent
AgentsAn LLM that decides what to do next from a set of available tools.
In production AI, an agent is a system where the language model is given a task and a set of tools (functions it can call), and it iterates: model picks a tool, tool runs, result feeds back into the next call, repeat until the task is done. Agents are the most failure-prone class of AI software because each iteration can compound an earlier mistake — but also the most economically valuable when they work, because they replace whole workflows. Constrain the action space, persist intermediate state, and plan for human-in-the-loop on day one.
Related: Tool use · Multi-agent system · temporal · state-machine
BM25
RetrievalA classical keyword-based search algorithm. Often beats pure vector search.
BM25 (Best Matching 25) is a probabilistic ranking function from the 1990s that scores documents by how well their terms match a query, weighted by term frequency and inverse document frequency. In modern RAG systems, BM25 is paired with dense vector search (the "hybrid" pattern) because each catches what the other misses — BM25 nails exact terms and proper nouns; dense retrieval nails semantic similarity. Pure vector search loses to hybrid in almost every benchmark.
Related: Hybrid search · RAG · reranking
Cached input
CostA token sent to the model that the provider has already processed, billed at a discount.
When you send the same prompt prefix repeatedly (system prompt, retrieved-doc context, conversation history), the provider can cache its KV-state internally and bill those tokens at a fraction of the regular rate — typically 10-25%. Aggressive prompt caching is the highest-leverage cost optimization in production AI: 60-85% input-cost reduction with zero behavior change. Requires structuring prompts so the stable parts come first.
Related: Prompt caching · Tokens · Prompt engineering
Chunking
RetrievalSplitting documents into pieces small enough to embed and retrieve.
Embeddings have a token limit (typically 512-8k), so source documents have to be split into chunks before indexing. Naive chunking (every N characters) destroys meaning at the boundaries. Document-aware chunking respects natural structure — paragraphs, sections, code blocks, table rows — and preserves enough context that a retrieved chunk makes sense in isolation. Most RAG quality problems are chunking problems in disguise.
Related: RAG · embeddings · overlap
Citation
RetrievalA reference back to the source chunk that produced an answer.
In production RAG, every claim the model makes should be traceable to the document chunks that informed it. Citations are surfaced to the user (clickable source links), used by evaluators to verify correctness, and used by you to debug retrieval failures. Without citations, you cannot tell whether a wrong answer came from bad retrieval, bad ranking, or bad generation.
Related: RAG · evals · Hallucination
Context window
ModelsThe maximum number of tokens a model can process in one request.
Frontier models in 2026 commonly support 200k-2M token context windows. Larger windows do not eliminate the need for retrieval — quality of long-context recall degrades meaningfully past 50-100k tokens, costs scale linearly, and latency scales worse than linearly. Treat context window as a soft constraint, not a license to dump everything.
Cross-encoder reranker
RetrievalA small model that re-scores retrieved chunks for relevance to the query.
After initial retrieval returns 20-50 candidate chunks, a cross-encoder reranker scores each (query, chunk) pair jointly and surfaces the top 3-5 to send to the LLM. Rerankers are slower than vector search but dramatically more accurate, because they let the model attend to query and document together. Cohere Rerank, Jina Rerank, and BGE Rerank are common choices. A reranker is the single highest-leverage retrieval upgrade.
Related: RAG · Hybrid search · embeddings
DPO
TrainingDirect Preference Optimization. A simpler alternative to RLHF for tuning model behavior.
DPO (Direct Preference Optimization) is a fine-tuning technique that trains a model to prefer one output over another using paired preference data (chosen vs. rejected). Unlike full RLHF it does not need a separate reward model or reinforcement-learning loop. In practice DPO is easier to set up than RLHF and almost as effective for most preference-tuning use cases.
Related: Fine-tuning · RLHF · preference-data
Embedding
RetrievalA numeric vector representation of text used for similarity search.
An embedding is a fixed-length vector of floating-point numbers (typically 768-3072 dimensions) produced by an embedding model. Two pieces of text with similar meaning produce vectors that are close in that high-dimensional space. Embeddings are the foundation of semantic search and RAG — you embed your documents once, embed the user query at runtime, and find the closest matches. Embedding model selection matters: text-embedding-3-large, voyage-3, and nomic-embed-text-v1.5 are common 2026 choices.
Related: RAG · Vector database · cosine-similarity
Eval / Evaluation harness
EvaluationA test suite for AI applications. The most important piece of infrastructure to build first.
An eval harness scores your AI system's outputs against a known dataset on every change. It has four parts: a dataset of inputs that mirrors real production traffic, a scoring mechanism per input (deterministic, LLM-as-judge, or human review), a reporting layer non-engineers can read, and CI integration that blocks shipping regressions. Tools like LangSmith, Braintrust, and Phoenix make this dramatically easier than rolling your own.
Related: LLM-as-judge · regression · Production AI
Fine-tuning
TrainingUpdating a model's weights on your own data so it specializes in your task.
Fine-tuning takes a pre-trained model and continues training it on your task-specific data. In 2026 this almost always means parameter-efficient methods — LoRA or QLoRA — that update a small fraction of weights and ship as adapters rather than full models. Full fine-tunes are reserved for cases where you need deep adaptation (specialty domains, languages, or compliance). Distillation from a frontier model to a smaller production-target is the most common production fine-tuning pattern.
Related: LoRA / QLoRA · qlora · distillation · DPO
Frontier model
ModelsThe current generation of largest, most capable models.
Frontier refers to the top-tier models from each major lab — Claude Opus, GPT, Gemini Ultra. They have the best reasoning, longest context handling, and highest quality on the hardest evals. They are also the slowest and most expensive (10-30× the price of mid-tier). Reserve frontier for the hardest 30% of your workload; route the easier majority to mid-tier or small models.
Related: Model routing · mid-tier · small-model
Function calling
AgentsThe model returns structured JSON to invoke a tool you defined.
Function calling (also called tool use) is the mechanism by which an LLM produces a structured request to invoke an external function. The model returns JSON matching a schema you provide ({"name": "search_orders", "arguments": {"customer_id": "..."}}), your code executes that function, and the result is fed back to the model. This is the foundation of agentic systems and is the right abstraction for almost any "the LLM needs to do something in the real world" use case.
Related: Agent · Tool use · JSON mode / Structured output
Graph RAG
RetrievalRAG that uses a knowledge graph as the index instead of (or alongside) vector search.
Graph RAG constructs a knowledge graph from your documents — entities, relationships, attributes — and then traverses the graph at query time. This is dramatically more powerful than vector RAG for multi-hop reasoning ("what companies has Person X founded that were later acquired by Y?") because it can follow chains of relationships rather than relying on text similarity. Microsoft's GraphRAG paper popularized the pattern; LangChain, LlamaIndex, and several specialized tools support it.
Related: RAG · knowledge-graph · entity-extraction
Hallucination
ConceptsWhen a model generates plausible-sounding text that is factually wrong.
Hallucinations are the failure mode that defines the gap between LLM demos and production systems. They are not bugs to be fixed once — they are statistical behaviors to be measured continuously and constrained by design. The standard mitigations are: ground answers in retrieved context (RAG), require citations, run an LLM-as-judge over outputs, constrain output schemas, and lower temperature. Eliminating hallucinations entirely is not currently possible; bounding their cost is.
Related: RAG · evals · Citation · Temperature
Hybrid search
RetrievalCombining keyword (BM25) and dense (vector) search and merging the results.
Hybrid search runs both BM25 and dense vector retrieval in parallel and merges the results, usually with reciprocal rank fusion (RRF). It outperforms either method alone in essentially every benchmark we have measured because the two methods catch different kinds of matches: BM25 wins on exact terms and rare entities, dense wins on semantic similarity and paraphrasing. If you are doing pure vector search in production, switching to hybrid is the easiest meaningful upgrade.
Related: BM25 · embeddings · reciprocal-rank-fusion
Inference
ModelsRunning the model to generate output. Distinct from training.
Inference is what you pay for in production: each call to the model API, each token in and out. Inference cost dominates the total cost of an AI application at scale, which is why prompt caching, model routing, output-length tightening, and batch APIs are all critical optimizations.
Related: training · cost · Tokens
A constrained decoding mode where the model is forced to produce valid JSON matching a schema.
Structured output (JSON mode in OpenAI, tool use in Claude, controlled generation in Gemini) constrains the model to return JSON matching a schema you provide. This eliminates the parse-the-output reliability problem that plagues early LLM applications. Always use structured output when the downstream consumer is code, not a human.
Related: Function calling · json-schema · reliability
LLM-as-judge
EvaluationUsing a model to evaluate the outputs of another model.
LLM-as-judge uses a (usually frontier) model to score the outputs of your production model against a rubric you provide. It is the standard pattern for evaluating open-ended generation tasks where deterministic scoring is impossible. Calibrate the judge against human ratings on a sample before trusting its scores at scale, and watch for known biases (judges prefer longer outputs, prefer first-listed options in pairwise comparisons).
Related: evals · rubric · pairwise-eval
LoRA / QLoRA
TrainingParameter-efficient fine-tuning methods that only update a small fraction of weights.
LoRA (Low-Rank Adaptation) trains small adapter matrices that are added to a frozen base model, instead of updating all the model's weights. Result: vastly cheaper training, smaller artifacts to deploy (megabytes instead of gigabytes), and the ability to swap adapters in/out for different tasks on the same base model. QLoRA is LoRA on a quantized (4-bit) base model, which makes fine-tuning large open-weight models possible on a single GPU.
Related: Fine-tuning · quantization · adapter
Mid-tier model
ModelsThe next-down models from frontier — Claude Sonnet, GPT-5 Mini, Gemini 2.5 Pro.
Mid-tier models are the workhorse of production AI in 2026. They handle 80% of real workloads at a fraction of frontier cost and acceptable latency. Default to mid-tier; promote individual calls to frontier when evals show a quality gap.
Related: Frontier model · Model routing · cost
Model routing
ModelsSending different requests to different model tiers based on difficulty or stakes.
Model routing classifies each request and sends it to the smallest model that can handle it. Easy classification → small model. Standard chat → mid-tier. Hard reasoning, novel reasoning, multi-step planning → frontier. Done well, routing cuts production AI bills 50-70% with no quality loss. The router is itself usually a small model or a deterministic classifier.
Related: Frontier model · Mid-tier model · small-model
MoE (Mixture of Experts)
ModelsA model architecture where only a subset of parameters activate per token.
Mixture-of-Experts models contain many "expert" sub-networks but route each token through only a small subset, so the active parameter count per inference is much smaller than the total parameter count. This makes very large models tractable to serve. Most frontier 2026 models use MoE under the hood.
Related: model-architecture · Inference
Multi-agent system
AgentsMultiple specialized agents collaborating to solve a task.
Instead of one agent with twelve tools, a multi-agent system splits the work into multiple smaller agents that each have a tight remit and communicate through a shared coordinator. This is easier to reason about, easier to debug, and usually cheaper to run (each sub-agent can use a smaller model). The pattern works best when sub-tasks are clearly bounded.
Related: Agent · orchestration · coordinator
Multi-modal
ModelsModels that accept inputs beyond text — images, audio, video, PDF.
Modern frontier models accept images, audio, and (increasingly) video as inputs alongside text. In production this enables document understanding (no separate OCR), visual QA, voice agents, and image-grounded chat. Each modality has its own pricing and latency profile; budget accordingly.
Related: vision · Whisper · Voice agent
Observability
InfrastructureRecording prompt/response pairs, latency, cost, and quality signals from production.
AI observability records every prompt, every response, latency, token cost, and user feedback signal — usually with PII redaction. The goal is to detect silent quality degradation before users do, debug specific failures, and feed real production traffic back into the eval harness for continuous evaluation. LangSmith, Braintrust, and Phoenix are common choices in 2026.
Related: evals · tracing · monitoring
OpenAI compatibility
InfrastructureA common HTTP API shape that many providers implement so client SDKs are interchangeable.
The OpenAI chat completions API is now the de facto standard for LLM HTTP APIs. Anthropic, Google, Mistral, Together, Groq, and most open-weight inference providers expose an OpenAI-compatible endpoint. This makes provider-agnostic abstractions straightforward (use one client, swap baseURL) but does not eliminate provider-specific features (Claude prompt caching, OpenAI predicted outputs, Gemini grounding).
Related: provider-abstraction · sdk
Pgvector
InfrastructureA Postgres extension that adds vector data types and similarity search.
pgvector turns Postgres into a vector database. For most production AI applications below the tens-of-millions-of-vectors scale, pgvector + a regular Postgres is the right choice — you keep ACID guarantees, your existing tooling, and one fewer system to operate. Above ~50M vectors or when sub-100ms latency is critical, dedicated vector DBs (Qdrant, Pinecone, Weaviate, Turbopuffer) start to pay off.
Related: Vector database · postgres · embeddings
Production AI
ConceptsAI systems that run reliably under real load with real users — distinct from prototypes.
Production AI is the engineering discipline of taking LLM-based systems from "works in the playground" to "runs at 3am on a Sunday without paging anyone." The core requirements: continuous evaluation, observability that detects silent degradation, bounded cost, fallbacks for model failure, and human-in-the-loop on high-stakes actions. The principles are the same as production software engineering applied to a less deterministic substrate.
Related: evals · Observability · fallback
Prompt caching
CostProvider-side caching of prompt prefixes for cheaper subsequent requests.
When you send the same prompt prefix repeatedly (system prompts, retrieved context, conversation history), providers can cache the model's internal state for that prefix and bill subsequent requests at 10-25% of the regular input rate. Aggressive prompt caching is the highest-leverage cost optimization in production AI: 60-85% input cost reduction with zero behavior change. Requires structuring prompts so stable parts come first.
Related: Cached input · cost · Prompt engineering
Prompt engineering
ConceptsThe practice of designing prompts to elicit reliable, high-quality outputs.
Prompt engineering is undervalued in 2026 — partly because the term has been used to describe both serious work (system prompt design, output structuring, few-shot example curation, constrained decoding) and trivial work ("let me try rewording it"). The serious version is real engineering with measurable outcomes against an eval harness; the trivial version is folklore.
Related: evals · few-shot · System prompt
RAG
RetrievalRetrieval-Augmented Generation. The dominant architecture for grounded LLM applications.
RAG fetches relevant documents at query time and inserts them into the model's context, so the model answers from current, authoritative sources rather than its training data. Almost every production LLM application is some flavor of RAG. The hard parts are not the LLM call — they are chunking, embedding choice, hybrid search, reranking, and citation tracking. Most failed AI applications are failed retrieval applications wearing an LLM costume.
Related: Hybrid search · reranking · Graph RAG · embeddings
Reciprocal Rank Fusion (RRF)
RetrievalA simple algorithm for merging multiple ranked result lists.
RRF combines results from multiple retrievers (e.g., BM25 and dense vector search) by summing 1/(rank + k) for each document across the lists. It is parameter-free (k=60 is a fine default), needs no training, and reliably outperforms more complex fusion methods. The standard merge for hybrid search.
Related: Hybrid search · BM25 · reranking
RLHF
TrainingReinforcement Learning from Human Feedback. The classic preference-tuning method.
RLHF trains a reward model on human preference pairs, then uses reinforcement learning (PPO) to update the LLM to maximize that reward. It is what made early ChatGPT and Claude useful. In 2026, DPO has largely replaced RLHF for ease of setup, but RLHF still wins at the highest scale and for the most subtle preference targets.
Related: DPO · Fine-tuning · preference-data
A small, fast model — sub-10B parameters typically — for cheap, low-stakes tasks.
Small language models (Haiku, Nano, Flash, Phi, Llama 3.2 Small) handle classification, routing, lightweight extraction, and batched scoring at a fraction of frontier cost and latency. The right pick for the long tail of an application that does not need top-tier reasoning. A model router that delegates appropriately to SLMs is often the single largest cost optimization in a mature application.
Related: Model routing · Mid-tier model · cost
Streaming
ModelsReturning tokens to the client as they are generated, rather than waiting for completion.
Streaming returns each generated token to the client as it is produced. This dramatically improves perceived latency for chat-style interfaces — users see the first word in 200ms instead of waiting 4 seconds for the whole response. SSE (Server-Sent Events) is the standard transport. Streaming has no cost impact but does complicate client-side error handling (the connection can fail mid-response).
Related: sse · latency · ux
System prompt
ModelsThe instructions that define the model's persona, role, and constraints for the conversation.
The system prompt is the first message in a conversation, distinct from the user's messages, that tells the model how to behave. Good system prompts are precise, specify output format, include examples of edge cases, and acknowledge the model's tools. They should be cached aggressively — they are the most stable part of any conversation.
Related: Prompt engineering · Prompt caching · few-shot
Temperature
ModelsA sampling parameter that controls randomness — lower is more deterministic.
Temperature scales the model's output distribution before sampling. Temperature 0 is deterministic (always pick the most likely token); 0.7 is the typical default for chat; 1+ produces creative, surprising outputs. For production tasks where reliability matters (extraction, classification, structured output), use temperature 0 or 0.1. For creative writing or brainstorming, raise it.
Related: sampling · top-p · reliability
Tokens
ModelsThe atomic units of text a model processes. Roughly 0.75 words each in English.
Models do not see characters or words — they see tokens, which are statistical sub-word units learned during training. "Hello, world" is about 3 tokens. Pricing is per-million-tokens of input and output (output costs ~5× input). Token counts are how you reason about context window limits and cost. Provider tokenizers differ slightly, so token counts are model-family-specific.
Related: Context window · cost · tokenizer
Tool use
AgentsSee: function calling. Same concept, slightly different naming.
Anthropic uses "tool use" where OpenAI uses "function calling" — same concept. The model returns a structured request to invoke a function you defined; your code runs that function; the result feeds back into the model. The foundation of agentic systems.
Related: Function calling · Agent · JSON mode / Structured output
Vector database
InfrastructureA database optimized for storing embeddings and finding nearest neighbors.
Vector databases index embeddings using approximate nearest-neighbor algorithms (HNSW, IVF) so similarity search stays fast at scale. Popular 2026 choices: pgvector (Postgres extension, default for most), Qdrant, Pinecone, Weaviate, Turbopuffer. Selection criteria: throughput, latency, hybrid-search support, multi-tenancy model, and cost-per-vector at your expected scale.
Related: embeddings · Pgvector · hnsw
Voice agent
AgentsA real-time conversational agent that speaks and listens.
Voice agents combine speech-to-text (Whisper, Deepgram), an LLM, and text-to-speech (ElevenLabs, Cartesia, OpenAI Voice) into a real-time loop. The hard parts are latency (sub-500ms end-to-end is the bar), interruption handling (the user starts talking mid-response), and turn detection. Frameworks like LiveKit Agents, Pipecat, and Vapi handle the orchestration; the LLM and prompt design are still your job.
Related: Multi-modal · Whisper · tts
Whisper
ModelsOpenAI's open-source speech-to-text model. The default for transcription.
Whisper is the de facto open-source speech-to-text model. Multiple sizes (tiny → large), multilingual, available via OpenAI API or self-hosted. For production voice applications, Deepgram and AssemblyAI are common alternatives that beat Whisper on latency for streaming use cases.
Related: Voice agent · speech-to-text · Multi-modal