Reference

The production AI glossary.

Plain-English definitions for the 43 terms that come up most in real production engagements. Free to bookmark, free to share, free to copy into your internal docs.

  • Agent

    Agents

    An LLM that decides what to do next from a set of available tools.

    In production AI, an agent is a system where the language model is given a task and a set of tools (functions it can call), and it iterates: model picks a tool, tool runs, result feeds back into the next call, repeat until the task is done. Agents are the most failure-prone class of AI software because each iteration can compound an earlier mistake — but also the most economically valuable when they work, because they replace whole workflows. Constrain the action space, persist intermediate state, and plan for human-in-the-loop on day one.

    Related: Tool use · Multi-agent system · temporal · state-machine

  • BM25

    Retrieval

    A classical keyword-based search algorithm. Often beats pure vector search.

    BM25 (Best Matching 25) is a probabilistic ranking function from the 1990s that scores documents by how well their terms match a query, weighted by term frequency and inverse document frequency. In modern RAG systems, BM25 is paired with dense vector search (the "hybrid" pattern) because each catches what the other misses — BM25 nails exact terms and proper nouns; dense retrieval nails semantic similarity. Pure vector search loses to hybrid in almost every benchmark.

    Related: Hybrid search · RAG · reranking

  • A token sent to the model that the provider has already processed, billed at a discount.

    When you send the same prompt prefix repeatedly (system prompt, retrieved-doc context, conversation history), the provider can cache its KV-state internally and bill those tokens at a fraction of the regular rate — typically 10-25%. Aggressive prompt caching is the highest-leverage cost optimization in production AI: 60-85% input-cost reduction with zero behavior change. Requires structuring prompts so the stable parts come first.

    Related: Prompt caching · Tokens · Prompt engineering

  • Chunking

    Retrieval

    Splitting documents into pieces small enough to embed and retrieve.

    Embeddings have a token limit (typically 512-8k), so source documents have to be split into chunks before indexing. Naive chunking (every N characters) destroys meaning at the boundaries. Document-aware chunking respects natural structure — paragraphs, sections, code blocks, table rows — and preserves enough context that a retrieved chunk makes sense in isolation. Most RAG quality problems are chunking problems in disguise.

    Related: RAG · embeddings · overlap

  • Citation

    Retrieval

    A reference back to the source chunk that produced an answer.

    In production RAG, every claim the model makes should be traceable to the document chunks that informed it. Citations are surfaced to the user (clickable source links), used by evaluators to verify correctness, and used by you to debug retrieval failures. Without citations, you cannot tell whether a wrong answer came from bad retrieval, bad ranking, or bad generation.

    Related: RAG · evals · Hallucination

  • The maximum number of tokens a model can process in one request.

    Frontier models in 2026 commonly support 200k-2M token context windows. Larger windows do not eliminate the need for retrieval — quality of long-context recall degrades meaningfully past 50-100k tokens, costs scale linearly, and latency scales worse than linearly. Treat context window as a soft constraint, not a license to dump everything.

    Related: Tokens · RAG · attention

  • A small model that re-scores retrieved chunks for relevance to the query.

    After initial retrieval returns 20-50 candidate chunks, a cross-encoder reranker scores each (query, chunk) pair jointly and surfaces the top 3-5 to send to the LLM. Rerankers are slower than vector search but dramatically more accurate, because they let the model attend to query and document together. Cohere Rerank, Jina Rerank, and BGE Rerank are common choices. A reranker is the single highest-leverage retrieval upgrade.

    Related: RAG · Hybrid search · embeddings

  • DPO

    Training

    Direct Preference Optimization. A simpler alternative to RLHF for tuning model behavior.

    DPO (Direct Preference Optimization) is a fine-tuning technique that trains a model to prefer one output over another using paired preference data (chosen vs. rejected). Unlike full RLHF it does not need a separate reward model or reinforcement-learning loop. In practice DPO is easier to set up than RLHF and almost as effective for most preference-tuning use cases.

    Related: Fine-tuning · RLHF · preference-data

  • Embedding

    Retrieval

    A numeric vector representation of text used for similarity search.

    An embedding is a fixed-length vector of floating-point numbers (typically 768-3072 dimensions) produced by an embedding model. Two pieces of text with similar meaning produce vectors that are close in that high-dimensional space. Embeddings are the foundation of semantic search and RAG — you embed your documents once, embed the user query at runtime, and find the closest matches. Embedding model selection matters: text-embedding-3-large, voyage-3, and nomic-embed-text-v1.5 are common 2026 choices.

    Related: RAG · Vector database · cosine-similarity

  • A test suite for AI applications. The most important piece of infrastructure to build first.

    An eval harness scores your AI system's outputs against a known dataset on every change. It has four parts: a dataset of inputs that mirrors real production traffic, a scoring mechanism per input (deterministic, LLM-as-judge, or human review), a reporting layer non-engineers can read, and CI integration that blocks shipping regressions. Tools like LangSmith, Braintrust, and Phoenix make this dramatically easier than rolling your own.

    Related: LLM-as-judge · regression · Production AI

  • Fine-tuning

    Training

    Updating a model's weights on your own data so it specializes in your task.

    Fine-tuning takes a pre-trained model and continues training it on your task-specific data. In 2026 this almost always means parameter-efficient methods — LoRA or QLoRA — that update a small fraction of weights and ship as adapters rather than full models. Full fine-tunes are reserved for cases where you need deep adaptation (specialty domains, languages, or compliance). Distillation from a frontier model to a smaller production-target is the most common production fine-tuning pattern.

    Related: LoRA / QLoRA · qlora · distillation · DPO

  • The current generation of largest, most capable models.

    Frontier refers to the top-tier models from each major lab — Claude Opus, GPT, Gemini Ultra. They have the best reasoning, longest context handling, and highest quality on the hardest evals. They are also the slowest and most expensive (10-30× the price of mid-tier). Reserve frontier for the hardest 30% of your workload; route the easier majority to mid-tier or small models.

    Related: Model routing · mid-tier · small-model

  • The model returns structured JSON to invoke a tool you defined.

    Function calling (also called tool use) is the mechanism by which an LLM produces a structured request to invoke an external function. The model returns JSON matching a schema you provide ({"name": "search_orders", "arguments": {"customer_id": "..."}}), your code executes that function, and the result is fed back to the model. This is the foundation of agentic systems and is the right abstraction for almost any "the LLM needs to do something in the real world" use case.

    Related: Agent · Tool use · JSON mode / Structured output

  • Graph RAG

    Retrieval

    RAG that uses a knowledge graph as the index instead of (or alongside) vector search.

    Graph RAG constructs a knowledge graph from your documents — entities, relationships, attributes — and then traverses the graph at query time. This is dramatically more powerful than vector RAG for multi-hop reasoning ("what companies has Person X founded that were later acquired by Y?") because it can follow chains of relationships rather than relying on text similarity. Microsoft's GraphRAG paper popularized the pattern; LangChain, LlamaIndex, and several specialized tools support it.

    Related: RAG · knowledge-graph · entity-extraction

  • When a model generates plausible-sounding text that is factually wrong.

    Hallucinations are the failure mode that defines the gap between LLM demos and production systems. They are not bugs to be fixed once — they are statistical behaviors to be measured continuously and constrained by design. The standard mitigations are: ground answers in retrieved context (RAG), require citations, run an LLM-as-judge over outputs, constrain output schemas, and lower temperature. Eliminating hallucinations entirely is not currently possible; bounding their cost is.

    Related: RAG · evals · Citation · Temperature

  • Inference

    Models

    Running the model to generate output. Distinct from training.

    Inference is what you pay for in production: each call to the model API, each token in and out. Inference cost dominates the total cost of an AI application at scale, which is why prompt caching, model routing, output-length tightening, and batch APIs are all critical optimizations.

    Related: training · cost · Tokens

  • A constrained decoding mode where the model is forced to produce valid JSON matching a schema.

    Structured output (JSON mode in OpenAI, tool use in Claude, controlled generation in Gemini) constrains the model to return JSON matching a schema you provide. This eliminates the parse-the-output reliability problem that plagues early LLM applications. Always use structured output when the downstream consumer is code, not a human.

    Related: Function calling · json-schema · reliability

  • LLM-as-judge

    Evaluation

    Using a model to evaluate the outputs of another model.

    LLM-as-judge uses a (usually frontier) model to score the outputs of your production model against a rubric you provide. It is the standard pattern for evaluating open-ended generation tasks where deterministic scoring is impossible. Calibrate the judge against human ratings on a sample before trusting its scores at scale, and watch for known biases (judges prefer longer outputs, prefer first-listed options in pairwise comparisons).

    Related: evals · rubric · pairwise-eval

  • LoRA / QLoRA

    Training

    Parameter-efficient fine-tuning methods that only update a small fraction of weights.

    LoRA (Low-Rank Adaptation) trains small adapter matrices that are added to a frozen base model, instead of updating all the model's weights. Result: vastly cheaper training, smaller artifacts to deploy (megabytes instead of gigabytes), and the ability to swap adapters in/out for different tasks on the same base model. QLoRA is LoRA on a quantized (4-bit) base model, which makes fine-tuning large open-weight models possible on a single GPU.

    Related: Fine-tuning · quantization · adapter

  • The next-down models from frontier — Claude Sonnet, GPT-5 Mini, Gemini 2.5 Pro.

    Mid-tier models are the workhorse of production AI in 2026. They handle 80% of real workloads at a fraction of frontier cost and acceptable latency. Default to mid-tier; promote individual calls to frontier when evals show a quality gap.

    Related: Frontier model · Model routing · cost

  • Sending different requests to different model tiers based on difficulty or stakes.

    Model routing classifies each request and sends it to the smallest model that can handle it. Easy classification → small model. Standard chat → mid-tier. Hard reasoning, novel reasoning, multi-step planning → frontier. Done well, routing cuts production AI bills 50-70% with no quality loss. The router is itself usually a small model or a deterministic classifier.

    Related: Frontier model · Mid-tier model · small-model

  • A model architecture where only a subset of parameters activate per token.

    Mixture-of-Experts models contain many "expert" sub-networks but route each token through only a small subset, so the active parameter count per inference is much smaller than the total parameter count. This makes very large models tractable to serve. Most frontier 2026 models use MoE under the hood.

    Related: model-architecture · Inference

  • Multiple specialized agents collaborating to solve a task.

    Instead of one agent with twelve tools, a multi-agent system splits the work into multiple smaller agents that each have a tight remit and communicate through a shared coordinator. This is easier to reason about, easier to debug, and usually cheaper to run (each sub-agent can use a smaller model). The pattern works best when sub-tasks are clearly bounded.

    Related: Agent · orchestration · coordinator

  • Models that accept inputs beyond text — images, audio, video, PDF.

    Modern frontier models accept images, audio, and (increasingly) video as inputs alongside text. In production this enables document understanding (no separate OCR), visual QA, voice agents, and image-grounded chat. Each modality has its own pricing and latency profile; budget accordingly.

    Related: vision · Whisper · Voice agent

  • Observability

    Infrastructure

    Recording prompt/response pairs, latency, cost, and quality signals from production.

    AI observability records every prompt, every response, latency, token cost, and user feedback signal — usually with PII redaction. The goal is to detect silent quality degradation before users do, debug specific failures, and feed real production traffic back into the eval harness for continuous evaluation. LangSmith, Braintrust, and Phoenix are common choices in 2026.

    Related: evals · tracing · monitoring

  • OpenAI compatibility

    Infrastructure

    A common HTTP API shape that many providers implement so client SDKs are interchangeable.

    The OpenAI chat completions API is now the de facto standard for LLM HTTP APIs. Anthropic, Google, Mistral, Together, Groq, and most open-weight inference providers expose an OpenAI-compatible endpoint. This makes provider-agnostic abstractions straightforward (use one client, swap baseURL) but does not eliminate provider-specific features (Claude prompt caching, OpenAI predicted outputs, Gemini grounding).

    Related: provider-abstraction · sdk

  • Pgvector

    Infrastructure

    A Postgres extension that adds vector data types and similarity search.

    pgvector turns Postgres into a vector database. For most production AI applications below the tens-of-millions-of-vectors scale, pgvector + a regular Postgres is the right choice — you keep ACID guarantees, your existing tooling, and one fewer system to operate. Above ~50M vectors or when sub-100ms latency is critical, dedicated vector DBs (Qdrant, Pinecone, Weaviate, Turbopuffer) start to pay off.

    Related: Vector database · postgres · embeddings

  • AI systems that run reliably under real load with real users — distinct from prototypes.

    Production AI is the engineering discipline of taking LLM-based systems from "works in the playground" to "runs at 3am on a Sunday without paging anyone." The core requirements: continuous evaluation, observability that detects silent degradation, bounded cost, fallbacks for model failure, and human-in-the-loop on high-stakes actions. The principles are the same as production software engineering applied to a less deterministic substrate.

    Related: evals · Observability · fallback

  • Provider-side caching of prompt prefixes for cheaper subsequent requests.

    When you send the same prompt prefix repeatedly (system prompts, retrieved context, conversation history), providers can cache the model's internal state for that prefix and bill subsequent requests at 10-25% of the regular input rate. Aggressive prompt caching is the highest-leverage cost optimization in production AI: 60-85% input cost reduction with zero behavior change. Requires structuring prompts so stable parts come first.

    Related: Cached input · cost · Prompt engineering

  • The practice of designing prompts to elicit reliable, high-quality outputs.

    Prompt engineering is undervalued in 2026 — partly because the term has been used to describe both serious work (system prompt design, output structuring, few-shot example curation, constrained decoding) and trivial work ("let me try rewording it"). The serious version is real engineering with measurable outcomes against an eval harness; the trivial version is folklore.

    Related: evals · few-shot · System prompt

  • RAG

    Retrieval

    Retrieval-Augmented Generation. The dominant architecture for grounded LLM applications.

    RAG fetches relevant documents at query time and inserts them into the model's context, so the model answers from current, authoritative sources rather than its training data. Almost every production LLM application is some flavor of RAG. The hard parts are not the LLM call — they are chunking, embedding choice, hybrid search, reranking, and citation tracking. Most failed AI applications are failed retrieval applications wearing an LLM costume.

    Related: Hybrid search · reranking · Graph RAG · embeddings

  • A simple algorithm for merging multiple ranked result lists.

    RRF combines results from multiple retrievers (e.g., BM25 and dense vector search) by summing 1/(rank + k) for each document across the lists. It is parameter-free (k=60 is a fine default), needs no training, and reliably outperforms more complex fusion methods. The standard merge for hybrid search.

    Related: Hybrid search · BM25 · reranking

  • RLHF

    Training

    Reinforcement Learning from Human Feedback. The classic preference-tuning method.

    RLHF trains a reward model on human preference pairs, then uses reinforcement learning (PPO) to update the LLM to maximize that reward. It is what made early ChatGPT and Claude useful. In 2026, DPO has largely replaced RLHF for ease of setup, but RLHF still wins at the highest scale and for the most subtle preference targets.

    Related: DPO · Fine-tuning · preference-data

  • A small, fast model — sub-10B parameters typically — for cheap, low-stakes tasks.

    Small language models (Haiku, Nano, Flash, Phi, Llama 3.2 Small) handle classification, routing, lightweight extraction, and batched scoring at a fraction of frontier cost and latency. The right pick for the long tail of an application that does not need top-tier reasoning. A model router that delegates appropriately to SLMs is often the single largest cost optimization in a mature application.

    Related: Model routing · Mid-tier model · cost

  • Streaming

    Models

    Returning tokens to the client as they are generated, rather than waiting for completion.

    Streaming returns each generated token to the client as it is produced. This dramatically improves perceived latency for chat-style interfaces — users see the first word in 200ms instead of waiting 4 seconds for the whole response. SSE (Server-Sent Events) is the standard transport. Streaming has no cost impact but does complicate client-side error handling (the connection can fail mid-response).

    Related: sse · latency · ux

  • The instructions that define the model's persona, role, and constraints for the conversation.

    The system prompt is the first message in a conversation, distinct from the user's messages, that tells the model how to behave. Good system prompts are precise, specify output format, include examples of edge cases, and acknowledge the model's tools. They should be cached aggressively — they are the most stable part of any conversation.

    Related: Prompt engineering · Prompt caching · few-shot

  • A sampling parameter that controls randomness — lower is more deterministic.

    Temperature scales the model's output distribution before sampling. Temperature 0 is deterministic (always pick the most likely token); 0.7 is the typical default for chat; 1+ produces creative, surprising outputs. For production tasks where reliability matters (extraction, classification, structured output), use temperature 0 or 0.1. For creative writing or brainstorming, raise it.

    Related: sampling · top-p · reliability

  • Tokens

    Models

    The atomic units of text a model processes. Roughly 0.75 words each in English.

    Models do not see characters or words — they see tokens, which are statistical sub-word units learned during training. "Hello, world" is about 3 tokens. Pricing is per-million-tokens of input and output (output costs ~5× input). Token counts are how you reason about context window limits and cost. Provider tokenizers differ slightly, so token counts are model-family-specific.

    Related: Context window · cost · tokenizer

  • Tool use

    Agents

    See: function calling. Same concept, slightly different naming.

    Anthropic uses "tool use" where OpenAI uses "function calling" — same concept. The model returns a structured request to invoke a function you defined; your code runs that function; the result feeds back into the model. The foundation of agentic systems.

    Related: Function calling · Agent · JSON mode / Structured output

  • Vector database

    Infrastructure

    A database optimized for storing embeddings and finding nearest neighbors.

    Vector databases index embeddings using approximate nearest-neighbor algorithms (HNSW, IVF) so similarity search stays fast at scale. Popular 2026 choices: pgvector (Postgres extension, default for most), Qdrant, Pinecone, Weaviate, Turbopuffer. Selection criteria: throughput, latency, hybrid-search support, multi-tenancy model, and cost-per-vector at your expected scale.

    Related: embeddings · Pgvector · hnsw

  • A real-time conversational agent that speaks and listens.

    Voice agents combine speech-to-text (Whisper, Deepgram), an LLM, and text-to-speech (ElevenLabs, Cartesia, OpenAI Voice) into a real-time loop. The hard parts are latency (sub-500ms end-to-end is the bar), interruption handling (the user starts talking mid-response), and turn detection. Frameworks like LiveKit Agents, Pipecat, and Vapi handle the orchestration; the LLM and prompt design are still your job.

    Related: Multi-modal · Whisper · tts

  • Whisper

    Models

    OpenAI's open-source speech-to-text model. The default for transcription.

    Whisper is the de facto open-source speech-to-text model. Multiple sizes (tiny → large), multilingual, available via OpenAI API or self-hosted. For production voice applications, Deepgram and AssemblyAI are common alternatives that beat Whisper on latency for streaming use cases.

    Related: Voice agent · speech-to-text · Multi-modal

Got a project that uses any of these?

Send a brief. Written proposal in 48 hours.

Send a brief →