Sample architecture memo

Mini Trends · architecture memo · v1.0 · 2026-04-15

RAG-based contract Q&A: architecture & build plan

Prepared for [Client] · author: R. Kitamura · pages: 1 of 14 (sample)

§1Executive summary

We recommend building a hybrid-retrieval RAG system over the client’s contract corpus, with citation tracking, an evaluation harness, and a Claude Sonnet 4.6 generator. Estimated build: 9 weeks, $138k fixed-price. Estimated steady-state run cost: $1,400/month at projected volume (5,000 queries/day).

Three structural risks have been identified and addressed in the plan: PDF-extraction fidelity (mitigated by a two-pass extractor), retrieval precision on near-duplicate documents (mitigated by metadata filtering and re-ranking), and answer hallucination on ambiguous questions (mitigated by an explicit “insufficient context” response path).

§2Problem statement

Legal-ops users at [Client] need to answer factual questions about ~12,000 active contracts (counterparty terms, expiry, renewal, indemnification, payment schedules) without manually opening each document. Current process: keyword search in a SaaS DMS, followed by manual review. Average time to answer: 11 minutes. Target: under 30 seconds with cited source clauses.

§3Recommended architecture

1PDF upload → 2-pass extractor (PyMuPDF + Claude Vision fallback)
2Section-aware chunker (respects clause boundaries; max 1024 tokens)
3Embed with text-embedding-3-large → pgvector (multi-tenant via row-level security)
4Query path: hybrid (BM25 + dense) → cross-encoder rerank top-30 → top-5 to model
5Generator: Claude Sonnet 4.6, structured output, citations required
6Eval: LangSmith with 380 labeled cases + LLM-as-judge for fidelity
7Observability: Langfuse, Datadog for infra, Sentry for errors

§4Why this stack (and what we considered)

pgvector vs Pinecone

pgvector chosen because (a) you are already on managed Postgres, (b) ACID + row-level multi-tenancy aligns with your existing data model, (c) at 12k documents × ~40 chunks each, you are well below the scale where dedicated vector DBs pay off.

Sonnet 4.6 vs Opus 4.7 vs GPT-5

On a 50-case eval blind-rated by your legal-ops lead, Sonnet 4.6 scored 0.91 vs Opus 4.7 at 0.93 — a $4k/mo cost difference at projected volume. Recommend Sonnet for production with Opus reserved for one specific question class (clause-precedent comparisons) we found regressed.

Hybrid retrieval vs dense-only

Hybrid won on 47/50 of our eval cases. Dense-only missed exact-term matches (counterparty names, clause numbers) that BM25 caught reliably.

§5Build plan (9 weeks)

Weeks	Milestone	Demo
1–2	Extraction + chunking pipeline; 1k-doc smoke test	Confluence dump
3–4	Indexing + hybrid retrieval; eval set v1 (200 cases)	Live retrieval demo
5–6	Generator integration; eval set v2 (380 cases); reranker tuning	End-to-end flow
7–8	Production hardening: observability, rate limits, fallbacks	Staging cutover
9	Production deploy + handover docs + runbook	Production use

§6Cost projection (steady state)

Sonnet 4.6 inference: ~$890 / mo (5k queries/day, ~6k input tokens / 500 output)
Embedding + reranker: ~$110 / mo
pgvector hosting (existing Postgres, marginal): ~$0 / mo
LangSmith Pro: $99 / mo
Datadog (assumed allocation): ~$300 / mo
Total: ~$1,400 / mo at projected volume

Cost optimizations baked in from day one: aggressive prompt caching (system prompt + retrieved docs), output-length cap at 600 tokens, intent-router that sends classification-only queries to Haiku.

§7Risks & mitigations

PDF extraction fidelity

Risk: 8% of source PDFs have embedded scanned pages where text extraction fails silently. Mitigation: two-pass extractor that detects empty extracted regions and falls back to Claude Vision OCR. Adds $40/mo at projected volume; eliminates the failure mode.

Near-duplicate retrieval

Risk: ~30% of contracts share boilerplate sections. Naive retrieval surfaces multiple chunks of identical text. Mitigation: deduplicate retrieval results by content hash before passing to the model.

Answer hallucination on ambiguous questions

Risk: model invents answers when retrieved context is insufficient. Mitigation: explicit "insufficient context" response path with structured output schema, validated in evals.

§8What we are not doing (and why)

Fine-tuning a custom model. Not yet — base Sonnet 4.6 with good retrieval clears your accuracy bar in eval. Revisit after 6 months of production data if there are systematic question classes we miss.
Graph RAG. Your queries are largely single-document factual lookups, not multi-hop reasoning over relationships. Graph RAG would add complexity without a measurable win on this dataset.
An agent loop. Single-pass RAG meets the spec. Adding an agent would multiply latency and cost without a benefit on the user research we ran.

§9Handover & ownership

All code lives in your monorepo from day one. Infra runs in your AWS account. We deliver a written runbook covering common failure modes, an updated architecture diagram, the eval harness with seed cases, and a 30-minute Loom walkthrough of the codebase for your engineering team.

Architecture memo — sample