Mini Trends · architecture memo · v1.0 · 2026-04-15
RAG-based contract Q&A: architecture & build plan
Prepared for [Client] · author: R. Kitamura · pages: 1 of 14 (sample)
§1Executive summary
We recommend building a hybrid-retrieval RAG system over the client’s contract corpus, with citation tracking, an evaluation harness, and a Claude Sonnet 4.6 generator. Estimated build: 9 weeks, $138k fixed-price. Estimated steady-state run cost: $1,400/month at projected volume (5,000 queries/day).
Three structural risks have been identified and addressed in the plan: PDF-extraction fidelity (mitigated by a two-pass extractor), retrieval precision on near-duplicate documents (mitigated by metadata filtering and re-ranking), and answer hallucination on ambiguous questions (mitigated by an explicit “insufficient context” response path).
§2Problem statement
Legal-ops users at [Client] need to answer factual questions about ~12,000 active contracts (counterparty terms, expiry, renewal, indemnification, payment schedules) without manually opening each document. Current process: keyword search in a SaaS DMS, followed by manual review. Average time to answer: 11 minutes. Target: under 30 seconds with cited source clauses.
§3Recommended architecture
- 1PDF upload → 2-pass extractor (PyMuPDF + Claude Vision fallback)
- 2Section-aware chunker (respects clause boundaries; max 1024 tokens)
- 3Embed with text-embedding-3-large → pgvector (multi-tenant via row-level security)
- 4Query path: hybrid (BM25 + dense) → cross-encoder rerank top-30 → top-5 to model
- 5Generator: Claude Sonnet 4.6, structured output, citations required
- 6Eval: LangSmith with 380 labeled cases + LLM-as-judge for fidelity
- 7Observability: Langfuse, Datadog for infra, Sentry for errors
§4Why this stack (and what we considered)
pgvector vs Pinecone
pgvector chosen because (a) you are already on managed Postgres, (b) ACID + row-level multi-tenancy aligns with your existing data model, (c) at 12k documents × ~40 chunks each, you are well below the scale where dedicated vector DBs pay off.
Sonnet 4.6 vs Opus 4.7 vs GPT-5
On a 50-case eval blind-rated by your legal-ops lead, Sonnet 4.6 scored 0.91 vs Opus 4.7 at 0.93 — a $4k/mo cost difference at projected volume. Recommend Sonnet for production with Opus reserved for one specific question class (clause-precedent comparisons) we found regressed.
Hybrid retrieval vs dense-only
Hybrid won on 47/50 of our eval cases. Dense-only missed exact-term matches (counterparty names, clause numbers) that BM25 caught reliably.
§5Build plan (9 weeks)
| Weeks | Milestone | Demo |
|---|---|---|
| 1–2 | Extraction + chunking pipeline; 1k-doc smoke test | Confluence dump |
| 3–4 | Indexing + hybrid retrieval; eval set v1 (200 cases) | Live retrieval demo |
| 5–6 | Generator integration; eval set v2 (380 cases); reranker tuning | End-to-end flow |
| 7–8 | Production hardening: observability, rate limits, fallbacks | Staging cutover |
| 9 | Production deploy + handover docs + runbook | Production use |
§6Cost projection (steady state)
- Sonnet 4.6 inference: ~$890 / mo (5k queries/day, ~6k input tokens / 500 output)
- Embedding + reranker: ~$110 / mo
- pgvector hosting (existing Postgres, marginal): ~$0 / mo
- LangSmith Pro: $99 / mo
- Datadog (assumed allocation): ~$300 / mo
- Total: ~$1,400 / mo at projected volume
Cost optimizations baked in from day one: aggressive prompt caching (system prompt + retrieved docs), output-length cap at 600 tokens, intent-router that sends classification-only queries to Haiku.
§7Risks & mitigations
PDF extraction fidelity
Risk: 8% of source PDFs have embedded scanned pages where text extraction fails silently. Mitigation: two-pass extractor that detects empty extracted regions and falls back to Claude Vision OCR. Adds $40/mo at projected volume; eliminates the failure mode.
Near-duplicate retrieval
Risk: ~30% of contracts share boilerplate sections. Naive retrieval surfaces multiple chunks of identical text. Mitigation: deduplicate retrieval results by content hash before passing to the model.
Answer hallucination on ambiguous questions
Risk: model invents answers when retrieved context is insufficient. Mitigation: explicit "insufficient context" response path with structured output schema, validated in evals.
§8What we are not doing (and why)
- Fine-tuning a custom model. Not yet — base Sonnet 4.6 with good retrieval clears your accuracy bar in eval. Revisit after 6 months of production data if there are systematic question classes we miss.
- Graph RAG. Your queries are largely single-document factual lookups, not multi-hop reasoning over relationships. Graph RAG would add complexity without a measurable win on this dataset.
- An agent loop. Single-pass RAG meets the spec. Adding an agent would multiply latency and cost without a benefit on the user research we ran.
§9Handover & ownership
All code lives in your monorepo from day one. Infra runs in your AWS account. We deliver a written runbook covering common failure modes, an updated architecture diagram, the eval harness with seed cases, and a 30-minute Loom walkthrough of the codebase for your engineering team.