← HomeSample work product

Architecture memo — sample

This is a redacted sample of the architecture memo we deliver at the end of every Discovery Sprint. Real shape, real structure, real opinions. Print to PDF if you want to forward it internally.

Get one for your project →

Mini Trends · architecture memo · v1.0 · 2026-04-15

RAG-based contract Q&A: architecture & build plan

Prepared for [Client] · author: R. Kitamura · pages: 1 of 14 (sample)

§1Executive summary

We recommend building a hybrid-retrieval RAG system over the client’s contract corpus, with citation tracking, an evaluation harness, and a Claude Sonnet 4.6 generator. Estimated build: 9 weeks, $138k fixed-price. Estimated steady-state run cost: $1,400/month at projected volume (5,000 queries/day).

Three structural risks have been identified and addressed in the plan: PDF-extraction fidelity (mitigated by a two-pass extractor), retrieval precision on near-duplicate documents (mitigated by metadata filtering and re-ranking), and answer hallucination on ambiguous questions (mitigated by an explicit “insufficient context” response path).

§2Problem statement

Legal-ops users at [Client] need to answer factual questions about ~12,000 active contracts (counterparty terms, expiry, renewal, indemnification, payment schedules) without manually opening each document. Current process: keyword search in a SaaS DMS, followed by manual review. Average time to answer: 11 minutes. Target: under 30 seconds with cited source clauses.

§3Recommended architecture

  1. 1PDF upload → 2-pass extractor (PyMuPDF + Claude Vision fallback)
  2. 2Section-aware chunker (respects clause boundaries; max 1024 tokens)
  3. 3Embed with text-embedding-3-large → pgvector (multi-tenant via row-level security)
  4. 4Query path: hybrid (BM25 + dense) → cross-encoder rerank top-30 → top-5 to model
  5. 5Generator: Claude Sonnet 4.6, structured output, citations required
  6. 6Eval: LangSmith with 380 labeled cases + LLM-as-judge for fidelity
  7. 7Observability: Langfuse, Datadog for infra, Sentry for errors

§4Why this stack (and what we considered)

pgvector vs Pinecone

pgvector chosen because (a) you are already on managed Postgres, (b) ACID + row-level multi-tenancy aligns with your existing data model, (c) at 12k documents × ~40 chunks each, you are well below the scale where dedicated vector DBs pay off.

Sonnet 4.6 vs Opus 4.7 vs GPT-5

On a 50-case eval blind-rated by your legal-ops lead, Sonnet 4.6 scored 0.91 vs Opus 4.7 at 0.93 — a $4k/mo cost difference at projected volume. Recommend Sonnet for production with Opus reserved for one specific question class (clause-precedent comparisons) we found regressed.

Hybrid retrieval vs dense-only

Hybrid won on 47/50 of our eval cases. Dense-only missed exact-term matches (counterparty names, clause numbers) that BM25 caught reliably.

§5Build plan (9 weeks)

WeeksMilestoneDemo
1–2Extraction + chunking pipeline; 1k-doc smoke testConfluence dump
3–4Indexing + hybrid retrieval; eval set v1 (200 cases)Live retrieval demo
5–6Generator integration; eval set v2 (380 cases); reranker tuningEnd-to-end flow
7–8Production hardening: observability, rate limits, fallbacksStaging cutover
9Production deploy + handover docs + runbookProduction use

§6Cost projection (steady state)

  • Sonnet 4.6 inference: ~$890 / mo (5k queries/day, ~6k input tokens / 500 output)
  • Embedding + reranker: ~$110 / mo
  • pgvector hosting (existing Postgres, marginal): ~$0 / mo
  • LangSmith Pro: $99 / mo
  • Datadog (assumed allocation): ~$300 / mo
  • Total: ~$1,400 / mo at projected volume

Cost optimizations baked in from day one: aggressive prompt caching (system prompt + retrieved docs), output-length cap at 600 tokens, intent-router that sends classification-only queries to Haiku.

§7Risks & mitigations

PDF extraction fidelity

Risk: 8% of source PDFs have embedded scanned pages where text extraction fails silently. Mitigation: two-pass extractor that detects empty extracted regions and falls back to Claude Vision OCR. Adds $40/mo at projected volume; eliminates the failure mode.

Near-duplicate retrieval

Risk: ~30% of contracts share boilerplate sections. Naive retrieval surfaces multiple chunks of identical text. Mitigation: deduplicate retrieval results by content hash before passing to the model.

Answer hallucination on ambiguous questions

Risk: model invents answers when retrieved context is insufficient. Mitigation: explicit "insufficient context" response path with structured output schema, validated in evals.

§8What we are not doing (and why)

  • Fine-tuning a custom model. Not yet — base Sonnet 4.6 with good retrieval clears your accuracy bar in eval. Revisit after 6 months of production data if there are systematic question classes we miss.
  • Graph RAG. Your queries are largely single-document factual lookups, not multi-hop reasoning over relationships. Graph RAG would add complexity without a measurable win on this dataset.
  • An agent loop. Single-pass RAG meets the spec. Adding an agent would multiply latency and cost without a benefit on the user research we ran.

§9Handover & ownership

All code lives in your monorepo from day one. Infra runs in your AWS account. We deliver a written runbook covering common failure modes, an updated architecture diagram, the eval harness with seed cases, and a 30-minute Loom walkthrough of the codebase for your engineering team.

Mini Trends · 2026 · Sample document. Real engagements include sections 10–14 (deployment plan, observability spec, security review, legal & procurement appendix, signed SOW).

This is what your Discovery Sprint produces.

Plus a working prototype of the most uncertain path. From $35k. Two weeks.

Send a brief →