LLMs hallucinate financial numbers - with legal consequences
Financial analysts spend 40+ hours/week manually reading SEC 10-K filings. LLM-only approaches hallucinate specific figures, dates, and risk factors - a direct regulatory liability in financial contexts.
Hallucination Risk at Scale
Standard LLMs freely fabricate financial figures when context is insufficient. "NVIDIA's revenue was $X billion" with a plausible-but-wrong number goes undetected - until it reaches a portfolio decision. There is no self-correction loop, no citation grounding, no verification.
Dual Retrieval Failure Modes
Financial text has two failure modes: dense vectors miss exact figures like "$47 billion" or "accession 0001045810" (BM25 is needed); BM25 misses semantic synonyms like "fabless strategy" = supply chain concentration (dense is needed). Neither alone achieves acceptable F1.
API Rate Limits Under Real Load
A 70B model reasoning step consumes ~3,500 tokens. Without budget management, a multi-turn session exhausts Groq free-tier limits in minutes - hard failing every subsequent query with HTTP 429. Production systems need adaptive degradation, not crashes.
5-layer end-to-end pipeline
Every layer is independently replaceable. Parser strategy pattern, async ingestion with ThreadPoolExecutor, multi-collection Qdrant, and a budget-aware optimization stack that survives real traffic.
The self-correcting loop - interactive
Switch budget tiers to see how the pipeline adapts automatically. Click any node to inspect its responsibilities, model choice, and latency contribution.
The Reflector is the system's intelligence: it judges whether the Reasoner's answer is grounded in retrieved documents - and if not, it triggers targeted re-retrieval rather than blind retry.
Why hybrid + reranking is non-negotiable for finance
Four retrieval strategies benchmarked end-to-end on NVIDIA 10-K queries. Each iteration adds a specific capability that addresses a concrete failure mode.
Five non-obvious choices
LangGraph over vanilla LangChain
LangChain sequential chains cannot express cycles. The reflect-retry loop requires a stateful cyclic graph: context_chunks must accumulate across retry iterations (via operator.add reducer) rather than overwriting on retry. LangChain would discard previously retrieved context each time the reflector requests more information - undermining the entire self-correction mechanism.
Hybrid BM25+Dense over pure dense vectors
Financial queries have two distinct failure modes: semantic misses (paraphrased concepts like "fabless strategy" = supply chain concentration) and lexical misses (exact figures like "$47 billion" or "accession 0001045810"). Dense-only F1=0.70 fails exact numbers. BM25-only F1=0.57 fails paraphrase. RRF fusion with k=60 (standard constant from the original RRF paper) combines both without requiring learned weights.
Cross-encoder reranking as non-negotiable
Bi-encoders score query and document independently - attention cannot flow between them. A cross-encoder processes (query, passage) pairs jointly, enabling query tokens to attend to document tokens. For financial precision (exact citation required), this joint attention is what separates 0.79 precision (RRF alone) from 0.91 precision (with reranker). +0.72s latency is justified by the 15% precision gain.
GroqSafeWrapper for RAGAS compatibility
RAGAS AnswerRelevancy internally requests n>1 completions (generates reverse-questions for similarity scoring). Groq API rejects n>1 entirely. Without a fix, the entire evaluation crashes. GroqSafeWrapper subclasses ChatGroq and intercepts both _generate and _agenerate to force n=1 - enabling the metric with a single completion and producing valid relevancy scores without changing any RAGAS internals.
Shared embedding model singleton
Four components each need all-MiniLM-L6-v2: HybridRetriever (dense search), SemanticResponseCache (query similarity), DynamicContextWindow (relevance re-scoring), QueryBatcher (eval). Loading ~80MB independently 4× wastes ~240MB RAM and adds startup overhead. A class-level _shared_embedding_model is instantiated once at main.py startup and injected into all consumers - eliminating the redundancy.
FinSight-Alpha
Production-grade Agentic RAG for SEC financial intelligence - 0.91 RAGAS Faithfulness, 6-node LangGraph, 5-layer optimization.