Agentic RAG · LLM Systems · Financial Intelligence · LangGraph

FinSight-Alpha

Production-grade Agentic RAG for institutional SEC filing analysis. A 6-node LangGraph state machine with self-correcting hallucination detection achieves 0.91 RAGAS Faithfulness - well above the 0.70 institutional threshold.

RAGAS Faithfulness (institutional threshold: 0.70)
0.91
LLM-as-judge · NVIDIA FY2026 10-K · adversarial tests included
Cached query response time
50ms
vs 6.55s full pipeline · semantic similarity cache
LangGraphLLaMA 3.3 70BQdrantBM25CrossEncoderFastAPIStreamlitRAGASGroq
0.91
RAGAS Faithfulness
threshold ≥ 0.70 · LLM-as-judge
0.89
Retrieval F1
+56% over BM25 baseline (0.57)
50ms
Cached Response
vs 6.55s full pipeline
0
Hard Failures
graceful degradation on every error path
The Core Problem

LLMs hallucinate financial numbers - with legal consequences

Financial analysts spend 40+ hours/week manually reading SEC 10-K filings. LLM-only approaches hallucinate specific figures, dates, and risk factors - a direct regulatory liability in financial contexts.

Hallucination Risk at Scale

Standard LLMs freely fabricate financial figures when context is insufficient. "NVIDIA's revenue was $X billion" with a plausible-but-wrong number goes undetected - until it reaches a portfolio decision. There is no self-correction loop, no citation grounding, no verification.

Dual Retrieval Failure Modes

Financial text has two failure modes: dense vectors miss exact figures like "$47 billion" or "accession 0001045810" (BM25 is needed); BM25 misses semantic synonyms like "fabless strategy" = supply chain concentration (dense is needed). Neither alone achieves acceptable F1.

API Rate Limits Under Real Load

A 70B model reasoning step consumes ~3,500 tokens. Without budget management, a multi-turn session exhausts Groq free-tier limits in minutes - hard failing every subsequent query with HTTP 429. Production systems need adaptive degradation, not crashes.

System Architecture

5-layer end-to-end pipeline

Every layer is independently replaceable. Parser strategy pattern, async ingestion with ThreadPoolExecutor, multi-collection Qdrant, and a budget-aware optimization stack that survives real traffic.

L1
Ingestion PipelineParserRegistry · Pandera
284 indexed chunks · 424 KB processed JSONL · BM25 index rebuilt
L2
Hybrid Retrieval LayerQdrant · BM25 · CrossEncoder
top-N ranked chunks - N=8/5/4 by budget tier - with [Doc X: source] identifiers
L3
LangGraph Agent CoreLangGraph · LLaMA 3.3 70B
Citation-grounded answer with reasoning trace + live MRR/NDCG metrics
L4
5-Layer Optimization StackSemanticCache · BudgetMgr
50ms cached response OR budget-tier-adapted pipeline (GREEN/YELLOW/RED)
L5
FastAPI + Streamlit ServingFastAPI · Streamlit
REST API /chat · 6-page live dashboard · live MRR/NDCG per query
284
Indexed Chunks
NVIDIA FY2026 10-K
1
Integration Tests
all 6 optimization modules
3
Cache Layers
semantic · result · embedding
5
API Endpoints
/chat · /health · /cache × 2 · /budget
LangGraph Agent Core

The self-correcting loop - interactive

Switch budget tiers to see how the pipeline adapts automatically. Click any node to inspect its responsibilities, model choice, and latency contribution.

The Reflector is the system's intelligence: it judges whether the Reasoner's answer is grounded in retrieved documents - and if not, it triggers targeted re-retrieval rather than blind retry.

P1
Planner
LLaMA 3.1 8B
420ms
P2
Query Rewriter
LLaMA 3.1 8B
680ms
P3
Retriever
BM25 + Qdrant
1,840ms + 720ms
P4
Reasoner
LLaMA 3.3 70B
2,100ms
P5
Reflector
LLaMA 3.1 8B
610ms
P6
Responder
packaging
180ms
GREEN TIER - 0-60k tokens
Planner
Active
Reasoner model
LLaMA 3.3 70B
Retrieval top_n
8 chunks
Max reflect loops
6
Latency Waterfall - single uncached pass
Planner
420ms
Query Rewriter
680ms
Retriever
1840ms
Reranker
720ms
Reasoner
2100ms
Reflector
610ms
Responder
180ms
Total: ~6,550ms · cached: ~50ms
Retrieval Strategy

Why hybrid + reranking is non-negotiable for finance

Four retrieval strategies benchmarked end-to-end on NVIDIA 10-K queries. Each iteration adds a specific capability that addresses a concrete failure mode.

BM25 Only
0.57
F1 score
P
0.63
R
0.52
LAT
0.08s
Misses semantic synonyms
Dense Only
0.70
F1 score
P
0.72
R
0.68
LAT
0.32s
Misses exact figures like $47B
Hybrid + RRF
0.81
F1 score
P
0.79
R
0.83
LAT
0.40s
Best of both, no learned weights
FINAL
Hybrid + Rerank
0.89
F1 score
P
0.91
R
0.87
LAT
0.63s
Joint query-doc attention
RAGAS IMPROVEMENT JOURNEY - Faithfulness across 5 evaluation runs
Run 1
0.71
Run 2
0.74
Run 3
0.79
Run 4
0.86
Run 5
0.91
Run 5 Faithfulness
0.91
Answer Relevancy
0.90
MRR (final)
0.88
Adversarial (AMD hallucination)
0.88
Adversarial test: "What does the 10-K say about NVIDIA's secret plans to acquire AMD?" - The system correctly abstains rather than fabricating acquisition plans. Faithfulness 0.88 on this adversarial input.
Engineering Decisions

Five non-obvious choices

01

LangGraph over vanilla LangChain

LangChain sequential chains cannot express cycles. The reflect-retry loop requires a stateful cyclic graph: context_chunks must accumulate across retry iterations (via operator.add reducer) rather than overwriting on retry. LangChain would discard previously retrieved context each time the reflector requests more information - undermining the entire self-correction mechanism.

Cyclic retry with persistent accumulated state
02

Hybrid BM25+Dense over pure dense vectors

Financial queries have two distinct failure modes: semantic misses (paraphrased concepts like "fabless strategy" = supply chain concentration) and lexical misses (exact figures like "$47 billion" or "accession 0001045810"). Dense-only F1=0.70 fails exact numbers. BM25-only F1=0.57 fails paraphrase. RRF fusion with k=60 (standard constant from the original RRF paper) combines both without requiring learned weights.

F1: 0.57 → 0.89 (+56% over BM25 baseline)
03

Cross-encoder reranking as non-negotiable

Bi-encoders score query and document independently - attention cannot flow between them. A cross-encoder processes (query, passage) pairs jointly, enabling query tokens to attend to document tokens. For financial precision (exact citation required), this joint attention is what separates 0.79 precision (RRF alone) from 0.91 precision (with reranker). +0.72s latency is justified by the 15% precision gain.

Precision: 0.79 → 0.91 with CrossEncoder
04

GroqSafeWrapper for RAGAS compatibility

RAGAS AnswerRelevancy internally requests n>1 completions (generates reverse-questions for similarity scoring). Groq API rejects n>1 entirely. Without a fix, the entire evaluation crashes. GroqSafeWrapper subclasses ChatGroq and intercepts both _generate and _agenerate to force n=1 - enabling the metric with a single completion and producing valid relevancy scores without changing any RAGAS internals.

RAGAS evaluation runs on Groq free-tier
05

Shared embedding model singleton

Four components each need all-MiniLM-L6-v2: HybridRetriever (dense search), SemanticResponseCache (query similarity), DynamicContextWindow (relevance re-scoring), QueryBatcher (eval). Loading ~80MB independently 4× wastes ~240MB RAM and adds startup overhead. A class-level _shared_embedding_model is instantiated once at main.py startup and injected into all consumers - eliminating the redundancy.

Saves ~240MB RAM · 4× model load eliminated

FinSight-Alpha

Production-grade Agentic RAG for SEC financial intelligence - 0.91 RAGAS Faithfulness, 6-node LangGraph, 5-layer optimization.

Back to ProjectsLangGraph · LLaMA 3.3 70B · Qdrant · BM25 · CrossEncoder · RAGAS