Projects/FinSight-Alpha

Agentic RAG · LLM Systems · Financial Intelligence · LangGraph

FinSight-Alpha

Production-grade Agentic RAG for institutional SEC filing analysis. A 6-node LangGraph state machine with self-correcting hallucination detection achieves 0.91 RAGAS Faithfulness - well above the 0.70 institutional threshold.

RAGAS Faithfulness (institutional threshold: 0.70)

0.91

LLM-as-judge · NVIDIA FY2026 10-K · adversarial tests included

Cached query response time

50ms

vs 6.55s full pipeline · semantic similarity cache

LangGraphLLaMA 3.3 70BQdrantBM25CrossEncoderFastAPIStreamlitRAGASGroq

0.91

RAGAS Faithfulness

threshold ≥ 0.70 · LLM-as-judge

0.89

Retrieval F1

+56% over BM25 baseline (0.57)

50ms

Cached Response

vs 6.55s full pipeline

Hard Failures

graceful degradation on every error path

The Core Problem

LLMs hallucinate financial numbers - with legal consequences

Financial analysts spend 40+ hours/week manually reading SEC 10-K filings. LLM-only approaches hallucinate specific figures, dates, and risk factors - a direct regulatory liability in financial contexts.

Hallucination Risk at Scale

Standard LLMs freely fabricate financial figures when context is insufficient. "NVIDIA's revenue was $X billion" with a plausible-but-wrong number goes undetected - until it reaches a portfolio decision. There is no self-correction loop, no citation grounding, no verification.

Dual Retrieval Failure Modes

Financial text has two failure modes: dense vectors miss exact figures like "$47 billion" or "accession 0001045810" (BM25 is needed); BM25 misses semantic synonyms like "fabless strategy" = supply chain concentration (dense is needed). Neither alone achieves acceptable F1.

API Rate Limits Under Real Load

A 70B model reasoning step consumes ~3,500 tokens. Without budget management, a multi-turn session exhausts Groq free-tier limits in minutes - hard failing every subsequent query with HTTP 429. Production systems need adaptive degradation, not crashes.

System Architecture

5-layer end-to-end pipeline

Every layer is independently replaceable. Parser strategy pattern, async ingestion with ThreadPoolExecutor, multi-collection Qdrant, and a budget-aware optimization stack that survives real traffic.

Ingestion PipelineParserRegistry · Pandera

284 indexed chunks · 424 KB processed JSONL · BM25 index rebuilt

Hybrid Retrieval LayerQdrant · BM25 · CrossEncoder

top-N ranked chunks - N=8/5/4 by budget tier - with [Doc X: source] identifiers

LangGraph Agent CoreLangGraph · LLaMA 3.3 70B

Citation-grounded answer with reasoning trace + live MRR/NDCG metrics

5-Layer Optimization StackSemanticCache · BudgetMgr

50ms cached response OR budget-tier-adapted pipeline (GREEN/YELLOW/RED)

FastAPI + Streamlit ServingFastAPI · Streamlit

REST API /chat · 6-page live dashboard · live MRR/NDCG per query

284

Indexed Chunks

NVIDIA FY2026 10-K

Integration Tests

all 6 optimization modules

Cache Layers

semantic · result · embedding

API Endpoints

/chat · /health · /cache × 2 · /budget

LangGraph Agent Core

The self-correcting loop - interactive

Switch budget tiers to see how the pipeline adapts automatically. Click any node to inspect its responsibilities, model choice, and latency contribution.

The Reflector is the system's intelligence: it judges whether the Reasoner's answer is grounded in retrieved documents - and if not, it triggers targeted re-retrieval rather than blind retry.

Planner

LLaMA 3.1 8B

420ms

Query Rewriter

LLaMA 3.1 8B

680ms

Retriever

BM25 + Qdrant

1,840ms + 720ms

Reasoner

LLaMA 3.3 70B

2,100ms

Reflector

LLaMA 3.1 8B

610ms

Responder

packaging

180ms

GREEN TIER - 0-60k tokens

Planner

Active

Reasoner model

LLaMA 3.3 70B

Retrieval top_n

8 chunks

Max reflect loops

Latency Waterfall - single uncached pass

Planner

420ms

Query Rewriter

680ms

Retriever

1840ms

Reranker

720ms

Reasoner

2100ms

Reflector

610ms

Responder

180ms

Total: ~6,550ms · cached: ~50ms

Retrieval Strategy

Why hybrid + reranking is non-negotiable for finance

Four retrieval strategies benchmarked end-to-end on NVIDIA 10-K queries. Each iteration adds a specific capability that addresses a concrete failure mode.

BM25 Only

0.57

F1 score

0.63

0.52

LAT

0.08s

Misses semantic synonyms

Dense Only

0.70

F1 score

0.72

0.68

LAT

0.32s

Misses exact figures like $47B

Hybrid + RRF

0.81

F1 score

0.79

0.83

LAT

0.40s

Best of both, no learned weights

FINAL

Hybrid + Rerank

0.89

F1 score

0.91

0.87

LAT

0.63s

Joint query-doc attention

RAGAS IMPROVEMENT JOURNEY - Faithfulness across 5 evaluation runs

Run 1

0.71

Run 2

0.74

Run 3

0.79

Run 4

0.86

Run 5

0.91

Run 5 Faithfulness

0.91

Answer Relevancy

0.90

MRR (final)

0.88

Adversarial (AMD hallucination)

0.88

Adversarial test: "What does the 10-K say about NVIDIA's secret plans to acquire AMD?" - The system correctly abstains rather than fabricating acquisition plans. Faithfulness 0.88 on this adversarial input.

Engineering Decisions

Five non-obvious choices

LangGraph over vanilla LangChain

LangChain sequential chains cannot express cycles. The reflect-retry loop requires a stateful cyclic graph: context_chunks must accumulate across retry iterations (via operator.add reducer) rather than overwriting on retry. LangChain would discard previously retrieved context each time the reflector requests more information - undermining the entire self-correction mechanism.

Cyclic retry with persistent accumulated state

Hybrid BM25+Dense over pure dense vectors

Financial queries have two distinct failure modes: semantic misses (paraphrased concepts like "fabless strategy" = supply chain concentration) and lexical misses (exact figures like "$47 billion" or "accession 0001045810"). Dense-only F1=0.70 fails exact numbers. BM25-only F1=0.57 fails paraphrase. RRF fusion with k=60 (standard constant from the original RRF paper) combines both without requiring learned weights.

F1: 0.57 → 0.89 (+56% over BM25 baseline)

Cross-encoder reranking as non-negotiable

Bi-encoders score query and document independently - attention cannot flow between them. A cross-encoder processes (query, passage) pairs jointly, enabling query tokens to attend to document tokens. For financial precision (exact citation required), this joint attention is what separates 0.79 precision (RRF alone) from 0.91 precision (with reranker). +0.72s latency is justified by the 15% precision gain.

Precision: 0.79 → 0.91 with CrossEncoder

GroqSafeWrapper for RAGAS compatibility

RAGAS AnswerRelevancy internally requests n>1 completions (generates reverse-questions for similarity scoring). Groq API rejects n>1 entirely. Without a fix, the entire evaluation crashes. GroqSafeWrapper subclasses ChatGroq and intercepts both _generate and _agenerate to force n=1 - enabling the metric with a single completion and producing valid relevancy scores without changing any RAGAS internals.

RAGAS evaluation runs on Groq free-tier

Shared embedding model singleton

Four components each need all-MiniLM-L6-v2: HybridRetriever (dense search), SemanticResponseCache (query similarity), DynamicContextWindow (relevance re-scoring), QueryBatcher (eval). Loading ~80MB independently 4× wastes ~240MB RAM and adds startup overhead. A class-level _shared_embedding_model is instantiated once at main.py startup and injected into all consumers - eliminating the redundancy.

Saves ~240MB RAM · 4× model load eliminated

FinSight-Alpha

Production-grade Agentic RAG for SEC financial intelligence - 0.91 RAGAS Faithfulness, 6-node LangGraph, 5-layer optimization.

View on GitHub Live Demo

Back to ProjectsLangGraph · LLaMA 3.3 70B · Qdrant · BM25 · CrossEncoder · RAGAS