Projects/Andria Systems

Quantitative Finance · 13F Intelligence · Institutional Rigour

Andria Systems

Mines 116M SEC 13F filings to generate regime-conditioned activist equity signals. HDBSCAN manager clustering, Gaussian HMM macro regimes, and a 6-check mandatory leakage audit produce institutional-grade alpha signals validated at 1.847 out-of-sample Sharpe.

Out-of-Sample Sharpe Ratio

1.847

10 expanding walk-forward folds · 2010-2024 · after transaction costs

Python 3.12DuckDBHDBSCAN + UMAPGaussian HMMMLflowNext.js 14DockerGitHub Actions

0.000

Out-of-Sample Sharpe

10 walk-forward folds, 2010-2024

EDGAR Rows Processed

DuckDB in-process, 16 GB RAM

0 / 6

Leakage Checks Passed

Non-bypassable pre-flight audit

Monte Carlo Sims

Bootstrap + timing + regime permutations

Why naive 13F copying fails

Four Ways Raw 13F Data Misleads You

Most practitioners lose their edge before they even compute a signal. The problem isn't the data - it's how it's used.

45-Day Look-Ahead Bias

13F filings are legally due 45 days after quarter end. Using the quarter-end date directly as the signal date gives a free 45-day window into the future - the most common source of inflated Sharpe in 13F research. Nearly all academic strategies evaporate once this is corrected.

Manager Homogenisation

A passive index fund and a concentrated activist hedge fund are both "13F filers." Treating them identically in a signal means an index hugger's mandatory S&P 500 position counts equally to an activist's carefully researched 8% portfolio bet. The signal is diluted to noise.

Crowding Blindness

A CUSIP held by 300 institutional managers has zero informational edge - everyone already owns it. Raw 13F signals weight popular positions highest, which is exactly backwards. The crowding penalty in RACS explicitly deflates scores for institutionally saturated positions.

Regime Ignorance

Activist conviction signals that work in Goldilocks environments can be destructive in Rate_Shock. A strategy that buys high-conviction activist positions uniformly across 2010-2024 is unknowingly long risk in every down-regime. Macro conditioning is not optional.

End-to-end system

A 6-Stage Production Pipeline

From raw SEC filings to validated walk-forward results. Each stage has typed schema contracts, SHA256 input hashes, and MLflow artifact logging.

Data IngestionDuckDB

Hive-partitioned Parquet (ZSTD)

Manager DNADuckDB SQL

14 behavioral features per manager

HDBSCAN Clusteringscikit-learn

4 behavioral archetypes

Regime Detectionhmmlearn

4 macro regimes + per-day probabilities

RACS SignalDuckDB SQL

regime_adjusted_racs per (CUSIP, quarter)

Backtest + ValidationPython

1.847 OOS Sharpe · PBO < 0.5 · DSR > 1.0

Core innovation

The RACS v2 Signal

Raw activist conviction signals have crowding risk. RACS transforms "follow the smart money" into "follow the right smart money at the right macro moment."

racs.py - RACS v2 formula

RACS raw

consensus_weight
× ln(activist_buyers + 1.1)

Crowding

(1 − total_inst_holders
    / total_managers)

Regime mult

(1 ± regime_weight
   × regime_prob)

regime_adjusted_racs

Final signal
per (CUSIP, quarter)

Crowding Penalty

1 − (total_inst_holders / total_managers)

A CUSIP held by 300 out of 500 institutional managers gets a crowding ratio of 0.6 - meaning 60% of its score is deflated. Positions already owned by the institutional consensus have no informational edge; RACS explicitly prices this in.

Regime Multiplier

(1 ± regime_weight × regime_prob)

Goldilocks and Recovery regimes amplify the signal (+30% at full confidence). Rate_Shock and Recession_Fear suppress it (−30%). The multiplier is probability-weighted - a regime with 0.95 HMM confidence has stronger conditioning than a mixed-state transition.

HDBSCAN + UMAP

4 Behavioral Archetypes

HDBSCAN sweep across 5 min_cluster_size values - best silhouette retained. Cosine similarity labeling produces stable archetype names across all refits, regardless of integer state assignments.

Conviction Activists

High concentration, growing conviction, long holds

High HHI (concentration)

Growing conviction delta

Low put ratio

Multi-quarter holds

Build large concentrated positions with increasing conviction over time. Their alignment of position sizing, holding duration, and absence of hedging signals genuine information edge - not index replication. Only this cluster contributes to the RACS numerator.

PRIMARY SIGNAL SOURCE

Index Huggers

Massive AUM, diversified, passive mandate

Low HHI (diversified)

Very high AUM

Low conviction delta

Low new position rate

Large institutional managers with passive or quasi-passive mandates. Their positions reflect index-tracking, not conviction. Including them in the RACS numerator would dilute signal quality - they appear only in the crowding penalty denominator.

CROWDING DENOMINATOR ONLY

Macro Tourists

High options exposure, regime-reactive rotation

High put ratio

High options notional

High turnover

Regime-correlated churn

Heavy options hedging and high turnover. Their signals are regime-lagged rather than forward-predictive - including them introduces noise precisely when the regime model is transitioning. Signal quality is anti-correlated with conviction quality.

EXCLUDED

Nimble Traders

Small AUM, tactical entry/exit, short windows

Low AUM

High new_position_rate

High exit_rate

Short holding duration

Highly active smaller managers with rapid position turnover. Holding windows are too short (under 1 quarter) to generate predictive signals in a 90-day forward return framework. Signal half-life is shorter than the signal observation window.

EXCLUDED

Institutional-grade validation

Six-Check Leakage Audit

The LeakageAuditReport is mandatory and non-bypassable - it runs inside AlphaFactoryEngine.run_backtest() before any metrics are computed. ERROR-level findings halt execution immediately.

ERROR

check_future_timestamps

No signal timestamp may reference data beyond its filing date. Raises BacktestError if violated.

ERROR

check_lookahead_joins

Validates all pricing asof-joins are directional. Entry = forward fill; exit = backward fill.

WARNING

check_forward_contamination

Scans DuckDB SQL patterns for cross-table joins that reference future quarters.

WARNING

check_duplicate_signals

Prevents the same (CUSIP, date) from appearing in multiple walk-forward folds with different labels.

WARNING

check_overlapping_labels

Ensures train/test windows share no CUSIP-date labels across expanding folds.

WARNING

check_regime_leakage

HMM regime labels must use only data available at signal decision time, not end-of-quarter values.

PBO (CSCV)Target: < 0.5

Probability of Backtest Overfitting

Full CSCV implementation (Bailey, Borwein, Lopez de Prado & Zhu 2016). N=16 partitions, M=21 trials. PBO < 0.5 means the probability that the selected strategy is overfit to the training sample is below chance - the bar for publishable quant research.

DSRTarget: > 1.0

Deflated Sharpe Ratio

Adjusts the Sharpe Ratio downward for the number of trials tested (21), non-normality of returns (skew + excess kurtosis), and serial autocorrelation. DSR > 1.0 means the strategy is statistically significant at the 5% level after accounting for trial multiplicity - not just lucky selection.

Validated results

1.847 Out-of-Sample Sharpe

Achieving Sharpe > 1.5 out-of-sample in equity markets - after realistic transaction costs, T+1 fills, 45-day filing lag, and ADV liquidity constraints - eliminates the majority of academic quant strategies that inflate in-sample metrics.

Primary metric

1.847

Out-of-Sample Sharpe Ratio

10 expanding walk-forward folds, 2010-2024

T+1 execution fills, 45-day filing lag enforced

20-50 bps transaction costs + sqrt market impact

ADV 5% position size cap (liquidity constraint)

FF5 + Momentum factor orthogonalization

Leakage Audit

0 ERROR findings

6 / 6

PBO (Bailey-LdP)

Strategy not overfit to in-sample

< 0.5

Deflated Sharpe

Significant at 5% level (21 trials)

> 1.0

Signal Half-Life

Persists through institutional execution

> 20D

Engineering Decisions

Five decisions that separate this from naive 13F research. Each has a specific technical justification, not a preference.

DuckDB over pandas for 116M rows

Loading EDGAR TSVs as pandas DataFrames would require 12+ GB RAM and hours of I/O. DuckDB operates out-of-core, uses vectorized execution, and has native Parquet read/write with Hive partitioning - the full pipeline runs in under 16 GB RAM on a laptop.

116M rows on a 16 GB machine

HDBSCAN over k-Means for behavioral clustering

k-Means requires a fixed k and assumes spherical clusters. Manager behavioral features occupy a non-spherical manifold in 14D space. HDBSCAN handles noise points natively, requires no fixed k, and the silhouette sweep across 5 min_cluster_size values selects the best structure automatically.

Stable 4-archetype taxonomy without fixed k

Cosine similarity for stable HMM labeling

GaussianHMM state IDs are arbitrary integers that flip between refits. Cosine similarity against prototype vectors (Goldilocks = low VIX + steep yield curve + tight credit) maps each state to a stable semantic name regardless of the integer assigned during EM training.

Reproducible regime names across all refits

45-day filing lag as a hard constraint

13F filings are legally due 45 days after quarter end. Using quarter-end dates directly gives a 45-day look-ahead advantage - the most common source of inflated Sharpe in 13F research. MarketCalendar enforces this at signal-generation time, non-optionally.

Eliminates the primary 13F pitfall

Staged temp tables over single CTE chain

A monolithic CTE chain materializes all intermediate results simultaneously. Manager DNA (6 CTEs) and RACS (5 CTEs) drop intermediates after use. This creates memory checkpoints and keeps peak DuckDB usage under 16 GB for 116M-row inputs.

Memory within consumer hardware constraints

Andria Systems

116M EDGAR rows · 1.847 Sharpe · HDBSCAN + HMM

Back to Projects GitHub Live Demo