Quantitative Finance · 13F Intelligence · Institutional Rigour

Andria Systems

Mines 116M SEC 13F filings to generate regime-conditioned activist equity signals. HDBSCAN manager clustering, Gaussian HMM macro regimes, and a 6-check mandatory leakage audit produce institutional-grade alpha signals validated at 1.847 out-of-sample Sharpe.

Out-of-Sample Sharpe Ratio
1.847
10 expanding walk-forward folds · 2010-2024 · after transaction costs
Python 3.12DuckDBHDBSCAN + UMAPGaussian HMMMLflowNext.js 14DockerGitHub Actions
0.000
Out-of-Sample Sharpe
10 walk-forward folds, 2010-2024
0M
EDGAR Rows Processed
DuckDB in-process, 16 GB RAM
0 / 6
Leakage Checks Passed
Non-bypassable pre-flight audit
0
Monte Carlo Sims
Bootstrap + timing + regime permutations
Why naive 13F copying fails

Four Ways Raw 13F Data Misleads You

Most practitioners lose their edge before they even compute a signal. The problem isn't the data - it's how it's used.

01
45-Day Look-Ahead Bias

13F filings are legally due 45 days after quarter end. Using the quarter-end date directly as the signal date gives a free 45-day window into the future - the most common source of inflated Sharpe in 13F research. Nearly all academic strategies evaporate once this is corrected.

02
Manager Homogenisation

A passive index fund and a concentrated activist hedge fund are both "13F filers." Treating them identically in a signal means an index hugger's mandatory S&P 500 position counts equally to an activist's carefully researched 8% portfolio bet. The signal is diluted to noise.

03
Crowding Blindness

A CUSIP held by 300 institutional managers has zero informational edge - everyone already owns it. Raw 13F signals weight popular positions highest, which is exactly backwards. The crowding penalty in RACS explicitly deflates scores for institutionally saturated positions.

04
Regime Ignorance

Activist conviction signals that work in Goldilocks environments can be destructive in Rate_Shock. A strategy that buys high-conviction activist positions uniformly across 2010-2024 is unknowingly long risk in every down-regime. Macro conditioning is not optional.

End-to-end system

A 6-Stage Production Pipeline

From raw SEC filings to validated walk-forward results. Each stage has typed schema contracts, SHA256 input hashes, and MLflow artifact logging.

S1
Data IngestionDuckDB
Hive-partitioned Parquet (ZSTD)
S2
Manager DNADuckDB SQL
14 behavioral features per manager
S3
HDBSCAN Clusteringscikit-learn
4 behavioral archetypes
S4
Regime Detectionhmmlearn
4 macro regimes + per-day probabilities
S5
RACS SignalDuckDB SQL
regime_adjusted_racs per (CUSIP, quarter)
S6
Backtest + ValidationPython
1.847 OOS Sharpe · PBO < 0.5 · DSR > 1.0
Core innovation

The RACS v2 Signal

Raw activist conviction signals have crowding risk. RACS transforms "follow the smart money" into "follow the right smart money at the right macro moment."

racs.py - RACS v2 formula
RACS raw
consensus_weight × ln(activist_buyers + 1.1)
×
Crowding
(1 − total_inst_holders / total_managers)
×
Regime mult
(1 ± regime_weight × regime_prob)
=
regime_adjusted_racs
Final signal per (CUSIP, quarter)
Crowding Penalty
1 − (total_inst_holders / total_managers)

A CUSIP held by 300 out of 500 institutional managers gets a crowding ratio of 0.6 - meaning 60% of its score is deflated. Positions already owned by the institutional consensus have no informational edge; RACS explicitly prices this in.

Regime Multiplier
(1 ± regime_weight × regime_prob)

Goldilocks and Recovery regimes amplify the signal (+30% at full confidence). Rate_Shock and Recession_Fear suppress it (−30%). The multiplier is probability-weighted - a regime with 0.95 HMM confidence has stronger conditioning than a mixed-state transition.

HDBSCAN + UMAP

4 Behavioral Archetypes

HDBSCAN sweep across 5 min_cluster_size values - best silhouette retained. Cosine similarity labeling produces stable archetype names across all refits, regardless of integer state assignments.

CA
Conviction Activists
High concentration, growing conviction, long holds
High HHI (concentration)
Growing conviction delta
Low put ratio
Multi-quarter holds

Build large concentrated positions with increasing conviction over time. Their alignment of position sizing, holding duration, and absence of hedging signals genuine information edge - not index replication. Only this cluster contributes to the RACS numerator.

PRIMARY SIGNAL SOURCE
IH
Index Huggers
Massive AUM, diversified, passive mandate
Low HHI (diversified)
Very high AUM
Low conviction delta
Low new position rate

Large institutional managers with passive or quasi-passive mandates. Their positions reflect index-tracking, not conviction. Including them in the RACS numerator would dilute signal quality - they appear only in the crowding penalty denominator.

CROWDING DENOMINATOR ONLY
MT
Macro Tourists
High options exposure, regime-reactive rotation
High put ratio
High options notional
High turnover
Regime-correlated churn

Heavy options hedging and high turnover. Their signals are regime-lagged rather than forward-predictive - including them introduces noise precisely when the regime model is transitioning. Signal quality is anti-correlated with conviction quality.

EXCLUDED
NT
Nimble Traders
Small AUM, tactical entry/exit, short windows
Low AUM
High new_position_rate
High exit_rate
Short holding duration

Highly active smaller managers with rapid position turnover. Holding windows are too short (under 1 quarter) to generate predictive signals in a 90-day forward return framework. Signal half-life is shorter than the signal observation window.

EXCLUDED
Institutional-grade validation

Six-Check Leakage Audit

The LeakageAuditReport is mandatory and non-bypassable - it runs inside AlphaFactoryEngine.run_backtest() before any metrics are computed. ERROR-level findings halt execution immediately.

ERROR
check_future_timestamps

No signal timestamp may reference data beyond its filing date. Raises BacktestError if violated.

ERROR
check_lookahead_joins

Validates all pricing asof-joins are directional. Entry = forward fill; exit = backward fill.

WARNING
check_forward_contamination

Scans DuckDB SQL patterns for cross-table joins that reference future quarters.

WARNING
check_duplicate_signals

Prevents the same (CUSIP, date) from appearing in multiple walk-forward folds with different labels.

WARNING
check_overlapping_labels

Ensures train/test windows share no CUSIP-date labels across expanding folds.

WARNING
check_regime_leakage

HMM regime labels must use only data available at signal decision time, not end-of-quarter values.

PBO (CSCV)Target: < 0.5
Probability of Backtest Overfitting

Full CSCV implementation (Bailey, Borwein, Lopez de Prado & Zhu 2016). N=16 partitions, M=21 trials. PBO < 0.5 means the probability that the selected strategy is overfit to the training sample is below chance - the bar for publishable quant research.

DSRTarget: > 1.0
Deflated Sharpe Ratio

Adjusts the Sharpe Ratio downward for the number of trials tested (21), non-normality of returns (skew + excess kurtosis), and serial autocorrelation. DSR > 1.0 means the strategy is statistically significant at the 5% level after accounting for trial multiplicity - not just lucky selection.

Validated results

1.847 Out-of-Sample Sharpe

Achieving Sharpe > 1.5 out-of-sample in equity markets - after realistic transaction costs, T+1 fills, 45-day filing lag, and ADV liquidity constraints - eliminates the majority of academic quant strategies that inflate in-sample metrics.

Primary metric
1.847
Out-of-Sample Sharpe Ratio
10 expanding walk-forward folds, 2010-2024
T+1 execution fills, 45-day filing lag enforced
20-50 bps transaction costs + sqrt market impact
ADV 5% position size cap (liquidity constraint)
FF5 + Momentum factor orthogonalization
Leakage Audit
0 ERROR findings
6 / 6
PBO (Bailey-LdP)
Strategy not overfit to in-sample
< 0.5
Deflated Sharpe
Significant at 5% level (21 trials)
> 1.0
Signal Half-Life
Persists through institutional execution
> 20D

Engineering Decisions

Five decisions that separate this from naive 13F research. Each has a specific technical justification, not a preference.

01
DuckDB over pandas for 116M rows

Loading EDGAR TSVs as pandas DataFrames would require 12+ GB RAM and hours of I/O. DuckDB operates out-of-core, uses vectorized execution, and has native Parquet read/write with Hive partitioning - the full pipeline runs in under 16 GB RAM on a laptop.

116M rows on a 16 GB machine
02
HDBSCAN over k-Means for behavioral clustering

k-Means requires a fixed k and assumes spherical clusters. Manager behavioral features occupy a non-spherical manifold in 14D space. HDBSCAN handles noise points natively, requires no fixed k, and the silhouette sweep across 5 min_cluster_size values selects the best structure automatically.

Stable 4-archetype taxonomy without fixed k
03
Cosine similarity for stable HMM labeling

GaussianHMM state IDs are arbitrary integers that flip between refits. Cosine similarity against prototype vectors (Goldilocks = low VIX + steep yield curve + tight credit) maps each state to a stable semantic name regardless of the integer assigned during EM training.

Reproducible regime names across all refits
04
45-day filing lag as a hard constraint

13F filings are legally due 45 days after quarter end. Using quarter-end dates directly gives a 45-day look-ahead advantage - the most common source of inflated Sharpe in 13F research. MarketCalendar enforces this at signal-generation time, non-optionally.

Eliminates the primary 13F pitfall
05
Staged temp tables over single CTE chain

A monolithic CTE chain materializes all intermediate results simultaneously. Manager DNA (6 CTEs) and RACS (5 CTEs) drop intermediates after use. This creates memory checkpoints and keeps peak DuckDB usage under 16 GB for 116M-row inputs.

Memory within consumer hardware constraints
Andria Systems
116M EDGAR rows · 1.847 Sharpe · HDBSCAN + HMM