Four Ways Raw 13F Data Misleads You
Most practitioners lose their edge before they even compute a signal. The problem isn't the data - it's how it's used.
13F filings are legally due 45 days after quarter end. Using the quarter-end date directly as the signal date gives a free 45-day window into the future - the most common source of inflated Sharpe in 13F research. Nearly all academic strategies evaporate once this is corrected.
A passive index fund and a concentrated activist hedge fund are both "13F filers." Treating them identically in a signal means an index hugger's mandatory S&P 500 position counts equally to an activist's carefully researched 8% portfolio bet. The signal is diluted to noise.
A CUSIP held by 300 institutional managers has zero informational edge - everyone already owns it. Raw 13F signals weight popular positions highest, which is exactly backwards. The crowding penalty in RACS explicitly deflates scores for institutionally saturated positions.
Activist conviction signals that work in Goldilocks environments can be destructive in Rate_Shock. A strategy that buys high-conviction activist positions uniformly across 2010-2024 is unknowingly long risk in every down-regime. Macro conditioning is not optional.
A 6-Stage Production Pipeline
From raw SEC filings to validated walk-forward results. Each stage has typed schema contracts, SHA256 input hashes, and MLflow artifact logging.
The RACS v2 Signal
Raw activist conviction signals have crowding risk. RACS transforms "follow the smart money" into "follow the right smart money at the right macro moment."
1 − (total_inst_holders / total_managers)A CUSIP held by 300 out of 500 institutional managers gets a crowding ratio of 0.6 - meaning 60% of its score is deflated. Positions already owned by the institutional consensus have no informational edge; RACS explicitly prices this in.
(1 ± regime_weight × regime_prob)Goldilocks and Recovery regimes amplify the signal (+30% at full confidence). Rate_Shock and Recession_Fear suppress it (−30%). The multiplier is probability-weighted - a regime with 0.95 HMM confidence has stronger conditioning than a mixed-state transition.
4 Behavioral Archetypes
HDBSCAN sweep across 5 min_cluster_size values - best silhouette retained. Cosine similarity labeling produces stable archetype names across all refits, regardless of integer state assignments.
Build large concentrated positions with increasing conviction over time. Their alignment of position sizing, holding duration, and absence of hedging signals genuine information edge - not index replication. Only this cluster contributes to the RACS numerator.
Large institutional managers with passive or quasi-passive mandates. Their positions reflect index-tracking, not conviction. Including them in the RACS numerator would dilute signal quality - they appear only in the crowding penalty denominator.
Heavy options hedging and high turnover. Their signals are regime-lagged rather than forward-predictive - including them introduces noise precisely when the regime model is transitioning. Signal quality is anti-correlated with conviction quality.
Highly active smaller managers with rapid position turnover. Holding windows are too short (under 1 quarter) to generate predictive signals in a 90-day forward return framework. Signal half-life is shorter than the signal observation window.
Six-Check Leakage Audit
The LeakageAuditReport is mandatory and non-bypassable - it runs inside AlphaFactoryEngine.run_backtest() before any metrics are computed. ERROR-level findings halt execution immediately.
No signal timestamp may reference data beyond its filing date. Raises BacktestError if violated.
Validates all pricing asof-joins are directional. Entry = forward fill; exit = backward fill.
Scans DuckDB SQL patterns for cross-table joins that reference future quarters.
Prevents the same (CUSIP, date) from appearing in multiple walk-forward folds with different labels.
Ensures train/test windows share no CUSIP-date labels across expanding folds.
HMM regime labels must use only data available at signal decision time, not end-of-quarter values.
Full CSCV implementation (Bailey, Borwein, Lopez de Prado & Zhu 2016). N=16 partitions, M=21 trials. PBO < 0.5 means the probability that the selected strategy is overfit to the training sample is below chance - the bar for publishable quant research.
Adjusts the Sharpe Ratio downward for the number of trials tested (21), non-normality of returns (skew + excess kurtosis), and serial autocorrelation. DSR > 1.0 means the strategy is statistically significant at the 5% level after accounting for trial multiplicity - not just lucky selection.
1.847 Out-of-Sample Sharpe
Achieving Sharpe > 1.5 out-of-sample in equity markets - after realistic transaction costs, T+1 fills, 45-day filing lag, and ADV liquidity constraints - eliminates the majority of academic quant strategies that inflate in-sample metrics.
Engineering Decisions
Five decisions that separate this from naive 13F research. Each has a specific technical justification, not a preference.
Loading EDGAR TSVs as pandas DataFrames would require 12+ GB RAM and hours of I/O. DuckDB operates out-of-core, uses vectorized execution, and has native Parquet read/write with Hive partitioning - the full pipeline runs in under 16 GB RAM on a laptop.
k-Means requires a fixed k and assumes spherical clusters. Manager behavioral features occupy a non-spherical manifold in 14D space. HDBSCAN handles noise points natively, requires no fixed k, and the silhouette sweep across 5 min_cluster_size values selects the best structure automatically.
GaussianHMM state IDs are arbitrary integers that flip between refits. Cosine similarity against prototype vectors (Goldilocks = low VIX + steep yield curve + tight credit) maps each state to a stable semantic name regardless of the integer assigned during EM training.
13F filings are legally due 45 days after quarter end. Using quarter-end dates directly gives a 45-day look-ahead advantage - the most common source of inflated Sharpe in 13F research. MarketCalendar enforces this at signal-generation time, non-optionally.
A monolithic CTE chain materializes all intermediate results simultaneously. Manager DNA (6 CTEs) and RACS (5 CTEs) drop intermediates after use. This creates memory checkpoints and keeps peak DuckDB usage under 16 GB for 116M-row inputs.