String matching finds 2.6% of the market
Five UK supermarkets each name the same physical product differently. Without semantic understanding, you cannot measure competition, predict prices, or track market dynamics across retailers.
Name Heterogeneity at Scale
"Heinz Tomato Ketchup Bottle 570g" vs "Heinz Ketchup 570g" vs "Heinz Ketchup Btl 570g" - the same product, three incompatible strings. With 127,226 unique raw names across five retailers, every pair of equivalent names looks different to string distance functions.
O(n²) Intractability
Naïve pairwise Levenshtein matching across 127K names requires ~8 billion comparisons. Even parallelised, this is impractical at pipeline cadence. Fuzzy matching libraries hit timeout thresholds and produce low-precision clusters that still miss semantically identical products.
Statistical Meaninglessness
With only ~3,000 exact-match products, cross-retailer price comparisons cover 2.6% of SKUs - a statistically useless sample. Market dispersion, price leadership, and competitive index calculations on this sample are dominated by the bias of which products happened to have identical names.
6-stage end-to-end ML system
Config-driven architecture via Pydantic + YAML. Every stage is a Typer CLI command. Pandera enforces three-stage data contracts. GitHub Actions runs pytest and ruff on every commit.
The 22× breakthrough: why strings fail, semantics win
Switch categories below to see how the same physical product - named differently by Tesco and ASDA - goes from unlinked to matched when you embed semantic intent rather than compare characters.
SHAP + CCF reveal pricing structure
With 67,341 matched products and precomputed SHAP values, the data reveals how UK supermarkets actually set prices - not in isolation, but as reactions to each other.
Aldi systematically depresses prices across all categories. SHAP analysis shows supermarket=Aldi is the single most negative feature predictor - it consistently pushes price predictions down regardless of product or date. The Big Four cannot undercut Aldi; they price relative to it.
Cross-correlation CCF peaks at lag=14 between Tesco and Sainsbury's price series, confirming a systematic 2-week follow pattern. Tesco acts as the Big Four price anchor; Sainsbury's matches within a fortnight. ASDA and Morrisons show less predictable lag structures.
price_vs_market_avg is a top-3 SHAP feature across all categories, outranking lag and momentum features. This confirms retailers do not price in isolation - they algorithmically monitor and react to the daily market average. Prices converge toward the cluster mean within 2-3 days of any deviation.
Five non-obvious choices
SBERT over fuzzy string matching
Fuzzy matching (Levenshtein, token_sort_ratio) is O(n²) across 127K unique names - already intractable at this scale. More critically, it treats names as character sequences, not semantic objects: "Heinz Ketchup 570g" scores low against "Heinz Tomato Ketchup Bottle 570g" despite describing the same product. SBERT embeds semantic intent - abbreviations and descriptor variations map to nearby vectors regardless of string structure.
Time-series split over random shuffle
Random train/test split causes data leakage: future price information leaks into training data through rolling and lag features. A model trained with random shuffle predicts prices it has already seen in its rolling context - inflating R² artificially. Time-series split enforces the correct causal direction, producing an honest £0.14 MAE on genuinely unseen future prices.
MAE over MSE as training objective
Price scraping produces non-Gaussian outliers: promotional flash sales, mislabelled weights, and bundle-price errors appear as extreme values. MSE penalises outliers quadratically - pulling the model towards fitting noise. MAE provides linear loss, making LightGBM robust to scraping artefacts while optimising for accurate day-to-day price prediction.
Precomputed SHAP over on-demand
TreeExplainer on LightGBM with 9.5M rows takes 4-8 minutes per run. A 1GB RAM Streamlit deployment cannot execute this in real time. Precomputing on an 8,000-sample stratified subset preserves the global feature importance distribution while reducing computation to seconds. The .npy matrix is cached at startup via @st.cache_resource, enabling instant SHAP waterfall charts.
DuckDB + PyArrow for 1GB constraint
9.5M rows × 40 columns in native Pandas consumes >2.5 GB RAM - crashing Streamlit Cloud on every cold start. DuckDB executes SQL directly on Parquet files with columnar projection pushdown (only queried columns loaded). PyArrow provides the memory-efficient backend. The lite dataset (every 4th day) reduces RAM to <150 MB while preserving all 67K canonical products for basket analysis.
PricePoint Dynamics
UK Supermarket Competitive Intelligence - 9.5M price records, 67,341 matched products, £0.14 MAE.