Projects/PricePoint Dynamics
NLP Matching · LightGBM · SHAP · Competitive Intelligence

PricePoint Dynamics

End-to-end ML pipeline decoding UK supermarket pricing dynamics across 9.5M records. Sentence-BERT semantic matching expanded comparable products 22× over string matching - from 3,000 to 67,341 canonical products.

Cross-retailer products matched
67,341
22× over string matching · SBERT + FAISS
Daily price prediction MAE
£0.14
R²=0.98 · LightGBM · time-series split
PythonSentence-BERTFAISSLightGBMSHAPPanderaDuckDBStreamlitPyArrow
9.5M
Price Records
9.5M rows ingested
67,341
Comparable Products
22× over string matching
£0.139
Price MAE
R²=0.98 · 30-day holdout
22×
Matching Improvement
SBERT vs string baseline
The Core Problem

String matching finds 2.6% of the market

Five UK supermarkets each name the same physical product differently. Without semantic understanding, you cannot measure competition, predict prices, or track market dynamics across retailers.

Name Heterogeneity at Scale

"Heinz Tomato Ketchup Bottle 570g" vs "Heinz Ketchup 570g" vs "Heinz Ketchup Btl 570g" - the same product, three incompatible strings. With 127,226 unique raw names across five retailers, every pair of equivalent names looks different to string distance functions.

O(n²) Intractability

Naïve pairwise Levenshtein matching across 127K names requires ~8 billion comparisons. Even parallelised, this is impractical at pipeline cadence. Fuzzy matching libraries hit timeout thresholds and produce low-precision clusters that still miss semantically identical products.

Statistical Meaninglessness

With only ~3,000 exact-match products, cross-retailer price comparisons cover 2.6% of SKUs - a statistically useless sample. Market dispersion, price leadership, and competitive index calculations on this sample are dominated by the bias of which products happened to have identical names.

Pipeline Architecture

6-stage end-to-end ML system

Config-driven architecture via Pydantic + YAML. Every stage is a Typer CLI command. Pandera enforces three-stage data contracts. GitHub Actions runs pytest and ruff on every commit.

S1
Data Ingestion & ValidationPandera
cleaned_supermarket_data.parquet - Snappy compressed
S2
Semantic Product MatchingSBERT + FAISS
canonical_products_e5.parquet - 67,341 canonical clusters
S3
Feature EngineeringPandas + NumPy
feature_engineered_data.parquet - 9.5M × ~40 feature columns
S4
LightGBM TrainingLightGBM
price_predictor_lgbm.joblib - MAE £0.139 · R²=0.98 · 3.09 MB
S5
Market PrecomputationSHAP + DuckDB
shap_values.npy · market_dispersion.parquet · price_leadership.parquet
S6
Streamlit DashboardStreamlit + PyArrow
5-page live dashboard · sub-1ms inference · <150 MB RAM footprint
30+ tests
Test Coverage
pytest · 3 modules
GitHub Actions
CI/CD
ruff + pytest on every PR
Pydantic + YAML
Config
zero hardcoded paths
3.09 MB
Model Size
LightGBM .joblib artifact
Semantic Matching Engine

The 22× breakthrough: why strings fail, semantics win

Switch categories below to see how the same physical product - named differently by Tesco and ASDA - goes from unlinked to matched when you embed semantic intent rather than compare characters.

String Matching (Levenshtein)
TESCO
Heinz Tomato Ketchup Bottle 570g
ASDA
Heinz Ketchup 570g
Similarity: 61% < 80% threshold → rejected
✗ NO MATCH - products not linked
Total coverage: ~3,000 products
SBERT + FAISS (intfloat/e5-large)
TESCO → normalise → 1024-dim embed
Heinz Tomato Ketchup Bottle 570g
ASDA → normalise → 1024-dim embed
Heinz Ketchup 570g
Cosine similarity: 0.943 > 0.85 threshold → accepted
canonical: "heinz tomato ketchup 570g"
✓ MATCHED - same canonical cluster
Total coverage: 67,341 products (22×)
Matching Pipeline - name reduction funnel
127,226
raw product names
ingest
116,229
after normalisation
lowercase · strip units · deduplicate
67,341
canonical products
FAISS cosine @ 0.85 threshold
Market Intelligence Findings

SHAP + CCF reveal pricing structure

With 67,341 matched products and precomputed SHAP values, the data reveals how UK supermarkets actually set prices - not in isolation, but as reactions to each other.

PRICE LEADERSHIP NETWORK - Cross-correlation CCF analysis
Aldi
Price Setter
Sets the floor
Tesco
Big Four Leader
~2 day lag to Aldi
ASDA
Fast Follower
~2 day lag to Aldi
Morrisons
Mid Follower
3-5 day lag
Sainsbury's
Tesco Tracker
14-day lag to Tesco
🔻
The Price Floor SetterAldi

Aldi systematically depresses prices across all categories. SHAP analysis shows supermarket=Aldi is the single most negative feature predictor - it consistently pushes price predictions down regardless of product or date. The Big Four cannot undercut Aldi; they price relative to it.

🔗
The Lockstep LeadersTesco · Sainsbury's

Cross-correlation CCF peaks at lag=14 between Tesco and Sainsbury's price series, confirming a systematic 2-week follow pattern. Tesco acts as the Big Four price anchor; Sainsbury's matches within a fortnight. ASDA and Morrisons show less predictable lag structures.

📊
Algorithmic Competitive AnchoringAll 5 retailers

price_vs_market_avg is a top-3 SHAP feature across all categories, outranking lag and momentum features. This confirms retailers do not price in isolation - they algorithmically monitor and react to the daily market average. Prices converge toward the cluster mean within 2-3 days of any deviation.

SHAP FEATURE IMPORTANCE - Top drivers of price prediction
price_rol_mean_7d
7-day rolling mean
92
price_vs_market_avg
deviation from market average
78
price_lag_1d
yesterday's price
71
supermarket=Aldi
Aldi membership flag
58
price_rol_min_7d
7-day rolling minimum
51
Engineering Decisions

Five non-obvious choices

01

SBERT over fuzzy string matching

Fuzzy matching (Levenshtein, token_sort_ratio) is O(n²) across 127K unique names - already intractable at this scale. More critically, it treats names as character sequences, not semantic objects: "Heinz Ketchup 570g" scores low against "Heinz Tomato Ketchup Bottle 570g" despite describing the same product. SBERT embeds semantic intent - abbreviations and descriptor variations map to nearby vectors regardless of string structure.

3,000 → 67,341 comparable products (22×)
02

Time-series split over random shuffle

Random train/test split causes data leakage: future price information leaks into training data through rolling and lag features. A model trained with random shuffle predicts prices it has already seen in its rolling context - inflating R² artificially. Time-series split enforces the correct causal direction, producing an honest £0.14 MAE on genuinely unseen future prices.

Prevents leakage · honest £0.14 MAE
03

MAE over MSE as training objective

Price scraping produces non-Gaussian outliers: promotional flash sales, mislabelled weights, and bundle-price errors appear as extreme values. MSE penalises outliers quadratically - pulling the model towards fitting noise. MAE provides linear loss, making LightGBM robust to scraping artefacts while optimising for accurate day-to-day price prediction.

Robust to price-scraping artefacts
04

Precomputed SHAP over on-demand

TreeExplainer on LightGBM with 9.5M rows takes 4-8 minutes per run. A 1GB RAM Streamlit deployment cannot execute this in real time. Precomputing on an 8,000-sample stratified subset preserves the global feature importance distribution while reducing computation to seconds. The .npy matrix is cached at startup via @st.cache_resource, enabling instant SHAP waterfall charts.

4-8 min → sub-second SHAP in dashboard
05

DuckDB + PyArrow for 1GB constraint

9.5M rows × 40 columns in native Pandas consumes >2.5 GB RAM - crashing Streamlit Cloud on every cold start. DuckDB executes SQL directly on Parquet files with columnar projection pushdown (only queried columns loaded). PyArrow provides the memory-efficient backend. The lite dataset (every 4th day) reduces RAM to <150 MB while preserving all 67K canonical products for basket analysis.

2.5 GB Pandas → <150 MB serving layer

PricePoint Dynamics

UK Supermarket Competitive Intelligence - 9.5M price records, 67,341 matched products, £0.14 MAE.

Back to ProjectsPython · Sentence-BERT · FAISS · LightGBM · SHAP · Pandera