Projects/PricePoint Dynamics

NLP Matching · LightGBM · SHAP · Competitive Intelligence

PricePoint Dynamics

End-to-end ML pipeline decoding UK supermarket pricing dynamics across 9.5M records. Sentence-BERT semantic matching expanded comparable products 22× over string matching - from 3,000 to 67,341 canonical products.

Cross-retailer products matched

67,341

22× over string matching · SBERT + FAISS

Daily price prediction MAE

£0.14

R²=0.98 · LightGBM · time-series split

PythonSentence-BERTFAISSLightGBMSHAPPanderaDuckDBStreamlitPyArrow

9.5M

Price Records

9.5M rows ingested

67,341

Comparable Products

22× over string matching

£0.139

Price MAE

R²=0.98 · 30-day holdout

22×

Matching Improvement

SBERT vs string baseline

The Core Problem

String matching finds 2.6% of the market

Five UK supermarkets each name the same physical product differently. Without semantic understanding, you cannot measure competition, predict prices, or track market dynamics across retailers.

Name Heterogeneity at Scale

"Heinz Tomato Ketchup Bottle 570g" vs "Heinz Ketchup 570g" vs "Heinz Ketchup Btl 570g" - the same product, three incompatible strings. With 127,226 unique raw names across five retailers, every pair of equivalent names looks different to string distance functions.

O(n²) Intractability

Naïve pairwise Levenshtein matching across 127K names requires ~8 billion comparisons. Even parallelised, this is impractical at pipeline cadence. Fuzzy matching libraries hit timeout thresholds and produce low-precision clusters that still miss semantically identical products.

Statistical Meaninglessness

With only ~3,000 exact-match products, cross-retailer price comparisons cover 2.6% of SKUs - a statistically useless sample. Market dispersion, price leadership, and competitive index calculations on this sample are dominated by the bias of which products happened to have identical names.

Pipeline Architecture

6-stage end-to-end ML system

Config-driven architecture via Pydantic + YAML. Every stage is a Typer CLI command. Pandera enforces three-stage data contracts. GitHub Actions runs pytest and ruff on every commit.

Data Ingestion & ValidationPandera

cleaned_supermarket_data.parquet - Snappy compressed

Semantic Product MatchingSBERT + FAISS

canonical_products_e5.parquet - 67,341 canonical clusters

Feature EngineeringPandas + NumPy

feature_engineered_data.parquet - 9.5M × ~40 feature columns

LightGBM TrainingLightGBM

price_predictor_lgbm.joblib - MAE £0.139 · R²=0.98 · 3.09 MB

Market PrecomputationSHAP + DuckDB

shap_values.npy · market_dispersion.parquet · price_leadership.parquet

Streamlit DashboardStreamlit + PyArrow

5-page live dashboard · sub-1ms inference · <150 MB RAM footprint

30+ tests

Test Coverage

pytest · 3 modules

GitHub Actions

CI/CD

ruff + pytest on every PR

Pydantic + YAML

Config

zero hardcoded paths

3.09 MB

Model Size

LightGBM .joblib artifact

Semantic Matching Engine

The 22× breakthrough: why strings fail, semantics win

Switch categories below to see how the same physical product - named differently by Tesco and ASDA - goes from unlinked to matched when you embed semantic intent rather than compare characters.

String Matching (Levenshtein)

TESCO

Heinz Tomato Ketchup Bottle 570g

ASDA

Heinz Ketchup 570g

Similarity: 61% < 80% threshold → rejected

✗ NO MATCH - products not linked

Total coverage: ~3,000 products

SBERT + FAISS (intfloat/e5-large)

TESCO → normalise → 1024-dim embed

Heinz Tomato Ketchup Bottle 570g

ASDA → normalise → 1024-dim embed

Heinz Ketchup 570g

Cosine similarity: 0.943 > 0.85 threshold → accepted

canonical: "heinz tomato ketchup 570g"

✓ MATCHED - same canonical cluster

Total coverage: 67,341 products (22×)

Matching Pipeline - name reduction funnel

127,226

raw product names

ingest

→

116,229

after normalisation

lowercase · strip units · deduplicate

→

67,341

canonical products

FAISS cosine @ 0.85 threshold

Market Intelligence Findings

SHAP + CCF reveal pricing structure

With 67,341 matched products and precomputed SHAP values, the data reveals how UK supermarkets actually set prices - not in isolation, but as reactions to each other.

PRICE LEADERSHIP NETWORK - Cross-correlation CCF analysis

Aldi

Price Setter

Sets the floor

Tesco

Big Four Leader

~2 day lag to Aldi

ASDA

Fast Follower

~2 day lag to Aldi

Morrisons

Mid Follower

3-5 day lag

Sainsbury's

Tesco Tracker

14-day lag to Tesco

🔻

The Price Floor SetterAldi

Aldi systematically depresses prices across all categories. SHAP analysis shows supermarket=Aldi is the single most negative feature predictor - it consistently pushes price predictions down regardless of product or date. The Big Four cannot undercut Aldi; they price relative to it.

🔗

The Lockstep LeadersTesco · Sainsbury's

Cross-correlation CCF peaks at lag=14 between Tesco and Sainsbury's price series, confirming a systematic 2-week follow pattern. Tesco acts as the Big Four price anchor; Sainsbury's matches within a fortnight. ASDA and Morrisons show less predictable lag structures.

📊

Algorithmic Competitive AnchoringAll 5 retailers

price_vs_market_avg is a top-3 SHAP feature across all categories, outranking lag and momentum features. This confirms retailers do not price in isolation - they algorithmically monitor and react to the daily market average. Prices converge toward the cluster mean within 2-3 days of any deviation.

SHAP FEATURE IMPORTANCE - Top drivers of price prediction

price_rol_mean_7d

7-day rolling mean

price_vs_market_avg

deviation from market average

price_lag_1d

yesterday's price

supermarket=Aldi

Aldi membership flag

price_rol_min_7d

7-day rolling minimum

Engineering Decisions

Five non-obvious choices

SBERT over fuzzy string matching

Fuzzy matching (Levenshtein, token_sort_ratio) is O(n²) across 127K unique names - already intractable at this scale. More critically, it treats names as character sequences, not semantic objects: "Heinz Ketchup 570g" scores low against "Heinz Tomato Ketchup Bottle 570g" despite describing the same product. SBERT embeds semantic intent - abbreviations and descriptor variations map to nearby vectors regardless of string structure.

3,000 → 67,341 comparable products (22×)

Time-series split over random shuffle

Random train/test split causes data leakage: future price information leaks into training data through rolling and lag features. A model trained with random shuffle predicts prices it has already seen in its rolling context - inflating R² artificially. Time-series split enforces the correct causal direction, producing an honest £0.14 MAE on genuinely unseen future prices.

Prevents leakage · honest £0.14 MAE

MAE over MSE as training objective

Price scraping produces non-Gaussian outliers: promotional flash sales, mislabelled weights, and bundle-price errors appear as extreme values. MSE penalises outliers quadratically - pulling the model towards fitting noise. MAE provides linear loss, making LightGBM robust to scraping artefacts while optimising for accurate day-to-day price prediction.

Robust to price-scraping artefacts

Precomputed SHAP over on-demand

TreeExplainer on LightGBM with 9.5M rows takes 4-8 minutes per run. A 1GB RAM Streamlit deployment cannot execute this in real time. Precomputing on an 8,000-sample stratified subset preserves the global feature importance distribution while reducing computation to seconds. The .npy matrix is cached at startup via @st.cache_resource, enabling instant SHAP waterfall charts.

4-8 min → sub-second SHAP in dashboard

DuckDB + PyArrow for 1GB constraint

9.5M rows × 40 columns in native Pandas consumes >2.5 GB RAM - crashing Streamlit Cloud on every cold start. DuckDB executes SQL directly on Parquet files with columnar projection pushdown (only queried columns loaded). PyArrow provides the memory-efficient backend. The lite dataset (every 4th day) reduces RAM to <150 MB while preserving all 67K canonical products for basket analysis.

2.5 GB Pandas → <150 MB serving layer

PricePoint Dynamics

UK Supermarket Competitive Intelligence - 9.5M price records, 67,341 matched products, £0.14 MAE.

View on GitHub Live Dashboard

Back to ProjectsPython · Sentence-BERT · FAISS · LightGBM · SHAP · Pandera