Three hard constraints at once
Raw fitness tracker telemetry is useless without infrastructure to process it, models to interpret it, and a deployment strategy that fits within the memory ceiling of serverless cloud platforms.
OOM Deployment Crash
- Unoptimized RF Regressor serializes to ~3 GB
- Streamlit Cloud enforces a hard 1 GB RAM ceiling
- Direct deployment results in immediate MemoryError
- GitHub blocks files over 100 MB without LFS overhead
Pandas Does Not Scale
- 358K records across 6 months and 1,959 users
- Pandas loads entire dataset into memory for each operation
- No partition pruning: every query scans all records
- ETL scripts become brittle and environment-dependent
Generic Health Insights
- Raw step counts and heart rates are not actionable
- No user segmentation: every user receives identical advice
- No activity recognition: calorie estimation is random
- No behavioral pattern extraction for product personalization
Five-stage production pipeline
Data generation through containerized cloud deployment: each stage outputs a versioned artifact that feeds the next. Click any stage to see the implementation detail.
Production discipline: the PySpark ETL uses full-overwrite semantics on partitions, making every run idempotent. GitHub Actions runs Pytest on every commit. Docker locks the JVM and Python environment. The entire pipeline from raw data to deployed dashboard can be reproduced by any engineer with a single docker run command.
From 3 GB OOM crash to 50 MB deployment
Step through the compression pipeline that turned an undeployable model into a production artifact. Each optimization is quantified with size and accuracy impact.
Three models, three different problems
Each model solves a distinct business problem using Scikit-Learn pipelines that bundle all preprocessing steps, eliminating training-serving skew at the architecture level.
Zero training-serving skew: all preprocessing steps (SimpleImputer, StandardScaler, OneHotEncoder) are bundled inside Scikit-Learn Pipeline objects. The exact transformation graph applied at training is structurally guaranteed to be applied at inference. No manual re-implementation of preprocessing in the serving layer.
Five deliberate choices with documented tradeoffs
Each decision is grounded in the specific constraints of the system: memory limits, scale requirements, and production deployment reliability.
PySpark over Pandas
Pandas loads full datasets into memory for every operation. PySpark with Parquet partitioning allows downstream ML loaders and analytics queries to scan only the required year/month slices. At 358K records this is a latency optimization; at terabyte scale it becomes a necessity. Parquet columnar format also guarantees high compression ratios over CSV.
Random Forest Selection
Fitness metrics have fundamentally non-linear interactions: 500 calories over 2,000 steps is categorically different from 500 calories over 10,000 steps. Random Forests capture these interaction effects naturally without requiring polynomial feature expansion. Unlike distance-based algorithms (SVM, KNN), RF is immune to feature scale discrepancies between heart rate (60-200 bpm) and step counts (0-30,000).
Hyperparameter Compression Strategy
Rather than switching to a smaller model family (trading accuracy for size), the strategy attacks the Random Forest hyperparameters directly. max_depth bounds exponential node growth. n_estimators reduces tree count proportionally. min_samples_leaf prevents noise-fitting that inflates tree size without adding predictive value. Together these reduce the 3 GB model to ~200 MB before compression is even applied.
Google Drive + Memory-Aware Loading
GitHub blocks files over 100MB without LFS, and LFS adds commit complexity and storage costs. Google Drive provides a simple public URL with gdown. The loading sequence matters: K-Means and Classifier are loaded first (smaller), then gc.collect() forces Python's garbage collector to reclaim memory from initialization, then the Regressor loads into the now-cleared headroom. Without this sequence, accumulated framework initialization state pushes total usage over the cloud limit.
Docker + GitHub Actions CI
PySpark requires a specific JVM version (OpenJDK 11) and Spark binary distribution. Without containerization, "works on my machine" failures are common in CI. The multi-stage Docker build (openjdk:11-jre-slim base) locks the execution environment. GitHub Actions triggers Pytest on every commit: atomic ETL function tests (including zero-division handling in calorie ratios) and integration tests verifying the end-to-end Spark job produces the expected partition structure.
358K records. 3 models. One Docker command.
Explore the full interactive dashboard or read the source. The entire pipeline is reproducible locally via Docker.