Three hard constraints at once
Predicting macroscopic thermodynamic properties from a 1D molecular string is a notoriously difficult problem. The provided data was unusable, the obvious deep learning approach failed, and the target variable has an unusual statistical structure.
Extreme Feature Sparsity (99% Zeros)
- 419 binary Group features representing molecular substructures
- Over 319 columns have near-zero variance: almost all zeros
- Tree-based models waste split evaluations on constant columns
- Neural networks cannot learn from features with no signal variance
Standard GCN Fails on Thermodynamics
- GCN aggregates local atomic neighborhoods (bonds, hybridization)
- No mechanism to capture global properties like MolWt or TPSA
- Phase transitions are driven by macroscopic descriptors, not topology alone
- Vanilla GCN at 56.5 K MAE -worse than a Random Forest baseline
Thermodynamic Complexity + Outliers
- Target range: 53.54 K to 897.15 K (17x spread)
- Right-skewed distribution with physically valid high-Tm outliers
- RMSE training would sacrifice median accuracy for rare extremes
- Approaching experimental limits (~20-25 K) from 1D strings alone
Five-stage research-to-production pipeline
From raw sparse molecular data to deployed dual-architecture inference. Each stage resolves a specific technical constraint. Click any stage to see the implementation detail.
The key insight from the pipeline: the provided sparse binary features were nearly useless (319 of 427 dropped). The entire predictive signal came from RDKit-generated physicochemical descriptors extracted from the SMILES strings -a domain enrichment step that both the Stacking Ensemble and the Hybrid GNN depend on. Feature importance from LightGBM empirically confirmed this: virtually 100% of the top predictive features are RDKit-generated, not the original Group indicators.
Three architectures, three lessons
Step through the model evolution from the architecture that failed, to the classical ML approach that achieved SOTA accuracy, to the neural network design that solved the representational gap. Each iteration informed the next.
Two model paths, one key prerequisite
Both the Stacking Ensemble and the Hybrid GNN depend critically on RDKit feature engineering. The models are complements, not substitutes: Stacking maximizes accuracy on known scaffolds, the GNN generalizes better to novel ones.
Engineering pragmatism: the Stacking Ensemble outperforms the Hybrid GNN in absolute MAE (24.59 K vs 30.27 K). Rather than hiding this, the project uses both in parallel: Stacking for best accuracy on in-distribution molecules, the Hybrid GNN for better generalization to novel scaffolds. Acknowledging that classical ML beat deep learning here -and shipping both -demonstrates a results-first mindset over architecture hype.
Five choices grounded in the problem constraints
Each decision was made in response to a specific constraint in the data, the model architecture, or the deployment environment. Most are non-obvious without the full context.
Log-Transform Target Variable
Tm ranges from 53 to 897 K (a 17x spread) with a right-skewed distribution. Training on raw Kelvin values with L1 (MAE) loss causes the neural network to allocate gradient budget disproportionately toward high-Tm outliers. np.log1p() compresses the dynamic range, making prediction errors comparable in magnitude across the full Kelvin range. The transformation is inverted at inference via np.exp(). This single change was critical for GNN training convergence and stability over 150 epochs.
MAE (L1) over RMSE / MSE
Melting point datasets contain physically valid extreme outliers: organic crystals with Tm approaching 900 K. RMSE penalizes large errors quadratically -a 100 K error counts 100x more than a 10 K error. This forces models to sacrifice median accuracy in order to reduce errors on rare high-Tm compounds. MAE treats all errors linearly, producing a more robust model for the typical organic chemistry use case where most compounds lie in the 200-400 K range.
Late-Fusion over Early Fusion
Early fusion -concatenating RDKit descriptors to node-level features before GCN layers -corrupts the graph message-passing semantics. GCN layers perform neighborhood aggregation over per-atom features; mixing in global molecular descriptors (which represent the entire molecule, not individual atoms) breaks the locality assumption the GCN architecture relies on. Late fusion allows the GCN to specialize in topology and the parallel MLP to specialize in thermodynamics, with the regression head performing the joint optimization.
Ridge as Meta-Learner over Tree Ensemble
A gradient boosting meta-learner at Level-1 would overfit the OOF meta-feature matrix: at N=2662 training rows and only 3 meta-features (one per base learner), trees have too few samples per leaf to generalize. RidgeCV learns the optimal linear combination of base learner predictions with cross-validated alpha regularization. The linear model assumption is appropriate here because the meta-features (OOF predictions) are already in prediction space, and the relationship between their linear combination and the true target is approximately linear.
@st.cache_resource + .eval() Mode
PyTorch state dictionaries are deserialized from disk into RAM on every model load. Without @st.cache_resource, every Streamlit script re-run (triggered by any UI interaction) triggers a fresh deserialization -adding 2-5 second latency and risking OOM under concurrent sessions. @st.cache_resource persists the loaded PyTorch module across all sessions and reruns in the same server process. .eval() mode switches BatchNorm from training statistics to running statistics and disables dropout sampling, reducing per-inference compute and eliminating non-determinism in predictions.
2,662 molecules. Two architectures. One SMILES input.
Paste any SMILES string and get a melting point prediction under 50ms. Both the Stacking Ensemble and Hybrid GNN paths are available in the dashboard.