Three hard constraints that break standard ML
Detecting a star being torn apart by a supermassive black hole from simulated telescope data requires solving simultaneous data engineering, statistical, and architectural challenges that standard pipelines cannot handle out of the box.
Extreme Class Imbalance (1:20)
- 148 TDEs vs 2,895 Non-TDEs in the training set
- Accuracy is a useless metric: 95% accuracy by predicting all Non-TDE
- Standard 0.5 threshold catastrophically biases toward the majority class
- Macro F1 is the correct metric: forces equal precision and recall across classes
Irregular Multi-Band Sparsity
- 6 optical filter bands: u, g, r, i, z, y (each observed independently)
- Observations at irregular Modified Julian Dates (MJD) per filter
- A reading in g-band means i, z, y have missing data for that timestamp
- RNNs overfit to gap patterns rather than brightness signal
The Interpolation Trap
- Naive fix: interpolate missing filter values onto a common time grid
- Linear interpolation creates physically non-existent data points
- The model then overfits to these hallucinated brightness values
- Imputation testing empirically degraded F1 vs no imputation at all
Five-stage champion pipeline
The winning architecture: from raw irregular telescope observations to calibrated TDE probability scores. Each stage was specifically engineered around the constraints of astronomical sparsity. Click any stage to see the implementation detail.
Why this pipeline works on sparse data: tsfresh never requires observations at regular intervals. It independently analyzes whatever flux measurements exist per filter band and compresses the temporal distribution into statistics. A filter with 5 observations and a filter with 50 observations are both valid inputs. The tree splits in LightGBM then handle missing tsfresh features (from filters with very few or zero observations) natively without imputation.
Three architectures, one brutal lesson
The project started with a complex bidirectional GRU and ended with automated feature abstraction. Step through the architectural evolution and the engineering reasoning behind each transition.
Three components of the champion system
Peak performance comes from three interlocking components: automated feature abstraction, statistically principled signal selection, and carefully calibrated decision boundary engineering.
Real-world significance: the Vera C. Rubin Observatory will generate petabytes of data per night. Without automated filtering, scientists cannot physically review all observations for rare transients before they fade. A 0.53 F1 system on a 1:20 imbalanced dataset successfully filters more than 95% of background cosmic noise, delivering a condensed high-probability target list to telescope follow-up systems in milliseconds.
Five deliberate choices, each empirically validated
Every decision in the champion pipeline was validated against an alternative. None of these choices are defaults: each was made by testing the alternative and measuring the F1 impact.
Strategic Pivot: GRU to LightGBM
The BiGRU architecture was mathematically elegant and required significant engineering effort (custom attention, multi-channel parallel experts). Abandoning it required accepting that the sunk cost of development time was irrelevant compared to the data reality. Irregular temporal gaps (observations every 3 days, then every 30 days) corrupt RNN hidden states with gap noise rather than signal. LightGBM tree splits handle missing values natively: a tree can branch on "feature X is missing" as a valid decision rule, which is exactly what sparse astronomical data needs.
tsfresh over Manual Feature Engineering
Manual feature engineering (mean, std, skew per filter) reached 0.4281 F1 and then stalled: a human can only enumerate so many statistical moments before running out of ideas. tsfresh systematically extracts 60+ mathematical feature types including continuous wavelet transforms (which capture multi-scale variability simultaneously), Fourier coefficients (which capture periodicity and phase), and nonlinear aggregations that no human would manually compute. The automated approach found the specific feature combinations that discriminate TDE brightening curves from AGN or supernovae variability.
merge_asof over Linear Interpolation
An intuitive approach to computing cross-filter color differences (g-band minus r-band flux) would be to interpolate each filter onto a common time grid. Empirical testing proved this degrades F1 performance: linear interpolation between real observations creates synthetic data points that never actually existed, and the LightGBM model overfits to these hallucinated values. pd.merge_asof with tolerance=1.0 MJD only joins observations that are physically concurrent within 1 Modified Julian Day -producing authentic differential flux measurements from real telescope readings only.
scale_pos_weight over SMOTE
SMOTE generates synthetic minority class samples by interpolating between existing TDE lightcurve feature vectors. On this dataset, SMOTE introduces fabricated feature combinations that never correspond to real TDE physics, and the interpolated samples may fall in feature regions that are indistinguishable from Non-TDE backgrounds. scale_pos_weight directly modifies the LightGBM loss function: each TDE training sample is penalized 20x more than a Non-TDE sample at the gradient computation level, without creating any synthetic data. This keeps the training set physically grounded in real observations.
FDR Control over Variance-Based Reduction
PCA and variance thresholding discard features based on how much they vary across samples, without asking whether that variation correlates with the target label. A feature could have high variance but be completely uninformative (noise), or low variance but be a critical TDE discriminator. The Benjamini-Yekutieli FDR procedure tests each feature for statistical association with the TDE target at the q=0.05 significance level, controlling the expected proportion of false discoveries among retained features. This is statistically principled: retained features have measurable evidence of predictive value, not just high variance.
479K observations. 6 filter bands. 0.53 F1.
Explore the interactive inference lab with lightcurve visualizations, threshold simulation, and real-time TDE classification.