Rare Event Detection + Time Series ML

MALLORNAstronomical Classification

Detecting rare Tidal Disruption Events (stars torn apart by supermassive black holes) from 479K irregular multi-band LSST lightcurve observations. A strategic pivot from failing Bi-GRU deep learning to automated tsfresh feature abstraction raised F1 by 197%.

0.0
Macro F1 Score
4.86% positive class rate, 1:20 imbalance
+0%
F1 improvement over BiGRU
0.18 (deep learning) to 0.53 (tsfresh)
BiGRU + Attention: 0.18 F1 (overfit to temporal gaps)
tsfresh + LightGBM: 0.5353 F1 (+197%)
0.0
Macro F1 Score
stratified 5-fold CV, best fold ~0.59
0.00%
Positive Class Rate
1:20 imbalance, 148 TDEs vs 2,895 Non-TDEs
0
Features Selected
from 1,000+ via Benjamini-Yekutieli FDR control
+0%
F1 vs Deep Learning
Bi-GRU baseline: 0.18 F1 (architecture failure)
The Problem

Three hard constraints that break standard ML

Detecting a star being torn apart by a supermassive black hole from simulated telescope data requires solving simultaneous data engineering, statistical, and architectural challenges that standard pipelines cannot handle out of the box.

Extreme Class Imbalance (1:20)

  • 148 TDEs vs 2,895 Non-TDEs in the training set
  • Accuracy is a useless metric: 95% accuracy by predicting all Non-TDE
  • Standard 0.5 threshold catastrophically biases toward the majority class
  • Macro F1 is the correct metric: forces equal precision and recall across classes

Irregular Multi-Band Sparsity

  • 6 optical filter bands: u, g, r, i, z, y (each observed independently)
  • Observations at irregular Modified Julian Dates (MJD) per filter
  • A reading in g-band means i, z, y have missing data for that timestamp
  • RNNs overfit to gap patterns rather than brightness signal

The Interpolation Trap

  • Naive fix: interpolate missing filter values onto a common time grid
  • Linear interpolation creates physically non-existent data points
  • The model then overfits to these hallucinated brightness values
  • Imputation testing empirically degraded F1 vs no imputation at all
Architecture

Five-stage champion pipeline

The winning architecture: from raw irregular telescope observations to calibrated TDE probability scores. Each stage was specifically engineered around the constraints of astronomical sparsity. Click any stage to see the implementation detail.

S1
Data Ingestion + PreprocessingPandas
3,043 objects unified; 479,384 lightcurve observations across 6 optical filter bands
S2
tsfresh Temporal Feature Abstractiontsfresh
1,000+ complex time-series characteristics per optical filter (Fourier, wavelets, kurtosis, energy ratios)
S3
Robust Color Engineering (merge_asof)Pandas
High-quality g-r, u-g, r-i, i-z cross-filter color indices without interpolation noise
S4
FDR Dimensionality Reduction (198 features)tsfresh / scipy
198 statistically significant features retained (from 1,000+) via Benjamini-Yekutieli FDR control
S5
Cost-Sensitive LightGBM + Optuna + Threshold CalibrationLightGBM / Optuna
0.5353 Macro F1 (5-fold stratified CV); optimal decision boundary at P > 0.35

Why this pipeline works on sparse data: tsfresh never requires observations at regular intervals. It independently analyzes whatever flux measurements exist per filter band and compresses the temporal distribution into statistics. A filter with 5 observations and a filter with 50 observations are both valid inputs. The tree splits in LightGBM then handle missing tsfresh features (from filters with very few or zero observations) natively without imputation.

The Strategic Pivot

Three architectures, one brutal lesson

The project started with a complex bidirectional GRU and ended with automated feature abstraction. Step through the architectural evolution and the engineering reasoning behind each transition.

The Challenger: Deep Sequence Modeling
0.18 F1
Macro F1 Score (stratified 5-fold CV)
# Bi-Directional GRU with Custom Attention
class BiGRU_Attention(nn.Module):
    def __init__(self, input_size=2, hidden_size=64, num_layers=2):
        super().__init__()
        self.gru = nn.GRU(
            input_size, hidden_size,
            num_layers=num_layers,
            bidirectional=True, batch_first=True,
        )
        self.attention  = nn.Linear(hidden_size * 2, 1)
        self.classifier = nn.Linear(hidden_size * 2, 1)

    def forward(self, x):
        out, _  = self.gru(x)
        attn    = F.softmax(self.attention(out), dim=1)
        context = (attn * out).sum(dim=1)
        return self.classifier(context)

# Also tried: Multi-Channel BiGRU (1 expert per filter)
# Both variants: max CV F1 ~ 0.18
# Reason: hidden states corrupted by irregular temporal gaps

The initial hypothesis: a bidirectional GRU with custom attention can map irregular lightcurves onto a latent space encoding the TDE signal. Two architectures were built: single-channel BiGRU concatenating all observations, and multi-channel BiGRU with one expert GRU per optical filter. Both failed. RNN hidden states update sequentially, so irregular temporal gaps (observations every 3 days, then every 30 days) directly corrupt the hidden state with noise from the gap itself rather than astronomical signal. The model overfits to the pattern of missingness, not the physical brightness evolution.

F1 Score Comparison (higher is better)
Bi-GRU + Attention0.18 F1
LightGBM + Basic Stats0.43 F1
tsfresh + LightGBM0.54 F1
Random baseline (majority-class)~0.09 F1
Best single fold (champion)~0.59 F1
BEFORE
Hypothesis: RNNs can infer signal across irregular temporal gaps
AFTER
Max CV F1: 0.18. Both BiGRU architectures overfit to noise, not TDE signal.
Strategic pivot:
0.18 F1
BiGRU (fails)
0.43 F1
Basic LightGBM
0.50 F1
+ interpolation (rejected)
0.53 F1
tsfresh Champion
ML Engine

Three components of the champion system

Peak performance comes from three interlocking components: automated feature abstraction, statistically principled signal selection, and carefully calibrated decision boundary engineering.

Peak Performance
Champion Pipeline
0.5353Macro F1 (5-fold stratified CV)
tsfresh EfficientFCParameters: 1,000+ features per filter, 6 filters
FDR dimensionality reduction to 198 statistically significant vectors
Cost-sensitive LightGBM: scale_pos_weight per-fold dynamically set
50-trial Optuna TPE search maximizing validation F1 directly
FDR Signal Distillation
Feature Intelligence
198validated features from 1,000+
Benjamini-Yekutieli FDR control at q=0.05 significance level
Retains features with statistically non-spurious TDE associations
Fourier coefficients, wavelet transforms, energy ratios, kurtosis...
Principled selection vs variance-based PCA (preserves causal signal)
Boundary Calibration
Threshold Engineering
P > 0.35optimal decision threshold
OOF probability sweep: P=0.01 to P=0.99 in 0.01 steps
Boundary that maximizes macro F1 across all 5 folds
Lower than 0.5: intentional bias toward TDE recall vs precision
Protects against the false negative cost of missing rare transients

Real-world significance: the Vera C. Rubin Observatory will generate petabytes of data per night. Without automated filtering, scientists cannot physically review all observations for rare transients before they fade. A 0.53 F1 system on a 1:20 imbalanced dataset successfully filters more than 95% of background cosmic noise, delivering a condensed high-probability target list to telescope follow-up systems in milliseconds.

Engineering Decisions

Five deliberate choices, each empirically validated

Every decision in the champion pipeline was validated against an alternative. None of these choices are defaults: each was made by testing the alternative and measuring the F1 impact.

Strategic Pivot: GRU to LightGBM

Abandon deep learning when empirical results prove architecture mismatch

The BiGRU architecture was mathematically elegant and required significant engineering effort (custom attention, multi-channel parallel experts). Abandoning it required accepting that the sunk cost of development time was irrelevant compared to the data reality. Irregular temporal gaps (observations every 3 days, then every 30 days) corrupt RNN hidden states with gap noise rather than signal. LightGBM tree splits handle missing values natively: a tree can branch on "feature X is missing" as a valid decision rule, which is exactly what sparse astronomical data needs.

tsfresh over Manual Feature Engineering

Automated temporal abstraction vs hand-crafted statistical aggregations

Manual feature engineering (mean, std, skew per filter) reached 0.4281 F1 and then stalled: a human can only enumerate so many statistical moments before running out of ideas. tsfresh systematically extracts 60+ mathematical feature types including continuous wavelet transforms (which capture multi-scale variability simultaneously), Fourier coefficients (which capture periodicity and phase), and nonlinear aggregations that no human would manually compute. The automated approach found the specific feature combinations that discriminate TDE brightening curves from AGN or supernovae variability.

merge_asof over Linear Interpolation

Strict ±1 MJD tolerance join vs synthetic gap filling

An intuitive approach to computing cross-filter color differences (g-band minus r-band flux) would be to interpolate each filter onto a common time grid. Empirical testing proved this degrades F1 performance: linear interpolation between real observations creates synthetic data points that never actually existed, and the LightGBM model overfits to these hallucinated values. pd.merge_asof with tolerance=1.0 MJD only joins observations that are physically concurrent within 1 Modified Julian Day -producing authentic differential flux measurements from real telescope readings only.

scale_pos_weight over SMOTE

Cost-sensitive loss penalization vs synthetic minority class sampling

SMOTE generates synthetic minority class samples by interpolating between existing TDE lightcurve feature vectors. On this dataset, SMOTE introduces fabricated feature combinations that never correspond to real TDE physics, and the interpolated samples may fall in feature regions that are indistinguishable from Non-TDE backgrounds. scale_pos_weight directly modifies the LightGBM loss function: each TDE training sample is penalized 20x more than a Non-TDE sample at the gradient computation level, without creating any synthetic data. This keeps the training set physically grounded in real observations.

FDR Control over Variance-Based Reduction

Benjamini-Yekutieli hypothesis testing vs PCA / variance thresholding

PCA and variance thresholding discard features based on how much they vary across samples, without asking whether that variation correlates with the target label. A feature could have high variance but be completely uninformative (noise), or low variance but be a critical TDE discriminator. The Benjamini-Yekutieli FDR procedure tests each feature for statistical association with the TDE target at the q=0.05 significance level, controlling the expected proportion of false discoveries among retained features. This is statistically principled: retained features have measurable evidence of predictive value, not just high variance.

Live on Streamlit

479K observations. 6 filter bands. 0.53 F1.

Explore the interactive inference lab with lightcurve visualizations, threshold simulation, and real-time TDE classification.