Skip to main content

Exoplanet Anomaly Tagging Using Clustered Folded TESS Light Curves

ABSTRACT

This research paper presents a novel, fully unsupervised machine learning pipeline for identifying and analyzing anomalies in the classification of exoplanet candidates, a domain traditionally dominated by supervised learning methods. Using data from the Transiting Exoplanet Survey Satellite (TESS), this work departs from existing models that rely heavily on labeled datasets by offering instead an interpretable framework for exploratory analysis of previously misclassified or unidentified exoplanetary signals. Our methodology leverages phase-folded light curves processed using the HDBSCAN clustering algorithm, with UMAP for dimensionality reduction and visual anomaly detection. HDBSCAN enables the discovery of dense clusters based on astrophysical parameters such as signal-to-noise ratio (SNR), transit depth, planetary duration, and orbital periodicity, allowing for both well-formed groupings and outlier identification without supervision. UMAP further enhances cluster interpretation by providing a low-dimensional embedding capturing the underlying structure. To validate the significance of the discovered anomalies, a lightweight scoring mechanism was implemented that flags cluster-level outliers by cross-referencing known astrophysical features. This enables separating likely exoplanet candidates from confounding artifacts such as stellar flares, instrumental noise, and other periodic signals. Our method demonstrates the ability to isolate high-confidence candidates, including potential hot Jupiters and sub-Neptunes, while significantly reducing the candidate search space for further vetting. To our knowledge, this is the first such work to apply a fully unsupervised clustering and anomaly-detection framework to the exoplanet-detection pipeline using TESS data. Our approach lays the foundation for scalable, label-free analysis in future exoplanet missions.

INTRODUCTION.

The search for planets beyond our solar system has been transformed by space-based photometric surveys, most notably NASA’s Transiting Exoplanet Survey Satellite (TESS) [1]. These missions continuously monitor stellar brightness over time, producing light curves: sequences of flux measurements recorded at regular or irregular intervals. When a planet passes in front of its host star along the observer’s line of sight, a small fraction of the host starlight is blocked, producing the characteristic dip in the light flux curve. TESS collects large numbers of such light curves across different observing sectors of the sky. Each such target (data point/light curve) is assigned a TESS Input Catalog identifier (TIC ID), a unique integer used to track stars and their corresponding light curves in the catalog. Figure 1 presents the phase-folded light curve of TIC 389096698, which shows a clean, symmetric flux dip near phase 0 consistent with a transiting planet. Phase folding superimposes multiple such transit events onto a single orbital cycle to emphasize the periodic signal.

Figure 1. Phase-folded light curve of TIC 389096698 reprocessed with phase folding using TESS [1] data for a typical transiting exoplanet. The smooth, symmetric flux dip near phase 0 is characteristic of a planet passing cleanly in front of its host star.
Detecting, measuring, and interpreting these transit signatures is the foundation of modern exoplanet discovery. Beyond simple single-planet transits, the shape and timing of these flux dips can reveal physical information such as orbital period, planetary size relative to the host star, and, in some cases, clues about atmospheric or structural compositions of the planet.

Anomalous light curves, which deviate from the clean periodic dip shown in Figure 1, are also scientifically informative. Figure 2 provides an example: the phase-folded light curve of TIC 466376085, folded at 1.27 days, shows an asymmetric, highly scattered transit profile instead of clean, symmetric dip. This irregular signature is consistent with a disintegrating planet, whose trailing debris cloud can scatter and absorb starlight in a complex, time-variable way [9]. Such signals offer rare insights into extreme planetary physics, including planetary destruction, ultrashort-period orbits, multi-body interactions, and stellar activity. However, these signals are also most likely to be missed or misclassified by conventional detection methods, which are primarily tuned to identify signals like those resembling Figure 1.

Figure 2. Phase-folded light curve of TIC 466376085 reprocessed for phase folding using TESS [1] data for a typical transiting exoplanet. The smooth, symmetric flux dip near phase 0 is characteristic of a planet passing cleanly in front of its host star.
The vast light-curve datasets produced by TESS are archived at the Mikulski Archive for Space Telescopes (MAST) [3]. Analyzing these light curves at scale requires automated methods, including mission processing pipelines [8], supervised neural-network classifiers for exoplanet detection [10], and shape-based ranking methods for transit candidates [11]. These pipelines flag potential Threshold Crossing Events (TCEs) [4] and use tools such as Lomb–Scargle periodograms [5, 6] to detect periodic transits via phase-folding.

Candidate events are then cataloged in the Exoplanet Follow-up Observing Program (ExoFOP) [2]. However, existing classification approaches can still produce false positives and misclassified candidates, including eclipsing binaries, instrumental noise, weak signals, and unusual light-curve morphologies. To address these limitations, we developed an unsupervised pipeline that groups TESS light curves by similarity and highlights unusual signals for follow-up.

Rather than relying on pre-labeled examples, the pipeline identifies structure directly from the data, making it better suited for rare or previously misclassified systems. We then compare the resulting groups and outliers with known astrophysical parameters to prioritize candidates for further inspection.

MATERIALS AND METHODS.

Data Sources and Preprocessing.

The analysis used the TCE database of periodic signals detected by the SPOC pipeline [3]. Stellar and planetary parameters were retrieved from the ExoFOP database using TIC IDs. After excluding the light curves which failed periodogram calculation, or phase folding, approximately 1300 TIC IDs from the 2025-05-02 TCE database remained. Lomb–Scargle periodograms were computed and each light curve was folded over one orbital period. Folded light curves were then reduced to two dimensions using Principal Component Analysis (PCA) [7] with 95% retained variance and Uniform Manifold Approximation and Projection (UMAP) [12]. The hyperparameter choices for PCA and UMAP algorithms are shown in Table S1. Methodology: Hierarchical Density-Based Spatial Clustering of Applications with Noise (HDBSCAN) [13, 14] was applied to the UMAP output of the folded light curve profiles to identify clusters of similar targets. The parameter choices for HDBSCAN algorithm are indicated in Table S1. Cluster assignments and membership probabilities were extracted from the HDBSCAN output. In parallel, an Isolation Forest algorithm [15] was applied on the original folded light curves to produce an anomaly score for each entry. In some cases, distances from the cluster center were also computed to identify signals that were most atypical relative to the rest of their assigned cluster. A sensitivity analysis of the various UMAP and HDBSCAN parameters is shown in Table S2. Each TIC ID was then assigned to a class using thresholds derived from the merged TCE and ExoFOP databases. These classes included hot Jupiters (high transit depth, short duration, low equilibrium temperature); likely flares (very short duration, high stellar temperature); and instrumental noise (shallow or inconsistent outliers with no discernible pattern). Entries that did not match any of these categories were retained as Filtered Candidates. Among these, objects with high combined anomaly scores and signal-to-noise ratios (SNR) were designated Likely Candidates.

Cluster Statistics and Hypotheses Validation.

For anomalous outliers, median and standard deviation of transit duration, depth, and planetary temperature are computed to further distinguish likely sub-Neptunes, potential binaries, or hot Jupiters (Figure S1, Figure S2, and Figure S3). The outputs were compiled into figures and tabulated datasets to support further analysis. These statistics are shown in Table S3. 

RESULTS.

A total of 1,322 TIC IDs from the May 2025 TCE dataset were analyzed using HDBSCAN clustering, and UMAP applied to folded light-curves. This pipeline reduced the candidate pool by approximately 90%, yielding 122 TIC IDs for further inspection. In UMAP space (Figure 3) the likely candidates formed a coherent grouping in UMAP space, suggesting that their light curve clusters are systematically distinct from the broader sample. When mapped back to ExoFOP parameters (Figure 4), the groups remain statistically separable, with no two groups sharing the same identical combinations of properties, suggesting that they represent distinct signal populations rather than random scatter. Filters for known false-positives, including hot jupiters, flares, and instrumental issues, did not exclude any additional TIC IDs. The remaining candidates therefore constitute a compact subset of anomalous planets that merit closer inspection due to their distinct morphology and associated planetary physical parameters.

Figure 3. Two-dimensional UMAP projection of TESS TCE light curve profiles, illustrating the separation of candidate classes — including Hot Jupiters, Likely Candidates, and Likely Flares — identified through HDBSCAN clustering
Figure 4. Two-dimensional UMAP projection of TESS TCE light curve profiles, illustrating the separation of candidate classes — including Hot Jupiters, Likely Candidates, and Likely Flares — identified through HDBSCAN clustering

DISCUSSION.

This study has several limitations. Many TIC IDs lacked complete light curves so only the best available sector was analyzed in most cases. Few targets had data across all sectors, and some were excluded due to insufficient information for classification. The pipeline also does not yet explicitly model several anomalous classes, including disintegrating planets or binary contamination, although possible cases were identified. Some TIC IDs exhibit unusually broad transit features, while a few show anomalies that do not match known signatures, requiring further modeling.

Future work will extend the pipeline to additional archives, including Kepler/K2, and incorporate a broader set of astrophysical classes to improve anomaly detection. K-means-based centroid distances may also be explored for identifying atypical clusters to improve anomaly detection. Over the longer term, lightweight on board near real-time detection systems may enable survey missions to flag anomalies for follow up.

ACKNOWLEDGMENTS.

I want to thank Dr. Daniel Muthukrishna, for his guidance throughout my research.

SUPPORTING INFORMATION.

The supporting information document includes information about hyperparameter choices made per chosen algorithm and sensitivity analysis for these parameter choices.  The document also adds information with additional tables and figures that support our conclusions for our results, indicated through Tables S1-S3 and Figures S1-S3.

REFERENCES

  1. R. Ricker, et al., Transiting Exoplanet Survey Satellite. Journal of Astronomical Telescopes, Instruments, and Systems 1, 014003 (2015).
  2. NASA Exoplanet Science Institute, “Exoplanet Follow-up Observing Program for TESS (ExoFOP-TESS)” (2025); https://exofop.ipac.caltech.edu/tess/.
  3. MAST Archive Team, “Mikulski Archive for Space Telescopes (MAST)” (2025); https://mast.stsci.edu.
  4. NASA Ames Research Center, “TESS Threshold Crossing Event (TCE) catalog” (2025); https://archive.stsci.edu/tess/tce.html.
  5. R. Lomb, Least-squares frequency analysis of unequally spaced data. Astrophysics and Space Science 39, 447–462 (1976).
  6. D. Scargle, Studies in astronomical time series analysis. II. Statistical aspects of spectral analysis of unevenly spaced data. The Astrophysical Journal 263, 835–853 (1982).
  7. Columbia University Mailman School of Public Health, “Principal Components Analysis” (2025); https://www.publichealth.columbia.edu/research/population-health-methods/principal-components-analysis.
  8. M. Jenkins, et al., Overview of the Kepler science processing pipeline. The Astrophysical Journal Letters 713, L87–L91 (2010).
  9. Hon, et al., A disintegrating rocky planet with prominent comet-like tails around a bright star. The Astrophysical Journal Letters 984, L3 (2025).
  10. Ansdell, et al., Scientific domain knowledge improves exoplanet transit classification with deep learning. The Astrophysical Journal Letters 869, L7 (2018).
  11. J. Armstrong, D. Pollacco, A. Santerne, Transit shapes and self-organizing maps as a tool for ranking planetary candidates: Application to Kepler and K2. Monthly Notices of the Royal Astronomical Society 465, 2634–2642 (2017).
  12. McInnes, J. Healy, J. Melville, UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv:1802.03426 [stat.ML] (2018).
  13. J. G. B. Campello, D. Moulavi, J. Sander, Density-based clustering based on hierarchical density estimates in Advances in Knowledge Discovery and Data Mining. In: J. Pei, V. S. Tseng, L. Cao, H. Motoda, G. Xu, (eds) Advances in Knowledge Discovery and Data Mining, vol 7819. Springer, 2013, pp. 160–172.
  14. McInnes, J. Healy, S. Astels, hdbscan: Hierarchical density based clustering. Journal of Open Source Software 2, 205 (2017).
  15. Brenndoerfer, “Isolation Forest: Complete Guide to Unsupervised Anomaly Detection with Random Trees and Path Length Analysis” (2025); https://mbrenndoerfer.com/writing/isolation-forest-anomaly-detection-unsupervised-learning-random-trees-path-length-mathematical-foundations-python-scikit-learn-guide.


Posted by on Tuesday, June 2, 2026 in May 2026.

Tags: , , ,