AI Exploration of Non-Linearities in Dark Matter and Supercluster Detection with SRG/eROSITA
ABSTRACT
Dark matter constitutes most of the matter in the universe but can only be observed through its gravitational interactions. Galaxy clusters and superclusters serve as probes of its nature and evolution through cosmic time. The former is a collection of densely packed galaxies, while the latter is an even collection of those. To investigate this, we used the SRG/eROSITA eRASS1 dataset, the deepest all-sky soft X-ray survey in thirty years, reaching a depth over 20 times deeper than what ROSAT achieved in 1990. Using a comparison between a linear regression model and a neural network, we found significant non-linearity in dark matter using the relationship between a galaxy cluster’s luminosity and its other features and also used those features to identify two different types of galaxies through KMeans clustering. We also used HDBSCAN clustering to detect superclusters within this catalogue, and out of those superclusters isolated the already discovered Shapely and Horologium Reticulum superclusters, finding that they have more member galaxy clusters than previously estimated. This study shows promise in the application of supervised and unsupervised machine learning to large astrophysical datasets in analyzing cosmic mysteries like dark matter.
INTRODUCTION.
Analyzing the relationship between the different features of these galaxy clusters is necessary to the understanding of dark matter because it has mostly been studied at the level of individual galaxies or galaxy clusters, Analyzing a dataset of unparalleled depth will help identify whether dark matter behaves differently at different scales. This, and some subsequent analysis was conducted separately on galaxy clusters of different redshifts (distances from earth, as assumed in the standard Lambda-CDM cosmology. Throughout this paper, we will use redshift to not only measure physical distance, but also separation by time. This can be done because over such large scales, light takes so long to travel from an observed location in space to us that the farther away an observed object in space is, the more ‘outdated’ our observation of it is, giving us a glimpse into what our universe looked like a long time ago, as per Hubble’s law [1]). Our dataset was split into high redshift (0.3 < z < 1.2) and low redshift (z < 0.3) galaxy clusters. Using galaxy cluster features to sort galaxy clusters into classification is also important as it will serve to confirm previous studies aimed at galaxy cluster classification.
In this study, we use the SRG/eROSITA eRASS1 dataset. eROSITA conducted the deepest all-sky soft X-ray survey in thirty years, reaching a depth over 20 times deeper than what ROSAT achieved in 1990, and cataloging 12,247 different galaxy clusters [2]. This dataset has many benefits – its increased sensitivity allows it to detect dimmer and more distant galaxy clusters, and it also records a diverse assortment of galaxy cluster features like mass, x-ray luminosity, and size, not just position.
Supercluster detection through the use of machine learning on a dataset covering this large of a region of the universe will provide a significant contribution to this field as previous efforts of identifying superclusters either a) have a solid theoretical foundation but have no data to back their conclusions up [3], or b) use less intelligent algorithms like Friends of Friends that are not as applicable to clustering problems as Artificial Intelligence (AI) is, leading to supercluster groupings that are physically inconsistent with preexisting models [4]. Our supercluster groupings are consistent with previous estimates of the number and size of possible superclusters in this region while simultaneously performing well on clustering evaluation metrics commonly used to quantify the quality of clustering in unsupervised machine learning.
This paper accomplishes two objectives – it uses the features of each galaxy cluster to quantify the non-linearity of dark matter at this scale and identify classifications of galaxy clusters, and it uses the positions of each galaxy cluster to detect superclusters in this region.
MATERIALS AND METHODS.
The dataset used is a catalogue of x-ray detected galaxy clusters, so each sample corresponds to a specific galaxy cluster. The data was cleaned by dropping all galaxy clusters with null observations from the dataset, reducing our number of datapoints from 12,247 to 10,447. All used features except for right ascension and declination were evaluated at their logged value. For all machine learning models, we drew from the SciKit Learn Python library.
To quantify the non-linearity of dark matter, we compare the accuracy of a linear regression model and a Multi-Layer Perceptron (MLP) Model in predicting the luminosity of a galaxy cluster given its mass, gas mass, radius, and redshift. The MLP model was hyper-tuned (set to optimized initial settings) through adjustment of hidden layer size and amount (with the goal of maximum accuracy). Accuracy was measured with mean absolute percentage error. Dark matter non-linearity was quantified with the sigma value (a measure of statistical significance, which in this paper, represents the standard error in the data) of the difference between the accuracy of 32 iterations of the MLP over different random states and the accuracy of the linear regression model. A high, positive sigma value would indicate that there is a significant non-linear relationship within the features of galaxy clusters (since linear regression does not pick up on it) and therefore a high degree of dark matter non-linearity, and vice versa for a low and insignificant sigma value.
Feature clustering.
To identify classifications of galaxy clusters, we used a KMeans clustering model, which partitions data into a specified number of clusters using each point’s distance to cluster centroids. This model was fitted to the mass, gas mass, radius, luminosity, and redshift of each galaxy cluster (all variables were scaled beforehand). To evaluate the quality of this and other clustering models, we used the Silhouette Coefficient (Sil. score) and Davis-Bouldin Index (DB score). For context, highly significant clustering should receive a Sil. score of 0.5 or above and a DB score of 0.8 or below. To determine the number of classifications, we hyper-tuned by adjusting the number of clusters (this is a hyper-parameter that determines the number of groupings the model generates, and not in any way synonymous with ‘galaxy cluster’, which in this case refers to points within the dataset) and optimizing for the average performance of the model for each of the metrics mentioned above, under 32 iterations over different random states. We then ran the tuned model over the total dataset as well as the high and low redshift regimes and evaluated the significance of the classification using the Sil. and DB scores.
Supercluster detection and analysis.
To detect superclusters, we used a Hierarchical Density-Based Spatial Clustering of Applications with Noise (HDBSCAN) model, which identifies groups of points by finding regions of high data density across multiple density scales, while automatically labeling sparse points as noise or outliers. This model was fitted over the right ascension, declination (together, these two function as coordinates on the night sky, as seen by the telescope), and redshift. All variables were scaled beforehand. We hyper-tuned the model by adjusting the minimum and maximum cluster sizes, optimizing for Sil and DB score performance (no score validation across random states was needed, as HDBSCAN is a fully deterministic algorithm). We decided on two final models after hyper-tuning and ran those over the positions of the galaxy clusters to generate super-clusters.
We then performed further analysis on the superclusters. Before possible patterns in the data could be analyzed, further processing was needed to determine certain variables. The redshift of each supercluster was taken at the medoid, which is the location of the member galaxy cluster that is closest to the average location of all the supercluster’s member galaxy clusters. Within each supercluster, each member galaxy’s distance from the medoid was determined using distance conversions done through AstroPy functions utilizing the Planck18 cosmological model.
We also analyzed the relationship between a supercluster’s ellipticity and the average luminosity of its member galaxy clusters using a KMeans clustering model. Parameters were optimized through hyper-tuning.
RESULTS.
Feature regression.
Our MLP regressor used for galaxy cluster luminosity prediction showed a 1.79 sigma improvement over the linear regression model at low redshifts, and a 10.04 sigma improvement at high redshifts.
This shows that dark matter was very non-linear earlier in the universe’s history, then evolved to be less (but still significantly) non-linear (Fig.1). This improvement over linear models is also seen in previous studies of dark matter field and halo field prediction at the galactic level [6].

Feature clustering.
Our hyper-tuning of the KMeans model used to cluster galaxy clusters by feature demonstrated that two moderately significant classifications of galaxy clusters exist at both high and low redshifts (Fig. 2). The clustering generated by this model received Sil. and DB scores of 0.46 and 0.85, respectively, at low redshifts. These metrics shifted to 0.35 and 1.08 at high redshifts, suggesting that classifications became more defined as the universe aged, as expected by cosmological simulations [3]. As mentioned before, the degree of order exhibited by galaxy cluster features can serve as a proxy for the linearity of dark matter, so the evolution of this classification supports our previously made conclusion that dark matter has evolved to be more linear. That said, the modest metric performance received by our clustering demonstrates that overall, the galaxy cluster classifications are only moderately defined.

Supercluster detection and analysis.
After hyper-tuning the HDBSCAN model, we selected two viable models for further analysis – one with a supercluster size range of 10-30 member galaxy clusters [4], that identified 84 superclusters and received Sil. and DB scores of 0.57 and 0.58, and a second with a supercluster size range of 25-50 member galaxy clusters [3], that identified 11 superclusters and received Sil. and DB scores of 0.74 and 0.36. The mapping of these superclusters is visualized in Figure 3.

Within the superclusters generated by the size 25-50 model, we used position along right ascension, declination, and redshift (the full 3D spatial coordinates) to identify two of the superclusters as the already documented Shapley and Horologium Reticulum superclusters (Fig. 4), some of the few already documented superclusters [5], and the only prolific ones contained within the region of the sky examined in our data set. Not only does this confirm the physical viability of our supercluster mapping, but it also presents an alternative and independent assessment of the size of each supercluster. While previous estimates for both have varied widely – ranging from 11 member galaxy clusters [6] to 45 members [4] for Shapely, with similar uncertainty and underestimation for Horologium Reticulum’s member estimates [7] – our model identified 48 member clusters within the Shapely supercluster and 32 within the Horologium Reticulum supercluster (interestingly, the Horologium Reticulum supercluster was not detected by the Friends of Friends algorithm used by the SRG consortium in their supercluster mapping). These are therefore the largest estimates generated so far, and a testament to the power of machine learning in improving preexisting astronomical detection techniques.
To analyze trends in supercluster characteristics, we use the size 10-30 model because having more superclusters, and therefore datapoints, to analyze yields more trustworthy correlations, and all correlations are confirmed by the size 25-50 model. We found a positive correlation between the redshift of a supercluster and its members’ average luminosity (Figure 4), with a p-value of 6.185 × 10-21, and an R2 value of 0.660, which shows a significant relationship. This demonstrates that as the universe has aged, the density of dark matter (x-ray luminosity serves as a proxy for dark matter density [8]) contained in superclusters has decreased.

The output of our clustering model’s examination of the relationship between a supercluster’s ellipticity and its redshift (Fig. 5) shows that rounder superclusters existed in the early universe and then were stretched out. This was potentially the mechanism by which they integrated into the ‘cosmic web’, a network of long filaments that transport matter throughout the universe [9]. Since this clustering only received Sil. and DB scores of 0.39 and 0.90, this claim can only be made with moderate confidence, as the relationship is not very significant.

DISCUSSION.
We conducted simulations on supercluster morphology over time using ellipticity measurements of each supercluster. These simulations were based on the Planck18 cosmology model [10] and the space-time metric was evolved using the Friedmann equations. Two simulations were run – one assuming Planck18 cosmology within the flat lambda CDM model of dark energy (which assumes that dark energy grows at a constant rate) and the other simulation assuming a Planck18 cosmology within the recently proposed DESI model [11], which postulates an evolving lambda, implying that dark energy evolution changes over cosmic time. These simulations evaluated the most physically viable set of eigenvalues for all superclusters. Eigenvalues are values that each represent a dimension, or axis, in space. Their magnitude determines how much a supercluster expands or shrinks along that axis. Negative values indicate shrinking, while positive values indicate expansion. Changing the eigenvalues in a simulation will change it’s output ellipticities. We determined the most optimal common set of eigenvalues by comparing the superclusters’ actual ellipticities to the ellipticities predicted by the simulations for every set of eigenvalues, with accuracy measured with root mean squared error. The resulting optimized eigenvalues demonstrate that superclusters expand on one axis and contract on another, narrowing down the exact process of their morphology evolution. The two simulations did not show any significant difference, which suggests promise for the DESI model, which is relatively new, and a significant shift from the previously assumed Flat Lambda CDM cosmology. However, another reason for this lack of variation is that our survey was conducted over relatively low redshifts and non-extreme ellipticities, so there was little opportunity for any deviations to be expressed.
CONCLUSION.
In this paper, we investigated the nature of dark matter using the largest dataset of X-ray detected galaxy clusters. We applied various supervised and unsupervised machine learning models to quantitatively understand patterns in the data. We used a performance comparison between a linear regression and MLP model to prove that dark matter does not only have non-linear characteristics, but was in fact more non-linear earlier in the universe’s lifespan. We used a clustering model to identify two classifications that a galaxy cluster can fit into based on its features. We used a different clustering model trained on galaxy cluster positions to identify superclusters within our dataset, and analyzed the resulting groupings. One of the most significant findings of our analysis was the confirmation of the existence of the Shapley supercluster, and our new estimate of its size being at 48 member galaxy clusters. As astronomical surveys get larger and machine learning gets more powerful, methods and strategies like the ones presented here will be crucial to unraveling the extreme physics of the cosmos.
ACKNOWLEDGMENTS.
I would like to thank Dr. Antonio Rodriguez of Harvard University for being an incredible mentor and providing very helpful guidance throughout this study, as well as the Inspirit AI program for making this collaboration possible.
REFERENCES
- 1. E. Hubble, A relation between distance and radial velocity among extra-galactic nebulae. Proceedings of the National Academy of Sciences. 15, 168–173 (1929).
- V. Ghirardini et al. The SRG/eROSITA all-sky survey. Astronomy and Astrophysics. 689, A298–A298 (2024).
- G. Chon, H. Böhringer, S. Zaroubi, On the definition of superclusters. Astronomy and Astrophysics. 575, L14–L14 (2015).
- A. Liuet al. The SRG/eROSITA All-Sky Survey. Astronomy and Astrophysics. 683, A130–A130 (2024).
- D. Proust et al. The Shapley Supercluster: the Largest Matter Concentration in the Local Universe. The Messenger. 124, 30-31 (2006).
- M. C. Münchmeyer, Reconstruction of Dark Matter and Baryon Density From Galaxies: A Comparison of Linear, Halo Model and Machine Learning-Based Methods. arXiv e-prints, arXiv:2507.12530 (2025).
- M. C. Fleenor et al. Redshifts and Velocity Dispersions of Galaxy Clusters in the Horologium-Reticulum Supercluster. The Astronomical Journal. 131, 1280–1287 (2006).
- Y. Lu, X. Yang, S. Shen, Using Member Galaxy Luminosities as Halo Mass Proxies Of Galaxy Groups. The Astrophysical Journal. 804, 55–55 (2015).
- M. S. S. L. Oei et al. Black hole jets on the scale of the cosmic web. Nature. 633, 537–541 (2024).
- Planck Collaboration, Planck 2018 results. VI. Cosmological parameters. Astronomy & Astrophysics. 641, A6 (2020), doi:https://doi.org/10.1051/0004-6361/201833910.
- A. G. Adame et al. DESI 2024 VI: cosmological constraints from the measurements of baryon acoustic oscillations. Journal of Cosmology and Astroparticle Physics. 2025, 021–021 (2025).
Posted by buchanle on Wednesday, June 3, 2026 in May 2026.
Tags: dark matter, Galaxy cluster, Machine learning, supercluster
