Preprint
Article

Enhanced Particle Classification in Water Cherenkov Detectors Using Machine Learning: Modeling and Validation With Monte Carlo Simulation Datasets

Altmetrics

Downloads

126

Views

116

Comments

0

A peer-reviewed article of this preprint also exists.

Submitted:

24 June 2024

Posted:

27 June 2024

You are already at the latest version

Alerts
Abstract
The Latin American Giant Observatory (LAGO) is a ground-based extended cosmic rays observatory designed to study transient astrophysical events, the role of the atmosphere on the formation of secondary particles, and space-weather-related phenomena. With the use of a network of Water Cherenkov Detectors, LAGO measures the secondary particle flux, a consequence of the interaction of astroparticles impinging on the Earth with the atmosphere. These interactions yield a flux of secondary particles that can be grouped into three distinct basic constituents: the electromagnetic, muonic, and hadronic components. In this work we extend our previous research by using detailed simulations of the expected atmospheric response to the primary flux and the corresponding response of our WCDs to atmospheric radiation. After implemented OPTICS, a density-based clustering algorithm, to identify patterns in data collected by a single WCD, we have further refined our approach by implementing a method that categorizes and differentiates particle groups using advanced unsupervised machine learning techniques. Our analysis demonstrates that applying our enhanced methodology can accurately identify the originating particle with a high degree of confidence on a single-pulse basis, highlighting its precision and reliability. These promising results suggest the feasibility of implementing machine-learning-based models throughout LAGO distributed detection network and other astroparticles observatories for semi-automated, onboard and real-time data analysis. When a particle enters a Water Cherenkov Detector (WCD), it generates a measurable signal characterized by unique features correlating to the particle’s type and the detector’s specific response. The resulting charge histograms from these signals provide valuable insights into the flux of primary astroparticles and their key characteristics. However, this data is insufficient to effectively distinguish between the contributions of different secondary particles. This research extends our previous investigations, wherein we implemented OPTICS, a density-based clustering algorithm, to identify patterns in data collected by a single WCD. We have further refined our approach by implementing a method that categorizes and differentiates particle groups using advanced unsupervised machine learning techniques. This methodology facilitates the differentiation among particle types and utilizes the detector’s nuanced response to each particle type, thus pinpointing the principal contributors within each group. In this work, we utilize OPTICS to analyze simulated data, further validating the proposed method. We created a simulated dataset by combining the outputs of the ARTI and MEIGA frameworks. This dataset simulates the expected WCD signals produced by the flux of secondary particles during one day at the LAGO site in Bariloche, Argentina, situated at 865 meters above sea level. To achieve this, real-time magnetospheric and local atmospheric conditions for February and March of 2012 were analyzed, and the resultant atmospheric secondary-particle flux was integrated into a specific MEIGA application featuring a comprehensive Geant4 model of the WCD at this LAGO location. The output from MEIGA was modified for effective integration into our machine learning pipeline. Our analysis demonstrates that applying our enhanced methodology can accurately identify the originating particle with a high degree of confidence on a single-pulse basis, highlighting its precision and reliability. These promising results suggest the feasibility of implementing machine-learning-based models throughout the LAGO distributed detection network for semi-automated, onboard and real-time data analysis. Our analysis demonstrates that OPTICS can accurately identify the originating particle with a high degree of confidence on a single-pulse basis, highlighting its precision and reliability. These promising results suggest the feasibility of implementing machine-learning-based models throughout the LAGO distributed detection network for semi-automated, onboard data analysis.
Keywords: 
Subject: Environmental and Earth Sciences  -   Atmospheric Science and Meteorology

1. Introduction

The Earth’s atmosphere is constantly impinged by astroparticles, giving rise to an atmospheric flux of secondary particles comprising three main components: electromagnetic ( γ s and e ± ), muonic ( μ ± ), and hadronic (consisting of various types of mesons and baryons, including nuclei).
The Latin American Giant Observatory (LAGO)1 is a network of Water Cherenkov Detectors (WCDs) situated across multiple sites in Ibero-America. Among LAGO’s primary objectives are the measurement of high-energy events originating from space using WCDs at ground-level locations [1] and the continuous enhancement of our WCD systems [2]. LAGO WCDs employ a single, large-area photomultiplier tube as the primary sensor. When relativistic charged particles traverse, the WCD lead the emission of Cherenkov radiation within the water volume, which subsequently triggers a detection event in the detector’s data acquisition system. Due to its large water volume, neutral particles such as photons or neutrons can also be indirectly detected through processes like Compton scattering or pair creation in the former case, or nuclear interactions with the various materials present in the latter. WCDs at LAGO, which are not necessarily homogeneous across the detection network, are deployed in selected, some of them remote, sites with different altitudes and geomagnetic characteristics (and so, with different rigidity cut-offs).
In this scenario, one important goal is the continuous monitoring of the detector’s status and those factors that could affect the WCD response, such as the detector ageing or the water quality, as they could affect artificial increases or decreases in the flux of measured signals. Moreover, registering and transferring the complete flux of secondary particles could not be necessary during calm periods, when no astrophysical transients are detected. Thus, being able to characterize the WCD response in a real-time, local and unattended mode is extremely necessary, specially for those detectors deployed in challenging sites (e.g. in Antarctica or very high-altitude mountains). For this reason, one of the goals of this work is the possibility of automatic determination of WCD response to the flux of secondary particles, specially during astrophysical transients, such as those produced during disturbed conditions produced by Space Weather phenomena.
Magnetic and plasma interplanetary (IP) conditions near Earth, which have a major interest in Space Weather, can significantly modify the transport of low energy galactic cosmic rays (GCRs). These conditions can produce variability of the primary galactic proton fluxes with an electric charge that can be indirectly observed with particle detectors installed at Earth’s surface, primaries mainly in the band from ∼ hundred of MeVs to slightly more than 10 GeVs. This effect is evident, for example, in the well-known anti-correlation between the ∼ 11-year solar cycle of sunspots and the long-term variability of the galactic cosmic ray flux e.g., [3].
There are two major transient IP perturbations producing decreases of GCRs: Interplanetary Coronal Mass Ejections (ICMEs) and Stream Interaction Regions (SIRs). ICMEs are coronal mass ejected from the Sun and SIRs are interplanetary structures developed in the solar wind during the merging between fast solar wind streams when they reach slow interplanetary plasma e.g., [4].
When ICMEs or SIRs are observed by spacecrafts near the geospace, ground-level GCRs generally show decreases on the flux of secondary particles, a phenomenon called Forbush decrease (FDs) e.g., [5,6].
The variability of the GCR flux at ground level, in both long (e.g., ∼ 11 years for the solar cycle) and short term (∼ hours-days for FDs), have been systematically observed from several decades ago measuring secondary neutrons developed in the atmospheric cascades, using Neutron Monitors (NMs) e.g., [7,8].
Given its characteristics, WCDs started to be examined as a possible complement to NMs, since they can observe FDs produced by ICMEs e.g., [9,10]. In particular, FDs have been observed in different channels of LAGO WCDs [11]. Recently, first explorations have shown, with data from the the Antarctic node, that WCDs of LAGO can observe spatial anisotropy of the GCRs flux, at least during quiet days [12].
Thus, because the general interest of Space Weather focuses on the variability of the flow of low-energy charged primaries, there is a special interest in discriminating the energy deposited by secondaries originating from these primaries, and
consequently, a particular focus in being able to discriminate the traces deposited by the different particles to identify them.
The primary objective of this work is to find patterns within each WCD’s data that could allow us to assess particle separation within the charge histogram. We propose a machine learning (ML) algorithm to perform this task in such a way that each LAGO detector can have their own tailored ML model as a result of a learning process using their own particular dataset.
ML techniques have been used in many fields, including research in astroparticles, with encouraging results. In general, particle discrimination in WCD detections is an important task for various kinds of studies. In particular, ML has been applied to analyze WCD data in different scenarios. For example, Jamieson et al. [13] propose a boosted decision tree (XGBoost), graph convolutional network (GCN), and dynamic graph convolutional neural network (DGCNN) to optimize neutron capture detection capability in large-volume WCDs, such as the Hyper-Kamiokande detector. Their work is driven by the necessity of distinguishing the neutron signal from others, such as muon spallation or other background signals. Additionally, ML techniques to identify muons are used in Conceiçao et al. [14] for WCDs with reduced water volume and 4 photomultipliers (PMTs). In these cases, convolutional neural networks (CNNs) have been used, showing that the identification of muons in the station depends on the amount of electromagnetic contamination, with nearly no dependence on the configuration of the WCD array (dense vs sparse). These are two of many examples showing the potential of ML in this area of research [15,16,17,18].
In Torres Peralta et al. [19], we proposed a clustering (non supervised ML) technique to identify each of the components detected by LAGO WCDs. We proved that OPTICS (Ordering points to identify the clustering structure) is suitable for separating the particles. However, further validation was needed to ensure that the algorithm is robust and obtains the desired outcome with high confidence.
In this work, we continue the study of OPTICS applied to LAGO WCDs by using syntetich data from Monte Carlo simulations. Thereby, this work serves as a validation process for the proposed OPTICS method since the algorithm is performed in a synthetic dataset where the ground truth is known a priori. Moreover, we implement statistical analyses to ensure robustness and precision.
This work is organized as follows: Section 2 explains the LAGO software suite, the simulation’s main parameters and limitations, and the outcome in the form of a synthetic dataset. Section 3 details the ML technique including the main hyperparameters used. The Section 4 is dedicated to explain the pipeline followed for the ML modelling and decisions on the data treatment. Finally, we present the results and conclusions in Section 5 and Section 6 respectively.

2. Simulation Framework

2.1. Atmospheric Radiation Calculations

Cosmic rays (CRs) are defined as particles and atomic nuclei originating from beyond Earth, spanning energy levels from several GeVs to more than 10 19 eV [20]. These particles, upon reaching the upper atmosphere, interact with atmospheric elements to produce extensive air showers (EAS), a discovery made by Rossi and Auger in the 1930s [21]. An EAS generates new particles, or secondaries, through radiative and decay processes that follow the incoming direction of the CR [22].
The formation and characteristics of an EAS are influenced by the energy ( E p ) and type (such as gamma, proton, iron) of the incident primary CR, capable of generating up to 10 10 particles at peak energies. The process continues through atmospheric interaction until reaching the ground, where 85 90 % of E p is transferred to the electromagnetic (EM) channel, consisting of γ s and e ± . Muons are produced by the decay of different mesons during the cascade development, mainly but not exclusively from π ± and kaons. Hadrons are produced mainly from evaporation and fragmentation during QCD interactions with atmospheric nuceli at the core of the EAS, also known as the shower axis, following the direction of the incoming primary particle. The particle distribution across the EM, muon, and hadronic channels is approximately 10 2 : 1 : 10 2 , respectively [23].
As it can be supposed from the above description, the simulation of EAS is a task that demands significant computational resources. This challenge arises not only from the need to model intricate physical interactions but also from tracking an enormous quantity of particles and taking account their respective interactions with the atmosphere. Among the available simulation tools, CORSIKA [24] stands out as the most broadly adopted and rigorously tested, benefiting from ongoing enhancements [25]. CORSIKA allows for the detailed simulation of EAS initiated by individual cosmic rays, with adjustable settings for various parameters such as atmospheric conditions, local geomagnetic field variations, and observation altitude. To effectively simulate expected background radiation across different global locations and times using CORSIKA, an auxiliary tool is necessary. This tool dynamically adjusts the parameters based on seasonal changes in the atmospheric profile and variations in cosmic ray flux influenced by solar activity, which also impacts the geomagnetic field.
To tackle these challenges, the LAGO Collaboration has created ARTI [26], an accessible toolkit designed to compute and analyze background radiation and assess detector responses and its variabilities [27]. ARTI is capable of predicting the expected flux of atmospheric cosmic radiation at any location under dynamic atmospheric and geomagnetic conditions  [28,29,30,31], effectively integrating CORSIKA with Magneto-Cosmics [32] and Geant4 [33], and including its own analysis tools. Recently, ARTI has been utilized in diverse applications, including gamma source detection for site characterization [34], monitoring space weather phenomena like Forbush decreases  [29,31,35], estimating atmospheric muon fluxes at subterranean locations [31,36], and analyzing volcanic structures using muography [37,38,39,40]. Additionally, ARTI has been used in conflict zones in Colombia to detect improvised explosive devices [41], examine the effects of space weather on neutron detection in water Cherenkov detectors [42], develop neutron detectors for monitoring the transport of fissile materials [43,44], and even create ACORDE, a code for calculating radiation exposure during commercial flights [45].
Calculating the expected flux Ξ of the atmospheric radiation at any geographical position requires long integration times to avoid statistical fluctuations [26,35]. This is because a single EAS involves the interaction and tracking of billions of particles during the shower development along the atmosphere, but the atmospheric radiation is caused by the interaction of up to billions of CR impinging the Earth each second. For the modeling of EAS, not only the interactions involved but also the corresponding atmospheric profile at each location that also varies as a function of time should be considered, as it also determines the evolution of the shower [46]. For this reason, ARTI can handle different atmospheric available models: the MODTRAN model that sets a general atmospheric profile depending on the seasonal characteristics of large areas of the world (say, tropical, subtropical, arctic, and antarctic) [47]; the Linsley’s layered model, which uses atmospheric profiles obtained from measurements at predefined sites [48], or the set up of real-time atmospheric profile by using data from the Global Data Assimilation2 System (GDAS) [49] and characterise them by using the Linsley’s model; and finally, an averaged atmospheric profile obtained from the temporal averaging of the atmospheric GDAS profiles to build up an averaged density profile at each location for a certain period, e.g. one month [30,31]. Finally, Ξ is also affected by the variable conditions of the heliosphere and the EMF, as both affect the CR transport up to the atmosphere. As developed and described by Asorey et al. [29], ARTI also incorporates modules to consider changes over the secular magnitude of the EMF and disturbances due to transient solar phenomena, as Forbush decreases.
Upon setting the primary spectra, atmospheric profile, and the EMF’s secular and potential disturbances, it becomes feasible to compute Ξ . This is achieved by injecting into the atmosphere the integrated flux of primary particles, ranging in energies from Z × min ( R ) to 10 15 eV, where R represents the local directional rigidity cutoff tensor derived from the Earth’s magnetic field’s secular values as per the current International Geomagnetic Reference Field (IGRF) version 13 model [50], Z is the charge of the primary particles, from protons to iron, ranging from 1 Z 26 , and the is used as at energies above 1 PeV the primary spectra features the so-called ’knee’, reducing even more the flux for higher E p and their impact becomes negligible for atmospheric background calculations [26].
These calculations cover a standard area of typically 1 m2 over the time integration period τ ranging from several hours to days. Post-simulation, secondaries generated by primaries not allowed geomagnetically are discarded by comparing their magnetic rigidity to the evolving values of R . A detailed description of these procedures is provided in Asorey et al. [29]. These intensive processes demand substantial computing resources. For example, estimating the daily flux Ξ of secondary particles per square meter at a high-latitude location involves simulating approximately 10 9 EAS, each contributing to the production of a comparable number of ground-level secondaries. ARTI is prepared to operate on high performance computing (HPC) clusters and within Docker containers on virtualized platforms such as the European Open Science Cloud (EOSC), and to manage data storage and retrieval across public and federated cloud servers [51].

2.2. Detector Response Simulations

Meiga is a software framework built on Geant4, tailored to facilitate the calculation of particle transport through extensive distances, such as hundreds or thousands of meters through rocks of varying densities and compositions [38], and is also pivotal in the design and characterization of particle detectors for muography [38,52]. Structurally, Meiga is composed of various C++ classes, each dedicated to a specific functionality. It integrates Geant4 simulations for particle transport and detector response calculations, providing interfaces for users to manage detector descriptions and simulation executions.
Meiga offers a suite of customizable applications that simplify the simulation process for users, by utilizing configuration files formatted in XML and JSON. This customization allows users to adapt the framework to meet their specific project requirements [38]. Additionally, Meiga includes utilities such as a configuration file parser, physical constants, material properties, and tools for geometric and mathematical calculations. The adoption of JSON for configuration files was motivated by the possibility of comply with the FAIR (Findability, Accesible, Interoperable and Reusable) Data principles of the data produced [53], and for incorporating standards used during the creation of digital twins. Even more, as detailed in Taboada et al. [38], the framework’s modular and adaptable design includes a set of pre-configured detector models and physics lists, which users can easily extend or modify to develop tailored detectors and processes. This modular approach considerably reduces the time and effort required for development. For example, users can create new detector classes by extending the Detector class or define new materials within the Materials class. Additionally, the framework supports the development of new action classes to introduce specific operations during the simulation process.
Once the secondary particles Ξ are obtained, they are propagated through a detailed model of the LAGO Water Cherenkov Detector (WCD) developed using Meiga. This model incorporates specific variables to account for the water quality, the model of the photomultiplier tube (PMT), its geometric positioning within the detector, the internal Tyvek coating, and the detector’s electronic response. As charged particles enter the water, they generate Cherenkov photons, which move through the detector’s volume until they are either absorbed or reach the PMT. This PMT is simulated in Meiga as a photosensitive surface, accurately replicating its characteristic spectral response based on the PMT’s quantum efficiency provided by the manufacturer. Given the substantial water volume in typical WCDs (generally over 1 m3), the system is also sensitive to neutral particles like neutrons and photons through secondary processes such as neutron capture followed by prompt gamma emission, Compton scattering, or pair creation within the water [43].
The detector’s electronic response is simulated to produce the final signal — a pulse representing the time distribution of photo-electrons detected by the simulated electronics. This pulse, typically resembling a sampled FRED (Fast Rise and Exponential Decay) curve, is captured at the same rate as the detector’s electronics, ranging from 40 to 125 million samples per second, with time bins spanning 40 to 8 ns, respectively and 10 to 14 bits for the analog-to-digital equivalent converter. Similar to the physical WCD, the total pulse acquisition time can be set at 300 to 400 ns depending on the acquisition conditions. Once captured, the pulse is analyzed for its ’peak’, the maximum number of photo-electrons registered within a single time bin, the ’charge’, total photo-electrons collected during the event, and characteristic times such as the rise and decay times, determined by the period needed for the integrated signal to reach predefined levels (typically 10→50% and 30→90% of the charge). Each pulse and its associated characteristics are logged for further analysis along with details of the impinging secondary particle. Unlike physical detectors, which record pulses without identifying the type of particle at each event, Meiga allows each pulse to be linked to its corresponding secondary particle. This capability enables the testing of predictions made by machine learning unsupervised analysis techniques, as detailed in the subsequent section.

3. Machine Learning Framework

As mentioned above, the primary goal is to separate the signal contribution of different secondary particles within a given charge histogram from a single WCD. Since the ground truth is unknown for actual data obtained from a WCD, the approach presented by Torres Peralta et al. [19] was to implement a non-supervised ML clustering algorithm. This type of algorithm deals with the problem of partitioning a dataset into groups considering insights from the unlabelled data.
In the aforementioned work, the selected dataset consisted of pulses (samples) captured by the data acquisition system (DAQ) that were digitized with a sampling rate of 40 MHz and 10-bit resolution on a time windows of 400 ns. The origin of the data was from LAGO’s “Nahuelito” WCD site at Bariloche, Argentina.
The originally measured dataset was analyzed using the following features: the total charge deposited (the time integration of the pulse), the maximum value of the pulse, the time taken to deposit 90% of the charge, the pulse duration, and the time difference between the current and the next pulse. These initially selected features were further analyzed using Principal Component Analysis (PCA). Figure 1 shows a visualization of the resulting components from PCA, where each subplot is a two-dimensional projection of the distribution between the selected components, named as PCA 1 through PCA 5. In each, a darker color means that there are more points in that bin and thus that particular place is denser. Overall, the structure of the data shows a high complexity for the potential to form groups of points where in some cases, like in the projection between PCA 3 and PCA 1, one can observe a potential hierarchy of groups of different densities that compose a larger group.
We considered different types of clustering algorithms, such as partitioning methods (e.g., K-Means which is probably the most well-known clustering algorithm), hierarchical methods, density-based methods, grid-based methods, and distribution-based methods. After analyzing the structure of the data, a hierarchical density-based algorithm was chosen as a good candidate algorithm. In particular, we proposed the Ordering Points to Identify the Clustering Structure (OPTICS) [54]. OPTICS is a hierarchical density-based clustering algorithm, with the aim in our case, of organizing the secondary particle contributions from the cosmic rays into well-separated clusters.
For the work done for this paper and to achieve a more robust validation process, we use the same method for this work but applied to synthetic data resulting from simulations. Here, the pipeline begins with a standard preprocessing stage to prepare the data for the next stages. A set of criteria was chosen to filter data points that are considered anomalous, out of the dataset. From the resulting dataset, features were extracted, normalized, and passed through a PCA stage that resulted in a new set of orthonormal features called principal components. These new features are what compose the final dataset that was used to feed the main stage of the pipeline. Details about the specificts of the methodology can be found in the next section.
The main part of the pipeline, the ML modelling, uses the OPTICS algorithm to generate the separated clusters by grouping points that share similarities in their features. What is particular to density-based clustering algorithms is that cluster membership is defined on a distance metric of how close points are to each other. It is worth mentioning that OPTICS has a set of advantages that has been considered crucial in this application.
There are other density-based algorithms that deal with complex data such as the well-known ’Density-based spatial clustering of applications with noise’ (DBSCAN) algorithm [55]. One of the major advantages of OPTICS over DBSCAN is the efficient memory usage, OPTICS is O { n log n } while DBSCAN is O { n 2 } [56]. This is specially relevant in this case because the high number of samples (over 39 million of samples) require an efficient memory management by the algorithm. In such a case, DBSCAN fails. Even when OPTICS has a poor performance regarding execution time (is highly sequential) and DBSCAN is more efficient and parallel at running time, memory management is a hard constraint. In brief, OPTICS meet the desired scalability requirements for our system.
Moreover, one of the desired characteristics in clustering algorithms is the capability of discovering arbitrarily shaped clusters, which is one of the most challenging tasks. Again, density-based algorithms may achieve this goal but unlike DBSCAN, OPTICS can achieve both, complex cluster geometries like groups within groups (hierarchical structures) and variable cluster density [56]. Many algorithms based on either centroids (such as K-means) or medoids (such as k-medoids) fail to satisfy these clustering criteria of developing widely different clusters, converging concave shaped clusters and group hierarchically. In addition, OPTICS does not require any pre-defined specification for the number of partitions or clusters. Because of the above-mentioned advantages, we proposed OPTICS as the more suitable clustering method.
In OPTICS, there are two main concepts: core distances and reachability distances. A core distance defines the minimum distance needed for a given point o to be considered a core point, where multiple of these form a core group and the possible beginning of a new cluster. In the case of the reachability distance, it defines the minimum distance from p concerning o, such that if o is a core point, p is directly density-reachable from o. If o is not a core point, then the reachability distance is undefined [54].
In addition to these two concepts, two main hyperparameters need to be set a priori before running the algorithm, ϵ m a x and minPoints: a) ϵ m a x is the maximum possible reachability distance that can be calculated between two points; b) while minPoints is the minimum number of points required in the neighborhood of a point o for it to be considered a core point (o itself is included in minPoints). These parameters greatly affect the runtime performance of the execution of the algorithm, wherein the absolute worst case can be O { n 2 } . minPoints also helps with the granularity when searching for clusters, as a smaller value can help find clusters within clusters [56].
The first stage of OPTICS outcome is a visual representation of the calculated reachability distances, ϵ r , for each point concerning its closest core group. The points are ordered along the X-axis from smallest to greatest ϵ r for its corresponding core group, while on the Y-axis is the ϵ r value. The resulting plot is the so-called reachability plot.
To interpret the reachability plot, points in the valleys represent data points that are spatially close to each other meaning high local density, and meaning that they are likely to belong to the same cluster. The valleys are separated by data points of larger reachability distance, which means they are farther away from the data points in a valley. Figure 2, from Wang et al. 2019 [57], illustrates this interpretation. It is worth mentioning, referring to the same figure, that OPTICS can detect a hierarchy of clusters, for example, where the green and light blue clusters are clearly child clusters of a parent cluster in red. Thus, in problems where varying density within clusters is assumed, OPTICS will be able to achieve cluster separation.
As was hinted in the previous paragraph, cluster formation depends on where one chooses a maximum reachability distance as a cutoff for cluster membership. This can be done in two ways, adaptively chose a maximum reachability distance depending on the structure of the valley, or chose a fixed maximum as a cut-off. The latter is the selected option in this work.
As with any ML algorithm, OPTICS is a non-deterministic algorithm, meaning that for each run, different results can be obtained. In particular, the non-deterministic part of OPTICS is related to the order in which data is processed when each point is selected to belong (or not) to a given cluster. This order varies in each run. If the model is robust enough, the results at each independent run will tend to converge to similar clustering of the data. We address this issue again when we explain the methodology used in this work.
Finally, it is worth mentioning that the implementation was done using Python as the programming language with the Scikit Learn library.

4. Methodology

We propose a methodology based on a data science approach where ML is used to implement a data-driven model/system. Often, these techniques need to satisfy the actual scientific question but also address emerging quality attributes such as, debiasing and fairness, explainability, reproducibility, privacy and ethics, sustainability, scalability, etc.
The collection of data science stages from acquisition, to cleaning/curation, to modeling, and so on, are referred to as data science pipelines (or workflows). Data science pipelines enable flexible but robust development of ML-base modelling and software development for later decision making. We embrace the data science pipeline methodology to analyze and implement our proposed model [58].
In brief, Machine learning pipelines provide a structured, efficient, and scalable approach to developing and maintaining machine learning models. They allow modularization of the workflow, standardizing processes, and ensuring consistency, which is essential for producing (and reproducing) robust models.
The pipeline designed in this work, in Figure 3, can be summarized with the following:
  • Acquisition: produces the resulting 24-hour simulation data using the characteristics of "Nahuelito" WCD.
  • Preprocessing:
    • Filtering: removes anomalies to guarantee the quality of the data used.
    • Splitting: divides simulation output into two sets, input and ground truth. This is done because we want to do clustering on the input set in a ’blind’ fashion (without the ground truth). The dataset with the ground truth is later used for validation of the results.
  • Feature Engineering and Feature Selection: creates the initial features to be used and then PCA is performed to select the final features set.
  • Parallel running of OPTICS: the input set is divided into 24 datasets each of one hour and fed in parallel to the OPTICS algorithm. As a result, 24 independent models are obtained. For each independent run, the particle composition of each cluster is extracted.
  • Averaging: Repeat previous two steps 10 times and aggregate results.
The methodology starts at the acquisition stage, here we used the MEIGA simulation framework to produce a dataset with information about every particle event that passed through the simulated WCD. To achieve this, real-time magnetospheric and local atmospheric conditions for February and March of 2012 were analyzed, and the resultant atmospheric secondary-particle flux was integrated into a specific MEIGA application featuring a comprehensive Geant4 model of the WCD at a specific LAGO location. This includes information about the interactions of the particle that produced the event (secondary particles) as well as the particle type itself. Thus, we have a priori knowledge about the particle composition of events simulated. In particular, we used a 24-hour simulation for a WCD with the same characteristics as "Nahuelito" to reproduce similar characteristics as in [19] (e.g. WCD geometry, riggidity cut-off, etc). The output from MEIGA was restructured for effective integration into our Machine Learning pipeline. Each simulated day of data consist in approximately 500 millions of particles arriving at the WCD.
Following the ML pipeline, the preprocessing stage takes the simulation output dataset and transforms it into a curated dataset suitable for extracting and selecting features for the learning process. The preprocessing stage consists of two subtasks: filtering and data splitting.
Regarding the first task, the criteria used to filter out anomalous data points are reduced to events where the particle did not have enough energy to interact with the WCD, in other words, did not produce PEs. Unlike the actual data used in our previous work, where aggressive cleaning/filtering was needed, with synthetic data we perform a simple cleaning by eliminating those particles that do not produce a signal. This is because the generation of synthetic data is done in a controlled environment, meaning that the simulation does not account for external factors, like ambient noise or external sunlight filtering into the detector, that would be present in real data and would need to be filtered out.[19].
For the second part of the preprocessing stage, the cleaned dataset is split into two subsets. The first set contains the input data of events that would feed the next stage, while the second contained the ground truth (reserved for later validation). The ground truth consists of the actual particle composition of each event. OPTICS is a non-supervised ML algorithm and as such it is not a classifier. This means that after the clustering, the ground truth is used to analyze the particle composition of each cluster.
The next stages in the pipeline corresponds to feature engineering and feature selection. Feature engineering refers to the construction of features, from given features, that lead to improved the model performance. Feature engineering relies in transforming the feature space by applying different transformation function (e.g. arithmetic and/or aggregate operators) to generate new ones. While feature selection corresponds to the actual election of the more suitable features to enhance the performance of the ML model [59].
In general terms, the features should contain enough information so that the algorithm is able to properly cluster the signal from the secondary particles. Besides, and at the same time, features that don’t aid the learning process of the algorithm should be removed to avoid the problem called ’curse of dimensionality’ [60]. High-dimensional data (features are also refer as dimensions of the data in ML) can be extremely difficult to analyse, contra intuitive and, usually, carries out high computational cost (specially when dealing with Big Data) which, in turn, can lead to the degradation of the predictive capabilities of a given ML model. In brief, as the dimensionality increases, the number of data points required for good performance of any ML algorithm increases exponentially. Thus, we have to accomplish a trade-off between a relatively small number of features but ensuring that the chosen features explain as best as possible the problem.
The initial feature set proposed can be seen in Table 1. In order to select the more suitable features, we performed a cross-correlation analysis as a good practice to remove highly correlated features that may not add new significant information and can negatively affect the performance of ML algorithms as stated previously. From the resulting cross-correlation matrix, seen in Figure 4, it can be seen that features Peak and Pulse Duration were two possible candidates for removal from the final feature set. Peak had a high correlation of 0.94 with respect to Total PEs Deposited and a high correlation of 0.75 with respect to Pulse duration. While Pulse Duration had a high correlation of 0.87 with respect to Total PEs Deposited.
After thorough testing (not shown here) using the complete methodology presented here, it was found that removing only the feature Peak produced better results, thus, the final feature set used was Total Deposited, Time to Deposit 90% and Pulse Duration.
Before feeding the features into the ML stage, they were normalized and sent through a step of Principal Component Analysis (PCA). PCA is a feature selection and dimension reduction procedure to produce a set of principal components that maximize the variance along each dimension. This assumes a linear relationship between features and produces an ordered data set where the first principal component is the one with the most variance and each subsequent principal component is orthonormal to the previous. This is a standard process to transform the original dataset to a new dataset that is better suited for ML algorithms [61]. The components are linear combinations of the original features that maximally explain the variance in each selected dimension. This final transformed dataset is the output of the ’feature engineering and feature selection’ and pass to the ML stage.
The resulting dataset generated from the 24 hours of simulated data was then divided into 24 datasets, one for each hour. The size of each final subset was around 500,000 points of data. We performed a grid search to set the hyperparameters of minPoints and ϵ m a x for the OPTICS algorithm. We found that setting the hyperparameter of minPoints to 5000, produced the best results. With regards to ϵ m a x , a value of 0.5 ensured a good exploration of the space while reducing the run time considerably. A summary can be seen in Table 2.
The ML modeling stage consisted in running the clustering algorithm OPTICS to produce the reachability plot and consequently select a cutoff ϵ r for the actual clustering (see section ’results’). Since we are interested in being able to classify the secondary particle contributions, we needed to see if the generated clusters have a majority contribution of a specific particle. This, essentially would mean that a cluster becomes a classification of a particular particle. As such and using the ground truth provided by the simulation, the composition of each cluster was calculated using the OPTICS output (for each hour).
Up to the ML modeling stage of the pipeline, the process is deterministic so it only needed to be done once. For the ML stage, as mentioned above, the 24 hours of data were divided into 24 one-hour datasets that were processed independently in parallel. This process was then repeated ten times to test both its accuracy and precision. At each run, the output may change (non deterministic process) because at each instance the algorithm may start ordering the points from different initial points. Ideally, all the runs should converge to similar results. Thus, we computed the final output as the average of the results of the independent runs. This strategy is used to evaluate if the algorithm presents consistently both accurate and precise results, hence being robust. The final output of the ML modeling stage is the clusters obtained and the average and standard deviation of each particle within each cluster.
In future works, we expect to enhance the pipeline by adding other stages towards achieving better scalability, easy implementation and monitoring in semi-operative mode, and explainability.
Finally, this methodology can be extrapolated and applied to different LAGO sites and WCDs running the same learning process once to learn their respective actual characteristics. If this model is used after the calibration of the WCD, we can estimate how is the particle composition in each of the detectors taking into account the site rigidity cut-off, altitude, and WCD geometry, among other particular characteristics. When, for instance, the water starts ageing, the particle grouping will start varying and then the model will act as an automatic monitoring tool of the WCD health. This is one of the possible applications in an operative version.

5. Results

A total of 240 runs of the ML pipeline were done: 10 runs per hour of simulated data for a total of 24 hours. Each run employed the OPTICS algorithm, which determined groupings in two steps: a) generating a reachability plot and b) performing the actual clustering based on a cut-off threshold to determine cluster membership.
An example reachability plot, shown in Figure 5, obtained from a single hour displays clear cluster structures. Each cluster is marked with a different color, while points that do not belong to any cluster are marked in black. As described in Section 3, at the reachability plot the X-axis represents the ordered points and the Y-axis represents the ϵ r reachability distance.
A visual inspection revealed several ’valleys’ that indicate potential clusters. To define these clusters, a threshold must be selected. In our study, we used a fixed cut-off threshold of 0.08 for cluster membership, resulting in a stable 8-cluster structure across all runs. Although we considered different values for ϵ r , 0.08 consistently provided good results for all datasets used in this work.
Each identified cluster (Figure 5) is well-defined. However, while the first six clusters exhibited mostly a singular structure, cluster 7 displayed a more complex composition with numerous substructures, appearing as small ’valleys’ within the larger group. These subgroups are absorbed into the larger cluster due to the fixed cut-off threshold that was selected. This particular complex case needs further investigation, which will be addressed in future works where we will incorporate additional data and implement adaptive thresholding.
Using the same example run as a reference, Figure 6 shows the corresponding histogram of the total amount of PEs deposited by events, with the y-axis in logarithmic scale. Each cluster was labeled and colored according to the same scheme used in the reachability plot, facilitating easy comparison between the two figures. The histogram show three groups with similar behaviour with larger counts (between 6000 yo 10000 PE) in concordance with the Muon hump. These groups are clusters 0, 1 and 2.
When analyzing each cluster content, Cluster 7 contained the highest number of points, with approximately 431,000 particles, followed by Cluster 0 with around 110,000 particles, and Cluster 2 with about 104,000 particles. The remaining clusters each contained between 34,000 and 67,000 particles. These numbers represent the averaged number of particles from the output of each run. It is noteworthy that the approximate number of particles not assigned to any cluster was around 13,000, which is relatively small compared to the number of particles that thee algorithm is able to assign to the clusters. These values are aligned with the visual groupings observed in Figure 5.
An initial visual inspection suggested that certain clusters are strong candidates, with larger count of a given particular secondary particle. For instance, clusters 0, 1, and 2 appeared to be predominantly composed of muons, as evidenced by the well-known muon hump visible in each of these clusters. Additionally, each of the one hour runs consistently produced similar results for the eight clusters, maintaining a similar composition of particles.
To validate these results, we used the ground truth dataset. For each run, the particle composition was extracted and recorded. To assess the robustness of our findings, we aggregated the results by calculating the average and variation of the outcomes.
Table 3 presents the summary statistics for 240 one-hour total runs, detailing the distribution of clusters and secondary particles. The most noteworthy results can be observed in clusters 0, 1, and 2, where the majority of particles are muons, accounting for approximately 96.58%, 89.25%, and 97.27% of the total particles, respectively. These clusters showed minimal presence of other particle types. Cluster 3 contained about 58% muons, which we did not consider a significant majority. The variation for this cluster was around ± 1.41 % across different runs, indicating some difficulty for the algorithm in consistently grouping these particles. Additionally, Cluster 3 included roughly 23% photons, with other particles present in lesser amounts. Cluster 4 was more heterogeneous, with approximately 46% photons, 23% electrons and positrons, 28% muons, and smaller percentages of neutrons and hadrons. The variation for the constituent particles (e.g., ± 0.87 % for photons, ± 1.15 % for muons) suggested that the algorithm varied slightly on how it formed clusters across runs. The remaining clusters, situated in the lower energy regions of the histogram (i.e., towards the left side), where predominantly comprised of photons. Clusters 5, 6, and 7 had photon composition of approximately 62%, 70%, and 80%, respectively. The same also exhibited similar proportions of electrons and positrons (around 20%), with small percentages of other particles. These results implied that the algorithm lacked sufficient information to adequately group the types of secondary particles in these lower energy regions. Nevertheless, the consistency between each independent one-hour run indicated that the algorithm reliably produces robust results.
In summary, the statistics revealed three distinct categories of clusters: a) clusters with a majority of muons (Clusters 0, 1, and 2); b) clusters with a majority of photons (Clusters 5, 6, and 7); and c) mixed groups (Clusters 3 and 4). These results are illustrated in Figure 7, which presents a stacked bar chart showing the percentage composition of particles for each cluster. This visualization highlights the algorithm’s high accuracy, especially in grouping the muonic contributions of the simulated data.

6. Discussion and Conclusions

In this work, we proposed a ML pipeline to implement the OPTICS clustering algorithm to identify individual components within a charge histogram derived from synthetic data. The synthetic dataset was generated by the ARTI and MEIGA frameworks of the LAGO software suite. The Monte Carlo simulation outputs were tailored to fit the pipeline. The dataset encapsulated the characteristics of the LAGO WCD located at the Bariloche site in Argentina, known as ’Nahuelito’. The pipeline can be summarized as a set of linked stages where the output is the outcome of the ML model. These sages included filtering/cleaning, feature engineering and selection, and the actual ML model. Unlike typical non-supervised Ml, and because the dataset is the output of simulations, we have knowledge of the ground truth.
Using 24-hour dataset, we developed an end-to-end data science pipeline to implement the OPTICS algorithm, a hierarchical density-based clustering method. Then, results were validated with the ground truth, and they demonstrated that our pipeline can effectively produce well-separated clusters.
Specifically, clusters 0, 1, and 2 predominantly consisted of muons, contributing to the well-known muon hump present in the charge histogram. These findings build to validate the initial results presented in our previous work [19].
Figure 8 presents a zoomed-in view of the charge histogram (in black), alongside clusters 0, 1, and 2. The distinct shapes for the charge distribution for these clusters, which have the highest muon content, reflect the expected differences due to entry and exit trajectories of muons in the WCSs [62]. Cluster 0 would correspond to signals from muons passing vertically through the detector, the well-known Vertical Equivalent Muon (VEM) parameter, a standard observable for the calibration of this type of detector when there are no secondary detectors available. Clusters 1 and 2 would correspond to other type of muons, e.g. those arriving to the WCD in different angles.
The predominance of the electromagnetic component (gammas, electrons, and positrons) in clusters 5, 6, and 7, and of muons in clusters 0, 1, and 2, suggests the potential to define of bands of maximum content for these components, ass well as an intermediate zone with mixed content. This will facilitate not only multispectral analysis but also particle analysis, similar to the methodology employed in the LAGO space weather program [35] which relies in the automatic determination of WCD response to the flux of secondary particles, specially during astrophysical transients, such as those produced during Space Weather events, see Figure 9.
The proposed ML pipeline produced robust results for all 240 independent instances. In addition, repeated results showed minimal variation between runs showing the stability of the algorithm to reproduce results. Further more, lower energy regions of the charge histogram, like in cluster 7, showed substructures that need further analysis planned in future works.
Given that the proposed ML pipeline is planned to be implemented as a semi-automated, onboard, and real-time data analysis and calibration tool across the LAGO distributed network of WCDs, it is crucial to analyze its scalability in the context of Big Data with larger datasets, and analyze its robustness across WCDs with diverse site characteristics. To achieve these objectives, we plan to develop a comprehensive benchmarking framework to automatically and seamlessly test the ML pipeline under various scenarios. As part of the automation planned, we will include hyperparameter tuning, handling of datasets of increasing size, and further exploration of predictive features. This approach will ensure the model’s adaptability and reliability in varied operational conditions.
Another key advantage of the framework developed for this work is the ability to seamlessly integrate it into the current LAGO software suite, directly being able to use the output from MEIGA simulations. Unlike conventional high-performance computing benchmarks, which have a low dependency on datasets, ML benchmarks are highly dependent on the dataset for training and inference. Thus, we will perform the benchmarking for each of the simulated LAGO sites and report the output and statistics.
Finally, this proposed benchmark will be an important step towards its deployment at LAGO WCD sites, as we want to ensure an effective use and monitoring of ML methods tailored specifically for each site.

Author Contributions

This work has a multidisciplinary approach dealing manly with field such as astroparticle and space physiscs, machine learning, scientific programming, and instrumentation. Conceptualization, Torres Peralta, M. Graciela Molina, Henan Asorey, Ivan Sidelnik, and Sergio Dasso.; methodology, Torres Peralta, M. Graciela Molina, Henan Asorey; software, Torres Peralta, M. Graciela Molina, Henan Asorey and Alvaro Taboada ; investigation,Ticiano Jorge Torres Peralta, Maria Graciela Molina, Hernan Asorey, Ivan Sidelnik, Antonio Juan Rubio-Montero Sergio Dasso Rafael Mayo-Garcia, Alvaro Taboada, Luis Otiniano; resources, Hernan Asorey, Antonio Juan Rubio-Montero , Rafael Mayo-Garcia, Alvaro Taboada; data curation, T.Torres Peralta; writing—original draft preparation, Ticiano Jorge Torres Peralta, Maria Graciela Molina, Hernan Asorey, Ivan Sidelnik, Sergio Dasso; writing—review and editing,Ticiano Jorge Torres Peralta Maria Graciela Molina, Hernan Asorey, Ivan Sidelnik, Sergio Dasso , Rafael Mayo-Garcia, Luis Otiniano.; visualization, T. Torres Peralta; supervision, M. G. Molina, H. Asorey;. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Acknowledgments

The LAGO Collaboration is very thankful to all the participating institutions and to the Pierre Auger Collaboration for their continuous support. We acknowledge the ICTP and OIEA grant NT-17 that partially funded stays to carry out this work. This work was partially funded by grant RC-TW-2020-00098, MINCYT, Argentina. This work has profited from computing resources provided by CIEMAT at their Xula (Madrid) and Turgalium (Trujillo) clusters funded with ERDF funds. It has also been partially funded by the EU-LAC ResInfra Plus project funded by the European Commission through its Horizon Europe Program (no. 101131703). SD acknowledges support from the Argentine grant PICT-2019-02754 (FONCyT-ANPCyT).

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. I. Sidelnik.; others. The capability of water Cherenkov detectors arrays of the LAGO project to detect Gamma-Ray Burst and High Energy Astrophysics sources. 2022 RICH Conference, these proceedings;, 2022.
  2. L. Otiniano.; others. Measurement of the Muon Lifetime and the Michel Spectrum in the LAGO Water Cherenkov Detectors as a tool to improve energy calibration and to enhance the signal-to-noise ratio. 2022 RICH Conference, these proceedings;, 2022.
  3. Grieder, P.K.F. Cosmic Rays at Earth; Elsevier, 2001. [CrossRef]
  4. Daglis, I.A.; Chang, L.C.; Dasso, S.; Gopalswamy, N.; Khabarova, O.V.; Kilpua, E.; Lopez, R.; Marsh, D.; Matthes, K.; Nandy, D.; Seppälä, A.; Shiokawa, K.; Thiéblemont, R.; Zong, Q. Predictability of variable solar–terrestrial coupling. Annales Geophysicae 2021, 39, 1013–1035. [Google Scholar] [CrossRef]
  5. Dumbović, M.; Vršnak, B.; Temmer, M.; Heber, B.; Kühl, P. Generic profile of a long-lived corotating interaction region and associated recurrent Forbush decrease. Astronomy & Astrophysics 2022, arXiv:astro-ph.SR/2201.09623]658, A187. [Google Scholar] [CrossRef]
  6. Melkumyan, A.A.; Belov, A.V.; Shlyk, N.S.; Abunina, M.A.; Abunin, A.A.; Oleneva, V.; Yanke, V.G. Statistical comparison of time profiles of Forbush decreases associated with coronal mass ejections and streams from coronal holes in solar cycles 23-24. Monthly Notices of the Royal Astronomical Society 2023, 521, 4544–4560. [Google Scholar] [CrossRef]
  7. Simpson, J.A. The Cosmic Ray Nucleonic Component: The Invention and Scientific Uses of the Neutron Monitor - (Keynote Lecture). Space Science Reviews 2000, 93, 11–32. [Google Scholar] [CrossRef]
  8. Aspinall, M.D.; Alton, T.L.; Binnersley, C.L.; Bradnam, S.C.; Croft, S.; Joyce, M.J.; Mashao, D.; Packer, L.W.; Turner, T.; Wild, J.A. A new ground level neutron monitor for space weather assessment. Scientific Reports 2024, 14, 7174. [Google Scholar] [CrossRef]
  9. Pierre Auger Collaboration. The Pierre Auger Observatory scaler mode for the study of solar activity modulation of galactic cosmic rays. Journal of Instrumentation 2011, 6, 1003. [Google Scholar] [CrossRef]
  10. Dasso, S.; Asorey, H.; Pierre Auger Collaboration. The scaler mode in the Pierre Auger Observatory to study heliospheric modulation of cosmic rays. Advances in Space Research 2012, arXiv:astro-ph.SR/1204.6196]49, 1563–1569. [Google Scholar] [CrossRef]
  11. Durán, M.S.; Asorey, H.; Dasso, S.; Nunez, L.; Peréz, Y.; Sarmiento, C. ; F. T. LAGO Collaboration. The LAGO Space Weather Program: Directional Geomagnetic Effects, Background Fluence Calculations and Multi-Spectral Data Anal. 34th International Cosmic Ray Conference (ICRC2015), 2015, Vol. 34, International Cosmic Ray Conference, p. 142. [CrossRef]
  12. Santos, N.A.; Dasso, S.; Gulisano, A.M.; Areso, O.; Pereira, M.; Asorey, H.; Rubinstein, L.; LAGO Collaboration. First measurements of periodicities and anisotropies of cosmic ray flux observed with a water-Cherenkov detector at the Marambio Antarctic base. Advances in Space Research 2023, 71, 2967–2976. [Google Scholar] [CrossRef]
  13. Jamieson, B.; Stubbs, M.; Ramanna, S.; Walker, J.; Prouse, N.; Akutsu, R.; de Perio, P.; Fedorko, W. Using machine learning to improve neutron identification in water Cherenkov detectors. Frontiers in big Data 2022, 5, 978857. [Google Scholar] [CrossRef]
  14. Conceição, R.; González, B.; Guillén, A.; Pimenta, M.; Tomé, B. Muon identification in a compact single-layered water Cherenkov detector and gamma/hadron discrimination using machine learning techniques. The European Physical Journal C 2021, 81, 1–9. [Google Scholar] [CrossRef]
  15. Bom, C.R.; Dias, L.O.; Conceição, R.; Tomé, B.; de Almeida, U.B.; Moraes, A.; Pimenta, M.; Shellard, R.; de Albuquerque, M.P. Bayesian Deep Learning for Shower Parameter Reconstruction in Water Cherenkov Detectors. PoS 2021, ICRC2021, 739. [Google Scholar] [CrossRef]
  16. Hachaj, T.; Bibrzycki, L.; Piekarczyk, M. Fast Training Data Generation for Machine Learning Analysis of Cosmic Ray Showers. IEEE Access 2023, 11, 7410–7419. [Google Scholar] [CrossRef]
  17. Kalashev, O.; Pshirkov, M.; Zotov, M. Identifying nearby sources of ultra-high-energy cosmic rays with deep learning. Journal of Cosmology and Astroparticle Physics 2020, 2020, 005. [Google Scholar] [CrossRef]
  18. González, B.S.; Conceição, R.; Pimenta, M.; Tomé, B.; Guillén, A. Tackling the muon identification in water Cherenkov detectors problem for the future Southern Wide-field Gamma-ray Observatory by means of machine learning. Neural Computing and Applications 2022, 34, 5715–5728. [Google Scholar] [CrossRef]
  19. Torres Peralta, T.; Molina, M.; Otiniano, L.; Asorey, H.; Sidelnik, I.; Taboada, A.; Mayo-García, R.; Rubio-Montero, A.; Dasso, S. Particle classification in the LAGO water Cherenkov detectors using clustering algorithms. Nuclear Instruments and Methods in Physics Research Section A: Accelerators, Spectrometers, Detectors and Associated Equipment 2023, 1055, 168557. [Google Scholar] [CrossRef]
  20. Blümer, J.; Engel, R.; Hörandel, J.R. Cosmic rays from the knee to the highest energies. Progress in Particle and Nuclear Physics 2009, 63, 293–338. [Google Scholar] [CrossRef]
  21. Kampert, K.H.; Watson, A.A. Extensive air showers and ultra high-energy cosmic rays: a historical review. The European Physical Journal H 2012, 37, 359–412. [Google Scholar] [CrossRef]
  22. Grieder, P.K.F. Exentsive Air Showers and High Energy Phenomena; Springer Berlin Heidelberg, 2010. [CrossRef]
  23. Matthews, J. A Heitler model of extensive air showers. Astroparticle Physics 2005, 22, 387–397. [Google Scholar] [CrossRef]
  24. Heck, D.; Knapp, J.; Capdevielle, J.N.; Schatz, G.; Thouw, T. CORSIKA: A Monte Carlo Code to Simulate Extensive Air Showers. Technical report, 1998.
  25. Engel, R.; Heck, D.; Huege, T.; Pierog, T.; Reininghaus, M.; Riehn, F.; Ulrich, R.; Unger, M.; Veberič, D. Towards A Next Generation of CORSIKA: A Framework for the Simulation of Particle Cascades in Astroparticle Physics. Computing and Software for Big Science 2019, 3, 2. [Google Scholar] [CrossRef]
  26. Sarmiento-Cano, C.; Suárez-Durán, M.; Calderón-Ardila, R.; Vásquez-Ramírez, A.; Jaimes-Motta, A.; Núñez, L.A.; Dasso, S.; Sidelnik, I.; Asorey, H. The ARTI framework: cosmic rays atmospheric background simulations. The European Physical Journal C 2022, 82, 1019. [Google Scholar] [CrossRef]
  27. Asorey, H.; Sarmiento-Cano, C.; Suárez-Durán, M.; Rubio-Montero, A.J. The ARTI Framework, 2022. [CrossRef]
  28. Sarmiento-Cano, C.; Suárez-Durán, M.; Ramírez, A.V.; Jaimes-Motta, A.; Calderón-Ardila, R.; Peña-Rodríguez, J. Modeling the LAGO’s detectors response to secondary particles at ground level from the Antarctic to Mexico. Proceedings of 36th International Cosmic Ray Conference — PoS(ICRC2019). Sissa Medialab, 2019, p. 412. [CrossRef]
  29. Asorey, H.; Núñez, L.A.; Suárez-Durán, M. Preliminary Results From the Latin American Giant Observatory Space Weather Simulation Chain. Space Weather 2018, 16, 461–475. [Google Scholar] [CrossRef]
  30. Grisales-Casadiegos, J.; Sarmiento-Cano, C.; Núñez, L.A.; the LAGO Collaboration. Impact of Global Data Assimilation System atmospheric models on astroparticle showers. Canadian Journal of Civil Engineering 2020, 40, 1–31. [Google Scholar] [CrossRef]
  31. Rubio-Montero, A.J.; Pagán-Muñoz, R.; Mayo-García, R.; Pardo-Diaz, A.; Sidelnik, I.; Asorey, H. The EOSC-Synergy cloud services implementation for the Latin American Giant Observatory (LAGO). Proceedings of 37th International Cosmic Ray Conference — PoS(ICRC2021). Sissa Medialab, 2021, p. 261. [CrossRef]
  32. Desorgher, L.; Bütikofer, R.; Moser, M.R. Geant4 Application for Simulating the Propagation of Cosmic Rays through the Earth’s Magnetosphere Space radiation monitoring View project HESPERIA View project. Proceedings of the 28th International Cosmic Ray Conference. Universal Academy Press, 2003, pp. 4281–4285.
  33. Agostinelli, S.; Allison, J.; Amako, K.; Apostolakis, J.; Araujo, H.; Arce, P.; Asai, M.; Axen, D.; Banerjee, S.; Barrand, G.; Behner, F.; Bellagamba, L.; Boudreau, J.; Broglia, L.; Brunengo, A.; Burkhardt, H.; Chauvie, S.; Chuma, J.; Chytracek, R.; Cooperman, G.; Cosmo, G.; Degtyarenko, P.; Dell’Acqua, A.; Depaola, G.; Dietrich, D.; Enami, R.; Feliciello, A.; Ferguson, C.; Fesefeldt, H.; Folger, G.; Foppiano, F.; Forti, A.; Garelli, S.; Giani, S.; Giannitrapani, R.; Gibin, D.; Cadenas, J.G.; González, I.; Abril, G.G.; Greeniaus, G.; Greiner, W.; Grichine, V.; Grossheim, A.; Guatelli, S.; Gumplinger, P.; Hamatsu, R.; Hashimoto, K.; Hasui, H.; Heikkinen, A.; Howard, A.; Ivanchenko, V.; Johnson, A.; Jones, F.; Kallenbach, J.; Kanaya, N.; Kawabata, M.; Kawabata, Y.; Kawaguti, M.; Kelner, S.; Kent, P.; Kimura, A.; Kodama, T.; Kokoulin, R.; Kossov, M.; Kurashige, H.; Lamanna, E.; Lampén, T.; Lara, V.; Lefebure, V.; Lei, F.; Liendl, M.; Lockman, W.; Longo, F.; Magni, S.; Maire, M.; Medernach, E.; Minamimoto, K.; de Freitas, P.M.; Morita, Y.; Murakami, K.; Nagamatu, M.; Nartallo, R.; Nieminen, P.; Nishimura, T.; Ohtsubo, K.; Okamura, M.; O’Neale, S.; Oohata, Y.; Paech, K.; Perl, J.; Pfeiffer, A.; Pia, M.; Ranjard, F.; Rybin, A.; Sadilov, S.; Salvo, E.D.; Santin, G.; Sasaki, T.; Savvas, N.; Sawada, Y.; Scherer, S.; Sei, S.; Sirotenko, V.; Smith, D.; Starkov, N.; Stoecker, H.; Sulkimo, J.; Takahata, M.; Tanaka, S.; Tcherniaev, E.; Tehrani, E.S.; Tropeano, M.; Truscott, P.; Uno, H.; Urban, L.; Urban, P.; Verderi, M.; Walkden, A.; Wander, W.; Weber, H.; Wellisch, J.; Wenaus, T.; Williams, D.; Wright, D.; Yamada, T.; Yoshida, H.; Zschiesche, D. Geant4—a simulation toolkit. Nuclear Instruments and Methods in Physics Research Section A: Accelerators, Spectrometers, Detectors and Associated Equipment 2003, 506, 250–303. [Google Scholar] [CrossRef]
  34. Sarmiento-Cano, C.; Asorey, H.; Sacahui, J.; Otiniano, L.; Sidelnik, I. The Latin American Giant Observatory (LAGO) capabilities for detecting Gamma Ray Bursts. Proceedings of 37th International Cosmic Ray Conference — PoS(ICRC2021). Sissa Medialab, 2021, p. 929. [CrossRef]
  35. Asorey, H.; Dasso, S.; Núñez, L.; Pérez, Y.; Sarmiento-Cano, C.; Suárez-Durán, M. ; others. The LAGO space weather program: Directional geomagnetic effects, background fluence calculations and multi-spectral data analysis. The 34th International Cosmic Ray Conference, volume PoS (ICRC2015), 2015, Vol. 142.
  36. Pérez-Bertolli, C.; Sarmiento-Cano, C.; Asorey, H. Muon Flux Estimation in the ANDES Underground laboratory. Anales AFA 2022, 32, 106–111. [Google Scholar] [CrossRef]
  37. Peña-Rodriguez, J.; Vesga-Ramirez, A.; Vasquez-Ramirez, A.; Suarez-Duran, M.; de Leon-Barrios, R.; Sierra-Porta, D.; Calderon-Ardila, R.; Pisco-Guavabe, J.; Asorey, H.; Sanabria-Gomez, J.D.; Nuñez, L. Muography in Colombia: Simulation Framework, Instrumentation,and 10.31526/jais.2022.266Data Analysis. JOURNAL FOR ADVANCED INSTRUMENTATION IN SCIENCE 2022, 2022. [Google Scholar] [CrossRef]
  38. Taboada, A.; Sarmiento-Cano, C.; Sedoski, A.; Asorey, H. Meiga, a Dedicated Framework Used for Muography Applications. Journal for Advanced Instrumentation in Science 2022, 2022. [Google Scholar] [CrossRef]
  39. Vásquez-Ramírez, A.; Suárez-Durán, M.; Jaimes-Motta, A.; Calderón-Ardila, R.; Penã-Rodríguez, J.; Sánchez-Villafrades, J.; Sanabria-Gómez, J.D.; Asorey, H.; Núñez, L.A. Simulated response of MuTe, a hybrid Muon Telescope. Journal of Instrumentation 2020, 15. [Google Scholar] [CrossRef]
  40. Vesga-Ramírez, A.; Sanabria-Gómez, J.; Sierra-Porta, D.; Arana-Salinas, L.; Asorey, H.; Kudryavtsev, V.; Calderón-Ardila, R.; Núñez, L. Simulated Annealing for volcano muography. Journal of South American Earth Sciences 2021, 109, 103248. [Google Scholar] [CrossRef]
  41. Ramírez, A.V.; Gómez, M.A.; Moreno, M.C.; Medrano, V.B.; Asorey, H.; Nunez, L.A. Improvised Explosive Devices and cosmic rays. Proceedings of 37th International Cosmic Ray Conference — PoS(ICRC2021). Sissa Medialab, 2021, p. 480. [CrossRef]
  42. Sidelnik, I.; Asorey, H.; Guarín, N.; Durán, M.S.; Berisso, M.G.; Lipovetzky, J.; Blostein, J.J. Simulation of 500 MeV neutrons by using NaCl doped Water Cherenkov detector. Advances in Space Research 2020, 65, 2216–2222. [Google Scholar] [CrossRef]
  43. Sidelnik, I.; Asorey, H.; Guarin, N.; Durán, M.S.; Lipovetzky, J.; Arnaldi, L.H.; Pérez, M.; Haro, M.S.; Berisso, M.G.; Bessia, F.A.; Blostein, J.J. Enhancing neutron detection capabilities of a water Cherenkov detector. Nuclear Instruments and Methods in Physics Research, Section A: Accelerators, Spectrometers, Detectors and Associated Equipment 2020, 955. [Google Scholar] [CrossRef]
  44. Sidelnik, I.; Asorey, H.; Guarin, N.; Durán, M.S.; Bessia, F.A.; Arnaldi, L.H.; Berisso, M.G.; Lipovetzky, J.; Pérez, M.; Haro, M.S.; Blostein, J.J. Neutron detection capabilities of Water Cherenkov Detectors. Nuclear Instruments and Methods in Physics Research, Section A: Accelerators, Spectrometers, Detectors and Associated Equipment 2020, 952. [Google Scholar] [CrossRef]
  45. Asorey, H.; Suárez-Durán, M.; Mayo-García, R. ACORDE: A new application for estimating the dose absorbed by passengers and crews in commercial flights. Applied Radiation and Isotopes 2023, 196. [Google Scholar] [CrossRef]
  46. Dawson, B.R. The importance of atmospheric monitoring at the Pierre Auger Observatory. EPJ Web of Conferences 2017, 144, 01001. [Google Scholar] [CrossRef]
  47. Kneizys, F. ; others. The MODTRAN 2/3 report and LOWTRAN 7 model. Technical report, Phillips Laboratory, Hanscom AFB, MA (USA), 1996.
  48. National Aerospace Administration (NASA), N.O.; (NOAA), A.A.; Force, U.A. US Standard Atmosphere 1976. NOAA technical report NOAA-S/T-76-1562, National Oceanic and Atmospheric Administration, 1976.
  49. NOAA Air Resources Laboratory (ARL). Global Data Assimilation System (GDAS1) Archive Information. https://www.ready.noaa.gov/gdas1.php, accessed on 2023-05-31.
  50. Alken, P.; Thébault, E.; Beggan, C.D.; Amit, H.; Aubert, J.; Baerenzung, J.; Bondar, T.N.; Brown, W.J.; Califf, S.; Chambodut, A.; Chulliat, A.; Cox, G.A.; Finlay, C.C.; Fournier, A.; Gillet, N.; Grayver, A.; Hammer, M.D.; Holschneider, M.; Huder, L.; Hulot, G.; Jager, T.; Kloss, C.; Korte, M.; Kuang, W.; Kuvshinov, A.; Langlais, B.; Léger, J.M.; Lesur, V.; Livermore, P.W.; Lowes, F.J.; Macmillan, S.; Magnes, W.; Mandea, M.; Marsal, S.; Matzka, J.; Metman, M.C.; Minami, T.; Morschhauser, A.; Mound, J.E.; Nair, M.; Nakano, S.; Olsen, N.; Pavón-Carrasco, F.J.; Petrov, V.G.; Ropp, G.; Rother, M.; Sabaka, T.J.; Sanchez, S.; Saturnino, D.; Schnepf, N.R.; Shen, X.; Stolle, C.; Tangborn, A.; Tøffner-Clausen, L.; Toh, H.; Torta, J.M.; Varner, J.; Vervelidou, F.; Vigneron, P.; Wardinski, I.; Wicht, J.; Woods, A.; Yang, Y.; Zeren, Z.; Zhou, B. International Geomagnetic Reference Field: the thirteenth generation. Earth, Planets and Space 2021, 73, 49. [Google Scholar] [CrossRef]
  51. Rubio-Montero, A.J.; Pagan-Munoz, R.; Mayo-Garcia, R.; Pardo-Diaz, A.; Sidelnik, I.; Asorey, H. A Novel Cloud-Based Framework For Standardized Simulations In The Latin American Giant Observatory (LAGO). 2021 Winter Simulation Conference (WSC). IEEE, 2021, Vol. 2021-December, pp. 1–12. [CrossRef]
  52. Calderón-Ardila, R.; Asorey, H.; Almela, A.; Sedoski, A.; Varela, C.; Leal, N.; Gómez-Berisso, M. Development of Mudulus: A Muography Detector Based on Double-Synchronized Electronics for Geophysical Applications. JOURNAL FOR ADVANCED INSTRUMENTATION IN SCIENCE 2022, 2022. [Google Scholar] [CrossRef]
  53. Wilkinson, M.D.; Dumontier, M.; Aalbersberg, I.J.; Appleton, G.; Axton, M.; Baak, A.; Blomberg, N.; Boiten, J.W.; da Silva Santos, L.B.; Bourne, P.E.; Bouwman, J.; Brookes, A.J.; Clark, T.; Crosas, M.; Dillo, I.; Dumon, O.; Edmunds, S.; Evelo, C.T.; Finkers, R.; Gonzalez-Beltran, A.; Gray, A.J.; Groth, P.; Goble, C.; Grethe, J.S.; Heringa, J.; ’t Hoen, P.A.; Hooft, R.; Kuhn, T.; Kok, R.; Kok, J.; Lusher, S.J.; Martone, M.E.; Mons, A.; Packer, A.L.; Persson, B.; Rocca-Serra, P.; Roos, M.; van Schaik, R.; Sansone, S.A.; Schultes, E.; Sengstag, T.; Slater, T.; Strawn, G.; Swertz, M.A.; Thompson, M.; van der Lei, J.; van Mulligen, E.; Velterop, J.; Waagmeester, A.; Wittenburg, P.; Wolstencroft, K.; Zhao, J.; Mons, B. The FAIR Guiding Principles for scientific data management and stewardship. Scientific Data 2016, 3, 160018. [Google Scholar] [CrossRef]
  54. Ankerst, M.; Breunig, M.M.; Kriegel, H.P.; Sander, J. OPTICS: ordering points to identify the clustering structure. SIGMOD Rec. 1999, 28, 49–60. [Google Scholar] [CrossRef]
  55. Ester, M.; Kriegel, H.P.; Sander, J.; Xu, X. A density-based algorithm for discovering clusters in large spatial databases with noise. Proceedings of the Second International Conference on Knowledge Discovery and Data Mining. AAAI Press, 1996, KDD’96, p. 226–231.
  56. Schubert, E.; Gertz, M. Improving the Cluster Structure Extracted from OPTICS Plots. Lernen, Wissen, Daten, Analysen, 2018.
  57. Wang, J.; Schreiber, D.K.; Bailey, N.; Hosemann, P.; Toloczko, M.B. The Application of the OPTICS Algorithm to Cluster Analysis in Atom Probe Tomography Data. Microscopy and Microanalysis 2019, 25, 338–348. [Google Scholar] [CrossRef]
  58. Biswas, S.; Wardat, M.; Rajan, H. The art and practice of data science pipelines: A comprehensive study of data science pipelines in theory, in-the-small, and in-the-large. Proceedings of the 44th International Conference on Software Engineering; Association for Computing Machinery: New York, NY, USA, 2022; ICSE ’22,p.2091–2103. [Google Scholar] [CrossRef]
  59. Nargesian, F.; Samulowitz, H.; Khurana, U.; Khalil, E.B.; Turaga, D.S. Learning Feature Engineering for Classification. Ijcai, 2017, Vol. 17, pp. 2529–2535.
  60. Altman, N.; Krzywinski, M. The curse(s) of dimensionality. Nature Methods 2018, 15, 399–400. [Google Scholar] [CrossRef]
  61. Tang,J.;Alelyani,S.;Liu,H.,Feature Selection for Classification: A review.In Data Classification; CRC Press: Publication Location, 2014; p.37–64. [CrossRef]
  62. Etchegoyen, A.; Bauleo, P.; Bertou, X.; Bonifazi, C.; Filevich, A.; Medina, M.; Melo, D.; Rovero, A.; Supanitsky, A.; Tamashiro, A.; others. Muon-track studies in a water Cherenkov detector. Nuclear Instruments and Methods in Physics Research Section A: Accelerators, Spectrometers, Detectors and Associated Equipment 2005, 545, 602–612. [Google Scholar] [CrossRef]
1
2
Data assimilation is the adjustment of the parameters of any specific atmospheric model to the real state of the atmosphere, measured by meteorological observations.
Figure 1. 2D projections of the distributions between the different components obtained using PCA for the selected features in [19]. By visual exploration, one can observe the presence of different ’groups’ aggregated within other groups of varying density, showing the complexity of the data.
Figure 1. 2D projections of the distributions between the different components obtained using PCA for the selected features in [19]. By visual exploration, one can observe the presence of different ’groups’ aggregated within other groups of varying density, showing the complexity of the data.
Preprints 110218 g001
Figure 2. Construction of a reachability plot. Cluster 1 can be observed within the reachability plot as a valley (see the marked blue arrow). The algorithm can distinguish clusters with different densities of points in different hierarchies. Figure adapted from Wang et al., 2019 [57].
Figure 2. Construction of a reachability plot. Cluster 1 can be observed within the reachability plot as a valley (see the marked blue arrow). The algorithm can distinguish clusters with different densities of points in different hierarchies. Figure adapted from Wang et al., 2019 [57].
Preprints 110218 g002
Figure 3. General data pipeline for the ML modelling.
Figure 3. General data pipeline for the ML modelling.
Preprints 110218 g003
Figure 4. Cross-correlation matrix of initial feature set.
Figure 4. Cross-correlation matrix of initial feature set.
Preprints 110218 g004
Figure 5. Example of a reachability plot from one run of the ML pipeline. A consistent cut-off threshold of 0.08 was used in every run where each independent run produced a similar plot. Points belonging to a cluster are colored and labeled accordingly, while any point in black did not gain membership to any cluster. A total of eight clusters were found.
Figure 5. Example of a reachability plot from one run of the ML pipeline. A consistent cut-off threshold of 0.08 was used in every run where each independent run produced a similar plot. Points belonging to a cluster are colored and labeled accordingly, while any point in black did not gain membership to any cluster. A total of eight clusters were found.
Preprints 110218 g005
Figure 6. Histogram of charge, in black, with resulting clusters labeled and colored. The Y-axis is in a logarithmic scale for clarity.
Figure 6. Histogram of charge, in black, with resulting clusters labeled and colored. The Y-axis is in a logarithmic scale for clarity.
Preprints 110218 g006
Figure 7. Stacked bar chart showing the percentage particle compositions for each cluster. It can be seen that the algorithm is very accurate in grouping muonic contributions.
Figure 7. Stacked bar chart showing the percentage particle compositions for each cluster. It can be seen that the algorithm is very accurate in grouping muonic contributions.
Preprints 110218 g007
Figure 8. Zoomed in version of the histogram of charge, in black, highlighting clusters 0, 1 and 2.
Figure 8. Zoomed in version of the histogram of charge, in black, highlighting clusters 0, 1 and 2.
Preprints 110218 g008
Figure 9. Zoomed in version of the histogram of charge, in black. Clusters have been combined into three groups: group 1 is clusters 0, 1, and 2; group 2 is clusters 3 and 4; while group 3 is clusters 5, 6, and 7 .
Figure 9. Zoomed in version of the histogram of charge, in black. Clusters have been combined into three groups: group 1 is clusters 0, 1, and 2; group 2 is clusters 3 and 4; while group 3 is clusters 5, 6, and 7 .
Preprints 110218 g009
Table 1. Name and description of initial feature set.
Table 1. Name and description of initial feature set.
Name Description
Total PE Deposited Total amount of PEs deposited by an event in the WCD.
Peak Maximum of the pulse generated by the PEs during an event.
Time to Deposit 90% Time that it took for the event to deposit 90% of the PEs generated.
Pulse Duration Duration of the pulse generated by the PEs during an event.
Table 2. OPTICS hyperparameters selected after grid search.
Table 2. OPTICS hyperparameters selected after grid search.
Parameter Name Value
minPoints 5000
ϵ m a x 0.5
Table 3. Cluster compositions for 240 independent one-hour runs of the ML pipeline for the 24 hours of simulated data.
Table 3. Cluster compositions for 240 independent one-hour runs of the ML pipeline for the 24 hours of simulated data.
No. Photons Electrons &
Positron
Muon Neutron Hadron
0 1.41 % ± 0.12 % 1.61 % ± 0.11 % 96.58 % ± 0.24 % 0.20 % ± 0.02 % 0.20 % ± 0.02 %
1 5.13 % ± 0.34 % 4.53 % ± 0.25 % 89.25 % ± 0.59 % 0.49 % ± 0.03 % 0.60 % ± 0.02 %
2 0.91 % ± 0.15 % 1.35 % ± 0.15 % 97.27 % ± 0.30 % 0.22 % ± 0.02 % 0.33 % ± 0.02 %
3 23.08 % ± 0.96 % 15.20 % ± 0.55 % 58.46 % ± 1.41 % 1.38 % ± 0.07 % 1.88 % ± 0.08 %
4 46.45 % ± 0.87 % 22.78 % ± 0.44 % 27.86 % ± 1.15 % 1.34 % ± 0.05 % 1.58 % ± 0.08 %
5 62.45 % ± 0.51 % 22.92 % ± 0.25 % 12.76 % ± 0.67 % 0.90 % ± 0.03 % 0.97 % ± 0.07 %
6 70.45 % ± 0.43 % 20.01 % ± 0.25 % 6.71 % ± 0.37 % 2.12 % ± 0.15 % 0.71 % ± 0.03 %
7 80.60 % ± 0.61 % 9.16 % ± 0.07 % 1.30 % ± 0.06 % 8.23 % ± 0.61 % 0.71 % ± 0.03 %
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.
Copyright: This open access article is published under a Creative Commons CC BY 4.0 license, which permit the free download, distribution, and reuse, provided that the author and preprint are cited in any reuse.
Prerpints.org logo

Preprints.org is a free preprint server supported by MDPI in Basel, Switzerland.

Subscribe

© 2024 MDPI (Basel, Switzerland) unless otherwise stated