Although MSProfileR can be applied beyond the entomological domain, its performance was assessed here with two datasets from arthropods. The two datasets of MS spectra were very different from each other, both in terms of sample diversity, but also in terms of the number of spectra tested. The first dataset was intentionally heterogeneous and included several distinct arthropod families. The second consisted in a single mosquito genera but included 13 species with a larger dataset.
3.1. Use Case N°1: Arthropod Families
This first dataset is composed of several arthropod species all laboratory reared. It comprises three arthropod families, mosquitoes (Culicidae), ticks (Ixodidae) and fleas (Pulicidae), including six species at two developmental stages for some of them. According to family and developmental stages, different body parts were submitted to MS analysis. Several recent papers established standardized operational protocols for sample preparation of these arthropod families [
7,
8,
11]. Details about the arthropod species used for MS spectra analysis are summarized in
Table 2. The breeding conditions and sample preparations are also indicated in this table, with respective references. This first dataset consisted in a total of 192 MS spectra coming from 48 samples, analysed in quadruplicate. These spectra come from our home-made reference spectra DB [
30] and are then considered as high quality (Supplementary file S1).
The 192 MS spectra were uploaded in MSProfileR. The number of files uploaded and the name of samples can be controlled on the “data loading” tab (
Supplementary Figure S1). As it is the first dataset of spectra analysed with this tool, no file containing MSProfileR parameters set in a previous analysis was available. Then, all the parameters were selected in the present analysis.
In the preprocessing tab (
Supplementary Figure S2A), MS spectra were trimmed in the range of 2-20 kDa and were submitted to conformity tests (
Supplementary Figure S2B and S2C). All spectra passed these tests. The parameters selected in the spectra cleaning panel were: square root method for transforming intensity, Savitzky-Golay method with a half-window size of 10 for smoothing intensity, SNIP method with 100 iterations for removing baseline and TIC method for normalizing the intensity of MS spectra. The result of the cleaning steps can be visualized on a plot for each spectrum (
Supplementary Figure S2D). The Q estimator and the RC method with a threshold of 3 were computed for detecting of outliers in the quality control panel (
Supplementary Figure S2E). In the present dataset, an upper atypical score (A score) limit of 0.59 revealed six spectra among 192 as outliers (3.1%). Interestingly, unticking the “Include spectra below the lower threshold” button did not exclude additional spectra. This underlines that none of the spectra with a low A score were considered outlier. All spectra classified as outliers are automatically removed from the next steps of the analysis. However, it remains possible to visualize all outliers and keep them (
Supplementary Figure S2F). Using MSProfileR, all outlier spectra were compared with their respective replicates (
Supplementary Figure S6). These spectra with higher A score (i.e., upper the limit) were clearly confirmed to be of lower quality, particularly because they presented a higher background noise. These spectra came from 5 distinct samples, two replicates from one
Ae. albopictus at larval stage, two spectra from legs of two
Ae. aegypti specimens and two spectra from cephalothoraxes of two
Ct. felis specimens. As no sample had four replicates considered outliers, removing the six outlier spectra did not suppress any sample, which was confirmed in the averaging panel (
Supplementary Figure S2G). The averaging of selected replicates was done with the mean methods revealing that 48 samples were available for further analysis.
In the processing tab (
Supplementary Figure S3A), the MAD method with a half-window of 20 was applied for peak detection. Peak detection is directly linked to the signal-to-noise (SNR) value selected. In this dataset, when the SNR increases from 2 to 7, the mean number of peaks detected per spectrum decreases from about 144 to 39, respectively (
Supplementary Figure S3B). The choice of the optimal SNR value could be decisive for the next steps of the analysis. To determine the most appropriate SNR value, a comparison of the peak list detected per arthropod family and per body part was carried out based on the SNR value from two to five (
Supplementary Figure S7). It was noticed that at SNR equal to two, numerous peaks were detected between 2 to 8 kDa, among which several are unspecific and correspond to background noise. At SNR equal to three, the majority of these peaks of very low intensity were deleted. Moving to SNR equal to four, regardless of the sample types, some peaks which do not correspond to background noise were no longer detected. In such condition, some information about the protein profile of the sample was lost. To avoid this phenomenon, a SNR value of three was selected for this dataset. At this SNR value, the number of peaks detected per spectrum was 103.3 ± 11.5 (mean ± standard deviation (SD)).
After peak detection, the alignment of spectra was done (
Supplementary Figure S3C). The strict method with a minimum frequency of 50% and a tolerance of 0.002 were set for selecting reference peaks. A total of 33 peaks met these criteria and were used to align spectra from the dataset by applying lowess method for spectra warping. For the binning step, the strict method with a tolerance of 0.002 was set and the parameter for peak filtering was a prevalence of at least 20% for a m/z value to be conserved (
Supplementary Figure S3D and S3E). Among the 516 peak positions obtained after the binning step, the filtering reduced this number to 202. Finally, these 202 peaks were used to classify each sample based on their averaged MS profiles using a hierarchical clustering (
Supplementary Figure S3F). The clustering confirmed that all averaged spectra from the same sample type were grouped per species and body parts. Interestingly, the first criterion of classification is the body part followed by the species (
Supplementary Figure S8).
In the annotation tab (
Supplementary Figure S4A), the list of the sample name of the 48 averaged spectra can be downloaded and was completed with several metadata from each sample prior to be uploaded (
Supplementary Figure S4B and S4C). The annotation processing table allows to verify the correspondence of the averaged spectra with the metadata (
Supplementary Figure S4D). Here, as the number of averaged spectra and annotated samples are identical (n=48) and as none of them was not paired, then all averaged spectra from the dataset N°1 were annotated (Supplementary fileS1).
In the output tab (
Supplementary Figure S5A and S5(a)), several kinds of files could be downloaded. The report file detailed modules, methods and parameters applied, plus information about the dataset N°1 and the results (tables, plots, boxplot, gelviews and clustered heatmap) of the preprocessing and processing parts (
Supplementary Figure S5B and S5(b)). The parameters could be uploaded as a JSON file and used for analysing a new dataset (
Supplementary Figure S5C and S5(c)). This also ensures the traceability and the reproducibility of the analysis. The plots generated by the application during dataset N°1 analysis can also be exported (
Supplementary Figure S5E and S5(e)), as well as the intensity matrix (
Supplementary Figure S5D and S5(d)) and the HDF5 files (
Supplementary Figure S5F and S5(f)). All the outputs from this dataset are downloadable in one zipped file which is available in the Supplementary file S2.
3.2. Use Case N°2: Culex Genus
This second dataset included MS spectra from Neotropical
Culex mosquitoes (Diptera: Culicidae) coming from a work recently published [
31]. The methods and protocols used for mosquito collection, for their morphological and/or molecular identification, for sample preparation and MS submission were detailed in this previous study [
31]. Briefly, this dataset was composed of MS spectra from legs and thoraxes of 13 distinct
Culex species collected in the field, in French Guiana. The number of specimens per species varied according to the availability of the sample, ranging from 1 to 34 (Supplementary file S3).
The dataset N°2 has interesting characteristics for the evaluation of MSProfileR. First, it includes cryptic species, morphologically undistinguishable. Secondly, specimens were collected in the field and then higher inter-sample variations could occurred among specimens from the same species compared to laboratory reared specimens. Thirdly, it encompasses a large number of samples, 169 mosquito specimens submitted to MALDI-TOF MS on two distinct body parts (legs and thoraxes) and loaded in quadruplicates, corresponding to a total of 1352 spectra (169 specimens × two body parts × four replicates). Finally, in the previous work using the same dataset [
31], some spectra were excluded by the authors based on their visual inspection which were considered as low quality. The MSProfileR tool offers now the opportunity to detect and visualize all spectra considered as outliers and can then automatically exclude them from the dataset. The comparison of the spectra list excluded by the experimenters on the same dataset in the previous work will allow to assess the relevance of MSProfileR tool.
For analyzing the dataset N°2, the parameters applied for the dataset N°1 were uploaded. Uniquely the modified parameters compared to the dataset N°1 were specified. In the preprocessing part, all the 1352 spectra passed the conformity tests. However, the quality control steps revealed that 65 spectra (4.8%) exceeded the upper limit of the atypical threshold (A score = 0.72) and were considered as outliers (Figure 3D). Interestingly, all spectra classified as outliers had leg origin (Figure 3A, Supplementary file S4).
Several studies reported a higher diversity of spectra between body parts than between species [
9,
32]. Moreover, spectra from thoraxes generally had more numerous and higher intensity peaks than the paired leg spectra [
11,
31]. These two factors could likely explain why numerous spectra originating from legs were classified as outliers. To avoid this bias of exclusion, dataset N°2 was analyzed by splitting the list of spectra per body part, legs or thoraxes.
By applying the same parameters, “A” score thresholds of 0.93 and 0.59 were obtained for leg and thorax spectra, respectively (See report in Supplementary file S5 and Supplementary file S6). Nine and 20 spectra from leg and thorax, respectively, were classified as outliers. The inspection of these spectra confirmed the low quality of these outlier profiles compared to typical spectra. However, some spectra with a low-quality profile now scored below the threshold. (Figure 3E). Then, the threshold parameter of the quality control step was adjusted at 1.5 in order to exclude these spectra of low quality. “A” score thresholds of 0.73 and 0.52 were obtained for leg (Figure 3B) and thorax (Figure 3C) spectra, respectively. Details of the spectra listed as outliers are available in the reports of leg (Supplementary file S7) and thorax datasets (Supplementary file S8). Although 62 (9.2%) and 70 (10.4%) spectra from leg and thorax samples were classified as outliers (Figure 3D), the number of samples excluded remains modest, less than 5%. Effectively, the four spectra replicates from leg and thorax of eight and six specimens were excluded from the analysis. It is interesting to note that the eight specimens for which leg spectra were classified as outliers, encompassed those from the two
Culex idottus which were excluded by the authors due to the low intensity and inter-sample heterogeneity of MS profiles in the previous work [
31]. The inspection of the other spectra for which the four replicates were classified as outliers from legs or thoraxes confirmed the lower quality of these protein profiles. These samples were then excluded from the analysis. A total of 161 and 163 averaged spectra from legs or thoraxes, respectively, passed the preprocessing steps.
To check whether MSProfileR tool succeeded in classifying spectra according to species for each body part, averaged spectra were submitted to the peak detection steps by applying the parameters used in the dataset N°1. The filtering step revealed that 204 and 194 peaks for legs (Supplementary file S7) and thoraxes (Supplementary file S8), respectively, were retained for averaged spectra classification. Hierarchical clustering revealed that thorax averaged spectra were grouped per species. Solely two
Cx. usquatus (#TH_24 and #TH_82) were not clustered with the other spectra from the same species; and the thorax averaged spectra from the unique
Cx. adamesi (#TH_2) was classified inside the
Cx. dunni group (Supplementary figure S10A). The clustering of leg averaged spectra appeared less efficient, with several samples intertwining with other species (Supplementary figure S10B). The lower classification of the leg averaged spectra compared to thorax was attributed, in part, to their lower spectra intensity. Effectively, as shown in the present study and in concordance with previous works, leg spectra were generally less intense than those of thoraxes from the same specimens [
11,
31].
For thoraxes, among the seven samples, from the previous analysis [
31], which did not reach the threshold value (LSV>1.8) for relevant identification, for five of them, two or more of their spectra replicates were considered as outliers, leading to the exclusion of three samples, for which all replicates overtaken the “A” score threshold. The two other thorax samples (#TH_233 and #TH_234) from
Cx. dunni conserved by the quality control step, obtained nearly relevant identification scores, 1.78 and 1.74, respectively, in the previous study [
31]. The clustering of these last two thorax averaged spectra (#TH_233 and #TH_234) with the other samples from the same species support the conservation of their respective spectra by quality control steps.
Interestingly, the previous work [
31] reported that the selection of the top ten of mass peak list per
Culex species and per body part appeared sufficient to discriminate these 13
Culex species with a correct classification upper than 90%. Here, during the filtering step, all peaks with a frequency lower than 20 % (i.e., 0.2) across the dataset were excluded from the analysis. In the dataset N°2, for five species, one to five specimens were available which represents less than 1 to 4% of the number of averaged spectra of the dataset N°2 (161 for legs and 163 for thoraxes). It is then possible that some peaks specific to these species were not included in the intensity matrix due to their low representation (i.e., too few specimens from the same species to reach filtering threshold), which could explain the imperfect clustering of the samples per
Culex species, notably for legs.
The decrease of the peak filtering from 20% to 0.5% (i.e., 0.005) led to the inclusion of 878 peaks in the intensity matrix without improving drastically the classification of leg averaged spectra. Among the peaks added in the intensity matrix, about 81.7% (n=717), some of them are heterogeneous between averaged spectra from the same species which should perturb clustering. MSProfileR then appears well adapted for the detection of atypical spectra, but also for their classification at condition that a subgroup is not too under-representative, which could alter the classification. Moreover, spectra of high intensity remain essential for relevant classification.
The duration of computational analyse is another parameter to take into account. Generally, the quickness of the analysis is directly linked to the power of the computer used. In the present work, the analysis was done on a laptop with classical configuration (Processor: Intel (R) Core(TM) i7-10510U CPU @ 1.80 GHz, Hard disk: RAM: 16 GiB, Graphics card: 00:02.0 VGA compatible controller: Intel Corporation UHD Graphics (rev 02), Operating System: Ubuntu 20.04.). The complete pipeline process for analysing the 1352 spectra of dataset N°2 took less than four minutes, by applying default parameters. This processing speed allows to the user to compare/adjust methods and parameters throughout the workflow without a loss of time.