1. Introduction
Voice represents the result of complex configuration, arrangement and coordination of the elements that make up the phonatory apparatus, the respiratory system and the nervous central system: abnormal neurological and anatomical characteristics, that often concern genetic syndromes, could alter voice production. Over the years, several papers investigated voice pathology detection due to benign formations (e.g., nodules, polyps), neuromuscular disorders (e.g., vocal cords paralysis) [
1,
2] as well as to neurodegenerative diseases such as Parkinson and Alzheimer [
3,
4,
5]. Vocal tract, larynx and vocal folds anomalies can be identified by analyzing important acoustical parameters that are perceptually rated by experienced physicians and objectively assessed by dedicated software. In this latter context, some of the most important parameters are [
6]:
The fundamental frequency (F0) that describes the frequency of vibration of the vocal folds.
The first formant (F1), related to front-half oral cavity constriction: the greater the cavity, the lower F1. Furthermore, F1 is raised by pharyngeal tract constriction.
The second formant (F2), linked to tongue movements: it is lowered by back-tongue constriction and increased by front-tongue constriction.
The third formant (F3) that depends on lips rounding: the more this configuration is accentuated the lower is F3.
F0 and formants F1-F3 are respectively inversely proportional to the size and thickness of the vocal folds and to the vocal tract length.
In the last two decades, acoustical analysis started to be applied to patients affected by genetic syndromes, such as Costello (OMIM #218040, CS), Down (OMIM #190685, DS), Noonan (OMIM #163950, NS) and Smith-Magenis syndromes (OMIM #182290, SMS), leading to interesting results that underline markedly irregular voices. Non-invasive semeiotics of these diseases is generally carried out by studying somatic traits. In order to obtain a more detailed phenotype, a promising approach consists in objective acoustical analysis for the identification of a set of parameters associated with individual pathological conditions. Perceptually, low tonality and intensity of voice, as well as hoarseness, are typical characteristics of CS adult individuals [
7]. Norman [
8] has detected in CS children a severe deficit of articulation and language by means of the GFTA-3 test in comparison to age-matched healthy subjects (HS). However, the rarity of this syndrome has made particularly difficult to outline a precise acoustical profile and indeed no objective acoustical analysis was carried out on these patients. In Down syndrome, Moura et al. [
9] detected statistically significant differences in comparison to HS concerning F0 for sustained vowels / a /, / e /, / i / and / ɔ/ as well as for formants F1-F2 and HNR measures in Portuguese-speaking children. In adults, Bunton and Leddy [
10] highlighted articulation difficulties that reduce vowel space and intelligibility. Other studies show higher values for F0 in DS adults with respect to normophonic individuals, while the perceptual assessment in DS subjects, and in children in comparison to healthy, age-matched subjects [
11,
12]. Turkyilmaz et al. [
13] have analyzed with the MVDP software [
14] the sustained vowel /a/ of 11 children affected by Noonan syndrome: no significant differences were detected with respect to a control group except for the soft phonation index (SPI). Pierpont et al. [
15] highlighted that 20% of NS children and adolescents present articulatory deficits using the GTFA-2 test. In a single-case report, Wilson and Dyson [
16] noticed vowel neutralization and nasalization in a female child. In Garayzabal-Heinze et al. [
17] study, SMS adults showed higher F0 values for the vowel /a/ as compared to normophonic ones, but no significant differences were found in voice perturbation measures. In another work [
18], the same authors performed a similar experiment with SMS children and analyzed formants F1-F2 and cepstral peak prominence (CPP) [
19], an important parameter for assessing dysphonia. Only CPP showed a significant difference between patients and control group. These studies suggest that acoustical analysis can provide physicians and speech therapists useful information. However, they have focused on a single pathology and identified acoustical parameters that show statistically significant differences as compared only to the normophonic case. Thus, it is important to further extend this research by analyzing and comparing vocal phenotypes across syndromes, in order to find significant alterations that could support and speed differential diagnosis. In this work, we propose a procedure for the standardization of the recording and analysis of the voice of patients affected by several genetic syndromes. Specifically, this procedure focuses on the most suitable devices for voice recordings and on which vocal tasks should be uttered. To ensure repeatability, we also propose signal pre-processing steps and feature extraction. Furthermore, this work describes how to carry out statistical analysis, the development of machine learning models and the performance assessment. The validation, feasibility and robustness of this procedure were tested applying the proposed approach to 72 patients recruited at the Fondazione Policlinico Universitario A. Gemelli (FPUG) in Rome, Italy.
4. Discussion
This paper proposes an innovative and detailed procedure for the assessment of voice characteristics of patients affected by genetic diseases. It was developed following the general guidelines provided by otolaryngological societies and associations and by reviewing literature articles on voice analysis and automatic voice quality assessment. This is a first attempt to standardize the acquisition, analysis and classification processes of voice samples of subjects affected by genetic syndromes. Acoustical analysis represents a promising non-invasive approach in this clinical field, therefore this paper aims at establishing first ground rules for uniform and comparable results. For the recordings, quiet rooms and a Huawei Mate 10 Lite (RNE-L21) smartphone were used. Though rigorous, the proposed procedure is easily adaptable to other pathologies. Moreover, this procedure might be also applicable to languages different from Italian, taking into account specific vocal tasks. Being an exploratory analysis, we validated it with statistical analysis and machine learning techniques and reported the outcome. Age range and gender were taken into account: this allowed obtaining more reliable acoustical parameters. Voice properties were compared not only between healthy and pathological subjects but also among genetic syndromes. Specifically, the considered syndromes were: Costello (CS), Down (DS), Noonan (NS) and Smith-Magenis (SMS). The results will be discussed following this order.
CS paediatric subjects did not show any statistically significant difference in acoustical parameter with the exception of F0 std /a/, which could reflect the lower ability to sustain vowel emission with respect to (w.r.t.) SMS patients due to generalized hypotonia or neck muscle spasticity [
53]. Articulation deficits were highlighted by the vowel triangle (shrunk and left-shifted diagram in
Figure 3 that may depend on detectable deformations of the vocal tract such as ogival palate, macroglossia, hypopharyngeal velum laxity and supraglottic stenosis [
53]: these signs as well as pharynx structure malformations might cause difficulties in tongue movements. Indeed, statistical analysis detected significant differences in formant ratios (related to tongue motor ranges) and articulatory measures, e.g., F1a/F1i w.r.t. DS (p-value=0.006) and HS (p-value=0.014), FCR w.r.t. NS (p-value=0.021), DS (p-value=0.015) and HS (p-value<0.001). In the FA group of the Costello syndrome, the fundamental frequency of vowel /a/ presents significant differences w.r.t. NS (p-value=0.004) and of /u/ w.r.t. NS (p-value=0.006) and DS (p-value=0.018). Vocal instability and noise metrics computed for /I/ showed significant differences as well: jitter w.r.t DS (p-value=0.018) and NNE w.r.t. NS (p-value=0.026). This latter finding is in agreement with perceptual evaluation of the CS voice, which is defined as hoarse [
7]. Hypotonia constraints of lips and tongue movements, especially in reaching their limit positions, and pharyngeal space reduction due to macroglossia, could be the reason for significant differences in F2 mean /a/ and F2 max /a/ w.r.t. the control group (p-value=0.031 and p-value=0.005, respectively).
Male subjects diagnosed with CS did not present significant differences for F0-related parameters, whereas statistical analysis showed differences concerning articulation, specifically w.r.t. NS (p-value=0.044) for F2 min /a/ and w.r.t. HS (p-value=0.024) for F2 mean /u/, that is also supported by the vowel triangle in
Figure 4. This could be related to structural alterations of the posterior fossa, that can cause dysarthria [
54], macroglossia or generalized hypotonia. This last medical evidence is related as well to a significant difference in F3 min /u/ w.r.t. HS (p-value=0.023).
In the DS PS group, differently from the results by Moura et al. [
9] and Zampini et al. [
12], F0 of vowels and jitter did not significantly differ from the HS group. Such discrepancy could be related to different spoken languages (Brazilian Portuguese in [
9]), recorded utterances (speech fragments in [
12]), numerousness ([
9] applied acoustical analysis on a group of patients ten times larger than the one of this study), and the software used for acoustical analysis (PRAAT [
55] in both cases). In the review by Kent [
40] it was also stated that voice impairments with neurologic origin cause large variability in results, especially when evaluating F0 and its perturbations. As far as formant analysis is concerned, multiple comparisons showed statistical differences w.r.t. CS in F1 mean /a/ (p=0.002) and VSA (p=0.015), and w.r.t. NS in F2 max /I/ (p=0.021). These could be related to larger tongue dimensions which affect its movements and therefore modify vocal tract resonances. Statistical analysis on DS female subjects highlighted significant differences in F0 mean /I/ w.r.t. HS (p-value=0.003) and in F0 mean /u/ w.r.t. CS (p-value=0.018). Multiple comparisons showed significant differences between DS and CS (p-value=0.018) and NS (p-value<0.001) for jitter /I/ and between DS and CS (p-value=0.037) and HS (p-value=0.038) for jitter /u/. Articulation problems, which are still present in adults, determine significant differences in F1 mean /u/ and F1 min /a/ w.r.t. NS (p-value=0.001 and p-value=0.028) and F2 max /u/ w.r.t. SMS (p-value=0.003). In the MA DS group, post-hoc analysis detected significant statistical differences for FCR w.r.t. HS (p-value=0.007), for F2 mean /a/ w.r.t. NS (p-value<0.001) and HS (p-value=0.015), for F2 mean /I/ with HS (p-value=0.004), and for F3 mean /a/ w.r.t. NS (p-value=0.008) and HS (p-value=0.004). Neurologic abnormalities, located in the low temporal regions of the motor cortex, could be the reasons for these results.
For NS paediatric subjects, generalized low muscular tone, which tends to make difficult lateralization and protrusion of lips and tongue and limits jaw opening as well, might explain statistical differences in F1 min /a/ w.r.t. HS (p=0.024), in F2 mean /a/ w.r.t. HS (p=0.001), in F2 mean /I/ w.r.t. SMS (p=0.001) and F2i/F2u w.r.t. CS (p=0.049). Indeed Lee et al. [
56] has demonstrated, with ultrasonographic measures, that F1 and F2 are strongly correlated to the oral cavity anterior length and the tongue posterior superficial length. Moreover, T0(F0 min) /a/ and T0(F0 max) /a/, show significant statistical differences w.r.t. HS (respectively p=<0.001 and p=0.002), that could be related to patients’ difficulty in maintaining a stable and regular vocal folds vibration during phonation. Statistical analysis of FA diagnosed with NS has highlighted differences in F0 mean /a/ and F0 mean /I/ w.r.t. HS (p-value=0.005 and p-value=0.028, respectively) and in F0 mean /u/ with CS (p-value=0.006): these alterations might depend on shorter height and shorter neck w.r.t. control subjects, which are common phenotypical feature for this syndrome. Moreover, jitter /I/ showed a significant difference w.r.t. CS (p-value=0.018) and SMS (p-value=0.025). As shown in
Figure 5, NS FA vowel triangle is characterized by a small area, but VSA did not show any statistical significance. Nevertheless, formant coordinates have shown significant differences in F2 mean /a/ w.r.t. HS (p-value=0.001), F1 mean /u/ w.r.t. DS (p-value=0.001) and in F2 mean /I/ w.r.t. SMS (p-value=0.024): such alterations can be associated with difficulties in lips protrusion and lateralization [
41]. Regarding the NS MA group, multiple comparisons showed statistical differences for F0 mean /a/ w.r.t. HS (p-value=0.028) and for F0 mean /u/ w.r.t. SMS (p-value=0.014): besides short stature, possible causes of this alteration could be the presence of an anterior glottis web [
57] or a tendency to incur in vocal fold paralysis. These conditions can be associated as well to NNE values closer to 0, especially for /I/ and /u/ w.r.t. HS (p-value<0.001 and p-value=0.001, respectively).
Figure 4 shows vowel area reduction and centralization: both VSA and FCR detected significant differences w.r.t. HS (p-value<0.001 and p-value=0.001, respectively). F2 also showed significant differences: F2 mean /a/ w.r.t. DS (p-value<0.001), F2 mean /I/ w.r.t. HS (p-value=0.004), and F2 mean /u/ w.r.t. HS (p-value=0.028). Such alterations might depend on structural properties, such as choanal atresia, supraglottic stenosis and soft palate laxity, as well as neurologic problems.
In PS SMS subjects, articulation measures and formants showed significant statistical differences for: F1 mean /u/ w.r.t. CS (p=0.034), F2 mean /a/ w.r.t. HS (p=0.037), F2 mean /I/ w.r.t. HS (p=0.015) and CS (p=0.001), F1a/F1u with HS (p=0.004) and FCR with HS (p=0.001). In Garayzabal-Heinze [
18], nor F1 or F2 were able to discriminate SMS individuals from the control group: this difference could depend on the use of different acoustical analysis software tools. First formant alterations may be linked to velopharyngeal insufficiency: this incomplete closure typical of SMS patients causes a constant leak of airflow through nasal cavities, consequently altering resonant frequency along the vocal tract [
42]. F0-related features did not show any significant difference for the FA group diagnosed with SMS. They were found for F2 mean /a/ w.r.t. HS (p-value=0.026), F2 mean /I/ w.r.t. NS (p-value=0.024) and F3 median /I/ w.r.t. NS (p-value=0.001) and CS (p-value=0.039). Hypotonia and structural lip malformations [
17], in addition to frontal lobe calcification and cortical atrophy, could be the reasons for these anomalies. For the MA SMS group, post-hoc analysis highlighted significant differences in F0 mean /a/, F0 mean /I/ and F0 mean /u/ w.r.t. HS (p-value<0.001 for the three cardinal vowels). Alterations of F0 can be associated with greater vocal efforts during phonation due to vocal cords stiffness. Orofacial dysfunctions, worsened by hypotonia, soft palate clefts and posterior fossa anomalies might be responsible for articulation disabilities and therefore related to significant differences that were identified for F1 min /a/ and F2 max /u/ w.r.t. HS (p-value=0.050 and p-value=0.009, respectively).
Figure 3,
Figure 4 and
Figure 5 show that age and gender strongly influence F1 and F2: with respect to the reference adult males (solid line with diamond markers), the PS group, and to a less extent the FA group, presents higher formant values due to the shorter and smaller size of the vocal folds and vocal tract. These results underline the importance to carry out the acoustical analysis taking into account also age and gender. Moreover, as shown in
Figure 4, a difference exists also between the healthy adult male subjects considered in this study (simple solid line) and the reference adult males (solid line with diamonds) possibly because of our limited sample size.
In the KNN classifier of the PS group, NS and CS classes showed the highest AUCs (97% and 99%, respectively), whereas DS subjects present lower performance, in particular an AUC of 77% and recall of 43%, which may mean that their acoustical properties are not specific and therefore DS subjects are misclassified as other syndromes, possibly as SMS since this class shows a low recall value as well (65%). The SVM model of the FA group was able to detect healthy subjects with 100% precision; moreover, no pathological subject was misclassified as HS, leading to 100% F1-score. This is an important result: as supported by statistical analysis, voice quality between female normophonic subjects and genetic syndrome patients is different. Moreover, SMS and NS class presented high precision values as well (85% and 86%), whereas DS present a low specificity of 50% meaning that their vocal properties are probably similar to the other considered diseases. In the male cohort, the overall best classification results were obtained: all the observation belonging to CS, DS and HS classes were correctly identified. High performance characterized SMS and NS classes as well. In this latter case, in the future it will be important to understand whether the same performance will be achieved by reducing number of parameters.
Summing up, these preliminary results are promising for defining a phonatory profile for genetic diseases. However, we remark that this outcome was obtained with a limited dataset and therefore more voice samples need to be collected. By applying the proposed procedure to a larger dataset, it will be possible to carry out reliable comparisons in order to validate and possibly find new acoustical features that could reliably describe genetic syndromes. Indeed, with a higher amount of data, new models could be developed to understand whether the same differences in the acoustical parameters between syndromes found in this work will be confirmed and if any improvements in classification results are feasible. Indeed, the low numerousness of subjects analyzed in this first study did not allow investigating feature selection or feature engineering techniques to obtain better classifiers. Such methods will be implemented once a larger database will be available.
Author Contributions
Conceptualization, C.M., G.Z. AND A.L.; methodology, F.C., L.F., C.M., A.L.; software, C.M.; validation, L.F., C.M., A.L., G.Z.; formal analysis, F.C.; investigation, F.C.; resources, E.F., R.O., L.A., G.Z.; data curation, F.C., E.S.; writing—original draft preparation, F.C.; writing—review and editing, L.F., C.M., A.L., E.S., G.Z.; visualization, F.C.; supervision, C.M., A.L., G.Z.; project administration, C.M., G.Z.; funding acquisition, A.L., G.Z. All authors have read and agreed to the published version of the manuscript.