1. Introduction
Dysprosody is a well-attested symptom of Parkinsons’ disease [
1] and is discussed in the literature as an “impaired melody of speech”, speaking monotony in pitch or loudness (“monopitch” and “monoloudness” respectively), “hypophonia”, or an “altered rate of speech” [
2]. Dysprosody is an early onset symptom of the disease [
1] that is a prominent factor behind reduced speech intelligibility of patients with Parkinsonian dysarthria [
3,
4,
5,
6]. While most often discussed in connection with the dysarthrias, and predominately in connection with Parkinson’s disease, effects on expressive dysprosody have also been observed following lesions in the caudate nucleus, the globus pallidus, and the putamen [
7], in case reports of left hemiparesis and right hemisphere tumors [
8], and in approximately 2.7% of patients with epileptic seizures [
9]. When occurring as a component of apraxia of speech [
10], it has been observed that symptoms may be alleviated by neurobehavioral treatment [
11].
While widely attested and often discussed in reports of speech effects of Parkinson’s disease and other neurological diseases [
12] or of dopaminergic treatments [
13,
14] there is currently no clinically validated objective measure of dysprosody by which the effect of treatment may be assessed [
15]. One barrier to developing acoustic assessment methods for dysprosody originates in the complex nature of prosody itself [
7], and the assessment is therefore often approximated by a singular focus on variability in fundamental frequency (f
0) in utterances [
6,
13,
16,
17,
18,
19]; in Parkinsonian dysarthria, smaller f
0 excursions have been observed when speaking.
Well-functional prosody is, however, a well-explored field of linguistics, with analysis frameworks that may offer a path toward analysis of dysprosody resulting from neurological diseases as well. The autosegmental-metrical framework analyses the realization and temporal alignment of language-specific intonational units (tones) and the strength of breaks (pauses) in speech. The aim of the analysis is to describe the functional properties of the speech signal that conveys prosodic information to the listener. The autosegmental-metrical framework has been applied to describe aspects of dysprosody in patients with Parkinson’s disease. In these analyses, it has been observed that the frequency of pitch accents and boundary tones in intonation, and pauses, and length of utterances on the rhythmic level of dysprosody, may be altered by the disease while the inventory of tones used remains unchanged [
20,
21,
22]. The autosegmental-metrical analysis is a manual process, but Frota et al. recently proposed a quantification of the autosegmental-metrical analysis framework for Portuguese (P-ToBI) that incorporated expected and realized pitch accents and breaks into an intonational index of prosodic deviance [
16] by which a difference in prosodic phrasing and use of nuclear tones in persons with Parkinson’s disease was indicated [
16]. The analysis requires manual annotation of where the utterance occurred and pitch accents and break indices within the utterance by a specifically trained annotator. Further, the analysis system has not been demonstrated to be generalizable to languages other than Portuguese, and the tone and break indices approach to prosodic analysis is well recognized to require that the analysis system is adapted to fit the individual language [
23]. The index has not been successfully related to the perceived level of dysprosody in the patients. Thus, autosegmental-metrical approaches currently do not offer a path to automated and language-independent evaluation of neurogenic dysprosody.
Alternative approaches are acoustically defined composite measures applied to easilly identified portions of the speech signal and therefore require less laborious manual annotation. The Stress Pattern Index [
24], defined as
) where f
0 is the fundamental frequency and
E the speech signal energy, applies to words that should have prominence in an utterance to signal appropriate meaning. Alternatively, the Syllabic Prosody Index [
25], defined as
in which
d is the duration of the syllable and the speech signal energy
E is computed after low-pass filtering, applies to prominent syllables. Both indices have been verified to quantify aspects of dysprosody due to Parkinson’s disease. The Syllabic Prosody Index was, for instance, found to be reduced compared to control speakers only for male speakers [
24,
26]. The indices have not been evaluated in terms of classification performance. In terms of classification, the most reliable models of dysprosody severity due to dysarthria have shown an accuracy of 62.2-73.9%, depending on model type, when trained on a set of intonation (f
0) and rhythmic properties [
27]. The application of the models investigated by Hernandez et al. [
27], the Stress Pattern, and Syllabic Prominence indices requires language-specific manual annotation on one [
24,
26] or multiple [
27] tiers before application and may be considered unwieldy in clinical application.
A path to automated assessment of dysprosody severity may, however, be afforded by application of language-independent approaches to describe intonational contours. The International Transcription System for Intonation (INTSINT) re-encodes the surface intonational contour into language agnostic (T)op, (H)igher (local maximum), (U)ppstepped, (S)ame, (M)id, (D)ownstepped, (L)ower (local minimum), and (B)ottom levels defined relative to the f
0 key and f
0 range. Therefore, the original f
0 contour may be approximately reproduced given the INTSINT transcription labels and their locations in the utterance [
28]. A companion f
0 contour simplification algorithm (Modeling melody, Momel) filters out micro-prosodic components of the contour by approximation using quadratic splines. With only the macro-prosodic intonational structure remaining the representation is refocused to include only intonational movements that are likely to serve a function in communication. See Hirst [
29] for an overview. The macro-prosodic approximation of the intonational contour can then be used for the identification of INTSINT anchor (target) points in the utterance.
Figure 1 exemplifies the result of a combined INTSINT and Momel analysis. The Momel and INTSINT annotation procedures have been automated, applied successfully to different languages [
30,
31,
32,
33,
34,
35], and given a canonical computer implementation [
36]. An alternative f
0 stylization procedure is offered by ProsodyPro [
37], which assumes a relatively constant segmental tier. This assumption was, however, considered not generally valid with regards to dysarthria [
38] and this possible route to an automatic annotation was not pursued further. In contrast, the INTSINT / Momel analysis quantifies only the intonation contour and disregards the segmental information and pausing. The time marks identified by an INTSINT annotation provide reference points at which supplementary acoustic information could be computed directly from the sound. Whether an assessment procedure based on the result of an INTSINT / Momel analysis affords modeling of the level of dysprosody has, however, not been evaluated.
While the INTSINT and Momel procedures offer the opportunity to process intonation and other time-aligned acoustic information, the procedures presuppose that the utterance has already been identified and extracted manually. Recent developments in speaker diarization and vocal activity detection [
39,
40,
41,
42,
43,
44], however, offer the opportunity to preprocess speech recordings into vocal activities, approximating utterances, which can then be fed into an INTSINT / Momel analysis workflow in which the tonal movements are described in terms of timings and f
0 levels reached. The INTSINT / Momel analysis workflow provides a description of the micro and macro-prosodic intonation pattern of an utterance but not a summative quantification of prosody. The tonal levels provided after an INTSINT annotation can, however, be used as reference time points at which the Momel stylized f
0 and RMS amplitude of the speech signals can be extracted and given with a quantification that provides a prosodically relevant summation of the utterances.
The aim of the current study was to describe and evaluate an automated pipeline, with utterance identification and a novel quantification of intonation and speech intensity alterations, and to evaluate this pipeline in terms of its affordance for automatic assessment of dysprosody in patients with Parkinson’s disease.
Figure 1.
An example automatic INTSINT annotation (bottom panel) along with the audio waveform (middle panel) and computed f0 curve (top panel, green). The approximation of the macro prosodic structure of the utterance estimated by the Momel algorithm is visualized is overlayed on the original f0 curve (top panel, red).
Figure 1.
An example automatic INTSINT annotation (bottom panel) along with the audio waveform (middle panel) and computed f0 curve (top panel, green). The approximation of the macro prosodic structure of the utterance estimated by the Momel algorithm is visualized is overlayed on the original f0 curve (top panel, red).
4. Discussion
Prosody is the language function that organizes the speech stream into manageable chunks for the listener to process, and failure to meet listeners’ expectations is linked with a reduced speech intelligibility. Prosody is inherently multidimensional in how it is signaled to the listener, and previous models aimed to detect neurogenic dysprosody severity have achieved 62.2-73.9% detection accuracy by incorporating information from intonation, rhythm and pausing, information that was acquired by means of a manual annotation procedure. The requirement of a laborious and time-consuming transcription task preceding assessment presents a clear barrier to clinical adoption of the assessment procedure. In this study, an automatic dysprosody assessment pipeline was constructed from tools that were already available for speech utterance identification and pitch contour stylization, and provide a novel quantification aimed at capturing aspects of variability in pitch and intensity of speech. The complete pipeline was then assessed in terms of its proficiency in assessing dysprosody severity of patients with Parkinson’s disease based on a recording of speech patients’ reading, with no prior pre-processing. Five models were trained on the individual assessments of levels of dysprosody made by four clinical raters and evaluated in terms of their ability to predict the consensus assessment of dysprosody severity among expert human raters on unobserved utterances.
The results suggest that severity of dysprosody is not well described by single metrics, including the predominant proxy measure for dysprosody (variability in f0), or by simpler statistical models (Naïve bayes or Decision tree classifiers). Simpler bases for classification tended to result in a strongly biased prediction that does not reflect human experts’ ratings well, and no acoustic predictor showed an influence on the classification that was strong enough to be able to serve as a proxy in determining dysprosody severity. The ensemble model Random forest and the strong ordinal classifier model type penalized ordinal regression showed stronger ability to learn how to identify utterances in the evaluation set that human experts had determined to have Moderate to severe deviation in prosody. The Support Vector Machines models failed to reach similar levels of accuracy, particularly in identifying severe levels of dysprosody. Overall, the Random forest achieved a higher level of accuracy in predicting the severity of dysprosody in unseen utterances than one of the expert human raters.
The result, therefore, demonstrates that dysprosody severity as perceived by human clinical experts in the assessment of speech in Parkinson’s disease can be automatically deduced by a fully automatic speech processing pipeline in which utterances are automatically identified, and productive features for predicting dysprosody severity is identified using an established prosodic algorithm with a theoretical basis. It is therefore concluded that the developed pipeline constitutes a new development that can throw additional light onto what constitutes a symptom of perceived dysprosody due to Parkinson’s disease. One may further observe in the results that the one acoustic feature that has to date predominately been investigated as a proxy to dysprosody, utterance-wide variability in f
0, was not identified here as a robust predictor of the perceived level of dysprosody. Instead, the most productive predictors described local degrees of change in f
0 from the timing of one tonal level to the next; measures of utterance level variability in f
0 or RMS amplitude were less important. Thus, previous reports in which dysprosody has been evaluated solely based on the proxy measure of the standard deviation of f
0 are likely to have determined, in part, the level of liveliness [
63] in speech. Liveness is essential and, in a communicative setting, likely contributes significantly to the experience of both parties. However, variability in f
0 alone does not ensure a retained linguistically functional intonation that adequately supports the transfer of information from the speaker to the listener. Instead, estimates of more local alterations of how intonation and intensity variation are used to distinguish portions of the speech signal of particular importance to the message from relatively less significant portions provided a better model of clinical judgments of reduced prosodic functioning in patients with Parkinson’s disease. Patients with Parkinson’s disease have previously been observed to be reduced in their rapid regulation of phonation [
64,
65,
66,
67,
68,
69,
70,
71] which may provide a partial explanation of the findings of significant predictors of clinically rated dysprosody. While an explanation for the converging observations in terms of neurofunctional correlates cannot be offered to date, the connection with the subcortical structures, the globus pallidus, and the putamen [
7] is congruent with an interpretation that failure to achieve tonal targets by persons with Parkinson’s disease may be related to a failure to initiate an alteration of state in the phonatory musculature rather than an effect of muscular inability or fatigue, or conflicting signaling in the direct, indirect, or hyper-direct pathways from the striatum to the cortex [
72]. This interpretation is, however, tentative and requires experimental support before being accepted.
Dysprosody is discussed here and in other parts of the literature as a single symptom. While discussed under a single term, dysprosody of a rated severity due to Parkinson’s disease may differ from dysprosody caused by other neurological conditions [
8]. The automatic processing pipeline does not presuppose a particular language or underlying disease causing the dysprosody, but the relative importance weights may, for instance, for reduced amplitude fluctuations, warrant an increase for other diseases. Such adjustments can, however, only be made with access to the appropriate speech recordings and the appropriate clinical expert raters in the language and are not considered further in this report. The components of the processing pipeline are, however, widely available and well documented [
36,
42,
44,
73,
74], removing any barrier to replication, language or disease adjustments in weights, and replication efforts in later research.