Altmetrics
Downloads
194
Views
78
Comments
0
A peer-reviewed article of this preprint also exists.
This version is not peer-reviewed
Submitted:
16 December 2022
Posted:
03 January 2023
You are already at the latest version
1 | Primary emotions are innate to support fast and reactive response. Eg: angry, happy, sad. Six basic emotions were defined by Ekman [4] based on cross-cultural studies, and the basic emotions were found to be expressed similarly across cultures. The terms ‘primary’ and ‘basic’ emotions are used in literature with no clear distinction defined between them. For this study, the definition of primary emotions as defined above is used to be in alignment with studies in emotional speech synthesis [5]. Secondary emotions are assumed to arise from higher cognitive processes based on evaluating preferences over outcomes and expectations. E.g., relief, hope [6]. This distinction between the two emotion classes is based on neurobiological research by Damasio [7]. |
2 | ‘base’ speech synthesis system refers to the synthesis system that is built initially, which has no emotional modelling. The emotion-based modelling is built on this ‘base’ speech synthesis system. |
3 | Deep Neural Network |
4 | The term good quality here refers to recordings in recording studio environments that have controlled noise levels. |
5 |
Neutral in this context refers to speech without any emotions. |
6 | medium in comparison to the databases needed for the deep-learning approaches explained in the next paragraph |
7 | |
8 | Valence indicates the pleasantness of the voice ranging from unpleasant (e.g. sad, fear) to pleasant (e.g. happy, hopeful). Arousal specifies the reaction level to stimuli ranging from inactive (e.g. sleepy, sad) to active (e.g. anger, surprise). Russel developed this model in a psychology study where Canadian participants categorised English stimulus words portraying moods, feelings, affect or emotions. Later, 80 more emotion words were superimposed on Russel’s model based on studies in German [24]. Russel’s circumplex model diagram, as used in this study shown in Figure 1, is adapted from [25], which was adopted from Russel and Scherer’s work, but the positive valence is depicted by the right side of the x-axis (in contrast to Scherer’s study where it was on the left side.). A two-dimensional model is used (and not higher dimension models) as representing the emotions on a plane facilitates their visualisation. |
9 | Both male and female speakers’ sentences were used for the initial analysis. But only the male speaker sentences were used for the emotion-based contour model developed because it is this male speaker’s voice that will be synthesised. |
10 | This study was approved by the University of Auckland Human Participants Ethics Committee on 24/07/2019 for three years. Ref. No. 023353. |
Speech synthesis method | Emotional speech synthesis method | Approach | Resources needed | Naturalness | Emotions modelled |
1993 [13] Diphone synthesis | Rule-based | Emotion rules applied on speech synthesis systems | All possible diphones in a language have to be recorded for neutral TTS 1. E.g. 2431 diphones in British English. An emotional speech database (312 sentences) to frame rules is needed | Average 2 | neutral, joy, boredom, anger, sadness, fear, indignation |
1995 [9] Formant synthesis | Rule-based | Rules framed for prosody features such as pitch, duration, voice quality features | DECtalk synthesiser used containing approximately 160000 lines of C code. Emotion rules framed from past research | Average | anger, happiness, sadness, fear, disgust and grief |
2004 [14] Parametric speech synthesis | Style control vector | Style control vector associated with the target style transforms the mean vectors of the neutral HMM models | 504 phonetically balanced sentences for average voice, and at least 10 sentences of each of the styles | Good | Three styles: Rough, Joyful, sad |
2006 [10] Recorded neutral speech used as it is | Rule-based | GMM 3 and CART 4 based models for and duration | Corpus with 1500 sentences | Average | neutral, happiness, sadness, fear, anger |
2006 [15] Parametric speech synthesis | Corpus-based | Decision trees determine contours & timing trained from the database | 11 hours (excluding silence) of neutral sentences + 1 hour emotional speech | Good 5 | Conveying bad news, yes-no questions |
2006 [15] Parametric speech synthesis | Prosodic phonology approach | ToBI 6 based modelling | 11 hours (excluding silence) of neutral sentences + 1 hour emotional speech | Good | Conveying bad news, yes-no questions |
Speech synthesis method | Emotional speech synthesis method | Approach | Resources needed | Naturalness | Emotions modelled |
2007 [16] Parametric speech synthesis | Model adaptation on average voice | Acoustic features Mel-cepstrum & log were adapted | 503 phonetically balanced sentences for average voice, and at least 10 sentences of a particular style | Good | Speaking styles of speakers in the database |
2010 [17] Neutral voice not created | HMM-based parametric speech synthesis | Each emotion’s database was used to train emotional voice. | Spanish expressive voices corpus - 100 mins per emotion | Good | happiness, sadness, anger, surprise, fear, disgust |
2017 [12] Parametric speech synthesis using recurrent neural networks with long short-term memory units | Emotion- dependent modelling and unified modelling with emotion codes | Emotion code vector is input to all model layers to indicate the emotion characteristics | 5.5 hours emotional speech data + speaker independent model from 100 hours speech data | Reported to be better than HMM-based synthesis | neutral, happiness, anger, and sadness |
2018 [11] Tacotron-based end-to-end synthesis using DNN 3 | Prosody transfer | Tacotron model learning a latent embedding space of prosody derived from a reference acoustic representation containing the desired prosody | English dataset of audiobook recordings - 147 hours | Reported to be better than HMM-based synthesis | Speaking styles of speakers in the database |
2019 [18] Deep Convolutional TTS | Emotion adaptation | Transfer learning from neutral TTS to emotional TTS | Large dataset (24 hours) neutral speech + 7000 emotional speech sentences (5 emotions) | Reported to be better than HMM-based synthesis | anger, happiness, sadness, neutral |
Feature | Description | Extraction method |
---|---|---|
Linguistic context features | Count = 102, Eg. accented/unaccented, vowel/consonant | Text analysis at the phonetic level using MaryTTS. |
Non-emotional contour Fujisaki model parameters | Five Fujisaki parameters - , , , , | Passing non-emotional speech to AutoFuji extractor. |
Emotion tag | Five primary & five secondary emotions | Each emotion tag is assigned to the sentence |
Speaker tag | Two male speakers | Speaker tag is assigned |
Secondary emotion | Mean speech rate (syllables/sec) | Mean intensity (dB) |
---|---|---|
anxious | 3.25 | 58.24 |
apologetic | 2.93 | 55.14 |
confident | 3.20 | 59.50 |
enthusiastic | 3.24 | 63.91 |
worried | 2.99 | 56.34 |
APO | ANX | APO | ENTH | ||
APO | 97.9% | 2.1% | APO | 100% | 0% |
ANX | 0% | 100% | ENTH | 1.4% | 98.6% |
CONF | ANX | APO | WOR | ||
CONF | 88.3% | 11.7% | APO | 64.3% | 35.2% |
ANX | 12.4% | 87.6% | WOR | 32.4% | 67.6% |
ENTH | ANX | CONF | ENTH | ||
ENTH | 78.6% | 21.4% | CONF | 69% | 31% |
ANX | 24.8% | 75.2% | ENTH | 30.3% | 69.7% |
WOR | ANX | CONF | WOR | ||
WOR | 97.9% | 2.1% | CONF | 95.2% | 4.8% |
ANX | 4.19% | 95.9% | WOR | 22.8% | 77.2% |
APO | CONF | WOR | ENTH | ||
APO | 94.5% | 5.5% | WOR | 97.9% | 2.1% |
CONF | 9.7% | 90.3% | ENTH | 0.7% | 99.3% |
Actual emotions | Emotion words by participants (count of times used) |
Anxious | Anxious (41), Enthusiastic (9), Neutral (4), Confident (3), Energetic (1) |
Apologetic | Apologetic (35), Worried (22), Worried/Sad (1) |
Confident | Confident (34), Enthusiastic (9), Worried (8), Neutral (5), Authoritative (1), Demanding (1) |
Enthusiastic | Confident (24), Enthusiastic (21), Neutral (4), Apologetic (3), Worried (5), Encouraging (1) |
Worried | Worried (38), Apologetic (12), Anxious (5), Condescending (1), Confident (1), Neutral (1) |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
Jesin James
et al.
,
2023
Eman Abdulrahman Alkhamali
et al.
,
2024
Konstantinos Mountzouris
et al.
,
2023
© 2024 MDPI (Basel, Switzerland) unless otherwise stated