1. Introduction
If one invites a native English speaker on a random street in London to pronounce a less-frequently known small village in Norfolk, England, with a confusing pronunciation - Happisburgh (Hayes-bruh), he/she may find the speaker sounds hesitant due to their lack of knowledge or familiarity. Here, the human speaker can encode his or her intended confidence towards what they say through their tone of voice. However, in most cases, an Artificial Intelligence (AI) speaker (voice assistant/Text-to-Speech (TTS) service such as Apple’s Siri that synthesises human-like speeches) simply sounds the word out regardless of their knowledge or familiarity, even if they pronounced it wrong as Hap-pis-burgh. TTS composes of a text analysis component and a speech synthesis component, where the speech synthesis component can be built through different models such as WaveNet, Tacotron, or FastSpeech to produce natural-sounding speech that sounds like a human speaker (Arik et al., 2018; Williams & King, 2019; Yang et al., 2022). It is unknown whether (and if so, how) such a human-specific confidence encoding process can be expressed by AI speakers.
In human voice communication, paralinguistic vocal information is essential in encoding a variety of speaker-related information, including both stable traits or identity (e.g. biological sex, age, and personality) and dynamic or short-term states (e.g., emotion) for further decoding by listeners (Schuller & Batliner, 2013). Both human listeners and computational models based on algorithms for paralinguistic features are capable of identifying speakers’ short-term states, such as sleepiness (Egas-López & Gosztolya, 2021), intentions and attitudes (Ishi et al., 2008), as well as vocal expression of emotions (Anagnostopoulos et al., 2015; Kaya et al., 2017), and confidence that signals the speaker’s feeling of knowing (Jiang & Pell, 2018).
To what extent can AI also encode vocal expression as a human does? How can AI’s capability of cloning human beings’ acoustic-articulatory mechanisms underlying vocal expressions be quantified? One of the mostly studied synthetic approaches was the AI voice cloning technology that creates personalised voice assistants. To illustrate, the algorithm utilises a speaker encoder network to extract a target speaker’s utterance-level embeddings (Shi et al., 2020), which are passed down to a Tacotron 2 network to synthesise speech conditioned on the speaker embeddings that process speaker identity (Arik et al., 2018). Notably, such a cloning process can also copy the original speech prosody; for example, Lux et al. (2022) extracted and normalised fine-grained prosodic features from reference audio and applied them to new voices using a FastSpeech 2 model and an utterance-level speaker embedding; a perceptual study that involved 32 human listeners and a speaker’s embedding test (a technique to display the similarity of speaker models before and after the application of the speaker embedding in clusters in a low-dimensional 2D plot), confirmed the possibility of the algorithm to mimic not only speaker identity but also replicate the speech prosody in AI voice cloning.
This study aims to address the above questions by examining both the pattern of encoding vocally-expressed confidence with articulatory and acoustic cues in human target speeches and in AI-generated speeches via voice cloning TTS technology.
1.1. Research Gaps that Warrant the Adoption of Voice Cloning TTS for Studying the Encoding Mechanisms Underlying Vocal Expressions
Existing human-computer interaction studies on voice cloning AI mostly focus on the human perceptual outcome and their performance during the interaction with the speeches produced by voice cloning TTS, although few examined expressive voices or compared AI speeches with human speeches used to synthesise AI speeches with the TTS. Behavioural studies demonstrated that human listeners displayed high consensus in reporting the biological sex of Amazon’s Alexa to be female (Fortunati et al., 2022). Moreover, listeners prefer AI speeches which were synthesised in their native accents as compared with those of different accents. Listeners who judged the voices generated from TTS to be older in age also rated the same voices as more credible (Edwards et al., 2019), and they also believed the AI voice sounding human-like as more credible than those that sounded machine-like (Kim et al., 2022). Neurophysiological studies also showed differential event-related potentials and event-related spectral perturbations towards the onset of the most relative to the least preferred AI voices (Li et al., 2023).
However, these studies did not investigate how human reacts differently to AI and human voices, which another group of studies have attempted. Listeners preferred the human female voice over a speech by AI female speakers (Mullennix et al., 2003). Human participants would exaggerate their preference for human voices when they explicitly knew they were rating between AI and human voices, as compared to their performance in the Implicit Association Task, the response latency test that measures the strength of human’s association of human voices with positive and synthesised voices with negative, or vice versa (Mitchell et al., 2011). In these studies, no efforts have been made to match the identity of the human and the AI speakers. Rodero (2017) performed acoustic analyses that ensured the similarity of F0 ranges of human and AI speakers. Human raters perceived both synthetic voices (by Siri and Loquendo) and human-manipulated voices that morphed the original voice using KaleiVoiceCope software as less effective advertisement tool when compared to original human non-manipulated voices (Rodero, 2017). In another study, the same human acted as an ‘AI’ speaker or spoke with her own voice. Naive children shifted their interactive style to be less active when they were convinced they were playing a game with an ‘AI’ speaker as compared to the human speaker, even though the ‘AI’ and the human speaker were the same woman, except she expressed information in a monotone or lively tone (Gampe et al., 2023). These studies attempted to construct comparable human versus AI speaker groups (Gampe et al., 2023; Rodero, 2017); however, variations can still exist in paralinguistic features between speaker groups, including speaker identity or speech prosody, which need more effective experimental control.
To address this gap, this study posits Huawei’s Xiaoyi, a conversational agent service, an accessible AI voice clone technology that serves the purpose of cloning speaker identity and speech prosody (namely, confidence levels in this study) and aims to provide empirical evidence to attest so. The SV2TTS is constructed by three key components – Encoder, Synthesiser and Vocoder, largely draw upon the Mel spectrogram that can be converted to Mel Frequency Cepstral Coefficients (MFCC) for acoustic analysis (Jia et al., 2018), and such a spectrogram carries not only speaker’s vocal identity embeddings but also paralinguistic information (Hossain & Muhammad, 2019; Zhao et al., 2019). It is worth noting that past HCI studies relied upon accessible TTS services that were also using voice-cloning TTS since their used products, such as DECtalk, KaleiVoiceCope, Siri, Loquendo, and Microsoft Mary, were all built upon a specific original human speaker’s embeddings. The present study is similar to the past studies in terms of using the produced audio by voice-cloning TTS as research target, but different from the aforementioned HCI studies given that the present study included the audios presented by the target human speakers – where human speakers expressed their vocal confidence.
The perceptual difference between human and AI speakers raised the issue of whether human and AI speakers encode vocal expressions similarly or differently (Gampe et al., 2023; Rodero, 2017). If they do share a similarity, how can acoustic measurements and machine learning prediction provide supporting evidence (Jiang & Pell, 2017)? To address this gap, data-driven computational studies on acoustic features should be introduced to investigate the affective factors in audio by AI and humans (Rodero, 2017; Schuller & Batliner, 2013). In human-human interaction (HHI), Jiang and Pell (2017) prepared prosody-varied human recordings stating the same text in different accent groups (Canadian-English, Quebecois-French, and Australian-English) and performed a computational study using supervised machine learning models that classified the doubtful and confident prosodies through the acoustic measures inputted - mean pitch (F0), variation of pitch, and vocal quality measures (HNR). The computational paralinguistic experiment reported the consensus importance of F0 in categorising audio clips by three groups of accent users and reported an in-group bias of predicting novel vocal expression – training with Canadian-English to testing on Canadian-English generated a higher classification accuracy than training with Canadian-English to testing on Australian-English. Despite the study by Jiang and Pell (2017) adopting few acoustic features and only performing machine learning on confident and doubtful conditions, it still provided a methodological paradigm that encouraged the present study to view AI speakers as accent users in the computational research, thus validating the generalisability of vocal confidence across human and AI speakers.
1.2. Linguistic Phonetic Cues to Differentiate Human Vocally-Expressed Confidence
Linguistic phonetic studies demonstrated a profile of suprasegmental and segmental cues for speakers to extend their feeling of knowing in speech, thereby forming speaker confidence. Speaker confidence is a type of emotive communication that signals the transient mental state of the talker’s subjective certainty towards the statement they are making based on the concept and/or words they retrieved from their metacognitive judgement; and often, such a level of confidence to convey pragmatic intention is a stable mental state that is independent of speech content (Boduroglu et al., 2014; Mori & Pell, 2019; Nelson & Russell, 2011).
When speakers encode confidence in short sentences in English, the unconfident voice is signalled by a higher F0, mean amplitude and Harmonics-to-Noise Ratio (HNR), a slower speech rate, and more pauses, whereas the confident voice was observed to have a higher amplitude, lower HNR, and greater variations in F0 and amplitude (Jiang & Pell, 2017). Similarly, the acoustic analysis of vowels of the Chinese Wuxi dialect reported a similar F0 modulation subjected to the intended speaker’s confidence in the findings in English, where the dialect speakers raised their F0 to express unconfident feelings (Ji et al., 2022). Spectral information, such as the formant peak values, was also relevant. A following study using machine learning classifications with XGBoost (eXtreme Gradient Boosting) further confirmed that the mean F0, F0 variation, and HNR were crucial features for distinguishing perceived confident and doubtful voices across accented and native English speakers (Jiang & Pell, 2018).
In addition to the findings from acoustic analyses on produced vocal sounds, perceptual studies further confirmed that listeners reliably use F0 to represent speaker levels of confidence (Jiang & Pell, 2015; Jiang & Pell, 2017; Monetta et al., 2008). Studies manipulated F0 from auditory sentences to higher or lower with Praat showed according to changes in the perceived level of confidence (Goupil et al., 2021; Guyer et al., 2019); see (Guyer et al., 2021). Consistent with the dialect theory of vocal communication (Elfenbein & Ambady, 2002; Jiang & Pell, 2018), these findings suggest that different speaker groups could encode confidence in different vocal dialects but also follow a universal encoding mechanism in human speeches during human-human interaction (Ji et al., 2022; Scherer, 1997). Despite the fact that F0 is considered to reflect a biological modulator of encoding speaker identity (Lavan, Knight, et al., 2019), neither acoustic analyses nor human perceptual experiments have sufficiently addressed the biological significance of F0 modulation in different confidence levels.
1.3. Laryngeal and Acoustic Features of Human Vocal Expressions
While few studies have directly reported how speakers vary their laryngeal structures when speaking confidently or doubtfully, many showed that speaker identity is reliably represented and distinguished, given how the vocal tract varies in its shape and length (Belin et al., 2004). Studies on speaker identity have consistently reported the role of Vocal Tract Length (VTL) in signalling speakers’ biological sex and age (Lavan, Knight, et al., 2019; Meister et al., 2016; Rachman et al., 2022; Smith & Patterson, 2005). The VTL measures the curvilinear distance along the midline of the tract, from the glottis to the intersection, with a line drawn tangentially to the lips, growing from 6 to 8 cm in infancy to 15 cm for females to 18 cm for males in adulthood (Vorperian et al., 2005).
The VTL has been reported to correlate with F0 across speakers (Nagels et al., 2020), and therefore this study suspected that VTL and F0 could also show a reliable range of variation as the speakers’ mental state is confident/doubtful/neutral (Boduroglu et al., 2014; Mori & Pell, 2019; Nelson & Russell, 2011). The bridge between the apparent acoustic dimension of VTL and F0 could be the vocal size that signs listeners’ perceived quantity of the speakers’ anatomical property (Fuller et al., 2014). Specifically, the VTL modulation was widely associated with vocal size deception (for example, vocalisers lengthen their VTL to sound larger to intimidate enemies) in the animal world (Charlton et al., 2008; Pfefferle & Fischer, 2006; Reby & McComb, 2003) as well as in human beings (Belyk et al., 2022; Pisanski & Reby, 2021; Waters et al., 2021). Such evidence leads to an association between laryngeal mechanisms and paralinguistic information in speech communication (Belin et al., 2004). It is thus deduced that humans’ vocal size exaggeration ability dates back before human language and could contribute to the origins of the vocalic complexity of language (Pisanski et al., 2022; Pisanski et al., 2016). As such, VTL, this anatomically-related vocal cue that was found to be positively correlated to F0 (Nagels et al., 2020), is suspected of signalling vocal expression as F0 could do, as evidenced in the traditional vocal emotion research (Jiang & Pell, 2017). This study predicts the VTL modulation to be found when human speakers convey paralinguistic signals, such as confidence levels in language communication.
Mel Frequency Cepstral Coefficients (MFCC), which simulate human hearing perception, are measured with the shape of a spectral envelope and represent the short-term power spectrum of voice (X. Chen et al., 2022). MFCC is a frequently-cited parameter for computational paralinguistics, including vocal emotion recognition (Koduru et al., 2020; S. Liu et al., 2021) and musical instrument or genre classification (Bhalke et al., 2016; Friberg et al., 2014). MFCC is effective for speaker identification tasks (Hansen et al., 2017; Tirumala et al., 2017); for example, males have a higher value of one-dimensional MFCC than females (X. Chen et al., 2022). Still, how MFCC specifically contributes to the characterisation of vocal confidence remained unanswered, with limited research having noted that MFCC is an effective tool to depict verbal expressions but not showing how exactly (Hossain & Muhammad, 2019; Zhao et al., 2019).
Chroma-based features, or pitch-class profiles (PCP), are typical representations of the musical octave with 12 varied halftones and represent the relationship between the degree of changes in timbre and the musical aspect of harmony. Chroma Short-time Fourier Transform (STFT), Chroma Constant-Q transform (CQT), and Chroma Energy Normalized (CENS) were reported to be efficient parameters to classify basic vocal emotions, including neutral, calm, happy, sad, angry, fearful, disgusted, and surprised (Alnuaim et al., 2022). With such, this study aims to provide a further understanding of how Chroma-based features contribute to characterising speaker confidence levels.
Root Mean Square energy (RMS) which measures the loudness of the speech signal and is calculated by adding the audio’s mean squares of the amplitudes up, was proven to be salient in classifying emotions and thus also considered to be useful (Abhang & Gawali, 2015) as amplitude alone does (Jiang & Pell, 2018). Still, the relative importance of RMS and Amplitude compared with others to signal paralinguistic information remained to be revealed for further studies. Spectrum centroid, which measures the mass centre in the spectrum of a voice and signals speech brightness, is also suspected of playing a role in signalling different levels of confidence (Huang et al., 2019), despite its limited ability to support consistent predictions of music (Schubert, 2004). Considering the homology of music and speech and Librosa (https://Librosa.org/; Version: 0.9.2) has been reported as a reliable tool to visualise various acoustic features, including the aforementioned Chroma-based features, the present study also extracted a list of extra accessible features through Librosa (Er, 2020; McFee et al., 2015). Specifically, the additional features were Spectral Bandwidth that measures the width of a band of frequencies at half the maximum intensity (Abel & Fingscheidt, 2017; Cramer & Huggins, 1958), Spectral Contrast that measures the difference between the peaks and valleys of the spectrum of a speech signal (Leek & Summers, 1996; Nogueira et al., 2016), Spectral Flatness that reflects how much the speech signal resembles white noise (Kim & Stern, 2011; Madhu, 2009), Spectral Rolloff that measures how fast the spectrum of a speech signal decreases with frequency (Chandwadkar & Sutaone; Stolar et al., 2018), Tonnetz (German for ‘tone network’) that shows the triadic relationships of the perfect fifth and the major third among the 12 pitch classes of the chromatic scale (Milne & Holland, 2016), Zero Crossing Rate (ZCR) that measures how many times the signal changes from positive to zero to negative or from negative to zero to positive (Song et al., 2021), and Utempo, named the static Tempo with a uniform prior, that measures the speed of a musical piece or speech signal, usually expressed in beats per minute (BPM) (Kong et al., 2004).
1.4. The Present Study
This study aims to characterize human vocal confidence through acoustic features and assess how AI can mimic human-specific vocal confidence, thus constructing three research questions. Firstly, how do acoustic features contribute to depicting human-specific vocal confidence, especially with the laryngeal-related cues? Secondly, can AI-cloned speakers mimic human beings’ observed vocal confidence encoding mechanism? If yes, then thirdly, is predicting confidence levels in human and AI speeches across sources viable?
In this study, ten human speakers were invited to produce 30 statements of trivia/geography knowledge neutrally, doubtfully, and confidently. Audios for trivial and geography knowledge statements were further separately utilised for training AI models that replicate speaker identity and confidence prosodies. Hereby, along with human speech, 2,700 Chinese audios from three sources, produced by Humans, AI-Trivia text-based algorithms, and AI-Geography text-based algorithms that read 30 same sentences, were obtained. After extracting a set of 19 acoustic cues, linear mixed-effects models (LMEM) were performed on each of these features per sources of humans and AI. Ten-fold cross-validation XGBoost Classification methods were applied to produce importance scores based on these features (Jiang & Pell, 2018). Further model comparisons were performed to compare accuracies between models trained and tested on human speeches and different sources of AI with 1000 times simulations.
3. Results
3.1. The Effects of Confidence Levels on Acoustic Cues in Human and AI Sources
While not separating the sources, the analysis for ‘VARIABLE ~ Source * Confidence Levels + Biological Sex + (1| Height) + (1|Item)’ revealed significant main effects across all 19 acoustic features at both the source and confidence level dimensions (
p<.05), as illustrated in
Table 2. For the main effect related to sources, only modest effect sizes were observed (
ηp² ranging from .01 to .06) for the following features: ΔVTL, ΔF0, Mean VTL, Mean F0, Tonnetz, and Utempo. For the main effect of confidence levels, Tonnetz was manifested at a negligible level (
ηp²<.01), whereas Mean VTL, Spectral Flatness, Δ VTL, Amplitude, and Utempo demonstrated small effect sizes (
ηp² ranging between .01 and .06) (Note that the small, medium, and large effect size is generally referred to as
ηp² = .01,
ηp² = .06, and
ηp² = .14 (Olejnik & Algina, 2000). The
ηp² in this study was calculated through the test-statistic approximation method (
https://easystats.github.io/effectsize/articles/anovaES.html).).
Regarding interactions, no large effect sizes (ηp²>.14) were detected for any parameter. A majority of the acoustic features, including Amplitude, Δ F0, Chroma_stft, Spectral Flatness, Mean VTL, Δ VTL, Mean F0, Chroma_cqt, Chroma_cens, Spectral Rolloff, ZCR, and HNR, exhibited modest effect sizes (ηp² ranging between .01 and .06). Meanwhile, Spectral Contrast, Root Mean Square, Tonnetz, Spectral Centroid, and Utempo were found to only negligible effect sizes (ηp²<.01).
The left side of Supplementary Table S2 presented the results of the pairwise analysis concerning the sources. In a comparative analysis between two subsets of AI speech sources, no significant differences (p>.10) were observed in Mean VTL, Chroma_cqt, Spectral Contrast, and Root Mean Square. Moreover, Chroma_stft and HNR revealed only marginally significant differences (.05<p<.10). In contrast, significant differences (p<.05) across all acoustic parameters were observed when contrasting clips generated by AI-Geography with human speech or audio produced by AI-Trivia with human speech.
The right side of Supplementary Table S2 outlines the results of the pairwise analysis related to confidence levels. When contrasting the confident with the doubtful speech, significant differences (p<.05) were observed across all features. However, when confident and neutral speech were contrasted, only Amplitude and Utempo failed to show significant differences (p>.05), whereas the remaining 17 features did. When comparing doubtful and neutral speech, no significant differences (p>.05) were found for ΔVTL, Spectral Bandwidth, Tonnetz, Spectral Centroid, Spectral Rolloff, and Utempo, while significant differences were observed for the other features.
Supplementary Table S3 delineates a comparative analysis across three confidence levels (C for Confident, D for Doubtful, and N for Neutral) derived from three sources. Within the context of AI-Geography, non-significance (p>.05) was observed in eight scenarios: Δ VTL (C – D, C – N, D – N), Amplitude (C – N), Tonnetz (C – D, C – N), and ZCR (D-N). In the AI-Trivia subset, non-significance (p > .05) was found in nine conditions: Mean VTL (C - N), Δ VTL (D - N), MFCC (C - D), Spectral Bandwidth (D - N), Tonnetz (C – N, D - N), Spectral Centroid (D - N), Spectral Rolloff (D - N), and Utempo (C - N). In the human subset, eight instances did not meet the threshold of significance (p>.05): Spectral Bandwidth (C - D), Tonnetz (C – D, C – N, D - N), and Utempo (C – D, C – N, D - N). All remaining conditions across the three sources demonstrated significant variations.
Both sources and confidence levels showed significant main and interaction effects across all 19 acoustic features. However, relatively small main effects were observed in Mean F0 and Mean VTL that signal biological modulation. And, still, nuances of non-significances were observed in the pairwise results.
3.2. Similar Effect of Biological Sex and Its Interaction with Confidence Levels between Human and AI Sources
This analysis engaged three different datasets separately from AI-Geography, AI-Trivia, and Human sets. The aim was twofold: first, to describe the mechanisms of vocal confidence, and second, to probe the potential of AI in mimicking such mechanisms. Key acoustic features were Mean VTL, Mean F0, Chroma_cqt, Chroma_cens, Amplitude, and MFCC. These were chosen based on their significant scores in the predictive model trained to characterise human vocal confidence (
Figure 4A) and their biological relevance to encoding the vocal expression.
Table 3 presents both the main effect and the interaction of biological sex and confidence levels. For the main effect of biological sex, Chroma_cens and Amplitude did not exert a significant effect over the acoustic features. However, the other four did present a significant main effect. The above patterns for the six features were consistently observed in AI-Geography, AI-Trivia, and Human. Regarding the main effect of confidence levels, all parameters demonstrated main effects across all three sources. In the case of interaction between biological sex and confidence levels, non-significant interaction effects in Amplitude of AI-Trivia (
p=.91) and Mean VTL of Human (
p=.29) were noted. Aside from these two conditions for the six parameters, all other conditions demonstrated interaction effects.
A subsequent pairwise analysis was performed on confidence levels by subtracting the values of males from that of females (Supplementary Table S4). The results revealed significant differences in Mean VTL (AIg: -2.1; AIt: -2.23; Human: -2.17), Mean F0 (97.43; 103.56; 100.3), Chroma_cqt (-.05; -.06; -.04), and MFCC (-6.53; -6.47; -7.03) across AI-Geography, AI-Trivia, and Human sources, respectively. Inferential statistical analyses comparing the F – M values across three confidence levels and the C – D/C – N/D – N values across two biological sexes are shown in Supplementary Tables S5 and S6, respectively. The associated significance is annotated in
Figure 1 and
Figure 2.
To distinguish between the three confidence levels, estimated marginal means (emmeans), which account for all factors in the model, and the confidence intervals were shown in
Table 4. Firstly, when considering Mean VTL, for males, the ranking order for the vocal expression was as follows: Confident > Neutral > Doubtful in both the AI-Geography and Human datasets, while in the AI-Trivial dataset, the ranking was Neutral > Confident > Doubtful. For females, however, the ranking remained consistent as Confident > Neutral > Doubtful across all three data sources. Secondly, as for Mean F0, males exhibited a ranking of Doubtful > Confident > Neutral in the AI-Geography and AI-Trivia datasets, whereas the Human dataset showed a different trend with Confident > Doubtful > Neutral. For females, the ranking of Doubtful > Confident > Neutral remained the same across all three sources. Thirdly, when observing Chroma_cqt, the pattern was uniform for both sexes across all three data sources, with a ranking of Confident > Neutral > Doubtful. Fourth, for MFCC, the male ranking across all sources was consistently Neutral > Doubtful > Confident. The female ranking, on the other hand, followed a pattern of Neutral > Confident > Doubtful in the AI-Geography and AI-Trivia datasets but showed a different order of Neutral > Doubtful > Confident in the Human dataset.
Due to the similar rankings and biological importance of Mean VTL, Mean F0, and Chroma_cqt, correlation studies were performed. The findings revealed a negative correlation between Mean VTL and Mean F0 and a positive correlation between Mean VTL and Chroma_cqt across AI-Geography, AI-Trivial, and Human conditions (
Figure 3).
Humans lengthened their vocal tract when they were confident, resulting in higher Chroma_cqt and lower F0, and shortened it when they were doubtful, causing the opposite effects; this pattern was validated through human/AI speaker sources and male/female biological sexes. Likewise, other important features (see 3.3) showed agreement across biological sexes and sources.
3.3. The Important Features Signalling Vocal Confidence of Humans and AI
Seven key audio features showed high importance scores for accurately classifying confidence levels in human audio: Chroma_cqt, Chroma_cens, Root Mean Square, MFCC, Spectral Contrast, Vocal Tract Length, and Amplitude (
Figure 4A).
However, in the case of AI-Trivia, only six of these features were found to be important, with Amplitude not contributing significantly to classification accuracy. In addition, Spectral Bandwidth, F0, Chroma_stft, Spectral Rolloff, and Spectral Flatness were also identified as important features for AI-Trivia classification (
Figure 4B).
For AI-Geography, Amplitude was found to be of no importance, while Spectral Bandwidth, Chroma_stft, F0, Spectral Rolloff, and Spectral Flatness showed greater contribution than in human audios (
Figure 4C).
Therefore, the importance scores for Amplitude and additional acoustic features were similar in both AI models, despite some scattered differences in the values.
3.4. Training and Predicting Vocal Confidence across Sources
The 1,000 times averaged results in
Table 5 witnessed two in-group advantages. Firstly, as expected, the models that were trained and tested on their respective data demonstrated the highest overall accuracy (Jiang & Pell, 2018). For instance, H/H achieved an accuracy of .72, while the accuracy of AIg/H and AIt/H was reduced to .51 and .38, respectively. Likewise, H/H achieved an accuracy of .72, while H/AIg and H/AIt achieved accuracies of .54 and .53, respectively. Secondly, yet most importantly, when AI models were tested on one another’s data, their overall accuracies were higher than when tested on human data. For example, AIg/AIt had an accuracy of .67, AIt/AIg had an accuracy of .69, while that of AIg/H and AIt/H was .51 and .38. All accuracy levels were above the chance level (1/3). The ROC curve analysis in
Figure 4D also demonstrated such in-group advantage.
The ANOVA analysis of ‘Overall Accuracy ~ Training * Testing’ revealed significant main effects of both training (F=2175, p<2e-16, ηp²=.33) and testing (F=8335, p<2e-16, ηp²=.65), as well as their interaction effect (F=16123, p<2e-16, ηp²=.88).
The pairwise contrast in Supplementary Table S7 yielded several findings. H/H showed better performance than AIt/H (β=.34, p<.0001) and AIt/AIt (β=.04, p<.0001). AIg/H demonstrated superior performance than H/H (β=-.21, p<.0001) and AIt/H (β=.13, p<.0001). AIg/AIg consistently outperformed other conditions, particularly when compared to AIt/H (β=.38, p<.0001) and H/AIg (β=.22, p<.0001). AIt/AIg underperformed when compared to H/AIg (β=-.15, p<.0001) and H/AIt (β=.16, p<.0001). Finally, the AI-Trivia model performed equally well when trained on AI-Geography and AI-Trivia datasets for testing on AI-Trivia data; see AIt/AIg - AIt/AIt (β=0, t=2.09, p=1).
Altogether, the 1,000 training and testing study confirmed that models trained and tested on their respective data showed higher accuracy levels, and AI models generally performed better when tested on each other’s data than when tested on human data, hence the in-group advantage. Still, the above-chance-level accuracies when training and testing across humans and AI suggested AI’s robust capacity to replicate human-specific vocal confidence.
4. Discussion
4.1. Characterising Human Vocal Confidence through VTL
The mean VTL was reported to encode speaker confidence, with the confident voice showing the longest VTL, followed by neutrally-intending and the doubtful voice the shortest. Meanwhile, longer and shorter Mean VTL was associated with vocal modulation of vocal tract length, with lengthening the VTL with the aim of displaying a larger body size (Anikin et al., 2022). Hence, the confident sound is described as a state where human speakers extend their vocal tract causing a lower Mean F0 to sound more dominant (Puts et al., 2007). Studies have noted the importance of Mean F0 and Mean VTL that characterise ‘who is talking’ or speaker identity (Lavan, Knight, et al., 2019). The current study contributed to the argument that speaker identity and long-term traits, and short-term states, such as speaker emotions, are intertwined (Belin et al., 2004; Lavan, Burton, et al., 2019; Mileva & Lavan, 2023; Schuller & Batliner, 2013; Sorokowski et al., 2019).
Moreover, the mean VTL was positively correlated with Chroma_cqt and negatively correlated with Mean F0, which followed robust values ranking of Confident > Neutral > Doubtful. Chroma_cqt represents the twelve different pitch classes in the speech signal, which correspond to the notes C, C#, D, D#, E, F, F#, G, G#, A, A#, and B from lowest to highest in the Western music scale (Huang & Mushi, 2022). Higher Chroma_cqt suggests the speech sample was closer to the higher note in the music scale. To produce a confident speech, the speaker could increase Chroma_cqt to sound brighter (Collier & Hubbard, 2004).
These findings based on the vocal expression portrayed by Chinese speakers expanded findings in the English context (Jiang & Pell, 2017, 2018). Despite not taking VTL into account, the previous studies did suggest that the unconfident voice showed a higher Mean F0, consistent with the lower Mean VTL reported in the current study. Vocal size exaggeration has been associated with the evolution of the human speech-oral-motor system and cited as a common ability across species, and the human speaker has incorporated this intuitive capacity of vocal anatomy-sound coordination into the vocal communication system (Anikin et al., 2021; Pisanski et al., 2022; Pisanski & Reby, 2021). The current study, therefore, strengthened empirical evidence from a cross-linguistic perspective and displayed a more informative depiction of the vocal modulation mechanism underlying speaker expression through the analysis of acoustic cues signalling the anatomical structure.
4.2. AI Speakers Can Imitate Human-Specific Vocal Confidence
The present study demonstrated that an AI algorithm designed to clone speaker identity could also mimic human vocal confidence levels. The importance scores showed that two sets of AI data, akin to human audio, relied on the same range of acoustic features, from Chroma_cqt to VTL in a list of seven cues, to encode vocal confidence. Such a mutual utilisation of important acoustic features could lead to the above-chance level classification accuracies observed when training and testing across different data sources, as well as similar value rankings in Chroma_cqt, VTL, and MFCC.
As compared with human speakers, the AI models exhibited a greater reliance on additional features, such as Spectral Bandwidth and F0, which were not deemed important when encoding human confidence. It is possible that, due to these additional features involved in creating a multivariate pattern of representation, AI performed even higher accuracies when training and testing within AI sources than when training or testing on human data.
AI’s strong already cloning ability in HCI is in line with the dialect theory in HHI. The Dialect theory about human communication highlighted that individuals from different cultural backgrounds or group identities share similar patterns of expressing emotions (Scherer, 1997), despite variations caused by culture- or group-specific norms (e.g., mother tongue). Such encoding rules within and between speaker cultures and groups have been supported by perceptual studies which showed common and differential neural responses of decoding vocally-expressed confidence and doubt in native and accented speakers (Jiang et al., 2015; Laukka et al., 2014; Pell et al., 2009; Scherer, 1997). Systematically modulating the anatomical structure of the larynx allowed humans to produce different types of speech that conveyed their varied levels of ‘feelings of knowing’. Here, AI is proven to be capable of replicating such peculiar speech differences while even cloning the speaker-specific identity signalling ‘who is talking’ by simulating a speech-motor/laryngeal control of the vocal box in humans to serve the purpose of communicating certain pragmatic meanings that can be ‘perceived’. The AI’s capability to learn and replicate human-specific speaker identity and emotive vocal states could pose an empirical threat to the modern daily activities that heavily demand HCI, where the internet connects computers that play sounds out unceasingly. These sounds may contain speech signals that are either authentic or faked and may vary in tone and emotion, thus making it difficult for human listeners to discern the speaker’s group identity being human or AI. This could raise a realistic concern that should motivate the advancement of institutional regulations on speech synthesis. For instance, to counter the voice deepfake, some legislation such as Defending Each and Every Person from False Appearances by Keeping Exploitation Subject (DEEP FAKES) to Accountability Act has been approved, which requires deep fake creators to mark their content with an indelible digital watermark (Langa, 2021).
4.3. Implications and Limitations
This study demonstrated how voice-cloning service meant for human speaker identity cloning also captures and replicates vocal confidence across affective levels. Such capacity is important because a wide range of research has attested that the human brain distinctively responds to individuals with different individual speaker identities (Kroczek & Gunter, 2021; Puhacheuskaya & Järvikivi, 2022). The human-machine co-behaviour theories have identified an emerging trend to examine the long-term dynamics of hybrid systems and the ways that human social interactions could be modified by the introduction of intelligent machines (Rahwan et al., 2019). By demonstrating AI’s capacity to clone the acoustic encoding of vocally-expressed confidence, the current study serves as an interface that updates studies on the synthesis of affective speech from an engineering perspective and affective computing with decoding approaches (Gunes et al., 2019; Gunes et al., 2011; Habib et al., 2019). The current findings also provide a basis for recognising communicative meanings from alternative modalities through psychological and neurophysiological data analysis (Cross & Ramsey, 2021; Di Cesare et al., 2022; Kuriki et al., 2016; Li et al., 2023; Nummenmaa et al., 2023; Saarimäki et al., 2022; Tamura et al., 2015). In existing HCI studies, AI as a social agent, does not have equivalent emotional intelligence (EQ) as humans, and human counterparts can immediately detect its group identity (Mou et al., 2019). However, future researches on HCI with AI, powered by large language models such as ChatGPT and affective speech synthesis technology, can be challenged due to the higher human likeness marked by emotive states like vocal confidence and perceived personalities (2021); and such challenges are expected to influence human’s performance in HCI scenarios in various settings such as cooperation, competition, coordination, learning and communication, e.g., self-driving cars should adopt assertive and dominant synthesis voices to ensure responsible driving from drivers (Wong et al., 2019; Yoo et al., 2022).
Still, this study could suffer some limitations, which can be addressed in further studies. On the one hand, how speakers sharing individual identities but with distinct group identities (here determined by the speaker source of AI or human) could pose an impact on the perceptual responses of human listeners remains unanswered – how does ‘knowing the speaker to be AI/human’ influence social judgment (Chen et al., 2023; Gampe et al., 2023; Mou & Xu, 2017). Possible scenarios include customer service, education, health care and entertainment, where the tone, emotion, feedback, encouragement, empathy, trustworthiness, humour and personality of the agent could influence the social judgment of the human listeners in a variety of dimensions such as satisfaction, loyalty, motivation, learning, well-being, adherence, enjoyment and engagement (Baird et al., 2018; Canbek & Mutlu, 2016; Chattaraman et al., 2019; Hu et al., 2022; McLean et al., 2021; Reicherts et al., 2022; Rodero & Lucas, 2021).
Furthermore, the current study serves as an interface for clinical trials involving special populations. For example, behavioural data reported children with autism spectrum disorder (ASD) to have a weak ability to perceive psychological attributes of humanness in singing (Kuriki et al., 2016); and such have raised proposals to construct an emotional speech database in Mexican Spanish for future studies in ASD (Duville et al., 2021). However, it is noteworthy that Duville et al. (2021) indeed attempted to construct human versus AI comparable emotive speech sets, yet, their AI voices were obtained through modifying human speech, which could distort speaker identity encoded in the speech; the current study employed voice-cloning techniques to ensure the similarity of both emotive states and speaker identity and have provided validation evidence accordingly. With such foundations, future studies can investigate the behavioural and neural mechanisms of decoding emotions and the speaker’s individual and group identity in vocal speeches in abnormal populations (Cha et al., 2021; Chevallier et al., 2011; Frijia et al., 2021; Golan et al., 2006; Järvinen-Pasley et al., 2008; Jones et al., 2009), thus informing fields such as automatic diagnosis through possibly interaction tasks (Baki et al., 2022; Cummins et al., 2020; Siddi et al., 2023), or neural mechanism investigation (Gallagher et al., 2000; M. Liu et al., 2021).
On the other hand, how vocal expression conveying pragmatic intentions in AI speech affects the perceived group identity of human-like AI – how does paralinguistic and linguistic information influence the judgment of a speaker’s group identity, e.g., human-like (Jiang & Pell, 2015; Ko et al., 2023; Lee, 2010; Melo et al., 2023; Pelachaud, 2017). The current study showcased AI’s strong ability to learn and mimic human vocal confidence when providing text-to-speech services, and such AI-generated speeches were found to be comparable to the original human-specific vocal confidence. These features could make emotive AI-produced audios potentially confuse human listeners regarding the human-ness of the speech, thus leading to questions such as the uncanny valley and Eliza effect. As perceptual effects, the uncanny valley is a phenomenon where people feel uneasy or repulsed by human-like robots or animations that are not quite realistic enough (Mori et al., 2012), while the Eliza effect is a tendency to attribute the anthropomorphism in emotions and intentions to artificial intelligence systems that mimic human speeches or behaviours (Kim et al., 2019).
Importantly, it should be acknowledged that the AI speech in this study acquired vocal confidence expression in a non-interactive task using a pre-defined model. Future research could examine how AI can learn to encode human emotional cues directly from interactive tasks and how this would influence the perception and cognition of human participants (Gampe et al., 2023; Nasir et al., 2022; Salam et al., 2023).