1. Introduction
Cochlear implant (CI) represents a great bioengineering achievement in the treatment of individuals with severe to profound sensorineural hearing loss (Tamati, Pisoni, & Moberly, 2022; B. S. Wilson, Dorman, Woldorff, & Tucci, 2011). Despite providing recipients with impressive access to sound, current CI devices transmit distorted acoustic signals via speech coding strategies with a limited number of electrodes, resulting in degraded spectral-temporal information (Moore & Shannon, 2009). The fine structure information is poorly resolved, with merely temporal envelop cues preserved, contributing to the compromised fundamental frequency (F0) and harmonics (Oxenham, 2008). As F0 is a primary acoustic correlate for pitch patterns in speech sounds (Whalen & Xu, 1992), pitch percept is generally weak and poses a unique challenge for CI users.
Due to impaired pitch perception in CI users, persistent challenges arise in acquiring lexical tones, particularly in those from tonal languages like Mandarin Chinese (Tan, Dowell, & Vogel, 2016 for a review). Mandarin uses four pitch variations (high-level, low-to-high rising, low-dipping, and high-falling) to represent its four tones (T1, T2, T3, T4) (Chao, 1948; W. S. Y. Wang, 1973). Prior studies have consistently documented significant deficits in both tone perception and production among Mandarin-speaking CI children (Chen & Wong, 2017; Gao, Wong, & Chen, 2021; Tan et al., 2016 for reviews). These children typically achieve accuracy rates of about 67%-77% in tone recognition (Tao et al., 2015; H. Zhang, Zhang, Ding, & Zhang, 2020; Zhou, Huang, Chen, & Xu, 2013) and around 50% in tone production (Peng, Tomblin, Cheung, Lin, & Wang, 2004; Xu et al., 2011; Zhou & Xu, 2008). However, early acquisition of lexical tones is evident in typically developing children with normal hearing (NH), which stabilizes over the first 2-3 years of life (Singh & Fu, 2016 for a review). For instance, with a large subject pool (107 Mandarin-speaking children with CIs and 125 age mates with NH), Zhou et al. (2013) revealed that the group mean accuracies of tone perception and production were respectively 67.3% and 46.8% for CI children, whereas the accuracies were 98.7% and 94.8% for NH counterparts. This highlights significant room for improvement in Mandarin-speaking pediatric CI recipients’ lexical tone acquisition.
Auditory training is known as a sound-based intervention for speech and hearing rehabilitation, which engages the central auditory system to make perceptual distinctions of sound contrasts through repetition and variation of sound stimuli together with adaptive listening and effective feedback (Cambridge, Taylor, Arnott, & Wilson, 2022; Rayes, Al-Malky, & Vickers, 2019 for reviews). Several reports have demonstrated the potential benefits of auditory training in lexical tone rehabilitation for CI users (X. Cheng et al., 2018; Kim, Chou, & Luo, 2021; Wu, Yang, Lin, & Fu, 2007; H. Zhang, Ding, & Zhang, 2021; H. Zhang, Ma, Ding, & Zhang, 2023). These studies employed different approaches, such as phonetic identification, tone recognition, and melodic contour identification training, highlighting improvements in lexical tone perception for trained CI recipients. Intensive music training also yielded perceptual gains, showing improved recognition of lexical tones and sentences (Cheng et al., 2018). Despite these positive findings, previous research mainly assessed training-induced gains in speech perception, overlooking cross-modal transfer to speech production. To address this gap, our study employs the widely recognized high variability phonetic training (HVPT) protocol. Our primary aim is to assess the potential success of HVPT for lexical tone acquisition in Mandarin-speaking CI children, focusing on robust generalization of perceptual training benefits to new stimuli and reliable far-transfer to lexical tone production.
As a well-established technique in second language learning, HVPT utilizes training materials with multiple talkers and varying phonetic contexts. HVPT is known for its benefits in robust generalization, long-term retention, and far-transfer of perceptual learning to production in nonnative speech acquisition (Barriuso & Hayes-Harb, 2018; Ingvalson & Wong, 2016; X. Zhang, Cheng, & Zhang, 2021 for reviews). In a seminal study, Logan et al. (1991) employed variable natural speech produced by multiple talkers to train native Japanese speakers in distinguishing English consonants /r/ and /l/. Results showed improved identification of the target speech sounds across talkers and stimuli, indicating robust generalization (Logan et al., 1991), and this generalization effect was attributed to talker variability (Lively, Logan, & Pisoni, 1993). Further research demonstrated long-term retention (at least three months) of perceptual gains (Lively, Pisoni, Yamada, Tohkura, & Yamada, 1994) and far-transfer to speech production (Bradlow, Akahane-Yamada, Pisoni, & Tohkura, 1999; Bradlow, Pisoni, Akahane-Yamada, & Tohkura, 1997). HVPT has also been applied to nonnative speakers learning lexical tones, showing transfer from perception to production (Dong, Clayards, Brown, & Wonnacott, 2019; Y. Wang, Jongman, & Sereno, 2003; Y. Wang, Spence, Jongman, & Sereno, 1999; Wiener, Chan, & Ito, 2020). For example, Wang et al. (1999, 2003) assessed HVPT’s effectiveness in teaching Mandarin tones to native speakers of American English. Participants underwent an eight-session identification training over two weeks, resulting in improved tone identification, generalization to new words and untrained speakers, and a remarkable 18% increase in tone production accuracy rated by native Mandarin-speaking adults (Y. Wang et al., 2003).
Inspired by successes in second language learning, attempts to apply the HVPT protocol to CI users have yielded promising results. Miller and colleagues administered a two-week perceptual training program for postlingually deafened adults with CIs, significantly enhancing their recognition of phonetic contrasts (Miller, Zhang, & Nelson, 2016a, 2016b). Recent studies by Zhang and colleagues demonstrated that Mandarin-speaking CI children could obtain robust and lasting benefits in lexical tone perception through a five-session HVPT protocol (H. Zhang et al., 2021, 2023; H. Zhang, Zhang, Ding, & Li, 2020). These gains extended to both familiar and unfamiliar talkers and lasted up to 10 weeks post-training. However, questions remain about whether these perceptual gains generalize to novel phonetic contexts, given that testing was conducted on the monosyllable /i/ that was also used in training. Furthermore, further research is needed to investigate cross-modal transfer of perceptual learning to lexical tone production in Mandarin-speaking pediatric CI users.
This study builds upon previous research to investigate the effectiveness of the HVPT protocol for Mandarin-speaking pediatric CI users’ lexical tone rehabilitation. Three research questions are addressed: (a) whether training-induced gains generalize to recognizing lexical tones in novel phonetic contexts; (b) whether perceptual learning transfers to improved tone production during spontaneous speech; (c) whether a relationship exists between perception gains and production improvements. Based on prior research findings, we would expect the following results: (a) trained children with CIs would improve recognition of lexical tones for both trained and untrained stimuli, while control counterparts would show minimal pretest-posttest changes; (b) perceptual learning would transfer to improved tone production, with some tones benefiting more than others; (c) there would be a weak but positive relationship between gains in tone recognition and production. These findings can inform effective speech rehabilitation for tone language CI users.
4. Discussion
This study aimed to extend our prior work by assessing the effectiveness of HVPT for Mandarin-speaking pediatric CI users’ lexical tone learning with additional tests of transfer of learning. Pretest and posttest assessments were conducted to evaluate tone recognition and production. The results aligned with our expectations based on nonnative phonetic training studies. Trained children improved recognition for both trained (i.e., /i/) and untrained (i.e., /ɤ/) syllables. Additionally, perceptual learning robustly enhanced tone production, even though the relationship between perception gains and production benefits was not significant. This represents a noteworthy discovery, potentially the first demonstrating a reliable perception-production connection in pediatric CI users via cross-modal transfer following auditory training.
4.1. Robust Generalization to Lexical Tone Recognition in Novel Phonetic Contexts
Following five sessions of HVPT, significant improvements in lexical tone recognition were observed for the TG child participants with CIs, both in trained and untrained phonetic contexts. In contrast, the CG children, who did not undergo formal training, showed no significant pretest-posttest changes in either context (see
Figure 3 and
Figure 4). This underscores the substantial role of HVPT in eliciting perceptual gains, as both TG and CG had similar baseline performance. This finding replicated our previous research (H. Zhang et al., 2021, 2023) and further demonstrated that the perceptual gains could transfer to novel, untrained phonetic context by introducing high variability input. Robust generalization of perceptual learning is crucial in speech training (Lively et al., 1993; Logan et al., 1991; Pisoni & Lively, 1995). Our finding adds to the growing body of evidence that emphasizes the pivotal role of high variability input in successful learning of a variety of speech domains (Fuhrmeister & Myers, 2020; Lively et al., 1993; Shinohara & Iverson, 2018; X. Zhang, Cheng, Qin, & Zhang, 2021; X. Zhang, Cheng, & Zhang, 2021; X. Zhang, Cheng, Zou, Li, & Zhang, 2023). Moreover, examining a broader scientific literature spectrum, including visual perception, motor learning, language acquisition, computational modeling/deep learning, inductive reasoning, problem-solving, and education, it’s evident that variability in input generally follows a pattern: while more variable input may initially hinder learning, it typically leads to robust generalization effects (Raviv, Lupyan, & Green, 2022).
The benefits of high variability input in perceptual training can be explained through various theoretical perspectives. One view, known as the abstractionist view, posits that listeners normalize phonetic variability by extracting generalized phonetic patterns across different talkers, allowing them to map these patterns to long-term memory representations (Ladefoged & Broadbent, 1957; Lieberman, 1973). Exposure to diverse acoustic cues during training necessitates a normalization process that emphasizes between-category phonetic differences while reducing sensitivity to within-category acoustic variations, ultimately enhancing perceptual sensitivity (Pisoni & Lively, 1995). This perspective aligns with computational models suggesting that generalization occurs when talker-specific idiosyncrasies (linguistically irrelevant cues) are dissociated from linguistically relevant cues (Ramscar & Baayen, 2013; Ramscar, Yarlett, Dye, Denny, & Thorpe, 2010). In contrast, the exemplar view contends that talker-specific idiosyncrasies are not excluded but rather encoded and stored alongside phonetically relevant information in long-term memory (Johnson, 1994, 1997). These talker-specific cues are thought to enhance learning by creating more associative connections, leading to more robust mental representations (Goldinger, 1998). This perspective aligns with recent training studies using the categorical perception paradigm, which observed improvements in within-category tone discrimination following HVPT for Mandarin-speaking pediatric CI users (H. Zhang et al., 2023). It is important to note that these two views are not necessarily contradictory; both emphasize the importance of introducing high variability input to establish more robust abstract representations. Future research, employing techniques such as electrophysiology (B. Cheng, Zhang, Fan, & Zhang, 2019) and eye-tracking (Qin, Tremblay, & Zhang, 2019), could further investigate how idiosyncratic information is used to form lexical tone representations in CI recipients undergoing formal auditory training.
4.2. Cross-modality Transfer of Perceptual Learning to Tone Production
The perceptual learning achieved through HVPT extended beyond lexical tone recognition and significantly improved the accuracy of lexical tone production in trained pediatric CI users. However, this improvement was not uniform across all four tone types, with notable gains observed for T1 and T2, while T3 and T4 showed no significant improvement (refer to
Figure 5). This pattern aligns with previous research on the developmental trajectory of lexical tone production in Mandarin-speaking children aged 3-5 years, which suggests an acquisition order of the four Mandarin tones from T4 to T1 to T2 to T3 (Wong, 2012a, 2012b, 2013; Wong et al., 2005). T3, in particular, was found to be more challenging due to its complex motor sequencing and control demands, involving the activation of various laryngeal muscles (Wong, 2012b). The tangible benefits observed in T3 recognition (see
Figure 2) may not have translated to significant improvements in T3 production, potentially due to immature motor control required for this complex tone. These findings support the hypotheses that perception precedes production in speech acquisition (Edwards, 1974; Greenlee, 1980; Kuhl et al., 2008) and underscore the importance of accurate tone perception as a precursor for good tone production in pediatric CI users (Xu et al., 2011; Zhou et al., 2013). Conversely, T4 production involves minimal motor control, primarily requiring the relaxation of the cricothyroid muscle (Wong, 2013). The lack of significant perceptual learning transfer to T4 production might be attributed to participants already having mastered this tone type to a ceiling-level degree. This asymmetric pattern of production improvement across the four tones aligns with recent research that suggests HVPT, combined with explicit instruction, improved nonnative beginners’ productions of T1, T2, and T4 but had limited effectiveness in enhancing T3 production even after extensive classroom exposure (Wiener et al., 2020). The insignificant enhancement in T3 production of this study and nonnative learning research alike calls for future investigations exploring whether incorporating motor control exercises for laryngeal muscles into HVPT might improve T3 production accuracy for both native and nonnative learners.
In addition to the perceptual training-related benefits in production, we also evaluated correlations between lexical tone perception and production in terms of baseline performance (i.e., recognition and production accuracies in pretest) and magnitude of training-induced gains (i.e., difference between pretest and posttest) to uncover the perception-production link in Mandarin-speaking children with CIs. The significantly positive correlation between overall perception and production of lexical tones (i.e., with all four tones combined) replicated previous findings (e.g., Peng et al., 2004; Xu et al., 2011; Zhou et al., 2013) that pediatric CI users with exceptional performance in tone recognition tends to perform also well in tone production. Nevertheless, the relationship between perception gains and production gains were insignificant with astonishing small correlation coefficient (r = .038, p = .76). This observation was partially inconsistent with our prediction that was made on basis of perceptual training of second language phonemes. A recent meta-analysis presented a synthesis of 21 studies, which demonstrated a small to medium relationship between perception and production gains after training, although the relationship failed to reach significant (r = .31, p = .18) (Sakai & Moorman, 2018). The particularly small correlation coefficient of this study may be partially due to the small number of trained participants with CIs (n = 16) and remarkable individual differences in the trainees’ perception and production performances. More statistically robust and reliable correlation results could be found in future research recruiting more participants with relatively less heterogeneity. However, the insignificant findings echoed the meta-analysis results, suggesting that the contributing factors for production improvements may be partially or entirely independent of those for perception improvements (Sakai & Moorman, 2018).
The observed gains in production and the positive relationship between perception and production outcomes in Mandarin-speaking pediatric CI recipients undergoing perceptual training have important implications for the theoretical understanding of the interplay between perception and production in lexical tone rehabilitation. The motor theory (Liberman, Cooper, Shankweiler, & Studdert-Kennedy, 1967; Liberman & Mattingly, 1985) and the direct realist theory (Fowler, 1986, 1989) propose the existence of common speech representations shared by both perceptual and production modalities. Additionally, the native language magnet theory suggests that the strong connection between speech perception and production is developmental, forged through accumulated perceptual experience and learned mappings between the two modes (Kuhl et al., 2008; Kuhl & Meltzoff, 1996). Beyond shared mental representations, speech perception and production are believed to involve overlapping neural mechanisms. Neuroimaging studies have demonstrated co-activation of brain regions responsible for both speech perception and production tasks (e.g., Campbell et al., 2001; S. M. Wilson, Saygin, Sereno, & Iacoboni, 2004). Notably, the motor cortex (involved in speech production) is activated during passive speech listening, while the auditory cortex (associated with hearing speech sounds) is co-activated when participants view silent videos of mouthed speech. Hypothetically, the HVPT in this study extended the perceptual experience of lexical tones in trained pediatric CI recipients. This tuning of mental representations, coupled with potential changes in the underlying cortical processes due to neural plasticity, might have facilitated a more accurate representation of target lexical tones. These refined mental representations, in conjunction with established neural mechanisms and learned mappings between perception and production, likely contributed to improvements in lexical tone production. Future research should delve into the neural correlates of perceptual training-induced changes, employing a cognitive neuroscience approach. This would provide a comprehensive understanding of the perception-production link in the context of speech rehabilitation and learning for pediatric CI recipients.
4.3. Limitations and Future Directions
Several limitations of this study should be acknowledged. A notable limitation of this study concerns the evaluation of pediatric CI participants’ lexical tone production, which relies solely on perceptual judgments by native Mandarin-speaking adults. Although the use of production accuracy measured on the same scale (RAU score) as recognition accuracy facilitates correlation analysis between perception and production, future investigations could benefit from incorporating alternative production measures, such as acoustic analyses, to capture more nuanced differences in perceptual training-induced gains in lexical tone production (Deroche, Lu, Lin, Chatterjee, & Peng, 2019; Mao, Chen, Xie, & Xu, 2020; Tang, Yuen, Xu Rattanasone, Gao, & Demuth, 2019, 2021; Tao, Liu, & Zhou, 2022).
Another limitation pertains to the absence of a long-term retention assessment. Evaluating long-term retention, which involves measuring the maintenance of training-induced benefits over an extended period, is crucial for assessing the stability of learning. Although long-term retention has received relatively limited attention in auditory training studies involving pediatric CI users (Rayes et al., 2019), it is an important indicator of the durability of learning outcomes. Future research should consider conducting follow-up assessments to determine whether the perceptual training-induced improvements in tone production persist over time for trained CI children.
Additionally, it is important to acknowledge the relatively small sample size in this study, which is a common challenge in training research due to the substantial time commitments required for both researchers and participants. Future studies should aim to increase the sample size substantially to enhance the robustness and generalizability of findings in the realm of aural interventions.
Despite these limitations, the findings of this study hold significant implications for rehabilitative strategies for pediatric CI users from tone languages. The study demonstrates tangible gains in recognizing lexical tones and highlights the cross-modality transfer of perceptual learning to tone production, affirming the effectiveness and efficiency of the HVPT protocol for lexical tone rehabilitation and learning in CI children. Given its cost-effectiveness and accessibility via the internet, the HVPT protocol has the potential to bridge the gap between laboratory research and clinical practice. However, future research endeavors should focus on confirming the ecological validity and therapeutic efficacy of HVPT for individuals with CIs in real-world clinical settings. Particular attention should be given to optimizing perceptual training methods for pronunciation improvements.