Top Qs
Timeline
Chat
Perspective
Pronunciation assessment
Use of speech recognition to verify pronunciation From Wikipedia, the free encyclopedia
Remove ads
Automatic pronunciation assessment uses speech recognition to check how accurately speech is pronounced,[1][2] instead of relying on a human instructor or proctor.[3] Also called speech verification, pronunciation evaluation, and pronunciation scoring, this technology is mainly used for computer-aided pronunciation teaching (CAPT), when combined with computer-aided instruction for computer-assisted language learning (CALL), speech remediation, or accent reduction.[4]
Pronunciation assessment does not determine unknown speech (as in dictation or automatic transcription) but instead, knowing the expected word(s) in advance or from prior transcription, it attempts to verify the correctness of the learner's pronunciation and ideally their intelligibility to listeners,[5][6] sometimes along with often inconsequential prosody such as intonation, pitch, tempo, rhythm, and syllable and word stress.[7] Pronunciation assessment is also used in reading tutoring, for example in products such as Microsoft Teams[8] and from Amira Learning.[9] Automatic pronunciation assessment can also be used to help diagnose and treat speech disorders such as apraxia.[10]
Remove ads
Intelligibility
Summarize
Perspective
The earliest work on pronunciation assessment avoided measuring genuine listener intelligibility,[11] a shortcoming corrected in 2011 at the Toyohashi University of Technology,[12] and included in the Versant high-stakes English fluency assessment from Pearson[13] and mobile apps from 17zuoye Education & Technology,[14] but still missing in 2023 products from Google Search,[15] Microsoft,[16] Educational Testing Service,[17] Speechace,[18] and ELSA.[19] Assessing authentic listener intelligibility is essential for avoiding inaccuracies from accent bias, especially in high-stakes assessments;[20][21][22] from words with multiple correct pronunciations;[23] and from phoneme coding errors in machine-readable pronunciation dictionaries.[24] In the Common European Framework of Reference for Languages (CEFR) assessment criteria for "overall phonological control", intelligibility outweighs formally correct pronunciation at all levels.[25]
In 2022, researchers found that some newer speech-to-text systems, based on end-to-end reinforcement learning to map audio signals directly into words, produce word and phrase confidence scores closely correlated with genuine listener intelligibility.[26] In 2023, others were able to assess intelligibility using dynamic time warping based distance from Wav2Vec2 representation of good speech.[27]
Remove ads
Evaluation
Summarize
Perspective
Although there are as yet no industry-standard benchmarks for evaluating pronunciation assessment accuracy, researchers occasionally release evaluation speech corpuses for others to use for improving assessment quality.[28][29][30][31] Such evaluation databases often emphasize formally unaccented pronunciation to the exclusion of genuine intelligibility evident from blinded listener transcriptions.[6]
Ethical issues in pronunciation assessment are present in both human and automatic methods. Authentic validity, fairness, and mitigating bias in evaluation are all crucial. Diverse speech data should be included in automatic pronunciation assessment models. Combining human judgment with automated feedback can improve accuracy and fairness.[32]
Second language learners benefit substantially from their use of common speech regognition systems for dictation, as virtual assistants, and AI chatbots.[33] In such systems, users naturally try to correct their own errors evident in speech recognition results that they notice in the results. Such use improves their grammar and vocabulary development along with their pronunciation skills. The extent to which explicit pronunciation assessment and remediation approaches improve on such self-directed interactions remains an open question.[33]
Remove ads
Recent developments
Some promising areas for improvement being developed in 2024 include articulatory feature extraction[34][35][36] and transfer learning to suppress unnecessary corrections.[37] Other interesting advances under development include "augmented reality" interfaces for mobile devices using optical character recognition to provide pronunciation training on text found in user environments.[38][39]
As of mid-2024, audio multimodal large language models have been used to assess pronunciation.[40] That work has been carried forward by other researchers who report positive results.[41]
In 2025, the Duolingo English Test authors published a description of their pronunciation assessment method, purportedly built to measure intelligibility rather than accent imitation.[42] While achieving a correlation of 0.82 with expert human ratings, very close to inter-rater agreement and outperforming alternative methods, the method is nonetheless based on experts' scores along the six-point CEFR common reference levels scale, instead of genuine blinded listener transcriptions.[42]
See also
- Phonetics
- Speech segmentation — often called "forced alignment" (of audio to its expected phonemes) in this context[43]
- Statistical classification
References
External links
Wikiwand - on
Seamless Wikipedia browsing. On steroids.
Remove ads