Top Qs
Timeline
Chat
Perspective

Pronunciation assessment

Use of speech recognition to verify pronunciation From Wikipedia, the free encyclopedia

Remove ads

Automatic pronunciation assessment uses computer speech recognition to determine how accurately speech has been pronounced,[1][2] instead of relying on a human instructor or proctor.[3] It is also called speech verification, pronunciation evaluation, and pronunciation scoring.[4] This technology is used to grade speech quality, for computer-aided pronunciation teaching (CAPT) in computer-assisted language learning (CALL), for speaking skill remediation, and for accent reduction.[4]

Pronunciation assessment is different than dictation or automatic transcription — instead of determining unknown speech, it verifies learners' pronunciation of known word(s), often from prior transcription of the same utterance, ideally scoring the intelligibility of the learners' speech.[5][6] Sometimes pronunciation assessment evaluates the prosody of the learners' speech, such as intonation, pitch, tempo, rhythm, and syllable and word stress, although those are usually not essential for being understood in most languages.[7] Pronunciation assessment is also used in reading tutoring, for example in products from Google,[8] Microsoft,[9][10] and Amira Learning.[11] Automatic pronunciation assessment can also be used to help diagnose and treat speech disorders such as apraxia.[12]

Remove ads

Intelligibility

Summarize
Perspective

The earliest work on pronunciation assessment avoided measuring genuine listener intelligibility,[13] a shortcoming corrected in 2011 at the Toyohashi University of Technology,[14] and included in the Versant high-stakes English fluency assessment from Pearson[15] and mobile apps from 17zuoye Education & Technology,[16] but still missing in 2023 products from Google Search,[17] Microsoft,[18] Educational Testing Service,[19] Speechace,[20] and ELSA.[21] Assessing authentic listener intelligibility is essential for avoiding inaccuracies from accent bias,[5] especially in high-stakes assessments;[22][23][24] from words with multiple correct pronunciations;[25] and from phoneme coding errors in machine-readable pronunciation dictionaries.[26] In the Common European Framework of Reference for Languages (CEFR) assessment criteria for "overall phonological control", intelligibility outweighs formally correct pronunciation at all levels.[27]

In 2022, researchers found that some newer speech-to-text systems, based on end-to-end reinforcement learning to map audio signals directly into words, produce word and phrase confidence scores (from 10-25ms audio frame logit aggregation) closely correlated with genuine listener intelligibility.[28] In 2023, others were able to assess intelligibility using dynamic time warping distance measures from Wav2Vec2 representation of good speech.[29][30] Further work through 2025 has focused specifically on measuring intelligibility.[31][32]

A 2025 study of 42 pronunciation and speech coaching apps (32 mobile and 10 web) found that none offered intelligibility assessment. Instead, most provided only segmental and accent-focused scoring. About two-thirds of the apps provided some form of specific pronunciation feedback, usually with phonetic transcriptions, but accompanied by visual cues (such as animations of the vocal tract or the lips and tongue from the front) in only about 5% of the apps. Less than a third provided feedback on learner perception of exemplar speech.[33]

Remove ads

Evaluation

Summarize
Perspective

Although there are as yet no industry-standard benchmarks for evaluating pronunciation assessment accuracy, researchers occasionally release evaluation speech corpuses for others to use for improving assessment quality.[34][35][36][37] Such evaluation databases often emphasize formally unaccented pronunciation to the exclusion of genuine intelligibility evident from blinded listener transcriptions.[6] As of mid-2025, state of the art approaches for automatically transcribing phonemes typically achieve an error rate of about 10% from known good speech.[38][39][40][41]

Ethical issues in pronunciation assessment are present in both human and automatic methods. Authentic validity, fairness, and mitigating bias in evaluation are all crucial. Diverse speech data should be included in automatic pronunciation assessment models. Combining human judgments, especially listener transcriptions, with automated feedback can improve accuracy and fairness.[42]

Second language learners benefit substantially from their use of common speech regognition systems for dictation, virtual assistants, and AI chatbots.[43] In such systems, users naturally try to correct their own errors evident in speech recognition results that they notice. Such use improves their grammar and vocabulary development along with their pronunciation skills. The extent to which explicit pronunciation assessment and remediation approaches improve on such self-directed interactions remains an open question.[43]

Remove ads

Recent developments

Summarize
Perspective

During 2021-22, a smartphone-based CAPT system was used to sense articulation through both audible and inaudible signals, providing feedback at the phoneme level.[44][45]

Some promising areas for improvement which were being developed in 2024 include articulatory feature extraction[46][47][48] and transfer learning to suppress unnecessary corrections.[49] Other interesting advances under development include "augmented reality" interfaces for mobile devices using optical character recognition to provide pronunciation training on text found in user environments.[50][51]

In 2024, audio multimodal large language models were first described as assessing pronunciation.[52] That work has been carried forward by other researchers in 2025 who report positive results.[53][54] Subsequently, researchers demonstrated pronunciation scoring by providing a language model with textual descriptions of speech, including the speech-to-text transcript, phoneme sequences, pauses, and phoneme sequence matching; this approach can achieve performance similar to multimodal LLMs that analyze raw audio while avoiding their higher computational cost.[55]

In 2025, the Duolingo English Test authors published a description of their pronunciation assessment method, purportedly built to measure intelligibility rather than accent imitation.[56] While achieving a correlation of 0.82 with expert human ratings, very close to inter-rater agreement and outperforming alternative methods, the method is nonetheless based on experts' scores along the six-point CEFR common reference levels scale, instead of actual blinded listener transcriptions.[56]

Further promising work in 2025 includes assessment feedback aligning learner speech to synthetic utterances using interpretable features, identifying continuous spans of words for remediation feedback;[57] synthesizing corrected speech matching learners' self-perceived voices, which they prefer and imitate more accurately as corrections;[58] and streaming such interactions.[59]

Remove ads

Fiction

The 2012 horror film Prometheus shows android character David 8 learning to pronounce Proto-Indo-European phrases from a holographic virtual tutor.[60]

See also

References

Loading related searches...

Wikiwand - on

Seamless Wikipedia browsing. On steroids.

Remove ads