Sevenster, B., de Krom, G., and Bloothooft, G. (1998). 'Evaluation and training of second-language learners' pronunciation using phoneme-based HMMs'. Proceedings of ESCA workshop on Speech Technology in Language Learning, Stockholm, 91-94.

Evaluation and training of second-language learners’ pronunciation using phoneme-based HMMs

Bob Sevenster, Guus de Krom and Gerrit Bloothooft

Utrecht Institute of Linguistics-OTS, University of Utrecht

Tel: + 31 30 2536059, Fax: + 31 30 2536000, email:




In this study, phoneme based Hidden Markov Models (HMMs) were used to evaluate pronunciation. First their suitability for this task was determined and second, the effectiveness of feedback at a segmental level in a pronunciation learning experiment was investigated. The study is based on ten monosyllabic Dutch words, spoken by native and non-native speakers of Dutch. Pronunciation was evaluated by an expert listener as regards nativeness. Words spoken by natives and judged native by the expert listener were used to train phoneme based HMMs. In a test of these models, the words judged non-native achieved significantly lower scores then words judged native. The Equal Error Rates were low enough to assume the HMMs suitable for pronunciation evaluation. Forty non-native second language learners of Dutch participated in a training experiment. Half of the group was presented with pronunciation feedback on word level, the other half got feedback on segmental level. We expected that the last group would be able to improve their pronunciation more than the first group. Test results confirmed this hypothesis.

1. Introduction

This paper describes two experiments in which phoneme based Hidden Markov Models (HMMs) were applied for the evaluation of pronunciation in second language learning. HMMs have been used to evaluate pronunciation since the early 1990’s [1]. At SRI, the emphasis has shifted from a global pronunciation evaluation of utterances [2] to the evaluation at a segmental level [3]. Russel and colleagues evaluated HMMs with respect to their ability to distinguish between minimal pairs [4]. However, pronunciation errors made by second language learners need not lead to minimal pair differences, but may be more subtle. Witt and Young [5] used HMM recognition scores to calculate a Goodness of Pronunciation (GOP) score at a phonemic level. Their GOP metric proved useful for the task of locating and assessing mispronunciations at the phoneme level.

In a previous study [6], we evaluated HMMs with respect to their ability to discriminate between native and non-native pronunciations of monosyllabic Dutch words. The HMMs used in that study were "whole word" models, giving an indication of the quality of pronunciation for the entire word, rather than its constituent phonemes. We compared HMMs trained on word tokens produced by a single native speaker and HMMs trained on tokens produced by a variety of native speakers. As expected, the HMMs that were trained on a variety of speakers yielded lower false rejection rates than HMMs trained on word tokens produced by just one speaker.

In the present study, we used phoneme-based rather than word-based HMMs to assess the quality of pronunciation. In the first part of the study, phoneme-based HMMs were off-line evaluated with regard to their suitability to discriminate between native and non-native pronunciation of Dutch words. In the second part, an on-line segmental pronunciation evaluator was implemented and used by a group of students. Our aim was to determine whether the provision of visual feedback about pronunciation quality of the different phonemes would help increase the pronunciation skills of second language learners.

2. Off-line verification of the discrimination performance of phoneme-based HMMs

2.1 Recordings; material, speakers and procedure

Recordings were made of native and non-native speakers of Dutch (males and females, ages between 17 and 60 years, about 50 speakers per group), yielding 730 stimuli uttered by native speakers and 500 by non-native speakers. The native speakers had no obvious regional accent. The non-native speakers had various language backgrounds and differed widely in their fluency in Dutch. Each speaker was asked to read a list of ten monosyllabic Dutch words which are regularly mispronounced by foreign students: man /m A n /, deur /d r /, echt /E x t /, rook /r o :k /, muis /m U y s /, maat /m a :t /, kijk /k E i k /, kwiek /k i :k /, kind /k Ž n t /, and pot /p O t /. Recordings were made in a sound-treated booth, with a condenser microphone on digital audio tape. Speakers were instructed to pause between the words. To reduce possible begin- and end-of-list effects, the lists began and ended with two dummy words. The recorded word tokens were downsampled to 9.8 kHz and stored on disk.

2.2 Perceptual evaluation

The recorded words were perceptually evaluated by a professional teacher of Dutch as a foreign language. The listener’s primary task was to judge each word token as coming from either a native or a non-native speaker. She was allowed to listen to a specific word as often as desired, but could not change a judgement once it was made. In a first session, she judged 295 stimuli. In a second session those 295 stimuli were judged again, together with the remaining 935 stimuli. Of all stimuli judged twice, only 22 (7.5%) were judged inconsistently as regards nativeness. We therefore felt confident to use the listener’s judgements as a reference for the HMM-based evaluation.

2.3 HMM definition and recognition

2.3.1 Speech signal parameterisation

The speech signals were Hamming windowed, pre-emphasised, and converted into MFCC-format using the HVite tool of the Hidden Markov Toolkit [7] (frame duration 25.6 ms, frame rate 10 ms). The resulting parameter files contained data vectors with 26 coefficients: 12 MFCCs, log energy, and their first-order derivatives.

2.3.2 HMM prototype definition

Phonemes were modelled with HMMs consisting of 2, 3, or 4 states. The pre-release silence of plosives was modelled with one state. All HMMs had diagonal covariance in the transition matrices and two mixtures, to model possible differences in the distributions of acoustic parameters for males and females. One-state skips were allowed. The ten words were transcribed at phoneme level; word-based HMMs were defined as concatenations of the corresponding phoneme-based HMMs.

2.3.3 HMM initialisation and training

Each HMM was trained using 40 tokens (20 male and 20 female), chosen at random from the material spoken by natives and judged native. HMM prototypes were initialised using the Forward-Backward procedure, and further improved using Baum-Welch re-estimation [7, 8].

2.3.4 HMM recognition

The words that had not been used for training was used for testing: they consisted of 330 native and 500 non-native tokens. For recognition, the Viterbi-algorithm was used [7, 8]. For every speech frame, the probability of being generated by a state from any HMM was calculated. The maximum of those probabilities defines the most likely HMM and state for the speech frame. The mean of those maxima over all the speech frames yields a recognition score (MPPF, max. probability per frame). As the HMMs were trained on native material, the tokens produced by native speakers were expected to yield higher MPPF values than tokens produced by non-native speakers. Thus, one could define an MPPF threshold, such that tokens with an MPPF exceeding the threshold are labelled native and tokens with a lower value as non-native. Ideally, MPPF values for native and non-native tokens should not overlap, in which case a threshold value can be defined which effectively separates native and non-native tokens. In practice, however, there will be some overlap , which means that the choice for a given threshold results in a combination of false acceptance (FA) and false rejection (FR) errors (FA: accepting a non-native pronunciation as native; FR: rejecting a native pronunciation as non-native). An example of overlapping MPPF values for native and non-native word tokens is given in Figure 1.

Figure 1. Frequency distribution of the MPPF values for native and non-native tokens of the word "deur".

The problem is how to define an optimal threshold, as there is a trade-off between the two types of error. The question which error rates are acceptable depends largely on the purpose for which the system is used: a beginning second language learner should not be discouraged by a system that hardly accepts anything he or she says. In such a case, the threshold should be at a low MPPF value, biased towards false acceptance. We chose to consider false rejection equally bad as false acceptance. The MPPF threshold was therefore defined by calculating the Equal Error Rate (EER), the rate at which the number of false rejections equalled the number of false acceptances. Figure 2 gives false rejection and false acceptance curves for the word deur.

Figure 2. .Cumulative error rates (in %) as a function of MPPF Value for the word "deur". FR = False Rejection; FA = False Acceptance.

2.4 Results

MPPF determined for word-based HMMs and phoneme-based HMMs. The significance of the difference between MPPF values of tokens judged native and tokens judged non-native was determined by means of a Mann Whitney U-test. Word-level MPPF values for native and non-native tokens were significantly different at the 0.1% level for all words. The difference in phoneme-level MPPF values for native and non-native tokens was significant at the 5% level in most cases, except for /t/ in /E x t /, the final /k/ in /k E i k /, /n/ in /k I n t /, /m/ in /m a :t /, and /s/ in /m U y s /.

Word-level EERs ranged between 7% and 33%; phoneme level EERs ranged between 10% and 50%. The most likely explanation for the differences in EERs is the possible variance of native Dutch pronunciation which can be different for each word. If native speakers show little acoustic variance in their pronunciation of a given word or phoneme, its HMM will become quite sensitive to subtle changes in pronunciation as produced by non-native speakers. In such case relatively low EERs may be expected. In contrast, high EERs may be expected if there is much acoustic similarity between native and non-native pronunciations of a given word or phoneme, or if there is much acoustic dissimilarity among the native speakers’ realisation of a given sound. The former may be the case for the nasals /n/ in /k I n t / and /m/ in /m a :t /; the acoustic differences between native and non-native speakers may be small, as speakers do not have much articulatory freedom in their realisation of these sounds. We concluded that the results obtained in this off-line pronunciation verification task were sufficiently good to allow us to have some faith in the system’s ability to discriminate between native and non-native pronunciations of the words used in the experiment.

3. On-line pronunciation training using phoneme-based HMMs

3.1 Material and procedure

In the second part of the study, an on-line segmental pronunciation evaluator was implemented, using the HMMs and settings obtained in the off-line experiment. Twenty-four non-native second language learners, male and female, took part in an actual pronunciation training experiment that tested the effect of the type of feedback on the development of pronunciation skills. The speakers had various language backgrounds and differed widely in their fluency in Dutch. Each speaker was asked to pronounce a word written on the computer screen. The same ten words as in the off-line experiment were used. The speakers were asked to repeat each word twenty times; speakers were asked to try to improve their pronunciation with the help of visual (colour) feedback. After a speaker had gone through the twenty repetitions of a given word, the next test word came up, until all words had been evaluated. The order of the actual test words to be evaluated was randomised across speakers.

In this on-line test, we tried to avoid discouraging beginning second language learners. This was done by defining different thresholds: an "incorrect" (non-native) threshold set at a 2% false rejection rate, an interval of "dubious" pronunciation set between 2% and 10% false rejection, and a "correct" (native) threshold set at 10% false rejection. These settings gave non-natives relatively much encouragement in their attempt to improve themselves. Depending on the MPPF value, a given phoneme could now be labelled as correct (native), incorrect (non-native), or dubious.

Half of the speakers got visual feedback at a segmental (phoneme) level, the others got visual feedback at word level, indicating whether a word was correct or incorrect. At word level, "correct" could only be obtained when all constituent phonemes were considered correct by the machine, in which case the entire word was written on the computer screen in green. When one or more phonemes were considered incorrect, the entire word was written in red. Segmental pronunciation feedback was provided by writing the word on the computer screen, using the colours green, orange and red for the graphemes indicating the pronunciation quality of the corresponding phonemes as correct, dubious and incorrect.

We compared pronunciation evaluation scores for the "segmental feedback" and "word-level feedback" speaker groups to investigate the effectiveness of the type of feedback. A possible change in pronunciation skills for a given speaker was investigated by comparing MPPF recognition scores for the first eight spoken words in the test with the scores for the last eight. Pronunciation was considered to have improved if MPPF values of the last eight tokens were significantly higher than those of the first eight.

We hypothesised that more improvements would occur in the segmental than in the word-level feedback group, as information on the precise location of pronunciation errors enables a speaker to correct these errors more efficiently.

3.2 Recognition and feedback

The test was performed in a sound-treated booth, using a condenser microphone and a voice activated start and stop mechanism(sample rate 9.8 kHz). The recorded speech signals were real time converted using the same set-up as described in experiment 1. The Viterbi-algorithm was used to calculate the MPPF value for all of the tokens’ phonemes on which the feedback was based [7]. The words deur and maat were used to get the speakers familiar with the task. These words were repeated 5 times, the other eight words were repeated 20 times each.

3.3 Results

As mentioned before, an improvement was defined as an instance in which, for a given speaker, MPPF recognition scores of the last eight word tokens were significantly higher than the scores of the first eight tokens. Results are given in Table I.

Table I: percentage improvements for "word-level feedback" and "segmental feedback" speaker groups.




























As can be seen in Table 1, pronunciation of the word "echt" did not improve with either word-level or segmental feedback. For all other words but "pot", segmental feedback yielded more improvements than word-level feedback. On average, pronunciation improved in 7% of all cases in which subjects were given word-level feedback. Pronunciation improved in 23% of all cases in which subjects were given segmental feedback. These results clearly indicate the benefits of segmental feedback using phoneme-based HMMs for training a second-language learners’ pronunciation.

4. Discussion and conclusion

The results of the experiments described in this study indicate that phoneme-based Hidden Markov Models can be used with some success to evaluate and train second language learners’ pronunciation.

In an off-line experiment, the suitability of phoneme-based HMMs for the task of judging pronunciation was investigated. At word level, the difference in recognition scores for tokens produced by native and non-native speakers was significant for all word types. At phoneme level, the difference in recognition scores for native and non-native tokens was also significant in most cases. Equal Error Rates ranged between 7% and 33% at word level, versus 10% and 50% at phoneme level. We concluded that these errors were tolerable for a pilot on-line experiment in which 40 non-native speakers of Dutch would be given visual feedback on their pronunciation skills on the basis of recognition scores as determined by the system. Our hypothesis that the provision of visual feedback at a segmental level would be more effective in improving pronunciation than the provision of feedback at a global word level was confirmed for six out of the eight words that were investigated.

In evaluating the results of this experiment, a number of shortcomings should be mentioned. First of all, in the first experiment, the listener’s task was to judge if a given word token was spoken by a native or a non-native speaker. Although the listeners’ consistency was high, some tokens might have been erroneously assigned to the native or non-native part of the corpus. Second, we would like to emphasise that the HMMs were based on speech parameters that are conventionally used for ordinary speech recognition purposes. These parameters need not be optimal for the task of discriminating between native and non-native pronunciations of a given word. Duration is only partly modelled in our HMMs by the allowance of state-skips in the transition matrix, and intonation was not modelled at all. Yet, intonation seems a very important cue for the discrimination between native and non-native speakers, and it is most probable that intonation aspects had influenced the listener’s judgements. Third, in a few occasions non-native speakers who took part in the on-line experiment already achieved perfect pronunciation at the beginning of some words. As a result, they were not able to improve themselves.



[1] Bernstein, J., Cohen, M., Murveit, H., Rtichev, D., & Weintraub, M. (1990). Automatic evaluation and training in English pronunciation. Proceedings ICSLP ’90, 1185-1188.

[2] Neumeyer, L., Franco, H., Weintraub, M., & Price, P. (1996). Automatic text-independent pronunciation scoring of foreign language student speech. Proceedings ICSLP ’96, 1457-1460.

[3] Kim, Y., Franco, H., & Neumeyer, L. (1997). Automatic pronunciation scoring of specific phone segments for language instruction. Proceedings Eurospeech ‘97, 645-648.

[4]   Russel, M., Brown, C., Skilling, A., Series, R., Wallace, J., Bonham, B., & Barker, P. (1996). Applications of automatic speech recognition to speech and language development in young children, Proceedings ICSLP, 176-179.

[5] Witt, S., & Young, S. (1997). Language learning based on non-native speech recognition. Proceedings Eurospeech ‘97, 633-636.

[6] Goddijn, S., & de Krom, G. (1997). Evaluation of second language learners’ pronunciation using Hidden Markov Models Proceedings Eurospeech ‘97, 2331-2334.

[7] Young, S.J., Woodland, P.C., & Byrne, W.J. (1993). Reference manual Hidden Markov Toolkit (HTK) V2.1. Entropic Cambridge Research Laboratory.

[8] Rabiner, L.R., & Juang, B.H. (1986). An introduction to Hidden Markov Models, IEEE ASSP Magazine, 4-16.