Vernooij, G.J., and Bloothooft, G. (1989). 'Simulation of isolated word recognition on the basis of a hierarchy of phonetic classes', Proc. Eurospeech 89, Parijs, 469-472.

Figure 1 missing, table incomplete


Simulation of isolated word recognition on the basis of a hierarchy of phonetic classes

 

Vernooij, G.J., and Bloothooft, G.
Utrecht University, Trans 10, 3512 JK Utrecht NL 

 

Abstract

In broad phonetic classification of phonemes, a group of phonemes is put together and associated with one single broad phonetic class. As a result, the identity of the phonemes which make up such a a class is lost, but in general the capacity to classify such a class will improve. Several similation studies have been reported in the literature, showing the potential of broad phonetic classification for reduction of possible word candidates. These studies used efficiency of lexical access as the main criterion and paid little attention to the interaction with acoustic information. In this paper we will describe a broad classification experiment based on acoustic information rather than lexical information, and describe a classification hierarchy based on classification scores of broad phonetic classes.

1. Introduction

The idea underlying broad phonetic classification schemes is that spoken language is limited not only by the number of different sounds that are utilized but also by constraints present in phonemic structures. These (phonotactic) constraints make it worthwhile to have a crude phonetic analysis prior to a more detailed analysis. In this way lexical access efficiency can be greatly improved. Connected to this idea of using constraints in phonemic structures is the idea of interaction between acoustic properties of phonemes and their functions in words, often making a detailed analysis unneccessary and in fact sometimes even useless.

Several authors have described similation studies on a lexical basis, examining the use of broad classifications for pruning down the number of word candidates. Shipman and Zue [1] for instance showed that even a very large lexicon of 20,000 words can be pruned down very effectively by using just six broad phonetic classes. Vernooij e.a. [2] and Kassel and Zue [3] have described experiments in which hierarchies of broad classification levels are constructed. Going downwards from one level towards the next, two broad phonetic classes are joined together, thus decreasing the total number of classes by one. Starting with N classes at level 1, one ends up with one single class at level N. In the recognition phase, subsequent levels can be associated with subsequent steps in pruning down the number of word candidates, starting with a just few broad phonetic classes and ending with all individual phonemes at the lowest level. Going from a higher level to a lower one, these hierarchies exhibit an improved pruning of the lexicon, at he cost of being able to classify more distinct classes. The gain in pruning down a lexicon slows down as the level increases: for a 11,644 isolated word Dutch lexicon, 90% of the tokens are uniquely classified using 17 classes, for 99% classification you need 25 classes and for 100 % 32 classes [2]. This implies that for a practical implementation it is probably useful to fold the hierarchy into just a few levels, starting with just a few broad phonetic classes followed by a rather detailed analysis, maybe using a few intermediate levels.

The aforementioned similation studies were based mainly on properties of the lexicon that was used. Kassel and Zue [3] used an information theoretic measure, average mutual information, as a metric. Many of the classes in the phoneme class hierarchy described in [3] are similar acoustically even though no information was available to the clustering process, but there are also a number of peculiarities. Vernooij e.a. [2] took the number of unique cohorts as a criterion, a cohort being defined as a group of words sharing a broad phonetic labeling, and a unique cohort being defined as a cohort with just one single member. They employed some general constraints for broad classification of vowels and plosives, which are plausible from a phonetic point of view, but apart from this the broad classification hierarchy was built only using efficiency of lexical access as a criterion. Because of these built-in constrains their hierarchy seems to be more feasible from an acoustic point of view than the hierarchy described in Kassel and Zue [3]. Since Vernooij e.a. [2] didn't allow for mixing between five broad phonetic classes (vowels, plosives, fricatives, nasals and sonorants), their hierarchy of broad phonetic classes is nevertheless somewhat hypothetical.

From an acoustic point of view, broad classification of phonemes introduces loss of information about classification scores of individual phonemes in exchange for improved classification scores of the broad phonetical classes. For instance, if the plosives /p/, /t/, and /k/ are grouped into one single broad phonetic class {/p/,/t/,/k/} one loses information about the identity of the indivudual phones /p/, /t/ and /k/ which make up the class, but the classification score for the group as a whole will improve, since there are no longer confusions between the members of this group.

If implementation of a broad classification scheme is aimed at, then efficiency of lexical access cannot be the only criterion. Acoustic information has to be considered too: there has to be a balance between efficiency of lexical access and the probabilty of correct classification of the broad phonetic classes. In other words, it is not worthwhile to use a group of broad phonetic classes for pruning down a lexicon of word candidates if one cannot classify these broad phonetic classes reliably. Being able to distinguish vowels and consonants, for example is very efficient with regard to lexical access, and lexical studies will probably always find this division to be the most effective single broad classification that can be made. However, from an acoustic point of view one will have a hard job to make this vowel/consonant distinction reliably.

In this paper we will describe a broad classification experiment based on acoustic information rather than lexical information. Our aim was to investigate whether such a classification hierarchy has any resemblance to the hierarchies found in lexical studies and see if well-known broad phonetic classes sonorants, nasals, fricatives, vowels, plosives would emerge.

2. Experimental procedure

The feasibility of a broad phonetic classification of phonemes was investigated using a perceptron as a classifier and 10 ms filterbank output spectra as input. No contextual information or other information such as (differenced) power and duration, which can be useful for classification, was used in this experiment, so the results apply only to some characteristic spectrum of each phoneme. As a result the classification scores for individual phonemes (the lowest level of the classification hierarchy) are substantially lower than can be achieved if more information is used. We considered this to be acceptable for this particular experiment, since our goal was not to build an optimal phoneme recognizer. Thus, the classification scores at a particular level of the broad classification hierarchy may be considered as worst case scores.

The speech material that we used was drawn from the (prototype) version of the DARPA-TIMIT cd-rom. This cd-rom contains the data for 4200 American English sentences uttered by 420 speakers, each speaker contributing 10 sentences. These 420 speakers represent eight major dialect regions of American English. All sentences are provided with a time-aligned sequence of acoustic phonetic labels. Of these eight dialect regions, two were picked at random for this experiment (denoted dr3 and dr6 on the DARPA-TIMIT cd-rom). There are 98 speakers in this subset, 26 of which are female and 82 are male. Though there is an off-balance in speaker sex, which might influence classification scores, we decided to use both sexes. Two-thirds of the material was used for training and one third for testing. The time-aligned phonetic labeling was used for picking one 10 ms frame per labeled segment. For vowels we took the frame at one third of the segment duration so as to avoid off-glides as much as possible, and for all other phonemes we picked the middle frame of each segment. In the DARPA-TIMIT label set there are more than 60 different labels, most of them phonemic. Only a subset of these labels was used by us, since some of the labels occured too infrequently to be properly trained. Diphthongues were left out, assuming that that they can be rewritten to a concatenation of two monophthongues. The b,d,g,p,t, and k closures were grouped into two sets, one for the unvoiced closures and one for the voiced closures. Two allophones for h and u were joined in two combined classes instead of four seperate classes. The total number of classes remaining was 41. The TIMIT labels, the number of training and test data, and for each label a specimen word containing the phonetic class are gathered in table 1. For each 10 ms frame 16 filterbank output energies were computed [4], which were normalized with respect to cumulative energy. Since normalization with respect to energy may give unstable results for silent spectra, a small amount of noise was added to all filterbank energies before the normalization was carried out. A perceptron with two hidden layers, each containing 50 nodes, was used for the classification of these 41 classes. We used adaptive backpropagation as described by Chan & Fallside [5] which gave us a factor 3,5 improvement in learning time. A confusion matrix was constructed on the basis of the classification scores on the test set, which served as the basis for the broad classification hierarchy. The clustering algorithm that was used mimics the greedy types of clustering algorithms used in the aforementioned lexical studies. At each level two (broad) phonetic classes are joined into one single class, thus decreasing the total number of broad phonetic classes by one. The broad phonetic class which has the worst classification score at a given level is merged with another class such that the combined class has the highest classification score of all combinations that are tried. This procedure is repeated at each subsequent level until there are only two broad phonetic classes left. The joining of two classes is done as follows. Let Con[i,j] be the confusion of class i with respect to class j, and let N[i] denote the number of phonetic classes contained in the broad phonetic class i. If class m and n are joined into one single class then for all classes i, Con[i,m] is added to Con[i,n] and all Con[i,m] are removed. The confusion scores for the combined class m & n is taken as a weighted mean of the confusion scores Con[n,i] and Con[m,i]:

(N[n]*Con[n,i]+N[m]*Con[m,i])/(N[n]+N[m])

3. Results

The constructed phoneme class hierarchy is displayed in the form of a dendrogram, shown in Figure 1. The horizontal axis lists all phonemes and the vertical axis shows the mean classification scores and the level at which classes are merged. The height of the walls corresponds to the relative robustness of clusters. At the top level all vowels together with the sonorants, the glotal stop and nasal flap are joined into one single group and two all other consonants are in the other group. With the exception of the two stops, DX and Q, this seems to be a reaonable result, essentially corresponding to a vowel and semi-vowel group at the one side and all other consonants at the other side. Within the vowel and semi-vowel part of the tree, the sonorants L, W, Y, and R do not form a single intermediate broad phonetic class, but nevertheless seem to have some.status of there own. As was to be expected IY and Y cluster, just as ER and R, but essentially there is a vowel group which merges with the sonorants one after the other. On the consonant there is some mingling of plosives and fricatives, but similar to the vowels vs semi-vowels distinction, these broad phonetic groups seem to have some status of there own. Durational information, which we didn't use, might disambiguate these groups. The nasals do indeed form a separate group of their own.

Summarizing, we may conclude that the well-known broad phonetic classes do indeed seem to emerge in the broad classification scheme, which is an encouraging result, since lexical studies show an analogous grouping into these broad phonetic classes. Probably some hand-tuning in order to remove some of the mingling will not make much of a difference in classification scores. As for the use of broad classification for pruning down the number of word candidates in isolated word recognition, it is not quite clear whether the classification scores are well enough. Starting with a better input representation than we did will certainly boost classification scores, but due to the nature of the clustering algorithm it is unlikely that near perfect classification scores can be achieved. As broad phonetic classes grow in size, the confusions between classes are mainly the result of confusions between members at the edges of each broad phonetic class, and these the members at the edges will remain to have a less than near classification score.

References

[1] Shipman, D.S. and V.W. Zue, "Properties of Large Lexicons: Implications for advanced Isolated Word Recognition Systems", Proc. IEEE ICASSP, 1982, pp. 546-549.

[2] Vernooij, G.J.,G.Bloothooft, and Y. van Holsteijn, "A simulation Study on the Usefulness of Broad Phonetic Classification in Automatic Speech recognition", Proc. IEEE ICASSP, 1989, pp. 85-87.

[3] Kassel, R. and V.W. Zue, "An Information Theoretic Approach to the Study Of Phoneme Collocational Constrains", Proc. ICSLP, 1990, pp. 937-940.

[4] Sekey, A. and B.A. Hanson (1984) "Improved 1-Bark bandwidth auditory filter", JASA 75(6), pp. 1902-1904.

[5] Chan, L.W. and F.Fallside (1989) "An adaptive training algorithm for backpropagation networks" Computer Speech and Language 2, pp. 205-218.

Acoustic phonetic symbols and number of data

nr symbol nr training nr testing example
1 p     pop
2 t     tot
3 k     kick
4 b     bob
5 d     dad
6 g     gag
7 dx     butter
8 m     mom
9 n     non
10 ng     sing
11 s     sis
12 z     zoo
13 f     fief
14 v     very
15 ch     church
16 th     thief
17 sh     shoe
18 jh     judge
19 l     led
20 r     red
21 y     yet
22 w     wet
23 hv&hh     hay
24 eh     bet
25 ao     bought
26 aa     cot
27 uw     boot
28 er     bird
29 ay     bite
30 ey     bait
31 aw     about
32 ax     the
33 ix     roses
34 ih     bit
35 ae     bet
36 ah     butt
37 uw&ux&uh     boot
39 iy     beat
40 ow     boat
41 vcl     voiced closure
42 uvcl     unvoiced closure