Bloothooft, G. and Binnenpoorte, D. (1998). 'Towards a searchable resource of phonetograms'. Proceedings Voicedata98, OTS Utrecht, 112-116.


TOWARDS A SEARCHEABLE RESOURCE OF PHONETOGRAMS

 

Gerrit Bloothooft and Diana Binnenpoorte

Utrecht Institute of Linguistics OTS, Trans 14, 3512 JK Utrecht, The Netherlands

telephone +31.30.2536042 / fax +31.30.2536000 / Email


Abstract

The phonetogram is an informative description of vocal capacities. Still, it is questionable whether it is possible to compare phonetograms on the basis of the current storage of fundamental frequency, vocal intensity and the running average of acoustic parameters. This paper addresses the requirements for information storage in a phonetogram if the phonetogram is modelled as a set of Hidden Markov Models. Such an approach allows to compare phonetograms, and allows to find in a huge resource of phonetograms the phonetogram that most likely explains some new voice sample. It is argued that this possibility is important to the diagnostic potential of phonetograms. A first experiment into this direction is described.

Introduction

Present day technology has brought automatic recording of phonetograms within the standard package of voice measurements. Fundamental frequency, vocal intensity and a series of acoustic parameters such as jitter and crest factor are sampled with reasonable accuracy and presented in an elegant display (see Fig.1). However, whereas at first sight the phonetogram recording is appealing, problems arise if we want to compare phonetograms. The comparison of phonetograms is important if we want to trace differences in phonetograms of a single subject over time, or if we want to compare phonetograms of different voices. For decades, research has been directed to define so-called norm-phonetograms and various averaging procedures have been proposed [1-8]. Yet, this does not solve the problem of the comparison of a newly recorded phonetogram with the norm-phonetogram and to describe the deviations. It is even questionable whether such an approach is useful at all.

Fig. 1. Phonetogram recording of a female soprano measured by Pabon(1991). As acoustical parameter the crest factor is shown.

[figure not available]

Sometimes the phonetogram is seen as a basic description of the voice in the same way as the audiogram is a basic description of hearing capacities. As for the audiogram, in which we have normalised thresholds of hearing for healthy subjects, we may look for a normalised phonetogram as well. It is essential, however, to see the big differences between the two areas in terms of the object of measurement and the measurement method. The hearing organ is passive and uncontrolled by human will. If a sinusoid is presented to the ear and a subject is asked whether the sound has been heard or not, the answer is a straightforward yes or no. On this basis a fairly accurate and reproducible estimate can be made of the hearing thresholds that make up the audiogram. There is little influence the subject can have on the outcome of the measurements. Under normal conditions, the capacities of the hearing organ of a subject are more or less fixed and cannot be trained. Furthermore, the variation in the hearing threshold at some frequency among healthy subjects is within reasonable limits. In contrast, voice production is highly influenced by human will and training. It is by no means a stable registration within and across subjects. Furthermore, vocal capacities in itself are not at all comparable across subjects. There is a wide range depending on sex, age, and intrinsic anatomic variation in the first place, which does not have its counterpart in normal hearing. In addition, vocal characteristics are far more complicated to describe than the threshold of hearing.

Because of this, normalisation of phonetograms may not be the way we should pursue. In this paper we take a completely different approach. We assume that we have a huge resource of phonetogram recordings. Then we have a newly recorded phonetogram and we ask which phonetogram from our resource best compares to the new phonetogram. If there is an answer to this question, its implications are far stretching. The resource of phonetograms could have been enriched, for instance, with medical histories. The best comparing phonetogram would then point towards possible diagnostic interpretations about the new phonetogram. Also, a technique of comparing phonetograms would allow to cluster phonetograms within the resource without actually standardising them. In this way we would get away from rigid normalisation but we would open very promising ways of interpretation of phonetograms instead.


A MODEL OF A PHONETOGRAM

Basically, the comparison of phonetograms is a statistical problem. We wish to test whether differences in some variable at some point between two phonetograms are significant or not. This test can only be done if we know the probability density function (pdf) of the variable at the point. And here we immediately hit the weakest aspect of current registration of phonetograms. What is measured is fundamental frequency, vocal intensity and the running average of acoustic parameters. We can use fundamental frequency and vocal intensity as basic variables defining a ‘cell’ in a phonetogram of typically one semi-tone by one decibel. We wish to compare phonetograms cell-wise and therefore need the distribution of each acoustic parameter per cell. The running average - currently chosen because of storage limitation - is insufficient to model the entire distribution, but this technical limitation can be overcome. In the research presented here we had the parameter distributions at our disposal.

For the statistical comparison of phonetograms, we best interpret the phonetogram as a production model. The probability density functions of the acoustic parameters in a phonetogram give the probability that some property of the sound can been produced by the subject concerned. This implies that if we have some recordings of new phrases of another subject, we can compute the probability that these phrases could have been produced by the subject for which we already have the phonetogram. If subjects have very different voices, the probability will be low, if the voices are very comparable the probability will be high. This procedure can be extended to all phonetograms we have in our resource and thus we can identify the phonetograms that have the highest likelihood to have produced the new phrases.

The approach sketched above implies the following framework for phonetogram recordings and comparisons. For the recording of phonetograms we collect series of phrases sung at all fundamental frequencies and vocal intensities a subject is able and willing to. These phrases are not only represented on the screen in the current way as running averages, but for all samples the values of all parameters are stored too. From the full data the pdfs of acoustical variables in each phonetogram cell are computed. In the comparison phase, the probability is estimated that the phrases that constituted one phonetogram could have been produced by other phonetograms. It is also possible to do this for new recordings of phrases from new subjects. Interestingly, it is not at all necessary to make a full phonetogram registration of a new subject to estimate the probability that some phonetogram could have generated the phrase. Any short phrase is sufficient for the computations, but of course the accuracy of the probability estimation will increase with an increasing number of phrases and a more complete scan of vocal possibilities.

The model we have in mind for the description of parameter distributions in a phonetogram cell is a Hidden Markov Model (HMM). The argument is that voice production can have more than one ‘state’ for a combination of fundamental frequency and vocal intensity. This assumption is most obvious for vocal registers but different states may also occur at very soft phonation, depending on whether the soft phonation is at the start or the end of phonation [9]. Often, we do not know what ‘state’ voice production is in, it is ‘hidden’ to us. If we adopt a very simple model with two states per cell, we could arrive at the following architecture:

[figure not available]

Fig.2. Two-state HMM for a phonetogram cell (restricted ranges of fundamental frequency and vocal intensity).

This architecture implies that we assume that once we enter a phonetogram cell in one state, we will stay in that state as long as the same fundamental frequency and intensity is maintained. In other words, it is assumed that we can not switch, for instance, between modal and falsetto register without changing (temporarily) fundamental frequency and/or vocal intensity.

Hidden Markov Models are well known from speech recognition, and speaker verification and identification. If we compare these applications with the presently proposed usage of HMMs it can be noticed that a speech sentence now has its correspondence with a sung phrase. Unlike in speech recognition, the phonetogram sound samples are always labelled, not as phonemes but by the combination of fundamental frequency and vocal intensity. A phoneme model in speech recognition has its counterpart in the presented model of a phonetogram cell. The spectral parameters used in speech recognition are replaced by acoustic voice parameters which each constitute a separate data stream. The language model in speech recognition, which defines the possible order of phonemes in words and words in sentences, is not yet applied in the phonetogram model. This implies that we allow that all combinations of fundamental frequency and intensity can follow each other in a phrase. The overall correspondence to standard applications of HMMs has the attractive consequence that we can use standard software for its implementation for phonetogram models.


EXPERIMENT

To test the method we needed a series of phonetograms, recorded according to the given requirements. Unfortunately, only limited material was available at the time of writing that satisfied our needs, although this still allowed us to perform an interesting experiment. For eight amateur and professional singers (4 male, 4 female), we had registrations of swelltones (variation in vocal intensity with constant fundamental frequency) and glissandi (variation in fundamental frequency with reasonably constant vocal intensity) on the vowel /a/ and recorded at a microphone distance of 30 cm [9]. Both types of phrases were sung in the speaking voice area, the area of register change between modal and falsetto, and in falsetto register (or high singing region). Per singer, three recordings were made under each condition. For each singer, we computed phonetogram HMMs on the basis of the swelltones only. With these HMMs we estimated for each singer the probability that the glissandi were produced by him or her.

 

Because of the recording procedure, we had a limited sampling of fundamental frequency, which we solved by applying cell widths of about one octave instead of one semitone. This means that we had a low interval for male lower modal register, a middle interval for male higher modal an falsetto register and lower female range, and a third interval for higher female range. The vocal intensity was divided in 6 dB intervals instead of one dB intervals. In all, there were three intervals for fundamental frequency and nine for vocal intensity, constituting 27 cells and 27 HMMs to be estimated. Figure 3 gives the definition of the cells in the F0/I plane. Even with wide intervals the number of cells is already considerable and the number could have easily exceeded 1000 cells when using semitone and decibel intervals. This raises the question of interval choices for an optimal description of differential voice characteristics which are trainable in a reasonable amount of time and with a minimum number of cells.

[Figure not available]

Fig. 3. Plane of fundamental frequency and vocal intensity with cell areas as used in the experiment. Model regions for male and female phonetograms are indicated. 40 ST = 165 Hz, 54 ST = 370 Hz.

Fig. 4. Distribution of the crest factor and relative rise time for eight singers in the phonetogram cell with an F0 range between 165 and 370 Hz and a vocal intensity range between 64 and 70 dB SPL

As acoustical variables we used the crest factor (which we defined as peak-peak difference minus the RMS value, in dB) and the relative rise time measure: the time it takes from the start of a period to reach the maximum peak - relative to the period duration. Both variables were measured every 10 ms. For a single cell, in the middle of the phonetograms, figure 4 shows the distribution of both parameters for all singers. The distributions show considerable differences among singers in the means and in the standard deviations. We also observe some cases of bimodal distributions for male singers. The latter are a strong motivation for the proposed 2-state architecture of the HMM model.

Data from the swelltones are fed into a HMM training procedure (HTK package). All samples were labelled on the basis of F0 and I as belonging to one of the 27 cells of a singer, while two data streams were applied for the crest factor and relative rise time, respectively. This training procedure resulted in between 10 and 21 HMMs per singer, depending on the available recordings. As a test, we took the glissando recordings and estimated the probability that these had been produced by any of the eight singers. Table 1 shows the results.

 

Order of most likely candidates

M1

M1

M3

M4

M2

F2

F4

F1

F3

M2

M1

M2

M3

M4

F2

F1

F4

F3

M3

M3

M1

M4

F4

M2

F1

F2

F3

M4

M4

M3

M1

F2

F1

F4

M2

F3

F1

M4

F1

F4

F2

F3

M1

M3

M2

F2

F2

M4

M1

F1

F4

M3

F3

M2

F3

M4

F2

F3

F1

F4

M3

M1

M2

F4

F4

M4

F1

M3

F2

F3

M1

M2

Table 1. Test on the recognition of acoustic data of glissandi by HMMs trained on swell tones. The left column gives the singer producing the glissandi, the other columns the singers ordered on the basis of the likelihood that their HMMs could explain the glissandi (left=high, right=low). Only those glissandi were considered that are within the range of the HMMs of the singer tested.

Table 1 shows that the own HMMs were among the best two for seven out of the eight singers. It can also be seen that the male HMMs best explained the glissandi of the male singers with only a few exceptions. For the female singers this effect is less clear, especially due to singer M4.


DISCUSSION

Although the experiment was very limited, we believe that the result is encouraging. But many more tests need to be done. First of all, we should extend the data to full phonetogram registrations and many more subjects. Second, we should study the optimal cell boundaries for which we built an HMM. It could be that oblique orientation of the cells, inspired by the orientation of register boundaries, optimises results. Then, we could extend the data with more streams of acoustic parameters such as jitter. We could look at other architectures for the HMMs. There is also the problem of sparse or missing data. What kind of interpolation could offer a solution for this? It would be very interesting to investigate what areas in the phonetograms distinguish subjects most. With the present encouraging results in mind, this all is well worth trying.


REFERENCES

[1] Böhme, G. & Stuchlick, G. (1995). Voice profiles and standard voice profiles of untrained children. J. of Voice 9, 304-307.

[2] Gramming, P. (1988). The phonetogram, an experimental and clinical study. Malmö, PhD dissertation.

[3] Coleman, R.F., Mabis, J.H. & Hinson, J.K. (1977). Fundamental frequency - Sound pressure level profiles of adult male and female voices. J. Speech and Hearing Research 20, 197-204.

[4] Schultz-Coulon, H.J. & Asche, S. (1988). Das "Normstimmfeld"- ein Vorschlag. Sprache-Stimme-Gehör 11, 5-8.

[5] Sulter, A.M., Wit, H.P., Schutte, H.K. & Miller, D.G. (1994). A Structured approach to voice range profile (phonetogram) analysis. J. Speech and Hearing Research 37, 1076-1085.

[6] Hacki, T., Frittrang, B., Zywietz, C. & Zupan, C. (1990). Verführen zur statistischen ermittlung von Stimmfeldgrenzen - Das Durchschnittstimmfeld. Sprache-Stimme-Gehör 14, 110-112.

[7] Heylen, L. (1997). Normfonetogrammen voor kinderen (6-11) en leerkrachten met gezonde stemmen. In: The clinical relevance of the phonetogram, PhD dissertation, Antwerpen [in Dutch], 86-101.

[8] Pabon, J.P.H. (1991). Objective voice-quality parameters in the computer phonetogram. J. of Voice 5, 203-216.

[9] Fokkens, J. (1997). The distribution of acoustic voice quality parameters in the phonetogram. MA thesis, Utrecht [in Dutch].