Bloothooft,G., Bos, S., and Schaeffer, R. (1994). Distance metric for the perception of formant frequency differences in quiet and noise. Presented at Workshop Psychoacoustics of Speech Perception II, Utrecht.
Distance metric for the perception of formant frequency differences in quiet and noise
Gerrit Bloothooft, Sidonne Bos, Roeland Schaeffer
Research Institute for Language and Speech, University of Utrecht
Trans 10, 3512 JK Utrecht, The Netherlands
The perception of vowels which varied only in the frequency of the second formant (between /u/ and /i/) has been investigated in both quiet and noise. The multi-dimensional perceptual stimuli configurations did not show a linear correspondence with log-formant frequencies in both quiet and noise. The perceptual configurations were subsequently matched with physical configurations obtained from metrics based on (weighted) band filter levels, or (weighted) level differences between adjacent band filters. Best results were obtained with weighted-level metrics. The weighting indicated that especially band filters with a local maximum in level difference, contribute to the distance between two spectra. These metrics could describe the considerable influence of the qualities of the vowels /u/ and /i/ on the perceived vowel differences.
The acoustical representation of speech in the initial phase of recognition has been acknowledged as highly important for both understanding human sound perception as well as getting good recognition performance by machines. For the last ten years, the modelling of the peripheral sound processing of simple signals, such as stationary speech sounds, has been based mainly on electrophysiological data. Although application of (simplified) models of auditory signal processing in speech recognition systems has been generally accepted, we feel that a gap still exists between these models and perceptual data on both the psycho-acoustic and phonetic level of perception (Klatt, 1982 & 1986).
From the recognition point of view, the phonetic level of perception is the most interesting. At this level we abstract from acoustic characteristics which are to a large extent irrelevant for the understanding of a message, such as spectral slope and fundamental frequency (as far as gender differences are concerned). Formant frequencies are probably the dominant factor. On the other hand, there is the psycho-acoustic level where gross spectral features dominate. This level is sensitive to variation in spectral slope and bandwidth, for instance, but is also capable of capturing many phonetic characteristics.
In this study we investigated, both in quiet and noise, the perception of stationary vowel stimuli which only varied in the frequency of the second formant. The stimulus space thus varied between /u/ and /i/, leading to a possible influence of these phonemic categories on the perceptual difference scores (Kewley-Port and Atal, 1989). The aim was to find the physical distance metric which best predicted the perceived differences.
II. Method and material
Vowel stimuli were synthesized with a Klatt-synthesizer (Klatt, 1980). We only varied the second formant frequency, while all other parameters were held constant. Fo was fixed at 100 Hz, F1 was set at 250 Hz, F3 at 2720 Hz, F4 at 3300 Hz, and F5 at 3750 Hz. The bandwidths of the formants were 50, 70, 110, 250, and 200 Hz for F1 to F5, respectively. Duration of each vowel was 250 ms, including a sinusoidal fade-in and fade-out of 15 ms. Fifteen vowels were synthesized with the second formant frequency equally distributed on a logarithmic scale between 677 and 2291 Hz at 677, 738, 805, 879, 959, 1046, 1141, 1245, 1358, 1482, 1617, 1764, 1925, 2100, 2291 Hz. Equal loudness of the stimuli was obtained by means of a perceptual comparison of the stimuli by three subjects.
In an identification test, eight experienced listeners classified all 15 stimuli 10 times out of a list of 13 Dutch vowels. Figure 1 shows the results. Stimuli varied between clearly recognized vowels /u/ and /i/, with dominating schwa responses for intermediate F2 values.
Fig. 1. Identification scores for the 15 vowel stimuli.
Five sets of vowel stimuli have been made. The first set consisted of the nine vowels close to /u/, with the lower F2 values between 677 and 1358 Hz. The second set contained nine vowels around /i/, with the higher F2 values (between 1141 and 2291 Hz). The eight uneven numbered stimuli, from /u/ to /i/, formed the third set (F2 = 677,805,...,2291 Hz). The fourth and the fifth set consisted of the same vowels as the third set, but with white noise added, resulting in a signal-to-noise ratio of 36 dB and 24 dB, respectively.
Band filter analysis
Band filter spectra of the stimuli were made on the basis of 1-Bark bandwidth auditory filters, as described by Sekey and Hanson (1984). Sixteen filters were used, with central frequencies between 117 and 4395 Hz. The energy (dB) measured in each filter was represented as a coordinate value along one dimension in a 16-dimensional space. To get an idea of the variation among the stimuli, the 16-dimensional representation of the stimuli was subjected to Principal Components Analyses, resulting, of course, in a highly one-dimensional solution, but with non-neglectable amounts of variance in the second, and third new dimension. For the five sets of vowels, Table I gives the amount of variance in the first three principal components dimensions.
Table I. Total spectral variance in each of the five vowel subsets, and the distribution of that variance over the first three Principal Component dimensions.
|Vowelset||Total variance (dB2)||Dimension|
|I (%)||II (%)||III (%)||total I-III (%)|
|set I (stim. 1-9)||2850||90.7||7.5||1.3||99.5|
|set II (stim. 7-15)||3670||84.6||12.5||2.0||99.1|
|set III (stim 1,3,..,13,15)||9600||83.8||11.2||2.8||97.8|
|set IV (set III, 36 dB S/N)||6040||82.0||10.2||3.9||96.1|
|set V (set III, 24 dB S/N)||3600||74.2||11.2||8.4||93.8|
While the stimuli of the sets only varied in second formant frequency, this one-dimensional F2 variation comes out as multi-dimensional variation in a band filter energy space.
The stimuli of a set were judged pairwise. Accordingly, there were 36 pairs of the nine stimuli of sets 1 and 2, and 28 pairs for the eight stimuli of sets III, IV, and V. Each pair was presented 4 times, 2 times in one, two times in reversed order. The inter-stimulus interval was 300 ms, while between pairs there was a silence interval of 2.5 s in which subjects could respond. They had to give a judgement on a 10-point scale, on which 0 meant 'no difference' and 9 'very different'. Listeners were seated in a sound-treated booth; they were not informed about the characteristics of the sounds. A short training phase, in which 10 stimuli pairs were presented, preceded the experiment. Each test lasted about 30 minutes.
Three groups of 20 subjects, with an age between 20 and 26 years, participated in the listening experiments. No hearing deficiencies were reported. The first group judged the sets I and II, the second group set III, and the last group sets IV and V. From the latter group half of the subjects firstly judged set IV, the other half firstly set V.
We investigated whether a subject was consistent in his judgements, by computing the average deviation in judgements of the same pair. When this deviation was larger than 2, we ignored the results of that subject. This was the case with two subjetcs who judged sets I and II, and two subjects who judged sets IV and V.
The subjects heard each stimulus pair four times, two times in one and two times in reversed order. There appeared to be a systematic difference between judgements of both orders. A stimulus pair ordered low F2 - high F2 was judged less different than the same stimulus pair ordered high F2 - low F2 (significant for sets I and III). Separate analyses of the results gave the same perceptual representation, indicating the systematic character of the effect. We therefore present the results of the combined data.
Subject's judgements on stimulus pairs were averaged and the resulting upper-triangle matrix of perceptual differences was used as input for Kruskal's multi-dimensional scaling technique (Kruskal, 1964). The number of perceptual dimensions was varied between one and three with, on average over the stimulus sets, a stress value of 8.8, 3.2, and 2.4 %, respectively. Since Kruskal (1964) considered 5 % stress to be a good fit, a two-dimensional solution seems acceptable. The introduction of the third dimension reduced the stress value somewhat more however, and we maintained that dimension in further analyses.
Physical stimulus configurations and the rotation to the perceptual configuration
We investigated four types of physical representation of the vowel stimuli. These are: (1) the 16-dimensional Euclidean band filter energy representation [Plomp-metric, Plomp (1970), Pols etal. (1969), Bloothooft and Plomp (1988)], (2) the 15-dimensional Euclidean band filter energy difference representation [degenerated Klatt-metric, Klatt(1982)], (3) the Plomp-metric in which each dimension is optimally weighted before matching with the perceptual configuration [post-hoc weighted Plomp-metric], and (4) the degenerated Klatt-metric, weighted as under (3) [post-hoc weighted Klatt-metric].
These four representations need some explanation. In pilot investigations, we found that a physical configuration based on the Klatt-metric (Klatt, 1982) leads to best matches with the perceptual configuration when parameters in the metric which weight global and local peaks were all set to infinite values. Essentially, this results in a degenerated Klatt-metric, which only takes into account the energy difference between adjacent band filters. The same result has been reported by Nocerino (1985) in an ASR task on alpha-digits.
In case of weighted metrics, the weighting factors of the band filter energy (differences) were obtained post-hoc by optimizing iteratively the fit between the resulting physical configuration and the perceptual configuration. As a criterion we used the overall fit-measure coefficient of alienation (Lingoes and Schönemann, 1974), see below. This post-hoc weighting was performed to get an idea of the way weighting factors should be incorporated in distance metrics. Because of the weighting factors, the matched physical dimensions are not orthogonal anymore with respect to the original unweighted band filter dimensions. This means that it is possible, for instance, that two types of correlated physical variation may map on two uncorrelated perceptual dimensions.
Physical and perceptual configurations were matched using orthogonal rotation to congruence (Schönemann and Carroll, 1970). In all cases, the distribution of explained spectral variance in matched perceptual dimensions did not deviate much from the type of distribution in PCA dimensions (Table I) and is not given. Also, for all three matched perceptual dimensions, the correlation coefficient between perceptual coordinate values and coordinate values of any matched physical dimension was significant (p<0.01) and not given either. Instead, we present in Table II the overall measure of fit, the coefficient of alienation.
Table II. Coefficients of alienation (fit measure) for four physical representations of the stimuli, matched with perceptual configurations of the five vowel sets. A value of 1 denotes unrelated configurations, 0 a perfect fit.
(III 36 dB)
For all metrics, the configuration matching is given in Fig. 2 for set I in both the I-II and the I-III perceptual plane. Results for the I-II perceptual plane are given in Fig. 3 for sets III and set V. The latter two sets consist of the same eight stimuli under different signal-to-noise conditions (no noise, and 24 dB, respectively). For set I, Fig. 4 presents the transformation vector towards the first perceptual dimension, for the (weighted) Plomp-metric and Klatt-metric.
Fig. 2. The match of four types of metric with the vowel configuration of set I (/u/-like vowels) in the I-II and the I-III perceptual plane.
IV. Discussion and conclusions
From all presented perceptual configurations we may safely conclude that there is no linear relationship between the perception of vowel differences and log-formant frequency, in both quiet and noise. Furthermore, there seems to be an effect of the stimulus set. For sets I and II, which contained /u/ and /i/-like vowels, respectively, a regular, although curved, distribution of stimuli was found in the perceptual space (Fig. 2). For sets III, IV, and V, in which both /u/ and /i/-like vowels were present, the /u/-like vowels clustered perceptually, while /i/-like vowels showed remarkably large differences (Fig. 3). No clustered category of /i/-like vowels was found.
Fig. 3. The match of four types of metric with the vowel configuration in the I-II perceptual plane of set III, vowels between /i/ and /u/, (upper panel) and set V, the same vowels presented under 24 dB S/N ratio, (lower panel).
Fig. 4. Transformation vector towards the first perceptual dimension of set I, (a) for the unweighted metrics, and (b) for the weighted metrics. Direction cosines for the Klatt-metric are positioned at the highest rank order of the two adjacent band filters, for which the level difference is computed.
Addition of white noise did not result in a closer relationship between the perceptual space and log-formant frequencies, which indicates that in this type of experiments, listeners still rely on broad spectral information. But in what way? For the one-vowel sets, the Klatt-metric was better, for the two-vowel sets, the Plomp-metric. In all cases, weighting of the orginal band filter (difference) levels, improved the matching considerably. This means that weighting should be incorporated in the metric, but not in the way as Klatt (1982) proposed. The post-hoc weighting we found, indicated that listening results could be best predicted by using only a few, but relevant, band filter (difference) levels. Of course, these band filters are entirely dependent on the stimulus set. In a more general sense, our results seem to indicate that only those few band filters contribute to a distance measure, in which two spectra differ maximally in a local frequency region.
The implication for automatic speech recognition seems to be that formant frequencies are not the only determinant of phonetic vowel differences and that further work on metrics for broad band spectra is necessary. The good results, using Klatt's metric in ASR tasks, found by Nocerino (1985) and Junqua (1991) are promising in this respect.
Bloothooft, G. and Plomp, R. (1988). "The timbre of sung vowels," J. Acoust. Soc. Am. 84, 847-860.
Lingoes, J.C. and Schönemann, P.H. (1974). "Alternative measures of fit for the Schönemann-Carroll matrix fitting algorithm," Psychometrica 39, 423-427.
Junqua, J.C. (1991). "A two-pass hybrid system using a low dimensional auditory model for speaker-independent isolated-word recognition," Speech Communication 10, 33-44.
Kewley-Port, D. and Atal, B.S. (1989). "Perceptual differences between vowels located in a limited phonetic space," J. Acoust. Soc. Am. 85, 1726-1740.
Klatt, D.H. (1980). "Software for a cascade/parallel formant syntheziser," J. Acoust. Soc. Am. 67, 971-995.
Klatt, D.H. (1982). "Prediction of perceived phonetic distance from critical-band spectra: A first step," Proc. ICASSP, Paris, 1278-1281.
Klatt, D.H. (1986). "The problem of variability in speech recognition and in models of speech perception," in: J.S. Perkell and D.H. Klatt (Eds.), Invariance and variability of speech processes, Lawrence Erlbaum Ass., Hillsdale, H.J., 300-319.
Kruskal, J.B. (1964). "Multi-dimensional scaling by optimizing goodness of fit to a nonmetric hypothesis," Psychometrica 29, 1-27.
Nocerino, N., Soong, F.K., Rabiner, L.R., and Klatt, D.H. (1985). "Comparative study of several distortion measures for speech recognition," Speech Communication 4, 317-331.
Plomp, R. (1970). "Timbre as a multi-dimensional attribute of complex tones," in: R. Plomp and G.F. Smoorenburg (Eds.), Frequency analysis and periodicity detection in hearing, Sijthoff, Leiden, 397-414.
Pols, L.C.W., Kamp, L.J.Th van der, and Plomp, R. (1969). "Perceptual and physical space of vowel sounds," J. Acoust. Soc. Am. 46, 458-467.
Schönemann, P.H., and Carroll, R.M. (1970). "Fitting one matrix to another under choice of a central dilation and a ridgid motion," Psychometrika, 35, 245-255.
Sekey, A., and Hanson, B.A. (1984). "Improved 1-Bark bandwidth auditory filter," J. Acoust. Soc. Am 75, 151-168.