Bloothooft, G. and Plomp, R. (1988). 'The timbre of sung vowels', J. Acoust.Soc.Am 84, 847-860.

The timbre of sung vowels


Gerrit Bloothooft a) and Reinier Plomp b)


Department of Otolaryngology,
Free University Hospital,
1007 MC Amsterdam,
The Netherlands

a) Present address: Research Institute for Language and Speech, University of Utrecht, Utrecht, The Netherlands

b) Also: Institute for Perception TNO, Soesterberg, The Netherlands



The perception of timbre differences in a vowel sung by 8 male and 7 female singers has been investigated by means of two types of listening experiments: (1) using the paradigm of the comparison of similarity, and (2) using judgments on 21 semantic bipolar scales. Using INDSCAL analysis for the similarity-comparison data and MDPREF analysis for the semantic-scale judgments, vowel configurations in a multidimensional perceptual space were derived, as well as a space which showed the weighting of perceptual dimensions by individual listeners (INDSCAL). The interpretation of semantic scales was represented by directions in the perceptual space (MDPREF). The perceptual vowel configurations, either based on timbre similarities or semantic scales judgements, were comparable. Broadly, semantic scales clustered into the categories vocal technique, general evaluation, vibrato, clarity, and sharpness. These five clusters were not independent and could be described in two dimensions. Timbre differences could be predicted on the basis of differences in 1/3-octave spectra of the vowels. It showed up that only sharpness had a constant interpretation for the various stimulus sets and was roughly related to the slope of the spectrum. One experiment, using a song phrase, extended the results to a more general domain.



In phonetic research, the perception of singing has been given relatively little attention. In an excellent review, Sundberg (1982) commented that experimental data in this field mainly focus on vowel intelligibility as a function of fundamental frequency, recognition of vocal registers, perceptual determinants of voice classification, and the effect of vibrato on perceived pitch. In many of these perceptual studies, listeners have been shown to use timbre in distinguishing between vowels, registers, or singer voice types. Timbre is used, for instance, as the perceptual quality which allows one to judge whether a phonation has been sung in the falsetto or modal register. However, the perception of timbre itself has hardly been investigated explicitly in singing. Yet the importance of timbre cannot be overestimated: most of the vocabulary in voice pedagogy for the description of voice quality is related to timbre; it is involved when a voice is called light, dark, pressed, warm, mellow, and so on; the list of terms seems unlimited. A number of attributes of voice quality also refers to the associated singing technique (e.g. covered, open, throaty, pressed, free), as was investigated by van den Berg and Vennard (1959). When we ask for an explicit, objective, definition of these many terms, only fragmentary data are available. For fruitful discussions on the singing voice, however, this knowledge seems to be an essential premise.

Timbre, tone-color, or "Klangfarbe", has been defined by the American Standards Association (1960) as that attribute of auditory sensation in terms of which listeners can judge that sounds having the same pitch and loudness are dissimilar. In an effort to present acoustic variables underlying timbre, Schouten (1968) mentioned five factors: (1) tonal or noise-like character, (2) the envelope of the frequency spectrum, (3) the temporal envelope, (4) change in spectral or temporal envelope, and (5) the onset. Plomp (1970) left dynamic aspects out of consideration and investigated the timbre of steady-state sounds. For these sounds, timbre is determined by the frequency spectrum only. He showed that the timbre of sounds can be represented as points in a multidimensional perceptual space in which distance corresponds to dissimilarity in timbre: the larger the distance in this space, the more the sounds are perceptually dissimilar in timbre. Since such a representation of timbre is based on perceived dissimilarity, it is a psychoacoustical perceptual representation, which has its origin in properties at a peripheral auditory level of perception. In a large part of the present study we limited ourselves to stationary sung vowels, so we could use the spatial representation of timbre.

The perceptual dissimilarities in timbre of stationary sounds are correlated with the differences in their spectra. This has been shown by Pols et al. (1969) for spoken vowels and by Plomp (1970, 1976) for sounds of musical instruments. The spectral representation was based on sound levels (in dB) in 1/3-oct bandpass filters, which approximate the critical bandwidths in audition. The close correspondence between the (subjective) perceptual space and the (objective) spectral space makes it possible to study timbre dissimilarities in terms of their underlying spectral differences.

Within the framework of a spatial representation of timbre we follow in this paper two main lines in studying timbre in singing:

(1) Is the correspondence between the perceptual and the spectral spaces for the timbre differences between vowels also valid for small timbre differences, such as timbre differences in the same vowel sung by different singers?

(2) How can we represent relations between descriptive terms of timbre in the perceptual and spectral timbre space?

These two main lines have been investigated in two experiments, which also provided the opportunity to study the following issues:

(a) How consistent is a listener in the use of timbre terms and what is the agreement among listeners about the meaning of these terms?

(b) What is the influence of musical training on the use of timbre terms?

(c) How do the many descriptive terms of timbre cluster and what is their spectral interpretation?

(d) How does timbre perception of steady-state vowels compare with timbre perception of a song phrase?


I. Material and spectral analysis

In a concert hall, recordings were made of vowels sung by nine female and eight male singers. All but one (professional bass-baritone singer 1) were advanced students of the Sweelinck Conservatory in Amsterdam, aged between 19 and 26, with 3 to 7 years of vocal training. The microphone distance was 0.3 m, so that the direct sound predominated. According to their own voice classification, the group consisted of: 2 bass-baritone, 2 baritone, 4 tenor, 2 alto, 3 mezzo-soprano, and 4 soprano singers (Table I). The vowels /a/, /i/, and /u/ were sung at a comfortable level at fundamental frequencies (Fo) of 131 (C3), 220 (A3), and 392 Hz (G4). A tone at these Fo values was repeatedly presented during the recording sessions to cue the singer. Some male singers performed at Fo = 392 Hz in modal register and falsetto register as well.

Table I. Classification of the singers

male singers female singers
1 bass-baritone 9 alto
2 bass-baritone 10 alto
3 baritone 11 mezzo soprano
4 baritone 12 mezzo soprano
5 tenor 13 mezzo soprano
6 tenor 14 soprano
7 tenor 15 soprano
8 tenor 16 soprano
. . 17 soprano

Since we wanted to investigate spectral attributes of timbre, temporal variations such as vibrato were removed from the vowel sounds. To accomplish this, each vowel sound was digitized (10 kHz sampling frequency); subsequently a single period with a fixed number of samples was segmented from the central part of the vowel, and this period was repeated to obtain a stimulus duration of 400 ms. Care was taken that the beginning and end of the segmented period did not show a discontinuity. To avoid clicks at the onset and offset of each stimulus, the sound level increased and decreased smoothly during the first and last 40 ms, respectively.

Ten different subsets of eight or nine vowel sounds were made by combining phonations of the same vowel with the same Fo sung by different singers. The subsets were organized according to vowel type (/a/, /i/, and /u/), fundamental frequency (131, 220, and 392 Hz), and sex of the singers (see Table II). As the table indicates, the vowel /a/ was studied for all Fo values, the vowels /i/ and /u/ for the midrange values of 220 Hz (males) and 392 Hz (females) only. Subset V combined a selection of vowels /a/ by both male and female singers (Fo = 220 Hz). Subset X combined /a/ phonations sung in the modal register and the falsetto register by male singers (Fo = 392 Hz). Because of the limitations imposed by the listening experiments, the maximum number of stimuli in each subset was nine. The loudness of the vowels in each subset was equalized by means of a subjective matching procedure.

Table II. Vowel, fundamental frequency and participating singers (labeled according to Table I) for the eleven subsets. f indicates falsetto register, t indicates a tenor-like register produced by baritone singer 3. M are male and F are female singers.

Vowel a u i a a a i u a a phrase
Fo(Hz) 131 220 220 220 220 220 392 392 392 392 -
Sexe M M M M M+F F F F F M M
Singers 1 1 1 1 1 9 9 9 9 1 1
2 2 2 2 5 10 10 10 10 3 2
3 3 3 3 6 11 11 11 11 3t 3
4 4 4 4 8 12 12 12 12 6 4
5 5 5 5 9 13 13 13 13 8 5
6 6 6 6 10 14 14 14 14 1f 6
7 7 7 7 11 15 15 15 15 3f 7
8 8 8 8 13 16 16 16 16 7f 8
        15 17 17 17 17 8f  

In order to determine whether the restriction to stationary vowels was justified in studying the relations between descriptive timbre terms, we added subset XI derived from short sung phrases. Recordings were made of a Dutch folk song sung by the eight male singers. From this folk song the phrase 'Halleluja' was extracted. Since the singers were free in their interpretation of this song, their recordings varied in time between 2.6 and 4.4 s. The loudness of the phrases was equalized by means of a subjective matching procedure.

The vowel stimuli were analyzed with a computer-controlled filter bank (Pols, 1977). Eleven 1/3-oct band-pass filters were used with center frequencies from 400 Hz up to 4 kHz, whereas the filters below 400 Hz were replaced by three 90-Hz wide filters with center frequencies of 122, 215, and 307 Hz, respectively, according to the concept of critical bandwidth in audition. The total number of filters used was 14, 11, and 8, for Fo = 131, 220, and 392 Hz, respectively, because filter bands which did not contain a partial were excluded.


II. Experiment 1: Correspondence between the perceptual and spectral timbre spaces.

A. Procedure

1. Listeners

We used two categories of listeners. The first category consisted of seven non-musicians (who had never had any musical training), the second category of nine musicians (five singers and four teachers of singing). These two categories of listeners were chosen to investigate the influence of musical training on timbre perception. All listeners had normal audiograms. They were paid for their services.

2. Method

To map the perceptual representation of timbre differences we used the method of paired comparison of pairs, with the modification that both pairs should have one stimulus in common. Different from the triadic comparison technique used by Pols et al. (1969) and Plomp (1970), in which the listener can listen repeatedly to the three stimuli before deciding which pair is most similar and which is most dissimilar, in the present procedure each pair of pairs was presented only once, after which the listener had to indicate which pair contained the more similar stimuli. The listener, seated in a sound-proof room, heard the stimuli monaurally at a comfortable level through earphones (Beyer DT-48). All possible pairs of pairs of vowels in a subset were presented in random order, while each stimulus of a subset was presented equally often in the first and second pairs. To eliminate order effects, half of the listeners heard the pairs in reversed order (e.g. AC-AB instead of AB-AC). Stimulus generation, timing, and response processing were handled by a PDP 11/10 computer.

3. INDSCAL analysis

The results of the paired-comparison experiment were collected for each listener in a dissimilarity matrix: Every time the listener judged a particular pair of vowel stimuli as more similar than another pair, the more similar pair scored one point. The total number of points which could be assigned to a pair could vary between zero and the total number of vowel stimuli minus one. As we were also interested in intersubject differences in the representation of timbre, especially between musicians and non-musicians, the dissimilarity matrices were analyzed by means of a quasi non-metric version of INDSCAL (Carroll and Chang, 1970). In this multi-dimensional scaling program, the subjects are assumed to weight differentially the several dimensions of a common perceptual space, the so-called object space. The individual weighting factors for each dimension are presented in a subject space. The dimensionality of the INDSCAL solution was chosen on the basis of the results of matching the configuration in the object space with the spectral representation of the stimuli, discussed in the next section. The correlation in each matched dimension had to be significant beyond the 0.01 level.

4. Matching of the perceptual and spectral vowel configurations

For each subset of stimuli we had available both a perceptual and a spectral configuration of the vowels. The latter configuration is given by the multidimensional representation of the spectra with the sound level in each filter band as coordinates. To investigate the agreement between these configurations we matched them, using the procedure of rotation to maximal congruence (Schönemann and Carroll, 1970) between the spectral and the perceptual configurations. As a result, each perceptual dimension is optimally fitted with a direction in the spectral space. This direction is defined as a linear combination of the original spectral dimensions (filter band levels). As a measure of fit we computed (1) the correlation between vowel coordinates in each perceptual and matching spectral dimension and (2) the coefficient of alienation (Lingoes and Schönemann, 1974); this coefficient varies between 0 (perfect fit) and 1 (unrelated configurations) 1). Since the perceptual space derived from INDSCAL is normalized, we computed weighting factors for its dimensions which minimized the coefficient of alienation. Since this procedure involved very little additional rotation of the spectral space, the correlation coefficients per dimension remained practically unchanged.

B. Results and discussion

An example of a perceptual (object) space, the matched spectral space and the listener (subject) space is shown in Fig. 1 for subset II (/u/, sung by eight male singers at Fo = 220 Hz). This subset was judged by nine singers and teachers of singing and seven non-musicians. The upper panels of Fig. 1 illustrate the very good agreement between the vowel configuration in the three-dimensional perceptual space (filled circles) and the matched configuration in the spectral space (open circles). As can be seen from the subject space (lower panels), the intersubject differences in the weighting of the perceptual dimensions is large, both for the musicians (open symbols) and the non-musicians (solid symbols). The average listeners' weighting factors for the three dimensions were .57, .44, and .41, respectively. An analysis of variance of the individual weighting factors did not show a significant difference between musicians and non-musicians. Consequently, separate analyses of the data from musicians and non-musicians resulted in similar perceptual spaces (not shown). This finding supports the view that in the comparison of timbre similarity, musical knowledge and experience do not play a part.

A summary of the results of matching the perceptual vowel configuration with the spectral vowel configuration for all vowel subsets is given in Table III. This table gives first the total spectral variance for each subset. This total variance is, accumulated over all bands, the sum of the variances in the filter bands, representing the spectral differences between the vowel stimuli of each subset. They vary between 103 and 332 dB2, due to differences between singers; this range agrees well with previous findings of Bloothooft and Plomp (1984) for 14 professional singers. The spectral differences between the modal register and the falsetto register introduced the largest spectral variance (subset X).

Table III. Results for the ten subsets of vowel stimuli of matching the configurations in the perceptual space with the spectral space. For each subset the total spectral variance is given as well as the percentage of this variance accounted for by the computed dimensions as far as the perceptual and spectral ones correlated significantly at the 0.01 level or between 0.05 and 0.01 (figures within brackets). The subsets were judged by seven listeners, except subset II which was judged by 15 listeners (see text and Fig. 2). The coefficient of alienation has been computed over all given dimensions D.

set vowel Fo (Hz) total spectral variance (dB2) percentage of total spectral variance in matched dimensions coefficient of alienation
D1 D2 D3 D4 sum
I 8 M /a/ 131 211 34 21 19 17 91 0.39
II 8 M /u/ 220 255 45 21 19 - 85 0.47
III 8 M /i/ 220 239 37 22 16 14 89 0.44
V 9 M+F /a/ 220 200 30 26 26 7 89 0.48
VI 9 F /a/ 220 114 48 (21) - - 69 0.68
VII 9 F /u/ 392 248 38 36 (15) 0 89 0.52
VIII 9 F /i/ 392 180 28 24 19 18 89 0.42
IX 9 F /a/ 392 103 20 (34) - - 54 0.76
X 9 M /a/ 392 332 38 37 (11) - 86 0.54

The next five columns give the percentage of total spectral variance explained by the common dimensions of the perceptual and spectral vowel configurations. The number of dimensions for which the correlations between vowel coordinates along the perceptual and matched spectral dimensions were significant beyond the 0.01 level, varied between one and four. This number is probably related to a minimum amount of spectral variance needed to define a perceptual dimension. It was found that the least significant dimension explained on the average 34 dB¨ of spectral variance. If spectral variance is uniformly distributed over frequency bands, this value corresponds to a standard deviation of about 2 dB for each frequency band. If spectral variance is concentrated in a single frequency band, the variance of 34 dB¨ would correspond to a standard deviation of 5.5 dB in that band. De Bruyn (1978) concluded from investigations on timbre dissimilarity of complex tones that two complex tones are distinguished well by listeners for a mean difference in sound levels of between 3 and 5 dB in each 1/3-oct band. The difference limen for individual harmonics in vowel sounds was estimated by Kakusho et al. (1971) to be less than 2 dB for most vowels. These thresholds roughly indicate that in our investigation the correspondence between spectral representation and psycho-acoustic representation of timbre is valid up to the perceptual threshold of timbre differences. This limit determined the dimensionality of the perceptual vowel configuration for all Fo values investigated. For low Fo values this is in agreement with results obtained by Nord and Sventelius (1979) concerning just-noticeable differences in formant frequency, and by Klatt (1982) for a number of physical manipulations of a single vowel.

C. Conclusions

(1) The prediction of timbre differences on the basis of 1/3-oct spectra is valid up to the perceptual threshold of differences in timbre, and thus for all kinds of timbre differences in stationary sung vowels.

(2) This prediction is valid up to Fo values of at least 392 Hz.

(3)  In judgments of similarity of timbre, musical knowledge or experience does not play a role.


III. Experiment 2: Relations between descriptive terms of timbre

A. Procedure

1. Semantic bipolar scales

For the study of descriptive timbre terms we designed a listening experiment in which sounds were compared on a number of semantic bipolar scales. Each semantic scale consisted of two adjectives with opposite meaning, describing timbre characteristics such as light-dark and colorful-colorless. For the determination of the set of semantic scales to be used in our listening experiment we first collected 50 scales from related studies on timbre (Isshiki et al., 1969; Donovan, 1970; von Bismarck, 1974a; Fagel et al.,1983; Boves, 1984) and from the literature on singing (Vennard, 1967; Hussler and Rodd-Marling, 1976). These semantic scales were rated by seven experts (speech therapists, teachers of singing) on their suitability for describing the timbre of sung vowels. Of the 50 scales, 21 were generally judged to be suitable (Table IV). Of these, scales 1 to 14 and scale 21 were regarded as commonly known adjectives for the description of timbre and were used by all listeners. The scale vibrato-straight (21) was only used for judgments of the song phrase. The scales free-pressed (15), open-throaty (16), and open-covered (17) were considered to be evaluative of singing technique. The scales dramatic-lyrical (18), soprano-alto (19), and tenor-bass (20) were intended to investigate relations between timbre and voice classification. These six scales were judged only by the musicians.

2. Method

The interpretation of a semantic scale was investigated using the method of paired comparisons. The listener had to judge which of the two stimuli presented was closer to a given target adjective of a semantic scale, for example: which of two stimuli was darker (on the semantic scale light-dark). The chosen stimulus scored one point. After all possible pairs of stimuli had been judged, the total number of points obtained by each stimulus was used to rank the stimuli of a subset from light to dark. Semantic scales were handled one after another. For a set of nine stimuli, 720 judgments by a listener were needed to investigate 20 semantic scales. In Table IV the adjectives used as targets are underlined.

Experiments performed in this way are very time consuming. Hence we reduced the number of subsets to five subsets of stationary vowels (II, III, V, IX, X) and one subset of song phrases (XI). In this selection the vowel /a/ was investigated for Fo = 220 and 392 Hz for both male and female singers, while the vowels /i/ and /u/ were investigated for Fo = 220 Hz for male singers only. Three half-day sessions were planned for each listener to complete all measurements. Most of the musicians could not finish their task within the time available; therefore the number of listeners per subset varied. The stimuli were presented in a random order but equally frequently in first and last position of a pair. Half of the total number of listeners heard the stimuli of a pair in reversed order. The experiments were computer controlled. The same musicians and non-musicians who participated in Experiment 1 served as listeners. They were paid for their services.

B. Results

1. Reliability

To judge the reliability of the results of a listener in a paired comparison experiment on a single semantic scale, we computed the number of circular triads the listener made. A circular triad occurs when, for example, stimulus B is judged to be darker than A, and C is judged to be darker than B but not darker than A. In a subset with 8 stimuli a score of less than 9 out of a maximum of 20 possible circular triads was accepted as a consistent and reliable result (0.05 significance level); in a subset with 9 stimuli this criterion corresponds with less than 14 out of a maximum of 30 circular triads (e.g., Edwards, 1957). Three different explanations of circular triads can be given: (1) the stimuli are almost equal, (2) the semantic scale is not appropriate, or (3) the listener response is not reliable.

In the case of a subset with many almost equal stimuli a high number of circular triads for all semantic scales has to be expected. In Table V we give, for the six subsets, the percentage of accepted semantic scale results for musicians and non-musicians. Musicians are, on the average, more reliable (85 % vs. 71 %). The number of accepted results for each subset shows, especially for the non-musicians, a clear relationship with the total spectral variance of the subset (last column). The song phrase, subset XI, is judged most consistently by all listeners. This demonstrates that the voice characteristics are much more distinct in a song phrase than in stationary vowels.

Table V. Percentage of judgments with a sufficiently low number of circular triads on all semantic scales for each subset. The total spectral variance per subset is also given.

set vowel Fo musicians non-musicians total spectral variance (dB2)
II 8 M /u/ 220 90 81 255
III 8 M /i/ 220 79 64 239
V 9 M+F /a/ 220 90 94 200
IX 9 F /a/ 392 64 54 114
X 9 M /a/ 392 87 83 332
XI 8 M phrase   94 92  
average     85 71  

Since an inappropriate semantic scale will result in inconsistent results for all listeners, we computed for each semantic scale the percentage of accepted judgments, according to the criterion given above, over the five vowel subsets and for the song phrase (Table IV). For the musicians the semantic scales with less than 75 % accepted judgments were full-thin, rough-smooth, strong-weak, open-throaty, and dramatic-lyrical. This can be explained for the scales rough-smooth and dramatic-lyrical by the fact that these scales refer to temporal characteristics which are not present in the vowel subsets; for strong-weak it suggests that this scale is interpreted as loud-soft, which was eliminated by loudness matching.

Table IV. Bipolar semantic scales used in Experiment 2, ordered according to the consistency of listeners' judgments. Target adjectives are underlined. The columns give the percentage of judgments with a sufficiently low number of circular triads on a semantic scale for musicians and non-musicians, both as the averaged number over the five subsets of stationary vowels, and for the subset of song phrases.

semantic scale   musicians   non-musicians  
    vowels phrase vowels phrase
2 light -dark 89 100 74 82
19 tenor -bass 94 100 - -
7 high -low 89 100 63 100
17 open -covered 89 90 - -
3 sharp -dull 87 100 71 100
13 metallic -velvety 87 100 69 82
11 cold -warm 85 100 66 100
15 free -pressed 85 90 - -
9 angular -round 85 90 63 65
4 clear -dull 82 100 82 100
shrill -deep 82 90 77 82
12 colorful -colorless 78 100 51 65
14 melodious -unmelodious 78 100 51 82
1 light -heavy 76 90 71 82
18 dramatic -lyrical 70 80 - -
5 full -thin 69 80 60 64
10 strong -weak 65 100 55 82
16 open -throaty 59 90 - -
20 soprano -alto 55 - - -
8 rough -smooth 52 100 37 65


vibrato -straight - 90 - 50


2. Relations between semantic scales

The 21 semantic scales may be expected to be highly interdependent. As a tool to reveal these relations, while preserving intersubject differences among listeners, we again used the multidimensional scaling technique of INDSCAL. For a given subset we had for each listener stimulus scores on all semantic scales; these stimulus scores were rank-correlated for each pair of semantic scales. In this way we obtained a correlation matrix of semantic scales for each listener. These correlation matrices were analyzed by INDSCAL.

The semantic scales are represented by positions in an object space in which distance is related to correlation: (1) the closer the positions of two scales, the more their correlation coefficient approaches the value of 1, and the more the scales are synonymous, (2) the more distant the position of two scales, the more their correlation coefficient approaches the value of -1, and the scales are inverted versions of each other, (3) in between, with one scale positioned in the origin of the space or in another dimension, the correlation is zero and the scales are unrelated. This shows that the point configuration, representing semantic scales, depends on the arbitrarily chosen polarity of the semantic scales. To eliminate this problem, each correlation matrix was extended by including all (redundant) correlations between semantic scales with reversed polarities. After this extension, the versions of a semantic scale with opposite polarities, of which the correlation is -1 by definition, are positioned radially and symmetrically relative to the origin in the object space. The complete solution is therefore radial-symmetric relative to the origin. In the final presentation of the configuration in the object space, however, only that polarity of a semantic scale will be given which is the more easily interpretable.

As a final step in summarizing the results, we combined the correlation matrices of all non-musicians for the five vowel subsets in one INDSCAL analysis. This was also done for the musicians (who judged an augmented set of semantic scales). This combination of matrices from different vowel subsets seems justified, since the subject space revealed that, both for musicians and non-musicians, the relations between the semantic scales were independent of the subset. This implied that the relations between semantic scales have a general validity for a listener, irrespective of vowel and Fo. For the sake of clarity in presentation, we averaged the position of each listener in the subject space over the vowel subsets. Results for song phrases were analyzed separately. The results of a two-dimensional INDSCAL analysis for all four cases are presented in Fig. 2. We limited ourselves to a two-dimensional solution of INDSCAL, since higher dimensions mostly presented only unique relations between semantic scales found for individual listeners. This effect can, for instance, be seen in the second dimension of the subject space for non-musician judgments of stationary vowels, which is almost exclusive to listener 5. The two-dimensional INDSCAL analysis included only a limited part of the total variance in the correlations between semantic scales, varying between 49.1 % (musicians, vowels) and 75.2 % (non-musicians, song phrase). This relatively low percentage explains the deviations of subject weightings from the unit circle. Examination of the subject space shows, especially for non-musicians, a tendency towards a one-dimensional interpretation of the object space, which was either directed towards dimension I or towards dimension II.

Before interpreting the object space, we should mention some general properties of that presentation: (1) semantic scales which are close to each other are highly correlated and are used more or less synonymously; for easier interpretation, clusters of semantic scales are circled; (2) the distance of a scale to the origin is a measure of its discriminative power (at least in this plane): the closer a semantic scale is located to the origin, the more synonymous it is with its reversed version and the smaller its discriminative power is.

Let us first consider the configuration of semantic scales in the object space for musicians. The clusters of semantic scales are from left to right:

(1) Singing technique: free-pressed (15) and open-throaty (16). For the song phrase the reversed scales 1, 9, and 11: heavy-light, round-angular, and warm-cold can also be considered to belong to this group. Round and warm are probably used as more impressionistic alternatives to the description of a good singing technique. For the stationary vowels, the scales free-pressed and open-throaty cluster with the scales for a general evaluation.

(2) General evaluation: melodious-unmelodious (14), colorful-colorless (12), and full-thin (5). For the song phrase these scales are positively related to scales on singing technique and temporal aspects; for the vowels they overlap the scales on singing technique.

(3) Temporal aspects: vibrato-straight (21) and rough-smooth (8). Rough-smooth is unreliable for stationary vowels (this conclusion was also drawn by von Bismarck, 1974a). Roughness is probably mainly related to irregularities in periodicity, not present in our vowel stimuli since they consist of repetitions of a single pitch period. For the song phrase both scales are used independently from scales on singing technique.

(4) Clarity: clear-dull (4), high-low (7), strong-weak (10), and open-covered (17). It is noteworthy that the technical scale open-covered (17) is to a great extent unrelated to other scales on singing technique, namely free-pressed (15) and open-throaty(16).

(5) Sharpness: sharp-dull (3), light-dark (2), shrill-deep (6), and metallic-velvety (13). For the song phrase the scale tenor-bass (19) is also included in this group. The scale soprano-alto was not used in the present analysis, since it was not applied to all vowel subsets. Separate analysis of subsets VI and VIII showed that the scale soprano-alto led to the same judgments as sharp-dull. For stationary vowels the scales light-heavy (1), angular-round (9), and cold-warm (11) can be included in sharpness, too. The scale dramatic-lyrical (18) is judged ambiguously. For the song phrase the scale is related to temporal aspects, clarity and sharpness, for the stationary vowels the scale presents a general evaluation.

According to the characteristics of the INDSCAL analysis, the two dimensions of the object space may have a fundamental psychological meaning. In this view, we suggest that dimension I can be interpreted as a 'pleasantness' factor (melodious-unmelodious, free-pressed, open- throaty, cold-warm, angular-round), while dimension II can be interpreted as a 'potency' factor (clear-dull, vibrato-straight, strong-weak). When both dimensions are equally weighted, as was the case for most musicians, the clusters of semantic scales mentioned above can be described by some combination of these two factors. For some listeners, however, there was only one dominating dimension. When dimension I dominates, the scales which evaluate temporal aspects do not discriminate, while all other scales are used in the same way, with sharpness and clarity negatively related to general evaluation and singing technique. When dimension II dominates, the scales on singing technique do not discriminate, while all other scales are used in the same way, with sharpness, clarity, and temporal aspects positively related to general evaluation.

For non-musicians, the results for the song phrase and the vowels are very similar and show a further simplification relative to the results of the musicians. The interpretation of the configuration is easy since non-musicians on the whole use, as the subject space indicates, either dimension I or dimension II. This implies that, apart from some unreliable scales, most semantic scales are used synonymously, according to sharpness. In view of the position of scales 12 and 14, the subjects only differed in their opinion as to whether sharp had to be associated with unmelodious (when dimension I was used) or melodious (when dimension II was used).

In summary, semantic scales are used very one-dimensionally by most non-musicians and also by some musicians. Most semantic scales are used according to sharpness and the only difference between listeners is whether sharp is positively or negatively associated with a general evaluation of the sound and with singing technique. Most musicians, however, differentiate more clearly between semantic scales, especially for the song phrase, for which clusters of semantic scales related to singing technique, general evaluation, temporal aspects, clarity, and sharpness can be distinguished. However, these groups of scales are not independent, but can be represented in two dimensions with the psychological interpretations of 'pleasantness' and 'potency'. For vowels, the results did not depend on the type of vowel nor on fundamental frequency.

3. Relation between semantic scales and perceptual space

In order to learn how the semantic scales are related to the perception of the stimuli of a subset, we used the method of multidimensional analysis of preference MDPREF (Carroll, 1972). From the listening experiments we obtained for each listener, for each semantic scale, stimulus scores which ranked the stimuli from one adjective of the semantic scale (low score) to the other adjective (high score), for instance from the most light to the most dark stimulus. For a given subset, the stimulus scores on all semantic scales, for all listeners, served as an input for the MDPREF analysis. In MDPREF the ordering of stimuli along a semantic scale is represented as an ordering of stimuli along a straight line in a perceptual space. The MDPREF algorithm computes in a multidimensional perceptual space both the stimuli, which are represented as points, and the semantic scales, which are represented as straight lines through the origin of that space.

Mathematically a straight line is described by a vector. The direction of a vector is computed in such a way that the projection of the stimulus points on the vector corresponds as closely as possible (least-squares criterion) to the stimulus scores on the semantic scale concerned. The first dimension of the perceptual space explains most variance in the listeners' judgments, the second most of the remaining variance, etc. The stimulus configuration is normalized so that the variance is equal in all dimensions. The semantic-scale vectors are given unit-length and in graphical presentations of results they are represented by their end points. An example of such a representation is given in Fig. 3.

Our application of MDPREF is unconventional because normally only one aspect (preference) of a stimulus set is investigated. We employed MDPREF on data from a large number of different semantic scales. This is allowed if we may assume that the stimulus configuration in the perceptual space is invariant for the different semantic scales. In Sec.IIIB4 we will show that the stimulus configuration derived by MDPREF and the one derived in Experiment 1 with the similarity-comparison paradigm, are highly congruent. This supports the view that the perceptual representation of timbre is an invariant stimulus representation, which is the basis for the further labeling of directions in that space by semantic-scale vectors.

Because we showed in Experiment 1 that the stimulus configuration in the perceptual space did not depend on musical experience, we combined in the present experiment for each subset the judgments on all semantic scales for both musicians and non-musicians. The interpretation of semantic scales, however, may vary from scale to scale and from listener to listener, and will be represented by the semantic-scale vector configuration. Whenever a listeners' judgments on a semantic scale included too many circular triads, these data were not used in the MDPREF analysis.

The result of the MDPREF analyses did not allow a generalized interpretation of the semantic scales, due to great intersubject differences. This will be demonstrated with the examples in Figs. 4 and 5. In Fig. 4a-d typical judgments on the semantic scales are presented for the subset of song phrases. All panels present the same stimulus configuration but show different subsets of semantic-scale vectors. The first two dimensions of the perceptual space explain 71 % of the total variance in semantic-scale judgments. In Fig. 4a and Fig. 4b all accepted semantic scales for musician 7 and non-musician 2 are presented. The spread of the positions of the semantic-scale vectors shows that the musician is able to distinguish between several characteristics of the song phrases. The clustering of the semantic-scale vectors for the non-musician demonstrates that most semantic scales, except vibrato-straight (21), are used synonymously, implying that this subject is unable to describe more than one perceptual dimension. For both listeners the relations between the semantic scales are in good agreement with the general results, presented in Fig.2.

In Fig. 4c and Fig. 4d, for all musicians (open circles) and non-musicians (filled circles) the directions of the semantic scales clear-dull and colorful-colorless are shown, respectively. The scale clear-dull shows corresponding judgments along the first dimension for all listeners. This is not the case for the scale colorful-colorless, which most musicians judged to be close to the second dimension of the perceptual space, while non-musicians judged again along the first dimension. It can be seen that along dimension I some listeners even have opposite opinions on this scale. The musician and the non-musician for whom the semantic-scale vector is positioned near the origin use this scale in another perceptual dimension.

Fig. 5 shows the two-dimensional MDPREF solution for subset III (/i/ sung by 8 male singers at Fo = 220 Hz). The positions of all accepted semantic scales are given for musicians 2 (open squares) and 4 (open circles) and non-musicians 2 (filled squares) and 4 (filled circles). In this example of stationary vowels, the clustering of most semantic scales per listener illustrates that the scales are used synonymously but in different ways by the individual listeners. This effect was most clearly present in the subsets with stationary vowels.

The application of semantic scales can only be demonstrated in examples such as those given in Figs. 4 and 5. Due to strongly listener- and vowel-dependent behavior, the results are difficult to generalize. Especially when a listener used all semantic scales synonymously, it is likely that only one particular perceptual attribute was dominant in the vowel subset, and that all semantic scales were judged according to this attribute. The attribute concerned differed among listeners.

4. Comparison with the vowel configurations derived from Experiment 1

The MDPREF analyses resulted in a perceptual vowel configuration for each vowel subset. Since for these subsets vowel configurations were also available from INDSCAL analysis of timbre dissimilarities in Experiment 1, we investigated whether these two configurations, derived with quite different techniques, were comparable. For this purpose, we rotated both normalized vowel configurations to maximal congruence. For all subsets the correlation between coordinate values of vowels on matched dimensions was significant beyond the 0.01 level and the coefficient of alienation varied between 0.36 and 0.54. Fig. 6 shows an example of the matching of the vowel configurations for subset IV, for which the matching with the spectral vowel configuration was already shown in Fig. 1.

Although the fit between the two configurations was generally good for all the subsets, there was no one-to-one correspondence between the original dimensions of the two spaces. The dimension related to sharpness, for instance, did not come out immediately in the INDSCAL analysis of the similarity-comparison data. This indicates that the dimensions originally derived by means of INDSCAL did not have a unique psychological meaning. Apart from the difference in the orientation of dimensions, we may conclude that both the comparison of timbre similarity of vowels (Experiment I) and the ranking of vowels along semantic scales (Experiment II) resulted in the same configuration of vowels in the perceptual space.

5. Spectral correlates of perceptual dimensions of vowel subsets

The vowel-point configuration derived by means of MDPREF can also be related to the 1/3-oct spectra representation of the vowels. This was done in the same way as in Experiment 1, using orthogonal rotation to congruence (see Sec.IIA4). The results of the matching procedure are given in Table VI. The dimensionality of each subset and the total amount of spectral variance explained compares to the results of Experiment 1 (Table III). This was to be expected since the vowel configurations derived from both types of listening experiments were highly congruent.

Table VI. Results of matching vowel configurations in the perceptual space, derived by MDPREF from semantic scale judgments, and in the spectral space. For both the variance in semantic scale judgments and spectral variance, the distribution over matched dimensions is given. The correlation between perceptual and spectrum coordinate values along the presented dimensions is for all dimensions beyond the 0.01 level of significance except for the figures within brackets, which were significant beyond the 0.05 level. Ls represents the number of listeners for each subset. The coefficient of alienation has been computed over all given dimensions D.

set vowel Fo(Hz) Ls percentage variance in semantic scale judgments total spectral variance percentage spectral variance coefficient of alienation
. . . . D1 D2 D3 D4 sum dB2 D1 D2 D3 D4 sum .
II 8 M /u/ 220 16 49 17 12 10 88 255 28 23 22 18 91 0.51 
III 8 M /i/ 220 14 54 20 8 4 86 239 16 27 13 36 92 0.46
V 9 M+F /a/ 220 15 57 14 9 - 80 200 32 31 22 - 85 0.53
IX 9 F /a/ 392 21 33 (23) - - 56 114 18 (14) - - 32 0.88
X 9 M /a/ 392 16 58 15 11 - 84 332 39 28 21 - 88 0.48
XI 8 M phrase   16 57 14 10 5 86 . . . . . . .

The good fit between vowel configurations in the perceptual space and the spectral space allows us to associate properties between the corresponding directions in both spaces. This means that we can assign a spectral vector to a perceptual semantic-scale vector. This spectral vector is a linear combination of the original spectral dimensions (related to the sound level in frequency bands). The contribution of each original dimension (frequency band) to the spectral vector is expressed as the direction cosine of the angle between the spectral vector and the original dimension. The value of this direction cosine can vary between 1 (identical), 0 (unrelated), and -1 (identical, but in opposite direction). The presentation of the values of the direction cosines as a function of the center frequency of the frequency bands is called the profile of the spectral vector (see also Bloothooft and Plomp, 1985). When a spectral vector is matched with a perceptual semantic-scale vector, the profile can be considered to represent the spectral variation which underlies perceptual judgments on a semantic scale.

Spectral vectors can be derived for all individual semantic-scale judgments on stationary vowels. It would be of interest to search for spectral descriptions with a general validity for semantic scales. Unfortunately, the large intersubject differences in the interpretations of semantic scales, demonstrated previously, do not allow this. However, we can give the corresponding spectral interpretations of the principal dimensions of the perceptual space, derived by MDPREF. These dimensions are determined on the basis of the explained variance in the semantic-scale judgments; the first dimension explains most of this variance, the second dimension most of the remaining variance etc. In Fig. 7 the main results are presented. This figure shows for the five subsets of stationary vowels: (1) the average spectrum, (2) the profiles of the spectral vectors associated with the first two matched perceptual dimensions; because these vectors define an orthogonal basis of the matched spectrum space they are called basis vectors, and (3) the vowel configuration in the plane of the first two dimensions of the perceptual space. Table VI indicates that for most subsets more than half of the total variance in perceptual judgments is covered by the first dimension of the perceptual space. Figure 7 (panels of the second column) demonstrates that this dimension typically has spectral-slope weighting properties. Spectral slope is independent of the vowel type of a subset and has a general interpretation. This corresponds well with results of a factorial analysis of verbal attributes of timbre by von Bismarck (1974a), who found, for complex stationary harmonic tones, only one prominent attribute: sharpness. Von Bismarck (1974b) and Benedini (1980) demonstrated that sharpness was related to the relative importance of higher harmonics. It is remarkable, however, that spectral slope also plays the most important role when the corresponding amount of spectral variance is relatively low, as is the case for subset III (see Table VI). For some other subsets, variation in spectral slope coincides with typical properties of the vowel subset: in subset II the singers 3 and 8 colored the vowel /u/ towards /o/; the spectral effect of this phonemic difference (all formant frequencies of /o/ are higher than those of /u/) is represented along dimension I; for subset X the configuration in the perceptual space (Fig. 7, last column) shows that dimension I contributes highly to the differentiation between falsetto and modal registers (except singer 6, a very "dull" tenor). In subset V, dimension I of the perceptual space shows that the differentiation between soprano and alto singers has spectral-slope like properties (see also Bloothooft and Plomp, 1986a). In Sec.IIIB2 it was said that the semantic scale soprano-alto is used in the same way as sharp-dull. This correspondence between soprano and sharp implies that strong higher harmonics in the vowel sounds used in this experiment are associated with soprano singing, which is completely contrary to the actual spectral differences between soprano and alto singers.

The profiles of the second basis vector (third column of Fig. 7) show that the related perceptual dimensions have no general acoustical interpretation. The properties of these dimensions are probably related to the effects of vowel articulation. For subsets II and III, the vowels /u/ and /i/ respectively, the second dimension weights the depth of the spectral valley between lower and higher formants. This dimension is, for the vowel /i/ (subset III), strongly related to phonemic differences: most vowels /i/ were colored towards /y/, except for the singers 2, 5, and 6. In subset V, with combined male and female phonations of /a/, the second dimension weights the frequency positions of the peaks of higher formants. This property roughly discriminates between tenor and bass singers, as can be seen in the configuration in the perceptual space (Fig. 7, last column; see also Bloothooft and Plomp, 1985). For subsets IX and X, the vowel /a/, the profile of the second basis vector is comparable and weights the level of the frequency band of 1.2 kHz. Such a profile differentiates between the phonemes / / and /a/; not all singers produced the requested phonemic quality precisely. It may be noted that for subsets II, III, and IX the perceptual vowel configuration (Fig. 7, last column) does not have a relationship to the voice classifications of the singers.

The distribution of the variance in semantic-scale judgments over the perceptual dimensions, and therefore the order of importance of these dimensions, depends on our choice of selected scales. We cannot exclude that a single semantic scale describes a specific perceptual dimension, explaining little variance, while a large number of scales may be focused on one other perceptual dimension, explaining a large amount of variance. In previous sections it has been shown that for the subsets of stationary vowels there is agreement among listeners about scales which describe sharpness, the first dimension in the present analysis. A detailed study of the data did not reveal another perceptual dimension for which listeners agreed in their description. Therefore, the second and higher dimensions mainly rely on the extent to which listeners, unsystematically, use the acoustical properties of these dimensions in their judgments.

In summary, when listeners are requested to judge stationary vowels on semantic scales, they seem to focus primarily on differences in spectral slope between the vowels, even when this difference is smaller than those for the other perceptual dimensions. A large number of different semantic scales, related to sharpness, are judged according to this criterion. For other perceptual dimensions there is no agreement among listeners on verbal attributes.

6. Acoustic correlates of perceptual dimensions of song phrases

The acoustic properties of the song phrase 'Halleluja' are much more complex than the spectral variation between stationary vowels. Temporal aspects, such as vibrato and vowel duration, may also have influenced the judgments of listeners. This makes it difficult to relate perceptual dimensions to possible acoustic correlates. Since the first / / and final /a/ of 'Halleluja' took up more than half of the total phrase duration, we used these two vowels to investigate spectral correlates. The 1/3-oct spectrum of the vowel segments was measured every 10 ms, the resulting 10-ms spectra were normalized for overall sound-pressure level and the average of these spectra was considered to be the representative spectrum of each singer. Subsequently, the configuration of the corresponding eight points in the spectrum space was matched with the perceptual configuration. Three dimensions showed significant correlations (p<0.01). The profiles of the spectral basis vectors associated with the first three perceptual dimensions are shown in Fig. 8, together with the grand-average spectrum. The first dimension accounts for 45 % of the total spectral variance. The profile of the first basis vector shows that the corresponding perceptual dimension, describing sharpness, is, for song phrases too, associated with spectral-slope-like variation. The profile of the second basis vector strongly weighs the sound level of the frequency bands with center frequencies of 0.8 and 2.5 kHz. This indicates that a positive contribution of this dimension, perceptually associated with full, melodious, and colorful, is related to a more open vowel, /a/ instead of /a/, and a higher sound level of the high spectral peak. The latter peak is also known as the singer's formant and described as the origin of the "ring" of the voice (Bartholemew, 1934; Bloothooft and Plomp, 1986b). The frequency position of this spectral peak is weighted by the third basis vector. No systematically used verbal attribute was associated with this direction.

The amount of spectral variance accounted for by the second and third dimensions was only 9 and 12 %, respectively. Therefore, it may well be possible that other acoustical factors than those present in the average spectrum of the two vowels contribute to these dimensions. Concerning the influence of temporal measures on perceptual judgments, no effect of total phrase duration (tempo) could be established. However, this could be expected since the listeners were requested to ignore this factor. The specifically temporal semantic scale vibrato-straight was judged to a great extent on the basis of the depth of vibrato modulations (r = 0.74) and not on vibrato rate (r = -0.13). Fig. 2 showed that this scale was both positively related to sharpness and to general evaluation (full, melodious, colorful). Therefore, the presence of a good vibrato may contribute as a temporal attribute to these factors.

C. Discussion

The experimental results showed that the human peripheral auditory system can detect very small differences in timbre. Acoustically, these differences amount to only a few decibels per 1/3-octave frequency band. Detection of timbre differences can be modelled as the observation of distance in a multidimensional perceptual space. With respect to the ear's frequency-analyzing power, the maximum number of dimensions of this perceptual space may be high, theoretically. However, in most cases this number will depend on the Euclidean dimensionality of the spectral variation in stimuli, with the restriction of a minimum amount of spectral variation of about 34 dB2 per dimension. For harmonic vowel sounds in speech and singing, which are limited with respect to produced spectral variation, this criterion implies for a single vowel, sung by different singers, a maximum of about four independent perceptual dimensions. For the entire vowel system the number of dimensions is only slightly higher, with a maximum of about five dimensions, because a great deal of spectral variation between different vowels is already captured by the type of spectral variation within a single vowel. The number five corresponds well with the number of resonance frequencies describing the properties of the vocal tract.

The representation of stimuli in a perceptual space was based on a grand average for a large number of listeners, whose individual perception may deviate from this idealized grand-average representation. The use of INDSCAL analysis and MDPREF analysis revealed those intersubject differences. It is interesting, however, that the two different experimental techniques we used, similarity comparison and semantic scaling, resulted in comparable stimulus configurations. Whereas in the similarity-comparison experiment detection properties of the auditory system play a large role, this is not self-evident in the semantic-scaling experiment. The latter approach makes use of adjectives which require the mediation of some internal reference. That both types of experiments still resulted in comparable grand-average stimulus configurations supports the view that these representations are really basic for human timbre perception. For each individual, however, a central weighting of perceptual dimensions may influence experimental outcomes, both in the similarity-comparison and in the semantic-scaling experiments.

We showed that a difference in timbre correlates with a difference in 1/3-oct spectrum for stationary vowels, at least up to a fundamental frequency of 392 Hz. This allows us to investigate properties of vowel representations in the perceptual space on the basis of 1/3-oct spectra only, that is, without the need to perform time-consuming perceptual experiments. For vowels sung by professional singers, results of such a study have been reported in Bloothooft and Plomp (1984, 1985, 1986a, 1986b).

For all subsets of stationary vowels, only sharpness turned out to be a verbal attribute of timbre on which most listeners, regardless of their degree of musical training, agreed in their judgments; they only differed in their evaluation of sharpness: whether sharpness was melodious or not. In conformity with von Bismarck (1974b), sharpness was found to be acoustically related to the slope of the spectrum. The importance of sharpness in timbre perception was even apparent for subset III, in which only 16 % of total spectral variance was associated with this factor. These results support von Bismarck's opinion that sharpness may be considered as a fundamental perceptual quality, besides pitch and loudness, of any harmonic complex tone.

For spoken and sung vowels both a psycho-acoustical level and a phonetic level of perception may be distinguished. At the more central, phonetic, level the phonemic identity of a vowel is determined. This level is especially sensitive to formant-frequency variation (Klatt, 1982). It may well be possible that a number of verbal attributes of the timbre of vowel sounds refer to this level of perception and describe, for instance, formant-frequency deviations from typical reference values. The acoustical interpretation of such verbal attributes would then be vowel dependent. Since our subsets each included only one vowel, this kind of timbre description could have emerged from the listening experiments. Although the second perceptual dimension of the vowel subsets, shown in Fig. 7, did turn out to be related to vowel-specific acoustical variation, no indications were found that listeners agreed in their verbal description of this variation. This suggests that there are no stable verbal attributes for the phonetic level of perception under the experimental conditions used here.

The present experiments failed to reveal a relationship between the description of timbre of stationary vowels and voice classification (see Fig. 7). In all cases both the semantic scales tenor-bass and soprano-alto were used in the same way as sharp-dull: the more high-frequency energy, the higher the estimated voice classification. In fact, sharpness was unrelated to actual voice classification at all and even showed a reverse relationship with female voice classification for subset IX (Fig. 7). Whereas such results may be attributed at first sight to the restrictions of stationary vowels, which make it impossible to estimate voice classification even for musically trained listeners, the observation persisted to some extent for the phrases sung by male singers. Although in this case the relationship between judgments on the scale tenor-bass (19) and actual voice classifications was rather good (as an example, see the stimulus configuration in Fig. 4), a contradictory result was obtained for tenor singer 6, who had a rather dull voice and was associated with the lowest voice classification. For the song phrases the semantic scale tenor-bass was also highly correlated with the first perceptual dimension and, therefore, associated with the slope of the spectrum (Fig. 8). Listeners seem to relate a shallow spectral slope to tenor-voice timbre type and a steep negative spectral slope to bass-voice timbre type. A shallow slope may originate in the spectral effect of high first and second formants (e.g. Fant, 1960) or in a shallow source spectrum. Cleveland (1977) indicated that higher formant frequencies are indeed associated with higher voice classifications in professional male singers. The contradictory result for "dull" tenor singer 6 in the present experiment possibly demonstrated the confusing influence of a steep source spectrum. This raises the interesting question of whether perceptual voice classification, based on timbre, has a phonetic basis (formant frequency detection) or a psycho-acoustical basis (sharpness detection). The present results suggest a psycho-acoustical basis which may lead, however, to incorrect judgment of voice classification. Fortunately, many more factors establish voice classification, which obviates a wrong judgment on the basis of timbre alone.

The limitation in most of our experiments to steady-state vowels made it possible to conduct well-defined perceptual experiments, but this experimental paradigm may be rather far-off from perception in real singing performance. It was stressed in a review by Risset and Wessel (1982) that dynamic factors in timbre contribute to the identification and naturalness of musical instruments. We may assume that this will also be the case for the singing voice. Nevertheless, it was our informal observation that many typical characteristics of sung vowels are also present in their steady-state versions, irrespective of their somewhat mechanical character. Furthermore, our listening experiment with song phrases was a first attempt to bridge the gap between experiments with stationary sounds and real singing performance. Just as for stationary vowels, sharpness was the most important spectral attribute of the timbre of song phrases. The second perceptual dimension (colorfulness) indicated the influence of the relative sound level of the singer's formant. Vibrato was not found to take up a separate perceptual dimension but vibrato quality may enhance judgments on both sharpness and colorfulness, especially when the spectral contributions of these perceptual dimensions are small.

Finally, we should be careful in interpreting semantic scales in terms of acoustical properties in view of the small number of singers in the subsets. Accidental combinations of acoustic characteristics of singers, or their absence, may have influenced the results. Nevertheless, we trust that the main effects, found for most subsets, are likely to have a more general validity.



This research was supported by the Netherlands Organization for the Advancement of Pure Research (ZWO). The authors are indebted to the singers for their kind cooperation and to Louis C.W. Pols and two anonymous reviewers for their comments on earlier versions of this paper.


Note 1.

More precisely, we used the matrix analogue of a coefficient of alienation, S½. For the one-dimensional case, S is the analogue of 1-r¨, where r is the correlation coefficient. For example, for the one-dimensional case, for r = 0.81 (a significant correlation for 9 pairs of data points at the 0.01 level), S½ = 0.60.



ASA (1960). "American Standard Acoustical Terminology", New York, Definition 12.9, Timbre, 45.

Bartholemew, W.T. (1934). "A physical description of "good voice quality" in male voice," J.Acoust.Soc. Am. 6, 25-33.

Benedini, K. (1980). "Klangfarbenunterschiede zwischen tiefpass gefilterten harmonischen Klängen," Acustica 44, 129-134.

Berg, Jw. van den, and Vennard, W. (1959). "Towards an objective vocabulary for voice pedagogy," NATS Bulletin, February.

Bismarck, G. von (1974a). "Timbre of steady sounds: A factorial investigation of its verbal attributes," Acustica 30, 146-159.

Bismarck, G. von (1974b). "Sharpness as an attribute of the timbre of steady sounds," Acustica 30, 159-172.

Bloothooft, G., and Plomp, R. (1984). "Spectral analysis of sung vowels. I. Variation due to differences between vowel, singers, and modes of singing," J.Acoust.Soc.Am. 75, 1259-1264.

Bloothooft, G., and Plomp, R. (1985). "Spectral analysis of sung vowels. II. The effect of fundamental frequency on vowel spectra," J.Acoust.Soc.Am. 77, 1580-1588.

Bloothooft, G., and Plomp. R. (1986a). "Spectral analysis of sung vowels. III. Characteristics of singers and modes of singing," J.Acoust.Soc.Am. 79, 852-864.

Bloothooft, G., and Plomp. R. (1986b). "The sound level of the singer's formant in professional singing," J.Acoust.Soc.Am. 79, 2028-2033.

Boves, L.W.J., (1984). "The phonetic basis of perceptual ratings of running speech," Doctoral dissertation, (Foris, Dordrecht).

Carroll, J.D. (1972). "Individual differences and multi-dimensional scaling," in R.W. Shepard, A.K. Romney, and S.B. Nerlove (Eds.) Multi-dimensional scaling I, 105-155, Seminar Press, New York.

Carroll, J.D., and Chang, J.J. (1970). "Analysis of individual differences in multi-dimensional scaling via an N-way generalization of 'Eckart-Young' decomposition," Psychometrica 35, 283-319.

Cleveland, T.F. (1977). "Acoustic properties of voice timbre types and their influence on voice classification," J.Acoust.Soc.Am. 61, 1622-1629.

De Bruyn, A. (1978). "Timbre-classification of complex tones," Acustica 40, 108-114.

Donovan, R. (1970). "The relationship between physical analysis of sounds and the auditory impression of their vowel and tone quality in tenor singing," Acustica 23, 269-276.

Edwards, (1957). "Techniques of attitude scale construction," (Appleton, New York).

Fagel, W.P.F., Herpt, L.W.A. van, and Boves, L. (1983). "Analysis of the perceptual qualities of dutch speaker's voice and pronunciation," Speech Communication 2, 315-326.

Fant, G. (1960). "Acoustic theory of speech production," (Mouton, The Hague).

Hussler, F., and Rodd-Marling, Y. (1976). "Singing: the physical nature of the vocal organ," (Faber and Faber, London).

Isshiki, N., Ohamura, H, Tanabe, M, and Morimoto,M. (1969). "Differential diagnosis of hoarseness," Folia Phoniatrica 21, 9-19.

Kakusho, O., Hirato, H., Kato, K., and Kobayashi, T. (1971). "Some experiments of vowel perception by harmonic synthesizer," Acustica 24, 179-190.

Klatt, D.H. (1982). "Prediction of perceived phonetic distance from critical-band spectra: a first step," Proc. ICASSP, 1278-1281.

Lingoes, J.C., and Schönemann, P.H. (1974). "Alternative measures of fit for the Schönemann-Carroll matrix fitting algorithm," Psychometrica 39, 423-427.

Nord, L., and Sventelius, E. (1979). "Analysis and prediction of difference limen data for formant frequencies," Speech Transmission Laboratory, Quarterly Progress and Status Reports 3-4, 60-72.

Plomp, R. (1970). "Timbre as a multi-dimensional attribute of complex tones," in R. Plomp and G.F. Smoorenburg (Eds.) Frequency analysis and periodicity detection in hearing, (Sythoff, Leiden).

Plomp, R. (1976). "Aspects of tone sensation," (Academic, London).

Pols, L.C.W. (1977). "Spectral analysis and identification of Dutch vowels in mono syllabic words," Doctoral dissertation (Free University, Amsterdam).

Pols, L.C.W., Kamp, T.J.Th. van der, and Plomp, R. (1969). "Perceptual and physical space of vowel sounds," J.Acoust.Soc.Am. 46, 458-467.

Risset, J.C., and Wessel, D.L. (1982). "Exploration of timbre by analysis and synthesis," in D. Deutsch (Ed.) The psychology of music, (Academic, London).

Schönemann, P.H., and Carroll, R.M. (1970). "Fitting one matrix to another under choice of a central dilation and a ridgid motion," Psychometrika, 35, 245-255.

Schouten, J.F. (1968). "The perception of timbre," Reports of the 6th ICA, Tokyo, GP-6-2, 35-44.

Sundberg, J. (1982). "The perception of singing," in D. Deutsch (Ed.) The Psychology of Music, (Academic, London).

Vennard, W. (1967). "Singing, the mechanism and the technique," (Fisher, New York).


Legends of figures 1-8

Fig. 1. Result of an INDSCAL analysis on data from similarity-comparison judgments of the vowel /u/, sung by eight male singers at Fo = 220 Hz. The upper panels show the I-II and the I-III planes of the object space and spectral space combined. Filled circles form the vowel configuration obtained from INDSCAL, open circles present the best fitting spectral vowel configuration. The lower panels show the corresponding planes from the subject space of INDSCAL. Coordinate values of points represent the weight a subject attaches to a dimension. Open squares are musicians, filled squares are non-musicians.

Fig. 2. Representation of relations between semantic scales by means of the results of INDSCAL analyses. The I-II plane of both the object space (semantic scales) and the subject space (listeners) is presented for the subset of song phrases and the subsets of stationary vowels, both for musicians and non-musicians. Semantic scale numbers refer to Table IV.

Fig. 3. Example of presentation of results of MDPREF analysis. Stimuli are represented as points in a space and semantic scales are represented as vectors on which the projection of the stimulus points gives the best estimate of a listener's judgment.

Fig. 4. Perceptual space for the song phrases sung by eight male singers (large filled circles, the numbers of which refer to Table I) with various results of semantic-scale vectors:

(a) All accepted semantic scales (numbers refer to Table IV) for musician 7.
(b) All accepted semantic scales for non-musician 2.
(c) All accepted listeners on the semantic scale clear-dull; small open circles refer to musicians, small solid circles to non-musicians.
(d) All accepted listeners on the semantic scale colorful-colorless; small open circles refer to musicians; small solid circles to non-musicians.

Fig. 5. Perceptual space for a subset of vowels /i/, sung by eight male singers at Fo = 220 Hz. All semantic-scale vectors are shown for musicians 2 (open squares) and 4 (open circles) and non-musicians 2 (closed squares) and 4 (closed circles).

Fig. 6. Results of matching vowel configurations in a perceptual space derived from the similarity-comparison experiment (filled circles) and in a perceptual space, derived from semantic scaling experiments (open circles), in the first two best fitting dimensions. The subset consisted of the vowel /u/, sung by eight male singers at Fo = 220 Hz (see also Fig. 1).

Fig. 7. Results of matching vowel configurations in the perceptual space (derived from semantic-scaling experiments) and the spectrum space. Left-hand panels show the grand-average spectrum of each subset; the middle panels present the profiles of the spectral basis vectors associated with the first two perceptual dimensions, and the right-hand panels show the vowel configurations in the perceptual space. Numbers refer to singers (see Table I), f indicates falsetto register, t indicates a tenor-like phonation produced by baritone singer 3.

Fig. 8. Results of matching the perceptual configuration of song phrases and the configuration of average spectra of the two vowels / / and /a/ in 'Halleluja'. The grand-average spectrum and the profiles of the first three spectral basis vectors are shown.