LREC Granada 1998

Proceedings 1th International Conference on Language Resources & Evaluation, Granada, 28-30 Mey 1998, edited by A.Rubio, N.Gallardo, R.Castro, A. Tejada, pp 51-54.

Evaluating various spoken dialogue systems with a single questionnaire:

Analysis of the ELSNET Olympics

Els den Os¹ & Gerrit Bloothooft²
^{¹KPN Research Leidschendam, THE NETHERLANDS

²Utrecht institute of Linguistics OTS, THE NETHERLANDS

[e.a.denos@research.kpn.com, Email
]

Abstract

At Eurospeech ’97 Elsnet provided the possibility for participants
to test ten different spoken dialogue systems. After each test, participants were asked to
complete a questionnaire (18 questions). This questionnaire was designed to assess the
major properties of the spoken dialogue systems in a system-independent way. The total
number of completed evaluation forms returned was 255, and for eight of the ten systems
the number was large enough to permit meaningful statistical processing. A factor analyses
was performed on the pooled data; the five main factors explained 75% of the total
variance. Three of these five factors could be interpreted to reflect user opinions on (1)
general appreciation, (2) functional capabilities of the system, (3) intelligibility of
the output speech, while two were user-oriented: (4) proficiency in the system’s
language, and (5) familiarity with spoken dialogue systems. Comparable sets of main
factors were also obtained in separate factor analyses on the data for each system. This
indicates that the first three main factors play a dominant role in the user's opinion of
a spoken dialogue system, independently of the complexity or purpose of the system and
independently of the user's knowledge of the system's language or the user's familiarity
with this kind of information service. Some recommendations are given for the design of
subjective evaluation of spoken dialogue systems by means of questionnaires.

1. Introduction

As the number of spoken dialogue systems is growing, the need to
evaluate these systems becomes important. Since spoken dialogue systems are complex and
include a combination of speech recognition, natural language processing, dialogue
management, and speech synthesis, assessment of the complete system is not a simple task.
Objective evaluation may address the length of the dialogue, the number of turns, the
number of diagnostic turns by the system, the number of speech recognition errors, the
number of successful database queries, and so on. Besides this objective evaluation,
subjective assessment can be used to address the perceived quality of the dialogue system
(Walker et al., 1997). The only way to investigate the perceived quality of
a system is by asking users questions about it. These questions may be related to the
overall acceptability of the system or may have a more diagnostic purpose in order to
address specific features of a system. In the first case it is more appropriate to ask
open questions like "How did you get along?" and "What do you think of this
system?" In the second case it is more appropriate to use Likert scales (e.g.
"The system did understand me well": I fully agree [1]-[2]-[3]-[4]-[5] I fully
disagree). In this paper we concentrate on the latter type of diagnostic evaluation.

At the ELSNET Olympics, Eurospeech '97 participants had the unique
opportunity to test ten different dialogue systems. All systems were evaluated using the
same questionnaire. For naïve subjects it will probably be difficult to answer diagnostic
questions. However, the Eurospeech population may be considered capable of answering these
types of questions about spoken dialogue systems. In this paper we investigate whether
different spoken dialogue systems can be assessed by a single questionnaire. This is only
possible if the questionnaire (and the individual questions) address relevant features
that are present in all systems. If that proves to be the case, we may hope to be able to
compare subjective evaluations of dialogue systems for different tasks. Even more
importantly: we would then be in a position to transfer knowledge and experience gained
with one system to speed up the development of other systems.

2. The questionnaire

The ten systems that participated in the Elsnet Olympics differed in
task, lexicon size, type of speech recognition, dialogue design, availability of barge-in,
type of system speech and language. Our hypothesis was that irrespective of these
differences it is still possible to assess systems subjectively by putting the focus on
whether systems do what they claim to do from the perspective of the user. Therefore, the
questions were formulated in a general way (see table 1 for the questionnaire). Most
questions were of the closed type using a Likert scale, few other ones were of the open
type in order to be able to interpret the meaning of the answers to the closed questions
more clearly. Next to the questions related to the spoken dialogue systems, two questions
were included addressing the users’ proficiency in the system’s language and the
familiarity of the user with spoken dialogue systems, to see whether these factors
influence the diagnostic evaluation.

3. The systems

Two of the eight systems could not be tested properly. The Aizula
system (Japan) (Ward, 1996) was not an information system but a demonstration of simulated
back-channel feedback. The Actis system (KDD, Japan) provides information about
area codes, country codes and time differences between Japan and the destination of an
international call. The lexicon covers the names of 300 countries and a thousand cities
all over the world. Due to line problems the system was only tested a few times.

- Did
the system fulfil your expectations

(based on aims and task as explained on the poster)
.
yes /no /in part

-
Please give your opinion on the following aspects

(encircle number):
.
.

.
.
POOR

GOOD

.
* the functional
possibilities of the system
1
2
3
4
5

*
speech recognition
1
2
3
4
5

* intelligibility of the
system's speech
1
2
3
4
5

* wording of system's
utterances
1
2
3
4
5

* overall dialogue
structure
1
2
3
4
5

* the possibility to
recover from system errors
1
2
3
4
5

*
appropriateness of the system's reactions
1
2
3
4
5

* length of the dialogue
1
2
3
4
5

* the system's error
messages
1
2
3
4
5

* task completion
1
2
3
4
5

* your overall
satisfaction
1
2
3
4
5

-
Please list strong points of the system
.

- Please list weak points
of the system
.

- What
is your mother tongue?
.

.
.
POOR

NATIVE

- What
is your proficiency in the system's language
.
1
2
3
4
5

.
.
LITTLE

MUCH

- Are
you acquainted with spoken dialogue systems
.
1
2
3
4
5

Please use the space
below for additional comments

.

.

Table 1: Text of the questionnaire

The remaining eight information systems were:

Jupiter (MIT, Cambridge MA, USA) (Zue et al., 1997): The
version of Jupiter used at the ELSNET Olympics was a US English conversational system that
gives current weather forecasts over the telephone for over 500 cities worldwide. The
system can answer queries about general weather forecasts, temperature, humidity, wind
speed, sunrise/sunset times, and weather alerts (such as flooding and hurricanes). It
obtains its weather information from Web sources. Jupiter has a vocabulary of 1400 words,
and will soon be able to handle calls in German, Mandarin, and Spanish. It uses
text-to-speech synthesis for output speech.

STACC (University of Granada, Spain) (Rubio et al., 1997):
This system-driven service allows students to consult their marks. Students have to enter
one of two degrees and one of six courses they want to consult, and say their full name as
well as an eight-digit identification number. When the name and identification number
match, the system provides the required mark. The lexicon contains about 300 words.

O-tel (University of Maribor, Slovenia). This automatic reverse
directory service can handle Slovenian, English, and German. The user enters digits in one
of these languages; the digits must be separated by short pauses, since isolated-word
recognition is used. The system repeats each digit, and when a digit is not recognized
correctly, the caller can erase it. The most distinguishing feature of this system is that
it has talk-through capability. The output is given by word-based synthesis.

EVAR (University of Erlangen, Germany) (Boros et al., 1997):
Developed within the Sundial project, this system provides information on German Intercity
timetables. It is run mainly for research purposes. Research emphasis is on the
(relatively free) dialogue, on robust recognition, parsing of spontaneous speech and
detection of out-of-vocabulary words. The dialogue manager can cope with anaphora and
ellipses, and has a variety of recovery strategies for unusual situations. The user can
always go back and change information. The system has a vocabulary of 1600 words, as well
as a spelling mode for

when standard dialogue strategies fail.

Dialogos/Italian Arise (CSELT, Torino, Italy) (Albesano et al.,
1997): Like the three systems below, this continuous-speech dialogue system forms part of
the Arise project, a European research project partly funded by the EC under the LE sector
of the Fourth Framework Telematics Application Programme. It has a vocabulary of 3,500
words, including 3,000 Italian station names. The dialogue module interprets the content
of the user's utterances by taking into account both previous utterances and data
pertaining to the application. The system can support different clarification and
correction subdialogues, and is able to detect repairs initiated by the user. It uses
text-to-speech synthesis for output speech.

LIMSI Arise (Orsay, France) (Lamel et al., 1996): This
system provides information on train schedules, fares, reductions, and services in French.
It uses continuous speech recognition with task-dependent acoustic models. The lexicon
contains about 1500 words, 680 of which are station names. It is possible to interrupt
system prompts (barge-in); speech output is handled by synthesis, through concatenation of
about 2000 prerecorded units. The system uses a very open mixed-initiative dialogue: the
caller is free to ask any question at any time. The system asks the user to provide the
information it needs for database access, but can deal with the caller providing different
information.

IRIT Arise (Toulouse, France) : This system uses speech recognition
and dialogue management technology developed by Philips. The lexicon contains 1500 words
(500 of which are station names). It is a conversational system, with concatenation of
prerecorded speech being used for the system's output.

Dutch Arise (NS/OVR, Philips, KPN Research, University of Nijmegen,
the Netherlands): This conversational system has a lexicon of 1380 words (680 station
names). Its speech recognition component uses context-dependent acoustic models
(triphones). The system has been trained by more than 11,000 dialogues, and uses
concatenation of prerecorded units for speech output.

4. Results

We analysed the data from the Likert scales by means of a factor
analysis which was followed by varimax rotation for more easily interpretable links
between questions and factors. Since only completed questionnaires could be used for
factor analysis we had to discard quite a few questionnaires (see table 2). Some users did
not encounter errors and consequently could not answer error-related questions, others
failed to communicate meaningfully with a system and did not get any impression of details
of the system's capacities, while some users simply forgot to turn the page of the
questionnaire. There was no specific relation between a system and the type of missing
questions, with the exception of the multi-lingual O-tel system for which some users
failed to pass the language choice.

.
Total
Completed

ARISE DUTCH
24
16

ARISE
ITALIAN
33
23

ARISE IRIT
33
30

ARISE LIMSI
41
31

EVAR
46
36

JUPITER
114
78

O-TEL
45
31

STACC
18
10

TOTAL
354
255

Table 2. Total number of questionnaires returned for
each system and the total number of fully completed questionnaires used for factor
analysis.

Table 3 shows the results of the factor analysis for the pooled data.
Five main factors can be distinguished, which together explain 75% of the total variance.
The first three factors are related to system properties, the last two are user-oriented.
The first factor refers to the general appreciation of a system and addresses how a user
perceives the interaction with the system. Questions about overall satisfaction, task
completion, error recovery and error messages have their main loading on this factor.

Question
1
2
3
4
5

1
GENERALAPPRECIATION
.

Task completion
.81
.28
-.00
.14
-.08

Error recovery
.80
.23
.09
-.12
.03

Appropriate reactions
.77
.30
.17
.04
.12

Error messages
.74
-.06
.33
-.10
.22

Overall satisfaction
.74
.52
.11
-.00
-.01

Speech recognition
.56
.58
.10
.24
-.03

Dialogue length
.49
.27
.37
-.25
.02

2
FUNCTIONAL CAPABILITIES
.

Functional possibilities
.23
.85
.14
-.02
.07

Dialogue structure
.43
.60
.24
-.27
.17

3 SYSTEM
SPEECH
.

Speech intelligibility
.01
.09
.81
.23
-.16

Wording of utterances
.33
.19
.73
-.09
.04

4 USER's
LANGUAGE PROFICIENCY
.00
-.02
.10
.88
.14

5 USER's
FAMILIARITY WITH DIALOGUE SYSTEMS
.08
.09
-.10
.14
.93

Table 3. Loadings of all closed questions on the first
five factors after factor analysis (followed by varimax rotation). Boldface typed numbers
show loadings that exceed 0.70. Factors 4 and 5 correspond to a single question.

The second factor refers to the functional capability of the system,
i.e. what is the purpose of the system? It can be noticed that most questions that are
mentioned under the first two factors have their major loadings on both factors 1 and 2,
which implies that most of these questions measure some combination of general
appreciation and the functional capabilities of the system. The third factor deals with
properties of the system prompts and output speech. These three factors are independent
dimensions in the user’s assessment of a system.

The fourth and fifth factors (language proficiency of the user and the
familiarity with spoken dialogue systems) are user-related. Interestingly, there is no
interaction between these user-oriented factors and the factors dealing with the
assessment of system properties.

A proper factor analysis presupposes a single subject group evaluating
all systems. Obviously, this could not be the case because of the different languages of
the systems. We therefore also performed factor analyses on data for each of the systems
separately. If the same factor pattern as for the pooled data shows up for the individual
systems, this would support the general validity of the overall analysis. Table 4
summarizes the results of the separate analyses. For most questions the relation to the
main factors was comparable for all systems. A slight complication was that the question
on wording of the system utterances now got a separate factor for four systems, which in
two cases was also related to the question on error messages. The question on dialogue
structure loaded on factor 1 or factor 2, while the question on dialogue length was linked
to various factors for the different systems. The scatter in the loadings of the latter
questions over the factors in the pooled data obviously originates from system-dependent
usage of these questions.

Question
DUTCH
LIMSI
IRIT
ITALIAN
EVAR
JUPITER
O-TEL
STACC
.

Task completion
1
1
-
1
1
1
1
2
1

Error recovery
1
1
1
1
-
7
1
1
1

Appropriate reactions
1
1
-
1
1
1
1
1
1

Error messages
-
1
6
1
6
-
1
1
1

Overall satisfaction
1
1
1
1
1
1
-
1
1

Speech recognition
8
1
1
-
1
1
-
8
1

Functional possibilities
2
2
1
2
2
-
2
2
2

Speech
intelligibility
3
3
3
3
3
3
3
3
3

User's proficiency
-
4
4
4
3
4
4
2
4

User's familiarity
2
5
5
5
5
5
5
3
5

Wording of utterances
6
2
6
3
6
-
3
6
6

Dialogue structure
-
2
1
2
6
-
2
1
X

Dialogue length
6
2
3
-
-
-
5
1
X

Table 4. Factor associations per question after
separate factor analysis (and varimax rotation) per system. A question was associated with
a factor when its loading exceeded .70 (otherwise - is inserted). The factor numbering
follows Table 3, but higher numbers were introduced when for some system the factor
pattern did not match the general pattern. The right column shows the most common factor
per question.

.
1
2
3
4
5

ARISE DUTCH
-.10
.50
.51
.27
-.42

ARISE IRIT
.05
-.22
-.55
-.09
-.16

ARISE ITALIAN
.79
.40
-.10
.17
.26

ARISE LIMSI
-.50
.02
.72
-.03
.11

EVAR
-.14
-.20
-.36
.11
-.23

JUPITER
-.05
.20
-.17
-.31
-.10

O-TEL
.12
-.82
.36
.17
.10

STACC
.29
.58
.31
.97
-.03

Table 5. Average factor scores of the eight systems on
the five main factors. Gray cells indicate significantly different values (p<.05) per
factor. For factor 3 the marked group of three cells differ significantly from the values
-.36 en -.51 only.

Finally, we present in Table 5 the average factor scores of the systems
on the five main factors. Significancies of differences were determined by performing, per
factor, an ANOVA on the raw factor scores, followed by Newman-Keuls posthoc analysis. The
results are consistent with a description presented in Bloothooft & den Os (1997).
Since each system more or less had its own user group, systematic scoring differences
between these groups may have affected the average scores and hence the factor scores.
This can happen even when the relative relations between questions are about the same for
all groups, as shown in the present study. Therefore the average factor scores should be
treated with some caution.

5. Discussion

In the subjective diagnostic evaluation of spoken dialogue systems we
found that users distinguish three independent factors while judging these systems. These
factors reflect three general features: (1) general appreciation of the system, (2)
functional capabilities of the system and (3) quality of the output speech. Since these
three factors were shown to be highly system independent, we recommend that they should
get special attention in both the design and the subjective diagnostic evaluation of any
system. However, since these factors refer to very general features, it remains to be seen
whether answers to questions addressing these factors will be of help in designing
improved systems. On the other hand, it is also questionable whether more detailed and
system-specific questions would result in more useful responses, especially when the
system is tested by non-experienced users. It may well happen that detailed questions are
interpreted in terms of the three main factors presented.

Familiarity with dialogue systems and language proficiency of users did
not interact with the assessment of the systems. This result supports the fundamental
character of the main factors. They not only span the views of experts but also the
experiences of users who are less proficient in the language and probably pose a critical
test to the system, and they also span the assessment of users which are less familiar
with these types of systems, their properties and their terminology. It is uncertain,
however, whether this result extends from the Eurospeech population to the general public.
In general it may be wise to include explicit questions on the user's knowledge of the
functionality of a system.

Some of our questions were not always interpreted in the same way among
different systems. These questions are related to important diagnostic features such as
error messages, speech recognition, wording of system speech, dialogue structure, and
dialogue length (see table 4). However, it is not always clear whether this
system-dependent interpretation could have been avoided by a better phrasing of the
questions. For instance, the question on speech recognition has a more precise meaning in
an isolated word recogniser than in a system using continuous speech recognition where it
is not always clear whether speech recognition or other parts of the system failed. In
this case it probably does not make sense to improve the question. Even for experienced
users it is often hard if not impossible to distinguish between failures due to speech
recognition, to a wrongly designed dialogue structure, or to incomplete functionality of
the application. This type of ambiguity does not hold, for instance, for the question on
dialogue length. Among other things, dialogue length can be influenced by system
utterances which are too long or provide too much information, by an inefficient design of
the dialogue, or by speech recognition that is too slow. These causes can be addressed
separately in a system-specific questionnaire. Because our questionnaire had to be applied
to very different systems, the phrasing of the questions had to be general but we
recommend a more precise phrasing in system-specific questionnaires.

References

Albesano, D., Baggia, P., Danieli, M., Gemello, R., Gerbino, E., &
Rullent, C. (1997). A Robust System for Human-Machine Dialogue in Telephony-Based
Applications. J. of Speech Technology 2, pp. 99-110.

Bloothooft, G. & den Os, E. (1997). The ELSNET Olympics. Testing
Spoken Dialogue Systems at Eurospeech'97. ELSNews 6.5, pp. 1-3.

Boros, M., Aretoulaki, M., Gallwitz, F., Noeth, E., & Niemann, H.
(1997). Semantic Processing of Out-of-Vocabulary Words in a Spoken Dialogue System. Proc.
Eurospeech'97, Rhodes, pp. 1887-1890.

Lamel, L., Gauvain, J.L., Bennacef, S.K., Devillers, L., Foukia, S.,
Gangolf, J.J., & Rosset, S. (1996). Field Trials of a Telephone Service for Rail
Travel Information. Proc. IEEE IVTTA-96, Basking Ridge, NJ, pp. 111-116, October.
(also to appear in Speech Communication)

Rubio, A.J., García, P., Torre, I. de la, Segura, J., Díaz-Verdejo,
J., Benítez, M.C., Sánchez, V., Peinado, A.M., & López-Córdoba, J.L. (1997).
STACC: An Automatic Service for Information Access Using Continuous Speech Recognition
Through Telephone Line. Proc. Eurospeech'97, Rhodes, pp. 1779-1782.

Walker, M.A., Litman, D., Kamm, C. & Abella, A. (1997) Paradise: A
general framework for evaluating spoken dialogue agents. Proc. ACL/EACL 97, Madrid.

Ward, N. (1996). Using Prosodic Clues to Decide When to Produce
Back-channel Utterances. Proc. ICSLP'96, Philadelphia, pp. 1728-1731.

Zue, V., Seneff, S., Glass, J., Hetherington, L., Hurley, E., Meng, H.,
Pao, C., Polifroni, J., Schloming, R., & Schmid, P. (1997). From Interface to Content:
Translingual Access and Delivery of On-line Information. Proc. Eurospeech'97,
Rhodes, pp. 2227-2230. Also http://www.sls.lcs.mit.edu/jupiter.

Additional figures from the oral presentation

Figure 1: Factor loadings of all closed questions on the first
two factors, associated with user-system interaction and intrinsic qualities,
respoectively

Figure 2: Factor scores of all systems on the first two
factors, associated with user-system interaction and intrinsic qualities, respectively.

Figure 3: Factor scores of all systems on factor 1 and 3,
associated with user-system interaction and system speech, respectively.

Figure 4: Factor scores of all systems on factor 2 and 3,
associated with intrinsic qualities and system speech, respectively.

Figure 5: Average scores of all systems on the questions on the
language proficiency of the user (horizontal axis) and on the user's familiaruty with
dialogue systems (vertical axis).}

Question	DUTCH	LIMSI	IRIT	ITALIAN	EVAR	JUPITER	O-TEL	STACC	.
Task completion	1	1	-	1	1	1	1	2	1
Error recovery	1	1	1	1	-	7	1	1	1
Appropriate reactions	1	1	-	1	1	1	1	1	1
Error messages	-	1	6	1	6	-	1	1	1
Overall satisfaction	1	1	1	1	1	1	-	1	1
Speech recognition	8	1	1	-	1	1	-	8	1
Functional possibilities	2	2	1	2	2	-	2	2	2
Speech intelligibility	3	3	3	3	3	3	3	3	3
User's proficiency	-	4	4	4	3	4	4	2	4
User's familiarity	2	5	5	5	5	5	5	3	5
Wording of utterances	6	2	6	3	6	-	3	6	6
Dialogue structure	-	2	1	2	6	-	2	1	X
Dialogue length	6	2	3	-	-	-	5	1	X

- Did the system fulfil your expectations (based on aims and task as explained on the poster)		.	yes /no /in part
- Please give your opinion on the following aspects (encircle number):		.	.
.	.		POOR			GOOD
.	* the functional possibilities of the system		1	2	3	4	5
	* speech recognition		1	2	3	4	5
	* intelligibility of the system's speech		1	2	3	4	5
	* wording of system's utterances		1	2	3	4	5
	* overall dialogue structure		1	2	3	4	5
	* the possibility to recover from system errors		1	2	3	4	5
	* appropriateness of the system's reactions		1	2	3	4	5
	* length of the dialogue		1	2	3	4	5
	* the system's error messages		1	2	3	4	5
	* task completion		1	2	3	4	5
	* your overall satisfaction		1	2	3	4	5
- Please list strong points of the system		.
- Please list weak points of the system		.
- What is your mother tongue?		.
.		.	POOR			NATIVE
- What is your proficiency in the system's language		.	1	2	3	4	5
.		.	LITTLE			MUCH
- Are you acquainted with spoken dialogue systems		.	1	2	3	4	5
Please use the space below for additional comments . .

.	Total	Completed
ARISE DUTCH	24	16
ARISE ITALIAN	33	23
ARISE IRIT	33	30
ARISE LIMSI	41	31
EVAR	46	36
JUPITER	114	78
O-TEL	45	31
STACC	18	10
TOTAL	354	255

Question	1	2	3	4	5
1 GENERALAPPRECIATION	.
Task completion	.81	.28	-.00	.14	-.08
Error recovery	.80	.23	.09	-.12	.03
Appropriate reactions	.77	.30	.17	.04	.12
Error messages	.74	-.06	.33	-.10	.22
Overall satisfaction	.74	.52	.11	-.00	-.01
Speech recognition	.56	.58	.10	.24	-.03
Dialogue length	.49	.27	.37	-.25	.02
2 FUNCTIONAL CAPABILITIES	.
Functional possibilities	.23	.85	.14	-.02	.07
Dialogue structure	.43	.60	.24	-.27	.17
3 SYSTEM SPEECH	.
Speech intelligibility	.01	.09	.81	.23	-.16
Wording of utterances	.33	.19	.73	-.09	.04
4 USER's LANGUAGE PROFICIENCY	.00	-.02	.10	.88	.14
5 USER's FAMILIARITY WITH DIALOGUE SYSTEMS	.08	.09	-.10	.14	.93

.	1	2	3	4	5
ARISE DUTCH	-.10	.50	.51	.27	-.42
ARISE IRIT	.05	-.22	-.55	-.09	-.16
ARISE ITALIAN	.79	.40	-.10	.17	.26
ARISE LIMSI	-.50	.02	.72	-.03	.11
EVAR	-.14	-.20	-.36	.11	-.23
JUPITER	-.05	.20	-.17	-.31	-.10
O-TEL	.12	-.82	.36	.17	.10
STACC	.29	.58	.31	.97	-.03