Bloothooft, G. and den Os, E.A. (1997). 'The Elsnet Olympics: Testing Spoken Dialogue Systems at Eurospeech'97'. ELSNews 6.5, 1-3.
Elsnet Olympics: Testing Spoken Dialogue Systems at Eurospeech'97
Gerrit Bloothooft (1) and Els den Os (2)
2KPN Research Leidschendam, The Netherlands
At Eurospeech'97, held from 22-25 September in Rhodes, Greece, Elsnet has introduced the possibility for conference participants to become familiar with and to test present-day technology of spoken dialogue systems over the telephone. Since the conference was held in Greece the test was named the "Elsnet Olympics" although no real competition was aimed at. Ten academic and industrial sites from eight countries provided their systems. Conference participants were invited to try these systems while after each call they were asked to give their opinion on a number of system features using a 5-point Likert scale. In addition, they were invited to add comments and suggestions in free text format.
The Elsnet Olympics turned out to be a great success. During the four days of the conference, the four available phones were occupied all the time and some 390 questionnaires were returned. All completed questionnaires have been forwarded to the respective system developers. The main goals of the Elsnet Olympics have been reached: promoting a variety of up-to-date systems, providing hands-on experience for participants at a major speech conference, and useful feedback for developers from an expert audience.
Although its name may suggest otherwise, the Elsnet Olympics was certainly not meant to identify a real "winner", in the form of the world's single best spoken dialogue system. The large number of variables that could not be controlled in this type of testing (number and background of speakers, languages tested, different type of systems) do not allow for a comparison leading to a simple, single rank order. Of course, at the closing ceremony of Eurospeech we had to announce a "winner" and for this we took four representative questions from the questionnaire, and determined the system with the highest scores. This turned out to be the Italian train timetable information system developed by CSELT.
The Elsnet Olympics has been very successful in mixing promotion and pleasure. Furthermore, the comments of the Eurospeech participants are very useful, since they can be regarded as views from real expert users. It is our opinion that it is possible to deduce strong and weak points of the systems from the questionnaires returned. Answers to the questionnaires will give more insight in the most important features of a good spoken dialogue system. Therefore, we will present some of the results here. In doing so we will focus on the distinguishing features of the systems and how they have been perceived by the Eurospeech'97 attendants and we will forget about the competition element of the Elsnet Olympics.
2. Ten spoken dialogue systems
Nine out of the ten systems are information systems: some type of information (e.g. departure and arrival times of trains, weather forecasts, etc.) is given after callers have indicated what they want to know. The exception is the Japanese (1) "Aizula" system (Ward, 1996), which is not an information system. This system takes over the back channel feedback (interruptions like 'oh, really, hmm') from a human after some time. The system mainly uses the intonation contour in the caller's speech in order to identify the moment where it may insert the Japanese utterance 'un'. The comments on the questionnaire showed that some of the users liked the idea and thought it a very useful technique to be integrated into a dialogue system.
Due to line problems, for the second Japanese system (2) "Actis" only seven questionnaires were returned. This system, produced by International Telephone company KDD, is meant for callers who want to make international phone calls. The system provides information about area codes, country codes and time differences between Japan and the destination of the international call. The lexicon covers the names of 300 countries and a thousand cities all over the world.
Since the questionnaire was not really appropriate for the Aizula system and since very few Japanese callers used the Actis system, we did not include results on these systems in our analysis.
MIT provided an information system called (3) "Jupiter" (Zue et al., 1997). The version of Jupiter used at the Elsnet Olympics is a US English conversational system that gives current weather forecasts over the telephone for over 500 cities (~350 in the US) around the world. In the near future the system will be able to handle multilingual calls (German, Mandarin, Spanish). It can answer queries about general weather forecasts, temperature, humidity, wind speed, sunrise/sunset times, and weather alerts (flooding, hurricanes etc.). A 1400 word vocabulary is used and weather information is obtained from Web sources. Text-to-speech synthesis is used for the output speech. The system is a laboratory prototype.
The Spanish (4) "STACC" system is developed by the University of Granada (Rubio et al., 1997). This system driven service allows students to consult their marks. Students have to enter one of two possible degrees and one of six possible courses they want to consult and they have to speak their first and last name and an eight-digit identification number (spoken as connected digits). Only when the name and identification number match, the system provides the required mark. The lexicon contains about 200 words (for each given degree and course about 35 different names and identification numbers have to be recognized).(Antonio, please check**!!)
The (5) "O-tel" system is developed by the University of Maribor. This experimental system is an automatic reverse directory service that can handle Slovenian, English, and German. The user has to enter digits in either of these languages. Since isolated-word recognition is used, the digits must be separated by short pauses. The system repeats each digit; when a digit is not recognized correctly, the user can erase the misunderstood digit by saying "delete". The most distinguishing feature of this system is that it has talk-through capability. The tested telephone directory at Eurospeech included 50 subscribers from Slovenia. The output is given by word-based synthesis.
The remaining five systems all are train time table information systems, giving information on departure and arrival times of trains and related information. Within the Sundial project, the University of Erlangen designed the German (6) "EVAR" system that provides information on German Intercity time tables; this system is now mainly run for research purposes (Eckert et al., 1993). Research emphasis lies on the (rather free) dialogue and less on robustness and speed of the speech recognizer. The vocabulary contains 1600 words. The dialogue manager can cope with anaphora and ellipses, and it has a variety of recovery strategies from different exceptional situations. The user can always go back and can change given information. There is a spelling mode, but the user is only asked to spell the name of a station if the standard dialogue strategies fail.
The four remaining systems are all related to the Arise project. This European research project is partly funded by the European Commission under the Language Engineering sector of the Fourth Framework Telematics Application program. Four prototypes are under development (one for Dutch, one for Italian and two for French); all four systems were present at the Elsnet Olympics.
The Italian (7) "Arise" system Dialogos (Albesano et al. 1997) has been developed by CSELT. It is a continuous-speech dialogue system with a vocabulary of 3,500 words, including all the 3,000 Italian station names. The dialogue module interprets the semantic content of user's utterances by taking into account both previous utterances and data pertaining to the application. At the dialogue level, different clarification and correction subdialogues are supported and the system is able to detect user's initiated repairs. Text-to-speech synthesis is used for the output speech. A system-driven version of this dialogue system, which exploits only isolated word recognition, has already been integrated into the F.S. railway call center in Milan, Italy (tel. ++.39.2.29041).
For French, two Arise systems have been developed. The (8) "LIMSI Arise" system provides information on train schedules, fares, reductions, and services. Continuous speech recognition with task-dependent acoustic models is used; the lexicon contains 1500 words, 680 of which are station names. It is possible to interrupt system prompts (barge-in), and speech output is handled by synthesis by concatenation of about 2000 pre-recorded units.
The (9) "IRIT Arise" system uses the technology developed by Philips (speech recognition as well as dialogue management). The lexicon contains x words (x station names). It is a conversational system. Concatenation of pre-recorded speech is used for the system's output.
The Dutch (10) "Arise" system is also based on the Philips technology. This conversational system uses a lexicon of 1380 words (680 station names). Speech recognition uses context-dependent acoustic models (triphones). The system has been trained by more than 11,000 dialogues. Also here concatenation of pre-recorded units is used for the output of the system.
3. Some observations
The aim of the questionnaire was to obtain information on how users experienced the systems. Questions related to the appreciation of the used technologies, the adequacy of the system-user interaction, and to overall satisfaction. To get a basic user profile, we asked to indicate the language proficiency in the system's language and the level of acquaintance with spoken dialogue systems. We also gave the possibility to indicate strong and weak points of the systems and to give general comments.
We are aware that the definition of assessment methodologies for spoken dialogues systems is only recently under development. Ways have to be found to assess the quality of systems that are designed to perform different tasks for different groups of users of different languages. Still, we are convinced that it should be possible to assess systems irrespective of the factors in which they differ. Systems can be tested to check whether they do what they are supposed to do from the users' point of view (see e.g. Walker et al. (1997)). In this contribution we give information on four questions that can be considered to be important for the users' perception of each system: speech recognition, intelligibility of the system's speech, error recovery, and task completion. More detailed and elaborated analyses of all questions posed will be presented at the First International Conference on Language Resources and Evaluation in Granada (28-30 May, 1998).
3.1 Speech recognition
Generally spoken, the experts were rather satisfied with the performance of the speech recognition in the systems (average score 3.5 out of 5). The appreciation of the recognition did not seem to correspond closely to the size of the vocabulary of the system. Recognition of a limited vocabulary as in the O-tel system (isolated digits and about 8 other words) does not automatically result in the perception of good quality of speech recognition. An error in for instance the recognition of the language to be used (German, English, or Slovenian) is considered to be very serious. These types of errors in the recognition of critical, frequent words occurred in other systems as well. Speech recognition of the STACC system and of the Italian and Dutch Arise systems was considered to be good. For the STACC system the size of the vocabulary to be recognized is rather small and the dialogue used is system-driven. Also the Italian system uses a system-driven dialogue, but here the vocabulary is the largest of all systems. The vocabulary size of the Dutch system is half of the size of the Italian one, but in the Dutch system a conversational dialogue strategy is used, which results in a more difficult task for the recognizer. Mixed remarks were made on the capability of the systems to cope with non-native speakers. Some users were surprised to see systems doing very well, others became frustrated by persistent recognition errors.
3.2 System's intelligibility
The intelligibility of the system's speech was scored even better than the quality of speech recognition (on average 4.2 out of 5). Most systems used concatenated prerecorded speech, two of them (MIT and Italian Arise) used text-to-speech synthesis. Of course, text-to-speech output is not considered as very natural, but for more or less comparable systems concatenation of pre-recorded utterances is not always considered to be better. If the pre-recorded utterances are different in loudness or if the prosody is not handled properly (as was mentioned for the Arise IRIT and the EVAR system) concatenation does not result in an acceptable speech quality. The concatenated speech of the LIMSI system was considered to be very good; in the comments it was mentioned that the voice is very agreeable and that concatenation is done very smoothly. A couple of times the voice of the LIMSI system was even considered to be too human: people forgot that they were talking to a machine, which is nice as everything goes fine, but confusing and counter effective as e.g. speech recognition errors occur.
3.3 Error recovery
Error recovery is a very important feature of spoken dialogue systems. Especially when errors occur the limits of the systems become clear and proper handling of errors probably contributes largely to the general appreciation of a system by a user. With the exception of the Italian Arise system, none of the systems scored extremely well on this question (average score of 2.9 out of 5). Comments showed that it was often hard to correct the system when speech recognition errors occurred or when the dialogue went into a wrong direction. It does not seem to be a good solution to simply repeat error messages. The Italian Arise system is able to detect misunderstandings (Danieli, 1996), then it tries to solve first the misunderstanding by directly asking a confirmation or the selection between mutually inconsistent parameter values, and if the recovery subdialogue is too long, the system degrades the interaction to isolated word recognition.
3.4 Task completion
The question on task completion summarizes the experience of the user with the information retrieval of the system and the interaction with the system in case of errors. Systems scored 3.5 on average and scores were highly correlated to questions on error handling and system reactions in general. Italian Arise and STACC scored highest. Comments related to the type of information (the Italian Arise system presented far more information than requested without barge-in facility; the handling of unforeseen tasks, unknown destinations, etc.), failures in error recovery (over two minutes of silence, 11 trials before an answer), and speed (very long waiting times versus explicit appreciation of quick answers).
Elsnet's initiative to organise a test of spoken dialogue systems at a major conference has been successful and deserves continuation in subsequent events (and possibly on other topics and systems as well). Given various uncontrollable factors during the test, results should be interpreted with care. For the participating systems the Elsnet Olympics has been a big challenge and system developers should be praised for their courage to bring their system to such a demanding audience. They are well rewarded with system promotion and useful expert-user feed back.
Albesano, D., Baggia, P., Danieli, M., Gemello, R., Gerbino, E., and C. Rullent (1997) A Robust System for Human-Machine Dialogue in Telephony-Based Applications. In International Journal of Speech Technology, vol. 2, n. 2, pp. 99-110.
Danieli, M. (1996) On the use of expectations for detecting and repairing human-machine miscommunications. In Proccedings of AAAI-96 Conference Workshop on Detecting, Preventing, and Repairing Human-Machine Miscommunications, Portland, OR, pp. 87-93.
Eckert, W., Kuhn, T., Niemann, H., Rieck, S., Scheuer, A., and E.G. Schukat-Talamazzini (1993) A Spoken Dialogue System for German InterCity Train Timetable Inquiries. In Proceedings 3rd European Conference on Speech Communication and Technology, Berlin, p.1871-1874.
Rubio, A.J., García, P., Torre, Á. de la, Segura, J., Díaz-Verdejo, J., Benítez, M.C., Sánchez, V., Peinado, A.M., and J.L. López-Córdoba (1997) STACC: An Automatic Service for Information Access Using Continuous Speech Recognition Through Telephone Line. In Proceedings 5th European Conference on Speech Communication and Technology, Rhodes, p. 1779-1782.
Walker, M.A., D.J. Litman, C.A. Kamm, and A. Abella (1997) Evaluating Interactive Dialogue Systems: Extending Component Evaluation to Integrated System Evaluation. In Proceedings (**Els, add) p. 1-8.
Ward, N. (1996) Using Prosodic Clues to Decide When to Produce Back-channel Utterances. In: Proceedings 4th Conference on Spoken Language Processing, Philadelphia, p. 1728-1731.
Zue, V., Seneff, S., Glass, J., Hetherington, L., Hurley, E., Meng, H., Pao, C., Polifroni, J., Schloming, R., and P. Schmid (1997) From Interface to Content: Translingual Access and Delivery of On-line Information. In Proceedings 5th European Conference on Speech Communication and Technology, Rhodes, p.2227-2230. Also http://www.sls.lcs.mit.edu/jupiter.