proposal

MUMIN

Home
Motivation
Background
Objectives
Synergies
PhD education
Plan
Selected references

Motivation

The past five or ten years have seen a growing interest in multimodal interfaces, and the number of research contributions in this area - in the Nordic countries and elsewhere - has grown considerably. Providing systems with multimodal behaviour is seen as a way of allowing for more flexible and user-friendly interaction between humans and digital services: by improving the information systems' functionality, usability, and acceptability, multimodality makes it possible for human-machine communication to take place in a more natural way. And indeed, many sources claim that users generally prefer multimodal to monomodal systems. Furthermore, the flexibility provided by multimodal interfaces may be of critical importance to non-expert users as well as disabled users.

In general, the ability to process input from different modality devices adds inherent robustness to a system, since the interpretation of the user's communicative acts can be based on input from different channels: errors in one channel can be compensated for by the information coming from another channel. On the other hand, the use of additional modalities is also likely to introduce ambiguities and uncertainty which the system must be able to deal with. Moreover, multimodality puts heavy demand on flexible interaction modelling, since having to merge input from different sources also requires extensive reasoning capabilities both in understanding and responding to partial, erroneous, multi-channel input. The way in which different modalities reinforce or complement each other, is still poorly understood and exploited. Modality integration is thus a compelling research topic. In fact, the representation and integration of multimodal messages in dialogue systems is also a very important issue for commercial parties. And on the generation side, a dialogue system's combination of speech with other modalities to present information to the user, also poses a number of unsolved problems regarding the choice, timing and consistency of the output.

To avoid building systems where the processing of multimodal inputs and generation of multimodal outputs is implemented as a series of idiosyncratic procedures tailored to specific tasks, standards and generic methodologies for modality integration should be studied. Such methodologies should enable systems to make the most of the redundancy introduced by multimodality and, so to speak, find (or present) the right information in the right thread. Furthermore, they should make multimodal systems more easily scalable and portable across domains.

The aim of this network is to contribute to the development of such standards, architectures and methodologies - and to a deeper understanding of how language and other modalities best complement each other in computer interfaces - by bringing together research institutes working with multimodal interaction in the Nordic countries. The relevance of such a network becomes clear when one considers various relevant activities currently undertaken in the Nordic countries. These include research projects of national and European scale, courses (e.g. 7th European Summer School on Language and Speech with the theme Multimodality in Language and Speech Systems held in Stockholm in 1999 and arranged by KTH), and basic research carried out at individual research institutes. Furthermore, we believe funding a Nordic network on multimodal interaction is relevant to the Nordic language technology research programme - and in particular to the theme Human-computer interaction in natural language - not only because language is one of the modalities used, but also because techniques from NLP can be expected to play a major role in models of multimodal integration. In this respect, it is interesting to note that the growing interest for multimodal interaction is opening a new perspective to Nordic research on dialogues, which is already acknowledged internationally.

The creation and running of a network on multimodality cannot be achieved by the individual efforts of the interested institutes. In order to produce fruitful and useful results, coordination of the work is needed, and joint activities must be planned, organised and seen through. Thus although willingness to participate and need for such a network already exist, an operational network requires that a basic infrastructure for coordination and management, and financial support of joint activities are in place. We believe that by providing this financial support, the Nordic language technology programme would contribute to the development of a very promising area in which Nordic research stands a good chance of achieving remarkable international results.

Background and previous work

Research on multimodal interfaces addresses the important general objective of constructing better interfaces for application domains where ease of use and natural and flexible interaction are of crucial importance. A whole range of research issues, some of which have already received attention from the institutes engaged in the network, are relevant to this overall objective, and constitute topics of interest around which to organise the network's research activities.

A central issue, and one where language technology research results may be capitalised on, is that of multimodal integration. A promising approach put forward by several researchers is in fact that of using techniques known from NLP (see Johnson et al. 1997). A similar distinction to that made in NLP between grammar rules and parsing algorithms can be made between a multimodal grammar and an algorithm for applying the grammar to input from multiple modalities. By upholding this separation of process and data, the process of merging inputs from different modalities can be made more general, as the entire representation becomes media-independent and any procedures defined for modality integration within the processing stages are then generally applicable regardless of which input modalities originate the information in question. Finally, defining algorithms for modality integration independent of the specific modalities used in a particular application, also increases the chances that components of the system can be extended and/or reused. For example in the Danish research project Staging, Center for Sprogteknologi (CST) has developed a multimodal dialogue interface to a virtual environment (see Paggio et al. 2000) where speech, keyboard and gestural inputs are merged by a feature-based parser. These results will be shared with the other network participants, and extended as a result of CST's engagement in the network. Another Danish partner, the SMC group from Aalborg, also has extensive research and teaching experience in the area of multimodality, complemented with expertise in speech processing.

Another promising approach to modality integration is the use of different machine learning techniques and especially such techniques as neural networks, which have already been successfully applied e.g. to speech recognition and various classification tasks. As has been the case with many other application domains, also for multimodal integration hybrid systems mixing rule-based approaches with machine learning algorithms may well provide the most interesting results. Although rule-based methods in general work reasonably well, it is a well-known problem that an explicit specification of the steps, i.e. rules that are required to control the processing of the input, is a difficult task, and when the domain becomes more complex, the rules become more complex too. Often the correlation between input and output is difficult to specify. This is the case e.g. with multimodal interfaces, and thus approaches which are both robust and able to adapt to new inputs are needed. Expertise in this domain is brought to the network by the Media Lab at the University of Art and Design in Helsinki (UIAH), and especially its Soft-Computing Interfaces Group which is devoted to designing adaptive interfaces and developing tools for human-machine interaction, relying on nature-like emergent knowledge that arises from subsymbolic, unsupervised processes of self-organizing nature (see e.g. Koskenniemi et al. 2001, Jokinen et al. 2001). One of Media Lab's goals is also to explore the impact of new digital technology in society, and to evaluate, understand and deal with the challenges it poses to the design of information technology products. In this, multimodality plays an active role in opening new possibilities for communication, interaction, education and expression, and the network will provide an important channel for planning and integration of matters relating to interactive media.

To fully exploit multimodality in various interfaces, it is important to know how the neurocognitive mechanisms support multimodal and multisensory integration. In comparison to that devoted to single sensory systems, there has been very little research on the integration mechanisms of information received via different senses. However, the research group of Cognitive Science and Technology at the Helsinki University of Technology is using various methods to uncover the integration principles of auditory and visual speech. On the basis of the results, mathematical models of the integration are being developed. The group is also developing a Finnish artificial person, a talking and gesturing audiovisual head model. The model will be used in practical dialogue systems, and also serve as a well-controlled stimulus for neurocognitive studies.

In a similar way as the rule-based integration of modalities can be enhanced using machine learning techniques, results obtained through pure probabilistic analysis methods may well be boosted by the addition of symbolic rules. An example relevant to multimodal interfaces are the algorithms for character and word prediction used in connection with eye-tracking, where the system tries to guess what the user is "typing with the eye?. Although the performance of the probabilistic approaches implemented in current systems is promising, language technology techniques seem to constitute a valuable add-on. This is an issue that the group at the IT University of Copenhagen is working on.

A third research issue regards the interpretation of multimodal input and the generation of multimodal output in relation to a dialogue model and to a model of the domain and task at hand. Several language technology institutes in the Nordic countries have contributed substantially to dialogue research, and developed dialogue models as well as implemented dialogue systems. Notable examples are the Department of Linguistics at the University of Göteborg, the Natural Interactive Systems Laboratory (NISLab) at the University of Southern Denmark, and the natural language processing research group (NLPLab) at the University of Linköping, all of which will participate in the network. The Göteborg group has extensive experience in corpus collection and dialogue management. They have developed tools for spoken language analysis and coding which can be applied to the collection and analysis of multimodal dialogues, thus providing empirical basis and insight for research on multimodal interaction: how different modalities are used in human-human communication (Allwood, 2001). NISLab has a strong background in dialogue management, components and systems evaluation, and spoken dialogue corpus coding, having led the EU projects DISC and DISC2 (1997-2000) on best practice in the development and evaluation of spoken language dialogue systems and components (see www.disc2.dk), as well as the EU project MATE (1998-2000) which developed the MATE Workbench for multi-level and cross-level annotation of spoken dialogue. NISLab is currently in the process of generalising the DISC and MATE results by addressing best practice in the development and evaluation of natural interactivity systems and components (in the EU project CLASS, 2000-2002), surveying data resources, coding schemes and coding tools for natural interactivity (EU-US project ISLE, 2000-2002), and building the world's first general-purpose coding tool for natural interactive communicative behaviour (EU project NITE, 2001-2002). NLPLAB at Linköping University has for almost two decades conducted research on dialogue systems and now has a platform for development of multimodal dialogue systems for various applications to be developed further towards an open source code repository (Degerstedt & Jönsson, 2001). Currently focus is on integrating dialogue systems with intelligent document processing techniques in order to develop multimodal dialogue systems that can retrieve information from unstructured documents, where the request requires that the user, in a dialogue with the system, specifies their information needs (Merkel & Jönsson, 2001). AT KTH several multimodal dialog systems have been developed. The first system Waxholm was a multimodal system exploring an animated agent (Carlsson & Granström, 1996).Current work and interest involves research on multimodal output using animations and also to some extent multimodal input using both speech and pointing (Gustafson, et. al, 2000).

Another branch of research includes the development of generic language technology resources in an open source code repository. This involves a method for development of dialogue systems (Degerstedt and Jönsson, 2001), as well as design of generic system architectures. The Jaspis architecture, developed at the TAUCHI group at University of Tampere (Turunen and Hakulinen, 2000) provides an agent-based flexible development platform which has already been applied to various dialogue applications.
Finally, an important topic is also the impact of multimodality on the users. Relevant questions are which input and output modalities should be used for which task, which are the best combinations, and how different modalities are used by or for users with different degrees of expertise. There are in general two ways to use multimodal input: to react directly to the user's intentional input and to observe the user's unconscious use of certain modalities (e.g. eye-gaze). The former method is based on direct control and has been used in earlier conversational interface prototypes. Observing the user and deciding their intentions and mental states has not been extensively studied, and would add valuable information to the design of multimodal systems. The Computer Human Interaction group at the University of Tampere (TAUCHI) will bring to the network its extensive expertise in the design and use of innovative user interfaces. Multimodal human-computer interaction including speech, touch and gaze, with particular regard to the usability aspect, is one of the main research themes within the group. TAUCHI has also experience with usability testing, which is carried out at the group's well-equipped usability laboratories. Also NISLab has made pioneering contributions to the theoretical understanding of the unimodal input/output modalities which combine in various ways to constitute multimodal interfaces, as well as to the understanding of the multiple conditions which determine the usability of individual modalities and their combinations (Bernsen 2001, Bernsen and Dybkjær 2001).

Objectives

The MUMIN network aims at stimulating basic Nordic research and cooperation in the area of multimodality in several ways:

by encouraging joint activities in building generic models and architectures as well as defining standards for the integration and development of multimodal interaction;
by encouraging investigations on the use of multimodality in various practical applications;
by providing a forum for sharing resources and results, and by encouraging network participants to make their research results available via the network's web site;
by organising PhD courses and research workshops on issues related to multimodal interaction;
by creating a network of contacts and a pool of shared knowledge that can be taken advantage of for the definition of collaborative research projects and for product development.

It will also be an important objective of the network to support investigation of the use of multimodal interaction in non-expert environments, and the accessibility of disabled people to IT technology. The network will thus contribute to the Nordic countries' social objectives and help them advance in their vision of building democratic and equal societies for everyone. This also conforms to the EU objectives of creating a user-friendly information society, with accessibility of IT benefits and services for all citizens.

More concretely, the network will organise a number of activities:

1 smaller and 1 larger workshop;
PhD courses in the form of a summer course and individual courses that can be part of existing Nordic PhD programmes;
exchange of researchers and research students;
email-network and information repository (web-page and links).

Eventually, the research work carried out under the auspices of the network is expected to prepare the ground for a collaborative research project (for which additional funding, however, will have to be sought).

Synergies

The network will relate to other relevant initiatives in the Nordic countries, both language-technology oriented ones like the GENST-NET (Nätverk för gemensam nordisk forskarutbildning i språkteknologi), and others like the NorFA-network Nordtalk on corpus based research on spoken language (http://www.ling.gu.se/norfa/), and multidisciplinary oriented networks like the Nordic Interactive and its NIRES research school which concentrate on interactive digital technology (http://www.nordic-interactive.org/). MUMIN will complement these networks by focussing on multimodal and language technology aspects of the interaction. It will also seek contact with the ACL/ISCA Special Interest Group on Discourse and Dialogue SIGdial, whose current president is Professor Laila Dybkjær from NISLab.

PhD education

In addition to the PhD course that will be held during the first year, individual courses will be offered under the auspices of the network as part of existing PhD programmes in Nordic countries. For example, courses can be offered in Finland through the Finnish Language Technology network, and similar possibilities will be investigated in the other Nordic countries.

We expected 28 PhD students to participate in the network, distributed among the participating countries as follows: 10 in Denmark, 9 in Finland, 9 in Sweden. However, we expect the network participants to attract a larger number of students than those formally ?registered?. The network will also support PhD students' visits to other Nordic countries.

Plan

Month 1:	planning meeting to specify 1st year's activities
Month 3:	web site and mailing list established
Month 6:	smaller workshop for network participants to discuss research results and plans, and how they fit network's objectives
Month 12:	planning meeting to specify 2nd year's activities, in particular network's presence in PhD educational programmes and exchanges of researchers and students; PhD course
Month 18:	bigger workshop with international guest speakers
Month 24:	final meeting and evaluation

Throughout the whole period: email discussions and web site maintenance. The network will also be present at NodaLida.

Selected references

Allwood, J. (2001). Dialog Coding - Function and Grammar, Göteborg Coding Schemas, Gothenburg Papers in Theoretical Linguistics, GPTL 85. Göteborg University, Department of Linguistics.

Bernsen, N. O. (2001) Multimodality in language and speech systems - from theory to design support tool. Chapter to appear in Granström, B. (Ed.): Multimodality in Language and Speech Systems. Dordrecht: Kluwer Academic Publishers.

Bernsen, N. O., Dybkjær, L. (2001) Combining multi-party speech and text exchanges over the Internet. Proceedings of Eurospeech 2001, pp. 1189-1192.

Carlson, R., Granström, B. (1996). The WAXHOLM spoken dialogue system. Acta universitatis Carolinae philologica 1, pp. 39-52.

Degerstedt, L. and Jönsson, A. (2001). A Method for Iterative Implementation of Dialogue Management , IJCAI Workshop on Knowledge and Reasoning in Practical Dialogue Systems, Seattle.

Gustafson, J, Bell, L, Beskow, J, Boye, J, Carlson, R, Edlund, J, Granström, B, House, D & Wirén M (2000) AdApt - a multimodal onversational dialogue system in an apartment domain, In Proc of ICSLP 2000, Beijing, 2:134-137

Johnston M., Cohen P.R., McGee D., Oviatt S.L., and Pittman J.A. (1997) Unification-based multimodal interaction, in Proceedings of the 35th Annual Meeting of the Association for Computational Linguistics, Madrid, Spain, pp. 281-288.

Jokinen, K., Hurtig, T., Hynnä, K., Kanto, K., Kaipainen, M., and Kerminen, A. (2001). Dialogue Act classification and self-organising maps. In Proceedings of the Neural Networks and Natural Language Processing Workshop , Tokyo, Japan.

Koskenniemi T., Kerminen A., Raike, A. and Kaipainen, M., (2001) Presenting data as similarity clusters instead of lists. In Proceedings of the 1st International Conference on Universal Access in Human-Computer Interaction, New Orleans, USA.

Merkel, M. and Jönsson, A. (2001). Towards multimodal public informations systems, Proceedings of 13th Nordic Conference on Computational Linguistics, NoDaLiDa '01, Uppsala, Sweden.

Nivre, J., Tullgren, K., Allwood, J., Ahlsén, E., Holm, J., Grönqvist, L., Lopez-Kästen, D., and Sofkova, S. (1998). Towards multimodal spoken language corpora: TransTool and SyncTool. Proceedings of ACL-COLING 1998.

Paggio P:, Jongejan B. and Madsen C.B. (2000). Unification-based multimodal analysis in a 3D virtual world: the Staging project. In Proceedings of the CELE-Twente Workshop on Language Technology: Interacting Agents, pp. 71-82.

Sams, M., Manninen, P., Surakka, V., Helin, P. and Kättö, R. (1998). Effects of word meaning and sentence context on the integration of audiovisual speech. Speech Communication, 26, 75-87.

Sams, M., Kulju, J., Möttönen, R., Jussila, V., Olivés J-L., Zhang, Y., Kaski, K., Majaranta, P., Räihä, K-J. (2000). Towards a high-quality and well-controlled Finnish audio-visual speech synthesizer. Proceedings of The 4th World Multiconference on Systemics, Cybernetics and Informatics (Sci'2000), Orlando, Florida (USA).

Turunen, M. and Hakulinen J. (2000). JASPIS - a Framework for Multilingual Adaptive Speech Applications. In Proceedings of the 6th International Conference of Spoken Language Processing. ICSPL 2000.

Njalsgade 80 - DK-2300 KBH S - Tlf: +45 35329090 - Fax: +45 35329089 - webmaster@cst.dk