Speech-to-speech machine translation

Related

"The industry realises that the incorporation of speech technology will imply the ultimate step to bring computers closer to the general public."

EuTrans: web page

Consortium:

"To the extent that personal computers are being equipped with more and more telematic applications, together with the impending arrival of third generation mobile phones, reliable speech recognition systems become a must."

Other related projects:

Fame (EU)
Janus(US)
LC-Star (EU-US)
Siridus (EU)
SpeechTrans™ (US)
Tabitsu (JP)
TransType2 (EU-CA)
Verbmobil (DE)

TransType2

"TT2 aims at facilitating the task of producing high-quality translations, and make the translation task more cost-effective for human translators. Research progress will thus be measurable in terms of the increased productivity of translators using this new computer-assisted translation system. "

Verbmobil

Vision
"The vision behind the Verbmobil project was a portable translation device that one could carry to a meeting with speakers of other foreign languages.
Languages
Input in English, German or Japanese; the translation is bidirectional, English-German and Japanese-German.
Funders
The project was funded by the German Ministry for Research and Technology (BMFT) and an industrial consortium (including Alcatel, Daimler-Benz AG, IBM Deutschland, Philips GmbH, Siemens Aktiengesellschaft). For the first four years of the project the BMFT funding amounted to 60 Million DM. "
Verbmobil resouces at ELDA
Iincluding 200 spontaneous transliterated dialogues in Denglish -English spoken by Germans.

LC-STAR: Lexica and Corpora for Speech to Speech Translation Components

Fame: Facilitating Agents in Multicultural Exchange

University of Karlsruhe (DE)
ITC-irst (IT)
Universitat Politecnica de Catalunya in Barcelona (ES)
PRIMA group (FR)
The IIHM group (FR)
SONY
ATLAS (ES)

JANUS
Language Technology Institute (LTI), School of Computer Science at Carnegie Mellon University.
Travel Domain - conversations between travel agents and clients.
Languages: English, German, Japanese, Korean, Italian and French.
System applications include an Interactive Video Translation Station, a Portable Translator, and a Passive Dialog Interpreter.
Approach: Speech-to-Speech Translation of Spontaneous Conversational Dialogs in multiple languages primarily using an interlingua based approach.

"Another notable point is that task success (73.8%) is higher than translation accuracy (51.8%). This confirms the need for Task Based Evaluation (TBE) in addition to Accuracy Based Evaluations (ABE). "

"The reason for task success being higher than translation accuracy is that both experienced and inexperienced users accepted some bad translations as long as they can be understood in context. For example, in the context of the question How much does it cost?, users will accept the answer128 hours."

In Lessons Learned from a Task-Based Evaluation of Speech-to-Speech Machine Translation (Levin et al, 2000)

An interview with Francisco Casacuberta and Enrique Vidal

by Joseba Abaitua

"At present, there are only a few speech-to-speech machine translation projects of some relevance either in Europe or in the United States, and Japan. Nevertheless, there is no doubt that its importance is continually increasing, the same as the technology of cellular telephony and machine translation technologies. Without a doubt, in a few years time, speech-to-speech machine translation will be a commonplace thing."

Because oral language is the most spontaneous and natural form of communication among people, speech technology is perceived as a determining factor in achieving better interaction with computers. The industry is aware of this fact and realises that the incorporation of speech technology will imply the ultimate step to bring computers closer to the general public. To the extent that personal computers are being equipped with more and more telematic applications, together with the impending arrival of third generation mobile phones, reliable speech recognition systems become a must. In recent years there has been important progress, although still with limitations (of vocabulary, of domain coverage, in the treatment of disfluencies, etc.). But despite some minor problems, the technology today is ready to offer a wide range of services.

Paco%20y%20Enrique%20%281%29 One of the most attractive applications is without a doubt speech-to-speech machine translation. There exist a reduced number of projects that have contributed more significantly to the development of this technology, among which Verbmobil, sponsored by the German government, and the European EuTrans are worth citing. In this interview we have two representatives of one of the Spanish research groups that has gained more recognition in recent years thanks to its research on speech-to-speech translation. This is the group Pattern Recognition and Human Language Technology (PRHLT) of the Universitat Politècnica of València (UPV), directed by Francisco Casacuberta Nolla and Enrique Vidal Ruiz.

PRHLT-ITI

This group carries out its research activity both in speech technologies and in computer vision. The PRHLT subgroup devoted to speech-to-speech translation in the EuTrans project is composed --in addition to the two directors-- by Carlos Martínez Hinarejos, Francisco Nevado Montero, Moisés Pastor Gadea, David Picó Vila, Alberto Sanchis Navarro, who belong to the Computer Science Institute (ITI) of the UPV, where they also lecture, and by David Llorens Piñana and Juan Miguel Vilar Torres, from the Universitat Jaume I (UJI).

Other research projects on speech translation have been developed by the group, such as "EXTRA: Example-based extensions to text and speech translation in restricted domains" and "Translation and comprehension of the language spoken through example-based learning techniques: TRACOM", both funded by the Spanish Foundation of Science and Technology (CICYT). The group is currently participating in a new European project: "TransType2 - Computer-Assisted Translation" (TT2).

EuTrans

Question: In what context has your research been conducted lately?

Enrique Vidal. Recent research has been carried out within the framework of the project EuTrans, financed by the community program ESPRIT (actions 20268 and 30268). The consortium that has carried out the project was formed by the University of Aachen (Germany), the research centre of the Foundazione Ugo Bordoni (Italy), the German company ZERES GmbH, and our group at the Computer Science Institute (ITI) of the Universitat Politècnica of València (UPV), which led the project. The project involved two different stages. The first stage (in 1996), lasting only six months, demonstrated the viability of the proposed approach on a task of moderate complexity. In the second stage, which took three years (from 1997 to 2000), methodologies were developed in order to address real tasks.

The approach

Question: What is your approach to speech-translation?

Enrique Vidal: In a classical approach, a typical speech translator performs in two stages. In the first stage speech recognition takes place, converting the input sound into a source-language text. Then, in the second stage, that source-text is translated into a target-language text. This approach is also known as uncoupled or serial translation. In our approach, speech-recognition and translation are carried out simultaneously. This is possible due to the integration of the word-acoustic models that are part of the speech-recognition system into the translation model. We call it the integrated approach. Integration is possible because hidden-Markov models and finite-state transducers are very similar in essence.

Question: So, what are the advantages of this approach over the classical one?

Francisco Casacuberta: On the one hand, there is no system of speech-recognition that is perfect. Besides, translation systems that work from text --as opposed to speech-- take for granted that the source text is correct. As a result, in the classical approach, translation systems deal with errors of all types for which they were not designed. Obviously, this problem has a solution but a very expensive one.

Question: In short, what would you say is the most remarkable aspect of your approach?

Enrique Vidal: Our approach has two outstanding characteristics. One is the capacity to resolve, in a homogeneous and simultaneous way, the two phases of speech-recognition and translation. The second property is the possibility to generate acoustic, lexical and translation knowledge-sources automatically from examples.

But as I have said before, these systems have a limited application domain and need very large amounts of learning-data. As we know, corpora compilation is very expensive. And furthermore, the size of the resulting models represents another problem. Consequently, in order to make the approach viable, it is very important to restrict the domain of application.

Results

Question: How do the positive results of EuTrans project into a more optimistic horizon of speech technologies?

Francisco Casacuberta: Within the framework of EuTrans several prototypes have been developed for relatively simple translation tasks. The language pairs developed were Spanish into English and Spanish into German. On top of these prototypes, two further translation systems (from Spanish and Italian into English) have been constructed for more realistic applications. The small error rates and excellent response times, close to real time, provide good practical support to our technological standpoint.

Two versions of the Spanish prototype were done, with very different training- sample sets, so that the relevance of the training corpus size could be measured with respect to its influence on the behaviour of the translation model. The first prototype was trained with nearly half a million sentence pairs that were generated semiautomatically from texts out of tourist guidelines. The second one was trained with only 10,000 sentence pairs.

This difference in the size of the training corpus was adopted consciously. We wanted to simulate a situation of scarcity of training samples and to verify its incidence on the behaviour of the translator. The vocabulary contained in the corpus was of 686 words in Spanish and 513 in English. The acoustic models had been trained on the basis of four hours of recorded speech. The oral test consisted of 336 phrases in Spanish pronounced by four speakers. The Spanish system with a bigger training corpus and microphone input produced an error rate lower than 2% of translated words. With a corpus of only 10,000 sentence pairs a much worse outcome was obtained, with error rates only lower than 8%.

The Italian prototype was trained with a very small sample of just 3,038 sentence pairs. The acoustic models had been trained with a corpus composed of some eight hours of speech, acquired directly from real telephone calls to the reception of a hotel through the Wizard of Oz technique. The translation corpus was transcribed orthographically from the oral corpus and translated manually into English. The oral test was carried out on 278 Italian sentences, with a rate of 22% of poorly translated words.

Question: In what domains have the prototypes been applied?

Enrique Vidal: The prototypes were designed for the restricted application of person-to-person interactions at the reception of a hotel. There were five expected actions: information requests, bookings, cancellations, claims, and booking-changes.

Question: This makes EuTrans similar to other speech-to-speech translation prototypes, such as Verbmobil or Janus. Is there some reason for this coincidence?

Enrique Vidal: This task is sufficiently restricted, with a vocabulary of moderate size, which makes it possible to demonstrate the viability of speech translation in a very direct way.

Question: What conclusions could be extracted from the evaluation tests?

Enrique Vidal: The results obtained clearly show the viability of the approach for concrete tasks and restricted discourse domains.

The future

Question: What is left to do in this field?

Francisco Casacuberta: Many problems remain open. In the first place all those that affect speech modelling: unfavourable environments (for example speech coming out of cellular telephones), prosody, and other phenomena caused by spontaneous speech. Modelling dialogue --the way it is done in Verbmobil-- would also help the translation. With regard to translation models, it is necessary to resolve the computational problems that affect larger models, for instance when the task requires large lexicons (of several tens of thousands of words).

Question: What is the state of speech-to-speech translation in Europe and outside Europe?

Enrique Vidal: Some systems are currently being marketed. These employ the classical approach of a commercial speech-recognition system linked together to a text-translation system. In any case, there are only a few projects of some relevance either in Europe or in the United States and Japan. Nevertheless, there is no doubt that its importance is continually increasing, like the technology of cellular telephony and machine translation technologies. In a few years time, speech-to-speech machine translation will be a commonplace thing.

Conclusions

The achievements of the EuTrans project reveal two things. The first thing is that speech-to-speech translation is conditional to the development of speech recognition technology itself. Secondly, that the models employed in speech recognition based on large collections of examples, have proved valid also for the development of speech translation. This implies that in the future these two technologies will be successfully integrated.

At present, however, speech-to-speech translation systems are scarcely available. In recent years speech recognition has made an important progress thanks to the increasing availability of the resources that are needed for its development: large collections of oral texts and data-oriented efficient processing techniques, such as those designed by the PRHLT-ITI group itself. However, the integration of these systems into market products is still complicated. We should not forget that these prototypes developed within research projects are only capable of processing a few "hundreds" of sentences (around 300), on very specific topics (accommodation-booking, planning trips, etc.) and for a small bunch of languages (English, German, Japanese, Spanish, Italian). It seems unlikely that any application will be able to go beyond these boundaries in the short term.

The direct incorporation of speech translation prototypes into industrial applications is at present too costly. However, the growing demand on these products leads us to believe that they will soon be on the market at more affordable prices. The systems developed in projects such as Verbmobil, EuTrans, or Janus, in spite of being at laboratory-stage, contain in practice thoroughly evaluated and robust technologies. A manufacturer considering its integration may join R&D projects and take part in the development of prototypes with the prospect of a fast recoup of the start-up costs. It is quite clear that we are witnessing the emergence of a new technology with great potential for expansion in the telecommunications and microelectronics market for the immediate future.

Another remarkable aspect of the EuTrans project is its methodological contribution to machine translation as a whole, both in speech and written modes. Although these two modes of communication are very different in essence, and their respective technologies cannot always be compared, speech-to-speech translation has brought prospects of improvement for the other channel. Traditional methods for written texts tend to be based on grammatical rules. This way, many MT systems show no coverage problem, although this is so at the expense of quality. The most common way of improving quality consists in restricting the domain of application. It is widely accepted that broadening of coverage immediately endangers quality. In this regard, learning techniques that enable the systems to automatically adapt to new textual typologies, styles, structures, terminological and lexical items mean a radical contribution to the technology.

On account of the difference between oral and written communication, rule-based systems prepared for written texts can hardly be readapted to oral applications. This is an approach that has failed. On the contrary, the example-based learning methods designed for speech-to-speech translation systems can easily be adapted to the written texts given the increasing high availability of bilingual corpora. One of the main contributions of the PRHLT-ITI group is precisely its learning model based on bilingual corpora. It is along this line of experimentation that interesting prospects of improvement in the written translation exist.

Although with limitations with respect to the number of languages, linguistic coverage, and context, effective speech-to-speech translation will become available in the coming years, along with other voice-oriented technologies. It could be argued that EuTrans' main contribution is to have raised the possibilities of speech-to-speech translation to the levels of speech recognition technology, making any new innovation immediate accessible.

Joseba Abaitua has a Ph.D. in Computational Linguistics from the University of Manchester Institute of Science and Technology (UMIST). For four years he worked in the Japanese-Spanish module of Fujitsu's ATLAS machine-translation system. Since 1992 he is professor of Linguistic Technology at the University of Deusto, where he has participated in several Natural Language Processing projects. Complementary to his academic work he also acts as technology consultant for AutomaticTrans.