Lexical Ambiguity in Machine Translation: Expressing Regularities in the Polysemy of Danish Motion Verbs.

Submitted as PhD thesis October 1997.

Abstract:

The primary aim of this thesis is to investigate new methods for handling lexical ambiguity in machine translation in order to achieve a more adequate treatment of polysemy in the lexicon. More specifically, we are particularly interested in finding alternative ways of treating systematic polysemy which is characterised by several words following the same pattern of meaning change.

A more empirical subgoal consists in performing a corpus-based analysis of Danish motion verbs which have not previously been examined from this view point in spite of the fact that they are very ambiguous and therefore very difficult to treat in machine translation. The purpose of this investigation is to identify the regularities that Danish motion verbs exhibit in the ability to undergo meaning change.

In the thesis we discuss how lexical ambiguity is traditionally treated in machine translation and it is shown why simple sense enumerating techniques, where there is no distinction between unrelated meanings (homonomy) and related meanings (polysemy), are inadequate for several analysis and translation purposes.

We therefore turn to recent research within the field of lexical semantics in order to examine how the problem of verb polysemy is dealt with here. In particular, two approaches to the problem are considered to be adequate for our purpose, namely on the one hand Cognitive Frame Semantics (Atkins & Fillmore 1994) where a taxonomical approach to verbs is taken and where, for each semantic domain, the central semantic parameters (frame elements) are identified and examined with regards to their syntactic realisation, and on the other hand the alternation method (Levin 1993, Levin & Rappapport forthcoming) which aims at an identification of the relation between syntactic potential and elasticity in meaning.

On the basis of a corpus analysis, a small set of frame elements is identified for the motion domain, namely Mover, Direction, Manner, Vehicle and Protagonist control; likewise, a set of possible aspectual values for motion events are established: activity, accomplishment and achievement, and finally we present a partial sort hierarchy for nouns meant for encoding selectional restrictions on motion verbs. On the basis of these semantic features, a taxonomy of the prototypical meanings of Danish motion verbs is established and, in order to clarify their syntactic potential, each subgroup of verbs from the taxonomy is tested on a set of distributional tests. The tests expose a more or less transparent relation between semantics and syntax: words which share certain semantic features also show a strong tendency of exhibiting a similar syntactic behaviour.

The exposure of this close relationship between syntax and semantics in Danish motion constructions constitutes a first step towards an identification of different types of meaning change. Thus, basically 3 kinds of systematic polysemy are identified for Danish motion verbs: meaning extension, meaning reduction, and co-composition. Described in very short terms, the following relation between the taxonomy and the systematic meaning changes is exposed:

These results are partially explained on the basis of Levin & Rappaport Hovav's analysis, where it is claimed that direction verbs are 'saturated' in meaning because they inherently denote a final state of the event and therefore can't be varied, whereas manner verbs (comprising body verbs, body-by-vehicle verbs, non-body verbs and group verbs) denote activities, which are simple non-finite events, and are therefore open to variation.

In the second part of the thesis we describe how the lexicon of Danish motion verbs can be expressed in an HPSG-like formalism which applies Typed Feature Structures and which facilitates a high degree of organisation of the selected syntactic and semantic information. Furthermore, it is demonstrated how systematic polysemy can be formalised by means of lexical rules making it necessary only to encode the prototypical senses directly in the lexicon. Also, a blocking mechanism is established which hinders overgeneration in those cases where a verb for etymological or other reasons exhibits a less flexible behaviour than one should expect considering its semantic affiliation. In this way, two in essence opposite directed characteristics of polysemy are treated: the systematicity on the one hand and the idiosyncracy on the other.

Finally, we describe an implementation of 100 Danish motion verbs in the ALEP formalism which we integrate in a running analysis grammar for Danish. A test suite of 42 Danish sentences comprising motion events are tested on the system and the results are very promising: in 39 of the 42 examples the motion verbs are disambiguated correctly. We therefore conclude that the elaborated method seems to be both operational and sufficiently nuanced as to capture the semantic parameters which are central in the disambiguation process and we consider it to be plausible - although further investigations are required - that the method can be successfully extended to comprise other semantic domains as well.


cstum20
To the CST Home Page


Bolette Pedersen
Center for Sprogteknologi, Njalsgade 80, DK-2300 Copenhagen S, Denmark