Department of Nordic Studies and Linguistics Faculty of Humanities University of Copenhagen Dansk

STO - A Large Computational Lexicon for Danish - Ready for Applications

The Danish lexicon resource for language technology applications is available world-wide for research and commercial purposes at Evaluations & Language resources Distribution Agency (ELDA) www.elda.org.

The Centre for Language Technology (CST) is in charge of a national project developing a large-scale Danish lexicon for HLT and NLP applications. The short name of the project is STO, which stands for "SprogTeknologisk Ordbase" (Lexical Database for Language Technology). The project gets funding from the Danish Ministry for Information Technology and Research for a period of three years (2001-03).

The objective of the STO project is to develop a comprehensive, generic lexical database from which various, dedicated lexicon modules can be derived and adapted to particular applications.

In order to fulfil the project objectives of data production with a fixed scheme, we defined three areas of work - namely the organisational, the computational and the linguistic area. In the following we give an account of the most important, finished and ongoing tasks of each area.

Organisational area

CST is – as the project manager – responsible for the proper performance of the work processes. Currently, the lexicon encoding is carried out by sixteen project members (mainly part-timers) employed by four different institutes. In order to run the project effectively, the work processes must be carefully co-ordinated.

The project programme is logistically broken down into appropriate sub-processes and work packages with precise milestones/deliverables All data production is carried out outside the central lexical database and validated by CST before a delivery is integrated into the lexicon.

Computational area

Adaptation of reusable data and uploading of the tailored entries into the central (ORACLE) database (with integrity checks performed) Adjustment of the information structure to hold added language specific information types and their interrelations Development of a lexicographer's toolkit (consisting of ACCESS encoding tools, ACCESS/ORACLE uploading interface, ORACLE browsing macros, etc.)

Linguistic area

The goal is to populate the lexicon: coverage at least 50,000 lemmas (current figures: 40,000 entries with morphological descriptions, 24,000 entries provided also with syntactic descriptions, furthermore 10,000 related semantic readings originating from the SIMPLE project)

Definition of principles of the lexical coverage wrt. general and domain language respectively Integration of a frequency-based method into the process of selection of entry words and the relevant syntactic/semantic readings Enhancement of the linguistic description with further language specific information types relevant to a broad application area Elaboration of Linguistic Specifications documenting the lexical and linguistic coverage (both for lexicographers and customers/users) Creation of a Coder's Manual containing practical guidelines for lexicographers (e.g. use of supporting tools, appropriate selection of information types, use of corpus evidence)

Some features of the linguistic aspect

In order to enhance the applicability of the lexicon we implemented e.g. the linguistic description of noun compounding which is the most productive way word formation in Danish; in the next step derivational information will be added - both of them contribute to a dynamic exploitation of the data.

The area also incorporates research work into particular linguistic tasks where there is no systematic and exhaustive description accessible that could be implemented in a straightforward way.

It deserves notice that the STO lexicon complies with the recommendations for classification of information types laid down by the Danish Standard for lexical data collections. This is an essential prerequisite for a broad applicability of the lexicon.

Although the STO project focuses on the monolingual information content and data structure, we are also aware of the need for a Danish lexicon that can be integrated into multi-lingual lexical resources. To this end, the lexical data produced are kept compatible with the PAROLE descriptive language and - as regards esp. the semantic layer- we remain attentive to structures produced within other follow-up projects, like SIMPLE.

Further information

Braasch, Anna & Sussi Olsen (2000):
Formalised Representation of Collocations in a Danish Computational Lexicon .
i: U. Heid & al., (eds.) Proceedings of the Ninth EURALEX Congress. Stuttgart. p.475-488.

A.Braasch, A. Buhr Christensen, S. Olsen, B. Pedersen (1998):
A Large Scale Lexicon for Danish in the Information Society, in 'Proceedings from the First International Conference on Language Resources & Evaluation, ELRA.

The model for STO.

CST contact

Anna Braasch anna @ cst.dk

Emil Holms Kanal 2, building 22, 3, DK-2300 Copenhagen S
Tlf: +45 35329090 - Fax: +45 35329089

Web accessibility (in Danish)