CST's Lemmatiser

CST's lemmatiser reduces all words in a text to their base form, the lemma. Eg

"lemmatisation can be used for many purposes"

lemmatisation/NN/lemmatization can/MD/can be/VB/be used/VBN/use for/IN/for many/JJ/many purposes/NNS/purpose

The lemmatiser derives the base form (lemma) of words using a set of rules and an optional dictionary that express the relation between word forms and base forms.

The rules that are used in this demo are generated from a full form word list derived from CELEX. The correctness of the lemmatiser is estimated to be at least 87% for words that are not in the dictionary. The output will be better if the input is POS-tagged. If the input isn't POS-tagged, the lemmatiser can for example not discern between the verb means and the noun means.

The Danish rule set (which is demonstrated in the Danish version of this page) is based on STO. A manual check of the results of the lemmatiser for Danish text has shown that the correctness is 94%-98%, again depending on whether or not the input is POS-tagged.

The lemmatiser can use other dictionaries for rule training and for lookup, provided that they contain full forms and lemmas. The lemmatiser can be used for languages with suffix based inflection morphology, such as the Nordic languages, English and Greek, but also for languages with more complex morphology, such as German and Dutch. To that end we have developed a new training algorithm that handles prefixes and infixes as well as suffixes.

You can try the lemmatiser with the following languages. The third column points to the sources that have provided us with the training data.

Language Number of tuples Source
Danish 594.000 STO
Dutch 337.000 CELEX
English 87.000 CELEX
French 273.000 Lexique + Limsi
German 318.000 CELEX
Greek 564.700 G. Petasis
Icelandic 72.000 IFD
Latin 106.000 Morfologik
Polish 3.443.000 Morfologik
Russian 3.045.000 Lemmatizer.org

Apart from these languages the lemmatiser has been tested with Slovene, Norwegian and Swedish.

Lemmatisation can be used for many purposes. For example, the lemmatiser can collect all inflected forms of the same lemma, compute frequencies and show with which inflected forms the lemma occurs in the text, which is the first step to building an index of a text. Lexicographers can identify new terms by letting the lemmatiser mark which words are in the dictionary and which words are not.

More information

Jongejan, Bart and Dalianis, Hercules: Automatic training of lemmatization rules that handle morphological changes in pre-, in- and suffixes alike. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP. Suntec, Singapore : Association for Computational Linguistics, 2009. s. 145-153

Report about CST's lemmatiser

Contact: Bart Jongejan

Procurement and use

You can use CST's lemmatiser freely for non-commercial applications. The program - but in most cases not the lingusitic data - can be downloaded here.

To speed up the initialisation of the program, the run-time dictionary must be represented in a dedicated compressed binary format. Therefore, the lemmatiser has, apart from a mode to train the inflectional rules and a mode for lemmatisation proper, also a mode for the generation of the run-time dictionary from a column-wise text format. The lemmatiser uses a full form word list for training of the inflectional rules and for the creation of the run-time dictionary. For Danish, CST can provide data from STO in a format especially for this purpose. This word list contains 81 511 lemmas and 594 357 word forms. Each line contains the following informations:

  • A word form
  • The lemma (base form) of the word form
  • The category (word class) of the word form.

If you want to use the lemmatiser for Danish text, we can provide ready-made inflectional rules (all platforms) and run-time dictionaries for Windows and for Linux.

For use of Danish STO-lingware and commercial applications of the program, contact:

Claus Povlsen
E-mail: cpovlsen@hum.ku.dk

Blå linie
Emil Holms Kanal 2, building 22, 3, DK-2300 Copenhagen S
Valid XHTML 1.0 Strict