CST's Lemmatiser

CST's lemmatiser reduces all words in a text to their base form, the lemma. Eg

"lemmatisation can be used for many purposes"

lemmatisation/NN/lemmatization can/MD/can be/VB/be used/VBN/use for/IN/for many/JJ/many purposes/NNS/purpose


The lemmatiser derives the base form (lemma) of words using a set of rules and an optional dictionary that express the relation between word forms and base forms.

The rules that are used in this demo are generated from a full form word list derived from CELEX. The correctness of the lemmatiser is estimated to be at least 87% for words that are not in the dictionary. The output will be better if the input is POS-tagged. If the input isn't POS-tagged, the lemmatiser can for example not discern between the verb means and the noun means.

The Danish rule set (which is demonstrated in the Danish version of this page ) is based on STO. A manual check of the results of the lemmatiser for Danish text has shown that the correctness is 94%-98%, again depending on whether or not the input is POS-tagged.

The lemmatiser can use other dictionaries for rule training and for lookup, provided that they contain full forms and lemmas. The lemmatiser can be used for languages with suffix based inflection morphology, such as the Nordic languages, English and Greek, but also for languages with more complex morphology, such as German and Dutch. To that end we have developed a new training algorithm that handles prefixes and infixes as well as suffixes.

There are lemmatisation rules for about 30 languages, see the table below. The third column points to the sources that have provided us with the training data.

Language code Language Number of tuples Source
af Afrikaans 6 731 UD Afrikaans
ast Asturian 108 728 https://github.com/michmech/lemmatization-lists
be Belarusian 224 3311 UD_Belarusian-HSE
bg Bulgarian 55 198 MULTEXT-East free lexicons 4.0
ca Catalan 583 924 https://github.com/michmech/lemmatization-lists
cs Czech 184 620 MULTEXT-East free lexicons 4.0
cy Welsh 369 387 https://github.com/michmech/lemmatization-lists
da Danish (Contemporary) 583 397 STO
da Danish (Late modern) 1 651 435 DSL Ordbog over det danske sprog
da Danish (Medieval) 48 075 DSL middelaldertekster, Clara Kloster Guldkorpus
de German 318.000 CELEX
el Greek 564 700 G. Petasis
en English 86 992 CELEX
es Spanish 160 723 lachica
et Estonian 3 573 534 EstNLTK
fa Persian 13 006 MULTEXT-East non-commercial lexicons 4.0
fo Faroese 900 233 Føroyski bendingargrunnurin, far-ABLTagger
fr French 273 000 Lexique + Limsi
ga Irish 339 003 https://github.com/michmech/lemmatization-lists
gd Scottish Gaelic 63 160 https://github.com/michmech/lemmatization-lists
gl Galician 436 070 https://github.com/michmech/lemmatization-lists
gml Middle Low German 107 020 Referenzkorpus Mittelniederdeutsch/Niederrheinisch (1200–1650)
gv Manx 62 534 https://github.com/michmech/lemmatization-lists
hr Croatian 498 333 SETimes.HR+
hu Hungarian 63 998 MULTEXT-East free lexicons 4.0
is Icelandic 2 874 175 Icelandic Centre for Language Technology
it Italian 413 196 morph-it
ka Georgian 28 076 Vladimír Benko
la Latin 1 116 121 Jan Šipoš
mk Macedonian 1 323 572 MULTEXT-East non-commercial lexicons 4.0
nl Dutch 341 779 CELEX
no Norwegian 478 904 Scarrie Lexicon
pl Polish 4 773 631 Morfologik
pt Portuguese 912 617 LABELLEX
ro Romanian 428 191 MULTEXT-East free lexicons 4.0
ru Russian 3 087 092 Lemmatizer.org
sk Slovak 935 541 MULTEXT-East free lexicons 4.0
sl Slovene 207 044 MULTEXT-East free lexicons 4.0
sq Albanian 411 865 Albanian-POS
sr Serbian 150 504 MULTEXT-East free lexicons 4.0
sv Swedish 572 136 Språkbanken
uk Ukrainian 300 292 MULTEXT-East free lexicons 4.0

To try out these languages, please visit CST on-line tools or the Text Tonsoriun.

Lemmatisation can be used for many purposes. For example, the lemmatiser can collect all inflected forms of the same lemma, compute frequencies and show with which inflected forms the lemma occurs in the text, which is the first step to building an index of a text. Lexicographers can identify new terms by letting the lemmatiser mark which words are in the dictionary and which words are not.

More information

Jongejan, Bart and Dalianis, Hercules: Automatic training of lemmatization rules that handle morphological changes in pre-, in- and suffixes alike. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP. Suntec, Singapore : Association for Computational Linguistics, 2009. s. 145-153

Report about CST's lemmatiser

Contact: Bart Jongejan

Procurement and use

CST's lemmatiser is licensed with the GNU GENERAL PUBLIC LICENSE Version 2. Information about dowloading the program and the lingusitic data is here.

To speed up the initialisation of the program, the run-time dictionary must be represented in a dedicated compressed binary format. Therefore, the lemmatiser has, apart from a mode to train the inflectional rules and a mode for lemmatisation proper, also a mode for the generation of the run-time dictionary from a column-wise text format.

The lemmatiser uses a full form word list for training of the inflectional rules and for the creation of the run-time dictionary. Each line must contain the following informations:

  • A word form
  • The lemma (base form) of the word form
  • (Optional) The category (word class) of the word form.

Contact:

Bart Jongejan
E-mail: bartj@hum.ku.dk


Blå linie
Emil Holms Kanal 2, building 22, 3, DK-2300 Copenhagen S