CST's Lemmatiser

Department of Nordic Studies and Linguistics Faculty of Humanities University of Copenhagen Dansk

CST's Lemmatiser

CST's lemmatiser reduces all words in a text to their base form, the lemma. Eg

"lemmatisation can be used for many purposes"

→

lemmatisation/NN/lemmatization can/MD/can be/VB/be used/VBN/use for/IN/for many/JJ/many purposes/NNS/purpose

The lemmatiser derives the base form (lemma) of words using a set of rules and an optional dictionary that express the relation between word forms and base forms.

The rules that are used in this demo are generated from a full form word list derived from CELEX. The correctness of the lemmatiser is estimated to be at least 87% for words that are not in the dictionary. The output will be better if the input is POS-tagged. If the input isn't POS-tagged, the lemmatiser can for example not discern between the verb means and the noun means.

The Danish rule set (which is demonstrated in the Danish version of this page ) is based on STO. A manual check of the results of the lemmatiser for Danish text has shown that the correctness is 94%-98%, again depending on whether or not the input is POS-tagged.

The lemmatiser can use other dictionaries for rule training and for lookup, provided that they contain full forms and lemmas. The lemmatiser can be used for languages with suffix based inflection morphology, such as the Nordic languages, English and Greek, but also for languages with more complex morphology, such as German and Dutch. To that end we have developed a new training algorithm that handles prefixes and infixes as well as suffixes.

There are lemmatisation rules for about 30 languages, see the table below. The third column points to the sources that have provided us with the training data.

Language code	Language	Number of tuples	Source
af	Afrikaans	6 731	UD Afrikaans
ast	Asturian	108 728	https://github.com/michmech/lemmatization-lists
be	Belarusian	224 3311	UD_Belarusian-HSE
bg	Bulgarian	55 198	MULTEXT-East free lexicons 4.0
ca	Catalan	583 924	https://github.com/michmech/lemmatization-lists
cs	Czech	184 620	MULTEXT-East free lexicons 4.0
cy	Welsh	369 387	https://github.com/michmech/lemmatization-lists
da	Danish (Contemporary)	583 397	STO
da	Danish (Late modern)	1 651 435	DSL Ordbog over det danske sprog
da	Danish (Medieval)	48 075	DSL middelaldertekster, Clara Kloster Guldkorpus
de	German	318.000	CELEX
el	Greek	564 700	G. Petasis
en	English	86 992	CELEX
es	Spanish	160 723	lachica
et	Estonian	3 573 534	EstNLTK
fa	Persian	13 006	MULTEXT-East non-commercial lexicons 4.0
fo	Faroese	900 233	Føroyski bendingargrunnurin, far-ABLTagger
fr	French	273 000	Lexique + Limsi
ga	Irish	339 003	https://github.com/michmech/lemmatization-lists
gd	Scottish Gaelic	63 160	https://github.com/michmech/lemmatization-lists
gl	Galician	436 070	https://github.com/michmech/lemmatization-lists
gml	Middle Low German	107 020	Referenzkorpus Mittelniederdeutsch/Niederrheinisch (1200–1650)
gv	Manx	62 534	https://github.com/michmech/lemmatization-lists
hr	Croatian	498 333	SETimes.HR+
hu	Hungarian	63 998	MULTEXT-East free lexicons 4.0
is	Icelandic	2 874 175	Icelandic Centre for Language Technology
it	Italian	413 196	morph-it
ka	Georgian	28 076	Vladimír Benko
la	Latin	1 116 121	Jan Šipoš
mk	Macedonian	1 323 572	MULTEXT-East non-commercial lexicons 4.0
nl	Dutch	341 779	CELEX
no	Norwegian	478 904	Scarrie Lexicon
pl	Polish	4 773 631	Morfologik
pt	Portuguese	912 617	LABELLEX
ro	Romanian	428 191	MULTEXT-East free lexicons 4.0
ru	Russian	3 087 092	Lemmatizer.org
sk	Slovak	935 541	MULTEXT-East free lexicons 4.0
sl	Slovene	207 044	MULTEXT-East free lexicons 4.0
sq	Albanian	411 865	Albanian-POS
sr	Serbian	150 504	MULTEXT-East free lexicons 4.0
sv	Swedish	572 136	Språkbanken
uk	Ukrainian	300 292	MULTEXT-East free lexicons 4.0

To try out these languages, please visit CST on-line tools or the Text Tonsoriun.

Lemmatisation can be used for many purposes. For example, the lemmatiser can collect all inflected forms of the same lemma, compute frequencies and show with which inflected forms the lemma occurs in the text, which is the first step to building an index of a text. Lexicographers can identify new terms by letting the lemmatiser mark which words are in the dictionary and which words are not.

More information

Jongejan, Bart and Dalianis, Hercules: Automatic training of lemmatization rules that handle morphological changes in pre-, in- and suffixes alike. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP. Suntec, Singapore : Association for Computational Linguistics, 2009. s. 145-153

Report about CST's lemmatiser

Contact: Bart Jongejan

Procurement and use

CST's lemmatiser is licensed with the GNU GENERAL PUBLIC LICENSE Version 2. Information about dowloading the program and the lingusitic data is here.

To speed up the initialisation of the program, the run-time dictionary must be represented in a dedicated compressed binary format. Therefore, the lemmatiser has, apart from a mode to train the inflectional rules and a mode for lemmatisation proper, also a mode for the generation of the run-time dictionary from a column-wise text format.

The lemmatiser uses a full form word list for training of the inflectional rules and for the creation of the run-time dictionary. Each line must contain the following informations:

A word form
The lemma (base form) of the word form
(Optional) The category (word class) of the word form.

Contact:

Bart Jongejan
E-mail: bartj@hum.ku.dk

Emil Holms Kanal 2, building 22, 3, DK-2300 Copenhagen S

Web accessibility (in Danish)