CST's Lemmatiser

Faculty of Humanities University of Copenhagen PUNKT.KU Dansk Phonebook

CST's Lemmatiser

CST's lemmatiser reduces all words in a text to their base form, the lemma. Eg

"lemmatisation can be used for many purposes"

→

lemmatisation/NN/lemmatization can/MD/can be/VB/be used/VBN/use for/IN/for many/JJ/many purposes/NNS/purpose

The lemmatiser derives the base form (lemma) of words using a set of rules and an optional dictionary that express the relation between word forms and base forms.

The rules that are used in this demo are generated from a full form word list derived from CELEX. The correctness of the lemmatiser is estimated to be at least 87% for words that are not in the dictionary. The output will be better if the input is POS-tagged. If the input isn't POS-tagged, the lemmatiser can for example not discern between the verb means and the noun means.

The Danish rule set (which is demonstrated in the Danish version of this page ) is based on STO. A manual check of the results of the lemmatiser for Danish text has shown that the correctness is 94%-98%, again depending on whether or not the input is POS-tagged.

The lemmatiser can use other dictionaries for rule training and for lookup, provided that they contain full forms and lemmas. The lemmatiser can be used for languages with suffix based inflection morphology, such as the Nordic languages, English and Greek, but also for languages with more complex morphology, such as German and Dutch. To that end we have developed a new training algorithm that handles prefixes and infixes as well as suffixes.

There are lemmatisation rules for about 30 languages, see the table below. The third column points to the sources that have provided us with the training data.

Language code	Language	Number of tuples	Source
af	Afrikaans	6 731	UD Afrikaans
bg	Bulgarian	55 198	MULTEXT-East free lexicons 4.0
cz	Czech	184 620	MULTEXT-East free lexicons 4.0
da	Danish (Contemporary)	583 397‬	STO
da	Danish (Late modern)	1 651 435‬	DSL Ordbog over det danske sprog
da	Danish (Medieval)	48 075‬	DSL middelaldertekster, Clara Kloster Guldkorpus
de	German	318.000	CELEX
el	Greek	564 700	G. Petasis
en	English	86 992	CELEX
es	Spanish	160 723‬	lachica
et	Estonian	3 573 534	EstNLTK
fa	Persian	13 006‬	MULTEXT-East non-commercial lexicons 4.0
fr	French	273 000	Lexique + Limsi
hr	Croatian	498 333	SETimes.HR+
hu	Hungarian	63 998‬	MULTEXT-East free lexicons 4.0
is	Icelandic	2 874 175‬	Icelandic Centre for Language Technology
it	Italian	413 196	morph-it
la	Latin	1 116 121	Jan Šipoš
mk	Macedonian	1 323 572‬	MULTEXT-East non-commercial lexicons 4.0
nl	Dutch	341 779‬	CELEX
no	Norwegian	478 904‬	Scarrie Lexicon
pl	Polish	4 773 631‬	Morfologik
pt	Portuguese	912 617‬	LABELLEX
ro	Romanian	428 191	MULTEXT-East free lexicons 4.0
ru	Russian	3 087 092‬	Lemmatizer.org
sk	Slovak	935 541	MULTEXT-East free lexicons 4.0
sl	Slovene	207 044‬	MULTEXT-East free lexicons 4.0
sr	Serbian	150 504	MULTEXT-East free lexicons 4.0
sv	Swedish	572 136‬	Språkbanken
uk	Ukrainian	300 292‬	MULTEXT-East free lexicons 4.0

To try out these languages, please visit CST on-line tools or the Text Tonsoriun.

Lemmatisation can be used for many purposes. For example, the lemmatiser can collect all inflected forms of the same lemma, compute frequencies and show with which inflected forms the lemma occurs in the text, which is the first step to building an index of a text. Lexicographers can identify new terms by letting the lemmatiser mark which words are in the dictionary and which words are not.

More information

Jongejan, Bart and Dalianis, Hercules: Automatic training of lemmatization rules that handle morphological changes in pre-, in- and suffixes alike. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP. Suntec, Singapore : Association for Computational Linguistics, 2009. s. 145-153

Report about CST's lemmatiser

Contact: Bart Jongejan

Procurement and use

CST's lemmatiser is licensed with the GNU GENERAL PUBLIC LICENSE Version 2. Information about dowloading the program and the lingusitic data is here.

To speed up the initialisation of the program, the run-time dictionary must be represented in a dedicated compressed binary format. Therefore, the lemmatiser has, apart from a mode to train the inflectional rules and a mode for lemmatisation proper, also a mode for the generation of the run-time dictionary from a column-wise text format.

The lemmatiser uses a full form word list for training of the inflectional rules and for the creation of the run-time dictionary. Each line must contain the following informations:

A word form
The lemma (base form) of the word form
(Optional) The category (word class) of the word form.

Contact:

Bart Jongejan
E-mail: bartj@hum.ku.dk

Emil Holms Kanal 2, building 22, 3, DK-2300 Copenhagen S

Web accessibility (in Danish)