SOURCE
------
The lexicon for building the Russian wordform-lemma trainingslist for the
lemmatizer are downloaded from 
https://github.com/bachan/libturglem/tree/master/share/turglem/russian

In addition, Serge Sharoff provided the dataset msd-ru-lemma.lex.
See https://aclanthology.org/L08-1539/ 

LICENSE
-------
LGPL 
(see the comment at the start of the files gramtab.xml, lemmas.xml and paradigms.xml)


HOW TO USE THE RESOURCES
------------------------
You can either run the lemmatizer with or without the use of a dictionary. 
Flex rules are always needed. If the lemmatizer must lemmatize a word that is
not found in the dictionary (OOV word), it applies the flex rules. Tests have
shown that, in the case of Russian, a pruned set of flex rules gives better
results than an unpruned set. Pruning, in this case, means that all rules that
are based on fewer than three examples in the training set, are removed.

An unpruned set of flex rules lemmatizes every word in the dictionary
correctly, because the dictionary and the training set are made from the same
lexicon. A pruned set of flex rules does not lemmatize all words in the 
dictionary correctly, so it is a good idea to always use pruned fles rules
together with the optional dictionary.

So, for Russian you get the best results by using the dictionary (dict) and the
pruned set of flex rules (flexrules2). The differences are marginal, however,
so for a smaller footprint and faster start-up you can go for only using the
flex rules (flexrules0)

Examples:
Without dictionary:
    cstlemma -L -eU -p+ -q- -t- -U- -H2  -m0 -l -B'$w'            -c'$w/$B$s'           -f'flexrules0'          -i yourinput -o youroutput

With dictionary:
    cstlemma -L -eU -p+ -q- -t- -U- -H2  -m0 -l -B'$w' -b'$w' -u- -c'$w/$b[[$b0]?$B]$s' -f'flexrules2' -d'dict' -i yourinput -o youroutput

There are many options to specify input and output formats. See the lemmatizer's documentation.


CONTACT
-------
Bart Jongejan bartj at hum dot ku dot dk