SOURCE ------ The lexicon for building the Russian wordform-lemma trainingslist for the lemmatizer are downloaded from https://github.com/bachan/libturglem/tree/master/share/turglem/russian In addition, Serge Sharoff provided the dataset msd-ru-lemma.lex. See https://aclanthology.org/L08-1539/ LICENSE ------- LGPL (see the comment at the start of the files gramtab.xml, lemmas.xml and paradigms.xml) HOW TO USE THE RESOURCES ------------------------ You can either run the lemmatizer with or without the use of a dictionary. Flex rules are always needed. If the lemmatizer must lemmatize a word that is not found in the dictionary (OOV word), it applies the flex rules. Tests have shown that, in the case of Russian, a pruned set of flex rules gives better results than an unpruned set. Pruning, in this case, means that all rules that are based on fewer than three examples in the training set, are removed. An unpruned set of flex rules lemmatizes every word in the dictionary correctly, because the dictionary and the training set are made from the same lexicon. A pruned set of flex rules does not lemmatize all words in the dictionary correctly, so it is a good idea to always use pruned fles rules together with the optional dictionary. So, for Russian you get the best results by using the dictionary (dict) and the pruned set of flex rules (flexrules2). The differences are marginal, however, so for a smaller footprint and faster start-up you can go for only using the flex rules (flexrules0) Examples: Without dictionary: cstlemma -L -eU -p+ -q- -t- -U- -H2 -m0 -l -B'$w' -c'$w/$B$s' -f'flexrules0' -i yourinput -o youroutput With dictionary: cstlemma -L -eU -p+ -q- -t- -U- -H2 -m0 -l -B'$w' -b'$w' -u- -c'$w/$b[[$b0]?$B]$s' -f'flexrules2' -d'dict' -i yourinput -o youroutput There are many options to specify input and output formats. See the lemmatizer's documentation. CONTACT ------- Bart Jongejan bartj at hum dot ku dot dk