|
CST's LemmatiserCST's lemmatiser reduces all words in a text to their base form, the lemma. Eg "lemmatisation can be used for many purposes" → lemmatisation/NN/lemmatization can/MD/can be/VB/be used/VBN/use for/IN/for many/JJ/many purposes/NNS/purpose The lemmatiser derives the base form (lemma) of words using a set of rules and an optional dictionary that express the relation between word forms and base forms. The rules that are used in this demo are generated from a full form word list derived from CELEX. The correctness of the lemmatiser is estimated to be at least 87% for words that are not in the dictionary. The output will be better if the input is POS-tagged. If the input isn't POS-tagged, the lemmatiser can for example not discern between the verb means and the noun means. The Danish rule set (which is demonstrated in the Danish version of this page ) is based on STO. A manual check of the results of the lemmatiser for Danish text has shown that the correctness is 94%-98%, again depending on whether or not the input is POS-tagged. The lemmatiser can use other dictionaries for rule training and for lookup, provided that they contain full forms and lemmas. The lemmatiser can be used for languages with suffix based inflection morphology, such as the Nordic languages, English and Greek, but also for languages with more complex morphology, such as German and Dutch. To that end we have developed a new training algorithm that handles prefixes and infixes as well as suffixes. There are lemmatisation rules for about 30 languages, see the table below. The third column points to the sources that have provided us with the training data. To try out these languages, please visit CST on-line tools or the Text Tonsoriun. Lemmatisation can be used for many purposes. For example, the lemmatiser can collect all inflected forms of the same lemma, compute frequencies and show with which inflected forms the lemma occurs in the text, which is the first step to building an index of a text. Lexicographers can identify new terms by letting the lemmatiser mark which words are in the dictionary and which words are not. More informationJongejan, Bart and Dalianis, Hercules: Automatic training of lemmatization rules that handle morphological changes in pre-, in- and suffixes alike. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP. Suntec, Singapore : Association for Computational Linguistics, 2009. s. 145-153
Contact: Bart Jongejan Procurement and useCST's lemmatiser is licensed with the GNU GENERAL PUBLIC LICENSE Version 2. Information about dowloading the program and the lingusitic data is here. To speed up the initialisation of the program, the run-time dictionary must be represented in a dedicated compressed binary format. Therefore, the lemmatiser has, apart from a mode to train the inflectional rules and a mode for lemmatisation proper, also a mode for the generation of the run-time dictionary from a column-wise text format. The lemmatiser uses a full form word list for training of the inflectional rules and for the creation of the run-time dictionary. Each line must contain the following informations:
Contact:
Bart Jongejan |
|
Emil Holms Kanal 2, building 22, 3, DK-2300 Copenhagen S
|