Download CST's NLP software

Department of Nordic Studies and Linguistics Faculty of Humanities University of Copenhagen Dansk

Download of software

Software under GNU General Public License (GPL):

The package consists of the source code (C++) of CST's RTFreader. The program can read a flat text file or RTF (Rich Text Format) file and convert it to segmented text with one sentence per line. Where full stops are missing the program uses lay-out information such as character size and typeface to decide where a sentence is ending. Optionally the program delivers tokenised output, and uses some heuristics to decide whether dots belong to a token or are sentence delimiters. You can supplement the tokenisation process with lists of abbreviations and MWU's. Such language dependent resources are not part of the package.

CSTlemma

CST's lemmatizer uses affix rules (affix: prefix, infix, suffix, circumfix) and has been trained for a number of languages. Trained affix rules are available for the following languages:

Afrikaans, Albanian, Asturian, Belarusian, Bulgarian, Catalonian, Czech, Danish, Dutch, English, Estonian, Faroese, Farsi, French, Gaelic, Galician, Georgian, German, Greek, Hungarian, Icelandic, Irish, Italian, Latin, Macedonian, Manx, Middle Low German, Polish, Portuguese, Romanian, Russian, Serbian, Slovak, Slovene, Spanish, Ukrainian, Welsh.

Rules are obtained by supervised learning from a full form - lemma list.
In this folder you find lemmatisation rules for a number of languages and in some cases dictionaries in binary format that the lemmatiser can use. These dictionaries are not necessary, but can improve output quality. Find a complete list of resources on Github.

Get the most recent version of the source code at GitHub . You can use tinylemmatizer in your Python programs. This module uses the same lemmatisation rules as CSTlemma, but does not have any of CSTlemma's many options.

If you want to train affix rules for the lemmatiser (think of German abgebrannt → abbrennen) you need the program affixtrain besides cstlemma. For languages with rich morphology you need large full form word lists (>100 000) to attain a reasonable result.

Jongejan, Bart and Dalianis, Hercules: Automatic training of lemmatization rules that handle morphological changes in pre-, in- and suffixes alike. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP. Suntec, Singapore : Association for Computational Linguistics, 2009. s. 145-153

taggerXML

CST uses Eric Brill 's POS-tagger in many applications for the analysis of English texts (using Eric Brill's linguistic resources, in some cases with small adaptations) as well as Danish texts (with CST's linguistic resources). The distribution comprises Eric Brill's original distribution and a Zip-file with CST's software adaptations. Note that the training part of Brill's tagger is unchanged! We have made the following adaptations:

Reformatting from UNIX-style C to standard C++,
Replacement of some UNIX-specific functions with standard C functions,
Better handling of capitals in (supposedly) headings, and
The introduction of an optionfile "xoptions" to make the source code independent of language and tagset.
Reading and writing XML formatted files, storing the POS tag in an attribute of the element containing the word.

Get the most recent version of the source code at GitHub .

Academic users may use the linguistic resources that enables the POS-tagger to tag Danish text. These resources are based on linguistic resources (the Danish Parole corpus) owned by the Society for Danish Language and Literature (DSL). See korpus.dsl.dk.

Bracmat

Bracmat is an interpreted programming language that is developed by Bart Jongejan since 1986. Originally it was designed as a Computer Algebra system, but it has shown its merits in natural language processing as well. It has been used in the field of General Relativity for the algebraic computation of Ricci tensors from given space-time metrics, for the implementation of a dialogue-manager in the Staging-project, for the analysis of texts in the "Controlled Language"-part of the VID-project, for automatic error correction of CST's many html-pages and for many corpus validation tasks. Bracmat has also shown its utility in some real-world applications: for example to identify persons, companies etc. in pre-tagged texts that must be anonymised. The to date most advanced application of Bracmat is as workflow planner and executor, the CLARIN-DK Text Tonsorium. Instead of letting the user choose between tools, which the user may not know very well, the Tools module asks the user to specify what kind of output she wants. With this information the Tools module computes all (not necessarily sequential) combinations of tools and their parameter settings that combine into workflows that are guaranteed to produce the specified output from the given input. The computed list is condensed into a short format that highlights the differences between the workflows for the user and leaves out all that is of less importance.

Get earlier versions (1986-2012) of the source code here.

Read more about Bracmat.

Papers:

Jongejan, Bart: Intricate Natural Language Processing made easier with Symbolic Computation software: Pattern Matching lessons from Bracmat. CLIN26. Amsterdam, 2015

Bart Jongejan. 2013. Workflow management in CLARIN-DK. In Proceedings of the workshop on Nordic language research infrastructure at NODALIDA 2013, volume 089 of NEALT Proceedings Series, pages 11–20. Northern European Association for Language Technology (NEALT), May.

Bart Jongejan. 2016. Implementation of a workflow management system for non-expert users. In Proceedings of the Workshop on Language Technology Resources and Tools for Digital Humanities (LT4DH), pages 101–108, Osaka, Japan, December. The COLING 2016 Organizing Committee.

makeUTF8

converts UTF-16 (BE/LE), UTF-32 (BE/LE), ISO-8859-N to UTF-8. Removes BOM and surrogate pairs from UTF-8, converting a codepoint between U-D800 and U-DBFF followed by a codepoint between U-DC00 and U-DFFF to one valid codepoint > U-FFFF.

Text Tonsorium

A webservice that, given input file(s) and the user's requirements regarding the output, computes candidate workflows of NLP tools. After the computation, it enacts the workflow that the user has selected from the list of candidates. This webservice is part of Clarin-DK and can be found at the following URL: https://cst.dk/texton/. It can also be reached via the Clarin.eu Language Resource Switchboard.

Linguistic resources

If you are interested in obtaining linguistic resources that have been produced under the auspices of CST (STO, training data for the lemmatiser, grammars for the NP-recogniser, rules for the name recogniser), please contact Bart Jongejan (bartj {at} hum ku dk).

Emil Holms Kanal 2, building 22, 3, DK-2300 Copenhagen S
Tlf: +45 35329090

Web accessibility (in Danish)