Download of software
Software under GNU General Public License (GPL):
- CST's RTFreader
-
The package consists of the source code (C++) of CST's RTFreader. The program can
read a flat text file or RTF (Rich Text Format) file and convert it to segmented
text with one sentence per line. Where full stops are missing the program uses lay-out
information such as character size and typeface to decide where a sentence is ending.
Optionally the program delivers tokenised output, and uses some heuristics to decide
whether dots belong to a token or are sentence delimiters. You can supplement the
tokenisation process with lists of abbreviations and MWU's. Such language dependent
resources are not part of the package.
- CSTlemma
-
CST's lemmatizer uses affix rules (affix: prefix, infix, suffix, circumfix) and
has been trained for a number of languages. Trained affix rules are available for
the following languages:
Afrikaans, Albanian, Asturian, Belarusian, Bulgarian, Catalonian, Czech, Danish, Dutch, English, Estonian, Faroese, Farsi, French, Gaelic, Galician, Georgian, German, Greek,
Hungarian, Icelandic, Irish, Italian, Latin, Macedonian, Manx, Middle Low German, Polish, Portuguese, Romanian,
Russian, Serbian, Slovak, Slovene, Spanish, Ukrainian, Welsh.
Rules are obtained by supervised learning from a full form - lemma list.
In this folder you find lemmatisation
rules for a number of languages and in some cases dictionaries in binary format
that the lemmatiser can use. These dictionaries are not necessary, but can improve
output quality. Find a complete list of resources on Github.
Get the most recent version of the source code at
GitHub .
You can use tinylemmatizer in your Python programs. This module uses the same lemmatisation rules as CSTlemma,
but does not have any of CSTlemma's many options.
If you want to train affix rules for the lemmatiser (think of German abgebrannt
→ abbrennen) you need the program
affixtrain besides cstlemma. For languages with rich morphology
you need large full form word lists (>100 000) to attain a reasonable result.
Jongejan, Bart and Dalianis, Hercules:
Automatic training of lemmatization rules that handle morphological changes in pre-,
in- and suffixes alike. In Proceedings of the Joint Conference of the 47th Annual
Meeting of the ACL and the 4th International Joint Conference on Natural Language
Processing of the AFNLP. Suntec, Singapore : Association for Computational
Linguistics, 2009. s. 145-153
- taggerXML
-
CST uses Eric Brill 's POS-tagger
in many applications for the analysis of English texts (using Eric Brill's linguistic
resources, in some cases with small adaptations) as well as Danish texts (with CST's
linguistic resources). The distribution comprises Eric Brill's original distribution
and a Zip-file with CST's software adaptations. Note that the training part of Brill's
tagger is unchanged! We have made the following adaptations:
- Reformatting from UNIX-style C to standard C++,
- Replacement of some UNIX-specific functions with standard C functions,
- Better handling of capitals in (supposedly) headings, and
- The introduction of an optionfile "xoptions" to make the source code independent
of language and tagset.
- Reading and writing XML formatted files, storing the POS tag in an attribute of
the element containing the word.
Get the most recent version of the source code at
GitHub .
Academic users may use the linguistic resources that enables the POS-tagger to tag
Danish text. These resources are based on linguistic resources (the Danish Parole
corpus) owned by the Society for Danish Language and Literature (DSL).
See korpus.dsl.dk.
- Bracmat
-
Bracmat is an interpreted programming language that is developed by Bart Jongejan
since 1986. Originally it was designed as a Computer Algebra system, but it has
shown its merits in natural language processing as well. It has been used in the
field of General Relativity for the algebraic computation of Ricci tensors from
given space-time metrics, for the implementation of a dialogue-manager in the Staging-project, for the analysis of texts in the "Controlled
Language"-part of the VID-project, for automatic error correction
of CST's many html-pages and for many corpus validation tasks. Bracmat has also
shown its utility in some real-world applications: for example to identify persons,
companies etc. in pre-tagged texts that must be anonymised. The to date most advanced
application of Bracmat is as workflow planner and executor, the CLARIN-DK Text Tonsorium.
Instead of letting the user choose between tools,
which the user may not know very well, the Tools module asks the user to specify
what kind of output she wants. With this information the Tools module computes all
(not necessarily sequential) combinations of tools and their parameter settings
that combine into workflows that are guaranteed to produce the specified output
from the given input. The computed list is condensed into a short format that highlights
the differences between the workflows for the user and leaves out all that is of
less importance.
Get earlier versions (1986-2012) of the source code here.
Read more about Bracmat.
Papers:
Jongejan, Bart: Intricate
Natural Language Processing made easier with Symbolic Computation software: Pattern
Matching lessons from Bracmat.
CLIN26. Amsterdam, 2015
Bart Jongejan. 2013. Workflow
management in CLARIN-DK. In Proceedings of the workshop on Nordic language research
infrastructure at NODALIDA 2013, volume 089 of NEALT Proceedings Series,
pages 11–20. Northern European Association for Language Technology (NEALT), May.
Bart Jongejan. 2016. Implementation of
a workflow management system for non-expert users. In Proceedings of the Workshop
on Language Technology Resources and Tools for Digital Humanities (LT4DH),
pages 101–108, Osaka, Japan, December. The COLING 2016 Organizing Committee.
- makeUTF8
-
converts UTF-16 (BE/LE), UTF-32 (BE/LE), ISO-8859-N to UTF-8. Removes BOM and surrogate
pairs from UTF-8, converting a codepoint between U-D800 and U-DBFF followed by a
codepoint between U-DC00 and U-DFFF to one valid codepoint > U-FFFF.
- Text Tonsorium
-
A webservice that, given input file(s) and the user's
requirements regarding the output, computes candidate workflows of NLP tools.
After the computation, it enacts the workflow that the user has selected
from the list of candidates.
This webservice is part of Clarin-DK and can be found at the following URL:
https://cst.dk/texton/.
It can also be reached via the Clarin.eu Language Resource Switchboard.
Linguistic resources
If you are interested in obtaining linguistic resources that have been produced
under the auspices of CST (STO, training data for the lemmatiser, grammars for the
NP-recogniser, rules for the name recogniser), please contact Bart Jongejan (bartj {at} hum ku dk).
|