The Text Tonsorium is a workflow management system that not only executes workflows, but also composes workflows from building blocks. Each building block encapsulates a Natural Language Processing tool.
The Text Tonsorium may compose many workflows that all lead to your goal. It will then ask you to choose one of the proposed workflows. In general, the more detail you add to your goal, the fewer solutions the Text Tonsorium will find, even zero.
In contrast to other workflow management systems, the Text Tonsorium focuses on results rather than on tools, so a general understanding of how the desired result is described is useful. Still, you may like to also know which tools currently are integrated in the Text Tonsorium. This page gives you information about both.
# | Name of the tool | Supported languages |
---|---|---|
1 | Annotation splitter | |
2 | Bohnet's POS tagger | de, en, es, fr |
3 | Bohnets parser | da, de, en, es, fr |
4 | Brill's PoS-tagger | as, da, en, la |
5 | CBF-Tokenizer | af, bg, bs, ca, cs, cy, da, de, el, en, eo, es, et, fa, fi, fr, hi, hr, hu, hy, id, is, it, ka, kn, ku, la, lv, mk, ml, nl, no, pl, pt, ro, ru, sk, sl, sq, sr, sv, sw, ta, tr, uk, vi, zh |
6 | cluster | |
7 | CoNLL converter | |
8 | CONLL to Penn Treebank | |
9 | CQP-corpus creator | |
10 | Create pre-tokenized Clarin Base Format text | |
11 | CST's Name recogniser | da |
12 | CST's NP recogniser | da |
13 | CST's Repetitiveness checker | |
14 | CST's RTFreader | |
15 | CST-Lemmatiser | af, bg, cs, da, de, el, en, es, et, fa, fr, hr, hu, is, it, la, mk, nl, no, pl, pt, ro, ru, sk, sl, sr, sv, uk |
16 | CST-Normaliser | da |
17 | CuneiForm | bg, cs, da, de, en, es, et, fr, hr, hu, it, lt, lv, nl, pl, pt, ro, ru, sl, sr, sv, tr, uk |
18 | daner | da |
19 | dapipe | da |
20 | Diplom annotator | |
21 | Diplom fetch corrected text | da, gml, la, sv |
22 | eSpeak | af, bg, bs, ca, cs, cy, da, de, el, en, eo, es, et, fi, fr, hi, hr, hu, hy, id, is, it, ka, kn, ku, la, lv, mk, ml, nl, pl, pt, ro, ru, sk, sq, sr, sv, sw, ta, tr, vi, zh |
23 | Frequencies | |
24 | html2text | |
25 | JSON pretty print | |
26 | JSON to ORG-mode converter | |
27 | JSON to Tab-separated | |
28 | JSON to TEI P5 annotation | |
29 | KORP to Excel converter | |
30 | Lapos POS tagger | da, la |
31 | LemPoS | bg, cs, da, de, en, es, et, fa, hr, hu, is, it, la, mk, nl, pl, pt, ro, ru, sk, sl, sr, sv, uk |
32 | LibreOffice | |
33 | Normalize dipl | la |
34 | OpenNLP tools PosTagger | da, en |
35 | pdf2htmlEX | |
36 | PDFMiner | |
37 | PoS tag translator | da, la |
38 | pretokenize TEI P5 | |
39 | PruneLemPos | |
40 | Sentence extractor | |
41 | TEI P5 anno to Org-mode | |
42 | TEI P5 annotator | |
43 | TEI P5 to CoNLL-U converter | |
44 | TEIP5-segmenter | |
45 | TEIP5-tokeniser/sentence extractor | |
46 | Tesseract-OCRv4 | af, br, bs, ca, co, cs, cy, da, de, en, eo, es, et, eu, fi, fo, fr, ga, gl, hr, ht, hu, id, is, it, iu, jv, la, lb, lt, lv, ms, mt, nb, nl, nn, oc, pl, pt, ro, sk, sl, sq, sr, sv, sw, tr, uz, vi, yi |
47 | vujiLoX | la |
Takes TEI P5 document containing multiple stand-off anotation groups (spanGrp). Outputs one of the annotation groups.
Part of Speech tagger that is distributed as part of mate-tools.
Dependency parser, part of mate-tools.
Part-of-speech tagger: Marks each word in a text with information about word class and morphological features.
From Clarin Base Format enriched with token and segment attributes, extract tokens and their offset in the input.
Make clusters of lines, separated by empty lines. Each cluster contains all those lines that have the the value in the first or second column in common with one or more other lines in the same cluster.
Converts input to CoNLL 2007 format.
Convert syntax dependency annotation in CoNLL 2009 or CoNLL-U format to bracketed "Lisp-like" format.
Takes input comntaining words, tags and lemmas and creates output that can be read by the CQP software.
From a plain segmentized and tokenized text file that uses DEL characters to separate tokens that are written together in the input, create a TEI P5 Clarin Base Format text with attributes S and T for segment and token identification.
Classifies names as proper names, locations (with sub-classes of street, city, land and other types of locations), and other names (called MISC)
Collects words that constitute noun phrases.
Uses a statistical method to find repetitions in a text.
Extracts segments from RTF-file or from plain text. Optionally tokenises.
Produces the dictionary look-up form (or lemma) for each word, inflected or not, in the input.
Normalises older (1200-1900) Danish text to spelling rules as employed in ODS (Ordbog over det danske Sprog).
Интеллектуальная система распознавания текста - Optical Character Recognition
Named Entity Recognition for Danish, Distributed by ITU NLP. Uses Stanford CoreNLP NER and the model from DKIE to tag incoming Danish plain text for named entities, in three classes: location, person, and organization names.
UDPipe tools for Danish. udpipe does pos-tagging, lemmatization and syntactic analysis. The syntactic analysis and lemmatization is always based on UDPipe's own pos-tagging. Using dapipe with TEI P5 input is discouraged, unless tokenisation and sentence extraction is done in separate steps, and not by dapipe itself.
Store lemma in column 3 and/or word class in column 4 of an orgmode input file that already has diplomatic and facsimal values in columns 7 and 8.
Fetch the column with corrected transcriptions. This column contains words with additions between parentheses. The parentheses are removed in the output.
Text to speech software. Originally known as speak and originally written for Acorn/RISC_OS computers starting in 1995. This version is an enhancement and re-write, including a relaxation of the original memory and processing power constraints, and with support for additional languages.
Sorts input lines, collapses equal lines, appends column with frequencies. Assumes that input is 1, 2 or 3 columns, separated by tabs.
A very simple script that loads from HTML, and then iterates over the DOM to correctly output plain text.
Json pretty-print parser based on a recursive lexical analyser. The parser was based on the specification defined at json.org. The input file is parsed to build a json object. If the object is correct, it will be pretty-printed to standard output.
Converts JSON output with tokens, lemmas and Part of Speech tags to a three-column ORG-mode table.
Convert word-lemma-pos data from JSON to CQP format.
Read json file with fields for token ID, word, lemma and pos. Output a TEI P5 annotation file (spanGrp) containing either lemmas or Part of Speech tags.
This tool generates a tabulator separated file with all KWIC (keyword-in-context) results generated by the KORP tool at the address https://alf.hum.ku.dk/korp/. Input to the tool is the URL copied from the address line when KORP has performed a search.
Fork of the Lookahead Part-Of-Speech (Lapos) Tagger
Lemmatizes input text and adds PoS-options to each lemma. Output can be ambiguous.
A powerful office suite, here used to convert office documents to RTF or PDF.
Fill column left of diplom column with normalized tokens, i.e. v -> u, j -> i and all lowercase.
Part of Speech Tagger that marks tokens with their corresponding word type based on the token itself and the context of the token. Uses a probability model to predict the correct pos tag.
Converts PDF to HTML without losing text or format. (The produced HTML can hardly be interpreted by other tools.)
Extracts information from PDF documents. Focuses entirely on getting and analyzing text data.
Translate from DSL's tag set to Menotas
Apply a primitive tokenisation to the contents of the <text> element in a TEI P5 document. Each word, punctuation and whitespace is marked up by w or c tags. S and T attributes indicate wich primitive tokens should be combined to create higher level tokens.
Takes text input that has lemma and PoS annotations that are ambiguous. In the output, the ambiguity is dimished by using bigram frequencies.
From Clarin Base Format (enriched with attributes for tokens and segments) extract the sentences and their offsets in the source.
Convert TEI P5 stand-off annotation to a two column file in Org-mode format. The first column contains a token, the second contains the annotation: POS-tag, word class, or lemma
Add attributes for lemma and Part of Speech tag to <w> and <c> elements. (<w> and <c> elements must already exist.)
Converts a TEI P5 document with annotations for lemma, pos (or msd) and syntactic dependencies to CoNLL-U 10 column format.
Reads tokens and sentences as annotations and produces segment annotations, where segments refer to tokens, not to the base text. Input and output is encoded in TEI P5.
Reads TEIP5 and produces token and sentence annotations. The annotations refer to the base text, but also include the tokens and sentences themselves in plain text.
Tesseract Open Source OCR Engine. Tesseract 4 adds a new neural net (LSTM) based OCR engine which is focused on line recognition, but also still supports the legacy Tesseract OCR engine of Tesseract 3 which works by recognizing character patterns.
Converts Latin text to lower case and transforms v to u and j to i.
Subtype of resource, e.g. basis text, tokenisation, alphabetic list.
Language of the text
The way that information is encoded for storage in a computer file.
Time period of the spelling in the text resource.
How results are presented for the user.
Decorative tradition, for example typeface class.
Whether data contains ambiguous elements.
Special feature used to give unique identity to input and intermediate data, for example output of OCR software.