Text Tonsorium

(This page is still Work in progress)

The Text Tonsorium is a workflow management system that not only executes workflows, but also composes workflows from building blocks. Each building block encapsulates a Natural Language Processing tool.

The Text Tonsorium may compose many workflows that all lead to your goal. It will then ask you to choose one of the proposed workflows. In general, the more detail you add to your goal, the fewer solutions the Text Tonsorium will find, even zero.

In contrast to other workflow management systems, the Text Tonsorium focuses on results rather than on tools, so a general understanding of how the desired result is described is useful. Still, you may like to also know which tools currently are integrated in the Text Tonsorium. This page gives you information about both.

Overview

Data
Input
Goal
Features
Type of content
Subtype of resource, e.g. basis text, tokenisation, alphabetic list.
Language
Language of the text
Format
The way that information is encoded for storage in a computer file.
Historical period
Time period of the spelling in the text resource.
Assemblage
How results are presented for the user.
Appearance
Decorative tradition, for example typeface class.
Ambiguity
Whether data contains ambiguous elements.
Smell
Special feature used to give unique identity to input and intermediate data, for example output of OCR software.
Tools
#Name of the toolSupported languages
1Annotation splitter
2Bohnet's POS taggerde, en, es, fr
3Bohnets parserda, de, en, es, fr
4Brill's PoS-taggeras, da, en, la
5CBF-Tokenizeraf, bg, bs, ca, cs, cy, da, de, el, en, eo, es, et, fa, fi, fr, hi, hr, hu, hy, id, is, it, ka, kn, ku, la, lv, mk, ml, nl, no, pl, pt, ro, ru, sk, sl, sq, sr, sv, sw, ta, tr, uk, vi, zh
6cluster
7CoNLL converter
8CONLL to Penn Treebank
9CQP-corpus creator
10Create pre-tokenized Clarin Base Format text
11CST's Name recogniserda
12CST's NP recogniserda
13CST's Repetitiveness checker
14CST's RTFreader
15CST-Lemmatiseraf, bg, cs, da, de, el, en, es, et, fa, fr, hr, hu, is, it, la, mk, nl, no, pl, pt, ro, ru, sk, sl, sr, sv, uk
16CST-Normaliserda
17CuneiFormbg, cs, da, de, en, es, et, fr, hr, hu, it, lt, lv, nl, pl, pt, ro, ru, sl, sr, sv, tr, uk
18danerda
19dapipeda
20Diplom annotator
21Diplom fetch corrected textda, gml, la, sv
22eSpeakaf, bg, bs, ca, cs, cy, da, de, el, en, eo, es, et, fi, fr, hi, hr, hu, hy, id, is, it, ka, kn, ku, la, lv, mk, ml, nl, pl, pt, ro, ru, sk, sq, sr, sv, sw, ta, tr, vi, zh
23Frequencies
24html2text
25JSON pretty print
26JSON to ORG-mode converter
27JSON to Tab-separated
28JSON to TEI P5 annotation
29KORP to Excel converter
30Lapos POS taggerda, la
31LemPoSbg, cs, da, de, en, es, et, fa, hr, hu, is, it, la, mk, nl, pl, pt, ro, ru, sk, sl, sr, sv, uk
32LibreOffice
33Normalize diplla
34OpenNLP tools PosTaggerda, en
35pdf2htmlEX
36PDFMiner
37PoS tag translatorda, la
38pretokenize TEI P5
39PruneLemPos
40Sentence extractor
41TEI P5 anno to Org-mode
42TEI P5 annotator
43TEI P5 to CoNLL-U converter
44TEIP5-segmenter
45TEIP5-tokeniser/sentence extractor
46Tesseract-OCRv4af, br, bs, ca, co, cs, cy, da, de, en, eo, es, et, eu, fi, fo, fr, ga, gl, hr, ht, hu, id, is, it, iu, jv, la, lb, lt, lv, ms, mt, nb, nl, nn, oc, pl, pt, ro, sk, sl, sq, sr, sv, sw, tr, uz, vi, yi
47vujiLoXla

Annotation splitter

Takes TEI P5 document containing multiple stand-off anotation groups (spanGrp). Outputs one of the annotation groups.

Bohnet's POS tagger

Part of Speech tagger that is distributed as part of mate-tools.

Bohnets parser

Dependency parser, part of mate-tools.

Brill's PoS-tagger

Part-of-speech tagger: Marks each word in a text with information about word class and morphological features.

CBF-Tokenizer

From Clarin Base Format enriched with token and segment attributes, extract tokens and their offset in the input.

cluster

Make clusters of lines, separated by empty lines. Each cluster contains all those lines that have the the value in the first or second column in common with one or more other lines in the same cluster.

CoNLL converter

Converts input to CoNLL 2007 format.

CONLL to Penn Treebank

Convert syntax dependency annotation in CoNLL 2009 or CoNLL-U format to bracketed "Lisp-like" format.

CQP-corpus creator

Takes input comntaining words, tags and lemmas and creates output that can be read by the CQP software.

Create pre-tokenized Clarin Base Format text

From a plain segmentized and tokenized text file that uses DEL characters to separate tokens that are written together in the input, create a TEI P5 Clarin Base Format text with attributes S and T for segment and token identification.

CST's Name recogniser

Classifies names as proper names, locations (with sub-classes of street, city, land and other types of locations), and other names (called MISC)

CST's NP recogniser

Collects words that constitute noun phrases.

CST's Repetitiveness checker

Uses a statistical method to find repetitions in a text.

CST's RTFreader

Extracts segments from RTF-file or from plain text. Optionally tokenises.

CST-Lemmatiser

Produces the dictionary look-up form (or lemma) for each word, inflected or not, in the input.

CST-Normaliser

Normalises older (1200-1900) Danish text to spelling rules as employed in ODS (Ordbog over det danske Sprog).

CuneiForm

Интеллектуальная система распознавания текста - Optical Character Recognition

daner

Named Entity Recognition for Danish, Distributed by ITU NLP. Uses Stanford CoreNLP NER and the model from DKIE to tag incoming Danish plain text for named entities, in three classes: location, person, and organization names.

dapipe

UDPipe tools for Danish. udpipe does pos-tagging, lemmatization and syntactic analysis. The syntactic analysis and lemmatization is always based on UDPipe's own pos-tagging. Using dapipe with TEI P5 input is discouraged, unless tokenisation and sentence extraction is done in separate steps, and not by dapipe itself.

Diplom annotator

Store lemma in column 3 and/or word class in column 4 of an orgmode input file that already has diplomatic and facsimal values in columns 7 and 8.

Diplom fetch corrected text

Fetch the column with corrected transcriptions. This column contains words with additions between parentheses. The parentheses are removed in the output.

eSpeak

Text to speech software. Originally known as speak and originally written for Acorn/RISC_OS computers starting in 1995. This version is an enhancement and re-write, including a relaxation of the original memory and processing power constraints, and with support for additional languages.

Frequencies

Sorts input lines, collapses equal lines, appends column with frequencies. Assumes that input is 1, 2 or 3 columns, separated by tabs.

html2text

A very simple script that loads from HTML, and then iterates over the DOM to correctly output plain text.

JSON pretty print

Json pretty-print parser based on a recursive lexical analyser. The parser was based on the specification defined at json.org. The input file is parsed to build a json object. If the object is correct, it will be pretty-printed to standard output.

JSON to ORG-mode converter

Converts JSON output with tokens, lemmas and Part of Speech tags to a three-column ORG-mode table.

JSON to Tab-separated

Convert word-lemma-pos data from JSON to CQP format.

JSON to TEI P5 annotation

Read json file with fields for token ID, word, lemma and pos. Output a TEI P5 annotation file (spanGrp) containing either lemmas or Part of Speech tags.

KORP to Excel converter

This tool generates a tabulator separated file with all KWIC (keyword-in-context) results generated by the KORP tool at the address https://alf.hum.ku.dk/korp/. Input to the tool is the URL copied from the address line when KORP has performed a search.

Lapos POS tagger

Fork of the Lookahead Part-Of-Speech (Lapos) Tagger

LemPoS

Lemmatizes input text and adds PoS-options to each lemma. Output can be ambiguous.

LibreOffice

A powerful office suite, here used to convert office documents to RTF or PDF.

Normalize dipl

Fill column left of diplom column with normalized tokens, i.e. v -> u, j -> i and all lowercase.

OpenNLP tools PosTagger

Part of Speech Tagger that marks tokens with their corresponding word type based on the token itself and the context of the token. Uses a probability model to predict the correct pos tag.

pdf2htmlEX

Converts PDF to HTML without losing text or format. (The produced HTML can hardly be interpreted by other tools.)

PDFMiner

Extracts information from PDF documents. Focuses entirely on getting and analyzing text data.

PoS tag translator

Translate from DSL's tag set to Menotas

pretokenize TEI P5

Apply a primitive tokenisation to the contents of the <text> element in a TEI P5 document. Each word, punctuation and whitespace is marked up by w or c tags. S and T attributes indicate wich primitive tokens should be combined to create higher level tokens.

PruneLemPos

Takes text input that has lemma and PoS annotations that are ambiguous. In the output, the ambiguity is dimished by using bigram frequencies.

Sentence extractor

From Clarin Base Format (enriched with attributes for tokens and segments) extract the sentences and their offsets in the source.

TEI P5 anno to Org-mode

Convert TEI P5 stand-off annotation to a two column file in Org-mode format. The first column contains a token, the second contains the annotation: POS-tag, word class, or lemma

TEI P5 annotator

Add attributes for lemma and Part of Speech tag to <w> and <c> elements. (<w> and <c> elements must already exist.)

TEI P5 to CoNLL-U converter

Converts a TEI P5 document with annotations for lemma, pos (or msd) and syntactic dependencies to CoNLL-U 10 column format.

TEIP5-segmenter

Reads tokens and sentences as annotations and produces segment annotations, where segments refer to tokens, not to the base text. Input and output is encoded in TEI P5.

TEIP5-tokeniser/sentence extractor

Reads TEIP5 and produces token and sentence annotations. The annotations refer to the base text, but also include the tokens and sentences themselves in plain text.

Tesseract-OCRv4

Tesseract Open Source OCR Engine. Tesseract 4 adds a new neural net (LSTM) based OCR engine which is focused on line recognition, but also still supports the legacy Tesseract OCR engine of Tesseract 3 which works by recognizing character patterns.

vujiLoX

Converts Latin text to lower case and transforms v to u and j to i.

Type of content

Subtype of resource, e.g. basis text, tokenisation, alphabetic list.

text
sentences
segments
tokens
Penn Treebank
Simple
name entities
PoS-tags
Penn Treebank
CST-tagset
DSL-tagset
Universal Part-of-Speech Tagset
Menotas
lemmas
noun phrases
syntax (dependency structure)
PoS-tags,lemmas,syntax
repeated phrases
segments,tokens
Penn Treebank
Simple
segments,tokens,named entities
segments,PoS-tags
CST-tagset
segments,tokens,PoS-tags
Penn Treebank
CST-tagset
DSL-tagset
Universal Part-of-Speech Tagset
segments,lemmas
segments,tokens,lemmas
segments,tokens,lemmas,PoS-tags
Penn Treebank
CST-tagset
DSL-tagset
Universal Part-of-Speech Tagset
Menotas
segments,tokens,word class,PoS-tags,lemmas
segments,tokens,PoS-tags,dependency relations
segments,tokens,PoS-tags,lemmas,dependency relations
tokens,PoS-tags
Menotas
tokens,lemmas
tokens,PoS-tags,lemmas
Penn Treebank
CST-tagset
DSL-tagset
Menotas
keyword-in-context (KWIC)

Language

Language of the text

Afrikaans
Albanian
Armenian
Assamese
Basque
Bosnian
Breton
Bulgarian
Catalan
Chinese
Corsican
Croatian
Czech
Danish
Dutch
English
Esperanto
Estonian
Faroese
Finnish
French
Galician
Georgian
German
Greek
Middle Low German
Haitian
Hindi
Hungarian
Icelandic
Indonesian
Inuktitut
Irish
Italian
Javanese
Kannada
Kurdish
Latin
Latvian
Lithuanian
Luxembourgish
Macedonian
Malay
Malayalam
Maltese
Norwegian
Norwegian Bokmål
Norwegian Nynorsk
Occitan
Persian
Polish
Portuguese
Romanian
Russian
Serbian
Slovak
Slovene
Spanish
Swahili
Swedish
Tamil
Turkish
Ukrainian
Uzbek
Vietnamese
Welsh
Yiddish

Format

The way that information is encoded for storage in a computer file.

plain
Can be edited with a simple text editor like 'vi'.
UTF-8
RTF
PDF
HTML
Traditional tags (h, p, etc.)
Exact layout
Corpus Workbench (for CQP queries)
verticalized text, Corpus Workbench input format
DOC
DOCX
ODF
ODP
PPT
PPTX
TEIP5
TEIP5DKCLARIN
Can be viewed in a browser and edited with a text editor like 'vi' or with a XML editor like 'Oxygen' or Microsoft's 'Visual Studio'
TEIP5DKCLARIN_ANNOTATION
Can be viewed in a browser and edited with a text editor like 'vi' or with a XML editor like 'Oxygen' or Microsoft's 'Visual Studio'
id: not disclosed
image
GIF
JPEG JFIF
Progressive JPEG JFIF
Portable Network Graphics
Tag Image File Format
Image as PDF
audio
CoNLL
CoNLL 2009 (14 columns)
CoNLL-U (10 columns)
Penn Treebank
JSON
No unique ID
With xml id
Org-mode
plain text with ASCII 127 characters
single-column list
two-column list, tab separated
three-column list, tab separated
four-column list, tab separated
two-column list, tab separated, with empty lines
three-column list, tab separated, with empty lines
four-column list, tab separated, with empty lines
URL

Historical period

Time period of the spelling in the text resource.

classical
medieval
late modern
contemporary

Assemblage

How results are presented for the user.

normal
frequency list
alphabetic list

Appearance

Decorative tradition, for example typeface class.

roman
blackletter
blackletter w. ø
OCR
unnormalised
normalised
optimized for software
pretty printed

Ambiguity

Whether data contains ambiguous elements.

unambiguous
ambiguous
pruned

Smell

Special feature used to give unique identity to input and intermediate data, for example output of OCR software.

new smell