Text Tonsorium

The Text Tonsorium is a workflow management system that not only executes workflows, but also composes workflows from building blocks. Each building block encapsulates a Natural Language Processing tool.

In contrast to other workflow management systems, the Text Tonsorium focuses on results rather than on tools, so a general understanding of how the desired result is described is useful. Still, you may like to also know which tools currently are integrated in the Text Tonsorium. This page gives you information about both.

Overview

How does the Text Tonsorium compute and enact workflows?
Working with the Text Tonsorium
Features
Tools

How does the Text Tonsorium compute and enact workflows?

A workflow design consists of one or more tools connected by data streams, see Figure 1. In this diagra, we depict tools as shapes (squares, circles, stars, etc.) and data streams as edges between the shapes.
Broadly speaking, the Text Tonsorium can do two things with workflows: compose them and enact them. Here we show how it is done.

Figure 1. Computation and enactment of workflow design.
input(uploaded by the user)output(goal as set by the user)

Composition of workflow design

  1. Top of the diagram: The user has uploaded one or more documents that she wants to process using the Text Tonsorium.
    Bottom of the diagram: The user has also specified the output she wishes to get from the Text Tonsorium.
  2. The first tool the Text Tonsorium adds to the design is in fact the last tool of the workflow: a tool that satisfies the user's goal. From there, it works backwards toward the input.
  3. Once the last tool of the workflow has been established, its input requirements become the new goal.
  4. This process of moving the goal repeats for each addition of a tool to the workflow design.
  5. When a tool is found that takes the user's uploaded document as input, it is still too early to claim that a viable workflow design has been found.
  6. Text Tonsorium retraces its path to see whether any of the tools in the path needs more inputs. When Text Tonsorium finds such a tool, it again seeks toward the input to pick up the missing data.
    Sometimes, another tool is added to the workflow design to fulfill the need.
  7. In other cases, the needed input is the output of an already established component in the workflow.
  8. Again, Text Tonsorium crawls back towards the goal and tries to obtain any still missing inputs.
  9. Sometimes the Text Tonsorium inserts the same tool in the workflow design multiple times. In such cases, all incarnations of the tool have different parameters, so they do different things.
  10. The Text Tonsorium has completed the composition of a viable workflow design when the last tool produces the output that the user wants and all tools get the inputs they need.

Enactment of a workflow

  1. Once the Text Tonsorium has designed a workflow, we can ask it to enact the workflow with data that we have uploaded as input.
  2. The input data is sent to the first tool in the workflow.
  3. The output from the first tool is sent to the second tool.
  4. The output from the first tool is also sent to the third tool. Depending on the involved tools and the availability of processing units, this can happen at the same time.
  5. A tool is not activated before all its required inputs are present.
  6. The same tool, with the same input, but with different parameters and therefore a different output.
  7. Many datastreams come together in the last tool in the workflow. This happens often, since users often wish to see many annotation layers in the output.
  8. A single datstream, containing all required annotation layers, is made available to the user.

The diagram above is a gross simplification. The main omission is that the Text Tonsorium always attempts to find not just one, but all roads leading to a goal.
The Text Tonsorium finds all workflows by trying out all tools and all tool parameter combinations.

Pruning

The full set of workflows found by the Text Tonsorium is not presented to the user in its entirety. In general, there will be workflows that do not make much sense to a user. Many workflows will be pruned away.

Tools that compete for the same goal

In the following example, two tools compete to create output fulfilling the same goal. There are three tools that 'consume' the output. The consuming tools cannot take output from both competing tools. The question is: which of the competing tools should they choose?

Figure 2. Pruning.
  1. Non pruned, ambiguous workflow design containing two tools fulfillng the same goal.
  2. Pruned, unambiguous workflow design that ignores the second tool completely, configuration 1.
  3. Pruned, unambiguous workflow design that includes both tools, configuration 2.
  4. Pruned, unambiguous workflow design that includes both tools, configuration 3.
  5. Pruned, unambiguous workflow design that includes both tools, configuration 4.
  6. Pruned, unambiguous workflow design that includes both tools, configuration 5.
  7. Pruned, unambiguous workflow design that includes both tools, configuration 6.
  8. Pruned, unambiguous workflow design that includes both tools, configuration 7.
  9. Pruned, unambiguous workflow design that ignores the first tool completely, configuration 8.

If a goal can be satified in M ways and there are N workflow nodes setting that goal, then there are N^M viable configurations. In the example, M = 2 and N = 3, so there are 2^3 = 8 confugurations.

As more and more tools were integrated in the the Text Tonsorium, situations where tools were competing arose more often, resulting in enormous amounts (sometimes tens of thousands) of viable workflow designs that the user would not be able to choose from.

Pruning reduces the amount of workflows that is presented to the user by excluding all workflow designs that contain two or more competing tools. Thus, in the example, only the first and the last configuration survive the pruning process. So, in the end, there are not N^M, but only M configurations.

Tools repelling another tool by 'smell'

Some tools should never occur in the same workflow design. For example, two Optical Character Recognition systems will, in general, output different numbers of tokens when given the same input, e.g. bacause one system sees white space where the other does not.

In the Text Tonsorium, tools that should not co-occur can be given a value for a 'smell' feature that spreads throughout the workflow. If two different smells collide, the workflow design is discarded.

Working with the Text Tonsorium

The Text Tonsorium may compose many workflows that all lead to your goal. It will then ask you to choose one of the proposed workflows. In general, the more detail you add to your goal, the fewer solutions the Text Tonsorium will find, even zero.

1: Upload
You can upload your input in three ways:
file upload
Text Tonsorium can handle many different input formats.
You are not limited to uploading a single file. We have done batches of over 100 files.
via URL
You can enter a list of URLs. Notice that some web pages require JavaScript to be active in the browser. Such webpages cannot be fetched successfully.
direct typing
You can type in a text
2: Specify what you want
Input
Goal
3: Select a workflow
4: Launch the selected workflow
5: Inspect/download results

Features

Data streams, but also the input and output specifications of tools, are described in terms of 'features'. Features express things such as the language used in a text, file format, and the type of content.
Users are confronted with features and feature values when they specify the goal of a workflow design. There is a second level of feature specification, a level that users normally aren't bothered with: feature values can be further specified with 'style descriptors'.
Feature values and style descriptors are always chosen from predefined sets of values, using drop down lists.

This is the list of features that currently is defined in this instance of Text Tonsorium:

Type of content
Subtype of resource, e.g. basis text, tokenisation, alphabetic list.
Language
Language of the text
Format
The way that information is encoded for storage in a computer file.
Historical period
Time period of the spelling in the text resource.
Assemblage
How results are presented for the user.
Appearance
Decorative tradition, for example typeface class.
Ambiguity
Whether data contains ambiguous elements.
Smell
Special feature used to give unique identity to input and intermediate data, for example output of OCR software.

Type of content

Subtype of resource, e.g. basis text, tokenisation, alphabetic list.

text
text excerpts
paragraphs
sentences
segments
tokens
name entities
word class
PoS-tags
Style options:
Penn Treebank
CST-tagset
DSL-tagset
Universal Part-of-Speech Tagset
Menota
HiNTS (Historisches-Niederdeutsch-Tagset)
https://www.fdr.uni-hamburg.de/record/9195/files/Annotationshandbuch_1_2021-01-30.pdf
lemmas
noun phrases
morphological features
Style options:
Universal Part-of-Speech Tagset
Menota
sentiment
syntax (dependency structure)
repeated phrases
keyword-in-context (KWIC)

Language

Language of the text

Afrikaans
afr
Albanian
sqi
Arabic
ara
Armenian
hye
Asturian
ast
Basque
eus
Belarusian
bel
Bosnian
bos
Breton
bre
Bulgarian
bul
Catalan
cat
Chinese
zho
Corsican
cos
Croatian
hrv
Coptic
cop
Czech
ces
Danish
dan
Dutch
nld
English
eng
Esperanto
epo
Estonian
est
Faroese
fao
Finnish
fin
French
fra
Galician
glg
Georgian
kat
German
deu
Gothic
got
Greek
ell
Middle Low German
gml
Haitian
hat
Hebrew
heb
Hindi
hin
Hungarian
hun
Icelandic
isl
Indonesian
ind
Inuktitut
iku
Irish
gle
Italian
ita
Japanese
jpn
Javanese
jav
Kannada
kan
Korean
kor
Kurdish
kur
Latin
lat
Latvian
lav
Lithuanian
lit
Luxembourgish
ltz
Macedonian
mkd
Malay
msa
Malayalam
mal
Maltese
mlt
Manx
glv
Marathi (Marāṭhī)
mar
Northern Sami
sme
Norwegian
nor
Norwegian Bokmål
nob
Norwegian Nynorsk
nno
Occitan
oci
Old Church Slavonic
chu
Persian
fas
Polish
pol
Portuguese
por
Romanian
ron
Russian
rus
Scottish Gaelic
gla
Serbian
srp
Slovak
slk
Slovene
slv
Spanish
spa
Swahili
swa
Swedish
swe
Tamil
tam
Telugu
tel
Turkish
tur
Ukrainian
ukr
Urdu
urd
Uyghur
uig
Uzbek
uzb
Vietnamese
vie
Welsh
cym
Wolof
wol
Yiddish
yid

Format

The way that information is encoded for storage in a computer file.

plain
Can be edited with a simple text editor like 'vi'.
Style options:
UTF-8
RTF
PDF
HTML
Style options:
Traditional tags (h, p, etc.)
Exact layout
Corpus Workbench (for CQP queries)
verticalized text, Corpus Workbench input format
DOC
DOCX
ODF
ODP
PPT
PPTX
TEIP5
TEIP5DKCLARIN
Can be viewed in a browser and edited with a text editor like 'vi' or with a XML editor like 'Oxygen' or Microsoft's 'Visual Studio'
TEIP5DKCLARIN_ANNOTATION
Can be viewed in a browser and edited with a text editor like 'vi' or with a XML editor like 'Oxygen' or Microsoft's 'Visual Studio'
Style options:
id: not disclosed
image
Style options:
GIF
JPEG JFIF
Progressive JPEG JFIF
Portable Network Graphics
Tag Image File Format
Image as PDF
audio
CoNLL
Style options:
CoNLL 2009 (14 columns)
ID FORM LEMMA PLEMMA POS PPOS FEAT PFEAT HEAD PHEAD DEPREL PDEPREL FILLPRED PRED
CoNLL-U (10 columns)
ID FORM LEMMA UPOS XPOS FEATS HEAD DEPREL DEPS MISC
Penn Treebank
JSON
Style options:
No unique ID
With xml id
Org-mode
plain text with ASCII 127 characters
columns, tab separated fields
two-column list, tab separated
three-column list, tab separated
four-column list, tab separated
two-column list, tab separated, with empty lines
three-column list, tab separated, with empty lines
four-column list, tab separated, with empty lines
URL

Historical period

Time period of the spelling in the text resource.

classical
medieval
late modern
contemporary

Assemblage

How results are presented for the user.

normal
frequency list
alphabetic list
frequency list, OOV only
alphabetic list, OOV only

Appearance

Decorative tradition, for example typeface class.

roman
blackletter
blackletter w. ø
OCR
unnormalised
normalised
optimized for software
pretty printed

Ambiguity

Whether data contains ambiguous elements.

unambiguous
ambiguous
pruned

Smell

Special feature used to give unique identity to input and intermediate data, for example output of OCR software.

new smell

Tools

These are the currently active tools:

#Name of the toolSupported languages
1ANNIE DE (GATE cloud)de
2ANNIE EN (GATE cloud)en
3ANNIE FR (GATE cloud)fr
4ANNIE RO (GATE cloud)ro
5ANNIE RU (GATE cloud)ru
6Anno-splitter
7Bohnet parserda, de, es, fr
8Bohnet taggerde, en, es, fr
9Brill taggerda, en, gml, la
10cluster
11CoNLL 2009 to U
12CoNLL formatter
13CONLL to Penn Treebank
14CONLL to three columns
15CQP formatter
16CSTlemmaaf, ast, be, bg, ca, cs, cy, da, de, el, en, es, et, fa, fo, fr, ga, gd, gl, gml, gv, hr, hu, is, it, ka, la, mk, nl, no, pl, pt, ro, ru, sk, sl, sq, sr, sv, uk
17CSTnerda
18danerda
19dependency2tree
20Diplom annotator
21Diplom fetch corrected textda, gml, la, sv
22Document similarity checker
23eSpeakaf, bg, bs, ca, cs, cy, da, de, el, en, eo, es, et, fi, fr, hi, hr, hu, hy, id, is, it, ka, kn, ku, la, lv, mk, ml, nl, pl, pt, ro, ru, sk, sq, sr, sv, sw, ta, tr, vi, zh
24Frequencies
25html2text
26JSON pretty print
27JSON to ORG-mode
28JSON to TEI
29JSON to TSV
30KORP to Excel
31Laposda, gml, la
32LemPoSbe, bg, cs, da, de, es, et, fa, fo, hr, hu, is, it, la, mk, nl, pl, pt, ro, ru, sk, sl, sq, sr, sv, uk
33LibreOffice
34Normaliserda
35Normalize diplla
36NP finderda
37OpenNLP Taggerda, en
38pdf2htmlEX
39PDFMiner
40plain to TEI
41PoS translatorda, la
42PruneLemPos
43Repetitiveness checker
44RTFreader
45Sentence extractor
46Stanford CoreNLPen
47TEI annotator
48TEI extract tokens/sentences
49TEI to CoNLL-U
50TEI to Org-mode
51TEI tokenizer
52TEI-segmenter
53Tesseract-OCRv5af, br, bs, ca, co, cs, cy, da, de, en, eo, es, et, eu, fa, fi, fo, fr, ga, gl, hr, ht, hu, id, is, it, iu, jv, la, lb, lt, lv, ms, mt, nb, nl, nn, oc, pl, pt, ro, sk, sl, sq, sr, sv, sw, tr, uz, vi, yi
54Token extractoraf, ar, ast, be, bg, bs, ca, cop, cs, cy, da, de, el, en, eo, es, et, eu, fa, fi, fo, fr, ga, gd, gl, gml, got, gv, he, hi, hr, hu, hy, id, is, it, ja, ka, kn, ko, ku, la, lt, lv, mk, ml, mr, mt, nb, nl, nn, no, pl, pt, ro, ru, se, sk, sl, sq, sr, sv, sw, ta, te, tr, ug, uk, ur, vi, wo, zh
55udpipeaf, ar, be, bg, ca, cop, cs, cu, da, de, el, en, es, et, eu, fa, fi, fr, ga, gd, gl, got, he, hi, hr, hu, hy, id, it, ja, ko, la, lt, lv, mr, mt, nb, nl, nn, pl, pt, ro, ru, se, sk, sl, sr, sv, ta, te, tr, ug, uk, ur, vi, wo, zh
56vujiLoXla

ANNIE DE (GATE cloud)

ANNIE is a named entity recognition pipeline that identifies basic entity types, such as Person, Location, Organization, Money amounts, Time and Date expressions. It is the prototypical information extraction pipeline distributed with the GATE framework and forms the base of many more complex GATE-based IE applications.

ANNIE EN (GATE cloud)

ANNIE is a named entity recognition pipeline that identifies basic entity types, such as Person, Location, Organization, Money amounts, Time and Date expressions. It is the prototypical information extraction pipeline distributed with the GATE framework and forms the base of many more complex GATE-based IE applications.

ANNIE FR (GATE cloud)

ANNIE is a named entity recognition pipeline that identifies basic entity types, such as Person, Location, Organization, Money amounts, Time and Date expressions. It is the prototypical information extraction pipeline distributed with the GATE framework and forms the base of many more complex GATE-based IE applications.

ANNIE RO (GATE cloud)

ANNIE is a named entity recognition pipeline that identifies basic entity types, such as Person, Location, Organization, Money amounts, Time and Date expressions. It is the prototypical information extraction pipeline distributed with the GATE framework and forms the base of many more complex GATE-based IE applications.

ANNIE RU (GATE cloud)

ANNIE is a named entity recognition pipeline that identifies basic entity types, such as Person, Location, Organization, Money amounts, Time and Date expressions. It is the prototypical information extraction pipeline distributed with the GATE framework and forms the base of many more complex GATE-based IE applications.

Anno-splitter

Takes TEI P5 document containing multiple stand-off annotation groups (spanGrp). Outputs one of the annotation groups.

Bohnet parser

Dependency parser, part of mate-tools.

Bohnet tagger

Part of Speech tagger that is distributed as part of mate-tools.

Brill tagger

Part-of-speech tagger: Marks each word in a text with information about word class and morphological features.

cluster

Make clusters of lines, separated by empty lines. Each cluster contains all those lines that have the the value in the first or second column in common with one or more other lines in the same cluster.

CoNLL 2009 to U

Convert CoNLL 2009 (14 columns) to CoNLL-U (10 columns)

CoNLL formatter

Converts input to CoNLL 2009 format.

CONLL to Penn Treebank

Convert syntax dependency annotation in CoNLL 2009 or CoNLL-U format to bracketed "Lisp-like" format.

CONLL to three columns

Convert a CONLL 2009 or CONLL-U file to a tabalator separated file. On each line: <word> \t <lemma> \t <pos> \n

CQP formatter

Takes input comntaining words, tags and lemmas and creates output that can be read by the CQP software.

CSTlemma

Produces the dictionary look-up form (or lemma) for each word, inflected or not, in the input.

CSTner

Classifies names as proper names, locations (with sub-classes of street, city, land and other types of locations), and other names (called MISC)

daner

Named Entity Recognition for Danish, Distributed by ITU NLP. Uses Stanford CoreNLP NER and the model from DKIE to tag incoming Danish plain text for named entities, in three classes: location, person, and organization names.

dependency2tree

Convert CoNLL output of a dependency parser into a latex or graphviz tree.

Diplom annotator

Store lemma in column 3 and/or word class in column 4 of an orgmode input file that already has diplomatic and facsimal values in columns 7 and 8.

Diplom fetch corrected text

Fetch the column with corrected transcriptions. This column contains words with additions between parentheses. The parentheses are removed in the output.

Document similarity checker

Uses a statistical method to find phrases that are found in each of the input documents.

eSpeak

Text to speech software. Originally known as speak and originally written for Acorn/RISC_OS computers starting in 1995. This version is an enhancement and re-write, including a relaxation of the original memory and processing power constraints, and with support for additional languages.

Frequencies

Sorts input lines, collapses equal lines, appends column with frequencies. Assumes that input is 1, 2 or 3 columns, separated by tabs.

html2text

A very simple script that loads from HTML, and then iterates over the DOM to correctly output plain text.

JSON pretty print

Json pretty-print parser based on a recursive lexical analyser. The parser was based on the specification defined at json.org. The input file is parsed to build a json object. If the object is correct, it will be pretty-printed to standard output.

JSON to ORG-mode

Converts JSON output with tokens, lemmas and Part of Speech tags to a three-column ORG-mode table.

JSON to TEI

Read json file with fields for token ID, word, lemma and pos. Output a TEI P5 annotation file (spanGrp) containing either lemmas or Part of Speech tags.

JSON to TSV

Convert word-lemma-pos data from JSON to CQP format.

KORP to Excel

This tool generates a tabulator separated file with all KWIC (keyword-in-context) results generated by the KORP tool at the address https://alf.hum.ku.dk/korp/. Input to the tool is the URL copied from the address line when KORP has performed a search.

Lapos

Fork of the Lookahead Part-Of-Speech (Lapos) Tagger

LemPoS

Lemmatizes input text and adds PoS-options to each lemma. Output can be ambiguous.

LibreOffice

A powerful office suite, here used to convert office documents to RTF or PDF.

Normaliser

Normalises older (1200-1900) Danish text to spelling rules as employed in ODS (Ordbog over det danske Sprog).

Normalize dipl

Fill column left of diplom column with normalized tokens, i.e. v -> u, j -> i and all lowercase.

NP finder

Collects words that constitute noun phrases.

OpenNLP Tagger

Part of Speech Tagger that marks tokens with their corresponding word type based on the token itself and the context of the token. Uses a probability model to predict the correct pos tag.

pdf2htmlEX

Converts PDF to HTML without losing text or format. (The produced HTML can hardly be interpreted by other tools.) Renders PDF files in HTML, utilizing modern Web technologies. It aims to provide an accurate rendering, while keeping optimized for Web display. Best for text-based PDF files, for example scientific papers with complicated formulas and figures. Text, fonts and formats are natively preserved in HTML such that you can still search and copy. The generated HTML file is static, with optional features powered by JavaScript.

PDFMiner

Extracts information from PDF documents. Focuses entirely on getting and analyzing text data.

plain to TEI

From a plain segmentized and tokenized text file that uses DEL characters to separate tokens that are written together in the input, create a TEI P5 Clarin Base Format text with attributes S and T for segment and token identification.

PoS translator

Translate from DSL's tag set to Menota

PruneLemPos

A "Poor man's POS-tagger" that takes text input that has ambiguous lemma and PoS annotations and diminishes the ambiguity by using bigram HMM + Viterbi algorithm. No training data are involved! Works best with larger texts.

Repetitiveness checker

Uses a statistical method to find repetitions in a text.

RTFreader

Extracts segments from RTF-file or from plain text. Optionally tokenises. Keeps \f

Sentence extractor

From a TEI text enriched with T (token) and S (segment) attributes, extract the sentences and their offsets in the source.

Stanford CoreNLP

CoreNLP is your one stop shop for natural language processing in Java! CoreNLP enables users to derive linguistic annotations for text, including token and sentence boundaries, parts of speech, named entities, numeric and time values, dependency and constituency parses, coreference, sentiment, quote attributions, and relations. CoreNLP currently supports 8 languages: Arabic, Chinese, English, French, German, Hungarian, Italian, and Spanish.

TEI annotator

Add attributes for lemma and Part of Speech tag to <w> and <c> elements. (<w> and <c> elements must already exist.)

TEI extract tokens/sentences

Reads TEIP5 and produces token and sentence annotations. The annotations refer to the base text, but also include the tokens and sentences themselves in plain text.

TEI to CoNLL-U

Converts a TEI P5 document with annotations for lemma, pos (or msd) and syntactic dependencies to CoNLL-U 10 column format.

TEI to Org-mode

Convert TEI P5 stand-off annotation to a two column file in Org-mode format. The first column contains a token, the second contains the annotation: POS-tag, word class, or lemma

TEI tokenizer

Apply a primitive tokenisation to the contents of the <text> element in a TEI P5 document. Each word, punctuation and whitespace is marked up by w or c tags. S and T attributes indicate wich primitive tokens should be combined to create higher level tokens.

TEI-segmenter

Reads tokens and sentences as annotations and produces segment annotations, where segments refer to tokens, not to the base text. Input and output is encoded in TEI P5.

Tesseract-OCRv5

Tesseract Open Source OCR Engine. Tesseract 4 adds a new neural net (LSTM) based OCR engine which is focused on line recognition, but also still supports the legacy Tesseract OCR engine of Tesseract 3 which works by recognizing character patterns.

Token extractor

From a TEI text enriched with T (token) and S (segment) attributes, extract tokens and their offset in the input.

udpipe

Tokenizer, POS Tagger, Lemmatizer and Parser models for 94 treebanks of 61 languages of Universal Depenencies 2.5 Treebanks.

vujiLoX

Converts Latin text to lower case and transforms v to u and j to i.