Text Tonsorium er et workflow management system som ikke kun fører data gennem workflows, men også designer workflows ved hjælp af 'byggeklodser'. Hver byggeklods indeholder et sprogteknologisk værktøj.
Når du arbejder med Text Tonsorium, skal dit fokus være på det resultat som du vil opnå, og ikke på de værktøjer der er i værktøjskassen. Derfor er det nyttigt at vide hvordan det ønskede resultat bedst kan beskrives. Du vil måske alligevel også vide hvilke værktøjer er integreret i Text Tonsorium. Denne side giver dig information om begge dele.
En arbejdsgang består af et eller flere værktøjer forbundet ved datastrømme, se Figur 1. I diagrammet afbildes værktøjer som figurer (firkanter, cirkler, stjerne, o.s.v.) og datastrømme som linjer imellem dem.
Text Tonsorium kan gøre to helt forskellige ting med arbejdsgange: det kan sammensætte arbejdsgange og det kan eksekvere dem. Her viser vi hvordan.
Ovenstående tegning giver et forenklet billede. Hvad du ikke kan se er at Text Tonsorium altid forsøger at finde ikke kun én, men alle veje der fører til målet.
Text Tonsorium finder alle arbejdsgange ved at afprøve alle værktøjer og alle værktøjsindstillinger.
Brugeren bliver ikke præsenteret for den fulde samling af de arbejdsgange som Text Tonsorium har fundet. Der vil næsten altid være arbejdsgange som ikke giver så meget mening for en bruger. Derfor finder der en kraftig beskæring sted som fjerner mange arbejdsgange.
I næste eksempel er der to værktøjer som konkurerer om at levere output som opfylder samme mål. Der er tre værktøjer som tager dette output som input. Disse tre værktøjer kan ikke tage output fra to værktøjer på én gang, men må vælge.
Nogle værktøjer bør aldrig forekomme i den samme arbejdsgang. For eksempel vil to OCR-programmer (OCR=Optical Character Recognition) næsten altid være uenige om antallet af ord i inputtet fordi det ene værktøj ser blanktegn hvor det andet ikke gør.
Text Tonsorium kan give forskellige værdier for et 'lugt'-attribut til værktøjer som ikke passer sammen. Lugten spræder sig gennem arbejdsgangen, men hvis den støder in i en anden lugt, bliver arbejdsgangen kasseret.
Ofte beregner Text Tonsorium mange workflows, som alle fører til dit mål, og så skal du vælge en af disse. Som tommelfingerregel gælder det at hvis du udtrykker dit mål i flere detaljer, vil der være færre forslag til workflows, og måske endda ingen.
Datastrømme, men også et værktøjs input- og outputspecifikationer, bliver beskrevet ved hjælp af 'træk'. Træk beskriver ting som sproget som er anvendt i en text, filformat, og indholdstype.
Brugere bliver konfronteret med træk og trækværdier når de specificerer målet for en arbejdsgang. Der findes et andet niveau i trækspecifikationen, et niveau som brugeren normalt ikke skal bekymre sig om: trækværdier kan blive beriget med 'stilindikatorer'.
Trækværdier og stilindikatorer skal altid vælges fra foruddefinerede værdiklasser. Det sker ved hjælp af dropdownlister.
Dette er listen over træk som anvendes i denne version af Text Tonsorium:
Subtype af resource, fx basistekst, tokens, alfabetisk liste.
Sproget i teksten.
Måden hvorpå information er lagret i en fil.
Tidsperiode hvori stavningen eller ortografien anvendt i teksten blev anvendt.
Måden hvorpå resultaterne præsenteres for brugeren.
Dekorativ tradition, fx skrifttype.
Til angivelse om data indeholder flertydige elementer.
Special feature used to give unique identity to input and intermediate data, for example output of OCR software.
Her er listen med tilgængelige værktøjer:
# | Name of the tool | Supported languages |
1 | ANNIE DE (GATE cloud) | de |
2 | ANNIE EN (GATE cloud) | en |
3 | ANNIE FR (GATE cloud) | fr |
4 | ANNIE RO (GATE cloud) | ro |
5 | ANNIE RU (GATE cloud) | ru |
6 | Anno-splitter | |
7 | Bohnet parser | da, de, es, fr |
8 | Bohnet tagger | de, en, es, fr |
9 | Brill tagger | da, en, gml, la |
10 | cluster | |
11 | CoNLL 2009 to U | |
12 | CoNLL formatter | |
13 | CONLL to Penn Treebank | |
14 | CONLL to three columns | |
15 | CQP formatter | |
16 | CSTlemma | af, ast, be, bg, ca, cs, cy, da, de, el, en, es, et, fa, fo, fr, ga, gd, gl, gml, gv, hr, hu, is, it, ka, la, mk, nl, no, pl, pt, ro, ru, sk, sl, sq, sr, sv, uk |
17 | CSTner | da |
18 | daner | da |
19 | dependency2tree | |
20 | Diplom annotator | |
21 | Diplom fetch corrected text | da, gml, la, sv |
22 | Document similarity checker | |
23 | eSpeak | af, bg, bs, ca, cs, cy, da, de, el, en, eo, es, et, fi, fr, hi, hr, hu, hy, id, is, it, ka, kn, ku, la, lv, mk, ml, nl, pl, pt, ro, ru, sk, sq, sr, sv, sw, ta, tr, vi, zh |
24 | Frequencies | |
25 | html2text | |
26 | JSON pretty print | |
27 | JSON to ORG-mode | |
28 | JSON to TEI | |
29 | JSON to TSV | |
30 | KORP to Excel | |
31 | Lapos | da, gml, la |
32 | LemPoS | be, bg, cs, da, de, es, et, fa, fo, hr, hu, is, it, la, mk, nl, pl, pt, ro, ru, sk, sl, sq, sr, sv, uk |
33 | LibreOffice | |
34 | Normaliser | da |
35 | Normalize dipl | la |
36 | NP finder | da |
37 | OpenNLP Tagger | da, en |
38 | pdf2htmlEX | |
39 | PDFMiner | |
40 | plain to TEI | |
41 | PoS translator | da, la |
42 | PruneLemPos | |
43 | Repetitiveness checker | |
44 | RTFreader | |
45 | Sentence extractor | |
46 | Stanford CoreNLP | ar, de, en, es, fr, hu, it, zh |
47 | TEI annotator | |
48 | TEI extract tokens/sentences | |
49 | TEI to CoNLL-U | |
50 | TEI to Org-mode | |
51 | TEI tokenizer | |
52 | TEI-segmenter | |
53 | Tesseract-OCRv5 | af, br, bs, ca, co, cs, cy, da, de, en, eo, es, et, eu, fa, fi, fo, fr, ga, gl, hr, ht, hu, id, is, it, iu, jv, la, lb, lt, lv, ms, mt, nb, nl, nn, oc, pl, pt, ro, sk, sl, sq, sr, sv, sw, tr, uz, vi, yi |
54 | Token extractor | af, ar, ast, be, bg, bs, ca, cop, cs, cy, da, de, el, en, eo, es, et, eu, fa, fi, fo, fr, ga, gd, gl, gml, got, gv, he, hi, hr, hu, hy, id, is, it, ja, ka, kn, ko, ku, la, lt, lv, mk, ml, mr, mt, nb, nl, nn, no, pl, pt, ro, ru, se, sk, sl, sq, sr, sv, sw, ta, te, tr, ug, uk, ur, vi, wo, zh |
55 | udpipe | af, ar, be, bg, ca, cop, cs, cu, da, de, el, en, es, et, eu, fa, fi, fr, ga, gd, gl, got, he, hi, hr, hu, hy, id, it, ja, ko, la, lt, lv, mr, mt, nb, nl, nn, pl, pt, ro, ru, se, sk, sl, sr, sv, ta, te, tr, ug, uk, ur, vi, wo, zh |
56 | vujiLoX | la |
ANNIE is a named entity recognition pipeline that identifies basic entity types, such as Person, Location, Organization, Money amounts, Time and Date expressions. It is the prototypical information extraction pipeline distributed with the GATE framework and forms the base of many more complex GATE-based IE applications.
Takes TEI P5 document containing multiple stand-off annotation groups (spanGrp). Outputs one of the annotation groups.
Dependency parser, part of mate-tools.
Part of Speech tagger that is distributed as part of mate-tools.
Part-of-speech tagger: Marks each word in a text with information about word class and morphological features.
Make clusters of lines, separated by empty lines. Each cluster contains all those lines that have the the value in the first or second column in common with one or more other lines in the same cluster.
Convert CoNLL 2009 (14 columns) to CoNLL-U (10 columns)
Converts input to CoNLL 2009 format.
Convert syntax dependency annotation in CoNLL 2009 or CoNLL-U format to bracketed "Lisp-like" format.
Convert a CONLL 2009 or CONLL-U file to a tabulator separated file. On each line: <word> \t <lemma> \t <pos> \n
Takes input comntaining words, tags and lemmas and creates output that can be read by the CQP software.
Produces the dictionary look-up form (or lemma) for each word, inflected or not, in the input.
Classifies names as proper names, locations (with sub-classes of street, city, land and other types of locations), and other names (called MISC)
Named Entity Recognition for Danish, Distributed by ITU NLP. Uses Stanford CoreNLP NER and the model from DKIE to tag incoming Danish plain text for named entities, in three classes: location, person, and organization names.
Convert CoNLL output of a dependency parser into a latex or graphviz tree.
Store lemma in column 3 and/or word class in column 4 of an orgmode input file that already has diplomatic and facsimal values in columns 7 and 8.
Fetch the column with corrected transcriptions. This column contains words with additions between parentheses. The parentheses are removed in the output.
Uses a statistical method to find phrases that are found in each of the input documents.
Text to speech software. Originally known as speak and originally written for Acorn/RISC_OS computers starting in 1995. This version is an enhancement and re-write, including a relaxation of the original memory and processing power constraints, and with support for additional languages.
Sorts input lines, collapses equal lines, appends column with frequencies. Assumes that input is 1, 2 or 3 columns, separated by tabs.
A very simple script that loads from HTML, and then iterates over the DOM to correctly output plain text.
Json pretty-print parser based on a recursive lexical analyser. The parser was based on the specification defined at json.org. The input file is parsed to build a json object. If the object is correct, it will be pretty-printed to standard output.
Converts JSON output with tokens, lemmas and Part of Speech tags to a three-column ORG-mode table.
Read json file with fields for token ID, word, lemma and pos. Output a TEI P5 annotation file (spanGrp) containing either lemmas or Part of Speech tags.
Convert word-lemma-pos data from JSON to CQP format.
This tool generates a tabulator separated file with all KWIC (keyword-in-context) results generated by the KORP tool at the address https://alf.hum.ku.dk/korp/. Input to the tool is the URL copied from the address line when KORP has performed a search.
Fork of the Lookahead Part-Of-Speech (Lapos) Tagger
Lemmatizes input text and adds PoS-options to each lemma. Output can be ambiguous.
A powerful office suite, here used to convert office documents to RTF or PDF.
Normalises older (1200-1900) Danish text to spelling rules as employed in ODS (Ordbog over det danske Sprog).
Fill column left of diplom column with normalized tokens, i.e. v -> u, j -> i and all lowercase.
Collects words that constitute noun phrases.
Part of Speech Tagger that marks tokens with their corresponding word type based on the token itself and the context of the token. Uses a probability model to predict the correct pos tag.
Converts PDF to HTML without losing text or format. (The produced HTML can hardly be interpreted by other tools.) Renders PDF files in HTML, utilizing modern Web technologies. It aims to provide an accurate rendering, while keeping optimized for Web display. Best for text-based PDF files, for example scientific papers with complicated formulas and figures. Text, fonts and formats are natively preserved in HTML such that you can still search and copy. The generated HTML file is static, with optional features powered by JavaScript.
Extracts information from PDF documents. Focuses entirely on getting and analyzing text data.
From a plain segmentized and tokenized text file that uses DEL characters to separate tokens that are written together in the input, create a TEI P5 Clarin Base Format text with attributes S and T for segment and token identification.
Translate from DSL's tag set to Menota
A "Poor man's POS-tagger" that takes text input that has ambiguous lemma and PoS annotations and diminishes the ambiguity by using bigram HMM + Viterbi algorithm. No training data are involved! Works best with larger texts.
Uses a statistical method to find repetitions in a text.
Extracts segments from RTF-file or from plain text. Optionally tokenises. Keeps \f
From a TEI text enriched with T (token) and S (segment) attributes, extract the sentences and their offsets in the source.
CoreNLP is your one stop shop for natural language processing in Java! CoreNLP enables users to derive linguistic annotations for text, including token and sentence boundaries, parts of speech, named entities, numeric and time values, dependency and constituency parses, coreference, sentiment, quote attributions, and relations. CoreNLP currently supports 8 languages: Arabic, Chinese, English, French, German, Hungarian, Italian, and Spanish.
Add attributes for lemma and Part of Speech tag to <w> and <c> elements. (<w> and <c> elements must already exist.)
Reads TEIP5 and produces token and sentence annotations. The annotations refer to the base text, but also include the tokens and sentences themselves in plain text.
Converts a TEI P5 document with annotations for lemma, pos (or msd) and syntactic dependencies to CoNLL-U 10 column format.
Convert TEI P5 stand-off annotation to a two column file in Org-mode format. The first column contains a token, the second contains the annotation: POS-tag, word class, or lemma
Apply a primitive tokenisation to the contents of the <text> element in a TEI P5 document. Each word, punctuation and whitespace is marked up by w or c tags. S and T attributes indicate wich primitive tokens should be combined to create higher level tokens.
Reads tokens and sentences as annotations and produces segment annotations, where segments refer to tokens, not to the base text. Input and output is encoded in TEI P5.
Tesseract Open Source OCR Engine. Tesseract 4 adds a new neural net (LSTM) based OCR engine which is focused on line recognition, but also still supports the legacy Tesseract OCR engine of Tesseract 3 which works by recognizing character patterns.
From a TEI text enriched with T (token) and S (segment) attributes, extract tokens and their offset in the input.
Tokenizer, POS Tagger, Lemmatizer and Parser models for 94 treebanks of 61 languages of Universal Depenencies 2.5 Treebanks.
Converts Latin text to lower case and transforms v to u and j to i.