Text Tonsorium - Changelog

2024-03-06
Improved translation from UD-pipe tag set to (enriched) CST tag set, taking word and morphology into account.
2024-03-01
Fixed tokenisation issue: tokens containing non-alphabetic characters where split.
2024-02-05
The Repetitiveness checker tool can now take multiple inputs. In that case it shows phrases that occur in multiple files.
2024-02-05
Can now register tools that can take multiple inputs of the same type, e.g. file comparison tools.
2023-12-05
Fixed @xml:id bug that showed up in <exclude> elements.
2023-12-04
(1) Fixed handling of inputs that are not in the leftmost column. If the user changed the input description, such files could not be processed anymore. (2) The readme.txt is removed from the putput zip file. (3) Tei files with <w> (token) tags and no other linguistic annotations are treated as 'text', not as 'tokenized text'.
2023-11-24
Visually improved SVG workflow diagram, with animation. On same page, easier to read textual description of workflow.
2023-09-06
Display workflow diagram as SVG (vector) drawing with mouseover effects.
2023-02-19
Added Belarusian, using data from https://github.com/Belarus/GrammarDB.
2023-02-08
Extended Faroese training data for CSTlemma: https://github.com/hinrikur/far-ABLTagger/blob/master/inflection/edfm-v0.1/edfm.csv plus https://urdarbrunnur.rhi.hi.is/bendingar-nidurhal/Storasnid_allt.zip
2023-02-08
Extended Norwegian training data for CSTlemma with Scarrie data. (https://www.nb.no/sbfil/leksikalske_databaser/leksikon/scarrie-lex-lmf.zip)
2023-01-29
New 'Assemblage' values 'alphabetic list, OOV only' and 'frequency list, OOV only'. (Can be used to find words to extend CSTlemma's training data.)
2023-01-26
Albanian training data for CSTlemma now combination of https://github.com/NeldaKote/Albanian-POS and https://github.com/isahb/sq-morph-dict/blob/main/verbs_tagged.txt
2023-01-26
English training data for CSTlemma now combination of http://celex.mpi.nl/ and https://github.com/lamblabo/node-lemmatizer
2023-01-10
'Dry' running of a workflow. Outputs mix of pseudocode and bash command lines.
2022-11-03
Fixed and improved: zooming in when specifying goal.
2022-10-19
Fixed: in some situations some workflows were computed, but not shown as candidate workflows.
2022-10-19
Improved CST-NER and TEI annotation.
2022-10-19
Clarin base format amended: <s> elements for sentences, <w> elements for tokens. <w> elements can contain other elements, also <w> elements.
2022-10-19
conll2pt requires Conll-U formatted input.
2022-08-23
Used https://github.com/michmech/lemmatization-lists.git as training data for CSTlemma. New languages: Asturian, Catalan, Welsh, Irish, Scottish Gaelic, Galician, Manx Gaelic. Improved: Swedish.
2022-08-23
Solved: sometimes, after zooming in on goal, too few wofkflow candidates were shown because of erroneously added goal requirements.
2022-07-22
Front page: added Clarin-DK and Text Tonsorium logos, with links. "Text Tonsorium" removed from headline.
2022-07-20
The Text Tonsorium tries to offer a helpful explanation if no workflows are found.
2022-07-19
Some languages (Czech, English, German, modern Greek, French, Spanish) are assumed not to have undergone substantial orthographic changes since the 19th century, making 'late modern' and 'contemporary' equivalent period values.
2022-05-10
Now URLs with non-ASCII characters are accepted.
2022-05-09
Added tool to convert CONLL to three columns, copying word, lemma and pos.
2022-04-22
Accept input with a multivalued feature, such as conll formatted annotated texts with several types of contents (form, lemma, pos, etc.).
2022-04-13
PDFMiner, Cuneiform and Tesseract 'repel' each other and do not occur in the same workflow.
2022-04-12
'Combination with other values allowed' is now ignored if no value at all is chosen. This solves issue with zooming in on goal.
2022-04-11
Tesseract: version 4 --> 5.1.0
2022-03-12
RTFreader: tokenisation w/o segmentation. (Requires input Type of Content 'segments', not 'text'.)
2022-03-11
CSTlemma: addition of one-line-per-sentence output format.
2022-03-08
End of line markers CRLF (as in Windows text) are read as LF when analyzing ORG mode input (e.g. detect language).
2022-01-28
Restricted 'lempos' to non-English and non-contemporary Danish. Lempos is for "Poor mans's POS-tagging".
2022-01-26
The Latin texts in the 'Skrift og tekst i tid og rum' project are now annotated with morphological features in addition to Part of Speech and lemmas.
2022-01-14
The drop down lists on the source/goal specification page are cached and not recomputed if present in cache.
2022-01-12
log4j 2.16.0 -> 2.17.1
2021-12-30b
Fixed a bug in memoizing workflows characterized by the last tool instead of the desired output.
2021-12-30a
Workflows that contain tool invocations that only are different at the subspecification level are pruned. For example, a workflow that uses the same tokeniser to tokenise the same text using different tokenisation styles, e.g. 'simple' and 'Penn Treebank'.
2021-12-20
The vulnerable log4j (logging for Java) libaries are replaced with safe(r) ones: version 2.16.0
2021-12-03
When asked for unambiguous output, CSTlemma still produced ambiguous output if it couldn't make up its mind. Now, if lemmas are equiprobable, CSTlemma chooses the first one.
2021-11-27
Beginning of changelog.