Text Tonsorium - Changelog

2022-07-22
Front page: added Clarin-DK and Text Tonsorium logos, with links. "Text Tonsorium" removed from headline.
2022-07-20
The Text Tonsorium tries to offer a helpful explanation if no workflows are found.
2022-07-19
Some languages (Czech, English, German, modern Greek, French, Spanish) are assumed not to have undergone substantial orthographic changes since the 19th century, making 'late modern' and 'contemporary' equivalent period values.
2022-05-10
Now URLs with non-ASCII characters are accepted.
2022-05-09
Added tool to convert CONLL to three columns, copying word, lemma and pos.
2022-04-22
Accept input with a multivalued feature, such as connl formatted annotated texts with several types of contents (form, lemma, pos, etc.).
2022-04-13
PDFMiner, Cuneiform and Tesseract 'repel' each other and do not occur in the same workflow.
2022-04-12
'Combination with other values allowed' is now ignored if no value at all is chosen. This solves issue with zooming in on goal.
2022-04-11
Tesseract: version 4 --> 5.1.0
2022-03-12
RTFreader: tokenisation w/o segmentation. (Requires input Type of Content 'segments', not 'text'.)
2022-03-11
CSTlemma: addition of one-line-per-sentence output format.
2022-03-08
End of line markers CRLF (as in Windows text) are read as LF when analyzing ORG mode input (e.g. detect language).
2022-01-28
Restricted 'lempos' to non-English and non-contemporary Danish. Lempos is for "Poor mans's POS-tagging".
2022-01-26
The Latin texts in the 'Skrift og tekst i tid og rum' project are now annotated with morphological features in addition to Part of Speech and lemmas.
2022-01-14
The drop down lists on the source/goal specification page are cached and not recomputed if present in cache.
2022-01-12
log4j 2.16.0 -> 2.17.1
2021-12-30b
Fixed a bug in memoizing workflows characterized by the last tool instead of the desired output.
2021-12-30a
Workflows that contain tool invocations that only are different at the subspecification level are pruned. For example, a workflow that uses the same tokeniser to tokenise the same text using different tokenisation styles, e.g. 'simple' and 'Penn Treebank'.
2021-12-20
The vulnerable log4j (logging for Java) libaries are replaced with safe(r) ones: version 2.16.0
2021-12-03
When asked for unambiguous output, CSTlemma still produced ambiguous output if it couldn't make up its mind. Now, if lemmas are equiprobable, CSTlemma chooses the first one.
2021-11-27
Beginning of changelog.