CST's online tools

Here you can analyse text with a combination of CST's tools.
All tools support Danish and some tools also support other languages.

Language:  Bonus code:

Write a few lines ...

... or specify a text or RTF file.   Or view a demo text!

Choose one or more tools:












Tip

Point at the bold words for explanation.

Language

The name recogniser and the noun-phrase recogniser only support Danish. All other tools support all the languages in the list.

More options

You can choose more options for eg. the lemmatiser.

Fewer options

With fewer shown options - only the most often used ones - you may have a better overview.

Bonus code

The bonus code allows you to analyse larger amounts of text.
Contact us if you are interested:

bartj@hum.ku.dk.

Demo text

The text shown depends on your choice of tools. For example is the "Treaty Establishing a Constitution for Europe" used for showing the repetitiveness checker's capabilities and a piece from "Alice's Adventures in Wonderland" for demonstrating the lemmatiser. (If you choose English as input language, that is.)

Tokeniser

Separated tokens in the text from each other. Eg,
  Oops,Oops ,

Some multiple word units are combined into one token. Eg. (Danish)
  for længstfor_længst

The program also divides the text in sentences.

POS-tagger

(Part-Of-Speech Tagger)

Identifies word class and morphological features for all tokens in the text.

Lemmatiser

Builds the base form of a word.
Eg. 'went''go', 'houses''house'.

In the demo you can choose between running text (with or without the inflected forms of the words) or an alphabetical list (all inflected forms per lemma or the other way around).

Word splitter

Decomposes a compound word.

Repetitivenesschecker

Uses a probabilistic model to find and weight repeated sequences of words (tokens) in the text.

The text must have at least two sentences.

N-gram frequencies

Counts all occurrences of all unigrams (=words), bigrams and trigrams.

Keywords

Finds keyword candidates in a text.

Multiple word terms

Finds keyword candidates in a text.

Multiple word terms

Finds keyword candidates in a text.

Find term candidates in a text corpus.

The text is tokenised, names are recognised if required, and then the text is POS-tagged and lemmatised.

The repetitiveness checker looks for candidate multiple word units.

Anonymise a text

The text is tokenised and names are recognised.

Proper names, which now are marked with "*PERSONNAMEX"
(eg "Ole=P.=Dam*PERSONNAMEX"), can now be replaced with eg. "XXX"

Links

Alternative web pages with online tools, offering more options, are the Clarin logoClarin-DK Toolbox and the even more advanced Text Tonsorium💈.


If you want to know more about the tools individually you can visit these pages:


Some projects with participation of CST also show demos.


You can DOWNLOAD some programs and linguistic resources.


The Danish linguistic resources for the POS-tagger are based on the Parole corpus. See the download page for more info.


The word lists for the lemmatiser stem from STO (da), CELEX (de, en, nl), the Icelandic Centre for Language Technology IFD (is), lemmatizer.org (ru), Morfologik 1.5 (pl), Lexique + Limsi (fr), Perseus (la), G. Petasis (el) and MULTEXT-East free lexicons 4.0 (bg,cs,et,hu,ro,sk,sl,uk). MULTEXT-East non-commercial lexicons 4.0 (fa,mk). LABELLEX (pt). lachica (es). morph-it (it). UD Afrikaans (af). SETimes.HR+ (hr). Språkbanken (sv).


Blå linie
Emil Holms Kanal 2, building 22, 3, DK-2300 Copenhagen S
Webmaster: bartj(at]hum dot ku.dk