CST's online-værktøjer

Tip

Point at the bold words for explanation.

Language

The name recogniser and the noun-phrase recogniser only support Danish. All other tools support all the languages in the list.

More options

You can choose more options for eg. the lemmatiser.

Fewer options

With fewer shown options - only the most often used ones - you may have a better overview.

Bonus code

The bonus code allows you to analyse larger amounts of text.
Contact us if you are interested:

bartj@hum.ku.dk.

Demo text

The text shown depends on your choice of tools. For example is the "Treaty Establishing a Constitution for Europe" used for showing the repetitiveness checker's capabilities and a piece from "Alice's Adventures in Wonderland" for demonstrating the lemmatiser. (If you choose English as input language, that is.)

Tokeniser

Separated tokens in the text from each other. Eg,
Oops, → Oops ,

Some multiple word units are combined into one token. Eg. (Danish)
for længst → for_længst

The program also divides the text in sentences.

Name recogniser

Finds and marks names and dates in the text.

Name recogniser (CSTNER)

Finds and marks names and dates in the text.

POS-tagger

(Part-Of-Speech Tagger)

Identifies word class and morphological features for all tokens in the text.

Lemmatiser

Builds the base form of a word.
Eg. 'went' → 'go', 'houses' → 'house'.

In the demo you can choose between running text (with or without the inflected forms of the words) or an alphabetical list (all inflected forms per lemma or the other way around).

Word splitter

Decomposes a compound word.

NP-recogniser

(or noun-phrase recogniser)

Collects the constituents of an NP into one unit.

Repetitivenesschecker

Uses a probabilistic model to find and weight repeated sequences of words (tokens) in the text.

The text must have at least two sentences.

N-gram frequencies

Counts all occurrences of all unigrams (=words), bigrams and trigrams.

Keywords

Finds keyword candidates in a text.

Multiple word terms

Finds keyword candidates in a text.

Multiple word terms

Finds keyword candidates in a text.

Lemmatiser options

Show

Sort:

Show only the words not found in the dictionary

The dictonary

Danish: STO; Dutch, English and German: CELEX; Icelandic: IFD; Russian: lemmatizer.org; Polish: Morfologik; French: limsi+lexique; Bulgarian, Czech, Estonian, Farsi, Hungarian, Macedonian, Romanian, Slovak, Slovene, Ukrainian: MULTEXT-East; Latin: Perseus; Greek: G. Petasis; Portuguese: LABEL-LEX; Spanish: lachica; Italian: morph-it; Afrikaans: UD_Afrikaans; Croatian: SETimes.HR; Swedish: Språkbanken

The dictonary

Dictonary

The word frequensies in the dictionary can be used to remove ambiguities. The dictionary can also handle small errors in the word class identification.

Show whether word is found

Words that are not in the dictionary are marked '-'.
Words that can be assigned more than one lemma are marked '+'.
All other words are left unmarked.

Ambiguities

Some words cannot be lemmatised unambigiously on the basis of word form (and word class) alone. However, the lemmatiser's heuristics can always "solve" the problem.

Running text

All words in the same order as in the text.

Inflected form + lemma

Shows both the inflected forms and the lemmas of the words.

Only lemma

Shows only the inflected forms of the words.

Sorted list

Alphabetical sorted list without duplicates.

Output fields

Lemmas alone, all lemmas per word or (sorted output) all words per lemma.

Sorting order

Sorted according to fields.

Lemma + inflected forms

List sorted on lemma. Outputs a list with the found inflected forms for each lemma.

Inflected form + lemmas

List sorted on inflected forms. Outputs the lemma(s) for each inflected form.

Find term candidates in a text corpus.

The text is tokenised, names are recognised if required, and then the text is POS-tagged and lemmatised.

The repetitiveness checker looks for candidate multiple word units.

Find the content-rich elements in a text.

Find the content rich elements in a text.

Anonymise a text

The text is tokenised and names are recognised.

Proper names, which now are marked with "*PERSONNAMEX"
(eg "Ole=P.=Dam*PERSONNAMEX"), can now be replaced with eg. "XXX"

CST's online tools

Tip

Language

More options

Fewer options

Bonus code

Demo text

Tokeniser

Name recogniser

Name recogniser (CSTNER)

POS-tagger

Lemmatiser

Word splitter

NP-recogniser

Repetitivenesschecker

N-gram frequencies

Keywords

Multiple word terms

Multiple word terms

Lemmatiser options

The dictonary

The dictonary

Dictonary

Show whether word is found

Ambiguities

Running text

Inflected form + lemma

Only lemma

Sorted list

Output fields

Sorting order

Lemma + inflected forms

Inflected form + lemmas

Find term candidates in a text corpus.

Find the content-rich elements in a text.

Anonymise a text

Links