Keyword extractor

CST's keyword extractor extracts 20 keywords characterising an input text. This is done by comparing the words of the input text with 1500 articles (app. 900.000 words) from the Danish newspaper Berlingske Tidende 1999. Words in the text are assumed to characterise a text if they only occur in relatively few articles in Berlingske Tidende, meaning that they are not ordinary frequent words.

First the input text is pos-tagged and lemmatised. Then the relative frequency of the nouns is calculated by the well known weighing algorithm TF*IDF1 that combines the term frequency (TF) and the inverse document frequency (IDF). The function is cable of discriminating terms characterising the individual text from terms characterising the whole document collection or general language:

TF*IDF = log10((n/df)*tf)

- where n is the number of documents (in this case 1500 articles from Berlingske Tidende), df is the number of documents in which the term occurs, and tf is the term frequency in the input text.

1 see e.g Manning & Schütze 1999, p.543


Multi word terms

Multi word terms in Danish are a bit different from multi word terms in English. Normally compounds in Danish are written in one word and therefore do not count as multi word terms. Therefore multi word terms in Danish consist more often of the pattern: adjective + noun rather than the pattern: nouns + noun like in English.

Besides single words also multi word terms can be characterising for a text, eg. Financial support, finansiel støtte. The multi word terms here are calculated in another way than single word keywords. They are the most relevant adjective + noun combination among the words in the text rather than the relatively most frequent word pair compared with other texts.

All occurrences of the same multi word term from the text are listed here with the frequency of each occurrence and the sum of frequencies.


More Information

C. Navarretta, B.S. Pedersen, D.H. Hansen: Language Technology in knowledge-organisation systems. I New Review of Hypermedia and Multimedia, Vol. 12 No. 1, June 2006, 29-49.

Contact: Dorte Haltrup Hansen
E-mail: dorte @ cst.dk


Blå linie
Emil Holms Kanal 2, building 22, 3, DK-2300 Copenhagen S
Valid XHTML 1.0 Strict