CST Named Entity Recogniser

CST's named entity recogniser classifies names as:
    - proper names,
    - locations (with sub-classes of street, city, country and other types of locations),
    - and other names (called MISC)


"Per Sørensen arbejder i Landmandsbanken i Sorø"

Per=Sørensen*PERSONNAMEX arbejder i Landmandsbanken*MISCNAMEX i Sorø*CITYNAMEX

Also dates are identified.

CST's name recogniser is rule-based and can be extended with other name categories (eg product names, names of firms, event names etc.) by adding lists and by making rules that characterise the new categories. The program can also classify other types of numbers, eg civil registration numbers, telephone numbers etc.

Good results were obtained when CST's name recogniser is used as tool for anonymising (see IDANNA). With advocacy related text a recall of 82% and a precision of 98% have been obtained, whereas with psychiatric journals a recall of 95% and a precision of 97% is obtained. (Recall is the number of names that the program finds finder, and precision is the number of names that the program categorises correctly). If the name recogniser is used for general texts one may expect a lower recognition rate because adjustment of the program with respect to subject area (eg. case books or advocacy related texts) help with the identification of names.

Apart from anonymisation name recognition can for example be used as a preprocessing stage during machine translation to avoid names being translated, as fact-finder in documents or as preprocessing stage for a POS-tagger (see the Brill-tagger).

CST's name recogniser only supports Danish texts.

More information

Report about the name recogniser used for anonymisation

Contact: Dorte Haltrup Hansen
E-mail:   dorteh@hum.ku.dk

Procurement and use

For use of Danish lingware and commercial applications of the program, contact:

Claus Povlsen
E-mail: cpovlsen@hum.ku.dk

Blå linie
Njalsgade 140-142, building 25, DK-2300 Copenhagen S
Tlf: +45 35329090 - Fax: +45 35329089
Valid XHTML 1.0 Strict