--------------------------------------------------------------------------------
CstNER
--------------------------------------------------------------------------------
Version: V4.0
Licence: CLARIN_ACA-NC (Clarin academic licens for non-commercial use)
IPR-holder: University of Copenhagen
17-12 2012
Dorte Haltrup Hansen
------------
DESCRIPTION:
------------
CSTner is a rule based language dependent named entity recognizer (NER) for Danish.
It is implemented in perl and runs om linux with utf8 input and output.
NEs are here characterized by starting with uppercase. Interpunctation araound
uppercase tokens is ignored, the uppercase tokens are looked up in external lists (see files),
and token internal and token external clues are used to clasity the names.
Frequent sentence initial words, frequent listed "no-names" and uppercase tokens
that exist in a lowercase version elsewhere in the document are ignored.
Tokens are concatenated if person or company names.
The user can add their own sentence initial words to the file: "Berl.initord"
and their own "no-names" to the file "nonames".
At the end of the programme the user can chage the output using '#'
to include/exclude output mode.
print "$tok"; # Print all
#if($tok =~ /Named Entity)
Syntax:
perl CstNER.pl INPUTFILE > OUTPUTFILE
Must be run in the same directory as the external name lists.
-----------
NER CATEGORIES:
-----------
person
street
city
country
place
organisation
misc (other)
E.g.
Antonio Natale
Lindhagensgatan
Warszawa
Danmark
Asien
Arla Foods amba
Telefax
-----------
CERTAINTY TAGS:
---------------
The certainty tags indicates if a named entity is found on a list (certainty = 1),
if part of the name is found on a list or if clues indicates the category (certainty = 2),
or if the category is uncertain (certainty = 3).
Then is probably not a name, but can be.
E.g.
Lene Stampe Thomsen (all names are on the lists))
Iver Huifeldt (only Iver is on the firstname list)
Nato (Nato is unknown - but is a name)
Rederi (Rederi is unknown - but is a not name)
------
FILES (lists of names):
------
1391 Berl.initord (frequent sentenceinitial word for the newspaper Berlingske Tidende)
43 company (company abb.)
281 contrynames
584 DKcity (Danish city names)
10328 firstnames
72 islandnames
333 miscplace (other place names)
1531 nonames (upper case word that are not names)
977 nonDKcity (cities outside Denmark)
20 street
94639 surnames
-------------
CONTACT INFO:
-------------
For questions and remarks about the program, please feel free to contact us.
Our postal address is:
Center for Sprogteknologi
University of Copenhagen
Njalsgade 140
2300 Copenhagen S.
Denmark