-------------------------------------------------------------------------------- CstNER -------------------------------------------------------------------------------- Version: V4.0 Licence: CLARIN_ACA-NC (Clarin academic licens for non-commercial use) IPR-holder: University of Copenhagen 17-12 2012 Dorte Haltrup Hansen ------------ DESCRIPTION: ------------ CSTner is a rule based language dependent named entity recognizer (NER) for Danish. It is implemented in perl and runs om linux with utf8 input and output. NEs are here characterized by starting with uppercase. Interpunctation araound uppercase tokens is ignored, the uppercase tokens are looked up in external lists (see files), and token internal and token external clues are used to clasity the names. Frequent sentence initial words, frequent listed "no-names" and uppercase tokens that exist in a lowercase version elsewhere in the document are ignored. Tokens are concatenated if person or company names. The user can add their own sentence initial words to the file: "Berl.initord" and their own "no-names" to the file "nonames". At the end of the programme the user can chage the output using '#' to include/exclude output mode. print "$tok"; # Print all #if($tok =~ /Named Entity) Syntax: perl CstNER.pl INPUTFILE > OUTPUTFILE Must be run in the same directory as the external name lists. ----------- NER CATEGORIES: ----------- person street city country place organisation misc (other) E.g. Antonio Natale Lindhagensgatan Warszawa Danmark Asien Arla Foods amba Telefax ----------- CERTAINTY TAGS: --------------- The certainty tags indicates if a named entity is found on a list (certainty = 1), if part of the name is found on a list or if clues indicates the category (certainty = 2), or if the category is uncertain (certainty = 3). Then is probably not a name, but can be. E.g. Lene Stampe Thomsen (all names are on the lists)) Iver Huifeldt (only Iver is on the firstname list) Nato (Nato is unknown - but is a name) Rederi (Rederi is unknown - but is a not name) ------ FILES (lists of names): ------ 1391 Berl.initord (frequent sentenceinitial word for the newspaper Berlingske Tidende) 43 company (company abb.) 281 contrynames 584 DKcity (Danish city names) 10328 firstnames 72 islandnames 333 miscplace (other place names) 1531 nonames (upper case word that are not names) 977 nonDKcity (cities outside Denmark) 20 street 94639 surnames ------------- CONTACT INFO: ------------- For questions and remarks about the program, please feel free to contact us. Our postal address is: Center for Sprogteknologi University of Copenhagen Njalsgade 140 2300 Copenhagen S. Denmark