|
CST's Part-Of-Speech tagger (Brill, with adaptations)The POS-tagger marks each word in a text with information about word class and morphological features, for example "This page is about the Brill-tagger" → This/DT page/NN is/VBZ about/IN the/DT Brill-tagger/NNP
POS-tags can be used in extraction of words of a specific word class (all finite verbs, all nouns, etc.), to decide which word class a word belongs to in a given position (She flies = verb, the flies = noun), or to group word classes into syntagmata. The Danish version of the Brill-tagger is trained on the Parole corpus, so the rules it uses to compute word classes for new words or homographs reflect the composition and usage in the Parole corpus (see report below). Under optimal circumstances the tagger attains 97% correct POS-tagging. In principle Brill's tagger can be used for many different languages. The only requirement is a POS-tagged training corpus with minimally about 250,000 words. More informationShort description (Danish) Report "Træning og brug af Brill-taggeren på danske tekster" (Danish, "Training and application of Brill's tagger to Danish texts") Guide to the Danish Parole corpus (Danish)
Contact: Dorte Haltrup Hansen |
|
Emil Holms Kanal 2, building 22, 3, DK-2300 Copenhagen S
|