Center for Sprogteknologi has developed a 'Tokenizer’ (a pre-processor) in the STO project www.cst.dk/sto .
The purpose of the tokenizer is to split up a text into tokens, e.g. each word in the text is written in a separate line. Besides defining words as tokens, abbreviations, sentence separators, multi-word-units, numbers and html codes are also defined as tokens.
The program also coverts the character set from dos and html formats to Latin1, the format used by unix and XKwic.
The Tokenizer is made as a pre-processor which prepares raw texts to the format used in XKwic; however the program is also used as pre-processor for tagging, lemmatizing and parsing.