CST's tokenisation and segmentation program

Department of Nordic Studies and Linguistics Faculty of Humanities University of Copenhagen Dansk

CST's tokenisation and segmentation program for text and RTF files

The program's task is to prepare text for other programs. Often programs for text analysis require that the input text is formatted in a prescribed way, which almost always means that the text may not contain control characters, must use a prescribed character set, and must consist of separate tokens and maybe also segments (sentences, headings and bullets). Tokenisation means that words, numbers and punctuation are separated with spaces and that multiple word units are combined.

Background

Many texts are not readily available in the required flat text format and must be converted first, for example PDF files, HTML files and Microsoft Word documents.

Conversion from text-with-layout to flat text is much more than removal of the charcters that control layout and typography. The layout and typography contain valuable information about both tokens and sentences. For example do headings normally not end with a full stop, but is the ending indicated by the transition from large, bold print to normal print. And bullets in a list can sometimes only be recognised by looking at both typografy and position.

CST's tokenisation and segmentation program is developed for the conversion of RTF (Rich Text Format) to the required flat text format. We have chosen RTF because it is well-documentet and because documents which are created in other formats, such as PDF, HTML and DOC, can be converted easily to RTF without loss of crucial layout information.

By the way, the program can also tokenise and segment flat texts, but of course the quality will normally be a slighty less because of less equivocal layout information.

Contact: Bart Jongejan

Emil Holms Kanal 2, building 22, 3, DK-2300 Copenhagen S

Web accessibility (in Danish)