CST's repetitiveness checker

The program uses a statistical method to find repetitions in a text. The number and type of the repetitions in a text tells something about in how far the text lends itself to automatic processing, for example machine translation. Sequences of two or more words which are repeated often in a text are also good candidates for multiple word compounds or domain specific phrases. In general, the program helps to characterise a text on a statistical basis.


Background

The program looks after sequences of words that occur more than once. These sequences are weighted according to the length and frequency of the sequence in the text, but also according to the frequency of the words that are contained in the sequences. If very frequent words occur together, the assigned weight is lower than if infrequent words occur together. That means for example that a sequence like 'there are' is assigned a low weight even though the sequence occurs often. In this way the program prioritises sequences that are characteristic for the text rather than generally frequent sequences.

The program can address tasks like:

  • assessing whether automatic translation by means of translation memory is a good option,
  • compressing a text,
  • spotting of unintended repetitions of sentences or parts thereof,
  • spotting pieces of plagiarism,
  • extraction of term candidates for translation memory or search engine.

More information

Underwood, N. L. & B. Jongejan:"Profiling Translation Projects - An Essential Part of Routing Translations". The 8th International Conference on Theoretical and Methodological Issues in Machine Translation (TMI 99), Chester, august 1999,

Computation of the weight of a sequence.

Contact: Bart Jongejan


BlÄ linie
Emil Holms Kanal 2, building 22, 3, DK-2300 Copenhagen S