Presentation Examples

Computing the weight of a sequence in a text

The weight that the repetitiveness checker assigns to a sequence is high if there are more occurences of the sequence than expected on the basis of the assumption that a text is an incoherent list of words. In this way, the weight tells something about the tendency of words in text to be together with their friends, even though there may not be many friends in the text and it would not be expected a priori that they would be together.

This is how the repetitiveness checker computes the probability that a candidate sequence occurs a certain number of times:

$P_{n}^{m} (x_{1} . . . x_{l}) = \frac{\prod_{j = 1}^{m} (n - j l + 1)}{m!} {(P_{x_{1} . . . x_{l}})}^{m} . {(1 - P_{x_{1} . . . x_{l}})}^{n - m l}$

In this formula, $P_{n}^{m} (x_{1} . . . x_{l})$ is the probability that a sequence of $l$ words $x_{1} . . . x_{l}$ occurs $m$ times in a text with $n$ tokens, whereas $P_{x_{1} . . . x_{l}}$ is the probability that the sequence of words $x_{1} . . . x_{l}$ occurs at a given position. This probability is simply the product of probabilities of the words in the sequence: $P_{x_{1} . . . x_{l}} = \prod_{i = 1}^{l} P_{x_{i}}$ , where $P_{x_{i}}$ is the probability that a word at a given position is the word $x_{i}$ .