TEMAA D13 - 3 Overview of Testing

In the original plan of work, it was agreed that the methodology would be tried out with at least two different spelling checkers, taking into consideration at least one language. In the present report, we are presenting the testing carried out on two products, on two different languages (Danish and Italian).

As described in earlier reports (D9 and D10), we subdivided the functionality attribute for spelling checkers into coverage for correct (valid) words, identification of wrong (erroneous) words, and suggestion of correct words for the erroneous ones.

3.1 Preparatory work

In order to proceed with the testing, the following preparatory work was done. First of all, the test material was to be available, i.e., lists of words in a given language. We now have available lists of words for Danish, Italian and English.

Secondly, we needed an automatic way to interact with a spelling checker, that would somehow emulate a user's session with a spelling checker. For this purpose, we created a special piece of software, ASCC (Automated Spelling Checker Checker), which is described in detail in deliverable D10, "Evaluation Methods for Spelling Checkers" and in this report. ASCC interacts with a spelling checker via the word processor the checker is integrated into.

Two reasons brought us to this choice. On the one hand, we wanted to test the products as a typical end user generally sees them, and a typical end user would generally access a spelling checker via a word processor. On the other hand, the spelling checkers that were available to us were integral parts of a full text processing package. There are some practical disadvantages to this, as the testing process becomes quite slow. However, it serves to demonstrate the basic feasibility of the automated testing.

ASCC also produces reports that indicate the rate of acceptance or rejection of that given checker over a given word list, what its suggestions were, and in which position the intended word appears.

Finally, we needed to give as input to the spelling checkers lists of invalid words in order to test their error detection, and their suggestion adequacy. From the frequency list of correct words, according to the taxonomy of the given language, (see D12, "Test materials" for the findings for Danish and Italian) a number of error lists have been derived, and supplied to each spelling checker.

ASCC itself has an embedded facility to introduce errors in correct words, but we realized that we needed a tool that could better specify the context in which an error is to be inserted. For instance, we might want to undouble a double consonant only when it is preceded by a given vowel, if such is the error taxonomy that we have observed in a given language. The errgen package has thus been created, which aims exactly at allowing this. In addition, errgen checks that words generated by inserting one error are not (accidentally) correct words, by comparing the error list to a correct list, and discarding the unwanted results. For instance, the word chat can become cat with the deletion of a character, without resulting in an erroneous word.

With one exception, all lists (i.e., correct word lists and error lists) are made available to the Commission. The exception is the list made available to ISSCO by Pisa's Istituto di Linguistica Computazionale, which is protected by a non-distribution agreement.

3.2 Actual testing

Testing with the two products has been undertaken for the Danish and Italian languages. The appendices give a full account of the various test lists and respective results obtained.

Although for the two languages the same manufacturers have been used, the approach is a little different, and the lists supplied to the checkers are meant to reflect different needs. For instance, the test for Danish includes a list of computer loan words from English, whereas the Italian test includes Italian proper names. This was done deliberately, in order to reflect a larger coverage of possible scenarios for intended users.

For the Danish language, the lexical coverage has been tested on one word list, totalling 6,780 words. The results are sub-divided into each frequency class, in order to give a better account of the functionality of each checker. Also for coverage, two other lists have been used, one of closed-class words (748 items), the other of loan words from an English computer related dictionary (135 items). As for precision and suggestion adequacy, ten error types have been tested. All error types are related to competence of the writer (i.e., they are real misspellings and not mis-typings). All error types come to a total of 4,562 error items.

For the Italian language, two lists were available for lexical coverage, one much larger but older than the other (their word count was 33,377 and 16,525 respectively). Two lists of proper names (Italian cities[103 items] and given names [609 items]) were also tested, as well as adjectives referring to those same cities (320 items) and ordinal numerals in full letters (63). Precision and suggestion adequacy were tested with seven error types, two of which reflected mis-typings, for a total of 5,253 error items.

3.3 Conclusions

The spelling checkers reported in the appendices are off-the-shelf products, available to the general public. The prime reason for carrying out tests on commercial products was to put our methodology to the trial of real-life experience. In this, we have seen that the tests have shown enough of a difference between the two products, so that a potential consumer can make the decision of a purchase with a better knowledge of the functionalities of a given product.

For instance, in the case of Danish only one checker recognized all closed-class words, and that same checker scored slightly better for general language recall. Both checkers did poorly on recall for loan words. As for testing with error lists, the checker that scored lower on recall, scored higher on precision, reaching a 100% score seven times on the total ten lists. The other checker reached 100% precision only on three lists out of ten. Suggestion adequacy scores are similar for both checkers; the first suggestion is the intended one in 76% of the cases for one checker, and in 73% for the other one. No suggestions are offered in 5% of the cases for one checker, and in 4% of the cases for the other one.

As for Italian, recall for general language and on names of Italian cities is similar for the two checkers, but the difference for coverage for personal names, city and ordinal adjectives is quite significant. One checker recognized 93% of the given names, but only 34% of the city adjectives, and 29% of ordinal adjectives, whereas the other checker scored 79% for given names, 95% for city adjectives, and 97% for ordinal adjectives. The results for precision are also indicative, one checker scoring 100% four times out of the seven lists supplied, the other one scoring 100% only once. Suggestion adequacy results are similar to the ones for Danish. The correct suggestion is the intended one in 72% of the cases for one checker, and in 76% for the other one. One checker does not offer any suggestion in 7% of the cases, the other one in 11% of the cases.