TEMAA D13 - 2 Methodology

In TEMAA, the functionality of a spelling checker is broken down into three attributes, namely lexical coverage, error coverage and suggestion adequacy, each of them in turn subdivided into a number of sub-attributes. In order to test how well a system performs with respect to each of these, it must be run on collections of linguistic materials consisting of lists of correct words (to test lexical coverage) and lists of misspelled words (to test the other two attributes). An experimental 'proof of concept' tool ASCC (Automated Spelling Checker Checker) has been developed at Stichting Taaltechnologie to automate the production of corpora of misspelled words and the actual testing of the three functionality attributes.

The TEMAA model thus provides clear evaluation benchmarks in the form of attributes and sub-attributes, and automation as implemented in ASCC makes it possible for the project to work with large data samples, a clearly desirable goal for any serious evaluation attempt.

The following sections deal with the way in which ASCC has been used to test with materials relevant to Danish.

2.1.1 Testing lexical coverage

To test lexical coverage, a collection of word lists representative of a number of coverage sub-attributes have been prepared. Details about the lists are given in D12: Test materials. Here, we shall only repeat the sub-attributes covered, which are common word coverage (words belonging to open part of speech and closed part of speech classes have been tested by way of two separate lists), and coverage of English loan words from the computer science domain.

It has to be noted that some of the items in the lists were multi-word units, i.e. expressions consisting of more than one typographical word. In particular, the loan word list contains a large number of English compounds. Such multi-word units had to be tested one word at the time, as explained in Section 2.2.2.

2.1.2 Testing error coverage and suggestion adequacy

Error coverage and suggestion adequacy have been tested by automatically generating misspelled words on the basis of ten different error types. These are a subset of the error types described in D12, representing a reasonable spread with respect to the manipulations carried out to generate a wrong word from a correct one, as well as to the linguistic factors causing the errors.

Thus, the error types tested are shown below. The rule notation used consists of three parts:

The following symbols are used: '.' indicates beginning or end of word, '>' means rewrites as, '-' means at least one preceding letter, 'V' means any vowel, 'C' any consonant, '&' any letter, '!=' means different from, and '{}' enclose a list of restrictions.

Errors have only been generated from the list of common words of open part of speech classes. There are a number of reasons for excluding the other two lists. Firstly, few words from the closed part of speech classes match any of the letter combinations in the left-hand side of the rules. Secondly, most of the closed class words are quite frequent ones, and this makes them unlikely candidates for intentional misspelling. Therefore, the errors are hardly relevant to those words. Finally, the list of loan words is basically a list of English words: since the errors tested for are all related to the way in which Danish phonology is realised in the spelling system of Danish, it makes no sense to apply the rules that generate those errors to words deriving from a different spelling system.

In general, it must be stressed that the errors tested for are only a sample of the kind which can be described as intentional errors, i.e. errors due to the writer's wrong use of their linguistic knowledge. Thus, we have not looked at another central error type, namely mis-typings. However, there is no reason to believe that the methodology described in this report should not be directly applicable to that class of errors as well.

2.2 Problematic areas

2.2.1 Error generation

ASCC consists of three parts, namely an error generator, a spelling checker runner and a reporter. However, it has been necessary to detach the error generation phase from the rest of the testing procedure. Thus, the error generation code in ASCC has been commented out, and misspelled words have been generated by a separate perl script, where each corruption rule has been implemented as a perl routine.[3]

The reason for this was that some of the corruption rules listed above generated too much noise, and it was therefore necessary to manually remove "false" errors from the lists of misspellings. Such false errors were of two kinds, on the one hand words that were not actually misspelled at all (i.e. corruption rules unintentionally generated a different valid word), on the other, words that did contain an error, but not of the kind the rules were intended to create. The first type of false error is not too problematic. Correct words would not have been taken by ASCC to be misspelled if a larger input list had been used. ASCC checks all the generated items against the base list, in fact, to make sure that they are not correct words of the language. The other kind of unwanted output, on the other hand, seems more difficult to get rid of completely given the kind of corruption rules used. Therefore, a certain amount of manual checking seems unavoidable. Of the rules listed above, especially the second one generated such a large number of false or unwanted misspellings that it was finally taken out of the sample. Therefore, results relative to rule 2 are not shown in the tables further below.

2.2.2 Multi-word units

Multi-word units are a problem in the sense that spelling checkers only treat them as sequences of individual words. This means that each component word is checked independently of the others in the compound. Consequently, ASCC cannot treat words consisting of more than one typographical unit. Therefore, all the multi-word units occurring in our lists (especially the list of loan words as already mentioned contains a large number of them) had to be split up, so that each word was written on a separate line.

2.2.3 Dots and dashes

Another problem was due to misspellings ending in dots or dashes (i.e. corrupted abbreviations or compound components). One of the systems tested showed rather idiosyncratic behaviour in relation to such unknown words. In the spelling checker's menu, in fact, they appear without the final dot or dash. This confuses ASCC, and these words are thus simply skipped over in the report generated by ASCC, and had to be added to the list of unknown words manually.

2.2.4 File format

All the files that were used under testing are DOS files. To run a spelling checker on them, the user is usually asked which format the files are in, so that they can be read into the word processor in the proper way. ASCC simulates an interactive session and thus expects to be asked about the format of the files. However, one of the systems tested showed rather unpredictable behaviour with respect to this point, so that some of the control characters sent by ASCC corrupted the files. The problem has been solved for the purpose of this test experiment, but not in a general way, as it is still unclear what caused the spelling checker's unexpected behaviour. In general, it must be stressed that ASCC is completely dependent on systematic behaviour on the part of the checker, and that even minor idiosyncracies (such as a missing dot in abbreviations as described earlier) are enough to corrupt the final results.

[3] The project has now developed a separate tool to express errors, namely the errgen program, so this problem has been solved.