The TEMAA model thus provides clear evaluation benchmarks in the form of attributes and sub-attributes, and automation as implemented in ASCC makes it possible for the project to work with large data samples, a clearly desirable goal for any serious evaluation attempt.
The following sections deal with the way in which ASCC has been used to test with materials relevant to Danish.
It has to be noted that some of the items in the lists were multi-word units, i.e. expressions consisting of more than one typographical word. In particular, the loan word list contains a large number of English compounds. Such multi-word units had to be tested one word at the time, as explained in Section 2.2.2.
Thus, the error types tested are shown below. The rule notation used consists of three parts:
1. the letter combination to be found,
2. the letter combination to be substituted for the combination found in 1,
3. an optional list of restrictions on the letter combination in 1.
The following symbols are used: '.' indicates beginning or end of word, '>' means rewrites as, '-' means at least one preceding letter, 'V' means any vowel, 'C' any consonant, '&' any letter, '!=' means different from, and '{}' enclose a list of restrictions.
1. rer. > re e.g. bærer > bære
2. rer. > r e.g. kontrollører > kontrollør
3. -et. > ed e.g. betragtet > betragted
4. -VrV > VV {V != V} e.g. Panduro > Panduo
5. -VrC > VC {C != r} e.g. hierarkisk > hierakisk
6. -nd > n e.g. sandsynligvis > sansynligvis
7. &rC > &rdC {& != r, C!=s} e.g. samværs > samværds
8. (-kk > gg) or (-k > g) e.g. makke ret > magge ret
9. &ce > &se {& != c, & != s} e.g. nyancer > nyanser
10. (.chV > sjV) or &chV > &sjV {& != s} e.g. chance > sjance
Errors have only been generated from the list of common words of open part of speech classes. There are a number of reasons for excluding the other two lists. Firstly, few words from the closed part of speech classes match any of the letter combinations in the left-hand side of the rules. Secondly, most of the closed class words are quite frequent ones, and this makes them unlikely candidates for intentional misspelling. Therefore, the errors are hardly relevant to those words. Finally, the list of loan words is basically a list of English words: since the errors tested for are all related to the way in which Danish phonology is realised in the spelling system of Danish, it makes no sense to apply the rules that generate those errors to words deriving from a different spelling system.
In general, it must be stressed that the errors tested for are only a sample of the kind which can be described as intentional errors, i.e. errors due to the writer's wrong use of their linguistic knowledge. Thus, we have not looked at another central error type, namely mis-typings. However, there is no reason to believe that the methodology described in this report should not be directly applicable to that class of errors as well.
The reason for this was that some of the corruption rules listed above generated too much noise, and it was therefore necessary to manually remove "false" errors from the lists of misspellings. Such false errors were of two kinds, on the one hand words that were not actually misspelled at all (i.e. corruption rules unintentionally generated a different valid word), on the other, words that did contain an error, but not of the kind the rules were intended to create. The first type of false error is not too problematic. Correct words would not have been taken by ASCC to be misspelled if a larger input list had been used. ASCC checks all the generated items against the base list, in fact, to make sure that they are not correct words of the language. The other kind of unwanted output, on the other hand, seems more difficult to get rid of completely given the kind of corruption rules used. Therefore, a certain amount of manual checking seems unavoidable. Of the rules listed above, especially the second one generated such a large number of false or unwanted misspellings that it was finally taken out of the sample. Therefore, results relative to rule 2 are not shown in the tables further below.
[3] The project has now developed a separate tool to express errors, namely the errgen program, so this problem has been solved.