4 General comments

In the light of the results obtained in this test experiment, we think the usefulness of the attributes and sub-attributes defined by the project emerges clearly. Using them as benchmarks, we have managed to elicit information about the two products' behaviour, and to show that they differ in their treatment of the Danish material.

Looking more closely at the results, it clearly emerges (as expected) that both systems do better on lexical coverage and error recognition than on suggestion adequacy. The system that scores better on coverage is also the one that does less well on error coverage. This is, by the way, also the product that sometimes shows idiosyncratic behaviour, so it remains to be seen whether its comparatively poor error coverage is due to imperfect linguistic knowledge or to bugs.

More work, we feel, still needs to be done on error generation. Generating errors automatically has indeed the great advantage that systems can be tested against large samples of errors: the results obtained on the basis of a large sample are statistically more significant than those obtained on a limited ad hoc word list. Even if a certain degree of noise seems unavoidable, thus rendering manual checking of the errors produced necessary, we feel the method is still a useful one. However, it ought to be complemented with information on the frequency of occurrence of various error types in different user populations. This would make it possible to model the relative importance of the various error types in the same way as it has been done for different word classes under common word coverage.