TEMAA D13 - 4 Discussion

In order to perform repeatable and large-scale tests for the functionality attribute of spelling checkers, systematically constructed test suites are needed. Since the spelling checkers under consideration in this project check words in isolation (i.e., they do not take context into account), an appropriate form for a test suite is a list of words.

The functionality attribute for spelling checkers is composed of three sub-attributes:

* suggestion adequacy, i.e., the degree to which, on rejecting a word, the checker provides a good suggestion for replacement.

A test suite for recall is a list of correct words. These can be subdivided into lists of specific types, e.g., based on word frequency in the language (or in some text type), or special sub-vocabularies like technical terminology. Testing means checking whether all words in the list are accepted by the spelling checker. We have not created any specific tools for the construction or maintenance of such lists.

A test suite for precision is a list of incorrect words, representing some type of spelling error. One such type is a typing error, e.g., hitting a key next to the intended key. For TEMAA, we have created a tool for generating such lists from lists of correct words.

A test suite for suggestion adequacy is a list of pairs, each pair consisting of a correct word W and a non-correct word NW. The idea is that, if NW is derived from W by a realistic spelling error, W itself is the best suggestion for replacing NW. This simulates the case of a human author who intends to write W but writes NW instead. The tool that generates suites for precision can, as a side-effect, also generate suites for suggestion adequacy.

4.1.2 Functionality of errgen

Errgen is a program for the generation of test suites for testing precision and suggestion adequacy of spelling checkers. It is written in Perl 4, and works on various platforms including Unix stations and PCs, provided the Perl interpreter is installed.

The idea is that the errgen user provides a set of rules, where each rule represents a type of spelling mistake.

One selected rule is applied to an input list of words, provided by the user. Errgen applies the rule to each word, and produces a non-word according to the rule if one exists. It also creates a reference from each non-word to its `source' (the word from which it was formed), for use in suggestion adequacy tests.

The user may provide a separate list of correct words that is used as a filter on the output of errgen. For example, if the input list is just a list of legal terminology, the filter may contain general vocabulary as well. In this way, the user can make sure that the non-word list does not contain correct words (which would damage the precision and suggestion adequacy tests).

4.1.3 Rules

The basic idea of the program is that its user can write rules that simulate realistic spelling mistakes. A rule is a collection of substitute statements in Perl style. The example below shows a rule to replace `t' by `+'. The first and last line are delimiters; the second line is a comment; the third line is a (the only) sub-rule:

The syntax for sub-rules is Perl substitute expressions, i.e. expressions of the form s/R1/R2/, where R1 and R2 are regular expressions that may contain backreferences \1, \2, ..., and special Perl variables $1, $2, ..., as described in Wall & Schwartz (1990), p. 24-29.

The rule writer can define variables to generalize over e.g. vowels and use these in the regular expressions. Example:

Regular expressions is a well-known formalism; as far as we can say on the basis of TEMAA, this formalism is well-suited to simulate spelling mistakes. Rule sets actually used are documented in the section and appendix on functionality testing for Italian.

4.1.4 Errgen operation

Per invocation of the program just one rule is applied, which the user specifies by its name. Let us call this the selected rule.

Errgen will read the rule file up to and including the selected rule. It will read all the declarations preceding the selected rule but no declarations following the selected rule in the file.

If a later declaration assigns to the same variable as a previous declaration, the later one is used (as long as it precedes the selected rule).

The selected rule, taking into account the preceding declarations, is applied to each item of the input list.

* each of these is applied to the current item in all possible ways, adding the results to a result pool;

If no sub-rules are applicable or all results have been eliminated by the filter, there is no result.

If there is a result, the current input item is written to the output source items file and the corruption result to the output error list. The output source file is made for suggestion checking. It can also be used to compare (e.g. by doing a "diff") with the input list and see which items got no results.

Note on slowness: errgen does not assume that the filter is a small file and consequently does not load it into memory. That means that for each input item, the filter is read from file.

4.1.5 Software deliverable

4.2 Development of test materials

4.2.1 Test materials are word lists

To test the linguistic functionality of a spelling checker, we have developed various test materials. All spelling checkers we know of perform spelling checking at the word level, so that words are either accepted as valid words of the language or flagged as errors no matter what the context in which they occur is. In other words, the context around the word to be checked is not taken into consideration. To identify and correct a valid word used in the wrong syntactic context, a grammar checker is needed rather than a spelling checker. Therefore, the test materials used in TEMAA are lists of words, either simple or compound words.

Recall that we distinguish two kinds of word lists: lists of correct words, which we call base lists, and lists of corrupted ones, which we call error lists. Base lists are used to test a spelling checker's coverage of different lexical coverage sub-attributes. Error lists are used to test the checker's error coverage and suggestion adequacy.

4.2.2 Methods for constructing base lists

Different methods can be envisaged to construct base lists to test a spelling checker's lexical coverage of different coverage sub-attributes, e.g., frequency word lists, lemmatised lists taken from existing published dictionaries, manually-constructed or automatically-constructed samples.

TEMAA has opted for frequency word lists where possible, i.e., where such lists were available or where it was possible to construct them within the project. We were interested in lists of inflected forms, as words in running text occur as inflected full forms, and not all forms of the same lexeme are equally frequent. Furthermore, we wanted to be able to distinguish between different degrees of frequency when testing the lexical coverage of the various domains. In fact, it is very important for a spelling checker to be able to recognise the most used words of the domain correctly in order not to bother the user with too many false flaggings. On the other hand, it may also be of particular interest to know how well a spelling checker performs at correcting infrequent words, which are the ones a user is probably most likely to misspell.

The validity of the test results obtained with a frequency word list as input, however, depends on the quality of the list, i.e., on its representativeness and correctness. The issue of correctness is particularly relevant because of the purpose the list is intended to serve. Thus, base lists must have been proof-read to remove systematic typos and spelling errors. Sporadic errors, of course, are difficult to eradicate, but precisely because of their sporadic nature they will only appear among the most infrequent words. Examples of systematic and apparently conscious deviance from the spelling norm on the part of an author, instead, should be kept in the list as possible variants. Representativeness of the input list, on the other hand, is necessary to ensure that the results reflect coverage of the intended domain, and not something else. The representativeness of a frequency word list depends on the way in which the corpus it is built from has been put together. Criteria to be met are variation on the one hand, and homogeneity on the other other. The former is needed to avoid the systematic influence of individual texts or themes on the overall lexical material. The latter ensures that the characteristics of the corpus can be generalised to texts of the same type and domain.

In TEMAA, frequency word lists are used for example to test a system's coverage of the standard vocabulary of the language. For Danish, such lists have been identified in Maegaard & Ruus (1986). This is a collection of lists derived from five different corpora representative of different text types, namely fiction for adults, children's fiction, newspapers, magazines and professional publications. Together, the five corpora represent the most frequent standard vocabulary of modern Danish. The corpora used to derive the word lists are made up of randomly chosen text samples of 250 words each, to ensure reasonable variation. The decision to have five separate corpora for distinct text types rather than a unified corpus of texts in modern Danish is motivated by the homogeneity criterion. In fact, the most frequent content words are different in the different lists. Additionally, still to ensure homogeneity, publications were selected from a relatively short period, namely 1970-1974. More details on the principles used to build the corpora are given in Maegaard & Ruus (1987).

One of lists in the collection has been used for actual testing, namely the list covering the domain of general interest magazines[1]. The size is approximately 6,800 words, covering around 82% of the corpus. The list was divided into three frequency classes, as shown in the following table (figures have been rounded up):

The results of testing are provided in two different ways, either as simple percentage scores showing the proportion of words recognised by the checker for each class, or as a weighted mean score showing the overall performance for the list as a whole. The weights used correspond to the relative frequencies of each class.

For Italian, ISSCO has constructed a frequency word list on the basis of ANSA news wire bulletins, dated from January 1995 to end of April 1995. The corpus was filtered to obtain accented characters in the ISO8859_1 character set; in order to avoid noise in the frequency with proper names, acronyms and so on, capitalized or uppercase words were deleted. The list consists of 16,530 words; low frequency words account for about 50%. This list can be considered representative of current newspaper style and language. As it was derived from published material, the amount of errors or mis-typings is probably minimal, and in the lowest frequency category. Because any manual correction might have inserted other undesirable errors, we automatically sorted capitalised words and figures, and deleted them from the list.

Moreover, ISSCO was given a larger frequency list from Istituto di Linguistica Computazionale in Pisa, consisting of 244,191 words. Since this list is composed of much broader material, we restricted the word list used for testing to words that had a relatively high frequency (over 20 instances found in the corpus), thus limiting the list to 33,377 words in total.

Also for technical sublanguages, we believe the frequency word list approach would be a fruitful one. So, for each technical domain of interest to the user, a separate word list should be used to test the spelling checker's lexical coverage.

In a particular case, namely words belonging to the closed part of speech classes, exhaustive word lists can be constructed. We have provided an example of such a list for Danish: this is a list of determiners, pronouns, prepositions, conjunctions and non-productive adverbs. The list was constructed automatically by extracting the relevant word forms from the electronic version of the Danish spelling dictionary (Retskrivningsordbogen 1986).

For other sub-attributes of lexical coverage, including multi-word units, closed sub-vocabularies (e.g., foreign loans, archaisms, slang, obscenities), proper names, acronyms, initialisms, abbreviations, and symbols, the project had foreseen using limited samples, constructed manually or semi-manually. We have actually constructed three different lists to treat coverage of loan words and of proper names.

For Danish, we have a list of loan words from the technical domain of computer science, in which a large number of English terms are used. This is a manually-constructed distributed sample of 95 words extracted from a printed dictionary of computer terms (Politikens Dataleksikon 1986). The dictionary includes both Danish and English words. To obtain a sample of 100 terms[2] from a dictionary of approximately 500 pages, we took the first English term on every fifth page. Since we could not always find an English term on the page chosen, we had to repeat the procedure by taking a term from every 17th page, and again from every 20th page. In the end, we had a sample of 100 terms, which we ran through a spelling checker to remove spelling errors. For five of the terms chosen, the spelling checker suggested an alternative spelling. To make sure that we did not introduce unconventional spellings in the sample, we removed those five forms.

For Italian, two lists of proper names have been constructed, one containing the names of all Italian capitals of provinces (capoluoghi di provincia) for a total of 99 cities, and the other containing first names, for a total of 593 names. Two other lists were also derived: from the city list, a list containing the adjectives referring to the inhabitants of those cities, and another list of adjectives referring to either people's age or ordinal numbers in full letters (they are represented by different forms in Italian). Although the age and ordinal list is not exhaustive, the city list is fully complete, with all inflections in gender and number.

The English base lists are are (i) a list of 288 closed class words drawn from the Alvey Grammar 3rd release (lexicon file d.le), (ii) a list of 9532 general open class words derived from the British National Corpus ([BNC]) via word/part of speech frequency lists compiled by Adam Kilgarriff of the University of Brighton, kindly made available by anonymous ftp [AK], and (iii) a list of 32,250 technical words drawn from the European Corpus Initiative CDROM (ECI).

The project had also foreseen the use of domain-related grammars to generate samples relative to other coverage sub-attributes, e.g. numbers, units of measure, and dates. An example of such test suites has been provided for Italian in the form of a list of adjectives referring to people's age and of ordinal numbers in full letters.

In conclusion, the collection of word lists provided ranges over a broad selection of coverage sub-attributes and constitutes, in our opinion, a good exemplification of the methods set up by the project.

4.2.3 Testing base lists

The procedures used to test coverage on the basis of the available base lists, and the problems encountered, are described in detail in the two Appendices to this report. Here, we shall only indicate one particular problem, and summarise the general conclusions.

The problem regards compound terms and multi-word units in general. Some of the items in the lists were expressions consisting of more than one typographical word. In particular, the Danish loan word list contains a large number of English compounds. Multi-word units are a problem in that spelling checkers only treat them as sequences of individual words. This means that each component word is checked independently of the others in the compound. Consequently, ASCC cannot treat words consisting of more than one typographical unit. Therefore, all the multi-word units occurring in our lists had to be split up, so that each word was written on a separate line. A similar problem was encountered with compounds where the component nouns are linked by a hyphen, or Italian words ending with an apostrophe.

Apart from this issue, the lists have served their purpose, and it has been possible to elicit results showing different coverage performances in different systems.

4.2.4 Methods for constructing error lists

Lists of corrupted words, error lists, are generated automatically by applying so-called corruption rules to our base lists. Corruption rules are intended to simulate typing or spelling errors. A corruption rule applies to a correct word and produces an incorrect one. At a high level, at least the following error types can be distinguished:

Common practice in most spelling checker evaluations (cf Green & Hendry 1993) is to use limited samples of particularly tricky words, i.e., words which the evaluator supposes will be hard for the checker to spot for different reasons. Automation of the error generation process and of spelling checker testing, on the other hand, makes it possible to check spelling checker behaviour against large error samples. This has been done in TEMAA by designing language-specific corruption rules and applying them to some of the word lists used to test lexical coverage.

The project has concentrated on mis-typings and misspellings. In particular, for Italian we have tested how well spelling checkers recognised and corrected two kinds of mis-typings (consonant doubling and undoubling). In addition, different types of misspellings were tested for both Italian and Danish. The various error types, as well as the results obtained, are commented on in detail in the Appendices.

A few general comments will be made here. The first one relates to the approach to suggestion adequacy taken in the project. Suggestion adequacy is tested by checking whether one of the suggestions offered by the checker matches the original correct word, and by taking the position of the correct suggestion in the suggestion menu into account. However, the checker may sometimes provide a suggestion which does not match the original input, but is nevertheless a plausible replacement for the error in question. Taking this into account would make automation of the testing considerably more complex.

The second issue concerns the accuracy of our corruption rules. It is not always possible to state the transformation needed to generate a certain error with enough precision, and undesired transformations are sometimes carried out. Two types of `false' errors may in fact be produced: on the one hand words that are not actually misspelled at all (i.e., corruption rules unintentionally generate a different valid word), on the other, words that do contain an error, but not of the kind the rules are intended to create. The first type of false error is not too problematic. The errgen program checks all the generated items against a base list to make sure that they are not correct words of the language. Provided that the base list is large enough, valid words will thus be recognised. The other kind of output, on the other hand, seems more difficult to get rid of completely due to the fact that for a particular language, it might not always be possible to narrowly restrict the context in which the error should be inserted. Therefore, a certain amount of manual checking seems unavoidable.

Finally, an aspect that has not been treated by the project is how the various error types relate to different user types. If frequency information relative to the various error types were available, it would add to the significance of the results obtained.

4.3 Testing

The main precaution for using ASCC in any PC configuration is that keyboard definitions and operating systems can vary greatly across the localized versions of the same product. ASCC works by emulating a user hitting a key on the keyboard, so a certain amount of work is required to adapt ASCC to a particular environment.

As already mentioned, testing on two spelling checkers was carried out for two languages, namely Danish and Italian. The lists of words were structured with one word per line. To our knowledge, spelling checkers identify the object to be verified as a string of characters between certain delimiters (i.e., two spaces, some punctuation marks, end of line, etc.). The lists of words were fed to the checkers as correct lists, but also as error lists, introducing one error per word using the facilities developed during the course of the project (ASCC and errgen).

The correct word lists were used to test the coverage of each spelling checker. In other words, for each valid word in a given language, the checker should accept that word (recall).

On the other hand, we needed to verify that words that are not correct in that same language are rejected (precision). This second test is performed by corrupting the baselists with one error per word, creating a number of error-specific error lists per language.

The lists of wrong words were derived according to a taxonomy of frequent errors that were found for each language. A few words should be said about the reason for generating errors out of correct lists, which might seem a convoluted way of proceeding. After all, humans make enough mistakes so that we should not need to create error lists on purpose. Why not extract invalid words from un-edited text?

Although the availability of written corpora has become wider over the past few years, access to an un-edited corpus is not an easy pursuit. People or companies that could provide un-edited texts show an understandable reluctance to distribute such texts, precisely because they have not been proof-read. Moreover, errors found in un-edited text might not be suitable to be compiled automatically, especially if one wants to know for certain what the intended word was. For instance, it would seem safe to assume that the word "teh" is a mis-typing for "the", but we could not assume for certain a lot of other mis-typings, especially once the words had been sorted in an alphabetical word list, taken out of their original context. Finally, un-edited texts could not possibly contain only mis-typings. A lot of manual work would have been necessary had we chosen this way of proceeding.

More central to the testbed idea is the fact that we wanted a list of errors that had been prepared in a controlled way, i.e., we control the underlying manipulations that changed a correct word into an incorrect one. This way of proceeding is repeatable for each language and for each error type that could be subsequently found (for instance, a new keyboard layout).

First of all, we wanted to have lists of errors in order to test the precision of the spelling checkers, i.e., how many invalid words of a language does it reject. In order to measure this function, we needed to give as input a list that contained only incorrect words, and proceed to a count of the number of words rejected. As with coverage testing, the lists contain only one word per line.

Secondly, we wanted to measure the suggestion adequacy of the spelling checkers, i.e., given an incorrect word, which word in the suggestion list is the one originally intended by the user. The most reliable way of doing this, is to actually know what the intended word was, precisely because we were the ones to insert the error.

In fact, by providing ASCC with both the error and the base lists of words, ASCC can check at which position in the suggestion list offered by the spelling checker the intended word appears. ASCC writes different report files, one for the coverage and one for the suggestion adequacy. Besides, ASCC writes the word lists that each checker has given as suggestion. It is on the basis of the two reports that most of the measures have been taken for the test reports in the appendices.

4.4 User profiles as a compositional tree of weighted attribute value specifications

In the framework the project has produced, User Profiles (which are effectively statements of requirements) are directly expressed in terms of constraints on values of measured attributes of the systems under test. There are three dimensions which define a user profile: attribute choice; specification of required values for an attribute; and specification of the importance (in the hierarchical composition of the overall evaluation) of the attribute.

The evaluator's tool ET exemplifies this. In checklist mode, attributes can be added, removed, or their measure types modified. In user profiling mode, ET produces a tree whose root is the `utility' value for a single evaluation, its immediate children are quality characteristics, and their children are attributes, which may further decompose into sub-attributes. Each node has a specification, or requirement, plus a weight corresponding to the importance of the attribute, each of which may be modified by the user. Sibling nodes have their weights normalised to maintain the proportional contributions of their values to the parent node's value.

This approach offers a precise definition of how a composite figure for the value of a particular system to a particular user may be determined from the measurements that are made. However, such precision results in accuracy only where the attributes, their specifications, weights and their combinations accurately reflect requirements.

The project has not been able to produce much in the way of realistic instances of weights or specifications, or even, to some extent, of attribute choice. This may be partly because the user survey that was done was not heavily resourced, and since it took place at the beginning of the project could not be informed by these more focussed questions which have arisen from the subsequent work. But there must remain a question mark over how usable such an approach is, until it has been used with some more detailed, focussed empirical work on defining user profiles. In future work, it would be desirable to move user survey tasks to after the development of detailed framework of what is to be asked; or, indeed, as software design best practice suggests, to have user survey activity be integrated into the whole process.

We have identified a number of attributes which implicitly characterise user groups. The language of the text is the most obvious. Additionally, we have identified tendencies to certain kinds of typing and phonetic error on the basis of published research in spelling errors (e.g., following Green and Hendry (1993)), and the effect of particular upstream paths from which text is derived, such as OCR. These factors might in principle be factored into the assignment of weights to particular attributes.

For instance, Hendry and Green (1993) report that Japanese first language (L1) writers of English make different phonetically based errors from other L1 groups -- this might result in the need for different attributes, not just different weights. However, this does not invalidate the approach; attributes may easily have weight zero for some user types. It is important to see the attribute structure developed so far as an approximate template, which can be modified as required (as ET indeed permits). As such, the attributes can be thought of as a Generic Task Model for the task of doing an evaluation in the area of authoring aids, available for modification for particular future evaluation requirements.

Most of the attribute choices we have made, and of course the definition of the tasks themselves, are not backed up by solid empirical evidence, of the type that might be considered to be required to support the nomination of performance factors that truly determine a system's conformance to requirements (cf Spark Jones and Galliers 1996); to do this requires extensive experimental work varying all possible factors independently. This shortcoming does not mean the attributes are useless, however, but can be looked on as the result of taking a different approach, similar to that of requirements capture for software systems, to the question of how to justify the attribute choices, specifications, and weightings. When a requirements engineer is preparing a specification for a piece of software, the items in the specification are not in general arrived at by the sort of rigorous combinatorial processes mentioned, but by a complex human process (which may be supported by various analysis tools and methods), by means of which requirements couched in the vocabulary of the user are clarified, made consistent where possible, and translated into specifications for system behaviour.

It would be valuable to state the methods used to determine or justify attributes, specs, and weightings explicitly, but this has mostly fallen outside the bounds of the project's work so far, because although it is fully compatible with the framework, it is a major undertaking in its own right. For example, we have used the work of other people in identifying and establishing the relative frequency of error types in particular L1/L2 pairs. Explicit representation and recording of the methods they used would allow easy customisation of attribute lists, specs and weightings for, e.g., new L1/L2 pairs, of the requirements on error types and frequencies, in the same way as the libraries of methods for measuring system performace allow easy customisation.

This is a problem both of justification, and of communication. The framework as exemplified by the ET program, with its fully compositional approach to utility, does not provide any support for a user of such an evaluation who is not able to supply specifications and weights to reflect their requirements at the level of the attributes we define. However, one of the problems with a technology evaluation targetted for example at the technology purchaser, as might well be the case for adequacy evaluation, is that the user does not know how to express their requirements at this level. For instance, without some sort of guidelines, a user has no way to know what relative weight to place on the correction rates for simple typos versus those for phonetic errors. This is a matter for future work, however.

So, to sum up, for a number of reasons it may be necessary to leave the final decision on User Profiles (how to choose and combine information about a system to get a utility value) to the user of the evaluation, but the framework as currently instantiated provides no support for the process of self-diagnosis that this requires. This is connected with the next section, which deals with the question of how an evaluation should be thought about and presented.

4.5 Evaluation Types

In the various discussions of the evaluation process which have taken place around this project (e.g., in the EAGLES Evaluation working group and report), the question of what an evaluation is has often been unclear. Two main alternatives have been discussed: the Consumer Report Paradigm (CRP), and the functional definition of evaluation as a function from user requirements and objects of evaluation to values, with its implied compositional instantiation, as in ET. These seem on the face of it to be incompatible approaches. The functional approach, exemplified by the ET tool, allows for the summing of all attribute values according to specific requirement weightings. The essence of the Consumer Report is the reporting of values for a given set of attributes for the type of system under test in such a way as to supposedly satisfy the needs for information of all relevant user groups in a single report; explicit user profile types may be presented separately, and indeed recommendations made separately on the most suitable systems for each such type, but there is no explicit composition function by which the values of individual attributes are weighted and combined for each user type, and then (presumably) ranked to give the recommendations.

The two views are not, however, truly incompatible. The classic Consumer Report grid can be thought of as (a) a partial representation of a set of compositional user profiles, or (b) as a representation of a set of partial compositional user profiles.

On view (a), a CRP would be a partial representation of a set of compositional user profiles if a fully instantiated set of user profiles existed, but were used only to ascertain the relevant attributes, such that the CRP reported the values for all attributes reportable in any of the user profiles.

On view (b), a CRP would be a representation of a set of partial compositional user profiles if there was not enough information to determine all user compositional factors (see the discussion in the previous section for why this might be so), and hence it was necessary to report attributes down to the level at which compositionality is true (perhaps right down to the smallest measurement taken in principle, since the line between attributes and partial results combined by methods is drawn purely by the definition of what level of result is considered to be reportable).

It would be interesting to further formalise the idea of the CRP as a combination of the results grid for reportable attributes, plus a set of user profile descriptions, in such a way as to bring out the equivalences between this and the compositional view. Such an account would treat the results grid as embodying the results of composition up to the point of the reportable attributes (i.e., those combinations that currently fall under the heading of methods in the framework). The user profile descriptions would then be treated in terms of the further processes of value combination on the part of the user of the evalation that the user profiles must be able to support, in allowing a user of the evaluation to map from characteristics they understand to weighting judgements about the reportable attributes (whose significance to their utility requirements may not be apparent to them without such guidance). A user profile description, then, might correspond to a set of plausible mappings from user-level characteristics (`quality of resultant text very important; first language Japanese...') to weighting judgements on reportable attributes (`recall more important than precision; error type XXX higher weight than other phonetic errors...'). Such mappings could potentially be represented in something like an expert system, supporting an interface which asks questions about the user-level characteristics, and on the basis of the answers, assigns the appropriate weights to produce a more completely compositional evaluation.

4.6 General notes on using TEMAA tools

In the course of the TEMAA project, some tools have been constructed for each of these slots. It has to be noted, though, that these tools are mainly illustrations of the possibilities for automation in evaluation. That is, they are not very efficient or user-friendly. TEMAA is a project on principles of evaluation methods, not an enterprise in software production.

In general, tools like the ones made here are not intended to be used by the general end users of, say, spelling checkers. Rather, tools like these should be used by specialized evaluators, e.g., in consumers organizations.

The errgen tool was used for test suite generation in some of the languages tested (it appeared too late for Danish), and has caused no particular problems. The task to be performed by this tool is quite straightforward. The intelligence lies in the definition of the `corruption rules'; see the testing reports.

The tool that was used most heavily in TEMAA is ASCC. The fact that functionality testing can be automated to such a large degree is the consequence of the somewhat fortunate task area of spelling checking. ASCC starts from a simulation of an author making mistakes (using errgen) and simply runs spelling checkers on the result; the scoring relies on the fact that a spelling checker can actually replace a mistake by a correction. Therefore, scoring is simply a matter of comparing the input and output of a spelling checker run. Doing similar things with, e.g., a grammar checker, would entail much more sophisticated communications between the grammar checker and the simulation program. It is not clear whether equivalents to ASCC will be feasible for many other types of objects of evaluation.

ASCC enables the user to do a serious functionality test on a spelling checker; however, this is not to say that doing this is a very user-friendly affair.

First of all, ASCC is quite slow, since it actually simulates a spelling checker run on a given input text. As the spelling checkers tested are written to be used interactively, there is no reason for them to be fast. In normal operation, a spelling checker will not be presented with a list of, say, 100,000 different words that are all mis-spelled.

Second, ASCC communicates with a spelling checker by running it and sending it commands in the form of keyboard key sequences. For example, to open a file in Clarisworks, it will send the string "!FO" to the program (where !F means `ALT+F'). This causes several problems for the ASCC users as it makes ASCC dependent on the local computer's keyboard definition, as well as the specific version of the spelling checker (or the text processor embodying it).

Nevertheless, though ASCC is unfriendly, it enables an evaluation technology that does serious and large-scale testing of a class of objects. It would be recommendable to try similar techniques in other areas.

With respect to task 3, maintenance of feature checklists and user profiles, the ET program was delivered. This program arrived somewhat later in the project and was not used as much as ASCC. However, it provides some better prospects for the future: unlike ASCC it is quite user-friendly and entirely general over kinds of objects of evaluation. There are still some conceptual questions (cf the section on user profiles, above), but they are solvable in principle. This program should be a useful basis for evaluation work in the future since it visualizes reasonably well the notion of a user category.

Task 4: Integrating and reporting the results of a particular evaluation is the function of a program written in the CLIPS Object Oriented Language, which contains representations of the main classes in the project's formalisation of user profiles, requirements, methods and the resulting system values. Data is drawn from the ET and PTB programs and presented as hypertext reports or in textual form.

The program PTB is a front end to integrate all the other programs. This was made somewhat late in the project and has not been used on a large scale. For users outside of TEMAA it provides a friendly interface to the other parts of the evaluation software. [1] The following magazines were used: Haven, Folkeskolen, Arbejdsgiveren, Sygeplejersken, Ingeniørens Ugeblad, Dansk Jagt.

[2] We use the word "term" to indicate that the list contains simple words as well as compounds.

class interval (number of words)	relative frequency of each class (%)	relative cumulative frequency (%)
1-189	48	48
190-1732	21	69
1733-6780	13	82

4 Discussion

4.1 Tools for Test Suites for Spelling Checkers

4.1.1 Testing the functionality of a spelling checker