TEMAA Final Report - 3 Case studies

The most general level at which spelling checkers are described in TEMAA is that of quality characteristics. The project has adapted the ISO approach by looking at the way in which the ISO quality characteristics can be influenced by the different roles that can be distinguished in a basic spelling checking task set-up. This basic task set-up includes at least the following roles:

the customer, who is concerned with the overall quality of the system in relation to the requirements expressed by the writer, end-user and reader of a certain set-up, as well as with other considerations such as cost.

Each of these roles must be seen not as an individual with idiosyncratic properties and desires, but as a class. Note that some or all of these roles may be taken by the same person in a particular instance, or all may be taken by different persons.

3.1.1.1 Functionality

Evaluation of the linguistic functionality of a spelling checker is influenced by properties of several of the roles mentioned above, especially those of the writer and the end-user. Different writers may have different requirements with respect to the lexical domain the system must be able to cover, for example. Thus, for a technical writer it will be important that the system recognises technical terms as valid words of the language, whereas coverage of, say, archaic words is not so relevant. Whether the writer is a native speaker of the language of the text to be spellchecked, is also an important factor in determining the system's adequacy. Writers working in their native language will be prone to making errors of a certain kind, whilst writers working in a foreign language will tend to make other types of errors. This is an acknowledged fact, and various products on the market already offer specific versions of a spelling checker for a particular language tuned to native speakers of different languages, e.g. an English spelling checker for native speakers of French.

The end-user's degree of competence in the language of the text, is important when looking at how good a system is at suggesting a correct replacement for a misspelled word. For an end-user with a good competence, being made aware of an error is more important than being provided with the right correction, especially if the end-user is also the writer of the text under consideration: a writer with a high level of competence in the language will typically make typing rather than spelling mistakes, and will find it easy to correct these errors once the system has spotted them[1]. For an end-user with poor spelling competence, on the other hand, being provided with the correct replacement

Functionality evaluation has constituted the focus of the project. The various functionality attributes are discussed in more detail further in this section.

3.1.1.2 Reliability

Reliability concerns the issue of whether the system always performs in the same way under stated conditions. Reliability can of course be understood in the sense of how likely it is for a system to crash under certain circumstances: in this sense, there is nothing particular to be said about spelling checkers. However, reliability can also be seen in terms of whether the system always recognises a certain error type, thus showing consistent behaviour. To give an example, in Danish it is sometimes difficult to distinguish between the present tense form and the infinitive form of the same verb (lærer/lære teach). The distinction is difficult because for certain verbs the two forms are pronounced similarly. A spelling checker which works strictly by dictionary look-up will never detect an error due to confusion between the two forms: in other words, it will be reliable in never finding an error of this kind. A slightly more sophisticated system may be able to determine that the infinitive is needed after e.g. a modal verb or the infinitive marker at. However, it will not be able to detect all the errors of this type. Consequently, the user will observe that sometimes this error is detected, sometimes not. Although the issue of reliability is an interesting one, it has not been investigated any further in the project.

3.1.3 Efficiency

3.1.1.4 Maintainability

3.1.1.5 Portability

According to the ISO definition, portability does not depend on the number of different platforms a product is implemented for, but is rather understood in terms of how easy it is to implement the product on different platforms. In this sense, portability is not relevant to the adequacy evaluation of existing systems except to system developers and implementors.

The issue of how easy it is to install and use the same product across several platforms, on the other hand, is relevant in the context of adequacy evaluation. It has not, however, been treated any further in the project.

3.1.4.6 Usability

3.1.1.7 Customisability

As explained earlier (Section 2.1), customisability is not an independent quality characteristic according to the ISO framework, but it is so important in the evaluation not only of spelling checkers, but of NLP systems in general, that we have chosen to treat it as a separate category. Customisability of NLP systems mainly concerns the possibilities the user has to customise the system to particular text types and preferences. For spelling checkers, this means whether and how easily the user can create and update user-defined dictionaries.

3.1.2 Reportable attributes

The ISO quality characteristics describe rather abstract aspects of a software product. To provide a more detailed view of the system, each characteristic must be broken down into a number of attributes and sub-attributes. The granularity of choice of attributes and sub-attributes is to some extent arbitrary, as it depends on the needs of specific types of users. We call an attribute of interest to a user type a reportable attribute.

For each attribute and sub-attribute, we have defined an appropriate measure, as well as a method to be used to get the measure for a concrete product. As mentioned above, depending on the characteristics of the participants in the various roles of a specific evaluation set-up, some sub-attributes may be more important than others. Thus, weightings can be attached in the PTB to specific sub-attributes: the final assessment will thus provide the customer with individual measures for each of the sub-attribute relevant to them, as well as with a weighted overall measure for each of the reportable attributes.

As mentioned above, the project has concentrated on a subset of the ISO quality characteristics, as not all of them seem relevant to adequacy evaluation, which is the main focus of the project, or especially distinctive in the context of either spelling or grammar checker evaluation.

Consequently, a serious attempt at breaking down quality characteristics into reportable attributes and sub-attributes, and to define methods and measures relevant to each of them, has only been done for functionality, usability and customisability.

3.1.2.1 Functionality

recall (the degree to which the checker accepts all the valid words of a language)

* suggestion adequacy (in case of invalid words, does the checker provide correct suggestions)

3.1.2.1.1 Recall

In this project, recall is defined as the degree to which a spelling checker accepts correct words of some language. That is, the checker is taken to be in essence a dictionary, and recall is seen as the completeness of this dictionary with respect to the vocabulary of the language (or, some specialized vocabulary; see below). To some, this way of using the term is confusing, as they tend to see a spelling checker as a thing that must react to errors, so that they would prefer to see recall as completeness with respect to the complement of the vocabulary; i.e., recall would mean the degree of rejection of non-words. Both views are equally sympathetic to us. In order to avoid confusion, we have decided to use only one throughout this report; rather arbitrarily, the first one has been chosen.

productive sub-vocabularies (e.g., numbers, units of measure, currency, dates, equations)

As can be seen, each of these sub-attributes is intended to account for the spelling checker's adequacy at recognising the vocabulary of a specific lexical domain.

3.1.2.1.2 Precision

Precision can be decomposed according to different error types, so that it can be investigated how well a system recognises specific error types. The taxonomy of errors the project has used includes the following categories:

Each of these can be further decomposed into more specific error types. In the case of mis-typings, these can be defined in terms of letter manipulations. In the case of misspellings, on the other hand, both the language of the writer and the kind of manipulation can be used as distinguishing features.

A spelling error defined by precisely one of these transformations is called a single-error item in Kukich (1992), who cites evidence showing that error items are most often single-error items, though the percentages vary.

The four types above can be broken down into a larger number of more specific error types, namely:

1. doubling: insertion of a letter X next to another X (bicycle > bicyccle) (special case of insertion)

2. singling: the opposite of doubling (winning > wining) (special case of deletion)

5. addition of a letter X next to a letter Y where X and Y are close on the keyboard (biocycle)

6. substitution of X by Y where X and Y are close on the keyboard (bicycle > bocycle)

All these error types have been treated by the project, in the sense that a methodology has been devised to generate instances of each error class for testing spelling checker adequacy. The list of error types may still be extended. Additional possibilities are:

A taxonomy of spelling errors will be based on language-specific criteria, at least to some extent. A first distinction, as mentioned earlier, is that between misspellings made by native speakers and those made by foreign speakers. We have mainly concentrated on native-speaker misspellings. As an example, the following taxonomy of misspellings in Danish has been used as a basis to generate error instances:

The typology has been set up on the basis of investigations of errors made by students in Danish primary and high schools (cf. Löb 1983 and Andersen et al. 1992) as well as relevant textbooks for native Danish speakers (Togeby 1989). Each type corresponds to a number of different errors (for more detail cf. Test Materials, TEMAA Deliverable 12) that can be expressed as a context-sensitive substitution of a character sequence with some other character sequence.

3.1.2.1.3 Suggestion Adequacy

Most of today's spelling checkers do not just spot spelling errors but also suggest corrections for them. In this project, the attribute named `sugestion adequacy' is used for the evaluation of this part of the checkers functionality. Both of the checkers used as test cases perform this function.

In order to evaluate suggestion adequacy, one needs some general idea of what constitutes a `good' suggestion. In this project, we take the following view on this. A writer (including artificial writing devices like OCR) intends to write a given word W but, by mistake, writes W' (W' W). The difference between W and W' reflects some error type, as defined above. The text is then checked by some spelling checker C. If C concludes that W' is not correct, it offers a list of suggestions for changing W'. Clearly, the best suggestion would be to change W' to W. In fact, all other suggestions are equally useless. We therefore define W to be the unique correct suggestion. The way to operationalize this is to generate collections of W' from collections of W in systematic ways, reflecting types of mistakes that writers (in the general sense) tend to actually make.

In sum, suggestion adequacy is defined, in this project, in a context where each spelling mistake is derived from an intended word form, so that one can measure the degree to which suggestions lists offered by checkers contain the intended words.

3.1.2.2 Usability attributes

There is a great deal of literature in the field of Human Computer Interaction (HCI) on the usability of interfaces, and it is clear that in TEMAA we have only attempted to specify some useful best practice from this work rather than contributing much original of our own. Our aim is to evaluate the usability of spelling checkers, which we take to mean the ease with which a given set of tasks can be accomplished by a user. For the basic operation of a checker, we take the top level task set to be:

These tasks themselves decompose into a number of sub-tasks, some of which are optional or alternatives.

To evaluate the system's usability with respect to each of the basic tasks, we use a set of distinctions introduced in Jordan et al (1991): the differences between experienced user performance, guessability, and learnability. These dimensions, which are used instead of the distinction between novice and expert, in one sense flesh out the conventional wisdom that it should be easy to do simple things and possible to do difficult things; they add to this some consideration of how the system supports the development of the user's expertise.

For experienced users whose typing is in general fast and accurate, what we are evaluating here is in a sense the potential usability of the system, the performance level at which the learning curve levels off to approach an asymptote. This applies to all actual functions of the interface.

It is not clear that user manuals are very relevant to spelling checkers, which rely heavily on guessability. Guessability is based largely on the similarity of a system's interface or operations to those of familiar packages. Thus a Macintosh user may be better able to guess how to use a new program that employs a dialogue format following the Macintosh user interface guidelines.

However, it turns out that having used any spreadsheet program, to take another example, is a basis for guessability, as it imparts some understanding of the kind of operations that must be offered in some way in a new instance of such a type, i.e., domain expertise makes a difference to what the user brings to guessability. Guessability may also be a function of the number of available options, so that, for instance, a system with a very rich set of functions will be less guessable (and may have lower EUP) than a simpler one that happens to fit the experienced user's needs and also is guessable.

Learnability describes the nature of the learning curve by which novice users can approach the asymptotic performance of EUP. One criticism levelled at `intuitive' interfaces, for example, is that it is hard to get beyond what is intuitive; the user never has occasion to learn that there may be faster and better ways to do tasks. The usability of on-line help or printed manuals is one factor in this, as is the number of options available (with too many options the user may find it harder to learn an optimum selection for his or her own tasks).

Our consideration of the dimensions of usability suggests that we can think of three questions to ask about each task or subtask:

3.1.2.3 Customisability attributes

For each of the sub-tasks given above, we have a sub-attribute for the basic facts about what is possible, in terms of whether and how a task is supported, for example what number of dictionaries can be accessed simultaneously during spelling checking.

There appears to be an important distinction between what can be done from within the checking program and what cannot. We define a sub-attribute with related measure for some subtasks in terms of whether a particular function is available from within the checking program or outside it; we feel that both options could be desirable in order to allow the user to create a customised dictionary during the checking process, but also to be able to predefine a dictionary, before the checker is ever run.

Additionally, we feel that information indicating how easy it is to create, modify and update user dictionaries would be useful to the user. Clearly notions of ease of use can depend on particular users' usual modes of working; however, we feel that the measures we have chosen (see Section 3.1.3.5) can be reported on and will enable the customer to form their own idea of how easy it is to customise the checker.

3.1.3 Evaluation measures for spelling checkers

3.1.3.1 Recall measures

As mentioned earlier, recall is defined as the degree to which a system recognises the valid words of the language. This attribute is decomposed into a number of domain-dependent sub-attributes. For each sub-attribute, one or more representative word lists must be made available to provide testing material. The same measure is used for each list, namely:

The simplest way to express this measure is by giving the percentage of words recognised out of the total number of words in the list. There will also be cases, however, where coverage of some sub-attribute will be expressed as a weighted percentage. To test common word coverage, for example, we propose to use a structured list of words, i.e., a list consisting of a number of separate sublists, defined on the basis of frequency of occurrence. For each sublist, then, we count the percentage of recognised words. In order to obtain a single coverage value for the whole list, the single percentages will be weighted to reflect the relative frequency of the words contained in each sublist. A weighted mean score will then be calculated and presented to the customer as the total value for common word coverage in addition to the simple percentages obtained for each sublist.

Thus, we use weights to relate coverage values to the different frequencies of different groups of words belonging to the same domain. It may be objected that the use of weights makes the results difficult for the customer to interpret. However, we believe that different users will be interested in different portions of the lexicon, and that it is important, therefore, to provide rather differentiated coverage values. For instance, technical writers may not want to evaluate a spelling checker on the basis of its total coverage of the general vocabulary of a certain language. However, the most frequent common words of that language will also be relevant to them. Thus, the only common word coverage value which is of interest to them is the percentage obtained on the group of most frequent words. Writers of fiction, instead, will use a much larger general vocabulary. However, coverage of the most frequent common words will still be more important than coverage of rarer words, and must therefore receive more weight in the calculation of the overall score.

Unless a spelling checker has perfect recall, there will be correct words that it nevertheless rejects. These are called false positives in this report. Expressing recall as a percentage R, it is clear that the percentage F of false positives is computed by F = 100 - R.

3.1.3.2 Precision measures

For example, a whole range of existing checkers detect no errors at all in the following sentence, because they cannot treat multi-word units:

This behaviour enables these checkers to accept the idiom ad hoc (thereby improving their scores on the recall attribute) but it is questionable whether this improves the checkers' behaviour as a whole.

Our approach to measuring precision is in essence the same as that proposed for recall. As discussed in Section 3.1.2.1.2, we define a set of error types. For each error type, we apply a spelling checker to a collection of non-words containing errors of this type, and we determine how many of them are rejected.

3.1.3.3 Suggestion adequacy measures

In the example case, clicking on Change will cause the checker to replace speling with spelling. If the user clicks on one of the other suggested corrections first, that one will be the replacement.

A suggested correction makes sense only if it actually is the intended word. In the example, the first suggestion is the right one if the user actually intended to type spelling.

The term suggestion adequacy denotes the degree to which a spelling checker produces suggestions that can actually be used as replacements for spelling errors.

In order to set up a measure for evaluation of the usefulness of suggestions like the examples above, we will distinguish:

An error item is an item which is not included in the valid words of the language under consideration. A source item is the item that the user intends to be in the text instead of the error item. A suggestion is an item suggested by the checker as a correction for the error item.

As suggestions are listed in some order, we will refer to the Nth suggestion for some integer N.

We will say that a suggestion for some error item E is a hit if it is identical to the source of E; we will say that it is a miss otherwise.

Evaluation measures for the adequacy of suggestions should reflect how often one of the suggestions is a hit. Moreover, we want to know if the first suggestion is a hit, as this is the best possible result. The worst result, on the other hand, is provided if there is no hit visible on the spelling checker window, or if all the visible suggestions are wrong. In other words, we don't consider the possibility of additional replacements which the user can only have access to by extending the window: even if they may be correct, they are too difficult to get to.

We therefore decompose suggestion adequacy into four sub-attributes which will be our suggestion adequacy measures:

Each of these is a numeric measure whose values is a percentage between 0 and 100.

3.1.3.4 Usability measures

How is the spelling check initiated? For the most part, this boils down to the question of whether the speller runs as a command in another program such as a word processor or as a standalone program.

Means of initiation. This is a factual measure, taking values that are a subset of {standalone, word processor command}.

How easy is it to invoke the spelling checker (from each of the available methods)? This is an operation count measure.

What unit of text can be selected for spell-checking? This measure is a subset of {one word, selection, whole text}.

How easy is it to do in each case that is offered? This is an operation count measure.

A significant class of users require the correct suggestion to be the first on the list, partly, seemingly, because any accelerated commands such as one-click replacement apply only to the first suggestion, and hence accepting lower suggestions involves considerably more user operations.

For example, Microsoft Word 3 requires a separate user operation to show suggested replacements, which, although it may speed up the initial dialogue, puts an extra burden on users who want to see suggestions.

The measurements we make here may be a little more complicated since the dialogue with the user may involve a number of user choices, not all of which will be present in every case. The choices can be tentatively listed as follows:

dismiss suggestions and save word to a user dictionary (this is covered under customisability);

Each of these actions (which are of course sub-tasks) can be analysed for the ease with which they can be carried out. Note that the relative importance of each will vary with the performance of the checker on its coverage and replacement suggestion attributes: if, for whatever combinations of writer, text type, and checker, the level of false positives is high, it is particularly important that the first and second options should be easy; if the suggestions are in general poorly ordered, it is more important that second and lower suggestions be as easy to select as the first. However, on average we are likely to want to give more importance to the measurement for accepting the first suggestion and dismissing the suggestions, each of which should have the most direct possible commands. The error dialogue, then, has one usability measure for each user action:

There may still be spelling checkers that make it difficult to actually convert the user's decision in an error dialogue to changes in the text; certainly this is necessarily true of much of the operation of grammar and style checkers. Under this heading we also consider the possible desirability of retaining a record of changes suggested and made; this may not be typically required for a `general purpose' spelling checker, but for various types of user, such as those involved in group authoring, this could be a valuable function.

What types of text alteration facility are available? This measure is a subset of {automatic text replacement, user alteration}

What kinds of recording of text alterations is possible? This measure is a subset of {none, record changes, record proposals}

Our measure for guessability is applied to each coherent task set as a whole. Thus, we have only one such measure for the basic checking task, and another for customisability. We propose one overall measure to be a rating scale with values from 1 to 5, plus a measure to be expressed in a short structured prose report detailing task/command mappings (i.e., the number of key strokes or mouse operations) that are found problematic to various kinds of users.

Learnability means how easy it is to find out how to use the checker. Thus, we want to evaluate any learning resources by use of which a user's expertise can increase.

What learning resources are available? This measure is a subset of {on-line help, printed manuals}.

* Learning resource completeness. We propose to measure learning resource completeness by using the task/command mapping, giving a rating scale score from 1 to 5 based on the proportion of tasks that are adequately represented in the manual.

* Learning resource quality. Quality and clarity of user manuals or indeed on-line help might be more difficult to measure. Whilst there are standard indices of so-called clarity of writing, they are somewhat controversial, and it is difficult to see that they would produce any reliable results on what are likely to be very small pieces of text. We provisionally give a measure that is a simple rating scale, 1-5, plus perhaps a short structured prose report based around tasks or commands that are particularly hard to learn.

3.1.3.5 Customisability measures

The measures for customisability are either factual measures to indicate whether a particular task is supported, or set measures indicating a set of possible ways in which a task can be solved, or operation/count measures intended to give the user an idea of how easy it is to perform a given customisation task. They are organised along the various tasks described in Section 3.1.2.3.

* In what ways can a user dictionary be defined? This measure is a subset of {inside checker, outside checker, non-applicable (N/A)}.

* How easy is it to create a new dictionary from inside the checker? This measure is operation count or N/A.

* How many user dictionaries can be loaded for simultaneous access? Factual numeric measure.

Defining error dictionaries may need some explanation. It amounts to defining a dictionary of valid words which are closely related to another valid word but which would nevertheless be inappropriate to the domain to which a particular document belongs. For example in Italian, one may want to exclude the word coniglio (rabbit) so that the spelling checker interprets that string as a misspelling of the word consiglio (committee). In other words, is it possible to block the acceptance by the checker of otherwise valid words when checking a particular text?

* Is it possible to define error dictionaries? The values for this measure would be either positive or negative (yes/no).

* How can user dictionaries be inspected? This measure is a subset of {inside checker, outside checker} or `not offered'.

* How can a new word be added? This measure is a subset of {inside checker, outside checker}.

* How easy is it to add a new word from inside the checker? This measure is operation count.

* How can a word be deleted? This measure is a subset of {inside checker, outside checker}.

* How easy is it to delete a word from inside the checker? This measure is operation count.

Finally, we would also add an attribute regarding the number of users who can access a single user-defined dictionary at one time. Again this is subdivided into two questions: what is possible within the spelling checking program? And what is possible outside it?

* Can more than one user simultaneously modify a single user-defined dictionary from within the spelling checker program? This measure is (yes/no).

* Is it possible to merge different user-defined dictionaries into one, outside the spelling checker program? This measure is (yes/no).

3.1.4 Evaluation methods for spelling checkers

In this project, `evaluation methods' refers to ways in which values for evaluative attributes are obtained. As attributes are hierarchically organized, for attributes the values are derived from their sub-attributes. A possible method for this is to take a weighted average, but this depends of course on the types of values of the sub-attributes.

Establishing values for the attributes that are not decomposed (the leaves of the attributes tree) involves performing tests on objects of evaluation. In this section, we describe testing methods for the attributes defined above.

An important part of the TEMAA philosophy has been that tests for evaluation should be automated wherever possible. This would serve the repeatability of tests in the very first place, but it would also be in the interest of testing efficiency.

The Parametrisable Test Bed (chapter 4) contains some programs that support testing: Errgen supports the construction of lists of mis-spelled words, and ASCC runs spelling checkers on test data to produce statistics on recall, precision and suggestion adequacy.

But it should be clear from the start that, for several quality characteristics, no automatic methods are possible. For example, testing characteristics like guessablity and learnability essentially involves human action. Also, the construction of basic lists of correct words can not be done by computers.

In this section, we describe the construction of data used in testing recall, precision, and suggestion adequacy of spelling checkers, and, more informally, methods for testing usability and customisability.

3.1.4.1 Methods for constructing basic word lists

To test the linguistic functionality of a spelling checker, we have developed various test materials. All spelling checkers we know of perform spelling checking at the word level, so that words are either accepted as valid words of the language or flagged as errors no matter what context they occur in. In other words, the context around the word to be checked is not taken into consideration. To identify and correct a valid word used in the wrong syntactic context, a grammar checker is needed rather than a spelling checker. Therefore, the test materials used in TEMAA are lists of words, either simple or compound words.

We distinguish two kinds of word lists: lists of valid words, which we call base lists, and lists of corrupted ones, which we call error lists. Base lists are used to test a spelling checker's recall (for different lexical coverage sub-attributes), to construct error lists, and to test suggestion adequacy. Error lists are used to test the checker's precision and suggestion adequacy.

Different methods can be envisaged to construct base lists, e.g., using frequency word lists, lemmatised lists taken from existing published dictionaries, manually-constructed or automatically-constructed samples.

TEMAA has opted for frequency word lists where possible, i.e. where such lists were available or where it was possible to construct them within the project. We were interested in lists of inflected words, as words in running text occur as inflected full forms, and not all forms of the same lexeme are equally frequent. Furthermore, we wanted to be able to distinguish between different degrees of frequency when testing the lexical coverage of the various domains. In fact, it is very important for a spelling checker to be able to recognise the most used words of the domain correctly in order not to bother the user with too many false flaggings. On the other hand, it may also be of particular interest to know how well a spelling checker performs at correcting misspellings of infrequent words, which are the ones a user is most likely to misspell.

The validity of the test results obtained with a frequency word list as input, however, depends on the quality of the list, i.e. on its representativeness and correctness. The issue of correctness is particularly relevant because of the purpose the list is intended to serve. Thus, base lists must have been proofread to remove systematic typos and spelling errors. Sporadic errors, of course, are difficult to eradicate, but precisely because of their sporadic nature they will only appear among the most infrequent words. Examples of systematic and apparently conscious deviance from the spelling norm on the part of an author, instead, should be kept in the list as possible variants. Representativeness of the input list, on the other hand, is necessary to ensure that the results reflect coverage of the intended domain, and not something else. The representativeness of a frequency word list depends on the way in which the corpus it is built from has been defined and assembled. Criteria to be met are variation on the one hand, and homogeneity on the other. The former is needed to avoid the systematic influence of individual texts or themes on the overall lexical material. The latter ensures that the characteristics of the corpus can be generalised to texts of the same type and domain.

In TEMAA, frequency word lists are used for example to test a system's coverage of the standard vocabulary of the language. For Danish, such lists have been identified in Maegaard & Ruus (1986). This is a collection of the lists derived from five different corpora representative of different text types, namely fiction for adults, children's fiction, newspapers, magazines and professional publications. Together, the five corpora represent the most frequent standard vocabulary of modern Danish. The corpora used to derive the word lists are made up of randomly chosen text samples of 250 words each, to ensure reasonable variation. The decision to have five separate corpora for distinct text types rather than a unified corpus of texts in modern Danish is motivated by the homogeneity criterion. In fact, the most frequent content words are different in the different lists. Additionally, still to ensure homogeneity, publications were selected from a relatively short period, namely 1970-1974. More details on the principles used to built the corpora are given in Maegaard & Ruus (1987).

One of lists in the collection has been used for actual testing, namely the list covering the domain of general interest magazines[2]. The size is approximately 6,800 words, covering around 82% of the corpus. The list was divided into three frequency classes, as shown in the following table (figures have been rounded up):

The results of testing are provided in two different ways, either as simple percentage scores showing the proportion of words recognised by the checker for each class, or as a weighted mean score showing the overall performance for the list as a whole. The weights used correspond to the relative frequencies of the various classes.

For Italian, ISSCO has constructed a frequency word list on the basis of ANSA news wire bulletins, dated from January 1995 to end of April 1995. The corpus was filtered to obtain accented characters in the ISO 8859_1 character set; in order to avoid noise in the frequency with proper names, acronyms and so on, capitalized or uppercase words were deleted. The list consists of 16,530 words; low frequency words account for about 50%. This list can be considered representative of current newspaper style and language. As it was derived from published material, the amount of errors or mis-typings is probably minimal, and in the lowest frequency category. Because any manual correction might have inserted other undesirable errors, we automatically sorted capitalised words and figures, and deleted them from the list.

Moreover, ISSCO was given a larger frequency list from Istituto di Linguistica Computazionale in Pisa, consisting of 244,191 words. Since this list is composed from much broader material, we restricted the word list used for testing to words that had a relatively high frequency (over 20 instances found in the corpus), thus limiting the list to 33,377 words in total.

Also for technical sublanguages, we believe the frequency word list approach would be a fruitful one. So, for each technical domain of interest to the user, a separate word list should be used to test the spelling checker's lexical coverage.

In a particular case, namely words belonging to the closed part of speech classes, exhaustive word lists can be constructed. We have provided an example of such a list for Danish: this is a list of determiners, pronouns, prepositions, conjunctions and non-productive adverbs. The list was constructed automatically by extracting the relevant word forms from the electronic version of the Danish spelling dictionary (Retskrivningsordbogen 1986).

For other sub-attributes of lexical coverage, including multi-word units, closed sub-vocabularies (e.g., foreign loans, archaisms, slang, obscenities), proper names, acronyms, initialisms, abbreviations, and symbols, the project had foreseen using limited samples, constructed manually or semi-manually. We have actually constructed three different lists to treat coverage of loan words and of proper names.

For Danish, we have a list of loan words from the technical domain of computer science, in which a large number of English terms are used. This is a manually constructed distributed sample of 95 words extracted from a printed dictionary of computer terms (Politikens Dataleksikon 1986). The dictionary includes both Danish and English words. To build this list, we made use of a method described in D2 (Survey of Existing Practices Within the Consortium: Spelling Checkers). To obtain a sample of 100 terms[3] from a dictionary of approximately 500 pages, we took the first English term on every fifth page. Since we could not always find an English term on the page chosen, we had to repeat the procedure by taking a term from every 17th page, and again from every 20th page. In the end, we had a sample of 100 terms, which we ran through a spelling checker to remove spelling errors. For five of the terms chosen, the spelling checker suggested an alternative spelling. To make sure that we did not introduce unconventional spellings in the sample, we removed those five forms.

For Italian, two lists of proper names have been constructed, one containing the names of all Italian capitals of provinces (capoluoghi di provincia) for a total of 99 cities, and the other containing first names, for a total of 593 names. From this another list was also derived, containing the adjectives referring to the inhabitants of those cities.

The English base lists are (i) a list of 288 closed class words drawn from the Alvey Grammar 3rd release (lexicon file d.le), (ii) a list of 9532 general open class words derived from the British National Corpus ([BNC]) via word/part of speech frequency lists compiled by Adam Kilgarriff of the University of Brighton, kindly made available by anonymous ftp [AK], and (iii) a list of 32,250 technical words drawn from the European Corpus Initiative CDROM (ECI).

The project had also foreseen the use of domain-related grammars to generate samples relative to other sub-attributes of recall, e.g. numbers, units of measure, and dates. An example of such test suites has been provided for Italian in the form of a list of adjectives referring to people's age and of ordinal numbers in full letters.

In conclusion, the collection of word lists provided ranges over a broad selection of recall sub-attributes and constitutes, in our opinion, a good exemplification of the methods set up by the project.

3.1.4.2 Methods for constructing error lists

Common practice in most spelling checker evaluations (cf. Green & Hendry 1993) is to use limited samples of particularly tricky words, i.e. words which the evaluator supposes will be hard for the checker to spot for different reasons. Automation of the error generation process and of spelling checker testing, on the other hand, makes it possible to check spelling checker behaviour against large error samples. This has been done in TEMAA by designing language-specific corruption rules and applying them to some of the word lists used to test lexical coverage.

The project has concentrated on mis-typings and misspellings. In particular, for Italian we have tested how well spelling checkers recognised and corrected two kinds of mis-typings (consonant doubling and undoubling). In addition, different types of misspellings were tested for both Italian and Danish. The thrust of our work has been to formalise the description of spelling errors so that they could be treated automatically: automation, in fact, allows us to work systematically with large samples of data - a necessary precondition to offer reliable figures on the performance of different checkers. With automation as a general goal, two factors are crucial to determine whether a misspelling can be treated by our evaluation method:

* whether the error can be generated automatically by some mechanical and systematic substitution, deletion or addition of letters

For instance, among the misspelling categories identified by Löb (1983) for Danish are idiosyncratic errors for which no systematic mapping between the valid and the invalid forms seems possible (e.g. *indtasitter for intercitytog, En: intercity train). Such errors cannot be generated automatically and therefore fall outside the scope of our evaluation package. In fact, they are also very difficult for any spelling checker to handle intelligently.

The second group of misspelled words that constitute a problem in our case are the so-called `false negatives' . A false negative is a misspelled word which is wrong in the current context, but may be correct in others. The correct and the `incorrect' words are often homophones, i.e. they have the same pronunciation but different orthographies (e.g. in Danish at *terroriserer for at terrorisere, English: to terrorise). These errors are systematic, and can easily be generated automatically. However, since spelling checkers check words one at a time without taking the context into account, false negatives cannot be detected. Therefore, they are not treated in the evaluation package.

Practical considerations played a role in determining the approach to suggestion adequacy taken in the project. Suggestion adequacy is tested by checking whether one of the suggestions offered by the checker matches the original valid word, and by taking the position of the correct suggestion in the suggestion window into account. However, the checker may sometimes provide a suggestion which does not match the original input, but is nevertheless a plausible replacement for the error in question. Taking this into account would make automation of the testing considerably more complex.

Another issue concerns the accuracy of our corruption rules. It is not always possible to state the transformation needed to generate a certain error with enough precision, and undesired transformations are sometimes carried out. Two types of "false" errors may in fact be produced: on the one hand words that are not actually misspelled at all (i.e. corruption rules unintentionally generate a different valid word), on the other, words that do contain an error, but not a very plausible one. The first type of false error is not too problematic. The Errgen program (cf. section 4.4.5) checks all the generated items against a base list to make sure that they are not valid words of the language. Provided that the base list is large enough, valid words will thus be recognised.

The other kind of unwanted output, on the other hand, seems more difficult to get rid of completely due to the fact that for a particular language, it may not always be possible to narrowly restrict the context in which the error should be inserted. For example, a possible misspelling in Danish can originate as a result of confusion between the two participial endings t and et, which are used in connection with verbs belonging to two different classes. An example would be De har *slæbet (English: They have dragged), where the participial form should read slæbt. Assuming that the relevant corruption rule would simply substitute et for t as an ending, one would also generate less plausible misspellings, e.g. with an adjective:

The problem here is that the grammatical category of the input word is not checked to constrain the application domain of the rule. To avoid the generation of such unlikely errors, therefore, a certain amount of manual checking seems unavoidable.

Nevertheless, we feel that the approach to error treatment taken in TEMAA is fundamentally a sound one, and the test results obtained (cf. An Experimental Application of the TEMAA Evaluation Framework: Spelling Checkers, TEMAA Deliverable 13), show that the metrics set up for error treatment are able to elicit useful information about different spelling checkers' degree of precision with respect to different error types.

Finally, an important aspect that has not been treated by the project is how the various error types relate to different user types. If frequency information relative to the various error types were available, it would increase the significance of the results obtained.

3.1.4.3 Usability methods

In this section, we discuss the methods we employ to evaluate usability. In fact, the method types we mention here are also relevant for the customisability attributes. First we list a number of possible methods and their advantages and disadvantages; then we describe the concrete methods proposed in TEMAA for each usability attribute.

The user survey carried out in the initial phase of the project represented the first level of an empirical user-based investigation using a telephone questionnaire that operated like a semi-structured interview, although the usability aspects discussed were necessarily very general.

The main weakness of this as a method for eliciting usability evaluations stems from the remoteness from actual use, compared with behavioural observations. Many of the design problems of interfaces are not specifically recalled in such an interview. Strong points of interviews are general attitude evaluations, and, of course, the saving of user and investigator time.

In a think-aloud method, actual users are observed in the course of a (relatively) realistic task. As the requirements of the task cause them to use or seek to use the various commands of the system, they are encouraged to keep up a commentary on their reactions: their goals, their attempts to find ways of satisfying them, the difficulties they encounter. Such sessions may be recorded on video or audio tape for ease of processing if the information required from the think-aloud is not easy to extract. The great advantage of this is the real insight into usability difficulties that it produces; this makes it particularly suitable to measure learnability problems, for instance. The disadvantages include the difficulty in deciding whether problems encountered are representative without huge duplication, the expensive and time-consuming nature of working with users, and the difficulty of converting the record of such a protocol to a concise measure.

A think-aloud should be guided by a list of questions we want an answer for, probably focused around the commands and measures: i.e., for each command, how easy it is to learn. The task the user is asked to do is defined accordingly, and the researcher's note-taking focused on those questions. A marking scheme should be established, whereby a given level of user response is judged to fall into categories of difficulty or ease.

A researcher test may be similar to a think-aloud, except for the greater convenience of not having to find users. Where we can be sure that an attribute can be representatively measured by a researcher's own experience, this is clearly the method of choice.

When we are evaluating the ease of use of a given command, it will often be realistic to measure its experienced user potential simply in terms of the number of keystrokes or mouse commands required to effect the command, which we called an operation count measure. The actual method of obtaining this measure is taken to involve the creation of a task/command mapping, which relates the general tasks and subtasks we identify with a sequence of commands in each particular system.

Given the operations in the basic spelling task described in Section 3.1.3.4, a method can be established for each of them out of the method types given above. Additionally, usability evaluation involves methods for general guessability and learnability evaluation across the four basic task operations. All the concrete methods suggested by the project are listed in the following subsections.

* Means of initiation. This is a factual measure, taking values that are a subset of {standalone, word processor command}.

* How easy is it to invoke the spelling checker (from each of the available methods)? This is an operation count measure.

Method: researcher test of operation count using instructions available in the literature expressed in the task-command mapping.

* What unit of text can be selected for spell-checking? This measure takes a factual value, from the set {one word, selection, whole text}.

Method: researcher test of operation count using instructions available in the literature expressed in the task-command mapping.

* dismiss suggestions and save word to a user dictionary (this is covered under customisability);

Method: researcher test of operation count using instructions available in the literature expressed in the task-command mapping.

* What types of text alteration facility are available? This measure is a subset of {automatic text replacement, user alteration}.

* What kinds of recording of text alterations is possible? This measure is a subset of {none, record changes, record proposals}.

Method: researcher test of operation count using instructions available in the literature expressed in the task-command mapping.

Values for both the guessability measures we are interested in derive from the same test, with somewhat different reporting. The test would be a think-aloud protocol on a number of users drawn from those who are familiar with spelling checkers, those who are familiar with the platform conventions, and those who are not. They would be set a task of checking a text with various errors and false positives that exercises all the task sets identified so far. The researcher would use the task/command mapping as a skeleton to note where tasks take a long time, prompt extensive experimentation, or result in errors. Each task would be given ratings based on the number of problems experienced by all the subjects. Differences between user types would be noted.

* Guessability index: We propose one overall measure to be a rating scale with values from 1 to 5.

Method: For the first measure, a sum will be obtained of all the problem scores obtained from the think-aloud report, and a value calculated such that high problem scores map to low guessability, low problem scores to high guessability. Note that this will require preliminary calibration with a reasonable number of test cases to establish suitable mappings.

* Guessability report: Measure to be expressed in a short structured prose report detailing task/command mappings that are found problematic to various kinds of users.

Method: For the second measure, tasks causing problems above a given threshold overall will be noted, and any tasks that cause problems that are differently distributed among the user types will be noted.

* What learning resources are available? This measure is a subset of {on-line help; printed manuals}.

The test to be used here is a think-aloud experiment, possibly as an extension to the one described above, in which the user is asked to comment on their reaction to using the available learning resources including their indexes. For each learning resource, we have methods for the following evaluative measures:

Method: We propose to measure learning resource completeness by using the task/command mapping, giving a rating scale score from 1 to 5 based on the proportion of tasks that are adequately represented in the manual. Judging adequate representation is likely to be rather a subjective component of this method.

* Learning resource quality. Measure: a simple rating scale, 1-5, plus a short structured prose report based around tasks or commands that are particularly hard to learn.

Method: Using the same think-aloud, but with more emphasis on the quality rather than the presence or absence of entries for tasks and commands. Similar scoring and summing of responses for the rating scale. For the structured report, the choice of tasks with above-threshold problems or user-group specific problems is similar to the guessability methods in the previous section.

Method: Using the same think-aloud, the problem here is to find a list of terms under which the various tasks and commands ought to be indexed to make it easy enough. A possible way of doing this is by a structured interview in various types of users are asked how they would want to describe the tasks they are set.

3.1.4.4 Customisability methods

* In what ways can a user dictionary be defined? This measure is a subset of {inside checker, outside checker, N/A}.

* How easy is it to create a new dictionary from inside the checker? This measure is operation count or N/A.