EAGLES Evaluation

Evaluation Design: the EAGLES framework

Margaret King

TIM (ex-ISSCO)/ETI

University of Geneva

54 rte des Acacias

CH-1227 Geneva

Tel: (41) 22 705 7114

Fax: (41) 22 300 10 80

E-mail: Margaret.Kingissco.unige.ch

Introduction.

There is nothing mysterious about evaluation: we all do it every day. We decide what to eat for lunch, whether to get married, which new dress to buy, whether to apply for a new job. Sometimes we weigh up several alternatives and decide between them, as when we decide what to have for lunch. On other occasions, there is only one possibility, and we have to decide whether it will do. (I do not think that most people choose amongst several possible marriage partners). But whether we are choosing amongst many candidates or only looking at a single candidate, each time we ask whether the candidate fits with what we want or need, we are carrying out an evaluation.

This presupposes, of course, that we know what we want. In practice, our wants and needs may not be set in stone before we start the evaluation, and carrying out the evaluation may cause us to rethink. But we must have some idea before we start: otherwise we shall finish up considering the possibility of a pair of trousers as a lunch item, or of a pet gold fish as a soul mate.

It also presupposes that we have some way of judging whether a candidate meets our wants and needs. Typically, we will break our requirements down into a list of stipulations, and the for each item on the list find a way of deciding how a candidate fares with respect to each stipulation. The new pair of trousers must fit, for example, must be one a set of colours, must be made from a warm fabric, must have narrow legs. Not all these stipulations will have equal importance. If the trousers are to complete an already existing outfit, the colour may be the most important thing. If they are to be worn on a holiday high up in the mountains, the warmth may be the most important. Then again, finding a way to decide how a candidate fares may sometimes prove difficult. It is easy enough to decide whether the trousers fit: we try them on and see. But it is considerably more difficult to find a way of deciding whether the future marriage partner will be faithful, or will turn out to be a good parent. The divorce rates offer ample evidence that people sometimes get these things wrong.

Even if it is not always easy to come to a clear cut conclusion based on unshakable evidence, we do have considerable experience of carrying out evaluations. I want to use the rest of this paper to suggest that we can apply that experience to the evaluation of language technology products and systems, and come up with a principled way of designing evaluations which will facilitate communication amongst those involved in evaluation, whether as evaluation designers, users of evaluation results or as owners of what is being evaluated, and will also encourage the gradual development of a pool of expertise. Furthermore, the proposals made here are, I believe in congruity with the ISO standards which refer to the evaluation of software product (ISO/IEC 9126-1, 98, referred to hereinafter as ISO 9126)and ISO/IEC 14598-1, 98, referred to hereinafter as ISO 14598). This is encouraging in itself, and also tends towards de facto standardization in the field of evaluation. But let us sketch the proposals first and come back to the history later.

First, we shall look at a quick recipe for designing evaluations, given in the form of a series of questions, with very little comment. This section is loosely based on ISO14598. Then we shall try to make things more concrete by looking at how the quick recipe might be applied to an informal example. A more technical example will then be given. We will then return to the quick recipe, relate it more closely to EAGLES work, and discuss in more detail some of the issues involved. The paper will close with an historical note.

The quick recipe.

The recipe can be broken down into four major steps, each of which carries with it a series of questions:

1. Establish evaluation requirements:

What is the purpose of the evaluation? Do all parties involved have the same understanding of the purpose?
What exactly is being evaluated? Is it a system or a system component? A system in isolation or a system in a specific context of use? Where are the boundaries of the system?
What features of the system need to be evaluated? Are they all equally important?

2. Specify the evaluation:

For each feature which has been identified as being important, can a valid and reliable way be found of measuring the object being evaluated with respect to that feature?
If so, what will count as a good score, a satisfactory score or an unsatisfactory score? Where are the cut off points?
Usually, more than one metric will be applied. How are the different values obtained to be combined in order to reflect the relative importance in this evaluation of different features of the object being evaluated?

3. Design the execution of the evaluation:

Who will actually carry out the different measurements? When? In what circumstances? What form will the end result take?

4. Execute the evaluation:

Take measures.
Compare with the previously determined satisfaction ratings.
Summarize the results.

An informal example.

Perhaps the first comment to be made here is that although we are talking primarily about evaluating a specific kind of software - a language technology system of some sort - most of what has been said above applies equally well to the evaluation of any complex object. Say, for example, that we were trying to decide what undergraduate courses to apply for in what Universities. Following the steps outlined would lead to series of answers like:

Purpose of the evaluation: decide on whether to apply for a given University undergraduate course.
What exactly is being evaluated: the course plus facilities offered by the University and by the town in which it is located.
What features are important: content of the course, quality of teaching, library facilities, sports facilities (swimming and rugby) ....
How to measure the content of the course? Obtain list of classes.
How to measure teaching quality? Look at success rate of students.
How to measure library facilities? Visit library, ask number of volumes in specific area.
How to measure sports facilities? Is there a swimming pool? What size? Opening hours? Is there a rugby team? Does it win?
How is satisfaction rated for each of these measures?

Course content:

more than 50% of classroom hours spent on specialized topics: good

between 40% and 50% of classroom hours spent on specialized topics: acceptable
less than 50% of classroom hours spent on specialized topics: not acceptable.

We shall not belabour this further by working out satisfaction ratings for the other features we have mentioned, and we shall not go into any detail on how the evaluation will actually be executed. But it is worth mentioning that the step of deciding on the relative importance of different features carries straight over to this kind of evaluation too. For one person, it may be critically important that good swimming facilities are available, and that person will reject a course even if all the other features score very highly indeed. For another person, it may be that swimming is no more than a mildly desirable feature, and the content of courses is predominant. For that person, poor or non-existent swimming facilities might be compensated for by very good course content.

A more technical example.

This section is based on work done in the EAGLES Evaluation group, and in an LRE project TEMAA. However, in order to make some specific points, it differs from that work in the imagined purpose of the evaluation. There, an important test case was the evaluation of spelling checkers for use by end users - people actually writing and correcting texts. Here, we shall imagine that, although spelling checkers are still the object of the evaluation, the purpose is to decide whether or not the spelling checker should be integrated into a larger text processing package by a manufacturer of text processors. With this preamble, let us once again work through the first steps of the quick recipe.

1. Establish evaluation requirements.

Purpose of the evaluation: to decide whether the technical features of the spelling checker are such that it could usefully be integrated into a larger text processing package.

Clearly other factors than technical features will enter into a final decision, such as the cost of the spelling checker, the cost of the integration, whether the market for the language covered by the spelling checker is large enough to warrant the effort of integration and so on. The evaluation described here neglects those factors in order to concentrate on technical issues. They will come into play after the evaluation, when a final decision on purchase is being made.

What exactly is being evaluated?

Given that the spelling checker is to be integrated into a larger text processor, we really need to take into account both the spelling checker taken as an independent piece of software, and the text processor as it will be with the spelling checker incorporated into it. The spelling checker itself can be directly examined. The text processor with spelling checker integrated into it does not exist. However, it may well be possible to predict some features of the enhanced text processor on the basis of characteristics of the spelling checker and what would be required to carry out the integration.

This raises the general question of where the boundaries of the system to be evaluated are. Here we have distinguished two boundaries, the first the naked spelling checker, the second the enhanced text processor. Even more expansive boundaries are possible: we could, for example, be interested in evaluating the enhanced text processor's suitability for use in a specific environment, by editors in a publishing firm, for example, or by a teacher teaching language skills. Each change in boundary would imply a change in how the evaluation is designed.

What features of the system need to be evaluated? Are they all equally important?

Let us first think about features of the spelling checker itself, returning later to what might be said about the virtual text processor of which it will eventually be a part.

1. Functionality:

the spelling checker should not flag a legal word of the language as being a spelling mistake
it should flag a spelling error as an error
it should propose a list of possible corrections
the word intended by a supposed end user should be high in the list of suggestions, preferably first.

This latter requires a comment. Given the purpose of the evaluation here, the end user in mind is taken to be the most "neutral" user of a text processor, i.e. an average person writing a text. Other users might be imagined who would have special needs, but the manufacturer of a text processing system is imagined here to be aiming at the largest market first.

To this list we might add a feature which is in effect a specialisation of the first two features above. For many languages, English being a prominent example, there are variant spellings depending on geographical area or predominant culture. It is very irritating when a spelling checker flags as errors spellings which are acceptable in that culture. For example, the Oxford English Dictionary will tell you, if you look up "-ize" that it is also written "-ise", yet many spelling checkers will persist in flagging the "-ise" spelling as an error in British English. So we may want to add:

the spelling checker should not flag as an error any spelling which is authorized by a major reference dictionary for the language in question.

This could then be seen as a sub-feature of the feature which says that the spelling checker should not flag a legal word of the language as being an error.

2. Reliability:

the spelling checker should not fail as a result of bugs

3. Efficiency:

the spelling checker should have an acceptable response time
the spelling checker should only use an acceptable amount of computing resources

4. Adaptability:

it should be possible to adapt the spelling checker to a particular set of needs, for example by creating personal dictionaries

This first set of features relate to the behaviour of the spelling checker. A second set relates to how easy it might be to integrate it into the text processor.

5. Maintainability/modifiability:

By this we mean how easy it is to modify the software either to maintain it or, as in this case, to modify it to enable the integration to be done. It might usefully be broken down into a number of sub-features:

it should be easy to analyze how the spelling checker works
it should be easy to implement modifications
modifying the spelling checker should not produce unexpected side effects

There may also be other considerations such as how easy it will be to adapt the text processor with the spelling checker incorporated to different environments and different platforms, but we shall not go into these here.

When we turn to considering the final product, the text processor with the spelling checker included, a new set of features becomes relevant.

Some of these will be reconsiderations of features already associated with the spelling checker considered in isolation. Functionality we might imagine will not change, but questions of efficiency will have to be reconsidered, since, for example, the integration might have a deleterious effect on response times for the spelling checker. Conversely, integrating the spelling checker could have adverse effects on other functionalities of the text processor. Imagine for example that a correction while you type feature is included, but as a result, the text processor cannot keep up with the normal typing speed of an experienced typist.

Other features are related to usability questions, which so far we have not considered at all. Here, too we might usefully break a large question down into a series of smaller ones:

6. Usability

how easy will it be for an end user to understand the enhanced text processor well enough to know what it can be used for and to decide whether he should make the effort to install and learn how to use it?
how easy will it be for the end user to learn how to use the enhanced text processor?

(A sub-feature often added here is:

how easy is it for the end user to guess how particular features of the software work? )
how easy will it be to operate and control the enhanced text processor?

One question we have not so far addressed is whether all these features are of equal importance. Considering the spelling checker in isolation, and in the context of this evaluation, it seems plausible to suggest that functionality is most important, integratability next, efficiency next, and reliability least important. One can imagine a result of the evaluation being a stipulation to the supplier of the spelling checker that purchase will be made if response times are speeded up to some specified degree or if certain bugs are fixed. It is harder to imagine contemplating even the conditional purchase of a spelling checker that does not spot spelling errors. If we were thinking of an evaluation done with the purpose of deciding whether the newly developed enhanced text processor should be put on the market, or of one aimed at deciding whether the enhanced text processor adequately met the needs of a specified set of customers, new features would be added, some of the old ones might be deleted and the balance between the features we have already picked out would shift.

It is clear from the above that working out what are the interesting features in a particular evaluation of a particular type of language technology system is a long and fastidious task. Even so, what has been said above makes no claim to being exhaustive, or even necessarily a correct set of choices. Working out what a good set of features might be could well be an area where cooperation would produce a store of reusable work which would, in its turn lead to effective standardization of evaluation. We shall return to this question later.

For each feature which has been identified, can a valid and reliable way be found of evaluating the object to be evaluated with respect to that feature?

Not all the features mentioned above can be measured in the same way. Space restrictions (and reader boredom) prevent us from considering every feature in detail, so we will simply pick out a selection where the measures are different in their nature.

As a first example, let us take one of the functionality features: the spelling checker should flag a spelling error as a spelling error.

A possible metric here (and one used in the TEMAA project) is to take a list of legal words in the language and systematically generate from it erroneous spellings. The spelling checker is then run over the list of erroneous words, and the percentage of errors flagged is counted.

Even this simple sounding metric is not quite so simple as it sounds. First, corrupting a legal word may sometimes lead to the generation of a different legal word, so the list of errors must be checked and legal words removed from it. If this is not done, the metric is not measuring what it is supposed to measure: it is invalid.

On the positive side, the metric can be made to be very informative. If the corruption is rule driven, the error list can be made to reflect particular types of errors, for example, errors resulting from reduplication of a letter or from transposing two letters. In the limit, if it is possible to define the errors typically made by a class of users (the two above are typical of skilled typists typing quickly), the error list can intentionally reflect the specific needs of that class of users. It might be possible, for example, to generate errors typical of school children learning a language or of second language learners.

A further advantage of this particular metric is that it can be used to obtain information about other features. Since the "correct" answer is known, it is possible to use the metric to collect data on how often the correct answer is included in the list of suggestions made, and on whereabouts in the list it appears.

Finally, the metric lends itself to automation. In the TEMAA project, errors were automatically generated from a list of correct words according to rules supplied by project members expert in the language being checked. Then a session with the spelling checker was automatically simulated, and the results sent to summary files from which a report could be generated summarizing the results.

In many ways, this might be said to be an ideal metric: measurement is done directly on the object being evaluated, automation is possible, scoring is on an objective quantified scale, the metric is maximally informative.

For a metric that cannot be directly applied to the object being evaluated, let us consider one of the efficiency features, response time. Of course, when considering the spelling checker in isolation, it will be possible to measure response time directly. But this is no longer the case when we consider the enhanced text processor of which the spelling checker will eventually be a part. However, we can get some predictive estimate, by combining the response times of the spelling checker itself with a consideration of what would have to be done to bring about the integration, and of how that would affect response times.

When we look at the features relevant to the integration itself, it is no longer the case that measuring direct behaviour has any pertinence whatsoever. What is in question here are internal characteristics of the software itself; how large the program is, whether it is well structured, whether it is well documented. The most we could do in the way of direct measurement would be to walk through the code, step by step, looking at how easy it is to do the walk through, checking for points where unexpected side effects might occur and so on.

When we look at the usability features, we find yet a different kind of metric. Once a user has been introduced into the picture, the sort of measures we have already mentioned may have some predictive value concerning his reaction to the software, but we can only really measure features like learnability or understandability by observing a set of representative users as they work. This is typically done by manufacturers in laboratory conditions or by prospective clients in a real work environment. Obvious conditions of validity for this type of testing are the size of the set of users observed and their representativeness. (There are of course also others). A known problem is that laboratory results do not always correlate to real life results, in that users in laboratory conditions do not always behave in the same way as users in the real life situation. For example, a user in laboratory conditions may be more likely to "play" with the system, where a real life user is concerned only with getting the job done. This paragraph touches on a vast topic which cannot be treated with any justice here. Some references have been provided for the interested user.

However, before we leave the topic, we should at least mention the notion of "quality in use": the quality of a system as seen from the point of view of a specific user who is concerned with getting a particular job done effectively, productively, in safety and with satisfaction in the job and how it has been done. Obviously, all the features we have mentioned contribute to quality in use, along with a host of other features that have not been relevant to the evaluation context set up here, such as the work environment, ergonomic factors, work flow considerations and so on. Metrics for measuring quality in use would have to take into account all of these features. Quality in use has only recently been added into the ISO work on evaluation. The bibliography contains a few references for those who wish to pursue the subject further.

It should be intuitively clear that all these different kinds of metrics are inter-related: metrics based on internal characteristics of the software predict how the system will behave, and thus what the result of administering metrics based on directly measuring the behaviour of the system will be. In turn, metrics based on the behaviour of the system have predictive value for whether or not the user will ultimately be satisfied when he uses the system.

EAGLES, ISO and some issues.

This section is primarily based on work within the EAGLES evaluation group on applying the ISO 9126 and 14598 to the evaluation of language technology systems and products. It also draws heavily on the experience of an LRE project, TEMAA, which in turn applied the EAGLES work on developing a theoretical framework for evaluation design to the practical evaluation of spelling checkers and grammar checkers. Within EAGLES itself, a certain amount of work was done on applying the framework to translation memory systems. The section is structured around major issues encountered in the EAGLES and TEMAA work.

Users.

The ISO definition of quality is based on user needs. "Quality: the totality of characteristics of an entity that bear on its ability to satisfy stated and implied needs" (ISO 8402: 1994). The user is by no means always the end user of a product. At different stages of the life cycle, he may be, for example, the developer of the software (using, for example, specifications or a component of a larger system) or the maintainer or operator of a completed system.

Defining who the user is is very closely related to defining where the boundaries of the system are. ISO 9126 makes a distinction between the evaluation of a software product and the evaluation of the system in which it is executed. "For example, the reliability of a system is assessed by observing all failures due to whatever cause (hardware, software, human error etc.), whereas the reliability of the software product is assessed by extracting from the observed failures only those that are due to faults ... in the software"(ISO 9126-1, 5.3). It also points out that the system itself may be part of a larger system. The example given is that of an aircraft with a computer controlled flight controlled system. If the passengers are taken to be the users, the system on which they depend includes the flight crew, the airframe, and the hardware and software in the flight control system. If the flight crew are taken to be the users, then the system on which they depend consists only of the airframe and the flight control system. Early EAGLES work made essentially the same point when talking about "set-ups" (a term taken from Sparck Jones, 96), pointing out too that a set-up may be embedded in a larger set-up. The essential point is that where the boundaries of the system or set-up are taken to be will critically affect what features need to be evaluated, and the relative balance of importance between them. The passengers, for example, might care greatly whether the flight crew considers it part of their duty to provide them with a second gin and tonic. The air crew will have no interest in the question.

The "needs" used in the definition of quality may be stated or implied. A note to the definition tells us that needs are stated in a contractual situation, as when a contract has been given for the construction of a piece of software. In other situations, needs are implied. That means that they must be made explicit as prerequisite of evaluation design. Eliciting user needs is a difficult task, made more so by the fact that the user himself may not be aware of his real needs, or his needs may change over time.

The EAGLES approach was to work in terms of classes of typical users, much in the same way that consumer organizations imagine typical users of cars or washing machines when drawing up their reports on products. User profiling was then done in terms of what tasks the typical user was trying to accomplish. This work was left at a very preliminary stage, and is an area where much future work might usefully be done.

Even so, the essential point remains: in every evaluation there is somewhere, consciously or unconsciously, a user whose needs are to be satisfied. This is true even in the evaluation of core technology: furthering the progress of core technology is not an ultimate aim in itself, there is always the idea that the technology will eventually be of benefit to someone.

Defining the quality model.

In the quick recipe, we talked about deciding what features of the object to be evaluated were of interest, and what the relative importance of these features was. In ISO terms, this process is called defining the quality model: it is a way of setting out explicitly the components of quality for this type of object.

ISO 9126-1 offers a set of quality characteristics and sub-characteristics which can serve as a check-list when defining the quality model for a piece of software. The technical example given earlier makes use of some of these characteristics and sub-characteristics. The definition of the quality (sub)-characteristics in the ISO documents is given in English, and is thus open to interpretation. For example, we identified a feature of spelling checkers that had to do with whether the spelling checker accepted alternative spellings that were sanctioned by a recognized reference dictionary. In the example, this feature was organized under the functionality characteristic:

Functionality

not flag legal words as errors flag spelling errors as errors

accept alternative spellings

Another quality characteristic in the ISO list is compliance, which normally has to do with compliance to standards and norms. Someone might argue that our alternative spelling feature should be organized under compliance rather than under functionality. Such arguments are the inevitable consequence of an informal definition. However, the ISO documents themselves recognize that the definitions given do not constitute the one and only way of designing a quality model. The essential is that they provide a very useful checklist which can be used both to structure thinking about the quality model being defined and to ensure that nothing important gets forgotten.

The ISO standard defines three elements of a quality model: the quality characteristics, their sub-characteristics and attributes. Attributes are features to which metrics may be attached. The value of an attribute may influence one or more characteristics, just as one characteristic may be influenced by more than one attribute.

A major aim of the first round of EAGLES work was to develop a formalization of the ISO quality model. This was done by using the sort of feature structures familiar from work in computational linguistics, where attribute/value pairs are organized into a hierarchical structure. The hierarchy can be as deep as is needed, but eventually bottoms out into attributes for which a value may be directly obtained by applying a metric. In this model there is no formal distinction between characteristics, sub-characteristics and attributes.

When an evaluation is executed, the values obtained are combined and propagated back up the hierarchical structure. The specification of weightings on individual nodes in the structure governs the combining of values, thereby reflecting the relative importance of specific features for a particular evaluation, thus allowing the evaluation to be tailored to the needs of a particular set of users. A feature structure with the weightings specified thus constitutes a representation of a user or class of users.

A great advantage of this formalization is that it allows automation. The TEMAA project constructed a prototype evaluators' workbench. The workbench provided tools for defining a quality model and for specifying how values for pertinent attributes might be obtained. Some of these metrics were automated, some, for example the price of a product, were instructions to a human on how to find the appropriate value. A quality model for a specific product could then be derived by obtaining the values for that product. A user profile of a specific user or class of users could be defined by specifying the weightings governing the combination of values. It was then possible automatically to generate an evaluation report by comparing the quality model of one or more products with the user profile.

It is perhaps worth emphasizing once again that the construction of a quality model even for a relatively simple system like a spelling checker is time consuming and fastidious work. The model produced by the TEMAA project concentrated mainly on functionality and its attributes, on the grounds that it was the functionality attribute which captured the specificity of language technology products. A full quality model would require considerable work on the other quality characteristics.

Metrics.

When discussing the quality model, we stipulated that the leaf attributes should have metrics associated with them. The values obtained from these metrics are the starting point for assessing in practice the overall quality of the object being evaluated. The relationship between attribute and metric is often more complex than this description might lead one to suppose. It is perfectly possible to single out an attribute and then not be able to find a good metric to match it. A notorious example is quality of a translation. Faced with the impossibility of finding a valid and reliable metric, the most common reaction is to reconsider the attribute, as was done with the ARPA/DARPA evaluations of machine translation where "quality" was abandoned in favour of "fluency" and "comprehensibility", attributes which contribute intuitively to quality but which are easier to measure (although not completely unproblematic). Thus, there is often a feedback loop between defining an attribute and finding a metric for it.

The ISO documents talk of two kinds of metric. Internal metrics are based on features of the software itself; the number of lines of code, path length to response time and so on. External metrics are based on observing the behaviour of the system in action; actual response times, output etc. Internal metrics are predictive of external metrics, but the relationship is never perfect. "... the effect that a given internal attribute has upon an associated external measure will be determined by experience, and will depend on the particular context in which the software is used." (ISO 9126-1, Annex A, A.1.1).

EAGLES work so far has not made use of this distinction, primarily because its first test cases involved black box evaluation of finished products, where inspection of internal characteristics of the software is not possible.

On the other hand, much attention in EAGLES work has been paid to questions of validity. A metric is valid if and only if it measures what it is supposed to measure. This sounds easy, but is far from simple to ensure, and the literature on evaluation abounds in examples of metrics whose validity is at best shaky. In a similar strain, a metric is reliable only if it gives repeatable results. If we look at all language technology applications, only a handful of commonly accepted metrics exist. Finding others, and confirming their validity and reliability, is another area where cooperative work might lead to the creation of a very useful shared resource.

Summary and future directions.

The ISO standard on the process of evaluation (ISO 14598) offers a very general model of how to design an evaluation, which can potentially even be applied to objects other than software systems.. Combining it with the ISO 9126 standard for quality characteristics tightens the process model into a model of software evaluation. Formalizing the ISO standards and making the quality characteristics more concrete and more specific, as has been done in EAGLES work, produces a framework for designing evaluations of language technology products. This framework relies critically on the elaboration of quality models for language technology systems, and on the definition and validation of metrics associated with the attributes included in the quality model.

So far, only very partial quality models for a small number of products have been developed. Developing and validating full scale quality models is a lengthy and expensive enterprise. However, once a model has been developed, it can be used as the starting point for any future evaluation of the type of system of which it is a model, thus not only saving a great deal of work which is now done by each evaluation designer from scratch, but can also providing a common reference model for developers, as well as for research agencies and research workers seeking improvement in existing systems. Practical experience with evaluation, both within and without the EAGLES framework, has brought us to a point where it might be possible to envisage cooperative work aimed at producing a collection of quality models for a variety of different applications.

Such models are only really useful if accompanied by commonly accepted valid and reliable metrics for each bottom line attribute. Here too is an area where cooperative work could lead to the creation of a precious resource.

Historical note and acknowledgments.

The EAGLES working groups are an initiative of the European Commission. Work first started in 1993, and with a short break between late 1995 and early 1996 still continues. The present round will finish towards the end of 1998. The current work of the Evaluation Group is mainly concerned with consolidation and dissemination of the work done on developing a general framework for evaluation design during the first round. The TEMAA project was very closely linked to the first round of EAGLES work, carrying out practical experimentation with the theoretical ideas developed.

The first ISO standard on evaluation, ISO/IEC 9126 was published as a standard in 1991. This text served as important input for the first round of EAGLES work. That document laid down standards for the quality characteristics and guidelines for the evaluation process model. Subsequently, the subject matter has been separated. A new draft of 9126 concerns the quality characteristics and metrics. This draft is currently in the Final Committee Draft stage, and is therefore very close to publication as a International Standard. ISO/IEC 14598 now defines a standard for the evaluation process model. It too is very close to publication as an International Standard. The current round of EAGLES evaluation work has been quite closely influenced by the new drafts.

It would be unfair to close without thanking all my EAGLES and TEMAA colleagues, too many to name individually, for their collaboration over the last five years. Although I am solely responsible for the defects of the current paper, the ideas in it are the fruit of much collaboration and discussion within EAGLES and TEMAA. I also want to thank Nigel Bevan, technical editor of the two draft ISO standards for his invaluable cooperation during the second round of EAGLES work.

References.

Some useful web sites:

EAGLES: http://www.cst.dk/eagles2.html

This site also contains a large bibliography on evaluation, and a set of links to other sites concerned with evaluation. Anyone wanting to join the EAGLES evaluation discussion list should send a message to eagcst.dk

Usability (the INUSE project): http://www.lboro.ac.uk/research/husat/inuse

Bibliographical references.

EAGLES: Final Report (First phase), (96). To be found on the EAGLES web site or available from Center for Sprogteknologi, Copenhagen, 1996.

ISO/IEC 9126 (91) Quality characteristics and guidelines for their use. ISO, Geneva.

ISO/IEC FCD 9126-1 (98) Information Technology - Software product quality - Part 1: Quality model. ISO, Geneva.

ISO/IEC 14598-1 (98) Information Technology - Software product evaluation - Part 1: General Overview. ISO, Geneva.

LREC, (98). Proceedings of the First International Conference on Language Resources and Evaluation, ELRA. These proceedings contain very many papers on evaluation.

Sparck Jones, K. and Galliers, J.R., (96). Evaluating Natural Language Processing systems. Springer.

TEMAA: Final Report (96). Center for Sprogteknologi, Copenhagen, 1996.

van Haaren, L. et.al (98): Evaluating Quality of Spoken Dialogue Systems: Comparing a Technology-focused and a User-focused Approach. In Proceedings of ELREC, pp. 655- 659.

White, J.S. et al (94): The ARPA MT evaluation methodologies: Evolution, lessons and further approaches. In Technology partnerships for crossing the language barrier: Proceedings of the first conference of the Association for Machine Translation in the Americas. Columbia, Maryland. pp. 193-205.