EAGLES-ll 1997 Workshop Report: EAGLES Margaret King

Previous EAGLES work - Maghi King

Summary of presentation

Maghi King's presentation of previous EAGLES work concentrated on its relationship with the ISO work presented by Nigel Bevan, and sought to underline issues and questions arising from comparing the two efforts.

The first round of EAGLES work had taken as one of its starting points the 1991 version of ISO 9126, and had tried to apply the standard to the evaluation of language engineering products and systems.

A major part of the EAGLES work had been to carry out a formalisation of the ISO proposals in the form of an attribute feature structure of the sort familiar from recent work in computational linguistics. Thus, each quality characteristic in the ISO standard was taken as a top-level attribute, having as its value a set of sub-attributes, each in their turn broken down into sub-attributes and values. The resulting structure is hierarchical, bottoming out at different places for different high-level attributes at the point where atomic values are reached. (It should be noticed that where the 1991 version of the ISO 9126 standard mentions the possibility of quality characteristics being broken down into sub-characteristics, the new draft goes further and suggests a certain number of specific sub-characteristics: in EAGLES terms, each set of sub-characteristics would be the set of attributes which were the value of the quality characteristic of which they were the sub-characteristics).

Metrics are then defined which allow the atomic values to be determined, and theire values fed upwards through the structure. The values at any given level can be combined in different ways to reflect different requirements based on defining different user needs.

(See the EAGLES I report and the TEMAA final report for substantially more detailed accounts of this formalisation).

Special attention had been paid to two questions, taken up separately below:

How to design metrics that would provide valid data on which an evaluation could be based?

A classification of attributes based on how values for that attribute were to be obtained had been set up :

facts: factual values can be established simply by checking in the techincal documentation supplied with a product, examining the product directly or looking in the sales literature relevant to the product. Examples are :

Is the interface available in French?
Is user modification of an existing translation memory possible?
How much does a single user licence cost?

tests: the value is obtained by administering a test to the system. Examples are:

Compared to a list of the 2000 most frequent words in a corpus, how many of those words are covered by the system dictionary?
How many pages of raw translation are produced (under specified circumstances) by a machine translation system in one hour?

judgements : the value is obtained by asking a human to make a judgement. Examples are :

Is the interface pleasant to look at?
Is the translation of good quality?

Each type raises issues of validity, although the issues are more acute with attributes of type test and of type judgement.

Taking inspiration from work in the social sciences, EAGLES work had distinguished between 'external validity' and 'internal validity'. External validity depends on a correlation between the value obtained for the specified attribute and some external criterion. A classic example is offered by the questions asked by insurance companies in order to determine the amount of a life-insurance premium. Such factors as age, previous major surgery, having previously suffered from certain illnesses or having genetic disease in the family are all held to correlate strongly with estimated life expectancy. Internal validity is intrinsic to the metric itself. A classic example is the reading tests administered to children which depend on the choice of vocabulary included in the test. The metric can be invalidated only by challenging the choice of vocabulary, not by appealing to a lack of correlation with some external criterion.

Determining the validity of metrics to be used in evaluation is often a difficult task, but one of critical importance. Maghi suggested that it was one of the key issues for putting into practice the EAGLES guidelines, and asked what, if any, the relationship was between external and internal validity as defined in the EAGLES work, and external and internal metrics as defined in the ISO work.

The second question relates to those attributes where it is difficult, maybe sometimes impossible to define metrics of type test. Often these attributes relate to usability, and concern questions like whether the software is user friendly or easy to use. With certain products, machine translation being a very clear example, it is particularly hard to find valid metrics of type test.

Maghi went on to point out that the type of an attribute directly influences whether finding a value for that attribute can be automated or not. She outlined the TEMAA proposals for an Evaluator's Workbench, a integrated suite of software modules which would allow formal description of products and of users, given in terms of feature structures, and definition of associated metrics, which in the ideal case could be administered automatically to a product in order to determine the specific values obtained by evaluation of that product. If all metrics were automatable, the profile of a specific user could be described to the workbench, evaluation of the products known to the workbench automatically carried out (or, when they existed, previous evaluation results compared to the desiderata) and an evaluation report produced which showed which product or products best fitted the specified user needs.

Attributes of type test lend themselves to automation of the evaluation. For example, in TEMAA, a measure of a spelling checker's ability to detect errors of a specified kind (redoubling of letters for example) was based on automatic generation of error lists from lists of correctly spelt words (thus ensuring that the correspondence between error and correct spelling was known to the system). The error lists then served as data for an automated simulation of a session with the spelling checker, whose results in terms of errors detected and correct suggestions for replacement were automatically stored and automatically analysed. Coming up with ways to automate administration of metrics associated with attributes of type test may require considerable ingenuity, but can, in principle be done.

Similarly, with attributes of type fact, the value of the attribute in the case of a specific product may well be stored in the work-bench as part of the description of the product. Even when this is not the case, asking a human to input the required value does not require him to exercise any judgement ; he looks up or otherwise discovers what the facts are, and supplies them to the system.
As the last sentence suggests, this is not the case with attributes of type judgement. By their nature, these require the human to exercise his judgement. They are thus impossible to automate. It is also difficult to ensure their validity, since it is hard to dissociate measuring the human's capabilities from measuring the system's capacities. The metrics used in the ARPA/DARPA evaluations of machine translation systems offer clear examples of this difficulty. For example, when measurements were based on a human interacting with the system to produce correct output, factors such as the user's familiarity with the interface and with computer use in general proved critical, and were not easy to control for, as did factors like the human's own skill as a translator.

In summary, Maghi said that efficient and objective evaluations were best obtained through formalisation of the evaluation methodology and through automating as much of the evaluation process as possible. The first round of EAGLES work had contributed to this goal by building on the ISO 9126 standard to produce a formal version. In the course of the exercise, difficult and important issues to do with validity of metrics had been encountered, and some progress made at least towards understanding what the issues were. It was pleasing to be able to note a strong convergence between the ideas and proposals developed in the EAGLES context and the content of the new ISO draft, in particular the importance accorded to metrics definition.

Discussion.

Denis Johnstone suggested that Maghi was over pessimistic about the possibility of defining valid metrics for judgement type attributes. He cited metrics in other areas based on the use of questionnaires, and volunteered to provide bibliographical references for the web site.

Much discussion turned on the supposed differences between language engineering softwares as a class and other software products. It was generally agreed that language engineering products do generally require much more customisation than other software. Machine translation was quoted as the blatant example of this : the acceptability of a machine translation system depends on its ability to cope with a specific user's vocabulary, in-house style, langauge register and so on.
It was suggested that the EAGLES notion of 'internal validity' might correspond more to the ISO notion of 'quality in use' (see Nigel Bevan's second presentation) rather than to the internal metrics of the new draft of ISO 9126. There was no clear consensus on this point, which was thought worthy of later discussion via the web site.

Similarly, it was suggested that the EAGLES 'external validity' simply meant that a product conformed to what the real world required of it, and that in this sense it was analogous to 'measures' in general in the new draft of ISO 9126. Lengthy discussion failed to produce any clear consensus, and once again, the question was felt to merit further reflection and discussion via the web site.

Taking up again the topic of making metrics of type judgement objective, Nigel Bevan suggested that some of the techniques used in usability evaluation did exactly this, at least in the case of softwares other than language engineering softwares. Karen Sparck Jones pointed to information retrieval as another application where judgements had been 'objectivised'. Maghi suggested that an essential difference between information retrieval metrics as defined by the TREC evaluations and the product of other language engineering applications was that in the case of TREC it had been possible to define a set of generally agreed 'right answers' against which the system's performance could be measured. It was considerably more difficult to define 'right answers' in the case of some other language engineering applications. (See Karen's presentation for more discussion of this issue).

Several questions were requests for more information about the Evaluator's Work Bench prototype produced by the TEMAA project. A number of participants felt that an operational version of the prototype with expanded capacities would be very useful.

Back to main workshop report