EAGLES Evaluation Group Workshop

Evaluation in Natural Language Engineering: Standards and Sharing.
Brussels, November 26th and 27th, 1997.

Detailed Report.

Introduction.

This report is a detailed report on the workshop. It concentrates mainly on the content of the speakers' presentations and of the discussion sessions. A full list of the participants in the workshop is also available.

1. Aims of the workshop

The aims of the EAGLES Evaluation group during this second round of EAGLES work are essentially consolidation of work done during the previous round (1993 - 1996), dissemination of the results already achieved, and, perhaps most importantly, the creation of an on-going forum in which matters pertaining to evaluation can be discussed, experiences and resources shared, and work on methodology advanced. In line with these aims, the workshop sought to bring together a group of persons actively involved in work on evaluation, to inform them of previous EAGLES work and of other work held to be highly pertinent to EAGLES aims, and to create the beginnings of an on-line discussion group.

2. Themes of the workshop.

Preliminary work had identified a number of areas where active work was in progress which could benefit the EAGLES effort whilst at the same time being of interest to the evaluation community as a whole. These areas included:

a. EAGLES results from the first round.
b. On-going work in the context of ISO 9126.
c. On-going work on defining a reference model for the evaluation of speech systems.
d. Evaluation of usability.
e. Sharing of evaluation resources.

3. Organisation of the workshop.

Apart from the invited speakers the workshop was open to public participation and 20 people participated as well as the speakers. The workshop took place over two full days. Substantial discussion sessions were interspersed between the presentations. Speakers were encouraged to be informal and to accept discussion during their presentations. All workshop participants were invited to a communal dinner on the evening between the two workshop days. Coffee was provided at the end of each day, thus encouraging participants to engage in informal discussion after the end of the formal sessions.

In the event, a strongly informal atmosphere prevailed. Discussion and debate were very lively and felt to be very fruitful. In a final session where the present workshop was evaluated by the participants and plans for the future discussed, the participants expressed their overall satisfaction with both the contents and the organisation of the workshop.

4. Working sessions.

This section takes each session in chronological order.

4.1. ISO 9126 (including discussion of the new draft).
Nigel Bevan, National Physical Laboratory, Teddington, England.

The first round of EAGLES work had used as one of its starting points an existing ISO standard (ISO 9126), which primarily deals with the definition of quality characteristics for use in the evaluation of software products. The standard also includes guidelines for the use of these quality characteristics, including a model of the evaluation process.

During preparation of the workshop, it was discovered that a new draft of the ISO 9126 standard was in preparation, and that the new draft contained many ideas which converged substantially with recommendations of the first round of EAGLES work. In addition, the new draft is to be accompanied by documents on the elaboration of metrics, an area where EAGLES had done substantial preliminary work but where the previous ISO standard had had very little to say.

The technical editor of the new ISO draft, Nigel Bevan, was therefore invited to attend the workshop and to present the new draft.

Click here to read a summary of his presentation.

4.2. Previous EAGLES work
Maghi King, ISSCO, Geneva.

Maghi King's presentation of previous EAGLES work concentrated on its relationship with the ISO work presented by Nigel Bevan, and sought to underline issues and questions arising from comparing the two efforts.

The first round of EAGLES work had taken as one of its starting points the 1991 version of ISO 9126, and had tried to apply the standard to the evaluation of language engineering products and systems.

Click here to read a summary of her presentation.

4.3. Reference model for speech evaluation
Denis Johnston British Telecom Laboratories, Martlesham Heath, Ipswich, UK.

This area was identified as being of interest partly because it was thought important that the written language community should have some knowledge of work in the speech community, partly because the work in question had raised a number of issues to do with the definition and validation of metrics that were directly pertinent to on-going EAGLES work.

Denis Johnston of British Telecom was invited to present this work.

Click here to read a summary of his presentation

4.4. Coherent approaches to evaluation
Karen Sparck Jones University of Cambridge, UK

Much effort and expense has been invested by the U.S. funding authoritites in a series of evaluation driven research programmes. These programmes have met with varying degrees of success. One of the most succesful has been TREC, a programme devoted to the evaluation of document retrieval systems. At the same time, even the succesful programmes have raised issues of how expensive evaluation resources can be created and shared without prejudicing overmuch directions of research and without tempting participants in the evaluation programmes to "cheat" by tailoring their systems to achieving high evaluation scores rather than encouraging them to attack real technological problems.

Karen Sparck Jones, who has been heavily involved in the TREC programmes and who is also well known and well respected for her work on evaluation methodologies, was invited to present the TREC work and a scheme. the "braided evaluation scheme" which aims at overcoming some of the problems of the monolithic evaluation guided programmes sponsored by ARPA/DARPA.

Click here to read a summary of her presentation

4.5. User centred design : INUSE and RESPECT
Nigel Bevan, National Physical Laboratory, Teddington, England.

Previous EAGLES work had done very little on evaluation of usability. This was felt to be an obvious lacuna, and, furthermore was felt to be an area which had a particular interest for developers as well as for potential purchasers of systems: purchasers need to be able to assess whether a particular product will be found usable by their own particular user community, while developers and system vendors need to ensure that a sufficiently large proportion of average users will find their system usable to ensure its commercial viability.

Preliminary work identified two European projects, Inuse and Megataq, whose primary purpose was to provide support for usability evaluation mainly to projects involving developers. Nigel Bevan, fortunately, is also heavily involved in the Inuse project and was invited to make a second presentation of their work.

Click here to read a summary of his presentation

4.6. MEGATAQ
Bert Arnold, Delft University of Technology

The European project, Megataq, has as its objective is to provide Telmatics Applications Projects (TAPs) with user-centered evaluation guidines and consultancy. Bert Arnold, of Delft University, was invited to present the work of Megataq.

Click here to read a summary of his presentation

EAGLES Evaluation Group Workshop

Evaluation in Natural Language Engineering: Standards and Sharing. Brussels, November 26th and 27th, 1997.

Detailed Report.

Introduction.

1. Aims of the workshop

2. Themes of the workshop.

3. Organisation of the workshop.

4. Working sessions.

Evaluation in Natural Language Engineering: Standards and Sharing.
Brussels, November 26th and 27th, 1997.