EAGLES-ll Geneva Workshop

Back toEAGLES-II Home Page

"EAGLES and Current Evaluation Practices" Workshop

ETI, University of Geneva
September 8-9, 1998

For those of you who were unable to attend the workshop in person but would like to contribute with your views and experiences, here is a fuller version of the final programme with links to some of the input that the various participants presented. This is not a report on the workshop itself - this will be made available at a later date.

September 8

1. Welcome, purpose of the workshop

This workshop is aimed at drawing together the discussions of the EAGLES evaluation workshop held in Brussels in November 1997, of the LREC conference in Granada in May 1998 and of the electronic discussion list on evaluation and at making the conclusions concrete in the form of a Handbook of Current Evaluation Practices.

2. Table of Contents of the EAGLES Handbook

The annotated draft Table of Contents was presented

3. Summary of the EAGLES Framework (Bente Maegaard)

A descriptive summary of the EAGLES framework as outlined in the EAGLES-I final report was given

4. The new ISO standard (Maghi King)

The new ISO standard was presented. For background Maghi King has also provided an as yet unfinished paper "Standards work related to evaluation"

5. The need for technology evaluation in language and speech engineering (Patrick Paroubek)

Abstract The ELSE project was presented and Patrick Paroubek subsequently provided a brief summary of his presentation.

6. User profiling and requirements analysis (Nancy Underwood)

Based on Appendix C (Requirements analysis for linguistic engineering evaluation) of the EAGLES-I report, the EAGLES model was described. An annotated .rtf version of the slides is also available. The distinction between functional and non-functional requirements was taken up: in EAGLES where we have concentrated on text transformation systems functional requirements were seen as being bound to the top level specifications on input/output. The question was raised as to whether this definition of functionality can be extended to other types of LE systems.

7. Metrics (Bente Maegaard)

Taking the EAGLES-I chapters on measures and methods as a starting point, This presentation discussed the notions of internal validity and external validity. The notion of typing of measures, (cf. section 2.5.5) and its relation to typing of methods was also brought up

8. Formalisation and automation and tools (Louis des Tombe, Steven Krauwer)

Based on work in both the EAGLES (cf. section 2.2) and TEMAAprojects, the question of formalisation and automation were addressed, and a demonstration was given of the parameterisable testbed (PTB).

10.Test materials (Nancy Underwood)

In the event, this presentation was not given at the workshop due to lack of time. It is clear that this is closely related to the question of sharing resources and data which was discussed on the second day. The chapter on test materials in the handbook is expected to be a somewhat updated version section 2.6.3 in the EAGLES-I report. However it may be that this should be integrated into the section on sharing resources. (for those interested an annotated .rtf version of the slides is available).

September 9

Case studies:

11. Activities and results of the ARISE project in the field of validation (Marc Blasband)

Abstract Marc Blasband presented the validation in the ARISE project

12. Case studies:Spelling and grammar checkers (Nancy Underwood)

The application of the EAGLES/TEMAA framework to a concrete evaluation of Danish spelling checkers was described (a full description of this experiment as well as an evaluation of Italian spelling checkers can be found in "An experimental Application of the TEMAA Evaluation Framework: Spelling Checkers". In addition preliminary work on applying the framework to grammar checkers was presented.see also Paggio, P & N. L. Underwood (1998) Validating the TEMAA LE evaluation methodology: a case study on Danish Spelling checkers. Natural Language Engineering, 4(3))

13. Discussion: Sharing resources - tools and data (introduction: Maghi King)

As background to this see the paper "Language Resources and Evaluation", Margaret King presented at the Moncton Conference and available here as an .rtf file.

14. Current Practices

a. Short presentation of the Survey on Current Practices for the Handbook (Maghi King)

A number of different current projects and initiatives can be drawn upon when looking at current evaluation practices, below are a number of links to some of them.

- Text REtrieval Conference (TREC)

- Message Understanding Conference (MUC)

- Grammaires et Ressources pour les Analyseurs de Corpus et leur Evaluation (GRACE)

- SENSEVAL (Evaluating Word Sense Disambiguation Systems)

b. Further contributions

MT Evaluation (contributions from Litton PRC):

Methodology from DARPA MT evaluation:

White, John S., O'Connell Theresa A. and O'Mara, Francis A. Advanced Research Projects Agency Machine Translation Program:3Q94 Evaluation

O'Connell Theresa A., O'Mara, Francis A. and Taylor, Kathryn B. Sensitivity, Portability and Economy in the ARPA Machine Translation Evaluation Methodology

The Machine Translation Functional Proficiency Scale

presented at LREC in Granada (.rtf format) White, John S.& Kathryn B. Taylor "A Task-Oriented Evaluation Metric for Machine Translation"

accepted for the AMTA, Bucks County, PA, USA, October 1998.(.rtf format) Kathryn Taylor & John White Predicting What MT is Good for: User Judgements and Task Performance

Parser Evaluation (contributions from John Carroll/Ted Brisco)

Measuring parser accuracy is a difficult problem: a number of approaches have been proposed in the literature, but all suffer from drawbacks. In the paper Carroll, J., E. Briscoe & A. Sanfilippo (1998) Parser evaluation: a survey and a new proposal. In proceedings of the 1st International Conference on Language Resources and Evaluation, Granada, Spain. 447-454. (Available online at ftp://ftp.cogs.susx.ac.uk/pub/users/johnca/lre98-final.ps)
we describe and justify a new (dependency-based) technique which overcomes some of the shortcomings of previous proposals. The technique draws on work on subcategorization standards developed within EAGLES by the lexicon/syntax interest group

The proceedings of the workshop: 'The Evaluation of Parsing Systems' at the 1^st International Conference on Language Resources and Evaluation, Granada, Spain, May 1998. are available (in printed form only) as a technical report (Cognitive Science Research Paper 489, School of Cognitive and Computing Sciences, University of Sussex, UK) click here for the table of contents

A brief survey of parser evaluation "A Survey of Parser Evaluation Methods" John Carroll &Ted Brisco is also available on this page.

Chapter in the Translingual Information Management Workshop on Evaluation and Assessment at ACL-COLING available at http://www.cs.cmu.edu/~ref/mlim/

Back to EAGLES-II Home Page

"EAGLES and Current Evaluation Practices" Workshop

ETI, University of Geneva September 8-9, 1998

ETI, University of Geneva
September 8-9, 1998