The TEMAA project has achieved its two main goals: to provide a framework for thinking about evaluation of NLP tools, and to produce a partial implementation of an overall tool for carrying out such an evaluation with the goal of automating the evaluation as far as possible.
The framework is shared with that of the EAGLES Evaluation working group (the EAGLES initiative is an LRE activity) and has been further refined and assessed by TEMAA. The TEMAA project has shown that the general framework, which builds on an ISO standard, provides a solid basis for evaluation design. It has also become obvious that much more work has to be invested into methodologies for requirements analysis for NLP tools.
In the original plan, development of data material for testing applications such as spelling checkers and tools for running that testing were foreseen, but not much thought was given to the environment for carrying out an evaluation. During the project lifetime however, this aspect was further developed: first with ASCC, the Automated Spelling Checker Checker, and Errgen, a set of programs which automatically convert a list of valid words into a list of invalid ones and then compare the results of using the spelling checker, and later in the project by the introduction of the PTB, a Parameterisable Test Bed which provides an environment for carrying out evaluations. Also an object oriented implementation of the framework model, Evaluation Reporter (ER), was developed for the automatic generation of evaluation reports.
The project believes it has achieved its two goals in a very satisfactory way, and in particular we find it satisfactory to see that the two goals fitted so well together: working with the concrete evaluation tools and packages has been very closely related to working with the formal and theoretical parts of the framework. This is one of the cases where research and the application of its results tie in very well together.
The report starts with a description of the model of NLP evaluation which the project has developed (Section 2). The underlying ISO evaluation framework which TEMAA has drawn on is dealt with first, followed by an account of the original contributions made by TEMAA in terms of a formalisation of the model. Specific aspects of the TEMAA framework are then treated in detail. These include our view of the task model which should be reflected in any practical evaluation, as well as the way in which the ISO quality characteristics of software products should be adapted to the evaluation of NLP products.
The description of the TEMAA framework is then made concrete through a number of case studies, where the model has been applied to the evaluation of spelling checkers, grammar checkers, and research projects in the field of information retrieval (Section 3). The degree of detail and concreteness in the three case studies varies. The work carried out on spelling checkers has constituted the main practical application experimented with within the project, and has thus received most attention. The study on grammar checkers, on the other hand, is less geared towards practical experimentation, and its main goal should be seen as a test bed of the model developed for spelling checkers. The third study, finally, deals with a quite different object of evaluation, namely not products but projects, and provides therefore a quite different perspective from which the TEMAA model can be viewed. The important conclusion of all three studies, however, is that the model constitutes a solid basis on which practical evaluations can be performed. Finally, this section contains a description of the software programs developed and used in the project.
In the light of the practical results described in the case studies, the final section of the report (Section 4) reviews the theoretical assumptions made in the project, and indicates possible developments and areas for future research.
Louis des Tombe (STT)
Shona Douglas (LTG)
Maghi King (ISSCO)
Steven Krauwer (STT)
Bente Maegaard (CST)
Sandra Manzi (ISSCO)
Patrizia Paggio (CST)
Gurli Rohde (CST)
Merle Tenney (Claris)
Nancy Underwood (CST)
The software programs described in the report have been developed by Louis des Tombe (ASCC, ET, Errgen), Steven Krauwer (PTB), Shona Douglas and Gurli Rohde (ER).
General editing has been carried out by Bente Maegaard and Patrizia Paggio.