Karen's presentation was concerned with examining the relationship between intellectual strategies and organisational strategies in the design and implementation of evaluations.
It was based on experience with the ARPA/DARPA TREC series of evaluations, in which the participants had profited greatly from each other's experience. This was the case even for those participants who had not been involved from the beginning but had joined in later.
However, the TREC evaluations were expensive, especially if the number of systems participating is taken into account. They are also very closely tied to a particular application, that of information retrieval. An alternative would be to take a "task free" product, which nonetheless is likely to have an analogue in the components of task oriented applications, like a parser, and compare its output with that produced by the task oriented application. In this way, evaluation becomes less subjective, since all products are compared to the same "standard" output.
The intellectual methodology underlying these proposals is reported in detail in Sparck Jones and Galliers, 1996, where examples of how the methodology can be applied are also given. Here, only a very brief summary is given. However, Karen's talk was quite dense, and made use of terms which are not necessarily familiar to all readers of the present report. For that reason, the author of the report has taken the liberty of inserting into the text passages from Sparck Jones and Galliers where it was thought that this would be helpful to the reader. Such quotations are indented and followed by the page on which they occur.
There is a close relation between the evaluation remit and the evaluation design, both of which can be specified by unpacking the specification into a series of questions. The remit involves establishing the motivation for the evaluation, the perspective to be taken (task to be accomplished, financial implications etc.), in whose interest the evaluation is to be performed (the developer, for example) and who is to be the consumer of the evaluation (the manager, for example).
Defining the goal of the evaluation will determine its orientation (whether it is to be extrinsic or intrinsic) , the kind of evaluation (investigation or experiment), the type of the evaluation (black box or glass box), the yardstick to be used (exemplar, judged and so on), the style of the evaluation (whether it is to be indicative or exhaustive) and the mode (quantitative or qualitative).
Extrinsic and intrinsic:
Intrinsic criteria are those relating to a system's objective, extrinsic criteria are those relating to its function, i.e. to its role in realisation to its set-up's purpose. ' (Author's gloss: a set-up is a system together with the setting in which it is to be used). [...]
Thus for a translation system, intrinsic criteria could be ones applying to the quality of the translation, and extrinsic criteria those applying to the ease with which post-editors could tweak these, while for the translation set-up as a whole intrinsic criteria could refer to e.g. to the speed of production of the final translation, and extrinsic criteria to the value/acceptability/utility of the final translation for some purpose such as literature scanning.
(p.19)
Investigation and experiment:
An investigation is typically concerned with an operational system or set-up and with establishing its performance characteristics; an experiment is designed to answer not the question: what is happening? but the question: what would happen if so and so?
(p.22)
Black box and glass box.
[black box evaluation] considers only system input-output relations without regard to the specific mechanisms by which the outputs were obtained while [glass box evaluation] examines the mechanisms linking input and output.
(p.26)
Exemplar vs. judged:
[...] practices have also been developed [...] which are designed to promote consistency and comparability on a larger scale, through the use of, for instance, common measures, common baselines or benchmarks (e.g. 'rock bottom' or 'par' indexing and searching strategies) or other paradigms (e.g. 'standard human professional results') which can be used as performance exemplars or norms for comparison when alternative or new automatic systems are proposed.
(pp. 25-26)
Qualitative and quantitative:
Quantitative measures naturally involve statistical analysis both to establish significance and to indicate import, for instance that of the size of performance differences. Qualitative measures are based on observation or interviewing, and are broadly designed to obtain a more holistic, less reductive or fragmented view of the situation [...]'
(p.61)
Evaluation design questions can be grouped into four classes. The first class
defines the evaluation subject at the level of detail necessary to conduct the evaluation. Thus the questions here ask what the subject's ends are, for instance for systems their objective or function; what the subject's context, inanimate or animate is; and what is the subject's constitution, as structure or process is?
(p.196)
Characterising context and constitution leads to the identification of questions about performance factors, i.e. about environment variables and system parameters.
[...] the development of an appropriate descriptive framework and analytic methodology for carrying evaluations centres on the notion of performance factors. This framework has been used to separate the system with its constituent parameters and their settings from the system's environment, i.e. from its set-up as seen from the point of view of what is given as properties of or constraints on the system's inputs and outputs.
(pp 22-23)
The second class of question about evaluation design deals with the choice of evaluation criteria and of measures and methods.
The third deals the choice of evaluation data.
Finally, a fourth class of questions deals with the procedure, the sequence of actions to be undertaken, for carrying out the evaluation.
If all this is analysed in terms of effort and cost, instantiating the evaluation remit is not very costly. However, instantiating the design through real data and a concrete procedure can be very costly. For example, the creation of reference transcripts for evaluation of speech recognition systems or of facts for information extraction systems is expensive. What this comes down to is that the creation of answer data of any sort is likely to prove expensive.
The crux of all this is that a system or system component can be evaluated either in terms of its internal objective or in terms of its external function. For example, in document retrieval, an internal objective might be to select terms, whilst the external function is to deliver the appropriate set of documents to the user.
Thus while the message processing system is the system proper as far as its objective 'message processing' (however specified in detail) is concerned, it is also part of a larger set-up, involving its users and any other apparatus, which in turn has its own encompassing purpose(s), namely to manage road use, etc., and within which the system has a function.
(pp. 11-13)
The internal objective may be very limited. It presupposes an environment which is then not taken into account during the evaluation. Evaluating by external function is less limited, and requires knowledge of the environment, for example of document types of interest or of user likes and dislikes. Once these are known, a test grid can be drawn up based on comparing environment variables to system parameters.
The next section of Karen's talk concerned a proposal put forward in the context of a working party set up to consider evaluation within the Fourth Framework Programme of the EU. The working party suggested an organisational approach which was intended to reduce effort and cost and to increase information synergy (the perceived advantages of TREC type evaluations) whilst avoiding being bound to a single task.
Basing the evaluation framework on a presumed modular architecture was not felt to be a satisfactory approach, since it was difficult to avoid one module being influenced by the behaviour of other modules. It was also decided not to propose a universal output requirement, since this came too close to imposing a single task.
The alternative was a 'braided chain' architecture, which allowed for overlapping strands covering specific tasks or sub-tasks, for shared data inputs or outputs and multiple points at which tests could be applied. The main advantage of the braided chain approach was thought to be its flexibility.
The final section of the presentation was devoted explicitly to experience with the TREC evaluations and to what could be learned from them.
TREC is an abbreviation for Text Retrieval Conferences, a series of evaluation exercises organised within the ARPA/DARPA framework. The first conference took place in 1991, TREC-6 in 1997. The exercises were designed to be open with respect to the tasks to be performed, broad in the payoff expected from the exercise, large in terms of the number of people involved and wide in the amount of help offered to participants.
The structure was in the form of 'hub and spokes': there were core tasks (the hub) and specialised tracks (the spokes) in which participants could choose to be involved. In TREC-1 there were only the two core tasks of ad hoc text retrieval and routing. By TREC-4 the tracks had developed to include tracks for languages other than English, data corruption and database merge. By TREC-6, the core task was still ad hoc, but the tracks now covered treatment of Chinese, cross-language filtering, high precision retrieval, interactive systems, use of natural language processing, retrieval of spoken documents and treatment of very large corpora. Roughly, twenty teams participated in TREC-1, fifty in TREC-6.
The organisational base was provided by the federal government, with D. Harman of NIST (National Institute of Standards and Technology) playing a central role. NIST had global control and was responsible for policy and for design of the exercise, as well as for management and administration. NIST was also responsible for providing test data in the form of inputs and answer outputs. This was its major and heaviest task.
Specific regions have taken responsibility for the tracks, and have provided specific design, local co-ordination and sometimes data.
The current status of the TREC exercises is rampant, and the effect they have had is striking. There have been solid retrieval findings allowing comparisons along the dimensions of scale, time and range. The TREC community is a very lively interacting community, where members not only learn from one another, but also are prepared to share their resources as well as their experience. The conferences have served to create a cumulative data resource in the form of both evaluation material and results. There is thus a high yield from the investment in creating data, both now and in the future.
In fact, the TREC exercises seem to have turned into what Karen called a 'virtuous circle': as more teams participated, the answer pool grew bigger, more technique checking was possible - and the more new teams wanted to join. Consolidation involved building on the aims, the methods and the results achieved.
A particular example of one of the tracks was spoken document retrieval, which involved both sharing of resources and designing a way of comparing results. The LDC (Linguistic Data Consortium) had been responsible for providing reference transcripts, IBM had provided baseline transcripts, and NIST the basic engine. Each team ran its system against the reference transcripts, the baseline transcripts and its own corpus. The central task, defined by NIST was to retrieve a known item. Several measures were used to measure relative success at this task.
When it came to considering how the TREC experience might apply to Europe, Karen suggested that retrieval across languages was a suitable overall task. In fact, TREC-6 had a track along those lines, where the text database was multilingual, the request was formulated in a single language but the output was multilingual. There had been European participants in this track, and the coordinator was European. European resources had provided input material and output answers. The 1997 exercise had been mainly concerned with feasibility. There was full automation of the retrieval system, but no attempt at translation of texts. Surprises in the results included the realisation that even weak alignment was useful, and that considering French as ill-spelt English was a productive way of tackling the cross language problem.
In braided chain terms, the cross language retrieval task provides a foundation thread based on retrieval, translation and extraction. Translation may be of the query, or of the retrieved document. Subtasks are, for example, the treatment of formatting, the creation of a thesaurus and so on. In this perspective, one person's output can be considered as another person's input.
The organisation of tracks like the cross-language track raises a number of important issues. Devolution of design and preparatory work to the regions reduces the effort demanded of the central organisation but can also weaken management, thus affecting, for example, the ability to make and maintain a timetable. Devolution also reduces central cost, but may also dilute the data used, since it is less easy, for example, to control the quality of answer data. And data preparation still implies effort and cost, even if it is undertaken away from the centre. Another issue is that of constraints imposed on the data, for example, by the file contents available or by the real needs of the participants. Participating may also put constraints on the teams, by forcing them to shift tasks or environments.
Nonetheless, when all is considered, TREC shows us that the gains far outweigh the losses, and that there are generic rewards enjoyed by both participants and sponsors, as well as the more specific rewards that come from performing well or obtaining favourable notice for one's own system.
A lot of discussion addressed the question of whether experience with evaluation in the field of Information Retrieval could be directly exported to evaluation of other language engineering products, which did not easily lend themselves to measures depending on closeness to a set of 'right ' answers. In the case of machine translation, for example, it seems close to impossible to define ways of comparing system output to human output.
Karen suggested that the TIPSTER series of evaluations might provide a counter example. TIPSTER is concerned with the evaluation of summary making programs. Evaluators are given the outputs of various systems and asked to choose the one summary which best represents the original document.
To this it was suggested that different evaluators might give different weights to the presence or absence of some specific element in the summary, and that unless it was decided in advance what the key elements were, this could lead to skewed results.
It was also objected hat it was difficult to decide what the expressed needs of machine translation users might be. Any given human translation cannot be taken as a standard, since a variety of human translators will translate differently, yet correctly, the same text.
It was asked if the cost of the TREC evaluations was known. Karen replied that it was difficult to come up with an exact figure. However, some twelve people work full time for the organisation and running of the tests. Also NIST is government funded. This gives an order of magnitude idea of the cost.
K. Sparck Jones, J.R. Galliers (1996): Evaluating natural language processing
systems. LNAI 1083, Springer.
R. Crouch, R. Gaizauskas, K. Netter (1995): Report of the Study Group on
Assessment and Evaluation. Submitted to EC DG XIII. Cmp-lg: http://xxx.lanl.gov/cmp-lg/9601003