Untitled

ELSE Presentation:

Patrick Paroubek

The ELSE projects aims at proposing a generic infrastructure for evaluating Natural Language Processing systems, whether for text or speech processing, using semi-automatic quantitative black-box methodology in a multilingual context. Among the various kind of evaluation it is possible to perform from ELSE point of view:

Basic research Evaluation. It concerns the attempt to validate new ideas or assess the improvement they bring over existing methods.

Technology Evaluation. It deals with the test of a given technology, trying to assess its range of performance and appropriateness for solving a particular problem.

Application Evaluation. It concerns the test of several technologies together and of their various ways of integrating. This kind of evaluation uses intrinsic criteria according to Karen Sparck Jones terminology,

User-oriented Evaluation. It puts the focus on user considerations and usability; in particular the environment of deployment is taken into account. It uses extrinsic criteria.

Impact Evaluation. It is the evaluation of the socioeconomic consequences of the use of a particular application or technology.

Program Evaluation. It can be seen as an attempt to determine how worthwhile has been a funding program (for instance the Language Engineering program from the European Community) that was designed to stimulate research and development in a given area.

In ELSE, it is an infrastructure dedicated to technology evaluation that is the objective, with the expectation that it will: enable the researchers and industrialists to compare the pro and cons of various approaches used in natural language processing, help in assessing the merits and usability of new emerging technologies, contribute to draw a better picture of the current state of science and technology in the field of natural language processing, and offer a neutral common ground where the main actors can compare their ideas and systems on a fair basis using a transparent protocol providing reproducible results from common data.

Better information on current technology and systems, better knowledge about intrinsic task performance limits, creation and stimulation of communities of actors in the different domains and acceleration in the developments of new markets are the main benefits to expects from implementing ELSE proposal. They will be completed by two other interesting by-products: increase in the amount of annotated and validated linguistic data (new data will be built at a lower cost from the data produced by the participants of an evaluation campaign) and the production of new evaluation measures and the corresponding software, all based on widely used standards.

Along the properties that any control task used in an evaluation campaign ought to display

it is easy to communicate to non-experts,
a human operator can easily perform it,
it is not necessarily a module from a classical architecture but can be synthesized for the sole purpose of evaluation,
there exists a standard or a near standard for the information it processes

Note that both Word Sense tagging and Part of Speech tagging fulfill those requirements. ELSE identified as a suitable linking framework for the different evaluation tasks, the abstract architecture of a cross-language intelligent information extraction system. Here information extraction is meant in a broad sense, encompassing both the classical meanings of Information Extraction (IE), i.e. template filling from documents and Information Retrieval (IR), i.e. document selection. Such system would have multi-modal input and output and it would be able to intelligently adapt its behavior to a particular query, for instance by choosing between classical IE and IR functionality, or deciding whether to consult either a local database or the WEB. This architecture finds its origins in the braided architecture produced by the EAGLES evaluation study group . It is important to note that the ELSE architecture is intended to remain abstract as for now it is supposed to be only used as a theoretical and organizational reference framework, while the evaluation architecture of the new DARPA project COMMUNICATOR is actually implemented in a software. The evaluation paradigm it supports is based on assessing the performance increase induced by module replacement in a common reference system.

Other current results of ELSE are a generic plan of an evaluation campaign (installation, training, dry run, tests, impact study), organizational details for deployment of the evaluation paradigm (guidelines for selecting the participants - called evaluatees in ELSE terminology - the evaluators, obtaining and distributing the data etc.), a tentative list of control tasks, and preliminary cost assessments.

Multilinguality is a problem ELSE is still working on. How can we evaluate on one language and generalize the results to others? A solution could be the systematic stating of cross lingual requirements for the control task. But in that case, how to avoid the running of evaluation protocol for the whole set of possible combinations? What about the noise added by the cross-language mapping that must appear somewhere in the data processing of any participating system? Furthermore, this scheme does not provide any solution for intrinsically monolingual tasks like speech recognition or speech synthesis.

Another line of approach that has been investigated by ELSE is the use of a common pivotal language for evaluation in addition to the original language for which the participating systems have been initially developed for, results for other languages can be reasonably inferred from those produced by the evaluation. This scheme has been used in the past in the SQALE project. Although this approach limits the number of evaluations one must perform if tests are sought for a given number of languages, results generalization to other languages is not possible in all cases and the result of such generalization not necessarily reliable when the languages involved are not closely related. Linguistic resource identification and candidate tasks selection refinement are part of the current work in progress of ELSE.

Annex

Here is the list of control tasks that could be considered good potential candidates for starting a first series of evaluation campaigns.

a) Broadcast News Transcription.

b) Cross-Lingual Information Retrieval. (CLIR)

c) Text To Speech synthesis.

d) Text Summarization (SUMRZ)

e) Language Model evaluation. (Word Prediction task).

f) POS tagging, lemmatization/morphological analysis and Word Sense disambiguation (all 3 together).

The aims in selecting those were:

trying to cover as many different aspects of NLP (both for text and speech),
try to address problems all over the processing chain between input and output in the ELSE Information Retrieval System abstract architecture.
select control tasks where there is a good chance to have enough participants,
select control tasks for which there exist available resources in sufficient amount,
select tasks which are not too hard to evaluate (e.g. this is why MT is not listed here).

Comments 1:

During the workshop there were questions raised about only on the Text Summarization (d) control task. As I had some time in the train to think more about it here are the reasons for proposing Text Summarization and how I think we could evaluate Text Summarizer.

Text summarization has been selected because it seemed to be the easiest task to handle in an evaluation among the ones that are related to language generation (also in the case of summarization, generation is not a pre-requisite as a simple summarizer can be a simple sentence filter).
for the evaluation protocol/metrics/scoring, instead of a quality metrics which is very difficult to define and even more difficult to implement, I would propose 2 metrics:

1) one based on the information content of the summarized text. It could be run in parallel with CLIR evaluation, by having first the summarizer to process a part of the database used for CLIR, and compare the precision/recall variations between a given text and its summary across the various IR systems. This of course implies to link the calendars of the CLIR and SUMRZ tasks. Ideally a good summarizer would produce abstracts that would be retrieved in the same way that the full texts are, or with better precision and recall.

2) to solve partially the problem of the text quality I suggest to count the number of changes (basic transformation steps) required to obtain an error free text from a summary. For a start, I don't think that venturing up to the level of text quality assessment is safe (or if it's done, it should be only on an exploratory basis). The rationale behind this is that there is no clear definition neither a precise minimal threshold for text quality, while the error-free criteria is clearly defined for everybody. Of course nothing prevent to produce additional quality assessment in addition to the previous 2 measures.

Comments 2:

Concerning the remarks about the list of criteria a control task ought to display to be suitable for ELSE purposes (quantitative black box technology evaluation) and the question about whether the ease of performance by a human operator or a related criterion wouldn't apply also to the measure itself, I think it is true but should be phrased cautiously, as in most of the examples I am aware of, the "principle of the measure" is rather simple and easily understood by anybody (WER in Speech Recognition, difference in tag lists after mapping in GRACE or SENSEVAL/ROMANSEVAL, precision/recall in IR) but the "implementation" is rarely straightforward and can lead to very difficult problems (e.g. weighting in SENSEVAL, tokenization/alignment in GRACE, reference data set building for IR, etc.).

Patrick Paroubek

LIMSI - CNRS

Batiment 508 Universite Paris XI

Dep. Communication Homme Machine

LIMSI - BP 133 - 91403 ORSAY Cedex

phone: (33) (0)1 69 85 81 91

fax: (33) (0)1 69 85 80 88

email:pap@m192.limsi.fr