The ELSE projects aims at proposing a generic infrastructure for evaluating Natural Language Processing systems, whether for text or speech processing, using semi-automatic quantitative black-box methodology in a multilingual context. Among the various kind of evaluation it is possible to perform from ELSE point of view:
In ELSE, it is an infrastructure dedicated to technology evaluation that is the objective, with the expectation that it will: enable the researchers and industrialists to compare the pro and cons of various approaches used in natural language processing, help in assessing the merits and usability of new emerging technologies, contribute to draw a better picture of the current state of science and technology in the field of natural language processing, and offer a neutral common ground where the main actors can compare their ideas and systems on a fair basis using a transparent protocol providing reproducible results from common data.
Better information on current technology and systems,
better knowledge about intrinsic task performance limits, creation and
stimulation of communities of actors in the different domains and acceleration
in the developments of new markets are the main benefits to expects from
implementing ELSE proposal. They will be completed by two other interesting
by-products: increase in the amount of annotated and validated linguistic
data (new data will be built at a lower cost from the data produced by
the participants of an evaluation campaign) and the production of new evaluation
measures and the corresponding software, all based on widely used standards.
Along the properties that any control task used in an evaluation campaign ought to display
Note that both Word Sense tagging and Part of Speech tagging
fulfill those requirements. ELSE identified as a suitable linking framework
for the different evaluation tasks, the abstract architecture of a cross-language
intelligent information extraction system. Here information extraction
is meant in a broad sense, encompassing both the classical meanings of
Information Extraction (IE), i.e. template filling from documents and Information
Retrieval (IR), i.e. document selection. Such system would have multi-modal
input and output and it would be able to intelligently adapt its behavior
to a particular query, for instance by choosing between classical IE and
IR functionality, or deciding whether to consult either a local database
or the WEB. This architecture finds its origins in the braided architecture
produced by the EAGLES evaluation study group . It is important to note
that the ELSE architecture is intended to remain abstract as for now it
is supposed to be only used as a theoretical and organizational reference
framework, while the evaluation architecture of the new DARPA project COMMUNICATOR
is actually implemented in a software. The evaluation paradigm it supports
is based on assessing the performance increase induced by module replacement
in a common reference system.
Other current results of ELSE are a generic plan of an evaluation campaign (installation, training, dry run, tests, impact study), organizational details for deployment of the evaluation paradigm (guidelines for selecting the participants - called evaluatees in ELSE terminology - the evaluators, obtaining and distributing the data etc.), a tentative list of control tasks, and preliminary cost assessments.
Multilinguality is a problem ELSE is still working on.
How can we evaluate on one language and generalize the results to others?
A solution could be the systematic stating of cross lingual requirements
for the control task. But in that case, how to avoid the running of evaluation
protocol for the whole set of possible combinations? What about the noise
added by the cross-language mapping that must appear somewhere in the data
processing of any participating system? Furthermore, this scheme does not
provide any solution for intrinsically monolingual tasks like speech recognition
or speech synthesis.
Another line of approach that has been investigated by
ELSE is the use of a common pivotal language for evaluation in addition
to the original language for which the participating systems have been
initially developed for, results for other languages can be reasonably
inferred from those produced by the evaluation. This scheme has been used
in the past in the SQALE project. Although this approach limits the number
of evaluations one must perform if tests are sought for a given number
of languages, results generalization to other languages is not possible
in all cases and the result of such generalization not necessarily reliable
when the languages involved are not closely related. Linguistic resource
identification and candidate tasks selection refinement are part of the
current work in progress of ELSE.
Annex
Here is the list of control tasks that could be considered
good potential candidates for starting a first series of evaluation campaigns.
a) Broadcast News Transcription.
b) Cross-Lingual Information Retrieval. (CLIR)
c) Text To Speech synthesis.
d) Text Summarization (SUMRZ)
e) Language Model evaluation. (Word Prediction task).
f) POS tagging, lemmatization/morphological analysis and Word Sense disambiguation (all 3 together).
The aims in selecting those were:
Comments 1:
During the workshop there were questions raised about
only on the Text Summarization (d) control task. As I had some time in
the train to think more about it here are the reasons for proposing Text
Summarization and how I think we could evaluate Text Summarizer.
1) one based on the information content of the summarized text. It could be run in parallel with CLIR evaluation, by having first the summarizer to process a part of the database used for CLIR, and compare the precision/recall variations between a given text and its summary across the various IR systems. This of course implies to link the calendars of the CLIR and SUMRZ tasks. Ideally a good summarizer would produce abstracts that would be retrieved in the same way that the full texts are, or with better precision and recall.
2) to solve partially the problem of the text quality I suggest to count the number of changes (basic transformation steps) required to obtain an error free text from a summary. For a start, I don't think that venturing up to the level of text quality assessment is safe (or if it's done, it should be only on an exploratory basis). The rationale behind this is that there is no clear definition neither a precise minimal threshold for text quality, while the error-free criteria is clearly defined for everybody. Of course nothing prevent to produce additional quality assessment in addition to the previous 2 measures.
Comments 2:
Concerning the remarks about the list of criteria a control task ought to display to be suitable for ELSE purposes (quantitative black box technology evaluation) and the question about whether the ease of performance by a human operator or a related criterion wouldn't apply also to the measure itself, I think it is true but should be phrased cautiously, as in most of the examples I am aware of, the "principle of the measure" is rather simple and easily understood by anybody (WER in Speech Recognition, difference in tag lists after mapping in GRACE or SENSEVAL/ROMANSEVAL, precision/recall in IR) but the "implementation" is rarely straightforward and can lead to very difficult problems (e.g. weighting in SENSEVAL, tokenization/alignment in GRACE, reference data set building for IR, etc.).
Patrick Paroubek
LIMSI - CNRS
Batiment 508 Universite Paris XI
Dep. Communication Homme Machine
LIMSI - BP 133 - 91403 ORSAY Cedex
phone: (33) (0)1 69 85 81 91
fax: (33) (0)1 69 85 80 88
email:pap@m192.limsi.fr