Re: EAGLESEVAL: Questions about evaluation
MK>A question from George Doddington:
MK>Several presenters have argued for moving evaluation from the technical
MK>(lower) level toward the application (user) level.
MK>...
I think that the question which was raised by the presenters, was about the distinction
between field evaluation of applications (with real use conditions and
end-users involvement) and technology evaluation (more in the spirit of
the evaluation campaign which have been organized by NIST and DARPA),
and about their relative importance, which in turn conditions the amount of resource that
should be alloted for their development and the way they ought to be
deployed.
In my opinion both are necessary, and in particular, technology evaluation
is essential as it comes before application evaluation in the development
life-cycle of a product, looking at the whole of it, from the very begining
when a new concept comes to existence in a research lab, up to the moment
where a product is mass marketed.
Technology evaluation is needed to ease and speed-up the technology transfer
from the lab context toward industry environment, and end-user evaluation
is needed to facilitate the transition from the industry context (pre-commercial
application) to the market environment (final product).
HC>>> - Some people say "It kills science."
HC>how can we have science without measurement?
It' not the paradigm that could kills science it's the use one could make of it.
But if the intention is initially there, I'am sure mankind is imaginative enough
to find plenty of replacement solutions, so I think the real question
is more about the use made of comparative evaluation results. In fact,
the first users are generally the commissioners themselves.
In USA the commissionners are clearly identified, but in Europe
who are they ?
MK>But I suppose what they mean is that if you're working in this
MK>comparative evaluation paradigm, you tend to concentrate on jazzing up
MK>your system so that it will do better at the next evaluation, rather
MK>than tackling some basic research problem carrying with it the risk of
MK>failure and therefore of not showing any progress at all at the next
MK>evaluation?
If you have developed an idea/concep that brings a real improvement
over the existing solutions for a problem but unfortunately you don't
have enough resources to fine-tune an application based on it to
participate to a comparative excercise (or the will), I'am sure you'll find
other people who have the resources and desire to acquire the know-how,
that would be ready to deploy your idea and participate to a comparative
excercise. In fact comparative evaluation excercises are a good way to stimulate
the drive for better communication/cooperation between, on the one hand
the research community (who develop and tries out new concepts) and
on the other hand, the industry who implements and markets them.
LC>Competitive evaluation, however, has produced something that never existed
LC>before: an identifiable and highly cooperative *research community*.
True, I witnessed this phenomean also in Europe, for instance
in France during the GRACE and the AUPELF-Uref evaluation campaign
and in UK with SENSEVAL.
LC>I'd like to stress that it's not *laboratories* that are evaluated, but
LC>*implemented techniques*. It's true that, due to poor performance, some
LC>laboratories (including the one I came from in the US) have been forced to
LC>drop certain lines of research in the context of evaluations. But those
LC>same labs have been able to pick up on better-performing techniques,
LC>reorient themselves, and subsequently do very well in follow-on efforts.
LC>It seems to me that this a good thing -- methods that work can be sorted
LC>out from methods that don't work. This is, after all, our goal as
LC>scientists, is it not?
Comparative evaluation won't change this basic fact, it might only
speed up the process a bit and help make the issue clearer by providing
a common reference ground (training and evaluation data as well as
evaluation protocol) and helping communication (evaluation campaign
are very good helpers for binding a community). Look the other at this
kind of argument, I would like to point out that comparative evaluation
(particularly with transparent protocol and reproducible results) is
a way to show that research is needed in an area and could provide
a proof for preserving some research lines, e.g. while continuous
speech recognition software already exist on the market, the problem of
multi-locutor speech recognition in various noise condition is
still unsolved.
LC>There is always the danger that some promising but immature technique will
LC>never have a chance to prove itself against an entrenched behemoth. This
LC>is especially dangerous if all available funding becomes tied to evaluation
LC>results. But it seems to me that the danger of this happening in Europe is
LC>low, as many researchers have their salaries covered by direct grants from
LC>their own countries. This guaranteed minimum level of funding is rare in
LC>the US, where the complaint that evaluation stifles innovation is often
LC>heard. The Europeans have a distinct advantage on this count.
In my humble opinion, this is an important point, and a free space should
always be provided in an evaluation campaign for the testing of new ideas.
Those should be clearly identified and distinguished from the more mature ones.
esteam>In most NLP testing (technical) we tend to have a wishful thinking aspect
esteam>and to some degree forget the state of the art - we want more from the
esteam>system than the technology can provide because we are mostly researchers.
esteam>When developing you need to know where you are at - somewhere the issue of
esteam>competition is healthy - however in science there are things that have to
esteam>brew for themselves, that actually take much more effort and perhaps a time
esteam>versus performance curve is not suited for. If we are going to improve the
esteam>performance of NLP systems we need basic research that is left in peace (not
esteam>completely - it NEEDS funding).
esteam>I would like to get back to the first paragraph - you need to know where you
esteam>are at - also we need to learn from each other. The point of entry for good
esteam>communication between the people working in a field is I believe through
esteam>evaluation.
Here also I agree entirely, technology evaluation comes after the concept
elaboration phase in a development life cycle. In my opinion technology
evaluation could serve here two purposes, to help in identifying which
concepts/ideas are ripe for transfer into the industry environment given
the current state of the technology, and in a bigger time span perspective,
to help position the current state of technology with respect to its intrinsic limits
and sometimes contribute to the identifcation of these limits (e.g. the MUC performance ceiling pb.).
esteam>There are negative aspects of the dog-runs the US are currently
esteam>operating but there are also good aspects - perhaps we can do them better
esteam>than them (succer for competition) - or take a completely different approach
esteam>like an panel of advicers evaluating the technology and telling us where we
esteam>are going wrong in stead of saying you are number 1 - something we feel
esteam>awfully proud of but doesn't say anything at all.
The problem with the expert panel solution is that is doesn't yield reproducible
results.
Also, it can't be replicated and distributed, while with comparative
evaluation based on quantitative protocol using common reference data, once
it has been developed, the whole evaluation chain becomes another tools that helps
in developing new ideas (provided it is made available). It functions then as both
a minimal benchmark test and as a metering/viewing tools for finding
correlations between a system modification and a performance change.
I think panels of experts would be more useful if they were involved in the
developement of the protocols and metrics themselves rather than in the
measurement.
LC> A general measure of "understanding" quality has never been developed.
LC> veryone agrees that this is critical, but there's wide disagreement about
LC> what to measure and when.
LC> This is not because picking a measure is hard,
LC> but rather because we don't really know what we mean by "understanding".
Please allow me to point out that "picking a measure" often seems easy but in
fact it is not, even for apparently simple problems like POS tagger evaluation.
If you need further proof, I invite you to look at the measure definition discussion
in SENSEVAL.
Many thanks to all for bringing up this interesting issues.
----------
Patrick Paroubek
LIMSI - CNRS
Batiment 508 Universite Paris XI
Dep. Communication Homme Machine
LIMSI - BP 133 - 91403 ORSAY Cedex
phone: (33) (0)1 69 85 81 91
fax: (33) (0)1 69 85 80 88
email:pap@m192.limsi.fr