Re: EAGLESEVAL: A question from George Doddington.
I have been watching this debate, and can't resist jumping in here
to raise some of my favorite themes.
The issue of what is evaluated and how it relates to overall system
performance is really closely linked to who PAYS for the evaluation,
and who USES the evaluation (and for what purpose).
If we assume that CONSUMERS will use the evaluation to chose what
system to buy, then we care about how well it performs overall.
Internal metrics are only interesting if they are very clearly
with overall performance for the end user. In this sense, Ed's metric
of lexicon size is probably a very useful metric -- it's pretty easy
to understand what bad things might happen if a word is not in the
And it may be pretty easy to show that systems with bigger lexicons
better for the end user on broad-based tasks.
If on the other hand, we know that FUNDING AGENCIES will consume the
then they have a legitimate interest in funding research, which often
novel and high risk approaches -- including evaluation! Eventually,
have to pay off in showing good performance on the user-centered side
better, cheaper systems). But in the short run, it may be reasonable to
optimize systems along some technical dimension (let's say word error in
in the hopes that there is some magic knee in the curve -- below the
there is no improvement, but suddenly at 20% word error, the system
can be used for topic clustering; at 10% for information extraction; at
spoken interfaces, etc. But if we had started measuring when systems
at 70% word error, it might have taken quite a while to see any
between word error and system performance improvement on real tasks.
Where funding agencies are the consumer, it is crucial that evaluations
technical approaches -- evaluations must not be biased towards a given
solution. For example, if we really want to get MEANING from spoken
input and if we ONLY measure word error, we prejudge the solution -- by
favoring systems with lower word error,
but ignoring systems that e.g., use prosody and robust parsing to
new information in the input. So word error is probably not a good
for meaning extraction, at least not below a certain word error.
"Understanding" will be much more correlated semantic frame extraction
or something like that.
Measuring grammar rules -- or even measuring parse
correctness -- falls into this trap: it assumes that you have a grammar
-- which prejudges a particular solution. BUT if you are using that
and you want to evaluate successive versions of JUST YOUR system, maybe
it is a
reasonable metric -- you the researcher are paying for the evaluation
you get to define whatever you believe will help you make progress, even
excludes other approaches.
Finally, improvements that don't appear to help in one task may prove
useful in another. Take name identification, for example. It may not
information retrieval on clean text, but accurate name identification
good "signatures" for topic clustering for spoken document retrieval. We
shouldn't necessarily say it isn't worth doing good name extraction,
just because we
haven't yet found the application where it provides real leverage.
So -- we want technical metrics that:
* encourage innovative research and don't prejudge the approach
* are well correlated with improvements in user-centered evaluation
-- but for WHAT TASKS?
* are cheap and cost effective relative to the community that creates
and uses the metrics.
And we want user-centered metrics that:
* are well correlated with the state of the underlying technology
(that is, the system has some chance of more or less doing the task)
* are correlated with real tasks performed by real sets of users
* enable us to validate or ignore various technical measures
And we need to understand the complex interplay between these styles
of evaluation and the different communities that they serve.
End of diatribe --