[Date Prev][Date Next][Thread Prev][Thread Next][Chronological Index][Thread Index]

Re: EAGLESEVAL: A question from George Doddington.



I have been watching this debate, and can't resist jumping in here
to raise some of my favorite themes.

The issue of what is evaluated and how it relates to overall system
performance is really closely linked to who PAYS for the evaluation,
and who USES the evaluation (and for what purpose).

If we assume that CONSUMERS will use the evaluation to chose what
system to buy, then we care about how well it performs overall. 
Internal metrics are only interesting if they are very clearly
correlated
with overall performance for the end user. In this sense, Ed's metric
of lexicon size is probably a very useful metric -- it's pretty easy
to understand what bad things might happen if a word is not in the
lexicon.
And it may be pretty easy to show that systems with bigger lexicons
perform
better for the end user on broad-based tasks.

If on the other hand, we know that FUNDING AGENCIES will consume the
results,
then they have a legitimate interest in funding research, which often
includes
novel and high risk approaches -- including evaluation! Eventually,
these approaches 
have to pay off in showing good performance on the user-centered side
(faster, 
better, cheaper systems). But in the short run, it may be reasonable to
try to 
optimize systems along some technical dimension (let's say word error in
speech)
in the hopes that there is some magic knee in the curve -- below the
knee,
there is no improvement, but suddenly at 20% word error, the system
can be used for topic clustering; at 10% for information extraction; at
3% for
spoken interfaces, etc. But if we had started measuring when systems
were
at 70% word error, it might have taken quite a while to see any
correlation
between word error and system performance improvement on real tasks.

Where funding agencies are the consumer, it is crucial that evaluations
not prejudge
technical approaches -- evaluations must not be biased towards a given
solution. For example, if we really want to get MEANING from spoken
input and if we ONLY measure word error, we prejudge the solution -- by
favoring systems with lower word error,
but ignoring systems that e.g., use prosody and robust parsing to
capture the
new information in the input. So word error is probably not a good
metric
for meaning extraction, at least not below a certain word error.
"Understanding" will be much more correlated semantic frame extraction
or something like that. 

Measuring grammar rules -- or even measuring parse
correctness -- falls into this trap: it assumes that you have a grammar
with rules 
-- which prejudges a particular solution. BUT if you are using that
approach,
and you want to evaluate successive versions of JUST YOUR system, maybe
it is a
reasonable metric -- you the researcher are paying for the evaluation
and therefore
you get to define whatever you believe will help you make progress, even
if it
excludes other approaches.

Finally, improvements that don't appear to help in one task may prove
very
useful in another. Take name identification, for example. It may not
help
information retrieval on clean text, but accurate name identification
may provide
good "signatures" for topic clustering for spoken document retrieval. We
shouldn't necessarily say it isn't worth doing good name extraction,
just because we
haven't yet found the application where it provides real leverage.

So -- we want technical metrics that:

* encourage innovative research and don't prejudge the approach
* are well correlated with improvements in user-centered evaluation
	-- but for WHAT TASKS?
* are cheap and cost effective relative to the community that creates
	and uses the metrics.

And we want user-centered metrics that:

* are well correlated with the state of the underlying technology
	(that is, the system has some chance of more or less doing the task)
* are correlated with real tasks performed by real sets of users
* enable us to validate or ignore various technical measures

And we need to understand the complex interplay between these styles
of evaluation and the different communities that they serve.

End of diatribe --

Lynette