RE: EAGLESEVAL: A question from George Doddington.
It took me much less time to read what you wrote than it would have taken me
to write it: Thanks a lot for the good answer.
Would we need to make it a final statement?
Dr. Christian Dugast
Philips Speech Processing
Tel: +49 (241) 88 71 251
Fax: +49 (241) 88 71 141
On Thursday, August 27, 1998 12:58 AM, Lynette Hirschman
> I have been watching this debate, and can't resist jumping in here
> to raise some of my favorite themes.
> The issue of what is evaluated and how it relates to overall system
> performance is really closely linked to who PAYS for the evaluation,
> and who USES the evaluation (and for what purpose).
> If we assume that CONSUMERS will use the evaluation to chose what
> system to buy, then we care about how well it performs overall.
> Internal metrics are only interesting if they are very clearly
> with overall performance for the end user. In this sense, Ed's metric
> of lexicon size is probably a very useful metric -- it's pretty easy
> to understand what bad things might happen if a word is not in the
> And it may be pretty easy to show that systems with bigger lexicons
> better for the end user on broad-based tasks.
> If on the other hand, we know that FUNDING AGENCIES will consume the
> then they have a legitimate interest in funding research, which often
> novel and high risk approaches -- including evaluation! Eventually,
> these approaches
> have to pay off in showing good performance on the user-centered side
> better, cheaper systems). But in the short run, it may be reasonable to
> try to
> optimize systems along some technical dimension (let's say word error in
> in the hopes that there is some magic knee in the curve -- below the
> there is no improvement, but suddenly at 20% word error, the system
> can be used for topic clustering; at 10% for information extraction; at
> 3% for
> spoken interfaces, etc. But if we had started measuring when systems
> at 70% word error, it might have taken quite a while to see any
> between word error and system performance improvement on real tasks.
> Where funding agencies are the consumer, it is crucial that evaluations
> not prejudge
> technical approaches -- evaluations must not be biased towards a given
> solution. For example, if we really want to get MEANING from spoken
> input and if we ONLY measure word error, we prejudge the solution -- by
> favoring systems with lower word error,
> but ignoring systems that e.g., use prosody and robust parsing to
> capture the
> new information in the input. So word error is probably not a good
> for meaning extraction, at least not below a certain word error.
> "Understanding" will be much more correlated semantic frame extraction
> or something like that.
> Measuring grammar rules -- or even measuring parse
> correctness -- falls into this trap: it assumes that you have a grammar
> with rules
> -- which prejudges a particular solution. BUT if you are using that
> and you want to evaluate successive versions of JUST YOUR system, maybe
> it is a
> reasonable metric -- you the researcher are paying for the evaluation
> and therefore
> you get to define whatever you believe will help you make progress, even
> if it
> excludes other approaches.
> Finally, improvements that don't appear to help in one task may prove
> useful in another. Take name identification, for example. It may not
> information retrieval on clean text, but accurate name identification
> may provide
> good "signatures" for topic clustering for spoken document retrieval. We
> shouldn't necessarily say it isn't worth doing good name extraction,
> just because we
> haven't yet found the application where it provides real leverage.
> So -- we want technical metrics that:
> * encourage innovative research and don't prejudge the approach
> * are well correlated with improvements in user-centered evaluation
> -- but for WHAT TASKS?
> * are cheap and cost effective relative to the community that creates
> and uses the metrics.
> And we want user-centered metrics that:
> * are well correlated with the state of the underlying technology
> (that is, the system has some chance of more or less doing the task)
> * are correlated with real tasks performed by real sets of users
> * enable us to validate or ignore various technical measures
> And we need to understand the complex interplay between these styles
> of evaluation and the different communities that they serve.
> End of diatribe --