Re: EAGLESEVAL: A question from George Doddington.
Background:
>Eduard Hovy wrote:
>
>> I can think of several places in which people make progress on underlying
>> levels that have no real effect on the application level. For example,
>> many groups have insisted that detailed syntactic parsing is the key,
>> without demonstrating significant increases in output quality (or a
>> significant decrease in speed). Ditto for the heavily semanticized
>> Interlingua-based approaches, who can spend years building ontologies
>> and fancy lexicons but still show no improvement on general text in the
>> large. (I happen myself to believe that semantics is the right direction
>> to go, but I do not believe that we will get adequately large semantic
>> knowledge bases and/or lexicons using current manual techniques.)
At 6:02 PM +0200 8/26/98, Maghi King wrote:
>But doesn't that mean that they HAVEN'T made progress on the underlying
>levels?
It means I am a skeptic. They *may* have, but if the measurements don't
show it, then who can tell? That is, if the things they are working on
are in fact the proper underlying levels to work on.
>Or are you saying that their parsers or ontologies get better and better
>(judged I suppose by some measure other than system improvement), but
>the overall result of the application doesn't improve?
Yes indeed, this happens. What kinds of internal evaluations did the
Eurotra people produce to see if they were making progress? Number of
grammar rules? Size of lexicon? And yet, did the system end up getting
steadily better every six months? Did it end up where SYSTRAN was,
despite SYSTRAN's somewhat older methods?
>So we'd say that they have to be wrong about what the appropriate
>underlying technology really is?
That is always possible. For example, the IBM Candide system contained
nothing that anyone would recognize as a grammar (its closest equivalent
was a bigram model of English and a fertility model of French-English).
Nor any morphological knowledge. Yet it performed French-to-English just
about as well as SYSTRAN did, in the large DARPA evaluation of 1994. So
who knows what underlyinhg technology is really required? Breeding faster
horses becomes irrelevant when the horseless carriage is invented.
Summary, and then I will shut up:
My belief is this: one can work on whatever aspect you like. If you're
smart, you will work on something that will help get the system in the
magical corner of HighQuality-HighDomainCoverage-FullAutomation. But
what is it that will be sure to get you there? Since *no-one has reached
that corner yet*, despite much work, it looks reasonable to say that we
don't know. And possibly also reasonable to suspect that the traditional
methods are not going to get us there.
So, not knowing, you are forced to evaluate the effects of your work,
whatever you choose to work on, and show that your effects do in fact
move the system closer to the magic corner. I believe that `internal'
measures such as syntactic coverage and size of ontology and so on are
(much) weaker indicators of movement toward success than black-box ones
of the application-oriented kind. (This is not to say I am advocating
spending time on interfaces and so on. But I believe the parameters of
the application and situation define magical corners--what Ken and I
called niches of functionality in our Crummy MT paper--toward which
you should drive your system development.)
That is, if you can *prove* the direct correlation between syntactic
coverage and application success, good for you. Then work on syntax.
In my experience, you get to the syntax plateau after about 2 person-
years = 500 grammar rules (but perhaps the new statistical learning
methods of Collins or Hermjakob make a difference here) and then you
need something more. To my knowledge, the best overall indicator of
success, crudely speaking, is lexicon size (words, phrases, the lot).
The older the system, by and large, the bigger the lexicon, and the
better it performs. Most recent research has developed ways to create
systems more quickly, but not to make their output better....SYSTRAN
still wins, in the most natural open-domain races.
So why evaluate?
To see whether what you are working on helps, and whether your work
is going places.
Evaluate doing what?
Doing a real application, factoring out as much as possible things
like machine speed and interfaces and input data formatting and so on.
What if it takes 100 years to prove the utility of my magic method,
because I cannot build it faster (the Interlingua objection)?
Invent techniques to build it faster, or go away for 100 years and
come back and prove your case.
For my money, for most applications, a shallow smeantic transfer
approach works fine, using rules built up automatically. The fun
reasearch, and the truly difficult translation, requires a real
Interlingua semantics. we need to investigate automated ways of
building that.
Amen. No more from me!
E
----------------------------------------------------------------------------
Eduard Hovy
email: hovy@isi.edu USC Information Sciences Institute
tel: 310-822-1511 ext 731 4676 Admiralty Way
fax: 310-823-6714 Marina del Rey, CA 90292-6695
project homepage: http://www.isi.edu/natural-language/nlp-at-isi.html