[Date Prev][Date Next][Thread Prev][Thread Next][Chronological Index][Thread Index]

Re: EAGLESEVAL: A question from George Doddington.




Background: 
>Eduard Hovy wrote:
>
>> I can think of several places in which people make progress on underlying
>> levels that have no real effect on the application level.  For example,
>> many groups have insisted that detailed syntactic parsing is the key,
>> without demonstrating significant increases in output quality (or a
>> significant decrease in speed).  Ditto for the heavily semanticized
>> Interlingua-based approaches, who can spend years building ontologies
>> and fancy lexicons but still show no improvement on general text in the
>> large.  (I happen myself to believe that semantics is the right direction
>> to go, but I do not believe that we will get adequately large semantic
>> knowledge bases and/or lexicons using current manual techniques.)


At 6:02 PM +0200 8/26/98, Maghi King wrote:
>But doesn't that mean that they HAVEN'T made progress on the underlying
>levels?

It means I am a skeptic.  They *may* have, but if the measurements don't 
show it, then who can tell?  That is, if the things they are working on 
are in fact the proper underlying levels to work on.  


>Or are you saying that their parsers or ontologies get better and better
>(judged I suppose by some measure other than system improvement), but
>the overall result of the application doesn't improve? 

Yes indeed, this happens.  What kinds of internal evaluations did the 
Eurotra people produce to see if they were making progress?  Number of 
grammar rules?  Size of lexicon?  And yet, did the system end up getting 
steadily better every six months?  Did it end up where SYSTRAN was, 
despite SYSTRAN's somewhat older methods?  


>So we'd say that they have to be wrong about what the appropriate
>underlying technology really is?

That is always possible.  For example, the IBM Candide system contained 
nothing that anyone would recognize as a grammar (its closest equivalent 
was a bigram model of English and a fertility model of French-English).  
Nor any morphological knowledge.  Yet it performed French-to-English just 
about as well as SYSTRAN did, in the large DARPA evaluation of 1994.  So 
who knows what underlyinhg technology is really required?  Breeding faster 
horses becomes irrelevant when the horseless carriage is invented.  


Summary, and then I will shut up: 

My belief is this: one can work on whatever aspect you like.  If you're 
smart, you will work on something that will help get the system in the 
magical corner of HighQuality-HighDomainCoverage-FullAutomation.  But 
what is it that will be sure to get you there?  Since *no-one has reached 
that corner yet*, despite much work, it looks reasonable to say that we 
don't know.  And possibly also reasonable to suspect that the traditional 
methods are not going to get us there.  

So, not knowing, you are forced to evaluate the effects of your work, 
whatever you choose to work on, and show that your effects do in fact 
move the system closer to the magic corner.  I believe that `internal' 
measures such as syntactic coverage and size of ontology and so on are 
(much) weaker indicators of movement toward success than black-box ones 
of the application-oriented kind.  (This is not to say I am advocating 
spending time on interfaces and so on.  But I believe the parameters of 
the application and situation define magical corners--what Ken and I 
called niches of functionality in our Crummy MT paper--toward which 
you should drive your system development.)  

That is, if you can *prove* the direct correlation between syntactic 
coverage and application success, good for you.  Then work on syntax.  
In my experience, you get to the syntax plateau after about 2 person-
years = 500 grammar rules (but perhaps the new statistical learning 
methods of Collins or Hermjakob make a difference here) and then you 
need something more.  To my knowledge, the best overall indicator of 
success, crudely speaking, is lexicon size (words, phrases, the lot).  
The older the system, by and large, the bigger the lexicon, and the 
better it performs.  Most recent research has developed ways to create 
systems more quickly, but not to make their output better....SYSTRAN 
still wins, in the most natural open-domain races.  

So why evaluate? 
To see whether what you are working on helps, and whether your work 
is going places. 

Evaluate doing what? 
Doing a real application, factoring out as much as possible things 
like machine speed and interfaces and input data formatting and so on.  

What if it takes 100 years to prove the utility of my magic method, 
because I cannot build it faster (the Interlingua objection)? 
Invent techniques to build it faster, or go away for 100 years and 
come back and prove your case. 

For my money, for most applications, a shallow smeantic transfer 
approach works fine, using rules built up automatically.  The fun 
reasearch, and the truly difficult translation, requires a real 
Interlingua semantics.  we need to investigate automated ways of 
building that. 

Amen.  No more from me! 
E


----------------------------------------------------------------------------
Eduard Hovy
email: hovy@isi.edu          USC Information Sciences Institute 
tel: 310-822-1511 ext 731    4676 Admiralty Way 
fax: 310-823-6714            Marina del Rey, CA 90292-6695 
project homepage: http://www.isi.edu/natural-language/nlp-at-isi.html