[Date Prev][Date Next][Thread Prev][Thread Next][Chronological Index][Thread Index]

Re: EAGLEEVAL: A question from George Doddington




MB> In Granada, I proposed to have a third of all validation efforts devoted =
MB> to user testing. During the summer, I have changed my opinion towards =

The question behind every evaluation and behind the amount of resources devoted
to a particular kind is always: who ordered it? Knowing it's very likely
that then you will know what purpose it will serve and consequently 
how much effort you should allocate to it.

MB> I do not know of any other way to discover useful technologies than to =
MB> first perform user tests and then look which components are used by the =
MB> better systems.

The same could be said of any scientific test procedure, at any point in the 
development lifecycle.
When Marc says "useful technologies", I am tempted to ask: useful for what? 
Useful in a sense that they increase the usability of the final application 
and so it merchandisability, or "useful" because it contributes significantly 
to bridge the gap between the current state of technology and its 
advertised goals? If it's the former then maybe we should should start 
redirecting our work more towards user interfaces development and ergonomics.

MB> It seems that to be effective at an end-user level, every technology =
MB> must reach a threshold. The detail improvements above that threshold are =
MB> lost in the noise at the user application.

There is a WIDE area in performances curves, between the point where 
a technology has achieved a sufficient level of maturity to have 
noticeable consequences at the application level, and the point where
it has been so refined that any improvement is minimal and lost in
the noise generated by other elements of the final application.

Thechnology evaluation as ELSE proposes, precisely 
aims at speeding up and helping the discorvery of emerging technologies 
(by giving them a chance to prove themselve), as well as contributing 
to the development of the technology between the two performance points 
under consideration. In addition technology evaluation is a mean 
to faciliate the discovery of the ultimate performance treshold 
achievable by a technology at a given time (independently of 
end-user considerations which make harder  and more costly, 
the credit attribution for a performance result - did the system
perform well because the technology it uses is appropriate, or did
it perform well because the coupling between the user and the 
interface was optimal?).

MB> If this impression is correct, the understanding of the thresholds is a =
MB> key to improving the end results. To manage our work we need to know for =
MB> every technology where this threshold rule is true:
MB> 
MB> a.. at which threshold we are now;
MB>     b.. where the next threshold will probably be;
MB>     c.. can the existing technology reach that new threshold by small =
MB> increments or is a quantum jump necessary.

Exactly, if you don't mind Marc, I'll re-use this part of you prose
for the ELSE publications, it's a very good argument for technology evaluation.

>    In cases where a quantum jump is necessary, the validation effort =
>should promote worse results ('Reculer pour mieux sauter ', also think =
>of Pierre Boulard's 'Towards Increasing Speech Recognition Error =
>Rates'). To me, speech recognition is an example where that is needed.

The paradoxial part in what you say is that Speech recognition is 
one of the best example of the benefits brought by technology evaluation.

>By definition we need user testing to determine these thresholds. I can =
>argue that without knowledge of these thresholds, the goal of technology =
>validation is questionable.

Here again I agree with you Marc, as I think that end-user evaluation and 
technology evaluation are complementary, but they should be addressed
in the same order as they appear in the development lifecycle and
end-user evaluation should serve to complement the results of
technology evaluation.

MB> In the chain from theory to technology to system to end-user, the larger =
MB> drop of performance takes place at the final link. The measures that =
MB> have the greatest impact are not technological but organisational (e.g. =
MB> operator fallback, prompts). This is were the biggest effect for the =
MB> least involvement can be achieved.

But these are mostly considerations linked with the deployment conditions of
an application, an issue which falls under the responsability of the
deployer of the system and its proefficiency at packaging the final application.
To take on your example, knowing that, when your spoken language dialog system
fails, it is better for the acceptance of the application by the customers 
to have a fallback strategy with an operator, is a consideration that won't
bring any improvement to science or technology in the field. 
In the same manner, putting a lot of efforts in the development of the prompts 
to palliate the lack of capabilities in "understanding" of the system is 
a short terms strategy that will not address the basic problem: which technology
to use to improve "understanding" capabilities which right now are not that 
far from simple key-word spotting for what concerns spoken language 
dialog systems.

MB> 4. Purpose of User Validation
MB> Martin Rajman asked (msg00035): 'Measuring user satisfaction does not =
MB> allow progress in the field, it only increases return on investment'.=20
MB> 
MB> This is correct when this measure is only used for user satisfaction. =
MB> However, I advocate using user validation in order to structure and =
MB> understand the field, so that progress is fostered where it is most =
MB> needed and where it provides the greatest impact.=20

Going on with you example of spoken language dialog systems, we should 
focus then on the us of "prompts" in a dialog, on the decision of
when resorting to a fall-back operator rather than trying to tackle
understanding. 

Interestingly enough, one of the latest American DARPA programs, has 
Topic Detection and Tracking as one of its control tasks and the
general consensus at the LREC Granada conference on evaluation
was that for Speech Processing, we were entering a new
area where the focus would be put on "understanding".

Best regards,

Patrick Paroubek