EAGLEEVAL: A question from George Doddington

In Granada, I proposed to have a third of all validation efforts devoted to user testing. During the summer, I have changed my opinion towards more than that third: at least half of all efforts is necessary. I shall repeat the arguments that are already provided in the discussion list in the following three reasons. The fourth point answers a question about the purpose of user validation.

Eduard Hovy (msg00065) wrote: 'The IBM Candide system contained nothing that anyone would recognise as a grammar'. I have seen that in successful spoken dialog systems, classical morphology and syntax are not used or at least in a very limited way.

I do not know of any other way to discover useful technologies than to first perform user tests and then look which components are used by the better systems.

Eduard Hovy asked the rhetorical question (msg00054): 'It is better to make progress on the underlying problems even if they have no effect at the application level'.

'The parsers get better and better, as measured by metric for parsers but that the overall result of the application doesn’t improve', to paraphrase Margaret King.

Robert Frederking writes about 'greedy search' (msg0064), Lynette about a 'magic knee in the curve' (msg00066). I keep the word 'threshold' that Lin Chase has used in Geneva.

It seems that to be effective at an end-user level, every technology must reach a threshold. The detail improvements above that threshold are lost in the noise at the user application.

If this impression is correct, the understanding of the thresholds is a key to improving the end results. To manage our work we need to know for every technology where this threshold rule is true:

at which threshold we are now;
where the next threshold will probably be;
can the existing technology reach that new threshold by small increments or is a quantum jump necessary.

In cases where a quantum jump is necessary, the validation effort should promote worse results ('Reculer pour mieux sauter ', also think of Pierre Boulard's 'Towards Increasing Speech Recognition Error Rates'). To me, speech recognition is an example where that is needed.

By definition we need user testing to determine these thresholds. I can argue that without knowledge of these thresholds, the goal of technology validation is questionable.

This point is a conclusion of my experiences with spoken dialogue systems. I have heard anecdotal evidence of the same in other types of systems (the presentation on screen is the key to user acceptance).

In the chain from theory to technology to system to end-user, the larger drop of performance takes place at the final link. The measures that have the greatest impact are not technological but organisational (e.g. operator fallback, prompts). This is were the biggest effect for the least involvement can be achieved.

Ignoring this area in validating and measuring, gives me the impression that the main driver of the end-result is ignored.

Martin Rajman asked (msg00035): 'Measuring user satisfaction does not allow progress in the field, it only increases return on investment'.

This is correct when this measure is only used for user satisfaction. However, I advocate using user validation in order to structure and understand the field, so that progress is fostered where it is most needed and where it provides the greatest impact.

Marc Blasband