Back to main workshop report

Towards a reference model for speech evaluation.
- Denis Johnston

Author's note: I have the impression that I have not done justice to the discussion following this talk. I would be very grateful to any participants who can contribute to the description. MK

Summary of presentation:

Denis started his presentation by explaining that speech evaluation had the advantage that there were long established and well-known methods for measuring the quality of speech transmission over, for example, a telephone line.

Tests are based on user opinions. This has the advantage of reflecting the real world and of directly addressing the user. The test is based on a listening effort scale, with end points defined as "No effort required to understand", through "Some attention needed" to "No meaning can be understood despite all feasible effort being made". A distance of one metre of air space between the source of speech and the listener has been accepted as an industry standard for some time. If the listener has trouble hearing at this distance, then the product does not meet standard expectations. To administer the test, a microphone and earpiece with combined transmission characteristics identical to one metre distance between speaker and hearer are used. The speech transmission can be impaired in a number of ways. Candidate impairments include babble, split frequency bands and wow and flutter.

Tests are performed using "non-mental effort" sentences: sentences whose understanding requires no effort on the part of the hearer, such as "The sun shines in the sky". The level of impairment is raised until it affects the understanding of the intended message. This gives a measure of the signal quality in unfavourable situations: the greater the degree of impairment required to block understanding of the message, the better is the quality of the signal.

A practical application of the measuring technique lies in the evaluation of text to speech systems. Synthesised speech is generated from a written text. Text to speech technology is used for a variety of applications where recording is not possible, for example when instructions have to be given over the phone or when addresses are given.

The evaluation is done by asking a group of sixteen listeners to transcribe the synthesised speech. It is important that the text must always be unfamiliar to the listeners, and not, for example, some well-known piece of prose, since people have quite long term memory for something once heard or read, even if they do not consciously remember it.

The linguistic theories underlying text to speech systems (theories of prosody or of speech generation, for example) are quite various, but this does not seem to affect the acceptability or the efficacy of the proposed measure.


It was suggested that using sixteen listeners was unnecessarily extravagant, and that experience elsewhere had shown that twelve listeners were enough. Denis replied that sixteen seemed to be the generally accepted number, and that the text to speech test took only a few minutes, so that the whole experiment cost only a couple of man days in total.

There was considerable discussion of how to determine an optimal sample size for this kind of experiment and more generally. Denis indicated some standard bibliographic references, and agreed to supply the project with a selected list of references.

A question was asked about whether any tests had been done with people listening to a language that was not their mother tongue. Denis replied that he did not know of any such experiments. There had been a proposal to carry out tests with different varieties of UK English, and US but the proposal had not been accepted.

Back to main workshop report