TEMAA Final Report - 4 Overview of the Parametrisable Test Bed (PTB)

The goal of the TEMAA project is to design a formal framework for evaluation, which is embodied by a device capable of carrying out evaluations on the basis of this formal framework. We will call this device a Test Bed.

The Test Bed will be fed with descriptions of objects to be evaluated, with descriptions of the user's requirements, and with the evaluation measures and methods to be applied. Its output is an evaluation report on the product or products evaluated.

Since, at each moment in time, new object descriptions, user requirements, measures and methods can be added to the Test Bed, we will call it a Parametrisable Test Bed (PTB).

4.2 Global structure of the TEMAA PTB

Objects of evaluation can be any NLP application, e.g. spelling checkers, style checkers, information retrieval systems, translation systems or aids.

Objects are (in accordance with our formal framework) described in terms of attributes and values. The set of objects is structured: objects may be subtypes of other objects (grammar checkers may include spelling checking facilities), and classes of objects may have components in common (certain functions can be found both in grammar checkers and in style checkers). User requirements are formulated in terms of constraints on the values of attributes, and weightings of the relative importance of the attributes.

For some of the attributes it will be possible to design automated methods to obtain their values for each instance of the object (e.g. error correction precision and recall by means of methods such as ASCC, described in the section on testing methods). Others will ask the operator to provide specific data (e.g. the price of the instance), and yet others may provide the operator with a recipe for obtaining the value (e.g. a questionnaire to be sent out to 100 users, an experimental setup, etc.).

The actual testing will consist of collecting all values for the relevant attributes, either in an automated way or partly or entirely by hand. The result will be a full set of attribute-value pairs for one or more specific instances of an object. Note that this data collection is not dependent on any specific user type: the data depend on the object instances alone, and are neutral with respect to user types.

The last phase is the evaluation of the instances. Here the user requirements play a central role in that the relations between the various attributes and their weightings will determine which scores will be assigned to the products, and how they will be ranked.

4.3 The internal structure of the PTB

1. Maintenance of the test bed, i.e. creation and maintenance of testing materials and tools;

2. Collecting test data, i.e. applying testing tools and materials to objects of evaluation;

3. Defining object and user types, i.e. creation and maintenance of feature checklists and user profiles;

Note that these layers correspond closely to the 'evaluation actions' described in 2.2.2. Layer 1 covers actions 4 and 5 (definition of tests and the creation of test materials), layer 2 covers action 6 (performing the tests), layer 3 covers actions 1 and 3 (definition of object and user types), and layer 4 corresponds to actions 7 and 8 (performing the evaluation and reporting). Note also that evaluation action 2 (collecting objects of evaluation) falls outside the scope of the PTB: no facilities are included to acquire the software to be evaluated.

The four layers of the PTB will be described in the rest of this section, whilst section 4.4 contains a brief technical description of the main computer programs involved.

4.3.1 Maintenance of the test bed

Many of the activities in connection with the test bed have to do with the maintenance of the libraries and the construction of support tools. They are typically carried out off-line, in part by computer specialists (e.g. writing the test programs) and in part by people who know about NLP evaluation (e.g. designing the tests).

First of all, there is a list of known object types to be evaluated, such as spelling checkers, grammar checkers, translation aids, etc.

For every single object type there is a specification of all its attributes (in their hierarchical structure), and the types and ranges of their values. The descriptions are kept in files, and are called 'feature checklists'. Here we will not try to answer the question how exactly one arrives at a set of attributes for a given object type. Ideally, the set of attributes reflects all and only those properties of the object that have relevance for users, but the perception of what counts as relevant may change over time, and may require reconsideration of the descriptions.

For every object type there is a listing of all known (i.e. known to PTB) instances of these objects.

We will call those attributes whose values can be obtained by direct observation, inspection or measurement 'primitive attributes', as opposed to attributes whose values depend on values of other attributes.

A list of all primitive attributes (for all objects) is kept in a special catalogue, which does not only specify the full name of the attribute, but also, per attribute, the name of the test program that will compute (or sometimes ask) the value for this attribute.

In principle one might have different test programs for each single attribute, but it can be expected that some test programs will be applicable to more than one attribute (e.g. asking for specific information such as a product's size or price).

At this moment there are two instances of test programs. One is an elaborate program (ASCC), capable of carrying out fully automatic coverage checks of spelling checkers, and one generic program which will ask the operator to provide the desired data.

The design and construction of appropriate tests for all attributes (and the tools to support them) is a non-trivial enterprise which falls beyond the scope of this project. As a demonstration of what is needed for such a task, a description of ASCC, of its associated tools, and of the data collection work needed in order to apply it, is presented in the sections on testing methods.

4.3.2 Collecting test data

Once the description of a class of objects has been established, one or more instances of this type can be identified. If all the tests to obtain the values of the primitive attributes are in place (and possibly adapted to specific properties of the products), the tests can be carried out in order to assign values to all primitive attributes for each instance. As mentioned above, the tests may or may not be fully automated (although one should aim at automating as many of them as possible).

Note that automating tests is only possible if the task at hand is fairly well understood, as e.g. in the case of spelling checkers. As things stand it is difficult to see how a grammar checker test could be fully automated, although certain subfunctions (such as spelling checking, subject-verb agreement) could be dealt with automatically (cf. the section on grammar checkers).

This test data collection activity will under normal circumstances be a one-off activity: once all test data are available for an instance of the object, the results will be stored in the library, where they can be used to evaluate the products on the basis of different sets of user requirements.

4.3.3 Defining objects and users

For each new object a feature checklist (cf. 4.3.1 above) has to be constructed. In order to facilitate this, a special graphical editor has been constructed (ET, cf. section 4.4 below). Note that the feature checklist does not depend on specific users or classes of users. This means that the data collection activity described in section 3.2 above can take place before the users come into play, although it cannot be excluded that new user types may require a revision of the feature checklist.

Evaluations will be carried out for specific users or groups of users. Users are defined in terms of their requirements. These requirements are expressed in terms of constraints on attribute-value pairs or relative weights assigned to them. In order to facilitate this work, a special graphical editor has been constructed (which is the same program as the feature checklist editor ET, but running in a different mode). It will take a feature checklist as an input, and it allows the operator to add the relevant constraints and weightings. The result is called a 'user profile'.

4.3.4 Evaluating object instances

The collected test data and the user profile are fed into a program which carries out the actual evaluation, i.e. it will determine whether a specific instance satisfies the user's requirements, and (if necessary) it will rank the results for the instances tested. A report generator will present the results in a format usable for the intended end user of the evaluation.

4.4 Specific software programs

4.4.1 PTB

The main program used for the maintenance of the test bed and the collection of test results is called PTB. It is a running research prototype, rather than an efficient end user product. The prototype is written in WinBatch, a programming language closely connected with MS Windows, which was selected because:

it offers a number of standard facilities for file and data manipulation, not unlike the MS DOS batch language, but tailored to Windows;

and most importantly, it offers facilities to interact with Windows based programs, in exactly the same way as the human user.

When the program is run, it offers the user to select an action from one of the four layers of activity described in section 4.3 above. The maintenance layer provides facilities to edit files (all files are plain text files, which can easily be edited), and to perform certain conversions between file formats. Other facilities include debugging, viewing log files and (re)compilation of the program.

The test data collection layer has two options. The standard option is to select a specific object type and a product, and to run all known tests for the attributes connected with that object instance. The results are collected in one file, to be used by the ER, the Evaluation Reporter. Another option makes it possible to assign specific values to specific attributes in case single tests have been carried out (e.g. non automated tests) or erroneous results have to be corrected (without repeating the full test).

The third layer lets the user create or modify object descriptions or user profiles, using the program ET (cf. below).

4.4.2 ET: Evaluator's Tool

A feature checklist is assumed to be a tree of typed attributes (but with values not specified); what ET offers is a graphical tree-oriented editor dedicated to these formal objects.

A user profile is a feature checklist with additional information describing a class of users. This extra information concerns the reportability of an attribute, the importance of a sub-attribute relative to other sub-attributes, and the utility functions for values of attributes that are measured directly. The same graphical tree editor can be used in user profile mode for the creation and maintenance of user profiles.

We remind the reader that 'terminal attributes' are the ones that get a value by collecting data on objects directly, while 'nonterminal attributes' get their value by some function from their daughter attributes' values. In the case of ET, this function is a weighted average.

To perform an evaluation for a given object of evaluation O, a file of test results for the set of terminal attributes must be supplied to ET. On the basis of such a data file, ET computes scores for all nonterminal attributes, up to the root, which denotes the global utility value of O.

ET is not specific to spelling checkers or even NLP evaluation, but can be used to record any user profile that can be expressed in terms of desired properties of objects of evaluation.

A full documentation with example checklists and user profiles is provided as part of TEMAA deliverable D17 (the PTB software). Snapshots of ET operation are provided as illustrations in chapter 2, section on components of an evaluation procedure.

4.4.3 ER: Evaluation reporter

Integrating and reporting the results of a particular evaluation is the function of the program ER, written in the CLIPS Object Oriented Language, which contains representations of the main classes in the project's formalisation of user profiles, requirements, methods and the resulting system values. Data is drawn from the ET and PTB programs and presented as hypertext reports or in textual form. The interface is written in Hardy (Smart 1996), a diagramming and hypertext tool which connects CLIPS objects directly to hypertext objects and interface constructs.

Reports are displayed as a hypertext tree similar to that for ET, which allows navigation via a document structure tree or from node to node. Nodes contain text and links to other nodes; for instance, in the spelling checker tree, the node for Functionality contains links to nodes for Recall, Precision, and Suggestion Adequacy. The tree can be saved as HTML, so reports can be made available via WWW.

Snapshots of ER operation are provided as illustrations in chapter 2, section on components of an evaluation procedure.

4.4.4 ASCC: Automated spelling checker checker

The program ASCC is an attempt to establish a benchmark test for the functionality of spelling checkers. It is part of the PTB component devoted to methods for determining values of terminal attributes.

As said earlier in this report, the functionality attribute of a spelling checker is composed of three attributes: recall, precision, and suggestion adequacy (though the last attribute might have been put under 'user-friendliness').

ASCC works essentially by running spelling checkers on word lists, storing the results of the spelling checker application, and generating reports based on the differences between input and output lists. It is intended to work for each spelling checker that runs under MS Windows, but adding new checkers (or even new versions of them) requires some reprogramming of part of the program (the part where ASCC communicates with the checker). Since we had, in TEMAA, no systematic way to isolate each spelling checker, ASCC runs them by running the full word processing packages they are part of. This is a somewhat roundabout way of working that is open to improvement in the future.

ASCC provides a way to test spelling checker functionality in a systematic way and on large data sets. Doing this by hand is not practicable. For example, suppose one would like to test for 20 error types (systematic types of spelling mistakes), with 5000 items per error type, for three spelling checkers and three languages: then the total number of item-test instances would be 900,000. At a pace of 20 per minute, this would amount to 750 hours of work. In order to compensate for judgement errors of the human tester, this should probably be done twice.

So, even though ASCC is not the ultimately best way to do this job, it at least provides some way. It runs entirely in batch mode, so large amounts of data can be processed without human intervention.

A detailed documentation is provided as part of TEMAA deliverable D17 (the PTB software).

Input: a collection of spelling checkers recognised by ASCC, and a list of words (in practice, correct words).

Output: Per spelling checker, statistics on the number of words accepted, recall expressed as a percentage, and a list of (erroneously) rejected words.

Input: a collection of spelling checkers recognized by ASCC, and a list of words (in practice, incorrect words).

Output: Per spelling checker, statistics on the number of words rejected, precision expressed as a percentage, and a list of (erroneously) accepted words.

Definitions (see also section 3.1.3.3): - error item: some misspelled word - source item: is, for some error item, the correct word assumed to be the 'intended' word; for a way to construct pairs of source and error items systematically, see the description of the Errgen program. - suggestion: a suggestion for changing the spelling, offered by a spelling checker - correct suggestion: a suggestion, for a given error item, that is identical to the corresponding source item.

Input: a collection of spelling checkers recognised by ASCC, and two parallel lists of words: mis-spelled words and their 'sources' (i.e. the words that are assumed to be the intended correct words).

Output: Per spelling checker, statistics on the following: - how often is the first suggestion correct - how often is one of the first 5 suggestions correct - how often are no suggestions offered - how often are all of the first 5 suggestions wrong (redundant w.r.t. the previous three).

4.4.5 Errgen: a program for error generation

Errgen is a Perl (4 or 5) program for the generation of spelling errors from a given word list. It is part of the PTB component dedicated to the creation of test materials. In this particular case, the test materials are word lists with systematic spelling errors. Errgen can generate such a list from a given list of words.

The program works roughly as follows. Input is a list of rules and an input word file. Output is a list of words containing errors according to one of the rules.

A detailed documentation is provided as part of TEMAA deliverable D17 (the PTB software).

A rule file contains some number of declarations and rules. Declarations are assignments to variables (identifiers prefixed with $) for use in subsequent rules, for example the following declaration defines the notion of a vowel:

A rule is a set of Perl substitutes that simulate types of human spelling errors, e.g.:

A filter (optional), i.e. the list of (correct) words that are not allowed as corruption results

the source list, listing the items of the input list for which a result was found;

Errgen is called with parameters indicating a rule file, a rule name, an input word list and, optionally, a filter.

each of these is applied to the current item in all possible ways, adding the result to a result pool

If no subrules are applicable or all results have been eliminated by the filter, there is no result.

If there is a result, the current input item is written to the output source items file and the corruption result to the output error list.

The error list can be used for testing the precision of spelling checkers (i.e. the degree to which they reject non-words).

The output source list can be used it to check which items got no results (by comparing it to the input word list).

The coordinated pair of output source list and error list can be used to test the suggestion adequacy of a spelling checker (under the hypothesis that the source item was the 'intended' spelling, i.e. under the hypothesis that the errors introduced by the rule are realistic).

4.5 Concluding remarks

Its inherent flexibility makes it easily adaptable to new types of objects and requirements of new types of users;

The fact that it is a library of independent modules ensures optimal reusability across objects and user profiles;

It provides an environment in which automated and hand measuring methods can coexist in a natural way.