TEMAA D13 - 2 Overview of the Parameterizable Test Bed (PTB)

The goal of NLP evaluation is to design a formal framework for evaluation, which is embodied by a device capable of carrying out evaluations on the basis of this formal framework. We will call this device a Test Bed.

The Test Bed will be fed with descriptions of objects to be evaluated, with descriptions of the user's requirements, and with the evaluation measures and methods to be applied. Its output is an evaluation report on the product or products evaluated.

Since, at each moment in time, new object descriptions, user requirements, measures and methods can be added to the Test Bed, we will call it a Parameterizable Test Bed (PTB).

2.2 Global structure of the TEMAA PTB

Objects of evaluation can be any NLP application, e.g., spelling checkers, style checkers, information retrieval systems, translation systems or aids.

Objects are (in accordance with our formal framework) described in terms of attributes and values. The set of objects is structured: objects may be subtypes of other objects (grammar checkers may include spelling checking facilities), and classes of objects may have components in common (certain functions can be found both in grammar checkers and in style checkers).

User requirements are formulated in terms of constraints on attributes and their values, and weightings of the relative importance of the attributes.

For some of the attributes it will be possible to design automated methods to obtain their values for each instance of the object (e.g., error correction precision and recall by means of methods such as ASCC, described in the section on testing methods). Others will ask the operator to provide specific data (e.g., the price of the instance), and yet others may provide the operator with a recipe for obtaining the value (e.g., a questionnaire to be sent out to 100 users, an experimental setup, etc.).

The actual testing will consist of collecting all values for the relevant attributes, either in an automated way or partly or entirely by hand. The result will be a full set of attribute-value pairs for one or more specific instances of an object. Note that this data collection is not dependent on any specific user type: the data depend on the object instances alone, and are neutral with respect to user types.

The last phase is the evaluation of the instances. Here the user requirements play a central rôle in that the relations between the various attributes and their weightings will determine which scores will be assigned to the products, and how they will be ranked.

2.3 The Internal Structure of the PTB

* Maintenance of the test bed, i.e., creation and maintenance of testing materials and tools;

* Collecting test data, i.e., applying testing tools and materials to objects of evaluation;

* Defining objects and users, i.e., creation and maintenance of feature checklists and user profiles;

These will be described in the rest of this section. Section 2.4 contains a brief technical description of the main computer programs involved.

2.3.1 Maintenance of the test bed

Many of the activities in connection with the test bed have to do with the maintenance of the libraries and the construction of support tools. They are typically carried out off-line, in part by computer specialists (e.g., writing the test programs) and in part by people who know about NLP evaluation (e.g., designing the tests).

First of all, there is a list of known object types to be evaluated, such as spelling checkers, grammar checkers, translation aids, etc.

For every single object type there is a specification of all its attributes (in their hierarchical structure), and the types and ranges of their values. The descriptions are kept in files, and are called `feature checklists'. Here we will not try to answer the question of how exactly one arrives at a set of attributes for a given object type. Ideally, the set of attributes reflects all and only those properties of the object that have relevance for users, but the perception of what counts as relevant may change over time, and may require reconsideration of the descriptions.

For every object type there is a listing of all known (i.e., known to PTB) instances of these objects.

We will call those attributes whose values can be obtained by direct observation, inspection or measurement `primitive attributes', as opposed to attributes whose values depend on values of other attributes.

A list of all primitive attributes (for all objects) is kept in a special catalogue, which not only specifies the full name of the attribute, but also, per attribute, the name of the test program that will compute (or sometimes ask) the value for this attribute.

In principle one might have different test programs for each single attribute, but it can be expected that some test programs will be applicable to more than one attribute (e.g., asking for a product's size or price).

At this moment there are two instances of test programs. One is an elaborate program (ASCC), capable of carrying out fully automatic coverage checks of spelling checkers, and one is a generic program which will ask the operator to provide the desired data.

The design and construction of appropriate tests for all attributes (and the tools to support them) is a non-trivial enterprise which falls beyond the scope of this project. As a demonstration of what is needed for such a task, a detailed description of ASCC, of its associated tools, and of the data collection work needed in order to apply it, is presented in the sections on testing methods.

2.3.2 Collecting test data

Once the description of a class of objects has been established, one or more instances of this type can be identified. If all the tests to obtain the values of the primitive attributes are in place (and possibly adapted to specific properties of the products), the tests can be carried out in order to assign values to all primitive attributes for each instance. As mentioned above, the tests may or may not be fully automated (although one should aim at automating as many of them as possible).

Note that automating tests is only possible if the task in hand is well-understood, as for example in the case of spelling checkers. As things stand it is difficult to see how a grammar checker test could be fully automated, although certain subfunctions (such as spelling checking, subject-verb agreement) could be dealt with automatically.

This test data collection activity will under normal circumstances be a one-off activity: once all test data are available for an instance of the object, the results will be stored in the library, where they can be used to evaluate the products on the basis of different sets of user requirements.

2.3.4 Defining objects and users

For each new object a feature checklist (cf 2.3.1 above) has to be constructed. In order to facilitate this, a special graphical editor has been constructed (ET, cf section 2.4 below). Note that the feature checklist does not depend on specific users or classes of users. This means that the data collection activity described in Section 2.3.2 above can take place before the users come into play, although it cannot be excluded that new user types may require a revision of the feature checklist.

Evaluations will be carried out for specific users or groups of users. Users are defined in terms of their requirements. These requirements are expressed in terms of constraints on attribute-value pairs or relative weights assigned to them. In order to facilitate this work, a special graphical editor has been constructed (which is the same program as the feature checklist editor ET, but running in a different mode). It will take a feature checklist as an input, and it allows the operator to add the relevant constraints and weightings. The result is called a `user profile'.

2.3.5 Evaluating object instances

The collected test data and the user profile are fed into a program which carries out the actual evaluation, i.e., it will determine whether a specific instance satisfies the user's requirements, and (if necessary) it will rank the results for the instances tested. A report generator will present the results in a format usable for the intended end user of the evaluation.

2.4 Specific software programs

2.4.1 PTB

The main program used for the maintenance of the test bed and the collection of test results is called PTB. It is a running research prototype, rather than an efficient end user product. The prototype is written in WinBatch, a programming language closely connected with MS Windows, which was selected because:

* it offers a number of standard facilities for file and data manipulation, not unlike the MS DOS batch language, but tailored to Windows;

* it offers facilities to call other programs (such as ET, or the CLIPS program);

* and most importantly, it offers facilities to interact with Windows based programs, in exactly the same way as the human user.

2.4.2 ET

ET (Evaluators' Tool) is a program primarily designed for the creation and maintenance of feature checklists and user profiles in the sense of the TEMAA framework. It assumes that a feature checklist is a tree of typed attributes (but with values not specified), and that a user profile is a feature checklist with additional information describing a class of users. This extra information concerns the reportability of an attribute, the weight of a sub-attribute relative to other sub-attributes, and the utility functions for values of attributes that are measured directly. ET provides a graphical tree editor for the creation and maintenance of checklists and user profiles. ET is not specific to spelling checkers or even NLP evaluation, but can be used to record any user profile that can be expressed in terms of desired properties of objects of evaluation.

2.4.3 ER

Integrating and reporting the results of a particular evaluation is the function of an Evaluation Reporter written in the CLIPS Object Oriented Language, which contains representations of the main classes in the project's formalisation of user profiles, requirements, methods and the resulting system values. Data is drawn from the ET and PTB programs and presented as hypertext reports or in textual form.

2.5 Concluding remarks

* Its inherent flexibility makes it easily adaptable to new types of objects and requirements of new types of users;

* The fact that it is a library of independent modules ensures optimal reusability across objects and user profiles;

* It provides an environment in which automated and hand measuring methods can coexist in a natural way.