The Test Bed will be fed with descriptions of objects to be evaluated, with descriptions of the user's requirements, and with the evaluation measures and methods to be applied. Its output is an evaluation report on the product or products evaluated.
Since, at each moment in time, new object descriptions, user requirements, measures and methods can be added to the Test Bed, we will call it a Parameterizable Test Bed (PTB).
* is fed with parameters describing
* objects of evaluation, and
* classes of users (in a broad sense) of these objects;
* consults a library of test methods;
* carries out the relevant tests, and
* produces the evaluation report.
Objects of evaluation can be any NLP application, e.g., spelling checkers, style checkers, information retrieval systems, translation systems or aids.
Objects are (in accordance with our formal framework) described in terms of attributes and values. The set of objects is structured: objects may be subtypes of other objects (grammar checkers may include spelling checking facilities), and classes of objects may have components in common (certain functions can be found both in grammar checkers and in style checkers).
User requirements are formulated in terms of constraints on attributes and their values, and weightings of the relative importance of the attributes.
The PTB has a (continuously updated and expanded) library of
* descriptions of known objects,
* descriptions of known user types,
* attributes and ways of establishing their values, and
* results of earlier measurements.
For some of the attributes it will be possible to design automated methods to obtain their values for each instance of the object (e.g., error correction precision and recall by means of methods such as ASCC, described in the section on testing methods). Others will ask the operator to provide specific data (e.g., the price of the instance), and yet others may provide the operator with a recipe for obtaining the value (e.g., a questionnaire to be sent out to 100 users, an experimental setup, etc.).
The actual testing will consist of collecting all values for the relevant attributes, either in an automated way or partly or entirely by hand. The result will be a full set of attribute-value pairs for one or more specific instances of an object. Note that this data collection is not dependent on any specific user type: the data depend on the object instances alone, and are neutral with respect to user types.
The last phase is the evaluation of the instances. Here the user requirements play a central rôle in that the relations between the various attributes and their weightings will determine which scores will be assigned to the products, and how they will be ranked.
* Maintenance of the test bed, i.e., creation and maintenance of testing
materials and tools;
* Collecting test data, i.e., applying testing tools and materials to
objects of evaluation;
* Defining objects and users, i.e., creation and maintenance of feature
checklists and user profiles;
* Performing evaluations, i.e., evaluating object instances.
These will be described in the rest of this section. Section 2.4 contains a brief technical description of the main computer programs involved.
Currently, the PTB libraries contain the following items (or classes of items).
First of all, there is a list of known object types to be evaluated, such as spelling checkers, grammar checkers, translation aids, etc.
For every single object type there is a specification of all its attributes (in their hierarchical structure), and the types and ranges of their values. The descriptions are kept in files, and are called `feature checklists'. Here we will not try to answer the question of how exactly one arrives at a set of attributes for a given object type. Ideally, the set of attributes reflects all and only those properties of the object that have relevance for users, but the perception of what counts as relevant may change over time, and may require reconsideration of the descriptions.
For every object type there is a listing of all known (i.e., known to PTB) instances of these objects.
We will call those attributes whose values can be obtained by direct observation, inspection or measurement `primitive attributes', as opposed to attributes whose values depend on values of other attributes.
A list of all primitive attributes (for all objects) is kept in a special catalogue, which not only specifies the full name of the attribute, but also, per attribute, the name of the test program that will compute (or sometimes ask) the value for this attribute.
In principle one might have different test programs for each single attribute, but it can be expected that some test programs will be applicable to more than one attribute (e.g., asking for a product's size or price).
At this moment there are two instances of test programs. One is an elaborate program (ASCC), capable of carrying out fully automatic coverage checks of spelling checkers, and one is a generic program which will ask the operator to provide the desired data.
The design and construction of appropriate tests for all attributes (and the tools to support them) is a non-trivial enterprise which falls beyond the scope of this project. As a demonstration of what is needed for such a task, a detailed description of ASCC, of its associated tools, and of the data collection work needed in order to apply it, is presented in the sections on testing methods.
Note that automating tests is only possible if the task in hand is well-understood, as for example in the case of spelling checkers. As things stand it is difficult to see how a grammar checker test could be fully automated, although certain subfunctions (such as spelling checking, subject-verb agreement) could be dealt with automatically.
This test data collection activity will under normal circumstances be a one-off activity: once all test data are available for an instance of the object, the results will be stored in the library, where they can be used to evaluate the products on the basis of different sets of user requirements.
Evaluations will be carried out for specific users or groups of users. Users are defined in terms of their requirements. These requirements are expressed in terms of constraints on attribute-value pairs or relative weights assigned to them. In order to facilitate this work, a special graphical editor has been constructed (which is the same program as the feature checklist editor ET, but running in a different mode). It will take a feature checklist as an input, and it allows the operator to add the relevant constraints and weightings. The result is called a `user profile'.
* it offers a user interface similar to the standard Windows interface;
* it offers a number of standard facilities for file and data manipulation,
not unlike the MS DOS batch language, but tailored to Windows;
* it offers facilities to call other programs (such as ET, or the CLIPS
program);
* and most importantly, it offers facilities to interact with Windows
based programs, in exactly the same way as the human user.
* It is based on a formal framework, which gives it a solid foundation;
* Its inherent flexibility makes it easily adaptable to new types of
objects and requirements of new types of users;
* The fact that it is a library of independent modules ensures optimal
reusability across objects and user profiles;
* It provides an environment in which automated and hand measuring methods
can coexist in a natural way.