The Test Bed will be fed with descriptions of objects to be evaluated, with descriptions of the user's requirements, and with the evaluation measures and methods to be applied. Its output is an evaluation report on the product or products evaluated.
Since, at each moment in time, new object descriptions, user requirements, measures and methods can be added to the Test Bed, we will call it a Parametrisable Test Bed (PTB).
is fed with parameters describing
objects of evaluation, and
classes of users (in a broad sense) of these objects,
consults a library of test methods,
carries out the relevant tests, and
produces the evaluation report.
Objects of evaluation can be any NLP application, e.g. spelling checkers, style checkers, information retrieval systems, translation systems or aids.
Objects are (in accordance with our formal framework) described in terms of attributes and values. The set of objects is structured: objects may be subtypes of other objects (grammar checkers may include spelling checking facilities), and classes of objects may have components in common (certain functions can be found both in grammar checkers and in style checkers). User requirements are formulated in terms of constraints on the values of attributes, and weightings of the relative importance of the attributes.
The PTB has a (continuously updated and expanded) library of
descriptions of known objects,
descriptions of known user types,
attributes and ways of establishing their values, and
results of earlier measurements.
For some of the attributes it will be possible to design automated methods to obtain their values for each instance of the object (e.g. error correction precision and recall by means of methods such as ASCC, described in the section on testing methods). Others will ask the operator to provide specific data (e.g. the price of the instance), and yet others may provide the operator with a recipe for obtaining the value (e.g. a questionnaire to be sent out to 100 users, an experimental setup, etc.).
The actual testing will consist of collecting all values for the relevant attributes, either in an automated way or partly or entirely by hand. The result will be a full set of attribute-value pairs for one or more specific instances of an object. Note that this data collection is not dependent on any specific user type: the data depend on the object instances alone, and are neutral with respect to user types.
The last phase is the evaluation of the instances. Here the user requirements play a central role in that the relations between the various attributes and their weightings will determine which scores will be assigned to the products, and how they will be ranked.
1. Maintenance of the test bed, i.e. creation and maintenance of testing materials and tools;
2. Collecting test data, i.e. applying testing tools and materials to objects of evaluation;
3. Defining object and user types, i.e. creation and maintenance of feature checklists and user profiles;
4. Performing evaluations, i.e. evaluating object instances.
Note that these layers correspond closely to the 'evaluation actions' described in 2.2.2. Layer 1 covers actions 4 and 5 (definition of tests and the creation of test materials), layer 2 covers action 6 (performing the tests), layer 3 covers actions 1 and 3 (definition of object and user types), and layer 4 corresponds to actions 7 and 8 (performing the evaluation and reporting). Note also that evaluation action 2 (collecting objects of evaluation) falls outside the scope of the PTB: no facilities are included to acquire the software to be evaluated.
The four layers of the PTB will be described in the rest of this section, whilst section 4.4 contains a brief technical description of the main computer programs involved.
Currently, the PTB libraries contain the following items (or classes of items).
First of all, there is a list of known object types to be evaluated, such as spelling checkers, grammar checkers, translation aids, etc.
For every single object type there is a specification of all its attributes (in their hierarchical structure), and the types and ranges of their values. The descriptions are kept in files, and are called 'feature checklists'. Here we will not try to answer the question how exactly one arrives at a set of attributes for a given object type. Ideally, the set of attributes reflects all and only those properties of the object that have relevance for users, but the perception of what counts as relevant may change over time, and may require reconsideration of the descriptions.
For every object type there is a listing of all known (i.e. known to PTB) instances of these objects.
We will call those attributes whose values can be obtained by direct observation, inspection or measurement 'primitive attributes', as opposed to attributes whose values depend on values of other attributes.
A list of all primitive attributes (for all objects) is kept in a special catalogue, which does not only specify the full name of the attribute, but also, per attribute, the name of the test program that will compute (or sometimes ask) the value for this attribute.
In principle one might have different test programs for each single attribute, but it can be expected that some test programs will be applicable to more than one attribute (e.g. asking for specific information such as a product's size or price).
At this moment there are two instances of test programs. One is an elaborate program (ASCC), capable of carrying out fully automatic coverage checks of spelling checkers, and one generic program which will ask the operator to provide the desired data.
The design and construction of appropriate tests for all attributes (and the tools to support them) is a non-trivial enterprise which falls beyond the scope of this project. As a demonstration of what is needed for such a task, a description of ASCC, of its associated tools, and of the data collection work needed in order to apply it, is presented in the sections on testing methods.
Note that automating tests is only possible if the task at hand is fairly well understood, as e.g. in the case of spelling checkers. As things stand it is difficult to see how a grammar checker test could be fully automated, although certain subfunctions (such as spelling checking, subject-verb agreement) could be dealt with automatically (cf. the section on grammar checkers).
This test data collection activity will under normal circumstances be a one-off activity: once all test data are available for an instance of the object, the results will be stored in the library, where they can be used to evaluate the products on the basis of different sets of user requirements.
Evaluations will be carried out for specific users or groups of users. Users are defined in terms of their requirements. These requirements are expressed in terms of constraints on attribute-value pairs or relative weights assigned to them. In order to facilitate this work, a special graphical editor has been constructed (which is the same program as the feature checklist editor ET, but running in a different mode). It will take a feature checklist as an input, and it allows the operator to add the relevant constraints and weightings. The result is called a 'user profile'.
it offers a user interface similar to the standard Windows interface;
it offers a number of standard facilities for file and data manipulation, not unlike the MS DOS batch language, but tailored to Windows;
it offers facilities to call other programs (such as ET or ER);
and most importantly, it offers facilities to interact with Windows based programs, in exactly the same way as the human user.
When the program is run, it offers the user to select an action from one of the four layers of activity described in section 4.3 above. The maintenance layer provides facilities to edit files (all files are plain text files, which can easily be edited), and to perform certain conversions between file formats. Other facilities include debugging, viewing log files and (re)compilation of the program.
The test data collection layer has two options. The standard option is to select a specific object type and a product, and to run all known tests for the attributes connected with that object instance. The results are collected in one file, to be used by the ER, the Evaluation Reporter. Another option makes it possible to assign specific values to specific attributes in case single tests have been carried out (e.g. non automated tests) or erroneous results have to be corrected (without repeating the full test).
The third layer lets the user create or modify object descriptions or user profiles, using the program ET (cf. below).
The fourth layer starts the Evaluation Reporter (ER), cf. below.
defining feature checklists
defining user profiles
performing an evaluation
A feature checklist is assumed to be a tree of typed attributes (but with values not specified); what ET offers is a graphical tree-oriented editor dedicated to these formal objects.
A user profile is a feature checklist with additional information describing a class of users. This extra information concerns the reportability of an attribute, the importance of a sub-attribute relative to other sub-attributes, and the utility functions for values of attributes that are measured directly. The same graphical tree editor can be used in user profile mode for the creation and maintenance of user profiles.
We remind the reader that 'terminal attributes' are the ones that get a value by collecting data on objects directly, while 'nonterminal attributes' get their value by some function from their daughter attributes' values. In the case of ET, this function is a weighted average.
To perform an evaluation for a given object of evaluation O, a file of test results for the set of terminal attributes must be supplied to ET. On the basis of such a data file, ET computes scores for all nonterminal attributes, up to the root, which denotes the global utility value of O.
ET is not specific to spelling checkers or even NLP evaluation, but can be used to record any user profile that can be expressed in terms of desired properties of objects of evaluation.
A full documentation with example checklists and user profiles is provided as part of TEMAA deliverable D17 (the PTB software). Snapshots of ET operation are provided as illustrations in chapter 2, section on components of an evaluation procedure.
Reports are displayed as a hypertext tree similar to that for ET, which allows navigation via a document structure tree or from node to node. Nodes contain text and links to other nodes; for instance, in the spelling checker tree, the node for Functionality contains links to nodes for Recall, Precision, and Suggestion Adequacy. The tree can be saved as HTML, so reports can be made available via WWW.
Snapshots of ER operation are provided as illustrations in chapter 2, section on components of an evaluation procedure.
As said earlier in this report, the functionality attribute of a spelling checker is composed of three attributes: recall, precision, and suggestion adequacy (though the last attribute might have been put under 'user-friendliness').
ASCC works essentially by running spelling checkers on word lists, storing the results of the spelling checker application, and generating reports based on the differences between input and output lists. It is intended to work for each spelling checker that runs under MS Windows, but adding new checkers (or even new versions of them) requires some reprogramming of part of the program (the part where ASCC communicates with the checker). Since we had, in TEMAA, no systematic way to isolate each spelling checker, ASCC runs them by running the full word processing packages they are part of. This is a somewhat roundabout way of working that is open to improvement in the future.
ASCC provides a way to test spelling checker functionality in a systematic way and on large data sets. Doing this by hand is not practicable. For example, suppose one would like to test for 20 error types (systematic types of spelling mistakes), with 5000 items per error type, for three spelling checkers and three languages: then the total number of item-test instances would be 900,000. At a pace of 20 per minute, this would amount to 750 hours of work. In order to compensate for judgement errors of the human tester, this should probably be done twice.
So, even though ASCC is not the ultimately best way to do this job, it at least provides some way. It runs entirely in batch mode, so large amounts of data can be processed without human intervention.
A detailed documentation is provided as part of TEMAA deliverable D17 (the PTB software).
Testing recall
Input: a collection of spelling checkers recognised by ASCC, and a list of words (in practice, correct words).
Output: Per spelling checker, statistics on the number of words accepted, recall expressed as a percentage, and a list of (erroneously) rejected words.
Testing precision
Input: a collection of spelling checkers recognized by ASCC, and a list of words (in practice, incorrect words).
Output: Per spelling checker, statistics on the number of words rejected, precision expressed as a percentage, and a list of (erroneously) accepted words.
Testing suggestion adequacy
Definitions (see also section 3.1.3.3): - error item: some misspelled word - source item: is, for some error item, the correct word assumed to be the 'intended' word; for a way to construct pairs of source and error items systematically, see the description of the Errgen program. - suggestion: a suggestion for changing the spelling, offered by a spelling checker - correct suggestion: a suggestion, for a given error item, that is identical to the corresponding source item.
Input: a collection of spelling checkers recognised by ASCC, and two parallel lists of words: mis-spelled words and their 'sources' (i.e. the words that are assumed to be the intended correct words).
Output: Per spelling checker, statistics on the following: - how often is the first suggestion correct - how often is one of the first 5 suggestions correct - how often are no suggestions offered - how often are all of the first 5 suggestions wrong (redundant w.r.t. the previous three).
The program works roughly as follows. Input is a list of rules and an input word file. Output is a list of words containing errors according to one of the rules.
A detailed documentation is provided as part of TEMAA deliverable D17 (the PTB software).
Rules
A rule file contains some number of declarations and rules. Declarations are assignments to variables (identifiers prefixed with $) for use in subsequent rules, for example the following declaration defines the notion of a vowel:
$Vowel="[aeiou]"
A rule is a set of Perl substitutes that simulate types of human spelling errors, e.g.:
Rule 3
# double some character
s/(.)/$1$1/
End
Rule 4
# double some vowel
s/($Vowel)/$1$1/
End
Input
The input list, i.e. the list of words to be corrupted
A filter (optional), i.e. the list of (correct) words that are not allowed as corruption results
A rules file
An initialization file
Output
Two output word lists are created:
the source list, listing the items of the input list for which a result was found;
the error list, listing (in the corresponding order) the corruption results.
What Errgen does
Errgen is called with parameters indicating a rule file, a rule name, an input word list and, optionally, a filter.
The selected rule is applied to each item of the input list.
Per input item, this is done in four steps:
all matching subrules are collected
each of these is applied to the current item in all possible ways, adding the result to a result pool
if a filter is supplied, this is applied to the result pool
of the remaining results, one is chosen randomly to be the final result
If no subrules are applicable or all results have been eliminated by the filter, there is no result.
If there is a result, the current input item is written to the output source items file and the corruption result to the output error list.
Use of results
The error list can be used for testing the precision of spelling checkers (i.e. the degree to which they reject non-words).
The output source list can be used it to check which items got no results (by comparing it to the input word list).
The coordinated pair of output source list and error list can be used to test the suggestion adequacy of a spelling checker (under the hypothesis that the source item was the 'intended' spelling, i.e. under the hypothesis that the errors introduced by the rule are realistic).
It is based on a formal framework, which gives it a solid foundation;
Its inherent flexibility makes it easily adaptable to new types of objects and requirements of new types of users;
The fact that it is a library of independent modules ensures optimal reusability across objects and user profiles;
It provides an environment in which automated and hand measuring methods can coexist in a natural way.