The standard TEMAA framework for evaluation is based on the idea that an evaluation is a function from a pair of an object and a user to a utility value. For example, there is an author (the user) and a spelling checker (the object); an evaluation in this case is a value assigned to this pair, expressing how useful the spelling checker is to the author.
In the TEMAA project, this view was implemented as follows:
an object class is characterised by a hierarchy of attributes
a user profile is a statement of desired values on these attributes and their relative importance
members of the object class are assigned values for basic attributes by performing tests
evaluation is the computation of the combined scores on all attributes
In this way, what is computed is the utility value of a given object to a given user.
In the course of the project, we saw some serious conceptual problems with this view.
One is that the word `user' hides a lot of complex questions and does not help to analyse these. For example, take some translation bureau T. The author in the example above is employed by T. The quality of a spelling checker is a concern for him because he wants to do his work well in view of a salary raise or promotion. The director of T is also a user of the spelling checker; what she wants is faster work from her employees, less complaints from them or the customers, etc. If T has shareholders, they take an interest in the spelling checker as well, though they may not be aware of this. In the standard TEMAA framework, all one can do is say that there are three user profiles: the end-user, the director, and the shareholder. Their preferences may be similar but this is not necessarily the case. For example, `user-friendliness' may not be a serious concern for the shareholder. In any case, the TEMAA framework offers no way to analyse the interdependencies amongst related users like employees, directors, and shareholders.
Another problem is that of the relation between the object of evaluation and the organisation within which it is used. Suppose that T's director considers to buy a new piece of software that offers sophisticated support to her translators, a `Translator's workbench'. With the TEMAA framework, we can define relevant attributes of such systems (like, what kind of computer platform is needed, does it contain a translation memory, does it have two windows with coordinated scrolling, etc.) but these can not be systematically related to the working organisation of the bureau; they can just be stated.
A third problem is `progress evaluation'. This is related to the `imperfective paradox', i.e. the problem with sentences like `John is building a house': most of the time during this process, there is no house. Suppose we evaluate houses in the standard TEMAA way, with attributes like `does not leak', `safe for burglary', `water available in the bathroom' etc., then at most times during the construction, all of these may be negative and it will seem that no progress is made at all. This problem is also relevant if one wishes to evaluate a method instead of a fully instantiated object. An example could be `using unification grammars in natural language processing'. As long as there is no finished product containing a full grammar, it may seem that no progress is made.
We do not offer solutions to these questions here. We will just look at a different way to describe the basic evaluation function, which looks more natural than the current one and may offer possibilities for future practical improvements.
Here is an example of evaluation that does not fit easily in the standard TEMAA framework. Imagine some state of affairs in which some person, John, kisses some other person, Mary. We will call this state of affairs JKM. We may wish to evaluate JKM, asking specifically how `good' it is. The answer depends on the point of view we choose. Suppose John likes kissing Mary, then from his point of view JKM is rather `good'. But from the point of view of John's wife (assuming she is not Mary), JKM is possibly `bad'(if she is inclined to jealousy). Mary's point of view may add a third, yet different result; perhaps she simply does not care, and the evaluation result for her point of view is `indifferent'.
Three basic components are present in this example:
the state of affairs (the thing that gets evaluated)
the point of view, which serves as a point of reference for the evaluation
the evaluation result, which is a `value', often related somehow to `good' and `bad' (but not in an absolute sense: good and bad are defined w.r.t. the point of view)
The `value' is determined completely by the pair consisting of the point of view and the state of affairs.
The main reason why this example is difficult for the standard TEMAA framework is that there is no clear `object' present, at least there is no individual in the state of affairs that easily takes this role. Nevertheless, it definitely is a case of evaluation.
So we might take the basic roles as follows:
the object of evaluation is a state of affairs rather than an individual or set of individuals
the evaluation is defined w.r.t. a point of view rather than a `user'
evaluation is a function from these two to `utility', as before
A `point of view' is some description of a motivation state, expressible as a set of desired states of affairs like `I like to kiss Mary', `I want my husband to stay away from Mary', `I hope to become rich', `I need to get this document proof-read before tomorrow'. One good thing in this analysis would be that both the object of evaluation and the motivation could be expressed as constraints on states of affairs.
Another useful thing with this view is that we can analyse a complex object like a working organisation. Think again of the director of the translation bureau who has to decide about the introduction of new translators' workbench software. She will evaluate two states of affairs. One is T, the bureau in the way it currently works; the other is T', which is the hypothetical bureau in the way it would work after the introduction of the new system. The object of evaluation is not just a piece of software but an alternative version of the bureau. In fact, our director has to evaluate even more states: those that would occur during the transition from T to T' if she decided on the change.
It should be clear that evaluating T' cannot be based on direct measurement: at the time of evaluation, T' does not exist. What the director would do is analyse T' and evaluate it on the basis of, amongst other things, constraints on the functioning of the new software. What can then be measured is the degree to which the new software complies to those constraints.
In this example, we may call the new software the `focus of evaluation'. The objects of evaluation are states of affairs (versions of T); but they are constructed on the basis of the question of whether to introduce the software, all other things are made dependent on that. In fact, the focus does not even have to exist at the time of evaluation. In the analysis of T', one may as well fill in a hypothetical software component, by stating its properties abstractly. This shows how the new evaluation framework would be a basis for evaluation of more abstract `foci' like a certain technology (e.g. `using unification grammar') or a component (e.g. a semi-automatic gearbox).
An important construct in this style of evaluation would be a `reachability relation' on states of affairs. This would be some relation indicating how difficult it is to go from one state to another. The less useful states during the transition from T to T' (during which the translation bureau may even loose money) are still useful in a way, as they get the bureau closer to the desired state T'. Definition of a reachability relation is a basis for defining `progress'. One way to define reachability would be as a probability: the reachability of state S' given S is the probability that S' will occur, given S. This kind of thinking must go on in a car driver who has to decide whether he obeys the speed limit. He evaluates states Slow and Fast (driving), amongst other things in view of the reachability (probability, progress toward) a state Fined. This example is, by the way, another one that does not fit easily the standard TEMAA framework.
The preceding comments are not meant to invalidate the work done in TEMAA. In fact, we can see the standard TEMAA framework as a special version of the new one, where the object of evaluation is identified with the focus of evaluation and the further description of the state of affairs is (more or less) hidden in the concept of `user'.
For the future, the idea would be to give a more central place to analysing states of affairs. In practice, for NLP evaluation, this would often come down to analysis of professional organisations. One would search for commonalities amongst them in terms of the constraints on certain foci (like spelling checkers, translation software). Actual measurement (testing) would concern the degree to which existing foci meet these constraints, or, in the case of progress evaluation, the reachability of the desired states.
The evaluator's tool ET exemplifies this. In checklist mode, attributes can be added, removed, or their measure types modified. In user profiling mode, ET produces a tree whose root is `utility' value for a single evaluation, its immediate children are quality characteristics, and their children are attributes, which may further decompose into sub-attributes. Each node has a specification, or requirement, plus a weight corresponding to the importance of the attribute, each of which may be modified by the user. Sibling nodes have their weights normalised to maintain the proportional contributions of their values to the parent node's value.
This approach offers a precise definition of how a composite figure for the value of a particular system to a particular user may be determined from the measurements that are made. However, such precision results in accuracy only where the attributes, their specifications, weights and their combinations accurately reflect requirements.
The project has not been able to produce much in the way of realistic instances of weights or specifications, or even, to some extent, of attribute choice. This may be partly because the user survey that was done was not heavily resourced, and since it took place at the beginning of the project could not be informed by these more focussed questions which have arisen from the subsequent work. But there must remain a question mark over how usable such an approach is, until it has been used with some more detailed, focussed empirical work on defining user profiles. In future work, it would be desirable to move user survey tasks to after the development of a detailed framework of what is to be asked; or, indeed, as software design best practice suggests, to have user survey activity be integrated into the whole process. Indeed, a whole methodology of user profiling or requirements identification should be developed, since such practices will have a major effect on the comparability and reusability of evaluations.
We have identified a number of attributes which implicitly characterise user groups. The language of the text is the most obvious. Additionally, we have identified tendencies to certain kinds of typographical and phonetic error on the basis of published research in spelling errors (e.g., following Green and Hendry), and the effect of particular upstream paths from which text is derived, such as OCR. These factors might in principle be factored into the assignment of weights to particular attributes.
For instance, Hendry and Green report that Japanese first language (L1) writers of English make different phonetically based errors from other L1 groups -- this might result in the need for different attributes, not just different weights. However, this does not invalidate the approach; attributes may easily have weight zero for some user types. It is important to see the attribute structure developed so far as an approximate template, which can be modified as required (as ET indeed permits). As such, the attributes can be thought of as a Generic Task Model for the task of doing an evaluation in the area of authoring aids, available for modification for particular future evaluation requirements.
Most of the attribute choices we have made, and of course the definition of the tasks themselves, are not backed up by solid empirical evidence, of the type that might be considered to be required to support the nomination of performance factors that truly determine a system's conformance to requirements (cf. Galliers and Sparck Jones); to do this requires extensive experimental work varying all possible factors independently. This shortcoming does not mean the attributes are useless, however, but can be looked on as the result of taking a different approach, similar to that of requirements capture for software systems, to the question of how to justify the attribute choices, specifications, and weightings. When a requirements engineer is preparing a specification for a piece of software, the items in the specification are not in general arrived at by the sort of rigorous combinatorial processes mentioned, but by a complex human process (which may be supported by various analysis tools and methods), by means of which requirements couched in the vocabulary of the user are clarified, made consistent where possible, and translated into specifications for system behaviour.
It would be valuable to state the methods used to determine or justify attributes, specs, and weightings explicitly, but this has mostly fallen outside the bounds of the project's work so far, because although it is fully compatible with the framework, it is a major undertaking in its own right. For example, we have used the work of other people in identifying and establishing the relative frequency of error types in particular L1/L2 pairs. Explicit representation and recording of the methods they used would allow easy customisation of attribute lists, specs and weightings for, e.g., new L1/L2 pairs, of the requirements on error types and frequencies, in the same way as the libraries of methods for measuring system performance allow easy customisation.
This is a problem both of justification, and of communication. The framework as exemplified by the ET program, with its fully compositional approach to utility, does not provide any support for a user of such an evaluation who is not able to supply specifications and weights to reflect their requirements at the level of the attributes we define. However, one of the problems with a technology evaluation targeted for example at the technology purchaser, as might well be the case for adequacy evaluation, is that the user does not know how to express their requirements at this level. For instance, without some sort of guidelines, a user has no way to know what relative weight to place on the correction rates for simple typos versus those for phonetic errors. Some suggestions on ways forward in this area are given in Section 4.2.3.
There may even be a more fundamental problem with the compositional theory of value, which is essentially a monetary or consistent unit theory, which assumes that preferences are in effect transitive -- that if John prefers Coke to Pepsi, and Pepsi to lemonade, he will prefer Coke to lemonade in all circumstances. Much economic theory is also dependent on this assumption, but in fact there is reason to believe it flawed (Loomes, 1991). In practice people's value requirements may be sufficiently non-compositional that it is even theoretically (as well as practically) necessary to stop an evaluation short of computing a combined utility value. Moreover, the subtree of quality characteristics rooted in utility was not designed to be a compositional system, and the assumption that it can be treated as such is rather a strong one, for which evidence is so far lacking. Indeed, there is a certain amount of evidence that, e.g., usability and functionality cannot be neatly measured in completely separate parts of the tree, but that there are complex interactions at lower level, for instance between interaction style (user-led or system-led) and the rate of false positives a user can tolerate.
So, to sum up, for a number of reasons it may be necessary to leave the final decision on User Profiles (how to choose and combine information about a system to get a utility value) to the user of the evaluation, but the framework as currently instantiated provides no support for the process of self-diagnosis that this requires. This is connected with the next section, which deals with the question of how an evaluation should be thought about and presented.
The two views are not, however, truly incompatible. The classic Consumer Report grid can be thought of as (a) a partial representation of a set of compositional user profiles, or (b) as a representation of a set of partially compositional user profiles.
On view (a), a CRP would be a partial representation of a set of compositional user profiles if a fully instantiated set of user profiles existed, but were used only to ascertain the relevant attributes, such that the CRP reported the values for all attributes reportable in any of the user profiles.
On view (b), a CRP would be a representation of a set of partially compositional user profiles if there was not enough information to determine all user compositional factors (see the discussion in the previous section for why this might be so), and hence it was necessary to report attributes down to the level at which compositionality is true (perhaps right down to the smallest measurement taken in principle, since the line between attributes and partial results combined by methods is drawn purely by the definition of what level of result is considered to be reportable).
It would be interesting to further formalise the idea of the CRP as a combination of the results grid for reportable attributes, plus a set of user profile descriptions, in such a way as to bring out the equivalencies between this and the compositional view. Such an account would treat the results grid as embodying the results of composition up to the point of the reportable attributes (i.e., those combinations that currently fall under the heading of methods in the framework). The user profile descriptions would then be treated in terms of the further processes of value combination on the part of the user of the evaluation they must be able to support, in allowing a user of the evaluation to map from characteristics they understand to weighting judgments about the reportable attributes, whose significance to their utility requirements may not be apparent to them without such guidance. A user profile description, then, might correspond to a set of plausible mappings from user-level characteristics (`quality of resultant text very important; first language Japanese...') to weighting judgements on reportable attributes (`recall has higher weight than precision; error type XXX has higher weight than other phonetic errors...'). Such mappings could potentially be represented in something like an expert system, supporting an interface which asks questions about the user-level characteristics, and on the basis of the answers, assigns the appropriate weights to produce a more completely compositional evaluation.
The procedures and methods for obtaining such mappings remain to be discussed. In the next section, some suggestions are made on possible future research in this area.
The input to the ISO process model is a statement of `stated or implied needs'; in terms of classical software engineering practice, this requirements statement is the first-stage product of design, and is used in evaluation for acceptance of completed software. User profiles behave like parameterisations of requirements statements, capturing regular variation in requirements for similar types of system. As discussed in Section 5.1, the term `user' hides a lot of the important questions, and leads us to messy formulations in terms of different `kinds' of user, not very well distinguished. In the following discussion, the term requirements will generally be used instead of user profiles, and the question of what should replace the idea of the user will be addressed.
Requirements methods for attribute development
The formal and operational work on the TEMAA PTB has been forced to restrict its view of stated or implied needs to the idea that user profiles are (by some means outside the scope of the system) accurate reflections of requirements expressed in terms of system attributes. However, in the more informal parts of the work, some further machinery has been suggested which might be developed to an operational level in future. In developing a set of reportable attributes for spelling checkers, the quality characteristics were used in conjunction with a task-based description of the problem, which attempted to decompose the problem in terms of basic functions, data inputs and outputs to them, the roles associated with these functions, and variations in the roles and parameterisable features of these roles.
For instance, roles were distinguished for the writer, editor and reader of the text, as well as the system with which the editor interacts. (The writer can be generalised to the upstream history of the text, to include factors such as the fact that a text has been processed using OCR technology.)
Variations in requirements can be localised to variations in parameters belonging to particular roles, such the tendency for writers of particular types (e.g., first language Japanese writers) to produce spelling errors of particular kinds, or the need for certain kinds of editor (end-user) to have access to thesaural information to support the task of choosing between correction suggestions.
In fact, the gathering of requirements is complicated by the fact not only of different role-players, but multiple roles in the situation, which may result in completely different viewpoints on the problem. These may simply involve different vocabularies to describe compatible requirements, but they may also involve genuine clashes in requirements between different stake-holders, e.g., management and end-users, writers and editors, as mentioned in Section 5.1. There is a good deal of work being done in Requirements Engineering on how to represent and manage such multiple viewpoints, some of which may be relevant to our future work in this area (see, for example, Nuseibeh et al 1994).
Unlike software design, where the ostensible aim is to produce a design that is fully equivalent to the requirements, part of the purpose of evaluation is to informatively point out where designs fail to fill requirements; this will require similar machinery to that required to deal with possibly incompatible viewpoints. As an example, no spelling checkers of the normal type can correct errors stemming from simple typographical mistakes which result in valid though unintended words, such as typing form instead of from, and it is tempting for system developers and evaluators to simply define spelling errors in such a way as to recognise this fact; yet it is not unlikely that certain classes of users would view these as spelling errors. An explicit and structured process of decomposing or transforming requirements stated in the users' terms into measurable attributes can provide opportunities for noting where and how such discrepancies occur, and permit systems to be described, not in their own terms, but in terms of real user requirements.
Requirements methods for attribute justification
To extend the TEMAA evaluation framework for user profiling/requirements capture, we need to know where a certain reportable attribute requirement comes from in terms of the actual problem domain, and whether the derivation is valid; information supporting the choice of the attributes used should ideally be part of evaluation documentation, just as much as information about the methods used to arrive at values for the attributes. For requirements engineering in general, such validation is required to show that specifications are equivalent in some sense to the problem domain requirements (user-level descriptions of what is needed) they are intended to express -- that they are in a sense alternative descriptions of `the same thing'. There is considerable work in Requirements Engineering on the traceability of requirements in this process of redescription or transformation (see, for example, Gotel & Finkelstein 1993) and this may be of use in the future to our evaluation work.
The idea of traceability or design rationale can be thought of as attaching a justification to each reportable attribute, such that it possible to say why the measurable attributes that form the basis of the evaluation actually address less measurable, high level requirements. Such justifications can vary in type from rigorous empirical demonstrations of correlation (such as between first language and phonetic spelling errors, determined by large-scale data analysis) to appeals to previous experience of, for example, introducing a new software component into human processes such as editing or translating. In the latter case, such an approach might be a starting point for implementing a `reachability function' from existing worlds to desired worlds, as suggested in Section 5.1: the probability of a set of attributes of, e.g., a system under development, reflecting the desirability of the completed system, depends on the quality of the justifications for the equivalence of a series of descriptions of the path between the current state and the desired state. Such a formulation leaves open the use of very different kinds of justification that may be necessary dependent on the actual connections we wish to make, while allowing them to be given the same place in the formal framework.
Requirements methods for reusability and comparison
One of the major benefits sought from the developing evaluation framework is that of reusability. Reusable evaluation resources not only save effort in the invention of the wheel, but promote comparability and standardisation by the use of similar methods in different evaluations. The Parameterisable Test Bed (PTB) contains libraries of attributes and associated measures and methods, and envisages them being built up over time, together with guidelines on how to put them together to make up a new evaluation. From the requirements side, too, such libraries can be envisaged.
In our work so far on spelling checkers and grammar checkers, we have seen how in these similar applications can be described with very similar task models, in terms of roles, data input and output, and parameters, and in terms of the reportable attributes that come out of the process of task analysis. If we could augment the current implicit connections between the task analysis and the reportable attributes using the kind of justification mechanism suggested above, these justifications would provide a further reusable component to guide the development of new evaluations. For instance, suppose we say that the justification for including a certain kind of phonetic error as a subtype of errors tested for is that (a) it is asserted that Japanese first language writers of English are a relevant group for our evaluation, and (b) it has been shown that such writers make these kinds of error significantly often in comparison with the other errors we measure. Then the empirical methods for identifying this (some kind of corpus analysis, presumably) can form the basis for developing comparable requirements for some new group.
Indeed, the statement that Japanese first language writers in English are a relevant group may not be simply an assertion, but may be justified by the fact that (a) the basic requirements are for a tool that can correct the spelling errors of the members of a certain bureau and (b) some of those members are Japanese writers in English. Development of useful requirements statements by means of justified decompositions of the problem state can be prompted by using pre-existing types of actor. For instance, all the human actors in language processing have parameters for their first language and for the language they produce documents in; when a new task situation is being modelled, such parameters can be used to prompt the developer to consider whether different values for them affect the validity of justifications that have been used previously.
Such classes and libraries of requirements components and methods have a long way to go before they are a workable artifact in a super-PTB; representing them in a manageable way to the developer of an evaluation would be quite challenging. However, it is a relatively clear path of investigation for future work, and could be informed by comparisons with the development of Generic Task Model libraries in the KADS approach to knowledge-based system design (see, e.g., Hayward et al) and knowledge acquisition techniques increasingly used in KBS and other requirements analysis (see, e.g., Cordingley 1989).
The programs themselves clearly would benefit from greater integration, robustness, and flexibility; this is an inevitable result of their status as late-agreed outputs of a relatively short research project.
An important component of the PTB methodology is the idea that libraries of test methods, materials, and results should be built up and used in the development of new evaluations. This implies some central body of test material and software, maintained and made available, perhaps via a mechanism like AFS. (See, for example, the Relator project's work in this regard.)
Access to system internals
The automatic administration of tests to the consumer software the project has considered has been hampered by the restricted access this software allows to its functional components since they are embedded in word processors and similar interfaces; for example, no meaningful testing of time and space efficiency attributes has been possible because of the nature of the test harness that has been required. (This is in contrast to the investigations carried out by Claris, where the actual spelling engine was supplied by its manufacturers for testing, so that a test harness could be built that could fairly compare them.) There are, however, two ways in which this situation may not be as much of a general problem for future work as it has been for the current project.
Firstly, much LE software which it would be desirable to evaluate is never likely to be `shrink-wrapped' consumer software in the way that spelling and grammar checkers are. Partly this is because practical LE technology generally requires non-trivial customisation effort before delivery to specific users; partly it is because one of the kinds of testing which it would be desirable to carry out would be the testing of software components under development, before their incorporation into products.
Secondly, present and future developments in personal computer operating systems suggest that an object or component-based approach, with well-defined APIs, is likely to be increasingly common. This would allow, for example, a word processor to be shipped with a standard spelling checker installed, but to make it easy for this to be replaced with an upgraded or specialised checker depending on user requirements. It would also make it easy to construct a standardised test harness into which alternative modules could be installed, with a level of access to functionality which would allow more useful evaluations of time and space efficiency issues.
In terms of test materials, for example, more `naturalistic' corpus based methods could be supported by providing associated tools to support the marking up of corpora to identify errorful text and supply corrections appropriate to some editor group, to extract these error examples and submit them for automatic testing. Equally, test materials can be developed based on a `test case' approach, in which, on the basis of knowledge about available techniques, particular examples are used as diagnostic markers for classes of functionality. Each of these constitutes a subclass of methods, which should be duly documented and placed in the appropriate relation to the existing methods in the library.
In terms of actual test administration, the PTB currently produces simple pages of description of methods for those attributes for which we have not developed automated methods. Future developments are foreseen in which the PTB will include methods for, e.g., usability testing, which are likely to involve some empirical testing on the part of the person conducting the evaluation. The PTB methodology applies once again: class types for different kinds of experimental technique and result type, with parameters for number of subjects, guidelines on decision-making, and so on, should be developed. The evaluator would choose a test type, set a few parameters, and receive from the PTB a neat form detailing the experimental protocol; when the experiments have been carried out, the results will be inserted in the appropriate place and the overall evaluation will proceed.
Indeed, the project's work on grammar checkers has shown that the kind of large-scale automatic testing possible for spelling checkers would not be possible because so many grammatical errors cannot be completely corrected by grammar checkers; instead it is necessary to measure the usefulness of the advice given by the checker in allowing some specific kind of user to correct the error. This again will involve the development of new, probably person-intensive, methods, which should nonetheless fall within the PTB methodology of explicitly documented, formalised methods.
The essential benefits of the methodology remain: in the explicit recording of methods; in the incorporation of the results into the overall evaluation; in the potential for reuse and comparability to be gained in supporting the user in choosing and developing methods as instances and adaptations of an existing taxonomy of methods.
However, it is clear that for many purposes it would be desirable to be able to guide the classification of users using terms more accessible to anticipated customers of an evaluation. For instance, there may be certain distributions of phonetic spelling error types that typically apply to writers in particular second languages with a particular first language; currently the PTB has no support for entering requirements at that level of granularity. A library of previous examples of relationships between `user-level' descriptions and reportable attributes might be useful, as might guided navigation through discrimination processes to choose between examples. Particularly where the tasks for which an evaluation is being prepared have well-defined linguistic inputs and outputs, it might be possible to support diagnosis of a particular user type by guiding the PTB user through an analysis of typical input and output texts (or dialogue pairs, for instance) supplied by the PTB user. Some automatic diagnosis on the basis of such texts might also be possible.