Back to the workshop


Standards work related to evaluation.

1. A little history.

ISO (the International Organization for Standardization) and IEC (the International Electrotechnical Commission) together form the specialized system for worldwide standardization. International Standards are developed by technical committees, whose membership comes from national bodies who are members of ISO or from IEC participants. ISO and IEC committees collaborate in fields of mutual interest.

General information about ISO, and about the two series of ISO standards which relate to management, ISO 9000 and ISO 14000, can be found at

The 9000 series primarily deals with quality assurance, the 14000 series with management and the environment.

Information on how to order ISO documents can be found at the same address.

An important standard pertaining to evaluation is ISO/IEC 9126, which was prepared by Joint Technical Committee JTC 1, Information technology.

The first edition of this standard, entitled "Information technology - Software product evaluation - Quality characteristics and guidelines for their use" was published in 1991.

As its title implies, this standard was mainly concerned with stipulating a set of quality characteristics worked out on the basis of a general definition of quality that was subsequently used in ISO 8402. The definition of quality is accepted for all kinds of products and services. It starts from the user's needs.

On the grounds that a set of definitions given only as an exercise in terminology would not provide sufficient support to those involved in assessing software quality, a description on how to proceed with evaluating the quality of a software product was also included.

It was acknowledged in the standard that evaluating product quality in practice required characteristics beyond the set given, and also required the development of metrics associated with each of the quality characteristics. However, the state of the art did not permit standardization in those areas, and rather than wait an indefinitely long period of time for the necessary enhancements, it was decided to issue the 1991 version to harmonise further development.

In 1994 it was felt that other standards being produced in the area of product quality evaluation necessitated the revision of 9126. The revision has resulted in a series of documents. The quality model and documents on metrics pertaining to it form part of the 9000 series. The process of evaluation has been separated out and is the topic of a series of documents in the 14000 series.

That revision is now almost complete, at least for the part which directly concerns the definition of quality. The draft of ISO/IEC 9126 Part 1, the quality model is, at the time of writing, at the Final Committee Draft stage. A vote will be considered at the November meeting of the Committee. No major changes are expected.

Similarly, a new standard ISO/IEC 14598-1, which gives a general overview of the process of evaluation, is very close to publication as an international standard.

Both the 1991 versions and the new versions are considered in more detail below.


The first phase of EAGLES work started in 1993. A primary goal of the initiative was standardization in the language engineering area. Naturally enough, what could or should be standardized varied from one working group to another. For the Evaluation working group, where it was felt that evaluation methods and techniques were at an early stage of development, the aim was to produce a way of thinking about evaluation rather than a set of recipes for the evaluation of particular types of systems. In particular, there was substantial agreement within the group that there could be no single and universal evaluation technique which could be applied to all language engineering products indifferently of the contexts in which the product would be used.

A first step therefore was to look for existing standardization work which could form a starting point for the development of a methodology for evaluation design: a way of thinking about evaluation which could be applied to the construction of any specific evaluation, and which, since it would be common to all evaluations of language engineering products, would provide a de facto standard at an appropriate level of abstraction, permitting the particularities of specific evaluations to be taken into account within a standardized framework.

Indeed, even though work concentrated on commercially available or near-to-market products, it was intended that the principles of evaluation design worked out within the project should be much more widely applicable, and should be capable of being used for evaluation at any point of the product's life cycle, from initial project proposal through development to commercialisation.

From this perspective, ISO/IEC 9126 was of considerable interest: it fitted almost exactly with what the group was looking for. Furthermore, it was part of the mandate of the EAGLES group that users needs and requirements should play a major role in the framework to be devised. This fitted in very closely with the ISO definition of quality:

"The totality of features and characteristics of a product or service that bears on its ability to satisfy stated or implied needs" (ISO 8402: 1986, note 1).

ISO/IEC 9126 was therefore very influential on the work of the group, and a great deal of effort was invested into first deciding what modifications and extensions would be necessary in order to apply the standards and guidelines in practice to the evaluation of language engineering systems, secondly into producing a formal version of a model of quality.

The first exercise involved defining quality characteristics and sub-characteristics for a number of different classes of systems. The characteristics for spelling checkers were worked out in some detail, a fairly substantial check-list for translation memory systems was produced, and work on grammar checkers was started. The work on spelling checkers and grammar checkers was mainly carried out in the framework of an LRE project, TEMAA, which carried the work on spelling checkers further by defining metrics for the quality sub-characteristics which had been identified. An account of that work can be found in section XXX of this report, and in the TEMAA final report.

Normalisation involved formal description of the quality characteristic hierarchy in terms of a feature structure of the type familiar from work in computational linguistics. Additional work on metrics and on automation within the TEMAA project allowed a prototype Evaluator's Workbench to be developed. Within the workbench environment, some measurement could be carried out (semi)-automatically, and a report could be automatically generated which assessed the suitability of a particular system in the light of the specific needs of a user or of a class of users. This latter was made possible by using the same descriptive tools for the description of users as those used for the description of systems, and by providing mechanisms for reflecting the relative importance of particular sub-characteristics for specific users. That work too is described in more detail elsewhere in this report (XXX).

The second round of EAGLES Evaluation work started in 1996 and is now drawing to a close. It was seen primarily as a consolidation and dissemination effort, with no new work on developing the EAGLES framework being undertaken within the group itself. During this phase, the group has been fortunate enough to have been able to enter into direct content with the Document Editor of the new drafts of ISO/IEC 9126 and of ISO/IEC14598-1. The draft of 9126 was presented in an Evaluation Group workshop in November of 1997. It was particularly pleasing to be able to notice a convergence of ideas, especially in the area of the importance of metrics. Subsequent examination of the draft of ISO/IEC 14598-1 has confirmed the convergence of ideas.

3. ISO/IEC 9126. First edition, 1991.

Since later revision has resulted in a division of the subject matter, discussion of ISO 9126 (1991) is here placed under two separate headings, even though both topics are covered in the same document in the 1991 standard.

The account is intended to be a brief summary of the documents in question, with occasional commentary touching on the relationship between EAGLES work and ISO. The commentary is of course entirely the responsibility of the EAGLES group, and in no way reflects ISO policy.

The Quality Model.

It has already been mentioned that the quality model set out in ISO/IEC 9126 is based on a general definition of quality, quoted above, which is intended to be applicable to any product or service. The model in 9126 is therefore a specialization of the generic model, intended as a quality model specifically of software product. Quality is seen in general as a composite of a set of quality characteristics. Relevant quality characteristics must be chosen and defined in order to produce a specialized quality model.

The requirements for choosing the quality characteristics set out in 9126 were as follows:

We recall that the definition of quality on which 9126 is based is that of ISO 8402: 1986:

"The totality of features and characteristics of a product or service that bear on its ability to satisfy stated or implied needs".

It is perhaps worth underlining here once again that this definition fits in very closely with the mandate given to the EAGLES Evaluation group to ensure that user needs play a central role in evaluation. Even though evaluation may be carried out at many different points in a product's life-cycle, and by many different people, thus giving rise to what 9126 calls different view-points on evaluation, the ultimate objective is always the satisfaction of user needs. Evaluation during development, for example, is aimed at predicting whether a product will ultimately satisfy user needs or not.

Six quality characteristics of software were stipulated in the standard: functionality, reliability, usability, efficiency, maintainability, and portability. We shall not give the detailed definitions here.

It is important to note that each of these characteristics was perceived to be the top level of a hierarchy of sub-characteristics. An annexe to the standard, Annexe A, whose status was informative rather than normative, gave examples of how each characteristic could be broken down into a set of sub-characteristics, each of which, could in its turn be further broken down. There is no claim that the sub-characteristics of Annexe A and their organisation constitute the only possible model of quality which can be derived from following the standard. Rather,

"The key point is that there should be a quality model to at least the subcharacteristic level for a software product, not that it should be of the precise form described in this annex." (ISO/IEC 9126, Annex A).

The guidelines contained in the body of the document also point out that the importance of each quality characteristic will vary, depending on the class of software.

"For example, reliability is most important for a mission critical system software, efficiency is most important for a time critical real time system software, and usability is most important for an interactive end user software." (ISO/IEC 9126: 1991, 5.1 Usage).

We have already mentioned that 9126 points out that there may be different views of software quality. This discussed in the document itself are those of the user (who may be an end-user in the conventional sense of end-user, but may also be an operator, a recipient of the results of the software, or even a developer or maintainer of the software: the essential point being that the user uses the system to perform a specific function), the developer or the manager. It is emphasized that the developer may use different metrics for some characteristic than the user. For example, the user may think of efficiency in terms of response time, while the developer, at some stage of development, may not be able directly to measure response time. But since he is by necessity ultimately interested in the same quality characteristics as the user, he will use other metrics, such as path length and access or waiting time to measure the same characteristic.

"Generally speaking, metrics applying to the external interface of a product are replaced by those applying to its structure". (ISO/IEC 91126: 1991, 5.2.2 Developer's view.)

We can summarize the quality model set out in 9126 by saying that a set of quality characteristics are stipulated, which can, and should be further broken down into sub-characteristics. The hierarchical structure thus obtained for some class of software product is a model of quality for that product. The quality characteristics, and especially the subcharacteristics given in Annex A are not rigid and unchangeable: their primary purpose is to serve as a check-list, guiding the evaluator in his attempt to decide and define what characteristics contribute to quality and therefore should be measured when carrying out an evaluation.

The Evaluation process model.

The evaluation process model given in 9126 is part of the guidelines for use of the quality characteristics. Three stages of the process are distinguished,

The evaluation process is conceived of as being generic: it applies to component evaluation as well as to system evaluation, and may be applied at any appropriate phase of the product life cycle.

Quality requirements definition involves setting up a model of quality for the product in question. The model defined will capture the stated or implied needs of the user, and will express the demands made by the environment upon the software produced. Requirements for system components may be derived from requirements for the whole system, but, typically, different requirements will be made on different components. The quality requirements are expressed in terms of quality characteristics and sub-characteristics.

Evaluation preparation involves three sub-phases:

Quality characteristics cannot be directly measured. Metrics must therefore be defined which correlate to the quality characteristic. Different metrics may be used in different environments and at different stages of a product's development. However, metrics used during the development phase should correlate to the metrics used when evaluating from the user view, since ultimately only the user view matters.

A metric typically involves producing a score on some scale, reflecting the particular system's performance with respect to the quality characteristic in question. This score, uninterpreted, says nothing about whether the system performs satisfactorily. Rating levels definition involves determining the correspondence between the uninterpreted score and the degree of satisfaction of the requirements. Since quality refers to given needs, there can be no general rules for when a score is satisfactory. This must be determined for each specific evaluation.

Each measure obtains contributes to the overall judgement of the product, but not necessarily in a uniform way. It may be, for example, that one requirement is critical, whilst another is desirable, but not strictly necessary. In this case, if the system does not perform satisfactorily with respect to the critical characteristic, it will be assessed negatively no matter what happens to all the other characteristics. If it performs badly with respect to the desirable but not essential characteristic, it is its performance with respect to all the other characteristics which will determine whether the system is acceptable or not. Assessment criteria definition involves defining a procedure for summarizing the results of the evaluation of the different characteristics, using for example decision tables or weighted averages.

Note that quality metrics selection, rating levels definition and assessment criteria definition all form part of the preparation of the evaluation, and are done before any measurement actually takes place.

One might comment that there are obvious good reasons for insisting that the three sub-phases above are part of the preparation. It is only too easy for the evaluator to be influenced by the results of the measurement, and to change his criteria for acceptability. Setting out those criteria before the measurement is done at least helps to minimize this danger.

The last stage is the evaluation procedure itself, broken down into;

These steps are intuitively straightforward in light of the above. Measurement gives a score on a scale appropriate to the metric being used. Rating determines the correlation between the raw score and the rating levels. Assessment is a summary of the set of rated levels. On the basis of this assessment, a final managerial decision is taken based on management criteria.

It is perhaps worth noting that all the steps above are mirrored rather faithfully in the prototype Evaluator's Workbench produced by the TEMAA project, and reported on in Section XXX.

Another point is worth making before turning to the later versions of the ISO standard. The overall perspective of the ISO standard is that software development: in the statement of scope we are told that the Standard is intended for those associated with "acquisition, development, use, support, maintenance or audit of software." This is a viewpoint quite different to hat of the comparative evaluations carried out in the framework of technology evaluation, such as the American programmes in various fields and the more recent comparative evaluation efforts in the Francophone world. (See Appendix XXX for more information).

This might lead the reader to believe that the evaluation of core technology and the sort of product or system evaluation presupposed by the ISO standards are fundamentally different. The EAGLES belief is that this is not so. The procedures set out in the ISO documents as well as the way of thinking reflected in the proposed ways of constructing definitions of models of quality are generic, and apply to all manner of evaluation. Indeed, the more recent ISO documents suggest that they may be useful even outside software evaluation and applicable to any complex product. In other words, the ISO documents propose a way of thinking which is part of the infrastructure basic to any evaluation design, no matter what the specifics of the particular evaluation might be.


Very little is said in 9126 about metrics, except that the state of the art is not sufficiently advanced for standardization work to be carried out, and that only a few generally accepted metrics exist for the quality characteristics given.

However, it is worth noticing that in this standard a metric is by definition a quantitative metric:

"3.14 software quality metric: A quantitative scale and method which can be used to determine the value a feature takes for a specific software product." (ISO/IEC 9126: 1991,3.14)

This is in contradistinction to the EAGLES proposal, where metrics are classified by the type of value they may take into facts, judgements and tests. Facts need not involve any kind of scale, and judgements are qualitative rather than quantitative, since they involve the exercise of human judgement. (For further discussion, see Section XXXX).

However, the disagreement is apparent rather than real, since 9126 elsewhere suggests that standards groups and organisations may establish their own evaluation process models for creating and validating metrics, and that

"In those cases where appropriate metrics are unavailable and cannot be developed, verbal descriptions or "rule of thumb" may sometimes be used." (ISO/IEC 9126: 1991, 5.1).

It is also interesting to note that the distinction made in the 1998 ISO documents between internal and external metrics (see below) is foreshadowed in the 1991 document by the remark, when discussing the developer's view of evaluation, that the metrics used by the developer, although they should correlate with those used from the user viewpoint, will not be the same metrics. In the developer's case:

"Generally speaking, metrics applying to the external interface of a product are replaced by those applying to its structure." (ISO/IEC 9126: 1991, 5.2.2)

4. ISO/IEC standards, 1998.

What are the major changes?

Before going on to look at the recent versions of the ISO standards relating to evaluation, it might be useful to summarize the major changes.

First, it is important to notice that the basic principles have not changed. It is still the case that the starting point for designing an evaluation is constructing a model of quality which is based on the ISO general definition of quality quoted above. Thus it is still the case that user needs are taken as primordial.

The major changes, then, are in format and in greatly expanded working out of the content of the standard. There now two separate series of documents, one series, relating to the quality model in the 9000 series, the other series, relating to the evaluation process model, in the 14000 series.

ISO/IEC 9126: 1991 did not talk explicitly of a quality model. The 1998 version explicitly specifies such a model. The quality characteristics remain unchanged, but normative sub-characteristics have been introduced, most of which are based on the illustrative subcharacteristics contained in Annex A of the 1991 standard.

A new notion "quality in use" has been introduced. Quality in use is quality as seen from the user point of view, and is super-ordinate to the six quality characteristics already defined, being a composite of them whose exact nature can only be determined by the specific requirements of a specific user in a specific environment.

Metrics have moved into the foreground as an area of interest. A distinguish is made between metrics relating to internal characteristics (internal metrics) of the software and metrics relating to the behaviour of the software as seen from the outside (external metrics). Documents on each of these two types of metrics are in preparation. Recently, a work item on metrics for quality in use has been added. This gives us the following documents in the 9126 series:

9126-1: Quality model

9126-2: External metrics

9126-3: Internal metrics

9126-4: Quality in use metrics.

9126-1 is close to publication as an International Standard. The other documents in the series are in preparation.

The evaluation process model has been removed from the 9126 series and is now part of the 14000 series. ISO/IEC 14598 consists of the following parts under the general title Information Technology - Software product evaluation:

14598-1: General overview

14598-2: Planning and management

14598-3: Process for developers

14598-4: Process for acquirers

14598-5: Process for evaluators

14598-6: Documentation of evaluation modules.

The discussion in the present draft of this document is based on the current version of 14598-1, which is close to publication as an International Standard.

ISO/IEC 9126-1: Final Committee Draft, 1998: Quality model.

The normative part of ISO/IEC 9126-1, 1998 is the definition of a quality model. The model distinguishes internal quality, external quality and quality in use.

It specifies six quality characteristics (the same six as those specified in 9126, 1991) for internal and external quality. The quality characteristics are broken down into subcharacteristics which now are an integral part of the normative work. Quality in use is broken down into four characteristics which are the combined effect of the software quality characteristics from the user's point of view.

The intended use of 9126:1998 is very wide indeed. The characteristics defined are meant to be applicable to any kind of software, and also provide a consistent terminology for software quality. Their chief purpose is to provide the framework for specifying quality requirements. The intended users of 9126:1998 include developers, acquirers, quality assurance staff and independent evaluators. Examples of uses of the quality model include:

IS/IEC 9126-1: 1998 can also be used in conjunction with other ISO standards in a wide variety of tasks, including software process assessment, definition, review, verification and validation of quality requirements during software lifecycle and quality assurance processes.

A quality model is described which explains the relationship between different approaches to quality. The breakdown into quality characteristics and subcharacteristics constitutes a specific implementation of the generic quality model.

A distinction is made between internal measures, which are typically static measures of intermediate products (by which is meant specifications, source code etc.) and external measures, which typically involve measures of the behaviour of the code when executed. Different approaches to quality then go in a chain from process quality, which influences internal quality, which in turn influences external quality which in turn influences quality in use. Seen from the opposite end of the chain, quality in use depends on external quality, which depends on internal quality, which depends on process quality.

Standing outside the chain, goal quality is the necessary and sufficient quality which reflects real user needs. Goal quality is not necessarily perfect quality, but the quality which allows the user to achieve his goals. It is not always possible to define goal quality completely before development starts, partly because real user needs are not always consciously known or stateable at that point, partly because user needs may change and develop during the development process.

The item to be evaluated differs according to the approach to quality. For process quality, it is the process itself. For internal and external quality, it is the software product. For quality in use, it is the effect of the software product.

This latter is rather important: to quote the new draft:

"Software never runs alone, but always as a part of a larger system consisting of other software products with which it has interfaces, hardware, human operators, and work flows...Quality in use (the capability of a product to meet stated and implied needs) can be measured by the extent to which a product used by specified users meets their needs to achieve specified goals with effectiveness, productivity, safety and satisfaction." (ISO/IEC 9126-1: 1998).

This is very reminiscent of what earlier EAGLES work called a set-up (based on a term used by Karen Sparck Jones in making essentially the same point as the quotation above). There, it was also pointed out that one set-up may be embedded inside another, thus leading to an important distinction, also made in 9126, between evaluation of a software product and evaluation of the system in which it is executed. ISO/IEC 9126-1 gives a clear example of how where the boundary of the system is considered to be changes depending on the purposes of the evaluation and on who the users are taken to be:

"For example, if the users of an aircraft with a computer-based flight control system are taken to be the passengers, then the system on which they depend includes the flight crew, the airframe, and the hardware and software in the flight control system, whereas if the flight crew are taken to be the users, then the system upon which they depend consists only of the airframe and the flight control system." (ISO/IEC 9126-1, 1998, 5.3).

9126 summarizes the use of a quality model in evaluation as follows:

"Software product quality should be evaluated using a defined quality model. The quality model should be used when setting quality goals for software products and intermediate products, Both software quality and quality in use should be decomposed into a quality model composed of characteristics and subcharacteristics which can be used as a checklist of issues relating to quality." (ISO/IEC 9126-1, 1998, 5.4).

Clauses 6 and 7 define a hierarchical quality model for software quality and quality in use, although it is noted that other ways of categorising quality may be more appropriate in particular circumstances. Despite this last qualification, it is noted elsewhere in the document that the model given is the default model: other models should not be used unless there is good reason to do so.

The quality characteristics.

We shall not give here the full definition of each of the quality characteristics and their subcharacteristics, restricting ourselves to a simple naming exercise. The reader is referred to the ISO document for full definitions. Note though that the subcharacteristics are now normative rather then simply informative as they were in 9126, 1991.

<<< to be continued>>>>