Below is a (very) condensed version of the parser evaluation
survey. Both my online conference paper and printed workshop paper contain
fuller versions of this
We present a critical overview of the state-of-the-art
in parser evaluation methodologies and metrics:
Listing Linguistic Constructions Covered (No corpus):
advantages of this relatively simple traditional approach (e.g. see Briscoe
et al., 1987) are that it requires no corpus resources, and an indication
of constructions -not- covered can be useful. On the other hand, since
the distinction between core and peripheral aspects of constructions is
often ill-defined, and much of the complexity can be in their interaction,
the approach provides no precise information about actual coverage of text.
Coverage (Unannotated corpus): calculating the percentage
of sentences from a given, unannotated corpus which are assigned one or
more analyses by a parser/grammar is a useful first step in evaluation
(e.g. Lehmann et al., 1996); in the current context it is a weak measure,
though, because it does not guarantee that any of the analyses returned
are correct.
Average Parse Base (APB; Unannotated corpus): the APB
(Black et al., 1993) is used to give the number of (undisambiguated) parses
that would be expected for a sentence of specified length in a particular
corpus. While a succinct measure, an unambiguous but perhaps low coverage
grammar would do well on it; moreover it is not appropriate for comparing
different parsers on different data.
Entropy/Perplexity (Unannotated corpus): this approach
can be used to compare the effectiveness of different probabilistic models
or training regimes, and also generalised to provide a model-independent
measure of the inherent complexity of a corpus (Sharman, 1990). The basic
measure has a clear interpretation and allows comparison of different models
of the same corpus. Disadvantages are that it only applies to probabilistic
models, and provides only a weak measure of the accuracy of (highly-ranked)
derivations.
Part-of-speech Assignment Accuracy (Annotated corpus):
although there is for many languages a good deal of manually-corrected
part-of-speech tagged material, many broad-coverage parsers take pre-tagged
text as input, so this measure is inappropriate for these parsers. Also,
it provides only a very partial measure of the accuracy of a syntactic
analysis, so has been used mainly to evaluate `lexical' parsers (Karlsson
et al., 1995).
Structural Consistency (Annotated corpus): Black et al.
(1993) define structural consistency as the percentage of sentences in
a treebank which receive at least one analysis which is structurally consistent
with the correct analysis (according to the crossing brackets measure,
defined below). The measure is stronger than simply measuring coverage;
however, alone, it is inadequate as it favours systems which yield minimal
structure.
Best-first/Ranked Consistency (Annotated corpus): this
measure, the percentage of highest-ranked analyses output by a parser which
are identical to the correct (treebank) analysis (Briscoe and Carroll,
1993) gives an immediately understandable score, but it is highly dependent
on having an accurate, coherent and consistent annotated corpus which is
fully compatible with parser output.
Tree Similarity Measures (Annotated corpus): Sampson et
al. (1989) and others define tree similarity measures which are finer-grained
than full identity and more tolerant of annotation errors. However, it
is difficult to see how the measures map onto many parsing tasks, and a
detailed and compatible treebank is still required.
Grammar Evaluation Interest Group (GEIG) Scheme (Annotated
corpus): GEIG computes the percentage of bracketed constituents the parser
identifies correctly (precision) and the percentage of correct ones it
finds (recall), and also the average number of `crossing' brackets per
sentence-where a constituent from the parser overlaps with one from the
treebank but neither is properly contained in the other (Grishman et al.,
1992). Advantages are that a relatively undetailed (merely bracketed),
less parser-specific annotation is required, some level of cross-framework/system
comparison is possible, and the measure is tolerant of annotation errors.
Disadvantages are that it is much weaker than full identity, and it is
unclear how much of a success (or failure) it is to achieve high (low)
scores: for example, a single attachment mistake embedded n levels deep
(and perhaps completely innocuous, such as an `aside' delimited by dashes)
can be penalised by n crossings, whereas incorrect identification of arguments/adjuncts
can go unpunished in some cases; also, the precision measure marks down
a system for inserting extra possibly correct constituent structure.
Dependency Structure-based Scheme (Annotated corpus):
Lin (1995) proposes evaluation based on dependency structure annotation.
Atwell (1996), though, points out that transformation of standard constituency-based
analyses into a dependency-based representation would lose certain kinds
of grammatical information that might be important for subsequent processing,
such as `logical' information (e.g. location of traces, or moved constituents).
References
Atwell, E. (1996) Comparative evaluation of grammatical
annotation models. In R. Sutcliffe, H. Koch, and A. McElligott (eds.) Industrial
Parsing of Software Manuals. Amsterdam, The Netherlands: Rodopi, 25-46.
Black, E., Garside, R. and Leech, G. (eds.) (1993) Statistically-driven
computer grammars of English: The IBM/Lancaster approach. Amsterdam, The
Netherlands: Rodopi.
Briscoe, E. and Carroll, J. (1993) Generalised probabilistic
LR parsing for unification-based grammars. Computational Linguistics, 19.1,
25-60.
Briscoe, E., Grover, C., Boguraev, B. and Carroll, J.
(1987) A formalism and environment for the development of a large grammar
of English. In Proceedings of IJCAI-87, Milan, Italy, 703-708.
Grishman, R., Macleod, C. and Sterling, J. (1992) Evaluating
parsing strategies using standardized parse files. In Proceedings of the
3rd ACL Conference on Applied Natural Language Processing, Trento,
Italy, 156-161.
Karlsson, F., Voutilainen, A., Heikkil"a, J. and
Anttila, A. (1995) Constraint grammar: a language-independent system for
parsing unrestricted text. Berlin, Germany: de Gruyter.
Lehmann, S. et al. (1996) TSNLP-Test suites for natural
language processing. In Proceedings of COLING-96, Copenhagen, Denmark,
711-716.
Lin, D. (1995) A dependency-based method for evaluating
broad-coverage parsers. In Proceedings of IJCAI-95, Montreal, Canada, 1420-1425.
Sampson, G., Haigh, R. and Atwell, E. (1989) Natural language
analysis by stochastic optimization: a progress report on Project APRIL.
Journal of Experimental and Theoretical Artificial Intelligence, 1, 271-287.
Sharman, R. (1990) Evaluating a grammar as a language
model for speech. In L. Torres, E. Masgrau, and M. Lagunas (eds.) Signal
Processing V: Theories and Applications. The Netherlands: Elsevier, 1271-1274.