Back to the workshop

Below is a (very) condensed version of the parser evaluation survey. Both my online conference paper and printed workshop paper contain fuller versions of this

John Carroll, University of Sussex, UK

Ted Briscoe, University of Cambridge, UK

We present a critical overview of the state-of-the-art in parser evaluation methodologies and metrics:

Listing Linguistic Constructions Covered (No corpus): advantages of this relatively simple traditional approach (e.g. see Briscoe et al., 1987) are that it requires no corpus resources, and an indication of constructions -not- covered can be useful. On the other hand, since the distinction between core and peripheral aspects of constructions is often ill-defined, and much of the complexity can be in their interaction, the approach provides no precise information about actual coverage of text.

Coverage (Unannotated corpus): calculating the percentage of sentences from a given, unannotated corpus which are assigned one or more analyses by a parser/grammar is a useful first step in evaluation (e.g. Lehmann et al., 1996); in the current context it is a weak measure, though, because it does not guarantee that any of the analyses returned are correct.

Average Parse Base (APB; Unannotated corpus): the APB (Black et al., 1993) is used to give the number of (undisambiguated) parses that would be expected for a sentence of specified length in a particular corpus. While a succinct measure, an unambiguous but perhaps low coverage grammar would do well on it; moreover it is not appropriate for comparing different parsers on different data.

Entropy/Perplexity (Unannotated corpus): this approach can be used to compare the effectiveness of different probabilistic models or training regimes, and also generalised to provide a model-independent measure of the inherent complexity of a corpus (Sharman, 1990). The basic measure has a clear interpretation and allows comparison of different models of the same corpus. Disadvantages are that it only applies to probabilistic models, and provides only a weak measure of the accuracy of (highly-ranked) derivations.

Part-of-speech Assignment Accuracy (Annotated corpus): although there is for many languages a good deal of manually-corrected part-of-speech tagged material, many broad-coverage parsers take pre-tagged text as input, so this measure is inappropriate for these parsers. Also, it provides only a very partial measure of the accuracy of a syntactic analysis, so has been used mainly to evaluate `lexical' parsers (Karlsson et al., 1995).

Structural Consistency (Annotated corpus): Black et al. (1993) define structural consistency as the percentage of sentences in a treebank which receive at least one analysis which is structurally consistent with the correct analysis (according to the crossing brackets measure, defined below). The measure is stronger than simply measuring coverage; however, alone, it is inadequate as it favours systems which yield minimal structure.

Best-first/Ranked Consistency (Annotated corpus): this measure, the percentage of highest-ranked analyses output by a parser which are identical to the correct (treebank) analysis (Briscoe and Carroll, 1993) gives an immediately understandable score, but it is highly dependent on having an accurate, coherent and consistent annotated corpus which is fully compatible with parser output.

Tree Similarity Measures (Annotated corpus): Sampson et al. (1989) and others define tree similarity measures which are finer-grained than full identity and more tolerant of annotation errors. However, it is difficult to see how the measures map onto many parsing tasks, and a detailed and compatible treebank is still required.

Grammar Evaluation Interest Group (GEIG) Scheme (Annotated corpus): GEIG computes the percentage of bracketed constituents the parser identifies correctly (precision) and the percentage of correct ones it finds (recall), and also the average number of `crossing' brackets per sentence-where a constituent from the parser overlaps with one from the treebank but neither is properly contained in the other (Grishman et al., 1992). Advantages are that a relatively undetailed (merely bracketed), less parser-specific annotation is required, some level of cross-framework/system comparison is possible, and the measure is tolerant of annotation errors. Disadvantages are that it is much weaker than full identity, and it is unclear how much of a success (or failure) it is to achieve high (low) scores: for example, a single attachment mistake embedded n levels deep (and perhaps completely innocuous, such as an `aside' delimited by dashes) can be penalised by n crossings, whereas incorrect identification of arguments/adjuncts can go unpunished in some cases; also, the precision measure marks down a system for inserting extra possibly correct constituent structure.

Dependency Structure-based Scheme (Annotated corpus): Lin (1995) proposes evaluation based on dependency structure annotation. Atwell (1996), though, points out that transformation of standard constituency-based analyses into a dependency-based representation would lose certain kinds of grammatical information that might be important for subsequent processing, such as `logical' information (e.g. location of traces, or moved constituents).


Atwell, E. (1996) Comparative evaluation of grammatical annotation models. In R. Sutcliffe, H. Koch, and A. McElligott (eds.) Industrial Parsing of Software Manuals. Amsterdam, The Netherlands: Rodopi, 25-46.

Black, E., Garside, R. and Leech, G. (eds.) (1993) Statistically-driven computer grammars of English: The IBM/Lancaster approach. Amsterdam, The Netherlands: Rodopi.

Briscoe, E. and Carroll, J. (1993) Generalised probabilistic LR parsing for unification-based grammars. Computational Linguistics, 19.1, 25-60.

Briscoe, E., Grover, C., Boguraev, B. and Carroll, J. (1987) A formalism and environment for the development of a large grammar of English. In Proceedings of IJCAI-87, Milan, Italy, 703-708.

Grishman, R., Macleod, C. and Sterling, J. (1992) Evaluating parsing strategies using standardized parse files. In Proceedings of the 3rd ACL Conference on Applied Natural Language Processing, Trento, Italy, 156-161.

Karlsson, F., Voutilainen, A., Heikkil"a, J. and Anttila, A. (1995) Constraint grammar: a language-independent system for parsing unrestricted text. Berlin, Germany: de Gruyter.

Lehmann, S. et al. (1996) TSNLP-Test suites for natural language processing. In Proceedings of COLING-96, Copenhagen, Denmark, 711-716.

Lin, D. (1995) A dependency-based method for evaluating broad-coverage parsers. In Proceedings of IJCAI-95, Montreal, Canada, 1420-1425.

Sampson, G., Haigh, R. and Atwell, E. (1989) Natural language analysis by stochastic optimization: a progress report on Project APRIL. Journal of Experimental and Theoretical Artificial Intelligence, 1, 271-287.

Sharman, R. (1990) Evaluating a grammar as a language model for speech. In L. Torres, E. Masgrau, and M. Lagunas (eds.) Signal Processing V: Theories and Applications. The Netherlands: Elsevier, 1271-1274.