A shared task and Evaluation Campaign for NLG
The field of Natural Language Generation (NLG) has strong evaluation traditions, in particular in user-based evaluation of applied systems. However, while in most other NLP fields comparative, quantitative evaluation now plays a central role, there are few such results in NLG. We believe that NLG needs to develop the data resources and evaluation technology required for comparative, quantitative evaluation.
The idea for a Shared Task and Evaluation Campaign (STEC) in Generation of Referring Expressions arose as a direct result of the recently held Workshop on Shared Tasks and Comparative Evaluation in NLG organised by Michael White and Robert Dale, with the support of the National Science Foundation. This Workshop provided a forum for discussion on the prospects, pros and cons of STECs in NLG, and the related question of shared resources. During the workshop, several of the position papers made reference to Referring Expressions Generation as a prime candidate for a STEC, since this area has been the focus of intensive research over the past decade, leading to a consensus over its basic problem definition, inputs and outputs.
This initial STEC for NLG will therefore focus on Generation of Referring Expressions, with particular attention to its most "basic" sub-task, namely attribute selection for distinguishing descriptions.
The report on the Workshop, jointly authored by the participants and due to be published later this year, reflects the variety of opinion on the subject of shared-task evaluations in NLG, including a significant number of researchers who are concerned that such evaluations may narrow the research focus in a particular field. This can occur particularly due to an exclusive focus on a small set of evaluation metrics, which may be inherently biased towards a particular approach to a problem (e.g. a statistical approach). A STEC may also have a negative impact if it results in a restricted definition of what a particular field is about, namely, "that which is acceptable and measurable within the context of the STEC itself".
Our specification of the Attribute
Selection for GRE Challenge incorporates ideas from both sides of the
We conceive of this shared task involving referring expressions as one element of a possible constellation of shared tasks in NLG, each focusing on different aspects of the field. In the longer term, we are committed to addressing a wide range of task definitions (attribute selection, pronominalisation and other anaphoric reference, realisation, etc.), different data resources (COCONUT, TUNA, GREC, etc.), and different evaluation methods (correlation measures, set overlap, surface similarity metrics, as well as user-oriented evaluations). Most importantly, we will encourage grass-roots involvement through calls for the submission of task proposals, data resources and evaluation methods.