Anja Belz | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Anja Belz is active.

Explore More

Publication

Featured researches published by Anja Belz.

Natural Language Engineering | 2008

Automatic generation of weather forecast texts using comprehensive probabilistic generation-space models

Anja Belz

Two important recent trends in natural language generation are (i) probabilistic techniques and (ii) comprehensive approaches that move away from traditional strictly modular and sequential models. This paper reports experiments in which pCRU a generation framework that combines probabilistic generation methodology with a comprehensive model of the generation space was used to semi-automatically create five different versions of a weather forecast generator. The generators were evaluated in terms of output quality, development time and computational efficiency against (i) human forecasters, (ii) a traditional handcrafted pipelined NLG system and (iii) a HALOGEN-style statistical generator. The most striking result is that despite acquiring all decision-making abilities automatically, the best pCRU generators produce outputs of high enough quality to be scored more highly by human judges than forecasts written by experts.

Computational Linguistics | 2009

An investigation into the validity of some metrics for automatically evaluating natural language generation systems

Ehud Reiter; Anja Belz

There is growing interest in using automatically computed corpus-based evaluation metrics to evaluate Natural Language Generation (NLG) systems, because these are often considerably cheaper than the human-based evaluations which have traditionally been used in NLG. We review previous work on NLG evaluation and on validation of automatic metrics in NLP, and then present the results of two studies of how well some metrics which are popular in other areas of NLP (notably BLEU and ROUGE) correlate with human judgments in the domain of computer-generated weather forecasts. Our results suggest that, at least in this domain, metrics may provide a useful measure of language quality, although the evidence for this is not as strong as we would ideally like to see; however, they do not provide a useful measure of content quality. We also discuss a number of caveats which must be kept in mind when interpreting this and other validation studies.

meeting of the association for computational linguistics | 2008

Intrinsic vs. Extrinsic Evaluation Measures for Referring Expression Generation

Anja Belz; Albert Gatt

In this paper we present research in which we apply (i) the kind of intrinsic evaluation metrics that are characteristic of current comparative HLT evaluation, and (ii) extrinsic, human task-performance evaluations more in keeping with NLG traditions, to 15 systems implementing a language generation task. We analyse the evaluation results and find that there are no significant correlations between intrinsic and extrinsic evaluation measures for this task.

natural language generation | 2010

Introducing shared tasks to NLG: the TUNA shared task evaluation challenges

Albert Gatt; Anja Belz

Shared Task Evaluation Challenges (stecs) have only recently begun in the field of nlg. The tuna stecs, which focused on Referring Expression Generation (reg), have been part of this development since its inception. This chapter looks back on the experience of organising the three tuna Challenges, which came to an end in 2009. While we discuss the role of the stecs in yielding a substantial body of research on the reg problem, which has opened new avenues for future research, our main focus is on the role of different evaluation methods in assessing the output quality of reg algorithms, and on the relationship between such methods.

Computational Linguistics | 2009

That's nice... what can you do with it?

Anja Belz

A regular fixture on the mid 1990s international research seminar circuit was the billion-neuron artificial brain talk. The idea behind this project was simple: in order to create artificial intelligence, what was needed first of all was a very large artificial brain; if a big enough set of interconnected modules of neurons could be implemented, then it would be possible to evolve mammalian-level behavior with current computational- neuron technology. The talk included progress reports on the current size of the artificial brain, its structure, update rate, and power consumption, and explained how intelli- gent behavior was going to develop by mechanisms simulating biological evolution. What the talk didnt mention was what kind of functionality the team had so far managed to evolve, and so the first comment at the end of the talk was inevitably nice work, but have you actually done anything with the brain yet?1 In human language technology (HLT) research, we currently report a range of evaluation scores that measure and assess various aspects of systems, in particular the similarity of their outputs to samples of human language or to human-produced gold- standard annotations, but are we leaving ourselves open to the same question as the billion-neuron artificial brain researchers?

international conference on natural language generation | 2008

Attribute selection for referring expression generation: new algorithms and evaluation methods

Albert Gatt; Anja Belz

Referring expression generation has recently been the subject of the first Shared Task Challenge in NLG. In this paper, we analyse the systems that participated in the Challenge in terms of their algorithmic properties, comparing new techniques to classic ones, based on results from a new human task-performance experiment and from the intrinsic measures that were used in the Challenge. We also consider the relationship between different evaluation methods, showing that extrinsic task-performance experiments and intrinsic evaluation methods yield results that are not significantly correlated. We argue that this highlights the importance of including extrinsic evaluation methods in comparative NLG evaluations.

natural language generation | 2009

System Building Cost vs. Output Quality in Data-to-Text Generation

Anja Belz; Eric Kow

Data-to-text generation systems tend to be knowledge-based and manually built, which limits their reusability and makes them time and cost-intensive to create and maintain. Methods for automating (part of) the system building process exist, but do such methods risk a loss in output quality? In this paper, we investigate the cost/quality trade-off in generation system building. We compare four new data-to-text systems which were created by predominantly automatic techniques against six existing systems for the same domain which were created by predominantly manual techniques. We evaluate the ten systems using intrinsic automatic metrics and human quality ratings. We find that increasing the degree to which system building is automated does not necessarily result in a reduction in output quality. We find furthermore that standard automatic evaluation metrics underestimate the quality of handcrafted systems and over-estimate the quality of automatically created systems.

natural language generation | 2007

Generation of repeated references to discourse entities

Anja Belz; Sebastian Varges

Generation of Referring Expressions is a thriving subfield of Natural Language Generation which has traditionally focused on the task of selecting a set of attributes that unambiguously identify a given referent. In this paper, we address the complementary problem of generating repeated, potentially different referential expressions that refer to the same entity in the context of a piece of discourse longer than a sentence. We describe a corpus of short encyclopaedic texts we have compiled and annotated for reference to the main subject of the text, and report results for our experiments in which we set human subjects and automatic methods the task of selecting a referential expression from a wide range of choices in a full-text context. We find that our human subjects agree on choice of expression to a considerable degree, with three identical expressions selected in 50% of cases. We tested automatic selection strategies based on most frequent choice heuristics, involving different combinations of information about syntactic MSR type and domain type. We find that more information generally produces better results, achieving a best overall test set accuracy of 53.9% when both syntactic MSR type and domain type are known.

international conference on natural language generation | 2006

Shared-Task Evaluations in HLT: Lessons for NLG

Anja Belz; Adam Kilgarriff

While natural language generation (NLG) has a strong evaluation tradition, in particular in userbased and task-oriented evaluation, it has never evaluated different approaches and techniques by comparing their performance on the same tasks (shared-task evaluation, STE). NLG is characterised by a lack of consolidation of results, and by isolation from the rest of NLP where STE is now standard. It is, moreover, a shrinking field (state-of-the-art MT and summarisation no longer perform generation as a subtask) which lacks the kind of funding and participation that natural language understanding (NLU) has attracted.

Proceedings of the 2009 Workshop on Language Generation and Summarisation (UCNLG+Sum 2009) | 2009

The GREC Main Subject Reference Generation Challenge 2009: Overview and Evaluation Results

Anja Belz; Eric Kow; Jette Viethen; Albert Gatt

The GREC-MSR Task at Generation Challenges 2009 required participating systems to select coreference chains to the main subject of short encyclopaedic texts collected from Wikipedia. Three teams submitted one system each, and we additionally created four baseline systems. Systems were tested automatically using existing intrinsic metrics. We also evaluated systems extrinsically by applying coreference resolution tools to the outputs and measuring the success of the tools. In addition, systems were tested in an intrinsic evaluation involving human judges. This report describes the GREC-MSR Task and the evaluation methods applied, gives brief descriptions of the participating systems, and presents the evaluation results.

Explore More