[PDF] A Human Evaluation of AMR-to-English Generation Systems

Abstract

Most current state-of-the art systems for generating English text from Abstract Meaning Representation (AMR) have been evaluated only using automated metrics, such as BLEU, which are known to be problematic for natural language generation. In this work, we present the results of a new human evaluation which collects fluency and adequacy scores, as well as categorization of error types, for several recent AMR generation systems. We discuss the relative quality of these systems and how our results compare to those of automatic metrics, finding that while the metrics are mostly successful in ranking systems overall, collecting human judgments allows for more nuanced comparisons. We also analyze common errors made by these systems.

Full PDF

AA Human Evaluation of AMR-to-English Generation Systems

Emma Manning Shira Wein Nathan Schneider

Georgetown University{ esm76 , sw1158 , nathan.schneider } @georgetown.edu Abstract

Most current state-of-the art systems for gen-erating English text from Abstract MeaningRepresentation (AMR) have been evaluatedonly using automated metrics, such as BLEU,which are known to be problematic for naturallanguage generation. In this work, we presentthe results of a new human evaluation whichcollects ﬂuency and adequacy scores, as wellas categorization of error types, for several re-cent AMR generation systems. We discussthe relative quality of these systems and howour results compare to those of automatic met-rics, ﬁnding that while the metrics are mostlysuccessful in ranking systems overall, collect-ing human judgments allows for more nuancedcomparisons. We also analyze common errorsmade by these systems.

Abstract Meaning Representation, or AMR(Banarescu et al., 2013), is a representation ofthe meaning of a sentence as a rooted, labeled,directed acyclic graph. For example, (l / label-01:ARG0 (c / country :wiki"Georgia_(country)":name (n / name :op1 "Georgia")):ARG1 (s / support-01:ARG0 (c2 / country :wiki "Russia":name (n2 / name :op1 "Russia"))):ARG2 (a / act-02:mod (a2 / annex-01))) represents the sentence “Georgia labeled Russia’ssupport an act of annexation.” AMR does not rep-resent some morphological and syntactic detailssuch as tense, number, deﬁniteness and word order;thus, this same AMR could also represent alter-nate phrasings such as “Russia’s support is being labeled an act of annexation by Georgia.”AMR generation is the task of generating a sen-tence in natural language (in this case, English)from an AMR graph. This has applications to arange of NLP tasks, including summarization (Liaoet al., 2018) and machine translation (Song et al.,2019). Like other Natural Language Generation(NLG) tasks, this is difﬁcult to evaluate due to therange of possible valid sentences corresponding toany single AMR.Currently, AMR generation systems are typi-cally evaluated only with automatic metrics thatcompare a generated sentence to a single human-authored reference; for AMR, this is the sentencefrom which the AMR graph was originally created.However, there is evidence that these metrics maynot be a good representation of human judgmentsfor AMR generation (May and Priyadarshi, 2017)and NLG in general (see §2.1).Thus, in this work, we present a new humanevaluation of several recent AMR generation sys-tems, most of which have not previously been man-ually evaluated. Our methodology (§3) differs inseveral ways from previous evaluations of AMRgeneration, including separate direct assessmentof ﬂuency and adequacy; and asking annotators toevaluate sentences without comparison to a refer-ence, in order to avoid biasing them toward a par-ticular wording. We analyze (§4) what our resultsshow about the relative quality of the systems andhow this compares to their scores from automaticmetrics, ﬁnding that these metrics are mostly accu-rate in ranking systems, but that collecting separatejudgments for ﬂuency, adequacy, and error typesallows us to characterize the relative strengths andweaknesses of each system in more detail. Finally,we discuss common errors among sentences whichreceived low scores from annotators, identifyingissues for future researchers to address includinghallucination, anonymization, and repetition. a r X i v : . [ c s . C L ] A p r Background

In §2.1 we discuss previous work on evaluation ofAMR generation and related NLG tasks, both withautomatic metrics and human evaluation. In §2.2we survey recent work in AMR generation, includ-ing describing the systems which we evaluate.

The vast majority of AMRgeneration papers measure their performance onlywith automatic metrics. The most common of thesemetrics is BLEU (Papineni et al., 2002), whichis typically used to determine the state of the art.However, it is unclear whether BLEU is a reli-able metric to compare AMR generation systems:May and Priyadarshi (2017) found that BLEU dis-agreed with human judgments on the ranking ofﬁve AMR generation systems, including disagree-ing on which system was the best. Concerns havealso been raised about the suitability of BLEU forNLG in general; for example, Reiter (2018) foundthat BLEU has generally poor correlations withhuman judgments for NLG. Novikova et al. (2017)compared many metrics to human judgments onNLG from meaning representations and concludedthat use of reference-based metrics relies on an in-valid assumption that references are correct andcomplete enough to be used as a gold standard.Some recent AMR generation papers have re-ported other automatic metrics alongside BLEU.Many have reported METEOR (Banerjee andLavie, 2005), and a few have included TER (Snoveret al., 2006) and, most recently, CHRF++ (Popovi´c,2017). However, it is unclear how accurately anyof these metrics capture the relative performanceof AMR generation systems.

Human Evaluation:

Prior to this work, the onlyhuman evaluation comparing several AMR genera-tion systems was the SemEval-2017 AMR sharedtask, which used a ranking-based evaluation of ﬁvesystems (May and Priyadarshi, 2017). All of thesesystems perform far below the current state-of-the-art, making a new evaluation necessary.While most AMR generation papers have re-ported no human evaluation of their systems, a fewhave conducted smaller-scale evaluations. In partic-ular, Ribeiro et al. (2019) conducted a MechanicalTurk evaluation to compare their best graph en-coder model with a sequence-to-sequence baseline,ﬁnding that their model performs better on bothmeaning similarity between the generated sentence and the gold reference, and readability of the gen-erated sentence.Lapalme (2019) also conducted a small humanevaluation in which annotators chose the best out-put out of three options: their own system, ISI(Pourdamghani et al., 2016), and JAMR (Flaniganet al., 2016). They ﬁnd that their rule-based sys-tem is on par with ISI and much better than JAMR,despite having a much lower BLEU.Beyond AMR generation, other NLG tasks arealso often evaluated only with automatic metrics;for example, Gkatzia and Mahamood (2015) foundthat 38.2% of NLG papers overall, and 68% ofthose published in ACL venues, used automaticmetrics. However, as discussed above, many stud-ies have found that these metrics are not a reliableproxy for human judgments. One example of theuse of human evaluation is the Conference on Ma-chine Translation (WMT), which runs an annualevaluation of machine translation systems (e.g. Bar-rault et al., 2019).

Shortly after the 2017 shared task, Konstas et al.(2017) made signiﬁcant advances to the ﬁeld witha neural sequence-to-sequence approach, mitigat-ing the limitations of the small amount of AMR-annotated data by augmenting training data with ajointly-trained parser.Later work by Song et al. (2018) builds off thisapproach but uses a graph-to-sequence model topreserve more information from the structure ofthe AMR. Several recent papers have exploredvariations on a graph-to-sequence approach: im-provements in encoding reentrancies and long-range dependencies (Damonte and Cohen, 2019),a dual graph encoder that captures both top-downand bottom-up representations of graph structure(Ribeiro et al., 2019), and a densely-connectedgraph convolutional network (Guo et al., 2019).Recent sequence-to-sequence approaches in-clude using structure-aware self-attention to cap-ture relations between concepts within a sequence-to-sequence transformer model (Zhu et al., 2019),and generating syntactic constituency trees as an in-termediate step before generating surface structure(Cao and Clark, 2019).While neural approaches have achieved state-of-the-art BLEU scores, a few recent works haveinstead approached AMR generation through morerule-based methods. Manning (2019) constrainstheir system with rules, supplemented by simple igure 1:

Screenshot from the ﬂuency section of thesurvey.

Figure 2:

Screenshot from the adequacy section of thesurvey. statistical models, to avoid certain types of errors,such as hallucinations, that are possible in neuralsystems. Lapalme (2019) create a fully rule-basedgeneration system to help humans check their AMRannotations.

We conduct a human evaluation of several AMRgeneration systems. §3.1 discusses the general sur-vey design, while §3.2 discusses details of the pilotsurvey, which validates the methodology by apply-ing it to data from the SemEval evaluation, and §3.3discusses the evaluation of more recent systems.

Figures 1 and 2 show examples of the survey inter-face for one sentence.

Scalar Scores:

The SemEval-2017 evaluation ofAMR generation elicited judgments in the formof relative rankings of output from three systemsat a time (May and Priyadarshi, 2017). However,recent work in evaluation of machine translation(Bojar et al., 2016) has found that direct assessmentis a preferable method to collect judgments, partly because it evaluates absolute quality of translations.We use a similar direct assessment method, provid-ing annotators with a slider which represents scoresfrom 0 to 100, although annotators are not shownnumbers. Unlike recent WMT evaluations, wecollect separate scalar scores for ﬂuency and ad-equacy . This has been common practice in manyevaluations of NLG and MT; for example, Gatt andBelz (2010) also use separate direct assessmentsliders for these two dimensions for NLG.

Referenceless Design:

Many human evaluationsof NLG and MT, including the SemEval evalua-tion for AMR, provide a reference for the annotatorto compare to the system output. However, sinceAMR is underspeciﬁed with respect to many as-pects of phrasing including tense, number, wordorder, and deﬁniteness, comparison to a single ref-erence risks biasing annotators toward the speciﬁcphrasing used in the reference. Thus, each surveygiven to annotators consists of two sections: in theﬁrst half, annotators judged ﬂuency, and saw onlythe output sentences; in the second, they judgedthe same sentences on adequacy, and were shownthe AMR from which the sentence was generated,allowing them to compare the meanings. This de-sign required that our annotators be familiar withthe AMR scheme to identify mismatches in theconcepts and relations expressed in the sentences.

Adequacy Error Types:

In addition to numericscores, under each adequacy slider are three check-boxes where annotators can indicate whether cer-tain types of adequacy errors apply:• That they cannot understand the meaning ofthe utterance (i.e. it is disﬂuent enough to beincomprehensible, making it difﬁcult to mean-ingfully judge adequacy)• That information in the AMR is missing fromthe utterance• That information not present in the AMR isadded in the utteranceThese options allow for a more nuanced analysisof the types of mistakes made by different systemsthan numerical scores alone would provide.

Survey Structure:

Instructions for judging ﬂu-ency are provided at the beginning of the survey,and instructions for adequacy are shown before thestart of the adequacy portion. For ﬂuency, anno-tators are asked to “indicate how well each onerepresents ﬂuent English, like you might expect aperson who is a native speaker of English to use,”and told that “some of these may be sentence frag-ents rather than complete sentences, but can stillbe considered ﬂuent utterances.” For adequacy,they are instructed to “determine how accuratelythe sentence expresses the meaning in the AMR.”The full text of these instructions, which also in-cludes examples, is provided in the supplementarymaterial.Each page of the survey includes each system’soutput for a given sentence, presented in a randomorder. The reference is also included as a sentenceto judge, but is not distinguished from the systemoutputs.

Before collecting the full dataset of human judg-ments for AMR generation, we completed a smallerpilot experiment to test the validity and practicalityof the methodology. This pilot used the data andsystems included in the SemEval-2017 shared task(May and Priyadarshi, 2017). A random subset of25 out of the 1293 sentences in the dataset wereused. All were annotated by three annotators, eachof whom was a linguist with experience with AMR.We tweaked the design of the later survey basedon feedback from the pilot annotators. In particular,the surveys were shortened (annotators completedtwo batches of 10 sentences each, instead of onewith 25); more thorough instructions were given,with examples; and wording was changed from“sentence” to “utterance” to reﬂect that some arenot full sentences in a grammatical sense.

The main evaluation was larger in scope than thepilot, and evaluated more recent systems, most ofwhich are of a markedly higher quality than those inthe pilot. We contacted the authors of several recentpapers on AMR-to-English generation to obtaintheir system’s output for use in the evaluation, andincluded all ﬁve systems for which we were able toobtain usable data in time to begin our evaluation:Konstas et al. (2017), Guo et al. (2019), Manning(2019), Ribeiro et al. (2019), and Zhu et al. (2019).These systems are described in §2.2.

Data:

The LDC2015E86 and LDC2017T10AMR test sets contain the same sentences, withsome updates to the AMRs. Because some of thesystem output we obtained was generated from the2015 AMRs and some from the 2017, we decidedto only include AMRs at the intersection of thesedatasets in our evaluation.

Pair

AVG 0.51 0.65 0.37 0.44 0.35

Table 1:

Inter-annotator agreement scores for each an-notator pair, with averages in the ﬁnal row. For nu-meric ratings of Fluency (F) and Adequacy (A), weuse Spearman’s Rho; for binary categorical ratings ofIncomprehensibility (INC), Missing Information (MI),and Added Information (AI), we use Cohen’s Kappa.

System F ↑ A ↑ INC ↓ MI ↓ AI ↓ Konstas Reference 87.56 93.68 5.0 4.5 10.0

Table 2:

For each system, average ﬂuency and ade-quacy scores and percentage where each adequacy er-ror type was selected. Scores for the reference sen-tences are included for comparison.

Additionally, we chose to exclude AMRs whoseroot relation was multisentence , which indicatesthat the portion of text ofﬁcially segmented as onesentence includes what AMR annotators analyzedas two or more sentences. These were excludedbecause they are often very long and pilot annota-tors found they could be very difﬁcult to read andevaluate, and because unlike other AMR relations, multisentence does not represent a semantic rela-tionship between elements of meaning.A total of 335 sentences were excluded fromconsideration due to differences in their AMRsbetween the different versions of the data, and 71for being multi-sentence. Accounting for overlapbetween the excluded sets, 998 out of 1371 totalsentences in the test set were considered eligiblefor our evaluation. A random sample of 100 ofthese were used in the survey.

Annotation:

A total of nine annotators partici-pated in this evaluation, including the three whoparticipated in the pilot. All had prior training inAMR annotation, mostly from taking a semester-long course focused on AMR and other mean-ing representations. Each annotated two differentbatches of 10 sentences each, except for one anno- lllllllllll annotator sc o r e (a) Fluency by annotator l llllllllll annotator sc o r e (b) Adequacy by annotator

Figure 3:

Violin plots of ranges of human judgments for each annotator. llllllllllllll system sc o r e (a) Fluency by system lllllllllll system sc o r e (b) Adequacy by system

Figure 4:

Violin plots of human judgments for each system. tator who did four batches. The result was that eachset of sentences was double-annotated, allowing usto quantify inter-annotator agreement. Addition-ally, batches were assigned such that each annotatoroverlapped with at least two other annotators.

The only previous human evaluation ofseveral AMR-to-English generation systems was inthe SemEval-2017 task discussed above. Since oursurvey had several differences from this previousevaluation, it was possible that the methodologicaldifferences could lead to substantial differences injudgments on the same data. Thus, before conduct-ing the main survey, we validated our methodologyby comparing the results of the pilot survey to thatof the SemEval-2017 evaluation.This is the ﬁrst evaluation of AMR generationto collect separate judgments for ﬂuency and ad-equacy. We hypothesized that this would providea ﬁner-grained characterization of system behav-ior, and that annotators would be able to distinguish these two scales, though they are related (incompre-hensible sentences necessarily have low ﬂuency aswell as accuracy, while references and high-qualityoutput have near-perfect ﬂuency and adequacy).Indeed, we ﬁnd a Spearman’s rank correlationof 0.68 between ﬂuency and adequacy ratings inthe pilot, indicating that while they are related, an-notators were largely able to evaluate these twodimensions separately.The average ﬂuency scores from our evaluationmatch the ranking of systems found in May andPriyadarshi (2017). Average adequacy scores arethe same except that ISI performs slightly higherthan FORGe. This suggests that our methodologyis reliable for ranking systems, and that separatingjudgments for ﬂuency and adequacy allows for amore nuanced view of relative system performancethan overall quality judgments.Finally, we calculate inter-annotator agreement(IAA) to measure how consistently annotatorscould make these judgments. We measure IAAfor the numeric ﬂuency and adequacy scores withSpearman’s correlation, and for each adequacy er-ror type with Cohen’s Kappa.e ﬁnd an average pairwise IAA of 0.78 for ﬂu-ency and 0.67 for adequacy. For error types, we getlower agreement: average pairwise Kappa scoresare 0.44 for incomprehensibility, 0.53 for missinginformation, and 0.28 for added information. Thisindicates that guidelines on when to annotate theseerror types were not made clear enough for annota-tors to apply them consistently; future studies usingthis methodology should clarify these guidelinesfor more reliable results.

Main Survey:

On this survey we ﬁnd an overallSpearman’s correlation of 0.58 between ﬂuencyand adequacy, indicating that annotators were ableto evaluate these two dimensions separately.This correlation is lower than in the pilot, whichmay be due to clearer instructions given to annota-tors on what is meant by “ﬂuency” and “adequacy”,or because the two dimensions are easier to sepa-rate when fewer sentences are of very low quality.Since each set of 10 AMRs (or 60 judgments ofeach type per annotator) was double-annotated by adifferent pair of annotators, we evaluated IAA sep-arately for each pair. Agreement scores vary con-siderably, but indicate moderate agreement overall.Results are shown in table 1. We ﬁnd that IAAfor ﬂuency is moderate to high for most annotatorpairs, with two exceptions where agreement is low.IAA is higher for adequacy than for ﬂuency in8 out of 10 cases, and reﬂects at least moderateagreement in all cases.For adequacy error types, IAA scores varygreatly and many are low. This indicates that guide-lines given to annotators may not have been clearenough. For example, it was expected that anno-tators would infer, based on their knowledge thatAMR does not specify tense, that sentences shouldnot be considered wrong for having any particulartense; however, we learned after the evaluation thatat least one annotator marked some cases of non-present tense in sentences as added information.Figure 3 gives each annotator’s distribution ofratings, showing that different individuals choseto distribute their judgments over the available0–100 scale in different ways. Since each anno-tator judged each system the same number of times,this is not a problem for our comparison of systems.However, when identifying low-scoring sentences(§4.4 and §4.5), we normalize by annotator to ac-count for these differences.

System

Konstas 5 9Zhu 9 16Ribeiro 21 34Guo 21 28Manning 60 51Reference 0 1Total 116 139

Table 3:

Of 100 sentences, number with low ﬂuencyor adequacy (bottom 1 / Table 2 shows the average score given for each sys-tem for ﬂuency and adequacy, as well as how ofteneach was marked as having each adequacy errortype. We ﬁnd that on both ﬂuency and adequacyscores, Konstas performs best, followed by Zhu,and Manning performs the worst. Guo and Ribeiroare in between and within 5 points of each other oneach measure, with Ribeiro performing better onﬂuency and Guo on adequacy.Unsurprisingly, the lower a system’s average ﬂu-ency score, the more often sentences were markedas incomprehensible.The Missing Information and Added Informa-tion labels support the suggestion of Manning(2019) that although their system performs worsethan others by most measures, its constraints makeit less likely than machine-learning-based systemsto omit or hallucinate information. Konstas’s sys-tem performs the next-best by both of these mea-sures; in particular, it rarely adds information notpresent in the AMR. Ribeiro’s system is most proneto errors of these types, omitting information innearly half of sentences and hallucinating it innearly a third. Overall, the results from these ques-tions indicate that neural AMR generation systemsare prone to omit or hallucinate concepts from theAMR with concerning frequency.Figures 4a and 4b show the distributions ofscores each system received for ﬂuency and ad-equacy, respectively. These show that Konstas isskewed toward very high scores, and that Manningskews toward low scores especially for ﬂuency.

To investigate how well automatic metrics alignwith human judgments of the relative quality ofthese systems, we compute BLEU (Papineni et al.,2002), METEOR (Banerjee and Lavie, 2005), TER Reference scores are omitted from these ﬁgures becausethe high concentration of perfect scores obscured the detailsof other systems.ystem BLEU ↑ METEOR ↑ TER ↓ CHRF++ ↑ BERTScore ↑ All Sub. All Sub. All Sub. All Sub. All Sub.Konstas

Zhu 31.3

Table 4:

Each system’s scores on automatic metrics for the full dataset of 1371 sentences (All) and the subset of100 sentences used in the human evaluation (Sub.). l

30 40 50 60 70 80 90

Fluency B L E U sc o r e l konstasmanningguoribeirozhu (a) Comparison of BLEU scores to average ﬂuencyscores from human evaluation. l

50 60 70 80 90

Adequacy B L E U sc o r e l konstasmanningguoribeirozhu (b) Comparison of BLEU scores to average ade-quacy scores. l

40 50 60 70 80 90

Fluency BE R T sc o r e l konstasmanningguoribeirozhu (c) Comparison of BERT scores to average ﬂuencyscores. l

50 60 70 80 90

Adequacy BE R T sc o r e l konstasmanningguoribeirozhu (d) Comparison of BERT scores to average ade-quacy scores.

Figure 5:

Comparison of BLEU and BERT scores to human judgments.

Snover et al., 2006), and CHRF++ (Popovi´c,2017), and BERTScore (Zhang et al., 2020) foreach system. Results are shown in table 4; therelationship between each system’s average ﬂu-ency and adequacy scores to its BLEU score andBERTScore are also visualized in ﬁgure 5.All these metrics at least agree with humans thatthe Konstas and Zhu systems are the best, followedby Ribeiro and Guo, and that Manning is the worst.Within the top two, humans found Konstas sub-stantially better than Zhu. When using the full data,all automatic metrics agree that Konstas is best,although for all but CHRF++ this is by a small mar-gin. When evaluated only on sentences used in thehuman evaluation, only METEOR, CHRF++, andBERTScore preserve this ranking; BLEU ﬁnds thetwo essentially tied, while TER ﬁnds Zhu slightlybetter.For the middle two, humans preferred Ribeiroon ﬂuency but preferred Guo on adequacy. Onthe full dataset, all the metrics capture that thesesystems are of very similar overall quality, varyingonly by a fraction of a point. On the subset ofsentences, all metrics except BERTScore preferRibeiro, suggesting that these metrics may alignmore with human judgments of ﬂuency than ofadequacy.Overall, these results show that these metrics es-sentially capture human rankings of these systemson this dataset, although further research wouldbe needed to more robustly conﬁrm the validity ofthese metrics for the task.The results also highlight the limitations of met-rics that produce only single scores. While thesemetrics can only capture that the Ribeiro and Guosystems are similar, our human evaluation foundmore nuance by identifying criteria on which eachone outperforms the other.

To examine what factors contributed to particu-larly low adequacy scores, we identify sentencesfor which both annotators gave low ratings. Be-cause, as shown in ﬁgure 3, individual annotatorsdiffered in the distribution of ratings they used, wenormalized this by annotator: a sentence is countedas low-adequacy if each annotator gave it a rat-ing in the lower 1/3 of their total adequacy ratings.The number of low-scoring sentences by system is For reproducibility, details on scripts and parameters usedfor each metric are given in the supplementary material. given in table 3.All 139 low-adequacy sentences were markedas having at least one adequacy error by at leastone annotator. 46 (33%) were tagged by both an-notators as incomprehensible, 51 (37%) as missinginformation, and 25 (18%) as adding information.Added information is perhaps the most troublingform of error; AMR generation systems will haveseverely limited potential for use in practical appli-cations as long as they hallucinate meaning. In oneexample, a reference to prostitution is inserted:

REF: A high-security Russian laboratorycomplex storing anthrax, plague and otherdeadly bacteria faces loosing electricityfor lack of payment to the mosenergoelectric utility.RIBEIRO: the russian laboratory complexas a high - security complex will befaced with anthrax , prostitution ,and and other killing bacterium losingelectricity as it is lack of paying formosenergo .

As seen above in table 2, Manning omits andadds information substantially less often than theother systems, but produces incomprehensible sen-tences far more often. Thus, it is unsurprising thatmost (73%) of its low-adequacy sentences are alsolow-ﬂuency. For Guo, too, a majority (54%) oflow-adequacy sentences are low-ﬂuency, thoughthis is largely due to anonymization and repetitionof words, as discussed below.

Using the same procedure described above for lowadequacy, we also identify sentences for whichboth annotators gave low ﬂuency ratings. Countsfor each system are given in table 3. As expected,no reference sentences are low-ﬂuency.Of the 116 low-ﬂuency sentences, 50 (43%) arealso marked as incomprehensible by both annota-tors. The other error types are, unsurprisingly, lessrelated to low ﬂuency than to low adequacy: 23(20%) of low-ﬂuency sentences are missing infor-mation, and only 6 (5%) have added information.Over half of all low-ﬂuency sentences are fromManning’s rule-based system. This is largelybecause in many cases the system’s rules donot allow for the generation of function wordsthat would be expected in a ﬂuent version ofthe sentence, while the neural systems are morelikely to include such words in similar ways to theraining data. For example, for the following AMR: (t / thank-01:ARG1 (y / you):ARG2 (r / read-01:ARG0 y))

Manning’s system gave the disﬂuent output ‘

Thankyou read . ’ while others produced variantsof ‘ thank you for reading . ’ or ‘ thanks forreading . ’For the neural systems, common sources of lowﬂuency scores included anonymization and repe-tition of words. Anonymization was a problemprimarily for Guo; 9 of Guo’s 21 low-ﬂuency sen-tences contain the token in place of lower-frequency words. For example, for the AMR in §1,‘annexation’ is lost:

GUO: georgia labels russia ’s supportfor the act .

While Konstas uses anonymization less fre-quently, 2 of the system’s 5 low-ﬂuency sentencescontain anonymized location names or quantities.Guo, Ribeiro, and Konstas all have severallow-ﬂuency sentences with unhumanlike repetitionof words or phrases, for example: (a / and:op2 (h / happen-02:ARG1 (l / like-01:ARG0 (i / i):ARG1 (d / develop-02:ARG1 (l2 / lot:mod (l3 / large))))))RIBEIRO: and i happen to like a largelot of a lot .

Our analysis of these systems, and especially oftheir common errors, points toward directions forresearchers developing NLG systems, especiallyfor AMR, to improve their output. We recommendattempting to ﬁnd solutions to the common issuesthat led to low scores even from state-of-the-artsystems, such as anonymization of infrequent con-cepts, unnecessary repetition of words, and halluci-nation.While this study found that popular automaticmetrics were mostly successful in ranking thesesystems in the same order human annotators did,we also found that the human evaluation was able to identify strengths and weaknesses of systems withmore nuance than a single number can convey. Wealso acknowledge that, given prior work pointing tothe inadequacy of metrics such as BLEU for NLGand AMR generation, more research is needed todetermine the reliability of these metrics for com-paring systems. We suggest that researchers inAMR generation and other NLG tasks continue tosupplement automatic metrics with human evalua-tion as much as possible.

References

Laura Banarescu, Claire Bonial, Shu Cai, MadalinaGeorgescu, Kira Grifﬁtt, Ulf Hermjakob, KevinKnight, Philipp Koehn, Martha Palmer, and NathanSchneider. 2013. Abstract meaning representationfor sembanking. In

Proceedings of the 7th Linguis-tic Annotation Workshop and Interoperability withDiscourse , pages 178–186, Soﬁa, Bulgaria. Associa-tion for Computational Linguistics.Satanjeev Banerjee and Alon Lavie. 2005. METEOR:An automatic metric for MT evaluation with im-proved correlation with human judgments. In

Pro-ceedings of the ACL Workshop on Intrinsic and Ex-trinsic Evaluation Measures for Machine Transla-tion and/or Summarization , pages 65–72, Ann Ar-bor, Michigan. Association for Computational Lin-guistics.Loïc Barrault, Ondˇrej Bojar, Marta R. Costa-jussà,Christian Federmann, Mark Fishel, Yvette Gra-ham, Barry Haddow, Matthias Huck, Philipp Koehn,Shervin Malmasi, Christof Monz, Mathias Müller,Santanu Pal, Matt Post, and Marcos Zampieri. 2019.Findings of the 2019 conference on machine transla-tion (WMT19). In

Proceedings of the Fourth Con-ference on Machine Translation (Volume 2: SharedTask Papers, Day 1) , pages 1–61, Florence, Italy. As-sociation for Computational Linguistics.Ondˇrej Bojar, Rajen Chatterjee, Christian Federmann,Yvette Graham, Barry Haddow, Matthias Huck, An-tonio Jimeno Yepes, Philipp Koehn, Varvara Lo-gacheva, Christof Monz, Matteo Negri, AurélieNévéol, Mariana Neves, Martin Popel, Matt Post,Raphael Rubino, Carolina Scarton, Lucia Spe-cia, Marco Turchi, Karin Verspoor, and MarcosZampieri. 2016. Findings of the 2016 conferenceon machine translation. In

Proceedings of theFirst Conference on Machine Translation: Volume2, Shared Task Papers , pages 131–198, Berlin, Ger-many. Association for Computational Linguistics.Kris Cao and Stephen Clark. 2019. Factorising AMRgeneration through syntax. In

Proceedings of the2019 Conference of the North American Chapter ofthe Association for Computational Linguistics: Hu-man Language Technologies, Volume 1 (Long andShort Papers) , pages 2157–2163, Minneapolis, Min-nesota. Association for Computational Linguistics.arco Damonte and Shay B. Cohen. 2019. Structuralneural encoders for AMR-to-text generation. In

Pro-ceedings of the 2019 Conference of the North Amer-ican Chapter of the Association for ComputationalLinguistics: Human Language Technologies, Vol-ume 1 (Long and Short Papers) , pages 3649–3658,Minneapolis, Minnesota. Association for Computa-tional Linguistics.Jeffrey Flanigan, Chris Dyer, Noah A. Smith, andJaime Carbonell. 2016. Generation from abstractmeaning representation using tree transducers. In

Proceedings of the 2016 Conference of the NorthAmerican Chapter of the Association for Computa-tional Linguistics: Human Language Technologies ,pages 731–739, San Diego, California. Associationfor Computational Linguistics.Albert Gatt and Anja Belz. 2010. Introducing SharedTasks to NLG: The TUNA Shared Task EvaluationChallenges. In Emiel Krahmer and Mariët Theune,editors,

Empirical Methods in Natural LanguageGeneration: Data-oriented Methods and EmpiricalEvaluation , pages 264–293. Springer, Berlin, Hei-delberg, Berlin, Heidelberg.Dimitra Gkatzia and Saad Mahamood. 2015. A snap-shot of NLG evaluation practices 2005 - 2014. In

Proceedings of the 15th European Workshop on Nat-ural Language Generation (ENLG) , pages 57–60,Brighton, UK. Association for Computational Lin-guistics.Zhijiang Guo, Yan Zhang, Zhiyang Teng, and WeiLu. 2019. Densely connected graph convolutionalnetworks for graph-to-sequence learning.

Transac-tions of the Association for Computational Linguis-tics , 7:297–312.Ioannis Konstas, Srinivasan Iyer, Mark Yatskar, YejinChoi, and Luke Zettlemoyer. 2017. Neural AMR:Sequence-to-sequence models for parsing and gener-ation. In

Proceedings of the 55th Annual Meeting ofthe Association for Computational Linguistics (Vol-ume 1: Long Papers) , pages 146–157, Vancouver,Canada. Association for Computational Linguistics.Guy Lapalme. 2019. Verbalizing AMR Structures.Kexin Liao, Logan Lebanoff, and Fei Liu. 2018. Ab-stract meaning representation for multi-documentsummarization. In

Proceedings of the 27th Inter-national Conference on Computational Linguistics ,pages 1178–1190, Santa Fe, New Mexico, USA. As-sociation for Computational Linguistics.Emma Manning. 2019. A partially rule-based ap-proach to AMR generation. In

Proceedings of the2019 Conference of the North American Chapter ofthe Association for Computational Linguistics: Stu-dent Research Workshop , pages 61–70, Minneapolis,Minnesota. Association for Computational Linguis-tics. Jonathan May and Jay Priyadarshi. 2017. SemEval-2017 task 9: Abstract meaning representationparsing and generation. In

Proceedings of the11th International Workshop on Semantic Evalua-tion (SemEval-2017) , pages 536–545, Vancouver,Canada. Association for Computational Linguistics.Jekaterina Novikova, Ondˇrej Dušek, Amanda Cer-cas Curry, and Verena Rieser. 2017. Why we neednew evaluation metrics for NLG. In

Proceedingsof the 2017 Conference on Empirical Methods inNatural Language Processing , pages 2241–2252,Copenhagen, Denmark. Association for Computa-tional Linguistics.Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: a method for automatic eval-uation of machine translation. In

Proceedings ofthe 40th Annual Meeting of the Association for Com-putational Linguistics , pages 311–318, Philadelphia,Pennsylvania, USA. Association for ComputationalLinguistics.Maja Popovi´c. 2017. chrF++: words helping charac-ter n-grams. In

Proceedings of the Second Con-ference on Machine Translation , pages 612–618,Copenhagen, Denmark. Association for Computa-tional Linguistics.Nima Pourdamghani, Kevin Knight, and Ulf Herm-jakob. 2016. Generating English from abstractmeaning representations. In

Proceedings of the 9thInternational Natural Language Generation confer-ence , pages 21–25, Edinburgh, UK. Association forComputational Linguistics.Ehud Reiter. 2018. A structured review of the validityof BLEU.

Computational Linguistics , 44(3):393–401.Leonardo F. R. Ribeiro, Claire Gardent, and IrynaGurevych. 2019. Enhancing AMR-to-text genera-tion with dual graph representations. In

Proceed-ings of the 2019 Conference on Empirical Methodsin Natural Language Processing and the 9th Inter-national Joint Conference on Natural Language Pro-cessing (EMNLP-IJCNLP) , pages 3174–3185, HongKong, China. Association for Computational Lin-guistics.Matthew Snover, Bonnie Dorr, Richard Schwartz, Lin-nea Micciulla, and John Makhoul. 2006. A Study ofTranslation Edit Rate with Targeted Human Annota-tion. In

Proceedings of the 7th Conference of the As-sociation for Machine Translation in the Americas ,pages 223–231, Cambridge.Linfeng Song, Daniel Gildea, Yue Zhang, ZhiguoWang, and Jinsong Su. 2019. Semantic neural ma-chine translation using AMR.

Transactions of theAssociation for Computational Linguistics , 7:19–31.Linfeng Song, Yue Zhang, Zhiguo Wang, and DanielGildea. 2018. A graph-to-sequence model for AMR-to-text generation. In

Proceedings of the 56th An-nual Meeting of the Association for Computationalinguistics (Volume 1: Long Papers) , pages 1616–1626, Melbourne, Australia. Association for Compu-tational Linguistics.Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q.Weinberger, and Yoav Artzi. 2020. Bertscore: Eval-uating text generation with bert. In

InternationalConference on Learning Representations .Jie Zhu, Junhui Li, Muhua Zhu, Longhua Qian, MinZhang, and Guodong Zhou. 2019. Modeling graphstructure in transformer for better AMR-to-text gen-eration. In