[PDF] Generating Query Focused Summaries from Query-Free Resources

Abstract

The availability of large-scale datasets has driven the development of neural models that create generic summaries from single or multiple documents. In this work we consider query focused summarization (QFS), a task for which training data in the form of queries, documents, and summaries is not readily available. We propose to decompose QFS into (1) query modeling (i.e., finding supportive evidence within a set of documents for a query) and (2) conditional language modeling (i.e., summary generation). We introduce MaRGE, a Masked ROUGE Regression framework for evidence estimation and ranking which relies on a unified representation for summaries and queries, so that summaries in generic data can be converted into proxy queries for learning a query model. Experiments across QFS benchmarks and query types show that our model achieves state-of-the-art performance despite learning from weak supervision.

Full PDF

AAbstractive Query Focused Summarization with Query-Free Resources

Yumo Xu and

Mirella Lapata

Institute for Language, Cognition and ComputationSchool of Informatics, University of Edinburgh10 Crichton Street, Edinburgh EH8 9AB [email protected] , [email protected] Abstract

The availability of large-scale datasets hasdriven the development of neural sequence-to-sequence models to generate generic summaries, i.e., summaries which do notcorrespond to any pre-speciﬁed queries.However, due to the lack of training data,query focused summarization (QFS) hasbeen studied mainly with extractive meth-ods. In this work, we consider the prob-lem of leveraging only generic summariza-tion resources to build an abstractive QFSsystem. We propose M

ARGE , a MaskedROUGE Regression framework composedof a novel uniﬁed representation for sum-maries and queries, and a distantly super-vised training task for answer evidence es-timation. To further utilize generic datafor generation, three attributes are incorpo-rated during training and inference to con-trol the shape of the ﬁnal summary: evi-dence rank, query guidance, and summarylength. Despite learning from minimal su-pervision, our system achieves state-of-the-art results in the distantly supervised settingacross domains and query types. The neural encoder-decoder framework has be-come increasingly popular in generic text summa-rization, including single-document summariza-tion (SDS; See et al. 2017; Gehrmann et al. 2018)and multi-document summarization (MDS; Liuand Lapata 2019a; Fabbri et al. 2019). However,its success can be partially attributed to the avail-ability of large-scale parallel datasets. In con-trast, parallel data for query focused summariza-tion (QFS) is costly to obtain; there exists only one Our code will be available at github.com/yumoxu/margesum. relatively small single-document

QFS training set(Nema et al., 2017), although some QFS datasetsare available for the evaluation purpose (Dang,2005; Hoa, 2006; Baumel et al., 2016). Therefore,there is an urgent need to leverage generic summa-rization data for the beneﬁt of QFS.The major issue of applying generic summa-rization datasets to QFS is the lack of queries inthese resources: generic summarization datasetsare document-summary pairs, while QFS sum-maries are expected to answer speciﬁc queries thatdo not exist in generic data. As a result, genericsummarizers have no query modeling, which isan essential component for QFS (Nema et al.,2017). To sidestep this problem, recent researchin this ﬁeld resort to distant supervision fromquery-relevant NLP resources (Xu and Lapata,2020; Su et al., 2020; Laskar et al., 2020), includ-ing question answering (Rajpurkar et al., 2016;Chakraborty et al., 2020) and paraphrase identiﬁ-cation (Dolan and Brockett, 2005). Nevertheless,this approach has its own deﬁciencies. Firstly,creating these resources such as QA annotationcan also be extremely costly (Bajaj et al., 2016;Kwiatkowski et al., 2019), which is contradic-tory to the intent of handling QFS with efﬁciency.Besides, queries in QFS at inference time sufferfrom a distribution shift from questions/queries inthese resources (Xu and Lapata, 2020), and it is,however, practically infeasible to ﬁnd appropriatequery-related resources for all domains.Therefore, in this work, we do not assume ac-cess to any resources other than generic summa-rization data for abstractive QFS, and further de-couple abstractive QFS into two research ques-tions: (1) answer evidence ranking : how to learnfrom generic summarization data to ﬁnd support-ive evidence from the input document set for a a r X i v : . [ c s . C L ] D ec iven query, and (2) query focused generation :how to train an abstractive summarizer, e.g., anencoder-decoder model, on generic summariza-tion data to maximize its performance on QFS.Under this formulation, we use generic summa-rization data not only for conditional languagemodeling, but also for learning an evidence rank-ing model. Inspired by the Cloze task and itsapplications in NLP (Taylor, 1953; Lewis et al.,2019; Lee et al., 2019), we propose M ARGE , a Ma sked R OU GE regression framework for sen-tential evidence estimation and ranking. M ARGE introduces a uniﬁed representation for summaries and queries , so summaries in generic data canbe converted into proxy queries for learning aquery model. Based on the evidence selectedby M

ARGE , we incorporate different controllersinto query focus generation, including evidencerank, query guidance, and summary length to bet-ter shape the generation process for QFS purpose.Our contributions in this work are threefold: wepropose a minimally supervised system for ab-stractive QFS where no query-related resource isrequired; we discover a new type of connectionbetween generic summaries and QFS queries, andprovide a universal representation for them whichallows generic summarization data to be furtherexploited for QFS; we provide experimental re-sults on multi-document QFS benchmarks withvaried types of queries across domains, and showthat our system achieves state-of-the-art results inthe distantly supervised setting on both evidenceranking and abstractive QFS.

Most previous QFS systems are extractive, wheresystems take as inputs a document cluster anda query, and select query-relevant sentences tocompose the summary. Approaches vary depend-ing on the way centrality and relevance are esti-mated and incorporated, including manifold rank-ing (Wan et al., 2007), look-ahead strategy (Badri-nath et al., 2011) and uncertainty prediction (Wanand Zhang, 2014). In the era of neural networks,attention mechanisms are employed in both dis-criminative and generative modeling for salienceestimation (Li et al., 2017a,b). Recently, Xu andLapata (2020) propose a coarse-to-ﬁne frameworkthat leverages distant supervision from questionanswering (QA) to extract summary-worthy con-tent. Another line of work is based on Multiple In- stance Learning (Angelidis and Lapata, 2018; Xuand Lapata, 2019), where sentential predictionsare optimized with weakly supervision, and thenproduce summaries customized by a class label.On the contrary, abstractive QFS has receivedsigniﬁcantly less attention since generation mod-els are particularly data-hungry (Lebanoff et al.,2018; Liu and Lapata, 2019a). In generic sum-marization, some work proposes to inject proto-types (Saito et al., 2020), topics (Perez-Beltrachiniet al., 2019; Wang et al., 2020) or guidance ofdifferent formats (Dou et al., 2020) to boost gen-eration performance. These methods, however,are not directly applicable to QFS due to the lackof training data. Thanks to the availability ofvarious pretrained models recently, pipeline-styleframeworks for QFS are proposed to utilize re-sources from a wider range of NLP tasks. Suet al. (2020) ﬁne-tune BART (Lewis et al., 2020)on CNN/DailyMail for single-document summa-rization. To generate long sequences for QFS, theyiteratively summarize paragraphs to a budget. Tobuild a query model for paragraph selection, theymake use of HLTC-MRQA (optimized on six QAdatasets; Su et al. 2019) and BioBERT optimizedon SQuAD (Rajpurkar et al., 2016). Similarly,Laskar et al. (2020) ﬁne-tune B

ERT S UM (Liu andLapata, 2019b) on CNN/DailyMail, and proposea three-stage system which uses both supervisionsfrom QFS testing data and other tasks, e.g., QAand paraphrase identiﬁcation.In contrast to existing approaches, our sys-tem does not assume access to any training re-source other than generic summarization datasets.For query modeling, we train an evidence ranker based on training signals derived from genericdatasets. Plus, for the ﬁrst time, our system isable to generate long QFS summaries at once,instead of iteratively generating bullet-style sum-maries that lack coherence. We denote a generic summarization datasetwith { ( S, D ) } where S is the summary and D = { d , d , . . . , d M } is a collection of docu-ments from which the summary is generated. Wehave |D| = 1 for single-document summariza-tion (SDS) and |D| > for multi-documentsummarization (MDS). Query focused summa-rization (QFS) datasets, in addition, have a query Q in each sample to specify information request: raining(w/ Masked Summary) Sent 1Sent 2Sent K Sent 1Sent 2Sent N

Query Focused Summary

Controlled Generator CLS Masked Seq SEP

MaRGE

Ranked Evidence

The Independent in 2012 reported

Brown's best-seller was the most-donated book for the fourth year running. [MASK] reported [MASK] was the most-donated book for [MASK] running . [MASK] hydroelectric projects are planned or in progress and [MASK] problems are associated with them.

What hydroelectric projects are planned or in progress and what problems are associated with them?

Testing (w/ Masked Query)

Figure 1: Pipeline of our abstractive QFS system.Masked ROUGE Regression (M

ARGE ) serves asthe training task for an evidence ranker. Rankedsentences are input to a controlled generator whichoutputs the ﬁnal summary. Dashed lines denoteoptional inputs for summary generation. { ( S, D , Q ) } . It is often assumed (e.g., in DUCbenchmarks) that Q consists of a short title (e.g., Amnesty International ), and a detailed query nar-rative which is considerably longer and detailed(e.g.,

What is the scope of operations of AmnestyInternational and what are the international reac-tions to its activities? ).In this work, we disentangle a QFS system intoa query model and a conditional language model .The query model q θ ( D | Q ; θ ) estimates how rele-vant to a given query each text unit in the clus-ter (e.g., a sentence) is so as to be incorporatedin the generation process modeled with the con-ditional language model p φ ( S | D, Q ; φ ) . When S ⊥⊥ Q , we have a query-agnostic conditional lan-guage model p φ ( S | D ; φ ) . Otherwise, we call theconditional language model query-guided .Since creating QA datasets is extremely costlyand practically infeasible for every new domain,we do not assume access to QA datasets; instead,we opt for training a query model with distantsupervision derived from generic summarizationdata which is automatically crawled from the web.The premise of this work is the hypothesis thatin generic summarization, although no query isspeciﬁed as input, summaries are composed to re-spond to some latent queries. Therefore, these latent queries guide the summary structure andcan be induced from summaries. One straight-forward option is to train a question generator (QG). To make the generated questions informa- tive and answerable, QG usually assumes that theexpected answers are located in the provided con-text and are accessible for generation (Yuan et al.,2017; Du and Cardie, 2018; Hosking and Riedel,2019). Nevertheless, it is tricky to ﬁnd the de-sirable answers from the provided summarizationdocument(s). In fact, the motivation of construct-ing questions and training a query model is to lo-cate such potential answers from the documents.Also, QG usually focuses on short and factoidquestions (Lewis et al., 2019), while queries inQFS can be longer and more complex (Xu and La-pata, 2020), which is likely to result in divergencesbetween training and testing.Inspired by the standard Cloze task (Taylor,1953) and its recent variants (Lewis et al., 2019;Lee et al., 2019), we provide a Uniﬁed MaskedRepresentation (UMR) to bridge the syntactic andsemantic gap between queries and summaries.With UMR, summaries from generic summariza-tion datasets can serve as proxy queries for modeltraining. We segment document collection D intosentences as candidates for selection, denoted with C . We ﬁrst recognize the information slots in sum-mary S , then turn S into UMR-S as a proxy query.Since we do not have the annotation of correct an-swers in D , we use the ROUGE score betweena candidate sentence C and its (unmasked) ref-erence summary S as a proxy of the sentence’sanswerability to the proxy query UMR-S . Withthis ROUGE score as distant QA supervision, wetrain a matching model that takes a query and acandidate as inputs and estimates their relevance.At inference time, we turn Q into UMR-Q andrank all the sentences in the QFS dataset with thetrained matching model. Top sentences are se-lected as inputs to a conditional language model p φ ( S | D, Q ; φ ) which is also optimized on genericsummarization data, to generate query focused ab-stractive summaries. Uniﬁed Masked Representation

The UMR fora summary is deﬁned as the concatenation of itssentential UMRs. To convert a summary sentencefrom natural language to UMR, we parse it with

Open Information Extraction (Open IE; Stanovskyet al. 2018), obtaining a set of propositions withvaried verbs and arguments, e.g., Arg0 and Arg1.All arguments are treated as candidate informa-tion slots . As shown in Algorithm 1, we replace lgorithm 1

Generate Masked Summary function M ASK S UMMARY ( S , γ ) (cid:46) Summary sentences andmask ratio Parse each s ∈ S with OpenIE to extract information slots I Reveal budget B = |I| ∗ γ (cid:46) Reveal information partially Initialize revealed word number b = 0 Initialize masked summary M as the same shape of S andpre-ﬁll it with [MASK] Initialize

EOM = false (cid:46) End of Masking while true do S a = G ET A VAIABLE ( S ) (cid:46) Sentences with masked slots for s ← S a do b = b + R EVEAL ( s ) (cid:46) Reveal a randomly sampledslot and record its length, i.e., if b ≥ B then EOM = true if EOM then (cid:46)

Start post-process for m ← M do M ERGE ( m ) (cid:46) Merge continuous [MASK] return M end function all slot with a mask token [MASK] for initializa-tion. Then, under a budget constraint, e.g., a ratio γ of the total number of tokens in the informationslots, we sample from the candidates a set of slotsand reveal them. Finally, we merge continuous [MASK] tokens into one, resulting in a partiallymasked summary as the UMR of this summary.As to QFS queries, we empirically observethat the queried information (i.e., the informationthat the query seeks and thus does not exist inthe query) is typically replaced with two primarytypes of query words (in bold ): (1) interrogatives ,e.g., how is A and what is B. (2) request words ,e.g., describe A and tell me B. Following thisobservation, we manually collect a small set ofquery words and replace them with [MASK] inthe query narrative to convert it to masked repre-sentation. Particularly, for a query consisted of atitle and a narrative, we ﬁrst mask its narrative andthen prepend [MASK] T . to the partially maskednarrative to form the ﬁnal UMR-Q where T is a se-quence of title tokens. Masked Semantic Matching

For query model-ing q θ ( D | Q ; θ ) , we construct a pre-trained BERTmodel (Devlin et al., 2019) which matches a UMRsegment UMR-* and a candidate sentence C . Weuse UMR-S as a proxy query during training and

UMR-Q during testing. We concatenate a UMRsegment

UMR-* (in a token sequence U ) and acandidate sentence C (in a token sequence C ) intoan input sequence [CLS] U [SEP] C [SEP] toa BERT encoder (we pad each sequence in a mini-batch of L tokens). The [CLS] vector serves asinput to a single layer neural network to obtain thematching score. We sample tuples of summary S (which is thenconverted into UMR-S ) and candidate sentence C from generic summarization dataset D for modeltraining. We use the mean-square error to computeloss and update both the encoding parameters inBERT via standard backpropagation: L ( θ ) = 1 |D| (cid:88) ( S,C ) ∼D (cid:104) ( y − ˆ y ( S, C ; θ )) (cid:105) . (1)As training signals, Liu and Lapata (2019a) useROUGE 2 for paragraph ranking. However, sen-tences are signiﬁcantly shorter than paragraphs,and we observe a number of instances with aROUGE 2 score of 0. For label smoothing, we de-ﬁne y as an interpolation of F1 scores of ROUGE 2and ROUGE 1: y = R ( S, C ) + λ ∗ R ( S, C ) where we set λ = 0 . (optimized on the devel-opment set). At inference time, given UMR-Q , wecompute its matching score with all candidate sen-tences from the input documents and rank themaccordingly. Top sentences are selected as query-relevant evidence for summarization according toa word budget.

Query Narrative Expansion

Some QFSdatasets have only short query titles while querynarratives are missing. Since the latent queryproxies in training instances are usually longand detailed, we build query narratives in anunsupervised fashion for the QFS datasets whereonly short queries are available. Speciﬁcally, weemploy LexRank (Erkan and Radev, 2004) toselect a subset of representative sentences fromclusters under a word budget and concatenatethem to form the narratives. The predicted narra-tives are then appended to the original query titlesas above-mentioned to form the ﬁnal queries.

In this work, we assume that we have access to generic summarization datasets, which we canleverage to ﬁne-tune a pretrained language modelfor abstractive QFS. At inference time, we inputquery-relevant sentences from the evidence rankerto decode summaries autoregressively.

Synthetic MDS Data

Apart from existing MDSdatasets in the news domain, e.g., Multi-News, weprovide a way to build synthetic MDS datasetsbased on SDS data. One reason is that generic re-sources in MDS are also limited while SDS datais easier to acquire (Zhang et al., 2018; Lebanoff ataset 2005 2006 2007 TD-QFSDomain Cross Cross Cross MedicalQuery Narrative Long Long Long Short

Table 1: Multi-Document QFS dataset statistics.

Dataset Multi-News CNN/DM

Evidence Ranking

Summary Generation

Table 2: Training data statistics for evidenceranking (above) and summary generation (below).CNN/DM statistics for summary generation de-scribe the synthetic MDS dataset proposed in thiswork based on the original CNN/DM data.et al., 2018). On the other hand, we empiricallyﬁnd that under the same length constraint, sum-maries from MDS tend to be more redundant andcover fewer topic aspects than SDS. Since QFSusually requires multiple aspects in a query to beresponded to, we also explore the use of SDS data.One challenge with leveraging SDS for QFS isthe output length (Lebanoff et al., 2018). Sum-maries in SDS dataset, e.g., CNN/DailyMail (Her-mann et al., 2015), have on average around 30tokens. In contrast, QFS requires 250 words assummaries. To sidestep this problem, we builda synthetic MDS dataset from a SDS dataset ina retrieval manner. Speciﬁcally, we ﬁrst builda database with all summaries in the originaldataset. For each sample ( d i , s i ) in the dataset, wequery the whole database with its summary s i . Weretrieve N i − other summaries S i with bigramhashing and TF-IDF matching described in Chenet al. (2017). Then, we fetch their correspondingarticles D i , and form the i th cluster as: D ∗ i = { d i } (cid:91) D i (2) ˆ s ∗ i = concat( s i , , s i, , . . . , s i,N i ) , s i,n ∈ S i (3)where D ∗ i is source documents, and ˆ s ∗ i is a clustersummary with potential redundancy. We set N i to minimize the length difference between ˆ s ∗ i and thesummary requirement in QFS, i.e., 250 words. Toobtain the ﬁnal summary s ∗ i , we further removeredundancy by selecting sentences from the startof ˆ s ∗ i and skipping sentences that have high cosinesimilarities (e.g., ≥ . ) with the selected ones. Evidence Rank

In generic MDS, summariza-tion source is organized by concatenating docu-ments in a cluster, with sentences in each docu-ment following their original order (Fabbri et al.,2019). Nevertheless, we expect the input sen-tences during training to be agnostic to their orig-inal positions in articles, since this absolute posi-tional feature is not generalizable to the inferenceon QFS data after evidence ranking. To mitigatethis training-testing discrepancy, for each trainingsample, we collect all sentences across documentsand rank them in descending order as per theirevidence, e.g., ROUGE 2 scores against the goldsummary. We take this evidence-ranked sentencelist as the training input to pretrained languagemodels. With parameterized positional informa-tion, e.g., relative position embeddings in U NI LM(Bao et al., 2020), our model learns to generatesummaries from evidence-ranked sentences, e.g.,by potentially favoring information presented intop-ranked sentences. During inference, when ac-tual queries are available, we instead use the topsentences ranked by our evidence ranker as the in-puts to summary generation. Query Guidance

Given the fact that summa-rization input is composed of selected evidencesthat are highly relevant to the query, an open ques-tion is that is it still useful to explicitly modelqueries during generation?

To answer this ques-tion, we instantiate the two conditional languagemodels introduced in Section 3. For a query-guided summarizer p φ ( S | D, Q ; φ ) , we prepend UMR-S to the selected evidences during trainingand prepend

UMR-Q at inference. While for a query-agnostic summarizer p φ ( S | D ; φ ) , we con-sider only selected answer evidences as summa-rization inputs, which is identical to generic MDS. Length Control

QFS tasks usually require sum-maries of a ﬁxed length budget (e.g, 250 words),while the summary length varies in our training This setting also has a training/testing divergence be-tween gold/estimated evidence ranks. We also experimentedwith using estimated ranking for training . However, this doesnot yield better results. ata. We empirically found that either adoptinglength penalty, or the minimum number of to-kens for tokens, fails to output meaningful, query-relevant content. Inspired by Fan et al. (2018), wequantize summary length into discrete bins witha size of 10. For a training instance, we prependa length token to its input sequence according toits actual summary length. e.g., [230] . At in-ference, we prepend [250] to the extracted ev-idences for every testing instance to inform ourmodel of the expected generation length.

Datasets

We performed multi-document QFSexperiments on the DUC 2005-2007 benchmarksand TD-QFS (Baumel et al., 2016). Table 1shows the statistics. DUC benchmarks containlong query narratives while TD-QFS focuses onmedical texts with short keyword queries. We usedDUC 2005 as a development set to optimize hy-perparameters and select abstractive models, andevaluated performance on the other three datasets.We used Multi-News (Fabbri et al., 2019) andCNN/DailyMail (Hermann et al., 2015), as thegeneric summarization dataset to train M

ARGE (for evidence ranking) and U NI LM V ARGE , we sampled sen-tences from each dataset. Speciﬁcally, we tookthe ﬁrst and the last 20 sentences from each clus-ter in Multi-News and took the ﬁrst and the last3 sentences from each article in CNN/DailyMail.For ﬁne-tuning U NI LM V

2, we used the origi-nal Multi-News and the multi-document versionof CNN/DailyMail (created by this work as de-scribed in Section 5).

Implementation Details

We used the publiclyreleased BERT model and ﬁne-tuned it forROUGE regression with a learning rate of × − and a batch size of 128 for 3 epochs on 8 GPUs(GTX 2080 Ti). We trained two summarizationmodels on CNN/DailyMail and Multi-News, re-spectively, with the same hardware. For both mod-els, we set the maximum input length to 768,and ﬁne-tuned the publicly released U NI LM V with a learning rate of × − and a batchsize of 16 for 40,000 steps with gradient accu-mulation every 4 steps. During decoding, we use https://github.com/huggingface/pytorch-transformers https://github.com/microsoft/unilm beam search with beam size and Trigram Block-ing (Paulus et al., 2018) to reduce redundancy. Evaluation Metrics

We evaluate evidence rank-ing with both retrieval and summarization metrics.For the retrieval -style evaluation, we follow Liuand Lapata (2019a) and concatenate the top k sen-tences, then calculate the ROUGE recall againstgold summaries. However, retrieval focuses onthe informativeness that does not perfectly ﬁt QFSin the senses that: (1) sentence length can vary alot, while pretrained generative model encodes in-puts at token-level and usually has a length bud-get in terms of the number of words/tokens; (2)redundancy exists in the evidence but is not con-sidered. Two similar sentences can be informa-tive on their own. However, given the inclusionof one sentence, the other may provide little infor-mation gain to downstream generation. Therefore,we additionally propose to evaluate evidence rank-ing in an extractive summarization fashion: wetake top sentences subject to a word budget of 250,and remove redundancy as described in Section5. Besides, instead of ROUGE recall, we opt forROUGE F1 to evaluate the composed summariesso that precision is also taken into account. Results

We compare our evidence rankers withTerm Frequency, a simple but effective retrievalmethod that performs particularly well on DUCdatasets (Katragadda and Varma, 2009). Wealso compare to two semantic matching modelsused for extractive QFS (Xu and Lapata, 2020):B

ERT

QA which is trained on the joint set ofWikiQA (Yang et al., 2015) and TrecQA (Yaoet al., 2013), and B

ERT M RC which is ﬁne-tunedon SQuAD 2.0 (Rajpurkar et al., 2018). Wesummarize the ranking results and the summa-rization results in Table 3 and Table 4, respec-tively. As we can see, despite learning fromweak signals, i.e., proxy queries and proxy answerscores, our models outperform the strongest base-line, B

ERT Q A under both evaluation tasks. With-out recourse to any question/answer annotations ordataset-speciﬁc retrieval methods, our model pro-vides more informative input to the downstreamgeneration task. To examine the ranking performance faithfully, we ex-clude multi-stage frameworks that further rerank the selectedevidence with other modules, .e.g., centrality estimation. ystems DUC 2006 DUC 2007 TD-QFS

R@10 R@30 R@50 R@10 R@30 R@50 R@10 R@30 R@50O

RACLE

ERM F REQ

ERT Q A ERT M RC ARGE -M N M ARGE -C NN D M Table 3: Performance of evidence rankers on top retrieval. R@ k stands for the ROUGE 2 recall score forthe concatenation of the top k retrieved sentences. Results within brackets use query narrative expansion. Systems DUC 2006 DUC 2007 TD-QFS

R-1 R-2 R-SU4 R-1 R-2 R-SU4 R-1 R-2 R-SU4G

OLD

RACLE

EAD

ERM F REQ EX R ANK

ERT Q A ERT M RC ARGE -M N M ARGE -C NN D M Table 4: Performance of evidence rankers on extractive QFS. R-1, R-2 and R-SU4 stand for the F1 scoreof ROUGE 1, 2, and SU4, respectively. Results within brackets use query narrative expansion.

Systems DUC 2006 DUC 2007

R-1 R-2 R-SU4 R-1 R-2 R-SU4M

ARGE -M N ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ Table 5: Ablation results for training data (abso-lute performance decrease denoted by ↓ ). Ablation Studies

To test the effectiveness ofM

ARGE as a training task, we conduct ablationstudy and show the results in Table 5. Speciﬁcally,-Verb classiﬁes verbs as information slots whichcan be sampled and masked, to see if verbs provideessential structural information; -Mask removesthe masking mechanism so the whole summary isrevealed. In this case, M

ARGE can perform al-most perfectly on the training task, since all it re-quires is lexical overlap counting; -Query removesthe proxy query (at training time) and the actualquery (at inference time). This is to investigatewhether our model simply learns to judge sentencesalience based on its own features, instead of per-forming semantic matching with the given query;-OpenIE removes the dependency on Open IE andchooses words to mask randomly. Speciﬁcally,we randomly mask 15% words in summaries as C o rr e l a t i on query agnostic Label-Prediction Correlation (Multi-News validation) R O U G E R e c a ll query agnostic Retrieval R@50 (DUC 2005)

Reveal Ratio R O U G E F query agnostic Extractive Summarization (DUC 2005)

Figure 2: Performance over the reveal ratio γ .Correlation refers to the average of Pearson corre-lation and Spearman correlation. The star markerdenotes the query-agnostic performance where allquery tokens are masked, including informationslots and the rest.in B ERT (Devlin et al., 2019) and merge contin-uous [MASK] tokens. We observe performancedrop in all cases, especially when queries are re-moved, proving the effectiveness of the proposedrepresentation and training framework. ffects of Masking Ratio

We also show howthe mask reveal ratio γ affects the performancein Figure 2. As we can see, performance onthe ROUGE regression task improves with γ in-creases, since the task becomes easier when fewertokens are masked; when γ = 1 . , simply count-ing lexical overlaps can solve the task perfectly.However, model performance on the QFS devel-opment set (DUC 2005) shows an opposite trend:actual queries seek information, instead of pro-viding all the information needed. Therefore, themodel is required to be trained to perform seman-tic matching (Guo et al., 2016) to accurately es-timate evidence scores. Based on our empiricalresults, a simple but effective strategy is to maskall information slots (i.e., potential arguments) andreveal the rest words (including verbs) in the sum-mary to construct proxy queries for training. Automatic Evaluation

We show the compari-son of our system with existing abstractive sys-tems in Table 6. Results for abstractive summa-rization are shown in Table 7. The ﬁrst blockpresents a supervised abstractive approach and an extractive approach: PQS UM -W SL (Laskar et al.,2020), an extract-replace-abstract system thatachieves state-of-the-art results on DUC bench-marks. It ﬁrst extracts relevant sentences for eachdocument with a QA model, then replace sen-tences with gold summary sentences with a para-phrase model. These summary sentences are usedto further optimize ﬁne-tuned B ERT S UM . In itssupervised setup, two years’ DUC datasets areused for training and the other one is for testing.Q UERY S UM (Xu and Lapata, 2020) is the state-of-the-art extractive approach in ROUGE F1 witha coarse-to-ﬁne process for salience estimation.The second block compares our models withtwo distantly supervised approaches, which in-clude: B ART -C AQ (Su et al., 2020), an extract-abstract system. It uses an ensembled QA modelto extract answer evidence, and ﬁne-tuned B ART (Lewis et al., 2020) to generate summaries fromparagraphs iteratively. PQS UM (Laskar et al.,2020), an abstract-extract system. It uses ﬁne-tuned B ERT S UM to generate summaries for eachdocument in a cluster, and a QA model to ranksummary sentences against the given query.The third block presents the performanceof U NI LM ﬁne-tuned on Multi-News and

Models QA PI GS QFSB

ART -C AQ (Su et al., 2020) (cid:51) (cid:55) (cid:51) (cid:55) PQS UM (Laskar et al., 2020) (cid:51) (cid:55) (cid:51) (cid:55) PQS UM -W SL (Laskar et al., 2020) (cid:51) (cid:51) (cid:51) (cid:51) U NI LM (Bao et al., 2020) (cid:55) (cid:55) (cid:51) (cid:55) M ARGE S UM (our work) (cid:55) (cid:55) (cid:51) (cid:55) Table 6: Comparison of abstractive QFS modelson their required training data. QA, PI, GS andQFS stands for question answering, paraphraseidentiﬁcation, generic summarization and queryfocused summarization, respectively.CNN/DailyMail with the standard setting in Baoet al. (2020). It uses no query guidance or lengthcontrol. Documents are concatenated as inputs fortraining. During testing, the same evidences fromM

ARGE are used but ordered with their originalpositions in documents.The last block shows two variants of our QFSsystem which we call M

ARGE S UM . Speciﬁ-cally, M ARGE S UM -M N is query-agnostic , opti-mized with the summarization training set fromMulti-News. M ARGE S UM -C NN D M is query-guided , optimized with the summarization trainingset built from CNN/DailyMail. Both of the twomodels take as the inputs the evidences selectedfrom M ARGE -M N during inference.As we can see, without requiring expensive QAdata, our system M ARGE -C NN D M outperformsexisting distantly supervised approaches, whichfurther shrinks the gap between distantly super-vised QFS and supervised

QFS. Plus, its perfor-mance on DUC is on par with one of the strongestextractive systems, while on TD-QFS, our modelachieves even better results across metrics. It isalso noteworthy that M

ARGE -C NN D M trained onour synthesized MDS data outperforms M ARGE -M N . Compared to Multi-News, synthesized sum-maries from SDS can cover more topic aspects andbe less redundant, which is suitable for the QFSsetting where there are usually multiple subqueriesto respond to. Ablation Studies

We also conduct ablationstudies and the results are shown in Table 8. Re-placing the inputs with the evidence sentencesselected by B

ERT Q A (see *B ERT

QA) signiﬁ-cantly decreases the performance, demonstratingthe high-quality evidence inputs from M

ARGE helps the downstream abstractive summarization.Removing the evidence rank (-Rank) leads to alarge performance drop, since features of sen- ystems DUC 2006 DUC 2007 TD-QFS

R-1 R-2 R-SU4 R-1 R-2 R-SU4 R-1 R-2 R-SU4

Supervised or Extractive

PQS UM -W SL † (Laskar et al., 2020) 43.5 10.8 16.5 44.7 12.4 17.7 - - -Q UERY S UM ∗ (Xu and Lapata, 2020) 41.6 9.5 15.3 43.3 11.6 16.8 44.3 16.1 20.7 Distantly Supervised B ART -C AQ (Su et al., 2020) 38.3 7.7 12.9 40.5 9.2 14.4 - - -PQS UM (Laskar et al., 2020) Fine-tuned Language Models w/o Query Focus U NI LM-M N NI LM-C NN D M Our Systems M ARGE S UM -M N ARGE S UM -C NNDM

Table 7: Performance of abstractive summarization systems estimation models. R-1, R-2 and R-SU4stand for the F1 score of ROUGE 1, 2, and SU4, respectively. Extractive method is denoted with ∗ andsupervised learning method is denoted with † . Systems DUC 2007 TD-QFS

R-1 R-2 R-SU4 R-1 R-2 R-SU4C NN D M ERT Q A ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ N ↓ ↓ ↓ ↓ ↓ ↓ Table 8: Ablation results for our abstractivesummarization model M

ARGE S UM trained onCNN/DM and Multi-News, respectively (absoluteperformance decrease denoted by ↓ ). We report re-sults on DUC 2007 and TD-QFS for brevity sinceDUC 2006 follows a similar pattern.tence positions in their original documents arenot transferable to QFS settings where query rel-evance should instead be considered. Removingthe length control (-Length) also hurts the perfor-mance, since the length token informs the sum-marizer of an expected generation plan. How-ever, the effect of incorporating query guidanceexplicitly in the generation is dataset-dependent;proxy queries from our synthetic data based onCNN/DailyMail help the generation (C NN D M -Query), while the performance change on Multi-News shows the opposite (M N +Query). We noticethat there exist several speciﬁc patterns in Multi-News summaries that are not generalizable to QFStasks. In Table 9, we present three of these pat-terns: (1) extra press information, (2) lazy querycopy and (3) hallucinating hyperlink. The ﬁrst pat-tern exists in the outputs of both query-agnostic and query-guided models trained on Multi-News, while the other two mainly appear in the query-guided case. Therefore, we argue that using proxyqueries as guidance in training an abstractive sum-marizer does not necessarily beneﬁt QFS perfor-mance, and the effects vary with the training dataand its distribution divergence with QFS data. Onereason is the potential information leakage fromproxy queries. Proxy queries, which are inducedfrom reference summaries, provide a shortcut tosummary generation, and the model is, therefore,more prone to learn dataset-speciﬁc summariza-tion patterns. As a comparison, our syntheticdata from CNN/DailyMail is proved to be moresuitable for generating query focused summaries,with which we can improve QFS by incorporatingqueries during training and inference. Human Evaluation

Participants assessed sum-maries created by PQS UM -W SL , the state-of-the-art abstractive system, Q UERY S UM , a state-of-the-art extractive system in ROUGE F1, U NI LM-C NN D M and M ARGE S UM -C NN D M . We alsorandomly selected a G

OLD standard summary andincluded it as an upper bound. We sampled 20query-cluster pairs from DUC (2006, 2007; 10from each set), and 20 pairs from TD-QFS (5from each cluster) and collected three responsesper query-summary pair.Table 10 shows the human ratings for each sys-tem. Participants perceive M

ARGE S UM -C NN D M We are grateful to Md Tahmid Rahman Laskar for pro-viding us with the output of their PQS UM -W SL system. Weinclude PQS UM -W SL only for human evaluation on DUCsince it was not evaluated on TD-QFS (Laskar et al., 2020)and the outputs are not available. uery : Computer viruses. Identify computer viruses detected worldwide. Include such details as how they are spread, what operating systemsthey affect, what damage they inﬂict, their country of origin, and their creators wherever possible. M ARGE S UM -M N : The FBI is trying to track the author of a ﬁle - destroying computer virus that has hit some of the nation’s biggestcorporations and infected tens of thousands of machines, the AP reports. According to South Africa’s computer virus helpline co-ordinatorIan Melamed, Hare Krsna is at present the most widely reported computer virus in the country and could wipe out all information stored onthe hard drives of infected computers on that date. The new virus, named W32/KRIZ.3862, is designed to damage Windows 95, Windows98, and Windows NT machines.

Xinhua reports that the Philippine government asked the congress Tuesday to pass a law that would imposestiff penalties on cyber crimes following a delay in the arrest of the suspected creators of the "love bug" computer virus. The virus, whichwas allegedly created by a Philippine couple using internet access cards, has caused up to 2.6 billion dollars in damage to computer systemsin more than 20 countries in Asia, America, and Europe, an expert at Carnegie Mellon University tells the AP . Home-made computer virusesare not new in Bangladesh, as the country’s ﬁrst locally-produced virus, the Dhaka virus, was created in 1991. Experts say off-the-shelf theantivirus software can prevent infection.

The AP notes that the virus is only known to attack computers using Microsoft’s Word 97 or Word2000 programs, but experts say it’s not as effective as the original. M ARGE S UM -M N +Query : Here’s what you need to know about computer viruses detected worldwide. Include such details as how theyare spread, the operating systems they affect, their country of origin, and their creators of the virus ver possible. Click here to read thefull list of computer viruses, including the number of cases they are spreading, the damage they inﬂict, their state of origin and theircountry’s country of origins, and the amount of damage ver has done to computer systems in more than 20 countries in Asia, America,and Europe, the AP reports . The AP quotes an expert at Carnegie Mellon University as saying Hare Krsna is at present the most widelyreported computer virus in the country and could wipe out all information stored on the hard drives of infected computers on Christmas day,a potential security breach that should worry businesses and governments, an expert said Saturday, per Xinhua, which notes that the FBI onFriday tried to track the author of a computer - destroying computer virus that has hit some of the nation’s biggest corporations and infected tensof thousands of machines, per the AP. (Click to read about the Chernobyl virus, or to see how it could be used to wipe out informationon hard drives without warning ) ...

But it’s the most widespread computer virus ever reported in South Africa, according to South Africa’scomputer virus helpline co-ordinator Ian Melamed, who says the virus is "the most commonly reported" computer virus.

Table 9: System outputs for cluster D0629B in DUC 2006. We show three dataset-speciﬁc summarypatterns of Multi-News that can harm QFS performance: 1) extra press information , 2) lazy querycopy and 3) hallucinating hyperlink (better view in color).

DUC Rel Suc CohG

OLD UM -W SL †◦ Q UERY S UM †◦ U NI LM-C NN D M †◦ ARGE S UM -C NN D M TD-QFS Rel Suc CohG

OLD

UERY S UM ◦ †◦ U NI LM-C NN D M †◦ ARGE S UM -C NN D M Table 10: Human evaluation results on DUC(above) and TD-QFS (below): average

Rel evance,

Suc cinctness,

Coh erence ratings; † : sig differentfrom M ARGE S UM -C NN D M ; ◦ : sig different fromGold (at p < . , using a pairwise t-test).on par with PQS UM -W SL in query relevance andsummary succinctness, while signiﬁcantly betterthan PQS UM -W SL and Q UERY S UM in coher-ence. In fact, participants ﬁnd summaries fromPQS UM -W SL as incoherent as the extractive sys-tem Q UERY S UM , which is probably due to its gen-eration pipeline: PQS UM -W SL ﬁrst generates anabstractive summary for each document and thenre-rank the generated sentences. Therefore, ﬁnalsummary sentences are less related to each other.Summaries from our system are also consideredto be signiﬁcantly more relevant than the baseline system U NI LM-C NN D M . Compared to PQS UM -W SL , although U NI LM-C NN D M is not good atproducing relevant content, it maintains relativelyhigher coherence, demonstrating the effectivenessof training abstractive systems with synthetic datafrom SDS and generating long summaries at once. In this work we proposed an abstractive frame-work for query focused summarization which re-quires no query-relevant resources. We provideda uniﬁed mask representation for summaries andqueries, and a distantly supervised approach totrain a masked semantic matching model for QFSwith generic summarization data. Despite learn-ing from minimal supervision, our model outper-forms strong question answering models in evi-dence ranking. We also studied the incorporationof various controlling attributes into training anabstractive summarizer from generic data. Exper-imental results across datasets show that the pro-posed system yields state-of-the-art query focusedsummarization results in the distantly supervisedsetting, and produces more relevant and coherentsummaries compared to existing systems. eferences

Stefanos Angelidis and Mirella Lapata. 2018.Multiple instance learning networks for ﬁne-grained sentiment analysis.

Transactions ofthe Association for Computational Linguistics ,6:17–31.Rama Badrinath, Suresh Venkatasubramaniyan,and CE Veni Madhavan. 2011. Improvingquery focused summarization using look-aheadstrategy. In

Proceedings of the 33rd Euro-pean Conference on Advances in InformationRetrieval , pages 641–652, Dublin, Ireland.Payal Bajaj, Daniel Campos, Nick Craswell,Li Deng, Jianfeng Gao, Xiaodong Liu, RanganMajumder, Andrew McNamara, Bhaskar Mitra,Tri Nguyen, et al. 2016. MS MARCO: A hu-man generated machine reading comprehensiondataset. arXiv preprint arXiv:1611.09268 .Hangbo Bao, Li Dong, Furu Wei, Wenhui Wang,Nan Yang, Xiaodong Liu, Yu Wang, JianfengGao, Songhao Piao, Ming Zhou, et al. 2020.UniLMv2: Pseudo-masked language modelsfor uniﬁed language model pre-training. In

Pro-ceedings of the 37rd International Conferenceon Machine Learning , pages 642–652, Online.Tal Baumel, Raphael Cohen, and Michael El-hadad. 2016. Topic concentration in query fo-cused summarization datasets. In

Proceedingsof the 30th AAAI Conference on Artiﬁcial Intel-ligence , pages 2573–2579, Phoenix, Arizona.Souradip Chakraborty, Ekaba Bisong, ShwetaBhatt, Thomas Wagner, Riley Elliott, andFrancesco Mosconi. 2020. BioMedBERT: Apre-trained biomedical language model for qaand ir. In

Proceedings of the 28th Interna-tional Conference on Computational Linguis-tics , pages 669–679, Online.Danqi Chen, Adam Fisch, Jason Weston, and An-toine Bordes. 2017. Reading Wikipedia to an-swer open-domain questions. In

Proceedings ofthe 55th Annual Meeting of the Association forComputational Linguistics , pages 1870–1879,Vancouver, Canada.Hoa Trang Dang. 2005. Overview of duc 2005.In

Proceedings of the 2005 Document Under-standing Conference , pages 1–12, Vancouver,Canada. Jacob Devlin, Ming-Wei Chang, Kenton Lee, andKristina Toutanova. 2019. BERT: Pre-trainingof deep bidirectional transformers for languageunderstanding. In

Proceedings of the 2019Conference of the North American Chapter ofthe Association for Computational Linguistics:Human Language Technologies , pages 4171–4186, Minneapolis, Minnesota.William B Dolan and Chris Brockett. 2005. Au-tomatically constructing a corpus of sententialparaphrases. In

Proceedings of the Third Inter-national Workshop on Paraphrasing , pages 9–16, Jeju Island, Korea.Zi-Yi Dou, Pengfei Liu, Hiroaki Hayashi, Zheng-bao Jiang, and Graham Neubig. 2020. GSum:A general framework for guided neural ab-stractive summarization. arXiv preprintarXiv:2010.08014 .Xinya Du and Claire Cardie. 2018. Harvest-ing paragraph-level question-answer pairs fromwikipedia. arXiv preprint arXiv:1805.05942 .Günes Erkan and Dragomir R Radev. 2004.Lexrank: Graph-based lexical centrality assalience in text summarization.

Journal of Arti-ﬁcial Intelligence Research , 22:457–479.Alexander Richard Fabbri, Irene Li, Tianwei She,Suyi Li, and Dragomir Radev. 2019. Multi-News: A large-scale multi-document sum-marization dataset and abstractive hierarchicalmodel. In

Proceedings of the 57th AnnualMeeting of the Association for ComputationalLinguistics , pages 1074–1084, Florence, Italy.Angela Fan, David Grangier, and Michael Auli.2018. Controllable abstractive summarization.In

Proceedings of the 2nd Workshop on Neu-ral Machine Translation and Generation , pages45–54, Melbourne, Australia.Sebastian Gehrmann, Yuntian Deng, and Alexan-der Rush. 2018. Bottom-up abstractive sum-marization. In

Proceedings of the 2018 Con-ference on Empirical Methods in Natural Lan-guage Processing , pages 4098–4109, Brussels,Belgium.Jiafeng Guo, Yixing Fan, Qingyao Ai, andW Bruce Croft. 2016. A deep relevance match-ing model for ad-hoc retrieval. In

Proceedingsf the 25th ACM International on Conferenceon Information and Knowledge Management ,pages 55–64, Indianapolis, Indiana.Karl Moritz Hermann, Tomáš Koˇciský, EdwardGrefenstette, Lasse Espeholt, Will Kay, MustafaSuleyman, and Phil Blunsom. 2015. Teachingmachines to read and comprehend. In

Proceed-ings of the 28th International Conference onNeural Information Processing Systems , page1693–1701, Cambridge, MA, USA.TD Hoa. 2006. Overview of duc 2006. In

Pro-ceedings of the 2006 Document UnderstandingConference , New York, USA.Tom Hosking and Sebastian Riedel. 2019. Eval-uating rewards for question generation models.In

Proceedings of the 2019 Conference of theNorth American Chapter of the Association forComputational Linguistics: Human LanguageTechnologies , pages 2278–2283, Minneapolis,Minnesota.Rahul Katragadda and Vasudeva Varma. 2009.Query-focused summaries or query-biasedsummaries? In

Proceedings of the 47th AnnualMeeting of the Association for ComputationalLinguistics and the 4th International JointConference on Natural Language Processing ,pages 105–108, Suntec, Singapore.Tom Kwiatkowski, Jennimaria Palomaki, OliviaRedﬁeld, Michael Collins, Ankur Parikh, ChrisAlberti, Danielle Epstein, Illia Polosukhin, Ja-cob Devlin, Kenton Lee, et al. 2019. Naturalquestions: a benchmark for question answeringresearch.

Transactions of the Association forComputational Linguistics , 7:453–466.Md Tahmid Rahman Laskar, Enamul Hoque,and Jimmy Xiangji Huang. 2020. WSL-DS:Weakly supervised learning with distant super-vision for query focused multi-document ab-stractive summarization. In

Proceedings ofthe 28th International Conference on Compu-tational Linguistics , pages 5647–5654, Online.Logan Lebanoff, Kaiqiang Song, and Fei Liu.2018. Adapting the neural encoder-decoderframework from single to multi-document sum-marization. In

Proceedings of the 2018 Con-ference on Empirical Methods in Natural Lan-guage Processing , pages 4131–4141, Brussels,Belgium. Kenton Lee, Ming-Wei Chang, and KristinaToutanova. 2019. Latent retrieval for weaklysupervised open domain question answering.In

Proceedings of the 57th Annual Meeting ofthe Association for Computational Linguistics ,pages 6086–6096, Florence, Italy.Mike Lewis, Yinhan Liu, Naman Goyal, MarjanGhazvininejad, Abdelrahman Mohamed, OmerLevy, Veselin Stoyanov, and Luke Zettlemoyer.2020. BART: Denoising sequence-to-sequencepre-training for natural language generation,translation, and comprehension. In

Proceed-ings of the 58th Annual Meeting of the As-sociation for Computational Linguistics , pages7871–7880, Online.Patrick Lewis, Ludovic Denoyer, and Sebas-tian Riedel. 2019. Unsupervised question an-swering by cloze translation. arXiv preprintarXiv:1906.04980 .Piji Li, Wai Lam, Lidong Bing, Weiwei Guo, andHang Li. 2017a. Cascaded attention based un-supervised information distillation for compres-sive summarization. In

Proceedings of the 2017Conference on Empirical Methods in NaturalLanguage Processing , pages 2081–2090, Brus-sells, Belgium.Piji Li, Zihao Wang, Wai Lam, Zhaochun Ren,and Lidong Bing. 2017b. Salience estima-tion via variational auto-encoders for multi-document summarization. In

Proceedings ofthe 31th AAAI Conference on Artiﬁcial Intelli-gence , pages 3497–3503, San Francisco, Cali-fornia, USA.Yang Liu and Mirella Lapata. 2019a. Hierarchi-cal transformers for multi-document summa-rization. In

Proceedings of the 57th AnnualMeeting of the Association for ComputationalLinguistics , pages 5070–5081, Florence, Italy.Yang Liu and Mirella Lapata. 2019b. Text sum-marization with pretrained encoders. In

Pro-ceedings of the 2019 Conference on EmpiricalMethods in Natural Language Processing andthe 9th International Joint Conference on Nat-ural Language Processing , pages 3730–3740,Hong Kong, China.Preksha Nema, Mitesh M. Khapra, Anirban Laha,and Balaraman Ravindran. 2017. Diversityriven attention model for query-based abstrac-tive summarization. In

Proceedings of the 55thAnnual Meeting of the Association for Compu-tational Linguistics , pages 1063–1072, Vancou-ver, Canada.Romain Paulus, Caiming Xiong, and RichardSocher. 2018. A deep reinforced model for ab-stractive summarization. In

Proceedings of the6th International Conference on Learning Rep-resentations , Vancouver, Canada.Laura Perez-Beltrachini, Yang Liu, and MirellaLapata. 2019. Generating summaries withtopic templates and structured convolutional de-coders. In

Proceedings of the 57th AnnualMeeting of the Association for ComputationalLinguistics , pages 5107–5116, Florence, Italy.Pranav Rajpurkar, Robin Jia, and Percy Liang.2018. Know what you don’t know: Unan-swerable questions for squad. In

Proceedingsof the 56th Annual Meeting of the Associationfor Computational Linguistics , pages 784–789,Austin, Texas.Pranav Rajpurkar, Jian Zhang, Konstantin Lopy-rev, and Percy Liang. 2016. SQuAD: 100,000+questions for machine comprehension of text.In

Proceedings of the 2016 Conference on Em-pirical Methods in Natural Language Process-ing , pages 2383–2392, Sydney, Australia.Itsumi Saito, Kyosuke Nishida, Kosuke Nishida,Atsushi Otsuka, Hisako Asano, Junji Tomita,Hiroyuki Shindo, and Yuji Matsumoto. 2020.Length-controllable abstractive summarizationby guiding with summary prototype. arXivpreprint arXiv:2001.07331 .Abigail See, Peter J. Liu, and Christopher D. Man-ning. 2017. Get to the point: Summarizationwith pointer-generator networks. In

Proceed-ings of the 55th Annual Meeting of the As-sociation for Computational Linguistics , pages1073–1083, Vancouver, Canada.Gabriel Stanovsky, Julian Michael, Luke Zettle-moyer, and Ido Dagan. 2018. Supervised openinformation extraction. In

Proceedings of the2018 Conference of the North American Chap-ter of the Association for Computational Lin-guistics: Human Language Technologies , pages885–895, New Orleans, Louisiana. Dan Su, Yan Xu, Genta Indra Winata, Peng Xu,Hyeondey Kim, Zihan Liu, and Pascale Fung.2019. Generalizing question answering systemwith pre-trained language model ﬁne-tuning. In

Proceedings of the 2nd Workshop on MachineReading for Question Answering , pages 203–211, Hong Kong, China.Dan Su, Yan Xu, Tiezheng Yu, Farhad BinSiddique, Elham Barezi, and Pascale Fung.2020. CAiRE-COVID: A question answeringand query-focused multi-document summariza-tion system for COVID-19 scholarly informa-tion management. In

Proceedings of the 1stWorkshop on NLP for COVID-19 (Part 2) atEMNLP 2020 , Online.Wilson L Taylor. 1953. “Cloze Procedure”: Anew tool for measuring readability.

JournalismQuarterly , 30(4):415–433.Xiaojun Wan, Jianwu Yang, and Jianguo Xiao.2007. Manifold-ranking based topic-focusedmulti-document summarization. In

Proceed-ings of the 20th International Joint Confer-ence on Artiﬁcial Intelligence , pages 2903–2908, Hyderabad, India.Xiaojun Wan and Jianmin Zhang. 2014. CTSUM:extracting more certain summaries for news ar-ticles. In

Proceedings of the 37th internationalACM SIGIR Conference on Research & Devel-opment in Information Retrieval , pages 787–796, New York, United States.Zhengjue Wang, Zhibin Duan, Hao Zhang, Chao-jie Wang, Long Tian, Bo Chen, and MingyuanZhou. 2020. Friendly topic assistant for trans-former based abstractive summarization. In

Proceedings of the 2020 Conference on Empir-ical Methods in Natural Language Processing ,pages 485–497, Online.Yumo Xu and Mirella Lapata. 2019. Weaklysupervised domain detection.

Transactions ofthe Association for Computational Linguistics ,7:581–596.Yumo Xu and Mirella Lapata. 2020. Coarse-to-ﬁne query focused multi-document summariza-tion. In

Proceedings of the 2020 Conference onEmpirical Methods in Natural Language Pro-cessing , pages 3632–3645, Online.i Yang, Wen-tau Yih, and Christopher Meek.2015. WikiQA: A challenge dataset for open-domain question answering. In

Proceedingsof the 2015 Conference on Empirical Methodsin Natural Language Processing , pages 2013–2018, Lisbon, Portugal.Xuchen Yao, Benjamin Van Durme, ChrisCallison-Burch, and Peter Clark. 2013. An-swer extraction as sequence tagging with treeedit distance. In

Proceedings of the 2013 Con-ference of the North American Chapter of theAssociation for Computational Linguistics: Hu-man Language Technologies , pages 858–867,Atlanta, Georgia.Xingdi Yuan, Tong Wang, Caglar Gulcehre,Alessandro Sordoni, Philip Bachman, SaizhengZhang, Sandeep Subramanian, and AdamTrischler. 2017. Machine comprehension bytext-to-text neural question generation. In

Pro-ceedings of the 2nd Workshop on Representa-tion Learning for NLP , pages 15–25, Vancou-ver, Canada.Jianmin Zhang, Jiwei Tan, and Xiaojun Wan.2018. Neural single-document summarizationmodel for abstractive multi-document summa-rization: A pilot studyadapting. In