[PDF] Improving the Robustness of QA Models to Challenge Sets with Variational Question-Answer Pair Generation

Abstract

Question answering (QA) models for reading comprehension have achieved human-level accuracy on in-distribution test sets. However, they have been demonstrated to lack robustness to challenge sets, whose distribution is different from that of training sets. Existing data augmentation methods mitigate this problem by simply augmenting training sets with synthetic examples sampled from the same distribution as the challenge sets. However, these methods assume that the distribution of a challenge set is known a priori, making them less applicable to unseen challenge sets. In this study, we focus on question-answer pair generation (QAG) to mitigate this problem. While most existing QAG methods aim to improve the quality of synthetic examples, we conjecture that diversity-promoting QAG can mitigate the sparsity of training sets and lead to better robustness. We present a variational QAG model that generates multiple diverse QA pairs from a paragraph. Our experiments show that our method can improve the accuracy of 12 challenge sets, as well as the in-distribution accuracy. Our code and data are available at this https URL.

Full PDF

VVariational Question-Answer Pair Generation for Machine ReadingComprehension

Kazutoshi Shinoda , and Akiko Aizawa , The University of Tokyo National Institute of Informatics [email protected]@nii.ac.jp

Abstract

We present a deep generative model ofquestion-answer (QA) pairs for machine read-ing comprehension. We introduce two in-dependent latent random variables into ourmodel in order to diversify answers and ques-tions separately. We also study the effect ofexplicitly controlling the KL term in the varia-tional lower bound in order to avoid the “poste-rior collapse” issue, where the model ignoreslatent variables and generates QA pairs thatare almost the same. Our experiments onSQuAD v1.1 showed that variational methodscan aid QA pair modeling capacity, and thatthe controlled KL term can signiﬁcantly im-prove diversity while generating high-qualityquestions and answers comparable to those ofthe existing systems.

Machine reading comprehension has gained muchattention in the NLP community, whose goal isto devise systems that can answer questions aboutgiven documents (Rajpurkar et al., 2016; Trischleret al., 2017; Joshi et al., 2017). To build suchsystems, a substantial number of question-answer(QA) pairs are needed to train neural network basedmodels. However, the creation of QA pairs fromunlabeled documents requires considerable manualeffort. To alleviate this problem, there has beena resurgence of work on automatic QA pair gen-eration for data augmentation (Yang et al., 2017a;Du and Cardie, 2018; Subramanian et al., 2018;Alberti et al., 2019; Wang et al., 2019).When the answers are text spans in a given para-graph, QA pair generation systems have generallyused a pipeline of answer extraction (AE) and ques-tion generation (QG) models. QG aims to generatequestions from each paragraph or sentence. Duet al. (2017) ﬁrst used sequence-to-sequence mod-els for QG and improved the quality, replacing

Context :... Their hiatus saw the release of Beyonc´e’s debut album,Dangerously in Love (2003), which established her asa solo artist worldwide, earned ﬁve Grammy Awardsand featured the Billboard Hot 100 number-one singles“Crazy in Love” and “Baby Boy”.

Question-answer pairs :What album made her a worldwide known artist?— Dangerously in LoveWhat was the ﬁrst album Beyonc´e released as a soloartist?— Dangerously in LoveWhat was the name of Beyonc´e’s ﬁrst solo album?— Dangerously in Love

Table 1: Example of QA pairs with context in SQuADv1.1 (Rajpurkar et al., 2016). Underlined text spans inthe context are used as the gold answers. The listed QApairs show the case in which multiple questions can becreated from a single context-answer pair. a rule-based method (Heilman and Smith, 2010).Following works used answers as additional inputand showed that answers aid quality of QG (Zhouet al., 2018; Kim et al., 2018; Zhao et al., 2018).Since answers are not available in the real case, AEhas been studied in addition to QG. AE aims toextract from documents question-worthy phrases,which are deﬁned by Subramanian et al. (2018)and Wang et al. (2019) as phrases that are worthbeing asked about. Subramanian et al. (2018) andKumar et al. (2018) proposed to extract answer can-didates from documents and to generate questionsfrom documents and the extracted answers. Simi-larly, Du and Cardie (2018) proposed to generateQA pairs such that requires coreference resolution.Moreover, Alberti et al. (2019) presented QA pairgeneration with roundtrip consistency that ﬁltersout unanswerable QA pairs using BERT (Devlinet al., 2019).However, to the best of our knowledge, the di-versity of QA pairs has been less studied. For QG, a r X i v : . [ c s . C L ] A p r igure 1: Graphical models of a pipeline model (a)and our Variational Question-Answer Pair Generativemodel (VQAG) (b). ( c : context, a : answer, q : ques-tion, z and y : latent variables, solid : generative model, dashed : inference model) a few studies focused on diversity (Yao et al., 2018;Bahuleyan et al., 2018). Namely, existing QA pairgeneration systems can only extract a ﬁxed set ofanswer spans from each document. Since answersare important features for QG, the lack of diver-sity in answers should lead to the lack of diversityin questions. Here, we speciﬁcally focus on QApair generation where AE and QG are distinctivestochastic processes that generate diverse outputs.For example, as shown in Table 1, multiple answercandidates such as “2003” and “Dangerously inLove” can be extracted from the context about Be-yonc´e, and multiple questions can be created fromthe answer “Dangerously in Love”.It is known that using a variational autoencoder(VAE) (Kingma and Welling, 2013) can diversifythe generated text and generate unseen sentencesfrom latent space (Bowman et al., 2016). More-over, a conditional VAE (CVAE) can generate notonly diverse sentences but also condition them onadditional variables (Zhao et al., 2017). Here, weconjecture that the CVAE framework may be suit-able for QA pair generation conditioned on context.Therefore, we propose a variational QA pair gen-erative model (VQAG). As shown in Figure 1, weintroduce two independent latent random variablesinto our VQAG to model the two one-to-many prob-lems, AE and QG, enabling us to diversify AE andQG separately. We also study the effect of control-ling the KL term in the variational lowerbound byintroducing hyperparameters to mitigate the poste-rior collapse issue, where the model ignores latentvariables and generate outputs that are almost thesame.We conducted experiments on three tasks, i.e.,QA pair modeling, answer extraction, and answer-aware question generation, using SQuAD v1.1. QApair modeling is our newly developed task that en-ables us to assess the distribution modeling capacityof QA pair generative models. Our qualitative anal- ysis reveals that our model can generate reasonableQA pairs that are not close to the ground truths. Contributions

Our main contributions are three-fold: (1) We propose a Variational Question-Answer Pair Generative model including two inde-pendent latent random variables for modeling thediversity of AE and QG separately. To the best ofour knowledge, our work is the ﬁrst to introducevariational methods for both AE and QG jointly. (2)We develop the QA pair modeling task and showthat our variational model achieves better modelingcapacity than a non-stochastic model in terms ofthe negative log likelihood. (3) We show that ex-plicitly controlling the KL term in the variationallowerbound objective can avoid the posterior col-lapse issue. Our model with the controlled KLvalue signiﬁcantly improve diversity while generat-ing high-quality questions and answers comparableor superior to those of the existing systems for AEand QG.

Answer extraction (AE) can be performed inmainly three ways, i.e., 1) using linguistic knowl-edge, 2) sequence labeling, and 3) using a pointernetwork.Yang et al. (2017a) extracted candidate phrasesusing rule-based methods such as part-of-speechtagger, a simple constituency parser, and namedentity recognizer (NER). However, in the SQuADdataset, not all the named entities, noun phrases,verb phrases, adjectives, or clauses, are used asgold answer spans. So, these rule-based methodsare likely to extract many trivial phrases.Therefore, there have been studies on trainingneural models to identify question-worthy phrases.Subramanian et al. (2018) treated the positions ofanswers as a sequence and used a pointer network(Vinyals et al., 2015). Du and Cardie (2018) framedthe AE problem as a sequence labeling task andused BiLSTM-CRF (Huang et al., 2015) with NERfeatures as additional inputs. Wang et al. (2019)used a pointer network and Match-LSTM (Wangand Jiang, 2016, 2017) to interact with the questiongeneration module. Alberti et al. (2019) made useof pretrained BERT (Devlin et al., 2019) for AE.Note that these current AE models are determin-istic, i.e., their output is static when the input isﬁxed. As far as we know, our work is the ﬁrsto introduce a pointer network incorporating a la-tent random variable. In this paper, we assumethat the answer spans used in the SQuAD datasetare question-worthy, but there should be question-worthy phrases not used as the gold answer spansin the dataset.

Traditionally, Question Generation (QG) was stud-ied using rule-based methods (Mostow and Chen,2009; Heilman and Smith, 2010; Lindberg et al.,2013; Labutov et al., 2015) These rule-based meth-ods use only the syntactic roles of words.Since Du et al. (2017) proposed a neuralsequence-to-sequence model (Sutskever et al.,2014) for QG and improved its BLEU scores com-pared to rule-based methods, neural models thattake context and answer as inputs has started tobe used to improve question quality with atten-tion (Bahdanau et al., 2014) and copying (Gul-cehre et al., 2016; Gu et al., 2016) mechanisms.Most works focused on generating relevant ques-tions from answer-context pairs (Zhou et al., 2018;Song et al., 2018; Zhao et al., 2018; Sun et al.,2018; Kim et al., 2018; Harrison and Walker, 2018;Liu et al., 2019; Qiu and Xiong, 2019; Zhang andBansal, 2019; Scialom et al., 2019). These worksshowed the importance of answers as input featuresfor question generation. Other works studied pre-dicting question types (Zhou et al., 2019; Kanget al., 2019), modeling structured answer-releventrelation (Li et al., 2019), and reﬁning generatedquestions (Nema et al., 2019). To further improvequestion quality, policy gradient techniques havebeen used (Yuan et al., 2017; Yang et al., 2017a;Yao et al., 2018; Kumar et al., 2018). Dong et al.(2019) used a pretrained language model. Whilethe above QG models do not handle cases in whichmultiple questions can be created from a singlecontext-answer pair, the diversity of questions hasbeen tackled using variational attention (Bahuleyanet al., 2018) or the CVAE (Yao et al., 2018).Our work is different from these works in that westudy QA pair generation by introducing variationalmethods into both AE and QG and that we evaluatediversity and modeling capacity of our model.Further, constructing better QA pair generativemodels need to be constructed for not only dataaugmentation but also directly applying them toquestion answering. Lewis and Fan (2019) pro-posed to perform question answering tasks by re- formulating them as a = argmax a p ( q, a | c ) =argmax a p ( q | a, c ) p ( a | c ) , and showed that the re-formulation helped to mitigate the superﬁcial un-derstanding problems of machine reading compre-hension (Weissenborn et al., 2017). The VAE (Kingma and Welling, 2013) is a popu-lar deep generative model. It consists of a neuralencoder (inference model) and a decoder (genera-tive model). The encoder learns to map from anobserved variable, x , to a latent variable, z , and thedecoder works vice versa. Neural approximation and reparameterization techniques of VAE havebeen applied to NLP tasks such as text generation(Bowman et al., 2016), machine translation (Zhanget al., 2016), and sequence labeling (Chen et al.,2018).The CVAE is an extension of the VAE, in whichthe prior distribution of a latent variable is explic-itly conditioned on certain variables and enablesgeneration processes to be more diverse than a VAE(Li et al., 2018; Zhao et al., 2017; Shen et al., 2017).The CVAE is trained by maximizing the followingvariational lower bound: log p θ ( x | c ) ≥ E z ∼ q φ ( z | x,c ) [log p θ ( x | z, c )] − D KL ( q φ ( z | x, c ) || p θ ( z | c )) (1)where D KL means the Kullback-Leibler diver-gence, c is the condition, and θ ( φ ) is parametersof the generative (inference) model parameterizedby neural networks. Here, the problem is to generate QA pairs fromcontexts (documents). We focus on the case inwhich an answer is a text span in the context. Weuse c , q , and a to represent the context, question,and answer, respectively.We assume that every QA pair is sampled inde-pendently given a context. Thus, the problem isdeﬁned as maximizing the following conditionallog likelihood: log (cid:81) Nk =1 p ( q k , a k | c k ) = (cid:80) Nk =1 log p ( q k , a k | c k ) where N is the size of the training, development,or test set. For simplicity, we remove superscript k in the following sections. .3 Variational Lower Bound Because questions and answers are different typesof observed variables, embedding QA pairs intodifferent latent spaces may be suitable. For exam-ple, different questions can correspond to the sameanswer (Table 1). Thus, we introduce two indepen-dent latent random variables to assign the role ofdiversifying AE and QG to z and y , respectively(see Figure 1 (b)). The variational lower bound ofour VQAG is as follows: log p θ ( q, a | c ) ≥ E z,y ∼ q φ ( z,y | q,a,c ) [log p θ ( q | y, a, c )+ log p θ ( a | z, c )] − D KL ( q φ ( z | a, c ) || p θ ( z | c )) − D KL ( q φ ( y | q, c ) || p θ ( y | c )) . (2)See Appendix A for the derivation of Eq. 2. VAEs often suffer from “posterior collapse” , wherethe model learns to ignore latent variables and gen-erates outputs that are almost the same. This prob-lem occurs especially when VAEs are used for mod-eling discrete data and implemented with strongdecoders such as LSTM (Bowman et al., 2016).Many approaches have been proposed to mitigatethis issue, such as weakening the generators (Bow-man et al., 2016; Yang et al., 2017b; Semeniutaet al., 2017), or modifying the objective functionsto control the KL term (Tolstikhin et al., 2018; Zhaoet al., 2017; Higgins et al., 2017).We also observe that this issue happens whenimplementing our model according to the Ineq. 2.To mitigate this problem, inspired by Prokhorovet al. (2019), we use modiﬁed β -VAE (Higginset al., 2017) proposed by Burgess et al. (2018),which uses two hyperparameters to control the KLterms. Our modiﬁed variational lower bound is asfollows: log p θ ( q, a | c ) ≥ E z,y ∼ q φ ( z,y | q,a,c ) [log p θ ( q | y, a, c )+ log p θ ( a | z, c )] − β | D KL ( q φ ( z | a, c ) || p θ ( z | c )) − C |− β | D KL ( q φ ( y | q, c ) || p θ ( y | c )) − C | , (3)where β > and C ≥ . We use the same β and C for the two KL terms for simplicity. In this paper,we set β = 1 and change only C because C wasenough to regularize the KL terms in our case (seeTable 2). Figure 2: Overview of the model architecture. Eachmodule with its input and output is shown. Note thatthe latent variables z and y are sampled from the poste-riors when computing the variational lower bound andfrom the priors during generation. See § An overview of our VQAG is given in Figure 2.We describe the details of each module below.Here, we denote c = { c t } L C t =1 , q = { q t } L Q t =1 , and a = { a t } L A t =1 = { c t } endt = start , where each elementrepresents one word, and L C , L Q , and L A are, re-spectively, the lengths of the context, question, andanswer span. Embedding and Contextual Embedding Layer

First, in the embedding layer, the i th word, w i , of asequence of length L is simultaneously convertedinto word- and character-level embedding vectors, e wi and e ci , by using a convolutional neural network(CNN) based on Kim (2014). Then, e wi and e ci areconcatenated across columns and e i = [ e wi ; e ci ] isobtained.After that, we pass the embedding vectors to thecontextual embedding layer as follows: H, h = BiLSTM([ e T ; e T ; ... ; e TL ]) (4)where H ∈ R L × d is the concatenated outputs ofLSTMs (Hochreiter and Schmidhuber, 1997) ineach direction at each time step, e T denotes thetranspose of e , and h ∈ R d is the concatenatedlast hidden state vectors of LSTMs in each direc-tion. This bidirectional LSTM (BiLSTM) encoderis shared by the AE and QG tasks. The outputshave superscripts, H C , h C , H Q , h Q , H A , and h A to indicate where they come from; i.e., C , Q , and denote the context, question, and answer, respec-tively. Prior and Posterior Distributions

Following Zhao et al. (2017), we hypothesized thatthe prior and posterior distributions of the latentvariables follow multivariate Gaussian distributionswith diagonal covariance. The distributions aredescribed as follows: z | a, c ∼ N ( µ post Z , diag ( σ post Z )) z | c ∼ N ( µ prior Z , diag ( σ prior Z )) y | q, c ∼ N ( µ post Y , diag ( σ post Y )) y | c ∼ N ( µ prior Y , diag ( σ prior Y ) . The prior and posterior distributions of the latentvariables, z and y , are computed as follows: (cid:20) µ post Z log( σ post Z ) (cid:21) = W post Z (cid:20) h C h A (cid:21) + b post Z (cid:20) µ prior Z log( σ prior Z ) (cid:21) = W prior Z h C + b prior Z (cid:20) µ post Y log( σ post Y ) (cid:21) = W post Y (cid:20) h C h Q (cid:21) + b post Y (cid:20) µ prior Y log( σ prior Y ) (cid:21) = W prior Y h C + b prior Y . Then, latent variable z (and y ) is obtained usingthe reparameterization trick (Kingma and Welling,2013): z = µ + σ (cid:12) (cid:15) , where (cid:12) represents theHadamard product, and (cid:15) ∼ N (0 , I ) . Then, z and y is passed to the AE and QG models, respectively. Answer Extraction Model

We regard answer extraction as two-step sequentialdecoding, i.e., p ( a | c ) = p ( c end | c start , c ) p ( c start | c ) , (5)that predicts the start and end positions of an an-swer span in this order. For AE, we modify apointer network (Vinyals et al., 2015) to take intoaccount the initial hidden state h AE = W z + b ,which in the end diversify AE by enabling the map-pings from z to a to be learned. The decodingprocess is as follows: h INi = (cid:26) e ( ⇒ ) if i = 1 H Ct i − if i = 2 h AEi = LSTM( h AEi − , h INi ) u AEij = ( v AE ) T tanh( W H Cj + W h AEi + b ) p ( c t i | c t i − , c ) = softmax( u i ) where ≤ i ≤ , ≤ j ≤ L C , h AEi is the hiddenstate vector of the LSTM, h INi is the i th input, t i denotes the start (i=1) or end (i=2) positions in c ,and v , W n and b n are learnable parameters. Welearn the embedding of the special token “ ⇒ ” asthe initial input h IN .When we used the embedding vector e t i as h INi +1 , instead of H Ct i , following Subramanian et al.(2018), we observed that the extracted spans tendedto be long and unreasonable. We assume that thisis because the decoder cannot get the positionalinformation from the input in each step. Answer-aware Context Encoder

To compute answer-aware context information forQG, we use another BiLSTM as follows: H CA , h CA = BiLSTM([ H C , o start , o end ]) (6)where o start and o end ∈ R L C are the one-hot vec-tors of the start and end positions of an answer span. H CA ∈ R L C × d is used as the source for attentionand copying in question generation. ( h CA ∈ R d ) Question Generation Model

For QG, we modify an LSTM decoder with atten-tion and copying mechanisms to take the initialhidden state h QG = W y + b as input to diversifyQG. In detail, at each time step, the probabilitydistribution of generating words from vocabularyusing attention (Bahdanau et al., 2014) is computedas: h QGi = LSTM( h QGi − , q t − ) u attij = ( v att ) T tanh( W h QGi + W H CAj + b ) a atti = softmax( u atti )ˆ h i = (cid:80) j a attij H CAj ˜ h i = tanh( W ([ˆ h i ; h QGi ] + b )) P vocab = softmax( W (˜ h i ) + b ) , and the probability distributions of copying (Gul-cehre et al., 2016; Gu et al., 2016) from context arecomputed as: u copyij = ( v copy ) T tanh( W h QGi + W H CAj + b ) a copyi = softmax( u copyi ) Accordingly, the probability of outputting q i is: p g = σ ( W h QGi ) p ( q i | q i − , a, c )= p g P vocab ( q i ) + (1 − p g ) (cid:80) j : c j = q i a copyij where σ is the sigmoid function. Experiments & Results

See Appendix B for the training details.

We used SQuAD v1.1 (Rajpurkar et al., 2016), alarge QA pair dataset consisting of documents col-lected from Wikipedia and 100k QA pairs createdby crowdworkers. Each question in SQuAD canbe answered by a text span in a context. Sincethe SQuAD test set has not been released, we splitthe dataset following Du et al. (2017), where theoriginal training set is split into training and de-velopment sets and the original development setis used as a test set. In so doing, the sizes of thetraining, development and test sets amounted to70,484, 10,570, and 11,877, respectively.

NLL

NLL a NLL q D KL z D KL y Pipeline 36.26 3.99 32.50 - -VQAG

C = 0

C = 5

C = 20

C = 100

Table 2: QA pair modeling capacity measured on thetest set. NLL: negative log likelihood ( − log p ( q, a | c ) ). NLL a = − log p ( a | c ) , NLL q = − log p ( q | a, c ) . D KL z and D KL y are Kullback–Leibler divergence betweenthe approximate posterior and the prior of the latentvariable z and y. The lower NLL is, the higher the prob-ability is that the model assigns to the test set. NLL forour models are estimated with importance sampling us-ing 300 samples. We originally developed a QA pair modeling toevaluate QA pair generative models. We comparedmodels based on the bases of the probability theyassigned to the ground truth QA pairs. We chosethe negative log likelihood (NLL) of QA pairsas the metric, namely, − N (cid:80) Nk =1 log p ( q k , a k | c k ) .Since variational models can not directly computeNLL, we estimate NLL with importance sampling.We also estimate each term in decomposed NLL,i.e., NLL = NLL a + NLL q = − log p ( a | c ) − log p ( q | a, c ) . The better a model performs in thistask, the better it ﬁt the test set. As a baseline,to assess the effect of incorporating latent randomvariables, we implemented a pipeline model similarto Subramanian et al. (2018), eliminating all thearchitectures related to latent random variables inour models and treating a sequence of the start and Relevance DiversityPrecision Recall DistProp. Exact Prop. ExactNER 34.44 19.61 64.60 45.39 30.0kBiLSTM-CRFw/ char w/ NER(2018) 45.96 33.90 41.05 28.37 -VQAG

C = 0

C = 5

C = 20

C = 100

Table 3: Results for answer extraction on the test set.For all the metrics, higher is better. end positions of all the possible answers in contextas the output of AE.

Result

Table 2 shows the result of QA pair model-ing. First, our models with

C = 0 are superior tothe pipeline model, which means that introducinglatent random variables aid QA pair modeling ca-pacity. However, the KL terms converge to zerowith

C = 0 . In other tasks, it is shown that ourmodel with

C = 0 collapses into a deterministicmodel. The fact that

NLL a is consistently lowerthan NLL q is due to the decomposition of prob-ability p ( a | c ) = p ( c end | c start , c ) p ( c start | c ) and p ( q | a, c ) = (cid:81) i p ( q i | q i − , a, c ) , which is sensi-tive to the sequence length. Also, we observe thatthe hyperparameter C can control the KL values,showing the potential to avoid the posterior col-lapse issue in our case. When we set C > , KLvalues are greater than 0, which implies that la-tent variables have non-trivial information aboutquestions and answers. Inputs were the contexts and outputs were a set ofmultiple answer spans. Following Du and Cardie(2018), to measure the accuracy of multiple phrases,we computed

Proportional Overlap and

ExactMatch metrics (Breck et al., 2007; Johansson andMoschitti, 2010) for each pair of a predicted answerand a ground truth. Proportional Overlap returnsscores proportional to the amount of overlap. Wereport the precision and recall with respect to theabove metrics.Our models are different from existing models in We exclude

Binary Overlap because, as Breck et al.(2007) discussed,

Binary Overlap assigns high scores on sys-tems that extract the entire input context, and therefore is nota reliable metric.elevance DiversityB1 B2 B3 B4 ME RL Token D1 D2 E4 SB4ELMo+QPP&QAP(2019)w/Beam10 48.39 32.71 24.13 18.34 24.82 46.66 133.2k 10.1k 45.8k 15.75 -w/DivBeam50 48.59 32.83 24.21 18.40 24.86 46.66 133.8k 10.2k 46.4k 15.78 -B1-R B2-R B3-R B4-R ME-R RL-R Token D1 D2 E4 SB4ELMo+QPP&QAP(2019)w/DivBeam50

C = 0

C = 5

C = 20

C = 100

Table 4: Results for answer-aware question generation on the test set of Du et al. (2017)’s split of SQuAD.Paragraph-level contexts and answer spans are used as input. Bn: BLEU-n, ME: METEOR, RL: ROUGE-L,Token: the total number of the generated words, Dn: Dist-n, E4: Ent-4 (entropy of 4-grams), SB4: Self-BLEU-4.“-R” represents recall. (e.g. B1-R is the recall of BLEU-1.) One question per answer-context pair is evaluated inthe upper part, while 50 questions per answer-context pair is evaluated in the lower part to assess their diversity. that they can generate an arbitrary number of sam-ples and improve diversity. For comparison, wehad our models extract a total of 50 answer spansfrom each context to assess their diversity and qual-ity, while the existing models can extract only aﬁxed set of answer spans. To measure the diversityof the predicted answer spans, we calculated the

Dist score as the the total number of distinct spans.For AE, we adopted two baselines, named en-tity recognition (NER) and BiLSTM-CRF w/ charw/NER (Du and Cardie, 2018) For NER, we usedspaCy. For BiLSTM-CRF w/ char w/ NER, we di-rectly copied the scores from Du and Cardie (2018).

Result

Table 3 shows the result. Our model withthe condition

C = 5 performed the best in terms ofthe recall scores, while surpassing NER in terms ofdiversity. From the viewpoint of diversity,

C = 20 is the best setting. However, high

Dist scores donot occur together with high recall scores. Thisobservation shows the trade-off between diversityand quality. In this task, we show that our modelwith

C = 5 can cover most of the human-createdanswers and also extract more diverse answers thanbaselines. However, when

C = 0 , the

Dist scoreis fairly low. This implies the posterior collapseissue, though the precision scores are the best.While our models with C ≥ had low precision,it was due to the diversity of extracted answers. Ifdiversity is improved, answer spans that are nottreated as ground truths would be extracted. Sinceeven the test set do not cover all the possible answerspans, we assert that low precision scores do not necessarily mean poor performance. The inputs were the contexts and gold answer spans.To see how well our models could generate diversequestions, we had them generate a total of 50 ques-tions from each context-answer pair.We calculated the BLEU (Papineni et al., 2002),METEOR (Denkowski and Lavie, 2014), andROUGE-L (Lin, 2004) scores, and report the recallscores per reference question. Since our motiva-tion is to improve diversity, precision metrics arenot appropriate in our setting. Thus, we do notreport precision scores here. To measure diversity,we computed

Dist -n, Ent-n (Serban et al., 2017;Zhang et al., 2018), and Self-BLEU (Zhu et al.,2018). Ent-n is the entropy (in bits) of n-grams,and it reﬂects how evenly n-grams are generated.Self-BLEU evaluates the degree to which sentencesgenerated by a system resemble each other. Wecalculated Self-BLEU scores for 50 questions gen-erated from each context-answer pair and averagedthem. We computed

Dist -n following the deﬁnitionof Xu et al. (2018), wherein

Dist -n is the numberof distinct n-grams. We also reported the totalnumber of generated words as reference.For QG, we compared our models with theELMo+QAP&QPP model (Zhang and Bansal,2019), which achieved the state-of-the-art in Dist -n is often deﬁned as the ratio of distinct n-grams (Liet al., 2016) but this is not fair when the number of generatedsentences differs among models, so we did not use this.eyonc ’s vocal range spans (cid:3)(cid:2) (cid:0)(cid:1)(cid:3)(cid:2) (cid:0)(cid:1) four octaves . (cid:3)(cid:2) (cid:0)(cid:1) jody rosen highlights her tone and timbre as particularly distinctive , describingher voice as ” one of the most compelling instruments in popular music ” . while another critic says she is a ” vocal acrobat ,being able to sing long and complex melismas and vocal runs effortlessly , and in key . (cid:3)(cid:2) (cid:0)(cid:1) her vocal abilities mean she is identiﬁedas the centerpiece of destiny ’s child . (cid:3)(cid:2) (cid:0)(cid:1) the daily mail calls beyonc ’s voice ” (cid:3)(cid:2) (cid:0)(cid:1) versatile ” , capable of exploring power ballads ,soul , rock belting , operatic ﬂourishes , and (cid:3)(cid:2) (cid:0)(cid:1) hip hop . jon pareles of the new york times commented that her voice is ” velvetyyet (cid:3)(cid:2) (cid:0)(cid:1) tart , with an insistent ﬂutter and reserves of soul belting ” . rosen notes that (cid:3)(cid:2) (cid:0)(cid:1) the (cid:3)(cid:2) (cid:0)(cid:1) hip hop era highly inﬂuenced beyonc’s strange rhythmic vocal style , but also ﬁnds her quite traditionalist in her use of balladry , gospel and falsetto . other critics (cid:3)(cid:2) (cid:0)(cid:1) praise her range and power , with chris richards of the washington post saying she was ” capable of punctuating any beat withgoose - bump - inducing whispers or full - bore diva - roars . ”

Table 5: Heatmap of 250 answer spans extracted using our VQAG (

C = 5 ), the best performing model in termsof recall of Exact match (see Table 3). The darker the color is, the more often the word is extracted. The phrasessurrounded by (cid:3)(cid:2) (cid:0)(cid:1) are the ground truth answers of SQuAD.

C=0 C=5 C=20 C=100beyonc range spans spansspans spans or four octavesspans ? —four how can one ﬁnd her vocalabilities in key music ? —she is identiﬁed as the cen-terpiece of destiny ’s child how does her voice as hervoice ? —one of the mostcompelling instruments inpopular music ” . leptines polybolos ? —fourbeyonc range spans spansspans spans spans andwhich vocal range ? —four how many octaves is beyonc’s vocal range spans four oc-taves ? —spans four how many power ballads areused by chris richards ? —the daily mail calls beyonc’s voice ” versatile ” j.n. ? —four octaves

Table 6: Examples of QA pairs generated with our model. The input context is the same as the one in Table 5.

SQuAD QG. Since diversity metrics were not re-ported in that paper, we reran the model, which ispublicly available . In addition, to compare ourmodels with the baseline under an equivalent con-dition, we also reran the ELMo+QAP&QPP modelwith diverse beam search (Li et al., 2016), kept top50 questions per answer, and used them to calculatethe metrics. Result

Table 4 shows the result of QG. The re-call scores of our model with C=20 were compara-ble to the scores of ELMo+QAP&QPP w/Beam10and w/DivBeam50. Though ELMo+QAP&QPPw/DivBeam50 is superior in terms of the recall ofrelevance scores, our models perform signiﬁcantlybetter in terms of the diversity scores. This showsthat our model can improve diversity while gener-ating high-quality questions. Among the varioussettings of C , 20 is suitable based on this result. Since it is hard to evaluate generated QA pairsthat are valid but not close to the ground truths,we analyze the generated questions and answersqualitatively.Table 5 shows the example answers extracted byour model and the gold answers of SQuAD. Our https://github.com/ZhangShiyue/QGforQA model extracts every gold answer of SQuAD atleast once. Moreover, there are answers extractedby our model that are not used in SQuAD butquestion-worthy. For example, “jon pareles” and“one of the most compelling instruments in popu-lar mucis” are question-worthy because these arerelated to the main topic, Beyonc´e. Note that ourmodel can extract not only named entities but alsophrases of other types like this example.Table 6 shows some examples of generated QApairs from the various settings of C . The exampleswith C = 5 seems the most reasonable and diverse.When

C = 0 , the generated QA pairs are reason-able but lack diversity, suffering from posteriorcollapse. When

C = 100 , the generated QA parisare diverse but not reasonable. From this result,ﬁnding an appropriate value of C is necessary.

We designed a variational QA pair generativemodel, consisting of two independent latent ran-dom variables. We showed explicitly controllingthe KL term could either enable our model toperform well in distribution modeling (

C = 0 ) oravoid posterior collapse and improve diversity andrecall-oriented relevance scores ( C > ). However,it is not trivial how to ﬁnd the optimal C. cknowledgments We would like to thank Saku Sugawara at NationalInstitute of Informatics for his valuable support.This work was supported by NEDO SIP-2 ”Big-data and AI-enabled Cyberspace Technologies.

References

Chris Alberti, Daniel Andor, Emily Pitler, Jacob De-vlin, and Michael Collins. 2019. Synthetic QA Cor-pora Generation with Roundtrip Consistency. arXive-prints , page arXiv:1906.05416.Dzmitry Bahdanau, Kyunghyun Cho, and YoshuaBengio. 2014. Neural machine translation byjointly learning to align and translate.

CoRR ,abs/1409.0473.Hareesh Bahuleyan, Lili Mou, Olga Vechtomova, andPascal Poupart. 2018. Variational attention forsequence-to-sequence models. In

Proceedings ofthe 27th International Conference on ComputationalLinguistics , pages 1672–1682. Association for Com-putational Linguistics.Samuel R. Bowman, Luke Vilnis, Oriol Vinyals, An-drew Dai, Rafal Jozefowicz, and Samy Bengio.2016. Generating sentences from a continuousspace. In

Proceedings of The 20th SIGNLL Con-ference on Computational Natural Language Learn-ing , pages 10–21, Berlin, Germany. Association forComputational Linguistics.Eric Breck, Yejin Choi, and Claire Cardie. 2007. Identi-fying expressions of opinion in context. In

Proceed-ings of the 20th International Joint Conference onArtiﬁcal Intelligence , IJCAI’07, pages 2683–2688,San Francisco, CA, USA. Morgan Kaufmann Pub-lishers Inc.Christopher P. Burgess, Irina Higgins, Arka Pal,Loic Matthey, Nick Watters, Guillaume Desjardins,and Alexander Lerchner. 2018. Understandingdisentangling in β -VAE. arXiv e-prints , pagearXiv:1804.03599.Mingda Chen, Qingming Tang, Karen Livescu, andKevin Gimpel. 2018. Variational sequential labelersfor semi-supervised learning. In Proceedings of the2018 Conference on Empirical Methods in NaturalLanguage Processing , pages 215–226, Brussels, Bel-gium. Association for Computational Linguistics.Michael Denkowski and Alon Lavie. 2014. Meteor uni-versal: Language speciﬁc translation evaluation forany target language. In

Proceedings of the NinthWorkshop on Statistical Machine Translation , pages376–380, Baltimore, Maryland, USA. Associationfor Computational Linguistics.Jacob Devlin, Ming-Wei Chang, Kenton Lee, andKristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language under-standing. In

Proceedings of the 2019 Conferenceof the North American Chapter of the Associationfor Computational Linguistics: Human LanguageTechnologies, Volume 1 (Long and Short Papers) ,pages 4171–4186, Minneapolis, Minnesota. Associ-ation for Computational Linguistics.Li Dong, Nan Yang, Wenhui Wang, Furu Wei, Xi-aodong Liu, Yu Wang, Jianfeng Gao, Ming Zhou,and Hsiao-Wuen Hon. 2019. Uniﬁed languagemodel pre-training for natural language understand-ing and generation.

CoRR , abs/1905.03197.Xinya Du and Claire Cardie. 2018. Harvest-ing paragraph-level question-answer pairs fromwikipedia. In

Proceedings of the 56th Annual Meet-ing of the Association for Computational Linguistics(Volume 1: Long Papers) , pages 1907–1917. Associ-ation for Computational Linguistics.Xinya Du, Junru Shao, and Claire Cardie. 2017. Learn-ing to ask: Neural question generation for readingcomprehension. In

Proceedings of the 55th AnnualMeeting of the Association for Computational Lin-guistics (Volume 1: Long Papers) , pages 1342–1352.Association for Computational Linguistics.Xavier Glorot and Yoshua Bengio. 2010. Understand-ing the difﬁculty of training deep feedforward neu-ral networks. In

Proceedings of the ThirteenthInternational Conference on Artiﬁcial Intelligenceand Statistics , volume 9 of

Proceedings of MachineLearning Research , pages 249–256, Chia LagunaResort, Sardinia, Italy. PMLR.Jiatao Gu, Zhengdong Lu, Hang Li, and Victor O.K.Li. 2016. Incorporating copying mechanism insequence-to-sequence learning. In

Proceedings ofthe 54th Annual Meeting of the Association for Com-putational Linguistics (Volume 1: Long Papers) ,pages 1631–1640, Berlin, Germany. Association forComputational Linguistics.Caglar Gulcehre, Sungjin Ahn, Ramesh Nallapati,Bowen Zhou, and Yoshua Bengio. 2016. Pointingthe unknown words. In

Proceedings of the 54th An-nual Meeting of the Association for ComputationalLinguistics (Volume 1: Long Papers) , pages 140–149. Association for Computational Linguistics.Vrindavan Harrison and Marilyn Walker. 2018. Neuralgeneration of diverse questions using answer focus,contextual and linguistic features. In

Proceedings ofthe 11th International Conference on Natural Lan-guage Generation , pages 296–306. Association forComputational Linguistics.Michael Heilman and Noah A. Smith. 2010. Goodquestion! statistical ranking for question genera-tion. In

Human Language Technologies: The 2010Annual Conference of the North American Chap-ter of the Association for Computational Linguistics ,pages 609–617. Association for Computational Lin-guistics.rina Higgins, Loic Matthey, Arka Pal, ChristopherBurgess, Xavier Glorot, Matthew Botvinick, ShakirMohamed, and Alexander Lerchner. 2017. beta-vae:Learning basic visual concepts with a constrainedvariational framework. In

Proceedings of the 5thInternational Conference on Learning Representa-tions .Sepp Hochreiter and Jrgen Schmidhuber. 1997.Long short-term memory.

Neural Computation ,9(8):1735–1780.Zhiheng Huang, Wei Xu, and Kai Yu. 2015. Bidi-rectional LSTM-CRF models for sequence tagging.

CoRR , abs/1508.01991.Richard Johansson and Alessandro Moschitti. 2010.Syntactic and semantic structure for opinion expres-sion detection. In

Proceedings of the FourteenthConference on Computational Natural LanguageLearning , pages 67–76, Uppsala, Sweden. Associ-ation for Computational Linguistics.Mandar Joshi, Eunsol Choi, Daniel S. Weld, and LukeZettlemoyer. 2017. Triviaqa: A large scale distantlysupervised challenge dataset for reading comprehen-sion.

CoRR , abs/1705.03551.Junmo Kang, Haritz Puerto San Roman, and sung-hyon myaeng. 2019. Let me know what to ask:Interrogative-word-aware question generation. In

Proceedings of the 2nd Workshop on Machine Read-ing for Question Answering , pages 163–171, HongKong, China. Association for Computational Lin-guistics.Yanghoon Kim, Hwanhee Lee, Joongbo Shin, andKyomin Jung. 2018. Improving neural ques-tion generation using answer separation.

CoRR ,abs/1809.02393.Yoon Kim. 2014. Convolutional neural networks forsentence classiﬁcation. In

Proceedings of the 2014Conference on Empirical Methods in Natural Lan-guage Processing (EMNLP) , pages 1746–1751. As-sociation for Computational Linguistics.Diederik P. Kingma and Jimmy Ba. 2014. Adam:A method for stochastic optimization.

CoRR ,abs/1412.6980.Diederik P Kingma and Max Welling. 2013. Auto-Encoding Variational Bayes. arXiv e-prints , pagearXiv:1312.6114.Diederik P Kingma and Max Welling. 2013. Auto-encoding variational bayes. arXiv preprintarXiv:1312.6114 .Vishwajeet Kumar, Kireeti Boorla, Yogesh Meena,Ganesh Ramakrishnan, and Yuan-Fang Li. 2018.Automating Reading Comprehension by GeneratingQuestion and Answer Pairs. arXiv e-prints , pagearXiv:1803.03664. Vishwajeet Kumar, Ganesh Ramakrishnan, and Yuan-Fang Li. 2018. A framework for automatic questiongeneration from text using deep reinforcement learn-ing.

CoRR , abs/1808.04961.Igor Labutov, Sumit Basu, and Lucy Vanderwende.2015. Deep questions without deep understanding.In

Proceedings of the 53rd Annual Meeting of theAssociation for Computational Linguistics and the7th International Joint Conference on Natural Lan-guage Processing (Volume 1: Long Papers) , pages889–898, Beijing, China. Association for Computa-tional Linguistics.Mike Lewis and Angela Fan. 2019. Generative ques-tion answering: Learning to answer the whole ques-tion. In

Proceedings of the Seventh InternationalConference on Learning Representations .Jingjing Li, Yifan Gao, Lidong Bing, Irwin King, andMichael R. Lyu. 2019. Improving question gener-ation with to the point context. In

Proceedings ofthe 2019 Conference on Empirical Methods in Nat-ural Language Processing and the 9th InternationalJoint Conference on Natural Language Processing(EMNLP-IJCNLP) , pages 3214–3224, Hong Kong,China. Association for Computational Linguistics.Jiwei Li, Michel Galley, Chris Brockett, Jianfeng Gao,and Bill Dolan. 2016. A diversity-promoting ob-jective function for neural conversation models. In

Proceedings of the 2016 Conference of the NorthAmerican Chapter of the Association for Computa-tional Linguistics: Human Language Technologies ,pages 110–119, San Diego, California. Associationfor Computational Linguistics.Jiwei Li, Will Monroe, and Dan Jurafsky. 2016. A Sim-ple, Fast Diverse Decoding Algorithm for NeuralGeneration. arXiv e-prints , page arXiv:1611.08562.Juntao Li, Yan Song, Haisong Zhang, Dongmin Chen,Shuming Shi, Dongyan Zhao, and Rui Yan. 2018.Generating classical chinese poems via conditionalvariational autoencoder and adversarial training. In

Proceedings of the 2018 Conference on EmpiricalMethods in Natural Language Processing , pages3890–3900. Association for Computational Linguis-tics.Chin-Yew Lin. 2004. ROUGE: A package for auto-matic evaluation of summaries. In

Text Summariza-tion Branches Out , pages 74–81, Barcelona, Spain.Association for Computational Linguistics.David Lindberg, Fred Popowich, John Nesbit, and PhilWinne. 2013. Generating natural language ques-tions to support learning on-line. In

Proceedings ofthe 14th European Workshop on Natural LanguageGeneration , pages 105–114, Soﬁa, Bulgaria. Associ-ation for Computational Linguistics.Bang Liu, Mingjun Zhao, Di Niu, Kunfeng Lai,Yancheng He, Haojie Wei, and Yu Xu. 2019. Learn-ing to generate questions by learning what not togenerate.

CoRR , abs/1902.10418.ack Mostow and Wei Chen. 2009. Generating instruc-tion automatically for the reading strategy of self-questioning. In

Proceedings of the 2009 Confer-ence on Artiﬁcial Intelligence in Education: Build-ing Learning Systems That Care: From KnowledgeRepresentation to Affective Modelling , pages 465–472, Amsterdam, The Netherlands, The Netherlands.IOS Press.Preksha Nema, Akash Kumar Mohankumar, Mitesh M.Khapra, Balaji Vasan Srinivasan, and BalaramanRavindran. 2019. Let’s ask again: Reﬁne networkfor automatic question generation. In

Proceedingsof the 2019 Conference on Empirical Methods inNatural Language Processing and the 9th Interna-tional Joint Conference on Natural Language Pro-cessing (EMNLP-IJCNLP) , pages 3312–3321, HongKong, China. Association for Computational Lin-guistics.Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: A method for automatic eval-uation of machine translation. In

Proceedings ofthe 40th Annual Meeting on Association for Com-putational Linguistics , ACL ’02, pages 311–318,Stroudsburg, PA, USA. Association for Computa-tional Linguistics.Jeffrey Pennington, Richard Socher, and ChristopherManning. 2014. Glove: Global vectors for word rep-resentation. In

Proceedings of the 2014 Conferenceon Empirical Methods in Natural Language Process-ing (EMNLP) , pages 1532–1543. Association forComputational Linguistics.Victor Prokhorov, Ehsan Shareghi, Yingzhen Li, Mo-hammad Taher Pilehvar, and Nigel Collier. 2019.On the importance of the Kullback-Leibler diver-gence term in variational autoencoders for text gen-eration. In

Proceedings of the 3rd Workshop onNeural Generation and Translation , pages 118–127,Hong Kong. Association for Computational Linguis-tics.Jiazuo Qiu and Deyi Xiong. 2019. Generating highlyrelevant questions. In

Proceedings of the 2019 Con-ference on Empirical Methods in Natural LanguageProcessing and the 9th International Joint Confer-ence on Natural Language Processing (EMNLP-IJCNLP) , pages 5982–5986, Hong Kong, China. As-sociation for Computational Linguistics.Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, andPercy Liang. 2016. Squad: 100,000+ questions formachine comprehension of text. In

Proceedings ofthe 2016 Conference on Empirical Methods in Nat-ural Language Processing , pages 2383–2392. Asso-ciation for Computational Linguistics.Thomas Scialom, Benjamin Piwowarski, and JacopoStaiano. 2019. Self-attention architectures foranswer-agnostic neural question generation. In

Pro-ceedings of the 57th Annual Meeting of the Asso-ciation for Computational Linguistics , pages 6027–6032, Florence, Italy. Association for Computa-tional Linguistics. Stanislau Semeniuta, Aliaksei Severyn, and ErhardtBarth. 2017. A hybrid convolutional variational au-toencoder for text generation. In

Proceedings of the2017 Conference on Empirical Methods in NaturalLanguage Processing , pages 627–637, Copenhagen,Denmark. Association for Computational Linguis-tics.Iulian Vlad Serban, Alessandro Sordoni, Ryan Lowe,Laurent Charlin, Joelle Pineau, Aaron Courville, andYoshua Bengio. 2017. A hierarchical latent variableencoder-decoder model for generating dialogues. In

Proceedings of the Thirty-First AAAI Conference onArtiﬁcial Intelligence , AAAI’17, pages 3295–3301.AAAI Press.Xiaoyu Shen, Hui Su, Yanran Li, Wenjie Li, ShuziNiu, Yang Zhao, Akiko Aizawa, and Guoping Long.2017. A conditional variational framework for dia-log generation. In

Proceedings of the 55th AnnualMeeting of the Association for Computational Lin-guistics (Volume 2: Short Papers) , pages 504–509,Vancouver, Canada. Association for ComputationalLinguistics.Linfeng Song, Zhiguo Wang, Wael Hamza, Yue Zhang,and Daniel Gildea. 2018. Leveraging context infor-mation for natural question generation. In

Proceed-ings of the 2018 Conference of the North AmericanChapter of the Association for Computational Lin-guistics: Human Language Technologies, Volume2 (Short Papers) , pages 569–574. Association forComputational Linguistics.Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky,Ilya Sutskever, and Ruslan Salakhutdinov. 2014.Dropout: A simple way to prevent neural networksfrom overﬁtting.

Journal of Machine Learning Re-search , 15:1929–1958.Sandeep Subramanian, Tong Wang, Xingdi Yuan,Saizheng Zhang, Adam Trischler, and Yoshua Ben-gio. 2018. Neural models for key phrase extrac-tion and question generation. In

Proceedings of theWorkshop on Machine Reading for Question Answer-ing , pages 78–88. Association for ComputationalLinguistics.Xingwu Sun, Jing Liu, Yajuan Lyu, Wei He, YanjunMa, and Shi Wang. 2018. Answer-focused andposition-aware neural question generation. In

Pro-ceedings of the 2018 Conference on Empirical Meth-ods in Natural Language Processing , pages 3930–3939. Association for Computational Linguistics.Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014.Sequence to sequence learning with neural networks.In

Advances in neural information processing sys-tems , pages 3104–3112.Ilya Tolstikhin, Olivier Bousquet, Sylvain Gelly, andBernhard Schoelkopf. 2018. Wasserstein auto-encoders. In

International Conference on LearningRepresentations .dam Trischler, Tong Wang, Xingdi Yuan, Justin Har-ris, Alessandro Sordoni, Philip Bachman, and Ka-heer Suleman. 2017. Newsqa: A machine compre-hension dataset. In

Proceedings of the 2nd Work-shop on Representation Learning for NLP , pages191–200. Association for Computational Linguis-tics.Oriol Vinyals, Meire Fortunato, and Navdeep Jaitly.2015. Pointer networks. In C. Cortes, N. D.Lawrence, D. D. Lee, M. Sugiyama, and R. Gar-nett, editors,

Advances in Neural Information Pro-cessing Systems 28 , pages 2692–2700. Curran Asso-ciates, Inc.Shuohang Wang and Jing Jiang. 2016. Learning nat-ural language inference with LSTM. In

Proceed-ings of the 2016 Conference of the North Ameri-can Chapter of the Association for ComputationalLinguistics: Human Language Technologies , pages1442–1451, San Diego, California. Association forComputational Linguistics.Shuohang Wang and Jing Jiang. 2017. Machine com-prehension using match-lstm and answer pointer. In

Proceedings of the Fifth International Conferenceon Learning Representations .Siyuan Wang, Zhongyu Wei1, Zhihao Fan1, Yang Liu,and Xuanjing Huang. 2019. A multi-agent commu-nication framework for question-worthy phrase ex-traction and question generation. In

Proceedings ofthe 33rd AAAI Conference on Artiﬁcial Intelligence .Dirk Weissenborn, Georg Wiese, and Laura Seiffe.2017. Making neural QA as simple as possiblebut not simpler. In

Proceedings of the 21st Confer-ence on Computational Natural Language Learning(CoNLL 2017) , pages 271–280, Vancouver, Canada.Association for Computational Linguistics.Jingjing Xu, Xuancheng Ren, Junyang Lin, andXu Sun. 2018. Diversity-promoting GAN: A cross-entropy based generative adversarial network for di-versiﬁed text generation. In

Proceedings of the 2018Conference on Empirical Methods in Natural Lan-guage Processing , pages 3940–3949, Brussels, Bel-gium. Association for Computational Linguistics.Zhilin Yang, Junjie Hu, Ruslan Salakhutdinov, andWilliam Cohen. 2017a. Semi-supervised qa withgenerative domain-adaptive nets. In

Proceedingsof the 55th Annual Meeting of the Association forComputational Linguistics (Volume 1: Long Papers) ,pages 1040–1050. Association for ComputationalLinguistics.Zichao Yang, Zhiting Hu, Ruslan Salakhutdinov, andTaylor Berg-Kirkpatrick. 2017b. Improved vari-ational autoencoders for text modeling using di-lated convolutions. In

Proceedings of the 34th In-ternational Conference on Machine Learning , vol-ume 70 of

Proceedings of Machine Learning Re-search , pages 3881–3890, International ConventionCentre, Sydney, Australia. PMLR. Kaichun Yao, Libo Zhang, Tiejian Luo, Lili Tao, andYanjun Wu. 2018. Teaching machines to ask ques-tions. In

Proceedings of the Twenty-Seventh In-ternational Joint Conference on Artiﬁcial Intelli-gence, IJCAI-18 , pages 4546–4552. InternationalJoint Conferences on Artiﬁcial Intelligence Organi-zation.Xingdi Yuan, Tong Wang, Caglar Gulcehre, Alessan-dro Sordoni, Philip Bachman, Saizheng Zhang,Sandeep Subramanian, and Adam Trischler. 2017.Machine comprehension by text-to-text neural ques-tion generation. In

Proceedings of the 2nd Workshopon Representation Learning for NLP , pages 15–25.Association for Computational Linguistics.Biao Zhang, Deyi Xiong, Jinsong Su, Hong Duan, andMin Zhang. 2016. Variational neural machine trans-lation. In

Proceedings of the 2016 Conference onEmpirical Methods in Natural Language Processing ,pages 521–530, Austin, Texas. Association for Com-putational Linguistics.Shiyue Zhang and Mohit Bansal. 2019. Address-ing semantic drift in question generation for semi-supervised question answering. In

Proceedings ofthe 2019 Conference on Empirical Methods in Nat-ural Language Processing and the 9th InternationalJoint Conference on Natural Language Processing(EMNLP-IJCNLP) , pages 2495–2509, Hong Kong,China. Association for Computational Linguistics.Yizhe Zhang, Michel Galley, Jianfeng Gao, Zhe Gan,Xiujun Li, Chris Brockett, and Bill Dolan. 2018.Generating informative and diverse conversationalresponses via adversarial information maximization.In S. Bengio, H. Wallach, H. Larochelle, K. Grau-man, N. Cesa-Bianchi, and R. Garnett, editors,

Ad-vances in Neural Information Processing Systems31 , pages 1810–1820. Curran Associates, Inc.Shengjia Zhao, Jiaming Song, and Stefano Er-mon. 2017. InfoVAE: Information MaximizingVariational Autoencoders. arXiv e-prints , pagearXiv:1706.02262.Tiancheng Zhao, Ran Zhao, and Maxine Eskenazi.2017. Learning discourse-level diversity for neuraldialog models using conditional variational autoen-coders. In

Proceedings of the 55th Annual Meet-ing of the Association for Computational Linguistics(Volume 1: Long Papers) , pages 654–664. Associa-tion for Computational Linguistics.Yao Zhao, Xiaochuan Ni, Yuanyuan Ding, and QifaKe. 2018. Paragraph-level neural question gener-ation with maxout pointer and gated self-attentionnetworks. In

Proceedings of the 2018 Conferenceon Empirical Methods in Natural Language Pro-cessing , pages 3901–3910. Association for Compu-tational Linguistics.Qingyu Zhou, Nan Yang, Furu Wei, Chuanqi Tan,Hangbo Bao, and Ming Zhou. 2018. Neural ques-tion generation from text: A preliminary study. In atural Language Processing and Chinese Comput-ing , pages 662–671, Cham. Springer InternationalPublishing.Wenjie Zhou, Minghua Zhang, and Yunfang Wu. 2019.Question-type driven question generation. In

Pro-ceedings of the 2019 Conference on Empirical Meth-ods in Natural Language Processing and the 9th In-ternational Joint Conference on Natural LanguageProcessing (EMNLP-IJCNLP) , pages 6031–6036,Hong Kong, China. Association for ComputationalLinguistics.Yaoming Zhu, Sidi Lu, Lei Zheng, Jiaxian Guo,Weinan Zhang, Jun Wang, and Yong Yu. 2018. Texy-gen: A benchmarking platform for text generationmodels. In

The 41st International ACM SIGIR Con-ference on Research & , SIGIR ’18, pages 1097–1100, NewYork, NY, USA. ACM.

Derivations of the Variational LowerBound