[PDF] Generating Diversified Comments via Reader-Aware Topic Modeling and Saliency Detection

Abstract

Automatic comment generation is a special and challenging task to verify the model ability on news content comprehension and language generation. Comments not only convey salient and interesting information in news articles, but also imply various and different reader characteristics which we treat as the essential clues for diversity. However, most of the comment generation approaches only focus on saliency information extraction, while the reader-aware factors implied by comments are neglected. To address this issue, we propose a unified reader-aware topic modeling and saliency information detection framework to enhance the quality of generated comments. For reader-aware topic modeling, we design a variational generative clustering algorithm for latent semantic learning and topic mining from reader comments. For saliency information detection, we introduce Bernoulli distribution estimating on news content to select saliency information. The obtained topic representations as well as the selected saliency information are incorporated into the decoder to generate diversified and informative comments. Experimental results on three datasets show that our framework outperforms existing baseline methods in terms of both automatic metrics and human evaluation. The potential ethical issues are also discussed in detail.

Full PDF

GGenerating Diversiﬁed Comments via Reader-Aware Topic Modeling andSaliency Detection

Wei Wang , , * , Piji Li , Hai-Tao Zheng , , † Shenzhen International Graduate School, Tsinghua University Department of Computer Science and Technology, Tsinghua University Tencent AI Lab [email protected],[email protected]@sz.tsinghua.edu.cn

Abstract

Automatic comment generation is a special and challengingtask to verify the model ability on news content comprehen-sion and language generation. Comments not only conveysalient and interesting information in news articles, but alsoimply various and different reader characteristics which wetreat as the essential clues for diversity . However, most ofthe comment generation approaches only focus on saliencyinformation extraction, while the reader-aware factors im-plied by comments are neglected. To address this issue, wepropose a uniﬁed reader-aware topic modeling and saliencyinformation detection framework to enhance the quality ofgenerated comments. For reader-aware topic modeling, wedesign a variational generative clustering algorithm for latentsemantic learning and topic mining from reader comments.For saliency information detection, we introduce Bernoullidistribution estimating on news content to select saliency in-formation. The obtained topic representations as well as theselected saliency information are incorporated into the de-coder to generate diversiﬁed and informative comments. Ex-perimental results on three datasets show that our frameworkoutperforms existing baseline methods in terms of both au-tomatic metrics and human evaluation. The potential ethicalissues are also discussed in detail.

Introduction

For natural language generation research, automatic com-ment generation is a challenging task to verify the modelability on aspects of news information comprehension andhigh-quality comment generation (Reiter and Dale 1997;Gatt and Krahmer 2018; Qin et al. 2018; Huang et al. 2020).One common phenomenon is that, for the same news ar-ticle, there are usually hundreds or even thousands of differ-ent comments proposed by readers in different backgrounds.Figure 1 depicts an example of a news article (truncated)and three corresponding comments from Yahoo. The newsarticle is about “The Walking Dead”. From the example,we can observe and conclude two characteristics: (1) Al-though the news article depicts many different aspects of the * Work was done during internship at Tencent AI Lab. † Title:

Andrew Lincoln Poised To Walk Off ‘The Walking Dead’Next Season; Norman Reedus To Stay

Body (truncated):

He has lost his on-screen son, his wife anda number of friends to the zombie apocalypse, and now TheWalking Dead star Andrew Lincoln looks to be taking his ownleave of the AMC blockbuster. Cast moves on the series willalso see Norman Reedus stay put under a new $ 20 million con-tract.Almost unbelievable on a show where almost no one is said tobe safe, the man who has played Rick Grimes could be goneby the end of the upcoming ninth season... The show will resetwith Reedus ’ Daryl Dixon even more in the spotlight ...

Comment A:

I’m not watching TWD without Lincoln.

Comment B:

Storylines getting stale and they keep having thesame type trouble every year.

Comment C:

Reedus ca n’t carry the show playing Daryl.

Figure 1: A news example from Yahoo.event and conveys lots of detailed information, readers usu-ally pay attention to part of the content, which means thatnot all the content information is salient and important . Asshown in Figure 1, the ﬁrst two readers both focus on “TheWalking Dead” and the third reader focuses on “Reedus”.None of them mention other details in the content. (2) Dif-ferent readers are usually interested in different topics, andeven on the same topic, they may hold different opinions,which makes the comments diverse and informative . Forexample, the ﬁrst reader gives the comment from the topicof “feeling” and he “cannot accept Lincoln’s leaving”. Thesecond reader comments on the topic of “plot” and thinks “itis old-fashioned”. The third reader comments from the topicof “acting”, saying that “Reedus can’t play that role”.Therefore, news comments are produced based on the in-teractions between readers and news articles. Comments notonly imply different important information in news articles,but also convey distinct reader characteristics. Precisely be-cause of these reader-aware factors, we can obtain a vari-ety of diversiﬁed comments under the same news article. In-tuitively, as the essential reason for diversity, these reader-aware factors should be considered jointly with saliency in-formation detection in the task of diversiﬁed comment gen-eration. However, it is rare that works consider these two a r X i v : . [ c s . C L ] F e b omponents simultaneously. Zheng et al.(2017) proposed togenerate one comment only based on the news title. Qinet al.(2018) extended the work to generate a comment jointlyconsidering the news title and body content. These twoseq2seq (Sutskever, Vinyals, and Le 2014) based methodsconducted saliency detection directly via attention model-ing. Li et al.(2019) extracted keywords as saliency informa-tion. Yang et al.(2019) proposed a reading network to selectimportant spans from the news article. Huang et al.(2020)employed the LDA (Blei, Ng, and Jordan 2003) topic modelto conduct information mining from the content. All theseworks concern the saliency information extraction and en-hance the quality of comments. However, various reader-aware factors implied in the comments, which are the es-sential factors for diversity as well, are neglected.To tackle the pre-mentioned issues, we propose a reader-aware topic modeling and saliency detection framework toenhance the quality of generated comments. The goal ofreader-aware topic modeling is to conduct reader-aware la-tent factors mining from the comments . The latent fac-tors might be either reader interested topics or the writingstyles of comments, or more other detailed factors. We donot design a strategy to disentangle them, instead, we de-sign a uniﬁed latent variable modeling component to cap-ture them. Speciﬁcally, inspired by Jiang et al. (2017), wedesign a Variational Generative Clustering (VGC) model toconduct latent factors modeling from the reader comments.The obtained latent factor representations can be interpretedas news topics, user interests, or writing styles. For con-venience, we collectively name them as Topic . For reader-aware saliency information detection, we build a saliencydetection component to conduct the Bernoulli distributionestimating on the news content. Gumbel-Softmax is intro-duced to address the non-differentiable sampling operation.Finally, the obtained topic representation vectors and the se-lected saliency news content are integrated into the genera-tion framework to control the model to generate diversiﬁedand informative comments.We conduct extensive experiments on three datasets indifferent languages: NetEase News (Chinese) (Zheng et al.2017), Tencent News (Chinese) (Qin et al. 2018), and Ya-hoo! News (English) (Yang et al. 2019). Experimental re-sults demonstrate that our model can obtain better perfor-mance according to automatic evaluation and human evalu-ation. In summary, our contributions are as follows:• We propose a framework to generate diversiﬁed com-ments jointly considering saliency news information de-tection and reader-aware latent topic modeling.• We design a Variational Generative Clustering (VGC)based component to learn the reader-aware latent topicrepresentations from comments.• For reader-aware saliency information detection,Bernoulli distribution estimating is conducted on thenews content. Gumbel-Softmax is introduced to addressthe non-differentiable sampling operation.• Experiments on three datasets demonstrate that our modeloutperforms state-of-the-art baseline methods accordingto automatic evaluation and human evaluation.

Methodology

Overview

To begin with, we state the problem of news comment gen-eration as follows: given a news title T = { t , t , . . . , t m } and a news body B = { b , b , . . . , b n } , the model needsto generate a comment Y = { y , y , . . . , y l } by maximiz-ing the conditional probability p ( Y | X ) , where X = [ T, B ] .As shown in Figure 2, the backbone of our work is asequence-to-sequence framework with attention mechanism(Bahdanau, Cho, and Bengio 2014). Two new componentsof reader-aware topic modeling and saliency information de-tection are designed and incorporated for better commentgeneration. For reader-aware topic modeling, we design avariational generative clustering algorithm for latent seman-tic learning and topic mining from reader comments. Forreader-aware saliency information detection, we introduce asaliency detection component to conduct the Bernoulli dis-tribution estimating on news content. Finally, the obtainedtopic vectors as well as the selected saliency information areincorporated into the decoder to conduct comment genera-tion. The Backbone Framework

The backbone of our work is a sequence-to-sequence frame-work with attention mechanism (Bahdanau, Cho, and Ben-gio 2014). A BiLSTM encoder is used to encode the inputcontent words X into vectors. Then a LSTM-based decodergenerates a comment Y conditioning on a weighted contentvector computed by attention mechanism. Precisely, a wordembedding matrix is used to convert content words into em-bedding vectors. Then these embedding vectors are fed intoencoder to compute the forward hidden vectors via: −→ h i = LST M ef (cid:16) x i , −→ h i − (cid:17) , (1)where −→ h i is a d -dimension hidden vector and LST M ef de-notes the LSTM unit. The reversed sequence is fed into LST M eb to get the backward hidden vectors. We concate-nate them to get the ﬁnal hidden representations of contentwords. And the representation of news is : h e = (cid:104) −−→ h | X | ; ←− h (cid:105) .The state of decoder is initialized with h e . For predictingcomment word y t , the hidden state is ﬁrst obtained by: h t = LST M d ( y t − , h t − ) , (2)where h t ∈ R d is the hidden vector of comment word and LST M d is the LSTM unit. Then we use attention mech-anism to query the content information from source input.The weight of each content word is computed as follows: e ti = h (cid:62) t W a h ei ,α ti = exp ( e ti ) (cid:80) | X | i (cid:48) =1 exp ( e ti (cid:48) ) , (3)where W a ∈ R d × d , h ei is the hidden vector of contentword i , α ti is the normalized attention score on x i at timestep t . Then the attention-based content vector is obtainedby: ˜ h et = (cid:80) i α ti h ei . The hidden state vector is updated by: ˜ h t = tanh( W c [ h t ; ˜ h et ]) . (4) (𝑐|𝐳) Clustering by: 𝑞 𝑐 = 2 𝐳 log 𝑞(𝐳|𝐲)𝑝(𝐳 |𝐮 ) 𝐡 𝐡 𝐡 𝐡 𝐡 𝐡 𝐡 𝐡 |𝑋|𝑒 𝐡 … 𝐱 𝐱 𝐱 𝐱 𝐱 𝐱 |𝑋| Comment 𝐲 𝐲 < 𝑠 > Reader-Aware Topic Modeling 𝐲 𝐡𝐮 log𝛔 𝐳ො𝐲 log𝟏 𝐮 𝐮 𝐮 𝐳 × + × × × × × ′ Reader-Aware Saliency Detection and Comment Generation 𝑝(𝑐|𝑋)𝐷 𝐾𝐿 (𝑞(𝑐|𝐳)||𝑝(𝑐|𝑋) ) Title and BodySelected Topic Vector Selected Saliency InformationTopic VectorsBOW Vector of Comment

Visualization of Clusters 𝐮 𝐮 𝐮 𝐳 Figure 2: The framework of our proposed method. Left: the reader-aware topic modeling component is used to learn topicvectors. Speciﬁcally, the comment y is encoded to get latent semantic vector z . Then the classiﬁer q ( c | z ) classiﬁes z into onetopic. In addition, z is decoded to reconstruct the comment. Topic vectors are learned according to Equation 9. Right: thereader-aware saliency detection is used to select saliency words and the topic selector p ( c | X ) is used to select an appreciatetopic vector. Finally the comment is generated conditioning on the selected topic vector and selected saliency information.Finally, the probability of the next word y t is computed via: p ( y t | y t − , X ) = softmax (cid:16) linear (cid:16) ˜ h t (cid:17)(cid:17) , (5)where linear ( · ) is a linear transformation function.During training, cross-entropy loss L ce is employed as theoptimization objective. Reader-Aware Topic Modeling

Reader-aware topic modeling is conducted on all the com-ment sentences, aiming to learn the reader-aware topic repre-sentations only from the comments. To achieve this goal, wedesign a variational generative clustering algorithm whichcan be trained jointly with the whole framework in an end-to-end manner.Since this component is a generative model, thus foreach comment sentence Y (we employ Bag-of-Words fea-ture vector y to represent each comment sentence), the gen-eration process is: (1) A topic c is generated from the priortopic categorical distribution p ( c ) ; (2) A latent semantic vec-tor z is generated conditionally from a Gaussian distribution p ( z | c ) ; (3) y is generated from the conditional distribution p ( y | z ) . According to the generative process above, the jointprobability p ( y , z , c ) can be factorized as: p ( y , z , c ) = p ( y | z ) p ( z | c ) p ( c ) . (6)By using Jensen’s inequality, the log-likelihood can bewritten as: log p ( y ) = log (cid:90) z (cid:88) c p ( y , z , c ) d z ≥ E q ( z ,c | y ) (cid:20) log p ( y , z , c ) q ( z , c | y ) (cid:21) = L ELBO ( y ) , (7) where L ELBO is the evidence lower bound (ELBO), q ( z , c | y ) is the variational posterior to approximate the trueposterior p ( z , c | y ) and can be factorized as follows: q ( z , c | y ) = q ( z | y ) q ( c | z ) . (8) Then based on Equation 6 and Equation 8, the L ELBO canbe rewritten as: L ELBO ( y ) = E q ( z ,c | y ) (cid:20) log p ( y , z , c ) q ( z , c | y ) (cid:21) = E q ( z ,c | y ) [log p ( y , z , c ) − log q ( z , c | y )]= E q ( z ,c | y ) [log p ( y | z ) + log p ( z | c )+ log p ( c ) − log q ( z | y ) − log q ( c | z )]= E q ( z ,c | y ) [log p ( y | z ) − log q ( z | y ) p ( z | c ) − log q ( c | z ) p ( c ) ]= E q ( z | y ) [log p ( y | z ) − (cid:88) c q ( c | z ) log q ( z | y ) p ( z | c ) − D KL ( q ( c | z ) || p ( c ))] (9) where the ﬁrst term in Equation 9 is the reconstruction term,which encourages the model to reconstruct the input. Thesecond term aligns the latent vector z of input y to the latenttopic representation corresponding to topic c . q ( c | z ) can beregarded as the clustering result for the input comment sen-tence Y . The last term is used to narrow the distance be-tween posterior topic distribution q ( c | z ) and prior topic dis-tribution p ( c ) .In practical implementations, the prior topic categoricaldistribution p ( c ) is set to uniform distribution p ( c ) = K toprevent the posterior topic distribution q ( c | z ) from collaps-ing, that is, all comments are clustered into one topic. p ( z | c ) is a parameterised diagonal Gaussian as follows: p ( z | c ) = N ( z | µ c , diag ( )) , (10)where µ c is mean of Gaussian for topic c , which is also usedas the latent topic representation of topic c . Inspired by VAE(Kingma and Welling 2014; Bowman et al. 2016), we em-ploy a parameterised diagonal Gaussian as q ( z | y ) : µ = l ( h ) , log σ = l ( h ) ,q ( z | y ) = N (cid:0) z | µ , diag (cid:0) σ (cid:1)(cid:1) , (11)here l ( · ) and l ( · ) are linear transformations, h is ob-tained by the comment encoder, which contains two MLPlayers with tanh activation functions. In addition, a classi-ﬁer with two MLP layers is used to predict topic distribution q ( c | z ) . p ( y | z ) is modeled by the decoder, which is a one-layer MLP with softmax activation function.After training, K reader-aware topic representation vec-tors { µ i } K are obtained only from the whole reader com-ments corpus in the training set. And these reader-aware top-ics can be used to control the topic diversity of the generatedcomments. Reader-Aware Saliency Detection

Reader-aware saliency detection component is designed toselect the most important and reader-interested informationfrom the news article. It conducts Bernoulli distribution es-timating on each word of the news content, which indicateswhether each content word is important or not. Then we canpreserve the selected important words for comment genera-tion.Speciﬁcally, the saliency detection component ﬁrst uses aBiLSTM encoder to encode title words and the last hiddenvectors of two directions are used as the title representation h te . Then we use two-layer MLP with sigmoid activationfunction on the ﬁnal layer to predict the selection probabilityfor each content word x i as follows jointly considering thetitle information: p θ ( β i | x i ) = MLP( h ei , h te ) , (12)where h ei is the hidden vector obtained by BiLSTM-basednews content encoder in Section 2.2. The probability β i de-termines the probability (saliency) of the word be selected,and it is used to parameterize a Bernoulli distribution. Thena binary gate for each word can be obtained by samplingfrom the Bernoulli distribution: g i ∼ Bernoulli ( β i ) (13)Content words with g i = 1 are selected as context informa-tion to conduct the comment generation. Thus, the weight ofeach content word in attention mechanism and the weightedsource context in the backbone framework in Section 2.2are changed as follows: ˆ α ti = g i (cid:12) exp( e ti ) (cid:80) | X | i (cid:48) =1 g i (cid:48) (cid:12) exp( e ti (cid:48) ) , ˜ h et = (cid:88) i ˆ α ti h ei , (14)where ˜ h et is the selected saliency information of news con-tent and it will be used for comment generation.However, the sampling operation in Equation 13 is notdifferentiable. In order to train the reader-aware saliencydetection component in an end-to-end manner, we applyGumbel-Softmax distribution as a surrogate of Bernoullidistribution for each word selection gate (Xue, Li, andZhang 2019). Speciﬁcally, the selection gate produces a two-element one-hot vector as follows: g i = one hot (arg max j p i,j , j = 0 , p i, = 1 − β i , p i, = β i (15)we use the Gumbel-Softmax distribution to approximate to the one-hot vector g i : ˆ g i = [ˆ p i,j ] j =0 , ˆ p i,j = exp((log( p i,j )+ (cid:15) j ) /τ ) (cid:80) j (cid:48) =0 exp((log( p i,j (cid:48) )+ (cid:15) j (cid:48) ) /τ ) , (16)where (cid:15) j is a random sample from Gumbel(0, 1). When tem-perature τ approaches 0, Gumbel-Softmax distribution ap-proaches one-hot. And now we use g i = ˆ g i, instead ofEquation 13. Via this approximation, we can train the com-ponent end-to-end with other modules. In order to encouragethe saliency detection component to turn off more gates andselect less words, a l norm term over all gates is added tothe loss function as follows: L sal = (cid:107)G(cid:107) | X | = (cid:80) i g i | X | . (17) Diversiﬁed Comment Generation

Given the learned K reader-aware topic vectors, we needto select an appropriate topic for current article to guidethe comment generation. Therefore, a two-layers MLP with softmax activation function is used to predict the selectionprobability of each topic as follows: p ( c | X ) = M LP ( h e ) . (18)During training, the true topic distribution q ( c | z ) (Equation8) is available and is used to compute a weighted topic rep-resentation by: ˜ µ = K (cid:88) c q ( c | z ) (cid:12) µ c . (19)After getting the topic vector ˜ µ and the selected saliencyinformation ˜ h et , we update the hidden vector of the backbonedecoder in Section 2.2 as follows: ˜ h t = tanh( W c [ h t ; ˜ h et ; ˜ µ ]) . (20)Then ˜ h t is used to predict next word as Equation 5.In the inference stage, p ( c | X ) is used to get the topic rep-resentation. So in order to learn p ( c | X ) during the trainingstage, a KL term L top = D KL ( q ( c | z ) || p ( c | X )) is added intothe ﬁnal loss function. Learning

Finally, considering all components the loss function of thewhole comment generation framework is as follows: L = λ L ELBO + λ L sal + λ L ce + λ L top . (21)where λ , λ , λ , and λ are hyperparameters to make atrade-off among different components. Then we jointly trainall components according to Equation 21. Experimental Settings

Datasets

Tencent Corpus is a Chinese dataset published in (Qin et al.2018). The dataset is built from Tencent News and eachdata item contains a news article and the corresponding com-ments. Each article is made up of a title and a body. All text https://news.qq.com/ s tokenized by a Chinese word segmenter JieBa . The av-erage lengths of news titles, news bodies, and comments are15 words, 554 words, and 17 words respectively. Yahoo! News Corpus is an English dataset published in(Yang et al. 2019), which is built from Yahoo! News . Text istokenized by Stanford CoreNLP (Manning et al. 2014). Theaverage lengths of news titles, news bodies, and commentsare 12, 578, and 32 respectively. NetEase News Corpus is also a Chinese dataset crawledfrom NetEase News and used in (Zheng et al. 2017). Weprocess the raw data according to the processing methodsused in the ﬁrst two datasets (Qin et al. 2018; Yang et al.2019). On average, news titles, news bodies, and commentscontain 12, 682, and 23 words respectively.Table 1 summarizes the statistics of the three datasets. Train Dev TestTencent

Table 1: Statistics of the three datasets.

Baseline Models

The following models are selected as baselines:

Seq2seq (Qin et al. 2018): this model follows the frame-work of seq2seq model with attention. We use two kindsof input, the title( T ) and the title together with the content( TC ). GANN (Zheng et al. 2017): the author proposes a gatedattention neural network, which is similar to

Seq2seq-T andadds a gate layer between encoder and decoder.

Self-attention (Chen et al. 2018): this model also followsseq2seq framework and use multi-layer self multi-head at-tention as the encoder and a RNN decoder with attention isapplied. We follow the setting of Li et al.(2019) and use thebag of words as input. Speciﬁcally the words with top 600term frequency are as the input.

CVAE (Zhao, Zhao, and Eskenazi 2017): this model usesconditional VAE to improve the diversity of neural dialog.We use this model as a baseline for evaluating the diversityof comments.

Evaluation Metrics

Automatic Evaluation

Following Qin et al. (2018), we useROUGE (Lin 2004), CIDEr (Vedantam, Lawrence Zitnick,and Parikh 2015), and METEOR (Banerjee and Lavie 2005)as metrics to evaluate the performance of different models.A popular NLG evaluation tool nlg-eval is used to com-pute these metrics. Besides the overlapping based metrics,we use Distinct (Li et al. 2016) to evaluate the diversity of https://github.com/fxsjy/jieba https://news.yahoo.com/ https://news.163.com/ https://github.com/Maluuba/nlg-eval comments. Distinct-n measures the percentage of distinct n-grams in all generated results. M-Distinct-n measures theability to generate multiple diverse comments for the sametest article. For computing M-Distinct, 5 generated com-ments for each test article are used. For Seq2seq-T, Seq2seq-TC, GANN and Self-attention, top 5 comments from beamsearch with a size of 5 are used. For CVAE, we decode 5times by sampling on latent variable to get 5 comments. Forour method, we decode 5 times with one of top 5 predictedtopics to get 5 comments. Human Evaluation . Following Qin et al. (2018), we alsoevaluate our method by human evaluation. Given titles andbodies of news articles, raters are asked to rate the com-ments on three dimensions:

Relevance , Informativeness ,and

Fluency . Relevance measures whether the commentis about the main story of the news, one side part of thenews, or irrelevant to the news.

Informativeness evaluateshow much concrete information the comment contains. Itmeasures whether the comment involves a speciﬁc aspect ofsome character or event.

Fluency evaluates whether the sen-tence is ﬂuent. It mainly measures whether the sentence fol-lows the grammar. The score of each aspect ranges from 1 to5. In our experiment, we randomly sample 100 articles fromthe test set for each dataset and ask three raters to judge thequality of the comments given by different models. For ev-ery article, comments from all models are pooled, randomlyshufﬂed, and presented to the raters.

Implementation Details

For each dataset, we use a vocabulary with the top 30k fre-quent words in the entire data. We limit maximum lengthsof news titles, news bodies and comments to 30, 600 and50 respectively. The part exceeding the maximum length istruncated. The embedding size is set to 256. The word em-bedding are shared between encoder and decoder. For RNNbased encoder, we use a two-layer BiLSTM with hidden size128. We use a two-layer LSTM with hidden size 256 as de-coder. For self multi-head attention encoder, we use 4 headsand two layers. For CVAE and our topic modeling compo-nent, we set the size of latent variable to 64. For our method, λ and λ are set to 1, λ and λ are set to . × − and0.2 respectively. We choose topic number K from set [10,100, 1000], and we set K = 100 for Tencent dataset and K = 1000 for other two datasets. The dropout layer is in-serted after LSTM layers of decoder and the dropout rateis set to 0.1 for regularization. The batch size is set to 128.We train the model using Adam (Kingma and Ba 2014) withlearning rate 0.0005. We also clamp gradient values into therange [ − . , . to avoid the exploding gradient problem(Pascanu, Mikolov, and Bengio 2013). In decoding, top 1comment from beam search with a size of 5 is selected forevaluation. Experimental Results and Discussions

Automatic and Human Evaluation

Automatic evaluation results on three datasets are shownin Table 2. On most automatic metrics, our method out-performs baseline methods. Compared with Seq2seq-TC, ataset Models ROUGE L CIDEr METEOR Distinct-3 Distinct-4 M-Distinct-3 M-Distinct-4Tencent Seq2seq-T 0.261 0.015 0.076 0.088 0.079 0.046 0.051Seq2seq-TC 0.280 0.021 0.088 0.121 0.122 0.045 0.054GANN 0.267 0.017 0.081 0.087 0.081 0.040 0.046Self-attention 0.280 0.019 0.092 0.117 0.121 0.043 0.051CVAE 0.281 0.021 0.094 0.135 0.137 0.041 0.044Ours

Yahoo Seq2seq-T 0.299 0.031 0.105 0.137 0.168 0.044 0.063Seq2seq-TC 0.308

NetEase Seq2seq-T 0.263 0.025 0.105 0.149 0.169 0.046 0.056Seq2seq-TC 0.268

Table 2: Automatic evaluation results on three datasets

Dataset Models Relevance Informativeness Fluency TotalTencent Seq2seq-TC 1.22 1.11

Yahoo Seq2seq-TC 1.70 1.70 3.77 2.39Self-attention 1.71 1.72

NetEase Seq2seq-TC 1.97 1.99 4.03 2.66Self-attention 1.90 1.96 4.02 2.63CVAE 1.50 1.53

Table 3: Human evaluation results on three datasets

Metrics Distinct-3 Distinct-4 M-Distinct-3 M-Distinct-4No Topic Modeling 0.142 0.151 0.050 0.060No Saliency Detection 0.173 0.188 0.087 0.104Full Model 0.176 0.196 0.092 0.112

Table 4: Model ablation results on Tencent datasetSeq2seq-T and GANN perform worse in all metrics. Thisindicates that news bodies are important for generating bet-ter comments. The results of Self-attention and CVAE arenot stable. Compared with Seq2seq-TC, Self-attention per-forms worse in Yahoo dataset and close in other datasets.CVAE performs better in Tencent dataset and worse in otherdastasets compared with Seq2seq-TC. Compared with othermethods, our method improves Distinct-4 and M-Distinctscores signiﬁcantly. This demonstrates that our method cangenerate diversiﬁed comments according to different topicsand salient information. While different comments of one ar-ticle of Seq2seq-T, Seq2seq-TC, GANN and Self-attentioncome from the same beam, the M-Distinct scores of thesemethods are lower than ours. Although CVAE can gener-ate different comments for one article by sampling on a la-tent variable, it obtains a worse M-Distinct score than ours.This demonstrates that the semantic change of generatedcomments is small when sampling on a latent variable. Ourmethod generates comments by selecting different topic rep-resentation vectors and salient information of news, thus hasa higher M-Distinct score. Table 3 reports human evaluation results in three datasets.Because Seq2seq-T and GANN are not using news bod-ies and perform worse in automatic metrics, we compareour method with other methods. Our method achieves thebest Total scores in three datasets. Speciﬁcally, our methodmainly improves scores on Relevance and Informativeness.This shows that our method can generate more relevantand informative comments by utilizing reader-aware topicmodeling and saliency information detection. However, ourmethod performs worse in term of Fluency. We ﬁnd thatbaselines tend to generate more generic responses, such as“Me too.”, resulting in higher Fluency scores.

Ablation Study

We conduct ablation study to evaluate the affection of eachcomponent and show the results in Table 4. We compareour full model with two variants: (1) No Topic Modeling:the reader-aware topic modeling component is removed; (2)No Saliency Detection: the reader-aware saliency detectionis removed. We can see that our full model obtains thebest performance and two components contribute to the per-formance. No topic modeling drops a lot in Distinct andM-Distinct. This shows that the reader-aware topic model-ing component is important for generating diversiﬁed com-ments. With saliency detection, the performance gets betterand this indicates that it is useful to detect important infor-mation of news for generating diversiﬁed comments.

Analysis of Learned Latent Topics

In order to evaluate the reader-aware topic modeling com-ponent, we visualize the learned latent semantic vectors ofcomments. To this end, we use t-SNE (Maaten and Hinton2008) to reduce the dimensionality of the latent vector z to2. The one with the highest probability in topic distribution q ( c | z ) is used as the topic of a comment. We ﬁrst randomlyselect 10 topics from 100 topics in Tencent dataset and thenplot 5000 sampled comments belonging to these 10 topics inFigure 3. Points with different colors belong to different top-ics. In addition, we plot topic vectors of these 10 topics. We Figure 3: The latent semantic vectors of sampled commentsand corresponding topic vectors.

Topic 1: 不错 , 有点 , 真的 , 太 , 挺 , 比较 , 应该 , 确实 , 好像 , 其实 (nice, a little, really, too, quite, comparatively, should, indeed,like, actually) Topic 22: 恶心 , 丑 , 可爱 , 真 , 不要脸 , 太 , 讨厌 , 难看 , 脸 , 臭 (disgusting, ugly, cute, real, shameful, too, nasty, ugly, face, stink) Topic 37: 好看 , 漂亮 , 挺 , 不错 , 身材 , 演技 , 颜值 , 性感 , 长得 , 很漂亮 (good-looking, beautiful, quite, nice, body, acting, face,sexy, look like, very beautiful) Topic 62: 好吃 , 东西 , 喝 , 吃 , 味道 , 鱼 , 肉 , 水 , 菜 , 不吃 (deli-cious, food, drink, eat, taste, ﬁsh, meat, water, dish, don’t eat) Topic 99: 穿 , 腿 , 长 , 衣服 , 眼睛 , 好看 , 脸 , 胖 , 瘦 , 高 (wear, leg,long, clothes, eye, good-looking, face, fat, thin, tall) Table 5: The top 10 frequent words of some topicscan see that these latent vectors of comments are groupedinto several clusters and distributed around correspondingtopic vectors. This shows that reader-aware topic modelingcomponent can effectively cluster comments and topic vec-tors can be used to represent topics well. What’s more, wecollect comments on each topic to observe what each topicis talking about. The top 10 frequent words (removing stopwords) of some topics are shown in Table 5. We can see thatTopic 1 is talking about intensity of emotional expression,such as “a little”, “really”, and “too”. Appearance and Foodare discussed in Topic 37 and Topic 62 respectively.

Case Study

In order to further understand our model, we compare com-ments generated by different models in Table 6. The newsarticle is about the dressing of a female star. Seq2seq-TCproduces a general comment while Self-attention producesan informative comment. However, they can not producemultiple comments for one article. CVAE can achieve thisby sampling on a latent variable, but it produces same com-ments for different samples. Compared to CVAE, our modelgenerates more relevant and diverse comments accordingto different topics. For example, “It is good-looking in anyclothes” comments on the main story of news and mentionsdetail of news, “wearing clothes”. What’s more, comparingthe generated comment conditioning on a speciﬁc topic withcorresponding topic words in Table 5, we ﬁnd that the gen-erated comments are consistent with the semantics of the

Title: 蒋欣终于穿对衣服！尤其这开叉裙，显瘦斤！微胖界穿搭典范！ (Jiang Xin is ﬁnally wearing the right clothes!Especially this open skirt, which is 20 pounds slimmer! Micro-fat dress code!) Body (truncated): 中国全民追星热的当下 , 明星的一举一动 , 以及穿着服饰 , 都极大受到粉丝的追捧。事实上 , 每位女明星都有自己的长处、优点 , 善意看待就好噢。蒋欣呀蒋欣 , 你这样穿确实很显瘦的说 , 着实的好看又吸睛 , 难怪人人都说 , 瘦瘦瘦也可以凹出美腻感的节奏有么有 , 学会这样穿说不定你也可以一路美美美的节奏。 (At the moment whenChina’s people are star-ﬁghting, the celebrity’s every move, aswell as wearing clothing, are greatly sought after by fans. Infact, each female star has its own strengths and advantages, justlook at it in good faith. Jiang Xin, Jiang Xin, you are indeed verythin when you wear it.It is really beautiful and eye-catching. Nowonder everyone says that there is a rhythm of thinness.You canlearn to wear it like this. You can have a beautiful rhythm.) Seq2seq-TC: 我也是 (Me too.) Self-attention: 喜欢蒋欣 (Like Jiang Xin.) CVAE 1: 我喜欢蒋欣 (I like Jiang Xin.) 我喜欢蒋欣 (I like Jiang Xin.) Ours Topic 99: 穿什么衣服都好看 (It is good-looking inany clothes.) Topic 22: 好可爱 (So cute.) Topic 37: 挺好看的 (It is pretty beautiful.) Topic 1: 不错不错 (It is nice.) Topic 62: 胖了 (Gain weight.) Table 6: A Case from Tencent News datasetcorresponding topics.

Related Work

Automatic comment generation is proposed by Zhenget al.(2017) and Qin et al.(2018). The former proposed togenerate one comment only based on the news title while thelatter extended the work to generate a comment jointly con-sidering the news title and body content. These two methodsadopted seq2seq (Sutskever, Vinyals, and Le 2014) frame-work and conducted saliency detection directly via attentionmodeling. Recently, Li et al.(2019) extracted keywords assaliency information and Yang et al.(2019) proposed a read-ing network to select important spans from the news article.Huang et al.(2020) employed the LDA (Blei, Ng, and Jordan2003) topic model to conduct information mining from thecontent. All these works concern the saliency informationextraction. However, they neglect various reader-aware fac-tors implied in the comments. Our method simultaneouslyconsiders these two aspects and utilizes two novel compo-nents to generate diversiﬁed comments.

Conclusion

We propose a reader-aware topic modeling and saliency de-tection framework to enhance the quality of generated com-ments. We design a variational generative clustering algo-rithm for topic mining from reader comments. We introduceBernoulli distribution estimating on news content to selectsaliency information. Results show that our framework out-performs existing baseline methods in terms of automaticmetrics and human evaluation. thics Statement

We are fully aware of the new ethics policy posted in theAAAI 2021 CFP page and we seriously honor the AAAIPublications Ethics and Malpractice Statement, as well asthe AAAI Code of Professional Conduct. And along thewhole process of our research project, we carefully thinkabout them. Here we elaborate on the ethical impact of thistask and our method.Automatic comment generation aims to generate com-ments for news articles. This task has many potential ap-plications. First, researchers have been working to develop amore intelligent chatbot (such as XiaoIce (Shum, He, and Li2018; Zhou et al. 2020)), which can not only chat with peo-ple, but also write poems, sing songs and so on. The applica-tion of this task is to give the chatbot the ability to commenton articles (Qin et al. 2018; Yang et al. 2019) to enable in-depth, content-rich conversations with users based on arti-cles users are interested in. Second, it can be used for onlinediscussion forums to increase user engagement and fosteronline communities by generating some enlightening com-ments to attract users to give their own comments. Third, wecan build a comment writing assistant which generates somecandidate comments for users (Zheng et al. 2017). Userscould select one and reﬁne it, which makes the proceduremore user-friendly. Therefore, this task is novel and mean-ingful.We are aware that numerous uses of these techniques canpose ethical issues. For example, there is a risk that peo-ple and organizations could use these techniques at scale tofeign comments coming from people for purposes of polit-ical manipulation or persuasion (Yang et al. 2019). There-fore, in order to avoid potential risks, best practices willbe necessary for guiding applications and we need to su-pervise all aspects when deploying such a system. First, wesuggest that market regulators must monitor organizationsor individuals that provide such services to a large numberof users. Second, we suggest limiting the domain of suchsystems and excluding the political domain. And some post-processing techniques are need to ﬁlter some sensitive com-ments. Third, we suggest limiting the number of system callsin a short period of time to prevent massive abuse. We be-lieve that reasonable guidance and supervision can largelyavoid these risks.On the other hand, we have to mention that some typicaltasks also have potential risks. For example, the technologyof dialogue generation (Zhang et al. 2020) can be used todisguise as a normal person to deceive people who chat withit, and achieve a certain purpose. The technology of facegeneration (Karras et al. 2018) can be used to disguise as theface of target people, so as to deceive the face recognitionsystem. However, there are still many researchers workingon these tasks for the positive uses of these tasks. Therefore,everything has two sides and we should treat it dialectically.In addition, we believe that the study of this technology isimportant for us to better understand the defects of this tech-nology, which helps us to detect spam comments and combatthis behavior. For example, Zellers et al.(2019) found thatthe best defense against fake news turns out to be a strongfake news generator.

Acknowledgments

This research is supported by National Natural Sci-ence Foundation of China (Grant No. 61773229 and6201101015), Tencent AI Lab Rhino-Bird Focused Re-search Program (No. JR202032), Shenzhen Giiso Infor-mation Technology Co. Ltd., the Basic Research Fund ofShenzhen City (Grand No. JCYJ20190813165003837), andOverseas Cooperation Research Fund of Graduate School atShenzhen, Tsinghua University (Grant No. HW2018002).

References

Bahdanau, D.; Cho, K.; and Bengio, Y. 2014. Neural ma-chine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 .Banerjee, S.; and Lavie, A. 2005. METEOR: An AutomaticMetric for MT Evaluation with Improved Correlation withHuman Judgments. In

Proceedings of the ACL Workshopon Intrinsic and Extrinsic Evaluation Measures for Ma-chine Translation and/or Summarization , 65–72. Ann Arbor,Michigan: Association for Computational Linguistics.Blei, D. M.; Ng, A. Y.; and Jordan, M. I. 2003. Latent dirich-let allocation.

Journal of machine Learning research

Proceedings of The 20th SIGNLLConference on Computational Natural Language Learning ,10–21. Association for Computational Linguistics.Chen, M. X.; Firat, O.; Bapna, A.; Johnson, M.; Macherey,W.; Foster, G.; Jones, L.; Schuster, M.; Shazeer, N.; Parmar,N.; et al. 2018. The Best of Both Worlds: Combining RecentAdvances in Neural Machine Translation. In

Proceedings ofthe 56th Annual Meeting of the Association for Computa-tional Linguistics (Volume 1: Long Papers) , 76–86. Associ-ation for Computational Linguistics.Gatt, A.; and Krahmer, E. 2018. Survey of the state of theart in natural language generation: Core tasks, applicationsand evaluation.

Journal of Artiﬁcial Intelligence Research

61: 65–170.Huang, J.; Pan, L.; Xu, K.; Peng, W.; and Li, F.2020. Generating Pertinent and Diversiﬁed Comments withTopic-aware Pointer-Generator Networks. arXiv preprintarXiv:2005.04396 .Jiang, Z.; Zheng, Y.; Tan, H.; Tang, B.; and Zhou, H. 2017.Variational deep embedding: an unsupervised and generativeapproach to clustering. In

Proceedings of the 26th Inter-national Joint Conference on Artiﬁcial Intelligence , 1965–1972.Karras, T.; Aila, T.; Laine, S.; and Lehtinen, J. 2018. Pro-gressive Growing of GANs for Improved Quality, Stabil-ity, and Variation. In

International Conference on LearningRepresentations .Kingma, D. P.; and Ba, J. 2014. Adam: A method forstochastic optimization. arXiv preprint arXiv:1412.6980 .ingma, D. P.; and Welling, M. 2014. Auto-Encoding Vari-ational Bayes. stat

Proceedings of the 54th Annual Meeting of theAssociation for Computational Linguistics (Volume 1: LongPapers) , 994–1003. Association for Computational Linguis-tics.Li, W.; Xu, J.; He, Y.; Yan, S.; Wu, Y.; and Sun, X. 2019.Coherent Comments Generation for Chinese Articles with aGraph-to-Sequence Model. In

Proceedings of the 57th An-nual Meeting of the Association for Computational Linguis-tics , 4843–4852. Florence, Italy: Association for Computa-tional Linguistics.Lin, C.-Y. 2004. ROUGE: A Package for Automatic Evalu-ation of Summaries. In

Text Summarization Branches Out ,74–81. Barcelona, Spain: Association for ComputationalLinguistics.Maaten, L. v. d.; and Hinton, G. 2008. Visualizing data usingt-SNE.

Journal of machine learning research

Proceedings of 52ndannual meeting of the association for computational linguis-tics: system demonstrations , 55–60.Pascanu, R.; Mikolov, T.; and Bengio, Y. 2013. On the difﬁ-culty of training recurrent neural networks. In

InternationalConference on Machine Learning , 1310–1318.Qin, L.; Liu, L.; Bi, W.; Wang, Y.; Liu, X.; Hu, Z.; Zhao, H.;and Shi, S. 2018. Automatic Article Commenting: the Taskand Dataset. In

Proceedings of the 56th Annual Meetingof the Association for Computational Linguistics (Volume 2:Short Papers) , 151–156. Melbourne, Australia: Associationfor Computational Linguistics.Reiter, E.; and Dale, R. 1997. Building applied natural lan-guage generation systems.

Natural Language Engineering

Frontiers of Information Technology & Electronic Engineer-ing

Advances inneural information processing systems , 3104–3112.Vedantam, R.; Lawrence Zitnick, C.; and Parikh, D. 2015.Cider: Consensus-based image description evaluation. In

Proceedings of the IEEE conference on computer vision andpattern recognition , 4566–4575.Xue, L.; Li, X.; and Zhang, N. L. 2019. Not All Attention IsNeeded: Gated Attention Network for Sequence Data. arXivpreprint arXiv:1912.00349 .Yang, Z.; Xu, C.; Wu, W.; and Li, Z. 2019. Read, Attendand Comment: A Deep Architecture for Automatic News Comment Generation. In

Proceedings of the 2019 Confer-ence on Empirical Methods in Natural Language Process-ing and the 9th International Joint Conference on NaturalLanguage Processing (EMNLP-IJCNLP) , 5077–5089. HongKong, China: Association for Computational Linguistics.Zellers, R.; Holtzman, A.; Rashkin, H.; Bisk, Y.; Farhadi, A.;Roesner, F.; and Choi, Y. 2019. Defending against neuralfake news. In

Advances in Neural Information ProcessingSystems , 9054–9065.Zhang, Y.; Sun, S.; Galley, M.; Chen, Y.-C.; Brockett, C.;Gao, X.; Gao, J.; Liu, J.; and Dolan, B. 2020. DIALOGPT :Large-Scale Generative Pre-training for Conversational Re-sponse Generation. In

Proceedings of the 58th AnnualMeeting of the Association for Computational Linguistics:System Demonstrations , 270–278. Online: Association forComputational Linguistics.Zhao, T.; Zhao, R.; and Eskenazi, M. 2017. LearningDiscourse-level Diversity for Neural Dialog Models usingConditional Variational Autoencoders. In

Proceedings of the55th Annual Meeting of the Association for ComputationalLinguistics (Volume 1: Long Papers) , 654–664.Zheng, H.-T.; Wang, W.; Chen, W.; and Sangaiah, A. K.2017. Automatic generation of news comments based ongated attention neural networks.

IEEE Access

6: 702–710.Zhou, L.; Gao, J.; Li, D.; and Shum, H.-Y. 2020. The designand implementation of xiaoice, an empathetic social chatbot.