[PDF] Context-aware Helpfulness Prediction for Online Product Reviews

Abstract

Modeling and prediction of review helpfulness has become more predominant due to proliferation of e-commerce websites and online shops. Since the functionality of a product cannot be tested before buying, people often rely on different kinds of user reviews to decide whether or not to buy a product. However, quality reviews might be buried deep in the heap of a large amount of reviews. Therefore, recommending reviews to customers based on the review quality is of the essence. Since there is no direct indication of review quality, most reviews use the information that ''X out of Y'' users found the review helpful for obtaining the review quality. However, this approach undermines helpfulness prediction because not all reviews have statistically abundant votes. In this paper, we propose a neural deep learning model that predicts the helpfulness score of a review. This model is based on convolutional neural network (CNN) and a context-aware encoding mechanism which can directly capture relationships between words irrespective of their distance in a long sequence. We validated our model on human annotated dataset and the result shows that our model significantly outperforms existing models for helpfulness prediction.

Full PDF

TThis is the preliminary version. Fi-nal version has been published inThe 15th Asia Information RetrievalSocieties Conference proceeding (2019)and can be obtained from https://link.springer.com/chapter/10.1007/978-3-030-42835-8_6

Context-aware Helpfulness Prediction for OnlineProduct Reviews (cid:63)

Iyiola E. Olatunji, Xin Li, and Wai Lam

Department of Systems Engineering and Engineering Management,The Chinese University of Hong Kong,Shatin, Hong Kong { olatunji,lixin, wlam } @se.cuhk.edu.hk Abstract.

Modeling and prediction of review helpfulness has becomemore predominant due to proliferation of e-commerce websites and on-line shops. Since the functionality of a product cannot be tested beforebuying, people often rely on diﬀerent kinds of user reviews to decidewhether or not to buy a product. However, quality reviews might beburied deep in the heap of a large amount of reviews. Therefore, rec-ommending reviews to customers based on the review quality is of theessence. Since there is no direct indication of review quality, most reviewsuse the information that “X out of Y” users found the review helpful forobtaining the review quality. However, this approach undermines helpful-ness prediction because not all reviews have statistically abundant votes.In this paper, we propose a neural deep learning model that predicts thehelpfulness score of a review. This model is based on convolutional neuralnetwork (CNN) and a context-aware encoding mechanism which can di-rectly capture relationships between words irrespective of their distancein a long sequence. We validated our model on human annotated datasetand the result shows that our model signiﬁcantly outperforms existingmodels for helpfulness prediction.

Keywords:

Helpfulness prediction · context-aware · Product review.

Reviews have become an integral part of users experience when shopping online.This trend makes product reviews an invaluable asset because they help cus-tomers make purchasing decision, consequently, driving sales [7]. Due to theenormous amount of reviews, it is important to analyze review quality andto present useful reviews to potential customers. The quality of a review canvary from a well-detailed opinion and argument, to excessive appraisal, to spam.Therefore, predicting the helpfulness of a review involves automatic detection ofthe inﬂuence the review will have on a customer for making purchasing decision.Such reviews should be informative and self-contained [6]. (cid:63)

The work described in this paper is substantially supported by a grant from theResearch Grant Council of the Hong Kong Special Administrative Region, China(Project Code: 14204418). a r X i v : . [ c s . C L ] A p r I. E. Olatunji et al.

Votes: [1, 8]HS: 0.13HAS: 0.85

I received an updated charger as well as the updated replacement powersupply due to the recall. It looks as if this has solved all of the problems.I have been using this charger for some time now with no problems. Nomore heat issues and battery charging is quick and accurate. I can nowrecommend this charger with no problem. I use this charger severaltimes a week and much prefer it over the standard wall type chargers.I primarily use Enelope batteries with this charger.

Fig. 1.

Example of a review text with helpfulness score. HS = helpfulness score basedon “X of Y approach” while HAS = human annotated helpfulness score.

Review helpfulness prediction have been studied using arguments [12], as-pects (ASP) [25], structural (STR) and unigram (UGR) features [26]. Also,semantic features such as linguistic inquiry and word count (LIWC), generalinquirer (INQ) [26], and Geneva aﬀect label coder (GALC) [14] have been usedto determine review helpfulness. However, using handcrafted features is laboriousand expensive due to manual feature engineering and data annotation. Recently,convolutional neural networks (CNNs) [10], more speciﬁcally, the character-basedCNN [5] has been applied to review helpfulness prediction and has shown to out-perform handcrafted features. However, it does not fully capture the semanticrepresentation of the review since diﬀerent reviewers have diﬀerent perspective,writing style and the reviewers language may aﬀect the meaning of the review.That is, the choice of word of an experienced reviewer diﬀers from that of a newreviewer. Therefore, modelling dependency between words is important.Recent works on helpfulness prediction use the “X of Y approach” i.e. ifX out of Y users votes that a review is helpful, then the helpfulness score ofthe review is X/Y. However, such simple method is not eﬀective as shown inFigure 1. The helpfulness score (HS = 0.13) implies that the review is unhelpful.However, this is not the case as the review text entails user experience andprovides necessary information to draw a reasonable conclusion (self-contained).Hence, it is clear that the review is of high quality (helpful) as pointed outby human annotators (HAS = 0.85). This observation demonstrates that “Xof Y” helpfulness score may not strongly correlate to review quality [26] whichundermines the eﬀectiveness of the prediction output.Similarly, prior methods assume that reviews are coherent [26][5]. However,this is not the case in most reviews because of certain content divergence featuressuch as sentiment divergence (opinion polarity of products compared to that ofreview) embedded in reviews. To address these issues, we propose a model thatpredicts the helpfulness score of a review by considering the context in whichwords are used in relation to the entire review. We aim to understand the internalstructure of the review text.In order to learn the internal structure of review text, we learn dependenciesbetween words by taking into account the positional encoding of each word ontext-aware Helpfulness Prediction for Online Product Reviews 3 and employing the self-attention mechanism to cater for the length of longersequences. We further encode our learned representation into CNN to producean output sequence representation.In our experiment, our framework outperforms existing state-of-the-art meth-ods for helpfulness prediction on the Amazon review dataset. We conducted de-tailed experiments to quantitatively evaluate the eﬀectiveness of each designedcomponent of our model. We validated our model on the human annotated data.The code is available.

Previous works on helpfulness prediction can be categorized broadly into threecategories (a) score regression (prediction a helpfulness score between 0 and 1)(b) classiﬁcation (classifying a review as either helpful or not helpful) (c) ranking(ordering reviews based on their helpfulness score). In this paper we deﬁne theproblem of helpfulness prediction as a regression task. Most studies tend to focuson extracting speciﬁc features (handcrafted features) from review text.Semantic features such as LIWC (linguistic inquiry and word count), generalinquirer (INQ) [26], and GALC (Geneva aﬀect label coder) [14] were used toextract meaning from the text to determine the helpfulness of the review. Ex-tracting argument features from review text has shown to outperform semanticfeatures [12] since it provides a more detailed information about the product.Structural (STR) and unigram (UGR) features [26] have also been exploited.Content-based features such as review text and star rating and context-basedfeatures such as reviewer/user information are used for helpfulness predictiontask. Content-based features are features that can be obtained from the re-views. They include length of review, review readability, number of words inreview text, word-category features and content divergence features. Context-based features are features that can be derived outside the review. This includesreviewer features (proﬁle) and features to capture similarities between users andreviews (user-reviewer Idiosyncrasy). Other metadata such as the probability ofa review and its sentences being subjective have been successfully used as fea-tures [19][9][17][20][22][8]. Since review text are mostly subjective, it is a morecomplicated task to model all contributing features for helpfulness prediction.Thus, using handcrafted features has limited capabilities and laborious due tomanual data annotation.Several methods have been applied to helpfulness prediction task includingsupport vector regression [9,29,26], probabilistic matrix factorization [23], lin-ear regression [13], extended tensor factorization models [16], HMM-LDA basedmodel [18] and multi-layer neural networks [11]. These methods allows the in-tegration of robust constraints into the learning process and this in turn hasimproved prediction result. Recently, convolutional neural network (CNN) hasshown signiﬁcant improvement over existing methods by achieving the state-of-the-art performance [10][28]. CNNs automatically extract deep feature fromraw text. This capability can alleviate manual selection of features. Furthermore,

I. E. Olatunji et al. adding more levels of abstraction as in character-level representations has fur-ther improved prediction results over vanilla CNN. Embedding-gated CNN [3]and multi-domain gated CNN [4] are recent methods for predicting helpfulnessprediction.Moreover, attention mechanism has also been employed to CNN such as theABCNN for modelling sentence pair [27]. Several tasks including textual entail-ment, sentence representation, machine translation, and abstractive summariza-tion have applied self-attention mechanism and has shown signiﬁcant result [1].However, employing self-attention for developing context-aware encoding mech-anism has not been applied to helpfulness prediction. Using self-attention mech-anism on review text is quite intuitive because even the same word can havediﬀerent meaning based on the context in which it is being used.

Fig. 2.

Proposed context-aware helpfulness prediction model

We model the problem of predicting the helpfulness score of a review as a regres-sion problem. Precisely, given a sequence of review, we predict the helpfulnessscore based on the review text. As shown in Figure 2, the sequence of wordsin the review are embedded and concatenated with their respective positionalencoding to form the input features. These input features are processed by aself-attention block for generating context-aware representations for all tokens.Then such representation will be fed into a convolutional neural network (CNN) ontext-aware Helpfulness Prediction for Online Product Reviews 5 which computes a vector representation of the entire sequence (i.e the depen-dency between the token and the entire sequence) and then we use a regressionlayer for predicting the helpfulness score.

The context-aware component of our model consists of positional encoding andself-attention mechanism. We augment the word embedding with positional en-coding vectors to learn text representation while taking the absolute position ofwords into consideration.Let X = ( x , x , ..., x n ) be a review consisting of a sequence of words. We mapeach word x i in a review X to a l -dimensional (word embedding) word vector e x i stored in an embedding matrix E ∈ IR V × l where V is the vocabulary size.We initialize E with pre-trained vectors from GloVe [21] and set the embeddingdimension l to 100. A review is therefore represented as Y = ( e x , e x , ..., e x n ).Since the above representation learns only the meaning of each word, we needthe position of each word for understanding the context of each word. Let S =( s , s , ..., s n ) be the position of each word in a sentence. Inspired by Vaswani etal [24], the positional encoding, denoted as P E ∈ IR n × l , is a 2D constant matrixwith position speciﬁc values calculated by the sine and cosine functions below: P E ( s k , i ) = sin (cid:16) s k /j i/l (cid:17) P E ( s k , i + 1) = cos (cid:16) s k /j i/l (cid:17) (1)where i is the position along the embedding vector e and j is a constant repre-senting the distance between successive peaks (or successive troughs) of cosineand sine function. This constant is between 2 π and 10000. Based on a tuningprocess, we set j to 1000. The sequence P = (cid:0) P , P , ..., P n (cid:1) , where P s = P E ( S )deﬁned as the row vector corresponding to S in the matrix P E as in Equation1. The ﬁnal representation e (cid:48) is obtained by adding the word embedding to therelative position values of each word in the sequence. Therefore, e (cid:48) = ( e (cid:48) , e (cid:48) , ..., e (cid:48) n )where e (cid:48) j = (cid:0) P j + e x j (cid:1) Self-attention is employed in our model. Given an input e (cid:48) , self-attention in-volves applying the attention mechanism on each e (cid:48) i using e (cid:48) i query vector andkey-value vector for all other positions. The reason for using the self-attentionmechanism is to capture the internal structure of the sentence by learning depen-dencies between words. The scaled dot-product attention is used which allowsfaster computation instead of the standard additive attention mechanism [2]. Itcomputes the attention scores by: Attention ( Q, K, V ) = sof tmax ( ( Q )( K ) T √ l ) V (2)where Q , K and V are the query, key and value matrices respectively. The aboveequation implies that we divide the dot product of queries with all keys by thekey vector dimension to obtain the weight on the values. I. E. Olatunji et al.

We used multi-head attention similar to [24]. The multi-head attention mech-anism maps the input vector e (cid:48) to queries, keys and values matrices by usingdiﬀerent linear projections. This strategy allows the self-attention mechanismto be applied h times. Then all the vectors produced by diﬀerent heads areconcatenated together to form a single vector.Concisely, our model captures context for a sequence as follows: We obtainthe relative position of the tokens in the sequence from Equation 1. The selfattention block, learns the context by relating or mapping diﬀerent positions s , s , ..., s n of P via Equation 1 so as to compute a single encoding representationof the sequence. By employing the multi-head attention, our model can attend towords from diﬀerent encoding representation at diﬀerent positions. We set heads h to 2. The query, key and value used in the self-attention block are obtainedfrom the output of the previous layer of the context-aware encoding block. Thisdesign allows every position in the context-aware encoding block to attend overall positions of the input sequence. To consider positional information in longersentences not observed in training, we apply the sinusoidal position encoding tothe input embedding. The output of the context-aware encoding representation e (cid:48) is fed into the CNNto obtain new feature representations for making predictions. We employ mul-tiple ﬁlters f ∈ [1,2,3]. This method is similar to learning uni-gram, bi-gramand tri-gram representations respectively. Speciﬁcally, for each ﬁlter, we obtainan hidden representation r = P ool ( Conv ( e (cid:48) , f ilterSize ( f, l, c ))) where c is thechannel size, P ool is the pooling operation and

Conv ( . ) is the convolution op-eration. In our experiment, we use max pooling and average pooling. The ﬁnalrepresentation h is obtained by concatenating all hidden representation. i.e., h = [ r , r , r ]. These features are then passed to the regression layer to producethe helpfulness scores. We used two datasets for our experiments. The ﬁrst dataset called D1 is con-structed from the Amazon product review dataset [15]. This dataset consists ofover 142 million reviews from Amazon between 1996 to 2014. We used a subsetof 21,776,678 reviews from 5 categories, namely; health, electronics, home, out-door and phone. We selected reviews with over 5 votes as done in [5,26]. Thestatistics of the dataset used are shown in table 1. We removed reviews havingless than 7 words for experimenting with diﬀerent ﬁlter sizes. Note that this isthe largest dataset used for helpfulness prediction task.The second dataset called D2 is the human annotated dataset from Yang etal. [26]. This dataset consists of 400 reviews with 100 reviews selected randomlyfrom four product categories (outdoor, electronics, home and books). The reasonfor using the human annotated dataset is to verify that our model truly learns ontext-aware Helpfulness Prediction for Online Product Reviews 7 deep semantics features of review text. Therefore, our model was not trainedon the human annotated dataset but only used for evaluating the eﬀectivenessof our model. We used only three categories for our experiment and performedcross-domain experiment on categories not in D2.

Table 1.

Data statistic of Amazon reviews from 5 diﬀerent categories. We used Healthinstead of Watches as done by Chen et al. [5] because it is excluded from newlypublished Amazon dataset.

Product category

Phone 261,370 3,447,249Outdoor 491,008 3,268,695Health 550,297 2,982,326Home 749,564 4,253,926Electronics 1,310,513 7,824,482

Following the previous works [5,26,25], all experiments were evaluated usingcorrelation coeﬃcients between the predicted helpfulness scores and the groundtruth scores. We split the dataset D1 into Train / Test / Validation (70, 20, 10).We used the same baselines as state-of-the-art convolutional model for helpful-ness prediction [5] i.e. STR, UGR, LIWC, INQ [26], ASP [25]. CNN is the CNNmodel by [10] and C CNN is the state-of-the-art character-based CNN from [5].We added two additional variants (S Attn and S Avg) to test diﬀerent compo-nents of our model. S Attn involves using only self-attention without CNN whileS Avg is self-attention with CNN using average pooling and ﬁnally our modeluses max pooling with context-aware encoding. We re-implemented all baselinesas well as C CNN as described by [5] but excluded the transfer learning part oftheir model since it is for tackling insuﬃcient data problem. We used RELU fornon-linearity and set dropout rate to 0.5 (for regularization). We used Adaptivemoment estimation (Adam) as our optimizer. The learning rate was set to 0.001and l to 100. We experimented with diﬀerent ﬁlter sizes and found that f ∈ [1,2,3] produces the best result. Also we tried using Recurrent Neural Network(RNN) such as LSTM and BiLSTM but they performed worse than CNN. As shown in Table 2, our context-aware encoding based model using max poolingoutperforms all handcrafted features and CNN-based models including C CNNwith a large margin on D1. This is because by applying attention at diﬀerentpositions of the word embedding, diﬀerent information about word dependencies

I. E. Olatunji et al.

Table 2.

Experimental result for the dataset D1

Phone Outdoor Health Home Electronics

STR 0.136 0.210 0.295 0.210 0.288UGR 0.210 0.299 0.301 0.278 0.310LIWC 0.163 0.287 0.268 0.285 0.350INQ 0.182 0.324 0.310 0.291 0.358ASP 0.185 0.281 0.342 0.233 0.366CNN 0.221 0.392 0.331 0.347 0.411C CNN 0.270 0.407 0.371 0.366 0.442S Attn 0.219 0.371 0.349 0.358 0.436S Avg 0.194 0.236 0.336 0.318 0.382Ours

Experimental result for the dataset D2. (Fusion all = STR + UGR + LIWC+ INQ)

Outdoor Home Electronics

Fusion all 0.417 0.596 0.461CNN 0.433 0.521 0.410C CNN 0.605 0.592 0.479Ours

Cross-domain investigationD1-PhoneD2-Home D1-HealthD2-ElectronicsC CNN 0.389 0.436Ours ontext-aware Helpfulness Prediction for Online Product Reviews 9 are extracted which in turn handles context variation around the same word.However, using self-attention alone (S Attn) (table 2) performs poorly than CNNas learning word dependencies alone is not suﬃcient for our task. We furtherneed to understand the internal structure of the review text. Since self-attentioncan handle longer sequence length than CNN when modelling dependencies,we resolve to capturing the dependencies using self-attention and then encodethe dependencies into a vector representation using CNN to further extract thepositional invariant features. Two variants are presented using average pooling(S Avg) and max pooling. S Avg performs comparable to handcrafted featuresprobably due to its tendency of selecting tokens having low attention scores. Ourproposed model with max-pooling produces the best result on D1 (Table 2) andsigniﬁcantly on D2 (Table 3) since it selects the best representation with mostattention. It implies that our model can capture the dependency between tokensand the entire sequence. Likewise, our model understands the internal structureof review and has a high correlation to human score.Since D2 does not include the Phone and Health category, we tested ourproposed model trained on the Phone and Health category from D1 on the Homeand Electronics category respectively on D2. Speciﬁcally, we used the trainingdata of the Phone category from D1 to train our proposed model and used thedata of the Home category from D2 for testing. Similarly, we used the trainingdata of the Health category from D1 to train our proposed model and tested themodel using the data from the Electronics category of D2.As shown in Table 4, the result is quite surprising. This shows that ourproposed model can eﬀectively learn cross domain features and is robust to“out-of-vocabulary” (OOV) problem by predicting reasonable helpfulness scorehaving a high correlation to human score.

Predicting review helpfulness can substantially save a potential customers timeby presenting the most useful review. In this paper, we propose a context-awareencoding based method that learns dependencies between words for understand-ing the internal structure of the review.Experimental results on the human an-notated data shows that our model is a good estimator for predicting the help-fulness of reviews and robust to the out-of-vocabulary (OOV) problem. In thefuture, we aim to explore some learning to rank models to eﬀectively rank help-fulness score while incorporating some other factors that may aﬀect helpfulnessprediction including the types of products.

References

1. Ambartsoumian, A., Popowich, F.: Self-Attention : A Better Building Block forSentiment Analysis Neural Network Classiﬁers. In: Proceedings of the 9th Work-shop on Computational Approaches to Subjectivity, Sentiment and Social MediaAnalysis. pp. 130–139 (2018)0 I. E. Olatunji et al.2. Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learningto align and translate. In: Proc. of ICLR. pp. 1–15 (2015)3. Chen, C., Qiu, M., Yang, Y., Zhou, J., Huang, J., Li, X., Bao, F.S.: Review Help-fulness Prediction with Embedding-Gated CNN. arXiv (2018)4. Chen, C., Qiu, M., Yang, Y., Zhou, J., Huang, J., Li, X., Bao, F.S.: Multi-domaingated cnn for review helpfulness prediction. In: Proc. of WWW. pp. 2630–2636(2019)5. Chen, C., Yang, Y., Zhou, J., Li, X., Bao, F.S.: Cross-Domain Review Helpful-ness Prediction based on Convolutional Neural Networks with Auxiliary DomainDiscriminators. In: Proc. of NAACL-HLT. pp. 602–607 (2018)6. Diaz, G.O., Ng, V.: Modeling and Prediction of Online Product Review Helpfulness: A Survey. In: Proc. of ACL. pp. 698–708 (2018)7. Duan, W., Gu, B., Whinston, A.B.: The dynamics of online word-of-mouth andproduct sales An empirical investigation of the movie industry. Journal of Retailing , 233–242 (2008)8. Ghose, A., Ipeirotis, P.G.: Estimating the helpfulness and economic impact ofproduct reviews: Mining text and reviewer characteristics. IEEE Transactions onKnowledge and Data Engineering (10), 1498–1512 (2011)9. Kim, S.M., Pantel, P., Chklovski, T., Pennacchiotti, M.: Automatically assessingreview helpfulness. In: Proc. of EMNLP. pp. 423–430 (2006)10. Kim, Y.: Convolutional Neural Networks for Sentence Classiﬁcation. In: Proc. ofEMNLP. pp. 1746–1751 (2014)11. Lee, S., Choeh, J.Y.: Predicting the helpfulness of online reviews using multilayerperceptron neural networks. Expert Systems with Applications (6), 3041 – 3046(2014)12. Liu, H., Gao, Y., Lv, P., Li, M., Geng, S., Li, M., Wang, H.: Using Argument-based Features to Predict and Analyse Review Helpfulness. In: Proc. of EMNLP.pp. 1358–1363 (2017)13. Lu, Y., Tsaparas, P., Ntoulas, A., Polanyi, L.: Exploiting social context for reviewquality prediction. In: In Proc. of WWW. pp. 691–700 (2010)14. Martin, L., Pu, P.: Prediction of Helpful Reviews Using Emotions Extraction. In:Proc. of AAAI. pp. 1551–1557 (2014)15. Mcauley, J., Targett, C., Hengel, A.V.D.: Image-based Recommendations on Stylesand Substitutes. In: Proc. of SIGIR (2015)16. Moghaddam, S., Jamali, M., Ester, M.: Etf: Extended tensor factorization modelfor personalizing prediction of review helpfulness. In: Proceedings of the Fifth ACMInternational Conference on Web Search and Data Mining. pp. 163–172 (2012)17. Mudambi, S.M., Schuﬀ, D.: Research note: What makes a helpful online review? astudy of customer reviews on amazon.com. MIS Quarterly (1), 185–200 (2010)18. Mukherjee, S., Popat, K., Weikum, G.: Exploring latent semantic factors to ﬁnduseful product reviews. In: Proceedings of the 2017 SIAM International Conferenceon Data Mining. pp. 480–488 (2017)19. Otterbacher, J.: ’helpfulness’ in online communities: A measure of message qual-ity. In: Proceedings of the SIGCHI Conference on Human Factors in ComputingSystems. pp. 955–964 (2009)20. Pan, Y., Zhang, J.Q.: Born unequal: A study of the helpfulness of user-generatedproduct reviews. Journal of Retailing (4), 598 – 612 (2011)21. Pennington, J., Socher, R., Manning, C.D.: GloVe : Global Vectors for Word Rep-resentation. In: Proc. of EMNLP. pp. 1532–1543 (2014)ontext-aware Helpfulness Prediction for Online Product Reviews 1122. Salehan, M., Kim, D.J.: Predicting the performance of online consumer reviews:A sentiment mining approach to big data analytics. Decision Support Systems ,30 – 40 (2016)23. Tang, J., Gao, H., Hu, X., Liu, H.: Context-aware review helpfulness rating pre-diction. In: Proceedings of the 7th ACM Conference on Recommender Systems(2013)24. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser,(cid:32)L., Polosukhin, I.: Attention Is All You Need. In: Proc. of NIPS (2017)25. Yang, Y., Chen, C., Bao, F.S.: Aspect-Based Helpfulness Prediction for OnlineProduct Reviews. In: Proc. of International Conference on Tools with Artiﬁcial In-telligence (ICTAI). pp. 836–843 (2016). https://doi.org/10.1109/ICTAI.2016.013026. Yang, Y., Yan, Y., Qiu, M., Bao, F.S.: Semantic Analysis and Helpfulness Pre-diction of Text for Online Product Reviews. In: Proc. of ACL-IJCNLP. pp. 38–44(2015)27. Yin, W., Schutze, H., Xiang, B., Zhou, B.: ABCNN : Attention-Based Convolu-tional Neural Network for Modeling Sentence Pairs. Transactions of the Associationfor Computational Linguistics4