Effective Use of Word Order for Text Categorization with Convolutional Neural Networks
EEffective Use of Word Order for Text Categorizationwith Convolutional Neural Networks
Rie Johnson
RJ Research ConsultingTarrytown, NY, USA [email protected]
Tong Zhang
Baidu Inc., Beijing, ChinaRutgers University, Piscataway, NJ, USA [email protected]
Abstract
Convolutional neural network (CNN) is a neu-ral network that can make use of the inter-nal structure of data such as the 2D structureof image data. This paper studies CNN ontext categorization to exploit the 1D structure(namely, word order) of text data for accurateprediction. Instead of using low-dimensionalword vectors as input as is often done, wedirectly apply CNN to high-dimensional textdata, which leads to directly learning embed-ding of small text regions for use in classifi-cation. In addition to a straightforward adap-tation of CNN from image to text, a sim-ple but new variation which employs bag-of-word conversion in the convolution layer isproposed. An extension to combine multipleconvolution layers is also explored for higheraccuracy. The experiments demonstrate theeffectiveness of our approach in comparisonwith state-of-the-art methods.
Text categorization is the task of automatically as-signing pre-defined categories to documents writ-ten in natural languages. Several types of text cat-egorization have been studied, each of which dealswith different types of documents and categories,such as topic categorization to detect discussed top-ics (e.g., sports, politics), spam detection (Sahami etal., 1998), and sentiment classification (Pang et al.,2002; Pang and Lee, 2008; Maas et al., 2011) to de-termine the sentiment typically in product or moviereviews. A standard approach to text categorizationis to represent documents by bag-of-word vectors , To appear in NAACL HLT 2015. namely, vectors that indicate which words appear inthe documents but do not preserve word order, anduse classification models such as SVM.It has been noted that loss of word order causedby bag-of-word vectors ( bow vectors ) is particularlyproblematic on sentiment classification. A simpleremedy is to use word bi-grams in addition to uni-grams (Blitzer et al., 2007; Glorot et al., 2011; Wangand Manning, 2012). However, use of word n -gramswith n > on text categorization in general is notalways effective; e.g., on topic categorization, sim-ply adding phrases or n -grams is not effective (see,e.g., references in (Tan et al., 2002)).To benefit from word order on text categoriza-tion, we take a different approach, which employs convolutional neural networks (CNN) (LeCun et al.,1986). CNN is a neural network that can make useof the internal structure of data such as the
2D struc-ture of image data through convolution layers, whereeach computation unit responds to a small region ofinput data (e.g., a small square of a large image).We apply CNN to text categorization to make use ofthe
1D structure (word order) of document data sothat each unit in the convolution layer responds to asmall region of a document (a sequence of words).CNN has been very successful on image clas-sification; see e.g., the winning solutions of Im-ageNet Large Scale Visual Recognition Challenge(Krizhevsky et al., 2012; Szegedy et al., 2014; Rus-sakovsky et al., 2014).On text, since the work on token-level applica-tions (e.g., POS tagging) by Collobert et al. (2011),CNN has been used in systems for entity search, sen-tence modeling, word embedding learning, productfeature mining, and so on (Xu and Sarikaya, 2013;Gao et al., 2014; Shen et al., 2014; Kalchbrenner et a r X i v : . [ c s . C L ] M a r l., 2014; Xu et al., 2014; Tang et al., 2014; Westonet al., 2014; Kim, 2014). Notably, in many of theseCNN studies on text, the first layer of the networkconverts words in sentences to word vectors by ta-ble lookup. The word vectors are either trained aspart of CNN training, or fixed to those learned bysome other method (e.g., word2vec (Mikolov et al.,2013)) from an additional large corpus. The latter isa form of semi-supervised learning, which we studyelsewhere. We are interested in the effectivenessof CNN itself without aid of additional resources ;therefore, word vectors should be trained as part ofnetwork training if word vector lookup is to be done.A question arises, however, whether word vectorlookup in a purely supervised setting is really usefulfor text categorization. The essence of convolutionlayers is to convert text regions of a fixed size (e.g.,“am so happy” with size 3) to feature vectors , as de-scribed later. In that sense, a word vector learninglayer is a special (and unusual) case of convolutionlayer with region size one. Why is size one appro-priate if bi-grams are more discriminating than uni-grams? Hence, we take a different approach. We di-rectly apply CNN to high-dimensional one-hot vec-tors ; i.e., we directly learn embedding of text re-gions without going through word embedding learn-ing. This approach is made possible by solving thecomputational issue through efficient handling ofhigh-dimensional sparse data on GPU, and it turnedout to have the merits of improving accuracy withfast training/prediction and simplifying the system(fewer hyper-parameters to tune). Our CNN codefor text is publicly available on the internet .We study the effectiveness of CNN on text cate-gorization and explain why CNN is suitable for thetask. Two types of CNN are tested: seq-CNN is astraightforward adaptation of CNN from image totext, and bow-CNN is a simple but new variation ofCNN that employs bag-of-word conversion in theconvolution layer. The experiments show that seq- We use the term ‘embedding’ loosely to mean a structure-preserving function, in particular, a function that generates low-dimensional features that preserve the predictive structure. CNN implemented for image would not handle sparse dataefficiently, and without efficient handling of sparse data, convo-lution over high-dimensional one-hot vectors would be compu-tationally infeasible. riejohnson.com/cnn_download.html OutputInput [ 0 0 1 0 0 0 0 0 0 0 ] Convolution layerPooling layerConvolution layerPooling layerOutput layer(Linear classifier)[ 0 0 1 0 0 0 0 0 0 0 ]
Figure 1:
Convolutional neural network.
Figure 2:
Convolution layer for image. Each computationunit (oval) computes a non-linear function σ ( W · r (cid:96) ( x ) + b ) ofa small region r (cid:96) ( x ) of input image x , where weight matrix W and bias vector b are shared by all the units in the same layer. CNN outperforms bow-CNN on sentiment classi-fication, vice versa on topic classification, and thewinner generally outperforms the conventional bag-of- n -gram vector-based methods, as well as previ-ous CNN models for text which are more complex.In particular, to our knowledge, this is the first workthat has successfully used word order to improvetopic classification performance. A simple exten-sion that combines multiple convolution layers (thuscombining multiple types of text region embedding)leads to further improvement. Through empiricalanalysis, we will show that CNN can make effec-tive use of high-order n -grams when conventionalmethods fail. We first review CNN applied to image data and thendiscuss the application of CNN to document classi-fication tasks to introduce seq-CNN and bow-CNN.
CNN is a feed-forward neural network with convo-lution layers interleaved with pooling layers, as il-lustrated in Figure 1, where the top layer performsclassification using the features generated by the lay-ers below. A convolution layer consists of severalcomputation units, each of which takes as input a region vector that represents a small region of theinput image and applies a non-linear function to it.Typically, the region vector is a concatenation ofixels in the region, which would be, for example,75-dimensional if the region is × and the numberof channels is three (red, green, and blue). Concep-tually, computation units are placed over the inputimage so that the entire image is collectively cov-ered, as illustrated in Figure 2. The region stride(distance between the region centers) is often set toa small value such as 1 so that regions overlap witheach other, though the stride in Figure 2 is set largerthan the region size for illustration.A distinguishing feature of convolution layersis weight sharing . Given input x , a unit associ-ated with the (cid:96) -th region computes σ ( W · r (cid:96) ( x ) + b ) , where r (cid:96) ( x ) is a region vector representingthe region of x at location (cid:96) , and σ is a pre-defined component-wise non-linear activation func-tion, (e.g., applying σ ( x ) = max( x, to each vec-tor component). The matrix of weights W and thevector of biases b are learned through training, andthey are shared by the computation units in the samelayer. This weight sharing enables learning usefulfeatures irrespective of their location, while preserv-ing the location where the useful features appeared.We regard the output of a convolution layer as an‘image’ so that the output of each computation unitis considered to be a ‘pixel’ of m channels where m is the number of weight vectors (i.e., the numberof rows of W ) or the number of neurons . In otherwords, a convolution layer converts image regionsto m -dim vectors , and the locations of the regionsare inherited through this conversion.The output image of the convolution layer ispassed to a pooling layer, which essentially shrinksthe image by merging neighboring pixels, so thathigher layers can deal with more abstract/global in-formation. A pooling layer consists of pooling units,each of which is associated with a small regionof the image. Commonly-used merging methodsare average-pooling and max-pooling, which respec-tively compute the channel-wise average/maximumof each region. Now we consider application of CNN to text data.Suppose that we are given a document D =( w , w , . . . ) with vocabulary V . CNN requires vec-tor representation of data that preserves internal lo-cations (word order in this case) as input. A straight- forward representation would be to treat each wordas a pixel, treat D as if it were an image of | D | × pixels with | V | channels, and to represent each pixel(i.e., each word) as a | V | -dimensional one-hot vec-tor . As a running toy example, suppose that vocab-ulary V = { “don’t”, “hate”, “I”, “it”, “love” } andwe associate the words with dimensions of vectorin alphabetical order (as shown), and that document D =“I love it”. Then, we have a document vector: x = [ 0 0 1 0 0 | | (cid:62) . As in the convolution layer for image, we repre-sent each region (which each computation unit re-sponds to) by a concatenation of the pixels, whichmakes p | V | -dimensional region vectors where p isthe region size fixed in advance. For example, onthe example document vector x above, with p = 2 and stride 1, we would have two regions “I love” and“love it” represented by the following vectors: r ( x ) = — don (cid:48) thate I itlovedon (cid:48) thateIit love r ( x ) = — don (cid:48) thateIit love don (cid:48) thateI it love The rest is the same as image; the text region vec-tors are converted to feature vectors , i.e., the con-volution layer learns to embed text regions into low-dimensional vector space. We call a neural net witha convolution layer with this region representation seq-CNN (‘seq’ for keeping sequences of words) todistinguish it from bow-CNN , described next.
A potential problem of seq-CNN however, is thatunlike image data with 3 RGB channels, the numberof ‘channels’ | V | (size of vocabulary) may be verylarge (e.g., 100K), which could make each regionvector r (cid:96) ( x ) very high-dimensional if the region size Alternatively, one could use bag-of-letter-n-gram vectors as in (Shen et al., 2014; Gao et al., 2014) to cope with out-of-vocabulary words and typos. is large. Since the dimensionality of region vec-tors determines the dimensionality of weight vec-tors, having high-dimensional region vectors meansmore parameters to learn. If p | V | is too large, themodel becomes too complex (w.r.t. the amount oftraining data available) and/or training becomes un-affordably expensive even with efficient handling ofsparse data; therefore, one has to lower the dimen-sionality by lowering the vocabulary size | V | and/orthe region size p , which may or may not be desir-able, depending on the nature of the task.An alternative we provide is to perform bag-of-word conversion to make region vectors | V | -dimensional instead of p | V | -dimensional; e.g., theexample region vectors above would be convertedto: r ( x ) = don (cid:48) thate I it love r ( x ) = don (cid:48) thateI itlove With this representation, we have fewer param-eters to learn. Essentially, the expressivenessof bow-convolution (which loses word order onlywithin small regions) is somewhere between seq-convolution and bow vectors.
Whereas the size of images is fixed in image ap-plications, documents are naturally variable-sized,and therefore, with a fixed stride, the output of a con-volution layer is also variable-sized as shown in Fig-ure 3. Given the variable-sized output of the convo-lution layer, standard pooling for image (which usesa fixed pooling region size and a fixed stride) wouldproduce variable-sized output, which can be passedto another convolution layer. To produce fixed-sizedoutput, which is required by the fully-connected toplayer , we fix the number of pooling units and dy-namically determine the pooling region size on eachdata point so that the entire data is covered withoutoverlapping.In the previous CNN work on text, pooling istypically max-pooling over the entire data (i.e., one In this work, the top layer is fully-connected (i.e., each neu-ron responds to the entire data) as in CNN for image. Alterna-tively, the top layer could be convolutional so that it can receivevariable-sized input, but such CNN would be more complex.
I love it This isn’t what I expected ! (a) (b)
This isn’t what I expected ! (a) (b)
Figure 3:
Convolution layer for variable-sized text. pooling unit associated with the whole text). The dy-namic k -max pooling of (Kalchbrenner et al., 2014)for sentence modeling extends it to take the k largestvalues where k is a function of the sentence length,but it is again over the entire data, and the operationis limited to max-pooling. Our pooling differs in thatit is a natural extension of standard pooling for im-age, in which not only max-pooling but other typescan be applied. With multiple pooling units associ-ated with different regions, the top layer can receivelocational information (e.g., if there are two poolingunits, the features from the first half and last half ofa document are distinguished). This turned out to beuseful (along with average-pooling) on topic classi-fication, as shown later. n -grams Traditional methods represent each document en-tirely with one bag-of- n -gram vector and then ap-ply a classifier model such as SVM. However, sincehigh-order n -grams are susceptible to data sparsity,use of a large n such as 20 is not only infeasiblebut also ineffective. Also note that a bag-of- n -gramrepresents each n -gram by a one-hot vector and ig-nores the fact that some n -grams share constituentwords. By contrast, CNN internally learns embed-ding of text regions (given the consituent words asinput) useful for the intended task . Consequently,a large n such as 20 can be used especially with thebow-convolution layer, which turned out to be usefulon topic classification. A neuron trained to assign alarge value to, e.g., “I love” (and a small value to “Ihate”) is likely to assign a large value to “we love”(and a small value to “we hate”) as well, even though“we love” was never seen during training . We willconfirm these points empirically later. We have described CNN with the simplest networkarchitecture that has one pair of convolution andpooling layers. While this can be extended in sev-eral ways (e.g., with deeper layers), in our experi-ments, we explored parallel CNN , which has two or utput 1 (positive)Input: “I really love it !” Oneregion size s region size Convolution layersPooling layersOutput layer1 (positive)Input: “I really love it !” One-hot vectorsregion size s Figure 4:
CNN with two convolution layers in parallel. more convolution layers in parallel , as illustrated inFigure 4. The idea is to learn multiple types of em-bedding of small text regions so that they can com-plement each other to improve model accuracy. Inthis architecture, multiple convolution-pooling pairswith different region sizes (and possibly different re-gion vector representations) are given one-hot vec-tors as input and produce feature vectors for eachregion; the top layer takes the concatenation of theproduced feature vectors as input. We experimented with CNN on two tasks, topic clas-sification and sentiment classification. Detailed in-formation for reproducing the results is available onthe internet along with our code.
We fixed the activation function to rectifier σ ( x ) =max( x, and minimized square loss with L reg-ularization by stochastic gradient descent (SGD).We only used the 30K words that appeared mostfrequently in the training set; thus, for example, inseq-CNN with region size 3, a region vector is 90Kdimensional. Out-of-vocabulary words were repre-sented by a zero vector. On bow-CNN, to speed upcomputation, we used variable region stride so that alarger stride was taken where repetition of the sameregion vectors can be avoided by doing so. Padding size was fixed to p − where p is the region size. Similar architectures have been used for image. Kim(2014) used it for text, but it was on top of a word vector con-version layer. For example, if we slide a window of size 3 over “* * foo* *” where “*” is out of vocabulary, a bag of “foo” will berepeated three times with stride fixed to 1. As is commonly done, to the beginning and the end of eachdocument, special words that are treated as unknown words(and converted to zero vectors instead of one-hot vectors) wereadded as ‘padding’. The purpose is to equally treat the words atthe edge and words in the middle.
We used two techniques commonly used withCNN on image, which typically led to small per-formance improvements. One is dropout (Hintonet al., 2012) optionally applied to the input to thetop layer. The other is response normalization as in(Krizhevsky et al., 2012), which in our case scalesthe output of the pooling layer z at each location bymultiplying (1 + | z | ) − / . For comparison, we tested SVM with the linear ker-nel and fully-connected neural networks (see e.g.,Bishop (1995)) with bag-of- n -gram vectors as in-put. To experiment with fully-connected neural nets,as in CNN, we minimized square loss with L reg-ularization and optional dropout by SGD, and ac-tivation was fixed to rectifier. To generate bag-of- n -gram vectors, on topic classification, we first seteach component to log( x + 1) where x is the wordfrequency in the document and then scaled them tounit vectors, which we found always improved per-formance over raw frequency. On sentiment classi-fication, as is often done, we generated binary vec-tors and scaled them to unit vectors. We tested threetypes of bag-of- n -gram: bow1 with n ∈ { } , bow2with n ∈ { , } , and bow3 with n ∈ { , , } ;that is, bow1 is the traditional bow vectors, and withbow3, each component of the vectors corresponds toeither uni-gram, bi-gram, or tri-gram of words.We used SVMlight for the SVM experiments. NB-LM
We also tested NB-LM, which first ap-peared (but without performance report ) as NB-SVM in WM12 (Wang and Manning, 2012) andlater with a small modification produced perfor-mance that exceeds state-of-the-art supervised meth-ods on IMDB (which we experimented with) inMMRB14 (Mesnil et al., 2014). We experimentedwith the MMRB14 version, which generates bi-nary bag-of- n -gram vectors, multiplies the com-ponent for each n -gram f i with log( P ( f i | Y =1) /P ( f i | Y = − ( NB-weight ) where the prob-abilities are estimated using the training data,and does logistic regression training. We usedMMRB14’s software with a modification so that http://svmlight.joachims.org/ WM12 instead reported the performance of an ensemble ofNB and SVM as it performed better. https://github.com/mesnilgr/nbsvm he regularization parameter can be tuned on devel-opment data. For all the methods, the hyper-parameters suchas net configurations and regularization parameterswere chosen based on the performance on the devel-opment data (held-out portion of the training data),and using the chosen hyper-parameters, the modelswere re-trained using all the training data.
The IMDB dataset (Maaset al., 2011) is a benchmark dataset for sentimentclassification. The task is to determine if the moviereviews are positive or negative. Both the trainingand test sets consist of 25K reviews. For preprocess-ing, we tokenized the text so that emoticons such as“:-)” are treated as tokens and converted all the char-acters to lower case.
Elec: electronics product reviews
Elec consistsof electronic product reviews. It is part of a largeAmazon review dataset (McAuley and Leskovec,2013). We chose electronics as it seemed to be verydifferent from movies. Following the generation ofIMDB (Maas et al., 2011), we chose the training setand the test set so that one half of each set consistsof positive reviews and the other half is negative, re-garding rating 1 and 2 as negative and 4 and 5 aspositive, and that the reviewed products are disjointbetween the training set and test set. Note that toextract text from the original data, we only used the text section , and we did not use the summary sec-tion . This way, we obtained a test set of 25K reviews(same as IMDB) and training sets of various sizes.The training and test sets are available on the inter-net . Data preprocessing was the same as IMDB. RCV1: topic categorization
RCV1 is a corpusof Reuters news articles as described in LYRL04(Lewis et al., 2004). RCV1 has 103 topic categoriesin a hierarchy, and one document may be associatedwith more than one topic. Performance on this task(multi-label categorization) is known to be sensitiveto thresholding strategies, which are algorithms ad-ditional to the models we would like to test. There-fore, we also experimented with single-label cate- riejohnson.com/cnn_data.html label Table 1:
RCV1 data summary. gorization to assign one of 55 second-level topicsto each document to directly evaluate models. Forthis task, we used the documents from a one-monthperiod as the test set and generated various sizes oftraining sets from the documents with earlier dates.Data sizes are shown in Table 1. As in LYRL04, weused the concatenation of the headline and text ele-ments. Data preprocessing was the same as IMDBexcept that we used the stopword list provided byLYRL04 and regarded numbers as stopwords.
Table 2 shows the error rates of CNN in comparisonwith the baseline methods. The first thing to noteis that on all the datasets, the best-performing CNNoutperforms the baseline methods, which demon-strates the effectiveness of our approach.To look into the details, let us first focus on CNNwith one convolution layer (seq- and bow-CNN inthe table). On sentiment classification (IMDB andElec), the configuration chosen by model selectionwas: region size 3, stride 1, 1000 weight vectors, andmax-pooling with one pooling unit, for both typesof CNN; seq-CNN outperforms bow-CNN, as wellas all the baseline methods except for one. Notethat with a small region size and max-pooling, if areview contains a short phrase that conveys strongsentiment (e.g., “A great movie!”), the review couldreceive a high score irrespective of the rest of the re-view. It is sensible that this type of configuration iseffective on sentiment classification.By contrast, on topic categorization (RCV1), theconfiguration chosen for bow-CNN by model selec-tion was: region size 20, variable-stride ≥
2, average-pooling with 10 pooling units, and 1000 weight vec-tors, which is very different from sentiment classifi-cation. This is presumably because on topic clas-sification, a larger context would be more predic-tive than short fragments ( → larger region size),the entire document matters ( → the effectiveness ofaverage-pooling), and the location of predictive textalso matters ( → multiple pooling units). The lastoint may be because news documents tend to havecrucial sentences (as well as the headline) at the be-ginning. On this task, while both seq and bow-CNNoutperform the baseline methods, bow-CNN outper-forms seq-CNN, which indicates that in this settingthe merit of having fewer parameters is larger thanthe benefit of keeping word order in each region.Now we turn to parallel CNN. On IMDB, seq2-CNN, which has two seq-convolution layers (regionsize 2 and 3; 1000 neurons each; followed by oneunit of max-pooling each), outperforms seq-CNN.With more neurons (3000 neurons each; Table 3) itfurther exceeds the best-performing baseline, whichis also the best previous supervised result. We pre-sume the effectiveness of seq2-CNN indicates thatthe length of predictive text regions is variable.The best performance 7.67 on IMDB was ob-tained by ‘seq2-bow n -CNN’, equipped with threelayers in parallel: two seq-convolution layers (1000neurons each) as in seq2-CNN above and one layer(20 neurons) that regards the entire document as oneregion and represents the region (document) by abag-of- n -gram vector (bow3) as input to the compu-tation unit; in particular, we generated bow3 vectorsby multiplying the NB-weights with binary vectors,motivated by the good performance of NB-LM. Thisthird layer is a bow-convolution layer with one re-gion of variable size that takes one-hot vectors with n -gram vocabulary as input to learn document em-bedding. The seq2-bow n -CNN for Elec in the ta-ble is the same except that the regions sizes of seq-convolution layers are 3 and 4. On both datasets,performance is improved over seq2-CNN. The re-sults suggest that what can be learned through thesethree layers are distinct enough to complement eachother. The effectiveness of the third layer indicatesthat not only short word sequences but also globalcontext in a large window may be useful on this task;thus, inclusion of a bow-convolution layer with n -gram vocabulary with a large fixed region size mightbe even more effective, providing more focused con-text, but we did not pursue it in this work. Baseline methods
Comparing the baseline meth-ods with each other, on sentiment classification, re-ducing the vocabulary to the most frequent n -grams It can also be regarded as a fully-connected layer that takesbow3 vectors as input. methods IMDB Elec RCV1SVM bow3 (30K) 10.14 9.16 10.68SVM bow1 (all) 11.36 11.71 10.76SVM bow2 (all) 9.74 9.05 10.59SVM bow3 (all) 9.42 8.71 10.69NN bow3 (all) 9.17 8.48 10.67NB-LM bow3 (all) 8.13 8.11 13.97bow-CNN 8.66 8.39 seq-CNN 8.39 7.64 9.96seq2-CNN 8.04 7.48 –seq2-bow n -CNN – Table 2:
Error rate (%) comparison with bag-of- n -gram-based methods. Sentiment classification on IMDB andElec (25K training documents) and 55-way topic cate-gorization on RCV1 (16K training documents). ‘(30K)’indicates that the 30K most frequent n -grams were used,and ‘(all)’ indicates that all the n -grams (up to 5M) wereused. CNN used the 30K most frequent words.SVM bow2 [WM12] 10.84 –WRRBM+bow [DAL12] 10.77 –NB+SVM bow2 [WM12] 8.78 ensembleNB-LM bow3 [MMRB14] 8.13 –Paragraph vectors [LM14] 7.46 unlabeled dataseq2-CNN (3K ×
2) [Ours] 7.94 –seq2-bow n -CNN [Ours] – Table 3:
Error rate (%) comparison with previous bestmethods on IMDB. notably hurt performance (also observed on NB-LMand NN) even though some reduction is a commonpractice. Error rates were clearly improved by ad-dition of bi- and tri-grams. By contrast, on topiccategorization, bi-grams only slightly improved ac-curacy, and reduction of vocabulary did not hurt per-formance. NB-LM is very strong on IMDB andpoor on RCV1; its effectiveness appears to be data-dependent, as also observed by WM12.
Comparison with state-of-the-art results
Asshown in Table 3, the previous best supervised resulton IMDB is 8.13 by NB-LM with bow3 (MMRB14),and our best error rate 7.67 is better by nearly 0.5%.(Le and Mikolov, 2014) reports 7.46 with the semi-supervised method that learns low-dimensional vec-tor representations of documents from unlabeleddata. Their result is not directly comparable with oursupervised results due to use of additional resource.Nevertheless, our best result rivals their result.We tested bow-CNN on the multi-label topiccategorization task on RCV1 to compare with odels micro-F macro-FLYRL04’s best SVM 81.6 60.7bow-CNN
Table 4:
RCV1 micro-averaged and macro-averaged F-measure results on multi-label task with LYRL04 split.
LYRL04. We used the same thresholding strategy asLYRL04. As shown in Table 4, bow-CNN outper-forms LYRL04’s best results even though our datapreprocessing is much simpler (no stemming and notf-idf weighting).
Previous CNN
We focus on the sentence classifi-cation studies due to its relation to text categoriza-tion. Kim (2014) studied fine-tuning of pre-trainedword vectors to produce input to parallel CNN. Hereported that performance was poor when word vec-tors were trained as part of CNN training (i.e., no ad-ditional method/corpus). On our tasks, we were alsounable to outperform the baselines with this type ofmodel. Also, with our approach, a system is sim-pler with one fewer layer – no need to tune the di-mensionality of word vectors or meta-parameters forword vector learning.Kalchbrenner et al. (2014) proposed complexmodifications of CNN for sentence modeling. No-tably, given word vectors ∈ R d , their convolutionwith m feature maps produces for each region a ma-trix ∈ R d × m (instead of a vector ∈ R m as in stan-dard CNN). Using the provided code, we found thattheir model is too resource-demanding for our tasks.On IMDB and Elec the best error rates we ob-tained by training with various configurations thatfit in memory for 24 hours each on GPU (cf. Fig 5)were 10.13 and 9.37, respectively, which is no bet-ter than SVM bow2. Since excellent performanceswere reported on short sentence classification, wepresume that their model is optimized for short sen-tences, but not for text categorization in general. Performance dependency
CNN training isknown to be expensive, compared with, e.g., linearmodels – linear SVM with bow3 on IMDB onlytakes 9 minutes using SVMlight (single-core) on ahigh-end Intel CPU. Nevertheless, with our code onGPU, CNN training only takes minutes (to a fewhours) on these datasets shown in Figure 5. We could not train adequate models on RCV1 on eitherTesla K20 or M2070 due to memory shortage. E rr o r r a t e ( % ) minutes Elec (25K)seq2-CNNseq2-bown7.588.599.5 0 50 100 E rr o r r a t e ( % ) minutes IMDB (25K)seq-CNNseq2-bown 99.51010.511 0 10 20 30 40 E rr o r r a t e ( % ) minutes RCV1 (16K)bow-CNN80CNNbown
Figure 5:
Training time (minutes) on Tesla K20. Thehorizontal lines are the best-performing baselines. E rr o r r a t e ( % ) Training data size (log-scale)
ElecSVM bow2 (all)SVM bow3 (all)NB-LM bow3 (all)seq2-CNN E rr o r r a t e ( % ) E rr o r r a t e ( % ) Training data size (log-scale)
RCV1SVM bow1 (all)SVM bow2 (all)SVM bow3 (all)bow-CNN
Figure 6:
Error rate in relation to training data size. Forreadability, only representative methods are shown.
Finally, the results with training sets of varioussizes on Elec and RCV1 are shown in Figure 6.
In this section we explain the effectiveness of CNNthrough looking into what it learns from training.First, for comparison, we show the n -grams thatSVM with bow3 found to be the most predictive;i.e., the following n -grams were assigned the 10largest weights by SVM with binary features on Elecfor the negative and positive class, respectively: • poor, useless, returned, not worth, return, worse,disappointed, terrible, worst, horrible • great, excellent, perfect, love, easy, amazing, awe-some, no problems, perfectly, beat Note that, even though SVM was also given bi- andtri-grams, the top 10 features chosen by SVM withbinary features are mostly uni-grams; furthermore,the top 100 features (50 for each class) include 28bi-grams but only four tri-grams. This means that,with the given size of training data, SVM still heav-ily counts on uni-grams, which could be ambiguous,and cannot fully take advantage of higher-order n -grams. By contrast, NB-weights tend to promote n -grams with a larger n ; the 100 features that were as-signed the largest NB-weights are 7 uni-, 33 bi-, and60 tri-grams. However, as seen above, NB-weightsdo not always lead to the best performance. Table 5:
Examples of predictive text regions in the train-ing set.
In Table 5, we show some of text regions learnedby seq-CNN to be predictive on Elec. This net hasone convolution layer with region size 3 and 1000neurons; thus, embedding by the convolution layerproduces a 1000-dim vector for each region, which(after pooling) serves as features in the top layerwhere weights are assigned to the 1000 vector com-ponents. In the table, N i /P i indicates the componentthat received the i -th highest weight in the top layerfor the negative/positive class, respectively. The ta-ble shows the text regions (in the training set) whoseembedded vectors have a large value in the corre-sponding component, i.e., predictive text regions.Note that the embedded vectors for the text re-gions listed in the same row are close to each otheras they have a large value in the same component.That is, Table 5 also shows that the proximity ofthe embedded vectors tends to reflect the proximityin terms of the relations to the target classes (pos-itive/negative sentiment). This is the effect of em-bedding, which helps classification by the top layer.With the bag-of- n -gram representation, only the n -grams that appear in the training data can partici-pate in prediction. By contrast, one strength of CNNis that n -grams (or text regions of size n ) can con-tribute to accurate prediction even if they did notappear in the training data , as long as (some of)their constituent words did, because input of embed-ding is the constituent words of the region. To seethis point, in Table 6 we show the text regions fromthe test set , which did not appear in the trainingdata , either entirely or partially as bi-grams, and yetwhose embedded features have large values in theheavily-weighted (predictive) component thus con-tributing to the prediction. There are many more ofthese, and we only show a small part of them that were unacceptably bad, is abysmally bad, were uni-versally poor, was hugely disappointed, was enor-mously disappointed, is monumentally frustrating,are endlessly frustratingbest concept ever, best ideas ever, best hub ever,am wholly satisfied, am entirely satisfied, am in-credicbly satisfied, ’m overall impressed, am aw-fully pleased, am exceptionally pleased, ’m entirelyhappy, are acoustically good, is blindingly fast, Table 6:
Examples of text regions that contribute toprediction. They are from the test set , and they did not appear in the training set, either entirely or partially asbi-grams. fit certain patterns. One noticeable pattern is (be-verb, adverb, sentiment adjective) such as “am en-tirely satisfied” and “’m overall impressed”. Theseadjectives alone could be ambiguous as they may benegated. To know that the writer is indeed “satis-fied”, we need to see the sequence “am satisfied”,but the insertion of adverb such as “entirely” is verycommon. “best X ever’ is another pattern that a dis-criminating pair of words are not adjacent to eachother. These patterns require tri-grams for disam-biguation, and seq-CNN successfully makes use ofthem even though the exact tri-grams were not seenduring training, as a result of learning, e.g., “am Xsatisfied” with non-negative X (e.g., “am very satis-fied”, “am so satisfied”) to be predictive of the pos-itive class through training. That is, CNN can ef-fectively use word order when bag-of- n -gram-basedapproaches fail. This paper showed that CNN provides an alternativemechanism for effective use of word order for textcategorization through direct embedding of smalltext regions, different from the traditional bag-of- n -gram approach or word-vector CNN. With the paral-lel CNN framework, several types of embedding canbe learned and combined so that they can comple-ment each other for higher accuracy. State-of-the-artperformances on sentiment classification and topicclassification were achieved using this approach. Acknowledgements
We thank the anonymous reviewers for useful sug-gestions. The second author was supported by NSFIIS-1250985 and NSF IIS-1407939. eferences
Christopher Bishop. 1995.
Neural networks for patternrecognition . Oxford University Press.John Blitzer, Mark Dredze, and Fernando Pereira. 2007.Biographies, bollywood, boom-boxes, and blenders:Domain adaptation for sentiment classification. In
Proceedings of ACL .Ronan Collobert, Jason Weston, L´eon Bottou, MichaelKarlen, Koray Kavukcuoglu, and Pavel Kuksa. 2011.Natural language processing (almost) from scratch.
Journal of Machine Learning Research , 12:2493–2537.Jianfeng Gao, Patric Pantel, Michael Gamon, XiaodongHe, and Li dent. 2014. Modeling interestingness withdeep neural networks. In
Proceedings of EMNLP .Xavier Glorot, Antoine Bordes, and Yoshua Bengio.2011. Domain adaptation for large-scale sentimentclassification: A deep learning approach. In
Proceed-ings of ICML .Geoffrey E. Hinton, Nitish Srivastava, Alex Krizhevsky,Ilya Sutskever, and Ruslan R. Salakhutdinov.2012. Improving neural networks by preventingco-adaptation of feature detectors. arXiv:1207.0580 .Nal Kalchbrenner, Edward Grefenstette, and Phil Blun-som. 2014. A convolutional neural network for mod-eling sentences. In
Proceedings of ACL .Yoon Kim. 2014. Convolutional neural networks for sen-tence classification. In
Proceedings of EMNLP , pages1746–1751.Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton.2012. ImageNet classification with deep convolutionalneural networks. In
Proceedings of NIPS .Quoc Le and Tomas Mikolov. 2014. Distributed repre-sentations of sentences and documents. In
Proceed-ings of ICML .Yann LeCun, Le´on Bottou, Yoshua Bengio, and PatrickHaffner. 1986. Gradient-based learning applied todocument recognition. In
Proceedings of the IEEE ,pages 2278–2324.David D. Lewis, Yiming Yang, Tony G. Rose, and Fan Li.2004. RCV1: A new benchmark collection for textcategorization research.
Journal of Marchine Learn-ing Research , 5:361–397.Andrew L. Maas, Raymond E. Daly, Peter T. Pham, DanHuang, Andrew Y. Ng, and Christopher Potts. 2011.Learning word vectors for sentiment analysis. In
Pro-ceedings of ACL .Julian McAuley and Jure Leskovec. 2013. Hidden fac-tors and hidden topics: Understanding rating dimen-sions with review text. In
RecSys .Gr´egoire Mesnil, Tomas Mikolov, Marc’Aurelio Ran-zato, and Yoshua Bengio. 2014. Ensemble of genera-tive and discriminative techniques for sentiment analy- sis of movie reviews. arXiv:1412.5335v5 (4 Feb 2015version) .Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado,and Jeffrey Dean. 2013. Distributed representationsof words and phrases and their compositionality. In
Proceedings of NIPS .Bo Pang and Lillian Lee. 2008. Opinion mining andsentiment analysis.
Foundations and Trends in Infor-mation Retrieval , 2(1–2):1–135.Bo Pang, Lillian Lee, and Shivakumar Vaithyanathan.2002. Thumbs up? sentiment classification using ma-chine learning techniques. In
Proceedings of Confer-ence on Empirical Methods in Natural Language Pro-cessing (EMNLP) , pages 79–86.Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause,Sanjeev Satheesh, Sean Ma, Zhiheng Huang, An-drej Karpathy, Aditya Khosla, Michael Bernstein,Alexander C. Berg, and Li Fei-Fei. 2014. Im-ageNet Large Scale Visual Recognition Challenge. arXiv:1409.0575 .Mehran Sahami, Susan Dumais, David Heckerman, andEric Horvitz. 1998. A bayesian approach to filteringjunk e-mail. In
Proceedings of AAAI’98 Workshop onLearning for Text Categorization .Yelong Shen, Xiaodong He, Jianfeng Gao, Li Deng, andGr´egoire Mensnil. 2014. A latent semantic modelwith convolutional-pooling structure for informationretrieval. In
Proceedings of CIKM .Christian Szegedy, Wei Liu, Yangqing Jia, PierreSermanet, Scott Reed, Dragomir Anguelov, Du-mitru Erhan, Vincent Vanhoucke, and Andrew Ra-binovich. 2014. Going deeper with convolutions. arXiv:1409.4842 .Chade-Meng Tan, Yuan-Fang Wang, and Chan-Do Lee.2002. The use of bigrams to enhance text catego-rization.
Information Processing and Management ,38:529–546.Duyu Tang, Furu Wei, Nan Yang, Ming Zhou, Ting Liu,and Bing Qin. 2014. Learning sentiment-specificword embedding for twitter sentiment classification.In
Proceedings of ACL , pages 1555–1565.Sida Wang and Christopher D. Manning. 2012. Base-lines and bigrams: Simple, good sentiment and topicclassification. In
Proceedings of ACL (short paper) .Jason Weston, Sumit Chopra, and Keith Adams. 2014.
Proceedings of EMNLP , pages 1822–1827.Puyang Xu and Ruhi Sarikaya. 2013. Convolutional neu-ral network based triangular crf for joint intent detec-tion and slot filling. In
ASRU .Liheng Xu, Kang Liu, Siwei Lai, and Jun Zhao. 2014.Product feature mining: Semantic clues versus syntac-tic constituents. In