[PDF] Text Classification with Lexicon from PreAttention Mechanism

Abstract

A comprehensive and high-quality lexicon plays a crucial role in traditional text classification approaches. And it improves the utilization of the linguistic knowledge. Although it is helpful for the task, the lexicon has got little attention in recent neural network models. Firstly, getting a high-quality lexicon is not easy. We lack an effective automated lexicon extraction method, and most lexicons are hand crafted, which is very inefficient for big data. What's more, there is no an effective way to use a lexicon in a neural network. To address those limitations, we propose a Pre-Attention mechanism for text classification in this paper, which can learn attention of different words according to their effects in the classification tasks. The words with different attention can form a domain lexicon. Experiments on three benchmark text classification tasks show that our models get competitive result comparing with the state-of-the-art methods. We get 90.5% accuracy on Stanford Large Movie Review dataset, 82.3% on Subjectivity dataset, 93.7% on Movie Reviews. And compared with the text classification model without Pre-Attention mechanism, those with Pre-Attention mechanism improve by 0.9%-2.4% accuracy, which proves the validity of the Pre-Attention mechanism. In addition, the Pre-Attention mechanism performs well followed by different types of neural networks (e.g., convolutional neural networks and Long Short-Term Memory networks). For the same dataset, when we use Pre-Attention mechanism to get attention value followed by different neural networks, those words with high attention values have a high degree of coincidence, which proves the versatility and portability of the Pre-Attention mechanism. we can get stable lexicons by attention values, which is an inspiring method of information extraction.

Full PDF

Text Classification with Lexicon from Pre-Attention Mechanism

QINGBIAO LI1, CHUNHUA WU1, KANGFENG ZHENG1 1School of Cyberspace Security, Beijing University of Posts and Telecommunications, Beijing 100876, China .

ABSTRACT

Text classification is an important task in the field of natural language processing (NLP). A comprehensive and high-quality lexicon plays a crucial role in traditional text classification approaches. And it improves the utilization of the linguistic knowledge. Although it is helpful for the task, the lexicon has got little attention in recent neural network models. Firstly, getting a high-quality lexicon is not easy. We lack an effective automated lexicon extraction method, and most lexicons are hand crafted, which is very inefficient for big data. What’s more, there is no an effective way to use a lexicon in a neural network. To address those limitations, we propose a Pre-Attention mechanism for text classification in this paper, which can learn attention of different words according to their effects in the classification tasks. The words with different attention can form a domain lexicon. Experiments on three benchmark text classification tasks show that our models get competitive result comparing with the state-of-the-art methods. We get 90.5% accuracy on Stanford Large Movie Review dataset, 82.3% on Subjectivity dataset, 93.7% on Movie Reviews. And compared with the text classification model without Pre-Attention mechanism, those with Pre-Attention mechanism improve by 0.9%-2.4% accuracy, which proves the validity of the Pre-Attention mechanism. In addition, the Pre-Attention mechanism performs well followed by different types of neural networks (e.g., convolutional neural networks and Long Short-Term Memory networks). For the same dataset, when we use Pre-Attention mechanism to get attention value followed by different neural networks, those words with high attention values have a high degree of coincidence, which proves the versatility and portability of the Pre-Attention mechanism. we can get stable lexicons by attention values, which is an inspiring method of information extraction. I. INTRODUCTION

Text classification is an important task in the field of NLP, which plays an important role in many practical applications, such as email categorization, web search, file classification, etc [1,2]. A lot of researches have been done in text classification. SynTime1 [3] model uses three main syntactic token types to recognize time expression. Lifelong learning model [4] for sentiment classification adopts a Bayesian optimization framework. In the traditional text classification method, a popular and common method of expressing text is the bag-of-words. However, the bag-of-words method loses the order of the words and ignores the semantics of the words. The n-gram model is very popular in statistical language models and usually performs well [5]. However, the n-gram model has a large defect that is affected by data sparsity [6]. Recently, neural network methods are becoming more and more popular, for it can train a more complex model on a large dataset. And it can also overcome the data sparsity problem of the n-gram model [6]. Neural network models based on deep learning have achieved significant success on many NLP tasks, including learning distributed word, sentence and document representation[7], parsing[8], statistical machine translation[9], sentiment classification[10], etc. In the field of text classification, neural networks have been widely used and perform well. Some fast and efficient neural network-based methods have been proposed. For example, fast text is a linear word-level model with a rank constraint and fast loss approximation, which achieved competitive results with a simple structure. [11]. Though deep neural networks have gained great success in text classification field, these methods do not make full use of the linguistic knowledge, because not all words have the same importance in text classification. For example, a high-quality sentiment lexicon is very important for a sentiment classification task, and it would be easier for us to classify one’s texts from other people’s texts if we have the lexicon of his most-used words. The traditional text classification approaches used the classification lexicon including those words playing a crucial role for text classification task, [12] which has has a positive impact on improving classification accuracy. For example, emotion features extracted using the knowledge of the general purpose emotion lexicons (GPELs), when combined with traditional bag-of-words features improved emotion classification significantly [13], [14]. But the lexicon for classification has received little attention in recent neural network models. There are two main difficulties. Firstly, a high-quality lexicon is hard to obtain. In other words, it is difficult to find those words that are important for a classification task. We lack an effective automated lexicon extraction method, and most lexicons are hand crafted. For example, existing GPELS such as WordNet-Affect (WNA) [15], EmoSenticNet (ESN) [16] and NRC word-emotion lexicon [17], which are hand crafted, associate between words and emotions identified by Ekman and Plutchik. However, the efficiency of manual extraction is very low, especially when the amount of data is large. Secondly, there is no an effective way to use a lexicon in a neural network. Some methods using neural networks try to solve those problems with a attention mechanism[18, 19, 20]. But most traditional attention mechanisms rely on the RNN or encoding-decoding structures. Because they apply the attention mechanisms on the output states of RNN, it is difficult to migrate the same attention structure between neural networks of different structure, which limits the use of attention mechanisms. In addition, as getting the attention of a word, the attention mechanism has considered the context of the word, which leads different attention values for ths same word in different sentences. So, for a classification task on a database, we can not get which word is more important for the task by comparing the attention values of two words, and we can not get a lexicon, either. To address the aforementioned limitations, we propose a Pre-Attention mechanism, which can automatically find a lexicon for a classification task and integrate it into the neural networks. The mechanism is located between the word vectors and the classification neural network. Firstly, we get the text representation with the Pre-Attention mechanism. Then we push the text representation to CNN and LSTM for classification. In addition, for each word in a database, we can calculate the accordingly attention value, which reflects the contribution of the word to the classification task, and we can get a lexicon according it. Compared with words with low attention values, those words with high attention values are more likely to belong to the lexicon. Our contributions are summarized below: 1) We provide a way to integrate a lexicon into deep neural networks for classification. The Pre-Attention mechanism can automatically pay different attention values to words according to their importance for a classification task, which improves the utilization of linguistic knowledge. 2) We provide a method of getting a stable lexicon through neural networks. For a text classification task, after getting the words’ attention values by Pre-Attention mechanism, we can get a lexicon for classification. Experiments showed that even with different post-classification methods the lexicon is stable. The rest of the paper is organized as follows. Section II presents related work. Section III describes model architecture. Section IV outlines the experimental setup. Section V discusses the empirical results and analysis. Finally, section VI presents the conclusion and future work.

II. RELATED WORK

Natural Language Processing (NLP) is a sub-area of artificial intelligence (AI), which is also one of the most difficult problem in AI. With the development of the Internet, the amount of text data in the network has increased rapidly. NLP is becoming more and more important. Text classification is significant for NLP systems, where there have been an enormous amount of researches. A simple and efficient baseline method for text classification is to train a linear classifier (e.g., a Support Vector Machines and logistic regression) after represent the sentence as a bag-of-words. However, the bag-of-words method loses the order of the words and ignores the semantics of the words. N-gram model is another popular method to represent a sentence, which usually performs well [21]. In addition, there are some other popular methods to obtain better performance of the sentences, such as topics extracted by topic models [22] and dependency parse trees [23]. One work which bases on pattern matching and applys extra NLP systems to derive lexical features is proposed [24], which utilizes many features derived from external corpora for a Support Vector Machine (SVM) classifier. However, all simple techniques have limitations for certain tasks. To solve the above problem, an effective solution is to factorize the linear classifier into low-rank matrices [25]. Neural network models have achieved significant success on text classification. Convolutional neural networks (CNNs) and recurrent neural networks (RNNs) have emerged as two widely used architectures and are often combined with sequence-based or tree-structured models[26,27]. CNN regards feature extraction and classification as a joint task, and extracts hierarchical representations of inputs by stacking multiple convolution kernels. Convolutional layers are similar to a sliding window over a matrix. Due to the ability of CNN to capture local features, the n-gram language model has been successfully implemented in the CNN model [28]. So far, the CNN has achieved some very successful results in the field of text classification [29, 30]. A convolution neural network architecture with multiple convolution layers is proposed, positing tent, dense and low-dimensional word vectors(initialized to random values) as inputs [31]. Experiments show that it are better than those based on unigram and bigram models. For text classification of high dimensional text, CNN achieved several state-of-the-art performances on some benchmark datasets for sentiment categorization[32]. RNN improves time complexity and analyzes texts word-by-word, considering the influence of historical sequences on current words, and can deal with the long-term dependence of a certain length of sequence, suitable for time series. Experiments show that RNN can capture long-term dependence even if there is only one single layer [33]. However, in the case where the input sequence is long, the RNN may have a gradient explosion or the gradient disappears. To avoid this problem, variants such as Long-Short-Term Memory (LSTM) [34] and Gated Recurrent Unit (GRU) [35] are designed, which have achieved some excellent results in the field of text classification [36]. Attention is an effective mechanism for selecting significant information in order to obtain superior results. Deep neural networks, including CNN and RNN, can get better result by equipped with attention mechanisms. Among many proposed attention mechanisms, some examples are excellent including soft and hard attention [37], global and local attention [38], and source-target attention and self attention [39]. In the field of image processing, by integrating pre-attention mechanisms in the optimization criterion, in the form of a saliency map, good results were obtained on the task of sequential spatial reasoning in images. [40]. In natural language processing, attentive neural networks have achieved great success on a wide range of tasks ranging such as question answering, machine translations and tec., [41, 42, 43]. GRU compared with attention mechanisms can capture the importance of words [18]. In order to fuse the advantages of RNN, CNN and attention, the ARC model was proposed [19], and good results were obtained on the text classification task. In addition, a study has confirmed that even in the low-resource scenario, attention can be learned effectively. [44]

III. MODEL ARCHITECTURE

In this section we will describe the details of the model framework. The model structure is shown in Fig.1. The model consists of three parts: the word embedding layer, the Pre-Attention mechanism and the post-classification net. In the word embedding layer, we get the text representation by word2vec. Then we weight the text representation with the Pre-Attention mechanism to get the attention representation. Finally, we enter the attention representation into the post-classification net for classification. In the rest of this section, we will detail the three parts above. 𝑤𝑜𝑟𝑑 𝑤𝑜𝑟𝑑 𝑤𝑜𝑟𝑑 𝑛 ... ... 𝑤 𝑤 𝑤 𝑛 ... ... 𝑊 𝑤 𝑊 𝑤 𝑊 𝑛 𝑤 𝑛 𝑣 ,b 𝑊 𝑊 𝑊 𝑛 𝑢 𝑢 𝑢 𝑢 𝑛 ... ... Classification model word embedding Pre-Attention mechanism post-classification net

FIGURE 1.

The structure of text classification model with Pre-Attention Mechanism A. THE WORD EMBEDDING LAYER

Experiments have proven that improvements in model accuracy can be obtained by performing unsupervised, pre-trained word embeddings, so we first get the word embedding matrix

𝑀𝜖ℝ |𝑉|𝑙 ( ℝ : The space of real numbers), where 𝑉 is a fixed-sized vocabulary, and 𝑙 is the size of word embedding. For the matrix 𝑀 , we can initialize it with the already trained word2vec model, where 𝑀 𝑖 is the vector of the word 𝑉 𝑖 . For the i-th word in text 𝑤𝑜𝑟𝑑 = (𝑤𝑜𝑟𝑑 , 𝑤𝑜𝑟𝑑 , … … , 𝑤𝑜𝑟𝑑 𝑛 ) , we transform it into its word embedding 𝑤 𝑖 by using the matrix-vector product: 𝑤 𝑖 = 𝑣 𝑖 𝑀 (1) Where 𝑣 𝑖 is a vector of size | 𝑉 | which has value 1 at index of 𝑤𝑜𝑟𝑑 𝑖 in 𝑉 and 0 in all other positions. Then the sentence is feed into the next layer as a real-valued vector 𝑆 =(𝑤 , 𝑤 , … … , 𝑤 𝑛 ) . B. PRE-ATTENTION LAYER

Not all words contribute equally to the meaning of sentences. So in this section, we impose a Pre-Attention mechanism on the inputting S to calculate the attention weight distribution of different words. This weight will make our classifiers more focused on words that play an important role in the classification task. The Pre-Attention mechanism is shown in Fig.2. 𝑢 𝑢 𝑢 𝑢 𝑛 𝑤 𝑤 𝑤 𝑛 𝑣 𝑏 𝑊 𝑊 𝑊 𝑛 sigmoidmultiplication summation FIGURE 2.

The structure of Pre-Attention Mechanism

For input 𝑆 , 𝑤 𝑖 represents the word vector of the i-th word, and we calculate its attention weight from: 𝑊 𝑖 = 𝑓(𝑣𝑤 𝑖 + 𝑏) (2) 𝑓 (𝑥) =

11+ 𝑒 −𝑥 (3) where the attention vector 𝑣𝜖ℝ 𝑙 is a parameter to be learned. 𝑙 is the size of word embedding. 𝑏𝜖ℝ is a bias term to be learned and 𝑓 is an activation function. We utilize the 𝑠𝑖𝑔𝑚𝑜𝑖𝑑 as 𝑓 . Then we use the obtained attention distribution to weight the input word vector to obtain the text representation with Pre-Attention weight, which is 𝑈 . 𝑢 𝑖 = 𝑊 𝑖 𝑤 𝑖 (4) 𝑈 = (𝑢 , 𝑢 , … … , 𝑢 𝑛 ) (5) C. POST-CLASSIFICATION MODEL

To verify the portability of Pre-Attention mechanism, we choose two typical text classification models as post-classification networks, Text-CNN model based on CNN [28], and Att-BLSTM model based on LSTM [45].

1) Text-CNN

Text-CNN is a classical model on text classification based on CNN. The model structure is shown in the Fig.3. * representation of sentence 𝒏 𝒍 Convolutional layer with multiple filter widths and feature maps Max pooling and map vector to fixed lengths

Fully connected layer with dropout and softmax 𝑢 𝑢 𝑢 𝑢 𝑛 𝑢 𝑛− ...... FIGURE 3.

The structure of Text-CNN Model

The input is

U = (𝑢 , 𝑢 , … … , 𝑢 𝑛 ) from the Pre-Attention layer, where 𝑢 𝑖 𝜖ℝ 𝑙 . 𝑙 is the dimension of the word vector, and 𝑛 is the length of the sentence. We define the following equation: 𝑈 𝑖:𝑗 = 𝑢 𝑖 ⨁𝑢 𝑖+1 … … ⨁𝑢 𝑗 (6) where ⨁ is a concatenation operation. A convolution operation involves a filter 𝑐𝜖ℝ ℎ𝑙 , which is applied to a window of ℎ words to produce a new feature. For example, a feature 𝑜 i is generated from a window of words 𝑈 𝑖:𝑖+ℎ−1 by 𝑜 𝑖 = 𝑓(𝑐 · 𝑈 𝑖:𝑖+ℎ−1 + 𝑏) (7) The convolution kernel with a height of ℎ scans input matrix U from top to bottom, equally taking n-gram feature extraction on U with size ℎ . we apply a max-over time pooling operation [46] over the feature map and get maximum value of each feature map, then concatenate the maximum values into vector and feed vector to fully connection layer for classification. Meanwhile we employ dropout on the fully connection layer with a constraint on 𝑙 − 𝑛𝑜𝑟𝑚𝑠 of the weight vectors. Finally, we get sentence category with the softmax layer.

2) Att-BLSTM

Att-BLSTM is a classical model on text classification based on LSTM. The model structure is shown in the Fig.4. * representation of sentence 𝒏 𝒍 ℎ ℎ ℎ ℎ ℎ 𝑛− ℎ 𝑛 ℎ 𝑛− ℎ 𝑛 ℎ ℎ ℎ 𝑛− ℎ 𝑛 ⨁ V BLSTM Layer Attention Layer Fully connected layer with dropout and softmax 𝑢 𝑢 𝑢 𝑢 𝑛 𝑢 𝑛− .... .. . FIGURE 4.

The structure of Att-BLSTM Model

Usually, as shown in Fig.5, the LSTM unit contains three parts. б Tanh б б

Tanh

A A h 𝑡− 𝑥 𝑡 𝑥 𝑡 +1 h 𝑡 h 𝑡 +1 𝑥 𝑡− FIGURE 5.

Unit of Long Short-Term Memory

One forget gate 𝑓 𝑡 with corresponding weight matrix 𝑊 𝑓 , 𝑏 𝑓 , which decides which state information to discard: 𝑓 𝑡 = σ(𝑊 𝑓 [ℎ 𝑡−1 , 𝑥 𝑡 ] + 𝑏 𝑓 ) (8) One input gate 𝑖 𝑡 and 𝐶 𝑡 ̃ with corresponding weight matrix 𝑊 𝑖 , 𝑊 𝐶 , 𝑏 𝑖 , 𝑏 𝐶 , which decides which kind of cell state should be added: 𝑖 𝑡 = σ(𝑊 𝑖 [ℎ 𝑡−1 , 𝑥 𝑡 ] + 𝑏 𝑖 ) (9) 𝐶 𝑡 ̃ = tanh(𝑊 𝐶 [ℎ 𝑡−1 , 𝑥 𝑡 ] + 𝑏 𝐶 ) (10) Then, cell states can be updated by follow equation. Where 𝑓 𝑡 ∗ 𝐶 𝑡 forgets the state information what we want to discard, and 𝑖 𝑡 ∗ 𝐶 𝑡 ̃ adds new content we want to remember. 𝐶 𝑡 = 𝑓 𝑡 ∗ 𝐶 𝑡−1 + 𝑖 𝑡 ∗ 𝐶 𝑡 ̃ (11) One output gate 𝑂 𝑡 with corresponding weight matrix 𝑊 𝑜 , 𝑏 𝑜 , which decides which part of the cell state to output. 𝑂 𝑡 = σ(𝑊 𝑜 [ℎ 𝑡−1 , 𝑥 𝑡 ] + 𝑏 𝑜 ) (12) ℎ 𝑡 = 𝑂 𝑡 ∗ tanh(𝐶 𝑡 ) (13) For many sequence modelling tasks, it is beneficial to have access to future as well as past context. However, standard LSTM networks ignore future context, and they process sequences in temporal order. To address this limitation, Bidirectional LSTM networks extend the unidirectional LSTM networks by introducing a second layer, where the hidden to hidden connections flow in opposite temporal order. Therefore, the model is able to exploit information both from the future and the past. As shown in Fig.4, We first enter 𝑈 into a two-way lstm network, which contains two sub-networks for the left and right sequence context. Then we can get the output of the i-th word, which is shown in the following equation: ℎ 𝑖 = ℎ 𝑖 ⨁ℎ 𝑖 (14) 𝐻 = [ℎ , ℎ … … ℎ 𝑛 ] (15) Then we pay attention mechanism to the output vectors 𝐻 that the BLSTM layer produced. where 𝑤𝜖𝑅 𝑙 is a trained parameter vector, and 𝑙 is the size of word embedding. M = tanh(H) (16) α = softmax(𝑤 𝑇 𝑀) (17) Then we get the sentence representation γ by a weighted sum of those output vectors: γ = H𝛼 𝑇 (18) Finally, we obtain the final sentence representation by the following equation: ℎ ∗ = tanh(𝛾) (19) Then we feed vector ℎ ∗ to fully connection layer for classification. Meanwhile we employ dropout on the fully connection layer with a constraint on 𝑙 − 𝑛𝑜𝑟𝑚𝑠 of the weight vectors. Finally, we get sentence category with the softmax layer. IV. EXPERIMENTAL SETUP A. DATASETS

We test the network on three different datasets, whose details are shown in Table 1.  Stanford Large Movie Review dataset (IMDB):

The IMDB [47] consists of 50,000 binary labeled reviews, which are divided into 50:50 training and testing sets. The distribution of labels is balanced in each sub-dataset. One key aspect of this dataset is that there are several sentences in each review.  Subjectivity dataset (Subj):

Subjectivity dataset [48] where the task is to classify a sentence as being subjective or objective, which is collected from snippets of movie reviews from Rotten Tomatoes and plot summaries for movies from the Internet Movie Database. It consists of 10000 binary labeled reviews, including 5000 subjective reviews and 5000 objective reviews.  Movie reviews (MR):

The database [49] consists of 10662 reviews from Rotten Tomatoes webpages, including 5,331 positive and 5,331 negative samples. Every review includes one sentence. Those reviews marked with “fresh” are positive, and those reviews marked with “rotten” are negative For the IMDB, we used 20% of the labeled training documents as a validation set. For the Subj, we split it into three sets: 7k sentences for training, 2k sentences for testing, and 1k sentences as a validation set. The MR is also splited into three sets: 70% of the sentences for training, 20% of the sentences for testing, and 10% of the sentences being validation set. In Table 1, we present additional details about the three benchmark datasets.

B. HYPERPARAMETER AND TRAINING MODEL VARIATIONS

We experiment with several variants of the model  Pre-Attention-Text-CNN:

A model uses Text-CNN as a post-classification model with Pre-Attention, using pre-trained vectors from word2vec in the word embedding layer. For those unknown words, we randomly initialize them. The word embedding matrix 𝑴 are kept static and only the other parameters can be learned.  Text-CNN:

Same as above but the Pre-Attention mechanism is removed.  Pre-Attention-Att-BLSTM:

A model uses Att-BLSTM as a post-classification model with Pre-Attention, using pre-trained vectors from word2vec in the word embedding layer. For those unknown words, we randomly initialize them. The word embedding matrix 𝑴 are kept static and only the other parameters can be learned.  Att-BLSTM:

Same as above but the Pre-Attention mechanism is removed. THE WORD EMBEDDINGS

In our experiments, we utilized the publicly available word2vec vectors that were trained on 100 billion words from Google News. The vectors were trained using the continuous bag-of-words architecture [50]. The size of word enbedding are optional, and we use vectors with dimensional of 300. Text-CNN

For all datasets we use: windows ( ℎ ) of 1,2,3,4,5 with 128 feature maps each, dropout rate of 0.4, the L2 regularization of 1, the learning rate of 0.001, the mini-batch of 64. All of above values were chosen via a grid search on the IMDB with Text-CNN. Pre-Attention-Text-CNN

Except for Pre-Attention mechanism, the parameters of the other part are the same as above. Att-BLSTM

For all datasets we use: dropout rate of 0.5, the L2 regularization of 0.1, the learning rate of 0.01, the mini-batch of 64, the hidden layer size of LSTM of 50. All of above values were chosen via a grid search on the IMDB with Att-BLSTM Pre-Attention-Att-BLSTM

Except for Pre-Attention mechanism, the parameters of the other part are the same as above.

V. EMPIRICAL RESULTS AND ANALYSIS

A. MODEL AND RESULTS

Table 2 shows the results of our models and other state-of-the-art methods of text classification. The optimal model on the IMDB dataset is Pre-Attention-Text-CNN, and its classification accuracy is only 0.2% lower than DSCNN-Pretrain. Compared with other models, the classification accuracy has increased by 0.3%-7.31%. For the

TABLE 1.

Summary statistics for the datasets. 𝑻 : The type of review, 𝒄 : Number of target classes. 𝑳 : Average sentence length. 𝑵 : Dataset size. |𝑽| : Vocabulary size. 𝑻𝒓𝒂𝒊𝒏 : Train set size.

𝑻𝒆𝒔𝒕 : Test set size.

𝑫𝒆𝒗 : Validation set size.

Data 𝑇 𝑐 𝐿 𝑁 |𝑉| 𝑇𝑟𝑎𝑖𝑛

𝑇𝑒𝑠𝑡

𝐷𝑒𝑣

IMDB Document 2 230 50000 89527 20k 25k 5k Subj Sentence 2 23 10000 21323 7k 2k 1k MR Sentence 2 20 10662 18765 7464 2132 1066

TABLE 2.

The classification accuracy (%) of our model compared to other approaches on IMDB, MR and Subj.

Method IMDB MR Subj

Text-CNN 89.0 79.9 92.5 Pre-Attention-Text-CNN

Att-BLSTM 86.5 79.1 90.3 Pre-Attention-Att-BLSTM 88.3 80.0 92.1 DNN[51] 88.55 - - Naïve bayes classifier[51] 83.19 - - RNN[52] 88.59 - - CNN[52] 87.44 - - Svm[53] 87.97 - - SVM(TF-IDF)[54] 88.45 - - DSCNN [55] 90.2 81.5 93.2 DSCNN-Pretrain[55]

ESN [56] - 78.1 92.6 CNN-BiGRU[57] - 79.4 93.8 CNN-Ana[58] - 81.02 93.66 DSCNN[59] - 81.5 - combine-skip[60] - - 93.6

FIGURE 6 . Results of comparing Pre-Attention-Classification model and Single-Classification model on three datasets (IMDB, MR, Subj), (A) Text-CNN vs. Pre-Attention-Text-CNN on IMDB (B) Att-BLSTM vs. Pre-Attention-Att-BLSTM on IMDB (C) Text-CNN vs. Pre-Attention-Text-CNN on MR (D) Att-BLSTM vs. Pre-Attention-Att-BLSTM on MR (E) Text-CNN vs. Pre-Attention-Text-CNN on Subj (F) Att-BLSTM vs. Pre-Attention-Att-BLSTM on Subj

Movie reviews (MR) dataset, our optimal model is Pre-Attention-Text-CNN, and the classification accuracy reaches 82.3%, compared with several other classification models on MR. Accuracy has increased by 0.1%-4.2%. For the subj dataset, each review containing a sentence. The Pre-Attention-Text-CNN is still the best performer, and the accuracy rate is 93.7%. Compared with several other comparison models, the model accuracy rate exceeds the model other than DSCNN-Pretrain. The experimental results verified the effectiveness of Pre-Attention mechanism. Although it is generally believed that RNN is more suitable for NLP tasks, it can better refer to word order information. But in this paper, we find that the CNN-based model is better than the RNN-based model. This may be because for the classification task of this paper, the phrase with significant emotional polarity will have a more critical impact on the result. CNN mainly does the extraction of local features, similar to n-gram, so it is understandable that the CNN network can work better in the tasks of this paper.

B. PRE-ATTENTION-CLASSIFICATION MODEL VS. SINGLE-CLASSIFICATION MODEL

In order to prove the effectiveness of the Pre-Attention mechanism, we compared the classification accuracy rate from Pre-Attention-classification model and single-classification model on the same dataset. So there are two comparative experiments, Text-CNN vs. Pre-Attention-Text-CNN and Att-BLSTM vs. Pre-Attention- Att-BLSTM. We performed our experiments on the above three datasets. For every comparison, we train the two classifier models and calculate the classification accuracy on the test set every 5 steps. The experimental results are shown in the Fig.6, where 𝑆 is the number of training steps. X-axis represents 𝑆5 , and the Y-axis represents the classification accuracy on the test set. We can see that the Pre-Attention mechanism improves the accuracy of the classification model. According to Table 2, we find that compared with the text classification model without Pre-Attention mechanism, those with Pre-Attention mechanism improved accuracy by 0.9%-2.4%, which clearly demonstrates the effectiveness of the proposed Pre-Attention mechanism. C. VISUALIZATION OF PRE-ATTENTION

Another advantage of Pre-Attention is that it is easier to visualize, which is very instructive for us to analyze which words are more important for classification. We randomly select some texts from the above three datasets, calculating the pre-attention value, and visualize the word attention 𝑊 𝑡 using an open source sequence annotation tool [61], visualized in Fig.7. We can find that Pre-Attention models give more attention to words with strong emotions and degree adverbs, such as absolutely, horrible, sure, like, serious, good, better, successful , etc., which proves that Pre-Attention mechanism can learn explicit emotional tendencies in sentences and successfully integrates an emotional lexicon into deep neural network. The Pre-Attention mechanism can learn explicit emotional tendencies in sentences and have a good visualization. FIGURE 7. Heatmap of three datasets (IMDB, MR, Subj) on Pre-Attention-Classification model (Pre-Attention-Text-CNN, Pre-Attention-Att-BLSTM) (A) Heatmap of IMDB on Pre-Attention-Text-CNN (B) Heatmap of IMDB on Pre-Attention- Att-BLSTM (C) Heatmap of MR on Pre-Attention-Text-CNN (D) Heatmap of MR on Pre-Attention- Att-BLSTM (E) Heatmap of Subj on Pre-Attention-Text-CNN (F) Heatmap of Subj on Pre-Attention- Att-BLSTM

FIGURE 8.

Word clouds of three datasets (IMDB, MR, Subj) on Pre-Attention-Classification model (Pre-Attention-Text-CNN, Pre-Attention-Att-BLSTM) D. SENTIMENT LEXICON

For above classification tasks, after getting the words’ attention values by Pre-Attention mechanism, we draw word clouds based on attention values, which are shown in Fig.8. We can find those words with strong emotions and degree adverbs get more attention, such as funnier, anticlimactic, terrible, unappetizing, unsatisfying, dull, amazing , etc., which proves that Pre-Attention mechanism can learn explicit emotional tendencies in sentences. Then we prove that we have got an excellent lexicon from two aspects of stability and effectiveness. EFFECTIVENESS

In order to prove the validity of the lexicon extracted by the Pre-Attention value, we compare it with a handcrafted lexicon. Subjectivity Lexicon [62] is a lexicon including 8222 words. Every word has two type of labels.

In Table 3, we present additional details about the dataset. We select the words belonging to strongsubj to form subj-lexicon set

𝑆_𝑙 and those words being positive, negative or both priorpolaritys to form priorpolaritys-lexicon set

𝑃_𝑙 . Then we define 𝑊 𝐿 = (𝑊 , 𝑊 … … , 𝑊 k𝐿 ) to represent sequences of words sorted by Pre-Attention values, where the Pre-Attention value of 𝑊 𝑖𝐿 is bigger than the Pre-Attention value of 𝑊 𝑗𝐿 when 𝑖 is smaller than 𝑗 . We define the equation (21) to measure validity of lexicon by Pre-Attention mechanism. TABLE 3.

Summary statistics for the Subjectivity Lexicon. 𝒘 : The number of weaksubj words, 𝒔 : The number of strongsubj words. 𝒑 : The number of positive words. 𝒏 : The number of negative words. 𝒃 : The number of words with both priorpolaritys. N : The number of neutral words. type priorpolarity 𝒘 𝒔 𝒑 𝒏 𝒃 N Subjectivity Lexicon

FIGURE 9.

The similarity of lexicons from Pre-Attention mechanism and handcrafted lexicons with threshold change. (A) Word clouds of IMDB on Pre-Attention-Text-CNN (B) Word clouds of IMDB on Pre-Attention-Att-BLSTM (C) Word clouds of MR on Pre-Attention-Text-CNN (D) Word clouds of MR on Pre-Attention-Att-BLSTM (E) Word clouds of Subj on Pre-Attention-Text-CNN (F) Word clouds of Subj on Pre-Attention-Att-BLSTM 𝐷 𝑝 = (𝑊 , 𝑊 … … , 𝑊 ⌊𝑘∗𝑝⌋𝐿 )(0 < 𝑝 < 1) (20) 𝐿 (𝑝) = |(𝐷 𝑝 ∩𝐷)|⌊𝑘∗𝑝⌋ (21) 𝐿 (𝑝) represents the ratio of words belonging to 𝐷 in 𝐷 𝑝 . Where 𝐷 is a handcrafted lexicon. When 𝑊 𝐿 is from IMDB or MR, we set 𝐷 to 𝑃_𝑙 . When 𝑊 𝐿 is from Subj, we set 𝐷 to 𝑆_𝑙 . Fig.9 shows the change of 𝐿 (𝑝) with 𝑝 in different situations. Since the handcrafted lexicon is a general lexicon rather than a lexicon like the one we extracted by Pre-Attention mechanism, which is for a data set, 𝐿 (𝑝) is not high. But we can still prove from the trend of 𝐿 (𝑝) that those words with higher attention values are more likely to appear in the lexicon for classification , which proves that the Pre-Attention value can reflect the explicit emotional tendencies in the sentences. STABILITY

For a text classification task, we believe that each word in the dataset has a different impact on the classification results, and there is a unique ordering that indicates the importance of the words for this task. Although the importance of a word for particular two sentences may be different, in a statistical sense, our assumptions are reasonable for the entire classification task. Next, in order to prove the stability of the pre-attention mechanism, we compare the ordering of words sorted by the Pre-Attention weight from the two pre-attention models (Pre-Attention-Text-CNN and Pre-Attention-Att-BLSTM) in a dataset classification task. We define 𝑜 𝑐 and 𝑜 𝑟 to represent sequences of words index sorted by Pre-Attention values, where 𝑜 𝑐 and 𝑜 𝑟 are from Pre-Attention-Text-CNN and Pre-Attention-Att-BLSTM respectively, and 𝑘 is the number of words of the dataset. 𝑜 𝑐 = (𝑜 𝑐1 , 𝑜 𝑐2 … … 𝑜 𝑐𝑘 ) (22) 𝑜 𝑟 = (𝑜 𝑟1 , 𝑜 𝑟2 … … 𝑜 𝑟𝑘 ) (23) By equation (2) , we get: 𝑊 𝑂 𝑚𝑖 < 𝑊 𝑂 𝑚𝑗 ( 𝑚 ∈ {𝑐, 𝑟}, 𝑖 < 𝑗 ) (24) Then we define a threshold to get sentiment lexicons. we put the words of the top (1 − 𝑝)(0 < 𝑝 < 1) of the Pre-Attention value into a lexicon. We get: 𝐷 𝑐 = (𝑜 𝑐⌊𝑘∗𝑝⌋ , 𝑜 𝑐⌊𝑘∗𝑝⌋+1 … … 𝑜 𝑐𝑘 ) (25) 𝐷 𝑟 = (𝑜 𝑟⌊𝑘∗𝑝⌋ , 𝑜 𝑟⌊𝑘∗𝑝⌋+1 … … 𝑜 𝑟𝑘 ) (26) 𝐷 𝑐 is a sentiment lexicon from Pre-Attention-Text-CNN, and 𝐷 𝑟 is a sentiment lexicon from Pre-Attention-Att-BLSTM. We define the following equation (27) to measure the similarity of the sentiment lexicons. 𝑦 (𝑝) = |(𝐷 𝑐 ∩𝐷 𝑟 )|⌈𝑘∗(1−𝑝)⌉ (27) Fig.10 depicts the similarity of sentiment lexicons 𝐷 𝑐 and 𝐷 𝑟 on three datasets. We also added the similarity of sentiment lexicons when 𝐷 𝑐 and 𝐷 𝑟 are two random sequences, where every element is between 1 and 𝑘 , and there are no two same numbers in every sequence. In the figure, the x-axis represents 𝑝 and y-axis is 𝑦 (𝑝) . This figure shows that the sentiment lexicon extracted by our Pre-Attention model has good stability even with different post-classifiers. Then we define equation (28) to measure the relative stability of the sentiment lexicons, where (1 − 𝑝) is the lexicons similarity, when 𝐷 𝑐 and 𝐷 𝑟 are two random sequences. We describe the change of 𝑌 (𝑝) with 𝑝 in Fig.11. We can see that 𝑌 (𝑝) continues to grow with 𝑝 growing, which means that those words with high Pre-Attention values are more easily identified by the Pre-Attention mechanism. In Table 4, we list the values of 𝑦 (𝑝) and 𝑌 (𝑝) when 𝑝 is 0.5, 0.6, 0.7, 0.8, 0.9 on three datasets. 𝑌 (𝑝) = 𝑦 (𝑝) (1−𝑝) (28) FIGURE 10. The similarity of sentiment lexicons from Pre-Attention-Text-CNN and Pre-Attention-Att-BLSTM on three datasets with threshold change. FIGURE 11.

The relative similarity of sentiment lexicons from Pre-Attention-Text-CNN and Pre-Attention-Att-BLSTM on three datasets with threshold change.

TABLE 4.

The values of 𝒚 (𝒑) and 𝒀 (𝒑) when 𝒑 is 0.5, 0.6, 0.7, 0.8, 0.9 on three datasets IMDB MR Subj 𝑝 𝑦 (𝑝) 𝑌 (𝑝) 𝑦 (𝑝) 𝑌 (𝑝) 𝑦 (𝑝) 𝑌 (𝑝) In summary, the Pre-Attention mechanism assigns those words with strong emotions and degree adverbs more attention, even with different post-classification models, which proves the Pre-Attention mechanism has great effectiveness, stability and portability. By setting proper thresholds, we can obtain reliable and stable sentiment lexicons.

VI. CONCLUSION

For the problem that the linguistic knowledge is not fully utilized in the text classification task, in this paper, we presented the Pre-Attention mechanism. It can automatically assign different attention values to words according that different importance for the text classification task, equally integrating a lexicon for classification into deep neural networks. Our approach performed well on three benchmark datasets. Our results demonstrated that those models with Pre-Attention mechanism achieved higher accuracy on all three datasets compared with those models without Pre-Attention, which proves the validity of Pre-Attention mechanism. When comparing with other several methods, our approach also performed well and achieved a competitive classification accuracy. In addition, those words with high Pre-Attention values are more likely to be in lexicons for classification, and we got lexicons by those attention values, which is an inspiring method of information extraction. And we proved the stability of the lexicons extracted by Pre-Attention mechanism. However, there are still some limitations on the Pre-Attention mechanism. For example, we can only get the attention value of a single word. In the future, we plan to extend Pre-Attention mechanism to 2-gram, 3-gram and etc.

REFERENCES [1]

Du J, Gui L, He Y, et al. Convolution-based neural attention with applications to sentiment classification[J]. IEEE Access, 2019, 7: 27983-27992. [2]

Pang, Bo, and Lillian Lee. "Opinion mining and sentiment analysis." Foundations and Trends® in Information Retrieval 2.1–2 (2008): 1-135. [3]

Zhong, Xiaoshi, Aixin Sun, and Erik Cambria. "Time expression analysis and recognition using syntactic token types and general heuristic rules." Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2017. [4]

Chen, Zhiyuan, Nianzu Ma, and Bing Liu. "Lifelong learning for sentiment classification." Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 2: Short Papers). 2015. [5]

Joachims, Thorsten. "Text categorization with support vector machines: Learning with many relevant features." European conference on machine learning. Springer, Berlin, Heidelberg, 1998. [6]

Bengio, Yoshua, et al. "A neural probabilistic language model." Journal of machine learning research 3.Feb (2003): 1137-1155. [7]

Mikolov, Tomas, et al. "Distributed representations of words and phrases and their compositiona (Volume 1: Long Papers). 2013. [8]

Socher, Richard, et al. "Parsing with compositional vector grammars." Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2013. [9]

Devlin, Jacob, et al. "Fast and robust neural network joint models for statistical machine translation." Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2014. [10]

Hassan, Abdalraouf, and Ausif Mahmood. "Convolutional recurrent deep learning model for sentence classification." Ieee Access 6 (2018): 13949-13957. [11]

Joulin, Armand, et al. "Bag of tricks for efficient text classification." arXiv preprint arXiv:1607.01759 (2016). [12]

Bandhakavi, Anil, et al. "Lexicon based feature extraction for emotion text classification." Pattern recognition letters 93 (2017): 133-142. [13]

Aman, Saima, and Stan Szpakowicz. "Identifying expressions of emotion in text." International Conference on Text, Speech and Dialogue. Springer, Berlin, Heidelberg, 2007. [14]

Mohammad, Saif. "Portable features for classifying emotional text." Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics, 2012. [15]

Strapparava C, Valitutti A. Wordnet affect: an affective extension of wordnet[C]//Lrec. 2004, 4(1083-1086): 40. [16]

Poria, Soujanya, et al. "EmoSenticSpace: A novel framework for affective common-sense reasoning." Knowledge-Based Systems 69 (2014): 108-123. [17]

Mohammad, Saif M., and Peter D. Turney. "Crowdsourcing a word–emotion association lexicon." Computational Intelligence 29.3 (2013): 436-465. [18]

Bahdanau, Dzmitry, Kyunghyun Cho, and Yoshua Bengio. "Neural machine translation by jointly learning to align and translate." arXiv preprint arXiv:1409.0473 (2014). [19]

Wen, Shuang, and Jian Li. "Recurrent Convolutional Neural Network with Attention for Twitter and Yelp Sentiment Classification: ARC Model for Sentiment Classification." Proceedings of the 2018 International Conference on Algorithms, Computing and Artificial Intelligence. ACM, 2018. [20]

Lei, Zeyang, Yujiu Yang, and Min Yang. "Sentiment lexicon enhanced attention-based LSTM for sentiment classification." Thirty-Second AAAI Conference on Artificial Intelligence. 2018. [21]

Joachims, Thorsten. "Text categorization with support vector machines: Learning with many relevant features." European conference on machine learning. Springer, Berlin, Heidelberg, 1998. [22]

Zelikovitz, Sarah, and Haym Hirsh. "Using LSI for text classification in the presence of background text." Proceedings of the tenth international conference on Information and knowledge management. ACM, 2001. [23]

Nakagawa, Tetsuji, Kentaro Inui, and Sadao Kurohashi. "Dependency tree-based sentiment classification using CRFs with hidden variables." Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics. Association for Computational Linguistics, 2010. [24]

Rink, Bryan, and Sanda Harabagiu. "Utd: Classifying semantic relations by combining lexical and semantic resources." Proceedings of the 5th International Workshop on Semantic Evaluation. Association for Computational Linguistics, 2010. [25]

Mikolov, Tomas, et al. "Efficient estimation of word representations in vector space." arXiv preprint arXiv:1301.3781 (2013). [26]

Tai, Kai Sheng, Richard Socher, and Christopher D. Manning. "Improved semantic representations from tree-structured long short-term memory networks." arXiv preprint arXiv:1503.00075 (2015). [27]

Lei, Tao, Regina Barzilay, and Tommi Jaakkola. "Molding cnns for text: non-linear, non-consecutive convolutions." arXiv preprint arXiv:1508.04112 (2015). [28]

Kim, Yoon. "Convolutional neural networks for sentence classification." arXiv preprint arXiv:1408.5882 (2014). [29]

Jianqiang, Zhao, Gui Xiaolin, and Zhang Xuejun. "Deep convolution neural networks for twitter sentiment analysis." IEEE Access 6 (2018): 23253-23260. [30]

Zeng, Jichuan, et al. "Topic memory networks for short text classification." arXiv preprint arXiv:1809.03664 (2018). [31]

Kalchbrenner, Nal, Edward Grefenstette, and Phil Blunsom. "A convolutional neural network for modelling sentences." arXiv preprint arXiv:1404.2188 (2014). [32]

Johnson, Rie, and Tong Zhang. "Effective use of word order for text categorization with convolutional neural networks." arXiv preprint arXiv:1412.1058 (2014). [33]

Sundermeyer, Martin, Hermann Ney, and Ralf Schlüter. "From feedforward to recurrent LSTM neural networks for language modeling." IEEE/ACM Transactions on Audio, Speech, and Language Processing 23.3 (2015): 517-529. [34]

Graves, Alex. "Supervised sequence labelling." Supervised sequence labelling with recurrent neural networks. Springer, Berlin, Heidelberg, 2012. 5-13. [35]

Chung, Junyoung, et al. "Empirical evaluation of gated recurrent neural networks on sequence modeling." arXiv preprint arXiv:1412.3555 (2014). [36]

Tang, Duyu, Bing Qin, and Ting Liu. "Document modeling with gated recurrent neural network for sentiment classification." Proceedings of the 2015 conference on empirical methods in natural language processing. 2015. [37]

Xu, Kelvin, et al. "Show, attend and tell: Neural image caption generation with visual attention." International conference on machine learning. 2015. [38]

Luong, Minh-Thang, Hieu Pham, and Christopher D. Manning. "Effective approaches to attention-based neural machine translation." arXiv preprint arXiv:1508.04025 (2015). [39]

Lin Z, Feng M, Santos C N, et al. A structured self-attentive sentence embedding[J]. arXiv preprint arXiv:1703.03130, 2017. [40]

Fouquier, Geoffroy, Jamal Atif, and Isabelle Bloch. "Sequential spatial reasoning in images based on pre-attention mechanisms and fuzzy attribute graphs." ECAI. 2008. [41]

Hermann K M, Kocisky T, Grefenstette E, et al. Teaching machines to read and comprehend[C]//Advances in neural information processing systems. 2015: 1693-1701 [42]

Bahdanau D, Cho K, Bengio Y. Neural machine translation by jointly learning to align and translate[J]. arXiv preprint arXiv:1409.0473, 2014. [43]

Galassi, Andrea, Marco Lippi, and Paolo Torroni. "Attention, please! a critical review of neural attention models in natural language processing." arXiv preprint arXiv:1902.02181 (2019). [44]

Bao, Yujia, et al. "Deriving machine attention from human rationales." arXiv preprint arXiv:1808.09367 (2018). [45]

Zhou P, Shi W, Tian J, et al. Attention-based bidirectional long short-term memory networks for relation classification[C]//Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). 2016: 207-212. [46]

Collobert R, Weston J, Bottou L, et al. Natural language processing (almost) from scratch[J]. Journal of machine learning research, 2011, 12(Aug): 2493-253. [47]

Maas A L, Daly R E, Pham P T, et al. Learning word vectors for sentiment analysis[C]//Proceedings of the 49th annual meeting of the association for computational linguistics: Human language technologies-volume 1. Association for Computational Linguistics, 2011: 142-150. [48]

Pang B, Lee L. A sentimental education: Sentiment analysis using subjectivity summarization based on minimum cuts[C]//Proceedings of the 42nd annual meeting on Association for Computational Linguistics. Association for Computational Linguistics, 2004: 271. [49]

Pang B, Lee L. Seeing stars: Exploiting class relationships for sentiment categorization with respect to rating scales[C]//Proceedings of the 43rd annual meeting on association for computational linguistics. Association for Computational Linguistics, 2005: 115-124. [50]

Mikolov T, Sutskever I, Chen K, et al. Distributed representations of words and phrases and their compositionality[C]//Advances in neural information processing systems. 2013: 3111-3119. [51]

Kowsari K, Heidarysafa M, Brown D E, et al. Rmdl: Random multimodel deep learning for classification[C]//Proceedings of the 2nd International Conference on Information System and Data Mining. ACM, 2018: 19-28. [52]

Yang Z, Yang D, Dyer C, et al. Hierarchical attention networks for document classification[C]//Proceedings of the 2016 conference of the North American chapter of the association for computational linguistics: human language technologies. 2016: 1480-1489. [53]

Zhang W, Yoshida T, Tang X. Text classification based on multi-word with support vector machine[J]. Knowledge-Based Systems, 2008, 21(8): 879-886. [54]

Chen K, Zhang Z, Long J, et al. Turning from TF-IDF to TF-IGM for term weighting in text classification[J]. Expert Systems with Applications, 2016, 66: 245-260. [55]

Zhang R, Lee H, Radev D. Dependency sensitive convolutional neural networks for modeling sentences and documents[J]. arXiv preprint arXiv:1611.02361, 2016. [56]

Wieting J, Kiela D. No training required: Exploring random encoders for sentence classification[J]. arXiv preprint arXiv:1901.10444, 2019. [57]

Zhang D, Tian L, Hong M, et al. Combining convolution neural network and bidirectional gated recurrent unit for sentence semantic classification[J]. IEEE Access, 2018, 6: 73750-73759. [58]

Zhang Y, Wallace B. A sensitivity analysis of (and practitioners' guide to) convolutional neural networks for sentence classification[J]. arXiv preprint arXiv:1510.03820, 2015. [59]

Zhang R, Lee H, Radev D. Dependency sensitive convolutional neural networks for modeling sentences and documents[J]. arXiv preprint arXiv:1611.02361, 2016. [60]

Kiros R, Zhu Y, Salakhutdinov R R, et al. Skip-thought vectors[C]//Advances in neural information processing systems. 2015: 3294-3302. [61]