Semantic-Unit-Based Dilated Convolution for Multi-Label Text Classification
SSemantic-Unit-Based Dilated Convolution for Multi-Label TextClassification
Junyang Lin , Qi Su , Pengcheng Yang , Shuming Ma , Xu Sun School of Foreign Languages, Peking University MOE Key Lab of Computational Linguistics, School of EECS, Peking University { linjunyang, sukia, yang pc, shumingma, xusun } @pku.edu.cn Abstract
We propose a novel model for multi-label textclassification, which is based on sequence-to-sequence learning. The model gener-ates higher-level semantic unit representationswith multi-level dilated convolution as wellas a corresponding hybrid attention mecha-nism that extracts both the information at theword-level and the level of the semantic unit.Our designed dilated convolution effectivelyreduces dimension and supports an exponen-tial expansion of receptive fields without lossof local information, and the attention-over-attention mechanism is able to capture moresummary relevant information from the sourcecontext. Results of our experiments show thatthe proposed model has significant advantagesover the baseline models on the dataset RCV1-V2 and Ren-CECps, and our analysis demon-strates that our model is competitive to the de-terministic hierarchical models and it is morerobust to classifying low-frequency labels . Multi-label text classification refers to assigningmultiple labels for a given text, which can be ap-plied to a number of important real-world appli-cations. One typical example is that news on thewebsite often requires labels with the purpose ofthe improved quality of search and recommenda-tion so that the users can find the preferred infor-mation with high efficiency with less disturbanceof the irrelevant information. As a significant taskof natural language processing, a number of meth-ods have been applied and have gradually achievedsatisfactory performance. For instance, a seriesof methods based on machine learning have beenextensively utilized in the industries, such as Bi-nary Relevance (Boutell et al., 2004). BR treats The code is available at https://github.com/lancopku/SU4MLC the task as multiple single-label classifications andcan achieve satisfactory performance. With the de-velopment of Deep Learning, neural methods areapplied to this task and achieved improvements(Zhang and Zhou, 2006; Nam et al., 2013; Benitesand Sapozhnikova, 2015).However, these methods cannot model the in-ternal correlations among labels. To capture suchcorrelations, the following work, including ML-DT (Clare and King, 2001), Rank-SVM (Elisseeffand Weston, 2002), LP (Tsoumakas and Katakis,2006), ML-KNN (Zhang and Zhou, 2007), CC(Read et al., 2011), attempt to capture the re-lationship, which though demonstrated improve-ments yet simply captured low-order correlations.A milestone in this field is the application ofsequence-to-sequence learning to multi-label textclassification (Nam et al., 2017). Sequence-to-sequence learning is about the transformationfrom one type of sequence to another type of se-quence, whose most common architecture is theattention-based sequence-to-sequence (Seq2Seq)model. The attention-based Seq2Seq (Sutskeveret al., 2014) model is initially designed for neu-ral machine translation (NMT) (Bahdanau et al.,2014; Luong et al., 2015). Seq2Seq is able to en-code a given source text and decode the represen-tation for a new sequence to approximate the tar-get text, and with the attention mechanism, thedecoder is competent in extracting vital source-side information to improve the quality of de-coding. Multi-label text classification can be re-garded as the prediction of the target label se-quence given a source text, which can be modeledby the Seq2Seq. Moreover, it is able to model thehigh-order correlations among the source text aswell as those among the label sequence with deeprecurrent neural networks (RNN).Nevertheless, we study the attention-basedSeq2Seq model for multi-label text classification a r X i v : . [ c s . C L ] N ov Nam et al., 2017) and find that the attention mech-anism does not play a significant role in this taskas it does in other NLP tasks, such as NMT and ab-stractive summarization. In Section 3, we demon-strate the results of our ablation study, which showthat the attention mechanism cannot improve theperformance of the Seq2Seq model. We hypoth-esize that compared with neural machine transla-tion, the requirements for neural multi-label textclassification are different. The conventional at-tention mechanism extracts the word-level infor-mation from the source context, which makes lit-tle contribution to a classification task. For textclassification, human does not assign texts labelssimply based on the word-level information butusually based on their understanding of the salientmeanings in the source text.For example, regarding the text “The youngboys are playing basketball with great excitementand apparently they enjoy the fun of competition”,it can be found that there are two salient ideas,which are “game of the young” and “happiness ofbasketball game”, which we call “semantic units”of the text. The semantic units, instead of word-level information, mainly determine that the textcan be classified into the target categories “youth”and “sports”.Semantic units construct the semantic meaningof the whole text. To assign proper labels fortext, the model should capture the core semanticunits of the source text, the higher-level informa-tion compared with word-level information, andthen assign the text labels based on its understand-ing of the semantic units. However, it is diffi-cult to extract information from semantic units asthe conventional attention mechanism focuses onextracting word-level information, which containsredundancy and irrelevant details.In order to capture semantic units in the sourcetext, we analyze the texts and find that the seman-tic units are often wrapped in phrases or sentences,connecting with other units with the help of con-texts. Inspired by the idea of global encoding forsummarization (Lin et al., 2018), we utilize thepower of convolutional neural networks (CNN) tocapture local interaction among words and gener-ate representations of information of higher levelsthan the word, such as phrase or sentence. More-over, to tackle the problem of long-term depen-dency, we design a multi-level dilated convolu-tion for text to capture local correlation and long- term dependency without loss of coverage as wedo not apply any form of pooling or strided con-volution. Based on the annotations generated byour designed module and those by the original re-current neural networks, we implement our hy-brid attention mechanism with the purpose of cap-turing information at different levels, and further-more, it can extract word-level information fromthe source context based on its attention on the se-mantic units.In brief, our contributions are illustrated below: • We analyze that the conventional attentionmechanism is not useful for multi-label textclassification, and we propose a novel modelwith multi-level dilated convolution to cap-ture semantic units in the source text. • Experimental results demonstrate that ourmodel outperforms the baseline models andachieves the state-of-the-art performance onthe dataset RCV1-v2 and Ren-CECps, andour model is competitive with the hierarchi-cal models with the deterministic setting ofsentence or phrase. • Our analysis shows that compared with theconventional Seq2Seq model, our modelwith effective information extracted from thesource context can better predict the labels oflow frequency, and it is less influenced by theprior distribution of the label sequence.
As illustrated below, multi-label text classificationhas the potential to be regarded as a task of se-quence prediction, as long as there are certain cor-relation patterns in the label data. Owing to thecorrelations among labels, it is possible to improvethe performance of the model in this task by as-signing certain label permutations for the label se-quence and maximizing subset accuracy, whichmeans that the label permutation and the cor-responding attention-based Seq2Seq method arecompetent in learning the label classification andthe label correlations. By maximizing the sub-set accuracy, the model can improve the perfor-mance of classification with the assistance of theinformation about the label correlations. Regard-ing label permutation, a straightforward method isto reorder the label data in accordance with the de-scending order by frequency, which shows satis-factory effects (Chen et al., 2017).ulti-label text classification can be regardedas a Seq2Seq learning task, which is formu-lated as below. Given a source text x = { x , ..., x i , ..., x n } and a target label sequence y = { y , ..., y i , ..., y m } , the Seq2Seq modellearns to approximate the probability P ( y | x ) = (cid:81) mt =1 P ( y t | y Models HL(-) P(+) R(+) F1(+) w/o attention 0.0082 0.883 0.849 0.866+attention 0.0081 0.889 0.848 0.868 Table 1: Performances of the Seq2Seq models with andwithout attention on the RCV1-v2 test set, where HL,P, R, and F1 refer to hamming loss, micro-precision,micro-recall and micro- F . The symbol “ + ” indi-cates that the higher the better, while the symbol “ - ” indicates that the lower the better. As is shown in Table 1, the Seq2Seq modelswith and without the attention mechanism demon-strate similar performances on the RCV1-v2 ac-cording to their scores of micro- F , a significantevaluation metric for multi-label text classifica-tion. This can be a proof that the conventional at-tention mechanism does not play a significant rolein the improvement of the Seq2Seq model’s per-formance. We hypothesize that the conventionalattention mechanism does not meet the require-ments of multi-label text classification. A com-mon sense for such a classification task is thatthe classification should be based on the salientideas of the source text. The semantic units, in-stead of word-level information, mainly determinethat the text can be classified into the target cate-gories “youth” and “sports”. For each of a varietyof texts, there are always semantic units that con-struct the semantic meaning of the whole text. Re-garding an automatic system for multi-label textclassification, the system should be able to extractthe semantic units in the source text for better per-formance in classification. Therefore, we proposeour model to tackle this problem. In the following, we introduce our proposed mod-ules to improve the conventional Seq2Seq modelfor multi-label text classification. In general, itcontains two components: multi-level dilated con-volution (MDC) as well as hybrid attention mech-anism. On top of the representations generated by theoriginal encoder, which is an LSTM in our model,we apply the multi-layer convolutional neural net-works to generate representations of semanticunits by capturing local correlations and long-termdependencies among words. To be specific, ourCNN is a three-layer one-dimensional CNN. Fol- (cid:28646)(cid:28647)(cid:28640)(cid:28595)(cid:28674)(cid:28680)(cid:28679)(cid:28675)(cid:28680)(cid:28679)(cid:28678)(cid:28663)(cid:28668)(cid:28671)(cid:28660)(cid:28679)(cid:28668)(cid:28674)(cid:28673)(cid:28624)(cid:28612)(cid:28663)(cid:28668)(cid:28671)(cid:28660)(cid:28679)(cid:28668)(cid:28674)(cid:28673)(cid:28624)(cid:28613)(cid:28663)(cid:28668)(cid:28671)(cid:28660)(cid:28679)(cid:28668)(cid:28674)(cid:28673)(cid:28624)(cid:28614) Figure 1: Structure of Multi-level Dilated Convo-lution (MDC). A example of MDC with kernel size k = 2 and dilation rates [1 , , . To avoid gridding ef-fects, the dilation rates do not share a common factorother than . lowing the previous work (Kalchbrenner et al.,2014) on CNN for NLP, we use one-dimensionalconvolution with the number of channels equal tothe number of units of the hidden layer, so that theinformation at each dimension of a representationvector will not be disconnected as 2-dimensionconvolution does. Besides, as we are to capturesemantic units in the source text instead of higher-level word representations, there is no need to usepadding for the convolution.A special design for the CNN is the implemen-tation of dilated convolution. Dilation has be-come popular in semantic segmentation in com-puter vision in recent years (Yu and Koltun, 2015;Wang et al., 2018), and it has been introducedto the fields of NLP (Kalchbrenner et al., 2016)and speech processing (van den Oord et al., 2016).Dilated convolution refers to convolution insertedwith “holes” so that it is able to remove the neg-ative effects such as information loss caused bycommon down-sampling methods, such as max-pooling and strided convolution. Besides, it is ableto expand the receptive fields at the exponentiallevel without increasing the number of parame-ters. Thus, it becomes possible for dilated con-volution to capture longer-term dependency. Fur-thermore, with the purpose of avoiding griddingeffects caused by dilation (e.g., the dilated seg-ments of the convolutional kernel can cause miss-ing of vital local correlation and break the continu-ity between word representations), we implementa multi-level dilated convolution with different di-lation rates at different levels, where the dilationrates are hyperparameters in our model.Instead of using the same dilation rate or di-lation rates with the common factor, which cancause gridding effects, we apply multi-level di-lated convolution with different dilation rates, (cid:14) (cid:14)(cid:14) (cid:28640)(cid:28631)(cid:28630)(cid:28595)(cid:28674)(cid:28680)(cid:28679)(cid:28675)(cid:28680)(cid:28679)(cid:28678) (cid:28639)(cid:28646)(cid:28647)(cid:28640)(cid:28595)(cid:28674)(cid:28680)(cid:28679)(cid:28675)(cid:28680)(cid:28679)(cid:28678)(cid:28663)(cid:28664)(cid:28662)(cid:28674)(cid:28663)(cid:28664)(cid:28677) Figure 2: Structure of Hybrid Attention. The bluecircles at the bottom right represent the source annota-tions generated by the LSTM encoder, the yellow cir-cles at the bottom left represent the semantic unit rep-resentations generated by MDC, and the blue circles atthe top represent the LSTM decoder outputs. At eachdecoding time step, the output of the LSTM attendsto the semantic unit representations first, and then thenew representation incorporated with high-level infor-mation attends to the source annotations. such as [1,2,3]. Following Wang et al. (2018), for N layers of 1-dimension convolution with kernelsize K with dilation rates [ r , ..., r N ] , the max-imum distance between two nonzero values is max ( M i +1 − r i , M i +1 − M i +1 − r i ) , r i ) with M N = r N , and the goal is M ≤ K . In our ex-periments, we set the dilation rates to [1, 2, 3] and K to 3, and we have M = 2 . The implemen-tations can avoid the gridding effects and allowsthe top layer to access information between longerdistance without loss of coverage. Moreover, asthere may be irrelevant information to the seman-tic units at a long distance, we carefully design thedilation rates to [1, 2, 3] based on the performancein validation, instead of others such as [2, 5, 9],so that the top layer will not process the informa-tion among overlong distance and reduce the in-fluence of unrelated information. Therefore, ourmodel can generate semantic unit representationsfrom the information at phrase level with small di-lation rates and those at sentence level with largedilation rates. As we have annotations from the RNN encoderand semantic unit representations from the MDC,we design two types of attention mechanism toevaluate the effects of information of different lev-els. One is the common attention mechanism,which attends to the semantic unit representationsnstead of the source word annotations as the con-ventional does, the other is our designed hybridattention mechanism to incorporate information ofthe two levels.The idea of hybrid attention is motivated bymemory networks (Sukhbaatar et al., 2015) andmulti-step attention (Gehring et al., 2017). Itcan be regarded as the attention mechanism withmultiple “hops”, with the first hop attending tothe higher-level semantic unit information and thesecond hop attending to the lower-level word unitinformation based on the decoding and the first at-tention to the semantic units. Details are presentedbelow.For the output of the decoder at each time step,it not only attends to the source annotations fromthe RNN encoder as it usually does but also at-tends to the semantic unit representations from theMDC. In our model, the decoder output first paysattention to the semantic unit representations fromthe MDC to figure out the most relevant semanticunits and generates a new representation based onthe attention. Next, the new representation withboth the information from the decoding processas well as the attention to the semantic units at-tends to the source annotations from the LSTMencoder, so it can extract word-level informationfrom the source text with the guidance of the se-mantic units, mitigating the problem of irrelevanceand redundancy.To be specific, for the source annotationsfrom the LSTM encoder h = { h , ..., h i , ..., h n } and the semantic unit representations g = { g , ..., g i , ..., g m } , the decoder output s t first at-tends to the semantic unit representations g andgenerates a new representation s (cid:48) t . Then the newrepresentation s (cid:48) t attends to the source annota-tions h and generates another representation ˜ s t fol-lowing the identical attention mechanism as men-tioned above. In the final step, the model generates o t for the prediction of y t , where: o t = s (cid:48) t ⊕ ˜ s t (7)For comparison, we also propose another typeof attention called “additive attention”, whose ex-perimental results are in the ablation test. In thismechanism, instead of paying attention to the twotypes of representations step by step as mentionedabove, the output of the LSTM decoder s t at-tends to the semantic unit representations g andthe source annotations h respectively to generate s (cid:48) t and ˜ s t , which are finally added element-wiselyfor the final output o t . In the following, we introduce the datasets and ourexperiment settings as well as the baseline modelsthat we compare with. : Thedataset (Lewis et al., 2004) consists of more than800k manually categorized newswire stories byReuters Ltd. for research purpose, where eachstory is assigned with multiple topics. The totalnumber of topics is 103. To be specific, the train-ing set contains around 802414 samples, while thedevelopment set and test set contain 1000 sam-ples respectively. We filter the samples whoselengths are over 500 words in the dataset, whichremoves about 0.5% of the samples in the train-ing, development and test sets. The vocabularysize is set to 50k words. Numbers as well as out-of-vocabulary words are masked by special tokens“ Ren-CECps: The dataset is a sentence corpus col-lected from Chinese blogs, annotated with 8 emo-tion tags anger , anxiety , expect , hate , joy , love , sorrow and surprise as well as 3 polarity tags pos-itive , negative and neutral . The dataset contains35096 sentences for multi-label text classification.We apply preprocessing for the data similar to thatfor the RCV1-v2, which are filtering samples ofover 500 words, setting the vocabulary size to 50kand applying the descending order by frequencyfor label permutation. We implement our experiments in PyTorch on anNVIDIA 1080Ti GPU. In the experiments, thebatch size is set to 64, and the embedding sizeand the number of units of hidden layers are 512.We use Adam optimizer (Kingma and Ba, 2014)with the default setting β = 0 . , β = 0 . and (cid:15) = 1 × − . The learning rate is initial-ized to . based on the performance on thedevelopment set, and it is halved after every epoch odels HL(-) P(+) R(+) F1(+) BR 0.0086 0.904 0.816 0.858CC 0.0087 0.887 0.828 0.857LP 0.0087 0.896 0.824 0.858CNN 0.0089 Table 2: Performance on the RCV1-V2 test set. HL,P, R, and F1 denote hamming loss, micro-precision,micro-recall and micro- F , respectively ( p < . ). of training. Gradient clipping is applied with therange [-10, 10].Following the previous studies (Zhang andZhou, 2007; Chen et al., 2017), we choose ham-ming loss and micro- F score to evaluate the per-formance of our model. Hamming loss refers tothe fraction of incorrect prediction (Schapire andSinger, 1999), and micro- F score refers to theweighted average F score. For reference, themicro-precision as well as micro-recall scores arealso reported. To be specific, the computationsof Hamming Loss (HL) micro- F score are illus-trated below: HL = 1 L (cid:88) I ( y (cid:54) = ˆ y ) (8) microF = (cid:80) Lj =1 tp j (cid:80) Lj =1 tp j + f p j + f n j (9)where tp j , f p j and f n j refer to the number oftrue positive examples, false positive examplesand false negative examples respectively. In the following, we introduce the baseline modelsfor comparison for both datasets. • Binary Relevance (BR) (Boutell et al.,2004) transforms the MLC task into multiplesingle-label classification problems. • Classifier Chains (CC) (Read et al., 2011)transforms the MLC task into a chain of bi-nary classification problems to model the cor-relations between labels. • Label Powerset (LP) (Tsoumakas andKatakis, 2006) creates one binary classifierfor every label combination attested in thetraining set. Models HL(-) P(+) R(+) F1(+) BR Table 3: Performance of the models on the Ren-CECpstest set. HL, P, R, and F1 denote hamming loss,micro-precision, micro-recall and micro- F , respec-tively ( p < . ). • CNN (Kim, 2014) uses multiple convolutionkernels to extract text feature, which is theninput to the linear transformation layer fol-lowed by a sigmoid function to output theprobability distribution over the label space. • CNN-RNN (Chen et al., 2017) utilizes CNNand RNN to capture both global and local tex-tual semantics and model label correlations. • S2S and S2S+Attn (Sutskever et al., 2014;Bahdanau et al., 2014) are our implementa-tion of the RNN-based sequence-to-sequencemodels without and with the attention mech-anism respectively. In the following sections, we report the results ofour experiments on the RCV1-v2 and Ren-CECps.Moreover, we conduct an ablation test and thecomparison with models with hierarchical mod-els with the deterministic setting of sentence orphrase, to illustrate that our model with learnablesemantic units possesses a clear advantage overthe baseline models. Furthermore, we demonstratethat the higher-level representations are useful forthe prediction of labels of low frequency in thedataset so that it can ensure that the model is notstrictly learning the prior distribution of the labelsequence. We present the results of our implementationsof our model as well as the baselines on theRCV1-v2 on Table 2. From the results of theconventional baselines, it can be found that theclassical methods for multi-label text classifica-tion still own competitiveness compared with theachine-learning-based and even deep-learning-based methods, instead of the Seq2Seq-basedmodels. Regarding the Seq2Seq model, both theS2S and the S2S+Attn achieve improvements onthe dataset, compared with the baselines above.However, as mentioned previously, the attentionmechanism does not play a significant role in theSeq2Seq model for multi-label text classification.By contrast, our proposed mechanism, which islabel-classification-oriented, can take both the in-formation of semantic units and that of word unitsinto consideration. Our proposed model achievesthe best performance in the evaluation of Ham-ming loss and micro- F score, which reduces9.8% of Hamming loss and improves 1.3% ofmicro- F score, in comparison with the S2S+Attn.We also present the results of our experimentson Ren-CECps. Similar to the models’ perfor-mance on the RCV1-v2, the conventional base-lines except for Seq2Seq models achieve lowerperformance on the evaluation of micro- F scorecompared with the Seq2Seq models. Moreover,the S2S and the S2S+Attn still achieve similar per-formance on micro- F on this dataset, and our pro-posed model achieves the best performance withthe improvement of 0.009 micro- F score. An in-teresting finding is that the Seq2Seq models do notpossess an advantage over the conventional base-lines on the evaluation of Hamming Loss. Weobserve that there are fewer labels in the Ren-CECps than in the RCV1-v2 (11 and 103). Asour label data are reordered according to the de-scending order of label frequency, the Seq2Seqmodel is inclined to learn the frequency distri-bution, which is similar to a long-tailed distribu-tion. However, regarding the low-frequency labelswith only a few samples, their amounts are similar,whose distributions are much more uniform thanthat of the whole label data. It is more difficultfor the Seq2Seq model to classify them correctlywhile the model is approximating the long-taileddistribution compared with the conventional base-lines. As Hamming loss reflects the average incor-rect prediction, the errors in classifying into low-frequency labels will lead to a sharper increasein Hamming Loss, in comparison with micro- F score. To evaluate the effects of our proposed modules,we present an ablation test for our model. We re- Models HL(-) P(+) R(+) F1(+) w/o attention 0.0082 0.883 0.849 0.866attention 0.0081 0.889 0.848 0.868MDC 0.0074 0.889 0.871 0.880additive 0.0073 0.888 0.871 0.879hybrid Table 4: Performance of the models with different at-tention mechanisms on the RCV1-V2 test set. HL, P, R,and F1 denote hamming loss, micro-precision, micro-recall and micro- F , respectively ( p < . ). move certain modules to control variables so thattheir effects can be fairly compared. To be spe-cific, besides the evaluation of the conventionalattention mechanism mentioned in Section 3, weevaluate the effects of hybrid attention and itsmodules. We demonstrate the performance of fivemodels with different attention implementation forcomparison, which are model without attention,one with only attention to the source annotationsfrom LSTM, one with only attention to the seman-tic unit representations from the MDC, one withthe attention to both the source annotations andsemantic unit representations (additive) and hy-brid attention, respectively. Therefore, the effectsof each of our proposed modules, including MDCand hybrid attention, can be evaluated individuallywithout the influence of the other modules.Results in Table 4 reflect that our model stillperforms the best in comparison with models withthe other types of attention mechanism. Exceptfor the insignificant effect of the conventional at-tention mechanism mentioned above, it can befound that the high-level representations gener-ated by the MDC contribute much to the perfor-mance of the Seq2Seq model for multi-label textclassification, which improves about 0.9 micro- F score. Moreover, simple additive attention mecha-nism, which is equivalent to the element-wise ad-dition of the representations of MDC and those ofthe conventional mechanism, achieves similar per-formance to the single MDC, which also demon-strates that conventional attention mechanism inthis task makes little contribution. As to our pro-posed hybrid attention, which is a relatively com-plex combination of the two mechanisms, can im-prove the performance of MDC. This shows thatalthough conventional attention mechanism forword-level information does not influence the per-formance of the SeqSeq model significantly, thehybrid attention which extracts word-level infor- odels HL(-) P(+) R(+) F1(+) Hier-5 0.0075 0.887 0.869 0.878Hier-10 0.0077 0.883 0.873 0.878Hier-15 0.0076 0.879 0.879 0.879Hier-20 0.0076 0.876 Table 5: Performance of the hierarchical model and ourmodel on the RCV1-V2 test set. Hier refers to hierar-chical model, and the subsequent number refers to thelength of sentence (word) for sentence-level represen-tations ( p < . ). mation based on the generated high-level semanticinformation can provide some information aboutimportant details that are relevant to the most con-tributing semantic units. Another method that can extract high-level rep-resentations is a heuristic method that manuallyannotates sentences or phrases first and applies ahierarchical model for high-level representations.To be specific, the method does not only apply anRNN encoder to the word representations but alsoto sentence representations. In our reimplementa-tion, we regard the representation from the LSTMencoder at the time step of the end of each sentenceas the sentence representation, and we implementanother LSTM on top of the original encoder thatreceives sentence representations as input so thatthe whole encoder can be hierarchical. We imple-ment the experiment on the dataset RCV1-v2. Asthere is no sentence marker in the dataset RCV1-v2, we set a sentence boundary for the source textand we apply a hierarchical model to generate sen-tence representations.Compared with our proposed MDC, the hier-archical model for the high-level representationsis relatively deterministic since the sentences orphrases are predefined manually. However, ourproposed MDC learns the high-level representa-tions through dilated convolution, which is notrestricted by the manually-annotated boundaries.Through the evaluation, we expect to see if ourmodel with multi-level dilated convolution as wellas hybrid attention can achieve similar or betterperformance than the hierarchical model. More-over, we note that the number of parameters ofthe hierarchical model is more than that of ourmodel, which are 47.24M and 45.13M respec- Ranking of the most frequent label m i c r o - F ( % ) MDC+Hybridw/o MDC+Hybrid Figure 3: Micro- F scores of our model and thebaseline on the evaluation of labels of different fre-quency. The x-axis refers to the ranking of the mostfrequent label in the labels for classification, and they-axis refers to the micro- F score performance. tively. Therefore, it is obvious that our model doesnot possess the advantage of parameter number inthe comparison.We present the results of the evaluation on Ta-ble 5, where it can be found that our model withfewer parameters still outperforms the hierarchicalmodel with the deterministic setting of sentence orphrase. Moreover, in order to alleviate the influ-ence of the deterministic sentence boundary, wecompare the performance of different hierarchicalmodels with different boundaries, which sets theboundaries at the end of every 5, 10, 15 and 20words respectively. The results in Table 5 showthat the hierarchical models achieve similar per-formances, which are all higher than the perfor-mances of the baselines. This shows that high-level representations can contribute to the perfor-mance of the Seq2Seq model on the multi-labeltext classification task. Furthermore, as these per-formances are no better than that of our proposedmodel, it can reflect that the learnable high-levelrepresentations can contribute more than deter-ministic sentence-level representations, as it canbe more flexible to represent information of di-verse levels, instead of fixed phrase or sentencelevel. Another finding in our experiments is that themodel’s performance on low-frequency label clas-sification is lower than that on high-frequency la-bel classification. This problem is also reflectedn our report of the experimental results on theRen-CECps. The decrease in performance is rea-sonable since the model is sensitive to the amountof data, especially on small datasets such as Ren-CECps. We also hypothesize that this error comesfrom the essence of the Seq2Seq model. As thefrequency of our label data is similar to a long-tailed distribution and the data are organized bydescending order of label frequency, the Seq2Seqmodel is inclined to model the distribution. As thefrequency distribution of the low-frequency labelsis relatively uniform, it is much harder for it tomodel the distribution.In contrast, as our model is capable of cap-turing deeper semantic information for the labelclassification, we believe that it is more robustto the classification of low-frequency labels withthe help of the information from multiple levels.We remove the top 10, 20, 30, 40, 50 and 60most frequent labels subsequently, and we eval-uate the performance of our model and the base-line Seq2Seq model on the classification of theselabels. Figure 3 shows the results of the modelson label data of different frequency. It is obviousthat although the performances of both models de-crease with the decrease of the label frequency, ourmodel continues to perform better than the base-line on all levels of label frequency. In addition,the gap between the performances of the two mod-els continues to increase with the decrease of labelfrequency, demonstrating our model’s advantageover the baseline on classifying low-frequency la-bels. The current models for the multi-label classifica-tion task can be classified into three categories:problem transformation methods, algorithm adap-tation methods, and neural network models.Problem transformation methods decomposethe multi-label classification task into multiplesingle-label learning tasks. The BR algorithm(Boutell et al., 2004) builds a separate classifierfor each label, causing the label correlations to beignored. In order to model label correlations, La-bel Powerset (LP) (Tsoumakas and Katakis, 2006)creates one binary classifier for every label com-bination attested in the training set and ClassifierChains (CC) (Read et al., 2011) connects all clas-sifiers in a chain through feature space.Algorithm adaptation methods adopt specific learning algorithms to the multi-label classifica-tion task without requiring problem transforma-tions. Clare and King (2001) constructed decisiontree based on multi-label entropy to perform clas-sification. Elisseeff and Weston (2002) adopted aSupport Vector Machine (SVM) like learning sys-tem to handle multi-label problem. Zhang andZhou (2007) utilized the k -nearest neighbor algo-rithm and maximum a posteriori principle to de-termine the label set of each sample. F¨urnkranzet al. (2008) made ranking among labels by uti-lizing pairwise comparison. Li et al. (2015) usedjoint learning predictions as features.Recent studies of multi-label text classificationhave turned to the application of neural networks,which have achieved great success in natural lan-guage processing. Zhang and Zhou (2006) imple-mented the fully-connected neural networks withpairwise ranking loss function. Nam et al. (2013)changed the ranking loss function to the cross-entropy loss to better the training. Kurata et al.(2016) proposed a novel neural network initializa-tion method to treat some neurons as dedicatedneurons to model label correlations. Chen et al.(2017) incorporated CNN and RNN so as to cap-ture both global and local semantic informationand model high-order label correlations. (Namet al., 2017) proposed to generate labels sequen-tially, and Yang et al. (2018); Li et al. (2018) bothadopted the Seq2Seq, one with a novel decoderand one with a soft loss function respectively. In this study, we propose our model based on themulti-level dilated convolution and the hybrid at-tention mechanism, which can extract both thesemantic-unit-level information and word-level in-formation. Experimental results demonstrate thatour proposed model can significantly outperformthe baseline models. Moreover, the analyses re-flect that our model is competitive with the de-terministic hierarchical models and it is more ro-bust to classifying the low-frequency labels thanthe baseline. Acknowledgements This work was supported in part by National Nat-ural Science Foundation of China (No. 61673028)and the National Thousand Young Talents Pro-gram. Xu Sun is the corresponding author of thispaper. eferences Dzmitry Bahdanau, Kyunghyun Cho, and YoshuaBengio. 2014. Neural machine translation byjointly learning to align and translate. CoRR ,abs/1409.0473.Fernando Benites and Elena Sapozhnikova. 2015.Haram: a hierarchical aram neural network forlarge-scale text classification. In Data Mining Work-shop (ICDMW), 2015 IEEE International Confer-ence on , pages 847–854. IEEE.Matthew R. Boutell, Jiebo Luo, Xipeng Shen, andChristopher M. Brown. 2004. Learning multi-label scene classification. Pattern Recognition ,37(9):1757–1771.Guibin Chen, Deheng Ye, Zhenchang Xing, JieshanChen, and Erik Cambria. 2017. Ensemble applica-tion of convolutional and recurrent neural networksfor multi-label text categorization. In IJCNN 2017 ,pages 2377–2383.Amanda Clare and Ross D King. 2001. Knowledgediscovery in multi-label phenotype data. In Euro-pean Conference on Principles of Data Mining andKnowledge Discovery , pages 42–53. Springer.Andr´e Elisseeff and Jason Weston. 2002. A kernelmethod for multi-labelled classification. In Ad-vances in neural information processing systems ,pages 681–687.Johannes F¨urnkranz, Eyke H¨ullermeier, Eneldo LozaMenc´ıa, and Klaus Brinker. 2008. Multilabel classi-fication via calibrated label ranking. Machine learn-ing , 73(2):133–153.Jonas Gehring, Michael Auli, David Grangier, DenisYarats, and Yann N. Dauphin. 2017. Convolutionalsequence to sequence learning. In ICML 2017 ,pages 1243–1252.Sepp Hochreiter and J¨urgen Schmidhuber. 1996.LSTM can solve hard long time lag problems. In NIPS, 1996 , pages 473–479.Nal Kalchbrenner, Lasse Espeholt, Karen Simonyan,A¨aron van den Oord, Alex Graves, and KorayKavukcuoglu. 2016. Neural machine translation inlinear time. CoRR , abs/1610.10099.Nal Kalchbrenner, Edward Grefenstette, and Phil Blun-som. 2014. A convolutional neural network formodelling sentences. In ACL 2014 , pages 655–665.Yoon Kim. 2014. Convolutional neural networks forsentence classification. In EMNLP 2014 , pages1746–1751.Diederik P. Kingma and Jimmy Ba. 2014. Adam:A method for stochastic optimization. CoRR ,abs/1412.6980. Gakuto Kurata, Bing Xiang, and Bowen Zhou. 2016.Improved neural network-based multi-label classifi-cation with better initialization leveraging label co-occurrence. In NAACL HLT 2016 , pages 521–526.David D. Lewis, Yiming Yang, Tony G. Rose, and FanLi. 2004. RCV1: A new benchmark collection fortext categorization research. Journal of MachineLearning Research , 5:361–397.Li Li, Houfeng Wang, Xu Sun, Baobao Chang, ShiZhao, and Lei Sha. 2015. Multi-label text catego-rization with joint learning predictions-as-featuresmethod. In EMNLP 2015 , pages 835–839.Wei Li, Xuancheng Ren, Damai Dai, Yunfang Wu,Houfeng Wang, and Xu Sun. 2018. Sememeprediction: Learning semantic knowledge fromunstructured textual wiki descriptions. CoRR ,abs/1808.05437.Junyang Lin, Xu Sun, Shuming Ma, and Qi Su. 2018.Global encoding for abstractive summarization. In ACL 2018 , pages 163–169.Thang Luong, Hieu Pham, and Christopher D. Man-ning. 2015. Effective approaches to attention-basedneural machine translation. In EMNLP 2015 , pages1412–1421.Jinseok Nam, Jungi Kim, Iryna Gurevych, and Jo-hannes F¨urnkranz. 2013. Large-scale multi-labeltext classification - revisiting neural networks. CoRR , abs/1312.5419.Jinseok Nam, Eneldo Loza Menc´ıa, Hyunwoo J Kim,and Johannes F¨urnkranz. 2017. Maximizing subsetaccuracy with recurrent neural networks in multi-label classification. In NIPS 2017 , pages 5419–5429.A¨aron van den Oord, Sander Dieleman, Heiga Zen,Karen Simonyan, Oriol Vinyals, Alex Graves,Nal Kalchbrenner, Andrew W. Senior, and KorayKavukcuoglu. 2016. Wavenet: A generative modelfor raw audio. In ISCA Speech Synthesis Workshop2016 , page 125.Jesse Read, Bernhard Pfahringer, Geoff Holmes, andEibe Frank. 2011. Classifier chains for multi-labelclassification. Machine learning , 85(3):333.Robert E Schapire and Yoram Singer. 1999. Improvedboosting algorithms using confidence-rated predic-tions. Machine learning , 37(3):297–336.Sainbayar Sukhbaatar, Arthur Szlam, Jason Weston,and Rob Fergus. 2015. End-to-end memory net-works. In NIPS, 2015 , pages 2440–2448.Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. 2014.Sequence to sequence learning with neural net-works. In NIPS 2014 , pages 3104–3112.rigorios Tsoumakas and Ioannis Katakis. 2006.Multi-label classification: An overview. Interna-tional Journal of Data Warehousing and Mining ,3(3).Panqu Wang, Pengfei Chen, Ye Yuan, Ding Liu, ZehuaHuang, Xiaodi Hou, and Garrison W. Cottrell. 2018.Understanding convolution for semantic segmenta-tion. In WACV 2018 , pages 1451–1460.Pengcheng Yang, Xu Sun, Wei Li, Shuming Ma, WeiWu, and Houfeng Wang. 2018. SGM: sequence gen-eration model for multi-label classification. In COL-ING 2018 , pages 3915–3926.Fisher Yu and Vladlen Koltun. 2015. Multi-scale con-text aggregation by dilated convolutions. CoRR ,abs/1511.07122.Min-Ling Zhang and Zhi-Hua Zhou. 2006. Multilabelneural networks with applications to functional ge-nomics and text categorization. IEEE Transactionson Knowledge and Data Engineering , 18(10):1338–1351.Min-Ling Zhang and Zhi-Hua Zhou. 2007. Ml-knn: Alazy learning approach to multi-label learning.