[PDF] Does Head Label Help for Long-Tailed Multi-Label Text Classification

Abstract

Multi-label text classification (MLTC) aims to annotate documents with the most relevant labels from a number of candidate labels. In real applications, the distribution of label frequency often exhibits a long tail, i.e., a few labels are associated with a large number of documents (a.k.a. head labels), while a large fraction of labels are associated with a small number of documents (a.k.a. tail labels). To address the challenge of insufficient training data on tail label classification, we propose a Head-to-Tail Network (HTTN) to transfer the meta-knowledge from the data-rich head labels to data-poor tail labels. The meta-knowledge is the mapping from few-shot network parameters to many-shot network parameters, which aims to promote the generalizability of tail classifiers. Extensive experimental results on three benchmark datasets demonstrate that HTTN consistently outperforms the state-of-the-art methods. The code and hyper-parameter settings are released for reproducibility

Full PDF

DDoes Head Label Help for Long-Tailed Multi-Label Text Classiﬁcation

Lin Xiao , Xiangliang Zhang , Liping Jing , Chi Huang , Mingyang Song Beijing Key Lab of Trafﬁc Data Analysis and Mining, Beijing Jiaotong University, Beijing, China King Abdullah University of Science and Technology(KAUST), Saudi [email protected], [email protected], [email protected],[email protected], [email protected]

Abstract

Multi-label text classiﬁcation (MLTC) aims to annotate doc-uments with the most relevant labels from a number of candi-date labels. In real applications, the distribution of label fre-quency often exhibits a long tail, i.e., a few labels are asso-ciated with a large number of documents (a.k.a. head labels),while a large fraction of labels are associated with a smallnumber of documents (a.k.a. tail labels). To address the chal-lenge of insufﬁcient training data on tail label classiﬁcation,we propose a

Head-to-Tail Network (HTTN) to transfer the meta -knowledge from the data-rich head labels to data-poortail labels. The meta -knowledge is the mapping from few-shot network parameters to many-shot network parameters,which aims to promote the generalizability of tail classiﬁers.Extensive experimental results on three benchmark datasetsdemonstrate that HTTN consistently outperforms the state-of-the-art methods. The code and hyper-parameter settingsare released for reproducibility . Introduction

Multi-label text classiﬁcation has become one of the coretasks in natural language processing and has been widelyapplied in topic recognition (Yang et al. 2016), question an-swering (Kumar et al. 2016), sentimental analysis (Cambria,Olsher, and Rajagopal 2014) and so on. Even though var-ious techniques have been proposed for multi-label learn-ing, it is still a challenging task due to two main character-istics. One important statistical characteristic is that multi-label data usually follows a paw-law distribution (called aslong-tailed), especially for data with large number of labels.As shown in Figure 1 (a)-(b),a few labels are associated witha large number of documents (a.k.a. head labels), while alarge fraction of labels are associated with a small numberof documents (a.k.a. tail labels). In this situation, learningclassiﬁers for tail labels is much more difﬁcult than that forhead labels, due to the poor generalizability caused by insuf-ﬁcient training instances.The other is label dependency because multiple labelsmay be assigned to one document, which makes it hardto separate classes with common instances. On the other * https://github.com/xiaolin1207/HTTN-master (a) Label distribution in AAPD (b) Label distribution in RCV1(c) Label correlation in AAPD (d) Label correlation in RCV1 Figure 1: The label distribution and pairwise label correla-tion coefﬁcients in AAPD and RCV1 datasets. Here, labelsare ordered by their frequency from the highest to the lowestin all subﬁgures. The points with bright color indicate highercorrelation between labels in subﬁgure (c)-(d).hand, fortunately, such label dependency is helpful to con-struct the correlation among labels for knowledge transfer-ring. Figure 1 (c)-(d) illustrate the pairwise label correlationon AAPD and RCV1 datasets. An interesting point is thatthere are strong correlations between head labels, as wellas between head and tail labels. To handle such multi-labeldata, an intuitive idea is to make use of these label depen-dency, which has been a hot research topic in multi-labellearning. Most existing studies focus on how to exploit la-bel structure (Zhang et al. 2018; Huang et al. 2019), labelcontent meaning (Pappas and Henderson 2019; Xiao et al.2019; Du et al. 2019), or label co-occurrence patterns (Liuet al. 2017; Kurata, Xiang, and Zhou 2016) to built one clas-siﬁer on all labels. Recently, (Wei and Li 2019) empiricallydemonstrate that tail labels impact much less than head la- a r X i v : . [ c s . C L ] J a n els on the whole prediction performance. We are thus in-spired to investigate an interesting and important question: Do the head labels help model construction on tail labelsand further leverage long-tailed multi-label classiﬁcation ?Due to the long-tailed distribution, as we known, train-ing head labels and tailed labels together will make the headlabels dominate the learning procedure, which will sacri-ﬁce the prediction precision of head labels and recall oftail labels. To solve this issue, great effort has been doneby designing proper instance sampling strategies (e.g., over-sampling on tail label or undersampling on head label),constructing complex objective functions (e.g., label-awaremargin loss, class-balanced loss) (Cao et al. 2019; Cui et al.2019), or transferring knowledge from head labels to tail la-bels (Liu et al. 2019). These methods are proposed for long-tailed multi-class problems, however, long-tailed multi-labelproblems have barely been studied. Recently, MacAvaneyet al. (2020) exploit the extra information (label semanticembedding) to leverage the long-tailed multi-label classiﬁ-cation. Yuan, Xu, and Li (2019) train classiﬁers for head la-bels and tail labels separately. Although they obtain impres-sive performance, the former suffers from high computingcomplexity and is limited by the existence of extra infor-mation; the later obviously ignores the correlation betweenhead and tail labels, which has been proven important formulti-label learning.In this paper, thus, we propose a

Head-to-Tail Network (HTTN) for long-tailed multi-label classiﬁcation task. Itsmain idea is to take advantage of sufﬁcient informationamong head labels and label dependency between head la-bels and tail labels. HTTN consists of two main parts. Theﬁrst part aims to learn the meta-knowledge between thelearning model on few-shot data and that on many-shot data,which is implemented with the aid of data-rich head labels.The second part tries to leverage the classiﬁer constructionon tail labels by transferring meta-knowledge learnt fromhead labels and exploiting label dependency between headand tail labels. A good characteristic of HTTN is that onlythe classiﬁer on head labels has to be trained, while thelearning model on tail labels can be directly computed withthe aid of meta-knowledge and their own instances. Thisstrategy has a great by-product: once a new label with a fewinstances is coming, we need not retrain the whole model.Meanwhile, to improve the robustness and generalization oftail label learning, an ensemble mechanism is designed fortail label prediction. We summarize our main contributionsas follows:• A head-to-tail network (HTTN) is proposed to tackle thelong tail problem in multi-label text classiﬁcation.• HTTN effectively detects model transformation strategy(denoted as meta-knowledge about learning) from few-shot learning to many-shot learning on head labels.• HTTN efﬁciently builds classiﬁer for tail label with theaid of meta-knowledge and label dependency betweenhead and tail labels.• HTTN obtains promising performance on three widely-used benchmark datasets by comparing with several pop-ular baselines. The rest of the paper is organized via four Sections. Section2 discusses the related work. Section 3 describes the pro-posed HTTN model for multi-label text classiﬁcation. Theexperimental setting and results are discussed in Section 4.A brief conclusion and future work are given in Section 5.

Related work

Multi-label text classiﬁcation (MLTC).

In the line ofMLTC, it has been a common knowledge to explore the labelcorrelation. Kurata, Xiang, and Zhou (2016) adopted labelco-occurrence to initialize the ﬁnal hidden layer of the clas-siﬁer network. Zhang et al. (2018) learn the label embeddingfrom the label co-occurrence graph to supervise the classiﬁerconstruction. In (Du et al. 2019), the text-based label embed-ding is introduced to mine the ﬁne-grained word-level clas-siﬁcation clues. A joint input-label embedding model wasproposed in (Pappas and Henderson 2019) to capture thestructure of the labels and input documents and the inter-actions between the two. Although multi-label learning ben-eﬁts from the exploration of label correlation, these methodscannot well handle long tail problem because they treat alllabels equally. In this case, the whole learning model willbe dominated by the head labels. Due to the insufﬁciency oftraining instances, in fact, tail labels were disclosed havingmuch less impact than head labels on the whole predictionperformance (Wei and Li 2019). The language descriptionsof labels were used in (MacAvaney et al. 2020) to designa soft n-gram interaction matching model for non-frequentlabels. To treat head and tail labels in different manners,(Yuan, Xu, and Li 2019) proposed a two-stage and ensem-ble learning approach. Although these methods obtain im-pressive results, they have limitations, such as high comput-ing complexity, dependency on extra label information, ig-norance of the label dependency among head and tail labels.

Imbalanced data classiﬁcation.

Another stream of workattacking the long tail problem is imbalanced learning, be-cause the numbers of instances in head labels and tail labelshave a big variance. The main strategies are discussed here.

Class distribution re-balancing strategy : The most popularideas include under-sampling the head classes (Byrd andLipton 2019), over-sampling the tail classes (Chawla et al.2002; Buda, Maki, and Mazurowski 2018; Byrd and Lip-ton 2019), and allocating large weights to tail classes inloss functions (Cui et al. 2019). Unfortunately, (Zhou et al.2020) and (Kang et al. 2019) empirically prove that re-balancing methods may hurt feature learning to some extent.Recently, (Zhou et al. 2020) proposed a uniﬁed Bilateral-Branch Network (BBN) integrating “conventional learningbranch” and “re-balancing branch”. The former branch isequipped with the typical uniform sampler to learn the uni-versal patterns for recognition, while the later branch is cou-pled with a reversed sampler to model the tail data. A similaridea is used in (Kang et al. 2019) to decouple the learningprocedure into representation learning and classiﬁcation.

Low-shot learning strategy : The low-shot learning sharessimilar features to the long-tail learning, because they bothcontain some labels with many instances, while the other la-bels have only few instances. Low-shot learning aims to con-struct classiﬁers for data-poor classes with the aid of data-ich classes (Hariharan and Girshick 2017; Gidaris and Ko-modakis 2018; Qi, Brown, and Lowe 2018). Among them,(Hariharan and Girshick 2017) generate synthetic instancesbased on the head classiﬁer, and incorporate them to train taillabel learning model. Gidaris and Komodakis (2018) pro-posed an attention based few-shot classiﬁcation weight gen-erator with the aid of base categories (i.e., head labels) to de-sign the classiﬁer, which can generalize better on “unseen”categories and retain the patterns trained from base cate-gories. Similarly, based on a learner initially trained on baseclasses with abundant samples, a simple imprinting strategywas proposed in (Qi, Brown, and Lowe 2018) to effortlesslylearn the novel categories with few samples.

Knowledge transfer strategy : Another way to handle imbal-anced data is to transfer the knowledge learned from data-rich classes to help data-poor classes. For example, Wang,Ramanan, and Hebert (2017) propose a meta-network onthe space of model parameters learnt from head classes todetermine the meta knowledge, which can be transferred totail classes in a progressive manner. In (Liu et al. 2019), adynamic meta-embedding method is proposed to learn bothdirect feature for few-shot categories and memory featurewith the aid of many-shot categories, which can handle tailrecognition robustness and open recognition sensitivity.Our proposed method HTTN also adopts the knowledgetransfer strategy, but focuses on the multi-label learningproblem, differing from the existing methods working fromthe multi-class problem. In addition, HTTN builds gener-alized tail classiﬁers directly from the meta-knowledge andtheir own instances, and thus has more ﬂexibility on process-ing the tail and novel rare classes.

Proposed HTTN Method

Problem Deﬁnition:

Let D = { ( x i , y i ) } Ni =1 denote the setof documents, which consists of N documents with corre-sponding labels Y = { y i ∈ { , } l } , here l is the totalnumber of labels. Each document contains a sequence ofwords, x i = { w , · · · , w q , · · · , w n } , where w q ∈ R k is the q -th word vector (e.g., encoded by word2vec (Pennington,Socher, and Manning 2014)). Multi-label text classiﬁcation(MLTC) aims to learn a classiﬁer from D , which can assignthe most relevant labels to the new given documents. In thisstudy, we divide the label set into two parts: head labels thatare associated with many documents, and tail labels that areassociated with few documents.The number of head labels and tail labels are l head and l tail , respectively. The corresponding documents associatedwith head labels and tail labels form D head and D tail , re-spectively. There may be overlapping documents betweenthem. If a document belongs to both head and tail label,it appears in both D head and D tail . The framework of theproposed Head-to-Tail Network (HTTN) is shown in Fig-ure 2. HTTN consists of three stages in the training pro-cess. Firstly, learning the head label classiﬁers by usinga semantic extractor φ , which extracts the semantic infor-mation from the documents by a Bi-LSTM with attentionmechanism. The classiﬁer weights of head labels are thenlearned, denoted as M head . Secondly, a label prototypergenerates a prototype for each head label by sampling the Figure 2: The architecture of HTTNdocuments representation learned through φ and then tak-ing average of these samples with the same label, denotedby R head . Thirdly, a transfer learner distils class-irrelevantmeta-knowledge W transfer for mapping R head to M head .By the learned generic meta-knowledge W transfer , the clas-siﬁer weights of tail labels M tail can then be inferred fromtheir corresponding label prototype R tail , although only fewdocuments are available in tail labels. The inference is doneby sending given documents to the shared semantic extractor φ , and then predicting the labels by the integration of M head and M tail . We next discuss the training process in details. Learning Head Label Classiﬁers

Semantic Extractor:

We adopt the bidirectional long short-term memory (Bi-LSTM) (Zhou et al. 2016) with self-attention mechanism (Tan et al. 2018) to learn the documentrepresentation. With the input word w q in a document x , thehidden states of Bi-LSTM are updated as, −→ h q = LST M ( −−→ h q − , w q ) ←− h q = LST M ( ←−− h q − , w q ) (1)where the hidden states −→ h q , ←− h q ∈ R k encode the forward andbackward word context representations, respectively. Aftertaking all n words, the whole document is represented as, H = ( −→ H , ←− H ) ∈ R k × n −→ H = ( −→ h , −→ h , · · · , −→ h n ) ←− H = ( ←− h , ←− h , · · · , ←− h n ) (2)To intensify the representation with important words in eachdocument, we adopt the attention mechanism (Vaswani et al.2017), which has been successful used in various text miningtasks (Tan et al. 2018; Al-Sabahi, Zuping, and Nadher 2018;You et al. 2018), E = sof tmax ( W H ) r = f ( EH ) = ( EH T ) W (3)where W ∈ R × k are the attention parameters, and E ∈ R × n presents the contribution of all words to the document.The ﬁnal document representation r ∈ R d is obtained by alinear embedding layer f ( . ) with parameter W ∈ R k × d .t is worth noting that the semantic extractor r = φ ( x ) is ashared block in the whole HTTN model. Head Classiﬁer Construction:

Once having the documentrepresentation r ∈ R d , we can build the multi-label text clas-siﬁer for head labels, e.g., by a one-layer neural network, ˆ y = sigmoid ( rM head ) (4)where M head ∈ R d × l head is the weights to learn as headclassiﬁer parameters, and l head is the number of head la-bels. The sigmoid function transfers the output values intoprobabilities for assigning multiple labels to one document.Cross-entropy is thus used as loss function, whose suitabilityhas been proved for multi-label learning (Nam et al. 2014), L c = − l head (cid:88) j =1 (cid:88) x i ∈ D head ( y ij log(ˆ y ij ))+ (1 − y ij ) log(1 − ˆ y ij ) (5)where N is the number of training documents belonging tohead, l head is the number of head label, ˆ y ij ∈ [0 , is thepredicted probability, and y ij ∈ { , } indicates the groundtruth of the i -th document along the j -th label. The headclassiﬁer weight M head can be learned by minimizing theabove loss function. Label Prototyper

The label prototyper is designed to build a prototype foreach class. We borrow the idea from meta-learning proto-typical network (Snell, Swersky, and Zemel 2017), whichis an effective multi-class few-shot classiﬁcation approach.For a head label j (same later for a tail class), we sample t documents and get their representation { r j , · · · , r jt } . Thenthe prototype is obtained by taking average of these vectors, p jhead = avg { r j , · · · , r jt } (6)In multi-class prototypical network (Snell, Swersky, andZemel 2017), one prototype is built for each class, and allprototypes are independent. However, our prototypes builthere in multi-label learning are correlated, because the sam-pled documents of one label can also be sampled for otherlabels. The correlation between prototypes is consistent withthe correlation between labels. Transfer Learner

The transfer learner is designed to link the (few-shot) labelprototype p j and the corresponding many-shot classiﬁer pa-rameter m j . For head labels, we have obtained their many-shot classiﬁer parameter m jhead ∈ M head , as well as theirlabel prototype p jhead . Therefore, a transfer function can belearned to map p jhead to m jhead , j = 1 ...l head , by minimiz-ing L t = l head (cid:88) j =1 || m jhead − W transfer p jhead || (7)where W transfer ∈ R d × d is the parameter of the transferlearner. It captures the generic and class irrelevant transfor-mation from few-shot label prototypes to many-short clas-siﬁer parameters. For each head label, we sample S times to have different p jhead for training a generalizable transferlearner. The S usually is a small constant, e.g., 30 or 40. Thesensitivity analysis of S is presented in section 4.5.Since the generic transfer learner maps the (few-shot)prototype to (many-shot) classiﬁer parameters as a class-irrelevant transformation, we can use it to map the (few-shot)tail prototypes to their (many-shot) classiﬁer parameters. Fora tail label z , we also sample t documents and get their rep-resentation { r z , · · · , r zt } by the trained semantic extractor.Then, we use the label prototyper to get the prototype of thetail label, p ztail = avg { r z , · · · , r zt } . (8)Thereafter, the tail label classiﬁer parameters are estimatedby using the transfer learner, ˆ m ztail = W transfer p ztail . (9)This ˆ m ztail is an estimation of a tail classiﬁer when it hasmany-shot document instances. As discussed before, one ofthe most important characteristics of MLTC is the label cor-relation caused by label co-occurrence. Although the labelprototyper can capture the label co-occurrence by making asame document instance contributing to more than one labelprototype since the document may have multiple labels, thelabel correlation has not been sufﬁcient explored due to therandom sampling process. Especially in the tail labels, fullyconsidering the correlation between tail labels and head la-bels can effectively improve the classiﬁcation performanceon the tail labels. We thus propose a tail label attention mod-ule, which aims to enhance the tail label classiﬁers by ex-ploring their correlation with the head labels. For each taillabel p ztail , we calculate the attention score between it andeach head prototype p jhead : e zj = f att ( p ztail , p jhead ) α zj = sof tmax ( e zj ) = exp ( e zj ) (cid:80) l head k =1 exp ( e jk ) p zatt = (cid:88) j ( α zj p jhead ) p znew = avg ( p zatt , p ztail ) . (10)Then, the same transfer learner is applied to estimate the taillabel classiﬁer parameters, ˆ m ztail = W transfer p znew (11)which is concatenated with the head label classiﬁer, formingthe whole classiﬁer for inference: M = cat [ M head : ˆ M tail ] . (12)Given a testing document, it will ﬁrst go through the seman-tic extractor φ to have its representation vector r , and thenget the predicted label by ˆ y = sigmoid ( rM ) . Ensemble HTTN:

In early research, ensembles were provenempirically and theoretically to possess better performancethan any single component. Hence, to improve the ro-bustness of the classiﬁcation process, we extend HTTNin an ensemble way. Ensemble HTTN (EHTTN) is de-signed to increase the accuracy of a single classiﬁer byable 1: Summary of Experimental Datasets.

Datasets

N M D L ¯ L ˜ L ¯ W ˜ W RCV1 23,149 781,265 47,236 103 3.18 729.67 259.47 269.23AAPD 54,840 1,000 69,399 54 2.41 2444.04 163.42 171.65EUR-Lex 13,905 3,865 33246 3,714 5.32 19.93 1,217.47 1,242.13 N is the number of training instances, M is the number of test instances, D is the total number of words, L is the total number of classes, ¯ L is the average number of labels per document, ˜ L is the average number of documents per label, ¯ W is the average number of words perdocument in the training set, ˜ W is the average number of words per document in the testing set. building several different tail classiﬁers. In summary, wesample the documents belonging to tail labels G times,and use the transfer learner to obtain multiple classiﬁerweights { ˆ M tail , · · · , ˆ M Gtail } for tail labels, and thus multiple { M , · · · , M G } used for inference. Ensemble HTTN has thefollowing advantages: 1) Robustness. If only sampling once,the model will be greatly affected by the quality of the ran-domly sampled documents. However, ensemble HTTN canavoid the caused problems. 2) Flexibility. Ensemble HTTNis ﬂexible on handling tail labels with different number ofinstances. Even in the long-tail part, some tail labels havedozens of instances, while the others have a few instances.Using a single batch of a ﬁxed number of instances mayunder-sample the former and leave the latter out. Experiments

In this section, we evaluate the proposed model on threedatasets by comparing with the state-of-the-art methodsin terms of widely used metrics, P@K and nDCG@K(k=1,3,5) and F1-score.

Experimental Setting

Datasets:

Three multi-label text datasets are used to eval-uate the HTTN model, AAPD, RCV1 and EUR-Lex. Theirlabel distributions all follow the power-low distribution, asshown in Figure 1 (the label distribution of EUR-Lex ispresented in the supplementary document due to the spacelimit). The benchmark datasets have deﬁned the training andtesting split. We follow the same data usage for all evaluatedmodels. The datasets are summarized in Table 1.

Baseline Models:

To demonstrate the effectiveness ofHTTN on the benchmark datasets, we selected the sevenmost representative baseline models in the different groupsof related work discussed in the second session.•

Joint: it uses Bi-LSTM with self-attention mechanism totackle multi-label text classiﬁcation without differentiat-ing the head and tail labels, i.e., learning the classiﬁer forthem in a joint way.•

XML-CNN (Liu et al. 2017): it adopts ConvolutionalNeural Network (CNN) and a dynamic pooling techniqueto extract high-level feature for large-scale multi-labeltext classiﬁcation.•

DXML (Zhang et al. 2018): it tries to solve the multi-labellong tail problem by considering the label structure fromthe label co-occurrence graph.•

LTMCP (Yuan, Xu, and Li 2019): it introduces an ensem-ble method to tackle long-tailed multi-label training. The DNN and linear classiﬁer are combined to deal with thehead label and tail label respectively.•

BBN (Zhou et al. 2020): it takes care of both representa-tion learning and classiﬁer learning for exhaustively im-proving the performance of long-tailed tasks.•

Imprinting (Qi, Brown, and Lowe 2018): it computesembeddings of novel examples and set novel weights inthe ﬁnal layer directly.•

OLTR (Liu et al. 2019): it learns dynamic meta-embedding in order to share visual knowledge betweenhead and tail classes.

Parameter Setting:

For all three datasets, we useGlove (Pennington, Socher, and Manning 2014) to get theword embedding in 300-dim. LSTM hidden state dimen-sion k is set to 300. The parameter d = 128 for W and W transfer . The number of sampled instances t for label pro-totyper in AAPD, RCV1 and EUR-Lex are t = 5 , , , re-spectively. The whole model is trained via Adam (Kingmaand Ba 2014) with the learning rate being 0.001. AAPD andRCV1 have 54 and 103 labels, respectively. To test the per-formance on different number of tail labels, we set l tail =18and 9 in AAPD, and l tail =28 and 14 in RCV1. For EUR-Lex dataset, we select the last 768 one-shot tail labels andthe 1238 less than three-shot tail labels. For the ensembleHTTN, we set G = 30 for AAPD and RCV1. G = 1 forEUR-Lex, because there are many one-shot labels in EUR-Lex. We used the default parameters for the DXML, XML-CNN, EXAM, and LTMCP models. The baselines OLTR,Imprinting, and BBN deal with the long tail problem onthe image recognition, the feature extractor used was theResNet-10, ResNet-32 and others. For a fair comparison, wereplace the feature extractor with Bi-LSTM with attention.The parameters of all baselines are either adopted from theiroriginal papers or determined by experiments. Results Comparison and Discussion

The results on three datasets are presented in Table 2, Ta-ble 3, and Table 4. The best results are marked in bold.From Table 2 to Table 4, we can make a number of ob-servations. Firstly,

Imprinting is worse than other methodsbecause it only copies the embedding activations for a novelexemplar as the new set of classiﬁer parameters. DXML ex-plores the label correlation by the label graph to alleviate thelong tail problem in MLTC, so they can get the satisfyingresults. OLTR learns the dynamic meta-embedding to helpthe tail label classiﬁcation. LTMXP combines linear modeland DNN to train the documents belonging to tail label andable 2: Comparing HTTN with baselines on AAPD dataset.

Method P@1 P@3 P@5 nDCG@3 nDCG@5 F1-scoreJoint 78.20 55.21 37.89 73.42 77.63 63.88DXML 80.54 56.30 39.16 77.23 80.99 65.13XML-CNN 74.38 53.84 37.79 71.12 75.93 65.35OLTR 78.96 56.28 38.60 74.66 78.58 62.48Imprinting 68.68 38.22 23.71 55.30 55.67 25.58BBN 81.56 57.81 39.10 76.92 80.06 66.73 l tail =18 LTMCP 78.12 55.19 37.67 75.18 75.43 62.84 l tail =18 HTTN 82.04 57.12 39.33 76.98 80.69 67.71 l tail =18 EHTTN 83.34 59.06 40.30 77.75 81.65 68.84 l tail =9 LTMCP 78.51 56.02 38.46 75.19 76.05 63.59 l tail =9 HTTN 82.49 58.72 40.31 78.20 81.24 68.14 l tail =9 EHTTN Table 3: Comparing HTTN with baselines on RCV1 dataset.

Method P@1 P@3 P@5 nDCG@3 nDCG@5 F1-scoreJoint 92.18 72.33 47.35 83.02 81.47 75.19DXML 94.04 78.65 54.38 89.83 90.21 75.76XML-CNN 95.75 78.63 54.94 l tail =28 LTMCP 90.47 74.57 51.59 85.31 85.83 73.99 l tail =28 HTTN 94.11 75.92 52.85 87.02 87.98 76.09 l tail =28 EHTTN 95.62 77.25 54.28 87.46 88.46 76.92 l tail =14 LTMCP 91.39 73.04 49.76 83.30 83.93 74.67 l tail =14 HTTN 94.70 77.83 54.21 88.49 89.05 76.86 l tail =14 EHTTN Table 4: Comparing HTTN with baselines on EUR-Lex dataset.

Method P@1 P@3 P@5 nDCG@3 nDCG@5 F1-scoreJoint 79.04 64.89 55.00 69.20 63.60 52.51DXML 80.41 66.74 56.33 70.03 63.18 53.28XML-CNN 78.20 65.93 53.81 68.41 60.54 51.98OLTR 65.62 52.34 42.69 55.73 50.57 22.64Imprinting 62.16 40.25 29.07 45.46 38.24 9.94BBN 76.22 60.40 49.45 64.26 58.54 41.01 l tail =1238 LTMCP 75.23 60.12 49.36 64.89 58.23 48.10 l tail =768 LTMCP 77.26 62.39 52.10 67.18 60.54 50.33 l tail =1238 HTTN 80.53 66.96 55.71 70.35 63.87 53.44 l tail =768 HTTN head label respectively. However, OLTR and LTMXP bothdon’t consider the correlation between the head labels andtail labels, which is in fact important for long tail MLTCtask. The Joint method trained the documents belonging tothe head labels and the tail labels jointly together, resultingin good results on the head labels, but bad results on the taillabels. The EHTTN transfers the meta-knowledge from thehead labels to tail labels, and ensembles multiple sampleddocuments from tail labels to further improve the robustnessof tail label classiﬁcation. The results demonstrate the supe-riority of the proposed EHTTN on all metrics for MLTC.In EUR-Lex training set, there are 768 labels with onlyone training document. So l tail = 768 is a setting equiva-lent to one-shot learning. The high data scarcity causes sev-eral methods have poor performance. Especially in OLTRmethod that borrows information from the learned memoryto help tail label classiﬁcation, it cannot get a comprehensive memory to help tail label classiﬁcation, because each tail la-bel has only one document. HTTN in this one-shot settingoutperform other methods on all measures. It is also interest-ing to ﬁnd in all three datasets that HTTN/EHTTN performsbetter when l tail is smaller. The reason is that when l tail issmaller, more head labels are used for distilling W transfer tohave richer meta-knowledge. We present the detailed analy-sis of the impact of l tail in the supplementary document. Theresults in Table 2, 3, and 4 answer the question we had: thehead labels do help a lot for long-tailed multi-label text clas-siﬁcation . In addition, more head labels can be more helpfulon learning the meta-knowledge . Ablation Test

An ablation test would provide informative analysis aboutthe effect of different components of the proposed HTTN,which can be taken apart as HTTN without attention mod- a) AAPD (b) RCV1

Figure 3: Ablation test on two datasets.Figure 4: F1-score on RCV1 tail labels.ular and ﬁne-tuning (denoted as H − F − A), HTTN withoutﬁne-tuning (H − F) but with attention, the complete HTTN(H), and the ensemble of multiple HTTN (denoted as EH)in Figure 3. The results were obtained on AAPD and RCV1datasets. There are two interesting observations: 1) It is al-ways preferable to use the ensemble strategy, as shown bythe superior performance of EH; and 2) The result of H − Fis always better than H − F − A, because the attention mod-ular is designed to explore the correlation between the headand tail labels, thus improves the classiﬁcation performance.H is better than H − F − A and H − F, indicating the ﬁne-tuning can further improve classiﬁcation performance.

Performance analysis on tail labels

To further verify the proposed HTTN, we compare it with

Joint and BBN (Zhou et al. 2020) on only tail labels. Figure4 shows their F1-score on the tail labels in RCV1. We cansee that the F1-scores of HTTN on most of the tail labels arehigher than those from the

Joint and BBN model, especiallyon the extreme tail label 97, 98, 99 and 100. The number ofdocuments belonging to them is 6, 3, 2, 2, respectively. Dueto the high data scarcity, the results predicted by

Joint andBNN on them are all 0 (no positive). The meta-knowledgelearned in HTTN does help to build effective tail label clas-siﬁers, making non-zero positive prediction.

Sensitivity of S in transfer learner For investigating the impact of the number of sampling fre-quency S in transfer learner, we vary S from 5 to 60, andshow its inﬂuence on F1-score in Figure 5. Increasing S from 5 to 20 can greatly help HTTN to gain strong im-provement in both datasets. That’s to say, sampling multi-ple prototypes in head labels can effectively strengthen the Figure 5: Sensitivity to the sampling frequency. (a) AAPD (b) RCV1 Figure 6: The HTTN members in EHTTN.generalizability of W transfer and distil the class irrelevantmeta-knowledge to transfer.. Analysis of ensemble HTTN

In order to further verify the robustness and effectivenessof Ensemble HTTN, in Figure 6, we compare the resultsof EHTTN with that of the worst member (HTTN min),the best member (HTTN max) and the average results ofall members (HTTN avg). We can see that EHTTN alwaysachieves the best. Learning only one ˆ M tail can result in un-stable performance, depending on the quality of the sampledinstances. Learning more ˆ M tail can strengthen the classiﬁerwith higher robustness and diversity.In summary, extensive experiments are carried out onthree MLTC benchmark datasets with various scales. Theresults demonstrate that the proposed HTTN can achieve su-perior performance compared with the seven baselines. Inparticular, effectiveness of HTTN is shown on tail labels. Conclusions

A Head-to-Tail Network (HTTN) is proposed in this paperfor long tail multi-label text classiﬁcation. By using the doc-uments belonging to the head labels, a transfer learner learnsthe meta-knowledge, which maps the class weights learnedby few-shot to the class weights learned by many-shot. Thisgeneric and class-irrelevant meta-knowledge effectively im-proves the tail label classiﬁcation performance. Extensiveexperiments on benchmark datasets demonstrate the superi-ority of HTTN, comparing with the state-of-the-art methods.With HTTN, the head labels do help for long-tailed multi-label text classiﬁcation. cknowledgments

This work was supported in part by the National Natu-ral Science Foundation of China under Grant 61822601,61773050, 61632004 and 61828302; The Beijing Natu-ral Science Foundation under Grant Z180006; The Na-tional Key Research and Development Program of Chinaunder Grant 2020AAA0106800 and 2017YFC1703506;The Fundamental Research Funds for the Central Univer-sities (2019JBZ110); And King Abdullah University ofScience & Technology, under award number FCC/1/1976-19-01.

References

Al-Sabahi, K.; Zuping, Z.; and Nadher, M. 2018. A hier-archical structured self-attentive model for extractive doc-ument summarization (HSSAS).

IEEE Access

6: 24205–24212.Buda, M.; Maki, A.; and Mazurowski, M. A. 2018. A sys-tematic study of the class imbalance problem in convolu-tional neural networks.

Neural Networks

International Confer-ence on Machine Learning , 872–881.Cambria, E.; Olsher, D.; and Rajagopal, D. 2014. Sentic-Net 3: a common and common-sense knowledge base forcognition-driven sentiment analysis. In

Twenty-eighth AAAIconference on artiﬁcial intelligence .Cao, K.; Wei, C.; Gaidon, A.; Arechiga, N.; and Ma, T. 2019.Learning imbalanced datasets with label-distribution-awaremargin loss. In

Advances in Neural Information ProcessingSystems , 1567–1578.Chawla, N. V.; Bowyer, K. W.; Hall, L. O.; and Kegelmeyer,W. P. 2002. SMOTE: synthetic minority over-sampling tech-nique.

Journal of artiﬁcial intelligence research

16: 321–357.Cui, Y.; Jia, M.; Lin, T.-Y.; Song, Y.; and Belongie, S. 2019.Class-balanced loss based on effective number of samples.In

Proceedings of the IEEE Conference on Computer Visionand Pattern Recognition , 9268–9277.Du, C.; Chen, Z.; Feng, F.; Zhu, L.; Gan, T.; and Nie, L.2019. Explicit interaction model towards text classiﬁcation.In

Proceedings of the AAAI Conference on Artiﬁcial Intelli-gence , volume 33, 6359–6366.Gidaris, S.; and Komodakis, N. 2018. Dynamic few-shot vi-sual learning without forgetting. In

Proceedings of the IEEEConference on Computer Vision and Pattern Recognition ,4367–4375.Hariharan, B.; and Girshick, R. 2017. Low-shot visualrecognition by shrinking and hallucinating features. In

Pro-ceedings of the IEEE International Conference on ComputerVision , 3018–3027.Huang, X.; Chen, B.; Xiao, L.; and Jing, L. 2019. Label-aware Document Representation via Hybrid Attention forExtreme Multi-Label Text Classiﬁcation. arXiv preprintarXiv:1905.10070 . Kang, B.; Xie, S.; Rohrbach, M.; Yan, Z.; Gordo, A.; Feng,J.; and Kalantidis, Y. 2019. Decoupling Representation andClassiﬁer for Long-Tailed Recognition. In

InternationalConference on Learning Representations .Kingma, D. P.; and Ba, J. 2014. Adam: A method forstochastic optimization. arXiv preprint arXiv:1412.6980 .Kumar, A.; Irsoy, O.; Ondruska, P.; Iyyer, M.; Bradbury, J.;Gulrajani, I.; Zhong, V.; Paulus, R.; and Socher, R. 2016.Ask me anything: Dynamic memory networks for naturallanguage processing. In

International Conference on Ma-chine Learning , 1378–1387.Kurata, G.; Xiang, B.; and Zhou, B. 2016. Improved neuralnetwork-based multi-label classiﬁcation with better initial-ization leveraging label co-occurrence. In

Proceedings of the2016 Conference of the North American Chapter of the As-sociation for Computational Linguistics: Human LanguageTechnologies , 521–526.Liu, J.; Chang, W.-C.; Wu, Y.; and Yang, Y. 2017. Deeplearning for extreme multi-label text classiﬁcation. In

Pro-ceedings of the 40th International ACM SIGIR Conferenceon Research and Development in Information Retrieval ,115–124.Liu, Z.; Miao, Z.; Zhan, X.; Wang, J.; Gong, B.; and Yu,S. X. 2019. Large-scale long-tailed recognition in an openworld. In

Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition , 2537–2546.MacAvaney, S.; Dernoncourt, F.; Chang, W.; Goharian,N.; and Frieder, O. 2020. Interaction Matching forLong-Tail Multi-Label Classiﬁcation. arXiv preprintarXiv:2005.08805 .Nam, J.; Kim, J.; Menc´ıa, E. L.; Gurevych, I.; andF¨urnkranz, J. 2014. Large-scale multi-label textclassiﬁcation-revisiting neural networks. In

Joint eu-ropean conference on machine learning and knowledgediscovery in databases , 437–452. Springer.Pappas, N.; and Henderson, J. 2019. Gile: A generalizedinput-label embedding for text classiﬁcation.

Transactionsof the Association for Computational Linguistics

7: 139–155.Pennington, J.; Socher, R.; and Manning, C. 2014. Glove:Global vectors for word representation. In

Proceedings ofthe 2014 conference on empirical methods in natural lan-guage processing (EMNLP) , 1532–1543.Qi, H.; Brown, M.; and Lowe, D. G. 2018. Low-shot learn-ing with imprinted weights. In

Proceedings of the IEEE con-ference on computer vision and pattern recognition , 5822–5830.Snell, J.; Swersky, K.; and Zemel, R. 2017. Prototypicalnetworks for few-shot learning. In

Advances in neural in-formation processing systems , 4077–4087.Tan, Z.; Wang, M.; Xie, J.; Chen, Y.; and Shi, X. 2018. Deepsemantic role labeling with self-attention. In

Thirty-SecondAAAI Conference on Artiﬁcial Intelligence .aswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones,L.; Gomez, A. N.; Kaiser, Ł.; and Polosukhin, I. 2017. At-tention is all you need. In

Advances in neural informationprocessing systems , 5998–6008.Wang, Y.-X.; Ramanan, D.; and Hebert, M. 2017. Learn-ing to model the tail. In

Advances in Neural InformationProcessing Systems , 7029–7039.Wei, T.; and Li, Y.-F. 2019. Does Tail Label Help for Large-Scale Multi-Label Learning?

IEEE Transactions on NeuralNetworks and Learning Systems .Xiao, L.; Huang, X.; Chen, B.; and Jing, L. 2019. Label-speciﬁc document representation for multi-label text classi-ﬁcation. In

Proceedings of the 2019 Conference on Empir-ical Methods in Natural Language Processing and the 9thInternational Joint Conference on Natural Language Pro-cessing (EMNLP-IJCNLP) , 466–475.Yang, Z.; Yang, D.; Dyer, C.; He, X.; Smola, A.; and Hovy,E. 2016. Hierarchical attention networks for document clas-siﬁcation. In

Proceedings of the 2016 Conference of theNorth American Chapter of the Association for Computa-tional Linguistics: Human Language Technologies , 1480–1489.You, R.; Dai, S.; Zhang, Z.; Mamitsuka, H.; and Zhu, S.2018. AttentionXML: Extreme Multi-Label Text Classiﬁ-cation with Multi-Label Attention Based Recurrent NeuralNetworks. arXiv preprint arXiv:1811.01727 .Yuan, M.; Xu, J.; and Li, Z. 2019. Long Tail Multi-labelLearning. In , 28–31. IEEE.Zhang, W.; Yan, J.; Wang, X.; and Zha, H. 2018. Deep ex-treme multi-label learning. In

Proceedings of the 2018 ACMon International Conference on Multimedia Retrieval , 100–107.Zhou, B.; Cui, Q.; Wei, X.-S.; and Chen, Z.-M. 2020.BBN: Bilateral-Branch Network with Cumulative Learn-ing for Long-Tailed Visual Recognition. In

Proceedings ofthe IEEE/CVF Conference on Computer Vision and PatternRecognition , 9719–9728.Zhou, P.; Qi, Z.; Zheng, S.; Xu, J.; Bao, H.; and Xu, B.2016. Text classiﬁcation improved by integrating bidirec-tional LSTM with two-dimensional max pooling. arXivpreprint arXiv:1611.06639arXivpreprint arXiv:1611.06639