[PDF] LightXML: Transformer with Dynamic Negative Sampling for High-Performance Extreme Multi-label Text Classification

Abstract

Extreme Multi-label text Classification (XMC) is a task of finding the most relevant labels from a large label set. Nowadays deep learning-based methods have shown significant success in XMC. However, the existing methods (e.g., AttentionXML and X-Transformer etc) still suffer from 1) combining several models to train and predict for one dataset, and 2) sampling negative labels statically during the process of training label ranking model, which reduces both the efficiency and accuracy of the model. To address the above problems, we proposed LightXML, which adopts end-to-end training and dynamic negative labels sampling. In LightXML, we use generative cooperative networks to recall and rank labels, in which label recalling part generates negative and positive labels, and label ranking part distinguishes positive labels from these labels. Through these networks, negative labels are sampled dynamically during label ranking part training by feeding with the same text representation. Extensive experiments show that LightXML outperforms state-of-the-art methods in five extreme multi-label datasets with much smaller model size and lower computational complexity. In particular, on the Amazon dataset with 670K labels, LightXML can reduce the model size up to 72% compared to AttentionXML.

Full PDF

LLightXML: Transformer with Dynamic Negative Sampling for High-PerformanceExtreme Multi-label Text Classiﬁcation

Ting Jiang , Deqing Wang , Leilei Sun , Huayi Yang , Zhengyang Zhao , Fuzhen Zhuang SKLSDE and

BDBC

Lab, Beihang University, Beijing, China Key Lab of Intelligent Information Processing of CAS, Institute of Computing Technology, CAS, Beijing, China Beijing Advanced Innovation Center for Imaging Theory and Technology, Academy for Multidisciplinary Studies, CapitalNormal University, Beijing, China { royokong, dqwang, leileisun, yanghy, zzy979 } @buaa.edu.cn, [email protected] Abstract

Extreme Multi-label text Classiﬁcation (XMC) is a task ofﬁnding the most relevant labels from a large label set. Nowa-days deep learning-based methods have shown signiﬁcantsuccess in XMC. However, the existing methods (e.g., Atten-tionXML and X-Transformer etc) still suffer from 1) com-bining several models to train and predict for one dataset,and 2) sampling negative labels statically during the pro-cess of training label ranking model, which reduces boththe efﬁciency and accuracy of the model. To address theabove problems, we proposed LightXML, which adopts end-to-end training and dynamic negative labels sampling. InLightXML, we use generative cooperative networks to re-call and rank labels, in which label recalling part generatesnegative and positive labels, and label ranking part distin-guishes positive labels from these labels. Through these net-works, negative labels are sampled dynamically during labelranking part training by feeding with the same text repre-sentation. Extensive experiments show that LightXML out-performs state-of-the-art methods in ﬁve extreme multi-labeldatasets with much smaller model size and lower computa-tional complexity. In particular, on the Amazon dataset with670K labels, LightXML can reduce the model size up to72% compared to AttentionXML. Our code is available athttp://github.com/kongds/LightXML.

Introduction

Extreme Multi-label text Classiﬁcation (XMC) is a task ofﬁnding the most relevant labels for each text from an ex-tremely large label set. It is a very practical problem thathas been widely applied in many real-world scenarios, suchas tagging a Wikipedia article with most relevant labels(Dekel and Shamir 2010), dynamic search advertising in E-commerce (Prabhu et al. 2018), and suggesting keywords toadvertisers on Amazon (Chang et al. 2020).Different from the classical multi-label classiﬁcationproblem, the candidate label set could be very large, whichresults in a huge computational complexity. To overcomethis difﬁculty, many methods (e.g., Parabel (Prabhu et al.2018), DiSMEC (Babbar and Sch¨olkopf 2017) and Atten-tionXML (You et al. 2019) etc) have been proposed in re- * cent years. From the perspective of text representation learn-ing, these methods could be divided into two categories: 1)Raw feature methods, where texts are represented by sparsevectors and feed to classiﬁers directly; 2) Semantic featuremethods, where deep neural networks are usually employedto transfer the texts into semantic representations before theclassiﬁcation procedure.The superiority of XMC methods with semantic featureshave been frequently reported recently. For example, At-tentionXML (You et al. 2019) and X-Transformer (Changet al. 2020) have achieved signiﬁcant improvements in ac-curacy comparing to the state-of-the-art methods. However,the challenge of how to solve the XMC problem with theconstraint computational resources still remains, both At-tentionXML and X-Transformer need large computationalresources to train or to implement. AttentionXML needsto train four seperate models for big XMC datasets likeAmazon-670K. In X-Transformer, it uses a large trans-former model for recalling labels, and simple linear classiﬁ-cations are used to rank labels. Both methods split the train-ing process into multiple stages, and each stage needs a sep-arate model to train, which takes a lot of computational re-sources. Another disadvantage of these methods is the staticnegative label sampling. Both methods train label rankingmodel with negative labels sampled by ﬁne-tuned label re-calling models. This negative sampling strategy makes labelranking model only focus on a small number of negative la-bels, and hard to converge due to these negative labels arevery similar to positive labels.To address the above problems, we propose a lightdeep learning model, LightXML, which ﬁne-tunes singletransformer model with dynamic negative label sampling.LightXML consists of three parts: text representing, labelrecalling, and label ranking. For text representing, we usemulti-layer features of the transformer model as text repre-sentation, which can prove rich text information for the othertwo parts. With the advantage of dynamic negative sampling,we propose generative cooperative networks to recall andrank labels. For the label recalling part, we use the genera-tor network based on label clusters, which is used to recalllabels. For the label ranking part, we use the discriminatornetwork to distinguish positive labels from recalled labels.In summary, the contributions of this paper are as follows: a r X i v : . [ c s . C L ] J a n A novel deep learning method is proposed, which com-bines the powerful transformer model with generative co-operative networks. This method can fully exploit the ad-vantage of transformer model by end-to-end training.• We propose dynamic negative sampling by using gener-ative cooperative networks to recall and rank labels. Thedynamic negative sampling allows label ranking part tolearn from easy to hard and avoid overﬁtting, which canboost overall model performance.• Our extensive experiments show that our model achievesthe best results among all methods on ﬁve benchmarkdatasets. In contrast, our model has much smaller modelsize and lower computational complexity than currentstate-of-the-art methods.

Related work

Many novel methods have been proposed to improve accu-racy while controlling computational complexity and modelsizes in XMC. These methods can be broadly categorizedinto two directions according to the input: One is tradi-tional machine learning methods that use the sparse featuresof text like BOW features as input, and the other is deeplearning methods that use raw text. For traditional machinelearning methods, we can continue to divide these methodsinto three directions: one-vs-all methods, tree-based meth-ods and embedding-based methods.

One-vs-all methods

One-vs-all methods such as DiS-MEC (Babbar and Sch¨olkopf 2017), ProXML (Babbarand Sch¨olkopf 2019), PDSparse (Yen et al. 2016), andPPDSparse (Yen et al. 2017) which treat each label as bi-nary classiﬁcation problem and classiﬁcation tasks are inde-pendent of each other. Although many one-vs-all methodslike DiSMEC and PPDSparse focus on improving model ef-ﬁciency, one-vs-all methods still suffer from expensive com-putational complexity and large model size. With the cost ofefﬁciency, these methods can achieve acceptable accuracy.

Tree-based methods

Tree-based methods aim to over-come high computational complexity in one-vs-all methods.These methods will construct a hierarchical tree structure bypartitioning labels like Parabel (Prabhu et al. 2018) or sparsefeatures like FastXML (Prabhu and Varma 2014). For Para-bel, the label tree is built by label features using balance k -mean clustering, each node in the tree contains several clas-siﬁcations to classify whether the text belongs to the childrennode in inner nodes or the label in leaf nodes. For FastXML,FastXML directly optimizes the normalized Discounted Cu-mulative Gain (nDCG), and each node of FastXML containsseveral binary classiﬁcations to decide which children nodesto traverse like Parabel. Embedding-based methods

Embedding-based methodsproject high dimensional label space into a low dimensionalspace to simplify the XMC problem. And the design of labelcompression part and label decompression part is signiﬁcantfor the performance of these methods. However, no matterhow the label compression part is designed, label compres-sion will always lose a part of information. It makes these methods achieve worse accuray compared with one-vs-allmethods and methods. And some improved embedding-based methods like SLEEC (Bhatia et al. 2015) and An-nexML (Tagami 2017) be proposed to solve this problem byimproving the label compression and decompression parts,but the problem still remains.

Deep learning methods

With the development of NLP,deep learning methods have shown great improvement inXMC which can learn better text representation from rawtext. But the main challenge to these methods is how to cou-ple with millions of labels with limited GPU resources. wereview three most representative methods, e.g., XML-CNN(Liu et al. 2017), AttentionXML (You et al. 2019) and X-Transformers (Chang et al. 2020)..XML-CNN is the ﬁrst successful method that showed thepower of deep learning in XMC. It learns text representationby feeding word embedding to CNN networks with end-to-end training. For scaling to the datasets with hundreds ofthousands of labels, XML-CNN proposes a hidden bottle-neck layer to project text feature into low dimensional space,which can reduce the overall model size. But XML-CNNonly uses a simple fully connected layer to score all labelswith binary entropy loss like simple multi-label classiﬁca-tion, which makes it hard to deal with large label sets.After XML-CNN, AttentionXML shows great success inXMC, which overpassed all traditional machine learningmethods and proved the superiority of the raw text compareto sparse features. Unlike using a simple fully connectedlayer for label scoring in XML-CNN, AttentionXML adoptsa probabilistic label tree (PLT) that can handle millions oflabels. In AttentionXML, it uses RNN networks and atten-tion mechanisms to handle raw text with diffferent modelsto each layer of PLT. It needs to train several models forone dataset. For solving this problem, AttentionXML ini-tializes the weight of the current layer model by its upperlayer model, which can help the model converge quickly.But It still makes AttentionXML much slow in predictingand surffers from big overall model size. In conclusion, At-tentionXML is an enlightened method that elegantly com-bines PLT with deep learning methods.For X-Transformer, it only uses deep learning models tomatch the label clusters for given raw text and ranks these la-bels by high dimension linear classiﬁcations with the sparsefeature and text representation of deep learning models. X-Transformer is the ﬁrst method of using deep transformermodels in XMC. Due to the high computational complexityof transformer models, it only ﬁne-tunes transformer mod-els as the label clusters matcher, which can not fully exploitthe power of transformer models. Although X-Transformercan reach higher accuracy than AttentionXML with the costof high computational complexity and model size, whichmakes X-Transformer infeasible in XMC applications. Be-cause AttentionXML can reach better accuracy with thesame computational complexity of X-Transformer by usingmore models for ensemble.igure 1: An overview of the proposed framework.

Methodology

Problem Formulation

Given a training set { ( x i , y i ) } Ni =1 where x i is raw text,and y i ∈ { , } L is the label of x i represented by L di-mensional multi-hot vectors. Our goal is to learn a function f ( x i ) ∈ R L which gives scores to all labels, and f needsto give high score to l with y il = 1 for x i . We can obtaintop-K predicted labels by f ( x i ) . But in many XMC meth-ods, it only scores the recalled labels to reduce the overallcomputational complexity, and the number of these labels inthe subset will be much smaller than all labels, which cansave much computation time. Framework

The proposed framework shows in Figure 1. We ﬁrst clusterlabels by sparse features of labels. After label clustering, wehave a certain number of label clusters, and each label be-longs to one label cluster. Then we employ the transformermodel to embed raw text information into a high dimensionrepresentation, which is the input of label recalling part andlabel ranking part.For label recalling part and label ranking part, with the ad-vantage of dynamic negative sampling, we proposed genera-tive cooperative networks for XMC. The label recalling partis the generator of these networks, which can dynamicallysample negative labels. The label ranking part is the dis-criminator of these networks, which distinguishes betweenthe negative labels and positive labels. For the cooperation ofthese networks, the generator assists the discriminator learn-ing better label representation, and the discriminator helpsthe generator to consider ﬁne-grained text information. Be-cause the discriminator is trained according to speciﬁc labelrather than label clustering.Speciﬁcally, for the generator, we score every label clusterby text representation to obtain top-K label clusters whichare the labels that we sampled. All positive labels are addedto this subset in the training stage to make the discriminatorcan be ﬁne-tuned with generator. For the discriminator, eachlabel in this label subset will be scored to distinguish thepositive labels and the negative labels.

Label clustering

Label clustering is equivalent to two lay-ers of Probabilistic Label Tree (PLT). Since the deep PLTwill harm performance and label clustering is enough tocope with extreme numbers of labels, we just use this two-layer PLT with the same constructing methods in Atten-tionXML (You et al. 2019).More speciﬁcally, given a maximum number of labels thateach cluster s has, our goal is to partition labels into K la-bel clusters, and the number of labels contained in each la-bel cluster is less than s and greater than s/ . To solve thisproblem, We ﬁrst get each label representation by normal-izing the sum of sparse text features with corresponding la-bels contain this label. Then, we use balanced k-means (k=2)clustering to recursively partition label sets until all labelsets satisfy the above requirement. Text representation

Transformer models show outstand-ing performance on a wide array of NLP tasks. In our frame-work, We adopt three pre-trained transformer base mod-els: BERT (Devlin et al. 2018), XLNet (Yang et al. 2019)and RoBERTa (Liu et al. 2019), which is the same as X-Transformer (Chang et al. 2020). But compared to largetransformer models(24 layers and 1024 hidden dimension)that X-Transformer uses, we only use base transformer mod-els (12 layers and 768 hidden dimension) to reduce compu-tational complexity.For input sequence length, the time and space complex-ity of transformer models will grow exponentially withthe growth of text length under self-attention mechanisms(Vaswani et al. 2017), which makes transformer modelshard to deal with long text. In X-Transformer, maximum se-quence length is set to 128 for all datasets. However, we setmaximum sequence length to 512 for small XMC datasetsand 128 for large XMC datasets with the advantage of usingbase models instead of large models.For text embedding of transformer models, different fromﬁne-tuning transformer models for typical text classiﬁcationtasks. In order to make full use of the transformer models inXMC, we concatenate the ”[CLS]” token in the hidden stateof last ﬁve layers as text representation. Let e i ∈ R l be the l dimensional hidden state of the ”[CLS]” token in the last i -thlayer of the transformer model. The text is represented by theconcatenation of last ﬁve layers e = [ e , .., e ] ∈ R l . Highdimensional text representation can enrich text informationand improve the generalization ability of overall model inXMC. And our experiments show it can speed up conver-gence and improve the model performance. To avoid over-ﬁtting, We also use a high rate of dropout to this high dimen-sional text representation Label recalling

In this part, our goal is not only sam-pling positive labels, but also negative labels to help the labelranking part learn. To reduce computational complexity andspeed up convergence, we directly sample the label clustersinstead of the labels, and use all labels in these clusters asthe labels we sample.The generator is a fully connected layer with sigmoid G ( e ) = σ ( W g e + b g ) and G returns K dimensional vectorrepresentation, which is the scores of all K label clusters.We choose top b label clusters to generate a subset of labels.n training, all positive labels are added to this subset to forceteaching the label ranking to distinguish positive and nega-tive labels. In predicting, we don’t modify this subset, andthis subset may not contain all positive labels.For the generator loss, we don’t calculate loss accordingto the feedback of the discriminator, due to the generatoris designed to cooperate with the discriminator and makemodel easy to convergence. And we can directly calculatethe loss by the ground truth. The loss function is describedas follows: L g ( G ( e ) , y g ) = K (cid:88) i =0 (1 − y ig )( − log (1 − G ( e ) i ))+ y ig ( − log ( G ( e ) i )) , (1)where y g ∈ { , } K is the multi-hot representation of labelcluster for the given text.Negative label sampling is a decisive factor for overallmodel performance. In AttentionXML (You et al. 2019),negative samples are static, and the model only overﬁts todistinguish speciﬁc negative label samples, which will con-strain the performance. The static negative sampling alsomakes the model hard to converge, because negative labelsare very similar to positive labels. We solve this problem byemploying the generator with dynamically negative labelssampling. For the same training instance, the negative labelsof this instance is resampled every time by the current gen-erator, and negative labels are sampled from easy to difﬁcultdistinguishing during the generator ﬁting, which makes thediscriminator converge easily and avoid overﬁtting. The la-bel candidates the generator sampled are as follows: S g = { l i : i ∈ { i : g c ( l i ) ∈ G ( e ) }} , (2)where g c is a function to map labels to its clusters and l i isthe i -th label. In training stage, all positive labels are addedto S g . Label ranking

Given text representation e and label can-didates S g , we ﬁrst need to get embeddings of all labels in S g which can represent as follows: M = [ E i : i ∈ { i : S g } ] , (3)where E i ∈ R b is the learn-able b dimension embeddingof i -th label and E ∈ R L × b is the overall label embeddingmatrix which is initialized randomly.We use the same hidden bottleneck layer as XML-CNN(Liu et al. 2017) to project text embedding to low dimension.There are two advantages of it:• The hidden bottleneck layer makes the overall model sizesmaller and let the model ﬁt into limited GPU memory.The overall label size L is usually more than hundreds ofthousands in XMC. If we remove this layer, this part issize will be O ( L × k ) , which can take huge GPU mem-ory. After adding a hidden bottleneck layer, the size willbe O (( L + 5 k ) × b ) , and the hyper-parameter b is the di-mension of the label embedding, which is much smallerthan k . According to the size of different datasets, wecan set different b to make full use of GPU memory. • The hidden bottleneck layer also makes the generator andthe discriminator focus on different information of textrepresentation. The generator focuses on ﬁne-grained textinformation, while the discriminator focuses on coarse-grained text information. And the ﬁnal results will com-bine these two types of information.The discriminator can be described as follows: D ( e, M ) = σ ( M σ ( W h e + b h )) , (4)where W h ∈ R b × k and b h ∈ R b is the weight of hiddenbottleneck layer.The object of D ( e, M ) is to distinguish positive labels andnegative labels that are sampled by the generator. The train-ing target of D ( e, M ) is y d where y id = 0 if S ig is positivelabels and y id = 1 if S ig is negative label. Thus the loss ofthis part is as follows: L d ( D ( e, M ) , y d ) = K (cid:88) i =0 (1 − y id )( − log (1 − D ( e, M ) i ))+ y id ( − log ( D ( e, M ) i )) (5)The whole framework of LightXML is shown in Algo-rithm 1. Algorithm 1

The proposed framework.

Input:

Training set { X, Y } = { ( x i , y i ) } Ni =1 , sparse featureof training text ˆ X ; Output:

Model; Construct the label clusters C by ˆ X and Y ; Initialize transformer model T with pre-trained trans-former model; Initialize discriminator D base on C ; Initialize label embedding E , generator G ; while model not converge do Draw m samples X batch and Y batch from the trainingset { X, Y } ; Get text embedding Z by T ( X batch ) ; for i=1.. m do Generate label clusters S generated by G ( Z i ) ; Get negative labels S neg by S generated and C ; Remove positive labels in S neg ; end for Get positive labels S pos according to Y batch ; for i=1.. m do Get label embedding M according to S pos and S neg ; Generate each label scores by D ( Z i , M ) ; end for Update parameters of T , G and D according to Eq.1,and Eq.5; end while return model; raining Unlike AttentionXML (You et al. 2019) and X-Transformer(Chang et al. 2020), our model can perform end-to-end train-ing by using generative cooperative networks to handle bothlabel recalling and label ranking parts, which can reducetraining time and model size. The overall loss function is: L = L g + L d (6)We directly add the loss of generator part L g and discrimi-nator part L d as overall loss, and the transformer model thatused to represent text will update its gradient according toboth L g and L d , which makes the transformer model learnfrom both label recalling and label ranking. Prediction

The efﬁciency of prediction is essential for XMC applica-tions. But deep learning methods reach high accuracy withthe huge cost of computational complexity, which makesthem infeasible compared to traditional machine learningmethods. Although LightXML is based on deep transformermodels, LightXML can make fast prediction in several mil-liseconds on the large scale XMC dataset.LightXML can also end-to-end predict with raw text. Thelabel recalling part of LightXML scores all label clusters andreturn a subset of all labels containing both positive labelsand negative labels. The label ranking part scores every labelin this subset. The ﬁnal scores of labels are the multiplyingof recalling scores and ranking scores, and we can get top-Klabels by these scores.

Experiments

Dataset

Five widely-used XMC benchmark datasets are used in ourexperiments, they are: Eurlex-4K (Mencia and F¨urnkranz2008), Wiki10-31K (Zubiaga 2012), AmazonCat-13K(McAuley and Leskovec 2013), Wiki-500K and Amazon-670K (McAuley and Leskovec 2013). The detailed informa-tion of each dataset is shown in Table 2. The sparse featureof text we use for clustering is also contained in datasets.

Evalutaion Measures

We choose P@ k as evaluation metrics, which is widely usedin XMC and represents percentage of accuracy labels in top k score labels. P@ k can be deﬁned as follows: P @ k = 1 k (cid:88) i ∈ rank k (ˆ y ) y i , (7)where ˆ y is the prediction vector, i denotes the index of thei-th highest element in ˆ y and y ∈ { , } L . Baseline

We compared the state-of-the-art and most enlighten-ing methods including one-vs-all DiSMEC (Babbar andSch¨olkopf 2017); label tree based Parabel (Prabhu et al.2018), ExtremeText (XT) (Wydmuch et al. 2018), Bonsai(Khandagale, Xiao, and Babbar 2019); and deep learningbased XML-CNN (Liu et al. 2017), AttentionXML (You et al. 2019), and X-Transformers (Chang et al. 2020). Theresults of all baseline are from (You et al. 2019) and (Changet al. 2020).

Experiment Settings

For all datasets, we directly use raw texts without prepro-cessing, and texts are truncated to the maximum input to-kens. We use 0.5 rate of dropout for text representation andSWA (stochastic weight averaging) (Izmailov et al. 2018)which is also used in AttentionXML to avoid overﬁtting.LightXML is trained by AdamW (Kingma and Ba 2014)with constant learning rate of 1e-4 and 0.01 weight decayfor bias and layer norm weight in model. Automatic MixedPrecision (AMP) is also used to reduce GPU memory usageand increase training. Our training stage is end-to-end, andthe loss of model is the sum of label recalling loss and labelranking loss. For datasets with small labels like Eurlex-4k,Amazoncat-13k and Wiki10-31k, each label clusters containonly one label and we can get each label scores in labelrecalling part. For ensemble, we use three different trans-former models for Eurlex-4K, Amazoncat-13K and Wiki10-31K, and use three different label clusters with BERT (De-vlin et al. 2018) for Wiki-500K and Amazon-670K. Com-pared to state-of-the-art deep learning methods, all of ourmodels is trained on single Tesla v100 GPU, and our modelonly uses less than 16GB of GPU memory for training,which is much smaller than other methods use. Other hy-perparameters is given in Table 1.Datasets

E B b C L t Eurlex-4K 20 16 - - 512AmazonCat-13K 5 16 - - 512Wiki-31K 30 16 - - 512Wiki-500K 10 32 500 60 128Amazon-670K 15 16 400 80 128Table 1: Hyperparameters of all datasets. E is the numberof epochs, B is the batch size, b is the dimension of labelembedding, C is the number of labels in one label cluster, L t is the maximum length of transformer model is input tokensand M is the model size. Performance comparison

Table 3 shows P @ k on the ﬁve datasets. We focus on topprediction by varying k at 1, 3 and 5 in P @ K which arewidely used in XMC. LightXML outperforms all meth-ods on four datasets. For traditional machine learningmethods, DiSMEC has better accuracy compared to thesemethods (Parabel, Bonsai and XT) with the cost of highcomputational complexity, and LightXML has much im-provement on accuracy compared to these methods. ForX-Transformer, which is also transformer models basedmethod, LightXML achieve better accuracy on all datasetswith much small model size and computational complexity,which can prove the effectiveness of our method. For Atten-tionXML, although AttentionXML has slightly better P @5 than LightXML on Wiki-500K, but LightXML achievesatasets N train N test D L ¯ L ˆ L ¯ W train ¯ W test Eurlex-4K 15,449 3,865 186,104 3,956 5.30 20.79 1248.58 1230.40Wikil0-31K 14,146 6,616 101,938 30,938 18.64 8.52 2484.30 2425.45AmazonCat-13K 1,186,239 306,782 203,882 13,330 5.04 448.57 246.61 245.98Amazon-670K 490,449 153,025 135,909 670,091 5.45 3.99 247.33 241.22Wiki-500K 1,779,881 769,421 2,381,304 501,008 4.75 16.86 808.66 808.56Table 2: Detailed datasets statistics. N train is the number of training samples, N test is the number of test samples, D is thedimension of BOW feature vector, L is the number of labels, ¯ L is the average number of labels per sample, ˆ L is the averagenumber of samples per label, ¯ W train is the average number of words per training sample and ¯ W test is the average number ofwords per testing sample. Datasets DiSMEC Parabel Bonsai XT XML-CNN AttentionXML X-Transformer LightXMLEurlex-4K P @1 P @3 P @5 AmazonCat-13K P @1 P @3 P @5 Wiki10-31K P @1 P @3 P @5 Wiki-500K P @1 P @3 P @5 P @1 P @3 P @5 Table 3: Comparisons with different methods. Comparing our model against state-of-the-art XMC methods on Eurlex-4K,AmazonCat-13K, Wiki10-31K, Wiki-500K and Amazon-670K. Note that XML-CNN are not scalable on Wiki-500K, and theresult of X-Transformer on Amazon-670K has never been reported which is hard to reproduce it limited by our hardwareconditions.more improvement in P @1 and P @3 , and LightXML out-performs AttentionXML on other four datasets. Performance on single model

We also examine the single model performance, calledLightXML-1. Table 4 shows the results of single models onAmazon-670K and Wiki-500K. LightXML-1 shows betteraccuracy compare to AttentionXML-1.Datasets LightXML-1 AttentionXML-1Wiki-500K P @1 P @3 P @5 P @1 P @3 P @5 Effect of the dynamic negative sampling

To examine the importance of the dynamic negative sam-pling in negative label sampling, we compare dynamic neg-ative sampling with static negative sampling on Wiki-500Kand Amazon-670K. Datasets

D BS S

Wiki-500K P @1 P @3 P @5 P @1 P @3 P @5 D is the dynamic negative sampling, BS is the staticnegative sampling with additional text representation and S is the static negative sampling with single text representa-tion.For static negative sampling, it’s hard to have a fair com-parison with dynamic negative sampling, due to the con-straint of static negative sampling, which needs to train re-calling and ranking in order. So we proposed two version ofstatic negative sampling: 1) S has the same model size withdynamic negative sampling, and we trained label recallingwith the freeze text representation of trained label ranking.2) BS has a additional text representation, which can beenﬁne-tuned in label ranking. And we initialize this with thetext representation of trained label recalling.he performance of different negative sampling is shownin Table 5. Although both two static negative sampling meth-ods take longer than dynamic negative sampling methodsin training time, dynamic negative sampling method stilloutperforms the other two methods. For two static negativesampling methods, BS shows better results than S with ad-ditional text representation. Effect of the multi layers text representation

This section analyze how multi layers text representation af-fect the model performance, and we choose our model andour model with only single last layer of ”[CLS]” token astext representation. (a) Wiki-500K(b) Amazon-670K

Figure 2: Effect of the multi layers text representation.As Figure 2 shows, we compare the training loss of multilayers and single layer on Wiki-500K and Amazon-670K. Itcan be seen that multi layers has low training loss, whichmeans multi layers text representation can accelerate modelconvergence, and multi layers can reach same training lossof single layer by only using half of total epochs. For ﬁ-nal accuracy, multi layers can improve the ﬁnal accuracy of P @5 more than 1%. Computation Time and Model Size

Computation time and model size are essential for XMC,which means XMC is not only a task to pursue high ac-curacy, but also a task to improve efﬁciency. In this sec-tion, we compare the computation time and model sizeof LightXML with high performance XMC method At-tentionXML. For X-Transformer, it uses large transformermodels and high dimension linear classiﬁcation, which makes it has large model size and high computational com-plexity. X-Transformer takes more than 35 hours of trainingwith eight Tesla V100 GPUs on Wiki-500K, and it will needmore than one hundred hours for us to reproduce. Due to thehard reproducing and bad efﬁciency of the X-Transformer,we don’t compare LightXML with X-Transformer.

Datasets AttentionXML-1 LightXML-1Wiki-500K T train S test M Amazon-670K T train S test M Table 6: Computation Time and Model Size. T train is theoverall training hours. S test is the average time required topredict each sample. The unit of S test is milliseconds persample (ms/sample). M is the model size in GB.Tabel 6 shows the training time, predicting speed andmodel size of AttentionXML-1 and LightXML-1 on Wiki-500K and Amazon-670K, and both AttentionXML andLightXML uses same hardware with one Tesla V100 GPU.LightXML shows signiﬁcant improvements on both predict-ing speed and model size compare to AttentionXML. Forpredicting speed, LightXML can ﬁnd relevant labels frommore than 0.5 million labels in 5 milliseconds with raw textas input. For model size, LightXML can reduce 72% modelsize in Amazon-670K, and 52% on Wiki-500K. For trainingtime, both LightXML and AttentionXML are fast in train-ing, which can save more than three times training time com-pare to X-Transformer. Conclusion

In this paper, we proposed a light deep learning model forXMC, LightXML, which combines the transformer modelwith generative cooperative networks. With generative co-operative networks, the transformer model can be end-to-end ﬁne-tuned in XMC, which makes the transformer modellearn powerful text representation. To make LightXMl ro-bust in predicting, we also proposed dynamic negative sam-pling based on these generative cooperative networks. Withextensive experiments, LightXML shows high efﬁciency onlarge scale datasets with the best accuracy comparing to thecurrent state-of-the-art methods, which can allow all of ourexperiments to be performed on the single GPU card with areasonable time. Furthermore, current state-of-the-art deeplearning methods have many redundant parameters, whichwill harm the performance, and LightXML can remain ac-curacy while reducing more than 50% model size than thesemethods.

Acknowledgment

This work was supported by the National Natural Sci-ence Foundation of China under Grant Nos. 71901011,U1836206, and National Key R&D Program of China un-der Grant No. 2019YFA0707204 eferences

Babbar, R.; and Sch¨olkopf, B. 2017. Dismec: Distributedsparse machines for extreme multi-label classiﬁcation. In

Proceedings of the Tenth ACM International Conference onWeb Search and Data Mining , 721–729.Babbar, R.; and Sch¨olkopf, B. 2019. Data scarcity, robust-ness and extreme multi-label classiﬁcation.

Machine Learn-ing

Advances in neural information processing systems ,730–738.Chang, W.-C.; Yu, H.-F.; Zhong, K.; Yang, Y.; and Dhillon,I. S. 2020. Taming Pretrained Transformers for ExtremeMulti-label Text Classiﬁcation. In

Proceedings of the26th ACM SIGKDD International Conference on Knowl-edge Discovery & Data Mining , 3163–3171.Dekel, O.; and Shamir, O. 2010. Multiclass-multilabel clas-siﬁcation with more classes than examples. In

Proceedingsof the Thirteenth International Conference on Artiﬁcial In-telligence and Statistics , 137–144.Devlin, J.; Chang, M.-W.; Lee, K.; and Toutanova, K. 2018.Bert: Pre-training of deep bidirectional transformers for lan-guage understanding. arXiv preprint arXiv:1810.04805 .Izmailov, P.; Podoprikhin, D.; Garipov, T.; Vetrov, D.;and Wilson, A. G. 2018. Averaging weights leads towider optima and better generalization. arXiv preprintarXiv:1803.05407 .Khandagale, S.; Xiao, H.; and Babbar, R. 2019. Bonsai–Diverse and Shallow Trees for Extreme Multi-label Classiﬁ-cation. arXiv preprint arXiv:1904.08249 .Kingma, D. P.; and Ba, J. 2014. Adam: A method forstochastic optimization. arXiv preprint arXiv:1412.6980 .Liu, J.; Chang, W.-C.; Wu, Y.; and Yang, Y. 2017. Deeplearning for extreme multi-label text classiﬁcation. In

Pro-ceedings of the 40th International ACM SIGIR Conferenceon Research and Development in Information Retrieval ,115–124.Liu, Y.; Ott, M.; Goyal, N.; Du, J.; Joshi, M.; Chen, D.;Levy, O.; Lewis, M.; Zettlemoyer, L.; and Stoyanov, V.2019. Roberta: A robustly optimized bert pretraining ap-proach. arXiv preprint arXiv:1907.11692 .McAuley, J.; and Leskovec, J. 2013. Hidden factors andhidden topics: understanding rating dimensions with reviewtext. In

Proceedings of the 7th ACM conference on Recom-mender systems , 165–172.Mencia, E. L.; and F¨urnkranz, J. 2008. Efﬁcient pairwisemultilabel classiﬁcation for large-scale problems in the le-gal domain. In

Joint European Conference on MachineLearning and Knowledge Discovery in Databases , 50–65.Springer.Prabhu, Y.; Kag, A.; Harsola, S.; Agrawal, R.; and Varma,M. 2018. Parabel: Partitioned label trees for extreme classi-ﬁcation with application to dynamic search advertising. In

Proceedings of the 2018 World Wide Web Conference , 993–1002.Prabhu, Y.; and Varma, M. 2014. Fastxml: A fast, accurateand stable tree-classiﬁer for extreme multi-label learning. In

Proceedings of the 20th ACM SIGKDD international con-ference on Knowledge discovery and data mining , 263–272.Tagami, Y. 2017. Annexml: Approximate nearest neighborsearch for extreme multi-label classiﬁcation. In

Proceed-ings of the 23rd ACM SIGKDD international conference onknowledge discovery and data mining , 455–464.Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones,L.; Gomez, A. N.; Kaiser, Ł.; and Polosukhin, I. 2017. At-tention is all you need. In

Advances in neural informationprocessing systems , 5998–6008.Wydmuch, M.; Jasinska, K.; Kuznetsov, M.; Busa-Fekete,R.; and Dembczynski, K. 2018. A no-regret generaliza-tion of hierarchical softmax to extreme multi-label classi-ﬁcation. In

Advances in Neural Information Processing Sys-tems , 6355–6366.Yang, Z.; Dai, Z.; Yang, Y.; Carbonell, J.; Salakhutdinov,R. R.; and Le, Q. V. 2019. Xlnet: Generalized autoregres-sive pretraining for language understanding. In

Advances inneural information processing systems , 5753–5763.Yen, I. E.; Huang, X.; Dai, W.; Ravikumar, P.; Dhillon, I.;and Xing, E. 2017. Ppdsparse: A parallel primal-dual sparsemethod for extreme classiﬁcation. In

Proceedings of the23rd ACM SIGKDD International Conference on Knowl-edge Discovery and Data Mining , 545–553.Yen, I. E.-H.; Huang, X.; Ravikumar, P.; Zhong, K.; andDhillon, I. 2016. Pd-sparse: A primal and dual sparse ap-proach to extreme multiclass and multilabel classiﬁcation.In

International Conference on Machine Learning , 3069–3077.You, R.; Zhang, Z.; Wang, Z.; Dai, S.; Mamitsuka, H.; andZhu, S. 2019. Attentionxml: Label tree-based attention-aware deep model for high-performance extreme multi-labeltext classiﬁcation. In

Advances in Neural Information Pro-cessing Systems , 5820–5830.Zubiaga, A. 2012. Enhancing navigation on wikipedia withsocial tags. arXiv preprint arXiv:1202.5469arXiv preprint arXiv:1202.5469