Improved Customer Transaction Classification using Semi-Supervised Knowledge Distillation
IImproved Customer Transaction Classification using Semi-SupervisedKnowledge Distillation
Rohan Sukumaran ∗ Applied Research, Swiggy [email protected]
Abstract
In pickup and delivery services, transactionclassification based on customer provided freetext is a challenging problem. It involves asso-ciation of a wide variety of customer inputs toa fixed set of categories while adapting to thevarious customer writing styles. This catego-rization is important for the business: it helpsunderstand the market needs and trends, andalso assist in building a personalized experi-ence for different segments of the customers.Hence, it is vital to capture these category in-formation trends at scale, with high precisionand recall. In this paper, we focus on a specificuse-case where a single category drives eachtransaction. We propose a cost-effective trans-action classification approach based on semi-supervision and knowledge distillation frame-works. The approach identifies the category ofa transaction using free text input given by thecustomer. We use weak labelling and noticethat the performance gains are similar to thatof using human annotated samples. On a largeinternal dataset and on 20Newsgroup dataset,we see that RoBERTa performs the best forthe categorization tasks. Further, using an AL-BERT model (it has 33x less parameters vis-a-vis paremeters of RoBERTa), with RoBERTaas the Teacher, we see a performance similar tothat of RoBERTa and better performance overunadapted ALBERT. This framework, withALBERT as student and RoBERTa as teacher,is further referred to as R-ALBERT in this pa-per. The model is in production and is usedby business to understand changing trends andtake appropriate decisions.
Natural Language Understanding (NLU) has be-come a ubiquitous part of the e-commerce plat-forms with applications in information discov-ery (Degenhardt et al., 2019; Grandhi, 2016), senti- * Work done while an intern at Applied Research, Swiggy. ment analysis (Liang and Wang, 2019; Taylor andKeselj, 2020), query understanding (Zhao et al.,2019) and many other similar tasks. The advance-ment in machine learning has enhanced the scaleof adaption of such capabilities to enable richercustomer experience.In this paper, we consider a use-case of pickupand delivery services where customers make use ofshort sentences or lists to inform about the productsto be picked up (or bought) from a certain locationand dropped at a target location. Table 1 shows afew examples of the descriptions used by our cus-tomers to describe their products. Customers tendto use short, code-mixed (using more than one lan-guage in the same message ) and incoherent textualdescriptions of the products for describing them inthe transaction. These descriptions, if mapped toa fixed set of categories, will help assist criticalbusiness decisions such as how to enhance the cus-tomers’ experience on the platform, understand theimportance of each category and issues faced bythem, demographic driven prioritization of cate-gories, launch of new product categories etc. Asexpected, a transaction may comprise of multipleproducts - this adds an additional level of com-plexity to the task. In this work, we focus on amulti-class classification of transactions, where asingle majority category drives the transaction.Classifying transaction types to the right cate-gories requires labelled data for training an MLmodel. Some of the data used in this paper (Ta-ble 2: Train data) was labeled manually by thesubject matter experts (SMEs) from the businessand it was a very expensive exercise.Recent works have shown that BERT based mod-els (Conneau et al., 2019; Yang et al., 2019) areable to achieve state-of-the-art performance on textclassification tasks. Through experiments we could In our case, Hindi written in roman case is mixed withEnglish a r X i v : . [ c s . C L ] F e b ransaction Description Category ”Get me my lehanga” Translation :
Get me my skirt Clothes ”Buy a 500gm packet of Wholegrain Atta”
Grocery ”Get a roll of paratha”
Food ”Mera do bags leaoo”
Translation :
Bring two of mybags Package
Table 1: Samples of actual transaction descriptionsused by our customers along with their correspond-ing categories as labelled by the subject matter experts(SMEs). observe that RoBERTa (Liu et al., 2019) was thebest performing model for our task. However, ow-ing to the large number of parameters, it was notfeasible to deploy this model at scale. Furthermore,we observed that the lighter versions of BERT suchas ALBERT (base) (Lan et al., 2019; Wolf et al.,2019) that are production friendly could not matchthe performance of RoBERTa.With the above two limitations for productionfriendliness in mind, we propose: a) an approachthat leverages semi supervision to reduce the man-ual labeling cost and, b) also explore knowledgedistillation to build a smaller model (in terms ofnumbers of parameters) that matches the perfor-mance of the state-of-the-art heavier models suchas RoBERTa. The key contributions of this paperare:•
Weak Labelling:
A framework based onsemi-supervised learning and weak supervi-sion to reduce manual data labelling bottle-neck.•
Knowledge Distillation Framework:
Train-ing a lightweight model (33x lesser parame-ters) with the help of weak labels, which isable to match the performance of a much heav-ier model.•
Inference and Scalability:
Attained 86 timesincrease in inference speed with the Stu-dent model while running on a r5 4x large(Intel Xeon Platinum 8000 series processor(Skylake-SP) with a sustained all core TurboCPU clock speed of up to 3.1 GHz) machinein production when compared to that of theTeacher model.
Text classification problems with code-mixed in-puts have been studied and transformer based mod-els perform well on benchmarks (Chang et al.,2019; Lu et al., 2020) like TREC-6 (Voorhees andHarman, 2000) and DBpedia (Auer et al., 2007).The larger size of the models exacerbates the chal-lenge of deployment with limited resources (Chenet al., 2020; Sajjad et al., 2020). Multiple methodslike quantization (Zafrir et al., 2019), pruning (Gor-don et al., 2020), distillation (Sanh et al., 2019; Jiaoet al., 2019) and weight sharing (Houlsby et al.,2019) are used to mitigate this issue. All thesemethods have shown varying degrees of successcompared to the performance of the base modelfrom which they are derived.
Data-less text classification has become a popularpractice to achieve low cost model training. (Hing-mire and Chakraborti, 2014) explored approachesbased on topic modelling to predict labels for doc-uments. Our problem setting involves short trans-action descriptions that do not perform well withstandard topic modelling techniques. (Li and Yang,2018) worked with unlabeled data by identifyinga minimal set of seed word based pseudo labelsfor documents and trained a Naive Bayes modelusing semi-supervision. We had a large amount ofmanually tagged data - we leveraged this to extractcandidate training samples from unlabeled data. (Hinton et al., 2015) studied how a model canbe used to label unlabelled data and make use ofthe model predictions (the probability distributionsand/or the one hot encoded labels) for training us-ing a combination of loss functions. We borrowfrom (Hinton et al., 2015) the concept of usingdifferent loss functions during training. In thispaper, we use transformer models as teacher andstudent. In (Yuan et al., 2019), the authors pro-posed how a model could be it’s own teacher andhow lightweight models can teach a heavier model.Although this showed promising results in imageclassification tasks, we observed that the similarperformance gains was not visible for our settingof NLP tasks. This could be due to the differencein the underlying information captured in represen-tation learning. Hence, we stick to the standardteacher-student framework where heavier modelsre used as the teacher.
The Multifit (Eisenschlos et al., 2019) paper pro-posed to fine-tune a lightweight model on the tar-get task with labelled or pseudo labelled data andempirically showed the performance improvementderived from using the teacher model. We leveragea similar setup and validate the generalization ofthe method to large scale transformer based models.We observe this to be an effective training strategy.The study also showed that the student model wasrobust to noise as the teacher served as a regular-izer. In (Tamborrino et al., 2020) also the authorsexplain the benefits of relevant knowledge transfervia task specific fine-tuning.
Our approach is focused on leveraging knowledgedistillation to build a highly accurate classifier withreduced cost of training and deployment. We traina model using manually labeled data, and call it asthe “Teacher”. We then pass the unlabelled datathrough this Teacher model. The output of theTeacher model is used as the label for this unla-belled data. This semi-supervision approach is usedfor data augmentation. Further, this augmenteddata is used to train a “Student” model. With theweakly labeled data, we employ multiple strategiesto improve the performance of the “Student” model.Figure 1 shows a high level system overview.
Figure 1: High level overview of the process of Knowl-edge Distillation using Semi-Supervision
In this approach, we leverage the teacher modelto obtain weak labels for unlabelled data. For a given sample we assign the most confident pre-diction (from the “Teacher”) as its label. In otherwords, the output probabilities for each sample areconverted into one hot encodings considering theprediction with highest probability as the true label.
Similar to the previous approach, we leverage theteacher model to obtain weak labels on a dataset.But instead of one hot encodings, we consider theprobability distribution of the predictions as thelabels for the samples. In other words, we performsemi supervision while ensuring to replicate theteacher’s behaviour when labelling the samples.
Inspired by the results of Section 3.2, we leveragethe probability distribution based labels but makeuse of KL divergence (Kullback, 1997) loss insteadof Cross Entropy loss. Results show that this strat-egy performs the best to learn the difference be-tween two probability distributions (the predictionsfrom the student and the teacher models).
Table 2 shows the different datasets consideredfor the experiments. The initial training data com-prises of 41,539 customer transactions sampledfrom September to December (2019) time frame- each of these transactions had an associated cus-tomer message. The free text message of everytransaction was annotated by a team of three SMEs,and mapped to one of the ten pre-defined cate-gories. The list of categories considered are as fol-lows: { ’Food’, ’Grocery’, ’Package’, ’Medicines’,’Household Items’, ’Cigarettes’, ’Clothes’, ’Elec-tronics’, ’Keys’, ’Documents/Books’ } .Additionally, we consider 285,235 unlabelledcustomer transactions sampled from January toApril (2020) for the semi-supervision experiments.For benchmarking the performance of differentclassification approaches, we label 20,156 cus-tomer transactions from April to construct a testdataset. This test set from April, containing 20,156samples are not used for the semi-supervision ex-periments. ataset Duration Size Train Sept - Dec 41,539Unlabelled Jan - Apr. 285,235Test April 20,156
Table 2: Dataset description
In the first step, we train multiple models using the
Train dataset and test on
Test dataset to identifythe candidate teacher model for our KnowledgeDistillation experiments. For the purpose of theexperiments, we consider XgBoost (Chen et al.,2015; Pedregosa et al., 2011), BiLSTM (Gers et al.,1999; Abadi et al., 2016), ALBERT (Lan et al.,2019) and RoBERTa (Liu et al., 2019). Table 3shows the F1-scores for different models consid-ered for this experiment. We observe that AL-BERT and RoBERTa outperform BiLSTM andXgBoost. Therefore, RoBERTa is chosen as theteacher model for the next set of experiments.
Model F1 Score Accuracy
XgBoost 0.60 63BiLSTM 0.65 73ALBERT 0.70 78RoBERTa 0.74 82
Table 3: F1-scores and Accuracies for different classi-fication models trained on Train dataset and tested onTest dataset
In the second step, we leverage the teacher modeldescribed in the previous subsection to extractweakly labeled samples for the
Unlabelled datasetto augment the training dataset. To reduce the prob-ability of selecting mislabeled samples, we set anempirical threshold of 95% confidence in the labelprediction as the criteria to accept a sample intothe pool of training samples. At the end of this pro-cess, we obtain 93,820 additional training samples( weakly labelled by the teacher ). Based on the production bottlenecks on numberof parameters, the ALBERT(base) with ˜11 millionparameters was chosen as the student model from
Model F1 Score Accuracy
R-ALBERT - OHE 0.72 83R-ALBERT - CE 0.65 64R-ALBERT - KL 0.73 84RoBERTa 0.18 40
Table 4: Comparison of F1-scores on internal bench-mark using different approaches the set of SOTA models. The model details can befound in Table 7 under Appendix A.The data from Section 4.3 is used further to“teach” the student models’ making use of the 3strategies mentioned in Section 3. The studentmodel ALBERT is already trained (fine-tuned) onthe labelled
Train dataset which was used for train-ing the teacher. R-ALBERT, the new student modelbased on the ALBERT architecture ends up per-forming the best and even better than the “teacher”model on our
Test dataset. Similar pattern wasobserved in (Eisenschlos et al., 2019).
In order to validate the reproduciblity of our ap-proach, we ran similar experiments on the 20News-group (Rennie, 2008) dataset.
As shown in Table 4, the Student model tends toperform better than its base version (the modelwhich did not have a Teacher). We validate thestatistical significance of the performance improve-ment using Stuart Maxwell Test (Stuart, 1955;Maxwell, 1970). As shown in Table 6, the per-formance improvement over the base model is sig-nificant. Moreover, we observe that our approachachieved similar performance when compared tohuman annotated data, despite the change in datadistributions and textual patterns. Also, from Ta-ble 5 we observe that the given method is repro-ducible on the 20Newsgroup dataset.Further, weobserve that the distillation over RoBERTa’s pre-dictions gave an improvement of 8% when com-pared to fine-tuning directly on part of the labelleddataset.
We explore a generalised distillation framework ontransformer based architecture which shows that“Students” can be made better with the help of theweak labels generated by a good “Teacher”. Given odel F1-score Accuracy (%)
ALBERT 0.63 65R-ALBERT-KL 0.70 73RoBERTa 0.88 87
Table 5: F1-scores on the 20Newsgroup dataset with8,073 train samples, 7,037 weakly labeled samples (af-ter 95% threshold) and 805 samples the focus on having smaller models and to ensurethat we make effective utilisation of resources, wefeel this distillation method will pave path for moreresearch in the future. This framework can alsohelp reduce manual labelling efforts.
References
Mart´ın Abadi, Paul Barham, Jianmin Chen, ZhifengChen, Andy Davis, Jeffrey Dean, Matthieu Devin,Sanjay Ghemawat, Geoffrey Irving, Michael Isard,et al. 2016. Tensorflow: A system for large-scalemachine learning. In { USENIX } symposiumon operating systems design and implementation( { OSDI } , pages 265–283.S¨oren Auer, Christian Bizer, Georgi Kobilarov, JensLehmann, Richard Cyganiak, and Zachary Ives.2007. Dbpedia: A nucleus for a web of open data.In The semantic web , pages 722–735. Springer.Wei-Cheng Chang, Hsiang-Fu Yu, Kai Zhong, Yim-ing Yang, and Inderjit Dhillon. 2019. X-bert: ex-treme multi-label text classification with using bidi-rectional encoder representations from transformers. arXiv preprint arXiv:1905.02331 .Daoyuan Chen, Yaliang Li, Minghui Qiu, ZhenWang, Bofang Li, Bolin Ding, Hongbo Deng, JunHuang, Wei Lin, and Jingren Zhou. 2020. Ad-abert: Task-adaptive bert compression with differ-entiable neural architecture search. arXiv preprintarXiv:2001.04246 .Tianqi Chen, Tong He, Michael Benesty, VadimKhotilovich, and Yuan Tang. 2015. Xgboost: ex-treme gradient boosting.
R package version 0.4-2 ,pages 1–4.Alexis Conneau, Kartikay Khandelwal, Naman Goyal,Vishrav Chaudhary, Guillaume Wenzek, FranciscoGuzm´an, Edouard Grave, Myle Ott, Luke Zettle-moyer, and Veselin Stoyanov. 2019. Unsupervisedcross-lingual representation learning at scale. arXivpreprint arXiv:1911.02116 .Jon Degenhardt, Surya Kallumadi, Utkarsh Porwal,and Andrew Trotman. 2019. Ecom’19: The sigir2019 workshop on ecommerce. In
Proceedings ofthe 42nd International ACM SIGIR Conference onResearch and Development in Information Retrieval ,pages 1421–1422. Julian Eisenschlos, Sebastian Ruder, Piotr Czapla,Marcin Kardas, Sylvain Gugger, and JeremyHoward. 2019. Multifit: Efficient multi-linguallanguage model fine-tuning. arXiv preprintarXiv:1909.04761 .Felix A Gers, J¨urgen Schmidhuber, and Fred Cummins.1999. Learning to forget: Continual prediction withlstm.Mitchell A Gordon, Kevin Duh, and Nicholas Andrews.2020. Compressing bert: Studying the effects ofweight pruning on transfer learning. arXiv preprintarXiv:2002.08307 .Roopnath Grandhi. 2016. Methods and systems of dis-covery of products in e-commerce. US Patent App.14/830,696.Swapnil Hingmire and Sutanu Chakraborti. 2014.Topic labeled text classification: A weakly super-vised approach. In
ACM SIGIR Conference on Re-search and Development in Information Retrieval .Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. 2015.Distilling the knowledge in a neural network. arXivpreprint arXiv:1503.02531 .Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski,Bruna Morrone, Quentin De Laroussilhe, AndreaGesmundo, Mona Attariyan, and Sylvain Gelly.2019. Parameter-efficient transfer learning for nlp. arXiv preprint arXiv:1902.00751 .Xiaoqi Jiao, Yichun Yin, Lifeng Shang, Xin Jiang,Xiao Chen, Linlin Li, Fang Wang, and Qun Liu.2019. Tinybert: Distilling bert for natural languageunderstanding. arXiv preprint arXiv:1909.10351 .Solomon Kullback. 1997.
Information theory andstatistics . Courier Corporation.Zhenzhong Lan, Mingda Chen, Sebastian Goodman,Kevin Gimpel, Piyush Sharma, and Radu Soricut.2019. Albert: A lite bert for self-supervised learn-ing of language representations. arXiv preprintarXiv:1909.11942 .Ximing Li and Bo Yang. 2018. A pseudo label baseddataless naive bayes algorithm for text. In
Interna-tional Conference on Computational Linguistics .Ruxia Liang and Jian-qiang Wang. 2019. A linguisticintuitionistic cloud decision support model with sen-timent analysis for product selection in e-commerce.
International Journal of Fuzzy Systems , 21(3):963–977.Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man-dar Joshi, Danqi Chen, Omer Levy, Mike Lewis,Luke Zettlemoyer, and Veselin Stoyanov. 2019.Roberta: A robustly optimized bert pretraining ap-proach. arXiv preprint arXiv:1907.11692 .hibin Lu, Pan Du, and Jian-Yun Nie. 2020. Vgcn-bert:Augmenting bert with graph embedding for text clas-sification. In
European Conference on InformationRetrieval , pages 369–382. Springer.Albert Ernest Maxwell. 1970. Comparing the classifi-cation of subjects by two independent judges.
TheBritish Journal of Psychiatry , 116(535):651–655.Fabian Pedregosa, Ga¨el Varoquaux, Alexandre Gram-fort, Vincent Michel, Bertrand Thirion, OlivierGrisel, Mathieu Blondel, Peter Prettenhofer, RonWeiss, Vincent Dubourg, et al. 2011. Scikit-learn:Machine learning in python. the Journal of machineLearning research , 12:2825–2830.Jason Rennie. 2008. Newsgroups dataset, 2008.Hassan Sajjad, Fahim Dalvi, Nadir Durrani, andPreslav Nakov. 2020. Poor man’s bert: Smallerand faster transformer models. arXiv preprintarXiv:2004.03844 .Victor Sanh, Lysandre Debut, Julien Chaumond, andThomas Wolf. 2019. Distilbert, a distilled versionof bert: smaller, faster, cheaper and lighter. arXivpreprint arXiv:1910.01108 .Alan Stuart. 1955. A test for homogeneity of themarginal distributions in a two-way classification.
Biometrika , 42(3/4):412–416.Alexandre Tamborrino, Nicola Pellicano, Baptiste Pan-nier, Pascal Voitot, and Louise Naudin. 2020. Pre-training is (almost) all you need: An applica-tion to commonsense reasoning. arXiv preprintarXiv:2004.14074 .Stacey Taylor and Vlado Keselj. 2020. e-commerceand sentiment analysis: Predicting outcomes of classaction lawsuits. In
Proceedings of The 3rd Work-shop on e-Commerce and NLP , pages 77–85.Ellen M. Voorhees and Donna Harman. 2000.Overview of the sixth text retrieval conference (trec-6).
Inf. Process. Manage. , 36(1):3–35.Thomas Wolf, Lysandre Debut, Victor Sanh, JulienChaumond, Clement Delangue, Anthony Moi, Pier-ric Cistac, Tim Rault, R´emi Louf, Morgan Funtow-icz, et al. 2019. Huggingface’s transformers: State-of-the-art natural language processing.
ArXiv , pagesarXiv–1910.Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Car-bonell, Russ R Salakhutdinov, and Quoc V Le. 2019.Xlnet: Generalized autoregressive pretraining forlanguage understanding. In
Advances in neural in-formation processing systems , pages 5753–5763.Li Yuan, Francis EH Tay, Guilin Li, Tao Wang,and Jiashi Feng. 2019. Revisit knowledge distil-lation: a teacher-free framework. arXiv preprintarXiv:1909.11723 . Ofir Zafrir, Guy Boudoukh, Peter Izsak, and MosheWasserblat. 2019. Q8bert: Quantized 8bit bert. arXiv preprint arXiv:1910.06188 .Jiashu Zhao, Hongshen Chen, and Dawei Yin. 2019.A dynamic product-aware learning model for e-commerce query intent understanding. In
Proceed-ings of the 28th ACM International Conference onInformation and Knowledge Management , pages1843–1852.
Appendix odel 1 Model 2 Chi-square DoF p-value ( < ) ALBERT RoBERTa 2185.71 9 2.2e-16R-ALBERT-KL R-ALBERT-OHE 955.61 9 2.2e-16R-ALBERT-KL RoBERTa 955.61 9 2.2e-16
Table 6: The results of Stuart Maxwell Test’s can be found, where we compare different models and we can seethat the increase/decrease in performance is statistically significant.
Model Parameters (in millions)