Few-Shot Text Classification with Pre-Trained Word Embeddings and a Human in the Loop
aa r X i v : . [ c s . C L ] A p r Few-Shot Text Classification with Pre-Trained WordEmbeddings and a Human in the Loop
Katherine Bailey and Sunny ChopraAcquia {katherine.bailey,sunny.chopra}@acquia.com A BSTRACT
Most of the literature around text classification treats it as a supervised learningproblem: given a corpus of labeled documents, train a classifier such that it canaccurately predict the classes of unseen documents. In industry, however, it isnot uncommon for a business to have entire corpora of documents where few ornone have been classified, or where existing classifications have become mean-ingless. With web content, for example, poor taxonomy management can result inlabels being applied indiscriminately, making filtering by these labels unhelpful.Our work aims to make it possible to classify an entire corpus of unlabeled doc-uments using a human-in-the-loop approach, where the content owner manuallyclassifies just one or two documents per category and the rest can be automati-cally classified. This "few-shot" learning approach requires rich representationsof the documents such that those that have been manually labeled can be treatedas prototypes, and automatic classification of the rest is a simple case of measur-ing the distance to prototypes. This approach uses pre-trained word embeddings,where documents are represented using a simple weighted average of constituentword embeddings. We have tested the accuracy of the approach on existing la-beled datasets and provide the results here. We have also made code available forreproducing the results we got on the 20 Newsgroups dataset . NTRODUCTION W ORD E MBEDDINGS
Word Embeddings are representations of words as low-dimensional vectors of real numbers thatcapture the semantic relationships between words. In Natural Language Processing (NLP), somemethod for converting words to numeric values is always necessary as computation only works withnumbers, not raw text. (Mikolov et al., 2013) introduced efficient techniques for learning distributedvector representations of words from huge corpora. This method is called word2vec, and since thenalternative approaches have been put forward by others such as (Pennington et al., 2014), knownas GloVe, and (Bojanowski et al., 2016), known as FastText. There is also doc2vec for learningrepresentations of entire documents.These approaches differ in the way the representations are generated. Word2vec and FastText are"predictive" models, whereas GloVe is categorized as a "count-based" model. Predictive modelslearn their vectors by predicting the target word given neighboring context words using a feed-forward neural network. The weights of this network are optimized by using stochastic gradientdescent, and these weights are vector representations of the words in the vocabulary. In contrast,count-based models learn their vectors by finding a lower dimensional representation of each wordby minimizing a reconstruction loss function on the co-occurrence count matrix given as an input. https://github.com/katbailey/few-shot-text-classification EW -S HOT L EARNING
Few-shot learning is an approach to classification that works with only a few human labeled exam-ples. It often goes hand-in-hand with transfer learning, a technique involving learning representa-tions during one task that can then be applied to a different task, because it is the richness of thelearned representations that makes it possible to learn from just a few examples. The use of pre-trained word embeddings is an example of transfer learning. One-Shot learning is a special caseof Few-Shot Learning: as the name suggests, it means learning to classify objects when only onelabeled example exists per class.H
UMAN - IN - THE -L OOP
The term Human-in-the-Loop (HitL) refers to any Machine Learning technique that involves humaninput in the training process. It includes techniques such as active learning, where humans handlelow confidence predictions, as well as crowd-sourced approaches to labeling data sets. In our case,the humans perform the "few shots" in our few-shot learning system. EW -S HOT T EXT C LASSIFICATION WITH A H UMAN IN THE L OOP
Our approach involves a "classification engine" that the user (the content owner) interacts with. Abatch of documents is fed to the engine and each document is converted into a 300-dimensionalvector. A set of two or more categories is specified, and Latent Dirichlet Allocation is run on thebatch using the number of categories provided as the number of topics. This serves to surface themost likely representative documents for each category. These are then presented to the user, whomust choose some documents to manually classify for each category. Once this step is complete, thesystem has a vector representing each category: if only one document was classified for a categorythen that document’s vector is used as the category vector, otherwise the vectors of multiple docu-ments are averaged together. Once this is done, the remaining documents are compared against eachcategory representative using simple cosine similarity and each one is assigned the category whosevector it is closest to. A score is also assigned for each prediction. All of these steps are explainedin section 4.
ELATED WORK
In (Arora et al., 2017) the authors present an approach to representing sentences or entire documentsas vectors. They first take the weighted average of the constituent (pre-trained) word vectors of thedocument. The weighting method serves to down-weight frequent words. They then run principalcomponents analysis (PCA) on the batch, and the final embedding for each document is obtained bysubtracting the projection of the set of sentence embeddings to their first principal component. Theintuition behind this is that common methods, such as GloVe, for computing word vectors based onco-occurrence statistics lead to large components that contain no semantic information. A similarinsight is presented in (Mu et al., 2017).In (Snell et al., 2017), the authors present the idea of
Prototypical Networks as a way of performingfew-shot classification. In this approach, each class prototype is the mean vector of the vectorizedsupport points belonging to its class and they are learned through gradient-descent-based trainingepisodes on subsets of training examples. Classification then involves finding the nearest classprototype for each query vector. This is similar to how we perform classification but in our case weare actively seeking the best class prototypes by using a human-in-the-loop to choose them.
PPROACH IN D ETAIL
In this section, we provide the details of the text classification system.2.1 D
OCUMENT E MBEDDINGS
For any given collection of documents, which we refer to as a batch, each document needs to beconverted to a fixed-length vector so that we can measure similarity between them.For our task, which is about classifying content, we found there to be little improvement in accuracywhen running the PCA step, as suggested by (Arora et al., 2017) and (Mu et al., 2017), over justusing the weighted average of the word vectors. Morevover, PCA introduced a level of complexitythat meant the embedding of a document was always batch-specific. For pragmatic reasons it waspreferable for us to have document representations that were independent of the batch they camefrom. Hence our embedding method is simply:
Algorithm 1:
Weighted average algorithm for document representations.
Input :
Word representations { v w : w ∈ V} , a set of documents D , parameter α , andestimated probabilities { p ( w ) : w ∈ V} of the words Output:
Document embeddings { v d : d ∈ D} . for all documents d in D do v d ← | d | P w ∈ d αα + p ( w ) v w end LASSIFICATION
The classification task involves manually labelling a small number of documents in each class andusing these as representatives of the class. In the one-shot case, this means that our representation ofa class is simply the single document vector that has been labeled for that class. If we have more thanone representative document per class, we simply take the average of those vectors as our categoryrepresentative. Once we have a representative for each class, we can proceed to predict classes forthe rest.
Algorithm 2:
Predicting classes using cosine similarity with class representative vectors.
Input :
Class representative vectors { v c : c ∈ C} , a set of document vectors { v d : d ∈ D} Output:
Predicted class for each document. for all documents d in D do for all classes c in C do sims c ← cosine _ similarity ( v d , v c ) end ˆ c d ← argmax ( sims ) end URFACING GOOD REPRESENTATIVES
Choosing good class representatives is of vital importance to the accuracy of this approach. Thehuman-in-the-loop will be making the final choice about which documents to use as representativesfor each class, but we need to make it as easy as possible to choose the best representatives. Our ap-proach is to run Latent Dirichlet Allocation (LDA) on the batch of documents. This is a probabilisticapproach to topic modeling which, given a number of topics, will infer those topics as distributionsover the words in the vocabulary. Each document is assumed to be some mixture of topics, and theseare assigned probabilities based on the words contained in the document. For example document1 could be deemed to be 90% about topic A and 10% about topic B. If you assume that the topicsare mutually exclusive then it is reasonable to simply assign each document to the topic it has thehighest probability of belonging to. We perform this step, and then for each inferred topic we rankthe documents assigned to it in descending order of their probability of belonging to that topic. Theassumption here is that LDA, in inferring topics, has approximately teased apart the actual cate-gories, and so the idea is to have an ordering of the documents within a batch such that the first pageof documents seen in the UI is likely to have a mix of good representatives for each category.3ategories
XPERIMENTS
We evaluate the accuracy of performing one-shot classification on some publicly available labeleddatasets.5.1 D
ATASETS
We created 2- and 3-category subsets of the "20 Newsgroups" dataset. The original dataset consistsof 20,000 messages taken from 20 newsgroups on subjects like cars, religion, guns and baseball.We used the Scikit-learn python library to extract subsets of this data. We also created 2-, 3- and4-category subsets of the DBPedia dataset, which contains the first paragraph of the Wikipedia pagefor around half a million entities in 15 categories, e.g. Animal, Film, Plant, Company, etc.We perform some basic cleaning of the text and remove stop words before converting documentsinto vectors.5.2 M
AXIMUM O NE -S HOT A CCURACY
In order to know how accurate our classification approach can be in theory (i.e. if the user happensto be lucky enough to find the best representatives for each category) we did some brute force testingof the approach, where we went through thousands of combinations of document representatives tosee what the highest achievable accuracy was with one-shot learning. In some cases, i.e. wherethere were only two categories and only around 500 documents per category, we did an exhaustivesearch, testing all possible combinations. In other cases we randomly sampled from all possiblecombinations. Table 1 shows the maximum accuracy we achieved on various subsets of categories.As expected the accuracy is high when the categories are very distinct, such as the "Animal" and"Film" categories in the DBPedia dataset: there would be very little overlap in the types of wordsused in articles about animals vs articles about films. We looked at the few misclassifications thatarose when the best known representatives were used in this dataset and found that the films mis-classified as animals were films about animals and the animals misclassified as films were all thor-oughbred racehorses that had won or been nominated for awards.These results show us that for a batch of documents to be classified into separate categories, it ispossible to achieve very high accuracy with just a single labeled example per category, provided that1) those categories truly are reflected in the words of the documents and 2) good enough representa-tives are chosen for each category. Since for any given batch we have no control over how well the4ategories Max acc. LDA maxVillage, Film 0.9992 0.998388Village, Animal 0.9979 –Animal, Film 0.9946 0.992510Animal, Company 0.9943 0.987633Animal, Film, Company, Village 0.9706 –autos, baseball 0.9703 0.945402guns, hardware 0.9693 0.965613mideast, electronics 0.9689 0.959472Animal, Film, Company 0.9688 0.961771christian, guns 0.9380 0.925251med, electronics 0.9324 0.931532atheism, space 0.9232 0.864271baseball, hockey 0.9063 0.798500autos, baseball, space 0.8964 0.803371Animal, Plant 0.8540 –politics, religion 0.8264 0.804404Table 2Maximum brute-force accuracy obtained and maximum accuracy obtained using onlycombinations of the top 12 LDA-surfaced documentswords in the documents reflect the desired categories, the problem we focus on is how to choose thebest representatives.5.3 R
ESULTS WITH
LDAWe ran LDA on our datasets, specifying the number of known categories as the number of topics,and then tested the maximum accuracy that could be obtained using one-shot learning when tryingonly the possible combinations of representatives in the top 12 documents surfaced by LDA (asdescribed in section 4). The idea is that the user will be presented with documents for classification,perhaps 12 documents per page, and ideally they wouldn’t have to look beyond the first, or at mostthe second, page of documents in order to choose representatives for each category.In table 2 we show the same maximum accuracy obtained via the brute-force testing of combinationsalong with the maximum accuracy achieved by testing the combinations from the top 12 documents.In many cases the LDA accuracy is quite close to the maximum brute-force accuracy, however in 3cases we could not get a result for LDA because the top 12 documents did not include representativesfrom all categories. This is perhaps understandable in the case of the 4-category dataset, "Animal,Film, Company, Village." In the case of the "Animal, Plant" dataset we see that the brute forceaccuracy is relatively low, due to the fact that many similar words would be used to describe animalsand plants ("species", "family", "habitat", etc), and so this perhaps explains why LDA had a difficulttime teasing apart the topics. However in the case of the "Village,Animal" dataset, which gets amaximum brute-force accuracy of 99.79% the same argument clearly cannot be made. This tellsus that there is much room for improvement in our method for inferring topics in datasets for thepurpose of surfacing good category representatives.We also tested the LDA accuracy on some datasets for which we had not run a brute-force test. Table3 shows these results, tested using both GloVe and FastText pre-trained word embeddings.5.4 R
ELATIVE LENGTH OF CATEGORY REPRESENTATIVES
We were curious about the extent to which relative length of the chosen category representativesmattered. For example, if the user chose a very short document as the sole representative for categoryA and a very long document, or multiple documents, for category B, how much would this skew thepredictions towards category B. On one dataset only, the 2-category subset "autos, baseball" from20 Newsgroups, we went through every possible combination of category representatives, notedthe length of each document, ran the auto-classification step, and noted the number of predictions5ategories GloVe LDA FastText LDAVillage, Film 0.9984 0.9932Animal, Film 0.9925 0.9934MeansOfTransportation, NaturalPlace 0.9922 0.9926Village, Company 0.9895 0.9896Animal, Company 0.9876 0.9905Building, NaturalPlace 0.9791 0.9807Album, Film 0.974 0.9825guns, hardware 0.9656 0.9638Film, Company 0.9632 0.9664Animal, Film, Company 0.9618 0.9445Artist, Athlete 0.9604 0.922mideast, electronics 0.9595 0.9746autos, baseball 0.9454 0.9511autos, hockey 0.9413 0.9376Building, EducationalInstitution 0.9345 0.935Film, WrittenWork 0.9341 0.9376med, electronics 0.9315 0.9153christian, guns 0.9253 0.9243med, religion 0.888 0.888atheism, space 0.8643 0.8713autos, guns 0.8417 0.8407autos, electronics 0.8291 0.845space, med 0.8111 0.8405politics, religion 0.8044 0.7526autos, baseball, space 0.8034 0.8414baseball, hockey 0.7985 0.8594religion, mideast 0.7526 0.7808christian, atheism 0.6553 0.6777atheism, religion 0.6304 0.6522Table 3Max accuracies on combinations of top 12 LDA documents: GloVe vs FastText6n each category. With this information, we were able to examine the correlation between relativerepresentative length and relative prediction count for that category. We found a strong positivecorrelation between them, roughly 0.8. This means we need to either provide guidance to the userwhen selecting representatives (to make sure they select representatives of roughly equal length), orwe need to filter the documents we present to the user to remove very long and very short ones.
ISCUSSION
Our testing on the 20 Newsgroups and DBPedia datasets showed that the general approach is sound,that the document embeddings using weighted averages of pre-trained word embeddings are usefulrepresentations, and that if good category representatives are chosen, high levels of accuracy can beachieved on the classification task. It also showed that LDA can help in choosing good representa-tives. The datasets we tested on were quite artificial: mostly just 2-category datasets chosen fromlarger datasets with more categories. However they did help identify certain characteristics that maymake a real dataset suitable to use this approach on. The most important characteristic is that thecategories must be sufficiently distinct and the words used in the documents should reflect thosecategories. The length of the documents chosen as representatives is also important - they should beroughly equal in length.As to the question of working with more categories, why not test on the entire 20 Newsgroups orDBPedia dataset? Brute-force testing becomes infeasible with larger numbers of categories, and so itwasn’t possible to get a maximum achievable accuracy for the entire dataset (the maximum achievedaccuracy of 97% on a particular 4-category subset of the DBPedia dataset was based on testing390,625 of the 625 trillion possible combinations). In any case, it is unclear whether this approachwill be appropriate for datasets with many categories — say, more than four or five — seeing as itrequires the human in the loop to find good representatives for each one, a task which might provedaunting in such a case even if the category breakdown has been successfully approximated throughLDA or similar. We feel that our text classification approach is best suited to 2- or 3-categoryclassification tasks. An interesting use case for 2-category, i.e. binary, classification might be topicstance detection. Given user-generated content on a particular topic, e.g. posts on a web forum, itmight be the case that different words are used depending on where the writer stands on that topic.An example of this would be some people using the term "family reunification" and others using"chain migration" to refer to the same thing. In this case our approach could be used to detect thestance reflected in each post.
UTURE WORK
The crucial step in our method is to present the user with suitable candidates to be labeled as rep-resentatives. We used Latent Dirichlet Allocation (LDA) for topic inference and found that whilein many cases we got close to the maximum accuracy achieved through brute-force trials, in somecases it fell far short or even failed to tease apart the topics at all.An alternative to LDA would be to use probabilistic Latent Semantic Analysis (pLSA) which treatstopics as word distributions and uses probabilistic methods similar to LDA. But the use of Dirichletpriors in LDA for the document-topic and topic-word distributions in order to prevent over-fittingseems to make it a better choice. Our goal in the future is to improve the topic inference step, and sowe will look to other alternatives. One idea is to use guided LDA as suggested in (Jagarlamudi et al.,2012). The seeds for guiding it will be the category names that the user specifies to classify the batchof documents. Another approach is to use Gaussian LDA as proposed in (Das et al., 2015). Thisapproach is a good fit as it works with vectorized representations of words and documents. Yetanother option would be to use ProdLDA as described in (Srivastava et al., 2017), which is a neuralnetwork version in which the distribution over individual words is a product of experts rather thanthe mixture model used in LDA. Running clustering algorithms on the document representations isanother approach we plan to try in order to solve the topic inference problem.Although we only tested our approach on single labeled datasets, we would like to be able to applyit to multi-labeled datasets. One idea would be to use each label independently and do a binaryclassification of whether a document has that label or not. By treating each label independently ofthe other labels, we convert it into a single labeled problem. However, our current LDA approach to7opic inference will not work in this case as it will always suggest the same ordering of documentsregardless of which label the user is choosing representatives for. This is where guiding the LDAwith the seeds of category names can prove immensely useful. An alternative approach, rather thantreating each label independently, would be to use classifier chains for multi-label classification assuggested in (Read et al., 2009). This method may help us achieve a fair improvement in accuracy.The foremost issue again is the topic inference, which will be essential for accurate multi-labelclassification, and that’s going to be our principal focus of research in the future. R EFERENCES
Sanjeev Arora and Yingyu Liang and Tengyu Ma. 2017. A Simple but Tough-to-Beat Baseline forSentence Embeddings.
Proc. ICLR .Piotr Bojanowski, Edouard Grave, Armand Joulin, Tomáš Mikolov 2016. Enriching Word Vectorswith Subword Information. In
Proc. TACL .Rajarshi Das, Manzil Zaheer, and Chris Dyer. 2015. Gaussian LDA for Topic Models with WordEmbeddings.
The Association for Computer Linguistics (ACL) , 795–804.Jagadeesh Jagarlamudi, Hal Daume III, and Raghavendra Udupa. 2012. Incorporating Lexical Priorsinto Topic Models.
European Chapter of the Association for Computational Linguistics (EACL) ,204–213.Tomáš Mikolov, Ilya Sutskever, Kai Chen, Greg S. Corrado, and Jeff Dean. 2013. Distributedrepresentations of words and phrases and their compositionality. In
Adv. NIPS .Jiaqi Mu, Suma Bhat, Pramod Viswanath. 2017. All-but-the-Top: Simple and Effective Postpro-cessing for Word Representations. https://arxiv.org/abs/1702.01417 .Jeffrey Pennington and Richard Socher and Christopher D. Manning. 2014. GloVe: Global Vectorsfor Word Representation.
Empirical Methods in Natural Language Processing (EMNLP) , 1532–1543.Jesse Read, Bernhard Pfahringer, Geoff Holmes, Eibe Frank. 2009. Classifier Chains for Multi-label Classification.
Proc 13th European Conference on Principles and Practice of KnowledgeDiscovery in Databases and 20th European Conference on Machine Learning, Springer .Jake Snell, Kevin Swersky and Richard Zemel 2017 Prototypical Networks for Few-shot LearningIn
Adv. NIPS , 4077–4087.Akash Srivastava and Charles Sutton. 2017. Autoencoding Variational Inference for Topic Models. https://arxiv.org/abs/1703.01488https://arxiv.org/abs/1703.01488