Bootstrapping Large-Scale Fine-Grained Contextual Advertising Classifier from Wikipedia
BBootstrapping Large-Scale Fine-GrainedContextual Advertising Classifier from Wikipedia
Yiping Jin , ( (cid:0) ), Vishakha Kadam , and Dittaya Wanvarie Knorex, 140 Robinson Road, {jinyiping,vishakha.kadam}@knorex.com Department of Mathematics & Computer ScienceChulalongkorn University, Thailand 10300
Abstract.
Contextual advertising provides advertisers with the oppor-tunity to target the context which is most relevant to their ads. However,its power cannot be fully utilized unless we can target the page contentusing fine-grained categories, e.g., “coupé” vs. “hatchback” instead of “au-tomotive” vs. “sport”. The widely used advertising content taxonomy(IAB taxonomy) consists of 23 coarse-grained categories and 355 fine-grained categories. With the large number of categories, it becomes verychallenging either to collect training documents to build a supervisedclassification model, or to compose expert-written rules in a rule-basedclassification system. Besides, in fine-grained classification, different cat-egories often overlap or co-occur, making it harder to classify accurately.In this work, we propose wiki2cat , a method to tackle the problem oflarge-scaled fine-grained text classification by tapping on Wikipedia cat-egory graph. The categories in IAB taxonomy are first mapped to cate-gory nodes in the graph. Then the label is propagated across the graphto obtain a list of labeled Wikipedia documents to induce text classi-fiers. The method is ideal for large-scale classification problems since itdoes not require any manually-labeled document or hand-curated rules orkeywords. The proposed method is benchmarked with various learning-based and keyword-based baselines and yields competitive performanceon both publicly available datasets and a new dataset containing morethan 300 fine-grained categories.
Keywords:
Contextual Advertising · Wikipedia Content Analysis · Large-Scale Text Classification
Despite the fast advancement of text classification technologies, most text classi-fication models are trained and applied to a relatively small number of categories.Popular benchmark datasets contain from two up to tens of categories, such asSST2 dataset for sentiment classification (2 categories) [15], AG news dataset (4categories) [18] and 20 Newsgroups dataset [11] for topic classification. a r X i v : . [ c s . I R ] F e b Y. Jin et al.
In the meantime, industrial applications often involve fine-grained classifi-cation with a large number of categories. For example, Walmart built a hybridclassifier to categorize products into 5000+ product categories [16], and Yahoobuilt a contextual advertising classifier with a taxonomy of around 6000 cate-gories [2]. Unfortunately, both systems require a huge human effort in composingand maintaining rules and keywords. Readers can neither reproduce their systemnor is the system or data available for comparison.In this work, we focus on the application of contextual advertising, whichallows advertisers to target the context which is most relevant to their ads. How-ever, its power cannot be fully utilized unless we can target the page contentusing fine-grained categories, e.g., “coupé” ’ vs. “hatchback” instead of “automo-tive” vs. “sport”. This motivates a classification taxonomy with both high cover-age and high granularity. The commonly used contextual taxonomy introducedby Interactive Advertising Bureau (IAB) contains 23 coarse-grained categoriesand 355 fine-grained categories . Figure 1 shows a snippet of the taxonomy.Fig. 1: Snippet of IAB Content Categorization Taxonomy.Inspired by the fact that large online encyclopedias, such as Wikipedia, con-tain an updated account of almost all topics, we ask an essential question: can webootstrap an accurate text classifier with hundreds of categories using a publiclyavailable encyclopedia without any manually-labeled document? To this end, wepropose wiki2cat , a simple framework to bootstrap classifiers from Wikipedia.We tap on and extend previous work on Wikipedia content analysis [10] to auto-matically label Wikipedia articles related to each category in our taxonomy byWikipedia category graph traversal. We then train classification models with thelabeled Wikipedia articles. We compare our method with various learning-basedand keyword-based baselines and obtain a competitive performance. Large knowledge bases like Wikipedia or DMOZ content directory cover a widerange of topics. They also have a category hierarchy in either tree or graphstructure, which provides a useful resource for building text classification models.Text classification using knowledge bases can be broadly categorized into twomain approaches: vector space model and semantic model. ootstrapping Large-Scale Contextual Classifier from Wikipedia 3 Vector space model aims to learn a category vector by aggregating the descen-dant pages and perform nearest neighbor search during classification. A pruningis usually performed first based on the depth from the root node or the numberof child pages to reduce the number of categories. Subsequently, each documentforms a document vector, which is aggregated to form the category vector. Lee etal. [13] used tf-idf representation of the document, while Kim et al. [9] combinedword embeddings and tf-idf representations to obtain a better performance.In semantic models, the input document is mapped explicitly to concepts inthe knowledge base. The concepts are used either in conjunction with bag-of-words representation [6] or stand-alone [3] to assign categories to the document.Gabrilovich and Markovitch [6] used a feature generator to predict relevantWikipedia concepts (articles) related to the input document. These conceptsare orthogonal to the labels in specific text classification tasks and are usedto enrich the representation of the input document. Experiments on multipledatasets demonstrated that the additional concepts helped improve the perfor-mance. Similarly, Zhang et al. [19] enriched the document representation withboth concepts and categories from Wikipedia.Chang et al. [3] proposed Dataless classification that maps both input doc-uments and category names into Wikipedia concepts using Explicit SemanticAnalysis [7]. The idea is similar to Gabrilovich and Markovitch [6], except (1)the input is mapped to a real-valued concept vector instead of a discrete list ofrelated categories, and (2) the category name is mapped into the same semanticspace, which removes the need for labeled documents.Our work is most similar to Lee et al. [13]. However, they only evaluatedon random-split Wikipedia documents, while we apply the model to a real-world large-scale text classification problem. We also employed a graph traversalalgorithm to label the documents instead of labeling all descendant documents.
Some previous work tried to understand the distribution of topics in Wikipediafor data analysis and visualization. Kittur et al. [10] calculated the distance be-tween each page to top-level category nodes. They then assigned the categorywith the shortest distance to the page. With this approach, they provided thefirst quantitative analysis of the distribution of topics in Wikipedia. Farina etal. [5] extended the method by allowing traversing upward in the category graphand assigning categories proportional to the distance instead of assigning thecategory with the shortest-path only. More recently, Bekkerman and Donin [1]visualized Wikipedia by building a two-level coarse-grained/fine-grained graphrepresentation. The edges between categories capture the co-occurrence of cat-egories on the same page. They further pruned edges between categories thatrarely appear together. The resulting graph contains 441 largest categories and4815 edges connecting them.
Y. Jin et al.
We propose wiki2cat , a simple framework using
Wiki pedia to bootstrap text cat egorizers. We first map the target taxonomy to corresponding Wikipediacategories (briefed in Section 3.1). We then traverse the Wikipedia categorygraph to automatically label Wikipedia articles (Section 3.2). Finally, we inducea classifier from the labeled Wikipedia articles (Section 3.3). Figure 2 overviewsthe end-to-end process of building classifiers under the wiki2cat framework.Fig. 2: Overview of wiki2cat, a framework to bootstrap large-scale classifiers fromWikipedia. User-defined categories are first mapped to Wikipedia categories. TheWikipedia category graph is then traversed to label documents for training.
Wikipedia contains 2 million categories, which is 4 orders of magnitude largerthan IAB taxonomy. We index all Wikipedia category names in Apache Lucene and use the IAB category names to query the closest matches. We perform thefollowing: 1) lemmatize the category names in both taxonomies, 2) index bothWikipedia category names and their alternative names from redirect links (e.g.,“A.D.D.” and “Attention deficit disorder”), 3) split conjunction category namesand query separately (e.g., “Arts & Entertainment” → “Arts”, “Entertainment”),and 4) capture small spelling variations with string similarity .Out of all 23 coarse-grained and 355 fine-grained categories in IAB taxonomy,311 categories (82%) can be mapped trivially. Their category names either matchexactly or contain only small variations. E.g., the IAB category “Pagan/Wiccan”is matched to three Wikipedia categories “Paganism”, “Pagans”, and “Wiccans”.One author of this paper took roughly 2 hours to curate the remaining 67 cate-gories manually and provided the mapping to Wikipedia categories. Out of the67 categories, 23 are categories that cannot be matched automatically becausethe category names look very different, e.g., “Road-Side Assistance” and “Emer-gency road services”. The rest are categories where the system can find a match, https://lucene.apache.org We use Jaro-Winkler string similarity with a threshold of 0.9 to automatically mapIAB categories to Wikipedia categories.ootstrapping Large-Scale Contextual Classifier from Wikipedia 5 but the string similarity is below the threshold (e.g., correct: “Chronic Pain”and “Chronic Pain Syndromes”; incorrect: “College Administration” and “CourtAdministration”). We use the curated mapping in subsequent sections.
With the mapping between IAB and Wikipedia categories, we can anchor eachIAB category as nodes in the Wikipedia category graph , referred to as the root category nodes . Our task then becomes to obtain a set of labeled Wikipediaarticles by performing graph traversal from the root category nodes. From eachroot category node, the category graph can be traversed using the breadth-firstsearch algorithm to obtain a list of all descendant categories and pages.One may argue that we can take all descendant pages of a Wikipedia categoryto form the labeled set. However, in Wikipedia page A belongs to category B does not imply a hypernym relation. In fact, some pages have a long list ofcategories, most of which are at their best remotely related to the main contentof the page. E.g., the page “Truck Stop Women” is a descendant page of thecategory “Trucks”. However, it is a 1974 film, and the main content of the pageis about the plot and the cast.We label Wikipedia pages using a competition-based algorithm following Kit-tur et al. [10] and Farina et al. [5]. We treat each category node from which apage can be traversed as a candidate category and evaluate across all candidatecategories to determine the final category(s) for the page.Firstly, all pages are pruned based on the percentage of their parent categoriesthat can be traversed from the root category. Figure 3 shows two Wikipedia pageswith a snippet of their ancestor categories. Both pages have a shortest distanceof 2 to the category “Trucks”. However, the page “Ford F-Max” is likely morerelated to “Trucks” than the page “Camping and Caravanning Club” becausemost of its parent categories can be traversed from “Trucks”. We empirically setthe threshold that we will prune a page with respect to a root category if lessthan 30% of its parent categories can be traversed from the root category.While the categories in IAB taxonomy occur in parallel, the correspondingcategories in Wikipedia may occur in a hierarchy. For example, the category“SUVs” and “Trucks” are in parallel in IAB taxonomy but “SUVs” is a descendantcategory of “Trucks” in Wikipedia (Trucks ›Trucks by type ›Light trucks ›Sportutility vehicles). To avoid confusing the classification algorithm, while traversingfrom the root category node, the algorithm prunes all the branches correspondingto a competing category.Pruning alone will not altogether remove the irrelevant content, because thedegree of semantic relatedness is not considered. We measure the semantic relat-edness between a page and a category based on two factors, namely the shortest We construct the category graph using the “subcat” (subcategory) relation in theWikipedia dump. The graph contains both category nodes and page nodes. Pagesall appear as leaf nodes while category nodes can be either internal or leaf nodes. https://en.wikipedia.org/wiki/Truck_Stop_Women Y. Jin et al.
Fig. 3: Intuition of the pruning for the category “Trucks”. The page “Ford F-Max”belongs to four categories. Three of which can be traversed from “Trucks” andone cannot (marked in red and italic).path distance and the number of unique paths between them. Previous workdepends only on the shortest path distance [10,5]. We observe that if a page isdensely connected to a category via many unique paths, it is often an indicationof a strong association. We calculate the weight w of a page with respect to acategory as follows: w = k (cid:88) i =0 d i (1)where k is the number of unique paths between the page and the categorynode, and d i is the distance between the two in the i th path. To calculate thefinal list of categories, the weights for all competing categories are normalizedto 1 by summing over each candidate category j and the categories which havea weight higher than 0.3 are returned as the final assigned categories. w j = k j (cid:88) i =0 d ij / ( (cid:88) j k j (cid:88) i =0 d ij ) (2)The labeling process labeled in total 1.16 million Wikipedia articles. The bluescattered plot in Figure 4 plots the number of labeled training articles per fine-grained category in log-10 scale. We can see that the majority of the categorieshave between 100 to 10k articles. The output of the algorithm described in Section 3.2 is a set of labeled Wikipediapages. In theory, we can apply any supervised learning method to induce classi-fiers from the labeled dataset. The focus of this work is not to introduce a novelmodel architecture, but to demonstrate the effectiveness of the framework to ootstrapping Large-Scale Contextual Classifier from Wikipedia 7
Fig. 4: Blue: Throughout this paper, we use the Wikipedia dump downloaded on 10 December2019. After removing hidden categories and list pages, the final category graphcontains 14.9 million articles, 1.9 million categories and 37.9 million links. Thegraph is stored in Neo4J database and occupies 4.7GB disk space (not includingthe page content). We use the original dataset without sampling for the centroid classifier since it isnot affected by label imbalance. https://neo4j.com Y. Jin et al.
We use the SGD classifier implementation in scikit-learn with default hy-perparameters for linear SVM. Words are weighted using tf-idf with a minimumterm frequency cutoff of 3. We implement the centroid classifier using TfidfVec-torizer in scikit-learn and use numpy to implement the nearest neighbor classi-fication. For BERT, we use DistilBERT implementation by HuggingFace , amodel which is both smaller and faster than the original BERT-base model. Forthe feed-forward classification head, we use a single hidden layer with 256 units.The model is implemented in PyTorch and optimized with Adam optimizer witha learning rate of 0.01.We evaluated our method using three contextual classification datasets. Thefirst two are coarse-grained evaluation datasets published by Jin et al. [8] coveringall IAB tier-1 categories except for “News” (totaling 22 categories). The datasetsare collected using different methods (news-crawl-v2 dataset by mapping fromnews categories; browsing dataset by manual labelling) and contains 2,127 and1,501 documents separately .We compiled another dataset for fine-grained classification comprising of doc-uments labeled with one of the IAB tier-2 categories. The full dataset consists of134k documents and took an effort of multiple person-year to collect. The sourcesof the dataset are news websites, URLs occurring in the online advertising trafficand URLs crawled with keywords using Google Custom Search . The numberof documents per category can be overviewed in Figure 4 (the orange scatterplot). 23 out of 355 IAB tier-2 categories are not included in the dataset becausethey are too rare and are not present in our data source. So there are in total 332fine-grained categories in the datasets. Due to company policy, we can publishonly a random sample of the dataset with ten documents per category . Wereport the performance on both datasets for future work to reproduce our result.To our best knowledge, this dataset will be the only publicly available datasetfor fine-grained contextual classification.We focus on classifying among fine-grained categories under the same par-ent category. Figure 5 shows the number of fine-grained categories under eachcoarse category. While the median number of categories is 10, the classificationis challenging because categories are similar to each other.We compare wiki2cat with the following baselines: – Keyword voting (kw voting): predicts the category whose name occursmost frequently in the input document. If none of the category names ispresent, the model predicts a random label. – Dataless [3]: maps the input document and the category name into thesame semantic space representing Wikipedia concepts using Explicit Seman-tic Analysis (ESA) [7]. https://scikit-learn.org https://huggingface.co/transformers/model_doc/distilbert.html https://github.com/YipingNUS/nle-supplementary-dataset https://developers.google.com/custom-search/ https://link/anonymized/for/review ootstrapping Large-Scale Contextual Classifier from Wikipedia 9 Fig. 5: Number of fine-grained categories per coarse-grained category. – Doc2vec [12]: similar to the Dataless model. Instead of using ESA, it usesdoc2vec to generate the document and category vector. – STM [14]: seed-guided topic model. The state-of-the-art model on coarse-grained contextual classification. Underlying, STM calculates each word’sco-occurrence and uses it to “expand” the knowledge beyond the given seedwords. For coarse-grained classification, STM used hand-curated seed wordswhile STM,S label used category names as seed words. Both are trained ona private in-domain dataset. We also trained STM using our Wikipediadataset, referred to as STM,D wiki . For fine-grained classification, we reportonly the result of STM,S label since no previously published seed words areavailable.Keyword voting and Dataless do not require any training document. BothDoc2vec and STM require unlabeled training corpus. We copy the coarse-grainedclassification result for Doc2vec, STM, and STM,S label from Jin et al. [8]. Forfine-grained classification, we train Doc2vec and STM,S label using the same setof Wikipedia documents as in wiki2cat.
We present the performance of various models on news-crawl-v2 and browsingdataset in Table 1.We can observe that wiki2cat using SVM as the learning algorithm outper-formed Dataless and Doc2vec baseline. However, it did not perform as well asSTM. The STM model was trained using a list of around 30 carefully chosenkeywords for each category. It also used in-domain unlabeled documents duringtraining, which we do not use. Jin et al. [8] demonstrated that the choice ofseed keywords has a significant impact on the model’s accuracy. STM,S label isthe result of STM using only unigrams in the category name as seed keywords.Despite using the same learning algorithm as STM, its performance was muchworse than using hand-picked seed words.To study the contribution of the in-domain unlabeled document to STM’s su-perior performance, we trained an STM model with the same manually-curatedkeywords as Jin et al. [8] on the same Wikipedia dataset we used to trainwiki2cat (denoted as STM,D wiki ). There is a noticeable decrease in perfor- F acc + ma F kw voting .196 .180 .251 .189Dataless .412 .377 .536 .392Doc2vec .480 .461 .557 .424STM .623 .607 .794 .625 STM,S label .332 .259 .405 .340STM,D wiki .556 .533 .780 .595w2c svm .563 .539 .659 .523w2c centroid .471 .426 .675 .523w2c bert .440 .403 .621 .482w2c svm,child .325 .289 .340 .322w2c svm,descendant .539 .503 .607 .481w2c svm,min − dist .533 .498 .612 .489w2c svm,no − pruning .488 .466 .608 .491 Table 1: Performance of various models on IAB coarse-grained classificationdatasets. The best performance is highlighed in bold.mance in STM,D wiki without in-domain unlabeled documents. It underper-formed w2c svm on news-crawl-v2 dataset and outperformed on browsing dataset.w2c centroid performed slightly better than w2c svm on the browsing datasetbut worse on the news-crawl-v2 dataset. Surprisingly, BERT did not performas well as the other two much simpler models. We conjecture it is because ourtraining corpus consists of only Wikipedia articles, while the model was appliedto another domain. Therefore, the contextual information that BERT capturedmay be irrelevant or even counterproductive. We leave a more in-depth analysisto future work and adhere to the SVM and Centroid model hereafter.
We now turn our attention to the impact of different graph labeling algorithmson the final classification accuracy. We compare our graph labeling method in-troduced in Section 3.2 with three methods mentioned in previous work, namelylabeling only immediate child pages (child), labeling all descendant pages (de-scendant), assigning the label with shortest distance (min-dist) as well as anotherbaseline removing the pruning step from our method (no-pruning). We use anSVM model with the same hyperparameters as w2c svm . Their performance isshown in the last section of Table 1.Using only the immediate child pages led to poor performance. Firstly, itlimited the number of training documents. Some categories have only a dozen ofimmediate child pages. Secondly, the authors of Wikipedia often prefer to assignpages to specific categories instead of general categories. They assign a page to ageneral category only when it is ambiguous. Despite previous work in Wikipediacontent analysis advocated using shortest distance to assign the topic to arti-cles [10,5], we did not observe a substantial improvement using shortest distance ootstrapping Large-Scale Contextual Classifier from Wikipedia 11 over using all descendant pages. Our graph labeling method outperformed allbaselines, including its modified version without pruning.
Table 2 presents the result on fine-grained classification. We notice a performancedifference on the full and sample dataset. However, the relative performance ofvarious models on the two datasets remains consistent.
Model Full dataset Sample datasetacc ma F acc ma F kw voting .108 .018 .075 .025Dataless .428 .376 .477 .462Doc2vec .246 .152 .253 .211STM,S label .493 .370 .533 .464w2c svm .542 ∗ .464 ∗ .646 ∗ .627 ∗ w2c centroid .548 ∗ .451 ∗ .595 ∗ .566 ∗ Table 2: Performance of various models on IAB fine-grained classificationdatasets. * indicates a statistically significant improvement from baselines withp-value<0.05 using single-sided sample T-test.A first observation is that the keyword voting baseline performed very poorly,having 7.5-10.8% accuracy. It shows that the category name itself is not enoughto capture the semantics. E.g., the category “Travel > South America” does notmatch a document about traveling in Rio de Janeiro or Buenos Aires but willfalsely match content about “
South
Korea” or “United States of
America ”.Dataless and STM outperformed the keyword voting baseline by a large mar-gin. However, wiki2cat is clearly the winner, outperforming these baselines by5-10%. It demonstrated that the automatically labeled documents are helpful forthe more challenging fine-grained classification task where categories are moresemantically similar and harder to be specified with a handful of keywords.
We introduced wiki2cat , a simple framework to bootstrap large-scale fine-grainedtext classifiers from Wikipedia without having to label any document manually.The method was benchmarked on both coarse-grained and fine-grained contex-tual advertising datasets and achieved competitive performance against variousbaselines. It performed especially well on fine-grained classification, which both ismore challenging and requires more manual labeling in a fully-supervised setting.As an ongoing effort, we are exploring using unlabeled in-domain documents fordomain adaptation to achieve better accuracy.
References
1. Bekkerman, R., Donin, O.: Visualizing wikipedia for interactive exploration. In:Proceedings of KDD 2017 Workshop on Interactive Data Exploration and Analytics(IDEA17). Halifax, Nova Scotia, Canada (2017)2. Broder, A., Fontoura, M., Josifovski, V., Riedel, L.: A semantic approach to contex-tual advertising. In: Proceedings of the 30th International ACM SIGIR Conferenceon Research and Development in Information Retrieval. pp. 559–566 (2007)3. Chang, M.W., Ratinov, L.A., Roth, D., Srikumar, V.: Importance of semanticrepresentation: Dataless classification. In: Proceedings of the AAAI Conference onArtificial Intelligence. vol. 2, pp. 830–835 (2008)4. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: Pre-training of deep bidirec-tional transformers for language understanding. In: Proceedings of the Conferenceof the North American Chapter of the Association for Computational Linguistics:Human Language Technologies. pp. 4171–4186. Minneapolis, Minnesota (2019)5. Farina, J., Tasso, R., Laniado, D.: Automatically assigning wikipedia articles tomacrocategories. In: Proceedings of Hypertext (2011)6. Gabrilovich, E., Markovitch, S.: Overcoming the brittleness bottleneck usingwikipedia: Enhancing text categorization with encyclopedic knowledge. In: Pro-ceedings of the AAAI Conference on Artificial Intelligence. vol. 6, pp. 1301–1306(2006)7. Gabrilovich, E., Markovitch, S., et al.: Computing semantic relatedness usingwikipedia-based explicit semantic analysis. In: Proceedings of the Twentieth In-ternational Joint Conference on Artificial Intelligence. vol. 7, pp. 1606–1611 (2007)8. Jin, Y., Wanvarie, D., Le, P.T.V.: Learning from noisy out-of-domain corpus usingdataless classification. Natural Language Engineering (2020)9. Kim, K.M., Dinara, A., Choi, B.J., Lee, S.: Incorporating word embeddings intoopen directory project based large-scale classification. In: Proceedings of thePacific-Asia Conference on Knowledge Discovery and Data Mining. pp. 376–388.Springer (2018)10. Kittur, A., Chi, E.H., Suh, B.: What’s in wikipedia? mapping topics and con-flict using socially annotated category structure. In: Proceedings of the SIGCHIConference on Human Factors in Computing Systems. pp. 1509–1512 (2009)11. Lang, K.: Newsweeder: Learning to filter netnews. In: Proceedings of the TwelfthInternational Conference on Machine Learning, pp. 331–339. Elsevier (1995)12. Le, Q., Mikolov, T.: Distributed representations of sentences and documents. In:Proceedings of the International Conference on Machine Learning. pp. 1188–1196(2014)13. Lee, J.H., Ha, J., Jung, J.Y., Lee, S.: Semantic contextual advertising based on theopen directory project. ACM Transactions on the Web (TWEB) (4), 1–22 (2013)14. Li, C., Chen, S., Xing, J., Sun, A., Ma, Z.: Seed-guided topic model for documentfiltering and classification. ACM Transactions on Information Systems (1), 1–37(2018)15. Socher, R., Perelygin, A., Wu, J., Chuang, J., Manning, C.D., Ng, A.Y., Potts, C.:Recursive deep models for semantic compositionality over a sentiment treebank.In: Proceedings of the 2013 Conference on Empirical Methods in Natural LanguageProcessing. pp. 1631–1642 (2013)16. Sun, C., Rampalli, N., Yang, F., Doan, A.: Chimera: Large-scale classification usingmachine learning, rules, and crowdsourcing. Proceedings of the VLDB Endowment7