Luminoso at SemEval-2018 Task 10: Distinguishing Attributes Using Text Corpora and Relational Knowledge
LLuminoso at SemEval-2018 Task 10: Distinguishing Attributes Using TextCorpora and Relational Knowledge
Robyn Speer
Luminoso Technologies, Inc.675 Massachusetts AvenueCambridge, MA 02139 [email protected]
Joanna Lowry-Duda
Luminoso Technologies, Inc.675 Massachusetts AvenueCambridge, MA 02139 [email protected]
Abstract
Luminoso participated in the SemEval 2018task on “Capturing Discriminative Attributes”with a system based on ConceptNet, an openknowledge graph focused on general knowl-edge. In this paper, we describe how wetrained a linear classifier on a small number ofsemantically-informed features to achieve an F score of 0.7368 on the task, close to thetask’s high score of 0.75. Word embeddings are most effective when theylearn from both unstructured text and a graphof general knowledge (Speer and Lowry-Duda,2017). ConceptNet 5 (Speer et al., 2017) is anopen-data knowledge graph that is well suited forthis purpose. It is accompanied by a pre-built wordembedding model known as ConceptNet Number-batch , which combines skip-gram embeddingslearned from unstructured text with the relationalknowledge in ConceptNet.A straightforward application of the Concept-Net Numberbatch embeddings took first place inSemEval 2017 task 2, on semantic word similarity.For SemEval 2018, we built a system with theseembeddings as a major component for a slightlymore complex task.The Capturing Discriminative Attributes task(Paperno et al., 2018) emphasizes the ability of asemantic model to recognize relevant differencesbetween terms, not just their similarities. As thetask description states, “If you can tell that ameri-cano is similar to capuccino and espresso but youcan’t tell the difference between them, you don’tknow what americano is.”The ConceptNet Numberbatch embeddingsonly measure the similarity of terms, and we hy- https://github.com/commonsense/conceptnet-numberbatch pothesized that we would need to represent morespecific relationships. For example, the inputtriple “frog, snail, legs” asks us to determinewhether “legs” is an attribute that distinguishes“frog” from “snail”. The answer is yes, becausea frog has legs while a snail does not. The has re-lationship is one example of a specific relationshipthat is represented in ConceptNet.To capture this kind of specific relationship, webuilt a model that infers relations between Con-ceptNet nodes, trained on the existing edges inConceptNet and random negative examples. Thereare many models designed for this purpose; theone we decided on is based on Semantic MatchingEnergy (SME) (Bordes et al., 2014).Our features consisted of direct similarity overConceptNet Numberbatch embeddings, the rela-tionships inferred over ConceptNet by SME, fea-tures that compose ConceptNet with other re-sources (WordNet and Wikipedia), and a purelycorpus-based feature that looks up two-wordphrases in the Google Books dataset.We combined these features based on Concept-Net with features extracted from a few other re-sources in a LinearSVC classifier, using liblinear(Fan et al., 2008) via scikit-learn (Pedregosa et al.,2011). The classifier used only 15 features, ofwhich 12 ended up with non-zero weights, fromthe five sources described. We aimed to avoidcomplexity in the classifier in order to preventoverfitting to the validation set; the power of theclassifier should be in its features.The classifier produced by this design (submit-ted late to the contest leaderboard) successfullyavoided overfitting. It performed better on the testset than on the validation set, with a test F scoreof 0.7368, whose margin of error overlaps with theevaluation’s reported high score of 0.75.At evaluation time, we accidentally submittedour results on the validation data, instead of the a r X i v : . [ c s . C L ] D ec est data, to the SemEval leaderboard. Our codehad truncated the results to the length of the testdata, causing us to not notice the mismatch. Thiserroneous submission got a very low score, ofcourse. This paper presents the corrected test re-sults, which we submitted to the post-evaluationCodaLab leaderboard immediately after the resultsappeared. We did not change the classifier or data;the change was a one-line change to our code foroutputting the classifier’s predictions on the testset instead on the validation set. In detail, these are the five sources of features weused:
ConceptNet vector similarity.
Given the triple( term , term , att ), we look up the ConceptNetNumberbatch embeddings for the root words ofthe three terms (with root words determined usingConceptNet’s built-in lemmatizer). We determinethe cosine similarity of ( term , att ) and the co-sine similarity of ( term , att ). We then subtractthe square roots of the similarity scores (floored at0). If this difference is large enough, it indicatesa positive example, a discriminative attribute thatapplies to term and not to term . ConceptNet relational inference.
We train aSemantic Matching Energy model to representConceptNet nodes and relations as vectors, alongwith a 3-tensor of interactions between them. Thismodel can then assign a confidence score to anytriple (a relation connecting two terms). We usedthis model to infer values for each of 11 differ-ent ConceptNet relations. As in the case of vectorsimilarity, each feature value is the difference be-tween the value inferred for rel ( term , att ) and rel ( term , att ). This model is described in moredetail in the next section. Wikipedia lead sections.
This feature expandson ConceptNet vector similarity: instead of com-puting the similarity between the attribute and theterm, it computes the maximum of the similaritybetween the attribute and any word that appearsin the lead section of the Wikipedia article for theterm (Wikipedia, 2017). This helps to identify at-tributes that would be used to define the term, suchas “amphibian” as an attribute for “frog”.
WordNet entries.
This feature is similar to the“Wikipedia lead sections” feature. It expands each term by looking up its synonyms in Word-Net (Miller et al., 1998), the synonyms in synsetsit is connected to, and the words in its gloss (defi-nition), and taking the maximum similarity of theattribute to any of these terms.
Google Books 2-grams.
This feature deter-mines if term forms a significant two-wordphrase with att , more than term does, basedon the Google Books English Fiction data (Linet al., 2012). The “significance” (s) of a two-wordphrase is determined by comparing the smoothedlog-likelihood of the individual unigrams to thesmoothed log-likelihood of the phrase: s( term , att ) = 10 + log ( term , att ) + 1) − log (( term ) + 10 )( att ) + 10 )) where represents the number of occurrences ofa unigram or bigram in the corpus.The “ConceptNet relational inference” featureprovides 11 entries to the feature vectors, whilethe other sources each provide one. In total, thereare 15 features that represent each input triple.Across multiple data sources, we use the squareroot of cosine similarity to measure the strengthof the match between a term and an attribute. Be-cause attributes should be at least somewhat re-lated to the terms they describe, and because weaksemantic similarity can be interpreted as related-ness, the square root helps us emphasize the im-portant part of the scale. The difference between“somewhat related” and “not related” is more im-portant to the task than the difference between“very similar” and “somewhat related”, as a dis-criminative attribute should ideally be unrelated tothe second term. To infer truth values for ConceptNet relations, weuse a variant of the Semantic Matching Energymodel (Bordes et al., 2014), adapted to work wellon ConceptNet’s vocabulary of relations. Insteadof embedding relations in the same space as theterms, this model assigns new 10-dimensional em-beddings to ConceptNet relations, yielding a com-pact model for ConceptNet’s relatively small setof relations.The model is trained to distinguish positive ex-amples of ConceptNet edges from negative ones.The positive examples are edges directly con-tained in ConceptNet, or those that are entailedby changing the relation to a more general one orwitching the directionality of a symmetric rela-tion. The negative examples come from replac-ing one of the terms with a random other term,the relation with a random unentailed relation, orswitching the directionality of an asymmetric re-lation.We trained this model for approximately 3million iterations (about 4 days of computa-tion on an nVidia Titan Xp) using PyTorch(Paszke et al., 2017). The code of themodel is available at https://github.com/LuminosoInsight/conceptnet-sme .To extract features for the discriminative at-tribute task, we focus on a subset of Concept-Net relations that would plausibly be used as at-tributes:
RelatedTo , IsA , HasA , PartOf , Capa-bleOf , UsedFor , HasContext , HasProperty , and
AtLocation .For most of these relations, the first argument isthe term, and the second argument is the attribute.We use two additional features for
PartOf and
At-Location with their arguments swapped, so that theattribute is the first argument. The generic rela-tion
RelatedTo , unlike the others, is intended to besymmetric, so we add its value to the value of itsswapped version and use it as a single feature.
The classifier that we use to make a decision basedon these features is scikit-learn’s LinearSVC, us-ing the default parameters in scikit-learn 0.19.1.(In Section 4, we discuss other models and param-eters that we tried.) This classifier makes effectiveuse of the features while being simple enough toavoid some amount of overfitting.One aspect of the classifier that made a notice-able difference was the scaling of the features. Wetried L and L -normalizing the columns of the in-put matrix, representing the values of each feature,and decided on L normalization.We took advantage of the design of our featuresand the asymmetry of the task as a way to furthermitigate overfitting. All of the features were de-signed to identify a property that term has and term does not, as is the case for the discrimi-native examples, so they should all make a non-negative contribution to a feature being discrimi-native. We can inspect the coefficients of the fea-tures in the SVC’s decision boundary. If any fea-ture gets a negative weight, it is likely a spuriousresult from overfitting to the training data. So, af- Feature Coefficient
ConceptNet vector similarity 13.82SME: RelatedTo 14.01SME: ( x IsA a ) 2.13SME: ( x HasA a ) 0.00SME: ( x PartOf a ) 0.56SME: ( x CapableOf a ) 3.72SME: ( x UsedFor a ) 0.92SME: ( x HasContext a ) 0.88SME: ( x HasProperty a ) 0.00SME: ( x AtLocation a ) 0.00SME: ( a PartOf x ) 3.22SME: ( a AtLocation x ) 0.69Wikipedia lead sections 12.46WordNet relatedness 13.95Google Ngrams 28.82 Table 1: Coefficients of each feature in our linear clas-sifier. x represents a term and a represents the attribute. ter training the classifier, we clip the coefficientsof the decision boundary, setting all negative coef-ficients to zero.If we were to remove these features and re-train,or require non-negative coefficients as a constrainton the classifier, then other features would inher-ently become responsible for overfitting. By neu-tralizing the features after training, we keep thefeatures that are working well as they are, and re-move a part of the model that appears to purelyrepresent overfitting. Indeed, clipping the negativecoefficients in this way increased our performanceon the validation set.Table 1 shows the coeffcients assigned to eachfeature based on the training data. There are other features that we tried and later dis-carded. We experimented with a feature similar tothe Google Books 2-grams feature, based on theAOL query logs dataset (Pass et al., 2006). It didnot add to the performance, most likely becauseany information it could provide was also providedby Google Books 2-grams. Similiarly, we tried ex-tending the Google Books 2-grams data to includethe first and third words of a selection of 3-grams,but this, too, appeared redundant with the 2-grams.We also experimented with a feature based onbounding box annotations available in the Open-Images dataset (Krasin et al., 2017). We hopedit would help us capture attributes such as colors,materials, and shapes. While this feature did notimprove the classifier’s performance on the vali-dation set, it did slightly improve the performanceon the test set.Before deciding on scikit-learn’s LinearSVC, ataset F1 Error (SEM)train .7617 ± .0032validation .7281 ± .0085test .7368 ± .0091 Table 2: F scores by dataset. The reported F score isthe arithmetic mean of the F scores for both classes. F )0.640.660.680.700.720.74 T e s t acc u r ac y ( F ) ABC = (0.563, 0.606)D E ABAC AD AEBC BD BECD CE DE ABCABD ABEACDACEADEBCD BCEBDECDEABCD ABCEABDEACDEBCDEABCDE
ABCDE = ConceptNet vector similarity= ConceptNet inference using SME= Wikipedia lead sections= WordNet links and glosses= Google Books 2-grams
Figure 1: This ablation analysis shows the contribu-tions of subsets of the five sources of features. Ellipsesindicate standard error of the mean, assuming that thedata is sampled from a larger, unseen set. we experimented with a number of other classi-fiers. This included random forests, differentiablemodels made of multiple ReLU and sigmoid lay-ers, and SVM with an RBF kernel or a polynomialkernel.We also experimented with different parame-ters to LinearSVC, such as changing the defaultvalue of the penalty parameter C of the errorterm, changing the penalty from L to L , solv-ing the primal optimization problem instead of thedual problem, and changing the loss from squaredhinge to hinge. These changes either led to lowerperformance or had no significant effect, so in theend we used LinearSVC with the default parame-ters for scikit-learn version 0.19.1. When trained on the training set, the classifierwe describe achieved an F score of 0.7617 onthe training set, 0.7281 on the validation set, and0.7368 on the test set. Table 2 shows these scoresalong with their standard error of the mean, sup-posing that these data sets were randomly sampledfrom larger sets. We performed an ablation analysis to see what thecontribution of each of our five sources of featureswas. We evaluated classifiers that used all non-empty subsets of these sources. Figure 1 plots theresults of these 31 classifiers when evaluated onthe validation set and the test set.It is likely that the classifier with all five sources(
ABCDE ) performed the best overall. It is in a sta-tistical tie ( p > . ) with ABDE , the classifier thatomits Wikipedia as a source.Most of the classifiers perfomed better on thetest set than on the validation set, as shown bythe dotted line. Some simple classifiers with veryfew features performed particularly well on thetest set. One surprisingly high-performing clas-sifier was A (ConceptNet vector similarity), whichgets a test F score of 0.7355 ± sim( term , att ) − sim( term , att ) > . where sim( a, b ) = (cid:114) max (cid:16) a · b || a ||·|| b || , (cid:17) .It is interesting to note that source A (Concept-Net vector similarity) appears to dominate source B (ConceptNet SME) on the test data. SME led toimprovements on the validation set, but on the testset, any classifier containing AB performs equal toor worse than the same classifier with B removed.This may indicate that the SME features were themost prone to overfitting, or that the validation setgenerally required making more difficult distinc-tions than the test set. The code for our classifier is avail-able on GitHub at https://github.com/LuminosoInsight/semeval-discriminatt , and its in-put data is downloadable from https://zenodo.org/record/1183358 . References
Antoine Bordes, Xavier Glorot, Jason Weston, andYoshua Bengio. 2014. A semantic matching energyfunction for learning with multi-relational data.
Ma-chine Learning , 94(2):233–259.ong-En Fan, Kai-Wei Chang, Cho-Jui Hsieh, Xiang-Rui Wang, and Chih-Jen Lin. 2008. LIBLINEAR:A library for large linear classification.
Journal ofMachine Learning Research , 9(Aug):1871–1874.Ivan Krasin, Tom Duerig, Neil Alldrin, Vittorio Fer-rari, Sami Abu-El-Haija, Alina Kuznetsova, HassanRom, Jasper Uijlings, Stefan Popov, Andreas Veit,Serge Belongie, Victor Gomes, Abhinav Gupta,Chen Sun, Gal Chechik, David Cai, Zheyun Feng,Dhyanesh Narayanan, and Kevin Murphy. 2017.OpenImages: A public dataset for large-scale multi-label and multi-class image classification.Yuri Lin, Jean-Baptiste Michel, Erez Lieberman Aiden,Jon Orwant, Will Brockman, and Slav Petrov. 2012.Syntactic annotations for the Google Books NgramCorpus. In
Proceedings of the ACL 2012 sys-tem demonstrations , pages 169–174. Association forComputational Linguistics.George Miller, Christiane Fellbaum, Randee Tengi,P Wakefield, H Langone, and BR Haskell. 1998.
WordNet . MIT Press Cambridge.Denis Paperno, Alessandro Lenci, and Alicia Krebs.2018. SemEval-2018 Task 10: Capturing discrimi-native attributes. In
Proceedings of the 12th Interna-tional Workshop on Semantic Evaluation (SemEval-2018) , New Orleans, LA, United States. Associationfor Computational Linguistics.Greg Pass, Abdur Chowdhury, and Cayley Torgeson.2006. A picture of search. In
Proceedings of the1st International Conference on Scalable Informa-tion Systems , InfoScale ’06, New York, NY, USA.ACM.Adam Paszke, Sam Gross, Soumith Chintala, Gre-gory Chanan, Edward Yang, Zachary DeVito, Zem-ing Lin, Alban Desmaison, Luca Antiga, and AdamLerer. 2017. Automatic differentiation in pytorch.Fabian Pedregosa, Ga¨el Varoquaux, Alexandre Gram-fort, Vincent Michel, Bertrand Thirion, OlivierGrisel, Mathieu Blondel, Peter Prettenhofer, RonWeiss, Vincent Dubourg, et al. 2011. Scikit-learn:Machine learning in Python.
Journal of machinelearning research , 12(Oct):2825–2830.Robyn Speer, Joshua Chin, and Catherine Havasi.2017. ConceptNet 5.5: An open multilingual graphof general knowledge. In
AAAI , San Francisco.Robyn Speer and Joanna Lowry-Duda. 2017. Concept-Net at SemEval-2017 task 2: Extending word em-beddings with multilingual relational knowledge. In
Proceedings of the 11th International Workshop onSemantic Evaluation (SemEval-2017) , pages 85–89,Vancouver, Canada. Association for ComputationalLinguistics.Wikipedia. 2017. Wikipedia, the free encyclopedia —English data export. (A collaborative project withthousands of authors.) Retrieved from https://dumps.wikimedia.org/enwiki/https://dumps.wikimedia.org/enwiki/