[PDF] Luminoso at SemEval-2018 Task 10: Distinguishing Attributes Using Text Corpora and Relational Knowledge

Abstract

Luminoso participated in the SemEval 2018 task on "Capturing Discriminative Attributes" with a system based on ConceptNet, an open knowledge graph focused on general knowledge. In this paper, we describe how we trained a linear classifier on a small number of semantically-informed features to achieve an F 1 score of 0.7368 on the task, close to the task's high score of 0.75.

Full PDF

LLuminoso at SemEval-2018 Task 10: Distinguishing Attributes Using TextCorpora and Relational Knowledge

Robyn Speer

Luminoso Technologies, Inc.675 Massachusetts AvenueCambridge, MA 02139 [email protected]

Joanna Lowry-Duda

Luminoso Technologies, Inc.675 Massachusetts AvenueCambridge, MA 02139 [email protected]

Abstract

Luminoso participated in the SemEval 2018task on “Capturing Discriminative Attributes”with a system based on ConceptNet, an openknowledge graph focused on general knowl-edge. In this paper, we describe how wetrained a linear classiﬁer on a small number ofsemantically-informed features to achieve an F score of 0.7368 on the task, close to thetask’s high score of 0.75. Word embeddings are most effective when theylearn from both unstructured text and a graphof general knowledge (Speer and Lowry-Duda,2017). ConceptNet 5 (Speer et al., 2017) is anopen-data knowledge graph that is well suited forthis purpose. It is accompanied by a pre-built wordembedding model known as ConceptNet Number-batch , which combines skip-gram embeddingslearned from unstructured text with the relationalknowledge in ConceptNet.A straightforward application of the Concept-Net Numberbatch embeddings took ﬁrst place inSemEval 2017 task 2, on semantic word similarity.For SemEval 2018, we built a system with theseembeddings as a major component for a slightlymore complex task.The Capturing Discriminative Attributes task(Paperno et al., 2018) emphasizes the ability of asemantic model to recognize relevant differencesbetween terms, not just their similarities. As thetask description states, “If you can tell that ameri-cano is similar to capuccino and espresso but youcan’t tell the difference between them, you don’tknow what americano is.”The ConceptNet Numberbatch embeddingsonly measure the similarity of terms, and we hy- https://github.com/commonsense/conceptnet-numberbatch pothesized that we would need to represent morespeciﬁc relationships. For example, the inputtriple “frog, snail, legs” asks us to determinewhether “legs” is an attribute that distinguishes“frog” from “snail”. The answer is yes, becausea frog has legs while a snail does not. The has re-lationship is one example of a speciﬁc relationshipthat is represented in ConceptNet.To capture this kind of speciﬁc relationship, webuilt a model that infers relations between Con-ceptNet nodes, trained on the existing edges inConceptNet and random negative examples. Thereare many models designed for this purpose; theone we decided on is based on Semantic MatchingEnergy (SME) (Bordes et al., 2014).Our features consisted of direct similarity overConceptNet Numberbatch embeddings, the rela-tionships inferred over ConceptNet by SME, fea-tures that compose ConceptNet with other re-sources (WordNet and Wikipedia), and a purelycorpus-based feature that looks up two-wordphrases in the Google Books dataset.We combined these features based on Concept-Net with features extracted from a few other re-sources in a LinearSVC classiﬁer, using liblinear(Fan et al., 2008) via scikit-learn (Pedregosa et al.,2011). The classiﬁer used only 15 features, ofwhich 12 ended up with non-zero weights, fromthe ﬁve sources described. We aimed to avoidcomplexity in the classiﬁer in order to preventoverﬁtting to the validation set; the power of theclassiﬁer should be in its features.The classiﬁer produced by this design (submit-ted late to the contest leaderboard) successfullyavoided overﬁtting. It performed better on the testset than on the validation set, with a test F scoreof 0.7368, whose margin of error overlaps with theevaluation’s reported high score of 0.75.At evaluation time, we accidentally submittedour results on the validation data, instead of the a r X i v : . [ c s . C L ] D ec est data, to the SemEval leaderboard. Our codehad truncated the results to the length of the testdata, causing us to not notice the mismatch. Thiserroneous submission got a very low score, ofcourse. This paper presents the corrected test re-sults, which we submitted to the post-evaluationCodaLab leaderboard immediately after the resultsappeared. We did not change the classiﬁer or data;the change was a one-line change to our code foroutputting the classiﬁer’s predictions on the testset instead on the validation set. In detail, these are the ﬁve sources of features weused:

ConceptNet vector similarity.

Given the triple( term , term , att ), we look up the ConceptNetNumberbatch embeddings for the root words ofthe three terms (with root words determined usingConceptNet’s built-in lemmatizer). We determinethe cosine similarity of ( term , att ) and the co-sine similarity of ( term , att ). We then subtractthe square roots of the similarity scores (ﬂoored at0). If this difference is large enough, it indicatesa positive example, a discriminative attribute thatapplies to term and not to term . ConceptNet relational inference.

We train aSemantic Matching Energy model to representConceptNet nodes and relations as vectors, alongwith a 3-tensor of interactions between them. Thismodel can then assign a conﬁdence score to anytriple (a relation connecting two terms). We usedthis model to infer values for each of 11 differ-ent ConceptNet relations. As in the case of vectorsimilarity, each feature value is the difference be-tween the value inferred for rel ( term , att ) and rel ( term , att ). This model is described in moredetail in the next section. Wikipedia lead sections.

This feature expandson ConceptNet vector similarity: instead of com-puting the similarity between the attribute and theterm, it computes the maximum of the similaritybetween the attribute and any word that appearsin the lead section of the Wikipedia article for theterm (Wikipedia, 2017). This helps to identify at-tributes that would be used to deﬁne the term, suchas “amphibian” as an attribute for “frog”.

WordNet entries.

This feature is similar to the“Wikipedia lead sections” feature. It expands each term by looking up its synonyms in Word-Net (Miller et al., 1998), the synonyms in synsetsit is connected to, and the words in its gloss (deﬁ-nition), and taking the maximum similarity of theattribute to any of these terms.

Google Books 2-grams.

This feature deter-mines if term forms a signiﬁcant two-wordphrase with att , more than term does, basedon the Google Books English Fiction data (Linet al., 2012). The “signiﬁcance” (s) of a two-wordphrase is determined by comparing the smoothedlog-likelihood of the individual unigrams to thesmoothed log-likelihood of the phrase: s( term , att ) = 10 + log ( term , att ) + 1) − log (( term ) + 10 )( att ) + 10 )) where represents the number of occurrences ofa unigram or bigram in the corpus.The “ConceptNet relational inference” featureprovides 11 entries to the feature vectors, whilethe other sources each provide one. In total, thereare 15 features that represent each input triple.Across multiple data sources, we use the squareroot of cosine similarity to measure the strengthof the match between a term and an attribute. Be-cause attributes should be at least somewhat re-lated to the terms they describe, and because weaksemantic similarity can be interpreted as related-ness, the square root helps us emphasize the im-portant part of the scale. The difference between“somewhat related” and “not related” is more im-portant to the task than the difference between“very similar” and “somewhat related”, as a dis-criminative attribute should ideally be unrelated tothe second term. To infer truth values for ConceptNet relations, weuse a variant of the Semantic Matching Energymodel (Bordes et al., 2014), adapted to work wellon ConceptNet’s vocabulary of relations. Insteadof embedding relations in the same space as theterms, this model assigns new 10-dimensional em-beddings to ConceptNet relations, yielding a com-pact model for ConceptNet’s relatively small setof relations.The model is trained to distinguish positive ex-amples of ConceptNet edges from negative ones.The positive examples are edges directly con-tained in ConceptNet, or those that are entailedby changing the relation to a more general one orwitching the directionality of a symmetric rela-tion. The negative examples come from replac-ing one of the terms with a random other term,the relation with a random unentailed relation, orswitching the directionality of an asymmetric re-lation.We trained this model for approximately 3million iterations (about 4 days of computa-tion on an nVidia Titan Xp) using PyTorch(Paszke et al., 2017). The code of themodel is available at https://github.com/LuminosoInsight/conceptnet-sme .To extract features for the discriminative at-tribute task, we focus on a subset of Concept-Net relations that would plausibly be used as at-tributes:

RelatedTo , IsA , HasA , PartOf , Capa-bleOf , UsedFor , HasContext , HasProperty , and

AtLocation .For most of these relations, the ﬁrst argument isthe term, and the second argument is the attribute.We use two additional features for

PartOf and

At-Location with their arguments swapped, so that theattribute is the ﬁrst argument. The generic rela-tion

RelatedTo , unlike the others, is intended to besymmetric, so we add its value to the value of itsswapped version and use it as a single feature.

The classiﬁer that we use to make a decision basedon these features is scikit-learn’s LinearSVC, us-ing the default parameters in scikit-learn 0.19.1.(In Section 4, we discuss other models and param-eters that we tried.) This classiﬁer makes effectiveuse of the features while being simple enough toavoid some amount of overﬁtting.One aspect of the classiﬁer that made a notice-able difference was the scaling of the features. Wetried L and L -normalizing the columns of the in-put matrix, representing the values of each feature,and decided on L normalization.We took advantage of the design of our featuresand the asymmetry of the task as a way to furthermitigate overﬁtting. All of the features were de-signed to identify a property that term has and term does not, as is the case for the discrimi-native examples, so they should all make a non-negative contribution to a feature being discrimi-native. We can inspect the coefﬁcients of the fea-tures in the SVC’s decision boundary. If any fea-ture gets a negative weight, it is likely a spuriousresult from overﬁtting to the training data. So, af- Feature Coefﬁcient

ConceptNet vector similarity 13.82SME: RelatedTo 14.01SME: ( x IsA a ) 2.13SME: ( x HasA a ) 0.00SME: ( x PartOf a ) 0.56SME: ( x CapableOf a ) 3.72SME: ( x UsedFor a ) 0.92SME: ( x HasContext a ) 0.88SME: ( x HasProperty a ) 0.00SME: ( x AtLocation a ) 0.00SME: ( a PartOf x ) 3.22SME: ( a AtLocation x ) 0.69Wikipedia lead sections 12.46WordNet relatedness 13.95Google Ngrams 28.82 Table 1: Coefﬁcients of each feature in our linear clas-siﬁer. x represents a term and a represents the attribute. ter training the classiﬁer, we clip the coefﬁcientsof the decision boundary, setting all negative coef-ﬁcients to zero.If we were to remove these features and re-train,or require non-negative coefﬁcients as a constrainton the classiﬁer, then other features would inher-ently become responsible for overﬁtting. By neu-tralizing the features after training, we keep thefeatures that are working well as they are, and re-move a part of the model that appears to purelyrepresent overﬁtting. Indeed, clipping the negativecoefﬁcients in this way increased our performanceon the validation set.Table 1 shows the coeffcients assigned to eachfeature based on the training data. There are other features that we tried and later dis-carded. We experimented with a feature similar tothe Google Books 2-grams feature, based on theAOL query logs dataset (Pass et al., 2006). It didnot add to the performance, most likely becauseany information it could provide was also providedby Google Books 2-grams. Similiarly, we tried ex-tending the Google Books 2-grams data to includethe ﬁrst and third words of a selection of 3-grams,but this, too, appeared redundant with the 2-grams.We also experimented with a feature based onbounding box annotations available in the Open-Images dataset (Krasin et al., 2017). We hopedit would help us capture attributes such as colors,materials, and shapes. While this feature did notimprove the classiﬁer’s performance on the vali-dation set, it did slightly improve the performanceon the test set.Before deciding on scikit-learn’s LinearSVC, ataset F1 Error (SEM)train .7617 ± .0032validation .7281 ± .0085test .7368 ± .0091 Table 2: F scores by dataset. The reported F score isthe arithmetic mean of the F scores for both classes. F )0.640.660.680.700.720.74 T e s t acc u r ac y ( F ) ABC = (0.563, 0.606)D E ABAC AD AEBC BD BECD CE DE ABCABD ABEACDACEADEBCD BCEBDECDEABCD ABCEABDEACDEBCDEABCDE

ABCDE = ConceptNet vector similarity= ConceptNet inference using SME= Wikipedia lead sections= WordNet links and glosses= Google Books 2-grams

Figure 1: This ablation analysis shows the contribu-tions of subsets of the ﬁve sources of features. Ellipsesindicate standard error of the mean, assuming that thedata is sampled from a larger, unseen set. we experimented with a number of other classi-ﬁers. This included random forests, differentiablemodels made of multiple ReLU and sigmoid lay-ers, and SVM with an RBF kernel or a polynomialkernel.We also experimented with different parame-ters to LinearSVC, such as changing the defaultvalue of the penalty parameter C of the errorterm, changing the penalty from L to L , solv-ing the primal optimization problem instead of thedual problem, and changing the loss from squaredhinge to hinge. These changes either led to lowerperformance or had no signiﬁcant effect, so in theend we used LinearSVC with the default parame-ters for scikit-learn version 0.19.1. When trained on the training set, the classiﬁerwe describe achieved an F score of 0.7617 onthe training set, 0.7281 on the validation set, and0.7368 on the test set. Table 2 shows these scoresalong with their standard error of the mean, sup-posing that these data sets were randomly sampledfrom larger sets. We performed an ablation analysis to see what thecontribution of each of our ﬁve sources of featureswas. We evaluated classiﬁers that used all non-empty subsets of these sources. Figure 1 plots theresults of these 31 classiﬁers when evaluated onthe validation set and the test set.It is likely that the classiﬁer with all ﬁve sources(

ABCDE ) performed the best overall. It is in a sta-tistical tie ( p > . ) with ABDE , the classiﬁer thatomits Wikipedia as a source.Most of the classiﬁers perfomed better on thetest set than on the validation set, as shown bythe dotted line. Some simple classiﬁers with veryfew features performed particularly well on thetest set. One surprisingly high-performing clas-siﬁer was A (ConceptNet vector similarity), whichgets a test F score of 0.7355 ± sim( term , att ) − sim( term , att ) > . where sim( a, b ) = (cid:114) max (cid:16) a · b || a ||·|| b || , (cid:17) .It is interesting to note that source A (Concept-Net vector similarity) appears to dominate source B (ConceptNet SME) on the test data. SME led toimprovements on the validation set, but on the testset, any classiﬁer containing AB performs equal toor worse than the same classiﬁer with B removed.This may indicate that the SME features were themost prone to overﬁtting, or that the validation setgenerally required making more difﬁcult distinc-tions than the test set. The code for our classiﬁer is avail-able on GitHub at https://github.com/LuminosoInsight/semeval-discriminatt , and its in-put data is downloadable from https://zenodo.org/record/1183358 . References

Antoine Bordes, Xavier Glorot, Jason Weston, andYoshua Bengio. 2014. A semantic matching energyfunction for learning with multi-relational data.

Ma-chine Learning , 94(2):233–259.ong-En Fan, Kai-Wei Chang, Cho-Jui Hsieh, Xiang-Rui Wang, and Chih-Jen Lin. 2008. LIBLINEAR:A library for large linear classiﬁcation.

Journal ofMachine Learning Research , 9(Aug):1871–1874.Ivan Krasin, Tom Duerig, Neil Alldrin, Vittorio Fer-rari, Sami Abu-El-Haija, Alina Kuznetsova, HassanRom, Jasper Uijlings, Stefan Popov, Andreas Veit,Serge Belongie, Victor Gomes, Abhinav Gupta,Chen Sun, Gal Chechik, David Cai, Zheyun Feng,Dhyanesh Narayanan, and Kevin Murphy. 2017.OpenImages: A public dataset for large-scale multi-label and multi-class image classiﬁcation.Yuri Lin, Jean-Baptiste Michel, Erez Lieberman Aiden,Jon Orwant, Will Brockman, and Slav Petrov. 2012.Syntactic annotations for the Google Books NgramCorpus. In

Proceedings of the ACL 2012 sys-tem demonstrations , pages 169–174. Association forComputational Linguistics.George Miller, Christiane Fellbaum, Randee Tengi,P Wakeﬁeld, H Langone, and BR Haskell. 1998.

WordNet . MIT Press Cambridge.Denis Paperno, Alessandro Lenci, and Alicia Krebs.2018. SemEval-2018 Task 10: Capturing discrimi-native attributes. In

Proceedings of the 12th Interna-tional Workshop on Semantic Evaluation (SemEval-2018) , New Orleans, LA, United States. Associationfor Computational Linguistics.Greg Pass, Abdur Chowdhury, and Cayley Torgeson.2006. A picture of search. In

Proceedings of the1st International Conference on Scalable Informa-tion Systems , InfoScale ’06, New York, NY, USA.ACM.Adam Paszke, Sam Gross, Soumith Chintala, Gre-gory Chanan, Edward Yang, Zachary DeVito, Zem-ing Lin, Alban Desmaison, Luca Antiga, and AdamLerer. 2017. Automatic differentiation in pytorch.Fabian Pedregosa, Ga¨el Varoquaux, Alexandre Gram-fort, Vincent Michel, Bertrand Thirion, OlivierGrisel, Mathieu Blondel, Peter Prettenhofer, RonWeiss, Vincent Dubourg, et al. 2011. Scikit-learn:Machine learning in Python.

Journal of machinelearning research , 12(Oct):2825–2830.Robyn Speer, Joshua Chin, and Catherine Havasi.2017. ConceptNet 5.5: An open multilingual graphof general knowledge. In

AAAI , San Francisco.Robyn Speer and Joanna Lowry-Duda. 2017. Concept-Net at SemEval-2017 task 2: Extending word em-beddings with multilingual relational knowledge. In

Proceedings of the 11th International Workshop onSemantic Evaluation (SemEval-2017) , pages 85–89,Vancouver, Canada. Association for ComputationalLinguistics.Wikipedia. 2017. Wikipedia, the free encyclopedia —English data export. (A collaborative project withthousands of authors.) Retrieved from https://dumps.wikimedia.org/enwiki/https://dumps.wikimedia.org/enwiki/