[PDF] Contextual Compositionality Detection with External Knowledge Bases andWord Embeddings

Abstract

When the meaning of a phrase cannot be inferred from the individual meanings of its words (e.g., hot dog), that phrase is said to be non-compositional. Automatic compositionality detection in multi-word phrases is critical in any application of semantic processing, such as search engines; failing to detect non-compositional phrases can hurt system effectiveness notably. Existing research treats phrases as either compositional or non-compositional in a deterministic manner. In this paper, we operationalize the viewpoint that compositionality is contextual rather than deterministic, i.e., that whether a phrase is compositional or non-compositional depends on its context. For example, the phrase `green card' is compositional when referring to a green colored card, whereas it is non-compositional when meaning permanent residence authorization. We address the challenge of detecting this type of contextual compositionality as follows: given a multi-word phrase, we enrich the word embedding representing its semantics with evidence about its global context (terms it often collocates with) as well as its local context (narratives where that phrase is used, which we call usage scenarios). We further extend this representation with information extracted from external knowledge bases. The resulting representation incorporates both localized context and more general usage of the phrase and allows to detect its compositionality in a non-deterministic and contextual way. Empirical evaluation of our model on a dataset of phrase compositionality, manually collected by crowdsourcing contextual compositionality assessments, shows that our model outperforms state-of-the-art baselines notably on detecting phrase compositionality.

Full PDF

CContextual Compositionality Detection with ExternalKnowledge Bases and Word Embeddings

Dongsheng Wang

Department of Computer ScienceUniversity of CopenhagenCopenhagen, [email protected]

Qiuchi Li

Department of InformationEngineeringUniversity of PaduaPadova, [email protected]

Lucas Chaves Lima

Department of Computer ScienceUniversity of CopenhagenCopenhagen, [email protected]

Jakob Grue Simonsen

Department of Computer ScienceUniversity of CopenhagenCopenhagen, [email protected]

Christina Lioma

Department of Computer ScienceUniversity of CopenhagenCopenhagen, [email protected]

ABSTRACT

When the meaning of a phrase cannot be inferred from the individ-ual meanings of its words (e.g., hot dog ), that phrase is said to be non-compositional . Automatic compositionality detection in multi-word phrases is critical in any application of semantic processing,such as search engines [9]; failing to detect non-compositionalphrases can hurt system effectiveness notably. Existing researchtreats phrases as either compositional or non-compositional in adeterministic manner. In this paper, we operationalize the view-point that compositionality is contextual rather than deterministic,i.e., that whether a phrase is compositional or non-compositionaldepends on its context. For example, the phrase “green card” iscompositional when referring to a green colored card, whereas it isnon-compositional when meaning permanent residence authoriza-tion. We address the challenge of detecting this type of contextualcompositionality as follows: given a multi-word phrase, we en-rich the word embedding representing its semantics with evidenceabout its global context (terms it often collocates with) as well asits local context (narratives where that phrase is used, which wecall usage scenarios ). We further extend this representation with in-formation extracted from external knowledge bases. The resultingrepresentation incorporates both localized context and more gen-eral usage of the phrase and allows to detect its compositionality ina non-deterministic and contextual way. Empirical evaluation of ourmodel on a dataset of phrase compositionality , manually collectedby crowdsourcing contextual compositionality assessments, showsthat our model outperforms state-of-the-art baselines notably ondetecting phrase compositionality. https://github.com/dswang2011/ImprovedRankedList/tree/master/inputThis paper is published under the Creative Commons Attribution 4.0 International(CC-BY 4.0) license. Authors reserve their rights to disseminate the work on theirpersonal and corporate Web sites with the appropriate attribution. WWW ’19 Companion, May 13–17, 2019, San Francisco, CA, USA © 2019 IW3C2 (International World Wide Web Conference Committee), publishedunder Creative Commons CC-BY 4.0 License.ACM ISBN 978-1-4503-6675-5/19/05.https://doi.org/10.1145/3308560.3316584

CCS CONCEPTS • Information systems → Data encoding and canonicalization . KEYWORDS

Compositionality detection; Knowledge base; Word embedding

ACM Reference Format:

Dongsheng Wang, Qiuchi Li, Lucas Chaves Lima, Jakob Grue Simonsen,and Christina Lioma. 2019. Contextual Compositionality Detection withExternal Knowledge Bases and Word Embeddings. In

Companion Proceedingsof the 2019 World Wide Web Conference (WWW ’19 Companion), May 13–17,2019, San Francisco, CA, USA.

ACM, New York, NY, USA, 7 pages. https://doi.org/10.1145/3308560.3316584

Automatic compositionality detection refers to the automatic assess-ment of the extent to which the meaning of a multi-word phrase isdecomposable into the meanings of its constituents words and theircombination. For example, while brown dog is a fully compositionalphrase meaning a dog of brown color, hot dog is a non-compositionalphrase denoting a type of food. Compositionality plays a vital rolein word embeddings because a non-decomposable phrase should,in principle, be treated as a single word instead of a bag of word(BOW) in word embedding approaches.A typical line of research in automatic compositionality detectionis to "perturb" the input phrase by replacing one of its constituentwords at a time with its synonym, and then to measure the semanticdistance between the original phrase and the perturbed phraseset [8]. The larger this distance, the less compositional the originalphrase. For instance, hot dog would be perturbed to warm dog and hot canine . The semantic distance between the original phrase andits two perturbations is high, indicating that they denote differentconcepts; hence hot dog is non-compositional. However, the phrase brown dog would be perturbed to hazel dog and brown canine , whichhave a shorter semantic distance to brown dog , indicating that it iscompositional.In this paper, we posit that the compositionality of a phrase is notdichotomous or deterministic, but instead varies across scenarios.For instance, heavy metal could refer to a dense metal that is toxic, a r X i v : . [ c s . C L ] M a r hich is compositional, but it could also be non-compositionalwhen it refers to a genre of music. Previous work acknowledges thisproperty of compositionality theoretically [8], but no operationalmodels implementing this have been presented to this day.Given a multi-word phrase as input, we reason that the phraseis used in some narrative, e.g., a query, sentence, snippet, docu-ment, etc. We refer to this narrative as usage scenario of the phrase.We combine evidence extracted from this usage scenario of thephrase with the global context (frequently co-occurring terms) ofthe phrase and use this to enrich the word embedding representa-tion of the phrase. We linearly combine the weights of the tokensthat are obtained from the usage scenario and the global context.We further extend this representation with information extractedfrom external knowledge bases.We evaluate our model on a large dataset of phrases whichare labeled as per five degrees of compositionality under varioususage scenarios. We find that our model outperforms state-of-the-art baselines notably on identifying phrase compositionality. Ourcontributions are as follows: • A novel model that detects phrase compositionality underdifferent contexts and that outperforms the state of the artperformance in the area. • A benchmarking dataset of contextualized compositionalitydetection, that we make publicly available to the community.

Compositionality detection mainly focuses on the semantic distanceor similarity calculation between a given phrase and its componentwords or its perturbations under a corpus or dictionary. Earlierapproaches mostly estimate the similarity between the originalphrase and its component words. For example, Baldwin et al. [1],and Katz and Giesbrecht [6] employ Latent Semantic Analysis (LSA)to calculate the semantic similarity (and hence to measure compo-sitionality). Venkatapathy and Joshi [16] extended this by addingcollocation features, e.g., phrase frequency, point-wise mutual in-formation, extracted from the British National Corpus.More recent work estimates the similarity between a phrase andperturbed versions of that phrase where the words are replaced,one at a time, by their synonyms. For instance, Kiela and Clark [7]compute the semantic distance between a phrase and its pertur-bation, using cosine similarity, which measures a phrase weightby pointwise-multiplication vectors of its terms. Lioma et al. [10]calculate the semantic distance with Kullback-Leibler divergencebased on a language model; and, in subsequent work, Lioma et al.[8] represent the original phrase and its perturbations as rankedlists, and measure their correlation or distance.A promising line of work uses word embeddings and deep artifi-cial neural networks for compositionality detection. Salehi [15] em-ploys the word-based skip-gram model for learning non-compositionalphrases, treating phrases as individual tokens with vectorial com-position functions. Hashimoto and Tsuruoka [5] adopt syntacticfeatures including word index, frequency and PMI of a phraseand its components words to learn the embeddings. Yazdani etal. [17] utilize a polynomial projection function and deep artifi-cial neural networks to learn the semantic composition and detect non-compositional phrases like those that stand out as outliers,assuming that the majority are compositional.Closer to our work, Salehi el. [14] use Wiktionary and utilize thedefinition, synonyms, and translations of Wiktionary to detect non-compositional components. Specifically, they analyze the lexicaloverlap between the definition of a phrase and its component wordsto measure compositionality. They assume that multi-word phrasesare included in Wiktionary, while there is no guarantee for perfectcoverage of the dictionary. Unlike this approach, we use Wiktionarytogether with DBPedia as a structured knowledge base to representthe contextual semantics of phrases.To our knowledge, no prior work has operationalized the com-positionality of a phrase as contextual.

Given an input phrase p and its accompanying usage scenario s , theaim is to compute the compositionality score Score ( p ) of phrase p with respect to usage scenario s . We follow the substitution-basedline of work [7], which (a) generates perturbations of the inputphrase p by substituting one word at a time with its synonym, (b)builds a semantic representation (a vector of its co-occurring terms)separately for the input phrase p and each perturbed phrase, and(c) uses the distance between the vectors of the input phrase andits perturbations to approximate the compositionality of the inputphrase: the higher the distance, the less compositional the inputphrase. This substitution-based line of work does not accommodatethe usage scenario of the input phrase or its perturbations. Thevectors of co-occurring terms are computed on one corpus, andhence these vectors represent the global distributional semanticsof the input phrase and its perturbations. We extend this line ofwork by incorporating the local usage scenario of the phrase and itsperturbations. We furthermore enrich these representations usingexternal knowledge bases KBs . We describe this next.Figure 1 shows how the phrase and scenario are fed into theexternal corpus and knowledge base in a sequential manner in thearchitecture, which we refer to as contextual representation model(CRM).

Global Phrase Context.

In Natural Language Processing (NLP),the distributional semantics of an input word are computed byfixing a natural number n and, for each occurrence of a word insome corpus, finding the n words occurring immediately before,and n words occurring immediately after each occurrence of theinput word (called context window ). If there is a total of N contextwindows for a word, its distributional semantics in vector formcan be calculated by using all these N windows. Because this is aglobal representation of the word’s distributional semantics acrossthe whole corpus, the vector is called a globalized vector. A generalword embedding (e.g., word2vec) is comparable to such global con-text. Concretized representations of this globalized vector can be igure 1: The diagram for the CRM sequential framework. calculated with, e.g. ranked lists or word embeddings, as describedin section 4.1. Local Phrase Context.

We aim to incorporate a representation ofthe local usage scenario of the input phrase ( local phrase context )into the above described global phrase context. This representationwill not be in the vector directly, because usage scenarios are typi-cally extremely short (in terms of words), which may strongly biasthe contextual representation of the phrase. Therefore, we rankall the global context windows of the phrase according to theirsimilarity to the usage scenario of the phrase, and we select the top K most similar context windows to the usage scenario. These top K context windows are used to build the local usage scenario contextrepresentation of the phrase. Then, we linearly combine the globalrepresentation of the phrase (i.e., by taking all N context windows)and the local usage scenario representation of the phrase (i.e., thetop K context windows) to acquire the localized phrase context . Theranking score is the similarity between the usage scenario s and awindow W i , i.e. sim i = similarity ( W i , s ) ∈ [ , ] . The details of howthe similarity score is computed are introduced in Section 4.In the above, the value of K is determined by the length of theusage scenario as follows: K = max (cid:18) N length ( s ) , M (cid:19) (1)where N is the total number of context windows that contain phrase p in the corpus; length ( s ) is the number of words in usage scenario s excluding the original phrase; and M is a threshold. We explainthese next.We posit that 2 length ( s ) indicates the degree of shrinking: thelonger the usage scenario is, the smaller number of windows willbe shrunk. The reason behind this is that the longer the usagescenario is, the more semantics it contains and subsequently fewerspecific windows we are supposed to be capable of locating on. Forinstance, if the usage scenario is empty with length ( s ) =

0, then itreturns the entire N windows of p with no shrinking performed;if the usage scenario has three words with 2 lenдth ( s ) =

8, it onlycollects the top of all the windows ( K = N ). Note that K dependson the usage scenario length, and is not fixed as a threshold of thesimilarity values. Since the similarity values may vary drasticallyin between [ , ] for different usage scenarios, we argue that ourmethod is more robust to such variations. Furthermore, an empiricalthreshold of M is introduced to guarantee at least M windows willbe selected anyhow. To this end, the localized usage scenario context of a phrase is given by: C( p , s ) = α (cid:205) Ni = R( W i ) N + ( − α ) (cid:205) Kj = R( W j ) K (2)where a is a weight parameter between 0-1 indicating the weightof global vector, and the remaining of 1 − a corresponds to thecontribution of the localized vector; R denotes the semantic repre-sentation for W . In this paper, we represent a phrase as a ranked listof words and word embedding, so the same symbol R is adopted todenote both. The approach to calculate R is described in section4.1. We enrich the global and local context representations of the inputphrase with information extracted from external knowledge bases.We describe this next.We reason that the corpus used to extract the global and localcontexts has good coverage of various but not all possible usagescenarios of the phrase. A knowledge base is expected to containmore comprehensive, declarative information about the phrase,e.g., entities and phrase senses with categorized information. We,therefore, enrich the global and local phrase contexts extractedfrom the corpus with phrase information extracted from externalknowledge bases.Given an input phrase, we collect all candidate senses and enti-ties (uniformly referred to as candidates in this paper) by searchingthe following properties (and associated values) from the knowl-edge bases: the properties dbpedia:redirects , dbpedia:disambiguation and their propagation relation with dbpedia:name and rdfs:label .The associated resources in these retrieved triples result in a set ofcandidates. Then, for each candidate, the values of rdfs:label , dbpe-dia:abstract and rdf:type are concatenated as the context for thatcandidate, excluding the title (which is mostly the phrase name).We also use the interface to retrieve senses from Wiktionary andmerge them into the same candidate set for that given input.Most phrases only contain a limited number of candidates, anddifferent candidates of the same phrase can have entirely differentmeanings or be distinct entities. We hence investigate a sequentialway to incorporate the knowledge base into the phrase contextualrepresentation, as follows. First, the phrase is fed into the knowledgebase to find all candidate articles. Then, the candidates are ranked https://dkpro.github.io/dkpro-jwktl/ ccording to how similar they are to the localized phrase context.Those with similarity values above a certain threshold are identifiedas the matched candidates, denoted as { D i , sim i } ni = , where D i and sim i refer to the i th matched candidate with similarity valuesim i ∈ [ , ] . A linear combination of the localized phrase contextand the candidate articles is then conducted to compute the adjustedphrase context as follows: C( p , D ) = λ C( p , s ) + ( − λ ) n (cid:213) i = w i R( D i ) , (3)where w i = sim i (cid:205) ni = sim i is the normalized similarity score for the i th candidate article D i while R( D i ) denotes the semantic representa-tion for D i . Since KB contains well-defined knowledge of words, weuse a weighted sum of the matched candidates, instead of a simpleaverage of matched contexts in the text corpus.The knowledge base we employ consists of DBpedia, Wiktionary,and Wordnet. DBpedia is constructed by extracting structured in-formation from Wikipedia. The English version of the DBpediacontains 4.58 million entries, of which 4.22 million are classifiedand managed under one consistent ontology. Wiktionary is a multi-lingual, web-based, freely available dictionary, thesaurus and phrasebook, designed as the lexical companion to Wikipedia. Volunteerscollaboratively construct Wiktionary, so there are no specializedqualifications necessary. This section represents an approach of non-linear combination asa companion to the linear combination approach introduced insection 3.2 and 3.3. In addition to the weight parameter orienteddesign of the linear combination, we also employ a non-linearsigmoid function in RNNs (recurrent neural networks), which re-solves the arrangement of the combining order for context inputs.In other words, RNNs take into consideration the feedback fromthe previous context vector back and forth, leading to numerousapplications [3, 4]. Specifically, we train a neural network modelusing the Keras library to identify the compositionality label ofeach phrase. We encode the semantics adopting pre-trained wordembeddings - word2vec [11] as word representations, a recurrentneural network with LSTM cells as the model, and cross-entropyas the loss function. As an optimizer, we utilize Adam optimizer fortraining the model.In a realistic scenario (also represented in our dataset) there arefewer non-compositional than compositional phrases. This situ-ation resembles the class imbalance issue which happens whenone class (or label) is represented by most of the examples whilethe other one is represented just by a few. Therefore, we adoptre-sampling strategies to tackle this problem.

Here we introduce our proposed method for compositionality de-tection with Algorithm 1. Given a phrase p of length l and its usagescenario s in a large corpus Corp , we compute its compositionalityscore through the following steps:(1) Obtain localized phrase context through Eq. 1 and 2. Theusage scenario of a phrase is the critical information, and

Input:

Phrase p with length l Input:

Usage scenario s for p Input:

Corpus

Corp

Input:

Knowledge Base - DBPedia

Input:

Similarity threshold - thred

Output:

Compositionality score comp ( p ) Set of perturbed phrase S( ˆ p ) ← (cid:156) Find synonym ˆ t of each term t ∈ p for each ˆ t do Perturbed phrase ˆ p ← { ˆ t , l -1 original terms t o } Update perturbed phrase set S ( ˆ p ) ← S ( ˆ p ) ∪ ˆ p end for for phrase p ′ ∈ { p ∪ S ( ˆ p )} do C ( p ′ ) ← get context terms from localized phrase contextfrom Corp , smooth with Eq. 1 if it has scenario end for Find n candidate articles D i from KB where sim i = similarity ( s , D i ) > thred R( D i ) ← semantic representation of D i C ( p , D ) ← linear combined context λC ( p ) + ( − λ ) (cid:205) ni = sim i (cid:205) ni = sim i R( D i ) Q ( L ˆ p ) ← (cid:156) for each perturbed phrase ˆ p ∈ S ( ˆ p ) do Q ( L ˆ p ) ← Q ( L ˆ p ) ∪ C ( ˆ p ) end for return | Q ( L ˆ p )| (cid:205) C ( ˆ p )∈ Q ( L ˆ p ) Similarity ( C ( ˆ p ) , C ( p , D )) Algorithm 1:

Algorithm of contextual compositionality detectionthe idea behind this step is to smooth the scenario contextrepresentation with the original phrase representation, asshown in line 8 in Algorithm 1.(2) Adjust phrase context with a knowledge base. The knowl-edge base is fed to adjust the localized phrase context wherewe adopt Eq. 3 to encode the information (from line 10 toline 12).(3) Obtain a perturbed phrase set. For each term in the phrase,we find its synonyms in WordNet. We then generate the setof perturbed phrases S ( p ) as: S(p) = { ˆ p where ˆ p = l -1 termsof p plus a synonym of the remaining term of p}, from line 3to line 6.(4) Construct a perturbation representation set. For each per-turbed phrase ˆ p in S(p), the corresponding representation C( ˆ p ) is composed of all windows of ˆ p from the corpus, andis added to the perturbation list Q ( L ˆ p ) from line 14 to line16. Note that we do not combine context from KB for pertur-bations.(5) Compute the compositionality score for the input phrase,shown in line 17, using the following equation: score ( p ) = (cid:205) ˆ p ∈ S ( p ) sim (C( p , D ) , C( ˆ p ))| S ( p )| (4) IMPLEMENTATION4.1 Semantic Representation

The semantic representation of a context (a phrase, a context win-dow or a candidate content), i.e.,

R(·) is concretized as either aranked list or a word embedding. For the ranked list model, wecalculate the TF-IDF as weight for all the tokens, rank them ac-cording to the weight, resulting in a ranked list of those tokens asthe localized contextual representation. For the word embeddingmodel, we use existing pre-trained word vectors - Glove [12], andrepresent the vector with the average of all tokens. The corpus weemploy in our experiment is ClueWeb12-B13, a subset of some 50million pages of ClueWeb12-Full dataset .These two contextual representations lead to two different com-positionality scores for the same model. We apply suffixes "-wordembedding" or "-ranked list" in order to distinguish the way thecontextual representation is computed, resulting in two distinctmodels, namely CRM word embedding and

CRM ranked list . In this study, we are faced with the problem of computing thesimilarity value between two context vectors. Here, we considertwo types of similarity measures to achieve this purpose: cosinesimilarity and

Pearson correlation coefficient .One of the most commonly used similarity measures, cosinesimilarity , computes the cosine value of the angle between the twovectors of the same length. For two vectors (cid:174) a = [ a , a , .., a n ] and (cid:174) b = [ b , b , .., b n ] , their cosine similarity cossim(a,b) is given below: cossim ( a , b ) = (cid:205) ni = a i b i (cid:113)(cid:205) ni = a i (cid:113)(cid:205) ni = b i (5)The Pearson correlation coefficient computes the degree of corre-lation between two variables, each having a set of observed values.Suppose two variables X and Y are associated with two set of val-ues { X , X , ..., X n } and { Y , Y , ..., Y n } respectively. The Pearsoncorrelation coefficient r can, therefore, be computed as follows: r = (cid:205) ni = ( X i − X )( Y i − Y ) (cid:113)(cid:205) ni = ( X i − X ) (cid:113)(cid:205) ni = ( Y i − Y ) (6)where X and Y denote the average of X and Y respectively. Here, we introduce the process to obtain the perturbations of aphrase p with length l . First, we get the synonyms for each word inthe phrase. Then, we construct the whole perturbation set, whichcontains all phrases composed of l − p and a synonymof the remaining word. Suppose the i th word has n i synonyms,then the perturbation set contains (cid:205) li = n i perturbed phrases. Wethen prune the perturbation set by filtering out the rare perturbedphrases in the text corpus. Basically, we compute the occurrencefrequency of all perturbed phrases and pick the perturbed phraseswith top K frequency values. In our study, we set K to be 7, which https://lemurproject.org/clueweb12/ is derived from empirical observation of the data. Then, the finalperturbation set contains 7 perturbed phrases in our study. Contextual Windows setting:

We set window =

20, which meansit scans the previous 20 and subsequent 20 words of that phrase,with a sum of 40 words for each window. In Equation 1, we set M =

10, and the base 2 can also be parameter-free which can bechanged into 2,3,4,etc., to increase the localization level.

Knowledge base threshold:

As for the threshold of KB candidates,we set the threshold of similarity value between localized contextand KB candidates as 0.5 to filter out those candidates with similarityless than 0.5.

Ranked list length:

In line with the work [8], we set a maximumlength of the ranked list as 1000, which means that we rank thetokens according to their TFIDF weight, and the tokens after theposition of 1000 would be pruned.

Training and testing:

As we are working with imbalanced data,we use a random oversampling strategy. We split our data in astratified fashion into 65% for training, 15% for validation, and20% for testing. The re-sampling is be done after splitting the datainto training and test, and only on the training data, i.e., none ofthe information in the test data is being used to create syntheticobservations.

In this section, we evaluate the effectiveness of the model presentedin Section 3. Section 5.1 introduces the dataset and Section 5.2presents the results achieved by our model.

We employ a dataset that consists of 1042 phrases that are noun-noun 2-term phrases [2]. In this dataset, each phrase was assessedfour times using a binary scale (compositional or non-compositional).However, these phrases are assessed with a deterministic label,meaning that no scenario or context was given, and the degree ofcompositionality may not always be binary [13]. Therefore, we ex-tend the dataset into a new version where each phrase is enrichedwith one or two scenarios if possible, by taking advantage of acrowdsourcing website - Figure Eight , and we use a graded levelof compositionality. In Table 2 we summarize the dataset statistics.We divided the assessment into two stages: for the first stage,the trustful assessors, with level 3 (highest in Figure Eight), arerequired to understand the various meanings of a phrase, and, ifpossible, create two scenarios for the same phrase. From thesetwo scenarios, one should be compositional or as compositional aspossible, and the other non-compositional or as non-compositionalas possible. If the phrase can only be compositional or only be non-compositional, then they create one scenario for it. For the secondstage, the assessors are required to assess the compositionality ofphrases within different scenarios with one of the five graded labels:compositional, mostly-compositional, ambiguous to judge, mostlynon-compositional, and non-compositional. Note that, for the first nsupervised Methods ρ α , λ Baseline: Ranked list [8] 0.131 naBaseline: Word Embedding 0.147 naCRM ranked list 0.209 0.1,0.5CRM Word Embedding 0.375 0.9, 0.1

Supervised Methods (20% Testing) ρ RNN (LSTM cells) 0.176 naRNN (LSTM cells) CRM 0.324 na

Table 1: Results of different compositionality detection meth-ods; na denotes not applicable . No. Non-Compositional 43 (3.6%)No. Mostly Non-Compositional 145 (12.1%)No. Ambiguos Phrases 126 (10.5%)No. Mostly Compositional 141 (12.0%)No. Compositional 739 (61.8%)Unique number of Phrases 1042No. of context 1194Average number of context by Phrase 1.146

Table 2: Summary of dataset statistics. stage, the two scenarios of a phrase are not necessarily of twoextreme polarities.

Two linear combination parameters influence the performance ofour model: the combination weight α (in Eq. 2) between the vectorsof a phrase and its scenario, resulting in a localized phrase context,and λ (in Eq. 3) between the localized context and knowledge base.The impacts of these two parameters on the final performance arevisualized in Figure 2 and 3, corresponding to the word embedding-based and ranked list-based contextual representation respectively. α and λ denote the x and y coordinates. The colors indicate theperformance, which is the correlation between the ground truthlabels and the predicted labels of our models ranging from -1 to1. The performance values are colored ranging from red to blue,representing the lowest performance to the highest.As shown in Figure 2, the performance is negatively correlatedwith α while positively correlated with λ . This indicates that re-ducing the relative importance of localized context (right directionon x-coordinate) while enhancing the influence of knowledge base(bottom direction on y-coordinate) can improve the performancefor word embedding based contextual representation. In contrast,as shown in Figure 3, if we ignore the column, α is negativelycorrelated with the performance while λ does not have an apparentinfluence on the performance. This indicates that attaching higherimportance to the localized context (left direction on x-coordinate)can improve the performance for the ranked list based contextualrepresentation, while the adoption of knowledge base does nothave an apparent influence to the overall performance. The first column, which is shown more like an outlier, indicates that theexistence of a vector of the original phrase is necessary. In otherwords, localized context would have relatively poor performance.As summarized in Table 1, the performance improved from 0.147to 0.375 for CRMs based on word embeddings (the best); from 0.131to 0.209 for CRMs based on ranked list; and 0.176 to 0.324 for CRMbased on RNN. For the word embedding based contextual represen-tation model, relying more on the knowledge base while keepingthe scenario to limited importance will lead to a high-performedmodel; for the ranked list based contextual representation model,on the other hand, adequately high adoption of localized contextcan lead to improved performance. The reason behind this can be that the knowledge base contains relatively trimmed but well-categorized information, therefore, the word embedding model cantake full use of this text as informative vectors. In contrast, rankedlists, depending on tokens, work better on a large-scale corpuswhere they induce a large number of context windows. However,the knowledge base contains a limited number of tokens that mayhave little contribution to the final representation. Even though wecan tune the weight of tokens from a knowledge base, it still canhave limited influence in comparison to the long ranked list, whichcan be as long as 1000 tokens in our experiment.For the non-linear combination where we employed the sigmoidfunction in RNNs, the CRM based on RNN still beats the originalRNN. However, the performance is still lower than the unsupervisedapproaches. We developed a novel method for compositionality detection wherethe compositionality of a phrase is contextual rather than static.Instead of considering an isolated phrase as input, we assume aphrase and its usage scenario (e.g., a query, snippet, sentence, etc.)as input, and we model a joint semantic representation of these bycombining distributional semantics extracted from a corpus and ad-ditional evidence extracted from an external structured knowledgebase.Our resulting model uses word embeddings to detect compo-sitionality, more accurately than the related state of the art. Ourexperiments show that for word embeddings, the usage of knowl-edge bases can lead to notable performance improvements.In the future, we plan to evaluate our model on further datasetsand compositionality detection scenario, e.g., Verbal PhraseologicalUnits (VPUs).

ACKNOWLEDGEMENT

This work is supported by the Quantum Access and Retrieval The-ory (QUARTZ) project, which has received funding from the Euro-pean Union’s Horizon 2020 research and innovation programmeunder the Marie Sklodowska-Curie grant agreement No. 721321.

REFERENCES [1] Timothy Baldwin, Colin Bannard, Takaaki Tanaka, and Dominic Widdows. 2003.An empirical model of multiword expression decomposability. In

Proceedingsof the ACL 2003 workshop on Multiword expressions: analysis, acquisition andtreatment-Volume 18 . Association for Computational Linguistics, 89–96. igure 2: The grid search for Word embedding based Contex-tual representation. x-coordinate is α for controlling localizedcontext and λ stands for y-coordinate, controlling the KB com-bining weight. Deeper blue represents higher performancewhereas red indicates the opposite. Figure 3: The grid search for ranked list based Contextual rep-resentation. x-coordinate is α for controlling localized contextand y-coordinate stands for λ , controlling the KB combiningweight. Deeper blue represents higher performance whereasred indicates the opposite. [2] Meghdad Farahmand, Aaron Smith, and Joakim Nivre. 2015. A Multiword Ex-pression Dataset: Annotating Non-Compositionality and Conventionalizationfor English Noun Compounds. In Proceedings of the 11th Workshop on MultiwordExpressions (MWE-NAACL 2015) . Association for Computational Linguistics.[3] Christian Hansen, Casper Hansen, Stephen Alstrup, Jakob Grue Simonsen, andChristina Lioma. [n. d.]. Neural Speed Reading with Structural-Jump-LSTM. In

Proceedings of the 2019 International Conference on Learning Representations, ICLR2019 .[4] Christian Hansen, Casper Hansen, Stephen Alstrup, Jakob Grue Simonsen, andChristina Lioma. 2019. Modelling Sequential Music Track Skips using a Multi-RNN Approach. In

Proceedings of the 2019 International Conference on Web Searchand Data Mining, WSDM 2019, Sequential Skip Prediction Challenge . in press.[5] Kazuma Hashimoto and Yoshimasa Tsuruoka. 2016. Adaptive joint learningof compositional and non-compositional phrase embeddings. arXiv preprintarXiv:1603.06067 (2016).[6] Graham Katz and Eugenie Giesbrecht. 2006. Automatic identification of non-compositional multi-word expressions using latent semantic analysis. In

Pro-ceedings of the Workshop on Multiword Expressions: Identifying and ExploitingUnderlying Properties . Association for Computational Linguistics, 12–19.[7] Douwe Kiela and Stephen Clark. 2013. Detecting Compositionality of Multi-WordExpressions using Nearest Neighbours in Vector Space Models. In

Proceedings ofthe 2013 Conference on Empirical Methods in Natural Language Processing . Associ-ation for Computational Linguistics, 1427–1432. http://aclweb.org/anthology/D13-1147[8] Christina Lioma and Niels Dalum Hansen. 2017. A Study of Metrics of Distanceand Correlation Between Ranked Lists for Compositionality Detection.

Cogn.Syst. Res.

44, C (Aug. 2017), 40–49. https://doi.org/10.1016/j.cogsys.2017.03.001[9] Christina Lioma, Birger Larsen, and Peter Ingwersen. 2018. To Phrase or Not toPhrase – Impact of User versus System Term Dependence Upon Retrieval.

Dataand Information Management

2, 1 (2018), 1–14. [10] Christina Lioma, Jakob Grue Simonsen, Birger Larsen, and Niels Dalum Hansen.2015. Non-compositional term dependence for information retrieval. In

Proceed-ings of the 38th International ACM SIGIR Conference on Research and Developmentin Information Retrieval . ACM, 595–604.[11] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013.Distributed representations of words and phrases and their compositionality. In

Advances in neural information processing systems . 3111–3119.[12] Jeffrey Pennington, Richard Socher, and Christopher Manning. 2014. Glove:Global vectors for word representation. In

Proceedings of the 2014 conference onempirical methods in natural language processing (EMNLP) . 1532–1543.[13] Siva Reddy, Diana McCarthy, and Suresh Manandhar. 2011. An Empirical Study onCompositionality in Compound Nouns. In

Proceedings of the 5th International JointConference on Natural Language Processing (IJCNLP-11) . Chiang Mai, Thailand.http://aclweb.org/anthology-new/I/I11/I11-1024.pdf[14] Bahar Salehi, Paul Cook, and Timothy Baldwin. 2014. Detecting non-compositional MWE components using Wiktionary. In

Proceedings of the 2014Conference on Empirical Methods in Natural Language Processing (EMNLP) . 1792–1797.[15] Bahar Salehi, Paul Cook, and Timothy Baldwin. 2015. A word embedding ap-proach to predicting the compositionality of multiword expressions. In

Proceed-ings of the 2015 Conference of the North American Chapter of the Association forComputational Linguistics: Human Language Technologies . 977–983.[16] Sriram Venkatapathy and Aravind K Joshi. 2005. Measuring the relative composi-tionality of verb-noun (VN) collocations by integrating features. In

Proceedings ofthe conference on Human Language Technology and Empirical Methods in NaturalLanguage Processing . Association for Computational Linguistics, 899–906.[17] Majid Yazdani, Meghdad Farahmand, and James Henderson. 2015. Learningsemantic composition to detect non-compositionality of multiword expressions.In