Deep Learning Embeddings for Discontinuous Linguistic Units
aa r X i v : . [ c s . C L ] D ec Deep Learning Embeddings for DiscontinuousLinguistic Units
Wenpeng Yin and Hinrich Sch ¨utze
Center for Information and Language ProcessingUniversity of MunichGermany [email protected]
Abstract
Deep learning embeddings have been successfully used for many natural languageprocessing problems. Embeddings are mostly computed for word forms althougha number of recent papers have extended this to other linguistic units like mor-phemes and phrases. In this paper, we argue that learning embeddings for discon-tinuous linguistic units should also be considered. In an experimental evaluationon coreference resolution, we show that such embeddings perform better thanword form embeddings.
One advantage of recent work in deep learning on natural language processing (NLP) is that lin-guistic units are represented by rich and informative embeddings. These embeddings support betterperformance on a variety of NLP tasks (Collobert et al., 2011) than symbolic linguistic represen-tations that do not directly represent information about similarity and other linguistic properties.Embeddings are mostly derived for word forms although a number of recent papers have extendedthis to other linguistic units like morphemes (Luong et al., 2013) and phrases (Mikolov et al., 2013).Thus, an important question is: what are the basic linguistic units that should be represented byembeddings in a deep learning NLP system? In this paper, we argue that certain discontinuouslinguistic units should also have embeddings. We will restrict ourselves to the arguably simplestpossible type of discontinuity: two noncontinous words. For example, in the sentence “this teahelped me to relax”, “helped*to” is one of several such two-word discontinuities. We will referto discontinuous linguistic units like “helped*to” as minimal contexts (MC) for reasons that willbecome clear presently.We can approach the question of what basic linguistic units should have representations from apractical as well as from a cognitive point of view. In practical terms, we want representations tobe optimized for good generalization. There are many situations where a particular task involving aphrase cannot be solved based on the phrase itself , but it can be solved by analyzing the context of thephrase . For example, if a coreference resolution system needs to determine whether the unknownword “Xiulan” (a Chinese first name) in “he helped Xiulan to find a flat” refers to an animate or aninanimate entity, then the minimal context “helped*to” is a good indicator for the animacy of theunknown word – whereas the unknown word itself provides no clue.From a cognitive point of view, it can be argued that many basic units that the human cognitivesystem uses are also discontinuous. Particularly convincing examples for such units are phrasal verbsin English, which frequently occur discontinuously. It is implausible to suppose that we retrieveatomic representations for, say, “keep”, “up”, “under” and “in” and then combine them to form themeanings of phrases like “keep him up”, “keep them under”, “keep it in”. Rather, it is more plausible1hat we recognize “keep up”, “keep under” and “keep in” as relevant basic linguistic units in thesecontexts and that the human cognitive systems represents them as units.This paper presents an initial study of minimal context embeddings and shows that they are bettersuited for a classification task needed for coreference resolution than word embeddings. Our con-clusion is that minimal contexts (as well as inflected word forms, morphemes and phrases) shouldbe considered as basic units that we need to learn embeddings for.
With English Gigaword Corpus, we use the skip-gram model as implemented in word2vec (Mikolov et al., 2013) to induce embeddings. To be able to use word2vec directly without codechanges, we represent the corpus as a sequence of sentences, each consisting of two tokens: anMC (written as the two enclosing words separated by a star) and a word that occurs between thetwo enclosing words. The distance k between the two enclosing words can be varied. In our ex-periments, we use either distance k = 2 or distance ≤ k ≤ . For example, for k = 2 , thetrigram w i − w i w i +1 generates the single sentence “ w i − * w i +1 w i ”; and for ≤ k ≤ , thefourgram w i − w i − w i w i +1 generates the four sentences “ w i − * w i w i − ”, “ w i − * w i +1 w i ”,“ w i − * w i +1 w i − ” and “ w i − * w i +1 w i ”.Note that the reformated corpus enables word2vec to learn embeddings for single words and MCssimultaneously, we discard the word embeddings, and yet compute standard word embeddings onthe original corpus using word2vec skip-gram model. In experiments, embedding size is set to 200. A markable is a linguistic expression that refers to an entity in the real world or another linguisticexpression. Examples of markables include noun phrases (“the man”), named entities (“Peter”) andnested noun phrases (“their”). We address the task of animacy classification of markables: classi-fying them as animate/inanimate. This feature is useful for coreference resolution systems becauseonly animate markables can be referred to using masculine and feminine pronouns in English like“him” and “she”. Thus, this is an important clue for automatically clustering the markables of adocument into correct coreference chains.To create training and test sets, we extract all 39,689 coreference chains from the CoNLL2012OntoNotes corpus. We label chains that contain one of the markables “she”, “her”, “he”, “him” or“his” as animate and chains that contain one of “it” or “its” as inanimate.We extract 39,942 markables and their corresponding MCs from the 10,361 animate and inanimatechains where an MC simply is the pair of the two words occurring to the left and right of themarkable. The gold label of a markable and its MC is the animacy status of its chain: either animateor inanimate. We divide all MCs having received an embedding in the embedding learning phaseinto a training set of 11,301 (8097 animate, 3204 inanimate) and a balanced test set of 4036.We use LIBLINEAR for classification, with penalty factors 3 and 1 for inanimate and animateclasses, respectively, because the training data are unbalanced. We compare the following representations for animacy classification of markables. (i) MC: minimalcontext embeddings with k = 2 and ≤ k ≤ ; (ii) concatenation: concatenation of the embeddingsof the two enclosing words where the embeddings are either standard word2vec embeddings (seeSection 2.1) or the embeddings published by Collobert et al. (2011); (iii) the bag-of-words (BOW) https://code.google.com/p/word2vec/ http://conll.cemantix.org/2012/data.html https://github.com/bwaldvogel/liblinear-java http://metaoptimize.com/projects/wordreprs/ V where V is the size of the vocabulary. The first (resp. second) vector is the one-hot vector for theleft (resp. right) word of the MC. Experimental results are shown in Table 1.representation accuracyMC k = 2 ≤ k ≤ † C&W 0.662* † BOW 0.638* † Table 1: Classification accuracy. Mark “*” means significantly lower than MC, k = 2 ; “ † ” meanssignificantly lower than MC, ≤ k ≤ .The results show that MC embeddings have an obvious advantage in this classification task, both for k = 2 and ≤ k ≤ . This validates our hypothesis that learning embeddings for discontinuouslinguistic units is promising.In our error analysis, we found two types of frequent errors. (i) Unspecific MCs.
Many MCs areequally appropriate for animate and inanimate markables. Examples of such MCs include “take*in”,“keep*alive” and “then*goes”. (ii)
Untypical use of specific MCs.
Even MCs that are specific withrespect to what type of markable they enclose sometimes occur with the “wrong” type of markable.For example, most markables occurring in the MC “of*whose” are animate because “whose” usuallyrefers to an animate markable. However, in the context “. . . the southeastern area of Fujian whoseeconomy is the most active” the enclosed markable is Fujian, a province of China. This exampleshows that “whose” occasionally refers to an inanimate entity even though these cases are infrequent.
Most work on embeddings has focused on word forms with a few exceptions, notably embeddingsfor stems and morphemes (Luong et al., 2013) and for phrases (Mikolov et al., 2013). To the best ofour knowledge, our work is the first to learn embeddings for discontinuous linguistic units.An alternative to learning an embedding for a linguistic unit is to calculate its distributed repre-sentation from the distributed representations of its parts; the best known work along those lines is(Socher et al., 2012, 2010, 2011). This approach is superior for units that are compositional, i.e.,whose properties are systematically predictable from their parts. Our approach (as well as similarwork on continuous phrases) only makes sense for noncompositional units.
We have argued that discontinuous linguistic units are part of the inventory of linguistic units thatwe should compute embeddings for and we have shown that such embeddings are superior to wordform embeddings in a coreference resolution task.It is obvious that we cannot and do not want to compute embeddings for all possible discontinuouslinguistic units. Similarly, the subset of phrases that embeddings are computed for should be care-fully selected. In future work, we plan to address the question of how to select a subset of linguisticunits – e.g., those that are least compositional – when inducing embeddings.