[PDF] Morph-fitting: Fine-Tuning Word Vector Spaces with Simple Language-Specific Rules

Abstract

Morphologically rich languages accentuate two properties of distributional vector space models: 1) the difficulty of inducing accurate representations for low-frequency word forms; and 2) insensitivity to distinct lexical relations that have similar distributional signatures. These effects are detrimental for language understanding systems, which may infer that 'inexpensive' is a rephrasing for 'expensive' or may not associate 'acquire' with 'acquires'. In this work, we propose a novel morph-fitting procedure which moves past the use of curated semantic lexicons for improving distributional vector spaces. Instead, our method injects morphological constraints generated using simple language-specific rules, pulling inflectional forms of the same word close together and pushing derivational antonyms far apart. In intrinsic evaluation over four languages, we show that our approach: 1) improves low-frequency word estimates; and 2) boosts the semantic quality of the entire word vector collection. Finally, we show that morph-fitted vectors yield large gains in the downstream task of dialogue state tracking, highlighting the importance of morphology for tackling long-tail phenomena in language understanding tasks.

Full PDF

MMorph-ﬁtting: Fine-Tuning Word Vector Spaceswith Simple Language-Speciﬁc Rules

Ivan Vuli´c , Nikola Mrkši´c , Roi Reichart Diarmuid Ó Séaghdha , Steve Young , Anna Korhonen University of Cambridge Technion, Israel Institute of Technology Apple Inc. {iv250,nm480,sjy11,alk23}@[email protected] [email protected]

Abstract

Morphologically rich languages accentu-ate two properties of distributional vec-tor space models: 1) the difﬁculty of in-ducing accurate representations for low-frequency word forms; and 2) insensitivityto distinct lexical relations that have simi-lar distributional signatures. These effectsare detrimental for language understandingsystems, which may infer that inexpensive is a rephrasing for expensive or may not as-sociate acquire with acquires . In this work,we propose a novel morph-ﬁtting procedurewhich moves past the use of curated seman-tic lexicons for improving distributionalvector spaces. Instead, our method injectsmorphological constraints generated usingsimple language-speciﬁc rules, pulling in-ﬂectional forms of the same word close to-gether and pushing derivational antonyms far apart. In intrinsic evaluation over fourlanguages, we show that our approach: improves low-frequency word estimates;and boosts the semantic quality of theentire word vector collection. Finally, weshow that morph-ﬁtted vectors yield largegains in the downstream task of dialoguestate tracking , highlighting the importanceof morphology for tackling long-tail phe-nomena in language understanding tasks. Word representation learning has become a re-search area of central importance in natural lan-guage processing (NLP), with its usefulness demon-strated across many application areas such as pars-ing (Chen and Manning, 2014; Johannsen et al.,2015), machine translation (Zou et al., 2013), andmany others (Turian et al., 2010; Collobert et al., 2011). Most prominent word representation tech-niques are grounded in the distributional hypothe-sis (Harris, 1954), relying on word co-occurrenceinformation in large textual corpora (Curran, 2004;Turney and Pantel, 2010; Mikolov et al., 2013;Mnih and Kavukcuoglu, 2013; Levy and Goldberg,2014; Schwartz et al., 2015, i.a.).Morphologically rich languages, in which “sub-stantial grammatical information. . . is expressed atword level” (Tsarfaty et al., 2010), pose speciﬁcchallenges for NLP. This is not always consideredwhen techniques are evaluated on languages suchas English or Chinese, which do not have rich mor-phology. In the case of distributional vector spacemodels, morphological complexity brings two chal-lenges to the fore:

1. Estimating Rare Words:

A single lemmacan have many different surface realisations.Naively treating each realisation as a separate wordleads to sparsity problems and a failure to exploittheir shared semantics. On the other hand, lemma-tising the entire corpus can obfuscate the differ-ences that exist between different word forms eventhough they share some aspects of meaning.

2. Embedded Semantics:

Morphology can en-code semantic relations such as antonymy (e.g. lit-erate and illiterate , expensive and inexpensive ) or(near-)synonymy ( north , northern , northerly ).In this work, we tackle the two challenges jointlyby introducing a resource-light vector space ﬁne-tuning procedure termed morph-ﬁtting . The pro-posed method does not require curated knowledgebases or gold lexicons. Instead, it makes use of theobservation that morphology implicitly encodessemantic signals pertaining to synonymy (e.g.,German word inﬂections katalanisch, katalanis-chem, katalanischer denote the same semantic con-cept in different grammatical roles), and antonymy(e.g., mature vs. immature ), capitalising on the a r X i v : . [ c s . C L ] J un n_expensive de_teure it_costoso en_slow de_langsam it_lento en_book de_buch it_libro costly teuren dispendioso fast allmählich lentissimo books sachbuch romanzocostlier kostspielige remunerativo slowness rasch lenta memoir buches raccontocheaper aufwändige redditizio slower gemächlich inesorabile novel romandebüt volumettoprohibitively kostenintensive rischioso slowed schnell rapidissimo storybooks büchlein saggiopricey aufwendige costosa slowing explosionsartig graduale blurb pamphlet ecclesiasteexpensiveness teures costosa slowing langsamer lenti booked bücher libricostly teuren costose slowed langsames lente rebook büch libracostlier teurem costosi slowness langsame lenta booking büche librareruinously teurer dispendioso slows langsamem veloce rebooked büches libreunaffordable teurerer dispendiose idle langsamen rapido books büchen librano Table 1: The nearest neighbours of three example words ( expensive , slow and book ) in English, Germanand Italian before (top) and after (bottom) morph-ﬁtting.proliferation of word forms in morphologicallyrich languages. Formalised as an instance of thepost-processing semantic specialisation paradigm(Faruqui et al., 2015; Mrkši´c et al., 2016), morph-ﬁtting is steered by a set of linguistic constraintsderived from simple language-speciﬁc rules whichdescribe (a subset of) morphological processes ina language. The constraints emphasise similarityon one side (e.g., by extracting morphological syn-onyms), and antonymy on the other (by extracting morphological antonyms), see Fig. 1 and Tab. 2.The key idea of the ﬁne-tuning process is to pullsynonymous examples described by the constraintscloser together in the transformed vector space,while at the same time pushing antonymous exam-ples away from each other. The explicit post-hocinjection of morphological constraints enables: a) the estimation of more accurate vectors for low-frequency words which are linked to their high-frequency forms by the constructed constraints; this tackles the data sparsity problem; and b) spe-cialising the distributional space to distinguish be-tween similarity and relatedness (Kiela et al., 2015),thus supporting language understanding applica-tions such as dialogue state tracking (DST). As a post-processor, morph-ﬁtting allows theintegration of morphological rules with any distri-butional vector space in any language: it treats aninput distributional word vector space as a blackbox and ﬁne-tunes it so that the transformed spacereﬂects the knowledge coded in the input morpho-logical constraints (e.g., Italian words rispettoso and irrispetosa should be far apart in the trans- For instance, the vector for the word katalanischem whichoccurs only 9 times in the German Wikipedia will be pulledcloser to the more reliable vectors for katalanisch and kata-lanischer , with frequencies of 2097 and 1383 respectively. Representation models that do not distinguish betweensynonyms and antonyms may have grave implications in down-stream language understanding applications such as spokendialogue systems: a user looking for ‘an affordable Chineserestaurant in west Cambridge’ does not want a recommenda-tion for ‘an expensive Thai place in east Oxford’ . rispettosorispettosarispettosi irrispettosoirrispettosairrispettosi Figure 1:

Morph-ﬁtting in Italian. Representationsfor rispettoso , rispettosa , rispettosi ( EN : respectful ),are pulled closer together in the vector space (solidlines; A TTRACT constraints). At the same time,the model pushes them away from their antonyms(dashed lines; R

EPEL constraints) irrispettoso , ir-rispettosa , irrispettosi ( EN : disrespectful ), obtainedthrough morphological afﬁx transformation cap-tured by language-speciﬁc rules (e.g., adding thepreﬁx ir- typically negates the base word in Italian)formed vector space, see Fig. 1). Tab. 1 illustratesthe effects of morph-ﬁtting by qualitative exam-ples in three languages: the vast majority of nearestneighbours are “morphological” synonyms.We demonstrate the efﬁcacy of morph-ﬁttingin four languages (English, German, Italian, Rus-sian), yielding large and consistent improvementson benchmarking word similarity evaluation setssuch as SimLex-999 (Hill et al., 2015), its multilin-gual extension (Leviant and Reichart, 2015), andSimVerb-3500 (Gerz et al., 2016). The improve-ments are reported for all four languages, and witha variety of input distributional spaces, verifyingthe robustness of the approach.We then show that incorporating morph-ﬁttedvectors into a state-of-the-art neural-network DSTmodel results in improved tracking performance,especially for morphologically rich languages. Wereport an improvement of 4% on Italian, and 6% onGerman when using morph-ﬁtted vectors instead ofthe distributional ones, setting a new state-of-the-art DST performance for the two datasets. There are no readily available DST datasets for Russian.

Morph-ﬁtting: Methodology

Preliminaries

In this work, we focus on four lan-guages with varying levels of morphological com-plexity: English ( EN ), German ( DE ), Italian ( IT ),and Russian ( RU ). These correspond to languagesin the Multilingual SimLex-999 dataset. Vocabu-laries W en , W de , W it , W ru are compiled by retain-ing all word forms from the four Wikipedias withword frequency over 10, see Tab. 3. We then extractsets of linguistic constraints from these (large) vo-cabularies using a set of simple language-speciﬁc if-then-else rules, see Tab. 2. These constraints(Sect. 2.2) are used as input for the vector spacepost-processing A

TTRACT -R EPEL algorithm (out-lined in Sect. 2.1).

TTRACT -R EPEL

Model

The A

TTRACT -R EPEL model, proposed by Mrkši´cet al. (2017b), is an extension of the P

ARAGRAM procedure proposed by Wieting et al. (2015). Itprovides a generic framework for incorporating similarity (e.g. successful and accomplished ) and antonymy constraints (e.g. nimble and clumsy ) intopre-trained word vectors. Given the initial vectorspace and collections of A

TTRACT and R

EPEL con-straints A and R , the model gradually modiﬁes thespace to bring the designated word vectors closertogether or further apart. The method’s cost func-tion consists of three terms. The ﬁrst term pulls theA TTRACT examples ( x l , x r ) ∈ A closer together.If B A denotes the current mini-batch of A TTRACT examples, this term can be expressed as: A ( B A ) = (cid:88) ( x l ,x r ) ∈B A ( ReLU ( δ att + x l t l − x l x r )+ ReLU ( δ att + x r t r − x l x r )) where δ att is the similarity margin which de-termines how much closer synonymous vectorsshould be to each other than to each of their respec-tive negative examples. ReLU ( x ) = max(0 , x ) isthe standard rectiﬁed linear unit (Nair and Hinton,2010). The ‘negative’ example t i for each word x i in any A TTRACT pair is the word vector clos-est to x i among the examples in the current mini-batch (distinct from its target synonym and x i it-self). This means that this term forces synonymous A native speaker can easily come up with these sets ofmorphological rules (or at least with a reasonable subset ofthem) without any linguistic training. What is more, the rulesfor DE , IT , and RU were created by non-native, non-ﬂuentspeakers with a limited knowledge of the three languages,exemplifying the simplicity and portability of the approach. English German Italian (discuss, discussed) (schottisch, schottischem) (golfo, golﬁ)(laugh, laughing) (damalige, damaligen) (minato, minata)(paciﬁst, paciﬁsts) (kombiniere, kombinierte) (mettere, metto)(evacuate, evacuated) (schweigt, schweigst) (crescono, cresci)(evaluate, evaluates) (hacken, gehackt) (crediti, credite)(dressed, undressed) (stabil, unstabil) (abitata, inabitato)(similar, dissimilar) (geformtes, ungeformt) (realtà, irrealtà)(formality, informality) (relevant, irrelevant) (attuato, inattuato)

Table 2: Example synonymous (inﬂectional; top)and antonymous (derivational; bottom) constraints.words from the in-batch A

TTRACT constraints tobe closer to one another than to any other word inthe current mini-batch.The second term pushes antonyms away fromeach other. If ( x l , x r ) ∈ B R is the current mini-batch of R EPEL constraints, this term can be ex-pressed as follows: R ( B R ) = (cid:88) ( x l ,x r ) ∈B R ( ReLU ( δ rpl + x l x r − x l t r )+ ReLU ( δ rpl + x l x r − x r t r )) In this case, each word’s ‘negative’ example is the(in-batch) word vector furthest away from it (anddistinct from the word’s target antonym). The intu-ition is that we want antonymous words from theinput R

EPEL constraints to be further away fromeach other than from any other word in the currentmini-batch; δ rpl is now the repel margin.The ﬁnal term of the cost function serves toretain the abundance of semantic information en-coded in the starting distributional space. If x initi isthe initial distributional vector and V ( B ) is the setof all vectors present in the given mini-batch, thisterm (per mini-batch) is expressed as follows: R ( B A , B R ) = (cid:88) x i ∈ V ( B A ∪B R ) λ reg (cid:13)(cid:13)(cid:13) x initi − x i (cid:13)(cid:13)(cid:13) where λ reg is the L2 regularisation constant. Thisterm effectively pulls word vectors towards theirinitial (distributional) values, ensuring that rela-tions encoded in initial vectors persist as long asthey do not contradict the newly injected ones.

Theﬁne-tuning A

TTRACT -R EPEL procedure is entirelydriven by the input A

TTRACT and R

EPEL sets of We use hyperparameter values δ att = 0 . , δ rpl = 0 . , λ reg = 10 − from prior work without ﬁne-tuning. We trainall models for 10 epochs with AdaGrad (Duchi et al., 2011). W | | A | | R |English 1,368,891 231,448 45,964German 1,216,161 648,344 54,644Italian 541,779 278,974 21,400Russian 950,783 408,400 32,174 Table 3: Vocabulary sizes and counts of A

TTRACT ( A ) and R EPEL ( R ) constraints.constraints. These can be extracted from a varietyof semantic databases such as WordNet (Fellbaum,1998), the Paraphrase Database (Ganitkevitch et al.,2013; Pavlick et al., 2015), or BabelNet (Navigliand Ponzetto, 2012; Ehrmann et al., 2014) as donein prior work (Faruqui et al., 2015; Wieting et al.,2015; Mrkši´c et al., 2016, i.a.). In this work, weinvestigate another option: extracting constraints without curated knowledge bases in a spectrum oflanguages by exploiting inherent language-speciﬁcproperties related to linguistic morphology. Thisrelaxation ensures a wider portability of A TTRACT -R EPEL to languages and domains without readilyavailable or adequate resources.

Extracting A

TTRACT

Pairs

The core differencebetween inﬂectional and derivational morphology can be summarised in a few lines as follows: the for-mer refers to a set of processes through which theword form expresses meaningful syntactic infor-mation, e.g., verb tense, without any change to thesemantics of the word. On the other hand, the latterrefers to the formation of new words with seman-tic shifts in meaning (Schone and Jurafsky, 2001;Haspelmath and Sims, 2013; Lazaridou et al., 2013;Zeller et al., 2013; Cotterell and Schütze, 2017).For the A

TTRACT constraints, we focus on in-ﬂectional rather than on derivational morphology rules as the former preserve the full meaning of aword, modifying it only to reﬂect grammatical rolessuch as verb tense or case markers (e.g., (en_read,en_reads ) or (de_katalanisch, de_katalanischer) ).This choice is guided by our intent to ﬁne-tunethe original vector space in order to improve theembedded semantic relations.We deﬁne two rules for English, widely recog-nised as morphologically simple (Avramidis andKoehn, 2008; Cotterell et al., 2016b). These are: (R1) if w , w ∈ W en , where w = w + ing/ed/s , then add ( w , w ) and ( w , w ) to the set of A T - TRACT constraints A . This rule yields pairs such as (look, looks), (look, looking), (look, looked) .If w [: − is a function which strips the lastcharacter from word w , the second rule is: (R2) if w ends with the letter e and w ∈ W en and w ∈ W en , where w = w [: − + ing/ed , then add ( w , w ) and ( w , w ) to A . This creates pairssuch as (create, creating) and (create, created) . Nat-urally, introducing more sophisticated rules is pos-sible in order to cover for other special cases andmorphological irregularities (e.g., sweep / swept ),but in all our EN experiments, A is based on thetwo simple EN rules R1 and R2.The other three languages, with more compli-cated morphology, yield a larger number of rules.In Italian, we rely on the sets of rules spanning:(1) regular formation of plural ( libro / libri ); (2)regular verb conjugation ( aspettare / aspettiamo );(3) regular formation of past participle ( aspettare/ aspettato ); and (4) rules regarding grammaticalgender ( bianco / bianca ). Besides these, anotherset of rules is used for German and Russian: (5)regular declension (e.g., asiatisch / asiatischem ). Extracting R

EPEL

Pairs

As another source ofimplicit semantic signals, W also contains wordswhich represent derivational antonyms : e.g., twowords that denote concepts with opposite meanings,generated through a derivational process. We use astandard set of EN “antonymy” preﬁxes: AP en = {dis, il, un, in, im, ir, mis, non, anti} (Fromkin et al.,2013). If w , w ∈ W en , where w is generatedby adding a preﬁx from AP en to w , then ( w , w ) and ( w , w ) are added to the set of R EPEL con-straints R . This rule generates pairs such as (ad-vantage, disadvantage) and (regular, irregular) . Anadditional rule replaces the sufﬁx -ful with -less ,extracting antonyms such as (careful, careless) .Following the same principle, we use AP de = {un, nicht, anti, ir, in, miss} , AP it = {in, ir, im,anti} , and AP ru = { не , анти } . For instance, thisgenerates an IT pair (rispettoso, irrispettoso) (seeFig. 1). For DE , we use another rule targeting sufﬁxreplacement: -voll is replaced by -los .We further expand the set of R EPEL constraintsby transitively combining antonymy pairs fromthe previous step with inﬂectional A

TTRACT pairs.This step yields additional constraints such as (rispettosa, irrispettosi) (see Fig. 1). The ﬁnal A and R constraint counts are given in Tab. 3. The fullsets of rules are available as supplemental material. Training Data and Setup

For each of the fourlanguages we train the skip-gram with negativesampling (SGNS) model (Mikolov et al., 2013)n the latest Wikipedia dump of each language.We induce 300-dimensional word vectors, with thefrequency cut-off set to 10. The vocabulary sizes | W | for each language are provided in Tab. 3. Welabel these collections of vectors

SGNS - LARGE . Other Starting Distributional Vectors

We alsoanalyse the impact of morph-ﬁtting on other col-lections of well-known EN word vectors. Thesevectors have varying vocabulary coverage and aretrained with different architectures. We test stan-dard distributional models: Common-Crawl GloVe(Pennington et al., 2014), SGNS vectors (Mikolovet al., 2013) with various contexts ( BOW = bag-of-words;

DEPS = dependency contexts), and train-ing data ( PW = Polyglot Wikipedia from Al-Rfouet al. (2013); = 8 billion token word2vec cor-pus), following (Levy and Goldberg, 2014) and(Schwartz et al., 2015). We also test the symmetric-pattern based vectors of Schwartz et al. (2016)( SymPat-Emb ), count-based PMI-weighted vectorsreduced by SVD (Baroni et al., 2014) (

Count-SVD ),a model which replaces the context modelling func-tion from CBOW with bidirectional LSTMs (Mela-mud et al., 2016) (

Context2Vec ), and two sets of EN vectors trained by injecting multilingual infor-mation: BiSkip (Luong et al., 2015) and

MultiCCA (Faruqui and Dyer, 2014).We also experiment with standard well-knowndistributional spaces in other languages ( IT and DE ), available from prior work (Dinu et al., 2015;Luong et al., 2015; Vuli´c and Korhonen, 2016a). Morph-ﬁxed Vectors

A baseline which utilisesan equal amount of knowledge as morph-ﬁtting,termed morph-ﬁxing , ﬁxes the vector of each wordto the distributional vector of its most frequentinﬂectional synonym, tying the vectors of low-frequency words to their more frequent inﬂections.For each word w , we construct a set of M + 1 words W w = { w , w (cid:48) , . . . , w (cid:48) M } consisting ofthe word w itself and all M words which co-occur with w in the A TTRACT constraints. Wethen choose the word w (cid:48) max from the set W w withthe maximum frequency in the training data, andﬁx all other word vectors in W w to its word vec-tor. The morph-ﬁxed vectors (MF IX ) serve as ourprimary baseline, as they outperformed anotherstraightforward baseline based on stemming across Other SGNS parameters were set to standard values (Ba-roni et al., 2014; Vuli´c and Korhonen, 2016b): epochs, negative samples, global learning rate: . , subsampling rate: e − . Similar trends in results persist with d = 100 , . all of our intrinsic and extrinsic experiments. Morph-ﬁtting Variants

We analyse two vari-ants of morph-ﬁtting: (1) using A

TTRACT con-straints only (MF IT -A), and (2) using both A T - TRACT and R

EPEL constraints (MF IT -AR). Evaluation Setup and Datasets

The ﬁrst set ofexperiments intrinsically evaluates morph-ﬁtted vector spaces on word similarity benchmarks, usingSpearman’s rank correlation as the evaluation met-ric. First, we use the SimLex-999 dataset, as wellas SimVerb-3500, a recent EN verb pair similaritydataset providing similarity ratings for 3,500 verbpairs. SimLex-999 was translated to DE , IT , and RU by Leviant and Reichart (2015), and they crowd-sourced similarity scores from native speakers. Weuse this dataset for our multilingual evaluation. Morph-ﬁtting EN Word Vectors

As the ﬁrst ex-periment, we morph-ﬁt a wide spectrum of EN dis-tributional vectors induced by various architectures(see Sect. 3). The results on SimLex and SimVerbare summarised in Tab. 4. The results with ENSGNS - LARGE vectors are shown in Fig. 3a. Morph-ﬁtted vectors bring consistent improvement acrossall experiments, regardless of the quality of the ini-tial distributional space. This ﬁnding conﬁrms thatthe method is robust: its effectiveness does not de-pend on the architecture used to construct the initialspace. To illustrate the improvements, note that thebest score on SimVerb for a model trained on run-ning text is achieved by

Context2vec ( ρ = 0 . );injecting morphological constraints into this vectorspace results in a gain of . ρ points. Experiments on Other Languages

We next ex-tend our experiments to other languages, testingboth morph-ﬁtting variants. The results are sum-marised in Tab. 5, while Fig. 3a-3d show resultsfor the morph-ﬁtted

SGNS - LARGE vectors. Thesescores conﬁrm the effectiveness and robustness ofmorph-ﬁtting across languages, suggesting that theidea of ﬁtting to morphological constraints is in-deed language-agnostic, given the set of language-speciﬁc rule-based constraints. Fig. 3 also demon- Unlike other gold standard resources such as WordSim-353 (Finkelstein et al., 2002) or MEN (Bruni et al., 2014),SimLex and SimVerb provided explicit guidelines to discernbetween semantic similarity and association, so that relatedbut non-similar words (e.g. cup and coffee ) have a low rating. Since Leviant and Reichart (2015) re-scored the original EN SimLex, we use their EN SimLex version for consistency. valuationVectors

SimLex-999 SimVerb-35001. SG-BOW2-PW (300) (Mikolov et al., 2013) .339 → .439 .277 → .381

2. GloVe-6B (300) (Pennington et al., 2014) .324 → .438 .286 → .405

3. Count-SVD (500) (Baroni et al., 2014) .267 → .360 .199 → .301

4. SG-DEPS-PW (300) (Levy and Goldberg, 2014) .376 → .434 .313 → .418

5. SG-DEPS-8B (500) (Bansal et al., 2014) .373 → .441 .356 → .473

6. MultiCCA-EN (512) (Faruqui and Dyer, 2014) .314 → .391 .296 → .354

7. BiSkip-EN (256) (Luong et al., 2015) .276 → .356 .260 → .333

8. SG-BOW2-8B (500) (Schwartz et al., 2015) .373 → .440 .348 → .441

9. SymPat-Emb (500) (Schwartz et al., 2016) .381 → .442 .284 → .373

10. Context2Vec (600) (Melamud et al., 2016) .371 → .440 .388 → .459 Table 4: The impact of morph-ﬁtting (MF IT -ARused) on a representative set of EN vector spacemodels. All results show the Spearman’s ρ corre-lation before and after morph-ﬁtting. The numbersin parentheses refer to the vector dimensionality. Vectors

Distrib. MF IT -A MF IT -AR EN : GloVe-6B (300) .324 .376 .438 EN : SG-BOW2-PW (300) .339 .385 .439 DE : SG-DEPS-PW (300) (Vuli´c and Korhonen, 2016a) .267 .318 .325 DE : BiSkip-DE (256) (Luong et al., 2015) .354 .414 .421 IT : SG-DEPS-PW (300) (Vuli´c and Korhonen, 2016a) .237 .351 .391 IT : CBOW5-Wacky (300) (Dinu et al., 2015) .363 .417 .446 Table 5: Results on multilingual SimLex-999 ( EN , DE , and IT ) with two morph-ﬁtting variants.strates that the morph-ﬁtted vector spaces consis-tently outperform the morph-ﬁxed ones.The comparison between MF IT -A and MF IT -AR indicates that both sets of constraints are im-portant for the ﬁne-tuning process. MF IT -A yieldsconsistent gains over the initial spaces, and (con-sistent) further improvements are achieved by alsoincorporating the antonymous R EPEL constraints.This demonstrates that both types of constraints areuseful for semantic specialisation.

Comparison to Other Specialisation Methods

We also tried using other post-processing spe-cialisation models from the literature in lieu ofA

TTRACT -R EPEL using the same set of “morpho-logical” synonymy and antonymy constraints. Wecompare A

TTRACT -R EPEL to the retroﬁtting model en:GloVe en:BOW2 de:DEPS de:BiSkip it:DEPS it:CBOW5Word Vector Space0.200.250.300.350.400.45 S p e a r m a n ’ s ρ c o rr e l a t i o n s c o r e DistribRFCFMFit-AR

Figure 2: A comparison of morph-ﬁtting (the MF IT -AR variant) with two other standard specialisationapproaches using the same set of morphologicalconstraints: Retroﬁtting (RF) (Faruqui et al., 2015)and Counter-ﬁtting (CF) (Mrkši´c et al., 2016).Spearman’s ρ correlation scores on the multilingualSimLex-999 dataset for the same six distributionalspaces from Tab. 5.of (Faruqui et al., 2015) and counter-ﬁtting (Mrkši´cet al., 2017a). The two baselines were trained for20 iterations using suggested settings. The resultsfor EN , DE , and IT are summarised in Fig. 2. Theyclearly indicate that MF IT -AR outperforms the twoother post-processors for each language. We hy-pothesise that the difference in performance mainlystems from context-sensitive vector space updatesperformed by A TTRACT -R EPEL . Conversely, theother two models perform pairwise updates whichdo not consider what effect each update has on theexample pair’s relation to other word vectors (for adetailed comparison, see (Mrkši´c et al., 2017b)).Besides their lower performance, the two otherspecialisation models have additional disadvan-tages compared to the proposed morph-ﬁttingmodel. First, retroﬁtting is able to incorporateonly synonymy/A

TTRACT pairs, while our re-sults demonstrate the usefulness of both types ofconstraints, both for intrinsic evaluation (Tab. 5)and downstream tasks (see later Fig. 3). Second,counter-ﬁtting is computationally intractable with

SGNS - LARGE vectors, as its regularisation term in-volves the computation of all pairwise distancesbetween words in the vocabulary.

Further Discussion

The simplicity of the usedlanguage-speciﬁc rules does come at a cost of occa-sionally generating incorrect linguistic constraintssuch as (tent, intent) , (prove, improve) or (press,impress) . In future work, we will study how to fur-her reﬁne extracted sets of constraints. We alsoplan to conduct experiments with gold standardmorphological lexicons on languages for whichsuch resources exist (Sylak-Glassman et al., 2015;Cotterell et al., 2016b), and investigate approacheswhich learn morphological inﬂections and deriva-tions in different languages automatically as an-other potential source of morphological constraints(Soricut and Och, 2015; Cotterell et al., 2016a;Faruqui et al., 2016; Kann et al., 2017; Aharoniand Goldberg, 2017, i.a.). Goal-oriented dialogue systems provide conversa-tional interfaces for tasks such as booking ﬂightsor ﬁnding restaurants. In slot-based systems, ap-plication domains are speciﬁed using ontologies that deﬁne the search constraints which users canexpress. An ontology consists of a number of slots and their assorted slot values . In a restaurant search domain, sets of slot-values could include

PRICE =[ cheap, expensive ] or FOOD = [

Thai, Indian, ... ].The DST model is the ﬁrst component of mod-ern dialogue pipelines (Young, 2010). It serves tocapture the intents expressed by the user at eachdialogue turn and update the belief state . This prob-ability distribution over the possible dialogue states(deﬁned by the domain ontology) is the system’sinternal estimate of the user’s goals. It is used bythe downstream dialogue manager component tochoose the subsequent system response (Su et al.,2016). The following example shows the true dia-logue state in a multi-turn dialogue:

User:

What’s good in the southern part of town? inform(area=south)

System:

Vedanta is the top-rated Indian place.

User:

How about something cheaper? inform(area=south, price=cheap)

System:

Seven Days is very popular. Great hot pot.

User:

What’s the address? inform(area=south, price=cheap);request(address)

System:

Seven Days is at 66 Regent Street.

The Dialogue State Tracking Challenge (DSTC)shared task series formalised the evaluation andprovided labelled DST datasets (Henderson et al.,2014a,b; Williams et al., 2016). While a plethoraof DST models are available based on, e.g., hand-crafted rules (Wang et al., 2014) or conditionalrandom ﬁelds (Lee and Eskenazi, 2013), the recentDST methodology has seen a shift towards neural- network architectures (Henderson et al., 2014c,d;Zilka and Jurcicek, 2015; Mrkši´c et al., 2015; Perezand Liu, 2017; Liu and Perez, 2017; Vodolán et al.,2017; Mrkši´c et al., 2017a, i.a.).

Model: Neural Belief Tracker

To detect intentsin user utterances, most existing models rely on ei-ther (or both): Spoken Language Understandingmodels which require large amounts of annotatedtraining data; or hand-crafted, domain-speciﬁclexicons which try to capture lexical and morpho-logical variation. The Neural Belief Tracker (NBT)is a novel DST model which overcomes both issuesby reasoning purely over pre-trained word vectors(Mrkši´c et al., 2017a). The NBT learns to composethese vectors into intermediate utterance and con-text representations. These are then used to decidewhich of the ontology-deﬁned intents (goals) havebeen expressed by the user. The NBT model keepsword vectors ﬁxed during training, so that unseen,yet related words can be mapped to the right intentat test time (e.g. northern to north ). Data: Multilingual WOZ 2.0 Dataset

Our DSTevaluation is based on the WOZ dataset, releasedby Wen et al. (2017). In this Wizard-of-Oz setup,two Amazon Mechanical Turk workers assumedthe role of the user and the system asking/providinginformation about restaurants in Cambridge (oper-ating over the same ontology and database usedfor DSTC2 (Henderson et al., 2014a)). Users typedinstead of speaking, removing the need to deal withnoisy speech recognition. In DSTC datasets, userswould quickly adapt to the system’s inability todeal with complex queries. Conversely, the WOZsetup allowed them to use sophisticated language.The WOZ 2.0 release expanded the dataset to 1,200dialogues (Mrkši´c et al., 2017a). In this work, weuse translations of this dataset to Italian and Ger-man, released by Mrkši´c et al. (2017b).

Evaluation Setup

The principal metric we useto measure DST performance is the joint goal ac-curacy , which represents the proportion of test setdialogue turns where all user goals expressed up tothat point of the dialogue were decoded correctly(Henderson et al., 2014a). The NBT models for EN , DE and IT are trained using four variants of the SGNS - LARGE vectors: the initial distributionalvectors; morph-ﬁxed vectors; and the twovariants of morph-ﬁtted vectors (see Sect. 3).As shown by Mrkši´c et al. (2017b), semanticspecialisation of the employed word vectors ben- istrib MFix MFit-A MFit-AR0.150.200.250.300.350.400.45 S i m L e x ( S p e a r m a n ’ s ρ ) D S T P e r f o r m a n c e (J o i n t ) (a) English Distrib MFix MFit-A MFit-AR0.150.200.250.300.350.400.45 S i m L e x ( S p e a r m a n ’ s ρ ) SimLex D S T P e r f o r m a n c e (J o i n t ) DST (b) German

Distrib MFix MFit-A MFit-AR0.150.200.250.300.350.400.45 S i m L e x ( S p e a r m a n ’ s ρ ) D S T P e r f o r m a n c e (J o i n t ) (c) Italian Distrib MFix MFit-A MFit-ARRU Word Vector Space0.150.200.250.300.350.400.45 S i m L e x ( S p e a r m a n ’ s ρ ) SimLex (d) Russian

Figure 3: An overview of the results (Spearman’s ρ correlation) for four languages on SimLex-999 (greybars, left y axis) and the downstream DST performance (dark bars, right y axis) using SGNS - LARGE vectors( d = 300 ), see Tab. 3 and Sect. 3. The left y axis measures the intrinsic word similarity performance,while the right y axis provides the scale for the DST performance (there are no DST datasets for Russian).eﬁts DST performance across all three languages.However, large gains on SimLex-999 do not al-ways induce correspondingly large gains in down-stream performance. In our experiments, we inves-tigate the extent to which morph-ﬁtting improvesDST performance, and whether these gains exhibitstronger correlation with intrinsic performance. Results and Discussion

The dark bars (againstthe right axes) in Fig. 3 show the DST perfor-mance of NBT models making use of the fourvector collections. IT and DE beneﬁt from bothkinds of morph-ﬁtting : IT performance increasesfrom . → . (MF IT -A) and DE performancerises even more: . → . (MF IT -AR), settinga new state-of-the-art score for both datasets. The morph-ﬁxed vectors do not enhance DST perfor-mance, probably because ﬁxing word vectors totheir highest frequency inﬂectional form eliminatesuseful semantic content encoded in the originalvectors. On the other hand, morph-ﬁtting makesuse of this information, supplementing it with se-mantic relations between different morphologicalforms. These conclusions are in line with the Sim-Lex gains, where morph-ﬁtting outperforms bothdistributional and morph-ﬁxed vectors. English performance shows little variationacross the four word vector collections investigatedhere. This corroborates our intuition that, as a mor-phologically simpler language, English stands togain less from ﬁne-tuning the morphological varia-tion for downstream applications. This result againpoints at the discrepancy between intrinsic and ex-trinsic evaluation: the considerable gains in Sim-Lex performance do not necessarily induce similargains in downstream performance. Additional dis-crepancies between SimLex and downstream DSTperformance are detected for German and Italian.While we observe a slight drop in SimLex perfor-mance with the DE MF IT -AR vectors comparedto the MF IT -A ones, their relative performance isreversed in the DST task. On the other hand, wesee the opposite trend in Italian, where the MF IT -A vectors score lower than the MF IT -AR vectorson SimLex, but higher on the DST task. In sum-mary, we believe these results show that SimLex isnot a perfect proxy for downstream performancein language understanding tasks. Regardless, itsperformance does correlate with downstream per-formance to a large extent, providing a useful in-dicator for the usefulness of speciﬁc word vectorpaces for extrinsic tasks such as DST. Semantic Specialisation

A standard approachto incorporating external information into vectorspaces is to pull the representations of similarwords closer together. Some models integrate suchconstraints into the training procedure, modify-ing the prior or the regularisation (Yu and Dredze,2014; Xu et al., 2014; Bian et al., 2014; Kiela et al.,2015), or using a variant of the SGNS-style objec-tive (Liu et al., 2015; Osborne et al., 2016). Anotherclass of models, popularly termed retroﬁtting , in-jects lexical knowledge from available semanticdatabases (e.g., WordNet, PPDB) into pre-trainedword vectors (Faruqui et al., 2015; Jauhar et al.,2015; Wieting et al., 2015; Nguyen et al., 2016;Mrkši´c et al., 2016). Morph-ﬁtting falls into thelatter category. However, instead of resorting to cu-rated knowledge bases, and experimenting solelywith English, we show that the morphological rich-ness of any language can be exploited as a sourceof inexpensive supervision for ﬁne-tuning vectorspaces, at the same time specialising them to betterreﬂect true semantic similarity, and learning moreaccurate representations for low-frequency words.

Word Vectors and Morphology

The use of mor-phological resources to improve the representationsof morphemes and words is an active area of re-search. The majority of proposed architectures en-code morphological information, provided eitheras gold standard morphological resources (Sylak-Glassman et al., 2015) such as CELEX (Baayenet al., 1995) or as an external analyser such asMorfessor (Creutz and Lagus, 2007), along withdistributional information jointly at training timein the language modelling (LM) objective (Luonget al., 2013; Botha and Blunsom, 2014; Qiu et al.,2014; Cotterell and Schütze, 2015; Bhatia et al.,2016, i.a.). The key idea is to learn a morphologi-cal composition function (Lazaridou et al., 2013;Cotterell and Schütze, 2017) which synthesises therepresentation of a word given the representationsof its constituent morphemes. Contrary to our work,these models typically coalesce all lexical relations.Another class of models, operating at the charac-ter level, shares a similar methodology: such mod-els compose token-level representations from sub-component embeddings (subwords, morphemes, orcharacters) (dos Santos and Zadrozny, 2014; Linget al., 2015; Cao and Rei, 2016; Kim et al., 2016; Wieting et al., 2016; Verwimp et al., 2017, i.a.).In contrast to prior work, our model decouples the use of morphological information, now pro-vided in the form of inﬂectional and derivationalrules transformed into constraints, from the actualtraining. This pipelined approach results in a sim-pler, more portable model. In spirit, our work is sim-ilar to Cotterell et al. (2016b), who formulate theidea of post-training specialisation in a generativeBayesian framework. Their work uses gold mor-phological lexicons; we show that competitive per-formance can be achieved using a non-exhaustiveset of simple rules. Our framework facilitates theinclusion of antonyms at no extra cost and natu-rally extends to constraints from other sources (e.g.,WordNet) in future work. Another practical differ-ence is that we focus on similarity and evaluatemorph-ﬁtting in a well-deﬁned downstream taskwhere the artefacts of the distributional hypothesisare known to prompt statistical system failures.

We have presented a novel morph-ﬁtting methodwhich injects morphological knowledge in the formof linguistic constraints into word vector spaces.The method makes use of implicit semantic signalsencoded in inﬂectional and derivational rules whichdescribe the morphological processes in a language.The results in intrinsic word similarity tasks showthat morph-ﬁtting improves vector spaces inducedby distributional models across four languages. Fi-nally, we have shown that the use of morph-ﬁtted vectors boosts the performance of downstream lan-guage understanding models which rely on wordrepresentations as features, especially for morpho-logically rich languages such as German.Future work will focus on other potential sourcesof morphological knowledge, porting the frame-work to other morphologically rich languages anddownstream tasks, and on further reﬁnements ofthe post-processing specialisation algorithm andthe constraint selection.

Acknowledgments

This work is supported by the ERC ConsolidatorGrant LEXICAL: Lexical Acquisition Across Lan-guages (no 648909). RR is supported by the Intel-ICRI grant: Hybrid Models for Minimally Super-vised Information Extraction from Conversations.The authors are grateful to the anonymous review-ers for their helpful suggestions. eferences

Roee Aharoni and Yoav Goldberg. 2017. Mor-phological inﬂection generation with hardmonotonic attention. In

Proceedings of ACL .https://arxiv.org/abs/1611.01487.Rami Al-Rfou, Bryan Perozzi, and Steven Skiena.2013. Polyglot: Distributed word representations formultilingual NLP. In

Proceedings of CoNLL

Proceedings of ACL

Proceedings of EMNLP . pages 490–500.https://aclweb.org/anthology/D16-1047.Jiang Bian, Bin Gao, and Tie-Yan Liu. 2014.Knowledge-powered deep learning for word embed-ding. In

Proceedings of ECML-PKDD . pages 132–148. https://doi.org/10.1007/978-3-662-44848-9_9.Jan A. Botha and Phil Blunsom. 2014. Com-positional morphology for word repre-sentations and language modelling. In

Proceedings of ICML . pages 1899–1907.http://jmlr.org/proceedings/papers/v32/botha14.html.Elia Bruni, Nam-Khanh Tran, and Marco Baroni.2014. Multimodal distributional semantics.

Jour-nal of Artiﬁcial Intelligence Research

Proceedings of the 1st Workshop on Rep-resentation Learning for NLP . pages 18–26.http://aclweb.org/anthology/W/W16/W16-1603.Danqi Chen and Christopher D. Manning. 2014. Afast and accurate dependency parser using neural net-works. In

Proceedings of EMNLP

Journal ofMachine Learning Research

Proceedings of the 14th SIG-MORPHON Workshop on Computational Researchin Phonetics, Phonology, and Morphology . pages10–22. http://anthology.aclweb.org/W16-2002.Ryan Cotterell and Hinrich Schütze. 2015.Morphological word-embeddings. In

Pro-ceedings of NAACL-HLT

Transactions of the ACL https://arxiv.org/abs/1701.00946.Ryan Cotterell, Hinrich Schütze, and Jason Eisner.2016b. Morphological smoothing and extrapolationof word embeddings. In

Proceedings of ACL

TSLP

From Distributional toSemantic Similarity . Ph.D. thesis, Schoolof Informatics, University of Edinburgh.http://hdl.handle.net/1842/563.Georgiana Dinu, Angeliki Lazaridou, and Marco Ba-roni. 2015. Improving zero-shot learning by mitigat-ing the hubness problem. In

Proceedings of ICLR(Workshop Papers) . http://arxiv.org/abs/1412.6568.Cícero Nogueira dos Santos and BiancaZadrozny. 2014. Learning character-levelrepresentations for part-of-speech tagging.In

Proceedings of ICML . pages 1818–1826.http://jmlr.org/proceedings/papers/v32/santos14.html.John C. Duchi, Elad Hazan, and Yoram Singer.2011. Adaptive subgradient methods for on-line learning and stochastic optimization.

Jour-nal of Machine Learning Research

Proceed-ings of LREC

Proceedings of NAACL-HLT

Proceedings of EACL

Proceedings of NAACL-HLT

WordNet .https://mitpress.mit.edu/books/wordnet.Lev Finkelstein, Evgeniy Gabrilovich, Yossi Ma-tias, Ehud Rivlin, Zach Solan, Gadi Wolfman,and Eytan Ruppin. 2002. Placing search incontext: The concept revisited.

ACM Trans-actions on Information Systems

An Introduction to Language, 10th Edition .Juri Ganitkevitch, Benjamin Van Durme, and ChrisCallison-Burch. 2013. PPDB: The ParaphraseDatabase. In

Proceedings of NAACL-HLT

Proceedings of EMNLP . pages 2173–2182.https://aclweb.org/anthology/D16-1235.Zellig S. Harris. 1954. Distributional structure.

Word

Under-standing morphology .Matthew Henderson, Blaise Thomson, and Jason D.Wiliams. 2014a. The Second Dialog State TrackingChallenge. In

Proceedings of SIGDIAL . pages 263–272. http://aclweb.org/anthology/W/W14/W14-4337.pdf.Matthew Henderson, Blaise Thomson, and Jason D.Wiliams. 2014b. The Third Dialog State TrackingChallenge. In

Proceedings of IEEE SLT . pages 324–329. https://doi.org/10.1109/SLT.2014.7078595.Matthew Henderson, Blaise Thomson, and SteveYoung. 2014c. Robust dialog state tracking usingdelexicalised recurrent neural networks and unsu-pervised adaptation. In

Proceedings of IEEE SLT .pages 360–365. Matthew Henderson, Blaise Thomson, and SteveYoung. 2014d. Word-based dialog statetracking with recurrent neural networks. In

Proceedings of SIGDIAL . pages 292–299.http://aclweb.org/anthology/W/W14/W14-4340.pdf.Felix Hill, Roi Reichart, and Anna Korhonen.2015. SimLex-999: Evaluating semanticmodels with (genuine) similarity estimation.

Computational Linguistics

Proceedings of NAACL

Proceedings of EMNLP . pages 2062–2066. http://aclweb.org/anthology/D15-1245.Katharina Kann, Ryan Cotterell, and Hinrich Schütze.2017. Neural multi-source morphological reinﬂec-tion. In

Proceedings of EACL

Proceedings of EMNLP . pages 2044–2048. http://aclweb.org/anthology/D15-1242.Yoon Kim, Yacine Jernite, David Sontag, and Alexan-der M. Rush. 2016. Character-aware neural lan-guage models. In

Proceedings of AAAI . pages 2741–2749.Angeliki Lazaridou, Marco Marelli, Roberto Zam-parelli, and Marco Baroni. 2013. Compositional-ly derived representations of morphologicallycomplex words in distributional semantics.In

Proceedings of ACL

Proceedings of SIGDIAL . pages 414–422. http://aclweb.org/anthology/W/W13/W13-4066.pdf.Ira Leviant and Roi Reichart. 2015. Separated byan un-common language: Towards judgment lan-guage informed vector space modeling.

CoRR abs/1508.00106. http://arxiv.org/abs/1508.00106.Omer Levy and Yoav Goldberg. 2014.Dependency-based word embeddings. In

Proceedings of ACL

Proceedings of EMNLP .pages 1520–1530. http://aclweb.org/anthology/D15-1176.Fei Liu and Julien Perez. 2017. Gated end-to-end mem-ory networks. In

Proceedings of EACL

Proceedings of ACL

Proceedingsof the 1st Workshop on Vector Space Modelingfor Natural Language Processing

Proceedings of CoNLL

Proceedings of CoNLL . pages 51–61.http://aclweb.org/anthology/K/K16/K16-1006.pdf.Tomas Mikolov, Ilya Sutskever, Kai Chen, Gregory S.Corrado, and Jeffrey Dean. 2013. Distributed repre-sentations of words and phrases and their composi-tionality. In

Proceedings of NIPS . pages 3111–3119.Andriy Mnih and Koray Kavukcuoglu. 2013. Learningword embeddings efﬁciently with noise-contrastiveestimation. In

Proceedings of NIPS . pages 2265–2273.Nikola Mrkši´c, Diarmuid Ó Séaghdha, Blaise Thom-son, Milica Gaši´c, Pei-Hao Su, David Vandyke,Tsung-Hsien Wen, and Steve Young. 2015. Multi-domain dialog state tracking using recurrent neuralnetworks. In

Proceedings of ACL . pages 794–799.http://aclweb.org/anthology/P/P15/P15-2130.pdf.Nikola Mrkši´c, Diarmuid Ó Séaghdha, BlaiseThomson, Tsung-Hsien Wen, and Steve Young.2017a. Neural Belief Tracker: Data-driven di-alogue state tracking. In

Proceedings of ACL .http://arxiv.org/abs/1606.03777.Nikola Mrkši´c, Diarmuid Ó Séaghdha, Blaise Thom-son, Milica Gaši´c, Lina Maria Rojas-Barahona,Pei-Hao Su, David Vandyke, Tsung-Hsien Wen, andSteve Young. 2016. Counter-ﬁtting word vectorsto linguistic constraints. In

Proceedings of NAACL-HLT . http://aclweb.org/anthology/N/N16/N16-1018.pdf. Nikola Mrkši´c, Ivan Vuli´c, Diarmuid Ó Séaghdha, RoiReichart, Milica Gaši´c, Anna Korhonen, and SteveYoung. 2017b. Semantic Specialisation of Distribu-tional Word Vector Spaces using Monolingual andCross-Lingual Constraints. arXiv.Vinod Nair and Geoffrey E. Hinton. 2010. Recti-ﬁed linear units improve restricted Boltzmann ma-chines. In

Proceedings of ICML

Artiﬁcial Intelligence

Proceedings of ACL . pages454–459. http://anthology.aclweb.org/P16-2074.Dominique Osborne, Shashi Narayan, and Shay Cohen.2016. Encoding prior knowledge with eigenwordembeddings.

Transactions of the ACL

Proceedings of ACL

Proceedings of EMNLP

Proceedings of EACL

Proceedings of COLING

Proceedings of NAACL .http://aclweb.org/anthology/N/N01/N01-1024.Roy Schwartz, Roi Reichart, and Ari Rappoport.2015. Symmetric pattern based word em-beddings for improved word similarity predic-tion. In

Proceedings of CoNLL

Proceedings of NAACL-HLT

Proceedings of ACL

Proceed-ings of the NAACL Workshop on Statistical Pars-ing of Morphologically-Rich Languages

Proceedings of ACL

Journal of Artiﬁcal Intelligence Research

Proceedings of EACL

Proceedings of ACL . pages518–524. http://anthology.aclweb.org/P16-2084.Ivan Vuli´c and Anna Korhonen. 2016b. On the roleof seed lexicons in learning bilingual word embed-dings. In

Proceedings of ACL

Proceedings of AAAI .pages 1112–1119.Tsung-Hsien Wen, David Vandyke, Nikola Mrkši´c,Milica Gaši´c, Lina M. Rojas-Barahona, Pei-HaoSu, Stefan Ultes, and Steve Young. 2017. Anetwork-based end-to-end trainable task-orienteddialogue system. In

Proceedings of EACL

Transactions ofthe ACL

Proceedings of EMNLP . pages 1504–1515.https://aclweb.org/anthology/D16-1157.Jason D. Williams, Antoine Raux, and MatthewHenderson. 2016. The Dialog State Track-ing Challenge series: A review.

Dialogue& Discourse

Proceedings of CIKM . pages 1219–1228.https://doi.org/10.1145/2661829.2662038.Steve Young. 2010. Cognitive User Interfaces.

IEEESignal Processing Magazine .Mo Yu and Mark Dredze. 2014. Improvinglexical embeddings with semantic knowl-edge. In

Proceedings of ACL

Proceedings ofASRU .Will Y. Zou, Richard Socher, Daniel Cer, andChristopher D. Manning. 2013. Bilingual wordembeddings for phrase-based machine translation.In

Proceedings of EMNLP orph-ﬁtting: Fine-Tuning Word Vector Spaceswith Simple Language-Speciﬁc Rules

Supplementary Material

Morphological Rules

In this supplemental material, we provide ashort comprehensive overview of simple language-speciﬁc morphological rules in English ( EN ), Ger-man ( DE ), Italian ( UT ), and Russian ( RU ). Theserules were used to build the sets of synonymous A T - TRACT and antonymous R

EPEL constraints for our morph-ﬁtting ﬁne-tuning procedure. As discussedin the paper, the linguistic constraints extractedfrom the rules require only a comprehensive listof vocabulary words in each language. A nativespeaker of each language used in our experimentsis able to easily come up with these sets of morpho-logical rules (or at least with a reasonable subsetof rules) without any linguistic training. What ismore, the rules for German, Italian, and Russianwere created by non-native and non-ﬂuent speakerswho have only a passive or limited knowledge ofthe three languages, exemplifying the simplicityand portability of the ﬁne-tuning approach basedon the shallow “morphological supervision”. Thesimplicity is also conﬁrmed by the short time usedto compile the rules, ranging from a few minutes forEnglish to approximately two hours for Russian.Different languages differ in their “morpholog-ical richness” (e.g., declension, verb conjugation,plural forming, gender) which consequently leadsto the varying number of rules in each language.However, all four languages in our study displaymorphological regularities described by simplemorphological rules that are exploited to build setsof A

TTRACT and R

EPEL linguistic constraints ineach language from scratch. Vocabularies W in all four languages are labeled W en , W de , W it , W ru . We add the pairs ( w , w ) and ( w , w ) generated by the rules to the sets ofconstraints iff both w , w ∈ W . After we generateall such constraints, since some constraints mayhave been generated by more than one rule, weremove all duplicates from the respective sets ofA TTRACT and R

EPEL constraints.Before we start, we will deﬁne two simple func-tions: (i) the function w [: − N ] strips the last Note that the rules for extracting A

TTRACT constraintswere additionally used to generate the Morph-SimLex evalua-tion set, also provided as supplemental material. N characters from the word w , (ii) the function w . ew(sub) tests if the word w ends with a se-quence of characters sub . For instance, create [: − returns creat , while create . ew(’s’) returns False and create . ew(’e’) returns True . English RulesInﬂectional Synonymy: A

TTRACT

As dis-cussed in the paper, we rely on only two simpleinﬂectional morphological rules in English:- w = w + ’s’/’ed’/’ing’ . This rule yieldsconstraints such as (speak, speaking) , (turtle, tur-tles) , or (clean, cleaned) .- If w .ew(’e’) , then w = w [: − + ’ed’/’ing’ . This rule yields constraints suchas (create, creating) , or (generate, generated) . Derivational Antonymy: R

EPEL

We assumethe following set of standard “antonymy” preﬁxesin English: AP en = {’dis’, ’il’, ’un’, ’in’, ’im’,’ir’, ’mis’, ’non’, ’anti’} . We rely on the followingderivational rules to extract R EPEL pairs:- w = ap + w , where ap ∈ AP en . This rule yieldsconstraints such as (mature, immature) , (allow, dis-allow) or (regularity, irregularity) .- If w .ew(’ful’) , then w = w [: − + ’less’ . This rule yields constraints such as (cheerful, cheerless) .As mentioned in the paper, for all four languageswe further expand the set of R EPEL constraints bytransitively combining antonymy pairs with inﬂec-tional A

TTRACT pairs. In simple words, the friendof my enemy is my enemy.

This means that, givenan A

TTRACT pair (allow, allows) and a R

EPEL pair (allow, disallow) , we extract another R

EPEL pair (allows, disallow) . German RulesInﬂectional Synonymy: A

TTRACT

Being mor-phologically richer than English, the German lan-guage naturally requires more rules to describe its(inﬂectional) morphological richness and variation.First, we capture the regular declension of nounsand adjectives by the following heuristic:- Generate a set of words W w = { w , w | w = w + ’e’/’em’/’en’/’er’/’es’ } ; take the Cartesianproduct on W w × W w and then exclude ( w i , w i ) airs with identical words. This rule generatespairs such as (schottisch, schottische) , (schottis-chem, schottischen) .The second set of rules describes regular verbmorphology , i.e., verb conjugation in the presentand past tense, and the formation of regular pastparticiples. This set of rules may be expressed as:- If w .ew(’en’) , then w (cid:48) = w [: − .If w (cid:48) .ew(’t’) , then generate a setof words W w = { w , w | w = w + ’e’/’st’/’ete’/’etest’/’etet’/’eten’ , w = ’ge’ + w (cid:48) + ’et’ } , else (if not w (cid:48) .ew(’t’) ), gen-erate a set of words W w = { w , w | w = w + ’e’/’st’/’te’/’test’/’tet’/’ten’ , w = ’ge’ + w (cid:48) + ’t’ } .We then take the Cartesian product on W w × W w .Again, all pairs with identical words were dis-carded. This rule yields pairs such as (machen,machten) , (mache, gemacht) , (kaufst, kauft) , or (arbeite, arbeitete) and (arbeiten, gearbeitet) .Another set of rules targets the regular formationof plural nouns:- If w .ew(’ei’) or w .ew(’heit’) or w .ew(’keit’) or w .ew(’schaft’) or w .ew(’ung’) , then w = w + ’en’ . Thisrule yields pairs such as (wahrheit, wahrheiten) or (gemeinschaft, gemeinschaften) .- If w .ew(’in’) , then w = w + ’nen’ . Thisrule generates pairs such as (lehrerin, lehrerinnen) or (lektorin, lektorinnen) .- If w .ew(’a’/’i’/’o’/’u’/’y’) then w = w + ’s’ . This rule yields pairs such as (auto,autos) .- If w .ew(’e’) , then w = w + ’n’ . This ruleyields pairs such as (postkarte, postkarten) .- w = lumlaut ( w ) + er , where the function lumlaut ( w ) replaces the last occurrence of the let-ter ’a’ , ’o’ or ’u’ with ’ä’ , ’ö’ or ’ü’ . Thisrule generates pairs such as (wörterbuch, wörter-bücher) or (stadt, städter) . Derivational Antonymy: R

EPEL

We assumethe following set of standard “antonymy” preﬁxesin German: AP de = {’un’, ’nicht’, ’anti’, ’ir’, ’in’,’miss’} . We rely on the following derivational rulesto extract R EPEL pairs in German:- w = ap + w , where ap ∈ AP de . This rule yieldsconstraints such as (aktiv, inaktiv) , (wandelbar, un-wandelbar) or (zyklone, antizyklone) .- If w .ew(’voll’) , then w = w [: − + ’los’ . This rule yields constraints such as (geschmackvoll, geschmacklos) . The set of R EPEL is then again transitively ex-panded yielding pairs such as (relevant, irrelevan-ter) or (aktivem, inaktiv) . Italian RulesInﬂectional Synonymy: A

TTRACT

The ﬁrst setof rules aims at capturing the regular plural formingin Italian (e.g., libro, libri ) and regular differencesin gender (e.g., rapido, rapida ). We rely on thesimple heuristic which can be expressed as follows:- If w .ew(’a’/’e’/’o’/’i’) , then gener-ate a set of words W w = { w | w = w [: −

1] + ’a’/’e’/’o’/’i’ } , and take the Cartesian product on W w × W w discarding pairs with identical words.This rule yields pairs such as (nero, neri) or (gener-azione, generazioni) .- If w .ew(’ga’/’ca’) , then w = w + ’he’ . This rule generates pairs such as (tartaruga,tartarughe) or (bianca, bianche) .- If w .ew(’go’) , then w = w + ’hi’ . Thisrule generates pairs such as (albergo, alberghi) .The second set of rules targets regular verb con-jugation in Italian and the formation of regular pastparticiples. The following rules are used:- If w .ew(’are’) , then generate a set ofwords W w = { w , w | w = w [: −

3] + ’iamo’/’ate’/’ano’/’o’/’i’/’a’/’ato’/’ata’/’ati’/’ate’ } ;take the Cartesian product on W w × W w discard-ing pairs with identical words. This rule results inpairs such as (aspettare, aspettiamo) .- If w .ew(’ere’) , then generate a set ofwords W w = { w , w | w = w [: −

3] + ’iamo’/’ete’/’ono’/’o’/’i’/’e’/’uto’/’uta’/’uti’/’ute’ } ;take the Cartesian product on W w × W w discard-ing pairs with identical words. This rule resultsin pairs such as (ricevere, ricevete) or (riceve,ricevuto) .- If w .ew(’ire’) , then generate a set ofwords W w = { w , w | w = w [: −

3] + ’iamo’/’ite’/’ono’/’o’/’i’/’e’/’ito’/’ita’/’iti’/’ite’ } ;take the Cartesian product on W w × W w discarding pairs with identical words. This ruleresults in pairs such as (dormire, dormono) or (dormi, dormita) . Derivational Antonymy: R

EPEL

We assumethe following set of standard “antonymy” preﬁxes: AP it = {’in’, ’ir’, ’im’, ’anti’} . The followingderivational rule is used to extract R EPEL pairs:- w = ap + w , where ap ∈ AP it . This rule yieldsconstraints such as (attivo, inattivo) or (rispettosa,irrispettosa) .he set of R EPEL was then expanded as before,e.g., with additional pairs such as (rispettosa, ir-rispettosi) generated.

Russian RulesInﬂectional Synonymy: A

TTRACT

The ﬁrst setof rules in Russian targets the regular forming ofplural in Russian. A few simple heuristics are usedas follows:- w = w + ’ и ’/’ ы ’ . This rule yields pairs suchas (aльбом, aльбомы) , transliterated as: (al’bom,al’bomy) .- if w .ew(’a’/’ я ’/’ ь ’) , then w = w [: − + ’ и ’/’ ы ’ . This rule generates pairs such as (песня, песни) : (pesnja, pesni) .- if w .ew(’o’) , then w = w [: − + ’ a ’ .This rule generates pairs such as (письмо, пись-ма) : (pis’mo, pis’ma) .- if w .ew(’e’) , then w = w [: − + ’ я ’ .This rule generates pairs such as (платье, пла-тья) : (plat’e, plat’ja) .The next set of rules targets regular verb con-jugation of Russian verbs as well as the regularformation of past participles. We again build a sim-ple heuristic to extract A TTRACT pairs:- if w .ew(’ ти ’/’ ть ’) , then generate a setof words W w = { w , w | w = w [: −

2] + ’у’/’ю’/’ешь’/’ишь’/’ет’/’ит’/’ем’/’им’ , w = w [: −

2] + ’ете’/ите’/’ут’/’ют’/’ат’/’ят’ , w = w [: −

2] + ’нный’/’нная’ } and take the Carte-sian product on W w × W w discarding pairs withidentical words. This rule yields pairs such as (ва-рить, варите) or (заканчиваю, заканчивают) ,transliterated as: (varit’, varite) , (zakanchivaju, za-kanchivajut) .Following that, we also utilise the regularities re-garding declension processes in Russian, capturedby the following rules:- if w .ew( ’a’ ) , then generate a set of words W w = { w , w | w = w [: −

1] + ’e’/’y’/’ой’ } and take the Cartesian product on W w × W w discarding pairs with identical words. This ruleyields pairs such as (работа, работой) : (rabota,rabotoj) .- if w .ew( ’я’ ) , then generate a set of words W w = { w , w | w = w [: −

1] + ’e’/’ю’/’ей’ } and take the Cartesian product on W w × W w dis-carding pairs with identical words. This rule yieldspairs such as (линия, линию) : (linija, liniju) .- if w .ew( ’ы’ ) , then generate a set ofwords W w = { w , w | w = w [: −

1] + ’ам’/’ами’/’ах’ } and take the Cartesian product on W w × W w discarding pairs with identicalwords. This rule yields pairs such as (работам,работами) : (rabotam, rabotami) .- if w .ew( ’и’ ) , then generate a set ofwords W w = { w , w | w = w [: −

1] + ’ь’/’ям’/’ями’/’ях’ } and take the Cartesian prod-uct on W w × W w discarding pairs with identicalwords. This rule yields pairs such as (работам,работами) : (rabotam, rabotami) .Yet another set of rules targets regular adjectivecomparison and gender:- if w .ew( ’ый’/’ой’/’ий’ ) , then generate a setof words W w = { w , w | w = w [: −

2] + ’ь’/’ее’/’ые’ } . This rule yields pairs such as (быст-рый, быстрее) : (bystryj, bystree) .- if w .ew( ’ая’ ) , then generate a set ofwords W w = { w , w | w = w [: −

2] + ’ее’/’ыe’/’ый’ } . This rule yields pairs such as (но-вая, новыe) : (novaja, novye) .- if w .ew( ’oe’ ) , then generate a set ofwords W w = { w , w | w = w [: −

2] + ’ый’/’ыe’/’ая’ } . This rule yields pairs such as (но-вое, новый) : (novoe, novyj) . Derivational Antonymy: R

EPEL

We assumethe following set of standard “antonymy” preﬁxesin Russian: AP ru = { не, анти ’} , and simply usethe following rule:- w = ap + w , where ap ∈ AP ru . This ruleyields constraints such as (адекватный, неадек-ватный) or (вирусная, антивирусная) , translit-erated as: (adekvatnyj, neadekvatnyj) and (virus-naja, antivirusnaja) .The further expansion of R EPEL constraintsyields pairs such as (адекватный, неадекват-ная) : (adekvatnyj, neadekvatnaja) . Further Discussion

We stress that the listed rules for all four languagesare non-exhaustivenon-exhaustive