Morph-fitting: Fine-Tuning Word Vector Spaces with Simple Language-Specific Rules
Ivan Vuli?, Nikola Mrkši?, Roi Reichart, Diarmuid ? Séaghdha, Steve Young, Anna Korhonen
MMorph-fitting: Fine-Tuning Word Vector Spaceswith Simple Language-Specific Rules
Ivan Vuli´c , Nikola Mrkši´c , Roi Reichart Diarmuid Ó Séaghdha , Steve Young , Anna Korhonen University of Cambridge Technion, Israel Institute of Technology Apple Inc. {iv250,nm480,sjy11,alk23}@[email protected] [email protected]
Abstract
Morphologically rich languages accentu-ate two properties of distributional vec-tor space models: 1) the difficulty of in-ducing accurate representations for low-frequency word forms; and 2) insensitivityto distinct lexical relations that have simi-lar distributional signatures. These effectsare detrimental for language understandingsystems, which may infer that inexpensive is a rephrasing for expensive or may not as-sociate acquire with acquires . In this work,we propose a novel morph-fitting procedurewhich moves past the use of curated seman-tic lexicons for improving distributionalvector spaces. Instead, our method injectsmorphological constraints generated usingsimple language-specific rules, pulling in-flectional forms of the same word close to-gether and pushing derivational antonyms far apart. In intrinsic evaluation over fourlanguages, we show that our approach: improves low-frequency word estimates;and boosts the semantic quality of theentire word vector collection. Finally, weshow that morph-fitted vectors yield largegains in the downstream task of dialoguestate tracking , highlighting the importanceof morphology for tackling long-tail phe-nomena in language understanding tasks. Word representation learning has become a re-search area of central importance in natural lan-guage processing (NLP), with its usefulness demon-strated across many application areas such as pars-ing (Chen and Manning, 2014; Johannsen et al.,2015), machine translation (Zou et al., 2013), andmany others (Turian et al., 2010; Collobert et al., 2011). Most prominent word representation tech-niques are grounded in the distributional hypothe-sis (Harris, 1954), relying on word co-occurrenceinformation in large textual corpora (Curran, 2004;Turney and Pantel, 2010; Mikolov et al., 2013;Mnih and Kavukcuoglu, 2013; Levy and Goldberg,2014; Schwartz et al., 2015, i.a.).Morphologically rich languages, in which “sub-stantial grammatical information. . . is expressed atword level” (Tsarfaty et al., 2010), pose specificchallenges for NLP. This is not always consideredwhen techniques are evaluated on languages suchas English or Chinese, which do not have rich mor-phology. In the case of distributional vector spacemodels, morphological complexity brings two chal-lenges to the fore:
1. Estimating Rare Words:
A single lemmacan have many different surface realisations.Naively treating each realisation as a separate wordleads to sparsity problems and a failure to exploittheir shared semantics. On the other hand, lemma-tising the entire corpus can obfuscate the differ-ences that exist between different word forms eventhough they share some aspects of meaning.
2. Embedded Semantics:
Morphology can en-code semantic relations such as antonymy (e.g. lit-erate and illiterate , expensive and inexpensive ) or(near-)synonymy ( north , northern , northerly ).In this work, we tackle the two challenges jointlyby introducing a resource-light vector space fine-tuning procedure termed morph-fitting . The pro-posed method does not require curated knowledgebases or gold lexicons. Instead, it makes use of theobservation that morphology implicitly encodessemantic signals pertaining to synonymy (e.g.,German word inflections katalanisch, katalanis-chem, katalanischer denote the same semantic con-cept in different grammatical roles), and antonymy(e.g., mature vs. immature ), capitalising on the a r X i v : . [ c s . C L ] J un n_expensive de_teure it_costoso en_slow de_langsam it_lento en_book de_buch it_libro costly teuren dispendioso fast allmählich lentissimo books sachbuch romanzocostlier kostspielige remunerativo slowness rasch lenta memoir buches raccontocheaper aufwändige redditizio slower gemächlich inesorabile novel romandebüt volumettoprohibitively kostenintensive rischioso slowed schnell rapidissimo storybooks büchlein saggiopricey aufwendige costosa slowing explosionsartig graduale blurb pamphlet ecclesiasteexpensiveness teures costosa slowing langsamer lenti booked bücher libricostly teuren costose slowed langsames lente rebook büch libracostlier teurem costosi slowness langsame lenta booking büche librareruinously teurer dispendioso slows langsamem veloce rebooked büches libreunaffordable teurerer dispendiose idle langsamen rapido books büchen librano Table 1: The nearest neighbours of three example words ( expensive , slow and book ) in English, Germanand Italian before (top) and after (bottom) morph-fitting.proliferation of word forms in morphologicallyrich languages. Formalised as an instance of thepost-processing semantic specialisation paradigm(Faruqui et al., 2015; Mrkši´c et al., 2016), morph-fitting is steered by a set of linguistic constraintsderived from simple language-specific rules whichdescribe (a subset of) morphological processes ina language. The constraints emphasise similarityon one side (e.g., by extracting morphological syn-onyms), and antonymy on the other (by extracting morphological antonyms), see Fig. 1 and Tab. 2.The key idea of the fine-tuning process is to pullsynonymous examples described by the constraintscloser together in the transformed vector space,while at the same time pushing antonymous exam-ples away from each other. The explicit post-hocinjection of morphological constraints enables: a) the estimation of more accurate vectors for low-frequency words which are linked to their high-frequency forms by the constructed constraints; this tackles the data sparsity problem; and b) spe-cialising the distributional space to distinguish be-tween similarity and relatedness (Kiela et al., 2015),thus supporting language understanding applica-tions such as dialogue state tracking (DST). As a post-processor, morph-fitting allows theintegration of morphological rules with any distri-butional vector space in any language: it treats aninput distributional word vector space as a blackbox and fine-tunes it so that the transformed spacereflects the knowledge coded in the input morpho-logical constraints (e.g., Italian words rispettoso and irrispetosa should be far apart in the trans- For instance, the vector for the word katalanischem whichoccurs only 9 times in the German Wikipedia will be pulledcloser to the more reliable vectors for katalanisch and kata-lanischer , with frequencies of 2097 and 1383 respectively. Representation models that do not distinguish betweensynonyms and antonyms may have grave implications in down-stream language understanding applications such as spokendialogue systems: a user looking for ‘an affordable Chineserestaurant in west Cambridge’ does not want a recommenda-tion for ‘an expensive Thai place in east Oxford’ . rispettosorispettosarispettosi irrispettosoirrispettosairrispettosi Figure 1:
Morph-fitting in Italian. Representationsfor rispettoso , rispettosa , rispettosi ( EN : respectful ),are pulled closer together in the vector space (solidlines; A TTRACT constraints). At the same time,the model pushes them away from their antonyms(dashed lines; R
EPEL constraints) irrispettoso , ir-rispettosa , irrispettosi ( EN : disrespectful ), obtainedthrough morphological affix transformation cap-tured by language-specific rules (e.g., adding theprefix ir- typically negates the base word in Italian)formed vector space, see Fig. 1). Tab. 1 illustratesthe effects of morph-fitting by qualitative exam-ples in three languages: the vast majority of nearestneighbours are “morphological” synonyms.We demonstrate the efficacy of morph-fittingin four languages (English, German, Italian, Rus-sian), yielding large and consistent improvementson benchmarking word similarity evaluation setssuch as SimLex-999 (Hill et al., 2015), its multilin-gual extension (Leviant and Reichart, 2015), andSimVerb-3500 (Gerz et al., 2016). The improve-ments are reported for all four languages, and witha variety of input distributional spaces, verifyingthe robustness of the approach.We then show that incorporating morph-fittedvectors into a state-of-the-art neural-network DSTmodel results in improved tracking performance,especially for morphologically rich languages. Wereport an improvement of 4% on Italian, and 6% onGerman when using morph-fitted vectors instead ofthe distributional ones, setting a new state-of-the-art DST performance for the two datasets. There are no readily available DST datasets for Russian.
Morph-fitting: Methodology
Preliminaries
In this work, we focus on four lan-guages with varying levels of morphological com-plexity: English ( EN ), German ( DE ), Italian ( IT ),and Russian ( RU ). These correspond to languagesin the Multilingual SimLex-999 dataset. Vocabu-laries W en , W de , W it , W ru are compiled by retain-ing all word forms from the four Wikipedias withword frequency over 10, see Tab. 3. We then extractsets of linguistic constraints from these (large) vo-cabularies using a set of simple language-specific if-then-else rules, see Tab. 2. These constraints(Sect. 2.2) are used as input for the vector spacepost-processing A
TTRACT -R EPEL algorithm (out-lined in Sect. 2.1).
TTRACT -R EPEL
Model
The A
TTRACT -R EPEL model, proposed by Mrkši´cet al. (2017b), is an extension of the P
ARAGRAM procedure proposed by Wieting et al. (2015). Itprovides a generic framework for incorporating similarity (e.g. successful and accomplished ) and antonymy constraints (e.g. nimble and clumsy ) intopre-trained word vectors. Given the initial vectorspace and collections of A
TTRACT and R
EPEL con-straints A and R , the model gradually modifies thespace to bring the designated word vectors closertogether or further apart. The method’s cost func-tion consists of three terms. The first term pulls theA TTRACT examples ( x l , x r ) ∈ A closer together.If B A denotes the current mini-batch of A TTRACT examples, this term can be expressed as: A ( B A ) = (cid:88) ( x l ,x r ) ∈B A ( ReLU ( δ att + x l t l − x l x r )+ ReLU ( δ att + x r t r − x l x r )) where δ att is the similarity margin which de-termines how much closer synonymous vectorsshould be to each other than to each of their respec-tive negative examples. ReLU ( x ) = max(0 , x ) isthe standard rectified linear unit (Nair and Hinton,2010). The ‘negative’ example t i for each word x i in any A TTRACT pair is the word vector clos-est to x i among the examples in the current mini-batch (distinct from its target synonym and x i it-self). This means that this term forces synonymous A native speaker can easily come up with these sets ofmorphological rules (or at least with a reasonable subset ofthem) without any linguistic training. What is more, the rulesfor DE , IT , and RU were created by non-native, non-fluentspeakers with a limited knowledge of the three languages,exemplifying the simplicity and portability of the approach. English German Italian (discuss, discussed) (schottisch, schottischem) (golfo, golfi)(laugh, laughing) (damalige, damaligen) (minato, minata)(pacifist, pacifists) (kombiniere, kombinierte) (mettere, metto)(evacuate, evacuated) (schweigt, schweigst) (crescono, cresci)(evaluate, evaluates) (hacken, gehackt) (crediti, credite)(dressed, undressed) (stabil, unstabil) (abitata, inabitato)(similar, dissimilar) (geformtes, ungeformt) (realtà, irrealtà)(formality, informality) (relevant, irrelevant) (attuato, inattuato)
Table 2: Example synonymous (inflectional; top)and antonymous (derivational; bottom) constraints.words from the in-batch A
TTRACT constraints tobe closer to one another than to any other word inthe current mini-batch.The second term pushes antonyms away fromeach other. If ( x l , x r ) ∈ B R is the current mini-batch of R EPEL constraints, this term can be ex-pressed as follows: R ( B R ) = (cid:88) ( x l ,x r ) ∈B R ( ReLU ( δ rpl + x l x r − x l t r )+ ReLU ( δ rpl + x l x r − x r t r )) In this case, each word’s ‘negative’ example is the(in-batch) word vector furthest away from it (anddistinct from the word’s target antonym). The intu-ition is that we want antonymous words from theinput R
EPEL constraints to be further away fromeach other than from any other word in the currentmini-batch; δ rpl is now the repel margin.The final term of the cost function serves toretain the abundance of semantic information en-coded in the starting distributional space. If x initi isthe initial distributional vector and V ( B ) is the setof all vectors present in the given mini-batch, thisterm (per mini-batch) is expressed as follows: R ( B A , B R ) = (cid:88) x i ∈ V ( B A ∪B R ) λ reg (cid:13)(cid:13)(cid:13) x initi − x i (cid:13)(cid:13)(cid:13) where λ reg is the L2 regularisation constant. Thisterm effectively pulls word vectors towards theirinitial (distributional) values, ensuring that rela-tions encoded in initial vectors persist as long asthey do not contradict the newly injected ones.
Thefine-tuning A
TTRACT -R EPEL procedure is entirelydriven by the input A
TTRACT and R
EPEL sets of We use hyperparameter values δ att = 0 . , δ rpl = 0 . , λ reg = 10 − from prior work without fine-tuning. We trainall models for 10 epochs with AdaGrad (Duchi et al., 2011). W | | A | | R |English 1,368,891 231,448 45,964German 1,216,161 648,344 54,644Italian 541,779 278,974 21,400Russian 950,783 408,400 32,174 Table 3: Vocabulary sizes and counts of A
TTRACT ( A ) and R EPEL ( R ) constraints.constraints. These can be extracted from a varietyof semantic databases such as WordNet (Fellbaum,1998), the Paraphrase Database (Ganitkevitch et al.,2013; Pavlick et al., 2015), or BabelNet (Navigliand Ponzetto, 2012; Ehrmann et al., 2014) as donein prior work (Faruqui et al., 2015; Wieting et al.,2015; Mrkši´c et al., 2016, i.a.). In this work, weinvestigate another option: extracting constraints without curated knowledge bases in a spectrum oflanguages by exploiting inherent language-specificproperties related to linguistic morphology. Thisrelaxation ensures a wider portability of A TTRACT -R EPEL to languages and domains without readilyavailable or adequate resources.
Extracting A
TTRACT
Pairs
The core differencebetween inflectional and derivational morphology can be summarised in a few lines as follows: the for-mer refers to a set of processes through which theword form expresses meaningful syntactic infor-mation, e.g., verb tense, without any change to thesemantics of the word. On the other hand, the latterrefers to the formation of new words with seman-tic shifts in meaning (Schone and Jurafsky, 2001;Haspelmath and Sims, 2013; Lazaridou et al., 2013;Zeller et al., 2013; Cotterell and Schütze, 2017).For the A
TTRACT constraints, we focus on in-flectional rather than on derivational morphology rules as the former preserve the full meaning of aword, modifying it only to reflect grammatical rolessuch as verb tense or case markers (e.g., (en_read,en_reads ) or (de_katalanisch, de_katalanischer) ).This choice is guided by our intent to fine-tunethe original vector space in order to improve theembedded semantic relations.We define two rules for English, widely recog-nised as morphologically simple (Avramidis andKoehn, 2008; Cotterell et al., 2016b). These are: (R1) if w , w ∈ W en , where w = w + ing/ed/s , then add ( w , w ) and ( w , w ) to the set of A T - TRACT constraints A . This rule yields pairs such as (look, looks), (look, looking), (look, looked) .If w [: − is a function which strips the lastcharacter from word w , the second rule is: (R2) if w ends with the letter e and w ∈ W en and w ∈ W en , where w = w [: − + ing/ed , then add ( w , w ) and ( w , w ) to A . This creates pairssuch as (create, creating) and (create, created) . Nat-urally, introducing more sophisticated rules is pos-sible in order to cover for other special cases andmorphological irregularities (e.g., sweep / swept ),but in all our EN experiments, A is based on thetwo simple EN rules R1 and R2.The other three languages, with more compli-cated morphology, yield a larger number of rules.In Italian, we rely on the sets of rules spanning:(1) regular formation of plural ( libro / libri ); (2)regular verb conjugation ( aspettare / aspettiamo );(3) regular formation of past participle ( aspettare/ aspettato ); and (4) rules regarding grammaticalgender ( bianco / bianca ). Besides these, anotherset of rules is used for German and Russian: (5)regular declension (e.g., asiatisch / asiatischem ). Extracting R
EPEL
Pairs
As another source ofimplicit semantic signals, W also contains wordswhich represent derivational antonyms : e.g., twowords that denote concepts with opposite meanings,generated through a derivational process. We use astandard set of EN “antonymy” prefixes: AP en = {dis, il, un, in, im, ir, mis, non, anti} (Fromkin et al.,2013). If w , w ∈ W en , where w is generatedby adding a prefix from AP en to w , then ( w , w ) and ( w , w ) are added to the set of R EPEL con-straints R . This rule generates pairs such as (ad-vantage, disadvantage) and (regular, irregular) . Anadditional rule replaces the suffix -ful with -less ,extracting antonyms such as (careful, careless) .Following the same principle, we use AP de = {un, nicht, anti, ir, in, miss} , AP it = {in, ir, im,anti} , and AP ru = { не , анти } . For instance, thisgenerates an IT pair (rispettoso, irrispettoso) (seeFig. 1). For DE , we use another rule targeting suffixreplacement: -voll is replaced by -los .We further expand the set of R EPEL constraintsby transitively combining antonymy pairs fromthe previous step with inflectional A
TTRACT pairs.This step yields additional constraints such as (rispettosa, irrispettosi) (see Fig. 1). The final A and R constraint counts are given in Tab. 3. The fullsets of rules are available as supplemental material. Training Data and Setup
For each of the fourlanguages we train the skip-gram with negativesampling (SGNS) model (Mikolov et al., 2013)n the latest Wikipedia dump of each language.We induce 300-dimensional word vectors, with thefrequency cut-off set to 10. The vocabulary sizes | W | for each language are provided in Tab. 3. Welabel these collections of vectors
SGNS - LARGE . Other Starting Distributional Vectors
We alsoanalyse the impact of morph-fitting on other col-lections of well-known EN word vectors. Thesevectors have varying vocabulary coverage and aretrained with different architectures. We test stan-dard distributional models: Common-Crawl GloVe(Pennington et al., 2014), SGNS vectors (Mikolovet al., 2013) with various contexts ( BOW = bag-of-words;
DEPS = dependency contexts), and train-ing data ( PW = Polyglot Wikipedia from Al-Rfouet al. (2013); = 8 billion token word2vec cor-pus), following (Levy and Goldberg, 2014) and(Schwartz et al., 2015). We also test the symmetric-pattern based vectors of Schwartz et al. (2016)( SymPat-Emb ), count-based PMI-weighted vectorsreduced by SVD (Baroni et al., 2014) (
Count-SVD ),a model which replaces the context modelling func-tion from CBOW with bidirectional LSTMs (Mela-mud et al., 2016) (
Context2Vec ), and two sets of EN vectors trained by injecting multilingual infor-mation: BiSkip (Luong et al., 2015) and
MultiCCA (Faruqui and Dyer, 2014).We also experiment with standard well-knowndistributional spaces in other languages ( IT and DE ), available from prior work (Dinu et al., 2015;Luong et al., 2015; Vuli´c and Korhonen, 2016a). Morph-fixed Vectors
A baseline which utilisesan equal amount of knowledge as morph-fitting,termed morph-fixing , fixes the vector of each wordto the distributional vector of its most frequentinflectional synonym, tying the vectors of low-frequency words to their more frequent inflections.For each word w , we construct a set of M + 1 words W w = { w , w (cid:48) , . . . , w (cid:48) M } consisting ofthe word w itself and all M words which co-occur with w in the A TTRACT constraints. Wethen choose the word w (cid:48) max from the set W w withthe maximum frequency in the training data, andfix all other word vectors in W w to its word vec-tor. The morph-fixed vectors (MF IX ) serve as ourprimary baseline, as they outperformed anotherstraightforward baseline based on stemming across Other SGNS parameters were set to standard values (Ba-roni et al., 2014; Vuli´c and Korhonen, 2016b): epochs, negative samples, global learning rate: . , subsampling rate: e − . Similar trends in results persist with d = 100 , . all of our intrinsic and extrinsic experiments. Morph-fitting Variants
We analyse two vari-ants of morph-fitting: (1) using A
TTRACT con-straints only (MF IT -A), and (2) using both A T - TRACT and R
EPEL constraints (MF IT -AR). Evaluation Setup and Datasets
The first set ofexperiments intrinsically evaluates morph-fitted vector spaces on word similarity benchmarks, usingSpearman’s rank correlation as the evaluation met-ric. First, we use the SimLex-999 dataset, as wellas SimVerb-3500, a recent EN verb pair similaritydataset providing similarity ratings for 3,500 verbpairs. SimLex-999 was translated to DE , IT , and RU by Leviant and Reichart (2015), and they crowd-sourced similarity scores from native speakers. Weuse this dataset for our multilingual evaluation. Morph-fitting EN Word Vectors
As the first ex-periment, we morph-fit a wide spectrum of EN dis-tributional vectors induced by various architectures(see Sect. 3). The results on SimLex and SimVerbare summarised in Tab. 4. The results with ENSGNS - LARGE vectors are shown in Fig. 3a. Morph-fitted vectors bring consistent improvement acrossall experiments, regardless of the quality of the ini-tial distributional space. This finding confirms thatthe method is robust: its effectiveness does not de-pend on the architecture used to construct the initialspace. To illustrate the improvements, note that thebest score on SimVerb for a model trained on run-ning text is achieved by
Context2vec ( ρ = 0 . );injecting morphological constraints into this vectorspace results in a gain of . ρ points. Experiments on Other Languages
We next ex-tend our experiments to other languages, testingboth morph-fitting variants. The results are sum-marised in Tab. 5, while Fig. 3a-3d show resultsfor the morph-fitted
SGNS - LARGE vectors. Thesescores confirm the effectiveness and robustness ofmorph-fitting across languages, suggesting that theidea of fitting to morphological constraints is in-deed language-agnostic, given the set of language-specific rule-based constraints. Fig. 3 also demon- Unlike other gold standard resources such as WordSim-353 (Finkelstein et al., 2002) or MEN (Bruni et al., 2014),SimLex and SimVerb provided explicit guidelines to discernbetween semantic similarity and association, so that relatedbut non-similar words (e.g. cup and coffee ) have a low rating. Since Leviant and Reichart (2015) re-scored the original EN SimLex, we use their EN SimLex version for consistency. valuationVectors
SimLex-999 SimVerb-35001. SG-BOW2-PW (300) (Mikolov et al., 2013) .339 → .439 .277 → .381
2. GloVe-6B (300) (Pennington et al., 2014) .324 → .438 .286 → .405
3. Count-SVD (500) (Baroni et al., 2014) .267 → .360 .199 → .301
4. SG-DEPS-PW (300) (Levy and Goldberg, 2014) .376 → .434 .313 → .418
5. SG-DEPS-8B (500) (Bansal et al., 2014) .373 → .441 .356 → .473
6. MultiCCA-EN (512) (Faruqui and Dyer, 2014) .314 → .391 .296 → .354
7. BiSkip-EN (256) (Luong et al., 2015) .276 → .356 .260 → .333
8. SG-BOW2-8B (500) (Schwartz et al., 2015) .373 → .440 .348 → .441
9. SymPat-Emb (500) (Schwartz et al., 2016) .381 → .442 .284 → .373
10. Context2Vec (600) (Melamud et al., 2016) .371 → .440 .388 → .459 Table 4: The impact of morph-fitting (MF IT -ARused) on a representative set of EN vector spacemodels. All results show the Spearman’s ρ corre-lation before and after morph-fitting. The numbersin parentheses refer to the vector dimensionality. Vectors
Distrib. MF IT -A MF IT -AR EN : GloVe-6B (300) .324 .376 .438 EN : SG-BOW2-PW (300) .339 .385 .439 DE : SG-DEPS-PW (300) (Vuli´c and Korhonen, 2016a) .267 .318 .325 DE : BiSkip-DE (256) (Luong et al., 2015) .354 .414 .421 IT : SG-DEPS-PW (300) (Vuli´c and Korhonen, 2016a) .237 .351 .391 IT : CBOW5-Wacky (300) (Dinu et al., 2015) .363 .417 .446 Table 5: Results on multilingual SimLex-999 ( EN , DE , and IT ) with two morph-fitting variants.strates that the morph-fitted vector spaces consis-tently outperform the morph-fixed ones.The comparison between MF IT -A and MF IT -AR indicates that both sets of constraints are im-portant for the fine-tuning process. MF IT -A yieldsconsistent gains over the initial spaces, and (con-sistent) further improvements are achieved by alsoincorporating the antonymous R EPEL constraints.This demonstrates that both types of constraints areuseful for semantic specialisation.
Comparison to Other Specialisation Methods
We also tried using other post-processing spe-cialisation models from the literature in lieu ofA
TTRACT -R EPEL using the same set of “morpho-logical” synonymy and antonymy constraints. Wecompare A
TTRACT -R EPEL to the retrofitting model en:GloVe en:BOW2 de:DEPS de:BiSkip it:DEPS it:CBOW5Word Vector Space0.200.250.300.350.400.45 S p e a r m a n ’ s ρ c o rr e l a t i o n s c o r e DistribRFCFMFit-AR
Figure 2: A comparison of morph-fitting (the MF IT -AR variant) with two other standard specialisationapproaches using the same set of morphologicalconstraints: Retrofitting (RF) (Faruqui et al., 2015)and Counter-fitting (CF) (Mrkši´c et al., 2016).Spearman’s ρ correlation scores on the multilingualSimLex-999 dataset for the same six distributionalspaces from Tab. 5.of (Faruqui et al., 2015) and counter-fitting (Mrkši´cet al., 2017a). The two baselines were trained for20 iterations using suggested settings. The resultsfor EN , DE , and IT are summarised in Fig. 2. Theyclearly indicate that MF IT -AR outperforms the twoother post-processors for each language. We hy-pothesise that the difference in performance mainlystems from context-sensitive vector space updatesperformed by A TTRACT -R EPEL . Conversely, theother two models perform pairwise updates whichdo not consider what effect each update has on theexample pair’s relation to other word vectors (for adetailed comparison, see (Mrkši´c et al., 2017b)).Besides their lower performance, the two otherspecialisation models have additional disadvan-tages compared to the proposed morph-fittingmodel. First, retrofitting is able to incorporateonly synonymy/A
TTRACT pairs, while our re-sults demonstrate the usefulness of both types ofconstraints, both for intrinsic evaluation (Tab. 5)and downstream tasks (see later Fig. 3). Second,counter-fitting is computationally intractable with
SGNS - LARGE vectors, as its regularisation term in-volves the computation of all pairwise distancesbetween words in the vocabulary.
Further Discussion
The simplicity of the usedlanguage-specific rules does come at a cost of occa-sionally generating incorrect linguistic constraintssuch as (tent, intent) , (prove, improve) or (press,impress) . In future work, we will study how to fur-her refine extracted sets of constraints. We alsoplan to conduct experiments with gold standardmorphological lexicons on languages for whichsuch resources exist (Sylak-Glassman et al., 2015;Cotterell et al., 2016b), and investigate approacheswhich learn morphological inflections and deriva-tions in different languages automatically as an-other potential source of morphological constraints(Soricut and Och, 2015; Cotterell et al., 2016a;Faruqui et al., 2016; Kann et al., 2017; Aharoniand Goldberg, 2017, i.a.). Goal-oriented dialogue systems provide conversa-tional interfaces for tasks such as booking flightsor finding restaurants. In slot-based systems, ap-plication domains are specified using ontologies that define the search constraints which users canexpress. An ontology consists of a number of slots and their assorted slot values . In a restaurant search domain, sets of slot-values could include
PRICE =[ cheap, expensive ] or FOOD = [
Thai, Indian, ... ].The DST model is the first component of mod-ern dialogue pipelines (Young, 2010). It serves tocapture the intents expressed by the user at eachdialogue turn and update the belief state . This prob-ability distribution over the possible dialogue states(defined by the domain ontology) is the system’sinternal estimate of the user’s goals. It is used bythe downstream dialogue manager component tochoose the subsequent system response (Su et al.,2016). The following example shows the true dia-logue state in a multi-turn dialogue:
User:
What’s good in the southern part of town? inform(area=south)
System:
Vedanta is the top-rated Indian place.
User:
How about something cheaper? inform(area=south, price=cheap)
System:
Seven Days is very popular. Great hot pot.
User:
What’s the address? inform(area=south, price=cheap);request(address)
System:
Seven Days is at 66 Regent Street.
The Dialogue State Tracking Challenge (DSTC)shared task series formalised the evaluation andprovided labelled DST datasets (Henderson et al.,2014a,b; Williams et al., 2016). While a plethoraof DST models are available based on, e.g., hand-crafted rules (Wang et al., 2014) or conditionalrandom fields (Lee and Eskenazi, 2013), the recentDST methodology has seen a shift towards neural- network architectures (Henderson et al., 2014c,d;Zilka and Jurcicek, 2015; Mrkši´c et al., 2015; Perezand Liu, 2017; Liu and Perez, 2017; Vodolán et al.,2017; Mrkši´c et al., 2017a, i.a.).
Model: Neural Belief Tracker
To detect intentsin user utterances, most existing models rely on ei-ther (or both): Spoken Language Understandingmodels which require large amounts of annotatedtraining data; or hand-crafted, domain-specificlexicons which try to capture lexical and morpho-logical variation. The Neural Belief Tracker (NBT)is a novel DST model which overcomes both issuesby reasoning purely over pre-trained word vectors(Mrkši´c et al., 2017a). The NBT learns to composethese vectors into intermediate utterance and con-text representations. These are then used to decidewhich of the ontology-defined intents (goals) havebeen expressed by the user. The NBT model keepsword vectors fixed during training, so that unseen,yet related words can be mapped to the right intentat test time (e.g. northern to north ). Data: Multilingual WOZ 2.0 Dataset
Our DSTevaluation is based on the WOZ dataset, releasedby Wen et al. (2017). In this Wizard-of-Oz setup,two Amazon Mechanical Turk workers assumedthe role of the user and the system asking/providinginformation about restaurants in Cambridge (oper-ating over the same ontology and database usedfor DSTC2 (Henderson et al., 2014a)). Users typedinstead of speaking, removing the need to deal withnoisy speech recognition. In DSTC datasets, userswould quickly adapt to the system’s inability todeal with complex queries. Conversely, the WOZsetup allowed them to use sophisticated language.The WOZ 2.0 release expanded the dataset to 1,200dialogues (Mrkši´c et al., 2017a). In this work, weuse translations of this dataset to Italian and Ger-man, released by Mrkši´c et al. (2017b).
Evaluation Setup
The principal metric we useto measure DST performance is the joint goal ac-curacy , which represents the proportion of test setdialogue turns where all user goals expressed up tothat point of the dialogue were decoded correctly(Henderson et al., 2014a). The NBT models for EN , DE and IT are trained using four variants of the SGNS - LARGE vectors: the initial distributionalvectors; morph-fixed vectors; and the twovariants of morph-fitted vectors (see Sect. 3).As shown by Mrkši´c et al. (2017b), semanticspecialisation of the employed word vectors ben- istrib MFix MFit-A MFit-AR0.150.200.250.300.350.400.45 S i m L e x ( S p e a r m a n ’ s ρ ) D S T P e r f o r m a n c e (J o i n t ) (a) English Distrib MFix MFit-A MFit-AR0.150.200.250.300.350.400.45 S i m L e x ( S p e a r m a n ’ s ρ ) SimLex D S T P e r f o r m a n c e (J o i n t ) DST (b) German
Distrib MFix MFit-A MFit-AR0.150.200.250.300.350.400.45 S i m L e x ( S p e a r m a n ’ s ρ ) D S T P e r f o r m a n c e (J o i n t ) (c) Italian Distrib MFix MFit-A MFit-ARRU Word Vector Space0.150.200.250.300.350.400.45 S i m L e x ( S p e a r m a n ’ s ρ ) SimLex (d) Russian
Figure 3: An overview of the results (Spearman’s ρ correlation) for four languages on SimLex-999 (greybars, left y axis) and the downstream DST performance (dark bars, right y axis) using SGNS - LARGE vectors( d = 300 ), see Tab. 3 and Sect. 3. The left y axis measures the intrinsic word similarity performance,while the right y axis provides the scale for the DST performance (there are no DST datasets for Russian).efits DST performance across all three languages.However, large gains on SimLex-999 do not al-ways induce correspondingly large gains in down-stream performance. In our experiments, we inves-tigate the extent to which morph-fitting improvesDST performance, and whether these gains exhibitstronger correlation with intrinsic performance. Results and Discussion
The dark bars (againstthe right axes) in Fig. 3 show the DST perfor-mance of NBT models making use of the fourvector collections. IT and DE benefit from bothkinds of morph-fitting : IT performance increasesfrom . → . (MF IT -A) and DE performancerises even more: . → . (MF IT -AR), settinga new state-of-the-art score for both datasets. The morph-fixed vectors do not enhance DST perfor-mance, probably because fixing word vectors totheir highest frequency inflectional form eliminatesuseful semantic content encoded in the originalvectors. On the other hand, morph-fitting makesuse of this information, supplementing it with se-mantic relations between different morphologicalforms. These conclusions are in line with the Sim-Lex gains, where morph-fitting outperforms bothdistributional and morph-fixed vectors. English performance shows little variationacross the four word vector collections investigatedhere. This corroborates our intuition that, as a mor-phologically simpler language, English stands togain less from fine-tuning the morphological varia-tion for downstream applications. This result againpoints at the discrepancy between intrinsic and ex-trinsic evaluation: the considerable gains in Sim-Lex performance do not necessarily induce similargains in downstream performance. Additional dis-crepancies between SimLex and downstream DSTperformance are detected for German and Italian.While we observe a slight drop in SimLex perfor-mance with the DE MF IT -AR vectors comparedto the MF IT -A ones, their relative performance isreversed in the DST task. On the other hand, wesee the opposite trend in Italian, where the MF IT -A vectors score lower than the MF IT -AR vectorson SimLex, but higher on the DST task. In sum-mary, we believe these results show that SimLex isnot a perfect proxy for downstream performancein language understanding tasks. Regardless, itsperformance does correlate with downstream per-formance to a large extent, providing a useful in-dicator for the usefulness of specific word vectorpaces for extrinsic tasks such as DST. Semantic Specialisation
A standard approachto incorporating external information into vectorspaces is to pull the representations of similarwords closer together. Some models integrate suchconstraints into the training procedure, modify-ing the prior or the regularisation (Yu and Dredze,2014; Xu et al., 2014; Bian et al., 2014; Kiela et al.,2015), or using a variant of the SGNS-style objec-tive (Liu et al., 2015; Osborne et al., 2016). Anotherclass of models, popularly termed retrofitting , in-jects lexical knowledge from available semanticdatabases (e.g., WordNet, PPDB) into pre-trainedword vectors (Faruqui et al., 2015; Jauhar et al.,2015; Wieting et al., 2015; Nguyen et al., 2016;Mrkši´c et al., 2016). Morph-fitting falls into thelatter category. However, instead of resorting to cu-rated knowledge bases, and experimenting solelywith English, we show that the morphological rich-ness of any language can be exploited as a sourceof inexpensive supervision for fine-tuning vectorspaces, at the same time specialising them to betterreflect true semantic similarity, and learning moreaccurate representations for low-frequency words.
Word Vectors and Morphology
The use of mor-phological resources to improve the representationsof morphemes and words is an active area of re-search. The majority of proposed architectures en-code morphological information, provided eitheras gold standard morphological resources (Sylak-Glassman et al., 2015) such as CELEX (Baayenet al., 1995) or as an external analyser such asMorfessor (Creutz and Lagus, 2007), along withdistributional information jointly at training timein the language modelling (LM) objective (Luonget al., 2013; Botha and Blunsom, 2014; Qiu et al.,2014; Cotterell and Schütze, 2015; Bhatia et al.,2016, i.a.). The key idea is to learn a morphologi-cal composition function (Lazaridou et al., 2013;Cotterell and Schütze, 2017) which synthesises therepresentation of a word given the representationsof its constituent morphemes. Contrary to our work,these models typically coalesce all lexical relations.Another class of models, operating at the charac-ter level, shares a similar methodology: such mod-els compose token-level representations from sub-component embeddings (subwords, morphemes, orcharacters) (dos Santos and Zadrozny, 2014; Linget al., 2015; Cao and Rei, 2016; Kim et al., 2016; Wieting et al., 2016; Verwimp et al., 2017, i.a.).In contrast to prior work, our model decouples the use of morphological information, now pro-vided in the form of inflectional and derivationalrules transformed into constraints, from the actualtraining. This pipelined approach results in a sim-pler, more portable model. In spirit, our work is sim-ilar to Cotterell et al. (2016b), who formulate theidea of post-training specialisation in a generativeBayesian framework. Their work uses gold mor-phological lexicons; we show that competitive per-formance can be achieved using a non-exhaustiveset of simple rules. Our framework facilitates theinclusion of antonyms at no extra cost and natu-rally extends to constraints from other sources (e.g.,WordNet) in future work. Another practical differ-ence is that we focus on similarity and evaluatemorph-fitting in a well-defined downstream taskwhere the artefacts of the distributional hypothesisare known to prompt statistical system failures.
We have presented a novel morph-fitting methodwhich injects morphological knowledge in the formof linguistic constraints into word vector spaces.The method makes use of implicit semantic signalsencoded in inflectional and derivational rules whichdescribe the morphological processes in a language.The results in intrinsic word similarity tasks showthat morph-fitting improves vector spaces inducedby distributional models across four languages. Fi-nally, we have shown that the use of morph-fitted vectors boosts the performance of downstream lan-guage understanding models which rely on wordrepresentations as features, especially for morpho-logically rich languages such as German.Future work will focus on other potential sourcesof morphological knowledge, porting the frame-work to other morphologically rich languages anddownstream tasks, and on further refinements ofthe post-processing specialisation algorithm andthe constraint selection.
Acknowledgments
This work is supported by the ERC ConsolidatorGrant LEXICAL: Lexical Acquisition Across Lan-guages (no 648909). RR is supported by the Intel-ICRI grant: Hybrid Models for Minimally Super-vised Information Extraction from Conversations.The authors are grateful to the anonymous review-ers for their helpful suggestions. eferences
Roee Aharoni and Yoav Goldberg. 2017. Mor-phological inflection generation with hardmonotonic attention. In
Proceedings of ACL .https://arxiv.org/abs/1611.01487.Rami Al-Rfou, Bryan Perozzi, and Steven Skiena.2013. Polyglot: Distributed word representations formultilingual NLP. In
Proceedings of CoNLL
Proceedings of ACL
Proceedings of ACL
Proceedings of ACL
Proceedings of EMNLP . pages 490–500.https://aclweb.org/anthology/D16-1047.Jiang Bian, Bin Gao, and Tie-Yan Liu. 2014.Knowledge-powered deep learning for word embed-ding. In
Proceedings of ECML-PKDD . pages 132–148. https://doi.org/10.1007/978-3-662-44848-9_9.Jan A. Botha and Phil Blunsom. 2014. Com-positional morphology for word repre-sentations and language modelling. In
Proceedings of ICML . pages 1899–1907.http://jmlr.org/proceedings/papers/v32/botha14.html.Elia Bruni, Nam-Khanh Tran, and Marco Baroni.2014. Multimodal distributional semantics.
Jour-nal of Artificial Intelligence Research
Proceedings of the 1st Workshop on Rep-resentation Learning for NLP . pages 18–26.http://aclweb.org/anthology/W/W16/W16-1603.Danqi Chen and Christopher D. Manning. 2014. Afast and accurate dependency parser using neural net-works. In
Proceedings of EMNLP
Journal ofMachine Learning Research
Proceedings of the 14th SIG-MORPHON Workshop on Computational Researchin Phonetics, Phonology, and Morphology . pages10–22. http://anthology.aclweb.org/W16-2002.Ryan Cotterell and Hinrich Schütze. 2015.Morphological word-embeddings. In
Pro-ceedings of NAACL-HLT
Transactions of the ACL https://arxiv.org/abs/1701.00946.Ryan Cotterell, Hinrich Schütze, and Jason Eisner.2016b. Morphological smoothing and extrapolationof word embeddings. In
Proceedings of ACL
TSLP
From Distributional toSemantic Similarity . Ph.D. thesis, Schoolof Informatics, University of Edinburgh.http://hdl.handle.net/1842/563.Georgiana Dinu, Angeliki Lazaridou, and Marco Ba-roni. 2015. Improving zero-shot learning by mitigat-ing the hubness problem. In
Proceedings of ICLR(Workshop Papers) . http://arxiv.org/abs/1412.6568.Cícero Nogueira dos Santos and BiancaZadrozny. 2014. Learning character-levelrepresentations for part-of-speech tagging.In
Proceedings of ICML . pages 1818–1826.http://jmlr.org/proceedings/papers/v32/santos14.html.John C. Duchi, Elad Hazan, and Yoram Singer.2011. Adaptive subgradient methods for on-line learning and stochastic optimization.
Jour-nal of Machine Learning Research
Proceed-ings of LREC
Proceedings of NAACL-HLT
Proceedings of EACL
Proceedings of NAACL-HLT
WordNet .https://mitpress.mit.edu/books/wordnet.Lev Finkelstein, Evgeniy Gabrilovich, Yossi Ma-tias, Ehud Rivlin, Zach Solan, Gadi Wolfman,and Eytan Ruppin. 2002. Placing search incontext: The concept revisited.
ACM Trans-actions on Information Systems
An Introduction to Language, 10th Edition .Juri Ganitkevitch, Benjamin Van Durme, and ChrisCallison-Burch. 2013. PPDB: The ParaphraseDatabase. In
Proceedings of NAACL-HLT
Proceedings of EMNLP . pages 2173–2182.https://aclweb.org/anthology/D16-1235.Zellig S. Harris. 1954. Distributional structure.
Word
Under-standing morphology .Matthew Henderson, Blaise Thomson, and Jason D.Wiliams. 2014a. The Second Dialog State TrackingChallenge. In
Proceedings of SIGDIAL . pages 263–272. http://aclweb.org/anthology/W/W14/W14-4337.pdf.Matthew Henderson, Blaise Thomson, and Jason D.Wiliams. 2014b. The Third Dialog State TrackingChallenge. In
Proceedings of IEEE SLT . pages 324–329. https://doi.org/10.1109/SLT.2014.7078595.Matthew Henderson, Blaise Thomson, and SteveYoung. 2014c. Robust dialog state tracking usingdelexicalised recurrent neural networks and unsu-pervised adaptation. In
Proceedings of IEEE SLT .pages 360–365. Matthew Henderson, Blaise Thomson, and SteveYoung. 2014d. Word-based dialog statetracking with recurrent neural networks. In
Proceedings of SIGDIAL . pages 292–299.http://aclweb.org/anthology/W/W14/W14-4340.pdf.Felix Hill, Roi Reichart, and Anna Korhonen.2015. SimLex-999: Evaluating semanticmodels with (genuine) similarity estimation.
Computational Linguistics
Proceedings of NAACL
Proceedings of EMNLP . pages 2062–2066. http://aclweb.org/anthology/D15-1245.Katharina Kann, Ryan Cotterell, and Hinrich Schütze.2017. Neural multi-source morphological reinflec-tion. In
Proceedings of EACL
Proceedings of EMNLP . pages 2044–2048. http://aclweb.org/anthology/D15-1242.Yoon Kim, Yacine Jernite, David Sontag, and Alexan-der M. Rush. 2016. Character-aware neural lan-guage models. In
Proceedings of AAAI . pages 2741–2749.Angeliki Lazaridou, Marco Marelli, Roberto Zam-parelli, and Marco Baroni. 2013. Compositional-ly derived representations of morphologicallycomplex words in distributional semantics.In
Proceedings of ACL
Proceedings of SIGDIAL . pages 414–422. http://aclweb.org/anthology/W/W13/W13-4066.pdf.Ira Leviant and Roi Reichart. 2015. Separated byan un-common language: Towards judgment lan-guage informed vector space modeling.
CoRR abs/1508.00106. http://arxiv.org/abs/1508.00106.Omer Levy and Yoav Goldberg. 2014.Dependency-based word embeddings. In
Proceedings of ACL
Proceedings of EMNLP .pages 1520–1530. http://aclweb.org/anthology/D15-1176.Fei Liu and Julien Perez. 2017. Gated end-to-end mem-ory networks. In
Proceedings of EACL
Proceedings of ACL
Proceedingsof the 1st Workshop on Vector Space Modelingfor Natural Language Processing
Proceedings of CoNLL
Proceedings of CoNLL . pages 51–61.http://aclweb.org/anthology/K/K16/K16-1006.pdf.Tomas Mikolov, Ilya Sutskever, Kai Chen, Gregory S.Corrado, and Jeffrey Dean. 2013. Distributed repre-sentations of words and phrases and their composi-tionality. In
Proceedings of NIPS . pages 3111–3119.Andriy Mnih and Koray Kavukcuoglu. 2013. Learningword embeddings efficiently with noise-contrastiveestimation. In
Proceedings of NIPS . pages 2265–2273.Nikola Mrkši´c, Diarmuid Ó Séaghdha, Blaise Thom-son, Milica Gaši´c, Pei-Hao Su, David Vandyke,Tsung-Hsien Wen, and Steve Young. 2015. Multi-domain dialog state tracking using recurrent neuralnetworks. In
Proceedings of ACL . pages 794–799.http://aclweb.org/anthology/P/P15/P15-2130.pdf.Nikola Mrkši´c, Diarmuid Ó Séaghdha, BlaiseThomson, Tsung-Hsien Wen, and Steve Young.2017a. Neural Belief Tracker: Data-driven di-alogue state tracking. In
Proceedings of ACL .http://arxiv.org/abs/1606.03777.Nikola Mrkši´c, Diarmuid Ó Séaghdha, Blaise Thom-son, Milica Gaši´c, Lina Maria Rojas-Barahona,Pei-Hao Su, David Vandyke, Tsung-Hsien Wen, andSteve Young. 2016. Counter-fitting word vectorsto linguistic constraints. In
Proceedings of NAACL-HLT . http://aclweb.org/anthology/N/N16/N16-1018.pdf. Nikola Mrkši´c, Ivan Vuli´c, Diarmuid Ó Séaghdha, RoiReichart, Milica Gaši´c, Anna Korhonen, and SteveYoung. 2017b. Semantic Specialisation of Distribu-tional Word Vector Spaces using Monolingual andCross-Lingual Constraints. arXiv.Vinod Nair and Geoffrey E. Hinton. 2010. Recti-fied linear units improve restricted Boltzmann ma-chines. In
Proceedings of ICML
Artificial Intelligence
Proceedings of ACL . pages454–459. http://anthology.aclweb.org/P16-2074.Dominique Osborne, Shashi Narayan, and Shay Cohen.2016. Encoding prior knowledge with eigenwordembeddings.
Transactions of the ACL
Proceedings of ACL
Proceedings of EMNLP
Proceedings of EACL
Proceedings of COLING
Proceedings of NAACL .http://aclweb.org/anthology/N/N01/N01-1024.Roy Schwartz, Roi Reichart, and Ari Rappoport.2015. Symmetric pattern based word em-beddings for improved word similarity predic-tion. In
Proceedings of CoNLL
Proceedings of NAACL-HLT
Proceedings of NAACL-HLT
Proceedings of ACL
Proceedings of ACL
Proceed-ings of the NAACL Workshop on Statistical Pars-ing of Morphologically-Rich Languages
Proceedings of ACL
Journal of Artifical Intelligence Research
Proceedings of EACL
Proceedings of EACL
Proceedings of ACL . pages518–524. http://anthology.aclweb.org/P16-2084.Ivan Vuli´c and Anna Korhonen. 2016b. On the roleof seed lexicons in learning bilingual word embed-dings. In
Proceedings of ACL
Proceedings of AAAI .pages 1112–1119.Tsung-Hsien Wen, David Vandyke, Nikola Mrkši´c,Milica Gaši´c, Lina M. Rojas-Barahona, Pei-HaoSu, Stefan Ultes, and Steve Young. 2017. Anetwork-based end-to-end trainable task-orienteddialogue system. In
Proceedings of EACL
Transactions ofthe ACL
Proceedings of EMNLP . pages 1504–1515.https://aclweb.org/anthology/D16-1157.Jason D. Williams, Antoine Raux, and MatthewHenderson. 2016. The Dialog State Track-ing Challenge series: A review.
Dialogue& Discourse
Proceedings of CIKM . pages 1219–1228.https://doi.org/10.1145/2661829.2662038.Steve Young. 2010. Cognitive User Interfaces.
IEEESignal Processing Magazine .Mo Yu and Mark Dredze. 2014. Improvinglexical embeddings with semantic knowl-edge. In
Proceedings of ACL
Proceedings of ACL
Proceedings ofASRU .Will Y. Zou, Richard Socher, Daniel Cer, andChristopher D. Manning. 2013. Bilingual wordembeddings for phrase-based machine translation.In
Proceedings of EMNLP orph-fitting: Fine-Tuning Word Vector Spaceswith Simple Language-Specific Rules
Supplementary Material
Morphological Rules
In this supplemental material, we provide ashort comprehensive overview of simple language-specific morphological rules in English ( EN ), Ger-man ( DE ), Italian ( UT ), and Russian ( RU ). Theserules were used to build the sets of synonymous A T - TRACT and antonymous R
EPEL constraints for our morph-fitting fine-tuning procedure. As discussedin the paper, the linguistic constraints extractedfrom the rules require only a comprehensive listof vocabulary words in each language. A nativespeaker of each language used in our experimentsis able to easily come up with these sets of morpho-logical rules (or at least with a reasonable subsetof rules) without any linguistic training. What ismore, the rules for German, Italian, and Russianwere created by non-native and non-fluent speakerswho have only a passive or limited knowledge ofthe three languages, exemplifying the simplicityand portability of the fine-tuning approach basedon the shallow “morphological supervision”. Thesimplicity is also confirmed by the short time usedto compile the rules, ranging from a few minutes forEnglish to approximately two hours for Russian.Different languages differ in their “morpholog-ical richness” (e.g., declension, verb conjugation,plural forming, gender) which consequently leadsto the varying number of rules in each language.However, all four languages in our study displaymorphological regularities described by simplemorphological rules that are exploited to build setsof A
TTRACT and R
EPEL linguistic constraints ineach language from scratch. Vocabularies W in all four languages are labeled W en , W de , W it , W ru . We add the pairs ( w , w ) and ( w , w ) generated by the rules to the sets ofconstraints iff both w , w ∈ W . After we generateall such constraints, since some constraints mayhave been generated by more than one rule, weremove all duplicates from the respective sets ofA TTRACT and R
EPEL constraints.Before we start, we will define two simple func-tions: (i) the function w [: − N ] strips the last Note that the rules for extracting A
TTRACT constraintswere additionally used to generate the Morph-SimLex evalua-tion set, also provided as supplemental material. N characters from the word w , (ii) the function w . ew(sub) tests if the word w ends with a se-quence of characters sub . For instance, create [: − returns creat , while create . ew(’s’) returns False and create . ew(’e’) returns True . English RulesInflectional Synonymy: A
TTRACT
As dis-cussed in the paper, we rely on only two simpleinflectional morphological rules in English:- w = w + ’s’/’ed’/’ing’ . This rule yieldsconstraints such as (speak, speaking) , (turtle, tur-tles) , or (clean, cleaned) .- If w .ew(’e’) , then w = w [: − + ’ed’/’ing’ . This rule yields constraints suchas (create, creating) , or (generate, generated) . Derivational Antonymy: R
EPEL
We assumethe following set of standard “antonymy” prefixesin English: AP en = {’dis’, ’il’, ’un’, ’in’, ’im’,’ir’, ’mis’, ’non’, ’anti’} . We rely on the followingderivational rules to extract R EPEL pairs:- w = ap + w , where ap ∈ AP en . This rule yieldsconstraints such as (mature, immature) , (allow, dis-allow) or (regularity, irregularity) .- If w .ew(’ful’) , then w = w [: − + ’less’ . This rule yields constraints such as (cheerful, cheerless) .As mentioned in the paper, for all four languageswe further expand the set of R EPEL constraints bytransitively combining antonymy pairs with inflec-tional A
TTRACT pairs. In simple words, the friendof my enemy is my enemy.
This means that, givenan A
TTRACT pair (allow, allows) and a R
EPEL pair (allow, disallow) , we extract another R
EPEL pair (allows, disallow) . German RulesInflectional Synonymy: A
TTRACT
Being mor-phologically richer than English, the German lan-guage naturally requires more rules to describe its(inflectional) morphological richness and variation.First, we capture the regular declension of nounsand adjectives by the following heuristic:- Generate a set of words W w = { w , w | w = w + ’e’/’em’/’en’/’er’/’es’ } ; take the Cartesianproduct on W w × W w and then exclude ( w i , w i ) airs with identical words. This rule generatespairs such as (schottisch, schottische) , (schottis-chem, schottischen) .The second set of rules describes regular verbmorphology , i.e., verb conjugation in the presentand past tense, and the formation of regular pastparticiples. This set of rules may be expressed as:- If w .ew(’en’) , then w (cid:48) = w [: − .If w (cid:48) .ew(’t’) , then generate a setof words W w = { w , w | w = w + ’e’/’st’/’ete’/’etest’/’etet’/’eten’ , w = ’ge’ + w (cid:48) + ’et’ } , else (if not w (cid:48) .ew(’t’) ), gen-erate a set of words W w = { w , w | w = w + ’e’/’st’/’te’/’test’/’tet’/’ten’ , w = ’ge’ + w (cid:48) + ’t’ } .We then take the Cartesian product on W w × W w .Again, all pairs with identical words were dis-carded. This rule yields pairs such as (machen,machten) , (mache, gemacht) , (kaufst, kauft) , or (arbeite, arbeitete) and (arbeiten, gearbeitet) .Another set of rules targets the regular formationof plural nouns:- If w .ew(’ei’) or w .ew(’heit’) or w .ew(’keit’) or w .ew(’schaft’) or w .ew(’ung’) , then w = w + ’en’ . Thisrule yields pairs such as (wahrheit, wahrheiten) or (gemeinschaft, gemeinschaften) .- If w .ew(’in’) , then w = w + ’nen’ . Thisrule generates pairs such as (lehrerin, lehrerinnen) or (lektorin, lektorinnen) .- If w .ew(’a’/’i’/’o’/’u’/’y’) then w = w + ’s’ . This rule yields pairs such as (auto,autos) .- If w .ew(’e’) , then w = w + ’n’ . This ruleyields pairs such as (postkarte, postkarten) .- w = lumlaut ( w ) + er , where the function lumlaut ( w ) replaces the last occurrence of the let-ter ’a’ , ’o’ or ’u’ with ’ä’ , ’ö’ or ’ü’ . Thisrule generates pairs such as (wörterbuch, wörter-bücher) or (stadt, städter) . Derivational Antonymy: R
EPEL
We assumethe following set of standard “antonymy” prefixesin German: AP de = {’un’, ’nicht’, ’anti’, ’ir’, ’in’,’miss’} . We rely on the following derivational rulesto extract R EPEL pairs in German:- w = ap + w , where ap ∈ AP de . This rule yieldsconstraints such as (aktiv, inaktiv) , (wandelbar, un-wandelbar) or (zyklone, antizyklone) .- If w .ew(’voll’) , then w = w [: − + ’los’ . This rule yields constraints such as (geschmackvoll, geschmacklos) . The set of R EPEL is then again transitively ex-panded yielding pairs such as (relevant, irrelevan-ter) or (aktivem, inaktiv) . Italian RulesInflectional Synonymy: A
TTRACT
The first setof rules aims at capturing the regular plural formingin Italian (e.g., libro, libri ) and regular differencesin gender (e.g., rapido, rapida ). We rely on thesimple heuristic which can be expressed as follows:- If w .ew(’a’/’e’/’o’/’i’) , then gener-ate a set of words W w = { w | w = w [: −
1] + ’a’/’e’/’o’/’i’ } , and take the Cartesian product on W w × W w discarding pairs with identical words.This rule yields pairs such as (nero, neri) or (gener-azione, generazioni) .- If w .ew(’ga’/’ca’) , then w = w + ’he’ . This rule generates pairs such as (tartaruga,tartarughe) or (bianca, bianche) .- If w .ew(’go’) , then w = w + ’hi’ . Thisrule generates pairs such as (albergo, alberghi) .The second set of rules targets regular verb con-jugation in Italian and the formation of regular pastparticiples. The following rules are used:- If w .ew(’are’) , then generate a set ofwords W w = { w , w | w = w [: −
3] + ’iamo’/’ate’/’ano’/’o’/’i’/’a’/’ato’/’ata’/’ati’/’ate’ } ;take the Cartesian product on W w × W w discard-ing pairs with identical words. This rule results inpairs such as (aspettare, aspettiamo) .- If w .ew(’ere’) , then generate a set ofwords W w = { w , w | w = w [: −
3] + ’iamo’/’ete’/’ono’/’o’/’i’/’e’/’uto’/’uta’/’uti’/’ute’ } ;take the Cartesian product on W w × W w discard-ing pairs with identical words. This rule resultsin pairs such as (ricevere, ricevete) or (riceve,ricevuto) .- If w .ew(’ire’) , then generate a set ofwords W w = { w , w | w = w [: −
3] + ’iamo’/’ite’/’ono’/’o’/’i’/’e’/’ito’/’ita’/’iti’/’ite’ } ;take the Cartesian product on W w × W w discarding pairs with identical words. This ruleresults in pairs such as (dormire, dormono) or (dormi, dormita) . Derivational Antonymy: R
EPEL
We assumethe following set of standard “antonymy” prefixes: AP it = {’in’, ’ir’, ’im’, ’anti’} . The followingderivational rule is used to extract R EPEL pairs:- w = ap + w , where ap ∈ AP it . This rule yieldsconstraints such as (attivo, inattivo) or (rispettosa,irrispettosa) .he set of R EPEL was then expanded as before,e.g., with additional pairs such as (rispettosa, ir-rispettosi) generated.
Russian RulesInflectional Synonymy: A
TTRACT
The first setof rules in Russian targets the regular forming ofplural in Russian. A few simple heuristics are usedas follows:- w = w + ’ и ’/’ ы ’ . This rule yields pairs suchas (aльбом, aльбомы) , transliterated as: (al’bom,al’bomy) .- if w .ew(’a’/’ я ’/’ ь ’) , then w = w [: − + ’ и ’/’ ы ’ . This rule generates pairs such as (песня, песни) : (pesnja, pesni) .- if w .ew(’o’) , then w = w [: − + ’ a ’ .This rule generates pairs such as (письмо, пись-ма) : (pis’mo, pis’ma) .- if w .ew(’e’) , then w = w [: − + ’ я ’ .This rule generates pairs such as (платье, пла-тья) : (plat’e, plat’ja) .The next set of rules targets regular verb con-jugation of Russian verbs as well as the regularformation of past participles. We again build a sim-ple heuristic to extract A TTRACT pairs:- if w .ew(’ ти ’/’ ть ’) , then generate a setof words W w = { w , w | w = w [: −
2] + ’у’/’ю’/’ешь’/’ишь’/’ет’/’ит’/’ем’/’им’ , w = w [: −
2] + ’ете’/ите’/’ут’/’ют’/’ат’/’ят’ , w = w [: −
2] + ’нный’/’нная’ } and take the Carte-sian product on W w × W w discarding pairs withidentical words. This rule yields pairs such as (ва-рить, варите) or (заканчиваю, заканчивают) ,transliterated as: (varit’, varite) , (zakanchivaju, za-kanchivajut) .Following that, we also utilise the regularities re-garding declension processes in Russian, capturedby the following rules:- if w .ew( ’a’ ) , then generate a set of words W w = { w , w | w = w [: −
1] + ’e’/’y’/’ой’ } and take the Cartesian product on W w × W w discarding pairs with identical words. This ruleyields pairs such as (работа, работой) : (rabota,rabotoj) .- if w .ew( ’я’ ) , then generate a set of words W w = { w , w | w = w [: −
1] + ’e’/’ю’/’ей’ } and take the Cartesian product on W w × W w dis-carding pairs with identical words. This rule yieldspairs such as (линия, линию) : (linija, liniju) .- if w .ew( ’ы’ ) , then generate a set ofwords W w = { w , w | w = w [: −
1] + ’ам’/’ами’/’ах’ } and take the Cartesian product on W w × W w discarding pairs with identicalwords. This rule yields pairs such as (работам,работами) : (rabotam, rabotami) .- if w .ew( ’и’ ) , then generate a set ofwords W w = { w , w | w = w [: −
1] + ’ь’/’ям’/’ями’/’ях’ } and take the Cartesian prod-uct on W w × W w discarding pairs with identicalwords. This rule yields pairs such as (работам,работами) : (rabotam, rabotami) .Yet another set of rules targets regular adjectivecomparison and gender:- if w .ew( ’ый’/’ой’/’ий’ ) , then generate a setof words W w = { w , w | w = w [: −
2] + ’ь’/’ее’/’ые’ } . This rule yields pairs such as (быст-рый, быстрее) : (bystryj, bystree) .- if w .ew( ’ая’ ) , then generate a set ofwords W w = { w , w | w = w [: −
2] + ’ее’/’ыe’/’ый’ } . This rule yields pairs such as (но-вая, новыe) : (novaja, novye) .- if w .ew( ’oe’ ) , then generate a set ofwords W w = { w , w | w = w [: −
2] + ’ый’/’ыe’/’ая’ } . This rule yields pairs such as (но-вое, новый) : (novoe, novyj) . Derivational Antonymy: R
EPEL
We assumethe following set of standard “antonymy” prefixesin Russian: AP ru = { не, анти ’} , and simply usethe following rule:- w = ap + w , where ap ∈ AP ru . This ruleyields constraints such as (адекватный, неадек-ватный) or (вирусная, антивирусная) , translit-erated as: (adekvatnyj, neadekvatnyj) and (virus-naja, antivirusnaja) .The further expansion of R EPEL constraintsyields pairs such as (адекватный, неадекват-ная) : (adekvatnyj, neadekvatnaja) . Further Discussion
We stress that the listed rules for all four languagesare non-exhaustivenon-exhaustive