[PDF] Differentiable Generative Phonology

Abstract

The goal of generative phonology, as formulated by Chomsky and Halle (1968), is to specify a formal system that explains the set of attested phonological strings in a language. Traditionally, a collection of rules (or constraints, in the case of optimality theory) and underlying forms (UF) are posited to work in tandem to generate phonological strings. However, the degree of abstraction of UFs with respect to their concrete realizations is contentious. As the main contribution of our work, we implement the phonological generative system as a neural model differentiable end-to-end, rather than as a set of rules or constraints. Contrary to traditional phonology, in our model, UFs are continuous vectors in \mathbb{R}^d, rather than discrete strings. As a consequence, UFs are discovered automatically rather than posited by linguists, and the model can scale to the size of a realistic vocabulary. Moreover, we compare several modes of the generative process, contemplating: i) the presence or absence of an underlying representation in between morphemes and surface forms (SFs); and ii) the conditional dependence or independence of UFs with respect to SFs. We evaluate the ability of each mode to predict attested phonological strings on 2 datasets covering 5 and 28 languages, respectively. The results corroborate two tenets of generative phonology, viz. the necessity for UFs and their independence from SFs. In general, our neural model of generative phonology learns both UFs and SFs automatically and on a large-scale.

Full PDF

DDifferentiable Generative Phonology

Shijie Wu ∗ Edoardo M. Ponti ∗ , , Ryan Cotterell , Johns Hopkins University Mila Montreal McGill University University of Cambridge ETH Zürich [email protected] {ep490,rdc42}@cam.ac.uk Abstract

The goal of generative phonology, as for-mulated by Chomsky and Halle (1968), isto specify a formal system that explains theset of attested phonological strings in a lan-guage. Traditionally, a collection of rules(or constraints, in the case of optimality the-ory) and underlying forms (UF) are positedto work in tandem to generate phonologicalstrings. However, the degree of abstractionof UFs with respect to their concrete realiza-tions is contentious. As the main contribu-tion of our work, we implement the phono-logical generative system as a neural modeldifferentiable end-to-end, rather than as aset of rules or constraints. Contrary to tra-ditional phonology, in our model UFs arecontinuous vectors in R d , rather than dis-crete strings. As a consequence, UFs arediscovered automatically rather than positedby linguists, and the model can scale tothe size of a realistic vocabulary. More-over, we compare several modes of the gen-erative process, contemplating: i) the pres-ence or absence of an underlying represen-tation in between morphemes and surfaceforms (SFs); and ii) the conditional depen-dence or independence of UFs with respectto SFs. We evaluate the ability of each modeto predict attested phonological strings on2 datasets covering 5 and 28 languages, re-spectively. The results corroborate two tenetsof generative phonology, viz. the necessityfor UFs and their independence from SFs.In general, our neural model of generativephonology learns both UFs and SFs auto-matically and on a large-scale. The codeis available at https://github.com/shijie-wu/neural-transducer . Generative phonology is one of the most prominentparadigms in phonological analysis (Hayes, 2009). ∗ Equal contribution {i, e} { ı , a} {u, o} {ü, ö} TAG* Figure 1: Learned underlying representations for Turk-ish, dimensionality-reduced to R with t -SNE. The graydots show the sufﬁxes and the colored dots show thestems. 2-way vowel harmony is encoded: back vowels(red and purple) concentrate on the left and front vowels(green and blue) on the right. The goal of the research program is to devise aformal system that allows linguists to explain thesystematic variation in the surface forms (SFs) of alanguage. For instance, consider how the past tenseis expressed in English—a classic case of allomor-phy. Comparing talked ([ t h O:kt ]), saved ([ seIvd ]),and acted ([' æk.tId ]), at least three pronunciations([t], [d], [ I d]) and at least two spellings (- ed and - d )can be counted for the past tense morpheme. Tradi-tionally, linguists assume that each morpheme has asingle underlying form (UF), a string of phonemesshared across all its contexts (Jakobson, 1948): forthe past tense, /-d/. The surface variation is ex-plained by generative phonology via a grammar, aset of rewrite rules (Chomsky and Halle, 1968) orconstraint rankings (Prince and Smolensky, 2008)that map the UFs of a sequence of morphemes tothe observed SFs. a r X i v : . [ c s . C L ] F e b n this work, we offer the ﬁrst end-to-end dif-ferentiable version of generative phonology, i.e.a neural network, trainable by backpropagation,than can perform the same function that a genera-tive phonology does: To derive the set of attestedSFs. In terms of innovation for linguistic theory,the primary motivation for this paper is to recon-sider the mathematical type of UFs, which are la-tent variables (i.e. never observed). What if, ratherthan being string-valued, we consider a more ab-stract representation, namely, a vector in R d . Thiswould immediately yield certain advantages: In anend-to-end differentiable phonology, we can back-propagate through the underlying representationsthemselves, rather than have a human or machineperform a difﬁcult combinatorial search problemto ﬁnd the best string-valued representation. Werefer to this framework automating the process ofbeing a phonologist as differentiable phonology .In the technical section of the paper, we dis-cuss a series of possible instantiations of differen-tiable phonology. Drawing on recent work in themorphological inﬂection literature (Cotterell et al.,2017), we develop neural sequence-to-sequencemodels to spell out the attested SFs given the wordmorphemes. In particular, we compare three vari-ants relying on different assumptions: a) the pres-ence (or absence) of an intermediate latent variableto bridge between morphemes and SFs, namelyUFs with abstract real-valued vector representa-tions; and b) the conditional independence (or de-pendence) of UFs with respect to SFs.Moreover, we take time to discuss the impactof these ideas on phonological theory: Despite notbeing fully interpretable, as traditional underlyingstrings are, real-valued vector representations canbe ﬂexibly compared through cosine similarity. Weexhibit learned underlying representations from ourdifferentiable phonology in Figure 1. For instance,we ﬁnd that some classic phonological patterns,e.g. vowel harmony, are immediately evident fromthis visualization, where morphemes with back andfront vowels cluster separately.Empirically, we provide results on the phono-logical dataset taken from CELEX2 (Baayen et al.,1995) and the orthography-based dataset of Uni-Morph (Kirov et al., 2018). The task is to predictmissing forms in a paradigm (some slots are held Note that there are many possible neuralizations of gener-ative phonology, several of which are beyond the scope of thispaper. out), using the inferred underlying representations.We show that our model of differentiable gener-ative phonology has comparable performance tothe latent-string model of Cotterell et al. (2015)on the tiny CELEX2 dataset. However, the modelscales to the larger UniMorph datasets that are twoorders of magnitude larger. Moreover, we ﬁnd thatthe best-performing neuralization of differentiablephonology justiﬁes two propositions of generativephonology, viz. the presence of UFs and their con-ditional independence from SFs.

In this section, we will brieﬂy formalize generativephonology in our own notation. First, we deﬁnethe following six sets:• ∆ = { d , . . . , d | ∆ | } is the discrete alphabetof underlying phonemes.• Σ = { s , . . . , s | Σ | } is the discrete alphabet ofsurface phones.• M = { µ , . . . , µ | M | } the discrete set of ab-stract morphemes.• U ⊂ ∆ ∗ is the set of underlying forms. Theoptimality-theoretic notion of richness of thebase (Prince and Smolensky, 2008) suggeststhat we should always consider all potentialforms.• S ⊂ Σ ∗ is the set of realizable surface forms. • ˜ S ⊂ S is the observed set of surface formsthat the phonologist has access to to performtheir analysis.Furthermore, let s · · · s | s | = s ∈ S be a surfaceform . In phonological theory, a surface form isan observed sequence of phonological symbols,e.g., symbols of the International Phonetic Alpha-bet (IPA). This stands in contrast to the notion ofan orthographic form , which is the sequence oforthographic symbols used in writing. To see thedifference, contrast the orthographic form talked and the surface form [ t h O:kt ] of the same word.A surface form s may be decomposed, seman-tically, into a (variable-length) vector of abstractmorphemes : decomp ( s ) = µ s = [ µ , . . . , µ k ] (1) Note that many—if not most—of these forms will bephonotactically invalid under the language’s grammar. This term is non-standard in the phonological literature,but we ﬁnd it useful for our exposition. ith the term abstract morpheme, we are referringto an abstract unit that carries semantics, but doesnot have phonological or phonetic content associ-ated with it. For each abstract morpheme µ ∈ M ina word, let m µ ∈ U be its underlying representa-tion . The concept of the underlying representationis commonplace in generative phonology. Usingthe terminology of machine learning, an underlyingform is a discrete latent variable that helps explainthe relation between the surface forms in the lexi-con for the same abstract morpheme.As a concrete example of our decomposition intoabstract morphemes, consider that the English ran may be decomposed into its constituent abstractmorphemes RUN and

PAST . Then, the underly-ing forms for the morphemes are m RUN = / ô2 n/and m PAST = /d/. As a short hand, we will write m s = [ m µ , . . . , m µ k ] . The underlying forms ofthe morphemes are stitched together to create anunderlying form for the entire word. We will write u s as the underlying form for the entire surfaceform. In general, u s will also be a latent variable.For instance, the underlying form for [ ôæn ] ( ran )may be / ô2 n generative phonology P : U → S is amapping from the space of underlying forms U to the space of surface forms S . The function P may take the form of string rewrite rules, as sug-gested by (Chomsky and Halle, 1968) or ordered,violable constraints, as suggested by (Prince andSmolensky, 2008). Indeed, there many differentformalisms for P —we will avoid the details of anyone speciﬁc formalism, taking a broader perspec-tive. The job of a generative phonologist is, then,to discover an underlying form of the words u s for every surface form s as well as to specify thephonology P within their chosen formalism.What generative phonology is to be preferred?First, it must explain the data: given the empiricalset of surface forms ˜ S ⊂ S ⊂ Σ ∗ that the phonolo-gist has access to, the function P is chosen, alongwith corresponding underlying forms u s , such that s = P ( u s ) , ∀ s ∈ ˜ S . Second, the quality of thephonology is evaluated through Occam’s razor—simpler phonologies are better than more complexones, to the extent that both explain the data. Thisidea goes back to Chomsky (1955, p. 118), but hasrecently been operationalized with information the-ory (Rasin and Katzir, 2016). Also, the phonologyshould avoid language-speciﬁc constraints. In opti- mality theory, for example, the set of constraints inthe grammar is said to be universal and languagesdiffer only in their ordering of the constraints.Another desideratum is that the phonology P canaccount for the gradient nature of well-formednessjudgments of speakers (Hayes and Wilson, 2008).This justiﬁes the probabilistic treatments of gener-ative phonology that have been given in the liter-ature (Jarosz, 2006; Apoussidou, 2006; Eisenstat,2009; Jarosz, 2013; Cotterell et al., 2015). In broadstrokes, these models have treated u s as a string-valued latent variable. In the fully probabilistic caseof Cotterell et al. (2015), the goal was to deﬁne ajoint probability distribution over surface forms andunderlying forms and then to marginalize out theunderlying form as it was never observed. Work-ing within the framework of string-valued graphi-cal modeling (Dreyer and Eisner, 2009), one canapproximately perform inference despite the ∆ ∗ -valued latent variable using a variant of belief prop-agation. In contrast, we focus on a probabilisticgenerative phonology that has real-valued latentvariables as underlying representations. The methodological jump we make in this paper,in contrast to the discrete latent-variable approachdiscussed in Cotterell et al. (2015), is to treat theunderlying forms as gradient , i.e. to treat the under-lying forms as a continuous latent variable. Thisis loosely inspired by the ideas of gradient sym-bolic computation (GSC; Smolensky and Legendre,2006). However, we emphasize that this connec-tion is loose as GSC creates gradient structures asa blend of discrete ones, whereas we remove thenotion of discreteness altogether. Much like Cot-terell et al. (2015), the ﬁnal form of our model willbe a probabilistic latent-variable model. However,in contrast, using the notation in §2, we will deﬁne U = R d . Thus, each underlying form is a vector in U , e.g., m RUN ∈ R d and m PAST ∈ R d . Linguistic Justiﬁcation

From a theoretical per-spective, a more abstract representation of UFs isfar from being far-fetched. In the generative frame-work, Kiparsky (1968) initiated the discussionabout how much phonological theory should per-mit phonological representations to deviate fromphonetic reality (Kenstowicz and Kisseberth, 1979,ch. 6): conceptualizing contrasts in the underlyingforms that never emerge in SFs (so-called ‘absoluteeutralizations’). The stance that some segmentsmay not be realized phonetically was similarly heldby Hyman (1970).Outside the generative framework, additional ar-guments have been put forth challenging the ideaof discrete underlying forms, and arguing insteadin favor of a continuous characterization of phonol-ogy. First, this succeeds in grounding phonologyon the acoustic and articulatory space of phonet-ics –a problem known as ‘naturalness’ (Jakobsonet al., 1952). For instance, it explains the parallelbetween phonetic consonant-vowel co-articulationand phonological assimilation (Ohala, 1990; Flem-ming, 2001). Second, it explains lexical diffusion.Sound changes spread throughout the lexicon incre-mentally, rather than abruptly as one would expectwith discrete phonemes (Bybee, 2007). Moreover,the rate of change is known to correlate with wordfrequencies. In order for usage to affect lexicaldiffusion, the underlying representations must beamenable to vary along a continuum.In this work, we commit to the assumptions oftraditional generative phonology, which excludeconsiderations of usage and performance from theassessment of competence in a language. How-ever, we abandon the assumption that phonologyis discrete in nature. This is not to deny the pres-ence of classical phonemic units at some level ofabstraction; rather, we maintain that they can be“ embedded in a continuous description ,” borrow-ing the expression of Pierrehumbert et al. (2000, p.20). This leaves open the possibility that phonolog-ical constraints and ﬁner-grained phonetic detailsmay co-exist in the same representation (Flemming,2001).

Doing Linguistics Through Backpropagation.

As mentioned in §2, the goal of the generativephonologist is to posit underlying forms that helpexplain the relationships among the observed sur-face forms ˜ S . Even when this problem is posedprobabilistically using latent-variable models, onestill has to contend with a tricky discrete inferenceproblem—a sum over ∆ ∗ . In contrast, we are go-ing to learn the underlying representations throughbackpropagation (Rumelhart et al., 1985), as theylive in a continuous space. The beauty of this ap-proach is that we can compute the gradient exactly Contrary to contextual neutralizations, these are irre-versible (they cannot re-emerge in SFs after language change),unstable (they lead to reanalysis of the words involved in alexical fashion) and non-productive. in polynomial time. This means that we can au-tomate the work of a phonologist using a neuralnetwork.However, as mentioned above, we will have tolive with the idea that our learned representationsare considerably more abstract than those in tradi-tional phonology. As a consequence, our underly-ing forms have no direct phonetic interpretation—they are merely vectors in R d —and thus do not leadto any indirect association with similarly soundingwords. However, despite this, our underlying formscan be discussed in relation to their mutual simi-larity, based on any distance metric (e.g. cosinedistance). In addition, their most salient propertiescan be visualized via dimensionality reduction (e.g.Principal Component Analysis). We will formulate our differentiable generativephonology as a probabilistic model of the set ofobserved surface forms ˜ S conditioned on the mor-phological lexicon M . We factorize this distribu-tion as follows: p ( ˜ S | M ) = (cid:89) s ∈ ˜ S p phon ( s | m s ) (2) = (cid:89) s ∈ ˜ S | s | (cid:89) i =1 p phon ( s i | s

Deﬁning p sr . We ﬁrst deﬁne the part of the neu-ral network that spells out the surface form condi-tioned on the underlying representation u s . We pro-pose that this is a conditional long short-term mem-ory (LSTM; Hochreiter and Schmidhuber, 1997)decoder, similar to the one used in Sutskever et al.(2014) for neural machine translation. This is a re-current neural network that conditions the decoderon a ﬁxed-length vector. More formally, we deﬁne this network as the following p sr ( s i | s