Differentiable Generative Phonology
DDifferentiable Generative Phonology
Shijie Wu ∗ Edoardo M. Ponti ∗ , , Ryan Cotterell , Johns Hopkins University Mila Montreal McGill University University of Cambridge ETH Zürich [email protected] {ep490,rdc42}@cam.ac.uk Abstract
The goal of generative phonology, as for-mulated by Chomsky and Halle (1968), isto specify a formal system that explains theset of attested phonological strings in a lan-guage. Traditionally, a collection of rules(or constraints, in the case of optimality the-ory) and underlying forms (UF) are positedto work in tandem to generate phonologicalstrings. However, the degree of abstractionof UFs with respect to their concrete realiza-tions is contentious. As the main contribu-tion of our work, we implement the phono-logical generative system as a neural modeldifferentiable end-to-end, rather than as aset of rules or constraints. Contrary to tra-ditional phonology, in our model UFs arecontinuous vectors in R d , rather than dis-crete strings. As a consequence, UFs arediscovered automatically rather than positedby linguists, and the model can scale tothe size of a realistic vocabulary. More-over, we compare several modes of the gen-erative process, contemplating: i) the pres-ence or absence of an underlying represen-tation in between morphemes and surfaceforms (SFs); and ii) the conditional depen-dence or independence of UFs with respectto SFs. We evaluate the ability of each modeto predict attested phonological strings on2 datasets covering 5 and 28 languages, re-spectively. The results corroborate two tenetsof generative phonology, viz. the necessityfor UFs and their independence from SFs.In general, our neural model of generativephonology learns both UFs and SFs auto-matically and on a large-scale. The codeis available at https://github.com/shijie-wu/neural-transducer . Generative phonology is one of the most prominentparadigms in phonological analysis (Hayes, 2009). ∗ Equal contribution {i, e} { ı , a} {u, o} {ü, ö} TAG* Figure 1: Learned underlying representations for Turk-ish, dimensionality-reduced to R with t -SNE. The graydots show the suffixes and the colored dots show thestems. 2-way vowel harmony is encoded: back vowels(red and purple) concentrate on the left and front vowels(green and blue) on the right. The goal of the research program is to devise aformal system that allows linguists to explain thesystematic variation in the surface forms (SFs) of alanguage. For instance, consider how the past tenseis expressed in English—a classic case of allomor-phy. Comparing talked ([ t h O:kt ]), saved ([ seIvd ]),and acted ([' æk.tId ]), at least three pronunciations([t], [d], [ I d]) and at least two spellings (- ed and - d )can be counted for the past tense morpheme. Tradi-tionally, linguists assume that each morpheme has asingle underlying form (UF), a string of phonemesshared across all its contexts (Jakobson, 1948): forthe past tense, /-d/. The surface variation is ex-plained by generative phonology via a grammar, aset of rewrite rules (Chomsky and Halle, 1968) orconstraint rankings (Prince and Smolensky, 2008)that map the UFs of a sequence of morphemes tothe observed SFs. a r X i v : . [ c s . C L ] F e b n this work, we offer the first end-to-end dif-ferentiable version of generative phonology, i.e.a neural network, trainable by backpropagation,than can perform the same function that a genera-tive phonology does: To derive the set of attestedSFs. In terms of innovation for linguistic theory,the primary motivation for this paper is to recon-sider the mathematical type of UFs, which are la-tent variables (i.e. never observed). What if, ratherthan being string-valued, we consider a more ab-stract representation, namely, a vector in R d . Thiswould immediately yield certain advantages: In anend-to-end differentiable phonology, we can back-propagate through the underlying representationsthemselves, rather than have a human or machineperform a difficult combinatorial search problemto find the best string-valued representation. Werefer to this framework automating the process ofbeing a phonologist as differentiable phonology .In the technical section of the paper, we dis-cuss a series of possible instantiations of differen-tiable phonology. Drawing on recent work in themorphological inflection literature (Cotterell et al.,2017), we develop neural sequence-to-sequencemodels to spell out the attested SFs given the wordmorphemes. In particular, we compare three vari-ants relying on different assumptions: a) the pres-ence (or absence) of an intermediate latent variableto bridge between morphemes and SFs, namelyUFs with abstract real-valued vector representa-tions; and b) the conditional independence (or de-pendence) of UFs with respect to SFs.Moreover, we take time to discuss the impactof these ideas on phonological theory: Despite notbeing fully interpretable, as traditional underlyingstrings are, real-valued vector representations canbe flexibly compared through cosine similarity. Weexhibit learned underlying representations from ourdifferentiable phonology in Figure 1. For instance,we find that some classic phonological patterns,e.g. vowel harmony, are immediately evident fromthis visualization, where morphemes with back andfront vowels cluster separately.Empirically, we provide results on the phono-logical dataset taken from CELEX2 (Baayen et al.,1995) and the orthography-based dataset of Uni-Morph (Kirov et al., 2018). The task is to predictmissing forms in a paradigm (some slots are held Note that there are many possible neuralizations of gener-ative phonology, several of which are beyond the scope of thispaper. out), using the inferred underlying representations.We show that our model of differentiable gener-ative phonology has comparable performance tothe latent-string model of Cotterell et al. (2015)on the tiny CELEX2 dataset. However, the modelscales to the larger UniMorph datasets that are twoorders of magnitude larger. Moreover, we find thatthe best-performing neuralization of differentiablephonology justifies two propositions of generativephonology, viz. the presence of UFs and their con-ditional independence from SFs.
In this section, we will briefly formalize generativephonology in our own notation. First, we definethe following six sets:• ∆ = { d , . . . , d | ∆ | } is the discrete alphabetof underlying phonemes.• Σ = { s , . . . , s | Σ | } is the discrete alphabet ofsurface phones.• M = { µ , . . . , µ | M | } the discrete set of ab-stract morphemes.• U ⊂ ∆ ∗ is the set of underlying forms. Theoptimality-theoretic notion of richness of thebase (Prince and Smolensky, 2008) suggeststhat we should always consider all potentialforms.• S ⊂ Σ ∗ is the set of realizable surface forms. • ˜ S ⊂ S is the observed set of surface formsthat the phonologist has access to to performtheir analysis.Furthermore, let s · · · s | s | = s ∈ S be a surfaceform . In phonological theory, a surface form isan observed sequence of phonological symbols,e.g., symbols of the International Phonetic Alpha-bet (IPA). This stands in contrast to the notion ofan orthographic form , which is the sequence oforthographic symbols used in writing. To see thedifference, contrast the orthographic form talked and the surface form [ t h O:kt ] of the same word.A surface form s may be decomposed, seman-tically, into a (variable-length) vector of abstractmorphemes : decomp ( s ) = µ s = [ µ , . . . , µ k ] (1) Note that many—if not most—of these forms will bephonotactically invalid under the language’s grammar. This term is non-standard in the phonological literature,but we find it useful for our exposition. ith the term abstract morpheme, we are referringto an abstract unit that carries semantics, but doesnot have phonological or phonetic content associ-ated with it. For each abstract morpheme µ ∈ M ina word, let m µ ∈ U be its underlying representa-tion . The concept of the underlying representationis commonplace in generative phonology. Usingthe terminology of machine learning, an underlyingform is a discrete latent variable that helps explainthe relation between the surface forms in the lexi-con for the same abstract morpheme.As a concrete example of our decomposition intoabstract morphemes, consider that the English ran may be decomposed into its constituent abstractmorphemes RUN and
PAST . Then, the underly-ing forms for the morphemes are m RUN = / ô2 n/and m PAST = /d/. As a short hand, we will write m s = [ m µ , . . . , m µ k ] . The underlying forms ofthe morphemes are stitched together to create anunderlying form for the entire word. We will write u s as the underlying form for the entire surfaceform. In general, u s will also be a latent variable.For instance, the underlying form for [ ôæn ] ( ran )may be / ô2 n generative phonology P : U → S is amapping from the space of underlying forms U to the space of surface forms S . The function P may take the form of string rewrite rules, as sug-gested by (Chomsky and Halle, 1968) or ordered,violable constraints, as suggested by (Prince andSmolensky, 2008). Indeed, there many differentformalisms for P —we will avoid the details of anyone specific formalism, taking a broader perspec-tive. The job of a generative phonologist is, then,to discover an underlying form of the words u s for every surface form s as well as to specify thephonology P within their chosen formalism.What generative phonology is to be preferred?First, it must explain the data: given the empiricalset of surface forms ˜ S ⊂ S ⊂ Σ ∗ that the phonolo-gist has access to, the function P is chosen, alongwith corresponding underlying forms u s , such that s = P ( u s ) , ∀ s ∈ ˜ S . Second, the quality of thephonology is evaluated through Occam’s razor—simpler phonologies are better than more complexones, to the extent that both explain the data. Thisidea goes back to Chomsky (1955, p. 118), but hasrecently been operationalized with information the-ory (Rasin and Katzir, 2016). Also, the phonologyshould avoid language-specific constraints. In opti- mality theory, for example, the set of constraints inthe grammar is said to be universal and languagesdiffer only in their ordering of the constraints.Another desideratum is that the phonology P canaccount for the gradient nature of well-formednessjudgments of speakers (Hayes and Wilson, 2008).This justifies the probabilistic treatments of gener-ative phonology that have been given in the liter-ature (Jarosz, 2006; Apoussidou, 2006; Eisenstat,2009; Jarosz, 2013; Cotterell et al., 2015). In broadstrokes, these models have treated u s as a string-valued latent variable. In the fully probabilistic caseof Cotterell et al. (2015), the goal was to define ajoint probability distribution over surface forms andunderlying forms and then to marginalize out theunderlying form as it was never observed. Work-ing within the framework of string-valued graphi-cal modeling (Dreyer and Eisner, 2009), one canapproximately perform inference despite the ∆ ∗ -valued latent variable using a variant of belief prop-agation. In contrast, we focus on a probabilisticgenerative phonology that has real-valued latentvariables as underlying representations. The methodological jump we make in this paper,in contrast to the discrete latent-variable approachdiscussed in Cotterell et al. (2015), is to treat theunderlying forms as gradient , i.e. to treat the under-lying forms as a continuous latent variable. Thisis loosely inspired by the ideas of gradient sym-bolic computation (GSC; Smolensky and Legendre,2006). However, we emphasize that this connec-tion is loose as GSC creates gradient structures asa blend of discrete ones, whereas we remove thenotion of discreteness altogether. Much like Cot-terell et al. (2015), the final form of our model willbe a probabilistic latent-variable model. However,in contrast, using the notation in §2, we will define U = R d . Thus, each underlying form is a vector in U , e.g., m RUN ∈ R d and m PAST ∈ R d . Linguistic Justification
From a theoretical per-spective, a more abstract representation of UFs isfar from being far-fetched. In the generative frame-work, Kiparsky (1968) initiated the discussionabout how much phonological theory should per-mit phonological representations to deviate fromphonetic reality (Kenstowicz and Kisseberth, 1979,ch. 6): conceptualizing contrasts in the underlyingforms that never emerge in SFs (so-called ‘absoluteeutralizations’). The stance that some segmentsmay not be realized phonetically was similarly heldby Hyman (1970).Outside the generative framework, additional ar-guments have been put forth challenging the ideaof discrete underlying forms, and arguing insteadin favor of a continuous characterization of phonol-ogy. First, this succeeds in grounding phonologyon the acoustic and articulatory space of phonet-ics –a problem known as ‘naturalness’ (Jakobsonet al., 1952). For instance, it explains the parallelbetween phonetic consonant-vowel co-articulationand phonological assimilation (Ohala, 1990; Flem-ming, 2001). Second, it explains lexical diffusion.Sound changes spread throughout the lexicon incre-mentally, rather than abruptly as one would expectwith discrete phonemes (Bybee, 2007). Moreover,the rate of change is known to correlate with wordfrequencies. In order for usage to affect lexicaldiffusion, the underlying representations must beamenable to vary along a continuum.In this work, we commit to the assumptions oftraditional generative phonology, which excludeconsiderations of usage and performance from theassessment of competence in a language. How-ever, we abandon the assumption that phonologyis discrete in nature. This is not to deny the pres-ence of classical phonemic units at some level ofabstraction; rather, we maintain that they can be“ embedded in a continuous description ,” borrow-ing the expression of Pierrehumbert et al. (2000, p.20). This leaves open the possibility that phonolog-ical constraints and finer-grained phonetic detailsmay co-exist in the same representation (Flemming,2001).
Doing Linguistics Through Backpropagation.
As mentioned in §2, the goal of the generativephonologist is to posit underlying forms that helpexplain the relationships among the observed sur-face forms ˜ S . Even when this problem is posedprobabilistically using latent-variable models, onestill has to contend with a tricky discrete inferenceproblem—a sum over ∆ ∗ . In contrast, we are go-ing to learn the underlying representations throughbackpropagation (Rumelhart et al., 1985), as theylive in a continuous space. The beauty of this ap-proach is that we can compute the gradient exactly Contrary to contextual neutralizations, these are irre-versible (they cannot re-emerge in SFs after language change),unstable (they lead to reanalysis of the words involved in alexical fashion) and non-productive. in polynomial time. This means that we can au-tomate the work of a phonologist using a neuralnetwork.However, as mentioned above, we will have tolive with the idea that our learned representationsare considerably more abstract than those in tradi-tional phonology. As a consequence, our underly-ing forms have no direct phonetic interpretation—they are merely vectors in R d —and thus do not leadto any indirect association with similarly soundingwords. However, despite this, our underlying formscan be discussed in relation to their mutual simi-larity, based on any distance metric (e.g. cosinedistance). In addition, their most salient propertiescan be visualized via dimensionality reduction (e.g.Principal Component Analysis). We will formulate our differentiable generativephonology as a probabilistic model of the set ofobserved surface forms ˜ S conditioned on the mor-phological lexicon M . We factorize this distribu-tion as follows: p ( ˜ S | M ) = (cid:89) s ∈ ˜ S p phon ( s | m s ) (2) = (cid:89) s ∈ ˜ S | s | (cid:89) i =1 p phon ( s i | s
Defining p sr . We first define the part of the neu-ral network that spells out the surface form condi-tioned on the underlying representation u s . We pro-pose that this is a conditional long short-term mem-ory (LSTM; Hochreiter and Schmidhuber, 1997)decoder, similar to the one used in Sutskever et al.(2014) for neural machine translation. This is a re-current neural network that conditions the decoderon a fixed-length vector. More formally, we define this network as the following p sr ( s i | s
Traditionally, in gen-erative phonology, the underlying representation istaken to be independent of how the word is spelled-out: within our model, we express this as the fol-lowing conditional independence assumption: p ur ( u s | s
Contrary to the cus-tomary practice in generative phonology, we alsoconsider a position-dependent underlying represen-tation. This means that the underlying representa-tion of the word changes as the SF is being spelledout. We define such a position-dependent UF as p ur ( u s | s
Dutch English German A cc u r a c y E d i t D i s t a n ce N LL
200 400 600 800 200 400 600 800 200 400 600 8000.40.50.60.70.80.90.51.01.52.02.50.10.20.30.40.5 l l l
Joint UR Position−Dependent UR Position−Independent UR
Figure 3: Accuracy ( ↑ ), edit distance ( ↓ ), and surprisal ( NLL ↓ ) on 3 languages of CELEX2. The y-axis representsthe metric value, and the x-axis the amount of training examples. Shaded areas indicate one standard deviation foreach estimate. Again, the Position-Independent UF outperforms both competitors by significant margins. ogy can predict held-out surface forms (and or-thographic forms) in word paradigms given a se-quence of abstract morphemes. Note that our eval-uation setup differs from the established practicein linguistics, where a system of rules or constraintrankings—combined with underlying forms—is fa-vored over the alternatives if it better explains thefull set of data and possibly satisfies other desider-ata, such as simplicity (see §2). Instead, our setupbased on blind testing quantifies the ability of anymodel of generative phonology to generalize toheld-out forms. In particular, we rely on threedisjoint sets of data points: a training set to per-form inference, a development set for fine-tuninghyper-parameters, and an evaluation set to test themodel predictions.
We experiment on two datasets, CELEX2 (Baayenet al., 1995) for phonetic surface forms and Uni-Morph 2.0 (Kirov et al., 2018) for orthographicforms. The choice of the second dataset highlightsthe computational advantages of gradient underly-ing forms. In contrast to models with string-valuedlatent variables, our model can scale to morpho- Measuring generalization averts theory-internal consid-erations but implicitly selects for both accurate and simplemodels, as it indicates lack of over-fitting to the data. logical lexica that contain a number of forms inthe same order of magnitude as natural languages.Indeed, Cotterell et al. (2015)’s largest lexicon cov-ered only 1000 forms. Thus, this work providesthe first large-scale induction of underlying formswith a computational model. In what follows, weprovide additional details on the datasets.
CELEX 2 (Baayen et al., 1995) provides surfaceforms and token counts for 3 languages: Dutch,English, and German. In particular, we use a sub-corpus created by Cotterell et al. (2015), whichcontains 1000 nouns and verbs per language, andfocuses on voicing patterns (such as final obstruentdevoicing and voicing assimilation). The phono-logical annotation is limited to segmental features,and thus ignores suprasegmental phenomena suchas prosody. As a training set, again following Cot-terell et al. (2015), we consider several subsets ofthe subcorpus, sampling without replacement k instances based on the distribution defined by nor-malized token counts. In particular, we consider k ∈ { , , , } . In this dataset, eachword has at most 2 abstract morphemes, consistingin a stem and a (possibly empty) affix. UniMorph 2.0 (Kirov et al., 2018) is a dataset ofvast proportions and covering a wide spectrum oflanguages. Moreover, it is the most widespreadbenchmark for morphological inflection (Vylo-osition-Independent UF Position-Dependent UF Joint Model UF
ACC ↑ MLD ↓ NLL ↓ ACC ↑ MLD ↓ NLL ↓ ACC ↑ MLD ↓ NLL ↓ ARA
BUL
CES
CYM
DAN
DEU
ENG
EST
EUS
FAS
FRA
GLE
HEB
HIN
HYE
ITA
LAV
NLD
POL
POR
RON
RUS
SPA
SQI
SWE
TUR
UKR
URD average
Table 1: Accuracy (
ACC ↑ ), edit distance ( MLD ↓ ), and surprisal ( NLL ↓ ) on 28 languages of UniMorph 2.0. ThePosition-Independent UF outperforms both competitors, often by very large margin. mova et al., 2020). All data are taken from theUniMorph project. Specifically, we look at thefollowing 28 languages: Albanian, Arabic, Arme-nian, Basque, Bulgarian, Czech, Danish, Dutch,English, Estonian, French, German, Hebrew, Hindi,Irish, Italian, Latvian, Persian, Polish, Portuguese,Romanian, Russian, Spanish, Swedish, Turkish,Ukrainian, Urdu and Welsh. The languages comefrom 4 stocks (Indo-European, Afro-Asiatic, Finno-Ugric and Turkic) with Basque, a language iso-late, included as well. They represent a reason-able degree of typological diversity. We lamentthat the Indo-European family is overrepresentedin the UniMorph dataset. However, within the https://unimorph.github.io Indo-European family, we consider a diverse setof genera: Albanian, Armenian, Slavic, Germanic,Romanace, Indo-Aryan, Baltic and Celtic. Con-trary to CELEX2, UniMorph 2.0 contains only or-thographic strings and currently lacks a phonemictranscription.
The dimension of morpheme embeddings m , un-derlying form embeddings u , surface character em-beddings s , and the hidden size of the 1-layer de-coder LSTM are d = 200 . Therefore, the learnableparameters for the feed-forward network in Equa-tion (7) are W ∈ R × and V ∈ R | Σ |× , andfor the attention mechanism in Equation (12) are T ∈ R × . We additionally apply 0.2 dropoutSrivastava et al., 2014) to all embeddings. Themodel is trained with Adam (Kingma and Ba, 2015)and the learning rate is 1e-3. We halve the learningrate whenever the development loss does not im-prove and we stop early when learning rate dropsbelow 1e-5. Finally, we wait 10 epochs before drop-ping the learning rate in the CELEX2 experiments. We consider 3 evaluation metrics: (i) surprisal (thenegative log probability normalized by length) ofthe held-out surface or orthographic forms (NLL),(ii) accuracy of the 1-best prediction (ACC) and (iii)edit distance of the 1-best prediction (MLD). Giventhe unfeasibly many possible splits of CELEX2with k training examples, for each desired size weestimate the expected performance through 10 ran-dom samples. In particular, we report the samplemean and the standard deviation of each metric. We present the results for surface form predictionon CELEX2 in Figure 3 and the results for ortho-graphic form prediction on UniMorph 2.0 in Ta-ble 1. Of the three neural models we compare, wefind that the position-independent performs the best.This behavior is consistent across both datasets, 3evaluation metrics (accuracy, edit distance, and sur-prisal), and 4 sizes of training sets (for CELEX2).For UniMorph 2.0, all the differences are signifi-cant under a paired-permutation test with p < . .Indeed, this result is quite strong as it holds forevery one of our 28 languages.The model with position-independent UFs issuperior to the second-best model with position-dependent UFs as it increases its accuracy by 17.3absolute points (+23%), reduces its edit distance by0.578 points (-77.07%), and reduces its surprisal by0.046 points (-56.10%). This shows the importanceof independence assumptions of UFs with respectto SFs. What is more, the joint model without UFsachieves the worst scores, with an additional gapin performance. This result showcases the needfor latent variables bridging between abstract mor-phemes and SFs, which correspond to gradient UFsin our formulation. While both these findings con-firm received wisdom from the literature in genera-tive phonology, the preeminence of the variant withposition-independent UFs is surprising from thecomputational perspective: Attention-based mod-els are generally superior to those without atten- tion in morphological tasks (Aharoni and Goldberg,2017; Wu et al., 2018; Wu and Cotterell, 2019, in-ter alia ).The results suggest that a mode that encodes asingle underlying representation is best for mod-eling the phonological and orthographic forms. Ifone accepts the connection between our real-valuedlatent variable and the traditional underlying formsof generative phonology, these results provide sup-port for the idea that an underlying form is usefulfor explaining the surface forms in the lexicon. We have presented a differentiable and probabilis-tic version of generative phonology. Instead ofhaving string-valued underlying forms, as in tradi-tional generative phonology, we have relaxed thisformalism to one where the underlying forms arereal-valued vectors. Using sequence-to-sequenceneural models that have become standard in naturallanguage processing (NLP), our differentiable gen-erative phonology spells outs the words from thislatent vector space. Since our model can be learnedin an end-to-end fashion, underlying forms are dis-covered automatically rather than being posited bylinguists. While lacking a phonetic interpretation,these forms can be compared with respect to theirgeometric distance, revealing phonological patternssuch as vowel harmonization.The empirical portion of our paper conductsexperiments on 3 languages from the CELEX2dataset for surface form prediction and 28 lan-guages from the 2.0 dataset for morphologicalinflection. We find that, under three metrics,our model of differentiable phonology achievesperformances comparable with previous work inthe small-data regime, but scales to many moreforms in the large-scale setting. Finally, we com-pare several variants of our model, lending cred-ibility to two conjectures of generative phonol-ogy: that the variation in SFs is mediated byUFs, and that these are conditionally indepen-dent from SFs. All of our code, models and re-sults may be found at https://github.com/shijie-wu/neural-transducer . References
Roee Aharoni and Yoav Goldberg. 2017. Mor-phological inflection generation with hard mono-tonic attention. In
Proceedings of the 55th An-ual Meeting of the Association for Compu-tational Linguistics (Volume 1: Long Papers) ,pages 2004–2015, Vancouver, Canada.Diana Apoussidou. 2006. On-line learning of un-derlying forms. Rutgers Optimality Archive.R. Harald Baayen, Richard Piepenbrock, and LéonGulikers. 1995. CELEX2, LDC96L14. Techni-cal report, Linguistic Data Consortium.Joan Bybee. 2007.
Frequency of Use and the Orga-nization of Language . Oxford University Press.Noam Chomsky. 1955. The logical structure oflinguistic theory. Technical report, MIT.Noam Chomsky and Morris Halle. 1968.
TheSound Pattern of English . MIT Press.Ryan Cotterell, Christo Kirov, John Sylak-Glassman, Géraldine Walther, Ekaterina Vy-lomova, Patrick Xia, Manaal Faruqui, SandraKübler, David Yarowsky, Jason Eisner, andMans Hulden. 2017. CoNLL-SIGMORPHON2017 shared task: Universal morphological rein-flection in 52 languages. In
Proceedings of theCoNLL SIGMORPHON 2017 Shared Task: Uni-versal Morphological Reinflection , pages 1–30,Vancouver, Canada.Ryan Cotterell, Nanyun Peng, and Jason Eisner.2015. Modeling word forms using latent under-lying morphs and phonology.
Transactions ofthe Association for Computational Linguistics ,3:433–447.Markus Dreyer and Jason Eisner. 2009. Graphicalmodels over multiple strings. In
Proceedings ofthe 2009 Conference on Empirical Methods inNatural Language Processing , pages 101–110,Singapore.Sarah Eisenstat. 2009. Learning underlying formswith MaxEnt. Master’s thesis, Brown Univer-sity.Edward Flemming. 2001. Scalar and categoricalphenomena in a unified model of phonetics andphonology.
Phonology , 18(1):7–44.Bruce Hayes. 2009.
Introductory Phonology .Blackwell.Bruce Hayes and Colin Wilson. 2008. A maximumentropy model of phonotactics and phonotacticlearning.
Linguistic Inquiry , 39(3):379–440. Sepp Hochreiter and Jürgen Schmidhuber. 1997.Long short-term memory.
Neural Computation ,9(8):1735–1780.Joan B. Hooper. 1976.
An Introduction to NaturalGenerative Phonology . Academic Press.Larry M. Hyman. 1970. How concrete is phonol-ogy?
Language , 46(1):58–76.Roman Jakobson. 1948. Russian conjugation.
Word , 4(3):155–167.Roman Jakobson, C. Gunnar Fant, and MorrisHalle. 1952. Preliminaries to speech analy-sis: The distinctive features and their correlates.Technical report, Acoustic Laboratory, MIT.Gaja Jarosz. 2006. Richness of the base and proba-bilistic unsupervised learning in Optimality The-ory. In
Proceedings of the Eighth Meeting of theACL Special Interest Group on ComputationalPhonology and Morphology at HLT-NAACL2006 , pages 50–59, New York City, USA.Gaja Jarosz. 2013. Learning with hidden structurein Optimality Theory and Harmonic Grammar:Beyond robust interpretive parsing.
Phonology ,30(1):27–71.Michael Kenstowicz and Charles Kisseberth. 1979.
Generative Phonology: Description and Theory .Academic Press.Diederik P. Kingma and Jimmy Ba. 2015. Adam:A method for stochastic optimization. In
Pro-ceedings of the 3rd International Conference onLearning Representations , San Diego, USA.Diederik P. Kingma and Max Welling. 2014. Auto-encoding variational Bayes. In
Proceedings ofthe 2nd International Conference on LearningRepresentations , Banff, Canada.Paul Kiparsky. 1968.
How Abstract is Phonology?
Indiana University Linguistics Club.Christo Kirov, Ryan Cotterell, John Sylak-Glassman, Géraldine Walther, Ekaterina Vy-lomova, Patrick Xia, Manaal Faruqui, Se-bastian Mielke, Arya D. McCarthy, SandraKübler, David Yarowsky, Jason Eisner, andMans Hulden. 2018. UniMorph 2.0: UniversalMorphology. In
Proceedings of the Eleventh In-ternational Conference on Language Resourcesand Evaluation (LREC 2018) , Miyazaki, Japan.hang Luong, Hieu Pham, and Christopher D. Man-ning. 2015. Effective approaches to attention-based neural machine translation. In
Proceed-ings of the 2015 Conference on Empirical Meth-ods in Natural Language Processing , pages1412–1421, Lisbon, Portugal.John J. Ohala. 1990. The phonetics and phonologyof aspects of assimilation. In John Kingston andMary E. Beckman, editors,
Papers in Labora-tory Phonology I: Between the Grammar and thePhysics of Speech , pages 258–275. CambridgeUniversity Press.Janet Pierrehumbert, Mary E. Beckman, andD. Robert Ladd. 2000. Conceptual founda-tions of phonology as a laboratory science. InNoel Burton-Roberts, Philip Carr, and Ger-ard Docherty, editors,
Phonological Knowledge:Conceptual and Empirical Issues , pages 273–304. Oxford University Press.Alan Prince and Paul Smolensky. 2008.
OptimalityTheory: Constraint Interaction in GenerativeGrammar . John Wiley & Sons.Ezer Rasin and Roni Katzir. 2016. On evaluationmetrics in optimality theory.
Linguistic Inquiry ,47(2):235–282.David E. Rumelhart, Geoffrey E. Hinton, andRonald J. Williams. 1985. Learning internalrepresentations by error propagation. Technicalreport, California University San Diego La Jolla.Paul Smolensky and Géraldine Legendre. 2006.
The Harmonic Mind: From Neural ComputationTo Optimality-theoretic Grammar (Cognitive ar-chitecture) , volume 1. MIT Press.Nitish Srivastava, Geoffrey Hinton, AlexKrizhevsky, Ilya Sutskever, and RuslanSalakhutdinov. 2014. Dropout: A simple wayto prevent neural networks from overfitting.
The Journal of Machine Learning Research ,15(1):1929–1958.Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014.Sequence to sequence learning with neural net-works. In
Advances in Neural Information Pro-cessing Systems 27 , pages 3104–3112, Montreal,Canada.Ekaterina Vylomova, Jennifer White, Eliza-beth Salesky, Sabrina J. Mielke, Shijie Wu, Edoardo Maria Ponti, Rowan Hall Maudslay,Ran Zmigrod, Josef Valvoda, Svetlana Toldova,Francis Tyers, Elena Klyachko, Ilya Yegorov,Natalia Krizhanovsky, Paula Czarnowska, IreneNikkarinen, Andrew Krizhanovsky, Tiago Pi-mentel, Lucas Torroba Hennigen, Christo Kirov,Garrett Nicolai, Adina Williams, Antonios Anas-tasopoulos, Hilaria Cruz, Eleanor Chodroff,Ryan Cotterell, Miikka Silfverberg, and MansHulden. 2020. SIGMORPHON 2020 sharedtask 0: Typologically diverse morphological in-flection. In
Proceedings of the 17th SIGMOR-PHON Workshop on Computational Research inPhonetics, Phonology, and Morphology , pages1–39, Online.Shijie Wu and Ryan Cotterell. 2019. Exact hardmonotonic attention for character-level transduc-tion. In
Proceedings of the 57th Annual Meetingof the Association for Computational Linguistics ,pages 1530–1537, Florence, Italy.Shijie Wu, Pamela Shapiro, and Ryan Cotterell.2018. Hard non-monotonic attention forcharacter-level transduction. In