[PDF] Inherent Biases of Recurrent Neural Networks for Phonological Assimilation and Dissimilation

Abstract

A recurrent neural network model of phonological pattern learning is proposed. The model is a relatively simple neural network with one recurrent layer, and displays biases in learning that mimic observed biases in human learning. Single-feature patterns are learned faster than two-feature patterns, and vowel or consonant-only patterns are learned faster than patterns involving vowels and consonants, mimicking the results of laboratory learning experiments. In non-recurrent models, capturing these biases requires the use of alpha features or some other representation of repeated features, but with a recurrent neural network, these elaborations are not necessary.

Full PDF

aa r X i v : . [ c s . C L ] F e b Inherent Biases of Recurrent Neural Networks for PhonologicalAssimilation and Dissimilation

Amanda Doucette

University of Massachusetts Amherst [email protected]

Abstract

A recurrent neural network model ofphonological pattern learning is pro-posed. The model is a relatively sim-ple neural network with one recurrentlayer, and displays biases in learning thatmimic observed biases in human learning.Single-feature patterns are learned fasterthan two-feature patterns, and vowel orconsonant-only patterns are learned fasterthan patterns involving vowels and conso-nants, mimicking the results of laboratorylearning experiments. In non-recurrentmodels, capturing these biases requires theuse of alpha features or some other rep-resentation of repeated features, but witha recurrent neural network, these elabora-tions are not necessary.

Models of phonological pattern learning typicallyrequire large numbers of constraints or rules onwhere features can occur, and the presence of al-pha features or some other representation of re-peated features to allow certain patterns to belearned more quickly (Hayes and Wilson, 2008;Moreton et al., 2015). In human learning exper-iments, certain phonological patterns are learnedmore easily, particularly those involving multipleoccurences of the same feature, such as a voicingagreement pattern.In order to capture this bias towards single-feature patterns, many models have some repre-sentation of repeated features. Alpha features areone example of this (see McCarthy (1988) forother approaches, such as feature geometry). Al-pha features allow a model to learn a harmony pat-tern with only one predicate - that two featuresmust be the same, having the value α . Without alpha features, the model must learn two predi-cates - that the two features must either both havethe value + or the value − . Therefore, there can-not be a bias towards single-feature patterns, be-cause two-feature patterns also require learningtwo predicates (Moreton, 2012).In addition to alpha features, many phonolog-ical learning models have to test or search overa large number of possible rules or constraints tolearn a pattern. In models that use conjunctions offeatures as constraints (Hayes and Wilson, 2008;Moreton et al., 2015), if there are N features in themodel, each with three possible values ( + , − , ± ),there are N possible conjunctions of these fea-tures. With even a small number of features, thenumber of conjunctive constraints becomes verylarge.Moreton, Pater, and Pertsova (2015) describe acue-based learning model that uses these conjunc-tive constraints. Their model is a maximum en-tropy model trained by gradient descent on neg-ative log-likelihood, and is related to the single-layer perceptron. It successfully models the bi-ases found in human phonological learning exper-iments, but still requires listing all possible con-straint conjuncions in the input. In unpublishedwork, I have found that it is also possible to modelthese biases without constraint conjunctions us-ing a feed-forward neural network with a hiddenlayer. See Alderete and Tupper (To appear) foran overview of other connectionist approaches tophonology.Hare (1990) uses a recurrent neural network tomodel Hungarian vowel harmony without phono-logical rules or constraints. In Hare’s model, se-quences of individual features describing vowelswere the only inputs to the network. Some fea-tures in the input sequence could be left unspeci-ﬁed, and after training, fully speciﬁed feature se-quences are output. While the model was onlyrained on sequences of vowels, not entire words,Hare showed that recurrent neural networks werecapable of modeling vowel harmony patterns us-ing only individual features as input.Rodd (1997) also uses recurrent neural net-works to model Turkish vowel harmony. Individ-ual phonemes rather than features were used as in-put to the networks, and the task was to predict thefollowing phoneme. Rodd showed that the hid-den units in small recurrent networks were able torepresent distinctions between vowels and conso-nants, differences in sonority, and differences be-tween front and back vowels. Although humansmost likely do not perform the task of predictingthe next phoneme in a word, Rodd showed thatsimple recurrent network could learn phonologicalregularities through differences in the distributionof phonemes.Recurrent neural networks are capable of learn-ing more than just vowel harmony patterns andfeature representations, though. This paper de-scribes a simple recurrent neural network modelof phonological pattern learning that is biased to-wards learning single-feature patterns and patternsover only consonants or vowels without using al-pha features, separate representations of conso-nants and vowels, or conjunctive constraints. The model used in these simulations is a simplerecurrent neural network model. The ”words” thatmake up the patterns are the inputs in the ﬁrst layerof the model. At each time step, the four fea-tures representing one phoneme are input to thenetwork. The second layer is a hidden recurrentlayer with ten neurons. The third layer is a logsoftmax layer with two output neurons. After theentire sequence is input to the model, the outputsat the ﬁnal timestep will represent the log proba-bilities of the input belonging to each of the twoclasses, which will be referred to as IN the patternor OUT of the pattern. The probability of a pat-tern being IN or OUT is the probability of it beingallowed in the language.The model was trained using gradient descenton negative log-likelihood with a learning rate of0.01. Weights were adjusted after each word in thetraining data, rather than in batches. In each epochof training, the order of presentation of the train-ing data was randomly permuted. The output ofthe network was considered to be correct when the log probability of the intended class was greaterthan the other class. This criterion for correctnesswas used because when the model is trained on asubset of the full pattern, it prevents overﬁtting tothat subset. After each epoch of training, all train-ing examples were checked for correctness. If thecorrect class was predicted for every training ex-ample, training was stopped.The number of neurons in the recurrent layeris not important to the model. Ten neurons werechosen because with fewer neurons, the patternstested could not be fully learned, and with more,the patterns were learned after only a few trainingepochs. More complex patterns or patterns requir-ing more features will likely require a larger num-ber of neurons in this layer.

The patterns used in testing the model used fourphonological features - two consonant features,and two vowel features. Each feature has a valueof +1 or -1. For consonants, the vowel featureshave a value of 0, and for vowels, the conso-nants have a value of 0. The consonant featuresused in these patterns are voicing ( + / − voi) andplace( + / − cor), and the vowel features are height( + / − hi) and backness ( + / − back). This featureset corresponds to the consonants [d, t, k, g] andthe vowels [i, u, æ, a]. All ”words” in the patternshave the form C V C V , where C and V rangeover the four consonants and vowels described bythe four features, so there are 256 total. For eachpattern, the words are divided into the two classes,IN and OUT, each with 128 examples.Six patterns, dividing the 256 words based ontheir features, were created as simpliﬁed versionsof real phonological patterns. The six feature com-binations that are IN for each pattern are describedin Table 1. The 128 words that do not ﬁt these fea-ture descriptions are in the OUT class of the pat-tern.In pattern 1, there is a feature dependency be-tween an adjacent consonant and vowel. In pat-tern 2, this dependency is between a non-adjacentconsonant and vowel. Pattern 3 is a voice assimila-tion pattern where the two consonants must agreein voicing. In pattern 4, the consonants must dis-agree in voicing. In pattern 5, the two features rel-evant to the pattern are on the same consonant, andin pattern 6, they are on two separate consonants.Moreton (2012) claims that there is an advan- attern Features in Pattern1 C1+voi and V1+backC1-voi and V1-back C1+voi and V2+backC1-voi and V2-back C1+voi and C2+voiC1-voi and C2-voi C1+voi and C2-voiC1-voi and C2+voi C1+voi and C1+corC1-voi and C1-cor C1+voi and C2+corC1-voi and C2-corTable 1: Feature descriptions of patterns.tage for learning intra-dimensional patterns overinter-dimensional patterns that requires alpha fea-tures to be captured by a model. The same ad-vantage was also shown by Moreton, Pater, andPertsova (2015) and Saffran and Thiessen (2003).In the six patterns described here, patterns 3 and 4are intra-dimentional, single-feature patterns, andthe rest are inter-dimensional, two-feature pat-terns. Therefore, patterns 3 and 4 should belearned faster than patterns 1, 2, and 6.Pattern 5 is also an inter-dimensional pattern,but the two features are on the same segment ratherthan different segments like the rest of the pat-terns. In the experiments of Moreton, Pater, andPertsova (2015), patterns involving features on thesame segment were easier to learn than patternsinvolving two segments. Because of this, pattern 5should be learned faster than pattern 6.Moreton (2012) also showed in an experimentthat a pattern involving an adjacent consonant andvowel was learned no faster than a pattern involv-ing a non-adjacent consonant and vowel. There-fore, there should be no difference in the amountof time to learn patterns 1 and 2.Results from several studies also showthat there is no difference in difﬁcultyof learning harmony and disharmony pat-terns (Moreton, 2012; Pycha et al., 2003;Skoruppa and Peperkamp, 2011). Of thesesix patterns, pattern 3 is a harmony pattern, andpattern 4 is a disharmony pattern, so there shouldbe no difference in the time to learn these patterns.

For each pattern, the model was trained 3000 timeswith random initial weights on all 256 examples.For each training run, the number of epochs takento learn the pattern according to the criterion insection 2 was recorded. Table 2 shows averagesover the 3000 runs for each pattern. The modelwas capable of learning the patterns in all but25 training runs, which were stopped after 400epochs and excluded from these results. This wasdone because weights in these training runs werelikely stuck in local minima, and the model wasincapable of learning the pattern.

Pattern Mean St. Err. St. Dev. > < > < > The model was also trained on randomly cho-sen subsets of the training data for each pattern.This was done because when learning phonolog-ical patterns, people do not have access to everypossible example of the pattern, and every possi-ble example of something that does not conform omparison p-value

Pattern 1, Pattern 2 . Pattern 5, Pattern 6 . × − Pattern 1, Pattern 3 . × − Pattern 2, Pattern 3 . × − Pattern 4, Pattern 3 . Pattern 5, Pattern 3 . Pattern 6, Pattern 3 . × − Pattern 1, Pattern 4 . × − Pattern 2, Pattern 4 . × − Pattern 5, Pattern 4 . Pattern 6, Pattern 4 . × − Table 3: Training time comparisons for pairs ofpatterns and p-values.to the pattern. Rather, people are only exposed tocorrect examples of their language as they learn.To test the model under these conditions, 32examples of each pattern were randomly chosenfrom the 128 available. If the model were trainedon only positive examples, it would predict every-thing to be included in the pattern, so some neg-ative training examples are needed. As negativetraining examples, 32 were chosen from the re-maining 224, after removing the 32 already se-lected positive examples. Therefore, the nega-tive examples are not true negative examples, theyare just randomly chosen examples from the fulldataset.Although this is an unusual method of traininga neural network model, it was done to test if themodel is capable of generalizing to unseen exam-ples of the pattern. There is no direct way of usingunsupervised learning with a recurrent neural net-work, so using randomly selected negative exam-ples was used as an analogue to it. In training, themodel gets positive examples of the pattern, butdoes not get true negative examples.The model was trained 3000 times on randomlychosen subsets of each pattern. Training wasstopped when the model correctly classiﬁed the64 examples it was trained on. After each trainingrun, the proportion of the full dataset that was clas-siﬁed correctly by the model was recorded. Table3 shows averages over the 3000 runs for the num-ber of training epochs taken and proportion correctin the full dataset for each pattern. 41 training runswere excluded because the pattern was not learnedin 400 epochs.When trained on a subset of the pattern, the

Patt. Mean Mean Std. Err. Std. Dev.Corr.

Without a representation of repeated features, andusing only single features as input, this recurrentneural network is able to model results from hu-man phonological learning experiments. Althoughnon-recurrent neural network models such as thesingle-layer perceptron require a representation ofrepeated features to allow single-feature patternsto be learned more easily (Moreton, 2012), the ad-dition of a recurrent layer seems to have the sameeffect. In a recurrent neural network, input is pro-cessed one segment at a time, rather than simul-taneously. There is no clear way to represent se-quential input in non-recurrent models. Althoughit is difﬁcult to interpret connection weights in arecurrent neural network, it is possible that the se-quential input of the network somehow biases ittowards single-feature patterns.A non-recurrent, multilayer perceptron may beable to learn these patterns because they are allhe same length, but it will not be able to cap-ture the bias towards patterns with repeated fea-tures. By concatenating the features of all foursegments into one input to a multilayer perceptron,there is no connection between segments with re-peated features; a pattern involving two instancesof a consonant feature will be connected to the hid-den layer in the same way a pattern with a conso-nant and a vowel will be. In a recurrent network,repeated activation of the same input feature willincrease activation in a hidden unit more than ac-tivation of different features in the input will. Forthis reason, a non-recurrent network cannot modelbiases towards single-feature patterns.In addition to the bias towards single-featurepatterns, the recurrent neural network can learnpatterns over only consonants or only vowels morequickly than patterns involving a consonant anda vowel. Some models accomplish this with anadditional representation of only the vowels in aword, or only the consonants in a vowel or con-sonant ’tier’ (Hayes and Wilson, 2008). A voweltier allows for a bias towards vowel-only patternsbecause only the vowels in a word are considered,making ﬁnding the pattern much faster.This recurrent neural network model does nothave any representation of a consonant or voweltier, but it is still able to learn consonant-only pat-terns faster. There are separate features used fordescribing vowels and consonants, but they are allused as input to the network at the same time. Thisbias is possibly related to the single-feature bias.Vowels and consonants never have the same fea-tures, so patterns with only consonants or vowelsare learned faster because there is some overlapin the features used to describe them. It is possi-ble that adding a separate representation of vowelsand consonants could make learning these patternsfaster, but it does not seem necessary. However,this could be accomplished by having multiple in-stances of the model where only consonants, onlyvowels, or the entire word is input. The outputsof the recurrent layers of these separate instanceswould be connected to a single non-recurrent layerwhich would combine the predictions of the threetiers.Although the patterns used to test the modelcan be described with only one or two features, itshould also be capable of learning more complexpatterns. More neurons in the input and recurrentlayers will allow more input features to be used, and more complex patterns to be represented.The model also requires supervised training. Tolearn a pattern, both positive and negative exam-ples are necessary for the model, but humans arecapable of learning these patterns through unsu-pervised learning (Moreton et al., 2015). Unsu-pervised learning can be approximated by usingrandomly chosen examples from the entire datasetas negative examples, but it is still not true unsu-pervised learning. If there is a way to better ap-proximate unsupervised learning with this model,it would better ﬁt the human learning experiments.In conclusion, a simple recurrent neural net-work was able to model human phonological pat-tern learning without alpha features or any rep-resentation of rules or constraints. The modeluses only individual phonological features as in-put, and has no separate representations of vowelsor consonants. Single-feature patterns are learnedmore easily than two-feature patterns, and vowelor consonant-only patterns are learned more eas-ily than patterns involving vowels and consonants.Modeling these biases using a recurrent neural net-work is possible without any representation of re-peated features that is necessary in non-recurrentmodels.

Acknowledgments

Thanks to Joe Pater and Brendan O’Connor fortheir guidance and helpful discussion while Iworked on this research, as well as to the fouranonymous reviewers for their comments.

References [Alderete and TupperTo appear] John Alderete andPaul Tupper. To appear. Connectionist approachesto generative phonology. In Anna Bosch and S. J.Hannahs, editors,

The Routledge handbook ofphonological theory .[Hare1990] Mary Hare. 1990. The role of trigger-target similarity in the vowel harmony process. In

Annual Meeting of the Berkeley Linguistics Society ,volume 16, pages 140–152.[Hayes and Wilson2008] Bruce Hayes and Colin Wil-son. 2008. A maximum entropy model of phono-tactics and phonotactic learning.

Linguistic Inquiry ,39(3):379–440.[McCarthy1988] John J McCarthy. 1988. Feature ge-ometry and dependency: A review.

Phonetica , 45(2-4):84–108.Moreton et al.2015] Elliott Moreton, Joe Pater, andKatya Pertsova. 2015. Phonological concept learn-ing.

Cognitive Science .[Moreton2012] Elliott Moreton. 2012. Inter-andintra-dimensional dependencies in implicit phono-tactic learning.

Journal of Memory and Language ,67(1):165–183.[Pycha et al.2003] Anne Pycha, Pawel Nowak, EurieShin, and Ryan Shosted. 2003. Phonological rule-learning and its implications for a theory of vowelharmony. In

Proceedings of the 22nd West CoastConference on Formal Linguistics , volume 22, pages101–114. Somerville, MA: Cascadilla Press.[Rodd1997] Jennifer M Rodd. 1997. Recurrent neural-network learning of phonological regularities inturkish. In

CoNLL , pages 97–106.[Saffran and Thiessen2003] Jenny R Saffran and Erik DThiessen. 2003. Pattern induction by infantlanguage learners.

Developmental Psychology ,39(3):484.[Skoruppa and Peperkamp2011] Katrin Skoruppa andSharon Peperkamp. 2011. Adaptation tonovel accents: Feature-based learning of context-sensitive phonological regularities.