[PDF] Disambiguatory Signals are Stronger in Word-initial Positions

Abstract

Psycholinguistic studies of human word processing and lexical access provide ample evidence of the preferred nature of word-initial versus word-final segments, e.g., in terms of attention paid by listeners (greater) or the likelihood of reduction by speakers (lower). This has led to the conjecture -- as in Wedel et al. (2019b), but common elsewhere -- that languages have evolved to provide more information earlier in words than later. Information-theoretic methods to establish such tendencies in lexicons have suffered from several methodological shortcomings that leave open the question of whether this high word-initial informativeness is actually a property of the lexicon or simply an artefact of the incremental nature of recognition. In this paper, we point out the confounds in existing methods for comparing the informativeness of segments early in the word versus later in the word, and present several new measures that avoid these confounds. When controlling for these confounds, we still find evidence across hundreds of languages that indeed there is a cross-linguistic tendency to front-load information in words.

Full PDF

DDisambiguatory Signals are Stronger in Word-initial Positions

Tiago Pimentel D Ryan Cotterell D , QD University of Cambridge Q ETH Z¨urich @ Google [email protected] , [email protected] , [email protected] Brian Roark @ Abstract

Psycholinguistic studies of human word pro-cessing and lexical access provide ample ev-idence of the preferred nature of word-initialversus word-ﬁnal segments, e.g., in terms ofattention paid by listeners (greater) or thelikelihood of reduction by speakers (lower).This has led to the conjecture—as in Wedelet al. (2019b), but common elsewhere—thatlanguages have evolved to provide more infor-mation earlier in words than later. Information-theoretic methods to establish such tendenciesin lexicons have suffered from several method-ological shortcomings that leave open the ques-tion of whether this high word-initial informa-tiveness is actually a property of the lexiconor simply an artefact of the incremental natureof recognition. In this paper, we point out theconfounds in existing methods for comparingthe informativeness of segments early in theword versus later in the word, and present sev-eral new measures that avoid these confounds.When controlling for these confounds, we stillﬁnd evidence across hundreds of languagesthat indeed there is a cross-linguistic tendencyto front-load information in words. The psycholinguistic study of human lexical accessis largely concerned with the incremental process-ing of words—whereby, as individual sub-lexicalunits (e.g., phones) are perceived, listeners up-date their expectations of the word being spoken.One common tenet of such studies is that the dis-ambiguatory signal contributed by units early inthe word is stronger than that contributed later—i.e. disambiguatory signals are front-loaded inwords . This intuition is derived from ample indi-rect evidence that the beginnings of words are moreimportant for humans during word processing—including, e.g., evidence of increased attention toword beginnings (Nooteboom, 1981, inter alia ) or Our code is available at https://github.com/tpimentelms/frontload-disambiguation . Position from Start F o r w a r d celexwikipedianortheuralex8 6 4 2 0 Position from End B ac k w a r d celexwikipedianortheuralex Figure 1:

Forward and Backward Surprisals with LSTMmodel from Pimentel et al. (2020). The bottom plot has beenﬂipped horizontally such that it visually corresponds to thenormal string direction. evidence of increased levels of phonological reduc-tion in word endings (van Son and Pols, 2003b).To analyse this front-loading effect, researchershave investigated the information provided by seg-ments in words. van Son and Pols (2003a,b)showed that, in Dutch, a segment’s position in aword is a very strong predictor of its conditional sur-prisal, with later segments being more predictablethan earlier ones—a result which we show to arisedirectly from its deﬁnition in §3.3.1. Recently Kingand Wedel (2020) and Pimentel et al. (2020) con-ﬁrmed the effect on many more languages.Their analysis, however, presents an inherentconfound between the amount of conditional in-formation available to a model and the surprisalof the subsequent segment—see Fig. 1 for resultsillustrating this. Using the LSTM training recipesfrom Pimentel et al. (2020), we calculated the con-ditional surprisal at each segment position withinthe words across all languages in three datasets. The top-half of Fig. 1 shows that, indeed, positions https://github.com/tpimentelms/phonotactic-complexity See §3 and §5 for speciﬁcs on training and data. Eachsegment corresponds to a single phone in CELEX andNorthEuraLex, and to a single grapheme in Wikipedia. a r X i v : . [ c s . C L ] F e b arlier in the string have higher surprisal than po-sitions later in the string, supporting the thesis ofhigher informativity earlier in words. The bottom-half shows that modelling the strings right-to-leftinstead of left-to-right reverses the resulting effect.This decouples conditional surprisal from the dis-ambiguatory strength. To expose this decoupling,consider an artiﬁcial language where every wordcontains a copy of its ﬁrst half, e.g., foofoo , barbar , foobarfoobar , etc. The ﬁrst and second halves ofthese words have identical disambiguatory strength;they are the same so one could disambiguate theword as easily from its second half as from the ﬁrst.In contrast, conditional surprisal would be nearlyzero for the second halves of words because the sec-ond half is perfectly predictable from the ﬁrst half.In natural languages, measuring conditional en-tropy in a left-to-right fashion inherently forces areduction of conditional entropy in later segmentsbecause of a language’s phonotactic constraints.However, the disambiguatory strength of later seg-ments is not inherently less than that of earliersegments. For instance, in a language like Turkish,which has vowel harmony, knowledge of any of thevowels in a word will provide information aboutthe word’s other vowels in a similar way. As such,knowledge of vowels towards the front of a wordis as disambiguating as of vowels towards its end.The contributions of this paper are threefold.First, we document and demonstrate the shortcom-ings of existing methods for measuring the informa-tiveness of individual segments in context, includ-ing the confound with the amount of conditionalinformation discussed above. Second, we intro-duce three surprisal-based measures that controlfor this confound and enable comparison of word-initial versus -ﬁnal positions in this respect: uni-gram, position-speciﬁc and cloze surprisal (see §3).Finally, we ﬁnd robust evidence across many lan-guages of stronger disambiguatory signals in wordinitial than word-ﬁnal positions. Out of a total of151 languages analysed across three separate col-lections, 82 of them present a higher cloze surprisalin word beginnings than in endings—with similarpatterns arising with the other two measures. Psycholinguistic evidence.

Lexical access haslong been a topic of interest for psycholinguists,leading to many distinct models being proposedfor this process (Morton, 1969; Marcus, 1981; Marslen-Wilson, 1987). Far earlier, though, Bagley(1900) had already demonstrated that earlier seg-ments in words were more important for wordrecognition than later segments; speciﬁcally, theyfound that, when exposed to words with word-initial or word-ﬁnal consonant deletions, listenersfound the word-initial deletions more disruptive.Fay and Cutler (1977) showed mispronunciationsare more likely in word endings, while Bruner andO’Dowd (1958) showed that recognizing writtenwords with ﬂipped initial characters was harderthan with word ﬁnal ones—demonstrating that theinitial part of the word was more “useful” for read-ers. More recently, Wedel et al. (2019a) foundevidence in support of Houlihan (1975), showingneutralizing rules tend to target word endings moresigniﬁcantly than beginnings in both sufﬁxing andpreﬁxing languages.Nooteboom (1981) investigated the ease of re-covering lexical items from either word beginningsor endings, ﬁnding that people had an easier timerecovering words from their beginnings. For this,he examined words for which the ﬁrst and secondhalves each completely identiﬁed them in a largeDutch dictionary—controlling for both segments’length and uniqueness. Later on, though, Noote-boom and van der Vlugt (1988) showed this differ-ence vanishes when priming people with the lengthof the word—proposing the difference comes notfrom how informative segments were, but from thedifﬁculty in time aligning later segments in men-tal lexicons. Connine et al. (1993) also found nodifference in priming effects with non-words thatdiffered from real words in either word initial ormedial positions, suggesting initial positions haveno special status in word recognition.Psycholinguistic evidence is key to understand-ing how lexical access works in human languageprocessing, and can help us understand why lexi-cons may evolve to provide more disambiguatorysignals earlier in words. Given the incrementalnature of human lexical processing, however, suchevidence cannot provide direct evidence of the na-ture of the lexicon uninﬂuenced by incrementality.

Computational evidence.

To the best of ourknowledge, van Son and Pols (2003b,a) were theﬁrst to use computational methods coupled with an Note that there are many possible reasons why the effectswe demonstrate in this paper may arise, from the demands oflexical access to constraints on articulation. We provide noevidence for any of the possible explanations, evolutionary orotherwise, just methods for measuring the effect. nformation theoretic deﬁnition of informativenessto investigate this question. They showed that seg-ments in the beginning of words carry most of aword’s information, as measured by their contex-tual surprisal using a plug-in tree structured proba-bilistic estimator. Although assessing a less-biasedsample of words than Nooteboom (1981), thisstudy is also limited to a single language (Dutch),hence cannot assess whether this is a general phe-nomenon or speciﬁc to that language.Further, van Son and Pols (2003a,b) use absoluteword positions in their analysis. Word length corre-lates strongly with frequency, hence while early po-sitions are present in all words, later positions onlyexist for a much smaller sample of typically lowerfrequency words. Thus this comparison amountsto asking if later positions in longer and infrequentwords have lower surprisal than earlier positions inall (frequent or infrequent) words. We analyse thisconfounding factor in §6.Wedel et al. (2019b) and King and Wedel (2020)applied a methodology similar to that of van Sonand Pols (2003a) to show, for many diverse lan-guages, that more frequent words contain less in-formative segments in word initial positions, whileless frequent types carry more informative ones.They further showed that segments in later wordpositions were less informative (given the previ-ous ones) than average in rarer words. While con-trolling for length, King and Wedel (2020) alsocompared words’ forward and backward unique-ness points—nodes in a trie from which only oneleaf node can be reached, i.e., where the word isuniquely identiﬁed—showing they happened ear-lier in forward strings.While these studies provide evidence from morediverse sets of languages, they follow van Son andPols (2003a) in studying closed lexicons. As weshow in §3.3.1, the use of probabilistic trie modelson a closed lexicon yields a trivial effect of higherinformativity at word initial positions. Furthermore,such studies cannot account for out-of-vocabularywords (e.g., nonce, proper name or otherwise un-known words) or derivational morphology, whichare key parts of lexical recognition. Lexical access Nooteboom (1981) looked at words completely identi-ﬁable by both their ﬁrst and second halves in a large Dutchdictionary—this resulted in a study with only 14 words. The closed lexicon assumption is incorporated implicitlyin the probabilistic trie models used by van Son and Pols(2003a,b) and King and Wedel (2020)—i.e. they assign zeroprobability to any form not in their training sets—and in theuniqueness point analysis of King and Wedel (2020). is also somewhat robust to segmental misorder-ing (Toscano et al., 2013) and sounds later in aword help determine the perception of earlier ones(Gwilliams et al., 2018). In contrast, a trie over aclosed lexicon is deterministic. Beyond this, Luce(1986) showed in a corpus study that the proba-bility of a word type being uniquely identiﬁablebefore its last segment was only —and oftypes were identiﬁed only by the end of word, be-ing proper preﬁxes of other words, such as cat and cats . They conclude that uniqueness point statisticsmay only be useful for long word analysis.In Pimentel et al. (2020), we analysed severallanguages’ phonotactic distributions, focusing onpresenting a trade-off between phonotactic entropyand word length across languages. As a controlexperiment we analysed the correlation betweena segment’s surprisal and its word position across106 languages. We did not control for word lengthand did not run per-language experiments, though—so we could have just been capturing the effect thatlater positions will mostly be present in languageswith longer words (which, as we ﬁnd, have lowerinformation on average). While this last work avoids many of the issuesraised earlier in this section, it fails to control thekey confound mentioned earlier: it relies on left-to-right conditional probabilities to calculate surprisal.Thus segments early in the word have less condi-tional information and hence are generally of lowerprobability—a trivial effect that does not indicate asegment’s disambiguatory signal strength.

In this work, instead of the lexicon itself, we inves-tigate the probability distribution from which it issampled. The distribution is unobserved, but wecan get glimpses of it via the sampled lexicon: (cid:110) w ( n ) (cid:111) Nn =1 ∼ p ( w ) = | w | (cid:89) t =1 p ( w t | w

EOW ) symbol.For simplicity, we assume the alphabet includes

EOW through-out the rest of the paper. ribution should assign high probability to likelywordforms (attested or not) and low probabilityto unlikely ones. Using Chomsky and Halle’s(1965) classic example from English, brick (at-tested) and blick (unattested) would have high prob-ability, whereas *bnick (unattested) would have alow probability.

Shannon’s entropy is a measure of how much in-formation a random variable contains. Consider asegment w t at word position t , which is a value ofthe random variable W t . The average information(surprisal) relayed per segment is: H( W t ) ≡ (cid:88) w t ∈ Σ p ( w t ) log 1 p ( w t ) (2)A random variable is maximally entropic if it isa uniform distribution, in which case H( W t ) =log( | Σ | ) . Conditional entropy measures how muchinformation the knowledge of a variable conveys,given some previous knowledge. The average infor-mation transmitted per segment, given the previousones in a word, is H( W t | W

We present a reductio ad absurdum which showsthat van Son and Pols’s (2003b) method will lead tothe conclusion that word-initial segments are moreinformative even if all segments were equally en-tropic and sampled independently—a nonsensical ﬁnding. Accordingly, assume the probability distri-bution p ( w t | w

1) log e count( w t − , . . . , w ) ≈ | Σ | t − ( | Σ | −

1) log eN (6)where ˆH is a plug-in estimate of the entropy. Theerror grows exponentially in t due to the | Σ | t − factor. However, by assumption, H( W t | W t − ) is constant—we have equally entropic andindependent segments. Thus, the only way forthis difference to increase is for the second termto decrease as a function of t . It follows that theestimated cross-entropies decrease as a functionof t due to a methodological technicality. Indeed,in the extreme case, every position after a word’suniqueness point would be estimated to have zeroentropy. Thus, van Son and Pols’s (2003a) methodonly reveals a trivial effect. As previously mentioned, the conditional entropymeasures how much information the knowledge ofa variable conveys, given some previous informa-tion, and it is always smaller or equal to the entropy.For this reason, relying on left-to-right conditional This is in fact a simpliﬁcation of van Son and Pols’s(2003a) model, which in practice uses Katz smoothing. ntropies to estimate the strength of disambigua-tory signals yields straightforward results; the avail-ability of larger conditioning contexts in a word’sﬁnal segments will naturally reduce its conditionalentropy. This will negatively skew the estimatedinformativeness of the later parts of a word. H( W t ) ≥ H( W t | W t − ) ≥ H( W t | W

Unigram Surprisal H θ ( W t ) : the surprisal ofindividual segments.• Cloze Surprisal H θ ( W t | W (cid:54) = t ) : surprisal ofa segment given all others in the same word.• Position-Speciﬁc Surprisal H θ ( W t | T = t, | W | ) : the surprisal of in-dividual segments given their position in thewordform and the word’s length.The unigram surprisal captures the informationprovided by each segment when considering nocontext; while the cloze surprisal represents the in-formation provided by a segment when one alreadyknows the rest of the word. The position-speciﬁcsurprisal represents a mid way between both, con-ditioning each segment only on its position and theword’s length—being inspired by Nooteboom andvan der Vlugt’s (1988) experiments. These threemeasures of information control for the contextsize considered at each position, being thus betterfor an investigation of disambiguatory strength.We used an unigram model (see §4) to estimatethe unigram surprisal, and transformers (Vaswaniet al., 2017) for cloze and position-speciﬁc sur-prisals. We also use the LSTM (Long-ShortTerm Memory, Hochreiter and Schmidhuber, 1997)model from Pimentel et al. (2020) for two other en-tropy measures which do not control for the amountof conditional information:• Forward Surprisal H θ ( W t | W t ) : thesurprisal of a segment given the future ones.We include the beginning- and end-of-word sym-bols in the forward and backward surprisal analy-sis, respectively, following previous work (Wedelt al., 2019b; Pimentel et al., 2020; King andWedel, 2020). However, we ignore them in theunigram, position-speciﬁc and cloze surprisal anal-yses. Position-speciﬁc and cloze surprisal are giveninformation about word length, hence these sym-bols are unambiguously predictable. We analysethe impact of these symbols in §6. In this paper, we make use of character-level lan-guage models to model the probability distributions p θ and approximate the relevant cross-entropies. Unigram.

This might be the simplest languagemodel still in use in Natural Language Processing.We use its Laplace-smoothed variant p θ ( w t ) = count( w t ) + 1 (cid:80) c (cid:48) ∈ Σ count( c (cid:48) ) + | Σ | (12) LSTM.

This architecture is the state-of-the-artfor character-level language modelling (Melis et al.,2020). Given a sequence of segments w ∈ Σ ∗ , weuse one hot lookup embeddings to transform eachof them into a vector z t ∈ R d . We then feed thesevectors into a k -layer LSTM h t = LSTM( z t − , h t − ) (13)where h ∈ R d , h is a vector with all zeros and w is the beginning-of-word symbol. We then linearlytransform these vectors before feeding them into asoftmax non-linearity to obtain the distribution p θ ( w t | w

To get the backward sur-prisals we use models with the same architecture,but reverse all strings before feeding them to themodels. As such, we get the similar equations h t = LSTM( z t +1 , h t +1 ) (15) p θ ( w t | w >t ) = softmax( W h t + b ) (16) Transformer.

Transformers allow a segment tobe conditioned on both future and previous sym-bols. Our implementation starts similar to theLSTM one, getting embedding vectors z t for eachsegment in the string w ∈ Σ ∗ , except that we re-place segment w t with a MASK symbol. Wethen feed these vectors through k multi-headed self-attention layers, as deﬁned by Vaswani et al. (2017). Finally, the representations from the last layer arelinearly transformed and fed into a softmax p θ ( w t | w (cid:54) = t ) = softmax( W h t + b ) (17) Position-Speciﬁc Transformer.

To get position-speciﬁc surprisal values, we again use a transformerarchitecture, but instead of replacing a single seg-ment with a

MASK symbol, we replace all ofthem. This is equivalent to conditioning each seg-ment’s distribution on its position and the wordlength—i.e., estimating p θ ( w t | t, | w | ) . In order to estimate redundancy and informative-ness of segments we use three different datasets,each with its own pros and cons. We focus ontypes instead of tokens—i.e., the datasets consistof lexicons—for a few different reasons. First, it iseasier to get reliable samples of types than tokensfor a language, specially low-resource ones. Sec-ond, it is a well known result that token frequencycorrelates with both word length (Zipf, 1949) andphonotactic probability (Mahowald et al., 2018;Meylan and Grifﬁths, 2017), so that would be astrong confound in the results. Third, morphologyis more easily modeled at the type level than attoken level (Goldwater et al., 2011). CELEX (Baayen et al., 2015) allows us to ex-periment exclusively on monomorphemic words,but covers only three closely related languages. Itcontains both morphological and phonetic annota-tions for a large number of words in English, Dutchand German. We follow Dautriche et al. (2017) inusing only words labeled as monomorphemic inour study, leaving us with , words in German, , words in English and , words in Dutch. NorthEuraLex (Dellert et al., 2019) spans 107languages from 21 language families in a uniﬁedIPA format. This database is composed of conceptaligned word lists for these languages, containing1016 concepts, each of them translated in most lan-guages. However, most of these languages are fromEurasia, hence the collection lacks the typologicaldiversity we would ideally like.

Wikipedia allows us to investigate a broaderand more diverse set of languages, but has no pho-netic information (only graphemes) and lexicons For each of the analysed datasets, we use 80% of the wordtypes for training, with the rest being equally split betweendevelopment and test sets; only test set surprisal and cross-entropies are used in our analysis. xtracted from it may be “contaminated” with for-eign words. We fetch the Wikipedia for a set of 41diverse languages, and tokenise their text usinglanguage-speciﬁc tokenisers from spaCy (Honni-bal and Montani, 2017). When a language-speciﬁctokeniser was not available, we used a multilin-gual one. We then ﬁltered all non-word tokens—byremoving the ones with any symbol not in the lan-guage’s scripts—and kept only the , mostfrequent types in each language. Forward Surprisal.

We ﬁrst replicate the resultsfrom van Son and Pols (2003a,b), Wedel et al.(2019b), and Pimentel et al. (2020), which showthat surprisal decreases with as the words posi-tion advances. On average, forward surprisal, i.e. H θ ( W t | W

If the result for forwardsurprisal is largely due to the amount of condi-tional information, then reversing the strings shouldlead to a roughly opposite effect. With this inmind, for each language, we again bin surprisalsin word initial vs. ﬁnal position, but now weevaluate languages using backward surprisal, i.e., These languages were: af, ak, ar, bg, bn, chr, de, el, en,es, et, eu, fa, ﬁ, ga, gn, haw, he, hi, hu, id, is, it, kn, lt, mr, no,nv, pl, pt, ru, sn, sw, ta, te, th, tl, tr, tt, ur, zu. All statistical signiﬁcance results in this work have beencorrected for multiple tests with Benjamini and Hochberg(1995) corrections and use a conﬁdence value of p < . . Initial Surprisal (bits) F i n a l S u r p r i s a l ( b it s ) wikipedianortheuralexcelex 3 4 5 Initial Surprisal (bits) F i n a l S u r p r i s a l ( b it s ) wikipedianortheuralexcelex Figure 2: Word initial vs. ﬁnal surprisals with: (left)Forward; (right) Backward. H θ ( W t | W >t ) . When using backward sur-prisal, many of the analysed languages have signiﬁ-cantly higher surprisals in word ﬁnal positions (seeTab. 1 and the right graph in Fig. 2). However, 11languages in the NorthEuraLex dataset still havehigher word initial surprisals, suggesting that ini-tial positions in these languages are indeed largelymore informative than ﬁnal ones. There doesseem to be a large effect of the amount of condi-tional information and also some lexical effect offront-loading disambiguatory signals, however itis difﬁcult to determine if there are cross-linguistictendencies with these measures.

Unigram Surprisal.

To control for the condi-tioning aspect of the question: do words front-loadtheir disambiguatory signals? , we can look atunigram surprisal H θ ( W t ) . This value tells ushow uncommon the segments that appear in acertain position are, when analysed in isolationfrom the rest of the word—uncommon segmentsare more informative and provide stronger signalfor disambiguation. In NorthEuraLex, 71 of thelanguages have signiﬁcantly higher informativityin word beginnings than in endings—nonetheless,one language (Kildin Saami) has higher surprisalsin word endings. In CELEX, Dutch and Germanhave higher surprisals in initial positions, butEnglish does not. And in Wikipedia, all languagesbut Hebrew and Bengali have higher surprisalin initial positions—with Bengali having highersurprisal in word endings. This experimentsuggests that indeed most languages are biasedtowards providing stronger disambiguatory signalsin word beginnings, even when we control for the We note that King and Wedel (2020) also used backwardsurprisal, although with a different objective in mind. In oneof their experiments, they presented aggregate results of acomparison between the forward and backward surprisal. We also ran the same experiments with a probabilistictrie model like the ones used in van Son and Pols (2003b) andWedel et al. (2019b), which showed an even stronger resultreversal when using backward surprisal. urprisalDataset | | | | | | |

31 71 | | | | |

39 39 | | | Table 1: Number of languages in the analysed datasets with signiﬁcantly larger surprisals in initial | ﬁnal positions. amount of conditional information. Nonetheless,this is not a universal characteristic which alllanguages share and two analysed languages evenhad a statistically signiﬁcant inverse effect. Position-Speciﬁc Surprisal.

While clozesurprisal makes explicit the non-redundantinformativity a segment conveys, unigram surprisalanalyses the same segments in isolation. Position-speciﬁc surprisal provides a midway analysis,incorporating the position as some previously-speciﬁed knowledge, but not conditioning on theother segments in the word. The position-speciﬁcsurprisal is inspired by Nooteboom and van derVlugt (1988) experiments, which prime individualson word length and position. As can be seen inTab. 1, position-speciﬁc surprisal again seems tofavour initial positions over ﬁnal, but only slightly.Interestingly, most languages present no signiﬁcantdifference and some the inverse effect (i.e. highersurprisal in ﬁnal positions).

Position-speciﬁc Unigram models.

To betterunderstand the differences between the unigramand position-speciﬁc surprisal results, we trainedposition-speciﬁc unigram models—which counteach segment’s frequency per position—and thencalculated their Kullback–Leibler (KL) divergenceper position with the traditional unigram

KL( p ( w t | t ) || p ( w t )) (18) = (cid:88) w t ∈ Σ p ( w t | t ) log p ( w t | t ) p ( w t ) We compare these KL divergences and ﬁnd that, forall but four languages, the KL is largest in either theﬁrst or second segment positions. This suggeststhat one of the reasons for higher unigram surprisalin initial positions is that the ﬁrst two segments usu-ally differ from the rest of the positions, potentiallyserving as markers for word segmentation. We use Laplacian smoothing in the position-speciﬁc uni-grams and constrain the analysis to positions which appear inat least 75% of the analysed words in that language.

Initial Surprisal (bits) F i n a l S u r p r i s a l ( b it s ) wikipedianortheuralexcelex Figure 3: Word initial vs. ﬁnal cloze surprisals.

Cloze Surprisal.

When we condition a segmenton all others in the same word, we measure howmuch uncertainty is left about that individual seg-ment when considering everything else, or, in otherwords, how much information is passed only bythat segment non-redundantly. Word initial sur-prisal is higher in most analysed languages (seeTab. 1). Nonetheless, two languages in Wikipedia,Thai and Bengali, have signiﬁcantly higher sur-prisal in their ﬁnal segments—while English inCELEX and Hungarian in NorthEuraLex alsopresent this same inverse effect. Front-loading dis-ambiguatory information, thus, is not established tobe the linguistic universal it is believed to be, withonly roughly half the analysed languages show-ing this property when we control for morphology(CELEX and NorthEuraLex). Fig. 3 plots the re-sults for all languages analysed.When we compare these results, we ﬁnd an inter-esting pattern. Morphology seems to reduce non-redundant (cloze) information later in the words—while only half of the languages had signiﬁcantsurprisals in CELEX (which consists of monomor-phemic words) and NorthEuraLex (base forms),most languages were signiﬁcant in Wikipedia.Furthermore, English and Hungarian had signiﬁ-cantly higher surprisals in word endings in CELEXand NorthEuraLex, while the opposite trend inWikipedia—this is consistent with the fact that suf-ﬁx morphemes are present in more types than wordroots are, so morphology would make word end-ings less surprising. OW Non-

EOW

Forward 1.14 3.55Backward 0.89 3.61Unigram 2.75 4.90Position-speciﬁc 0.00 4.36Cloze 0.00 3.23

Table 2:

Average surprisal (in bits) of

EOW vs. non-

EOW segments averaged over all datasets.

Length as a Confounding Effect.

We evaluatethe impact of length as a confounding effect onprevious methodologies. As mentioned in §2,by directly analysing surprisal–position pairs (asopposed to binning word initial vs. ﬁnal po-sitions), previous work confounds position andword length—i.e., only long words will have laterword positions. In this study, we analyse forwardsurprisal–length pairs; instead of pairing a seg-ment’s surprisal with its position, we pair it withits word length. We then get the slope formed by alinear regression between these pairs of values andtest for its signiﬁcance per language by using a per-mutation test, in which we shufﬂe surprisal–lengthvalues. On the three datasets, all languages havestatistically signiﬁcant negative slopes, meaninglong words have smaller surprisals on average thanshorter ones. A caveat, though, is that now weare confounding position into our length analysis.Constraining our analysis only to the ﬁrst two seg-ments in each word, we still ﬁnd the same effect—though now one language (Hebrew) in Wikipediaand seven in NorthEuraLex are not signiﬁcant. Wecan thus conclude that longer words have smallersurprisal values than shorter ones, even when con-trolling for the same word positions. This impliesthat directly using surprisal–position pairs for suchan analysis is not ideal.

The Effect of End of Word in Surprisal.

Theend-of-word (

EOW ) symbol is a special “segment”which symbolises the end of a string. It is neces-sary when modelling the probability distributionover strings w ∈ Σ ∗ , to guarantee that the overalldistribution sums to 1. Nonetheless, it is expectedto behave in a different way from other segments.If a speaker wants to reduce their production ef-fort, although changing from one phone to anothermay help, the most efﬁcient way is usually justending the string earlier. Furthermore, since allrealisable strings must eventually end, it will be King and Wedel (2020) indeed present a similar correla-tion in their Figure 2.

EOW No EOW

Initial Final Diff (%) Initial Final Diff (%)Forward 3.85 2.65 31.1 % 3.83 3.00 21.6 %Backward 3.02 3.40 -11.3 % 3.63 3.39 6.7 %Unigram - - - 4.85 4.40 9.3 %Position - - - 4.36 4.17 4.3 %Cloze - - - 3.26 2.81 13.9 %

Table 3:

Average surprisal per segment in word initial andﬁnal positions with and without

EOW symbols. present in all words, making it a very frequentsymbol—in fact, Tab. 2 shows its average surprisalis much lower than that of other segments. Assuch, it is only natural it should be analysed on itsown, separately from other segments. Through thesame logic, other segments should also be analysedseparately from

EOW —or else, lower word ﬁnalsurprisals may be due to this symbol alone. Assuch, we analyse the surprisal of LSTM “languagemodels” without the

EOW symbol here. Unsurprisingly, Tab. 3 shows the difference be-tween word initial and ﬁnal positions is consider-ably reduced when we remove the

EOW symbolfrom the forward surprisal analysis. Surprisingly,we see that when we remove the beginning-of-wordfrom the backward surprisal analysis, instead of alarger word ﬁnal surprisal, we get a larger wordinitial value—even though we are still conditioningthe models right-to-left. This result further supportsthe hypothesis that the disambiguatory signals areon average stronger in word initial positions.

In this work, we analysed the distribution of dis-ambiguatory information in word positions. Wepresent an in-depth critique of previous work, show-ing several confounding effects in their analysis.We then proposed the use of three new methodswhich corrected for these biases—namely unigram,position-speciﬁc and cloze surprisal. These modelscontrolled for the amount of conditional informa-tion across word positions, allowing for an unbi-ased analysis of the lexicon. Using these modelswe show that the lexicons of most languages in-deed front-load their disambiguatory signals. Thiseffect, though, is not universal and the difference indisambiguatory information between word initialand ﬁnal positions is much lower than previouslyestimated—ranging from 4% to 14%, dependingon the used metric, instead of 31%. To be more precise, we actually ignore the beginning-of-word symbol when estimating backward surprisal. eferences

R. H. Baayen, R. Piepenbrock, and L. Gulikers. 2015.CELEX2 LDC96L14.William Chandler Bagley. 1900. The apperception ofthe spoken sentence: A study in the psychologyof language.

The American Journal of Psychology ,12(1):80–130.Georgij P. Basharin. 1959. On a statistical estimate forthe entropy of a sequence of independent randomvariables.

Theory of Probability & Its Applications ,4(3):333–336.Yoav Benjamini and Yosef Hochberg. 1995. Control-ling the false discovery rate: A practical and pow-erful approach to multiple testing.

Journal of theRoyal Statistical Society: Series B (Methodological) ,57(1):289–300.Jerome S. Bruner and Donald O’Dowd. 1958. A noteon the informativeness of parts of words.

Languageand Speech , 1(2):98–101.Noam Chomsky and Morris Halle. 1965. Some contro-versial questions in phonological theory.

Journal ofLinguistics , 1(2):97–138.Cynthia M. Connine, Dawn G. Blasko, and DebraTitone. 1993. Do the beginnings of spoken wordshave a special status in auditory word recognition?

Journal of Memory and Language , 32(2):193–210.Isabelle Dautriche, Kyle Mahowald, Edward Gibson,Anne Christophe, and Steven T. Piantadosi. 2017.Words cluster phonetically beyond phonotactic reg-ularities.

Cognition , 163:128–145.Johannes Dellert, Thora Daneyko, Alla M¨unch, AlinaLadygina, Armin Buch, Natalie Clarius, Ilja Grigor-jew, Mohamed Balabel, Hizniye Isabella Boga, Za-lina Baysarova, et al. 2019. NorthEuraLex: A wide-coverage lexical database of Northern Eurasia.

Lan-guage Resources and Evaluation , pages 1–29.David Fay and Anne Cutler. 1977. Malapropisms andthe structure of the mental lexicon.

Linguistic In-quiry , 8(3):505–520.Sharon Goldwater, Thomas L. Grifﬁths, and MarkJohnson. 2011. Producing power-law distributionsand damping word frequencies with two-stage lan-guage models.

Journal of Machine Learning Re-search , 12(Jul):2335–2382.Laura Gwilliams, Tal Linzen, David Poeppel, and AlecMarantz. 2018. In spoken word recognition, thefuture predicts the past.

Journal of Neuroscience ,38(35):7585–7599.Sepp Hochreiter and J¨urgen Schmidhuber. 1997.Long short-term memory.

Neural Computation ,9(8):1735–1780. Matthew Honnibal and Ines Montani. 2017. spaCy 2:Natural language understanding with Bloom embed-dings, convolutional neural networks and incremen-tal parsing.Kathleen Houlihan. 1975.

The Role of Word Boundaryin Phonological Processes . Ph.D. thesis, Universityof Texas at Austin.Adam King and Andrew Wedel. 2020. Greater earlydisambiguating information for less-probable words:The lexicon is shaped by incremental processing.

Open Mind , pages 1–12.Paul A. Luce. 1986. A computational analysis ofuniqueness points in auditory word recognition.

Per-ception & Psychophysics , 39(3):155–158.Kyle Mahowald, Isabelle Dautriche, Edward Gibson,and Steven T. Piantadosi. 2018. Word forms arestructured for efﬁcient use.

Cognitive Science ,42(8):3116–3134.Stephen Michael Marcus. 1981. ERIS-context sensi-tive coding in speech perception.

Journal of Phonet-ics , 9(2):197–220.William D. Marslen-Wilson. 1987. Functional paral-lelism in spoken word-recognition.

Cognition , 25(1-2):71–102.G´abor Melis, Tom´aˇs Koˇcisk´y, and Phil Blunsom. 2020.Mogriﬁer LSTM. In

International Conference onLearning Representations .Stephan C. Meylan and Thomas L. Grifﬁths. 2017.Word forms—not just their lengths—are opti-mized for efﬁcient communication. arXiv preprintarXiv:1703.01694 .John Morton. 1969. Interaction of information in wordrecognition.

Psychological Review , 76(2):165.S. G. Nooteboom and M. J. van der Vlugt. 1988.A search for a word-beginning superiority effect.

The Journal of the Acoustical Society of America ,84(6):2018–2032.Sieb G. Nooteboom. 1981. Lexical retrieval fromfragments of spoken words: Beginnings vs endings.

Journal of Phonetics , 9(4):407–424.Tiago Pimentel, Brian Roark, and Ryan Cotterell. 2020.Phonotactic complexity and its trade-offs.

Transac-tions of the Association for Computational Linguis-tics , 8:1–18.Rob J. J. H. van Son and Louis C. W. Pols. 2003a. Infor-mation structure and efﬁciency in speech production.In

Eighth European Conference on Speech Commu-nication and Technology .Rob J. J. H. van Son and Louis C.W. Pols. 2003b. Howefﬁcient is speech? In

Proceedings of the Instituteof Phonetic Sciences , volume 25, pages 171–184.oseph C. Toscano, Nathaniel D. Anderson, and BobMcMurray. 2013. Reconsidering the role of tempo-ral order in spoken word recognition.

PsychonomicBulletin & Review , 20(5):981–987.Ashish Vaswani, Noam Shazeer, Niki Parmar, JakobUszkoreit, Llion Jones, Aidan N. Gomez, ŁukaszKaiser, and Illia Polosukhin. 2017. Attention is allyou need. In

Advances in Neural Information Pro-cessing Systems , pages 5998–6008.Andrew Wedel, Adam Ussishkin, and Adam King.2019a. Crosslinguistic evidence for a strong statis-tical universal: Phonological neutralization targetsword-ends over beginnings.

Language , 95(4):e428–e446.Andrew Wedel, Adam Ussishkin, and Adam King.2019b. Incremental word processing inﬂuences theevolution of phonotactic patterns.

Folia Linguistica ,40(1):231–248.George Kingsley Zipf. 1949.