Disambiguatory Signals are Stronger in Word-initial Positions
DDisambiguatory Signals are Stronger in Word-initial Positions
Tiago Pimentel D Ryan Cotterell D , QD University of Cambridge Q ETH Z¨urich @ Google [email protected] , [email protected] , [email protected] Brian Roark @ Abstract
Psycholinguistic studies of human word pro-cessing and lexical access provide ample ev-idence of the preferred nature of word-initialversus word-final segments, e.g., in terms ofattention paid by listeners (greater) or thelikelihood of reduction by speakers (lower).This has led to the conjecture—as in Wedelet al. (2019b), but common elsewhere—thatlanguages have evolved to provide more infor-mation earlier in words than later. Information-theoretic methods to establish such tendenciesin lexicons have suffered from several method-ological shortcomings that leave open the ques-tion of whether this high word-initial informa-tiveness is actually a property of the lexiconor simply an artefact of the incremental natureof recognition. In this paper, we point out theconfounds in existing methods for comparingthe informativeness of segments early in theword versus later in the word, and present sev-eral new measures that avoid these confounds.When controlling for these confounds, we stillfind evidence across hundreds of languagesthat indeed there is a cross-linguistic tendencyto front-load information in words. The psycholinguistic study of human lexical accessis largely concerned with the incremental process-ing of words—whereby, as individual sub-lexicalunits (e.g., phones) are perceived, listeners up-date their expectations of the word being spoken.One common tenet of such studies is that the dis-ambiguatory signal contributed by units early inthe word is stronger than that contributed later—i.e. disambiguatory signals are front-loaded inwords . This intuition is derived from ample indi-rect evidence that the beginnings of words are moreimportant for humans during word processing—including, e.g., evidence of increased attention toword beginnings (Nooteboom, 1981, inter alia ) or Our code is available at https://github.com/tpimentelms/frontload-disambiguation . Position from Start F o r w a r d celexwikipedianortheuralex8 6 4 2 0 Position from End B ac k w a r d celexwikipedianortheuralex Figure 1:
Forward and Backward Surprisals with LSTMmodel from Pimentel et al. (2020). The bottom plot has beenflipped horizontally such that it visually corresponds to thenormal string direction. evidence of increased levels of phonological reduc-tion in word endings (van Son and Pols, 2003b).To analyse this front-loading effect, researchershave investigated the information provided by seg-ments in words. van Son and Pols (2003a,b)showed that, in Dutch, a segment’s position in aword is a very strong predictor of its conditional sur-prisal, with later segments being more predictablethan earlier ones—a result which we show to arisedirectly from its definition in §3.3.1. Recently Kingand Wedel (2020) and Pimentel et al. (2020) con-firmed the effect on many more languages.Their analysis, however, presents an inherentconfound between the amount of conditional in-formation available to a model and the surprisalof the subsequent segment—see Fig. 1 for resultsillustrating this. Using the LSTM training recipesfrom Pimentel et al. (2020), we calculated the con-ditional surprisal at each segment position withinthe words across all languages in three datasets. The top-half of Fig. 1 shows that, indeed, positions https://github.com/tpimentelms/phonotactic-complexity See §3 and §5 for specifics on training and data. Eachsegment corresponds to a single phone in CELEX andNorthEuraLex, and to a single grapheme in Wikipedia. a r X i v : . [ c s . C L ] F e b arlier in the string have higher surprisal than po-sitions later in the string, supporting the thesis ofhigher informativity earlier in words. The bottom-half shows that modelling the strings right-to-leftinstead of left-to-right reverses the resulting effect.This decouples conditional surprisal from the dis-ambiguatory strength. To expose this decoupling,consider an artificial language where every wordcontains a copy of its first half, e.g., foofoo , barbar , foobarfoobar , etc. The first and second halves ofthese words have identical disambiguatory strength;they are the same so one could disambiguate theword as easily from its second half as from the first.In contrast, conditional surprisal would be nearlyzero for the second halves of words because the sec-ond half is perfectly predictable from the first half.In natural languages, measuring conditional en-tropy in a left-to-right fashion inherently forces areduction of conditional entropy in later segmentsbecause of a language’s phonotactic constraints.However, the disambiguatory strength of later seg-ments is not inherently less than that of earliersegments. For instance, in a language like Turkish,which has vowel harmony, knowledge of any of thevowels in a word will provide information aboutthe word’s other vowels in a similar way. As such,knowledge of vowels towards the front of a wordis as disambiguating as of vowels towards its end.The contributions of this paper are threefold.First, we document and demonstrate the shortcom-ings of existing methods for measuring the informa-tiveness of individual segments in context, includ-ing the confound with the amount of conditionalinformation discussed above. Second, we intro-duce three surprisal-based measures that controlfor this confound and enable comparison of word-initial versus -final positions in this respect: uni-gram, position-specific and cloze surprisal (see §3).Finally, we find robust evidence across many lan-guages of stronger disambiguatory signals in wordinitial than word-final positions. Out of a total of151 languages analysed across three separate col-lections, 82 of them present a higher cloze surprisalin word beginnings than in endings—with similarpatterns arising with the other two measures. Psycholinguistic evidence.
Lexical access haslong been a topic of interest for psycholinguists,leading to many distinct models being proposedfor this process (Morton, 1969; Marcus, 1981; Marslen-Wilson, 1987). Far earlier, though, Bagley(1900) had already demonstrated that earlier seg-ments in words were more important for wordrecognition than later segments; specifically, theyfound that, when exposed to words with word-initial or word-final consonant deletions, listenersfound the word-initial deletions more disruptive.Fay and Cutler (1977) showed mispronunciationsare more likely in word endings, while Bruner andO’Dowd (1958) showed that recognizing writtenwords with flipped initial characters was harderthan with word final ones—demonstrating that theinitial part of the word was more “useful” for read-ers. More recently, Wedel et al. (2019a) foundevidence in support of Houlihan (1975), showingneutralizing rules tend to target word endings moresignificantly than beginnings in both suffixing andprefixing languages.Nooteboom (1981) investigated the ease of re-covering lexical items from either word beginningsor endings, finding that people had an easier timerecovering words from their beginnings. For this,he examined words for which the first and secondhalves each completely identified them in a largeDutch dictionary—controlling for both segments’length and uniqueness. Later on, though, Noote-boom and van der Vlugt (1988) showed this differ-ence vanishes when priming people with the lengthof the word—proposing the difference comes notfrom how informative segments were, but from thedifficulty in time aligning later segments in men-tal lexicons. Connine et al. (1993) also found nodifference in priming effects with non-words thatdiffered from real words in either word initial ormedial positions, suggesting initial positions haveno special status in word recognition.Psycholinguistic evidence is key to understand-ing how lexical access works in human languageprocessing, and can help us understand why lexi-cons may evolve to provide more disambiguatorysignals earlier in words. Given the incrementalnature of human lexical processing, however, suchevidence cannot provide direct evidence of the na-ture of the lexicon uninfluenced by incrementality.
Computational evidence.
To the best of ourknowledge, van Son and Pols (2003b,a) were thefirst to use computational methods coupled with an Note that there are many possible reasons why the effectswe demonstrate in this paper may arise, from the demands oflexical access to constraints on articulation. We provide noevidence for any of the possible explanations, evolutionary orotherwise, just methods for measuring the effect. nformation theoretic definition of informativenessto investigate this question. They showed that seg-ments in the beginning of words carry most of aword’s information, as measured by their contex-tual surprisal using a plug-in tree structured proba-bilistic estimator. Although assessing a less-biasedsample of words than Nooteboom (1981), thisstudy is also limited to a single language (Dutch),hence cannot assess whether this is a general phe-nomenon or specific to that language.Further, van Son and Pols (2003a,b) use absoluteword positions in their analysis. Word length corre-lates strongly with frequency, hence while early po-sitions are present in all words, later positions onlyexist for a much smaller sample of typically lowerfrequency words. Thus this comparison amountsto asking if later positions in longer and infrequentwords have lower surprisal than earlier positions inall (frequent or infrequent) words. We analyse thisconfounding factor in §6.Wedel et al. (2019b) and King and Wedel (2020)applied a methodology similar to that of van Sonand Pols (2003a) to show, for many diverse lan-guages, that more frequent words contain less in-formative segments in word initial positions, whileless frequent types carry more informative ones.They further showed that segments in later wordpositions were less informative (given the previ-ous ones) than average in rarer words. While con-trolling for length, King and Wedel (2020) alsocompared words’ forward and backward unique-ness points—nodes in a trie from which only oneleaf node can be reached, i.e., where the word isuniquely identified—showing they happened ear-lier in forward strings.While these studies provide evidence from morediverse sets of languages, they follow van Son andPols (2003a) in studying closed lexicons. As weshow in §3.3.1, the use of probabilistic trie modelson a closed lexicon yields a trivial effect of higherinformativity at word initial positions. Furthermore,such studies cannot account for out-of-vocabularywords (e.g., nonce, proper name or otherwise un-known words) or derivational morphology, whichare key parts of lexical recognition. Lexical access Nooteboom (1981) looked at words completely identi-fiable by both their first and second halves in a large Dutchdictionary—this resulted in a study with only 14 words. The closed lexicon assumption is incorporated implicitlyin the probabilistic trie models used by van Son and Pols(2003a,b) and King and Wedel (2020)—i.e. they assign zeroprobability to any form not in their training sets—and in theuniqueness point analysis of King and Wedel (2020). is also somewhat robust to segmental misorder-ing (Toscano et al., 2013) and sounds later in aword help determine the perception of earlier ones(Gwilliams et al., 2018). In contrast, a trie over aclosed lexicon is deterministic. Beyond this, Luce(1986) showed in a corpus study that the proba-bility of a word type being uniquely identifiablebefore its last segment was only —and oftypes were identified only by the end of word, be-ing proper prefixes of other words, such as cat and cats . They conclude that uniqueness point statisticsmay only be useful for long word analysis.In Pimentel et al. (2020), we analysed severallanguages’ phonotactic distributions, focusing onpresenting a trade-off between phonotactic entropyand word length across languages. As a controlexperiment we analysed the correlation betweena segment’s surprisal and its word position across106 languages. We did not control for word lengthand did not run per-language experiments, though—so we could have just been capturing the effect thatlater positions will mostly be present in languageswith longer words (which, as we find, have lowerinformation on average). While this last work avoids many of the issuesraised earlier in this section, it fails to control thekey confound mentioned earlier: it relies on left-to-right conditional probabilities to calculate surprisal.Thus segments early in the word have less condi-tional information and hence are generally of lowerprobability—a trivial effect that does not indicate asegment’s disambiguatory signal strength.
In this work, instead of the lexicon itself, we inves-tigate the probability distribution from which it issampled. The distribution is unobserved, but wecan get glimpses of it via the sampled lexicon: (cid:110) w ( n ) (cid:111) Nn =1 ∼ p ( w ) = | w | (cid:89) t =1 p ( w t | w EOW ) symbol.For simplicity, we assume the alphabet includes EOW through-out the rest of the paper. ribution should assign high probability to likelywordforms (attested or not) and low probabilityto unlikely ones. Using Chomsky and Halle’s(1965) classic example from English, brick (at-tested) and blick (unattested) would have high prob-ability, whereas *bnick (unattested) would have alow probability. Shannon’s entropy is a measure of how much in-formation a random variable contains. Consider asegment w t at word position t , which is a value ofthe random variable W t . The average information(surprisal) relayed per segment is: H( W t ) ≡ (cid:88) w t ∈ Σ p ( w t ) log 1 p ( w t ) (2)A random variable is maximally entropic if it isa uniform distribution, in which case H( W t ) =log( | Σ | ) . Conditional entropy measures how muchinformation the knowledge of a variable conveys,given some previous knowledge. The average infor-mation transmitted per segment, given the previousones in a word, is H( W t | W We present a reductio ad absurdum which showsthat van Son and Pols’s (2003b) method will lead tothe conclusion that word-initial segments are moreinformative even if all segments were equally en-tropic and sampled independently—a nonsensical finding. Accordingly, assume the probability distri-bution p ( w t | w 1) log e count( w t − , . . . , w ) ≈ | Σ | t − ( | Σ | − 1) log eN (6)where ˆH is a plug-in estimate of the entropy. Theerror grows exponentially in t due to the | Σ | t − factor. However, by assumption, H( W t | W t − ) is constant—we have equally entropic andindependent segments. Thus, the only way forthis difference to increase is for the second termto decrease as a function of t . It follows that theestimated cross-entropies decrease as a functionof t due to a methodological technicality. Indeed,in the extreme case, every position after a word’suniqueness point would be estimated to have zeroentropy. Thus, van Son and Pols’s (2003a) methodonly reveals a trivial effect. As previously mentioned, the conditional entropymeasures how much information the knowledge ofa variable conveys, given some previous informa-tion, and it is always smaller or equal to the entropy.For this reason, relying on left-to-right conditional This is in fact a simplification of van Son and Pols’s(2003a) model, which in practice uses Katz smoothing. ntropies to estimate the strength of disambigua-tory signals yields straightforward results; the avail-ability of larger conditioning contexts in a word’sfinal segments will naturally reduce its conditionalentropy. This will negatively skew the estimatedinformativeness of the later parts of a word. H( W t ) ≥ H( W t | W t − ) ≥ H( W t | W Unigram Surprisal H θ ( W t ) : the surprisal ofindividual segments.• Cloze Surprisal H θ ( W t | W (cid:54) = t ) : surprisal ofa segment given all others in the same word.• Position-Specific Surprisal H θ ( W t | T = t, | W | ) : the surprisal of in-dividual segments given their position in thewordform and the word’s length.The unigram surprisal captures the informationprovided by each segment when considering nocontext; while the cloze surprisal represents the in-formation provided by a segment when one alreadyknows the rest of the word. The position-specificsurprisal represents a mid way between both, con-ditioning each segment only on its position and theword’s length—being inspired by Nooteboom andvan der Vlugt’s (1988) experiments. These threemeasures of information control for the contextsize considered at each position, being thus betterfor an investigation of disambiguatory strength.We used an unigram model (see §4) to estimatethe unigram surprisal, and transformers (Vaswaniet al., 2017) for cloze and position-specific sur-prisals. We also use the LSTM (Long-ShortTerm Memory, Hochreiter and Schmidhuber, 1997)model from Pimentel et al. (2020) for two other en-tropy measures which do not control for the amountof conditional information:• Forward Surprisal H θ ( W t | W This might be the simplest languagemodel still in use in Natural Language Processing.We use its Laplace-smoothed variant p θ ( w t ) = count( w t ) + 1 (cid:80) c (cid:48) ∈ Σ count( c (cid:48) ) + | Σ | (12) LSTM. This architecture is the state-of-the-artfor character-level language modelling (Melis et al.,2020). Given a sequence of segments w ∈ Σ ∗ , weuse one hot lookup embeddings to transform eachof them into a vector z t ∈ R d . We then feed thesevectors into a k -layer LSTM h t = LSTM( z t − , h t − ) (13)where h ∈ R d , h is a vector with all zeros and w is the beginning-of-word symbol. We then linearlytransform these vectors before feeding them into asoftmax non-linearity to obtain the distribution p θ ( w t | w To get the backward sur-prisals we use models with the same architecture,but reverse all strings before feeding them to themodels. As such, we get the similar equations h t = LSTM( z t +1 , h t +1 ) (15) p θ ( w t | w >t ) = softmax( W h t + b ) (16) Transformer. Transformers allow a segment tobe conditioned on both future and previous sym-bols. Our implementation starts similar to theLSTM one, getting embedding vectors z t for eachsegment in the string w ∈ Σ ∗ , except that we re-place segment w t with a MASK symbol. Wethen feed these vectors through k multi-headed self-attention layers, as defined by Vaswani et al. (2017). Finally, the representations from the last layer arelinearly transformed and fed into a softmax p θ ( w t | w (cid:54) = t ) = softmax( W h t + b ) (17) Position-Specific Transformer. To get position-specific surprisal values, we again use a transformerarchitecture, but instead of replacing a single seg-ment with a MASK symbol, we replace all ofthem. This is equivalent to conditioning each seg-ment’s distribution on its position and the wordlength—i.e., estimating p θ ( w t | t, | w | ) . In order to estimate redundancy and informative-ness of segments we use three different datasets,each with its own pros and cons. We focus ontypes instead of tokens—i.e., the datasets consistof lexicons—for a few different reasons. First, it iseasier to get reliable samples of types than tokensfor a language, specially low-resource ones. Sec-ond, it is a well known result that token frequencycorrelates with both word length (Zipf, 1949) andphonotactic probability (Mahowald et al., 2018;Meylan and Griffiths, 2017), so that would be astrong confound in the results. Third, morphologyis more easily modeled at the type level than attoken level (Goldwater et al., 2011). CELEX (Baayen et al., 2015) allows us to ex-periment exclusively on monomorphemic words,but covers only three closely related languages. Itcontains both morphological and phonetic annota-tions for a large number of words in English, Dutchand German. We follow Dautriche et al. (2017) inusing only words labeled as monomorphemic inour study, leaving us with , words in German, , words in English and , words in Dutch. NorthEuraLex (Dellert et al., 2019) spans 107languages from 21 language families in a unifiedIPA format. This database is composed of conceptaligned word lists for these languages, containing1016 concepts, each of them translated in most lan-guages. However, most of these languages are fromEurasia, hence the collection lacks the typologicaldiversity we would ideally like. Wikipedia allows us to investigate a broaderand more diverse set of languages, but has no pho-netic information (only graphemes) and lexicons For each of the analysed datasets, we use 80% of the wordtypes for training, with the rest being equally split betweendevelopment and test sets; only test set surprisal and cross-entropies are used in our analysis. xtracted from it may be “contaminated” with for-eign words. We fetch the Wikipedia for a set of 41diverse languages, and tokenise their text usinglanguage-specific tokenisers from spaCy (Honni-bal and Montani, 2017). When a language-specifictokeniser was not available, we used a multilin-gual one. We then filtered all non-word tokens—byremoving the ones with any symbol not in the lan-guage’s scripts—and kept only the , mostfrequent types in each language. Forward Surprisal. We first replicate the resultsfrom van Son and Pols (2003a,b), Wedel et al.(2019b), and Pimentel et al. (2020), which showthat surprisal decreases with as the words posi-tion advances. On average, forward surprisal, i.e. H θ ( W t | W If the result for forwardsurprisal is largely due to the amount of condi-tional information, then reversing the strings shouldlead to a roughly opposite effect. With this inmind, for each language, we again bin surprisalsin word initial vs. final position, but now weevaluate languages using backward surprisal, i.e., These languages were: af, ak, ar, bg, bn, chr, de, el, en,es, et, eu, fa, fi, ga, gn, haw, he, hi, hu, id, is, it, kn, lt, mr, no,nv, pl, pt, ru, sn, sw, ta, te, th, tl, tr, tt, ur, zu. All statistical significance results in this work have beencorrected for multiple tests with Benjamini and Hochberg(1995) corrections and use a confidence value of p < . . Initial Surprisal (bits) F i n a l S u r p r i s a l ( b it s ) wikipedianortheuralexcelex 3 4 5 Initial Surprisal (bits) F i n a l S u r p r i s a l ( b it s ) wikipedianortheuralexcelex Figure 2: Word initial vs. final surprisals with: (left)Forward; (right) Backward. H θ ( W t | W >t ) . When using backward sur-prisal, many of the analysed languages have signifi-cantly higher surprisals in word final positions (seeTab. 1 and the right graph in Fig. 2). However, 11languages in the NorthEuraLex dataset still havehigher word initial surprisals, suggesting that ini-tial positions in these languages are indeed largelymore informative than final ones. There doesseem to be a large effect of the amount of condi-tional information and also some lexical effect offront-loading disambiguatory signals, however itis difficult to determine if there are cross-linguistictendencies with these measures. Unigram Surprisal. To control for the condi-tioning aspect of the question: do words front-loadtheir disambiguatory signals? , we can look atunigram surprisal H θ ( W t ) . This value tells ushow uncommon the segments that appear in acertain position are, when analysed in isolationfrom the rest of the word—uncommon segmentsare more informative and provide stronger signalfor disambiguation. In NorthEuraLex, 71 of thelanguages have significantly higher informativityin word beginnings than in endings—nonetheless,one language (Kildin Saami) has higher surprisalsin word endings. In CELEX, Dutch and Germanhave higher surprisals in initial positions, butEnglish does not. And in Wikipedia, all languagesbut Hebrew and Bengali have higher surprisalin initial positions—with Bengali having highersurprisal in word endings. This experimentsuggests that indeed most languages are biasedtowards providing stronger disambiguatory signalsin word beginnings, even when we control for the We note that King and Wedel (2020) also used backwardsurprisal, although with a different objective in mind. In oneof their experiments, they presented aggregate results of acomparison between the forward and backward surprisal. We also ran the same experiments with a probabilistictrie model like the ones used in van Son and Pols (2003b) andWedel et al. (2019b), which showed an even stronger resultreversal when using backward surprisal. urprisalDataset | | | | | | | 31 71 | | | | | 39 39 | | | Table 1: Number of languages in the analysed datasets with significantly larger surprisals in initial | final positions. amount of conditional information. Nonetheless,this is not a universal characteristic which alllanguages share and two analysed languages evenhad a statistically significant inverse effect. Position-Specific Surprisal. While clozesurprisal makes explicit the non-redundantinformativity a segment conveys, unigram surprisalanalyses the same segments in isolation. Position-specific surprisal provides a midway analysis,incorporating the position as some previously-specified knowledge, but not conditioning on theother segments in the word. The position-specificsurprisal is inspired by Nooteboom and van derVlugt (1988) experiments, which prime individualson word length and position. As can be seen inTab. 1, position-specific surprisal again seems tofavour initial positions over final, but only slightly.Interestingly, most languages present no significantdifference and some the inverse effect (i.e. highersurprisal in final positions). Position-specific Unigram models. To betterunderstand the differences between the unigramand position-specific surprisal results, we trainedposition-specific unigram models—which counteach segment’s frequency per position—and thencalculated their Kullback–Leibler (KL) divergenceper position with the traditional unigram KL( p ( w t | t ) || p ( w t )) (18) = (cid:88) w t ∈ Σ p ( w t | t ) log p ( w t | t ) p ( w t ) We compare these KL divergences and find that, forall but four languages, the KL is largest in either thefirst or second segment positions. This suggeststhat one of the reasons for higher unigram surprisalin initial positions is that the first two segments usu-ally differ from the rest of the positions, potentiallyserving as markers for word segmentation. We use Laplacian smoothing in the position-specific uni-grams and constrain the analysis to positions which appear inat least 75% of the analysed words in that language. Initial Surprisal (bits) F i n a l S u r p r i s a l ( b it s ) wikipedianortheuralexcelex Figure 3: Word initial vs. final cloze surprisals. Cloze Surprisal. When we condition a segmenton all others in the same word, we measure howmuch uncertainty is left about that individual seg-ment when considering everything else, or, in otherwords, how much information is passed only bythat segment non-redundantly. Word initial sur-prisal is higher in most analysed languages (seeTab. 1). Nonetheless, two languages in Wikipedia,Thai and Bengali, have significantly higher sur-prisal in their final segments—while English inCELEX and Hungarian in NorthEuraLex alsopresent this same inverse effect. Front-loading dis-ambiguatory information, thus, is not established tobe the linguistic universal it is believed to be, withonly roughly half the analysed languages show-ing this property when we control for morphology(CELEX and NorthEuraLex). Fig. 3 plots the re-sults for all languages analysed.When we compare these results, we find an inter-esting pattern. Morphology seems to reduce non-redundant (cloze) information later in the words—while only half of the languages had significantsurprisals in CELEX (which consists of monomor-phemic words) and NorthEuraLex (base forms),most languages were significant in Wikipedia.Furthermore, English and Hungarian had signifi-cantly higher surprisals in word endings in CELEXand NorthEuraLex, while the opposite trend inWikipedia—this is consistent with the fact that suf-fix morphemes are present in more types than wordroots are, so morphology would make word end-ings less surprising. OW Non- EOW Forward 1.14 3.55Backward 0.89 3.61Unigram 2.75 4.90Position-specific 0.00 4.36Cloze 0.00 3.23 Table 2: Average surprisal (in bits) of EOW vs. non- EOW segments averaged over all datasets. Length as a Confounding Effect. We evaluatethe impact of length as a confounding effect onprevious methodologies. As mentioned in §2,by directly analysing surprisal–position pairs (asopposed to binning word initial vs. final po-sitions), previous work confounds position andword length—i.e., only long words will have laterword positions. In this study, we analyse forwardsurprisal–length pairs; instead of pairing a seg-ment’s surprisal with its position, we pair it withits word length. We then get the slope formed by alinear regression between these pairs of values andtest for its significance per language by using a per-mutation test, in which we shuffle surprisal–lengthvalues. On the three datasets, all languages havestatistically significant negative slopes, meaninglong words have smaller surprisals on average thanshorter ones. A caveat, though, is that now weare confounding position into our length analysis.Constraining our analysis only to the first two seg-ments in each word, we still find the same effect—though now one language (Hebrew) in Wikipediaand seven in NorthEuraLex are not significant. Wecan thus conclude that longer words have smallersurprisal values than shorter ones, even when con-trolling for the same word positions. This impliesthat directly using surprisal–position pairs for suchan analysis is not ideal. The Effect of End of Word in Surprisal. Theend-of-word ( EOW ) symbol is a special “segment”which symbolises the end of a string. It is neces-sary when modelling the probability distributionover strings w ∈ Σ ∗ , to guarantee that the overalldistribution sums to 1. Nonetheless, it is expectedto behave in a different way from other segments.If a speaker wants to reduce their production ef-fort, although changing from one phone to anothermay help, the most efficient way is usually justending the string earlier. Furthermore, since allrealisable strings must eventually end, it will be King and Wedel (2020) indeed present a similar correla-tion in their Figure 2. EOW No EOW Initial Final Diff (%) Initial Final Diff (%)Forward 3.85 2.65 31.1 % 3.83 3.00 21.6 %Backward 3.02 3.40 -11.3 % 3.63 3.39 6.7 %Unigram - - - 4.85 4.40 9.3 %Position - - - 4.36 4.17 4.3 %Cloze - - - 3.26 2.81 13.9 % Table 3: Average surprisal per segment in word initial andfinal positions with and without EOW symbols. present in all words, making it a very frequentsymbol—in fact, Tab. 2 shows its average surprisalis much lower than that of other segments. Assuch, it is only natural it should be analysed on itsown, separately from other segments. Through thesame logic, other segments should also be analysedseparately from EOW —or else, lower word finalsurprisals may be due to this symbol alone. Assuch, we analyse the surprisal of LSTM “languagemodels” without the EOW symbol here. Unsurprisingly, Tab. 3 shows the difference be-tween word initial and final positions is consider-ably reduced when we remove the EOW symbolfrom the forward surprisal analysis. Surprisingly,we see that when we remove the beginning-of-wordfrom the backward surprisal analysis, instead of alarger word final surprisal, we get a larger wordinitial value—even though we are still conditioningthe models right-to-left. This result further supportsthe hypothesis that the disambiguatory signals areon average stronger in word initial positions. In this work, we analysed the distribution of dis-ambiguatory information in word positions. Wepresent an in-depth critique of previous work, show-ing several confounding effects in their analysis.We then proposed the use of three new methodswhich corrected for these biases—namely unigram,position-specific and cloze surprisal. These modelscontrolled for the amount of conditional informa-tion across word positions, allowing for an unbi-ased analysis of the lexicon. Using these modelswe show that the lexicons of most languages in-deed front-load their disambiguatory signals. Thiseffect, though, is not universal and the difference indisambiguatory information between word initialand final positions is much lower than previouslyestimated—ranging from 4% to 14%, dependingon the used metric, instead of 31%. To be more precise, we actually ignore the beginning-of-word symbol when estimating backward surprisal. eferences R. H. Baayen, R. Piepenbrock, and L. Gulikers. 2015.CELEX2 LDC96L14.William Chandler Bagley. 1900. The apperception ofthe spoken sentence: A study in the psychologyof language. The American Journal of Psychology ,12(1):80–130.Georgij P. Basharin. 1959. On a statistical estimate forthe entropy of a sequence of independent randomvariables. Theory of Probability & Its Applications ,4(3):333–336.Yoav Benjamini and Yosef Hochberg. 1995. Control-ling the false discovery rate: A practical and pow-erful approach to multiple testing. Journal of theRoyal Statistical Society: Series B (Methodological) ,57(1):289–300.Jerome S. Bruner and Donald O’Dowd. 1958. A noteon the informativeness of parts of words. Languageand Speech , 1(2):98–101.Noam Chomsky and Morris Halle. 1965. Some contro-versial questions in phonological theory. Journal ofLinguistics , 1(2):97–138.Cynthia M. Connine, Dawn G. Blasko, and DebraTitone. 1993. Do the beginnings of spoken wordshave a special status in auditory word recognition? Journal of Memory and Language , 32(2):193–210.Isabelle Dautriche, Kyle Mahowald, Edward Gibson,Anne Christophe, and Steven T. Piantadosi. 2017.Words cluster phonetically beyond phonotactic reg-ularities. Cognition , 163:128–145.Johannes Dellert, Thora Daneyko, Alla M¨unch, AlinaLadygina, Armin Buch, Natalie Clarius, Ilja Grigor-jew, Mohamed Balabel, Hizniye Isabella Boga, Za-lina Baysarova, et al. 2019. NorthEuraLex: A wide-coverage lexical database of Northern Eurasia. Lan-guage Resources and Evaluation , pages 1–29.David Fay and Anne Cutler. 1977. Malapropisms andthe structure of the mental lexicon. Linguistic In-quiry , 8(3):505–520.Sharon Goldwater, Thomas L. Griffiths, and MarkJohnson. 2011. Producing power-law distributionsand damping word frequencies with two-stage lan-guage models. Journal of Machine Learning Re-search , 12(Jul):2335–2382.Laura Gwilliams, Tal Linzen, David Poeppel, and AlecMarantz. 2018. In spoken word recognition, thefuture predicts the past. Journal of Neuroscience ,38(35):7585–7599.Sepp Hochreiter and J¨urgen Schmidhuber. 1997.Long short-term memory. Neural Computation ,9(8):1735–1780. Matthew Honnibal and Ines Montani. 2017. spaCy 2:Natural language understanding with Bloom embed-dings, convolutional neural networks and incremen-tal parsing.Kathleen Houlihan. 1975. The Role of Word Boundaryin Phonological Processes . Ph.D. thesis, Universityof Texas at Austin.Adam King and Andrew Wedel. 2020. Greater earlydisambiguating information for less-probable words:The lexicon is shaped by incremental processing. Open Mind , pages 1–12.Paul A. Luce. 1986. A computational analysis ofuniqueness points in auditory word recognition. Per-ception & Psychophysics , 39(3):155–158.Kyle Mahowald, Isabelle Dautriche, Edward Gibson,and Steven T. Piantadosi. 2018. Word forms arestructured for efficient use. Cognitive Science ,42(8):3116–3134.Stephen Michael Marcus. 1981. ERIS-context sensi-tive coding in speech perception. Journal of Phonet-ics , 9(2):197–220.William D. Marslen-Wilson. 1987. Functional paral-lelism in spoken word-recognition. Cognition , 25(1-2):71–102.G´abor Melis, Tom´aˇs Koˇcisk´y, and Phil Blunsom. 2020.Mogrifier LSTM. In International Conference onLearning Representations .Stephan C. Meylan and Thomas L. Griffiths. 2017.Word forms—not just their lengths—are opti-mized for efficient communication. arXiv preprintarXiv:1703.01694 .John Morton. 1969. Interaction of information in wordrecognition. Psychological Review , 76(2):165.S. G. Nooteboom and M. J. van der Vlugt. 1988.A search for a word-beginning superiority effect. The Journal of the Acoustical Society of America ,84(6):2018–2032.Sieb G. Nooteboom. 1981. Lexical retrieval fromfragments of spoken words: Beginnings vs endings. Journal of Phonetics , 9(4):407–424.Tiago Pimentel, Brian Roark, and Ryan Cotterell. 2020.Phonotactic complexity and its trade-offs. Transac-tions of the Association for Computational Linguis-tics , 8:1–18.Rob J. J. H. van Son and Louis C. W. Pols. 2003a. Infor-mation structure and efficiency in speech production.In Eighth European Conference on Speech Commu-nication and Technology .Rob J. J. H. van Son and Louis C.W. Pols. 2003b. Howefficient is speech? In Proceedings of the Instituteof Phonetic Sciences , volume 25, pages 171–184.oseph C. Toscano, Nathaniel D. Anderson, and BobMcMurray. 2013. Reconsidering the role of tempo-ral order in spoken word recognition. PsychonomicBulletin & Review , 20(5):981–987.Ashish Vaswani, Noam Shazeer, Niki Parmar, JakobUszkoreit, Llion Jones, Aidan N. Gomez, ŁukaszKaiser, and Illia Polosukhin. 2017. Attention is allyou need. In Advances in Neural Information Pro-cessing Systems , pages 5998–6008.Andrew Wedel, Adam Ussishkin, and Adam King.2019a. Crosslinguistic evidence for a strong statis-tical universal: Phonological neutralization targetsword-ends over beginnings. Language , 95(4):e428–e446.Andrew Wedel, Adam Ussishkin, and Adam King.2019b. Incremental word processing influences theevolution of phonotactic patterns. Folia Linguistica ,40(1):231–248.George Kingsley Zipf. 1949.