A general solution to the preferential selection model
Jake Ryland Williams, Diana Solano-Oropeza, Jacob R. Hunsberger
AA general solution to the preferential selection model
Jake Ryland Williams ∗ Department of Information Science, College of Computing and Informatics,Drexel University, 3675 Market St. Philadelphia, PA 19104
Diana Solano-Oropeza † Department of Physics, Drexel University, 32 S. 32 nd St. Philadelphia, PA 19104
Jacob R. Hunsberger ‡ Department of Chemical and Biological Engineering, College of Engineering,Drexel University, 3100 Market St. Philadelphia, PA 19104 (Dated: August 10, 2020)We provide a general analytic solution to Herbert Simon’s 1955 model for time-evolving noveltyfunctions. This has far-reaching consequences: Simon’s is a pre-cursor model for Barabasi’s 1999preferential attachment model for growing social networks, and our general abstraction of it moreconsiders attachment to be a form of link selection. We show that any system which can be mod-eled as instances of types—i.e., occurrence data (frequencies)—can be generatively modeled (andsimulated) from a distributional perspective with an exceptionally high-degree of accuracy.
I. PREFERENTIAL SELECTION
What are the mechanistic processes through which so-cial agents make selection decisions, or more concisely,how do people pick things? Social agents might expressthemselves through selection, and one well-known mech-anism for understanding these processes comes from thestudy of complex networks—it is known as preferentialattachment [1, 2]. It traces its roots to the study ofevolution [3], and for text, its analog is well known as language generation , i.e., an agent selects words [4]. Thismodel for language came about in 1955 from the well-known social scientist Herbert Simon. It abstracts wellto other selection contexts and can capture a variety ofphenomena through modulation of its parameters [5–8],so we refer to this model more generally as preferentialselection .We continue with the development and extension ofthis model through a generalized analysis for arbitrary,time-evolving novelty rates, i.e, the capacity for the se-lection model to pick ‘new’ things. Previously, generalsolutions were only available for this model when sys-tems were assumed to obey a constant novelty rate. Thishas largely obstructed the applicability and usefulness ofSimon’s selection model to real-world data. Our solutionprovides a functional form for essentially all parameter-izations, which will result in much greater applicability.Hence, future work will include experiments that seek touncover insights from a fuller breadth of selection datafrom different social contexts. Likewise, computationaltools for conducting these analysis will be fully developedfor open-source release. ∗ [email protected] † [email protected] ‡ [email protected] A. Model setup
Preferential selection describes a sequence of M in-stances (words): ( x m ) Mm =1 that have an onto relationshipto a set of types W (a vocabulary). So, let N = |W| bethe number of types and index the set of types (surfaceforms) w n ∈ W by their order of appearance so that w n indicates the n th unique type in the stream. For any in-stance, m , define n m to be the number of types observed‘so far’, i.e., within: { x k } mk =1 . Let α (without subscript)denote a fixed novelty rate and let α m denote one whichvaries by instance. Without loss of generality, when se-lecting instance x m Simon’s model can be succinctly un-derstood as a trade-off between two dynamics: • exploitation : with probability α m − , the selectedinstance is a ‘new’ (novel) type: x m = w n m − +1 • exploration : with probability (1 − α m − ) x m ’stype is assigned randomly from { x k } m − k =1 Note: indexing requires n = 0 (there are zero instances ⇐⇒ there are zero types) and α = 1 (the model has nofalse starts). B. Pre-existing solution
Now define m n : n = 1 , · · · , N as the instance at whichthe n th type is introduced. In line with the rate-equationsapproach from [7], consider the recursion equation: f m ( w n ) = (cid:20) − α m − m − (cid:21) f m − ( w n ) . (1)Its expansion produces the following product: f m ( w n ) = m − (cid:89) j = m n j + 1 − α j j . (2)When α m = α for all m (novelty is held constant), thenumerator in the product produces Γ- and β -function a r X i v : . [ phy s i c s . s o c - ph ] A ug representations: f m ( w n ) = Γ( m + θ )Γ( m n )Γ( m n + θ )Γ( m ) = B ( m n , θ ) B ( m, θ ) (3)where θ = 1 − α denotes the exploitation probability (forconvenience). In the latter, we substitute the Stirlingapproximation for β functions, and arrive at an analyticfrequency approximation:ˆ f ( w n ) ≈ (cid:16) m n m (cid:17) − θ . (4)Following subsequent work [8], for any n ≥ (cid:104) m n +1 − m n (cid:105) is the expectation of a geometric distribu-tion with success probability α , i.e., that (cid:104) m n +1 − m n (cid:105) = α − . This separates as: (cid:104) m n +1 − m n (cid:105) = (cid:104) m n +1 (cid:105) − (cid:104) m n (cid:105) ,and results in another recursion equation: (cid:104) n n +1 (cid:105) = (cid:104) m n (cid:105) + α − , which provides: (cid:104) m n (cid:105) = α + n − α = n − θα . (5)Approximation of m j by (cid:104) m n (cid:105) = ( n − θ ) /α within thenumerator and denominator of Eq. 4 renders the pre-existing form for the preferential selection model’s ana-lytic frequencies: ˆ f ( w n ) ≈ (cid:18) n − θN − θ (cid:19) − θ . (6)
1. Non-constant novelty rates
Most works on the subject have acknowledged prefer-ential selection supports non-constant novelty rates, butthe topic has remained largely unexplored in the litera-ture. Some progress was made by [5], though only focus-ing on specifically parameterized, power-law attenuatingvariation (with instance numbers). While this is empir-ically reasonable, the case was notably an example of apower-law-in/power-law-out phenomenon—assuming thepower-law in the novelty function analytically produceda secondary power law in the resulting frequency distri-bution. However, general effects of non-constant noveltyrates on frequency distributions are unknown (prior toour derivation, below).Studies on the closely-related Growing Network (GN)model (and its variants) have focused on ‘attachment ker-nel’ mechanisms [2]. But GN always selects, i.e., ‘links’a new type (node) to a pre-existing type (instance), forpreferential selection this would be equivalent to a con-stant novelty rate of α = 0 .
5, since a novel node is ex-plored and an existing node is exploited with each link(instance). While it doesn’t impact the resulting fre-quency distribution, the network picture has extra con-nectivity information that does not exist for preferentialselection.Our work (below) resolves two important limitationsthat have prevented the direct analysis of non-constantnovelty rates. These are: 1) the challenges of adapting the rate-equations analysis to arbitrary novelty rates, and2) a lack of data collection and/or representation of em-pirical novelty rates. We overcome both, and are thusable to formulate arbitrary frequency distributions im-plied by novelty rates through preferential selection.
C. Solving for non-constant novelty
Starting from the novelty rate, α m , we have a quantitythat generally varies with each observed instance. Butwhen [5] explored evolving novelty rates, α n was definedas function of n —the number of observed types, not in-stances. Likewise, we will work with novelty rates thatvary by types. But our modification of Eq. 2 assumesnovelty varies as a step function that ‘steps’ with newtypes, as opposed to each instance. In particular, we as-sume that the novelty rate can be written as α m = α n m ,for all m . Critically, if the m + 1 st instance is not a noveltype, this assumption forces: α m = α m +1 .Substituting our step function into Eq. 2 produces: f m ( w n ) = n m − (cid:89) k = n m k +1 − (cid:89) j = m k j + 1 − α k j . (7)Since the j -indexed product has the same form as Eq. 2,i.e., α k is constant with respect to j , we can simply sub-stitute the j -indexed product with a form analagous tothe right hand side of Eq. 4: f m ( w n ) = n m − (cid:89) k = n (cid:18) m k m k +1 (cid:19) − (1 − α k ) (8)= m − (1 − α n ) n m − (1 − α nm ) n m − (cid:89) k = n ( m k +1 ) α k +1 − α k (9)and simplify (at right, above).
1. This solution as a generalization
Provided α k → f m ( w n ) → m − (1 − α n ) n m − (1 − α nm ) , (10)which is quite similar to that of Eq. 4. However, to movethis to a frequency representation based on n (the gen-eralization of Zipf’s law [9]), we’ll have to produce a sep-arate approximation for m n based on the step-functionnovelty rate, α n , which we move onto next. But notably,we observe that holding α n to a constant, as α , collapsesEq. 8 into Eq. 4, as any generalization should.Focusing again on approximations, we consider howthe inverse of the novelty rate—even when varying—stillprovides an estimate for the number of instances elapsedbetween observed novel types: α − n ≈ m n +1 − m n . Thismay be studied again through recursion: m n +1 ≈ α − n + m n = 1 + n (cid:88) k =1 α − k = 1 + n (cid:88) k =1 α − k . (11)Rearranging this equation, and noting the harmonicmean, (cid:104) α (cid:105) n , we are thus able to express: m n ≈ (cid:104) α (cid:105) n − + n − (cid:104) α (cid:105) n − (12)Note this form’s resemblance to (and generalization of)the form presented in the middle of Eq. 5. This form re-lies only on the novelty rate (parameters, essentially) andthe number of observed types ( n ), so we may utilize itby inserting for m n into Eq. 10 to obtain our approxima-tion for the frequency scaling resulting from an arbitrarynovelty rate: f m ( w n ) ∝ (cid:18) n + (cid:104) α (cid:105) n − − (cid:104) α (cid:105) n − (cid:19) − (1 − α n ) (13)(under the condition that novelty attenuates smoothly).But Eq. 13 is not the most desirable form for empiricalexploration, and rather best used to assess familiar func-tional characteristics. For a more accurate functionalform that doesn’t depend on smooth novelty attenua-tion, one need only substitute Eq. 12’s approximationinto Eq. 8, which is exact and amenable to computationand data, as our investigation continues into.To close this section and highlight one last aspect ofgeneralization, we consider the specific, well-known nov-elty function from [5]. Letting α n be a function of typeswith a power-law attenuation occurring after a breakpoint, b ≥
1, we assume the rate of novelty’s attenua-tion is controlled (after b ) by a negative scaling, µ ≥ α n ( α , µ, b ) = (cid:40) α n ≤ bα (cid:0) nb (cid:1) − µ n > b (14)Where n ≤ b the novelty function remains a fixed con-stant, α , and afterwards it attenuates as a power law,in and of itself. Inserting Eq. 14 into Eq. 13, the Euler-Maclaurin formula provides (cid:104) ˆ α (cid:105) n → α n µ which allows usto note the limiting proportionality:lim n →∞ f m ( w n ) ∝ ( n − − (1+ µ ) , which is equivalent to the same form derived differentlyin other work [5]. II. EMPIRICAL NOVELTY RATES
Here, we derive and explore several different possiblemethods for producing an empirical, N -parameter nov-elty rate from data, (ˆ α n ) Nn =1 . A. Reading order
Inspired by the language context, this cognitively na¨ıveapproach focuses on computing the gap sizes, ∆ n (innumber of selection instances), observed between the se-lection of novel types. Their reciprocals produce a noisy FIG. 1. Empirical novelty rates are presented for the collectedworks of Georg Ebers. representation of the novelty function: ˆ α n = 1 / ∆ n . Adiagram is presented as Eq. 15 to illustrate this: A (cid:124)(cid:123)(cid:122)(cid:125) ∆ B A (cid:124) (cid:123)(cid:122) (cid:125) ∆ C B A B (cid:124) (cid:123)(cid:122) (cid:125) ∆ D · · · (cid:124) (cid:123)(cid:122) (cid:125) ∆ (15)This empirical novelty rate offers contextualized infor-mation about a selection’s presented order, but withoutconnection to frequency accumulation. Its noise is anexhibition of context, i.e., the subtly non-random orderof empirical selection, or in a more modern parlance forlanguage [10], the attention of word choice to context.In Fig. 1, an example of the reading-order representation(and some immediate limitations to its value) for a largedocument can be seen. Like other documents, reading-order novelty (grey points, main axis) can be seen totransform through Eq. 8 into a rank-frequency distribu-tion (grey dotted, lower right inset) that is extremely outof sync with the empirical frequencies, except perhaps atlow frequencies.
1. Interpreting the reading order novelty
If non-constant novelty rates can produce accuraterank-frequency distributions for language, they will haveto be constructed out of sync with the way an authorpresents a document. We do so in the next sections, butthis finding in and of itself is, empirically, critical. Ifthe reading order doesn’t capture the nature of accumu-lation, then what is the model failing to capture? It ispossible that modeling an additional scrambling (shuf-fling) process could overcome this limitation. This wouldregard reading order as some kind of a useful random-ization of selected instances—perhaps an artifact of theregularizations of language we experience through syn-tax, grammar, and semantics. But ultimately, more workis required to extract value from this empirical represen-tation.
B. Birthday-derived novelty representations
Both of the following representations fit tightly to fre-quency because their formulation is directly based on theempirical frequencies. Both, however, descend from acentral assumption about birthdays . These instancenumbers, m n , describe the total progress the systemmakes before the n th type emerges. Critically, assuming types appear in rank-frequency order, the proportionalselection property assures that m n ≈ mf ( w r ) (cid:12)(cid:12)(cid:12)(cid:12) n = r (16)This can be understood as follows: between birthdays themodel guarantees proportional selection events— only .At the n th birthday, the n th type’s frequency is 1, so itis guaranteed that the relative proportion with any other(lower-ranked) type, i.e., with k < n will hold: f m n ( w k ) f m n ( w n ) ≈ f m ( w k ) f m ( w n ) (17)So this, and the fact that f m n ( w n ) = 1 ensure the result: m n = n (cid:88) k =1 f m n ( w k ) ≈ n (cid:88) k =1 f m ( w k ) f m ( w n ) . (18)
1. Boundary pivoted
As we know from [7], the first-appearing type intro-duced to a preferential selection system holds a special,distributional position. Proportionally, it is the most dis-tributionally stable type (with respect to frequency andanalytic limits) that the model produces, so it forms auseful pivot. When the novelty rate is constant (Eq. 4),one can apply basic logarithmic equation solving:ˆ α ≈ − log( f ( w ) /f ( w N ))log( m /m ) (19)where we have used the fact that f ( w N ) = m = 1 tohighlight that this approximation characterizes noveltyfor the range of types (all of them) that were ‘born’ underthis (constant) rate.As a time-evolving novelty-rate increases the parame-ters (degrees of freedom), solving for these requires moreinformation—all of the frequency distribution. But be-cause selection is proportional, we can leverage this sameformulation over the steps of our α n step function. Usingthe fixed proportions of types born immediately adjacentto one another, we can leverage a cancellation of factorsin Eq. 8 to derive:ˆ α n ≈ f ( w n − ) /f ( w n ))log( m n − /m n ) , (20) This now characterizes the novelty step for the ( n th )range of types that were ‘born’ under this rate. An ex-ample novelty function derived from this analytic cancel-lation is likewise presented in Fig. 1 as the fit in green,which at this point produces a frequency prediction (bot-tom right) that is strong enough to be challenging to dis-tinguish from the empirical frequencies by eye. In thelower left, we see it exhibits roughly an order of mag-nitude less error per word than reading-order novelty (avast improvement).
2. Proportioned gaps
This representation also utilizes the ‘birthday’ approx-imations for m n , but does so in a more-ad hoc fashion.Using only Eq. 16, we simply note that by definition∆ n = m n − m n − withˆ α n ≈ (∆ n ) − = ( m n − m n − ) − . (21)An example novelty function derived from proportionedgaps is likewise presented in Fig. 1. There, it is the ‘bestfit’ (in pink), exhibiting roughly an order of magnitudeless error than the boundary-pivoted form.
3. Interpreting birthday-derived novelty
Both birthday-derived formulations have a startingcondition, i.e., are only defined for n >
1. But since thefirst type is guaranteed to appear as the first instance( α = 1), α only meaningfully defines well the nov-elty rate after the first type’s appearance. So the model and these representations are all forward-looking—Eq. 8technically doesn’t utilize any ‘last’ value α N as a re-sult of the indexing. Additionally, both approxima-tions are subject to finite size effects as a result of in-teger frequencies in data. Particularly, there are manylow-frequency types that have large, ambiguous ranks.So both representations are best computed by batch-ing same-frequency types in aggregate calculations, i.e.,treating each plateaux in the rank-frequency distributionas though its types appeared under a single ‘step’ of con-stant novelty. This aligns to these types’ indistinguish-ably by frequency. C. Evaluating novelty representations
Considering how wildly divergent the reading orderrepresentation is from corresponding empirical frequen-cies, this exploration compares the two birthday-derivednovelty representations. Hence, we present a preliminaryempirical characterization of the circumstances in whichthese representations form better and worse models—it turns out, neither is generally, empirically ‘best’. Inparticular, across the Project Gutenberg eBooks collec-tion (over 20 ,
000 documents, from a variety of languagesand topical sources) we: 1) compute both birthday-derived representations, ˆ α , ˆ α ; 2) apply them throughEq. 8 to produce corresponding frequency representa-tions, ˆ f , ˆ f ; and 3) compute the average of abso- FIG. 2. Comparison of the performance of birthday-derivednovelty representations as a function of vocabulary size, N . lute error for each, ε , ε ; where the superscripts indi-cate the analytic boundary-pivoted (1) and empirically-proportioned gaps (2) representations.While both birthday-derived novelty representationsproduce models that fit tightly to frequency (see Fig. 1insets), we note that neither (according to derivation)appears to be objectively better (across the eBooks).The result of this experiment’s application can be seein Fig. 2, which presents ε − ε as a function of vo-cabulary size, N . Interestingly, the proportioned gapscommonly outperform the boundary-pivoted representa-tion (most documents fall below the green line). But forlarger documents (more than 10 terms), a second regime emerges, where the boundary-pivoted representation be-gins to outperform the proportioned gaps. In particular,for just some large documents (never small), the propor-tioned gaps fail dramatically. This interestingly identifiestwo clusters of documents, with one being characterizedby some large-scale effect. Further investigation charac-terizing this variation is thus warranted. III. FUTURE WORK
This work is the beginning of an investigation that isnow being directed towards empirical work. With the an-alytic solution in place and a number of accurate, empiri-cal novelty representations, we will move on to regressinglow-parameter characterizations of non-constant noveltyrates, such as Eq. 14. As we’ve studied in other work,regressing these parameters ( µ ) can produce a potent fea-turization for understanding qualitative characteristics oflanguage generators, e.g., social bots [11]. Being able toregress directly from well-representing novelty distribu-tions will provide a large boost to performance at de-tection. The generality of this solution and ability formodeling arbitrary novelty-evolving context likewise al-lows for this model’s applicability to a diversity of cat-egorical data streams. Hence, we intend to investigatethe capacity for these modeling approaches to effectivelydescribe the selection characteristics of other social selec-tion processes that we expect to be strongly modulatedby proportional attention towards historical frequency,such as Twitter users liking tweets, authors citing pa-pers, or journalists referencing social media users. ACKNOWLEDGMENTS
This document is based upon work supported by theNational Science Foundation under grant no. [1] A. L. Barab´asi and R. Albert, Emergence of scaling inrandom networks, Science , 509 (1999).[2] P. L. Krapivsky and S. Redner, Organization of growingrandom networks, Phys. Rev. E , 066123 (2001).[3] G. U. Yule, A mathematical theory of evolution, basedon the conclusions of Dr. J. C. Willis, F.R.S, Phil. Trans.B , 21 (1924).[4] H. A. Simon, On a class of skew distribution functions,Biometrika , 425 (1955).[5] M. Gerlach and E. G. Altmann, Stochastic model for thevocabulary growth in natural languages, Phys. Rev. X ,021006 (2013).[6] J. R. Williams, J. P. Bagrow, C. M. Danforth, andP. S. Dodds, Text mixing shapes the anatomy of rank-frequency distributions, Physical Review E , 052811(2015).[7] P. S. Dodds, D. R. Dewhurst, F. F. Hazlehurst, C. M.Van Oort, L. Mitchell, A. J. Reagan, J. R. Williams, and C. M. Danforth, Simon’s fundamental rich-get-richermodel entails a dominant first-mover advantage, Phys.Rev. E , 052301 (2017).[8] J. R. Williams and G. C. Santia, Is space a word, too?,CoRR abs/1710.07729 (2017), arXiv:1710.07729.[9] G. K. Zipf, The Psycho-Biology of Language (Houghton-Mifflin, 1935).[10] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit,L. Jones, A. N. Gomez, L. u. Kaiser, and I. Polosukhin,Attention is all you need, in
Advances in Neural Infor-mation Processing Systems 30 , edited by I. Guyon, U. V.Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vish-wanathan, and R. Garnett (Curran Associates, Inc.,2017) pp. 5998–6008.[11] E. M. Clark, J. R. Williams, C. A. Jones, R. A. Galbraith,C. M. Danforth, and P. S. Dodds, Sifting robotic fromorganic text: a natural language approach for detectingautomation on twitter, Journal of Computational Science16