How to read probability distributions as statements about process
HHow to read probability distributions as statements about process
Steven A. Frank Department of Ecology and Evolutionary Biology, University of California, Irvine,CA 92697–2525 USA
Probability distributions can be read as simple expressions of information. Each continuous probability distri-bution describes how information changes with magnitude. Once one learns to read a probability distributionas a measurement scale of information, opportunities arise to understand the processes that generate thecommonly observed patterns. Probability expressions may be parsed into four components: the dissipationof all information, except the preservation of average values, taken over the measurement scale that relateschanges in observed values to changes in information, and the transformation from the underlying scale onwhich information dissipates to alternative scales on which probability pattern may be expressed. Informationinvariances set the commonly observed measurement scales and the relations between them. In particular,a measurement scale for information is defined by its invariance to specific transformations of underlyingvalues into measurable outputs. Essentially all common distributions can be understood within this simpleframework of information invariance and measurement scale ab . I. Introduction II. Overview III. The four components of probabilitypatterns
IV. Reading probability expressions in terms ofmeasurement and information
V. The log-linear scale VI. The linear-log scale VII. Relation between linear-log and log-linearscales
VIII. Dissipation of information on alternativescales
IX. Pairs of alternative scales by integraltransform a) email: [email protected], homepage: http://stevefrank.org b) Published as Frank, S. A. 2014. How to read probability dis-tributions as statements about process.
Entropy
D. Scale inversion by the Laplace transform 11
X. Alternative descriptions of generativeprocess XI. Reading probability distributions
XII. Relations between probability patterns
XIII. Hierarchical families of measurement scalesand distributions
XIV. Why do linear and logarithmic scalesdominate?
XV. Asymptotic invariance
XVI. Discussion XVII. Appendix: scale transformation a r X i v : . [ s t a t . O T ] N ov I INTRODUCTION
Patterns of nature often follow probability distribu-tions. Physical processes lead to an exponential distribu-tion of energy levels among a collection of particles. Ran-dom fluctuations about mean values generate a Gaussiandistribution. In biology, the age of cancer onset tendstoward a gamma distribution. Economic patterns of in-come typically match variants of the Pareto distributionswith power law tails.Theories in those different disciplines attempt to fit ob-served patterns to an underlying generative process. If agenerative model predicts the observed pattern, then thefit promotes the plausibility of the model. For example,the gamma distribution for the ages of cancer onset arisesfrom a multistage process . If cancer requires k differentrate-limiting events to occur, then, by classical probabil-ity theory, the simplest model for the waiting time forthe k th event to occur is a gamma distribution.Many other aspects of cancer biology tell us that theprocess indeed depends on multiple events. But howmuch do we really learn by this inverse problem, in whichwe start with an observed distribution of outcomes andthen try to infer underlying process? How much does anobserved distribution by itself constrain the range of un-derlying generative processes that could have led to thatobserved pattern?The main difficulty of the inverse problem has to dowith the key properties of commonly observed patterns.The common patterns are almost always those that ariseby a wide array of different underlying processes . Wemay say that a common pattern has a wide basin of at-traction, in the sense that many different initial start-ing conditions and processes lead to that same commonoutcome. For example, the central limit theorem is, inessence, the statement that adding up all sorts of dif-ferent independent processes often leads to a Gaussiandistribution of fluctuations about the mean value.In general, the commonly observed patterns are com-mon because they are consistent with so many differentunderlying processes and initial conditions. The commonpatterns are therefore particularly difficult with regard tothe inverse problem of going from observed distributionsto inferences about underlying generative processes. Butan observed pattern does provide some information aboutthe underlying generative process, because only certaingenerative processes lead to the observed outcome. Howcan we learn to read a mathematical expression of a prob-ability pattern as a statement about the family of under-lying processes that may generate it? II OVERVIEW
In this article, I will explain how to read continuousprobability distributions as simple statements about un-derlying process. I presented the technical backgroundin an earlier article , with addition details in other publications . Here, I focus on developing the intu-ition that allows one to read probability distributions assimple sentences. I also emphasize key unsolved puzzlesin the understanding of commonly observed probabilitypatterns.The third section introduces the four components ofprobability patterns: the dissipation of all information,except the preservation of average values, taken over themeasurement scale that relates changes in observed val-ues to changes in information, and the underlying scaleon which information dissipates relative to alternativescales on which probability pattern may be expressed.The fourth section develops an information theory per-spective. A distribution can be read as a simple state-ment about the scaling of information with respect tothe magnitude of the observations. Because measurement has a natural interpretation in terms of information, wecan understand probability distributions as pure expres-sions of measurement scales.The fifth section illustrates the scaling of informationby the commonly observed log-linear pattern. Informa-tion in observations may change logarithmically at smallmagnitudes and linearly at large magnitudes. The clas-sic gamma distribution is the pure expression of the log-linear scaling of information.The sixth section presents the inverse linear-log scale.The Lomax and generalized Student’s distributions fol-low that scale. Those distributions include the classic ex-ponential and Gaussian forms in their small-magnitudelinear domain, but add power law tails in their large-magnitude logarithmic domain.The seventh section shows that the commonly observedlog-linear and linear-log scales form a dual pair throughthe Laplace transform. That transform changes addi-tion of random variables into multiplication, and mul-tiplication into addition. Those arithmetic changes ex-plain the transformation between multiplicative log scal-ing and additive linear scaling. In general, integral trans-forms describe dualities between pairs of measurementscales, clarifying the relations between commonly ob-served probability patterns.The eighth section considers cases in which informationdissipates on one scale, but we observe probability pat-tern on a different scale. The log-normal distribution is asimple example, in which observations arise as productsof perturbations. In that case, information dissipates onthe additive log scale, leading to a Gaussian pattern onthat log scale.The eighth section continues with the more interestingcase of extreme values, in which one analyzes the largestor smallest value of a sample. For extreme values, dissi-pation of information happens on the scale of cumulativeprobabilities, but we express probability pattern on thetypical scale for the relative probability at each magni-tude. Once one recognizes the change in scale for extremevalue distributions, those distributions can easily be readin terms of my four basic components.The ninth section returns to dual scales connected byintegral transforms. In superstatistics, one evaluates aparameter of a distribution as a random variable ratherthan a fixed value. Averaging over the distribution of theparameter creates a special kind of integral transformthat changes the measurement scale of a distribution,altering that original distribution into another form witha different scaling relation.The tenth section considers alternative perspectives ongenerative process. We may observe pattern on one scale,but the processes that generated that pattern may havearisen on a dual scale. For example, we may observe theclassic gamma probability pattern of log-linear scaling,in which we measure the time per event. However, theunderlying generative process may have a more naturalinterpretation on the inverse linear-log scaling of the Lo-max distribution. That inverse scale has dimensions ofevents per unit time, or frequency.The eleventh section reiterates how to read probabil-ity distributions. I then introduce the L´evy stable dis-tributions, in which dual scales relate to each other bythe Fourier integral transform. The L´evy case connectslog scaling in the tails of distributions to constraints inthe dual domain on the average of power law expressions.The average of power law expressions describes fractionalmoments, which associate with the common stretched ex-ponential probability pattern.The twelfth section explains the relations between dif-ferent probability patterns. Because a probability pat-tern is a pure expression of a measurement scale, thegenesis of probability patterns and the relations betweenthem reduce to understanding the origins of measurementscales. The key is that the dissipation of information andmaximization of entropy set a particular invariance struc-ture on measurement scales. That invariance stronglyinfluences the commonly observed scales and thus thecommonly observed patterns of nature.The twelfth section continues by showing that particu-lar aspects of invariance lead to particular patterns. Forexample, shift invariance with respect the information inunderlying values and transformed measured values leadsto exponential scaling of information. By contrast, affineinvariance leads to linear scaling. The distinctions be-tween broad families of probability distributions turn onthis difference between shift and affine invariance for theinformation in observations.The thirteenth section presents a broad classificationof measurement scales and associated probability pat-terns. Essentially all commonly observed distributionsarise within a simple hierarchically generated sequenceof measurement scales. That hierarchy shows one way toconsider the genesis of the common distributions and therelations between them. I present a table that illustrateshow the commonly observed distributions fit within thisscheme.The fourteenth section considers the most interestingunsolved puzzle: Why do linear and logarithmic scalingdominate the base scales of the commonly observed pat-terns? One possibility is that linear and log scaling ex- press absolute and relative incremental information, thetwo most common ways in which information may scale.Linear and log scaling also have a natural associationwith addition and multiplication, suggesting a connec-tion between common arithmetic operations and commonscaling relations.The fifteenth section suggests one potential solutionto the puzzle of why commonly observed measurementscales are simple. Underlying values may often be trans-formed by multiple processes before measurement. Eachtransformation may be complex, but the aggregate trans-formation may smooth into a simple relation between ini-tial inputs and final measured outputs. The scaling thatdefines the associated probability pattern must provideinvariant information with respect to underlying valuesor final measured outputs. If the ultimate transformationof underlying values to final measured outputs is simple,then the required invariance may often define a simpleinformation scaling and associated probability pattern.The Discussion summarizes key points and emphasizesthe major unsolved problems. III THE FOUR COMPONENTS OF PROBABILITY PATTERNS
To parse probability patterns, one must distinguishfour properties. In this section, I begin by briefly de-scribing each property. I then match the properties tothe mathematical forms of different probability patterns,allowing one to read probability distributions in terms ofthe four basic components. Later sections develop theconcepts and applications.First, dissipation of information occurs because mostobservable phenomena arise by aggregation over manysmaller scale processes. The multiple random, small scalefluctuations often erase the information in any particu-lar lower level process, causing the aggregate observableprobability pattern to be maximally random subject toconstraints that preserve information .Second, average values tend to be the only preservedinformation after aggregation has dissipated all else.Jaynes developed dissipation of information andconstraint by average values as the key principles of max-imum entropy, a widely used approach to understandingprobability patterns. I extended Jaynesian maximum en-tropy by the following components .Third, average values may arise on different measure-ment scales. For example, in large scale fluctuations,one might only be able to obtain information about thelogarithm of the underlying values. The constrained av-erage would be the mean of the logarithmic values, orthe geometric mean. The information in measurementsmay change with magnitude. In some cases, the scalemay be linear for small fluctuations but logarithmic forlarge fluctuations, leading to an observed linear-log scaleof observations.Fourth, the measurement scale on which informationdissipates may differ from the scale on which one observespattern. For example, a multiplicative process causes in-formation to dissipate on the additive logarithmic scale,but we may choose to analyze the observed multiplicativepattern. Alternatively, information may dissipate by themultiplication of the cumulative probabilities that indi-vidual fluctuations fall below some threshold, but we maychoose to analyze the extreme values of aggregates on atransformed linear scale.The measurement scaling defines the various com-monly observed probability distributions. By learningto parse the scaling relations of measurement implicitin the mathematical expressions of probability patterns,one can read those expression as simple statements aboutunderlying process. The previously hidden familial rela-tions between different kinds of probability distributionsbecome apparent through their related forms of measure-ment scaling.
A. Dissipation of information
Most observations occur on a macroscopic scale thatarises by aggregation of many small scale phenomena .Each small scale process often has a random component.The greater the number of small scale fluctuations thatcombine to form an aggregate, the greater the total ran-domness in the macroscopic system. We may think ofrandomness as entropy or as the loss of information.Thus, aggregation dissipates information and increasesentropy .A typical measure of entropy or randomness is E = − (cid:90) p y log( p y )d y, (1)in which p y describes the probability distribution for avariable y .Information is the negative of the entropy, and so thedissipation of information is also given by the entropy .I use a continuous form of entropy throughout this arti-cle, and focus only on the continuous probability distri-butions. Discrete distributions follow a similar logic, butrequire different expressions and details of presentation.We can find the probability distribution consistentwith maximum entropy by maximizing the expression inEq. (1), which requires solving ∂ E /∂p y = 0. The so-lution is p y = c , where c is a constant. This uniformdistribution describes the pattern in which the probabil-ity of observing any value is the same for all values of y .The maximum entropy uniform distribution has the leastinformation, because all outcomes are equally likely. B. Constraint by average values
Suppose that we are studying the distribution of en-ergy levels in a population of particles. We want to knowthe probability that any particle has a certain level ofenergy. The probability distribution over the population describes the probability of different levels of energy perparticle.Typically, there is a certain total amount of energyto be distributed among the particles in the population.The fixed total amount of energy constrains the averageenergy per particle.To find the distribution of energy, we could reasonablyassume that many different processes operate at a smallscale, influencing each particle in multiple ways. Eachsmall scale process often has a random component. Inthe aggregate of the entire population, those many smallscale random fluctuations tend to increase the total en-tropy in the population, subject to the constraint thatthe mean is set extrinsically.For any pattern influenced by small-scale random fluc-tuations, the only constraint on randomness may be agiven value for the mean. If so, then pattern follows max-imum entropy subject to a constraint on the mean . Constraint on the mean
When we maximize the entropy in Eq. (1) to find theprobability distribution consistent with the inevitabledissipation of information and increase in entropy, wemust also account for the constraint on the average valueof observable events. The technical approach to maximiz-ing a quantity, such as entropy, subject to a constraintis the method of Lagrange multipliers. In particular, wemust maximize the quantityΛ =
E − κC − λC, (2)in which the constraint on the average value is writtenas C = (cid:82) p y y d y − µ . The integral term of the constraintis the average value of y over the distribution p y , andthe term, µ , is the actual average value set by constraint.The method guarantees that we find a distribution, p y ,that satisfies the constraint, in particular that the aver-age of the distribution that we find is indeed equal to thegiven constraint on the average, (cid:82) p y y d y = µ . We mustalso set the total probability to be one, expressed by theconstraint C = (cid:82) p y d y − ∂ E /∂p y =0 for the constants κ and λ that satisfy the constrainton total probability and the constraint on average value,yielding p y ∝ e − λy , (3)in which λ = 1 /µ , and ∝ means “is proportional to.”The total probability over a distribution must be one.If we use that constraint on total probability, we canfind κ such that ψe − λy would be an equality rather thana proportionality for p y for some constant, ψ . That iseasy to do, but adds additional steps and a lot of nota-tional complexity without adding any further insight. Itherefore present distributions without the adjusting con-stants, and write the distributions as “ p y ∝ ” to expressthe absence of the constants and the proportionality ofthe expression.The expression in Eq. (3) is known as the exponen-tial distribution, or sometimes the Gibbs or Boltzmanndistribution. We can read the distribution as a simplestatement. The exponential distribution is the probabil-ity pattern for a positive variable that is most random,or has least information, subject to a constraint on themean. Put another way, the distribution contains infor-mation only about the mean, and nothing else. Constraint on the average fluctuations from the mean
Sometimes we are interested in fluctuations about amean value or central location. For example, what is thedistribution of errors in measurements? How do aver-age values in samples vary around the true mean value?In these cases, we may describe the intrinsic variabilityby the variance. If we constrain the variance, we areconstraining the average squared distance of fluctuationsabout the mean.We can find the distribution that is most random sub-ject to a constraint on the variance by using the vari-ance as the constraint in Eq. (2). In particular, let C = (cid:82) p y ( y − µ ) d y − σ , in which σ is the variance and µ is the mean. This expression constrains the squared dis-tance of fluctuations, ( y − µ ) , averaged over the prob-ability distribution of fluctuations, p y , to be the givenconstraint, σ .Without loss of generality, we can set µ = 0 and in-terpret y as a deviation from the mean, which simplifiesthe constraint to be C = (cid:82) p y y d y − σ . We can thenwrite the constraint on the mean or the constraint on thevariance as a single general expression C = (cid:90) p y f y d y − ¯ f y , (4)in which f y is y or y for constraints on the mean orvariance, respectively, and ¯ f y is the extrinsicially set con-straint on the mean or variance, respectively. Then themaximization of entropy subject to constraint takes thegeneral form p y ∝ e − λf y . (5)If we constrain the mean, then f y = y and λ = 1 /µ ,yielding the exponential form in Eq. (3). If we constrainthe variance, then f y = y , and λ = 1 / σ , which is theGaussian distribution. C. The measurement scale for average values
The constraint on randomness may be transformed bythe measurement scale . We may write the transforma-tion of the observable values, f y , as T( f y ) ≡ T f . Here, f y is y or y depending on whether we are interested in the average value or in the average distance from a cen-tral location, and T is the measurement scale. Thus, theconstraint in Eq. (4) can be written as C = (cid:90) p y T f d y − ¯T f , (6)which generalizes the solution in Eq. (5) to p y ∝ e − λ T f . (7)This form provides a simple way to express many differ-ent probability distributions, by simply choosing T f tobe a constraint that matches the form of a distribution.For example, the power law distribution, p y ∝ y − λ , cor-responds to the measurement scale T f = log( y ). In gen-eral, finding the measurement scale and the associatedconstraint that lead to a particular form for a distribu-tion is useful, because the constraint concisely expressesthe information in a probability pattern .Simply matching probability patterns to their asso-ciated measurement scales and constraints leaves openthe problem of why particular scalings and constraintsarise. What sort of underlying generative processes leadto a particular scaling relation, T f , and therefore attractto the same probability pattern? I address that crucialquestion in later sections. For now, it is sufficient to notethat we have a simple way to connect the dissipation ofinformation and constraint to probability patterns. D. The scale on which information dissipates
In some cases, information dissipates on one scale, butwe wish to express the probability pattern on anotherscale. Suppose that information dissipates on the scalegiven by x , leading to the distribution p x . After ob-taining the distribution on the scale x by applying thetheory for the dissipation of information and constraint,we may wish to transform the distribution to a differentscale, y . Here, I briefly mention two distinct types oftransformation. Later sections illustrate the crucial roleof scale transformations in understanding several impor-tant probability patterns. The Methods provides techni-cal details. Change of variable
The relation between x and y is given by the trans-formation x = g ( y ), where g is some function of y . Forexample, we may have x = log( y ). In general, we canuse any transformation that has meaning for a particu-lar problem. Several important probability distributionsarise by dissipation of information on scales other thanthe one on which we typically express probability pat-terns. To understand those distributions, one must rec-ognize the scale on which information dissipates and thetransformed scale used to express the probability distri-bution.Define m y = | g (cid:48) ( y ) | , where g (cid:48) is the derivative of g withrespect to y . The notation m y emphasizes the term as themeasurement scale correction when observing pattern onthe scale y . Because information dissipates on the scale x ,we can often find the distribution p x easily from Eq. (7),in which T f is a function of f x . Applying the change inmeasure, m y , we obtain p y ∝ m y e − λ T f (8)in which we replace f x by the transformed expression f g ( y ) in the scaling relation T f .The key point is that we have simply made a changeof variable from x to y . The term m y adjusts the scalingof the probability pattern for that change of variable. Integral transform
If we take the average of e − xy over the distribution of x , we obtain a new function for each value of y , as h ∗ ( y ) = (cid:90) e − xy p x d x, which may be interpreted as a Laplace or Fourier trans-form of the original distribution, p x . Under some condi-tions, we can think of the transformed function h ∗ ( y ) asa distribution that has a paired relation with the originaldistribution p x . The transformation creates a pair of re-lated measurement scales that determines the associatedpair of probability distributions. We may use other trans-formation functions besides e − xy to create various pairsof measurement scales and probability distributions. IV READING PROBABILITY EXPRESSIONS IN TERMS OFMEASUREMENT AND INFORMATION
In this section, I show that probability distributionscan be read as simple statements about the change in in-formation with the magnitude of the observations. Theessential scaling relation T f expresses exactly how infor-mation changes with magnitude. Because measurement has a natural interpretation in terms of information, wecan also think of T f as an expression of the measurementscale associated with a particular probability distribu-tion. A. Information and surprise
The key step arises from interpreting S y = − log( p y ) (9)in Eq. (1) as the translation between probability, p y , andinformation, S y . This expression is sometimes called self-information , which describes the information in an event y in terms of the probability of that event, p y . I usethe symbol S y because Tribus interpreted this quan-tity as the surprise associated with the magnitude of theobservation, y .The interpretation of S y as surprise arises from the ideathat relatively rare events are more surprising. For anyinitial value of p y , the surprise, − log( p y ) = log(1 /p y ),increases by log(2) as p y decreases by half. Thus, thesurprise increases linearly with relative rarity.Surprise connects to information. If we are surprisedby an observation, we learn a lot; if we are not surprised,we had already predicted the outcome to be relativelylikely, and we gain little information.Note that entropy in Eq. (1) is equivalent to (cid:82) p y S y d y ,which is simply the average amount of surprise over aparticular probability distribution. A uniform distribu-tion, in which all values of y are equally likely, has amaximum amount of entropy and a minimum amount ofinformation or surprise. The low surprise occurs because,with any value of y equally likely, we can never be rela-tively more surprised by observing one particular valueof y rather than another. B. Scaling relations express the change in information
The expression S y relates information to the magni-tude of observations, y . We can use that relation to de-velop an understanding of how information changes withmagnitude. The change in information with magnitudecaptures the essential aspect of measurement scale. Thisnotion of information in relation to scale turns out to bethe key to understanding probability patterns for contin-uous variables.I begin with the general expression for probability pat-terns in Eq. (7), altered slightly here as p y = ψe − λ T f , (10)in which ψ is a constant that sets the total probabilityof the distribution to one. In this section, we can ignorethe scale transformations and the term m y that led toEq. (8). Those transformations change the original prob-ability pattern from one scale to another. That changeof scale does not alter the relation between informationand magnitude on the original scale that determined theform of the probability distribution.If we take the logarithm of both sides of Eq. (10), weobtain a general expression for probability patterns interms of information as S y = ψ + λ T f . (11)Thus, the change in information, d S y , compares with thechange in the scaling relation for measurement, dT f , as | d S y | = | λ dT f | , (12)in which absolute values quantify the magnitude ofchange. Intuitively, we may think of this expressionas the increment of information gained for measuring achange in magnitude on the scale T f . The parameter λ isthe relative rate of change of information compared withmeasured values.Note that we can also writed S y ∝ dT f , (13)which means that an increment on the measurement scaleis proportional to an increment of information. C. How to read the exponential and Gaussian distributions
The exponential distribution in Eq. (3) has T f = y and d S y = λ . The parameter λ = 1 /µ is the inverseof the distribution’s mean value. The exponential distri-bution describes a constant increase in information withmagnitude, associated with a constant decline in rela-tive probability with magnitude. The rate of increase ininformation with magnitude is the inverse of the mean.For the Gaussian distribution with a mean of zero,T f = y and d S y = 2 λy . The parameter λ = 1 / σ is the inverse of twice the distribution’s average squareddeviation, leading to d S y = y/σ . The Gaussian dis-tribution describes a linearly increasing gain (constantacceleration) in information with magnitude, associatedwith a linearly increasing decline (constant deceleration)in relative probability with magnitude. The rate of thelinearly increasing gain in information with magnitude isthe inverse of the variance.The following sections present the way in which to reada wide variety of common distributions in terms of thescaling relations of information and measurement. Latersections consider the underlying structure and familial re-lations between commonly observed distributions. Thatunderlying structure arises from the information symme-tries that relate different measurements scales to eachother. V THE LOG-LINEAR SCALE
Cancer incidence illustrates how probability patternsmay express simple scaling relations . For many can-cers, the probability p y that an individual develops dis-ease near the age y , among all those born at age zero, isapproximately p y ∝ y k − e − αy , (14)which is the gamma probability pattern. A simple gener-ative model that leads to a gamma pattern is the waitingtime for the k th event to occur. For example, if cancerdeveloped only after k independent rate-limiting barriersor stages have been passed, then the process of cancerprogression would lead to a gamma probability pattern.That match between a generative multistage model ofprocess and the observed gamma pattern led many peo- ple to conclude that cancer develops by a multistage pro-cess of progression. By fitting the particular incidencedata to a gamma pattern and estimating the parameter k , one could potentially estimate the number of rate-limiting stages required for cancer to develop. Althoughthis simple model does not capture the full complexityof cancer, it does provide the basis for many attemptsto connect observed patterns for the age of onset to theunderlying generative processes that cause cancer .Let us now read the gamma pattern as an expressionabout the scaling of probability in relation to magni-tude. We can then compare the general scaling relationthat defines the gamma pattern to the different kinds ofprocesses that may generate a pattern matched to thegamma distribution.The probability expression in Eq. (14) can be dividedinto two terms. The first term is y k − = e ( k −
1) log( y ) , (15)which matches our general expression for probability pat-terns in Eq. (7) with T f = log( y ). This equivalence asso-ciates the power law component of the gamma distribu-tion with a logarithmic measurement scale.For the second term, e − αy , in Eq. (14), we have T f = y ,which expresses linear scaling in y . Thus, the two termsin Eq. (14) correspond to logarithmic and linear scaling p y ∝ y k − × e − αy linear , (16)which leads to an overall measurement function that hasthe general log-linear form T f = log( y ) − by . For theparameters in this example, b = α/ ( k − y is small, T f ≈ log( y ), and the logarithmicterm dominates changes in the information of the prob-ability pattern, d S y , and the measurement scale, dT f .By contrast, when y is large, T f ≈ − by , and the linearterm dominates. Thus, the gamma probability patternis simply the expression of logarithmic scaling at smallmagnitudes and linear scaling at large magnitudes. Thevalue of b determines the magnitudes at which the differ-ent scales dominate.Generative processes that create log-linear scaling typ-ically correspond to a gamma probability pattern. Con-sider the classic generative process for the gamma, thewaiting time for the k th independent event to occur.When the process begins, none of the events has oc-curred. For all k events to occur in the next time interval,all must happen essentially simultaneously.The probability of multiple independent events to oc-cur essentially simultaneously is the product of the prob-abilities for each event to occur. Multiplication leads topower law expressions and logarithmic scaling. Thus, atsmall magnitudes, the change in information scales withthe change in the logarithm of time.By contrast, at large magnitudes, after much time haspassed, either the k th event has already happened, andthe waiting is already over, or k − k is a continuous parameter that influences the magni-tudes at which logarithmic or linear scaling dominate.Later, I will return to this important link between gen-erative process and measurement scale. For now, let uscontinue to follow the consequences of various scaling re-lations.The log-linear scale contains the purely linear and thepurely logarithmic as special cases. In Eq. (14), as k → α →
0, the probability pattern approaches the powerlaw form, the pure expression of logarithmic scaling.
VI THE LINEAR-LOG SCALE
Another commonly observed pattern follows a Lomaxor Pareto Type II form p y ∝ (cid:16) yα (cid:17) − k , (17)which is associated with the measurement scale T f =log(1 + y/α ). This distribution describes linear-log scal-ing. For small values of y relative to α , we have T f → y/α , and the distribution becomes p y ∝ e − ( k/α ) y , (18)which is the pure expression of linear scaling. For largevalues of y relative to α , we have T f → log( y/α ), and thedistribution becomes p y ∝ y − k , (19)which is the pure expression of logarithmic scaling.In these examples, I have used f y = y in the scalingrelation T f = log(1 + f y /α ). We can add to the formsof the linear-log scale by using f y = ( y − µ ) , describ-ing squared deviations from the mean. To simplify thenotation, let µ = 0. Then Eq. (17) becomes p y ∝ (cid:18) y α (cid:19) − k , (20)which is called the generalized Student’s or q-Gaussiandistribution . When the deviations from the mean are relatively small compared with α , linear scaling domi-nates, and the distribution is Gaussian, p y ∝ e − ( k/α ) y .When deviations from the mean are relatively large com-pared with α , logarithmic scaling dominates, causingpower law tails, p y ∝ y − k . VII RELATION BETWEEN LINEAR-LOG AND LOG-LINEARSCALES
The specific way in which these two scales relate toeach other provides much insight into pattern and pro-cess.
A. Common scales and common patterns
The log-linear and linear-log scales include most of thecommonly observed probability patterns. The purely lin-ear exponential and Gaussian distributions arise as spe-cial cases. Pure linearity is perhaps rare, because verylarge or very small values often scale logarithmically. Forexample, we measure distances in our immediate sur-roundings on a linear scale, but typically measure verylarge cosmological distances on a logarithmic scale, lead-ing to a linear-log scaling of distance.On the linear-log scale, positive variables often followthe Lomax distribution (Eq. 17). The Lomax expressesan exponential distribution with a power law tail. Overa sufficiently wide range of magnitudes, many seeminglyexponential distributions may in fact grade into a powerlaw tail, because of the natural tendency for the informa-tion at extreme magnitudes to scale logarithmically. Al-ternatively, many distributions that appear to be powerlaws may in fact grade into an exponential shape at smallmagnitudes.When studying deviations from the mean, the linear-log scale leads to the generalized Student’s form. Thatdistribution has a primarily Gaussian shape but withpower law tails. The tendency for the tails to grade intoa power law may again be the rule when studying patternover a sufficiently wide range of magnitudes .In some cases, the logarithmic scaling regime occurs atsmall magnitudes rather than large magnitudes. Thosecases of log-linear scaling typically lead to a gammaprobability pattern. Many natural observations approx-imately follow the gamma pattern, which includes thechi-square pattern as a special case. B. Relations between the scales
The linear-log and log-linear scales seem to be naturalinverses of each other. But what does an inverse scalingmean? We obtain some clues by noting that the mathe-matical relation between the scales arises from (cid:18) f y α (cid:19) − k linear-log ∝ (cid:90) e − xf y x k − e − αx log-linear d x. (21)The right side is the Laplace transform of the log-lineargamma pattern in the variable x , here interpreted forreal-valued f y . That transform inverts the scale to thelinear-log form, which is the Lomax distribution for f y = y or the generalized Student’s distribution for f y = y .This relation between scales is easily understood withregard to mathematical operations . The Laplacetransform changes the addition of random variables intothe multiplication of those variables, and it changes themultiplication of random variables into the addition ofthose variables . Logarithmic scaling can be thought ofas the expression of multiplicative processes, and linearscaling can be thought of as the expression of additiveprocesses.The Laplace transform, by changing multiplicationinto addition, transforms log scaling into linear scaling,and by changing addition into multiplication, transformslinear scaling into log scaling. Thus, log-linear scalingchanges to linear-log scaling. The inverse Laplace trans-form works in the opposite direction, changing linear-logscaling into log-linear scaling.The fact that the Laplace transform connects two ofthe most important scaling relations is interesting. Butwhat does it mean in terms of reading and understand-ing common probability patterns? The following sectionssuggest one possibility. VIII DISSIPATION OF INFORMATION ON ALTERNATIVESCALES
It may be that information dissipates on one scale, butwe observe pattern on a different scale. For example, in-formation may dissipate on the frequency scale of eventsper unit time, but we may observe pattern on the inversescale of time per event. Before developing that interpre-tation of the Laplace pair of inverse scales, it is useful toconsider more generally the problem of analyzing patternon one scale when information dissipates on a differentscale.
A. Scale change for data analysis
Information may dissipate on the scale, x , but we maywish to observe or to analyze the data on the transformedscale, y . For example, the observations, y , may arise bythe product of positive random values. Then x = log( y )would be the sum of the logarithms of those random val-ues. The dissipation of information by the addition ofrandom variables often leads to a Gaussian distribution.By application of Eq. (7), we have the distribution of x = log( y ) as p x ∝ e − λ ( x − µ ) , where µ is the mean of x , and 1 /λ is twice the varianceof x . On the x scale, the Gaussian distribution has T f =( x − µ ) .Suppose we want the distribution on the scale of theobservations, y , rather than on the logarithmic scale x =log( y ) on which information dissipates. Then we mustapply Eq. (8) to transform to the scale, x , to the scaleof interest, y , by using g ( y ) = log( y ), and thus m y = g (cid:48) ( y ) = y − . Then, from Eq. (8), we have the log-normaldistribution p y ∝ y − e − λ (log( y ) − µ ) , which we match to Eq. (8) by noting that m y = y − andT f = (log( y ) − µ ) .Consider another example, in which information dis-sipates on the log-linear scale, x . By Eq. (7), log-linearscaling leads to a gamma distribution p x ∝ x k − e − αx , in which the log-linear scale has the form − λ T( x ) = ( k −
1) log( x ) − αx .Suppose that we wish to analyze the data on a logarith-mic scale, or that we only have access to the logarithmsof the observations . Then we must analyze the distri-bution of y = log( x ), which means that the original scalefor the dissipation of information was x = g ( y ) = e y .Therefore − λ T ( g ( y )) = ( k −
1) log( e y ) − αe y . Because m y = g (cid:48) ( y ) = e y , by Eq. (8), we have p y ∝ m y e − λ T( g ( y )) = e y e ( k −
1) log( e y ) − αe y , which simplifies to p y ∝ e ky − αe y . (22)We read this as the dissipation of information on the log-linear scale, x , and a change of variable x = e y , in orderto analyze the log transformation of the underlying distri-bution as y = log( x ). Data, such as the distribution of bi-ological species abundances in samples, often have an un-derlying log-linear structure associated with the gammadistribution. Typically, such data are log-transformedbefore analysis, leading to the distribution in Eq. (22),which I call the exponential-gamma distribution .Eq. (22) has the same form as the commonly observedGumbel distribution that arises in extreme value theory.That theory turns out to be another way in which infor-mation dissipates on one scale, but we analyze patternon a different scale.0 B. Extreme values: dissipation on the cumulative scale
Many problems depend only on the largest or small-est value of a sample. Extreme values determine muchof the financial risk of disasters, the probability of struc-tural failure, and the expectation of unacceptable trafficcongestion. In biology, the most advantageous beneficialmutations may set the pace and extent of adaptation.At first glance, it may seem that the most extremevalues associated with rare events would be hard to pre-dict. Although it is true that the extreme value in anyparticular case cannot be guessed with certainty, it turnsout that the probability distribution of extreme valuesoften follows a very regular pattern. That regularity ofextreme values arises from the same sort of strong con-vergence by which the central limit theorem leads to theregularity of the Gaussian probability distribution.I describe how the extreme value distributions can beunderstood by the dissipation of information and scaletransformation. I focus on the largest value in a sample.The same logic applies to the smallest value. I empha-size an intuitive way in which to read the extreme valuedistributions as expressions about process.Many sources provide background on the extremevalue distributions . In my own work, I describedthe technical details for a maximum entropy interpreta-tion of extreme values , and the scale transformationsthat connect extreme value forms to general measure-ment interpretations of probability patterns . Dissipation of information
In a sufficiently large sample, the probability of an ex-treme value depends only on the chance that an observa-tion falls in the upper tail of the underlying distributionfrom which the observations are drawn. All other infor-mation about the underlying process dissipates. The av-erage tail probability of the underlying distribution setsthe constraint on retained information, expressed as fol-lows.Let x be the upper tail probability of a distribution, p z , defined as x = (cid:90) ∞ y p z d z, (23)in which y is a threshold value, and x is the cumulativeprobability in the upper tail of the distribution p z abovethe value y . Thus, x is the probability of observing avalue that is greater than y . The cumulative probability, x , tells us how likely it is to observe a value greater than y , and thus how likely it is that y would be near theextreme value in a sample of observations.On the scale, x , the dissipation of information in re-peated samples causes the distribution of upper tail prob-abilities to take on the general form of Eq. (7), in partic- ular p x ∝ e − λx . The average value of x , which is the average upper tailprobability, sets the only constraint that shapes the pat-tern of the distribution. We can, without loss of general-ity, rescale x so that λ = 1, and thus p x is proportionalto e − x . Scale transformation
Scale transformation describes how to go from tailprobabilities, x , to the extreme value in a sample, y .Suppose tail probabilities, which are on the scale x , arerelated to extreme values, which are on the scale y . Therelation between x and y is given by Eq. (23). We canexpress that relation as x = T( y ) = T f , in which T f isthe right-hand side of Eq. (23).We can now use our general approach to scale trans-formation in Eq. (8), repeated here p y ∝ m y e − λ T f . In this case, m y = (cid:12)(cid:12) T (cid:48) f (cid:12)(cid:12) , which is the absolute value ofthe derivative of x with respect to y , yielding p y ∝ (cid:12)(cid:12) T (cid:48) f (cid:12)(cid:12) e − λ T f . (24)This expression provides the general form of probabilitydistributions when T f describes the measurement scalefor y in terms of the cumulative distribution, or tail prob-abilities, for some underlying distribution.The form of T f arises, as always, from the informa-tion constrained by an average value. For example, if inEq. (23) the tail probability decays exponentially suchthat p z → e − z , then x = (cid:90) ∞ y p z d z ≈ e − y . The average tail probability is the average of e − y , and x = T f = e − y . From Eq. (24), we have p y ∝ e − y − λe − y , (25)which is the Gumbel form of the extreme value distribu-tions. Alternatively, if the average tail probability is theaverage of y − γ , from the tail of an underlying distribu-tion that decays as a power law in y , then T f = y − γ ,and p y ∝ y − ( γ +1) e − λy − γ , which is the Fr´echet form of the extreme value distribu-tions. In summary, the extreme value distributions followthe simple maximum entropy form. The constraint is theaverage tail probability of an underlying distribution. Wetransform from the scale, x , of the cumulative distribu-tion of tail probabilities, to the scale, y , of the extremevalue in a sample .1 IX PAIRS OF ALTERNATIVE SCALES BY INTEGRAL TRANS-FORM
The prior section discussed paired scales, in which in-formation dissipates on one scale, but we observe patternon a transformed scale. In those particular cases, thedual relation between scales was obvious. For example,we may explicitly choose to study pattern by a log orexponential transformation of the observations. Or in-formation may dissipate on the cumulative scale of tailprobabilities, but we transform to the scale of observedextreme values to express probability patterns.I now return to the linear-log and log-linear scales,which lead to the most commonly observed probabilitypatterns. How can we understand the duality betweenthese inverted scales? Is there a general way in which tounderstand the pairing between inverted measurementscales?
A. Overview
The distributions based on linear-log and log-linearscales form naturally inverted pairs connected by theLaplace transform. Eq. (21) showed that connection, re-peated here (cid:18) f y α (cid:19) − k linear-log ∝ (cid:90) e − xf y x k − e − αx log-linear d x. (26)In this section, I summarize two ways in which to un-derstand this mathematical expression. First, the pairmay arise from superstatistics , in which a parameterof a distribution is considered to vary rather than tobe fixed. Second, the pair provides an example of amore general way in which dual measurement scales con-nect to each other through integral transformation, whichchanges one measurement scale into another. Fourier,Laplace, and superstatistics transformations can be un-derstood as special cases of the more general integraltransforms. Those general transforms include as spe-cial cases the classic characteristic functions and momentgenerating functions of probability theory.The following section considers cases in which infor-mation dissipates on one of the scales, but we observepattern on the inverted scale. This duality provides anessential way in which to connect the scaling and con-straints of process on one scale to the patterns of naturethat we observe on the dual scale. Reading probabilitypatterns in terms of underlying process may often dependon recognizing this essential duality. B. Superstatistics
The transformation between scales in Eq. (26) can beinterpreted as averaging over a varying parameter. As- sume that we begin with a distribution in the variable f y , given by φ ( f y | x ). Here, x is the parameter of the dis-tribution. Typically, we think of a parameter x as a fixedconstant. Suppose, instead, that x varies according to adistribution, h ( x ). For example, we may think of a com-posite population in which f y varies according to φ ( f y | x )in different locations, with the mean of the distribution,1 /x , varying across locations.If we measure the composite population, we study thedistribution φ ( f y | x ) when averaged over the different val-ues of x , which vary according to h ( x ). The compositepopulation then follows the distribution given by h ∗ ( f y ) = (cid:90) φ ( f y | x ) h ( x )d x. (27)Averaging a distribution, such as φ , over a variable pa-rameter, is sometimes called superstatistics . When theinitial distribution, φ ( f y | x ), is exponential, e − xf y , thensuperstatistical averaging over the variable parameter x in Eq. (27) is equivalent to the Laplace transform, ofwhich Eq. (26) is an example. C. Integral transforms
We may read Eq. (27) as an integral transform, whichprovides a general relation between a pair of measure-ment scales. Thus, we may think of Eq. (27) as a gen-eral way in which to express the duality between pairedmeasurement scales, rather than a specific superstatisticsprocess of averaging over a variable parameter.In this general integral transform interpretation, westart with some distribution h ( x ), which has a scalingrelation, T( x ). Integrating over the transformation ker-nel φ ( f y | x ) creates the distribution h ∗ ( f y ), with scalingrelation T ∗ ( f y ). Thus, averaging over the transformationkernel φ changes the variable from x to f y , and changesthe measurement scale from T( x ) to T ∗ ( f y ).The interpretation of such scale transformations de-pends on the particular transformation kernel, which cre-ates the particular properties of the dual relation. TheLaplace transform, with the exponential transformationkernel e − xf y , has many special properties that connectpaired measurement scales in interesting ways. D. Scale inversion by the Laplace transform
Suppose the log-linear scaling pattern occurs for thevariable x , as in Eq. (26). That equation shows thatthe Laplace transformation kernel, e − xf y , transforms thelog-linear scaling relation of x into the linear-log scalingrelation of f y , for real values of f y .The Laplace change of variable from x to f y often in-verts the dimensional units. The exponent of the trans-formation kernel e − xf y is usually dimensionless, whichmeans that the dimensions of x and f y must cancel.Thus, the units of f y are typically the inverse of the units2of x . For example, if x has units of time per event, then f y has units of events or repetitions per time, which isa kind of frequency. The units may also be changed in-versely from frequency to time.The Laplace transform changes the way in which inde-pendent observations combine to produce aggregate pat-tern. On one scale, the distribution of the sum (convolu-tion) of independent observations from an underlying dis-tribution transforms to multiplication of the distributionson the other scale. Inversely, the distribution of multi-plied observations on one scale transforms to addition ofvariables on the other scale. This duality between addi-tion and multiplication on inverted scales corresponds tothe duality between linear and logarithmic measurementon the paired scales. X ALTERNATIVE DESCRIPTIONS OF GENERATIVE PROCESS
We often wish to associate an observed probability pat-tern with the underlying generative process. The gener-ative process may dissipate information directly on themeasurement scale associated with the observed proba-bility pattern. Or, the generative process may dissipateinformation on a different scale, but we observe the pat-tern on a transformed scale.Consider, as an example, the Laplace duality betweenthe linear-log and log-linear scales in Eq. (26). Supposethat we observe the gamma pattern of log-linear scaling.We wish to associate that observed gamma pattern tothe underlying generative process.The generative process may directly create a log-linearscaling pattern. The classic example concerns waitingtime for the k th independent event. For small times, the k events must happen nearly simultaneously. As notedearlier, the probability of multiple independent eventsto occur essentially simultaneously is the product of theprobabilities for each event to occur. Multiplication leadsto power law expressions and logarithmic scaling. Thus,at small magnitudes, the change in information scaleswith the change in the logarithm of time.By contrast, at large magnitudes, after much time haspassed, either the k th event has already happened, andthe waiting is already over, or k − f y = y . Ifwe observe the outcome of that process in terms of theinverted units of time per event, those inverted dimen-sions lead to log-linear scaling and a gamma pattern, orto a gamma pattern with a Gaussian tail if we measuresquared deviations.Is it meaningful to say that the generative process anddissipation of information arise on a linear-log scale ofevents per unit time, but we observe the pattern on thelog-linear scale of time per event? That remains an openquestion.On the one hand, the scaling relations and dissipa-tion of information contain exactly the same informa-tion whether on the linear-log or log-linear scales. Thatequivalence suggests a single underlying generative pro-cess that may be thought of in alternative ways. In thiscase, we may consider constraints on average frequencyor, equivalently, constraints on average time. More gen-erally, constraints on either of a dual pair of scales withinverted dimensions would be equivalent.On the other hand, the meaning of constraint by aver-age value may make sense only on one of the scales. Forexample, it may be meaningful to consider only the aver-age waiting time for an event to occur. That distinctionsuggests that we consider the underlying generative pro-cess strictly in terms of the log-linear scale. However, ifour observations of pattern are confined to the inverse fre-quency scale, then the observed linear-log scaling wouldonly be a reflection of the true underlying process on thedual log-linear scale.All paired scales through integral transformation posethe same issues of duality and interpretation with re-gard to the connection between generative process andobserved pattern. XI READING PROBABILITY DISTRIBUTIONS
In this section, I recap the four components of proba-bility patterns. A clear sense of those four componentsallows one to read the mathematical expressions of prob-ability distributions as sentences about underlying pro-cess.The four components are: the dissipation of all infor-3mation; except the preservation of average values; takenover the measurement scale that relates changes in ob-served values to changes in information; and the transfor-mation from the underlying scale on which informationdissipates to alternative scales on which probability pat-tern may be expressed.Common probability patterns arise from those fourcomponents, described in Eq. (8) by p y ∝ m y e − λ T f . (28)I show how to read probability distributions in terms ofthe four components and this general expression. To il-lustrate the approach, I parse several commonly observedprobability patterns. This section mostly repeats earlierresults, but does so in an alternative way to emphasizethe simplicity of form in common probability expressions. A. Linear scale
The exponential and Gaussian are perhaps the mostcommon of all distributions. They have the form p y ∝ e − λf y . (29)The exponential case, f y = y , corresponds to the preser-vation of the average value, ¯ y . The Gaussian case, f y = ( y − µ ) , preserves the average squared distancefrom the mean, which is the variance. For convenience,I often set µ = 0 and write f y = y for the squared dis-tance. The exponential and Gaussian express the dissi-pation of information and preservation of average valueson a linear scale. We use either the average value itselfor the average squared distance from the mean. B. Combinations of linear and log scales
Purely linear scaling is likely to be rare over a suffi-ciently wide range of magnitudes. For example, one nat-urally plots geographic distances on a linear scale, butvery large cosmological distances on a logarithmic scale.On a geographic scale, an increment of an additionalmeter in distance can be measured directly anywhereon earth. The equivalent measurement information ob-tained at any geographic distance leads to a linear scale.By contrast, the information that we can obtain aboutmeter-scale increments tends to decrease with cosmolog-ical distance. The declining measurement informationobtained at increasing cosmological distance leads to alogarithmic scale.The measurement scaling of distances and other quan-tities may often grade from linear at small magnitudes tologarithmic at large magnitudes. The linear-log scale isgiven by T f = log (1 + f y /α ). Using that measurementscale in Eq. (28), with m y = 1 and λ = k , we obtain p y ∝ (1 + f y /α ) − k . When f y is small relative to α , we get the standard ex-ponential form of linear scaling in Eq. (29), which corre-sponds to the exponential or Gaussian pattern. The tailof the distribution, with f y greater than α , is a powerlaw in proportion to f − ky . An exponential pattern witha power law tail is the Lomax or Pareto type II distribu-tion. A Gaussian with a power law tail is the generalizedStudent’s distribution.If one measures observations over a sufficiently widerange of magnitudes, many apparently exponential orGaussian distributions will likely turn out to have thepower law tails of the Lomax or generalized Student’sforms. Similarly, observed power law patterns may oftenturn out to be exponential or Gaussian at small magni-tudes, also leading to the Lomax or generalized Student’sforms.Other processes lead to the inverse log-linear scale,which changes logarithmically at small magnitudes andlinearly at large magnitudes. The log-linear scale is givenby T f = log( f y ) − bf y , in which b determines the transi-tion between log scaling at small magnitudes and linearscaling at large magnitudes. Using that measurementscale in Eq. (28) with m y = 1 and f y = y , and adjustingthe parameters to match earlier notation, we obtain thegamma distribution p y ∝ y k − e − αy , which is a power law with logarithmic scaling for smallmagnitudes and an exponential with linear scaling forlarge magnitudes. The gamma distribution includes asa special case the widely used chi-square distribution.Thus, the chi-square pattern is a particular instance oflog-linear scaling.If we use the log-linear scale for squared deviationsfrom zero, f y = y , then we obtain p y ∝ y k − e − αy , which is a gamma pattern with a Gaussian tail, express-ing log-linear scaling with respect to squared deviations.For k = 2, this is the well-known Rayleigh distribution.In some cases, information scales logarithmically atboth small and large magnitudes, with linearity dom-inating at intermediate magnitudes . In a log-linear-log scale, precision at the extremes may depend morestrongly on magnitude, or there may be a saturating ten-dency of process at extremes that causes relative scalingof information with magnitude. Relative scaling corre-sponds to logarithmic measures.Commonly observed log-linear-log patterns often leadto the beta family of distributions . For example, wecan modify the basic linear-log scale, T f = log (1 + y/α ),by adding a logarithmic component at small magni-tudes, yielding the scale T f = b log( y ) − log (1 + y/α ),for b = γ/k , which leads to a variant of the beta-primedistribution p y ∝ y γ (1 + y/α ) − k . y/α ) − k , with an additional log scale powerlaw component, y γ , that dominates at small magnitudes.Other forms of log-linear-log scaling often lead to variantsfrom the beta family. C. Direct change of scale
In many cases, process dissipates information and pre-serves average values on one scale, but we observe oranalyze data on a different scale. When the scale changearises by simple substitution of one variable for another,the form of the probability distribution is easy to read ifone directly recognizes the scale of change. Here, I re-peat my earlier discussion for the way in which one readsthe commonly observed log-normal distribution. Otherdirect scale changes follow this same approach.If process causes information to dissipate on a scale x , preserving only the average squared distance from themean (the variance), then x tends to follow the Gaussianpattern p x ∝ e − λ ( x − µ ) , in which the mean of x is µ , and the variance is 1 / λ . Ifthe scale, x , on which information dissipates is logarith-mic, but we observe or analyze data on a linear scale, y ,then x = log( y ). The value of m y in Eq. (8) is the changein x with respect to y , yielding d log( y ) / d y = y − . Thus,the distribution on the y scale is p y ∝ y − e − λ (log( y ) − µ ) , which is simply the Gaussian pattern for log( y ), correctedby m y = y − to account for the fact that dissipation ofinformation and constraint of average value are happen-ing on the logarithmic scale, log( y ), but we are analyzingpattern on the linear scale of y . Other direct changes ofscale can be read in this way. D. Extreme values and exponential scaling
Extreme values arise from the probability of observinga magnitude beyond some threshold. Probabilities be-yond a threshold depend on the cumulative probabilityof all values beyond the cutoff. For an initially linear scalewith f x = x , cumulative tail probabilities typically followthe generic form e − λx or, simplifying by using λ = 1, theexponential form e − x . The cumulative tail probabilitiesabove a threshold, y , define the scaling relation between x and y , as x = (cid:90) ∞ y e − z d z = e − y . Thus, extreme values that depend on tail probabilitiestend to define an exponential scaling, x = e − y = T f . Because we have changed the scale from the cumulativeprobabilities, x , to the probability of some threshold, y ,that determines the extreme value observed, we must ac-count for that change of scale by m y = (cid:12)(cid:12) T (cid:48) f (cid:12)(cid:12) = e − y ,where the prime is the derivative with respect to y . Us-ing Eq. (8) for the generic method of direct change inscale, and using the form of m y here for the change fromthe cumulative scale of tail probabilities to the directscaling of threshold values, we obtain the general form ofthe extreme value distributions as p y ∝ (cid:12)(cid:12) T (cid:48) f (cid:12)(cid:12) e − λ T f . In this simple case, T f = e − y , thus p y ∝ e − y − λe − y , a form of the Gumbel extreme value distribution. Notethat this form is just a direct change from linear to ex-ponential scaling, x = e − y .Alternatively, we can obtain the same Gumbel formby any process that leads to exponential-linear scalingof the form λ T( y ) = y + λe − y , in which the exponen-tial term dominates for small values and the linear termdominates for large values. That scaling leads directly tothe distribution p y ∝ e − λe − y exp e − y linear . The probability of a small value being the largest extremevalue decreases exponentially in y , leading to the doubleexponential term e − λe − y dominating the probability. Bycontrast, the probability of observing large extreme val-ues decreases linearly in y , leading to the exponentialterm e − y dominating the probability. E. Integral transform and change of scale
Eq. (21) showed the connection between linear-log andlog-linear scales through the Laplace integral transform.The Laplace transform can often be thought of as in-verting the dimensional units. For example, we maychange from the time per event for a gamma distribu-tion with log-linear scaling to the number of events perunit time (frequency) according to a Lomax distributionwith linear-log scaling. Or we may start with a gammadistribution of frequencies and transform to a Lomax dis-tribution of time per event. The units do not have to bein terms of time and frequency. Any pair of inverteddimensions relates to each other in the same way.That connection between different scales helps to readprobability distributions in relation to underlying pro-cess. For example, an observation of frequencies dis-tributed according to the linear-log Lomax pattern maysuggest dissipation of information and constraint of av-erage values in the dual log-linear measurement domain.Scale inversion by the Laplace transform also has theinteresting property of switching between addition and5multiplication in the two domains. For example, multi-plicative aggregation of processes and a logarithmic pat-tern at small magnitudes on the scale of time per eventtransform to additive aggregation and a linear patternat small magnitudes on the frequency scale of events perunit time.This arithmetic duality of measurement scales clarifiesthe meaning of probability distributions with respect tounderlying generative mechanisms. It would be inter-esting to study pairs of scales connected by the generalintegral transform (Eq. 27) with respect to the interpre-tation of aggregation and pattern in dual domains.
F. L´evy stable distributions
Another important family of common distributionsarises by a similar scaling duality (cid:18) y ϕ (cid:19) − ∝ (cid:90) e − xiy e − ϕ | x | d x. (30)Consider each part in relation to the Laplace pair inEq. (21). The left side is the Cauchy distribution, aspecial case of the linear-log generalized Student’s dis-tribution with k = 1 and α = ϕ . On the right, e − ϕ | x | is a symmetric exponential distribution, because e − ϕx isthe classic exponential distribution for x >
0, and e ϕx for x < x = 0 axis. The two distributions together form a newdistribution over all positive and negative values of x .Each positive and negative part of the symmetric ex-ponential, by itself, expresses linearity in x . However,the sharp switch in direction and the break in smooth-ness at x = 0 induces a quasi-logarithmic scaling at smallmagnitudes, which corresponds to the linearity at smallmagnitudes in the transformed domain of the Cauchydistribution.In this case, the integral transform is Fourier ratherthan Laplace, using the transformation kernel e − xiy overall positive and negative values of x . For our purposes, wecan consider the consequences of the Laplace and Fouriertransforms as similar with regard to inverting the dimen-sions and scaling relations between a pair of measurementscales.The Cauchy distribution is a particularly importantprobability pattern. In one simple generative model, theCauchy arises by the same sort of summing up of ran-dom perturbations and dissipation of information thatleads to the Gaussian distribution by the central limittheorem. The Cauchy differs from the Gaussian becausethe underlying random perturbations follow logarithmicscaling at large magnitudes.Log scaling at large magnitudes causes power law tails,in which the distributions of the underlying random per-turbations tend to have the form 1 / | x | γ at large mag-nitudes of x . When the tail of a distribution has thatform, then the total probability in the tail above magni-tudes of | x | is approximately 1 / | x | γ . The Cauchy is the particular distribution with γ = 1. Thus, one way togenerative a Cauchy is to sum up random perturbationsand constrain the average total probability in the tail tobe 1 / | x | .Note that the constraint on the average tail probabil-ity of 1 / | x | for the Cauchy distribution on the left sideof Eq. (30) corresponds, in the dual domain on the rightside of that equation, to e − ϕ | x | , in which the measure-ment scale is T f = | x | . The average of the scaling T f corresponds to the preserved average constraint after thedissipation of information. In this case, the dual domainpreserves only the average of | x | . Thus the dual scalingdomains preserve the average of | x | in the symmetric ex-ponential domain and the average total tail probabilityof 1 / | x | in the dual Cauchy domain.We can express a more general duality that includesthe Cauchy as a special case by p y ∝ (cid:90) e − xiy e − ϕ | x | γ d x. (31)The only difference from Eq. (30) is that in the symmet-ric exponential, I have written | x | γ . The parameter γ creates a power law scaling T f = | x | γ , which correspondsto a distribution that is sometimes called a stretched ex-ponential.The distribution in the dual domain, p y , is a form ofthe L´evy stable distribution. That distribution does nothave a mathematical expression that can be written ex-plicitly. The L´evy stable distribution, p y , can be gener-ated by dissipating all information by summation of ran-dom perturbations while constraining the average of thetotal tail probability to be 1 / | x | γ for γ <
2. For γ = 1,we obtain the Cauchy distribution. When γ = 2, the dis-tributions in both domains become Gaussian, which isthe only case that domains paired by Laplace or Fouriertransform inversion have the same distribution.Note that the paired scales in Eq. (31) match a con-straint on the average of | x | γ with an inverse constrainton the average tail probability, 1 / | x | γ . Here, γ is not nec-essarily an integer, so the average of | x | γ can be thoughtof as a fractional moment in the stretched exponentialdomain that pairs with the power law tail in the inverseL´evy domain . XII RELATIONS BETWEEN PROBABILITY PATTERNS
I have shown how to read probability distributions asstatements about the dissipation of information, the con-straint on average values, and the scaling relations ofinformation and measurement. Essentially all commondistributions have the form given in Eq. (8) as p y ∝ m y e − λ T f . (32)Dissipation of information and constraint on average val-ues set the e − λf y form. Scaling measures transform theobservables, f y , to T f ≡ T( f y ). The term m y accounts6for changes between dissipation of information on onescale and measurement of final pattern on a differentscale.The scaling measures, T f , determine the differences be-tween probability patterns. In this section, I discuss thescaling measures in more detail. What defines a scal-ing relation? Why are certain common scaling measureswidely observed? How are the different scaling measuresconnected to each other to form families of related prob-ability distributions? A. Invariance and common scales
The form of the maximum entropy distributions influ-ences the commonly observed scales and associated prob-ability distributions . In particular, we obtain the samedistribution in Eq. (32) for either the measurement func-tion T f or the affine transformed measurement functionT f (cid:55)→ a + b T f . An affine transformation shifts the vari-able by the constant a and multiplies it by the constant b . The shift by a changes the constant of proportionality e − λ ( a +T f ) = ξe − λ T f , in which ξ = e − λa . In maximum entropy, the final pro-portionality constant always adjusts to satisfy the con-straint that the total probability is one (Eq. 2). Thus,the final adjustment of total probability erases any priormultiplication of the distribution by a constant. A shifttransformation of T f does not change the associatedprobability pattern.Multiplication by b also has no effect on probabilitypattern, because e − λb T f = e − ˆ λ T f for ˆ λ = bλ . In maximum entropy, the final value of theconstant multiplier for T f always adjusts so that thatthe average value of T f satisfies an extrinsic constraint,as given in Eq. (6).Thus, maximum entropy distributions are invariant toaffine transformations of the measurement scale. Thataffine invariance shapes the form of the common mea-surement scales. In particular, consider transformationsof the observables, G( f y ), such thatT [G( f y )] = a + b T( f y ) . (33)Any scale, T, that satisfies this relation causes the trans-formed scale T [G( f y )] to yield the same maximum en-tropy probability distribution as the original scale T f ≡ T( f y ).For example, suppose our only information about aprobability distribution is that its form is invariant to atransformation of the observable values f y by a processthat changes f y to G( f y ). Then it must be that the scal-ing relation of the measurement function T f satisfies the invariance in Eq. (33). By evaluating how that invari-ance sets a constraint on T f , we can find the form of theprobability distribution.The classic example concerns the invariance of log-arithmic scaling to power law transformation . LetT( y ) = log( y ) and G( y ) = cy γ . Then by Eq. (33), wehave log( cy γ ) = log( c ) + γ log( y ) , (34)which demonstrates that logarithmic scaling is affine in-variant to power law transformations of the form cy γ , inwhich affine invariance means that the scaling relationT and the associated transformation G satisfy Eq. (33). B. Affine invariance of measurement scaling
Put another way, a scaling relation, T, is defined bythe transformations, G, that leave unchanged the in-formation in the observables with respect to probabilitypatterns. In maximum entropy distributions, unchanged means affine invariance. This affine invariance of mea-surement scaling in probability distributions is so impor-tant that I like to write the key expression in Eq. (33) ina more compact and memorable formT ∼ T ◦ G . (35)Here, the circle means composition of functions, suchthat T ◦ G ≡ T[G( f y )], and the symbol “ ∼ ” for similaritymeans equivalent with respect to affine transformation.Thus, the right side of Eq. (33) is similar to T with re-spect to affine transformation, and the left side Eq. (33)is equivalent to T ◦ G. Reversing sides of Eq. (33) andusing “ ∼ ” for affine similarity leads to Eq. (35).Note, from Eq. (11) and Eq. (33), that S y ≡ T ◦ G,showing that the information in a probability distribu-tion, S y , is invariant to affine transformation of T. Thus,we can also write T ∼ T ◦ G ∼ S y , which emphasizes the fundamental role of invariant in-formation in defining the measurement scaling, T, andthe associated form of probability patterns. C. Base scales and notation
Earlier, I defined f y = f ( y ) as an arbitrary function ofthe variable of interest, y . I have used either y or y or( y − µ ) for f y to match the classical maximum entropyinterpretation of average values constraining either themean or the variance.To express other changes in the underlying variable, y , I introduced the measurement functions or scalingrelations, T f ≡ T( f y ). In this section, I use an ex-panded notation to reveal the structure of the invariances7that set the forms of scaling relations and probabilitydistributions . In particular, let w ≡ w ( f y )be a function of f y . Then, for example, we can writean exponential scaling relation as T( f y ) = e βw . Wemay choose a base scale, w , such as a linear base scale, w ( f y ) = f y , or a logarithmic base scale, w ( f y ) = log( f y ),or a linear-log base scale, w ( f y ) = log(1 + f y /α ), or anyother base scale. Typically, simple combinations of lin-ear and log scaling suffice. Why such simple combina-tions suffice is an essential unanswered question, which Idiscuss later.Previously, I have referred to f y as the observable, inwhich we are interested in the distribution of y but onlycollect statistics on the function f y . Now, we will consider w ≡ w ( f y ) as the observable. We may, for example, belimited to collecting data on w = log( f y ) or on measure-ment functions T( f y ) that can be expressed as functionsof the base scale w . We can always revert to the simplercase in which w ≡ f y or w ≡ y .In the following sections, the expanded notation re-veals how affine invariance sets the structure of scalingrelations and probability patterns. D. Two distinct affine relations
All maximum entropy distributions satisfy the affinerelation in Eq. (33), expressed compactly in Eq. (35). Inthat general affine relation, any measurement function,T, could arise, associated with its dual transformation,G, to which T is affine invariant. That general affine re-lation does not set any constraints which measurementfunctions T may occur, although the general affine rela-tion may favor certain scaling relations to be relativelycommon.By contrast with the general affine form T ∼ T ◦ G, forany T and its associated G, we may consider how spe-cific forms of G determine the scaling, T. Put anotherway, if we require that a probability pattern be invari-ant to transformations of the observables by a particularG, what does that tell us about the form of the associ-ated scaling relation, T, and the consequent probabilitypattern?Here we must be careful about potential confusion. Itturns out that an affine form of G is itself important,in which, for example, G( w ) = δ + θw . That specificaffine choice for G is distinct from the general affine formof Eq. (35). With that in mind, the following sectionsexplore the consequences of an affine transformation, G,or a shift transformation, which is a special case of anaffine transformation. E. Shift invariance and generalized exponentialmeasurement scales
Suppose we know only that the information in prob-ability patterns does not change when the observablesundergo shift transformation, such that G( w ) = δ + w .In other words, the form of the measurement scale, T,must be affine invariant to adding a constant to the basevalues, w . A shift transformation is a special case of anaffine transformation G( w ) = δ + θw , in which the affinetransform becomes strictly a shift transformation for therestricted case of θ = 1.The exponential scaleT f = e βw (36)maintains the affine invariance in Eq. (33) to a shift trans-formation, G. If we apply shift transformation to the ob-servables, w (cid:55)→ δ + w , then the exponential scale becomes e β ( δ + w ) , which is equivalent to be βw for b = e βδ . We canignore the constant multiplier, b , thus, the exponentialscale is shift invariant with respect to Eq. (33).Using the shift invariant exponential form for T f , themaximum entropy distributions in Eq. (32) become p y ∝ m y e − λe βw . (37)This exponential scaling has a simple interpretation.Consider the example in which w is a linear measure oftime, y , and β is a rate of exponential growth (or de-cay). Then the measurement scale, T f , transforms eachunderlying time value, y , into a final observable value af-ter exponential growth, e βy . The random time values, y , become random values of final magnitudes, such asrandom population sizes after exponential growth for arandom time period. In general, exponential growth ordecay is shift invariant, because it expresses a constantrate of change independently of the starting point.If the only information we have about a scaling re-lation is that the associated probability pattern is shiftinvariant to transformation of observables, then exponen-tial scaling provides a likely measurement function, andthe probability distribution may often take the form ofEq. (37).The Gumbel extreme value distribution in Eq. (25)follows exponential scaling. In that case, the underlyingobservations, y , are transformed into cumulative expo-nential tail probabilities that, in aggregate, determinethe probability that an observation is the extreme valueof a sample. The exponential tail probabilities are shiftinvariant, in the sense that a shifted observation, δ + y ,also yields an exponential tail probability. The magni-tude of the cumulative tail probability changes with ashift, but the exponential form does not change. F. Affine duality and linear scaling
Suppose probability patterns do not change when ob-servables undergo an affine transformation G( w ) = δ +8 θw . Affine transformation of observables allows a broaderrange of changes than does shift transformation. Thebroader the range of allowable transformations of observ-ables, G, the fewer the measurement functions, T, thatwill satisfy the affine invariance in Eq. (33). Thus affinetransformation of observables leads to a narrower range ofcompatible measurement functions than does shift trans-formation.When G is affine with θ (cid:54) = 1, then the associated mea-surement function T f must itself be affine. Because T f is invariant to shift and multiplication, we can say thatinvariance to affine G means that T f = w , and thus themaximum entropy probability distribution in Eq. (32)becomes linear in the base measurement scale, w , as p y ∝ m y e − λw . (38)This form follows when the probability pattern is invari-ant to affine transformation of the observables, w . Bycontrast, invariance to a shift transformation of the ob-servables leads to the broader class of distributions inEq. (37), of which Eq. (38) is special case for the morerestrictive condition of invariance to affine transforma-tion of observables.To understand the relation between affine and shifttransformations of observables, G, it is useful to writethe expression for the measurement function in Eq. (36)more generally as T f = 1 β (cid:0) e βw − (cid:1) , (39)noting that we can make any affine transformation of ameasurement function, T f (cid:55)→ a + b T f , without changingthe associated probability distribution. With this newmeasurement function for shift invariance, as β → f → w , and we recover the measurement functionassociated with affine G.Suppose, for example, that we interpret β as a rateof exponential change in the underlying observable, w ,before the final measurement. Then, as β →
0, the un-derlying observable and the final measurement becomeequivalent, T f → w , becauseT f = lim β → (cid:20) β (cid:0) e βw − (cid:1)(cid:21) → w. G. Exponential and Gaussian distributions arise fromaffine invariance
Suppose we know only that the information in proba-bility patterns does not change when the observables un-dergo affine transformation, w (cid:55)→ δ + θw . The invarianceof probability pattern to affine transformation of observ-ables leads to distributions of the form in Eq. (38). Thus,if the observable is the underlying value, w ≡ y , then theprobability distribution is exponential p y ∝ e − λy , and if the observable is y , the squared distance of theunderlying value from its mean, then the probability dis-tribution is Gaussian p y ∝ e − λy . By contrast, if the probability pattern is invariant to ashift of the observables, but not to an affine transforma-tion of the observables, then the distribution falls into thebroader class based on exponential measurement func-tions in Eq. (37).
XIII HIERARCHICAL FAMILIES OF MEASUREMENT SCALESAND DISTRIBUTIONS
The general form for probability distributions inEq. (37), repeated here p y ∝ m y e − λe βw arises from a base measurement scale, w , and shift invari-ance of the probability pattern to changes w (cid:55)→ δ + w .Each base scale, w , defines a family of related probabilitydistributions, including the linear form p y ∝ m y e − λw as a special case when the probability pattern is invariantto affine changes w (cid:55)→ δ + θw , which corresponds to β → w , creat-ing a variety of distinct measurement scales and familiesof distributions. Ultimately, we must consider how thebase scales arise. However, it is useful first to study thecommonly observed base scales. The relations betweenthese common base scales form a hierarchical pattern ofmeasurement scales and probability distributions . A. A recursive hierarchy for the base scale
The base scales associated with common distributionstypically arise as combinations of linear and logarithmicscaling. For example, the linear-log scale can be definedby log( c + x ). This scale changes linearly in x when x ismuch smaller than c and logarithmically in x when x ismuch larger than c . As c →
0, the scale becomes almostpurely logarithmic, and for large c , the scale becomesalmost purely linear.We can generate a recursive hierarchy of linear-logscale deformations by w ( i ) = log (cid:16) c i + w ( i − (cid:17) . (40)The hierarchy begins with w (0) = f y , in which f y denotesour underlying observable. Recursive expansion of thehierarchy yields: a linear scale, w (0) = f y ; a linear-logdeformation, w (1) = log( c + f y ); a linear-log deformation9 TABLE I. Some Common Probability Distributions ∗ Distribution p y w Notes and alternative namesGumbel e βy − λeβy Linear m y = T (cid:48) Gibbs/Exponential e − λy Linear β → e − λy Linear β → f y = y Rayleigh ye − λy Linear β → f y = y ; m y = T (cid:48) Log-Normal y − e − λ (log y )2 Linear β → f y = y ; y → log y ; m y = y − Stretched exponential e − λyβ Log (1)
Gauss with β = 2Fr´echet/Weibull y β − e − λyβ Log (1) m y = T (cid:48) ; Rayleigh with β = 2Symmetric L´evy e − λ | y | β (Fourier domain) Log (1) f y = | y | ; β ≤
2; Gauss ( β = 2), Cauchy ( β = 1); Eq. (31)Pareto type I y − λ Log (1) β → m y = 1 or m y = T (cid:48) Log-Fr´echet y − (log y ) β − e − λ (log y ) β Log (2) m y = T; also from Fr´echet: y → log y , m y = y − T (cid:48) ( y )?? e − λ (log y ) β Log (2)
Also from stretched exponential with f y = log y Log-Pareto type I y − (log y ) − λ Log (2) β → m y = T (cid:48) ; also from Pareto I: y → log y , m y = y − ?? (log y ) − λ Log (2) β →
0; also from Pareto I with f y = log y Pareto type II/Lomax ( c + y ) − λ LinLog (1) β → (cid:0) c + y (cid:1) − λ LinLog (1) β → f y = y ; Pearson VII, Kappa; Cauchy for λ = 1?? (log ( c + y )) − λ LinLog (2) β → c = 0Gamma y − λ e − c λy LogLin (1) β →
0; Pearson type III, includes chi-squareGamma-Gauss y − λ e − c λy LogLin (1) β → f y = y ; m y = 1 or m y = T (cid:48) ; Rayleigh λ = − y − γ ( λ − − e − c λyγ LogLin (1) β → y → y γ ; m y = y γ − ; Chi for γ = 2 and c λ = 1 / c − y ) − λ ( y − c ) − bλ LogLinLog (1) β →
0; Pearson type I; c ≤ y ≤ c Beta prime/F y − bλ (1 + y ) ( b +1) λ − LogLinLog (1) β → y → y/ (1 + y ); m y = (1 + y ) − ; y >
0; Pearson VIGamma variant ( c + y ) − bλ e − c λy LinLogLin (1) β → y > ∗ Assumptions: base form for p y is always m y e − λeβw , in which T f = e βw , as given in Eq. (37). The w column describes the base scale, expressedas combinations of Lin (Linear) and Log scaling, with the superscript denoting the number of recursions as in Eq. (40). For example, Log (1) implies that w ( f y ) = log( f y ), and LinLog (1) implies w ( f y ) = log( c + f y ). Purely linear scaling is shown as “Linear,” which implies w ≡ f y .Recursive expansion of a linear scale remains linear, so no superscript is given for linear scales. Unless otherwise noted, f y = y , shift invarianceonly is assumed for T with respect to G with β (cid:54) = 0, and m y = 1. When β → m y = T (cid:48) abbreviates the proper change of scale, m y = | T (cid:48) f | , in which information dissipates on the cumulativedistribution scale. Change of variable is shown as y → g ( y ), which often leads to a change of scale, m y = g (cid:48) ( y ). Direct values y , possiblycorrected by displacement from a central location, y − µ , are shown here as y without correction. Squared deviations ( y − µ ) from a centrallocation are shown here as y . Listings of distributions can be found in various texts . Many additional forms can be generated by varyingthe measurement function. In the first column, the question marks denote a distribution for which I did not find a commonly used name.Modified from Table 5 of Frank and Smith . See that article for additional details. of the linear-log scale, w (2) = log( c + log( c + f y )); andso on. A log deformation of a log scale arises as a specialcase, leading to a double log scale.Other scales, such as the log-linear scale, can be ex-panded in a similarly recursive manner. We may alsoconsider log-linear-log scales and linear-log-linear scales.We can abbreviate a scale, w , by its recursive deformationand by its level in a recursive hierarchy. For example,LinLog (2) = log( c + log( c + f y )) (41)is the second recursive expansion of a linear-log deforma-tion. The initial value for any recursive hierarchy with asuperscript of i = 0 associates with the base observable w (0) = f y , which I will also write as “Linear,” becausethe base observable is always a linear expression of theunderlying observable, f y . B. Examples of common probability distributions
Table I shows that commonly observed probability dis-tributions arise from combinations of linear and loga-rithmic scaling. For example, the simple linear-log scaleexpresses linear scaling at small magnitudes and loga-rithmic scaling at large magnitudes. The distributionsthat associate with linear-log scaling include very com-mon patterns.For direct observables, f y = y , the linear-log scale in-cludes the purely linear exponential distribution as a lim-iting case, the purely logarithmic power law (Pareto typeI) distribution as a limiting case, and the Lomax (Paretotype II) distribution that is exponential at small magni-tudes and has a power law tail at large magnitudes.For observables that measure the squared distance offluctuations from a central location, f y = ( y − µ ) , or y for simplicity, the linear-log scale includes the purelylinear Gaussian (normal) distribution as a limiting case,and the generalized Student’s distribution that is a Gaus-sian linear pattern for small deviations from the central0location and grades into a logarithmic power law patternin the tails at large deviations.Most of the commonly observed distributions arisefrom other simple combinations of linear and logarithmicscaling. To mention just two further examples among themany described in Table I, the log-linear scale leads tothe gamma distribution, and the log-linear-log scale leadsto the commonly observed beta distribution. XIV WHY DO LINEAR AND LOGARITHMIC SCALES DOMI-NATE?
Processes in the natural world often cause highly non-linear transformations of inputs into outputs. Why dothose complex nonlinear transformations typically leadin the aggregate to simple combinations of linear andlogarithmic base scales? Several possibilities exist . Imention a few in this section. However, I do not knowof any general answer to this essential question. A clearanswer would greatly enhance our understanding of thecommonly observed patterns in nature. A. Absolute versus relative incremental information
The scaling of information often changes between lin-ear and logarithmic as magnitude changes. At somemagnitudes, a fixed measurement increment providesabout the same (linear) information over a varying range,whereas at other magnitudes, a fixed measurement pro-vides less (logarithmic) information as values increase.Consider the example of measuring distance . Startwith a ruler that is about the length of your hand. Withthat ruler, you can measure the size of all the visible ob-jects in your office. That scaling of objects in your officewith the length of the ruler means that those objects havea natural linear scaling in relation to your ruler.Now consider the distances from your office to variousgalaxies. If the distance is sufficiently great, your ruleris of no use, because you cannot distinguish whether aparticular galaxy moves farther away by one ruler unit.Instead, for two distant galaxies, you can measure theratio of distances from your office to each galaxy. Youmight, for example, find that one galaxy is twice as far asanother, or, in general, that a galaxy is some percentagefarther away than another. Percentage changes define aratio scale of measure, which has natural units in loga-rithmic measure . For example, a doubling of distancealways adds log(2) to the logarithm of the distance, nomatter what the initial distance.Measurement naturally grades from linear at localmagnitudes to logarithmic at distant magnitudes whencompared to some local reference scale. The transitionbetween linear and logarithmic varies between problems,depending partly on measurement technology. Measuresfrom some phenomena remain primarily in the linear do-main, such as measures of height and weight in humans. Measures for other phenomena remain primarily in thelogarithmic domain, such as large cosmological distances.Other phenomena scale between the linear and logarith-mic domains, such as fluctuations in the price of financialassets or the distribution of income and wealth .Consider the opposite direction of scaling, from localmagnitude to very small magnitude. Your hand-lengthruler is of no value for small magnitudes, because it can-not distinguish between a distance that is a fraction 10 − of the ruler and a distance that is 2 × − of the ruler.At small distances, one needs a standard unit of measurethat is the same order of magnitude as the distinctionsto be made. A rule of length 10 − distinguishes between10 − and 2 × − , but does not distinguish between 10 − and 2 × − . At small magnitudes, ratios can potentiallybe distinguished, causing the unit of informative measureto change with scale. Thus, small magnitudes naturallyhave a logarithmic scaling.As we change from very small to intermediate to verylarge, the measurement scaling naturally grades from log-arithmic to linear and then again to logarithmic, a log-linear-log scaling . The locus of linearity and the mean-ing of very small and very large differ between problems,but the overall pattern of the scaling relations remainsthe same. B. Common arithmetic operations lead to commonscaling relations
Perhaps linear and logarithmic scaling reflect aggrega-tion by addition or multiplication of fluctuations. Addingfluctuations often tends in the limit to a smooth linearscaling relation. Multiplying fluctuations often tends inthe limit to a smooth logarithmic scaling relation.Consider the basic log-linear scale that leads to thegamma distribution. A simple generative model for thegamma distribution arises from the waiting time for the k th event to occur. At time zero, no events have oc-curred.At small magnitudes of time, the occurrence of all k events requires essentially simultaneous occurrence of allof those events. Nearly simultaneous occurrence happensroughly in proportion to the product of the probabilityof any single event occurring in a small time interval.Multiplication associates with logarithmic scaling.At large magnitudes of time, either all k events haveoccurred, or in most cases k − k events naturally follows a log-linear pattern.Any process that requires simultaneity at extrememagnitudes leads to logarithmic scaling at those limits.Thus, a log-linear-log scale may be a very common under-lying pattern. Special cases include log-linear, linear-log,purely log, and purely linear. For those variant patterns,the actual extreme tails may be logarithmic, although1difficulty observing the extreme tail pattern may lead tomany cases in which a linear tail is a good approximationover the range of observable magnitudes.Other aspects of aggregation and limiting processesmay also lead to the simple and commonly observed scal-ing relations. For example, fractal theory provides muchinsight into logarithmic scaling relations . However,I do not know of any single approach that matches thesimplicity of the commonly observed combinations of lin-ear and logarithmic scaling patterns to a single, simpleunderlying theory.The invariances associated with simple scaling patternsmay provide some clues. As noted earlier, shift invarianceassociates with exponential scaling, and affine invarianceassociates with linear scaling. It is easy to show thatpower law invariance associates with logarithmic scaling.For example, in the measurement scale invariance expres-sion given in Eq. (33), the invariance holds for a log scale,T( y ) = log( y ), in relation to power law transformationsof the observables, G( y ) = cy γ , as shown in Eq. (34).We may equivalently say that a scaling relation satis-fies power law invariance or that a scaling relation is loga-rithmic. Noting the invariance does not explain why thescaling relation and the associated invariance are com-mon, but it does provide an alternative and potentiallyuseful way in which to study the problem of commonness. XV ASYMPTOTIC INVARIANCE
The measurement functions, T, that define maximumentropy distributions satisfy the affine invariance givenin Eq. (35), repeated hereT ∼ T ◦ G . (42)One can think of G as an input-output function thattransforms observations in a way that does not changeinformation with respect to probability pattern.Most of the commonly observed probability patternshave a simple form, associated with a simple measure-ment function composed of linear, logarithmic, and expo-nential components. I have emphasized the open prob-lem of why the measurement functions, T, tend to beconfined to those simple forms. That simplicity of mea-surement implies an associated simplicity for the form ofG under which information remains invariant. If we canfigure out why G tends to be simple, then perhaps wemay understand the simplicity of T. A. Multiple transformations of observations
At the microscopic scale, observations may tend to gettransformed or filtered through a variety of complex pro-cesses represented by variable and complex forms of G.Then, for a simple measurement function, T, the funda-mental affine invariance would not holdT (cid:54)∼ T ◦ G . (43) However, the great lesson of statistical mechanics andmaximum entropy is that, for complex underlying pro-cesses, aggregation often smooths ultimate pattern intoa simple form. Perhaps multiple filtering of observationsthrough input-output functions G would, in the aggre-gate, lead to a simple overall form for the transformationof initial observations into the actual values observed .We can study how multiple applications of input-output transformations may influence the measurementfunction, T. Note that in the basic invariance of Eq. (42),application of G does not change the information in ob-servations. Thus, we can apply G multiple times andstill maintain invariant information. If we write G n orG s for n or s applications of input-output processing for n, s = , , , , . . . , then we can write the more general ex-pression for the fundamental measurement and informa-tion invariance as T ◦ G n ∼ T ◦ G s . (44) B. Invariance in the limit
Suppose that, for a simple measurement function, T,and a complex input-output process, G, the basic invari-ance does not hold (Eq. 43). However, it may be thatmultiple rounds of processing by G ultimately lead toa relatively simple transformation of the initial inputsto the final outputs. In other words, G may be com-plex, but for sufficient large n , the form of G n may besimple . This aggregate simplicity may lead in the limitto asymptotic invarianceT ◦ G n → T ◦ G ∞ (45)as n becomes sufficiently large. It is not necessary for ev-ery G to be identical. Instead, each G may be a samplefrom a pool of alternative transformations. Each indi-vidual transformation may be complicated. But in theaggregate, the overall relation between the initial inputsand final outputs may smooth asymptotically into a sim-ple form, such as a power law. If so, then the associatedmeasurement scale smooths asymptotically into a simplelogarithmic relation.Other aggregates of input-output processing maysmooth into affine or shift transformations, which asso-ciate with linear or exponential scales. When differentinvariances hold at different magnitudes of the initial in-puts, then the measurement scale will change with magni-tude. For example, a log-linear scale may reflect asymp-totic power law and affine invariances at small and largemagnitudes. XVI DISCUSSION
Aggregation smooths underlying complexity into sim-ple patterns. The common probability patterns arise by2the dissipation of information in aggregates. Each addi-tional random perturbation increases entropy until thedistribution of observations takes on the maximum en-tropy form. That form has lost all information exceptthe constraints on simple average values.For each particular probability distribution, the con-straint on average value arises on a characteristic mea-surement scale. That scaling relation, T, defines the formof the maximum entropy probability distributions p y ∝ m y e − λ T f as initially presented in Eq. (8), for which T ≡ T f . Here, m y accounts for cases in which information dissipateson one scale, but we measure probability pattern on adifferent scale.The common probability distributions tend to havesimple forms for T that follow linear, logarithmic, or ex-ponential scaling at different magnitudes. The way inwhich those three fundamental scalings grade into eachother as magnitude changes sets the overall scaling rela-tion.A scaling relation defines the associated maximum en-tropy distribution. Thus, reading a probability distribu-tion as a statement about process reduces to reading theembedded scaling relation, and trying to understand theprocesses that cause such scaling. Similarly, understand-ing the familial relations between probability patternsreduces to understanding the familial relations betweendifferent measurements scales.The greatest open puzzle concerns why a small numberof simple measurement scales dominant the commonlyobserved patterns of nature. I suggested that the solu-tion may follow from the basic invariance that defines ameasurement scale. Eq. (35) presented that invarianceas T ∼ T ◦ G . The measurement scale, T, is affine invariant to trans-formation of the observations by G. In other words, theinformation in measurements with regard to probabilitypattern does not change if we use the directly measuredobservations or we measure the observations after trans-formation by G, when analyzed on the scale T.In many cases, the small scale processes, G, that trans-form underlying values may have complex forms. If so,then the associated scaling relation T, might also be com-plex, leaving open the puzzle of why observable forms ofT tend to be simple. I suggested that underlying val-ues may often be transformed by multiple processes be-fore ultimate measurement. Those aggregate transfor-mations may smooth into a simple form with regard tothe relation between initial inputs and final measurableoutputs. If we express a sequence of n transformationsas G n , then the asymptotic invariance of the aggregateprocessing may be simple in the sense thatT ◦ G n → T ◦ G ∞ (46) as given by Eq. (45). Here, the measurement scaling T,and the aggregate input-output processing G n are rel-atively simple and consistent with commonly observedpatterns.The puzzle concerns how aggregate input-output pro-cessing smooths into simple forms . In particular, howdoes a combination of transformations lead in the aggre-gate to a simple asymptotic invariance?The scaling pattern for any aggregate input-output re-lation may have simple asymptotic properties. The ap-plication to probability patterns arises when we embed asimple asymptotic scaling relation into the maximum en-tropy process of dissipating information. The dissipationof information in maximum entropy occurs as measure-ments are made on the aggregation of individual outputs.Two particularly simple forms of invariance by T toinput-output processing by G n may be important. If G n is a shift transformation w (cid:55)→ δ + w for some base scaling w , then the associated measurement scale has the formT f = e βw . This exponential scaling corresponds to thefact that exponential growth or decay is shift invariant.With exponential scaling, the general maximum entropyform is p y ∝ m y e − λe βw . The extreme value distributions and other common dis-tributions derive from that double exponential form. Theparticular distribution depends on the base scaling, w , asillustrated in Table I.Shift transformation is a special case of the broaderclass of affine transformations, w (cid:55)→ δ + θw . If G n causesaffine changes, then the broader class of input-outputrelations leads to a narrower range of potential measure-ment scales that preserve invariance. In particular, anaffine measurement scale is the only scale that preservesinformation about probability pattern in relation to affinetransformations. For maximum entropy probability dis-tributions, we may write T f = w for the measurementscale that preserves invariance to affine G n , leading tothe simpler form for probability distributions p y ∝ m y e − λw , which includes most of the very common probability dis-tributions. Thus, the distinction between asymptoticshift and affine changes of initial base scales before po-tential measurement may influence the general form ofprobability patterns.In summary, the common patterns of nature follow afew generic forms. Those forms arise by the dissipationof information and the scaling relations of measurement.The measurement scales arise from the particular wayin which the information in a probability pattern is in-variant to transformation. Information invariance ap-parently limits the common measurement scales to sim-ple combinations of linear, logarithmic, and exponentialcomponents. Common probability distributions expresshow those component scales grade into one another asmagnitude changes.3 XVII APPENDIX: SCALE TRANSFORMATION
In some cases, information dissipates on one scale, butwe wish to express the probability pattern on anotherscale. For example, a process may lead to a final mea-sured value that is the product of a series of underlyingprocesses. The product of multiple values is equal to thesum of the logarithms of those values. So we may con-sider how information dissipates as the logarithm of eachindividual component is added to the total. The the-ory for the dissipation of information has a particularlysimple interpretation as the sum of independent randomprocesses.The sum of random processes often converges to aGaussian distribution, preserving information only aboutthe average squared distance of fluctuations around themean. Thus, we obtain a simple expression for the dissi-pation of information when we transform the final mea-sured values, which arise by multiplication, to the addi-tive logarithmic scale. After finding the shape of the dis-tribution on the log transformed scale, it makes sense totransform the distribution of values back to the originalscale of the measurements. In the case of a Gaussian dis-tribution on the altered scale, the transformation backto the original scale leads to the pattern known as thelog-normal distribution.The transformations associated with the log-normaldistribution are well known. Because the Gaussian dis-tribution is a standard component of simple maximumentropy approaches, the log-normal also falls within thatscope. But the Gaussian and log-normal transformationpair are sometimes considered to be a special case. Here,I emphasize that one must understand the structure ofthe transformation argument more generally. Informa-tion often dissipates on one scale, but we may wish toexpress probability patterns on another scale.Once one recognizes the more general structure for thedissipation of information, many previously puzzling pat-terns fall naturally within the scope of a simple theory ofprobability patterns. In the main text, I discuss impor-tant examples, particularly the extreme value distribu-tions that play a central role in many applications of riskanalysis. In this Methods section, I give the general formby which one can express the different scales for the dis-sipation of information and for measurement. That gen-eral form provides the key to reading the mathematicalexpressions of probability patterns as simple statementsabout process.For continuous variables, probability expressions de-scribe the chance that an observation is close to a value y . The chance of observing a value exactly equal to y must be close to zero, because there are essentially aninfinite number of possible values that y can take on. Sowe describe probability in terms of the chance that an ob-servation falls into a small interval between y and y + d y ,where d y is a small increment. We write the probabilityof falling into a small increment near y as p y | d y | .We are interested in understanding the distribution on the scale y . But suppose that information dissipates ona different scale given by x , leading to the distribution p x | d x | . After obtaining the distribution on the scale x byapplying the theory for the dissipation of information andconstraint, we often wish to transform the distribution tothe original scale y . The relation between x and y is givenby the transformation x = g ( y ), where g is some functionof y . For example, we may have x = log( y ). In general,we can use any transformation that has meaning for aparticular problem.By standard calculus, we can write d x = g (cid:48) ( y )d y ,where g (cid:48) is the derivative of g with respect to y . Define m y = | g (cid:48) ( y ) | , which gives a notation m y that emphasizesthe term as the translation between the measurementscales for x and y . Thus | d x | = m y | d y | . Because x = g ( y ), we can also write p x = p g ( y ) , and so p x | d x | = m y p g ( y ) | d y | = p y | d y | , or p y = m y p x = m y p g ( y ) . (47)Because information dissipates on the scale x , we canoften find the distribution p x relatively easily. FromEq. (5), the form of that distribution is p x ∝ e − λf x . Applying the change in measure in Eq. (47), we obtain p y ∝ m y e − λf g ( y ) . (48)To illustrate, consider the log-normal example, in which x = g ( y ) = log( y ) and m y = y − . On the logarithmicscale, x , the distribution is Gaussian p x ∝ e − λ ( x − µ ) , in which λ = 1 / σ . From Eq. (47), we obtain the dis-tribution on the original scale, y , as p y ∝ y − e − λ (log( y ) − µ ) , which is the log-normal distribution. The relation be-tween the Gaussian and the log-normal is widely known.But the general principle of studying the dissipation ofinformation on one scale and then transforming to an-other scale is more general. That relation is an essentialstep in reading probability expressions in terms of pro-cess and in unifying the commonly observed distributionsinto a single general framework. ACKNOWLEDGMENTS
National Science Foundation grant DEB–1251035 sup-ports my research.4 S. A. Frank,
Dynamics of Cancer: Incidence, Inheritance, andEvolution (Princeton University Press, Princeton, NJ, 2007). E. T. Jaynes,
Probability Theory: The Logic of Science (Cam-bridge University Press, New York, 2003). S. A. Frank, “The common patterns of nature,” Journal of Evo-lutionary Biology , 1563–1585 (2009). S. A. Frank and E. Smith, “A simple derivation and classificationof common probability distributions based on information sym-metry and measurement scale.” Journal of Evolutionary Biology , 469–484 (2011). S. A. Frank, “Measurement scale in maximum entropy modelsof species abundance,” Journal of Evolutionary Biology , 485–496 (2011). S. A. Frank and E. Smith, “Measurement invariance, entropy,and probability,” Entropy , 289–303 (2010). E. T. Jaynes, “Information theory and statistical mechanics,”Phys. Rev. , 620–630 (1957). E. T. Jaynes, “Information theory and statistical mechanics. II,”Phys. Rev. , 171–190 (1957). J. W. Gibbs,
Elementary Principles in Statistical Mechanics (Scribner, New York, 1902). T. M. Cover and J. A. Thomas,
Elements of Information Theory (Wiley, New York, 1991). M. Tribus,
Thermostatics and Thermodynamics: An Introduc-tion to Energy, Information and States of Matter, with Engi-neering Applications (Van Nostrand, New York, 1961). C. Tsallis,
Introduction to Nonextensive Statistical Mechanics (Springer, New York, 2009). R. N. Bracewell,
The Fourier Transform and its Applications ,3rd ed. (McGraw Hill, Boston, 2000). P. Embrechts, C. Kluppelberg, and T. Mikosch,
Modeling Ex-tremal Events: For Insurance and Finance (Springer Verlag,Heidelberg, 1997). S. Kotz and S. Nadarajah,
Extreme Value Distributions: Theoryand Applications (World Scientific, Singapore, 2000). S. Coles,
An Introduction to Statistical Modeling of Extreme Val-ues (Springer, New York, 2001). E. J. Gumbel,
Statistics of Extremes (Dover Publications, NewYork, 2004). S. A. Frank, “Generative models versus underlying symmetriesto explain biological pattern,” Journal of Evolutionary Biology , 1172–1178 (2014). C. Beck and E. Cohen, “Superstatistics,” Physica A: StatisticalMechanics and its Applications , 267–275 (2003). S. A. Frank, “Input-output relations in biological systems: mea-surement, information and the Hill equation,” Biology Direct ,31 (2013). D. J. Hand,
Measurement Theory and Practice (Arnold, London,2004). N. L. Johnson, S. Kotz, and N. Balakrishnan,
Continuous Uni-variate Distributions , 2nd ed. (Wiley, New York, 1994). N. L. Johnson, S. Kotz, and N. Balakrishnan,
Continuous Uni-variate Distributions , 2nd ed., Vol. 2 (Wiley, New York, 1995). C. Kleiber and S. Kotz,
Statistical Size Distributions in Eco-nomics and Actuarial Sciences (Wiley, New York, 2003). F. Aparicio and J. Estrada, “Empirical distributions of stockreturns: European securities markets, 1990-95,” The EuropeanJournal of Finance , 1–21 (2001). A. A. Dragulescu and V. M. Yakovenko, “Exponential and power-law probability distributions of wealth and income in the UnitedKingdom and the United States,” Physica A , 213–221(2001). B. B. Mandelbrot,
The Fractal Geometry of Nature (W. H. Free-man, 1983). D. Sornette,