Statistical Consequences of Fat Tails: Real World Preasymptotics, Epistemology, and Applications
TT H E T E C H N I C A L I N C E R T O C O L L E C T I O N
STATISTICALCONSEQUENCES OF FAT TAILS
Real World Preasymptotics, Epistemology, and ApplicationsPapers and CommentaryNASSIM NICHOLAS TALEB his format is based on André Miede’s
ClassicThesis , with adaptation from Lorenzo Pantieri’s
Ars Classica .With immense gratitude to André and Lorenzo.
STEM Academic Press operates under an academic journal-style board and publishes books containingpeer-reviewed material in the mathematical and quantitative sciences. Authors must make electronicversions freely available to the general public.
Scribe Media helped organize the publication process; special thanks to Tucker Max, Ellie Cole, ZachObront and Erica Hoffman.
Statistical Consequences of Fat Tails: Real World Preasymptotics, Epistemology, andApplications (The Technical Incerto Collection)Keywords: Mathematical Statistics/Risk Analysis/Probability TheoryISBN - - - - c ⃝ Nassim Nicholas Taleb, . All rights reserved
10 9 8 7 6 5 4 3 ii COAUTHORS Pasquale Cirillo (Chapters , , and )Raphael Douady (Chapter )Andrea Fontanari (Chapter )Hélyette Geman ( Chapter )Donald Geman (Chapter )Espen Haug (Chapter )The Universa Investments team (Chapter ) Papers relied upon here are [ , , , , , , , , , , , , , , , , , , ] v CC Available under Creative Commons License
Pyrrho of Ellis and the "Applied" Pyrrhonians
Menodotus of Nicomedia, Sextus Empiricus, Aenesidemus of Knossos, Antiochus of Laodicea, Herodotus of Tarsus,...
Algazel
Algazelist School (Nicolas d'Autrecourt, etc.) Pareto, Lévy (I)Mandelbrot (I)Polya, FellerZolotarev, TaqquSamorodnisky
No overlap FAT TAILS (Mathematics)
Cramer, Lundberg, Denbo, Zeitouni,Varadhan, etc.
RUIN PROBLEMS/LARGE SKEPTICAL EMPIRICAL TRADITION (Philosophy) ACADEMIA
Carneades
Negative Empiricism
Brochard-Favier-Popper Almost no Overlap between
LDT and
Fat Tails (Cramer condition for exponential moments)
Modern PROBLEM of INDUCTION
Simon FoucherBayle, Huet, Hume, Mill, Russel, Ayer, Goodman... Gnedenko, Resnick Embrechts, Balkemade Haan, Picklands,
Extreme Value TheoryMathematicalFinance /Derivative Theory/Stochastic CalculusSTOCHASTIC Contract Theory (scholastics)
Pierrede Jean Olivi
Heuristics and Biases/ Decision Theory /Psychology of Probability
Overlap limitedto psych of induction,
Loss Models/INSURANCE (I)
CONVERGENCE LAWS
De Moivre,Markov Bienaymé,Chebychev, Bernstein, Kolmogorov,Luzhin, Berry, Esseen, Petrov, Nagaev brothers, Mikosch "Knightian uncertainty" too coarse mathematically and philosophicaly to connect. No visible overlap between concavity/fragility and both contract and
ASYMMETRY/FRAGILITY/CONVEXITY
ETHICS
Skin in the game
Little overlap between convergence laws (
LLN) and the philosophical problem No overlap between Heuristics and Biases and Fat Tails
Probability in Epistemology:
Bayes, Peirce, Ramsey, Carnap, Levi, Kyburgh, Jeffreys, ... Littleoverlap
Complexity TheoryAgency Problem, INSURANCE (II)
MathematicsPhilosophySocial ScienceLegal Theory
Paul Lévy (II),Mandelbrot (II)EconophysicsHayekShackle
FAT TONYISMReal WorldTrader LoreSkin in the GameErgodicity Real WorldNo overlap in literature betweenthe world of ideas and FAT TONYISM except some treatments of ecological uncertaintyEconomics of Uncertainty No overlap between economics of uncertainty andfat tails/skepticism/inductionergodicity Montaigne(essaymethod)
Knightian
Genealogy of the Incerto project with links to the various research traditions.
CC Available under Creative Commons License
Pyrrho of Ellis and the "Applied" Pyrrhonians
Menodotus of Nicomedia, Sextus Empiricus, Aenesidemus of Knossos, Antiochus of Laodicea, Herodotus of Tarsus,...
Algazel
Algazelist School (Nicolas d'Autrecourt, etc.) Pareto, Lévy (I)Mandelbrot (I)Polya, FellerZolotarev, TaqquSamorodnisky
No overlap FAT TAILS (Mathematics)
Cramer, Lundberg, Denbo, Zeitouni,Varadhan, etc.
RUIN PROBLEMS/LARGE SKEPTICAL EMPIRICAL TRADITION (Philosophy) ACADEMIA
Carneades
Negative Empiricism
Brochard-Favier-Popper Almost no Overlap between
LDT and
Fat Tails (Cramer condition for exponential moments)
Modern PROBLEM of INDUCTION
Simon FoucherBayle, Huet, Hume, Mill, Russel, Ayer, Goodman... Gnedenko, Resnick Embrechts, Balkemade Haan, Picklands,
Extreme Value TheoryMathematicalFinance /Derivative Theory/Stochastic CalculusSTOCHASTIC Contract Theory (scholastics)
Pierrede Jean Olivi
Heuristics and Biases/ Decision Theory /Psychology of Probability
Overlap limitedto psych of induction,
Loss Models/INSURANCE (I)
CONVERGENCE LAWS
De Moivre,Markov Bienaymé,Chebychev, Bernstein, Kolmogorov,Luzhin, Berry, Esseen, Petrov, Nagaev brothers, Mikosch "Knightian uncertainty" too coarse mathematically and philosophicaly to connect. No visible overlap between concavity/fragility and both contract and
ASYMMETRY/FRAGILITY/CONVEXITY
ETHICS
Skin in the game
Little overlap between convergence laws (
LLN) and the philosophical problem No overlap between Heuristics and Biases and Fat Tails
Probability in Epistemology:
Bayes, Peirce, Ramsey, Carnap, Levi, Kyburgh, Jeffreys, ... Littleoverlap
Complexity TheoryAgency Problem, INSURANCE (II)
MathematicsPhilosophySocial ScienceLegal Theory
Paul Lévy (II),Mandelbrot (II)EconophysicsHayekShackle
FAT TONYISMReal WorldTrader LoreSkin in the GameErgodicity Real WorldNo overlap in literature betweenthe world of ideas and FAT TONYISM except some treatments of ecological uncertaintyEconomics of Uncertainty No overlap between economics of uncertainty andfat tails/skepticism/inductionergodicity Montaigne(essaymethod)
Knightian (cont from left page).
O N T E N T S
Nontechnical chapters are indicated with a star *; Discussion chapters are indicated with a y ; adaptation frompublished ("peer-reviewed") papers with a z .While chapters are indexed by Arabic numerals, expository and very brief mini-chapters (half way between appendicesand full chapters) use letters such as A, B, etc. prologue (cid:3) ,† glossary, definitions, and notations . General Notations and Frequently Used Symbols . Catalogue Raisonné of General & Idiosyncratic concepts . . Power Law Class P . . Law of Large Numbers (Weak) . . The Central Limit Theorem (CLT) . . Law of Medium Numbers or Preasymptotics . . Kappa Metric . . Elliptical Distribution . . Statistical independence . . Stable (Lévy stable) Distribution . . Multivariate Stable Distribution . . Karamata Point . . Subexponentiality . . Student T as Proxy . . Citation Ring . . Rent seeking in academia . . Pseudo-empiricism or Pinker Problem . . Preasymptotics . . Stochasticizing . . Value at Risk, Conditional VaR . . Skin in the Game . . MS Plot . . Maximum Domain of Attraction, MDA . . Substitution of Integral in the psychology literature . . Inseparability of Probability (another common error) . . Wittgenstein’s Ruler . . Black Swans . . The Empirical Distribution is Not Empirical . . The Hidden Tail vii iii Contents . . Shadow Moment . . Tail Dependence . . Metaprobability . . Dynamic Hedging fat tails and their effects, an introduction a non-technical overview - the darwin college lecture (cid:3) ,‡ . On the Difference Between Thin and Thick Tails . Tail Wagging Dogs: An Intuition . A (More Advanced) Categorization and Its Consequences . The Main Consequences and How They Link to the Book . . Forecasting . . The Law of Large Numbers . Epistemology and Inferential Asymmetry . Naive Empiricism: Ebola Should Not Be Compared to Falls fromLadders . . How some multiplicative risks scale . Primer on Power Laws (almost without mathematics) . Where Are the Hidden Properties? . Bayesian Schmayesian . X vs F ( X ): exposures to X confused with knowledge about X . Ruin and Path Dependence . What To Do? univariate fat tails, level 1, finite moments † . A Simple Heuristic to Create Mildly Fat Tails . . A Variance-preserving heuristic . . Fattening of Tails With Skewed Variance . Does Stochastic Volatility Generate Power Laws? . The Body, The Shoulders, and The Tails . . The Crossovers and Tunnel Effect. . Fat Tails, Mean Deviation and the Rising Norms . . The Common Errors . . Some Analytics . . Effect of Fatter Tails on the "efficiency" of STD vs MD . . Moments and The Power Mean Inequality . . Comment: Why we should retire standard deviation, now! . Visualizing the Effect of Rising p on Iso-Norms level 2: subexponentials and power laws . . Revisiting the Rankings . . What is a Borderline Probability Distribution? . . Let Us Invent a Distribution . Level : Scalability and Power Laws . . Scalable and Nonscalable, A Deeper View of Fat Tails . . Grey Swans . Some Properties of Power Laws . . Sums of variables . . Transformations ontents ix . Bell Shaped vs Non Bell Shaped Power Laws . Interpolative powers of Power Laws: An Example . Super-Fat Tails: The Log-Pareto Distribution . Pseudo-Stochastic Volatility: An investigation thick tails in higher dimensions † . Thick Tails in Higher Dimension, Finite Moments . Joint Fat-Tailedness and Ellipticality of Distributions . Multivariate Student T . . Ellipticality and Independence under Thick Tails . Fat Tails and Mutual Information . Fat Tails and Random Matrices, a Rapid Interlude . Correlation and Undefined Variance . Fat Tailed Residuals in Linear Regression Models special cases of thick tails . Multimodality and Thick Tails, or the War and Peace Model . Transition Probabilities: What Can Break Will Break the law of medium numbers limit distributions, a consolidation (cid:3) ,† . Refresher: The Weak and Strong LLN . Central Limit in Action . . The Stable Distribution . . The Law of Large Numbers for the Stable Distribution . Speed of Convergence of CLT: Visual Explorations . . Fast Convergence: the Uniform Dist. . . Semi-slow convergence: the exponential . . The slow Pareto . . The half-cubic Pareto and its basin of convergence . Cumulants and Convergence . Technical Refresher: Traditional Versions of CLT . The Law of Large Numbers for Higher Moments . . Higher Moments . Mean deviation for a Stable Distributions how much data do you need? an operational metric for fat-tailedness ‡ . Introduction and Definitions . The Metric . Stable Basin of Convergence as Benchmark . . Equivalence for Stable distributions . . Practical significance for sample sufficiency . Technical Consequences . . Some Oddities With Asymmetric Distributions . . Rate of Convergence of a Student T Distribution to the Gaus-sian Basin . . The Lognormal is Neither Thin Nor Fat Tailed . . Can Kappa Be Negative? . Conclusion and Consequences . . Portfolio Pseudo-Stabilization
Contents . . Other Aspects of Statistical Inference . . Final comment . Appendix, Derivations, and Proofs . . Cubic Student T (Gaussian Basin) . . Lognormal Sums . . Exponential . . Negative Kappa, Negative Kurtosis extreme values and hidden tails (cid:3) ,† . Preliminary Introduction to EVT . . How Any Power Law Tail Leads to Fréchet . . Gaussian Case . . The Picklands-Balkema-de Haan Theorem . The Invisible Tail for a Power Law . . Comparison with the Normal Distribution . Appendix: The Empirical Distribution is Not Empirical growth rate and outcome are not in the same distribution class . The Puzzle . Pandemics are really Fat Tailed the large deviation principle, in brief calibrating under paretianity . Distribution of the sample tail Exponent "it is what it is": diagnosing the sp500 † . Paretianity and Moments . Convergence Tests . . Test : Kurtosis under Aggregation . . Maximum Drawdowns . . Empirical Kappa . . Test : Excess Conditional Expectation . . Test - Instability of 4 th moment . . Test : MS Plot . . Records and Extrema . . Asymmetry right-left tail . Conclusion: It is what it is the problem with econometrics . Performance of Standard Parametric Risk Estimators . Performance of Standard NonParametric Risk Estimators machine learning considerations . . Calibration via Angles predictions, forecasting, and uncertainty probability calibration under fat tails ‡ . Continuous vs. Discrete Payoffs: Definitions and Comments . . Away from the Verbalistic . . There is no defined "collapse", "disaster", or "success" underfat tails . Spurious overestimation of tail probability in psychology ontents xi . . Thin tails . . Fat tails . . Conflations . . Distributional Uncertainty . Calibration and Miscalibration . Scoring Metrics . . Deriving Distributions . Non-Verbalistic Payoff Functions/Machine Learning . Conclusion: . Appendix: Proofs and Derivations . . Distribution of Binary Tally P ( p ) ( n ) . . Distribution of the Brier Score election predictions as martingales: an arbitrage approach ‡ . . Main results . . Organization . . A Discussion on Risk Neutrality . The Bachelier-Style valuation . Bounded Dual Martingale Process . Relation to De Finetti’s Probability Assessor . Conclusion and Comments inequality estimators under fat tails gini estimation under infinite variance ‡ . Introduction . Asymptotics of the Nonparametric Estimator under Infinite Vari-ance . . A Quick Recap on a -Stable Random Variables . . The a -Stable Asymptotic Limit of the Gini Index . The Maximum Likelihood Estimator . A Paretian illustration . Small Sample Correction . Conclusions on the super-additivity and estimation biases of quantile contri-butions ‡ . Introduction . Estimation For Unmixed Pareto-Tailed Distributions . . Bias and Convergence . An Inequality About Aggregating Inequality . Mixed Distributions For The Tail Exponent . A Larger Total Sum is Accompanied by Increases in b k q . Conclusion and Proper Estimation of Concentration . . Robust methods and use of exhaustive data . . How Should We Measure Concentration? shadow moments papers shadow moments of apparently infinite-mean phenomena ‡ . Introduction ii Contents . The dual Distribution . Back to Y : the shadow mean (or population mean) . Comparison to Other Methods . Applications on the tail risk of violent conflict (with p. cirillo) ‡ . Introduction/Summary . Summary statistical discussion . . Results . . Conclusion . Methodological Discussion . . Rescaling Method . . Expectation by Conditioning (less rigorous) . . Reliability of Data and Effect on Tail Estimates . . Definition of An "Event" . . Missing Events . . Survivorship Bias . Data analysis . . Peaks over Threshold . . Gaps in Series and Autocorrelation . . Tail Analysis . . An Alternative View on Maxima . . Full Data Analysis . Additional robustness and reliability tests . . Bootstrap for the GPD . . Perturbation Across Bounds of Estimates . Conclusion: is the world more unsafe than it seems? . Acknowledgments what are the chances of a third world war? (cid:3) ,† metaprobability papers how thick tails emerge from recursive epistemic uncertainty † . Methods and Derivations . . Layering Uncertainties . . Higher Order Integrals in the Standard Gaussian Case . . Effect on Small Probabilities . Regime : Cases of decaying parameters a ( n ) . . Regime -a;“Bleed” of Higher Order Error . . Regime -b; Second Method, a Non Multiplicative Error Rate . Limit Distribution stochastic tail exponent for asymmetric power laws † . Background . One Tailed Distributions with Stochastic Alpha . . General Cases . . Stochastic Alpha Inequality . . Approximations for the Class P . Sums of Power Laws . Asymmetric Stable Distributions ontents xiii . Pareto Distribution with lognormally distributed a . Pareto Distribution with Gamma distributed Alpha . The Bounded Power Law in Cirillo and Taleb ( ) . Additional Comments . Acknowledgments meta-distribution of p-values and p-hacking ‡ . Proofs and derivations . Inverse Power of Test . Application and Conclusion some confusions in behavioral economics . Case Study: How the myopic loss aversion is misspecified option trading and pricing under fat tails financial theory’s failures with option pricing † . Bachelier not Black-Scholes . . Distortion from Idealization . . The Actual Replication Process: . . Failure: How Hedging Errors Can Be Prohibitive. unique option pricing measure (no dynamic hedging/complete markets) ‡ . Background . Proof . . Case : Forward as risk-neutral measure . . Derivations . Case where the Forward is not risk neutral . comment option traders never use the black-scholes-merton formula (cid:3) ,‡ . Breaking the Chain of Transmission . Introduction/Summary . . Black-Scholes was an argument . Myth : Traders did not price options before BSM . Methods and Derivations . . Option formulas and Delta Hedging . Myth : Traders Today use Black-Scholes . . When do we value? . On the Mathematical Impossibility of Dynamic Hedging . . The (confusing) Robustness of the Gaussian . . Order Flow and Options . . Bachelier-Thorp option pricing under power laws: a robust heuristic (cid:3) ,‡ . Introduction . Call Pricing beyond the Karamata constant . . First approach, S is in the regular variation class . . Second approach, S has geometric returns in the regular vari-ation class . Put Pricing . Arbitrage Boundaries . Comments iv Contents four mistakes in quantitative finance (cid:3) ,‡ . Conflation of Second and Fourth Moments . Missing Jensen’s Inequality in Analyzing Option Returns . The Inseparability of Insurance and Insured . The Necessity of a Numéraire in Finance . Appendix (Betting on Tails of Distribution) tail risk constraints and maximum entropy (w. d.& h. geman) ‡ . Left Tail Risk as the Central Portfolio Constraint . . The Barbell as seen by E.T. Jaynes . Revisiting the Mean Variance Setting . . Analyzing the Constraints . Revisiting the Gaussian Case . . A Mixture of Two Normals . Maximum Entropy . . Case A: Constraining the Global Mean . . Case B: Constraining the Absolute Mean . . Case C: Power Laws for the Right Tail . . Extension to a Multi-Period Setting: A Comment . Comments and Conclusion . Appendix/Proofs
Bibliography and Index
P R O L O G U E (cid:3) ,† The less you understand the world, theeasier it is to make a decision. Figure . : The problem is not awareness of "fat tails", but the lack of understanding of their conse-quences. Saying "it is fat tailed" implies much more than changing the name of the distribution, buta general overhaul of the statistical tools and types of decisions made. Credit Stefan Gasic.
The main idea behind the
Incerto project is that while there is a lot of un-certainty and opacity about the world, and an incompleteness of informa-tion and understanding, there is little, if any, uncertainty about what actionsshould be taken based on such an incompleteness, in any given situation. T his book consists in ) published papers and ) (uncensored) commentary,about classes of statistical distributions that deliver extreme events, andhow we should deal with them for both statistical inference and decisionmaking. Most "standard" statistics come from theorems designed for thin tails: Discussion chapter. prologue (cid:3) ,† Figure . : Complicationwithout insight: the clarityof mind of many profes-sionals using statisticsand data science withoutan understanding of thecore concepts, what it isfundamentally about.
Credit: Wikimedia. they need to be adapted preasymptotically to fat tails, which is not trivial –orabandoned altogether.So many times this author has been told of course we know this or the beastlyportmanteau nothing new about fat tails by a professor or practitioner who just pro-duced an analysis using "variance", "GARCH", "kurtosis" , "Sharpe ratio", or "valueat risk", or produced some "statistical significance" that is clearly not significant.More generally, this book draws on the author’s multi-volume series,
Incerto [ ]and associated technical research program, which is about how to live in the realworld, a world with a structure of uncertainty that is too complicated for us.The Incerto tries to connect five different fields related to tail probabilities andextremes: mathematics, philosophy, social science, contract theory, decision theory,and the real world. If you wonder why contract theory, the answer is: optiontheory is based on the notion of contingent and probabilistic contracts designedto modify and share classes of exposures in the tails of the distribution; in a wayoption theory is mathematical contract theory. Decision theory is not about under-standing the world, but getting out of trouble and ensuring survival. This point isthe subject of the next volume of the
Technical Incerto , with the temporary workingtitle
Convexity, Risk, and Fragility . a word on terminology "Thick tails" is often used in academic contexts. For us, here, it maps to much"higher kurtosis than the Gaussian" –to conform to the finance practitioner’s lingo.As to "Fat Tails", we prefer to reserve it both extreme thick tails or membership inthe power law class (which we show in Chapter cannot be disentangled). Formany it is meant to be a narrower definition, limited to "power laws" or "regularvariations" – but we prefer to call "power laws" "power laws" (when we are quite rologue (cid:3) ,† Figure . : The classic response: a "substitute" is something that does not harm rent-seeking. Credit:Stefan Gasic. certain about the process), so what we call "fat tails" may sometimes be moretechnically "extremely thick tails" for many.To avoid ambiguity, we stay away from designations such as "heavy tails" or "longtails".The next two chapters will clarify. acknowledgments
In addition to coauthors mentioned earlier, the author is indebted to Zhuo Xi, Jean-Philippe Bouchaud, Robert Frey, Spyros Makridakis, Mark Spitznagel, BrandonYarkin, Raphael Douady, Peter Carr, Marco Avellaneda, Didier Sornette, Paul Em-brechts, Bruno Dupire, Jamil Baz, Damir Delic, Yaneer Bar-Yam, Diego Zviovich,Joseph Norman, Ole Peters, Chitpuneet Mann, Harry Crane –and of course end-less, really endless discussions with the great Benoit Mandelbrot.Social media volunteer editors such as Maxime Biette, Caio Vinchi, Jason Thorell,and Petri Helo cleared many typos. Kevin Van Horn send an extensive list of typosand potential notational confusions.Some of the papers that turned into chapters have been presented at conferences;the author thanks Lauren de Haan, Bert Zwart, and others for comments on ex-treme value related problems. More specific acknowledgements will be madewithin individual chapters. As usual, the author would like to express his grat-itude to the staff at Naya restaurant in NY. prologue (cid:3) ,† T his author presented the present book and the main pointsat the monthly Bloomberg Quant Conference in New York inSeptember . After the lecture, a prominent mathematicalfinance professor came to see me. "This is very typical Taleb",he said. "You show what’s wrong but don’t offer too manysubstitutes".Clearly, in business or in anything subjected to the rigors of the real world,he would have been terminated. People who never had any skin in the game[ ] cannot figure out the necessity of circumstantial suspension of beliefand the informational value of unreliability for decision making: don’t give apilot a faulty metric, learn to provide only reliable information; letting the pilot knowthat the plane is defective saves lives . Nor can they get the outperformance of via negativa –Popperian science works by removal. The late David Freedmanhad tried unsuccessfully to tame vapid and misleading statistical modelingvastly outperformed by "nothing".But it is the case that the various chapters and papers here do offer solu-tions and alternatives, except that these aren’t the most comfortable for someas they require some mathematical work for re-derivations for fat tailed con-ditions. G L O S S A R Y, D E F I N I T I O N S , A N DN O T A T I O N S T his is a catalogue raisonné of the main topics and notations.Notations are redefined in the text every time; this is an aidfor the random peruser. Some chapters extracted from paperswill have specific notations, as specified. Note that while ourterminology may be at variance with that of some researchgroups, it aims at remaining consistent. P is the probability symbol; typically in P ( X > x ), X is the random variable, x is the realization. More formal measure-theoretic definitions of events and otherFrench matters are in Chapter and other places where such formalism makessense. E is the expectation operator. V is the Variance operator. M is the mean absolute deviation which is, when centered, centered around themean (rather than the median). φ (.) and f (.) are usually reserved for the PDF (probability density function) ofa pre-specified distribution. In some chapters, a distinction is made for f x ( x ) and f y ( y ), particularly when X and Y follow two separate distributions. n is usually reserved for the number of summands. p is usually reserved for the moment order.r.v. is short for a random variable. F (.) is reserved for the CDF (cumulative distribution function P ( X < x ), F (.), or S is the survival function P ( X > x ). glossary, definitions, and notations (cid:24) indicates that a random variable is distributed according to a certain specifiedlaw. c ( t ) = E ( e itX s ) is the characteristic function of a distribution. In some discussions,the argument t R is represented as w . Sometimes Y is used. D ! denotes convergence in distribution, as follows. Let X , X , . . . , X n be a se-quence of random variables; X n D ! X means the CDF F n for X n has the followinglimit: lim n ! ¥ F n ( x ) = F ( x )for every real x for which F is continuous. P ! denotes convergence in probability, that is for >
0, we have, using the samesequence as before lim n ! ¥ Pr ( j X n (cid:0) X j > ) = 0. a . s . ! denotes almost sure convergence, the stronger form: P ( lim n ! ¥ X n = X ) = 1. S n is typically a sum for n summands. a and a s : we shall typically try to use a s (0, 2] to denote the tail exponentof the limiting and Platonic stable distribution and a p (0, ¥ ) the correspondingParetian (preasymptotic) equivalent but only in situations where there could besome ambiguity. Plain a should be understood in context. N ( m , s ) the Gaussian distribution with mean m and variance s . L (., .) or LN (., .) is the Lognormal distribution, with PDF f ( L ) (.) typically parametrizedhere as L ( X (cid:0) s , s ) to get a mean X , and variance ( e s (cid:0) ) X . S ( a S , b , m , s ) is the stable distribution with tail index a s in (0, 2], symmetry index b in (cid:0)
1, 1), centrality parameter m in R and scale s > P is the power law class (see below). S is the subexponential class (see below). d (.) is the Dirac delta function. q (.) is the Heaviside theta function.erf(.), the error function, is the integral of the Gaussian distribution erf( z ) = p p ∫ z dte (cid:0) t . erfc(.), is the complementary error function 1 (cid:0) er f (.). ∥ . ∥ p is a norm defined for (here a real vector) X = ( X , . . . , X n ) T , ∥ X ∥ p ≜ ( n (cid:229) ni =1 j x i j p ) = p . Note the absolute value in this text. .2 catalogue raisonné of general & idiosyncratic concepts F (.; .; .) is the Kummer confluent hypergeometric function: F ( a ; b ; z ) = (cid:229) ¥ k =0 a k zkk ! b k . ˜ F is the generalized hypergeometric function regularized: ˜ F (., .; ., .; .) = F ( a ; b ; z ) ( G ( b )... G ( b q ) ) and p F q ( a ; b ; z ) has series expansion (cid:229) ¥ k =0 ( a ) k ...( a p ) k ( b ) k ...( b p ) k z k = k !, were ( a q ) (.) is the Pockham-mer symbol.( a q ) (.) is the Q-Pochhammer symbol ( a q ) n = (cid:213) n (cid:0) i =1 ( (cid:0) aq i ) . Next is the duplication of the definition of some central topics. . . Power Law Class P The power law class is conventionally defined by the property of the survival func-tion, as follows. Let X be a random variable belonging to the class of distributionswith a "power law" right tail, that is: P ( X > x ) = L ( x ) x (cid:0) a ( . )where L : [ x min , + ¥ ) ! (
0, + ¥ ) is a slowly varying function, defined aslim x ! + ¥ L ( kx ) L ( x ) = 1for any k > ].The survival function of X is called to belong to the "regular variation" class RV a .More specifically, a function f : R + ! R + is index varying at infinity with index r ( f RV r ) when lim t ! ¥ f ( tx ) f ( t ) = x r .More practically, there is a point where L ( x ) approaches its limit, l , becominga constant –which we call the "Karamata constant" and the point is dubbed the"Karamata point". Beyond such value the tails for power laws are calibrated usingsuch standard techniques as the Hill estimator. The distribution in that zone isdubbed the strong Pareto law by B. Mandelbrot[ ],[ ].The same applies, when specified, to the left tail. glossary, definitions, and notations . . Law of Large Numbers (Weak)
The standard presentation is as follows. Let X , X , . . . X n be an infinite sequenceof independent and identically distributed (Lebesgue integrable) random variableswith expected value E ( X n ) = m (though one can somewhat relax the i.i.d. assump-tions). The sample average X n = n ( X + (cid:1) (cid:1) (cid:1) + X n ) converges to the expected value, X n ! m , for n ! ¥ .Finiteness of variance is not necessary (though of course the finite higher mo-ments accelerate the convergence).The strong law is discussed where needed. . . The Central Limit Theorem (CLT)
The Standard (Lindeberg-Lévy) version of CLT is as follows. Suppose a sequence ofi.i.d. random variables with E ( X i ) = m and V ( X i ) = s < + ¥ , and X n the sampleaverage for n . Then as n approaches infinity, the sum of the random variables p n ( X n m ) converges in distribution to the Gaussian [ ] [ ]: p n ( X n (cid:0) m ) d (cid:0)! N ( s ) .Convergence in distribution here means that the CDF (cumulative distributionfunction) of p n converges pointwise to the CDF of N (0, s ) for every real z ,lim n ! ¥ P ( p n ( X n (cid:0) m ) (cid:20) z ) = lim n ! ¥ P [ p n ( X n (cid:0) m ) s (cid:20) z s ] = F ( z s ) , s > F ( z ) is the standard normal CDF evaluated at z .There are many other versions of the CLT, presented as needed. . . Law of Medium Numbers or Preasymptotics
This is pretty much the central topic of this book. We are interested in the behaviorof the random variable for n large but not too large or asymptotic. While it is nota big deal for the Gaussian owing to extremely rapid convergence (by both LLNand CLT), this is not the case for other random variables.See Kappa next. . . Kappa Metric
Metric here should not be interpreted in the mathematical sense of a distancefunction, but rather in its engineering sense, as a quantitative measurement. .2 catalogue raisonné of general & idiosyncratic concepts Kappa, in [0, 1], developed by this author here, in Chapter , and in a paper [ ],gauges the preasymptotic behavior or a random variable; it is 0 for the Gaussianconsidered as benchmark, and 1 for a Cauchy or a r.v. that has no mean.Let X , . . . , X n be i.i.d. random variables with finite mean, that is E ( X ) < + ¥ .Let S n = X + X + . . . + X n be a partial sum. Let M ( n ) = E ( j S n (cid:0) E ( S n ) j ) be theexpected mean absolute deviation from the mean for n summands (recall we donot use the median but center around the mean). Define the "rate" of convergencefor n additional summands starting with n : k n , n : M ( n ) M ( n ) = ( nn ) (cid:0) k n n , n , n = 1, 2, ..., ( . ) n > n (cid:21)
1, hence k ( n , n ) = 2 (cid:0) log( n ) (cid:0) log( n )log ( M ( n ) M ( n ) ) . ( . )Further, for the baseline values n = n + 1, we use the shorthand k n . . . Elliptical DistributionX , a p (cid:2) m , a non-negative matrix S and some scalarfunction Y if its characteristic function is of the form exp( it ′ m ) Y ( t S t ′ ).In practical words, one must have a single covariance matrix for the joint distri-bution to be elliptical. Regime switching, stochastic covariances (correlations), allthese prevent the distributions from being elliptical. So we will show in Chap-ter that a linear combination of variables following thin-tailed distributions canproduce explosive thick-tailed properties when ellipticality is violated. This (inaddition to fat tailedness) invalidates much of modern finance. . . Statistical independence
Independence between two variables X and Y with marginal PDFs f ( x ) and f ( y )and joint PDF f ( x , y ) is defined by the identity: f ( x , y ) f ( x ) f ( y ) = 1,regardless of the correlation coefficient. In the class of elliptical distributions, thebivariate Gaussian with coefficient 0 is both independent and uncorrelated. Thisdoes not apply to the Student T or the Cauchy in their multivariate forms. glossary, definitions, and notations . . Stable (Lévy stable) Distribution
This is a generalization of the CLT.Let X , . . . , X n be independent and identically distributed random variables. Con-sider their sum S n . We have Sn (cid:0) a n b n D ! X s , ( . )where X s follows a stable distribution S , a n and b n are norming constants, and, torepeat, D ! denotes convergence in distribution (the distribution of X as n ! ¥ ).The properties of S will be more properly defined and explored in the next chapter.Take it for now that a random variable X s follows a stable (or a -stable) distribution,symbolically X s (cid:24) S ( a s , b , m , s ), if its characteristic function c ( t ) = E ( e itX s ) is of theform: c ( t ) = e ( i m t (cid:0)j t s j a s ( (cid:0) i b tan ( pa (cid:0) s ) sgn( t ) )) when a s ̸ = 1. ( . )The constraints are (cid:0) (cid:20) b (cid:20) < a s (cid:20) . . Multivariate Stable Distribution
A random vector X = ( X , . . . , X k ) T is said to have the multivariate stable distri-bution if every linear combination of its components Y = a X + (cid:1) (cid:1) (cid:1) + a k X k has astable distribution. That is, for any constant vector a R k , the random variable Y = a T X should have a univariate stable distribution. . . Karamata Point
See
Power Law Class . . Subexponentiality
The natural boundary between Mediocristan and Extremistan occurs at the subex-ponential class which has the following property:Let X = X , . . . , X n be a sequence of independent and identically distributedrandom variables with support in ( R + ), with cumulative distribution function F .The subexponential class of distributions is defined by (see [ ], [ ]):lim x ! + ¥ (cid:0) F (cid:3) ( x )1 (cid:0) F ( x ) = 2 ( . )where F (cid:3) = F ′ (cid:3) F is the cumulative distribution of X + X , the sum of two inde-pendent copies of X . This implies that the probability that the sum X + X exceedsa value x is twice the probability that either one separately exceeds x . Thus, every .2 catalogue raisonné of general & idiosyncratic concepts time the sum exceeds x , for large enough values of x , the value of the sum is dueto either one or the other exceeding x —the maximum over the two variables—andthe other of them contributes negligibly.More generally, it can be shown that the sum of n variables is dominated bythe maximum of the values over those variables in the same way. Formally, thefollowing two properties are equivalent to the subexponential condition [ ],[ ].For a given n (cid:21)
2, let S n = S ni =1 x i and M n = max (cid:20) i (cid:20) n x i a) lim x ! ¥ P ( S n > x ) P ( X > x ) = n ,b) lim x ! ¥ P ( S n > x ) P ( M n > x ) = 1.Thus the sum S n has the same magnitude as the largest sample M n , which isanother way of saying that tails play the most important role.Intuitively, tail events in subexponential distributions should decline more slowlythan an exponential distribution for which large tail events should be irrelevant.Indeed, one can show that subexponential distributions have no exponential mo-ments: ∫ ¥ e ϵ x dF ( x ) = + ¥ ( . )for all values of greater than zero. However, the converse isn’t true, since dis-tributions can have no exponential moments, yet not satisfy the subexponentialcondition. . . Student T as Proxy
We use the student T with a degrees of freedom as a convenient two-tailed powerlaw distribution. For a = 1 it becomes a Cauchy, and of course Gaussian for a ! ¥ .The student T is the main bell-shaped power law, that is, the PDF is continu-ous and smooth, asymptotically approaching zero for large negative/positive x ,and with a single, unimodal maximum (further, the PDF is quasiconcave but notconcave). . . Citation Ring
A highly circular mechanism by which academic prominence is reached thanksto discussions where papers are considered prominent because other people areciting them, with no external filtering, thus causing research to concentrate andget stuck around "corners", focal areas of no real significance. This is linked tothe operation of the academic system in the absence of adult supervision or thefiltering of skin in the game. glossary, definitions, and notations E xample of fields that are, practically, frauds in the sense thattheir results are not portable to reality and only serve to feedadditional papers that, in turn, will produce more papers: Mod-ern Financial Theory, econometrics (particularly for macro vari-ables), GARCH processes, psychometrics, stochastic controlmodels in finance, behavioral economics and finance, decision making un-der uncertainty, macroeconomics, and a bit more. . . Rent seeking in academia
There is a conflict of interest between a given researcher and the subject underconsideration. The objective function of an academic department (and person)becomes collecting citations, honors, etc. at the expense of the purity of the subject:for instance many people get stuck in research corners because it is more beneficialto their careers and to their department. . . Pseudo-empiricism or Pinker Problem
Discussion of "evidence" that is not statistically significant, or use of metrics thatare uninformative because they do not apply to the random variables under con-sideration –like for instance making inferences from the means and correlationsfor fat tailed variables. This is the result of:i) the focus in statistical education on Gaussian or thin-tailed variables,ii) the absence of probabilistic knowledge combined with memorization of statis-tical terms,iii) complete cluelessness about dimensionality,all of which are prevalent among social scientists.Example of pseudo-empiricism: comparing death from terrorist actions or epi-demics such as ebola (fat tailed) to falls from ladders (thin tailed).This confirmatory "positivism" is a disease of modern science; it breaks downunder both dimensionality and fat-tailedness.Actually one does not need to distinguish between fat tailed and Gaussian vari-ables to get the lack of rigor in these activities: simple criteria of statistical signif-icance are not met –nor do these operators grasp the notion of such a concept assignificance. . . Preasymptotics
Mathematical statistics is largely concerned with what happens with n = 1 (where n is the number of summands) and n = ¥ . What happens in between is what wecall the real world –and the major focus of this book. Some distributions (say those .2 catalogue raisonné of general & idiosyncratic concepts with finite variance) are Gaussian in behavior asymptotically, for n = ¥ , but notfor extremely large but not infinite n . . . Stochasticizing
Making a deterministic parameter stochastic, (i) in a simple way, or (ii) via a morecomplex continuous or discrete distribution.(i) Let s be the deterministic parameter; we stochasticize (entry-level style) bycreating a two-state Bernouilli with p probability of taking a value s , 1 (cid:0) p oftaking value s . A transformation is mean-preserving when ps + (1 (cid:0) p ) s = s ,that is, preserves the mean of the s parameter. More generally, it can be in asimilar manner variance preserving, etc.(ii) We can use a full probability distribution, typically a Gaussian if the variableis two-tailed, and the Lognormal or the exponential if the variable is one-tailed(rarely a power law). When s is standard deviation, one can stochasticize s , whereit becomes "stochastic volatility", with a variance or standard deviation typicallydubbed "Vvol". . . Value at Risk, Conditional VaR
The mathematical expression of the Value at Risk, VaR, for a random variable X with distribution function F and threshold l [0, 1]VaR l ( X ) = (cid:0) inf f x R : F ( x ) > l g ,and the corresponding CVar or Expected Shortfall ES at threshold l :ES l ( X ) = E ( (cid:0) X j X (cid:20)(cid:0) VaR l ( X ) ) or, in the positive domain, by considering the tail for X instead of that of (cid:0) X .More generally the expected shortfall for threshold K is E ( X j X > K ). . . Skin in the Game
A filtering mechanism that forces cooks to eat their own cooking and be exposedto harm in the event of failure, thus throws dangerous people out of the system.Fields that have skin in the game: plumbing, dentistry, surgery, engineering, ac-tivities where operators are evaluated by tangible results or subjected to ruin andbankruptcy. Fields where people have no skin in the game: circular academic fieldswhere people rely on peer assessment rather than survival pressures from reality. glossary, definitions, and notations . . MS Plot
The MS plot, "maximum to sum", allows us to see the behavior of the LLN for agiven moment consider the contribution of the maximum observation to the total,and see how it behaves as n grows larger. For a r.v. X , an approach to detectif E ( X p ) exists consists in examining convergence according to the law of largenumbers (or, rather, absence of), by looking the behavior of higher moments in agiven sample. One convenient approach is the Maximum-to-Sum plot, or MS plotas shown in Figure . .The MS Plot relies on a consequence of the law of large numbers [ ] when itcomes to the maximum of a variable. For a sequence X , X , ..., X n of nonnegativei.i.d. random variables, if for p = 1, 2, 3, . . . , E [ X p ] < ¥ , then R pn = M pn = S pn ! a . s . n ! ¥ , where S pn = n (cid:229) i =1 X pi is the partial sum, and M pn = max( X p , ..., X pn ) thepartial maximum. (Note that we can have X the absolute value of the randomvariable in case the r.v. can be negative to allow the approach to apply to oddmoments.) . . Maximum Domain of Attraction, MDA
The extreme value distribution concerns that of the maximum r.v., when x ! x (cid:3) ,where x (cid:3) = sup f x : F ( x ) < g (the right "endpoint" of the distribution) is in themaximum domain of attraction, MDA [ ]. In other words,max( X , X , . . . X n ) P ! x (cid:3) . . . Substitution of Integral in the psychology literature
The verbalistic literature makes the following conflation. Let K R + be a thresh-old, f (.) a density function and p K [0, 1] the probability of exceeding it, and g ( x )an impact function. Let I be the expected payoff above K : I = ∫ ¥ K g ( x ) f ( x ) d x ,and Let I be the impact at K multiplied by the probability of exceeding K : I = g ( K ) ∫ ¥ K f ( x ) dx = g ( K ) p K .The substitution comes from conflating I and I , which becomes an identity ifand only if g (.) is constant above K (say g ( x ) = q K ( x ), the Heaviside theta function).For g (.) a variable function with positive first derivative, I can be close to I onlyunder thin-tailed distributions, not under the fat tailed ones. .2 catalogue raisonné of general & idiosyncratic concepts . . Inseparability of Probability (another common error)
Let F : A ! [0, 1] be a probability distribution (with derivative f ) and g : R ! R a measurable function, the "payoff"". Clearly, for A ′ a subset of A : ∫ A ′ g ( x )d F ( x ) = ∫ A ′ f ( x ) g ( x )d x ̸ = ∫ A ′ f ( x )d x g ( ∫ A ′ d x ) In discrete terms, with p (.) a probability mass function: ( . ) (cid:229) x ′ p ( x ) g ( x ) ̸ = (cid:229) x ′ p ( x ) g ( n (cid:229) x ′ x ) = probability of event (cid:2) payoff of average eventThe general idea is that probability is a kernel into an equation not an end productby itself outside of explicit bets. . . Wittgenstein’s Ruler "Wittgenstein’s ruler" is the following riddle: are you using the ruler to measurethe table or using the table to measure the ruler? Well, it depends on the results.Assume there are only two alternatives: a Gaussian distribution and a Power Lawone. We show that a large deviation, say a "six sigma" indicates the distribution ispower law. . . Black Swans
Black Swans result from the incompleteness of knowledge with effects that can bevery consequential in fat tailed domains.
Basically, they are things that fall outside what you can expect and model, andcarry large consequences. The idea is to not predict them, but be convex (or atleast not concave) to their impact: fragility to a certain class of events is detectable,even measurable (by gauging second order effects and asymmetry of responses),while the statistical attributes of these events may remain elusive.It is hard to explain to modelers that we need to learn to work with things wehave never seen (or imagined) before, but it is what it is . Note the epistemic dimension:
Black swans are observer-dependent: a BlackSwan for the turkey is a White Swan for the butcher. September was a Black As Paul Portesi likes to repeat (attributing or perhaps misattributing to this author): "You haven’t seen theother side of the distribution". glossary, definitions, and notations Swan for the victims, but not to the terrorists. This observer dependence is a centralproperty. An "objective" probabilistic model of Black Swan isn’t just impossible,but defeats the purpose, owing to the incomplete character of information and itsdissemination.
Grey Swans:
Large deviations that are are both consequential and have a very lowfrequency but remain consistent with statistical properties are called "Grey Swans".But of course the "greyness" depends on the observer: a Grey Swan for someone usinga power law distribution will be a Black Swan to naive statisticians irremediablystuck within, and wading into, thin-tailed frameworks and representations.
Let us repeat: no, it is not about fat tails; it is just that fat tails make them worse.The connection between fat-tails and Black Swans lies in the exaggerated impactfrom large deviations in fat tailed domains. . . The Empirical Distribution is Not Empirical
The empirical distribution, or survival function b F ( t ) is as follows: Let X , . . . X n be independent, identically distributed real random variables with the commoncumulative distribution function F ( t ). b F n ( t ) = 1 n n (cid:229) i =1 x i (cid:21) t ,where A is the indicator function.By the Glivenko-Cantelli theorem, we have uniform convergence of the max normto a specific distribution –the Kolmogorov-Smirnoff –regardless of the initial dis-tribution. We have: sup t R (cid:12)(cid:12)(cid:12) b F n ( t ) (cid:0) F ( t ) (cid:12)(cid:12)(cid:12) as. (cid:0)!
0; ( . )this distribution-independent convergence concerns probabilities of course, notmoments –a result this author has worked on and generalized for the "hiddenmoment" above the maximum.We note the main result (further generalized by Donsker into a Brownian Bridgesince we know the extremes are 0 and 1) p n ( b F n ( t ) (cid:0) F ( t ) ) D ! N ( F ( t )(1 (cid:0) F ( t )) ) ( . )"The empirical distribution is not empirical" means that since the empirical dis-tributions are necessarily censured on the interval [ x min , x max ], for fat tails this cancarry huge consequences because we cannot analyze fat tails in probability spacebut in payoff space.Further see the entry on the hidden tail (next). .2 catalogue raisonné of general & idiosyncratic concepts . . The Hidden Tail
Consider K n the maximum of a sample of n independent identically distributedvariables; K n = max ( X , X , . . . , X n ) . Let ϕ (.) be the density of the underlyingdistribution. We can decompose the moments in two parts, with the "hidden"moment above K . E ( X p ) = ∫ K n L x p ϕ ( x ) dx | {z } m L , p + ∫ ¥ K n x p ϕ ( x ) dx | {z } m K , p where m L is the observed part of the distribution and m K the hidden one (above K ).By Glivenko-Cantelli the distribution of m K ,0 should be independent of the initialdistribution of X , but higher moments do not, hence there is a bit of a problemwith Kolmogorov-Smirnoff-style tests. . . Shadow Moment
This is called in this book "plug-in" estimation. It is not done by measuring thedirectly observable sample mean which is biased under fat-tailed distributions, butby using maximum likelihood parameters, say the tail exponent a , and derivingthe shadow mean or higher moments. . . Tail Dependence
Let X and X be two random variables not necessarily in the same distributionclass. Let F ( q ) be the inverse CDF for probability q , that is F ( q ) = inf f x R : F ( x ) (cid:21) q g , l u the upper tail dependence is defined as l u = lim q ! P ( X > F ( q ) j X > F ( q ) ) ( . )Likewise for the lower tail dependence index. . . Metaprobability
Comparing two probability distributions via some tricks which includes stochas-ticizing parameters. Or stochasticize a parameter to get the distribution of a callprice, a risk metric such as VaR (see entry), CVaR, etc., and check the robustnessor convexity of the resulting distribution. glossary, definitions, and notations . . Dynamic Hedging
The payoff of a European call option C on an underlying S with expiration timeindexed at T should be replicated with the following stream of dynamic hedges,the limit of which can be seen here, between present time t and T :lim ∆ t ! ( n = T = ∆ t (cid:229) i =1 ¶ C ¶ S j S = S t +( i (cid:0) ∆ t, t = t +( i (cid:0) ∆ t, ( S t + i ∆ t (cid:0) S t +( i (cid:0) ∆ t )) ( . )We break up the period into n increments ∆ t. Here the hedge ratio ¶ C ¶ S is computedas of time t +(i- ) ∆ t , but we get the nonanticipating difference between the priceat the time the hedge was initiatied and the resulting price at t+ i ∆ t .This is supposed to make the payoff deterministic at the limit of ∆ t !
0. In theGaussian world, this would be an Ito-McKean integral.We show where this replication is never possible in a fat-tailed environment,owing to the special presamptotic properties. art I
FAT TA I L S A N D T H E I R E F F E C T S , A N I N T R O D U C T I O N
A N O N -T E C H N I C A L O V E R V I E W - T H ED A R W I N C O L L E G E L E C T U R E (cid:3) ,‡Abyssus abyssum invocat ארוכ Mוהת לא Mוהת
Psalms T his chapter presents a nontechnical yet comprehensive presen-tation of of the entire statistical consequences of thick tails project.It compresses the main ideas in one place. Mostly, it providesa list of more than a dozen consequences of thick tails on sta-tistical inference. We begin with the notion of thick tails and how it relates to extremes using thetwo imaginary domains of Mediocristan (thin tails) and Extremistan (thick tails).
Research and discussion chapter.A shorter version of this chapter was presented at Darwin College, Cambridge (UK) on January
27 2017 ,as part of the Darwin College Lecture Series on Extremes. The author extends the warmest thanks toD.J. Needham and Julius Weitzdörfer, as well as their invisible assistants who patiently and accuratelytranscribed the lecture into a coherent text. The author is also grateful towards Susan Pfannenschmidtand Ole Peters who corrected some mistakes. Jamil Baz prevailed upon me to add more commentaryto the chapter to accommodate economists and econometricians who, one never knows, may eventuallyidentify with some of it. a non-technical overview - the darwin college lecture (cid:3) ,‡ (cid:15) In Mediocristan, when a sample under consideration gets large, no sin-gle observation can really modify the statistical properties. (cid:15)
In Extremistan, the tails (the rare events) play a disproportionately largerole in determining the properties.Another way to view it:Assume a large deviation X . (cid:15) In Mediocristan, the probability of sampling higher than X twice in arow is greater than sampling higher than 2 X once. (cid:15) In Extremistan, the probability of sampling higher than 2 X once isgreater than the probability of sampling higher than X twice in a row.Let us randomly select two people in Mediocristan; assume we obtain a (veryunlikely) combined height of . meters – a tail event. According to the Gaussiandistribution (or, rather its one-tailed siblings), the most likely combination of thetwo heights is . meters and . meters. Not centimeters and meters.Simply, the probability of exceeding sigmas is 0.00135. The probability of ex-ceeding sigmas, twice as much, is 9.86 (cid:2) (cid:0) . The probability of two -sigmaevents occurring is 1.8 (cid:2) (cid:0) . Therefore the probability of two 3-sigma eventsoccurring is considerably higher than the probability of one single -sigma event.This is using a class of distribution that is not fat-tailed.Figure . shows that as we extend the ratio from the probability of two 3-sigmaevents divided by the probability of a 6-sigma event, to the probability of two 4-sigma events divided by the probability of an 8-sigma event, i.e., the further wego into the tail, we see that a large deviation can only occur via a combination (asum) of a large number of intermediate deviations: the right side of Figure . . Inother words, for something bad to happen, it needs to come from a series of veryunlikely events, not a single one. This is the logic of Mediocristan.Let us now move to Extremistan and randomly select two people with combinedwealth of $ million. The most likely combination is not $ million and $ million. It should be approximately $ , , and $ , .This highlights the crisp distinction between the two domains; for the class ofsubexponential distributions, ruin is more likely to come from a single extremeevent than from a series of bad episodes. This logic underpins classical risk theoryas outlined by the actuary Filip Lundberg early in the 20 th Century [ ] andformalized in the s by Harald Cramer [ ], but forgotten by economists inrecent times. For insurability, losses need to be more likely to come from manyevents than a single one, thus allowing for diversification,This indicates that insurance can only work in Mediocristan; you should neverwrite an uncapped insurance contract if there is a risk of catastrophe. The point iscalled the catastrophe principle.As we saw earlier, with thick tailed distributions, extreme events away from thecentre of the distribution play a very large role. Black Swans are not "more fre- .2 tail wagging dogs: an intuition quent" (as it is commonly misinterpreted), they are more consequential. The fattesttail distribution has just one very large extreme deviation, rather than many depar-tures form the norm. Figure . shows that if we take a distribution such as theGaussian and start fattening its tails, then the number of departures away from onestandard deviation drops. The probability of an event staying within one standarddeviation of the mean is percent. As the tails fatten, to mimic what happens infinancial markets for example, the probability of an event staying within one stan-dard deviation of the mean rises to between 75 and 95 percent. So note that as wefatten the tails we get higher peaks, smaller shoulders, and a higher incidence ofa very large deviation. Because probabilities need to add up to 1 (even in France)increasing mass in one area leads to decreasing it in another. ( in σ ) S ( K ) S ( K ) Figure . : Ratio of S (.) survival functions for twooccurrences of size K byone of K for a Gaussiandistribution (cid:3) . The larger theK, that is, the more we are inthe tails, the more likely theevent is to come from two in-dependent realizations of K(hence P ( K ) , and the lesslikely from a single event ofmagnitude K. (cid:3) This is fudging for peda-gogical simplicity. The morerigorous approach would beto compare 2 occurrences ofsize K to 1 occurrence of size2 K plus 1 regular deviation –but the end graph would notchange at all. The tail wags the dog effect
Centrally, the thicker the tails of the distribution, the more the tail wags thedog , that is, the information resides in the tails and less so in the "body" (thecentral part) of the distribution. Effectively, for very fat tailed phenomena, alldeviations become informationally sterile except for the large ones.The center becomes just noise. Although the "evidence based" science might notquite get it yet, under such conditions, there is no evidence in the body.This property also explains the slow functioning of the law of large numbers incertain domains as tail deviations, where the information resides, are –by definition–rare.The property explains why, for instance, a million observations of white swansdo not confirm the non-existence of black swans, or why a million confirmatory a non-technical overview - the darwin college lecture (cid:3) ,‡ Figure . : Iso-densities fortwo independent Gaussiandistributions. The lineshows x + y = 4.1 . Visi-bly the maximal probabilityis for x = y = 2.05 . (cid:1)(cid:2)(cid:3)(cid:4)(cid:5) (cid:6)(cid:4)(cid:7)(cid:8)(cid:9)(cid:10)(cid:9)(cid:4)(cid:8) (cid:1) + (cid:2) = (cid:3)(cid:4) Figure . : Iso-densities fortwo independent thick taileddistributions (in the powerlaw class). The line showsx + y = 36 . Visibly the max-imal probability is for eitherx = 36 (cid:0) ϵ or y = 36 (cid:0) ϵ ,with ϵ going to as the sumx + y becomes larger. observations count less than a single disconfirmatory one. We will link it to thePopper-style asymmetries later in the chapter. .3 a (more advanced) categorization and its consequences x + y = -
20 0 20 40 - Figure . : Same represen-tation as in Figure . , butconcerning power law dis-tributions with support onthe real line; we can see theiso-densities looking moreand more like a cross forlower and lower probabili-ties. More technically, thereis a loss of ellipticality. It also explains why one should never compare random variables driven by thetails (say, pandemics) to ones driven by the body (say, number of people whodrown in their swimming pool). See Cirillo and Taleb ( ) [ ] for the policyimplications of systemic risks. Let us now consider the degrees of thick tailedness in a casual way for now (wewill get deeper and deeper later in the book). The ranking is by severity.Distributions:
Thick Tailed (cid:27)
Subexponential (cid:27)
Power Law (Paretian)
First there are entry level thick tails. This is any distribution with fatter tailsthan the Gaussian i.e. with more observations within (cid:6) ( p ) (cid:25) and with kurtosis (a function of the fourth central moment)higher than 3 .Second, there are subexponential distributions satisfying our thought experimentearlier (the one illustrating the catastrophe principle). Unless they enter the classof power laws, distributions are not really thick tailed because they do not have The error function erf is the integral of the Gaussian distribution erf( z ) = p p ∫ z dte (cid:0) t . The moment of order p for a random variable X is the expectation of a p power of X , E ( X p ). a non-technical overview - the darwin college lecture (cid:3) ,‡ Figure . : The law of largenumbers, that is how long ittakes for the sample mean tostabilize, works much moreslowly in Extremistan (herea Pareto distribution with . tail exponent , cor-responding to the "Pareto - ". Both have thesame mean absolute devia-tion. Note that the same ap-plies to other forms of sam-pling, such as portfolio the-ory. monstrous impacts from rare events. In other words, they can have all the moments. Level three, what is called by a variety of names, power law, or member of theregular varying class, or the "Pareto tails" class; these correspond to real thick tailsbut the fattailedness depends on the parametrization of their tail index. Withoutgetting into a tail index for now, consider that there will be some moment that isinfinite, and moments higher than that one will also be infinite.Let us now work our way from the bottom to the top of the central tableau inFigure . . At the bottom left we have the degenerate distribution where thereis only one possible outcome i.e. no randomness and no variation. Then, aboveit, there is the Bernoulli distribution which has two possible outcomes, not more.Then above it there are the two Gaussians. There is the natural Gaussian (withsupport on minus and plus infinity), and Gaussians that are reached by adding .3 a (more advanced) categorization and its consequences n = = - - Gaussian LLN n = = - - Fat Tailed LLN
Figure . : What happens to the distribution of an average as the number of observations n increases?This is the same representation as in Figure . seen in distribution/probability space. The fat taileddistribution does not compress as easily as the Gaussian. You need a much, much larger sample. It iswhat it is. DegenerateBernoulliThin - Tailedfrom Convergence toGaussian COMPACT SUPPORTSubexponential Supercubic α ≤ - Stable α < α ≤ CRAMERCONDITION ℒ LAW OF LARGE NUMBERS ( WEAK ) CONVERGENCE ISSUESGaussian from Lattice Approximation FuhgetabouditCENTRAL LIMIT — BERRY - ESSEEN
Figure . : The tableau of thick tails, along the various classifications for convergence purposes (i.e.,convergence to the law of large numbers, etc.) and gravity of inferential problems. Power Laws are inwhite, the rest in yellow. See Embrechts et al [ ]. random walks (with compact support, sort of, unless we have infinite summands) .These are completely different animals since one can deliver infinity and the other Compact support means the real-valued random variable X takes realizations in a bounded interval, say[ a , b ],( a , b ], [ a , b ), etc. The Gaussian has an exponential decline e (cid:0) x that accelerates with deviations, sosome people such as Adrien Douady consider it effectively of compact support. a non-technical overview - the darwin college lecture (cid:3) ,‡ cannot (except asymptotically). Then above the Gaussians sit the distributionsin the subexponential class that are not members of the power law class. Thesemembers have all moments. The subexponential class includes the lognormal,which is one of the strangest things in statistics because sometimes it cheats andfools us. At low variance, it is thin-tailed; at high variance, it behaves like the verythick tailed. Some people take it as good news that the data is not Paretian butlognormal; it is not necessarily so. Chapter gets into the weird properties of thelognormal.Membership in the subexponential class does not satisfy the so-called Cramercondition, allowing insurability, as we illustrated in Figure . , recall out thoughtexperiment in the beginning of the chapter. More technically, the Cramer conditionmeans that the expectation of the exponential of the random variable exists. Once we leave the yellow zone, where the law of large numbers (LLN) largelyworks , and the central limit theorem (CLT) eventually ends up working , thenwe encounter convergence problems. So here we have what are called power laws.We rank them by their tail index a , on which later; take for now that the lower thetail index, the fatter the tails. When the tail index is a (cid:20) a = 3 is cubic). That’s an informal borderline: the distribution has no momentother than the first and second, meaning both the laws of large number and thecentral limit theorem apply in theory.Then there is a class with a (cid:20) a less than 2 not explicitly in thatclass; but in theory, as we add add up variables, the sum ends up in that classrather than in the Gaussian one thanks to something called the generalized centrallimit theorem, GCLT ). From here up we are increasingly in trouble because thereis no variance. For 1 (cid:20) a (cid:20) Technical point: Let X be a random variable. The Cramer condition: for all r > E ( e rX ) < + ¥ ,where E is the expectation operator. Take for now the following definition for the law of large numbers: it roughly states that if a distributionhas a finite mean, and you add independent random variables drawn from it —that is, your sample getslarger— you eventually converge to the mean. How quickly? that is the question and the topic of thisbook. We will address ad nauseam the central limit theorem but here is the initial intuition. It states that n -summed independent random variables with finite second moment end up looking like a Gaussiandistribution. Nice story, but how fast? Power laws on paper need an infinity of such summands, meaningthey never really reach the Gaussian. Chapter deals with the limiting distributions and answers thecentral question: "how fast?" both for CLT and LLN. How fast is a big deal because in the real world wehave something different from n equals infinity. .3 a (more advanced) categorization and its consequences Summary of the problem with overstandardized statistics S tatistical estimation is based on two elements: the centrallimit theorem (which is assumed to work for "large" sums, thusmaking about everything conveniently normal) and that of thelaw of large numbers, which reduces the variance of the esti-mation as one increases the sample size. However, things arenot so simple; there are caveats. In Chapter , we show how sampling isdistribution dependent, and varies greatly within the same class. As shownby Bouchaud and Potters in [ ] and Sornette in [ ], the tails for some fi-nite variance but infinite higher moments can, under summation, converge tothe Gaussian within (cid:6) √ n log n , meaning the center of the distribution insidesuch band becomes Gaussian, but remote parts, those tails, don’t –and theremote parts determine so much of the properties.Life happens in the preasymptotics.Sadly, in the entry on estimators in the monumental Encyclopedia of StatisticalScience [ ], W. Hoeffding writes:"The exact distribution of a statistic is usually highly compli-cated and difficult to work with. Hence the need to approximatethe exact distribution by a distribution of a simpler form whoseproperties are more transparent. The limit theorems of probabilitytheory provide an important tool for such approximations. In par-ticular, the classical central limit theorems state that the sum of alarge number of independent random variables is approximatelynormally distributed under general conditions. In fact, the nor-mal distribution plays a dominating role among the possible limitdistributions. To quote from Gnedenko and Kolmogorov’s text[[ ], Chap. ]: "Whereas for the convergence of distribution functions ofsums of independent variables to the normal law only restric-tions of a very general kind, apart from that of being infinites-imal (or asymptotically constant), have to be imposed on thesummands, for the convergence to another limit law somevery special properties are required of the summands" .Moreover, many statistics behave asymptotically like sums of in-dependent random variables. All of this helps to explain theimportance of the normal distribution as an asymptotic distribu-tion."Now what if we do not reach the normal distribution, as life happens beforethe asymptote? This is what this book is about. aa The reader is invited to consult a "statistical estimation" entry in any textbook or online encyclope-dia. Odds are that the notion of "what happens if we do not reach the asymptote" will never bediscussed –as in the pages of the monumental
Encyclopedia of Statistics . Further, ask a regularuser of statistics about how much data one needs for such and such distributions, and don’t besurprised at the answer. The problem is that people have too many prepackaged statistical toolsin their heads, ones they never had to rederive themselves. The motto here is: "statistics is neverstandard". a non-technical overview - the darwin college lecture (cid:3) ,‡
20 40 60 80 100 x - H x L
20 40 60 80 100 x - - H x L Figure . : In the presence of thick tails, we can fit markedly different regression lines to the samestory (the Gauss-Markov theorem —necessary to allow for linear regression methods —doesn’t applyanymore). Left: a regular (naïve) regression. Right: a regression line that tries to accommodate thelarge deviation —a "hedge ratio" so to speak, one that protects the agent from a large deviation, butmistracks small ones. Missing the largest deviation can be fatal. Note that the sample doesn’t includethe critical observation, but it has been guessed using "shadow mean" methods.
Figure . : Inequality measures such as the Ginicoefficient require completely different methodsof estimation under thick tails, as we will see inPart III. Science is hard.
Here are some consequences of moving out of the yellow zone, the statisticalcomfort zone: .4 the main consequences and how they link to the book Consequence The law of large numbers, when it works, works too slowly in the real world.
This is more shocking than you think as it cancels most statistical estimators. SeeFigure . in this chapter for an illustration. The subject is treated in Chapter and distributions are classified accordingly. Consequence The mean of the distribution will rarely correspond to the sample mean; it willhave a persistent small sample effect (downward or upward) particularly when thedistribution is skewed (or one-tailed).
This is another problem of sample insufficiency. In fact, there is no very thicktailed- one tailed distribution in which the population mean can be properly esti-mated directly from the sample mean –rare events determine the mean, and these, being rare , take a lot of data to show up . Consider that some power laws (like theone described as the " / " in common parlance have percent of the observa-tions falling below the true mean). For the sample average to be informative, weneed orders of magnitude more data than we do (people in economics still do notunderstand this, though traders have an intuitive grasp of the point). The prob-lem is discussed briefly further down in . , and more formally in the "shadowmean" chapters, Chapters and . Further, we will introduce the notion of hid-den properties are in . . Clearly by the same token, variance will be likely to beunderestimatwd. Consequence Metrics such as standard deviation and variance are not useable.
They fail out of sample –even when they exist; even when all moments exist. Dis-cussed in ample details in Chapter . It is a scientific error that the notion ofstandard deviation (often mistaken for average deviation by its users) found itsway as a measure of variation as it is very narrowly accurate in what it purportsto do, in the best of circumstances. Consequence Beta, Sharpe Ratio and other common hackneyed financial metrics are uninforma-tive. What we call preasymptotics is the behavior of a sum or sequence when n is large but not infinite. This is(sort of) the focus of this book. The population mean is the average if we sampled the entire population. The sample mean is, obviously,what we have in front of us. Sometimes, as with wealth or war casualties, we can have the entire popula-tion, yet the population mean isn’t that of the sample. In these situations we use the concept of "shadowmean", which is the expectation as determined by the data generating process or mechanism. a non-technical overview - the darwin college lecture (cid:3) ,‡ Figure . : We plotthe Sharpe ratio ofhedge funds on the hor-izontal axis as com-puted up to crisis of and their subse-quent losses expressedin standard deviationduring the crisis. Notonly does the Sharperatio completely fail topredict out of sampleperformance, but, ifanything, it can beseen as a weak predic-tor of failure. CourtesyRaphael Douady.
This is a simple consequence of the previous point. Either they require muchmore data, many more orders of magnitude, or some different model than theone being used, of which we are not yet aware. Figure . show the Sharpe ratio,supposed to predict performance, fails out of sample — it acts in exact reverse ofthe intention. Yet it is still used because people can be suckers for numbers.Practically every single economic variable and financial security is thick tailed.Of the , securities examined, not one appeared to be thin-tailed. This isthe main source of failure in finance and economics.Financial theorists claim something highly unrigorous like "if the first two mo-ments exist, then mean-variance portfolio theory works, even if the distributionhas thick tails" (they add some conditions of ellipticality we will discuss later). Themain problem is that even if variance exists, we don’t know what it can be with accept-able precision; it obeys a slow law of large numbers because the second moment of a randomvariable is necessarily more thick tailed than the variable itself . Further, stochastic cor-relations or covariances also represent a form of thick tails (or loss of ellipticality),which invalidates these metrics.Practically any paper in economics using covariance matrices is suspicious.Details are in Chapter for the univariate case and Chapter for multivariatesituations. Consequence Robust statistics is not robust and the empirical distribution is not empirical. Roughly, Beta is a metric showing how much an asset A is expected to move in response to a move in thegeneral market (or a given benchmark or index), expressed as the ratio of the covariance between A andthe market over the variance of the market.The Sharpe ratio expresses the average return (or excess return) of an asset or a strategy divided by itsstandard deviation. .4 the main consequences and how they link to the book The story of my life. Much like the Soviet official journal was named
Pravda which means "truth" in Russian, almost as a joke, robust statistics are like a typeof prank, except that most professionals aren’t aware of it.First, robust statistics shoots for measures that can handle tail events —large ob-servations —without changing much. This the wrong idea of robustness: a metricthat doesn’t change in response of a tail event may be doing so precisely because itis uninformative. Further, these measures do not help with expected payoffs. Sec-ond, robust statistics are usually associated with a branch called "nonparametric"statistics, under the impression that the absence of parameters will make the anal-ysis less distribution dependent. This book shows all across that it makes thingsworse.The winsorization of the data, by removing outliers, distorts the expectation op-eration and actually reduces information –though it would be a good idea to checkif the outlier is real or a fake outlier of the type we call in finance a "bad print"(some clerical error or computer glitch).The so-called (nonparametric) "empirical distribution" is not empirical at all (asit misrepresents the expected payoffs in the tails), as we will show in Chapter —this is at least the case for the way it is used in finance and risk management.Take for now the following explanation: future maxima are poorly tracked by pastdata without some intelligent extrapolation.Consider someone looking at building a flood protection system with levees. Thenaively obtained "empirical" distribution will show the worst past flood level, thepast maxima. Any worse level will have zero probability (or so). But by definition,if it was a past maxima, it had to have exceeded what was a past maxima before itto become one, and the empirical distribution would have missed it. For thick tails,the difference between past maxima and future expected maxima is much largerthan thin tails. Consequence Linear least-square regression doesn’t work (failure of the Gauss-Markov theorem).
See Figure . and the commentary. The logic behind the least-square minimiza-tion method is the Gauss-Markov theorem which explicitly requires a thin-taileddistribution to allow the line going through the data points to be unique. So eitherwe need a lot, a lot of data to minimize the squared deviations (in other words,the Gauss-Markov theorem applies, but not for our preasymptotic situations asthe real world has finite, not infinite data), or we can’t because the second momentdoes not exist. In the latter case, if we minimize mean absolute deviations (MAD),as we see in . , not only we may still be facing an insufficiency of data for properconvergence, but the deviation slope may not be unique.We discuss the point in some details in . and show how thick tails produce anin-sample higher coefficient of determination ( R ) than the real one because of thesmall sample effect of thick tails. When variance is infinite, R should be 0. Butbecause samples are necessarily finite, it will show, deceivingly, higher numbers a non-technical overview - the darwin college lecture (cid:3) ,‡ than 0. Effectively, to conclude, under thick tails, R is useless, uninformative, andoften (as with IQ studies) downright fraudulent. Consequence Maximum likelihood methods can work well for some parameters of the distribution(good news).
Take a power law. We may estimate a parameter for its shape, the tail exponent (forwhich we use the symbol a in this book ), which, adding some other parameter(the scale) connects us back to its mean considerably better than going about itdirectly by sampling the mean. Example:
The mean of a simple Pareto distribution with minimum value L andtail exponent a and PDF a L a x (cid:0) a (cid:0) is L aa (cid:0) , a function of a . So we can get itfrom these two parameters, one of which may already be known. This is whatwe call "plug-in" estimator. One can estimate a with a low error with visual aid(or using maximum likelihood methods with low variance — it is inverse-gammadistributed), then get the mean. It beats the direct observation of the mean.The logic is worth emphasizing:The tail exponent a captures, by extrapolation, the low-probability deviationnot seen in the data, but that plays a disproportionately large share in deter-mining the mean.This generalized approach to estimators is also applied to Gini and other inequalityestimators.So we can produce more reliable (or at least less unreliable) estimators for, say, afunction of the tail exponent in some situations. But, of course, not all.Now a real-world question is warranted: what do we do when we do not havea reliable estimator? Better stay home. We must not expose ourselves to harm inthe presence of fragility, but can still take risky decisions if we are bounded formaximum losses (Figure . ). Consequence The gap between disconfirmatory and confirmatory empiricism is wider than in sit-uations covered by common statistics i.e., the difference between absence of evidenceand evidence of absence becomes larger. (What is called "evidence based" science,unless rigorously disconfirmatory, is usually interpolative, evidence-free, and unsci-entific.)
From a controversy the author had with the cognitive linguist and science writerSteven Pinker: making pronouncements (and generating theories) from recent vari-ations in data is not acceptable, unless one meets some standards of significance, To clear up the terminology: in this book, the tail exponent, commonly written a is the limit of quotient ofthe log of the survival function in excess of K over log K, which would be 1 for Cauchy. Some researchersuse a (cid:0) .4 the main consequences and how they link to the book which requires more data under thick tails (the same logic as that of the slow LLN).Stating "violence has dropped" because the number of people killed in wars hasdeclined from the previous year or decade is not a scientific statement: a scientificclaim distinguishes itself from an anecdote as it aims at affecting what happensout of sample, hence the concept of statistical significance.Let us repeat that nonstatistically significant statements are not the realm of sci-ence. However, saying violence has risen upon a single observation may be arigorously scientific claim. The practice of reading into descriptive statistics maybe acceptable under thin tails (as sample sizes do not have to be large), but neverso under thick tails, except, to repeat, in the presence of a large deviation. Consequence Principal component analysis (PCA) and factor analysis are likely to produce spuri-ous factors and loads.
This point is a bit technical; it adapts the notion of sample insufficiency to largerandom vectors seen via the dimension reduction technique called principal com-ponent analysis (PCA) . The issue a higher dimensional version of our law of largenumber complications. The story is best explained in Figure . , which shows theaccentuation of what is called the "Wigner Effect", from insufficiency of data forthe PCA. Also, to be technical, note that the Marchenko-Pastur distribution is notapplicable in the absence of a finite fourth moment (or, has been shown in [ ], fortail exponent in excess of ). Figure . : Under thicktails (to the left), mistakesare terminal. Under thintails (to the left) they can begreat learning experiences.Source: You had one Job.
Consequence The method of moments (MoM) fails to work. Higher moments are uninformativeor do not exist.
The same applies to the GMM, the generalized method of moment, crowned with aBank of Sweden Prize known as a Nobel. This is a long story, but take for now thatthe estimation of a given distribution by moment matching fails if higher momentsare not finite, so every sample delivers a different moment –as we will soon seewith the 4 th moment of the SP . To be even more technical, principal components are independent when correlations are . However, forfat tailed distributions, as we will see more technically in . . , absence of correlation does not implyindependence. a non-technical overview - the darwin college lecture (cid:3) ,‡ Simply, higher moments for thick tailed distributions are explosive. Particularlyin economics.
Consequence There is no such thing as a typical large deviation.
Conditional on having a "large" move, the magnitude of such a move is not con-vergent, especially under serious thick tails (the Power Law tails class). This isassociated with the catastrophe principle we saw earlier. In the Gaussian world,the expectation of a movement, conditional that the movement exceeds 4 standarddeviations, is about 4 standard deviations. For a Power Law it will be a multipleof that. We call this the Lindy property and it is discussed in and particularly inChapter . Consequence The Gini coefficient ceases to be additive..
Methods of measuring sample data for Gini are interpolative –they in effect havethe same problem we saw earlier with the sample mean underestimating or overes-timating the true mean. Here, an additional complexity arises as the Gini becomessuper-additive under thick tails. As the sampling space grows, the conventionalGini measurements give an illusion of large concentrations of wealth. (In otherwords, inequality in a continent, say Europe, can be higher than the weightedaverage inequality of its members). The same applies to other measures of concen-tration such as the top % has x percent of the total wealth, etc.It is not just Gini, but other measures of concentration such as the top 1% owns x % of the total wealth, etc. The derivations are in Chapters and . Consequence Large deviation theory fails to apply to thick tails. I mean, it really doesn’t apply.
I really mean it doesn’t apply . The methods behind the large deviation principle(Varadan [ ] , Dembo and Zeituni [ ], etc.) will be very useful in the thin-tailedworld. And there only. See discussion and derivations in Appendix C as well asthe limit theorem chapters, particularly Chapter . Consequence Risks of financial options are never mitigated by dynamic hedging.
This might be technical and uninteresting for nonfinance people but the entire basisof financial hedging behind Black-Scholes rests on the possibility and necessity of Do not confuse large deviation theory LDT, with extreme value theory, EVT, which covers all major classesof distributions .4 the main consequences and how they link to the book dynamic hedging, both of which will be shown to be erroneous in Chapters and ,and . The required exponential decline of deviates away from the centerrequires the probability distribution to be outside the sub-exponential class. Again,we are talking about something related the Cramer condition –it all boils down tothat exponential moment.Recall the author has been an option trader and to option traders dynamic hedg-ing is not the way prices are derived —and it has been so, as shown by Haug andthe author, for centuries. Consequence Forecasting in frequency space diverges from expected payoff.
And also:
Consequence Much of the claims in the psychology and decision making literature concerningthe "overestimation of tail probability" and irrational behavior with respect of rareevents comes form misunderstanding by researchers of tail risk, conflation of prob-ability and expected payoffs, misuse of probability distributions, and ignorance ofextreme value theory (EVT).
These point is explored in the next section here and in an entire chapter (Chapter ?? ): the foolish notion of focus on frequency rather than expectation can carry amild effect under thin tails; not under thick tails. Figures . and . show theeffect. Consequence Ruin problems are more acute and ergodicity is required under thick tails.
This is a bit technical but explained in the end of this chapter.Let us discuss some of the points. . . Forecasting In Fooled by Randomness ( / ), the character is asked which was more proba-ble that a given market would go higher or lower by the end of the month. Higher,he said, much more probable. But then it was revealed that he was making tradesthat benefit if that particular market goes down . This of course, appears to be para-doxical for the nonprobabilist but very ordinary for traders, particularly undernonstandard distributions (yes, the market is more likely to go up, but should itgo down it will fall much much more). This illustrates the common confusionbetween a forecast and an exposure (a forecast is a binary outcome, an exposurehas more nuanced results and depends on full distribution). This example showsone of the extremely elementary mistakes of talking about probability presented as a non-technical overview - the darwin college lecture (cid:3) ,‡ Perfect calibrationoverunder
Figure . : Probabilistic calibration as seen in the psychology literature. The x axis shows theestimated probability produced by the forecaster, the y axis the actual realizations, so if a weatherforecaster predicts chance of rain, and rain occurs of the time, they are deemed "calibrated".We hold that calibration in frequency (probability) space is an academic exercise (in the bad sense ofthe word) that mistracks real life outcomes outside narrow binary bets. It is particularly fallaciousunder thick tails. The point is discussed at length in Chapter . Corresponding MisCalibration in Probability
MisCalibration in Payoff
Figure . : How miscalibration in probability corresponds to miscalibration in payoff under powerlaws. The distribution under consideration is Pareto with tail index a = 1.15 .Again, the point isdiscussed at length in Chapter . .4 the main consequences and how they link to the book single numbers not distributions of outcomes, but when we go deeper into thesubject, many less obvious, or less known paradox-style problems occur. Simply,it is of the opinion of the author, that it is not rigorous to talk about "probability"as a final product, or even as a "foundation"of decisions.In the real world one is not paid in probability, but in dollars (or in survival, etc.).The fatter the tails, the more one needs to worry about payoff space – the sayinggoes: "payoff swamps probability" (see box). One can be wrong very frequentlyif the cost is low, so long as one is convex to payoff (i.e. make large gains whenone is right). Further, one can be forecasting with . % accuracy and still gobust (in fact, more likely to go bust: funds with impeccable track records werethose that went bust during the - rout ). A point that may be technicalfor those outside quantitative finance: it is the difference between a vanilla optionand a corresponding binary of the same strike, as discussed in Dynamic Hedging [ ]: counterintuitively, thick tailedness lowers the value of the binary and raisethat of the vanilla. This is expressed by the author’s adage: "I’ve never seen arich forecaster." We will examine in depth in . . where we show that fatteningthe tails cause the probability of events higher than standard deviations to drop–but the consequences to rise (in term of contribution to moments, say effect onthe mean or other metrics).Figure . shows the extent of the problem. Remark Probabilistic forecast errors ("calibration") are in a different probability class fromthat true real-world P/L variations (or true payoffs)."Calibration", which is a measure of how accurate one’s predictions, lies in prob-ability space –between and . Any standard measure of such calibration willnecessarily be thin-tailed (and, if anything, extra-thin tailed since it is bounded) –whether the random variable under such prediction is thick tailed or not. On theother hand, payoffs in the real world can be thick tailed, hence the distribution ofsuch "calibration" will follow the property of the random variable. We show full derivations and proofs in Chapter . . . The Law of Large Numbers
Let us now discuss the law of large numbers which is the basis of much of statistics.The law of large numbers tells us that as we add observations the mean becomesmore stable, the rate being around p n . Figure . shows that it takes many moreobservations under a fat-tailed distribution (on the right hand side) for the meanto stabilize.The "equivalence" is not straightforward. R. Douady, data from Risk Data about funds that collapsed in the crisis, personal communication a non-technical overview - the darwin college lecture (cid:3) ,‡ P ayoff swamps probability in Extremistan: To see the main dif-ference between Mediocristan and Extremistan, consider the event of a plane crash. A lot of people will lose their lives,something very sad, say between and people, so theevent is counted as a bad episode, a single one. For forecastingand risk management, we work on minimizing such a probability to make itnegligible.Now, consider a type of plane crashes that will kill all the people who everrode the plane, even all passengers who ever rode planes in the past. All. Isit the same type of event? The latter event is in Extremistan and, for these, wedon’t talk about probability but focus instead on the magnitude of the event. (cid:15)
For the first type, management consists in reducing the probability –thefrequency – of such events. Remember that we count events and aim atreducing their counts. (cid:15)
For the second type, it consists in reducing the effect should such anevent take place. We do not count events, we measure impact.If you think the thought experiment is a bit weird, consider that the moneycenter banks lost in more money than they ever made in their history,the Savings and Loans industry (now gone) did so in , and the entirebanking system lost every penny ever made in - . One can routinelywitness people lose everything they earned cumulatively in a single marketevent. The same applies to many, many industries (e.g. automakers andairlines).But banks are only about money; consider that for wars we cannot afford thenaive focus on event frequency without taking into account the magnitude,as done by the science writer Steven Pinker in [ ], discussed in Chapter . This is without even examining the ruin problems (and nonergodicity)presented in the end of this section. More technically, one needs to meet theCramer condition of non-subexponentiality for a tally of events (taken at facevalue) for raw probability to have any meaning at all. The plane analogy wasproposed by the insightful Russ Robert during one of his Econtalk podcastswith the author.One of the best known statistical phenomena is Pareto’s / e.g. twenty per-cent of Italians own percent of the land. Table . shows that while it takes observations in the Gaussian to stabilize the mean up to a given level, it takes 10 observations in the Pareto to bring the sample error down by the same amount(assuming the mean exists).Despite this being trivial to compute, few people compute it. You cannot makeclaims about the stability of the sample mean with a thick tailed distribution. Thereare other ways to do this, but not from observations on the sample mean. .5 epistemology and inferential asymmetry Figure . : Life is about payoffs, not forecasting, and the difference increases in Extremistan.(Why "Gabish" rather than "capisce"? Gabish is the recreated pronunciation of Siculo-Galabrez (Cal-abrese); the "p" used to sound like a "b" and the "g" like a Semitic kof, a hard K, from Punic. Muchlike capicoli is "gabagool".)
Table . : Corresponding n a , or how many observations to get a drop in the error around the meanfor an equivalent a -stable distribution (the measure is discussed in more details in Chapter ). TheGaussian case is the a = 2 . For the case with equivalent tails to the / one needs at least moredata than the Gaussian. a n a n b = (cid:6) a n b = (cid:6) a Symmetric Skewed One-tailed Fughedaboudit - - (cid:2) (cid:2) (cid:2) ,
634 895 , (cid:2) ,
027 6 ,
002 8 ,
567 613 737
165 171 186
75 77 79
44 44 442 30 .
30 30
Definition . (Asymmetry in distributions) It is much easier for a criminal to fake being an honest person than for an honest personto fake being a criminal. Likewise it is easier for a fat-tailed distribution to fake being thintailed than for a thin tailed distribution to fake being thick tailed. a non-technical overview - the darwin college lecture (cid:3) ,‡ x f ! x "
10 20 30 40 x f ! x " Additional VariationApparently degenerate case More data shows nondegeneracy
Figure . : The Masquerade Problem (or Central Asymmetry in Inference) . To the left, adegenerate random variable taking seemingly constant values, with a histogram producing a Diracstick. One cannot rule out nondegeneracy. But the right plot exhibits more than one realization. Hereone can rule out degeneracy. This central asymmetry can be generalized and put some rigor intostatements like "failure to reject" as the notion of what is rejected needs to be refined. We can use theasymmetry to produce rigorous rules.
Principle . (Epistemology: the invisibility of the generator.) (cid:15) We do not observe probability distributions, just realizations. (cid:15)
A probability distribution cannot tell you if the realization belongs to it. (cid:15)
You need a meta-probability distribution to discuss tail events (i.e., the condi-tional probability for the variable to belong to a certain distributions vs. oth-ers).
Let us now examine the epistemological consequences. Figure . illustratesthe Masquerade Problem (or Central Asymmetry in Inference). On the left is adegenerate random variable taking seemingly constant values with a histogramproducing a Dirac stick.We have known at least since Sextus Empiricus that we cannot rule out degen-eracy but there are situations in which we can rule out non-degeneracy. If I seea distribution that has no randomness, I cannot say it is not random. That is, wecannot say there are no Black Swans. Let us now add one observation. I can nowsee it is random, and I can rule out degeneracy. I can say it is not "not random".On the right hand side we have seen a Black Swan , therefore the statement thatthere are no Black Swans is wrong. This is the negative empiricism that underpinsWestern science. As we gather information, we can rule things out. The distribu-tion on the right can hide as the distribution on the left, but the distribution on theright cannot hide as the distribution on the left (check). This gives us a very easyway to deal with randomness. Figure . generalizes the problem to how we caneliminate distributions. .5 epistemology and inferential asymmetry dist 1dist 2dist 3dist 4dist 5dist 6dist 7dist 8dist 9dist 10dist 11dist 12dist 13dist 14 ObservedDistribution GeneratingDistributionsTHE VEIL
Distributionsruled out
NonobservableObservable
Distributionsthat cannot beruled out"True"distribution
Figure . : "The probabilistic veil" . Taleb and Pilpel [ ] cover the point from an epistemologicalstandpoint with the "veil" thought experiment by which an observer is supplied with data (generatedby someone with "perfect statistical information", that is, producing it from a generator of time series).The observer, not knowing the generating process, and basing his information on data and data only,would have to come up with an estimate of the statistical properties (probabilities, mean, variance,value-at-risk, etc.). Clearly, the observer having incomplete information about the generator, and noreliable theory about what the data corresponds to, will always make mistakes, but these mistakes havea certain pattern. This is the central problem of risk management. If we see a sigma event, we can rule out that the distribution is thin-tailed. Ifwe see no large deviation, we can not rule out that it is not thick tailed unless weunderstand the process very well. This is how we can rank distributions. If wereconsider Figure . we can start seeing deviations and ruling out progressivelyfrom the bottom. These ranks are based on how distributions can deliver tail events.Ranking distributions (by order or priority for the sake of inference) becomes verysimple. Consider the logic: if someone tells you there is a ten-sigma event, it ismuch more likely that they have the wrong distribution than it is that you reallyhave ten-sigma event (we will refine the argument later in this chapter). Likewise,as we saw, thick tailed distributions do not deliver a lot of deviation from the mean.But once in a while you get a big deviation. So we can now rule out what is notmediocristan. We can rule out where we are not ¸S we can rule out mediocristan. Ican say this distribution is thick tailed by elimination. But I can not certify that itis thin tailed. This is the Black Swan problem. Application of the Maquerade Problem: Argentina’s stock market before andafter Aug , For an illustration of the asymmetry of inference applied a non-technical overview - the darwin college lecture (cid:3) ,‡ Figure . : Popper’s solution of theproblem of induction is in asymme-try: relying on confirmatory empiri-cism, that is focus on "ruling out"what fails to work, via negativa style.We extend this approach to statisticalinference with the probabilistic veil byprogressively ruling out entire classesof distributions.
Scientific Rigor and Asymmetries by The Russian School of Probability O ne can believe in the rigor of mathematical statements aboutprobability without falling into the trap of providing naivecomputations subjected to model error. There is a wonderfulawareness of the asymmetry throughout the works of the Rus-sian school of probability –and asymmetry here is the analogof Popper’s idea in mathematical space.Members across three generations: P.L. Chebyshev, A.A. Markov, A.M. Lya-punov, S.N. Bernshtein (ie. Bernstein), E.E. Slutskii, N.V. Smirnov, L.N.Bol’shev, V.I. Romanovskii, A.N. Kolmogorov, Yu.V. Linnik, and the new gen-eration: V. Petrov, A.N. Nagaev, A. Shyrayev, and a few more.They had something rather potent in the history of scientific thought: theythought in inequalities, not equalities (most famous: Markov, Chebyshev,Bernstein, Lyapunov). They used bounds, not estimates. Even their centrallimit version was a matter of bounds, which we exploit later by seeing whattakes place outside the bounds . They were world apart from the new generationof users who think in terms of precise probability –or worse, mechanisticsocial scientists. Their method accommodates skepticism, one-sided thinking:" A is > x , AO ( x ) [Big-O: "of order" x ], rather than A = x .For those working on integrating the mathematical rigor in risk bearing theyprovide a great source. We always know one-side, not the other. We know thelowest value we are willing to pay for insurance, not necessarily the upperbound (or vice versa). aa The way this connects asymmetry to robustness is as follows. Is robust what does not producevariability across perturbation of parameters of the probability distribution. If there is change, butwith an asymmetry, i.e. a concave or convex response to such perturbations, the classification isfragility and antifragility, respectively, see [ ]. .5 epistemology and inferential asymmetry Figure . : The Problem of Induc-tion. The philosophical problem of enu-merative induction, expressed in thequestion: "How many white swans do youneed to count before ruling out thefuture occurrence of a black one?" maps surprisingly perfectly to ourproblem of the working of the law oflarge numbers: "How much data do you need be-fore making a certain claim withan acceptable error rate?"
It turns out that the very nature of sta-tistical inference reposes on a clear defi-nition and quantitative measure of theinductive mechanism. It happens that,under thick tails, we need considerablymore data; as we will see in Chapters and there is a way to gauge therelative speed of the inductive mecha-nism, even if ultimately the problem ofinduction cannot be perfectly solved.The problem said of induction is gener-ally misattributed to Hume, [ ] . . Figure . : A Discourse to Showthat Skeptical Philosophy is of GreatUse in Science by François de LaMothe Le Vayer ( - ), appar-ently Bishop Huet’s source. Everytime I find a IJoriginal thinker˙I whofigured out the skeptical solution tothe Black Swan problem, it turns outthat he may just be cribbing a prede-cessor –not maliciously, but we for-get to dig to the roots. As we in-sist, "Hume’s problem" has little todo with Hume, who carried the heavymulti-volume Dictionary of PierreBayle (his predecessors) across Eu-rope. I thought it was Huet who wasas one digs, new predecessors crop up to parameters of a distribution, or how a distribution can masquerade as havingthinner tails than it actually has, consider what we knew about the Argentinianmarket before and after the large drop of Aug , (shown in Figure . ).Using this reasoning, any future parameter uncertainty should make tails fatter, a non-technical overview - the darwin college lecture (cid:3) ,‡ ��� ��������� �������� ��� ��� ��� ���� ����� ��� ��������� ����� ��� ��� ��� ��� ���� ����� Figure . : It is not possible to "accept" thin tails, very easy to reject thintailedness. One distributioncan produce jumps and quiet days will not help rule out their occurrence. α = BEFORE × - P > Surprise α = AFTER × - P > Figure . : A single dayreveals the true tails of a dis-tribution. Argentina’s stockmarket before and after Aug , . You may suddenlyrevise the tails as thicker(lower parameter a ), neverthe reverse –it would takea long, long time for thatto happen. Data obtainedthanks to Diego Zviovich. not thinner. Rafal Weron, in [ ], showed how we are more likely to overestimatethe tail index when fitting a stable distribution (lower means fatter tails). Let us illustrate one of the problem of thin-tailed thinking in the fat-tailed domainwith a real world example. People quote so-called "empirical" data to tell us we are .6 naive empiricism: ebola should not be compared to falls from ladders Figure . : Naive empiricism: never comparethick tailed variables to thin tailed ones, sincethe means do not belong to the same class ofdistributions. This is a generalized mistakemade by The Economist, but very common inthe so-called learned discourse. Even the RoyalStatistical Society fell for it once they hired a"risk communication" person with a sociologyor journalism background to run it. foolish to worry about ebola when only two Americans died of ebola in . Weare told that we should worry more about deaths from diabetes or people tangledin their bedsheets. Let us think about it in terms of tails. If we were to read in thenewspaper that billion people have died suddenly, it is far more likely that theydied of ebola than smoking or diabetes or tangled in their bedsheets? Principle . Thou shalt not compare a multiplicative fat-tailed process in Extremistan in thesubexponential class to a thin-tailed process from Mediocristan, particularly one thathas Chernoff bounds..
This is a simple consequence of the catastrophe principle we saw earlier, as illus-trated in Figure . .Alas few "evidence based" people get (at the time of writing) the tail wag the dogeffect. a non-technical overview - the darwin college lecture (cid:3) ,‡ Figure . : Bill Gates’s Naive (Non-Statistical) Empiricism: the founder of Microsoft is promotingand financing the development of the above graph, yet at the same time claiming that the climateis causing an existential risk, not realizing that his arguments conflict since existential risks arenecessarily absent in past data. Furthermore, a closer reading of the graphs shows that cancer, heartdisease, and Alzheimer, being ailments of age, do not require the attention on the part of young adultsand middle-aged people something terrorism and epidemics warrant.Another logical flaw is that terrorism is precisely low because of the attention it commands. Relaxyour vigilance and it may go out of control. The same applies to homicide: fears lead to safety.If this map shows something, it is the rationality of common people with a good tail risk detector,compared to the ignorance of "experts". People are more calibrated to consequences and properties ofdistributions than psychologists claim. Microsoft is a technology company still in existence at the time of writing.
Figure . : Because of the slowness of the law of large numbers, under thick tails, the past’s pastdoesn’t resemble the past’s future; accordingly, today’s past will not resemble today’s future. Thingsare easier under thin tails. Credit Stefan Gasic.
It is naïve empiricism to compare these processes, to suggest that we worry toomuch about ebola (epidemics or pandemics) and too little about diabetes. In fact .6 naive empiricism: ebola should not be compared to falls from ladders Figure . : Beware the lobbyist using pseudo-empirical arguments. "Risk communications"shills such as the fellow here, with a journalismbackground, are hired by firms such as Monsanto(and cars and Tobacco companies) to engagein smear campaigns on their behalf using "sci-ence", "empirical arguments" and "evidence",and downplay "public fears" they deem irra-tional. Lobbying organizations penetrate suchcenters as "Harvard Center for Risk Analysis"with a fancy scholarly name that helps convincethe layperson. The shills’ line of argument, com-monly, revolves around "no evidence of harm"and "rationality". Other journalists, in turn, es-pouse such arguments owing to their ability tosway the statistically naive. Probabilistic andrisk literacy, statistical knowledge and journal-ism have suffered greatly from the spreading ofmisconceptions by nonscientists, or, worse, non-statisticians. it is the other way round. We worry too much about diabetes and too little aboutebola and other ailments with multiplicative effects. This is an error of reasoningthat comes from not understanding thick tails –sadly it is more and more common.What is worse, such errors of reasoning are promoted by empirical psychology which does not appear to be empirical. It is also used by shills for industry passingfor "risk communicators" selling us pesticides and telling us not to worry becauseharm appears to be minimal in past data (see Figure ).The correct reasoning is generally absent in decision theory and risk circles out-side of the branches of extreme value theory and the works of the ABC groupin Berlin’s Max Planck directed by Gerd Gigerenzer [ ] which tells you thatyour grandmother’s instincts and teachings are not to be ignored and, when herrecommendations clash with psychologists and decision theorists, it is usually thepsychologists and decision theorists who are unrigorous. A simple look at the sum-mary by "most cited author" Baruch Fishhoff’s in
Risk: a Very Short Introduction [ ]shows no effort to disentangle the two classes of distribution. The problem linkedto the "risk calibration" and "probabilistic calibration" misunderstood by psychol-ogists and discussed more technically in Chapter discussing expert calibrationunder thick tails. . . How some multiplicative risks scale
The "evidence based" approach is still too primitive to handle second ordereffects (and risk management) and has certainly caused way too much harmwith the COVID- pandemic to remain useable outside of single patientissues. One of the problems is the translation between individual and collec- The Gigerenzer school is not immune to mistakes, as evidenced by their misunderstanding of the risks ofCOVID- in early –the difference between Mediocristan and Extremistan has not reached them yet.But this author is optimistic that it will. a non-technical overview - the darwin college lecture (cid:3) ,‡ tive risk (another is the mischaracterization of evidence and conflation withabsence of evidence).At the beginning of the COVID- pandemic, many epidemiologists inno-cent of probability compared the risk of death from it to that of drowning in aswimming pool. For a single individual, this might have been true (althoughCOVID- turned out rapidly to be the main source of fatality in many parts,and later even caused % of the fatalities New York City). But conditionalon having 1000 deaths, the odds of the cause being drowning in swimmingpools is slim.This is because your neighbor having COVID increases the chances that youget it, whereas your neighbor drowning in her or his swimming pool doesnot increase your probability of drowning (if anything, like plane crashes, it decreases other people’s chance of drowning ).This aggregation problem is discussed in more technical terms with ellipti-cality, see Section . –joint distributions are no longer elliptical, causing thesum to be fat-tailed even when individual variables are thin-tailed.It is also discussed as a problem in ethics [ ]: by contracting the diseaseyou cause more deaths than your own. Although the risk of death froma contagious disease can be smaller than, say, that from a car accident, itbecomes psychopathic to follow "rationality" (that is, first order rationalitymodels) as you will eventually cause systemic harm and even, eventually,certain self-harm. Let us now discuss the intuition behind the Pareto Law. It is simply defined as: say X is a random variable. For a realization x of X sufficiently large, the probability ofexceeding 2 x divided by the probability of exceeding x is "not too different" fromthe probability of exceeding 4 x divided by the probability of exceeding 2 x , and soforth. This property is called "scalability". So if we have a Pareto (or Pareto-style) distribution, the ratio of people with $ million compared to $ million is the same as the ratio of people with $ millionand $ million. There is a constant inequality. This distribution has no charac-teristic scale which makes it very easy to understand. Although this distributionoften has no mean and no standard deviation we can still understand it –in factwe can understand it much better than we do with more standard statistical distri-butions. But because it has no mean we have to ditch the statistical textbooks anddo something more solid, more rigorous, even if it seems less mathematical. To put some minimum mathematics: let X be a random variable belonging to the class of distributionswith a "power law" right tail: P ( X > x ) = L ( x ) x (cid:0) a ( . )where L : [ x min , + ¥ ) ! (
0, + ¥ ) is a slowly varying function, defined as lim x ! + ¥ L ( kx ) L ( x ) = 1 for any k > .7 primer on power laws (almost without mathematics) Table . : An example of a power law
Richer than million in . Richer than million in Richer than million in , Richer than million in , Richer than million in , Richer than million in ? Table . : Kurtosis from a single observation for financial data Max ( X t (cid:0) ∆ t i ) ni =0 (cid:229) ni =0 X t (cid:0) ∆ t i Security Max Q Years .Silver .
94 46 .SP
500 0 .
79 56 .CrudeOil .
79 26 .Short Sterling .
75 17 .Heating Oil .
74 31 .Nikkei .
72 23 .FTSE .
54 25 .JGB .
48 24 .Eurodollar Depo M .
31 19 .Sugar . .Yen .
27 38 .Bovespa .
27 16 .Eurodollar Depo M .
25 28 .CT .
25 48 .DAX . .A Pareto distribution has no higher moments: moments either do not exist orbecome statistically more and more unstable. So next we move on to a problemwith economics and econometrics. In I took years of data and looked athow much of the kurtosis (a function of the fourth moment) came from the largestobservation –see Table . . For a Gaussian the maximum contribution over thesame time span should be around . (cid:6) . . For the S&P it was about percent. This tells us that we don’t know anything about the kurtosis of thesesecurities. Its sample error is huge; or it may not exist so the measurement isheavily sample dependent. If we don’t know anything about the fourth moment,we know nothing about the stability of the second moment. It means we are notin a class of distribution that allows us to work with the variance, even if it exists.Science is hard; quantitative finance is hard too.For silver, in 46 years 94 percent of the kurtosis came from one single observation.We cannot use standard statistical methods with financial data. GARCH (a methodpopular in academia) does not work because we are dealing with squares. The a non-technical overview - the darwin college lecture (cid:3) ,‡ variance of the squares is analogous to the fourth moment. We do not know thevariance. But we can work very easily with Pareto distributions. They give us lessinformation, but nevertheless, it is more rigorous if the data are uncapped or ifthere are any open variables.Table . , for financial data, debunks all the college textbooks we are currentlyusing. A lot of econometrics that deals with squares goes out of the window. Thisexplains why economists cannot forecast what is going on –they are using thewrong methods and building the wrong confidence intervals. It will work withinthe sample, but it will not work outside the sample –and samples are by definitionfinite and will always have finite moments. If we say that variance (or kurtosis) isinfinite, we are not going to observe anything that is infinite within a sample.Principal component analysis, PCA (see Figure . ) is a dimension reductionmethod for big data and it works beautifully with thin tails (at least sometimes).But if there is not enough data there is an illusion of what the structure is. Aswe increase the data (the n variables), the structure becomes flat (something calledin some circles the "Wigner effect" for random matrices, after Eugene Wigner —do not confuse with Wigner’s discoveries about the dislocation of atoms underradiation). In the simulation, the data that has absolutely no structure: principalcomponents (PCs) should be all equal (asymptotically, as data becomes large); butthe small sample effect causes the ordered PCs to show a declining slope. We havezero correlation on the matrix. For a thick tailed distribution (the lower section),we need a lot more data for the spurious correlation to wash out i.e., dimensionreduction does not work with thick tails. The following summarizes everything that I wrote in
The Black Swan (a messagethat somehow took more than a decade to go through without distortion). Distri-butions can be one-tailed (left or right) or two-tailed. If the distribution has a thicktail, it can be thick tailed one tail or it can be thick tailed two tails. And if is thicktailed one tail, it can be thick tailed left tail or thick tailed right tail.See Figure . for the intuition: if it is thick tailed and we look at the samplemean, we observe fewer tail events. The common mistake is to think that we cannaively derive the mean in the presence of one-tailed distributions. But there areunseen rare events and with time these will fill in. But by definition, they are lowprobability events.It is easier to be fooled by randomness about the quality of the performance witha short volatility time series (left skewed, exposed to sharp losses) than witha long tail volatility one (right skewed, exposed to sharp gains). Simply shortvolatility overestimate the performance (while the other underestimates it(see Fig . ). This is another version of the asymmetry attributed to Popperthat we saw earlier in the chapter. .8 where are the hidden properties? Figure . : Spurious PCAs Under Thick Tails:
A Monte Carlo experiment that shows how spu-rious correlations and covariances are more acute under thick tails. Principal Components rankedby variance for Gaussian uncorrelated variables (above), n = 100 (shaded) and data points(transparent), and principal components ranked by variance for stable distributed ( below, with tail a = , symmetry b = 1 , centrality m = 0 , scale s = 1 ), with same n = 100 (shaded) and n = 1000 (transparent). Both are "uncorrelated" identically distributed variables. We can see the "flatter" PCAstructure with the Gaussian as n increases (the difference between PCAs shrinks). Such flatteningdoes not occur in reasonable time under fatter tails. The trick is to estimate the distribution and then derive the mean (which impliesextrapolation). This is called in this book "plug-in" estimation, see Table . . Itis not done by measuring the directly observable sample mean which is biasedunder fat-tailed distributions. This is why, outside a crisis, banks seem to makelarge profits. Then once in a while they lose everything and more and have to bebailed out by the taxpayer. The way we handle this is by differentiating the truemean (which I call "shadow") from the realized mean, as in the Tableau in Table . .We can also do that for the Gini coefficient to estimate the "shadow" one ratherthan the naively observed one. a non-technical overview - the darwin college lecture (cid:3) ,‡ Figure . : A central asymmetry : the difference between absence of evidence and evidence ofabsence is compounded by thick tails. It requires a more elaborate understanding of random events—or a more naturalistic one. (Please do not attribute IQ points here as equivalent to the ones used incommon psychometrics: the suspicion is that high scoring people on IQ tests fail to get the asymmetry.IQ here should be interpreted as "real" intelligence, not the one from that test. ) Courtesy StefanGasic.
This is what we mean when we say that the "empirical" distribution is not "em-pirical". In other words: ) there is a wedge between population and sampleattributes and, ) even exhaustive historical data must be seen as mere samplingfrom a broader phenomenon (the past is in sample; inference is what works out ofsample). Table . : Shadow mean vs. Sample mean and their ratio for different minimum thresholds. Theshadow mean is obtained via maximum likelihood, ML (from plug-in estimators) . In bold the valuesfor the k threshold. Rescaled data. From Cirillo and Taleb [ ]. Details are explained in Chapters and . L Sample Mean ML Mean Ratio K 9.079 (cid:2) (cid:2) . K 9.82 (cid:2) (cid:2) . K 1.12 (cid:2) (cid:2) . K 1.34 (cid:2) (cid:2) . K 1.66 (cid:2) (cid:2) . K 2.48 (cid:2) (cid:2) . Once we have figured out the distribution, we can estimate the statistical mean.This works much better than directly measuring the sample mean. For a Paretodistribution, for instance, % of observations are below the mean. There is a biasin the observed mean. But once we know that we have a Pareto distribution, weshould ignore the sample mean and look elsewhere. Chapters and discussthe techniques.Note that the field of Extreme Value Theory [ ] [ ] [ ] focuses on tail prop-erties, not the mean or statistical inference. .8 where are the hidden properties? WITTGENSTEIN’S RULER: WAS IT REALLY A " SIGMA EVENT"? I n the summer of 1998 , the hedge fund called "Long Term Cap-ital Management" (LTCM) proved to have a very short life; itwent bust from some deviations in the markets –those "of anunexpected nature". The loss was a yuuuge deal because twoof the partners received the Swedish Riksbank Prize, marketedas the "Nobel" in economics. More significantly, the fund harbored a largenumber of finance professors; LTCM had imitators among professors (at leastsixty finance PhDs blew up during that period from trades similar to LTCM’s,and owing to risk management methods that were identical). At least two ofthe partners made the statement that it was a " sigma" event ( standarddeviations), hence they should be absolved of all accusations of incompetence(I was first hand witness of two such statements).Let us apply what the author calls "Wittgenstein’s ruler": are you using theruler to measure the table or using the table to measure the ruler?Assume to simplify there are only two alternatives: a Gaussian distributionand a Power Law one. For the Gaussian, the "event" we define as the survivalfunction of standard deviations is 1 in 1.31 (cid:2) (cid:0) . For the Power law ofthe same scale, a student T distribution with tail exponent 2, the survivalfunction is 1 in 203.What is the probability of the data being Gaussian conditional on a sigmaevent, compared to that alternative?We start with Bayes’ rule. P ( A j B ) = P ( A ) P ( B j A ) P ( B ) . Replace P ( B ) = P ( A ) P ( B j A ) + P ( A ) P ( B j A ) and apply to our case. P (Gaussian j Event) = P (Gaussian) P (Event j Gaussian)(1 (cid:0) P (Gaussian)) P (Event j NonGaussian) + P (Gaussian) P (Event j Gaussian) P (Gaussian) P (Gaussian j Event)0.5 2 (cid:2) (cid:0) (cid:2) (cid:0) (cid:2) (cid:0) (cid:2) (cid:0) (cid:2) (cid:0) Moral:
If there is a tiny probability, < (cid:0) that the data might not be Gaus-sian, one can firmly reject Gaussianity in favor of the thick tailed distribution.The heuristic is to reject Gaussianity in the presence of any event > > aa The great Benoit Mandelbrot used to be extremely critical of methods that relied on a Gaussianand added jumps or other ad hoc tricks to explain what happened in the data (say Merton’s jumpdiffusion process [ ]) –one can always fit back jumps ex post. He used to cite the saying attributedto John von Neumann: "With four parameters I can fit an elephant, and with five I can make himwiggle his trunk." a non-technical overview - the darwin college lecture (cid:3) ,‡ - ��� - ��� - ��� - �� - �� - �� - �� ������������������������� ���� ������ �� �� �� �� ��� ��� ��� ������������������� ������ ���� ������ Figure . : Shadow Mean at work: Below: Inverse Turkey Problem – The unseen rare event ispositive. When you look at a positively skewed (antifragile) time series and make (nonparametric)inferences about the unseen, you miss the good stuff and underestimate the benefits. Above: Theopposite problem. The filled area corresponds to what we do not tend to see in small samples, frominsufficiency of data points. Interestingly, the shaded area increases with model error (owing to theconvexity of tail probabilities to uncertainty).
In the absence of reliable information, Bayesian methods can be of little help. Thisauthor has faced since the publication of
The Black Swan numerous questions con-cerning the use of something vaguely Bayesian to solve problems about the un-known under thick tails. Since one cannot manufacture information beyond what’savailable, no technique, Bayesian nor Schmayesian can help. The key is that oneneeds a reliable prior, something not readily observable (see Diaconis and Fried-man [ ] for the difficulty for an agent in formulating a prior).A problem is the speed of updating, as we will cover in Chapter , which is highlydistribution dependent. The mistake in the rational expectation literature is tobelieve that two observers supplied with the same information would necessarily .10 x vs f ( x ) : exposures to x confused with knowledge about x converge to the same view. Unfortunately, the conditions for that to happen in realtime or to happen at all are quite specific.One of course can use Bayesian methods (under adequate priors) for the esti-mation of parameters if ) one has a clear idea about the range of values (sayfrom universality classes or other stable basins) and ) these parameters follow atractable distribution with low variance such as, say, the tail exponent of a Paretodistribution (which is inverse-gamma distributed), [ ]. M oral hazard and rent seeking in financial education: One ofthe most depressing experience this author had was whenteaching a course on Fat Tails at the University of Mas-sachusetts Amherst, at the business school, during a very briefstint there. One PhD student in finance bluntly said that heliked the ideas but that a financial education career commanded "the highestsalary in the land" (that is, among all other specialties in education). He pre-ferred to use Markowitz methods (even if they failed in fat-tailed domains)as these were used by other professors, hence allowed him to get his paperspublished, and get a high paying job.I was disgusted, but predicted he would subsequently have a very successfulcareer writing non-papers. He did. x vs f ( x ) : exposures to x confused with knowledge about x Take X a random or nonrandom variable, and F ( X ) the exposure, payoff, the effectof X on you, the end bottom line. ( X is often is higher dimensions but let’s assumeto simplify that it is a simple one-dimensional variable).Practitioners and risk takers often observe the following disconnect: people (non-practitioners) talking X (with the implication that practitioners should care about X in running their affairs) while practitioners think about F ( X ), nothing but F ( X ).And the straight confusion since Aristotle between X and F ( X ) has been chronicas discussed in Antifragile [ ] which is written around that theme. Sometimespeople mention F ( X ) as utility but miss the full payoff. And the confusion is attwo level: one, simple confusion; second, in the decision-science literature, seeingthe difference and not realizing that action on F ( X ) is easier than action on X . (cid:15) The variable X can be unemployment in Senegal, F ( X ) is the effect on thebottom line of the IMF, and F ( X ) is the effect on your grandmother (which Iassume is minimal). (cid:15) X can be a stock price, but you own an option on it, so F ( X ) is your exposurean option value for X , or, even more complicated the utility of the exposureto the option value. a non-technical overview - the darwin college lecture (cid:3) ,‡ (cid:15) X can be changes in wealth, F ( X ) the convex-concave way it affects your well-being. One can see that F ( X ) is vastly more stable or robust than X (it hasthinner tails). Convex vs. linear functions of a variable X Consider Fig. . ; confusing F ( X )(on the vertical) and X (the horizontal) is more and more significant when F ( X )is nonlinear. The more convex F ( X ), the more the statistical and other propertiesof F ( X ) will be divorced from those of X. For instance, the mean of F ( X ) will bedifferent from F (Mean of X ), by Jensen’s inequality. But beyond Jensen’s inequality,the difference in risks between the two will be more and more considerable. Whenit comes to probability, the more nonlinear F , the less the probabilities of X matterscompared to that of F . Moral of the story: focus on F , which we can alter, ratherthan on the measurement of the elusive properties of X . Probability Distribution of X Probability Distribution of F ( X ) Figure . : The Conflation Problem X (random variable) and F ( X ) a function of it (or payoff). IfF ( X ) is convex we don’t need to know much about it –it becomes an academic problem. And it is saferto focus on transforming F ( X ) than X. Figure . : The Conflation Problem: a convex-concave transformation of a thick tailed X produces athin tailed distribution (above). A sigmoidal transformation (below) that is bounded on a distributionin ( (cid:0) ¥ , ¥ ) produces an ArcSine distribution, with compact support. .10 x vs f ( x ) : exposures to x confused with knowledge about x Limitations of knowledge
What is crucial, our limitations of knowledge applyto X not necessarily to F ( X ). We have no control over X , some control over F ( X ).In some cases a very, very large control over F ( X ). Concave - Convex Transformation Distribution of x Distribution of f ( x ) Figure . : A concave-convex transformation (of the style of a probit –an inverse CDF for theGaussian– or of a logit) makes the tails of the distribution of f ( x ) thicker The danger with the treatment of the Black Swan problem is as follows: peoplefocus on X ("predicting X "). My point is that, although we do not understand X , we can deal with it by working on F which we can understand, while otherswork on predicting X which we can’t because small probabilities are incomputable,particularly in thick tailed domains. F ( x ) is how the end result affects you.The probability distribution of F ( X ) is markedly different from that of X , partic-ularly when F ( X ) is nonlinear. We need a nonlinear transformation of the distribu-tion of X to get F ( X ). We had to wait until to start a discussion on “convextransformations of random variables”, Van Zwet ( )[ ] –as the topic didn’tseem important before. Ubiquity of S curves F is almost always nonlinear (actually I know of no excep-tion to nonlinearity), often “S curved”, that is convex-concave (for an increasingfunction). See the longer discussion in F. Fragility and Antifragility
When F ( X ) is concave (fragile), errors about X can translate into extreme negative values for F ( X ). When F ( X ) is convex,one is largely immune from severe negative variations. In situations of trialand error, or with an option, we do not need to understand X as much as ourexposure to the risks. Simply the statistical properties of X are swamped bythose of H . The point of Antifragile is that exposure is more important thanthe naive notion of “knowledge”, that is, understanding X .The more nonlinear F the less the probabilities of X matters in the probabilitydistribution of the final package F .Many people confuse the probabilites of X with those of F . I am serious: the entire literature reposes largely on this mistake. For Baal’s sake, focus on F ,not X . a non-technical overview - the darwin college lecture (cid:3) ,‡ B etter be convex than right: In the fall of , a firm went bustbetting against volatility –they were predicting lower real mar-ket volatility (rather, variance) than "expected" by the market.
They were correct in the prediction, but went bust nevertheless . Theywere just very concave in the payoff function. Recall that x isnot f ( x ) and that in the real world there are almost no linear f ( x ).The following example can show us how. Consider the following pay-off in the figure below. The payoff function is f ( x ) = 1 (cid:0) x daily,meaning if x moves by up to 1 unit (say, standard deviation), there isa profit, losses beyond. This is a typical contract called "variance swap". - - - x - - ( x )= - x Now consider the following two types successions of deviations of x for days (expressed in standard deviations).Succession (thin tails): f
1, 1, 1, 1, 1, 0, 0 g . Mean variation= 0.71. P/L= 2.Succession (thick tails): f
0, 0, 0, 0, 0, 0, 5 g . Mean variation= 0.71 (same).P/L= (cid:0)
18 (bust, really bust).In both cases they forecast right, but the lumping of the volatility –the fatnessof tails– made a huge difference.This in a nutshell explains why, in the real world, "bad" forecasters can makegreat traders and decision makers and vice versa –something every operatorknows but that the mathematically and practically unsophisticated "forecast-ing" literature, centuries behind practice, misses.
Let us finish with path dependence and time probability. Our greatgrandmothersdid understand thick tails. These are not so scary; we figured out how to surviveby making rational decisions based on deep statistical properties.Path dependence is as follows. If I iron my shirts and then wash them, I getvastly different results compared to when I wash my shirts and then iron them.My first work,
Dynamic Hedging [ ], was about how traders avoid the "absorbing .11 ruin and path dependence barrier" since once you are bust, you can no longer continue: anything that willeventually go bust will lose all past profits.The physicists Ole Peters and Murray Gell-Mann [ ] shed new light on thispoint, and revolutionized decision theory showing that a key belief since the de-velopment of applied probability theory in economics was wrong. They pointedout that all economics textbooks make this mistake; the only exception are by in-formation theorists such as Kelly and Thorp.Let us explain ensemble probabilities.Assume that of us, randomly selected, go to a casino and gamble. If the28 th person is ruined, this has no impact on the 29 th gambler. So we can computethe casino’s return using the law of large numbers by taking the returns of the people who gambled. If we do this two or three times, then we get a good estimateof what the casino’s "edge" is. The problem comes when ensemble probability isapplied to us as individuals. It does not work because if one of us goes to the casinoand on day is ruined, there is no day . This is why Cramer showed insurancecould not work outside what he called "the Cramer condition", which excludespossible ruin from single shocks. Likewise, no individual investor will achieve thealpha return on the market because no single investor has infinite pockets (or, asOle Peters has observed, is running his life across branching parallel universes).We can only get the return on the market under strict conditions.Time probability and ensemble probability are not the same. This only worksif the risk takers has an allocation policy compatible with the Kelly criterion Figure . : Ensemble probability vs.time probability. The treatment by op-tion traders is done via the absorbingbarrier. I have traditionally treatedthis in Dynamic Hedging [ ] andAntifragile[ ] as the conflation be-tween X (a random variable) and f ( X ) a function of said r.v., which may in-clude an absorbing state. a non-technical overview - the darwin college lecture (cid:3) ,‡ Figure . : A hierarchyfor survival. Higher en-tities have a longer lifeexpectancy, hence tail riskmatters more for these.Lower entities such as youand I are renewable. [ ],[ ] using logs. Peters wrote three papers on time probability (one withMurray Gell-Mann) and showed that a lot of paradoxes disappeared.Let us see how we can work with these and what is wrong with the literature.If we visibly incur a tiny risk of ruin, but have a frequent exposure, it will go toprobability one over time. If we ride a motorcycle we have a small risk of ruin, butif we ride that motorcycle a lot then we will reduce our life expectancy. The wayto measure this is: Principle . (Repetition of exposures) Focus only on the reduction of life expectancy of the unit assuming repeated exposureat a certain density or frequency.
Behavioral finance so far makes conclusions from statics not dynamics, hencemisses the picture. It applies trade-offs out of context and develops the consensusthat people irrationally overestimate tail risk (hence need to be "nudged" into tak-ing more of these exposures). But the catastrophic event is an absorbing barrier.No risky exposure can be analyzed in isolation: risks accumulate. If we ride a mo-torcycle, smoke, fly our own propeller plane, and join the mafia, these risks addup to a near-certain premature death. Tail risks are not a renewable resource.Every risk taker who managed to survive understands this. Warren Buffett un-derstands this. Goldman Sachs understands this. They do not want small risks,they want zero risk because that is the difference between the firm surviving andnot surviving over twenty, thirty, one hundred years. This attitude to tail risk canexplain that Goldman Sachs is years old –it ran as partnership with unlimitedliability for approximately the first years, but was bailed out once in , afterit became a bank. This is not in the decision theory literature but we (people with .12 what to do? skin in the game) practice it every day. We take a unit, look at how long a life wewish it to have and see by how much the life expectancy is reduced by repeated exposure. Remark : Psychology of decision making The psychological literature focuses on one-single episode exposures and narrowlydefined cost-benefit analyses. Some analyses label people as paranoid for overestimat-ing small risks, but don’t get that if we had the smallest tolerance for collective tailrisks, we would not have made it for the past several million years.
Next let us consider layering, why systemic risks are in a different category fromindividual, idiosyncratic ones. Look at the (inverted) pyramid in Figure . : theworst-case scenario is not that an individual dies. It is worse if your family, friendsand pets die. It is worse if you die and your arch enemy survives. They collectivelyhave more life expectancy lost from a terminal tail event.So there are layers. The biggest risk is that the entire ecosystem dies. The pre-cautionary principle puts structure around the idea of risk for units expected tosurvive.Ergodicity in this context means that your analysis for ensemble probability trans-lates into time probability. If it doesn’t, ignore ensemble probability altogether. To summarize, we first need to make a distinction between mediocristan and Ex-tremistan, two separate domains that about never overlap with one another. Ifwe fail to make that distinction, we don’t have any valid analysis. Second, if wedon’t make the distinction between time probability (path dependent) and ensem-ble probability (path independent), we don’t have a valid analysis.The next phase of the
Incerto project is to gain understanding of fragility, robust-ness, and, eventually, anti-fragility. Once we know something is fat-tailed, we canuse heuristics to see how an exposure there reacts to random events: how much isa given unit harmed by them. It is vastly more effective to focus on being insulatedfrom the harm of random events than try to figure them out in the required details(as we saw the inferential errors under thick tails are huge). So it is more solid,much wiser, more ethical, and more effective to focus on detection heuristics andpolicies rather than fabricate statistical properties.The beautiful thing we discovered is that everything that is fragile has to presenta concave exposure [ ] similar –if not identical –to the payoff of a short option,that is, a negative exposure to volatility. It is nonlinear, necessarily. It has to haveharm that accelerates with intensity, up to the point of breaking. If I jump mI am harmed more than times than if I jump one meter. That is a necessaryproperty of fragility. We just need to look at acceleration in the tails. We have builteffective stress testing heuristics based on such an option-like property [ ]. a non-technical overview - the darwin college lecture (cid:3) ,‡ In the real world we want simple things that work [ ]; we want to impressour accountant and not our peers. (My argument in the latest instalment of the
Incerto , Skin in the Game is that systems judged by peers and not evolution rot fromovercomplication). To survive we need to have clear techniques that map to ourprocedural intuitions.The new focus is on how to detect and measure convexity and concavity. This ismuch, much simpler than probability. next
The next three chapters will examine the technical intuitions behind thick tails indiscussion form, in not too formal a language. Derivations and formal proofs comelater with the adaptations of the journal articles.
U N I V A R I A T E F A T T A I L S , L E V E L 1 ,F I N I T E M O M E N T S † T he N ext T wo chapters organized as follows. We look at three lev-els of fat-tails with more emphasis on the intuitions and heuristicsthan formal mathematical differences, which will be pointed outlater in the discussions of limit theorems. The three levels are: (cid:15) Fat tails, entry level (sort of), i.e., finite moments (cid:15)
Subexponential class (cid:15)
Power Law classLevel one will be the longest as we will use it to build intuitions. Whilethis approach is the least used in mathematics papers (fat tails are usuallyassociated with power laws and limit behavior), it is relied upon the mostanalytically and practically. We can get the immediate consequences of fat-tailedness with little effort, the equivalent of a functional derivative that pro-vides a good grasp of local sensitivities. For instance, as a trader, the authorwas able to get most of the effect of fattailedness with a simple heuristic of av-eraging option prices across two volatilities, which proved sufficient in spiteof its simplicity.
A couple of reminders about convexity and Jensen’s inequality:Let A be a convex set in a vector space in R , and let φ : A ! R be a function; φ is called convex if x , x A , t [0, 1] : φ ( tx + (1 (cid:0) t ) x ) (cid:20) t φ ( x ) + (1 (cid:0) t ) φ ( x ) Discussion chapter. univariate fat tails, level 1, finite moments † Figure . : How randomvolatility creates fatter tailsowing to the convexity ofsome parts of the density tothe scale of the distribution.
For a random variable X and φ (.) a convex function, by Jensen’s inequality[ ]: φ ( E [ X ]) (cid:20) E [ φ ( X )]. Remark : Fat Tails and Jensen’s inequality For a Gaussian distribution (and, members of the location-scale family of distribu-tions), tail probabilities are convex to the scale of the distribution, here the standarddeviation s (and to the variance s ). This allows us to fatten the tails by "stochas-ticizing" either the standard deviation or the variance, hence checking the effect ofJensen’s inequality on the probability distribution. Heteroskedasticity is the general technical term often used in time series analysisto characterize a process with fluctuating scale. Our method "stochasticizes", thatis, perturbates the variance or the standard deviation of the distribution underthe constraint of conservation of the mean. "Volatility" in the quant language means standard deviation, but "stochastic volatility" is usually stochasticvariance. .1 a simple heuristic to create mildly fat tails But note that any heavy tailed process, even a power law, can be described insample (that is finite number of observations necessarily discretized) by a simpleGaussian process with changing variance, a regime switching process, or a combi-nation of Gaussian plus a series of variable jumps (though not one where jumpsare of equal size, see the summary in [ ]). This method will also allow us to answer the great question: "where do the tailsstart?" in . .Let f ( p a , x ) be the density of the normal distribution (with mean 0) as a functionof the variance for a given point x of the distribution.Compare f ( ( p (cid:0) a + p a + 1 ) , x ) to ( f ( p (cid:0) a , x ) + f ( p a + 1, x )) ; the dif-ference between the two will be owed to Jensen’s inequality. We assume the aver-age s constant, but the discussion works just as well if we just assumed s constant—it is a long debate whether one should put a constraint on the average varianceor on that of the standard deviation, but ) doesn’t matter much so long as one re-mains consistent, ) for our illustrative purposes here there is no real fundamentaldifference.Since higher moments increase under fat tails, though not necessarily lower ones,it should be possible to simply increase fat tailedness (via the fourth moment)while keeping lower moments (the first two or three) invariant. . . A Variance-preserving heuristic
Keep E ( X ) constant and increase E ( X ) , by "stochasticizing" the variance of thedistribution, since E ( X ) is itself analog to the variance of E ( X ) measured acrosssamples – E ( X ) is the noncentral equivalent of E (( X (cid:0) E ( X )) ) so we willfocus on the simpler version outside of situations where it matters. Further, we willdo the "stochasticizing" in a more involved way in later sections of the chapter.An effective heuristic to get some intuition about the effect of the fattening oftails consists in simulating a random variable set to be at mean 0, but with thefollowing variance-preserving tail fattening trick: the random variable follows adistribution N ( s p (cid:0) a ) with probability p = and N ( s p a ) with theremaining probability , with ⩽ a < is ϕ ( t , a ) = 12 e (cid:0) (1+ a ) t s ( e at s ) ( . ) The jumps for such a process can be simply modeled as a regime that is characterized by a Gaussianwith low variance and extremely large mean (and a low-probability of occurrence), so, technically, Poissonjumps are mixed Gaussians. To repeat what we stated in the previous chapter, the literature sometimes separates "Fat tails" from "Heavytails", the first term being reserved for power laws, the second to subexponential distribution (on which,later). Fughedaboutdit. We simply call "Fat Tails" something with a higher kurtosis than the Gaussian,even when kurtosis is not defined. The definition is functional as used by practioners of fat tails, that is,option traders and lends itself to the operation of "fattening the tails", as we will see in this section. Note there is no difference between characteristic and moment generating functions when the mean is ,a property that will be useful in later, more technical chapters. univariate fat tails, level 1, finite moments † Odd moments are nil. The second moment is preserved since M (2) = ( (cid:0) i ) ¶ t ,2 ϕ ( t ) j = s ( . )and the fourth moment M (4) = ( (cid:0) i ) ¶ t ,4 ϕ j = 3 ( a + 1 ) s ( . )which puts the traditional kurtosis at 3 ( a + 1 ) (assuming we do not remove 3 tocompare to the Gaussian). This means we can get an "implied a from kurtosis.The value of a is roughly the mean deviation of the stochastic volatility parameter"volatility of volatility" or Vvol in a more fully parametrized form. Limitations of the simple heuristic
This heuristic, while useful for intuitionbuilding, is of limited powers as it can only raise kurtosis to twice that of a Gaus-sian, so it should be used only pedagogically, to get some intuition about the effectsof the convexity. Section . . will present a more involved technique. Remark : Peaks As Figure . shows: fat tails manifests themselves with higher peaks, a concentra-tion of observations around the center of the distribution. This is usually misunderstood. . . Fattening of Tails With Skewed Variance
We can improve on the fat-tail heuristic in . , (which limited the kurtosis to twicethe Gaussian) as follows. We Switch between Gaussians with variance: { s (1 + a ), with probability p s (1 + b ), with probability 1 (cid:0) p ( . )with p [0, 1) and b = (cid:0) a p (cid:0) p , giving a characteristic function: ϕ ( t , a ) = p e (cid:0) ( a +1) s t (cid:0) ( p (cid:0) e (cid:0) s t ap + p (cid:0) p (cid:0) with Kurtosis (( (cid:0) a ) p (cid:0) ) p (cid:0) thus allowing polarized states and high kurtosis, allvariance preserving.Thus with, say, p = 1 = a = 999,kurtosis can reach as high a level as 3000. .1 a simple heuristic to create mildly fat tails This heuristic approximates quite well the effect on probabilities of a lognormalweighting for the characteristic function ϕ ( t , V ) = ∫ ¥ e (cid:0) t v (cid:0) ( log( v ) (cid:0) v Vv ) Vv p p vVv dv ( . )where v is the variance and Vv is the second order variance, often called volatilityof volatility. Thanks to integration by parts we can use the Fourier transform toobtain all varieties of payoffs (see Gatheral [ ]). But the absence of a closed-form distribution can be remedied as follows, with the use of distributions for thevariance that are analytically more tractable. V Gamma (cid:31) (cid:30) vs. Lognormal Stochastic Variance
Gamma (cid:31) (cid:30) vs. Lognormal Stochastic Variance,
Α(cid:30) Figure . : Stochastic Variance: Gamma distribution and Lognormal of same mean and variance.
Gamma Variance
The gamma distribution applied to the variance of a Gaussianis is a useful shortcut for a full distribution of the variance, which allows us to gobeyond the narrow scope of heuristics [ ]. It is easier to manipulate analyticallythan the Lognormal.Assume that the variance of the Gaussian follows a gamma distribution. G a ( v ) = v a (cid:0) ( Va ) (cid:0) a e (cid:0) avV G ( a )with mean V and variance V p a . Figure . shows the matching to a lognormal withsame first two moments where we calibrate the lognormal to mean log ( aV aV +1 ) and standard deviation √ (cid:0) log ( aVaV +1 ) . The final distribution becomes (onceagain, assuming the same mean as a fixed volatility situation: f a , V ( x ) = ∫ ¥ e (cid:0) ( x (cid:0) m )22 v p p p v G a ( v )d v , ( . ) univariate fat tails, level 1, finite moments † (cid:45) (cid:45) Gaussian With Gamma Variance
Figure . : Stochas-tic Variance usingGamma distributionby perturbating a inequation . . allora: f a , V ( x ) = 2 (cid:0) a a a + V (cid:0) a (cid:0) j x (cid:0) m j a (cid:0) K a (cid:0) ( p p a j x (cid:0) m jp V ) p p G ( a ) . ( . )where K n ( z ) is the Bessel K function, which satisfies the differential equation (cid:0) y ( n + z ) + z y ′′ + zy ′ = 0.Let us now get deeper into the different forms of stochastic volatility. We have not yet defined power laws; take for now the condition that least one ofthe moments is infinite.And the answer: depend on whether we are stochasticizing s or s on one hand,or s or s on the other.Assume the base distribution is the Gaussian, the random variable X (cid:24) N ( m , s ).Now there are different ways to make s , the scale, stochastic. Note that since s isnonnegative, we need it to follow some one-tailed distribution. (cid:15) We can make s (or, possibly s ) follow a Lognormal distribution. It does notyield closed form solutions, but we can get the moments and verify it is nota power law. (cid:15) We can make s (or s ) follow a gamma distribution. It does yield closedform solutions, as we saw in the example above, in Eq. . . (cid:15) We can make s —the precision parameter—follow a gamma distribution. (cid:15) We can make s follow a lognormal distribution.The results shown in Table . come from the following simple properties ofdensity functions and expectation operators. Let X be any random variable with .3 the body, the shoulders, and the tails Table . : Transformations for stochastic volatility. We can see from the density of the transformations x or p x if we have a power law on hand. LN , N , G and P are the Lognormal, Normal, Gamma, andPareto distributions, respectively. distr p ( x ) p ( x ) p ( p x ) LN ( m , s ) e (cid:0) ( m (cid:0) log( x ))22 s p p sx e (cid:0) ( m +log( x ))22 s p p sx p p e (cid:0) ( m +2log( x ))22 s sx N ( m , s ) e (cid:0) ( m (cid:0) x )22 s p p s e (cid:0) ( m (cid:0) x ) s p p sx p p e (cid:0) ( m (cid:0) x ) s sx G ( a , b ) b (cid:0) a x a (cid:0) e (cid:0) xb G ( a ) b (cid:0) a x (cid:0) a (cid:0) e (cid:0) bx G ( a ) 2 b (cid:0) a x (cid:0) a (cid:0) e (cid:0) bx G ( a ) P (1, a ) a x (cid:0) a (cid:0) a x a (cid:0) a x a (cid:0) Table . : The p-moments of possible distributions for variance distr E ( X p ) E ( ( X ) p ) E ( ( p X ) p ) LN ( m , s ) e mp + p s e p ( ps (cid:0) m ) e p ( ps (cid:0) m ) G ( a , b ) b p ( a ) p ( (cid:0) p b (cid:0) p (1 (cid:0) a ) p , p < a fughedaboudit P (1, a ) aa (cid:0) p , p < a aa + p a a + p PDF f (.) in the location-scale family, and l any random variable with PDF g (.); X and l are assumed to be independent. Since by standard results, the moments oforder p for the product and the ratio X l are: E ( ( X l ) p ) = E ( X p ) E ( l p ) and E (( X l ) p ) = E (( l ) p ) E ( X p ) .(via the Mellin transform).Note that as proprety of location-scale family, l f x l ( x l ) = f x ( x l ) so, for instance, if x (cid:24) N (0, 1) (that is, normally distributed), then x s (cid:24) N (0, s ). Where do the tails start?We assume the tails start at the level of convexity of the segment of the proba-bility distribution to the scale of the distribution –in other words, affected by thestochastic volatility effect. univariate fat tails, level 1, finite moments † . . The Crossovers and Tunnel Effect.
Notice in Figure . a series of crossover zones, invariant to a . Distributions called"bell shape" have a convex-concave-convex shape (or quasi-concave shape).Let X be a random variable with distribution with PDF p ( x ) from a generalclass of all unimodal one-parameter continuous pdfs p s with support D (cid:18) R andscale parameter s . Let p (.) be quasi-concave on the domain, but neither convexnor concave. The density function p ( x ) satisfies: p ( x ) (cid:21) p ( x + ϵ ) for all ϵ >
0, and x > x (cid:3) and p ( x ) (cid:21) p ( x (cid:0) ϵ ) for all x < x (cid:3) with x (cid:3) = argmax x p ( x ) p ( w x + (1 (cid:0) w ) y ) (cid:21) min ( p ( x ), p ( y ) ) .A- If the variable is "two-tailed", that is, its domain of support D = (- ¥ , ¥ ), andwhere p d ( x ) ≜ p ( x , s + d )+ p ( x , s (cid:0) d )2 , . There exist a "high peak" inner tunnel, A T = ( a , a ) for which the d -perturbed s of the probability distribution p d ( x ) (cid:21) p ( x ) if x ( a , a ) . There exists outer tunnels, the "tails", for which p d ( x ) (cid:21) p ( x ) if x ( (cid:0) ¥ , a )or x ( a , ¥ ) . There exist intermediate tunnels, the "shoulders", where p d ( x ) (cid:20) p ( x ) if x ( a , a ) or x ( a , a ) a a a a a “Shoulders” ! a , a " , ! a , a " “Peak”( a , a " Right tailLeft tail ! ! Figure . : Where do the tails start? Fatter and fatter fails through perturbation of the scale param-eter s for a Gaussian, made more stochastic (instead of being fixed). Some parts of the probabilitydistribution gain in density, others lose. Intermediate events are less likely, tails events and moderatedeviations are more likely. We can spot the crossovers a through a . The "tails" proper start at a onthe right and a on the left. .3 the body, the shoulders, and the tails The Black Swan Problem:
As we saw, it is not merely that events in thetails of the distributions matter, happen, play a large role, etc. The point isthat these events play the major role and their probabilities are not (easily)computable, not reliable for any effective use. The implication is that BlackSwans do not necessarily come from fat tails; le problem can result from anincomplete assessment of tail events.Let A = f a i g the set of solutions { x : ¶ p ( x ) ¶s j a = 0 } .For the Gaussian ( m , s ), the solutions obtained by setting the second derivativewith respect to s to are: e (cid:0) ( x (cid:0) m )22 s ( s (cid:0) s ( x (cid:0) m ) + ( x (cid:0) m ) ) p ps = 0,which produces the following crossovers: ( . ) f a , a , a , a g = { m (cid:0) √ ( (cid:0) p ) s , m (cid:0) √ ( p ) s , m + √ ( (cid:0) p ) s , m + √ ( p ) s } In figure . , the crossovers for the intervals are numerically f(cid:0) s , (cid:0) .66 s , .66 s , 2.13 s g .As to a symmetric power law(as we will see further down), the Student T Distri-bution with scale s and tail exponent a : p ( x ) ≜ ( aa + x s ) a +12 p a sB ( a , ) f a , a , a , a g = { (cid:0) √ a (cid:0)p ( a +1)(17 a +1)+1 a (cid:0) s p (cid:0) √ a + p ( a +1)(17 a +1)+1 a (cid:0) s p √ a (cid:0)p ( a +1)(17 a +1)+1 a (cid:0) s p √ a + p ( a +1)(17 a +1)+1 a (cid:0) s p } where B (.) is the Beta function B ( a , b ) = G ( a ) G ( b ) G ( a + b ) = ∫ dtt a (cid:0) (1 (cid:0) t ) b (cid:0) .When the Student is "cubic", that is, a = 3: f a , a , a , a g = { (cid:0) √ (cid:0) p s , (cid:0) √ p s , √ (cid:0) p s , √ p s } univariate fat tails, level 1, finite moments † In Summary, Where Does the Tail Start?
For a general class of symmetric distributions with power laws, the tail startsat: (cid:6) √ a + p ( a +1)(17 a +1)+1 a (cid:0) s p , with a infinite in the stochastic volatility Gaussian casewhere s is the standard deviation. The "tail" is located between around and standard deviations. This flows from our definition: which part of thedistribution is convex to errors in the estimation of the scale.But in practice, because historical measurements of STD will be biased lowerbecause of small sample effects (as we repeat fat tails accentuate small sampleeffects), the deviations will be > - STDs. +
12 2 π x + x - - - x Figure . : We compare thebehavior of p K + x andK + j x j . The difference be-tween the two weightingfunctions increases for largevalues of the random vari-able x, which explains thedivergence of the two (and,more generally, higher mo-ments) under fat tails. We can verify that when a ! ¥ , the crossovers become those of a Gaussian. Forinstance, for a : lim a ! ¥ (cid:0) √ a (cid:0)p ( a +1)(17 a +1)+1 a (cid:0) s p (cid:0) √ ( (cid:0) p ) s B- For some one-tailed distribution that have a "bell shape" of convex-concave-convex shape, under some conditions, the same crossover points hold. The Log-normal is a special case. f a , a , a , a g = { e ( m (cid:0)p p s (cid:0)p s ) , e ( m (cid:0)p p p s +5 s ) , e ( m + p p s (cid:0)p s ) , e ( m + p p p s +5 s ) } Stochastic Parameters
The problem of elliptical distributions is that they do notmap the return of securities, owing to the absence of a single variance at any pointin time, see Bouchaud and Chicheportiche ( ) [ ]. When the scales of the dis- .4 fat tails, mean deviation and the rising norms tributions of the individuals move but not in tandem, the distribution ceases tobe elliptical. Figure . shows the effect of applying the equivalent of stochasticvolatility methods: the more annoying stochastic correlation. Instead of perturbat-ing the correlation matrix S as a unit as in section , we perturbate the correlationswith surprising effect. Next we discuss the beastly use of standard deviation and its interpretation. . . The Common Errors
We start by looking at standard deviation and variance as the properties of highermoments. Now, What is standard deviation? It appears that the same confusionabout fat tails has polluted our understanding of standard deviation.The difference between standard deviation (assuming mean and median of0 to simplify) s = √ n (cid:229) x i and mean absolute deviation MAD = n (cid:229) j x i j increases under fat tails, as one can see in Figure . . This can provide aconceptual approach to the notion.Dan Goldstein and the author [ ] put the following question to investment pro-fessionals and graduate students in financial engineering –people who work withrisk and deviations all day long. A stock (or a fund) has an average return of %. It moves on average % a dayin absolute value; the average up move is % and the average down move is %.It does not mean that all up moves are % –some are . %, others . %, and soforth.Assume that we live in the Gaussian world in which the returns (or daily per-centage moves) can be safely modeled using a Normal Distribution. Assumethat a year has business days. What is its standard deviation of returns (that Time1.11.21.31.41.51.61.7STD (cid:144)
MAD
Figure . : The RatioSTD/MAD for the dailyreturns of the SP overthe past years, seenwith a monthly rollingwindow. We can considerthe level √ p (cid:25) (asapproximately the value forGaussian deviations), as thecut point for fat tailedness. univariate fat tails, level 1, finite moments † is, of the percentage moves), the ˘AIJsigma that is used for volatility in financialapplications?What is the daily standard deviation?What is the yearly standard deviation? As the reader can see, the question described mean deviation. And the answerswere overwhelmingly wrong. For the daily question, almost all answered %. Yeta Gaussian random variable that has a daily percentage move in absolute termsof % has a standard deviation that is higher than that, about . %. It shouldbe up to . % in empirical distributions. The most common answer for the yearlyquestion was about %, which is about % of what would be the true answer.The professionals were scaling daily volatility to yearly volatility by multiplyingby p
256 which is correct provided one had the correct daily volatility.So subjects tended to provide MAD as their intuition for STD. When profession-als involved in financial markets and continuously exposed to notions of volatilitytalk about standard deviation, they use the wrong measure, mean absolute devia-tion (MAD) instead of standard deviation (STD), causing an average underestima-tion of between and %. In some markets it can be up to %. Further, re-sponders rarely seemed to immediately understand the error when it was pointedout to them. However when asked to present the equation for standard deviationthey effectively expressed it as the mean root mean square deviation. Some werepuzzled as they were not aware of the existence of MAD.Why this is relevant: Here you have decision-makers walking around talkingabout "volatility" and not quite knowing what it means. We note some clips in thefinancial press to that effect in which the journalist, while attempting to explainthe "VIX", i.e., volatility index, makes the same mistake. Even the website of thedepartment of commerce misdefined volatility.Further, there is an underestimation as MAD is by Jensen’s inequality lower (orequal) than STD. How the ratio rises
For a Gaussian the ratio (cid:24) . , and it rises from there withfat tails. Example : Take an extremely fat tailed distribution, with n =10 , observations areall - except for a single one of 10 , X = { (cid:0) (cid:0)
1, ..., (cid:0)
1, 10 } .The mean absolute deviation, MAD ( X ) = . The standard deviation STD ( X )= .The ratio standard deviation over mean deviation is . . . Some AnalyticsThe ratio for thin tails
As a useful heuristic, consider the ratio h : .4 fat tails, mean deviation and the rising norms h = √ E ( X ) E ( j X j ) ,where E is the expectation operator (under the probability measure of concern and X is a centered variable such E ( x ) = 0); the ratio increases with the fat tailednessof the distribution; (The general case corresponds to ( E ( x p )) p E ( j x j ) , p >
1, under thecondition that the distribution has finite moments up to n , and the special casehere n = 2). Simply, x p is a weighting operator that assigns a weight, x p (cid:0) , which is large forlarge values of X , and small for smaller values.The effect is due to the convexity differential between both functions, j X j is piece-wise linear and loses the convexity effect except for a zone around the origin. Mean Deviation vs Standard Deviation, more technical
Why the [REDACTED]did statistical science pick STD over Mean Deviation? Here is the story, withanalytical derivations not seemingly available in the literature. In Huber [ ]:There had been a dispute between Eddington and Fisher, around , about the relative merits of dn (mean deviation) and Sn (standarddeviation). Fisher then pointed out that for exactly normal observa-tions, Sn is % more efficient than dn , and this seemed to settle thematter. (My emphasis)Let us rederive and see what Fisher meant.Let n be the number of summands:Asymptotic Relative Efficiency (ARE) = lim n ! ¥ ( V ( Std ) E ( Std ) / V ( Mad ) E ( Mad ) ) Assume we are certain that X i , the components of the sample follow a Gaussiandistribution, normalized to mean= and a standard deviation of . Relative Standard Deviation Error
The characteristic function Y ( t ) of the distri-bution of x : Y ( t ) = ∫ ¥ (cid:0) ¥ e (cid:0) x
22 + itx p p dx = p (cid:0) it . With the squared deviation z = x , f , the pdf for n summands becomes: f Z ( z ) = 12 p ∫ ¥ (cid:0) ¥ exp( (cid:0) itz ) ( p (cid:0) it ) n dt = 2 (cid:0) n e (cid:0) z z n (cid:0) G ( n ) , z > The word "infinite" moment is a big ambiguous, it is better to present the problem as "undefined" momentin the sense that it depends on the sample, and does not replicate outside. Say, for a two-tailed distribution(i.e. with support on the real line), the designation"infinite" variance might apply for the fourth moment,but not to the third. univariate fat tails, level 1, finite moments † Now take y = p z , f Y ( y ) = (cid:0) n e (cid:0) z z n (cid:0) G ( n ) , z >
0, which corresponds to the ChiDistribution with n degrees of freedom. Integrating to get the variance: V std ( n ) = n (cid:0) G ( n +12 ) G ( n ) . And, with the mean equalling p G ( n +12 ) G ( n ) , we get V ( Std ) E ( Std ) = n G ( n ) G ( n +12 ) (cid:0) Relative Mean Deviation Error
Characteristic function again for j x j is that of afolded Normal distribution, but let us redo it: Y ( t ) = ∫ ¥ √ p e (cid:0) x + itx = e (cid:0) t ( i erfi ( t p )) , where erfi is the imaginary errorfunction er f ( iz ) = i .The first moment: M = (cid:0) i ¶¶ t ( e (cid:0) t n ( i erfi ( t p n ))) n (cid:12)(cid:12)(cid:12) t =0 = √ p .The second moment, M = ( (cid:0) i ) ¶ ¶ t ( e (cid:0) t n ( i erfi ( t p n ))) n (cid:12)(cid:12)(cid:12) t =0 = n + p (cid:0) p n .Hence, V ( Mad ) E ( Mad ) = M (cid:0) M M = p (cid:0) n . Finalmente, the Asymptotic Relative Efficiency For a Gaussian
ARE = lim n ! ¥ n ( n G ( n ) G ( n +12 ) (cid:0) ) p (cid:0) p (cid:0) (cid:25) .875which means that the standard deviation is 12.5% more "efficient" than the meandeviation conditional on the data being Gaussian and these blokes bought the argu-ment. Except that the slightest contamination blows up the ratio. We will showlater why Norm ℓ is not appropriate for about anything; but for now let us get aglimpse on how fragile the STD is. . . Effect of Fatter Tails on the "efficiency" of STD vs MD
Consider a standard mixing model for volatility with an occasional jump witha probability p . We switch between Gaussians (keeping the mean constant andcentral at 0) with: V ( x ) = { s (1 + a ) s with probability p with probability (1 (cid:0) p )For ease, a simple Monte Carlo simulation would do. Using p = .01 and n = 1000...Figure . shows how a= causes degradation. A minute presence of outliersmakes MAD more "efficient" than STD. Small "outliers" of standard deviationscause MAD to be five times more efficient. The natural way is to center MAD around the median; we find it more informative for many of ourpurposes here (and decision theory) to center it around the mean. We will make note when the centeringis around the mean. .4 fat tails, mean deviation and the rising norms Figure . : Harald Cramér, of theCramer condition, and the ruin prob-lem. a Figure . : A simulation ofthe Relative Efficiency ratioof Standard deviation overMean deviation when inject-ing a jump size p (1 + a ) (cid:2) s , as a multiple of s thestandard deviation. . . Moments and The Power Mean Inequality
Let X ≜ ( x i ) ni =1 , ∥ X ∥ p ≜ ( (cid:229) ni =1 j x i j p n ) = p univariate fat tails, level 1, finite moments † Figure . : Mean deviation(blue) vs standard deviation(yellow) for a finite vari-ance power law. The re-sult is expected (MD is thethinner distribution), com-plicated by the fact thatstandard deviation has aninfinite variance since thesquare of a Paretian randomvariable with exponent a isParetian with an exponentof a . In this example themean deviation of standarddeviation is times higher. Figure . : For a Gaus-sian, there is small differ-ence in distribution betweenMD and STD (adjusting forthe mean for the purpose ofvisualization).
For any 1 (cid:20) p < q the following inequality holds: p √ n (cid:229) i =1 w i j x i j p (cid:20) q √ n (cid:229) i =1 w i j x i j q ( . )where the positive weights w i sum to unity. (Note that we avoid p < Proof.
The proof for positive p and q is as follows: Define the following function: f : R + ! R + ; f ( x ) = x qp . f is a power function, so it does have a second derivative: f ′′ ( x ) = ( qp ) ( qp (cid:0) ) x qp (cid:0) which is strictly positive within the domain of f , since q > p , f is convex. Hence,by Jensen’s inequality : f ( (cid:229) ni =1 w i x pi ) (cid:20) (cid:229) ni =1 w i f ( x pi ), so pq √ n (cid:229) i =1 w i x pi (cid:20) (cid:229) ni =1 w i x qi after raising both side to the power of 1 = q (an increasing function, since 1 = q ispositive) we get the inequality. .4 fat tails, mean deviation and the rising norms What is critical for our exercise and the study of the effects of fat tails is that,for a given norm, dispersion of results increases values. For example, take a flatdistribution, X= f , g . ∥ X ∥ = ∥ X ∥ =... = ∥ X ∥ n = 1. Perturbating while preserving ∥ X ∥ , X = { , } produces rising higher norms: f∥ X ∥ n g n =1 = { p
52 , p = , p
412 , p = } . ( . )Trying again, with a wider spread, we get even higher values of the norms, X = { , } , fjj X jj n g n =1 =
1, 54 , √ p p (cid:2) = . ( . )So we can see (removing constraints and/or allowing for negative values) howhigher moments become rapidly explosive.One property quite useful with power laws with infinite moment: ∥ X ∥ ¥ = sup ( j x i j ) ni =1 ( . ) Gaussian Case
For a Gaussian, where x (cid:24) N (0, s ), as we assume the mean is 0without loss of generality,Let E ( X ) be the expectation operator for X , E ( X = p ) E ( j X j ) = 2 p (cid:0) ( ( (cid:0) p + 1 ) s p (cid:0) G ( p + 12 ) or, alternatively E ( X p ) E ( j X j ) = 2 ( p (cid:0) ( (cid:0) p ) ( s ) (cid:0) p G ( p + 12 ) ( . )where G (z) is the Euler gamma function; G ( z ) = ∫ ¥ t z (cid:0) e (cid:0) t dt . For odd moments,the ratio is 0. For even moments: E ( X ) E ( j X j ) = √ p s hence √ E ( X ) E ( j X j ) = STDMD = √ p univariate fat tails, level 1, finite moments † As to the fourth moment, it equals 3 √ p s .For a Power Law distribution with tail exponent a = , say a Student T √ E ( X ) E ( j X j ) = STDMD = p Pareto Case
For a standard Pareto distribution with minimum value (and scale) L , PDF f ( x ) = a L a x (cid:0) a (cid:0) and standard deviation p aa (cid:0) L a (cid:0) , we have STDMD = 12 p a (cid:0) a (cid:0) a (cid:0) a (cid:0) a , ( . )by centering around the mean. "Infinite" moments Infinite moments, say infinite variance, always manifest them-selves as computable numbers in observed sample, yielding finite moments of allorders, simply because the sample is finite. A distribution, say, Cauchy, with unde-fined means will always deliver a measurable mean in finite samples; but differentsamples will deliver completely different means. Figures . and . illustratethe "drifting" effect of the moments with increasing information. - - M TX H A , x L Figure . : The mean of aseries with undefined mean(Cauchy). . . Comment: Why we should retire standard deviation, now!
The notion of standard deviation has confused hordes of scientists; it is time toretire it from common use and replace it with the more effective one of mean de-viation. Standard deviation, STD, should be left to mathematicians, physicists and .4 fat tails, mean deviation and the rising norms M TX I A , x M Figure . : The squareroot of the second momentof a series with infinite vari-ance. We observe pseudo-convergence before a jump. mathematical statisticians deriving limit theorems. There is no scientific reasonto use it in statistical investigations in the age of the computer, as it does moreharm than good-particularly with the growing class of people in social sciencemechanistically applying statistical tools to scientific problems.Say someone just asked you to measure the "average daily variations" for the tem-perature of your town (or for the stock price of a company, or the blood pressureof your uncle) over the past five days. The five changes are: (- , , - , , - ). Howdo you do it?Do you take every observation: square it, average the total, then take the squareroot? Or do you remove the sign and calculate the average? For there are seriousdifferences between the two methods. The first produces an average of . , thesecond . . The first is technically called the root mean square deviation. Thesecond is the mean absolute deviation, MAD. It corresponds to "real life" muchbetter than the first-and to reality. In fact, whenever people make decisions afterbeing supplied with the standard deviation number, they act as if it were theexpected mean deviation.It is all due to a historical accident: in , the great Karl Pearson introducedthe term "standard deviation" for what had been known as "root mean squareerror". The confusion started then: people thought it meant mean deviation. Theidea stuck: every time a newspaper has attempted to clarify the concept of market"volatility", it defined it verbally as mean deviation yet produced the numericalmeasure of the (higher) standard deviation.But it is not just journalists who fall for the mistake: I recall seeing official doc-uments from the department of commerce and the Federal Reserve partaking ofthe conflation, even regulators in statements on market volatility. What is worse,Goldstein and I found that a high number of data scientists (many with PhDs) alsoget confused in real life.It all comes from bad terminology for something non-intuitive. By a psychologi-cal phenomenon called attribute substitution, some people mistake MAD for STD univariate fat tails, level 1, finite moments † because the former is easier to come to mind – this is "Lindy" as it is well knownby cheaters and illusionists. ) MAD is more accurate in sample measurements, and less volatile than STDsince it is a natural weight whereas standard deviation uses the observation itself asits own weight, imparting large weights to large observations, thus overweighingtail events. ) We often use STD in equations but really end up reconverting it within theprocess into MAD (say in finance, for option pricing). In the Gaussian world, STDis about . time MAD, that is, √ p . But we adjust with stochastic volatilitywhere STD is often as high as . times MAD. ) Many statistical phenomena and processes have "infinite variance" (such asthe popular Pareto / rule) but have finite, and sometimes very well behaved,mean deviations. Whenever the mean exists, MAD exists. The reverse (infiniteMAD and finite STD) is never true. ) Many economists have dismissed "infinite variance" models thinking thesemeant "infinite mean deviation". Sad, but true. When the great Benoit Mandel-brot proposed his infinite variance models fifty years ago, economists freaked outbecause of the conflation.It is sad that such a minor point can lead to so much confusion: our scientific toolsare way too far ahead of our casual intuitions, which starts to be a problem withscience. So I close with a statement by Sir Ronald A. Fisher: ’The statistician cannotevade the responsibility for understanding the process he applies or recommends.’ Note
The usual theory is that if random variables X , . . . , X n are independent,then V ( X + (cid:1) (cid:1) (cid:1) + X n ) = V ( X ) + (cid:1) (cid:1) (cid:1) + V ( X n ).by the linearity of the variance. But then it assumes that one cannot use anothermetric then by simple transformation make it additive . As we will see, for theGaussian md( X ) = √ p s —for the Student T with degrees of freedom, the factoris p , etc. See a definition of "Lindy" in . . For instance option pricing in the Black-Scholes formula is done using variance, but the price maps directlyto MAD; an at-the-money straddle is just a conditional mean deviation. So we translate MAD into standarddeviation, then back to MAD .4 fat tails, mean deviation and the rising norms p = = p = = = ∞ - - - - Figure . : Rising norms and the unitcircle/square: values of the iso-norm ( j x j p + j x j p ) = p = 1 . We notice thearea inside the norm (i.e. satisfying norm (cid:20) ),v ( p ) = G ( p +1 p ) G ( p +2 p ) , with v (1) = 2 and v ( ¥ ) = 4 . Figure . : Risingnorms and the unitcube: values of the iso-norm ( j x j p + j x j p + j x j p ) = p =1 for p = 1, , 2, 3, 4 , and ¥ . The volume satisfyingthe inequality norm (cid:20) increases for for p = 1 , p for p = 2 (the unitsphere), to for p = ¥ (theunit cube), a much higherincrease than in Figure . . We can see the operation ofthe curse of dimensionalityin the smaller and smallervolume for p = 1 , relative tothe maximum when p = ¥ . univariate fat tails, level 1, finite moments † Log V d ∞ V d Norms and Dimensionality
Figure . : The curse of di-mensionality, with yuuugeapplications across statisti-cal areas, particularly modelerror in higher dimentions.As d increases, the ratio ofV over V ¥ blows up. If ford = 2 , it is it is already sixfigures for d = 9 . p on iso-norms Consider the region R ( n )( p ) defined as X = ( x , . . . , x n ) : ( (cid:229) ni = x pi ) = p (cid:20) , withthe border defined by the identity. As the norm rises, we calculate the followingmeasure of the ball: V pn = ∫ . . . ∫ X R ( n )( n ) d X = ( G ( + )) n G ( np + ) Figures . and . show two effects.The first is how rising norms occupy a larger share of the space.The second gives us a hint of the curse of dimensionality, useful in many circum-stances (and, centrally, for model error). Compare figures . and . : you willnotice that in the first case, for d = 2, p = 1, m occupies half the area of the square,with p = ¥ all of it. The ratio of norms is . But for d = 3, p = 1 occupies = = of the space (again, p = ¥ occupies all of it). The ratio of higher moments to lowermoments increases with dimensionality, as seen in Figure . . .5 visualizing the effect of rising p on iso-norms F urther R eading : We stop here and present probability booksin general. For more general intuition about probability, the in-dispensable Borel’s [ ]. Kolmogorov [ ], Loeve [ ], Feller[ ],[ ]. For measure theory, Billingsley [ ]. For subexponentiality
Pitman [ ], Embrechts and Goldie ( ) [ ], Em-brechts ( , which seems to be close to his doctoral thesis) [ ], Chistyakov( ) [ ], Goldie ( ) [ ], and Teugels [ ]. For extreme value distributions
Embrechts et al [ ], De Haan and Fer-reira [ ]. For stable distributions
Uchaikin and Zolotarev [ ], Zolotarev [ ],Samorindsky and Taqqu [ ]. Stochastic processes
Karatsas and Shreve [ ], Oksendal [ ], Varadhan[ ]. L E V E L 2 : S U B E X P O N E N T I A L S A N DP O W E R L A W S T his C hapter briefly presents the subexponential vs. the powerlaw classes as "true fat tails" (already defined in Chapter ) andpresents some wrinkles associated with them. Subexponentiality(without scalability), that is membership in the subexponentialbut not power law class is a small category (of the common distributions,only the borderline exponential –and gamma associated distributions suchas the Laplace – and the the lognormal fall in that class). . . Revisiting the Rankings
Table . reviews the rankings of Chapter . Recall that probability distributionsrange between extreme thin-tailed (Bernoulli) and extreme fat tailed. Among thecategories of distributions that are often distinguished due to the convergenceproperties of moments are: . Having a support that is compact (but not degenerate) . Subgaussian . Subexponential . Power Law with exponent greater than . Power Law with exponent less than or equal to . In particular, Power Lawdistributions have a finite mean only if the exponent is greater than , andhave a finite variance only if the exponent exceeds . Power Law with exponent less than Our interest is in distinguishing between cases where tail events dominate im-pacts, as a formal definition of the boundary between the categories of distribu-tions to be considered as mediocristan and Extremistan.Centrally, a subexponential distribution is the cutoff between "thin" and "fat" tails.It is defined as follows. level 2: subexponentials and power laws Table . : Ranking distributions
Class Description
True Thin Tails Compact support (e.g. : Bernouilli,Binomial)Thin tails Gaussian reached organicallythrough summation of true thintails, by Central Limit; compactsupport except at the limit n ! ¥ Conventional Thintails Gaussian approximation of a natu-ral phenomenonStarter Fat Tails Higher kurtosis than the Gaussianbut rapid convergence to Gaussianunder summationSubexponential (e.g. lognormal)Supercubic a Cramer conditions do not hold for t > ∫ e (cid:0) tx d ( Fx ) = + ¥ Infinite Variance Levy Stable a < ∫ e (cid:0) tx dF ( x ) =+ ¥ Undefined First Mo-ment FuhgetaboutditThe mathematics is crisp: the excedance probability or survival function needsto be exponential in one not the other. Where is the border?The natural boundary between Mediocristan and Extremistan occurs at the subex-ponential class which has the following property:Let X = X , . . . , X n be a sequence of independent and identically distributedrandom variables with support in ( R + ), with cumulative distribution function F .The subexponential class of distributions is defined by (see [ ], [ ]):lim x ! + ¥ (cid:0) F (cid:3) ( x )1 (cid:0) F ( x ) = 2 ( . )where F (cid:3) = F ′ (cid:3) F is the cumulative distribution of X + X , the sum of two inde-pendent copies of X . This implies that the probability that the sum X + X exceedsa value x is twice the probability that either one separately exceeds x . Thus, everytime the sum exceeds x , for large enough values of x , the value of the sum is dueto either one or the other exceeding x —the maximum over the two variables—andthe other of them contributes negligibly.More generally, it can be shown that the sum of n variables is dominated bythe maximum of the values over those variables in the same way. Formally, thefollowing two properties are equivalent to the subexponential condition [ ],[ ].For a given n (cid:21)
2, let S n = S ni =1 x i and M n = max (cid:20) i (cid:20) n x i a) lim x ! ¥ P ( S n > x ) P ( X > x ) = n , evel 2: subexponentials and power laws b) lim x ! ¥ P ( S n > x ) P ( M n > x ) = 1.Thus the sum S n has the same magnitude as the largest sample M n , which isanother way of saying that tails play the most important role.Intuitively, tail events in subexponential distributions should decline more slowlythan an exponential distribution for which large tail events should be irrelevant.Indeed, one can show that subexponential distributions have no exponential mo-ments: ∫ ¥ e ϵ x dF ( x ) = + ¥ ( . )for all values of greater than zero. However,the converse isn’t true, since dis-tributions can have no exponential moments, yet not satisfy the subexponentialcondition.We note that if we choose to indicate deviations as negative values of the variable x , the same result holds by symmetry for extreme negative values, replacing x ! + ¥ with x ! (cid:0) ¥ . For two-tailed variables, we can separately consider positiveand negative domains. . . What is a Borderline Probability Distribution?
The best way to figure out a probability distribution is to... invent one. In fact inthe next section, . . , we will build one that is the exact borderline between thinand fat tails by construction . Consider for now that the properties are as follows:Let F be the survival function. We have F : R ! [0, 1] that satisfieslim x ! + ¥ F ( x ) n F ( nx ) = 1, ( . )and lim x ! + ¥ F ( x ) = 0lim x !(cid:0) ¥ F ( x ) = 1Note : another property of the demarcation is the absence of Lucretius fallacyfrom The Black Swan , mentioned earlier (i.e. future extremes will not be similar topast extremes under fat tails, and such dissimilarity increases with fat tailedness):Let us look at the demarcation properties for now. Let X be a random variablethat lives in either (0, ¥ ) or ( (cid:0) ¥ , ¥ ) and E the expectation operator under "realworld" (physical) distribution. By classical results [ ]: level 2: subexponentials and power laws lim K ! ¥ K E ( X j X > K ) = l ( . ) (cid:15) If l = 1 , X is said to be in the thin tailed class D and has a character-istic scale (cid:15) If l > X is said to be in the fat tailed regular variation class D andhas no characteristic scale (cid:15) If lim K ! ¥ E ( X j X > K ) (cid:0) K = m where m >
0, then X is in the borderline exponential classThe first case is called the "Lindy effect" when the random variable X is time sur-vived. The subject is examined outside of this fat-tails project. See Iddo eliazar’sexposition [ ]. InventedGaussian x Figure . : Comparing theinvented distribution (at thecusp of subexponentiality)to the Gaussian of the samevariance (k = 1 ). It does nottake much to switch fromGaussian to subexponentialproperties. . . Let Us Invent a Distribution
While the exponential distribution is at the cusp of the subexponential class butwith support in [0, ¥ ), we can construct a borderline distribution with support in( (cid:0) ¥ , ¥ ), as follows . Find survival functions F : R ! [0, 1] that satisfy: x (cid:21)
0, lim x ! + ¥ F ( x ) F (2 x ) = 1, F ′ ( x ) (cid:20) x ! + ¥ F = 0.lim x !(cid:0) ¥ F = 1. The Laplace distribution, which doubles the exponential on both sides, does not fit the property as theratio of the square to the double is . .1 level 3: scalability and power laws Let us assume a candidate function a sigmoid, using the hyperbolic tangent F k ( x ) = 12 ( (cid:0) tanh( kx ) ) , k > f (.) be the density function: f ( x ) = (cid:0) ¶ F ( x ) ¶ x = 12 k sech ( kx ). ( . )The characteristic function: ϕ ( t ) = p t csch ( p t k ) k . ( . )Given that it is all real, we can guess that the mean is 0 –so are all odd moments.The second moment will be lim t ! ( (cid:0) i ) ¶ ¶ t p t csch ( p t k ) k = p k And the fourth mo-ment: lim t ! ( (cid:0) i ) ¶ ¶ t p t csch ( p t k ) k = p k , hence the Kurtosis will be . The distribu-tion we invented has slightly fatter tails than the Gaussian. Now we get into the serious business.
Why power laws?
There are a lot of theories on why things should be powerlaws, as sort of exceptions to the way things work probabilistically. But it seemsthat the opposite idea is never presented: power laws should be the norm, andthe Gaussian a special case ([ ]), effectively the topic of
Antifragile and the nextvolume of the
Technical Incerto ), owing to concave-convex responses (sort of damp-ening of fragility and antifragility, bringing robustness, hence thinning the tails). . . Scalable and Nonscalable, A Deeper View of Fat Tails
So far for the discussion on fat tails we stayed in the finite moments case. For acertain class of distributions, those with finite moments, P X > nK P X > K depends on n andK. For a scale-free distribution, with K "in the tails", that is, large enough, P X > nK P X > K depends on n not K. These latter distributions lack in characteristic scale and willend up having a Paretian tail, i.e., for x large enough, P X > x = Cx (cid:0) a where a is thetail and C is a scaling constant.Note: We can see from the scaling difference between the Student and the Paretothe conventional definition of a Power Law tailed distribution is expressed more level 2: subexponentials and power laws Gaussian LogNormal-2 Student (3)2 5 10 20 log x10 - - - - P > x Figure . : Three Types of Distributions. As we hit the tails, the Student remains scalable while theStandard Lognormal shows an intermediate position before eventually ending up getting an infiniteslope on a log-log plot. But beware the lognormal as it may have some surprises (Chapter ) . Table . : Scalability, comparing regularly varying functions/powerlaws to other distributions k P ( X > k ) (cid:0) P ( X > k ) P ( X > k ) P ( X > k ) (cid:0) P ( X > k ) P ( X > k ) P ( X > k ) (cid:0) P ( X > k ) P ( X > k ) (Gaussian) (Gaussian) Student( ) Student ( ) Pareto( ) Pareto ( ) . . . 5.1 (cid:2) . . (cid:2) (cid:2)
216 7 . (cid:2) (cid:2)
491 7 . (cid:2) (cid:2)
940 7 . (cid:2) fughedaboudit . (cid:2) fughedaboudit . (cid:2) fughedaboudit . (cid:2) fughedaboudit . (cid:2) fughedaboudit . formally as P ( X > x ) = L ( x ) x (cid:0) a where L ( x ) is a "slow varying function", whichsatisfies the following: lim x ! ¥ L ( t x ) L ( x ) = 1 .1 level 3: scalability and power laws for all constants t > P > x log x converges to a constant, namely the tail exponent- a . A scalable should produce the slope a in the tails on a log-log plot, as x ! ¥ .Compare to the Gaussian (with STD s and mean m ) , by taking the PDF this timeinstead of the exceedance probability log ( f ( x ) ) = ( x (cid:0) m ) s (cid:0) log( s p p ) (cid:25) (cid:0) s x which goes to (cid:0) ¥ faster than (cid:0) log( x ) for (cid:6) x ! ¥ .So far this gives us the intuition of the difference between classes of distributions.Only scalable have "true" fat tails, as others turn into a Gaussian under summation.And the tail exponent is asymptotic; we may never get there and what we may seeis an intermediate version of it. The figure above drew from Platonic off-the-shelfdistributions; in reality processes are vastly more messy, with switches betweenexponents as deviations get larger. Definition . (the class P ) The P class of power laws (regular variation) is defined for r.v. X as follows: P = f X : P ( X > x ) (cid:24) L ( x ) x (cid:0) a g ( . ) . . Grey Swans
Figure . : The graph repre-sents the log log plot of GBP,the British currency. Wecan see the "Grey Swan" ofBrexit (that is, the jump inthe currency when the un-expected referendum resultscame out); when seen usinga power law the large de-viation is rather consistentwith the statistical proper-ties.
Why do we use Student T to simulate symmetric power laws?
For convenience,only for convenience. It is not that we believe that the generating process is StudentT. Simply, the center of the distribution does not matter much for the propertiesinvolved in certain classes of decision making.The lower the exponent, the less the center plays a role. The higher the exponent,the more the student T resembles the Gaussian, and the more justified its use willbe accordingly.More advanced methods involving the use of Levy laws may help in the eventof asymmetry, but the use of two different Pareto distributions with two different level 2: subexponentials and power laws α = ��� � ��� �� � �� � ��� - � ������������� � > � Figure . : Book Sales: thenear tail can be robust for es-timation of sales from rankand vice versa –it workswell and shows robustnessso long as one doesn’t com-pute general expectations orhigher non-truncated mo-ments.
200 400 600 800 1000 - - - - - Figure . : The TurkeyProblem, where nothing inthe past properties seems toindicate the possibility ofthe jump. exponents, one for the left tail and the other for the right one would do the job(without unnecessary complications).
Estimation issues
Note that there are many methods to estimate the tail expo-nent a from data, what is called a "calibration. However, we will see, the tailexponent is rather hard to guess, and its calibration marred with errors, owing tothe insufficiency of data in the tails. In general, the data will show thinner tail thanit should.We will return to the issue in more depth in later chapters. Two central properties. . . Sums of variables .2 some properties of power laws Property : Tail exponent of a sum Let X , X , . . . X n be random variables neither independent nor identically distributed,each X i following a distribution with a different asymptotic tail exponent a i (we as-sume that random variables outside the power law class will have an asymptoticalpha = + ¥ ). Assume further we are concerned with the right tail of the distribu-tion (the argument remains identical when we apply it to the left tail). See [ ] forfurther details.Consider the weighted sum S n = (cid:229) ni =1 w i X i , with all weights w i strictly positive.Consider a s the tail exponent for the sum.For all w i > , a s = min( a i ).Clearly, if a (cid:20) a and w > x ! ¥ log ( w x (cid:0) a + w x (cid:0) a ) log( x ) = a .The implication is that adding a single summand with undefined (or infinite)mean, variance, or higher moments leads to the total sum to have undefined (orinfinite) mean, variance, or higher moments. Principle . (Power Laws + Thin Tails = Power Laws) Mixing power law distributed and thin tailed variables results in power laws, no matterthe composition. . . Transformations
The second property while appearing benign, can be vastly more annoying:
Property Let X be a random variable with tail exponent a . The tail exponent of X p is a p . This tells us that the variance of a finite variance random variable with tail expo-nent < Proof.
The general approach is as follows. Let p (.) be a probability density functionand ϕ (.) a transformation (with some restrictions). We have the distribution of thetransformed variable (assuming the support is conserved –stays the same): p ( ϕ ( x ) ) = p ( ϕ ( (cid:0) ( x ) ) ϕ ′ ( ϕ ( (cid:0) ( x ) ) . ( . ) level 2: subexponentials and power laws Assume that x > l and l is large (i.e. a point where the slowly varying function"ceases to vary" within some order of x ). The PDF for these values of x can bewritten as p ( x ) (cid:181) Kx (cid:0) a (cid:0) . Consider y = ϕ ( x ) = x p : the inverse function of y = x p is x = y p . Applying to the denominator in Eq. . , we get p x (cid:0) pp .Integrating above l , the survival function will be: P ( Y > y ) (cid:181) y (cid:0) a p . The slowly varying function effect, a case study
The fatter the tails, the less the"body" matters for the moments (which become infinite, eventually). But for powerlaws with thinner tails, the zone that is not power law (the slowly moving part)plays a role –"slowly varying" is more or less formally defined in . . , . . and . . . This section will show how apparently equal distributions can have differentshapes.Let us compare a double Pareto distribution with the following PDF: f P ( x ) = a (1 + x ) (cid:0) a (cid:0) x (cid:21) a (1 (cid:0) x ) (cid:0) a (cid:0) x < s and PDF f S ( x ) = a a = ( a + x s )
12 ( (cid:0) a (cid:0) sB ( a , ) where B (.) is the Euler beta function, B ( a , b ) = ( G ( a ))( G ( b )) G ( a + b ) = ∫ t a (cid:0) (1 (cid:0) t ) b (cid:0) dt .We have two ways to compare distributions. (cid:15) Equalizing by tail ratio: setting lim x ! ¥ f p ( x ) f s ( x ) = 1 to get the same tail ratio, weget the equivalent "tail" distribution with s = ( a (cid:0) a B ( a , )) = a . (cid:15) Equalizing by standard deviations (when finite): we have, with a > E ( X P ) = a (cid:0) a +2 and E ( X S ) = a ( a (cid:0) a B ( a , ) ) = a a (cid:0) .So we could set √ E ( X P ) = p k √ E ( X S ) k ! a (cid:0) = a B ( a , ) (cid:0) = a a (cid:0) } .Finally, we have the comparison "bell shape" semi-concave vs the angular double-convex one as seen in Figure . . .4 interpolative powers of power laws: an example - - x f p ( . ) f s ( . ) Figure . : Comparing twosymmetric power laws ofsame exponent, one with abrief slowly varying func-tion, the other with an ex-tended one. All momentseventually become the samein spite of the central dif-ferences in their shape forsmall deviations.
Consider Jobless Claims during the COVID- pandemic: unemployment jumpedmany so-called standard deviations in March of . But was the jump an outlier?Maybe if you look at . and think like someone trained in thin tails. But notreally. As Figure . shows, the tail exponent is hardly changed. The scale ofthe distribution could perhaps vary, but the exponent is patently robust to out-of-sample observations. Log Changes in Jobless Claims
Figure . : Jobless claims:looks like the jump is a sur-prise... but only to un-trained economists. As Fig. . shows, it shouldn’t be.And to the trained eyes (ala Benoit Mandelbrot), vari-ations were mild but cer-tainly never Gaussian. The mother of all fat tails, the log-Pareto distribution, is not present in commonlists of distributions but we can rederive it here. The log-Pareto is the Paretiananalog of the lognormal distribution. level 2: subexponentials and power laws × - P > Figure . : Zipf plot for job-less claims: we did not needthe abrupt jump during theCOVID- pandemic (lastpoint on the right) to realizeit was a power law. Remark : Rediscovering the log-Pareto distribution If X (cid:24) P ( L , a ) the Pareto distribution with PDF f ( P ) ( x ) = a L a x (cid:0) a (cid:0) , x (cid:21) L andsurvival function S ( P ) ( x ) = L a x (cid:0) a , then:e X (cid:24) LP ( L , a ) the log-Pareto distribution with PDFf ( LP ) ( x ) = a L a log (cid:0) a (cid:0) ( x ) x , x (cid:21) e L and survival function S ( LP ) ( x ) = L a log (cid:0) a ( x )While for a regular power law, we have an asymptotic linear slope on the log-logplot, i.e., lim x ! ¥ log ( L a x (cid:0) a ) log( x ) = (cid:0) a ,the slope for a log-Pareto goes to 0:lim x ! ¥ log ( L a log( x ) (cid:0) a ) log( x ) = 0,and clearly no moment can exist regardless of the value of the tail parameter a .The difference between asymptotic behaviors is visible is Fig . . We mentioned earlier in Chapter that a " sigma" statement means we are notin the Gaussian world. We also discussed the problem of nonobservability ofprobability distributions: we observe data, not generating processes. .6 pseudo-stochastic volatility: an investigation ParetoLog - Pareto5 10 50 100 Log x0.050.100.501510Log S ( x ) Figure . : Comparinglog-log plots for the sur-vival functions of thePareto and log-Pareto
It is therefore easy to be fooled by a power law by mistaking it for a heteroskedas-tic process. In hindsight, we can always say: "conditional volatility was high, atsuch standard deviation it is no longer a sigma, but a mere sigma deviation".The way to debunk these claims is to reason with the aid of an inverse problem:how a power law with a constant scale can masquerade as a heteroskedastic pro-cess. We will see in Appendix how econometrics’ reliance on heteroskedasticity(i.e. moving variance) has severe defects since the variance of that variance doesn’thave a structure.
500 1000 1500 2000 2500 t σ Figure . : Running -day (i.e., corresponding to monthly) realized volatility (standard deviation)for a Student T distributed returns sampled daily. It gives the impression of stochastic volatility whenin fact the scale of the distribution is constant. level 2: subexponentials and power laws Fig. . shows the volatility of returns of a market that greatly resemble onesshould one use a standard simple stochastic volatility process. By stochastic volatil-ity we assume the variance is distributed randomly .Let X be the returns with mean 0 and scale s , with PDF φ (.): φ ( x ) = ( aa + x s ) a +12 p as B ( a , ) , x ( (cid:0) ¥ , ¥ ).Transforming to get Y = X (to get the distribution of the second moment), y , thePDF for Y becomes, y ( y ) = ( as as + y ) a +12 s B ( a , ) p a y , y ( (cid:0) ¥ , ¥ ),which we can see transforms into a power law with asymptotic tail exponent a .The characteristic function c y ( w ) = E (exp( i w Y )) can be written as ( . ) c y ( w ) = 12 B ( a , ) ( p p as √ as (( pa ) csc) ( p p ˜ F ( ; 1 (cid:0) a ; (cid:0) i as w ) G ( a +12 ) (cid:0) ( as ) (cid:0) a ( (cid:0) i w ) a = ˜ F ( a + 12 ; a + 22 ; (cid:0) i as w ))) From which we get the mean deviation of the second moment as follows : a MD of the second moment √ = ( F ( , ; ; (cid:0) ) +3 ( ) = ) s G ( ) p p G ( ) s p
72 5 7 = ( F ( , ; ; (cid:0) ) (cid:0) F ( , ; ; (cid:0) )) s G ( ) = p p G ( ) ( p (cid:0) ) s
292 3 √ ( ( ) = (cid:0) F ( , ; ; (cid:0) ) ) s G ( ) p p G ( ) s ( p (cid:0)
16 tan (cid:0) ( p )) p One can have models with either stochastic variance or stochastic standard deviation. The two havedifferent expectations. As customary, we do not use standard deviation as a metric owing to its instability and its lack of infor-mation, but prefer mean deviation. .6 pseudo-stochastic volatility: an investigation next
The next chapter will venture in higher dimensions. Some consequences are obvi-ous, others less so –say correlations exist even when covariances do not.
T H I C K T A I L S I N H I G H E R D I M E N S I O N S † T his discussion is about as simplified as possible handling ofhigher dimensions. We will look at ) the simple effect of fat-tailedness for multiple random variables, ) Ellipticality anddistributions, ) random matrices and the associated distribu-tion of eigenvalues , ) How we can look at covariance andcorrelations when moments don’t exist (say, as in the Cauchy case). - - - - - - - - - Figure . : Thick tails in higher dimensions: For a dimentional vector, thin tails (left) and thick tails(right) of the same variance. In place of a bell curve with higher peak (the "tunnel") of the univariatecase, we see an increased density of points towards the center. Discussion chapter. thick tails in higher dimensions † We will build the intuitions of thick tails from convexity to scale as we did in theprevious chapter, but using higher dimensions.Let ⇀ X = ( X , X , . . . , X m ) be a p (cid:2) f ( x , . . . , x m ) . We denote the m -variate multivariate Normal distribution by N ( ⇀ m , S ), with mean vector ⇀ m , variance-covariance matrix S , and joint pdf, f ( ⇀ x ) = (2 p ) (cid:0) m = j S j (cid:0) = exp ( (cid:0) ( ⇀ x (cid:0) ⇀ m ) T S (cid:0) ( ⇀ x (cid:0) ⇀ m )) ( . )where ⇀ x = ( x , . . . , x m ) R m , and S is a symmetric, positive definite ( m (cid:2) m )matrix.We can apply the same simplied variance preserving heuristic as in . to fattenthe tails: f a ( ⇀ x ) = 12 (2 p ) (cid:0) m = j S j (cid:0) = exp ( (cid:0) ( ⇀ x (cid:0) ⇀ m ) T S (cid:0) ( ⇀ x (cid:0) ⇀ m )) + 12 (2 p ) (cid:0) m = j S j (cid:0) = exp ( (cid:0) ( ⇀ x (cid:0) ⇀ m ) T S (cid:0) ( ⇀ x (cid:0) ⇀ m )) ( . )where a is a scalar that determines the intensity of stochastic volatility, S = S (1 + a )and S = S (1 (cid:0) a ). Figure . : Elliptically Con-toured Joint Returns of Powerlaw(Student T). We can simplify by assuming as we did in the single dimension case, without any loss of generality, that ⇀ m = (0, . . . , 0). .1 thick tails in higher dimension, finite moments Figure . : NonElliptical JointReturns, from stochastic correla-tions.
Notice in Figure . , as with the one-dimensional case, a concentration in themiddle part of the distribution. Figure . : Elliptically Con-toured Joint Returns for for amultivariate distribution ( x , y , z ) solving to the same density. We created thick tails making the variances stochastic while keeping the correlations constant; this is topreserve the positive definite character of the matrix. thick tails in higher dimensions † Figure . : NonElliptical Jointr.v., from stochastic correlations,for a multivariate distribution ( x , y , z ) , solving to the same den-sity. There is another aspect, beyond our earlier definition(s) of fat-tailedness, once weincrease the dimensionality into random vectors:
Figure . : History moves byjumps : A thick tailed historicalprocess, in which events are dis-tributed according to a power lawthat corresponds to the " / ", with a ≃ , represented as a -D Levyprocess. .2 joint fat-tailedness and ellipticality of distributions Figure . : What the proponents of "great mod-eration" or "long peace" have in mind: historyas a thin-tailed process.
What is an Elliptically Contoured Distribution?
From the standard definition,[ ], X , a p (cid:2) m , a non-negative matrix S , and some scalarfunction Y if its characteristic function φ is of the form φ ( t ) = exp( it ′ m ) Y ( t S t ′ ). ( . )There are equivalent definitions focusing on the density; take for now that themain attribute is that Y is a function of a single covariance matrix S .Intuitively, an elliptical distribution should show an ellipse for iso-density plots;see how we represented in -D (for a bivariate) and -D (for a trivariate) in Figures . and . . A noneliptical distribution would violate the shape as shown in Figures . and . .The main property of the class of elliptical distribution is that it is closed underlinear transformation. Intuitively, as we saw in Chapter with the example ofheight vs wealth, it means (in a bivariate situation) that tails are less likely to comefrom one than two marginal deviations. thick tails in higher dimensions † Ellipticality and Central Flaws in Financial Theory
This closure under lineartransformation leads to attractive properties in the building of portfolios, and inthe results of portfolio theory (in fact one cannot have portfolio theory withoutelliticality of distributions).Under ellipticality, all portfolios can be characterized completely by their locationand scale and any two portfolios with identical location and scale (in return space)have identical distributions returns.Note that (ironically) Lévy-Stable distributions are elliptical –but only in the waythey are defined.So ellipticality (under the condition of finite variance) allows the extension ofthe results of modern portfolio theory (MPT) under the so-called "nonnormality",initally discovered by[ ], also see[ ]. However it appears (from those of uswho work with stochastic covariances) that returns are not elliptical by any con-ceivable measure, see Chicheportiche and Bouchaud [ ] and simple visual graphsof stability of correlation as in E. .A simple pedagogical example using the 1 (cid:6) a heuristic we presented in . . Con-sider the bivariate normal with characteristic function Y ( t , t ) = e (cid:0) r t t (cid:0) t (cid:0) t .Now let us stochasticize the r parameter, with p probability of r and (1 (cid:0) p )probability of rho : Y ( t , t ) = pe (cid:0) r t t (cid:0) t (cid:0) t + (1 (cid:0) p ) pe (cid:0) r t t (cid:0) t (cid:0) t ( . )Figure . shows the result with p = and r = r .We can be more formal and show the difference, when S is stochastic, between Y ( t E ( S ) t ′ ) and E ( Y ( t S t ′ ) ) in Eq. . . Diversification
Recall that financial theory fails under thick tails (and no patches have fixedthe issue outside of the "overfitting" we discussed in earlier chapters). Ab-sence of ellipticality closes the matter. The implication is that all methodsbased on Markowitz-style portfolio construction, that is, grounded in theidea of diversification, fail to reduce the risk, while managing to deceivinglysmooth out daily volatility. Adding leverage makes blowups certain in thelong run a . a This includes an abhorrent approach called "risk parity" largely used to raise money via pseudothe-oretical and pseudoacademic smoke, a method called "asset gathering".
The multivariate Student T is a convenient way to model, as it collapses to theCauchy for a = 1. The alternative would be the multivariate stable, which, we willsee, is devoid of density. .3 multivariate student t - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - Figure . : Stochasticcorrelation for a stan-dard binormal distribu-tion: isodensities fordifferent combinations.We use a very simpletechnique of Eq. . ,with switch between r = r and r = (cid:0) r over the span withprobability p = . Let X be a ( p (cid:2)
1) vector following a multivariate Student T distribution, X (cid:24)S t ( M , S , a ) , where S is a ( p (cid:2) p ) matrix, M a p length vector and a a Paretian tailexponent with PDF f ( X ) = ( ( X (cid:0) M ). S (cid:0) .( X (cid:0) M ) n + 1 ) (cid:0) ( n + p ) . ( . )In the most simplified case, with p = 2, M = (0, 0), and S = = ( rr ) , f ( x , x ) = n √ (cid:0) r ( (cid:0) nr + n (cid:0) r x x + x + x n (cid:0) nr ) (cid:0) n (cid:0) p ( n (cid:0) nr ) . ( . ) . . Ellipticality and Independence under Thick Tails
Take the product of two Cauchy densities for x and y (what we used in Figure . ): f ( x ) f ( y ) = 1 p ( x + 1 ) ( y + 1 ) ( . ) thick tails in higher dimensions † which, patently, as we saw in Chapter (with the example of the two randomlyselected persons with a total net worth of $ million), is not elliptical. Compareto the joint distribution f r ( x , y ): f r ( x , y ) = 12 p √ (cid:0) r ( y ( y (cid:0) r (cid:0) r x (cid:0) r ) + x ( x (cid:0) r (cid:0) r y (cid:0) r ) + 1 ) = , ( . )and setting r = 0 to get no correlation, f ( x , y ) = 12 p ( x + y + 1 ) = ( . )which is elliptical. This illustrates how absence of correlation is not independenceas:Independence between two variables X and Y is defined by the identity: f ( x , y ) f ( x ) f ( y ) = 1,regardless of the correlation coefficient. In the class of elliptical distributions,the bivariate Gaussian with coefficient 0 is both independent and uncorre-lated. This does not apply to the Student T or the Cauchy.The reason the multivariate stable distribution with correlation coefficient set to 0is not independent is the following.A random vector X = ( X , . . . , X k ) ′ is said to have the multivariate stable distri-bution if every linear combination of its components Y = a X + (cid:1) (cid:1) (cid:1) + a k X k has astable distribution. That is, for any constant vector a R k , the random variable Y = a T X should have a univariate stable distribution. And to have a linear com-bination remain within the same class requires ellipticality. Hence by construction , f ( x , y ) is not necessarily equal to f ( x ) f ( y ). Consider the Cauchy case that has anexplicit density function. The denominator of the product of densities includes anadditional term, x y , which pushes the iso-densities in one direction or another,as we saw in the introductory examples of Chapter . We notice that because of the artificiality in constructing multivariate distributions,mutual information is not 0 in the presence of independence, since the ratio ofjoint densities/product of densities ̸ = 1 under 0 "correlation" r .What is the mutual information of a Student T (which includes the Cauchy)? I ( X , Y ) = E log ( f ( x , y ) f ( x ) f ( y ) ) .4 fat tails and mutual information -
100 0 100 200
Gaussian (a)
Gaussian -
200 0 200 400 p = - a = (b) Stoch Vol - - StudentTDistribution (c) Student / - × - × - × × × × StudentTDistribution [ ] (d) Cauchy
Figure . : The various shapes of the distribution of the eigenvalues for random matrices, which in theGaussian case follow the Wigner semicircle distribution. The Cauchy case corresponds to the Studentparametrized to have degrees of freedom. where the expectation is taken under the joint distribution for X and Y . The mutualinformation thanks to the log is additive (Note that one can use any logarithmicbase and translate by dividing by log(2)).So I ( X , Y ) = E ( log f ( x , y ) ) (cid:0) E log ( f ( x ) ) (cid:0) E log ( f ( y ) ) or H ( X )+ H ( Y ) - H ( X , Y )where H is the entropy and H ( X , Y ) the joint entropy.We note that (cid:0) log(1 (cid:0) r ) is the mutual information of a Gaussian regardlessof parametrization. So for X , Y (cid:24) Multivariate Student T ( a , r ), the mutual infor-mation I a ( X , Y ): I a ( X , Y ) = (cid:0)
12 log ( (cid:0) r ) + l a ( . ) thick tails in higher dimensions † where ( . ) l a = (cid:0) a + log( a ) + 2 p ( a + 1) csc( pa ) + 2 log ( B ( a )) (cid:0) ( a + 1) H (cid:0) a + ( a + 1) H (cid:0) a (cid:0) (cid:0) (cid:0) log(2 p )where csc(.) is the cosecant of the argument, B (., .) is the beta function and H (.) ( r ) is the harmonic number H rn = (cid:229) ni =1 1 i r with H n = H (1) n . We note that l a ! a ! ¥ The eigenvalues of matrices themselves have an analog to Gaussian convergence:the semi-circle distribution, as shown in Figure . .Let M be a ( n , n ) symmetric matrix. We have the eigenvalues l i , 1 (cid:20) i , (cid:20) n suchthat M . V i = l i V i where V i is the i th eigenvector.The Wigner semicircle distribution with support [ (cid:0) R , R ] has for PDF f present-ing a semicircle of radius R centered at ( , ) and then suitably normalized : f ( l ) = 2 p R √ R (cid:0) l for (cid:0) R (cid:20) l (cid:20) R . ( . )This distribution arises as the limiting distribution of eigenvalues of ( n , n ) sym-metric matrices with finite moments as the size n of the matrix approaches infinity.We will tour the "fat-tailedness" of the random matrix in what follows as well asthe convergence.This is the equivalent of thick tails for matrices. Consider for now that the 4 th moment reaching Gaussian levels (i.e. 3) for an univariate situation is equivalentto the eigenvalues reaching Wigner’s semicircle. Next we examine a paradox: while covariances can be infinite, correlation is finite.However, it will have a huge sampling error to be informative –same problem wediscussed with PCA in Chapter .Question: Why it is that a fat tailed distribution in the power law class P with in-finite or undefined mean (and higher moments) would have, in higher dimensions,undefined (or infinite) covariance but finite correlation? .6 correlation and undefined variance Consider a distribution with support in ( (cid:0) ¥ , ¥ ). It has no moments: E ( X ) isindeterminate, E ( X ) = ¥ , no covariance, E ( XY ) is indeterminate. But the (non-central) correlation for n variables is bounded by (cid:0) r ≜ (cid:229) ni =1 x i y i √ (cid:229) ni =1 x i √ (cid:229) ni =1 y i , n = 2, 3, ...By the subexponentiality property, we have P ( X + . . . + X n ⟩ x ) (cid:24) P ( max ( X , . . . X n ) > x ) as x ! ¥ . We note that the power law class is included in the subexponential class S .Order the variables in absolute values such that j x j (cid:20) j x j (cid:20) . . . (cid:20) j x n j Let k = (cid:229) n (cid:0) i =1 x i y i , k = (cid:229) n (cid:0) i =1 x i , and k = (cid:229) n (cid:0) i =1 y i .lim x n ! ¥ x n y n + k √ x n + k √ y n + k = y n √ k + y n ,lim y n ! ¥ x n y n + k √ x n + k √ y n + k = x n √ k + x n lim x n ! + ¥ y n ! + ¥ x n y n + k √ x n + k √ y n + k = 1lim x n ! + ¥ y n !(cid:0) ¥ x n y n + k √ x n + k √ y n + k = (cid:0) x n !(cid:0) ¥ y n ! + ¥ x n y n + k √ x n + k √ y n + k = (cid:0) n (cid:21) - ρ Figure . : Sample distri-bution of correlation for asample of . The corre-lation exists for a bivariateT distribution (exponent ,correlation ) but... notuseable. An example of the distribution of correlation is shown in Fig. . . Finite cor-relation doesn’t mean low variance: it exists, but may not be useful for statisticalpurpose owing to the noise and slow convergence. thick tails in higher dimensions † × × × ϵ ^20.0010.0100.100 P > Figure . : The log-log-plot of the survival func-tion of the squared residuals ϵ for the IQ-income linearregression using the stan-dard Winsconsin Longitudi-nal Studies (WLS) data. Wenotice that the income vari-ables are winsorized. Clip-ping the tails creates the il-lusion of a high R . Actu-ally, even without clippingthe tail, the coefficient of de-termination will show muchhigher values owing to thesmall sample properties forthe variance of a power law. R Figure . : An infinitevariance case that shows ahigh R in sample; but it ul-timately has a value of . Re-member that R is stochas-tic. The problem greatlyresembles that of P valuesin Chapter owing to thecomplication of a metadistri-bution in [0, 1] . We mentioned in Chapter that linear regression fails to inform under fat tails.Yet it is practiced. For instance, it is patent that income and wealth variables arepower law distributed (with a spate of problems, see our Gini discussions in ).However IQ scores are Gaussian (seemingly by design). Yet people regress one onthe other failing to see that it is improper.Consider the following linear regression in which the independent and indepen-dent are of different classes: Y = aX + b + ϵ ,where X is standard Gaussian ( N (0, 1) ) and ϵ is power law distributed, with E ( ϵ ) =0 and E ( ϵ ) < + ¥ . There are no restrictions on the parameters.Clearly we can compute the coefficient of determination R as minus the ratioof the expectation of the sum of residuals over the total squared variations, sowe get the more general answer to our idiosyncratic model. Since X (cid:24) N (0, 1), aX + b (cid:24) N ( b , j a j ), we have .7 fat tailed residuals in linear regression models R = 1 (cid:0) SS res SS tot = 1 (cid:0) (cid:229) ni =1 ( y i (cid:0) ( ax i + b + ϵ i ) ) (cid:229) ni =1 ( y i (cid:0) y ) .We can show that, for large nR = a a + E ( ϵ i ) + O ( n ) . ( . )And of course, for infinite variance: lim E ( ϵ ) ! + ¥ E ( R ) = 0.When ϵ is T-distributed with a degrees of freedom, clearly ϵ will follow anFRatio distribution (1, a ) –a power law with exponent a . - - - - X - Y . Figure . : A Cauchy re-gression with an expectedR = 0 , faking it but show-ing higher values in smallsamples (here . ). Note that we can also compute the same "expectation" by taking, simply, thesquare of the correlation between X and Y . For instance, assume the distributionfor ϵ is the Student T distribution with zero mean, scale s and tail exponent a > Cov ( X , Y ) = E ( ( aX + b + ϵ ) X ) = a . The denominator (standard deviationfor Y ) becomes √ E ( (( aX + ϵ ) (cid:0) a ) ) = √ a a (cid:0) a + as a (cid:0) . So E ( R ) = a ( a (cid:0) a (cid:0) a + as ( . )And the limit from above: lim a ! + E ( R ) = 0.We are careful here to use E ( R ) rather than the seemingly deterministic R because it is a stochastic variable that will be extremely sample dependent, andonly stabilize for large n , perhaps even astronomically large n . Indeed, recall that thick tails in higher dimensions † in sample the expectation will always be finite, even if the ϵ are Cauchy! The pointis illustrated in Figures . and . . Actually, when one uses the maximumlikelihood estimation of R via E ( ϵ ) using a , (the "shadow mean" method inChapters and , among others) we notice that in the IQ example used in thegraph, the mean of the sample residuals are about half of the maximum likelihoodone, making R even lower (that is, virtually 0) .The point invalidates much studies of the relations IQ-wealth and IQ-incomeof the kind [ ]; we can see the striking effect in Figure . . Given that R isbounded in [0, 1], it will reach its true value very slowly – see the P-Value problemin Chapter . Property When a fat tailed random variable is regressed against a thin tailed one, the coeffi-cient of determination R will be biased higher, and requires a much larger samplesize to converge (if it ever does). Note that sometimes people try to solve the problem by some nonlinear trans-formation of a random variable (say, the logarithm) to try to establish a linearrelationship. If the required transformation is exact, things will be fine –but only ifexact. Errors can arise from the discrepancy. For correlation is extremely delicateand unlike mutual information, non-additive and often uninformative. The pointhas been explored by this author in [ ]. next We will examine in chapter the slow convergence of power laws distributedvariables under the law of large numbers (LLN): it can be as much as 10 timesslower than the Gaussian. vs 1.24 10 . S P E C I A L C A S E S O F T H I C K T A I L S timeLow Probability Region - - - - - Figure A. : A coffee cup isless likely to incur "small"than large harm. It shatters,hence is exposed to (almost)everything or nothing. Thesame type of payoff is preva-lent in markets with, say,(reval)devaluations, wheresmall movements beyond abarrier are less likely thanlarger ones. F or unimodal distributions, thick tails are the norm: one canlook at tens of thousands of time series of the socio-economicvariables without encountering a single episode of "platykur-tic" distributions. But for multimodal distributions, some sur-prises can occur. a.1 multimodality and thick tails, or the war and peace model We noted earlier in . that stochasticizing (that is, making a deterministic variablestochastic), ever so mildly, variances, the distribution gains in thick tailedness (asexpressed by kurtosis). But we maintained the same mean.But should we stochasticize the mean as well (while preserving the initial av-erage), and separate the potential outcomes wide enough, so that we get manymodes, the "kurtosis" (as measured by the fourth moment) would drop. And if weassociate different variances with different means, we get a variety of "regimes",each with its set of probabilities. special cases of thick tails S S Pr Figure A. : The War andpeace model. Kurtosis = . ,much lower than the Gaus-sian. - - μ - μ Figure A. : Negative (rela-tive) kurtosis and bimodal-ity ( is the Gaussian). Either the very meaning of "thick tails" loses its significance under multimodal-ity, or takes on a new one where the "middle", around the expectation ceases tomatter.[ , ].Now, there are plenty of situations in real life in which we are confronted tomany possible regimes, or states. Assuming finite moments for all states, considerthe following structure: s a calm regime, with expected mean m and standarddeviation s , s a violent regime, with expected mean m and standard deviation s , or more such states. Each state has its probability p i .Now take the simple case of a Gaussian with switching means and variance: withprobability , X (cid:24) N ( m , s ) and with probability , X (cid:24) N ( m , s ). The kurtosiswill be Kurtosis = 3 (cid:0) ( ( m (cid:0) m ) (cid:0) ( s (cid:0) s ) )( ( m (cid:0) m ) + 2 ( s + s )) (A. ) .1 multimodality and thick tails, or the war and peace model As we see the kurtosis is a function of d = m (cid:0) m . For situations where s = s , m ̸ = m , the kurtosis will be below that of the regular Gaussian, and our measurewill naturally be negative. In fact for the kurtosis to remain at , j d j = p √ max( s , s ) (cid:0) min( s , s ) ,the stochasticity of the mean offsets the stochasticity of volatility.Assume, to simplify a one-period model, as if one was standing in front of a dis-crete slice of history, looking forward at outcomes. (Adding complications (transi-tion matrices between different regimes) doesn’t change the main result.)The characteristic function ϕ (t) for the mixed distribution becomes: ϕ ( t ) = N (cid:229) i =1 p i e (cid:0) t s i + itm i For N = 2, the moments simplify to the following: M = p m + ( (cid:0) p ) m M = p ( m + s ) + ( (cid:0) p ) ( m + s ) M = p m + ( (cid:0) p ) m ( m + 3 s ) + 3 m p s M = p ( m s + m + 3 s ) + ( (cid:0) p ) ( m s + m + 3 s ) Let us consider the different varieties, all characterized by the condition p < (1 (cid:0) p ), m < m , preferably m < m >
0, and, at the core, the centralproperty: s > s . Variety : War and Peace. Calm period with positive mean and very low volatil-ity, turmoil with negative mean and extremely low volatility.
Variety : Conditional deterministic state Take a bond B , paying interest r atthe end of a single period. At termination, there is a high probability of getting B (1 + r ), a possibility of defaut. Getting exactly B is very unlikely. Think that thereare no intermediary steps between war and peace: these are separable and discretestates. Bonds don’t just default "a little bit". Note the divergence, the probabilityof the realization being at or close to the mean is about nil. Typically, p ( E ( x ))the PDF of the expectation are smaller than at the different means of regimes, so P ( x = E ( x )) < P ( x = m ) and < P ( x = m ) , but in the extreme case (bonds), P ( x = E ( x )) becomes increasingly small. The tail event is the realization aroundthe mean.The same idea applies to currency pegs, as devaluations cannot be "mild", withall-or- nothing type of volatility and low density in the "valley" between the twodistinct regimes. special cases of thick tails S S Pr Figure A. : The Bond pay-off/Currency peg model. Ab-sence of volatility stuck atthe peg, deterministic pay-off in regime , mayhem inregime . Here the kurtosisK= . . Note that the coffeecup is a special case of bothregimes and being degen-erate. Figure A. : Pressure onthe peg which may givea Dirac PDF in the "nodevaluation" regime (or,equivalently,low volatil-ity). It is typical for fi-nance imbeciles to mis-take regime S for lowvolatility. With option payoffs, this bimodality has the effect of raising the value of at-the-money options and lowering that of the out-of-the-money ones, causing the exactopposite of the so-called "volatility smile".Note the coffee cup has no state between broken and healthy. And the state ofbeing broken can be considered to be an absorbing state (using Markov chains fortransition probabilities), since broken cups do not end up fixing themselves.Nor are coffee cups likely to be "slightly broken", as we see in figure A. . A brief list of other situations where bimodality is encountered: . Currency pegs . Mergers . Professional choices and outcomes . Conflicts: interpersonal, general, martial, any situation in which there is nointermediary between harmonious relations and hostility. . Conditional cascades .2 transition probabilities: what can break will break a.2 transition probabilities: what can break will break So far we looked at a single period model, which is the realistic way since newinformation may change the bimodality going into the future: we have clarity overone-step but not more. But let us go through an exercise that will give us an ideaabout fragility. Assuming the structure of the model stays the same, we can lookat the longer term behavior under transition of states. Let P be the matrix of tran-sition probabilitites, where p i , j is the transition from state i to state j over ∆ t, (thatis, where S(t) is the regime prevailing over period t, P ( S ( t + ∆ t) = s j (cid:12)(cid:12)(cid:12) S ( t ) = s i )) P = ( p p p p ) After n periods, that is, n steps, P n = ( a n b n c n d n ) Where a n = ( p (cid:0) ) ( p + p (cid:0) ) n + p (cid:0) p + p (cid:0) b n = ( (cid:0) p ) (( p + p (cid:0) ) n (cid:0) ) p + p (cid:0) c n = ( (cid:0) p ) (( p + p (cid:0) ) n (cid:0) ) p + p (cid:0) d n = ( p (cid:0) ) ( p + p (cid:0) ) n + p (cid:0) p + p (cid:0) p = 1,hence (replacing p i , ̸ = i j i =1,2 = 1 (cid:0) p i , i ). P n = ( (cid:0) p N p N ) and the "ergodic" probabilities:lim n ! ¥ P n = ( ) The implication is that the absorbing state regime , S (1) will end up dominatingwith probability : what can break and is irreversible will eventually break.With the "ergodic" matrix, special cases of thick tails lim n ! ¥ P n = p . T where T is the transpose of unitary vector f , g , p the matrix of eigenvectors.The eigenvalues become l = ( p + p (cid:0) ) and associated eigenvectors p = ( (cid:0) p (cid:0) p ) . art II T H E L A W O F M E D I U M N U M B E R S
L I M I T D I S T R I B U T I O N S , AC O N S O L I D A T I O N (cid:3) ,† I n this expository chapter we proceed to consolidate the litera-ture on limit distributions seen from our purpose, with someshortcuts where indicated. After introducing the law of largenumbers, we show the intuition behind the central limit theo-rem and illustrate how it varies preasymptotically across dis-tributions. Then we discuss the law of large numbers as applied to highermoments. A more formal and deeper approach will be presented in the nextchapter.Both the law of large numbers and the central limit theorem are partial answersto a general problem: "What is the limiting behavior of a sum (or average) ofrandom variables as the number of summands approaches infinity?". And ourlaw of medium numbers (or preasymptotics) is: now what when the number ofsummands doesn’t reach infinity? The standard presentation is as follows. Let X , X , . . . be an infinite sequence ofindependent and identically distributed (Lebesgue integrable) random variableswith expected value E ( X n ) = m (we will see further down one can somewhatrelax the i.i.d. assumptions). For all n , the sample average X n = n ( X + (cid:1) (cid:1) (cid:1) + X n )converges to the expected value, X n ! m ,for n ! ¥ .Finiteness of variance is not necessary (though of course the finite higher mo-ments accelerate the convergence).There are two modes of convergence: convergence in probability P ! (which im-plies convergence in distribution, though not always the reverse), and the stronger a . s . ! almost sure convergence (similar to pointwise convergence) (or almost every- Discussion chapter (with some research). limit distributions, a consolidation (cid:3) ,† where or almost always). Applied here the distinction corresponds to the weakand strong LLN respectively. The weak LLN
The weak law of large numbers (or Kinchin’s law, or sometimescalled Bernouilli’s law) can be summarized as follows: the probability of a varia-tion in excess of some threshold for the average becomes progressively smaller asthe sequence progresses. In estimation theory, an estimator is called consistent ifit thus converges in probability to the quantity being estimated. X n P (cid:0)! m when n ! ¥ .That is, for any positive number ,lim n ! ¥ P ( j X n (cid:0) m j > ) = 0.Note that standard proofs are based on Chebyshev’s inequality: if X has a finitenon-zero variance s . Then for any real number k > j X (cid:0) m j(cid:21) k s ) (cid:20) k . The strong LLN
The strong law of large numbers states that, as the numberof summands n goes to infinity, the probability that the average converges to theexpectation equals 1. X n a.s. (cid:0)! m when n ! ¥ .That is, P ( lim n ! ¥ X n = m ) = 1. Relaxations of i.i.d.
Now one can relax the identically distributed assumptionunder some conditions: Kolmogorov’s proved that non identical distributions forthe summands X i require for each summand the existence of a finite second mo-ment.As to independence, some weak dependence is allowed. Traditionally the condi-tions are, again, the usual finite variance ) V ( X i ) (cid:20) c and some structure on thecovariance matrix, ) lim j i (cid:0) j j! + ¥ Cov( X i , X j ) = 0.However it turns out ) can be weakened to n (cid:229) i =1 V [ X i ] = o ( n ), and ) j Cov( X i , X j ) j(cid:20) φ ( j i (cid:0) j j ), where n n (cid:229) i =1 φ ( i ) !
0. See Bernstein [ ] and Kozlov [ ] (in Russian). Thanking "romanoved", a mysterious Russian speaking helper on Mathematics Stack Exchange. .2 central limit in action
Our Interest
Our concern in this chapter and the next one is clearly to look atthe "speed" of such convergence. Note that under the stronger assumption of i.i.d.we do not need variance to be finite, so we can focus on mean absolute deviationas a metric for divergence. x ϕ x ϕ x ϕ x ϕ Figure . : The fastest CLT: the Uniform becomes Gaussian in a few steps. We have, successively, , , , and summands. With summands we see a well formed bell shape. We will start with a simplification of the generalized central limit theorem (GCLT),as formulated by Paul Lévy (the traditional approaches to CLT as well as the tech-nical backbone will be presented later): . . The Stable Distribution
Using the same notation as above, let X , . . . , X n be independent and identicallydistributed random variables. Consider their sum S n . We have Sn (cid:0) a n b n D ! X s , ( . )where X s follows a stable distribution S , a n and b n are norming constants, and, torepeat, D ! denotes convergence in distribution (the distribution of X as n ! ¥ ).The properties of S will be more properly defined and explored in the next chapter.Take it for now that a random variable X s follows a stable (or a -stable) distribution, limit distributions, a consolidation (cid:3) ,† Figure . : Paul Lévy, - , formulatedthe generalized central limit theorem. symbolically X s (cid:24) S ( a s , b , m , s ), if its characteristic function c ( t ) = E ( e itX s ) is of theform: c ( t ) = e ( i m t (cid:0)j t s j a s ( (cid:0) i b tan ( pa (cid:0) s ) sgn( t ) )) when a s ̸ = 1. ( . )The constraints are (cid:0) (cid:20) b (cid:20) < a s (cid:20) The designation stable distribution implies that the distribution (or class) is sta-ble under summation: you sum up random variables following any the variousdistributions that are members of the class S explained next chapter (actually thesame distribution with different parametrizations of the characteristic function),and you stay within the same distribution. Intuitively, c ( t ) n is the same form as c ( t ) , with m ! n m , and s ! n a s . The well known distributions in the class (orsome people call it a "basin") are: the Gaussian, the Cauchy and the Lévy with a = 2, 1, and , respectively. Other distributions have no closed form density. . . The Law of Large Numbers for the Stable Distribution
Let us return to the law of large numbers. We will try to use a s (0, 2] to denote the exponent of the limiting and Platonic stable distribution and a p (0, ¥ ) the corresponding Paretian (preasymptotic) equivalent but only in situations where there couldbe some ambiguity. Plain a should be understood in context. Actually, there are ways to use special functions; for instance one discovered accidentally by theauthor: for the Stable S with standard parameters a = , b = 1, m = 0, s = 1 , PDF ( x ) = (cid:0) p e x ( p x Ai ( x
23 22 = p ) +3 3 p ′ ( x
23 22 = p )) = used further down in the example on the limit dis-tribution for Pareto sums. .3 speed of convergence of clt: visual explorations Dirac - - Figure . : The law of largenumbers show a tighten-ing distribution around themean leading to degeneracyconverging to a Dirac stickat the exact mean.
By standard results, we can observe the law of large numbers at work for thestable distribution, as illustrated in Figure . :lim n ! + ¥ c ( tn ) n = e i m t , 1 < a s (cid:20) . )which is the characteristic functionof a Dirac delta at m , a degenerate distribution,since the Fourier transform F (here parametrized to be the inverse of the charac-teristic function) is: 1 p p F t ( e i m t ) ( x ) = d ( m + x ). ( . )Further, we can observe the "real-time" operation for all 1 < n < + ¥ in thefollowing ways, as we will explore in the next sections. We note that if X has a finite variance, the stable-distributed random variable X s will be Gaussian. But note that X s is a limiting construct as n ! ¥ and there aremany, many complication with "how fast" we get there. Let us consider casesthat illustrate both the idea of CLT and the speed of it. . . Fast Convergence: the Uniform Dist.
Consider a uniform distribution –the simplest of all. If its support is in [0, 1], itwill simply have a density of ϕ ( x ) = 1 for 0 (cid:20) x (cid:20) x , identically distributed and independent. The sum x + x immediately changed in shape! Look at ϕ (.), the density of the sum in Figure . . It is now a triangle. Add one variable and now consider the density ϕ of thedistribution of X + X + X . It is already almost bell shaped, with n = 3 summands. limit distributions, a consolidation (cid:3) ,† The uniform sum distribution ϕ n ( x ) = n (cid:229) k =0 ( (cid:0) k ( nk ) ( x (cid:0) LH (cid:0) L (cid:0) k ) n (cid:0) sgn ( x (cid:0) LH (cid:0) L (cid:0) k ) for nL (cid:20) x (cid:20) nH x ϕ x ϕ x ϕ x ϕ x ϕ x ϕ Figure . : The exponential distribution, ϕ indexed by the number of summands. Slower than theuniform, but good enough. . . Semi-slow convergence: the exponential
Let us consider a sum of exponential random variables.We have for initial density ϕ ( x ) = l e (cid:0) l x , x (cid:21) .3 speed of convergence of clt: visual explorations Figure . : The Pareto distribution. Doesn’t want to lose its skewness, although in this case it shouldconverge to the Gaussian... eventually. and for n summands ϕ n ( x ) = ( l ) (cid:0) n x n (cid:0) e (cid:0) l x G ( n ) .We have, replacing x by n = l (and later in the illustrations in Fig. . l = 1), ( l ) (cid:0) n x n (cid:0) e l ( (cid:0) x ) G ( n ) ! n ! ¥ l e (cid:0) l ( x (cid:0) n l ) n p p p n ,which is the density of the normal distribution with mean n l and variance n l .We can see how we get more slowly to the Gaussian, as shown in Figure . ,mostly on account of its skewness. Getting to the Gaussian requires symmetry. . . The slow Pareto
Consider the simplest Pareto distribution on [1, ¥ ): ϕ ( x ) = 2 x (cid:0) We derive the density of sums either by convolving, easy in this case, or as we will see with the Pareto,via characteristic functions. limit distributions, a consolidation (cid:3) ,† Figure . : The Pareto dis-tribution, ϕ and ϕ ,not much improvement to-wards Gaussianity, but an a = 2 will eventually getyou there if you are patientand have a long, very long,life. and inverting the characteristic function, ϕ n ( x ) = 12 p ∫ ¥ (cid:0) ¥ exp( (cid:0) itx )(2 E ( (cid:0) it )) n dt , x (cid:21) n Where E (.) (.) is the exponential integral E n ( z ) = ∫ ¥ dte t ( (cid:0) z ) t n . Clearly, the integrationis done numerically (so far nobody has managed to pull out the distribution of aPareto sum). It can be exponentially slow (up to hours for n = 50 vs. secondsfor n = 2), so we have used Monte Carlo simulations for Figs. . . .Recall from Eq. . that the convergence requires norming constants a n and b n .From Uchaikin and Zolotarev [ ], we have (narrowing the situation for 1 < a p (cid:20) P ( X > x ) = cx (cid:0) a p as x ! ¥ (assume here that c is a constant we will present more formally the"slowly varying function" in the next chapter, and P ( X < x ) = d j x j (cid:0) a p .4 cumulants and convergence as x ! ¥ . The norming constants become a n = n E ( X ) for a p > ] as these are not likely to occur in practice), and b n = p n = a p ( ( pa p ) G ( a p ) ) (cid:0) a p ( c + d ) = a p for 1 < a p < p c + d √ n log( n ) for a p = 2 . ( . )And the symmetry parameter b = c (cid:0) dc + d . Clearly, the situation where the Paretianparameter a p is greater than leads to the Gaussian. . . The half-cubic Pareto and its basin of convergence
Of interest is the case of a = . Unlike the situations where as in Figure . . , thedistribution ends up slowly being symmetric. But, as we will cover in the nextchapter, it is erroneous to conflate its properties with those of a stable. It is, in asense, more fat-tailed. x ϕ Figure . : The half-cubicPareto distribution never be-comes symmetric in real life.Here n = 10 Since the Gaussian (as a basin of convergence) has skewness of 0 and (raw) kurtosisof 3, we can heuristically examine the convergence of these moments to establishthe speed of the workings under CLT.
Definition . (Excess p-cumulants) Let c ( w ) be characteristic function of a given distribution, n the number of summands(for independent random variables), p the order of the moment. We define the ratio ofcumulants for the corresponding p th moment:K pk ≜ ( (cid:0) i ) p ¶ p log( c ( w ) n ) ( (cid:0) ¶ log( c ( w ) n ) ) limit distributions, a consolidation (cid:3) ,†
10 20 30 40 Lag n Copper
10 20 30 40 Lag n EurodollarDepo3M
10 20 30 40 Lag n Gold
10 20 30 40 Lag n LiveCattle
10 20 30 40 Lag n RussiaRTSI
10 20 30 40 Lag n SoyMeal
10 20 30 40 Lag n TY10YNotes
10 20 30 40 Lag n AustraliaTB 10y
10 20 30 40 Lag n CoffeeNY
Figure . : Behavior of the th moment under aggregation for a few financial securities deemed toconverge to the Gaussian but in fact do not converge (backup data for [ ]). There is no conceivableway to claim convergence to Gaussian for data sampled at a lower frequency. K ( n ) is a metric of excess p th moment over that of a Gaussian, p > ; in other words,K n = 0 denotes Gaussianity for n independent summands. Remark We note that lim n ! ¥ K pN = 0 for all probability distributions outside the Power Law class.We also note that lim p ! ¥ K pn is finite for the thin-tailed class. In other words, weface a clear-cut basin of converging vs. diverging moments. For distributions outside the Power Law basin, p N > , K pn decays at a rate N p (cid:0) .A sketch of the proof can be done using the stable distribution as the limitingbasin and the nonderivability at order p greater than its tail index, using Eq. . .Table . shows what happens to the cumulants K (.) for n -summed variables.We would expect a drop at a rate N for stochastic volatility (gamma variancewlog). However, figure . shows the drop does not take place at any such speed.Visibly we are not in the basin. As seen in [ ] there is an absence of convergenceof kurtosis under summation across economic variables. .5 technical refresher: traditional versions of clt Table . : Table of Normalized Cumulants For Thin Tailed Distributions Speed of Convergence for NIndependent Summands - Distr. Poisson Expon. Gamma Symmetric -state vol G -Variance ( l ) ( l ) (a,b) ( s , s ) ( a , b ) K ( ) K ( ) n l l n a b n K ( ) n l l n a b n (cid:0) p ) pn (cid:2) ( s (cid:0) s ) ( p s (cid:0) ( p (cid:0) s ) bn This is a refresher of the various approaches bundled under the designation CLT.
The Standard (Lindeberg-Lévy) version of CLT
Suppose as before a sequence ofi.i.d. random variables with E ( X i ) = m and V ( X i ) = s < + ¥ , and X n the sampleaverage for n . Then as n approaches infinity, the sum of the random variables p n ( X n m ) converges in distribution to a Gaussian [ ] [ ]: p n ( X n (cid:0) m ) d (cid:0)! N ( s ) .Convergence in distribution means that the CDF (cumulative distribution function)of p n converges pointwise to the CDF of N (0, s ) for every real z ,lim n ! ¥ P ( p n ( X n (cid:0) m ) (cid:20) z ) = lim n ! ¥ P [ p n ( X n (cid:0) m ) s (cid:20) z s ] = F ( z s ) , s > F ( z ) is the standard normal cdf evaluated ar z . Note that the convergenceis uniform in z in the sense thatlim n ! ¥ sup z R (cid:12)(cid:12)(cid:12) P ( p n ( X n (cid:0) m ) (cid:20) z ) (cid:0) F ( z s )(cid:12)(cid:12)(cid:12) = 0,where sup denotes the least upper bound, that is, the supremum of the set. Lyapunov’s CLT
In Lyapunov’s derivation, summands have to be independent,but not necessarily identically distributed. The theorem also requires that randomvariables j ′ X i j have moments of some order (2 + d , and that the rate of growth ofthese moments is limited by the Lyapunov condition given below.The condition is as follows. Define s n = n (cid:229) i =1 s i limit distributions, a consolidation (cid:3) ,† If for some d >
0, lim n ! ¥ s d n n (cid:229) i =1 E ( j X i (cid:0) m i j d ) = 0,then a sum of X i (cid:0) m i s i converges in distribution to a standard normal random vari-able, as n goes to infinity: 1 s n n (cid:229) i =1 ( X i (cid:0) m i ) D (cid:0)! N (0, 1).If a sequence of random variables satisfies Lyapunov’s condition, then it also sat-isfies Lindeberg’s condition that we cover next. The converse implication, however,does not hold. Lindeberg’s condition
Lindeberg allows to reach CLT under weaker assump-tions. With the same notations as earlier:lim n ! ¥ s n n (cid:229) i =1 E ( ( X i (cid:0) m i ) (cid:1) fj X i (cid:0) m i j > s n g ) = 0for all > indicator function, then the random variable Z n = (cid:229) ni =1 ( X i (cid:0) m i ) s n converges in distribution] to a Gaussian as n ! ¥ .Lindeberg’s condition is sufficient, but not in general necessary except if thesequence under consideration satisfies:max (cid:20) k (cid:20) n s i s n !
0, as n ! ¥ ,then Lindeberg’s condition is both sufficient and necessary, i.e. it holds if and onlyif the result of central limit theorem holds. . . Higher Moments
A test of fat tailedness can be seen by applying the law of large number to highermoments and see how they converge. A visual examination of the behavior ofthe cumulative mean of the moment can be done in a similar way to the standardvisual tests of LLN we saw in Chapter – except that it applies to X p (raw orcentered) rather than X . We check the functioning of the law of large numbers byseeing if adding observations causes a reduction of the variability of the average(or its variance if it exists). Moments that do not exist will show occasional jumps –or, equivalently, large subsamples will produce different averages. When momentsexist, adding observations eventually prevents further jumps. .6 the law of large numbers for higher moments Another visual technique is to consider the contribution of the maximum obser-vation to the total, and see how it behaves as n grows larger. It is called the MSplot [ ], "maximum to sum", and shown in Figure . . Table . : Kurtosis K ( t ) for t daily, -day, and -day windows for the random variables K( ) K( ) K( ) MaxQuartic Years Australian Dollar/USD . . . .
12 22 .Australia TB y . . . .
08 25 .Australia TB y . . . .
06 21 .BeanOil . . . .
11 47 .Bonds Y . . . .
02 32 .Bovespa . . . .
27 16 .British Pound/USD . . . .
05 38 .CAC
40 6 . . . .
05 20 .Canadian Dollar . . . .
06 38 .Cocoa NY . . . .
04 47 .Coffee NY . . . .
13 37 .Copper . . . .
05 48 .Corn . . . .
18 49 .Crude Oil . . . .
79 26 .CT . . . .
25 48 .DAX . . . .
20 18 .Euro Bund . . . .
06 18 .Euro Currency/DEMpreviously . . . .
06 38 .Eurodollar Depo M . . . .
31 19 .Eurodollar Depo M . . . .
25 28 .FTSE . . . .
54 25 .Gold . . . .
04 35 .Heating Oil . . . .
74 31 .Hogs . . . .
05 43 .Jakarta Stock Index . . . .
19 16 .Japanese Gov Bonds . . . .
48 24 .Live Cattle . . . .
04 44 .Nasdaq Index . . . .
13 21 .Natural Gas . . . .
06 19 .Nikkei . . . .
72 23 .Notes Y . . . .
06 21 .Russia RTSI . . . .
13 17 .Short Sterling . . . .
75 17 .Silver . . . .
94 46 .Smallcap . . . .
06 17 .SoyBeans . . . .
17 47 .SoyMeal . . . .
09 48 .Sp
500 38 . . . .
79 56 .Sugar
11 9 . . . .
30 48 . limit distributions, a consolidation (cid:3) ,† Table . : (continued from previous page) K( ) K( ) K( ) MaxQuartic Years SwissFranc . . . .
05 38 .TY Y Notes . . . .
10 27 .Wheat . . . .
02 49 .Yen/USD . . . .
27 38 . Figure . : MS Plot show-ing the behavior of cumula-tive moments p = 1, 2, 3, 4 for the SP over the years ending in . TheMS plot (Maximum to sum)will be presented in . . . Figure . : Gaussian Con-trol for the data in Figure . . .7 mean deviation for a stable distributions Let us prepare a result for the next chapter using the norm L for situations of finitemean but infinite variance. Clearly we have no way to measure the compressionof the distribution around the mean within the norm L .The error of a sum in the norm L is as follows. Let q ( x ) be the Heaviside function(whose value is zero for negative arguments and one for positive arguments). Sincesgn( x ) = 2 q ( x ) (cid:0)
1, its characteristic function will be: c sgn( x ) ( t ) = 2 it . ( . )Let c d (.) be the characteristic function of any nondegenerate distribution. Convo-luting c sgn( x ) (cid:3) ( c d ) n , we obtain the characteristic functionfor the positive variationsfor n independent summands c m = ∫ ¥ (cid:0) ¥ c sgn( x ) ( t ) c d ( u (cid:0) t ) n d t .In our case of mean absolute deviation being twice that of the positive values of X : c ( j S n j ) = (2 i ) ∫ ¥ (cid:0) ¥ c ( t (cid:0) u ) n t du ,which is the Hilbert transform of c when ∫ is taken in the p.v. sense (Pinelis, )[ ]. In our situation, given that all independents summands are copiesfrom the same distribution, we can replace the product c ( t ) n with c s ( t ) which isthe same characteristic function with s s = n = a s , b remaining the same: ( . ) E ( j X j ) = 2 i ¶¶ u p.v. ∫ ¥ (cid:0) ¥ c s ( t (cid:0) u ) t dt j t =0 .Now, [ ] the Hilbert transform H ,( H f )( t ) = 2 p i ∫ ¥ (cid:0) c s ( u + t ) (cid:0) c s ( u (cid:0) t ) dt can be rewritten as ( . )( H f )( t ) = (cid:0) i ¶¶ u ( c s ( u ) + 1 p i ∫ ¥ (cid:0) c s ( u + t ) (cid:0) c s ( u (cid:0) t ) (cid:0) c s ( t ) + c s ( (cid:0) t ) dtt ) .Consider the stable distribution defined in . . .Deriving first inside the integral and using a change of variable, z = log( t ), E j X j ( ˜ a s , b , s s ,0) = ∫ ¥ (cid:0) ¥ i a s e (cid:0) ( s s e z ) a s (cid:0) z ( s s e z ) a s ( b tan ( pa s ) sin ( b tan ( pa s ) ( s s e z ) a s ) + cos ( b tan ( pa s ) ( s s e z ) a s )) dz We say, again by convention, infinite for the situation where the random variable, say X (or the variance ofany random variable), is one-tailed –bounded on one side– and undefined in situations where the variableis two-tailed, e.g. the infamous Cauchy. limit distributions, a consolidation (cid:3) ,† which then integrates nicely to: ( . ) E j X j ( ˜ a s , b , s s ,0) = s s p G ( a s (cid:0) a s ) (( i b tan ( pa s )) = a s + ( (cid:0) i b tan ( pa s )) = a s ) . next The next chapter presents a central concept: how to work with the law of middlenumbers? How can we translate between distributions?
H O W M U C H D A T A D O Y O U N E E D ? A NO P E R A T I O N A L M E T R I C F O RF A T-T A I L E D N E S S ‡ I n this ( research ) chapter we discuss the laws of medium num-bers. We present an operational metric for univariate uni-modal probability distributions with finite first moment, in[0, 1] where is maximally thin-tailed (Gaussian) and is max-imally fat-tailed. It is based on "how much data one needs tomake meaningful statements about a given dataset?"Applications: Among others, it (cid:15) helps assess the sample size n needed for statistical significance outsidethe Gaussian, (cid:15) helps measure the speed of convergence to the Gaussian (or stablebasin), (cid:15) allows practical comparisons across classes of fat-tailed distributions, (cid:15) allows the assessment of the number of securities needed in portfolioconstruction to achieve a certain level of stability from diversification, (cid:15) helps understand some inconsistent attributes of the lognormal, pend-ing on the parametrization of its variance.The literature is rich for what concerns asymptotic behavior, but there is alarge void for finite values of n , those needed for operational purposes.Background : Conventional measures of fat-tailedness, namely ) the tail indexfor the Power Law class, and ) Kurtosis for finite moment distributions fail toapply to some distributions, and do not allow comparisons across classes and Research chapter.The author owes the most to the focused comments by Michail Loulakis who, in addition, provided therigorous derivations for the limits of the k for the Student T and lognormal distributions, as well as tothe patience and wisdom of Spyros Makridakis. The paper was initially presented at Extremes and Risks inHigher Dimensions , Sept -
16 2016 , at the Lorentz Center, Leiden and at Jim Gatheral’s Festschrift at theCourant Institute, in October . The author thanks Jean-Philippe Bouchaud, John Einmahl, PasqualeCirillo, and others. Laurens de Haan suggested changing the name of the metric from "gamma" to "kappa"to avoid confusion. Additional thanks to Colman Humphrey, Michael Lawler, Daniel Dufresne and othersfor discussions and insights with derivations. how much data do you need? an operational metric for fat-tailedness ‡ parametrization, that is between power laws outside the Levy-Stable basin, orpower laws to distributions in other classes, or power laws for different numberof summands. How can one compare a sum of Student T distributed randomvariables with degrees of freedom to one in a Levy-Stable or a Lognormal class?How can one compare a sum of Student T with degrees of freedom to asingle Student T with degrees of freedom?We propose an operational and heuristic metric that allows us to compare n -summed independent variables under all distributions with finite first moment.The method is based on the rate of convergence of the law of large numbers forfinite sums, n -summands specifically.We get either explicit expressions or simulation results and bounds for the log-normal, exponential, Pareto, and the Student T distributions in their various cali-brations –in addition to the general Pearson classes. Cauchy ( (cid:1) = ) Pareto 1.14Cubic Student TGaussian ( (cid:1) = ) Degrees ofFat Tailedness2 4 6 8 10 n (cid:1) | S n = X + X + ... + X n | Figure . : The intuition ofwhat k is measuring: howthe mean deviation of thesum of identical copies ofa r.v. S n = X + X +. . . X n grows as the sam-ple increases and how wecan compare preasymptoti-cally distributions from dif-ferent classes. How can one compare a Pareto distribution with tail a = 2.1 that is, with finitevariance, to a Gaussian? Asymptotically, these distributions in the regular vari-ation class with finite second moment, under summation, become Gaussian, butpre-asymptotically, we have no standard way of comparing them given that met-rics that depend on higher moments, such as kurtosis, cannot be of help. Nor canwe easily compare an infinite variance Pareto distribution to its limiting a -Stabledistribution (when both have the same tail index or tail exponent ). Likewise, howcan one compare the "fat-tailedness" of, say a Student T with 3 degrees of freedomto that of a Levy-Stable with tail exponent of 1.95? Both distributions have a finitemean; of the two, only the first has a finite variance but, for a small number ofsummands, behaves more "fat-tailed" according to some operational criteria. Criterion for "fat-tailedness"
There are various ways to "define" Fat Tails andrank distributions according to each definition. In the narrow class of distributionshaving all moments finite, it is the kurtosis, which allows simple comparisons and .1 introduction and definitions
Figure . : Watching the ef-fect of the Generalized Cen-tral Limit Theorem: Paretoand Student T Distribution,in the P class, with a ex-ponent, k converge to (cid:0) ( a < a + a (cid:21) , or the Sta-ble S class. We observehow slow the convergence,even after summands.This discounts Mandelbrot’sassertion that an infinitevariance Pareto can be sub-sumed into a stable distribu-tion. measure departures from the Gaussian, which is used as a norm. For the PowerLaw class, it can be the tail exponent . One can also use extremal values, takingthe probability of exceeding a maximum value, adjusted by the scale (as practicedin extreme value theory). For operational uses, practitioners’ fat-tailedness is adegree of concentration, such as "how much of the statistical properties will beattributable to a single observation?", or, appropriately adjusted by the scale (orthe mean dispersion), "how much is the total wealth of a country in the hands ofthe richest individual?"Here we use the following criterion for our purpose, which maps to the measureof concentration in the past paragraph: "How much will additional data (undersuch a probability distribution) help increase the stability of the observed mean".The purpose is not entirely statistical: it can equally mean: "How much will addingan additional security into my portfolio allocation (i.e., keeping the total constant)increase its stability?"Our metric differs from the asymptotic measures (particularly ones used in ex-treme value theory) in the fact that it is fundamentally preasymptotic.Real life, and real world realizations, are outside the asymptote. how much data do you need? an operational metric for fat-tailedness ‡ What does the metric do?
The metric we propose, k does the following: (cid:15) Allows comparison of n -summed variables of different distributions for agiven number of summands , or same distribution for different n , and assessthe preasymptotic properties of a given distributions. (cid:15) Provides a measure of the distance from the limiting distribution, namely theLévy a -Stable basin (of which the Gaussian is a special case). (cid:15) For statistical inference, allows assessing the "speed" of the law of large num-bers, expressed in change of the mean absolute error around the averagethanks to the increase of sample size n . (cid:15) Allows comparative assessment of the "fat-tailedness" of two different uni-variate distributions, when both have finite first moment. (cid:15)
Allows us to know ahead of time how many runs we need for a Monte Carlosimulation.
The state of statistical inference
The last point, the "speed", appears to havebeen ignored (see earlier comments in Chapter about the , pages of the Encyclopedia of Statistical Science [ ]). It is very rare to find a discussion abouthow long it takes to reach the asymptote, or how to deal with n summands thatare large but perhaps not sufficiently so for the so-called "normal approximation".To repeat our motto, "statistics is never standard". This metric aims at showing how standard is standard , and measure the exact departure from the standard fromthe standpoint of statistical significance. Student T ( ) orStable (cid:1) = (cid:1) = (cid:2) Gaussian0.5 1.0 1.5 2.0 2.5 3.0 (cid:3) (cid:4) Figure . : The lognormal distribu-tion behaves like a Gaussian for lowvalues of s , but becomes rapidly equiv-alent to a power law. This illustrateswhy, operationally, the debate onwhether the distribution of wealth waslognormal (Gibrat) or Pareto (Zipf)doesn’t carry much operational signifi-cance. Definition . (the k metric) Let X , . . . , X n be i.i.d. random variables with finite mean, that is E ( X ) < + ¥ . LetS n = X + X + . . . + X n be a partial sum. Let M ( n ) = E ( j S n (cid:0) E ( S n ) j ) be the expectedmean absolute deviation from the mean for n summands. Define the "rate" of convergencefor n additional summands starting with n : .2 the metric Table . : Kappa for summands, k . Distribution k Student T( a ) 2 (cid:0) ( (cid:0) a G ( a (cid:0) ) G ( a ) ) +log( p ) Exponential/Gamma 2 (cid:0) log(2)2 log(2) (cid:0) (cid:25) .21Pareto ( a ) (cid:0) log(2)log ( ( a (cid:0) (cid:0) a a a (cid:0) ∫ a (cid:0) (cid:0) a ( y +2) (cid:0) a (cid:0) ( a (cid:0) (cid:0) y ) ( B y +2 ( (cid:0) a ,1 (cid:0) a ) (cid:0) B y +1 y +2 ( (cid:0) a ,1 (cid:0) a ) ) dy ) Normal( m , s ) withswitchingvariance s a w.p p . (cid:0) log(2)log p (√ app (cid:0) s p ( (cid:0) √ app (cid:0) s p (√ app (cid:0) s (cid:0) √ a ( p (cid:0) ) +4 s p a + s ) + √ a ( p (cid:0) ) +4 s )) p p a + s (cid:0) ( p (cid:0) √ app (cid:0) s Lognormal( m , s ) (cid:25) (cid:0) log(2)log erf log ( ( e s )) p erf ( s p ) . Table . : Summary of main results
Distribution k n Exponential/Gamma ExplicitLognormal ( m , s ) No explicit k n but explicitlower and higher bounds(low or high s or n ). Approx-imated with Pearson IV for s in between.Pareto ( a ) (Constant) Explicit for k (lower boundfor all a ).Student T( a ) (slowly varying func-tion) Explicit for k , a = 3. k n , n = min { k n , n : M ( n ) M ( n ) = ( nn ) (cid:0) k n n , n = 1, 2, ... } , how much data do you need? an operational metric for fat-tailedness ‡ Table . : Comparing Pareto to Student T (Same tail exponent a ) a Pareto Pareto Pareto Student Student Student k k k k k k .
25 0 .
829 0 .
787 0 .
771 0 .
792 0 .
765 0 . . .
724 0 .
65 0 .
631 0 .
647 0 .
609 0 . .
75 0 .
65 0 .
556 0 .
53 0 .
543 0 .
483 0 . . .
594 0 .
484 0 .
449 0 .
465 0 .
387 0 . .
25 0 .
551 0 .
431 0 .
388 0 .
406 0 .
316 0 . . .
517 0 .
386 0 .
341 0 .
359 0 .
256 0 . .
75 0 .
488 0 .
356 0 .
307 0 .
321 0 .
224 0 . . .
465 0 . .
281 0 .
29 0 .
191 0 . .
25 0 .
445 0 .
305 0 .
258 0 .
265 0 .
167 0 . . .
428 0 .
284 0 .
235 0 .
243 0 .
149 0 . .
75 0 .
413 0 .
263 0 .
222 0 .
225 0 .
13 0 . . . . .
211 0 .
209 0 .
126 0 . n > n (cid:21) , hence k ( n , n ) = 2 (cid:0) log( n ) (cid:0) log( n )log ( M ( n ) M ( n ) ) . ( . ) Further, for the baseline values n = n + 1 , we use the shorthand k n . We can also decompose k ( n , n ) in term of "local" intermediate ones similar to"local" interest rates, under the constraint. k ( n , n ) = 2 (cid:0) log( n ) (cid:0) log( n ) (cid:229) ni =0 log( i +1) (cid:0) log( i )2 (cid:0) k ( i , i +1) . ( . ) Use of Mean Deviation
Note that we use for measure of dispersion around themean the mean absolute deviation, to stay in norm L in the absence of finite vari-ance –actually, even in the presence of finite variance, under Power Law regimes,distributions deliver an unstable and uninformative second moment. Mean devi-ation proves far more robust there. (Mean absolute deviation can be shown to bemore "efficient" except in the narrow case of kurtosis equals (the Gaussian), seea longer discussion in [ ]; for other advantages, see [ ].) Definition . (the class P ) The P class of power laws (regular variation) is defined for r.v. X as follows: P = f X : P ( X > x ) (cid:24) L ( x ) x (cid:0) a g ( . ) .3 stable basin of convergence as benchmark where (cid:24) means that the limit of the ratio or rhs to lhs goes to as x ! ¥ . L : [ x min , + ¥ ) ! (
0, + ¥ ) is a slowly varying function function, defined as lim x ! + ¥ L ( kx ) L ( x ) =1 for any k > . The constant a > . Next we define the domain of attraction of the sum of identically distributedvariables, in our case with identical parameters.
Definition . (stable S class) A random variable X follows a stable (or a -stable) distribution, symboli-cally X (cid:24) S (˜ a , b , m , s ) , if its characteristic function c ( t ) = E ( e itX ) is of the form: c ( t ) = e ( i m t (cid:0)j t s j ˜ a ( (cid:0) i b tan ( p ˜ a ) sgn ( t ) )) ˜ a ̸ = 1 e it ( bs log( s ) p + m ) (cid:0)j t s j ( i b sgn ( t )log( j t s j ) p ) ˜ a = 1 , ( . )Next, we define the corresponding stable ˜ a :˜ a ≜ { a a < + 2 a (cid:21) if X is in P . )Further discussions of the class S are as follows. . . Equivalence for Stable distributions
For all n and n (cid:21) S class with ˜ a (cid:21) k ( n , n ) = 2 (cid:0) ˜ a ,simply from the property that M ( n ) = n a M (1) ( . )This, simply shows that k n , n = 0 for the Gaussian.The problem of the preasymptotics for n summands reduces to: (cid:15) What is the property of the distribution for n = 1 (or starting from a stan-dard, off-the shelf distribution)? (cid:15) What is the property of the distribution for n summands? (cid:15) How does k n ! (cid:0) ˜ a and at what rate? . . Practical significance for sample sufficiency how much data do you need? an operational metric for fat-tailedness ‡ Confidence intervals : As a simple heuristic, the higher k , the more dispro-portionally insufficient the confidence interval. Any value of k above . effectively indicates a high degree of unreliability of the "normal approxima-tion". One can immediately doubt the results of numerous research papersin fat-tailed domains.Computations of the sort done Table . for instance allows us to compare variousdistributions under various parametriazation. (comparing various Pareto distribu-tions to symmetric Student T and, of course the Gaussian which has a flat kappaof )As we mentioned in the introduction, required sample size for statistical infer-ence is driven by n , the number of summands. Yet the law of large numbers isoften invoked in erroneous conditions; we need a rigorous sample size metric.Many papers, when discussing financial matters, say [ ] use finite variance asa binary classification for fat tailedness: power laws with a tail exponent greaterthan 2 are therefore classified as part of the "Gaussian basin", hence allowing theuse of variance and other such metrics for financial applications. A much morenatural boundary is finiteness of expectation for financial applications [ ]. Ourmetric can thus be useful as follows:Let X g ,1 , X g ,2 , . . . , X g , n g be a sequence of Gaussian variables with mean m andscale s . Let X n ,1 , X n ,2 , . . . , X n , n n be a sequence of some other variables scaled to beof the same M (1), namely M n (1) = M g (1) = √ p s . We would be looking for valuesof n n corresponding to a given n g . k n is indicative of both the rate of convergence under the law of large number,and for k n !
0, for rate of convergence of summands to the Gaussian underthe central limit, as illustrated in Figure . . n min = inf { n n : E ((cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n n (cid:229) i =1 X n , i (cid:0) m p n n (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)) (cid:20) E ((cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n g (cid:229) i =1 X g , i (cid:0) m g n g (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)) , n n > } ( . )which can be computed using k n = 0 for the Gaussian and backing our from k n forthe target distribution with the simple approximation: n n = n (cid:0) k ng (cid:0) g (cid:25) n (cid:0) k (cid:0) g , n g > . )The approximation is owed to the slowness of convergence. So for example, aStudent T with degrees of freedom ( a = 3) requires observations to get thesame drop in variance from averaging (hence confidence level) as the Gaussianwith , that is times as much. The one-tailed Pareto with the same tail exponent a = 3 requires observations to match a Gaussian sample of , . times morethan the Student, which shows ) finiteness of variance is not an indication of fattailedness (in our statistical sense), ) neither are tail exponent s good indicators )how the symmetric Student and the Pareto distribution are not equivalent becauseof the "bell-shapedness" of the Student (from the slowly varying function) thatdampens variations in the center of the distribution. .4 technical consequences We can also elicit quite counterintuitive results. From Eq. . , the "Pareto / "in the popular mind, which maps to a tail exponent around a (cid:25) > more observations than the Gaussian. . . Some Oddities With Asymmetric Distributions
The stable distribution, when skewed, has the same k index as a symmetric one (inother words, k is invariant to the b parameter in Eq. . , which conserves undersummation). But a one-tailed simple Pareto distribution is fatter tailed (for ourpurpose here) than an equivalent symmetric one.This is relevant because the stable is never really observed in practice and usedas some limiting mathematical object, while the Pareto is more commonly seen.The point is not well grasped in the literature. Consider the following use of thesubstitution of a stable for a Pareto. In Uchaikin and Zolotarev [ ]: Mandelbrot called attention to the fact that the use of the extremal stable distri-butions (corresponding to b = 1) to describe empirical principles was preferableto the use of the Zipf-Pareto distributions for a number of reasons. It can be seenfrom many publications, both theoretical and applied, that Mandelbrot’s ideasreceive more and more wide recognition of experts. In this way, the hope arisesto confirm empirically established principles in the framework of mathematicalmodels and, at the same time, to clear up the mechanism of the formation ofthese principles. These are not the same animals, even for large number of summands. . . Rate of Convergence of a Student T Distribution to the Gaussian Basin
We show in the appendix –thanks to the explicit derivation of k for the sum ofstudents with a = 3, the "cubic" commonly noticed in finance –that the rate ofconvergence of k to 0 under summation is n ) . This (and the semi-closed form forthe density of an n-summed cubic Student) complements the result in Bouchaudand Potters [ ] (see also [ ]), which is as follows. Their approach is to separatethe "Gaussian zone" where the density is approximated by that of a Gaussian, anda "Power Law zone" in the tails which retains the original distribution with PowerLaw decline. The "crossover" between the two moves right and left of the centerat a rate of √ n log( n ) standard deviations) which is excruciatingly slow. Indeed,one can note that more summands fall at the center of the distribution, and feweroutside of it, hence the speed of convergence according to the central limit theoremwill differ according to whether the density concerns the center or the tails.Further investigations would concern the convergence of the Pareto to a Levy-Stable, which so far we only got numerically. how much data do you need? an operational metric for fat-tailedness ‡ . . The Lognormal is Neither Thin Nor Fat Tailed
Naively, as we can see in Figure . , at low values of the parameter s , the log-normal behaves like a Gaussian, and, at high s , it appears to have the behaviorof a Cauchy of sorts (a one-tailed Cauchy, rather a stable distribution with a = 1, b = 1), as k gets closer and closer to 1. This gives us an idea about some aspectsof the debates as to whether some variable is Pareto or lognormally distributed,such as, say, the debates about wealth [ ], [ ], [ ]. Indeed, such debates canbe irrelevant to the real world. As P. Cirillo [ ] observed, many cases of Paretian-ity are effectively lognormal situations with high variance; the practical statisticalconsequences, however, are smaller than imagined. . . Can Kappa Be Negative?
Just as kurtosis for a mixed Gaussian (i.e., with stochastic mean, rather thanstochastic volatility ) can dip below 3 (or become "negative" when one uses the con-vention of measuring kurtosis as excess over the Gaussian by adding 3 to the mea-sure), the kappa metric can become negative when kurtosis is "negative". Thesesituations require bimodality (i.e., a switching process between means under fixedvariance, with modes far apart in terms of standard deviation). They do not appearto occur with unimodal distributions.Details and derivations are presented in the appendix.
To summarize, while the limit theorems (the law of large numbers and the centrallimit) are concerned with the behavior as n ! + ¥ , we are interested in finite andexact n both small and large.We may draw a few operational consequences: MarkowitzEstablishedsecuritiesSpeculativesecurities0 200 400 600 800 1000 n Figure . : In short, whythe /n heuristic works: ittakes many, many more se-curities to get the same riskreduction as via portfolioallocation according to theMarkowitz. We assume tosimplify that the securitiesare independent, which theyare not, something that com-pounds the effect. .5 conclusion and consequences . . Portfolio Pseudo-Stabilization
Our method can also naturally and immediately apply to portfolio constructionand the effect of diversification since adding a security to a portfolio has the same"stabilizing" effect as adding an additional observation for the purpose of statisticalsignificance. "How much data do you need?" translates into "How many securitiesdo you need?". Clearly, the Markowicz allocation method in modern finance [ ](which seems to not be used by Markowitz himself for his own portfolio [ ])applies only for k near 0; people use convex heuristics, otherwise they will un-derestimate tail risks and "blow up" the way the famed portfolio-theory orientedhedge fund Long Term Management did in [ ] [ ].)We mentioned earlier that a Pareto distribution close to the " / " requires up to10 more observations than a Gaussian; consider that the risk of a portfolio undersuch a distribution would be underestimated by at least 8 orders of magnitudes ifone uses modern portfolio criteria. Following such a reasoning, one simply needsbroader portfolios.It has also been noted that there is practically no financial security that is notfatter tailed than the Gaussian, from the simple criterion of kurtosis [ ], meaningMarkowitz portfolio allocation is never the best solution. It happens that agentswisely apply a noisy approximation to the n heuristic which has been classifiedas one of those biases by behavioral scientists but has in fact been debunked asfalse (a false bias is one in which, while the observed phenomenon is there, it doesnot constitute a "bias" in the bad sense of the word; rather it is the researcher whois mistaken owing to using the wrong tools instead of the decision-maker). Thistendency to "overdiversify" has been deemed a departure from optimal investmentbehavior by Benartzi and Thaler [ ], explained in [ ] "when faced with n options,divide assets evenly across the options. We have dubbed this heuristic the "1 = n rule."" However, broadening one’s diversification is effectively as least as optimalas standard allocation(see critique by Windcliff and Boyle [ ] and [ ]). In short,an equally weighted portfolio outperforms the SP across a broad range rangeof metrics. But even the latter two papers didn’t conceive of the full effect andproperties of fat tails, which we can see here with some precision. Fig. . showsthe effect for securities compared to Markowitz.This false bias is one in many examples of policy makers "nudging" people intothe wrong rationality [ ] and driving them to increase their portfolio risk manyfolds.A few more comments on financial portfolio risks. The SP has a k of around. , but one needs to take into account that it is itself a basket of n = 500 securi-ties, albeit unweighted and consisting of correlated members, overweighing stablestocks. Single stocks have kappas between .3 and .7, meaning a policy of "overdi-versification" is a must.Likewise the metric gives us some guidance in the treatment of data for forecast-ing, by establishing sample sufficiency, to state such matters as how many yearsof data do we need before stating whether climate conditions "have changed", see[ ]. how much data do you need? an operational metric for fat-tailedness ‡ . . Other Aspects of Statistical Inference
So far we considered only univariate distributions. For higher dimensions, a po-tential area of investigation is an equivalent approach to the multivariate distribu-tion of extreme fat tailed variables, the sampling of which is not captured by theMarchenko-Pastur (or Wishhart) distributions. As in our situation, adding vari-ables doesn’t easily remove noise from random matrices. . . Final comment
As we keep saying, "statistics is never standard"; however there are heuristics meth-ods to figure out where and by how much we depart from the standard.
We show here some derivations . . Cubic Student T (Gaussian Basin)
The Student T with degrees of freedom is of special interest in the literatureowing to its prevalence in finance [ ]. It is often mistakenly approximated tobe Gaussian owing to the finiteness of its variance. Asymptotically, we end upwith a Gaussian, but this doesn’t tell us anything about the rate of convergence.Mandelbrot and Taleb [ ] remarks that the cubic acts more like a power law inthe distribution of the extremes, which we will elaborate here thanks to an explicitPDF for the sum.Let X be a random variable distributed with density p ( x ): p ( x ) = 6 p p ( x + 3 ) , x ( (cid:0) ¥ , ¥ ) ( . ) Proposition . Let Y be a sum of X , . . . , X n , n identical copies of X. Let M ( n ) be the mean absolute devia-tion from the mean for n summands. The "rate" of convergence k n = { k : M ( n ) M (1) = n (cid:0) k } is: k n = 2 (cid:0) log( n )log ( e n n (cid:0) n G ( n + 1, n ) (cid:0) ) ( . )where G (., .) is the incomplete gamma function G ( a , z ) = ∫ ¥ z dtt a (cid:0) e (cid:0) t .Since the mean deviation M ( n ): M ( n ) = { p p for n = 1 p p ( e n n (cid:0) n G ( n + 1, n ) (cid:0) ) for n > . ) .6 appendix, derivations, and proofs The derivations are as follows. For the pdf and the MAD we followed differentroutes.We have the characteristic functionfor n summands: φ ( w ) = (1 + p j w j ) n e (cid:0) n p j w j The pdf of Y is given by: p ( y ) = 1 p ∫ ¥ (1 + p w ) n e (cid:0) n p w cos( w y ) d w After arduous integration we get the result in . . Further, since the followingresult does not appear to be found in the literature, we have a side useful result:the PDF of Y can be written as p ( y ) = e n (cid:0) iy p ( e iy p E (cid:0) n ( n + iy p ) + E (cid:0) n ( n (cid:0) iy p )) p p ( . )where E (.) (.) is the exponential integral E n z = ∫ ¥ e t ( (cid:0) z ) t n dt .Note the following identities (from the updating of Abramowitz and Stegun) [ ] n (cid:0) n (cid:0) G ( n + 1, n ) = E (cid:0) n ( n ) = e (cid:0) n ( n (cid:0) n n n (cid:229) m =0 n m m !As to the asymptotics, we have the following result (proposed by Michail Loulakis):Reexpressing Eq. . : M ( n ) = 2 p n ! p n n n (cid:0) (cid:229) m =0 n m m !Further, e (cid:0) n n (cid:0) (cid:229) m =0 n m m ! = 12 + O ( p n ) (From the behavior of the sum of Poisson variables as they converge to a Gaussianby the central limit theorem: e (cid:0) n (cid:229) n (cid:0) m =0 n m m ! = P ( X n < n ) where X n is a Poisson ran-dom variable with parameter n . Since the sum of n independent Poisson randomvariables with parameter 1 is Poisson with parameter n , the Central Limit Theo-rem says the probability distribution of Z n = ( X n (cid:0) n ) = p n approaches a standardnormal distribution. Thus P ( X n < n ) = P ( Z n < ! = n ! ¥ . For anotherapproach, see [ ] for proof that 1 + n + n + (cid:1) (cid:1) (cid:1) + n n (cid:0) ( n (cid:0) (cid:24) e n .)Using the property that lim n ! ¥ n !exp( n ) n n p n = p p , we get the following exact asymp-totics: lim n ! ¥ log( n ) k n = p Robert Israel on Math Stack Exchange how much data do you need? an operational metric for fat-tailedness ‡ thus k goes to (i.e, the average becomes Gaussian) at speed n ) , which is excru-ciatingly slow. In other words, even with 10 summands, the behavior cannot besummarized as that of a Gaussian, an intuition often expressed by B. Mandelbrot[ ]. . . Lognormal Sums
From the behavior of its cumulants for n summands, we can observe that a sumbehaves likes a Gaussian when s is low, and as a lognormal when s is high –andin both cases we know explicitly k n .The lognormal (parametrized with m and s ) doesn’t have an explicit characteristicfunction. But we can get cumulants K i of all orders i by recursion and for our caseof summed identical copies of r.v. X i , K ni = K i ( (cid:229) n X i ) = nK i ( X ).Cumulants: K n = ne m + s K n = n ( e s (cid:0) ) e m + s K n = n ( e s (cid:0) ) ( e s + 2 ) e m + s K n = . . .Which allow us to compute: Skewness = p e s (cid:0) ( e s +2 ) e ( m + s ) (cid:0) m (cid:0) s p n and Kurtosis =3 + e s ( e s ( e s +2 ) +3 ) (cid:0) n We can immediately prove from the cumulants/moments that:lim n ! + ¥ k n = 0, lim s ! k n = 0and our bound on k becomes explicit:Let k (cid:3) n be the situation under which the sums of lognormal conserve the lognor-mal density, with the same first two moments. We have0 (cid:20) k (cid:3) n (cid:20) k (cid:3) n = 2 (cid:0) log( n )log n erf vuut log ( n + e s (cid:0) n ) p erf ( s p ) .6 appendix, derivations, and proofs Heuristic attempt
Among other heuristic approaches, we can see in two stepshow ) under high values of s , k n ! k (cid:3) n , since the law of large numbers slowsdown, and ) k (cid:3) n s ! ¥ ! Loulakis’ Proof
Proving the upper bound, that for high variance k n approaches1 has been shown formally my Michail Loulakis which we summarize as follows.We start with the identify E ( j X (cid:0) m j ) = 2 ∫ ¥ m ( x (cid:0) m ) f ( x ) dx = 2 ∫ ¥ m ¯ F X ( t ) dt , where f (.) is the density, m is the mean, and ¯ F X (.) is the survival function. Further, M ( n ) =2 ∫ ¥ nm ¯ F ( x ) dx . Assume m = s , or X = exp ( s Z (cid:0) s ) where Z is a standard normalvariate. Let S n be the sum X + . . . + X n ; we get M ( n ) = 2 ∫ ¥ n P ( S n > t ) dt . Usingthe property of subexponentiality ([ ]), P ( S n > t ) (cid:21) P (max < i (cid:20) n ( X i ) > t ) (cid:21) n P ( X > t ) (cid:0) ( n ) P ( X > t ) . Now P ( X > t ) s ! ¥ ! s ! ¥ M ( n ) M (1) (cid:21) n , while at the same time we need to satisfythe bound M ( n ) M (1) (cid:20) n . So for s ! ¥ , M ( n ) M (1) = n , hence k n s ! ¥ ! Pearson Family Approach for Computation
For computational purposes, forthe s parameter not too large (below (cid:25) .3, we can use the Pearson family for com-putational convenience –although the lognormal does not belong to the Pearsonclass (the normal does, but we are close enough for computation). Intuitively, atlow s , the first four moments can be sufficient because of the absence of large de-viations; not at higher s for which conserving the lognormal would be the rightmethod.The use of Pearson class is practiced in some fields such as information/communicationtheory, where there is a rich literature: for summation of lognormal variates seeNie and Chen, [ ], and for Pearson IV, [ ], [ ].The Pearson family is defined for an appropriately scaled density f satisfying thefollowing differential equation. f ′ ( x ) = (cid:0) ( a + a x ) b + b x + b x f ( x ) ( . )We note that our parametrization of a , b , etc. determine the distribution withinthe Pearson class –which appears to be the Pearson IV. Finally we get an expressionof mean deviation as a function of n , s , and m .Let m be the mean. Diaconis et al [ ] from an old trick by De Moivre, Suzuki[ ] show that we can get explicit mean absolute deviation. Using, again, theidentity E ( j X (cid:0) m j ) = 2 ∫ ¥ m ( x (cid:0) m ) f ( x )d x and integrating by parts, E ( j X (cid:0) m j ) = 2 ( b + b m + b m ) a (cid:0) b f ( m ) ( . ) Review of the paper version; Loulakis proposed a formal proof in place of the heuristic derivation. how much data do you need? an operational metric for fat-tailedness ‡ We use cumulants of the n-summed lognormal to match the parameters. Setting a = 1, and m = b (cid:0) a (cid:0) b , we get a = e m + s ( (cid:0) n +(3 (cid:0) n ) e s +6( n (cid:0) e s +12( n (cid:0) e s (cid:0) (8 n +1) e s +3 e s + e s +12 ) ( n (cid:0) e s ( e s ( e s +4 ) (cid:0) )) b = e s ( e s (cid:0) )( e s +3 ) ( n (cid:0) e s ( e s ( e s +4 ) (cid:0) )) b = ( e s (cid:0) ) e m + s ( e s ( e s ( e s ( (cid:0) n + e s ( e s +4 ) +7 ) (cid:0) n +6 ) +6( n (cid:0) ) +12( n (cid:0) ) ( n (cid:0) e s ( e s ( e s +4 ) (cid:0) )) b = (cid:0) n ( e s (cid:0) ) e ( m + s ) ( e s ( (cid:0) n (cid:0) e s (cid:0) n + e s +3 ) +6( n (cid:0) ) ( n (cid:0) e s ( e s ( e s +4 ) (cid:0) )) Polynomial Expansions
Other methods, such as Gram-Charlier expansions, suchas Schleher [ ], Beaulieu,[ ], proved less helpful to obtain k n . At high valuesof s , the approximations become unstable as we include higher order Lhermitepolynomials. See review in Dufresne [ ] and [ ]. . . Exponential
The exponential is the "entry level" fat tails, just at the border. f ( x ) = l e (cid:0) l x , x (cid:21) Z = X , X , . . . X n we get, by recursion, since f ( y ) = ∫ y f ( x ) f ( y (cid:0) x ) dx = l ye (cid:0) l y : f n ( z ) = l n z n (cid:0) e (cid:0) l z ( n (cid:0) . )which is the gamma distribution; we get the mean deviation for n summands: M ( n ) = 2 e (cid:0) n n n l G ( n ) , ( . )hence: k n = 2 (cid:0) log( n ) n log( n ) (cid:0) n (cid:0) log( G ( n )) + 1 ( . )We can see the asymptotic behavior is equally slow (similar to the student) al-though the exponential distribution is sitting at the cusp of subexponentiality:lim n ! ¥ log( n ) k n = 4 (cid:0) p ) .6 appendix, derivations, and proofs - - μ - μ - - μ - μ - - - - - - - Figure . : Negative kurtosis from A. and corresponding kappa. . . Negative Kappa, Negative Kurtosis
Consider the simple case of a Gaussian with switching means and variance: withprobability , X (cid:24) N ( m , s ) and with probability , X (cid:24) N ( m , s ).These situations with thinner tails than the Gaussian are encountered with bi-modal situations where m and m are separated; the effect becomes acute whenthey are separated by several standard deviations. Let d= m (cid:0) m and s = s = s (to achieve minimum kurtosis), k = log(4)log( p ) (cid:0) p p de d s erf ( d s ) +2 p s e d s +2 s de d s erf ( d p s ) +2 p p s e d s + 2 ( . )which we see is negative for wide values of m (cid:0) m . next Next we consider some simple diagnostics for power laws with application to theSP . We show the differences between naive methods and those based on MLestimators that allow extrapolation into the tails.
E X T R E M E V A L U E S A N D H I D D E N T A I L S (cid:3) ,† W hen the data is thick tailed, there is a hidden part of the distri-bution, not shown in past samples. Past extrema (maximumor minimum) is not a good predictor of future extrema – vis-ibly records take place and past higher water mark is a naiveestimation, what is referred to in Chapter as the Lucretiusfallacy, which as we saw can be paraphrased as: the fool believes that the tallestriver and tallest mountain there is equals the tallest ones he has personally seen. This chapter, after a brief introduction to extreme value theory, focuseson its application to thick tails. When the data is power law distributed,the maximum of n observations follows a distribution easy to build fromscratch. We show practically how the Fréchet distribution is, asymptotically,the maximum domain of attraction MDA of power law distributed variables.More generally extreme value theory allows a rigorous approach to dealwith extremes and the extrapolation past the sample maximum. We presentsome results on the "hidden mean", as it relates to a variety of fallacies in therisk management literature. Let X , . . . X n be independent and distributed Pareto random variables with CDF F (.) Exposition chapter with somme research.Lucretius in
De Rerum Natura :Scilicet et fluvius qui visus maximus ei,Qui non ante aliquem majorem vidit; et ingensArbor, homoque videtur, et omnia de genere omniMaxima quae vidit quisque, haec ingentia fingit. extreme values and hidden tails (cid:3) ,† Figure . : The Roman philosophicalpoet Lucretius.
We can get an exact distribution of the max (or minimum). The CDF of themaximum of the n variables will be P ( X max (cid:20) x ) = P ( X (cid:20) x , . . . , X n (cid:20) x ) = P ( X (cid:20) x ) (cid:1) (cid:1) (cid:1) P ( X n (cid:20) x ) = F ( x ) n ( . )that is, the probability that all values of x falling at or below X max . The PDF is thefirst derivative : y ( x ) = ¶ F ( x ) n ¶ x .The extreme value distribution concerns that of the maximum r.v., when x ! x (cid:3) ,where x (cid:3) = sup f x : F ( x ) < g (the right "endpoint" of the distribution) is in themaximum domain of attraction, MDA [ ]. In other words,max( X , . . . X n ) P ! x (cid:3) ,where P ! denotes convergence in probability. The central question becomes: whatis the distribution of x (cid:3) ? We said that we have the exact distribution, so as engi-neers we could be satisfied with the PDF from Eq. . . As a matter of fact, wecould get all test statistics from there, provided we have patience, computer power,and the will to investigate –it is the only way to deal with preasymptotics, that is"what happens when n is small enough so x is not quite x (cid:3) .But it is quite useful for general statistical work to understand the general asymp-totic structure.The Fisher-Tippett-Gnedenko theorem (Embrech et al. [ ], de Haan and Ferreira[ ]) states the following. If there exist sequences of "norming" constants a n > b n R such that P ( M n (cid:0) b n a n (cid:20) x ) ! n ! ¥ G ( x ), ( . )then G ( x ) (cid:181) exp ( (cid:0) (1 + x x ) (cid:0) = x ) .1 preliminary introduction to evt where x is the extreme value index, and governs the tail behavior of the distri-bution. G is called the (generalized) extreme value distribution, GED . The sub-families defined by x = 0, x > x < Gumbel distribution (Type ) Here x = 0; rather lim x ! exp ( (cid:0) ( x x + 1) (cid:0) x ) : G ( x ) = exp ( (cid:0) exp ( (cid:0) ( x (cid:0) b n a n ))) for x R .when the distribution of M n has an exponential tail. Fréchet distribution(Type ) Here x = a : G ( x ) = x (cid:20) b n exp ( (cid:0) ( x (cid:0) b n a n ) (cid:0) a ) x > b n .when the distribution of M n has power law right tail, as we saw earlier. Note that a > Weibull distribution (Type ) Here x = (cid:0) a G ( x ) = exp ( (cid:0) ( (cid:0) ( x (cid:0) b n a n )) a ) x < b n x (cid:21) b when the distribution of M n has a finite support on the right (i.e., bounded maxi-mum). Note here again that a > . . How Any Power Law Tail Leads to Fréchet
10 20 30 40 50 x0.00.20.40.60.81.0Ratio n = = Figure . : Shows the ra-tion of distributions of theCDF of the exact distribu-tion over that of a Fréchet.We can visualize the accept-able level of approximationand see how x reaches theMaximum Domain of At-traction, MDA. Here a =2 , L = 1 . We note thatthe ratio for the PDF showsthe same picture, unlike theGaussian, as we will see fur-ther down. extreme values and hidden tails (cid:3) ,† Let us proceed now like engineers rather than mathematicians, and consider twoexisting distributions, the Pareto and the Fréchet, and see how one can me madeto converge to the other, in other words rederive the Fréchet from the asymptoticproperties of power laws.The reasoning we will follow next can be generalized to any Pareto-tailed variableconsidered above the point where slowly varying function satisfactorily approxi-mates a constant –the "Karamata point".The CDF of the Pareto with minimum value (and scale) L and tail exponent a : F ( x ) = 1 (cid:0) ( Lx ) a ,so the PDF of the maximum of n observations: y ( x ) = a n ( Lx ) a ( (cid:0) ( Lx ) a ) n (cid:0) x . ( . )The PDF of the Frechét: φ ( x ) = ab a x (cid:0) a (cid:0) e b a ( (cid:0) x (cid:0) a ) . ( . )Let us now look for x "very large" where the two functions equate, or y ( x (cid:3) ) ! φ ( x (cid:3) ). lim x ! ¥ y ( x ) φ ( x ) = n ( b ) a L a . ( . )Accordingly, for x deemed "large", we can use b = Ln = a . Equation . shows ushow the tail a conserves across transformations of distribution: Property The tail exponent of the maximum of i.i.d random variables is the same as that of therandom variables themselves.
Now, in practice, "where" do we approximate is shown in figure . . Property We get an exact asymptotic fitting for power law extrema. . . Gaussian Case
The Fréchet case is quite simple –power laws are usually simpler analytically, andwe can get limiting parametrizations. For the Gaussian and other distributions,more involved derivations and approximations are required to fit the normingconstants a n and b n , usually entailing quantile functions. The seminal paper by .1 preliminary introduction to evt Fisher and Tippet [ ] warns us that "from the normal distribution, the limitingdistribution is approached with extreme slowness" ( cited by Gasull et al. [ ]).In what follows we look for norming constants for a Gaussian, based on [ ]and later developments. ratio = = n = n = x Maximum Domain of Attraction for a Gaussian
Figure . : The behavior ofthe Gaussian; it is hard toget a good parametrization,unlike with power laws. They axis shows the ratio forthe CDF of thee exact max-imum distribution for nvariables over that of theparametrized EVT. ratio = = n = n = x Maximum Domain of Attraction for a Gaussian
Figure . : The same as fig-ure . but using PDF. It isnot possible to obtain a goodapproximation in the tails. Consider M n = a n x + b n in Eq. . . We assume then that M n follows the ExtremeValue Distribution EVT (the CDF is e (cid:0) e x , the mirror distribution of the Gumbel forminima, obtained by transforming the distribution of (cid:0) M n where M n (cid:0) b n a n followsa Gumbel with CDF 1 (cid:0) e (cid:0) e x .) The parametrized CDF for M n is e (cid:0) e (cid:0) x (cid:0) bnan .An easy shortcut comes from the following approximation : a n = b n b n +1 and The convention we follow considers the Gumbel for minima only, with the properly parametrized EVT forthe maxima. Embrechts et al [ ] proposes a n = p n ) , b n = √ n ) (cid:0) log(log( n ))+log(4 p )2 p n ) , the second term for b n onlyneeded for large values of n . The approximation is of order √ log( n ). extreme values and hidden tails (cid:3) ,† b n = (cid:0)p (cid:0) ( ( (cid:0) n )) , where erfc (cid:0) is the inverse complementary errorfunction. Figure . : The high watermark: the level offlooding in Paris in as a maxima. Clearlyone has to consider that such record will betopped some day in the future and proper riskmanagement consists in "how much" more thansuch a level one should seek protection. We havebeen repeating the Lucretius fallacy forever.
Property For tail risk and properties, it is vastly preferable to work with the exact distribu-tion for the Gaussian, namely for n variables, we have the exact distribution of themaximum from the CDF of the Standard Gaussian F ( g ) : ¶ F ( g ) ( K ) ¶ K = e (cid:0) K (cid:0) n n erfc ( (cid:0) K p ) n (cid:0) p p , ( . ) where erfc is the complementary error function. . . The Picklands-Balkema-de Haan Theorem
The conditional excess distribution function is the equivalent in density to the"Lindy" conditional expectation of excess deviation [ , ], –we will make useof it in Chapter .Consider an unknown distribution function F of a random variable X ; we areinterested in estimating the conditional distribution function F u of the variable X above a certain threshold u , defined as F u ( y ) = P ( X (cid:0) u (cid:20) y j X > u ) = F ( u + y ) (cid:0) F ( u )1 (cid:0) F ( u ) ( . ) .2 the invisible tail for a power law for 0 (cid:20) y (cid:20) x (cid:3) (cid:0) u , where x (cid:3) is the finite or infinite right endpoint of the underly-ing distribution F. Then there exists a measurable function s ( u ) such thatlim u ! x (cid:3) sup (cid:20) x < x (cid:3) (cid:0) u (cid:12)(cid:12) F u ( x ) (cid:0) G x , s ( u ) ( x ) (cid:12)(cid:12) = 0 ( . )and vice versa where G x , s ( u ) ( x ) is the generalized Pareto distribution (GPD) : G x , s ( x ) = { (cid:0) (1 + x x = s ) (cid:0) = x if x ̸ = 01 (cid:0) exp( (cid:0) x = s ) if x = 0 ( . )If x > G .,. is a Pareto distribution. If x = 0, G .,. (as we saw above) is a anexponential distribution. If x = (cid:0) G .,. is uniform.The theorem allows us to do some data inference by isolating exceedances. Moreon it in our discussion of wars and trends of violence in Chapter . Consider K n the maximum of a sample of n independent identically distributedvariables in the power law class; K n = max ( X , X , . . . , X n ) . Let ϕ (.) be the densityof the underlying distribution. We can decompose the moments in two parts, withthe "hidden" moment above K , as shown in Fig . : μ K , p = K ∞ x p ϕ ( x ) ⅆ x K2 4 6 8 10 12 140.010.020.030.040.05
Figure . : The p th mo-ment above K E ( X p ) = ∫ K n L x p ϕ ( x ) dx | {z } m p + ∫ ¥ K n x p ϕ ( x ) dx | {z } m K , p where m is the visible part of the distribution and m n the hidden one.We can also consider using ϕ e as the empirical distribution by normalizing. Since: ( ∫ K n L ϕ e ( x ) dx (cid:0) ∫ ¥ K n ϕ ( x ) dx )| {z } Corrected + ∫ ¥ K n ϕ ( x ) dx = 1, ( . ) extreme values and hidden tails (cid:3) ,† we can use the Radon-Nikodym derivative E ( X p ) = ∫ K n L x p ¶m ( x ) ¶m e ( x ) ϕ e ( x ) dx + ∫ ¥ K n x p ϕ ( x ) dx . ( . )
98 54 118 32 n ( μ ) ( μ ) Ratio of Hidden Mean
Figure . : Proportion ofthe hidden mean in relationto the total mean, for differ-ent parametrizations of thetail exponent a . α ( μ ) ( μ ) Figure . : Proportion ofthe hidden mean in relationto the total mean, for differ-ent sample sizes n.
Proposition . Let K (cid:3) be point where the survival function of the random variable X can be satisfactorilyapproximated by a constant, that is P ( X > x ) (cid:25) L (cid:0) a x (cid:0) a .Under the assumptions that K > K (cid:3) , the distribution for the hidden moment, m K , p , forn observation has for density g (.,.,.) (.) : ( . )g n , p , a ( z ) = nL a pp (cid:0) a ( z (cid:0) pz a ) p a (cid:0) p exp ( n ( (cid:0) L a pp (cid:0) a ) ( z (cid:0) pz a ) (cid:0) a p (cid:0) a ) for z (cid:21) , p > a , and L > . .2 the invisible tail for a power law The expectation of the p th moment above K , with K > L > E ( m K , p ) = a ( L p (cid:0) L a K p (cid:0) a ) a (cid:0) p . ( . )We note that the distribution of the sample survival function (that is, p = 0) is anexponential distribution with PDF: g n ,0, a ( z ) = ne (cid:0) nz ( . )which we can see depends only on n . Exceedance probability for an empiricaldistribution does not depend on the fatness of the tails.To get the mean, we just need to get the integral with a stochastic lower bound K > K min : ∫ ¥ K min ∫ ¥ K n x p ϕ ( x ) dx | {z } m K , p f K ( K ) dK .For the full distribution g n , p , a ( z ), let us decompose the mean of a Pareto withscale L , so K min = L .By standard transformation, a change of variable, K (cid:24) F ( a , Ln a ) a Fréchet distri-bution with PDF: f K ( K ) = a nK (cid:0) a (cid:0) L a e n ( (cid:0) ( LK ) a ) , we get the required result.
200 400 600 800 1000 n ( μ ) σ Hidden Tail For the Gaussian ( mean ) Figure . : Proportion ofthe hidden mean in relationto the standard deviation,for different values of n. extreme values and hidden tails (cid:3) ,† . . Comparison with the Normal Distribution
For a Gaussian with PDF ϕ ( g ) (.) indexed by ( g ), m ( g ) K = ∫ ¥ K ϕ ( g ) ( x ) dx = p (cid:0) G ( p +12 , K ) p p .As we saw earlier, without going through the Gumbel (rather EVT or "mirror-Gumbel"), it is preferable to the exact distribution of the maximum from the CDFof the Standard Gaussian F ( g ) : ¶ F ( g ) ( K ) ¶ K = e (cid:0) K (cid:0) n n erfc ( (cid:0) K p ) n (cid:0) p p ,where ertc is the complementary error functionFor p = 0, the expectation of the "invisible tail" (cid:25) n . ∫ ¥ e (cid:0) K (cid:0) n (cid:0) n G ( , K ) ( erf ( K p ) + 1 ) n (cid:0) p dK = 1 (cid:0) (cid:0) n n + 1 . Figure . : The baserate fallacy, revisited —or,rather in the other direction.The "base rate" is anempirical evaluation thatbases itself on the worstpast observations, an erroridentified in [ ] as thefallacy identified by theRoman poet Lucreciusin De rerum natura ofthinking the tallest futuremountain equals the talleston has previously seen.Quoted without permissionafter warning the author.
There is a prevalent confusion about the nonparametric empirical distributionbased on the following powerful property: as n grows, the errors around the em-pirical histogram for cumulative frequencies are Gaussian regardless of the base dis-tribution , even if the true distribution is fat-tailed (assuming infinite support). Forthe CDF (or survival functions) are both uniform on [0, 1], and, further, by the .3 appendix: the empirical distribution is not empirical Donsker theorem, the sequence p n ( F n ( x ) (cid:0) F ( x ) ) ( F n is the observed CDF or sur-vival function for n summands, F the true CDF or survival function) convergesin distribution to a Normal Distribution with mean 0 and variance F ( x ) ( (cid:0) F ( x ) ) (one may find even stronger forms of convergence via the Glivenko– Cantelli theo-rem).Owing to this remarkable property, one may mistakenly assume that the effect oftails of the distribution converge in the same manner independently of the distribu-tion. Further, and what contributes to the confusion, the variance, F ( x ) ( (cid:0) F ( x ) ) for both empirical CDF and survival function, drops at the extremes –though notits corresponding payoff.In truth, and that is a property of extremes, the error effectively increases in thetails if one multiplies by the deviation that corresponds to the probability.For the U.S. stock market indices, while the first method is deemed to be ludi-crous, using the second method leads to an underestimation of the payoff in thetails of between 5 and 70 times, as can be shown in Figure . . The topic is revis-ited again in Chapter with our discussion of the difference between binary andcontinuous payoffs, and the conflation between probability and real world payoffswhen said payoffs are from a fat tailed distribution. K ∫ K ∞ ϕ _ s ⅆ x ∫ K ∞ ϕ _ e ( x ) ⅆ x Figure . : This figureshows the relative valueof tail CVar-style measurecompared to that from the(smoothed) empirical distri-bution. The deep tail is un-derestimated up to timesby current methods, eventhose deemed "empirical". G R O W T H R A T E A N D O U T C O M E A R EN O T I N T H E S A M E D I S T R I B U T I O NC L A S S T he author and Pasquale Cirillo showed that fatalities from pan-demics follow power laws with a tail exponent patently lowerthan 1. This means that all the information resides in the tail.So unless one has some real reason to ignore general and un-conditional statistics (of the style "this one is different"), oneshould not base risk management decisions on the behavior of the expectedaverage or some point estimate.The following paradox arose: X t the number of fatalities between period t and t is Paretian with undefined mean. However its exponential growth rateis not! It is going to be thin tailed, exponentially distributed or so.Cirillo and Taleb ( ) [ ] (CT) showed via extreme value theory that Pandemicshave a tail a < X T , the number of fatalities at some date T in thefuture, with survival function P ( X > x ) = L ( x ) x (cid:0) a . Assume to simplify that, witha minimum value L , L ( x ) (cid:24) L so we get the survival function P ( X > x ) = Lx (cid:0) a . (B. ) b.1 the puzzle Consider the usual model, X t = X e r ( t (cid:0) t ) , (B. )where r = 1( t (cid:0) t ) ∫ tt r s ds (B. )and r s is the instantaneous rate. Normalize the distribution to L = 1. We can thusprove the following (under the assumption above that X t has survival function inEq. . ): growth rate and outcome are not in the same distribution class × × × × × Figure B. : Above, a histogram of realisations of r, from an exponential distribution with param-eter l = . Below, the histogram of X = e r . We can see the difference between the two distributions.The sample kurtosis are and respectively (in fact it is theoretically infinite for the second); allvalues for the latter are dominated by a single large deviation. Theorem If r has support in ( (cid:0) ¥ , ¥ ) , then its PDF φ for the scaled rate r = r ( t (cid:0) t ) can beparametrized as φ ( r ) = e (cid:0) r b b r (cid:21) e (cid:0) (cid:0) r b b otherwisewhere b = a .If r has support in (0, ¥ ) , then its PDF φφ ( r ) = { a e a ( (cid:0) r ) r (cid:21) otherwise What we have here is versions of the exponential or double exponential distribu-tion (Laplace). .1 the puzzle x0.050.100.501 P > x0.050.100.501 P > x0.050.100.501 P > x0.050.100.501 P > x0.050.100.501 P > x0.050.100.501 P > x0.050.100.501 P > x0.050.100.501 P > x0.050.100.501 P > Figure B. : We take the largest pandemics and randomly subselect half. We normalize the data bytoday’s population. The Paretian properties (and parametrization) are robust to these perturbations.EVT provides slighly higher tail exponent, but firmly below one. This about the lowest tail exponentthe authors have ever seen in their careers. Remark Implication: One cannot naively translate properties between the rate of growthr and X T because errors in r could be small (but nonzero) for r but explosive intranslation owing to the exponentiation. The reverse is also true: if r follows an exponential distribution then X T must bePareto distributed as in Eq. . .The sketch of the derivation is as follows, via change of variables. Let r follow adistribution with density ϕ , with support ( a , b ); under some standard conditions, u = g ( r ) follows a new distribution with density y ( u ) = ϕ ( g ( (cid:0) ( u ) ) g ′ ( g ( (cid:0) ( u ) ) ,and support [ g ( a ), g ( b ) ] . growth rate and outcome are not in the same distribution class b.2 pandemics are really fat tailed Figure B. shows how we get a power law with a low a no matter what randomsubsample from the data we select. We used in [ ] extreme value theory but thegraphs show the preliminary analysis (not in paper). This is the lowest tail expo-nent we have ever seen anywhere. The implication is that epidemiology studiesneed to be used for research but policy making must be done using EVT or simplyrelying on precautionary principles –that is, to cut the cancer when it is cheap todo so. A gross error is the reliance on single point forecast for policy –in fact as we show in chapter , it isalways wrong to use the forecast of the survival function –to gauge forecasting ability thinking it "howscience is done" –outside binary bets. T H E L A R G E D E V I A T I O N P R I N C I P L E ,I N B R I E F L et us return to the Cramer bound with a rapid exposition ofthe surrounding literature. The idea behind the tall vs. richoutliers in . is that under some conditions, their tail prob-abilities decay exponentially. Such a property that is centralin risk management –as we mentioned earlier, the catastropheprinciple explains that for diversification to be effective, such exponentialdecay is necessary.The large deviation principle helps us understand such a tail behavior.It alsohelps us figure out why things do not blow-up under thin-tailedness –but, moresignificantly, why they could under fat tails, or where the Cramèr condition is notsatisfied [ ].Let M N be the mean of a sequence of realizations (identically distributed) of N random variables. For large N , consider the tail probability: P ( M N > x ) (cid:25) e (cid:0) NI ( x ) ,where I (.) is the Cramer (or rate) function (Varadhan [ ], Denbo and Zeitouni[ ]). If we know the distribution of X , then, by Legendre transformation, I ( x ) =sup q > ( q x (cid:0) l ( q ) ) , where l ( q ) = log E ( e q ( X ) ) is the cumulant generating func-tion.The behavior of the function q ( x ) informs us on the contribution of a single eventto the overall payoff. (It connects us to the Cramer condition which requires exis-tence of exponential moments).A special case for Bernoulli variables is the Chernoff Bound, which provides tightbounds for that class of discrete variables. the large deviation principle, in brief simple case: chernoff bound A binary payoff is subjected to very tight bounds. Let ( X i ) < i (cid:20) n be a sequence ofindependent Bernouilli trials taking values in f
0, 1 g , with P ( X = 1) = p and P ( X =0) = 1 (cid:0) p . Consider the sum S n = (cid:229) < i (cid:20) n X i . with expectation E ( S n )= np = m .Taking d as a "distance from the mean", the Chernoff bounds gives:For any d > P ( S (cid:21) (1 + d ) m ) (cid:20) ( e d (1 + d ) d ) m and for 0 < d (cid:20) P ( S (cid:21) (1 + d ) m ) (cid:20) e (cid:0) md Let us compute the probability of coin flips n of having % higher than the truemean, with p= and m = n : P ( S (cid:21) ( ) n ) (cid:20) e (cid:0) md = e (cid:0) n = , which for n = 1000happens every in 1.24 (cid:2) . Proof
The Markov bound gives: P ( X (cid:21) c ) (cid:20) E ( X ) c , but allows us to substitute X with a positive function g ( x ), hence P ( g ( x ) (cid:21) g ( c )) (cid:20) E ( g ( X )) g ( c ) . We will use thisproperty in what follows, with g ( X ) = e w X .Now consider (1 + d ), with d >
0, as a "distance from the mean", hence, with w > P ( S n (cid:21) (1 + d ) m ) = P ( e w S n (cid:21) e w (1+ d ) m ) (cid:20) e (cid:0) w (1+ d ) m E ( e w S n ) (C. )Now E ( e w S n ) = E ( e w (cid:229) ( X i ) ) = E ( e w X i ) n , by independence of the stopping time,becomes ( E ( e w X ) ) n .We have E ( e w X ) = 1 (cid:0) p + pe w . Since 1 + x (cid:20) e x , E ( e w S n ) (cid:20) e m ( e w a (cid:0) Substituting in C. , we get: P ( e w S n (cid:21) e w (1+ d ) m ) (cid:20) e (cid:0) w (1+ d ) m e m ( e w (cid:0) (C. )We tighten the bounds by playing with values of w that minimize the right side. w (cid:3) = { w : ¶ e m ( e w (cid:0) ) (cid:0) ( d +1) mw ¶w = 0 } yields w (cid:3) = log(1 + d ).Which recovers the bound: e dm ( d + 1) ( (cid:0) d (cid:0) m . he large deviation principle, in brief An extension of Chernoff bounds was made by Hoeffding [ ] who broadenedit to bounded independent random variables, but not necessarily Bernouilli..
C A L I B R A T I N G U N D E R PA R E T I A N I T Y
Figure D. : The great Benoit Mandelbrot linked fractal geometry to statistical distributions via self-affinity at all scales. When asked to explain his work, he said: "rugosité", meaning"roughness" –ittook him fifty years to realize that was his specialty. (Seahorse Created by Wolfgang Beyer, WikipediaCommons.)
We start with a refresher:
Definition D. (Power Law Class P ) The r.v. X R belongs to P , the class of slowly varying functions (a.k.a. Paretiantail or calibrating under paretianity power law-tailed) if its survival function (for the variable taken in absolute value) decaysasymptotically at a fixed exponent a , or a ′ , that is P ( X > x ) = L ( x ) x (cid:0) a (D. ) (right tail) or P ( (cid:0) X > x ) = L ( x ) x (cid:0) a ′ (D. ) (left tail)where a , a ′ > and L : (0, ¥ ) ! (0, ¥ ) is a slowly varying function,defined as lim x ! ¥ L ( kx ) L ( x ) = 1 for all k > . The happy result is that the parameter a obeys an inverse gamma distributionthat converges rapidly to a Gaussian and does not require a large n to get a goodestimate. This is illustrated in Figure D .2, where we can see the difference in fit. n=100 n=1000 True Mean
Figure D. : Monte CarloSimulation ( ) of a com-parison of sample mean(Methods and ) vs max-imum likelihood mean es-timations (Method ) fora Pareto Distribution with a = 1.2 (yellow andblue respectively), for n =100, 1000 . We can see howthe MLE tracks the distri-bution more reliably. Wecan also observe the bias asMethods and underes-timate the sample mean inthe presence of skewness inthe data. We need moredata in order to get the sameerror rate. As we saw, there is a problem with the so-called finite variance power laws:finiteness of variance doesn’t help as we saw in Chapter . .1 distribution of the sample tail exponent d.1 distribution of the sample tail exponent Consider the standard Pareto distribution for a random variable X with PDF: ϕ X ( x ) = a L a x (cid:0) a (cid:0) , x > L (D. )Assume L = 1 by scaling.The likelihood function is L = (cid:213) ni =1 a x (cid:0) a (cid:0) i . Maximizing the Log of the likelihoodfunction (assuming we set the minimum value) log( L ) = n (log( a ) + a log( L )) (cid:0) ( a +1) (cid:229) ni =1 log ( x i ) yields: ˆ a = n (cid:229) ni =1 log ( x i ) . Now consider l = (cid:0) (cid:229) ni =1 log X i n . Using thecharacteristic function to get the distribution of the average logarithm yield: y ( t ) n = ( ∫ ¥ f ( x ) exp ( it log( x ) n ) dx ) n = ( a n a n (cid:0) it ) n which is the characteristic function of the gamma distribution ( n , a n ). A standardresult is that ˆ a ′ ≜ l will follow the inverse gamma distribution with density: ϕ ˆ a ( a ) = e (cid:0) a n ˆ a ( a n ˆ a ) n ˆ a G ( n ) , a > Debiasing
Since E (ˆ a ) = nn (cid:0) a we elect another –unbiased– random variable ˆ a ′ = n (cid:0) n ˆ a which, after scaling, will have for distribution ϕ ˆ a ′ ( a ) = e a (cid:0) a na ( a ( n (cid:0) a ) n +1 a G ( n +1) . Truncating for a > a (cid:20) ϵ , ϵ >
0. Our sampling nowapplies to lower-truncated values of the estimator, those strictly greater than ,with a cut point ϵ >
0, that is, (cid:229) n (cid:0) x i ) > ϵ , or E (ˆ a j ˆ a > ϵ ): ϕ ˆ a ′′ ( a ) = ϕ ˆ a ′ ( a ) ∫ ¥ ϵ ϕ ˆ a ′ ( a ) d a ,hence the distribution of the values of the exponent conditional of it being greaterthan 1 becomes: ϕ ˆ a ′′ ( a ) = e a n a (cid:0) an ( a n a ( n (cid:0) ) n a ( G ( n ) (cid:0) G ( n , n a ( n (cid:0) ϵ +1) )) , a (cid:21) ϵ (D. )So as we can see in Figure D. , the "plug-in" mean via the tail a might be a goodapproach under one-tailed Paretianity. "I T I S W H A T I T I S ": D I A G N O S I N GT H E S P 5 0 0 † T his is a diagnostics tour of the properties of the SP in-dex in its history. We engage in a battery of tests and checkwhat statistical picture emerges. Clearly, its returns are powerlaw distributed (with some added complications, such as anasymmetry between upside and downside) which, again, in-validates common methods of analysis. We look, among other things to: (cid:15) The behavior of Kurtosis under aggregation (as we lengthen the obser-vation window ) (cid:15)
The behavior of the conditional expectation E ( X j X > K ) for various val-ues of K . (cid:15) The maximum-to-sum plot(MS Plot). (cid:15)
Drawdowns (that is, maximum excursions over a time window) (cid:15)
Extremes and records to see if extremes are independent.These diagnostics allow us to confirm that an entire class of analyses in L The problem
As we said in the Prologue, switching from thin-tailed to fat-tailedis not just changing the color of the dress . The finance and economic rent seekers holdthe message "we know it is fat tailed" but then fail to grasp the consequences onmany things such as the slowness of the law of large numbers and the failure ofsample means or higher moments to be sufficient statistic ( as well as the ergodicity
This is largely a graphical chapter made to be read from the figures more than from the text as thearguments largely repose on the absence of convergence in the graphs. "it is what it is": diagnosing the sp500 † effect, among others). Likewise it leads to a bevy of uninformative analytics in theinvestment industry.Paretianity is clearly defined by the absence of some higher moment, exhibitedby lack of convergence under LLN. | X } P > X Figure . : Visual Identi-fication of Paretianity on astandard log-log plot with(absolute) returns on thehorizontal axis and the sur-vival function on the verti-cal one. If one removes thedata point corresponding tothe crash of , a lognor-mal would perhaps work, orsome fat tailed mixed dis-tribution outside the powerlaw class. For we can seethe survival function becom-ing vertical, indicative of aninfinite asymptotic tail ex-ponent. But as the sayinggoes, all one needs is a sin-gle event...
Remark Given that: ) the regularly varying class has no higher moments than a , more precisely, (cid:15) if p > a , E ( X p ) = ¥ if p is even or the distribution has one-tailed supportand (cid:15) E ( X p ) is undefined if p is odd and the distribution has two-tailed support,and ) distributions outside the regularly varying class have all moments p N + , E ( X p ) < ¥ . p N + s.t. E ( X p ) is either undefined or infinite , X P . Next we examine ways to detect "infinite" moments. Much confusion attendsthe notion of infinite moments and its identification since by definition samplemoments are finite and measurable under the counting measure. We will rely onthe nonconvergence of moments. Let ∥ X ∥ p be the weighted p -norm ∥ X ∥ p ≜ ( n n (cid:229) i =1 j x i j p ) = p ,we have the property of power laws: E ( X p ) ≮ ¥ , ∥ x ∥ p is not convergent. Question
How does belonging to the class of Power Law tails (with a (cid:20)
4) cancelmuch of the methods in L . shows the distribution of the mean deviation of the second momentfor a finite variance power law. Simply, even if the fourth moment does not exist,under infinite higher moments, the second moment of the variance has itself in-finite variance, and we fall in the sampling problems seen before: just as with apower law of a close to 1 (though slightly above it), the mean exists but will neverbe observed, in a situation of infinite third moment, the observed second momentwill fail to be informative as it will almost never converge to its value. Convergence laws can help us exclude some classes of probability distributions.
SP500ReshuffledSP5000 20 40 60 80 100 lag5101520Kurtosis
Figure . : Visual convergence diag-nostics for the kurtosis of the SP over the past observations. Wecompute the kurtosis at different lagsfor the raw SP and reshuffled data.While the th norm is not convergentfor raw data, it is clearly so for thereshuffled series. We can thus as-sume that the "fat tailedness" is at-tributable to the temporal structure ofthe data, particularly the clustering ofits volatility. See Table . for theexpected drop at speed = n for thin-tailed distributions. . . Test : Kurtosis under Aggregation If Kurtosis existed, it would end up converging to that of a Gaussian as one length-ens the time window. So we test for the computations of returns over longer andlonger lags, as we can see in Fig . . Result
The verdict as shown in Figure . is that the one-month kurtosis is notlower than the daily kurtosis and, as we add data, no drop in kurtosis is observed.Further we would expect a drop (cid:24) n (cid:0) . This allows us to safely eliminate numer-ous classes, which includes stochastic volatility in its simple formulations such asgamma variance. Next we will get into the technicals of the point and the strengthof the evidence.A typical misunderstanding is as follows. In a note "What can Taleb learn fromMarkowitz" [ ], Jack L. Treynor, one of the founders of portfolio theory, de-fended the field with the argument that the data may be fat tailed "short term" butin something called the "long term" things become Gaussian. Sorry, it is not so. "it is what it is": diagnosing the sp500 † ( ) SP500 MS Plot for 4th M ( ) ( ) MS Plot for Matching Stochastic Volatility ( ) SP500 MS Plot for 3rd M
Figure . : MS Plot (or "law of large numbers for p moments") for p = 4 for the SP comparedto p = 4 for a Gaussian and stochastic volatility for a matching Kurtosis ( ) over the entire period.Convergence, if any, does not take place in any reasonable time. MS Plot for moment p = 3 for theSP compared to p = 4 for a Gaussian. We can safely say that the th moment is infinite and the rd one is indeterminate (We add the ergodic problem that blurs, if not eliminate, the distinction betweenlong term and short term).The reason is that, simply we cannot possibly talk about "Gaussian" if kurtosis isinfinite, even when lower moments exist. Further, for a (cid:25)
3, Central limit operatesvery slowly, requires n of the order of 10 to become acceptable, not what we havein the history of markets. [ ] . . Maximum Drawdowns
For a time series for asset S taken over ( t , t + ∆ t , t + n ∆ t ), we are interested in thebehavior of d ( t , t , ∆ t ) = Min ( S i ∆ t + t (cid:0) ( Min S j ∆ t + t ) nj = i +1 ) ni =0 ( . )We can consider the relative drawdown by using the log of that minimum, as wedo with returns. The window for the drawdown can be n = 5, 100, 252 days. Asseen in Figure . , drawdowns are Paretian. K (- X - X > K ) K Figure . : The "Lindytest" or Condexp, using theconditional expectation be-low K as K varies as test ofscalability. As we move K,the measure should drop. - - - - Figure . : The empiricaldistribution could conceiv-ably fit a Lévy stable distri-bution with a l = 1.62 . | X } P > X StableDistribution [
1, 1., 1., 0.0690167, 0.00608249 ] Figure . : The tails caneven possibly fit an infi-nite mean stable distribu-tion with a l = 1 . . . Empirical Kappa
From our kappa equation in Chapter : k ( n , n ) = 2 (cid:0) log( n ) (cid:0) log( n )log ( M ( n ) M ( n ) ) . ( . ) "it is what it is": diagnosing the sp500 † Figure . : SP squaredreturns for observa-tions. No GARCH( , ) canproduce such jaggedness orwhat the great Benoit Man-delbrot called "rugosité".
100 200 300 400 500 600 n κ n PositiveNegativeEmpirical Kappa
Figure . : kappa-n esti-mated empirically. with shortcut k n = k (1, n ). We estimate empirically via bootstrapping and caneffectively see how it maps to that of a power law – with a < . . Test : Excess Conditional ExpectationResult: The verdict from this test is that, as we can see in Figure . , thatthe conditional expectation of X (and (cid:0) X ), conditional on X is greater thansome arbitrary value K , remains proportional to K . Definition . Let K be in R + , the relative excess conditional expectation: φ + K ≜ E ( X ) j X > K K , φ (cid:0) K ≜ E ( (cid:0) X ) j X > K K . - - - - - - - - - - - - - - - - Figure . : Drawdowns for windows n = 5 , , , and days, respectively. Maximum draw-downs are excursions mapped in Eq. . . We use here the log of the minimum of S over a window ofn days following a given S.
5d 100d 252 d0.02 0.05 0.10 0.20 0.50 | X } P > X Figure . : Paretianity ofDrawdowns and Scale
We have lim K ! ¥ φ K = 0,for distributions outside the power law basin, andlim K ! ¥ φ K = K = a (cid:0) a for distribution satisfying Definition . Note the van der Wijk’s law [ ], [ ].Figure . shows the following: the conditional expectation does not drop forlarge values, which is incompatible with non-Paretian distributions. "it is what it is": diagnosing the sp500 † | X } P > X Figure . : Fitting a Sta-ble Distribution to draw-downs
Empirical Survival FunctionFrechet, 1Frechet, lower tail index
Figure . : Correctingthe empirical distributionfunction with a Frechet forthe SP . . Test - Instability of th moment A main argument in [ ] is that in years of SP observations, a single onerepresents > % of the Kurtosis. Similar effect are seen with other socioeconomicvariables, such as gold, oil, silver other stock markets, soft commodities. Suchsample dependence of the kurtosis means that the fourth moment does not havethe stability, that is, does not exist. . . Test : MS Plot An additional approach to detect if E ( X p ) exists consists in examining convergenceaccording to the law of large numbers (or, rather, absence of), by looking the be-havior of higher moments in a given sample. One convenient approach is theMaximum-to-Sum plot, or MS plot as shown in Figure . . The MS Plot relies ona consequence of the law of large numbers [ ] when it comes to the maximum of a variable. For a sequence X , X , ..., X n of nonnegative i.i.d. random variables,if for p = 1, 2, 3, . . . , E [ X p ] < ¥ , then R pn = M pn = S pn ! a . s . n ! ¥ , where S pn = n (cid:229) i =1 X pi is the partial sum, and M pn = max( X p , ..., X pn ) thepartial maximum. (Note that we can have X the absolute value of the randomvariable in case the r.v. can be negative to allow the approach to apply to oddmoments.)We show by comparison the MS plot for a Gaussian and that for a Student T witha tail exponent of . We observe that the SP show the typical characteristics ofa steep power law, as in , observations ( years) it does not appear to dropto the point of allowing the functioning of the law of large numbers. | X } P > X | X } P > X | X } P > X | X } P > X | X } P > X | X } P > X | X } P > X | X } P > X | X } P > X | X } P > X | X } P > X | X } P > X Figure . : We separate positive and negative logarithmic returns and use overlapping cumula-tive returns from up to . Clearly the negative returns appear to follow a Power Law while theParetianity of the right one is more questionable. "it is what it is": diagnosing the sp500 † - - - - Figure . : QQ Plot com-paring the Student T to theempirical distribution of theSP : the left tail fits, notthe right tail. . . Records and Extrema
The Gumbel record methods is as follows (Embrechts et al [ ]). Let X , X , . . . bea discrete time series, with a maximum at period t (cid:21) M t = max( X , X , . . . , X t ),we have the record counter N t for n data points. N t = 1 + t (cid:229) k =2 X t > M t (cid:0) . ( . )Regardless of the underlying distribution, the expectation E ( N t ) is the HarmonicNumber H t , and the variance H t (cid:0) H t , where H t = (cid:229) ti =1 1 i r . We note that theharmonic number is concave and very slow in growth, logarithmic, as it can beapproximated with log ( n ) + g , where g is the Euler Mascheroni constant. Theapproximation is such that t +1) (cid:20) H t (cid:0) log ( t ) (cid:0) g (cid:20) t ( Wolfram Mathworld [ ]). GainsLosses0 5000 10000 15000 time51015 records
Figure . : The recordtest shows independence forextremes of negative re-turns, dependence for posi-tive ones. The number ofrecords for independent ob-servations grows over timeat the harmonic numberH ( t ) (dashed line), (cid:25) loga-rithmic but here appears togrow > standard devi-ations faster for positive re-turns, hence we cannot as-sume independence for ex-tremal gains. The test doesnot make claims about de-pendence outside extremes.
50 100 150 t N
50 100 150 t N Figure . : Running shorter period, t = 1000 days of overlapping observations for the records ofmaxima(top) and minima (bottom), compared to the expected Harmonic number H (1000) . Remark The Gumbel test of independence above is sufficient condition for the convergence ofextreme negative values of the log-returns of the SP to the Maximum Domain ofAttraction (MDA) of the extreme value distribution.
Entire series
We reshuffled the SP (i.e. bootstrapped without replacement,using a sample size equal to the original (cid:25) repeats) andran records across all of them. As shown in Figures . and . , the meanwas . (approximated by the harmonic number, with a corresponding standarddeviation.) The survival function S(.) of N (cid:2) = 16, S (16) = which allows usto consider the independence of positive extrema implausible. "it is what it is": diagnosing the sp500 † On the other hand the negative extrema (9 counts) show realizations close towhat is expected (10.3), diverting by a s.t.d. from expected, enough to justify afailure to reject independence. Subrecords
If instead of taking the data as one block over the entire period, webroke the period into sub-periods, we get (because of the concavity of the measureand Jensen’s inequality), N t + d , t + ∆ + d , we obtain T = d observations. We took ∆ = 10 and d = 10 , thus getting 170 subperiods for the T (cid:25) (cid:2) days. The pictureas shown in Figure . cannot reject independence for both positive and rejectobservations. SP5001950 - maximaMean recordsfor maxima ofreshuffledreturns5 10 15 20 t N t Figure . : The survival function of the records of positive maxima for the resampled SP ( times) by keeping all returns but reshuffling them, thus removing the temporal structure. The massabove (observed number of maxima records for SP over the period) is . Figure . : The CDF of the records of negative extrema for the resampled SP ( times) reshuf-fled as above. The mass above (observed number of minima records for SP over the period) is . Conclusion for subrecords
We can at least apply EVT methods for negativeobservations. . . Asymmetry right-left tail
We note an asymmetry as seen in Figure . , with the left tail considerablythicker than the right one. It may be a nightmare for modelers looking for someprecise process, but not necessarily to people interested in risk, and option trading. This chapter allowed us to explore a simple topic: returns on the SP index(which represents the bulk of the U.S. stock market capitalization) are, simplypower law distributed –by Wittgenstein’s ruler, it is irresponsible to model them inany other manner. Standard methods such as Modern Portfolio Theory (MPT) or"base rate crash" verbalisms (claims that people overestimate the probabilities oftail events) are totally bogus –we are talking of >
70, 000 papers and entire cohortsof research, not counting about 10 papers in general economics with results de-pending on "variance" and "correlation". You need to live with the fact that thesemetrics are bogus. As the ancients used to say, dura lex sed lex , or in more modernmafia terms: It is what it is.
T H E P R O B L E M W I T H E C O N O M E T R I C S T here is something wrong with econometrics, as almost all pa-pers don’t replicate in the real world. Two reliability tests inChapter , one about parametric methods the other aboutrobust statistics, show that there must be something rottenin econometric methods, fundamentally wrong, and that themethods are not dependable enough to be of use in anything remotely re-lated to risky decisions. Practitioners keep spinning inconsistent ad hoc state-ments to explain failures. This is a brief nontechnical exposition from theresults in [ ].With economic variables one single observation in , , that is, one single dayin years, can explain the bulk of the "kurtosis", the finite-moment standardmeasure of "fat tails", that is, both a measure how much the distribution underconsideration departs from the standard Gaussian, or the role of remote events indetermining the total properties. For the U.S. stock market, a single day, the crashof , determined % of the kurtosis for the period between and . Thesame problem is found with interest and exchange rates, commodities, and othervariables. Redoing the study at different periods with different variables shows atotal instability to the kurtosis. The problem is not just that the data had "fat tails",something people knew but sort of wanted to forget; it was that we would neverbe able to determine "how fat" the tails were within standard methods. Never. Macroeconomic variables, such as U.S. weekly jobless claims have traditionally appeared to be tractableinside the (ugly and drab) buildings housing economic departments. They ended up breaking the modelswith a bang. Jobless claims experienced "unexpected" jumps with Covid (the coronavirus) described of"thirty standard deviations": the kurtosis (of the log changes) rose from 8 to > after a single observation in April . Almost all in-sample higher moments are attributable to a one data point, and the higherthe moment the higher such an effect –hence one must accept that there are no higher moments, and noinformative lower moment, and the variable must be power law distributed.Such a role for the tail cancels the entire history of macroeconomic modeling, as well as policies based onthe conclusion of economists using Mediocristan-derived metrics. While economists in the citation-ringsmay not be aware of their fraudulent behavior, others are not missing the point. At the time of writingpeople are starting to realize that the fatter the tails, the more policies should be based on the expectedextrema, using extreme value theory (EVT), and the differences between Gaussian and power law modelsare even starker for the extremes. the problem with econometrics Figure E. : Credit: Stefan Gasic
The implication is that those tools used in economics that are based on squaringvariables (more technically, the L norm), such as standard deviation, variance,correlation, regression, the kind of stuff you find in textbooks, are not valid scien-tifically (except in some rare cases where the variable is bounded). The so-called "pvalues" you find in studies have no meaning with economic and financial variables.Even the more sophisticated techniques of stochastic calculus used in mathemati-cal finance do not work in economics except in selected pockets. e.1 performance of standard parametric risk estimators The results of most papers in economics based on these standard statistical meth-ods are thus not expected to replicate, and they effectively don’t. Further, thesetools invite foolish risk taking. Neither do alternative techniques yield reliablemeasures of rare events, except that we can tell if a remote event is underpriced,without assigning an exact value.From [ ]), using log returns, X t ≜ log ( P ( t ) P ( t (cid:0) i ∆ t ) ) . Consider the n -sample max-imum quartic observation Max( X t (cid:0) i ∆ t4 ) ni =0 . Let Q ( n ) be the contribution of themaximum quartic variations over n samples and frequency ∆ t . Q ( n ) := Max ( X t (cid:0) i ∆ t ) ni =0 (cid:229) ni =0 X t (cid:0) i ∆ t .Note that for our purposes, where we use central or noncentral kurtosis makes nodifference –results are nearly identical.For a Gaussian (i.e., the distribution of the square of a Chi-square distributedvariable) show Q ( ) the maximum contribution should be around . (cid:6) . .Visibly we can see that the observed distribution of the 4 th moment has the prop-erty P ( X > max( x i ) i (cid:20) (cid:20) n ) (cid:25) P ( X > n (cid:229) i =1 x i ) . .1 performance of standard parametric risk estimators Table E. : Maximum contribution to the fourth moment from a single daily observation
Security Max Q Years .Silver .
94 46 .SP
500 0 .
79 56 .CrudeOil .
79 26 .Short Sterling .
75 17 .Heating Oil .
74 31 .Nikkei .
72 23 .FTSE .
54 25 .JGB .
48 24 .Eurodollar Depo M .
31 19 .Sugar
11 0 . .Yen .
27 38 .Bovespa .
27 16 .Eurodollar Depo M .
25 28 .CT .
25 48 .DAX . .Recall that, naively, the fourth moment expresses the stability of the second mo-ment. And the second moment expresses the stability of the measure across sam-ples.Note that taking the snapshot at a different period would show extremes comingfrom other variables while these variables showing high maximma for the kurtosis,would drop, a mere result of the instability of the measure across series and time. Description of the dataset
All tradable macro markets data available as of Au-gust , with "tradable" meaning actual closing prices corresponding to transac-tions (stemming from markets not bureaucratic evaluations, includes interest rates,currencies, equity indices).
Figure E. : Max quarticacross securities in TableE. . the problem with econometrics EuroDepo 3M: Annual Kurt 1981 - Figure E. : Kurtosis acrossnonoverlapping periods forEurodeposits.
Figure E. : Monthly deliv-ered volatility in the SP (as measured by standard de-viations). The only struc-ture it seems to have comesfrom the fact that it isbounded at . This is stan-dard. Figure E. : Montly volatil-ity of volatility from thesame dataset in Table E. ,predictably unstable. e.2 performance of standard nonparametric risk estimators Does the past resemble the future in the tails? The following tests are nonparamet-ric, that is entirely based on empirical probability distributions. .2 performance of standard nonparametric risk estimators
Concentration of tail events without predecessors Concentration of tail events without successors0.0001 0.0002 0.0003 0.0004 0.0005 M @ t D @ t + D Figure E. : Comparing oneabsolute deviation M[t] andthe subsequent one M[t+ ]over a certain threshold(here % in stocks); il-lustrated how large devia-tions have no (or few) pre-decessors, and no (or few)successors– over the past years of data. @ t D @ t + D Figure E. : The "regular"is predictive of the regu-lar, that is mean devia-tion. Comparing one abso-lute deviation M[t] and thesubsequent one M[t+ ] formacroeconomic data. So far we stayed in dimension . When we look at higher dimensional properties,such as covariance matrices, things get worse. We will return to the point with thetreatment of model error in mean-variance optimization.When x t are now in R N , the problems of sensitivity to changes in the covari-ance matrix makes the empirically observed moments and conditional momentsextremely unstable. Tail events for a vector are vastly more difficult to calibrate,and increase in dimensions. The Responses so far by members of the economics/econometrics establishment
No answer as to why they still use STD, regressions, GARCH , value-at-risk andsimilar methods.
Peso problem
Benoit Mandelbrot used to insist that one can fit anything withPoisson jumps. This is similar to the idea that one can always perfectly fit n datapoints with a polynomial with n (cid:0) the problem with econometrics Figure E. : Correlations are also problematic, which flows from the instability of single variances andthe effect of multiplication of the values of random variables. Under such stochasticity of correlationsit makes no sense, no sense whatsoever, to use covariance-based methods such as portfolio theory.
Many researchers invoke "outliers" or "peso problem" as acknowledging fat tails(or the role of the tails for the distribution), yet ignore them analytically (outside ofPoisson models that are not possible to calibrate except after the fact: conventionalPoisson jumps are thin-tailed). Our approach here is exactly the opposite: do notpush outliers under the rug, rather build everything around them. In other words,just like the FAA and the FDA who deal with safety by focusing on catastropheavoidance, we will throw away the ordinary under the rug and retain extremesas the sole sound approach to risk management. And this extends beyond safetysince much of the analytics and policies that can be destroyed by tail events areinapplicable. Peso problem confusion about the Black Swan problem : "(...) "Black Swans" (Taleb, ). These cultural icons refer to disasters that oc-cur so infrequently that they are virtually impossible to analyze using standardstatistical inference. However, we find this perspective less than helpful becauseit suggests a state of hopeless ignorance in which we resign ourselves to beingbuffeted and battered by the unknowable."Andrew Lo, who obviously did not bother to read the book he was citing. Lack of skin in the game.
Indeed one wonders why econometric methods keepbeing used while being wrong, so shockingly wrong, how "University" researchers(adults) can partake of such acts of artistry. Basically these capture the ordinaryand mask higher order effects. Since blowups are not frequent, these events do notshow in data and the researcher looks smart most of the time while being funda- The peso problem is a discovery of an outlier in money supply, became a name for outliers and unex-plained behavior in econometrics. .2 performance of standard nonparametric risk estimators mentally wrong. At the source, researchers, "quant" risk manager, and academiceconomist do not have skin in the game so they are not hurt by wrong risk mea-sures: other people are hurt by them. And the artistry should continue perpetuallyso long as people are allowed to harm others with impunity. (More in Taleb andSandis [ ], Taleb [ ] ).
M A C H I N E L E A R N I N G C O N S I D E R A T I O N S W e have learned from option trading that you can express anyone-dimensional function as a weighted linear combination ofcall or put options –smoothed by adding time value to theoption. An option becomes a building block. A payoff con-structed via option is more precisely as follows S = (cid:229) ni w i C ( K i , t i ), i = 1, 2, . . . , n , where C is the call price (or, rather, valuation), w is a weight K is the strike price, and t the time to expiration of the option. AEuropean call C delivers max ( S (cid:0) K , 0) at expiration t . a Neural networks and nonlinear regression, the predecessors of machinelearning, on the other hand, focused on the Heaviside step function, againsmoothed to produce a sigmoid type "S" curve. A collection of differentsigmoids would fit in sample . a This appears to be an independent discovery by traders of the universal approximation theorem,initially for sigmoid functions, which are discussed further down (Cybenko [ ]). xf ( x ) Figure F. : The heav-iside q function: notethat it is the payoff ofthe "binary option" andcan be decomposed as lim ∆ K ! C ( K ) (cid:0) C ( K + ∆ K ) d K . So this discussion is about ...fattailedness and how the different building blockscan accommodate them. Statistical machine learning switched to "ReLu" or "ramp" machine learning considerations functions that act exactly like call options rather than an aggregation of "S" curves.Researchers then discovered that it allows better handling of out of sample tailevents (since there are by definition no unexpected tail events in sample) owing tothe latter’s extrapolation properties.What is a sigmoid? Consider a payoff function as shown in F. that can be ex-pressed with formula S : ( (cid:0) ¥ , ¥ ) ! (0, 1), S ( x ) = tanh ( k x p ) + ) , or, more pre-cisely, a three parameter function S i : ( (cid:0) ¥ , ¥ ) ! (0, a ) S i ( x ) = a i e ( ci (cid:0) bix ) +1 . It can alsobe the cumulative normal distribution, N ( m , s ) where s controls the smoothness(it then becomes the Heaviside of Fig. F. at the limit of s ! Dose ( X ) ( F ( X )) Figure F. : The sigmoidfunction ; note that it isbounded to both the leftand right sides owing tosaturation: it looks like asmoothed Heaviside q . We can build composite "S" functions with n summands c n ( x ) = (cid:229) ni w i S i ( x ) as inF. . But: Remark For c n ( x ) [0, ¥ ) _ [ (cid:0) ¥ , 0) _ ( (cid:0) ¥ , ¥ ) , we must have n ! ¥ . We need an infinity of summands for an unbounded function. So wherever the"empirical distribution" will be maxed, the last observation will match the flat partof the sig. For the definition of an empirical distribution see . .Now let us consider option payoffs. Fig.F. shows the payoff of a regular option atexpiration –the definition of which which matches a Rectifier Linear Unit (ReLu) inmachine learning. Now Fig. F. shows the following function: consider a function r : ( (cid:0) ¥ , ¥ ) ! [ k , ¥ ), with K R : r ( x , K , p ) = k + log ( e p ( x (cid:0) K ) + 1 ) p . (F. )We can sum the function as (cid:229) i = 1 n r ( x , K i , p i ) to fit a nonlinear function, whichin fact replicates what we did with call options –the parameters p i allow to smoothtime value. achine learning considerations the Dose ( X )- - ( F ( X )) Figure F. : A sum ofsigmoids will always bebounded, so one needsan infinite sum to repli-cate an "open" payoff,one that is not sub-jected to saturation. xf ( x ) Figure F. : An option pay-off at expiration, open on theright. xf ( x ) Figure F. : r function, fromEq. . , with k = 0 .We calibrate and smooth thepayoff with different valuesof p. F. . Calibration via Angles
From figure F. we can see that, in the equation, S = (cid:229) ni w i C ( K i , t i ), the w i corre-sponds to the arc tangent of the angle made –if positive (as illustrated in figureF. ), or the negative of the arctan of the supplementary angle. machine learning considerations
50 100 150 200 x - - ( x ) Figure F. : A butterfly(built via a sum of op-tions/ReLu, not sigmoids),with open tails on bothsides and flipping firstand second derivatives.This example is particu-larly potent as it has noverbalistic correspondencebut can be understood byoption traders and machinelearning. θ θ ω ω x ( x ) Figure F. : How w =arctan q . By fitting angleswe can translate a nonlinearfunction into its option sum-mation. Summary
We can express all nonlinear univariate functions using a weighted sum ofcall options of different strikes, which in machine learning applications mapsto the tails better than a sum of sigmoids (themselves a net of a long and ashort options of neighboring strikes). We can get the weights implicitly usingthe angles of the functions relative to Cartesian coordinates. art III
P R E D I C T I O N S , F O R E C A S T I N G , A N D U N C E R TA I N T Y P R O B A B I L I T Y C A L I B R A T I O N U N D E RF A T T A I L S ‡ W hat do binary (or probabilistic) forecasting abilities have to dowith performance? We map the difference between (univariate)binary predictions or "beliefs" (expressed as a specific "event"will happen/will not happen) and real-world continuous pay-offs (numerical benefits or harm from an event) and show theeffect of their conflation and mischaracterization in the decision-science liter-ature.The effects are: A) Spuriousness of psychological research particularly those documenting thathumans overestimate tail probabilities and rare events, or that they overreact tofears of market crashes, ecological calamities, etc. Many perceived "biases" arejust mischaracterizations by psychologists. There is also a misuse of Hayekianarguments in promoting prediction markets.
B) Being a "good forecaster" in binary space doesn’t lead to having a goodperformance , and vice versa, especially under nonlinearities. A binary forecastingrecord is likely to be a reverse indicator under some classes of distributions. Deeperuncertainty or more complicated and realistic probability distribution worsen theconflation .
C) Machine Learning:
Some nonlinear payoff functions, while not lending them-selves to verbalistic expressions and "forecasts", are well captured by ML or ex-pressed in option contracts.
D) M Competitions Methods:
The score for the M -M competitions appear tobe closer to real world variables than the Brier score.The appendix shows the mathematical properties and exact distribution of thevarious payoffs, along with an exact distribution for the Brier score helpful forsignificance testing and sample sufficiency. Research chapter. probability calibration under fat tails ‡ ''Normative''''Descriptive'' Figure . : "Typical patterns" asstated and described in [ ], a repre-sentative claim in psychology of deci-sion making that people overestimatesmall probability events. The centralfindings are in and [ ]and [ ]. We note that to the left,in the estimation part, ) events suchas floods, tornados, botulism, mostlypatently thick tailed variables, mat-ters of severe consequences that agentsmight have incorporated in the prob-ability, ) these probabilities are sub-jected to estimation error that, whenendogenised, increase the estimation. Example . ("One does not eat beliefs and (binary) forecasts") In the first volume of the Incerto ( Fooled by Randomness, [ ]), the narrator, atrader, is asked by the manager "do you predict that the market is going up or down?""Up", he answered, with confidence. Then the boss got angry when, looking at the firm’sexposures, he discovered that the narrator was short the market, i.e., would benefit from themarket going down.The trader had difficulties conveying the idea that there was no contradiction, as someonecould hold the (binary) belief that the market had a higher probability of going up thandown, but that, should it go down, there is a very small probability that it could go downconsiderably, hence a short position had a positive expected return and the rational responsewas to engage in a short exposure. "You do not eat forecasts, but P/L" (or "one does notmonetize forecasts") goes the saying among traders. If exposures and beliefs do not go in the same direction, it is because beliefsare verbalistic reductions that contract a higher dimensional object into a singledimension. To express the manager’s error in terms of decision-making research,there can be a conflation in something as elementary as the notion of a binary event(related to the zeroth moment) or the probability of an event and expected payoff fromit (related to the first moment and, when nonlinear, to all higher moments) as thepayoff functions of the two can be similar in some circumstances and different inothers.
Commentary . In short, probabilistic calibration requires estimations of the zeroth moment while the realworld requires all moments (outside of gambling bets or artificial environments such aspsychological experiments where payoffs are necessarily truncated), and it is a central prop-erty of thick tails that higher moments are explosive (even "infinite") and count more andmore. . . Away from the Verbalistic
While the trader story is mathematically trivial (though the mistake is committeda bit too often), more serious gaps are present in decision making and risk man-agement, particularly when the payoff function is more complicated, or nonlinear(and related to higher moments). So once we map the contracts or exposuresmathematically, rather than focus on words and verbal descriptions, some seriousdistributional issues arise.
Definition . ( Event) A (real-valued) random variable X : Ω ! R defined on the probability space ( Ω , F , P ) isa function X ( w ) of the outcome w Ω . An event is a measurable subset (countable ornot) of Ω , measurable meaning that it can be defined through the value(s) of one of severalrandom variable(s). Definition . (Binary forecast/payoff) A binary forecast (belief, or payoff) is a random variable taking two valuesX : Ω ! f X , X g , with realizations X , X R . In other words, it lives in the binary set (say f
0, 1 g , f(cid:0)
1, 1 g , etc.), i.e., the specifiedevent will or will not take place and, if there is a payoff, such payoff will be mappedinto two finite numbers (a fixed sum if the event happened, another one if it didn’t).Unless otherwise specified, in this discussion we default to the f
0, 1 g set.Example of situations in the real world where the payoff is binary: (cid:15) Casino gambling, lotteries , coin flips, "ludic" environments, or binary op-tions paying a fixed sum if, say, the stock market falls below a certain pointand nothing otherwise –deemed a form of gambling . (cid:15) Elections where the outcome is binary (e.g., referenda, U.S. Presidential Elec-tions), though not the economic effect of the result of the election. (cid:15) Medical prognoses for a single patient entailing survival or cure over a spec-ified duration, though not the duration itself as variable, or disease-specificsurvival expressed in time, or conditional life expectancy. Also exclude any-thing related to epidemiology. (cid:15)
Whether a given person who has an online profile will buy or not a unit ormore of a specific product at a given time (not the quantity or units).
Commentary . (A binary belief is equivalent to a payoff) A binary "belief" should map to an economic payoff (under some scaling or normalization ]. We consider such banning as justified since bets havepractically no economic value, compared to financial markets that are widely open to the public, wherenatural exposures can be properly offset. Note the absence of spontaneously forming gambling markets with binary payoffs for continuous vari-ables. The exception might have been binary options but these did not remain in fashion for very long,from the experiences of the author, for a period between and , largely motivated by tax gimmicks. probability calibration under fat tails ‡ necessarily to constitute a probability), an insight owed to De Finetti [ ] who held thata "belief" and a "prediction" (when they are concerned with two distinct outcomes) mapinto the equivalent of the expectation of a binary random variable and bets with a payoff in f
0, 1 g . An "opinion" becomes a choice price for a gamble, and one at which one is equallywilling to buy or sell. Inconsistent opinions therefore would lead to a violation of arbitragerules, such as the "Dutch book", where a combination of mispriced bets can guarantee afuture loss. Definition . (Real world open continuous payoff) X : Ω ! [ a , ¥ ) _ ( (cid:0) ¥ , b ] _ ( (cid:0) ¥ , ¥ ). A continuous payoff "lives" in an interval, not a finite set. It corresponds to an unboundedrandom variable either doubly unbounded or semi-bounded, with the bound on one side(one tailed variable).
Caveat
We are limiting for the purposes of our study the consideration to binaryvs. continuous and open-ended (i.e., no compact support). Many discrete payoffsare subsumed into the continuous class using standard arguments of approxima-tion. We are also omitting triplets, that is, payoffs in, say f(cid:0)
1, 0, 3 g , as theseobey the properties of binaries (and can be constructed using a sum of binaries).Further, many variable with a floor and a remote ceiling (hence, formally withcompact support), such as the number of victims or a catastrophe, are analyticallyand practically treated as if they were open-ended [ ].Example of situations in the real world where the payoff is continuous: (cid:15) Wars casualties, calamities due to earthquake, medical bills, etc. (cid:15)
Magnitude of a market crash, severity of a recession, rate of inflation (cid:15)
Income from a strategy (cid:15)
Sales and profitability of a new product (cid:15)
In general, anything covered by an insurance contract - x g ( x ) θ ( x ) Mistracking
Figure . : Comparing the payoffof a binary bet (The Heaviside q (.) )to a continuous open-ended exposureg ( x ) . Visibly there is no way to matchthe (mathematical) derivatives for anyform of hedging. Most natural and socio-economic variables are continuous and their statisticaldistribution does not have a compact support in the sense that we do not have ahandle of an exact upper bound. Figure . : Conflating probability and expected return is deeply entrenched in psychology and fi-nance. Credit: Stefan Gasic.
Example . Predictive analytics in binary space f
0, 1 g can be successful in forecasting if, from hisonline activity, online consumer Iannis Papadopoulos will purchase a certain item, say awedding ring, based solely on computation of the probability. But the probability of the"success" for a potential new product might be –as with the trader’s story– misleading.Given that company sales are typically thick tailed, a very low probability of success mightstill be satisfactory to make a decision. Consider venture capital or option trading –an outof the money option can often be attractive yet may have less than in probability ofever paying off.More significantly, the tracking error for probability guesses will not map to that of theperformance. l ( M ) would. This difference is well known by option traders as there are financial derivativecontracts called "binaries" that pay in the binary set f
0, 1 g (say if the underlyingasset S , say, exceeds a strike price K ), while others called "vanilla" that pay in[0, ¥ ), i.e. max( S (cid:0) K , 0) (or, worse, in ( (cid:0) ¥ , 0) for the seller can now be exposedto bankruptcy owing to the unbounded exposure). The considerable mathematicaland economic difference between the two has been discussed and is the subject of Dynamic Hedging: Managing Vanilla and Exotic Options [ ]. Given that the formerare bets paying a fixed amount and the latter have full payoff, one cannot be prop-erly replicated (or hedged) using another, especially under fat tails and parametricuncertainty –meaning performance in one does not translate to performance intothe other. While this knowledge is well known in mathematical finance it doesn’tseem to have been passed on to the decision-theory literature. Commentary . (Derivatives theory) Our approach here is inspired from derivatives (or option) theory and practice where thereare different types of derivative contracts, ) those with binary payoffs (that pay a fixed sumif an event happens) and ) "vanilla" ones (standard options with continuous payoffs). Itis practically impossible to hedge one with another [ ]. Furthermore a bet with a strikeprice K and a call option with same strike K, with K in the tails of the distribution, almostalways have their valuations react in opposite ways when one increases the kurtosis of thedistribution, (while preserving the first three moments) or, in an example further down inthe lognormal environment, when one increases uncertainty via the scale of the distribution. probability calibration under fat tails ‡ Commentary . (Term sheets) Note that, thanks to "term sheets" that are necessary both legally and mathematically,financial derivatives practive provides precise legalistic mapping of payoffs in a way tomake their mathematical, statistical, and economic differences salient.
There has been a tension between prediction markets and real financial markets.As we can show here, prediction markets may be useful for gamblers, but theycannot hedge economic exposures.The mathematics of the difference and the impossibility of hedging can be shownin the following. Let X be a random variable in R , we have the payoff of the bet orthe prediction q K : R ! f
0, 1 g , q K ( x ) = { x (cid:21) K . )and g : R ! R that of the natural exposure. Since ¶¶ x q K ( x ) is a Dirac delta functionat K , d ( K ) and ¶ ) ¶ x g k ( x ) is at least once differentiable for x (cid:21) K (or constant incase the exposure is globally linear or, like an option, piecewise linear above K ),matching derivatives for the purposes of offsetting variations is not a possiblestrategy. The point is illustrated in Fig . . . . There is no defined "collapse", "disaster", or "success" under fat tails
The fact that an "event" has some uncertainty around its magnitude carries somemathematical consequences. Some verbalistic papers in still commit the fal-lacy of binarizing an event in [0, ¥ ): A recent paper on calibration of beliefs says"...if a person claims that the United States is on the verge of an economic collapseor that a climate disaster is imminent..." An economic "collapse" or a climate "dis-aster" must not be expressed as an event in f
0, 1 g when in the real world it cantake many values. For that, a characteristic scale is required. In fact under fat tails,there is no "typical" collapse or disaster, owing to the absence of characteristic scale,hence verbal binary predictions or beliefs cannot be used as gauges.We present the difference between thin tailed and fat tailed domains as follows. Definition . (Characteristic scale) Let X be a random variable that lives in either (0, ¥ ) or ( (cid:0) ¥ , ¥ ) and E the expectationoperator under "real world" (physical) distribution. By classical results [ ]: lim K ! ¥ K E ( X j X > K ) = l , ( . ) (cid:15) If l = 1 , X is said to be in the thin tailed class D and has a characteristic scale (cid:15) If l > , X is said to be in the fat tailed regular variation class D and has nocharacteristic scale To replicate an open-ended continuous payoff with binaries, one needs an infinite series of bets, whichcancels the entire idea of a prediction market by transforming it into a financial market. Distributionswith compact support always have finite moments, not the case of those on the real line. (cid:15) If lim K ! ¥ E ( X j X > K ) (cid:0) K = m where m > , then X is in the borderline exponential class The point can be made clear as follows. One cannot have a binary contract thatadequately hedges someone against a "collapse", given that one cannot know inadvance the size of the collapse or how much the face value or such contract needsto be. On the other hand, an insurance contract or option with continuous payoffwould provide a satisfactory hedge. Another way to view it: reducing these eventsto verbalistic "collapse", "disaster" is equivalent to a health insurance payout of alump sum if one is "very ill" –regardless of the nature and gravity of the illness –and 0 otherwise.And it is highly flawed to separate payoff and probability in the integral of ex-pected payoff. Some experiments of the type shown in Figure ask agents whatis their estimates of deaths from botulism or some such disease: agents are blamedfor misunderstanding the probability. This is rather a problem with the experiment:people do not necessarily separate probabilities from payoffs. Definition . (Substitution of integral) Let K R + be a threshold, f (.) a density function and p K [0, 1] the probability ofexceeding it, and g ( x ) an impact function. Let I be the expected payoff above K:I = ∫ ¥ K g ( x ) f ( x ) d x , and Let I be the impact at K multiplied by the probability of exceeding K:I = g ( K ) ∫ ¥ K f ( x ) dx = g ( K ) p K . The substitution comes from conflating I and I , which becomes an identity if and onlyif g (.) is constant above K (say g ( x ) = q K ( x ) , the Heaviside theta function). For g (.) avariable function with positive first derivative, I can be close to I only under thin-taileddistributions, not under the fat tailed ones. For the discussions and examples in this section assume g ( x ) = x as we willconsider the more advanced nonlinear case in Section . . Practically all economic and informational variables have been shown since the s to belong to the D class, or at least the intermediate subexponential class (which includes the lognormal), [ , , , , ], along with social variables such as size of cities, words in languages, connections in networks,size of firms, incomes for firms, macroeconomic data, monetary data, victims from interstate conflicts andcivil wars[ , ], operational risk, damage from earthquakes, tsunamis, hurricanes and other naturalcalamities, income inequality [ ], etc. Which leaves us with the more rational question: where areGaussian variables? These appear to be at best one order of magnitude fewer in decisions entailing formalpredictions. This can also explain, as we will see in Chapter that binary bets can never represent "skin in the game"under fat tailed distributions. probability calibration under fat tails ‡ Theorem : Convergence of I I If X is in the thin tailed class D as described in . , lim K ! ¥ I I = 1 ( . ) If X is in the regular variation class D , lim K ! ¥ I I = l >
1. ( . ) Proof.
From Eq. . . Further comments: . . Thin tails
By our very definition of a thin tailed distribution (more generally any distributionoutside the subexponential class, indexed by ( g )), where f ( g ) (.) is the PDF:lim K ! ¥ ∫ ¥ K x f ( g ) ( x ) dxK ∫ ¥ K f ( g ) ( x ) dx = I I = 1. ( . )Special case of a Gaussian: Let g (.) be the PDF of predominantly used Gaussiandistribution (centered and normalized), ∫ ¥ K xg ( x ) dx = e (cid:0) K p p ( . )and K p = erfc ( K p ) , where erfc is the complementary error function, and K p isthe threshold corresponding to the probability p .We note that K p I I corresponds to the inverse Mills ratio used in insurance. . . Fat tails
For all distributions in the regular variation class, defined by their tail survivalfunction: for K large, P ( X > K ) (cid:25) LK (cid:0) a , a > L > f ( p ) is the PDF of a member of that class:lim K p ! ¥ ∫ ¥ K x f ( p ) ( x ) dxK ∫ ¥ K p f ( p ) ( x ) dx = aa (cid:0) >
1. ( . ) . . ConflationsConflation of I and I In numerous experiments, which include the prospecttheory paper by Kahneman and Tversky ( ) [ ], it has been repeatedly estab-lished that agents overestimate small probabilities in experiments where the oddsare shown to them, and when the outcome corresponds to a single payoff. Thewell known Kahneman-Tversky result proved robust, but interpretations make er-roneous claims from it. Practically all the subsequent literature, relies on I andconflates it with I , what this author has called the ludic fallacy in The Black Swan [ ], as games are necessarily truncating a dimension from reality. The psycho-logical results might be robust, in the sense that they replicate when repeated inthe exact similar conditions, but all the claims outside these conditions and exten-sions to real risks will be an exceedingly dubious generalization –given that ourexposures in the real world rarely map to I . Furthermore, one can overestimatethe probability yet underestimate the expected payoff. Stickiness of the conflation
The misinterpretation is still made four decades af-ter Kahneman-Tversky ( ). In a review of behavioral economics, with emphasison miscaculation of probability, Barberis ( ) [ ] treats I = I . And Arrow etal. [ ], a long list of decision scientists pleading for deregulation of the bettingmarkets also misrepresented the fitness of these binary forecasts to the real world(particularly in the presence of real financial markets).Another stringent –and dangerous –example is the "default VaR" (Value at risk)which is explicitly given as I , i.e. default probability x (1 (cid:0) expected recovery rate),which can be quite different from the actual loss expectation in case of default. Fi-nance presents erroneous approximations of CVaR , and the approximation is therisk-management flaw that may have caused the crisis of [ ].The fallacious argument is that they compute the recovery rate as the expectedvalue of collateral, without conditioning by the default event. The expected valueof the collateral conditionally to a default is often far less then its unconditionalexpectation. In , after a massive series of foreclosures, the value of most collat-erals dropped to about / of its expected value! Misunderstanding of Hayek’s knowledge arguments "Hayekian" arguments forthe consolidation of beliefs via prices does not lead to prediction markets as dis-cussed in such pieces as [ ], or Sunstein’s [ ]: prices exist in financial andcommercial markets; prices are not binary bets. For Hayek [ ] consolidation ofknowledge is done via prices and arbitrageurs (his words)–and arbitrageurs tradeproducts, services, and financial securities, not binary bets. The mathematical expression of the Value at Risk, VaR, for a random variable X with distribution function F and threshold a [0, 1] VaR a ( X ) = (cid:0) inf f x R : F X ( x ) > a g ,and the corresponding CVar ES a ( X ) = E ( (cid:0) X j X (cid:20)(cid:0) VaR a ( X ) ) probability calibration under fat tails ‡ Table . : Gaussian pseudo-overestimation p K p ∫ ¥ K p x f ( x ) dx K p ∫ ¥ K p f ( x ) dx p (cid:3) p (cid:3) p (cid:2) (cid:0) (cid:2) (cid:0) (cid:2) (cid:0) (cid:2) (cid:0) (cid:2) (cid:0) (cid:2) (cid:0) . . (cid:2) (cid:0) (cid:2) (cid:0) (cid:2) (cid:0) . . (cid:2) (cid:0) (cid:2) (cid:0) (cid:2) (cid:0) . Definition . (Corrected probability in binarized experiments) Let p (cid:3) be the equivalent probability to make I = I and eliminate the effect of the error, sop (cid:3) = f p : I = I = K g Now let’ s solve for K p "in the tails", working with a probability p . For theGaussian, K p = p (cid:0) (2 p ); for the Paretian tailed distribution, K p = p (cid:0) = a .Hence, for a Paretian distribution, the ratio of real continuous probability to thebinary one p (cid:3) p = a (cid:0) a ,which can allow in absurd cases p (cid:3) to exceed when the distribution is grosslymisspecified.Tables . and . show, for a probability level p , the corresponding tail level K p , such as K p = f inf K : P ( X > K ) > p g ,and the corresponding adjusted probability p (cid:3) that de-binarize the event – prob-abilities here need to be in the bottom half, i.e., p < .5. Note that we are operatingunder the mild case of known probability distributions, as it gets worse underparametric uncertainty. The most commonly known distribution among the public, the "Pareto / "(based on Pareto discovering that percent of the people in Italy owned per-cent of the land), maps to a tail index a = 1.16, so the adjusted probability is > Example of probability and expected payoff reacting in opposite direction underincrease in uncertainty
An example showing how, under a skewed distribution,the binary and the expectation reacting in opposite directions is as follows. Con-sider the risk-neutral lognormal distribution L ( X (cid:0) s , s ) with PDF f L (.), mean The analysis is invariant to whether we use the right or left tail .By convention, finance uses negative valuefor losses, whereas other areas of risk management express the negative of the random variance, hencefocus on the right tail. K p is equivalent to the Value at Risk VaR p in finance, where p is the probability of loss. Note the van der Wijk’s law, see Cirillo [ ]: I I is related to what is called in finance the expected shortfallfor K p . - - Binary - - Thin Tails - - Pareto 80 / Figure . : Comparing the three pay-offs under two distributions –the bi-nary has the same profile regardlessof whether the distribution is thin orfat tailed. The first two subfiguresare to scale, the third (representing thePareto / with a = 1.16 requiresmultiplying the scale by two orders ofmagnitude. X and variance ( e s (cid:0) ) X . We can increase its uncertainty with the parameter s . We have the expectation of a contract above X , E > X : E > X = ∫ ¥ X x f L ( x ) d x = 12 X ( ( s p )) and the probability of exceeding X , P ( X > X ) = 12 ( (cid:0) erf ( s p )) ,where erf is the error function. As s rises erf ( s p ) !
1, with E > X ! X and P ( X > X ) !
0. This example is well known by option traders (see
DynamicHedging [ ]) as the binary option struck at X goes to while the standard callof the same strike rises considerably to reach the level of the asset –regardless of probability calibration under fat tails ‡ Table . : Paretian pseudo-overestimation p K p ∫ ¥ K p x f ( x ) dx K p ∫ ¥ K p f ( x ) dx p (cid:3) p (cid:3) p . .
92 0 .
811 1 . (sic) . . .
23 0 .
65 0 .
11 11 .
533 5 .
87 0 .
53 0 .
011 11 . .
76 0 .
43 0 . .strike. This is typically the case with venture capital: the riskier the project, the lesslikely it is to succeed but the more rewarding in case of success. So, the expectationcan go to + ¥ while to probability of success goes to 0. . . Distributional UncertaintyRemark : Distributional uncertainty Owing to Jensen’s inequality, the discrepancy ( I (cid:0) I ) increases under parameteruncertainty, expressed in higher kurtosis, via stochasticity of s the scale of the thin-tailed distribution, or that of a the tail index of the Paretian one.Proof. First, the Gaussian world. We consider the effect of I (cid:0) I = ∫ ¥ K x f ( g ) ( x ) (cid:0) ∫ ¥ K f ( g ) ( x ) under stochastic volatility, i.e. the parameter from increase of volatility.Let s be the scale of the Gaussian, with K constant: ( . ) ¶ ( ∫ ¥ K x f ( g ) ( x ) dx ) ¶s (cid:0) ¶ ( ∫ ¥ K f ( g ) ( x ) dx ) ¶s = e (cid:0) K s ( ( K (cid:0) K (cid:0) ( K (cid:0) K s ) p ps ,which is positive for all values of K > K (cid:0) K (cid:0) K + 2 K > K positive).Second, consider the sensitivity of the ratio I I to parameter uncertainty for a inthe Paretian case (for which we can get a streamlined expression compared to thedifference). For a > . ) ¶ ( ∫ ¥ K x f ( p ) ( x ) dx = ∫ ¥ K f ( p ) ( x ) dx ) ¶a = 2 K ( a (cid:0) ,which is positive and increases markedly at lower values of a , meaning the fatterthe tails, the worse the uncertainty about the expected payoff and the larger thedifference between I and I . The psychology literature also examines the "calibration" of probabilistic assess-ment –an evaluation of how close someone providing odds of events turns out tobe on average (under some operation of the law of large number deemed satisfac-tory) [ ], [ ], see Fig. . (as we saw in Chapter ). The methods, for thereasons we have shown here, are highly flawed except in narrow circumstancesof purely binary payoffs (such as those entailing a "win/lose" outcome) –and gen-eralizing from these payoffs is either not possible or produces misleading results.Accordingly, Fig. makes little sense empirically.At the core, calibration metrics such as the Brier score are always thin-tailed,when the variable under measurement is fat-tailed, which worsens the tractability.To use again the saying "You do not eat forecasts", most businesses have severelyskewed payoffs, so being calibrated in probability is meaningless. Remark : Distributional differences Binary forecasts and calibration metrics via the Brier score belong to the thin-tailedclass.
We will show proofs next.
This section, summarized in Table . , compares the probability distributions ofthe various metrics used to measure performance, either by explicit formulationor by linking it to a certain probability class. Clearly one may be mismeasuringperformance if the random variable is in the wrong probability class. Differentunderlying distributions will require a different number of sample sizes owingto the differences in the way the law of numbers operates across distributions.A series of binary forecasts will converge very rapidly to a thin-tailed Gaussianeven if the underlying distribution is fat-tailed, but an economic P/L trackingperformance for someone with a real exposure will require a considerably largersample size if, say, the underlying is Pareto distributed [ ].We start by precise expressions for the four possible ones: . Real world performance under conditions of survival, or, in other words, P/Lor a quantitative cumulative score. . A tally of bets, the naive sum of how often a person’s binary prediction iscorrect . De Finetti’s Brier score l ( B ) n . The M score l M n for n observations used in the M competition, and itsprosed sequel M . probability calibration under fat tails ‡ Table . : Scoring Metrics for Performance Evaluation
Metric Name Fitness to reality P ( r ) ( T ) Cumulative P/L Adapted to real world distributions, partic-ularly under a survival filter P ( p ) ( n ) Tally of Bets Misrepresents the performance under fattails, works only for binary bets and/orthin tailed domains. l ( n ) Brier Score Misrepresents performance precision un-der fat tails, ignores higher moments. l ( M n M Score Represents precision not exactly real worldperformance but maps to real distributionof underlying variables. l ( M n Proposed M Score Represents both precision and survivalconditions by predicting extrema of timeseries. g (.) Machine learningnonlinear payofffunction (not ametric) Expresses exposures without verbalismand reflects true economic or otherP/L. Resembles financial derivatives termsheets. P/L in Payoff Space (under survival condition)
The "P/L" is short for the natu-ral profit and loss index, that is, a cumulative account of performance. Let X i berealizations of an unidimensional generic random variable X with support in R and t = 1, 2, . . . n . Real world payoffs P r (.) are expressed in a simplified way as P r ( n ) = P (0) + (cid:229) k (cid:20) N g ( x t ), ( . )where g t : R ! R is a measurable function representing the payoff; g may bepath dependent (to accommodate a survival condition), that is, it is a function ofthe preceding period t < t or on the cumulative sum (cid:229) t (cid:20) t g ( x t ) to introduce anabsorbing barrier, say, bankruptcy avoidance, in which case we write: P ( r ) ( T ) = P ( r ) (0) + (cid:229) t (cid:20) n ( (cid:229) t < t g ( x t ) > b ) g ( x t ), ( . )where b is is any arbitrary number in R that we call the survival mark and (.) anindicator function
0, 1 g .The last condition from the indicator function in Eq. . is meant to handleergodicity or lack of it [ ]. Commentary . P/L tautologically corresponds to the real world distribution, with an absorbing barrier atthe survival condition.
Frequency Space,
The standard psychology literature has two approaches.
A-When tallying forecasts as a counter P ( p ) ( n ) = 1 n (cid:229) i (cid:20) n X t c , ( . )where X t c
0, 1 g is an indicator that the random variable x c t in in the"forecast range", and T the total number of such forecasting events. where f t [0, 1] is the probability announced by the forecaster for event t B-When dealing with a score (calibration method) in the absence of a visible netperformance, researchers produce some more advanced metric or score to measurecalibration. We select below the gold standard", De Finetti’s Brier score(DeFinetti,[ ]). It is favored since it doesn’t allow arbitrage and requires perfect probabilisticcalibration: someone betting than an event has a probability 1 of occurring will geta perfect score only if the event occurs all the time. l ( B ) n = 1 n (cid:229) t (cid:20) n ( f t (cid:0) X t c ) , ( . )which needs to be minimized for a perfect probability assessor. Applications: M and M Competitions
The M series (Makridakis [ ]) eval-uate forecasters using various methods to predict a point estimate (along with arange of possible values). The last competition in , M , largely relied on aseries of scores, l M j , which works well in situations where one has to forecast thefirst moment of the distribution and the dispersion around it. Definition . (The M first moment forecasting scores) The M competition precision score (Makridakis et al. [ ]) judges competitors on thefollowing metrics indexed by j = 1, 2 l ( M j n = 1 n n (cid:229) i (cid:12)(cid:12)(cid:12) X f i (cid:0) X r i (cid:12)(cid:12)(cid:12) s j ( . ) where s = ( j X f i j + j X r i j ) and s is (usually) the raw mean absolute deviation for theobservations available up to period i (i.e., the mean absolute error from either "naive"forecasting or that from in sample tests), X f i is the forecast for variable i as a point estimate,X r i is the realized variable and n the number of experiments under scrutiny. In other word, it is an application of the Mean Absolute Scaled Error (MASE)and the symmetric Mean Absolute Percentage Error (sMAPE) [ ].The suggested M score (expected for ) adds the forecasts of extrema ofthe variables under considerations and repeats the same tests as the one for rawvariables in Definition . . probability calibration under fat tails ‡ . . Deriving DistributionsDistribution of P ( p ) ( n ) Remark The tally of binary forecast P ( p ) ( n ) is asymptotically normal with mean p and stan-dard deviation √ n ( p (cid:0) p ) regardless of the distribution class of the random vari-able X. The results are quite standard, but see appendix for the re-derivations.
Distribution of the Brier Score l n Theorem Regardless of the distribution of the random variable X, without even assumingindependence of ( f (cid:0) A ), . . . , ( f n (cid:0) An ) , for n < + ¥ , the score l n has allmoments of order q, E ( l qn ) < + ¥ .Proof. For all i , ( f i (cid:0) Ai ) (cid:20) f i are independent and follow a betadistribution B ( a , b ) (which approximates or includes all unimodal distributions in[0, 1] (plus a Bernoulli via two Dirac functions), and let p be the rate of success p = E ( Ai ) , the characteristic function of l n for n evaluations of the Brier score is( . ) φ n ( t ) = p n = ( (cid:0) a (cid:0) b +1 G ( a + b ) ( p ˜ F ( b + 12 , b a + b a + b + 1); itn ) (cid:0) ( p (cid:0) ˜ F ( a + 12 , a a + b a + b + 1); itn ))) .Here ˜ F is the generalized hypergeometric function regularized ˜ F (., .; ., .; .) = F ( a ; b ; z ) ( G ( b )... G ( b q ) ) and p F q ( a ; b ; z ) has series expansion (cid:229) ¥ k =0 ( a ) k ...( a p ) k ( b ) k ...( b p ) k z k = k !, were ( a ) (.) isthe Pochhammer symbol.Hence we can prove the following: under the conditions of independence of thesummands stated above, l n D (cid:0)! N ( m , s n ) ( . )where N denotes the Gaussian distribution with for first argument the mean andfor second argument the standard deviation.The proof and parametrization of m and s n is in the appendix. Distribution of the economic P/L or quantitative measure P r Remark Conditional on survival to time T, the distribution of the quantitative measureP ( r ) ( T ) will follow the distribution of the underlying variable g ( x ) . The discussion is straightforward if there is no absorbing barrier (i.e., no survivalcondition).
Distribution of the M score The distribution of an absolute deviation is in thesame probability class as the variable itself. Thee Brier score is in the norm L and is based on the second moment (which always exists) as De Finetti has shownthat it is more efficient to just a probability in square deviations. However fornonbinaries, it is vastly more efficient under fat tails to rely on absolute deviations,even when the second moment exists [ ]. Earlier examples focused on simple payoff functions, with some cases where theconflation I and I can be benign (under the condition of being in a thin tailedenvironment). However Inseparability of probability under nonlinear payoff function
Now when weintroduce a payoff function g (.) that is nonlinear, that is that the economic or otherquantifiable response to the random variable X varies with the levels of X , thediscrepancy becomes greater and the conflation worse. Commentary . (Probability as an integration kernel) Probability is just a kernel inside an integral or a summation, not a real thing on its own.The economic world is about quantitative payoffs. probability calibration under fat tails ‡ Remark : Inseparability of probability Let F : A ! [0, 1] be a probability distribution (with derivative f ) and g : R ! R a measurable function, the "payoff"". Clearly, for A ′ a subset of A : ∫ A ′ g ( x )d F ( x ) = ∫ A ′ f ( x ) g ( x )d x ̸ = ∫ A ′ f ( x )d x g ( ∫ A ′ d x ) In discrete terms, with p (.) a probability mass function: ( . ) (cid:229) x ′ p ( x ) g ( x ) ̸ = (cid:229) x ′ p ( x ) g ( n (cid:229) x ′ x ) = probability of event (cid:2) payoff of average eventProof. Immediate by Jensen’s inequality.In other words, the probability of an event is an expected payoff only when, aswe saw earlier, g ( x ) is a Heaviside theta function.Next we focus on functions tractable mathematically or legally but not reliableverbalistically via "beliefs" or "predictions". Misunderstanding g Figure . showing the mishedging story of Morgan Stan-ley is illustrative of verbalistic notions such as "collapse" mis-expressed in nonlin-ear exposures. In the Wall Street firm Morgan Stanley decided to "hedge"against a real estate "collapse", before the market in real estate started declining.The problem is that they didn’t realize that "collapse" could take many values,some worse than they expected, and set themselves up to benefit if there werea mild decline, but lose much if there is a larger one. They ended up right inpredicting the crisis, but lose $10 billion from the "hedge".Figure F. shows a more complicated payoff, dubbed a "butterfly" The function g and machine learning We note that g maps to various machinelearning functions that produce exhaustive nonlinearities via the universal uni-versal approximation theorem (Cybenko [ ]), or the generalized option payoffdecompositions (see Dynamic Hedging [ ]).Consider the function r : ( (cid:0) ¥ , ¥ ) ! [ K , ¥ ), with K , the r.v. X R : r K , p ( x ) = k + log ( e p ( x (cid:0) K ) + 1 ) p ( . ) Benefits fromDecline SeriousHarm fromDecline StartingPoint
20 40 60 80 100 120 - - - Figure . : The Morgan Stanley Story: an example of an elementary nonlinear payoff that cannotbe described verbalistically. This exposure is called in derivatives traders jargon a "Christmas Tree",achieved by purchasing a put with strike K and selling a put with lower strike K (cid:0) ∆ and anotherwith even lower strike K (cid:0) ∆ , with ∆ (cid:21) ∆ (cid:21) . We can express all nonlinear payoff functions g as, with the weighting w i R : g ( x ) = (cid:229) i w i r K i , p ( x ) ( . )by some similarity, r K , p ( x ) maps to the value a call price with strike K and time t to expiration normalized to 1, all rates set at 0, with sole other parameter s thestandard deviation of the underlying.We note that the expectation of g (.) is the sum of expectation of the ReLu func-tions: E ( g ( x ) ) = (cid:229) i w i E ( r K i , p ( x ) ) ( . )The variance and other higher order statistical measurements are harder to obtainin closed or simple form. Commentary . Risk management is about changing the payoff function g (.) rather than making "goodforecasts".
We note than l is not a metric but a target to which one can apply various metrics. Survival
Decision making is sequential. Accordingly, miscalibration may be a good idea ifit reduces the odds of being absorbed. See the appendix of
Skin in the Game [ ], probability calibration under fat tails ‡ which shows the difference between ensemble probability and time probability.The expectation of the sum of n gamblers over a given day is different from that ofa single gambler over n days, owing to the conditioning.In that sense, measuring the performance of an agent who will eventually gobust (with probability one) is meaningless. Finally, that in the real world, it is the net performance (economic or other) thatcounts, and making "calibration" mistakes where it doesn’t matter or can be helpfulshould be encouraged, not penalized. The bias variance argument is well knownin machine learning [ ] as means to increase performance, in discussions ofrationality (see
Skin in the Game [ ]) as a necessary mechanism for survival, and avery helpful psychological adaptation (Brighton and Gigerenzer [ ] show a potentargument that if it is a bias, it is a pretty useful one.) If a mistake doesn’t costyou anything –or helps you survive or improve your outcomes– it is clearly not amistake. And if it costs you something, and has been present in society for a longtime, consider that there may be hidden evolutionary advantages to these typesof mistakes –of the following sort: mistaking a bear for a stone is worse than mistaking a stone for a bear .We have shown that, in risk management, one should never operate in probabilityspace. . . Distribution of Binary Tally P ( p ) ( n )We are dealing with an average of Bernoulli random variables, with well knownresults but worth redoing. The characteristic function of a Bernoulli distributionwith parameter p is y ( t ) = 1 (cid:0) p + e ( It ) p . We are concerned with the N -summedcumulant generating function y ′ ( w ) = log y ( w N ) N . We have k ( p ) the cumulant oforder p : k ( p ) = (cid:0) i p ¶ p y ′ ¶ t p (cid:12)(cid:12)(cid:12)(cid:12) t ! So: k (1) = p , k (2) = (1 (cid:0) p ) pN , k (3) = ( p (cid:0) p (2 p (cid:0) N , k (4) = (1 (cid:0) p ) p (6( p (cid:0) p +1) N , whichproves that P ( p ) ( N ) converges by the law of large numbers at speed p N , and bythe central limit theorem arrives to the Gaussian at a rate of N , (since from thecumulants above, its kurtosis = 3 (cid:0) p (cid:0) p +1 n ( p (cid:0) p ). The M competition is expected to correct for that by making "predictors" predict the minimum (or maxi-mum) in a time series. . . Distribution of the Brier ScoreBase probability f First, we consider the distribution of f the base probability.We use a beta distribution that covers both the conditional and unconditional case(it is a matter of parametrization of a and b in Eq. . ). Distribution of the probability
Let us refresh a standard result behind nonpara-metric discussions and tests, dating from Kolmogorov [ ] to show the rationalebehind the claim that the probability distribution of probability (sic) is robust –inother words the distribution of the probability of X doesn’t depend on the distri-bution of X , ([ ] [ ]).The probability integral transform is as follows. Let X have a continuous distri-bution for which the cumulative distribution function (CDF) is F X . Then –in theabsence of additional information –the random variable U defined as U = F X ( X ) isuniform between 0 and 1.The proof is as follows: For t [0, 1], P ( Y (cid:20) u ) = P ( F X ( X ) (cid:20) u ) = P ( X (cid:20) F (cid:0) X ( u )) = F X ( F (cid:0) X ( u )) = u ( . )which is the cumulative distribution function of the uniform. This is the caseregardless of the probability distribution of X .Clearly we are dealing with ) f beta distributed (either as a special case theuniform distribution when purely random, as derived above, or a beta distributionwhen one has some accuracy, for which the uniform is a special case), and ) At a Bernoulli variable with probability p .Let us consider the general case. Let g a , b be the PDF of the Beta: g a , b ( x ) = x a (cid:0) (1 (cid:0) x ) b (cid:0) B ( a , b ) , 0 < x < m = ( a ( (cid:0) ( p (cid:0) (cid:0) ap + a + b ( b + 1) p ) G ( a + b ) G ( a + b + 2) s n = (cid:0) n ( a + b ) ( a + b + 1) ( a ( p (cid:0)
1) + a ( p (cid:0) (cid:0) b ( b + 1) p ) + 1( a + b + 2)( a + b + 3) ( a + b )( a + b + 1)( p ( a (cid:0) b )( a + b + 3)( a ( a + 3) + ( b + 1)( b + 2)) (cid:0) a ( a + 1)( a + 2)( a + 3))We can further verify that the Brier score has thinner tails than the Gaussian asits kurtosis is lower than . Proof.
We start with y j = ( f (cid:0) Aj ), the difference between a continuous Beta dis-tributed random variable and a discrete Bernoulli one, both indexed by j . The char-acteristic function of y j , Y ( y ) f = ( p ( (cid:0) e (cid:0) it )) F ( a ; a + b ; it ) where F (.; .; .) isthe Kummer confluent hypergeometric function F ( a ; b ; z ) = (cid:229) ¥ k =0 a k zkk ! b k . probability calibration under fat tails ‡ From here we get the characteristic function for y j = ( f j (cid:0) Aj ) ( . ) Y ( y ) ( t ) = p p (cid:0) a (cid:0) b +1 G ( a + b ) ( p ˜ F ( b + 12 , b a + b a + b +1); it ) (cid:0) ( p (cid:0) ˜ F ( a + 12 , a a + b a + b + 1); it )) where ˜ F is the generalized hypergeometric function regularized ˜ F (., .; ., .; .) = F ( a ; b ; z ) ( G ( b )... G ( b q ) ) and p F q ( a ; b ; z ) has series expansion (cid:229) ¥ k =0 ( a ) k ...( a p ) k ( b ) k ...( b p ) k z k = k !, were ( a ) (.) isthe Pochhammer symbol.We can proceed to prove directly from there the convergence in distribution forthe average n (cid:229) ni y i : ( . )lim n ! ¥ Y y ( t = n ) n =exp ( (cid:0) it ( p ( a (cid:0) b )( a + b + 1) (cid:0) a ( a + 1))( a + b )( a + b + 1) ) which is that of a degenerate Gaussian (Dirac) with location parameter p ( b (cid:0) a )+ a ( a +1) a + b +1 a + b .We can finally assess the speed of convergence, the rate at which higher momentsmap to those of a Gaussian distribution: consider the behavior of the 4 th cumulant k = (cid:0) i ¶ log Y . (.) ¶ t j t ! : ) in the maximum entropy case of a = b = 1: k j a =1, b =1 = (cid:0) n regardless of p . ) In the maximum variance case, using l’Hôpital:lim a ! b ! k = (cid:0) p (cid:0) p + 1 n ( p (cid:0) p Se we have k k ! n ! ¥ at rate n (cid:0) .Further, we can extract its probability density function of the Brier score for N = 1:for 0 < z <
1, ( . ) p ( z ) = G ( a + b ) ( ( p (cid:0) z a = ( (cid:0) p z ) b (cid:0) p ( (cid:0) p z ) a z b = ) ( p z (cid:0) ) z G ( a ) G ( b ) . E L E C T I O N P R E D I C T I O N S A SM A R T I N G A L E S : A N A R B I T R A G EA P P R O A C H ‡ W e examine the effect of uncertainty on binary outcomes, withapplication to elections. A standard result in quantitative fi-nance is that when the volatility of the underlying securityincreases, arbitrage pressures push the corresponding binaryoption to trade closer to %, and become less variable overthe remaining time to expiration. Counterintuitively, the higher the uncer-tainty of the underlying security, the lower the volatility of the binary option.This effect should hold in all domains where a binary price is produced –yet we observe severe violations of these principles in many areas where bi-nary forecasts are made, in particular those concerning the U.S. presidentialelection of . We observe stark errors among political scientists and fore-casters, for instance with ) assessors giving the candidate D. Trump between . % and % chances of success , ) jumps in the revisions of forecasts from % to %, both made while invoking uncertainty.Conventionally, the quality of election forecasting has been assessed staticallyby De Finetti’s method, which consists in minimizing the Brier score, a metricof divergence from the final outcome (the standard for tracking the accuracy ofprobability assessors across domains, from elections to weather). No intertempo-ral evaluations of changes in estimates appear to have been imposed outside the Research chapter.The author thanks Dhruv Madeka and Raphael Douady for detailed and extensive discussions of the pa-per as well as thorough auditing of the proofs across the various iterations, and, worse, the numerouschanges of notation. Peter Carr helped with discussions on the properties of a bounded martingale andthe transformations. I thank David Shimko,Andrew Lesniewski, and Andrew Papanicolaou for comments.I thankArthur Breitman for guidance with the literature for numerical approximations of the variouslogistic-normal integrals. I thank participants of the Tandon School of Engineering and Bloomberg Quan-titative Finance Seminars. I also thank Bruno Dupire, MikeLawler, the Editors-In-Chief of QuantitativeFinance, and various friendly people on social media. DhruvMadeka, then at Bloomberg, while workingon a similar problem, independently came up with the same relationships between the volatility of anestimate and its bounds and the same arbitrage bounds. All errors are mine. election predictions as martingales: an arbitrage approach ‡ quantitative finance practice and literature. Yet De Finetti’s own principle is that aprobability should be treated like a two-way "choice" price, which is thus violatedby conventional practice. Figure . : Election arbitrage "estimation" (i.e., valuation) at different expected proportional votesY [0, 1] , with s the expected volatility of Y between present and election results. We can see thatunder higher uncertainty, the estimation of the result gets closer to . , and becomes insensitive toestimated electoral margin. B t (cid:1) [ ] Y (cid:1) [ L , H ] X (cid:1) (- (cid:2) , (cid:2) ) B = (cid:3) ( X T > l ) B = (cid:3) ( Y T > S ( l )) Y = S ( X ) Figure . : X is an open non observable random variable (a shadow variable of sorts) on R , Y, itsmapping into "votes" or "electoral votes" via a sigmoidal function S (.) , which maps one-to-one, andthe binary as the expected value of either using the proper corresponding distribution. lection predictions as martingales: an arbitrage approach ‡ In this chapter we take a dynamic, continuous-time approach based on the prin-ciples of quantitative finance and argue that a probabilistic estimate of an electionoutcome by a given "assessor" needs be treated like a tradable price, that is, asa binary option value subjected to arbitrage boundaries (particularly since binaryoptions are actually used in betting markets). Future revised estimates need to becompatible with martingale pricing, otherwise intertemporal arbitrage is created,by "buying" and "selling" from the assessor.A mathematical complication arises as we move to continuous time and applythe standard martingale approach: namely that as a probability forecast, the un-derlying security lives in [0, 1]. Our approach is to create a dual (or "shadow")martingale process Y , in an interval [ L , H ] from an arithmetic Brownian motion, X in ( (cid:0) ¥ , ¥ ) and price elections accordingly. The dual process Y can for examplerepresent the numerical votes needed for success. A complication is that, becauseof the transformation from X to Y , if Y is a martingale, X cannot be a martingale(and vice-versa).The process for Y allows us to build an arbitrage relationship between the volatil-ity of a probability estimate and that of the underlying variable, e.g. the votenumber. Thus we are able to show that when there is a high uncertainty about thefinal outcome, ) indeed, the arbitrage value of the forecast (as a binary option)gets closer to % and ) the estimate should not undergo large changes even ifpolls or other bases show significant variations. The pricing links are between ) the binary option value (that is, the forecastprobability), ) the estimation of Y and ) the volatility of the estimation of Y overthe remaining time to expiration (see Figures . and . ). . . Main results
For convenience, we start with our notation.
Notation A central property of our model is that it prevents B (.) from varying more than the estimated Y : in a twocandidate contest, it will be capped (floored) at Y if lower (higher) than . . In practice, we can observeprobabilities of winning of % vs. % from a narrower spread of estimated votes of % vs. %;our approach prevents, under high uncertainty, the probabilities from diverging away from the estimatedvotes. But it remains conservative enough to not give a higher proportion. election predictions as martingales: an arbitrage approach ‡ Y the observed estimated proportion of votes expressed in [0, 1] attime t . These can be either popular or electoral votes, so long asone treats them with consistency. T period when the irrevocable final election outcome Y T is revealed,or expiration. t present evaluation period, hence T (cid:0) t is the time until the finalelection, expressed in years. s annualized volatility of Y , or uncertainty attending outcomes for Y in the remaining time until expiration. We assume s is constantwithout any loss of generality –but it could be time dependent. B (.) "forecast probability", or estimated continuous-time arbitrage eval-uation of the election results, establishing arbitrage bounds be-tween B (.), Y and the volatility s . Main results B ( Y , s , t , T ) = 12 erfc ( l (cid:0) erf (cid:0) (2 Y (cid:0) e s ( T (cid:0) t ) √ e s ( T (cid:0) t ) (cid:0) ) , ( . )where s (cid:25) √ log ( p s e (cid:0) (2 Y (cid:0) + 1 ) p p T (cid:0) t , ( . ) l is the threshold needed (defaults to . ), and erfc(.) is the standard complementaryerror function, -erf(.), with erf( z ) = p p ∫ z e (cid:0) t dt .We find it appropriate here to answer the usual comment by statisticians andpeople operating outside of mathematical finance: "why not simply use a Beta-style distribution for Y ?". The answer is that ) the main purpose of the paper isestablishing (arbitrage-free) time consistency in binary forecasts, and ) we are notaware of a continuous time stochastic process that accommodates a beta distribu-tion or a similarly bounded conventional one. . . Organization
The remaining parts of the paper are organized as follows. First, we show theprocess for Y and the needed transformations from a specific Brownian motion.Second, we derive the arbitrage relationship used to obtain equation ( . ). Finally,we discuss De Finetti’s approach and show how a martingale valuation relates tominimizing the conventional standard in the forecasting industry, namely the BrierScore. A comment on absence of closed form solutions for s We note that for Y welack a closed form solution for the integral reflecting the total variation: lection predictions as martingales: an arbitrage approach ‡ ∫ Tt s p p e (cid:0) erf (cid:0) (2 y s (cid:0) ds , though the corresponding one for X is computable. Ac-cordingly, we have relied on propagation of uncertainty methods to obtain a closedform solution for the probability density of Y , though not explicitly its momentsas the logistic normal integral does not lend itself to simple expansions [ ]. Time slice distributions for X and Y The time slice distribution is the probabil-ity density function of Y from time t , that is the one-period representation, startingat t with y = + erf( x ). Inversely, for X given y , the corresponding x , X maybe found to be normally distributed for the period T (cid:0) t with E ( X , T ) = X e s ( T (cid:0) t ) , V ( X , T ) = e s ( T (cid:0) t ) (cid:0) φ , the correspondingdistribution of Y with initial value y is given by ( . ) φ ( y ; y , T ) = 1 √ e s ( t (cid:0) t ) (cid:0) { erf (cid:0) (2 y (cid:0) (cid:0) ( coth ( s t ) (cid:0) ) ( erf (cid:0) (2 y (cid:0) (cid:0) erf (cid:0) (2 y (cid:0) e s ( t (cid:0) t ) ) } and we have E ( Y t ) = Y .As to the variance, E ( Y ), as mentioned above, does not lend itself to a closed-form solution derived from φ (.), nor from the stochastic integral; but it can be easilyestimated from the closed form distribution of X using methods of propagation ofuncertainty for the first two moments (the delta method).Since the variance of a function f of a finite moment random variable X can beapproximated as V ( f ( X ) ) = f ′ ( E ( X ) ) V ( X ): ¶ S (cid:0) ( y ) ¶ y (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) y = Y s (cid:25) e s ( T (cid:0) t ) (cid:0) s (cid:25) √ e (cid:0) (cid:0) (2 Y (cid:0) ( e s ( T (cid:0) t ) (cid:0) ) p . ( . )Likewise for calculations in the opposite direction, we find s (cid:25) √ log ( p s e (cid:0) (2 Y (cid:0) + 1 ) p p T (cid:0) t ,which is ( . ) in the presentation of the main result.Note that expansions including higher moments do not bring a material increasein precision – although s is highly nonlinear around the center, the range of values election predictions as martingales: an arbitrage approach ‡ for the volatility of the total or, say, the electoral college is too low to affect higherorder terms in a significant way, in addition to the boundedness of the sigmoid-style transformations. ELECTIONDAY538 Rigorousupdating20 40 60 80 1000.50.60.70.80.91.0
Figure . : Theoretical approach (top) vs practice (bottom). Shows how the estimation process cannotbe in sync with the volatility of the estimation of (electoral or other) votes as it violates arbitrageboundaries. . . A Discussion on Risk Neutrality
We apply risk neutral valuation, for lack of conviction regarding another way, asa default option. Although Y may not necessarily be tradable, adding a risk pre-mium for the process involved in determining the arbitrage valuation would nec-essarily imply a negative one for the other candidate(s), which is hard to justify.Further, option values or binary bets, need to satisfy a no Dutch Book argument(the De Finetti form of no-arbitrage) (see [ ]), i.e. properly priced binary optionsinterpreted as probability forecasts give no betting "edge" in all outcomes withoutloss. Finally, any departure from risk neutrality would degrade the Brier score(about which, below) as it would represent a diversion from the final forecast.Also note the absence of the assumptions of financing rate usually present infinancial discussions. Let F (.) be a function of a variable X satisfying dX t = s X t dt + s dW t . ( . ) We wish to show that X has a simple Bachelier option price B (.). The idea of noarbitrage is that a continuously made forecast must itself be a martingale.Applying Itô’s Lemma to F ≜ B for X satisfying ( . ) yields dF = [ s X ¶ F ¶ X + 12 s ¶ F ¶ X + ¶ F ¶ t ] dt + s FX dW so that, since ¶ F ¶ t ≜ F must satisfy the partial differential equation12 s ¶ F ¶ X + s X ¶ F ¶ X + ¶ F ¶ t = 0, ( . )which is the driftless condition that makes B a martingale.For a binary (call) option, we have for terminal conditions B ( X , t ) ≜ F , F T = q ( x (cid:0) l ), where q (.) is the Heaviside theta function and l is the threshold: q ( x ) := { x (cid:21) l x < l with initial condition x at time t and terminal condition at T given by:12 erfc ( x e s t (cid:0) l √ e s t (cid:0) ) which is, simply, the survival function of the Normal distribution parametrizedunder the process for X .Likewise we note from the earlier argument of one-to one (one can use Borel setarguments ) that q ( y ) := { y (cid:21) S ( l )0, y < S ( l ),so we can price the alternative process B ( Y , t ) = P ( Y > ) (or any other similarlyobtained threshold l , by pricing B ( Y , t ) = P ( x > S (cid:0) ( l )).The pricing from the proportion of votes is given by: B ( Y , s , t , T ) = 12 erfc ( l (cid:0) erf (cid:0) (2 Y (cid:0) e s ( T (cid:0) t ) √ e s ( T (cid:0) t ) (cid:0) ) ,the main equation ( . ), which can also be expressed less conveniently as B ( y , s , t , T ) = 1 √ e s t (cid:0) ∫ l exp ( erf (cid:0) (2 y (cid:0) (cid:0) ( coth ( s t ) (cid:0) ) ( erf (cid:0) (2 y (cid:0) (cid:0) erf (cid:0) (2 y (cid:0) e s t ) ) dy election predictions as martingales: an arbitrage approach ‡ XY
200 400 600 800 1000 t - - - Figure . : Process and Dual Process Y T is the terminal value of a process on election day. It lives in [0, 1] but canbe generalized to the broader [ L , H ], L , H [0, ¥ ). The threshold for a givencandidate to win is fixed at l . Y can correspond to raw votes, electoral votes, orany other metric. We assume that Y t is an intermediate realization of the processat t , either produced synthetically from polls (corrected estimates) or other suchsystems.Next, we create, for an unbounded arithmetic stochastic process, a bounded"dual" stochastic process using a sigmoidal transformation. It can be helpful tomap processes such as a bounded electoral process to a Brownian motion, or tomap a bounded payoff to an unbounded one, see Figure . . Proposition . Under sigmoidal style transformations S : x y , R ! [0, 1] of the form a) + erf ( x ) ,or b) ( (cid:0) x ) , if X is a martingale, Y is only a martingale for Y = , and if Y is amartingale, X is only a martingale for X = 0 .Proof. The proof is sketched as follows. From Itô’s lemma, the drift term for dX t be-comes ) s X ( t ), or ) s Tanh ( X ( t )2 ) , where s denotes the volatility, respectivelywith transformations of the forms a) of X t and b) of X t under a martingale for Y . The drift for dY t becomes: ) s e (cid:0) erf (cid:0) Y (cid:0) erf (cid:0) (2 Y (cid:0) p p or ) s Y ( Y (cid:0) Y (cid:0) X .We therefore select the case of Y being a martingale and present the details ofthe transformation a). The properties of the process have been developed by Carr[ ]. Let X be the arithmetic Brownian motion ( . ), with X -dependent drift andconstant scale s : dX t = s X t dt + s dW t , 0 < t < T < + ¥ .We note that this has similarities with the Ornstein-Uhlenbeck process normallywritten dX t = q ( m (cid:0) X t ) dt + s dW , except that we have m = 0 and violate the rulesby using a negative mean reversion coefficient, rather more adequately describedas "mean repelling", q = (cid:0) s . We map from X ( (cid:0) ¥ , ¥ ) to its dual process Y as follows. With S : R ! [0, 1], Y = S ( x ), S ( x ) = 12 + 12 erf( x )the dual process (by unique transformation since S is one to one, becomes, for y ≜ S ( x ), using Itô’s lemma (since S (.) is twice differentiable and ¶ S = ¶ t = 0): dS = ( s ¶ S ¶ x + X s ¶ S ¶ x ) d t + s ¶ S ¶ x dW which with zero drift can be written as a process dY t = s ( Y ) dW t ,for all t > t , E ( Y t j Y t ) = Y t . and scale s ( Y ) = s p p e (cid:0) erf (cid:0) (2 y (cid:0) which as we can see in Figure . , s ( y ) can be approximated by the quadraticfunction y (1 (cid:0) y ) times a constant. ⅇ - erf - (- + y ) π π y ( - y ) Y t Figure . : The instantaneousvolatility of Y as a function of thelevel of Y for two different methods oftransformations of X, which appearto not be substantially different. Wecompare to the quadratic form y (cid:0) y scaled by a constant √ p . Thevolatility declines as we move awayfrom and collapses at the edges,thus maintaining Y in (0, 1) . Forsimplicity we assumed s = t = 1 . We can recover equation ( . ) by inverting, namely S (cid:0) ( y ) = erf (cid:0) (2 y (cid:0) X or Y , even if one process has a drift while theother is a martingale. In other words, one may apply one’s estimation to theelectoral threshold, or to the more complicated X with the same results. And,to summarize our method, pricing an option on X is familiar, as it is exactly aBachelier-style option price. This section provides a brief background for the conventional approach to proba-bility assessment. The great De Finetti [ ] has shown that the "assessment" of the"probability" of the realization of a random variable in f
0, 1 g requires a nonlinear election predictions as martingales: an arbitrage approach ‡ Figure . : Bruno de Finetti ( - ). A probabilist, philosopher, andinsurance mathematician, he formu-lated the Brier score for probabilisticassessment which we show is compat-ible dynamically with a martingale.Source: DeFinetti.org loss function – which makes his definition of probabilistic assessment differ from thatof the P/L of a trader engaging in binary bets.Assume that a betting agent in an n -repeated two period model, t and t , pro-duces a strategy S of bets b i [0, 1] indexed by i = 1, 2, . . . , n , with the realizationof the binary r.v. t , i . If we take the absolute variation of his P/L over n bets, itwill be L ( S ) = 1 n n (cid:229) i =1 (cid:12)(cid:12) t , i (cid:0) b t , i (cid:12)(cid:12) .For example, assume that E ( t ) = . Betting on the probability, here , producesa loss of in expectation, which is the same as betting either 0 or 1 – hence notfavoring the agent to bet on the exact probability.If we work with the same random variable and non-time-varying probabilities,the L metric would be appropriate: L ( S ) = 1 n (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) t , i (cid:0) n (cid:229) i =1 b t , i (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) .De Finetti proposed a "Brier score" type function, a quadratic loss function in L : L ( S ) = 1 n n (cid:229) i =1 ( t , i (cid:0) b t , i ) , the minimum of which is reached for b t , i = E ( t ).In our world of continuous time derivative valuation, where, in place of a twoperiod lattice model, we are interested, for the same final outcome at t , in thestochastic process b t , t (cid:21) t (cid:21) t , the arbitrage "value" of a bet on a binary outcomeneeds to match the expectation, hence, again, we map to the Brier score – by anarbitrage argument . Although there is no quadratic loss function involved, the factthat the bet is a function of a martingale, which is required to be itself a martingale,i.e. that the conditional expectation remains invariant to time, does not allow anarbitrage to take place. A "high" price can be "shorted" by the arbitrageur, a "low"price can be "bought", and so on repeatedly. The consistency between bets atperiod t and other periods t + ∆ t enforces the probabilistic discipline. In otherwords, someone can "buy" from the forecaster then "sell" back to him, generating apositive expected "return" if the forecaster is out of line with martingale valuation.As to the current practice by forecasters, although some election forecasters ap-pear to be aware of the need to minimize their Brier score, the idea that the re-visions of estimates should also be subjected to martingale valuation is not wellestablished. As can be seen in Figure . , a binary option reveals more about uncertainty thanabout the true estimation, a result well known to traders, see [ ].In the presence of more than 2 candidates, the process can be generalized withthe following heuristic approximation. Establish the stochastic process for Y t ,and just as Y t is a process in [0, 1], Y t is a process ( Y t , 1], with Y t theresidual 1 (cid:0) Y t (cid:0) Y t , and more generally Y n (cid:0) t ( Y n , t , 1] and Y n , t is the residual Y n = 1 (cid:0) (cid:229) n (cid:0) i =1 Y i , t . For n candidates, the n th is the residual. addendum: all roads lead to quantitative finance Background
Aubrey Clayton sent a letter to the editor complaining about the previouspiece on grounds of "errors" in the above methodology. The author answered, with DhruvMadeka, not quite to Clayton, rather to express the usefulness of quantitative finance meth-ods in life.
We are happy to respond to Clayton’s (non-reviewed) letter, in spite of its con-fusions, as it will give us the opportunity to address more fundamental misun-derstandings of the role of quantitative finance in general, and arbitrage pricingin particular, and proudly show how "all roads lead to quantitative finance", thatis, that arbitrage approaches are universal and applicable to all manner of binaryforecasting. It also allows the second author to comment from his paper, Madeka( )[ ], which independently and simultaneously obtained similar results toTaleb ( )[ ]. election predictions as martingales: an arbitrage approach ‡ Incorrect claims
Taleb’s criticism of popular forecast probabilities, specifically the election forecasts of FiveThir-tyEight... " and "
He [Taleb] claims this means the FiveThirtyEight forecasts must have"violate[d] arbitrage boundaries " are factually incorrect.There is no mention of FiveThirtyEight in [ ], and Clayton must be confusingscientific papers with Twitter debates. The paper is an attempt at addressing elec-tions in a rigorous manner, not journalistic discussion, and only mentions the election in one illustrative sentence. Let us however continue probing Clayton’s other assertions, in spite of his confu-sion and the nature of the letter.
Incorrect arbitrage valuation
Clayton’s claims either an error ("First, one of the "standard results" of quantitativefinance that his election forecast assessments rely on is false", he initially writes),or, as he confusingly retracts, something "only partially true". Again, let us setaside that Taleb( )[ ] makes no "assessment" of FiveThirtyEight’s record andoutline his reasoning.Clayton considers three periods, t = 0, an intermediate period t and a terminalone T , with t (cid:20) t < T . Clayton shows a special case of the distribution of theforward probability, seen at t , for time starting at t = T and ending at T . It is auniform distribution for that specific time period. In fact under his construction,using the probability integral transform, one can show that the probabilities followwhat resembles a symmetric beta distribution with parameters a and b , and with a = b . When t = T , we have a = b = 1 (hence the uniform distribution). Before T = \ shape, with Dirac at t = t . Beyond T = [ shape, ending withtwo Dirac sticks at 0 and 1 (like a Bernoulli) when t is close to T (and close to anarcsine distribution with a = b = somewhere in between).Clayton’s construction is indeed misleading, since he analyzes the distributionof the price at time t with the filtration at time t , particularly when discussingarbitrage pricing and arbitrage pressures. Agents value options between t and T attime t (not period t ), with an underlying price: under such constraint, the binaryoption automatically converges towards as s ! ¥ , and that for any value ofthe underlying price, no matter how far away from the strike price (or threshold).The s here is never past realized, only future unrealized volatility. This can beseen within the framework presented in Taleb ( ) [ ] but also by taking anybinary option pricing model. A price is not a probability (less even a probabilitydistribution), but an expectation. Simply, as arbitrage operators, we look at future volatility given information about the underlying when pricing a binary option,not the distribution of probability itself in the unconditional abstract.At infinite s , it becomes all noise, and such a level of noise drowns all signals. Incidentally, the problem with FiveThirtyEight isn’t changing probabilities from .55 to .85 within a months period, it is performing abrupt changes within a much shorter timespan –and that was discussedin Madeka ( )[ ]. Another way to view the pull of uncertainty towards is in using informationtheory and the notion of maximum entropy under deep uncertainty : the entropy( I ) of a Bernoulli distribution with probabilities p and (1 (cid:0) p ), I = (cid:0) ((1 (cid:0) p ) log(1 (cid:0) p ) + p log( p )) is maximal at .To beat a pricing one needs to have enough information to beat the noise. Aswe will see in the next section, it is not easy. Arbitrage matters
Another result from quantitative finance that puts bounds on the volatility of fore-casting is as follows. Since election forecasts can be interpreted as a European bi-nary option, we can exploit the fact that the price process of this option is boundedbetween and to make claims about the volatility of the price itself.Essentially, if the price of the binary option varies too much, a simple tradingstrategy of buying low and selling high is guaranteed to produce a profit . Theargument can be summed up by noting that if we consider an arithmetic brownianmotion that’s bounded between [ L , H ]: dB t = s dW t ( . )The stochastic integral 2 ∫ T ( B (cid:0) B t ) dB t = s T (cid:0) ( B T (cid:0) B ) can be replicated atzero cost, indicating that the value of B T is bounded by the maximum value of thesquare difference on the right hand side of the equation. That is, a forecaster whoproduces excessively volatile probabilities – if he or she is willing to trade on such aforecast (i.e. they have skin in the game) – can be arbitraged by following a strategythat sells (proportionally) when the forecast is too high and buys (proportionally)when the forecast is too low.To conclude, any numerical probabilistic forecasting should be treated like achoice price —De Finetti’s intuition is that forecasts should have skin in the game.Under these conditions, binary forecasting belongs to the rules of arbitrage andderivative pricing, well mapped in quantitative finance. Using a quantitative fi-nance approach to produce binary forecasts does not prevent Bayesian methods(Taleb( ) does not say probabilities should be , only that there is a headwindtowards that level owing to arbitrage pressures and constraints on how variablea forecast can be). It is just that there is one price that counts at the end, 1 or 0,which puts a structure on the updating. We take this result from Bruno Dupire’s notes for his continuous time finance class at NYU’s CourantInstitute, particularly his final exam for the Spring of . Another way to see it, from outside our quantitative finance models: consider a standard probabilisticscore. Let X , . . . , X n be random variables in [0, 1 and a B T a constant B T
0, 1 g , we have the l score l n = 1 n n (cid:229) i =1 ( x i (cid:0) B T ) ,which needs to be minimized (on a single outcome B T ). For any given B T and an average forecast x = (cid:229) ni =1 x i , the minimum value of l n is reached for x = . . . = x n . To beat a Dirac forecast x = . . . = x n = forwhich l = with a high variance strategy, one needs to have % accuracy. (Note that a uniform forecasthas a score of .) This shows us the trade-off between volatility and signal. election predictions as martingales: an arbitrage approach ‡ The reason Clayton might have trouble with quantitative finance could be thatprobabilities and underlying polls may not be martingales in real life; tradedprobabilities (hence real forecasts) must be martingales. Which is why in Taleb( )[ ] the process for the polls (which can be vague and nontradable) needsto be transformed into a process for probability in [0, 1]. acknowledgments Raphael Douady, students at NYU’s Tandon School of Engineering, participants atthe Bloomberg Quantitative Finance Seminar in New York. art IV
I N E Q U A L I T Y E S T I M AT O R S U N D E R FAT TA I L S G I N I E S T I M A T I O N U N D E R I N F I N I T EV A R I A N C E ‡ T his C hapter is about the problems related to the estimation ofthe Gini index in presence of a fat-tailed data generating pro-cess, i.e. one in the stable distribution class with finite meanbut infinite variance (i.e. with tail index a (1, 2)). We showthat, in such a case, the Gini coefficient cannot be reliably esti-mated using conventional nonparametric methods, because of a downwardbias that emerges under fat tails. This has important implications for theongoing discussion about economic inequality.We start by discussing how the nonparametric estimator of the Gini indexundergoes a phase transition in the symmetry structure of its asymptoticdistribution, as the data distribution shifts from the domain of attraction ofa light-tailed distribution to that of a fat-tailed one, especially in the case ofinfinite variance. We also show how the nonparametric Gini bias increaseswith lower values of a . We then prove that maximum likelihood estimationoutperforms nonparametric methods, requiring a much smaller sample sizeto reach efficiency.Finally, for fat-tailed data, we provide a simple correction mechanism tothe small sample bias of the nonparametric estimator based on the distancebetween the mode and the mean of its asymptotic distribution. Wealth inequality studies represent a field of economics, statistics and econophysicsexposed to fat-tailed data generating processes, often with infinite variance [ , ]. This is not at all surprising if we recall that the prototype of fat-tailed dis-tributions, the Pareto, has been proposed for the first time to model household in- Research chapter.(With A. Fontanari and P. Cirillo), coauthors gini estimation under infinite variance ‡ comes [ ]. However, the fat-tailedness of data can be problematic in the contextof wealth studies, as the property of efficiency (and, partially, consistency) doesnot necessarily hold for many estimators of inequality and concentration [ , ].The scope of this work is to show how fat tails affect the estimation of one ofthe most celebrated measures of economic inequality, the Gini index [ , , ],often used (and abused) in the econophysics and economics literature as the maintool for describing the distribution and the concentration of wealth around theworld [ , ? ].The literature concerning the estimation of the Gini index is wide and comprehen-sive (e.g. [ , ] for a review), however, strangely enough, almost no attentionhas been paid to its behavior in presence of fat tails, and this is curious if we con-sider that: ) fat tails are ubiquitous in the empirical distributions of income andwealth [ , ], and ) the Gini index itself can be seen as a measure of variabilityand fat-tailedness [ , , , ].The standard method for the estimation of the Gini index is nonparametric: onecomputes the index from the empirical distribution of the available data usingEquation ( . ) below. But, as we show in this paper, this estimator suffers froma downward bias when we deal with fat-tailed observations. Therefore our goalis to close this gap by deriving the limiting distribution of the nonparametric Giniestimator in presence of fat tails, and propose possible strategies to reduce the bias.We show how the maximum likelihood approach, despite the risk of model mis-specification, needs much fewer observations to reach efficiency when comparedto a nonparametric one. Our results are relevant to the discussion about wealth inequality, recently rekin-dled by Thomas Piketty in [ ], as the estimation of the Gini index under fat tailsand infinite variance may cause several economic analyses to be unreliable, if notmarkedly wrong. Why should one trust a biased estimator?
Figure . : The Italian statisticianCorrado Gini, - . source: Boc-coni. A similar bias also affects the nonparametric measurement of quantile contributions, i.e. those of the type“the top % owns x% of the total wealth" [ ]. This paper extends the problem to the more widespreadGini coefficient, and goes deeper by making links with the limit theorems. By fat-tailed data we indicate those data generated by a positive random variable X with cumulative distribution function (c.d.f.) F ( x ), which is regularly-varying oforder a [ ], that is, for ¯ F ( x ) := 1 (cid:0) F ( x ), one haslim x ! ¥ x a ¯ F ( x ) = L ( x ), ( . )where L ( x ) is a slowly-varying function such that lim x ! ¥ L ( cx ) L ( x ) = 1 with c >
0, andwhere a > , ], when dealing with the probabilistic behavior of maxima and minima. Aspointed out in [ ], regularly-varying and fat-tailed are indeed synonyms. It isknown that, if X , ..., X n are i.i.d. observations with a c.d.f. F ( x ) in the regularly-varying class, as defined in Equation ( . ), then their data generating process fallsinto the maximum domain of attraction of a Fréchet distribution with parameter r , in symbols X MDA ( F ( r ))[ ]. This means that, for the partial maximum M n = max( X , ..., X n ), one has P ( a (cid:0) n ( M n (cid:0) b n ) (cid:20) x ) d ! F ( r ) = e (cid:0) x (cid:0) r , r >
0, ( . )with a n > b n R two normalizing constants. Clearly, the connection be-tween the regularly-varying coefficient a and the Fréchet distribution parameter r is given by: a = r [ ].The Fréchet distribution is one of the limiting distributions for maxima in extremevalue theory, together with the Gumbel and the Weibull; it represents the fat-tailedand unbounded limiting case [ ]. The relationship between regularly-varyingrandom variables and the Fréchet class thus allows us to deal with a very large fam-ily of random variables (and empirical data), and allows us to show how the Giniindex is highly influenced by maxima, i.e. extreme wealth, as clearly suggestedby intuition [ , ], especially under infinite variance. Again, this recommendssome caution when discussing economic inequality under fat tails.It is worth remembering that the existence (finiteness) of the moments for a fat-tailed random variable X depends on the tail exponent a , in fact E ( X d ) < ¥ if d (cid:20) a , E ( X d ) = ¥ if d > a . ( . )In this work, we restrict our focus on data generating processes with finite meanand infinite variance, therefore, according to Equation ( . ), on the class of regularly-varying distributions with tail index a (1, 2).Table . and Figure . present numerically and graphically our story, alreadysuggesting its conclusion, on the basis of artificial observations sampled from aPareto distribution (Equation ( . ) below) with tail parameter a equal to 1.1.Table . compares the nonparametric Gini index of Equation ( . ) with themaximum likelihood (ML) tail-based one of Section . . For the different samplesizes in Table . , we have generated 10 samples, averaging the estimators via gini estimation under infinite variance ‡ Monte Carlo. As the first column shows, the convergence of the nonparametricestimator to the true Gini value ( g = 0.8333) is extremely slow and monotonicallyincreasing; this suggests an issue not only in the tail structure of the distributionof the nonparametric estimator but also in its symmetry.Figure . provides some numerical evidence that the limiting distribution ofthe nonparametric Gini index loses its properties of normality and symmetry [ ],shifting towards a skewed and fatter-tailed limit, when data are characterized byan infinite variance. As we prove in Section . , when the data generating processis in the domain of attraction of a fat-tailed distribution, the asymptotic distribu-tion of the Gini index becomes a skewed-to-the-right a -stable law. This change ofbehavior is responsible of the downward bias of the nonparametric Gini under fattails. However, the knowledge of the new limit allows us to propose a correctionfor the nonparametric estimator, improving its quality, and thus reducing the riskof badly estimating wealth inequality, with all the possible consequences in termsof economic and social policies [ , ]. Table . : Comparison of the Nonparametric (NonPar) and the Maximum Likelihood (ML) Giniestimators, using Paretian data with tail a = 1.1 (finite mean, infinite variance) and different samplesizes. Number of Monte Carlo simulations: . n Nonpar ML Error Ratio (number of obs.) Mean Bias Mean Bias . - .
122 0 . . . - .
083 0 . . - .
058 0 . . . - .
043 0 . . - .
031 0 . + Figure . : Histograms for theGini nonparametric estimatorsfor two Paretian (type I) distri-butions with different tail indices,with finite and infinite variance(plots have been centered to easecomparison). Sample size: .Number of samples: for eachdistribution. The rest of the paper is organized as follows. In Section . we derive the asymp-totic distribution of the sample Gini index when data possess an infinite variance.In Section . we deal with the maximum likelihood estimator; in Section . weprovide an illustration with Paretian observations; in Section . we propose asimple correction based on the mode-mean distance of the asymptotic distributionof the nonparametric estimator, to take care of its small-sample bias. Section . closes the paper. A technical Appendix contains the longer proofs of the mainresults in the work. We now derive the asymptotic distribution for the nonparametric estimator of theGini index when the data generating process is fat-tailed with finite mean butinfinite variance.The so-called stochastic representation of the Gini g is g = 12 E ( j X ′ (cid:0) X ” j ) m [0, 1], ( . )where X ′ and X ” are i.i.d. copies of a random variable X with c.d.f. F ( x ) [ c , ¥ ), c >
0, and with finite mean E ( X ) = m . The quantity E ( j X ′ (cid:0) X ” j ) is known as the"Gini Mean Difference" (GMD) [ ]. For later convenience we also define g = qm with q = E ( j X ′ (cid:0) X ” j )2 .The Gini index of a random variable X is thus the mean expected deviationbetween any two independent realizations of X , scaled by twice the mean [ ].The most common nonparametric estimator of the Gini index for a sample X , ..., X n is defined as G NP ( X n ) = (cid:229) (cid:20) i < j (cid:20) n j X i (cid:0) X j j ( n (cid:0) (cid:229) ni =1 X i , ( . )which can also be expressed as G NP ( X n ) = (cid:229) ni =1 (2( i (cid:0) n (cid:0) (cid:0) X ( i ) (cid:229) ni =1 X ( i ) = n (cid:229) ni =1 Z ( i )1 n (cid:229) ni =1 X i , ( . )where X (1) , X (2) , ..., X ( n ) are the ordered statistics of X , ..., X n , such that: X (1) < X (2) < ... < X ( n ) and Z ( i ) = 2 ( i (cid:0) n (cid:0) (cid:0) ) X ( i ) . The asymptotic normality of theestimator in Equation ( . ) under the hypothesis of finite variance for the datagenerating process is known [ , ]. The result directly follows from the prop-erties of the U-statistics and the L-estimators involved in Equation ( . )A standard methodology to prove the limiting distribution of the estimator inEquation ( . ), and more in general of a linear combination of order statistics, isto show that, in the limit for n ! ¥ , the sequence of order statistics can be approx-imated by a sequence of i.i.d random variables [ , ]. However, this usuallyrequires some sort of L integrability of the data generating process, something weare not assuming here.Lemma . (proved in the Appendix) shows how to deal with the case of se-quences of order statistics generated by fat-tailed L -only integrable random vari-ables. gini estimation under infinite variance ‡ Lemma . Consider the following sequence R n = n (cid:229) ni =1 ( in (cid:0) U ( i ) ) F (cid:0) ( U ( i ) ) where U ( i ) are the orderstatistics of a uniformly distributed i.i.d random sample. Assume that F (cid:0) ( U ) L . Thenthe following results hold: R n L (cid:0)!
0, ( . ) and n a (cid:0) a L ( n ) R n L (cid:0)!
0, ( . ) with a (1, 2) and L ( n ) a slowly-varying function. . . A Quick Recap on a -Stable Random Variables We here introduce some notation for a -stable distributions, as we need them tostudy the asymptotic limit of the Gini index.A random variable X follows an a -stable distribution, in symbols X (cid:24) S ( a , b , g , d ),if its characteristic functionis E ( e itX ) = { e (cid:0) g a j t j a (1 (cid:0) i b sign( t )) tan( pa )+ i d t a ̸ = 1 e (cid:0) g j t j (1+ i b p sign( t )) ln j t j + i d t a = 1 ,where a (0, 2) governs the tail, b [ (cid:0)
1, 1] is the skewness, g R + is the scaleparameter, and d R is the location one. This is known as the S a -stable distributions [ , ].Interestingly, there is a correspondence between the a parameter of an a -stablerandom variable, and the a of a regularly-varying random variable as per Equation( . ): as shown in [ , ], a regularly-varying random variable of order a is a -stable, with the same tail coefficient. This is why we do not make any distinctionin the use of the a here. Since we aim at dealing with distributions characterizedby finite mean but infinite variance, we restrict our focus to a (1, 2), as the two a ’s coincide.Recall that, for a (1, 2], the expected value of an a -stable random variable X is equal to the location parameter d , i.e. E ( X ) = d . For more details, we refer to[ , ].The standardized a -stable random variable is expressed as S a , b (cid:24) S ( a , b , 1, 0). ( . )We note that a -stable distributions are a subclass of infinitely divisible distribu-tions. Thanks to their closure under convolution, they can be used to describe thelimiting behavior of (rescaled) partials sums, S n = (cid:229) ni =1 X i , in the General centrallimit theorem (GCLT) setting [ ]. For a = 2 we obtain the normal distributionas a special case, which is the limit distribution for the classical CLTs, under thehypothesis of finite variance. In what follows we indicate that a random variable is in the domain of attractionof an a -stable distribution, by writing X DA ( S a ). Just observe that this conditionfor the limit of partial sums is equivalent to the one given in Equation ( . ) forthe limit of partial maxima [ , ]. . . The a -Stable Asymptotic Limit of the Gini Index Consider a sample X , ..., X n of i.i.d. observations with a continuous c.d.f. F ( x ) inthe regularly-varying class, as defined in Equation ( . ), with tail index a (1, 2).The data generating process for the sample is in the domain of attraction of aFréchet distribution with r ( , 1), given that r = a .For the asymptotic distribution of the Gini index estimator, as presented in Equa-tion ( . ), when the data generating process is characterized by an infinite vari-ance, we can make use of the following two theorems: Theorem deals withthe limiting distribution of the Gini Mean Difference (the numerator in Equation( . )), while Theorem extends the result to the complete Gini index. Proofs forboth theorems are in the Appendix. Theorem Consider a sequence ( X i ) (cid:20) i (cid:20) n of i.i.d random variables from a distribution X on [ c , + ¥ ) with c > , such that X is in the domain of attraction of an a -stable random variable,X DA ( S a ) , with a (1, 2) . Then the sample Gini mean deviation (GMD) (cid:229) ni =1 Z ( i ) n satisfies the following limit in distribution:n a (cid:0) a L ( n ) ( n n (cid:229) i =1 Z ( i ) (cid:0) q ) d ! S a ,1 , ( . ) where Z i = (2 F ( X i ) (cid:0) X i , E ( Z i ) = q , L ( n ) is a slowly-varying function such thatEquation ( . ) holds (see the Appendix), and S a ,1 is a right-skewed standardized a -stablerandom variable defined as in Equation ( . ) .Moreover the statistic n (cid:229) ni =1 Z ( i ) is an asymptotically consistent estimator for the GMD,i.e. n (cid:229) ni =1 Z ( i ) P ! q . Note that Theorem could be restated in terms of the maximum domain ofattraction MDA ( F ( r )) as defined in Equation ( . ). Theorem Given the same assumptions of Theorem , the estimated Gini index G NP ( X n ) = (cid:229) ni =1 Z ( i ) (cid:229) ni =1 X i satisfies the following limit in distributionn a (cid:0) a L ( n ) ( G NP ( X n ) (cid:0) qm ) d ! Q , ( . ) where E ( Z i ) = q , E ( X i ) = m , L ( n ) is the same slowly-varying function defined in Theorem and Q is a right-skewed a -stable random variable S ( a , 1, m , 0) . gini estimation under infinite variance ‡ Furthermore the statistic (cid:229) ni =1 Z ( i ) (cid:229) ni =1 X i is an asymptotically consistent estimator for the Giniindex, i.e. (cid:229) ni =1 Z ( i ) (cid:229) ni =1 X i P ! qm = g. In the case of fat tails with a (1, 2), Theorem tells us that the asymptotic dis-tribution of the Gini estimator is always right-skewed notwithstanding the distri-bution of the underlying data generating process. Therefore heavily fat-tailed datanot only induce a fatter-tailed limit for the Gini estimator, but they also changethe shape of the limit law, which definitely moves away from the usual symmetricGaussian. As a consequence, the Gini estimator, whose asymptotic consistency isstill guaranteed [ ], will approach its true value more slowly, and from below.Some evidence of this was already given in Table . . Theorem indicates that the usual nonparametric estimator for the Gini index isnot the best option when dealing with infinite-variance distributions, due to theskewness and the fatness of its asymptotic limit. The aim is to find estimators thatstill preserve their asymptotic normality under fat tails, which is not possible withnonparametric methods, as they all fall into the a -stable Central Limit Theoremcase [ , ]. Hence the solution is to use parametric techniques.Theorem shows how, once a parametric family for the data generating processhas been identified, it is possible to estimate the Gini index via MLE. The resultingestimator is not just asymptotically normal, but also asymptotically efficient.In Theorem we deal with random variables X whose distribution belongs to thelarge and flexible exponential family [ ], i.e. whose density can be representedas f q ( x ) = h ( x ) e ( h ( q ) T ( x ) (cid:0) A ( q ) ) ,with q R , and where T ( x ), h ( q ), h ( x ), A ( q ) are known functions. Theorem Let X (cid:24) F q such that F q is a distribution belonging to the exponential family. Then theGini index obtained by plugging-in the maximum likelihood estimator of q , G ML ( X n ) q , isasymptotically normal and efficient. Namely: p n ( G ML ( X n ) q (cid:0) g q ) D ! N ( g ′ q I (cid:0) ( q ) ) , ( . ) where g ′ q = dg q d q and I ( q ) is the Fisher Information. p n ( G ML ( X n ) q (cid:0) g q ) D ! N ( g ′ q I (cid:0) ( q ) ) , Proof.
The result follows easily from the asymptotic efficiency of the maximum like-lihood estimators of the exponential family, and the invariance principle of MLE.In particular, the validity of the invariance principle for the Gini index is granted by the continuity and the monotonicity of g q with respect to q . The asymptoticvariance is then obtained by application of the delta-method [ ]. We provide an illustration of the obtained results using some artificial fat-taileddata. We choose a Pareto I [ ], with density f ( x ) = a c a x (cid:0) a (cid:0) , x (cid:21) c . ( . )It is easy to verify that the corresponding survival function ¯ F ( x ) belongs to theregularly-varying class with tail parameter a and slowly-varying function L ( x ) = c a . We can therefore apply the results of Section . to obtain the followingcorollaries. Corollary . Let X , ..., X n be a sequence of i.i.d. observations with Pareto distribution with tail param-eter a (1, 2) . The nonparametric Gini estimator is characterized by the following limit:D NPn = G NP ( X n ) (cid:0) g (cid:24) S a , 1, C (cid:0) a a n a (cid:0) a ( a (cid:0) a , 0 . ( . ) Proof.
Without loss of generality we can assume c = 1 in Equation ( . ). Theresults is a mere application of Theorem , remembering that a Pareto distributionis in the domain of attraction of a -stable random variables with slowly-varyingfunction L ( x ) = 1. The sequence c n to satisfy Equation ( . ) becomes c n = n a C (cid:0) a a ,therefore we have L ( n ) = C (cid:0) a a , which is independent of n . Additionally the meanof the distribution is also a function of a , that is m = aa (cid:0) . Corollary . Let the sample X , ..., X n be distributed as in Corollary . , let G ML q be the maximumlikelihood estimator for the Gini index as defined in Theorem . Then the MLE Giniestimator, rescaled by its true mean g, has the following limit:D MLn = G ML a ( X n ) (cid:0) g (cid:24) N (
0, 4 a n (2 a (cid:0) ) , ( . ) where N indicates a Gaussian.Proof. The functional form of the maximum likelihood estimator for the Gini indexis known to be G ML q = a ML (cid:0) [ ]. The result then follows from the fact that thePareto distribution (with known minimum value x m ) belongs to an exponentialfamily and therefore satisfies the regularity conditions necessary for the asymptoticnormality and efficiency of the maximum likelihood estimator. Also notice that theFisher information for a Pareto distribution is a . gini estimation under infinite variance ‡ Now that we have worked out both asymptotic distributions, we can comparethe quality of the convergence for both the MLE and the nonparametric case whendealing with Paretian data, which we use as the prototype for the more generalclass of fat-tailed observations.In particular, we can approximate the distribution of the deviations of the esti-mator from the true value g of the Gini index for finite sample sizes, by usingEquations ( . ) and ( . ). −0.10 −0.05 0.00 0.05 0.10 Limit distribution for 1.8, MLE vs Non−Parametric
Deviation from mean value
MLEn = 100n = 500n = 1000 (a) a = 1.8 −0.15 −0.10 −0.05 0.00 0.05 0.10 0.15 Limit distribution for 1.6, MLE vs Non−Parametric
Deviation from mean value
MLEn = 100n = 500n = 1000 (b) a = 1.6 −0.2 −0.1 0.0 0.1 0.2 Limit distribution for 1.4, MLE vs Non−Parametric
Deviation from mean value
MLEn = 100n = 500n = 1000 (c) a = 1.4 −0.3 −0.2 −0.1 0.0 0.1 0.2 Limit distribution for alpha = 1.2, MLE vs Non−Parametric
Deviation from mean value
MLEn = 100n = 500n = 1000 (d) a = 1.2Figure . : Comparisons between the maximum likelihood and the nonparametric asymptotic distri-butions for different values of the tail index a . The number of observations for MLE is fixed to n = 100 .Note that, even if all distributions have mean zero, the mode of the distributions of the nonparametricestimator is different from zero, because of the skewness. Figure . shows how the deviations around the mean of the two different typesof estimators are distributed and how these distributions change as the numberof observations increases. In particular, to facilitate the comparison between themaximum likelihood and the nonparametric estimators, we fixed the number ofobservation in the MLE case, while letting them vary in the nonparametric one.We perform this study for different types of tail indices to show how large theimpact is on the consistency of the estimator. It is worth noticing that, as thetail index decreases towards 1 (the threshold value for a infinite mean), the modeof the distribution of the nonparametric estimator moves farther away from the mean of the distribution (centered on 0 by definition, given that we are dealingwith deviations from the mean). This effect is responsible for the small samplebias observed in applications. Such a phenomenon is not present in the MLE case,thanks to the the normality of the limit for every value of the tail parameter.We can make our argument more rigorous by assessing the number of observa-tions ˜ n needed for the nonparametric estimator to be as good as the MLE one,under different tail scenarios. Let’s consider the likelihood-ratio-type function r ( c , n ) = P S ( j D NPn j > c ) P N ( j D ML j > c ) , ( . )where P S ( j D NPn j > c ) and P N ( j D ML j > c ) are the probabilities ( a -stable and Gaussianrespectively) of the centered estimators in the nonparametric, and in the MLEcases, of exceeding the thresholds (cid:6) c , as per Equations ( . ) and ( . ). In thenonparametric case the number of observations n is allowed to change, while inthe MLE case it is fixed to . We then look for the value ˜ n such that r ( c , ˜ n ) = 1for fixed c .Table . displays the results for different thresholds c and tail parameters a .In particular, we can see how the MLE estimator outperforms the nonparametricone, which requires a much larger number of observations to obtain the same tailprobability of the MLE with n fixed to . For example, we need at least 80 (cid:2) observations for the nonparametric estimator to obtain the same probability ofexceeding the (cid:6) a = 1.2. Table . : The number of observations ˜ n needed for the nonparametric estimator to match the tailprobabilities, for different threshold values c and different values of the tail index a , of the maximumlikelihood estimator with fixed n = 100 . Threshold c as per Equation ( . ) : a .
005 0 .
01 0 .
015 0 . . (cid:2) (cid:2) (cid:2) (cid:2) . (cid:2) (cid:2) (cid:2) (cid:2) . (cid:2) (cid:2) (cid:2) (cid:2) Interestingly, the number of observations needed to match the tail probabilitiesin Equation ( . ) does not vary uniformly with the threshold. This is expected,since as the threshold goes to infinity or to zero, the tail probabilities remain thesame for every value of n . Therefore, given the unimodality of the limit distri-butions, we expect that there will be a threshold maximizing the number of ob-servations needed to match the tail probabilities, while for all the other levels thenumber of observations will be smaller.We conclude that, when in presence of fat-tailed data with infinite variance, aplug-in MLE based estimator should be preferred over the nonparametric one. gini estimation under infinite variance ‡ Theorem can be also used to provide a correction for the bias of the nonparamet-ric estimator for small sample sizes. The key idea is to recognize that, for unimodaldistributions, most observations come from around the mode. In symmetric distri-butions the mode and the mean coincide, thus most observations will be close tothe mean value as well, not so for skewed distributions: for right-skewed contin-uous unimodal distributions the mode is lower than the mean. Therefore, giventhat the asymptotic distribution of the nonparametric Gini index is right-skewed,we expect that the observed value of the Gini index will be usually lower than thetrue one (placed at the mean level). We can quantify this difference (i.e. the bias)by looking at the distance between the mode and the mean, and once this distanceis known, we can correct our Gini estimate by adding it back .Formally, we aim to derive a corrected nonparametric estimator G C ( X n ) such that G C ( X n ) = G NP ( X n ) + jj m ( G NP ( X n )) (cid:0) E ( G NP ( X n )) jj , ( . )where jj m ( G NP ( X n )) (cid:0) E ( G NP ( X n )) jj is the distance between the mode m and themean of the distribution of the nonparametric Gini estimator G NP ( X n ).Performing the type of correction described in Equation ( . ) is equivalent toshifting the distribution of G NP ( X n ) in order to place its mode on the true value ofthe Gini index.Ideally, we would like to measure this mode-mean distance jj m ( G NP ( X n )) (cid:0) E ( G NP ( X n )) jj on the exact distribution of the Gini index to get the most accu-rate correction. However, the finite distribution is not always easily derivable asit requires assumptions on the parametric structure of the data generating process(which, in most cases, is unknown for fat-tailed data [ ]). We therefore proposeto use the limiting distribution for the nonparametric Gini obtained in Section . to approximate the finite sample distribution, and to estimate the mode-meandistance with it. This procedure allows for more freedom in the modeling assump-tions and potentially decreases the number of parameters to be estimated, giventhat the limiting distribution only depends on the tail index and the mean of thedata, which can be usually assumed to be a function of the tail index itself, as inthe Paretian case where m = aa (cid:0) .By exploiting the location-scale property of a -stable distributions and Equation( . ), we approximate the distribution of G NP ( X n ) for finite samples by G NP ( X n ) (cid:24) S ( a , 1, g ( n ), g ) , ( . )where g ( n ) = n a (cid:0) a L ( n ) m is the scale parameter of the limiting distribution.As a consequence, thanks to the linearity of the mode for a -stable distributions,we have jj m ( G NP ( X n )) (cid:0) E ( G NP ( X n )) jj(cid:25) jj m ( a , g ( n )) + g (cid:0) g jj = jj m ( a , g ( n )) jj , Another idea, which we have tested in writing the paper, is to use the distance between the median andthe mean; the performances are comparable. where m ( a , g ( n )) is the mode function of an a -stable distribution with zero mean.The implication is that, in order to obtain the correction term, knowledge of thetrue Gini index is not necessary, given that m ( a , g ( n )) does not depend on g . Wethen estimate the correction term asˆ m ( a , g ( n )) = arg max x s ( x ), ( . )where s ( x ) is the numerical density of the associated a -stable distribution in Equa-tion ( . ), but centered on 0. This comes from the fact that, for a -stable distri-butions, the mode is not available in closed form, but it can be easily computednumerically [ ], using the unimodality of the law.The corrected nonparametric estimator is thus G C ( X n ) = G NP ( X n ) + ˆ m ( a , g ( n )), ( . )whose asymptotic distribution is G C ( X n ) (cid:24) S ( a , 1, g ( n ), g + ˆ m ( a , g ( n )) ) . ( . )Note that the correction term ˆ m ( a , g ( n ) ) is a function of the tail index a and isconnected to the sample size n by the scale parameter g ( n ) of the associated limit-ing distribution. It is important to point out that ˆ m ( a , g ( n )) is decreasing in n , andthat lim n ! ¥ ˆ m ( a , g ( n )) !
0. This happens because, as n increases, the distributiondescribed in Equation ( . ) becomes more and more centered around its meanvalue, shrinking to zero the distance between the mode and the mean. This ensuresthe asymptotic equivalence of the corrected estimator and the nonparametric one.Just observe thatlim n ! ¥ j G ( X n ) C (cid:0) G NP ( X n ) j = lim n ! ¥ j G NP ( X n ) + ˆ m ( a , g ( n )) (cid:0) G NP ( X n ) j = lim n ! ¥ j ˆ m ( a , g ( n )) j! G C ( X n ) will always behave better in smallsamples. Consider also that, from Equation ( . ), the distribution of the cor-rected estimator has now for mean g + ˆ m ( a , g ( n )), which converges to the true Gini g as n ! ¥ .From a theoretical point of view, the quality of this correction depends on thedistance between the exact distribution of G NP ( X n ) and its a -stable limit; the closerthe two are to each other, the better the approximation. However, given that, inmost cases, the exact distribution of G NP ( X n ) is unknown, it is not possible to givemore details.From what we have written so far, it is clear that the correction term dependson the tail index of the data, and possibly also on their mean. These parameters,if not assumed to be known a priori, must be estimated. Therefore the additionaluncertainty due to the estimation will reflect also on the quality of the correction.We conclude this Section with the discussion of the effect of the correction pro-cedure with a simple example. In a Monte Carlo experiment, we simulate 1000 gini estimation under infinite variance ‡ Paretian samples of increasing size, from n = 10 to n = 2000, and for each samplesize we compute both the original nonparametric estimator G NP ( X n ) and the cor-rected G C ( X n ). We repeat the experiment for different a ’s. Figure . presents theresults.It is clear that the corrected estimators always perform better than the uncor-rected ones in terms of absolute deviation from the true Gini value. In particular,our numerical experiment shows that for small sample sizes with n (cid:20) a (1, 2). However, as ex-pected, the difference between the estimators decreases with the sample size, asthe correction term decreases both in n and in the tail index a . Notice that, whenthe tail index equals 2, we obtain the symmetric Gaussian distribution and thetwo estimators coincide, given that, thanks to the finiteness of the variance, thenonparametric estimator is no longer biased. . . . . . . Corrected vs Original Estimator, data Tail index = 1.8
Sample size E s t i m a t o r V a l ue s Corrected EstimatorOriginal EstimatorTrue Value (a) a = 1.8 . . . . . . Corrected vs Original Estimator, data Tail index = 1.6
Sample size E s t i m a t o r V a l ue s Corrected EstimatorOriginal EstimatorTrue Value (b) a = 1.6 . . . . . . Corrected vs Original Estimator, data Tail index = 1.4
Sample size E s t i m a t o r V a l ue s Corrected EstimatorOriginal EstimatorTrue Value (c) a = 1.4 . . . . . . Corrected vs Original Estimator, data Tail index = 1.2
Sample size E s t i m a t o r V a l ue s Corrected EstimatorOriginal EstimatorTrue Value (d) a = 1.2Figure . : Comparisons between the corrected nonparametric estimator (in red, the one on top) andthe usual nonparametric estimator (in black, the one below). For small sample sizes the corrected oneclearly improves the quality of the estimation.
In this chapter we address the issue of the asymptotic behavior of the nonparamet-ric estimator of the Gini index in presence of a distribution with infinite variance,an issue that has been curiously ignored by the literature. The central mistake inthe nonparametric methods largely used is to believe that asymptotic consistencytranslates into equivalent pre-asymptotic properties.We show that a parametric approach provides better asymptotic results thanks tothe properties of maximum likelihood estimation. Hence we strongly suggest that,if the collected data are suspected to be fat-tailed, parametric methods should bepreferred.In situations where a fully parametric approach cannot be used, we propose asimple correction mechanism for the nonparametric estimator based on the dis-tance between the mode and the mean of its asymptotic distribution. Even if thecorrection works nicely, we suggest caution in its use owing to additional uncer-tainty from the estimation of the correction term. technical appendix
Proof of Lemma . Let U = F ( X ) be the standard uniformly distributed integral probability transformof the random variable X . For the order statistics, we then have [ ? ]: X ( i ) a . s . = F (cid:0) ( U ( i ) ). Hence R n = 1 n n (cid:229) i =1 ( i = n (cid:0) U ( i ) ) F (cid:0) ( U ( i ) ). ( . )Now by definition of empirical c.d.f it follows that R n = 1 n n (cid:229) i =1 ( F n ( U ( i ) ) (cid:0) U ( i ) ) F (cid:0) ( U ( i ) ), ( . )where F n ( u ) = n (cid:229) ni =1 U i (cid:20) u is the empirical c.d.f of uniformly distributed randomvariables.To show that R n L (cid:0)!
0, we are going to impose an upper bound that goes to zero.First we notice that E j R n j(cid:20) n n (cid:229) i =1 E j ( F n ( U ( i ) ) (cid:0) U ( i ) ) F (cid:0) ( U ( i ) ) j . ( . ) gini estimation under infinite variance ‡ To build a bound for the right-hand side (r.h.s) of ( . ), we can exploit the factthat, while F (cid:0) ( U ( i ) ) might be just L -integrable, F n ( U ( i ) ) (cid:0) U ( i ) is L ¥ integrable,therefore we can use Hölder’s inequality with q = ¥ and p = 1. It follows that1 n n (cid:229) i =1 E j ( F n ( U ( i ) ) (cid:0) U ( i ) ) F (cid:0) ( U ( i ) ) j(cid:20) n n (cid:229) i =1 E sup U ( i ) j ( F n ( U ( i ) ) (cid:0) U ( i ) ) j E j F (cid:0) ( U ( i ) ) j .( . )Then, thanks to the Cauchy-Schwarz inequality, we get1 n n (cid:229) i =1 E sup U ( i ) j ( F n ( U ( i ) ) (cid:0) U ( i ) ) j E j F (cid:0) ( U ( i ) ) j(cid:20) ( n n (cid:229) i =1 ( E sup U ( i ) j ( F n ( U ( i ) ) (cid:0) U ( i ) ) j ) n n (cid:229) i =1 ( E ( F (cid:0) ( U ( i ) ))) ) . ( . )Now, first recall that (cid:229) ni =1 F (cid:0) ( U ( i ) ) a . s . = (cid:229) ni =1 F (cid:0) ( U i ) with U i , i = 1, ..., n , being ani.i.d sequence, then notice that E ( F (cid:0) ( U i )) = m , so that the second term of Equation( . ) becomes m ( n n (cid:229) i =1 ( E sup U ( i ) j ( F n ( U ( i ) ) (cid:0) U ( i ) ) j ) ) . ( . )The final step is to show that Equation ( . ) goes to zero as n ! ¥ .We know that F n is the empirical c.d.f of uniform random variables. Using thetriangular inequality the inner term of Equation ( . ) can be bounded as1 n n (cid:229) i =1 ( E sup U ( i ) j ( F n ( U ( i ) ) (cid:0) U ( i ) ) j ) ( . ) (cid:20) n n (cid:229) i =1 ( E sup U ( i ) j ( F n ( U ( i ) ) (cid:0) F ( U ( i ) )) j ) + 1 n n (cid:229) i =1 ( E sup U ( i ) j ( F ( U ( i ) ) (cid:0) U ( i ) ) j ) .Since we are dealing with uniforms, we known that F ( U ) = u , and the second termin the r.h.s of ( . ) vanishes.We can then bound E (sup U ( i ) j ( F n ( U ( i ) ) (cid:0) F ( U ( i ) ) j ) using the so called Vapnik-Chervonenkis (VC) inequality, a uniform bound for empirical processes [ , , ], getting E sup U ( i ) j ( F n ( U ( i ) ) (cid:0) F ( U ( i ) ) j(cid:20) √ log( n + 1) + log(2) n . ( . )Combining Equation ( . ) with Equation ( . ) we obtain m ( n n (cid:229) i =1 ( E sup U ( i ) j ( F n ( U ( i ) ) (cid:0) U ( i ) ) j ) ) (cid:20) m √ log( n + 1) + log(2) n , ( . )which goes to zero as n ! ¥ , thus proving the first claim. For the second claim, it is sufficient to observe that the r.h.s of ( . ) still goes tozero when multiplied by n a (cid:0) a L ( n ) if a (1, 2). Proof of Theorem The first part of the proof consists in showing that we can rewrite Equation ( . )as a function of i.i.d random variables in place of order statistics, to be able toapply a Central Limit Theorem (CLT) argument.Let’s start by considering the sequence1 n n (cid:229) i =1 Z ( i ) = 1 n n (cid:229) i =1 ( i (cid:0) n (cid:0) (cid:0) ) F (cid:0) ( U ( i ) ). ( . )Using the integral probability transform X d = F (cid:0) ( U ) with U standard uniform,and adding and removing n (cid:229) ni =1 ( U ( i ) (cid:0) ) F (cid:0) ( U ( i ) ), the r.h.s. in Equation ( . )can be rewritten as1 n n (cid:229) i =1 Z ( i ) = 1 n n (cid:229) i =1 (2 U ( i ) (cid:0) F (cid:0) ( U ( i ) ) + 1 n n (cid:229) i =1 ( i (cid:0) n (cid:0) (cid:0) U ( i ) ) F (cid:0) ( U ( i ) ). ( . )Then, by using the properties of order statistics [ ] we obtain the followingalmost sure equivalence1 n n (cid:229) i =1 Z ( i ) a . s . = 1 n n (cid:229) i =1 (2 U i (cid:0) F (cid:0) ( U i ) + 1 n n (cid:229) i =1 ( i (cid:0) n (cid:0) (cid:0) U ( i ) ) F (cid:0) ( U ( i ) ). ( . )Note that the first term in the r.h.s of ( . ) is a function of i.i.d random variablesas desired, while the second term is just a reminder, therefore1 n n (cid:229) i =1 Z ( i ) a . s . = 1 n n (cid:229) i =1 Z i + R n ,with Z i = (2 U i (cid:0) F (cid:0) ( U i ) and R n = n (cid:229) ni =1 (2( i (cid:0) n (cid:0) (cid:0) U ( i ) )) F (cid:0) ( U ( i ) ).Given Equation ( . ) and exploiting the decomposition given in ( . ) we canrewrite our claim as n a (cid:0) a L ( n ) ( n n (cid:229) i =1 Z ( i ) (cid:0) q ) = n a (cid:0) a L ( n ) ( n n (cid:229) i =1 Z i (cid:0) q ) + n a (cid:0) a L ( n ) R n . ( . )From the second claim of the Lemma . and Slutsky Theorem, the convergencein Equation ( . ) can be proven by looking at the behavior of the sequence n a (cid:0) a L ( n ) ( n n (cid:229) i =1 Z i (cid:0) q ) , ( . ) gini estimation under infinite variance ‡ where Z i = (2 U i (cid:0) F (cid:0) ( U i ) = (2 F ( X i ) (cid:0) X i . This reduces to proving that Z i is inthe fat tails domain of attraction.Recall that by assumption X DA ( S a ) with a (1, 2). This assumption enablesus to use a particular type of CLT argument for the convergence of the sum offat-tailed random variables. However, we first need to prove that Z DA ( S a ) aswell, that is P ( j Z j > z ) (cid:24) L ( z ) z (cid:0) a , with a (1, 2) and L ( z ) slowly-varying.Notice that P ( j ˜ Z j > z ) (cid:20) P ( j Z j > z ) (cid:20) P (2 X > z ),where ˜ Z = (2 U (cid:0) X and U ? X . The first bound holds because of the positivedependence between X and F ( X ) and it can be proven rigorously by noting that2 UX (cid:20) F ( X ) X by the so-called re-arrangement inequality [ ]. The upper boundconversely is trivial.Using the properties of slowly-varying functions, we have P (2 X > z ) (cid:24) a L ( z ) z (cid:0) a .To show that ˜ Z DA ( S a ), we use the Breiman’s Theorem, which ensure the sta-bility of the a -stable class under product, as long as the second random variable isnot too fat-tailed [ ].To apply the Theorem we re-write P ( j ˜ Z j > z ) as P ( j ˜ Z j > z ) = P ( ˜ Z > z ) + P ( (cid:0) ˜ Z > z ) = P ( ˜ UX > z ) + P ( (cid:0) ˜ UX > z ),where ˜ U is a standard uniform with ˜ U ? X .We focus on P ( ˜ UX > z ) since the procedure is the same for P ( (cid:0) ˜ UX > z ). Wehave P ( ˜ UX > z ) = P ( ˜ UX > z j ˜ U > P ( ˜ U >
0) + P ( ˜ UX > z j ˜ U (cid:20) P ( ˜ U (cid:20) z ! + ¥ .Now, we have that P ( ˜ UX > z j ˜ U (cid:20) !
0, while, by applying Breiman’s Theo-rem, P ( ˜ UX > z j ˜ U >
0) becomes P ( ˜ UX > z j ˜ U > ! E ( ˜ U a j U > P ( X > z ) P ( U > P ( j ˜ Z j > z ) ! E ( ˜ U a j U > P ( X > z ) + 12 E (( (cid:0) ˜ U ) a j U (cid:20) P ( X > z ).From this P ( j ˜ Z j > z ) ! P ( X > z )[ E ( ˜ U ) a j U >
0) + E (( (cid:0) ˜ U a j U (cid:20) a (cid:0) a P ( X > z ) (cid:24) a (cid:0) a L ( z ) z (cid:0) a .We can then conclude that, by the squeezing Theorem [ ], P ( j Z j > z ) (cid:24) L ( z ) z (cid:0) a , as z ! ¥ . Therefore Z DA ( S a ).We are now ready to invoke the Generalized Central Limit Theorem (GCLT) [ ]for the sequence Z i , i.e. nc (cid:0) n ( n n (cid:229) i =1 Z i (cid:0) E ( Z i ) ) d ! S a , b . ( . )with E ( Z i ) = q , S a , b a standardized a -stable random variable, and where c n is asequence which must satisfylim n ! ¥ nL ( c n ) c a n = G (2 (cid:0) a ) j cos( pa ) j a (cid:0) C a . ( . )Notice that c n can be represented as c n = n a L ( n ), where L ( n ) is another slowly-varying function possibly different from L ( n ).The skewness parameter b is such that P ( Z > z ) P ( j Z j > z ) ! b Z [ (cid:0) c , + ¥ ), the above expression reduces to P ( Z > z ) P ( Z > z ) + P ( (cid:0) Z > z ) ! P ( Z > z ) P ( Z > z ) = 1 ! b . )therefore b = 1. This, combined with Equation ( . ), the result for the reminder R n of Lemma . and Slutsky Theorem, allows us to conclude that the same weaklimits holds for the ordered sequence of Z ( i ) in Equation ( . ) as well. Proof of Theorem The first step of the proof is to show that the ordered sequence (cid:229) ni =1 Z ( i ) (cid:229) ni =1 X i , character-izing the Gini index, is equivalent in distribution to the i.i.d sequence (cid:229) ni =1 Z i (cid:229) ni =1 X i . Inorder to prove this, it is sufficient to apply the factorization in Equation ( . ) toEquation ( . ), getting n a (cid:0) a L ( n ) ( (cid:229) ni =1 Z i (cid:229) ni =1 X i (cid:0) qm ) + n a (cid:0) a L ( n ) R n n (cid:229) ni =1 X i . ( . )By Lemma . and the application of the continuous mapping and Slutsky The-orems, the second term in Equation ( . ) goes to zero at least in probability.Therefore to prove the claim it is sufficient to derive a weak limit for the followingsequence n a (cid:0) a L ( n ) ( (cid:229) ni =1 Z i (cid:229) ni =1 X i (cid:0) qm ) . ( . ) gini estimation under infinite variance ‡ Expanding Equation ( . ) and recalling that Z i = (2 F ( X i ) (cid:0) X i , we get n a (cid:0) a L ( n ) n (cid:229) ni =1 X i ( n n (cid:229) i =1 X i ( F ( X i ) (cid:0) (cid:0) qm )) . ( . )The term n (cid:229) ni =1 X i in Equation ( . ) converges in probability to m by an applicationof the continuous mapping Theorem, and the fact that we are dealing with positiverandom variables X . Hence it will contribute to the final limit via Slutsky Theorem.We first start by focusing on the study of the limit law of the term n a (cid:0) a L ( n ) 1 n n (cid:229) i =1 X i ( F ( X i ) (cid:0) (cid:0) qm ) . ( . )Set ˆ Z i = X i (2 F ( X i ) (cid:0) (cid:0) qm ) and note that E ( ˆ Z i ) = 0, since E ( Z i ) = q and E ( X i ) = m .In order to apply a GCLT argument to characterize the limit distribution of thesequence n a (cid:0) a L ( n ) 1 n (cid:229) ni =1 ˆ Z i we need to prove that ˆ Z DA ( S a ). If so then we can applyGCLT to n a (cid:0) a L ( n ) ( (cid:229) ni =1 ˆ Z i n (cid:0) E ( ˆ Z i ) ) . ( . )Note that, since E ( ˆ Z i ) = 0, Equation ( . ) equals Equation ( . ).To prove that ˆ Z DA ( S a ), remember that ˆ Z i = X i (2 F ( X i ) (cid:0) (cid:0) qm ) is just Z i = X i (2 F ( X i ) (cid:0)
1) shifted by qm . Therefore the same argument used in Theorem for Z applies here to show that ˆ Z DA ( S a ). In particular we can point out that ˆ Z and Z (therefore also X ) share the same a and slowly-varying function L ( n ).Notice that by assumption X [ c , ¥ ) with c > Z [ (cid:0) c (1 + qm ), ¥ ). As a consequence the left tail ofˆ Z does not contribute to changing the limit skewness parameter b , which remainsequal to 1 (as for Z ) by an application of Equation ( . ).Therefore, by applying the GCLT we finally get n a (cid:0) a L ( n ) ( (cid:229) ni =1 Z i (cid:229) ni =1 X i (cid:0) qm ) d (cid:0)! m S ( a , 1, 1, 0). ( . )We conclude the proof by noting that, as proven in Equation ( . ), the weaklimit of the Gini index is characterized by the i.i.d sequence of (cid:229) ni =1 Z i (cid:229) ni =1 X i rather thanthe ordered one, and that an a -stable random variable is closed under scaling by aconstant [ ]. O N T H E S U P E R - A D D I T I V I T Y A N DE S T I M A T I O N B I A S E S O F Q U A N T I L EC O N T R I B U T I O N S ‡ S ample measures a of top centile contributions to the total (con-centration) are downward biased, unstable estimators, extremelysensitive to sample size and concave in accounting for largedeviations. It makes them particularly unfit in domains withPower Law tails, especially for low values of the exponent.These estimators can vary over time and increase with the population size,as shown in this article, thus providing the illusion of structural changes inconcentration. They are also inconsistent under aggregation and mixing dis-tributions, as the weighted average of concentration measures for A and B will tend to be lower than that from A [ B . In addition, it can be shown thatunder such thick tails, increases in the total sum need to be accompanied byincreased sample size of the concentration measurement. We examine theestimation superadditivity and bias under homogeneous and mixed distribu-tions. a With R. Douady
Vilfredo Pareto noticed that % of the land in Italy belonged to % of the popula-tion, and vice-versa, thus both giving birth to the power law class of distributionsand the popular saying / . The self-similarity at the core of the property ofpower laws [ ] and [ ] allows us to recurse and reapply the / to the re-maining %, and so forth until one obtains the result that the top percent of thepopulation will own about % of the total wealth.It looks like such a measure of concentration can be seriously biased, dependingon how it is measured, so it is very likely that the true ratio of concentration of Research chapter. on the super-additivity and estimation biases of quantile contributions ‡ Figure . : The young Vil-fredo Pareto, before he dis-covered power laws. what Pareto observed, that is, the share of the top percentile, was closer to %,hence changes year-on-year would drift higher to converge to such a level fromlarger sample. In fact, as we will show in this discussion, for, say wealth, morecomplete samples resulting from technological progress, and also larger popula-tion and economic growth will make such a measure converge by increasing overtime, for no other reason than expansion in sample space or aggregate value.The core of the problem is that, for the class one-tailed fat-tailed random vari-ables, that is, bounded on the left and unbounded on the right, where the randomvariable X [ x min , ¥ ), the in-sample quantile contribution is a biased estimator ofthe true value of the actual quantile contribution.Let us define the quantile contribution k q = q E [ X j X > h ( q )] E [ X ]where h ( q ) = inf f h [ x min , + ¥ ) , P ( X > h ) (cid:20) q g is the exceedance threshold forthe probability q . For a given sample ( X k ) (cid:20) k (cid:20) n , its "natural" estimator b k q (cid:17) q th percentile total , used inmost academic studies, can be expressed, as b k q (cid:17) (cid:229) ni =1 X i > ˆ h ( q ) X i (cid:229) ni =1 X i where ˆ h ( q ) is the estimated exceedance threshold for the probability q :ˆ h ( q ) = inf f h : 1 n n (cid:229) i =1 x > h (cid:20) q g We shall see that the observed variable b k q is a downward biased estimator of thetrue ratio k q , the one that would hold out of sample, and such bias is in proportionto the fatness of tails and, for very thick tailed distributions, remains significant,even for very large samples. Let X be a random variable belonging to the class of distributions with a "powerlaw" right tail, that is: P ( X > x ) = L ( x ) x (cid:0) a ( . )where L : [ x min , + ¥ ) ! (
0, + ¥ ) is a slowly varying function, defined as lim x ! + ¥ L ( kx ) L ( x ) =1 for any k > %) between the var-ious possible distributions such as Student’s t, Lévy a -stable, Dagum,[ ],[ ]Singh-Maddala distribution [ ], or straight Pareto.For exponents 1 (cid:20) a (cid:20)
2, as observed in [ ] (Chapter in this book), the lawof large numbers operates, though extremely slowly. The problem is acute for a around, but strictly above and severe, as it diverges, for a = 1. . . Bias and ConvergenceSimple Pareto Distribution
Let us first consider ϕ a ( x ) the density of a a -Paretodistribution bounded from below by x min >
0, in other words: ϕ a ( x ) = a x a min x (cid:0) a (cid:0) x (cid:21) x min ,and P ( X > x ) = ( x min x ) a . Under these assumptions, the cutpoint of exceedance is h ( q ) = x min q (cid:0) = a and we have: k q = ∫ ¥ h ( q ) x ϕ ( x ) dx ∫ ¥ x min x ϕ ( x ) dx = ( h ( q ) x min ) (cid:0) a = q a (cid:0) a ( . ) on the super-additivity and estimation biases of quantile contributions ‡ If the distribution of X is a -Pareto only beyond a cut-point x cut , which we assumeto be below h ( q ), so that we have P ( X > x ) = ( l x ) a for some l >
0, then we stillhave h ( q ) = l q (cid:0) = a and k q = aa (cid:0) l E [ X ] q a (cid:0) a The estimation of k q hence requires that of the exponent a as well as that of thescaling parameter l , or at least its ratio to the expectation of X .Table . shows the bias of b k q as an estimator of k q in the case of an a -Paretodistribution for a = 1.1, a value chosen to be compatible with practical economicmeasures, such as the wealth distribution in the world or in a particular country,including developped ones. In such a case, the estimator is extemely sensitive to "small" samples, "small" meaning in practice 10 . We ran up to a trillion simula-tions across varieties of sample sizes. While k (cid:25) million remains severely biased as seen in the table.Naturally the bias is rapidly (and nonlinearly) reduced for a further away from , and becomes weak in the neighborhood of for a constant a , though not undera mixture distribution for a , as we shall se later. It is also weaker outside the top % centile, hence this discussion focuses on the famed "one percent" and on lowvalues of the a exponent. Table . : Biases of Estimator of k = 0.657933 From Monte Carlo Realizations b k ( n ) Mean Median STDacross MC runs b k (10 ) . . . b k (10 ) . . . b k (10 ) . . . b k (10 ) . . . b k (10 ) . . . b k (10 ) . . . In view of these results and of a number of tests we have performed around them,we can conjecture that the bias k q (cid:0) b k q ( n ) is "of the order of" c ( a , q ) n (cid:0) b ( q )( a (cid:0) whereconstants b ( q ) and c ( a , q ) need to be evaluated. Simulations suggest that b ( q ) = 1,whatever the value of a and q , but the rather slow convergence of the estimatorand of its standard deviation to makes precise estimation difficult. General Case
In the general case, let us fix the threshold h and define: k h = P ( X > h ) E [ X j X > h ] E [ X ] = E [ X X > h ] E [ X ] This value, which is lower than the estimated exponents one can find in the literature – around – is,following [ ], a lower estimate which cannot be excluded from the observations. so that we have k q = k h ( q ) . We also define the n -sample estimator: b k h (cid:17) (cid:229) ni =1 X i > h X i (cid:229) ni =1 X i where X i are n independent copies of X . The intuition behind the estimation biasof k q by b k q lies in a difference of concavity of the concentration measure withrespect to an innovation (a new sample value), whether it falls below or above thethreshold. Let A h ( n ) = (cid:229) ni =1 X i > h X i and S ( n ) = (cid:229) ni =1 X i , so that b k h ( n ) = A h ( n ) S ( n ) andassume a frozen threshold h . If a new sample value X n +1 < h then the new valueis b k h ( n + 1) = A h ( n ) S ( n ) + X n +1 . The value is convex in X n +1 so that uncertainty on X n +1 increases its expectation. At variance, if the new sample value X n +1 > h , the newvalue b k h ( n + 1) (cid:25) A h ( n )+ X n +1 (cid:0) hS ( n )+ X n +1 (cid:0) h = 1 (cid:0) S ( n ) (cid:0) A h ( n ) S ( n )+ X n +1 (cid:0) h , which is now concave in X n +1 ,so that uncertainty on X n +1 reduces its value. The competition between these twoopposite effects is in favor of the latter, because of a higher concavity with respectto the variable, and also of a higher variability (whatever its measurement) of thevariable conditionally to being above the threshold than to being below. The fatterthe right tail of the distribution, the stronger the effect. Overall, we find that E [ b k h ( n ) ] (cid:20) E [ A h ( n ) ] E [ S ( n ) ] = k h (note that unfreezing the threshold ˆ h ( q ) also tends toreduce the concentration measure estimate, adding to the effect, when introducingone extra sample because of a slight increase in the expected value of the estimatorˆ h ( q ), although this effect is rather negligible). We have in fact the following: Proposition . Let X = ( X ) ni =1 a random sample of size n > q , Y = X n +1 an extra single randomobservation, and define: b k h ( X ⊔ Y ) = (cid:229) ni =1 X i > h X i + Y > h Y (cid:229) ni =1 X i + Y . We remark that, wheneverY > h, one has: ¶ b k h ( X ⊔ Y ) ¶ Y (cid:20) This inequality is still valid with b k q as the value ˆ h ( q , X ⊔ Y ) doesn’t depend on the partic-ular value of Y > ˆ h ( q , X ).We face a different situation from the common small sample effect resulting fromhigh impact from the rare observation in the tails that are less likely to show up insmall samples, a bias which goes away by repetition of sample runs. The concavityof the estimator constitutes a upper bound for the measurement in finite n , clippinglarge deviations, which leads to problems of aggregation as we will state below inTheorem .In practice, even in very large sample, the contribution of very large rare eventsto k q slows down the convergence of the sample estimator to the true value. For abetter, unbiased estimate, one would need to use a different path: first estimatingthe distribution parameters ( ˆ a , ˆ l ) and only then, estimating the theoretical tailcontribution k q (ˆ a , ˆ l ). Falk [ ] observes that, even with a proper estimator of a and l , the convergence is extremely slow, namely of the order of n (cid:0) d = ln n , where on the super-additivity and estimation biases of quantile contributions ‡ Κ (cid:72) (cid:83) X i (cid:43) Y (cid:76) Figure . : Effect of addi-tional observations on k
20 40 60 80 100 Y0.6220.6240.626 Κ (cid:72) (cid:83) X i (cid:43) Y (cid:76) Figure . : Effect of addi-tional observations on k , wecan see convexity on bothsides of h except for valuesof no effect to the left of h,an area of order = n the exponent d depends on a and on the tolerance of the actual distribution vs.a theoretical Pareto, measured by the Hellinger distance. In particular, d ! a !
1, making the convergence really slow for low values of a . For the estimation of the mean of a fat-tailed r.v. ( X ) ji , in m sub-samples of size n i each for a total of n = (cid:229) mi =1 n i , the allocation of the total number of observations n between i and j does not matter so long as the total n is unchanged. Herethe allocation of n samples between m sub-samples does matter because of theconcavity of k . Next we prove that global concentration as measured by b k q ona broad set of data will appear higher than local concentration, so aggregatingEuropean data, for instance, would give a b k q higher than the average measure ofconcentration across countries – an "inequality about inequality" . In other words, weclaim that the estimation bias when using b k q ( n ) is even increased when dividing The same concavity – and general bias – applies when the distribution is lognormal, and is exacerbatedby high variance. the sample into sub-samples and taking the weighted average of the measuredvalues b k q ( n i ). Theorem Partition the n data into m sub-samples N = N [ . . . [ N m of respective sizesn , . . . , n m , with (cid:229) mi =1 n i = n, and let S , . . . , S m be the sum of variables over eachsub-sample, and S = (cid:229) mi =1 S i be that over the whole sample. Then we have: E [b k q ( N ) ] (cid:21) m (cid:229) i =1 E [ S i S ] E [b k q ( N i ) ] If we further assume that the distribution of variables X j is the same in all thesub-samples. Then we have: E [b k q ( N ) ] (cid:21) m (cid:229) i =1 n i n E [b k q ( N i ) ] In other words, averaging concentration measures of subsamples, weighted bythe total sum of each subsample, produces a downward biased estimate of theconcentration measure of the full sample.
Proof.
An elementary induction reduces the question to the case of two sub-samples.Let q (0, 1) and ( X , . . . , X m ) and ( X ′ , . . . , X ′ n ) be two samples of positive i.i.d.random variables, the X i ’s having distributions p ( dx ) and the X ′ j ’s having distribu-tion p ′ ( dx ′ ). For simplicity, we assume that both qm and qn are integers. We set S = m (cid:229) i =1 X i and S ′ = n (cid:229) i =1 X ′ i . We define A = mq (cid:229) i =1 X [ i ] where X [ i ] is the i -th largest value of ( X , . . . , X m ) , and A ′ = mq (cid:229) i =1 X ′ [ i ] where X ′ [ i ] is the i -th largest value of ( X ′ , . . . , X ′ n ) .We also set S ′′ = S + S ′ and A ” = ( m + n ) q (cid:229) i =1 X ′′ [ i ] where X ′′ [ i ] is the i -th largest value ofthe joint sample ( X , . . . , X m , X ′ , . . . , X ′ n ).The q -concentration measure for the samples X = ( X , ..., X m ), X ′ = ( X ′ , ..., X ′ n )and X ′′ = ( X , . . . , X m , X ′ , . . . , X ′ n ) are: k = AS k ′ = A ′ S ′ k ′′ = A ′′ S ′′ We must prove that he following inequality holds for expected concentration mea-sures: E [ k ′′ ] (cid:21) E [ SS ′′ ] E [ k ] + E [ S ′ S ′′ ] E [ k ′ ] We observe that: A = max J (cid:26)f m gj J j = q m (cid:229) i J X i on the super-additivity and estimation biases of quantile contributions ‡ and, similarly A ′ = max J ′ (cid:26)f n g , j J ′ j = qn (cid:229) i J ′ X ′ i and A ′′ = max J ′′ (cid:26)f m + n g , j J ′′ j = q ( m + n ) (cid:229) i J ′′ X i ,where we have denoted X m + i = X ′ i for i = 1 . . . n . If J (cid:26) f
1, ..., m g , j J j = q m and J ′ (cid:26) f m + 1, ..., m + n g , j J ′ j = qn , then J ′′ = J [ J ′ has cardinal m + n , hence A + A ′ = (cid:229) i J ′′ X i (cid:20) A ′′ , whatever the particular sample. Therefore k ′′ (cid:21) SS ′′ k + S ′ S ′′ k ′ and we have: E [ k ′′ ] (cid:21) E [ SS ′′ k ] + E [ S ′ S ′′ k ′ ] Let us now show that: E [ SS ′′ k ] = E [ AS ′′ ] (cid:21) E [ SS ′′ ] E [ AS ] If this is the case, then we identically get for k ′ : E [ S ′ S ′′ k ′ ] = E [ A ′ S ′′ ] (cid:21) E [ S ′ S ′′ ] E [ A ′ S ′ ] hence we will have: E [ k ′′ ] (cid:21) E [ SS ′′ ] E [ k ] + E [ S ′ S ′′ ] E [ k ′ ] Let T = X [ mq ] be the cut-off point (where [ mq ] is the integer part of mq ), so that A = m (cid:229) i =1 X i X i (cid:21) T and let B = S (cid:0) A = m (cid:229) i =1 X i X i < T . Conditionally to T , A and B areindependent: A is a sum if m q samples constarined to being above T , while B isthe sum of m (1 (cid:0) q ) independent samples constrained to being below T . They arealso independent of S ′ . Let p A ( t , da ) and p B ( t , db ) be the distribution of A and B respectively, given T = t . We recall that p ′ ( ds ′ ) is the distribution of S ′ and denote q ( dt ) that of T . We have: E [ SS ′′ k ] = x a + ba + b + s ′ aa + b p A ( t , da ) p B ( t , db ) q ( dt ) p ′ ( ds ′ )For given b , t and s ′ , a ! a + ba + b + s ′ and a ! aa + b are two increasing functions of thesame variable a , hence conditionally to T , B and S ′ , we have: E [ SS ′′ k (cid:12)(cid:12)(cid:12)(cid:12) T , B , S ′ ] = E [ AA + B + S ′ (cid:12)(cid:12)(cid:12)(cid:12) T , B , S ′ ] (cid:21) E [ A + BA + B + S ′ (cid:12)(cid:12)(cid:12)(cid:12) T , B , S ′ ] E [ AA + B (cid:12)(cid:12)(cid:12)(cid:12) T , B , S ′ ] This inequality being valid for any values of T , B and S ′ , it is valid for the uncon-ditional expectation, and we have: E [ SS ′′ k ] (cid:21) E [ SS ′′ ] E [ AS ] If the two samples have the same distribution, then we have: E [ k ′′ ] (cid:21) mm + n E [ k ] + nm + n E [ k ′ ] Indeed, in this case, we observe that E [ SS ′′ ] = mm + n . Indeed S = (cid:229) mi =1 X i and the X i are identically distributed, hence E [ SS ′′ ] = m E [ XS ′′ ] . But we also have E [ S ′′ S ′′ ] =1 = ( m + n ) E [ XS ′′ ] therefore E [ XS ′′ ] = m + n . Similarly, E [ S ′ S ′′ ] = nm + n , yielding theresult.This ends the proof of the theorem.Let X be a positive random variable and h (0, 1). We remind the theoretical h -concentration measure, defined as: k h = P ( X > h ) E [ X j X > h ] E [ X ] whereas the n -sample q -concentration measure is b k h ( n ) = A ( n ) S ( n ) , where A ( n ) and S ( n ) are defined as above for an n -sample X = ( X , . . . , X n ) of i.i.d. variables withthe same distribution as X . Theorem For any n N , we have: E [ b k h ( n ) ] < k h and lim n ! + ¥ b k h ( n ) = k h a.s. and in probabilityProof. The above corrolary shows that the sequence n E [ b k h ( n ) ] is super-additive,hence E [ b k h ( n ) ] is an increasing sequence. Moreover, thanks to the law of largenumbers, n S ( n ) converges almost surely and in probability to E [ X ] and n A ( n )converges almost surely and in probability to E [ X X > h ] = P ( X > h ) E [ X j X > h ] ,hence their ratio also converges almost surely to k h . On the other hand, this ratio isbounded by . Lebesgue dominated convergence theorem concludes the argumentabout the convergence in probability. Consider now a random variable X , the distribution of which p ( dx ) is a mix-ture of parametric distributions with different values of the parameter: p ( dx ) = on the super-additivity and estimation biases of quantile contributions ‡ Figure . : Pierre Simon,Marquis deLaplace. He gothis name on adistribution anda few results, butwas behind boththe Cauchy andthe Gaussiandistributions (seethe Stigler law ofeponymy [ ]).Posthumousportrait by Jean-Baptiste PaulinGuérin, . (cid:229) mi =1 w i p a i ( dx ). A typical n -sample of X can be made of n i = w i n samples of X a i with distribution p a i . The above theorem shows that, in this case, we have: E [b k q ( n , X ) ] (cid:21) m (cid:229) i =1 E [ S ( w i n , X a i ) S ( n , X ) ] E [b k q ( w i n , X a i ) ] When n ! + ¥ , each ratio S ( w i n , X a i ) S ( n , X ) converges almost surely to w i respectively,therefore we have the following convexity inequality: k q ( X ) (cid:21) m (cid:229) i =1 w i k q ( X a i ) The case of Pareto distribution is particularly interesting. Here, the parameter a represents the tail exponent of the distribution. If we normalize expectations to 1,the cdf of X a is F a ( x ) = 1 (cid:0) ( xx min ) (cid:0) a and we have: k q ( X a ) = q a (cid:0) a and d d a k q ( X a ) = q a (cid:0) a (log q ) a > k q ( X a ) is a convex function of a and we can write: k q ( X ) (cid:21) m (cid:229) i =1 w i k q ( X a i ) (cid:21) k q ( X ¯ a )where ¯ a = (cid:229) mi =1 w i a .Suppose now that X is a positive random variable with unknown distribution,except that its tail decays like a power low with unknown exponent. An unbiasedestimation of the exponent, with necessarily some amount of uncertainty (i.e., adistribution of possible true values around some average), would lead to a down-ward biased estimate of k q .Because the concentration measure only depends on the tail of the distribution,this inequality also applies in the case of a mixture of distributions with a powerdecay, as in Equation . : P ( X > x ) = N (cid:229) j =1 w i L i ( x ) x (cid:0) a j ( . )The slightest uncertainty about the exponent increases the concentration index.One can get an actual estimate of this bias by considering an average ¯ a > a + = a + d and a (cid:0) = a (cid:0) d . The convexity inequaly writesas follows: k q (¯ a ) = q (cid:0) a < ( q (cid:0) a + d + q (cid:0) a (cid:0) d ) So in practice, an estimated ¯ a of around 3 =
2, sometimes called the "half-cubic"exponent, would produce similar results as value of a much closer ro , as weused in the previous section. Simply k q ( a ) is convex, and dominated by the secondorder effect ln( q ) q (cid:0) a + d (ln( q ) (cid:0) a + d ))( a + d ) , an effect that is exacerbated at lower values of a .To show how unreliable the measures of inequality concentration from quantiles,consider that a standard error of . in the measurement of a causes k q ( a ) to riseby . . on the super-additivity and estimation biases of quantile contributions ‡ b k q There is a large dependence between the estimator b k q and the sum S = n (cid:229) j =1 X j :conditional on an increase in b k q the expected sum is larger. Indeed, as shown intheorem , b k q and S are positively correlated.For the case in which the random variables under concern are wealth, we observeas in Figure . such conditional increase; in other words, since the distribution isof the class of thick tails under consideration, the maximum is of the same order asthe sum, additional wealth means more measured inequality. Under such dynam-ics, is quite absurd to assume that additional wealth will arise from the bottom oreven the middle. (The same argument can be applied to wars, pandemics, size orcompanies, etc.) Κ (cid:72) n (cid:61) (cid:76) Figure . : Effect of addi-tional wealth on ˆ k Concentration can be high at the level of the generator, but in small units or sub-sections we will observe a lower k q . So examining times series, we can easily geta historical illusion of rise in, say, wealth concentration when it has been there allalong at the level of the process; and an expansion in the size of the unit measuredcan be part of the explanation. Even the estimation of a can be biased in some domains where one does notsee the entire picture: in the presence of uncertainty about the "true" a , it can beshown that, unlike other parameters, the one to use is not the probability-weightedexponents (the standard average) but rather the minimum across a section of ex-ponents.One must not perform analyses of year-on-year changes in b k q without adjustment.It did not escape our attention that some theories are built based on claims of such"increase" in inequality, as in [ ], without taking into account the true nature of Accumulated wealth is typically thicker tailed than income, see [ ]. k q , and promulgating theories about the "variation" of inequality without referenceto the stochasticity of the estimation (cid:0) and the lack of consistency of k q across timeand sub-units. What is worse, rejection of such theories also ignored the size effect,by countering with data of a different sample size, effectively making the dialogueon inequality uninformational statistically. The mistake appears to be commonly made in common inference about fat-taileddata in the literature. The very methodology of using concentration and changesin concentration is highly questionable. For instance, in the thesis by Steven Pinker[ ] that the world is becoming less violent, we note a fallacious inference aboutthe concentration of damage from wars from a b k q with minutely small populationin relation to the fat-tailedness. Owing to the fat-tailedness of war casualties andconsequences of violent conflicts, an adjustment would rapidly invalidate suchclaims that violence from war has statistically experienced a decline. . . Robust methods and use of exhaustive data
We often face argument of the type "the method of measuring concentration fromquantile contributions ˆ k is robust and based on a complete set of data". Robustmethods, alas, tend to fail with fat-tailed data, see Chapter . But, in addition, theproblem here is worse: even if such "robust" methods were deemed unbiased, amethod of direct centile estimation is still linked to a static and specific populationand does not aggregage. Accordingly, such techniques do not allow us to makestatistical claims or scientific statements about the true properties which shouldnecessarily carry out of sample.Take an insurance (or, better, reinsurance) company. The "accounting" profits ina year in which there were few claims do not reflect on the "economic" status ofthe company and it is futile to make statements on the concentration of losses perinsured event based on a single year sample. The "accounting" profits are not usedto predict variations year-on-year, rather the exposure to tail (and other) events,analyses that take into account the stochastic nature of the performance. Thisdifference between "accounting" (deterministic) and "economic" (stochastic) valuesmatters for policy making, particularly under thick tails. The same with wars: wedo not estimate the severity of a (future) risk based on past in-sample historicaldata. . . How Should We Measure Concentration?
Practitioners of risk managers now tend to compute CVaR and other metrics, meth-ods that are extrapolative and nonconcave, such as the information from the a ex-ponent, taking the one closer to the lower bound of the range of exponents, as we Financial Times , May , "Piketty findings undercut by errors" by Chris Giles. Using Richardson’s data, [ ]: "(Wars) followed an : rule: almost eighty percent of the deaths werecaused by two percent (his emph.) of the wars". So it appears that both Pinker and the literature cited forthe quantitative properties of violent conflicts are using a flawed methodology, one that produces a severebias, as the centile estimation has extremely large biases with fat-tailed wars. Furthermore claims aboutthe mean become spurious at low exponents. on the super-additivity and estimation biases of quantile contributions ‡ saw in our extension to Theorem and rederiving the corresponding k , or, morerigorously, integrating the functions of a across the various possible states. Suchmethods of adjustment are less biased and do not get mixed up with problems ofaggregation –they are similar to the "stochastic volatility" methods in mathemati-cal finance that consist in adjustments to option prices by adding a "smile" to thestandard deviation, in proportion to the variability of the parameter representingvolatility and the errors in its measurement. Here it would be "stochastic alpha"or "stochastic tail exponent " By extrapolative, we mean the built-in extension ofthe tail in the measurement by taking into account realizations outside the samplepath that are in excess of the extrema observed. acknowledgment
The late Benoît Mandelbrot, Branko Milanovic, Dominique Guéguan, Felix Salmon,Bruno Dupire, the late Marc Yor, Albert Shiryaev, an anonymous referee, the staffat Luciano Restaurant in Brooklyn and Naya in Manhattan. Also note that, in addition to the centile estimation problem, some authors such as [ ] when dealingwith censored data, use Pareto interpolation for unsufficient information about the tails (based on tailparameter), filling-in the bracket with conditional average bracket contribution, which is not the samething as using full power law extension; such a method retains a significant bias. Even using a lognormal distribution, by fitting the scale parameter, works to some extent as a rise of thestandard deviation extrapolates probability mass into the right tail. We also note that the theorems would also apply to Poisson jumps, but we focus on the powerlaw case inthe application, as the methods for fitting Poisson jumps are interpolative and have proved to be easier tofit in-sample than out of sample. art V
S H A D O W M O M E N T S PA P E R S S H A D O W M O M E N T S O FA P PA R E N T LY I N F I N I T E - M E A NP H E N O M E N A ‡ T his C hapter proposes an approach to compute the conditionalmoments of fat-tailed phenomena that, only looking at data,could be mistakenly considered as having infinite mean. Thistype of problems manifests itself when a random variable Y has a heavy-tailed distribution with an extremely wide yetbounded support.We introduce the concept of dual distribution, by means of a log-transformationthat smoothly removes the upper bound. The tail of the dual distribution canthen be studied using extreme value theory, without making excessive para-metric assumptions, and the estimates one obtains can be used to study theoriginal distribution and compute its moments by reverting the transforma-tion.The central difference between our approach and a simple truncation isin the smoothness of the transformation between the original and the dualdistribution, allowing use of extreme value theory.War casualties, operational risk, environment blight, complex networks andmany other econophysics phenomena are possible fields of application. Consider a heavy-tailed random variable Y with finite support [ L , H ]. W.l.o.g. set L >> H , assume that its value isremarkably large, yet finite. It is so large that the probability of observing valuesin its vicinity is extremely small, so that in data we tend to find observations onlybelow a certain M << H < ¥ . Research chapter, with P. Cirillo. shadow moments of apparently infinite-mean phenomena ‡ Figure . gives a graphical representation of the problem. For our random vari-able Y with remote upper bound H the real tail is represented by the continuousline. However, if we only observe values up to M << H , and - willing or not -we ignore the existence of H , which is unlikely to be seen, we could be inclined tobelieve the the tail is the dotted one, the apparent one. The two tails are indeed es-sentially indistinguishable for most cases, as the divergence is only evident whenwe approach H .Now assume we want to study the tail of Y and, since it is fat-tailed and despite H < ¥ , we take it to belong to the so-called Fréchet class . In extreme value theory[ ], a distribution F of a random variable Y is said to be in the Fréchet class if¯ F ( y ) = 1 (cid:0) F ( y ) = y (cid:0) a L ( y ), where L ( y ) is a slowly varying function . In other terms,the Fréchet class is the class of all distributions whose right tail behaves as a powerlaw.Looking at the data, we could be led to believe that the right tail is the dottedline in Figure . , and our estimation of a shows it be smaller than . Giventhe properties of power laws, this means that E [ Y ] is not finite (as all the otherhigher moments). This also implies that the sample mean is essentially useless formaking inference, in addition to any considerations about robustness [ ]. But if H is finite, this cannot be true: all the moments of a random variable with boundedsupport are finite.A solution to this situation could be to fit a parametric model, which allows forfat tails and bounded support, such as for example a truncated Pareto [ ]. Butwhat happens if Y only shows a Paretian behavior in the upper tail, and not forthe whole distribution? Should we fit a mixture model?In the next section we propose a simple general solution, which does not rely onstrong parametric assumptions. Instead of altering the tails of the distribution we find it more convenient to trans-form the data and rely on distributions with well known properties. In Figure . ,the real and the apparent tails are indistinguishable to a great extent. We can usethis fact to our advantage, by transforming Y to remove its upper bound H , sothat the new random variable Z - the dual random variable - has the same tail asthe apparent tail. We can then estimate the shape parameter a of the tail of Z andcome back to Y to compute its moments or, to be more exact, to compute its excessmoments, the conditional moments above a given threshold, view that we will justextract the information from the tail of Z .Take Y with support [ L , H ], and define the function φ ( Y ) = L (cid:0) H log ( H (cid:0) YH (cid:0) L ) . ( . ) Note that treating Y as belonging to the Fréchet class is a mistake. If a random variable has a finite upperbound, it cannot belong to the Fréchet class, but rather to the Weibull class [ ]. M H y R i g h t T a il : - F ( y ) Real TailApparent Tail
Figure . : Graphical representation of what may happen if one ignores the existence of the finiteupper bound H, since only M is observed.
We can verify that φ is "smooth": φ C ¥ , φ (cid:0) ( ¥ ) = H , and φ (cid:0) ( L ) = φ ( L ) = L .Then Z = φ ( Y ) defines a new random variable with lower bound L and an infiniteupper bound. Notice that the transformation induced by φ ( (cid:1) ) does not depend onany of the parameters of the distribution of Y .By construction, z = φ ( y ) (cid:25) y for very large values of H . This means that for avery large upper bound, unlikely to be touched, the results we get for the tail of Y and Z = φ ( Y ) are essentially the same, until we do not reach H . But while Y isbounded, Z is not. Therefore we can safely model the unbounded dual distributionof Z as belonging to the Fréchet class, study its tail, and then come back to Y andits moments, which under the dual distribution of Z could not exist. The tail of Z can be studied in different ways, see for instance [ ] and [ ].Our suggestions is to rely on the so-called de Pickands, Balkema and de Haan’sTheorem [ ]. This theorem allows us to focus on the right tail of a distribution,without caring too much about what happens below a given threshold threshold u . In our case u (cid:21) L . Note that the use of logarithmic transformation is quite natural in the context of utility. shadow moments of apparently infinite-mean phenomena ‡ Consider a random variable Z with distribution function G , and call G u the con-ditional df of Z above a given threshold u . We can then define the r.v. W , repre-senting the rescaled excesses of Z over the threshold u , so that G u ( w ) = P ( Z (cid:0) u (cid:20) w j Z > u ) = G ( u + w ) (cid:0) G ( u )1 (cid:0) G ( u ) ,for 0 (cid:20) w (cid:20) z G (cid:0) u , where z G is the right endpoint of G .Pickands, Balkema and de Haan have showed that for a large class of distribu-tion functions G , and a large u , G u can be approximated by a Generalized Paretodistribution, i.e. G u ( w ) ! GPD ( w ; x , s ), as u ! ¥ where GPD ( w ; x , s ) = { (cid:0) (1 + x w s ) (cid:0) = x i f x ̸ = 01 (cid:0) e (cid:0) w s i f x = 0 , w (cid:21)
0. ( . )The parameter x , known as the shape parameter, and corresponding to 1 = a , gov-erns the fatness of the tails, and thus the existence of moments. The moment oforder p of a Generalized Pareto distributed random variable only exists if and onlyif x < = p , or a > p [ ]. Both x and s can be estimated using MLE or the methodof moments [ ]. y : the shadow mean (or population mean) With f and g , we indicate the densities of Y and Z .We know that Z = φ ( Y ), so that Y = φ (cid:0) ( Z ) = ( L (cid:0) H ) e L (cid:0) ZH + H .Now, let’s assume we found u = L (cid:3) (cid:21) L , such that G u ( w ) (cid:25) GPD( w ; x , s ). Thisimplies that the tail of Y , above the same value L (cid:3) that we find for Z , can beobtained from the tail of Z , i.e. G u .First we have ∫ ¥ L (cid:3) g ( z ) d z = ∫ φ (cid:0) ( ¥ ) L (cid:3) f ( y ) d y . ( . )And we know that g ( z ; x , s ) = 1 s ( x z s ) (cid:0) x (cid:0) , z [ L (cid:3) , ¥ ). ( . )Setting a = x (cid:0) , we get f ( y ; a , s ) = H ( H (log( H (cid:0) L ) (cid:0) log( H (cid:0) y )) as ) (cid:0) a (cid:0) s ( H (cid:0) y ) , y [ L (cid:3) , H ], ( . ) There are alternative methods to face finite (or concave) upper bounds, i.e., the use of tempered powerlaws (with exponential dampening)[ ] or stretched exponentials [ ]; while being of the same natureas our exercise, these methods do not allow for immediate applications of extreme value theory or similarmethods for parametrization. y : the shadow mean (or population mean) Figure . : C.F.Gauss, paintedby ChristianAlbrecht Jensen.Gauss has hisname on thedistribution, gen-erally attributedto Laplace. or, in terms of distribution function, F ( y ; a , s ) = 1 (cid:0) ( H (log( H (cid:0) L ) (cid:0) log( H (cid:0) y )) as ) (cid:0) a . ( . )Clearly, given that φ is a one-to-one transformation, the parameters of f and g obtained by maximum likelihood methods will be the same —the likelihood func-tions of f and g differ by a scaling constant.We can derive the shadow mean of Y , conditionally on Y > L (cid:3) , as E [ Y j Y > L (cid:3) ] = ∫ HL (cid:3) y f ( y ; a , s ) d y , ( . ) We call the population average –as opposed to the sample one – "shadow", as it is not immediately visiblefrom the data. shadow moments of apparently infinite-mean phenomena ‡ obtaining E [ Y j Z > L (cid:3) ] = ( H (cid:0) L (cid:3) ) e as H ( as H ) a G ( (cid:0) a , as H ) + L (cid:3) . ( . )The conditional mean of Y above L (cid:3) (cid:21) L can then be estimated by simply plug-ging in the estimates ˆ a and ˆ s , as resulting from the GPD approximation of thetail of Z . It is worth noticing that if L (cid:3) = L , then E [ Y j Y > L (cid:3) ] = E [ Y ], i.e. theconditional mean of Y above Y is exactly the mean of Y .Naturally, in a similar way, we can obtain the other moments, even if we mayneed numerical methods to compute them.Our method can be used in general, but it is particularly useful when, from data,the tail of Y appears so fat that no single moment is finite, as it is often the casewhen dealing with operational risk losses, the degree distribution of large complexnetworks, or other econophysical phenomena.For example, assume that for Z we have x >
1. Then both E [ Z j Z > L (cid:3) ] and E [ Z ]are not finite . Figure . tells us that we might be inclined to assume that also E [ Y ] is infinite - and this is what the data are likely to tell us if we estimate ˆ x fromthe tail of Y . But this cannot be true because H < ¥ , and even for x > E [ Y j Z > L (cid:3) ] using equation ( . ). Value-at-Risk and Expected Shortfall
Thanks to equation ( . ), we can compute by inversion the quantile function of Y when Y (cid:21) L (cid:3) , that is Q ( p ; a , s , H , L ) = e (cid:0) g ( p ) ( L (cid:3) e as H + He g ( p ) (cid:0) He as H ) , ( . )where g ( p ) = as (1 (cid:0) p ) (cid:0) = a H and p [0, 1]. Again, this quantile function is conditionalon Y being larger than L (cid:3) .From equation ( . ), we can easily compute the Value-at-Risk ( VaR ) of Y j Y (cid:21) L (cid:3) for whatever confidence level. For example, the 95% VaR of Y , if Y represents op-erational losses over a -year time horizon, is simply VaR Y = Q (0.95; a , s , H , L ).Another quantity we might be interested in when dealing with the tail risk of Y is the so-called expected shortfall ( ES ), that is E [ Y j Y > u (cid:21) L (cid:3) ]. This is nothingmore than a generalization of equation ( . ).We can obtain the expected shortfall by first computing the mean excess functionof Y j Y (cid:21) L (cid:3) , defined as e u ( Y ) = E [ Y (cid:0) u j Y > u ] = ∫ ¥ u ( u (cid:0) y ) f ( y ; a , s )d y (cid:0) F ( u ) , Remember that for a GPD random variable Z , E [ Z p ] < ¥ iff x < = p . Because of the similarities between 1 (cid:0) F ( y ) and 1 (cid:0) G ( z ), at least up until M , the GPD approximation willgive two statistically undistinguishable estimates of x for both tails [ ]. for y (cid:21) u (cid:21) L (cid:3) . Using equation ( . ), we get e u ( Y ) = ( H (cid:0) L ) e as H ( as H ) a H log ( H (cid:0) LH (cid:0) u ) as + 1 a (cid:2) G ( (cid:0) a , as H + log ( H (cid:0) LH (cid:0) u )) . ( . )The Expected Shortfall is then simply computed as E [ Y j Y > u (cid:21) L (cid:3) ] = e u ( Y ) + u .As in finance and risk management, ES and VaR can be combined. For examplewe could be interested in computing the 95% ES of Y when Y (cid:21) L (cid:3) . This is simplygiven by VaR Y + e VaR Y ( Y ). There are three ways to go about explicitly cutting a Paretian distribution in thetails (not counting methods to stretch or "temper" the distribution). ) The first one consists in hard truncation, i.e. in setting a single endpoint for thedistribution and normalizing. For instance the distribution would be normalizedbetween L and H , distributing the excess mass across all points. ) The second one would assume that H is an absorbing barrier, that all therealizations of the random variable in excess of H would be compressed into aDirac delta function at H –as practiced in derivative models. In that case thedistribution would have the same density as a regular Pareto except at point H . ) The third is the one presented here.The same problem has cropped up in quantitative finance over the use of trun-cated normal (to correct for Bachelier’s use of a straight Gaussian) vs. logarithmictransformation (Sprenkle, [ ]), with the standard model opting for logarith-mic transformation and the associated one-tailed lognormal distribution. Asidefrom the additivity of log-returns and other such benefits, the models do not pro-duce a "cliff", that is an abrupt change in density below or above, with the instabil-ity associated with risk measurements on non-smooth function.As to the use of extreme value theory, Breilant et al. ( )[ ? ] go on to truncatethe distribution by having an excess in the tails with the transformation Y (cid:0) a ! ( Y (cid:0) a (cid:0) H (cid:0) a ) and apply EVT to the result. Given that the transformation includesthe estimated parameter, a new MLE for the parameter a is required. We findissues with such a non-smooth transformation. The same problem occurs as withfinancial asset models, particularly the presence an abrupt "cliff" below which thereis a density, and above which there is none. The effect is that the expectationobtained in such a way will be higher than ours, particularly at values of a <
1, asseen in Figure . . shadow moments of apparently infinite-mean phenomena ‡ We can demonstrate the last point as follows. Assume we observe distributionis Pareto that is in fact truncated but treat it as a Pareto. The density is f ( x ) = s ( x (cid:0) L as + 1 ) (cid:0) a (cid:0) , x [ L , ¥ ). The truncation gives g ( x ) = ( x (cid:0) L as +1 ) (cid:0) a (cid:0) s ( (cid:0) a a s a ( as + H (cid:0) L ) (cid:0) a ) , x [ L , H ].Moments of order p of the truncated Pareto (i.e. what is seen from realizationsof the process), M ( p ) are: M ( p ) = a e (cid:0) i p p ( as ) a ( as (cid:0) L ) p (cid:0) a ( B HL (cid:0) as ( p + 1, (cid:0) a ) (cid:0) B LL (cid:0) as ( p + 1, (cid:0) a ) )( asas + H (cid:0) L ) a (cid:0) . )where B (., .) is the Euler Beta function, B ( a , b ) = G ( a ) G ( b ) G ( a + b ) = ∫ t a (cid:0) (1 (cid:0) t ) b (cid:0) dt .We end up with r ( H , a ), the ratio of the mean of the soft truncated to that of thetruncated Pareto. r ( H , a ) = e (cid:0) a H ( a H ) a ( aa + H ) (cid:0) a ( a + H a ) (cid:0) a ( (cid:0) ( a + H a ) a + H + 1 ) ( a (cid:0) (( a H ) a (cid:0) ( a + HH ) a ) E a ( a H ) ( . )where E a ( a H ) is the exponential integral e a z = ∫ ¥ e t ( (cid:0) a ) t n dt . Operational risk
The losses for a firm are bounded by the capitalization, withwell-known maximum losses.
Capped reinsurance contracts
Reinsurance contracts almost always have caps(i.e., a maximum claim); but a reinsurer can have many such contracts on the samesource of risk and the addition of the contract pushes the upper bound in such away as to cause larger potential cumulative harm.
Violence
While wars are extremely fat-tailed, the maximum effect from any suchevent cannot exceed the world’s population.
Credit risk
A loan has a finite maximum loss, in a way similar to reinsurancecontracts.
City size
While cities have been shown to be Zipf distributed, the size of a givencity cannot exceed that of the world’s population. H = H = α E [ X smooth ] E [ X truncated ] Figure . : Ratio of the expectation of smooth transformation to truncated.
Environmental harm
While these variables are exceedingly fat-tailed, the risk isconfined by the size of the planet (or the continent on which they take place) as afirm upper bound.
Complex networks
The number of connections is finite.
Company size
The sales of a company is bound by the GDP.
Earthquakes
The maximum harm from an earthquake is bound by the energy.
Hydrology
The maximum level of a flood can be determined. O N T H E T A I L R I S K O F V I O L E N TC O N F L I C T ( W I T H P. C I R I L L O ) ‡ W e examine all possible statistical pictures of violent conflictsover common era history with a focus on dealing with incom-pleteness and unreliability of data. We apply methods fromextreme value theory on log-transformed data to remove com-pact support, then, owing to the boundedness of maximumcasualties, retransform the data and derive expected means. We find the es-timated mean likely to be at least three times larger than the sample mean,meaning severe underestimation of the severity of conflicts from naive obser-vation. We check for robustness by sampling between high and low estimatesand jackknifing the data. We study inter-arrival times between tail events andfind (first-order) memorylessless of events. The statistical pictures obtainedare at variance with the claims about "long peace". This study is as much about new statistical methodologies with thick tailed (andunreliable data), as well as bounded random variables with local Power Law be-havior, as it is about the properties of violence. Violence is much more severe than it seems from conventional analyses and theprevailing "long peace" theory which claims that violence has declined. Adaptingmethods from extreme value theory, and adjusting for errors in reporting of con-flicts and historical estimates of casualties, we look at the various statistical picturesof violent conflicts, with focus for the parametrization on those with more than k Research chapter. Acknowledgments: Captain Mark Weisenborn engaged in the thankless and gruesome task of compilingthe data, checking across sources and linking each conflict to a narrative on Wikipedia (see Appendix ).We also benefited from generous help on social networks where we put data for scrutiny, as well as advicefrom historians thanked in the same appendix. We also thank the late Benoit Mandelbrot for insights onthe tail properties of wars and conflicts, as well as Yaneer Bar-Yam, Raphael Douady... on the tail risk of violent conflict (with p. cirillo) ‡ α Figure . : Values of thetail exponent a from Hillestimator obtained across , different rescaledcasualty numbers uniformlyselected between low andhigh estimates of conflict.The exponent is slightly (butnot meaningfully) differentfrom the Maximum Likeli-hood for all data as we focuson top deviations. Figure . : Q-Q plot ofthe rescaled data in the near-tail plotted against a ParetoII-Lomax Style distribution.
Figure . : Death tollfrom "named conflicts"over time. Conflicts lastingmore than years aredisaggregated into two ormore conflicts, each onelasting years. victims (in equivalent ratio of today’s population, which would correspond to (cid:25) k in the 18 th C.). Contrary to current discussions, all statistical pictures thus ob-tained show that ) the risk of violent conflict has not been decreasing, but is ratherunderestimated by techniques relying on naive year-on-year changes in the mean, Figure . : Rescaled deathtoll of armed conflict andregimes over time. Dataare rescaled w.r.t. today’sworld population. Conflictslasting more than yearsare disaggregated into twoor more conflicts, each onelasting years. Sample ( "journalistic" ) MeanMax Likelihood MeanRange of (cid:1) (cid:1) × × × × × × × × Mean
Figure . : Observed"journalistic" mean com-pared to MLE mean(derived from rescalingback the data to compactsupport) for different valuesof a (hence for permutationsof the pair ( s a , a ) ). The"range of a is the one weget from possible variationsof the data from bootstrapand reliability simulations. or using sample mean as an estimator of the true mean of an extremely fat-tailedphenomenon; ) armed conflicts have memoryless inter-arrival times, thus incom-patible with the idea of a time trend. Our analysis uses ) raw data, as recordedand estimated by historians; ) a naive transformation, used by certain historiansand sociologists, which rescales past conflicts and casualties with respect to the ac-tual population; ) more importantly, a log transformation to account for the factthat the number of casualties in a conflict cannot be larger than the world popula-tion. (This is similar to the transformation of data into log-returns in mathematicalfinance in order to use distributions with support on the real line.)All in all, among the different classes of data (raw and rescaled), we observethat ) casualties are Power Law distributed. In the case of log-rescaled data weobserve .4 (cid:20) a (cid:20) .7, thus indicating an extremely fat-tailed phenomenon with anundefined mean (a result that is robustly obtained); ) the inter-arrival times of Many earlier studies have found Paretianity in data, [ ? ],[ ]. Our study, aside from the use of extremevalue techniques, reliability bootstraps, and compact support transformations, varies in both calibrationsand interpretation. on the tail risk of violent conflict (with p. cirillo) ‡ conflicts above the k threshold follow a homogeneous Poisson process, indicat-ing no particular trend, and therefore contradicting a popular narrative about thedecline of violence; ) the true mean to be expected in the future, and the mostcompatible with the data, though highly stochastic, is (cid:25) (cid:2) higher than past mean.Further, we explain: ) how the mean (in terms of expected casualties) is severelyunderestimated by conventional data analyses as the observed mean is not an esti-mator of true mean (unlike the tail exponent that provides a picture with smallernoise); ) how misconceptions arise from the deceiving lengthy (and volatile) inter-arrival times between large conflicts.To remedy the inaccuracies of historical numerical assessments, we provide astandard bootstrap analysis of our estimates, in addition to Monte Carlo checksfor unreliability of wars and absence of events from currently recorded history. . . ResultsParetian tails
Peak-Over-Threshold methods show (both raw and rescaled vari-ables) exhibit strong Paretiantail behavior, with survival probability P ( X > x ) = l ( x ) x (cid:0) a , where l : [ L , + ¥ ) ! (
0, + ¥ ) is a slowly varying function, defined aslim x ! + ¥ l ( kx ) l ( x ) = 1 for any k > . , G ( x ) = 1 (cid:0) (1 + x y = b ) (cid:0) = x , with x (cid:25) (cid:6) .14 for rescaled data which correspondsto a tail a = x = .53, (cid:6) .04. Memorylessness of onset of conflicts
Tables . and . show inter-arrivaltimes, meaning one can wait more than a hundred years for an event such asWWII without changing one’s expectation. There is no visible autocorrelation, nostatistically detectable temporal structure (i.e. we cannot see the imprint of a self-exciting process), see Figure . . Full distribution(s)
Rescaled data fits a Lomax-Style distribution with sametail as obtained by POT, with strong goodness of fit. For events with casualties > L = 10 K , 25 K , 50 K , etc . we fit different Pareto II (Lomax) distributions withcorresponding tail a (fit from GPD), with scale s = 84, 360, i.e., with density a ( (cid:0) L + s + x s ) (cid:0) a (cid:0) s , x (cid:21) L .We also consider a wider array of statistical "pictures" from pairs a , s a acrossthe data from potential alternative values of a , with recalibration of maximumlikelihood s , see Figure . . Difference between sample mean and maximum likelihood mean : Table . shows the true mean using the parametrization of the Pareto distribution above and inverting the transformation back to compact support. "True" or maximumlikelihood, or "statistical" mean is between and times observed mean.This means the "journalistic" observation of the mean, aside from the conceptualmistake of relying on sample mean, underestimates the true mean by at least 3times and higher future observations would not allow the conlusion that violencehas "risen". Table . : Sample means and estimated maximum likelihood mean across minimum values L –Rescaled data.
L Sample Mean ML Mean Ratio K 9.079 (cid:2) (cid:2) . K 9.82 (cid:2) (cid:2) . K 1.12 (cid:2) (cid:2) . K 1.34 (cid:2) (cid:2) . K 1.66 (cid:2) (cid:2) . K 2.48 (cid:2) (cid:2) . . . Conclusion
History as seen from tail analysis is far more risky, and conflicts far more violentthan acknowledged by naive observation of behavior of averages in historical timeseries.
Table . : Average inter-arrival times and their mean absolute deviation for events with more than , , and million casualties, using actual estimates. Threshold Average MAD .
71 31 .
662 42 .
19 47 .
315 57 .
74 68 . .
58 144 . Table . : Average inter-arrival times and their mean absolute deviation for events with more than , , , , , and million casualties, using rescaled amounts. Threshold Average MAD .
27 12 .
592 16 .
84 18 .
135 26 .
31 27 . .
39 41 . .
47 52 . .
88 78 . on the tail risk of violent conflict (with p. cirillo) ‡ Table . : Estimates (and standard errors) of the Generalized Pareto Distribution parameters forcasualties over a k threshold. For both actual and rescaled casualties, we also provide the number ofevents lying above the threshold (the total number of events in our data is ). Data Nr. Excesses x b
Raw Data
307 1 . . ( . ) ( . ) Naive Rescaling
524 1 . . ( . ) ( . ) Log-rescaling
524 1 . . ( . ) ( . ) . . Rescaling Method
We remove the compact support to be able to use power laws as follows (see earlierchapters). Using X t as the r.v. for number of incidences from conflict at times t ,consider first a naive rescaling of X ′ t = X t H t , where H t is the total human populationat period t . See appendix for methods of estimation of H t .Next, with today’s maximum population H and L the naively rescaled minimumfor our definition of conflict, we introduce a smooth rescaling function φ : [ L , H ] ! [ L , ¥ ) satisfying:i φ is "smooth": φ C ¥ ,ii φ (cid:0) ( ¥ ) = H ,iii φ (cid:0) ( L ) = φ ( L ) = L .In particular, we choose: φ ( x ) = L (cid:0) H log ( H (cid:0) xH (cid:0) L ) . ( . )We can perform appropriate analytics on x r = φ ( x ) given that it is unbounded,and properly fit Power Law exponents. Then we can rescale back for the propertiesof X . Also notice that the φ ( x ) (cid:25) x for very large values of H . This means that fora very large upper bound, the results we will get for x and φ ( x ) will be essentiallythe same. The big difference is only from a philosophical/methodological point ofview, in the sense that we remove the upper bound (unlikely to be reached).In what follows we will use the naively rescaled casualties as input for the φ ( (cid:1) )function.We pick H = P t for the exercise.The distribution of x can be rederived as follows from the distribution of x r : ∫ ¥ L f ( x r ) d x r = ∫ φ (cid:0) ( ¥ ) L g ( x ) d x , ( . ) where φ (cid:0) ( u ) = ( L (cid:0) H ) e L (cid:0) uH + H In this case, from the Pareto-Lomax selected: f ( x r ) = a ( (cid:0) L + s + x r s ) (cid:0) a (cid:0) s , x r [ L , ¥ ) ( . ) g ( x ) = a H ( s (cid:0) H log ( H (cid:0) xH (cid:0) L ) s ) (cid:0) a (cid:0) s ( H (cid:0) x ) , x [ L , H ],which verifies ∫ HL x g ( x ) d x = 1. Hence the expectation E g ( x ; L , H , s , a ) = ∫ HL x g ( x ) d x , ( . ) E g ( X ; L , H , s , a ) = a H ( a (cid:0) ( H (cid:0) L ) e s = H E a +1 ( s H ) H ) ( . )where E . (.) is the exponential integral E n z = ∫ ¥ e t ( (cid:0) z ) t n d t .Note that we rely on the invariance property: Remark If ˆ q is the maximum likelihood estimator (MLE) of q , then for an absolutely contin-uous function ϕ , ϕ ( ˆ q ) is the MLE estimator of ϕ ( q ) . For further details see [ ]. . . Expectation by Conditioning (less rigorous)
We would be replacing a smooth function in C ¥ by a Heaviside step function, thatis the indicator function : R ! f
0, 1 g , written as X [ L , H ] : E ( X [ L , H ] ) = ∫ HL x f ( x ) d x ∫ HL f ( x ) d x which for the Pareto Lomax becomes: E ( X [ L , H ] ) = as a ( H (cid:0) L ) s a (cid:0) ( H (cid:0) L + s ) a + ( a (cid:0) L + sa (cid:0) . ) on the tail risk of violent conflict (with p. cirillo) ‡ . . Reliability of Data and Effect on Tail Estimates
Data from violence is largely anecdotal, spreading via citations, often based onsome vague estimate, without anyone’s ability to verify the assessments usingperiod sources. An event that took place in the seventh century, such as the anLushan rebellion, is "estimated" to have killed million people, with no preciseor reliable methodology to allow us to trust the number. The independence warof Algeria has various estimates, some from France, others from the rebels, andnothing scientifically or professionally obtained.As said earlier, in this chapter, we use different data: raw data, naively rescaleddata w.r.t. the current world population, and log-rescaled data to avoid the theo-retical problem of the upper bound.For some observations, together with the estimated number of casualties, as re-sulting from historical sources, we also have a lower and upper bound available.Let X t be the number of casualties in a given conflict at time t . In principle, we candefine triplets like (cid:15) f X t , X lt , X ut g for the actual estimates (raw data), where X lt and X ut representthe lower and upper bound, if available. (cid:15) f Y t = X t P P t , Y lt = X lt P P t , Y ut = X ut P P t g for the naively rescaled data,where P is the world population in and P t is the population at time t = 1, ..., 2014. (cid:15) f Z t = φ ( Y t ), Z lt = φ ( Y lt ), Z ut = φ ( Y ut ) g for the log-rescaled data.To prevent possible criticism about the use of middle estimates, when bounds arepresent, we have decided to use the following Monte Carlo procedure (for moredetails [ ]), obtaining no significant different in the estimates of all the quantitiesof interest (like the tail exponent a = 1 = x ): . For each event X for which bounds are present, we have assumed casualtiesto be uniformly distributed between the lower and the upper bound, i.e. X (cid:24) U ( X l , X u ). The choice of the uniform distribution is to keep thingssimple. All other bounded distributions would in fact generate the sameresults in the limit, thanks to the central limit theorem. . We have then generated a large number of Monte Carlo replications, and ineach replication we have assigned a random value to each event X accordingto U ( X l , X u ). . For each replication we have computed the statistics of interest, typically thetail exponent, obtaining values that we have later averaged.This procedure has shown that the precision of estimates does not affect the tailof the distribution of casualties, as the tail exponent is rather stable.For those events for which no bound is given, the options were to use them asthey are, or to perturb them by creating fictitious bounds around them (and thentreat them as the other bounded ones in the Monte Carlo replications). We havechosen the second approach.The above also applies to Y t and Z t . Note that the tail a derived from an average is different from an average alphaacross different estimates, which is the reason we perform the various analysesacross estimates. Technical comment
These simulations are largely looking for a "stochastic al-pha" bias from errors and unreliability of data (Chapter ). With a sample sizeof n , a parameter ˆ q m will be the average parameter obtained across a large num-ber of Monte Carlo runs. Let X i be a given Monte Carlo simulated vector in-dexed by i and X m is the middle estimate between high and low bounds. Since,with m (cid:229) (cid:20) m ∥ X j ∥ = ∥ X m ∥ across Monte Carlo runs but j , ∥ X j ∥ ̸ = ∥ X m ∥ , b q m = m (cid:229) (cid:20) m b q ( X j ) ̸ = b q ( X m ). For instance, consider the maximum likelihood estimationof a Paretian tail, b a ( X i ) ≜ n ( (cid:229) (cid:20) i (cid:20) n log ( x i L )) (cid:0) . With ∆ (cid:21) x m , define b a ( X i ⊔ ∆ ) ≜ n (cid:229) ni =1 log ( x i L ) (cid:0) log ( ∆ L ) + n (cid:229) ni =1 log ( x i L ) + log ( ∆ L ) 1A which, owing to the concavity of the logarithmic function, gives the inequality ∆ (cid:21) x m , b a ( X i ⊔ ∆ ) (cid:21) b a ( X i ). . . Definition of An "Event" "Named" conflicts are an arbitrary designation that, often, does not make sensestatistically: a conflict can have two or more names; two or more conflicts can havethe same name, and we found no satisfactory hierarchy between war and conflict.For uniformity, we treat events as the shorter of event or its disaggregation intounits with a maximum duration of years each. Accordingly, we treat Mongolianwars, which lasted more than a century and a quarter, as more than a single event.It makes little sense otherwise as it would be the equivalent of treating the periodfrom the Franco-Prussian war to WW II as "German(ic) wars", rather than multipleevents because these wars had individual names in contemporary sources. Effec-tively the main sources such as the Encyclopedia of War [ ] list numerous conflictsin place of "Mongol Invasions" –the more sophisticated the historians in a givenarea, the more likely they are to break conflicts into different "named" events and,depending on historians, Mongolian wars range between and conflicts.What controversy about the definition of a "name" can be, once again, solvedby bootstrapping. Our conclusion, incidentally, is invariant to the bundling orunbundling of the Mongolian wars.Further, in the absence of a clearly defined protocol in historical studies, it hasbeen hard to disentangle direct death from wars and those from less direct effectson populations (say blocades, famine). For instance the First Jewish War has con-fused historians as an estimated K death came from the war, and a considerablyhigher (between
K and the number M according to Josephus) from the famineor civilian casualties. on the tail risk of violent conflict (with p. cirillo) ‡ . . Missing Events
We can assume that there are numerous wars that are not part of our sample, evenif we doubt that such events are in the "tails" of the distribution, given that largeconflicts are more likely to be reported by historians. Further, we also assume thattheir occurrence is random across the data (in the sense that they do not have aneffect on clustering).But we are aware of a bias from differential in both accuracy and reporting acrosstime: events are more likely to be recorded in modern times than in the past.Raising the minimum value L the number of such "missed" events and their impactare likely to drop rapidly. Indeed, as a robustness check, raising the bar to aminimum L = 500 K does not change our analysis.A simple jackknife procedure, performed by randomly removing a proportionof events from the sample and repeating analyses, shows us the dependence ofour analysis on missing events, dependence that we have found to be insignificant,when focusing on the tail of the distribution of casualties. In other words, giventhat we are dealing with extremes, if removing % of events and checking theeffects on parameters produce no divergence from initial results, then we do notneed to worry of having missed % of events, as missing events are not likely tocause thinning of the tails. . . Survivorship Bias
We did not take into account of the survivorship biases in the analysis, assum-ing it to be negligible before , as the probability of a conflict affecting all ofmankind was negligible. Such probability (and risk) became considerably highersince, especially because of nuclear and other mass destruction weapons.
Figures . and . graphically represent our data: the number of casualtiesover time. Figure . refers to the estimated actual number of victims, whileFigure . shows the rescaled amounts, obtained by rescaling the past observationwith respect to the world population in (around . billion people) . Figures . might suggest an increase in the death toll of armed conflicts over time, thussupporting the idea that war violence has increased. Figure . , conversely, seemsto suggest a decrease in the (rescaled) number of victims, especially in the lasthundred years, and possibly in violence as well. In what follows we show thatboth interpretations are surely naive, because they do not take into considerationthe fact that we are dealing with extreme events. The opposite is not true, which is at the core of the Black Swan asymmetry: such procedure does not rem-edy the missing of tail, "Black Swan" events from the record. A single "Black Swan" event can considerablyfatten the tail. In this case the tail is fat enough and no missing information seems able to make it thinner. Notice that, in equation ( . ), for H = 7.2 billion, φ ( x ) (cid:25) x . Therefore Figure . is also representativefor log-rescaled data. . . Peaks over Threshold
Given the fat-tailed nature of the data, which can be easily observed with some ba-sic graphical tools like histograms on the logs and QQplots (Figure . shows theQQplot of actual casualties against an exponential distribution: the clear concav-ity is a signal of fat-tailed distribution), it seems appropriate to use a well-knownmethod of extreme value theory to model war casualties over time: the Peaks-over-Threshold or POT [ ].According to the POT method, excesses of an i.i.d. sequence over a high thresh-old u (that we have to identify) occur at the times of a homogeneous Poissonprocess, while the excesses themselves can be modeled with a Generalized ParetoDistribution (GPD). Arrival times and excesses are assumed to be independent ofeach other.In our case, assuming the independence of the war events does not seem a strongassumption, given the time and space separation among them. Regarding the otherassumptions, on the contrary, we have to check them.We start by identifying the threshold u above which the GPD approximation mayhold. Different heuristic tools can be used for this purpose, from Zipf plot to meanexcess function plots, where one looks for the linearity which is typical of fat-tailedphenomena [ , ]. Figure . shows the mean excess function plot for actualcasualties : an upward trend is clearly present, already starting with a thresholdequal to k victims. For the goodness of fit, it might be appropriate to choose aslightly larger threshold, like u = 50 k . Figure . : QQplot of actual casualties against standard exponential quantile. The concave curvatureof data points is a clear signal of heavy tails. Similar results hold for the rescaled amounts (naive and log). For the sake of brevity we always showplots for one of the two variables, unless a major difference is observed. This idea has also been supported by subsequent goodness-of-fit tests. on the tail risk of violent conflict (with p. cirillo) ‡ Figure . : Mean excess function plot (MEPLOT) for actual casualties. An upward trend - almostlinear in the first part of the graph - is present, suggesting the presence of a fat right tail. The variabilityof the mean excess function for higher thresholds is due to the small number of observation exceedingthose thresholds and should not be taken into consideration. . . Gaps in Series and Autocorrelation
To check whether events over time occur according to a homogeneous Poissonprocess, a basic assumption of the POT method, we can look at the distributionof the inter-arrival times or gaps, which should be exponential. Gaps should alsoshow no autocorrelation.
Figure . : ACF plot of gaps for actual casualties, no significant autocorrelation is visible.
Figure . clearly shows the absence of autocorrelation. The plausibility of anexponential distribution for the inter-arrival times can be positively checked usingboth heuristic and analytic tools. Here we omit the positive results for brevity. However, in order to provide some extra useful information, in Tables . and . we provide some basic statistics about the inter-arrival times for very catas-trophic events in terms of casualties . The simple evidence there contained shouldalready be sufficient to underline how unreliable can be the statement that warviolence has been decreasing over time. For an events with more than millionvictims, if we refer to actual estimates, the average time delay is . years, witha mean absolute deviation of . years . This means that it is totally plausiblethat in the last few years we have not observed such a large event. It could simplyhappen tomorrow or some time in the future. This also means that every trendextrapolation makes no great sense for this type of extreme events. Finally, wehave to consider that an event as large as WW happened only once in years,if we deal with actual casualties (for rescaled casualties we can consider the AnLushan rebellion); in this case the possible waiting time is even longer. . . Tail Analysis
Given that the POT assumptions about the Poisson process seem to be confirmedby data, it is finally the time to fit a Generalized Pareto Distribution to the ex-ceedances.Consider a random variable X with df F , and call F u the conditional df of X abovea given threshold u . We can then define a r.v. Y , representing the rescaled excessesof X over the threshold u , getting [ ] F u ( y ) = P ( X (cid:0) u (cid:20) y j X > u ) = F ( u + y ) (cid:0) F ( u )1 (cid:0) F ( u )for 0 (cid:20) y (cid:20) x F (cid:0) u , where x F is the right endpoint of the underlying distribution F . Pickands [ ], Balkema and de Haan [ ], [ ] and [ ] showed that for a largeclass of underlying distribution functions F (following in the so-called domain ofattraction of the GEV distribution [ ]), and a large u , F u can be approximated bya Generalized Pareto distribution: F u ( y ) ! G ( y ), as u ! ¥ where G ( y ) = { (cid:0) (1 + x y = b ) (cid:0) = x i f x ̸ = 01 (cid:0) e (cid:0) y = b i f x = 0. . ( . )It can be shown that the GPD distribution is a distribution interpolating betweenthe exponential distribution (for x = 0) and a class of Pareto distributions. We referto [ ] for more details.The parameters in ( . ) can be estimated using methods like maximum likeli-hood or probability weighted moments [ ]. The goodness of fit can then betested using bootstrap-based tests [ ]. Table . does not show the average delay for events with M ( M) or more casualties. This is due tothe limited amount of these observations in actual, non-rescaled data. In particular, all the events withmore than million victims have occurred during the last years, and the average inter-arrival time isbelow years. Are we really living in more peaceful world? In case of rescaled amounts, inter-arrival times are shorter, but the interpretation is the same. on the tail risk of violent conflict (with p. cirillo) ‡ Table . contains our mle estimates for actual and rescaled casualties above a k victims threshold. This threshold is in fact the one providing the best com-promise between goodness of fit and a sufficient number of observation, so thatstandard errors are reliable. The actual and both the rescaled data show two differ-ent sets of estimates, but their interpretation is strongly consistent. For this reasonwe just focus on actual casualties for the discussion.The parameter x is the most important for us: it is the parameter governing thefatness of the right tail. A x greater than (we have 1.5886) signifies that no mo-ment is defined for our Generalized Pareto: a very fat-tailed situation. Naturally,in the sample, we can compute all the moments we are interested in, but from atheoretical point of view they are completely unreliable and their interpretationis extremely flawed (a very common error though). According to our fitting, verycatastrophic events are not at all improbable. It is worth noticing that the estimatesis significant, given that its standard error is 0.1467.Figures . and . compare our fittings to actual data. In both figures itis possible to see the goodness of the GPD fitting for most of the observationsabove the k victims threshold. Some problems arise for the very large events,like WW and the An Lushan rebellion . In this case it appears that our fittingexpects larger events to have happened. This is a well-known problem for extremedata [ ]. The very large event could just be behind the corner.Similarly, events with to million victims (not at all minor ones!) seem to beslightly more frequent than what is expected by our GPD fitting. This is anothersignal of the extreme character of war casualties, which does not allow for theextrapolation of simplistic trends. Figure . : GPD tail fitting to actual casualties’ data (in k). Parameters as per Table . , firstline. If we remove the two largest events from the data, the GPD hypothesis cannot be rejected at the 5%significance level.
Figure . : GPD cumulative distribution fitting to actual casualties’ data (in k). Parameters asper Table . , first line. . . An Alternative View on Maxima
Another method is the block-maxima approach of extreme value theory. In thisapproach data are divided into blocks, and within each block only the maximumvalue is taken into consideration. The Fisher-Tippet theorem [ ] then guaranteesthat the normalized maxima converge in distribution to a Generalized ExtremeValue Distribution, or GEV.
GEV ( x ; x ) = exp ( (cid:0) (1 + x x ) (cid:0) x ) x ̸ = 0exp ( (cid:0) exp( (cid:0) x ) ) x = 0 , 1 + x x > ] for moredetails.If we divide our data into -year blocks, we obtain observation (the lastblock is the residual one from to ). Maximum likelihood estimationsgive a x larger than , indicating that we are in the so-called Fréchet maximumdomain of attraction, compatible with very heavy-tailed phenomena. A value of x greater than under the GEV distribution further confirms the idea of the absenceof moments, a clear signal of a very heavy right tail. . . Full Data Analysis
Naturally, being aware of limitations, we can try to fit all our data, while for ca-sualties in excess of , we fit the Pareto Distribution from Equation . with a (cid:25) K) can be seein Figure . . Similar results to Figure . are seen for different values in tablebelow, all with the same goodness of fit. on the tail risk of violent conflict (with p. cirillo) ‡ L s K
84, 26025 K K K K K . can be calculated acrossdifferent set values of a , with one single degree of freedom: the corresponding s isa MLE estimate using such a as fixed: for a sample size n , and x i the observationshigher than L , s a = { s : a n s (cid:0) ( a + 1) (cid:229) ni =1 1 x i (cid:0) L + s = 0, s > } .The sample average for L = 10 K is 9.12 (cid:2) , across K simulations, with thespread in values showed in Figure . .The "true" mean from Equation . yields 3.1 (cid:3) , and we repeated for L = K, K, K, K, K, and
K, finding ratios of true estimated mean to observedsafely between and ., see Table . . Notice that this value for the mean of (cid:25) a is more rigorous andhas a smaller error, since the estimate of a is asymptotically Gaussian while theaverage of a power law, even when it exists, is considerably more stochastic. Seethe discussion on "slowness of the law of large numbers" in in connection withthe point.We get the mean by truncation for L= K a bit lower, under equation . ; around1.8835 (cid:2) .We finally note that, for values of L considered, % of conflicts with more than , victims are below the mean: where m is the mean, P ( X < m ) = 1 (cid:0) (cid:0) H log ( a e s = H E a +1 ( s H )) s (cid:0) a . . . Bootstrap for the GPD
In order to check our sensitivity to the quality/precision of our data, we have de-cided to perform some bootstrap analysis. For both raw data and the rescaled oneswe have generated
K new samples by randomly selecting % of the observa-tions, with replacement. Figures . , . and . show the stability of our x estimates. In particular x > x estimates in Table . appear to be good approximations for our GPD real shape parameters, notwithstandingimprecisions and missing observations in the data. Raw data: 100k bootstrap samples F r equen cy Figure . : x parameter’sdistribution over K boot-strap samples for actualdata. Each sample is ran-domly selected with replace-ment using % of the orig-inal observations. Naively rescaled data: 100k bootstrap samples F r equen cy Figure . : x parameter’sdistribution over K boot-strap samples for naivelyrescaled data. Each sampleis randomly selected with re-placement using % of theoriginal observations. Log−rescaled data: 100k bootstrap samples F r equen cy Figure . : x parame-ter’s distribution over Kbootstrap samples for log-rescaled data. Each sampleis randomly selected withreplacement using % ofthe original observations. . . Perturbation Across Bounds of Estimates
We performed analyses for the "near tail" using the Monte Carlo techniques dis-cussed in section . . . We look at second order "p-values", that is the sensitivityof the p-values across different estimates in Figure . –practically all resultsmeet the same statistical significance and goodness of fit.In addition, we look at values of both the sample means and the alpha-derivedMLE mean across permutations, see Figures . and . . on the tail risk of violent conflict (with p. cirillo) ‡ Figure . : P-Values ofPareto-Lomax across
Kcombinations. This is not toascertain the p-value, ratherto check the robustness bylooking at the variationsacross permutations of esti-mates. × × × m0.000.020.040.060.080.10Pr Figure . : Rescaled sam-ple mean across
K esti-mates between high-low. × × × × × m0.000.020.040.060.08Pr Figure . : RescaledMLE mean across
Kestimates between high-low. fg10 100 1000 10 Log ( x ) - - ( P > x ) Figure . : Loglogplotcomparison of f andg, showing a pasting-boundary style cappingaround H.
To put our conclusion in the simplest of terms: the occurrence of events thatwould raise the average violence by a multiple of would not cause us torewrite this chapter, nor to change the parameters calibrated within. (cid:15) Indeed, from statistical analysis alone, the world is more unsafe than casuallyexamined numbers. Violence is underestimated by journalistic nonstatisticallooks at the mean and lack of understanding of the stochasticity of underinter-arrival times. (cid:15)
The transformation into compact support allowed us to perform the analysesand gauge such underestimation which , if noisy, gives us an idea of theunderestimation and its bounds. (cid:15)
In other words, a large event and even a rise in observed mean violencewould not be inconsistent with statistical properties, meaning it would justifya "nothing has changed" reaction. (cid:15)
We avoided discussions of homicide since we limited L to values >
10, 000,but its rate doesn’t appear to have a particular bearing on the tails. It couldbe a drop in the bucket. It obeys different dynamics. We may have observedlower rate of homicide in societies but most risks of death come from violentconflict. (Casualties from homicide by rescaling from the rate 70 per 100 k ,gets us 5.04 (cid:2) casualties per annum at today’s population. A drop tominimum levels stays below the difference between errors on the mean ofviolence from conflicts with higher than , casualties.) (cid:15) We ignored survivorship bias in the data analysis (that is, the fact that hadthe world been more violent, we wouldn’t be here to talk about it ). Adding it wouldincrease the risk. The presence of tail effects today makes further analysisrequire taking it into account. Since , a single conflict –which almosthappened– has the ability to reach the max casualties, something we did nothave before. (We can rewrite the model with one of fragmentation of theworld, constituted of "separate" isolated n independent random variables X i ,each with a maximum value H i , with the total (cid:229) n w i H i = H , with all w i > on the tail risk of violent conflict (with p. cirillo) ‡ (cid:229) n w i = 1. In that case the maximum (that is worst conflict) could requirethe joint probabilities that all X , X , (cid:1) (cid:1) (cid:1) X n are near their maximum value,which, under subexponentiality, is an event of much lower probability thanhaving a single variable reach its maximum.) The data was compiled by Captain Mark Weisenborn. We thank Ben Kiernan forcomments on East Asian conflicts. How long do we have to wait before making a scientific pronouncement about the drop in the incidenceof wars of a certain magnitude? Simply, because inter-arrival time follows a memoryless exponentialdistribution, roughly the survival function of a deviation of three times the mean is e (cid:0) (cid:25) .05. It meanswait for three times as long as the mean inter-arrival times before saying something scientific. For largewars such as WW and WW , wait years. It is what it is . W H A T A R E T H E C H A N C E S O F AT H I R D W O R L D W A R ? (cid:3) ,† T his is from an article that is part of the debate with public in-tellectuals who claim that violence have dropped "from data",without realizing that science is hard; significance requires fur-ther data under fat tails and more careful examination. Ourresponse (by the author and P. Cirillo) provides a way to sum-marize the main problem with naive empiricism under fat tails.In a recent issue of Significance
Mr. Peter McIntyre asked what the chances arethat World War III will occur this century. Prof. Michael Spagat wrote that nobodyknows, nobody can really answer–and we totally agree with him on this. Then headds that "a really huge war is possible but, in my view, extremely unlikely." Tosupport his statement, Prof. Spagat relies partly on the popular science work ofProf. Steven Pinker, expressed in The Better Angels of our Nature and journalisticvenues. Prof. Pinker claims that the world has experienced a long-term decline inviolence, suggesting a structural change in the level of belligerence of humanity.It is unfortunate that Prof. Spagat, in his answer, refers to our paper (this volume,Chapter ), which is part of a more ambitious project we are working on relatedto fat-tailed variables.What characterizes fat tailed variables? They have their properties (such as themean) dominated by extreme events, those "in the tails". The most popularlyknown version is the "Pareto / ".We show that, simply, data do not support the idea of a structural change inhuman belligerence. So Prof. Spagat’s first error is to misread our claim: weare making neither pessimistic nor optimistic declarations: we just believe thatstatisticians should abide by the foundations of statistical theory and avoid tellingdata what to say.Let us go back to first principles. Discussion chapter. what are the chances of a third world war? (cid:3) ,† Figure G. : After Napoleon, there was a lull in Europe. Until nationalism came to change the story.
Foundational Principles
Fundamentally, statistics is about ensuring people do not build scientific theoriesfrom hot air, that is without significant departure from random. Otherwise, it ispatently "fooled by randomness".Further, for fat tailed variables, the conventional mechanism of the law of largenumbers is considerably slower and significance requires more data and longerperiods. Ironically, there are claims that can be done on little data: inference isasymmetric under fat-tailed domains.
We require more data to assert that thereare no Black Swans than to assert that there are Black Swans hence we wouldneed much more data to claim a drop in violence than to claim a rise in it.Finally, statements that are not deemed statistically significant –and shown to beso –should never be used to construct scientific theories.These foundational principles are often missed because, typically, social scientists’statistical training is limited to mechanistic tools from thin tailed domains [ ]. Inphysics, one can often claim evidence from small data sets, bypassing standardstatistical methodologies, simply because the variance for these variables is low.The higher the variance, the more data one needs to make statistical claims. Forfat-tails, the variance is typically high and underestimated in past data.The second –more serious –error Spagat and Pinker made is to believe that tailevents and the mean are somehow different animals, not realizing that the meanincludes these tail events. For fat-tailed variables, the mean is almost entirely determined by extremes.If you are uncertain about the tails, then you are uncertain about the mean .It is thus incoherent to say that violence has dropped but maybe not the risk of tailevents; it would be like saying that someone is "extremely virtuous except duringthe school shooting episode when he killed students". hat are the chances of a third world war? (cid:3) ,† Robustness
Our study tried to draw the most robust statistical picture of violence, relying onmethods from extreme value theory and statistical methods adapted to fat tails.We also put robustness checks to deal with the imperfection of data collected somethousand years ago: our results need to hold even if a third (or more) of the datawere wrong.
Inter-arrival times
We show that the inter-arrival times among major conflicts are extremely long, andconsistent with a homogenous Poisson process: therefore no specific trend can beestablished: we as humans can not be deemed as less belligerent than usual. Fora conflict generating at least million casualties, an event less bloody than WW or WW , the waiting time is on average years, with a mean absolute deviationof (or years and deviations for data rescaled to today’s population). Theseventy years of what is called the "Long Peace" are clearly not enough to statemuch about the possibility of WW in the near future. Underestimation of the mean
We also found that the average violence observed in the past underestimates thetrue statistical average by at least half. Why? Consider that about - % of theobservations fall below the mean, which requires some corrections with the helpof extreme value theory. (Under extreme fat tails, the statistical mean can be closerto the past maximum observation than sample average.) A common mistake
Similar mistakes have been made in the past. In , one H.T. Buckle used thesame unstatistical reasoning as Pinker and Spagat.That this barbarous pursuit is, in the progress of society, steadily de-clining, must be evident, even to the most hasty reader of Europeanhistory. If we compare one country with another, we shall find that fora very long period wars have been becoming less frequent; and now soclearly is the movement marked, that, until the late commencement ofhostilities, we had remained at peace for nearly forty years: a circum-stance unparalleled (...) The question arises, as to what share our moralfeelings have had in bringing about this great improvement.Moral feelings or not, the century following Mr. Buckle’s prose turned out to bethe most murderous in human history. Buckle, H.T. ( ) History of Civilization in England, Vol. , London: John W. Parker and Son. what are the chances of a third world war? (cid:3) ,† We conclude by saying that we find it fitting –and are honored –to expose fun-damental statistical mistakes in a journal called
Significance , as the problem is pre-cisely about significance and conveying notions of statistical rigor to the generalpublic. art VI
M E TA P R O B A B I L I T Y PA P E R S H O W T H I C K T A I L S E M E R G E F R O MR E C U R S I V E E P I S T E M I CU N C E R T A I N T Y † T he Opposite of Central Limit: With the Central Limit Theorem, westart with a specific distribution and end with a Gaussian. Theopposite is more likely to be true. Recall how we fattened the tailof the Gaussian by stochasticizing the variance? Now let us usethe same metaprobability method, putting additional layers of uncertainty.
The Regress Argument (Error about Error)
The main problem behind
The BlackSwan is the limited understanding of model (or representation) error, and, for thosewho get it, a lack of understanding of second order errors (about the methods usedto compute the errors) and by a regress argument, an inability to continuouslyreapplying the thinking all the way to its limit ( particularly when one provides noreason to stop ). Again, there is no problem with stopping the recursion, providedit is accepted as a declared a priori that escapes quantitative and statistical methods.
Epistemic not statistical re-derivation of power laws
Note that previous deriva-tions of power laws have been statistical (cumulative advantage, preferential attach-ment, winner-take-all effects, criticality), and the properties derived by Yule, Man-delbrot, Zipf, Simon, Bak, and others result from structural conditions or breakingthe independence assumptions in the sums of random variables allowing for theapplication of the central limit theorem, [ ] [ ][ ] [ ] [ ] . This work isentirely epistemic, based on standard philosophical doubts and regress arguments. Discussion chapter.A version of this chapter was presented at Benoit Mandelbrot’s Scientific Memorial on April , ,inNew Haven, CT. how thick tails emerge from recursive epistemic uncertainty † Figure . : A version ofthis chapter was presentedat Benoit Mandelbrot’smemorial. . . Layering Uncertainties
Take a standard probability distribution, say the Gaussian. The measure of dis-persion, here s , is estimated, and we need to attach some measure of dispersionaround it. The uncertainty about the rate of uncertainty, so to speak, or higherorder parameter, similar to what called the “volatility of volatility” in the lingo ofoption operators (see Taleb, , Derman, , Dupire, , Hull and White, ) –here it would be “uncertainty rate about the uncertainty rate”. And there isno reason to stop there: we can keep nesting these uncertainties into higher orders,with the uncertainty rate of the uncertainty rate of the uncertainty rate, and soforth. There is no reason to have certainty anywhere in the process. . . Higher Order Integrals in the Standard Gaussian Case
We start with the case of a Gaussian and focus the uncertainty on the assumedstandard deviation. Define ϕ ( m , s ;x) as the Gaussian PDF for value x with mean m and standard deviation s .A 2 nd order stochastic standard deviation is the integral of ϕ across values of s R + , under the PDF f ( ¯ s , s ; s ) , with s its scale parameter (our approach to trachthe error of the error), not necessarily its standard deviation; the expected value of s is s . f ( x ) = ∫ ¥ ϕ ( m , s , x ) f ( ¯ s , s ; s ) d s Generalizing to the N th order, the density function f(x) becomes f ( x ) N = ∫ ¥ ... ∫ ¥ ϕ ( m , s , x ) f ( ¯ s , s , s ) f ( s , s , s ) . . . f ( s N (cid:0) , s N , s N (cid:0) ) d s d s d s ... d s N ( . )The problem is that this approach is parameter-heavy and requires the specifi-cations of the subordinated distributions (in finance, the lognormal has been tra-ditionally used for s (or Gaussian for the ratio Log[ s t s ] since the direct use ofa Gaussian allows for negative values). We would need to specify a measure f for each layer of error rate. Instead this can be approximated by using the meandeviation for s , as we will see next.Discretization using nested series of two-states for s - a simple multiplicative pro-cessWe saw in the last chapter a quite effective simplification to capture the convexity,the ratio of (or difference between) ϕ ( m , s ,x) and ∫ ¥ ϕ ( m , s , x ) f ( ¯ s , s , s ) d s (thefirst order standard deviation) by using a weighted average of values of s , say, fora simple case of one-order stochastic volatility: s ( (cid:6) a (1) ) with 0 (cid:20) a (1) <
1, where a (1) is the proportional mean absolute deviation for s , inother word the measure of the absolute error rate for s . We use as the probabilityof each state. Unlike the earlier situation we are not preserving the variance, ratherthe STD. Thus the distribution using the first order stochastic standard deviationcan be expressed as: f ( x ) = 12 ( ϕ ( m , s ( a (1)), x ) + ϕ ( m , s (1 (cid:0) a (1)), x ) ) ( . )Now assume uncertainty about the error rate a( ), expressed by a( ), in the samemanner as before. Thus in place of a( ) we have a( )( (cid:6) a( )). how thick tails emerge from recursive epistemic uncertainty † Σ H - a L Σ H a + L Σ H a + L H - a L Σ H a + L H a + L Σ H - a L H - a L Σ H - a L H a + L Σ H - a L H - a L H - a L Σ H - a L H a + L H - a L Σ H a + L H - a L H - a L Σ H a + L H a + L H - a L Σ H - a L H - a L H a + L Σ H - a L H a + L H a + L Σ H a + L H - a L H a + L Σ H a + L H a + L H a + L Σ Figure . : Three levels of error rates for s following a multiplicative process The second order stochastic standard deviation: f ( x ) = 14 ( ϕ ( m , s (1 + a (1)(1 + a (2))), x ) + ϕ ( m , s (1 (cid:0) a (1)(1 + a (2))), x ) + ϕ ( m , s (1 + a (1)(1 (cid:0) a (2)), x ) + ϕ ( m , s (1 (cid:0) a (1)(1 (cid:0) a (2))), x )) ( . )and the N th order: f ( x ) N = 12 N N (cid:229) i =1 ϕ ( m , s M Ni , x )where M Ni is the i th scalar (line) of the matrix M N ( N (cid:2) ) M N = N (cid:213) j =1 ( a ( j ) T i , j + 1) N i =1 and T i , j the element of i th line and j th column of the matrix of the exhaustive com-bination of n -Tuples of the set f(cid:0)
1, 1 g ,that is the sequences of n length (1, 1, 1, ...)representing all combinations of 1 and (cid:0) , T = (cid:0) (cid:0) (cid:0) (cid:0) (cid:0) (cid:0) (cid:0) (cid:0) (cid:0) (cid:0) (cid:0) (cid:0) and M = (1 (cid:0) a (1))(1 (cid:0) a (2))(1 (cid:0) a (3))(1 (cid:0) a (1))(1 (cid:0) a (2))( a (3) + 1)(1 (cid:0) a (1))( a (2) + 1)(1 (cid:0) a (3))(1 (cid:0) a (1))( a (2) + 1)( a (3) + 1)( a (1) + 1)(1 (cid:0) a (2))(1 (cid:0) a (3))( a (1) + 1)(1 (cid:0) a (2))( a (3) + 1)( a (1) + 1)( a (2) + 1)(1 (cid:0) a (3))( a (1) + 1)( a (2) + 1)( a (3) + 1) So M = f (1 (cid:0) a (1))(1 (cid:0) a (2))(1 (cid:0) a (3)) g , etc.Note that the various error rates a ( i ) are not similar to sampling errors, butrather projection of error rates into the future. They are, to repeat, epistemic . The Final Mixture Distribution
The mixture weighted average distribution (re-call that ϕ is the ordinary Gaussian PDF with mean m , std s for the random variable x ). f ( x j m , s , M , N ) = 2 (cid:0) N N (cid:229) i =1 ϕ ( m , s M Ni , x ) It could be approximated by a lognormal distribution for s and the correspond-ing V as its own variance. But it is precisely the V that interest us, and V dependson how higher order errors behave. how thick tails emerge from recursive epistemic uncertainty † - - - Figure . : Thicker tails (higher peaks) for higher values of N; here N = 0, 5, 10, 25, 50 , all values ofa=
Next let us consider the different regimes for higher order errors. regime 1 (explosive): case of a constant parameter a Special case of constant a : Assume that a( )=a( )=...a(N)=a, i.e. the case of flatproportional error rate a . The Matrix M collapses into a conventional binomialtree for the dispersion at the level N . f ( x j m , s , M , N ) = 2 (cid:0) N N (cid:229) j =0 ( Nj ) ϕ ( m , s ( a + 1) j (1 (cid:0) a ) N (cid:0) j , x ) ( . )Because of the linearity of the sums, when a is constant, we can use the binomialdistribution as weights for the moments (note again the artificial effect of constrain-ing the first moment m in the analysis to a set, certain, and known a priori ). Moment1 m s ( a + 1 ) N + m ms ( a + 1 ) N + m m s ( a + 1 ) N + m + 3 ( a + 6 a + 1 ) N s Note again the oddity that in spite of the explosive nature of higher moments,the expectation of the absolute value of x is both independent of a and N , sincethe perturbations of s do not affect the first absolute moment = √ p s (that is, theinitial assumed s ). The situation would be different under addition of x .Every recursion multiplies the variance of the process by ( + a ). The processis similar to a stochastic volatility model, with the standard deviation (not thevariance) following a lognormal distribution, the volatility of which grows with M,hence will reach infinite variance at the limit.ConsequencesFor a constant a > , and in the more general case with variable a where a(n) (cid:21) a(n- ), the moments explode.A- Even the smallest value of a > , since ( a ) N is unbounded, leads to thesecond moment going to infinity (though not the first) when N ! ¥ . So some-thing as small as a . % error rate will still lead to explosion of moments andinvalidation of the use of the class of L distributions.B- In these conditions, we need to use power laws for epistemic reasons, or, atleast, distributions outside the L norm, regardless of observations of past data.Note that we need an a priori reason (in the philosophical sense) to cutoff the Nsomewhere, hence bound the expansion of the second moment.Convergence to Properties Similar to Power LawsWe can see on the example next Log-Log plot (Figure ) how, at higher or-ders of stochastic volatility, with equally proportional stochastic coefficient, (wherea( )=a( )=...=a(N)= ) how the density approaches that of a Power Law (just likethe Lognormal distribution at higher variance), as shown in flatter density on theLogLog plot. The probabilities keep rising in the tails as we add layers of uncer-tainty until they seem to reach the boundary of the power law, while ironically thefirst moment remains invariant.The same effect takes place as a increases towards , as at the limit the tail expo-nent P > x approaches but remains > . . . Effect on Small Probabilities
Next we measure the effect on the thickness of the tails. The obvious effect is therise of small probabilities.Take the exceedant probability,that is, the probability of exceeding K, given N,for parameter a constant: P > K j N = N (cid:229) j =0 (cid:0) N (cid:0) ( Nj ) erfc ( K p s ( a + 1) j (1 (cid:0) a ) N (cid:0) j ) ( . ) how thick tails emerge from recursive epistemic uncertainty † - - - - H x L a =
110 , N = Figure . : LogLog Plot of the probability of exceeding x showing power law-style flattening as Nrises. Here all values of a= / where erfc(.) is the complementary of the error function, -erf(.), erf( z ) = p p ∫ z e (cid:0) t dt Convexity effect
The next Table shows the ratio of exceedant probability underdifferent values of N divided by the probability in the case of a standard Gaussian.
Table . : Case of a = N P > NP > N =0 P > NP > N =0 P > NP > N =0 . .
155 710 1 . .
326 4515 1 . .
514 22120 1 . .
720 92225 1 . .
943 3347
Table . : Case of a = N P > NP > N =0 P > NP > N =0 P > NP > N =0 .
74 146 (cid:2)
10 4 .
43 805 (cid:2)
15 5 .
98 1980 (cid:2)
20 7 .
38 3529 (cid:2)
25 8 .
64 5321 (cid:2) a ( n ) a ( n ) As we said, we may have (actually we need to have) a priori reasons to decrease theparameter a or stop N somewhere. When the higher order of a (i) decline, then themoments tend to be capped (the inherited tails will come from the lognormality of s ). . . Regime -a;“Bleed” of Higher Order Error Take a “bleed” of higher order errors at the rate l , (cid:20) l < , such as a(N) = l a(N- ), hence a (N) = l N a ( ), with a ( ) the conventional intensity of stochasticstandard deviation. Assume m = .With N = , the second moment becomes: M (2) = ( a (1) + 1 ) s ( a (1) l + 1 ) With N = , M (3) = s ( a (1) ) ( l a (1) ) ( l a (1) ) finally, for the general N: M ( N ) = ( a (1) + 1 ) s N (cid:0) (cid:213) i =1 ( a (1) l i + 1 ) ( . )We can reexpress . using the Q-Pochhammer symbol ( a ; q ) N = (cid:213) N (cid:0) i =1 ( (cid:0) aq i ) M ( N ) = s ( (cid:0) a (1) ; l ) N Which allows us to get to the limitlim N ! ¥ M ( N ) = s ( l ; l ) ( a (1) ; l ) ¥ ( l (cid:0) ) ( l + 1 ) As to the fourth moment:By recursion: M ( N ) = 3 s N (cid:0) (cid:213) i =0 ( a (1) l i + a (1) l i + 1 ) M ( N ) = 3 s (( p (cid:0) ) a (1) ; l ) N ( (cid:0) ( p ) a (1) ; l ) N ( . ) how thick tails emerge from recursive epistemic uncertainty † lim N ! ¥ M ( N ) = 3 s (( p (cid:0) ) a (1) ; l ) ¥ ( (cid:0) ( p ) a (1) ; l ) ¥ ( . )So the limiting second moment for l =. and a( )=. is just . s , a significantbut relatively benign convexity bias. The limiting fourth moment is just 9.88 s ,more than times the Gaussian’s ( s ), but still finite fourth moment. For smallvalues of a and values of l close to , the fourth moment collapses to that of aGaussian. . . Regime -b; Second Method, a Non Multiplicative Error Rate For N recursions, s (1 (cid:6) ( a (1)(1 (cid:6) ( a (2)(1 (cid:6) a (3)( ...))) P ( X , m , s , N ) = 1 L L (cid:229) i =1 f ( x , m , s ( ( T N . A N ) i ) ( M N . T + 1) i ) is the i t h component of the ( N (cid:2)
1) dot product of T N the matrix ofTuples in (xx) , L the length of the matrix, and A contains the parameters A N = ( a j ) j =1,... N So for instance, for N = 3, T = ( a , a , a ) A T = a + a + a (cid:0) a + a + aa (cid:0) a + a (cid:0) a (cid:0) a + aa + a (cid:0) a (cid:0) a + a (cid:0) aa (cid:0) a (cid:0) a (cid:0) a (cid:0) a (cid:0) a The moments are as follows: M ( N ) = m M ( N ) = m + 2 s M ( N ) = m + 12 m s + 12 s N (cid:229) i =0 a i At the limit: lim N ! ¥ M ( N ) = 12 s (cid:0) a + m + 12 m s which is very mild. See Taleb and Cirillo [ ] for the treatment of the limit distribution which will bea lognormal under the right conditions. In fact lognormal approximations workwell when errors on errors are in constant proportion. S T O C H A S T I C T A I L E X P O N E N T F O RA S Y M M E T R I C P O W E R L A W S † W e examine random variables in the power law/slowly varyingclass with stochastic tail exponent , the exponent a havingits own distribution. We show the effect of stochasticity of a on the expectation and higher moments of the random vari-able. For instance, the moments of a right-tailed or right-asymmetric variable, when finite, increase with the variance of a ; those ofa left-asymmetric one decreases. The same applies to conditional shortfall(CVar), or mean-excess functions.We prove the general case and examine the specific situation of lognormallydistributed a [ b , ¥ ), b > a translates into ahigher expected mean.The bias is conserved under summation, even upon large enough a numberof summands to warrant convergence to the stable distribution. We establishinequalities related to the asymmetry.We also consider the situation of capped power laws (i.e. with compactsupport), and apply it to the study of violence by Cirillo and Taleb ( ).We show that uncertainty concerning the historical data increases the truemean. Research chapter.Conference: Extremes and Risks in Higher Dimensions, Lorentz Center, Leiden, The Netherlands, Septem-ber . stochastic tail exponent for asymmetric power laws † Stochastic volatility has been introduced heuristically in mathematical finance bytraders looking for biases on option valuation, where a Gaussian distribution isconsidered to have several possible variances, either locally or at some specificfuture date. Options far from the money (i.e. concerning tail events) increase invalue with uncertainty on the variance of the distribution, as they are convex tothe standard deviation.This led to a family of models of Brownian motion with stochastic variance (seereview in Gatheral [ ]) and proved useful in tracking the distributions of theunderlying and the effect of the nonGaussian character of random processes onfunctions of the process (such as option prices).Just as options are convex to the scale of the distribution, we find many situationswhere expectations are convex to the Power Law tail exponent . This note examinestwo cases: (cid:15)
The standard power laws, one-tailed or asymmetric. (cid:15)
The pseudo-power law, where a random variable appears to be a Power lawbut has compact support, as in the study of violence [ ] where wars havethe number of casualties capped at a maximum value. . . General CasesDefinition . Let X be a random variable belonging to the class of distributions with a "power law" righttail, that is support in [ x , + ¥ ) , R :Subclass P : f X : P ( X > x ) = L ( x ) x (cid:0) a , ¶ q L ( x ) ¶ x q = 0 for q (cid:21) g ( . ) We note that x_ can be negative by shifting, so long as x > (cid:0) ¥ .Class P : f X : P ( X > x ) = L ( x ) x (cid:0) a g ( . ) where (cid:24) means that the limit of the ratio or rhs to lhs goes to as x ! ¥ . L : [ x min , + ¥ ) ! (
0, + ¥ ) is a slowly varying function, defined as lim x ! + ¥ L ( kx ) L ( x ) = 1 forany k > . L ′ ( x ) is monotone. The constant a > .We further assume that: lim x ! ¥ L ′ ( x ) x = 0 ( . )lim x ! ¥ L ′′ ( x ) x = 0 ( . ) We have P (cid:26) P We note that the first class corresponds to the Pareto distributions (with propershifting and scaling), where L is a constant and P to the more general one-sidedpower laws. . . Stochastic Alpha Inequality
Throughout the rest of the paper we use for notation X ′ for the stochastic alphaversion of X , the constant a case. Proposition . Let p = 1, 2, ... , X ′ be the same random variable as X above in P (the one-tailed regularvariation class), with x (cid:21) , except with stochastic a with all realizations > p thatpreserve the mean ¯ a , E ( X ′ p ) (cid:21) E ( X p ). Proposition . Let K be a threshold. With X in the P class, we have the expected conditional shortfall(CVar): lim K ! ¥ E ( X ′ j X ′ > K ) (cid:21) lim K ! ¥ E ( X j X > K ).The sketch of the proof is as follows.We remark that E ( X p ) is convex to a , in the following sense. Let X a i be therandom variable distributed with constant tail exponent a i , with a i > p , i , and w i be the normalized positive weights: (cid:229) i w i = 1, 0 (cid:20) j w i j(cid:20) (cid:229) i w i a i = ¯ a . ByJensen’s inequality: w i (cid:229) i E ( X p a i ) (cid:21) E ( (cid:229) i ( w i X p a i )).As the classes are defined by their survival functions, we first need to solvefor the corresponding density: φ ( x ) = a x (cid:0) a (cid:0) L ( x , a ) (cid:0) x (cid:0) a L (1,0) ( x , a ) and get thenormalizing constant. L ( x , a ) = x a (cid:0) x L (1,0) ( x , a ) a (cid:0) (cid:0) x L (2,0) ( x , a )( a (cid:0) a (cid:0)
2) , ( . ) a ̸ = 1, 2 when the first and second derivative exist, respectively. The slot notation L ( p ,0) ( x , a ) is short for ¶ p L ( x , a ) ¶ x p j x = x . stochastic tail exponent for asymmetric power laws † By the Karamata representation theorem, [ ],[ ], a function L on [ x , + ¥ ) isslowly moving (Definition) if and only if it can be written in the form L ( x ) = exp ( ∫ xx ϵ ( t ) t dt ) + h ( x )where h (.) is a bounded measurable function converging to a finite number as x ! + ¥ , and ϵ ( x ) is a bounded measurable function converging to zero as x ! + ¥ .Accordingly, L ′ ( x ) goes to 0 as x ! ¥ . (We further assumed in . and . that L ′ ( x ) goes to faster than x and L ′′ ( x ) goes to faster than x ). Integrating byparts, E ( X p ) = x p + p ∫ ¥ x x p (cid:0) d ¯ F ( x )where ¯ F is the survival function in Eqs. . and . . Integrating by parts threeadditional times and eliminating derivatives of L (.) of higher order than 2: ( . ) E ( X p ) = x p (cid:0) a L ( x , a ) p (cid:0) a (cid:0) x p (cid:0) a +10 L (1,0) ( x , a )( p (cid:0) a )( p (cid:0) a + 1) + x p (cid:0) a +20 L (2,0) ( x , a )( p (cid:0) a )( p (cid:0) a + 1)( p (cid:0) a + 2)which, for the special case of X in P reduces to: ( . ) E ( X p ) = x p aa (cid:0) p As to Proposition , we can approach the proof from the property that lim x ! ¥ L ′ ( x ) =0. This allows a proof of var der Mijk’s law that Paretianinequality is invariant tothe threshold in the tail, that is E ( X j X > K ) K converges to a constant as K ! + ¥ .Equation . presents the exact conditions on the functional form of L ( x ) for theconvexity to extend to sub-classes between P and P .Our results hold to distributions that are transformed by shifting and scaling, ofthe sort: x x (cid:0) m + x (Pareto II), or with further transformations to Pareto types II andIV.We note that the representation P uses the same parameter, x , for both scaleand minimum value, as a simplification.We can verify that the expectation from Eq. . is convex to a : ¶ E ( X p ) ¶a = x p a (cid:0) . . . Approximations for the Class P For P n P , our results hold when we can write an approximation the expectationof X as a constant multiplying the integral of x (cid:0) a , namely E ( X ) (cid:25) k n ( a ) a (cid:0) . ) where k is a positive constant that does not depend on a and n (.) is approximatedby a linear function of a (plus a threshold). The expectation will be convex to a . Example: Student T Distribution
For the Student T distribution with tail a , the"sophisticated" slowly varying function in common use for symmetric power lawsin quantitative finance, the half-mean or the mean of the one-sided distribution (i.e.with support on R + becomes2 n ( a ) = 2 p a G ( a +12 ) p p G ( a ) (cid:25) a (1 + log(4)) p ,where G (.) is the gamma function. As we are dealing from here on with convergence to the stable distribution, weconsider situations of 1 < a <
2, hence p = 1 and will be concerned solely withthe mean.We observe that the convexity of the mean is invariant to summations of PowerLaw distributed variables as X above. The Stable distribution has a mean thatin conventional parameterizations does not appear to depend on a –but in factdepends on it.Let Y be distributed according to a Pareto distribution with density f ( y ) ≜ al a y (cid:0) a (cid:0) , y (cid:21) l > < a <
2. Now, let Y , Y , . . . Y n be identical and independent copies of Y . Let c ( t ) be the characteristic functionfor f ( y ). We have c ( t ) = a ( (cid:0) it ) a G ( (cid:0) a , (cid:0) it ), where g (., .) is the incomplete gammafunction. We can get the mean from the characteristic function of the average of n summands n ( Y + Y + ... Y n ), namely c ( tn ) n . Taking the first derivative: ( . ) (cid:0) i ¶c ( tn ) n ¶ t = ( (cid:0) i ) a ( n (cid:0) n (cid:0) a n a n l a ( n (cid:0) t a ( n (cid:0) (cid:0) G ( (cid:0) a , (cid:0) it l n ) n (cid:0) ( ( (cid:0) i ) a al a t a G ( (cid:0) a , (cid:0) it l n ) (cid:0) n a e i l tn ) and lim n ! ¥ (cid:0) i ¶c ( tn ) n ¶ t ???? t =0 = l aa (cid:0) . )Thus we can see how the converging asymptotic distribution for the average willhave for mean the scale times aa (cid:0) , which does not depends on n .Let c S ( t ) be the characteristic function of the corresponding stable distribution S a , b , m , s , from the distribution of an infinitely summed copies of Y . By the Lévycontinuity theorem, we have stochastic tail exponent for asymmetric power laws † (cid:15) n S i (cid:20) n Y i D (cid:0)! S , with distribution S a , b , m , s , where D (cid:0)! denotes convergence indistributionand (cid:15) c S ( t ) = lim n ! ¥ c ( t = n ) n are equivalent.So we are dealing with the standard result [ ],[ ], for exact Pareto sums [ ],replacing the conventional m with the mean from above: c S ( t ) = exp ( i ( l a t a (cid:0) j t j a ( b tan ( pa ) sgn( t ) + i ))) . We can verify by symmetry that, effectively, flipping the distribution in subclasses P and P around y to make it negative yields a negative value of the mean dhigher moments, hence degradation from stochastic a .The central question becomes: Remark : Preservation of Asymmetry A normalized sum in P one-tailed distribution with expectation that depends on a of the form in Eq. . will necessarily converge in distribution to an asymmetricstable distribution S a , b , m ,1 , with b ̸ = 0 . Remark Let Y ′ be Y under mean-preserving stochastic a . The convexity effect becomessgn ( E ( Y ′ ) (cid:0) E ( Y ) ) = sgn ( b ).The sketch of the proof is as follows. Consider two slowly varying functions as in . , each on one side of the tails. We have L ( y ) = y < y q L (cid:0) ( y ) + y (cid:21) y q L + ( y ): L + ( y ), L : [ y q , + ¥ ], lim y ! ¥ L + ( y ) = cL (cid:0) ( y ), L : [ (cid:0) ¥ , y q ], lim y !(cid:0) ¥ L (cid:0) ( y ) = d .From [ ],if P ( X > x ) (cid:24) cx (cid:0) a , x ! + ¥ P ( X < x ) (cid:24) d j x j (cid:0) a , x ! + ¥ , then Y converges in distribution to S a , b , m ,1 with the coefficient b = c (cid:0) dc + d . a We can show that the mean can be written as ( l + (cid:0) l (cid:0) ) aa (cid:0) where: l + (cid:21) l (cid:0) if ∫ ¥ y q L + ( y )d y , (cid:21) ∫ y q (cid:0) ¥ L (cid:0) ( y )d y a Now assume a is following a shifted Lognormal distribution with mean a andminimum value b , that is, a (cid:0) b follows a Lognormal L ( log( a ) (cid:0) s , s ) . Theparameter b allows us to work with a lower bound on the tail exponent in order tosatisfy finite expectation. We know that the tail exponent will eventually convergeto b but the process may be quite slow. Proposition . Assuming finite expectation for X’ and for exponent the lognormally distributed shiftedvariable a (cid:0) b with law L ( log( a ) (cid:0) s , s ) , b (cid:21) mininum value for a , and scale l :( . ) E ( Y ′ ) = E ( Y ) + l ( e s (cid:0) b ) a (cid:0) b We need b (cid:21) ϕ ( y , a ) be the density with stochastic tail exponent. With a > a > b , b (cid:21) s > Y (cid:21) l > . ) E ( Y ) = ∫ ¥ b ∫ ¥ L y ϕ ( y ; a ) d y d a = ∫ ¥ b l aa (cid:0) p ps ( a (cid:0) b )exp (cid:0) ( log( a (cid:0) b ) (cid:0) log( a (cid:0) b ) + s ) s d a = l ( a + e s (cid:0) b ) a (cid:0) b . Approximation of the Density
With b = 1 (which is the lower bound for b ),we get the density with stochastic a : ϕ ( y ; a , s ) = lim k ! ¥ Y k (cid:229) i =0 i ! L ( a (cid:0) i e i ( i (cid:0) s (log( l ) (cid:0) log( y )) i (cid:0) ( i + log( l ) (cid:0) log( y ))( . )This result is obtained by expanding a around its lower bound b (which we simpli-fied to b = 1) and integrating each summand. stochastic tail exponent for asymmetric power laws † Proposition . Assuming finite expectation for X ′ scale l , and for exponent a gamma distributed shiftedvariable a (cid:0) with law φ (.) , mean a and variance s , all values for a greater than : E ( X ′ ) = E ( X ′ ) + s ( a (cid:0) a (cid:0) s (cid:0) a + s (cid:0)
1) ( . ) Proof. φ ( a ) = e (cid:0) ( a (cid:0) ( a (cid:0) ) s ( s a (cid:0) ( a (cid:0) ) ) (cid:0) ( a (cid:0) ) s ( a (cid:0) G ( ( a (cid:0) ) s ) , a > . ) ∫ ¥ al a x (cid:0) a (cid:0) φ ( a ) d a ( . )= ∫ ¥ a ( e (cid:0) ( a (cid:0) a (cid:0) s ( s ( a (cid:0) a (cid:0) ) (cid:0) ( a (cid:0) s ) ( a (cid:0) ( ( a (cid:0) G ( ( a (cid:0) s )) d a = 12 ( a + s (cid:0) a (cid:0) s (cid:0) ) In [ ] and [ ], the studies make use of bounded power laws, applied to violenceand operational risk, respectively. Although with a < Z has finiteexpectations owing to the upper bound.The methods offered were a smooth transformation of the variable as follows: westart with z [ L , H ), L > x [ L , ¥ ), the latter legitimatelybeing Power Law distributed.So the smooth logarithmic transformation): x = φ ( z ) = L (cid:0) H log ( H (cid:0) zH (cid:0) L ) ,and f ( x ) = ( x (cid:0) L as + 1 ) (cid:0) a (cid:0) s .We thus get the distribution of Z which will have a finite expectation for all positivevalues of a . ( . ) ¶ E ( Z ) ¶a = 1 H ( H (cid:0) L ) ( e as H ( H G ( as H j a + 1, a + 1, a + 11, a , a , a ) (cid:0) H ( H + s ) G ( as H j a + 1, a + 11, a , a ) + s ( as + ( a + 1) H + 2 a H s ) E a ( as H )) (cid:0) H s ( H + s ) ) which appears to be positive in the range of numerical perturbations in [ ]. Atsuch a low level of a , around , the expectation is extremely convex and the biaswill be accordingly extremely pronounced.This convexity has the following practical implication. Historical data on violenceover the past two millennia, is fundamentally unreliable [ ]. Hence an impreci-sion about the tail exponent , from errors embedded in the data, need to be presentin the computations. The above shows that uncertainty about a , is more likely tomake the "true" statistical mean (that is the mean of the process as opposed to sam-ple mean) higher than lower, hence supports the statement that more uncertaintyincreases the estimation of violence. The bias in the estimation of the mean and shortfalls from uncertainty in the tailexponent can be added to analyses where data is insufficient, unreliable, or simplyprone to forgeries.In additional to statistical inference, these result can extend to processes, whethera compound Poisson process with power laws subordination [ ] (i.e. a Poissonarrival time and a jump that is Power Law distributed) or a Lévy process. The lattercan be analyzed by considering successive "slice distributions" or discretization ofthe process [ ]. Since the expectation of a sum of jumps is the sum of expectation,the same convexity will appear as the one we got from Eq. . . Marco Avellaneda, Robert Frey, Raphael Douady, Pasquale Cirillo. G ( as H j a + 1, a + 1, a + 11, a , a , a ) is the Meijer G function. M E T A - D I S T R I B U T I O N O F P - V A L U E SA N D P - H A C K I N G ‡ W e present an exact probability distribution (meta-distribution)for p-values across ensembles of statistically identical phenom-ena, as well as the distribution of the minimum p-value among m independents tests. We derive the distribution for small sam-ples 2 < n (cid:20) n (cid:3) (cid:25)
30 as well as the limiting one as the samplesize n becomes large. We also look at the properties of the "power" of a testthrough the distribution of its inverse for a given p-value and parametriza-tion.P-values are shown to be extremely skewed and volatile, regardless of thesample size n , and vary greatly across repetitions of exactly same protocolsunder identical stochastic copies of the phenomenon; such volatility makesthe minimum p value diverge significantly from the "true" one. Settingthe power is shown to offer little remedy unless sample size is increasedmarkedly or the p-value is lowered by at least one order of magnitude.The formulas allow the investigation of the stability of the reproductionof results and "p-hacking" and other aspects of meta-analysis –including ametadistribution of p-hacked results.From a probabilistic standpoint, neither a p-value of .05 nor a "power" at .9appear to make the slightest sense.Assume that we know the "true" p-value, p s , what would its realizations looklike across various attempts on statistically identical copies of the phenomena? Bytrue value p s , we mean its expected value by the law of large numbers acrossan m ensemble of possible samples for the phenomenon under scrutiny, that is m (cid:229) (cid:20) m p i P (cid:0)! p s (where P (cid:0)! denotes convergence in probability). A similar conver-gence argument can be also made for the corresponding "true median" p M . Themain result of the paper is that the the distribution of n small samples can be madeexplicit (albeit with special inverse functions), as well as its parsimonious limiting Research chapter. meta-distribution of p-values and p-hacking ‡ one for n large, with no other parameter than the median value p M . We wereunable to get an explicit form for p s but we go around it with the use of the me-dian. Finally, the distribution of the minimum p-value under can be made explicit,in a parsimonious formula allowing for the understanding of biases in scientificstudies. n = = = = = p Figure . : The different values for Equ. . showing convergence to the limiting distribution. It turned out, as we can see in Figure . the distribution is extremely asymmet-ric (right-skewed), to the point where % of the realizations of a "true" p-valueof . will be <. (a borderline situation is 3 (cid:2) as likely to pass than fail a givenprotocol), and, what is worse, % of the true p-value of . will be below . .Although with compact support, the distribution exhibits the attributes ofextreme fat-tailedness. For an observed p-value of, say, . , the "true" p-valueis likely to be >. (and very possibly close to . ), with a standard deviation >. (sic) and a mean deviation of around . (sic, sic). Because of the excessiveskewness, measures of dispersion in L and L (and higher norms) varyhardly with p s , so the standard deviation is not proportional, meaning an in-sample .01 p-value has a significant probability of having a true value > .3. So clearly we don’t know what we are talking about when we talk aboutp-values.
Earlier attempts for an explicit meta-distribution in the literature were found in[ ] and [ ], though for situations of Gaussian subordination and less parsimo-nious parametrization. The severity of the problem of significance of the so-called "sta-tistically significant" has been discussed in [ ] and offered a remedy via Bayesian methods in [ ], which in fact recommends the same tightening of standards top-values (cid:25) .01. But the gravity of the extreme skewness of the distribution ofp-values is only apparent when one looks at the meta-distribution.For notation, we use n for the sample size of a given study and m the number oftrials leading to a p-value. Proposition . Let P be a random variable [0, 1]) corresponding to the sample-derived one-tailed p-valuefrom the paired T-test statistic (unknown variance) with median value M ( P ) = p M [0, 1] derived from a sample of n size. The distribution across the ensemble of statisticallyidentical copies of the sample has for PDF φ ( p ; p M ) = { φ ( p ; p M ) L for p < φ ( p ; p M ) H for p > φ ( p ; p M ) L = l ( (cid:0) n (cid:0) p vuut (cid:0) l p ( l p M (cid:0) )( l p (cid:0) ) l p M (cid:0) √( (cid:0) l p ) l p √( (cid:0) l p M ) l p M + 1 l p (cid:0) p (cid:0) l p p l pM p l p p (cid:0) l pM + (cid:0) l pM (cid:0) n = φ ( p ; p M ) H = ( (cid:0) l ′ p ) ( (cid:0) n (cid:0) l ′ p (cid:0) ) ( l p M (cid:0) ) l ′ p ( (cid:0) l p M ) + 2 √( (cid:0) l ′ p ) l ′ p √( (cid:0) l p M ) l p M + 1 n +12 ( . )where l p = I (cid:0) p ( n , ) , l p M = I (cid:0) (cid:0) p M ( , n ) , l ′ p = I (cid:0) p (cid:0) ( , n ) , and I (cid:0) (., .) is theinverse beta regularized function. Remark For p= the distribution doesn’t exist in theory, but does in practice and we can workaround it with the sequence p m k = (cid:6) k , as in the graph showing a convergence tothe Uniform distribution on [0, 1] in Figure . . Also note that what is called the"null" hypothesis is effectively a set of measure . meta-distribution of p-values and p-hacking ‡ Proof.
Let Z be a random normalized variable with realizations z , from a vector ⃗ v of n realizations, with sample mean m v , and sample standard deviation s v , z = m v (cid:0) m hsv p n (where m h is the level it is tested against), hence assumed to s Student T with n degrees of freedom, and, crucially, supposed to deliver a mean of ¯ z , f ( z ; ¯ z ) = ( n ( ¯ z (cid:0) z ) + n ) n +12 p nB ( n , ) where B(.,.) is the standard beta function. Let g (.) be the one-tailed survival func-tion of the Student T distribution with zero mean and n degrees of freedom: g ( z ) = P ( Z > z ) = I n z n ( n , ) z (cid:21) ( I z z n ( , n ) + 1 ) z < I (.,.) is the incomplete Beta function.We now look for the distribution of g ◦ f ( z ). Given that g(.) is a legit Borel func-tion, and naming p the probability as a random variable, we have by a standardresult for the transformation: φ ( p , ¯ z ) = f ( g ( (cid:0) ( p ) ) j g ′ ( g ( (cid:0) ( p ) ) j We can convert ¯ z into the corresponding median survival probability because ofsymmetry of Z . Since one half the observations fall on either side of ¯ z , we canascertain that the transformation is median preserving: g ( ¯ z ) = , hence φ ( p M , .) = . Hence we end up having f ¯ z : I n ¯ z n ( n , ) = p M g (positive case) and f ¯ z : ( I z z n ( , n ) + 1 ) = p M g (negative case). Replacing we get Eq. . and Proposi-tion . is done.We note that n does not increase significance, since p-values are computed fromnormalized variables (hence the universality of the meta-distribution); a high n corresponds to an increased convergence to the Gaussian. For large n , we canprove the following proposition: Proposition . Under the same assumptions as above, the limiting distribution for φ (.) : lim n ! ¥ φ ( p ; p M ) = e (cid:0) erfc (cid:0) (2 p M ) ( erfc (cid:0) (2 p M ) (cid:0) erfc (cid:0) (2 p ) ) ( . ) where erfc(.) is the complementary error function and er f c (.) (cid:0) its inverse. The limiting CDF F (.) F ( k ; p M ) = 12 erfc ( erf (cid:0) (1 (cid:0) k ) (cid:0) erf (cid:0) (1 (cid:0) p M ) ) ( . ) Proof.
For large n , the distribution of Z = m vsv p n becomes that of a Gaussian, and theone-tailed survival function g (.) = erfc ( z p ) , z ( p ) ! p (cid:0) ( p ). p - value ( true mean ) % cutpointMedian (cid:1) % of realizations < .05 (cid:1) % of realizations < .010.05 0.10 0.15 0.20 p0.000.050.100.15PDF / Frequ.
Figure . : The probability distribution of a one-tailed p-value with expected value . generatedby Monte Carlo (histogram) as well as analytically with φ (.) (the solid line). We draw all possiblesubsamples from an ensemble with given properties. The excessive skewness of the distribution makesthe average value considerably higher than most observations, hence causing illusions of "statisticalsignificance". This limiting distribution applies for paired tests with known or assumed samplevariance since the test becomes a Gaussian variable, equivalent to the convergenceof the T-test (Student T) to the Gaussian when n is large. meta-distribution of p-values and p-hacking ‡ .025.1.150.50.0 0.2 0.4 0.6 0.8 1.0 p φ Figure . : The probability distribution of p at different values of p M . We observe how p M = leadsto a uniform distribution. Remark For values of p close to , φ in Equ. . can be usefully calculated as: φ ( p ; p M ) = p p p M vuut log ( p p M ) e √ (cid:0) log ( p log ( p p )) (cid:0) p ) √ (cid:0) log ( p log ( p p M )) (cid:0) ( p M ) + O ( p ). ( . ) The approximation works more precisely for the band of relevant values < p < p . From this we can get numerical results for convolutions of φ using the FourierTransform or similar methods.We can and get the distribution of the minimum p-value per m trials acrossstatistically identical situations thus get an idea of "p-hacking", defined as attemptsby researchers to get the lowest p-values of many experiments, or try until one ofthe tests produces statistical significance. Proposition . The distribution of the minimum of m observations of statistically identical p-values be-comes (under the limiting distribution of proposition . ): φ m ( p ; p M ) = m e erfc (cid:0) (2 p M ) ( erfc (cid:0) (2 p ) (cid:0) erfc (cid:0) (2 p M ) ) ( (cid:0) erfc ( erfc (cid:0) (2 p ) (cid:0) erfc (cid:0) (2 p M ) )) m (cid:0) ( . ) Proof. P ( p > p , p > p , . . . , p m > p ) = ∩ ni =1 F ( p i ) = ¯ F ( p ) m . Taking the first deriva-tive we get the result.Outside the limiting distribution: we integrate numerically for different valuesof m as shown in Figure . . So, more precisely, for m trials, the expectation iscalculated as: E ( p min ) = ∫ (cid:0) m φ ( p ; p M ) ( ∫ p φ ( u , .) d u ) m (cid:0) d p n = =
152 4 6 8 10 12 14 m trials0.020.040.060.080.100.12Expected min p - val Figure . : The "p-hacking" value across m trials for p M = .15 and p s = .22 . Let b be the power of a test for a given p-value p , for random draws X fromunobserved parameter q and a sample size of n . To gauge the reliability of b as atrue measure of power, we perform an inverse problem: meta-distribution of p-values and p-hacking ‡ b X q , p , n b (cid:0) ( X ) ∆ Proposition . Let b c be the projection of the power of the test from the realizations assumed to be studentT distributed and evaluated under the parameter q . We have F ( b c ) = { F ( b c ) L for b c < F ( b c ) H for b c > where F ( b c ) L = √ (cid:0) g g (cid:0) n ( (cid:0) g √ g (cid:0) p (cid:0) ( g (cid:0) ) g (cid:0) p (cid:0) ( g (cid:0) ) g + g ( √ g (cid:0) (cid:0) g ) (cid:0) ) n +12 √ (cid:0) ( g (cid:0) ) g ( . ) F ( b c ) H = p g ( (cid:0) g ) (cid:0) n B (
12 , n )0B@ (cid:0) ( p (cid:0) ( g (cid:0) ) g g )√ g (cid:0) √ g (cid:0) p (cid:0) ( g (cid:0) ) g (cid:0) g (cid:0) + g n +12 √ (cid:0) ( g (cid:0) ) g B ( n , ) ( . ) where g = I (cid:0) b c ( n , ) , g = I (cid:0) b c (cid:0) ( , n ) , and g = I (cid:0) ( p s (cid:0) ) ( n , ) . (cid:15) One can safely see that under such stochasticity for the realizations of p-values and the distribution of its minimum, to get what people mean by %confidence (and the inferences they get from it), they need a p-value of atleast one order of magnitude smaller. (cid:15) Attempts at replicating papers, such as the open science project [ ], shouldconsider a margin of error in its own procedure and a pronounced bias to-wards favorable results (Type-I error). There should be no surprise that apreviously deemed significant test fails during replication –in fact it is thereplication of results deemed significant at a close margin that should besurprising. (cid:15) The "power" of a test has the same problem unless one either lowers p-valuesor sets the test at higher levels, such at . . acknowledgment Marco Avellaneda, Pasquale Cirillo, Yaneer Bar-Yam, friendly people on twitter ...
S O M E C O N F U S I O N S I N B E H A V I O R A LE C O N O M I C S W saw earlier (Chapters and ) that the problem of "overesti-mation of the tails" by agents is more attributable to the useof a wrong "normative" model by psychologists and decisionscientists who are innocent of fat tails. Here we use two cases il-lustrative of such improper use of probability, uncovered withour simple heuristic of inducing a second order effect and seeing the effectof Jensens’s inequality on the expectation operator.One such unrigorous use of probability (the equity premium puzzle) in-volves the promoter of "nudging", an invasive and sinister method devisedby psychologists that aim at manipulating decisions by citizens. h.1 case study: how the myopic loss aversion is misspecified The so-called "equity premium puzzle", originally detected by Mehra and Prescott[ ], is called so because equities have historically yielded too high a return overfixed income investments; the puzzle is why it isn’t arbitraged away.We can easily figure out that the analysis misses the absence of ergodicity insuch domain, as we saw in Chapter : agents do not really capture market returnsunconditionally; it is foolish to use ensemble probabilities and the law of largenumbers for individual investors who only have one life. Also "positive expectedreturns" for a market is not a sufficient condition for an investor to obtain a positiveexpectation; a certain Kelly-style path scaling strategy, or path dependent dynamichedging is required.Benartzi and Thaler [ ] claims that the Kahneman-Tversky prospect theory [ ]explains such behavior owing to myopia. This might be true but such an analysisfalls apart under thick tails.So here we fatten tails of the distribution with stochasticity of, say, the scale pa-rameter, and can see what happens to some results in the literature that seem ab- some confusions in behavioral economics surd at face value, and in fact are absurd under more rigorous use of probabilisticanalyses. Myopic loss aversion
Higher values of a t (cid:45) (cid:45) (cid:45) (cid:45) (cid:45) (cid:45) H a ,12 Figure H. : The effectof H a , p ( t ) "utility" orprospect theory of un-der second order effect onvariance. Here s = 1, m =1 and t variable. a H a ,12 H Figure H. : The ratio H a , 12 ( t ) H or the degradationof "utility" under secondorder effects. Take the prospect theory valuation w function for x changes in wealth x , parametrizedwith l and a . w l , a ( x ) = x a x (cid:21) (cid:0) l ( (cid:0) x a ) x < Let ϕ m t , s p t ( x ) be the Normal Distribution density with corresponding mean andstandard deviation (scaled by t ) .1 case study: how the myopic loss aversion is misspecified The expected "utility" (in the prospect sense): H ( t ) = ∫ ¥ (cid:0) ¥ w l , a ( x ) ϕ m t , s p t ( x ) d x (H. )(H. )= 1 p p a (cid:0) ( s t ) (cid:0) a ( G ( a + 12 ) ( s a t a = ( s t ) a = (cid:0) ls p t √ s t ) F ( (cid:0) a (cid:0) t m s ) + 1 p s m G ( a ) ( s a +1 t a +1 ( s t ) a +12 + s a t a +12 ( s t ) a = + 2 ls t √ s t ) F ( (cid:0) a (cid:0) t m s )) We can see from H. that the more frequent sampling of the performance trans-lates into worse utility. So what Benartzi and Thaler did was try to find the sam-pling period "myopia" that translates into the sampling frequency that causes the"premium" —the error being that they missed second order effects.Now under variations of s with stochatic effects, heuristically captured, the storychanges: what if there is a very small probability that the variance gets multipliedby a large number, with the total variance remaining the same? The key here is thatwe are not even changing the variance at all: we are only shifting the distributionto the tails. We are here generously assuming that by the law of large numbers itwas established that the "equity premium puzzle" was true and that stocks really outperformed bonds.So we switch between two states, (1 + a ) s w.p. p and (1 (cid:0) a ) w.p. (1 (cid:0) p ).Rewriting H. (H. ) H a , p ( t ) = ∫ ¥ (cid:0) ¥ w l , a ( x ) ( p ϕ m t , p a s p t ( x ) + (1 (cid:0) p ) ϕ m t , p (cid:0) a s p t ( x ) ) d x Result
Conclusively, as can be seen in figures H. and H. , second order effectscancel the statements made from "myopic" loss aversion. This doesn’t mean thatmyopia doesn’t have effects, rather that it cannot explain the "equity premium",not from the outside (i.e. the distribution might have different returns, but fromthe inside, owing to the structure of the Kahneman-Tversky value function v ( x ). Comment
We used the (1 + a ) heuristic largely for illustrative reasons; we coulduse a full distribution for s with similar results. For instance the gamma distribu-tion with density f ( v ) = v g (cid:0) e (cid:0) a vV ( V a ) (cid:0) g G ( g ) with expectation V matching the varianceused in the "equity premium" theory.Rewriting H. under that form, some confusions in behavioral economics ∫ ¥ (cid:0) ¥ ∫ ¥ w l , a ( x ) ϕ m t , p v t ( x ) f ( v ) d v d x Which has a closed form solution (though a bit lengthy for here).
True problem with Benartzi and Thaler
Of course the problem has to do withthick tails and the convergence under LLN, which we treat separately.
Time Preference Under Model Error
Another example of the effect of the randomization of a parameter –the creation ofan additional layer of uncertainty so to speak.This author once watched with a great deal of horror one Laibson [ ] at aconference in Columbia University present the idea that having one massage todayto two tomorrow, but reversing in a year from now is irrational (or something of thesort) and we need to remedy it with some policy. (For a review of time discountingand intertemporal preferences, see [ ], as economists tend to impart to agentswhat seems to be a varying "discount rate", derived in a simplified model). Intuitively, what if I introduce the probability that the person offering the mas-sage is full of balloney? It would clearly make me both prefer immediacy at almostany cost and conditionally on his being around at a future date, reverse the prefer-ence. This is what we will model next.First, time discounting has to have a geometric form, so preference doesn’t be-come negative: linear discounting of the form Ct , where C is a constant ant t istime into the future is ruled out: we need something like C t or, to extract therate, (1 + k ) t which can be mathematically further simplified into an exponential,by taking it to the continuous time limit. Exponential discounting has the form e (cid:0) k t . Effectively, such a discounting method using a shallow model prevents "timeinconsistency", so with d < t : lim t ! ¥ e (cid:0) k t e (cid:0) k ( t (cid:0) d ) = e (cid:0) k d Now add another layer of stochasticity: the discount parameter, for which weuse the symbol l , is now stochastic.So we now can only treat H ( t ) as H ( t ) = ∫ e (cid:0) l t ϕ ( l ) d l .It is easy to prove the general case that under symmetric stochasticization ofintensity ∆ l (that is, with probabilities around the center of the distribution)using the same technique we did in . : Farmer and Geanakoplos [ ] have applied a similar approach to Hyperbolic discounting. .1 case study: how the myopic loss aversion is misspecified H ′ ( t , ∆ l ) = 12 ( e (cid:0) ( l (cid:0) ∆ l ) t + e (cid:0) ( l + ∆ l ) t ) H ′ ( t , ∆ l ) H ′ ( t , 0) = 12 e l t ( e ( (cid:0) ∆ l (cid:0) l ) t + e ( ∆ l (cid:0) l ) t ) = cosh( ∆ l t )Where cosh is the cosine hyperbolic function (cid:0) which will converge to a certainvalue where intertemporal preferences are flat in the future. Example: Gamma Distribution
Under the gamma distribution with support in R + , with parameters a and b , ϕ ( l ) = b (cid:0) a l a (cid:0) e (cid:0) lb G ( a ) we get: H ( t , a , b ) = ∫ ¥ e (cid:0) l t ( b (cid:0) a l a (cid:0) e (cid:0) lb ) G ( a ) d l = b (cid:0) a ( b + t ) (cid:0) a ,so lim t ! ¥ H ( t , a , b ) H ( t (cid:0) d , a , b ) = 1Meaning that preferences become flat in the future no matter how steep they are inthe present, which explains the drop in discount rate in the economics literature.Further, fudging the distribution and normalizing it, when ϕ ( l )= e (cid:0) l k k ,we get the normatively obtained so-called hyperbolic discounting: H ( t ) = 11 + k t ,which turns out to not be the empirical "pathology" that naive researchers haveclaimed it to be. It is just that their model missed a layer of uncertainty. art VII O P T I O N T R A D I N G A N D P R I C I N G U N D E R FAT TA I L S F I N A N C I A L T H E O R Y ’ S F A I L U R E SW I T H O P T I O N P R I C I N G † L et us discuss why option theory, as seen according to the so-called "neoclassical economics", fails in the real world. Howdoes financial theory price financial products? The principaldifference in paradigm between the one presented by Bache-lier in , [ ] and the modern finance one known as Black-Scholes-Merton [ ] and [ ] lies in a few central assumptions by whichBachelier was closer to reality and the way traders do business and havedone business for centuries. Figure . : The hedgingerrors for an option portfo-lio (under a daily revisionregime) over days, un-der a constant volatility Stu-dent T with tail exponent a = 3 . Technically the er-rors should not converge infinite time as their distribu-tion has infinite variance. Bachelier’s model is based on an actuarial expectation of final payoffs –not dy-namic hedging. It means you can use any distribution! A more formal proof using
Discussion chapter. financial theory’s failures with option pricing † Figure . : Hedging er-rors for an option portfo-lio (daily revision) under anequivalent (rather fictional)"Black-Scholes" world.
Figure . : PortfolioHedging errors includingthe stock market crash of . measure theory is provided in Chapter so for now let us just get the intuitionwithout too much mathematics.The same method was later used by a series of researchers, such as Sprenkle [ ]in , Boness, [ ] in , Kassouf and Thorp, [ ] in , Thorp, [ ] (onlypublished in ).They all encountered the following problem: how to produce a risk parameter–a risky asset discount rate – to make it compatible with portfolio theory? TheCapital Asset Pricing Model requires that securities command an expected rate ofreturn in proportion to their riskiness. In the Black-Scholes-Merton approach, anoption price is derived from continuous-time dynamic hedging, and only in prop-erties obtained from continuous time dynamic hedging –we will describe dynamichedging in some details further down. Thanks to such a method, an option col-lapses into a deterministic payoff and provides returns independent of the market;hence it does not require any risk premium. . . Distortion from Idealization
The problem we have with the Black-Scholes-Merton approach is that the require-ments for dynamic hedging are extremely idealized, requiring the following strict conditions. The operator is assumed to be able to buy and sell in a frictionlessmarket, incurring no transaction costs. The procedure does not allow for the priceimpact of the order flow –if an operator sells a quantity of shares, it should nothave consequences on the subsequent price. The operator knows the probabilitydistribution, which is the Gaussian, with fixed and constant parameters throughtime (all parameters do not change). Finally, the most significant restriction: noscalable jumps. In a subsequent revision [Merton, ] allows for jumps but theseare deemed to be Poisson arrival time, and fixed or, at the worst, Gaussian. Theframework does not allow the use of power laws both in practice and mathemati-cally. Let us examine the mathematics behind the stream of dynamic hedges in theBlack-Scholes-Merton equation.Assume the risk-free interest rate r = 0 with no loss of generality. The canoni-cal Black-Scholes-Merton model consists in selling a call and purchasing shares ofstock that provide a hedge against instantaneous moves in the security. Thus theportfolio p locally “hedged” against exposure to the first moment of the distribu-tion is the following: p = (cid:0) C + ¶ C ¶ S S ( . )where C is the call price, and S the underlying security.Take the change in the values of the portfolio ∆ p = (cid:0) ∆ C + ¶ C ¶ S ∆ S ( . )By expanding around the initial values of S, we have the changes in the portfolioin discrete time. Conventional option theory applies to the Gaussian in which allorders higher than ( ∆ S) and ∆ t disappears rapidly. ∆ p = (cid:0) ¶ C ¶ t ∆ t (cid:0) ¶ C ¶ S ∆ S + O ( ∆ S ) ( . )Taking expectations on both sides, we can see from ( ) very strict requirements onmoment finiteness: all moments need to converge. If we include another term, (cid:0) ¶ C ¶ S ∆ S , it may be of significance in a probability distribution with significantcubic or quartic terms. Indeed, although the n th derivative with respect to S candecline very sharply, for options that have a strike K away from the center ot thedistribution, it remains that the moments are rising disproportionately fast for thatto carry a mitigating effect.So here we mean all moments need to be finite and losing in impact –no approx-imation. Note here that the jump diffusion model (Merton, ) does not causemuch trouble since it has all the moments. And the annoyance is that a powerlaw will have every moment higher than a infinite, causing the equation of theBlack-Scholes-Merton portfolio to fail.As we said, the logic of the Black-Scholes-Merton so-called solution thanks toItô’s lemma was that the portfolio collapses into a deterministic payoff. But let ussee how quickly or effectively this works in practice. financial theory’s failures with option pricing † . . The Actual Replication Process:
The payoff of a call should be replicated with the following stream of dynamichedges, the limit of which can be seen here, between t and TLim ∆ t ! ( n = T = ∆ t (cid:229) i =1 ¶ C ¶ S j S = S t +( i (cid:0) ∆ t, t = t +( i (cid:0) ∆ t, ( S t + i ∆ t (cid:0) S t +( i (cid:0) ∆ t )) ( . )We break up the period into n increments ∆ t. Here the hedge ratio ¶ C ¶ S is computedas of time t +(i- ) ∆ t , but we get the nonanticipating difference between the priceat the time the hedge was initiatied and the resulting price at t+ i ∆ t .This is supposed to make the payoff deterministic at the limit of ∆ t !
0. In theGaussian world, this would be an Itô-McKean integral. . . Failure: How Hedging Errors Can Be Prohibitive.
As a consequence of the mathematical property seen above, hedging errors in ancubic a appear to be indistinguishable from those from an infinite variance process.Furthermore such error has a disproportionaly large effect on strikes away fromthe money.In short: dynamic hedging in a power law world removes no risk. next The next chapter will use measure theory to show why options can still be risk-neutral. U N I Q U E O P T I O N P R I C I N GM E A S U R E ( N O D Y N A M I CH E D G I N G / C O M P L E T E M A R K E T S ) ‡ W e present the proof that under simple assumptions, such asconstraints of Put-Call Parity, the probability measure for thevaluation of a European option has the mean derived fromthe forward price which can, but does not have to be the risk-neutral one, under any general probability distribution, by-passing the Black-Scholes-Merton dynamic hedging argument, and withoutthe requirement of complete markets and other strong assumptions. We con-firm that the heuristics used by traders for centuries are both more robust,more consistent, and more rigorous than held in the economics literature.We also show that options can be priced using infinite variance (finite mean)distributions. Option valuations methodologies have been used by traders for centuries, in aneffective way (Haug and Taleb, [ ]). In addition, valuations by expectation ofterminal payoff forces the mean of the probability distribution used for optionprices to be that of the forward, thanks to Put-Call Parity and, should the forwardbe risk-neutrally priced, so will the option be. The Black-Scholes argument (Blackand Scholes, , Merton, ) is held to allow risk-neutral option pricing thanksto dynamic hedging, as the option becomes redundant (since its payoff can bebuilt as a linear combination of cash and the underlying asset dynamically revisedthrough time). This is a puzzle, since: ) Dynamic Hedging is not operationallyfeasible in financial markets owing to the dominance of portfolio changes result-ing from jumps, ) The dynamic hedging argument doesn’t stand mathematicallyunder fat tails; it requires a very specific "Black-Scholes world" with many impos-sible assumptions, one of which requires finite quadratic variations, ) Traders usethe same Black-Scholes "risk neutral argument" for the valuation of options on as- Research chapter. unique option pricing measure (no dynamic hedging/complete markets) ‡ sets that do not allow dynamic replication, ) Traders trade options consistently indomain where the risk-neutral arguments do not apply ) There are fundamentalinformational limits preventing the convergence of the stochastic integral. There have been a couple of predecessors to the present thesis that Put-Call parityis sufficient constraint to enforce some structure at the level of the mean of theunderlying distribution, such as Derman and Taleb ( ), Haug and Taleb ( ).These approaches were heuristic, robust though deemed hand-waving (Ruffinoand Treussard, [ ]). In addition they showed that operators need to use therisk-neutral mean. What this chapter does is (cid:15)
It goes beyond the "handwaving" with formal proofs. (cid:15)
It uses a completely distribution-free, expectation-based approach and provesthe risk-neutral argument without dynamic hedging, and without any distri-butional assumption. (cid:15)
Beyond risk-neutrality, it establishes the case of a unique pricing distributionfor option prices in the absence of such argument. The forward (or future)price can embed expectations and deviate from the arbitrage price (owing to,say, regulatory or other limitations) yet the options can still be priced at adistibution corresponding to the mean of such a forward. (cid:15)
It shows how one can practically have an option market without "complete-ness" and without having the theorems of financial economics hold.These are done with solely two constraints: "horizontal", i.e. put-call parity, and"vertical", i.e. the different valuations across strike prices deliver a probabilitymeasure which is shown to be unique. The only economic assumption made hereis that the forward exits, is tradable — in the absence of such unique forwardprice it is futile to discuss standard option pricing. We also require the probabilitymeasures to correspond to distributions with finite first moment.Preceding works in that direction are as follows. Breeden and Litzenberger [ ]and Dupire [ ], show how option spreads deliver a unique probability measure;there are papers establishing broader set of arbitrage relations between optionssuch as Carr and Madan [ ] .However ) none of these papers made the bridge between calls and puts viathe forward, thus translating the relationships from arbitrage relations betweenoptions delivering a probability distribution into the necessity of lining up to themean of the distribution of the forward, hence the risk-neutral one (in case theforward is arbitraged.) ) Nor did any paper show that in the absence of secondmoment (say, infinite variance), we can price options very easily. Our methodologyand proofs make no use of the variance. ) Our method is vastly simpler, moredirect, and robust to changes in assumptions. Further, in a case of scientific puzzle, the exact formula called "Black-Scholes-Merton" was written down(and used) by Edward Thorp in a heuristic derivation by expectation that did not require dynamic hedging,see Thorpe [ ]. See also Green and Jarrow [ ] and Nachman [ ]. We have known about the possibility of risk neutralpricing without dynamic hedging since Harrison and Kreps [ ] but the theory necessitates extremelystrong –and severely unrealistic –assumptions, such as strictly complete markets and a multiperiod pricingkernel
We make no assumption of general market completeness. Options are not re-dundant securities and remain so. Table summarizes the gist of the paper. Define C ( S t , K , t ) and P ( S t , K , t ) as European-style call and put with strike price K,respectively, with expiration t , and S as an underlying security at times t , t (cid:21) t ,and S t the possible value of the underlying security at time t. . . Case : Forward as risk-neutral measure Define r = t (cid:0) t ∫ tt r s d s , the return of a risk-free money market fund and d = t (cid:0) t ∫ tt d s d s the payout of the asset (continuous dividend for a stock, foreign in-terest for a currency).We have the arbitrage forward price F Qt : F Qt = S (1 + r ) ( t (cid:0) t ) (1 + d ) ( t (cid:0) t ) t S e ( r (cid:0) d )( t (cid:0) t ) ( . )by arbitrage, see Keynes . We thus call F Qt the future (or forward) price ob-tained by arbitrage, at the risk-neutral rate. Let F Pt be the future requiring a risk-associated "expected return" m , with expected forward price: F Pt = S (1 + m ) ( t (cid:0) t ) t S e m ( t (cid:0) t ) . ( . ) Remark:
By arbitrage, all tradable values of the forward price given S t need to be equalto F Qt . "Tradable" here does not mean "traded", only subject to arbitrage replication by"cash and carry", that is, borrowing cash and owning the secutity yielding d if theembedded forward return diverges from r . . . Derivations
In the following we take F as having dynamics on its own –irrelevant to whetherwe are in case or –hence a unique probability measure Q . The famed Hakkanson paradox is as follows: if markets are complete and options are redudant, whywould someone need them? If markets are incomplete, we may need options but how can we price them?This discussion may have provided a solution to the paradox: markets are incomplete and we can priceoptions. Option prices are not unique in the absolute sense: the premium over intrinsic can take an entire spectrumof values; it is just that the put-call parity constraints forces the measures used for puts and the calls to bethe same and to have the same expectation as the forward. As far as securities go, options are securitieson their own; they just have a strong link to the forward. unique option pricing measure (no dynamic hedging/complete markets) ‡ Table . : Main practical differences between the dynamic hedging argument and the static Put-Callparity with spreading across strikes.
Black-Scholes Merton Put-Call Parity with Spread-ingType
Continuous rebalancing. Interpolative static hedge.
Limit
Law of large numbers in time(horizontal). Law of large numbers acrossstrikes (vertical).
Market As-sumptions ) Continuous Markets, nogaps, no jumps. ) Gaps and jumps accept-able. Possibility of continuousStrikes, or acceptable numberof strikes. ) Ability to borrow and lendunderlying asset for all dates. ) Ability to borrow and lendunderlying asset for single for-ward date. ) No transaction costs in trad-ing asset. ) Low transaction costs intrading options. Probability Dis-tribution
Requires all moments to be fi-nite. Excludes the class ofslowly varying distributions Requires finite 1 st moment (in-finite variance is acceptable). Market Com-pleteness
Achieved through dynamiccompleteness Not required (in the tradi-tional sense)
Realism of As-sumptions
Low High
Convergence
Uncertain; one large jumpchanges expectation Robust
Fitness to Real-ity
Only used after "fudging"standard deviations per strike. Portmanteau, using specificdistribution adapted to realityDefine Ω = [0, ¥ ) = A K [ A cK where A K = [0, K ] and A cK = ( K , ¥ ).Consider a class of standard (simplified) probability spaces ( Ω , m i ) indexed by i ,where m i is a probability measure, i.e., satisfying ∫ Ω d m i = 1. Theorem For a given maturity T, there is a unique measure m Q that prices European puts andcalls by expectation of terminal payoff. This measure can be risk-neutral in the sense that it prices the forward F Qt , butdoes not have to be and imparts rate of return to the stock embedded in the for-ward. Lemma . For a given maturity T, there exist two measures m and m for European calls and putsof the same maturity and same underlying security associated with the valuation by expec-tation of terminal payoff, which are unique such that, for any call and put of strike K, wehave: C = ∫ Ω f C d m , ( . ) and P = ∫ Ω f P d m , ( . ) respectively, and where f C and f P are ( S t (cid:0) K ) + and ( K (cid:0) S t ) + respectively.Proof. For clarity, set r and d to 0 without a loss of generality. By Put-Call ParityArbitrage, a positive holding of a call ("long") and negative one of a put ("short")replicates a tradable forward; because of P/L variations, using positive sign forlong and negative sign for short: C ( S t , K , t ) (cid:0) P ( S t , K , t ) + K = F Pt ( . )necessarily since F Pt is tradable.Put-Call Parity holds for all strikes, so: C ( S t , K + ∆ K , t ) (cid:0) P ( S t , K + ∆ K , t ) + K + ∆ K = F Pt ( . )for all K Ω Now a Call spread in quantities ∆ K , expressed as C ( S t , K , t ) (cid:0) C ( S t , K + ∆ K , t ),delivers $ if S t > K + ∆ K (that is, corresponds to the indicator function S > K + ∆ K ), if S t (cid:20) K (or S > K ), and the quantity times S t (cid:0) K if K < S t (cid:20) K + ∆ K , that is,between and $ (see Breeden and Litzenberger, [ ]). Likewise, consider theconverse argument for a put, with ∆ K < S t .At the limit, for ∆ K ! ¶ C ( S t , K , t ) ¶ K = (cid:0) P ( S t > K ) = (cid:0) ∫ A cK d m . ( . ) unique option pricing measure (no dynamic hedging/complete markets) ‡ By the same argument: ¶ P ( S t , K , t ) ¶ K = ∫ A K d m = 1 (cid:0) ∫ A cK d m . ( . )As semi-closed intervals generate the whole Borel s -algebra on Ω , this showsthat m and m are unique. Lemma . The probability measures of puts and calls are the same, namely for each Borel set A in Ω , m ( A ) = m ( A ) .Proof. Combining Equations . and . , dividing by ∆ K and taking ∆ K ! (cid:0) ¶ C ( S t , K , t ) ¶ K + ¶ P ( S t , K , t ) ¶ K = 1 ( . )for all values of K , so ∫ A cK d m = ∫ A cK d m , ( . )hence m ( A K ) = m ( A K ) for all K [0, ¥ ). This equality being true for any semi-closed interval, it extends to any Borel set. Lemma . Puts and calls are required, by static arbitrage, to be evaluated at same as risk-neutralmeasure m Q as the tradable forward.Proof. F Pt = ∫ Ω F t d m Q ; ( . )from Equation . ∫ Ω f C ( K ) d m (cid:0) ∫ Ω f P ( K ) d m = ∫ Ω F t d m Q (cid:0) K ( . )Taking derivatives on both sides, and since f C (cid:0) f P = S + K , we get the Radon-Nikodym derivative: d m Q d m = 1 ( . )for all values of K. Consider the case where F t is observable, tradable, and use it solely as an under-lying security with dynamics on its own. In such a case we can completely ignorethe dynamics of the nominal underlying S , or use a non-risk neutral "implied"rate linking cash to forward, m (cid:3) = log ( FS ) t (cid:0) t . the rate m can embed risk premium,difficulties in financing, structural or regulatory impediments to borrowing, withno effect on the final result.In that situation, it can be shown that the exact same results as before apply, byremplacing the measure m Q by another measure m Q (cid:3) . Option prices remain unique . We have replaced the complexity and intractability of dynamic hedging with asimple, more benign interpolation problem, and explained the performance of pre-Black-Scholes option operators using simple heuristics and rules, bypassing thestructure of the theorems of financial economics.Options can remain non-redundant and markets incomplete: we are just arguinghere for a form of arbitrage pricing (which includes risk-neutral pricing at the levelof the expectation of the probability measure), nothing more. But this is sufficientfor us to use any probability distribution with finite first moment, which includesthe Lognormal, which recovers Black-Scholes.A final comparison. In dynamic hedging, missing a single hedge, or encounter-ing a single gap (a tail event) can be disastrous —as we mentioned, it requires aseries of assumptions beyond the mathematical, in addition to severe and highlyunrealistic constraints on the mathematical. Under the class of fat tailed distribu-tions, increasing the frequency of the hedges does not guarantee reduction of risk.Further, the standard dynamic hedging argument requires the exact specificationof the risk-neutral stochastic process between t and t , something econometricallyunwieldy, and which is generally reverse engineered from the price of options,as an arbitrage-oriented interpolation tool rather than as a representation of theprocess.Here, in our Put-Call Parity based methodology, our ability to track the riskneutral distribution is guaranteed by adding strike prices, and since probabilitiesadd up to , the degrees of freedom that the recovered measure m Q has in thegap area between a strike price K and the next strike up, K + ∆ K , are severelyreduced, since the measure in the interval is constrained by the difference ∫ cA K d m (cid:0) ∫ cA K + ∆ K d m . In other words, no single gap between strikes can significantly affectthe probability measure, even less the first moment, unlike with dynamic hedging. We assumed discount rate for the proofs; in case of nonzero rate, premia are discounted at the rate ofthe arbitrage operator unique option pricing measure (no dynamic hedging/complete markets) ‡ In fact it is no different from standard kernel smoothing methods for statisticalsamples, but applied to the distribution across strikes. The assumption about the presence of strike prices constitutes a natural condi-tion: conditional on having a practical discussion about options, options strikesneed to exist. Further, as it is the experience of the author, market-makers can addover-the-counter strikes at will, should they need to do so. acknowledgments
Peter Carr, Marco Avellaneda, Hélyette Geman, Raphael Douady, Gur Huberman,Espen Haug, and Hossein Kazemi. For methods of interpolation of implied probability distribution between strikes, see Avellaneda et al.[ ]. O P T I O N T R A D E R S N E V E R U S E T H EB L A C K - S C H O L E S - M E R T O NF O R M U L A (cid:3) ,‡ O ption traders use a heuristically derived pricing formula whichthey adapt by fudging and changing the tails and skewness byvarying one parameter, the standard deviation of a Gaussian.Such formula is popularly called "Black-Scholes-Merton" ow-ing to an attributed eponymous discovery (though changingthe standard deviation parameter is in contradiction with it). However, wehave historical evidence that: ( ) the said Black, Scholes and Merton did notinvent any formula, just found an argument to make a well known (and used)formula compatible with the economics establishment, by removing the riskparameter through dynamic hedging, ( ) option traders use (and evidentlyhave used since ) sophisticated heuristics and tricks more compatiblewith the previous versions of the formula of Louis Bachelier and Edward O.Thorp (that allow a broad choice of probability distributions) and removedthe risk parameter using put-call parity, ( ) option traders did not use theBlack-Scholes-Merton formula or similar formulas after but continuedtheir bottom-up heuristics more robust to the high impact rare event. Thechapter draws on historical trading methods and th and early th centuryreferences ignored by the finance literature. It is time to stop using the wrongdesignation for option pricing. For us, practitioners, theories should arise from practice . This explains our con-cern with the "scientific" notion that practice should fit theory. Option hedging,pricing, and trading is neither philosophy nor mathematics. It is a rich craft with Research chapter. For us, in this discussion, a "practitioner" is deemed to be someone involved in repeated decisions aboutoption hedging, that is with a risk-P/L and skin in the game, not a support quant who writes pricingsoftware or an academic who provides consulting advice. option traders never use the black-scholes-merton formula (cid:3) ,‡ traders learning from traders (or traders copying other traders) and tricks develop-ing under evolution pressures, in a bottom-up manner. It is techne, not episteme.Had it been a science it would not have survived for the empirical and scientific fit-ness of the pricing and hedging theories offered are, we will see, at best, defectiveand unscientific (and, at the worst, the hedging methods create more risks thanthey reduce). Our approach in this chapter is to ferret out historical evidence oftechne showing how option traders went about their business in the past.Options, we will show, have been extremely active in the pre-modern financeworld. Tricks and heuristically derived methodologies in option trading and riskmanagement of derivatives books have been developed over the past century, andused quite effectively by operators. In parallel, many derivations were producedby mathematical researchers. The economics literature, however, did not recognizethese contributions, substituting the rediscoveries or subsequent re formulationsdone by (some) economists. There is evidence of an attribution problem withBlack-Scholes-Merton option formula , which was developed, used, and adaptedin a robust way by a long tradition of researchers and used heuristically by optionbook runners. Furthermore, in a case of scientific puzzle, the exact formula calledBlack-Sholes-Merton was written down (and used) by Edward Thorp which, para-doxically, while being robust and realistic, has been considered unrigorous. Thisraises the following: ) The Black-Scholes-Merton innovation was just a neoclassi-cal finance argument, no more than a thought experiment , ) We are not awareof traders using their argument or their version of the formula.It is high time to give credit where it belongs. . . Black-Scholes was an argument
Option traders call the formula they use the Black-Scholes-Merton formula withoutbeing aware that by some irony, of all the possible options formulas that have beenproduced in the past century, what is called the Black-Scholes-Merton formula(after Black and Scholes, , and Merton, ) is the one the furthest away fromwhat they are using. In fact of the formulas written down in a long history it is theonly formula that is fragile to jumps and tail events.First, something seems to have been lost in translation: Black and Scholes [ ]and Merton [ ] actually never came up with a new option formula, but only antheoretical economic argument built on a new way of deriving, rather re-deriving,an already existing and well known formula. The argument, we will see, is ex-tremely fragile to assumptions. The foundations of option hedging and pricingwere already far more firmly laid down before them. The Black-Scholes-Merton Here we question the notion of confusing thought experiments in a hypothetical world, of no predictivepower, with either science or practice. The fact that the Black-Scholes-Merton argument works in a Platonicworld and appears to be elegant does not mean anything since one can always produce a Platonic worldin which a certain equation works, or in which a rigorous proof can be provided, a process called reverse-engineering.
Figure . : Louis Bachelier, who came up withan option formula based on expectation. It isbased on more rigorous foundations than theBlack-Scholes dynamic hedging argument as itdoes not require a thin-tailed distribution. Fewpeople are aware of the fact that the Black-Scholesso-called discovery was an argument to removethe expectation of the underlying security, notthe derivation of a new equation. argument, simply, is that an option can be hedged using a certain methodologycalled dynamic hedging and then turned into a risk-free instrument, as the port-folio would no longer be stochastic. Indeed what Black, Scholes and Merton didwas marketing, finding a way to make a well-known formula palatable to the eco-nomics establishment of the time, little else, and in fact distorting its essence.Such argument requires strange far-fetched assumptions: some liquidity at thelevel of transactions, knowledge of the probabilities of future events (in a neoclassi-cal Arrow-Debreu style) , and, more critically, a certain mathematical structure thatrequires thin-tails, or mild randomness, on which, later . The entire argument is in-deed, quite strange and rather inapplicable for someone clinically and observation-driven standing outside conventional neoclassical economics. Simply, the dynamichedging argument is dangerous in practice as it subjects you to blowups; it makesno sense unless you are concerned with neoclassical economic theory. The Black-Scholes-Merton argument and equation flow a top-down general equilibrium the-ory, built upon the assumptions of operators working in full knowledge of theprobability distribution of future outcomes in addition to a collection of assump-tions that, we will see, are highly invalid mathematically, the main one being theability to cut the risks using continuous trading which only works in the very nar-rowly special case of thin-tailed distributions. But it is not just these flaws thatmake it inapplicable: option traders do not buy theories , particularly specula-tive general equilibrium ones, which they find too risky for them and extremelylacking in standards of reliability. A normative theory is, simply, not good for Of all the misplaced assumptions of Black Scholes that cause it to be a mere thought experiment, thoughan extremely elegant one, a flaw shared with modern portfolio theory, is the certain knowledge of futuredelivered variance for the random variable (or, equivalently, all the future probabilities). This is whatmakes it clash with practice the rectification by the market fattening the tails is a negation of the Black-Scholes thought experiment. option traders never use the black-scholes-merton formula (cid:3) ,‡ decision-making under uncertainty (particularly if it is in chronic disagreementwith empirical evidence). People may take decisions based on speculative theories,but avoid the fragility of theories in running their risks.Yet professional traders, including the authors (and, alas, the Swedish Academyof Science) have operated under the illusion that it was the Black-Scholes-Mertonformula they actually used we were told so. This myth has been progressivelyreinforced in the literature and in business schools, as the original sources havebeen lost or frowned upon as anecdotal (Merton [ ]). Figure . : The typical"risk reduction" performedby the Black-Scholes-Merton argument. Theseare the variations of a dy-namically hedged portfolio(and a quite standard one).BSM indeed "smoothes"out variations but exposesthe operator to massive tailevents reminiscent of suchblowups as LTCM. Otheroption formulas are robustto the rare event and makeno such claims.
This discussion will present our real-world, ecological understanding of optionpricing and hedging based on what option traders actually do and did for morethan a hundred years.This is a very general problem. As we said, option traders develop a chain oftransmission of techne, like many professions. But the problem is that the chainis often broken as universities do not store the acquired skills by operators. Effec-tively plenty of robust heuristically derived implementations have been developedover the years, but the economics establishment has refused to quote them or ac-knowledge them. This makes traders need to relearn matters periodically. Failureof dynamic hedging in , by such firm as Leland O ’ ´ZBrien Rubinstein, for in-stance, does not seem to appear in the academic literature published after the event(Merton, [ ], Rubinstein,[ ], Ross [ ]); to the contrary dynamic hedging isheld to be a standard operation .There are central elements of the real world that can escape them academic re-search without feedback from practice (in a practical and applied field) can causethe diversions we witness between laboratory and ecological frameworks. This ex-plains why so many finance academics have had the tendency to produce smoothreturns, then blow up using their own theories . We started the other way around, For instance how mistakes never resurface into the consciousness, Mark Rubinstein was awarded in the Financial Engineer of the Year award by the International Association of Financial Engineers. Therewas no mention of portfolio insurance and the failure of dynamic hedging. For a standard reaction to a rare event, see the following: "Wednesday is the type of day people willremember in quant-land for a very long time," said Mr. Rothman, a University of Chicago Ph.D. who ran first by years of option trading doing million of hedges and thousands of optiontrades. This in combination with investigating the forgotten and ignored ancientknowledge in option pricing and trading we will explain some common mythsabout option pricing and hedging. There are indeed two myths: (cid:15)
That we had to wait for the Black-Scholes-Merton options formula to tradethe product, price options, and manage option books. In fact the introductionof the Black, Scholes and Merton argument increased our risks and set usback in risk management. More generally, it is a myth that traders rely ontheories, even less a general equilibrium theory, to price options. (cid:15)
That we use the Black-Scholes-Merton options pricing formula. We, simplydon’t.In our discussion of these myth we will focus on the bottom-up literature onoption theory that has been hidden in the dark recesses of libraries. And thataddresses only recorded matters not the actual practice of option trading that hasbeen lost.
It is assumed that the Black-Scholes-Merton theory is what made it possible foroption traders to calculate their delta hedge (against the underlying) and to priceoptions. This argument is highly debatable, both historically and analytically.Options were actively trading at least already in the as described by JosephDe La Vega implying some form of techne ´n, a heuristic method to price them anddeal with their exposure. De La Vega describes option trading in the Netherlands,indicating that operators had some expertise in option pricing and hedging. Hediffusely points to the put-call parity, and his book was not even meant to teachpeople about the technicalities in option trading. Our insistence on the use ofPut-Call parity is critical for the following reason: The Black-Scholes-Merton ´Zsclaim to fame is removing the necessity of a risk-based drift from the underlyingsecurity to make the trade risk-neutral. But one does not need dynamic hedgingfor that: simple put call parity can suffice (Derman and Taleb, ), as we willdiscuss later. And it is this central removal of the risk-premium that apparentlywas behind the decision by the Nobel committee to grant Merton and Scholes the(then called) Bank of Sweden Prize in Honor of Alfred Nobel: Black, Merton andScholes made a vital contribution by showing that it is in fact not necessary touse any risk premium when valuing an option. This does not mean that the riskpremium disappears; instead it is already included in the stock price. It is forhaving removed the effect of the drift on the value of the option, using a thoughtexperiment, that their work was originally cited, something that was mechanicallypresent by any form of trading and converting using far simpler techniques. a quantitative fund before joining Lehman Brothers. "Events that models only predicted would happenonce in , years happened every day for three days." One "Quant Sees Shakeout For the Ages – ’ , Years" By Kaja Whitehouse,
Wall Street Journal
August , ; Page B . option traders never use the black-scholes-merton formula (cid:3) ,‡ Options have a much richer history than shown in the conventional literature.Forward contracts seems to date all the way back to Mesopotamian clay tabletsdating all the way back to
B.C. Gelderblom and Jonker [ ] show that Ams-terdam grain dealers had used options and forwards already in .In the late and the early there were active option markets in Londonand New York as well as in Paris and several other European exchanges. Markets itseems, were active and extremely sophisticated option markets in . Kairys andValerio ( ) discuss the market for equity options in USA in the s, indirectlyshowing that traders were sophisticated enough to price for tail events .There was even active option arbitrage trading taking place between some ofthese markets. There is a long list of missing treatises on option trading: we tracedat least ten German treatises on options written between the late s and thehyperinflation episode . One informative extant source, Nelson [ ], speaks volumes: An option traderand arbitrageur, S.A. Nelson published a book The A B C of Options and Arbitragebased on his observations around the turn of the twentieth century. According toNelson ( ) up to messages per hour and typically to messagesper day were sent between the London and the New York market through the ca-ble companies. Each message was transmitted over the wire system in less than aminute. In a heuristic method that was repeated in Dynamic Hedging [ ] , Nel-son, describe in a theory-free way many rigorously clinical aspects of his arbitragebusiness: the cost of shipping shares, the cost of insuring shares, interest expenses,the possibilities to switch shares directly between someone being long securitiesin New York and short in London and in this way saving shipping and insurancecosts, as well as many more tricks etc. The historical description of the market is informative until Kairys and Valerio [ ] try to gauge whetheroptions in the s were underpriced or overpriced (using Black-Scholes-Merton style methods). Therewas one tail-event in this period, the great panic of September . Kairys and Valerio find that holdingputs was profitable, but deem that the market panic was just a one-time event :"However, the put contracts benefit from the financial panic that hit the market in September, . View-ing this as a one-time event, we repeat the analysis for puts excluding any unexpired contracts writtenbefore the stock market panic."Using references to the economic literature that also conclude that options in general were overpricedin the s s and s they conclude: "Our analysis shows that option contracts were generallyoverpriced and were unattractive for retail investors to purchase. They add: ˙IEmpirically we find that bothput and call options were regularly overpriced relative to a theoretical valuation model." These results arecontradicted by the practitioner Nelson ( ): "The majority of the great option dealers who have foundby experience that it is the givers, and not the takers, of option money who have gained the advantage inthe long run." Here is a partial list: Bielschowsky, R ( ): Ueber die rechtliche Natur der Prämiengeschäfte , Bresl.Genoss.-Buchdr; Granichstaedten-Czerva, R ( ): Die Prämiengeschäfte an der Wiener Börse , Frank-furt am Main; Holz, L. ( ) Die Prämiengeschäfte , Thesis (doctoral)–Universität Rostock; Kitzing, C.( ): Prämiengeschäfte : Vorprämien-, Rückprämien-, Stellagen- u. Nochgeschäfte ; Die solidesten Spekulation-sgeschäfte mit Versicherg auf Kursverlust , Berlin; Leser, E, ( ): Zur Geschichte der Prämiengeschäfte ; Szkolny,I. ( ): Theorie und praxis der prämiengeschäfte nach einer originalen methode dargestellt. , Frankfurt am Main;Author Unknown ( ): Das Wesen der Prämiengeschäfte , Berlin : Eugen Bab & Co., Bankgeschäft.
Figure . : Espen Haug(coauthor of chapter) withMandelbrot and this authorin . The formal financial economics canon does not include historical sources fromoutside economics, a mechanism discussed in Taleb ( )[ ]. The put-call paritywas according to the formal option literature first fully described by Stoll [ ], butneither he nor others in the field even mention Nelson. Not only was the put-callparity argument fully understood and described in detail by Nelson, but he, inturn, makes frequent reference to Higgins ( ) [ ]. Just as an example Nelson( ) referring to Higgins ( ) writes:It may be worthy of remark that calls are more often dealt than putsthe reason probably being that the majority of punters in stocks andshares are more inclined to look at the bright side of things, and there-fore more often see a rise than a fall in prices.This special inclination to buy calls and to leave the puts severelyalone does not, however, tend to make calls dear and puts cheap, for itcan be shown that the adroit dealer in options can convert a put intoa call, a call into a put, a call or more into a put-and-call, in fact anyoption into another, by dealing against it in the stock. We may thereforeassume, with tolerable accuracy, that the call of a stock at any momentcosts the same as the put of that stock, and half as much as the Put-and-Call.The Put-and-Call was simply a put plus a call with the same strike and maturity,what we today would call a straddle. Nelson describes the put-call parity overmany pages in full detail. Static market neutral delta hedging was also known atthat time, in his book Nelson for example writes:Sellers of options in London as a result of long experience, if they sella Call, straightway buy half the stock against which the Call is sold; orif a Put is sold; they sell half the stock immediately.We must interpret the value of this statement in the light that standard options inLondon at that time were issued at-the-money (as explicitly pointed out by Nelson);furthermore, all standard options in London were European style. In London in- orout-of-the-money options were only traded occasionally and were known as fancy option traders never use the black-scholes-merton formula (cid:3) ,‡ options. It is quite clear from this and the rest of Nelson’s book that the optiondealers were well aware that the delta for at-the-money options was approximately %. As a matter of fact, at-the-money options trading in London at that time wereadjusted to be struck to be at-the-money forward, in order to make puts and callsof the same price. We know today that options that are at-the-money forward anddo not have very long time to maturity have a delta very close to % (naturallyminus % for puts). The options in London at that time typically had one monthto maturity when issued.Nelson also diffusely points to dynamic delta hedging, and that it worked betterin theory than practice (see Haug [ ]. It is clear from all the details describedby Nelson that options in the early traded actively and that option traders atthat time in no way felt helpless in either pricing or in hedging them.Herbert Filer was another option trader that was involved in option trading from to the s. Filer( ) describes what must be considered a reasonableactive option market in New York and Europe in the early s and s. Filermentions however that due to World War II there was no trading on the EuropeanExchanges, for they were closed. Further, he mentions that London option tradingdid not resume before . In the early , option traders in London wereconsidered to be the most sophisticated, according to [ ]. It could well be thatWorld War II and the subsequent shutdown of option trading for many years wasthe reason known robust arbitrage principles about options were forgotten andalmost lost, to be partly re-discovered by finance professors such as Stoll.Earlier, in , Vinzenz Bronzin published a book deriving several option pricingformulas, and a formula very similar to what today is known as the Black-Scholes-Merton formula, see also Hafner and Zimmermann ( , ) [ ]. Bronzinbased his risk-neutral option valuation on robust arbitrage principles such as theput-call parity and the link between the forward price and call and put options ina way that was rediscovered by Derman and Taleb ( ) . Indeed, the put-callparity restriction is sufficient to remove the need to incorporate a future return inthe underlying security it forces the lining up of options to the forward price .Again, in Henry Deutsch describes put-call parity but in less detail thanHiggins and Nelson. In
Reinach again described the put-call parity in quitesome detail (another text typically ignored by academics). Traders at New Yorkstock exchange specializing in using the put-call parity to convert puts into callsor calls into puts was at that time known as Converters. Reinach ( ) [ ]: The argument Derman Taleb( ) [ ] was present in [ ] but remained unnoticed. Ruffino and Treussard ( ) [ ] accept that one could have solved the risk-premium by happenstance,not realizing that put-call parity was so extensively used in history. But they find it insufficient. Indeedthe argument may not be sufficient for someone who subsequently complicated the representation of theworld with some implements of modern finance such as "stochastic discount rates" while simplifying itat the same time to make it limited to the Gaussian and allowing dynamic hedging. They write thatthe use of a non-stochastic discount rate common to both the call and the put options is inconsistentwith modern equilibrium capital asset pricing theory. Given that we have never seen a practitioner usestochastic discount rate, we, like our option trading predecessors, feel that put-call parity is sufficient &does the job.The situation is akin to that of scientists lecturing birds on how to fly, and taking credit for their subsequentperformance except that here it would be lecturing them the wrong way.
Although I have no figures to substantiate my claim, I estimate thatover percent of all Calls are made possible by the existence of Con-verters.In other words the converters (dealers) who basically operated as market makerswere able to operate and hedge most of their risk by statically hedging optionswith options. Reinach wrote that he was an option trader (Converter) and gaveexamples on how he and his colleagues tended to hedge and arbitrage optionsagainst options by taking advantage of options embedded in convertible bonds:Writers and traders have figured out other procedures for making profits writ-ing Puts & Calls. Most are too specialized for all but the seasoned professional.One such procedure is the ownership of a convertible bond and then writing ofCalls against the stock into which the bonds are convertible. If the stock is calledconverted and the stock is delivered.Higgins, Nelson and Reinach all describe the great importance of the put-callparity and to hedge options with options. Option traders were in no way helplessin hedging or pricing before the Black-Scholes-Merton formula. Based on simplearbitrage principles they were able to hedge options more robustly than with Black-Scholes-Merton. As already mentioned static market-neutral delta hedging wasdescribed by Higgins and Nelson in and . Also, W. D. Gann ( )discusses market neutral delta hedging for at-the-money options, but in much lessdetails than Nelson ( ). Gann also indicates some forms of auxiliary dynamichedging.Mills ( ) illustrates how jumps and fat tails were present in the literature inthe pre-Modern Portfolio Theory days. He writes: "(...) distribution may departwidely from the Gaussian type because the influence of one or two extreme pricechanges". . . Option formulas and Delta Hedging
Which brings us to option pricing formulas. The first identifiable one was Bachelier( ) [ ]. Sprenkle in [ ] extended Bacheliers work to assume lognormalrather than normal distributed asset price. It also avoids discounting (to no signif-icant effect since many markets, particularly the U.S., option premia were paid atexpiration).James Boness ( ) [ ] also assumed a lognormal asset price. He derives aformula for the price of a call option that is actually identical to the Black-Scholes-Merton formula, but the way Black, Scholes and Merton derived their formulabased on continuous dynamic delta hedging or alternatively based on CAPM theywere able to get independent of the expected rate of return. It is in other wordsnot the formula itself that is considered the great discovery done by Black, Scholesand Merton, but how they derived it. This is among several others also pointedout by Rubinstein ( ) [ ]: option traders never use the black-scholes-merton formula (cid:3) ,‡ The real significance of the formula to the financial theory of investment lies notin itself, but rather in how it was derived. Ten years earlier the same formula hadbeen derived by Case M. Sprenkle [ ] and A. James Boness [ ].Samuelson ( ) and Thorp ( ) published somewhat similar option pricingformulas to Boness and Sprenkle. Thorp ( ) claims that he actually had an iden-tical formula to the Black-Scholes-Merton formula programmed into his computeryears before Black, Scholes and Merton published their theory.Now, delta hedging. As already mentioned static market-neutral delta hedgingwas clearly described by Higgins and Nelson and . Thorp and Kassouf( ) presented market neutral static delta hedging in more details, not only forat-the-money options, but for options with any delta. In his paper Thorp isshortly describing market neutral static delta hedging, also briefly pointed in thedirection of some dynamic delta hedging, not as a central pricing device, but arisk-management tool. Filer also points to dynamic hedging of options, but with-out showing much knowledge about how to calculate the delta. Another ignoredand forgotten text is a book/booklet published in by Arnold Bernhard & Co.The authors are clearly aware of market neutral static delta hedging or what theyname balanced hedge for any level in the strike or asset price. This book has multi-ple examples of how to buy warrants or convertible bonds and construct a marketneutral delta hedge by shorting the right amount of common shares. Arnold Bern-hard & Co also published deltas for a large number of warrants and convertiblebonds that they distributed to investors on Wall Street.Referring to Thorp and Kassouf ( ), Black, Scholes and Merton took the ideaof delta hedging one step further, Black and Scholes ( ):If the hedge is maintained continuously, then the approximations mentionedabove become exact, and the return on the hedged position is completely inde-pendent of the change in the value of the stock. In fact, the return on the hedgedposition becomes certain. This was pointed out to us by Robert Merton.This may be a brilliant mathematical idea, but option trading is not mathematicaltheory. It is not enough to have a theoretical idea so far removed from reality thatis far from robust in practice. What is surprising is that the only principle optiontraders do not use and cannot use is the approach named after the formula, whichis a point we discuss next. Traders don’t do Valuation.First, operationally, a price is not quite valuation. Valuation requires a strongtheoretical framework with its corresponding fragility to both assumptions andthe structure of a model. For traders, a price produced to buy an option when onehas no knowledge of the probability distribution of the future is not valuation, butan expedient. Such price could change. Their beliefs do not enter such price. Itcan also be determined by his inventory.
This distinction is critical: traders are engineers, whether boundedly rational (oreven non interested in any form of probabilistic rationality), they are not privyto informational transparency about the future states of the world and their prob-abilities. So they do not need a general theory to produce a price merely theavoidance of Dutch-book style arbitrages against them, and the compatibility withsome standard restriction: In addition to put-call parity, a call of a certain strike K cannot trade at a lower price than a call K + ∆ K (avoidance of negative call and putspreads), a call struck at K and a call struck at K + 2 ∆ K cannot be more expensivethan twice the price of a call struck at K + ∆ (negative butterflies), horizontal cal-endar spreads cannot be negative (when interest rates are low), and so forth. Thedegrees of freedom for traders are thus reduced: they need to abide by put-callparity and compatibility with other options in the market.In that sense, traders do not perform valuation with some pricing kernel untilthe expiration of the security, but, rather, produce a price of an option compatiblewith other instruments in the markets, with a holding time that is stochastic. Theydo not need top-down science. . . When do we value?
If you find traders operated solo, in a desert island, having for some to produce anoption price and hold it to expiration, in a market in which the forward is absent,then some valuation would be necessary but then their book would be minuscule.And this thought experiment is a distortion: people would not trade options unlessthey are in the business of trading options, in which case they would need to havea book with offsetting trades. For without offsetting trades, we doubt traderswould be able to produce a position beyond a minimum (and negligible) size asdynamic hedging not possible. (Again we are not aware of many non-blownupoption traders and institutions who have managed to operate in the vacuum of theBlack Scholes-Merton argument). It is to the impossibility of such hedging that weturn next.
Finally, we discuss the severe flaw in the dynamic hedging concept. It assumes,nay, requires all moments of the probability distribution to exist .Assume that the distribution of returns has a scale-free or fractal property thatwe can simplify as follows: for x large enough, (i.e. in the tails), P ( X > nx ) P ( X > x ) dependson n , not on x . In financial securities, say, where X is a daily return, there is noreason for P(X> %)/P(X> %) to be different from P(X> %)/P(X> . %). Thisself-similarity at all scales generates power law, or Paretian, tails, i.e., above acrossover point, P ( X > x ) = Kx a . It happens, looking at millions of pieces of Merton ( ) seemed to accept the inapplicability of dynamic hedging but he perhaps thought thatthese ills would be cured thanks to his prediction of the financial world "spiraling towards dynamiccompleteness". Fifteen years later, we have, if anything, spiraled away from it. option traders never use the black-scholes-merton formula (cid:3) ,‡ data, that such property holds in markets all markets, baring sample error. Foroverwhelming empirical evidence, see Mandelbrot ( ), which predates Black-Scholes-Merton ( ) and the jump-diffusion of Merton ( ); see also Stanleyet al. ( ), and Gabaix et al. ( ). The argument to assume the scale-free isas follows: the distribution might have thin tails at some point (say above somevalue of X). But we do not know where such point is we are epistemologically inthe dark as to where to put the boundary, which forces us to use infinity.Some criticism of these "true fat-tails" accept that such property might apply fordaily returns, but, owing to the Central Limit Theorem, the distribution is held tobecome Gaussian under aggregation for cases in which a is deemed higher than2. Such argument does not hold owing to the preasymptotics of scalable distribu-tions: Bouchaud and Potters ( ) and Mandelbrot and Taleb ( ) argue thatthe presasymptotics of fractal distributions are such that the effect of the CentralLimit Theorem are exceedingly slow in the tails in fact irrelevant. Furthermore,there is sampling error as we have less data for longer periods, hence fewer tailepisodes, which give an in-sample illusion of thinner tails. In addition, the pointthat aggregation thins out the tails does not hold for dynamic hedging in whichthe operator depends necessarily on high frequency data and their statistical prop-erties. So long as it is scale-free at the time period of dynamic hedge, highermoments become explosive, infinite to disallow the formation of a dynamicallyhedge portfolio. Simply a Taylor expansion is impossible as moments of higherorder that matter critically one of the moments is going to be infinite.The mechanics of dynamic hedging are as follows. Assume the risk-free interestrate of with no loss of generality. The canonical Black-Scholes-Merton pack-age consists in selling a call and purchasing shares of stock that provide a hedgeagainst instantaneous moves in the security. Thus the portfolio p locally "hedged"against exposure to the first moment of the distribution is the following: p = (cid:0) C + ¶ C ¶ S S where C is the call price, and S the underlying security. Take the discrete timechange in the values of the portfolio ∆ p = (cid:0) ∆ C + ¶ C ¶ S ∆ S By expanding around the initial values of S , we have the changes in the portfolioin discrete time. Conventional option theory applies to the Gaussian in which allorders higher than ∆ S disappear rapidly.Taking expectations on both sides, we can see here very strict requirements onmoment finiteness: all moments need to converge. If we include another term,of order ∆ S , such term may be of significance in a probability distribution withsignificant cubic or quartic terms. Indeed, although the nth derivative with respectto S can decline very sharply, for options that have a strike K away from the cen-ter of the distribution, it remains that the delivered higher orders of S are risingdisproportionately fast for that to carry a mitigating effect on the hedges. So here we mean all moments–no approximation. The logic of the Black-Scholes-Mertonso-called solution thanks to Ito’s lemma was that the portfolio collapses into a de-terministic payoff. But let us see how quickly or effectively this works in practice.The Actual Replication process is as follows: The payoff of a call should be repli-cated with the following stream of dynamic hedges, the limit of which can be seenhere, between t and T :lim ∆ t ! ( n = T = ∆ t (cid:229) i =1 ¶ C ¶ S j S = S t +( i (cid:0) ∆ t, t = t +( i (cid:0) ∆ t, ( S t + i ∆ t (cid:0) S t +( i (cid:0) ∆ t )) ( . )Such policy does not match the call value: the difference remains stochastic(while according to Black Scholes it should shrink), unless one lives in a fantasyworld in which such risk reduction is possible.Further, there is an inconsistency in the works of Merton making us confusedas to what theory finds acceptable: in Merton ( ) he agrees that we can useBachelier-style option derivation in the presence of jumps and discontinuities, nodynamic hedging but only when the underlying stock price is uncorrelated to themarket. This seems to be an admission that dynamic hedging argument appliesonly to some securities: those that do not jump and are correlated to the market. . . The (confusing) Robustness of the Gaussian
The success of the formula last developed by Thorp, and called Black-Scholes-Merton was due to a simple attribute of the Gaussian: you can express any prob-ability distribution in terms of Gaussian, even if it has fat tails, by varying thestandard deviation s at the level of the density of the random variable. It doesnot mean that you are using a Gaussian, nor does it mean that the Gaussian isparticularly parsimonious (since you have to attach a s for every level of the price).It simply mean that the Gaussian can express anything you want if you add a func-tion for the parameter s , making it a function of strike price and time to expiration.This volatility smile, i.e., varying one parameter to produce s ( K ), or volatilitysurface, varying two parameter, s ( S , t ) is effectively what was done in differentways by Dupire ( , ) [ , ] and Derman [ , ] see Gatheral ( [ ]).They assume a volatility process not because there is necessarily such a thing onlyas a method of fitting option prices to a Gaussian. Furthermore, although theGaussian has finite second moment (and finite all higher moments as well), youcan express a scalable with infinite variance using Gaussian volatility surface. Onestrong constrain on the s parameter is that it must be the same for a put and callwith same strike (if both are European-style), and the drift should be that of theforward.Indeed, ironically, the volatility smile is inconsistent with the Black-Scholes-Mertontheory. This has lead to hundreds if not thousands of papers trying extend (whatwas perceived to be) the Black-Scholes-Merton model to incorporate stochasticvolatility and jump-diffusion. Several of these researchers have been surprisedthat so few traders actually use stochastic volatility models. It is not a model that option traders never use the black-scholes-merton formula (cid:3) ,‡ says how the volatility smile should look like, or evolves over time; it is a hedgingmethod that is robust and consistent with an arbitrage free volatility surface thatevolves over time.In other words, you can use a volatility surface as a map, not a territory. How-ever it is foolish to justify Black-Scholes-Merton on grounds of its use: we repeatthat the Gaussian bans the use of probability distributions that are not Gaussianwhereas non-dynamic hedging derivations (Bachelier, Thorp) are not grounded inthe Gaussian. . . Order Flow and Options
It is clear that option traders are not necessarily interested in the probability dis-tribution at expiration time given that this is abstract, even metaphysical for them.In addition to the put-call parity constrains that according to evidence was fullydeveloped already in , we can hedge away inventory risk in options with otheroptions. One very important implication of this method is that if you hedge op-tions with options then option pricing will be largely demand and supply based.This in strong contrast to the Black-Scholes-Merton ( ) theory that based on theidealized world of geometric Brownian motion with continuous-time delta hedg-ing then demand and supply for options simply should not affect the price ofoptions. If someone wants to buy more options the market makers can simplymanufacture them by dynamic delta hedging that will be a perfect substitute forthe option itself.This raises a critical point: option traders do not estimate the odds of rare eventsby pricing out-of-the-money options. They just respond to supply and demand.The notion of implied probability distribution is merely a Dutch-book compatibil-ity type of proposition. . . Bachelier-Thorp
The argument often casually propounded attributing the success of option volumeto the quality of the Black-Scholes formula is rather weak. It is particularly weak-ened by the fact that options had been so successful at different time periods andplaces.Furthermore, there is evidence that while both the Chicago Board Options Ex-change and the Black-Scholes-Merton formula came about in , the model was"rarely used by traders" before the s (O’Connell, ). When one of the au-thors (Taleb) became a pit trader in , almost two decades after Black-Scholes-Merton, he was surprised to find that many traders still priced options sheets free,pricing off the butterfly, and off the conversion, without recourse to any formula.Even a book written in by a finance academic appears to credit Thorpe andKassouf ( ) – rather than Black-Scholes ( ), although the latter was presentin its bibliography. Auster ( ): Sidney Fried wrote on warrant hedges before , but it was not until thatthe book Beat the Market by Edward O. Thorp and Sheen T. Kassouf rigorously,but simply, explained the short warrant/long common hedge to a wide audience.We conclude with the following remark. Sadly, all the equations, from the first(Bachelier), to the last pre-Black-Scholes-Merton (Thorp) accommodate a scale-freedistribution. The notion of explicitly removing the expectation from the forwardwas present in Keynes ( ) and later by Blau ( ) â and long a Call short a putof the same strike equals a forward. These arbitrage relationships appeared to bewell known in .One could easily attribute the explosion in option volume to the computer ageand the ease of processing transactions, added to the long stretch of peaceful eco-nomic growth and absence of hyperinflation. From the evidence (once one removesthe propaganda), the development of scholastic finance appears to be an epiphe-nomenon rather than a cause of option trading. Once again, lecturing birds howto fly does not allow one to take subsequent credit.This is why we call the equation Bachelier-Thorp. We were using it all along andgave it the wrong name, after the wrong method and with attribution to the wrongpersons. It does not mean that dynamic hedging is out of the question; it is justnot a central part of the pricing paradigm. It led to the writing down of a certainstochastic process that may have its uses, some day, should markets spiral towardsdynamic completeness. But not in the present. O P T I O N P R I C I N G U N D E R P O W E RL A W S : A R O B U S T H E U R I S T I C (cid:3) ,‡ I n this ( research ) chapter , we build a heuristic that takes agiven option price in the tails with strike K and extends (forcalls, all strikes > K , for puts all strikes < K ) assuming thecontinuation falls into what we define as "Karamata constant"or "Karamata point" beyond which the strong Pareto law holds.The heuristic produces relative prices for options, with for sole parameter thetail index a under some mild arbitrage constraints.Usual restrictions such as finiteness of variance are not required.The heuristic allows us to scrutinize the volatility surface and test theoriesof relative tail option mispricing and overpricing usually built on thin tailedmodels and modification of the Black-Scholes formula. ℓ log SLog Survival Function Figure . : The Karamata point where theslowly moving function is safely replaced by aconstant L ( S ) = l. The constant varies whetherwe use the price S or its geometric return –butnot the asymptotic slope which corresponds tothe tail index a . Research chapter, with the Universa team: Brandon Yarckin, Chitpuneet Mann, Damir Delic, and MarkSpitznagel. option pricing under power laws: a robust heuristic (cid:3) ,‡
115 120 125 130 K0.20.40.60.81.0Option Price
Black - Scholes Smile Power Law
Figure . : We show a straight Black-Scholes option price (constant volatil-ity), one with a volatility "smile", i.e.the scale increases in the tails, andpower law option prices. Under thesimplified case of a power law distribu-tion for the underlying, option pricesare linear to strike.
The power law class is conventionally defined by the property of the survival func-tion, as follows. Let X be a random variable belonging to the class of distributionswith a "power law" right tail, that is: P ( X > x ) = L ( x ) x (cid:0) a ( . )where L : [ x min , + ¥ ) ! (
0, + ¥ ) is a slowly varying function, defined as lim x ! + ¥ L ( kx ) L ( x ) =1 for any k > ].The survival function of X is called to belong to the "regular variation" class RV a .More specifically, a function f : R + ! R + is index varying at infinity with index r ( f RV r ) when lim t ! ¥ f ( tx ) f ( t ) = x r .More practically, there is a point where L ( x ) approaches its limit, l , becoming aconstant as in Figure . –we call it the "Karamata constant". Beyond such valuethe tails for power laws are calibrated using such standard techniques as the Hillestimator. The distribution in that zone is dubbed the strong Pareto law by B.Mandelbrot [ ],[ ]. Now define a European call price C ( K ) with a strike K and an underlying price S , K , S (
0, + ¥ ) , as ( S (cid:0) K ) + , with its valuation performed under some probabilitymeasure P , thus allowing us to price the option as E P ( S (cid:0) K ) + = ∫ ¥ K ( S (cid:0) K ) dP .This allows us to immediately prove the following. . . First approach, S is in the regular variation class We start with a simplified case, to build the intuition. Let S have a survival functionin the regular variation class RV a as per . . For all K > l and a > C ( K ) = K (cid:0) a l a a (cid:0) . ) Remark We note that the parameter l, when derived from an existing option price, containsall necessary information about the probability distribution below S = l, which undera given a parameter makes it unnecessary to estimate the mean, the "volatility" (thatis, scale) and other attributes. Let us assume that a is exogenously set (derived from fitting distributions, or,simply from experience, in both cases a is supposed to fluctuate minimally [ ] ).We note that C ( K ) is invariant to distribution calibrations and the only parametersneeded l which, being constant, disappears in ratios. Now consider as set themarket price of an "anchor" tail option in the market is C m with strike K , definedas an option for the strike of which other options are priced in relative value. Wecan simply generate all further strikes from l = ( ( a (cid:0) C m K a (cid:0) ) = a and applyingEq. . . Result : Relative Pricing under Distribution for SFor K , K (cid:21) l, C ( K ) = ( K K ) (cid:0) a C ( K ). ( . )The advantage is that all parameters in the distributions are eliminated: all weneed is the price of the tail option and the a to build a unique pricing mechanism. Remark : Avoiding confusion about L and a The tail index a and Karamata constant l should correspond to the assigned distri-bution for the specific underlying. A tail index a for S in the regular variation classas as per . leading to Eq. . is different from that for r = S (cid:0) S S RV a . Forconsistency, each should have its own Zipf plot and other representations. . If P ( X > x ) = L a ( x ) x (cid:0) a , and P ( X (cid:0) X X > x (cid:0) X X ) = L b ( x ) x (cid:0) a , the a con-stant will be the same, but the the various L (.) will be reaching their constantlevel at a different rate. . If r c = log SS , it is not in the regular variation class, see theorem. The reason a stays the same is owing to the scale-free attribute of the tail index. option pricing under power laws: a robust heuristic (cid:3) ,‡ Theorem : Log returns Let S be a random variable with survival function φ ( s ) = L ( s ) s (cid:0) a RV a , whereL (.) is a slowly varying function. Let r l be the log return r l = log ss . φ r l ( r l ) is notin the RV a class.Proof. Immediate. The transformation φ r l ( r l ) = L ( s ) s (cid:0) log ( log a ( s ) ) log( s ) .We note, however, that in practice, although we may need continuous compound-ing to build dynamics [ ], our approach assumes such dynamics are containedin the anchor option price selected for the analysis (or l ). Furthermore there is notangible difference, outside the far tail, between log SS and S (cid:0) S S . . . Second approach, S has geometric returns in the regular variationclass Let us now apply to real world cases where the returns S (cid:0) S S are Paretian. Consider,for r > l , S = (1 + r ) S , where S is the initial value of the underlying and r (cid:24) P ( l , a )(Pareto I distribution) with survival function ( K (cid:0) S lS ) (cid:0) a , K > S (1 + l ) ( . )and fit to C m using l = ( a (cid:0) = a C = a m ( K (cid:0) S ) (cid:0) a S , which, as before shows that practicallyall information about the distribution is embedded in l .Let S (cid:0) S S be in the regular variation class. For S (cid:21) S (1 + l ), C ( K , S ) = ( l S ) a ( K (cid:0) S ) (cid:0) a a (cid:0) . )We can thus rewrite Eq. . to eliminate l : Result : Relative Pricing under Distribution for S (cid:0) S S For K , K (cid:21) (1 + l ) S , C ( K ) = ( K (cid:0) S K (cid:0) S ) (cid:0) a C ( K ). ( . ) Figure . : Put Prices in the SP using "fix K" as anchor (from Dec , settlement), andgenerating an option prices using a tail index a that matches the market (blue) ("model), and in redprices for a = 2.75 . We can see that market prices tend to ) fit a power law (matches stochasticvolatility with fudged parameters), ) but with an a that thins the tails. This shows how modelsclaiming overpricing of tails are grossly misspecified. Remark Unlike the pricing methods in the Black-Scholes modification class (stochastic andlocal volatility models, (see the expositions of Dupire, Derman and Gatheral, [ ][ ], [ ], finiteness of variance is not required for our model or option pricingin general, as shown in [ ]. The only requirement is a > , that is, finite firstmoment. option pricing under power laws: a robust heuristic (cid:3) ,‡ Figure . : Same results as in Fig . but expressed using implied volatility. We match the priceto implied volatility for downside strikes (anchor , , and ) using our model vs market, in ratios.We assume a = 2.75 .
120 140 160 180 log K0.0010.0100.1001log Option Price
Black - Scholes α = α = α = Figure . : The intuition of the Log log plot forthe second calibration
We now consider the put strikes (or the corresponding calls in the negative tail,which should be priced via put-call parity arbitrage). Unlike with calls, we canonly consider the variations of S (cid:0) S S , not the logarithmic returns (nor those of S taken separately). We construct the negative side with a negative return for the underlying. Let r be the rate of return S = (1 (cid:0) r ) S , and Let r > l > f r ( r ) = a l a r (cid:0) a (cid:0) . We have by probabilistictransformation and rescaling the PDF of the underlying: f S ( S ) = (cid:0) a ( (cid:0) S (cid:0) S lS ) (cid:0) a (cid:0) lS l S [
0, (1 (cid:0) l ) S ) where the scaling constant l = ( (cid:0) a +1 ( l a (cid:0) ) ) is set in a way to get f s ( S ) to integrateto . The parameter l , however, is close to 1, making the correction negligible,in applications where s p t (cid:20) ( s being the Black-Scholes equivalent impliedvolatility and t the time to expiration of the option).Remarkably, both the parameters l and the scaling l are eliminated. Result : Put Pricing For K , K (cid:20) (1 (cid:0) l ) S , ( . )P ( K ) = P ( K ) ( (cid:0) (cid:0) a S (cid:0) a (( a (cid:0) K + S ) (cid:0) ( K (cid:0) S ) (cid:0) a ( (cid:0) (cid:0) a S (cid:0) a (( a (cid:0) K + S ) (cid:0) ( K (cid:0) S ) (cid:0) a Obviously, there is no arbitrage for strikes higher than the baseline one K inprevious equations. For we can verify the Breeden-Litzenberger result [ ], wherethe density is recovered from the second derivative of the option with respect tothe strike ¶ C ( K ) ¶ K j K (cid:21) K = a K (cid:0) a (cid:0) L a (cid:21) K + ∆ K , K ,and K (cid:0) ∆ K by violating the following boundary: let BSC ( K , s ( K )) be the Black-Scholes value of the call for strike K with volatility s ( K ) a function of the strikeand t time to expiration. We have C ( K + ∆ K ) + BSC ( K (cid:0) ∆ K ) (cid:21) C ( K ), ( . )where BSC ( K , s ( K )) = C ( K ). For inequality . to be satisfied, we further needan inequality of call spreads, taken to the limit: ¶ BSC ( K , s ( K )) ¶ K j K = K (cid:21) ¶ C ( K ) ¶ K j K = K ( . ) option pricing under power laws: a robust heuristic (cid:3) ,‡ Such an arbitrage puts a lower bound on the tail index a . Assuming rates tosimplify: ( . ) a (cid:21) (cid:0) log ( K (cid:0) S ) + log( l ) + log ( S )log
12 erfc ( t s ( K ) + 2 log( K ) (cid:0) S )2 p p t s ( K ) ) (cid:0) p S p t s ′ ( K ) K log ( S ) t s ( K )2 + 12 exp ( (cid:0) log2( K )+log2 ( S ) t s ( K )2 (cid:0) t s ( K ) ) p p As we can see in Figure . , stochastic volatility models and similar adaptations(say, jump-diffusion or standard Poisson variations) eventually fail "out in the tails"outside the zone for which they were calibrated. There has been poor attempts toextrapolate the option prices using a fudged thin-tailed probability distributionrather than a Paretian one –hence the numerous claims in the finance literature on"overpricing" of tail options combined with some psycholophastering on "dreadrisk" are unrigorous on that basis. The proposed methods allow us to approachsuch claims with more realism.Finaly, note that our approach isn’t about absolute mispricing of tail options, butrelative to a given strike closer to the money. acknowledgments Bruno Dupire, Peter Carr, students at NYU Tandon School of Engineering. F O U R M I S T A K E S I N Q U A N T I T A T I V EF I N A N C E (cid:3) ,‡ W e discuss Jeff Holman’s (who at the time was, surprisingly, a se-nior risk officer for a large hedge fund) comments in
QuantitativeFinance to illustrate four critical errors students should learn toavoid: . Mistaking tails ( th moment and higher) for volatility ( nd moment) . Missing Jensen’s Inequality when calculating return potential . Analyzing the hedging results without the performance of the underly-ing . The necessity of a numéraire in finance.The review of Antifragile by Mr Holman (Dec , ) is replete with factual,logical, and analytical errors. We will only list here the critical ones, and ones withgenerality to the risk management and quantitative finance communities; theseshould be taught to students in quantitative finance as central mistakes to avoid,so beginner quants and risk managers can learn from these fallacies. It is critical for beginners not to fall for the following elementary mistake. MrHolman gets the relation of the VIX (volatility contract) to betting on "tail events"backwards. Let us restate the notion of "tail events" (we saw earlier in the book):it means a disproportionate role of the tails in determining the properties of distri-bution, which, mathematically, means a smaller one for the "body". Discussion chapter. The point is staring at every user of spreadsheets: kurtosis, or scaled fourth moment, the standard measureof fattailedness, entails normalizing the fourth moment by the square of the variance. four mistakes in quantitative finance (cid:3) ,‡ Mr Holman seems to get the latter part of the attributes of fattailedness in reverse.It is an error to mistake the VIX for tail events. The VIX is mostly affected by at-the-money options which corresponds to the center of the distribution, closer to thesecond moment not the fourth (at-the-money options are actually linear in theirpayoff and correspond to the conditional first moment). As explained about seven-teen years ago in
Dynamic Hedging (Taleb, ) (see appendix), in the discussionon such tail bets, or "fourth moment bets", betting on the disproportionate roleof tail events of fattailedness is done by selling the around-the-money-options (theVIX) and purchasing options in the tails, in order to extract the second momentand achieve neutrality to it (sort of becoming "market neutral"). Such a neutralityrequires some type of "short volatility" in the body because higher kurtosis meanslower action in the center of the distribution.A more mathematical formulation is in the technical version of the
Incerto : fattails means "higher peaks" for the distribution as, the fatter the tails, the more mar-kets spend time between m (cid:0) √ ( (cid:0) p ) s and m + √ ( (cid:0) p ) s where s isthe standard deviation and m the mean of the distribution (we used the Gaussianhere as a base for ease of presentation but the argument applies to all unimodal dis-tributions with "bell-shape" curves, known as semiconcave). And "higher peaks"means less variations that are not tail events, more quiet times, not less. For theconsequence on option pricing, the reader might be interested in a quiz I routinelygive students after the first lecture on derivatives: "What happens to at-the-moneyoptions when one fattens the tails?", the answer being that they should drop invalue. Effectively, but in a deeper argument, in the QF paper (Taleb and Douady ),our measure of fragility has an opposite sensitivity to events around the center ofthe distribution, since, by an argument of survival probability, what is fragile issensitive to tail shocks and, critically, should not vary in the body (otherwise itwould be broken).
Here is an error to avoid at all costs in discussions of volatility strategies or, for thatmatter, anything in finance. Mr Holman seems to miss the existence of Jensen’sinequality, which is the entire point of owning an option, a point that has beenbelabored in
Antifragile . One manifestation of missing the convexity effect is acritical miscalculation in the way one can naively assume options respond to theVIX. Technical Point: Where Does the Tail Start?
As we saw in . , for a general class of symmetric distri-butions with power laws, the tail starts at: (cid:6) √ a + p ( a +1)(17 a +1)+1 a (cid:0) s p , with a infinite in the stochastic volatilityGaussian case and s the standard deviation. The "tail" is located between around and standard devia-tions. This flows from the heuristic definition of fragility as second order effect: the part of the distributionis convex to errors in the estimation of the scale. But in practice, because historical measurements of STDwill be biased lower because of small sample effects (as we repeat fat tails accentuate small sample effects),the deviations will be > - STDs. "A $ investment on January , in a strategy of buying and rolling short-term VIX futures would have peaked at $ . on November , -and thensubsequently lost % of its value over the next four and a half years, finishingunder $ . as of May , ." This mistake in the example given underestimates option returns by up to... sev-eral orders of magnitude. Mr Holman analyzes the performance a tail strategyusing investments in financial options by using the VIX (or VIX futures) as proxy,which is mathematically erroneous owing to second- order effects, as the link istenuous (it would be like evaluating investments in ski resorts by analyzing tem-perature futures). Assume a periodic rolling of an option strategy: an option STDaway from the money gains times in value if its implied volatility goes up by , but only loses its value if volatility goes to . For a STD it is times. And,to show the acceleration, assuming these are traded, a STD options by around , times . There is a second critical mistake in the discussion: Mr Holman’scalculations here exclude the payoff from actual in-the-moneyness.One should remember that the VIX is not a price, but an inverse function, anindex derived from a price: one does not buy "volatility" like one can buy a tomato;operators buy options correponding to such inverse function and there are severe,very severe nonlinearities in the effect. Although more linear than tail options, theVIX is still convex to actual market volatility, somewhere between variance andstandard deviation, since a strip of options spanning all strikes should deliver thevariance (Gatheral, ). The reader can go through a simple exercise. Let’s saythat the VIX is "bought" at % -that is, the component options are purchased at acombination of volatilities that corresponds to a VIX at that level. Assume returnsare in squares. Because of nonlinearity, the package could benefit from an episodeof % volatility followed by an episode of %, for an average of . %; Mr Holmanbelieves or wants the reader to believe that this . percentage point should betreated as a loss when in fact second order un-evenness in volatility changes aremore relevant than the first order effect. One should never calculate the cost of insurance without offsetting it with returnsgenerated from packages than one would not have purchased otherwise.Even had he gotten the sign right on the volatility, Mr Holman in the exampleabove analyzes the performance of a strategy buying options to protect a tail eventwithout adding the performance of the portfolio itself, like counting the cost sideof the insurance without the performance of what one is insuring that would nothave been bought otherwise. Over the same period he discusses the market rosemore than %: a healthy approach would be to compare dollar-for-dollar what In the above discussion Mr Holman also shows evidence of dismal returns on index puts which, as wesaid before, respond to volatility not tail events. These are called, in the lingo, "sucker puts". We are using implied volatility as a benchmark for its STD. An event this author witnessed, in the liquidation of Victor Niederhoffer, options sold for $. werepurchased back at up to $ , which bankrupted Refco, and, which is remarkable, without the optionsgetting close to the money: it was just a panic rise in implied volatility. four mistakes in quantitative finance (cid:3) ,‡ an investor would have done (and, of course, getting rid of this "VIX" business andfocusing on very small dollars invested in tail options that would allow such anaggressive stance). Many investors (such as this author) would have stayed outof the market, or would not have added funds to the market, without such aninsurance. There is a deeper analytical error.A barbell is defined as a bimodal investment strategy, presented as investing aportion of your portfolio in what is explicitly defined as a "numéraire repository ofvalue" (
Antifragile ), and the rest in risky securities (
Antifragile indicates that suchnuméraire would be, among other things, inflation protected). Mr Holman goeson and on in a nihilistic discourse on the absence of such riskless numéraire (ofthe sort that can lead to such sophistry as "he is saying one is safer on terra firma than at sea, but what if there is an earthquake?").The familiar Black and Scholes derivation uses a riskless asset as a baseline; butthe literature since around has substituted the notion of "cash" with that ofa numéraire , along with the notion that one can have different currencies, whichtechnically allows for changes of probability measure. A numéraire is definedas the unit to which all other units relate . ( Practically, the numéraire is a basketthe variations of which do not affect the welfare of the investor.) Alas, withoutnuméraire, there is no probability measure, and no quantitative in quantitativefinance, as one needs a unit to which everything else is brought back to. In this(emotional) discourse, Mr Holton is not just rejecting the barbell per se, but anyuse of the expectation operator with any economic variable, meaning he shouldgo attack the tens of thousand research papers and the existence of the journal
Quantitative Finance itself.Clearly, there is a high density of other mistakes or incoherent statements in theoutpour of rage in Mr Holman’s review; but I have no doubt these have beendetected by the
Quantitative Finance reader and, as we said, the object of this dis-cussion is the prevention of analytical mistakes in quantitative finance.To conclude, this author welcomes criticism from the finance community that arenot straw man arguments, or, as in the case of Mr Holmam, violate the foundationsof the field itself.
From
Dynamic Hedging , pages - : A fourth moment bet is long or short the volatility of volatility. It could be achievedeither with out-of-the-money options or with calendars. Example: A ratio "backspread"or reverse spread is a method that includes the buying of out-of-the-money options inlarge amounts and the selling of smaller amounts of at-the-money but making sure the
Figure . : First Methodto Extract the Fourth Mo-ment, from Dynamic Hedg-ing, . Figure . : SecondMethod to Extract theFourth Moment , fromDynamic Hedging, .trade satisfies the "credit" rule (i.e., the trade initially generates a positive cash flow).The credit rule is more difficult to interpret when one uses in-the-money options. In thatcase, one should deduct the present value of the intrinsic part of every option using theput-call parity rule to equate them with out-of-the-money.The trade shown in Figure . was accomplished with the purchase of both out-of-the-money puts and out-of-the-money calls and the selling of smaller amounts of at-the-money straddles of the same maturity.Figure . shows the second method, which entails the buying of - day options insome amount and selling -day options on % of the amount. Both trades show the four mistakes in quantitative finance (cid:3) ,‡ position benefiting from the fat tails and the high peaks. Both trades, however, will havedifferent vega sensitivities, but close to flat modified vega. See The Body, The Shoulders, and The Tails from section . where we assumetails start at the level of convexity of the segment of the probability distribution tothe scale of the distribution. T A I L R I S K C O N S T R A I N T S A N DM A X I M U M E N T R O P Y ( W. D . & H .G E M A N ) ‡ P ortfolio selection in the financial literature has essentiallybeen analyzed under two central assumptions: full knowledgeof the joint probability distribution of the returns of the securi-ties that will comprise the target portfolio; and investors’ pref-erences are expressed through a utility function. In the realworld, operators build portfolios under risk constraints which are expressedboth by their clients and regulators and which bear on the maximal loss thatmay be generated over a given time period at a given confidence level (theso-called V alue at Risk of the position). Interestingly, in the finance litera-ture, a serious discussion of how much or little is known from a probabilisticstandpoint about the multi-dimensional density of the assets’ returns seemsto be of limited relevance.Our approach in contrast is to highlight these issues and then adopt through-out a framework of entropy maximization to represent the real world igno-rance of the “true” probability distributions, both univariate and multivariate,of traded securities’ returns. In this setting, we identify the optimal portfoliounder a number of downside risk constraints. Two interesting results areexhibited: (i) the left-tail constraints are sufficiently powerful to override allother considerations in the conventional theory; (ii) the “barbell portfolio”(maximal certainty/ low risk in one set of holdings, maximal uncertainty inanother), which is quite familiar to traders, naturally emerges in our con-struction. Customarily, when working in an institutional framework, operators and risk tak-ers principally use regulatorily mandated tail-loss limits to set risk levels in their
Research chapter. tail risk constraints and maximum entropy (w. d.& h. geman) ‡ portfolios (obligatorily for banks since Basel II). They rely on stress tests, stop-losses, value at risk (VaR) , expected shortfall (–i.e., the expected loss conditionalon the loss exceeding VaR, also known as CVaR), and similar loss curtailmentmethods, rather than utility. In particular, the margining of financial transactionsis calibrated by clearing firms and exchanges on tail losses, seen both probabilis-tically and through stress testing. (In the risk-taking terminology, a stop loss is amandatory order that attempts to terminate all or a portion of the exposure upona trigger, a certain pre-defined nominal loss. Basel II is a generally used name forrecommendations on banking laws and regulations issued by the Basel Commit-tee on Banking Supervision. The Value-at-risk, VaR, is defined as a threshold lossvalue K such that the probability that the loss on the portfolio over the given timehorizon exceeds this value is ϵ . A stress test is an examination of the performanceupon an arbitrarily set deviation in the underlying variables.) The informationembedded in the choice of the constraint is, to say the least, a meaningful statisticabout the appetite for risk and the shape of the desired distribution.Operators are less concerned with portfolio variations than with the drawdownthey may face over a time window. Further, they are in ignorance of the joint prob-ability distribution of the components in their portfolio (except for a vague notionof association and hedges), but can control losses organically with allocation meth-ods based on maximum risk. (The idea of substituting variance for risk can appearvery strange to practitioners of risk-taking. The aim by Modern Portfolio Theoryat lowering variance is inconsistent with the preferences of a rational investor, re-gardless of his risk aversion, since it also minimizes the variability in the profitdomain –except in the very narrow situation of certainty about the future meanreturn, and in the far-fetched case where the investor can only invest in variableshaving a symmetric probability distribution, and/or only have a symmetric payoff.Stop losses and tail risk controls violate such symmetry.) The conventional notionsof utility and variance may be used, but not directly as information about them isembedded in the tail loss constaint.Since the stop loss, the VaR (and expected shortfall) approaches and other risk-control methods concern only one segment of the distribution, the negative sideof the loss domain, we can get a dual approach akin to a portfolio separation, or“barbell-style” construction, as the investor can have opposite stances on differentparts of the return distribution. Our definition of barbell here is the mixing oftwo extreme properties in a portfolio such as a linear combination of maximalconservatism for a fraction w of the portfolio, with w (0, 1), on one hand andmaximal (or high) risk on the (1 (cid:0) w ) remaining fraction.Historically, finance theory has had a preference for parametric, less robust, meth-ods. The idea that a decision-maker has clear and error-free knowledge about thedistribution of future payoffs has survived in spite of its lack of practical and theo-retical validity –for instance, correlations are too unstable to yield precise measure-ments. It is an approach that is based on distributional and parametric certainties,one that may be useful for research but does not accommodate responsible risktaking. (Correlations are unstable in an unstable way, as joint returns for assets arenot elliptical, see Bouchaud and Chicheportiche ( ) [ ].) There are roughly two traditions: one based on highly parametric decision-makingby the economics establishment (largely represented by Markowitz [ ]) and theother based on somewhat sparse assumptions and known as the Kelly criterion(Kelly, [ ], see Bell and Cover, [ ].) (In contrast to the minimum-variance approach, Kelly’s method, developed around the same period as Markowitz,requires no joint distribution or utility function. In practice one needs the ratio ofexpected profit to worst-case return dynamically adjusted to avoid ruin. Obviously,model error is of smaller consequence under the Kelly criterion: Thorp ( ) [ ],Haigh ( ) [ ], Mac Lean, Ziemba and Blazenko [ ]. For a discussion of thedifferences between the two approaches, see Samuelson’s objection to the Kelly cri-terion and logarithmic sizing in Thorp [ ].) Kelly’s method is also relatedto left-tail control due to proportional investment, which automatically reducesthe portfolio in the event of losses; but the original method requires a hard, non-parametric worst-case scenario, that is, securities that have a lower bound in theirvariations, akin to a gamble in a casino, which is something that, in finance, canonly be accomplished through binary options. The Kelly criterion, in addition, re-quires some precise knowledge of future returns such as the mean. Our approachgoes beyond the latter method in accommodating more uncertainty about the re-turns, whereby an operator can only control his left-tail via derivatives and otherforms of insurance or dynamic portfolio construction based on stop-losses. (Xu,Wu, Jiang, and Song ( ) [ ] contrast mean variance to maximum entropy anduses entropy to construct robust portfolios.) In a nutshell, we hardwire the curtail-ments on loss but otherwise assume maximal uncertainty about the returns. Moreprecisely, we equate the return distribution with the m aximum entropy extensionof constraints expressed as statistical expectations on the left-tail behavior as wellas on the expectation of the return or log-return in the non-danger zone. (Notethat we use S hannon entropy throughout. There are other information measures,such as Tsallis entropy [ ] , a generalization of Shannon entropy, and Renyi en-tropy, [ ] , some of which may be more convenient computationally in specialcases. However, Shannon entropy is the best known and has a well-developedmaximization framework. )Here, the “left-tail behavior” refers to the hard, explicit, institutional constraintsdiscussed above. We describe the shape and investigate other properties of theresulting so-called m axent distribution. In addition to a mathematical result re-vealing the link between acceptable tail loss (VaR) and the expected return in theGaussian mean-variance framework, our contribution is then twofold: ) an in-vestigation of the shape of the distribution of returns from portfolio constructionunder more natural constraints than those imposed in the mean-variance method,and ) the use of stochastic entropy to represent residual uncertainty.VaR and CVaR methods are not error free –parametric VaR is known to be in-effective as a risk control method on its own. However, these methods can bemade robust using constructions that, upon paying an insurance price, no longerdepend on parametric assumptions. This can be done using derivative contracts orby organic construction (clearly if someone has % of his portfolio in numérairesecurities, the risk of losing more than % is zero independent from all possiblemodels of returns, as the fluctuations in the numéraire are not considered risky). tail risk constraints and maximum entropy (w. d.& h. geman) ‡ We use “pure robustness” or both VaR and zero shortfall via the “hard stop” or in-surance, which is the special case in our paper of what we called earlier a “barbell”construction.It is worth mentioning that it is an old idea in economics that an investor canbuild a portfolio based on two distinct risk categories, see Hicks ( ) [ ]. Mod-ern Portfolio Theory proposes the mutual fund theorem or “separation” theorem,namely that all investors can obtain their desired portfolio by mixing two mutualfunds, one being the riskfree asset and one representing the optimal mean-varianceportfolio that is tangent to their constraints; see Tobin ( ) [ ], Markowitz( ) [ ], and the variations in Merton ( ) [ ], Ross ( ) [ ]. In ourcase a riskless asset is the part of the tail where risk is set to exactly zero. Notethat the risky part of the portfolio needs to be minimum variance in traditionalfinancial economics; for our method the exact opposite representation is taken forthe risky one. . . The Barbell as seen by E.T. Jaynes
Our approach to constrain only what can be constrained (in a robust manner) andto maximize entropy elsewhere echoes a remarkable insight by E.T. Jaynes in “Howshould we use entropy in economics?” [ ]:“It may be that a macroeconomic system does not move in responseto (or at least not solely in response to) the forces that are supposedto exist in current theories; it may simply move in the direction of in-creasing entropy as constrained by the conservation laws imposed byNature and Government.”
Let ⃗ X = ( X , ..., X m ) denote m asset returns over a given single period with jointdensity g ( ⃗ x ), mean returns ⃗ m = ( m , ..., m m ) and m (cid:2) m covariance matrix S : S ij = E ( X i X j ) (cid:0) m i m j , 1 (cid:20) i , j (cid:20) m . Assume that ⃗ m and S can be reliably estimated fromdata.The return on the portolio with weights ⃗ w = ( w , ..., w m ) is then X = m (cid:229) i =1 w i X i ,which has mean and variance E ( X ) = ⃗ w ⃗ m T , V ( X ) = ⃗ w S ⃗ w T .In standard portfolio theory one minimizes V ( X ) over all ⃗ w subject to E ( X ) = m fora fixed desired average return m . Equivalently, one maximizes the expected return E ( X ) subject to a fixed variance V ( X ). In this framework variance is taken as asubstitute for risk.To draw connections with our entropy-centered approach, we consider two stan-dard cases:( ) N ormal World: The joint distribution g ( ⃗ x ) of asset returns is multivariate Gaus-sian N ( ⃗ m , S ). Assuming normality is equivalent to assuming g ( ⃗ x ) has maxi-mum (Shannon) entropy among all multivariate distributions with the givenfirst- and second-order statistics ⃗ m and S . Moreover, for a fixed mean E ( X ),minimizing the variance V ( X ) is equivalent to minimizing the entropy (uncer-tainty) of X . (This is true since joint normality implies that X is univariatenormal for any choice of weights and the entropy of a N ( m , s ) variable is H = (1 + log(2 ps )).) This is natural in a world with complete information.( The idea of entropy as mean uncertainty is in Philippatos and Wilson ( )[ ]; see Zhou –et al. ( ) [ ] for a review of entropy in financial eco-nomics and Georgescu-Roegen ( ) [ ] for economics in general.)( ) U nknown Multivariate Distribution: Since we assume we can estimate thesecond-order structure, we can still carry out the Markowitz program, –i.e.,choose the portfolio weights to find an optimal mean-variance performance,which determines E ( X ) = m and V ( X ) = s . However, we do not know thedistribution of the return X . Observe that a ssuming X is normally distributed N ( m , s ) is equivalent to assuming the entropy of X is m aximized since, again,the normal maximizes entropy at a given mean and variance, see [ ].Our strategy is to generalize the second scenario by replacing the variance s by two left-tail value-at-risk constraints and to model the portfolio return as themaximum entropy extension of these constraints together with a constraint on theoverall performance or on the growth of the portfolio in the non-danger zone. . . Analyzing the Constraints
Let X have probability density f ( x ). In everything that follows, let K < ϵ > n (cid:0) < K , the v alue-at-risk constraints are:( ) T ail probability: P ( X (cid:20) K ) = ∫ K (cid:0) ¥ f ( x ) d x = ϵ .( ) E xpected shortfall (CVaR): E ( X j X (cid:20) K ) = n (cid:0) .Assuming ( ) holds, constraint ( ) is equivalent to E ( XI ( X (cid:20) K ) ) = ∫ K (cid:0) ¥ x f ( x ) d x = ϵn (cid:0) . tail risk constraints and maximum entropy (w. d.& h. geman) ‡ Given the value-at-risk parameters q = ( K , ϵ , n (cid:0) ), let Ω var ( q ) denote the set of prob-ability densities f satisfying the two constraints. Notice that Ω var ( q ) is convex: f , f Ω var ( q ) implies a f + (1 (cid:0) a ) f Ω var ( q ). Later we will add another con-straint involving the overall mean. Suppose we assume X is Gaussian with mean m and variance s . In principle itshould be possible to satisfy the VaR constraints since we have two free parameters.Indeed, as shown below, the left-tail constraints determine the mean and variance;see Figure . . However, satisfying the VaR constraints imposes interesting re-strictions on m and s and leads to a natural inequality of a “no free lunch” style. Area (cid:1)(cid:2) _ K - - Figure . : By setting K (thevalue at risk), the probability ϵ of exceeding it, and the short-fall when doing so, there is nowiggle room left under a Gaus-sian distribution: s and m are de-termined, which makes construc-tion according to portfolio theoryless relevant. Let h ( ϵ ) be the ϵ -quantile of the standard normal distribution, –i.e., h ( ϵ ) = F (cid:0) ( ϵ ),where F is the c.d.f. of the standard normal density ϕ ( x ). In addition, set B ( ϵ ) = 1 ϵh ( ϵ ) ϕ ( h ( ϵ )) = 1 p pϵh ( ϵ ) exp f(cid:0) h ( ϵ ) g . Proposition . If X (cid:24) N ( m , s ) and satisfies the two VaR constraints, then the mean and variance aregiven by: m = n (cid:0) + KB ( ϵ )1 + B ( ϵ ) , s = K (cid:0) n (cid:0) h ( ϵ )(1 + B ( ϵ )) . Moreover, B ( ϵ ) < (cid:0) and lim ϵ B ( ϵ ) = (cid:0) . The proof is in the Appendix. The VaR constraints lead directly to two linearequations in m and s : m + h ( ϵ ) s = K , m (cid:0) h ( ϵ ) B ( ϵ ) s = n (cid:0) .Consider the conditions under which the VaR constraints allow a positive meanreturn m = E ( X ) >
0. First, from the above linear equation in m and s in terms of h ( ϵ ) and K , we see that s increases as ϵ increases for any fixed mean m , andthat m > s > K h ( ϵ ) , –i.e., we must accept a lower bound on thevariance which increases with ϵ , which is a reasonable property. Second, from theexpression for m in Proposition , we have m > () j n (cid:0) j > KB ( ϵ ).Consequently, the only way to have a positive expected return is to accommodatea sufficiently large risk expressed by the various tradeoffs among the risk param-eters q satisfying the inequality above. (This type of restriction also applies moregenerally to symmetric distributions since the left tail constraints impose a struc-ture on the location and scale. For instance, in the case of a Student T distributionwith scale s , location m , and tail exponent a , the same linear relation between s and m applies: s = ( K (cid:0) m ) k ( a ), where k ( a ) = (cid:0) i √ I (cid:0) ϵ ( a , ) p a √ I (cid:0) ϵ ( a , ) (cid:0) , where I (cid:0) isthe inverse of the regularized incomplete beta function I , and s the solution of ϵ = I a s k (cid:0) m )2+ a s ( a , ) . . . A Mixture of Two Normals
In many applied sciences, a mixture of two normals provides a useful and natu-ral extension of the Gaussian itself; in finance, the Mixture Distribution Hypoth-esis (denoted as MDH in the literature) refers to a mixture of two normals andhas been very widely investigated (see for instance Richardson and Smith ( )[ ]). H. Geman and T. Ané ( ) [ ] exhibit how an infinite mixture of normaldistributions for stock returns arises from the introduction of a "stochastic clock"accounting for the uneven arrival rate of information flow in the financial markets.In addition, option traders have long used mixtures to account for fat tails, andto examine the sensitivity of a portfolio to an increase in kurtosis ("DvegaDvol");see Taleb ( ) [ ]. Finally, Brigo and Mercurio ( ) [ ] use a mixture of twonormals to calibrate the skew in equity options.Consider the mixture f ( x ) = l N ( m , s ) + (1 (cid:0) l ) N ( m , s ).An intuitively simple and appealing case is to fix the overall mean m , and take l = ϵ and m = n (cid:0) , in which case m is constrained to be m (cid:0) ϵn (cid:0) (cid:0) ϵ . It then followsthat the left-tail constraints are approximately satisfied for s , s sufficiently small.Indeed, when s = s (cid:25)
0, the density is effectively composed of two spikes (smallvariance normals) with the left one centered at n (cid:0) and the right one centered at at m (cid:0) ϵn (cid:0) (cid:0) ϵ . The extreme case is a Dirac function on the left, as we see next. Dynamic Stop Loss, A Brief Comment
One can set a level K below which thereis no mass, with results that depend on accuracy of the execution of such a stop.The distribution to the right of the stop-loss no longer looks like the standard tail risk constraints and maximum entropy (w. d.& h. geman) ‡ Gaussian, as it builds positive skewness in accordance to the distance of the stopfrom the mean. We limit any further discussion to the illustrations in Figure . . RetProbability
Figure . : A dynamic stop lossacts as an absorbing barrier, witha Dirac function at the executedstop.
From the comments and analysis above, it is clear that, in practice, the density f of the return X is unknown; in particular, no theory provides it. Assume wecan adjust the portfolio parameters to satisfy the VaR constraints, and perhapsanother constraint on the expected value of some function of X (e.g., the overallmean). We then wish to compute probabilities and expectations of interest, forexample P ( X >
0) or the probability of losing more than 2 K , or the expectedreturn given X >
0. One strategy is to make such estimates and predictions underthe most unpredictable circumstances consistent with the constraints. That is, usethe m aximum entropy extension (MEE) of the constraints as a model for f ( x ).The “differential entropy” of f is h ( f ) = (cid:0) ∫ f ( x ) ln f ( x ) d x . (In general, the in-tegral may not exist.) Entropy is concave on the space of densities for which it isdefined. In general, the MEE is defined as f MEE = arg max f Ω h ( f )where Ω is the space of densities which satisfy a set of constraints of the form E ϕ j ( X ) = c j , j = 1, ..., J . Assuming Ω is non-empty, it is well-known that f MEE isunique and (away from the boundary of feasibility) is an exponential distributionin the constraint functions, –i.e., is of the form f MEE ( x ) = C (cid:0) exp (cid:229) j l j ϕ j ( x ) where C = C ( l , ..., l M ) is the normalizing constant. (This form comes from differ-entiating an appropriate functional J ( f ) based on entropy, and forcing the integralto be unity and imposing the constraints with Lagrange mult ipliers.) In the spe-cial cases below we use this characterization to find the MEE for our constraints.
In our case we want to maximize entropy subject to the VaR constraints togetherwith any others we might impose. Indeed, the VaR constraints alone do not admitan MEE since they do not restrict the density f ( x ) for x > K . The entropy can bemade arbitrarily large by allowing f to be identically C = (cid:0) ϵ N (cid:0) K over K < x < N andletting N ! ¥ . Suppose, however, that we have adjoined one or more constraintson the behavior of f which are compatible with the VaR constraints in the sensethat the set of densities Ω satisfying all the constraints is non-empty. Here Ω would depend on the VaR parameters q = ( K , ϵ , n (cid:0) ) together with those parametersassociated with the additional constraints. . . Case A: Constraining the Global Mean
The simplest case is to add a constraint on the mean return, –i.e., fix E ( X ) = m .Since E ( X ) = P ( X (cid:20) K ) E ( X j X (cid:20) K ) + P ( X > K ) E ( X j X > K ), adding the meanconstraint is equivalent to adding the constraint E ( X j X > K ) = n + where n + satisfies ϵn (cid:0) + (1 (cid:0) ϵ ) n + = m .Define f (cid:0) ( x ) = K (cid:0) n (cid:0) ) exp [ (cid:0) K (cid:0) xK (cid:0) n (cid:0) ] if x < K ,0 if x (cid:21) K .and f + ( x ) = n + (cid:0) K ) exp [ (cid:0) x (cid:0) K n + (cid:0) K ] if x > K ,0 if x (cid:20) K .It is easy to check that both f (cid:0) and f + integrate to one. Then f MEE ( x ) = ϵ f (cid:0) ( x ) + (1 (cid:0) ϵ ) f + ( x )is the MEE of the three constraints. First, evidently . ∫ K (cid:0) ¥ f MEE ( x ) d x = ϵ ; . ∫ K (cid:0) ¥ x f MEE ( x ) d x = ϵn (cid:0) ; . ∫ ¥ K x f MEE ( x ) d x = (1 (cid:0) ϵ ) n + .Hence the constraints are satisfied. Second, f MEE has an exponential form in ourconstraint functions: f MEE ( x ) = C (cid:0) exp [ (cid:0) ( l x + l I ( x (cid:20) K ) + l xI ( x (cid:20) K ) ) ] . tail risk constraints and maximum entropy (w. d.& h. geman) ‡ The shape of f (cid:0) depends on the relationship between K and the expected short-fall n (cid:0) . The closer n (cid:0) is to K , the more rapidly the tail falls off. As n (cid:0) ! K , f (cid:0) converges to a unit spike at x = K (Figures . and . ). - -
10 10 200.10.20.30.4
Perturbating ϵ Figure . : Case A: Effect of dif-ferent values of ϵ on the shape ofthe distribution. - - Perturbating ν - Figure . : Case A: Effect of dif-ferent values of n (cid:0) on the shape ofthe distribution. . . Case B: Constraining the Absolute Mean
If instead we constrain the absolute mean, namely E j X j = ∫ j x j f ( x ) d x = m ,then the MEE is somewhat less apparent but can still be found. Define f (cid:0) ( x ) asabove, and let f + ( x ) = { l (cid:0) exp( l K ) exp( (cid:0) l j x j ) if x (cid:21) K ,0 if x < K . Then l can be chosen such that ϵn (cid:0) + (1 (cid:0) ϵ ) ∫ ¥ K j x j f + ( x ) d x = m . . . Case C: Power Laws for the Right Tail
If we believe that actual returns have “fat tails,” in particular that the right taildecays as a Power Law rather than exponentially (as with a normal or exponentialdensity), than we can add this constraint to the VaR constraints instead of workingwith the mean or absolute mean. In view of the exponential form of the MEE, thedensity f + ( x ) will have a power law, namely f + ( x ) = 1 C ( a ) (1 + j x j ) (cid:0) (1+ a ) , x (cid:21) K ,for a > E ( log(1 + j X j ) j X > K ) = A .Moreover, again from the MEE theory, we know that the parameter is obtained byminimizing the logarithm of the normalizing function. In this case, it is easy toshow that C ( a ) = ∫ ¥ K (1 + j x j ) (cid:0) (1+ a ) d x = 1 a (2 (cid:0) (1 (cid:0) K ) (cid:0) a ).It follows that A and a satisfy the equation A = 1 a (cid:0) log(1 (cid:0) K )2(1 (cid:0) K ) a (cid:0) a for a given A or,alternatively, as determining the constraint value A necessary to obtain a particularPower Law a .The final MEE extension of the VaR constraints together with the constraint onthe log of the return is then: f MEE ( x ) = ϵ I ( x (cid:20) K ) K (cid:0) n (cid:0) ) exp [ (cid:0) K (cid:0) xK (cid:0) n (cid:0) ] + (1 (cid:0) ϵ ) I ( x > K ) (1 + j x j ) (cid:0) (1+ a ) C ( a ) ,(see Figures . and . ). . . Extension to a Multi-Period Setting: A Comment
Consider the behavior in multi-periods. Using a naive approach, we sum up theperformance as if there was no response to previous returns. We can see how CaseA approaches the regular Gaussian, but not Case C (Figure . ). tail risk constraints and maximum entropy (w. d.& h. geman) ‡ - - Perturbating α Figure . : Case C: Effectof different values of on theshape of the fat-tailed maxi-mum entropy distribution. - - Perturbating α Figure . : Case C: Effectof different values of on theshape of the fat-tailed max-imum entropy distribution(closer K).
For case A the characteristic functioncan be written: Y A ( t ) = e iKt ( t ( K (cid:0) n (cid:0) ϵ + n + ( ϵ (cid:0) (cid:0) i )( Kt (cid:0) n (cid:0) t (cid:0) i )( (cid:0) (cid:0) it ( K (cid:0) n + ))So we can derive from convolutions that the function Y A ( t ) n converges to thatof an n -summed Gaussian. Further, the characteristic function of the limit of theaverage of strategies, namelylim n ! ¥ Y A ( t = n ) n = e it ( n + + ϵ ( n (cid:0) (cid:0) n + )) , ( . )is the characteristic function of the Dirac delta, visibly the effect of the law of largenumbers delivering the same result as the Gaussian with mean n + + ϵ ( n (cid:0) (cid:0) n + ) .As to the Power Law in Case C, convergence to Gaussian only takes place for a (cid:21)
2, and rather slowly. - - Figure . : Average re-turn for multiperiod naivestrategy for Case A, thatis, assuming independenceof “sizing”, as position sizedoes not depend on past per-formance. They aggregatenicely to a standard Gaus-sian, and (as shown in Equa-tion ( . )), shrink to aDirac at the mean value. We note that the stop loss plays a larger role in determining the stochastic proper-ties than the portfolio composition. Simply, the stop is not triggered by individualcomponents, but by variations in the total portfolio. This frees the analysis fromfocusing on individual portfolio components when the tail –via derivatives or or-ganic construction– is all we know and can control.To conclude, most papers dealing with entropy in the mathematical finance liter-ature have used minimization of entropy as an optimization criterion. For instance,Fritelli ( ) [ ] exhibits the unicity of a "minimal entropy martingale measure"under some conditions and shows that minimization of entropy is equivalent tomaximizing the expected exponential utility of terminal wealth. We have, instead,and outside any utility criterion, proposed entropy maximization as the recogni-tion of the uncertainty of asset distributions. Under VaR and Expected Shortfallconstraints, we obtain in full generality a "barbell portfolio" as the optimal solu-tion, extending to a very general setting the approach of the two-fund separationtheorem. P roof of Proposition : Since X (cid:24) N ( m , s ), the tail probability constraint is ϵ = P ( X < K ) = P ( Z < K (cid:0) ms ) = F ( K (cid:0) ms ).By definition, F ( h ( ϵ )) = ϵ . Hence, K = m + h ( ϵ ) s ( . ) tail risk constraints and maximum entropy (w. d.& h. geman) ‡ For the shortfall constraint, E ( X ; X < k ) = ∫ K (cid:0) ¥ x p ps exp (cid:0) ( x (cid:0) m ) s d x = mϵ + s ∫ ( K (cid:0) m ) = s ) (cid:0) ¥ x ϕ ( x ) d x = mϵ (cid:0) s p p exp (cid:0) ( K (cid:0) m ) s Since, E ( X ; X < K ) = ϵn (cid:0) , and from the definition of B ( ϵ ), we obtain n (cid:0) = m (cid:0) h ( ϵ ) B ( ϵ ) s ( . )Solving ( . ) and ( . ) for m and s gives the expressions in Proposition .Finally, by symmetry to the “upper tail inequality” of the standard normal, wehave, for x < F ( x ) (cid:20) ϕ ( x ) (cid:0) x . Choosing x = h ( ϵ ) = F (cid:0) ( ϵ ) yields ϵ = P ( X < h ( ϵ )) (cid:20)(cid:0) ϵ B ( ϵ ) or 1 + B ( ϵ ) (cid:20)
0. Since the upper tail inequality is asymptotically exact as x ! ¥ we have B (0) = (cid:0)
1, which concludes the proof.
I B L I O G R A P H Y A N D I N D E X
I B L I O G R A P H Y [ ] Inmaculada B Aban, Mark M Meerschaert, and Anna K Panorska. Parame-ter estimation for the truncated pareto distribution. Journal of the AmericanStatistical Association , ( ): – , .[ ] Thierry Ané and Hélyette Geman. Order flow, transaction clock, and nor-mality of asset returns. The Journal of Finance , ( ): – , .[ ] Kenneth J Arrow, Robert Forsythe, Michael Gorham, Robert Hahn, RobinHanson, John O Ledyard, Saul Levmore, Robert Litan, Paul Milgrom,Forrest D Nelson, et al. The promise of prediction markets. Science , ( ): , .[ ] Marco Avellaneda, Craig Friedman, Richard Holmes, and Dominick Sam-peri. Calibrating volatility surfaces via relative-entropy minimization. Ap-plied Mathematical Finance , ( ): – , .[ ] L. Bachelier. Theory of speculation in: P. Cootner, ed., , The random characterof stock market prices, . MIT Press, Cambridge, Mass, .[ ] Louis Bachelier. Théorie de la spéculation . Gauthier-Villars, .[ ] Kevin P Balanda and HL MacGillivray. Kurtosis: a critical review. The Amer-ican Statistician , ( ): – , .[ ] August A Balkema and Laurens De Haan. Residual life time at great age. The Annals of probability , pages – , .[ ] August A Balkema and Laurens De Haan. Limit distributions for orderstatistics. i. Theory of Probability & Its Applications , ( ): – , .[ ] August A Balkema and Laurens de Haan. Limit distributions for order statis-tics. ii. Theory of Probability & Its Applications , ( ): – , .[ ] Shaul K Bar-Lev, Idit Lavi, and Benjamin Reiser. Bayesian inference for thepower law process. Annals of the Institute of Statistical Mathematics , ( ): – , .[ ] Nicholas Barberis. The psychology of tail events: Progress and challenges. American Economic Review , ( ): – , . Bibliography [ ] Jonathan Baron. Thinking and deciding, th Ed. Cambridge University Press, .[ ] Norman C Beaulieu, Adnan A Abu-Dayya, and Peter J McLane. Estimat-ing the distribution of a sum of independent lognormal random variables. Communications, IEEE Transactions on , ( ): , .[ ] Robert M Bell and Thomas M Cover. Competitive optimality of logarithmicinvestment. Mathematics of Operations Research , ( ): – , .[ ] Shlomo Benartzi and Richard Thaler. Heuristics and biases in retirementsavings behavior. Journal of Economic perspectives , ( ): – , .[ ] Shlomo Benartzi and Richard H Thaler. Myopic loss aversion and the equitypremium puzzle. The quarterly journal of Economics , ( ): – , .[ ] Shlomo Benartzi and Richard H Thaler. Naive diversification strategies in de-fined contribution saving plans. American economic review , ( ): – , .[ ] Sergei Natanovich Bernshtein. Sur la loi des grands nombres. Communica-tions de la Société mathématique de Kharkow , ( ): – , .[ ] Patrick Billingsley. Probability and measure . John Wiley & Sons, .[ ] Patrick Billingsley. Convergence of probability measures . John Wiley & Sons, .[ ] Nicholas H Bingham, Charles M Goldie, and Jef L Teugels. Regular variation ,volume . Cambridge university press, .[ ] Giulio Biroli, J-P Bouchaud, and Marc Potters. On the top eigenvalue ofheavy-tailed random matrices. EPL (Europhysics Letters) , ( ): , .[ ] Fischer Black and Myron Scholes. The pricing of options and corporateliabilities. : – , May–June .[ ] Fischer Black and Myron Scholes. The pricing of options and corporateliabilities. The journal of political economy , pages – , .[ ] A.J. Boness. Elements of a theory of stock-option value. : – , .[ ] Jean-Philippe Bouchaud, Marc Mézard, Marc Potters, et al. Statistical proper-ties of stock order books: empirical results and models. Quantitative Finance , ( ): – , .[ ] Jean-Philippe Bouchaud and Marc Potters. Theory of financial risk and deriva-tive pricing: from statistical physics to risk management . Cambridge UniversityPress, .[ ] Olivier Bousquet, Stéphane Boucheron, and Gábor Lugosi. Introduction tostatistical learning theory. In Advanced lectures on machine learning , pages – . Springer, . ibliography [ ] George Bragues. Prediction markets: The practical and normative possibili-ties for the social production of knowledge. Episteme , ( ): – , .[ ] D. T. Breeden and R. H. Litzenberger. Price of state-contigent claimes implicitin option prices. : – , .[ ] Douglas T Breeden and Robert H Litzenberger. Prices of state-contingentclaims implicit in option prices. Journal of business , pages – , .[ ] Henry Brighton and Gerd Gigerenzer. Homo heuristicus and the bias–variance dilemma. In Action, Perception and the Brain , pages – . Springer, .[ ] Damiano Brigo and Fabio Mercurio. Lognormal-mixture dynamics and cal-ibration to market volatility smiles. International Journal of Theoretical andApplied Finance , ( ): – , .[ ] Peter Carr. Bounded brownian motion. NYU Tandon School of Engineering , .[ ] Peter Carr, Hélyette Geman, Dilip B Madan, and Marc Yor. Stochastic volatil-ity for lévy processes. Mathematical finance , ( ): – , .[ ] Peter Carr and Dilip Madan. Optimal positioning in derivative securities. .[ ] Lars-Erik Cederman. Modeling the size of wars: from billiard balls to sand-piles. American Political Science Review , ( ): – , .[ ] Bikas K Chakrabarti, Anirban Chakraborti, Satya R Chakravarty, and ArnabChatterjee. Econophysics of income and wealth distributions . Cambridge Univer-sity Press, .[ ] David G Champernowne. A model of income distribution. The EconomicJournal , ( ): – , .[ ] Shaohua Chen, Hong Nie, and Benjamin Ayers-Glassey. Lognormal sumapproximation with a variant of type iv pearson distribution. IEEE Commu-nications Letters , ( ), .[ ] Rémy Chicheportiche and Jean-Philippe Bouchaud. The joint distribution ofstock returns is not elliptical. International Journal of Theoretical and AppliedFinance , ( ), .[ ] VP Chistyakov. A theorem on sums of independent positive random vari-ables and its applications to branching random processes. Theory of Probabil-ity & Its Applications , ( ): – , .[ ] Pasquale Cirillo. Are your data really pareto distributed? Physica A: Statisti-cal Mechanics and its Applications , ( ): – , . Bibliography [ ] Pasquale Cirillo and Nassim Nicholas Taleb. Expected shortfall estimationfor apparently infinite-mean models of operational risk. Quantitative Finance ,pages – , .[ ] Pasquale Cirillo and Nassim Nicholas Taleb. On the statistical properties andtail risk of violent conflicts. Physica A: Statistical Mechanics and its Applications , : – , .[ ] Pasquale Cirillo and Nassim Nicholas Taleb. What are the chances of war? Significance , ( ): – , .[ ] Pasquale Cirillo and Nassim Nicholas Taleb. Tail risk of contagious diseases. Nature Physics , .[ ] Open Science Collaboration et al. Estimating the reproducibility of psycho-logical science. Science , ( ):aac , .[ ] Rama Cont and Peter Tankov. Financial modelling with jump processes , vol-ume . CRC press, .[ ] Harald Cramér. On the mathematical theory of risk . Centraltryckeriet, .[ ] George Cybenko. Approximation by superpositions of a sigmoidal function. Mathematics of control, signals and systems , ( ): – , .[ ] Camilo Dagum. Inequality measures between income distributions with ap-plications. Econometrica , ( ): – , .[ ] Camilo Dagum. Income distribution models . Wiley Online Library, .[ ] Anirban DasGupta. Probability for statistics and machine learning: fundamentalsand advanced topics . Springer Science & Business Media, .[ ] Herbert A David and Haikady N Nagaraja. Order statistics. .[ ] Bruno De Finetti. Probability, induction, and statistics. .[ ] Bruno De Finetti. Philosophical Lectures on Probability: collected, edited, andannotated by Alberto Mura , volume . Springer Science & Business Media, .[ ] Amir Dembo and Ofer Zeitouni. Large deviations techniques and applications ,volume . Springer Science & Business Media, .[ ] Kresimir Demeterfi, Emanuel Derman, Michael Kamal, and Joseph Zou. Aguide to volatility and variance swaps. The Journal of Derivatives , ( ): – , .[ ] Kresimir Demeterifi, Emanuel Derman, Michael Kamal, and Joseph Zou.More than you ever wanted to know about volatility swaps. Working paper,Goldman Sachs , . ibliography [ ] Victor DeMiguel, Lorenzo Garlappi, and Raman Uppal. Optimal versusnaive diversification: How inefficient is the /n portfolio strategy? Thereview of Financial studies , ( ): – , .[ ] E. Derman and N. Taleb. The illusion of dynamic delta replication. Quanti-tative Finance , ( ): – , .[ ] Emanuel Derman. The perception of time, risk and return during periods ofspeculation. Working paper, Goldman Sachs , .[ ] Marco Di Renzo, Fabio Graziosi, and Fortunato Santucci. Further results onthe approximation of log-normal power sum via pearson type iv distribu-tion: a general formula for log-moments computation. IEEE Transactions onCommunications , ( ), .[ ] Persi Diaconis and David Freedman. On the consistency of bayes estimates. The Annals of Statistics , pages – , .[ ] Persi Diaconis and Sandy Zabell. Closed form summation for classical distri-butions: variations on a theme of de moivre. Statistical Science , pages – , .[ ] Cornelius Frank Dietrich. Uncertainty, calibration and probability: the statisticsof scientific and industrial measurement . Routledge, .[ ] NIST Digital Library of Mathematical Functions . http://dlmf.nist.gov/, Release . . of - - . F. W. J. Olver, A. B. Olde Daalhuis, D. W. Lozier, B. I.Schneider, R. F. Boisvert, C. W. Clark, B. R. Miller and B. V. Saunders, eds.[ ] Daniel Dufresne. Sums of lognormals. In Proceedings of the rd actuarialresearch conference . University of Regina, .[ ] Daniel Dufresne et al. The log-normal approximation in financial and othercomputations. Advances in Applied Probability , ( ): – , .[ ] Bruno Dupire. Pricing with a smile. ( ), .[ ] Bruno Dupire. Exotic option pricing by calibration on volatility smiles. In Advanced Mathematics for Derivatives: Risk Magazine Conference , .[ ] Bruno Dupire et al. Pricing with a smile. Risk , ( ): – , .[ ] Danny Dyer. Structural probability bounds for the strong pareto law. Cana-dian Journal of Statistics , ( ): – , .[ ] Iddo Eliazar. Inequality spectra. Physica A: Statistical Mechanics and its Appli-cations , : – , .[ ] Iddo Eliazar. Lindy’s law. Physica A: Statistical Mechanics and its Applications , : – , . Bibliography [ ] Iddo Eliazar and Morrel H Cohen. On social inequality: Analyzing the rich–poor disparity. Physica A: Statistical Mechanics and its Applications , : – , .[ ] Iddo Eliazar and Igor M Sokolov. Maximization of statistical heterogeneity:From shannon’s entropy to gini’s index. Physica A: Statistical Mechanics andits Applications , ( ): – , .[ ] Iddo I Eliazar and Igor M Sokolov. Gini characterization of extreme-valuestatistics. Physica A: Statistical Mechanics and its Applications , ( ): – , .[ ] Iddo I Eliazar and Igor M Sokolov. Measuring statistical evenness: Apanoramic overview. Physica A: Statistical Mechanics and its Applications , ( ): – , .[ ] Paul Embrechts. Modelling extremal events: for insurance and finance , volume .Springer, .[ ] Paul Embrechts and Charles M Goldie. On convolution tails. Stochastic Pro-cesses and their Applications , ( ): – , .[ ] Paul Embrechts, Charles M Goldie, and Noël Veraverbeke. Subexponentialityand infinite divisibility. Probability Theory and Related Fields , ( ): – , .[ ] M Émile Borel. Les probabilités dénombrables et leurs applications arithmé-tiques. Rendiconti del Circolo Matematico di Palermo ( - ) , ( ): – , .[ ] Michael Falk et al. On testing the extreme value index via the pot-method. The Annals of Statistics , ( ): – , .[ ] Michael Falk, Jürg Hüsler, and Rolf-Dieter Reiss. Laws of small numbers: ex-tremes and rare events . Springer Science & Business Media, .[ ] Kai-Tai Fang. Elliptically contoured distributions. Encyclopedia of StatisticalSciences , .[ ] Doyne James Farmer and John Geanakoplos. Hyperbolic discounting is ra-tional: Valuing the far future with uncertain discount rates. .[ ] J Doyne Farmer and John Geanakoplos. Power laws in economics and else-where. In Santa Fe Institute , .[ ] William Feller. an introduction to probability theory and its applications,vol. .[ ] William Feller. An introduction to probability theory. .[ ] Baruch Fischhoff, John Kadvany, and John David Kadvany. Risk: A very shortintroduction . Oxford University Press, . ibliography [ ] Ronald Aylmer Fisher and Leonard Henry Caleb Tippett. Limiting formsof the frequency distribution of the largest or smallest member of a sample.In Mathematical Proceedings of the Cambridge Philosophical Society , volume ,pages – . Cambridge University Press, .[ ] Andrea Fontanari, Pasquale Cirillo, and Cornelis W Oosterlee. From con-centration profiles to concentration maps. new tools for the study of lossdistributions. Insurance: Mathematics and Economics , : – , .[ ] Shane Frederick, George Loewenstein, and Ted O’donoghue. Time discount-ing and time preference: A critical review. Journal of economic literature , ( ): – , .[ ] David A Freedman. Notes on the dutch book argument “. , .[ ] Marco Frittelli. The minimal entropy martingale measure and the valuationproblem in incomplete markets. Mathematical finance , ( ): – , .[ ] Xavier Gabaix. Power laws in economics and finance. Technical report, Na-tional Bureau of Economic Research, .[ ] Xavier Gabaix. Power laws in economics: An introduction. Journal of Eco-nomic Perspectives , ( ): – , .[ ] Armengol Gasull, Maria Jolis, and Frederic Utzet. On the norming con-stants for normal maxima. Journal of Mathematical Analysis and Applications , ( ): – , .[ ] Jim Gatheral. The Volatility Surface: a Practitioner’s Guide . John Wiley & Sons, .[ ] Jim Gatheral. The Volatility Surface: A Practitioner’s Guide . New York: JohnWiley & Sons, .[ ] Oscar Gelderblom and Joost Jonker. Amsterdam as the cradle of modern fu-tures and options trading, - . William Goetzmann and K. Geert Rouwen-horst , .[ ] Andrew Gelman and Hal Stern. The difference between “significant” and“not significant” is not itself statistically significant. The American Statistician , ( ): – , .[ ] Donald Geman, Hélyette Geman, and Nassim Nicholas Taleb. Tail risk con-straints and maximum entropy. Entropy , ( ): , .[ ] Nicholas Georgescu-Roegen. The entropy law and the economic process, . Cambridge, Mass , .[ ] Gerd Gigerenzer and Daniel G Goldstein. Reasoning the fast and frugal way:models of bounded rationality. Psychological review , ( ): , . Bibliography [ ] Gerd Gigerenzer and Peter M Todd. Simple heuristics that make us smart . Ox-ford University Press, New York, .[ ] Corrado Gini. Variabilità e mutabilità. Reprinted in Memorie di metodologicastatistica (Ed. Pizetti E, Salvemini, T). Rome: Libreria Eredi Virgilio Veschi , .[ ] BV Gnedenko and AN Kolmogorov. Limit Distributions for Sums of Indepen-dent Random Variables ( ) .[ ] Charles M Goldie. Subexponential distributions and dominated-variationtails. Journal of Applied Probability , pages – , .[ ] Daniel Goldstein and Nassim Taleb. We don’t quite know what we are talk-ing about when we talk about volatility. Journal of Portfolio Management , ( ), .[ ] Richard C Green, Robert A Jarrow, et al. Spanning and completeness inmarkets with contingent claims. Journal of Economic Theory , ( ): – , .[ ] Emil Julius Gümbel. Statistics of extremes. .[ ] Laurens Haan and Ana Ferreira. Extreme value theory: An introduction. Springer Series in Operations Research and Financial Engineering ( , .[ ] Wolfgang Hafner and Heinz Zimmermann. Amazing discovery: Vincenzbronzin’s option pricing models. : – , .[ ] Torben Hagerup and Christine Rüb. A guided tour of chernoff bounds. In-formation processing letters , ( ): – , .[ ] John Haigh. The kelly criterion and bet comparisons in spread betting. Jour-nal of the Royal Statistical Society: Series D (The Statistician) , ( ): – , .[ ] Peter Hall. On the rate of convergence of normal extremes. Journal of AppliedProbability , ( ): – , .[ ] Mahmoud Hamada and Emiliano A Valdez. Capm and option pricing withelliptically contoured distributions. Journal of Risk and Insurance , ( ): – , .[ ] Godfrey Harold Hardy, John Edensor Littlewood, and George Pólya. Inequal-ities . Cambridge university press, .[ ] J Michael Harrison and David M Kreps. Martingales and arbitrage in multi-period securities markets. Journal of Economic theory , ( ): – , .[ ] Trevor Hastie, Robert Tibshirani, and Jerome Friedman. The elements ofstatistical learning: data mining, inference, and prediction, springer series instatistics, . ibliography [ ] Espen G. Haug. Derivatives: Models on Models . New York: John Wiley & Sons, .[ ] Espen Gaarder Haug and Nassim Nicholas Taleb. Option traders use (very)sophisticated heuristics, never the black–scholes–merton formula. Journal ofEconomic Behavior & Organization , ( ): – , .[ ] Friedrich August Hayek. The use of knowledge in society. The Americaneconomic review , ( ): – , .[ ] John R Hicks. Value and capital , volume . Clarendon press Oxford, .[ ] Leonard R. Higgins. The Put-and-Call . London: E. Wilson., .[ ] Wassily Hoeffding. Probability inequalities for sums of bounded randomvariables. Journal of the American statistical association , ( ): – , .[ ] P. J. Huber. Robust Statistics . Wiley, New York, .[ ] HM James Hung, Robert T O’Neill, Peter Bauer, and Karl Kohne. The behav-ior of the p-value when the alternative hypothesis is true. Biometrics , pages – , .[ ] Rob J Hyndman and Anne B Koehler. Another look at measures of forecastaccuracy. International journal of forecasting , ( ): – , .[ ] E.T. Jaynes. How should we use entropy in economics? .[ ] Johan Ludwig William Valdemar Jensen. Sur les fonctions convexes et lesinégalités entre les valeurs moyennes. Acta Mathematica , ( ): – , .[ ] Hedegaard Anders Jessen and Thomas Mikosch. Regularly varying func-tions. Publications de l’Institut Mathematique , ( ): – , .[ ] Petr Jizba, Hagen Kleinert, and Mohammad Shefaat. Rényi’s informationtransfer between financial time series. Physica A: Statistical Mechanics and itsApplications , ( ): – , .[ ] Valen E Johnson. Revised standards for statistical evidence. Proceedings of theNational Academy of Sciences , ( ): – , .[ ] Daniel Kahneman and Amos Tversky. Prospect theory: An analysis of deci-sion under risk. Econometrica , ( ): – , .[ ] Joseph P Kairys Jr and NICHOLAS VALERIO III. The market for equityoptions in the s. The Journal of Finance , ( ): – , .[ ] Ioannis Karatzas and Steven E Shreve. Brownian motion and stochastic cal-culus springer-verlag. New York , .[ ] John L Kelly. A new interpretation of information rate. Information Theory,IRE Transactions on , ( ): – , . Bibliography [ ] Gideon Keren. Calibration and probability judgements: Conceptual andmethodological issues. Acta Psychologica , ( ): – , .[ ] Christian Kleiber and Samuel Kotz. Statistical size distributions in economicsand actuarial sciences , volume . John Wiley & Sons, .[ ] Andrei Nikolaevich Kolmogorov. On logical foundations of probability the-ory. In Probability theory and mathematical statistics , pages – . Springer, .[ ] Andrey Kolmogorov. Sulla determinazione empirica di una lgge di dis-tribuzione. Inst. Ital. Attuari, Giorn. , : – , .[ ] Samuel Kotz and Norman Johnson. Encyclopedia of Statistical Sciences . Wiley, .[ ] VV Kozlov, T Madsen, and AA Sorokin. Weighted means of weakly de-pendent random variables. MOSCOW UNIVERSITY MATHEMATICS BUL-LETIN C/C OF VESTNIK-MOSKOVSKII UNIVERSITET MATHEMATIKA , ( ): , .[ ] Jean Laherrere and Didier Sornette. Stretched exponential distributions innature and economy:“fat tails” with characteristic scales. The European Phys-ical Journal B-Condensed Matter and Complex Systems , ( ): – , .[ ] David Laibson. Golden eggs and hyperbolic discounting. The Quarterly Jour-nal of Economics , ( ): – , .[ ] Deli Li, M Bhaskara Rao, and RJ Tomkins. The law of the iterated logarithmand central limit theorem for l-statistics. Technical report, PENNSYLVANIASTATE UNIV UNIVERSITY PARK CENTER FOR MULTIVARIATE ANALY-SIS, .[ ] Sarah Lichtenstein, Baruch Fischhoff, and Lawrence D Phillips. Calibrationof probabilities: The state of the art. In Decision making and change in humanaffairs , pages – . Springer, .[ ] Sarah Lichtenstein, Paul Slovic, Baruch Fischhoff, Mark Layman, and Bar-bara Combs. Judged frequency of lethal events. Journal of experimental psy-chology: Human learning and memory , ( ): , .[ ] Michel Loève. Probability Theory. Foundations. Random Sequences . New York:D. Van Nostrand Company, .[ ] Filip Lundberg. I. Approximerad framställning af sannolikhetsfunktionen. II. Åter-försäkring af kollektivrisker. Akademisk afhandling... af Filip Lundberg,...
Almqvistoch Wiksells boktryckeri, .[ ] HL MacGillivray and Kevin P Balanda. Mixtures, myths and kurtosis. Com-munications in Statistics-Simulation and Computation , ( ): – , . ibliography [ ] LC MacLean, William T Ziemba, and George Blazenko. Growth versus secu-rity in dynamic investment analysis. Management Science , ( ): – , .[ ] Dhruv Madeka. Accurate prediction of electoral outcomes. arXiv preprintarXiv: . , .[ ] Spyros Makridakis, Evangelos Spiliotis, and Vassilios Assimakopoulos. Them competition: Results, findings, conclusion and way forward. InternationalJournal of Forecasting , ( ): – , .[ ] Spyros Makridakis and Nassim Taleb. Decision making and planning underlow levels of predictability, .[ ] Benoit Mandelbrot. A note on a class of skew distribution functions: Anal-ysis and critique of a paper by ha simon. Information and Control , ( ): – , .[ ] Benoit Mandelbrot. The pareto-levy law and the distribution of income. In-ternational Economic Review , ( ): – , .[ ] Benoit Mandelbrot. The stable paretian income distribution when the appar-ent exponent is near two. International Economic Review , ( ): – , .[ ] Benoit B Mandelbrot. New methods in statistical economics. In Fractals andScaling in Finance , pages – . Springer, .[ ] Benoît B Mandelbrot and Nassim Nicholas Taleb. Random jump, not randomwalk, .[ ] Harry Markowitz. Portfolio selection*. The journal of finance , ( ): – , .[ ] Harry M Markowitz. Portfolio selection: efficient diversification of investments ,volume . Wiley, .[ ] RARD Maronna, Douglas Martin, and Victor Yohai. Robust statistics . JohnWiley & Sons, Chichester. ISBN, .[ ] R. Mehera and E. C. Prescott. The equity premium: a puzzle. Journal ofMonetary Economics , : – , .[ ] Robert C Merton. An analytic derivation of the efficient portfolio frontier. Journal of financial and quantitative analysis , ( ): – , .[ ] Robert C. Merton. The relationship between put and call prices: Comment. ( ): – , .[ ] Robert C. Merton. Theory of rational option pricing. : – , Spring .[ ] Robert C. Merton. Option pricing when underlying stock returns are discon-tinuous. : – , . Bibliography [ ] Robert C Merton and Paul Anthony Samuelson. Continuous-time finance. .[ ] David C Nachman. Spanning and completeness with options. The review offinancial studies , ( ): – , .[ ] S. A. Nelson. The A B C of Options and Arbitrage.
The Wall Street Library,New York., .[ ] S. A. Nelson. The A B C of Options and Arbitrage . New York: The Wall StreetLibrary., .[ ] Hansjörg Neth and Gerd Gigerenzer. Heuristics: Tools for an uncertainworld. Emerging trends in the social and behavioral sciences: An Interdisciplinary,Searchable, and Linkable Resource , .[ ] Donald J Newman. A problem seminar . Springer Science & Business Media, .[ ] Hong Nie and Shaohua Chen. Lognormal sum approximation with type ivpearson distribution. IEEE Communications Letters , ( ), .[ ] John P Nolan. Parameterizations and modes of stable distributions. Statistics& probability letters , ( ): – , .[ ] Bernt Oksendal. Stochastic differential equations: an introduction with applica-tions . Springer Science & Business Media, .[ ] Joel Owen and Ramon Rabinovitch. On the class of elliptical distributionsand their applications to the theory of portfolio choice. The Journal of Finance , ( ): – , .[ ] T. Mikosch P. Embrechts, C. Kluppelberg. Modelling Extremal Events . Springer, .[ ] Vilfredo Pareto. La courbe des revenus. Travaux de Sciences Sociales , pages – , ( ).[ ] O. Peters and M. Gell-Mann. Evaluating gambles using dynamics. Chaos , ( ), .[ ] T Pham-Gia and TL Hung. The mean and median absolute deviations. Math-ematical and Computer Modelling , ( - ): – , .[ ] George C Philippatos and Charles J Wilson. Entropy, market risk, and theselection of efficient portfolios. Applied Economics , ( ): – , .[ ] Charles Phillips and Alan Axelrod. Encyclopedia of Wars:( -Volume Set). In-fobase Pub., .[ ] James Pickands III. Statistical inference using extreme order statistics. theAnnals of Statistics , pages – , . ibliography [ ] Thomas Piketty. Capital in the st century, .[ ] Thomas Piketty and Emmanuel Saez. The evolution of top incomes: a his-torical and international perspective. Technical report, National Bureau ofEconomic Research, .[ ] Iosif Pinelis. Characteristic function of the positive part of a random variableand related results, with applications. Statistics & Probability Letters , : – , .[ ] Steven Pinker. The better angels of our nature: Why violence has declined . Pen-guin, .[ ] Dan Pirjol. The logistic-normal integral and its generalizations. Journal ofComputational and Applied Mathematics , ( ): – , .[ ] EJG Pitman. Subexponential distribution functions. J. Austral. Math. Soc. Ser.A , ( ): – , .[ ] Svetlozar T Rachev, Young Shin Kim, Michele L Bianchi, and Frank J Fabozzi. Financial models with Lévy processes and volatility clustering , volume . JohnWiley & Sons, .[ ] Anthony M. Reinach. The Nature of Puts & Calls . New York: The Bookmailer, .[ ] Lewis F Richardson. Frequency of occurrence of wars and other fatal quar-rels. Nature , ( ): , .[ ] Matthew Richardson and Tom Smith. A direct test of the mixture of dis-tributions hypothesis: Measuring the daily flow of information. Journal ofFinancial and Quantitative Analysis , ( ): – , .[ ] Christian Robert and George Casella. Monte Carlo statistical methods . SpringerScience & Business Media, .[ ] Stephen A Ross. Mutual fund separation in financial theory—the separatingdistributions. Journal of Economic Theory , ( ): – , .[ ] Stephen A Ross. Neoclassical finance . Princeton University Press, .[ ] Francesco Rubino, Antonello Forgione, David E Cummings, Michel Vix,Donatella Gnuli, Geltrude Mingrone, Marco Castagneto, and JacquesMarescaux. The mechanism of diabetes control after gastrointestinal bypasssurgery reveals a role of the proximal small intestine in the pathophysiologyof type diabetes. Annals of surgery , ( ): – , .[ ] Mark Rubinstein. Rubinstein on derivatives . Risk Books, .[ ] Mark Rubinstein. A History of The Theory of Investments . New York: JohnWiley & Sons, . Bibliography [ ] Doriana Ruffino and Jonathan Treussard. Derman and taleb’s ‘the illusionsof dynamic replication’: a comment. Quantitative Finance , ( ): – , .[ ] Harold Sackrowitz and Ester Samuel-Cahn. P values as random variables—expected p values. The American Statistician , ( ): – , .[ ] Gennady Samorodnitsky and Murad S Taqqu. Stable non-Gaussian randomprocesses: stochastic models with infinite variance , volume . CRC Press, .[ ] D Schleher. Generalized gram-charlier series with application to the sumof log-normal variates (corresp.). IEEE Transactions on Information Theory , ( ): – , .[ ] Jun Shao. Mathematical Statistics . Springer, .[ ] Herbert A Simon. On a class of skew distribution functions. Biometrika , ( / ): – , .[ ] SK Singh and GS Maddala. A function for size distribution of incomes: reply. Econometrica , ( ), .[ ] Didier Sornette. Critical phenomena in natural sciences: chaos, fractals, selforga-nization, and disorder: concepts and tools . Springer, .[ ] C.M. Sprenkle. Warrant prices as indicators of expectations and preferences. Yale Economics Essays , ( ): – , .[ ] C.M. Sprenkle. Warrant Prices as Indicators of Expectations and Preferences: inP. Cootner, ed., , The Random Character of Stock Market Prices, . MIT Press,Cambridge, Mass, .[ ] AJ Stam. Regular variation of the tail of a subordinated probability distribu-tion. Advances in Applied Probability , pages – , .[ ] Stephen M Stigler. Stigler’s law of eponymy. Transactions of the New Yorkacademy of sciences , ( Series II): – , .[ ] Hans R Stoll. The relationship between put and call option prices. The Journalof Finance , ( ): – , .[ ] Cass R Sunstein. Deliberating groups versus prediction markets (or hayek’schallenge to habermas). Episteme , ( ): – , .[ ] Giitiro Suzuki. A consistent estimator for the mean deviation of the pearsontype distribution. Annals of the Institute of Statistical Mathematics , ( ): – , .[ ] E. Schechtman S.Yitzhaki. The Gini Methodology: A primer on a statisticalmethodology . Springer, .[ ] N N Taleb and R Douady. Mathematical definition, mapping, and detectionof (anti) fragility. Quantitative Finance , . ibliography [ ] Nassim N Taleb and G Martin. The illusion of thin tails under aggregation(a reply to jack treynor). Journal of Investment Management , .[ ] Nassim Nicholas Taleb. Dynamic Hedging: Managing Vanilla and Exotic Op-tions . John Wiley & Sons (Wiley Series in Financial Engineering), .[ ] Nassim Nicholas Taleb. Incerto: Antifragile, The Black Swan , Fooled by Ran-domness, the Bed of Procrustes, Skin in the Game . Random House and Penguin, - .[ ] Nassim Nicholas Taleb. Black swans and the domains of statistics. The Amer-ican Statistician , ( ): – , .[ ] Nassim Nicholas Taleb. Errors, robustness, and the fourth quadrant. Interna-tional Journal of Forecasting , ( ): – , .[ ] Nassim Nicholas Taleb. Finiteness of variance is irrelevant in the practice ofquantitative finance. Complexity , ( ): – , .[ ] Nassim Nicholas Taleb. Antifragile: things that gain from disorder . RandomHouse and Penguin, .[ ] Nassim Nicholas Taleb. Four points beginner risk managers should learnfrom jeff holman’s mistakes in the discussion of antifragile. arXiv preprintarXiv: . , .[ ] Nassim Nicholas Taleb. The meta-distribution of standard p-values. arXivpreprint arXiv: . , .[ ] Nassim Nicholas Taleb. Stochastic tail exponent for asymmetric power laws. arXiv preprint arXiv: . , .[ ] Nassim Nicholas Taleb. Election predictions as martingales: an arbitrageapproach. Quantitative Finance , ( ): – , .[ ] Nassim Nicholas Taleb. How much data do you need? an operational, pre-asymptotic metric for fat-tailedness. International Journal of Forecasting , .[ ] Nassim Nicholas Taleb. Skin in the Game: Hidden Asymmetries in Daily Life .Penguin (London) and Random House (N.Y.), .[ ] Nassim Nicholas Taleb. Technical Incerto, Vol : The Statistical Consequences ofFat Tails, Papers and Commentaries . Monograph, .[ ] Nassim Nicholas Taleb. Common misapplications and misinterpretations ofcorrelation in social" science. Preprint, Tandon School of Engineering, New YorkUniversity , .[ ] Nassim Nicholas Taleb. The Statistical Consequences of Fat Tails . STEM Aca-demic Press, . Bibliography [ ] Nassim Nicholas Taleb, Elie Canetti, Tidiane Kinda, Elena Loukoianova, andChristian Schmieder. A new heuristic measure of fragility and tail risks:application to stress testing. International Monetary Fund , .[ ] Nassim Nicholas Taleb and Pasquale Cirillo. Branching epistemic uncer-tainty and thickness of tails. arXiv preprint arXiv: . , .[ ] Nassim Nicholas Taleb and Raphael Douady. On the super-additivity andestimation biases of quantile contributions. Physica A: Statistical Mechanicsand its Applications , : – , .[ ] Nassim Nicholas Taleb and Daniel G Goldstein. The problem is beyondpsychology: The real world is more random than regression analyses. Inter-national Journal of Forecasting , ( ): – , .[ ] Nassim Nicholas Taleb and George A Martin. How to prevent other financialcrises. SAIS Review of International Affairs , ( ): – , .[ ] Nassim Nicholas Taleb and Avital Pilpel. I problemi epistemologici del riskmanagement. Daniele Pace (a cura di)" Economia del rischio. Antologia di scrittisu rischio e decisione economica", Giuffre, Milano , .[ ] Nassim Nicholas Taleb and Constantine Sandis. The skin in the game heuris-tic for protection against tail events. Review of Behavioral Economics , : – , .[ ] NN Taleb and J Norman. Ethics of precaution: Individual and systemic risk, .[ ] Jozef L Teugels. The class of subexponential distributions. The Annals ofProbability , ( ): – , .[ ] Edward Thorp. A corrected derivation of the black-scholes option model. Based on private conversation with Edward Thorp and a copy of a page paperThorp wrote around , with disclaimer that I understood Ed. Thorp correctly. , .[ ] Edward O Thorp. Optimal gambling systems for favorable games. Revue del’Institut International de Statistique , pages – , .[ ] Edward O Thorp. Extensions of the black-scholes option model. Proceedingsof the th Session of the International Statistical Institute, Vienna, Austria , pages – , .[ ] Edward O Thorp. Understanding the kelly criterion. The Kelly CapitalGrowth Investment Criterion: Theory and Practice’, World Scientific Press, Sin-gapore , .[ ] Edward O. Thorp and S. T. Kassouf. Beat the Market . New York: RandomHouse, . ibliography [ ] James Tobin. Liquidity preference as behavior towards risk. The review ofeconomic studies , pages – , .[ ] Jack L Treynor. Insights-what can taleb learn from markowitz? Journal ofInvestment Management , ( ): , .[ ] Constantino Tsallis, Celia Anteneodo, Lisa Borland, and Roberto Osorio.Nonextensive statistical mechanics and economics. Physica A: Statistical Me-chanics and its Applications , ( ): – , .[ ] Vladimir V Uchaikin and Vladimir M Zolotarev. Chance and stability: stabledistributions and their applications . Walter de Gruyter, .[ ] Aad W Van Der Vaart and Jon A Wellner. Weak convergence. In Weakconvergence and empirical processes , pages – . Springer, .[ ] Willem Rutger van Zwet. Convex transformations of random variables , volume .Mathematisch centrum, .[ ] SR Srinivasa Varadhan. Large deviations and applications , volume . SIAM, .[ ] SR Srinivasa Varadhan. Stochastic processes , volume . American Mathemat-ical Soc., .[ ] José A Villaseñor-Alva and Elizabeth González-Estrada. A bootstrap good-ness of fit test for the generalized pareto distribution. Computational Statistics& Data Analysis , ( ): – , .[ ] Eric Weisstein. Wolfram MathWorld .[ ] Rafał Weron. Levy-stable distributions revisited: tail index> does notexclude the levy-stable regime. International Journal of Modern Physics C , ( ): – , .[ ] Heath Windcliff and Phelim P Boyle. The /n pension investment puzzle. North American Actuarial Journal , ( ): – , .[ ] Yingying Xu, Zhuwu Wu, Long Jiang, and Xuefeng Song. A maximumentropy method for a robust portfolio problem. Entropy , ( ): – , .[ ] Yingying Yang, Shuhe Hu, and Tao Wu. The tail probability of the productof dependent random variables from max-domains of attraction. Statistics &Probability Letters , ( ): – , .[ ] Jay L Zagorsky. Do you have to be smart to be rich? the impact of iq onwealth, income and financial distress. Intelligence , ( ): – , .[ ] IV Zaliapin, Yan Y Kagan, and Federic P Schoenberg. Approximating thedistribution of pareto sums. Pure and Applied geophysics , ( - ): – , . Bibliography [ ] Rongxi Zhou, Ru Cai, and Guanqun Tong. Applications of entropy in finance:A review. Entropy , ( ): – , .[ ] Vladimir M Zolotarev. One-dimensional stable distributions , volume . Amer-ican Mathematical Soc., .[ ] VM Zolotarev. On a new viewpoint of limit theorems taking into accountlarge deviationsr. Selected Translations in Mathematical Statistics and Probability , : , . N D E X k metric, Antifragility, , Bad print (fake outlier), Bayes’ rule, Bayesian methods, Beta (finance), Bimodality,
Black Swan, , , , , , , , , , , Black Swan Problem, , Black-Scholes, , , , Bootstrap,
Brier score, , , Catastrophe principle, Central limit theorem (CLT), , , , , , , , , Characteristic function, , , , , , , , , , , , , , , , , Characteristic scale, , , Chebyshev’s inequality,
Chernoff bound,
Citation ring, , Compact support, Concavity/Convexity, , , , , Convolution,
COVID- pandemic, , , , Cramer condition, , CVaR, Conditional Value at Risk, , , , , , , , De Finetti,
Degenerate distribution, , , , , Dose-response (S curve), Dynamic hedging, Econometrics, , , , , Eigenvalues, , , Elliptical distribution (Ellipticality), , , Empirical distribution, , Entropy, , Ergodic probabilities,
Ergodicity, , , Expert calibration,
Extreme value distribution, , Extreme value theory, , , , , , , , , , , Fréchet class, , , Fragility, Fughedaboudit, Gamma variance, GARCH econometric models, , , , Gauss-Markov theorem, Generalized central limit theorem (GCLT), , , , Generalized extreme value distribu-tion,
Generalized Pareto distribution,
Generalized Pareto distribution (GPD), Index
Gini coefficient, Grey Swans, , Heteroskedasticity, , Hidden properties, Hidden tail, Higher dimensions (thick tailedness), , Hilbert transform,
Hoeffding’s inequality,
Independence, , Inseparability of probability,
Invisibility of the generator, IQ,
Itô’s lemma, – , , Jensen’s inequality, – , , , , , , , Jobless claims (jump in variable), Kappa metric, , Karamata point, , Karamata Representation Theorem,
Kurtosis, , , , , , , , , , , , Large deviation principle,
Large deviation theory, Law of large numbers (LLN), , , Law of large numbers (weak vs. strong),
Law of large numbers for higher mo-ments, , , , Law of medium numbers, , , , Levy, Paul,
Lindy effect, Linear regression under fat tails,
Location-scale family, , Log-Pareto distribution, , Long Term Capital Management (LTCM), Lucretius (fallacy), , Lucretius fallacy,
Machine Learning,
Mandelbrot, Benoit, , , , , , , , , , , , Marchenko-Pastur distribution, , Markov’s inequality,
Martingale, , , Maximum domain of attraction, , , , Mean absolute deviation, – , , , , , Mean-variance portfolio theory, , , , , , , , , , Mediocristan vs. Extremistan, , , , , , , , Mellin transform, Metadistribution, , Method of moments, Modern Portfolio Theory (Markowitz), , , , , MS plot, , , , Multivariate stable distribution, , Mutual information,
Myopic loss aversion, , Norming constants,
Norms L p , Numéraire,
Pandemics, , , Paretian, Paretianity, , , , , , , , , , , , , , , Pareto, Vilfredo,
Peak-over-threshold,
Peso problem,
Peso problem (confusion),
Picklands-Balkema-de Haan theorem,
Pinker pseudo-statistical inference, Poisson jumps,
Popper, Karl, Power law, , , , , , , , , , , – , , , , , , , , , , , , , , , , , ndex Power law basin,
Preasymptotics, , Principal component analysis, Pseudo power law,
Pseudo-empiricism, Pseudo-empiricism (pseudo-statisticalinference), , , – Pseudoconvergence, Psychometrics,
R-square, , Radon-Nikodym derivative,
Random matrices, , , Regular variation class, Rent seeking, , Residuals (regression),
Risk Parity,
Robust statistics, Ruin, Shadow mean, , Shadow moment, Sharpe ratio (coefficient of variation), , Sigmoid,
Skin in the game, , , , , , , Slowly varying function, , , , , , , , , , , , , , SP , , , , , , Stable (Lévy stable) distribution, , , , Stochastic process, – , , Stochastic volatility, , , , , , , , , , , , , , Stochasticization (of variance), Stochasticizing, , Strong Pareto law,
Subexponential class, Subexponentiality, Tail dependence, Tail exponent, , , , , , , , , , , , , , , , , , , , , , , , , The tail wags the dog effect, Transition probability,
Universal approximation theorem,
Value at Risk (VaR), , , , , , , var der Wijk’s law, Violence (illusion of drop in), , , von Neumann, John, Wigner semicircle distribution, , Winsorizing, Wittgenstein’s ruler, , ,,