[PDF] Cognitive Constructivism and the Epistemic Significance of Sharp Statistical Hypotheses in Natural Sciences

Abstract

This book presents our case in defense of a constructivist epistemological framework and the use of compatible statistical theory and inference tools. The basic metaphor of decision theory is the maximization of a gambler's expected fortune, according to his own subjective utility, prior beliefs an learned experiences. This metaphor has proven to be very useful, leading the development of Bayesian statistics since its XX-th century revival, rooted on the work of de Finetti, Savage and others. The basic metaphor presented in this text, as a foundation for cognitive constructivism, is that of an eigen-solution, and the verification of its objective epistemic status. The FBST - Full Bayesian Significance Test - is the cornerstone of a set of statistical tolls conceived to assess the epistemic value of such eigen-solutions, according to their four essential attributes, namely, sharpness, stability, separability and composability. We believe that this alternative perspective, complementary to the one ofered by decision theory, can provide powerful insights and make pertinent contributions in the context of scientific research.

Full PDF

CCognitive Constructivism andthe Epistemic Signiﬁcance of

Sharp Statistical Hypothesesin Natural Sciences

Julio Michael SternIME-USPInstitute of Mathematics and Statisticsof the University of S˜ao PauloVersion 2.31, May 01, 2013. a r X i v : . [ s t a t . O T ] F e b `A Marisa e a nossos ﬁlhos,Rafael, Ana Carolina, e Deborah. “Remanso de rio largo, viola da solid˜ao:Quando vou p’ra dar batalha, convido meu cora¸c˜ao.” Gentle backwater of wide river, ﬁddle to solitude:When going to do battle, I invite my heart.Jo˜ao Guimar˜aes Rosa (1908-1967).Grande Sert˜ao, Veredas. “Sert˜ao ´e onde o homem tem de ter a dura nuca e a m˜ao quadrada.(Onde quem manda ´e forte, com astucia e com cilada.)Mas onde ´e bobice a qualquer resposta,´e a´ı que a pergunta se pergunta.”“A gente vive repetido, o repetido...Digo: o real n˜ao est´a na sa´ıda nem na chegada:ele se disp˜oem para a gente ´e no meio da travessia.”

Sertao is where a man’s might must prevail,where he has to be strong, smart and wise.But there, where any answer is wrong,there is where the question asks itself.We live repeating the reapeated...I say: the real is neither at the departure nor at the arrival:It presents itself to us at the middle of the journey. ontents

Preface 13

Cognitive Constructivism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14Basic Tools for the (Home) Works . . . . . . . . . . . . . . . . . . . . . . . . . . 16Acknowledgements and Final Remarks . . . . . . . . . . . . . . . . . . . . . . . 16

CONTENTS

ONTENTS

CONTENTS

Epilog 205References 209A FBST Review 245

A.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 245A.2 Bayesian Statistical Models . . . . . . . . . . . . . . . . . . . . . . . . . . 246A.3 The Epistemic e -values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 248A.4 Reference, Invariance and Consistency . . . . . . . . . . . . . . . . . . . . 251A.5 Loss Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 254A.6 Belief Calculi and Support Structures . . . . . . . . . . . . . . . . . . . . . 255A.7 Sensitivity and Inconsistency . . . . . . . . . . . . . . . . . . . . . . . . . . 257A.8 Complex Models and Compositionality . . . . . . . . . . . . . . . . . . . . 260 B Binomial, Dirichlet, Poisson and Related Distributions 263

B.1 Introduction and Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . 263B.2 The Bernoulli Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 264B.3 Multinomial Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . 269B.4 Multivariate Hypergeometric Distribution . . . . . . . . . . . . . . . . . . 271B.5 Dirichlet Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 273B.6 Dirichlet-Multinomial . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 277B.7 Dirichlet of the Second Kind . . . . . . . . . . . . . . . . . . . . . . . . . . 280B.8 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 281B.9 Functional Characterizations . . . . . . . . . . . . . . . . . . . . . . . . . . 285B.10 Final Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 287

C Model Miscellanea 289

ONTENTS

D Deterministic Evolution and Optimization 317

D.1 Convex Sets and Polyedra . . . . . . . . . . . . . . . . . . . . . . . . . . . 318D.2 Linear Programming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 321D.2.1 Primal and Dual Simplex Algorithms . . . . . . . . . . . . . . . . . 321D.2.2 Decomposition Methods . . . . . . . . . . . . . . . . . . . . . . . . 328D.3 Non-Linear Programming . . . . . . . . . . . . . . . . . . . . . . . . . . . 330D.3.1 GRG: Generalized Reduced Gradient . . . . . . . . . . . . . . . . . 334D.3.2 Line Search and Local Convergence . . . . . . . . . . . . . . . . . . 336D.3.3 The Gradient ParTan Algorithm . . . . . . . . . . . . . . . . . . . 339D.3.4 Global Convergence . . . . . . . . . . . . . . . . . . . . . . . . . . . 341D.4 Variational Principles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3432

CONTENTS

E Entropy and Asymptotics 347

E.1 Boltzmann-Gibbs-Shannon Entropy . . . . . . . . . . . . . . . . . . . . . . 348E.2 Csisz´ar’s ϕ -divergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 350E.3 Minimum Divergence under Constraints . . . . . . . . . . . . . . . . . . . 351E.4 Fisher’s Metric and Jeﬀreys’ Prior . . . . . . . . . . . . . . . . . . . . . . . 354E.5 Posterior Asymptotic Convergence . . . . . . . . . . . . . . . . . . . . . . 356 F Matrix Factorizations 361

F.1 Matrix Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 361F.2 Dense LU, QR and SVD Factorizations . . . . . . . . . . . . . . . . . . . . 365F.3 Sparse Factorizations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 375F.3.1 Sparsity and Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . 375F.3.2 Sparse Cholesky Factorization . . . . . . . . . . . . . . . . . . . . . 378F.4 Bayesian Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 382

G Monte Carlo Miscellanea 385

G.1 Pseudo, Quasi and Subjective Randomness . . . . . . . . . . . . . . . . . . 385G.2 Integration and Variance Reduction . . . . . . . . . . . . . . . . . . . . . . 392G.3 MCMC - Markov Chain Monte Carlo . . . . . . . . . . . . . . . . . . . . . 395G.4 Estimation of Ratios . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 397G.5 Monte Carlo for Linear Systems . . . . . . . . . . . . . . . . . . . . . . . . 401

H Stochastic Evolution and Optimization 405

H.1 Inhomogeneous Markov Chains . . . . . . . . . . . . . . . . . . . . . . . . 405H.2 Simulated Annealing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 411H.3 Genetic Programming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 413H.4 Ontogenic Development . . . . . . . . . . . . . . . . . . . . . . . . . . . . 415

I Research Projects 419J Image and Art Gallery 423 reface “Life is like riding a bicycle.To keep your balance you must keep moving.”

Albert Einstein.The main goals of this book are to develop an epistemological framework based onCognitive Constructivism, and to provide a general introduction to the Full BayesianSigniﬁcance Test (FBST). The FBST was ﬁrst presented in Pereira and Stern (1999) asa coherent Bayesian method for accessing the statistical signiﬁcance of sharp or precisestatistical hypotheses. A review of the FBST is given in the appendices, including:a) Some examples of its practical application;b) The basic computational techniques used in its implementation;c) Its statistical properties;d) Its logical or formal algebraic properties;The items above have already been explored in previous presentations and courses given.In this book we shall focus on presentinge) A coherent epistemological framework for precise statistical hypotheses.The FBST grew out of the necessity of testing sharp statistical hypothesis in severalinstances of the consulting practice of its authors. By the end of the year 2003, variousinteresting applications of this new formalism had been published by members of theBayesian research group at IME-USP, some of which outperformed previously publishedsolutions based on alternative methodologies, see for example Stern and Zacks (2002). Insome applications, the FBST oﬀered simple, elegant and complete solutions whereas alter-native methodologies oﬀered only partial solutions and / or required convoluted problemmanipulations, see for example Lauretto et al. (2003).The FBST measures the signiﬁcance of a sharp hypothesis in a way that diﬀers com-pletely from that of Bayes Factors, the method of choice of orthodox Bayesian statistics.These methodological diﬀerences ﬁred interesting debates that motivated us to investi-gate more thoroughly the logical and algebraic properties of the new formalism. Theseinvestigations also gave us the opportunity to interact with people in communities thatwere interested in more general belief calculi, mostly from the areas of Logic and Artiﬁcial134

PREFACE

Intelligence, see for example Stern (2003, 2004) and Borges and Stern (2007).However, as both Orthodox Bayesian Statistics and Frequentist Statistics have theirown well established epistemological frameworks, namely, Decision Theory and PopperianFalsiﬁcationism, respectively, there was still one major gap to be ﬁlled: the establishmentof an epistemological framework for the FBST formalism. Despite the fact that the dailypractice of Statistics rarely leads to epistemological questions, the distinct formal proper-ties of the FBST repeatedly brought forward such considerations. Consequently, deﬁningan epistemological framework fully compatible with the FBST became an unavoidabletask, as part of our eﬀort to answer the many interesting questions posed by our col-leagues.Besides compatibility with the FBST logical properties, this new epistemologicalframework was also required to fully support sharp (precise or lower dimensional) sta-tistical hypothesis. In fact, contrasting with the decision theoretic epistemology of theorthodox Bayesian school, which is usually hostile or at least unsympathetic to this kindof hypothesis, this new epistemological framework actualy puts, as we will see in thefollowing chapters, sharp hypothesis at the center stage of the philosophy of science.

Cognitive Constructivism

The epistemological framework chosen to the aforementioned task was Cognitive Con-structivism, as presented in chapters 1 to 4, and constitute the core lectures of thiscourse. The central epistemological concept supporting the notion of a sharp statisti-cal hypothesis is that of a systemic eigen-solution. According to Heinz von Foerster,the four essential attributes of such eigen-solutions are: discreteness (sharpness), stabil-ity, separability (decoupling) and composability. Systemic eigen-solutions correspond tothe “objects” of knowledge, which may, in turn, be represented by sharp hypotheses inappropriate statistical models. These are the main topics discussed of chapter 1.Within the FBST setup, the e-value of a hypothesis, H , deﬁnes the measure of its Epistemic Value or the

Value of the Evidence in support of H , provided by the observa-tions. This measure corresponds, in turn, to the “reality” of the object described by thestatistical hypothesis. The FBST formalism is reviewed in Appendix A.In chapter 2 we delve into this epistemological framework from a broader perspective,linking it to the philosophical schools of Objective Idealism and Pragmatism. The generalapproach of this chapter can be summarized by the “wire walking” metaphor, accordingto which one strives to keep in balance at a center of equilibrium, to avoid the dangers ofextreme positions that are faraway from it, see Figure J.1. In this context, such extremepositions are related to the epistemological positions of Dogmatic Realism and SolipsisticSubjectivism. REFACE

PREFACE

Basic Tools for the (Home) Works

The fact that focus of this summer course will be on epistemological questions should notbe taken as an excuse for working not so hardly with statistical modeling, data analysis,computer implementation, and the like. After all, this course will give to successfulstudents 4 full credits in the IME-USP graduate programs!In the core lectures we will illustrate the topics under discussion with several ‘concrete’mathematical and statistical models. We have made a conscious eﬀort to choose illus-tration models ivolving only mathematical concepts already familiar to our prospectivestudents. Actually, most of these models are entail mathematical techniques that are usedin the analysis and the computational implementation of the FBST, or that are closelyrelated to them. Appendices A through K should help the students with their home-works. We point out, however, that the presentation quality of these appendices is veryheterogeneous. Some are (I hope) didactic and well prepared, some are only snapshotsfrom slide presentations, and ﬁnally, some are just commented computer codes.

Acknowledgements and Final Remarks

The main goal of this book is to explore the FBST formalism and Bayesian statistics froma constructivist epistemological perspective. In order to accomplish this, ideas from manygreat masters, including philosophers like Peirce, Maturana, von Foerster, and Luhmann,statisticians like Peirce, Fisher, de Finetti, Savage, Good, Kemthorne, Jaynes, Jeﬀreysand Basu, ans physicists like Boltzmann, Planck, de Broglie, Bohr, Heisenberg, and Bornhave been used. I hope it is clear from the text how much I admire and feel I owe to thesegiants, even when my attitude is less then reverential. By that I mean that I always feltfree to borough from the many ideas I like, and was also unashamed to reject the few Ido not. The progress of science has always relied on the free and open discussion of ideas,in contrast to rigid cults of personality. I only hope to receive from the reader the sametreatment and that, among the ideas presented in this work, he or she ﬁnds some thatwill be considered interesting and worthy of be kept in mind.Chapters 1 to 4, released as Stern (2005a) and the Technical Reports Stern (2006a-c),have been used in January-February of 2007 (and again for 2008) in the IME-USP Sum-mer Program for the discipline MAE-5747 Comparative Statistical Inference. Chapter5, released as the Technical Report by Stern (2007c), has also been used in the secondsemester of 2007 in the discipline MAP-427 - Nonlinear Programming. A short “no-math”article based on part of the material in Chapter 1 has been published (in Portuguese) inthe journal

Scientiae Studia . Revised and corrected versions of articles based on thematerial presented at Chapters 1, 2 and 3 have also been either published or acceptedfor publication in the journal

Cybernetics & Human Knowing . In the main text and the

REFACE [email protected] .Julio Michael SternS˜ao Paulo, 20/12/2007.8

PREFACE

Version Control - Version 1.0 - December 20, 2007.- Version 1.1 - April 9, 2008. Several minor corrections to the main text and some biblio-graphic updates. The appendices have been reorganized as follows: Appendix A presentsa short review of the FBST, including its deﬁnition and main statistical and logical proper-ties; Appendix B fully reviews the distribution theory used to build Multinomial-Dirichletstatistical models; Appendix C summarizes several statistical models used to illustratethe core lectures; Appendix D (previously a separate handout) gives a short introductionto deterministic optimization; Appendix E reviews some important concepts related tothe Maximum Entropy formalism and asymptotic convergence; Appendix F, on sparsefactorizations, provides some technical details related to the discussions on decouplingprocedures in chapter 3; Appendix G presents a technical miscellanea on Monte CarloMethods; Appendix H provides a short derivation of some stochastic optimization algo-rithms and evolution models; Appendix I lists some open research programs; Appendix Jcontains all bitmap ﬁgures and, ﬁnally, Appendix K brings to bear pieces of diﬃcult toget reading material. They will be posted at my web page, subject to the censorship ofour network administrator and his understanding of Brazilian copyright laws and regula-tions. All computer code was removed from text and is now available at my web page, ∼ jstern .This version has been used for a tutorial at MaxEnt-2008, the 28th International Work-shop on Bayesian Inference and Maximum Entropy Methods in Science and Engineering,held on July 6-11, at Borac´eia, S˜ao Paulo, Brazil.- Version 1.2 - December 10, 2008. Minor corrections to the main text and appendices,and some bibliographic updates. New section F.1 on dense matrix factorizations. Thissection also deﬁnes the matrix notation now used consistently throughout the book.- Version 2.0 - December 19, 2009. New section 4.5 and chapter 6, presented at the con-ference MBR’09 - Model Based Reasoning in Science and Technology - held at Campinas,Brazil. Most of the ﬁgures at exhibition in the art gallery are now in the separate ﬁle, .- Version 2.3 - November 02, 2012. New sections D.3.1 on Quadratic and Linear Com-plementarity Problems and E.6 on Reaction Networks and Kirchhok’s Laws. UpdatedReferences. Minor corrections throughout the text. hapter 1Eigen-Solutions and Sharp StatisticalHypotheses “Eigenvalues have been found ontologically to bediscrete, stable, separable and composable ...” Heinz von Foerster (1911 - 2002),Objects: Tokens for Eigen-Behaviours.

In this chapter, a few epistemological, ontological and sociological questions concerningthe statistical signiﬁcance of sharp hypotheses in the scientiﬁc context are investigatedwithin the framework provided by Cognitive Constructivism, or the Constructivist Theory(ConsTh) as presented in Maturana and Varela (1980), Foerster (2003) and Luhmann(1989, 1990, 1995). Several conclusions of the study, however, remain valid, mutatismutandis, within various other organizations and systems, see for example Bakken andHernes (2002), Christis (2001), Mingers (2000) and Rasch (1998).The author’s interest in this research topic emerged from his involvement in the de-velopment of the Full Bayesian Signiﬁcance Test (FBST), a novel Bayesian solution tothe statistical problem of measuring the support of sharp hypotheses, ﬁrst presented inPereira and Stern (1999). The problem of measuring the support of sharp hypothesesposes several conceptual and methodological diﬃculties for traditional statistical analysisunder both the frequentist (classical) and the orthodox Bayesian approaches. The solutionprovided by the FBST has signiﬁcant advantages over traditional alternatives, in terms ofits statistical and logical properties. Since these properties have already been thoroughlyanalyzed in previous papers, see references, the focus herein is directed exclusively toepistemological and ontological questions. 190

CHAPTER 1: EIGEN-SOLUTIONS AND STATISTICAL HYPOTHESES

Despite the fact that the FBST is fully compatible with Decision Theory (DecTh),as shown in Madruga et al. (2001), which, in turn, provides a strong and coherentepistemological framework to orthodox Bayesian Statistics, its logical properties openthe possibility of using and beneﬁting from alternative epistemological settings. In thischapter, the epistemological framework of ConsTh is counterposed to that of DecTh. Thecontrast, however, is limited in scope by our interest in statistics and is carried out in arather exploratory an non exhaustive form. The epistemological framework of ConsTh isalso counterposed to that of Falsiﬁcationism, the epistemological framework within whichclassical frequentist statistical test of hypotheses are often presented, as shown in Boyd(1991) and Popper (1959, 1963).In section 2, the fundamental notions of Autopoiesis and Eigen-Solutions in autopoieticsystems are reviewed. In section 3, the same is done with the notions of Social Systemsand Functional Diﬀerentiation and in section 4, a ConsTh view of science is presented.In section 5, the material presented in sections 2, 3 and 4 is related to the statisticalsigniﬁcance of sharp scientiﬁc hypotheses and the ﬁndings therein are counterposed totraditional interpretations such as those of DecTh. In section 6, a few sociological analysesfor diﬀerentiation phenomena are reviewed. In sections 7 and 8, the ﬁnal conclusions areestablished.In sections 2, 3, 4, and 6, well established concepts of the ConsTh are presented.However, in order to overcome an unfortunately common scenario, an attempt is madeto make them accessible to a scientist or statistician who is somewhat familiar withtraditional frequentist, and decision-theoretic statistical interpretations, but unfamiliarwith the constructivist approach to epistemology. Rephrasing these concepts (once again)is also avoided. Instead, quoting the primary sources is preferred whenever it can be clearly(in our context) and synthetically done. The contributions in sections 5, 7 and 8, relatemostly to the analysis of the role of quantitative methods speciﬁcally designed to measurethe statistical support of sharp hypotheses. A short review of the FBST is presented inAppendix A.

The concept of autopoiesis tries to capture an essential characteristic of living organisms(auto=self, poiesis=production). Its purpose and deﬁnition are stated in Maturana andVarela (1980, p.84 and 78-79): “Our aim was to propose the characterization of living systems that explainsthe generation of all the phenomena proper to them. We have done this bypointing at Autopoiesis in the physical space as a necessary and suﬃcientcondition for a system to be a living one.” .2 AUTOPOIESIS “An autopoietic system is organized (deﬁned as a unity) as a network of pro-cesses of production (transformation and destruction) of components that pro-duces the components which:(i) through their interactions and transformations continuously regenerate andrealize the network of processes (relations) that produced them; and(ii) constitute it (the machine) as a concrete unity in the space in which they(the components) exist by specifying the topological domain of its realizationas such a network.” Autopietic systems are non-equilibrium (dissipative) dynamical systems exhibiting(meta) stable structures, whose organization remains invariant over (long periods of)time, despite the frequent substitution of their components. Moreover, these componentsare produced by the same structures they regenerate. For example, the macromolecularpopulation of a single cell can be renewed thousands of times during its lifetime, seeBertalanﬀy (1969). The investigation of these regeneration processes in the autopoieticsystem production network leads to the deﬁnition of cognitive domain, Maturana andVarela (1980, p.10): “The circularity of their organization continuously brings them back to thesame internal state (same with respect to the cyclic process). Each internalstate requires that certain conditions (interactions with the environment) besatisﬁed in order to proceed to the next state. Thus the circular organizationimplies the prediction that an interaction that took place once will take placeagain. If this does not happen the system maintains its integrity (identity withrespect to the observer) and enters into a new prediction. In a continuouslychanging environment these predictions can only be successful if the environ-ment does no change in that which is predicted. Accordingly, the predictionsimplied in the organization of the living system are not predictions of partic-ular events, but of classes of inter-actions. Every interaction is a particularinteraction, but every prediction is a prediction of a class of interactions thatis deﬁned by those features of its elements that will allow the living systemto retain its circular organization after the interaction, and thus, to interactagain. This makes living systems inferential systems, and their domain ofinteractions a cognitive domain.”

The characteristics of this circular (cyclic or recursive) regenerative processes and theireigen (auto, equilibrium, ﬁxed, homeostatic, invariant, recurrent, recursive) -states, bothin concrete and abstract autopoietic systems, are further investigated in Foerster (2003)and Segal (2001):2

CHAPTER 1: EIGEN-SOLUTIONS AND STATISTICAL HYPOTHESES “The meaning of recursion is to run through one’s own path again. One of itsresults is that under certain conditions there exist indeed solutions which, whenreentered into the formalism, produce again the same solution. These are called“eigen-values”, “eigen-functions”, “eigen-behaviors”, etc., depending on whichdomain this formation is applied - in the domain of numbers, in functions, inbehaviors, etc.”

Segal (2001, p.145).The concept of eigen-solution for an autopoietic system is the key to distinguish speciﬁcobjects in a cognitive domain. von Foerster also establishes four essential attributes ofeigen-solutions that will support the analyses conducted in this chapter and conclusionsestablished herein. “Objects are tokens for eigen-behaviors. Tokens stand for something else. Inexchange for money (a token itself for gold held by one’s government, butunfortunately no longer redeemable), tokens are used to gain admittance tothe subway or to play pinball machines. In the cognitive realm, objects are thetoken names we give to our eigen-behavior.This is the constructivist’s insight into what takes place when we talk aboutour experience with objects.”

Segal (2001, p.127). “Eigenvalues have been found ontologically to be discrete, stable, separable andcomposable, while ontogenetically to arise as equilibria that determine them-selves through circular processes. Ontologically, Eigenvalues and objects, andlikewise, ontogenetically, stable behavior and the manifestation of a subject’s“grasp” of an object cannot be distinguished.”

Foerster (2003, p.266).The arguments used in this study rely heavily on two qualitative properties of eigen-solutions, refered by von Foerster by the terms “Discrete” and “Equilibria”. In whatfollows, the meaning of these qualiﬁers, as they are understood by von Foerster and usedherein, are examined:a- Discrete (or sharp): “There is an additional point I want to make, an important point. Out of aninﬁnite continuum of possibilities, recursive operations carve out a precise setof discrete solutions. Eigen-behavior generates discrete, identiﬁable entities.Producing discreteness out of inﬁnite variety has incredibly important conse-quences. It permits us to begin naming things. Language is the possibilityof carving out of an inﬁnite number of possible experiences those experienceswhich allow stable interactions of your-self with yourself.”

Segal (2001, p.128). .2 AUTOPOIESIS c of a linear transformation T ( ) only by its essential property of directional invariance, T ( x ) = cx , we obtain one dimensional sub-manifolds which, in this case, are subspacesor lines trough the origin. Only if we add the usual (but non essential) normalizationcondition, || x || = 1, do we get discrete eigen-vectors.b- Equilibria (or stable):A stable eigen-solution of the operator Op ( ), deﬁned by the ﬁxed-point or invarianceequation, x inv = Op ( x inv ), can be found, built or computed as the limit, x ∞ , of thesequence { x n } , deﬁned by recursive application of the operator, x n +1 = Op ( x n ). Underappropriate conditions, such as within a domain of attraction, the process convergenceand its limit eigen-solution will not depend on the starting point, x . In the linear algebraexample, using almost any staring point, the sequence generated by the recursive relation x n +1 = T ( x n ) / || T ( x n ) || , i.e. the application of T followed by normalization, converges tothe unitary eigen-vector corresponding to the largest eigen-value.In sections 4 and 5 it is shown, for statistical analysis in a scientiﬁc context, how theproperty of sharpness indicates that many, and perhaps some of the most relevant, scien-tiﬁc hypotheses are sharp, and how the property of stability, indicates that consideringthese hypotheses is natural and reasonable. The statistical consequences of these ﬁndingswill be discussed in sections 7 and 8. Before that, however, a few other ConsTh conceptsmust be introduced in sections 3 and 6.Autopoiesis found its name in the work of Maturana and Varela (1980), together witha simple, powerful and elegant formulation using the modern language of system’s theory.Nevertheless, some of the basic theoretical concepts, such as those of self-organization andautonomy of living organisms, have long historical grounds that some authors trace backto Kant. As seen in Kant (1790, sec. 65) for example, a (self-organized) “Organism” ischaracterized as an entity in which, “... every part is thought as ‘owing’ its presence to the ‘agency’ of all theremaining parts, and also as existing ‘for the sake of the others’ and of thewhole, that is as an instrument, or organ.”“Its parts must in their collective unity reciprocally produce one another alikeas to form and combination, and thus by their own causality produce a whole,the conception of which, conversely, -in a being possessing the causality ac-cording to conceptions that is adequate for such a product- could in turn be thecause of the whole according to a principle, so that, consequently, the nexusof ‘eﬃcient causes’ (progressive causation, nexus eﬀectivus) might be no less CHAPTER 1: EIGEN-SOLUTIONS AND STATISTICAL HYPOTHESES estimated as an ‘operation brought about by ﬁnal causes’ (regressive causation,nexus ﬁnalis).”

Baruch Spinoza, in his Ethics (1677, Part III, Propositions 6, 7 and 8), deﬁnes the

Conatus (eﬀort, endeavour, impetus) of self-preservation as the true essence of a being.This concept has also been regarded as a remote precursor of autopoiesis.Prop. III-6: Everything, in so far as it is in itself, endeavours to persist in itsown being.Prop. III-7: The endeavour, wherewith everything endeavours to persist in itsown being, is nothing else but the actual essence of the thing in question.Prop. III-8: The endeavour, whereby a thing endeavours to persist in itsbeing, involves no ﬁnite time, but an indeﬁnite time.For further historical comments we refer the reader to Zelleny (1980).

In order to give appropriate answers to environmental complexities, autopoietic systemscan be hierarchically organized as Higher Order Autopoietic Systems. As in Maturanaand Varela (1980, p.107,108,109), this notion is deﬁned via the concept of Coupling: “Whenever the conduct of two or more units is such that there is a domain inwhich the conduct of each one is a function of the conduct of the others, it issaid that they are coupled in that domain.”“Such a composite system will necessarily be deﬁned as a unity by the couplingrelations of its component autopoietic systems in the space that the natureof the coupling speciﬁes, and will remain as a unity as long as the componentsystems retain their autopoiesis which allows them to enter into those couplingrelations.”“An autopoietic system whose autopoiesis entails the autopoiesis of the coupledautopoietic units which realize it, is an autopoietic system of higher order.”

A typical example of a hierarchical system is a Beehive, a third order autopoieticsystem, formed by the coupling of individual Bees, the second order systems, which, inturn, are formed by the coupling of individual Cells, the ﬁrst order systems.The philosopher and sociologist Niklas Luhmann applied this notion to the studyof modern human societies and its systems. Luhmann’s basic abstraction is to look at .3 FUNCTIONAL DIFFERENTIATION “Social systems use communication as their particular mode of autopoietic(re)production. Their elements are communications that are recursively pro-duced and reproduced by a network of communications that are not living units,they are not conscious units, they are not actions. Their unity requires a syn-thesis of three selections, namely information, utterance and understanding(including misunderstanding).”

Luhmann (1990b, p.3).For Luhmann, society’s best strategy to deal with increasing complexity is the same asone observes in most biological organisms, namely, diﬀerentiation. Biological organismsdiﬀerentiate in specialized systems, such as organs and tissues of a pluricellular life form(non-autopoietic or allopoietic systems), or specialized individuals in an insect colony(autopoietic system). In fact, societies and organisms can be characterized by the way inwhich they diﬀerentiate into systems. For Luhmann, modern societies are characterizedby a vertical diﬀerentiation into autopoietic functional systems, where each system ischaracterized by its code, program and (generalized) media. The code gives a bipolarreference to the system, of what is positive, accepted, favored or valid, versus what isnegative, rejected, disfavored or invalid. The program gives a speciﬁc context where thecode is applied, and the media is the space in which the system operates.Standard examples of social systems are:- Science: with a true/false code, working in a program set by a scientiﬁc theory, andhaving articles in journals and proceedings as its media;- Judicial: with a legal/illegal code, working in a program set by existing laws andregulations, and having certiﬁed legal documents as its media;- Religion: with a good/evil code, working in a program set by sacred and hermeneutictexts, and having study, prayer and good deeds as its media;- Economy: with a property/lack thereof code, working in a program set by economicplanning scenarios and pricing methods, and having money and money-like ﬁnancial assetsas its media.Before ending this section, a notion related to the break-down of autopoiesis is intro-duced: Dediﬀerentiation (Entdiﬀerenzierung) is the degradation of the system’s internalcoherence, through adulteration, disruption, or dissolution of its own autopoietic rela-tions. One form of dediﬀerentiation (in either biological or social systems) is the system’spenetration by external agents who try to use system’s resources in a way that is not6

CHAPTER 1: EIGEN-SOLUTIONS AND STATISTICAL HYPOTHESES compatible with the system’s autonomy. In Lumann’s conception of modern society eachsystem may be aware of events in other systems, that is, be cognitively open, but isrequired to maintain its diﬀerentiation, that is, be operationally closed. In Luhmann’s(1989, p.109) words: “With functional diﬀerentiation... Extreme elasticity is purchased at the costof the peculiar rigidity of its contextual conditions. Every binary code claimsuniversal validity, but only for its own perspective. Everything, for example,can be either true of false, but only true or false according to the speciﬁctheoretical programs of the scientiﬁc system. Above all, this means that nofunction system can step in for any other. None can replace or even relieve anyother. Politics can not be substituted for economy, nor economy for science,nor science for law or religion, nor religion for politics, etc., in any conceivableintersystem relations.”

The interpretation of scientiﬁc knowledge as an eigensolution of a research process is partof a constructive approach to epistemology. Figure 1 presents an idealized structure anddynamics of knowledge production. This diagram represents, on the Experiment side (leftcolumn) the laboratory or ﬁeld operations of an empirical science, where experiments aredesigned and built, observable eﬀects are generated and measured, and the experimentaldata bank is assembled. On the Theory side (right column), the diagram represents thetheoretical work of statistical analysis, interpretation and (hopefully) understanding ac-cording to accepted patterns. If necessary, new hypotheses (including whole new theories)are formulated, motivating the design of new experiments. Theory and experiment con-stitute a double feed-back cycle making it clear that the design of experiments is guidedby the existing theory and its interpretation, which, in turn, must be constantly checked,adapted or modiﬁed in order to cope with the observed experiments. The whole systemconstitutes an autopoietic unit, as seen in Krohn and K¨uppers (1990, p.214): “The idea of knowledge as an eigensolution of an operationally closed combina-tion between argumentative and experimental activities attempts to answer theinitially posed question of how the construction of knowledge binds itself to itsconstruction in a new way. The coherence of an eigensolution does not referto an objectively given reality but follows from the operational closure of theconstruction. Still, diﬀerent decisions on the selection of couplings may leadto diﬀerent, equally valid eigensolutions. Between such diﬀerent solutions noreasonable choice is possible unless a new operation of knowledge is constructedexactly upon the diﬀerences of the given solutions. But again, this frame of .5. SHARP STATISTICAL HYPOTHESES reference for explicitly relating diﬀerent solutions to each other introduces newchoices with respect to the coupling of operations and explanations. It doesnot reduce but enhances the dependence of knowledge on decisions. On theother hand, the internal restrictions imposed by each of the chosen couplingsdo not allow for any arbitrary construction of results. Only few are suitableto mutually serve as inputs in a circular operation of knowledge.” Statistical science is concerned with inference and application of probabilistic models.From what has been presented in the preceding sections, it becomes clear what the roleof Statistics in scientiﬁc research is, at least in the ConsTh view of scientiﬁc research:Statistics has a dual task, to be performed both in the Theory and the Experiment sidesof the diagram in Figure 1:Experiment TheoryOperation- ⇐ Experiment ⇐ Hypothesesalization design formulation ⇓ ⇑

Eﬀects True/False Creativeobservation eigen-solution interpretation ⇓ ⇑

Data Mnemetic Statisticalacquisition ⇒ explanation ⇒ analysisSample space Parameter spaceFigure 1: Scientiﬁc production diagram.- At the Experiment side of the diagram, the task of statistics is to make probabilisticstatements about the occurrence of pertinent events, i.e. describe probabilistic distribu-tions for what, where, when or which events can occur. If the events are to occur in thefuture, these descriptions are called predictions, as is often the case in the natural sci-ences. It is also possible (more often in social sciences) to deal with observations relatedto past events, that may or may not be experimentally generated or repeated, imposing8 CHAPTER 1: EIGEN-SOLUTIONS AND STATISTICAL HYPOTHESES limitations to the quantity and/or quality of the available data. Even so, the habit ofcalling this type of statement “predictive probabilities” will be maintained.- At the Theory side of the diagram, the role of statistics is to measure the statisticalsupport of hypotheses, i.e. to measure, quantitatively, the hypotheses plausibility orpossibility in the theoretical framework where they were formulated, given the observeddata. From the material presented in the preceding sections, it is also clear that, inthis role, statistics is primarily concerned with measuring the statistical support of sharphypotheses, for hypotheses sharpness (precision or discreteness) is an essential attributeof eigen-solutions.Let us now examine how well the traditional statistical paradigms, and in contrast theFBST, are able to take care of this dual task. In order to examine this question, the ﬁrststep is to distinguish what kind of probabilistic statements can be made. We make use oftree statement categories: Frequentist, Epistemic and Bayesian:Frequentist probabilistic statements are made exclusively on the basis of the frequencyof occurrence of an event in a (potentially) inﬁnite sequence of observations generated bya random variable.Epistemic probabilistic statements are made on the basis of the epistemic status (de-gree of belief, likelihood, truthfulness, validity) of an event from the possible outcomesgenerated by a random variable. This generation may be actual or potential, that is, mayhave been realized or not, may be observable or not, may be repeated an inﬁnite or ﬁnitenumber of times.Bayesian probabilistic statements are epistemic probabilistic statements generated bythe (in practice, always ﬁnite) recursive use of Bayes formula: p n ( θ ) ∝ p n − ( θ ) p ( x n | θ ) . In standard models, the parameter θ , a non observed random variable, and the sample x , an observed random variable, are related through their joint probability distribution, p ( x, θ ). The prior distribution, p ( θ ), is the starting point for the Bayesian recursionoperation. It represents the initial available information about θ . In particular, the priormay represent no available information, like distributions obtained via the maximumentropy principle, see Dugdale (1996) and Kapur (1989). The posterior distribution, p n ( θ ),represents the available information on the parameter after the n-th “learning step”, inwhich Bayes formula is used to incorporate the information carried by observation x n .Because of the recursive nature of the procedure, the posterior distribution in a givenstep is used as prior in the next step.Frequentist statistics dogmatically demands that all probabilistic statements be fre-quentist. Therefore, any direct probabilistic statement on the parameter space is cate-gorically forbidden. Scientiﬁc hypotheses are epistemic statements about the parameters .5 SHARP HYPOTHESES H . In this example H is the independence hypothesis in a 2 × n = 16, see section A1 and B1. The horizontal axisshows the “diagonal asymmetry” statistics (diﬀerence between the diagonal products).The statistics D is an estimator of an unormalized version of Person’s correlation coeﬃ-cient, ρ . For detailed explanations, see Irony et al. (1995, 2000), Stern and Zacks (2002)and Madruga, Pereira and Stern (2003). D = x , x , − x , x , , ρ = σ , √ σ , σ , = θ , θ , − θ , θ , (cid:112) θ , θ , θ , θ , . Samples that are “perfectly compatible with the hypothesis”, that is, having no asym-metry, are near the center of the plot, with increasingly incompatible samples to the sides.The envelope curve for the resulting FBST e-values, to be commented later in this section,is smooth and therefore level at its maximum, where it reaches the value 1.In contrast the envelope curves for the p-values take the form of a cusp, i.e. a pointedcurve, that is broken (non diﬀerentiable) at its maximum, where it also reaches the valueone. The acuteness of the cusp also increases with increasing sample size. In the caseof NPW p-values we see, at the top of the cusp, a “ladder” or “spike”, with severalsamples with no asymmetry, but having diﬀerent outcome probabilities, “competing” forthe higher p-value.This is a typical collateral eﬀect of the artiﬁce that converts a question about thesigniﬁcance of H , asking for a probability in the parameter space as an answer, into a0 CHAPTER 1: EIGEN-SOLUTIONS AND STATISTICAL HYPOTHESES −50 0 5000.20.40.60.81 −50 0 5000.20.40.60.81−50 0 5000.20.40.60.81 −50 0 5000.20.40.60.81

Post. Prob. NPW p−value Chi2 p−value FBST e−value

Figure 2: Independence Hypothesis, n=16.question, conditional on H being truth, about the outcome probability of the observedsample, oﬀering a probability in the sample space as an answer. This qualitative analysisof the p-value methodology gives us an insight on typical abuses of the expression “increasesample size to reject” . In the words of I.J. Good (1983, p.135): “Very often the statistician doesn’t bother to make it quite clear whether hisnull hypothesis is intended to be sharp or only approximately sharp....It is hardly surprising then that many Fisherians (and Popperians) say that- you can’t get (much) evidence in favor of the null hypothesis but can onlyrefute it.” In Bayesian statistics we are allowed to make probabilistic statements on the parameterspace, and also, of course, in the sample space. Thus it seems that Bayesian statistics is theright tool for the job, and so it is! Nevertheless, we must ﬁrst examine the role played by .5 SHARP HYPOTHESES “Gambling problems in which the distributions of various quantities are promi-nent in the description of the gambler’s fortune seem to embrace the whole oftheoretical statistics according to one view (which might be called the decision-theoretic Bayesian view) of the subject....From the point of view of decision-theoretic statistics, the gambler in thisproblem is a person who must ultimately act in one of two ways (the twoguesses), one of which would be appropriate under one hypothesis ( H ) andthe other under its negation ( H )....Many problems, of which this one is an instance, are roughly of the followingtype. A person’s opinion about unknown parameters is described by a proba-bility distribution; he is allowed successively to purchase bits of informationabout the parameters, at prices that may depend (perhaps randomly) upon theunknown parameters themselves, until he ﬁnally chooses a terminal action forwhich he receives an award that depends upon the action and parameters.”“I turn now to a diﬀerent and, at least for me, delicate topic in connection withapplications of the theory of testing. Much attention is given in the literature ofstatistics to what purport to be tests of hypotheses, in which the null hypothesis CHAPTER 1: EIGEN-SOLUTIONS AND STATISTICAL HYPOTHESES is such that it would not really be accepted by anyone. ... extreme (sharp)hypotheses, as I shall call them......The unacceptability of extreme (sharp) null hypotheses is perfectly well known;it is closely related to the often heard maxim that science disproves, but neverproves, hypotheses, The role of extreme (sharp) hypotheses in science andother statistical activities seems to be important but obscure. In particular,though I, like everyone who practice statistics, have often “tested” extreme(sharp) hypotheses, I cannot give a very satisfactory analysis of the process,nor say clearly how it is related to testing as deﬁned in this chapter and othertheoretical discussions.”

As it is clearly seen, in the DecTh framework we speak about the betting odds for“the hypothesis wining on a gamble taking place in the parameter space”. But sincesharp hypotheses are zero (Lebesgue) measure sets, our betting odds must be null, i.e.sharp hypotheses must be (almost surely) false. If we accept the ConsTh view that animportant class of hypotheses concern the identiﬁcation of eigen-solutions, and that thoseare ontologically sharp, we have a paradox!From these considerations it is not surprising that frequentist and DecTh orthodoxyconsider sharp hypotheses, at best as anomalous crude approximations used when thescientist is incapable of correctly specifying error bounds, cost, loss or utility functions,etc., or then just consider them to be “just plain silly” . In the words of D.Williams (2002,p.234): “Bayesian signiﬁcance of sharp hypothesis: a plea for sanity: ...It astonishesme therefore that some Bayesian now assign non-zero prior probability that asharp hypothesis is exactly true to obtain results which seem to support stronglynull hypotheses which frequentists would very deﬁnitely reject. (Of course, itis blindingly obvious that such results must follow).”

But no matter how many times statisticians reprehend scientist for their sloppinessand incompetence, they keep formulating sharp hypotheses, as if they where magneticallyattracted to them. From the ConsTh plus FBST perspective they are, of course, justdoing the right thing!Decision theoretic statistics has also developed methods to deal with sharp hypotheses,posting sometimes a scary caveat emptor for those willing to use them. The best knownof such methods are Jeﬀreys’ tests, based on Bayes Factors that assign a positive priorprobability mass to the sharp hypothesis. This positive prior mass is supposed to worklike a handicap system designed to balance the starting odds and make the game “fair”.Out of that we only get new paradoxes, like the well documented Lindley’s paradox. In .5 SHARP HYPOTHESES H ), was speciallydesigned to eﬀectively evaluate the support for a sharp hypothesis, H . This supportfunction is based on the posterior probability measure of a set called the tangential set, T ( H ), which is a non zero measure set (so no null probability paradoxes), see Pereira andStern (1999), Madruga et al. (2003) and subsection A1 of the appendix.Although ev ( H ) is a probability in the parameter space, it is also a possibilistic sup-port function. The word possibilistic carries a heavy load, implying that ev ( H ) complieswith a very speciﬁc logic (or algebraic) structure, as seen in Darwishe and Ginsberg(1992), Stern (2003, 2004), and subsection A3 of the appendix. Furthermore the e-valuehas many necessary or desirable properties for a statistical support function, such as:1- Give an intuitive and simple measure of signiﬁcance for the hypothesis in test,ideally, a probability deﬁned directly in the original or natural parameter space .2- Have an intrinsically geometric deﬁnition, independent of any non-geometric aspect,like the particular parameterization of the (manifold representing the) null hypothesisbeing tested, or the particular coordinate system chosen for the parameter space, i.e., bean invariant procedure.3- Give a measure of signiﬁcance that is smooth, i.e. continuous and diﬀerentiable , onthe hypothesis parameters and sample statistics, under appropriate regularity conditionsof the model.4- Obey the likelihood principle , i.e., the information gathered from observationsshould be represented by, and only by, the likelihood function.5- Require no ad hoc artiﬁce like assigning a positive prior probability to zero measuresets, or setting an arbitrary initial belief ratio between hypotheses.6- Be a possibilistic support function.7- Be able to provide a consistent test for a given sharp hypothesis.8- Be able to provide compositionality operations in complex models.9- Be an exact procedure, not requiring “large sample” asymptotic approximations.10- Allow the incorporation of previous experience or expert’s opinion via (subjective) prior distributions .For a careful and detailed explanation of the FBST deﬁnition, its computational imple-mentation, statistical and logical properties, and several already developed applications,the reader is invited to consult some of the articles in the reference list. Appendix Aprovides a short review of the FBST, including its deﬁnition and main properties.4 CHAPTER 1: EIGEN-SOLUTIONS AND STATISTICAL HYPOTHESES

In this section some constructivist analyses of dediﬀerentiation phenomena in social sys-tems are reviewed. If the conclusions in the last section are correct, it is surprising howmany times DecTh, sometimes with a very narrow pseudo-economic interpretation, wasmisused in scientiﬁc statistical analysis. The diﬃculties of testing sharp hypotheses inthe traditional statistical paradigms are well documented, and extensively discussed inthe literature, see for example the articles in Harlow et al. (1997). We hope the materialin this section can help us understand these diﬃculties as symptoms of problems withmuch deeper roots. By no means the author is the ﬁrst to point out the danger of analy-ses carried out by blind transplantation of categories between heterogeneous systems. Inparticular, regarding the abuse of economical analyses, Luhmann (1989, p.164) states: “In this sense, it is meaningless to speak of “non-economic” costs. This is onlya metaphorical way of speaking that transfers the speciﬁcity of the economicmode of thinking indiscriminately to other social systems.”

For a sociological analysis of this phenomenon in the context of science, see for exampleFuchs (1996, p.310) and DiMaggio and Powell (1991, p.63): “...higher-status sciences may, more or less aggressively, colonize lower-statusﬁelds in an attempt at reducing them to their own First Principles. For particlephysics, all is quarks and the four forces. For neurophysiology, consciousnessis the aggregate outcome of the behavior of neural networks. For sociobiol-ogy, philosophy is done by ants and rats with unusual large brains that uttermetaphysical nonsense according to acquired reﬂexes. In short, successful andcredible chains or reductionism usually move from the top to the bottom ofdisciplinary prestige hierarchies.”“This may explain the popularity of giving an “economical understanding” toprocesses in functionally distinct areas even if (or perhaps because) this se-mantics is often hidden by statistical theory and methods based on decisiontheoretic analysis. This also may explain why some areas, like ecology, so-ciology or psychology, are (or where) far more prone to suﬀer this kind ofdediﬀerentiation by semantic degradation than others, like physics.”

Once the forces pushing towards systemic degradation are clearly exposed, we hopeone can understand the following corollary of von Foerster famous ethical and aestheticalimperatives:- Theoretical imperative: Preserve systemic autopoiesis and semantic integrity, for de-diﬀerentiation is in-sanity itself.- Operational imperative: Chose the right tool for each job: “If you only have a hammer,everything looks like a nail”. .7. COMPETING SHARP HYPOTHESES In this section we examine the concept of

Competing Sharp Hypotheses.

This concept hasseveral variants, but the basic idea is that a good scientist should never test a single sharphypothesis, for it would be an unfair faith of the poor sharp hypothesis standing all aloneagainst everything else in the world. Instead, a good scientist should always confront asharp hypothesis with a competing sharp hypotheses, making the test a fair game. Asseen in Good (1983, p.167,135,126): “Since I regard refutation and corroboration as both valid criteria for this de-marcation it is convenient to use another term, Checkability, to embrace bothprocesses. I regard checkability as a measure to which a theory is scientiﬁc,where checking is to be taken in both its positive and negative senses, conﬁrm-ing and disconﬁrming.”“...If by the truth of Newtonian mechanics we mean that it is approximatelytrue in some appropriate well deﬁned sense we could obtain strong evidencethat it is true; but if we mean by its truth that it is exactly true then it hasalready been refuted.”“...I think that the initial probability is positive for every self-consistent scien-tiﬁc theory with consequences veriﬁable in a probabilistic sense. No contradic-tion can be inferred from this assumption since the number of statable theoriesis at most countably inﬁnite (enumerable).”“...It is very diﬃcult to decide on numerical values for the probabilities, but itis not quite so diﬃcult to judge the ratio of the subjective initial probabilitiesof two theories by comparing their complexities. This is one reason why thehistory of science is scientiﬁcally important.”

The competing sharp hypotheses argument does not directly contradict the episte-mological framework presented in this chapter, and it may be appropriate under certaincircumstances. It may also mitigate or partially remediate the paradoxes pointed outin the previous sections when testing sharp hypotheses in the traditional frequentist ororthodox Bayesian settings. However, the author does not believe that having compet-ing sharp hypotheses is neither a necessary condition for good science practice, nor anaccurate description of science history.Just to stay with Good’s example, let us quickly examine the very ﬁrst major inci-dent in the tumultuous debacle of Newtonian mechanics. This incident was Michelson’sexperiment on the eﬀect of “aethereal wind” over the speed of light, see Michelson andMorley (1887) and Lorentz et al. (1952). A clear and lively historical account to thisexperiment can be found in Jaﬀe (1960). Actually Michelson found no such eﬀect, i.e. hefound the speed of light to be constant, invariant with the relative speed of the observer.6

CHAPTER 1: EIGEN-SOLUTIONS AND STATISTICAL HYPOTHESES

This result, a contradiction in Newtonian mechanics, is easily explained by Einstein’sspecial theory of relativity. The fundamental diﬀerence between the two theories is theirsymmetry or invariance groups: Galileo’s group for Newtonian mechanics, Lorentz’ groupfor special relativity. A fundamental result of physics, Noether’s Theorem, states that forevery continuous symmetry in a physical theory, there must exist an invariant quantityor conservation law. For detail the reader is refered to Byron and Fuller (1969, V-I, Sec.2.7), Doncel et al. (1987), Gruber et al. (1980-98), Houtappel et al. (1965), French(1968), Landau and Lifchitz (1966), Noether (1918), Wigner (1970), Weyl (1952). Con-servation laws are sharp hypotheses ideally suited for experimental checking. Hence, itseems that we are exactly in the situation of competing sharp hypotheses, and so we aretoday, from a far away historical perspective. But this is a post-mortem analysis of New-tonian mechanics. At the time of the experiment there was no competing theory. Insteadof conﬁrming an eﬀect, speciﬁed only within an order of magnitude, Michelson found, forhis and everybody else’s astonishment, an, up to the experiment’s precision, null eﬀect.Complex experiments like Michelson’s require a careful analysis of experimental errors,identifying all signiﬁcant source of measurement noise and ﬂuctuation. This kind ofanalysis is usual in experimental physics, and motivates a brief comment on a secondarysource of criticism on the use of sharp hypotheses. In the past, one often had to workwith over simpliﬁed statistical models. This situation was usually imposed by limitationssuch as the lack of better or more realistic models, or the unavailability of the necessarynumerical algorithms or the computer power to use them. Under these limitations, oneoften had to use minimalist statistical models or approximation techniques, even whenthese models or techniques were not recommended. These models or techniques wereinstrumental to provide feasible tools for statistical analysis, but made it very diﬃcult towork (or proved very ineﬀective) with complex systems, scarce observations, very largedata sets, etc. The need to work with complex models, and other diﬃcult situationsrequiring the use of sophisticated statistical methods and techniques, is very common(and many times inescapable) in research areas dealing with complex systems like biology,medicine, social sciences, psychology, and many other ﬁelds, some of them distinguishedwith the mysterious appellation of “soft” science. A colleague once put it to me like this:“It seems that physics got all the easy problems...”.If there is one area where the computational techniques of Bayesian statistics havemade dramatic contributions in the last decades, that is the analysis of complex models.The development of advanced statistical computational techniques like Markov ChainMonte Carlo (MCMC) methods, Bayesian and neural networks, random ﬁelds models,and many others, make us hope that most of the problems related to the use of oversimpliﬁed models can now be overcome. Today good statistical practice requires all sta-tistically signiﬁcant inﬂuences to be incorporated into the model, and one seldom ﬁndsan acceptable excuse not to do so; see also Pereira and Stern (2001). .8. FINAL REMARKS It should once more be stressed that most of the material presented in sections 2, 3,4, and 6 is not new in ConsTh. Unfortunately ConsTh has had a minor impact instatistics, and sometimes provoked a hostile reaction from the ill-informed. One possibleexplanation of this state of aﬀairs may be found in the historical development of ConsTh.The constructivist reaction to a dogmatic realism prevalent in hard sciences, specially inthe XIX and the beginning of the XX century, raised a very outspoken rhetoric intendedto make explicitly clear how naive and fragile the foundations of this over simplisticrealism were. This rhetoric was extremely successful, quickly awakening and foreverchanging the minds of those directly interested in the ﬁelds of history and philosophyof science, and spread rapidly into many other areas. Unfortunately the same rhetoriccould, in a superﬁcial reading, make ConsTh be perceived as either hostile or intrinsicallyincompatible with the use of quantitative and statistical methods, or leading to an extremeforms of subjectivism.In ConsTh, or (objective) Idealism as presented in this chapter, neither does one claimto have access to a “thing in itself” or “Ding an sich” in the external environment, seeCaygill (1995), as do dogmatic forms of realism, nor does one surrender to solipsism, as doskeptic forms of subjectivism, including some representatives of the subjectivist school ofprobability and statistics, as seen in Finetti (1974, 1.11, 7.5.7). In fact, it is the role of theexternal constraints imposed by the environment, together with the internal autopoieticrelations of the system, to guide the convergence of the learning process to precise eigen-solutions, these being at the end, the ultimate or real objects of scientiﬁc knowledge. Asstated by Luhmann (1990a, 1995): “...constructivism maintains nothing more than the unapproachability of theexternal world “in itself ” and the closure of knowing - without yielding, at anyrate, to the old skeptical or “solipsistic” doubt that an external world exists atall-...”

Luhmann (1990a, p.65). “...at least in systems theory, they (statements) refer to the real world. Thusthe concept of system refers to something that in reality is a system and therebyincurs the responsibility of testing its statements against reality.”

Luhmann(1995, p.12). “...both subjectivist and objectivist theories of knowledge have to be replaced bythe system / environment distinction, which then makes the distinction subject/ object irrelevant.”

Luhmann (1990a, p.66).The author hopes to have shown that ConsTh not only gives a balanced and eﬀectiveview of the theoretical / experimental aspects of scientiﬁc research but also that it is wellsuited (or even better suited) to give the necessary epistemological foundations for the8

CHAPTER 1: EIGEN-SOLUTIONS AND STATISTICAL HYPOTHESES use of quantitative methods of statistical analysis needed in the practice of science. Itshould also be stressed, according to author’s interpretation of ConsTh, the importance ofmeasuring the statistical support for sharp hypotheses. In this setting, the author believesthat, due to its statistical and logical characteristics, the FBST is the right tool for thejob, and hopes to have motivated the reader to ﬁnd more about the FBST deﬁnition,theoretical properties, eﬃcient computational implementation, and several of the alreadydeveloped applications, in some of the articles in the reference list. This perspective opensinteresting areas for further research. Among them, we mention the following two.

The ﬁrst area for further research has to do with some similarities between Noether the-orems in physics, and de Finetti type theorems in statistics. Nother theorems provideinvariant physical quantities or conservation laws from symmetry transformation groupsof the physical theory, and conservation laws are sharp hypotheses by excellence. Ina similar way, de Finetti type theorems provide invariant distributions from symmetrytransformation groups of the statistical model. Those invariant distributions can in turnprovide prototypical sharp hypotheses in many application areas. Physics has its ownheavy apparatus to deal with the all important issues of invariance and symmetry. Statis-tics, via de Finetti theorems, can provide such an apparatus for other areas, even insituations that are not naturally embedded in a heavy mathematical formalism, see Feller(1968, ch.7) and also Diaconis (1987, 1988), Eaton (1989), Nachbin (1965) Renyi (1970)and Ressel (1987).

The second area for further research has to do with one of the properties of eigen-solutionsmentioned by von Foerster that has not been directly explored in this chapter, namelythat eigen-solutions are “composable”, see Borges and Stern (2005) and section A4. Com-positionality properties concern the relationship between the credibility, or truth value,of a complex hypothesis, H , and those of its elementary constituents, H j , j = 1 . . . k .Compositionality questions play a central role in analytical philosophy.According to Wittgenstein (2001, 2.0201, 5.0, 5.32):- Every complex statement can be analyzed from its elementary constituents.- Truth values of elementary statement are the results of those statements’ truth-functions (Wahrheitsfunktionen).- All truth-function are results of successive applications to elementary constituentsof a ﬁnite number of truth-operations (Wahrheitsoperationen). .8 FINAL REMARKS “One of the main purposes of a mathematical theory of reliability is to developmeans by which one can evaluate the reliability of a structure when the relia-bility of its components are known. The present study will be concerned withthis kind of mathematical development. It will be necessary for this purposeto rephrase our intuitive concepts of structure, component, reliability, etc. inmore formal language, to restate carefully our assumptions, and to introducean appropriate mathematical apparatus.” In Luhmann (1989, p.79) we ﬁnd the following remark on the evolution of science thatdirectly hints the importance of this property: “After the (science) system worked for several centuries under these condi-tions it became clear where it was leading. This is something that idealization,mathematization, abstraction, etc. do not describe adequately. It concerns theincrease in the capacity of decomposition and recombination, a new formula-tion of knowledge as the product of analysis and synthesis. In this case analysisis what is most important because the further decomposition of the visible worldinto still further decomposable molecules and atoms, into genetic structures oflife or even into the sequence human/role/action/ action-components as ele-mentary units of systems uncovers an enormous potential for recombination.”

In the author’s view, the composition (or re-combination) of scientiﬁc knowledge andits use, so relevant in technology development and engineering, can give us a diﬀerent per-spective (perhaps a, bottom-up, as opposed to the top-down perspective in this chapter)on the importance of sharp hypotheses in science and technology practice. It can alsoprovide some insight on the valid forms of iteration of science with other social systemsor, in Luhmann’s terminology, how science does (or should) “resonate” in human society.0

CHAPTER 1: EIGEN-SOLUTIONS AND STATISTICAL HYPOTHESES hapter 2Language and the Self-ReferenceParadox “If the string is too tight it will snap,but if it is too loose it will not play.”

Siddhartha Gautama. “The most beautiful thing we can experience is the mysterious.It is the source of all true art and all science. He to whomthis emotion is a stranger, who can no longer pause to wonderand stand rapt in awe, is as good as dead: His eyes are closed.”

Albert Einstein (1879 - 1955).

In Chapter 1 it is shown how the eigen-solutions found in the practice of science arenaturally represented by statistical sharp hypotheses. Statistical sharp hypotheses areroutinely stated as natural “laws”, conservation “principles” or invariant “transforms”,and most often take the form of functional equations, like h ( x ) = c . Chapter 1 alsodiscusses why the eigen-solutions’ essential attributes of discreteness (sharpness), stability,and composability, indicate that considering such hypotheses in the practice of scienceis natural and reasonable. Surprisingly, the two standard statistical theories for testinghypotheses, classical (frequentist p-values) and orthodox Bayesian (Bayes factors), havewell known and documented problems for handling or interpreting sharp hypotheses.412 CHAPTER 2: LANGUAGE AND SELF-REFERENCE

These problems are thoroughly reviewed, from statistical, methodological, systemic andepistemological perspectives.Chapter 1 and appendix A present the FBST, or Full Bayesian Signiﬁcance Test, anunorthodox Bayesian signiﬁcance test speciﬁcally designed for this task. The mathemati-cal and statistical properties of the FBST are carefully analyzed. In particular, it is shownhow the FBST fully supports the test and identiﬁcation of eigen-solutions in the practiceof science, using procedures that take into account all the essential attributes pointed byvon Foerster. In contrast to some alternative belief calculi or logical formalisms based ondiscrete algebraic structures, the FBST is based on continuous statistical models. Thismakes it easy to support concepts like sharp hypotheses, asymptotic convergence andstability, and these are essential concepts in the representation of eigen-solutions. Thesame chapter presents cognitive constructivism as a coherent epistemological frameworkthat is compatible with the FBST formalism, and vice-versa. I will refer to this settingas the Cognitive Constructivism plus FBST formalism, or CogCon+FBST framework forshort.The discussion in Chapter 1 raised some interesting questions, some of which we willtry to answer in the present chapter. The ﬁrst question relates to the role and theimportance of language in the emergence of eigen-solutions and is discussed in section 2.In answering it, we make extensive use of the William Rasch “two-front war” metaphorof cognitive constructivism, as exposed in Rasch (2000). As explained in section 4, this isthe war against dogmatic realism at one front, and against skepticism or solipsism, at thesecond. The results of the ﬁrst part of the paper are summarized in section 5. To illustratehis arguments, Rasch uses some ideas of Niels Bohr concerning quantum mechanics. Insection 3, we use some of the same ideas to give concrete examples of the topics underdiscussion. The importance (and also the mystery) related to the role of language in thepractice of science was one of the major concerns of Bohr’s philosophical writings, seeBohr (1987, I-IV), as exempliﬁed by his famous “dirty dishes” metaphor: “Washing dishes and language can in some respects be compared. We havedirty dishwater and dirty towels and nevertheless ﬁnally succeed in getting theplates and glasses clean. Likewise, we have unclear terms and a logic limitedin an unknown way in its ﬁeld of application – but nevertheless we succeed inusing it to bring clearness to our understanding of nature.”

Bohr (2007).The second question, posed by Søren Brier, which asks whether the CogCon+FBSTframework is compatible with and can beneﬁt from the concepts of Semiotics and Peirceanphilosophy, is addressed in section 6. In section 7 I present my ﬁnal remarks.Before ending this section a few key deﬁnitions related to the concept of eigen-solutionare reviewed. As stated in Maturana and Varela (1980, p.10), the concept of recurrentstate is the key to understand the concept of cognitive domain in an autopoietic system. .1 INTRODUCTION “Living systems as units of interaction speciﬁed by their conditions of beingliving systems cannot enter into interactions that are not speciﬁed by their or-ganization. The circularity of their organization continuously brings them backto the same internal state (same with respect to the cyclic process). Each inter-nal state requires that certain conditions (interactions with the environment)be satisﬁed in order to proceed to the next state. Thus the circular organizationimplies the prediction that an interaction that took place once will take placeagain. If this does not happen the system maintains its integrity (identity withrespect to the observer) and enters into a new prediction. In a continuouslychanging environment these predictions can only be successful if the environ-ment does no change in that which is predicted. Accordingly, the predictionsimplied in the organization of the living system are not predictions of partic-ular events, but of classes of inter-actions. Every interaction is a particularinteraction, but every prediction is a prediction of a class of interactions thatis deﬁned by those features of its elements that will allow the living systemto retain its circular organization after the interaction, and thus, to interactagain. This makes living systems inferential systems, and their domain ofinteractions a cognitive domain.” The epistemological importance of this circular (cyclic or recursive) regenerative pro-cesses and their eigen (auto, equilibrium, ﬁxed, homeostatic, invariant, recurrent, recur-sive) -states, both in concrete and abstract autopoietic systems, are further investigatedin Foerster and Segal (2001, p.145, 127-128): “The meaning of recursion is to run through one’s own path again. One ofits results is that under certain conditions there exist indeed solutions which,when reentered into the formalism, produce again the same solution. Theseare called “eigen-values”, “eigen-functions”, “eigen-behaviors”, etc., depend-ing on which domain this formation is applied - in the domain of numbers, infunctions, in behaviors, etc.”“Objects are tokens for eigen-behaviors. Tokens stand for something else.In exchange for money (a token itself for gold held by one’s government, butunfortunately no longer redeemable), tokens are used to gain admittance tothe subway or to play pinball machines. In the cognitive realm, objects arethe token names we give to our eigen-behavior. When you speak about a ball,you are talking about the experience arising from your recursive sensorimotorbehavior when interacting with that something you call a ball. The “ball” asobject becomes a token in our experience and language for that behavior whichyou know how to do when you handle a ball. This is the constructivist’s insightinto what takes place when we talk about our experience with objects.” CHAPTER 2: LANGUAGE AND SELF-REFERENCE

Von Foerster also establishes several essential attributes of these eigen-solutions, asquoted in the following paragraph from Foerster (2003c, p.266). These essential attributescan be translated into very speciﬁc mathematical properties, that are of prime importancewhen investigating several aspects of the CogCon+FBST framework. “Eigenvalues have been found ontologically to be discrete, stable, separable andcomposable, while ontogenetically to arise as equilibria that determine them-selves through circular processes. Ontologically, Eigenvalues and objects, andlikewise, ontogenetically, stable behavior and the manifestation of a subject’s“grasp” of an object cannot be distinguished.”

Goudsmit (1998, sec.2.3.3, Objects as warrants for eigenvalues), ﬁnds an apparent dis-agreement between the form in which eigen-solutions emerge, according to von Foster andMaturana: “Generally, von Foersters concept of eigenvalue concerns the value of a func-tion after a repeated (iterative) application of a particular operation. ...This may eventually result in a stable performance, which is an eigenvalue ofthe observers behavior. The emerging objects are warrants of the existence ofthese eigenvalues.... contrary to von Foerster, Maturana considers the consensuality of distinc-tions as necessary for the bringing forth of objects. It is through the attain-ment of consensual distinctions that individuals are able to create objects inlanguage. ”

Conﬁrmation for the position attributed by Goudsmit to von Foerster can be foundin several of his articles. In Foerster (2003a, p.3), for example, one ﬁnds: “... I propose to continue the use of the term ‘self-organizing system,’ whilstbeing aware of the fact that this term becomes meaningless, unless the systemis in close contact with an environment, which possesses available energy andorder, and with which our system is in a state of perpetual interaction, suchthat it somehow manages to ‘live’ on the expenses of this environment. ...... So both the self-organizing system plus the energy and order of the envi-ronment have to be given some kind of pre-given objective reality for this viewpoints to function.” .2 EIGEN-SOLUTIONS AND LANGUAGE “Objectivity. Objects arise in language as consensual coordinations of actionsthat in a domain of consensual distinctions are tokens for more basic coordina-tions of actions, which they obscure. Without language and outside languagethere are no objects because objects only arise as consensual coordinations ofactions in the recursion of consensual coordinations of actions that languagingis. For living systems that do not operate in language there are no objects; orin other words, objects are not part of their cognitive domains. ... Objects areoperational relations in languaging.”

The standpoint of Maturana is further characterized in the following paragraphs fromBrier (2005, p.374): “The process of human knowing, is the process in which we, through languag-ing, create the diﬀerence between the world and ourselves; between the selfand the non-self, and thereby, to some extent, create the world by creatingourselves. But we do it by relating to a common reality which is in someway before we made the diﬀerence between ‘the world’ and ‘ourselves’ makea diﬀerence, and we do it on some kind of implicit belief in a basic kind oforder ‘beneath it all’. I do agree that it does not make sense to claim that theworld exists completely independently of us. But on the other hand it does notmake sense to claim that it is a pure product of our explanations or consciousimagination.”“...it is clear that we do not create the trees and the mountains through ourexperiencing or conversation alone. But Maturana is close to claim that thisis what we do.”

In order to understand the above comments, one must realize that Maturana’s view-points, or at least his rhetoric, changed greatly over time, ranging from the ponderate andprecise statements in Maturana and Varela (1980), to some extreme positions assumedin Maturana (1991, p.36-44)), see next paragraph. Maturana must have had in mind thecelebrated quote by Albert Einstein at the beginning of this chapter. “Einstein said, and many other scientists have agreed with him, that sci-entiﬁc theories are free creations of the human mind, and he marveled thatthrough them one could understand the universe. The criterion of validationof scientiﬁc explanation as operations in the praxis of living of the observer,however, permit us to see how it is that the ﬁrst reﬂection of Einstein is valid,and how it is that there is nothing marvelous in that it is so.” CHAPTER 2: LANGUAGE AND SELF-REFERENCE “Scientiﬁc explanations arise operationally as generative mechanisms acceptedby us as scientists through operations that do not entail or imply any suppo-sition about an independent reality, so that in fact there is no confrontationwith one, nor is it necessary to have one even if we believe that we can haveone.”“Quantiﬁcation (or measurements) and predictions can be used in the genera-tion of a scientiﬁc explanation but do not constitute the source of its validity.The notions of falsiﬁability (Popper), veriﬁcability, or conﬁrmation would ap-ply to the validation of scientiﬁc knowledge only if this were a cognitive domainthat revealed, directly or indirectly, by denotation or connotation, a transcen-dental reality independent of what the observer does...”“Nature is an explanatory proposition of our experience with elements of ourexperience. Indeed, we human beings constitute nature with our explaining,and with our scientiﬁc explaining we constitute nature as the domain in whichwe exist as human beings (or languaging living systems).”

Brier (2005, p.375) further contrasts the standpoint of Maturana with that of vonFoerster: “Von Foerster is more aware of the philosophical demand that to put up a newepistemological position one has to deal with the problem of solipsism and ofpure social constructivism.”“The Eigenfunctions do not just come out of the blue. In some, yet only dimlyviewed, way the existence of nature and its ‘things’ and our existence are in-tertwined in such a way that makes it very diﬃcult to talk about. Von Foersterrealizes that to accept the reality of the biological systems of the observer leadsinto further acceptance about the structure of the environment.”

While the position adopted by von Foerster appears to be more realistic or objective,the one adopted by Maturana seems more Idealistic or (inter) subjective. Can these twodiﬀerent positions, which may seem so discrepant, be reconciled? Do we have to chosebetween an idealistic or a realistic position, or can we rather have both? This is one ofthe questions we address in the next sections.In Chapter 1 we used an example of physical eigen-solution (physical invariant) toillustrate the ideas in discussion, namely, the speed of light constant, c . Historically,this example is tied to the birth of Special Relativity theory, and the debacle of classicalphysics. In this chapter we will illustrate them with another important historical exam-ple, namely, the Einstein-Podolsky-Rosen paradox. Historically, this example is tied toquestions concerning the interpretation of quantum mechanics. This is one of the maintopics of the next section. .3. THE LANGUAGES OF SCIENCE At the end of the 19th century, classical physics was the serene sovereign of science.Its glory was consensual and uncontroversial. However, at the beginning of the 20thcentury, a few experimental results challenged the explanatory power of classical physics.The problems appeared in two major fronts that, from a historical perspective, can belinked to the theories (at that time still non existent) of Special Relativity and quantummechanics.At that time, the general perception of the scientiﬁc community was that these fewopen problems could, should and would be accommodated in the framework of classicalphysics. Crafting sophisticated structural models such as those for the structure of ether(the medium in which light was supposed to propagate), and those for atomic structure,was typical of the eﬀort to circumvent these open problems by artfully maneuveringclassical physics. But physics and engineering laboratories insisted, building up a barrageof new and challenging experimental results.The diﬃculties with the explanations oﬀered by classical physics not only persisted,but also grew in number and strength. In 1940 the consensus was that classical physicshad been brutally defeated, and Relativity and quantum mechanics were acclaimed asthe new sovereigns. Let us closely examine some facts concerning the development ofquantum mechanics (QM).One of the ﬁrst steps in the direction of a comprehensive QM theory was given in 1924by Louis de Broglie, who postulated the particle-wave duality principle, which states thatevery moving particle has an associated pilot wave of wavelength λ = h/mv , where h isPlanck’s constant and mv is the particle’s momentum, i.e., the product of its mass andvelocity. In 1926 Erwin Schr¨odinger stated his wave equation, capable of explaining allknown quantic phenomena, and predicting several new ones that where latter conﬁrmed bynew experiments. Schr¨odinger theory is known as Orthodox QM, see Tomonaga (1962)and Pais (1988) for detailed historical accounts. Orthodox QM uses a mathematicalformalism based on a complex wave equation, and shares much of the descriptive languageof de Broglie’s particle-wave duality principle.There is, however, something odd in the wave-particle descriptions of orthodox QM.When describing a model we speak of each side of a double faced wave-particle entity, as ifeach side existed by itself, and then inextricably fuse them together in the mathematicalformalism. Quoting Cohen (1989, p.87), “Notice how our language shapes our imagination. To say that a particle ismoving in a straight line really means that we can set up particle detectorsalong the straight line and observe the signals they send. These signals wouldbe consistent with a model of the particle as a single chunk of mass moving(back and forth) in accordance with Newtonian particle physics. It is important CHAPTER 2: LANGUAGE AND SELF-REFERENCE to emphasize that we are not claiming that we know what the particle is, butonly what we would observe if we set up those particle detectors.”

From Schroedinger’s equation we can derive Heisenberg’s uncertainty principle, whichstates that we can not go around measuring everything we want until we pin down ev-ery single detail about (the classical entities in our wave-particle model of) reality. Oneinstance of the Heisenberg uncertainty principle states that we can not simultaneouslymeasure a particle position and momentum beyond a certain accuracy. One way of inter-preting this instance of the Heisenberg uncertainty principle goes as follows: In classicalNewtonian physics our particles are “big enough” so that our measurement devices canobtain the information we need about the particle without disturbing it. In QM, on theother hand, the particles are so small that the measurement operation will always disturbthe particle. For example, the light we have to use in order to illuminate the scene, sowe can see where the particle is, has to be so strong, relative to the particle size, that it“blows” the particle away changing its velocity. The consequence is that we cannot (nei-ther in practice, nor even in principle) simultaneously measure with arbitrary precision,both the particle’s position and momentum. Hence, we have to learn how to tame ourimagination and constrain our language.The need to exercise a strict discipline over what kinds of statements to use was alesson learned by 20th century physics - a lesson that mathematics had to learn a bitearlier. A classical example from set theory of a statement that cannot be allowed is theRussell’s catalog (class, set), deﬁned in Robert (1988, p.x) as: “The ‘catalogue of all catalogues not mentioning themselves.’ Should one in-clude this catalogue in itself ? ... Both decisions lead to a contradiction!”

Robert (1988) indicates several ways to avoiding this paradox (or antinomy). All ofthem imply imposing a (very reasonable) set of rules on how to form valid statements.Under any of these rules, Russell’s deﬁnition becomes an invalid or ill posed statement and,as such, should be disregarded, see Halmos (1998, ch.1 and 2) and Dugundji (1966, ch.1)for introductory texts and Aczel (1988) for an alternative view. Measure theory (of Borel,Lebesgue, Haar, etc.) was a fundamental achievement of 20th century mathematics. Itdeﬁnes measures (notions such as mass, volume and probability) for parts of R n . Howevernot all parts of R n are included, and we must refrain of speaking about the measure ofinadmissible (non-measurable) sets, see Ulam (1943) for a short article, Kolmogorov andFomin (1982) for a standard text, and Nachbin (1965) and Bernardo (1993) for extensionspertinent to the FBST formalism. The main subject in Robert (1988) is Non StandardAnalysis, a form of extending the languages of both Set Theory and Real Analysis, see theobservations in section 6.6 and also Davis (1977, sec.3.4), Goldblatt (1998) and Nelson(1987). .3 LANGUAGES OF SCIENCE “Historically, ... quantum mechanics developed in three stages. First camea collection of ad hoc assumptions and then a cookbook of equations knownas (orthodox) quantum mechanics. The equations and their philosophical un-derpinning were then collected into a model based on mathematics of Hilbertspace. From the Hilbert space model came the abstraction of quantum logics.” From the above historical comments we draw the following conclusions:0

CHAPTER 2: LANGUAGE AND SELF-REFERENCE

Each of the QM formalisms discussed in this section, namely, de Brogliewave-particle duality principle, Schr¨odinger orthodox QM and Hilbert spaceabstract QM, operates like a language. Maturana stated that objects arise inlanguage. He seems to be right.

It seems also that new languages must be created (or discovered) toprovide us the objects corresponding to the structure of the environment, asstated by von Foerster.

Exercising a strict discipline concerning what kinds of statements can beused in a given language and context, seems to be vital in many areas.

It is far from trivial to create, craft, discover, ﬁnd and/or use a languageso that “it works”, providing us the “right” objects (eigen-solutions).

Even when everything looks (for the entire community) ﬁne and well,new empirical evidence can bring our theories down as a castle of cards.As indicated by an anonymous referee, abstract formalisms or languages do not exist ina vacuum, but sit on top of (or are embedded in) natural (or less abstract) languages. Thisbring us to the interesting and highly relevant issues of hierarchical language structuresand constructive ladders of objects, including interdependence analyses between objectsat diﬀerent levels of such complex structures, see Piaget (1975) for an early reference. Fora recent concrete example of the scientiﬁc relevance of such interdependences in the ﬁeldof Psychology, using a Factor Analysis statistical model, see Shedler and Westen (2004,2005); These issues are among of the main topics addressed in chapter 3 and forthcomingarticles.

The conclusions established in the previous section may look reasonable. In 3.4, however,what exactly are the “right” objects? Clearly, the “right” objects are “those” objects wemore or less clearly see and can point at, using as reference language the language wecurrently use.There! I have just fallen, head-on, into the quicksands of the self-reference paradox.Don’t worry (or do worry), but note this: The self-reference paradox is unavoidable,especially as long as we use English or any other natural human language.Rasch (2000, p.73,85) has produced a very good description of the self-reference para-dox and some of its consequences: “having it both ways seems a necessary consequence... One cannot just haveit dogmatically one way, nor skeptically the other... One oscillates, therefore, .1 SELF-REFERENCE PARADOX between the two positions, neither denying reality nor denying reality’s essen-tially constructed nature. One calls this not idealism or realism, but (cognitive)constructivism.”“What do we call this oscillation? We call it paradox. Self - reference andparadox - sort of like love and marriage, horse and carriage.” Cognitive Constructivism implies a double rejection: That of a solipsist denial ofreality, and that of any dogmatic knowledge of the same reality. Rasch uses the “twofront war” metaphor to describe this double rejection. Carrying the metaphor a bitfurther, the enemies of cognitive constructivism could be portrayed, or caricatured, as:- Dogmatism despotically requires us to believe in its (latest) theory. Itsstatements and reasons should be passively accepted with fanatic resignationas infallible truth;- Solipsism’s anarchic distrust wishes to preclude any established order in theworld. Solipsism wishes to transform us into autistic skeptics, incapable ofestablishing any stable knowledge about the environment in which we live.We refer to Caygill (1995, dogmatism) for a historical perspective on the Kan-tian use of some of the above terms.Any military strategist will be aware of the danger in the oscillation described byRasch, which alternately exposes a weak front. The enemy at our strong front will besubjugated, but the enemy at our weak front will hit us hard. Rasch sees a solution tothis conundrum, even recognizing that this solution may be diﬃcult to achieve, Rasch(2000, p.85): “There is a third choice: to locate oneself directly on the invisible line that mustbe drawn for there to be a distinction mind / body (system / environment) inthe ﬁrst place. Yet when one attempts to land on that perfect center, oneﬁnds oneself oscillating wildly from side to side, perhaps preferring the mind(system) side, but over compensating to the body (environment) side - or viceversa.The history of post-Kantian German idealism is a history of the failed searchfor this perfect middle, this origin or neutral ground outside both mind andbody that would nevertheless actualize itself as a perfect transparent mind/bodywithin history. Thus, much of contemporary philosophy that both followsand rejects that tradition has become fascinated by, even if trapped in, themind/body oscillation.”

So, the question is: How do we land on Rasch’ ﬁne (invisible) line, ﬁnding the perfectcenter and avoiding dangerous oscillations? This is the topic of the next section.2

CHAPTER 2: LANGUAGE AND SELF-REFERENCE

We are now ready for a few deﬁnitions of basic epistemological terms. These deﬁnitionsshould help us build epistemic statements in a clear and coherent form according to theCogCon+FBST perspective.

An actual (potential) eigen-solution of agiven system’s interaction with its environment. In the sequel, we may use asomewhat more friendly terminology by simply using the term Object.

Degree of conformance of an object tothe essential attributes of an eigen-solution.

A (maximal) set of objects, as recognized by a given system,when interacting with single objects or with compositions of objects in thatset.

Belief that a system’s knowledge of an object is always de-pendent on the systems’ autopoietic relations.

Belief that a system’s knowledge of an object is always de-pendent on the environment’s constraints.

Idealism without Realism.

Realism without Idealism.

Idealism and Realism.

This expression, used in reference to a speciﬁcobject, is a marker or label for ill posed statements.Cog-Con+FBST assumes an objective and idealistic epistemology. Deﬁnition 5.9 la-bels some ill posed dogmatic statements. Often, the description of the method used toaccess something in itself looks like:- Something that an observer would observe if the (same) observer did not exist, or- Something that an observer could observe if he made no observations, or- Something that an observer should observe in the environment without interactingwith it (or disturbing it in any way), and many other equally nonsensical variations.Some of the readers may not like this form of labeling this kind of invalid statement,preferring to use, instead, a more elaborate terminology, such as “object in parenthesis”(approximately) as object, “object without parenthesis” (approximately) as somethingin itself, etc. There may be good reasons for doing so, for example, this elaborate lan-guage has the advantage of automatically stressing the diﬀerences between constructivistand dogmatic epistemologies, see Maturana (1988), Maturana and Poerksen (2004) andSteier (1991). Nevertheless, we have chosen our deﬁnitions in agreement with some verypragmatic advice given in Bopry (2002): .5 OBJECTIVE IDEALISM AND PRAGMATISM “Objectivity as deﬁned by a (dogmatic) realist epistemology may not existwithin a constructivist epistemology; but, part of making that alternative epis-temology acceptable is gaining general acceptance of its terminology. As longas the common use of the terms is at odds with the concepts of an epistemolog-ical position, that position is at a disadvantage. Alternative forms of inquiryneed to coopt terminology in a way that is consistent with its own epistemology.I suggest that this is not so diﬃcult. The term objective can be taken back...” Among the deﬁnitions 5.1 to 5.9, deﬁnition 5.2 plays a key role. It allows us to sayhow well an eigen-solution manifests von Foerster’s essential attributes, and consequently,how good (objective) is our knowledge of it. However, the degree of objectivity can notbe assessed in the abstract, it must be assessed by the means and methods of a givenempirical science, namely the one within which the eigen solution is presented. Hence,deﬁnition 5.2 relies on an “operational approach”, and not on metaphysical arguments.Such an operational approach may be viewed with disdain by some philosophical schools.Nevertheless, for C.S.Peirce it is “The Kernel of Pragmatism” , CP 5.464-465: “Suﬃce it to say once more that pragmatism is, in itself, no doctrine of meta-physics, no attempt to determine any truth of things. It is merely a methodof ascertaining the meanings of hard words and of abstract concepts. ... Allpragmatists will further agree that their method of ascertaining the meaningsof words and concepts is no other than that experimental method by which allthe successful sciences (in which number nobody in his senses would includemetaphysics) have reached the degrees of certainty that are severally properto them today; this experimental method being itself nothing but a particularapplication of an older logical rule, ‘By their fruits ye shall know them’. ”

Deﬁnition 5.2 also requires a belief calculus speciﬁcally designed to measure the sta-tistical signiﬁcance, that is, the degree of support of empirical data to the existence of aneigen-solution. In Chapter 1 we showed why conﬁrming the existence of an eigen-solutionnaturally corresponds to testing a sharp statistical hypotheses, and why the mathematicalproperties of FBST e-values correspond to the essential attributes of an eigen-solution asstated by von Foerster. In this sense, the FBST calculus is perfectly adequate to supportthe use of the term Objective and correlated terms in scientiﬁc language. Among themost important properties of the e-value mentioned in Chapter 1 and Appendix A, weﬁnd:

Continuity:

Give a measure of signiﬁcance that is smooth, i.e. continuous and dif-ferentiable , on the hypothesis parameters and the sample statistics, under appropriateregularity conditions of the statistical model.4

CHAPTER 2: LANGUAGE AND SELF-REFERENCE

Consistency:

Provide a consistent , that is, asymptotically convergent signiﬁcance mea-sure for a given sharp hypothesis.Therefore, the FBST calculus is a formalism that allow us to assess, continuously andconsistently, the objectivity of an eigen-solution, by means of a convergent signiﬁcancemeasure, see Chapter 1. We should stress, once more, that achieving comparable goalsusing alternative formalisms based on discrete algebraic structures may be, in general,rather diﬃcult. Hence, our answer to the question of how to land on Rasch’s perfectcenter is: Replace unstable oscillation for stable convergence!Any dispute about objectivity (epistemic quality or value of an object of knowledge),should be critically examined and evaluated within this pragmatic program. This program(in the Luhmann’s sense) includes the means and methods of the empirical science in whichthe object of knowledge is presented, and the FBST belief calculus, used to evaluate theempirical support of an object, given the available experimental data.Even if over optimistic (actually hopelessly utopic), it is worth restating Leibniz’ ﬂagof

Calculemus , as found in Gerhardt (1890, v.7, p.64-65): “Quo facto, quando orientur controversiae, non magis disputatione opus eritinter duos philosophos, quam inter duos Computistas. Suﬃciet enim calamosin manus sumere sedereque ad abacos, et sibi mutuo (accito si placet amico)dicere: Calculemus.”

A contemporary translation could read:

Actually, if controversies were to arise, therewould be no more need for dispute between two philosophers, rather than between twostatisticians. For them it would suﬃce to reach their computers and, in friendly under-standing, say to each other: Let us calculate!

In the previous sections we presented an epistemological perspective based on a pragmaticobjective idealism. Objective idealism and pragmatism are also distinctive characteristicsof the philosophy of C.S.Peirce. Hence the following question, posed by Søren Brier, thatwe examine in this section: Is the CogCon+FBST framework compatible with and can itbeneﬁt from the concepts of Semiotics and Peircean philosophy?In Chapter 1 we had already explored the idea that eigen-solutions, as discrete entities,can be named, i.e., become signs in a language system, as pointed by von Foerster in Segal(2001, p.128): “There is an additional point I want to make, an important point. Out of aninﬁnite continuum of possibilities, recursive operations carve out a precise set .6 PHILOSOPHY OF C.S.PEIRCE of discrete solutions. Eigen-behavior generates discrete, identiﬁable entities.Producing discreteness out of inﬁnite variety has incredibly important conse-quences. It permits us to begin naming things. Language is the possibilityof carving out of an inﬁnite number of possible experiences those experienceswhich allow stable interactions of your-self with yourself.” We believe that the process of recursively “discovering” objects of knowledge, identify-ing them by signs in language systems, and using these languages to “think” and structureour lives as self-concious beings, is the key for understanding concepts such as signiﬁca-tion and meaning. These ideas are explored, in a great variety of contexts, in Bakken andHernes (2002), Brier (1995), Ceruti (1989), Efran et al. (1990), Eibel-Eibesfeldt (1970),Ibri (1992), Piaget (1975), Wenger et al. (1999), Winograd and Flores (1987) and manyothers. Conceivably, the key underlying common principle is stated in Brier (2005, p.395): “The key to the understanding of understanding, consciousness, and com-munication is that both the animals and we humans live in a self-organizedsigniﬁcation sphere which we not only project around us but also project deepinside our systems. Von Uexk¨ull calls it “Innenwelt” (Brier 2001). The or-ganization of signs and the meaning they get through the habits of mind andbody follow very much the principles of second order cybernetics in that theyproduce their own Eigenvalues of sign and meaning and thereby create theirown internal mental organization. I call this realm of possible sign processesfor the signiﬁcation sphere. In humans these signs are organized into languagethrough social self-conscious communication, and accordingly our universe isorganized also as and through texts. But of course that is not an explanationof meaning.”

When studying the organization of self-conscious beings and trying to understandsemantic concepts such as signiﬁcation and meaning, or teleological concepts such as ﬁ-nality, intent and purpose, we move towards domains concerning systems of increasingcomplexity that are organized as higher hierarchical structures, like the domains of phe-nomenological, psychological or sociological sciences. In so doing, we leave the domains ofnatural and technical sciences behind, at least for a moment, see Brent and Bruck (2006)and Muggleton (2006), in last month’s issue of

Nature (March 2006, when this article waswritten), for two perspectives on future developments.As observed in Brier (2001), the perception of the objects of knowledge, changes frommore objective or realistic to more idealistic or (inter) subjective as we progress to higherhierarchical levels. Nevertheless, we believe that the fundamental nature of objects ofknowledge as eigen-solutions, with all the essential attributes pointed out by von Foerster,remains just the same. Therefore, a sign, as understood in the CogCon+FBST framework,always stands for the following triad:6

CHAPTER 2: LANGUAGE AND SELF-REFERENCE

S-1.

Some perceived aspects, characteristics, etc., concerning the organizationof the autopoietic system.

S-2.

Some perceived aspects, characteristics, etc., concerning the structure ofthe system’s environment.

S-3.

Some object (discrete, separable, stable and composable eigen-solutionbased on the particular aspects stated in S-1 and S-2) concerning the interac-tion of the autopoietic system with its environment.This triadic character of signs bring us, once again, close to the semiotic theory ofC.S.Peirce, oﬀering many opportunities for further theoretical and applied research. Forexample, we are currently using statistical psychometric analyses in an applied semioticproject for the development of software user interfaces, for related examples see Ferreira(2006). We defer, however, the exploration of these opportunities to forthcoming articles.In the remainder of this section we focus on a more basic investigation that, we believe,is a necessary preliminary step that must be undertaken in order to acquire a clear con-ceptual horizon that will assist a sound and steady progress in our future research. Thepurpose of this investigation is to ﬁnd out whether the CogCon+FBST framework canﬁnd a truly compatible ground in the basic concepts of Peircean philosophy. We proceedestablishing a conceptual mapping of the fundamental concepts used to deﬁne the Cog-Con+FBST epistemological framework into analogous concepts in Peircean philosophy.Before we start, however, a word of caution: The work of C.S.Peirce is extremely rich,and open to many alternative interpretations. Our goal is to establish the compatibil-ity of CogCon+FBST with one possible interpretation, and not to ascertain reductionistdeductions, in any direction.The FBST is a Continuous Statistical formalism. Our ﬁrst step in constructing thisconceptual mapping addresses the following questions: Is such a formalism amenable to aPerircean perspective? If so, which concepts in Peircean philosophy can support the useof such a formalism?

The FBST is a probability theory based statisticalformalism. Can the probabilistic concepts of the FBST ﬁnd the necessary support inconcepts of Peircean philosophy? We believe that Tychism is such a concept in Peirceanphilosophy, providing the ﬁrst element in our conceptual mapping. In CP 6.201 Tychismis deﬁned as: “... the doctrine that absolute chance is a factor of the universe.”

As stated in the previous section, the CogCon+FBST program pursuesthe stable convergence of the epistemic e-values given by the FBST formalism. Thefact that FBST is a belief calculus based on continuous mathematics is essential forits consistency and convergence properties. Again we have to ask: Does the continuity .6 PHILOSOPHY OF C.S.PEIRCE “that tendency of philosophical thought which insists upon the idea of continu-ity as of prime importance in philosophy and, in particular, upon the necessityof hypotheses involving true continuity.”

A key epistemological concept in the CogCon +FBST perspectiveis the notion of eigen-solution. Although the system theoretic concept of Eigen-solutioncannot possibly have an exact correspondent in Peirce philosophy, we believe that Peirce’sfundamental concept of “Habit” or “Insistency” oﬀers an adequate analog. Habit, andreality, are deﬁned as: “The existence of things consists in their regular behavior.” , CP 1.411. “Reality is insistency. That is what we mean by ‘reality’. It is the bruteirrational insistency that forces us to acknowledge the reality of what we expe-rience, that gives us our conviction of any singular.” , CP 6.340.However, the CogCon+FBST concept of eigen-solution is characterized by von Foer-ster by several essential attributes. Consequently, in order that the conceptual mappingunder construction can be coherent, these characteristics have to be mapped accordingly.In the following paragraphs we show that the essential attributes of sharpness (discrete-ness), stability and compositionality can indeed be adequately represented.

The ﬁrst essential attribute of eigen-solutions stated by von Foersteris discreteness or sharpness. As stated in Chapter 1, it is important to realize that, inthe sequel, the term ‘discrete’, used by von Foerster to qualify eigen-solutions in general,should be replaced, depending on the speciﬁc context, by terms such as lower-dimensional,precise, sharp, singular, etc. As physical laws or physical invariants, sharp hypotheses areformulated as mathematical equations.Can Peircean philosophy oﬀer a good support for sharp hypotheses? Again we believethat the answer is in the aﬃrmative. The following quotations should make that clear.The ﬁrst three passages are taken from Ibri (1992, p.84-85) and the next two from CP,1.487 and CP 1.415, see also NEM 4, p.136-137 and CP 6.203. “an object (a thing) IS only in comparison with a continuum of possibilitiesfrom which it was selected.” CHAPTER 2: LANGUAGE AND SELF-REFERENCE “Existence involves choice; the dice of inﬁnite faces, from potential to actual,will have the concreteness of one of them.”“...as a plane is a bi–dimensional singularity, relative to a tri-dimensionalspace, a line in a plane is a topic discontinuity, but each of this elements iscontinuous in its proper dimension.”“ Whatever is real is the law of something less real. Stuart Mill deﬁned matteras a permanent possibility of sensation. What is a permanent possibility but alaw?”“In fact, habits, from the mode of their formation, necessarily consist in thepermanence of some relation, and therefore, on this theory, each law of naturewould consist in some permanence, such as the permanence of mass, momen-tum, and energy. In this respect, the theory suits the facts admirably.”

The second essential attribute of eigen-solutions stated by von Foersteris stability. As stated in Stern (2005), a stable eigen-solution of an operator, deﬁned bya ﬁxed-point or invariance equation, can be found (built or computed) as the limit of asequence of recursive applications of the operator. Under appropriate conditions (suchas within a domain of attraction, for instance) the process convergence and its limitingeigen-solution will not depend on the starting point.A similar notion of stability for an object-sign complex is given by Peirce. As statedin CP 1.339: “That for which it (a sign) stands is called its object; that which it conveys,its meaning; and the idea to which it gives rise, its interpretant. The objectof representation can be nothing but a representation of which the ﬁrst repre-sentation is the interpretant. But an endless series of representations, eachrepresenting the one behind it, may be conceived to have an absolute object atits limit.”

The third essential attribute of eigen-solutions stated by vonFoerster is compositionality. As stated in Chapter 1 and Appendix A, compositionalityproperties concern the relationship between the credibility, or truth value, of a complexhypothesis, H , and those of its elementary constituents, H j , j = 1 . . . k . Compositionalityis at the very heart of any theory of language, see Noeth (1995). As an example ofcompositionality, see CP 1.366 and CP 6.23. Peirce discusses the composition of forces,that is, how the components are combined using the parallelogram law. “If two forces are combined according to the parallelogram of forces, their resul-tant is a real third... Thus, intelligibility, or reason objectiﬁed, is what makesThirdness genuine.” . .6 PHILOSOPHY OF C.S.PEIRCE “A physical law is absolute. What it requires is an exact relation. Thus, aphysical force introduces into a motion a component motion to be combinedwith the rest by the parallelogram of forces;” .In order to establish a minimal mapping, there are two more concepts in CogCon+FBSTto which we must assign adequate analogs in Peircean philosophy. In Chapter 1 the importance of incorporating all sources of noiseand ﬂuctuation, i.e., all the extra variability statistically signiﬁcant to the problem understudy, into the statistical model is analyzed. The following excerpt from CP 1.175 indi-cates that Peirce’s notion of falibillism may be used to express the need for allowing andembracing all relevant (and in practice inevitable) sources of extra variability. Accordingto Peirce, falibilism is “the doctrine that there is no absolute certainty in knowledge” . “There is no diﬃculty in conceiving existence as a matter of degree. Thereality of things consists in their persistent forcing themselves upon our recog-nition. If a thing has no such persistence, it is a mere dream. Reality, then,is persistence, is regularity. ... as things (are) more regular, more persistent,they (are) less dreamy and more real. Fallibilism will at least provide a bigpigeon-hole for facts bearing on that theory.” FBST is an Unorthodox Bayesian statistical formalism.Peirce has a strong and unfavorable opinion about Laplace’s theory of inverse probabilities. “...the majority of mathematical treatises on probability follow Laplace in re-sults to which a very unclear conception of probability led him. ... This is anerror often appearing in the books under the head of ‘inverse probabilities’.”

CP 2.785.Due to his theory of inverse probabilities, Laplace is considered one of the earliestprecursors of modern Bayesian statistics. Is there a conﬂict between CogCon+FBST andPeirce’s philosophy? We believe that a careful analysis of Peirce arguments not onlydissipates potential conﬂicts, but also reinforces some of the arguments used in Chapter1. Two main arguments are presented by Peirce against Laplace’s inverse probabilities.In the following paragraphs we will identify these arguments and present an up-to-dateanalysis based on the FBST (unorthodox) Bayesian view: “Laplace maintains that it is possible to draw a necessary conclusion regardingthe probability of a particular determination of an event based on not knowing CHAPTER 2: LANGUAGE AND SELF-REFERENCE anything at all [about it]; that is, based on nothing. ... Laplace holds thatfor every man there is one law (and necessarily but one) of dissection of eachcontinuum of alternatives so that all the parts shall seem to that man to be‘´egalement possibles’ in a quantitative sense, antecedently to all information.” ,CP 2.764.The dogmatic rhetoric used at the time of Laplace to justify ad hoc prior distribu-tions can easily backﬁre, as it apparently did for Peirce. Contemporary arguments for thechoice of prior distributions are based on MaxEnt formalism or symmetry relations, seeDugdale (1996), Eaton (1989), Kapur (1989) and Nachbin (1965). Contemporary argu-ments also examine the initial choice of priors by sensitivity analysis, for ﬁnite samples,and give asymptotic dissipation theorems for large samples, see DeGroot (1970), Gelmanet al. (2003) and Stern (2004). We can only hope that Peirce would be pleased withthe contemporary state of the art. These powerful theories have rendered ad hoc priorsunnecessary, and shed early dogmatic arguments into oblivion. “Laplace was of the opinion that the aﬃrmative experiments impart a deﬁ-nite probability to the theory; and that doctrine is taught in most books onprobability to this day, although it leads to the most ridiculous results, and isinherently self-contradictory. It rests on a very confused notion of what prob-ability is. Probability applies to the question whether a speciﬁed kind of eventwill occur when certain predetermined conditions are fulﬁlled; and it is the ra-tio of the number of times in the long run in which that speciﬁed result wouldfollow upon the fulﬁllment of those conditions to the total number of times inwhich those conditions were fulﬁlled in the course of experience.” , CP 5.169.In the second part of the above excerpt Peirce expresses a classical (frequentist) under-standing of having probability in the sample space, and not in the parameter space, thatis, he admits predictive probability statements but does not admit epistemic probabilitystatements. The FBST is a Bayesian formalism that uses both predictive and epistemicprobability statements, as explained in Chapter 1. However, when we examine the reasonpresented by Peirce for adopting this position, in the ﬁrst part of the excerpt, we ﬁnd aremarkable coincidence with the arguments presented in Stern (2003, 2004, 2006, 2007)against the orthodox Bayesian methodology for testing sharp hypotheses: The FBST doesnot attribute a probability to the theory (sharp hypothesis) being tested, as do orthodoxBayesian tests, but rather a degree of possibility . In Stern (2003, 2004, 2006, 2007) weanalyze procedures that attribute a probability to a given theory, and came to the exactsame conclusion as Pierce did, namely, those procedures are absurd. .6 PHILOSOPHY OF C.S.PEIRCE Let us now return to the Peircean concept of Synechism, todiscuss a technical point of contention between orthodox Bayesian statistics and the FBSTunorthodox Bayesian approach. The FBST formalism relies on some form of Measuretheory, see comments in section 3. De Finetti, the founding father of the orthodox schoolof Bayesian statistics, feels very uncomfortable having to admit the existence of non-measurable sets when using measure theory in dealing with probabilities, in which validstatements are called events, see Finetti (1975, 3.11, 4.18, 6.3 and appendix). Dubinsand Savage (1976, p.8) present similar objections, using the colorful gambling metaphorsthat are so characteristic of orthodox (decision theoretic) Bayesian statistics. In order toescape the constraint of having non-measurable sets, de Finetti (1975, v.2, p.259) readilyproposes a deal: to trade oﬀ other standard properties of a measure, like countable ( σ )additivity: “Events are restricted to be merely a subclass (technically a σ -ring with somefurther conditions) of the class of all subsets of the base space. In order to make σ -additivity possible, but without any real reason that could justify saying toone set ‘you are an event’, and to another ‘you are not’.” In order to proceed with our analysis, we have to search for the roots of de Finetti’sargument, roots that, we believe, lay outside de Finetti’s own theory, for they hinge onthe perceived structure of the continuum. Bell (1998, p.2), states: “the generally accepted set-theoretical formulation of mathematics (is one) inwhich all mathematical entities, being synthesized from collections of individu-als, are ultimately of a discrete or punctate nature. This punctate character ispossessed in particular by the set supporting the ‘continuum’ of real numbers- the ‘arithmetical continuum’.”

Among the alternatives to arithmetical punctiform perspectives of the continuum,there are more geometrical perspectives. Such geometrical perspectives allow us to use anarithmetical set as a coordinate (localization) system in the continuum, but the ‘ultimateparts’ of the continuum, called inﬁnitesimals, are essentially nonpunctiform, i.e. non pointlike. Among the proponents of inﬁnitesimal perspectives for the continuum one shouldmention G.W.Leibniz, I.Kant, C.S.Peirce, H.Poincar´e, L.E.J.Brouwer, H.Weyl, R.Thom,F.W.Lawvere, A.Robinson, E.Nelson, and many others. Excellent historical reviews arepresented in Bell (1998 and 2005), a general view, and Robertson (2001), for the ideas ofC.S.Peirce. In the inﬁnitesimal perspective, see Bell (1998, p.3), “any of its (the continuum) connected parts is also a continuum and, accord-ingly, divisible. A point, on the other hand, is by its nature not divisible, andso (as stated by Leibniz) cannot be part of the continuum.” CHAPTER 2: LANGUAGE AND SELF-REFERENCE

In Peirce doctrine of synechism, the inﬁnitesimal geometrical structure of the con-tinuum acts like “ the ‘glue’ causing points on a continuous line to lose their individualidentity.” , see Bell (1998, p.208, 211). According to Peirce, “ The very word continuityimplies that the instants of time or the points of a line are everywhere welded together.”

De Finetti’s argument on non-measurable sets implicitly assumes that all point subsetsof R n have equal standing, i.e., that the continuum has no structure. Under the arithmeti-cal punctiform perspective of the continuum, de Finetti’s objection makes perfect sense,and we should abstain from measure theory or alternative formalisms, as does orthodoxBayesian statistics. This is how Peirce’s concept of synechism helps us to overcome amajor obstacle (for the FBST) presented by orthodox Bayesian philosophy, namely, theobjections against the use of measure theory.At this point it should be clear that my answer to Brier’s question is emphaticallyaﬃrmative. From Brier’s comments and suggestions it is also clear how well he knew theanswer when he asked me the question. As a maieutic teacher however, he let me lookfor the answers my own way. I can only thank him for the invitation that brought me forthe ﬁrst time into contact with the beautiful world of semiotics and Peircean philosophy. The physician Rambam, Moshe ben Maimon (1135–1204) of (the then caliphate of) Cor-doba, wrote Shmona Perakim, a book on psychology (medical procedures for healing thehuman soul) based on fundamental principles exposed by Aristotle in Nicomachean Ethics,see Olitzky (2000) and Rackham (1926). Rambam explains how the health of the humansoul depends on always ﬁnding the straight path (derech y’shara) or golden way (shvilha-zahav), at the perfect center between the two opposite extremes of excess (odef) andscarcity (choser), see Maimonides (2001, v.1: Knowledge, ch.2: Temperaments, sec.1,2): “The straight path is the middle one, that is equidistant from both extremes....Neither should a man be a clown or jokester, nor sad or mourning, but heshould be happy all his days in serenity and pleasantness. And so with all theother qualities a man possesses. This is the way of the scholars. Every manwhose virtues reﬂect the middle, is called a chacham... a wise man.”

Rambam explains that a (always imperfect) human soul, at a given time and situation,may be more prone to fall victim of one extreme than to its opposite, and should try toprotect itself accordingly. One way of achieving this protection is to oﬀset its position inorder to (slightly over-) compensate for an existing or anticipated bias.At the dawn of the 20th century, humanity had in classical physics a paradigm ofscience handing out unquestionable truth, and faced the brutality of many totalitarian .7 FINAL REMARKS

CHAPTER 2: LANGUAGE AND SELF-REFERENCE hapter 3Decoupling, Randomization,Sparsity, and Ob jective Inference “The light dove, that at her free ﬂight cleaves the air,therefore feeling its resistance, could perhaps imaginethat she would succeed even better in the empty space.”

Immanuel Kant (1724-1804),Critique of Pure Reason (1787, B-8).

Step by step the ladder is ascended.

George Herbert (1593 - 1633),Jacula Prudentium (1651).

H.von Foerster characterizes “known” objects as eigen-solutions for an autopoietic system,that is, as discrete (sharp), separable (decoupled), stable and composable states of theinteraction of the system with its environment. Previous chapters have presented the FullBayesian Signiﬁcance Test (FBST) as a mathematical formalism speciﬁcally designed toaccess the support for sharp statistical hypotheses, and have shown that these hypothesescorrespond, from a constructivist perspective, to systemic eigen-solutions in the practiceof science, as seen in chapter 1. In this chapter, the role and importance of one of thesefour essential attributes indicated by von Foerster, namely, separation or decoupling, isstudied. 656

CHAPTER 3: DECOUPLING AND OBJECTIVE INFERENCE

Decoupling is the general principle that allows us to understand the world step bystep, ‘looking’ at it a piece at a time, localizing single features, isolating basic componentsor identifying simple objects, out of the immense complexity of the whole universe. Instatistical models, decoupling is often introduced by means of no association assumptions,such as independence, zero covariance, etc. In this context, decoupling relations aresharp statistical hypotheses that can be tested, see for example Stern and Zacks (2002).Decoupling relations in statistical models can also be introduced a priori by means ofspecial Design of Statistical Experiments (DSEs) techniques, the best known of whichbeing randomization.In chapter 2 the general meaning of the term “Objective” (how, less, more) is deﬁnedas the “degree of conformance of an object to the essential attributes of an eigen-solution”.One of the common uses of the word objective, as opposed to “subjective”, stresses thedecoupling or separation of a given systemic eigen-solution, such as an object of a scientiﬁcprogram, from the peculiarities of a second system, such as a speciﬁc human observer. Itis this restricted meaning, focusing on the decoupling property of systemic eigen-solutions,that justiﬁes the use of the term objective in this chapter’s title.The decoupling principle, and one of its most celebrated examples in Physics, thevibrating chord, are presented in section 2. In the vibrating chord model, a basic lin-ear algebra operation, the eigen-value factorization, is the key to obtain the decouplingoperator. In addition, the importance of eigen-solutions and decoupling operations arediscussed from a constructivist epistemological perspective. Herein, we shall focus on de-coupling operators related to an other basic linear algebra operation, namely, the Choleskyfactorization. In section 3 we show how Cholesky factorization can be used to decouplecovariance structure models. In section 4, Simpson’s paradox and some strategies forDSEs, such as control and randomization, are discussed. These strategies can be used toinduce independence relations, that are expressed into the sparsity structure of the model,which can, in turn, be used for eﬃcient decoupling. In section 5, the role of C.S.Peircein the introduction of control and randomization in DSEs is reviewed from an histori-cal perspective. This revision will help us set the stage for the discussion, in section 6,of a controversial issue: randomization in Bayesian Statistics. In section 7 some episte-mological consequences of randomization, are discussed and the underlying themata ofconstructivism and objective knowledge are revisited.The Cholesky factorization operator is presented in section 3, in conjunction withthe computational concepts of sparse and structured matrices. Covariance structure andBayesian networks are some of the most basic and widely used statistical models. There-fore, understanding their decoupling properties is important, not only from a compu-tational point of view, but also from the theoretical and a epistemological perspective.Furthermore, one could argue that the usefulness of these statistical models are due ex-actly to their decoupling properties. Final remarks are presented in section 8. .2. THE DECOUPLING PRINCIPLE Understanding the entire universe, with all its intricate constituents, relations and inter-connections, can be a daunting task, as stated by Schlick (1979, v.1, p.292): “ The most important (of these) diﬃculties arises from the recognition of theunending linkage of all natural processes one with another. Its eﬀect is that,on an exact view, every occurrence in the world is dependent on every other;the fall of a leaf is ultimately inﬂuenced by the motions of the stars, and itwould be a task utterly beyond fulﬁllment to assign its ‘cause’ with absolutecompleteness to any given process that we suppose determined down to the lastdetail. For this purpose we should have to adduce nothing less than all of thecircumstances of the universe that have so far occurred.Now fortunately this boundlessness is at once considerably restricted by expe-rience, which teaches us that the reciprocal interdependence of all events innature is subject to certain easy formulable conditions.”

L.Sadun has written an exceptionally clear book on linear algebra, emphasizing theidea of decoupling, i.e. the strategy of breaking down complicated multivariate systemsinto simple ‘modes’, by a suitable change of coordinates, see also Rijsbergen (2004). Sadun(2001, p.1) states the goal of his book as follows: “In this book we cover a variety of linear evolution equations, beginning withthe simplest equations in one variable, moving to coupled equations in severalvariables, and culminating in problems such as wave propagation that involvean inﬁnite number of degrees of freedom. Along the way we develop techniques,such as Fourier analysis, that allow us to decouple the equations into a set ofscalar equations that we already know how to solve.The general strategy is always the same. When faced with coupled equationsinvolving variables x , . . . , x n , we deﬁne new variables y , . . . , y n . These vari-ables can always be chosen so that the evolution of y depends only of y (andnot on y , . . . , y n ), the evolution of y depends only of y , and so on. To ﬁnd x ( t ) , . . . , x n ( t ) in terms of the initial conditions x (0) , . . . , x n (0) , we convert x (0) to y (0) , then solve for y ( t ) , then convert to x ( t ) . As an example of paramount theoretical and historical importance in Physics, weconsider the discrete chord. The chord is kept at tension h , with n particles of mass m at equally spaced positions js , j = 1 . . . n . The extremes of the chord, at positions0 and ( n + 1) s , are kept ﬁxed, and x = [ x , x , . . . , x n ] (cid:48) denote the particles’ vertical8 CHAPTER 3: DECOUPLING AND OBJECTIVE INFERENCE

Figure 1: Eigen-Solutions of Continuous and Discrete Chords.displacements, see French (1974, ch.5 Coupled oscillators and normal modes, p.119-160),Marion (1999, ch.9) and Franklin (1968, ch.7), Figure 1 shows the discrete chord for n = 2.The second order diﬀerential equation of classical mechanics, below, privides a linearapproximation for the discrete chord system’s dynamics:¨ x + Kx = 0 , K = w  − · · · − − · · · − − − −

10 0 · · · −  , w = hms . As it is, the discrete chord diﬀerential equation is diﬃcult to solve, since the n coor-dinates of vector x are coupled by matrix K . In the following paragraphs we show howto decouple this diﬀerential equation.Suppose that an orthogonal matrix Q is known to diagonalize matrix K , that is, Q − = Q (cid:48) , and Q (cid:48) KQ = D = diag( d ), d = [ d , d , . . . , d n ] (cid:48) . After pre-multiplying theabove diﬀerential equation by Q (cid:48) , we obtain the matrix equation Q (cid:48) ( Q ¨ y ) + Q (cid:48) K ( Qy ) = I ¨ y + Dy = 0 .2 DECOUPLING PRINCIPLE n decoupled scalar equations for harmonic oscillators, ¨ y k + d k y k =0, in the new ‘normal’ coordinates, y = Q (cid:48) x . The solution of each harmonic oscillator, asa function of time, t , has the form y k ( t ) = sin( ϕ k + w k t ), with phase 0 ≤ ϕ k ≤ π andangular frequency w k = √ d k .The columns of matrix Q , the decoupling operator, are the eigenvectors of matrix K , which are, as one can easily check, multiples of the un-normalized vectors z k . Theircorresponding eigenvalues, d k = w k , for j, k = 1 . . . n , are given by z kj = sin (cid:18) jkπn + 1 (cid:19) , w k = 2 w sin (cid:18) kπ n + 1) (cid:19) . The decoupled modes of oscillation, for n = 2, are depicted in Figure 1. They arecalled ‘normal’ modes in physics, ‘standing’ modes in engineering, and eigen-solutions inmathematics. The discrete chord with n particles will have n normal modes, and thelimit case, n → ∞ , is called the continuous chord. The normal modes of the continuouschord are given by trigonometric functions, the ﬁrst few of which are depicted in Figure1. They are also called ‘standing’ waves or eigen-functions of the chord, and constitutethe basis of Fourier analysis.In either the discrete or the continuous chord, we can ‘excite’, i.e. give energy or ‘putin motion’, one of the normal modes, without aﬀecting any other normal mode. Thisis the physical meaning of decoupling, i.e. to have ‘separate’ eigen-solutions. Since thediﬀerential equation describing the system is linear, distinct normal modes can also be su-perposed. This is called the ‘superposition’ principle, which renders the compositionalityrule for the eigen-solutions of the chord.In the original coordinate system, x , coupling made it hard to follow the system’sevolution. In the normal coordinate system, y , based on the system’s eigen-solutions,decoupling and superposition made it easier to understand the system behavior. But arethese eigen-solutions “just” a formal basis for an alternative coordinate system, or do theyrepresent “real objects” within the system under study?Obviously, this is not a mathematical or physical question, but rather an epistemo-logical one. From a constructivist perspective, we can consider these eigen-solutions“objectively known” entities in the system. Nevertheless, the meaning of the term ob-jective in a constructivist epistemology is distinct from its meaning in a dogmatic realistepistemology, as explained in Stern (2006b, 2007a,b).From a constructivist perspective, systemic eigen-solutions can be identiﬁed and “named”by an observer. Indeed, the eigen-solutions of the vibrating chord have been identiﬁedand named thousands of years before mankind knew anything about diﬀerential equa-tions. The eigen-values of the chord are known in music as the ‘fundamental tone’ andits ‘higher harmonics’, and constitute the basis for all known musical systems, see Benade(1992).0 CHAPTER 3: DECOUPLING AND OBJECTIVE INFERENCE

The linear model for the vibrating chord is a paradigmatic example of the fact that,despite the simplicity to understand and manipulate, linear models often give excellentapproximations for complex systems. Also, since linear operators are represented by ma-trices in standard matrix algebra, the importance of certain matrix operations in thedecoupling of such models should not be surprising at all. In the vibrating chord model,the eigen-value factorization, K = QDQ (cid:48) , was the key to obtain the decoupling opera-tor, Q . The eigen-value factorization plays the same role in many important statisticalprocedures, such as spectral analysis of time series, wavelet signal analysis, and kernelmethods.Related operations of linear algebra, like Singular Value Decomposition, SVD, andNonnegative Matrix Factorizations, NNMF, are important in principal components anal-ysis and latent structure models, see for example Bertsekas and Tsitsiklis (1989), Cen-sor and S.A.Zenios (1998), Cichocki et al. (2006), Dhillon and Sra (2005) and Hoyer(2004). Distinct decoupling operators have distinct characteristics, relying upon strongeror weaker structural properties of the model, requiring more or less computational work,and having diﬀerent capabilities for handling sparse data.In this chapter, we will be mainly interested in the decoupling of statistical models.More precisely, we shall focus on decoupling methods related to an important basic linearalgebra operation, namely, the Cholesky factorization. In the next section we show howCholesky factorization can be used to decouple covariance structure statistical models.The decoupling principle emerges, sometimes with diﬀerent denominations, in virtuallyevery area of the hard sciences. In Systems Theory and Mathematical Programming, forexample, it arises under the name of Decomposition Methods. In the optimization oflarge systems, for example, there are two basic approaches to decomposition:- High level methods focus on the underlying structure of the optimization problems.High level decomposition strategies replace the original large or complex problem by sev-eral hierarchically interconnected small or simple optimization problems, see for exampleGeoﬀrion (1972), Lasdon (1970) and Wismer (1971).- Low level methods look at the matrix representation of the optimization problems.Low level decomposition strategies beneﬁt from tailor made computational linear algebrasubroutines to take advantage of the underlying sparse matrix structure. Some of thesetechniques are discussed in the next section. Covariance structure, multivariate regression, Kalman ﬁlter and several other relatedlinear statistical models are widely used in the practice of science. They provide a powerfulanalytical tool in which the association, coupling or dependence between multiple variables .3 COVARINCE STRUCTURE MODELS x , its covariance matrix, V , is deﬁned as the ex-pected square distance to its expected (mean) value, β , that is, β = E ( x ) , V = Cov( x ) = E (( x − β ) ⊗ ( x − β ) (cid:48) ) . The diagonal elements, or variances, Var( x i ) = V i,i , give the most usual scalar measureof error, dispersion or uncertainty used in statistics, while the oﬀ diagonal elements,Cov( x i , x j ) = V i,j , give a measure of association between two scalar random variables, x i and x j , see Hocking (1985) for a general reference.Also recall that since the expectation operator, E , is linear, that is, E ( Ax + b ) = AE ( x ) + b for any random vector x , matrix A and vector b , we haveCov( Ax + b ) = A Cov( x ) A (cid:48) . The standard deviation, σ i = (cid:112) V i,i , is a dispersion measure given in the same unitas x , and the correlation, C i,j = V i,j /σ i σ j , is a measure of association normalized in the[ − ,

1] interval.As it is usual in the covariance structure literature, we can write a covariance matrixas V ( γ ) = (cid:80) γ t G t , in which the matrices G t constitute a basis for the space of symmetricmatrices of dimension n × n , see Lauretto et al. (2002). For example, for dimension n = 4, V ( γ ) = (cid:88) t =1 γ t G t =  γ γ γ γ γ γ γ γ γ γ γ γ γ γ γ γ  . Using the above notation, we can easily express hypotheses concerning structural prop-erties, including sparsity patterns, in the standard form of vector functional equations, h ( β, γ ) = 0. Details on how to use the FBST to test such general hypotheses in someparticular settings can be found in Lauretto et al. (2002).Once we have established the structural properties of the model, we can estimatethe parameters β and γ accordingly. Following the general line of investigation adoptedherein, a question that arises naturally is: How can we decouple the estimated model?One possible answer to this question can be given in terms of the Cholesky factoriza-tion, LL (cid:48) = V where L is lower triangular. Such a factorization is available for any fullrank symmetric matrix V , as shown in Golub and van Loan (1989). Let V = LL (cid:48) be the2 CHAPTER 3: DECOUPLING AND OBJECTIVE INFERENCE

Cholesky factorization of the covariance matrix, V , and let us consider the transformationof variables y = L − x , or x = Ly . The covariance matrix of the new variables can becomputed as Cov( y ) = L − V L − t = L − LL (cid:48) L − t = I . Hence, the transformed model hasbeen decoupled, i.e., has uncorrelated random components.Let us consider a simple numerical example of Cholesky factorization: V =   , L =   , V = LL (cid:48) . This example of Cholesky factorization has some peculiarities: The matrix V is sparse ,i.e., it has several zero elements. In contrast, a matrix with few or no zero elements issaid to be dense . Matrix V in the example is also structured , i.e., the zeros are arraged ina nice pattern, in this example, a 2 × L | LL (cid:48) = V , preserves the sparsity and structure of V , that is, no position with azero in V is ﬁlled with a non-zero in L . A factorization (or ellimination) resulting in no ﬁllin is called perfect . Perfect eliminations are not always possible, however, there are severaltechniques that can be used to obtain sparse (and structured) Cholesky factorizations inwhich the ﬁll in is minimized, that is, the sparsity of the Cholesky factor is maximized.Pertinent references on sparse factorizations include Blair and B.Peyton (1993), Bunchand D.J.Rose (1976) George et al. (1978, 1981, 1989, 1993), Golumbic (1980), Pissanetzky(1984), Rose (1972), Rose and Willoughby (1972), Stern (1992,1994), Stern and Vavasis(1993,1994) and van der Vorst and van Dooren (1990).Large models may have millions of sparsely coupled variables. A sparse and structuredfactorization of such a model gives a ‘simple’ decoupling operator, L . This is a matterof vital importance when designing eﬃcient computational procedures. In practice, largemodels can only be computed with the help of these techniques. An other important classof statistical models, Bayesian Networks, relies on sparse factorization techniques that,from an abstract graph theoretical perspective, are almost identical to sparse Choleskyfactorization, see for example Lauritzen (2006) and Stern (2006a, sec.9-11).In the next section we continue to examine the role of covariance, or more generalforms of association, in statistical modeling. On particular, we examine some situationsleading to spurious associations, destroying a model’s presumed sparsity and structure.In the following sections we review, from an historical and epistemological perspective,some techniques of Design of Statistical Experiments (DSE), used to induce (no) associa-tion relations in statistical models. These relations translate into sparsity and structuralpatterns that, in turn, can be used by eﬃcient factorization algorithms. .4. SIMPSON’S PARADOX AND THE CONTROL OF CONFOUNDING VARIABLES Lindley (1991, p.47-48) illustrates Simpson’s paradox with a medical trial example. From80 patients in the study, 40 received treatment, T, and 40 received a placebo with noeﬀect, NT. Some patients recovered from their illness, R, and some did not, NR. Therecovery rates, R%, are given in Table 1, where the experimental data is shown, both inaggregate form for All patients, and separated or disaggregated according to Sex. Lookingat the table one concludes that the treatment is bad for either male or female patients, butgood for all of them together! This is the Simpson’s Paradox: The association betweentwo variables, T and R in Lindley’s example, is reversed if the data is aggregated /disaggregated over a confounding variable, Sex in Lindley’s example.Table 1: Simpson’s Paradox.Sex T R NR Tot R%All T 20 20 40 50%All NT 16 24 40 40%Male T 18 12 30 60%Male NT 7 3 10 70%Fem T 2 8 10 20%Fem NT 9 21 30 30%Lindley provides the following scenario for the situation illustrated by this example:The physician responsible for the experiment did not trust the treatment and also wasaware that the illness under study aﬀects females most severely. Hence, he decided totry it mainly on males, who would probably recover anyway. This illustrates the generalSimpson’s paradox situation, generated by the association of the confounding variable withboth the explained and one (or more) of the explaining variables. Additional referenceson several aspects related to the Simpson paradox include Blyth (1972), Cobb (1998),Good and Mittal (1987), Gotzsche (2002), Greenland et al. (1999, 2001), Heydtmann(2002), Hinkelmann (1984), Pearl (2004) and Reintjes et al. (2000).The obvious question then is: How can we design a statistical experiment in order toavoid spurious associations?Two strategies are self-evident:1. Control possible confounding variables in order to impose some form of invariance(constancy, equality) in the experiment, or2. Measure possible confounding variables so that the relevant ones can be included inthe statistical model.4

CHAPTER 3: DECOUPLING AND OBJECTIVE INFERENCE

The simplest form of the ﬁrst strategy would be to test the treatment in a set of‘clones’, individuals that are, using the words of Fisher (1966, sec.9, Randomization; thePhysical Basis of Validity of the Test, p.17-19), “exactly alike, in every respect except that to be tested” ,This strategy, however, is too strict. Even if feasible, the conclusions of the study wouldonly apply to the ‘clone population’, not to individuals from a population with naturalvariability.A more general form of the ﬁrst strategy in known as blocking, deﬁned in Box et al.(1978, p.102-103, Sec.4.3, Blocking and Randomization) as: “The device of pairing observations is a special case of ‘blocking’ that hasimportant applications in many kinds of experiments. A block is a portion ofthe experimental material (the two shoes of one boy in this example) that isexpected to be more homogeneous than the aggregate (all shoes of all the boys).By conﬁning treatment comparisons within such blocks, greater precision canoften be obtained.”

Blocking is a very important strategy in the design of statistical experiments (DSEs),used to increase, whenever possible, the precision of the study’s conclusions.As for the second strategy, it looks a sure thing! No statistician would ever refusemore information, in a larger and richer data bank.Nevertheless, we have to ask whether we want to control and/or measure SOME ofthe possibly confounding variables, i.e. those perceived as the most important or eventhose we are aware of, or ALL of them?Keeping everything under control in a statistical experiment (or in life in general)constitutes, in the words of Fisher, “a totally impossible requirement in our example, and equally in all other formsof experimentation” .Not only the cost and complexity of trying to do so for a very large set of variableswould be prohibitive in any practical circumstance, but also “it would be impossible to present an exhaustive list of such possible diﬀerences(variables) appropriate for any one kind of experiment, because the uncon-trolled causes which may inﬂuence the result are always strictly innumerable” . .4 CONTROL OF CONFOUNDING VARIABLES “(the) notion of random assignment of treatment to a subset of the plots orpersons, leaving the rest as controls. ... I shall speak of an experiment usingrandomization in this way as involving a randomized design. ...... There is a related but distinguishable idea of (random) representative sam-pling.” As it is usual in the statistical literature, Hacking distinguishes between two intendeduses of randomization, namely random design and random sampling. Random designaims to eliminate bias coming from systematic design problems, including several formsof uncontrolled inﬂuence, either conscious or unconscious, received from and exerted byagents participating in the experiment. Random sampling, on the other hand, is intendedto justify, somehow, assumptions concerning the functional form of a distribution in thestatistical model of the experiment. The distinction between random design and randomsampling will be kept here, even though, as brieﬂy mentioned in section 6, a deeperprobabilistic analysis of randomization shows that, from a theoretical point of view, thetwo concepts can greatly overlap.Our immediate interest in randomization (and control) is on whether it can assist thedesign of experiments by inducing independence relations. This strategy is pinpointedin the following quote from Pearl (2000, p. 340,348. Epilogue: The Art and Science ofCause and Eﬀect): “...Fisher’s ‘randomized experiment’... consists of two parts, ‘randomization’and intervention’.”“Intervention means that we change the natural behavior of the individual:we separate subjects into two groups, called treatment and control, and weconvince the subjects to obey the experimental policy. We assign treatmentto some patients who, under normal circumstances, will not seek treatment,and give placebo to patients who otherwise would receive treatment. That,in our new vocabulary, means ‘surgery’ - we are severing one functional linkand replacing it with another. Fisher’s great insight was that connecting thenew link to a random coin ﬂip ‘guarantees’ that the link we wish to break isactually broken. The reason is that a random coin is assumed to be unaﬀectedby anything we can measure on macroscopic level...” CHAPTER 3: DECOUPLING AND OBJECTIVE INFERENCE

We believe that many ﬁne points about the role of randomization in the DSEs can bebetter understood by following its development from an historical perspective. This isthe topic of this section.In the period of 1850 to 1880 the quantitative analysis of human sensation in responseto physical (tactile, acoustic or visual) stimuli, was the main goal of ‘psychophysics’. Atypical hypothesis in this research program was Fechner’s law, see Hernstein and Boring(1966, p.72), which stated that, “The magnitude of sensation ( γ ) is not proportional to the absolute value ofthe stimulus ( β ), but rather to the logarithm of the magnitude of the stimuluswhen this is expresses in terms of its threshold value ( b ), i.e. that magnitudeconsidered as unit at which the sensation begins and disappears.” In modern mathematical notation, γ = k log( β/b ) I ( β > b ).In his psychophysical experiments Fechner tested his own ability to distinguish thestrongest in a pair of stimuli. For example, he would prepare two objects of masses µ and µ + δ , and later on he would lift them, and ‘answer’ which one appeared to him tobe the heaviest. A quantitative analysis would latter relate the proportion of right andwrong answers with the values of µ and δ , see Stigler (1986, ch.7, Psychophysics as aCounterpoint, p.239-261). Fechner was well aware of the potential diﬃculties resultingfrom the fact that the experiments where not performed blindly, that is, since he preparedthe experiment himself, he could know in advance the right answer. Nevertheless, heclaimed to be able to control himself, be objective, and overcome this diﬃculty.According to Dehue (1997), in the decade of 1870, G.E.M¨uller and several researchersat T¨ubingen and G¨ottingen Universities, began to improve the design of psychophysicalexperiments. The ﬁrst major improvement was blinding: the stimuli were prepared oradministered by an ‘Experimenter’ or ‘Operator’ and applied to a distinct person, the‘Observer’, ‘Patient’ or ‘Subject’, who was kept unaware of the actual intensity of thestimuli.The second major improvement was the precaution of presenting the stimuli in ‘ir-regular order’ (buntem Wechsel). This irregularity was introduced to prevent the patientfrom becoming habituated to patterns in the sequence of stimuli presented to him or, inother words, to keep him to form building expectations and guessing the right answers.Nevertheless, there was, at that time, neither a general theory deﬁning ‘irregularity’, nora systematic method for providing an ‘irregular order’.In 1885, Charles Saunders Peirce and his student Joseph Jastrow presented random-ization as a practical solution, in this context, to the question of irregularity, that is, .5 C.S.PEIRCE AND RANDOMIZATION “The pack (of playing-cards) was well shuﬄed, and, the operator and sub-ject having taken their places, the operator was governed by the color of thesuccessive cards ...A slight disadvantage in this mode of proceeding arises from the long runs ofone particular kind of change, which would occasionally be produced by chanceand would tend to confuse the mind of the subject. But it seems clear thatthis disadvantage was less than that which would have been occasioned by hisknowing that there would be no such long runs if any means had been taken toprevent them.” Regardless of its importance, Peirce’s solution of randomization was not accepted byhis contemporaries, fell into oblivion, and was almost forgotten, until it reappeared muchlatter in the work of R.A.Fisher. We believe that there are several entangled reasons toexplain such a twisted historical process. The psychopysics community raised objectionsagainst some of the hypotheses, and also against some methodological aspects presentedin Peirce’s paper. Besides, there is also a confounding factor generated by a secondrole played by randomization in Peirce’s paper, namely, ‘randomization to measure fainteﬀects’. We shall brieﬂy discuss these aspects in the next paragraphs.Fechner assumed the existence of a threshold (Schwelle), b , bellow which small diﬀer-ences could no longer be discerned. Peirce wanted to refute the existence of this thresholdassuming, instead, a continuously decreasing sensitivity to smaller and smaller diﬀerences.We should remark that for Peirce this should not have been a fortuitous hypothesis, sinceit can be related to his general philosophical ideas, most specially with the concept ofsynechism, see chapter 2, Hartshorne et al. (1992) and Eisele (1976).Peirce postulated that the patients’ sensitivity could be adequately measured by theprobability of correct answers, even when the diﬀerence was too faint to be consciouslydiscerned by the same patients. Hence, in experiments similar to Fechner’s, Peirce askedthe patient always to guess the correct answer. Peirce also asked the patient to givethe answer a conﬁdence score from 0 to 3. Peirce analyzed his experimental data andderived empirical formulae relating the (rounded) ‘subjective’ conﬁdence scores, m , andthe ‘objective’ probability of correct answers, p , as in Peirce and Jastrow (1884, p.122):8 CHAPTER 3: DECOUPLING AND OBJECTIVE INFERENCE “ The average marks seem to conform to the formula m = c log( p/ (1 − p )) ,where m denotes the degree of conﬁdence on the scale, p denotes the probabilityof the answer being right, and c is a constant which may be called the index ofconﬁdence.” At the time of Peirce’s experiments, the psychophysical community gave great impor-tance to the analysis of the patient’s subjective ‘introspections’. According to this view,Peirce’s experiments were criticized by asking the patient to guess the correct answereven when he expressed low conﬁdence. Of course, if one understands Peirce’s researchprogram, it is clear that that the experimental design he used is perfectly coherent. Un-fortunately, this was not the judgment of his contemporaries.The same techniques and experimental designs used by Peirce were subsequently usedby several researchers in attempts to measure faint eﬀects, including eﬀects produced by‘below the consciousness threshold’, sub-conscious, or sub-liminal stimuli. Some of thesestudies were really misconceived, and that may have been yet another contributing factorfor the reactions against the use of randomization. Whatever the explanation might be,Peirce’s paper fell into oblivion, and the progress of DSEs was delayed by half a century.

The work of Ronald Aylmer Fisher can undoubtedly be held responsible for disseminatingthe modern approach to DSEs, including randomization, to almost any area of empiricalresearch, see for example Fisher (1926, 1935). The idea of randomization, however, waslater contested by some members of the Bayesian school. Commenting on the use ofrandomization after Fisher, Hacking (1988, p.429-430), states: “Undoubtedly Fisher won the day, at least for the following generation, butthen a new, although not completely unrelated, challenge to randomized designarose. This came from the revival of the ‘Bayesian’ school, typically associ-ated with L.J.Savage’s theory of what he called personal probability. Here theobject is to form an initial assessment of one’s personal beliefs about a subjectand to modify them in the light of experience and a theoretical analysis for-mally modeled by the calculus of probability and a theory of personal utility.It is widely held to be an almost immediate consequence of this approach thatrandomization is of no value at all (except perhaps to eliminate some kind offraud).”

This erroneous notion of incompatibility between the use of randomization and Bayesianstatistics in now completely outdated. One of the most prestigious textbooks in contem-porary Bayesian statistics, see Gelman et al. (2003, ch.7, p.198), states: .6 BAYESIAN ANALYSIS OF RANDOMIZATION “A naive student of Bayesian inference might claim that because all inferenceis conditional on the observed data, it makes no diﬀerence how those data werecollected. This misplaced appeal to the likelihood principle would assert thatgiven (1) a ﬁxed model (including the prior distribution) for the underlyingdata and (2) ﬁxed observed values of the data, Bayesian inference is deter-mined regardless of the design for the collection of the data. Under this viewthere would be no formal role for randomization in either sample surveys orexperiments. The essential ﬂaw in the argument is that a complete deﬁnitionof ‘the observed data’ should include information on how the observed valuesarose, and in many situations such information has a direct bearing on howthese values should be interpreted. Formally then, the data analyst needs toincorporate the information describing the data collection process in the prob-ability model used for analysis.” Indeed, the classical argument using the likelihood principle against randomization inthe DSEs, assumes a ﬁxed, given statistical model and, as concisely stated by Kempthorne(1977, p.16): “The assertion that one does not need randomization in the context of theassumed (linear) model (above) is an empty one because an intrinsic role ofrandomization is to ‘insure‘ against model inadequacies.”

Gelman et al. (2003, ch.7, p.223-225) proceeds oﬀering a much deeper analysis of therole of randomization from a Bayesian perspective, see also Rubin (1978). The key conceptof “ignorable design” speciﬁes decoupling conditions between the sampling (or censoring)process, described by an indicator variable, I , and the distribution of the observed vari-ables, y obs . If the experiment has an ignorable design, we can build a statistical model thatexplicitly considers y obs alone. Finally, it is ironic that perhaps one of the best argumentsfor incorporating randomization in Bayesian experimental design is a consequence of deFinetti theorem for exchangeability. As mentioned in section 4, this argument also blurresthe distinction between the concepts of randomized design and randomized sampling. Wequote, once again, from Gelman et al. (2003, ch.7, p.223-225): “How does randomization ﬁt into this picture? First, consider the situationwith no fully observed covariates x , in which case the ‘only’ way to have aninvariant to permutation design - is to randomize.”“In this sense, there is a beneﬁt to using diﬀerent patterns of treatment as-signment for diﬀerent experiments; if nothing else about the experiments isspeciﬁed, they are exchangeable, and the global treatment assignment is neces-sarily randomized over the set of experiments.” CHAPTER 3: DECOUPLING AND OBJECTIVE INFERENCE

Several researchers currently concerned with epistemological questions in Bayesian statis-tics are engaged in a reductionist program dedicated to translate every statistical test orinference problem into a decision theoretic procedure. One of the main proponents andearly contributors to this program, but one who also had a much broader perspective,clearly articulating his epistemological insights and motivations, was Bruno de Finetti.In statistical models our knowledge of the world is encoded in probability distributions.Hence, it is vital to clarify the epistemological or ontological status of probability. Letus examine de Finetti’s position, based on his own words, beginning with Finetti (1972,p.189) and Finetti (1980, p.212): “Any assertion concerning probabilities of events is merely the expression ofsomebody’s opinion and not itself an event. There is no meaning, therefore,in asking whether such an assertion is true or false, or more or less probable.”“Each individual making a ‘coherent’ evaluation of probability (in the sense Ishall deﬁne later) and desiring it to be ‘objectively exact’, does not hurt anyone:everyone will agree that this is his subjective evaluation and his ‘objectivist’statement will be a harmless boast in the eyes of the subjectivist, while it willbe judged as true or false by the objectivist who agree with it or who, on theother hand, had a diﬀerent one. This is a general fact, which is obvious butinsigniﬁcant: ‘Each in his own way.’ ”

Solipsism, from the Latin solus (alone) +ipse (self), can be deﬁned as the epistemolog-ical thesis that the individual’s subjective states of mind are the only proper or possiblebasis of knowledge. Metaphysical solipsism goes even further, stating that nothing really‘exists’ outside of one’s own mind. From the two above quotations, it is clear that deFinetti stands, if not from a metaphysical, at least from a epistemological perspective, asa true solipsist. This goes farther than many theorists of the Bayesian subjectivist schoolwould venture, but de Finetti charges ahead, with a program that is not only anti-realist,but also anti-idealist. In (1974, V1, Sec.1.11, p.21,22, The Tyranny of Language), deFinetti launches a full-ﬂedged attack against the vain and futile desire for any objectiveknowledge: “Much more serious is the reluctance to abandon the inveterate tendency ofthe savages to objectivize and mythologize everything (1); a tendency that,unfortunately, has been, and is, favored by many more philosophers than havestruggled to free us from it (2).(1) The main responsibility for the objectivizationistic fetters inﬂicted on thoughtby everyday language rests with the verb ‘to be’ or ‘to exist’, and this is why we .7 RANDOMIZATION, EPISTEMIC CONSIDERATIONS drew attention to it in the exemplifying sentences. From it derives the swarmof pseudoproblems from ‘to be or not to be’, to ‘cogito ergo sum’, from theexistence of ‘cosmic ether’ to that of ‘philosophical dogmas’.(2) This is what distinguishes acute minds, who enlivened thought and stimu-lated its progress, from narrow-minded spirits who mortiﬁed and tried to mum-mify it ... ‘great thinkers’ (like Socrates and Hume) and ‘school philosophers’(like Plato and Kant). De Finetti was also aware of the dangers of ‘objective contamination’, that is, any‘objective’ (probabilistic) statement can potentially ‘infect’ and spread its objectivity toother statements, see De Finetti (1974, V2, Sec.7.5.7, p.41-42, Explanations based on‘homogeneity’): “There is no way, however, in which the individual can avoid the burden ofhis own evaluations. The key can not be found that will unlock the enchantedgarden wherein, among the fairy-rings and the shrubs of magic wands, beneaththe trees laden with monads and noumena, blossom forth the ﬂowers of ‘Prob-abilitas realis’. With the fabulous blooms safely in our button-holes we wouldbe spared the necessity of forming opinions, and the heavy loads we bear uponour necks would be rendered superﬂuous once and for all.”

As we have seen in the last sections, a randomization device is built so to providelegitimate ‘objective’ probabilistic statements about some events, and randomization pro-cedures in DSEs are conceived exactly in order to spread this objectivity around.I.J.Good was an other leading ﬁgure of the early days of the Bayesian revival move-ment. Contrary to de Finetti, Good has always been aware of the dangers of an extremesubjectivist position, see for example Good (1983, Ch.8 Random Thoughts about Ran-domness, p.93): “ Some of you might have expected me, as a conﬁrmed Bayesian, to restrict themeaning of the word ‘probability’ to subjective (personal) probability. That Ihave not done so is because I tend to believe that physical probability exists andis in any case a useful concept. I think physical probability can be measured onlywith the help of subjective probability, whereas de Finetti believes that it can be‘deﬁned’ in terms of subjective probability. De Finetti showed that if a personhas a consistent set of subjective or logical probabilities, then he will behave‘as if ’ there were physical probabilities, where the physical probability has aninitial subjective probability distribution. It seems to me that, if we are goingto act if the physical probability exists, then we don’t lose anything practicalif we assume it really ‘does’ exist. In fact I am not sure that existence means CHAPTER 3: DECOUPLING AND OBJECTIVE INFERENCE more than there are no conceivable circumstances in which the assumption ofexistence would be misleading. But this is perhaps too glib a deﬁnition. Thephilosophical impact of de Finetti’s theorem is that it supports the view thatsolipsism cannot be logically disproved. Perhaps it is the mathematical theoremwith most potential philosophical impact.”

In our terminology we would have used the expression ‘objective probability’ instead ofGood’s expression, ‘physical probability’. In 1962 Good edited a collection of speculativeessays, including some on the foundations of statistics. The following short essay byChristopher S.O’D.Scott oﬀers an almost direct answer to de Finetti, see Good (1962,sec.114, p.364-365): “Scientiﬁc Inference: You are given a large number of identical inscrutableboxes. You are to select one, the ‘target box’, by any means you wish whichdoes not involve opening any boxes, and you then have to say something aboutis in it. You may do this by any means you wish which does not involve openingthe target box.This apparent miracle can easily be performed. You only have to select thetarget box at random, and then open a random sample of other boxes. Thecontents of the sample boxes enable you to make an estimate of the contentsof the target box which will be better than a chance guess. To take an extremecase, if none of the sample boxes contains a rabbit and your sample is large,you can state with considerable conﬁdence: ‘The target box does not contain arabbit.’ In saying this, you make no assumption whatever about the principleswhich may have been used in ﬁlling the boxes.This process epitomizes scientiﬁc induction at its simplest, which is the basisof all scientiﬁc inference. It depends only on the existence of a method ofrandomization that is, on the assumption that events can be found which areunrelated (or almost) to given events.It is usually thought that scientiﬁc inference depends upon nature being orderly.The above shows that a seemingly weaker condition will suﬃce: Scientiﬁc in-ference depends upon our knowing ways in which nature is disorderly.”

In the preceding chapters we discussed general conditions validating objective knowl-edge, from a constructivist epistemological perspective. In this chapter we discuss theuse of randomization devices, that can generate observable events with distribution thatare independent of the distribution of any event relevant to a given statistical study. Forexample, the statistical study could be concerned with the reaction of human patientsaﬀected by a given disease to alternative medical treatments, whereas a “good” random-ization device could be a generic ‘coin ﬂipping machine’, like a regular dice or a mechanical .8. FINAL REMARKS

As analyzed in this chapter, the randomization method, introduced by C.S.Peirce andJ.Jastrow (1884), is the fundamental decoupling technique used in the design of statis-tical experiments (DSEs). Nevertheless, only after the work of R.A.Fisher (1935), wererandomized designs used regularly in practice. Today, randomization is one of the basicbackbones of statistical theory and methods. Meanwhile, the pioneering work of Peircehad been virtually forgotten by the Statistics community, until rediscovered by the histor-ical research of Stigler (1978) and Hacking (1988). Nevertheless, even today, the work ofPeirce is presented as an isolated and ad hoc contribution. As brieﬂy indicated in section5, it is plausible that Peirce and Jastrow’s experimental and methodological work couldhave had motivations related to more general ideas of Peircean philosophy. In particular,we believe that the faint eﬀects psychophysical hypothesis can be liked to the concept ofsynechism, while the randomized design solution can be embedded in the epistemolog-ical framework of Peirce’s objective idealism. We believe that these topics deserve theattention of further research.In this chapter we have examined some aspects of DSEs, such as blocking, controland randomization, from an epistemological perspective. However, in many applications,most noticeably in medical studies, several other aspects have to be taken into account,including the well being of the patients taking part in the study. In our view, such complexsituations require a thorough, open and honest discussion of all the moral and ethicalaspects involved. Typically they also demand sound protocols and complex statisticalmodels, suited to the ﬁne quantitative analyses needed to balance multiple objectivesand competing goals. For the Placebo, Nocebo, Kluge Hans, and similar eﬀects, andthe importance of blinding and randomization in clinical trials, see Kotz et al. (2005),under the entries Clinical Trials I, by N.E.Breslow, v.2, p.981-989, and Clinical TrialsII, by R.Simon, v.2, p.989-998. For additional references on statistical randomizationprocedures, see Folks (1984), Kadane and Seidenfeld (1990), Kaptchuk and Kerr (2004),Karlowski et al. (1975), Kempthorne (1977, 1980), Noseworthy et al. (1994), Pfeﬀermann4

CHAPTER 3: DECOUPLING AND OBJECTIVE INFERENCE et al. (1998) and Skinner and Chambers (2003). hapter 4Metaphor and Metaphysics: TheSub jective Side of Science “Why? - That is what my name asks!And there He blessed him.”

Genesis, XXXII, 30. “Metaphor is perhaps one of man’s most fruitful potecialities.Its eﬃcacy verges on magic, and it seems a tool for creationwhich God forgot inside His creatures when He made them.”

Jos´e Ortega y Gasset, The Dehumanization of Art, 1925. “There is nothing as practical as a good theory.”

Attrituted to Ludwig Boltzmann (1844-1906).

In this chapter we proceed with the exploration of the Cognitive Constructivism epis-temological framework (Cog-Con), continuing the previous work developed in previouschapters, and brieﬂy reviewed in section 5. In the previous chapters, we analyzed ques-tions concerning

How objects (eigen-solutions) emerge, that is, How they (eigen-solutions)become known in the interaction processes of a system with its environment. These ques-tions had to do with laws, patterns, etc., expressed as sharp or precise hypotheses, andwe argued that statistical hypothesis testing plays an important role in their validation.It is then natural to ask -

Why?

Why do these objects are (the way they are) andinteract the way they do? Why-questions claim for a causal nexus in a chain of events.856

CHAPTER 4. METAPHOR AND METAPHYSICS: THE SUBJECTIVE SIDE OF SCIENCE

Therefore, their answers must be theoretical constructs based on interpretations of thelaws used to describe these events. This chapter is devoted to the investigation of theseissues. Likewise, the interplay between the How and Why levels of inquiry which, in theconstructivist perspective, are not neatly stratiﬁed in separate hierarchical layers, butinteract in complex (often circular) patterns, will also be analyzed. As in the previouschapters, the discussion is illustrated by concrete mathematical models. In the process,we raise some interesting questions related to the practice of statistical modeling.Sections 2 examines the dictum “Statistics is Prediction”. The importance of accurateprediction is obvious for any statistics practitioner, but is that all there is? The investiga-tion on the importance of model interpretability begins in section 3, the rhetorical powerof mathematical models, self-fulﬁlling prophecies and some related issues are discussedand a practical consulting case in Finance, concerning the detection of trading oppor-tunities for intraday operations in both the

BOVESPA and

BM&F ﬁnancial markets ispresented. In this example, the REAL classiﬁcation tree algorithm, a statistical techniquepresented in Lauretto et al. (1998), is used.Section 4 is devoted to the issue of language dependence. Therein, the investigationon model interpretability continues with an analysis of the eternal counterpointing issuesof models for prediction and models for insight. An example from Psychology, concerningdimensional personality models is also presented. These models are based on a dimensionreduction technique known as Factor Analysis.In section 6, the necessary or “only world” vs. optimal or “best world” formulations ofoptics and mechanics are discussed. Simple examples related to the calculus of variations,are presented, which abridge the epistemological discussion in the following sections. Sec-tion 7 discusses eﬃcient and ﬁnal causal relations, teleological explanations, necessary andbest world arguments, and the possibility or desirability of having multiple interpretationsfor the same model or multiple models for the same phenomenon. In section 8, the formof modern metaphysical arguments in the construction of physical theories is addressed.In section 9, some simple but widely applicable models based on averages computedover all “possible worlds”, or more speciﬁcally, path integrals over all possible trajectoriesof a system, are presented. The ﬁrst example in this section relates to the linear systemMonte Carlo solution to the Dirichlet problem, a technique driven by a stochastic processknown as Gaussian Random Walk or Brownian Motion. Section 9 also points out to ageneralization of this process known as Fractional Brownian Motion. In sections 7 to 9 wealso try to examine the interrelations between “only world”, “best world” and “possibleworlds” forms of explanation, as well as their role and purpose in the light of cognitiveconstructivism, since they are at the core of modern metaphysics.Section 10 discusses how hypothetical models, mathematical equations, etc., relate tothe “true nature” of “real objects”. The importance of this relationship in the historyof science is illustrated therein with two cases: The Galileo aﬀair, and the atomic or .2. STATISTICS IS PREDICTION. IS THAT ALL THERE IS?

As a ﬁrst example for discussion, we present a consulting case in ﬁnance. The goal ofthis project was to implement a model for the detection of trading opportunities forintraday operations in both the

BOVESPA and the

BM&F ﬁnancial markets. For detailswe refer to Lauretto et al. (1998). The ﬁrst algorithms implemented were based onPolynomial Networks, as presented in Farlow (1984) and Madala and Ivakhnenko (1994),combined with standard time series pre-processing analysis techniques such as de-trending,de-seasonalization, diﬀerencing, stabilization and linear transformation, as exposed inBox and Jenkins (1976) and Brockwell and Davis (1991). A similar model is presentedin Lauretto et al. (2009). The predictive power of the Polynomial Network model wasconsidered good enough to render a proﬁtable return / risk performance.According to the decision theoretic theory, and its gambling metaphor as presented insection 1.5, the fundamental purpose of a statistical model is to help the user in a speciﬁcgambling operation, or decision problem. Hence, at least according to the orthodoxBayesian view, predictive power is the basic criterion to judge the quality of a statisticalmodel. This conclusion is accepted with no reservations by most experts in decisiontheory, orthodox Bayesian epistemologists, and even by many general practitioners. Astypical examples, consider the following statements: “We assume that the primary aim of [statistical] analysis is prediction.”

Robert (1995, p.456). “Although association with theory is reassuring, it does not mean that astatistical ﬁtted model is more true or more useful. All models should stand orfall based on their predictive power.”

Newman and Strojan (1998, p.168). “The only useful function of a statistician is to make predictions, and thusto provide a basis for action.”

W.E.Demming, as quoted in W.A.Wallis (1980). “It is my contention that the ultimate aim of any statistical analysis isto forecast, and that this determines which techniques apply in particular cir- CHAPTER 4. METAPHOR AND METAPHYSICS: THE SUBJECTIVE SIDE OF SCIENCE cumstances... The idea that statistics is all about making forecasts based onprobabilistic models of ‘reality’ provides a uniﬁed approach to the subject. Inthe literary sense, it provides a consistent authorial ‘voice’... the underlyingpurpose, often implicit rather than explicit, of every statistical analysis is toforecast future values of a variable.”

A.L.McLean (1998).Few theaters of operation so closely resemble a real casino as the stock market, hence,we were convinced that our model would be a success. Unfortunately, our PolynomialNetwork model was not well accepted by the client, that is, it was seldomly used foractual trading. The main complaint was the model’s lack of interpretability. The modelwas perceived as cryptic, a “black box” capable of selecting strategic operations and com-puting predicted margins and success rates, but incapable of providing an explanationof

Why the selection was recommended in the particular juncture. This state of aﬀairswas quite frustrating indeed: First, the client had never explicitly required such func-tionality during the speciﬁcation stage of the project, hence the model was not conceivedto provide explanatory statements. Second, as a fresh Ph.D. in Operations Research, Iwas well trained in the minutiae of Measure Theory and Hilbert Spaces, but had verylittle experience on how to make a model that could be easily interpreted by somebodyelse. Nevertheless, since (good) costumers are always right, a second model was speciﬁed,developed and implemented, as explained in the next section.

The ﬁrst step to develop a new model for the problem presented in the last section, wasto ﬁnd out what the client meant by an interpretable model. After a few brainstormsessions with the client, we narrowed it down to two main conditions: understandableI/O and understandable rules. The ﬁrst condition (understandable I/O) called for themodel’s input and output data to be already known, familiar or directly interpretable. Thesecond condition (understandable rules) called for the model’s transformation functions,re-presentation maps or derivation rules to be also based in already known, familiar ordirectly interpretable principles.Technical Indicators, derived from pre-processed price and volume trading data, con-stituted the input to the second model. Further details on their nature will be givenlater in this section. For now, it is enough to know that they are widely used in ﬁnancialmarkets, and that the client possessed ample expertise in technical analysis. The model’sstatistical data processing, on the other hand, was based on a classiﬁcation tree algorithmspecially developed for the application - the Real Attribute Learning Algorithm, or REAL,as presented in Lauretto et al. (1998). For general classiﬁcation tree algorithms, we referto Breiman (1993), Denison et al. (2002), Michie et al. (1994), Mueller and Wysotzki(1994), and Unger and Wysotzki (1981). .3. RHETORIC AND SELF-FULFILLING PROPHECIES “The Need for Anchors: When confronted with decisions, it is human na-ture to begin with the familiar and use it to make judgments. ...The Power of the Story: For better or worse, human actions tend to bebased not on quantitative factors but on story telling. People tend to look forsimple reasons for their decisions, and will often base their decision on whetherthese reasons exist.”

The rhetorical purpose and power of statistical models have been able to conquer,within the statistical literature, only a small fraction of its relative importance in the con-sulting practice. There are, nevertheless, some remarkable exceptions, as see for example,in Abelson (1995, p.xiii): “The purpose of statistics is to organize a useful argument from quanti-tative evidence, using a form of principled rhetoric. The word principled iscrucial. Just because rhetoric is unavoidable, indeed acceptable, in statisticalpresentations does not mean that you should say anything you please.”“Beyond the rhetorical function, statistical analysis also has a narrativerole. Meaningful research tells a story with some point to it, and statistics cansharpen the story.”

Let us now turn our attention to the inputs to the REAL based model, the TechnicalIndicators, also known as Charting Patterns. For a general description, see Damodaran(2003,ch.7). For some of the indicators used in the REAL project, see Colby (1988) andMurphy (1986). Technical indicators are primarily interpreted as behavioral patternsin the markets or, more appropriately, as behavioral patterns of the market players.Damodaran deﬁnes ﬁve groups that categorize the indicators according to the dominant0

CHAPTER 4. METAPHOR AND METAPHYSICS: THE SUBJECTIVE SIDE OF SCIENCE aspects of the behavioral pattern. A concise description of these ﬁve groups of indicatorsis given in Damodaran (2003,ch.7,p.46-47):1 - External Forces / Large Scale Indicators: “If you believe that there are long-term cycles in stock prices, your investment strategy may be driven bythe cycle you subscribe to and where you believe you are in the cycle.” “If you believe that there are some traderswho trade ahead of the market, either because they have better analysis toolsor information, your indicators will follow these traders - specialist short salesand insider buying/selling, for instance - with the objective of piggy-backingon their trades.” “With momentum indicators, suchas relative strength and trend lines, you are assuming that markets often learnslowly and that it takes time for prices to adjust to true values.” “Contrarian indicators suchas mutual fund holdings or odd lot ratios, where you track what investors arebuying and selling with the intention of doing the opposite, are grounded inthe belief that markets over react.”

5- Change of Mind / Price-Value Volatility Indicators: “A number of tech-nical indicators are built on the presumption that investors often change theirviews collectively, causing shifts in demand and prices, and that patterns incharts - support and resistance lines, price relative to a moving average- canpredict these changes.”

At this point, it is important to emphasize the dual nature of technical indicators:They disclose some things that may be happening with the trading market and also somethings that may be happening with the traders themselves. In other words, they portraydynamical patterns of the market that reﬂect behavioral patterns of the traders.Two characteristics of the REAL based model, of vital importance to the success inthe consulting case presented, relate to rhetorical and psychological aspects that havebeen commented so far:- Its good predictive and rhetorical power, which motivated the client to trade on thebasis of the analyses provide by the model;- The possibility of combining and integrating the analyses provided by the modelwith expert opinion.Technical indicators often carry the blame of being based in self-fulﬁlling prophecies,over-simpliﬁed formulas, superﬁcial and naive behavioral patterns, unsound economicgrounds, etc. From a pragmatic perspective, market analysts do not usually care abouttechnical analysis compatibility with sound economic theories, mathematical sophistica-tion, etc. Its ability to detect trading opportunities is what counts. From a conceptual .4. LANGUAGE, METAPHOR AND INSIGHT “The self-fulﬁlling prophecy (argument) is generally listed as a criticism ofcharting. It might be more appropriate to label it as a compliment.”

The importance of the psychological aspects of the models studied in this sectionmotivate us to take a look, in the sequel, at some psychological models of personality.

In chapter 1, the dual role played by Statistics in scientiﬁc research, namely, predictingexperimental events and testing hypotheses, was pointed out. It was also emphasized that,under a constructivist perspective, these hypotheses are often expressed as equations ofa mathematical model. In the last section we began to investigate the importance of theinterpretability of these models. The main goal of this section is to further investigatesubjective aspects of a statistical or mathematical model, speciﬁcally, the understandingor insight it provides.We start with three diﬀent versions of the well-known motto of Richard Hamming:- “The purpose of models is insight, not numbers.” - “The purpose of computing is insight, not numbers.” - “The purpose of numbers is insight, not numbers.” Dictionary deﬁnitions of Insight include:- A penetrating, deep or clear perception of a complex situation;- Grasping the inner or hidden nature of things;- An intuitive or sudden understanding.The illustrative case presented in this section is based on psychological models ofpersonality. Many of these models rely on symmetric conﬁgurations known as “mandala”schemata, see for example Jung (1968), and a good example is provided by the ﬁveelements model of traditional Chinese alchemy and their associated personality traits:1- Fire: Extroverted, emotional, emphatic, self-aware, sociable, eloquent.2

CHAPTER 4. METAPHOR AND METAPHYSICS: THE SUBJECTIVE SIDE OF SCIENCE

2- Earth: Caring, supporting, stable, protective, worried, attached.3- Metal: Analytical, controlling, logical, meticulous, precise, zealous.4- Water: Anxious, deep, insecure, introspective, honest, nervous.5- Wood: Angry, assertive, creative, decisive, frustrated, leading.Interactions between elements are conceived as a double feed-back cycle, representedby a pentagram inscribed in a pentagon. The pentagon or external cycle represent thecreation, stimulus or positive feed-back in the system, while the pentagram or internal cy-cle represent the destruction, control or negative feed-back in the system. The traditionalrepresentation of these systemic generative mechanisms or causal relations are:Pentagon: ﬁre [ calcinates to (cid:105) earth [ harbors (cid:105) metal [ condenses (cid:105) water [ nourishes (cid:105) wood [ f uels (cid:105) ﬁre.Pentagram: ﬁre [ melts (cid:105) metal [ cuts (cid:105) wood [ incorporates (cid:105) earth [ absorbs (cid:105) water[ extinguishes (cid:105) ﬁre.This double feed-back structure allows the representation of system with complex in-terconnections and nontrivial dynamical properties. In fact, the systemic interconnectionsare considered the key for understanding a general ﬁve-element model, rather than anysuperﬁcial analogy with the ﬁve elements’ traditional labels.It is an entertaining exercise to compare and relate the ﬁve alchemical elements listedabove with the ﬁve groups of technical indicators presented in the last section, or with thebig-ﬁve personality factors presented next, even if some of these models are consideredpre-scientiﬁc. Why, for example, do these models employ exactly ﬁve factors? Thatis, why is it that “four are few and six are many”? Is there an implicit mechanism inthe model, see Hargittai (1992), Hotchkiss (1998) or Philips (1995, ch.2), or is this anempirical statement supported by research data?Scientiﬁc psychometric models must be based on solid statistical analysis of testablehypotheses. Factor Analysis has been one of the preferred techniques used in the con-struction of modern psychometric models and it is the one used in the examples we discussnext. In section C.5, the basic structure of factor analysis statistical models is reviewed.In Allport and Odbert (1936) the authors presented their Lexical Hypothesis. Ac-cording to them, important aspects of human life correspond to words in the spokenlanguage. Also the number of corresponding terms in the lexicon is supposed to reﬂectthe importance of each aspect: “Those individual diﬀerences that are most salient and socially relevantin peoples lives will eventually become encoded into their language; the moreimportant such a diﬀerence, the more likely is it to become expressed as asingle word.” .4. LANGUAGE, METAPHOR AND INSIGHT “Applying the Lexical Hypothesis to Personality Disorders:Ultimately, the ﬁve-factor model is a model of personality derived from theconstructs and observations of lay-people, and it provides an excellent mapof the domains of personality to which the average layperson attends. How-ever, the present ﬁndings suggest that the ﬁve-factor model is not suﬃcientlycomprehensive for describing personality disorders or sophisticated enough forclinical purposes.In contrast to laypeople, practicing clinicians devote their professional livesto understanding the intricacies of personality. They develop intimate knowl-edge of others lives and inner experience in ways that may not be possible ineveryday social interaction. Moreover, they treat patients with variants of per-sonality pathology that laypeople encounter only infrequently (and are likely toavoid when they do encounter it). One would therefore expect expert cliniciansto develop constructs more diﬀerentiated than those of lay observers.Indeed, if this were not true, it would violate the lexical hypothesis on whichthe ﬁve-factor model rests: that language evolves over time to reﬂect what isimportant. To the extent that mental health professionals observe personalitywith particular goals and expertise, and observe the more pathological end ofthe personality spectrum, the constructs they consider important should diﬀerfrom those of the average layperson.” CHAPTER 4. METAPHOR AND METAPHYSICS: THE SUBJECTIVE SIDE OF SCIENCE

The issue of language dependence is very important in cognitive constructivism. Forfurther discussion, see Maturana (1988, 1991). Thus far we have stressed the lexicalaspect of language, that is, the importance of the available vocabulary in our descriptionof reality. In the remaining part of this section we shall focus on the symbolic or ﬁgurativeuse of the language constructs in these descriptions. We proceed by examining in moredetail the factor analysis model.Factor analysis is a dimension reduction technique. Its application renders a ‘simple’object, the factor model, capable of eﬃciently “coding”, into a space of reduced dimension,a complex ‘real’ object from a full or high dimensional space. In other words, a dimensionreduction technique presumes some form of valid knowledge transference, back and forththe complex (high dimensional) object and its simple (low dimensional) model. Hence, theprocess of using and interpreting factor analysis models can be conceived as metaphorical.Recall that the Greek word metaphor stands for transport or transfer, so that a linguisticmetaphor transfers some of the characteristics of one object, called the source or vehicle,into a second distinct object, called the target, tenant or topic; for a comprehensivereference see Lakoﬀ and Johnson (2003).For reasons which are similar to those studied in the last section, most users of apersonality model require it to be statistically sound. Many of them further demand itto be interpretable, in order to provide good insights to their patient’s personality andproblems. A good model should not only be useful in predicting recovery rates or drugeﬀectiveness, but also help in supplying good counseling or therapeutics.Paraphrasing Vega-Rodr´ıguez (1998):

The metaphorical mechanism should provide an articulation point between the empir-ical and the hypothetical, the rational and the intuitive, between calculation and insight.

The main reason for choosing factor analysis to illustrate this section is its capabil-ity of eﬃciently and transparently building sound statistical models that, at the sametime, provide intuitive interpretations. While soundness is the result of “estimation andidentiﬁcation tools”, such as ML (maximum likelihood) or MAP (maximum a posteri-ori) optimization, hypothesis testing and model selection, interpretableness results from“representation tools”, such as orthogonal and oblique factor rotation techniques.Factor rotation tools are meant to reconﬁgure the structure of a given factor anal-ysis model, so as to maintain its probabilistic explanatory power while maximizing itsheuristic explanatory power. Factor rotations are performed to implement an objectiveoptimization criteria, such as sparsity or entropy maximization. The optimal solution (foreach criterion) is unique and hoped to enhance model interpretability a great deal. .5. CONSTRUCTIVE ONTOLOGY AND METAPHYSICS How important heuristic arguments are in other areas of science? Should statistical ormathematical models play a similar rhetorical role in other ﬁelds of application? We willtry to answer these questions by discussing the role played by similar heuristic argumentsin physics. In sections 2 and 3 we dealt with application areas in which text(ure) manu-facture comprised, to a great extent, the very spinning of the threads. Nevertheless, onecan have the false impression that the constructivist approach suits better high level, softscience areas, rather than low level, rock bottom Physics. This widely spread miscon-ception is certainly not the case. In sections 7 through 10 we analyze the role played inscience by metaphysics, a very special form of heuristic argumentation.The example presented in section 4, together with the corresponding Factor Analysismodeling technique, is at an intermediate point of the soft-hard science scale used herein to(approximately) order the examples. Therefore, as previously stated in the introduction,we shall use section the current section to make a pause in the exposition, take a deepbreath, and try to get a bird’s eye view of the scenario. This section also reviews someconcepts of Cog-Con ontology deﬁned in previous chapters and discusses some insightson Cog-Con metaphysics.The Cog-Con framework rests upon two basic metaphors: the Heinz von Forster’smetaphor of

Object as token for an eigensolution , which is the key to Cog-Con ontology,and the Humberto Maturana and Francisco Varela’s metaphor of

Autopoiesis and cogni-tion , the key to Cog-Con metaphysics. Below we review these two metaphors, as theywhere used in chapter 1.

Autopoiesis and Cognition

Autopietic systems are non-equilibrium (dissipative) dynamical systems exhibiting (meta)stable structures, whose organization remains invariant over (long periods of) time, despitethe frequent substitution of their components. Moreover, these components are producedby the same structures they regenerate. As an example, take the macromolecular pop-ulation of a single cell, which can be renewed thousands of times during its lifetime, seeBertalanﬀy (1969). However, in spite of the fact that autopoiesis was a metaphor devel-oped to suit the essential characteristics of organic life, the concept of autopoietic systemhas been applied in the analysis of many other concrete or abstract autonomous systemssuch as social systems and corporate organizations, see for example Luhmann (1989) andZelleny (1980).The regeneration processes in the autopoietic system production network require theacquisition of resources such as new materials, energy and neg-entopy (order), from thesystem’s environment. Eﬃcient acquisition of the needed resources demands selective6

CHAPTER 4. METAPHOR AND METAPHYSICS: THE SUBJECTIVE SIDE OF SCIENCE (inter)actions which, in turn, must be based on suitable inferential processes (predictions).Moreover, these inferential processes characterize the agent’s domain of interaction as acognitive domain. For more details see the comments in chapter 1 and, more importantly,the original statements in Maturana and Varela (1980, p.10):“The circularity of their organization continuously brings them back to thesame internal state (same with respect to the cyclic process). ... Thus thecircular organization implies the prediction that an interaction that took placeonce will take place again. ... Accordingly, the predictions implied in theorganization of the living system are not predictions of particular events, butof classes of inter-actions. ... This makes living systems, inferential systems,and their domain of interactions a cognitive domain.”

Object as Tokens for Eigen-Solutions

The circular (cyclic or recursive) characteristic of autopoietic regenerative processes andtheir eigen (auto, equilibrium, ﬁxed, homeostatic, invariant, recurrent, recursive) -states,both in concrete and abstract autopoietic systems, are investigated in Foerster (2003) andSegal (2001).“The meaning of recursion is to run through one’s own path again. One ofits results is that under certain conditions there exist indeed solutions which,when reentered into the formalism, produce again the same solution. Theseare called “eigen-values”, “eigen-functions”, “eigen-behaviors”, etc., depend-ing on which domain this formation is applied - in the domain of numbers, infunctions, in behaviors, etc.” Segal (2001, p.145).The concept of eigen-solution for an autopoietic system is the key to distinguish speciﬁcobjects in a cognitive domain.“Objects are tokens for eigen-behaviors. Tokens stand for something else. Inexchange for money (a token itself for gold held by one’s government, butunfortunately no longer redeemable), tokens are used to gain admittance tothe subway or to play pinball machines. In the cognitive realm, objects are thetoken names we give to our eigen-behavior. ... When you speak about a ball,you are talking about the experience arising from your recursive sensorimotorbehavior when interacting with that something you call a ball. The “ball”as object becomes a token in our experience and language for that behaviorwhich you know how to do when you handle a ball. This is the constructivist’sinsight into what takes place when we talk about our experience with objects.”Segal (2001, p.127). .5. CONSTRUCTIVE ONTOLOGY AND METAPHYSICS

Constructive Ontology

The Cog-Con framework also includes the following conception of reality and some relatedterms, as deﬁned in chapter 2:

1. Known (knowable) Object:

An actual (potential) eigen-solution of a givensystem’s interaction with its environment. In the sequel, we may use a some-what more friendly terminology by simply using the term Object.

2. Objective (how, less, more):

Degree of conformance of an object to theessential attributes of an eigen-solution (to be precise, stable, separable andcomposable).

3. Reality:

A (maximal) set of objects, as recognized by a given system, wheninteracting with single objects or with compositions of objects in that set.The Cog-Con framework assumes that an object is always observed by an observer, justlike a living organism or a more abstract system, interacting with its environment. There-fore, this framework asserts that the manifestation of the corresponding eigen-solution andthe properties of the object are respectively driven and speciﬁed by both the system andits environment. More concisely, Cog-Con sustains:

4. Idealism:

The belief that a system’s knowledge of an object is alwaysdependent on the systems’ autopoietic relations.

5. Realism:

The belief that a system’s knowledge of an object is alwaysdependent on the environment’s constraints.Consequently, the Cog-Con perspective requires a ﬁne equilibrium, called

Realistic orObjective Idealism . Solipsism or Skepticism are symptoms of an epistemological analysesthat loose the proper balance by putting too much weight on the idealistic side. Con-versely,

Dogmatic Realism is a symptom of an epistemological analyses that loose theproper balance by putting too much weight on the realistic side. Dogmatic realism hasbeen, from the Cog-Con perspective, a very common (but mistaken) position in modernepistemology. Therefore, it is useful to have a speciﬁc expression, namely, something in CHAPTER 4. METAPHOR AND METAPHYSICS: THE SUBJECTIVE SIDE OF SCIENCE

Experiment TheoryOperation- ⇐ Experiment ⇐ Hypothesesalization design formulation ⇓ ⇑

Eﬀects True/False Creativeobservation eigen-solution interpretation ⇓ ⇑

Data Mnemetic Statisticalacquisition ⇒ explanation ⇒ analysisSample space Parameter spaceFigure 1: Scientiﬁc production diagram. itself to be used as a marker or label for such ill posed dogmatic statements. The methodused to access something in itself is often described as: - Something that an observerwould observe if the (same) observer did not exist, or - Something that an observer couldobserve if he made no observations, or - Something that an observer should observe in theenvironment without interacting with it (or disturbing it in any way), and many otherequally senseless variations.Although the application of the Cog-Con framework is as general as that of autopoiesis,this paper is focused on scientiﬁc activities. The interpretation of scientiﬁc knowledge asan eigensolution of a research process is part of a Cog-Con approach to epistemology. Fig-ure 1 presents an idealized structure and dynamics of knowledge production, see Krohnand K¨uppers (1990) and chapters 1 and 6. The diagram represents, on the Experimentside (left column) the laboratory or ﬁeld operations of an empirical science, where ex-periments are designed and built, observable eﬀects are generated and measured, andan experimental data bank is assembled. On the Theory side (right column), the dia-gram represents the theoretical work of statistical analysis, interpretation and (hopefully)understanding according to accepted patterns. If necessary, new hypotheses (includingwhole new theories) are formulated, motivating the design of new experiments. Theoryand experimentation constitute a double feed-back cycle making it clear that the designof experiments is guided by the existing theory and its interpretation, which, in turn,must be constantly checked, adapted or modiﬁed in order to cope with the observedexperiments. The whole system constituting an autopoietic unit. .5. CONSTRUCTIVE ONTOLOGY AND METAPHYSICS Fact or Fiction?

At this point it is useful to (re)turn our attention to a speciﬁc model, namely, factor anal-ysis, as discussed in section 4, and consider the following questions raised by Brian Everitt(1984, p.92, emphases are ours) concerning the appropriate interpretation of factors:“

Latent variables - fact or ﬁction?

One of the major criticisms of factoranalysis has been the tendency for investigators to give names to factors, andsubsequently, to imply that these factors have a reality of their own overand above the manifest variables.

This tendency continues with the useof the term latent variables since it suggests that they are existing variablesand that there is simply a problem of how they should be measured. Intruth, of course, latent variables will never be anything more than is containedin the observed variables and will never be anything beyond what has beenspeciﬁed in the model. For example, in the statement that verbal ability iswhatever certain test have in common, the empirical meaning is nothing morethan a shorthand for the observations of the correlations. It does not meanthat verbal ability is a variable that is measurable in any manifest sense.However, the concept of latent variable may still be extremely helpful. Ascientist may have a number of hypothetical constructs in terms of which sometheory is formulated, and he is willing to assume that the latent variables usedin specifying the structural models of interest are the operational equivalentsto theoretical constructs. As long as it is remembered that in most cases thereis no empirical way to prove this correspondence, then such an approach canlead to interesting and informative theoretical insights .”Ontology is a term used in philosophy in reference to a systematic account of existence or reality . We have already established the Cog-Con approach to objects as tokens foreigen-solutions, and explained their four essential attributes, namely, discreteness (pre-ciseness, sharpness or exactness), stablity, separability and composability. Therefore, inthe Cog-Con framefwork, accessing the ontological status of an object, or to say howobjective it is, is to ascertain how well it manifests the four essential attributes of aneigen-solution.The Full Bayesian Signiﬁcance Test, or FBST, is a possibilistic belief calculus, basedon (posterior) probabilistic measures, that was conceived as a statistical signiﬁcance testto access the objectivity of an eigen-solution, that is, to measure how well a given objectmanifests or conforms to von Foerster’s four essential attributes. The FBST belief orcredal value, ev( H | X ), the e-value of hypothesis H given the observed data X , is inter-preted as the epistemic value of hypothesis H (given X ), or the evidence value of data X (supporting H ). The formal deﬁnition of the FBST and several of its implementation in00 CHAPTER 4. METAPHOR AND METAPHYSICS: THE SUBJECTIVE SIDE OF SCIENCE speciﬁc problems can be found in the author’s previous publications, and are reviwed inappendix A.

Greek or Latin? Latent or Manifest?

We have already discussed the ontological status of an object. This discussion assumestesting hypotheses in a statistical model which, in order to built, one must know howto distinguish concrete measurable entities from abstract concepts, observed values frommodel parameters, latent from manifest variables, etc. When designing and conducting anexperiment, a scientist must have a well deﬁned a statistical model, and keep these distinc-tions crisp and clear. This is so important in the experimental sciences that statisticianshave the habit of using Latin letters for observables, and Greek letters for parameters.When a statistician questions whether a letter is Latin or Greek, he or she is not askingfor help with foreign alphabets, but rather seeking information about the aforementioneddistinctions.According to the positivist philosophical school, measurable entities, observed values,manifest variables, etc. are the true, ﬁrst class entities of a hard science, while abstractconcepts, model parameters, latent variables, etc. should be considered second classentities. One reason for downgrading the later class is that the positivist school assumesa nominalist perspective. Nominalism (at least in its strictest form) considers abstractconcepts as mere names (nomina) , that may stand as proxy for a “really existing item”,denoting “that singular thing” (supponere pro illa re singulari) . The Cog-Con perspectiveplays no role in the positivist dream. This issue will be further investigated in the nextsub-section, as well as in sections 8 and 11. For now we oﬀer the following argument:Although for a given model, the aforementioned distinctions between what to writeusing Latin or Greek letters should be always crisp and clear, we may have to simul-taneously work with several models. For example, we may need to use several modelshierarchically organized to cope with phenomena at diﬀerent scales or levels of granu-larity, like models in physics, chemistry, biology, and psychology, see chapters 5 and 6.We may also need diﬀerent models for competing theories trying to explain a given phe-nomenon. Finally, we may need diﬀerent models providing equivalent or compatible lawsto given phenomena that, nevertheless, use distinct theoretical approaches, see section 8,9 and 10. The positivist dream quickly turns into a nightmare when one realizes that anentity corresponding to a Greek letter variable in one model corresponds to a Latin lettervariable in another, and vice-versa.It is also important to realize that in the Cog-Con approach the ontological status ofan object is a reference to the properties of the corresponding eigen-solution emerging ina cyclic process. This leads to an intrinsically dynamic approach to ontology, in sharpcontrast with other analyses based on static categories. A consequence of this dynamicalsetting is that in the Cog-Con approach a statement about the ontological status of .5. CONSTRUCTIVE ONTOLOGY AND METAPHYSICS

Constructive Metaphysics

Metaphysics, in its gnosiological sense, is a philosophical term we use to refer to a sys-tematic account of possible forms of understanding, valid forms of explanation or rationalprinciples of intelligibility. In science, such explanations are often well represented in aschematic diagram describing the organization of a conceptual network. A link in such adiagram expresses a theoretical relation like, for example, a causal nexus, that is, a causeand eﬀect relation. In modern science, such explanations must also include the symbolicderivation of scientiﬁc hypotheses from general scientiﬁc laws, the formulation of new lawsin an existing theory, and even the conception of new theories, as well as their generalunderstanding based on general metaphysical principles.In this context, it is natural to ask questions like: What do we mean by the intuitivequality or theoretical importance of a concept or, more generally, of a sub-network? Howinteresting are the insights we gain from it? How can we access its explanatory power orheuristic value? We will try to answer these questions in the following sections, most spe-cially in section 8, on modern metaphysics. In this section we provide only a preliminarydiscussion of the importance of metaphysical entities in the constructivist perspective.We now return to Humberto Maturana and Francisco Varela’s metaphor of autopoiesisand cognition. As stated at the beginning of this section this metaphor is the key for Cog-Con metaphysics. From details of this metaphor we conclude that the autopoietic relationsof a system not only deﬁne who or what it “is”, but also limit the class of interactions inwhich it can possibly engage or the class of events it can possibly perceive. An adaptivesystem can learn, that is, it can reconﬁgure its internal organization, reshape its architec-ture, in order to enlarge its scope of inference or make better predictions. Nevertheless,learning is an evolutive process, and any evolutionary path to the future has to progressfrom the system’s present (or initial) conﬁguration. From the above considerations it isclear that, from a constructivist perspective, the speciﬁcation of autopoietic relations areof vital importance since they literally deﬁne the scope and possibilities of the system’slife.02

CHAPTER 4. METAPHOR AND METAPHYSICS: THE SUBJECTIVE SIDE OF SCIENCE

Theoretical Insights

Cog-Con approaches science as an autopoietic system whose organization is coded bysymbolic laws, causal relations, and metaphysical principles. Consequently, we mustgive them the greatest importance. Nevertheless, such metaphysical entities are evenmore abstract than the latent variables discussed in the last subsection. In contrast withthe constructivist approach, the positivist school is thus quite hostile to metaphysicalconcepts.In the Cog-Con perspective, metaphysics provides meaning to objects in a give real-ity, explaining why the corresponding eigen-solutions manifest themselves the way theydo. Accordingly, theoretical concepts become building blocks in the coding of systemicknowledge and reference marks in the mapping of the systems environment. Conceptualrelations are translated into inference tools, thus becoming, by deﬁnition, the basis ofautopoietic cognition. In the Cog-Con perspective, better understanding will strengthena given theoretical architecture or entail its evolution. In so doing, the importance of thepertinent concepts is enhanced, their scope is enlarged and their utility increased. Thewhole process enables richer and wider connections in the web of knowledge, embeddingtheory even deeper in the system’s life, revealing more links in the great chain of being!

In sections 7 through 10 we analyze the role played in modern science by metaphysics,a very special form of heuristic argumentation. Such arguments often explain why asystem follows a given trajectory or evolves along a given path. These arguments mayexplain why a system must follow a necessary path or is eﬀectively forced along a singletrajectory; these are “only world” explanations. Teleological arguments explain why asystem chooses the best trajectory according to some optimality criterion; these are “bestworld” explanations. Stochastic or integral arguments explain why the system evolutiontakes into account, including, averaging, summing or integrating over, all possible oradmissible trajectories; these are “possible worlds” explanations.In sections 7 to 9 we also try to examine the interrelations between “only world”, “bestworld” and “possible worlds” forms of explanation, as well as their role and purpose inthe light of cognitive constructivism, since they are at the core of modern metaphysics.We begin this journey by studying in this section a simple and seemingly innocent mathe-matical puzzle. The puzzle, which will be solved directly by elementary calculus, is in factused by Richard Feynman as an allegory to present an important variational problem.Consider a beach with shore line represented by x = a , in the standard Cartesianplane. A lifeguard, at position ( x, y ) = (0 , x, y ) =( a + b, d ). While on the athletic track the lifeguard car can run at top speed c , on the .6. NECESSARY AND BEST WORLDS c/ν . Once in the water, the lifeguard can only swim at speed c/ν , 1 < ν < ν . Letting ( x, y ) = ( a, y ( a )) be the point where he enters the water, whatis the optimal value y ( a ) = z if he wants to reach position ( a + b, d ) as fast as possible?Since the shortest path in an homogenous medium is a straight line, the optimaltrajectory is a broken line, from (0 ,

0) to ( a, z ), and then from ( a, z ) to ( a + b, d ). Thetotal travel time is J ( z ) /c , where J ( z ) = ν √ a + z + ν (cid:112) b + ( d − z ) . Since we want J ( z ) at a minimum, we set dJdz = ν − z √ a + z + ν − d − z )2 (cid:112) b + ( d − z ) = 0 , so that, we should have ν sin( θ ) = ν sin( θ ) . Professional lifeguards claim that this simple model can be improved by dividing thesand in a dry band, V , and a wet band, V , and the water in a shallow band, V , and adeep band, V , with respective diﬀerent media ‘resistance’ indices, ν , ν , ν , ν , satisfying ν > ν > ν > ν >

1. Although the solution for the improved model can be similarlyobtained, a general formalism to solve ‘variational’ problems of this kind exists which isknown as the Euler-Lagrange equation. For an instructive introduction see Krasnov et al.(1973), Leech (1963) and Marion (1970).The trigonometric relation, ν ( x ) sin( θ ) = K , obtained in the last equation, is known inoptics as Snell-Descartes’ law. It explains the refraction (bending) of a light ray incident toa surface separating two distinct optic media. In this relation, ν is the medium refractionindex. The variational problem solved above was proposed by Pierre de Fermat in 1662 to‘explain’ Snell-Descartes’ law. Fermat’s principle of least time states that a ray of light,going from one point to another, follows the path which is traversed in the smallest time.Notice that Fermat enounced this principle before any measurement of the speed oflight. The ﬁrst quantitative estimate of the speed of light, in sidereal space, was obtainedby O. Roemer in 1676. He measured the Doppler eﬀect on the period of Io, a satelliteof Jupiter discovered by Galileo in 1610. More precisely, he measured the violet andred shifts, i.e., the variation for shorter and longer in the observed periods of Io, asthe Earth traveled in its orbit towards and away from Jupiter. Roemer’s ﬁnal estimatewas c = 1 au/ (cid:48) , that is, one astronomial unit (the length of the semi-major axis ofthe earth’s elliptical orbit around the sun, approximately 150 million kilometres) per11 minutes. Today’s value is around 1 au/ (cid:48) (cid:48)(cid:48) . The ﬁrst direct measurements of thecomparative speed of light in distinct material media (air and water) were obtained byL´eon J.B.Foucault, almost two centuries latter, in 1850, using a rotating mirror device.04 CHAPTER 4. METAPHOR AND METAPHYSICS: THE SUBJECTIVE SIDE OF SCIENCE

For details, see Tobin (1993) and Jaﬀe (1960). For a historical perspective of severalcompeting theories of light we refer to Ronchi (1970) and Sabra (1981).Snell-Descartes’ “law” is an example of mathematical model that dictates a “necessaryworld”, stating, plain and simple, how things “have to be”. In contrast, Fermat’s “princi-ple” is a theoretical construct that elects a “best world” according to some criterion usedto compare “possible worlds”.Fermat’s principle is formulated minimizing the integral of ds = 1 /dt . In a similarway, Leibniz, Euler, Mauperius, Lagrange, Jacobi, Hamilton, and many others were ableto reformulate Newtonian mechanics, minimizing the integral of a quantity called action , ds = L dt , where the Lagrangian, L , is the diﬀerence between the kinetic energy (Leibniz’vis viva), (1 / mv , and the potential energy of the system (Leibniz’ vis morta). Hence,these formulations are called in physics principles of minimum action or principles of leastaction. At the XVII century, several models of light and its propagation were developed to explainSnell-Descartes’ law, see Sabra (1981). The discussion of these models, and the necessaryversus best world formulations of optics and mechanics discussed in the last section arehistorically connected to the discussion of the metaphysical concepts of eﬃcient and ﬁnalcauses.This terminology dates back to Aristotle, who distinguishes, in Metaphysics, fourforms of causation, that is, four types of answers that can be given to a Why-question.Namely:- Material cause: Because it is made of, or its constituent parts are ...- Formal cause: Because it has the form of, or is shaped like ...- Eﬃcient cause: Because it is produced, or accomplished by ...- Final cause: Because it is intended to, or has the purpose of ...Eﬃcient and ﬁnal causes are the subject of this section. For a general overview of thetheme in the history of 17th and 18th century Physics, see Brunet (1938), Dugas (1988),Pulte (1989), Goldstine (1980), Wiegel (1986) and Yourgrau and Mandelstam (1979).Newtonian mechanics is formulated only in terms of eﬃcient causes - an existingforce acts on a particle (or body) producing a movement described by the Newtoniandiﬀerential equations. Least action principles, on the other hand, are formulated throughthe use of a ﬁnal cause: the trajectory followed by the particle (or light ray) is that whichoptimizes a certain characteristic, given its original and ﬁnal positions. This is why theseformulations are also called teleological, from the Greek τ (cid:15)λoς , aim, goal or purpose. A .7. EFFICIENT AND FINAL CAUSES “In fact, as I have shown by the remarkable example of the principles ofoptics, ....(that) ﬁnal causes may be introduced with great fruitfulness eveninto the special problems of physics, not merely to increase our admirationfor the most beautiful works of the supreme Author, but also to help us makepredictions by means of them which could not be as apparent, except perhapshypothetically, through the use of eﬃcient cause... It must be maintained ingeneral that all existent facts can be explained in two ways - through a kingdomof power or eﬃcient causes and through a kingdom of wisdom or ﬁnal causes...Thus these two kingdoms everywhere permeate each other, yet their laws arenever confused and never disturbed, so the maximum in the kingdom of power,and the best in the kingdom of wisdom, take place together.”

Euler and Maupertuis generalized the arguments of Fermat and Leibniz, derivingNewtonian mechanics from the least action principle.The Principle of Least Action, wasstated in Maupertuis (1756, IV, p.36), as his

Lois du Mouvement, Principe G´en´eral , “Laws of Movement, General Principle:When a change occurs in Nature, the quantity of action necessary for thatchange is as small as possible.The quantity of action is the product of the mass of the bodies times theirspeed and the distance they travel. When a body is transported from one placeto another, the action is proportional to the mass of the body, to its speed andto the distance over which it is transported.” Maupertuis also used the same theological arguments of Leibniz regarding the harmonybetween eﬃcient and ﬁnal causes. In Maupertuis (1756, IV, p.20-23 of

Accord de Diﬀ´erentsLois de la Nature, qui avoient jusqu’ici paru incompatibles ), for example, we ﬁnd: “Accord Between Diﬀerent Laws of Nature, that seemed incompatible. ...I know the distaste that many mathematicians have for ﬁnal causes appliedto physics, a distaste that I share up to some point. I admit, it is risky tointroduce such elements; their use is dangerous, as shown by the errors madeby Fermat (and Leibniz(?)) in following them. Nevertheless, it is perhaps notthe principle that is dangerous, but rather the hastiness in taking as a basicprinciple that which is merely a consequence of a basic principle.One cannot doubt that everything is governed by a supreme Being whohas imposed forces on material objects, forces that show his power, just as hehas fated those objects to execute actions that demonstrate his wisdom. The CHAPTER 4. METAPHOR AND METAPHYSICS: THE SUBJECTIVE SIDE OF SCIENCE harmony between these two attributes is so perfect, that undoubtedly all theeﬀects of Nature could be derived from each one taken separately. A blind anddeterministic mechanics follows the plans of a perfectly clear and free Intellect.If our spirits were suﬃciently vast, we would also see the causes of all physicaleﬀects, either by studying the properties of material bodies or by studying whatwould most suitable for them to do.The ﬁrst type of studies is more within our power, but does not take us far.The second type may lead us stray, since we do not know enough of the goalsof Nature and we can be mistaken about the quantity that is truly the expenseof Nature in producing its eﬀects.To unify the certainty of our research with its breadth, it is necessary touse both types of study. Let us calculate the motion of bodies, but also consultthe plans of the Intelligence that makes them move.It seems that the ancient philosophers made the ﬁrst attempts at this sortof science, in looking for metaphysical relationships between numbers and ma-terial bodies. When they said that God occupies himself with geometry, theysurely meant that He unites in that science the works of His power with theperspectives of His wisdom.”

Some of the metaphysical explanation given by Leibniz and Maupertuis are based ontheological arguments which can be regarded as late inheritances of medieval philosophy.This form of metaphysical argument, however, faded away from the mainstream of scienceafter the 18th century. Nevertheless, in the following century, the (many variations ofthe) least action principle disclosed more powerful formalisms and found several newapplications in physics. For details, see Goldstine (1980) and Wiegel (1986). As statedin Yourgrau and Mandelstam (1979, ch.14 of

The Signiﬁcance of Variational Principlesin Natural Philosophy ), “Towards the end of the (XIX) century, Helmholtz invoked, on purely sci-entiﬁc grounds, the principle of least action as a unifying scientiﬁc naturallaw, a ‘leit-motif ’ dominating the whole of physics, Helmholtz (1887).‘From these facts we may even now draw the conclusion that the domainof validity of the principle of least action has reached far beyond the bound-aries of the mechanics of ponderable bodies. Maupertuis’ high hopes for theabsolute general validity of his principle appear to be approaching their fulﬁll-ment, however slender the mechanical proofs and however contradictory themetaphysical speculation which the author himself could at the time adduce insupport of his new principle. Even at this stage, it can be considered as highlyprobable that it is the universal law pertaining to all processes in nature. ...In any case, the general validity of the principle of least action seems to meassured, since it may claim a higher place as a heuristic and guiding principle .8. MODERN METAPHYSICS in our endeavor to formulate the laws governing new classes of phenomena.Helmholtz (1887).’ ” In this section we continue the investigation on the use and nature of metaphysical prin-ciples in theoretical Physics. Like many others adjectives, the word metaphysical hasacquired both a positive (meliorative, eulogistic, appreciative) and a negative (pejorative,derogatory, unappreciative) connotation.Logical positivism or logical empiricism was a mainstream school in the philosophy ofscience of the early 20th century. One of the objectives of the positivist school was tobuild science from empirical (observable) concepts only. According to this point of viewevery metaphysical, that is, non-empirical or non-directly observable, entity is cognitivelymeaningless and all teleological principles were perceived to fall in this category.Teleological arguments were also perceived as problematic in Biology and related ﬁeldsdue to the frequent abuse of phony teleological arguments, usually in the form of crudefallacies or obvious tautologies, given to provide support to whatever statement in need.Maupertuis, the proponent of the ﬁrst general least action principle, himself, was awareof such problems, as clearly stated in the text of his quoted in the previous section. Whythen did important theoretical physicists insist in keeping teleological arguments and otherkinds of principles perceived as metaphysical among the regular tools of the trade?Yourgrau and Mandelstam (1979, p.10) emphasize the heuristic importance of meta-physical principles in the early development of prominent physical theories: “In conformity with the scope of our subject, the speculative facets of thethinkers under review have been emphasized. Historically by far more conse-quential were the positive contributions to natural science, contributions whichtransferred the emphasis from ‘a priori’ reasoning to theories based upon ob-servation and experiment. Hence, while the future exponents of least principlesmay have been guided in their metaphysical outlook (1) by the idealistic back-ground we have described, they had, nevertheless, to present their formulationsin such fashion that the data of experience would thus be explained. A system-atic scrutiny of the individual chronological stages in the evolution of minimumprinciples can furnish us with profound insight into continuous transformationof a metaphysical canon to an exact natural law.(1) By ‘metaphysical outlook’ we comprehend nothing but those generalassumptions which are accepted by the scientist.”

The deﬁnition of Metaphysics used by Yougrau is perhaps a bit too vague, or too08

CHAPTER 4. METAPHOR AND METAPHYSICS: THE SUBJECTIVE SIDE OF SCIENCE humble. We believe that a deeper understanding of the role played by metaphysics inmodern theoretical physics can be found (emphases are ours) in Einstein (1950): “We have become acquainted with concepts and general relations that enableus to comprehend an immense range of experiences and make them accessibleto mathematical treatment . ...(but) Why do we devise theories at all? The answer to the latter questionis simply: Because we enjoy comprehending , i.e., reducing phenomena bythe process of logic to something already known or (apparently) evident. ...This is the striving toward uniﬁcation and simpliﬁcation of the premi-ses of the theory as a whole (Mach’s principle of economy, interpreted as alogical principle). ...There exists a passion for comprehension, just as there exists a passionfor music. That passion is rather common in children, but gets lost in mostpeople later on. Without this passion, there would be neither mathematics nornatural science. Time and again the passion for understanding has led to theillusion that man is able to comprehend the objective world rationally, by purethought, without any empirical foundations-in short, by metaphysics. I believethat every true theorist is a kind of tamed metaphysicist, no matter how pure a‘positivist’ he may fancy himself. The metaphysicist believes that the logicallysimple is also the real. The tamed metaphysicist believes that not all that islogically simple is embodied in experienced reality, but that the totality of allsensory experience can be ‘comprehended’ on the basis of a conceptual systembuilt on premises of great simplicity. The skeptic will say that this is a ‘miraclecreed.’ Admittedly so, but it is a miracle creed which has been borne out to anamazing extent by the development of science.”

Even more resolute statements are made by Max Planck (emphases are ours) in theencyclopedia Die Kultur der Gegenwart (1915, p.68), and in Planck (1915, p.71-72): “As long as there exists physical science, its highest desirable goal had beenthe solution of the problem to integrate all natural phenomena observedand still to be observed into a single simple principle which permits tocalculate all past and, in particular, all future processes from the presentones. It is natural that this goal has not been reached to date, nor ever will itbe reached entirely. It is well possible, however, to approach it more and more,and the history of theoretical physics demonstrates that on this way a richnumber of important successes could already be gained; which clearly indicatesthat this ideal problem is not merely utopical, but eminently fertile. Among themore or less general laws which manifest the achievements of physical sciencein the course of the last centuries, the Principle of Least Action is probably the .8. MODERN METAPHYSICS one which, as regards form and content, may claim to come nearest to thatﬁnal ideal goal of theoretical research.”“Who instead seeks for higher connections within the system of naturallaws which are most easy to survey, in the interest of the aspired harmonywill, from the outset, also admit those means, such as reference to the eventsat later instances of time, which are not utterly necessary for the completedescription of natural processes, but which are easy to handle and can be interpreted intuitively .” From the last quoted statements of Einstein and Planck we can draw the followingfour points list of motivations for the use of (or for deﬁning the characteristics of) goodmetaphysical principles:1- Simplicity; 2- Generality; 3- Interpretability; and4- Derivation of powerful and easy to handle (calculate, compute) symbolic (mathe-matical) formalisms.The ﬁrst three these points are very similar to the characteristics of good metaphoricalarguments, as analyzed in section 3. In this particular context, generality means theability of crossing over diﬀerent areas or transferring knowledge between multiple ﬁeldsto integrate the understanding of diﬀerent natural phenomena. Since the least actionprinciple clearly conforms with all four criteria in the above list, it is easy to understandwhy it is so endeared by physicists, despite the objections to its teleological nature.Up to this point we have been arguing that the laws of mechanics in integral form,stated in terms of the least action principle, and its associated teleological metaphysicalconcepts, should be accepted along side with the “standard” formulation of mechanicsin diﬀerential form, that is, the diﬀerential equations of Newtonian mechanics. However,Schlick (1979, V.1, p.297) proposes a complete inversion of the empirical / metaphysicalstatus of the two formulations, see also Muntean (2006) and St¨oltzner (2003). Accordingto Schlick’s view, while the integral or macro-law formulation has its grounds in observablequantities, the diﬀerential or micro-law formulation is based on non-empirical concepts: “That the event at a point depends only on those processes occurring in itsimmediate temporal and special neighborhood, is expressed in the fact that spaceand time appear in the formulae of natural laws as inﬁnitely small quantities;these formulae, that is, are diﬀerential equations. We can also describe themin a readily intelligible terminology as micro-laws . Through the mathematicalprocess of integration, there emerge from them the macro-laws (or integrallaws), which now state natural dependencies in their extension over spatialand temporal distances. Only the latter fall within experience, for the inﬁnitelysmall is not observable. The diﬀerential laws prevailing in nature can thereforebe conjectured and inferred only from the integral laws, and these inferences CHAPTER 4. METAPHOR AND METAPHYSICS: THE SUBJECTIVE SIDE OF SCIENCE are never, strictly speaking, univocal, since one can always account for theobserved macro-laws by various hypotheses about the underlying micro-laws.Among the various possibilities we naturally choose that marked by the greatestsimplicity. It is the ﬁnal aim of exact science to reduce all events to the fewestand simplest possible diﬀerential laws.”

From this and other examples presented in sections 6 to 9, we come to the conclu-sion that metaphysical concepts are unavoidable, regardless of the formulation in use.Positivists, on the other hand, envision the exclusive use of metaphysical free scientiﬁcconcepts, with grounds on pure empirical experience. At the end, it seems that the laterdevote themselves to the worthless pursuit of chasing chimeras. Moreover, metaphysi-cal arguments are essential to build our intuition. Without intuition, physical reasoningwould be downgraded to merely cranking the formalism, either by algebraic manipulationof the symbolic machinery or by sheer number crunching. Planck (1950, p.171-172), statesthat: “To be sure, it must be agreed that the positivistic outlook possesses a dis-tinctive value; for it is instrumental to a conceptual clariﬁcation of the signiﬁ-cance of physical laws, to a separation of that which is empirically proven fromthat which is not, to an elimination of emotional prejudices nurtured solely bycustomary views, and it thus helps to clear the road for the onward drive ofresearch. But Positivism lacks the driving force for serving as a leader on thisroad. True, it is able to eliminate obstacles, but it cannot turn them into aproductive factors. For its activity is essentially critical, its glace is directedbackward. But progress, advancement requires new associations of ideas andnew queries, not based on the results of measurements alone, but going beyondthem, and toward such things the fundamental attitude of Positivism is one ofaloofness.Therefore, up to quite recently, positivists of all hues have also put up thestrongest resistance to the introduction of atomic hypotheses .... ”

At this point it is opportune to remember Kant’s allegory of breathing, that oﬀers acouterpoint in contrast and complement to his allegory of the dove (Prolegomena to AnyFuture Metaphysics; How Is Metaphysics Possible As a Science?): “That the human mind will ever give up metaphysical researches is as littleto be expected as that we should prefer to give up breathing altogether, to avoidinhaling impure air.” .9. AVERAGING OVER ALL POSSIBLE WORLDS

The last example quoted by Planck provides yet another excellent illustration to enlightennot only the issue currently under discussion, but also other topics we want to address.In the next section we shortly introduce one of the most important models related to thedebate concerning the atomic hypothesis, namely, Brownian motion.We are interested in the Dirichlet problem of describing the steady state temperatureat a two dimensional plate, given the temperature at its border. The partial diﬀerentialequation that the temperatures, u ( x, y ), must obey in the Dirichlet problem is known asthe 2-dimensional Laplace equation,div grad u = ∂ u∂ x + ∂ u∂ y = 0 , as in Butkov (1968, Ch.8).From elementary calculus, see Demidovich and Maron (1976), we have the forwardand backward ﬁnite diﬀerence approximations for a partial derivative, ∂ u∂ x ≈ u ( x + h, y ) − u ( x, y ) h ≈ u ( x, y ) − u ( x − h, y ) h . Using these approximations twice, we obtain the symmetric or central ﬁnite diﬀerenceapproximation for the second derivatives, ∂ u∂ x ≈ u ( x + h, y ) − u ( x, y ) + u ( x − h, y ) h ,∂ u∂ y ≈ u ( x, y + h ) − u ( x, y ) + u ( x, y − h ) h . Substitution in the Laplace equation gives the “next neighbors’ mean value” equation, u ( x, y ) = 14 ( u ( x + h, y ) + u ( x − h, y ) + u ( x, y + h ) + u ( x, y − h )) . From the last equation we can set a linear system for the temperatures in a rectangulargrid. The unknown variables, in the left hand side, are the temperatures at the interiorpoints of the grid, in the right hand side we have the known temperatures at the boundarypoints.From the temperatures at the four neighboring points of a given grid point, [ x, y ], anestimate of the temperature, u ( x, y ), at this point is the expected value of the randomvariable Z ( x, y ) whose value is uniformly sampled from { u ( x + h, y ) , u ( x − h, y ) , u ( x, y + h ) , u ( x, y − h ) } , the north, south, east and west neighbors. Also, if we did not know the temperature atthe neighboring point sampled, we could estimate the neighbor’s temperature by sampling12 CHAPTER 4. METAPHOR AND METAPHYSICS: THE SUBJECTIVE SIDE OF SCIENCE one of the neighbor’s neighbors. Using this argument recursively, we could estimate thetemperature u ( x, y ) through the following Monte Carlo algorithm:Consider a “particle” undergoing a symmetric random (or drunken sailor) walk, thatis, a stochastic trajectory, T = [ T (1) , . . . T ( m )], such that starting at position T (1) =[ x (1) , y (1)], it jumps to positions T (1) , T (2) , . . . T ( m ) by uniformly sampling among theneighboring points of its current position, until it eventually hits the boundary. Moreprecisely, from a given position, T ( k ) = [ x ( k ) , y ( k )], at step k , the particle will equallylikely jump to one of its neighboring positions at step k + 1, that is, T ( k + 1) = [ x ( k + 1) , y ( k + 1)]is randomly selected from the set { [ x ( k ) + h, y ( k )] , [ x ( k ) − h, y ( k )] , [ x ( k ) , y ( k ) + h ] , [ x ( k ) , y ( k ) − h ] } . The journey ends when a boundary point, T ( m ) = [ x ( m ) , y ( m )], is hit by “particle” at(random) step m . Deﬁning the random variable Z ( T ) = u ( x ( m ) , y ( m )), it can be shownthat the expected value of Z ( T ), for T starting at T (1) = [ x (1) , y (1)], equals u ( x (1) , y (1)),the solution to the Dirichlet problem at [ x (1) , y (1)].The above algorithm is only a particular case of more general Monte Carlo algorithmsfor solving linear systems. For details see Demidovich and Maron (1976), Hammersleyand Handscomb (1964), Halton (1970) and Ripley (1987). Hence, these Monte Carloalgorithms allow us to obtain the solution of many continuous problems in terms of anexpected (average) value of a discrete stochastic ﬂow of particles. More precisely, eﬃcientMonte Carlo algorithms are available for solving linear systems, and many of the mathe-matical models in Physics, or science in general, are (or can be approximated by) linearequations. Consequently, one should not be surprised to ﬁnd physical models interpreta-tions in terms of particle ﬂows.In 1827, Robert Brown observed the movement of plant spores (pollen) immersed inwater. He noted that the spores were in perpetual movement, following an erratic orchaotic path. Since the motion persisted over long periods of time on diﬀerent liquidmedia and powder particles of inorganic minerals also exhibited the same motion pattern,he discarded the hypothesis of live or self propelled motion. This “Brownian motion”was the object of several subsequent studies, linking the intensity of the motion to thetemperature of the liquid medium. For further readings, see Brush (1968) and Haw (2002).In 1905 Einstein published a paper in which he explains Brownian motion as a ﬂuctu-ation phenomenon caused by the collision of individual water molecules with the particlein suspension. Using a simpliﬁed argument, we can model the particle’s motion by arandom path in a rectangular grid, like the one used to solve the Dirichlet problem. Inthis model, each step is interpreted as a molecule collision with the particle, causing itto move, equally likely, to the north, south, east or west. The stating the formal math-ematical properties of this stochastic process, known as a random walk, was one of the .9. AVERAGING OVER ALL POSSIBLE WORLDS y = 0, undergoes incrementalunitary steps, that is, y t +1 = y t + x t , and x t = ±

1. The steps are assumed unbiased,and uncorrelated, that is, E ( x t ) = 0 and Cov( x s , x t ) = 0. Also, Var( x t ) = 1. From thelinearity of the expectation operator, we conclude that E ( y t ) = 0. Also E ( y t ) = E (cid:16)(cid:88) tj =1 x j (cid:17) = E (cid:88) tj =1 x j + E (cid:88) j (cid:54) = k x j x k = t + 0 = t , so that at time t , the standard deviation of the particle’s position is (cid:113) E ( y t ) = t H , for H = 12 . From this simple model an important characteristic, expressed as a sharp statisticalhypothesis to be experimentally veriﬁed, can be derived: Brownian motion is a self-similarprocess, with scaling factor, or Hurst exponent, H = 1 /

2. One possible interpretation ofthe last statement is that, in other to make coherent observations of a Brownian motion,if time is rescaled by a factor φ , then space should also be rescaled by a factor φ H . Thegeneralization of this stochastic process for 0 < H <

1, is known as fractional Brownianmotion.The sharp hypothesis H = 1 / CHAPTER 4. METAPHOR AND METAPHYSICS: THE SUBJECTIVE SIDE OF SCIENCE

The Monte Carlo algorithms introduced in the last section are based on the stochasticﬂow of particles. Yet, these particles can be regarded as mere imaginary entities in acomputational procedure. On the other hand, some models based on similar ideas, suchas the kinetic theories of gases, or the random walk model for the Brownian motion, seemto give these particles a higher ontological status. It is thus worthwhile to discuss theepistemological or ontological status of an entity in a computational procedure, like theparticles in the above example.This discussion is not as trivial, innocent and harmless, at it may seem at ﬁrst sight.In 1632 Galileo Galilei published in Florence his Dialogue Concerning the Two MainWorld Systems. At that time it was necessary to have a license to publish a book, the imprimatur . Galileo had obtained the imprimatur from the ecclesiastical authorities twoyears earlier, under the explicit condition that some of the theses presented in the book,dangerously close to the heliocentric heretical ideas of Nicolas Copernicus, should bepresented as a “hypothetical model” or as a “calculation expedient” as opposed to the“truthful” or “factual” description of “reality”.Galileo not only failed to fulﬁll the imposed condition, but also ridiculed the oﬃcialdoctrine. He presented his theories in a dialogue form. In these dialogues, Simplicio,the character defending the orthodox geocentric ideas of Aristotle and Ptolemy, was con-stantly mocked by his opponent, Salviati, a zealot of the views of Galileo. In 1633 Galileowas prosecuted by the Roman Inquisition, under the accusation of making heretical state-ments, as quoted from Santillana (1955, p.306-310): “The proposition that the Sun is the center of the world and does not movefrom its place is absurd and false philosophically and formally heretical, becauseit is expressly contrary to Holy Scripture. The proposition that the Earth isnot the center of the world and immovable but that it moves, and also witha diurnal motion, is equally absurd and false philosophically and theologicallyconsidered at least erroneous in faith.”

In the Italian renaissance, one of the most open and enlighten societies of its time, butstill within a pre-modern era, where subsystems were only incipient and not clearly diﬀer-entiated, the consequences of mixing scientiﬁc and religious arguments could be daring.Galileo even uses some arguments that resemble the concept of systemic diﬀerentiation,for example: “Therefore, it would perhaps be wise and useful advice not to add withoutnecessity to the articles pertaining to salvation and to the deﬁnition of faith,against the ﬁrmness of which there is no danger that any valid and eﬀectivedoctrine could ever emerge. If this is so, it would really cause confusion to .10. HYPOTHETICAL VERSUS FACTUAL MODELS add them upon request from persons about whom not only do we not knowwhether they speak with heavenly inspiration, but we clearly see they are deﬁ-cient in the intelligence necessary ﬁrst to understand and then to criticize thedemonstrations by which the most acute sciences proceed in conﬁrming similarconclusions.”

Finocchiaro (1991, p.97).The paragraph above is from a letter of 1615 from Galileo to Her Serene HighnessGrand Duchess Cristina but, as usual, Galileo’s rhetoric is anything but serene. In 1633Galileo is sentenced to prison for an indeﬁnite term. After he abjures his allegedly hereticalstatements, the sentence is commuted to house-arrest at his villa. Legend has it that, afterhis formal abjuration, Galileo muttered the now celebrated phrase,

Eppur si mouve , “But indeed it (the earth) moves (around the sun)”.Around 1610 Galileo built a telescope (an invention coming from Netherland) thathe used for astronomical observations. Among his ﬁndings were four satellites to planetJupiter, namely, Io, Europa, Ganymedes and Callisto. He also observed phases (suchas the lunar phases) exhibited by planet Venus. Both facts are either compatible or ex-plained by the Copernican heliocentric theory, but problematic or incompatible with theorthodox Ptolemaic geocentric theory. During his trial, Galileo tried to use these observa-tions to corroborate his theories, but the judges would not, literally, even ‘look’ at them.The church’s chief astronomer, Christopher Clavius, refused to look through Galileo’stelescope, stating that there was no point in ‘seeing’ some objects through an instrumentthat had been made just in order to ‘create’ them. Nevertheless, only a few years afterthe trial, the same Clavius was building ﬁne telescopes, used to make new astronomicalobservations. He took care, of course, not to upset his boss with “theologically incorrect”explanations for what he was observing.From the late 19th century to 1905 the world witnessed yet another trial, perhapsnot so famous, but even more dramatic. Namely, that of the atomistic ideas of LudwigBoltzmann. For a excellent biography of Boltzmann, intertwined (as it ought to be)with the history of his scientiﬁc ideas, see Cercignani (1998). The ﬁnal verdict on thiscontroversy was given by Albert Einstein in his annus mirabilis paper about BrownianMotion, together with the subsequent experimental work of Jean Perrin. For details seeEinstein (1956) and Perrin (1950). A simpliﬁed version of these models was presentedin the previous section, including a “testable” sharp statistical hypothesis, H = 1 /

2, toempirically check the theory. As quoted in Brush (1968), in his Autobiographical Notes,Einstein states that: “The agreement of these considerations with experience together with Planck’sdetermination of the true molecular size from the law of radiation (for hightemperatures) convinced the skeptics, who were quite numerous at that time CHAPTER 4. METAPHOR AND METAPHYSICS: THE SUBJECTIVE SIDE OF SCIENCE (Ostwald, Mach) of the reality of atoms. The antipathy of these scholars to-wards atomic theory can indubitably be traced back to their positivistic philo-sophical attitude. This is an interesting example of the fact that even scholarsof audacious spirit and ﬁne instinct can be obscured in the interpretation offacts by philosophical prejudices. The prejudice - which has by no means diedout in the meantime - consists in the faith that facts themselves can and shouldyield scientiﬁc knowledge without free conceptual construction.Such misconception is possible only because one does not easily becomeaware of the free choice of such concepts, which through veriﬁcation and longusage, appear to be immediately connected with the empirical material”

Let us follow Perrin’s perception of the “empirical connection” between the conceptsused in the molecular theory, which contrasted to that of the rival energetic theory, duringthe ﬁrst decade of the 20th century. In 1903 Perrin was already an advocate of themolecular hypothesis, as can be seen in Perrin (1903). According to Brush (1968, p.30-31), Perrin refused the positivist demand for using only directly observable entities. Perrinreferred to an analogous situation in biology where, “the germ theory of disease might have been developed and successfullytested before the invention of the microscope; the microbes would have beenhypothetical entities, yet, as we know now, they could eventually be observed.”

But only three years latter, was Perrin (1906) conﬁdent enough to reverse the at-tack, accusing the energetic view rivaling the atomic theory, of having “degenerated intoa pseudo-religious cult”. It was the energetic theory, claimed Perrin, that was makinguse of non-observable entities! To begin with, Classical thermodynamics had a diﬀerentialformulation, with the functions describing the evolution of a system assumed to be contin-uous and diﬀerentiable (notice the similarity between the argument of Perrin and that ofSchlick, presented in section 8). Perrin based his argument of the contemporary evolutionof mathematical analysis when, until late in the 20th century, continuous functions werenaturally assumed to be diﬀerentiable. Nevertheless, the development of mathematicalanalysis, on the turn to the 20th century, proved this to be a rather naive assumption.Referring to this background material, Perrin argues: “But they still thought the only interesting functions were the ones that canbe diﬀerentiated. Now, however, an important school, developing with rigorthe notion of continuity, has created a new mathematics, within which theold theory of functions is only the study (profound, to be sure) of a group ofsingular cases. It is curves with derivatives that are now the exception; or,if one prefers the geometrical language, curves with no tangents at any pointbecome the rule, while familiar regular curves become some kind of curiosities,doubtless interesting, but still very special.” .11. MAGIC, MIRACLES AND FINAL REMARKS “In view of the ocular conﬁrmation of the picture which the kinetic theoryprovides us of the world of molecules, one must admit that this theory beginsto lose its hypothetical character.”

In several incidents analyzed in the last sections, one can repeatedly ﬁnd the occurrence oftheoretical “phase transitions” in the history of science. In these transitions, we observe adominant and strongly supported theory being challenged by an alternative point of view.In a ﬁrst moment, the cheerleaders of the dominant group come up with a variety of “dis-qualifying arguments”, to show why the underdog theory, plagued by phony concepts andfaulty constructions, should not even be considered as a serious contestant. In an secondmoment, the alternative theory is kept alive by a small minority, that is able to foster itsprogress. In a third and ﬁnal moment, the alternative theory becomes, quite abruptly, thedominant view, and many wonder how is it that the old, now abandoned theory, couldever had so much support. This process is captured in the following quotation, from thepreface to the ﬁrst edition of Schopenhauer (1818): “To truth only a brief celebration of victory is allowed between the two longperiods during which it is condemned as paradoxical, or disparaged as trivial.”

Perhaps this is the basis for the gloomier statement found in Planck (1950, p.33-34): “A new scientiﬁc truth does not triumph by convincing its opponents andby making them see the light, but rather because its opponents eventually die,and a new generation grows up that is familiar with it.”

As for the abruptness of the transition between the two phases, representing the twotheoretical paradigms, this is a phenomenon that has been extensively studied, fromsociological, systemic and historical perspectives, by Thomas Kuhn (1996, 1977). Seealso Hoyningen-Huene (1993) and Lakatos (1978a,b). For similar ideas presented withinan approach closer to the orthodox Bayesian theory, see Zupan (1991).We ﬁnish this section with a quick and simple alternative explanation, possibly just asa hint, that I believe can shed some light on the nature of this phenomenon. Elucidationsof this kind were used many times by von Foerster (2003,b,e) who was, among many otherthings, a skilful magician and illusionist.18

CHAPTER 4. METAPHOR AND METAPHYSICS: THE SUBJECTIVE SIDE OF SCIENCE

An Ambigram, or ambiguous picture, is a picture that can be looked at in two (ormore) diﬀerent ways. Looking at an ambigram, the observer’s interpretation or re-solutionof the image can be attracted to one of two or more distinct eigen-solutions. A memorableinstance of an ambigram is the Duck-Rabbit, born in 1892, in the humble pages of theGerman tabloid Fliegende Bl¨atter. It was studied in 1899 by the psychologist JosephJastrow in an article antecipating several aspects of cognitive constructivism, and ﬁnallymade famous by the philosopher Ludwig Wittgenstein in 1953. For a historical accountof this ambigram, see Kihlstrom (2006), as well as several nice ﬁgures. In case anyonewonders, Jastrow was Peirce’s Ph.D. student and coauthor of the 1885 paper introducingrandomization, and Wittgenstein is no other than von Foster’s uncle Ludwig.According to Jastrow (1899), an ambigram demonstrates how “True seeing, observing, is a double process, partly objective or outward -the thing seen and the retina - and partly subjective or inward - the picturemysteriously transferred to the mind’s representative, the brain, and there re-ceived and aﬃliated with other images.”

Still according to Jastrow, in an ambigram, “...a single outward impression changes its character according as it isviewed as representing one thing or another. In general we see the same thingall the time, and the image on the retina does not change. But as we shiftthe attention from one portion of the view to another, or as we view it with adiﬀerent mental conception of what the ﬁgure represents, it assumes a diﬀerentaspect, and to our mental eye becomes becomes quite a diﬀerent thing.”

Jastrow also describes some characteristics of the mental process of shifting betweenthe eigen-solutions of an ambigram, that is, how in “The Mind’s Eye” one changes fromone interpretation to the other. Two of these characteristics are specially interesting inour context:First, in the beginning, “It may require a little eﬀort to bring about this change, butit is very marked when once realized.”

Second, after both interpretations are known, “Most observers ﬁnd it diﬃcult to holdeither interpretation steadily, the ﬂuctuation being frequent, and coming as a surprise.”

The ﬁrst characteristic can help us understand either Nernst’s “ocular readiness” or,in contrast, Clavius’ “ocular blindness”. After all, the satellites of Jupiter were quitetangible objects, ready to be watched through Galileo’s telescope, whereas the grains ofcolloidal suspension that could be observed with the lunette of Perrin’s apparatus provideda much more indirect evidence for the existence of molecules. Or maybe not, after all, itall depends on what one is capable, ready, or willing to see... .11. MAGIC, MIRACLES AND FINAL REMARKS gives an encyclopaedic view of these constants and their inter-relations. Planck (1950,Ch.6) comments on their epistemological signiﬁcance.But far beyond their practical utility or even their scientiﬁc interest, the existenceof these eigen-solutions are not magical illusions, but true miracles. Why “true” mira-cles? Because the more they are explained and the better they are understood, the morewonderful they become!20

CHAPTER 4. METAPHOR AND METAPHYSICS: THE SUBJECTIVE SIDE OF SCIENCE hapter 5Complex Structures, Modularity,and Stochastic Evolution “Hierarchy, I shall argue, is one of the central struc-tural schemes that the architect of complexity uses.”“The time required for the evolution of a complex formfrom simple elements depends critically on the number anddistribution of potential intermediate stable subassemblies.”

Herbert Simon (1916-2001),The Sciences of the Artiﬁcial. “In order to make some sense here, we must keep anopen mind about the possibility that for suﬃcientlycomplex systems, amplitudes become probabilities.

Richard Feynman (1918-1988),Lecture notes on Gravitation.

The expression stochastic evolution may seem an oxymoron. After all, evolution indicatesprogress towards complexity and order, while a stochastic (probabilistic, random) processseems to be only capable of generating confusion or disorder. The etymology of the wordstochastic, from στ oχoς , meaning aim , goal or target , and its current use, meaning chancy or noisy , seems to incorporate this apparent contradiction. An alternative use of the same12122 CHAPTER 5: MODULARITY AND STOCHASTIC EVOLUTION root, στ oχαστ ικoς meaning skillful at guessing, conjecturing, or divining the truth , mayoﬀer a bridge between the two meanings.The main goal of this chapter is to study how the concepts of stochastic process andevolution of complex systems can be reconciled. Sections 2 and 3 examine two prototypicalalgorithms: Simulated Annealing and Genetic Programming. The ideas behind these twoalgorithms will be used as a basis for most of the arguments used in this chapter. Themathematical details of some of these algorithms are presented in appendix H. Section4 presents the concept of modularity, and explains its importance in the evolution ofcomplex systems.While sections 2, 3 and 4 are devoted to the study of general systems, including appli-cations to biological organisms and technological devices, section 5 pays closer attentionto the evolution of complex hypotheses and scientiﬁc theories. Section 5 also examinesthe idea of complementarity, developed by the physicist and philosopher Niels Bohr asa general framework for the reconciliation of two concepts that appear to be incompat-ible but are, at the same time, indispensable to the understanding of a given system.Section 6 explores the connection between complementarity and probability, presentingHeisenberg’s uncertainty principle. Section 7 extends the discussion to general theories ofevolution and returns to the pervasive theme of probabilistic causation. Section 8 presentsour ﬁnal remarks.

Most human societies are organized as hierarchical structures. Universities are organizedin research groups, departments, institutes and schools; Armies in platoons, battalions,regiments and brigades; and so on. This has been the way of doing business as describedin the earliest historical records. Deuteronomy (1:15) describes the ancient hierarchicalstructure of Israel:“So I took the heads (ROSh) of your tribes, men wise and known, andmade them heads over you, leaders (ShR) of thousands , hundreds, ﬁfties andtens, and oﬃcers (ShTR) for your tribes.”This verse gives us some idea of the criteria used to appoint leaders (knowledge andwisdom), but give us no hint on the criteria and methods used to form the groups (of10, 50, 100 and 1000). Perhaps that was obvious from the family and tribal structurealready in place. There are many situations, however, where organizing groups to obtainan optimal structure is far from trivial. In this section we study such a case: the blockpartition problem. .2 THE ERGODIC PATH

The matrix block partition problem arises in many practical situations in engineeringdesign, operations research and management science. In some applications, the elementsof a rectangular matrix, A , may represent the interaction between people, correspondingto columns, and activities, corresponding to rows, that is, A ji , the element in row i andcolumn j , represents the intensity of the interaction between person j and activity i . Theblock partition problem asks for an optimal ordering or permutation of rows and columnstaking the permuted matrix to Block Angular Form (BAF), so that each one of b diagonalblocks bundles a group of strongly coupled people and activities. Only a small numberof activities are leaft outside the diagonal blocks, in a special ( b + 1)-th block of residualrows. Also, only a small number of people interact with more than one of the b diagonalactivities, these corespond to residual columns, see Figure 1.   ,   Figure 1a,b: Two Matrices in Block Angular Form.A matrix in BAF is in Row Block Angular Form (RBAF) if it has only residual rows,and is in Column Block Angular Form (CBAF) if it has only residual columns. Eachangular block can, in turn, exhibit again a BAF, thus creating a recursive or NestedBlock Angular Form (NBAF). Figure 1a exhibits a matrix in NBAF. In this ﬁgure, zeroelements of the matrix are represented by blanck spaces. The number at the position ofa non-zero element (NZE) is not the corresponding matrix element’s value, but rather aclass tag or “color” indicating the block to which the row belongs. Residual rows receivethe special color b + 1. The ﬁrst block has a nested CBAF structure, shown in Figure 1b.For the sake of simplicity, this chapter will focus on the BAF partition problem, althoughall our conclusions can be generalized to the NBAF case.We motivate the block partition problem further with an application related to numer-ical linear algebra. Gaussian elimination is the name of a simple method for solving linearsystems of order n , by reducing the matrix of the original system to (upper) triangularform. This is accomplished by successively subtracting multiples of the row 1 through n from the rows bellow them, so as to eliminate (zero) the elements below each diagonalelement (or pivot element). The example in Figure 2 illustrates the Gaussian elimination24 CHAPTER 5: MODULARITY AND STOCHASTIC EVOLUTION algorithm, where the original system, Ax = b , is transformed into an upper triangularsystem, U x = c . The matrix L stores the multipliers used in the process. Each multiplieris stored at the position of the element it was used to eliminate, that is, at the positionof the zero it was used to create. It is easy to check that A = LU , hence the alternativename of the algorithm: LU Factorization.The example in Figure 2 also displays some structural peculiarities. Matrix A is inBAF, with two diagonal blocks, one residual row (at the bottom or south side of thematrix) and one residual column (at the right or east side of the matrix). This structureis preserved in the L and U factors. This structure and its preservation is of paramountimportance in the design of eﬃcient factorization algorithms. Notice that the eliminationprocess in Figure 2 can be done in parallel. That is, the factorization of each diagonalblock can be done independently of and simultaneously with the factorization of the otherblocks, for more details see Stern and Vavasis (1994).   =     Figure 2: A=LU Factorization of CBAF MatrixA classic combinatorial formulation for the CBAF partition problem, for a rectangularmatrix A , m by n , is the Hypergraph Partition Problem (HPP). In the HPP formulation,we paint all nonzero elements (NZE’s) in a vertex i ∈ { , . . . , m } , (corresponding to row A i ) with a color x i ∈ { , . . . , b } . The color q j ( x ) of an edge j ∈ { , . . . , n } , (correspondingto column A j ) is then the set of all its NZE’s colors. Multicolored edges of the hyper-graph (corresponding to columns of the matrix containing NZE’s of several colors) are theresidual columns in the CBAF. The formulation for the general BAF problem also allowssome residual rows to receive the special color b + 1.The BAF applications typically require:1. Roughly the same number of rows in each block.2. Only a few residual rows or columns.From 1 and 2 it is natural to consider the minimization of the objective or cost function f ( x ) = α (cid:88) bk =1 h k ( x ) + βc ( x ) + γr ( x ) , h k ( x ) = s k ( x ) − m/b ,q j ( x ) = (cid:8) k ∈ { , . . . , b } : ∃ i, A ji (cid:54) = 0 ∧ x i = k (cid:9) , s k ( x ) = |{ i ∈ { , . . . , m } : x i = k }| ,c ( x ) = (cid:12)(cid:12)(cid:8) j ∈ { , . . . , n } : | q j ( x ) | ≥ (cid:9)(cid:12)(cid:12) , r ( x ) = |{ i ∈ { , . . . , m } : x i = b + 1 }| . .2 THE ERGODIC PATH c ( x ) is the number of residual columns, and the term r ( x ) is the number ofresidual rows. The constraint functions h k ( x ) measure the deviation of each block fromthe ideal size m/b . Since we want to enforce these constraints only approximately, we usequadratic penalty functions, h k ( x ) , that (only) penalize large deviations. If we wantedto enforce the constraints more strictly, we could use exact penalty functions, like | h k ( x ) | ,that penalize even small deviations, see Bertzekas and Tsitsiklis (1989) and Luenberger(1984). The HPP stated in the last section is very diﬃcult to solve exactly. Technically it is anNP-hard problem, see Cook (1997). Consequently, we try to develop heuristic proceduresto ﬁnd approximate or almost optimal solutions. Simulated Annealing (SA) is a powerfulmeta-heuristic, well suited to solve many combinatorial problems. The theory behind SAalso has profound epistemological implications, that we explore latter on in this chapter.The ﬁrst step to deﬁne an SA procedure is to deﬁne a neighborhood structure in theproblem’s state or conﬁguration space. The neighborhood, N ( x ), of a given initial state, x , is the set of states, y , that can be reached from x , by a single move. In the HPP, asingle move is deﬁned as changing the color of a single row, x i (cid:55)−→ y i .In this problem, the neighborhood size is therefore the same, for any state x , namely,the product of the number of rows and colors, that is, | N ( x ) | = mb for CBAF, and | N ( x ) | = m ( b + 1) for BAF. This neighborhood structure provides good mobility in thestate space, in the sense that it is easy to ﬁnd a path (made by a succession of singlemoves) from any chosen initial state, x , to any other ﬁnal state, y . This property is calledirreducibility or strong connectivity. There is also a second technical requirements forgood mobility, namely, this set of paths should be aperiodic. If the length (the numberof single moves) of any path from x to y is a multiple of an integer k > k is called theperiod of this set. Further details are given in appendix H.1.In an SA, it is convenient to have an easy way to update the cost function, computedat a given state, x , to the cost of a neighboring state, y . The column color weight matrix, W , is deﬁned so that the element W jk counts the number of NZE’s in column j (in rows)of color k , that is, W jk ≡ (cid:12)(cid:12)(cid:8) A ji | A ji (cid:54) = 0 ∧ x i = k (cid:9)(cid:12)(cid:12) . The weight matrix can be easily updated at any single move and, from W , it is easy tocompute the cost function or a cost diﬀerential, δ ≡ f ( y ) − f ( x ) . The internal loop of the SA is a Metropolis sampler, where single moves are chosenat random (uniformly among any possible move) and then accepted with the Metropolis26

CHAPTER 5: MODULARITY AND STOCHASTIC EVOLUTION probability, M ( δ, θ ) ≡ (cid:26) , if δ ≤ − θ δ ) , if δ ≥ . The parameter θ is known as the inverse temperature, which has a natural interpretation instatistical physics, see MacDonald (2006), Nash(1974) and Rosenfeld (2005), for intuitiveintroductions, and Thompson (1972) for a rigorous text.The Gibbs distribution, g ( θ ) x , is the invariant distribution for the Metropolis samplingprocess, given by g ( θ ) x = 1 Z ( θ ) exp( − θf ( x )) with Z ( θ ) = (cid:88) x exp( − θf ( x )) . The symbol g ( θ ) represents a row vector, where the column index, x , spans the possiblestates of the system.Consider a system prepared (shuﬄed) in such a way that the probability of startingthe system in initial state x is g ( θ ) x . If we move the system to a neighboring state, y ,according to the Metropolis sampling procedure, the invariance property of the Gibbsdistribution assures that the probability that the system will land (after the move) in anygiven state, y , is g ( θ ) y , that is, the probability distribution of the ﬁnal (after the move)state remains unchanged.Under appropriate regularity conditions, see appendix H.1, the process is also ergodic.Ergodicity means that even if the system is prepared (shuﬄed) with an arbitrary prob-ability distribution, v (0), for the initial state, for example, the uniform distribution, theprobability distribution, v ( t ), of the ﬁnal system state after t moves chosen according tothe Metropolis sampling procedure will be suﬃciently close to g ( θ ) for suﬃciently large t . In other words, the probability distribution of the ﬁnal system state converges to theprocess’ invariant distribution. Consequently, we can ﬁnd out the process’ invariant dis-tribution by following, for a long time, the trajectory of a single system evolving accordingto to the Metropolis sampling procedure. Hence the expression, The Ergodic Path: Onefor All. From the history of an individual system we can recover important informationabout the whole process guiding its evolution.Let us now study how the Metropolis process can help us ﬁnding the optimal (minimumcost) conﬁguration for such a system. The behavior of the Gibbs distribution, g ( θ ),changes according to the inverse temperature parameter, θ :- In the high temperature extreme, 1 /θ → ∞ , the Gibbs distribution approaches theuniform distribution.- In the low temperature extreme, 1 /θ →

0, the Gibbs distribution is concentrated in thestates with minimum cost only.Correspondingly the Metropolis process behaves as follows:- At the high temperature extreme, the Metropolis process becomes insensitive to the .2 THE ERGODIC PATH −2 −1 0 1 200.20.40.60.811.2

L M Gh H −2 −1 0 1 200.20.40.60.811.2

Figure 3a: L,G- Local and global minimum; M- Maximum;S- Short-cut; h,H- Local and global escape energy.Figure 3b: A diﬃcult problem, with steep cliﬀs and ﬂat plateaus.The secret to play this trick is in the external loop of the SA algorithm, the CoolingSchedule. The cooling schedule initiates the temperature high enough so that most of theproposed moves are accepted, and then slowly cools down the process, until it freezes atan optimum state. The theory of SA is presented in appendix H.1.The most important result concerning the theory of SA, states that, under appropriateregularity conditions, the process converges to the system’s optimal solution as long aswe use the Logarithmic Cooling Schedule. This schedule draws the t -th move accordingto Metropolis process using temperature θ ( t ) = 1 n ∆ ln( t ) , where ∆ is the maximum objective function diﬀerential in a single move and n is theminimum number of steps needed to connect any two states. Hence, the cooling constant , n ∆ can be interpreted as an estimate of how high a mountain we may need to climb inorder to reach the optimal position, see Figure 3a(h).28 CHAPTER 5: MODULARITY AND STOCHASTIC EVOLUTION

Practical implementations of SA usually cool the temperature geometrically, θ ← (1 + (cid:15) ) θ , after each batch of Metropolis sampling. The SA is terminated when it freezes,that is, when the acceptance rate in the Metropolis sampling drops below a pre-establishedthreshold. Further details on such an implementation are given in the next section. The Standard Simulated Annealing (SSA), described in the last section, behaves poorlyin the BAF problem mainly because it is very diﬃcult to sense the proximity of low coststates, see Figure 3b, that is,1. Most of the neighbors of a low cost state, x , may have much higher costs; and2. The problem is highly degenerate in the sense that there are states, x , with a large(sub) neighborhood of equal cost states, S ( x ) = { y ∈ N ( x ) | f ( y ) = f ( x ) } . In thiscase, even rejecting all the proposals that would take us out of S , would still giveus a signiﬁcant acceptance rate.Diﬃculty 2, in particular, implies the failure of the SSA termination criterion: Adegenerate local minimum (or meta-stable minimum) could trap the SSA into forever,sustaining an acceptance rate above the established threshold.The best way we found to overcome these diﬃculties is to use a heuristic temperature-dependent cost function, designed to accelerate the SA convergence to the global optimumand to avoid premature convergence to locally optimal solutions: f ( x, µ ( θ )) ≡ f ( x ) + 1 µ ( θ ) u ( x ) , u ( x ) ≡ (cid:88) j, | q j ( x ) | > | q j ( x ) | . The state dependent factor in the additional term of the cost function, u ( x ), can beinterpreted as an heuristic merit or penalty function that rewards multicolored columnsfor using fewer colors. This penalty function, and some possible variants, have the eﬀectof softening the landscape, eroding sharp edges, such as in Figure 3b, into rounded hillsand valleys, such as in Figure 3a. The actual functional form of this penalty function isinspired by the tally function used in the P LU factorization. The temperature dependent parameter, µ ( θ ), gives the inverseweight of the heuristic penalty function in the cost function f ( x, µ ) .Function f ( x, µ ) also has the following properties: (1) f ( x,

0) = f ( x ); (2) f ( x, µ ) islinear in 1 /µ . Properties 1 and 2 suggest that we can cool the weight 1 /µ as we cool thetemperature, much in the same way we control a parameter of the barrier functions insome constrained optimization algorithms, see McCormick (1983).A possible implementation of this Heuristic Simulated Annealing, HSA, is as follows: .3. THE WAY OF SEX: ALL FOR ONE • Initialize parameters µ and θ , set a random partition, x , and initialize the auxiliaryvariables W , q , c , r , s , and the cost and penalty functions, f and h ; • For each proposed move, x → y , compute the cost diﬀerentials δ = f ( y ) − f ( x ) and δ µ = f ( y, µ ) − f ( x, µ ) . • Accept the move with the Metropolis probability, M ( δ µ , θ ). If the move is accepted,update x , W , q , c , r , s , f and h ; • After each batch of Metropolis sampling steps, perform a cooling step update θ ← (1 + (cid:15) ) θ , µ ← (1 + (cid:15) ) µ , < (cid:15) < (cid:15) << . Computational experiments show that the HSA successfully overcomes the diﬃcultiesundergone by the SSA, as shown in Stern (1991). As far as we know, this was the ﬁrsttime this kind of perturbative heuristic has been considered for SA. Pﬂug (1996) givesa detailed analysis for the convergence of such perturbed processes. These results areshortly reviewed is section H.1.In the next section we are going to extend the idea of stochastic optimization tothat of evolution of populations, following insights from biology. In zoology, there aremany examples of heuristic merit or penalty functions, often called ﬁtness or viabilityindicators, that are used as auxiliary objective functions in mate selection, see Miller(2000, 2001) and Zahavi (1975). The most famous example of such an indicator, thepeacock’s tail, was given by Charles Darwin himself, who stated: “The sight of a featherin a peacock’s tail, whenever I gaze at it, makes me feel sick!”

For Darwin, this case wasan apparent counterexample to natural selection, since the large and beautiful feathershave no adaptive value for survival but are, quite on the contrary, a handicap to thepeacock’s camouﬂage and ﬂying abilities. However, the theory presented in this sectiongive us a key to unlock this mystery and understand the tale of the peacock’s tail.

From the interpretation of the cooling constant given in the last section, it is clear thatwe would have a lower constant, resulting in a faster cooling schedule, if we used a richerset of single moves. Specially, if the additional moves could provide short-cuts in theconﬁguration space, as the moves indicated by the dashed line in Figure 3a. This is one ofthe arguments that can be used to motivate another important class of stochastic evolutionalgorithms. Namely, Genetic Programming, the subject of the following sections. We willfocus on a special class of problems known as functional trees. The general conclusions,however, remain valid in many other applications.30

CHAPTER 5: MODULARITY AND STOCHASTIC EVOLUTION

In this section, we deal with methods of ﬁnding the correct speciﬁcation of a complexfunction. This complex function must be composed recursively from a ﬁnite set, OP = { op , op , . . . op p } , of primitive functions or operators, and from a set, A = { a , a , . . . } , ofatoms. The k -th operator, op k , takes a speciﬁc number, r ( k ), of arguments, also knownas the arity of op k . We use three representations for (the value returned by) the operator op k computed on the arguments x , x , . . . x r ( k ) : op k ( x , . . . x r ( k ) ) , op k / \ x . . . x r ( k ) , (cid:0) op k x . . . x r ( k ) (cid:1) . The ﬁrst is the usual form of representing a function in mathematics; the second is thetree representation, which displays the operator and their arguments as a tree; and thethird is the preﬁx, preorder or LISP style representation, which is a compact form of thetree representation.As a ﬁrst problem, let us consider the speciﬁcation of a Boolean function of q variables, f ( x , . . . x q ), to mach a target table, g ( x , . . . x q ), see Angeline (1996) and Banzhaf el al.(1998). The primitive set of operators and atoms for this problem are: OP = {∼ , ∧ , ∨ , → , (cid:12) , ⊗} and A = { x , . . . x q , , } . Notice that while the ﬁrst operator (not) is unary, the last ﬁve (and, or, imply, nand, xor)are binary. x y ∼ x x ∧ y x ∨ y x → y x (cid:12) y x ⊗ y OP , of Boolean operators deﬁned above is clearly redundant. Notice, forexample, that x → x = ∼ ( x ∧ ∼ x ) , ∼ x = x (cid:12) x and x ∧ x = ∼ ( x (cid:12) x ) . This redundancy may, nevertheless, facilitate the search for the best conﬁguration in theproblem’s functional space.Example 1a shows a target table, g ( a, b, c ). As it is usual when the target functionis an experimentally observed variable, the target function is not completely speciﬁed.Unspeciﬁed values in the target table are indicated by the don’t-care symbol ∗ . The two .3 THE WAY OF SEX f and f , match the table in all speciﬁed cases. Solution f , however, is simplerand for that may be preferred, see section 4 for further comments. a b c g f f ∗ ∗ f |∨ / \∼ || | a c f |∨ / \∧ ∧ / \ / \∼ ∼ | || | | | a b a cf = ( ∼ A ) ∨ C , f = ( ∼ A ∧ ∼ B ) ∨ ( A ∧ C ) .f = ( ∨ ( ∼ A ) C ) , f = ( ∨ ( ∧ ( ∼ A ) ( ∼ B )) ( A ∧ C )) . Example 1a: Two Boolean functional trees for the target g ( a, b, c ).As a second problem, let us consider the speciﬁcation of a function for an integernumerical sequence, such as the Fibonacci sequence, presented in Koza (1983). g ( j ) ≡ (cid:26) j , if j = 0 ∨ j = 1 ; g ( j −

1) + g ( j − , if j ≥ . The following array, g j , 0 ≤ j ≤

20, lists the ﬁrst 21 elements of the Fibonacci sequence. g = [0 , , , , , , , , , , , , , , , , , , , , . In this problem, the primitive set of operators and atoms are: OP = { + , − , × , σ } , A = { j, , } , where j in an integer number, and the ﬁrst three operators are the usual arithmeticoperators. The speciﬁed function is used to compute the ﬁrst n + 1 elements of the array f j , seeking to mach the target array g j , 0 ≤ j ≤ n . The last primitive function is therecursive operator, σ ( i, d ), that behaves as follows: When computing the j -th element, f ( j ), σ ( i, d ) returns the already computed element f i , if i is in the range, 0 ≤ i < j , or adefault value, d , if i is out of the range.In the functional space of this problem, possible speciﬁcations for the Fibonacci func-tion in preﬁx representation, are(+ ( σ ( − j

1) 1) ( σ ( − j (+ 1 1) 0))) , (+ ( σ ( − j

1) 1) (+ 0 ( σ ( − j (+ 1 1) 0)))) . Example 2a: Two functional trees for the Fibonacci sequence.32

CHAPTER 5: MODULARITY AND STOCHASTIC EVOLUTION

Since the two expressions above are functionally equivalent, the ﬁrst one may bepreferable for being simpler, see section 4 for further comments.As a third problem, we mention Polynomial Network models. These functional treesuse as primitive operators linear, quadratic or cubic polynomials in one, two or threevariables. For several examples and algorithmic details, see Farlow (1984), Madala andIvakhnenko (1994) and Nikolaev and Iba (2006). Figure 4 shows a simple network usedfor sales forcast, a detailed report is given in Lauretto et al. (1995). Variable x is amagazine’s sales forecast obtained by a VARMA time series model using historic sales,econometric and calendric data. Variables x to x are qualitative variables (in the scale:Bad, Weak, Average, God, Excellent) to assess the appeal or attractiveness of an individ-ual issues of the magazine, namely: (1) cover impact; (2) editorial content; (3) promotionalitems; and (4) point of sale marketing. (cid:11)(cid:10) (cid:8)(cid:9)(cid:7)(cid:6) (cid:4)(cid:5)(cid:3)(cid:2) (cid:0)(cid:1) (cid:3)(cid:2) (cid:0)(cid:1) (cid:3)(cid:2) (cid:0)(cid:1) x x x x x x uAAAAAAAAC (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:16)(cid:26)(cid:26)(cid:26)(cid:26)(cid:28) u [[[[^ u (cid:26)(cid:26)(cid:26)(cid:26)(cid:28) [[[[^ Figure 4: Polynomial Network.Rings on a node: 1- Linear; 2- (incomplete) Quadratic; 3- (incomplete) Cubic.Of course, the optimization of a Polynomial Network is far more complex than theoptimization of Boolean or algebraic etworks, since not only topology has to be optimized(identiﬁcation problem), but also, given a topology, the parameters of the polynomialfunction have to be optimaized (estimation problem). Parameter optimization of sub-trees can be based on Tikhonov regularization, ridge regression, steepest descent or Partangradient rules. For several examples and algorithmic details, see Farlow (1984), Madalaand Ivakhnenko (1994), Nikolaev and Iba (2001, 2003, 2006), and Stern (2008). .3 THE WAY OF SEX

Starting from a given random tree, one can start an SA type search in the problem’s(topological) space. In GP terminology, the individual’s functional speciﬁcation is calledits genotype . the individual’s expressed behavior, or computed solutions, is called its phenotype . Changing a genotype to a neighboring one is called a mutation . The qualityof a phenotype, its performance, merit or adaptation, is measured by a ﬁtness function.While SA looks at the evolution of a single individual, GP looks at the evolution ofa population. A time parameter, t , indexes the successive generations of the evolvingpopulation. In GP, individuals typically have short lives, surviving only a few generationsbefore dying. Meanwhile, populations may evolve for a very long time.In GP an individual may, during its ephemeral life, share information, that is, swap(copies) of its (partial) genome, with other individuals. This genomic sharing process iscalled sex . In GP an individual, called a parent , may also participate in the creation ofa new individual, called its child , in a process called reproduction . In the reproductionprocess, an individual gives (partial) copies of its genotype to its oﬀspring. Reproductioninvolving only one parent is called asexual, otherwise it is called a sexual reproduction.In the following list, a set of possible mutation and sex operators are given:1- Point leaf mutation: Replace a leaf atom by an other atom.2- Point operator mutation: Replace a node operator by a compatible operator.3- Shrink mutation: Replace a sub-tree by a leaf with a single atom.4- Grow mutation: Replace the atom at a leaf by a random tree.5- Permutation: Change the order of the children of a given node.6- Gene duplication: Replace a leaf by a copy of a sub-tree.7- Gene inversion: Switch two sub-trees.8- Crossover: Share or exchange sub-trees between individuals.The ﬁrst ﬁve operators, involving only one sub-tree, are sometimes called (proper)mutations, while the last three operators, involving two or more separate sub-trees, arecalled recombinations. Also notice that the ﬁrst seven operators involve only one indi-vidual, while crossover involves two or more. This list of mutation and recombinationoperators is redundant but, again, this redundancy may also facilitate the search for thebest conﬁguration in the problem’s functional space.We should mention that the terms used to name these operators are not standard inthe ﬁeld of GP, and even less so in biology, genetics, zoology and botany. We should alsomention that the forms of GP presented in this section, do not explore the possibilityof allowing individuals to carry a (redundant) set of two or more homologous (similar34 CHAPTER 5: MODULARITY AND STOCHASTIC EVOLUTION but not identical) speciﬁcations (genes), a phenomenon known as diploidy or multiploidy.Diploidy is common in eukaryotic (biological) life, and can provide a much richer structureand better performance to GP.Sexual reproduction can be performed by crossover, with parents giving (partial) copiesof their genome to the children. The following examples show a pair of parents and childrengenerated by a single crossover, for some of the problems considered in the last section.A square parenthesis in the preﬁx representation indicates a crossover point. The treerepresentation would indicate the same crossover points by broken edges (=). Notice thatin these examples the is a child corresponding to a solution presented in the last section. f |∨ / \∧ = / \ |∼ ∼ a | | a b f |∨ / = | ∧ b / \| || | a c ⇒ f |∨ / \∧ ∧ / \ / \∼ ∼ | || | | | a b a c Example 1b: Crossover between Boolean functional trees.Parents: ( ∗ [ σ ( − j

1) 1] ( ∗ j j )) , (+ ( σ ( − j (+ 1 1) 0)) [ − j

1] ) ;Children: ( ∗ [ − j

1] ( ∗ j j )) , (+ ( σ ( − j (+ 1 1) 0)) [ σ ( − j

1) 1] ) . Example 2b: Crossover between arithmetic functional trees.Finally, the reproduction and survival selection processes in GP assume that individu-als are chosen from the general population according to sampling probabilities called the mating (or representation) distribution and the survival distribution, respectively. Somegeneral policies used to specify these probability distributions, based on the individual’sﬁtness, are given below:1- Top Rank Selection: The highest ranking (best ﬁt) individual is selected.2- High Pressure Selection: An individual is selected from the population with aprobability that increases sharply (super-linearly) with its ﬁtness or ﬁtness’ rank.3- Fitness Proportional Selection: An individual is selected from the population witha probability that is proportional to its ﬁtness.4- Rank Proportional Selection: An individual is selected from the population with aprobability that is proportional to its ﬁtness’ rank.5- Low Pressure Selection: An individual is selected from the population with a prob-ability that increases modestly (sub-linearly) with its ﬁtness or ﬁtness’ rank. .3 THE WAY OF SEX

A possible motivation for developing populational evolutionary algorithms like GP, insteadof single individual evolutionary algorithms, like straight SA, is to consider a richer andbetter neighborhood structure. The additional moves made available should provide short-cuts in the problem’s conﬁguration space, lowering the cooling constant and allowing afaster convergence of the algorithm.The intrinsic parallelism argument, ﬁrst presented in Holland (1975), proves that, un-der appropriate conditions, GP is likely to succeed in providing such a rich neighborhoodstructure. The mathematical analysis of this argument is presented in section H.2, seealso Reeves (1993, Ch.4 Genetic Algorithms). According to Reeves, “The underlying concept Holland used to develop a theoretical analysis ofhis GA [GP] was that of schema . The word comes from the past tense of theGreek verb (cid:15)χω , echo , to have, whence it came to mean shape or form; itsplural is schemata .” (p.154)Schemata are partially speciﬁed patterns in a program, like partially speciﬁed segmentsof preﬁx expressions, or partial code for functional sub-trees. The length and order of aschema are the distance between the ﬁrst and last deﬁned position on the schema, andthe number of deﬁned positions, respectively, see section H.2. The Intrinsic Parallelismtheorem states that the number of schemata (of order l and length 2 l , in binary codedprograms, in individuals of size n ) present in a population of size m , is proportional m .The crossover operator enriches the neighborhood of an individual with the schematapresent in other individuals of the population. If, as suggested by the implicit parallelismtheorem, the number of such schemata is large, GP is likely to be an eﬀective strategy.Schaﬀer (1987, p.89), celebrates this theorem stating that: “this [intrinsic parallelism] constitutes the only known example of combi-natorial explosion working to advantage instead of disadvantage.” CHAPTER 5: MODULARITY AND STOCHASTIC EVOLUTION

Indeed, Schaﬀer has ample reason to praise Holland’s result. Nevertheless, we mustanalyze this important theorem carefully, in order to understand its consequences cor-rectly. In particular, we should pay close attention to the unit, u , used to measure thepopulation size, m . As shown in detail in section H.2, this unit, u = 2 l , is itself exponen-tial in the schemata order. Therefore, the combinatorial explosion works to our advantageas long as we use short schemata, relative to the log-size of the population. This situationis described by Reeves as: “Thus the ideal situation for a GA [GP] are those where short, low-orderschemata combine with each other to form better and better solutions. Theassumption that this will work is called by Goldberg (1989) the building-blockhypothesis . Empirical evidence is strong that this is a reasonable assumptionin many problems.” (p.158)One key question we must face in order to design a successful GP application is,therefore: How then can we organize our working space so that our programming eﬀortcan rely on short schemata?The solution to this question is well known to computer scientists and software engi-neers: Organize the programs hierarchically (recursively) as self-contained (encapsulated)building-blocks (modules, functions, objects, sub-routines, etc.). The next section is ded-icated to the study of modular organization, and its spontaneous emergence in complexsystems. The biological world is an endless source of inspiration for improvements and variations inGP (of course, one should also be careful not to be carried away by superﬁcial analogies).A nice anthology of introductory articles can be found in the book by Michod and Levin(1988),

The Evolution of Sex: An Examination of Current Ideas . Let us begin this sectionwith an interesting biological example.It is a well known phenomenon that bacteria can develop antibiotic resistance. Amongthe most common mechanisms conferring resistance to new antibiotics, one can list:Agents that modify or destroy the antibiotic molecular structure; Agents that modifyor protect the antibiotic targets; New pathways oﬀering alternatives to those blockedby the antibiotic action; etc. However, all these mechanisms entail a ﬁtness cost to themodiﬁed individuals. At the very least, there is the cost of complexity, that is, the costof building and maintaining these new mechanisms. Hence, if the selective pressure ofthe antibiotic presence is interrupted, resistant bacterial populations will often revert tonon-resistant, see for example Bj¨orkholm et al. (2001). .4 SIMPLE LIFE

Okcam’s razor or lex parsimoniae , an epistemological principle stated by the 14th-century English logicianfriar William of Ockham, in the following forms:- Entia non sunt multiplicanda praeter necessitate, or-

Pluralitas non est ponenda sine neccesitate. that is, entities should not be created or multiplied without necessity.In section 4.1 we will see how well this principle applies to statistical models, and howit can be enforced. In section 4.2 we will examine introns , a phenomenon that at ﬁrstglance appears to contradict Okcam’s razor. Nevertheless, we will also see how intronsallow building blocks to appear spontaneously as an emergent feature in GP.

This section discusses the use of Okcam’s razor in statistical modeling. As an illustrativeexample, we use a standard normal multiple linear regression model. This model statesthat y = Xβ + u , X n × k , where n is the number of observations, k is the number ofindependent variables, β ∈ ] − ∞ , ∞ [ k is the vector of regression coeﬃcients, and u is aGaussian white noise such that E ( u ) = 0 and Cov( u ) = σ I , σ ∈ [0 , ∞ [, see DeGroot(1970), Hocking (1985) and Zellner (1971). Using the standard diﬀuse prior p ( β, σ ) =1 /σ , the joint posterior probability density, f ( β, σ | y, X ), and the MAP (maximum aposteriori) estimators for the parameters are given by: f ( β, σ | y, X ) = 1 σ n +1 exp( − σ ( ( n − k ) s + ( β − ˆ β ) (cid:48) X (cid:48) X ( β − ˆ β ) ) ) , ˆ β = ( X (cid:48) X ) − X (cid:48) y , ˆ y = X ˆ β ,s = ( y − ˆ y ) (cid:48) ( y − ˆ y ) / ( n − k ) . In the polynomial multiple linear regression model of order k , the dependent variable y is explained by the powers 0 through k of the independent variable x , i.e., the regressionmatrix element at row i and column j is X ji = ( x i ) j − , i = 1 . . . n , j = 1 . . . k + 1. Notethat the model of order k has dimension d = k + 2, with parameters β , β , . . . β k , and σ .In the classical example presented in Sakamoto et al. (1986, ch.8), we want to ﬁt alinear regression polynomial model of order k , y = β + β x + β x . . . + β k x k + N (0 , σI )through the n = 21 points, ( x i , y i ), in Table 1.38 CHAPTER 5: MODULARITY AND STOCHASTIC EVOLUTION i x i y i i x i y i i x i y i y i = g ( x i ) + 0 . ∗ N (0 , , g ( x ) = exp(( x − . ) − , where the target function, g ( x ), cannot be expressed exactly as a ﬁnite order linear re-gression polynomial model.Figure 5 presents the target function in the example’s range, the data set (Sakamoto’sset in 5a and a second set generated by the same stochastic process in 5b), and theregression polynomials of orders 0 through 5. In this example, all the available data pointsare used to ﬁt the model. An alternative procedure would be to divide the available datain two sets, the training set , used to adjust the model, and the test set , used to test themodel’s predictive or extrapolation power.Just by visual inspection, one can come to the following conclusions:- If the model is too simple, it fails to capture important information available in thedata, making poor predictions.- If the model is too complex, it overﬁts the training data, that is, the curve f ( t ) tendsto become an interpolation curve, but the curve becomes unstable and predicted valuesbecome meaningless.The polynomial regression model family presented in the example is typical, in thesense that it oﬀers a class o models of increasing dimension, or complexity. This posesa model selection problem, that is, deciding, among all models in the family, the “best”adapted to the data. It is natural to look for a model that accomplishes a small empiricalerror, the estimated model error in the training data, R emp . A regression model is esti-mated by minimizing the 2-norm empirical error. However, we cannot select the “best”model based only on the empirical error, because we would usually select a model of veryhigh complexity. In general, when the dimensionality of the model is high enough, theempirical error can be made equal to zero by simple interpolation. It is a well known factin statistics (or learning theory), that the prediction (or generalization) power of suchhigh dimension models is poor. Therefore the selection criterion has to penalize also the .4 SIMPLE LIFE Figure 5a,b: Target function, data points, and polynomial regressions of order 0 to 5; ◦ : Data points; (cid:5) : Target function; ∗ : Best (quadratic) polynomial regression.model dimension. This is known as a regularization mechanism.Some model selection criteria deﬁne R pen = r ( d, n ) R emp as a penalized (or regularized)error, using a regularization factor, r ( d, n ), where d is the model dimension and n thenumber of training data points. Common regularization factors, using p = ( d/n ), are: • Akaike’s ﬁnal prediction error: FPE = (1 + p ) / (1 − p ), • Schartz’ Bayesian criterion: SBC = 1 + ln( n ) p/ (2 − p ), • Generalized cross validation: GCV = (1 − p ) − , • Shibata model selector: SMS = 1 + 2 p ,All these regularization factors are supported by theoretical arguments as well as by empir-ical performance; other common regularization methods are Akaike information criterion40 CHAPTER 5: MODULARITY AND STOCHASTIC EVOLUTION (AIC), and Vapnik-Chervonenkis (VC) prediction error. For more details, see Akaike(1970 and 1974), Barron (1984), Breiman (1984), Cherkassky (1998), Craven (1979),Michie (1994), Mueller (1994), Shibata (1981), Swartz (1978), Unger (1981) and Vapnik(1995, 1998).We can also use the FBST as a model selection criterion by testing the hypothesis ofsome of its parameters being null, as detailed in Pereira and Stern (2001). The FBSTversion of Okcam’s razor states:-

Do not include in the model a new parameter unless there is strong evidence that it isnot null.

Table 2 presents the empirical error, EMP = || y − ˆ y || /n , for models of order k ranging from 0 to 5, several regularization criteria previously mentioned as well as theAkaike information criterion (AIC), as computed by Sakamoto. Table 2 also presents the e -value supporting the hypothesis H : β k = 0, that is, the hypothesis stating that themodel is in fact of order k − As seen in section 3, GP can produce polynomial networks that are very similar to thepolynomial regression models presented in the last section. The main diﬀerence betweenthe polynomial networks and the regression models lies in their generation process: Whilethe regression models are computed by a deterministic algorithm, the GP networks aregenerated by a random evolutionary search. However, if one uses compatible measuresof performance for the GP ﬁtness function and the regression (penalized or regularized) .4 SIMPLE LIFE extraneous code, that is, code segments that, ifremoved, do not (signiﬁcantly) alter the solution computed by the network. Trivial ex-amples of extraneous code segments are (+ s

0) and ( ∗ s s is a sub-expression.By their very deﬁnition, extraneous code segments cannot (signiﬁcantly) contribute toan individual’s ﬁtness, and hence to its survival or mating probabilities. However, An-geline noticed that the presence of extraneous code could signiﬁcantly contribute to theexpected ﬁtness of the individual’s descendents! Apparently, the role of these (sometimesvery large) patches of inert code is to isolate important blocks of working code, and toprotect these blocks from being broken at recombination (destructive crossover).In biological organisms, the genetic code of eukaryots exhibits similar regions of code(DNA) that are or are not expressed in protein synthesis; these regions are called exons and introns , respectively. Introns do not directly code amino-acid sequences in proteins,nevertheless, they seem to have an important role in the meta-control of the geneticmaterial expression and reproduction.Subsequent work of several authors tried to incorporate meta-control parameters toGP. Iba and Sato (1993, p.548), for example, propose a meta-level strategy for GP basedon a self-referential representation, where “[a] self-referential representation maintains a meta-description, or meta-prescription, for crossover. This meta-genetic descriptions are allowed to co-evolve with the gene pool. Hence, genetic and meta-genetic code variations arejointly selected. How well the genetic code is adapted to the environment istranslated by the merit or objective function which, in turn, is used for the im-mediate, short-term or individual selection process. How well the genetic andmeta-genetic code are adapted to each other impacts on the system’s evolv-ability, a characteristic of paramount importance in long-run survival of thespecies.” Functional trees, for example, can incorporate edge annotations, like probability weights,linkage compatibility or aﬃnity, etc. Such annotations are meta-parameters used to con-trol the recombination of the sub-tree directly bellow a given edge. For example, weightsmay be used to specify the probability that a recombination takes place at that edge, whilelinkage compatibility or aﬃnity tags may be used to identify homologous or compatiblegenes, specifying the possibility or probability of swapping two sub-trees. Other annota-tions, like context labels, variable type, etc., may provide additional information about thepossibility or probability of recombination or crossover, the need of type-cast operations,etc. When such metacontrol annotations coevolve in the stochastic optimization process,42

CHAPTER 5: MODULARITY AND STOCHASTIC EVOLUTION they may be interpreted as a spontaneusly emergent semantics. Any semantic informationmay, in turn, be used in the design of acceleration procedures based on heuristic meritfunctions, like the example studied in section 5.2.3.Banzahf (1998, ch.6, p.164), gives a simple example of functional tree annotation: “Recently, we introduced the explicitly deﬁned introns (EDI) into GP. Aninteger value is stored between every two nodes in the GP individual. Thisinteger value is referred as the EDI value (EDIV). The crossover operator ischanged so that the probability that crossover occurs between any two nodes inthe GP program is proportional to the integer value between the nodes. Thatis, the EDIV integer value strongly inﬂuences the crossover sites chosen by themodiﬁed GP algorithm, Nordin et al. (1996).The idea behind EDIVs was to allow the EDIV vector to evolve duringthe GP run to identify the building blocks in the individual as an emergent phenomenon. Nature may have managed to identify genes and to protect themagainst crossover in a similar manner. Perhaps if we gave the GP algorithmthe tools to do the same thing, GP, too, would learn how to identify and protectthe building blocks. If so, we would predict that the EDIV values within a goodbuilding block should become low and, outside the good block, high.”

Let us ﬁnish this section presenting two interpretations for the role of modularity ingenetic evolutionary processes. This interpretations are common in biology, computerscience and engineering, an indication that they provide powerful insights. These twometaphors are commonly referred to as:- New technology dissemination or component design substitution , and- Damage control or repair mechanism .The ﬁrst interpretation is perhaps the more evident. In a modular system, a new designfor an old component can be easily incorporated and, if successful, be rapidly disseminated.A classical example is the replacement of mechanical carburetors by electronic injectionas the standard technology for this component of gasoline engines in the automotiveindustry. The large assortment of upgrade kits available in any automotive or computerstore gives a strong evidence of how much these industries rely on modular design. Thesecond interpretation explains the possibility for the “continued evolution of germlinesotherwise destined to extinction”, see Michod and Levin (1988). A classic illustrationrelated to the damage control and repair mechanisms oﬀered by modular organization isgiven by the Hora and Tempus parable of Simon (1996), presented in section 6.4.The lessons learned in this section may be captured by the following dicta of HerbertSimon: “The time required for the evolution of a complex form from simple ele- .5. EVOLUTION OF THEORIES ments depends critically on the number and distribution of potential interme-diate stable subassemblies.” Simon (1996, p.190). “Hierarchy, I shall argue, is one of the central structural schemes that thearchitect of complexity uses.”

Simon (1996, p.184).

The last sections presented a general framework for the stochastic evolution of complexsystems. Figure 6 presents a systemic diagram of biological production, according to thisframework. This diagram, is also compatible with the current biological theories of lifeevolution, provided it is considered as a schematic simpliﬁcation focusing on our particularinterests.The comparison of this biological production diagram with the scientiﬁc productiondiagram presented in section 1.5. motivates several analogies which may receive furtherencouragement from a comment by Davis and Steenstrup (1987, p.2): “The metaphor underlying genetic algorithms is that of natural evolution.In evolution, the problem each species faces is one of searching for beneﬁcialadaptations to a complicated and changing environment. The ‘knowledge’ thateach species has gained is embodied in the makeup of the chromosomes of itsmembers.”

According to this view, computational (or biological genetic) programs are perceivedas coded knowledge acquired by a population. An immediate generalization of this idea isto consider the evolution of other corpora of knowledge, embodied in a variety of media.Our main interest, given the scope of this book, is in the evolution of scientiﬁc theoriesand their supporting statistical models. This is the topic discussed in this and the nextsections. For some very interesting qualitative analyses related to this subject see Richards(1989, appendix II) and Lakatos (1978a,b).Section 5.1 considers several ways in which statistical models can be nested, mixed andseparated. It also analyzes the series-parallel composition of several simpler and (nearly)independent models. Section 5.2 is devoted to complementary models. Complementarityis a basic form of model composition in quantum mechanics that has received, so far,little attention in other application areas. All these forms of model transformation andcombination should provide a basic set of mutations and recombination operators in anabstract modeling space. In this section we focus on the statistical operations themselves,leaving some of the required epistemological analyses and historical comments to sections6 and 7.44

CHAPTER 5: MODULARITY AND STOCHASTIC EVOLUTION

Somatic behavior Genetic codeIndividual ⇐ Epigenetic ⇐ Individualontogenesis development genotypes ⇓ ⇑

Fittest Live/Dead Geneticsurvival organism Mutation ⇓ ⇑

Mating Reproductive Sexualcompetition ⇒ representation ⇒ recombinationPhenotypic space Genotypic spaceFigure 6: Biological production diagram. In this subsection we use some examples involving the (two-parameter) Weibull (W2) andGompertz (G2) probability models. The hazard (or failure rate) functions, h W and h G ,the reliability (or survival) function, r W and r G , and the density function, f W and f G ,of these models are given by: h W ( x | β, γ ) = βγ β x β − ; r W = exp (cid:32) − (cid:18) xγ (cid:19) β (cid:33) ; f W = βγ β x β − exp (cid:32) − (cid:18) xγ (cid:19) β (cid:33) ; h G ( x | α, λ ) = λα x ; r G = exp (cid:18) − λ log α ( α x − (cid:19) ; and f G = λα x exp (cid:18) − λ log α ( α x − (cid:19) . The parameters: β and γ for the Weibull model; and λ and α for the Gompertz model,are known, respectively, as the scale and shape parameters. Notice that h = f /r , and r = 1 − F , that is, the reliability function is the complement of the cumulative distributionfunction F .These probability models are used in reliability theory to study the characteristics ofthe survival (or life) time of a system, until it ﬁrst fails (or dies). It can be shown, seeBarlow and Prochan (1981), Gavrilov (1991, 2001) and appendix H.3, that the Weibulldistribution is adequate to describe the survival time of many allopoietic, manufactured orindustrial systems, while the Gompertz distribution is adequate to describe the life time ofmany autopoietic, biological or organic systems. In this setting, the key diﬀerence betweenautopoietic and allopoietic systems is the nature of their ontogenesis or assembling process,as described in the next paragraphs. Reasonable assumptions concerning the systems’ontogenesis will render either the Weibull or the Gompertz distributions as asymptoticeigen-solutions. .5 EVOLUTION OF THEORIES γ , is approximately the 63rd lifetime per-centile, regardless of the value of the shape parameter. By altering its shape parameter, β , the (two-parameter) Weibull distribution can take a variety of forms, see Figure 7 andDodson(1994). Some particular values of the shape parameter are important special cases:for β = 1, it is the exponential distribution; for β = 2, it is the Rayleigh distribution;for β = 2 .

5, it approximates the lognormal distribution; for β = 3 .

6, it approximates thenormal distribution; and for β = 5 .

0, it approximates the peaked normal distribution.The ﬂexibility of the Weibull distribution makes it very useful for empirical modeling,specially in quality control and reliability. The regions β < β = 1, and β > β = 1, the Weibulldegenerates into the Exponential distribution. This (no) aging regime represents a sim-ple element with no structure exhibiting, therefore, the memoryless property of constantfailure rate, h E ( x | γ ) = 1 /γ .The aﬃne transformation x = x (cid:48) + α leads to the (three parameter) Truncated Weibulldistribution. A location (or threshold) parameter, α > t = 0, after it has already survived the period [ − α, r E ( x | γ ) = exp (cid:18) − (cid:18) xγ (cid:19)(cid:19) ; r W ( x | β, γ ) = exp (cid:32) − (cid:18) xγ (cid:19) β (cid:33) ; r W ( x | α, β, γ ) = 1 r W ( α | β, γ ) exp (cid:32) − (cid:18) x + αγ (cid:19) β (cid:33) . CHAPTER 5: MODULARITY AND STOCHASTIC EVOLUTION

Figure 7: Shapes of the Weibull Distribution, h , r and f .Parameters: γ = 1, β = 0 . , . , . , . , . , . , . nested models , in which a distribution withless parameters (or degrees of freedom) is a special case (a sub-manifold in the param-eter space) of a distribution with more parameters (or degrees of freedom): The (one-parameter) Exponential distribution is a special case of the (two-parameter) Weibull dis-tribution which, in turn, is a special case of the (three-parameter) Truncated Weibulldistribution. Nesting is one of the basic modes of relating diﬀerent statistical models. Forexamples of the FBST used for model selection in nested models see Ironi at al. (2002),Lauretto et al. (2003), Stern and Zacks (2002).The (two-parameter) Weibull distribution has also an important theoretical property:Its functional form is invariant by serial composition. If n i.i.d. random variables haveWeibull distribution, X i ∼ f ( x | β, γ ), then the ﬁrst failure is a Weibull variate withcharacteristic life γ/n /β , i.e. X [1 ,n ] ∼ f ( x | β, γ/n /β ). This is a key property for itscharacterization as a stable distribution, that is, for the characterization of the Weibulldistribution as an (asymptotic) eigensolution. For applications in the context of extremevalue theory, see Barlow and Prochan (1981).While a series system fails when its ﬁrst element fails, a parallel system fails whenits last element fails. Figure 8 gives the standard graphical representation of series andparallel systems. This representation is inspired in circuit theory: While in a serial systemthe current ﬂow is cut if a single element is cut, in a parallel system the current ﬂow iscut only if all elements are cut. Series / parallel composition are the two basic modesused in Reliability Engineering for structuring and analyzing complex systems. Some ofthe statistical properties of these structures are captured in the form of algebraic lattices,see Barlow and Prochan (1981) and Kaufmann et al. (1977). Some of these properties .5 EVOLUTION OF THEORIES (cid:3)(cid:2) (cid:0)(cid:1) (cid:3)(cid:2) (cid:0)(cid:1) (cid:3)(cid:2) (cid:0)(cid:1) (cid:3)(cid:2) (cid:0)(cid:1) (cid:3)(cid:2) (cid:0)(cid:1) (cid:3)(cid:2) (cid:0)(cid:1) H , or au-topoietic, H . Since hypotheses H and H imply life distributions with distinct functionalforms, the scientist could use his/her observed life data to decide which hypothesis is cor-rect (or more adequate). This situation is known in statistics as the problem of separatehypotheses . The scientist could also be faced with a mixed population, a situation in witha fraction w of the individuals are allopoietic, and a fraction w of the individuals areautopoietic. In this situation the scientist could use his/her observed data to infer thefractions or weights , w and w , in the mixture model. For mixture models in the general, the p.d.f. of the data is a convex linear combinationof ﬁxed candidate densities. Writting the model’s vector parameter as θ = [ w, ψ , . . . ψ m ] ,f ( x | θ ) = w f ( x | ψ ) + . . . + w m f m ( x | ψ m ) , w ≥ | w = 1 , and the model’s likelihood function is f ( X | θ ) = (cid:89) nj =1 (cid:88) mk =1 w k f k ( x j | ψ k ) . CHAPTER 5: MODULARITY AND STOCHASTIC EVOLUTION

Figure 9a,b: Mixture models with 1 and 2 bivariate-Normal components.In mixture analysis for unsupervised classiﬁcation, we assume that the data comes fromtwo or more subpopulations (classes), distributed under distinct densities. Statisticalmixture models may also be able to infer the classiﬁcation probabilities for each datapoint, see Figure 9. In a heterogeneous mixture model, the components in the mixturehave distinct functional forms. In a homogeneous mixture model, all components in themixture have the same functional form. For several applications of these models, seeFraley (1999), Lauretto et al. (2006, 2007), Robert (1996) and Stephens (1997).

According to Bohr, the word

Complementarity is used “...to characterize the relationship between experiences obtained by diﬀer-ent experimental arrangements and visualizable only by mutually exclusiveideas...” . (N.Bohr II, Natural Philosophy and Human Cultures, p.30) “Information regarding the behavior of an object obtained under deﬁniteexperimental conditions may, however, ...be adequately characterized as com-plementary to any information about the same object obtained by some otherexperimental arrangement excluding the fulﬁllment of the ﬁrst conditions. Al-though such kinds of information cannot be combined into a single picture bymeans of ordinary concepts, they represent indeed equally essential aspects of .5 EVOLUTION OF THEORIES any knowledge of the object in question which can be obtained in this domain.” (Bohr 1938, p.26).In quantum Mechanics, at least from a historical perspective, the most important com-plementarity relations are those implied by the wave-particle complementarity or dualityprinciple. We have mentioned these complementarity relations in section 3.3, and we willexamine them again in sections 6 and 7. This principle states that microparticles ex-hibit the properties of both particle and waves, even considering that, in classical physics,these categories are mutually exclusive. At the dawn of the XX century, physics had anassortment of phenomena that could not be appropriately explained by classical physics.In order to explain one of these phenomena, known as the photoelectric eﬀect, AlbertEinstein postulated in 1905, annus mirabilis, a model in which light, conceived in classi-cal physics as electro-magnetic waves, should also be seen as a rain of tiny particles, nowcalled photons. Einstein basic hypothesis was that a photon’s energy is proportional tothe light’s frequency, E = hν , where the proportionality constant, h , is Planck’s constant.In 1924, Louis de Broglie generalized Einstein’s hypotheses. Using Einstein’s relativis-tic relation, E = mc , the photon’s wavelength, λ = c/ν , can be written as λ = h/ ( mc ),where m = E/c is the eﬀective mass attributed to the photon. A moving particle’smoment is deﬁned as the product of its mass and velocity, p = mv . Hence, de Broglieconjectured that any moving particle has associated to itself a “pilot wave” of wavelength λ = h/p = h/ ( mv ), see Broglie (1946, ch.IV, Wave Mechanics) for the original argument.Just two years later, in 1926, Erwin Sch¨ordinger published the paper “Quantization asan Eigenvalue Problem”, further generalizing these ideas into his (Sch¨ordinger’s) waveequation, the basis for a general theory of Quantum Mechanics, see next section. The de-tails of the early developments of Quantum Mechanics can be found in Tomonaga (1962)and Pais (1988, ch.12), but from this brief history it is clear that the general idea ofcomplementarity was a cornerstone in the birth of modern physics.Nevertheless Bohr believed that complementarity could be a useful concept in manyother areas. Folse (1985) gives an interesting essay about Bohr’s ideas on complementarity,including its application to ﬁelds outside quantum mechanics. Possible examples of suchapplications are given next: “...the lesson with respect to the role which the tools of observation play indeﬁning the elementary physical concepts gives a clue to the logical applica-tions of notions like purposiveness foreign to physics, but lending themselvesso readily to the description of organic phenomena. Indeed, on this backgroundit is evident that the attitudes termed mechanistic and ﬁnalistic do not presentcontradictory views on biological problems, but rather stress the mutually ex-haustive observational conditions equally indispensable in our search for anever richer description of life.” (Bohr II, Physical Science and Problems ofLife, p.100).50 CHAPTER 5: MODULARITY AND STOCHASTIC EVOLUTION “For describing our mental activity, we require, on one hand, an objec-tively given content to be placed in opposition to a perceiving subject, while, onthe other hand, as is already implied in such an assertion, no sharp separa-tion between object and subject can be maintained, since the perceiving subjectalso belongs to our mental content. From these circumstances follows not onlythe relative meaning of every concept, or rather of every word, the meaningdepending upon our arbitrary choice of view point, but also we must, in gen-eral, be prepared to accept the fact that a complete elucidation of one and thesame object may require diverse points of view which defy a unique descrip-tion. Indeed, strictly speaking, the conscious analysis of any concept stands ina relation of exclusion to its immediate application. The necessity of takingrecourse to a complementarity, or reciprocal, mode of description is perhapsmost familiar to us from psychological problems. In opposition to this, the fea-ture which characterizes the so-called exact sciences is, in general, the attemptto attain to uniqueness by avoiding all reference to the perceiving subject. Thisendeavor is found most consciously, perhaps, in the mathematical symbolismwhich sets up for our contemplation an ideal of objectivity to the attainment ofwhich scarcely any limits are set, so long as we remain within a self-containedﬁeld of applied logic. In the natural sciences proper, however, there can be noquestion of a strictly self-contained ﬁeld of application of the logical principles,since we must continually count on the appearance of new facts, the inclusionof which within the compass of our earlier experience may require a revisionof our fundamental concepts. (Bohr I, The Quantum of Action, p.96-97).Examining some basic concepts of quantum mechanics, L.V.Tarasov (1980, p.153)poses a question concerning the concept of complementarity that is very pertinent in ourcontext: “A microparticle is neither a corpuscle, nor a wave, but still we employboth these images, which mutually exclude each other, for describing a mi-croparticle. ... Naturally, this could give rise to a ticklish question: Doesn’tthis mean an alienation of the image from the object, which is fraught witha transition to the position of subjectivism? A negative answer to this ques-tion is given by the principle of complementarity itself. From the position ofthis principle, pictures mutually excluding one another are used as mutuallycomplementary pictures, adequately representing various sides of the objectivereality called the microparticle.”

Even considering that Tarasov makes his point from a very diﬀerent epistemological per-spective, his statement ﬁts admirably well into our constructivist framework. Within itthe objectivity of a complementarity model can be interpreted as follows: Although com-plementary, the several views employed to describe an object should still render objective .6. VARIETIES OF PROBABILITY

This section presents some basic ideas of Quantum Mechanics, providing simple heuristicderivations for a few of its basic principles. Its main objective is to discuss the impact ofQuantum Mechanics on the concept and interpretation of probability models.

In this section we present Werner Heisenberg’s uncertainty principle, derived directly fromde Broglie’s wave-particle complementarity principle.A particle with a precise moment, p , has associated to it a pilot wave that is monochro-matic, that is, has a single wavelength, λ . Hence, this wave is homogeneously distributedin space. Let us think of a particle with an uncertain moment, speciﬁed by a probabilitydistribution, φ ( p ). What would the distribution, ψ ( x ), of the location of its associatedpilot wave, be? Assuming that the composition rule for pilot waves is the standard linearsuperposition principle, see Section 4.2, the answer to this question is given by the math-ematics of Fourier series and transforms, see Butkov (1968, ch.4 and 7), Byron and Fuller(1969, ch.4 and 5) or Sadun (2001, ch.8 and 10).The Fourier synthesis of a function, f ( x ), in the interval [0 , L ] is given by the Fourierseries f ( x ) = a (cid:88) ∞ n =1 (cid:18) a n cos (cid:18) n πxL (cid:19) + b n sin (cid:18) n πxL (cid:19)(cid:19) ,a n = 2 L (cid:90) L f ( x ) cos (cid:18) n πxL (cid:19) dx , b n = 2 L (cid:90) L f ( x ) sin (cid:18) n πxL (cid:19) dx . The following examples give the Fourier series for the rectangular and triangular spikefunctions, R h ( x ) and T h ( x ). In order to obtain simpler expressions, the spikes arepresented at the center of interval [ − π, + π ], the standard interval of length L = 2 π shiftedto be centered at the origin. Figure 10a displays the ﬁrst 5 even harmonics, cos( nx ), forwave number n = 1 . . .

5, Figure 10b displays the Fourier coeﬃcients, a n , in the synthesisof the triangular spike T h ( x ), for h = 1 .

0. Figures 10c and 10d display the triangularspike and its Fourier syntheses with the ﬁrst 2 and the ﬁrst 5 harmonics.52

CHAPTER 5: MODULARITY AND STOCHASTIC EVOLUTION −2 0 212345 1 2 3 4 500.20.40.60.81 −2 0 200.20.40.60.81−2 0 200.20.40.60.81

Figure 10: Monochromatic Waves and Superposition Packets. R h ( x ) = (cid:26) , if − h < x < + h, , otherwise in [ − π, π ] . = 2 hπ + 2 π (cid:88) ∞ n =1 sin( nh ) n cos( nx ) .T h ( x ) = (cid:26) − | x | /h , if | x | < h, , otherwise in [ − π, π ] . = h π + 4 hπ (cid:88) ∞ n =1 (cid:18) sin( nh ) nh (cid:19) cos( nx ) . It is also possible to express the Fourier series in complex form. Using the complexexponential notation, exp( ix ) = cos( x ) + i sin( x ), we write f ( x ) = (cid:88) + ∞−∞ c n e in πx/L , c n = 1 L (cid:90) L f ( x ) e − in πx/L dx . The trigonometric and complex exponential Fourier coeﬃcients are related as follows c = 12 a , c n = 12 ( a n − ib n ) , c − n = 12 ( a n + ib n ) , n = 1 , . . . ∞ . .6 VARIETIES OF PROBABILITIES (cid:90) L e in πx/L e − im πx/L dx = (cid:90) L e i ( n − m )2 πx/L dx = (cid:26) L , if n = m , , if n (cid:54) = m . , are the key for interpreting the set of complex exponentials, { e in πx/L } , for wave numbers n = −∞ . . . + ∞ , as an orthogonal basis for the appropriate functional vector space inthe interval [0 , L ].If we want to synthetize functions in the entire real line, not just in a ﬁnite interval,we must replace Fourier series by Fourier transforms. The Fourier transform, (cid:98) f ( k ), of afunction, f ( x ), and its inverse transform are deﬁned, respectively, by (cid:98) f ( k ) = 1 √ π (cid:90) ∞−∞ f ( x ) exp( − ikx ) dx and f ( x ) = 1 √ π (cid:90) ∞−∞ (cid:98) f ( k ) exp( ikx ) dk . In the Fourier transform the propagation number (or angular frequency), k = n π/L ,replaces the wave number, n , used in the Fourier series. The new normalization con-stants are deﬁned to stress the duality between the complementary representations of thefunction in state and frequency spaces, x and k .As an important example, let us compute the Fourier transform, of a Gaussian distri-bution with mean µ = 0 and standard deviation (uncertainty) σ x = σ : f ( x ) = 1 σ √ π exp (cid:18) − x σ (cid:19) , (cid:98) f ( k ) = σ √ π exp (cid:18) − σ k (cid:19) . This computation can be checked using the analytic formula of the Gaussian integral, (cid:90) ∞−∞ exp (cid:0) − ax + bx + c (cid:1) dx = (cid:114) πa exp (cid:18) b − ac a (cid:19) . −4 −2 0 2 400.10.20.30.40.50.6 −4 −2 0 2 400.10.20.30.40.50.6 Figure 11: Uncertainty Relation for Fourier Conjugates.54

CHAPTER 5: MODULARITY AND STOCHASTIC EVOLUTION

Hence, the Fourier transform of a Gaussian distribution with standard deviation σ x = σ , is again a Gaussian distribution, but with standard deviation σ k = 1 /σ , that is, σ x σ k = 1. Figure 11 displays the case σ = 1 .

5. It is also possible to show that this exampleis a best case, in the sense that, for any other function, f ( x ), the standard deviations ofthe conjugate functions, f ( x ) and (cid:98) f ( k ), obey the inequality of the uncertainty principle, σ x σ k ≥

1, see Sadun (2001, sec.10.5).In the context of Quantum Mechanics, the best known instance of the uncertaintyprinciple gives a lower bound on the product of the standard deviations of the positionand momentum of a particle, σ x σ p ≥ ¯ h , ¯ h = h π , h = 6 . E − J s .

Heisenberg’s bound is written as a function of the moment, p , instead of the frequency, k ; this is why in the right hand side of the inequality we have half the reduced Planck’sconstant, ¯ h/

2, instead of 1, as in Fourier transform conjugate functions.Planck’s constant dimension is that of action, an energy-time product, like joule-secondor electron-volt-second. The values above present the best current (2006) estimates for thisfundamental physical constants, in the format recommended by the Committee on Datafor Science and Technology, CODATA. The two digits in parentheses denote the standarddeviation of the last two signiﬁcant digits of the constant’s value. The importance of thisconstant and its representation are further analyzed in the next sections.

In the last sections we have analyzed de Broglie’s complementarity principle, which statesthat any moving particle has associated to itself a “pilot wave” of wavelength λ = h/mv .In section 4.2 we analyzed some of the basic properties of the classical wave equation,displayed below on the left hand side: d ψdx + ω ψ = 0 , ω = 2 m ¯ h ( E − V ( x )) . In the classical equation, ω = 2 π/λ is the wave’s angular frequency. What should aquantum wave equation equation look like? Schr¨odinger’s idea was to replace the classicalwavelength by de Broglie’s, that is, to use ω = 2 πmv/h . Using the deﬁnition of the kineticenergy of a particle, T = (1 / mv , and its relation to V ( x ) and E , the particle’s potentialand total energy, T = E − V ( x ), we ﬁnd the expression for ω displayed above on theright.This is Schr¨odinger’s (time independent) wave equation, which established a ﬁrm basisfor the development of Quantum Mechanics, also known in its early days as “wave me-chanics”. One of the immediate successes of Quantum Mechanics was to provide elegant .6 VARIETIES OF PROBABILITIES ψ , was given onlya few months later by Max Born. According to Born’s interpretation: The probabilitydensity of “ﬁnding” the particle at position x , is proportional to the square of the wavefunction absolute amplitude, | ψ ( x ) | . Since, in the general case, ψ is a complex function,the last quantity can also be written as the product of the wave function by its complexconjugate, that is, | ψ ( x ) | = ψ ∗ ψ .From this interpretation of the wave function, we can understand Max Born’s formu-lation of ‘the core metaphor of wave mechanics’, as quoted in Pais (1988, ch.12, sec.d,p.258), “The essence of wave mechanics: ‘The motion of particles follows probabil-ity laws but the probability itself propagates according to the law of causality.” This is a revolutionary interpretation, that attributes to the concept of probability anew and distinct ‘objective’ character. Hence, it is interesting to have some insight onthe genesis of Born’s interpretation. Born’s own recollections are presented at Pais (1988,ch.12, sec.d, p.258-259): “What made Born take his step?In 1954 Born was awarded the Nobel Prize ‘for his fundamental research,specially for his statistical interpretation of the wave function’. In his ac-ceptance speech Born, then in his seventies, ascribed his inspiration for the CHAPTER 5: MODULARITY AND STOCHASTIC EVOLUTION statistical interpretation to ‘an idea of Einstein’s [who] had tried to make theduality of particles - light-quanta or photons - and waves comprehensible byinterpreting the square of the optical wave amplitudes as probability densityfor the occurrence of photons. This concept could at once be carried over tothe ψ -function: | ψ | ought to represent the probability density of electrons.’ ” One of the favorite metaphors used by the orthodox Bayesian school describes the sci-entist’s work as a game against nature, with the objective of scoring a good guess on“nature’s true state”. Implicit in this metaphor is the assumption that such a “true stateof nature” exists and is, at least in principle, accessible. In this paradigm, omniscienceis usually a matter of money, that is, with enough economic resources all pertinent in-formation can, at least in principle, be acquired, see Blackwell and Girshick (1954), forexample. “Statistics can be viewed as a game against nature.” (p.75).“...games where one of the players is not faced with an intelligent opponentbut rather with an unknown state of nature.” (p.121).“The same theory that served to delineate optimal strategies in gamesplayed against an intelligent opponent will serve to delineate classes of op-timal strategies in games played against nature.” (p.123).“What prevents the statistician from getting full knowledge of ω [the stateof nature] by unlimited experimentation is the cost of experiments.” (p.78). This paradigm seems incompatible with, or at least very unfriendly to, Born’s proba-bilistic interpretation of Quantum Mechanics and Heisenberg’s uncertainty principle. Webelieve that, in the context of quantum mechanics, the strictly subjective interpretationof probability is, please forgive the pun, a very risky metaphor, and that pushing thismetaphor where it does not belong will lead to endless paradoxes. In Chapter 7 of hisbook,

The Physics of Chance , for example, Charles Ruhla presents the adventures of thesimple-minded hero Monsieur de La Palice, struggling to understand some basic quantumexperiments.For a strict subjectivist the situation is even worse, and the use of Quantum Mechanicsis at risk of being considered illegal. A statement giving the current best estimate of h (Planck’s constant) toghether with its standard deviation was presented in section5.6.1. Since h appears at the right hand side of Heisenberg’s uncertainty principle, anuncertainty about the value of h implies a second order uncertainty. The propagationof the uncertainty about the value of fundamental physical constants generates similarsecond order probabilistic statements about the detection, mesurement or observation .6 VARIETIES OF PROBABILITIES “Does it make sense to ask what is the probability that the probability ofa given event has a given value, p i ? ... It makes no sense to state that theprobability of an event E is to be regarded as unknown in that its true valueis one of the p i ’s, but we do not know which one.”“Speaking of unknown probabilities [or of probability of a probability] mustbe forbidden as meaningless.” A similar statement of de Finetti was analyzed in section 4.7. Such an awkwardposition, at least for a modern physicist, was seen by the founding fathers of orthodoxBayesian statistics as an unavoidable consequence of the subjectivist doctrine, accordingto which, “Probabilities are states of mind, not of nature.”

Savage (1981, p.674).From a constructivist perspective, fundamental physical constants, including of coursePlanck’s constant, correspond to very objective (very sharp, stable, separable and com-posable) eigenvalues of Physics’ research program, and it is perfectly admissible to speakabout the uncertainty of their estimated values. Of course that is what physicists need todo, and have done for almost a century, regardless of being disapproved by the Bayesianorthodoxy (theoretically coherent, but understandably very shy and timid). There havealso been some attempts to reconcile a strict subjectivist position with modern physics,through long and sophisticated translations of simple “crude” statements like the onesquoted above. Some of these translations are as bizarre and / or intricately involved assimilar attempts to translate epistemic probabilistic statements that are categorically for-bidden in frequentist statistics into “acceptable” frequentist probabilistic statements, seesection 2.5 and Rouanet et al. (1998, Preamble). Richard Feynman (2002, p.14), makesthe following comments on some ideas behind some of such interpretations: “Now, the philosophical question before us is, when we make an observationof our track in the past, does the result of our observation become real inthe same sense that the ﬁnal state would be deﬁned if an outside observerwere to make the observation? This is al very confusing, especially when weconsider that even though we may consistently consider ourselves always tobe the outside observer when we look at the rest of the world, the rest of theworld is at the same time observing us, and that often we agree on what we CHAPTER 5: MODULARITY AND STOCHASTIC EVOLUTION see in each other. Does this mean that my observations become real only whenI observe an observer observing something as it happens? This is an horribleviewpoint. Do you seriously entertain the thought that without observer thereis no reality? Which observer? Any observer? Is a ﬂy an observer? Is astar an observer? Was there no reality before 109 B.C. before life began?Or are you the observer? Then there is no reality to the world after you aredead? I know a number of otherwise respectable physicists who have bought lifeinsurance. By what philosophy will the universe without man be understood?In order to make some sense here, we must keep an open mind about thepossibility that for suﬃciently complex systems, amplitudes become probabili-ties....”

In order to provide deeper insight on the meaning of Heisenberg’s uncertainty princi-ple, let us link it to Noether’s theorems, already discussed in section 2.8.1. The centralpoint of Noether’s theorems lies in the existence of an invariant physical quantity for eachcontinuous symmetry group in a physical theory. Heisenberg’s uncertainty relation, pre-sented in section 6.1, sets a bound on the accuracy with which we can access, by meansof physical measurements, such symmetry / invariant dual or conjugate pairs. This pointis further analyzed by Bohr: “...we admire Planck’s happy intuition in coining the term ‘quantum of ac-tion’ which directly indicates a renunciation of the action principle, the centralposition of which in the classical description of nature he himself has empha-sized on more than one occasion. This principle symbolizes, as it were, thepeculiar reciprocal symmetry relation between the space-time description andthe laws of conservation of energy and momentum, the great fruitfulness ofwhich, already in classical physics, depends upon the fact that one may exten-sively apply them without following the course of the phenomena in space andtime.” (p.94 or 210). “Indeed, the inevitability of using, for atomic phenomena, a mode of de-scription which is fundamentally statistical arises from a closer investigationof the information which we are able to obtain by direct measurement of thesephenomena and the meaning we may ascribe, in this connection, to the appli-cation of the fundamental physical concepts...Such considerations lead immediately to the reciprocal uncertainty relationsset up by Heisenberg and applied by him as the basis of a thorough investigationof the logical consistency of quantum mechanics.” (p.113-114 or 247-248).In the article

Space-Time Continuity and Atomic Physics , Bohr (1935, p.370) furtherexplores the relation between quantization and our use of probabilistic language: .7. THEORIES OF EVOLUTION “With the forgoing analysis we have described the new point of view broughtforward by the quantum theory. Sometimes one has described it as leavingaside the idea of causality. I think we should rather say that in the quantumtheory we try to express some laws of nature that lie so deep that they cannot be visualized, or, which cannot be accounted for by the usual descriptionin terms of motion. This state of aﬀairs brings about the fact that we mustuse to a great extent statistical methods and speak of nature making choicesbetween possibilities.”

The correct interpretation of probability has been one of the key conceptual prob-lems of modern physics. The importance of this problem can be further appreciated inthe following statement of Paul Dirac, found in (Pais 1986, p.255), regarding the earlydevelopment of quantum mechanics: “This problem of getting the interpretation proved to be rather more diﬃcultthan just working out the equations.”

The “correct” interpretation or “best” metaphysics for quantum mechanics, includingthe ontological and epistemological status of probability and the understanding of itsrole in the theory, is an area of strong academic interest and current research, see forexample Albert (1993, ch.7) for an exposition of David Bohm’s interpretation of QM.Richard Feynman’s path integral formalism, see for example Feynman and Hibbs (1965),Honerkamp (1993) and Wiegel (1986), makes it possible to support other alternativeinterpretations.Perhaps the most important lesson to be learned from this section is that one must beaware of the several possible meanings and interpretations of the concept of probability,and that distinct situations may require or beneﬁt from distinct approaches. In thebest spirit of complementarity, we should even consider the possibility of studying thesame situation under diﬀerent perspectives, each one of them providing a positive andirreplaceable contribution to our understanding of a whole that is beyond the grasp of asingle picture . The objective of this section is to highlight the importance of three key concepts that areessential to modern theories explaining the evolution of complex systems, and to follow The following quote was brought to my attention by Jean-Yves Beziau: “The ordinary man hasalways been sane because the ordinary man has always been a mystic... He has always cared for truthmore than for consistency. If he saw two truths that seemed to contradict each other, he would take thetwo truths and the contradiction along with them.” Gilbert Keith Chesterton (1874 - 1936). CHAPTER 5: MODULARITY AND STOCHASTIC EVOLUTION some points in their development and interconnection, namely: (1) the systemic view;(2) modularity; and (3) stochastic evolution and/or probabilistic causation. Probabilisticcausation is by far the most troublesome of these concepts. It is absolutely essential, atleast in the framework presented in this chapter, to the evolution of complex systems, onone hand, but it was not easy for stochastic evolution to make its way as a “legitimate”concept in modern science, on the other. We believe that the historical progress andacceptance of the ontological status of these probabilistic concepts is closely related tothe evolution of epistemological frameworks that can, in turn, strongly inﬂuence and beinﬂuenced by the corresponding statistical theories giving them operational support.

The systemic view has always been part of the biological thinking. The teleomechanicsschool gave particular importance to a systemic view of living organisms, see Lenoir (1989)for an excellent historical account. As quoted in Lenoir (1989, p.220,221), for example,the XVIII century biologist C. Reichert states: “...‘we have a systemic product before us,... in which the intimate inter-connections of the constituent parts have reached their highest degree. Whenwe think about a system, we normally picture ourselves precisely this form ofsystematic product. Concerning such systems Kant said that the parts only ex-ist with reference to the whole and the whole, on the other hand, only appearsto exist for the sake of the parts.’In order to investigate the systematic character of biological organisms Re-ichert reminded the readers that it was necessary to have a method appropriateto the subject... Reichert could envision only one method to the investigationof the living organism which avoids disrupting the intimate interconnectionsof its parts:‘The systematist is aware both that he proceeds genetically and that he mustproceed genetically. He is aware that the structure on an organism consists inthe systematic division or dissection of the germ, which receives a particularsystematic unity through inheritance, makes it explicit through developmentand transmits it further through procreation.’ ”

These statements express one of the core methodological doctrines of the teleomechan-ics school, namely, that to understand the systemic character of the organism, one mustexamine its development. The systemic approach of the teleomechanics school greatlycontributed to the study of many ﬁelds in “Biology” (a word coined within this school),facilitating complex analyses and multiscale interconnections. C.F.Kielmeyer, anothergreat representative of the teleomechanics school, for example, linked individual and pop- .7 THEORIES OF EVOLUTION “Ontogeny recapitulates phylogeny.”

The teleomechanics research program, however, could never overcome (perceived) incom-patibility conﬂicts among some of its basic principles, such as, for example, the conﬂictbetween the teleological organization of organic systems, on one hand, and the need touse only scientiﬁcally accepted forms of causal explanation, on the other. Consequentlythe scientists in this program found themselves struggling between deterministic reduc-tionist mechanisms and vitalistic explanations, both unable to oﬀer signiﬁcant scientiﬁcknowledge or acceptable understanding for the phenomena in study.According to the framework for evolution presented in this chapter, the diagnostic forthis failure is quite obvious, namely, the lack of key conceptual probabilistic ingredients.This situation is analyzed in Lenoir (1989, p.239-241): “Only in a universe operating according to probabilistic laws, a universegrounded in non-deterministic causal processes, is it possible to harmonize theevolution of sequences of more highly organized beings with the principles ofmechanics.Two paths lay open for providing a consistent and rigorous solution to thisdilemma. One alternative is that of twentieth century science. It is simplyto abandon the classical notion of cause in favor of a non-deterministic con-ception of causality. In the late nineteenth century this was not an acceptablestrategy. To be sure statistical methods were being introduced into physics withgreat success, but prior to the quantum revolution in mechanics no one wasprepared to assert the probabilistic nature of physical causes....A second solution to this dilemma is that proposed by teleomechanists. Ac-cording to this interpretation rigidly determined causality can be retained, butthen limits must be placed on the analysis of the ultimate origins of biologicalorganization, and certain ground states of purposive or zweckm¨assig organiza-tion must be introduced.In the ﬁnal analysis the only resolution of their impasse was the construc-tion of an entirely new set of conceptual foundations for both the biologicaland the physical sciences which could cut the Gordian knot of chance andnecessity.”

The breakthrough of introducing stochastic dynamics in modern theories of evolutionis perhaps the greatest merit of Charles Darwin. According to Peirce (1893, 183-184): “(In)

The origin of Species published toward the end of 1859... the ideathat chance begets order, which is one of the cornerstones of modern physics... CHAPTER 5: MODULARITY AND STOCHASTIC EVOLUTION was at that time put into its clearest light.”

The role of probability in Darwin’s theories can be best appreciated in his own words: “Throughout this chapter and elsewhere I have spoken of selection as theparamount power, yet its action absolutely depends on what we in our igno-rance call spontaneous or accidental variability. Let an architect be compelledto build an ediﬁce with uncut stones, fallen from a precipice. The shape of eachfragment may be called accidental; yet the shape of each has been determinedby the force of gravity, the nature of the rock, and the slope of the precipice,-events and circumstances, all of which depend on natural laws ; but there isno relation between these laws and the purpose for which each fragment is usedby the builder. In the same manner the variations of each creature are deter-mined by ﬁxed and immutable laws; but these bear no relation to the livingstructure which is slowly built up through the power of selection, whether thisbe natural or artiﬁcial selection.If our architect succeeded in rearing a noble ediﬁce, using the rough wedge-shaped fragments for the arches, the longer stones for the lintels, and so forth,we should admire his skill even in a higher degree than if he had used stonesshaped for the purpose. So it is with selection, whether applied by man or bynature; for although variability is indispensably necessary, yet, when we lookat some highly complex and excellently adapted organism, variability sinks toa quite subordinate position in importance in comparison with selection, in thesame manner as the shape of each fragment used by our supposed architect isunimportant in comparison with his skill.”

Darwin (1887, ch.XXI, p.236)In the above passage, the importance given to the systemic view, that is, to the living structure of the organism is evident. At the same time, randomness is added as anessential provider of raw materials in the evolutionary process. However, there are someimportant points of divergence between the way randomness plays a role in Darwinianevolution, and in contemporary theories. We highlight three of them: (1) Darwin usesonly pseudo-randomness; (2) Genetic and somatic components of variation are not clearlydistinguished; (3) Darwinian variations are continuous. Let us examine these three pointsmore carefully:1- Darwin used pseudo-randomness, not essential uncertainty. S.J.Gould (p.684) as-sesses this point is as follows: “The Victorian age, basking in triumph of an industrial and military mightrooted in technology and mechanical engineering, granted little conceptual spaceto random events... Darwin got into enough trouble by invoking randomnessfor sources of raw material; he wasn’t about to propose stochastic causes for change as well!” .7 THEORIES OF EVOLUTION “Darwin’s ideas on variation, hereditarity, and development diﬀer signif-icantly from twentieth-century views. First, Darwin held that environmentalchanges, acting on the reproductive organs or the body, were necessary to gen-erate variation. Second, hereditarity was a developmental, not a transmitionalprocess...”

At the time of Darwin, the available technology could not, of course, reveal the bio-chemical mechanisms of heredity. Nevertheless, scientists like Hugo de Vries and ErwingSchr¨odinger have had powerful insight on this mechanisms, even before the necessarytechnology became available. de Vries (1900), for example, advanced the following hy-potheses: “1. Protoplasm is made up of numerous small units, which are bearers ofthe hereditarity characters. 2. These units are to be regarded as identical withmolecules.”

In his book

What is Life , Schr¨odinger (1945) advanced more detailed hypotheses aboutthe genetic coding mechanisms, based on far reaching theoretical insights provided byquantum mechanics. This small book was a declared source of inspiration for both JamesWatson and Francis Crick, who, in 1953, discovered the double-helix molecular struc-ture of DNA, opening the possibility of deciphering the genetic code and its expressionmechanisms.3- Continuous variations. From several passages of Darwin’s works, it is clear that hesaw actual variations as coming from a continuum of potential possibilities: “[as] I have attempted to show in my work on variation... they [are] ex-tremely slight and gradual.”

Darwin (1959, p.86). “On the slow and successive appearence of new species: ...organic beingsaccord best with the common view of the immutability of species, or with that oftheir slow and gradual modiﬁcation, through variation and natural selection.”

Darwin (1959, p.167). “It is indeed manifest that multitudes of species are related in the closestmanner to other species that still exist, or have lately existed; and it willhardly be maintained that such species have been developed in an abrupt or CHAPTER 5: MODULARITY AND STOCHASTIC EVOLUTION sudden manner. Nor should it be forgotten, when we look to the special partsof allied species, instead of to distinct species, that numerous and wonderfullyﬁne graduations can be traced, connecting together widely diﬀerent structures.”

Darwin (1959, p.117).The ﬁrst modern reference for discrete or modular genetic variations can be foundin the work of Gregor Mendel (1865), see next paragraph. It was unfortunate that theideas of Mendel, working at a secluded monastery in Br¨unn (Brno), were not immediatelyappreciated. For a contemporary view of evolution and modularity, see Margulis (1999)and Margulis and Sagan (2003). “The Forms of the Hybrids:With some characters... one of the two parental characters is so prepon-derant that it is diﬃcult, or quite impossible, to detect the other in the hybrid.This is precisely the case with the Pea hybrids. In the case of each of the7 crosses the hybrid-character resembles that of one of the parental forms soclosely that the other either escapes observation completely or cannot be de-tected with certainty. This circumstance is of great importance in the determi-nation and classiﬁcation of the forms under which the oﬀspring of the hybridsappear. Henceforth in this paper those characters which are transmitted entire,or almost unchanged in the hybridization, and therefore in themselves consti-tute the characters of the hybrid, are termed the dominant, and those whichbecome latent in the process recessive. The expression ”recessive” has beenchosen because the characters thereby designated withdraw or entirely disap-pear in the hybrids, but nevertheless reappear unchanged in their progeny, aswill be demonstrated later on.”

The third point of divergence, variations discreteness, is, of course, closely linked withthe second, the nature of genetic coding. However, its implications are much deeper, asexamined in the next section.

The ideas of Herbert Simon about modularity, examined in section 3.2, seem to receiveempirical support from anywhere we look in the biological world. Ksenzhek and Volkov(1998, p.80), also quoted in Souza and Manzatto (2000), for example, gives the followingexample from Botany: “A plant is a complicated, multilevel, hierarchical system, which provides avery high degree of integration, beginning from the elementary process of catch-ing light quanta and ultimately resulting in the functioning of a macroscopic .7 THEORIES OF EVOLUTION

Level Size (m) Structure Transfer mechanism Integration1 2 E − E − E − E − E − E − −

10 Tree Hydraulics 1E4

Table 3. Plant Energetics Hierachical and Modular Structure. plant as an entire organism. The hierarchical structure of plants may be ex-amined in a variety of aspects. (The following) table shows seven hierarchicallevels of mass and energy.”

As an example of how to interpret this table, we give further details concerning itsﬁrst line: in a thylakoid membrane, about 300 chlorophyll molecules act like an antennain a reaction center or photosynthetic unit, capable of absorbing light quanta at a rate ofabout 1K cycles / second. This energy conversion cycle absorbs photons of about 1 . eV (430 Hz or 700nm), synthesizing compounds, carbohydrates and oxygen, at an energylevel of about 1 . eV higher than its input compounds, carbon dioxide and water.Ksenzhek and Volkov (1998, p.80), see next quotation, also makes an important remarkconcerning the need for a speciﬁc and non-reductionist interpretation of each line in theabove table, or structural level in the organism. For related aspects in Biology, see Buss(2007). Niels Bohr (1987b, Light and Life, p.3-12; Biology and Atomic Physics, p.13-22)presents a similar argument based on the general concept of complementarity. “It should be noted that any hierarchical level that is above another levelcannot be considered as the simple sum of the elements belonging to that lowerlevel. In all cases, each step from a given level of the hierarchical staircase tothe next one is followed by the development of new features not inherent in theelements of the lower level.” Table 2 stops at somewhat arbitrary levels and could be extended further up or down.Higher levels in the table would enter the domains of Ecology. Lower levels would pene-trate the domains of Chemistry, and then Physics. At this point, we make an astonish-ing observation: Classical Physics cannot accommodate stable atomic models. ClassicalPhysics gives no support for discreteness or modularity of any kind. Hence, our modularview of the world would be, within classical Physics, a giant with feet of clay! WernerHeisenberg (1958, p.5,6) describes the situation as follows: “In 1911 Rutherford’s observations... resulted in his famous atomic model. CHAPTER 5: MODULARITY AND STOCHASTIC EVOLUTION

The atom is pictured as consisting of a nucleus, which is positively charged andcontains nearly the total mass of the atom, and electrons, which circle aroundthe nucleus like planets circle around the sun. The chemical bond betweenatoms of diﬀerent elements is explained as an interaction between the outerelectrons of the neighboring atoms; it has not directly to do with the atomicnucleus. The nucleus determines the chemical behavior of the atom through itscharge which in turn ﬁxes the number of electrons in the neutral atom. Initiallythis model of the atom could not explain the most characteristic feature ofthe atom, its enormous stability. No planetary system following the laws ofNewton’s mechanics would ever go back to its original conﬁguration after acollision with another such system. But an atom of the element carbon, forinstance, will still remain a carbon atom after any collision or interaction inchemical binding.The explanation of this unusual stability was given by Bohr in 1913, throughthe application of Planck’s quantum hypothesis. An atom can change its energyonly by discrete energy quanta, this must mean that the atom can exist only indiscrete stationary states, the lowest of which is the normal state of the atom.Therefore, after any kind of interaction, the atom will ﬁnally always fall backinto its normal state.” Figure 12: Orbital Eigensolutions for the Hydrogen Atom.Figure 13: Orbital Transitions for Hydrogen Spectral Lines;Series: Lyman, n = 1; Balmer, n = 2; Paschen, n = 3; m = n + 1 , . . . ∞ .Bohr’s model is based on the quantization of the angular momentum of the electronin the planetary atomic model. The wave-particle duality metaphor can give us a simplevisualization of Bohr’s model. As already mentioned in section 4.2, a string of length L .7 THEORIES OF EVOLUTION L = nλ , n = 1 , , , . . . . The ﬁrst one ( n = 1,longer wavelength, lower frequency) is called the fundamental frequency of the string, andthe others ( n = 2 , . . . , shorter wavelengths, higher frequencies) are called its harmonics.Putting together de Broglie’s duality principle and the planetary atomic model, wecan think of the electron’s orbit as a circular string of length L = 2 πr . Plugging in deBroglie’s equation, λ = h/mv , and imposing the condition of having stable eigenfunctionsor standing waves, see Enge (1972) and Figure 12, we have2 πr = nλ = nhm e v or m e vr = n h π = n ¯ h . Planck’s constant equals 6 . E −

34 joule-seconds or 4 . E −

15 electron-volt-second,and the electron mass is 9 . E −

28 gram. Since the right hand side of this equation isthe angular momentum of the orbiting electron, de Broglie wave-particle duality principleimposes its quantization.Bohr’s atomic model was also able to, for the ﬁrst time, provide an explanation foranother intriguing phenomenon, namely:(a) Atoms only emit light at sharply deﬁned frequencies, known as spectral lines;(b) The frequencies, ν , or wavelengths, λ , of these spectral lines are related by integeralgebraic expressions, like the Balmer-Rydberg-Ritz-Paschen empirical formula, ν n,m c = 1 λ n,m = R (cid:18) n − m (cid:19) , where R = 1 . E m − is Rydberg’s constant.Distinct combinations of integer numbers, 0 < n < m , in BRRP formula give distinctwavelengths of the spectrum, see Enge (1972). It so happens that these frequencies arein precise correspondence with the diﬀerences of energy levels of orbital eigen-solutions,see Figure 13. These are the Hydrogen spectral series of Lyman, n = 1, Balmer, n = 2,Paschen, n = 3, and Brackett, n = 4, for m = n + 1 , . . . ∞ . Similar spectral series havebeen known for other elements, and used by chemists and astronomers to identify thecomposition of matter from the light it radiates. Rydberg’s constant can be written as R = m e e / (8 (cid:15) h c ), where m e is the rest mass of the electron, e is the elementary charge, (cid:15) is the permittivity of free space, h is the Planck’s constant, and c is the speed of lightin vacuum.The importance attributed by Bohr to the emergence of these sharp (discrete) eigen-solutions out of a higher dimensional continuum of possibilities is emphasized in Bohr(2007): “Your physical aha-experience? CHAPTER 5: MODULARITY AND STOCHASTIC EVOLUTION

Wavelengths of a complete series of spectral lines in the hydrogen spectrumcan be expressed with the aid of integers. This information, he [Bohr] said,left an indelible impression on him.”

Approximation methods of perturbation theory can be used to compute probabilitiesof spontaneous and induced transitions between the diﬀerent orbitals or energy statesof an atom, and these transition rates can be observed as intensities of the respectivespectral lines, see Enge (1972, ch.8), Landshoﬀ (1998, ch.7) and McGervey (1995, ch.14).Comparative analyses between the value and accuracy of these theoretical calculationsand empirical observations are of obvious interest. However, the natural interpretationof these analyses immediately generates statements about the uncertainty of transitionrates, expressed as probabilities of probabilities. Hence, as explained in section 5.6.3,these statements collide with the canons of the subjectivist epistemological frameworkand are therefore unaceptable in orthodox Bayesian statistics.

An objective form of probability is at the core of quantum mechanics theory, as seenin previous sections. However, probabilistic explanations or probabilistic causation havebeen, at least from a historical perspective, very controversial concepts. This has beenso since the earliest times. Aristotle (Physics, II,4,195b-196a) discusses events resultingfrom coincidences or incidental circumstances. If such an event serves a conscious humanpurpose, it is called τ υχη , translated as luck or fortune. If it serves the “unconsciouspurposiveness of nature”, it is called αυτ oµατ oν , translated as chance or accident. “We must inquire therefore in what manner luck and chance are presentamong the causes enumerated, and whether they are the same or diﬀerent, andgenerally what luck and chance are.Thus, we must inquire what luck and chance are, whether they are the sameor diﬀerent, and how they ﬁt into our division of causes.Some people even question whether they are real or not. They say that-nothing happens by chance, but that everything which we ascribe to luck orchance has some deﬁnite cause.Others there are who, indeed, believe that chance is a cause, but that it isinscrutable to human intelligence, as being a divine thing and full of mystery.”

Aristotle (Physics, II,4,195b-196a) also reports some older philosophical traditionsthat made positive use of probabilistic causation, such as a stochastic development orevolution theory due to Empedocles: “Wherever then all the parts came about just what they would have beenif they had come be for an end, such things survived, being organized spon- .7 THEORIES OF EVOLUTION taneously in a ﬁtting way; whereas those which grew otherwise perished andcontinue to perish, as Empedocles says...”

Many other ancient cultures accepted probabilistic arguments and/or did make useof randomized procedures, see Davis (1969), Kaptchuk and Kerr (2004) and Rabinovitch(1973). Even the biblical narrative, so averse to magic of any sort, presents the ideathat destiny is ultimately inscrutable to human understanding, see for example Exodus(XXXIII, 18-23):Moses, who is always willing to speak his mind, asks God for perfect knowledge:

And Moses said: I pray You, show me Your glory!

In response God, Who is always ready to explain to Moses Who makes the rules, tellshim that perfect knowledge can not be achieved by a living creature. This verse may alsoallegorically indicate that temporal irreversibility is a necessary consequence of such veilof uncertainty:

And the Lord said: You cannot see My face, for no man can see Me and live!...I will enclose and conﬁne you, and protect you in My manner... (so that)You shall see My back, but My face shall not be seen.

Nevertheless, the concepts of stochastic evolution and probabilistic causation lost pres-tige along the centuries. From the comments of Gould and Lenoir in section 7.1, we mayconclude that at the XVIII and early XIX century its status reached the lowest level ever.It is ironic than that stochastic evolution is the concept at the eye of the storm of someof the most important scientiﬁc revolutions of the late XIX and XX century.As seen in section 6, Quantum Mechanics entails Heisenberg’s uncertainty principle,stating that we can not measure (in practice or in theory) the classical variables describingthe motion of a particle with a precision beyond a hard threshold given by Planck’s con-stant. Hence, the available information about a physical system is, in quantum mechanics,governed by laws that are in nature essentially probabilistic , or, as stated in Ruhla (1992,p.162), “No longer is it chance as a matter of ignorance or of incompetence: it ischance quintessential and unavoidable.”

The path leading to an essentially stochastic world-view (or, in more detail, a Weltan-schauung including random systemic interactions) was ﬁrst foreseen by people far aheadof their time, like C.S.Peirce and L.Bozmann, a path that was than advanced by reluc-tant revolutionaries like M. Planck, A. Einstein, and E. Schr¨odinger, who had a majorparticipation in forging the new concept of probability, but that were at the same time,70

CHAPTER 5: MODULARITY AND STOCHASTIC EVOLUTION still emotionally attached to classical concepts. Finally, a third generation, includingN.Bohr, W.Heisenberg and M.Born fully embraced the new concept of objective proba-bility. Of course, as with all truly innovative concepts, it will take mankind at least a fewgenerations to truly assimilate and incorporate the new idea.

The “objectiﬁcation of probability” and the consequent raise of the ontological status ofstochastic evolution and/or probabilistic causation was arguably one of the two greatestinnovations of modern physics. The other great innovation is the “geometrization ofspace-time” in Einstein’s theories of special and general relativity, see French (1968) andMartin (1988) for intuitive introductions, Sachs and Wu (1977) for a rigorous treatment,and Misner et al. (1973) for an encyclopedic treatise.The manifestation of physical quantization and (special) relativistic geometry is regu-lated by Planck’s constant and the speed of light. The value of these constants in standard(international) metric units, h = 6 . E − J s and c = 3 . E + 8 m/s , have, respectively, atiny and huge order of magnitude, making it easy to understand why most of the eﬀects ofmodern physics are not immediately perceptible in our ordinary life experience and, there-fore, why classical physics can oﬀer acceptable approximations in many circumstances ofcommon statistical practice. However, modern physics has forever changed some of ourmost basic concepts related to space, time, causality and probability. Moreover, we haveseen in this chapter how some of these concepts, like modularity and probabilistic causa-tion, are essential to our theories and to understand phenomena in many other ﬁelds. Wehave also seen how quantization or stochastic evolution have a direct or indirect baringon areas much closer to our daily life, like Biology and Engineering. Hence, it is of vi-tal importance to incorporate these new concepts to a contemporary epistemology or, atleast, to use an epistemological framework that is not incompatible with these new ideas. hapter 6The Living and Intelligent Universe “Cybernetics is the science of defensible metaphors.” Gordon Pask (1928-1996). “You, with all these words....”

Marisa Bassi Stern (my wife, when I speak too much). “Yes I think to myself: What a wonderful world!”

B.Thiele and G.D.Weiss, in the voice of L.Armstrong.In the article Mirror Neurons, Mirror Houses, and the Algebraic Structure of the Self,by Ben Goertzel, Onar Aam, F. Tony Smith and Kent Palmer (2008) and the companionarticle of Goertzel (2007), the authors provide an intuitive explanation for the logic ofmirror houses, that is, they study symmetry conditions for specular systems entailing thegeneration of kaleidoscopic images. In these articles, the authors share (in my opinion)several important insights on autopoietic systems and constructivist philosophy. A moreprosaic kind of mirror house used to be a popular attraction in funfairs and amusementparks. The entertainment then came from misperceptions about oneself or other objects.More precisely, from the misleading ways in which a subject sees how or where are theobjects inside the mirrorhouse, or how or where himself stands in relation to other objects.The main objective of this chapter is to show how similar misperceptions in sciencecan lead to ill-posed problems, paradoxical situations and even misconceived philosophi-17172

CHAPTER 6. THE LIVING AND INTELLIGENT UNIVERSE cal dilemmas. The epistemological framework of this discussion will be that of cognitiveconstructivism, as presented in previous chapters. In this framework, objects within ascientiﬁc theory are tokens for eigen-solutions which Heinz von Foerster characterized byfour essential attributes, namely those of being discrete (precise, sharp or exact), stable,separable and composable. The Full Bayesian Signiﬁcance Test (FBST) is a possibilis-tic belief calculus based on a (posterior) probabilistic measure originally conceived as astatistical signiﬁcance test to assess the objectivity of such eigen-solutions, that is, tomeasure how well a given object manifests or conforms to von Foerster’s four essentialattributes.The FBST belief or credal value of hypothesis H given the observed data X is the e-value , ev( H | X ), interpreted as the epistemic value of hypothes H (given X ), or the evidence value of data X (supporting H ). A formal deﬁnition of the FBST and several ofits implementations for speciﬁc problems can be found in the author’s previous articles,and summarized in appendix A. From now on, we will refer to Cognitive Constructivismaccompanied by Bayesian statistical theory and its tool boxes, as laid down in the afore-mentioned articles, as the Cog-Con epistemological framework.Instead of reviewing the formal deﬁnitions of the essential attributes of eigen-solutions,we analyze instead the Origami example, a didactic case presented by Richard Dawkins.This is the done in section 1. The origami example is so simple that it may look trivial and,in some sense, it is so. In subsequent sections we analyze in which ways the eigen-solutionsfound in the practice of science can be characterized as non-trivial, and also highlight some(in my view) common misconceptions about the nature of these non-trivial objects, justlike distinct forms of illusion in a mirror-house.In section 2 we contrast the control, precision and stability of morphogenic foldingprocesses in autopoietic and allopoietic systems. In section 3 we concentrate in objectorientation and code reuse, inter-modular adaptation and resonance, and also analyze theyoyo diagnostic problem. In section 4 we explore auto-catalytic and hypercyclic networks,as well as some related bootstrapping paradoxes. This section is heavily inﬂuenced by thework of Manfred Eigen. Section 5 focus on explanations of speciﬁc components, singlelinks or partial chains in long cyclic networks, including the meaning of some forms ofdirectional (such as upward or downward) causation. In section 6 we study the emergenceof asymptotic eigen-solutions such as thermodynamic variables or market prices, andin section 7 we analyze the ontological status of such entities. In section 8 we studythe limitations in the role and scope of conceptual distinctions used in science, and theimportance of probabilistic causation as a mechanism to overcome, in a constructive way,some of the resulting dilemmas. In short, section 2 to 8 discus autopoiesis, modularity,hypercycles, emergence, and probability as sources of complexity and forms of non-trivialorganization. Our ﬁnal remarks are presented in section 9.In this chapter we have made a conscious eﬀort to use examples that can be easily .1. THE ORIGAMI EXAMPLE . The Origami example, from the following text in Blackmore (1999, p.x-xii, emphasisare ours) was given by Richards Dawkins to present the notion of reliable replicationmechanisms in the context of evolutionary systems. Dawkins’ example contrasts twoversions of the Chinese Whispers game using distinct copy mechanisms.Suppose we assemble a line of children. A picture, say, a Chinese junk, isshown to the ﬁrst child, who is asked to draw it. The drawing, but not theoriginal picture, is then shown to the second child, who is asked to make herown drawing of it. The second child’s drawing is shown to the third child,who draws it again, and so the series proceeds until the twentieth child, whosedrawing is revealed to everyone and compared with the ﬁrst. Without evendoing the experiment, we know what the result will be. The twentieth drawingwill be so unlike the ﬁrst as to be unrecognizable. Presumably, if we lay thedrawings out in order, we shall note some resemblance between each one andits immediate predecessor and successor, but the mutation rate will be so highas to destroy all semblance after a few generations. A trend will be visible aswe walk from one end of the series of drawings to the other, and the directionof the trend will be degeneration...High ﬁdelity is not necessarily synonymous with digital. Suppose we set upour Chinese Whispers Chinese Junk game again, but this time with a crucialdiﬀerence. Instead of asking the ﬁrst child to copy a drawing of the junk,we teach her, by demonstration, to make an origami model of a junk. Whenshe has mastered the skill, and made her own junk, the ﬁrst child is askedto turn around to the second child and teach him how to make one. So theskill passes down the line to the twentieth child. What will be the result ofthis experiment? What will the twentieth child produce, and what shall weobserve if we lay the twenty eﬀorts out in order along the ground? ...In several of the experiments, a child somewhere along the line will forgetsome crucial step in the skill taught him by the previous child, and the line ofphenotypes will suﬀer an abrupt macromutation which will presumably then74

CHAPTER 6. THE LIVING AND INTELLIGENT UNIVERSE be copied to the end of the line, or until another discrete mistake is made. Theend result of such mutated lines will not bear any resemblance to a Chinesejunk at all. But in a good number of experiments the skill will correctly passall along the line, and the twentieth junk will be no worse and no better, onaverage, than the ﬁrst junk. If we lay then lay the twenty junks out in order,some will be more perfect than others, but imperfections will not be copiedon down the line...Here are the ﬁrst ﬁve instructions... for making a Chinese junk:1. Take a square sheet of paper and fold all four corners exactly into the middle .2. Take the reduced square so formed, and fold one side into the middle .3. Fold the opposite side into the middle, symmetrically .4. In the same way, take the rectangle so formed, and fold its two endsinto the middle .5. Take the small square so formed, and fold it backwards, exactly alongthe straight line where you last two folds met...These instructions, though I would not wish to call them digital, are po-tentially of very high ﬁdelity, just as if they were digital. This is because theyall make reference to idealized tasks like ‘fold the four corners exactly into themiddle’... The instructions are self-normalizing. The code is error-correcting...Dawkins recognizes that instructions for constructing an origami have remarkableproperties, providing the long term survival of the subjacent meme , i.e. speciﬁc model orsingle idea, expressed as an origami. Nevertheless, Dawkins is not sure how he “wishes tocall” these properties (digital? high ﬁdelity?). What adjectives should we use to appropri-ately describe the desirable characteristics that Dawkins perceives in these instructions?I claim that von Foerster’s four essential attributes of eigen-solutions oﬀer an accuratedescription of the properties relevant to the process in study.The instructions and the corresponding (instructed) operations are precise, stable,separable and composable. A simple interpretation of the meaning of these four attributesin the origami example is the following:

Precision : An instruction like “fold a paper joining two opposite corners of the square”implies that the folding must be done along a diagonal of the square. A diagonal is aspeciﬁc line, a 1-dimensional object in the 2-dimensional sheet of paper. In this sense theinstruction is precise or exact.

Stability : By interactively adjusting and correcting the position of the paper (beforemaking a crease) it is easy to come very close to what the instruction speciﬁes. Even if theresulting fold is not absolutely perfect (in practice it actually never is), it will probablystill work as intended. .2. AUTOPOIETIC CONTROL, PRECISION, STABILITY

Composability and Separability : We can compose or superpose multiple creases in thesame sheet of paper. Moreover, adding a new crease will not change or destroy the existingones. Hence, we can fold them one at a time, that is, separately.These four essential attributes are of fundamental importance in order to understandscientiﬁc activity in the Cog-Con framework. Moreover, Dawkins’ origami example illus-trates these attributes with striking clarity and simplicity.In the following sections we will examine other examples, which are less simple, notso clear or non-trivial in a distinct and characteristic way. We will also draw atten-tion to some confusions and mistakes often made when analyzing systems with similarcharacteristics.

The origami folding is performed and controlled by an external agent, the person foldingthe paper. In contrast, organic development processes are self-organized. These processesare not driven by an external agent, do not require external supervision, and usualy arenot even amenable to external corrections. While artifacts and machines manufacturedlike an origami are called allopoietic , from αλλo - πoιησις - external production, livingorganisms are called autopoietic , from αυτ o - πoιησις - self production.Autopoiesis is a non-trivial process, in many interesting ways. For example, the in-existence of external supervision or correction mechanism requires an autopoietic processto be stable. Moreover, typical biological processes occur in environments with high lev-els of noise and have large (extra) variability. Hence the process must be intrinsicallyself-correcting and redundant so that its noisy implementation does not compromise theviability of the ﬁnal product. In this section we make some considerations about morphogenic biological processes,namely, we study examples of tissue folding in early embryonic development. This processnaturally invites not only strong analogies, but also sharp contrasts with the origamiexample. At a macroscopic (supra cellular) level, the organisms’ organs and structuresare built by tissue movements, as described in Forgacs and Newman (2005, p.109), andSaltzman (2004, p.38).The main types of tissue movements in morphogenic process are:- Epiboly: spreading of a sheet of cells over deeper layers.- Emboly: inward movement of cells which is of various types as:76

CHAPTER 6. THE LIVING AND INTELLIGENT UNIVERSE - Invagination: infolding or insinking of a layer,- Involution: inturning, inside rotation or inward movement of a tissue.- Delamination: splitting of a tissue into 2 or more parallel layers.- Convergent/Divergent Extension: stretching together/apart of two distinct tissues.The blastula is an early stage in the embryonic development of most animals. Itis produced by cleavage of a fertilized ovum and consists of a hollow sphere of around128 cells surrounding a central cavity. From this point on, morphogenesis unfolds bysuccessive tissue movements. The very ﬁrst of such moves is known as gastrulation, a deepinvagination producing a tube, the archenteron or primitive digestive tract. This tubemay extend all the way to the pole opposing the invagination point producing a secondopening. The opening(s) of the archenteron become mouth and anus of the developingembryo.Gastrulation produces three distinct (germ) layers, that will further diﬀerentiate intoseveral body tissues. Ectoderm, the exterior layer, will further diﬀerentiate into skinand nervous systems. Endoderm, the innermost layer at the archenteron, generates thedigestive system. Mesoderm, between the ectoderm and endoderm, diﬀerentiates intomuscles, connective tissues, skeleton, kidneys, circulatory and reproductive organs. Wewill use this example to highlight some important topics, some of which will be exploredmore thoroughly in further sections.

Discrete vs. Exact or Precise Symmetries

Notice that origami instructions, that implicitly rely on the symmetries characterizingthe shape of the paper, require foldings at sharp edges or cresses. Hence, a proﬁle of thefolded paper sheet may look like it breaks (is non-diﬀerentiable) at a discrete or singularpoint.Organic tissue foldings have no sharp edges. Nevertheless, the (idealized) symmetriesof the folded tissues, like the spherical symmetry of the blastula, or the cylindrical symme-try of the gastrula, can be described by equations just as exact or precise, see Beloussov(2008), Nagpal (2002), Odel et al. (1980), Tarasov (1986), and Weliky and Oster (1990).This is why we usually prefer the adjectives precise or exact to the adjective discrete usedby von Foester in his original deﬁnition of the four essential properties of an eigen-solution. Centralized vs. Decentralized Control

In morphogenesis, there is no agent acting like a central controller, dispatching messagesordering every cell what to do. Quite the opposite, the complex forms and tissue move- .3. OBJECT ORIENTATION AND CODE REUSE

At the microscopic level, cells at the several organic tissues studied in the last section arediﬀerentiated by distinct metabolic reaction patterns. However, the genetic code of anyindividual cell in a organism is identical (as always in biology, there are exceptions, butthey are not relevant to this analysis), and cellular diﬀerentiation at distinct tissues arethe result of diﬀerentiated (genetic) expressions of this sophisticated program.As studied in Chapter 5, complex systems usually have a modular hierarchical struc-ture or, in computer science jargon, an object oriented design. In allopoietic systemsobject orientation is achieved by explicit design, that is, it has to be introduced by aknowledgeable and disciplined programmer, see Budd (1999). In autopoietic systemsmodularity is an implicit and emergent property, as analyzed in Angeline (1996), Banzaﬀ(1998), Iba (1992), Lauretto at al. (2009) and Chapter 5.Object oriented design entails the reuse, over and over, of the same modules (genes,functions or sub-routines) as control mechanisms for diﬀerent processes. The abilityto easily implement this kind of feature was actively pursued in computer science andsoftware engineering. Object orientation was also discovered, with some surprise, to benaturally occurring in developmental biology, see Carrol (2005).However, like any abused feature, code reuse can also become a burden in some circum-stances. The diﬃculty of locating the source of a functionality (or a bug) in an intricateinheritance hierarchy, represented by a complex dependency graph, is known in computerscience as the yoyo problem. According to the glossary in Budd (1999, p.408), “Yoyoproblem: Repeated movements up and down the class hierarchy that may be requiredwhen the execution of a particular method invocation is traced.”Systems undergoing many changes or modiﬁcations, under repeated adaptation orexpansion, or on rapid evolution are specially vulnerable to yoyo eﬀects. Unfortunately,the design of the human brain and its mental abilities are under all of the above conditions.In the next subsection we study some examples in this area, related to biological neuralnetworks and language. These examples also include some mental dissociative phenomenathat can be considered as manifestations of the yoyo problem.78

CHAPTER 6. THE LIVING AND INTELLIGENT UNIVERSE

In this section we study some human capabilities related to doing (acting), listening (lin-guistic understanding) and answering (dialogue). The capabilities we have chosen to studyare related to the phylogenetic acquisition and the ontogenetic development of:- Mechanisms for precision manipulation, production of speech and empathic feeling;- Syntax for complex manipulation procedures, language articulation and behavioral sim-ulation;- Semantics for action, communication and dialogue; and the learning of- Technological know-how, social awareness and self-awareness.When considering an action in a modern democratic society, we usually deliberatewhat to do (unless there is already a tacit agreement). We then communicate with otheragents involved to coordinate this action, so that we are ﬁnally able to do what has tobe done. Evolution, it seems, took exactly the other way around. Phylogenetically, thepath taken by our species follows a stepwise development of several mechanisms (thatwere neither independent nor strictly sequential), including:1. A mechanism for 3-dimensional vision and precision measurement, ﬁne motor con-trol of hands and mouth, and visual-motor coordination for the complex procedures of precision manipulation .2. Mechanisms for imitating , learning and simulating the former procedures or actions.3. Mechanisms for simulating (possible) actions taken by other individuals, theirconsequences and motivations, that is, mechanisms for awareness and (behavioral) under-standing of other individuals.4. A mechanism for communicating (possible) actions, used for commanding, control-ing and coordinating group actions. The use of such a mechanism implies a degree ofawareness of others, that is, some ability to communicate, explain, listen and learn whatyou do, you - an agent like me .5. Mechanisms for dialoging and deliberating , that is, for negotiating, goal selectingand non-trivial social planning. The use of such mechanisms implies some self-awareness or consciousness , that is, the conceptualization of an ego, an abstract I - an agent like you .In a living individual, all of these mechanisms must be well integrated. Consequently,it is natural that they work using coherent implicit grammars, reﬂecting compatible sub-jacent rules of composition for action, language and inter-individual interaction. Indeed,resent research in neuro-science conﬁrm the coherence of these mechanisms. Moreover,this research shows that this coherence is based not just on compatible designs of separatesystems, but on intricate schemes of use and reuse of the same structures, namely, theﬁrmware code or circuits implemented as biological neural networks. .3. OBJECT ORIENTATION AND CODE REUSE Mirror neuron is a concept of neuroscience that highlights the reuse of the same circuitsfor distinct functions. A mirror neuron is part of a circuit which is activated (ﬁres) whenan individual executes an action, and also when the individual observes another individualexecuting the same action as if he, the observer, were performing the action himself. Thefollowing passages, from important contemporary neuro-scientists, give some hints on howthe mechanisms mentioned in the past paragraph are structured.The ﬁrst group of quotes, from Hesslow (2002, p.245), states the mirror neuron sim-ulation hypothesis , according to which, the same circuits used to control our actions areused to learn, simulate, and ﬁnally “understand” possible actions taken by other individ-uals. According to the simulation hypothesis, we are then naturally endowed with thecapability of observing, listening, and “reading the mind” of (that is - understanding,by simulation, the meaning or intent of the possible actions taken by) our fellow humanbeings....the simulation hypothesis states that thinking consists of simulated interac-tion with the environment and rests on the following three core assumptions:(1) simulation of actions: we can activate motor structures of the brain in away that resembles activity during a normal action but does not cause anyovert movement;(2) simulation of perception: imagining perceiving something is essentially thesame as actually perceiving it, only the perceptual activity is generated by thebrain itself rather than by external stimuli;(3) anticipation: there exist associative mechanisms that enable both behav-ioral and perceptual activity to elicit other perceptual activity in the sensoryareas of the brain. Most importantly, a simulated action can elicit perceptualactivity that resembles the activity that would have occurred if the action hadactually been performed. (p.5).In order to understand the mental state of another when observing the otheracting, the individual imagines herself/himself performing the same action, acovert simulation that does not lead to an overt behavior. (p.5).The second group of quotes, from Rizzolatti and Arbib (1998), states the mirror neuron linguistic hypothesis , according to which, the same structures used for action simulation,are reused to support human language.Our proposal is that the development of the human lateral speech circuit isa consequence of the fact that the precursor of Broca’s area was endowed,before speech appearance, with a mechanism for recognizing actions made byothers. This mechanism was the neural prerequisite for the development ofinter-individual communication and ﬁnally of speech. We thus view language80

CHAPTER 6. THE LIVING AND INTELLIGENT UNIVERSE in a more general setting than one that sees speech as its complete basis.(Rizzo.p.190)....a ‘pre-linguistic grammar’ can be assigned to the control and observation ofactions. If this is so, the notion that evolution could yield a language system’atop’ of the action system becomes much more plausible. (p.191).In conclusion, the discovery of the mirror system suggests a strong link betweenspeech and action representation. ‘One sees a distinctly linguistic way of doingthings down among the nuts and bolts of action and perception, for it isthere, not in the remote recesses of cognitive machinery, that the speciﬁcallylinguistic constituents make their ﬁrst appearance’. (p.193-194).Finally, a third group of quotes, from Ramachandran (2007), states the mirror neu-ron self-awareness hypothesis , according to which, the same structures used for actionsimulation are reused, over again, to support abstract concepts related to consciousnessand self-awareness. According to this perspective, perhaps the most important of suchconcepts, that of an abstract self-identity or ego, is built upon one’s already developedsimulation capability for looking at oneself as if looking at another individual.I suggest that ‘other awareness’ may have evolved ﬁrst and then counter-intuitively, as often happens in evolution, the same ability was exploited tomodel one’s own mind - what one calls self awareness.How does all this lead to self awareness? I suggest that self awareness is simplyusing mirror neurons for ‘looking at myself as if someone else is look at me’(the word ‘me’ encompassing some of my brain processes, as well).The mirror neuron mechanism - the same algorithm - that originally evolvedto help you adopt another’s point of view was turned inward to look at yourown self. This, in essence, is the basis of things like ‘introspection’.This in turn may have paved the way for more conceptual types of abstraction;such as metaphor (‘get a grip on yourself’).

Yoyo Eﬀects and the Human Mind

From our analyses in the preceding sections, one should expect, as a consequence of theheavy reuse of code under fast development and steady evolution, the sporadic occurrenceof some mental yoyo problems. Such yoyo eﬀects break the harmonious way in which thesame code is (or circuits are) supposed to work as an integral part with several functionsused to do , listen and answer , that is, to control action performance, language communi-cation, and self or other kind of awareness. In psychology, many of such eﬀects are known .3. OBJECT ORIENTATION AND CODE REUSE dissociative phenomena. For carefully controlled studies of low level dissociative phe-nomena related to corporal action-perception, see Schooler (2002) and Johansson et al.(2008).In the following paragraphs we give a glimpse on possible neuroscience perspectives ofsome high level dissociative phenomena. Simulation mechanisms are (re)used to simulateone’s actions, as well as other agents’ actions. Contextualized action simulation is thebasis for intentional and motivational inference. From there, one can assess even higherabstraction levels such as tactical and strategic thinking, or even ethics and morality.But these capabilities must rely on some principle of decomposition, that is, the abilityto separate, to some meaningful degree, one’s own mental state from the mental state ofthose whose behavior is being simulated. This premise is clearly stated in Decety andGr`ezes (2005, p.5):One critical aspect of the simulation theory of mind is the idea that in trying toimpute mental states to others, an attributor has to set aside her own currentmental states and substitute those of the target.Unfortunately, as seen in the preceding section, the same low level circuits used tosupport simulation are also used to support language. This can lead to conﬂicting requeststo use the same resources. For example, verbalization requires introspection, a processthat conﬂicts with the need to set aside one’s own current mental state. This conﬂict leadsto verbal overshadowing - the phenomenon by which verbally describing or explaining anexperienced or simulated situation somehow modiﬁes or impairs its correct identiﬁcation(like recognition or recollection), or distorts its understanding (like contextualization ormeaning). Some causes and consequences of this kind of conﬂict are addressed by Iacoboni(2008, p.270):Mirror neurons are pre-motor neurons, remember, and thus are cells not reallyconcerned with our reﬂective behavior. Indeed, mirroring behaviors such asthe chameleon eﬀect seem implicit, automatic, and pre-reﬂexive. Meanwhile,society is obviously built on explicit, deliberate, reﬂexive discourse. Implicitand explicit mental processes rarely interact; indeed, they can even dissociate.(p.270).Psychoanalysis can teach us a lot about high level dissociations such as emotional /rational psychological mismatches and individual / social behavioral misjudgments. For aconstructivist perspective of psychotherapy see Efran et al. (1990), and further commentson section 7.We end up this section by posing a tricky question capable of inducing the most spec-tacular yoyo bouncings. This provocative question is related to the role played by division82

CHAPTER 6. THE LIVING AND INTELLIGENT UNIVERSE algebras; Goertzel’s articles mentioned at the introduction provide a good source of ref-erences. Division algebras capture the structure of eigen-solutions entailed by symmetryconditions for the recursively generated systems of specular images in a mirror house. Thesame division algebras are of fundamental importance in many physical theories, see Dionet al. (1995), Dixon (1994) and Lounesto (2001). Finally, division algebras capture thestructure of 2-dimensional (complex numbers) and 3-dimensional (quaternion numbers)rotations and translations governing human manipulation of objects, see Hanson (2006).We can thus ask: Do we keep ﬁnding division algebras everywhere out there when tryingto understand the physical universe because we already have the appropriate hardwareto see them, or is it the other way around? We can only suspect that any trivial choicein the dilemma posed by this trick question, will only result in an inappropriate answer.We shall revisit this theme at sections 7 and 8.

We can make the ladder of hierarchical complexity in the systems analyzed in the lastsections go even further up, as if it did not climb high enough, by including new stepsin the socio-cultural realms that stand above the level of simple or direct inter-individualinteraction, such as art, law, religion, science, etc. The origami example of section 1is used by Richard Dawkins as a prototypical meme or a unit of imitation. The term mneme , derived from µνηµη , the muse of memory, was used by Richard Semon as a unitof retrievable memory. Yet another variant of this term, mime , is derived from µιµησις orimitation. All these terms have been used to suggest a basic model, a single concept, anelementary idea, a memory trace or unit, or to convey related meanings, see Blackmoreand Dawkins (1999), Dawkins (1976), van Driem (2007), Schacter (2001), Schacter et al.(1978), and Semon (1904, 1909, 1921, 1923).Richard Semon’s theory was able to capture many important characteristics concerningthe storage or memorization, retrieval, propagation, reproduction and survival of mnemes.Semon was also able to foresee many important details and interconnections, at a timewhen there were no experimental techniques suitable for an empirical investigation of therelevant neural processes. Unfortunately, Semon’s analysis also suﬀers from the yoyo eﬀectin some aspects. That is not surprising at all given the complexity of the systems he wasstudying and the lack of suitable experimental tools. These yoyo problems were related tosome mechanisms, postulated by Semon, for mnemetic propagation across generations, ormnemetic hereditarity. Such mechanisms had a Lamackian character, since they impliedthe possibility of hereditary transmission of learned or acquired characteristics.In modern Computer Science, the term memetic algorithm is used to describe evo-lutive programming based on populational evolution by code (genetic) propagation thatcombines a Darwinian or selection phase, and a local optimization or Lamackian learningphase, see Moscato (1989). Such algorithms were inspired by the evolution of ideas and .4. HYPERCYCLIC BOOTSTRAPPING

On march 1st 2009, the Wikipedia deﬁnition for bootstrapping read:Bootstrapping or booting refers to a group of metaphors that share a commonmeaning, a self-sustaining process that proceeds without external help. Theterm is often attributed to Rudolf Erich Raspe’s story The Adventures ofBaron M¨unchausen, where the main character pulls himself out of a swamp,though it’s disputed whether it was done by his hair or by his bootstraps.The attributed origin of this metaphor, the (literally) incredible adventures of BaronM¨unchhausen, well known as a compulsive liar, makes us suspect that there may besomething wrong with some of its uses. There are, however, many examples where boot-strapping explanations can be rightfully applied. Let us analyze a few examples:1. The

Tostines mystery: Does Tostines sell more because it is always fresh andcrunchy, or is it always fresh and crunchy because it sells more?

This slogan was used at a very successful marketing campaign, that launched the rela-tively unknown brand Tostines, from Nestl´e, to a leading position in the Brazilian marketof biscuits, crackers and cookies. The expression

Tostines mystery became idiomatic inBrazilian Portuguese, playing a role similar to that of the expression bootstrapping inEnglish.84

CHAPTER 6. THE LIVING AND INTELLIGENT UNIVERSE

2. The C computer language and the UNIX operating system: Perhaps the mostsuccessful and inﬂuential computer language ever designed, C was conceived having boot-strapping in mind. The core language is powerful but spartan. Many capabilities that arean integral part of other programming languages are provided by functions in externalstandard libraries, including all device dependent operations such as input-output, stringand ﬁle manipulation, mathematical computations, etc. C was part of a larger projectto write UNIX as a portable operating system. In order to have UNIX and all of itsgoodies into a new machine (device drivers should already be there), we only have totranslate the assembly code for a core C compiler, compile a full C compiler, compilethe entire UNIX system, compile all the application programs we want, and voil`a, we aredone. Bootstrapping, as a technological approach, is of fundamental importance for thecomputer industry as it allows the development of evermore powerful software and therapid substitution of hardware.3. The Virtuous cycle of open source software: An initial or starting code contri-bution is made available at an open source code repository . Developer communities canuse the resources at the repository according to the established open source license. De-velopers create software or application programs according to their respective businessmodels, aﬀected by the open source license agreements and the repository governancepolicy. The use of existing software motivates new applications or extensions to the ex-isting ones, generating the development of new programs and new contributions to theopen source repository. Code contributions to the repository are ﬁltered by a controllingcommittee according to a governance model. The full development cycle works using thehighlighted elements as catalysts, and is fuelled by the work of self-interested individualsacting according to their own motivations, see Heiss (2007).4. The Bethe-Weizs¨acker main catalytic cycle (CNO-I):

C + H → N + γ + 1.95MeV; N → C + e + + ν + 2.22MeV; C + H → N + γ + 7.54MeV; N + H → O + γ + 7.35MeV; O → N + e + + ν + 2.75MeV; N + H → C + He + 4.96MeV.This example presents the nuclear synthesis of one atom of Helium from four atoms ofHydrogen. Carbon, Nitrogen and Oxygen act as catalysts in this cyclic reaction, that alsoproduces gamma rays, positrons and neutrinos. Note that the Carbon-12 atom used inthe ﬁrst reaction is regenerated at the last one. The CNO nuclear fusion cycle is the mainsource of energy in stars with mass twice as large or more than that of the sun. We haveincluded this example from nuclear physics in order to stress the fact that catalytic cyclesplay an important role in phenomena occurring in spatial and temporal scales which aremuch smaller than those typical of chemistry or biology, where some of the readers mayﬁnd them more familiar.5. RNA and DNA replication: DNA and RNA duplication, translation, and copyingmay, in general, be considered the core cycle of life, since it is the central cycle of biological .4. HYPERCYCLIC BOOTSTRAPPING autocatalytic cycle as a (chemical)reaction cycle that, using additional resources available in its environment, produces anexcess of one or more of its own reactants. A hypercycle is an autocatalytic reaction ofsecond or higher order, that is, an autocatalytic cycle connecting autocatalytic units. In amore general context, a hypercycle indicates self-reproduction of second or higher order,that is, a second or higher order cyclic production network including lower order self-replicative units. In the prototypical hypercycle architecture, a lower order self-replicativeunit plays then a dual catalytic role: First, it has an auto-catalytic function in its ownreproduction. Second, it acts like a catalyst promoting an intermediate step of the higherorder cycle.

Let us now examine some ways in which the bootstrapping metaphor is wrongfully ap-plied, that is, it is used to generate incongruent or inconsistent arguments, supposed toaccommodate contradictory situations or to explain the existence of impossible processes.We will focus on four cases of historical interest and great epistemological importance.

Perpetua Mobile

Perhaps the best known paradox related to the bootstraping metaphor is connected to aclass of examples known as Perpetuum Mobile machines. These machines are supposed tooperate forever without any external help or even to produce some useful energy output.Unfortunately, perpetual mobiles are only wishful thinking, since the existence of sucha machine would violate the ﬁrst, second and third laws of thermodynamics. These areessentially “no free lunch” principles, formulated as inequalities for the ﬂow (balanceor transfer) of matter, energy and information in a general system, see Atkins (1984),Dugdale (1996) and Tarasov (1988).Hypercyclical processes are not magical and must rely on energy, information (orderor neg-entropy) and raw materials available at their environment. In fact, the use ofexternal sources of energy and information is so important, that it entails the deﬁnitionof metabolism used in Eigen (1977):Metabolism: (The process) can become eﬀective only for intermediate stateswhich are formed from energy-rich precursors and which are degraded to some86

CHAPTER 6. THE LIVING AND INTELLIGENT UNIVERSE energy-deﬁcient waste. The ability of the system to utilize the free energy andthe matter required for this purpose is called metabolism. The necessity ofmaintaining the system far enough from equilibrium by a steady compensationof entropy production has been ﬁrst clearly recognized by Erwin Schr¨odinger(1945).The need for metabolism may come as a disappointment to professional wishful thinkers,engineers of perpetuum mobile machines, narcissistic philosophers and other anorexic de-signers. Nevertheless, it is important to realize that metabolic chains are in fact an integralpart of the hypercycle concept. Hypercycles are built upon the possibility that the rawmaterial that is supposed to be freely available in the environment for one autocatalyticreaction, may very well be the product of another catalytic cycle. Moreover, the samethermodynamic laws that prevent the existence of a perpeuum mobile, are fully com-patible with a truly wonderful property of hypercycles, namely, their almost miraculouseﬃciency, as stated in Eigen (1977):Under the stated conditions, the product of the plain catalytic process willgrow linearly with time, while the autocatalytic system will show exponentialgrowth.

Evolutionary View

The exponential or hyperbolic (super-exponential) growth of processes based on auto-catalytic cycles and hypercycles have profound evolutionary implications. Populationsgrowing exponentially in environments with limited resources, or even with resourcesgrowing at a linear or polynomial rate, ﬁnd themselves in the Maltusian conundrum ofever increasing individual or group competition for evermore scarce resources. In thissetting, selection rules applied to a population of individuals struggling to survive andreproduce inexorably leads to an evolutive process. This qualitative argument goes backto Thomas Robert Malthus, Alfred Russel Wallace, and Charles Darwin, see Ingraham(1982) and Richards (1989).Several alternative mathematical models for evolutive processes only conﬁrm thesoundness of the original Malthus-Wallace-Darwin argument. Eigen (1977, 1978a,b) anal-yses evolutionary processes on the basis of dynamical systems models using the languageof ordinary diﬀerential equations. Stern (2008, ch.5) takes In Chapter 5 we take a com-pletely diﬀerent approach, analyzing evolutionary processes on the basis of stochasticoptimization algorithms using the language of inhomogeneous Markov chains. For otherpossible approaches see Jantsch and Waddington (1976) and Jantsch (1980, 1981). It isremarkable however, that the qualitative conclusions of these distinct alternative analysesare in complete agreement. .4. HYPERCYCLIC BOOTSTRAPPING

Building Blocks and Modularity

Another consequence of the analysis of evolutionary processes, using either the dynamicalsystems approach, see Eigen (1977, 1978a,b), or the stochastic optimization approach, seeChapter 5, is the spontaneous emergence of modular structures and hierarchical organi-zation of complex systems.A classic illustration of the need for modular organization is given by the Hora andTempus parable of Simon (1996), see also Growney (1982). This is a parable about twowatch makers, named Hora and Tempus, both of whom are respected manufacturers and,under ideal conditions, produce watches of similar quality and price. Each watch requiresthe assemblage of n = 1000 elementary pieces. However, while Hora uses a hierarchicalmodular design, Tempus does not. Hora builds each watch with 10 large blocks, eachmade of 10 small modules of 10 single parts each. Consequently, in order to make awatch, Hora needs to assemble m = 111 modules with r = 10 parts each, while Tempusneeds to assemble only m = 1 module with r = 1000 parts. It takes either Hora orTempus one minute to put a part in its proper place. Hence, while Tempus can assemblea watch in 1000 minutes, Hora can only do it in 1110 minutes. However both work in anoisy environment, being subject to interruptions (like receiving a telephone call). Whileplacing a part an interruption occurrs with probability of p = 0 .

01. Partially assembledmodules are unstable, braking down at an interruption. Under these conditions, theexpected time to assemble a watch is mp (cid:18) − p ) r − (cid:19) . Replacing p , m and r for the values in the parable, one ﬁnds that Hora’s manufacturingprocess is a few thousand times more eﬃcient then Tempus’. After this analysis, it is notdiﬃcult to understand why Tempus struggles while Hora prospers.Closing yet another cycle, we thus came to the conclusion that the evolution of complexstructures requires modular design. The need for modular organization is captured bythe following dicta of Herbert Simon: “Hierarchy, I shall argue, is one of the central structural schemes that thearchitect of complexity uses.” Simon (1996, p.184).88

CHAPTER 6. THE LIVING AND INTELLIGENT UNIVERSE “The time required for the evolution of a complex form from simple ele-ments depends critically on the number and distribution of potential interme-diate stable subassemblies.”

Simon (1996, p.190). “The claim is that the potential for rapid evolution exists in any complexsystem that consists of a set of subsystems, each operating nearly independentlyof the detailed process going on within the other subsystems, hence inﬂuencedmainly by the net inputs and outputs of the other subsystems. If the near-decomposability condition is met, the eﬃciency of one component (hence itscontribution to organism ﬁtness) does not depend on the detailed structure ofother components.”

Simon (1996, p.193).

Standards and Once-Forever Choices

An important consequence of emerging modularity in evolutive processes is the recurrentcommitment to once-forever choices and the spontaneous establishment of standards. Thisorganizational side eﬀect is responsible for mirror-house eﬀects related to many misleadingquestions leading to philosophical dead-ends. Why do (almost all) nations use the French meter, m, as the standard unit of length, instead of the older Portuguese vara ( ≈ . m )or the British yard ( ≈ . m )? Why did the automotive industry select 87 octane as“regular” gasoline and settled for 12V as the standard voltage for vehicles? Why do wehave chiral symmetry breaks, that is, why do we ﬁnd only one speciﬁc type among twoor more possible isomeric molecular forms in organic life? What is so special about theDNA - RNA genetic code that it is shared by all living organisms on planet earth?In this mirror house we must accept that the deepest truth is often pretty shallow.Refusing to do so, insisting on extraction by forceps of more elaborate explanations,can take us seriously astray into foggy illusions, far away from clear reason and realunderstanding. Eigen (1977, p.541-542) makes the following comments:The Paradigm of Unity and Diversity in Evolution: Why do millions ofspecies, plants and animals, exist, while there is only one basic molecularmachinery of the cell, one universal genetic code and unique chiralities of themacromolecules?This code became ﬁnally established, not because it was the only alterna-tive, but rather due to a peculiar ‘once-forever’ selection mechanism, whichcould start from any random assignment. Once-forever selection is a conse-quence of hypercyclic organization. Ouroboros is a Greek name,

Ouρoβoρς oϕiς , meaning the tail-devouring snake, see Eleazar .5. SQUARING THE CYCLE polysemy , the reuse of the same tags in diﬀerent contexts.This is due to semantic contamination or spill over, that is, unwanted or unforeseentransfers of meaning, induced by polysemic overloading.We can ask several questions concerning the relative importance of speciﬁc links incausal networks. For example: Can we or should we by any means establish precedencesbetween the links in our diagram? Upward causes precede or have higher status thendownward causes or vice versa? Rightward causes explain or have preponderance overleftward causes or vice versa? Do any of the possible answers imply a progressive or90

CHAPTER 6. THE LIVING AND INTELLIGENT UNIVERSE revolutionary view? Do the opposite answers imply a conservative or reactionary view?The same questions can be asked with respect to a similar diagram for scientiﬁc productionpresented in section 7. Do any of the possible answers imply an empiricist or Aristotelicview? Do the opposite answers imply an idealistic or Platonic view?To some degree these can be legitimate questions and consequently, to the same degree,motivate appropriate answers. Nevertheless, following the main goal of this chapter,namely, the exploration of mirror-house illusions, we want to stress that extreme forms ofthese questions often lead to ill posed problems. Therefore, extreme answers to the samequestions often give an over simpliﬁed, one sided, biased, or distorted view of reality. Thedangerous consequences of acceding to the temptation of having an appetizing ourobourus’slice for supper are depicted, in the ﬁeld of psychology, by the following quotations fromEfran (1990, p.99,47):Using language, any cycle can be broken into causes and purposes... Notethat inventing purposes - and they are invented - is usually an exercise increating tautologies. A description is turned into a purpose that is then askedto account for the description. [A typical example] starts with the deﬁningcharacteristic of life, self-perpetuation, and states that it is the purpose forwhich the characteristic exists. Such circular renamings are not illegal, butthey do not advance the cause (no pun intended). (p.99)For a living system there is a unity between product and process: In otherwords, the major line of work for a living system is creating more of itself.Autopoiesis in neither a promise nor a purpose - it is an organizationalcharacteristic. This means that life lasts as long as it lasts. It doesn’t comewith guarantees. In contrast to what we are tempted to believe, people donot stay alive because of their strong survival instincts or because they havean important job to complete. They stay alive because their autopoietic or-ganization happens to permit it. When the essentials of that organization arelost, a person’s career comes to an end - he or she disintegrates. (p.47)

Asymptotic entities emerge in a model as a law of large numbers , that is, as a stablebehavior of a quantity in the limiting case of model parameters corresponding to a sys-tem with very many (asymptotically inﬁnite) components. The familiar mathematicalnotation used in these cases takes the form lim n →∞ g ( n ) or lim (cid:15) → f ( (cid:15) ). Typically, theunderlying model describes a local interaction in a small or microscopic scale, while theresulting limit correspond to a global behavior in a large or macroscopic scale.The paradigmatic examples in this class express the behavior of thermodynamic vari- .6. EMERGENCE AND ASYMPTOTICS valid example of emergence, that is, to describe a possible local interaction model from which the globalbehavior emerges when the ﬂock has a large number of individuals. We use the modelprogrammed by Craig Reynolds (1987).In 1986 I made a computer model of coordinated animal motion such asbird ﬂocks and ﬁsh schools. It was based on three dimensional computationalgeometry of the sort normally used in computer animation or computer aideddesign. I called the generic simulated ﬂocking creatures boids . The basicﬂocking model consists of three simple steering behaviors which describe howan individual boid maneuvers based on the positions and velocities its nearbyﬂockmates: Separation: steer to avoid crowding local ﬂockmates

Alignment: steer towards the average heading of local ﬂockmates

Cohesion: steer to move toward the average position of local ﬂockmatesEach boid has direct access to the whole scene‘s geometric description,but ﬂocking requires that it reacts only to ﬂockmates within a certain smallneighborhood around itself.92

CHAPTER 6. THE LIVING AND INTELLIGENT UNIVERSE

The second point is to explain why being part of a ﬂock can reduce the risk of predation:Many predators, like a falcon hunting a sparrow, need to single out and focus on a chosenindividual in order to strike accurately. However, the rapid change of relative positions ofindividuals in the ﬂock makes it diﬃcult to isolate a single individual as the designatedtarget and follow it inside the moving ﬂock. Computer simulation models show that thisconfusion eﬀect greatly reduces the killing (success) rate in this kind of hunt.The third point in our analysis is to contrast the hunting of single individuals, asanalyzed in the previous paragraph, with other forms of predation based on the captureof the entire ﬂock, or a large chunk of it. The focus of such alternative hunting techniquesis, in the relative topology of the ﬂock, not on local but on global variables describing thecollective entity. For example, as explained in Diachok (2006) and Leighton et al. (2004,2007), humpback whales collaborate using sophisticated strategies for hunting herring,including speciﬁc tactics for:

Detection:

Whales use active sonar detection techniques, using speciﬁc frequenciesthat resonates with and are attenuated by the swim bladders of the herring. In thisway, the whales can detect schools over long distances, and also measure its pertinentcharacteristics.

Steering:

Some whales broadcast loud sounds below the herring school, driving themto the surface. Other whales blow a bubble-net around the school, spiraling in as theschool rises. The herring is afraid of the loud sounds at the bottom, and also afraid ofswimming through the bubble-net, and is thus forced into a dense pack at a compactkilling zone near the surface.

Capture:

Finally, the whales take turns at the killing zone, raising to the surface withtheir mouths wide open, catching hundreds of ﬁsh at a time or, so to speak, “biting of”large chunks of the school.Finally, let us propose two short statements that can be distilled from our examples.They are going to carry us to the next section.- Flocking makes it diﬃcult for a predator to use local tactics tracking the trajectoryof a single individual, consequently, for a hunter that focus on local variables it is hard toknow what exactly is going on.- On the other hand, the same collective behavior creates the opportunity for globalstrategies that track and manipulate the entire ﬂock. These hunting technique may bevery eﬃcient, in which case, we can say that the hunters know very well what they aredoing. .7. CONSTRUCTIVE ONTOLOGIES ⇐ Experiment ⇐ Hypothesesalization design formulation ⇓ ⇑

Eﬀects True/False Creativeobservation eigen-solution interpretation ⇓ ⇑

Data Mnemetic Statisticalacquisition ⇒ explanation ⇒ analysisSample space Parameter spaceFigure 1: Scientiﬁc production diagram. From the several examples mentioned in sections 2, 4 and 6, we can suspect that theemergence of properties, behaviors, organizational forms and other entities are the rulerather than the exception for many non-trivial systems. Hence it is natural to ask aboutthe ontological status of such entities. Ontology is a term used in philosophy referringto a systematic account of existence or reality . In this section we analyze the ontologicalstatus of emergent entities according to the Cog-Con epistemological framework. Thefollowing paragraphs give a brief summary of this perspective, as well as some speciﬁcepistemological terms as they are used in the Cog-Con framework.The interpretation of scientiﬁc knowledge as an eigensolution of a research process ispart of a Cog-Con approach to epistemology. Figure 1 presents an idealized structure anddynamics of knowledge production. This diagram represents, on the Experiment side (leftcolumn) the laboratory or ﬁeld operations of an empirical science, where experiments aredesigned and built, observable eﬀects are generated and measured, and the experimentaldata bank is assembled. On the Theory side (right column), the diagram represents thetheoretical work of statistical analysis, interpretation and (hopefully) understanding ac-cording to accepted patterns. If necessary, new hypotheses (including whole new theories)are formulated, motivating the design of new experiments. Theory and experiment con-stitute a double feed-back cycle making it clear that the design of experiments is guidedby the existing theory and its interpretation, which, in turn, must be constantly checked,adapted or modiﬁed in order to cope with the observed experiments. The whole systemconstitutes an autopoietic unit.The Cog-Con framework also includes the following deﬁnition of reality and somerelated terms:94 CHAPTER 6. THE LIVING AND INTELLIGENT UNIVERSE

1. Known (knowable) Object:

An actual (potential) eigen-solution of a givensystem’s interaction with its environment. In the sequel, we may use a some-what more friendly terminology by simply using the term Object.

2. Objective (how, less, more):

Degree of conformance of an object to theessential attributes of an eigen-solution (to be precise, stable, separable andcomposable).

3. Reality:

4. Idealism:

The belief that a system’s knowledge of an object is alwaysdependent on the systems’ autopoietic relations.

5. Realism:

The belief that a system’s knowledge of an object is alwaysdependent on the environment’s constraints.Consequently, the Cog-Con perspective requires a ﬁne equilibrium, called

Realistic orObjective Idealism . Solipsism or Skepticism are symptoms of an epistemological analysesthat looses the proper balance by putting too much weight on the idealistic side. Con-versely,

Dogmatic Realism is a symptom of an epistemological analyses that looses theproper balance by putting too much weight on the realistic side. Dogmatic realism hasbeen, from the Cog-Con perspective, a very common (but mistaken) position in modernepistemology. Therefore, it is useful to have a speciﬁc expression, namely, something initself to be used as a marker or label for such ill posed dogmatic statements. The methodused to access something in itself is often described as: - Something that an observerwould observe if the (same) observer did not exist, or - Something that an observer couldobserve if he made no observations, or - Something that an observer should observe in theenvironment without interacting with it (or disturbing it in any way), and many otherequally senseless variations.From the preceding considerations, it should become clear that, from the Cog-Conperspective, the ontological status of emergent entities can be perfectly ﬁne, as longas these objects correspond to precise, stable, separable and composable eigen-solutions.However there is a long list of historical objections and complaints concerning such entities.The following quotations from Pihlstrom and El-Hani (2002) elaborate on this point.Emergent properties are not metaphysically real independently of our prac-tices of inquiry but gain their ontological status from the practice-laden onto-logical commitments we make. .7. CONSTRUCTIVE ONTOLOGIES why the corresponding eigen-solutions manifest96

CHAPTER 6. THE LIVING AND INTELLIGENT UNIVERSE themselves the way they do. Such explanations include, specially in modern science, thesymbolic derivation of scientiﬁc hypotheses from general scientiﬁc laws, the formulationof new laws in an existing theory, and even the conception of new theories, as well astheir general understanding based on accepted metaphysical principles. In the Cog-Conperspective, the understanding of an entity can only strengthen its ontological status,embedding it even deeper in the system’s life, endowing it with even wider connections inthe web of concepts, revealing more of its links with the great chain of being!

In the last two sections we have analyzed emergent objects and their properties. Inmany of the examples used in our discussions, probability mechanisms where at the coreof the emergence process. In this section, other ways in which probability mechanismscan generate complex or non-trivial structures will be presented. This section is alsodedicated to the study of the ontological status of probability, and the role played byexplanations given by probabilistic mechanisms and stochastic causal relations. We beginour discussion examining the concept of mixed strategies in game theory, due to vonNeumann and Morgenstern.Let us consider the matching pennies game, played by Odd and Even. Each of theplayers has to show, simultaneously, a bit (0 or 1). If both bits agree (i.e., 00 or 11), Oddwins. If both bits disagree (i.e., 01 or 10), Even wins. Both players only have two pureor deterministic strategies available from which to choose: s - show a 0, or s - show a 1.A solution, equilibrium or saddlepoint of a game is a set of strategies that leaves eachplayer at a local optimum, that is, a point at which each player, having full knowledge of allthe other players’ strategies at that equilibrium point, has nothing to gain by unilaterallychanging his own strategy. It is easy to see that, considering only the two deterministicstrategies, the game of matching pennies has no equilibrium point. If Odd knows thestrategy chosen by Even, he can just take the same strategy and win the game. In thesame way, Even can take the opposite choice of Odd’s, and win the game.Let us now expand the set of strategies available to each player considering mixed or randomized strategies, where each player picks among the pure strategies according to aset of probabilities he speciﬁes. We assume that a proper randomization device, like adice, a roulette or a computer with a random number generator program, is available. Inthe example at hand, Even and Odd can each specify a probability, respectively, pe and po , for showing a 1, and qe = 1 − pe and qo = 1 − po , for showing a 0. It is easy to checkthat pe = po = 1 / .8. DISTINCTIONS AND PROBABILITY convex combination of two points, p and p , is a point lying on the linesegment joining them, that is, a point of the form p ( λ ) = (1 − λ ) p + λp , 0 ≤ λ ≤

1. A convex set is a set that contains all convex combinations of its points. The extreme points of a convex set are those that can not be expressed as (non-trivial) convex combinationsof other points in the set. A function f ( x ) is convex if its epigraph, epi( f ) - the set ofall point above the graph of f ( x ), is convex. A convex optimization problem consists ofminimizing a convex function over a convex region. The properties of convex geometrywarrant that a convex optimization problem has an optimal solution, i.e. a minimum, f ( x ∗ ). Moreover, the minimum argument, x ∗ , is easy to compute using a procedure suchas the steepest descent algorithm , that can be informally stated as follows: Place a particleat some point over the graph of f ( x ), and let it “roll down the hill” to the bottom of thevalley, until it ﬁnds its lowest point at x ∗ , see Luenberger (1984) and Minoux (1986).In the matching pennies game, let us consider a convex combination of the two purestrategies, that is, a strategy of the form s ( λ ) = (1 − λ ) s + λs , 0 ≤ λ ≤

1. Since the purestrategies form a discrete set, such continuous combination of pure strategies is not evenwell deﬁned, except for the trivial extreme cases, λ = 0 or λ = 1. The introduction of ran-domization gives a coherent deﬁnition for convex combinations of existing strategies and,in so doing, it expands the set of available (mixed) strategies to a convex set where purestrategies become extreme points. In this setting, a game equilibrium point can be charac-terized as the solution of a convex optimization problem. Therefore, such an equilibriumpoint exists and is easy to compute. This is one way of having a geometric understandingof von Neumann and Morgenstein theorems, as well as to subsequent extensions in gametheory due to John F. Nash, see Bonassi et al. (2009), Mesterton-Gibbons (1992) andThomas (1986).The matching pennies example poses a διληµµα , dilemma - a problem oﬀering twopossibilities, none of which is acceptable. The conceptual dichotomy created by constrain-ing the players to only two deterministic strategies creates an ambush. Caught in thisambush, both players would be trapped, forever changing their minds between extreme98 CHAPTER 6. THE LIVING AND INTELLIGENT UNIVERSE options. Randomization expands the universe of available possibilities and, in so doing,allows the players to escape the perpetual ﬂip-ﬂopping at this discrete logic decision trap.In section 8.2, we extrapolate this example and generalize these conclusions. However, be-fore proceeding in this direction, we shall analyze, in the next section, some objections tothe concepts of probability, statistics and randomization posed by George Spencer-Brown,a philosopher of great inﬂuence in the ﬁeld of radical constructivism.

Spencer-Brown (1953, 1957) analyzed some apparent paradoxes involving the concept ofrandomness, and concluded that the language of probability and statistics is inappropri-ate for the practice of scientiﬁc inference. In subsequent work, Spencer Brown (1969)reformulates classical logic using only a generalized nor operator (marked not-and un-marked or ), that he represents `a la mode of Charles Saunders Peirce or John Venn,using a graphical boundary or distinction mark, see Edwards (2004), Kauﬀmann (2001,2003), Meguire (2003), Peirce (1880), Sheﬀer (1913). Making distinctions is, according toSpencer-Brown, the basic (if not the only) operation of human knowledge, an idea that haseither inﬂuenced or been directly explored by several authors in the radical constructivistmovement. Some typical arguments used by Spencer-Brown in his rejection of probabilityand statistics are given in the next quotations from Spencer-Brown (1957, p.66,105,113):We have found so far that the concept of probability used in statisticalscience is meaningless in its own terms; but we have found also that, howevermeaningful it might have been, its meaningfulness would nevertheless haveremained fruitless because of the impossibility of gaining information fromexperimental results, however signiﬁcant This ﬁnal paradox, in some ways themost beautiful, I shall call the Experimental Paradox (p.66).The essence of randomness has been taken to be absence of pattern. Buthas not hitherto been faced is that the absence of one pattern logically demandsthe presence of another. It is a mathematical contradiction to say that a serieshas no pattern; the most we can say is that it has no pattern that anyone islikely to look for. The concept of randomness bears meaning only in relationto the observer: If two observers habitually look for diﬀerent kinds of patternthey are bound to disagree upon the series which they call random. (p.105).In Section G.1 I carefully explain why I disagree with Spencer-Brown’s analysis ofprobability and statistics. In some of my arguments I dissent from Spencer-Brown’sinterpretation of measures of order-disorder in sequential signals. These arguments arebased on information theory and the notion of entropy. Atkins (1984), Attneave (1959),Dugdale (1996), Krippendorﬀ (1986) and Tarasov (1988) review some of the basic concepts .8. DISTINCTIONS AND PROBABILITY expost facto “ﬁshing expeditions” for interesting outcomes, or simple post hoc “sub-groupanalysis” in experimental data banks. This kind of retroactive or retrospective data anal-yses is considered a questionable statistical practice, and pointed as the culprit of manymisconceived studies, misleading arguments and mistaken conclusions. The literature ofstatistical methodology for clinical trials has been particularly active in warning againstthis kind of practice, see Tribble (2008) and Wang (2007) for two interesting papers ad-dressing this speciﬁc issue and published in high impact medicine journals less than ayear before I began writing this chapter. When consulting for pharmaceutical companiesor advising in the design of statistical experiments, I often ﬁnd it useful to quote ConanDoyle’s Sherlock Holmes, in The Adventure of Wisteria Lodge:Still, it is an error to argue in front of your data. You ﬁnd yourself insensiblytwisting them around to ﬁt your theories.Finally, I am suspicious or skeptical about some of the intended applications of Spencer-Brown’s research program, including the use of extrasensory empathic perception forcoded message communication, exercises on object manipulation using paranormal pow-ers, etc. Unable to reconcile his psychic research program with statistical science, Spencer-Brown had no regrets in disqualifying the later, as he clearly stated at the prestigiousscientiﬁc journal Nature , Spence-Brown (1953b, p.594-595):[On telepathy:] Taking the psychical research data (that is, the residuum whenfraud and incompetence are excluded), I tried to show that these now threwmore doubt upon existing pre-suppositions in the theory of probability thanin the theory of communication.[On psychokinesis:] If such an ‘agency’ could thus ‘upset’ a process of ran-domizing, then all our conclusions drawn through the statistical tests of sig-niﬁcance would be equally aﬀected, including the the conclusions about the‘psychokinesis’ experiments themselves. (How are the target numbers for thedie throws to be randomly chosen? By more die throws?) To speak of an‘agency’ which can ‘upset’ any process of randomization in an uncontrollablemanner is logically equivalent to speaking of an inadequacy in the theoreticalmodel for empirical randomness, like the luminiferous ether of an earlier con-00

CHAPTER 6. THE LIVING AND INTELLIGENT UNIVERSE troversy, becomes, with the obsolescence of the calculus in which it occurs, asuperﬂuous term.Sencer-Brown’s (1953, 1957) conclusions, including his analysis of probability, wereconsidered to be controversial (if not unreasonable or extravagant) even by his own col-leagues at the Society of Psychical Research, see Scott (1958), and Soal (1953). It seemsthat current research in this area, even if not free (or afraid) of criticism, has abandonedthe path of na¨ıve confrontation with statistical science, see Atmanspacher (2005) andEhm (2005). For additional comments, see Henning (2006), Kaptchuk and Kerr (2004),Utts (1991), and Wassermann (1955).Curiously, Charles Saunders Peirce and his student Joseph Jastrow, who introducedthe idea of randomization in statistical trials, struggled with some of the very same dilem-mas faced by Spencer-Brown, namely, the eventual detection of distinct patterns or seem-ingly ordered (sub)strings in a long random sequence. Peirce and Jastrow did not have attheir disposal the heavy mathematical artillery I have cited in the previous paragraphs.Nevertheless, like experienced explorers that when traveling in the desert are not lured bythe mirage of a misplaced oasis, these intrepid pioneers were able to avoid the conceptualpitfalls that lead Spencer-Brown so far astray. For more details see Bonassi et al. (2008),Dehue (1997), Hacking (1988), and Peirce and Jastrow (1885).As stated in the introduction, the Cog-Con framework is supported by the FBST, aformalism based on a non-decision theoretic form of Bayesian statistics. The FBST wasconceived as a tool for validating objective knowledge and, in this role, it can be easilyintegrated to the Cog-Con epistemological framework in the practice of scientiﬁc research.Contrasting our distinct views of cognitive constructivism, it is not at all surprising thatI have come to conclusions concerning the use of probability and statistics, and also tothe relation between probability and logic, that are fundamentally diﬀerent from those ofSpencer-Brown.

As stated by William James, our ways of understanding require us to split reality withconceptual distinctions. The non-trivial consequences of the resulting dichotomies arecaptured, almost poetically, by James (1909, Lecture VI) in the following passage from

APluralistic Universe :The essence of life is its continuously changing character; but our conceptsare all discontinuous and ﬁxed, and the only mode of making them coincidewith life is by arbitrarily supposing positions of arrest therein. With sucharrests our concepts may be made congruent. But these concepts are notparts of reality, not real positions taken by it, but suppositions rather, notes .8. DISTINCTIONS AND PROBABILITY

CHAPTER 6. THE LIVING AND INTELLIGENT UNIVERSE

In an empirical science, from a pragmatical perspective, probability reasoning seemsto be an eﬃcient tool for overcoming artiﬁcial dichotomies, allowing us to bridge the gapscreated by our own conceptual distinctions. Such probabilistic models have been able togenerate new eigen-solutions with very good characteristics, that is, eigen-solutions thatare very objective (precise, stable, separable and composable). These new objects can thenbe used as stepping stones or building blocks for the construction of new, higher ordertheories. In this context, we thus assign, coherently with the Cog-Con epistemologicalframework, a high ontological status to probabilistic concepts and causation mechanisms,that is, we use a notion of probability that has a distinctively objective character.

The objective of this chapter was to use the Cog-Con framework for the understandingof massively complex and non-trivial systems. We have analyzed several forms of systemcomplexity, several ways in which systems become non-trivial, and some interesting con-sequences, side eﬀects and paradoxes generated by such non-triviality. How can we callthe massive non-triviality found in nature? I call it

The Living and Intelligent Universe.

I could also call it

Deus sive natura or, according to Einstein,

Spinoza’s God, a God who reveals himself in the orderly harmony of whatexists...

In future research we would like to extend the use of the same Cog-Con frameworkto the analysis of the ethical conduct of agents that are conscious and (to some degree)self-aware. The deﬁnition of ethics given by Russell (1999, p.67), reads:

The problem of Ethics is to produce a harmony and self-consistency in conduct,but mere self-consistency within the limits of the individual might be attained inmany ways. There must therefore, to make the solution deﬁnite, be a universalharmony; my conduct must bring satisfaction not merely to myself, but to allwhom it aﬀects, so far as that is possible.

Hence, in this setting, such a research program should be concerned with the understand-ing and evaluation of choices and decisions made by agents, acting in a system in whichthey belong. Such an analysis should provide criteria for addressing the coherence andconsistency of the behavior of such agents, including the direct, indirect and reﬂexiveconsequences of their actions. Moreover, since we consider conscious agents, their values,beliefs and ideas should also be included in the proposed models. The importance of pur-suing this line of research, and also the inherent diﬃculties of this task, are summarizedby Eigen (1992, p.126): .9. FINAL REMARKS AND FUTURE RESEARCH

CHAPTER 6. THE LIVING AND INTELLIGENT UNIVERSE pilog

In six chapters and ten appendices, we have presented our case in defense of a construc-tivist epistemological framework and the use of compatible statistical theory and inferencetools. In this ﬁnal remarks, we shall try to wrap up, as concisely as possible, the reasonsfor adopting the constructivist world-view.The basic metaphor of decision theory is the maximization of a gambler’s expectedfortune, according to his own subjective utility, prior beliefs an learned experiences. Thismetaphor has proven to be very useful, leading the development of Bayesian statisticssince its XX-th century revival, rooted on the work of de Finetti, Savage and others.The basic metaphor presented in this text, as a foundation for cognitive constructivism,is that of an eigen-solution, and the veriﬁcation of its objective epistemic status. TheFBST is the cornerstone of a set of statistical tolls conceived to assess the epistemic valueof such eigen-solutions, according to their four essential attributes, namely, sharpness,stability, separability and composability. We believe that this alternative perspective,complementary to the one oﬀered by decision theory, can provide powerful insights andmake pertinent contributions in the context of scientiﬁc research.To fulﬁll our promise of concision, we ﬁnish here this summer course / tutorial. Wesincerelly thank the readers for their attention and welcome their constructive comments.May the blessings of the three holy knights in Figure J.2-4 protect and guide you in yourway. Fair well and goodbye! 20506

EPILOGPILOG “E aquela era a hora do mais tarde.O c´eu vem abaixando. Narrei ao senhor.No que narrei, o senhor talvez at´e ache,mais do que eu, a minha verdade.Fim que foi.”And it was already the time of later on,the time of sun-down. My story I have told,my lord, so that you may ﬁnd, perhaps evenbetter than me, the truth I wanted to tell.The End (that already was).“Vivendo, se aprende; mas o que se aprende,mais, ´e s´o a fazer outras maiores perguntas.”

Living one learns, but what one learns,is only how to ask even bigger questions.Jo˜ao Guimar˜aes Rosa (1908-1967).Grande Sert˜ao: Veredas.08

EPILOG eferences - E. Aarts, J. Korst (1989).

Simulated Annealing and Boltzmann Machines . Chichester: JohnWiley.- J.Abadie, J.Carpentier (1969). Generalization of Wolfe Reduced Gradient Method to the Caseof Nonlinear Constraints. p.37-47 in R.Flecher (ed)

Optimization . London: Academic Press.- K.M.Abadir, J.R.Magnus (2005).

Matrix Algebra.

Cambridge University Press.- J.M.Abe, B.C.Avila, J.P.A.Prado (1998).

Multi-Agents and Inconsistence.

ICCIMA’98. 2ndInternational Conference on Computational Intelligence and Multimidia Applications. Traral-gon, Australia.- S.Abe, Y.Okamoto (2001).

Nonextensive Statistical Mechanics and Its Applications.

NY:Springer.- R.P.Abelson (1995).

Statistics as Principled Argument.

LEA.- Abraham Eleazar (1760).

Uraltes chymisches Werk.

British Journal for the Philosophy of Science,

Concepts of Science. A Philosophical Analysis.

Baltimore.- D.H. Ackley (1987).

A Connectionist Machine for Genetic Hillclimbing . Boston: Kluwer.- J.Acz´el (1966).

Lectures on Functional Equations and their Applications.

NY: Academic Press.- P.Aczel (1988).

Non-Well-Founded Sets.

Stanford, CA: CSLI - Center for the Study of languageand Information.- P.S.Addison (1997).

Fractals and Chaos: An Illustrated Course.

Philadelphia: Institute ofPhysics.- D.Aigner, K.Lovel, P.Schnidt (1977). Formulation and Estimation of Stachastic Frontier Pro-duction Function Models.

Journal of Econometrics , 6, 21–37.- J.Aitchison (2003).

The Statistical Analysis for Compositional Data (2nd edition). Caldwell:Blackburn Press.- J.Aitchison, S.M.Shen (1980). Logistic-Normal Distributions: Some Properties and Uses.

Biometrika , 67, 261-72.- H.Akaike (1969). Fitting Autoregressive Models for Prediction.

Ann. Inst. Stat. Math,

22, 203–217.- H.Akaike (1974). A New Look at the Statistical Model Identiﬁcation.

IEEE Trans. Autom.Control.

19, 716–723.- J.H.Albert (1985). Bayesian Estimation Methods for Incomplete Two-Way Contingency Tablesusing Prior Belief of Association, in Bayesian Statistics 2:589-602, Bernardo, JM; DeGroot, MH;Lindley, DV; Smith, AFM eds. Amsterdam, North Holland.- J.H.Albert, A.K.Gupta (1983). Bayesian Estimation Methods for 2x2 Contingency Tables

REFERENCES using Mixtures of Dirichlet Distributions.

JASA

78, 831-41.- D.Z.Albert (1993).

Quantum Mechanics and Experience.

Harvard University Press.- R.Albright, J.Cox, D.Dulling, A.N.Langville, C.D.Meyer (2006). Algorithms, Initializations,and Convergence for the Nonnegative Matrix Factorization.- J.Alcantara, C.V.Damasio, L.M.Pereira (2002). Paraconsistent Logic Programs. JELIA-02.8th European Conference on Logics in Artiﬁcial Intelligence.

Lecture Notes in Computer Science,

Portfolio Analysis.

Englewood Cliﬀs, NJ: Prentice-Hall.- G.W.Allport, H.S.Odbert (1936). Trait Names: A Psycho-Lexical Study.

Psychological Mono-graphs , 47, No.211.- S.I.Amari, O.E.Barndorﬀ-Nielsen, R.E.Kass, S.L.Lauritzen, C.R.Rao (1987).

Diﬀerential Ge-ometry in Statistical Inference.

IMS Lecture Notes Monograph, v.10. Inst. Math. Statist.,Hayward, CA.- S.I.Amari (2007).

Methods of Information Geometry.

American Mathematical Society.- E.Anderson (1935). The Irises of the Gasp´e Peninsula.

Bulletin of the American Iris Society,

59, 2-5.- T.W.Anderson (1969).

Statistical Inference for Covariance Matrices with Linear Structure. inKrishnaiah, P. Multivariate Analysis II, NY: Academic Press.- P.Angeline. Two Self-Adaptive Crossover Operators for Genetic Programming. ch.5, p.89-110in Angeline and Kinnear (1996).- P.J.Angeline, K.E.Kinnear (1996).

Advances in Genetic Programming. Vol.2, Complex Adap-tive Systems. . MIT.- M.Anthony, N.Biggs (1992).

Computational Learning Theory.

Cambridge Univ. Press.- M.Aoyagi, A.Namatame (2005). Massive Individual Based Simulation: Forming and Reformingof Flocking Behaviors.

Complexity International, - M.A.Arbib, E.J.Conklin, J.C.Hill (1987).

From Schemata Theory to Language.

Oxford Uni-versity Press.- M.A.Arbib, Mary B. Hesse (1986).

The Construction of Reality.

Cambridge University Press.- O.Arieli, A.Avron (1996). Reasoning with Logical Bilattices.

Journal of Logic, Language andInformation , 5, 25–63.- S.Assmann, S.Pocock, L.Enos, L.Kasten (2000). Subgroup analysis and other (mis)uses ofbaseline data in clinical trials.

The Lancet,

The Second Law.

NY: The Scientiﬁc American Books.- A.C.Atkinson (1970). A Method for Discriminating Between Models.

J. Royal Statistical Soc.B , 32, 323-354.- H.Atmanspacher (2005). Non-Physicalist Physical Aproaches. Guest Editorial.

Mind andMatter,

3, 2, 3-6.- E.Attneave (1959).

Applications of Information Theory to Psychology: A summary of basicconcepts, methods, and results.

New York: Holt, Rinehart and Winston.- A.Aykac, C.Brumat, eds. (1977).

New Developments in the Application of Bayesian Methods.

Amterdam: North Holland.- J.Baggott (1992).

The Meaning of Quantum Theory.

Oxford University Press.- L. H. Bailey (1894). Neo-Lamarckism and Neo-Darwinism.

The American Naturalist , 28, 332,661-678.- T.Bakken, T.Hernes (2002).

Autopoietic Organization Theory. Drawing on Niklas Luhmann’sSocial Systems Perspective.

Copenhagen Business School.

EFERENCES - G.van Balen (1988). The Darwinian Systhesis: A Critique of the Rosenberg / WilliamsArgument.

British Journal of the Philosophy of Science , 39, 4, 441-448.- J.D.Banﬁeld,A.E.Raftery (1993). Model Based Gaussian and nonGaussian Clustering.

Biometrics ,803-21.- W.Banzahf, P.Nordin, R.E.Keller, F.D.Francone (1998).

Genetic Algorithms. - D.Barbieri (1992). Is Genetic Epistemology of Any Interest for Semiotics?

Scripta Semiotica1, 1-6. - R.E.Barlow, F.Prochan (1981).

Statistical Theory of Reliability and Life Testing ProbabilityModels.

Silver Spring: To Begin With.- G.A.Barnard (1947). The Meaning of Signiﬁcance Level.

Biometrika , 34, 179–182.- G.A.Barnard (1949). Statistical Inference. J. Roy. Statist. Soc. B, 11, 115–149.- A.R.Barron (1984) Predicted Squared Error: A Criterion for Automatic Model Selection. inFarlow (1984).- V.Bryant, H.Perfect (1980).

Independence Theory in Combinatorics: An Introductiory Accountwith Applications to Graphs and Transversals.

London: Chapman and Hall.- D.Basu (1988). Statistical Information and Likelihood. Edited by J.K.Ghosh.

Lect. Notes inStatistics , 45.- D.Basu, J.K.Ghosh (1988). Statistical Information and Likelihood.

Lecture Notes in Statistics,

JSPI

6, 345-62.- D.Basu, C.A.B.Pereira (1983). A Note on Blackwell Suﬃciency and a Shibinsky Characteri-zation of Distributions.

Sankhya A , 45,1, 99-104.- M.S.Bazaraa, H.D.Sherali, C.M.Shetty (1993).

Nonlinear Programming: Theory and Algo-rithms.

NY: Wiley.- J.L.Bell (1998).

A Primer of Inﬁnitesimal Analysis.

Cambridge Univ. Press.- J.L.Bell (2005).

The Continuous and the Inﬁnitesimal in Mathematics and Philosophy.

Milano:Polimetrica.- L.V.Beloussov (2008). Mechanically Based Generative Laws of Morphogenesis.

Phys. Biol. ,5, 1-19.- A.H. Benade (1992).

Horns, Strings, and Harmony.

Mineola: Dover.- C.H.Bennett (1976). Eﬃcient Estimation of Free Energy Diﬀerences from Monte Carlo Data.

Journal of Computational Physics

22, 245-268.- J.Beran (1994).

Statistics of Long-Memory Processes.

London: Chapman and Hall.- H.C.Berg (1993).

Random Walks in Biology.

Princeton Univ. Press.- J.O.Berger (1993).

Statistical Decision Theory and Bayesian Analysis,

On the Development of Reference Priors.

Bayesian Statistics4 (J. M. Bernardo, J. O. Berger, D. V. Lindley and A. F. M. Smith, eds). Oxford: OxfordUniversity Press, 35-60.- J.O.Berger, R.L.Wolpert (1988).

The Likelihood Principle,

Educational and Psychological Measurement , 65 (5),676-696.- J.M.Bernardo, A.F.M.Smith (2000).

Bayesian Theory.

NY: Wiley.- L.von Bertalanﬀy (1969).

General System Theory.

NY: George Braziller. REFERENCES - A.Bertoni, M.Dorigo (1993). Implicit Parallelism in Genetic Algorithms.

Artiﬁcial Intelligence ,61, 2, 307-314.- D.P.Bertsekas, J.N.Tsitsiklis (1989).

Parallel and Distributed Computation, Numerical Meth-ods.

Englewood Cliﬀs: Prentice Hall.- D.P.Bertsekas (1996). Thevelin Decomposition and Large Scale Optimization.

JOTA , 89, 1-15.- P.J.Bickel, K.A.Doksum (2001).

Mathematical Statistics, 2nd ed.

USA: Prentice Hall.- C.Biernacki G.Govaert (1998). Choosing Models in Model-based Clustering and DiscriminantAnalysis. Technical Report INRIA-3509-1998.- K.Binder (1986).

Monte Carlo methods in Statistical Physics . Topics in current Physics 7.Berlin: Springer.- K.Binder, D.W.Heermann (2002).

Monte Carlo Simulation in Statistical Physics, 4th ed.

NY:Springer.- E.G.Birgin, R.Castillo, J.M.Martinez (2004). Numerical comparison of Augmented Lagrangianalgorithms for nonconvex problems. to appear in

Computational Optimization and Applications. - A.Birnbaum (1962). On the Foundations of Statistical Inference.

J. Amer. Statist. Assoc.

PNAS , 98,4,14607-14612.- S.J.Blackmore (1999).

The Meme Machine.

Oxford University Press.- D.Blackwell, M.A.Girshick (1954).

Theory of Games and Statistical Decisions . NY: Doverreprint (1976).- J.R.S.Blair, B.Peyton (1993). An Introduction to Chordal Graphs and Clique Trees. In Georgeet al. (1993).- C.R.Blyth (1972). On Simpson’s Paradox and the Sure-Thing Principle.

Journal of theAmerican Statistical Association , 67, p. 364.- R.D.Bock, R.E.Bargnann (1966). Analysis of Covariance Structure.

Psycometrica , 31, 507–534.- N.Bohr (1935).

Space-Time Continuity and Atomic Physics.

H.H.Wills Memorial Lecture,Univ. of Bristol, Oct. 5, 1931. In Niels Bohr Collected Works, 6, 363-370. Complementarity,p.369-370.- N.H.D.Bohr (1987a).

The Philosophical Writings of Niels Bohr.

V.I- Atomic Theory and theDescription of Nature. Woodbridge, Connecticut: Ox Bow Press.- N.H.D.Bohr (1987b).

The Philosophical Writings of Niels Bohr.

V.II- Essays 1932-1957 onAtomic Physics and Human Knowledge. Woodbridge, Connecticut: Ox Bow Press.- N.H.D.Bohr (1987c).

The Philosophical Writings of Niels Bohr.

V.III- Essays 1958-1962 onAtomic Physics and Human Knowledge . Woodbridge, Connecticut: Ox Bow Press.- N.H.D.Bohr (1999), J.Faye, H.J.Folse, eds.

The Philosophical Writings of Niels Bohr.

V.IV-Causality and Complementarity: Supplementary Papers. Woodbridge, Connecticut: Ox BowPress.- N.H.D.Bohr (1985), J.Kalckar ed.

Collected Works.

V.6- Foundations of Quantum Physics I,(1926-1932). Elsevier Scientiﬁc.- N.H.D.Bohr (1996), J.Kalckar ed.

Collected Works.

V.7- Foundations of Quantum Physics II,

EFERENCES (1933-1958). Elsevier Scientiﬁc.- N.H.D.Bohr (2000), D.Favrholdt ed.

Collected Works.

V.10- Complementarity beyond Physics,(1928-1962). Elsevier Scientiﬁc.- N.H.D.Bohr (2007). Questions Answered by Niels Bohr (1885-1962).

Physikalisch-TechnischeBundesanstalt. - L.Boltzmann (1890). ¨Uber die Bedeutung von Theorien. Translated and edited by B.McGuinness(1974).

Theoretical Physics and Philosophical Problems: Selected Writings . Dordretcht: Reidel.- E.Bonabeau, M.Dorigo, G.Theraulaz (1999).

Swarm Intelligence: From Natural to ArtiﬁcialSystems.

Oxford University Press.- J.A.Bonaccini (2000).

Kant e o Problema da Coisa Em Si No Idealismo Alem˜ao.

SP: RelumeDumar´a.- F.V.Bonassi, R.B.Stern, S.Wechsler (2008). The Gambler’s Fallacy: A Bayesian Approach.

MaxEnt 2008, AIP Conference Proceedings, v. 1073, 8-15.- F.V.Bonassi, R.Nishimura, R.B.Stern (2009). In Defense of Randomization: A SubjectivistBayesian Approach. To appear in

MaxEnt 2009, AIP Conference Proceedings. - W.Boothby (2002).

An Introduction to Diﬀerential Manifolds and Riemannian Geometry.

NY:Academic Press.- J.Bopry (2002). Semiotics, Epistemology, and Inquiry.

Teaching & Learning,

17, 1, 5–18.- K.C. Border (1989).

Fixed Point Theorems with Applications to Economics and Game Theory.

Cambridge University Press.- W.Borges, J.M.Stern (2005). On the Truth Value of Complex Hypothesis.

CIMCA-2005 -International Conference on Computational Intelligence for Modelling Control and Automation.

USA: IEEE.- W.Borges, J.M.Stern (2007). The Rules of Logic Composition for the Bayesian Epistemice-Values.

Logic Journal of the IGPL , 15, 5-6, 401-420. doi:10.1093/jigpal/jzm032 .- G.E.P.Box, W.G.Hunter, J.S.Hunter (1978).

Statistics for Experimenters. An Introduction toDesign, Data Analysis and Model Building.

NY: Wiley.- G.E.Box, G.M.Jenkins (1976).

Time Series Analysis, Forcasting and Control.

Oakland:Holden-Day.- G.E.P.Box and G.C.Tiao (1973).

Bayesian Inference in Statistical Analysis.

London: Addison-Wesley.- P.J.Bowler (1974). Darwin’s Concept of Variation.

Journal of the History of Medicine andAllied Sciences , 29, 196-212.- J.Boyar (1989). Inferring Sequences Produced by Pseudo-Random Number Generators.

Jour-nal of the ACM , 36, 1, 129-141.- R.Boyd, P.Gasper, J.D.Trout, (1991).

The Philosophy of Science.

MIT Press.- L.M.Bregman (1967). The Relaxation Method for Finding the Common Point Convex Setsand its Application to the Solution of Problems in Convex Programming.

USSR ComputationalMathematics and Mathematical Physics,

7, 200-217.- L.Breiman, J.H.Friedman, C.J.Stone (1993).

Classiﬁcation and Regression Trees.

Chapmanand Hall.- R.Brent, J.Bruck (2006). Can Computers Help to Explain Biology?

Nature , 440/23, 416–417.- S.Brier (1995) Cyber-Semiotics: On autopoiesis, code-duality and sign games in bio-semiotics.Cybernetics and Human Knowing, 3, 1, 3–14.- S.Brier (2001). Cybersemiotics and Umweltlehre.

Semiotica , Special issue on Jakob vonUexk¨ull’s Umweltsbiologie, 134 (1/4), 779-814.- S.Brier (2005). The Construction of Information and Communication: A Cyber-Semiotic REFERENCES

Re-Entry into Heinz von Foerster’s Metaphysical Construction of Second Order Cybernetics.

Semiotica,

Time Series: Theory and Methods.

NY: Springer.- L.de Broglie (1946).

Matter and Light.

NY:Dover.- M.W.Browne (1974). Gradient Methods for Analytical Rotation.

British J.of Mathematicaland Statistical Psychology , 27, 115-121.- M.W.Browne (2001). An Overview of Analytic Rotation in Exploratory Factor Analysis.

Multivariate Behavioral Research , 36, 111-150.- P.Brunet (1938). ´Etude Historique sur le Principe de la Moindre Action.

Paris: Hermann.- S.G.Brush (1961). Functional Integrals in Statistical Physics.

Review of Modern Physics , 33,79-79.- S.Brush (1968). A History of Random Processes: Brownian Movement from Brown to Perrin.

Arch. Hist. Exact Sci.

5, 1-36.- T. Budd (1999).

Understanding Object-Oriented Programming With Java.

Addison Wesley.(1999, Glossary, p.408):- A.M.S.Bueno, C.A.B.Pereira, M.N.Rabelo-Gay, J.M.Stern (2002). Environmental Genotoxic-ity Evaluation: Bayesian Approach for a Mixture Statistical Model.

Stochastic EnvironmentalResearch and Risk Assessment,

16, 267-278.- J.R.Bunch, D.J.Rose (1976).

Sparse Matrix Computations . NY: Academic Press.- A.Bunde, S.Havlin (1994).

Fractals in Science.

NY: Springer.- L.W.Buss (2007).

The Evolution of Individuality.

Princeton University Press.- E.Butkov (1968).

Mathematical Physics.

Addison-Wesley.- F.W.Byron Jr., R.W.Fuller (1969). Reading, MA: Addison-Wesley.- H.B.Callen (1960).

Thermodynamics: An Introduction to the PhysicalTheories of EquilibriumThermostatics and Irreversible Thermodynamics . NY: John Wiley.- C.A.Callaghan (2006).

Kinetics and Catalysis of the Water-Gas-Shift Reaction: A Microki-netic and Graph Theoretic Approach . Dissertation, Worcester Polytechnic Institute, Dept. ofChemical Engineering.- T.Y.Cao (2003). Structural Realism and the Interpretation of Quantum Field Theory.

Syn-these , 136, 1, 3-24.- T.Y.Cao (2003). Ontological Relativity and Fundamentality - Is Quantum Field Theory theFundamental Theory?

Synthese , 136, 1, 25-30.- T.Y.Cao (2003). Can We Dissolve Physical Entities into Mathematical Structures?

Synthese ,136, 1, 57-71.- T.Y.Cao (2003). What is Ontological Synthesis? A Reply to Simon Saunders.

Synthese , 136,1, 107-126.- T.Y.Cao (2004). Ontology and Scientiﬁc Explanation. In Conwell (2004).- M.Carmeli, S.M.Malin (1976).

Representation of the Rotation and Lorentz Groups . Basel:Marcel Dekker.- M.P.do Carmo (1976).

Diﬀerential Geometry of Curves and Surfaces.

NY: Prentice Hall.- S.B.Carrol (2005).

Endless Forms Most Beautiful. The New Science of Evo Devo.

NY:W.W.Norton.- A.Caticha, A.Giﬃn (2007). Updating Probabilities with Data and Moments. 27th Interna-tional Workshop on Bayesian Inference and Maximum Entropy Methods in Science and Engi-neering. AIP Conf. Proc. 872, 74-84.- A. Caticha (2007). Information and Entropy. 27th International Workshop on BayesianInference and Maximum Entropy Methods in Science and Engineering. AIP Conf. Proc. 872,

EFERENCES

Lectures on Probability, Entropy and Statistical Physics.

Tutorial book forMaxEnt 2008, The 28th International Workshop on Bayesian Inference and Maximum EntropyMethods in Science and Engineering. July 6-11 of 2008, Borac´eia, S˜ao Paulo, Brazil.- H.Caygill (1995).

A Kant Dictionary.

Oxford: Blackwell.- G.Celeux, G.Govaert (1995). Gaussian Parsimonious Clustering Models.

Pattern Recog.

Journal of Statistical Computation and Simulation,

55, 287–314.- Y.Censor, S.Zenios (1994).

Introduction to Methods of Parallel Optimization.

IMPA, Rio dejaneiro.- Y.Censor, S.A.Zenios (1997).

Parallel Optimization: Theory, Algorithms, and Applications.

NY: Oxford.- C.Cercignani (1998).

Ludwig Boltzmann, The Man who Trusted Atoms.

Oxford Univ.- F.V.Cerezetti, J.M.Stern (2012). Non-arbitrage in Financial Markets: A Bayesian Approachfor Veriﬁcation. AIP Conf.Proc., 1490, 87-96.- M.Ceruti (1989).

La Danza che Crea.

Milano: Feltrinelli.- G.Chaitin (2004). On the Intelligibility of the Universe and the Notions of Simplicity, Com-plexity and Irreducibility. pp. 517-534 in

Grenzen und Grenzberschreitungen, XIX.

Berlin:Akademie Verlag.- L.Chang (2005). Generalized Constraint-Based Inference. M.S.Thesis, Univ.of British Columbia.- V.Cherkaasky, F.Mulier (1998).

Learning from Data.

NY: Wiley.- M.Chester (1987).

Primer of Quantum Mechanics.

John Wiley.- U.Cherubini, E.Luciano, W.Vecchiato (2004).

Copula Methods in Finance.

NY: Wiley.- J.Y.Ching, A.K.C.Wong, K.C.C.Chan. “Class-Dependent Discretization for Inductive Learn-ing from Continuous and Mixed-Mode Data.”

IEEE Transactions on Pattern Analysis andMachine Intelligence , 17 n.7, pp.641-651, 1995.- J.Christis (2001). Luhmann’s Theory of Knowledge: Beyond Realism and Constructivism?

Soziale Systeme,

7, 328–349.- G.C.Chow (1983).

Econometrics.

Singapore: McGraw-Hill.- A.Cichockt, R.Zdunek, S.I.Amari (0000). New Algorithms for Non-Negative Matrix Factor-ization in Applications to Bild Source Separaton.- A.Cichockt, S.I.Amari, R.Zdunek, R.Kompass, G.Hori, Z.He (0000). Extended SMART Algo-rithms for Non-Negative Matrix Factorization.- A.Cichockt, R.Zdunek, S.I.Amari (0000). Csisz´ar’s Divergences for Non-Negative Matrix Fac-torization: Family of New Algorithms.- G.W.Cobb (1998).

Introduction to Design and Analysis of Experiments.

NY: Springer.- C.Cockburn (1996). The Interaction of Social Issues and Software Architecture.

Communica-tions of the ACM,

39, 10, 40-46.- D.W.Cohen (1989).

An Introduction to Hilbert Space and Quantum Logic.

NY: Springer.- R.W.Colby (1988).

The Encyclopedia of Technical Market Indicators.

Homewood: Dow Jones- Irwin.- E.C.Colla (2007).

Aplica¸c˜ao de T´ecnicas de Fatora¸c˜ao de Matrizes Esparsas para Inferˆenciaem Redes Bayesianas . Ms.S. Thesis, Institute of Mathematics and Statistics, University of S˜aoPaulo.- E.C.Colla, J.M.Stern (2009). Factorization of Bayesian Networks.

Studies in Computational REFERENCES

Intelligence , 199, 275-285.- N.E.Collins, R.W. Eglese, B.L. Golden (1988).

Simulated Annealing, An Annotated Bibliogra-phy . In Johnson (1988).- M.L.L.Conde (1998).

Wittgenstein: Linguagem e Mundo.

SP: Annablume.- P.C.Consul (1989).

Generalized Poisson Distributions.

Basel: Marcel Dekker.- W.J.Cook, W.H.Cunningham, W.R.Pulleyblank, A.Schrijver (1997).

Combinatorial Optimiza-tion.

NY: Wiley.- Cornwell (2004).

Explanations: Styles of Explanation in Science. - N.C.A.Costa (1963). Calculs Propositionnels pour les Systemes Formales Incosistants.

CompteRendu Acad. des Scienes,

Pragmatic Probability. Erkenntnis,

25, 141-162.- N.C.A.da Costa (1993).

L´ogica Indutiva e Probabilidade.

S˜ao Paulo: Hucitec-EdUSP.- N.C.A.da Costa, D. Krause (2004). Complementarity and Paraconsistency. In Rahman (2004,557-568).- N.C.A.Costa, V.S.Subrahmanian (1989). Paraconsistent Logics as a Formalism for Reasoningabout Inconsistent Knowledge Bases.

Artiﬁcial Inteligence in Medicine , 1, 167–174.- N.C.A.Costa, C.A.Vago, V.S.Subrahmanian (1991). Paraconsistent Logics Ptt.

Zeitschrift f¨urMathematische Logik und Grundlagen der Mathematik , 37, 139-148.- N.C.A.Costa, J.M.Abe, V.S.Subrahmanian (1991). Remarks on Annotated Logic.

Zeitschriftf¨ur Mathematische Logik und Grundlagen der Mathematik , 37, 561–570.- N.C.A.Costa, J.M.Abe, A.C.Murolo, J.I.da Silva, C.F.S.Casemiro (1999). L´ogica Paraconsis-tente Aplicada. S˜ao Paulo: Atlas.- F.G.Cozman (2000). Generalizing Variable Elimination in Bayesian Networks. Proceedings ofthe Workshop in Probabilistic Reasoning in Artiﬁcial Inteligence. Atibaia.- J.F.Crow (1988). The Importance of Recombination. ch4, p.57-75 in Michod and Levin (1988).- I.Csiszar (1974). Information Measures.

2, 73-86.- T.van Cutsem. “Decision Trees for Detecting Emergency Voltage Conditions.”

Proc. SecondInternational Workshop on Bulk Power System Voltage Phenomena , pp.229-240, McHenry, USA,1991.- A. Damodaran (2003).

Investment Philosophies: Successful Investment Philosophies and theGreatest Investors Who Made Them Work.

NY: Wiley.- A.Y.Darwiche, M.L.Ginsberg (1992). A Symbolic Generalization of Probability Theory. AAAI-92. 10-th Conf. American Association for Artiﬁcial Intelligence.- A.Y.Darwiche (1993).

A Symbolic Generalization of Probability Theory.

Ph.D. Thesis, Stan-ford Univ.- C.Darwin (1860). Letter to Asa Gray, dated 3 April 1860. in F.Darwin ed. (1911). The Lifeand Letters of Charles Darwin, London: John Murray.- C.Darwin (1883).

The Variation of Animals and Plants under Domestication.

V.2, Portland,OR: Book News Inc. Reprint by Kissinger Press, 2004.- C.Darwin (1859).

On the Origin of Species by Means of Natural Selection.

Reprinted as GreatBooks of the Western World V.49, Chicago: Encyclopaedia Britanica Inc. 1952.- F.N.David (1969).

Games, Gods and Gambling. A History of Probability and Statistical Ideas.

London: Charles Griﬃn.- L.Davis ed. (1987). Genetic Algorithms and Simulated Annealing. Pittman, 1987.- L.Davis, M.Steenstrup (1987). Genetic Algorithms and Simulated Annealing: An Overview.p.1-11 in Davis (1987).

EFERENCES - M.Davis (1977).

Applied Nonstandard Analysis.

NY: Dover.- R.Dawkins (1989).

The Selﬁsh Gene . 2nd ed. Oxford University Press.- I.De´ak (1990).

Random Number Generators and Simulation.

Budapest: Akad´emiai Kiad´o.- J.Decety, J.Gr`ezes (2005). The power of simulation: Imagining one‘s own and other‘s behavior.

Brain Research,

Acta Psychol,

73, 13-24.- M.H.DeGroot (1970).

Optimal Statistical Decisions . NY: McGraw-Hill.- T. Dehue (1997). Deception, Eﬃciency, and Random Groups: Psychology and the GradualOrigination of the Random Group Design.

Isis , 88, 4, p.653-673- A.Deitmar (2005).

A First Course in Harmonic Analysis , 2nd ed. NY: Springer.- B.P.Demidovich, I.A.Maron (1976).

Computational Mathematics.

Moskow: MIR.- M.Delgado, S.Moral (1987). On the Concept of Possibility-Probability Consistency.

Fuzzy Setsand Systems,

21, 3, 311-318.- A.P.Dempster, N.M.Laird, D.B.Rubin (1977). Maximum Likelihood from Incomplete Data viathe EM Algorithm.

J. of the Royal Statistical Society B.

39, 1-38.- D.G.T.Denison, C.C.Holmes, B.K.Mallick, A.F.M.Smith (2002).

Bayesian Methods for Non-linear Classiﬁcation and Regression.

John Wiley.- I.S.Dhillon, S.Sra (0000). Generalized Nonnegative Matrix Approximations with BregmanDivergences.- O.Diachok (2006).

Do Humpback Whales Detect and Classify Fish by Transmitting SoundThrough Schools?

Science , 201, 131-136.- P.Diaconis, D.Freeman (1987). A Dozen de Finetti Style Results in Search of a Theory.

Ann.Inst. Poincar´e Probab. Stat. , 23, 397–423.- P.Diaconis (1988).

Group Representation in Probability and Statistics.

Hayward: IMA.- J.M.Dickey (1983). Multiple Hypergeometric Functions: Probabilistic Interpretations andStatistical Uses.

JASA , 78, 628-37.- J.M.Dickey, T.J.Jiang, J.B.Kadane (1987). Bayesian Methods for Categorical Data.

JASA

New Institutionalism in Organizational Analysis.

ChicagoUniv.- M.Diniz, C.A.B.Pereira, J.M.Stern (2008). FBST for Cointegration Problems.

AIP ConferenceProceedings, v. 1073, p. 157-164.- M.Diniz, C.A.B.Pereira, J.M.Stern (2011). Unit Roots: Bayesian Signiﬁcance Test.

Commu-nications in Statistics - Theory and Methods , 40, 4200-4213.- M.Diniz, C.A.B.Pereira, J.M.Stern (2012). Cointegration: Bayesian Signiﬁcance Test.

Com-munications in Statistics - Theory and Methods , 41, 3562-3574.- M.Diniz, C.A.B.Pereira, A.Polpo, J.M.Stern, S.Wechsler (2012). Relationship Between Bayesianand Frequentist Signiﬁcance Indices

International Journal for Uncertainty Quantiﬁcation , 2, 2,161-172.- S.M.Dion, J.L.A.Pacca, N.J.machado (1995). Quaternions: Sucessos e Insucessos de um Pro-jeto de Pesquisa.

Estudos Avan cados,

9, 25, 251-262.- G.Dixon (1994).

Division Algebras: Octonions, Quaternions, Complex Numbers and the Alge-braic Design of Physics. - B.Dodson (1994).

Weibull Analysis.

Milwaukee: ASQC Quality Press. REFERENCES - C.S.Dodson, M.K.Johnson, J.W.Schooler (1997). The verbal overshadowing eﬀect: Why de-scriptions impair face recognition.

Memory and Cognition,

25 (2), 129-139- M.G.Doncel, A.Hermann, L.Michel, A.Pais (1987).

Symmetries in Physics (1600-1980).

Sem-inari d’Hist`oria des les Ci`ences. Universitat Aut`onoma de Barcelona.- I.M.L. D’Otaviano, M.E.Q.Gonzales (2000).

Auto-Organiza¸c˜ao, Estudos Interdisciplinares.

Campinas, Brazil: CLE-UNICAMP.- G.van Driem (2007). Symbiosism, Symbiomism and the Leiden deﬁnition of the Meme.Keynote lecture delivered at the pluridisciplinary symposium on Imitation Memory and CulturalChange: Probing the Meme Hypothesis, hosted by the Toronto Semiotic Circle at the Universityof Toronto, 4 May 2007. Retrieved from - L.E.Dubins L.J.Savage (1965).

How to Gamble If You Must. Inequalities for Stochastic Pro-cesses.

NY: McGraw-Hill.- D.Dubois, H.Prade, S.Sandri (1993). On Possibility-Probability Transformations. p.103-112in Proceedings of Fourth IFSA Conference, Kluwer Academic Publ.- I.S.Duﬀ (1986).

Direct methods for sparse matrices . Oxford: Clarendon Press.- R.Dugas (1988).

A History of Mechanics.

Dover.- J.S.Dugdale (1996).

Entropy and Its Physical Meaning.

London: Taylor and Francis.- J.Dugundji (1966).

Topology.

Boston: Allyn and Bacon.- M.L.Eaton (1989).

Group Invariance Applications in Statistics.

Hayward: IMA.- G.T.Eble (1999). On the Dual Nature of Chance in Evolutionary Biology and Paleobiology.

Paleobilogy , 25, 75-87.- A.W.F.Edwards (2004).

Cogwheels of the Mind. The Story of Venn Diagrams.

Baltimore:The Johns Hopkins University Press.- J.S.Efran, M.D.Lukens, R.J.Lukens (1990).

Language, Structure and Change: Frameworks ofMeaning in Psychotherapy.

NY: W.W.Norton.- I.Eibel-Eibesfeldt (1970).

Ethology, The Biology of Behavior.

NY: Holt, Rinehart and Winston.- M.Eigen (1992).

Steps Towards Life.

Oxford University Press.- M.Eigen, P.Schuster (1977). The Hypercyde: A Principle of Natural Self-Organization. PartA: Emergence of the Hypercycle.

Die Naturwissenschaften,

64, 11, 541-565.- M.Eigen, P.Schuster (1978a). The Hypercyde: A Principle of Natural Self-Organization. PartB: The Abstract Hypercycle.

Die Naturwissenschaften,

65, 1, 7-41.- M.Eigen, P.Schuster (1978b). The Hypercyle: A Principle of Natural Self-Organization. PartC: The Realistic Hypercycle.

Die Naturwissenschaften,

65, 7, 341-369.- C.Eisele edt. (1976).

The New Elements of Mathematics of Charles S. Peirce.

The Hague:Mouton.- A.Einstein (1905a). ¨Uber einen die Erzeugung und Verwandlung des Lichtes betreﬀendenheuristischen Gesichtspunkt. (On a heuristic viewpoint concerning the production and transfor-mation of light).

Annalen der Physik , 17, 132-148.- A.Einstein (1905b). ¨Uber die von der molekularkinetischen Theorie der W¨arme geforderteBewegung von in ruhenden Fl¨ussigkeiten suspendierten Teilchen. (On the motion of smallparticles suspended in liquids at rest required by the molecular-kinetic theory of heat).

Annalender Physik , 17, 549-560.- A.Einstein (1905c). Zur Elektrodynamik bewegter K¨orper. (On the Electrodynamics of MovingBodies).

Annalen der Physik , 17, 891-921.- A.Einstein (1905d). Ist die T¨argheit eines Kr¨opers von seinem Energiegehalt abh¨angig? (Doesthe Inertia of a Body Depend Upon Its Energy Content?).

Annalen der Physik , 18, 639-641.

EFERENCES - A.Einstein (1905, 1956).

Investigations on the Theory of the Brownian Movement.

Dover.- A.Einstein (1950). On the Generalized Theory of Gravitation. Scientiﬁc American, 182, 4,13-17. Reprinted in Einstein (1950, 341-355).- A.Einstein (1954).

Ideas and Opinions . Wings Books.- A.Einstein (1991).

Autobiographical Notes: A Centennial Edition.

Open Court PublishingCompany.- W.Ehm (2005). Meta-Analysis of Mind-Matter Experiments: A Statistical Modeling Perspec-tive.

Mind and Matter,

3, 1, 85-132.- P.Embrechts (2002). Selfsimilar Processes Princeton University Press.- C.Emmeche, J.Hoﬀmeyer (1991). From Language to Nature: The Semiotic Metaphor inBiology.

Semiotica,

84, 1/2, 1-42.- H.A.Enge, M.R.Wehr, J.A.Richards (1972).

Introduction to Atomic Physics.

NY: Addison-Wesley.- T.Elfving (1980). On Some Methods for Entropy maximization and Matrix Scaling.

Linearalgebra and its applications,

34, 321-339.- A.G. Exp´osito, L.G. Franquelo (1987). A New Contribution to the Cluster Problem.

IEEETransactions on Circuits and Systems , 34, 546-552.- M.Evans (1997). Bayesian Inference Procedures Derived via the Concept of Relative Surprise.

Communications in Statistics , 26, 1125–1143.- M.Evans, T.Swartz (2000).

Approximating Integrals via Monte Carlo and Deterministic Meth-ods.

Oxford University Press.- B.S.Everitt (1984).

Latent Variable Models.

London: Chapman and Hall.- R.Falk, C.Konold (1997). Making Sense of Randomness: Implicit Encoding as a Basis forJudgment.

Psychological Review,

Encyclopedia of Statistical Sciences,

Entropy Optimization and Mathematical Pro-gramming.

Kluwer, Dordrecht.- S.J. Farlow (1984)

Self-Organizing Methods in Modeling: GMDH-type Algorithms.

MarcelDekker, Basel.- A. Faulstich-Brady (1993). A Taxonomy of Inheritance Semantics

Proceedings of the SeventhInternational Workshop on Software Speciﬁcation and Design,

Fractals.

NY: Plenum.- J.D.Fehribach (2009). Vector-Space Methods and Kirchhoﬀ Graphs for REaction Networks.

SIAM J.on Applied Mathematics , 70, 2, 543-562.- W.Feller (1957).

An Introduction to Probability Theory and Its Applications (2nd ed.), V.I.NY: Wiley.- W.Feller (1966).

An Introduction to Probability Theory and Its Applications (2nd ed.), V.II.NY: Wiley.- T.S.Ferguson (1996).

A Course in Large Sample Theory.

NY: Chapman & Hall.- P.J.Fernandes, J.M.Stern, M.S.Lauretto (2007). A New Media Optimizer Based on the Mean-Variance Model. Presented at ARF’05 - Advertising Research Foundation Conference.

PesquisaOperacional , 27, 427-456.- J.Ferreira (2006).

Semiotic Explorations in Computer Interface Design.

Wellington: VictoriaUniversity.- R.P.Feynman, A. R. Hibbs (1965).

Quantum Mechanics and Path Integrals.

NY: McGraw-Hill.- Feyerabend,P. (1993).

Against Method.

Verso Books. REFERENCES - C.M. Fiduccia, R.M. Mattheyses (1982). A Linear Time Heuristic for Improving NetworkPartitions.

IEEE Design Automation Conferences , 19, 175-181.- E.C.Fieller (1954). Some Problems in Interval Estimation.

Journal of the Royal StatisticalSociety B,

16, 175-185.- B.de Finetti (1947). La pr´evision: Des lois logiques, ses sourses subjectives. Annalles del’Institut Henri Poincar´e 7,1-68. English translation: Foresight: Its logical laws, its subjectivesources, in Kiburg and Smoker Eds. (1963), Studies in Subjective Probability, p.93-158, NY:Wiley.- B.de Finetti (1972).

Probability, Induction and Statistics.

NY: Wiley.- B.de Finetti (1974).

Theory of Probability,

V1 and V2. London: Wiley.- B.de Finetti (1975).

Theory of Probability. A Critical Introductory Treatment.

London: Wiley.- B.de Finetti (1977). Probabilities of Probabilities: A Real Problem or a Misunderstanding?in A.Aykac and C.Brumat (1977).- B.de Finetti (1980). Probability: Beware of Falsiﬁcations. p. 193-224 in: H.Kyburg, H.E.Smokler(1980).

Studies in Subjective Probability.

NY: Krieger.- B.de Finetti (1981).

Scritti.

V1: 1926-1930. Padova: CEDAM- B. de Finetti (1991).

Scritti.

V2: 1931-1936. Padova: CEDAM- B.de Finetti (1993).

Probabilit´a e Induzione.

Bologna: CLUEB.- D.Finkelstein (1993). Thinking Quantum.

Cybernetics and Systems , 24, 139-149.- M.A.Finocchiaro (1991).

The Galileo Aﬀair: A Documented History.

NY: The Notable TrialsLibrary.- R.A.Fisher (1935).

The Design of Experiments . 8ed.(1966). London: Oliver and Boyd,- R.A.Fisher (1936). The Use of Multiple Measurements in Taxonomic Problems.

Annals ofEugenics ,7,179–188.- R.A.Fisher (1926). The arrangement of Field Experiments.

Journal of the Ministry of Agri-culture , 33, 503-513.- R.A.Fisher (1934), Randomisation, and an Old Enigma of Card Play.

Mathematical Gazette

18, 294-297.- G.Fishman (1996).

Monte Carlo. Concepts, Algorithms and Applications.

NY: Springer.- I.Fishtik, C.A.Callaghan, R.Datta (2004). Reaction Route Graphs I: Theory and Algorithm.

J. Phys. Chem. B , 108, 5671-5682.- I.Fishtik, C.A.Callaghan, R.Datta (2006). Wiring Diagrams for Complex Reaction Networks.

Ind. Eng. Chem. Res. , 45, 6468-6476.- H.Flanders (1989).

Diﬀerential Forms with Applications to the Physical Sciences.

NY: Dover.- H.Fleming (1979). As Simetrias como Instrumento de Obten¸c˜ao de Conhecimento.

Ciˆencia eFilosoﬁa,

1, 99–110.- R.M.T.Fleming, C.M.Maes, M.A.Saunders, Y.Ye, B.O.Palsson (2012). A Variational Principlefor Computing Nonequilibrium Fluxes and Potentials in Genome-Scale Biochemical Networks.

Journal of Theoretical Biology , 292, 71-77.- M.Fleming (1962).

Domestic Financial Policies under Fixed and Under Floating ExchangeRates.

International Monetary Fund Staﬀ Papers 9, 1962, 369-79.- A.Flew (1959). Probability and Statistical Inference by G.Spencer-Brown (review).

The Philo-sophical Quarterly,

9, 37, 380-381.- H.von Foerster (2003).

Understanding Understanding: Essays on Cybernetics and Cognition.

NY: Springer Verlag. The following articles in this anthology are of special interest:(a) On Self-Organizing Systems and their Environments; p.1–19.(b) On Constructing a Reality; p.211–227.

EFERENCES (c) Objects: Tokens for Eigen-Behaviors; p.261–271.(d) For Niklas Luhmann: How Recursive is Communication? p.305–323.(e) Introduction to Natural Magic. p.339–338.- J.L.Folks (1984). Use of Randomization in Experimental Research. p.17–32 in Hinkelmann(1984).- H.Folse (1985).

The Philosophy of Niels Bohr.

Elsevier.- G.Forgacs, S.A.Newman (2005).

Biological Physics of the Developing Embryo.

CambridgeUniversity Press.- C.Fraley, A.E.Raftery (1999). Mclust: Software for Model-Based Cluster Analysis.

J. Classif. ,16,297-306.- J.N.Franklin (1968).

Matrix Theory.

Englewood-Cliﬀs: Prentice-Hall.- M.L.von Franz (1981).

Alchemy: An Introduction to the Symbolism and the Psychology.

Stud-ies in Jungian Psychology, Inner City Books.- A.P.French (1968).

Special Relativity.

NY: Chapman and Hall.- A.P.French (1974).

Vibrations and Waves.

M.I.T. Introductory Physics Series.- R.Frigg (2005).

Models and Representation: Why Structures Are Not Enough.

Tech.Rep.25/02, Center for Philosophy of Natural and Social Science,- S.Fuchs (1996). The new Wars of Truth: Conﬂicts over science studies as diﬀerential modesof observation.

Social Science Information,

Erkenntnis , 45, 253-265.- M.V.P.Garcia, C.Humes, J.M.Stern (2002). Generalized Line Criterion for Gauss SeidelMethod.

Journal of Computational and Applied Mathematics , 22, 1, 91-97.- M.R. Garey, D.S. Johnson (1979).

Computers and Intractability, A Guide to the Theory ofNP-Completeness . NY: Freeman and Co.- R.H.Gaskins (1992).

Burdens of Proof in Modern Discourse.

Yale Univ. Press.- L.A.Gavrilov and N.S.Gavrilova (1991).

The Biology of Life Span: A Quantitative Approach .New York: Harwood Academic Publisher.- L.A.Gavrilov and N.S.Gavrilova (2001). The Reliability Theory of Aging and Longevity.

J.Theor. Biol. http://statistik.wu-wien.ac.at/arvag/software.html .- M.Gell’Mann (1994).

The Quark and the Jaguar: Adventures in the Simple and the Complex.

New York: Freeman.- A.Gelman, J.B.Carlin, H.S.Stern, D.B.Rubin (2003).

Bayesian Data Analysis , 2nd ed. NY:Chapman and Hall / CRC.- S.Geman, D.Geman, (1984). Stochastic Relaxation, Gibbs Distribution and Bayesian Restora-tion of Images.

IEE Transactions on Pattern Analysis and Machine Intelligence,

6, 721-741.- J.E.Gentle (1998).

Random Number Generator and Monte Carlo Methods.

NY: Springer.- A.M.Geoﬀrion ed. (1972).

Perspectives on Optimization: A Collection of Expository Articles.

NY: Addison-Wesley.- A.George, J.W.H.Liu (1978). A Quotient Graph Model for Symmetric Factorization. p.154-175in: I.S.Duﬀ, G.W.Stewart (1978)

Spase Matrix Proceedings.

Philadelphia: SIAM.- A.George, J.W.H.Liu, E.Ng (1989). Solution of Sparse Positive Deﬁnite Systems on a Hyper-cube, in: Vorst and van Dooren (1990).- A.George, J.R.Gilbert, J.W.H.Liu (ed.) (1993).

Graph Theory and Sparse Matrix Computa- REFERENCES tion.

NY: Springer.- A.George, J.W.H.Liu (1981).

Computer Solution of Large Sparse Positive-Deﬁnite Systems.

NY: Prentice-Hall.- C.J.Gerhardt (1890).

Die philosophischen Schriften von Gottfried Wilhelm Leibniz.

Berlin:Weidmannsche Buchhandlung.- D.T.Gillespie (1992). A Rigorous Derivation of the Chemical Master Equation.

Physica A ,188, 404-425.- W.R.Gilks, S.Richardson, D.J.Spiegelhalter (1996).

Markov Chain Monte Carlo in Practice.

NY: CRC Press.- M.Ginsberg (1986). Multivalued Logics.

AAAI-86, 6th National Conference on ArtiﬁcialIntelligence.

The EM Algorithm for Mixtures of Factor Analyzers.

Tech.Rep. CRG-TR-96-1. Dept. of Computer Science, Univ. of Toronto.- G.J.Chaitin (1975). Randomness and Mathematical Proof.

Scientiﬁc American , 232, 47-52.- G.J.Chaitin (1988). Randomness in Arithmetic.

Scientiﬁc American , 259, 80-85.- B.Goertzel, O.Aam, F.T.Smith, K.Palmer (2008). Mirror Neurons, Mirrorhouses, and theAlgebraic Structure of the Self,

Cybernetics and Human Knowing,

15, 1, 9-28.- B.Goertzel (2007). Multiboundary Algebra as Pregeometry.

Electronic Journal of TheoreticalPhysics,

16, 11, 173-186.- D.E.Goldberg (1989).

Genetic Algorithms in Search, Optimization, and Machine Learning .Reading, MA: Addison-Wesley.- R.Goldblatt (1998).

Lectures on the Hyperreals: An Introduction to Nonstandard Analysis.

NY: Springer.- L. Goldstein, M. Waterman (1988).

Neighborhood Size in the Simulated Annealing Algorithm .In Johnson (1988).- H.H.Goldstine (1980).

A History of the Calculus of Variations from the Seventeenth Throughthe Nineteenth Century . Studies in the History of Mathematics and the Physical Sciences. NY:Springer.- M.C.Golumbic (1980).

Algorithmic Graph Theory and Perfect Graphs.

NY: Academic Press.- D.V.Gokhale (1975). Maximum Entropy Characterization of some Distributions. In Patil,G.P.,Kotz,G.P., Ord,J.K.

Statistical Distributions in Scientiﬁc Work.

V-3, 299-304.- G.H.Golub, C.F.van Loan (1989).

Matrix Computations.

Baltimore: Johns Hopkins.- I.J.Good (1958). Probability and Statistical Inference by G.Spencer-Brown (review).

TheBritish Journal for the Philosophy of Science,

9, 35, 251-255.- I.J.Good (ed.) (1962).

The Scientist Speculates. An Anthology of Partly-Baked Ideas.

NY:Basic Books.- I.J.Good (1983).

Good thinking: The foundations of probability and its applications . Min-neapolis: University of Minnesota Press.- I.J.Good (1988). The Interface Between Statistics and Philosophy of Science.

StatisticalScience , 3, 4, 386-397.- I.J.Good, Y.Mittal (1987). The Amalgamation and Geometry of Two-by-Two ContingencyTables.

Annals of Statistics , 15, p. 695.- P.C.Gotzsche (2002). Assessment of Bias. In S.Kotz, ed. (2006).

The Encyclopedia of Statis-tics,

1, 237-240.- A.L.Goudsmit (1988). Towards a Negative Understanding of Psychotherapy. Ph.D. Thesis,Groningen University.- M.Goupil (1991).

Du Flou au Clair? Histoire de l’Aﬃnit´e Chimique de Cardan `a Prigogine . EFERENCES

Paris: CTHS.- A.N.Gorban M.Shahzad (2011). The Michaelis-Menten-Stueckelberg Theorem.

Entropy , 13,966-1019.- S.Greenland, J.Pearl. J.M.Robins (1999). Confounding and Collapsibility in Causal Inference.

Statistical Science

14, 1, 29-46.- S.Greenland, H.Morgenstern1 (2001). Confounding in Health Research.

Annual Review ofPublic Health , 22, 189-212.- J.S.Growney (1998). Planning for Interruptions.

Mathematics Magazine , 55, 4, 213-219.- B.Gruber et al. edit. (1980–98).

Symmetries in Science, I–X.

NY: Plenum.- E.Gunel (1984). A Bayesian Analysis of the Multinomial Model for a Dichotomous Responsewith Non-Respondents.

Communications in Statistics - Theory and Methods , 13, 737-51.- M.G¨unther A.J¨ungel (2003, p.117).

Finanzderivate mit MATLAB. Mathematische Model-lierung und numerische Simulation .

Wiesbaden: Vieweg Verlag.- I.Hacking (1988). Telepathy: Origins of Randomization in Experimental Design.

Isis , 79, 3,427-451.- G.Hadley (1964).

Nonlinear and Dynamic Programming.

NY: Addison-Wesley.- O.H¨aggstr¨om (2002).

Finite Markov Chains and Algorithmic Applications.

Cambridge Univ.- P.R.Halmos (1998).

Naive Set Theory.

NY: Springer.- J.H.Halton (1970). A Retrospective and Prospective Survey of the Monte Carlo Method.

SIAM Review,

12, 1, 1-63.- H.D.Hamilton (1971). Geometry of the Selﬁsh Herd.

J. Theoretical Biology,

31, 295–311.- J.M.Hammersley, D.C.Handscomb (1964).

Monte Carlo Methods.

London: Chapman andHall.- J.Hanc, S.Tuleja, M.Hancova (2004). Symmetries and Conservation Laws: Consequenses ofNoether’s Theorem.

American Journal of Physics , 72, 428–435.- A.J.Hanson (2006).

Visualizing Quaternions.

San Francisco, CA: Morgan Kaufmann - Elsevier.- I.Hargittai (1992).

Fivefold Symmetry.

Singapore: World Scientiﬁc.- J.A.Hartigan (1983).

Bayes Theory.

NY: Springer.- C.Hartshorne, P.Weiss, A.Burks, edts. (1992).

Collected Papers of Charles Sanders Peirce.

Charlottesville: InteLex Corp.- E.J.Haupt (1998). G.E.M¨uller as a Source of American Psychology. In R.W.Rieber, K.Salzinger,eds. (1998).

Psychology: Theoretical-Historical Perspectives.

American Psychological Associa-tion.- D.A.Harville (2000).

Matrix Algebra From a Statistician’s Perspective . NY: Springer.- L.L.Harlow, S.A.Mulaik, J.H.Steiger (1997).

What If There Were No Signiﬁcance Tests?

London: LEA - Lawrence Erlbaum Associates.- C.Hartshorne, P.Weiss, A.Burks, edts. (1992).

Collected Papers of Charles Sanders Peirce.

Charlottesville: InteLex Corp.- Harville,D.A. (1997).

Matrix Algebra from a Statistician’s Perspective . NY: Springer.- W.K.Hastings (1970). Monte Carlo Sampling Methods Using Markov Chains and their Appli-cations.

Biometrika,

57, 97-109.- M.Haw (2002). Colloidal Suspensions, Brownian Motion, Molecular Reality: A Short History.

J. Phys. Condens. Matter . 14, 7769-7779.- W.Heisenberg (1958).

Physics and Philosophy.

London: Pinguin Classics reprint (2000).- J.J.Heiss (2007).

The Meanings and Motivations of Open-Source Communities.

Sun DeveloperNetwork, August 2007. REFERENCES - W.Heitler (1956).

Elementary Wave Mechanics with Applications to Quantum Chemistry.

Oxford University Press.- E.Hellerman, D.C.Rarick (1971). Reinversion with the Preassigned Pivot Procedure.

Mathe-matical Programming , 1, 195-216.- Helmholtz (1887a). ¨Uber die physikalische Bedeutung des Princips der keinsten Wirkung.

Journal f¨ur reine und angewandte Mathematik , 100, 137-166, 213-222.- Helmholtz (1887b). Zur Geschichte des Princips der kleinsten Action.

Sitzungsberichte derK¨oniglich Preussichen Akademie der Wissenschaften zu Berlin , I, 225-236.- N.D.Hemkumar, J.R.Cavallo (1994). Redundant and On-Line CORDIC for Unitary Transfor-mations. IEEE Transactions on Computers, 43, 8, 941–954.- C.Henning (2006).

Falsiﬁcation of Propensity Models by Statistical Tests and the Goodness-of-Fit Paradox.

Technical Report, Department of Statistical Science, University College, London.- R.J.Hernstein, E.G.Boring (1966).

A Source Book in Psychology.

Harvard Univ.- M.B.Hesse (1966).

Models and Analogies in Science.

University of Notre Dame Press.- G.Hesslow (2002). Conscious thought as simulation of behavior and perception.

Trends Cogn.Sci.

6, 242-247.- M.Heydtmann (2002). The nature of truth: Simpson’s Paradox and the Limits of StatisticalData.

QJM: An International Journal of Medicine . 95, 4, 247-249.- by D.M.Himmelblau (1972).

Applied Nonlinear Programming.

NY: McGraw-Hill.- K.Hinkelmann (ed.) (1984).

Experimental Design, Statistical Models and Genetic Statistics.Essays in Honor of Oscar Kempthorne.

Basel: Marcel Dekker.- Hitzer (2003).

Geometric Algebra - Leibnitz‘ Dream. Innovative Teaching of Mathematics withGeometric Algebra.

Advances inApplied Cliﬀord Algebras,

13, 157-181.- J.S.U.Hjorth (1984).

Computer Intensive Statistical Methods.

Chapman and Hall, London.- R.R.Hocking (1985).

The Analysis of Linear Models.

Monterey: Brooks Cole.- J.H.Holland (1975).

Adaptation in Natural and Artiﬁcial Systems.

Ann Arbor: University ofMichigan Press.- J.Honerkamp (1993).

Stochastic Dynamical Systems: Concepts, Numerical Methods, DataAnalysis.

Wiley-VCH.- F.H.C.Hotchkiss (1998). A “Rays-as-Appendages” Model for the Origin of Pentamerism inEchinoderms.

Paleobiology , 24,2, 200-214.- R.Houtappel, H.van Dam, E.P.Wigner (1965). The Conceptual Basis and Use of the GeometricInvariance Principles.

Reviews of Modern Physics,

37, 595–632.- P.O.Hoyer (2004). Non-Negative Matrix Factorizations with Sparseness Constrains.

J.of Ma-chine Learning Research , 5, 1457-1469.- P.Hoyningen-Huene (1993).

Reconstructing Scientiﬁc Revolutions. Thomas S. Kuhn’s Philos-ophy of Science.

University of Chicago Press.- C.Huang, A.Darwiche (1994). Inference in Belief Networks: A Procedural Guide. Int.J.ofApproximate Reasoning, 11, 1-58.- M.D. Huang, F. Roameo, A. Sangiovanni-Vincentelli (1986). An Eﬃcient General CoolingSchedule for Simulated Annealing.

IEEE International Conference on Computer-Aided Design ,381-384.- P.C.Hubert, M.Lauretto, J.M.Stern (2009). FBST for a Generalized Poisson Distribution. AIPConference Proceedings, accepted.- R.I.G.Hughes (1992).

The Structure and Interpretation of Quantum Mechanics.

Harvard

EFERENCES

University Press.- C.Humes, M.S.Lauretto, F.Nakano, C.A.B.Pereira, G.F.G.Rafare, J.M.Stern (2012). TORC3:Token-ring Clearing Heuristic for Currency Circulation. AIP Conf.Proc., 1490, 179-188.- T.P.Hutchinson (1991).

The engineering statistician’s guide to continuous bivariate distribu-tions.

Sydney: Rumsby Scientiﬁc Pub.- M.Iacoboni (2008).

Mirroring People.

NY: FSG.- H.Iba, T.Sato (1992). Meta-Level Strategy for Genetic Algorithms Based on Structured Rep-resentation. p.548-554 in

Proc. of the Second Paciﬁc Rim International Conference on ArtiﬁcialIntelligence. - I.A.Ibri (1992).

Kosmos Noetos. A Arquitetura Metaf´ısica de Charles S. Peirce.

S˜ao Paulo:Prespectiva.- R.Ingraham ed. (1982).

Evolution: A Century after Darwin.

Special issue of San Jose Studies,VIII, 3.- B.Ingrao, G.Israel (1990).

The Invisible Hand. Economic Equilibrium in the History of Science.

Cambridge, MA: MIT Press.- R.Inhasz, J.M.Stern (2010). Emergent Semiotics in Genetic Programming and the Self-Adaptive Semantic Crossover.

Studies in Computational Intelligence , 314, 381-392.- T.Z.Irony, M.Lauretto, C.A.B.Pereira, and J.M.Stern (2002). A Weibull Wearout Test: FullBayesian Approach. In: Y.Hayakawa, T.Irony, M.Xie, edit. Systems and Bayesian Reliability,287–300.

Quality, Reliability & Engineering Statistics , 5, Singapore: World Scientiﬁc.- T.Z.Irony, C.A.B.Pereira (1994). Motivation for the Use of Discrete Distributions in QualityAssurance.

Test , 3,2, 181-93.- T.Z.Irony, C.A.B.Pereira (1995), Bayesian Hypothesis Test: Using Surface Integrals To Dis-tribute Prior Information Among The Hypotheses, Resenhas, Sao Paulo 2(1): 27-46.- T.Z.Irony, C.A.B.Pereira, R.C.Tiwari (2000). Analysis of Opinion Swing: Comparison of TwoCorrelated Proportions.

The American Statistician , 54, 57-62.- A.N.Iusem, A.R.De Pierro (1986). Convergence Results for an Accelerated Nonlinear CimminoAlgorithm.

Numerische Matematik,

46, 367-378.- A.N.Iusem (1995).

Proximal Point Methods in Optimization.

Rio de Janeiro: IMPA.- A.J.Izzo (1992). A Functional Analysis Proof of the Existence of Haar Measure on LocallyCompact Abelian Groups

Proceedings of the American Mathematical Society , 115, 2, 581-583.- B.Jaﬀe (1960).

Michelson and the Speed of Light.

NY: Anchor.- W.James (1909, 2004).

A Pluralistic Universe. . The Project Gutenberg, E-Book 11984,Released April 10, 2004.- L.J´anossy, A.R´enyi, J.Acz´el (1950). On Composed Poisson Distributions.

Acta Math. Hun-garica , 1, 209-224.- E.Jantsch (1980).

Self Organizing Universe: Scientiﬁc and Human Implications.

Pergamon.- E.Jantsch ed. (1981).

The Evolutionary Vision. Toward a Unifying Paradigm of Physical,Biological and Sociocultural Evolution.

Washington DC, AAA - American Association for theAdvancement of Science.- E.Jantsch, C.H.Waddington, eds. (1976).

Evolution and Consciousness. Human Systems inTransition.

London: Addison-Wesley.- J.Jastrow (1899). The mind’s eye.

Popular Science Monthly , 54, 299-312. Reprinted in Jastrow(1900).- J.Jastrow (1900).

Fact and Fable in Psychology.

Boston: Houghton Miﬄin.- J.Jastrow (1988). A Critique of Psycho-Physic Methods.

American Journal of of Psychology ,1, 271-309. REFERENCES - E.T.Jaynes (1980). The Minimum Entropy Production Principle.

Ann. Rev. Phys. Chem.

31, 579-601.- E.T.Jaynes (1990). Probability Theory as Logic.

Maximum-Entropy and Bayesian Methods, ed. P.F.Fougere, Kluwer.- E.T.Jaynes (2003).

Probability Theory: The Logic of Science.

Cambridge University Press.- H.Jeﬀreys (1961).

Theory of Probability.

Oxford: Clarendon Press. (First ed. 1939).- R.I.Jennrich (2001). A Simple General Method for Orthogonal Rotation.

Psychometrica , 66,289-306.- R.I.Jennrich (2002). A Simple Method for Oblique Rotation.

Psychometrika , 67,1,7-20.- R.I.Jennrich (2004). Rotation to Simple Loadings using Component Loss Functions: TheOrthogonal Case.

Psychometrika , 69, 257-274.- J.M.Jeschke, R.Tollrian (2007). Prey swarming: Which predators become confused and why.

Animal Behaviour,

74, 387–393.- G.Jetschke (1989). On the Convergence of Simulated Annealing. pp. 208-215 in Voigt et al.(1989).- T.J.Jiang, J.B.Kadane, J.M.Dickey (1992). Computation of Carsons Multiple HipergeometricFunction R for Bayesian Applications.

Journal of Computational and Graphical Statistics , 1,231-51.- Jiang,G., Sarkar,S. (1998). Some Asymptotic Tests for the Equality of Covariance Matrices ofTwo Dependent Bivariate Normals.

Biometrical Journal , 40, 205–225.- Jiang,G., Sarkar,S., Hsuan,F. (1999). A Likelihood Ratio Test and its Modiﬁcations for theHomogeneity of the Covariance Matrices of Dependent Multivariate Normals.

J. Stat. Plan.Infer. , 81, 95-111.- Jiang,G., Sarkar,S. (2000a). Some Combination Tests for the Equality of Covariance Matricesof Two Dependent Bivariate Normals. Proc.

ISAS-2000, Information Systems Analysis andSynthesis. - Jiang,G., Sarkar,S. (2000b). The Likelihood Ratio Test for Homogeneity of the Variances ina Covariance Matrix with Block Compound Symmetry.

Commun. Statist. Theory Meth.

Operations Research , 37, 865-892.- M.E.Johnson (1987). Multivariate Statistical Simulation. NY: Wiley.- M.E. Johnson (ed.) (1988).

Simulated Annealing & Optimization . Syracuse: American SciencePress. This book is also the volume 8 of the

American Journal of Mathematical and ManagementSciences. - P.Johansson, L.Hall, S.Silksr¨om, A.Olsson (2008). Failure to Detect Mismatches BetweenIntention and Outcome in Simple Decision Task.

Science,

Commun. Statist. Simula. Computa.

14, 511–514.- J¨oreskog,K.G. (1970). A General Method for Analysis of Covariance Structures.

Biometrika ,57, 239–251.- C.G.Jung (1968).

Man and His Symbols.

Laurel.- M.Kac (1983). What is Random?

American Scientist , 71, 405-406.- J.B.Kadane (1985). Is Victimization Chronic? A Bayesian Analysis of Multinomial MissingData.

Journal of Econometrics , 29, 47-67.- J.Kadane, T.Seidenfeld (1990). Randomization in a Bayesian Perspective.

J.of StatisticalPlanning and Inference , 25, 329-345.

EFERENCES - J.B.Kadane, R.L.Winkler (1987). De Finetti’s Methods of Elicitation. In Viertl (1987).- I.Kant (1790). Critique of Teleological Judgment. In Kant’s Critique of Judgement, Oxford:Clarendon Press, 1980.- I.Kant.

The critique of pure reason; The critique of practical reason; The critique of judgment.

Encyclopaedia Britannica Great books of the Western World, v.42, 1952.- S.Kaplan, C.Lin (1987). An Improved Condensation Procedure in Discrete Probability Distri-bution Calculations.

Risk Analysis,

7, 15-19.- T.J.Kaptchuk, C.E.Kerr (2004). Commentary: Unbiased Divination, Unbiased Evidence, andthe Patulin Cliniacal Trial.

International Journal of Epidemiology , 33, 247-251.- J.N.Kapur (1989).

Maximum Entropy Models in Science and Engineering.

New Delhi: JohnWiley.- J.N.Kapur, H.K.Kesevan (1992).

Entropy Optimization Principles with Applications.

Boston:Academic Press.- T.R.Karlowski, T.C.Chalmers, T.C.Frankel, L.D.Kapikian, T.L.Lewis, J.M.Lynch (1975). Ascor-bic acid for the common cold: a prophylactic and therapeutic trial.

JAMA , 231, 1038-1042.- A.Kaufmann, D.Grouchko, R.Cruon (1977).

Mathematical Models for the Study of the Relia-bility of Systems.

NY: Academic Press.- L.H.Kauﬀman (2001). The Mathematics of Charles Sanders Peirce.

Cybernetics and HumanKnowing,

8, 79-110.- L.H.Kauﬀmann (2006).

Laws of Form: An Exploration in Mathematics and Foundations. - M.J.Kearns, U.V.Vazirani (1994).

Computational Learning Theory.

Cambridge: MIT Press.- R. Keller, L.A. Davidson and D.R. Shook (2003). How we are Shaped: The Biomechanics ofGastrulation.

Diﬀerentiation,

71, 171-205.- O.Kempthorne, L.Folks (1971).

Probability, Statistics and Data Analysis.

Ames: Iowa StateUniv. Press.- O.Kempthorne (1976). Of what Use are Tests of Signiﬁcance and Tests of Hypothesis.

Comm.Statist.

A5, 763–777.- O.Kempthorne (1977). Why Randomize?

J. of Statistical Planning and Inference,

1, 1-25- Kempthorne,O. (1980). Foundations of Statistical Thinking and Reasoning.

Australian CSIRO-DMS Newsletter.

68, 1–5; 69, 3–7.- M.G.Kendall (2004).

A Course in the Geometry of n -Dimensions . Mineola: Dover.- B.W. Kernighan, S. Lin (1970). An Eﬃcient Heuristic Procedure for Partitioning Graphs. TheBell System Technical Journal , 49, 291-307.- A.I.Khinchin (1953).

Mathematical Foundations of Information Theory.

NY: Dover.- J.F.Kihlstrom (2006).

Joseph Jastrow and His Duck - Or Is It a Rabbit?

On line document,University of California at Berkeley.- D.A.Klein and G.C.Rota (1997).

Introduction to Geometric Probability.

Cambridge Univ.Press.- G.J.Klir, T.A.Folger (1988).

Fuzzy Sets, Uncertainty and Information.

NY: Prentice Hall.- C.J.W.Kloesel (1993).

Writings of Charles S. Peirce. A Chronological Edition .- S.Kocherlakota, KKocherlakota (1992).

Bivariate Discrete Distributions.

Basel: MarcelDekker.- M.A.R.Koehl (1990). Biomechanical Approaches to Morphogenesis.

Sem. Dev. Biol.

The Burden of Proof in Comparative and International Human Rights Law.

Hague: Kluwer. REFERENCES - V.B.Kolmanovskii, V.R.Nosov (1986).

Stability of Functional Diﬀerential Equations.

London:Academic Press.- A.N.Kolmogorov ( 1965 ). Three Approaches to the Quantitative Deﬁnition of Information.

Problems in Information Transmission,

1, 1-7.- A.N.Kolmogorov, S.V.Fomin (1982, Portuguese translation).

Elements of the Theory of Func-tions and Functional Analysis.

Moscow: MIR.- B.O.Koopman (1940a). Axioms and Algebra of Intuitive Probability.

Annals of Mathematics ,41, 269–292.- B.O.Koopman (1940b). Bases of Probability.

Bulletin of the Ammerican Mathematical Society ,46, 763–774.- F.H.H.Kortlandt (1985). A Parasitological View of Non-Constructible Sets.

Studia LinguisticaDiachronica et Synchronica,

Encyclopedia of StatisticalSciences,

C´alculo Variacional . MIR, Moskow.- K.Krippendorﬀ (1986).

Information Theory: Structural Models for Qualitative Data.

Quanti-tative Applications in the Social Sciences V.62. Beverly Hills: Sage.- W.Krohn, G.K¨uppers, H.Nowotny (1990).

Selforganization. Portrait of a Scientiﬁc Revolution.

Dordrecht: Kluwer.- W.Krohn, G. K¨uppers (1990). The Selforganization of Science - Outline of a TheoreticalModel. in Krohn (1990), 208–222.- P.Krugman (1999). O Canada: A Neglected Nation Gets its Nobel. Slate, October 19, 1999.- A.Krzysztof Kw´asniewski (2008).

Glimpses of the Octonions and Quaternions History andToday’s Applications in Quantum Physics. eprint arXiv:0803.0119.- O.S.Ksenzhek, A.G.Volkov (1998).

Plant Energetics.

NY: Academic Press.- T.S.Kuhn (1977).

The Essential Tension: Selected Studies in Scientiﬁc Tradition and Change .University of Chicago Press.- T.S.Kuhn (1996).

The Structure of Scientiﬁc Revolutions.

University of Chicago Press.- H.Kunz, T.Z¨ublin, C.K.Hemelrijk (1000). On Prey Grouping and Predator Confusion inArtiﬁcial Fish Schools.- P.J.M. van Laarhoven, E.H.L. Aarts (1987).

Simulated Annealing: Theory and Applications .Dordrecht: Reidel Publishing Co.- C.L.Lanczos (1986).

The Variational Principles of Mechanics.

Mineola: Dover. Noether’sInvariant Variational Problems, Appendix II, p.401-405.- D.Landau, K.Binder (2000).

A Guide to Monte Carlo Simulations in Statistical Physics.

Cambridge University Press.- L.D.Landau, E. M. Lifchitz (1966).

Cours de Physique Th´eorique.

Moscou: MIR.- P.V.Landshoﬀ, A.Metherell, W.G.Rees (1998).

Essential Quantum Physics.

Cambridge Uni-versity Press.- K.Lange (2000).

Numerical Analysis for Statisticians.

NY: Springer.- I.Lakatos (1978a).

The Methodology of Scientiﬁc Research.

Cambridge Univ. Press.- I.Lakatos (1978b).

Mathematics, Science and Epistemology.

Canbridge Univ. Press.- G.Lakoﬀ, M.Johnson (2003).

Metaphors We Live By.

University of Chicago Press.

EFERENCES - L.S.Lasdon (1970).

Optimization Theory for Large Systems.

NY: MacMillan.- Laurent, John. 1999. A note on the origin of memes / mnemes,

Journal of Memetics,

3, 1,20-21.- M.Lauretto, F.Nakano, C.A.B.Pereira, J.M.Stern (2009). Hierarchical Forecasting with Poly-nomial Nets.

Studies in Computational Intelligence , 199, 305-315.- M.Lauretto, F.Nakano, S.R.Faria, C.A.B.Pereira, J.M.Stern (2009). A Straightforward Mul-tiallelic Signicance Test for the Hardy-Weinberg Equilibrium Law.

Genetics and MolecularBiology , 32, 3, 619-625.- M.S.Lauretto, F.Nakano, C.A.B.Pereira, J.M.Stern (2012). Intentional Sampling by GoalOptimization with Decoupling by Stochastic Perturbation. AIP Conf.Proc., 1490, 189-201.- M.Lauretto, C.A.B.Pereira, J.M.Stern, S.Zacks (2003). Full Bayesian Signiﬁcance Test Appliedto Multivariate Normal Structure Models.

Brazilian Journal of Probability and Statistics,

American Institute of Physics Conference Proceedings , 803, 121–128.- M.Lauretto, S.R. de Faria Jr., B.B.Pereira, C.A.B.Perreira, J.M.Stern (2007). The Problem ofSeparate Hypotheses via Mixture Models. To appear,

American Institute of Physics ConferenceProceedings .- M.S.Lauretto, C.A.B.Pereira, J.M.Stern (2008). MaxEnt 2008 -

Bayesian Inference and Max-imum Entropy Methods in Science and Engineering.

July 6-11, Borac´eia, S˜ao paulo, Brazil.American Institute of Physics Conference Proceedings, v.1073.- S.L.Lauritzen (2006).

Fundamentals of Graphical Models.

Saint Flour Summer-school.- J.W.Leech (1963).

Classical Mechanics.

London: Methuen.- Lehmann,E.L. (1959).

Testing Statistical Hypothesis.

NY: Wiley.- T.G.Leighton, S.D.Richards, P.R.White (2004). Trapped within a ‘Wall of Sound’ A PossibleMechanism for the Bubble Nets of Humpback Whales.

Acoustics Bulletin,

29, 1, 24-29.- T.Leighton, D.Finfer, E.Grover, P.White (2007). An Acoustical Hypothesis for the SpiralBubble Nets of Humpback Whales, and the Implications for Whale Feeding.

Acoustics Bulletin,

32, 1, 17-21.- D.S.Lemons (2002).

An Introduction to Stochastic Processes in Physics.

Baltimore: JohnHopkins Univ. Press.- T.Lenoir (1982).

The Strategy of Life. Teleology and Mechanics in Nineteenth-Century GermanBiology.

Univ.of Chicago Press.- I.Levi (1974).

Gambling with Truth: An Essay on Induction and the Aims of Science.

MITPress.- K.Lewin (1951).

Field Theory m Social Science: Selected Theoretical Papers.

New York:Harper and Row.- A.M.Liberman (1993). Haskins Laboratories Status Report on

Speech Research,

Biometrika

44, 187–192.- D.V.Lindley (1991).

Making Decisions.

NY: John Wiley.- D.V.Lindley, M.R.Novick (1981). The Role of Exchangeability in Inference. it The Annals ofStatistics, 9, 1, 45-58.- R.J.A.Little, D.B.Rubin (1987).

Statistical Analysis with Missing Data.

New York: Wiley.- J.L.Liu (2001).

Monte Carlo Strategies in Scientiﬁc Computing.

NY: Springer. REFERENCES - D.Loemker (1969).

G.W.Leibniz Philosophical Papers and Letters.

Reidel.- L.L.Lopes (1982). Doing the Impossible: A Note on Induction and the Experience of Random-ness.

Journal of Experimental Psychology: Learning, Memory, and Cognition , 8, 626-636.- L.L.Lopes, G.C.Oden (1987). Distinguishing Between Random and Nonrandom Events.

Jour-nal of Experimental Psychology: Learning, Memory, and Cognition , 13, 392-400.- H.A.Lorentz, A.Einstein, H.Minkowski and H.Weyl (1952).

The Principle of Relativity: ACollection of Original Memoirs on the Special and General Theory of Relativity.

NY: Dover.- R.H.Loschi, S.Wechsler (2002).

Coherence, Bayes’s Theorem and Posterior Distributions ,Brazilian Journal of Probability and Statistics, 16, 169–185.- LosDoggies web page (2010). Retrieved from - P.Lounesto (2001).

Cliﬀord Algebras and Spinors.

Linear and Nonlinear Programming.

Reading: Addison-Wesley.- N.Luhmann (1989).

Ecological Communication.

Chicago Univ. Press.- N.Luhmann (1990a).

The Cognitive Program of Constructivism and a Reality that RemainsUnknown. in Krohn (1990), 64–86.- N.Luhmann (1990b).

Essays on Self-Reference.

NY: Columbia Univ. Press.- N.Luhmann (1995).

Social Systems.

Stanford Univ. Press.- M. Lundy, A. Mees (1986). Convergence of an Annealing Algorithm.

Mathematical Program-ming , 34, 111-124.- I.J.Lustig (1987).

An Analysis of an Available Set of Linear Programming Test Problems .Tech. Rep. SOL-87-11, Dept. Operations Research, Stanford University.- D.K.C.MacDonald (1962).

Noise and Fluctuations: An Introduction.

NY: Wiley.- H.R.Madala, A.G.Ivakhnenko (1994).

Inductive Learning Algorithms for Complex SystemsModeling.

Boca Raton: CRC Press.- M.R.Madruga, L.G.Esteves, S.Wechsler (2001). On the Bayesianity of Pereira-Stern Tests.

Test , 10, 291–299.- M.R.Madruga, C.A.B.Pereira, J.M.Stern (2003). Bayesian Evidence Test for Precise Hypothe-ses.

Journal of Statistical Planning and Inference,

Betting on Theories.

Cambridge Univ. Press.- M.Maimonides (2001).

Mishne Torah: Yad hachazakah.

NY: Yeshivath Beth Moshe.- N.I.Mann, K.A.Dingess, P.J.B.Slater (2006). Antiphonal four-part synchronized chorusing ina Neotropical Wren.

Biol. Lett.,

2, 1-4.- V.T.L.Maranh˜ao, M.S.Lauretto, J.M.Stern (2012). FBST for Covariance Structures of Gener-alized Gompertz Models. AIP Conf.Proc., 1490, 202-211.- L.Margulis (1999).

Symbiotic Planet: A New Look At Evolution.

Basic Books.- L.Margulis, D.Sagan (2003).

Acquiring Genomes: The Theory of the Origins of the Species.

Basic Books.- D.D.Mari, S.Kotz (2001).

Correlation and Dependence.

Singapore: World Scientiﬁc.- J.B.Marion (1970).

Classical Dynamics of Particles and Systems.

NY: Academic Press.- J.B.Marion (1975).

Classical Dynamics of Particles and Systems.

NY: Academic Press.- H.M.Markowitz (1952). Portfolio Selection.

The Jounal of Finance , 7(1), pp-77-91.- H.M.Markowitz (1956). The optimization of a Quadratic Function Subject to Linear Con-straints.

Naval Research Logistics Quarterly,

3, 111-133.- H.M.Markowitz (1987).

Mean-variance Analisys in Portfolio Choice and Capital Markets .Cambridge, MA: Basil Blackwell.

EFERENCES - G.Marsaglia (1968). Random Numbers Fall Mainly in the Planes.

Proceedings of the NationalAcademy of Sciences , 61, 25-28.- J.J.Martin (1975).

Bayesian decision and probelms and Markov Chains. - J.L.Martin (1988).

Genearl Relativity. A Guide to its Consequences for Gravity and Cosmol-ogy.

Chichester: Ellis Horwood - John Willey.- E.Martin-L¨of(1966). The Deﬁnition of Random Sequences.

Information and Control , 9, 602-619.- E.Martin-L¨of (1969). Algorithms and Randomness.

Review of the Intern. Statistical Institute ,37, 3, 265-272.- J.M.Martinez, J.M. (1999). A Direct Search Method for Nonlinear Programming.

ZAMM,

Computational and Applied Mathe-matics.

19, 31-56.- J.Matou˘sek (1991).

Geometric Discrepancy.

Berlin: Springer.- M.Matsumoto, T.Nishimura (1998) Mersenne Twister: A 623-dimensionally EquidistributedUniform Pseudorandom Number Generator.

ACM Trans. Model. Comput. Simul. , 8, 3-30.- M.Matsumoto, Y.Kurita (1992,1994). Twisted GFSR Generators.

ACM Trans. Model. Com-put. Simul. , I:2,179-194; II:4,254-266.- H.R.Maturana, F.J.Varela (1980).

Autopoiesis and Cognition. The Realization of the Living.

Dordrecht: Reidel.- H.R.Maturana (1988). Ontology of Observing. The Biological Foundations of Self Conscious-ness and the Physical Domain of Existence. pp 18–23 in

Conference Workbook: Texts in Cyber-netics.

Felton, CA: American Society for Cybernetics.- H.R.Maturana (1991). Science and Reality in Daily Life: The Ontology of Scientiﬁc Explana-tions. In Steier (1991).- H.R.Maturana, B.Poerksen (2004). Varieties of Objectivity.

Cybernetics and Human Knowing.

11, 4, 63–71.- P.L.M.de Maupertuis (1965),

Oeuvres, I-IV.

Hildesheim: Georg Olms Verlagsbuchhandlung.- G.P. McCormick (1983).

Nonlinear Programming: Theory, Algorithms and Applications .Chichester: John Wiley.- D.K.C.MacDonald (1962).

Noise and Fluctuations.

NY: Dover.- R.P.McDonald (1962). A Note on the Derivation of the General Latent Class Model.

Psy-chometrika

27, 203–206.- R.P.McDonald (1974). Testing Pattern Hypotheses for Covariance Matrices.

Psychometrika ,39, 189–201.- R.P.McDonald (1975). Testing Pattern Hypotheses for Correlation Matrices.

Psychometrika ,40, 253–255.- R.P.McDonald, H.Swaminathan (1973). A Simple Matrix Calculus with Applications to Mul-tivariate Analysis.

General Systems , 18, 37–54- A.L.McLean (1998), The Forecasting Voice: A Uniﬁed Approach to Teaching Statistics. In

Proceedings of the Fifth International Conference on Teaching of Statistics , (eds L. Pereira-Mendoza, et al.), 1193-1199. Singapore: Nanjing University.- G.McLachlan, D.Peel (2000).

Finite Mixture Models.

NY: Wiley.- J.D.McGervey (1995).

Quantum Mechanics: Concepts and Applications.

San Diego: AcademicPress.- W.H.McRea (1954).

Relativity Physics . London: Methuen. REFERENCES - E.J.McShane. The Calculus of Variations. Ch.7, p.125-130 in: J.W.Brewer, M.K.Smith (1981).

Emmy Noether .- P.Meguire (2003). Discovering Boundary Algebra: A Simple Notation for Boolean Algebraand the Truth Functions.

Int. J. General Systems,

32, 25-87.- J.G.Mendel (1866). Versuche ¨uber Plﬂanzenhybriden Verhandlungen des naturforschendenVereines in Br¨unn, Bd. IV f¨ur das Jahr, 1865 Abhandlungen: 3-47. For the English translation,see: Druery, C.T and William Bateson (1901). Experiments in Plant Hybridization.

Journal ofthe Royal Horticultural Society , 26, 1-32.- M.B.Mendel (1989).

Development of Bayesian Parametric Theory with Application in Control.

PhD Thesis, MIT, Cambridge: MA.- X.L.Meng, W.H.Wong (1996). Simulating Ratios of Normalizing Constants via a Simple Iden-tity: A Theoretical Exploration.

Statistica Sinica , 6, 831-860.- R.Merkel (2005).

Analysis and Enhancements of Adaptive Random Testing.

Ph.D. Thesis.Swinburne University of Technology in Melbourne. Melburne: Australia.- M. Mesterton-Gibbons (1992). Redwood, CA:

An Introduction to Game-Theoretic Modelling.

Addison-Wesley.- N.Metropolis, S.Ulam (1949). The Monte Carlo method.

J. Amer. Statist. Assoc.,

44, 335-341.- N.Metropolis, A.W.Rosenbluth, M.N.Rosenbluth, A.H.Teller, E.Teller (1953). Equations ofState Calculations by Fast Computing Machines.

Journal of Chemical Physics , 21, 6, 1087-1092.- A.A.Michelson, E.W.Morley (1887). On the Relative Motion of the Earth and the LuminiferousEther.

American Journal of Physics,

34, 333–345.- D.Michie, D.J.Spiegelhalter, C.C.Taylor (1994).

Machine Learning, Neural and StatisticalClassiﬁcation.

Ellis Horwood.- R.E.Michod, B.R.Levin (1988). The Evolution of Sex: An Examination of Current Ideas.Sunderland, MA: Sinauer Associates.- W.Millar (1951). Some General Theorems for Non-Linear Systems Possessing Resistance.

Philosophical Magazine , 7, 42 (333), 1150-1160.- G.Miller (2000). Mental traits as ﬁtness indicators - expanding evolutionary psychology’sadaptationism. Evolutionary Perspectives on Human Reproductive Behaviour.

Annals of theNew York Academy of Sciences , 907, 62-74.- G.F.Miller (2001).

The Mating Mind: How Sexual Choice Shaped the Evolution of HumanNature . London: Vintage.- G.F.Miller, P.M.Todd (1995). The role of mate choice in biocomputation: Sexual selection asa process of search, optimization, and diversiﬁcation. In: W. Banzhaf, F.H. Eeckman (Eds.)Evolution and biocomputation: Computational models of evolution (pp. 169-204). Berlin:Springer.- J.Miller (2006). Earliest Known Uses of Some of the Words of Mathematics. http://members.aol.com/jeﬀ570/mathword.html - J.Mingers (1995)

Self-Producing Systems: Implications and Applications of Autopoiesis.

NY:Plenum.- M.Minoux, S.Vajda (1986).

Mathematical Programming.

John Wiley.- C.W.Misner, K.S.Thorne, J.A.Wheeler, J.Wheeler (1973).

Gravitation.

W.H.Freeman.- O.Morgenstern (2008). Entry

Game Theory at the

Dictionary of the History of Ideas (v.2,p.264-275). Retrieved from http//etext.virginia.edu/cgi-local/DHI/dhi.cgi?id=dv2-32 - O.Morgenstern, J.von Neumann (1947).

The Theory of Games and Economic Behavior.

Princeton University Press.

EFERENCES - W.J.Morokoﬀ (1998). Generating Quasi-Random Paths for Stochastic Processes.

SIAM Re-view,

40, 4, 765-788.- P.Moscato (1989).

On Evolution, Search, Optimization, Genetic Algorithms and Martial Arts:Towards Memetic Algorithms.

Caltech Concurrent Computation Program, Tech.Repport 826.- A.Mosleh, V,M,Bier (1996). Uncertainty about Probability: A Reconciliation with the Sub-jectivist Viewpoint.

IEEE Transactions on Systems, Man and Cybernetics,

A, 26, 3, 303-311.- W.Mueller, F.Wysotzki (1994). Automatic Construction of Decision Trees for Classiﬁcation.

Ann. Oper. Res.

52, 231-247.- S.H.Muggleton (2006). Exceeding Human Limits.

Nature , 440/23, 409–410.- P.Muir (1907).

A History of Chemical Theories and Laws . NY: John Wiley. Reprint, NY:Arno Press, 1975.- R. Mundell (1963). Capital Mobility and Stabilization Policy under Fixed and Flexible Ex-change Rates.

Canadian Journal of Economic and Political Science , 29, 475-85.- C.W.K.Mundle (1959). Probability and Statistical Inference by G.Spencer-Brown (review).

Philosophy,

34, 129, 150-154.- I.L. Muntean (2006).

Beyond Mechanics: Principle of Least Action in Maupertuis and Euler.

On line doc., University of California at San Diego.- J.J.Murphy (1986).

Technical Analysis of the Future Markets: A Comprehensive Guide toTrading Methods ans Applications.

NY: New York Institute of Finance.- B.A.Murtagh (1981).

Advanced Linear Programming.

NY: McGraw Hill.- T.Mikosch(1998).

Elementary Stochastic Calculus with Finance in View.

Singapore: WorldScientiﬁc.- L.Nachbin (1965).

The Haar Integral.

Van Nostrand.- R.Nagpal (2002). Self-assembling Global Shape using Concepts from Origami. p. 219-231 inT.C.Hull (2002). Origami3 Proceedings of the 3rd International Meeting of Origami Mathemat-ics, Science, and Education. Natick Massachusetts: A.K.Peters Ltd.- J.Nash (1951). Non-Cooperative Games.

The Annals of Mathematics , 54,2, 286-295.- L.K.Nash (1974).

Elements of Statistical Thermodynamics.

NY: Dover.- R.B.Nelsen (2006, 2nd ed.).

An Introduction to Copulas.

NY: Springer.- E.Nelson (1987).

Radically Elementary Probability Theory.

AM-117. Princeton UniversityPress.- W.Nernst (1909).

Theoretische Chemie vom Standpunkte der Avogadroschen Regel und derThermodynamik.

Stuttgart: F.Enke.- J.von Neumann (1928). Zur Theories der Gesellschaftsspiele.

Mathematische Annalen , 100,295-320. English translation in R.D.Luce, A.W.Tucker eds. (1959).

Contributions to the Theoryof Games IV . pp.13-42. Princeton University Press.- M.C.Newman, C.Strojan (1998).

Risk Assessment: Logic and Measurement.

CRC.- S.A.Newman, W.D.Comper (1990). Generic Physical Mechanisms of Morphogenesis and Pat-tern Formation.

Development,

IEEE Transactions on Evolutionary Computation , 5, 4, 359-375. Recombinative Guidance.- N.Y.Nikolaev, H.Iba (2003). Learning Polynomial Fedforward Neural Networks by GeneticProgramming and Backpropagation.

IEEE Transactions on Neural Networks,

14, 2, 337-350.- N.Y.Nikolaev, H.Iba (2006).

Adaptive Learning of Polynomial Networks.

Genetic and Evolu-tionary Computation. NY: Springer.- W.Noeth (1995).

Handbook of Semiotics.

Indiana University Press.- E.Noether (1918). Invariante Varlationsprobleme.

Nachrichten der K¨onighche Gesellschaft der REFERENCES

Wissenschaften zu G¨ottingen.

Transport Theory and Statistical Physics,

Neurology,

44, 16-20.- M.F.Ochs, R.S.Stoyanova, F.Arias-Mendoza, T.R.Brown (1999). A New Methods for SpectralDecomposition Using a Bilinear Bayesian Approach.

J.of Magnetic Resonance , 137, 161-176.- G.M.Odel, G.Oster, P.Alberch, B.Burnside (1980). The Mechanical Basis of Morphogenesis. I- Epithelial Folding and Invagination.

Dev. Biol.

85, 446-462.- G. ¨Okten (1999).

Contributions to the Theory of Monte Carlo and Quasi monte Carlo Methods.

Ph.D. Thesis, Clearmont University. Clearmont, CA: USA.- K.Olitzky edt. (2000).

Shemonah Perakim: A Treatise on the Soul by Moshe ben Maimon.

URJ Press.- Y.S.Ong, N.Krasnogor, H.Ishibuchi (2007). Special Issue on Memetic Algorithms.

IEEETransactions on Systems, Man, and Cybernetics, part B, 37, 1, 2-5.- D.Ormoneit, V.Tresp (1995). Improved Gaussian Mixtures Density Estimates Using BayesianPenalty Terms and Network Averaging.

Advances in Neural Information Processing Systems 8 ,542–548. MIT.- J.Ortega y Gasset (1914). Ensayo de Estetica a Manera de Prologo. in

El Pasajero byJ.Moreno Villa. Reprinted in p.152-174 of J.Ortega y Gasset (2006).

La Deshumanizacion delArte.

Madrid: Revista de Occidente en Alianza Editorial.- R.H.J.M. Otten, L.P.P.P. van Ginneken (1989).

The Annealing Algorithm . Boston: Kluwer.- C.C.Paige, M.A.Saunders (1977). Least Squares Estimation of Discrete Linear Dynamic Sys-tems using Orthogonal Transformations.

Siam J. Numer. Anal.

Inward Bound: Of Matter and Forces in the Physical World.

Oxford UniversityPress.- C.D.M.Paulino, C.A.B.Pereira (1992). Bayesian Analysis of Categorical Data InformativelyCensored.

Communications in Statistics - Theory and Methods , 21, 2689-705.- C.D.M.Paulino, C.A.B.Pereira (1995). Bayesian Methods for Categorical Data under Informa-tive General Censoring.

Biometrika , 82,2, 439-446.- Y.Pawitan (2001).

In All Likelihood: Statistical Modelling and Inference Using Likelihood.

Oxford University Press.- J.Pearl (2000).

Caysality: Models, Reasoning, and Inference.”

Cambridge University Press.- J.Pearl (2004).

Simpson’s Paradox: An Anatomy.

Rech.Rep. Cognitive Systems Lab., Com-puter Science Dept., Univ.of California at Los Angeles.- C.S.Peirce (1880). A Boolean Algebra with One Constant. In Hartshorne et al. (1992), 4,12-20.- C.S.Peirce (1883).

The John Hopkins Studies in Logic.

Boston: Little, Brown and Co.- C.S.Peirce, J.Jastrow (1885). On small Diﬀerences of Sensation.

Memirs of the NationalAcademy of Sciences , 3 (1884), 75-83. Also in Kloesel (1993), v.5 (1884-1886), p.122-135.- P.Penﬁeld, R.Spence, S. Duinker (1970a). A Generalized Form of Tellegens Theorem.

IEEETransactions on Circuit Theory , CT-17, 3, 302-305- P.Penﬁeld, R.Spence, S. Duinker (1970b).

Tellegen’s Theorem and Electrical Networks.

Cam-brige, MA: MIT Press.- C.A.B.Pereira, D.V.Lindley (1987). Examples Questioning the use of Partial Likelihood.

TheStatistician , 36, 15–20.- C.A.B.Pereira, J.M.Stern (1999a). A Dynamic Software Certiﬁcation and Veriﬁcation Proce-

EFERENCES dure. Proc.

ISAS-99, Int.Conf.on Systems Analysis and Synthesis,

2, 426–435.- C.A.B.Pereira, J.M.Stern, (1999b). Evidence and Credibility: Full Bayesian Signiﬁcance Testfor Precise Hypotheses.

Entropy Journal , 1, 69–80.- C.A.B.Pereira, J.M.Stern (2001a). Full Bayesian Signiﬁcance Tests for Coeﬃcients of Vari-ation. In: George, E.I. (Editor). Bayesian Methods with Applications to Statistics, 391-400.Monographs of Oﬃcial Statistics, EUROSTAT.- C.A.B.Pereira, J.M.Stern (2001b). Model Selection: Full Bayesian Approach.

Environmetrics

12, (6), 559-568.- C.A.B.Pereira, J.M.Stern (2005).

Inferˆencia Indutiva com Dados Discretos: Uma Vis˜ao Gen-uinamente Bayesiana.

COMCA-2005. Chile: Universidad de Antofagasta.- C.A.B.Pereira, J.M.Stern (2008). Special Characterizations of Standard Discrete Models.

REVSTAT Statistical Journal , 6, 3, 199-230.- C.A.B.Pereira, S.Wechsler (1993). On the Concept of p -value. Brazilian Journal of Probabilityand Statistics , 7, 159–177.- C.A.B.Pereira, M.A.G.Viana (1982).

Elementos de Inferˆencia Bayesiana.

5o Sinape, S˜aoPaulo.- C.A.B.Pereira, S.Wechsler, J.M.Stern (2008). Can a Signiﬁcance Test be Genuinely Bayesian?

Bayesian Analysis,

3, 1, 79-100.- P.Perny, A.Tsoukias (1998).

On the Continuous Extension of a Four Valued Logic for Prefer-ence Modelling.

IPMU-98, 302–309. 7th Conf. on Information Processing and Management ofUncertainty in Knowledge Based Systems. Paris, France.- J.Perrin (1903).

Trait´e de Chimie Physique . Paris: Gauthier-Villars.- J.Perrin (1906). La discontinuit´e de la Mati`ere.

Revue de Mois,

1, 323-343.- J. B. Perrin (1909). Mouvement Brownien et R´ealit´e Mol´eculaire.

Annales de Chimie et dePhysiqe , VIII 18, 5-114. also in p.171-239 of Perrin (1950). Translation: Brownian Movementand Molecular Reality, London: Taylor and Francis.- J.B.Perrin (1913).

Les Atomes . Paris: Alcan. Translation:

Atoms.

NY: Van Nostrand.- J.Perrin (1950).

Oeuvres Scientiﬁques . Paris: CNRS.- L.Peusner (1986).

Studies in Network Thermo-Dynamics.

Amsterdam: Elsevier.- D.Pfeﬀermann, A.M.Krieger, Y.Rinott (1998). Parametric Distributions of Complex SurveyData under Informative Probability Sampling.

Statistica Sinica

8, 1087-1114- D.Pfeﬀermann, M.Sverchkov (2003). Fitting Generalized Linear Models Under informativeSampling. In C.Skinner, R.Chambers (2003), 175-195.- G.C.Pﬂug (1996).

Optimization of Stochastic Models: The Interface Between Simulation andOptimization . Boston: Kluwer.- L.Phlips (1995).

Competition Policy: A Game-Theoretic Perspective.

Cambridge UniversityPress.- J.Piaget (1975).

L’´equilibration des Structures Cognitives: Probl`eme Central du D´eveloppement.

Paris: PUF.- J.Piaget (1985).

Equilibration of Cognitive Structures: The Central Problem of IntellectualDevelopment.

Univ.of Chicago.- J.Piaget, B.Inhelder (1951).

The Origin of the Idea of Chance in Children.

Translated byL.Leake, E.Burrell, H.D.Fishbein (1975), New York: Norton.- S.D.Pietra, V.Pietra, J.Laﬀerty (2001).

Duality and Auxiliary Functions for Bregman Dis-tances.

Tec.Rep. CMU-CS-01-109R, Carnegie Mellon.- S.Pihlstrom, C.N.El-Hani (2002). Emergence Theories and Pragmatic Realism.

Essays inPhilosophy.

Arcata, CA, USA: Humboldt State University. REFERENCES - S.Pissanetzky (1984).

Sparse Matrix Technology.

NY: Academic Press.- M.Planck (1915). Das Prinzip der kleinsten Wirkung.

Kultur der Gegenwart . Also in p.25-41of Planck (1944).- M.Planck (1944) Wege zur physikalischen Erkenntnis. Reden und Vortr¨age. Leipzig: S.Hirzel.- M.Planck (1937). Religion and Natural Science. Also in Planck (1950).- M.Planck (1950). Scientiﬁc Autobiography and other Papers. London: Williams and Norgate.- R.J.Plemmons, R.E.White (1990). Substructuring Methods for Computing the Nullspace ofEquilibrium Matrices.

SIAM Journal on Matrix Analysis and Applications , 11, 1-22.- K.R.Popper (1959).

The Logic of Scientiﬁc Discovery.

NY: Routledge.- K.R.Popper (1963).

Conjectures and Refutations: The Growth of Scientiﬁc Knowledge.

NY:Routledge.- I.Prigoine (1961).

Introduction to the Thermodynamics of Irreversible Processes , 2nd ed. NY:Interscience.- H.Pulte (1989). Das Prinzip der kleinsten Wirkung und die Kraftkonzeptionen der rationalenMechanik: Eine Untersuchung zur Grundlegungsproblemematik bei Leonhard Euler, PierreLouis Moreau de Maupertuis und Joseph Louis Lagrage.

Studia Leibnitiana , sonderheft 19.- J.R.Quinlan (1986). Induction of Decision Trees.

Machine Learning

1, 221-234.- N.L.Rabinovitch (1973).

Probability and Statistical Inference in Ancient and Medieval JewishLiterature.

University of Toronto Press.- H.Rackham (1926).

Aristotle, Nicomachean Ethics.

Harvard University Press.- S.Rahman, J.Symons, D.M.Gabbay J.P. van Bendegem, eds. (2004).

Logic, Epistemology, andthe Unity of Science . NY: Springer.- V.S.Ramachandran (2007). The Neurology of Self-Awareness. The Edge 10-th AnniversaryEssay.- W.Rasch (1998). Luhmann’s Widerlegung des Idealismus: Constructivism as a two-front war.

Soziale Systeme,

4, 151–161.- W.Rasch (2000) Niklas Luhmanns Modernity. Paradoxes of Diﬀerentiation. Stanford Univ.Press.Specially chapter 3 and 4, also published as: W.Rasch (1998). Luhmanns Widerlegung des Ide-alismus: Constructivism as a Two-Front War.

Soziale Systeme,

4, 151-159; and W.Rasch (1994).In Search of Lyotard Archipelago, or: How to Live a Paradox and Learn to Like It.

New GermanCritique,

61, 55-75.- A.Recski (1989).

Matroid Theory and its Applications in Electrical Network Theory and inStatics.

Budapest: Akad´emiai Kiad´o.- C.R.Reeves (1993).

Modern Heuristics for Combinatorial Problems.

Blackwell Scientiﬁc.- C.R.Reeves, J.E.Rowe (2002).

Genetic Algorithms - Principles and Perspectives: A Guide toGA Theory.

Berlin: Springer.- F.Reif (1965).

Statistical Physics.

NY: McGraw-Hill.- R.Reintjes, A.de Boer, W.van Pelt, J.M.de Groot (2000). Simpson’s Paradox: An Examplefrom Hospital Epidemiology.

Epidemiology , 11, 1, 81-83.- A.Renyi (1970).

Probability Theory.

Amsterdam: North-Holland.- A.Renyi (1961). On Measures of Entropy and Information.

Proc. 4-th Berkeley Symp. onMath Sats. and Prob.

V-I, 547-561.- H.L.Resnikoﬀ, R.O.Wells (2002).

Wavelet Analysis: The Scalable Structure of Information.

Springer Verlag.- P.Ressel (1985). DeFinetti Type Theorems: Analytical approach.

Annals Probability,

EFERENCES - P.Ressel (1987). A Very General De Finetti Type Theorem. In: Viertl (1987).- P.Ressel (1988). Integral Representations for Distributions of Symmetric Processes.

ProbabilityTheory and Related Fields,

79, 451–467.- C.Reynolds (1987). Flocks Herds and Schools: A Distributed Behavioral Model.

ComputerGraphics,

21, 25-34. Updated version at - R.J.Richards (1989).

Darwin and the Emergence of Evolutionary Theories of Mind and Be-havior.

University Of Chicago Press.- C.J.van Rijsbergen (2004).

The Geometry of Information Retrieval.

Cambridge UniversityPress.- B.D.Ripley (1987).

Stochastic Simulation.

NY: Wiley.- B.D.Ripley (1996).

Pattern Recognition and Neural Networks.

Cambridge University Press.- J.Rissanen (1978). Modeling by Shortest Data Description.

Automatica,

14, 465–471.- J.Rissanen (1989).

Stochastic Complexity in Statistical Inquiry.

NY: World Scientiﬁc.- G.Rizzolatti, M.A.Arbib (1998). Language within our grasp.

TINS,

21, 5, 188-194.- G.Rizzolatti, C.Sinigalia (2006).

Mirrors in the Brain. How our Minds Share Actions andEmotions.

Oxford University Press.- A.M.Robert (2003).

Nonstandard Analysis.

Mineola: Dover.- C.P.Robert (1996). Mixture of Distributions: Inference and Estimation. in Gilks et al. (1996).- R.Robertson (2001). One, Two, Three, Continuity. C.S.Peirce and the Nature of the Contin-uum.

Cybernetics & Human Knowing,

8, 7-24.- V.Ronchi (1970).

Nature of Light: An Historical Survey.

Harvard Univ. Press.- H.Rouanet, J.M.Bernard, M.C.Bert, B.Lecoutre, M.P.Lecoutre, B.Le Roux (1998).

New Waysin Statistical Methodology. From Signiﬁcance Tests to Bayesian Inference.

Berne: Peter Lang.- D.J.Rose (1972)

Sparse Matrices and Their Applications.

NY: Springer.- D.J.Rose, R.A.Willoughby (1972).

Sparse Matrices.

NY: Plenum Press.- L.Rosenfeld (2005).

Classical Statistical Mechanics.

S˜ao Paulo: CBPF - Livraria da F´ısica.- J.Ross, S.R.Berry (2008).

Thermodynamics and Fluctuations far from Equilibrium . NY:Springer.- R.Royall (1997).

Statistical Evidence: A Likelihood Paradigm . London: Chapman & Hall.- E.Rubin (1915).

Visuell wahrgenommene Figuren . Copenhagen: Cyldenalske Boghandel.- D.B.Rubin (1978). Bayesian Inference for Causal Eﬀects: The Role of Randomization.

TheAnnals of Statistics,

6, 34-58.- D.Rubin, D.Thayer (1982). EM Algorithm for ML Factor Analysis.

Psychometrika,

47, 1,69-76.- H.Rubin (1987). A Weak System of Axioms for “Rational” Behaviour and the Non-Separabilityof Utility from Prior.

Statistics and Decisions , 5, 47–58.- R.Y.Rubinstein, D.P.Kroese (2004).

The Cross-Entropy Method: A Uniﬁed Approach to Com-binatorial Optimization, Monte-Carlo Simulation and Machine Learning.

NY: Springer.- C.Ruhla (1992).

The Physics of Chance: From Blaise Pascal to Niels Bohr.

Oxford UniversityPress.- B.Russell (1894). Cleopatra or Maggie Tulliver? Lecture at the Cambridge ConversazioneSociety. Reprinted as Ch.8, p.57-67, in C.R.Pigden, ed. (1999).

Russell on Ethics.

London:Routledge.- S.Russel (1988). Machine Learning: The EM Algorithm. Unpublished note.- S.Russell (1998).

The EM Algorithm.

On line doc, Univ. of California at Berkeley. REFERENCES - Ruta v. Breckenridge-Remy Co., USA, 1982.- A.I.Sabra (1981).

Theories of Light: From Descartes to Newton.

Cambridge University Press.- R.K.Sacks, H.Wu (1977).

Genearl Relativity for Mathematicians.

NY: Springer.- L.Sadun (2001).

Applied Linear Algebra: The Decoupling Principle.

NY: Prentice Hall.- Sakamoto,Y. Ishiguro,M. Kitagawa,G. (1986).

Akaike Information Criterion Statistics.

Dor-drecht: Reidel - Kluwer.- V.H.S.Salinas-Torres, C.A.B.Pereira, R.C.Tiwari (1997). Convergence of Dirichlet MeasuresArising in Context of Bayesian Analysis of Competing Risks Models.

J. Multivariate Analysis ,62,1, 24-35.- V.H.S.Salinas-Torres, C.A.B.Pereira, R.C.Tiwari (2002). Bayesian Nonparametric Estimationin a Series System or a Competing-Risks Model.

J.of Nonparametric Statistics , 14,4, 449-58.- M.Saltzman (2004).

Tissue Engineering.

Oxford University Press.- A. Sangiovanni-Vincentelli, L.O. Chua (1977). An Eﬃcient Heuristic Cluster Algorithm forTearing Large-Scale Networks.

IEEE Transactions on Circuits and Systems , 24, 709-717.- L.Santaella (2002).

Semi´otica Aplicada.

S˜ao Paulo: Thomson Learning.- L.A.Santal´o (1973).

Vectores y Tensores.

Buenos Aires: Eudeba.- G.de Santillana (1955).

The Crime of Galileo . University of Chicago Press.- L.A.Santalo (1976).

Integral Geometry and Geometric Probability.

London: Addison-Wesley.- J.Sapp, F.Carrapio, M.Zolotonosov (2002). Symbiogenesis: The Hidden Face of ConstantinMerezhkowsky.

History and Philosophy of the Life Sciences,

24, 3-4, 413-440.- S.Sarkar (1988). Natural Selection, Hypercycles and the Origin of Life. Proceedings of theBiennial Meeting of the Philosophy of Science Association, Vol.1, 197-206. The University ofChicago Press.- L.J.Savage (1954):

The Foundations of Statistics.

Reprint 1972. NY: Dover.- L.J.Savage (1981).

The writings of Leonard Jimmie Savage: A memorial selection.

Instituteof Mathematical Statistics.- D.Schacter (2001).

Forgotten Ideas, Neglected Pioneers: Richard Semon and the Story ofMemory.

Philadelphia: Psychology Press.- D.L.Schacter, J.E.Eich, E.Tulving, (1978). Richard Semon’s Theory of Memory.

Journal ofVerbal Learning and Verbal Behavior,

17, 721-743.- J.D.Schaﬀer (1987). Some Eﬀects of Selection Procedures on Hyperplane Sampling by GeneticAlgorithms. p. 89-103 in L.Davis (1987).- J.M.Schervich (1995).

Theory of Statistics.

Berlin, Springer.- M.Schlick (1920). Naturphilosophische Betrachtungen ¨uber das Kausalprintzip.

Die Naturwissenschaften , 8, 461-474. Translated as, Philosophical Reﬂections on the Causal Principle,ch.12, p.295-321, v.1 in M.Schlick (1979).- M.Schlick (1979). Philosophical Papers. Dordrecht: Reidel.- H.Scholl (1998). Shannon optimal priors on independent identically distributed statisticalexperiments converge weakly to Jeﬀreys’ prior.

Test , 7,1, 75-94.- J.W.Schooler (2002). Re-Representing Consciousness: Dissociations between Experience andMetaconsciousness.

Trends in Cognitive Sciences,

6, 8, 339-344.- A.Schopenhauer (1818, 1966).

The World as Will and Representation.

NY: Dover.- E.Schr¨odinger (1926). Quantisierung als Eigenwertproblem. (Quantisation as an EigenvalueProblem).

Annalen der Physic , 489, 79-79.

Physical Review , 28, 1049-1049.- E.Schr¨odinger (1945).

What Is Life?

Cambridge University Press.- G.Schwarz (1978). Estimating the Dimension of a Model.

Ann. Stat. , 6, 461-464.

EFERENCES - C.Scott (1958). G.Spencer-Brown and Probability: A Critique.

J.Soc. Psychical Research,

Proc. IEEE International Conference on Computer-Aided Design , 478-481.- L.Segal (2001).

The Dream of Reality. Heintz von Foerster’s Constructivism.

NY: Springer.- R.W.Semon (1904).

Die Mneme . Leipzig: W. Engelmann. Translated (1921),

The Mneme .London: Allen and Unwin.- R.W.Semon (1909).

Die Mnemischen Empﬁndungen . Leipzig: Leipzig: W.Engelmann. Trans-lated (1923),

Mnemic psychology.

London: Allen and Unwin.- S.K.Sen, T.Samanta, A.Reese (2006). Quasi Versus Pseudo Random Generators: Discrep-ancy, Complexity and Integration-Error Based Comparisson.

Int.J. of Innovative Computing,Information and Control,

2, 3, 621-651.- S.Senn (1994). Fisher’s game with the devil.

Statistics in Medicine,

13, 3, 217-230.- G.Shafer (1982), Lindley’s Paradox.

J. American Statistical Assoc. , 77, 325–51.- G.Shafer, V.Vovk (2001).

Probability and Finance, It’s Only a Game!

NY: Wiley.- B.V.Shah, R.J.Buehler, O.Kempthorne (1964). Some Algorithms for Minimizing a Functionof Several Variables.

J. Soc. Indust. Appl. Math.

12, 74–92.- J.Shedler, D.Westen (2004). Dimensions of Personality Pathology: An Alternative to theFive-Factor Model.

American Journal of Psychiatry , 161, 1743-1754.- J.Shedler, D.Westen (2005). Reply to T.A.Widiger, T.J.Trull. A Simplistic Understanding ofthe Five-Factor Model

American J.of Psychiatry , 162,8, 1550-1551.- H.M.Sheﬀer (1913). A Set of Five Independent Postulates for Boolean Algebras, with Appli-cation to Logical Constants.

Trans. Amer. Math. Soc. , 14, 481-488.- Y.Shi (2001).

Swarm Intelligence.

Morgan Kaufmann.- R.Shibata (1981). An Optimal Selection of Regression Variables.

Biometrika,

68, 45–54.- G.Shwartz (1978). Estimating the Dimension of a Model.

Annals of Statistics,

6, 461–464.- B.Simon (1996).

Representations of Finite and Compact Groups.

AMS Graduate Studies inMathematics, v.10.- H.A.Simon (1996).

The Sciences of the Artiﬁcial . MIT Press.- E.H.Simpson (1951). The Interpretation of Interaction in Contingency Tables.

Journal of theRoyal Statistical Society , Ser.B, 13, 238-241.- S.Singh, M.K.Singh (2007). ‘Impossible Trinity’ is all about Problems of Choice: Of the ThreeOptions of a Fixed Exchange Rate, Free Capital Movement, and an Independent MonetaryPolicy: One can choose only two at a time. LiveMint.com, The Wall Street Journal. Posted:Mon, Nov 5 2007. 12:30 AM IST - J.Skilling (1988). The Axioms of MaximumEntropy. Maximum-Entropy and BayesianMethodsin Science and Engineering, G. J. Erickson and C. R. Smith (eds.) Dordrecht: Kluwer.- J.E.Smith (2007). Coevolving Memetic Algorithms: A Review and Progress Report.

IEEETransactions on Systems Man and Cybernetics, part B, 37, 1, 6-17.- P.J.Smith, E.Gunel (1984). Practical Bayesian Approaches to the Analysis of 2x2 ContingencyTable with Incompletely Categorized Data.

Communication of Statistics - Theory and Methods ,13, 1941-63.- C.Skinner, R.Chambers (2003).

Analysis of survey Data , New York: Wiley, 175-195.- S.G.Soal, F.J.Stratton, R.H.Thouless (1953). Statistical Signiﬁcance in Psychical Research.

Nature,

Hierarchquia Auto-Organizada em Sistemas Biol´ogicos. REFERENCES p.153-173 in D’Otaviano and Gonzales (2000).- J.C.Spall (2003).

Introduction to Stochastic Search and Optimization.

Hoboken: Wiley.- G.Spencer-Brown (1953a). Statistical Signiﬁcance in Psychical Research.

Nature,

Probability and Scientiﬁc Inference.

London: Longmans Green.- G.Spencer-Brown (1969).

Laws of Form.

Allen and Unwin.- M.D.Springer (1979).

The Algebra of Random Variables.

NY: Wiley.- F.Steier, edt. (1991)

Research and Reﬂexivity.

SAGE Publications.- M.Stephens (1997).

Bayesian Methods for Mixtures of Normal Distributions.

Oxford Univer-sity.- J.Stenmark, C.S.P.Wu (2004). Simpsons Paradox, Confounding Variables and InsuranceRatemaking.- C.Stern (1959). Variation and Hereditary Transmission.

Proceedings of the American Philo-sophical Society , 103, 2, 183-189.- J.M.Stern (1992). Simulated Annealing with a Temperature Dependent Penalty Function.

ORSA Journal on Computing,

4, 311-319.- J.M.Stern (1994).

Esparsidade, Estrutura, Estabilidade e Escalonamento em ´Algebra LinearComputacional.

Recife: UFPE, IX Escola de Computa¸c˜ao.- J.M.Stern (2001) The Full Bayesian Signiﬁcant Test for the Covariance Structure Problem.Proc.

ISAS-01, Int.Conf.on Systems Analysis and Synthesis,

7, 60-65.- J.M.Stern (2003a). Signiﬁcance Tests, Belief Calculi, and Burden of Proof in Legal and Sci-entiﬁc Discourse. Laptec-2003,

Frontiers in Artiﬁcial Intelligence and its Applications,

Lecture Notes Artiﬁcial Intelligence,

American Institute of Physics Proceedings , 735, 581–588.- J.M.Stern (2006a). Decoupling, Sparsity, Randomization, and Objective Bayesian Inference.Tech.Rep. MAC-IME-USP-2006-07.- J.M.Stern (2006b). Language, Metaphor and Metaphysics: The Subjective Side of Science.Tech.Rep. MAC-IME-USP-2006-09.- J.M.Stern (2007a). Cognitive Constructivism, Eigen-Solutions, and Sharp Statistical Hypothe-ses.

Cybernetics and Human Knowing , 14, 1, 9-36. Early version in Proceedings of FIS-2005,61, 1–23. Basel: MDPI.- J.M.Stern (2007b). Language and the Self-Reference Paradox.

Cybernetics and Human Know-ing , 14, 4, 71-92.- J.M.Stern (2007c). Complex Structures, Modularity and Stochastic Evolution. Tech.Rep.IME-USP MAP-0701.- J.M.Stern (2008a). Decoupling, Sparsity, Randomization, and Objective Bayesian Inference.

Cybernetics and Human Knowing , 15, 2, 49-68.- J.M.Stern (2008b).

Cognitive Constructivism and the Epistemic Signiﬁcance of Sharp Statisti-cal Hypotheses.

Tutorial book for MaxEnt 2008, The 28th International Workshop on BayesianInference and Maximum Entropy Methods in Science and Engineering. July 6-11 of 2008, Bo-rac´eia, S˜ao Paulo, Brazil.- J.M.Stern (2011). Spencer-Brown vs. Probability and Statistics: Entropys Testimony onSubjective and Objective Randomness.

Information , 2, 2, 277-301.

EFERENCES - J.M.Stern (2011). Constructive Veriﬁcation, Empirical Induction, and Falibilist Deduction: AThreefold Contrast.

Information , 2, 635-650- J.M.Stern (2011). Symmetry, Invariance and Ontology in Physics and Statistics.

Symmetry ,3, 3, 611-635.- J.M.Stern, C.Dunder, M.S.Laureto, F.Nakano, C.A.B.Pereira, C.O.Ribeiro (2006).

Otimiza¸c˜aoe Processos Estoc´asticos Aplicados `a Economia e Finan¸cas.

S˜ao Paulo: IME-USP.- J.M.Stern, M.S.Lauretto, A.Polpo, M.A.Diniz (2012). EBEB 2012 - XI Brazilian Meeting onBayesian Statistics. AIP Conference Proceedings v.1490. Melville, NY: American Institute ofPhysics.- J.M.Stern, C.O.Ribeiro, M.S.Lauretto, F.Nakano (1998). REAL: Real Attribute LearningAlgorithm.

Proc. ISAS/SCI-98

2, 315–321.- J.M.Stern, S.A.Vavasis (1994). Active Set Algorithms for Problems in Block Angular Form.

Computational and Applied Mathemathics , 12, 3, 199-226.- J.M.Stern, S.A.Vavasis (1993). Nested Dissection for Sparse Nullspace Bases.

SIAM Journalon Matrix Analysis and Applications,

14, 3, 766-775.- J.M.Stern, S.Zacks (2002). Testing Independence of Poisson Variates under the Holgate Bi-variate Distribution. The Power of a New Evidence Test.

Statistical and Probability Letters , 60,313–320.- J.M.Stern, S.Zacks (2003).

Sequential Estimation of Ratios, with Applications to BayesianAnalysis.

Tech. Rep. RT-MAC-2003-10.- R.B.Stern (2007).

An´alise da Responsabilidade Civil do Estado com base nos Princ´ıpios daIgualdade e da Legalidade.

Graduation Thesis. Faculdade de Direito da Pontif´ıcia UniversidadeCat´olica de S˜ao Paulo.- R.B.Stern, C.A.B.Pereira (2008). A Possible Foundation for Blackwell’s Equivalence.

AIPConference Proceedings, v. 1073, 90-95.- S.M.Stigler (1978). Mathematical Statistics in the Early States.

The Annals of Statistics,

The History of Statistics: The Measurement of Uncertainty before 1900 .Harvard Univ.Press.- M.St¨oltzner (2003). The Principle of Least Action as the Logical Empiricist’s Shibboleth.

Studies in History and Philosophy of Modern Physics , 34, 285-318.- R.D.Stuart (1966).

An Introduction to Fourier Analysis . London: Methuen.- M.N.S.Swamy, K.Thulasiraman (1981).

Graphs, Networks and Algorithms . NY: Wiley.- L Szilard (1929). ¨Uber die Entropieverminderung in einem Thermodynamischen System beiEingriﬀen Intelligenter Wesen.

Zeitschrift f¨ur Physik , 53, 840.- Taenzer, Ganti, and Podar (1989). Object-Oriented Software Reuse: The Yoyo Problem.

Journal of Object-Oriented Programming,

2, 3, 30-35.- H.Takayasu (1992).

Fractals in Physical Science.

NY: Wiley.- L.Tarasov (1988).

The World is Built on Probability.

Moscow: MIR.- L.Tarasov (1986).

This Amazingly Symmetrical World.

Moscow: MIR.- M.Teboulle (1992). Entropic Proximal Mappings with Applications to Nonlinear Programming.

Mathematics of operations Research,

17, 670-690.- L.C.Thomas (1986).

Games, Theory and Applications.

Chichester, England: Ellis Horwood.- C.J.Thompson (1972).

Mathematical Statistical Mechanics.

Princeton University Press.- G.L.Tian, K.W.Ng, Z.Geng (2003). Bayesian Computation for Contingency Tables with In-complete Cells-Counts.

Statistica Sinica , 13, 189-206.- W.Tobin (1993). Toothed Wheels and Rotating Mirrors.

Vistas in Astronomy , 36, 253-294. REFERENCES - S.Tomonaga (1962).

Quantum Mechanics. V.1 Old Quantum Theory; V.2, New quantumtheory.

North Holland and Interscience Publishers.- C.A. Tovey (1988).

Simulated Simulated Annealing . In Johnson (1988).- C.G.Tribble (2008). Industry-Sponsored Negative Trials and the Potential Pitfalls of Post HocAnalysis.

Arch Surg,

Thermostatics and Thermodynamics: An Introduction to Energy, Informa-tion and States of Matter, with Engineering Applications . Princeton: van Nostrand.- M.Tribus, E.C.McIrvine (1971). Energy and Information.

Scientiﬁc American , 224, 178-184.- P.K.Trivedi, D.M.Zimmer (2005).

Copula Modeling: An Introduction for Practitioners.

Boston:NOW.- C.Tsallis (2001). Nonextensive Statistical Mechanics and Termodynamics: Historical Back-ground and Present Status. p. 3-98 in Abe and Okamoto (2001)- S.M.Ulam (1943). What is a Measure?

The American Mathematical Monthly.

50, 10, 597-602.- S.Unger, F.Wysotzki (1981).

Lernfachige Klassiﬁzirungssysteme.

Berlin: Akademie Verlag.- J.Uﬃnk (1995). The Constraint Rule of the Maximum Entropy Principle.

Studies in theHistory and Philosophy of Modern Physics,

Studies in the History and Philosophy of Modern Physics,

27, 47-79.- J.Utts (1991). Replication and Meta-Analysis in Parapsychology.

Statistical Science,

6, 4,363-403, with comments by M.J.Bayarri, J.Berger, R.Dawson, P.Diaconis, J.B.Greenhouse,R.Hayman. R.L.Morris and F.Mosteller.- I.V´ag´o (1985).

Graph Theory: Applications to the Calculation of Electrical Networks.

Ams-terdam: Elsevier.- V.N.Vapnik (1995).

The Nature of Statistical Learning Theory.

NY: Springer.- V.N.Vapnik (1998).

Statistical Learning Theory: Inference for Small Samples.

NY: Wiley.- F.Varela (1978).

Principles of Biological Autonomy.

North-Holland.- A.M.Vasilyev (1980).

An Introduction to Statistical Physics.

Moscow: MIR.- M.Vega Rod´ıguez, (1998). La Actividad Metaf´orica: Entre la Raz´on Calculante y la Raz´onIntuitiva.

Es´eculo, Revista de estudios literarios . Madrid: Universidad Complutense.- E.S.Ventsel (1980).

Elements of Game Theory.

Moscow: MIR.- M.Viana (2003).

Symmetry Studies, An Introduction.

Rio de Janeiro: IMPA.- B.Vidakovic (1999).

Statistical Modeling by Wavelets.

Wiley-Interscience.- F.S.Vieira, C.N.El-Hani (2009). Emergence and Downward Determination in the NaturalSciences.

Cybernetics and Human Knowing,

15, 101-134.- R.Viertl (1987).

Probability and Bayesian Statistics.

NY: Plenum.- M.Vidyasagar (1997).

A Theory of Learning and Generalization.

Springer, London.- H.M.Voigt, H.Muehlenbein, H.P.Schwefel (1989).

Evolution and Optimization.

Berlin: AkademieVerlag.- H.A.van der Vorst, P.van Dooren, eds. (1990).

Parallel Algorithms for Numerical LinearAlgebra.

Amsterdam: North-Holland.- Hugo de Vries (1889).

Intracellular Pangenesis Including a paper on Fertilization and Hy-bridization . Translated by C.S.Gager (1910). Chicago: The Open Court Publishing Co.- H.de Vries (1900). Sur la loi de disjonction des hybrides.

Comptes Rendus de l’Academie desSciences , 130, 845-847. Translated as Concerning the law of segregation of hybrids.

Genetics ,(1950), 35, 30-32.- S.Walker(1986). A Bayesian Maximum Posteriori Algorithm for Categorical Data under In-

EFERENCES formative General Censoring.

The Statistician , 45, 293-8.- C.S.Wallace, D.M.Boulton (1968), An Information Measure for Classiﬁcation.

Computer Jour-nal , 11,2, 185-194.- C.S.Wallace, D.Dowe (1999). Minimum Message Length and Kolmogorov Complexity.

Com-puter Journal , 42,4, 270-283.- C.S.Wallace (2005).

Statistical and Inductive Inference by Minimum Message Lenght.

NY:Springer.- W.A.Wallis (1980). The Statistical Research Group, 1942-1945.

Journal of the AmericanStatistical Association , 75, 370, 320-330.- R.Wang, S.W.Lagakos,J.H.Ware, D.J.Hunter, J.M.Drazen (2007). Statistics in Medicine -Reporting of Subgroup Analyses in Clinical Trials

The New England Journal of Medicine,

The British Journal for the Philosophy of Science,

6, 22, 122-140.- L.Wasserman (2004).

All of Statistics: A Concise Course in Statistical Inference.

NY:Springer.- L.Wasserman (2005).

All of Nonparametric Statistics.

NY: Springer.- S.Wechsler, L.G.Esteves, A.Simonis, C.Peixoto (2005). Indiference, Neutrality and Informa-tiveness: Generalizing the Three Prisioners Paradox.

Synthese , 143, 255-272.- S.Wechsler, C.A.B.Pereira, P.C.Marques (2008). Birnbaum’s Theorem Redux.

AIP ConferenceProceedings,

Management Science , 18, 1, 98-108.- M.Weliky, G.Oster (1990). The Mechanical Basis of Cell Rearrangement.

Development,

Communities of Practice: Learning, Meaning,and Identity.

Cambridge Univ. Press.- S.R. White (1984). Concepts of Scale in Simulated Annealing.

American Institute of PhysicsConference Proceedings , 122, 261-270.- T.A.Widiger, E.Simonsen (2005). Alternative Dimensional Models of Personality Disorder:Finding a Common Ground.

J.of Personality Disorders , 19, 110-130- F.W.Wiegel (1986).

Introduction to Path-Integral Methods in Physics and Polymer Science.

Singapore: World Scientiﬁc.- E.P.Wigner (1960). The Unreasonable Eﬀectiveness of Mathematics in the Natural Sciences.

Communications in Pure and Applied mathematics,

Symmetries and Reﬂections.

Bloomington: Indiana University Press.- D.Williams (2001)

Weighing the Odds.

Cambridge Univ. Press.- R.C.Williamson (1989).

Probabilistic Arithmetic.

Univ. of Queensland.- S.S.Wilks (1962).

Mathematical Statistics.

NY: Wiley.- R.L.Winkler (1975).

Statistics: Probability, Inference, and Decision.

Harcourt School.- T.Winograd, F.Flores (1987).

Understanding Computers and Cognition: A New Foundationfor Design

NY: Addison-Wesley. - H.Weyl (1952).

Symmetry.

Princeton Univ. Press.- R.G.Winther (2000). Darwin on Variation and Hereditarity.

Journal of the History of Biology ,33, 425-455.- A.Wirfs-Brock, B.Wilkerson (1989). Variables Limit Reusability,

Journal of Object-OrientedProgramming,

2, 1, 34-40.- D.A.Wismer, ed. (1971).

Optimization Methods for Large-Scale Systems with Applications. REFERENCES

NY: McGaw-Hill.- S.Wi´sniewski, B.Staniszewski, R.Szymanik (1976). Thermodynamics of Nonequilibrium Pro-cesses. Dirdrecht: Reidel.- L.Wittgenstein (1921).

Tractatus Logico Philosophicus (Logisch-Philosophische Abhandlung).(Ed.1999) NY: Dover.- L.Wittgenstein (1953).

Philosophische Untersuchungen.

Philosophical Investigations, englishtransl. by G.E.M.Anscombe. Oxford: Blackwell.- P.Wolfe (1959). The Simplex Method for Quadratic Programming.

Econometrica , 27, 383–398.- W.Yourgrau S.Mandelstam (1979).

Variational Principles in Dynamics and Quantum Theory.

NY: Dover.- S.Youssef (1994). Quantum Mechanics as Complex Probability Theory.

Mod. Physics Lett.

A, 9, 2571-2586.- S.Youssef (1995). Quantum Mechanics as an Exotic Probability Theory. Proceedings of the Fif-teenth International Workshop on Maximum Entropy and Bayesian Methods, ed. K.M.Hansonand R.N.Silver, Santa Fe.- S.L. Zabell (1992). The Quest for Randomness and its Statistical Applications. In E.Gordon,S.Gordon (Eds.), Statistics for the Twenty-First Century (pp. 139-150). Washington, DC:Mathematical Association of America.- L.A.Zadeh (1987).

Fuzzy Sets and Applications . NY: Wiley.- A.Zahavi (1975). Mate selection: A selection for a handicap.

Journal of Theoretical Biology ,53, 205-214.- W.I.Zangwill (1969).

Nonlinear Programming: A Uniﬁed Approach.

NY: Prentice-Hall.- W.I.Zangwill, C.B.Garcia (1981).

Pathways to Solutions, Fixed Points, and Equilibria.

NY:Prentice-Hall.- M.Zelleny (1980).

Autopoiesis, Dissipative Structures, and Spontaneous Social Orders.

Wash-ington: American Association for the Advancement of Science.- A.Zellner (1971).

Introduction to Bayesian Inference in Econometrics.

NY:Wiley.- A.Zellner (1982). Is Jeﬀreys a Necessarist?

American Statistician , 36, 1, 28-30.- H.Zhu (1998).

Information Geometry, Bayesian Inference, Ideal Estimates and Error Decom-position.

Santa Fe Institute, 1399 Hyde Park Road, Santa Fe, NM 87501.- V.I.Zubov (1983).

Analytical Dynamics of Systems of Bodies.

Leningrad Univ.- M.A.Zupan (1991). Paradigms and Cultures: Some Economic Reasons for Their Stickness.

The American Journal of Economics and Sociology , 50, 99-104. ppendix AFBST Review “(A) man’s logical method should be loved and reverenced ashis bride, whom he has chosen from all the world. He need notcontemn the others; on the contrary, he may honor them deeply,and in doing so he honors her more. But she is the one that hehas chosen, and he knows that he was right in making that choice.”

C.S.Peirce (1839 - 1914),The Fixation of Belief (1877). “Make everything as simple as possible, but not simpler.”

Albert Einstein (1879 - 1955).

A.1 Introduction

The FBST was specially designed to give a measure of the epistemic value of a sharpstatistical hypothesis H , given the observations, that is, to give a measure of the valueof evidence in support of H given by the observations. This measure is given by thesupport function ev ( H ), the FBST e-value . Furthermore the e-value has many necessaryor desirable properties for a statistical support function, such as:(I) Give an intuitive and simple measure of signiﬁcance for the hypothesis in test,ideally, a probability deﬁned directly in the original or natural parameter space .(II) Have an intrinsically geometric deﬁnition, independent of any non-geometric as-pect, like the particular parameterization of the (manifold representing the) null hypoth-esis being tested, or the particular coordinate system chosen for the parameter space, i.e.,be an invariant procedure.(III) Give a measure of signiﬁcance that is smooth, i.e. continuous and diﬀerentiable ,24546 APPENDIX A. FBST REVIEW on the hypothesis parameters and sample statistics, under appropriate regularity condi-tions for the model.(IV) Obey the likelihood principle , i.e., the information gathered from observationsshould be represented by, and only by, the likelihood function, see Berger and Wolpert(1988), Pawitan (2001, ch.7) and Wechsler et al. (2008).(V) Require no ad hoc artiﬁce like assigning a positive prior probability to zero measuresets, or setting an arbitrary initial belief ratio between hypotheses.(VI) Be a possibilistic support function, where the support of a logical disjunction isthe maximum support among the support of the disjuncts.(VII) Be able to provide a consistent test for a given sharp hypothesis.(VIII) Be able to provide compositionality operations in complex models.(IX) Be an exact procedure, i.e., make no use of “large sample” asymptotic approxi-mations when computing the e -value.(X) Allow the incorporation of previous experience or expert’s opinion via (subjective) prior distributions .The objective of this section is to provide a very short review of the FBST theoreticalframework, summarizing the most important statistical properties of its support function,the e -value. It also summarizes the logical (algebraic) properties of the e -value, andits relations to other classical support calculi, including possibilistic calculus and logic,paraconsistent and classical. Further details, demonstrations of theoretical properties,comparison with other statistical tests for sharp hypotheses, and an extensive list ofreferences can be found in the author’s previous papers. A.2 Bayesian Statistical Models

A standard model of (parametric) Bayesian statistics concerns an observed (vector) ran-dom variable, x , that has a sampling distribution with a speciﬁed functional form, p ( x | θ ),indexed by the (vector) parameter θ . This same functional form, regarded as a function ofthe free variable θ with a ﬁxed argument x , is the model’s likelihood function. In frequen-tist or classical statistics, one is allowed to use probability calculus in the sample space,but strictly forbidden to do so in the parameter space, that is, x is to be considered asa random variable, while θ is not to be regarded as random in any way. In frequentiststatistics, θ should be taken as a ‘ﬁxed but unknown quantity’ (whatever that means).In the Bayesian context, the parameter θ is regarded as a latent (non-observed) randomvariable. Hence, the same formalism used to express credibility or (un)certainty, namely,probability theory, is used in both the sample and the parameter space. Accordingly, thejoint probability distribution, p ( x, θ ) should summarize all the information available in a .2. BAYESIAN STATISTICAL MODELS x and θ can be factorized either as the likelihood function of the parameter given theobservation times the prior distribution on θ , or as the posterior density of the parametertimes the observation’s marginal density, p ( x, θ ) = p ( x | θ ) p ( θ ) = p ( θ | x ) p ( x ) . The prior probability distribution p ( θ ) represents the initial information availableabout the parameter. In this setting, a predictive distribution for the observed randomvariable, x , is represented by a mixture (or superposition) of stochastic processes, all ofthem with the functional form of the sampling distribution, according to the prior mixing(or weights) distribution, p ( x ) = (cid:90) θ p ( x | θ ) p ( θ ) dθ . If we now observe a single event, x , it follows from the factorizations of the joint dis-tribution above that the posterior probability distribution of θ , representing the availableinformation about the parameter after the observation, is given by p ( θ ) ∝ p ( x | θ ) p ( θ ) . In order to replace the ‘proportional to’ symbol, ∝ , by an equality, it is necessary todivide the right hand site by the normalization constant, c = (cid:82) θ p ( x | θ ) p ( θ ) dθ . This isthe Bayes rule , giving the (inverse) probability of the parameter given the data. That isthe basic learning mechanism of Bayesian statistics. Computing normalization constantsis often diﬃcult or cumbersome. Hence, especially in large models, it is customary towork with unormalized densities or potentials as long as possible in the intermediatecalculations, computing only the ﬁnal normalization constants. It is interesting to observethat the joint distribution function, taken with ﬁxed x and free argument θ , is a potentialfor the posterior distribution.Bayesian learning is a recursive process, where the posterior distribution after a learn-ing step becomes the prior distribution for the next step. Assuming that the observationsare i.i.d. (independent and identically distributed) the posterior distribution after n ob-servations, x (1) , . . . x ( n ) , becomes, p n ( θ ) ∝ p ( x ( n ) | θ ) p n − ( θ ) ∝ (cid:89) ni = i p ( x ( i ) | θ ) p ( θ ) . If possible, it is very convenient to use a conjugate prior , that is, a mixing distributionwhose functional form is invariant by the Bayes operation in the statistical model at hand.For example, the conjugate priors for the Normal and Multivariate models are, respec-tively, Wishart and the Dirichlet distributions. The explicit form of these distributions isgiven in the next sections.48

APPENDIX A. FBST REVIEW

The ‘beginings and the endings’ of the Bayesian learning process really need furtherdiscussion, that is, we should present some rationale for choosing the prior distributionused to start the learning process, and some convergence theorems for the posterior asthe number observations increases. In order to do so, we must access and measure theinformation content of a (posterior) distribution. Appendix E is dedicated to the conceptof entropy, the key that unlocks many of the mysteries related to the problems at hand. Inparticular, Sections E.5 and E.6 discuss some ﬁne details about criteria for prior selectionand posterior convergence properties.

A.3 The Epistemic e -values Let θ ∈ Θ ⊆ R p be a vector parameter of interest, and p ( x | θ ) be the likelihood associatedto the observed data x , as in the standard statistical model. Under the Bayesian paradigmthe posterior density, p n ( θ ), is proportional to the product of the likelihood and a priordensity, p n ( θ ) ∝ p ( x | θ ) p ( θ ) . The (null) hypothesis H states that the parameter lies in the null set, deﬁned byinequality and equality constraints given by vector functions g and h in the parameterspace. Θ H = { θ ∈ Θ | g ( θ ) ≤ ∧ h ( θ ) = } From now on, we use a relaxed notation, writing H instead of Θ H . We are particularlyinterested in sharp (precise) hypotheses, i.e., those in which there is at least one equalityconstraint and hence, dim( H ) < dim(Θ).The FBST deﬁnes ev ( H ), the e -value supporting (in favor of) the hypothesis H , andev ( H ), the e -value against H , as s ( θ ) = p n ( θ ) r ( θ ) , s ∗ = s ( θ ∗ ) = sup θ ∈ H s ( θ ) , (cid:98) s = s ( (cid:98) θ ) = sup θ ∈ Θ s ( θ ) ,T ( v ) = { θ ∈ Θ | s ( θ ) ≤ v } , W ( v ) = (cid:90) T ( v ) p n ( θ ) dθ , ev ( H ) = W ( s ∗ ) ,T ( v ) = Θ − T ( v ) , W ( v ) = 1 − W ( v ) , ev ( H ) = W ( s ∗ ) = 1 − ev ( H ) . The function s ( θ ) is known as the posterior surprise relative to a given referencedensity, r ( θ ). W ( v ) is the cumulative surprise distribution. The surprise function wasused, among other statisticians, by Good [23], Evans [16] and Royall [48]. Its role inthe FBST is to make ev ( H ) explicitly invariant under suitable transformations on thecoordinate system of the parameter space, see next section. .3. THE EPISTEMIC E -VALUES T = T ( s ∗ ), is a Highest Relative Surprise Set(HRSS). It contains the points of the parameter space with higher surprise, relative tothe reference density, than any point in the null set H . When r ( θ ) ∝

1, the possiblyimproper uniform density, T is the Posterior’s Highest Density Probability Set (HDPS)tangential to the null set H . Small values of ev ( H ) indicate that the hypothesis traverseshigh density regions, favoring the hypothesis.Notice that, in the FBST deﬁnition, there is an optimization step and an integrationstep. The optimization step follows a typical maximum probability argument, according towhich, “a system is best represented by its highest probability realization”. The integra-tion step extracts information from the system as a probability weighted average. Manyinference procedures of classical statistics rely basically on maximization operations, whilemany inference procedures of Bayesian statistics rely on integration (or marginalization)operations. In order to achieve all its desired properies, the FBST procedure has to useboth, as explained in this appendix.The evidence value, deﬁned above, has a simple and intuitive geometric characteriza-tion. We now illustrate the above deﬁnitions with two simple but non-trivial examples.These two exemples are easy to visualize, since they have a two dimensional parameterspace, and are also non-trivial, in the sense that they have a non-linear hypothesis. Coeﬃcient of Variation

The Coeﬃcient of Variation (CV) of a random variable X is deﬁned as the ratio CV ( X ) = σ ( X ) /E ( X ), i.e. the ratio of its standard deviation to its mean. Let X be a normalrandom variable, with unknown mean and variance. We want to compute the evidencevalue supporting the hypothesis that the coeﬃcient of variation of X is equal to a givenconstant, X ∼ N ( β, σ ) , H : σ/β = c The conjugate family for this problem is the family of bivariate distributions, wherethe conditional distribution of the mean β , for a ﬁxed precision ρ = 1 /σ , is normal,and the marginal distribution of the precision ρ is gamma, DeGroot (1970). Using thestandard improper priors, uniform on ] − ∞ , + ∞ [ for β , and 1 /ρ on ]0 , + ∞ [ for ρ , we getthe posterior joint distribution for β and ρ : p n ( β, ρ | x ) ∝ √ ρ exp ( − nρ ( β − ¯ x ) / ρ n − exp ( − ρsn/ x = [ x . . . x n ] , ¯ x = 1 n n (cid:88) i =1 x i , s = n (cid:88) i =1 ( x i − ¯ x ) Figure A.1 shows the null set H , the tangential HRSS T , and the points of constrainedand unconstrained maxima, θ ∗ and (cid:98) θ , for testing the hypothesis at hand with the following50 APPENDIX A. FBST REVIEW po s t e r i o r p r e c i s i on , r n=16 m=10 c=0.1std=1.0 evid=0.93 m n=16 m=10 c=0.1std=1.1 evid=0.67 n=16 m=10 c=0.1std=1.5 evid=0.01 Figure A.1: FBST for H: CV=0.1numerical example: CV = 0 . n = 16, mean ¯ x = 10 and standarddeviations std = 1 . std = 1 . std = 1 .

5. We can see the tangent set expanding asthe sample standard deviation over mean ratio gets farther away from the coeﬃcient ofvariation being tested, CV ( X ) = σ ( X ) /E ( X ) = 0 .

1. In this example we use the standardimproper prior density and the uniform reference density. In the ﬁrst plot, the samplestandard deviation over mean ratio equals the coeﬃcient of variation tested. Nevertheless,the evidence against the null hypothesis is not zero; this is because of the non uniformprior. In order to test other hypotheses we only have to change the constraint(s) passedto the optimizer. Constraints for the hypothesis β = c and σ = c would be representedby, respectively, vertical and horizontal lines. All the details for these and other simpleexamples, as well as comparisons with standard frequentist and Bayesian tests, can befound in Irony et al. (2001), Pereira and Stern (1999b, 2000a,b) and Pereira and Wechsler(1993). Hardy-Weinberg equilibrium

Figure A.2 shows the null set H , the tangential HRSS T , and the points of constrainedand unconstrained maxima, θ ∗ and (cid:98) θ , for testing Hardy-Weinberg equilibrium law in apopulation genetics problem, as discussed in Pereira and Stern (1999). In this biologicalapplication n is the sample size, x and x are the two homozygote sample counts and .4. REFERENCE, INVARIANCE AND CONSISTENCY HTo θ * Figure A.2: H-W Hypothesis and Tangential Set x = n − x − x is heterozygote sample count. θ = [ θ , θ , θ ] is the parameter vector.The posterior and maximum entropy reference densities for this trinomial model, theparameter space and the null set are: p n ( θ | x ) ∝ θ x + y − θ x + y − θ x + y − , r ( θ ) ∝ θ y − θ y − θ y − , y = [0 , , , Θ = { θ ≥ | θ + θ + θ = 1 } , H = { θ ∈ Θ | θ = (1 − (cid:112) θ ) } . Nuisance Parameters

Let us consider the situation where the hypothesis constraint, H : h ( θ ) = h ( δ ) = 0 , θ =[ δ, λ ] is not a function of some of the parameters, λ . This situation is described by D.Basuin Ghosh (1988): “If the inference problem at hand relates only to δ , and if informationgained on λ is of no direct relevance to the problem, then we classify λ as theNuisance Parameter. The big question in statistics is: How can we eliminatethe nuisance parameter from the argument?” Basu goes on listing at least 10 categories of procedures to achieve this goal, like using max λ or (cid:82) dλ , the maximization or integration operators, in order to obtain a projectedproﬁle or marginal posterior function, p ( δ | x ). The FBST does not follow the nuisanceparameters elimination paradigm, working in the original parameter space, in its fulldimension. A.4 Reference, Invariance and Consistency

In the FBST the role of the reference density, r ( θ ) is to make ev ( H ) explicitly invariantunder suitable transformations of the coordinate system. The natural choice of reference52 APPENDIX A. FBST REVIEW density is an uninformative prior, interpreted as a representation of no information inthe parameter space, or the limit prior for no observations, or the neutral ground statefor the Bayesian operation. Standard (possibly improper) uninformative priors includethe uniform and maximum entropy densities, see Dugdale (1996) and Kapur (1989) for adetailed discussion. Invariance, as used in statistics, is a metric concept. The referencedensity can be interpreted as induced by the information metric in the parameter space, dl = dθ (cid:48) G ( θ ) dθ . Jeﬀreys’ invariant prior is given by p ( θ ) = (cid:112) det G ( θ ), see Section E.5.In the H-W example, using the notation above, the uniform density can be representedby y = [1 , ,

1] observation counts, and the standard maximum entropy density can berepresented by y = [0 , ,

0] observation counts.Let us consider the cumulative distribution of the evidence value against the hypoth-esis, V ( c ) = Pr( ev ≤ c ), given θ , the true value of the parameter. Under appropriateregularity conditions, for increasing sample size, n → ∞ , we can say the following:- If H is false, θ / ∈ H , then ev converges (in probability) to 1, that is, V (0 ≤ c < → H is true, θ ∈ H , then V ( c ), the conﬁdence level, is approximated by the function QQ ( t, h, c ) = Q (cid:0) t − h, Q − ( t, c ) (cid:1) , whereQ( k, x ) = Γ( k/ , x/ k/ , ∞ ) , Γ( k, x ) = (cid:90) x y k − e − y dy ,t = dim(Θ), h = dim( H ) and Q( k, x ) is the cumulative chi-square distribution with k degrees of freedom. Figure A.3 portrays QQ ( t, h, c ) Q( t − h, Q − ( t, c )) for t = 2 . . . h = 0 . . . t − c ( n ), provides a consistent test, τ c , that rejects the hypothesis if ev ( H ) > c . Theempirical power analysis developed in Stern and Zacks (2002) and Lauretto et al. (2003),provides critical levels that are consistent and also eﬀective for small samples. C o n f i d e n c e L e v e l t=2; h=0,1; t=3; h=0,1,2; t=4; h=0,1,2,3; Figure A.3: Test τ c critical level vs. conﬁdence level .4. REFERENCE, INVARIANCE AND CONSISTENCY Proof of invariance:

Consider a proper (bijective, integrable, and almost surely continuously diﬀerentiable)reparameterization ω = φ ( θ ). Under the reparameterization, the Jacobian, surprise,posterior and reference functions are: J ( ω ) = (cid:20) ∂ θ∂ ω (cid:21) = (cid:20) ∂ φ − ( ω ) ∂ ω (cid:21) =  ∂ θ ∂ ω . . . ∂ θ ∂ ω n ... . . . ... ∂ θ n ∂ ω . . . ∂ θ n ∂ ω n (cid:101) s ( ω ) = (cid:101) p n ( ω ) (cid:101) r ( ω ) = p n ( φ − ( ω )) | J ( ω ) | r ( φ − ( ω )) | J ( ω ) | Let Ω H = φ (Θ H ). It follows that (cid:101) s ∗ = sup ω ∈ Ω H (cid:101) s ( ω ) = sup θ ∈ Θ H s ( θ ) = s ∗ hence, the tangential set, T (cid:55)→ φ ( T ) = (cid:101) T , and (cid:101) ev( H ) = (cid:90) (cid:101) T (cid:101) p n ( ω ) dω = (cid:90) T p n ( θ ) dθ = ev ( H ) . Proof of consistency:

Let V ( c ) = Pr( ev ≤ c ) be the cumulative distribution of the evidence value againstthe hypothesis, given θ . We stated that, under appropriate regularity conditions, forincreasing sample size, n → ∞ , if H is true, i.e. θ ∈ H , then V ( c ), is approximated bythe function QQ ( t, h, c ) = Q (cid:0) t − h, Q − ( t, c ) (cid:1) . Let θ , (cid:98) θ and θ ∗ be the true value, the unconstrained MAP (Maximum A Posteriori),and constrained (to H ) MAP estimators of the parameter θ .Since the FBST is invariant, we can chose a coordinate system where, the (likeli-hood function) Fisher information matrix at the true parameter value is the identity,i.e., J ( θ ) = I . From the posterior Normal approximation theorem, see Section 5 of Ap-pendix E, we know that the standarized total diﬀerence between (cid:98) θ and θ converges indistribution to a standard Normal distribution, i.e. √ n ( (cid:98) θ − θ ) → N (cid:0) , J ( θ ) − J ( θ ) J ( θ ) − (cid:1) = N (cid:0) , J ( θ ) − (cid:1) = N (0 , I )This standarized total diﬀerence can be decomposed into tangent (to the hypothesismanifold) and transversal orthogonal components, i.e. d t = d h + d t − h , dt = √ n ( (cid:98) θ − θ ) , d h = √ n ( θ ∗ − θ ) , d t − h = √ n ( (cid:98) θ − θ ∗ ) . APPENDIX A. FBST REVIEW

Hence, the total, tangent and transversal distances ( L norms), || d t || , || d h || and || d t − h || ,converge in distribution to chi-square variates with, respectively, t , h and t − h degrees offreedom.Also from, the MAP consistency, we know that the MAP estimate of the Fisher infor-mation matrix, (cid:98) J , converges in probability to true value, J ( θ ).Now, if X n converges in distribution to X , and Y n converges in probability to Y , weknow that the pair [ X n , Y n ] converges in distribution to [ X, Y ]. Hence, the pair [ || d t − h || , (cid:98) J ]converges in distribution to [ x, J ( θ )], where x is a chi-square variate with t − h degreesof freedom. So, from the continuous mapping theorem, the evidence value against H ,ev ( H ), converges in distribution to e = Q( t, x ), where x is a chi-square variate with t − h degrees of freedom.Since the cumulative chi-square distribution is an increasing function, we can invertthe last formula, i.e., e = Q( t, x ) ≤ c ⇔ x ≤ Q − ( t, c ). But, since x in a chi-square variatewith t − h degrees of freedom,Pr( e ≤ c ) = QQ ( t, h, c ) = Q.E.D.A similar argument, using a non-central chi-square distribution, proves the other asymp-totic statement.If a random variable, x , has a continuous and increasing cumulative distribution func-tion, F ( x ), the random variable u = F ( x ) has uniform distribution. Hence, the tran-formation sev = QQ ( t, h, ev ), deﬁnes a “standarized e -value”, sev = 1 − sev, that canbe used somewhat in the same way as a p -value of classical statistics. This standarized e -value may be a convenient form to report, since its asymptotically uniform distributionprovides a large-sample limit interpretation, and many researchers will feel already fa-miliar with consequent diagnostic procedures for scientiﬁc hypotheses based on abundantempirical data-sets. A.5 Loss Functions

In orthodox decision theoretic Bayesian statistics, a signiﬁcance test is legitimate if andonly if it can be characterized as an Acceptance (A) or Rejection (R) decision proceduredeﬁned by the minimization of the posterior expectation of a loss function, Λ. Madruga(2001) gives the following family of loss functions characterizing the FBST. This lossfunction is based on indicator functions of θ being or not in the tangential set T :Λ( R, θ ) = a I ( θ / ∈ T ) , Λ( A, θ ) = b + d I ( θ ∈ T )The interpretation of this loss function is as follows: If θ ∈ T we want to reject H , for θ ismore probable than anywhere on H ; If θ ∈ T we want to accept H , for θ is less probable .6. BELIEF CALCULI AND SUPPORT STRUCTURES H . The minimization of this loss function gives the optimal test:Accept H iﬀ ev ( H ) ≥ ϕ = ( b + c ) / ( a + c ) . Note that this loss function is dependent on the observed sample (via the likelihoodfunction), on the prior, and on the reference density, stressing the important point ofnon-separability of utility and probability, see Kadane and Winkler (1987) and Rubin(1987).This type of loss function can be easily adapted in order to provide an asymptotic in-dicator checking if the true parameter belongs to the hypothesis set, I ( θ ∈ H ). Considerthe tangential reference mass , m = (cid:20)(cid:90) T ( s ∗ ) r ( θ ) dθ (cid:21) γ If γ = 1, m is the reference density mass of the tangencial set. If γ = 1 /t , m is a pseudo-distance from (cid:98) θ to θ ∗ . Consider also a threshold of form ϕ = bm or ϕ = bm/ ( a + m ), a, b >

0, in the expression of the optimal test above.If θ / ∈ H , then (cid:98) θ → θ and θ ∗ → θ ∗ , where θ ∗ (cid:54) = θ , therefore || (cid:98) θ − θ ∗ || → c > p n , converges to a normal distribution centered on θ .Hence, m → c > ϕ → c >

0. Finally, since ev ( H ) →

0, Pr( ev ( H ) > ϕ ) → θ ∈ H , then (cid:98) θ → θ and θ ∗ → θ , therefore || (cid:98) θ − θ ∗ || →

0. Hence, m → ϕ →

0. But ev ( H ) converges to a propper distribution, see section A.3, and, therefore,Pr( ev ( H ) > ϕ ) → A.6 Belief Calculi and Support Structures

Many standard Belief Calculi can be formalized in the context of Abstract Belief Calcu-lus, ABC, see Darwiche and Ginsberg (1992), Darwiche (1993) and Stern (2003). In aSupport Structure, (cid:104) Φ , ⊕ , (cid:11)(cid:105) , the ﬁrst element is a Support Function, Φ, on a universeof statements, U . Null and full support values are represented by and . The sec-ond element is a support Summation operator, ⊕ , and the third is a support Scaling orConditionalization operator, (cid:11) . A Partial Support Structure, (cid:104) Φ , ⊕(cid:105) , lacks the scalingoperation.The Support Summation operator, ⊕ , gives the support value of the disjunction ofany two logically disjoint statements from their individual support values, i.e., ¬ ( A ∧ B ) ⇒ Φ( A ∨ B ) = Φ( A ) ⊕ Φ( B ) . The support scaling operator updates an old state of belief to the new state of be-lief resulting from making an observation. Hence it can be interpreted as predicting or56

APPENDIX A. FBST REVIEW propagating changes of belief after a possible observation. Formally, the support scalingoperator, (cid:11) , gives the conditional support value of B given A from the unconditionalsupport values of A and the conjunction C = A ∧ B , i.e.,Φ A ( B ) = Φ( A ∧ B ) (cid:11) Φ( A ) . The support unscaling operator reconstitutes the old state of belief from a new stateof belief and the observation that has led to it. Hence it can be interpreted as explainingor back-propagating changes of belief for a given observation. If Φ does not reject A , thesupport unscaling operatior, ⊗ , gives the inverse of the scaling operator, i.e.,Φ( A ∧ B ) = Φ A ( B ) ⊗ Φ( A ) . Support structures for some standard belief calculi are given in Table A.1, where thesupport value of two statements their conjunction are given by a = Φ( A ), b = Φ( B ), c = Φ( C = A ∧ B ). In Table A.1, the relation a (cid:22) b indicates that the value a representsa stringer support than the value b . Darwiche and Ginsberg (1992) and Darwiche (1993)also give a set o axioms deﬁning the essential functional properties of a (partial) supportfunction. Stern (2003) shows that the support Φ( H ) = ev ( H ) complies with all theseaxioms. Table A.1: Support structures for some belief calculi, a = Φ( A ), b = Φ( B ), c = Φ( C = A ∧ B ). Φ( U ) a ⊕ b a (cid:22) b c (cid:11) a a ⊗ b Calculus[0 , a + b a ≤ b c/a a × b Probability[0 ,

1] max( a, b ) 0 1 a ≤ b c/a a × b Possibility { , } max( a, b ) 0 1 a ≤ b min( c, a ) min( a, b ) Classic.Logic[0 , a + b − b ≤ a ( c − a ) / (1 − a ) a + b − ab Improbablty { .. ∞} min( a, b ) ∞ b ≤ a c − a a + b Disbelief

In the FBST, the support values, Φ( H ) = ev ( H ), are computed using standard prob-ability calculus on Θ which has an intrinsic conditionalization operator. The computedevidences, on the other hand, have a possibilistic summation, i.e., the value of evidencein favor of a composite hypothesis H = A ∨ B , is the most favorable value of evidence infavor of each of its terms, i.e., ev ( H ) = max { ev ( A ) , ev ( B ) } . It is impossible howeverto deﬁne a simple scaling operator for this possibilistic support that is compatible withthe FBST’s evidence, ev , as it is deﬁned.Hence, two belief calculi are in simultaneous use in the FBST setup: ev ( H ) consti-tutes a possibilistic partial support structure coexisting in harmony with the probabilistic .7. SENSITIVITY AND INCONSISTENCY p n ( θ ), in the parameterspace, see Dubois et al. (1993), Delgado and Moral (1987).Requirements (V) and (VI), i.e. no ad hoc artiﬁce and possibilistic support , ﬁnd a richinterpretation in the juridical or legal context, where they correspond to the some of themost basic juridical principles, see Stern (2003). Onus Probandi is a basic principle of legal reasoning, also known as Burden of Proof,see Gaskins (1992) and Kokott (1998). It also manifests itself in accounting through theSafe Harbor Liability Rule: “There is no liability as long as there is a reasonable basis for belief, ef-fectively placing the burden of proof (Onus Probandi) on the plaintiﬀ, who,in a lawsuit, must prove false a defendant’s misstatement, without makingany assumption not explicitly stated by the defendant, or tacitly implied by anexisting law or regulatory requirement.”

The Most Favorable Interpretation principle, which, depending on the context, is alsoknown as Beneﬁt of the Doubt,

In Dubito Pro Reo , or Presumption of Innocence, isa consequence of the Onus Probandi principle, and requires the court to consider theevidence in the light of what is most favorable to the defendant. “Moreover, the party against whom the motion is directed is entitled tohave the trial court construe the evidence in support of its claim as truthful,giving it its most favorable interpretation, as well as having the beneﬁt of allreasonable inferences drawn from that evidence.”

A.7 Sensitivity and Inconsistency

For a given prior, likelihood and reference density, let η = ev ( H ; p , L x , r ) denote the e -value supporting H . Let η (cid:48) , η (cid:48)(cid:48) . . . denote the e -value with respect to references r (cid:48) , r (cid:48)(cid:48) . . . .The degree of inconsistency of the e -value supporting H , induced by a set of references, { r, r (cid:48) , r (cid:48)(cid:48) . . . } is deﬁned by the index I { η, η (cid:48) , η (cid:48)(cid:48) . . . } = max { η, η (cid:48) , η (cid:48)(cid:48) . . . } − min { η, η (cid:48) , η (cid:48)(cid:48) . . . } The same index can be used to study the degree of inconsistency of the e -value inducedby a set of priors, { p , p (cid:48) , p (cid:48)(cid:48) . . . } . One could also study the sensitivity of the e -value to a setof vitual sample sizes, { n, γ (cid:48) n, γ (cid:48)(cid:48) n . . . } , γ ∈ [0 , { L, L γ (cid:48) , L γ (cid:48)(cid:48) . . . } . This intuitive measure of inconsistency can be made rigorous in thecontext of paraconsistent logic and bilattice structures, see Abe et al. (1998), Alcantara58 APPENDIX A. FBST REVIEW et al. (2002), Arieli and Avron (1996), Costa (1963), Costa and Subrahmanian (1989)and Costa et al. (1991), (1999).The bilattice B ( C, D ) = (cid:104) C × D, ≤ k , ≤ t (cid:105) , given two complete lattices, (cid:104) C, ≤ c (cid:105) , and (cid:104) D, ≤ d (cid:105) , has two orders, the knowledge order, ≤ k , and the truth order, ≤ t , given by: (cid:104) c , d (cid:105) ≤ k (cid:104) c , d (cid:105) ⇔ c ≤ c c and d ≤ d d (cid:104) c , d (cid:105) ≤ t (cid:104) c , d (cid:105) ⇔ c ≤ c c and d ≤ d d The standard interpretation is that C provides the “credibility” or value in favor of ahypothesis (or statement) H , and D provides the “doubt” or value against H . If (cid:104) c , d (cid:105) ≤ k (cid:104) c , d (cid:105) , then we have more information (even if inconsistent) about situation 2 than 1.Analogously, if (cid:104) c , d (cid:105) ≤ t (cid:104) c , d (cid:105) , then we have more reason to trust (or believe) situation2 than 1 (even if with less information).For each of the bilattice orders we deﬁne a join and a meet operator, based on the joinand the meet operators of the single lattices orders. More precisely, (cid:116) k and (cid:117) k , for theknowledge order, and (cid:116) t and (cid:117) t , for the truth order, are deﬁned by the folowing equations: (cid:104) c , d (cid:105) (cid:116) k (cid:104) c , d (cid:105) = (cid:104) c (cid:116) c c , d (cid:116) d d (cid:105) , (cid:104) c , d (cid:105) (cid:117) k (cid:104) c , d (cid:105) = (cid:104) c (cid:117) c c , d (cid:117) d d (cid:105)(cid:104) c , d (cid:105) (cid:116) t (cid:104) c , d (cid:105) = (cid:104) c (cid:116) c c , d (cid:117) d d (cid:105) , (cid:104) c , d (cid:105) (cid:117) t (cid:104) c , d (cid:105) = (cid:104) c (cid:117) c c , d (cid:116) d d (cid:105) The “unit square” bilattice, (cid:104) [0 , × [0 , , ≤ , ≤(cid:105) has been routinely used to representfuzzy or rough pertinence relations, logical probabilistic annotations, etc. The lattice (cid:104) [0 , , ≤(cid:105) is the standard unit interval, where the join and meet coincide with the maxand min operators, (cid:116) = max and (cid:117) = min.In the unit square bilattice the “truth”, “false”, “inconsistency” and “indetermination”extremes are t , f , (cid:62) , ⊥ , whose coordinates are given in Figure A.4. As a simple example,let region R be the convex hull of the four vertices n , s , e and w , given in Figure A.4.Points kj , km , tj and tm are the knowledge and truth join and meet, over r ∈ R .In the unit square bilattice, the degree of trust and degree of inconsistency for a point x = (cid:104) c, d (cid:105) are given by BT ( (cid:104) c, d (cid:105) ) = c − d , and BI ( (cid:104) c, d (cid:105) ) = c + d −

1, a convenient linearreparameterization of [0 , , to [ − , +1] . Figure A.4 also compares the credibility-doubtand trust-inconsistency coordinates.Let η = ev ( H ), and η = ev ( H ) = 1 − ev ( H ). The point x = (cid:104) η, η (cid:105) in the unitsquare bilattice, represents herein a single evidence. Since BI( x ) = 0, such a point isconsistent. It is also easy to verify that for the multiple e -values, the deﬁnition of degreeof inconsistency given above, is the degree of inconsistency of the knowledge join of allthe single evidence points, i.e., I ( η, η (cid:48) , η (cid:48)(cid:48) . . . ) = BI ( (cid:104) η, η (cid:105) (cid:116) k (cid:104) η (cid:48) , η (cid:48) (cid:105) (cid:116) k (cid:104) η (cid:48)(cid:48) , η (cid:48)(cid:48) (cid:105) . . . ) . .7. SENSITIVITY AND INCONSISTENCY ¬ (cid:104) c, d (cid:105) = (cid:104) d, c (cid:105) , and conﬂation as − (cid:104) c, d (cid:105) = (cid:104) − c, − d (cid:105) , so that negation reverses trust, but preserves knowledge, and conﬂationreverses knowledge, but preserves trust. w n estmkm kjtjf TtL c d Some points, as (c,d) −1 −0.5 0 0.5 1−1−0.75−0.5−0.2500.250.50.751 wn estm kmkj tjf L tT BT B I Same points, as (BT,BI)

Figure A.4: credibility-doubt and trust-inconsistency coordinatesAs an example of sensitivity analysis we use the HW model with the standard uni-formative references, the uniform and the maximum entropy densities, represented by[1 , ,

1] and [0 , ,

0] observation counts. For a motivation for this particular analysis, seethe observations at the end of section E.5. Between these two uninformative references,we also consider perturbation references corresponding to [0 , , , ,

1] and [1 , ,

0] ob-servation counts. Each of these references can be interpreted as the exclusion of a singleobservation of the corresponding type from the observed data set. E v ( H ) Hardy−Weinberg symmetry: Yes3 3.5 4 4.5 5 5.5 6 6.5 700.050.10.150.20.250.3 log of sample size, log2(n) E v ( H ) Hardy−Weinberg symmetry: No

Figure A.5: Sensitivity analysis60

APPENDIX A. FBST REVIEW

The e -values in the example are calculated using two sample proportions, [ x , x , x ]= n [1 / , / , /

2] and = n [1 / , / , / ( n ), ranged from 3 to 7. In Figure A.5,the e -values corresponding to each choice of reference, are given by an interpolated dashedline. The interpretation of the vertical interval (solid bars) between the dashed lines issimilar to that of the usual statistical error bars. However, the uncertainty representedby these bars does not have a probabilistic nature, being rather a possibilistic measure ofinconsistency, deﬁned in the partial support structure given by the FBST evidence value,see Stern (2004). A.8 Complex Models and Compositionality

The relationship between the credibility of a complex hypothesis, H , and those of itsconstituent elementary hypothesis, H ( i,j ) , in the independent setup, can be analyzed underthe FBST, see Borges and Stern (2006) for precise deﬁnitions, and detailed interpretation.Let us consider elementary hypotheses, H ( i,j ) , in k independent constituent models, M j , and the complex or composit hypothesis H , equivalent to a (homogeneous) logi-cal composition (disjunction of conjunctions) of elementary hypotheses, in the compositproduct model, M .The possibilistic nature of the e -value measure makes it easy to compute the supportfor disjunctive complex hypotheses. Conjunction of elementary hypotheses require a moresophisticated analysis. First we must observe that knowing the e -values of the elementaryhypotheses is not enough to know the e -value of the conjunction; Elementary e -valuescan give only lower and upper bounds to the support for the conjunction. Figure A.6illustrates these bounds, and also the following results, for further details see Borges andStern (2006). For conjunctive compositions, the models’ truth functions, W j , are the keyelement for the required algebraic manipulation, as stated in the next result.If H is expressed in HDNF or Homogeneous Disjunctive Normal Form, H = (cid:95) qi =1 (cid:94) kj =1 H ( i,j ) , M ( i,j ) = { Θ j , H ( i,j ) , p j , p jn , r j } ,M = { Θ , H, p , p n , r } , Θ = (cid:89) kj =1 Θ j , p n = (cid:89) kj =1 p jn , r = (cid:89) kj =1 r j ;then the e-value supporting H isev ( H ) = ev (cid:16)(cid:95) qi =1 (cid:94) kj =1 H ( i,j ) (cid:17) = W (cid:16) q max i =1 (cid:89) kj =1 s ∗ ( i,j ) (cid:17) = W (cid:16) q max i =1 s ∗ i (cid:17) = q max i =1 W (cid:0) s ∗ i (cid:1) = q max i =1 ev (cid:16)(cid:94) kj =1 H ( i,j ) (cid:17) = q max i =1 ev (cid:0) H i (cid:1) ; .8. COMPLEX MODELS AND COMPOSITIONALITY W ( v ), is given by theMellin convolution operation, see Springer (1979), deﬁned as W = (cid:79) ≤ j ≤ k W j , W ⊗ W ( v ) = (cid:90) ∞ W ( v/y ) W ( dy ) . The probability distribution of the product of two independent positive random vari-ables is the Mellin convolution of each of their distributions. From this interpretation,the we immediately see that ⊗ is a commutative and associative operator.Mirroring Wittgenstein, in the FBST context, we can call the e-value, ev ( H ), thecumulative surprise distribution, W ( v ), and the Mellin convolution operation, ⊗ , respec-tively, truth value, truth function, and truth operation.Finally, we observe that, in the extreme case of null-or-full support, that is, when, for1 ≤ i ≤ q and 1 ≤ j ≤ k , s ∗ ( i,j ) = 0 or s ∗ ( i,j ) = (cid:98) s j , the evidence values (or, in thiscontext, truth values) of the constituent elementary hypotheses are either 0 or 1, and theconjunction and disjunction composition rules of classical logic hold. Numerical Aspects

In appendix G we detail an eﬃcient Monte Carlo algorithm for computing ev ( H ; p n , r ).In this algorithm, the bulk of the work consists in generating random points in the pa-rameter space, θ j ∈ Θ, and evaluating the surprise function, s j = s ( θ j ). The MonteCarlo algorithm proceeds updating several accumulators based on the tangential set “hitindicator”, I ∗ ( θ j ; p n , r ) = ( θ j ∈ T ) = ( s ( θ j ) > s ∗ ) . In order to compute a k -step function approximation of W ( v ), we only have to splitthe surprise range interval, [0 , (cid:98) s ] with a vector of k intermediate points, 0 < s < s <. . . s h < s ∗ < s h +1 < . . . s k < (cid:98) s , and set up a set of vector accumulators based on the vectorthreshold indicator, I k ( θ j ; p n , r ) = ( s ( θ j ) > s k ). Updating the vector accumulatorsusualy imposes only a small overhead on the Monte Carlo algorithm.Numerical convolutions of step functions can be easily computed with the help ofgood condensation procedures, see Kaplan and Lin (1987). For alternative approachesto numerical convolution see Springer (1979) and Williamson (1989). In the case ofdependent models, the composite truth function can be solved with the help of analyticaland numerical copulas, see Cherubini (2004), Mari and Kotz (2001) and Nelsen (2006).62 APPENDIX A. FBST REVIEW W s *1 ev(H )=W (s *1 ) W s *2 ev(H )=W (s *2 ) W ⊗ W s *1 s *2 ev(H ∧ H )=W ⊗ W (s *1 s *2 )ev(H )*ev(H )1−(1−ev(H ))*(1−ev(H )) W ⊗ W s *2 s *2 ev(H ∧ H )=W ⊗ W (s *2 s *2 ) Fig A.6. Subplots 1,2: W j , s ∗ j , and ev ( H j ), for j = 1 , W ⊗ W , s ∗ s ∗ , ev ( H ∧ H ) and bounds;Subplot 4: Structure M is an independent replica of M ,ev ( H ) < ev ( H ), but ev ( H ∧ H ) > ev ( H ∧ H ). ppendix BBinomial, Dirichlet, Poisson andRelated Distributions This essay has been published as Pereira and Stern (2008).The matrix notation used in this section is deﬁned in section F.1.

B.1 Introduction and Notation

This essay presents important properties of the distributions used for categorical dataanalysis. Regardless of the population size being known or unknown, or the speciﬁcobservational stopping rule, the Bernoulli Processes generates the sampling distributionsconsidered. On the other hand, the Gamma distribution generates the prior and posteriordistributions obtained: Gamma, Gamma-Poisson, Dirichlet, and Dirichlet-Multinomial.The Poisson Processes as generator of sampling distributions is also considered.The generation form of the discrete sampling distributions presented in Section 2is, in fact, a characterization method of such distributions. If one recalls that all thedistribution classes being mixed are complete classes and are Blackwell suﬃcient for theBernoulli processes, the mixing distributions are unique. This characterization method iscompletely described in Basu and Pereira (1983).Section 9 describes the Reny-Aczel characterization of the Poisson distribution. Al-though it could be thought as a de Finetti type characterization this characterizationis based on alternative requirements. While de Finetti chaparcterization is based ona permutable inﬁnite 0-1 process, Reny-Aczek characterization is based on a homoge-neous Markov process in a ﬁnite interval, generating ﬁnite discrete Markov Chains. UsingReny-Aczel characterization, together with Theorem 4, one can obtain a characterizationof Multinomial distributions. 26364

APPENDIX B: BINOMIAL, DIRICHLET AND RELATED DISTRIBUTIONS

Section 7 describes the Dirichlet of Second Kind. In this section we also show how touse a multivariate normal approximation to the logarithm of a random vector distributedas Dirichlet of Second Kind, and a log-normal approximation to a Gamma distribution,see Aitchison and Shen (1980). In many examples of the authors’ consulting practice theseapproximations proved to be a powerful modeling tool, leading to eﬃcient computationalprocedures.The development of the theory in this essay is self contained, seeking a uniﬁed treat-ment of a large variety of problems, including ﬁnite and inﬁnite populations, contingencytables of arbitrary dimension, deﬁciently categorized data, logistic regressions, etc. Thesemodels also present a way of introducing non parametric solutions.The singular representation adopted is unusual in statistical texts. This singular rep-resentation makes it simpler to extend and generalize the results and greatly facilitatesnumerical and computational implementation. In this essay, corollaries, lemmas, propo-sitions and theorems are numbered sequentially.We introduce the following notation for observation matrices, and respective summa-tion vectors: U = [ u , u , . . . ] , U : n = [ u , u , . . . u n ] , x n = U : n = (cid:88) nj =1 u j . The tilde accent indicates some form of normalization like, for example, (cid:101) x = (1 / (cid:48) x ) x . Lemma 1: If u , . . . u n are i.i.d random vectors, x = U : n ⇒ E( x ) = n E( u ) and Cov( x ) = n Cov( u ) . The ﬁrst result is trivial. For the second result, we only have to remember the transfor-mation properties of for the expectation and covariance operators by a linear operationon their argument, E ( AY + b ) = AE ( Y ) + b , Cov( AY + b ) = A Cov( Y ) A (cid:48) , and write Cov( x ) = Cov( U : n )= Cov (cid:0) ( (cid:48) ⊗ I ) Vec( U : n ) (cid:1) = ( (cid:48) ⊗ I ) (cid:0) I ⊗ Cov( u ) (cid:1) ( ⊗ I )= (cid:0) (cid:48) ⊗ Cov( u ) (cid:1) ( ⊗ I ) = n Cov( u ) . B.2 The Bernoulli Process

Let us consider a sequence of random vectors u , u , ... where, ∀ u i can assume only twovalues I = (cid:20) (cid:21) or I = (cid:20) (cid:21) where I = (cid:20) (cid:21) .2 BERNOULLI PROCESS u i can assume the value of any column of theidentity matrix, I. We say that u i is of class k , c ( u i ) = k , iﬀ u i = I k , k ∈ [1 , p =[ p (1) , p (2) , . . . p ( n )] is a permutation of [1 , , . . . n ], than, ∀ n, p ,Pr (cid:0) u , ...u n (cid:1) = Pr (cid:0) u p (1) , ...u p ( n ) (cid:1) . Just from this exchangeability constraint, that can be interpreted as saying that the indexlabels are non informative, de Finetti Theorem establishes the existence of an unknownvector θ ∈ Θ = { ≤ θ = (cid:20) θ θ (cid:21) ≤ | (cid:48) θ = 1 } such that, conditionally on θ , u , u , . . . are mutually independent, and the conditionalprobability of Pr( u i = I k | θ ) is θ k , i.e.( u (cid:113) u (cid:113) . . . ) | θ or ∞ (cid:97) i =1 u i | θ , and Pr( u i = I k | θ ) = θ k . Vector θ is characterized as the limit of proportions θ = lim n →∞ n x n , x n = U : n = (cid:88) nj =1 u j . Conditionally on θ , the sequence u , u , . . . receives the name of Bernoulli process. Aswe shall see, many well known discrete distributions can be obtained from transformationsof this process.The expectation and covariance (conditionally on θ ) of any vector in the sequence are: • E( u i ) = θ , • Cov( u i ) = E ( u i ⊗ ( u i ) (cid:48) ) − E ( u i ) ⊗ E (( u i ) (cid:48) ) = diag( θ ) − θ ⊗ θ (cid:48) .When the summation domain 1 : n , is understood, we may use the relaxed notation x instead of x n . We also deﬁne the Delta operator, or “pointwise power product” betweentwo vectors of same dimension: Given θ , and x , n × θ (cid:52) x ≡ n (cid:89) i =1 ( θ i ) x i . A stopping rule, δ , establishes, for every n = 1 , , . . . , a decision of observing (or not) u n +1 , after the observations u , . . . u n .For a good understanding of this text, it is necessary to have a clear interpretation ofconditional expressions like x n | n or x n | x n . In both cases we are referring to a unknown66 APPENDIX B: BINOMIAL, DIRICHLET AND RELATED DISTRIBUTIONS vector, x n , but with a diﬀerent partial information. In the ﬁrst case, we know n , andtherefore we know the sum of components, x n + x n = n ; however, we know neithercomponent x n nor x n . In the second case we only know the ﬁrst component, of x n , x n ,and do not know the second component, x n , obviously we also do not know the sum, n = x n + x n . Just pay attention: We list what we know to the right of the bar and,(unless we have some additional information) everything that can not be deduced fromthis list is unknown.The ﬁrst distribution we are going to discuss is the Binomial. Let δ ( n ) be the stoppingrule where n is the pre-established number of observations. The (conditional) probabilityof the observation sequence U : n isPr( U : n | θ ) = θ (cid:52) x n . The summation vector, x n , has Binomial distribution with parameters n and θ , andwe write x n | [ n, θ ] ∼ Bi( n, θ ). When n (or δ ( n )) is implicit in the context we may write x | θ instead of x n | [ n, θ ]. The Binomial distribution has the following expression:Pr( x n | n, θ ) = (cid:18) nx n (cid:19) ( θ (cid:52) x n )where (cid:18) nx (cid:19) ≡ Γ( n + 1)Γ( x + 1) Γ( x + 1) = n ! x ! x ! and n = (cid:48) x . A good exercise for the reader is to check that expectation vector and the covariancematrix of x n | [ n, θ ] have the following expressions:E( x n ) = nθ and Cov( x n ) = n ( θ (cid:52) ) (cid:20) − − (cid:21) . The second distribution we discuss is the Negative Binomial. Let δ ( x n ) be the ruleestablishing to stop at observation u n when obtaining a pre-established number of x n successes. The random variable x n , the number of failures he have when we obtain therequired x n successes, is called a Negative Binomial with parameters x n and θ . It isnot hard to prove that the Negative Binomial distribution x n | [ x n , θ ] ∼ NB( x n , θ ), hasexpression, ∀ x n ∈ IN ,Pr( x n | x n , θ ) = x n n (cid:18) nx n (cid:19) ( θ (cid:52) x n ) = θ Pr (cid:0) ( x n − I ) | ( n − , θ ) (cid:1) . Note that, from the deﬁnition this distribution, x n is a positive integer number. Nev-ertheless, we can extend the deﬁnition above for any real positive value a , and still obtaina probability function. For this, we use ∞ (cid:88) j =0 Γ( a + j )Γ( a ) j ! (1 − π ) j = π − a , ∀ a ∈ [0 , ∞ [ and π ∈ ]0 , . .2 BERNOULLI PROCESS x n :E ( x n | x n , θ ) = x n θ θ and Var ( x n | x n , θ ) = x n θ ( θ ) . In the special case of δ ( x n = 1), the Negative Binomial distribution is also known asthe Geometric distribution with parameter θ . If a random variables are independent andidentically distributed (i.i.d.) as a geometric distribution with parameter θ , then the sumof these variables has Negative Binomial distribution with parameters a and θ .The third distribution studied in this essay is the Hypergeometric. Going back tothe original sequence, u , u , ... , assume that a ﬁrst observer knows the ﬁrst N obser-vations, while a second observer knows only a subsequence of n < N of these observa-tions. Since the original sequence, u , u , . . . , is exchangeable, we can assume, withoutloss of generality, that the subsequence known to the second observer is the subsequenceof the ﬁrst n observations, u , . . . u n . Using de Finetti theorem, we have that x n and x N − x n = U n +1 : N are conditionally independent, given θ . That is, x n (cid:113) ( x N − x n ) | θ .Moreover, we can write x n | [ n, θ ] ∼ Bi( n, θ ) , x N | [ N, θ ] ∼ Bi(

N, θ ) , and( x N − x n ) | [( N − n ) , θ ] ∼ Bi( N − n, θ ) . Our goal is to ﬁnd the distribution function of x n | x N . Note that x N is suﬃcient for U : N given θ , and x n is suﬃcient for U : n . Moreover x n | [ n, x N ] has the same distributionof x n | [ n, x N , θ ]. Using the basic rules of probability calculus and the properties above,we have that Pr( x n | n, x N , θ )= Pr( x n , x N | n, N, θ )Pr( x N | n, N, θ ) = Pr( x n , ( x N − x n ) | n, N, θ )Pr( x N | n, N, θ )= Pr( x n | n, N, θ ) Pr( x N − x n | n, N, θ )Pr( x N | n, N, θ ) . Hence, x n | [ n, x N ] has distribution functionPr( x n | n, x N ) = (cid:18) nx n (cid:19) (cid:18) N − nx N − x n (cid:19)(cid:18) Nx N (cid:19) where ≤ x n ≤ x N ≤ N , (cid:48) x n = n , (cid:48) x N = N . APPENDIX B: BINOMIAL, DIRICHLET AND RELATED DISTRIBUTIONS

This is the vector representation of the Hypergeometric probability distribution. x n | [ n, x N ] ∼ Hy( n, N, x N ) . The reader is asked to check the following expressions for the expectation and (condi-tional) covariance of x n | [ n, N, x N ], and covariance of u i and u j , i, j ≤ n :E( x n ) = nN x N and Cov( x n ) = n ( N − n )( N −

1) ( x N (cid:52) ) (cid:20) − − (cid:21) Cov( u i , u j | x N ) = 1( N − N ( x N (cid:52) ) (cid:20) − − (cid:21) . We ﬁnish this section presenting the derivation of the Beta-Binomial distribution. Letus assume that the ﬁrst observer observed x n failures, until observing a pre-establishednumber of x n successes. A second observer makes more observations, observing x N failuresuntil completing the pre-established number of x N successes, x n < x N .Since x n and x N are pre-established, we can write x N | θ ∼ NB( x N , θ ) , x n | θ ∼ NB( x n , θ )( x N − x n ) | θ ∼ NB( x N − x n , θ ) and x n (cid:113) ( x N − x n ) | θ . As before, our goal is to describe the distribution of x n | [ x n , x N ]. If one notices that[ x n , x N ] is suﬃcient for [ x n , ( x N − x n )], with respect to θ , the problem becomes similar tothe Hypergeometric case, and one can obtainPr( x n | x n , x N ) = x N ! Γ( x N )Γ( x N + x N ) Γ( x n + x n ) x n ! Γ( x n ) Γ( x N − x n + x N − x n )( x N − x n )! Γ( x N − x n ) ,x n ∈ { , , ..., x N } . This is the distribution function of a random variable called Beta Binomial with param-eters x n and x N . x n | ( x n , x N ) ∼ BB( x n , x N ) . The properties of this distribution will be studied in the general case of the Dirichlet-Multinomial, in the following sections.Generalized categories for k > I , I , . . . I k ,i.e., the columns of the k -dimensional identity matrix. The Multinomial and Hypergeo-metric multivariate distributions, presented in the next sections, are distributions derivedof this basic generalization. .3. MULTINOMIAL DISTRIBUTION B.3 Multinomial Distribution

Let u i , i = 1 , , . . . be random vectors with possible results in the set of columns of the m -dimensional identity matrix, I k , k ∈ m . We say that u i is of class k , c ( u i ) = k , iﬀ u i = I k .Let θ ∈ [0 , m be the vector of probabilities for an observation of class k in a m -variateBernoulli process, i.e., Pr( u i = I k | θ ) = θ k , ≤ θ ≤ , (cid:48) θ = 1 . Like in the last section, let UU = [ u , u , . . . ] and x n = U : n . Deﬁnition:

If the knowledge of θ makes the vectors u i independent, then the (condi-tional) distribution of x n given θ is the Multinomial distribution of order m with param-eters n and θ , given by Pr( x n | n, θ ) = (cid:18) nx n (cid:19) ( θ (cid:52) x n )where (cid:18) nx (cid:19) ≡ Γ( n + 1)Γ( x + 1) . . . Γ( x m + 1) = n ! x ! . . . x m ! and n = (cid:48) x . We represent the m -Multinomial distribution writing x n | [ n, θ ] ∼ Mn m ( n, θ ) . When m = 2, we have the binomial case.Let us now examine some properties of the Multinomial distribution. Lemma 2: If x | θ ∼ Mn m ( n, θ ) then the (conditional) expectation and covariance of x are E( x ) = nθ and Cov( x ) = n (diag( θ ) − θ ⊗ θ (cid:48) ) . Proof:

Analogous to the binomial case.The next result presents a characterization of the Multinomial in terms of the Poissondistribution.

Lemma 3:

Reproductive property of the Poisson distribution. x i ∼ Ps( λ i ) ⇒ (cid:48) x | λ ∼ Ps( (cid:48) λ ) . that is, the sum of (independent) Poisson variates is also Poisson.70 APPENDIX B: BINOMIAL, DIRICHLET AND RELATED DISTRIBUTIONS

Theorem 4:

Characterization of the Multinomial by the Poisson.Let x = [ x , ..., x m ] (cid:48) be a vector with independent Poisson distributed components withparameters in the known vector λ = [ λ , ..., λ m ] (cid:48) >

0. Let n be a positive integer. Then,given λ , x | [ n = (cid:48) x, λ ] ∼ Mn m ( n, θ ) where θ = 1 (cid:48) λ λ . Proof:

The joint distribution of x , given λ isPr( x | λ ) = m (cid:89) k =1 e − λ k λ x k i x k ! . Using the Poisson reproductive property,Pr ( x | (cid:48) x = n, λ )= Pr ( (cid:48) x = n ∧ x | λ )Pr ( (cid:48) x = n | λ ) = δ ( n = (cid:48) x ) Pr( x | λ )Pr( (cid:48) x = n | λ ) . The following results state important properties of the Multinomial distribution. Theproof of these properties is simple, using the characterization of the Multinomial by thePoisson, and the Poisson reproductive property.

Theorem 5:

Multinomial Class PartitionLet 1 : m be the index domain for the classes of a order m Multinomial distribution. Let T be a partition matrix breaking the m -classes into s -super-classes. Let x ∼ Mn m ( n, θ ),then y = T x ∼ Mn s ( n, T θ ). Theorem 6:

Multinomial Conditioning on the Partial Sum.If x ∼ Mn m ( n, θ ), then the distribution of part of the vector x conditioned on its sumhas Multinomial distribution, having as parameter the corresponding part of the original(normalized) parameters. In more detail, conditioning on the t ﬁrst components, we have: x : t | ( (cid:48) x : t = j ) ∼ Mn t (cid:18) j, (cid:48) θ : t θ : t (cid:19) where 0 ≤ j ≤ n . Theorem 7:

Multinomial–Binomial Decomposition.Using the last two theorems, if x ∼ Mn m ( n, θ ),Pr( x | n, θ ) = n (cid:88) j =0 Pr (cid:18) x : t | j, (cid:48) θ : t θ : t (cid:19) Pr (cid:18) x t +1 : m | ( n − j ) , (cid:48) θ t +1 : m θ t +1 : m (cid:19) Pr (cid:18)(cid:20) j ( n − j ) (cid:21) | n, (cid:20) (cid:48) θ : t (cid:48) θ t +1 : m (cid:21)(cid:19) . .4. MULTIVARIATE HYPERGEOMETRIC DISTRIBUTION m -nomial- s -nomial decomposition for the partition of the m class indices into s super-classes. B.4 Multivariate Hypergeometric Distribution

In the ﬁrst section we have shown how an Hypergeometric variate can be generated froma Bernoulli process. The natural generalization of this result is obtained considering aMultinomial process. As in the last section, we say that u i is of class k , c ( u i ) = k , iﬀ u i = I k .We take a sample of size n from a ﬁnite population of size N ( > n ), that is partitionedinto m classes. The population frequencies (number of elements in each category) arerepresented by [ ψ , . . . ψ m ], hence N = (cid:48) ψ . Based on the sample, we want to make aninference on ψ . x k ´e is the sample frequency of class k .One way of describing this problem is to consider an urn with N balls of m diﬀerentcolors, indexed by 1 , . . . m . ψ k is the number of balls of color k . Assume that the N balls are separated into two smaller boxes, so that box 1 has n balls and box 2 has theremaining N − n balls. The statistician can observe the composition of box 1, representedby vector x of sample frequencies. The quantity of interest for the statistician is the vector ψ − x representing the composition of box 2.As in the bivariate case, we assume that U : N is a ﬁnite sub-sequence in an ex-changeable process and, therefore, any sub-sequence extracted from U : N has the samedistribution of U : n . Hence, x = U : n has the same distribution of the frequency vectorfor a sample of size n .As in the bivariate case, our objective is to ﬁnd the distribution of x | ψ . Again, usingde Finetti theorem, there is a vector ≤ θ ≤ , (cid:48) θ = 1, such that (cid:96) Nj =0 u j | θ andPr ( c ( u j ) = k ) = θ k . Theorem 8:

As in the Multinomial case, the following results follow: • ψ | θ ∼ Mn m ( N, θ ) ; • x | θ ∼ Mn m ( n, θ ) ; • ( ψ − x ) | θ ∼ Mn m (( N − n ) , θ ) ; • ( ψ − x ) (cid:113) x | θ .Using the results of the last section and following the same steps as in the Hy casein the ﬁrst section, we obtain the following expression for m -variate Hypergeometric72 APPENDIX B: BINOMIAL, DIRICHLET AND RELATED DISTRIBUTIONS distribution, x n | [ n, N, ψ ] ∼ Hy m ( n, N, ψ ) :Pr( x n | n, ψ ) = (cid:18) nx n (cid:19) (cid:18) N − nψ − x n (cid:19)(cid:18) Nψ (cid:19) where ≤ x n ≤ ψ ≤ N , (cid:48) x n = n , (cid:48) ψ = N .

This is the vector representation of the Hypergeometric probability distribution. x n | [ n, x N ] ∼ Hy( n, N, x N ) . Alternatively, we can write the more usual formula,Pr( x | ψ ) = (cid:18) ψ x (cid:19) (cid:18) ψ x (cid:19) · · · (cid:18) ψ m x m (cid:19)(cid:18) Nn (cid:19) . Theorem 9:

The expectation and covariance of a random vector with Hypergeometricdistribution, x ∼ Hy m ( n, N, ψ ), are:E( x ) = n (cid:101) ψ , Cov( x ) = n N − nN − (cid:16) diag( (cid:101) ψ ) − (cid:101) ψ ⊗ (cid:101) ψ (cid:48) (cid:17) where (cid:101) ψ = 1 N ψ .

Proof:

Use thatCov( x n ) = n Cov( u ) + n ( n − u , u )Cov( u ) = E (cid:0) u ⊗ ( u ) (cid:48) (cid:1) − E( u ) ⊗ E( u ) (cid:48) = diag( (cid:101) ψ ) − (cid:101) ψ ⊗ (cid:101) ψ (cid:48) Cov( u , u ) = E (cid:0) u ⊗ ( u ) (cid:48) (cid:1) − E( u ) ⊗ E( u ) (cid:48) . The second term of the last two equations are equal, and the ﬁrst term of the last equationis E (cid:0) u i u j (cid:1) = (cid:40) ψ i N ψ i − N − if i = j ψ i N ψ j N − if i (cid:54) = j Algebraic manipulation yields the result.Note that, as in the order 2 case, the diagonal elements of Cov( u ) are positive, whilethe diagonal elements of Cov( u , u ) are negative. In the oﬀ diagonal elements, the signsare reversed. .5. DIRICHLET DISTRIBUTION B.5 Dirichlet Distribution

In the second section we presented the multinomial distribution, Mn m ( n, θ ). In this sectionwe present the Dirichlet distribution for the parameter θ . Let us ﬁrst recall the univariatePoisson and Gamma distributions.A random variable has Gamma distribution, x | [ a, b ] ∼ G ( a, b ) , a, b >

0, if its distri-bution is continuous with density f ( x | a, b ) = b a Γ( a ) x a − exp( − bx ) , x > . The expectation and variance of this variate are E ( x ) = ab and Var( x ) = ab . Lemma 10:

Reproductive property for the Gamma distribution.If n independent random variables x i | a i , b ∼ G ( a i , b ), then (cid:48) x ∼ G ( (cid:48) a, b ) . Lemma 11:

The Gamma distribution is conjugate to the Poisson distribution.

Proof: If y | λ ∼ Ps( λ ) and λ has prior λ | a, b ∼ G ( a, b ), then f ( λ | y, a, b ) ∝ L ( λ | y ) f ( λ )= exp( − λ ) λ y y ! b a Γ( a ) λ a − exp( − bλ ) ∝ λ y + a − exp ( − ( b + 1) λ ) . That is, the posterior distribution of λ is Gamma with parameters [ a + y, b + 1]. Deﬁnition:

Dirichlet distribution.A random vector y ∈ S m − ≡ { y ∈ R m | ≤ y ≤ ∧ (cid:48) y = 1 } has Dirichlet distribution of order m with positive a ∈ R m if its density isPr( y | a ) = y (cid:52) ( a − ) B ( a ) . Note that S m − , the m − R m subject to the“constraint”, (cid:48) y = 1. Hence, a point in the Simplex has only m − y , . . . y m − ] (cid:48) , known74 APPENDIX B: BINOMIAL, DIRICHLET AND RELATED DISTRIBUTIONS as the Multivariate Beta distribution, but at the cost of obtaining a convoluted algebraicformulation that also loses the natural geometric interpretation of the singular form.The normalization factor for the Dirichlet distribution is B ( a ) ≡ (cid:90) y ∈S m − ( y (cid:52) ( a − dy . Lemma 12:

Beta function.The normalization factor for the Dirichlet distribution deﬁned above is the Beta function,deﬁned as B ( a ) = (cid:81) mk =1 Γ( a k )Γ( (cid:48) a ) . The proof is given at the end of this section.

Theorem 13:

Dirichlet as Conjugate of the Multinomial:If θ ∼ Di m ( a ) and x | θ ∼ Mn m ( n, θ ) then θ | x ∼ Di m ( a + x ) . Proof:

We only have to remember that the Multinomial likelihood is proportional to θ (cid:52) x ,and that a Dirichlet prior is proportional to θ (cid:52) ( a − ). Hence, the posterior is propor-tional to θ (cid:52) ( x + a − B ( a + x ) is the normalization factor, i.e.,equal to the integral on θ of θ (cid:52) ( x + a − Lemma 14:

Dirichlet Moments.If θ ∼ Di m ( a ) and p ∈ IN m , thenE ( θ (cid:52) p ) = B ( a + p ) B ( a ) . Proof: (cid:90) Θ ( θ (cid:52) p ) f ( θ | a ) dθ = 1 B ( a ) (cid:90) Θ ( θ (cid:52) p ) ( θ (cid:52) ( a − dθ =1 B ( a ) (cid:90) Θ ( θ (cid:52) ( a + p − dθ = B ( a + p ) B ( a ) . .5 DIRICHLET DISTRIBUTION p , appropriately, we have Corollary 15: If θ ∼ Di m ( a ) , thenE( θ ) = (cid:101) a ≡ (cid:48) a a Cov( θ ) = 1 (cid:48) a + 1 (diag( (cid:101) a ) − (cid:101) a ⊗ (cid:101) a (cid:48) ) . Theorem 16:

Characterization of the Dirichlet by the Gamma:Let the components of the random vector x ∈ R m be independent variables with distri-bution G ( a k , b ). Then, the normalized vector y = 1 (cid:48) x x ∼ Di m ( a ) , (cid:48) x ∼ Ga( (cid:48) a ) and y (cid:113) (cid:48) x . Proof:

Consider the normalization, y = 1 t x , t = (cid:48) x , x = ty , as a transformation of variables. Note that one of the new variables, say y m ≡ t (1 − y . . . − y m − ), becomes redundant.The Jacobian matrix of this transformation is J = ∂ ( x , x , . . . x m − , x m ) ∂ ( y , y , . . . y m − , t ) =  t · · · y t · · · y ... ... . . . ... ...0 0 · · · t y m − − t − t · · · − t − y · · · − y m −  . By elementary operations (see appendix F) that add all rows to the last one, we obtainthe LU factorization the Jacobian matrix, J = LU , where L =  · · · · · · · · · − − · · · −  and U =  t · · · y t · · · y ... ... . . . ... ...0 0 · · · t y m − · · ·  . A triangular matrix determinant is equal to the product of the elements in its maindiagonal, hence | J | = | L | | U | = 1 t m − .76 APPENDIX B: BINOMIAL, DIRICHLET AND RELATED DISTRIBUTIONS

At the other hand, the joint distribution of x is f ( x ) = m (cid:89) k =1 Ga( x k | a k , b ) = m (cid:89) k =1 b a k Γ( a k ) e − bx k ( x k ) a k − . and the joint distribution in the new system of coordinates is g ([ y, t ]) = | J | f (cid:0) x − ([ y, t ]) (cid:1) = t m − m (cid:89) k =1 b a k Γ( a k ) e − bx k ( x k ) a k − = t m − m (cid:89) k =1 b a k Γ( a k ) e − bty k ( ty k ) a k − = (cid:32) m (cid:89) k =1 ( y k ) a k − Γ( a k ) (cid:33) b (cid:48) a e − bt t (cid:48) a − m t m − = (cid:32) m (cid:89) k =1 ( y k ) a k − Γ( a k ) (cid:33) b (cid:48) a e − bt t (cid:48) a − . Hence, the marginal distribution of y = [ y , . . . y k ] (cid:48) is g ( y ) = (cid:90) ∞ t =0 g ([ y, t ]) dt = (cid:32) m (cid:89) k =1 ( y k ) a k − Γ( a k ) (cid:33) (cid:90) ∞ t =0 b (cid:48) a e − bt t (cid:48) a − dt = (cid:32) m (cid:89) k =1 ( y k ) a k − Γ( a k ) (cid:33) Γ( (cid:48) a ) = y (cid:52) ( a − B ( a ) . In the last passage, we have replaced the integral by the normalization factor of a Gammadensity, Ga( (cid:48) a, b ). Hence, we obtain a density proportional to y (cid:52) ( a − Lemma 17:

Bipartition of Indices for the Dirichlet.Let 1 : t , t + 1 : m be a bipartition of the class index domain, 1 : m , of an order m Dirichlet,in two super-classes. Let y ∼ Di m ( a ), and z = 1 (cid:48) y : t y : t , z = 1 (cid:48) y t +1 : m y t +1 : m , w = (cid:20) (cid:48) y : t (cid:48) y t +1 : m (cid:21) . We than have, z (cid:113) z (cid:113) w and z ∼ Di t ( a : t ) , z ∼ Di m − t ( a t +1 : m ) and w ∼ Di (cid:18)(cid:20) (cid:48) a : t (cid:48) a t +1 : m (cid:21)(cid:19) . .6. DIRICHLET-MULTINOMIAL Proof:

From the Dirichlet characterization by the Gamma we can imagine that the vector y isbuilt by normalizing of a vector x , as follows, y = 1 (cid:48) x x , x k ∼ Ga( a k , b ) , m (cid:97) k =1 x k . Considering isolatetly each one of the super-classes, we build the vectors z and z thatare distributed as z = 1 (cid:48) y : t y : t = 1 (cid:48) x : t x : t ∼ Di t ( a : t ) z = 1 (cid:48) y t +1 : m y t +1 : m = 1 (cid:48) x t +1 : m x t +1 : m ∼ Di m − t ( a t +1 : m ) .z (cid:113) z , that are in turn independent of the partial sums (cid:48) x : t ∼ Ga( (cid:48) a : t , b ) and (cid:48) x t +1 : m ∼ Ga( (cid:48) a t +1 : m , b ) . Using again the theorem characterizing the Dirichlet by the Gamma distribution forthese two Gamma variates, we obtain the result, Q.E.D.We can generalize this result for any partition of the set of classes, as follows. If y ∼ Di m ( a ) and T ´e is a s -partition of the m classes, the intra and extra super-classdistributions are independent Dirichlets, as follows z r = 1 T r y r P y ∼ Di T r ( r P a ) w = T y ∼ Di s ( T a ) . B.6 Dirichlet-Multinomial

We say that a random vector x ∈ IN n | (cid:48) x = n has Dirichlet-Multinomial (DM) distribu-tion with parameters n and a ∈ R m , iﬀPr( x | n, a ) = B ( a + x ) B ( a ) (cid:18) nx (cid:19) = B ( a + x ) B ( a ) B ( x ) 1 x (cid:52) . Theorem 18:

Characterization of the DM as a Dirichlet mixture of Multinomials.Se θ ∼ Di m ( a ) and x | θ ∼ Mn( n, θ ) then x | [ n, a ] ∼ DM m ( n, a ) . Proof: APPENDIX B: BINOMIAL, DIRICHLET AND RELATED DISTRIBUTIONS

The joint distribution of θ, x is proportional to θ (cid:52) ( a + x − θ is B ( a + x ). Hence, multiplying by the joint distribution constants, we have the marginalfor x , Q.E.D. Therefore, we have also proved that the function DM is normalized, that isPr( x ) = (cid:90) θ ∈S m − (cid:18) nx (cid:19) ( θ (cid:52) x ) 1 B ( a ) θ (cid:52) ( a − ) dθ = 1 B ( a ) (cid:18) nx (cid:19) (cid:90) θ ∈S m − ( θ (cid:52) ( x + a − )) dθ = B ( x + a ) B ( a ) (cid:18) nx (cid:19) . Theorem 19:

Characterization of the DM by m Negative Binomials.Let a ∈ IN m + , and x ∈ IN m , be a vector whose components are independent randomvariables, a k ∼ NB( a k , θ ). Then x | [ (cid:48) x = n, a ] ∼ DM m ( n, a ) . Proof:

Pr( x | θ, a ) = m (cid:89) k =1 (cid:18) a k + x k − x k (cid:19) θ a k (1 − θ ) x k Pr( (cid:48) x | θ, a ) = (cid:18) (cid:48) a + (cid:48) x − (cid:48) x (cid:19) θ (cid:48) a (1 − θ ) (cid:48) a . Then, Pr( x | (cid:48) x = n, θ, a ) = Pr( x | a, θ )Pr( (cid:48) x = n | θ ) = (cid:81) mk =1 (cid:18) a k + x k − x k (cid:19)(cid:18) (cid:48) a + (cid:48) x − (cid:48) x (cid:19) . Hence, Pr( x | (cid:48) x = n, θ, a ) = Pr( x | (cid:48) x = n, a )= m (cid:89) k =1 Γ( a k + x k ) x !Γ( a k ) / Γ( (cid:48) a + n )Γ( (cid:48) a ) n ! = B ( a + x ) B ( a ) (cid:18) nx (cid:19) . Theorem 20:

The DM as Pseudo-Conjugate for the HypergeometricSe x ∼ Hy m ( n, N, ψ ) and ψ ∼ DM m ( N, a ) then ( ψ − x ) | x ∼ DM m ( N − n, a ) . Proof:

Using the properties of the Hypergeometric already presented, we have the inde-pendence relation, ( ψ − x ) (cid:113) x | θ . We can therefore use the Multinomial sample x | θ forupdating the prior and obtain the posterior θ | x ∼ Di m ( a + x ) . .6 DIRICHLET-MULTINOMIAL ψ − x , given the sample x , is a mixture of ( ψ − x ) θ buy the posterior for θ . By the characterization of the DM asa mixture of Multinomials by a Dirichlet, the theorem follows, i.e.,( ψ − x ) | [ θ, x ] ∼ ( ψ − x ) | θ ∼ Mn m ( N − n, θ ) θ | x ∼ Di m ( a + x ) (cid:27) ⇒⇒ ( ψ − x ) | x ∼ Di m ( N − n, a + x ) . Lemma 21:

DM Expectation and Covariance.If x ∼ DM m ( n, a ) then E( x ) = n (cid:101) a ≡ (cid:48) a a Cov( x ) = n ( n + (cid:48) a ) (cid:48) a + 1 (diag( (cid:101) a ) − (cid:101) a ⊗ (cid:101) a (cid:48) ) . Proof: E( x ) = E θ ( E x ( x | θ )) = E θ ( nθ ) = n (cid:101) a E( x ⊗ x (cid:48) ) = E θ ( E x ( x ⊗ x (cid:48) | θ ))= E θ ( E( x | θ ) ⊗ E( x | θ ) (cid:48) + Cov( x | θ ))= E θ (cid:0) n (diag( θ ) − θ ⊗ θ (cid:48) ) + n θ ⊗ θ (cid:48) (cid:1) = n E θ (diag( θ )) + n ( n −

1) E θ ( θ ⊗ θ (cid:48) )= n diag( (cid:101) a ) + n ( n −

1) ( E( θ ) ⊗ E( θ ) (cid:48) + Cov( θ ))= n diag( (cid:101) a ) + n ( n − (cid:18)(cid:101) a ⊗ (cid:101) a (cid:48) + 1 (cid:48) a + 1 (diag( (cid:101) a ) − (cid:101) a ⊗ (cid:101) a (cid:48) ) (cid:19) = n diag( (cid:101) a ) + n ( n − (cid:18) (cid:48) a + 1 diag( (cid:101) a ) + (cid:48) a (cid:48) a + 1 (cid:101) a ⊗ (cid:101) a (cid:48) (cid:19) Cov( x ) = E( x ⊗ x (cid:48) ) − E( x ) ⊗ E( x ) (cid:48) = E( x ⊗ x (cid:48) ) − n (cid:101) a ⊗ (cid:101) a (cid:48) = (cid:18) n + n ( n − (cid:48) a + 1 (cid:19) diag( (cid:101) a ) + (cid:18) n ( n − (cid:48) a (cid:48) a + 1 − n (cid:19) (cid:101) a ⊗ (cid:101) a (cid:48) = n ( n + (cid:48) a ) (cid:48) a + 1 (diag( (cid:101) a ) − (cid:101) a ⊗ (cid:101) a (cid:48) ) Q.E.D. Theorem 22:

DM Class BipartitionLet 1 : t , t + 1 : m a bipartition of the index domain for the classes of an order m DM,1 : m , in two super-classes. Then, the following conditions (i) to (iii) are equivalent tocondition (iv):i: x t (cid:113) x t +1: m | n = (cid:48) x t ;ii-1: x t | n = (cid:48) x t ∼ DM t ( n , a t ) ;80 APPENDIX B: BINOMIAL, DIRICHLET AND RELATED DISTRIBUTIONS ii-2: x t +1: m | n = (cid:48) x t +1: m ∼ DM m − t ( n , a t +1: m ) ;iii: (cid:20) n n (cid:21) ∼ DM (cid:18) n, (cid:20) (cid:48) a t (cid:48) a t +1: m (cid:21)(cid:19) ;iv: x ∼ DM m ( n, a ) . Proof:

We only have to show that the joint distribution can be factored in this form. Bythe DM characterization as a mixture, we can write it as Dirichlet mixture of Multinomials.By the bipartition theorems, we can factor both, the Multinomials and the Dirichlet, sothe theorem follows.

B.7 Dirichlet of the Second Kind

Consider y ∼ Di m +1 ( a ). The vector z = (1 /y m +1 ) y : m has Dirichlet of the Second Kind(D2K) distribution. Theorem 23:

Characterization of D2K by the Gamma distribution.Using the characterization of the Dirichlet by the Gamma, we can write the D2K variateas a function of m + 1 independent Gamma variates, z : m ∼ (1 /x m +1 ) x : m where x k ∼ Ga ( a k , b ) . Similar to what we did for the Dirichlet (of the ﬁrst kind), we can write the D2Kdistribution and its moments as: f ( z | a ) = z (cid:52) ( a : m − (cid:48) z ) (cid:48) a B ( a ) ,E ( z ) = e = (1 /a m +1 ) a : m , Cov( z ) = 1 a m +1 − e ) + e ⊗ e (cid:48) ) . The logarithm of a Gamma variate is well approximated by a Normal variate, seeAitchison & Shen (1980). This approximation is the key to several eﬃcient computationalprocedures, and motivates the computation of the ﬁrst two moments of the log-D2Kdistribution. For that, we use the Digamma, ψ ( ), and Trigamma function, ψ (cid:48) ( ), deﬁnedas: ψ ( a ) = dda ln Γ( a ) = Γ (cid:48) ( a )Γ( a ) , ψ (cid:48) ( a ) = dda ψ ( a ) . Lemma 24:

The expectation and covariance of a log-D2K variate are: E (log( z )) = ψ ( a : m ) − ψ ( a m +1 ) , .8. EXAMPLES z )) = diag ( ψ (cid:48) ( a : m )) + ψ (cid:48) ( a m +1 ) ⊗ (cid:48) . Proof:

Consider a Gamma variate, x ∼ G ( a,

1) :1 = (cid:90) ∞ f ( x ) dx = (cid:90) ∞ a ) x a − exp( − x ) dx . Taking the derivative with respect to parameter a , we have0 = (cid:90) ∞ ln( x ) x a − exp( − x )Γ( a ) dx − Γ (cid:48) ( a )Γ ( a ) Γ( a ) = E (ln( x )) − ψ ( a ) . Taking the derivative with respect to parameter a a second time, ψ (cid:48) ( a ) = dda E (ln( x )) = dda (cid:90) ∞ ln( x )Γ( a ) x a − exp( − x ) dx = (cid:90) ∞ ln( x ) x a − exp( − x )Γ( a ) dx − Γ (cid:48) ( a )Γ( a ) E (ln( x ))= E (ln( x ) ) − E (ln( x )) = Var(ln( x )) . The lemma follows from the D2K characterization by the Gamma.

B.8 Examples

Example 1:

Let A , B be two attributes, each one of them present or absent in theelements of a population. Then each element of this population can be classiﬁed inexactly one of 2 = 4 categoriesA B k I k present present 1 [1 , , , (cid:48) present absent 2 [0 , , , (cid:48) absent present 3 [0 , , , (cid:48) absent absent 4 [0 , , , (cid:48) According to the notation above, we can write x | n, θ ∼ Mn ( n, θ ).If θ = [0 . , . , . , .

15] and n = 10, thenPr( x | n, θ ) = (cid:18) x (cid:19) ( θ (cid:52) x ) . Hence, in order to compute the probability of x = [1 , , , (cid:48) given θ , we use the expressionabove, obtaining Pr   |  . . . .  = 0 . . APPENDIX B: BINOMIAL, DIRICHLET AND RELATED DISTRIBUTIONS

Example 2: If X | θ ∼ Mn (10 , θ ), θ = [0 . , . , . X ) = (2 , , . , while the covariance matrix isΣ =  . − . − . − . . − . − . − .

45 1 .  . Example 3:

Assume that X | θ ∼ Mn (10 , θ ), with θ = [0 . , . , . A = { , } , A = { , } . Then, (cid:88) A X i | θ = X + X | θ ∼ Mn (10 , θ + θ ) , or X + X | θ ∼ Mn (10 , . . Analogously, X + X | θ ∼ Mn (10 , . ,X + X | θ ∼ Mn (10 , . ,X | θ ∼ Mn (10 , . . Note that, in general, if X | θ ∼ Mn k ( n, θ ) then X i | θ ∼ Mn ( n, θ i ), i = 1 , ..., k . Example 4: X | θ ∼ Mn ( n, θ ), as in a 3x3 Contingency Tables: x x x x • x x x x • x x x x • x • x • x • n Applying Theorem 5 we get( X • , X • ) | θ ∼ Mn ( n, θ (cid:48) ) , θ (cid:48) = ( θ • , θ • ) , θ (cid:48) = θ . This result tell us that ( X i , X i , X i ) | θ ∼ Mn ( n, θ (cid:48) i ) , with θ (cid:48) i = ( θ i , θ i , θ i ) , θ (cid:48) i = 1 − θ i • , i = 1 , , . .8 EXAMPLES X i , X i ) | x i • , θ ∼ Mn ( x i • , θ (cid:48) i )with θ (cid:48) i = ( θ il , θ i ) θ i • , θ (cid:48) i = θ i θ i • . The next result expresses the distribution of X | θ in term of the conditional distri-butions, of each row of the table, in its sum, and in term of the distribution of thesesums. Proposition 25: If X | θ ∼ Mn r − ( n, θ ), as in an r × r , contingency table, then P ( X | θ )can be written as P ( X | θ ) = (cid:34) r (cid:89) i =1 P ( X i , ..., X i,r − | x i • , θ ) (cid:35) P ( X • , ..., X r − • | θ ) . Proof:

We have: P ( X | θ ) = n ! r (cid:89) i =1 θ x i i x i ! = n ! θ x ... θ x rr rr x ! ... x rr != (cid:34) r (cid:89) i =1 x i • ! x i ! ... x ir ! (cid:18) θ i θ i • (cid:19) x i ... (cid:18) θ ir θ i • (cid:19) x ir (cid:35) n ! x i • ! ... x r • ! θ x • • ... θ x r • r • . From Theorems 5 and 6, as in the last example, we recognize each of the ﬁrst r factorsabove as the probabilities of each row in the table , conditioned on its sum, and recognizethe last factor as the joint probability distribution of sum of these r rows. Corollary 26: If X | θ ∼ Mn r − ( n, θ ), as in Theorems 5 and 6, then P ( X | x • , ..., x r − • , θ ) = r (cid:89) i =1 P ( X i , ..., X i,r − | x i • , θ )and, knowing θ, x • , ..., x r − • ,( X , ..., X ,r − ) (cid:113) ... (cid:113) ( X r , ..., X r,r − ) . APPENDIX B: BINOMIAL, DIRICHLET AND RELATED DISTRIBUTIONS

Proof:

Since P ( X | θ ) = P ( X | x • , ..., x r − • , θ ) P ( X • , X • , ..., X r − • | θ ) , from Theorems 5 and 6 we get the proposed equality.The following result will be used next to express Theorem 7 as a canonical represen-tation for P ( X | θ ). Proposition 27: If X | θ ∼ Mn r − ( n, θ ), as in Proposition, then a transformation T : ( θ , ..., θ r , ..., θ r , ..., θ r,r − ) → ( λ , ..., λ ,r − , ..., λ r , ..., λ r,r − , η , ..., η r − )given by λ = θ θ • , ... , λ ,r − = θ ,r − θ • ... λ r = θ r θ r • , ... , λ r,r − = θ r,r − θ r • η = θ • , η = θ • , ..., η r − = θ ( r − • is a onto transformation deﬁned in { < θ + ... + θ r,r − < < θij < } over theunitary cube of dimension r −

1. Moreover, the Jacobian of this transformation, t , is J = η r − η r − ... η r − r − (1 − η − ... − η r − ) r − . The proof is left as an exercise.

Example 5:

Let us examine the case of a 2 × x x x x n θ θ θ θ P ( X | θ ) we use the transformation T in the case r = 2: λ = θ θ + θ ,λ = θ θ + θ ,η = θ + θ , .9. FUNCTIONAL CHARACTERIZATIONS P ( X | θ ) == (cid:18) x • x (cid:19) λ x (1 − λ ) x (cid:18) x • x (cid:19) λ x (1 − λ ) x (cid:18) nx • (cid:19) η x • (1 − η ) x • , < θ < , < θ < , < η < . B.9 Functional Characterizations

The objective of this section is to derive the general form of a homogeneous Markov ran-dom process. Theorem 28, by Reny and Aczel, states that such a process is described by amixture of Poisson distributions. Our presentation follows Aczel (1966, sec. 2.1 and 2.3)and Janossy, Reny and Aczel (1950). It follows from the characterization of the Multino-mial by the Poisson distribution given in theorem 4, that Reny-Aczel characterization ofa homogeneous and local time point process is analogous to de Finetti characterizationof an inﬁnite exchangeable 0-1 process as a mixture of Bernoulli distributions, see forexample Feller (1971, v.2, ch.VII, sec. 4).

Cauchy’s Functional Equations

Cauchy’s additive functional equation has the form f ( x + y ) = f ( x ) + f ( y ) . The following argument from Cauchy (1821) shows that a continuous solution of thisfunctional equation must have the form f ( x ) = cx . Repeating the sum of the same argument, x , n times, we must have f ( nx ) = nf ( x ).If x = ( m/n ) t , then nx = mt and nf ( x ) = f ( nx ) = f ( mt ) = mf ( t ) hence ,f (cid:16) mn t (cid:17) = mn f ( t ) , taking c = f (1), and x = m/n , it follows that f ( x ) = cx , over the rationals, x ∈ IQ . Fromthe continuity condition for f ( x ), the last result must also be valid over the reals, x ∈ R .Q.E.D.Cauchy’s multiplicative functional equation has the form f ( x + y ) = f ( x ) f ( y ) , ∀ x, y > , f ( x ) ≥ . APPENDIX B: BINOMIAL, DIRICHLET AND RELATED DISTRIBUTIONS

The trivial solution of this equation is f ( x ) ≡

0. Assuming f ( x ) >

0, we take thelogarithm, reducing the multiplicative equation to the additive equation,ln f ( x y ) = ln f ( x ) + ln f ( y ) , hence , ln f ( x ) = cx , or f ( x ) = exp( cx ) . Homogeneous Discrete Markov Processes

We seek the general form of a homogeneous discrete Markov process. Let w k ( t ), for t ≥ k events. Let us also assume the followinghypotheses:Time Locality: If t ≤ t ≤ t ≤ t then, the number of events in [ t , t [ is independentsof the number of events in [ t , t [.Time Homogeneity: The distribution for the number of events occurring in [ t , t [depends only on the interval length, t = t − t .From time locality and homogeneity, we can decompose the occurrence of no (zero)events in [0 , t + u [ as , w ( t + u ) = w ( t ) w ( u ) . Hence, w ( t ) must obey Cauchy’s functional equation, and w ( t ) = exp( ct ) = exp( − λt ) . Since w ( t ) is a probability distribution, w ( t ) ≤

1, and λ > v ( t ) = 1 − w ( t ) = 1 − exp( − λt ), the probability of one or more events occurringbefore t >

0, must be the familiar exponential distribution.For k ≥ t + u , the general decomposition relation is w n ( t + u ) = n (cid:88) k =0 w k ( t ) w n − k ( u ) . Theorem 28: (Reny-Aczel) The general (non trivial) solution of this this system offunctional equations has the form: w k ( t ) = e − λt (cid:88) k (cid:89) j =1 ( c j t ) r j r j ! , λ = ∞ (cid:88) j =1 c j . where the index set < r, k, n > is deﬁned as < r, k, n > = { r , r , . . . r k | r + 2 r . . . + kr k = n } . .10. FINAL REMARKS < r, k > is a shorthand for < r, k, k > . Proof.

By induction: The theorem is true for k = 0. Let us assume, as inductionhypothesis, that it is true to k < n . The last equation in the recursive system is w n ( t + u ) = n (cid:88) k =0 w k ( t ) w n − k ( u ) = w n ( t ) e − λu + w n ( u ) e − λt + e − λ ( t + u ) n − (cid:88) k =1 (cid:88) (cid:88) k (cid:89) i =1 ( c i t ) r i r i ! k (cid:89) j =1 ( c j u ) s j s j ! . Deﬁning f n ( t ) = e λt w n ( t ) − (cid:88) n − (cid:89) j =1 ( c j t ) r j r j ! , the recursive equation takes the form f n ( t + u ) = f n ( t ) + f n ( u ) , and can be solved as a general Cauchy’s equation, that is, f n ( t ) = c n t . From the last equation and the deﬁnition of f n ( t ), we get the expression of w n ( t ) as intheorem 28. The constant λ is chosen so that the distribution is normalized.The general solution given by theorem 28 represents a composition (mixture) of Poissonprocesses, where an event in the j -the process in the composition corresponds to thesimultaneous occurrence of j single events in the original homogeneous Markov process.If we impose the following rarity condition, the general solution is reduced to a mixtureof ordinary Poisson processes.Rarity Condition: The probability that an event occurs in a short time at least once isapproximately equal to the probability that it occurs exactly once, that is, the probabilityof simultaneous occurrences is zero. B.10 Final Remarks

This work is in memory of Professor D Basu who was the supervisor of the ﬁrst author PhDdissertation, the starting point for the research in Bayesian analysis of categorical datapresented here. A long list of papers follows Basu and Pereira (1982). We have chosena few that we recommend for additional reading: Albert (1985), Gunel (1984), Irony,Pereira and Tiwari (2000), Paulino and Pereira (1992, 1995) and Walker (1996). To makethe analysis more realistic, extensions and mixtures of Dirichlet also were considered. For88

APPENDIX B: BINOMIAL, DIRICHLET AND RELATED DISTRIBUTIONS instance see Albert and Gupta (1983), Carlson (1977), Dickey (1983), Dickey, Jiang andKadane (1987), and Jiang, Kadane and Dickey (1992).Usually the more complex distributions are used to realistic represent situations forwhich the strong properties of Dirichlet seems to be not realistic. For instance, in a 2 x 2contingency table, the ﬁrst line to be conditional independent of the second line given themarginal seems to be unrealistic in some situations. Mixtures of Dirichlet in some casestake care of the situation as shown by Albert and Gupta (1983).The properties presented here are also important in non-parametric Bayesian statisticsin order to understand the Dirichlet process for the competitive risk survival problem. Seefor instance Salinas-Torres, Pereira and Tiwari (1997, 2002). In order to be historicallycorrect we cannot forget the important book of Wilks, published in 1962, where one canﬁnd the deﬁnition of Dirichlet distribution.The material presented in this essay adopts a singular representation for several dis-tributions, as in Pereira and Stern (2005). This representation is unusual in the statisticalliterature, but the singular representation makes it simpler to extend and generalize theresults and greatly facilitates numerical and computational implementations.We end this essay presenting the Reny-Aczel characterization of the Poisson mixture.This result can be interpreted as an alternative to de Finetti characterization theoremintroduced in Finetti (1937). Using the characterization of binomial distributions byPoisson processes conditional arguments, as given by Theorem 4, and Blackwell (minimal)suﬃciency properties discussed in Basu and Pereira (1983), Section 9 leads in fact to aDe Finetti characterization for Binomial distributions. Also, if one recall the indiﬀerenceprinciple (Mendel, 1989) the ﬁnite version of Finetti argument can simply be obtained.See also Irony and Pereira (1994) for the motivation of these arguments. The considerationof Section 9 could be viewed as a very simple formulation of the binomial distributionﬁnite characterization. ppendix CModel Miscellanea “Das Werdende, das ewig wirkt und lebt,Umfass euch mit der Liebe holden Schranken,Und was in schwankender Erscheinung schwebt,Befestiget mit dauernden Gedanken!”

The becoming, which forever works and lives,Holds you in love’s gracious bonds,And what ﬂuctuates in apparent oscillations,Fix it in place with enduring thoughts!Johann Wolfgang von Goethe (1749-1832),The Lord, in Faust, prologue in heaven. “Randomness and order do not contradict eachother; more or less both may be true at once.The randomness controls the world and due tothis in the world there is order and law, whichcan be expressed in measures of random eventsthat follow the laws of probability theory.”

Alfr´ed R´enyi (1921 - 1970).This appendix collects the material in some slide presentations on a miscellanea ofstatistical models used during the curse to illustrate several aspects of the FBST use andimplementation. This appendix is not intended to be a self suﬃcient reading material,but rather a guide or further study. Section 1, on contingency table models, is (I hope)fully supplemented by the material on the Multinomial-Dirichlet distribution presentedin appendix B. These models are of great practical importance, and also relatively simple28990

APPENDIX C. MODEL MISCELLANEA to implement and easy interpret. These characteristics make them ideal for the severalstatistical “experiments” required in the home works. Section 2, on a Wibull model,should require only minor additional reading, for further details see Barlow and Prochan(1981) and Ironi et al. (2002). This model highlights the importance of being able toincorporate expert opinion as prior information.Sections 3 to 5, presenting several models based on the Normal-Wishard distribution,may require extensive additional readings. Some epistemological aspects of these modelsare discussed in chapters 4 and 5. The material in these sections is presented for comple-tude, but its reading is optional, and only recommended for those students with a degreein statistics or equivalent knowledge. Of course, it is also possible to combine Normal-Wishad and Multinomial-Dirichlet models, in the form of mixture models, see section 6and Lauretto and Stern (2005). Section 7 presents an overview of the REAL classiﬁcationtree algorithm, for further details see Lauretto et al. (1998).

C.1 Contingency Table Models

Homogeneity test in × contingency table This model is useful in many applications, like comparison of two communities with re-lation to a disease incidence, consumer behavior, electoral preference, etc. Two samplesare taken from two binomial populations, and the objective is to test whether the successratios are equal. Let x and y be the number of successes of two independent binomialexperiments of sample sizes m and n , respectively. The posterior density for this multi-nomial model is, f ( θ | x, y, n, m ) ∝ θ x θ n − x θ y θ m − y The parameter space and the null hypothesis set are:Θ = { ≤ θ ≤ | θ + θ = 1 ∧ θ + θ = 1 } Θ = { θ ∈ Θ | θ = θ } The Bayes Factor considering a priori

P r { H } = P r { θ = θ } = 0 . and Θ − Θ is given in the equation below. See [ ? ] and [ ? ] for detailsand discussion about properties. BF = (cid:18) mx (cid:19) (cid:18) ny (cid:19)(cid:18) m + nx + y (cid:19) ( m + 1)( n + 1) m + n + 1 .2. WEIBULL WEAROUT MODEL Independence test in a × contingency table Suppose that laboratory test is used to help in the diagnostic of a disease. It shouldbe interesting to check if the test results are really related to the health conditions of apatient. A patient chosen from a clinic is classiﬁed as one of the four states of the set { ( h, t ) | h, t = 0 or } in such a way that h is the indicator of the occurrence or not of the disease and t is theindicator for the laboratory test being positive or negative. For a sample of size n werecord ( x , x , x , x ), the vector whose components are the sample frequency of eachthe possibilities of ( t, h ). The parameter space is the simplexΘ = { ( θ , θ , θ , θ ) | θ ij ≥ ∧ (cid:88) i,j θ ij = 1 } and the null hypothesis, h and t are independent, is deﬁned byΘ = { θ ∈ Θ | θ = θ • θ • , θ • = θ + θ , θ • = θ + θ } . The Bayes Factor for this case is discussed by [Iro 95] and has the following expression: BF = (cid:18) x • x (cid:19) (cid:18) x • x (cid:19)(cid:18) nx • (cid:19) (cid:26) ( n + 2) { ( n + 3) − ( n + 2)[ P (1 − P ) + Q (1 − Q )] } n + 1) (cid:27) where x i • = x i + x i , x • j = x j + x j , P = x • n +2 , and Q = x • n +2 . C.2 Weibull Wearout Model

We where faced with the problem of testing the wearout of a lot of used display panels.A panel displays 12 to 18 characters. Each character is displayed as a 5 × APPENDIX C. MODEL MISCELLANEA informed the mean life of the panels at those machines. Only working panels were ac-quired. The acquired panels were installed as components on machines of a diﬀerenttype. The use intensity of the panels at each type of machine corresponds to a diﬀerenttime scale, so mean lifes are not directly comparable. The shape parameter however isan intrinsic characteristic of the panel. The used time over mean life ratio, ρ = α/µ , isadimensional, and can therefore be used as an intrinsic measure of wearout. We haverecorded the time to failure, or times of withdrawal with no failure, of the panels at thenew machines, and want to use this data to corroborate (or not) the wearout informationprovided by the surplus equipment dealer. Weibull Distribution

The two parameter Weibull probability density, reliability (or survival probability) andhazard functions, for a failure time t ≥

0, given the shape, and characteristic life (or scale)parameters, β >

0, and γ >

0, are: w ( t | β, γ ) = ( β t β − /γ β ) exp ( − ( t/γ ) β ) r ( t | β, γ ) = exp ( − ( t/γ ) β ) z ( t | β, γ ) ≡ w ( ) /r ( ) = β t β − /γ β The mean and variance of a Weibull variate are given by: µ = γ Γ(1 + 1 /β ) σ = γ (Γ(1 + 2 /β ) + Γ (1 + 1 /β ))By altering the parameter, β , W ( t | β, γ ) takes a variety of shapes, Dodson(1994).Some values of shape parameter are important special cases: for β = 1, W is the exponen-tial distribution; for β = 2, W is the Rayleigh distribution; for β = 2 . W approximatesthe lognormal distribution; for β = 3 . W approximates the normal distribution; and for β = 5 . W approximates the peaked normal distribution. The ﬂexibility of the Weibulldistribution makes it very useful for empirical modeling, specially in quality control andreliability. The regions β < β = 1, and β > γ is approximately the 63rd percentile of the life time,regardless of the shape parameter.The Weibull also has important theoretical properties. If n i.i.d. random variableshave Weibull distribution, X i ∼ w ( t | β, γ ), then the ﬁrst failure is a Weibull variate withcharacteristic life γ/n /β , i.e. X [1 ,n ] ∼ w ( t | β, γ/n /β ). This kind of property allows acharacterization of the Weibull as a limiting life distribution in the context of extremevalue theory, Barlow and Prochan (1975). .2. WEIBULL WEAROUT MODEL t = t (cid:48) + α leads to the three parameter truncated Weibulldistribution. A location (or threshold) parameter, α > t = 0, after it has already survived the period [ − α, w ( t | α, β, γ ) = ( β ( t + α ) β − /γ β ) exp ( − (( t + α ) /γ ) β ) /r ( α | β, γ ) r ( t | α, β, γ ) = exp ( − (( t + α ) /γ ) β ) /r ( α | β, γ ) Wearout Model

The problem described at the preceding sections can be tested using the FBST, withparameter space, hypothesis and posterior joint density:Θ = { ( α, β, γ ) ∈ ]0 , ∞ ] × [1 , ∞ ] × [0 , ∞ [ } Θ = { ( α, β, γ ) ∈ Θ | α = ρµ ( β, γ ) } f ( α, β, γ | D ) ∝ n (cid:89) i =1 w ( t i | α, β, γ ) m (cid:89) j =1 r ( t j | α, β, γ )where the data D are all the recorded failure times, t i >

0, and the times of withdrawalwith no failure, t j > f l ( ). Given a sample with n recorded failures and m withdrawals, wl i = log( β ) + ( β −

1) log( t i + α ) − β log( γ ) − (( t i + α ) /γ ) β + ( α/γ ) β rl j = − (( t j + α ) /γ ) β + ( α/γ ) β f l = n (cid:88) i =1 wl i + m (cid:88) j =1 rl j the hypothesis being represented by the constraint h ( α, β, γ ) = ρ γ Γ(1 + 1 /β ) − α = 0The gradients of f l ( ) and h ( ) analytical expressions, to be given to the optimizer,94 APPENDIX C. MODEL MISCELLANEA are: dwl =[ ( β − / ( t + α ) − (( t + α ) /γ ) β β/ ( t + α ) + ( α/γ ) β β/α , /β + log( t + α ) − log( γ ) − (( t + α ) /γ ) β log(( t + α ) /γ ) + ( α/γ ) β log( α/γ ) , − β/γ + (( t + α ) /γ ) β β/γ − ( α/γ ) β β/γ ] drl =[ − (( t + α ) /γ ) β β/ ( t + α ) + ( α/γ ) β β/α , − (( t + α ) /γ ) β log(( t + α ) /γ ) + ( α/γ ) β log( α/γ ) , (( t + α ) /γ ) β β/γ , − ( α/γ ) β β/γ ] dh =[ − , − ρ γ Γ (cid:48) (1 + 1 /β ) Γ(1 + 1 /β ) /β , ρ Γ(1 + 1 /β ) ]For gamma and digamma functions eﬃcient algorithms see Spanier and Oldham (1987).In this model, some prior distribution of the shape parameter is needed to stabilizethe model. Knowing color elements’ life time to be approximately normal, we consider β ∈ [3 . , . C.3 The Normal-Wishart Distribution

The matrix notation used in this section is deﬁned in section F.1.The Bayesian research group at IME-USP has developed several applications based onmultidimensional normal models, including structure models, mixture models and factoranalysis models. In this appendix we review the core theory of some of these models, sincethey are used in some of the illustrative examples in chapters 4 and 5. For implementationdetails, practical applications, case studies, and further comments, see Lauretto et al.(2003).The conjugate family of priors for multivariate normal distributions is the Normal-Wishart family of distributions, DeGroot (1970). Consider the random matrix X withelements X ji , i = 1 . . . k , j = 1 . . . n , n > k , where each column, x j , contains a samplevector from a k -multivariate normal distribution with parameters β (mean vector) and V (covariance matrix), or R = V − (precision matrix). .3. THE NORMAL-WISHART DISTRIBUTION x and W denote, respectively, the statistics:¯ x = 1 n n (cid:88) j =1 x j = 1 n X W = n (cid:88) j =1 ( x j − β ) ( x j − β ) (cid:48) = ( X − β )( X − β ) (cid:48) The random matrix W has Wishart distribution with n degrees of freedom and precisionmatrix R . The Normal and Wishart pdfs have the expressions: f (¯ x | n, β, R ) = ( n π ) k/ | R | / exp( − n x − β ) (cid:48) R (¯ x − β ) ) f ( W | n, β, R ) = c | W | ( n − k − / exp( −

12 tr(

W R ) ) c − = | R | − n/ nk/ π k ( k − / k (cid:89) j =1 Γ( n + 1 − j X as above, with unknown mean β and unknown precisionmatrix R , and the statistic S = n (cid:88) j =1 ( x j − ¯ x ) ( x j − ¯ x ) (cid:48) = ( X − ¯ x )( X − ¯ x ) (cid:48) Taking as prior distribution for the precision matrix R the wishart distribution with a > k − S and, given R , taking as prior for β a multivariate normal with mean ˙ β and precision ˙ nR , i.e. p ( β, R ) = p ( R ) p ( β | R ) p ( R ) ∝ | R | ( a − k − / exp( −

12 tr( R ¨ S ) ) p n ( β | R, n, ¯ x, S ) ∝ | R | / exp( − ¨ n β − ¨ β ) (cid:48) R ( β − ¨ β ) )¨ β = ( n ¯ x + ˙ n ˙ β ) / ¨ n , ¨ n = n + ˙ n ¨ S = S + ˙ S + n ˙ nn + ˙ n ( ˙ β − ¯ x )( ˙ β − ¯ x ) (cid:48) APPENDIX C. MODEL MISCELLANEA

Hence, the posterior distribution for R is a Wishart distribution with a + n degrees offreedom and precision ¨ S , and the conditional distribution for β , given R , is k -Normalwith mean ¨ β and precision ¨ nR . All covariance and precision matrices are supposed to bepositive deﬁnite, n > k , a > k −

1, and ˙ n > n = 0, ˙ β = 0, a = 0, ˙ S = 0, i.e. we takea Wishart with 0 degrees of freedom as prior for R , and a constant prior for β , Box andTiao (1973), DeGroot (1970), Zellner (1971). Then, the posterior for R is a Wishart with n degrees of freedom and precision S , and the posterior for β , given R , is k -Normal withmean ¯ x and precision nR .We can now write the simpliﬁed log-posterior kernels: f l ( β, R | n, ¯ x, S ) = f l ( R | n, ¯ x, S ) + f l ( β | R, n, ¯ x, S ) f l ( R | n, ¯ x, S ) = f lr = a + n − k −

12 log( | R | ) −

12 tr( R ¨ S ) f l ( β | R, n, ¯ x, S ) = f lb = 12 log( | R | ) − ¨ n β − ¨ β ) (cid:48) R ( β − ¨ β )For the surprise kernel, relative to the uninformative prior, we only have to replace thefactor ( a + n − k − / a + n ) / C.4 Structural Models

In this section we study the dose-equivalence hypothesis.The dose-equivalence hypothesis, H , asserts a proportional response of a pair of re-sponse measurements to two diﬀerent stimuli. The hypothesis also asserts proportionalstandard deviations, and equivalent correlations for each response pair. The proportion-ality coeﬃcient, δ , is interpreted as the second stimulus dose equivalent to one unit of theﬁrst.This can be seen as a simultaneous generalization of the linear mean structure, thelinear covariance structure, and the Behrens-Fisher problems. The test proved to be usefulwhen comparing levels of genetic expression, as well as to calibrate micro array equipmentat BIOINFO, the genetic research task force at University of Sao Paulo. The applicationof the dose-equivalence model is similar to the much simpler bio-equivalence model usedin pharmacology, and closely related by several other classic covariance structure modelsused in biology, psychology, and social sciences, as described in Anderson (1969), Bock andBargnann (1966), Jiang and Sarkar (1998, 1999, 2000a,b), J¨oreskog (1970), and McDonald(1962, 1974, 1975). We are not aware of any alternative test for the dose-equivalencehypothesis published in the literature. .4. STRUCTURAL MODELS C.4.1 Mean and Covariance Structure

As it is usual in the covariance structure literature, we will write V ( γ ) = (cid:80) γ h G { h } , wherethe matrices G { h } , h = 1 , . . . k ( k + 1) / k × k symmetricmatrices; in our case, k = 4. The matrix notation is presented at Section F.1. V ( γ ) = (cid:88) h =1 γ h G { h } =  γ γ γ γ γ γ γ γ γ γ γ γ γ γ γ γ  , where G { h } =  δ h δ h δ h δ h δ h δ h δ h δ h δ h δ h δ h δ h δ h δ h δ h δ h  , and the Kronecker-delta is δ jh = 1 if h = j and δ jh = 0 if h (cid:54) = j .The dose-equivalence hypothesis, H , asserts a proportional response of a pair of re-sponse measurements to two diﬀerent stimuli. Each pair of response measurements is sup-posed to be a bivariate normal variate. H also asserts proportional standard deviations,and equivalent correlations for each pair of response measurements. The proportionalitycoeﬃcient, δ , is interpreted as the dose, calibration or proportionality coeﬃcient.In order to get simpler expressions for the log-likelihood, the constraints and its gra-dients, we use in the numerical procedures an extended parameter space including thecoeﬃcient δ , and state the dose-equivalence optimization problem on the extended 15-dimentional space, with a 5-dimentional constraint:Θ = { θ = [ γ (cid:48) , β (cid:48) , δ ] (cid:48) ∈ R , V ( γ ) > } Θ = { θ ∈ Θ | h ( θ ) = 0 } h ( θ ) =  δ γ − γ δ γ − γ δ γ − γ δβ − β δβ − β  In order to be able to compute some gradients needed in the next section, we recallsome matrix derivative identities, see Anderson (1969), Harville (1997), McDonald andSwaminathan (1973), Rogers (1980). We use V = V ( γ ), R = V − , and C for a constantmatrix. ∂ V∂ γ h = G { h } , ∂ R∂ γ h = − R G { h } R ,∂ β (cid:48)

C β∂ β = 2

C β , ∂ log( | V | ) ∂ γ h = tr( R G { h } ) , APPENDIX C. MODEL MISCELLANEA ∂ frob2( V − C ) ∂ γ h = 2 (cid:88) i,j ( V − C ) (cid:12) G { h } . We also deﬁne the auxiliary matrices: P { h } = R G { h } , Q { h } = P { h } R .

C.4.2 Numerical Optimization

To ﬁnd θ ∗ we use an objective function, to be minimized on the extended parameter space,given by a centralization term minus the log-posterior kernel, f ( θ | n, ¯ x, S ) = c n frob2( V − C ) − f lr − f lb = c n frob2( V − C ) − a + n − k | R | )+ 12 tr( R ¨ S ) + ¨ n β − ¨ β ) (cid:48) R ( β − ¨ β )Large enough centralization factors, c , times the squared Frobenius norm of ( V − C ), where C are intermediate approximations of the constrained minimum, make the ﬁrst points ofthe optimization sequence remain in the neighborhood of the empirical covariance (theinitial C ). As the optimization proceeds, we relax the centralization factor, i.e. make c →

0, and maximize the pure posterior function. This is a standard optimization procedurefollowing the regularization strategy of Proximal-Point algorithms, see Bertzekas andTsitsiklis (1989), Iusem (1995), Censor and Zenios (1997). In practice this strategy let usavoid handling explicitly the diﬃcult constraint V ( γ ) > ∂ f /∂ θ , ∂ f∂ γ h = a + n − k P { h } ) −

12 tr( Q { h } ¨ S ) − ¨ n β − ¨ β ) (cid:48) Q { h } ( β − ¨ β )+2 c n n (cid:88) i,j =1 ( V − C ) (cid:12) G { h } ∂ f∂ β = ¨ n R ( β − ¨ β )For the surprise kernel and its gradient, relative to the uninformative prior, we only haveto replace the factor ( a + n − k ) / a + n + 1) / .5. FACTOR ANALYSIS ∂ h/∂ θ , is:  δ − δγ δ − δγ δ − δγ δ − β δ − β  At the optimization step, Variable-Metric Proximal-Point algorithms, working withthe explicit analytical derivatives given above, proved to be very stable, in contrast withthe often unpredictable behavior of some methods found in most statistical software, likeNewton-Raphson or “Scoring”. Optimization problems of small dimension, like above,allow us to use dense matrix representation without signiﬁcant loss, Stern (1994).In order to handle several other structural hypotheses, we only have to replace theconstraint, and its Jacobian, passed to the optimizer. Hence, many diﬀerent hypothesisabout the mean and covariance or correlation structure can be treated in a coherent,eﬃcient, exact, robust, simple, and uniﬁed way.The derivation of the Monte Carlo procedure for the numerical integrations requiredto implement the FBST in this model is presented in appendix G.

C.5 Factor Analysis

This section reviews the most basic facts about FA models. For a synthetic introductionto factor analysis, see Ghaharamani and Hilton (1997) and Everitt (1984). For someof the matrix analytic and algorithmic details, see Abadir and Magnus (2005), Goluband Loan (1989), Harville (2000), Rubin and Thayer (1982), and Russel (1998). For thetechnical issue of factor rotation, see Browne (1974, 2001), Jennrich (2001, 2002, 2004)and Bernaards and Jennrich (2005).The generative model for Factor Analysis (FA) is x = Λ z + u , where x is a p × z is a k × factors and Λ is the p × k matrix of factor loadings , or weights. FA is used asa dimensionality reduction technique, so k < p .The vector variates z and u are assumed to be distributed as N (0 , I ) and N (0 , Ψ),where Ψ is diagonal. Hence, the observed and latent variables joint distribution is (cid:20) xz (cid:21) ∼ N (cid:18)(cid:20) (cid:21) , (cid:20) ΛΛ (cid:48) + Ψ ΛΛ (cid:48) I (cid:21)(cid:19) . APPENDIX C. MODEL MISCELLANEA

For two jointly distributed Gaussian (vector) variates, (cid:20) xz (cid:21) ∼ N (cid:18)(cid:20) ab (cid:21) , (cid:20) A CC (cid:48) D (cid:21)(cid:19) , the distribution of z given x is given by, see Zellner (1971), z | x ∼ N (cid:0) b + C (cid:48) A − ( x − a ) , D − C (cid:48) A − C (cid:1) . Hence, in the FA model, z | x ∼ N ( Bx, I − B Λ) , where B = Λ (cid:48) (ΛΛ (cid:48) + Ψ) − = Λ (cid:48) (cid:16) Ψ − − Ψ − Λ (cid:0) I + Λ (cid:48) Ψ − Λ (cid:1) − Λ (cid:48) Ψ − (cid:17) C.5.1 EM Algorithm

In order to obtain the Maximum Likelihood (ML) estimator of the parameters, one canuse the EM-Algorithm, see Rubin and Thayer (1982) and Russel (1998). The E-step forthe FA model computes the expected ﬁrst and second moments of the latent variables,for each observation, x . E ( z | x ) = Bx , and E ( zz (cid:48) | x ) = Cov( z | x ) + E ( z | x ) E ( z | x ) (cid:48) = I + B Λ +

Bx x (cid:48) B (cid:48) The M-step optimizes the parameters Λ and Ψ, of the expected log likelihood for theFA (completed data) model, q (Λ , Ψ) = E (cid:16) log (cid:89) nj =1 f ( x | z, Λ , Ψ) (cid:17) = E (cid:18) log (cid:89) nj =1 (2 π ) p/ | Ψ | − / exp (cid:18) − (cid:0) x j − Λ z (cid:1) (cid:48) Ψ − (cid:0) x j − Λ z (cid:1)(cid:19)(cid:19) = c − n | Ψ | − (cid:88) nj =1 E (cid:18) x j (cid:48) Ψ − x j − x j (cid:48) Ψ − Λ z + 12 z (cid:48) Λ (cid:48) Ψ − Λ z (cid:19) Using the results computed in the E-step, the last summation can be written as (cid:88) nj =1 (cid:18) x j (cid:48) Ψ − x j − x j (cid:48) Ψ − Λ E ( z | x j ) + 12 tr (cid:0) Λ (cid:48) Ψ − Λ E ( zz (cid:48) | x j ) (cid:1)(cid:19) The ML estimator, (Λ ∗ , Ψ ∗ ), is a stationary point in Λ ∗ , therefore ∂ q∂ Λ = − (cid:88) nj =1 Ψ − x j E ( z | x j ) (cid:48) + (cid:88) nj =1 Ψ − Λ E ( zz (cid:48) | x j ) = 0 , henceΛ ∗ = (cid:16)(cid:88) nj =1 E ( zz (cid:48) | x j ) (cid:48) (cid:17) − (cid:88) nj =1 x j E ( z | x j ) (cid:48) .5. FACTOR ANALYSIS ∗ , or in its inverse, therefore, substituting the stationaryvalue of Λ ∗ computed in the last equation, ∂ q∂ Ψ − = n − (cid:88) nj =1 (cid:18) x j x j (cid:48) − Λ ∗ E ( z | x j ) x j (cid:48) + 12 Λ ∗ E ( zz (cid:48) | x j )Λ ∗(cid:48) (cid:19) = 0Solving for Ψ, and using the diagonality constraint,Ψ ∗ = 1 n diag (cid:16)(cid:88) nj =1 x j x j (cid:48) − Λ ∗ (cid:88) nj =1 E ( z | x j ) x j (cid:48) (cid:17) The equation for Λ ∗ , in the M-step of the EM algorithm for FA, formally resemblesthe equation giving the LS estimation in a Linear Regression model, β (cid:48) = y (cid:48) X ( X (cid:48) X ) − .This is why, in the FA literature, the matrix Λ ∗ is sometimes interpreted as “the linearregression coeﬃcients of the z ’s on the x ’s”. C.5.2 Orthogonal and Oblique Rotations

Given a FA model and a non-singular coordinate transform, T , it is possible to obtaintransformed factors together with transformed loadings giving an equivalent FA model.Both, a direct , and an inverse , form of the factor loadings transform are common in theliterature.In the direct form, (cid:101) z = T − z and (cid:101) Λ = Λ T, hence, in the new model, x = Λ z + u = Λ T T − z + u = (cid:101) Λ (cid:101) z + u andCov( x ) = (cid:101) Λ (cid:101) Λ (cid:48) + Ψ = Λ T ( T − IT − t ) T (cid:48) Λ (cid:48) + Ψ = ΛΛ (cid:48) + Ψ . In the inverse form, (cid:101) z = T (cid:48) z and (cid:101) Λ = Λ T − t , hence, in the new model, x = Λ z + u = Λ T − t T (cid:48) z + u = (cid:101) Λ (cid:101) z + u andCov( x ) = (cid:101) Λ (cid:101) Λ (cid:48) + Ψ = Λ T − t ( T (cid:48) IT ) T − Λ (cid:48) + Ψ = ΛΛ (cid:48) + Ψ . This shows that the FA model is only determined by the k dimensional subspace of R p spanned by the factors. Any change of coordinates in this (sub) space, given by T ,leads to an equivalent model.An operator T is an orthogonal rotation iﬀ T (cid:48) T = I . Hence, orthogonal transformedfactors are still normalized and uncorrelated.02 APPENDIX C. MODEL MISCELLANEA

An operator T is an oblique rotation (in the inverse form) iﬀ diag ( T (cid:48) T ) = I . Hence,oblique transformed factors are still normalized, but correlated.We want to chose either an orthogonal or an oblique rotation, T , so to minimize acomplexity criterion function of (cid:101) Λ. Before discussing appropriate criteria and how to usethem, we examine some technical details concerning matrix norms and projections in thefollowing subsection.

C.5.3 Frobenius Norm and Projections

The matrix Frobenius product and the matrix Frobenius norm are deﬁned as follows: (cid:104) A | B (cid:105) F = tr( A (cid:48) B ) = (cid:48) ( A (cid:12) B ) , || A || F = (cid:104) A | A (cid:105) F = (cid:88) j (cid:12)(cid:12)(cid:12)(cid:12) A j (cid:12)(cid:12)(cid:12)(cid:12) = (cid:88) i || A i || . Lemma 1: The projection, T , with respect to the Frobenius norm, into the algebraicsub-manifold of the oblique rotation matricesof a square matrix, A , is given as follows, T = A diag ( A (cid:48) A ) / . A matrix T represents an oblique rotation iﬀ it has normalized columns, that is, iﬀdiag( T (cid:48) T ) = . We want to minimize || A − T || F = (cid:88) j (cid:12)(cid:12)(cid:12)(cid:12) A j − T j (cid:12)(cid:12)(cid:12)(cid:12) But, in the 2-norm, the normal vector T j that is closest to the A j is the one that has thesame direction of vector A j , that is, T j = 1 || A j || A j = 1( A j (cid:48) A j ) / A j , hence, the lemma.Lemma 2: The projection, Q , with respect to the Frobenius norm, into the algebraicsub-manifold of the orthogonal rotation matrices of a square matrix, A , is given by itsSVD factorization, as follows, Q = U V (cid:48) where U (cid:48) ( A ) V = diag( s ) . In order to prove the second lemma, we will consider the following problem. The orthogonal Procrustes problem seeks the orthogonal rotation, Q | Q (cid:48) Q = I , that minimizes .5. FACTOR ANALYSIS A , m × p , and the rotationof a second matrix, B , Formally, the problem is stated asmin Q | Q (cid:48) Q = I || A − BQ || F The norm function being minimized can be restated as || A − BQ || F = tr( A (cid:48) A ) + tr( B (cid:48) B ) − Q (cid:48) B (cid:48) A )Hence the problem asks for the maximum of the last term. Let Z be an orthogonal matrixdeﬁned by Q and the SVD factorization of B (cid:48) A as follows, U (cid:48) ( B (cid:48) A ) V = S = diag( s ) , Z = V (cid:48) Q (cid:48) U .

We have, tr( Q (cid:48) B (cid:48) A ) = tr( Q (cid:48) U SV (cid:48) ) = tr( ZS ) = s (cid:48) diag( Z ) ≤ s (cid:48) . But the last inequality is tight if Z = I , hence the optimal solution for the orthogonalProcrustes problem is Q = U V (cid:48) where U (cid:48) ( B (cid:48) A ) V = diag( s ) . In order to prove lemma 2, just consider the case B = I . C.5.4 Sparsity Optimization

In the FA literature, minimizing the complexity of the factor loadings, (cid:101)

Λ, is accomplishedby maximizing a measure of its sparsity, f ( (cid:101) Λ).A natural sparsity measure in engineering applications is the Minimum Entropy mea-sure. This measure and its (matrix) derivative are given by f me (Λ) = − (cid:104) Λ2 | log(Λ2) (cid:105) F , (Λ2) ji = (cid:0) Λ ji (cid:1) .df me (Λ) d Λ = − Λ (cid:12) log(Λ2) − Λ . Several variations of the entropy sparsity measure are used in the literature, see Bernaardsand Jennrich (2005).Hoyer (2004) proposes the following sparsity measure for a vector x ∈ R k + , based onthe diﬀerence of two p -norms, namely p = 1 and p = 2, f ho ( x ) = 1 √ k − (cid:18) √ k − || x || || x || (cid:19) . APPENDIX C. MODEL MISCELLANEA

From Cauchy-Schwartz inequality, we have the bounds,1 √ n || x || ≤ || x || ≤ || x || , hence 0 ≤ f ho ( x ) ≤ . Similar interpretations can be given to the Carroll’s Oblimin, on the parameter γ , andCrawford-Ferguson, on the parameter κ , families of sparsity measures. These measures,for Λ p × k , and its (matrix) derivative are given by, f γ (Λ) = 14 (cid:104) Λ2 | B ( γ ) (cid:105) F , where B ( γ ) = ( I − γC )Λ2 N , (Λ2) ji = (cid:0) Λ ji (cid:1) , C ji = 1 p , N ji = 1 − δ ji , δ ji = I ( i = j ) .df γ (Λ) d Λ = Λ (cid:12) B ( γ ) .f κ (Λ) = 14 (cid:104) Λ2 | B ( κ ) (cid:105) F , where B ( κ ) = (1 − κ )Λ2 N + κM Λ2 , (Λ2) ji = (cid:0) Λ ji (cid:1) , M ji = 1 − δ ji , p × p , N ji = 1 − δ ji , k × k .df κ (Λ) d Λ = Λ (cid:12) B ( κ ) . These parametric families include many sparsity measures, or simplicity criteria, tradi-tionally used in psychometric studies, for example, setting γ to 0, 1 /

2, or 1, we havethe Quartmin, Biquartmin or Covarimin criterium, also, setting κ to 0, 1 /p , k/ (2 p ) or( k − / ( p + k − T ∗ , we need to express the sparsityfunction and its matrix derivative as functions of T . In the direct form, df ( (cid:101) Λ) dT = df (Λ T ) dT = − (cid:18) Λ (cid:48) df (Λ) d Λ (cid:19) (cid:48) . In de inverse form, df ( (cid:101) Λ) dT = df (Λ T − t ) dT = − (cid:18) Λ (cid:48) df (Λ) d Λ T − (cid:19) (cid:48) . This expressions, together with the projectors obtained in the last section, can be usedin standard gradient projection optimization algorithms, like the Generalized ReducedGradient (GRG) or other standard primal optimization algorithms, see Bernaards andJennrich (2005), Jennrich (2002), Luenberger (1984), Minoux and Vajda (1986), Shah etal. (1964), and Stern et al. (2006). .6. MIXTURE MODELS K , and the orthogonal operators, Q , thatdo not have − J = K + I , K ji = − K ij .K = ( I − Q )( I + Q ) − = 2( I + Q ) − − I ,Q = ( I − K )( I + K ) − = 2 J − − I .

The spasity measure derivatives of the direct orthogonal rotation of the factor loadings,using the Cayley representation, are given by, f ( (cid:101) Λ) = f (Λ T ) , T = J − − I∂ f ( (cid:101) Λ) ∂ J ji = tr (cid:32) ∂ f ( (cid:101) Λ) ∂ T (cid:48) ∂ T∂ J ji (cid:33) = − (cid:32) ∂ f ( (cid:101) Λ) (cid:48) ∂ T J − ∂ J∂ J ji J − (cid:33) =2( Y ji − Y ij ) , where Y = J − ∂ f ( (cid:101) Λ) (cid:48) ∂ T J − . C.6 Mixture Models

The matrix notation used in this section is deﬁned in section F.1. In this section, h, i areindices in the range 1 : d , k is in 1 : m , and j is in 1 : n .In a d -dimensional multivariate ﬁnite mixture model with m components (or classes),and sample size n , any given sample x j is of class k with probability w k ; the weights, w k ,give the probability that a new observation is of class k . A sample j of class k = c ( j ) isdistributed with density f ( x j | ψ k ).The classiﬁcations z jk are boolean variables indicating whether or not x j is of class k , i.e. z jk = 1 iﬀ c ( j ) = k . Z is not observed, being therefore named latent variable ormissing data. Conditioning on the missing data, we get: f ( x j | θ ) = (cid:88) mk =1 f ( x j | θ, z jk ) f ( z jk | θ ) = (cid:88) mk =1 w k f ( x j | ψ k ) f ( X | θ ) = (cid:89) nj =1 f ( x j | θ ) = (cid:89) nj =1 (cid:88) mk =1 w k f ( x j | ψ k )06 APPENDIX C. MODEL MISCELLANEA

Given the mixture parameters, θ , and the observed data, X , the conditional classiﬁ-cation probabilities, P = f ( Z | X, θ ), are: p jk = f ( z jk | x j , θ ) = f ( z jk , x j | θ ) f ( x j | θ ) = w k f ( x j | ψ k ) (cid:80) mk =1 w k f ( x j | ψ k )We use y k for the number of samples of class k , i.e. y k = (cid:80) j z jk , or y = Z . Thelikelihood for the “completed” data, X, Z , is: f ( X, Z | θ ) = (cid:89) nj =1 f ( x j | ψ c ( j ) ) f ( z jk | θ ) = (cid:89) mk =1 (cid:0) w ky k (cid:89) j | c ( j )= k f ( x j | ψ k ) (cid:1) We will see in the following sections that considering the missing data Z , and theconditional classiﬁcation probabilities P , is the key for successfully solving the numericalintegration and optimization steps of the FBST. In this article we will focus on Gaussianﬁnite mixture models, where f ( x j | ψ k ) = N ( x j | b k , R { k } ), a normal density with mean b k and variance matrix V { k } , or precision R { k } = ( V { k } ) − . Next we specialize the theoryof general mixture models to the Dirichlet-Normal-Wishart case. C.6.1 Dirichlet-Normal-Wishart Mixtures

Consider the random matrix X ji , i in 1 : d , j in 1 : n , n > d , where each column contains asample element from a d -multivariate normal distribution with parameters b (mean) and V (covariance), or R = V − (precision). Let u and S denote the statistics: u = (1 /n ) (cid:88) nj =1 x j = (1 /n ) X , S = (cid:88) nj =1 ( x j − b ) ⊗ ( x j − b ) (cid:48) = ( X − b )( X − b ) (cid:48) The random vector u has normal distribution with mean b and precision nR . Therandom matrix S has Wishart distribution with n degrees of freedom and precision matrix R . The Normal, Wishart and Normal-Wishart pdfs have expressions: N ( u | n, b, R ) = ( n π ) d/ | R | / exp ( − ( n/ u − b ) (cid:48) R ( u − b ) ) W ( S | e, R ) = c − | S | ( e − d − / exp ( − (1 / S R ) )with normalization constant c = | R | − e/ ed/ π d ( d − / (cid:81) di =1 Γ(( e − i + 1) /

2) .Now consider the matrix X as above, with unknown mean b and unknown precisionmatrix R , and the statistic S = (cid:88) nj =1 ( x j − u ) ⊗ ( x j − u ) (cid:48) = ( X − u )( X − u ) (cid:48) The conjugate family of priors for multivariate normal distributions is the Normal-Wishart, see DeGroot (1970). Take as prior distribution for the precision matrix R the .6. MIXTURE MODELS e > d − S and, given R , take as prior for b a multivariate normal with mean ˙ u and precision ˙ nR , i.e. let us takethe Normal-Wishart prior N W ( b, R | ˙ n, ˙ e, ˙ u, ˙ S ). Then, the posterior distribution for R isa Wishart distribution with ¨ e degrees of freedom and precision ¨ S , and the posterior for b , given R , is k -Normal with mean ¨ u and precision ¨ nR , i.e., we have the Normal-Wishartposterior: N W ( b, R | ¨ n, ¨ e, ¨ u, ¨ S ) = W ( R | ¨ e, ¨ S ) N ( b | ¨ n, ¨ u, R )¨ n = ˙ n + n , ¨ e = ˙ e + n , ¨ u = ( nu + ˙ n ˙ u ) / ¨ n ¨ S = S + ˙ S + ( n ˙ n/ ¨ n )( u − ˙ u ) ⊗ ( u − ˙ u ) (cid:48) All covariance and precision matrices are supposed to be positive deﬁnite, and properpriors have ˙ e ≥ d , and ˙ n ≥

1. Non-informative Normal-Wishart improper priors are givenby ˙ n = 0, ˙ u = 0, ˙ e = 0, ˙ S = 0, i.e. we take a Wishart with 0 degrees of freedom asprior for R , and a constant prior for b , see DeGroot (1970). Then, the posterior for R isa Wishart with n degrees of freedom and precision S , and the posterior for b , given R , is d -Normal with mean u and precision nR .The conjugate prior for a multinomial distribution is a Dirichlet distribution: M ( y | n, w ) = (cid:0) n ! (cid:14) y ! . . . y m ! (cid:1) w y . . . w my m D ( w | y ) = (cid:0) Γ( y + . . . + y k ) (cid:14) Γ( y ) . . . Γ( y k ) (cid:1) (cid:89) mk =1 w ky k − with w > and w = 1. Prior information given by ˙ y , and observation y , result in theposterior parameter ¨ y = ˙ y + y . A non-informative prior is given by ˙ y = .Finally, we can write the posterior and completed posterior for the model as: f ( θ | X, ˙ θ ) = f ( X | θ ) f ( θ | ˙ θ ) f ( X | θ ) = (cid:89) nj =1 (cid:88) mk =1 p jk w k N ( x j | b k , R { k } ) f ( θ | ˙ θ ) = D ( w | ˙ y ) (cid:89) mk =1 N W ( b k , R { k } | ˙ n k , ˙ e k , ˙ u k , ˙ S { k } ) p jk = w k N ( x j | b k , R { k } ) (cid:14) (cid:88) mk =1 w k N ( x j | b k , R { k } ) f ( θ | X, Z, ˙ θ ) = f ( θ | X, Z ) f ( θ | ˙ θ ) = D ( w | ¨ y ) (cid:89) mk =1 N W ( b k , R { k } | ¨ n k , ¨ e k , ¨ u k , ¨ S { k } ) y = Z , ¨ y = ˙ y + y , ¨ n = ˙ n + y , ¨ e = ˙ e + yu k = (1 /y k ) (cid:88) nj =1 z jk x j , S { k } = (cid:88) nj =1 z jk ( x j − u k ) ⊗ ( x j − u k ) (cid:48) ¨ u k = (1 / ¨ y k )( ˙ n k ˙ u k + y k u k ) , ¨ S { k } = S { k } + ˙ S { k } + ( ˙ n k y k (cid:14) ¨ n k )( u k − ˙ u k ) ⊗ ( u k − ˙ u k ) (cid:48) APPENDIX C. MODEL MISCELLANEA

C.6.2 Gibbs Sampling and Integration

In order to integrate a function over the posterior measure, we use an ergodic MarkovChain. The form of the Chain below is known as Gibbs sampling, and its use for numericalintegration is known as Markov Chain Monte Carlo, or MCMC.Given θ , we can compute P . Given P , f ( z j | p j ) is a simple multinomial distribution.Given the latent variables, Z , we have simple conditional posterior density expressionsfor the mixture parameters: f ( w | Z, ˙ y ) = D ( w | ¨ y ) , f ( R { k } | X, Z, ˙ e k , ˙ S { k } ) = W ( R | ¨ e k , ¨ S { k } ) f ( b k | X, Z, R { k } , ˙ n k , ˙ u k ) = N ( b | ¨ n k , ¨ u k , R { k } )Gibbs sampling is nothing but the MCMC generated by cyclically updating variables Z , θ , and P , by drawing θ and Z from the above distributions, see Gilks aet al. (1996)and H¨aggstr¨om (2002). A uniform generator is all what is needed to the multinomialvariate. A Dirichlet variate w can be drawn using a gamma generator with shape andscale parameters α and β , G ( α, β ), see Gentle (1998). Johnson (1987) describes a simpleprocedure to generate the Cholesky factor of a Wishart variate W = U (cid:48) U with n degreesof freedom, from the Cholesky factorization of the covariance V = R − = C (cid:48) C , and a chi-square generator: a) g k = G ( y k ,

1) ; b) w k = g k / (cid:80) mk =1 g k ; c) for i < j , B i,j = N (0 ,

1) ;d) B i,i = (cid:112) χ ( n − i + 1) ; and e) U = BC .

All subsequent matrix computations proceeddirectly from the Cholesky factors, see Jones (1985).

Label Switching and Forbidden States

Given a mixture model, we obtain an equivalent model renumbering the components1 : m by a permutation σ ([1 : m ]). This symmetry must be broken in order to have anidentiﬁable model, see Stephens (1997). Let us assume there is an order criterion thatcan be used when numbering the components. If the components are not in the correctorder, Label Switching is the operation of ﬁnding permutation σ ([1 : m ]) and renumberingthe components, so that the order criterion is satisﬁed. If we want to look consistentlyat the classiﬁcations produced during a MCMC run, we must enforce a label switching tobreak all non-identiﬁability symmetries. For example, in the Dirichlet-Normal-Mixturemodel, we could choose to order the components (switch labels) according to the the rankgiven by: 1) A given linear combination of the vector means, c (cid:48) ∗ b k ; 2) The variancedeterminant | V { k }| . The choice of a good label switching criterion should consider notonly the model structure and the data, but also the semantics and interpretation of themodel.The semantics and interpretation of the model may also dictate that some states, likecertain conﬁgurations of the latent variables Z , are either meaningless or invalid, and .6. MIXTURE MODELS C.6.3 EM Algorithm for ML and MAP Estimation

The EM algorithm optimizes the log-posterior function f l ( X | θ ) + f l ( θ | ˙ θ ), see Dempster(1977), Ormoneit (1995) and Russel (1988). The EM is derived from the conditionallog-likelihood, and the Jensen inequality: If w, y > , w (cid:48) = 1 then log w (cid:48) y ≥ w (cid:48) log y .Let θ and ˜ θ be our current and next estimate of the MAP (Maximum a Posteriori),and p jk = f ( z jk | x j , θ ) the conditional classiﬁcation probabilities. At each iteration, thelog-posterior improvement is: δ (˜ θ, θ | X, ˙ θ ) = f l (˜ θ | X, ˙ θ ) − f l ( θ | X, ˙ θ ) = δ (˜ θ, θ | X ) + δ (˜ θ, θ | ˙ θ ) δ (˜ θ, θ | ˙ θ ) = f l (˜ θ | ˙ θ ) − f l ( θ | ˙ θ ) δ (˜ θ, θ | X ) = f l ( X | ˜ θ ) − f l ( X | θ ) = (cid:88) j δ (˜ θ, θ | x j ) δ (˜ θ, θ | x j ) = f l ( x j | ˜ θ ) − f l ( x j | θ ) = log (cid:88) k ˜ w k f ( x j | ˜ ψ k ) − f l ( x j | θ ) == log (cid:88) k p jk p jk ˜ w k f ( x j | ˜ ψ k ) f ( x j | θ ) ≥ ∆(˜ θ, θ | x j ) = (cid:88) k p jk log ˜ w k f ( x j | ˜ ψ k ) p jk f ( x j | θ )Hence, ∆(˜ θ, θ | X, ˙ θ ) = ∆(˜ θ, θ | X ) + δ (˜ θ, θ | ˙ θ ), is a lower bound to δ (˜ θ, θ | X, ˙ θ ). Also∆( θ, θ | X, ˙ θ ) = δ ( θ, θ | X, ˙ θ ) = 0. So, under mild diﬀerentiability conditions, both surfacesare tangent, assuring convergence of EM to the nearest local maximum. But maximizing∆(˜ θ, θ | X, ˙ θ ) over ˜ θ is the same as maximizing Q (˜ θ, θ ) = (cid:88) k,j p jk log (cid:16) ˜ w k f ( x j | ˜ ψ k ) (cid:17) + f l (˜ θ | ˙ θ )and each iteration of the EM algorithm breaks down in two steps:E-step: Compute P = E ( Z | X, θ ) .M-step: Optimize Q (˜ θ, θ ) , given P .For the Gaussian mixture model, with a Dirichlet-Normal-Wishart prior, Q (˜ θ, θ ) = (cid:88) mk =1 (cid:88) nj =1 p jk (cid:0) log ˜ w k + log N ( x j | ˜ b k , ˜ R { k } ) (cid:1) + f l (˜ θ | ˙ θ ) f l (˜ θ | ˙ θ ) = log D ( ˜ w | ˙ y ) + (cid:88) mk =1 log N W (˜ b k , ˜ R { k } | ˙ n k , ˙ e k , ˙ u k , ˙ S { k } )Lagrange optimality conditions give a simple analytical solutions for the M-step: y = P , ˜ w k = ( y k + ˙ y k − (cid:14) (cid:16) n − m + (cid:88) mk =1 ˙ y k (cid:17) APPENDIX C. MODEL MISCELLANEA u k = y k (cid:88) nj =1 p jk x j , S { k } = (cid:88) nj =1 p jk ( x j − ˜ b k ) ⊗ ( x j − ˜ b k ) (cid:48) ˜ b k = ˙ n k ˙ u k + y k u k ˙ n k + y k , ˜ V k = S { k } + ˙ n k (˜ b k − ˙ u k ) ⊗ (˜ b k − ˙ u k ) (cid:48) + ˙ S { k } y k + ˙ e k − d Global Optimization

In more general (non-Gaussian) mixture models, if an analytical solution for the M-stepis not available, a robust local optimization algorithm can be used, for example Martinez(2000). The EM is only a local optimizer, but the MCMC provides plenty of good startingpoints, so we have the basic elements for a global optimizer. To avoid using many startingpoints going to a same local maximum, we can ﬁlter the (ranked by the posteriori) topportion of the MCMC output using a clustering algorithm, and select a starting point fromeach cluster. For better eﬃciency, or more complex problems, the Stochastic EM or SEMalgorithm can be used to provide starting points near each important local maximum, seeCeleux (1995), Pﬂug (1996) and Spall (2003).

C.6.4 Experimental Tests and Final Remarks

The test case used in this study is given by a sample X assumed to follow a mixture ofbivariate normal distributions with unknown parameters, including the number of com-ponents. X is the Iris virginica data set, with sepal and petal length of 50 specimens (1discarded outlier). The botanical problem consists of determining whether or not thereare two distinct subspecies in the population, see Anderson (1935), Fisher (1936) andMcLachlan (2000). Figure 1 presents the dataset and posterior density level curves forthe parameters, θ ∗ and (cid:98) θ , optimized for the 1 and 2 component models. .6. MIXTURE MODELS y = , ˙ n = 1, ˙ u = u , ˙ e = 3, ˙ S = (1 /n ) S . Robert(1996) uses, with similar eﬀects, ˙ e = 6, ˙ S = (1 . /n ) S .The FBST selects the 2 component model, rejecting H , if the evidence against thehypothesis is above a given threshold, ev ( H ) > τ , and selects the 1 component model,accepting H , otherwise. The threshold τ is chosen by empirical power analysis, see Sternand Zacks (2002) and Lauretto et al. (2003). Let θ ∗ and (cid:98) θ represent the constrained (1component) and unconstrained (2 component) maximum a posteriori (MAP) parametersoptimized to the Iris dataset. Next, generate two collections of t simulated datasets of size n , the ﬁrst collection at θ ∗ , and the second at (cid:98) θ . α ( τ ) and β ( τ ), the empirical type 1 andtype 2 statistical errors, are the rejection rate in the ﬁrst collection and the acceptancerate in the second collection. A small, t = 500, calibration run sets the threshold τ soto minimize the total error, ( α ( τ ) + β ( τ )) /

2. Other methods like sensitivity analysis, seeStern (2004a,b), and loss functions, see Madruga (2001), could also be used.Biernacki and Govaert (1998) studied similar mixture problems and compared sev-eral selection criteria, pointing as the best overall performers: AIC - Akaike InformationCriterion, AIC3 - Bozdogan’s modiﬁed AIC, and BIC - Schwartz’ Bayesian InformationCriterion. These are regularization criteria, weighting the model ﬁt against the numberof parameters, see Pereira and Stern (2001). If λ is the model log-likelihood, κ its numberof parameters, and n the sample size, then, AIC = − λ + 2 κ , AIC − λ + 3 κ and BIC = − λ + κ log( n ) . Figure 2 show α , β , and the total error ( α + β ) /

2. The FBST outperforms all theregularization criteria. For small samples, BIC is very biased, always selecting the 1component model. AIC is the second best criterion, caching up with the FBST for samplesizes larger than n = 150.12 APPENDIX C. MODEL MISCELLANEA

50 100 1500102030405060708090100 α

50 100 1500102030405060708090100 β

50 100 1500510152025303540455055 ( α + β )/2 Figure 2: Criteria O= FBST, X= AIC, += AIC3, *= BIC,Type 1, 2 and total error rates for diﬀerent sample sizesFinally, let us point out a related topic for further research: The problem of discrimi-nating between models consists of determining which of m alternative models, f k ( x, ψ k ),more adequately ﬁts or describes a given dataset. In general the parameters ψ k havedistinct dimensions, and the models f k have distinct (unrelated) functional forms. In thiscase it is usual to call them “separate” models (or hypotheses). Atkinson (1970), althoughin a very diﬀerent theoretical framework, was the ﬁrst to analyse this problem using amixture formulation, f ( x | θ ) = (cid:88) mk =1 w k f k ( x, ψ k ) . The general theory for mixture models presented in this article can be adapted toanalyse the problem of discriminating between separate hypotheses. This is the subjectof the authors’ ongoing research with Carlos Alberto de Bragan¸ca Pereira and Bas´ılio deBragan¸ca Pereira, to be presented in forthcoming articles.The authors are grateful for the support of CAPES - Coordena¸c˜ao de Aperfei¸coamentode Pessoal de N´ıvel Superior, CNPq - Conselho Nacional de Desenvolvimento Cient´ıﬁcoe Tecnol´ogico, and FAPESP - Funda¸c˜ao de Apoio `a Pesquisa do Estado de S˜ao Paulo.

C.7 REAL Classiﬁcation Trees

This section presents an overview of REAL, The Real Attribute Learning Algorithm forautomatic construction of classiﬁcation trees. The REAL project started as an applicationto be used at the Brazilian BOVESPA and BM&F ﬁnancial markets, trying to provide agood algorithm for predicting the adequacy of operation strategies. In this context, the .7. REAL CLASSIFICATION TREES n × ( m + 1) matrix A . Each row, A ( i, :),represents a diﬀerent example, and each column, A (: , j ), a diﬀerent attribute. The ﬁrst m columns in each row are real-valued attributes, and the last column , A ( i, m + 1) isthe example’s class. Part of these samples, the training set, is used by the algorithm togenerate a classiﬁcation tree, which is then tested with the remaining examples. The errorrate in the classiﬁcation of the examples in the test set is a simple way of evaluating theclassiﬁcation tree.A market operation strategy is a predeﬁned set of rules determining an operator’sactions in the market. The strategy shall have a predeﬁned criterion for classifying astrategy application as success or failure.As a simple example, let us deﬁne the strategy buysell ( t, d, l, u, c ): • At time t buy a given asset A , at its price p ( t ). • Sell A as soon as:1. t (cid:48) = t + d , or2. p ( t (cid:48) ) = p ( t ) ∗ (1 + u/ p ( t (cid:48) ) = p ( t ) ∗ (1 − l/ • The strategy application is successful if c ≤ ∗ p ( t (cid:48) ) / ( p ( t ) ≤ u The parameters u , l , c and d can be interpreted as the desired and worst accepted returns(low and upper bound), the strategy application cost, and a time limit. Tree Construction

Each main iteration of the REAL algorithm corresponds to the branching of a terminalnode in the tree. The examples at that node are classiﬁed according to the value of aselected attribute, and new branches generated to each speciﬁc interval. The partitionof a real-valued attribute’s domain in adjacent non-overlapping (sub) intervals is thediscretization process. Each main iteration of REAL includes:1. The discretization of each attribute, and its evaluation by a loss function.2. Selecting the best attribute, and branching the node accordingly.3. Merging adjacent intervals that fail to reach a minimum conviction threshold.14

APPENDIX C. MODEL MISCELLANEA

C.7.1 Conviction, Loss and Discretization

Given a node of class c with n examples, k of which are misclassiﬁed and ( n − k ) ofwhich are correctly classiﬁed, we needed a single scalar parameter, cm , to measure theprobability of misclassiﬁcation and its conﬁdence level. Such a simpliﬁed conviction (ortrust) measure was a demand of REAL users operating at the stock market.Let q be the misclassiﬁcation probability for an example at a given node, let p = (1 − q )be the probability of correct classiﬁcation, and assume we have a Bayesian distributionfor q , namely D ( c ) = P r ( q ≤ c ) = P r ( p ≥ − c )We deﬁne the conviction measure: 100 ∗ (1 − cm )%, where cm = min c | P r ( q ≤ c ) ≥ − g ( c )and g ( ) is a monotonically increasing bijection of [0 ,

1] onto itself. From our experiencein the stock market application we learned to be extra cautious about making strongstatements, so we make g ( ) a convex function.In this paper D ( c ) is the posterior distribution for a sample taken from the Bernoullidistribution, with a uniform prior for q : B ( n, k, q ) = comb ( n, k ) ∗ q k ∗ p n − k D ( c, n, k ) = (cid:90) cq =0 B ( n, k, q ) / (cid:90) q =0 B ( n, k, q )= betainc( c, k + 1 , n − k + 1)Also in this paper, we focus our attention on g ( c ) = g ( c, r ) = c r , r ≥ . r the convexity parameter.With these choices, the posterior is the easily computed incomplete beta function, and cm is the root of the monotonically decreasing function: cm ( n, k, r ) = c | f ( c ) = 0 f ( c ) = 1 − g ( c ) − D ( c, n, k )= 1 − c r − betainc( c, k + 1 , n − k + 1)Finally, we want a loss function for the discretizations, based on the conviction mea-sure. In this paper we use the overall sum of each example classiﬁcation conviction, that .7. REAL CLASSIFICATION TREES loss = (cid:88) i n i ∗ cm i Given an attribute, the ﬁrst step of the discretization procedure is to order the ex-amples in the node by the attribute’s value, and then to join together the neighboringexamples of the same class. So, at the end of this ﬁrst step, we have the best ordereddiscretization for the selected attribute with uniform class clusters.In the subsequent steps, we join intervals together, in order to decrease the overallloss function of the discretization. The gain of joining J adjacent intervals, I h +1 , I h +2 ,. . . I h + J , is the relative decrease in the loss function gain ( h, j ) = (cid:88) j loss ( n j , k j , r ) − loss ( n, k, r )where n = (cid:80) j n j and k counts the minorities’ examples in the new cluster (at the secondstep k j = 0, because we begin with uniform class clusters).At each step we perform the cluster joining operation with maximum gain. Thediscretization procedure stops when there are no more joining operations with positivegain.The next examples show some clusters that would be joined together at the ﬁrst step ofthe discretization procedure. The notation ( n, k, m, r, ± ) means the we have two uniformclusters of the same class, of size n and m , separated by a uniform cluster of size k of adiﬀerent class; r is the convexity parameter, and + ( − ) means we would (not) join theclusters together. ( 2,1, 2,2,+)( 6,2, 7,2,-) ( 6,2, 8,2,+) ( 6,2,23,2,+) ( 6,2,24,2,-)( 7,2, 6,2,-) ( 7,2, 7,2,+) ( 7,2,42,2,+) ( 7,2,43,2,-)(23,3,23,2,-) (23,3,43,2,-) (23,3,44,2,+)(11,3,13,3,-) (11,3,14,3,+) (11,3,39,3,+) (11,3,40,3,-)(12,3,12,3,-) (12,3,13,3,+) (12,3,54,3,+) (12,3,55,3,-) In these examples we see that it takes extreme clusters of a balanced and large enoughsize, n and m , to “absorb” the noise or impurity in the middle cluster of size k . A largerconvexity parameter, r , implies a larger loss at small clusters, and therefore makes iteasier for sparse impurities to be absorbed. C.7.2 Branching and Merging

For each terminal node in the tree, we16

APPENDIX C. MODEL MISCELLANEA

1. perform the discretization procedure for each available attribute,2. measure the loss function of the ﬁnal discretization,3. select the minimum loss attribute, and4. branch the node according this attribute discretization.If no attribute discretization decreases the loss function by a numerical precision threshold (cid:15) >

0, no branching takes place.A premature discretization by a parameter selected at a given level may precludefurther improvement of the classiﬁcation tree by the branching process. For this reasonwe establish a conviction threshold, ct , and after each branching step we merge all adjacentintervals that do not achieve cm < ct . To prevent an inﬁnite loop, the loss function valueassigned to the merged interval is sum of the losses of the merging intervals. At the ﬁnalleaves, this merging is undone. The conviction threshold naturally stops the branchingprocess, so there is no need for an external pruning procedure, like in most TDIDTalgorithms.In the straightforward implementation, REAL spends most of the execution timecomputing the function cm ( n, k, r ). We can greatly accelerate the algorithm by usingprecomputed tables of cm ( n, k, r ) values for small n , and precomputed tables of cm ( n, k, r )polynomial interpolation coeﬃcients for larger n . To speed up the algorithm we can alsorestrict the search for join operations at the discretization step to small neighborhoods,i.e. to join only 3 ≤ J ≤ J max clusters: Doing so will expedite the algorithm withoutany noticeable consistent degradation.For further details on the numerical implementation, benchmarks, and the speciﬁcmarket application, see Lauretto et al. (1998). ppendix DDeterministic Evolution andOptimization

This chapter presents some methods of deterministic optimization. Section 1 presents thefundamentals of Linear Programming (LP), its duality theory, and some variations of theSimplex algorithm. Section 2 presents some basic facts of constrained and unconstrainedNon-Linear Programming (NLP), the Generalized Reduced Gradient (GRG) algorithmfor constrained NLP problems, the ParTan method for unconstrained NLP problems,and some simple line search algorithms for uni-dimensional problems. Sections 1 and 2also presents some results about these algorithms local and global convergence properties.Section 3 is a very short introduction to variational problems and the Euler-Lagrangeequation.The algorithms presented in sections 1 and 2 are within the class of active set or activeconstraint algorithms. The choice of concentrating on this class is motivated by someproperties of active set algorithms, that makes them specially useful in the applicationsconcerning the statistics, namely:- Active set algorithms maintain viability throughout in the search path for the optimalsolution. This is important if the objective function can only be computed at (nearly)feasible arguments, as it is often the case in statistics or simulation problems. This featurealso makes active set algorithms relatively easy to expalain and implement.- The general convergence theory of active set algorithms and the analysis of speciﬁcproblems may oﬀer a constructive proof of the existence or the veriﬁcation of stabilityconditions for an equilibrium or ﬁxed point, representing a systemic eigen-solution see,for example, Border (1989), Ingrao and Israel (1990) and Zangwill (1964).- Active set algorithms are particularly eﬃcient for small or medium size re-optimizationproblems, that is, for optimization problems where the initial solution or staring point forthe optimization procedure is (nearly) feasible and already close to the optimal solution,31718

APPENDIX D. DETERMINISTIC EVOLUTION AND OPTIMIZATION so that the optimization algorithm is only used to ﬁnetune the solution. In FBST applica-tions, such good staring points can be obtained from an exploratory search performed bythe Monte Carlo or Markov Chain Monte Carlo procedures used to numerically integratethe FBST e -value, ev ( H ), or truth function W ( v ), see appendices A and G. D.1 Convex Sets and Polyedra

The matrix notation used in this book is deﬁned at section F.1.

Convex Sets

A point y ( l ) is a convex combination of m points of R n , given by the columns of matrix X, n × m , iﬀ ∀ i , y ( l ) i = m (cid:88) j =1 l j ∗ X ji , l j ≥ | m (cid:88) j =1 l j = 1 , or, equivalently, in matrix notation, iﬀ y ( l ) = m (cid:88) i =1 l i ∗ X j , l j ≥ | m (cid:88) j =1 l j = 1 , or, in yet another equivalent form, replacing the summations by inner products, y ( l ) = Xl , l ≥ | (cid:48) l = 1 . In particular, the point y ( λ ) is a convex combination of two points, z e w , if y ( λ ) = (1 − λ ) z + λw , λ ∈ [0 , . Geometrically, these are the points in the line segment from z to w .A set, C ∈ R n , is convex iﬀ it contains all convex combinations of any two of its points.A set, C ∈ R n , is bounded iﬀ the distance between any two of its points is bounded: ∃ δ | ∀ x , x ∈ C , || x − x || ≤ δ Figure D.1 presents some sets exemplifying the deﬁnitions above.An extreme point of a convex set C is a point x ∈ C that can not be represented as aconvex combination of two other points of C . The proﬁle of a convex set C , ext( C ), is theset of its extreme points. The Convex hull and the closed convex hull of a set C , ch( C )and cch( C ), are the intersection of all convex sets, and closed convex sets, containing C . Theorem:

A compact (closed and bounded) convex set is equal to the closed convexhull of its proﬁle, that is, C = cch(ext( C )). .1. CONVEX SETS AND POLYEDRA −1 0 1 2−1012 −1 0 1 2−1012 −1 0 1 2−1012−1 0 1 2−1012 −1 0 1 2−1012 −1 0 1 2−1012 Figure D.1: (a) non-convex set, (b,c) bounded and unbounded polyedron,(d-f) degenerate vertex perturbed to a single or two nondegenerate ones.The epigraph of a curve in R , y = f ( x ), x ∈ [ a, b ], is the set deﬁned as epig( f ) ≡{ ( x, y ) | x ∈ [ a, b ] ∧ y ≥ f ( x ) } . A curve is said to be convex iﬀ its epigraph is convex. Acurve is said to be concave iﬀ − f ( x ) is convex. Theorem:

A curve, y = f ( x ), R (cid:55)→ R , that is continuously diﬀerentiable and hasmonotonically increasing ﬁrst derivative is convex. Theorem:

The convex hull of a ﬁnite set of points, V , is the set of all convex combina-tions of points of V , that is, if V = { x i , i = 1 . . . n } , then ch( V ) = { x | x = [ x , . . . x n ] l, l ≥ , (cid:48) l = 1 } .A (non-linear) constraint , in R n , is an inequality of the form g ( x ) ≤ g : R n (cid:55)→ R .The feasible region deﬁned by m constraints, g ( x ) ≤ g : R n (cid:55)→ R m , is the set of feasible(or viable) points { x | g ( x ) ≤ } . At the feasible point x , the constraint g i ( x ) is said tobe active or tight if the equality, g i ( x ) = 0, holds, and it is said to be inactive or slack ifthe strict inequality, g i ( x ) <

0, holds.20

APPENDIX D. DETERMINISTIC EVOLUTION AND OPTIMIZATION

Polyedra A polyedron in R n is a feasible region deﬁned by linear constraints : Ax ≤ d . We canalways compose an equality constraint, a (cid:48) x = δ with two inequality constraints, a (cid:48) x ≤ δ e a (cid:48) x ≥ δ . Theorem:

Polyedra are convex, but not necessarily bounded.A face of dimension k , of a polyedron in R n with m equality constraints, is a feasibleregion that obeys tightly to n − m − k of the polyedron’s inequality constraints. Equiv-alently, a point that obeys to r active inequality constraints is at a face of dimension k = n − m − r . A vertex is a face of dimension 0. An edge is a face of dimension 1,An interior point of the polyedron has all inequality constraints slack or inactive, that is, k = n − m . A facet is a face of dimension k = n − m − n − m + 1 inequality constraints are active. This point is “superdetermined”, since it is a point in R n that obeys to n + 1 equations, m equality constraintsand n − m + 1 active inequality constraints. Such a point is said to be degenerate. Fromnow on we assume the non-degenerescence hypothesis , stating that such points do notexist in the optimization problem at hand. This hypothesis is very reasonable, since theslightest perturbation to a degenerate problem transforms a degenerate point into one ormore vertices, see Figure D.1.A polyedron in standard form , P A,d ⊂ R n , is deﬁned by n signal constraints , x i ≥ m < n equality constraints , that is, P A,d = { x ≥ | Ax = d } , A m × n We can always rewrite a polyedron in standard form (at a higher dimensional space)using the following artiﬁces:1. Replace an unconstrained variable, x i by the diﬀerence of two positive ones, x + i − x − i where x + i = max { , x i } e x − i = max { , − x i } .2. Add a slack variable , χ ≥ a (cid:48) x ≤ δ ⇔ (cid:2) a (cid:3) (cid:20) xχ (cid:21) = δ . From the deﬁnition of vertex we can see that, in a polyedron in standard form, P A,d ,a vertex is a feasible point where n − m constraints are active. Hence, n − m variablesare null; these are the residual variables of this vertex. Let us permute the vector x soto place the residual variables at the last n − m positions. Hence, the remaining (non-null) variables, the basic variables will be at the ﬁrst m positions. Applying the same .2. LINEAR PROGRAMMING A , the block of the ﬁrst m columns is called the basis , B , of this vertex, while the block of the remaining n − m columns of A is called the residual matrix , R . That is, given vectors b and r with the basic and residual indices, thepermuted matrix A can be partitioned as (cid:2) A b A r (cid:3) = (cid:2) B R (cid:3)

In this form, it is easy to write the non-null variables explicitely, (cid:20) x b x r (cid:21) ≥ | (cid:2) B R (cid:3) (cid:20) x b x r (cid:21) = d hence, x b = B − [ d − Rx r ]Equating the residual variables to zero, it follows that x b = B − d . From the deﬁnition of degenerescence we see that the vertex of a polyedron in standardform is degenerate iﬀ it has a null basic variable.

D.2 Linear Programming

This section presents Linear Programming, the simplest optimization problem studied inmulti-dimensional mathematical programming. The simple structure of LP allows theformal development of relatively simple solution algorithms, namely, the primal and dualsimplex. This section also presents some decomposition techniques used for solving LPproblems in special forms.

D.2.1 Primal and Dual Simplex Algorithms

A LP problem in standard form asks for the minimum of a linear function inside a polye-dron in standard form, that is, min cx, x ≥ | Ax = d . Assume we know which are the residual (zero) variables of a given vertex. In thiscase we can form basic and residual index vectors, b and r , and obtain the basic (non-zero) variables of this vertex. Permuting and partitioning all objects of the LP problemaccording to the order established by the basic and residual index vectors, the LP problemis written as min (cid:2) c b c r (cid:3) (cid:20) x b x r (cid:21) , x ≥ | (cid:2) B R (cid:3) (cid:20) x b x r (cid:21) = d . APPENDIX D. DETERMINISTIC EVOLUTION AND OPTIMIZATION using the notation ˜ d ≡ B − d and ˜ R ≡ B − R , the basic solution corresponding to this vertex is x b = ˜ d (and x r = 0).Let us now proceed with an analysis of the sensitivity of this basic solution by aperturbation of a single residual variable. If we change a single residual variable, say the j -th element of x r , allowing it to become positive, that is, making x r ( j ) >

0, the basicsolution, x b becomes x b = ˜ d − ˜ Rx r = ˜ d − x r ( j ) ˜ R j This solution remains feasible as long as it remains non-negative. Using the non-degeneregescence hypothesis, ˜ d >

0, and we know that it is possible to increase the valueof x r ( j ) , while keeping the basic solution feasible, up to a threshold (cid:15) >

0, when somebasic variable becomes null.The value of this prturbed basic solution is cx = c b x b + c r x r = c b B − [ d − Rx r ] + c r x r = c b ˜ d + ( c r − c b ˜ R ) x r ≡ ϕ − zx r = ϕ − z j x r ( j ) Vector z is called the reduced cost of this basis.The sensitivity analysis suggests the following algorithm used to generate a sequenceof vertices of decreasing values, starting from an initial vertex, [ x b | x r ]. Simplex Algorithm:

1. Find a residual index j , such that z j > k ∈ K ≡ { l | ˜ R jl > } , (cid:15) k = ˜ d k / ˜ R jk ,and take i ∈ Argmin k ∈ K (cid:15) k , i.e. (cid:15) ( i ) = min k (cid:15) k .3. Make the variable x r ( j ) basic, and x b ( i ) residual.4. Go back to step 1.The simplex can not proceed if z ≤ .2. LINEAR PROGRAMMING topivot . After each pivoting operation the basis inverse needs to be recomputed, that is, thebasis needs to be reinverted . Numerically eﬃcient implementation of the Simplex do notactually keep the basis inverse, instead, the basis inverse is represented by a numericalfactorization, like B = LU or B = QR . At each pivot operation the basis is changed by asingle column, and there are eﬃcient numerical algorithms used to update the numericalfactorization representing the basis inverse, see Murtagh (1981) and Stern (1994).Example 1: Let us illustrate the Simplex algorithm solving the following simple ex-ample.Let us consider the LP problem min[ − , − x, ≤ x ≤ c = (cid:2) − − (cid:3) A = (cid:20) (cid:21) d = (cid:20) (cid:21) The initial vertex x = [0 ,

0] is assumed to be known.Step 1: r = [1 , b = [3 , B = A (: , b ) = I , R = A (: , r ) = I , − z = c r − c b ˜ R = [ − , − − [0 , ⇒ z = [1 , j = 1, r ( j ) = 1, x b = ˜ d − (cid:15) ˜ R j = (cid:20) (cid:21) − (cid:15) (cid:20) (cid:21) ⇒ (cid:15) ∗ = 1 , i = 1 , b ( i ) = 3Step 2: r = [3 , b = [1 , B = A (: , b ) = I , R = A (: , r ) = I , − z = c r − c b ˜ R = [0 , − − [ − , ⇒ z = [ − , j = 2, r ( j ) = 2, x b = ˜ d − (cid:15) ˜ R j = (cid:20) (cid:21) − (cid:15) (cid:20) (cid:21) ⇒ (cid:15) ∗ = 1 , i = 2 , b ( i ) = 4Step 3: r = [3 , b = [1 , B = A (: , b ) = I , R = A (: , r ) = I , − z = c r − c b ˜ R = [0 , − [ − , − ⇒ z = [ − , − < Obtaining the initial vertex

In order to obtain an initial vertex, used to start the simplex, we can use the auxiliaryLP problem,min (cid:2) (cid:3) (cid:20) xy (cid:21) | (cid:20) xy (cid:21) ≥ ∧ (cid:2) A diag(sign( d )) (cid:3) (cid:20) xy (cid:21) = d An initial vertex for the auxiliary problem is given by (cid:2) d (cid:48) ) (cid:3) (cid:48) . If the auxiliaryproblem has an optimal solution of value zero, the optimal solution gives a feasible vertexfor the original problem;if not, the original problem is unfeasible.24 APPENDIX D. DETERMINISTIC EVOLUTION AND OPTIMIZATION

Duality

Given an LP problem, called the primal

LP problem, we deﬁne a second problem, the dual problem (of the primal problem). Duality theory establishes important relations betweenthe solution of the primal LP and the solution of its dual.Given a LP problem in canonic form,min cx | x ≥ ∧ Ax ≥ d , its dual LP problem is deﬁned asmax y (cid:48) d | y ≥ ∧ y (cid:48) A ≤ c . The primal and dual problems in canonic form have an intuitive economical interpre-tation. The primal problem can be interpreted as the classic ration problem: A ji is thequantity of nutrient of type j found in one unit of aliment of type i . c i is the cost ofone unit of aliment i , and d j the minimum daily need of nutrient j . The primal optimalsolution is a nutritionally feasible ration of minimum cost. The dual problem can beinterpreted as a manufacturer of synthetic nutrients, looking for the “market value” forits nutrients line. The manufacturer income per synthetic ration is the objective functionto be maximized. In order to keep its line of synthetic nutrients competitive, no naturalaliment should provide nutrients cheaper than the corresponding synthetic mixture, theseare the dual problem’s constraints. The optimal prices for the synthetic nutrients, y ∗ canalso be interpreted as marginal prices, giving the diﬀerential price increment of aliment i by diﬀerential increase of its content of nutrient j . The correctness of these interpretationsare demonstrated by the duality properties discussed next. Lemma 1:

The dual of the dual is the primal PL problem.Proof: Just observe that the dual of the primal LP in canonic form is equivalent tomin − y (cid:48) d | y ≥ ∧ − y (cid:48) A ≥ − c . This problem is again in canonic form, and can be immediately dualized, yielding aproblem equivalent to the original LP problem.

Weak Duality Theorem: If x and y are, respectively, feasible solutions of the primaland dual problems, then there is a non-negative gap between their values as solutions ofthese problems, that is, cx ≥ y (cid:48) d . Proof: By feasibility, Ax ≥ d and y ≥

0. Hence, y (cid:48) Ax ≥ y (cid:48) d . In the same way, y (cid:48) A ≤ c and x ≥

0. Hence y (cid:48) Ax ≤ cx . Therefore, cx ≥ y (cid:48) d . QED. .2. LINEAR PROGRAMMING Corollary 1:

If we have a pair of feasible solutions, x for the primal LP problem and y for its dual, and their values as primal and dual solutions coincide, that is, cx ∗ = ( y ∗ ) (cid:48) d ,then both solutions are optimal. Corollary 2:

If the primal LP problem is unbounded, its dual is unfeasible.As we could re-write any LP problem in standard for, we can re-write any LP problemin canonical form. Hence, from Lemma 1, we know that the duality relation is deﬁnedbetween pairs of LP problems, whatever the form they have been writen.

Lemma 2:

Given a primal in standard form,min cx | x ≥ ∧ Ax = d , its dual is max y (cid:48) d | y ∈ R m ∧ y (cid:48) A ≤ c . Theorem (Simplex proof of correctness):We shall prove that the Simplex stops at an optimal vertex. At the Simplex haltingpoint we have z = − ( c r − c b B − R ) ≤

0. Let us consider y (cid:48) = c b B − as a candidate solutionfor the dual. (cid:2) c b c r (cid:3) − y (cid:48) (cid:2) B R (cid:3) = (cid:2) c b c r (cid:3) − c b B − (cid:2) B R (cid:3) = (cid:2) c b c r (cid:3) − c b (cid:2) I ˜ R (cid:3) = (cid:2) c b c r (cid:3) − (cid:2) c b c b ˜ R (cid:3) = (cid:2) − z (cid:3) ≥ y ´e is a feasible dual solution. Moreover, its value (as a dual solution) is y (cid:48) d = c b B − d = c b ˜ d = ϕ , and, by corollary 1, both solutions are optimal. Theorem (Strong Duality): If the primal problem is feasible and bounded, so is itsdual. Moreover, the value of the primal and dual solutions coincide.Proﬀ: Constructive, by the Simplex algorithm.

Theorem (Complementary Slackness): Let x and y (cid:48) be feasible solutions to a LP instandard form and its dual. These solutions are optimal iﬀ w (cid:48) x = 0, where w = ( c − y (cid:48) A ).The vectors x and w represent the slackness in the inequality constraints of the primaland dual LP problems. Since x ≥ w ≥

0, the scalar product w (cid:48) x is null iﬀ each26 APPENDIX D. DETERMINISTIC EVOLUTION AND OPTIMIZATION of its terms, w j x j , is null; or equivalently, if for each slack inequality constraint in theprimal, the corresponding inequality constraint in the dual is tight, and vice-versa. Hencethe name complementary slackness.Proof: If the solutions are optimal, we could have obtained them by the Simplexalgorithm. As in the Simplex proof of correctness, let( c − y (cid:48) A ) x = (cid:2) z (cid:3) (cid:20) x b (cid:21) = 0If ( c − y (cid:48) A ) x = 0, then y (cid:48) ( Ax ) = cx , or y (cid:48) d = cx ,and by the ﬁrst corollary of the weak duality theorem, both solutions are optimal. General Form of Duality

The following are LP problem of the most general form and its dual. An asterisk, ∗ ,indicates an unconstrained sub-vectorThe general Primal LP problem,min (cid:2) c { } c { } c { } (cid:3)  x { } x { } x { }  x { } ≥ x { } ∗ x { } ≤ |  A { } A { } A { } A { } A { } A { } A { } A { } A { }   x { } x { } x { }  ≤ d { } = d { }≥ d { } and its Dual LP problem:max (cid:2) d { } (cid:48) d { } (cid:48) d { } (cid:48) (cid:3)  y { } y { } y { }  y { } ≤ y { } ∗ y { } ≥ |  A { } A { } A { } A { } A { } A { } A { } A { } A { }  (cid:48)  y { } y { } y { }  ≤ c { } (cid:48) = c { } (cid:48) ≥ c { } (cid:48) The following interesting special case is known as the standard linear programmingproblem with box constraints:Primal: min c (cid:48) x | Ax = d ∧ l ≤ x ≤ u , Dual: max d (cid:48) y + l (cid:48) p − u (cid:48) q | A (cid:48) y + p − q = c ∧ p, q ≥ . .2. LINEAR PROGRAMMING Dual Simplex Algorithm

The Dual Simplex algorithm is analogous to the standard Simplex, but it works caring abasis that is dual feasible, and works to achieve primal feasibility. The Dual Simplex isvery useful in several situations in which we solve a LP problem, and subsequently haveto alter some constraints, loosing primal feasibility. We work with a standard LP programand its dual, P : min cx , x ≥ Ax = d and D : max y (cid:48) d , y (cid:48) A ≤ c In dual feasible basis, y = c b B − is a dual feasible solution, that is (cid:2) c b c r (cid:3) − y (cid:48) (cid:2) B R (cid:3) = (cid:2) c b c r (cid:3) − c b B − (cid:2) B R (cid:3) = (cid:2) c b c r (cid:3) − c b (cid:2) I ˜ R (cid:3) = (cid:2) c b c r (cid:3) − (cid:2) c b c b ˜ R (cid:3) = (cid:2) − z (cid:3) ≥ B, R ], as follows:max d (cid:48) y A (cid:48) y ≤ c (cid:48) max d (cid:48) y (cid:20) B (cid:48) R (cid:48) (cid:21) y ≤ (cid:20) c b (cid:48) c r (cid:48) (cid:21) max d (cid:48) y (cid:20) B (cid:48) I R (cid:48) I (cid:21)  yw b w r  = (cid:20) c b (cid:48) c r (cid:48) (cid:21) , w ≥ (cid:20) B (cid:48) R (cid:48) I (cid:21) , (cid:20) B − t − R (cid:48) B − t I (cid:21)(cid:20) yw r (cid:21) = (cid:20) B − t − R (cid:48) B − t I (cid:21) (cid:20) c b (cid:48) c r (cid:48) (cid:21) − (cid:20) B − t − R (cid:48) B − t I (cid:21) (cid:20) I (cid:21) w b i.e. y = B − t c b (cid:48) − B − t w b e w r = c r (cid:48) − R (cid:48) B − t c b (cid:48) + R (cid:48) B − t w b Note that the indices in b and r correspond to basic and residual indices in the primal,the situation being reversed in the dual. As in the standard Simplex, we can increase azero element of the residual vector in order to have a better dual solution, d (cid:48) y = d (cid:48) B − t ( c b (cid:48) − w b ) = const − ˜ d (cid:48) w b APPENDIX D. DETERMINISTIC EVOLUTION AND OPTIMIZATION

If ˜ d ≥ d i <

0, we can increase thevalue of the dual solution increasing w b ( i ) . We can increase w b ( i ) = ν , without loosingdual feasibility, as long as we maintain w r = c r (cid:48) − R (cid:48) B − t c b (cid:48) + νR (cid:48) B − t I i ≥ c r − ˜ c b R + νB − i R ≥ − z + ν ˜ R i ≥ j = arg min { ν ( j ) = z j / ˜ R ji , j | ˜ R ji < } , we have the index that leaves thedual basis. Hence, in the new list of indices b , that are primal basic, we can exclude b ( i ),include r ( j ), update the basis’ inverse and proceed to a new dual simplex iteration, untilwe reach dual optimality or, equivalently, primal feasibility. D.2.2 Decomposition Methods

Suppose we have a LP problem in the formmin cx, x ≥ | Ax = b, where the matrix A = (cid:20) ˙ A ¨ A (cid:21) , and the polyedron described by ¨ Ax = ¨ b has a very “simple” structure, while ˙ Ax = ˙ b implies only a “few” additional constraints that, unfortunately, greatly complicate theproblem.For example, let ¨ Ax = ¨ b describe a set of separate LP problems, while ˙ Ax = ˙ b imposesglobal constrains coupling the variables of the several LP problems. This structure isknown as Row Block Angular Form (RBAF), see section 5.2.We now study the Danzig-Wolf method, that allow us to solve the original LP problem,by successive iterations between a “small” main or master problem, and a large but“simple” subproblem or slave problem. We assume that the simple polyedron is bounded,hence being the convex hull of its vertices¨ X = { x ≥ | ¨ Ax = ¨ b } = ch ( V ) = V l , l ≥ | (cid:48) l = 1 . The origianl LP problem is equivalent to the following master problem: M : min cV l , l ≥ | (cid:20) ˙ AV (cid:48) (cid:21) l = (cid:20) ˙ b (cid:21) obviously this representation has only theoretical interest, for it is not practical to ﬁndthe many vertices of V . A given basis B is optimal iﬀ − z = [ cV ] R − ([ cV ] B B − ) R ≡ [ cV ] R − [ y, γ ] R ≥ . .2. LINEAR PROGRAMMING j , cV j − [ y, γ ] (cid:20) ˙ AV j (cid:21) ≥ , or γ ≤ cV j − y ˙ AV j = ( c − y ˙ A ) V j , or γ ≤ min( c − y ˙ A ) v , v ∈ ¨ X Hence, we deﬁne the o sub-problem S : min( c − y ˙ A ) v , v ≥ | ¨ Av = ¨ b If the optimal solution of S , v ∗ has optimal value ( c − y ˙ A ) v ∗ ≥ γ , the basis B is optimalfor M . If not, v ∗ give us the next column for entering the basis, (cid:20) ˙ Av ∗ (cid:21) .The optimal solution of the auxiliary problem also give us a lower bound for the originalproblem. Let x be any feasible solution for the original problem, that is, x ∈ ¨ X | ˙ Ax = ˙ b .Since x is more constrained, ( c − y ˙ A ) x ≥ ( c − y ˙ A ) v ∗ , hence, cx ≥ y ˙ b + ( c − y ˙ A ) v ∗ . Notethat y ˙ b is the current upper bound. Also note that it is not necessary to have a monotonicincrease in the lower bound. Hence we must keep track of the best lower bound found sofar.As we have seen, the Danzig-Wolf works very well for LP problems in RBAF. Ifwe had a problem in CBAF - Column Block Angular Form, we could use Danzig-Wolfdecomposition method on the problem’s dual. This is essentially Benders decompositionmethod, that can be eﬃciently implemented using the Dual Simplex algorithm. Exercises

1. Geometry and simple lemmas:a- Draw the simplex, S n , and the cube, C n of dimension 2 and 3. S = { x ≥ | (cid:48) x ≤ } , C = { x ≥ | Ix ≤ } .b- Rewrite S , S , C and C as standard form plyedra in R n , where n = 3 , , , b =[ b (1) , . . . b ( m )], of m indices from 1 : n , in increasing order that is, b ( j ) > b ( i ) for30 APPENDIX D. DETERMINISTIC EVOLUTION AND OPTIMIZATION j > i . For each arrangement, b form the basis B = A b . Check if B is invertible and,if so, check if the basic solution is a vertex, that is, if it is feasible, ˜ d = B − d > cx , l ≤ x ≤ u | Ax = d . Hint: Consider a given feasible basis, B , and a partition (cid:2) B R S (cid:3) , where l b < x b < u b , x r = l r , x s = u s , so that, x b = B − d − B − Rx r − B − Sx s and cx = c b B − d + ( c r − c b B − R ) x r + ( c s − c b B − S ) x s = ϕ + z r x r + z s x s If z r ( k ) <

0, we can improve the current solution increasing this residual variableresidual at its lower bound, x r ( k ) = l r ( k ) + δ r ( k ) , making x b = B − d − B − Rl r − B − Su s − δ r ( k ) B − R k .However, δ r ( k ) shall respect the following bounds:1- x r ( k ) = l r ( k ) + δ r ( k ) ≤ u r ( k ) , 2- x b ≥ l b , 3- x b ≤ u b .In a similar way, if z s ( k ) >

0, we can improve the current solution decreasing thisresidual variable at its upper bound, x s ( k ) = u s ( k ) − δ s ( k ) .5. Adapt and implement the Dual Simplex for LP problems with box constraints.6. Implement Danzig-Wolf decomposition methods for RBAF problems. D.3 Non-Linear Programming

Optimality and Lagrange Multipliers

We start this section giving an intuitive explanation of Lagrange’s optimality conditionsfor a Non-Linear Programming (NLP) problem, given asmin f ( x ) , x | g ( x ) ≤ ∧ h ( x ) = 0 , f : R n (cid:55)→ R , g : R n (cid:55)→ R m , h : R n (cid:55)→ R k . We can imagine the function f as potential , or the “height” of a surface. An equipo-tential is a manifold where the function is constant, f ( x ) = c . The gradient ∇ f ≡ ∂f /∂x = (cid:2) ∂f /∂x , ∂f /∂x , . . . ∂f /∂x n (cid:3) gives steepest ascent direction of the function at point x . Hence, the gradient ∇ f ( x ) isorthogonal to the equipotential at this point. .3. NON-LINEAR PROGRAMMING −∇ f ( x ). The optimal solutionmust be a point of equilibrium for the particle. Hence, either the force pulling the particledown is null, or else the force must be equilibrated by “reaction” forces exercised by theconstraints. The reaction force exercised by an inequality constraint g i ( x ) ≤

0, must obeythe following conditions:a) Be a force orthogonal to the equipotential curve of this constraint (since only thevalue of g i ( x ) is relevant for this constraint);b) Be a force pulling the particle “inwards”, that is, to the inside of the feasible region;c) Moreover, a inequality constraint can only exercise a reaction force if it is tight,otherwise there is a slack allowing the particle to move even closer to this constraint.An equality constraint, h i ( x ) = 0, can be seen as a pair of inequality constraints, h i ( x ) ≤ h i ( x ) ≥

0, but unlike an inequality constraint, an equality constraint is always active.Our intuitive discussion can be summarized analytically by the following conditionsknown as

Lagrange’s optimality conditions :If x ∗ ∈ R n ´e is an optimal point, then ∃ u ∈ R m , v ∈ R k | u ∇ g ( x ∗ ) + v ∇ h ( x ∗ ) − ∇ f ( x ∗ ) = 0 , onde u ≤ ∧ ug ( x ∗ ) = 0 . The condition u ≤ complementarity condition , ug = 0, implies that only activeconstraints can exercise reaction forces. The vectors u and v are known as Lagrangemultipliers .These necessary conditions can also be presented by means of the Lagrangen function, L ( x, λ ) = f ( x ) + ug ( x ) + vh ( x ) , where λ = [ u, v ], u ∈ R m + and v ∈ R k , as a necessary condition for a saddle point: ∂ L ( x ∗ , λ ∗ ) ∂ x = 0 , ∂ L ( x ∗ , λ ∗ ) ∂ λ = 0 . The Lagrangean function can be used to deﬁne a duality theory for non-linear optimizationproblems, see Luenberger (1984) and Minoux and Vajda (1986).

Quadratic and Linear Complementarity Problems

Quadratic Programming (QP) is an important problem in its own right, and is alsofrequently used as a subproblem in methods designed to solve more general problems like,32

APPENDIX D. DETERMINISTIC EVOLUTION AND OPTIMIZATION for example, Sequential Quadratic Programming, see Luenberger (1984) and Minoux andVajda (1986).The QP problem with linear constraints is stated as minf ( x ) ≡ (1 / x (cid:48) Qx − ηp (cid:48) x | x ≥ ∧ T e ∗ x = te ∧ T l ∗ x ≤ tl where the matrix dimensions are T e me × n, me < n , T l ml × n, ml < n , M l =1 , , . . . ml, M e = 1 , , . . . me, N = 1 , , . . . n . We assume that the quadratic form deﬁningthe problem is symmetric and positive deﬁnite, that is, Q = Q (cid:48) , Q > ∇ f = x (cid:48) Q − ηp (cid:48) , and the gradients of the constraint functions are g i ( x ) = T i x ≤ t i ⇒ ∇ g i = T i . Hence, the Lagrange optimality conditions are x ∈ R n + , s ∈ R n + , l ∈ R ml + , e ∈ R me , | − ( x (cid:48) Q − ηp (cid:48) ) + s (cid:48) − l (cid:48) T l + e (cid:48) T e = 0 ∧ ∀ i ∈ N , x i s i = 0 ∧ ∀ k ∈ M l , ( T l ∗ x − tl ) k l k = 0 or x ∈ R n + , s ∈ R n + , l ∈ R ml + , e ∈ R me , y ∈ R ml + | Qx − s (cid:48) + T l (cid:48) ∗ l + T e (cid:48) e = ηp ∧ ∀ i ∈ N , x i s i = 0 ∧ ∀ k ∈ M l , yl k l k = 0 onde yl = ( tl − T l ∗ x )The Complementarity Conditions (CC), x (cid:48) s = 0 e yl (cid:48) l = 0, indicate that only activeconstraints can help to equilibrate non-negative components of the objective function’sgradient. Using the change of variables e = ep − em , ep, em ≥

0, the optimal solution ischaracterized by the Viability and Optimality conditions (VOC):  xlepensyl  ≥ |  T l

IT e

Q T l (cid:48)

T e (cid:48) − T e (cid:48) − I   xlepensyl  =  tlteηp  x (cid:48) s = 0 , yl (cid:48) l = 0 . .3. NON-LINEAR PROGRAMMING Tabu Simplex

Observe that: (1) the VOC system (viability and optimality conditions) stated aboveformally resembles a LP (linear programming) problem in standard form; (2) all the non-linearity of the original QP (quadratic programming) problem is encapsulated and lockedinside the CC (complementarity conditions); (3) the CC take a logical form of mutualexclusion. These observations are the key to adapt the Simplex to solve a (positivedeﬁnite) QP, see Hadley (1964), Stern et al. (2006) and Wolfe (1959).The VOC plus CC stated above imply that, at an optimal solution, there must bemany null elements, more speciﬁcally: There are at least n null elements among x and s ;and there are at least ml null elements among yl e l . Moreover, ep and en are, respectively,the positive and negative parts of the unconstrained vector e , so that, by construction,there are at least me null elements among ep and en . Hence, in the optimal solution, thereare no more than ml + me + n non-zero variables, that can be written as a basic solutionto the VOC linear system. This suggests using of the Simplex algorithm for solving QP.Let us assume, for convenience and without loss of generality, that tl ≥

0. Theneed to solve the VOC system, respecting the CCs, is expressed by the following LinearComplementarity (LC) problem:min (cid:2) (cid:3)  xlepensylyeyq   xlepensylyeyq  ≥ |  T l I T e De Q T l (cid:48)

T e (cid:48) − T e (cid:48) − I Dq   xlepensylyeyq  =  tlteηp  where the CCs, the new matrix blocks, and the initial vertex are given by, respectively, x (cid:48) s = 0 , yl (cid:48) l = 0 , Dq = diag ( sign ( ηp )) , De = diag ( sign ( te )) , tl ≥ , APPENDIX D. DETERMINISTIC EVOLUTION AND OPTIMIZATION  xlepensylyeyq  =  tl | te || p |  . At the initial solution the CCs are satisﬁed, for x (cid:48) s = 0 (cid:48) yl (cid:48) l = tl (cid:48) ye out of the basis, as in a phase 0 Simplex for a standard LP problem. Thenext phase of the algorithm consists of driving the remaining artiﬁcial basis out of thebasis, respecting however the CC. In order to insure that the CC continue to be satisﬁedas the Simplex progresses, we use the following prohibition or Tabu rules: - Forbid variable x i to enter the basis if s i is currently basic, and vice-versa;- Forbid variable yl i to enter the basis if l i is currently basic, and vice-versa.The Tabu Simplex is an eﬃcient algorithms for Parametric Quadratic Programming (PQP). The original version of the Tabu Simplex is presented in Wolfe (1959); Hadley(1964) gives a simple proof of the algorithms subject only to equality constraints, andStern et al. (2006) details this proof including inequality constraints.A prototypical application of PQP is the computation of

Eﬃcient Frontiers , seeAlexander and Francis (1986) and Markowitz (1952, 1956, 1987). Many theoretical as-pects of ﬁnancial portfolio analysis can be easily stated based on optimality conditions ofthe underlying optimization problems, see Stern et al. (2006).

D.3.1 GRG: Generalized Reduced Gradient

Let us consider a NLP problem with non-linear equality constraints, plus box constraintsover the variables’ range, min f ( x ) , f : R n (cid:55)→ Rl ≤ x ≤ u | h ( x ) = 0 , h : R n (cid:55)→ R m The Generalized Reduced Gradient (GRG) method emulates the behaiviour of theSimplex method, for a local linearization of the NLP problem, see Abadie and Carpentier(1969) and Minoux and Vajda (1986), for an intuitive presentation see Himmelblau (1972).Let x be an initial feasible point. As for LP, we assume a non-degenerescence hypothesis,that is, we assume that, at a given feasible point, a maximum of ( n − m ) box constraintscan be active. Hence, we can take m of the variables with slack box constraints as basic .3. NON-LINEAR PROGRAMMING n − m variables as residual (or independent)variables. As in the Simplex algorithm, we permute and partition all vector and matrixobjects to better display this distinction, x = (cid:20) x b x r (cid:21) , l = (cid:20) l b l r (cid:21) , u = (cid:20) u b u r (cid:21) , ∇ f ( x ) = (cid:2) ∇ b f ( x ) ∇ r f ( x ) (cid:3) J ( x ) = (cid:2) J B ( x ) J R ( x ) (cid:3) =  ∇ b h ( x ) ∇ r h ( x ) ∇ b h ( x ) ∇ r h ( x )... ... ∇ b h m ( x ) ∇ r h m ( x )  Let us consider the eﬀect of a small alteration to the current feasible point, x + δ ,assuming that the functions f and h are continuous and diﬀerentiable. The correspondingalteration to the solution’s value is∆ f = f ( x + δ ) − f ( x ) ≈ ∇ f ( x ) δ = (cid:2) ∇ b f ( x ) ∇ r f ( x ) (cid:3) (cid:20) δ b δ r (cid:21) We also want the altered solution, x + δ , to remain (approximately) feasible, that is,∆ h = h ( x + δ ) − h ( x ) ≈ J ( x ) δ = (cid:2) J b ( x ) J r ( x ) (cid:3) (cid:20) δ b δ r (cid:21) = 0Isolating δ b , and assuming that the basis J b ( x ) is invertible, δ b = − (cid:0) J b ( x ) (cid:1) − J r ( x ) δ r ∆ f ≈ ∇ b f ( x ) δ b + ∇ r f ( x ) δ r = (cid:16) ∇ r f ( x ) − ∇ b f ( x ) (cid:0) J b ( x ) (cid:1) − J r ( x ) (cid:17) δ r = z ( x ) δ r Since the problem is non-linear, we can not assure that an optimal solution has allresidual variables with one active constraint, that is, are at one side of the box, as in astandard LP problem. Therefore, there is no motivation to restrict δ r to have only onenon-zero component, as in the Simplex. Instead, we suggest to move the current solution(in the space of residual variables) along the direction given by the vector v r , opposed tothe reduced gradient, as long as the corresponding box constraint is slack, that is, v r ( i ) =  − z i if z i > x r ( i ) > l r ( i ) − z i if z i < x r ( i ) < u r ( i ) v r as a function of36 APPENDIX D. DETERMINISTIC EVOLUTION AND OPTIMIZATION the box constraints slacks is undesirable. Hence, we shall use a continuous version of thesearch direction like, for example, v r ( i ) =  − γ ( x r ( i ) − l r ( i ) ) z i if z i > x r ( i ) > l r ( i ) − γ ( u r ( i ) − x r ( i ) ) z i if z i < x r ( i ) < u r ( i ) γ ( x ) = x/(cid:15), if 0 ≤ x ≤ (cid:15) ; and γ ( x ) = 1 , otherwise . The basic idea of one iteration of the GRG method is to move the feasible point by astep x + δ with δ = ηv , where v b = − (cid:0) J b ( x ) (cid:1) − J r ( x ) v r , that is, a step (in the space ofresidual variables) of size η in the direction v r . In order to determine the step size, η , weneed to perform a line search, always respecting the box constraints.Note that the direction, in the space of basic variables, v b , has been chosen so that x + ηv , remains (approximately) feasible, since we are moving inside a hyperplane thatis tangent to the algebraic manifold deﬁned by h ( x ) = 0. The new nearly feasible point, x , shall then receive a correction ∆ x in order to regain exact feasibility for the non-linearconstraints, that is, so that h ( x + ∆ x ) = 0. The nearly feasible point x can be used a thestarting point for a recursive method used to get exact feasiblity, like the Newton-Raphsonmethod, that uses the basic Jacobian, J b ( x ), to compute the correction∆ x b = − (cid:0) J b ( x ) (cid:1) − h ( x b , x r ) . D.3.2 Line Search and Local Convergence

This section analyses the problem of minimizing an unidimensional function, f ( x ). First,let us consider the problem of ﬁnding the root (zero) of a diﬀerentiable function, ap-proximated by its ﬁrst order Taylor expansion, g ( x ) ≈ q ( x k ) + g (cid:48) ( x k )( x − x k ). Thisapproximation implies that g ( x k +1 ) ≈

0, where x k +1 = x k − g (cid:48) ( x k ) − g ( x k )This is Newton’s method, used to ﬁnd the root of an unidimensional function.If a function f ( x ) is diﬀerentiable, its minimum is at a point where the function’s ﬁrstderivative is null. Hence, we can use Newton’s method for minimizing f ( x ), x k +1 = x k − f (cid:48)(cid:48) ( x k ) − f (cid:48) ( x k )Let us examine how fast the sequence generated by Newton’s method approachesthe optimal solution, x ∗ , assuming the starting point, x , is already close enough to x ∗ .Assuming third order diﬀerentiability, we can write0 = f (cid:48) ( x ∗ ) = f (cid:48) ( x k ) + f (cid:48)(cid:48) ( x k )( x ∗ − x k ) + (1 / f (cid:48)(cid:48)(cid:48) ( y k )( x ∗ − x k ) , or .3. NON-LINEAR PROGRAMMING x ∗ = x k − f (cid:48)(cid:48) ( x k ) − f (cid:48) ( x k ) − (1 / f (cid:48)(cid:48) ( x k ) − f (cid:48)(cid:48)(cid:48) ( y k )( x ∗ − x k ) Subtracting the equation that deﬁnes Newton’s method, we have( x k +1 − x ∗ ) = (1 / f (cid:48)(cid:48) ( x k ) − f (cid:48)(cid:48)(cid:48) ( y k ) ( x k − x ∗ ) As we shall see in the following, this result implies that Newton’s method convergesvery fast (quadratically), if we are already close enough to the optimal solution. However,Newton’s method needs a lot of diﬀerential information about the function, somethingthat may be hard to obtain. Moreover, far from the optimum, one can not be sure aboutthe method’s convergence. The following methods overcome these diﬃculties.Let us now examine the Golden Ratio search method, for minimizing a unidimensionaland unimodal function, f ( x ), in the interval, [ x , x ]. Assume we know the function’s valueat four points, the extremes of the interval and two interior points, x < x < x < x .From the unimodality hypothesis we can know that the point of minimum, x ∗ , is in oneof the sub-itervals, that is f ( x ) ≤ f ( x ) ⇒ x ∗ ∈ [ x , x ] , f ( x ) > f ( x ) ⇒ x ∗ ∈ [ x , x ] . without loss of generality, let us consider the way to divide the interval [0 , r deﬁnes a symmetric division in the form 0 < − r < r <

1. Dividing the subinterval[0 , r ] by the same ratio r , we obtain the points 0 < r (1 − r ) < r < r . We want the points r and 1 − r to coincide, so that it will only be necessary to evaluate the function at oneadditional point, taht is, we want r + r − r = ( √ − /

2, this is the goldenratio r ≈ . f ( x + η ), on η ≥

0, rely on a polynomial, p ( x ), that locally approximates f ( x ),and the subsequent minimization of the adjusted polynomial. The simplest of these meth-ods is quadratic adjustment. Assume we know at three points, η , η , η , the respectivefunction values, f i = f ( x + η i ). Considering the equations for the interpolating polynomial q ( η ) = aη + bη + c , q ( η i ) = f i we obtain the polynomial a = f ( η − η ) + f ( η − η ) + f ( η − η ) − ( η − η )( η − η )( η − η ) b = f ( η − η ) + f ( η − η ) + f ( η − η ) − ( η − η )( η − η )( η − η ) c = f ( η η − η η ) + f ( η η − η η ) + f ( η η − η η ) − ( η − η )( η − η )( η − η )38 APPENDIX D. DETERMINISTIC EVOLUTION AND OPTIMIZATION

Equating the ﬁrst derivative of the interpolating polynomial to zero, q (cid:48) ( η ) = 2 aη + b ,we obtain its point of minimum, η = a/ b or, directly from the function’s values, η = 12 f ( η − η ) + f ( η − η ) + f ( η − η ) f ( η − η ) + f ( η − η ) + f ( η − η )We should try to use the initial points in the “interpolating pattern” η < η < η e f ≥ f ≤ f , that is, three points where the intermediary point has the smallest function’svalue. So doing, we know that the minimum of the interpolating polynomial is inside ofthe initial search interval, that is, η ∈ [ η , η ]. In this situation we are interpolating andnot extrapolating the function, favoring the numerical stability of the procedure.Choosing η and two more points from the initial three, we have a new set of threepoints in the desired interpolating pattern, and are ready to proceed for the next iteration.Note that, in general, we can not guaranty that η is the best point in the new set ofthree. However, η will always replace the worst point in the old set. Hence, the sum z = f + f + f is monotonically decreasing. In section D.3.4 we shall see that theseproperties assure the global convergence of the quadratic adjustment line search algorithm.Let us now consider the errors relative to the minimum argument, (cid:15) i = x ∗ − x i . We canwrite (cid:15) = g ( (cid:15) , (cid:15) , (cid:15) ), where the function g is a second order polynomial. This is because η is obtained by a quadratic adjustment, that is also symmetric in its arguments, sincethe order of the ﬁrst three points is irrelevant. Moreover, it is nor hard to check that (cid:15) iszero if two of the three initial errors are zero. Hence, close to the minimum, x ∗ , we havethe following approximation for the forth error: (cid:15) = C ( (cid:15) (cid:15) + (cid:15) (cid:15) + (cid:15) (cid:15) )Assuming that the process is converging, the k -th error is approximately (cid:15) k +4 = C(cid:15) k +1 (cid:15) k +2 or C(cid:15) k +3 = C(cid:15) k +1 C(cid:15) k +0 . Let us now assume a power-law convergence, C(cid:15) k ≈ δ r k , so that we have δ r k +3 = δ r k +1 δ r k +0 or, taking the logarithm, r k +3 = r k +1 + r k +0 .The general solution of this ﬁnite diﬀerence equation has the form r k = Aλ k + Bλ k + Cλ k ,where λ i are the roots of the characteristic equation λ − λ − λ = λ − λ − λ ≈ .

32, Notice that 1 . ≈ .

30, making three stepsof this method “as good as” one step of the quadratically convergent Newton method,with the advantages of being globally convergent and not requiring the computation ofexpensive derivatives.We say that a sequence of real numbers r k → r ∗ converges at least in order p > ≤ lim k →∞ | r k +1 − r ∗ || r k − r ∗ | p = β < ∞ The sequence order of convergence is the supremum of constants p > p = 1 and β <

1, we say that the sequence has linear convergence with rate β . If β = 0,we say that the sequence has super linear convergence. .3. NON-LINEAR PROGRAMMING c ≥ c is the order of convergence of the sequence a ( c k ) . Wecan also see that 1 /k converges in order 1, although it is not linearly convergent, because r k +1 /r k →

1. Finally, (1 /k ) k converges in order 1, because for any p > r k +1 / ( r k ) p → ∞ ,However, this convergence is super-linear, because r k +1 /r k → D.3.3 The Gradient ParTan Algorithm

In this section we present the method of Parallel Tangents, ParTan, developed by Shah,Buehler and Kempthorne (1964) for solving the problem of minimizing an unconstrainedconvex function. We present a particular case of the General ParTan algorithm, theGradient ParTan, following the presentation in Luenberger (1983).The ParTan algorith was developed to solve exactly, after n steps, a general quadraticfunction f ( x ) = x (cid:48) Ax + b (cid:48) x + c . If A is real, symmetric and full rank matrix, it is possibleto ﬁnd the eigenvalue decomposition V (cid:48) AV = D = diag( d ), see section F.2. If we hadthe eigen-vector matrix, V , we could consider the coordinate transformation y = V (cid:48) x , x = V y , f ( y ) = y (cid:48) V (cid:48) AV y + b (cid:48) V y = y (cid:48) Dy + e (cid:48) y + c . The coordinate transformation givenby (the orthogonal) matrix V can be interpreted as a decoupling operator, see Chap.3, forit transforms an n -vector optimization problem into n independent scalar optimizationproblems, y i ∈ arg min d i ( y k ) + e i y i + c . However, ﬁnding the eigenvalue decompositionof A is even harder than solving the original optimization problem. A set of vectors (ordirections), w k is A -conjugate iﬀ, for k (cid:54) = j , ( w k ) (cid:48) Aw j = 0. A (non-orthogonal) matrixof n A -conjugate vectors, W = [ w . . . w n ] provides an alternative, and much cheaperdecoupling operator for the quadratic optimization problem. The Partan algorithm ﬁnds,on the ﬂy, a set of n A -conjugate vectors w k .To simplify the notation we assume, without loss of generality, a quadratic functionthat is centered at the origin, f ( x ) = x (cid:48) Ax . Therefore, grad( x ) = Ay , so that y (cid:48) Ax = y (cid:48) grad( x ), and vectors x and y are A -conjugate iﬀ y is ortogonal to grad( x ). The Partanalgorithm is deﬁned as follows, progressing through points x , x , y , x , . . . x k − , y k − , x k ,see Figure D.2 (left). The algorithm is initialized by choosing an arbitrary starting point, x , by an initial Cauchy step to ﬁnd y , and by taking x = y . N -Dimensional (Gradient) ParTan Algorithm:- Cauchy step: For k = 0 , , . . . n , ﬁnd y k = x k + α k g k in an exact line search alongthe k -th steepest descent direction, g k = − grad f ( x k ).- Acceleration step: For k = 1 , . . . n −

1, ﬁnd x k +1 = y k + β k ( y k − x k − ) in an exactline search along the k -th acceleration direction, ( y k − x k − ).In order to prove the correctness of the ParTan algorithm, we will prove, by induction,two statements:(1) The directions w k = ( x k +1 − x k ) are A -conjugate.40 APPENDIX D. DETERMINISTIC EVOLUTION AND OPTIMIZATION (2) Although the ParTan never performs the conjugate direction line search , x k +1 = x k + γ k w k , this is what implicitly happens, that is, the point x k +1 , actually found at theacceleration step, would also solve the (hypothetical) conjugate direction line search.The basis for the induction, k = 1, is trivially true. Let us assume the statements aretrue up to k −

1, and prove the induction step for the index k , see Figure D.2 (right). (cid:62) (cid:58)(cid:45) (cid:54) (cid:49) (cid:75)(cid:119)(cid:14) (cid:113) (cid:45) (cid:54)(cid:51) (cid:21) X(0) X(1)Y(1) X(2) Y(3)W(2)Y(2) X(k-1) X(k)Y(k)-g(k)W(k-1)W(1) X(3) W(k)X(k+1)Figure D.2: The Gradient ParTan Algorithm.By the induction hypothesis, x k is the minimum of f ( x ) on the k -dimensional hy-perplane through x spanned by all previous conjugate directions, w j , j < k . Hence, g k = − grad f ( x k ) is orthogonal to all w j , j < k . All previous search directions lie in thesame k -hyperplane, hence, g k is also orthogonal to them. In particular, g k is orthogonalto g k − = − grad f ( x k − ). Also, from the exact Cauchy step from x k to y k , we knowthat g k must be orthogonal to grad f ( y k ). Since grad f ( x ) is a linear function, it mustbe orthogonal to g k at any point in the line search x k +1 = y k + β k ( y k − x k − ). Sincethis line search is exact, grad f ( x k +1 ) is orthogonal to ( y k − x k − ). Hence grad f ( x k +1 ) isorthogonal to any linear combination of g k and ( y k − x k − ), including w k . For all otherproducts ( w j ) (cid:48) Aw k , w j , j < k −

1, we only have to write w k as a linear combination of g k and w k − to see that they vanish. This is enough to conclude the induction step ofstatements (1) and (2). QED.Since a full rank matrix A can have at most n simultaneous A -conjugate directions,the Gradient ParTan must ﬁnd the optimal solution of a quadratic function in at most n steps. This fact can be used to show that, if the quadratic model of the objectivefunction is good, the ParTan algorithm converges quadratically. Nevertheless, even if thequadratic model for the objective function is poor, the Cauchy (steepest descent) steps canmake good progress. This explains the Gradient ParTan robustness as an optimizationalgorithm, even if it starts far away from the optimal solution.The ParTan needs two line searches in order to obtain each conjugate direction. Faraway from the optimal solution a Cauchy method would use only one line search. Closeto the optimal solution alternative versions of the ParTan algorithm, known as ConjugateGradient algorithms, achieve quadratic convergence using only one line search per dimen- .3. NON-LINEAR PROGRAMMING D.3.4 Global Convergence

In this section we give some conditions that assures global convergence for a NLP algo-rithm. We follow the ideas of Zangwill (1964), similar analyses are presented in Luenberger(1984) and Minoux and Vajda (1986).We deﬁne an Algorithm as an iterative process generating a sequence of points, x , x , x . . . , that oby a recursion equation of the form x k +1 ∈ A k ( x k ), where the point-to-set map A k ( x k ) deﬁnes the possible successors of x k in the sequence.The idea of using an point-to-set map, instead of a ordinary function or point-to-pointmap, allows us to study in a uniﬁed way a hole class of algorithms, including alternativeimplementations of several details, approximate or inexact computations, randomizedsteps, etc. The basic property we look for on the maps deﬁning an algorithm is closure ,deﬁned as follows.A point-to-set map from space X to space Y , is closed at x if the following conditionholds: If a sequence x k converges to x ∈ X , and the sequence y k converges to y ∈ Y ,where y k ∈ A ( x ), then the also the limit y is in the image A ( x ), that is, x k → x , y k → y , y k ∈ A ( x k ) ⇒ y ∈ A ( x ) . The map is closed in C ⊆ X if it is closed at any point of C . Note that if we replace,in the deﬁnition of closed map, the inclusion relation by the equality relation, we getthe deﬁnition of continuity for point-to-point functions. Therefore, the closure propertyis a generalization of continuity. Indeed, a continuous function is closed, although thecontrary is not necessarily true.The basic idea of Zangwill’s global convergence theorem is to ﬁnd some characteristicthat is continuously “improved” at each iteration of the algorithm. This characteristic isrepresented by the concept of descendence function .Let A be an algorithm in X for solving the problem P , and let S ⊂ X be the solutionset for P . A function Z ( x ) ´e is a descendence function for ( X, A, S ) if the composition of Z and A is always decreasing outside the solution set, and does not increase inside thesolution set, that is, x / ∈ S ∧ y ∈ A ( x ) ⇒ Z ( y ) < Z ( x ) and x ∈ S ∧ y ∈ A ( x ) ⇒ Z ( y ) ≤ Z ( x ) . APPENDIX D. DETERMINISTIC EVOLUTION AND OPTIMIZATION

In optimization problems, some times the very objective function is a good descendencefunction. Other times, more complex descendence functions have to be used, for example,the objective function with auxiliary terms, like penalties for constraint violations.Before we state Zangwill’s theorem, let us review two basic concepts of set topology:An accumulation point os a sequence is a limit point for one of its sub-sequences. A setis compact iﬀ any (inﬁnite) sequence has an accumulation point inside the set. In R n , aset is compact if it is closed and bounded.Zangwill’s Global Convergence Theorem:Let Z be a descendence function for the algorithm A deﬁned in X with solution set S , and let x , x , x , . . . be a sequence generated by this algorithm such that:A) The map A is closed in any point outside S ,B) All points in the sequence remain inside a compact set C ⊆ X , andC) Z is continuous.Then, any accumulation point of the sequence is in the solution set.Proof: From C compacity, a sequence generated by the algorithm has a limit point, x ∈ C ⊆ X , for a subsequence, x s ( k ) . From the continuity of Z in X , the limit value of Z inthe subsequence coincides withg the value of Z at the limit point, that is, Z ( x s ( k ) ) → Z ( x ).But the complete sequence, Z ( x k ) is monotonically decreasing, hence, if s ( k ) ≤ j ≤ s ( k + 1) then Z ( x s ( k ) ) ≥ Z ( x j ) ≥ Z ( x s ( k +1) ), and the value of Z in the complete sequencealso converges to the value of Z at the accumulation point, that is Z ( x k ) → Z ( x ).Let us now imagine, for a proof by contradiction, that Z ( A ( x )) < Z ( x ). Let usconsider the sub-sequence of the successors of the points in the ﬁrst sub-sequence, x s ( k )+1 .This second sub-sequence, again by compacity, also has an accumulation point, x (cid:48) . Butfrom the result in the last paragraph, the value of the descendence function in bothsub-sequences converge to the limit value of the hole sequence, that is, lim Z ( x s ( k )+1 ) =lim Z ( x k ) = lim Z ( x s ( k ) ). So we have prooved the impossibility of x not being a solution.Several algorithms are formulated as a composition of several steps. Hence, the mapdescribing the hole algorithm is the composition of several maps, one for each step. Atypical example would be a step for choosing a search direction, followed by a step for aline search. The following lemmas are useful in the construction of such composite maps.First Composition Lemma: Let A from X to Y , and B from Y to Z , be point-to-setmaps, A closed in x ∈ X , B closed in A ( x ). If any sequence x k converging to x , y k ∈ A ( x k )has an accumulation point y , then the composed map B ◦ A is closed in x .Second Composition Lemma: Let A from X to Y , and B from Y to Z , be point-toset maps, A closed in x ∈ X , B closed in A ( x ). If Y is compact, then the composed map, B ◦ A is closed in x .Third Composition Lemma: Let A be a point-to point map from X in Y , and B apoint-to-set map from Y to Z . If A is continuous in x , and B is closed in A ( x ). then the .4. VARIATIONAL PRINCIPLES B ◦ A is closed in x . D.4 Variational Principles

The variational problem asks for the function q ( t ) that minimizes a global functional(function of a function), J ( q ), with ﬁxed boundary conditions, q ( a ) and q ( b ), as shown inFigure D.3. Its general form is given by a local functional, F ( t, q, q (cid:48) ), and an integral orglobal functional, J ( q ) = (cid:90) ba F ( t, q, q (cid:48) ) dt , where the prime indicates, as usual, the simple derivative with respect to t , that is, q (cid:48) = dq/dt . Figure D.3: Variational problem, q ( x ), η ( x ), q ( x ) + η ( x ). Euler-Lagrange Equation

Consider a ‘variation’ of q ( t ) given by another curve, η ( t ), satisfying the ﬁxed boundaryconditions, η ( a ) = η ( b ) = 0, q = q ( (cid:15), t ) = q ( t ) + (cid:15)η ( t ) and J ( (cid:15) ) = (cid:90) ba F ( t, q ( (cid:15), t ) , q (cid:48) ( (cid:15), t )) dt . A minimizing q ( t ) must be stationary, that is, ∂ J∂ (cid:15) = ∂∂ (cid:15) (cid:90) ba F ( t, q ( (cid:15), t ) , q (cid:48) ( (cid:15), t )) dt = 0 . APPENDIX D. DETERMINISTIC EVOLUTION AND OPTIMIZATION

Since the boundary conditions are ﬁxed, the diﬀerential operator aﬀects only the inte-grand, hence ∂ J∂ (cid:15) = (cid:90) ba (cid:18) ∂ F∂ q ∂ q∂ (cid:15) + ∂ F∂ q (cid:48) ∂ q (cid:48) ∂ (cid:15) (cid:19) dt From the deﬁnition of q ( (cid:15), t ) we have ∂ q∂ (cid:15) = η ( t ) , ∂ q (cid:48) ∂ (cid:15) = η (cid:48) ( t ) , hence ,∂ J∂ (cid:15) = (cid:90) ba (cid:18) ∂ F∂ q η ( t ) + ∂ F∂ q (cid:48) η (cid:48) ( t ) (cid:19) dt . Integrating the second term by parts, we get (cid:90) ba ∂ F∂ q (cid:48) η (cid:48) ( t ) dt = ∂ F∂ q (cid:48) η ( t ) (cid:12)(cid:12)(cid:12)(cid:12) ba − (cid:90) ba ddt (cid:18) ∂ F∂ q (cid:48) (cid:19) η ( t ) dt , where the ﬁrst term vanishes, since the extreme points, η ( a ) = η ( b ) = 0, are ﬁxed. Hence ∂ J∂ (cid:15) = (cid:90) ba (cid:18) ∂ F∂ q − ddt ∂ F∂ q (cid:48) (cid:19) η ( t ) dt . Since η ( t ) is arbitrary and the integral must be zero, the parenthesis in the integrandmust be zero. This is the Euler-Lagrange equation: ∂ F∂ q − ddt ∂ F∂ q (cid:48) = 0 . Noether Theorems

Nother theorems establishes very general conditions under which the existence of a sym-metry in the system, described by the invariance under the action of a continuous group,implies the existence of a quantity that remains constant in the system’s evolution, thatis, a conservation law, see for example Byron and Fuller (1969, V-I, Sec. 2.7).For example, consider a functional F ( t, q, q (cid:48) ) that does not depends explicitly of q . Thissituation reveals a symmetry: The system is invariant by a translation on the coordinate q . From Euler-Lagrange equatuion, it follows that the quantity p = ∂ F/∂ q (cid:48) is conserved.In the language of classical mechanics, q would be called a “cyclic coordinate”, while p would be called a “generalized moment”.Let us consider the lifeguard’s problem from section 5.5. Using the variable t insteadof x , and q instead of y , the length of an inﬁnitesimal arch is ds = dt + dq and we canbuild the total travel time using the functional F ( t, q, q (cid:48) ) = ν ( t ) (cid:112) q (cid:48) .4. VARIATIONAL PRINCIPLES q , the Euler-Lagrange equation reduces to ∂ F/∂ q (cid:48) = K , where K is a constant. Hence, the lifeguard’s problem solution is ν ( t ) q (cid:48) √ q (cid:48) = K .

If the resistance index ν ( t ) is also independent of t , q (cid:48) must be a constant, so that q is astraight line, as we have guessed in our very informal solution. In general, the solution tothe lifeguard’s problem is given by ν ( t ) tan( θ ) (cid:112) θ ) = ν ( t ) sin( θ ) = K . APPENDIX D. DETERMINISTIC EVOLUTION AND OPTIMIZATION ppendix EEntropy and Asymptotics “...we can identify that quantity which we commonly designate as(thermodynamic) entropy with the probability of the actual state.”

Ludwig Bolttzmann (1844 - 1906).W¨armetheorie und der Wahrscheinlischkeitrechnung, 1877.The origins of the entropy concept lay in the ﬁelds of Thermodynamics and StatisticalPhysics, but its applications have extended far and wide to many other phenomena,physical or not. The entropy of a probability distribution, H ( p ( x )), is a measure ofuncertainty (or impurity, confusion) in a system whose states, x ∈ X , have p ( x ) asprobability distribution. We follow closely the presentation in the following references.For the basic concepts: Csisz´ar (1974), Dugdale (1996), Kinchine (1957) and Renyi (1961,1970). For MaxEnt characterizations: Gokhale (1975), Kapur (1989), and Kapur andKesavan (1992). For MaxEnt optimization: Bertsekas and Tsitsiklis (1989), Censor andZenios (1994, 1997), Elfving (1980), Fang et al. (1997) and Iusem and Pierro (1987). Forposterior asymptotic convergence: Gelman (1995).For a detailed analysis of the connection between MaxEnt optimization and Bayesianstatistics’ formalisms, that is, for a deeper view of the relation between MaxEnt andBayes’ rule updates, see Caticha and Giﬃn (2007) and Caticha (2007). Convexity

We ﬁrst introduce the concept of convexity, that is going to be important throughout thischapter.

Deﬁnition:

A region S ∈ R n is Convex iﬀ, for any two points, x , x ∈ S , and weights0 ≤ l , l ≤ | l + l = 1, the convex combination of these two points remains in S , i.e. l x + l x ∈ S . 34748 APPENDIX E. ENTROPY AND ASYMPTOTICS

Theorem

Finite Convex Combination: A region S ∈ R n is Convex iﬀ any (ﬁnite)convex combination of its points remains in the region, i.e., ∀ ≤ l ≤ | (cid:48) l = 1, X = [ x , x , . . . x m ], x j ∈ S , X l =  x x . . . x m x x . . . x m ... ... . . . ... x n x n . . . x mn   l l . . .l m  ∈ S Proof:

By induction in the number of points, m . Deﬁnition:

The Epigraph of the function ϕ : R n → R is the region of X “above thegraph” of ϕ , i.e. Epi ( ϕ ) = (cid:8) x ∈ R n +1 | x n +1 ≥ ϕ (cid:0) [ x , x , . . . , x n ] (cid:48) (cid:1)(cid:9) Deﬁnition:

A function ϕ is convex iﬀ its epigraph is convex. A function ϕ is concaveiﬀ − ϕ is convex. Theorem:

A diﬀerentiable function, ϕ : R → R , with non negative second derivativeis convex. Proof:

Consider x = l x + l x , and the Taylor expansion around x , ϕ ( x ) = ϕ ( x ) + ϕ (cid:48) ( x )( x − x ) + (1 / ϕ (cid:48)(cid:48) ( x ∗ )( x − x ) where x ∗ is an appropriate intermediate point. If ϕ (cid:48)(cid:48) ( x ∗ ) > x = x and x = x we have, respectively, that ϕ ( x ) ≥ ϕ ( x ) + ϕ (cid:48) ( x ) l ( x − x )and ϕ ( x ) ≥ ϕ ( x ) + ϕ (cid:48) ( x ) l ( x − x ) multipying the ﬁrst inequality by l , the second by l , and adding them, we obtain the desired result. Theorem

Jensen Inequality: If ϕ is a convex function,E ( ϕ ( x )) ≥ ϕ ( E( X ))For discrete distributions the Jensen inequality is a special case of the ﬁnite convexcombination theorem. Arguments of Analysis allow us to extend the result to continuousdistributions. E.1 Boltzmann-Gibbs-Shannon Entropy If H ( p ( x )) is to be a measure of uncertainty, it is reasonable that it should satisfy thefollowing list of requirements. For the sake of simplicity, we present the theory for ﬁnitespaces. .1. BOLTZMANN-GIBBS-SHANNON ENTROPY n possible states, x , . . . x n , the entropy of the system with a givendistribution, p i ≡ p ( x i ), is a function H = H n ( p , . . . , p n )2) H is a continuous function.3) H is a function symmetric in its arguments.4) The entropy is unchanged if an impossible state is added to the system, i.e., H n ( p , . . . p n ) = H n +1 ( p , . . . p n , H n (0 , . . . , , , , . . .

0) = 06) The system’s entropy is maximal when all states are equally probable, i.e.,1 n = arg max H n

7) A system maximal entropy increases with the number of states, i.e. H n +1 (cid:18) n + 1 (cid:19) > H n (cid:18) n (cid:19)

8) Entropy is an extensive quantity, i.e., given two independent systems, with distri-butions p e q , the entropy of the composite system is additive, i.e., H nm ( r ) = H n ( p ) + H m ( q ) , r i,j = p i q j The Boltzmann-Gibbs-Shannon measure of entropy, H n ( p ) = − I n ( p ) = − (cid:88) ni =1 p i log( p i ) = − E i log( p i ) , ≡ I ( p ) = − H ( p ), the Neguentropy, is a measure of Information available about the system.For the Boltzmann-Gibbs-Shannon entropy we can extend requirement 8, and computethe composite Neguentopy even without independence: I nm ( r ) = (cid:88) n,mi =1 ,j =1 r i,j log( r i,j ) = (cid:88) n,mi =1 ,j =1 p i Pr( j | i ) log ( p i Pr( j | i ))= (cid:88) ni =1 p i log( p i ) (cid:88) mj =1 Pr( j | i ) + (cid:88) ni =1 p i (cid:88) mj =1 Pr( j | i ) log (Pr( j | i ))= I n ( p ) + (cid:88) ni =1 p i I m ( q i ) where q ij = Pr( j | i )50 APPENDIX E. ENTROPY AND ASYMPTOTICS

If we add this last identity as item number 9 in the list of requirements, we havea characterization of Boltzmann-Gibbs-Shannon entropy, see Kinchine (1957) and Renyi(1961, 1970).Like many important concepts, this measure of entropy was discovered and re-discoveredseveral times in diﬀerent contexts, and sometimes the uniqueness and identity of the con-cept was not immediately recognized. A well known anecdote refers the answer givenby von Neumann, after Shannon asked him how to call a “newly” discovered concept inInformation Theory. As reported by Shannon in Tribus and McIrvine (1971, p.180): “My greatest concern was what to call it. I thought of calling it information, but theword was overly used, so I decided to call it uncertainty. When I discussed it with Johnvon Neumann, he had a better idea. Von Neumann told me, You should call it entropy,for two reasons. In the ﬁrst place your uncertainty function has been used in statisticalmechanics under that name, so it already has a name. In the second place, and moreimportant, nobody knows what entropy really is, so in a debate you will always have theadvantage.”

A simple proof that requirement (6) is satisﬁed can be obtained directly from theconvexity of I n ( p, q ) as a function of p , see Kapur and Kesavan (1992, Sec.IV.2). Convexityproperties of I n ( p, q ), on either of its vector arguments, can, in turn, be asserted from thegradient vectors and positive deﬁnite Hessian matrices given by the following derivatives: ∂ I n ( p, q ) ∂ p i = 1 + log (cid:18) p i q i (cid:19) , ∂ I n ( p, q ) ∂ q i = − p i q i ,∂ I n ( p, q ) ∂ p i ∂ p j = δ ji p i , ∂ I n ( p, q ) ∂ q i ∂ q j = δ ji p i q i . These Hessian matrices are not only positive deﬁnite, but also diagonal. This observationis the basis for several analogies between minimum divergence problems and generalizednetwork ﬂow problems, see observations at Section D.3.1.

E.2 Csisz´ar’s ϕ -divergence We present an alternative demonstration that requirement (6) is satisﬁed (just use q ∝ Lemma:

Shannon InequalityIf p and q are two distributions over a system with n possible states, and q i (cid:54) = 0, then the Information Divergence of p relative to q , I n ( p, q ), is positive, except if p = q , when it isnull, I n ( p, q ) ≡ (cid:88) ni =1 p i log (cid:18) p i q i (cid:19) , I n ( p, q ) ≥ , I n ( p, q ) = 0 ⇒ p = q .3. MINIMUM DIVERGENCE UNDER CONSTRAINTS Proof:

By Jensen inequality, if ϕ is a convex function,E ( ϕ ( x )) ≥ ϕ ( E( X ))Taking ϕ ( t ) = t ln( t ) and t i = p i q i E q ( t ) = (cid:88) ni =1 q i p i q i = 1 I n ( p, q ) = (cid:88) q i t i log t i ≥ Deﬁnition:

Csisz´ar’s ϕ -divergence.Given a convex function ϕ , d ϕ ( p, q ) = (cid:88) ni =1 q i ϕ (cid:18) p i q i (cid:19) ϕ (cid:18) (cid:19) = 0 , ϕ (cid:16) c (cid:17) = c lim t →∞ ϕ ( t ) t For example, we can deﬁne the quadratic and the absolute divergence as ξ ( p, q ) = (cid:88) ( p i − q i ) q i , for ϕ ( t ) = ( t − Ab ( p, q ) = (cid:88) | p i − q i | q i , for ϕ ( t ) = | t − | E.3 Minimum Divergence under Constraints

Given a prior distribution, q , we would like to ﬁnd a vector p that minimizes the Infor-mation Divergence I n ( p, q ), where p is under the constraint of being a probability dis-tribution, and maybe also under additional constraints over the expectation of functionstaking values on the system’s states, that is, we want p ∗ ∈ arg min I n ( p, q ) , p ≥ | (cid:48) p = 1 and Ap = b , A ( m − × np ∗ is the Minimum Information Divergence distribution, relative to q , given the con-straints { A, b } . We can write the probability normalization constraint as a generic linear52 APPENDIX E. ENTROPY AND ASYMPTOTICS constraint, including and 1 as the m -th (or 0-th) rows of matrix A and vector b . Sodoing, we do not need to keep any distinction between the normalization and the otherconstraints. In this chapter, the operators (cid:12) e (cid:11) indicate the point (element) wise productand division between matrices of same dimension.The Lagrangean function of this optimization problem, and its derivatives are: L ( p, w ) = p (cid:48) log( p (cid:11) q ) + w (cid:48) ( b − Ap ) ,∂ L∂ p i = log( p i /q i ) + 1 − w (cid:48) A i , ∂ L∂ w k = b k − A k p . Equating the n + m derivatives to zero, we have a system with n + m unknowns andequations, giving viability and optimality conditions (VOCs) for the problem: p i = q i exp (cid:0) w (cid:48) A i − (cid:1) ou p = q (cid:12) exp (( w (cid:48) A ) (cid:48) − ) A k p = b k , p ≥ p i , writing the VOCs only on w ,the dual variables (Lagrange multipliers), h k ( w ) ≡ A k ( q (cid:12) exp (( w (cid:48) A ) (cid:48) − )) − b k = 0The last form of the VOCs motivates the use of iterative algorithms of Gauss-Seideltype, solving the problem by cyclic iteration. In this type of algorithm, one cyclically “ﬁts”one equation of the system, for the current value of the other variables. For a detailedanalysis of this type of algorithm, see Bertsekas and Tsitsiklis (1989), Censor and Zenios(1994, 1997), Elfving (1980), Garcia et al. (2002) and Iusem and Pierro (1987). Bregmann Algorithm:

Initialization: Take t = 0, w t ∈ R m , and p ti = q i exp (cid:16) w t (cid:48) A i − (cid:17) Iteration step: for t = 1 , , . . . , Take k = ( t mod m ) and ν | ϕ ( ν ) = 0 , where w t +1 = (cid:2) w t , . . . w tk − , w tk + ν, w tk +1 , . . . w tm (cid:3) (cid:48) p t +1 i = q i exp( w t +1 (cid:48) A i −

1) = p ti exp( νA ik ) ϕ ( ν ) = A k p t +1 − b k From our discussion of Entropy optimization under linear constraints, it should be clearthat the minimum information divergence distribution for a system under constraints onthe expectation of functions taking values on the system’s states, .3. MINIMUM DIVERGENCE UNDER CONSTRAINTS E p ( x ) a k ( x ) = (cid:82) a k ( x ) p ( x ) dx = b k , (including the normalization constraint, a = , b =1) has the form p ( x ) = q ( x ) exp ( − θ − θ a ( x ) − θ a ( x ) . . . )Note that we took θ = − ( w − θ k = − w k , and we have also indexed the state i by variable x , so to write the last equation in the standard form used in the statisticalliterature.Several distributions commonly used in Statistics can be interpreted as minimuminformation (or MaxEnt) densities (relative to the uniform distribution, if not otherwisestated) given some constraints over the expected value of state functions. For example:The Normal distribution is characterized as the distribution of maximum entropy on R n , given the expected values of its ﬁrst and second moments, i.e., mean vector andcovariance matrix.The Wishart distribution: f ( S | ν, V ) ≡ c ( ν, V ) exp (cid:18) ν − d −

12 log(det( S )) − (cid:88) i,j V i,j S i,j (cid:19) is characterized as the distribution of maximum entropy in the support S >

0, given theexpected value of the elements and log-determinant of matrix S . That is, writing Γ (cid:48) forthe digamma function,E( S i,j ) = V i,j , E(log(det( S ))) = (cid:88) dk =1 Γ (cid:48) (cid:18) ν − k + 12 (cid:19) The Dirichlet distribution f ( x | θ ) = c ( θ ) exp (cid:16)(cid:88) mk =1 ( θ k −

1) log( x k ) (cid:17) is characterized as the distribution of maximum entropy in the support x ≥ | (cid:48) x = 1,given the expected values of the log-coordinates, E(log( x k )). Jeﬀrey’s Rule:

Richard Jeﬀrey considered the problem of updating an old probability distribution, q ,to a new distribution, p , given new constraints on the probabilities of a partition, that is, (cid:88) i ∈ S k p i = α k , (cid:88) k α k = 1 , S ∪ . . . ∪ S m = { , . . . n } , S l ∩ S k = ∅ , l (cid:54) = k . His solution to this problem, known as the

Jeﬀrey’s rule , coincides with the minimuminformation divergence distribution, relative to q , given the new constraints. This solutioncan be expressed analytically as p i = α k q i / (cid:88) j ∈ S k q j , k | i ∈ S k . APPENDIX E. ENTROPY AND ASYMPTOTICS

E.4 Fisher’s Metric and Jeﬀreys’ Prior

The Fisher Information Matrix, J ( θ ), is deﬁned as minus the expected Hessian of the log-likelihood. Under appropriate regularity conditions, the information geometry is deﬁnedby the metric in the parameter space given by the Fisher information matrix, that is, thegeometric lenght of a curve is computed integrating the form dl = dθ (cid:48) J ( θ ) dθ .Lemma: The Fisher information matrix can also be written as the covariance matrixof for the gradient of the same likelihood, i.e., J ( θ ) ≡ − E X ∂ log p ( x | θ ) ∂ θ = E X (cid:18) ∂ log p ( x | θ ) ∂ θ ∂ log p ( x | θ ) ∂ θ (cid:19) Proof: (cid:90) X p ( x | θ ) dx = 1 ⇒ (cid:90) X ∂ p ( x | θ ) ∂ θ dx = 0 ⇒ (cid:90) X ∂ p ( x | θ ) ∂ θ p ( x | θ ) p ( x | θ ) dx = ∂ log p ( x | θ ) ∂ θ p ( x | θ ) dx = 0diﬀerentiating again relative to the parameter, (cid:90) X (cid:18) ∂ log p ( x | θ ) ∂ θ p ( x | θ ) + ∂ log p ( x | θ ) ∂ θ ∂ p ( x | θ ) ∂ θ (cid:19) dx = 0observing that the second term can be written as (cid:90) X ∂ log p ( x | θ ) ∂ θ ∂ p ( x | θ ) ∂ θ p ( x | θ ) p ( x | θ ) dx = (cid:90) X ∂ log p ( x | θ ) ∂ θ ∂ log p ( x | θ ) ∂ θ p ( x | θ ) dx we obtain the lemma.Harold Jeﬀreys used the Fisher metric to deﬁne a class of prior distributions, propor-tional to the determinant of the information matrix, p ( θ ) ∝ | J ( θ ) | / . Lemma: Jeﬀreys’ priors are geometric objects in the sense of being invariant by acontinuous and diﬀerentiable change of coordinates in the parameter space, η = f ( θ ).The proof follows Zellner (1971, p.41-54): Proof: J ( θ ) = (cid:20) ∂ η∂ θ (cid:21) J ( η ) (cid:20) ∂ η∂ θ (cid:21) (cid:48) , hence | J ( θ ) | / = (cid:12)(cid:12)(cid:12)(cid:12) ∂ η∂ θ (cid:12)(cid:12)(cid:12)(cid:12) | J ( η ) | / , and | J ( θ ) | / dθ = | J ( η ) | / dη . Q.E.D. .4. FISHER’S METRIC AND JEFFREYS’ PRIOR p ( y | θ ) = n ! (cid:89) mi =1 θ x i i (cid:46) (cid:89) mi =1 x i ! , θ m = 1 − (cid:88) m − i =1 θ i , x m = n − (cid:88) m − i =1 x i ,L = log p ( θ | x ) = (cid:88) mi =1 x i log θ i ,∂ L ( ∂ θ i ) = − x i θ i + x m θ m , ∂ L∂ θ i ∂ θ j = − x m θ m , i, j = 1 . . . m − , − E X ∂ L ( ∂ θ i ) = nθ i + nθ m , − E X ∂ L∂ θ i ∂ θ j = nθ m , | J ( θ ) | = ( θ θ . . . θ m ) − , p ( θ ) ∝ ( θ θ . . . θ m ) − / ,p ( θ | x ) ∝ θ x − / θ x − / . . . θ x m − / m . Hence, in the multinomial exemple, Jeﬀreys’ prior “discounts” half an observationof each kind, while the maxent prior discounts one full observation, and the ﬂat priordiscounts none. Similarly, slightly diﬀerent versions of uninformative priors for the multi-variate normal distribution are shown in section C.3. This situation leads to the possiblecriticism stated in Berger (1993, p.89): “Perhaps the most embarassing feature of noninformative priors, however, issimply that there are often so many of them.”

One response to this this criticism, to which Berger (1993, p.90) explicitly subscribes, isthat “it is rare for the choice of a noninformative prior to makedly aﬀect the an-swer... so that any reasonable noninformative prior can be used. Indeed, ifthe choice of noninformative prior does have a pronouced eﬀect on the answer,then one is probably in a situation where it is crucial to involve subjective priorinformation.”

The robustness of the inference procedures to variations on the form of the uninforma-tive prior can tested using sensitivity analysis, as discussed in section A.6. For alternativeapproaches, on robustness and sensitivity analysis, see Berger (1993, sec.4.7).In general Jeﬀrey’s priors are not minimally informative in any sense. However, Zell-ner (1971, p.41-54, Appendix do chapter 2: Prior Distributions Representing “KnowingLittle”) gives the following argument (attributed to Lindley) to present Jeﬀreys’ priorsas asymptotically minimally informative. The information measure of p ( x | θ ), I ( θ ); Theprior average information, A ; The information gain, G , that is, the prior average infor-mation associated with an observation, A , minus the prior information measure; and Theasymptotic information gain, G a , are deﬁned as follows: I ( θ ) = (cid:90) p ( x | θ ) log p ( x | θ ) dx ;56 APPENDIX E. ENTROPY AND ASYMPTOTICS A = (cid:90) I ( θ ) p ( θ ) dθ ; G = A − (cid:90) p ( θ ) log p ( θ ) dθ ; G a = (cid:90) p ( θ ) (cid:112) n | J ( θ ) | dθ − (cid:90) p ( θ ) log p ( θ ) dθ . Although Jeﬀreys’ priors does not in general maximize the information gain, G , the asymp-totic convergence results presented in the next section imply that Jeﬀrey’s priors maximizethe asymptotic information gain, G a . For further details and generalizations, see Amari(2007), Amari et al. (1987), Berger and Bernardo (1992), Berger (1993), Bernardo andSmith (2000), Hartigan (1983), Jeﬀreys (1961), Scholl (1998), and Zhu (1998). E.5 Posterior Asymptotic Convergence

The Information Divergence, I ( p, q ), can be used to proof several asymptotic results thatare fundamental to Bayesian Statistics. We present in this section two of these basicresults, following Gelman (1995, Ap.B). Theorem

Posterior Consistency for Discrete Parameters:Consider a model where f ( θ ) is the prior in a discrete parameter space, Θ = { θ , θ , . . . } , X = [ x , . . . x n ] is a series of observations, and the posterior is given by f ( θ k | X ) ∝ f ( θ k ) p ( X | θ k ) = f ( θ k ) (cid:89) ni =1 p ( x i | θ k )Further, assume that this model there is a single value for the vector parameter, θ ,that gives the best approximation for the “true” predictive distribution g ( x ), in the sensethat it minimizes the information divergence { θ } = arg min k I (cid:0) g ( x ) , p ( x | θ k ) (cid:1) I (cid:0) g ( x ) , p ( x | θ k ) (cid:1) = (cid:90) X g ( x ) log (cid:18) g ( x ) p ( x | θ k ) (cid:19) dx = E X log (cid:18) g ( x ) p ( x | θ k ) (cid:19) Then, lim n →∞ f ( θ k | X ) = δ ( θ k , θ ) Heuristic Argument:

Consider the logarithmic coeﬃcientlog (cid:18) f ( θ k | X ) f ( θ | X ) (cid:19) = log (cid:18) f ( θ k ) f ( θ ) (cid:19) + (cid:88) ni =1 log (cid:18) p ( x i | θ k ) p ( x i | θ ) (cid:19) The ﬁrst term is a constant, and the second term is a sum which terms have all negativeexpected (relative to x , for k (cid:54) = 0) value since, by our hypotheses, θ is the unique .5. POSTERIOR ASYMPTOTIC CONVERGENCE I ( g ( x ) , p ( x | θ k )). Hence, (for k (cid:54) = 0), the right hand side goesto minus inﬁnite as n increases. Therefore, at the left hand side, f ( θ k | X ) must go tozero. Since the total probability adds to one, f ( θ | X ) must go to one, QED.We can extend this result to continuous parameter spaces, assuming several regular-ity conditions, like continuity, diﬀerentiability, and having the argument θ as a interiorpoint of Θ with the appropriate topology. In such a context, we can state that, given apre-established small neighborhood around θ , like C ( θ , (cid:15) ) the cube of side size (cid:15) cen-tered at θ , this neighborhood concentrates almost all mass of f ( θ | X ), as the number ofobservations grows to inﬁnite. Under the same regularity conditions, we also have thatMaximum a Posteriori (MAP) estimator is a consistent estimator, i.e., (cid:98) θ → θ .The next results show the convergence in distribution of the posterior to a Normaldistribution. For that, we need the Fisher information matrix identity from the lastsection. Theorem

Posterior Normal Approximation:The posterior distribution converges to a Normal distribution with mean θ and precision nJ ( θ ). Proof (heuristic): We only have to write the second order log-posterior Taylor expansioncentered at (cid:98) θ , log f ( θ | X ) = log f ( (cid:98) θ | X ) + ∂ log f ( (cid:98) θ | X ) ∂ θ ( θ − (cid:98) θ )+ 12 ( θ − (cid:98) θ ) (cid:48) ∂ log f ( (cid:98) θ | X ) ∂ θ ( θ − (cid:98) θ ) + O ( θ − (cid:98) θ ) The term of order zero is a constant. The linear term is null, for (cid:98) θ is the MAPestimator at an interior point of Θ. The Hessian in the quadratic term is H ( (cid:98) θ ) = ∂ log f ( (cid:98) θ | X ) ∂ θ = ∂ log f ( (cid:98) θ ) ∂ θ + (cid:88) ni =1 ∂ log p ( x i | (cid:98) θ ) ∂ θ The Hessian is negative deﬁnite, by the regularity conditions, and because (cid:98) θ is the MAPestimator. The ﬁrst term is constant, and the second is the sum of n i.i.d. randomvariables. At the other hand we have already shown that the MAP estimator, and alsothat all the posterior mass concentrates around θ . We also see that the Hessian grows(in average) linearly with n , and that the higher order terms can not grow super-linearly.Also for a given n and θ → (cid:98) θ , the quadratic term dominates all higher order terms. Hence,the quadratic approximation of the log-posterior in increasingly more precise, Q.E.D.Given the importance of this result, we present an alternative proof, also giving thereader an alternative way to visualize the convergence process, see Figure 1. Theorem

MLE Normal Approximation:58

APPENDIX E. ENTROPY AND ASYMPTOTICS

The Maximum Likelihood Estimator (MLE) is asymptotically Normal, with mean θ andprecision nJ ( θ ). Proof (schematic): Assuming all needed regularity conditions, from the ﬁrst order opti-mality conditions, 1 n (cid:88) ni =1 ∂ log p ( x i | (cid:98) θ ) ∂ θ = 0hence, by the mean value theorem, there is an intermediate point (cid:101) θ such that1 n (cid:88) ni =1 ∂ log p ( x i | θ ) ∂ θ = 1 n (cid:88) ni =1 ∂ log p ( x i | (cid:101) θ ) ∂ θ ( θ − (cid:98) θ )or, equivalently, √ n ( (cid:98) θ − θ ) = − (cid:34) n (cid:88) ni =1 ∂ log p ( x i | (cid:101) θ ) ∂ θ (cid:35) − √ n (cid:88) ni =1 ∂ log p ( x i | θ ) ∂ θ We assume the regularity conditions are enough to assure that − (cid:34) n (cid:88) ni =1 ∂ log p ( x i | (cid:101) θ ) ∂ θ (cid:35) − → J ( θ ) − for the MLE is consistent, (cid:98) θ → θ , and hence so is the mean value point, (cid:101) θ → θ ; and1 √ n (cid:88) ni =1 ∂ log p ( x i | θ ) ∂ θ → N (0 , J ( θ ))because we have the sum of n i.i.d. vectors with mean 0 and, by the Information MatrixIdentity lemma covariance J ( θ ).Hence, we ﬁnally have √ n ( (cid:98) θ − θ ) → N (cid:0) , J ( θ ) − J ( θ ) J ( θ ) − (cid:1) = N (cid:0) , J ( θ ) − (cid:1) Q.E.D.

Exercises:

1) Implement Bregmann’s algorithm. It may be more convenient to number the rows of A from 1 to m , and take k = ( t mod m ) + 1.2) I was given a dice, that I assumed to be honest. A friend of mine lent the dice andreported playing it 60 times, obtaining 4 i’s, 8 ii’s, 11 iii’s, 14 iv’s, 13 v’s and 10 vi’s.A) What is my Bayesian posterior?Bi) What was the mean face value? (3.9). .5. POSTERIOR ASYMPTOTIC CONVERGENCE APPENDIX E. ENTROPY AND ASYMPTOTICS ppendix FMatrix Factorizations

F.1 Matrix Notation

Let us ﬁrst deﬁne some matrix notation. The operator f : s : t , to be read from f to t with step s , indicates the vector [ f, f + s, f + 2 s, . . . t ] or the corresponding index domain. f : t is a short hand for f : 1 : t . The element in the i -th row and j -th column of matrix A iswritten as A ( i, j ) or, with subscript row index and superscript column index, as A ji . Indexvectors can be used to build a matrix by extracting from a larger matrix a given sub-setof rows and columns. For example, A (1 : m/ , n/ n ) or A n/ n m/ is the northeast block,i.e. the block with the ﬁrst rows and last columns, from A . The next example shows amore general case of this notation, A = 

11 12 1321 22 2331 32 33  , r = (cid:2) (cid:3) , s = (cid:2) (cid:3) ,A sr = A ( r, s ) = (cid:20)

13 11 1233 31 32 (cid:21) . The suppression of an index vector indicates that the corresponding index spans all valuesin its current context. Hence, A ( i, : ) or A i indicates the i -th row, and A ( : , j ) or A j indicates the j -th column of matrix A .A single or multiple list of matrices is referenced by one or more indices in braces, like A { k } or A { p, q } . As for element indices, for double lists we may also use the subscript- superscript alternative notation for A { p, q } , namely, A { qp } . This compact notation isspecially usefull for building block matrices, like in the following example, A =  A { } A { } . . . A { s } A { } A { } . . . A { s } ... ... . . . ... A { r } A { r } . . . A { sr }  . APPENDIX F. MATRIX FACTORIZATIONS

Hence, A { p, q } ( i, j ) or A { qp } ji indicates the element in the i -th row and j -th column of theblock situated at the p -th block of rows and q -th block of columns of matrix A , A { p, q } (: , j )or A { qp } j indicates the j -th column of the same block, and so on.An upper case letter usually stands for (or starts) a matrix name, while lower caseletters are used for vectors or scalars. Whenever recommended by style or tradition, wemay slightly abuse the notation using upper case for the name of a matrix and lower casefor some of its parts. For example, we may write x j , instead of X j for the j -th column ofmatrix X .The vectors of zeros and ones, with appropriate dimension given by the context, are and . The transpose of matrix M is M (cid:48) , and the transpose inverse, M − t . In ( M + v ),where v is a column (row) vector of compatible dimension, v is added to each column(row) of matrix M .A tilde accent, (cid:101) A , indicates some simple transformation of matrix A . For exemple,it may indicate a row and / or column permutation, see next subsection. A tilde accentmay also indicate a normalization, like (cid:101) x = (1 / || x || ) x .The p -norm of a vector x is given by || x || p = ( (cid:80) | x i | p ) − p . Hence, for a non-negativevector x , we can write its 1-norm as || x || = (cid:48) x . V > (cid:12) , is deﬁned by M = A (cid:12) B ⇔ M ji = A ji B ji . Thesquared Frobenius norm of a matrix is deﬁned by frob2( M ) = (cid:80) i,j ( M ji ) .The Diagonal operator, diag, if applied to a square matrix, extracts the main diagonalas a vector, and if applied to a vector, produces the corresponding diagonal matrix.diag( A ) =  A A ... A nn  , diag( a ) =  a . . . a . . . . . . a n  , diag ( A ) =  A . . . A . . . . . . A nn  . The Kroneker product of two matrices is a block matrix where block { i, j } is thesecond matrix multiplied by element ( i, j ) of the ﬁrst matrix: A ⊗ B =  A B A B · · · A B A B · · · ... ... . . .  The following properties are easy to check: • ( A ⊗ B )( C ⊗ D ) = ( AC ) ⊗ ( BD ) • ( A ⊗ B ) (cid:48) = A (cid:48) ⊗ B (cid:48) • ( A ⊗ B ) − = A − ⊗ B − .1. MATRIX NOTATION A is m × n , Vec( A ) =  A ... A n  The following properties are easy to ckeck: • Vec( A + B ) = Vec( A ) + Vec( B ) • Vec( AB ) =  AB ... AB n  = ( I ⊗ A ) Vec( B ) Permutations and Partitions

We now introduce some concepts and notations related to the permutation and partitionof an m × n matrix A . A permutation matrix is a matrix obtained by permuting rows andcolumns of the identity matrix, I . To perform on I a given row (column) permutationyields the corresponding row (column) permutation matrix.Given row and column permutation matrices, P and Q , the corresponding vectors ofpermuted row and column indices are p = ( P  m  ) (cid:48) q = (cid:2) . . . n (cid:3) Q To perform a row (column) permutation on a matrix A , obtaining the permuted matrix˜ A , is equivalent to multiply it at the left (right) by the corresponding row (column)permutation matrix. Moreover, if p ( q ) is the corresponding vector of permuted row(column) indices, A p = P A = I p A , A q = AQ = I q . Exemple: Given the martices A = 

11 12 1321 22 2331 32 33  , P =   , Q =   , APPENDIX F. MATRIX FACTORIZATIONS p = q = (cid:2) (cid:3) , P A = 

31 32 3311 12 1321 22 23  , AQ = 

13 11 1223 21 2233 31 32  . A square matrix, A , is symmetric iﬀ it is equal to its transpose, that is, iﬀ A = A (cid:48) .A symmetric permutation of a square matrix A is a permutation of form ˜ A = P AP (cid:48) or˜ A = Q (cid:48) AQ , where P or Q are (row or column) permutation matrices. A square matrix, A , is orthogonal iﬀ its inverse equals its transpose, that is, iﬀ A − = A (cid:48) . The followingstatements are easy to check:(a) A permutation matrix is orthogonal.(b) A symmetric permutation of a symmetric matrix is still symmetric.A permutation vector, p , and a termination vector, t , deﬁne a partition of m originalindices in s classes:  p (1)... p ( t (1))  ,  p ( t (1) + 1)... p ( t (2))  . . .  p ( t ( s −

1) + 1)... p ( t ( s ))  where t (0) = 0 < t (1) < . . . < t ( s − < t ( s ) = m . We deﬁne the corresponding permutation and partition matrices, P and T , as P = I p (1 : m ) =  P { } P { } ... P { s }  , P { r } = I p ( t ( r − : t ( r )) ,T r = (cid:48) ( P { r } ) and T =  T ... T s  . These matrices facilitate writing functions of a given partition, like • The indices in class rP { r } (1 : m ) = P { r }  m  =  p ( t ( r −

1) + 1)... p ( t ( r ))  ; • The number of indices in class rT r = t ( r ) − t ( r −

1) ; .2. DENSE LU, QR AND SVD FACTORIZATIONS • A sub-matrix with the row indices in class rP { r } A =  A p ( t ( r − ... A p ( t ( r ))  ; • The summation of the rows of a submatrix with row indices in class rT r A = (cid:48) ( P { r } A ) ; • The rows of a matrix, added over each class

T A =  T A ... T s A  . Note that a matrix T represents a partition of m idices into s classes if T has dimension s × m , T jh ∈ { , } and T has orthogonal rows. The element T jh indicates if the index j ∈ m is in class h ∈ s . F.2 Dense LU, QR and SVD Factorizations

Vector Spaces and Projectors

Given two vectors, x, y ∈ R n , their scalar product is deﬁned as x (cid:48) y = n (cid:88) i =1 x i y i . With this deﬁnition in mind, it is easy to check that the scalar product satisﬁes thefollowing properties of the inner product operator:1. < x | y > = < y | x > , symmetry.2. < αx + βy | z > = α < x | z > + β < y | z > , linearity.3. < x | x > ≥ < x | x > = 0 ⇔ x = 0 , positivity.A given inner product deﬁnes the following norm, (cid:107) x (cid:107) ≡ < x | x > / ;66 APPENDIX F. MATRIX FACTORIZATIONS that can in turn be used to deﬁne the angle between two vectors:Θ( x, y ) ≡ arccos( < x | y > / (cid:107) x (cid:107)(cid:107) y (cid:107) ) . Let us consider the linear subspace generated by the columns of a matrix A , m by n , m ≥ n : C ( A ) = { y = Ax, x ∈ R n } .C ( A ) is called the image of A , and the complement of C ( A ), N ( A ), is called the nullspace of A , N ( A ) = { y | A (cid:48) y = 0 } . The projection of a vector b ∈ R m in the column space of A is deﬁned by the relations: y = P C ( A ) b ↔ y ∈ C ( A ) ∧ ( b − y ) ⊥ C ( A )or, equivalently, y = P C ( A ) b ↔ y = Ax ∧ A (cid:48) ( b − y ) = 0 . In the sequel we assume that A has full rank, i.e., that its columns are linearly in-dependent. It is easy to check that the projection of b in C ( A ) is given by the linearoperator P A = A ( A (cid:48) A ) − A (cid:48) . If y = A (( A (cid:48) A ) − A (cid:48) b ), then it is obvious that y ∈ C ( A ). At the other hand, A (cid:48) ( b − y ) = A (cid:48) ( I − A ( A (cid:48) A ) − A (cid:48) ) b = ( A (cid:48) − IA (cid:48) ) b = 0 . Orthogonal Matrices

A real square matrix Q is said to be orthogonal iﬀ its inverse is equal to its transpose,that is, Q (cid:48) Q = I . The columns of an orthogonal matrix Q are a orthonormal basis for R n . The quadratic norm of a vector v , given by (cid:107) v (cid:107) ≡ (cid:88) ( v i ) = v (cid:48) v , is not changed by an orthogonal transform, since( Qv ) (cid:48) ( Qv ) = v (cid:48) Q (cid:48) Qv = v (cid:48) Iv = v (cid:48) v . Given a vector in R , (cid:20) x x (cid:21) , a rotation of this vector by an angle θ is given by thelinear transform G { θ } x = (cid:20) cos( θ ) sin( θ ) − sin( θ ) cos( θ ) (cid:21) (cid:20) x x (cid:21) . .2. DENSE LU, QR AND SVD FACTORIZATIONS G { θ } (cid:48) G { θ } = (cid:20) cos( θ ) + sin( θ )

00 cos( θ ) + sin( θ ) (cid:21) = (cid:20) (cid:21) . The Givens rotation is a linear operator whose matrix is the identity, except for theinsertion of a bidimensional rotation matrix: G { i, j, θ } =  θ ) sin( θ ). . . − sin( θ ) cos( θ ) . . . 1  . The left multiplication of matrix A by a Givens transform, G (cid:48) A , rotates rows i and j of A counterclockwise by an angle θ . Since the product of orthogonal transforms is stillorthogonal, we can use a sequence of Givens rotations to build more complex orthogonaltransforms.We now deﬁne some simple bidimensional rotations that will be used as building blocksin the construction of several algorithms. Let us consider, in R , a vector v , a symmetricmatrix S , and an asymmetric matrix A , v = (cid:20) xy (cid:21) , S = (cid:20) p qq r (cid:21) , A = (cid:20) a bc d (cid:21) In order to set to zero the second component of vector v by means of a left rotation, G { θ v } (cid:48) v , it is possible to use the angle θ v = arctan (cid:16) yx (cid:17) . In order to diagonalize the symmetric matrix by a symmetric rotation, G { θ diag } (cid:48) S G { θ diag } ,it is possible to use the angle θ diag = 12 arctan (cid:18) qr − p (cid:19) . In order to symmetrize the asymmetric matrix by means of a left rotation, G { θ sym } (cid:48) A ,it is possible to use the angle θ sym = arctan (cid:18) b − ca + d (cid:19) . APPENDIX F. MATRIX FACTORIZATIONS

Hence, it is possible to diagonalize the asymmetric matrix by means of a symmetriza-tion followed by a diagonalization operation. Alternatively, it is possible to use the leftand right of Jacobi rotations, J { θ r } (cid:48) A J { θ l } , deﬁned as follows θ sum = θ r + θ l = arctan (cid:18) c + bd − a (cid:19) , θ dif = θ r − θ l = arctan (cid:18) c − bd + a (cid:19) or J { θ r } (cid:48) = G { θ sum / } (cid:48) G {− θ dif / } (cid:48) , J { θ l } = G { θ dif / } G { θ dif / } . when computing the rotation matrices, there is no need to make explicit use of therotation angles, nor is it necessary to use trigonometric functions, but only to computethe factors c = sin( θ ) and s = sin θ , directly as c = x (cid:112) x + y , s = − y (cid:112) x + y . In order to avoid numerical overﬂow, one can use the procedure • Se y == 0 , then c = 1 , s = 0 . • Se y ≥ x , then t = − x/y , s = 1 / √ t , c = st . • Se y < x , then t = − y/x , c = 1 / √ t , s = ct . QR Factorization

Given a full rank real matrix A , m × n , m ≥ n , it is always possible to ﬁnd an orthogonalmatrix Q such that A = Q (cid:20) R (cid:21) , where R is a square upper triangular matrix. This isthe QR factorization (or decomposition) of matrix A . The orthogonal factor, Q = [ C | N ]gives an orthonormal basis for R m , where the ﬁrst n columns give an orthonormal basefor C ( A ), and the last m − n columns give an orthonormal base for N ( A ), as can beeasily checked by the identity Q (cid:48) A = (cid:20) R (cid:21) . In the sequel a QR factorization algorithmis presented.The following example illustrates a rotation sequence that takes a 5 × i, j ), indicates a rotation used to zero the position atrow i column j . We assume that the original matrix is dense, that is, that the matrixhas no zero elements, and illustrate the sparsity pattern in the matrix as the algorithmprogresses. (1 , ∗ (1 , , , ∗ (2 , , , ∗ (3 , , ∗ .2. DENSE LU, QR AND SVD FACTORIZATIONS  x x xx x xx x xx x x x x   x x x x x x x x x x x   x x x x x x x x   x x x x x x  Least Squares

Given an over-determined system, Ax = b where A is m × n , m > n , vector x ∗ is a leastsquares solution to the system iﬀ x ∗ minimizes the quadratic norm of the residual, thatis, x ∗ = Arg min x ∈ R n (cid:107) Ax − b (cid:107) , Since an orthogonal rotation does not change the square norm of a vector, one can seekthe least square solution to this system minimizing the residual of the system transformedby the orthogonal factor of the QR factorization of A , (cid:107) Q (cid:48) ( Ax − b ) (cid:107) = (cid:107) (cid:20) R (cid:21) x − (cid:20) cd (cid:21) (cid:107) = (cid:107) Rx − c (cid:107) + (cid:107) x − d (cid:107) . From the last expression one can see that the solution and the residual of the originalproblem are given by x ∗ = R − c , y = Ax ∗ and z = Q (cid:20) d (cid:21) . Since the last m − n columns of Q are an orthonormal basis of N ( A ), we see that z ⊥ C ( A ),and can therefore conclude that y = P A b . LU and Cholesky Factorizations

Given a matrix A , the elementary operation given by the multiplier m ji , is the operationof subtracting from row i the row j multiplied by m ji . The elementary operation appliedto the identity matrix generates the corresponding elementary matrix , M { i, j } =  − m ji  ji . APPENDIX F. MATRIX FACTORIZATIONS

Applying a elementary operation to matrix A is equivalent to multiplying A from the leftby the corresponding elementary matrix.In the Gaussian elimination algorithm we use a sequence of elementary operations tobring A to upper triangular form, M A = M { n, n − } M { n − , n − } M { n, n − } · · · M { , } · · · M { n − , } M { n, } M { , } · · · M { n − , } M { n, } A = U .

Multiplier m ji is computed as the current matrix element at position ( i, j ) divided by the pivot element at the diagonal position ( j, j ). Elementary operation M { i, j } is used to eliminate (zero) the position ( i, j ). The elementary operations are performed in an orderthat prevents the zeros created at previous steps to be ﬁlled again.The next example shows the steps of Gaussian elimination on a small matrix. Themultipliers, in italic, are stored at the positions corresponding to the zeros they created.   →   →  −  The inverse of the product of this sequence of elementary matrices has the lowertriangular form, that is, M − = M − { n, } M − { n − , } · · · M − { , } M − { n, } M − { n − , }· · · M − { , } · · · M − { n, n − } M − { n − , n − } M − { n, n − } .L = M − =  m m n − m n − m n m n · · · m n − n  . Therefore the algorithm ﬁnds the LU factorization, A = LU . The lower and uppertriangular form of L and U allow us to easily compute L − z and U − z by simple forwardand backward substitution. Hence, A − z = U − ( L − z ) can be computed in just twosubstitution steps.In case we factor a symmetric matrix V = LU , we can collect the diagonal elementsof U in a diagonal matrix D , and write V = LDL (cid:48) . If S is positive deﬁnite we can takethe square roots of the diagonal elements and write D = D / D / .Deﬁning C = LD / , we have V = CC (cid:48) , the Cholesky factor of V . For reasons ofnumerical stability, it is recommended to take the square roots of each diagonal elementsjust before we use it as a pivot element , and then eliminate the elements of its column,see Pissanetzky (1984). .2. DENSE LU, QR AND SVD FACTORIZATIONS Quadratic Programming

The quadratic programming problem with equality constraints is the minimization of theobjective function f ( y ) ≡ (1 / y (cid:48) W y + c (cid:48) y , W = W (cid:48) with the constraints g i ( y ) ≡ N (cid:48) i y = d i . The gradients of f and g i are given by ∇ y f = y (cid:48) W + c (cid:48) , and ∇ y g i = N (cid:48) i . The Lagrange (ﬁrst order) optimality conditions state that the constraints are in eﬀect,and that objective function gradient equals a linear combination of gradients of the con-straint functions, Hence, the solution may be obtained from the

Lagrange multipliers , i.e.,the vector l with the coeﬃcients of the aformentioned linear combination. N (cid:48) y = d ∧ y (cid:48) W + c (cid:48) = l (cid:48) N (cid:48) , or, in matrix form, (cid:20) N (cid:48) W N (cid:21) (cid:20) yl (cid:21) = (cid:20) dc (cid:21) . These equations are known as the normal system , with a symmetric coeﬃcient matrix. Ifquadratic form W is positive deﬁnite, i.e. if ∀ x x (cid:48) W x ≥ ∧ x (cid:48) W x = 0 ⇔ x = 0, andthe constraint matrix N is full rank, the coeﬃcient matrix of the normal system is alsopositive deﬁnite. SVD Factorization

The SVD factorization takes a real matrix A , m × n, m ≥ n , to a diagonal matrix, D , byleft and right multiplication by orthogonal matrices D = U (cid:48) AV , Let us ﬁrst consider thecase m = n , i.e. a square matrix.The Jacobi algorithm is an iterative procedure that, at each iterations, “concentrates”the matrix in the diagonal by a Jacobi rotation, J { i, j, θ r } (cid:48) A { k } J { i, j, θ l } = A { k +1 } =  A { k +1 } · · · A { k +1 } i · · · A { k +1 } j · · · A { k +1 } n ... . . . ... . . . ... . . . ... A { k +1 } i · · · A { k +1 } ii · · · · · · A { k +1 } ni ... . . . ... . . . ... . . . ... A { k +1 } j · · · · · · A { k +1 } jj · · · A { k +1 } nj ... . . . ... . . . ... . . . ... A { k +1 } n · · · A { k +1 } in · · · A { k +1 } jn · · · A { k +1 } nn  APPENDIX F. MATRIX FACTORIZATIONS

Let us consider the sum of squares of of-diagonal elements of A , Oﬀ ( A ). We can seethat Oﬀ ( A { k +1 } ) = Oﬀ ( A { k } ) − ( A { k } ji ) − ( A { k } ij ) Hence, choosing at each iteration the index pair that maximizes the sum of squares of thecorresponding elements, the algorithms converges linearly to a diagonal matrix.The Jacobi algorithm gives a constructive proof for the existence of the SVD factor-ization, and is the basis of several eﬃcient numerical algorithms.If A is a rectangular matrix, one can ﬁrst ﬁnd its QR factorization, and then applyJacobi algorithm to the upper triangular R factor. If A is square and symmetric, theobtained factorization is known as the eigenvalue decomposition of A .The orthogonal matrices U and V can be interpreted as orthonormal bases in therespective m and n dimensional spaces. The values at the diagonal of S are called the singular values of matrix A , and can be interpreted geometrically as the scaling factors ofthe map A = U DV (cid:48) , taking each versor of the basis V to a scaled versor of the basis U . Complex Matrices

Many techniques developed in this section for real matrices can be generalized to complexmatrices. Practical and elegant methods of obtaining and describing such generalizationsare the described by Hemkumar (1994) using

Cordic transforms (COordinate RotationDigital Computer). Such a transform is applied to a 2 × M in the formof internal and external rotations pairs , (cid:20) c ( φ ) − s ( φ ) s ( φ ) c ( φ ) (cid:21) (cid:20) e ( iα ) 00 e ( iβ ) (cid:21) (cid:20) Ae ( ia ) Be ( ib ) Ce ( ic ) De ( id ) (cid:21) (cid:20) e ( iγ ) 00 e ( iδ ) (cid:21) (cid:20) c ( ψ ) − s ( ψ ) s ( ψ ) c ( ψ ) (cid:21) The elegance of these Cordic transforms comes from the following observations: Theinternal transform aﬀects only the imaginary exponents of the matrix elements, while theexternal transform can be independently applied to the real and the imaginary parts ofthe matrix, that is, (cid:20) e ( iα ) 00 e ( iβ ) (cid:21) (cid:20) Ae ( ia ) Be ( ib ) Ce ( ic ) De ( id ) (cid:21) (cid:20) e ( iγ ) 00 e ( iδ ) (cid:21) = (cid:20) Ae ( ia (cid:48) ) Be ( ib (cid:48) ) Ce ( ic (cid:48) ) De ( id (cid:48) ) (cid:21) = (cid:20) Ae ( i ( a + α + γ )) Be ( i ( b + α + δ )) Ce ( i ( c + β + γ )) De ( i ( d + β + γ )) (cid:21)(cid:20) c ( φ ) − s ( φ ) s ( φ ) c ( φ ) (cid:21) (cid:20) A (cid:48) r + iA (cid:48) i B (cid:48) r + iB (cid:48) i C (cid:48) r + iC (cid:48) i D (cid:48) r + iD (cid:48) i (cid:21) (cid:20) c ( ψ ) − s ( ψ ) s ( ψ ) c ( ψ ) (cid:21) = (cid:20) c ( φ ) − s ( φ ) s ( φ ) c ( φ ) (cid:21) (cid:20) A (cid:48) r B (cid:48) r C (cid:48) r D (cid:48) r (cid:21) (cid:20) c ( ψ ) − s ( ψ ) s ( ψ ) c ( ψ ) (cid:21) .2. DENSE LU, QR AND SVD FACTORIZATIONS i (cid:18)(cid:20) c ( φ ) − s ( φ ) s ( φ ) c ( φ ) (cid:21) (cid:20) A (cid:48) i B (cid:48) i C (cid:48) i D (cid:48) i (cid:21) (cid:20) c ( ψ ) − s ( ψ ) s ( ψ ) c ( ψ ) (cid:21)(cid:19) The following table deﬁnes some useful internal and external transforms. Type Itransforms change the imaginary exponents of the matrix elements at one of the diagonals.Transforms of Type R, C and D make real the elements in a row, column or diagonal.Type Value I main α = − β = γ = − δ = ( d − a ) / I off α = − β = − γ = δ = ( c − b ) / R up α = β = − ( b + a ) / γ = − δ = ( b − a ) / R low α = β = − ( d + c ) / γ = − δ = ( d − c ) / C left α = − β = ( c − a ) / γ = δ = − ( c + a ) / C right α = − β = ( d − b ) / γ = δ = − ( d + b ) / D main α = β = − ( d + a ) / γ = − δ = ( d − a ) / D off α = β = − ( b + c ) / γ = − δ = ( b − c ) / R low followed by a rotation, making the matrix up-per triangular. A type II transform applies D main , I off followed by a diagonalization. ForHermitian (self-adjoint) matrices, the diagonalization is obtained using only one transformof type IIIType Internal ExternalI α = β = − ( d + c ) / γ = − δ = ( d − c ) / φ = 0 ; ψ = arctan( C/D )II α = − ( a + b ) / β = γ = − δ = ( b − a ) / φ ± ψ = arctan( B/ ( D ∓ A ))III α = − β = − γ = δ = − b/ φ = ψ = arctan(2 B/ ( D − A )) / Exercises

1. Use the fundamental properties of the inner product to prove that:(a) The Cauchy-Scwartz inequality: | < x | y > | ≤ (cid:107) x (cid:107)(cid:107) y (cid:107) . Suggestion: Compute (cid:107) x − αy (cid:107) for α = < x | y > / (cid:107) y (cid:107) .(b) The triangular inequality: (cid:107) x + y (cid:107) ≤ (cid:107) x (cid:107) + (cid:107) y (cid:107) .74 APPENDIX F. MATRIX FACTORIZATIONS (c) In which case do we have equality or strict Cauchy-Schwartz inequality? Relateyour answer to the deﬁnition of angle between two vectors.2. Use the deﬁnition of inner product in R n to prove the parallelogram law: (cid:107) x + y (cid:107) + (cid:107) x − y (cid:107) = 2 (cid:107) x (cid:107) + 2 (cid:107) y (cid:107) .3. A matrix is idempotent, or a non-orthogonal projector, iﬀ P = P . Prove that:(a) R = ( I − P ) is idempotent.(b) R n = C ( P ) + C ( R ).(c) All eigenvalues of P are 0 or +1. Suggestion: Show that if 0 is a root of thecharacteristic polynomial of P , ϕ P ( λ ) ≡ det( P − λI ), than (1 − λ ) = 1 is aroot of ϕ R ( λ ).4. Prove that ∀ P idempotent and symmetric, P = P C ( P ) . Suggestion: Show that P (cid:48) ( I − P ) = 0.5. Prove that the projection operator into a given vector subspace, V , P V , is uniqueand symmetric.6. Prove Pythagoras theorem: ∀ b ∈ R m , u ∈ V we have (cid:107) b − u (cid:107) = (cid:107) b − P V b (cid:107) + (cid:107) P V b − u (cid:107) .7. Assume we have the QR factorization of a matrix A . Consider a new matrix, ˜ A ,obtained from A by the substitution of a single column. How could we updateour orthogonal factorization using only 3 n rotations? Suggestion: (a) Remove thealtered column of A and update the factorization using at most n rotations. (b)Rotated by the new column by the current orthogonal factor. ˜ a = Q (cid:48) a = R − t A (cid:48) a .(c) Add ˜ a as the last column of ˜ A , and update the factorization using 2 n rotations.8. Compute the LDL and Cholesky factorizations of matrix   .

9. Prove that:(a) ( AB ) (cid:48) = B (cid:48) A (cid:48) .(b) ( AB ) − = B − A − .(c) A − t ≡ ( A − ) (cid:48) = ( A (cid:48) ) − .10. Describe four algorithms to compute L − x and L − t x , accessing the unit diagonaland lower triangular matrix L row by row or column by column. .3. SPARSE FACTORIZATIONS F.3 Sparse Factorizations

As indicated in chapter 4, we present in this appendix some aspects related to the sparsefactorization. This material has strong connections with the issues discussed in chapter4, but is more mathematical in its nature, and can be omitted by the reader interestedmostly in the purely epistemological aspects of decoupling.Computing the Cholesky factorization of a n × n matrix involves on the order of n arithmetical operations. Large models may have thousands of variables, so it seemsthat decoupling large models requires a lot of work. Nevertheless, in practice, matricesappearing in large models are typically sparse and structured. A matrix is called sparse ifit has many zero elements, otherwise it is called dense . A sparse matrix is called structured if its non-zero-elements (NZEs) are arranged in a “nice” pattern. As we will see in thenext sections, we may be able to obtain a Cholesky factor, L , of a (permuted) sparse andstructured matrix V , that ‘preserves’ some of its sparsity and structure, hence decreasingthe computational work. F.3.1 Sparsity and Graphs

In the discussion of sparsity and structure, the language of graph theory is very helpful.This section gives a quick review of some of the basic concepts on directed and undirectedgraphs, and also deﬁnes the process of vertex elimination.A Directed Graph, or DG, G = ( V , A ) has a set of vertices or nodes, V , indexed bynatural numbers, and a set or directed arcs, A , where each arc joins two vertices. We saythat arc ( i, j ) ∈ A goes from node i to node j . When drawing a graphical representationof a DG, it is usual to represent vertices by dots, and arcs by a arrows between the dots.In a DG, we say that i is a parent of j , i ∈ pa ( j ), or that j is a child of i , j ∈ ch ( i ), ifthere is an arc going from i to j . The children of i , the children of its children, and so on,are the descendents of i . If j is a descendent of i we say that there is a path in G goingfrom i to j . A cycle is a path from a given vertex to itself. An arch from a vertex toitself, ( j, j ) is called a loop . In some situations we spare the eﬀort of multiple deﬁnitionsof essentially the same objects by referring to the same graph with or without all possibleloops.There is yet another representation for a DG, G , given by ( V , B ), where the adjacencymatrix , B , is the Boolean matrix B ( i, j ) = 1 if arc ( i, j ) ∈ A , and 0 otherwise. Thekey element relating the topics presented in this and the previous section, is the Booleanmatrix B indicating the non-zero elements of the numerical matrix A , B ji = I ( A ji (cid:54) = 0).In this way, the graph G = ( V , B ) is used to represent the sparsity pattern of a numericalmatrix A .A Directed Acyclic Graph, DAG, has no cycles. A separator S ⊂ V separates i from76 APPENDIX F. MATRIX FACTORIZATIONS j if any path from i to j goes through a vertex in S . A vertex j is a spouse of vertex i , j ∈ f ( i ), if they have a child in common. A tree is a DAG where each vertex has exactlyone parent, except for the root vertex, that has no parent. The leafs of a tree are thevertices with no children. A graph composed by several trees is a forest .An Undirected Graph, or UG, is a DG where, if arc ( i, j ) is in the graph, so is itsopposite, ( j, i ). An UG can also be represented as G = ( V , E ), where each undirectededge, { i, j } ∈ E , stands for the pair of opposite directed arcs, ( i, j ) and ( j, i ). Obviously,the adjacency matrix of a UG is a symmetric matrix, and vice-versa.1 3 → ↓ (cid:37) (cid:38) → → , − | / | \ − − , G , M ( G ), is the undirected graph with the same nodesas G , and edges joining nodes i and j if they are immediate relatives in G . The immediaterelatives of a node in G include its parents, children and spouses (but not brooders orsisters). The set of immediate relatives of i is also called the Markov blanket of i , m ( i ),hence, j ∈ m ( i ) if j is a neighbor of i in the moral graph. Figure 2 represents a DAG, itsmoral graph, and the Markov blanket of one of its vertices.Sometimes it is important to consider an order on the vertex set, established by an ‘in-dex vector’ q , in (a subset of) V = { , , . . . N } . For example, we can consider the naturalorder q = [1 , , . . . N ], or the order given by a permutation, q = [ q (1) , q (2) , . . . q ( N )].In order not to make language and notation too heavy, we may refer to the vertex‘set’ q , meaning the set of elements in vector q . Also, given two index vectors, a =[ a (1) , . . . a ( A )] and b = [ b (1) , . . . b ( B )], the index vector c = a ∪ b , has all the indices in a or b . Similarly, c = a \ b has all the indices in a that are not in b . These are essentiallyset operations but, since an index vector also establishes an order of its elements, c =[ c (1) , . . . c ( C )], this order, if not otherwise indicated, has somehow to be chosen.We deﬁne the elimination process in the UG, G = ( V , E ), V = { , . . . N } given anelimination order, q = [ q (1) , . . . q ( N )], as the sequence of elimination graphs G k = ( V k , E k )where, for k = 1 . . . n , V k = { q ( k ) , q ( k + 1) , . . . q ( n ) } , E = E , and, for k > , { i, j } ∈ E k ⇔ (cid:26) { i, j } ∈ E k − , or { q ( k − , i } ∈ E k − and { q ( k − , j } ∈ E k − . that is, when eliminating vertex q ( k ), we make its neighbors a clique , adding all missingedges between them. .3. SPARSE FACTORIZATIONS ﬁlled graph is the graph ( V , F ), where F = ∪ nk =1 E k . The original edges and theﬁlled edges in F are, respectively, the edges in E and in F \E .Figure 3 shows a graph with 6 vertices, the elimination graphs, and the ﬁlled graph,considering the elimination order q = [1 , , , , , − | × \ | / / / | \ − | / / − × | | \ − | − | × | \ − || × | / − simpliﬁed elimination : In the simpliﬁed version of the elimination graphs, G ∗ k , wheneliminating vertex q ( k ), we add only the clique edges incident to its neighbor, q ( l ), thatis next in the elimination order. Figure 4 shows the simpliﬁed elimination graphs andthe ﬁlled graph corresponding to the elimination process in Figure 3; The vertex beingeliminated is in boldface, and his next (in the elimination order) neighbor in italic. − | × \ | / / / | \ | / / − / | | \ | − | × | \ − || × | / − APPENDIX F. MATRIX FACTORIZATIONS that deserves to receive much more attention from the statistical modeler.

F.3.2 Sparse Cholesky Factorization

Let us begin with some matrix notation. Given a matrix A, and index vectors p and q , the equivalent notations A ( p, q ) or A qp indicate the (sub) matrix of rows and columnsextracted from A according to the indices in p and q . In particular, if p and q have singleindices, i and j , A ( i, j ) or A ji indicate the element of A in row i and column j . The nextexample shows a more general case: p =   , q = (cid:20) (cid:21) , A = 

11 12 1321 22 1331 32 33  , A qp = 

23 2233 3213 12  . If q = [ q (1) , . . . q ( N )] is a permutation of [1 , . . . N ], and I is the identity matrix, Q = I q and Q (cid:48) = I q are the corresponding row and column permutation matrices. Moreover, if A a N × N matrix, A q = QA and A q = AQ (cid:48) . The symmetric permutation of A in order q is A ( q, q ) = QAQ (cid:48) .Let us consider the covariance structure model of section 3. If we write the variablesof the model in a permuted order, q , the new covariance matrix is V ( q, q ). The statisticalmodel is of course the same, but the Cholesky factor of the two matrices may have a quitea diﬀerent sparsity structure.Figure 5 shows the positions ﬁlled in the Cholesky factorization of a matrix A , andin the Cholesky factorization of two symmetric permutation of the same matrix, A ( q, q ).Initial Non Zero Elements, NZEs, are represented by x , initial zeros ﬁlled during thefactorization are represented by 0, and initial zeros left unﬁlled are represented by blankspaces. 123456  x x xx x x x x x xx x   x x xx x xx xx x x x   x x x xx xx x xx x x  Figure 5: Filled Positions in Cholesky Factorization.The next lemma connects the numerical elimination process in the Cholesky factor-ization of a symmetric matrix A , to the vertex elimination process in the UG having asadjacency matrix, B , the sparsity pattern of A . .3. SPARSE FACTORIZATIONS j -th column in the Cholesky factorizationof matrix A ( q, q ) = LL (cid:48) , we ﬁll the positions in L corresponding to the ﬁlled edges in F at the elimination of vertex q ( j ).Given a matrix A , G = ( V , E ), an elimination order q , and the respective ﬁlled graph,let us consider the set of row indices of NZE’s in L j , the j − th column of the Choleskyfactor, L | QAQ (cid:48) = LL (cid:48) :nze( L j ) = { i | i > j ∧ { q ( i ) , q ( j ) } ∈ F } + { j } . → (cid:38) → → → , → → ↓ ← ← , (cid:37) → → (cid:38) → . Figure 6: Elimination Trees.We deﬁne the elimination tree , H , by h ( j ) = (cid:26) j, if nze( L j ) = { j } , ormin { i > j | i ∈ nze( L j ) } , otherwise . where h ( j ), the parent of j in H , is the ﬁrst (non diagonal) NZE in column j of L . Figure6 shows the elimination trees corresponding to the examples in Figure 5. Elimination Tree Theorem:

For any row index i bellow the diagonal in column j of L , j is a descendant of i in the elimination tree, that is, for any i > j | i ∈ nze( L j ), theis a path in H going from i to j .Proof (see Figure 7): If i = h ( j ), the result is trivial. Otherwise, (see Figure 7), let k = h ( j ). But L ji (cid:54) = 0 ∧ L jk (cid:54) = 0 ⇒ L ki (cid:54) = 0, because { q ( j ) , q ( i ) } , { q ( j ) , q ( k ) } ∈ E j ⇒{ q ( k ) , q ( i ) } ∈ E j +1 . Now, either i = h ( k ), or, applying the argument recursively, we tracea branch of H ( i, l, . . . k, j ), i > l > . . . > k > j . QED.80 APPENDIX F. MATRIX FACTORIZATIONS j ... . . . x . . . k . . . l ... . . . • • x . . . i . . . n Figure 7: A Branch in the Elimination Tree.From the proof of the last theorem we see that the elimination tree portrays thedependencies among the columns for the numeric factorization process. More exactly, wecan eliminate column j of A . i.e. compute all the multipliers in column j , M j , and updateall the elements aﬀected by these multipliers, if and only if we have already eliminated allthe descendents of j in the elimination tree.If we are able to perform parallel computations, we can simultaneously eliminate all thecolumns at a given level of the elimination tree, beginning with the leaves, and ﬁnishing atthe root. Example 4 considers the elimination of a matrix with the same sparsity patternof the last permutation in example 1. Its elimination tree is the last one presented atFigure 6. This elimination tree has three levels that, from the leaves to the root, are: { , , } , { , } , e { } .Hence, we can perform a Cholesky factorization with this sparsity pattern in only 2steps, as illustrated in the following numerical example:       The sparse matrix literature has many heuristics designed for ﬁnding good eliminationorders. The example in Figures 8 and 9 show a good elimination order for a 13 ×

13 sparsematrix. .3. SPARSE FACTORIZATIONS xx x x x x x xx x x x x x x xx x x x xx x x x x Figure 8: Gibbs Heuristic’s Elimination Order.The elimination order in Figure 8 was found using the Gibbs heuristic, described inStern (1994, ch.6) or Pissanetzky (1984, ch.x). The intuitive idea of Gibbs heuristic, seeFigure 9, is as follows: 1- Starting from a ‘peripheral’ vertex, in our example, vertex3; 2- Grow a breath-ﬁrst tree T in G . Notice that the vertices at a given level, l , of T form a separator, S l , in the graph G . 3- Chose a separator, S l , that is ‘small’, i.e.with few vertices, and ‘central’, i.e. dividing G in ‘balanced’ components. 4- Place in q ,ﬁrst the indices of each component separated by S l , and, at last, the vertices in S l . 5-Proceed recursively, separating each large component into smaller ones. In our example,we ﬁrst use separator S = { , } , dividing G in three components, C = { , , , , } C = { , , , , } C = { } . Next, we use separators S = { } in C , and S = { } in C .The main goal of the techniques studied in this and the last section is to ﬁnd anelimination order ﬁlling as few positions as possible in the Cholesky factor. Once theelimination order has been chosen, simpliﬁed elimination can be used to prepare in ad-vance all the data structures holding the sparse matices, hence separating the symbolic(combinatorial) and numerical steps of the factorization. This separation is important inthe production of high performance computer programs. − − − / / | | \ | | | | | \ | \ − −

10 11 12 1382

APPENDIX F. MATRIX FACTORIZATIONS

5; 4 (cid:46) ↓ (cid:38) (cid:46)↓(cid:38) (cid:46)↓(cid:38)

12 13 7; 2 10 1 8; 310 → (cid:37) (cid:37) (cid:37)T = 3 → → → → → → → → l = 1 2 3 4 5 6 7 8 9Figure 9: Nested Dissection by Gibbs Heuristic. F.4 Bayesian Networks

The objective of this section is to show that the sparsity techniques described in the lasttwo section can be applied, almost immediately, to an other important statistical model,namely, Bayesian networks. The presentation in this section follows very closely Cozman(2000). A Bayesian network is represented by a DAG. Each node, i , represents a randomvariable, x i . Using the notation established in section 9, we write i ∈ n , where n is theindex vector n = [1 , , . . . N ]. The DAG representing the Bayesian network has an arcfrom node i to node j if the probability distribution of variable x j is directly dependenton variable x i .In many statistical models that arc is interpreted as a direct inﬂuence or causal eﬀectof x i on x j . Technically, we assume that the joint distribution of the vector x is given inthe following product form. p ( x ) = (cid:89) j ∈ n p (cid:0) x j | x pa ( j ) (cid:1) . The important property of Markov blankets in a Bayesian network is that, given thevariables in its Markov blanket, a variable x i is conditionally independent of any othervariable, x j , in the network, that is, the Markov blanket of a variable ‘decouples’ thisvariable from the rest of the network, p ( x i | x m ( i ) , x j ) = p ( x i | x m ( i ) ) . Inference in Bayesian networks is based on queries, where the distribution of some‘query’ variables, x q , q = [ q (1) , . . . q ( Q )], is computed, given the observed values of some .4. BAYESIAN NETWORKS x e , e = [ e (1) , . . . e ( E )]. Such queries are performed eliminating, thatis marginalizing, integrating, or summing out, all the remaining variables, x s , that is, p ( x q | x e ) = (cid:88) x s p ( x ) = (cid:88) x s (cid:89) j ∈ r p (cid:0) x j | x pa ( j ) (cid:1) . We place the indices of the variables to be eliminated in the elimination index vector, s = r \ ( q ∪ e ). For now, let us consider the ‘requisite’ index vector, r , as being just apermutation (reordering) of the original indices in the network, that is, r = [ r (1) , . . . r ( R )], R = N . The ‘elimination order’ or ‘elimination sequence’, s = [ s (1) , . . . s ( S )], will playan important role in what follows.Let us mention two technical points: First, not all variables of the original networkmay be needed for a given query. If so, the indices of the unnecessary ones can be removedfrom the requisite index vector, and the query is performed involving only a proper subsetof the original variables, hence, R < N . For example, if the network has disconnectedcomponents, all the vertices in components having no query variables are unnecessary.Second, the normalization constant of distributions that appear in intermediate compu-tations are costly to obtain and, more important, not needed. Hence, we can perform thisintermediate computations with un-normalized distributions, also called ‘potentials’.Making explicit use of the elimination order, s = [ s (1) , . . . s ( S )], we can write the lastequation as p ( x q | x e ) = (cid:88) x s ( S ) · · · (cid:88) x s (1) p ( x r (1) | x pa ( r (1)) ) × . . . × p ( x r ( R ) | x pa ( r ( R )) ) . Because x s (1) can only appear in densities p ( x j | x pa ( j ) ) for j = s (1) or j ∈ ch ( s (1)), wecan separate the ﬁrst summation, writing p ( x q | x e ) = (cid:88) x s ( S ) · · · (cid:88) x s (2) (cid:18)(cid:89) j ∈ r \ ( ch ( s (1)) ∪ s (1)) p ( x j | x pa ( j ) ) (cid:19) × (cid:18)(cid:88) x s (1) (cid:89) j ∈ ch ( s (1)) ∪ s (1) p ( x j | x pa ( j ) ) (cid:19) . Eliminating, i.e. integrating out, the ﬁrst variable in the elimination order, x s (1) , wecreate a new (joint) potential of the children of the eliminated variable, given its parentsand spouses, that is, p ( x ch ( s (1) | x pa ( s (1) , x f ( s (1)) ) = (cid:88) x s (1) (cid:89) j ∈ ch ( s (1)) ∪ s (1) p ( x j | x pa ( j ) )Next we eliminate x s (2) , that is, we collect all potentials containing x s (2) , form theirjoint product, and marginalize on x s (2) . We proceed in the elimination order eliminating x s (3) , x s (4) . . . x s ( S ) , at which point the normalized potentials left give us the distribution p ( x q | x e ).84 APPENDIX F. MATRIX FACTORIZATIONS

We refer to the variables appearing in a joint potential as that potential’s cluster.Forming a joint potential is a computation of a complexity that is exponential in thesize of its cluster. Hence, it is vital to chose an elimination order that keeps the clustersizes as small as possible. But the clusters formed in the elimination process of a BNcorrespond to the cliques appearing in the elimination graphs, as deﬁned in the last twosections. Hence all techniques and heuristics used for ﬁnding a good elimination orderfor Cholesky factorization can be used to obtain a good elimination order for queryinga BN. Also, all the abstract combinatorial structures appearing in sparse Cholesky fac-torization, like elimination trees, have their analogues for computation in BNs. Cozman(2000) develops the complete theory of BNs in a very simple and intuitive way, a waythat naturally highlights this analogy. Other authors have already commented on thesimilarities between several graph decomposition algorithms, see for example Lauritzen(2006, Lecture 4, Probability propagation and related algorithms) for a very general andabstract, but highly mathematical overview. ppendix GMonte Carlo Miscellanea

Monte Carlo or, if necessary, Markov Chain Monte Carlo, is the basic tool we use fornumerical integration. There are several excellent books on the subject. Hammersleyand Handscomb (1964) is a short and intuitive introduction, including some importanttopics not usually covered at this level, like pseudo-random and quasi-random generators,importance sampling and other variance reduction techniques, and the solution of linearsystems. This book is now out of print, but has the advantage of being freely available fordownload at the internet. Ripley (1987) is an other excellent text covering this materialthat is still in print. Gilks et al. (1996) gives several excellent and up-to-date reviewpapers on areas that are of interest for statistical modeling. There is a vast literature onMC and MCMC written by physicists. It contains many original, interesting and usefulideas, but sometimes it employs a terminology that is unfamiliar to statisticians. Thearticle of Meng and Wong (1996) can help to overcone this gap.

G.1 Pseudo, Quasi and Sub jective Randomness

The implementation of Monte Carlo methods, as described in the following sections,requires a random number generator of i.i.d (independent and identically distributed)random variables uniformly distributed in the unit interval, [0 , d -dimensional unit box, [0 , d and, fromthere, non-linear generators for many other multivariate distributions. Random and Pseudo-Random Generators

The concept of randomness is usually applied to a variable (to be) generated or observedprocess involving some uncertainty, as in the deﬁnition presented by Hammersley andHandscomb (1964, p.10): 38586

APPENDIX G. MONTE CARLO MISCELLANEA “A random event is an event which has a chance of happening, and prob-ability is a numerical measure of that chance.”

Monte Carlo, and several other applications, require a random number generator.With the last deﬁnition in mind, engineering devices based on sophisticated physicalprocesses have been built in the hope of oﬀering a source of “true” random numbers.However, these special devices were cumbersome, expensive, not portable nor universallyavailable, and often unreliable. Moreover, practitioners soon realized that simple deter-ministic sequences could successfully be used to emulate a random generator, as statedin the following quotes (our emphasis) by Hammersley and Handscomb (1964, p.26) andRipley (1987, p.15): “For electronic digital computer it is most convenient to calculate a se-quence of numbers one at a time as required, by a completely speciﬁed rulewhich is, however, so devised that no reasonable statistical test will detectany signiﬁcant departure from randomness. Such a sequence is called pseudo-random . The great advantage of a speciﬁed rule is that the sequence can beexactly reproduced for purposes of computational checking.”“A sequence of pseudorandom numbers ( U i ) is a deterministic sequence ofnumbers in [0 , having the same relevant statistical properties as a sequenceof random numbers.” Many deterministic random emulators used today are Linear Congruential Pseudo-Random Generators (LCPRG), as in the following example: x i +1 = ( ax i + c ) mod m , where the multiplier a , the increment c and the modulus m should obey the conditions:(i) c and m are relatively prime; (ii) a − m ; (iii) a − m is a multiple of 4. LCPRG’s are fast and easy to implement if m is taken as the computer’s word range, 2 s , where s is the computer’s word size, typically s = 32 or s = 64. The LCPRG’s starting point, x , is called the seed. Given the sameseed the LCPG will reproduce the same sequence, what is very important for tracing,debugging and verifying application programs.However, LCPRG’s are not an universal solution. For example, it is trivial to devisesome statistics that will be far from random, see Marsaglia (1968). There the impor-tance of the words reasonable and relevant in the last quotations becomes clear: Formost Monte Carlo applications these statistics are irrelevant. LCPRG’s can also exhibitvery long range auto-correlations and, unfortunately, these are more likely to aﬀect longsimulated time series required in some special applications. The composition of several .1. PSEUDO, QUASI AND SUBJECTIVE RANDOMNESS Chance is Lumpy - Quasi-Random Generators “Chance is Lumpy” is Robert Abelson’s First Law of Statistics, see Abelson (1995, p.xv).The probabilistic expectation is a linear operator, that is, E ( Ax + b ) = AE ( x ) + b , where x in random vector and A and b are a determined matrix and vector. The Covarianceoperator is deﬁned as Cov( x ) = E (( x − E ( x )) ⊗ ( x − E ( x ))). Hence, Cov( Ax + b ) = A Cov( x ) A (cid:48) . Therefore, given n i.i.d. scalar variables, x i | Var( x i ) = σ , the variance oftheir mean, m = (1 /n ) (cid:48) x , is given by1 n (cid:48) diag( σ ) 1 n = (cid:2) n n . . . n (cid:3)  σ . . . σ . . . . . . σ   n n ... n  = σ /n . Hence, the mean’s standard deviation is std( m ) = σ/ (cid:112) ( n ). So, mean values of iid randomvariables converge to their expected values at a rate of 1 / (cid:112) ( n ).Quasi-random sequences are deterministic sequences built not to emulate random se-quences, as pseudo-random sequences do, but to achieve faster convergence rates. For d -dimensional quasi-random sequences, an appropriate measure of ﬂuctuation, called dis-crepancy, only grows at a rate of log( n ) d , hence growing much slower than (cid:112) ( n ). There-fore, the convergence rate corresponding to quasi-random sequences, log( n ) d /n , is muchfaster than the one corresponding to (pseudo) random sequences, (cid:112) ( n ) /n . Figure 1 al-lows the visual comparison of typical (pseudo) random (left) and quasi-random (right)sequences in [0 , . By visual inspection we see that the points of the quasi-random se-quence are more “homogeneously scattered” that is, they do not “clump together”, as thepoint of the (pseudo) random sequence often do.Let us consider an axis-parallel rectangles in the unit box, R = [ a , b [ × [ a , b [ × . . . [ a d , b d [ ⊆ [0 , . The discrepancy of the sequence s n in box R , and the overall discrepancy of the sequenceare deﬁned as D ( s n , R ) = n Vol( R ) − | s n ∩ R | , D ( s n ) = sup R ∈ [0 , d | D ( s n , R ) | . It is possible to prove that the discrepancy of the Halton-Hammersley sequence, deﬁnednext, is of order O (log( n ) d − ), see Matou˘sek (1991, ch.2).88 APPENDIX G. MONTE CARLO MISCELLANEA

Halton-Hammersley sets: Given d − p (1) , p (2) , . . . p ( d − i -th point, x i , in the Halton-Hammersley set, { x , x , . . . x n } , is x i = (cid:2) i/n, r p (1) ( i ) , r p (2) ( i ) , . . . r p ( d − ( i ) (cid:3) (cid:48) , for i = 1 : n − , where i = a + p ( k ) a + p ( k ) a + p ( k ) a + . . . , r p ( k ) (1) = a p ( k ) + a p ( k ) + a p ( k ) + . . . . That is, the ( k + 1)-th coordinate of x i , x ik +1 = r p ( k ) ( i ), is obtained by the bit reversal of i written in p ( k )-ary or base p ( k ) notation.The Halton-Hammersley set is a generalization of van der Corput set, built in thebidimensional unit square, d = 2, using the ﬁrst prime number, p = 2. The followingexample, from Hammersley (1964, p.33) and G¨unther and J¨ungel (2003, p.117) builds the8-point van der Corput set, expressed in binary and decimal notation. function x= corput(n,b)% size n base b v.d.corput setm=floor(log(n)/log(b));u=1:n; D=[];for i=0:md= rem(u,b);u= (u-d)/b;D= [D; d];endx=((1./b’).^(1:(m\ma1)))*D; Decimal Binary i r ( i ) i r ( i )1 0.5 1 0.12 0.21 10 0.013 0.75 11 0.114 0.125 100 0.0015 0.625 101 0.1016 0.375 110 0.0117 0.875 111 0.1118 0.0625 1000 0.0001 Figure G.1: (Pseudo)-random and quasi-random point sets on the unit boxQuasi-random sequences, also known as low-discrepancy sequences, can substitutepseudo-random sequences in some applications of Monte Carlo methods, achieving higher .1. PSEUDO, QUASI AND SUBJECTIVE RANDOMNESS “First, quasi-Monte Carlo methods are valid for integration problems, but maynot be directly applicable to simulations, due to the correlations between thepoints of a quasi-random sequence. ... A second limitation: the improvedaccuracy of quasi-Monte Carlo methods is generally lost for problems of highdimension or problems in which the integrand is not smooth.”

Subjective Randomness and its Paradoxes

When asked to look at patterns like those in Figure 1, many subjects perceive the quasi-random set as “more random” than the (pseudo) random set. How can this paradox beexplained? This was the topic of many psychological studies in the ﬁeld of subjectiverandomness. The quotation in the next paragraph is from one of these studies, namely,Falk and Konold (1997, p.306, emphasis are ours): “One major source of confusion is the fact that randomness involves twodistinct ideas: process and pattern (Zabell, 1992). It is natural to thinkof randomness as a process that generates unpredictable outcomes (stochasticprocess according to Gell’Mann, 1994). Randomness of a process refers tothe unpredictability of the individual event in the series (Lopes, 1982). Thisis what Spencer Brown (1957) calls primary randomness . However, oneusually determines the randomness of the process by means of its output, whichis supposed to be patternless . This kind of randomness refers, by deﬁnition,to a sequence. It is labeled secondary randomness by Spencer Brown. Itrequires that all symbol types, as well as all ordered pairs (diagrams), orderedtriplets (trigrams)... n-grams in the sequence be equiprobable. This deﬁnitioncould be valid for any n only in inﬁnite sequences, and it may be approximatedin ﬁnite sequences only up to ns much smaller than the sequence’s length. Theentropy measure of randomness (Attneave, 1959, chaps. 1 and 2) is based onthis deﬁnition.These two aspects of randomness are closely related. We ordinarily expectoutcomes generated by a random process to be patternless. Most of them are.Conversely, a sequence whose order is random supports the hypothesis that itwas generated by a random mechanism, whereas sequences whose order is notrandom cast doubt on the random nature of the generating process.” APPENDIX G. MONTE CARLO MISCELLANEA

Spencer-Brown was intrigued by the apparent incompatibility of the notions of primaryand secondary randomness. The apparent collision of these two notions generates severalinteresting paradoxes, taking Spencer-Brown to question the applicability of the conceptof randomness to probability and statistical analysis, see Spencer-Brown (1953, 1957) andFlew (1959), Good (1958) and Mundle (1959). See also Henning (2006), Kaptchuk andKerr (2004), Utts (1991), and Wassermann (1955). In fact, several subsequent psycholog-ical studies were able to conﬁrm that, for many subjects, the intuitive or common-senseperception of primary and secondary randomness are quite discrepant. However, a care-ful mathematical analysis makes it possible to reconcile the two notions of randomness.These are the topics discussed in this section.The relation between the joint and conditional entropy for a pair of random variables,see appendix E.2, H ( i, j ) = H ( j ) + H ( i | j ) = H ( i ) + H ( j | i ) , motivates the deﬁnition of ﬁrst, second and higher order entropies, deﬁned over the dis-tribution of words of size m in a string of letters from an alphabet of size a . H = (cid:88) j p ( j ) log p ( j ) , H = (cid:88) i,j p ( i ) p ( j | i ) log p ( j | i ) ,H = (cid:88) i,j,k p ( i ) p ( j | i ) p ( k | i, j ) log p ( k | i, j ) . . . It is possible to use these entropy measures to access the disorder or lack of patternin a given ﬁnite sequence, using the empirical probability distributions of single letters,pairs, triplets, etc. However, in order to have a signiﬁcant empirical distribution of m -plets any possible m -plet must be well represented in the sequence, that is, the word size, m , is required to be very short, relative to the sequence log-size, m << log a ( n ).In the article of Falk and Knold (1997), Figure 2 displays the typical perceived orapparent randomness of Boolean (0-1) bit sequences or black-and-white pixel grids versusthe second order entropy of the same strings and grids, see also Attneave (1959). Clearly,there is a remarkable bias of the apparent randomness relative to the entropic measure. “ When people invent superﬂuous explanations because they perceive pat-terns in random phenomena, they commit what is known in statistical parlanceas Type I error The other way of going awry, known as Type H error, occurswhen one dismisses stimuli showing some regularity as random. The numerousrandomization studies in which participants generated too many alternationsand viewed this output as random, as well as the judgments of overalternatingsets as maximally random in the perception studies, were all instances of typeII error in research results.” Falk and Konold (1997, p.303). .1. PSEUDO, QUASI AND SUBJECTIVE RANDOMNESS H -entropy vs. AR, apparent randomness.This eﬀect is also known as the gambler’s fallacy when betting on cool spots , expectingthe random sequence to “compensate” ﬁnite average ﬂuctuations from expected values.Of course, some gamblers exhibit the opposite behavior, preferring to bet on hot spots , ex-pecting the same ﬂuctuations to reoccur. These eﬀects are the consequence of a perceivedcoupling, by a negative or positive correlation or other measure of association, betweennon overlapping segments that are in fact supposed to be decoupled, uncorrelated or haveno association, that is, to be Markovian. For a statistical analysis, see Bonassi et al.(2008). A possible psychological explanation of the gambler’s fallacy is given by the con-structivist theory of Jean Piaget, see Piaget and Inhelder (1951), in which any “lump” inthe sequence is (miss) perceived as non-random order: “In analogy to Piaget’s operations, which are conceived as internalized ac-tions, perceived randomness might emerge from hypothetical action, that is,from a thought experiment in which one describes, predicts, or abbreviates thesequence. The harder the task in such a thought experiment, the more randomthe sequence is judged to be.” Falk and Konold (1997, p.316).The same hierarchical decomposition scheme used for higher order conditional entropymeasures can be adapted to measure the disorder or patternless of a sequence, relative toa given subject’s model of “computer” or generation mechanism. In the case of a discretestring, this generation model could be, for example, a deterministic or probabilistic Turingmachine, a ﬁxed or variable length Markov chain, etc. It is assumed that the model is reg-ulated by a code, program or vector parameter, θ , and outputs a data vector or observedstring, x . The hierarchical complexity measure of such a model emulates the Bayesianprior and conditional likelihood decomposition, H ( p ( θ, x )) = H ( p ( θ )) + H ( p ( x | θ )), thatis, the total complexity is given by the complexity of the program plus the complexity ofthe output given the program. This is the starting point for several complexity models,92 APPENDIX G. MONTE CARLO MISCELLANEA like Andrey Kolmogorov, Ray Solomonoﬀ and Gregory Chaitin’s computational comlex-ity models, Jorma Rissanen’s Minimum Description Length (MDL), and Chris Wallaceand David Boulton’s Minimum Message Length (MML). All these alternative complexitymodels can also be used to successfully reconcile the notions of primary and secondaryrandomness, showing that they are asymptotically equivalent, see Chaitin (1975, 1988),Kac (1983), Kolmogorov (1965), Martin-L¨of(1966, 1969).

G.2 Integration and Variance Reduction

This section presents the derivation of generic Monte Carlo procedures for numerical inte-gration. We follow the presentation of Hammersley (1964). Let us consider the integrationof a bounded function, 0 ≤ f ( x ) ≤ x ∈ [0 , crud MonteCarlo unbiased estimate of this integral is the mean value of the function evaluated atuniformly distributed iid random points, x i ∈ [0 , i = 1 : n , with variance γ = (cid:90) f ( x ) dx , (cid:98) γ c = 1 n n (cid:88) f ( x i ) , σ c = 1 n (cid:90) ( f ( x ) − γ ) dx . An alternative unbiased estimator is the hit-or-miss Monte Carlo , deﬁned by theauxiliary hit indicator function, h , h ( x, y ) = I ( f ( x ) ≥ y ) , γ = (cid:90) (cid:90) h ( x, y ) dxdy , (cid:98) γ h = 1 n n (cid:88) h ( x i , y i ) = n ∗ n . The variance of this method is that of a Bernoulli variate. Simple manipulation showsthat σ h = γ (1 − γ ) n , σ h − σ c = 1 n (cid:90) f ( x )(1 − f ( x )) dx > . Hence, hit-or-miss MC is worst than crude MC, as one could guess from the fact that itis using far less information about f at any given point, x i .An other alternative is importance sampling MC, deﬁned by an auxiliary samplingdistribution , g , in the integration interval, γ = (cid:90) f ( x ) dx = (cid:90) f ( x ) g ( x ) g ( x ) dx = (cid:90) f ( x ) g ( x ) dG ( x ) , (cid:98) γ s = 1 n n (cid:88) f ( x i ) g ( x i ) , x i ∼ g , i = 1 : n ; σ s = 1 n (cid:90) (cid:18) f ( x ) g ( x ) − γ (cid:19) dG ( x ) . The importance sampling method can be used on an arbitrary integration interval, aslong as we know how to draw the points x i according to the sampling distribution. Thevariance of this method is minimized if g ∝ f , that is if the sampling distribution is .2. INTEGRATION AND VARIANCE REDUCTION f /g ≤ c . Inparticular, if the integration interval is unbounded, the tails of the sampling distributionshould “cover” the tails of the integrand.The formula for σ s suggests yet another strategy of variance reduction. Let ϕ ( x ) bea function that closely emulates or mimics f ( x ), but is easy to integrate analytically (oreven numerically). Such a ϕ ( x ) is known as a control variate for f ( x ). The desired integralcan be computed as γ = (cid:90) ϕ ( x ) dx + (cid:90) ( f ( x ) − ϕ ( x )) dx = γ (cid:48) + (cid:90) ( f ( x ) − ϕ ( x )) dx . Consider the following estimators and variances, (cid:98) γ = 1 n n (cid:88) f ( x i ) , (cid:98) γ (cid:48) = 1 n n (cid:88) ϕ ( x i ) , Var( (cid:98) γ − (cid:98) γ (cid:48) ) = Var( (cid:98) γ ) + Var( (cid:98) γ (cid:48) ) − (cid:98) γ, (cid:98) γ (cid:48) ) . That is, this the method is useful if the integration and the control variates are strongly(positively) correlated.

Non-Uniform Random Generators

This section considers some elementary methods for producing i.i.d. non-uniform variates, x i , from a source of uniform variates in the unit interval, u i ∼ U ]0 , . Perhaps the simplestexample is to produce a Bernoulli variate:(a) If 0 ≤ u i ≤ p , then x i = 1, else ( p < u i ≤ x i = 0.If F ( x ) is the cumulative distribution of f ( x ), and x i ∼ f , then u = F ( x i ) ∼ U [0 , .Hence, if F ( x ) is invertible, we can just take x i = F − ( u i ) as a mechanism for generating f distributed variates. For example:(b) The exponential distribution with mean 1 /λ is given by f ( t ) = λ exp( − λt ), and F ( t ) = 1 − exp( − λt ). Hence, t = ( − /λ ) ln( u ) produces an exponential variate.(c) The Cauchy distribution with location and scale parameters, a, b , is given by1 /f ( x ) = πb (1 + (( x − a ) /b ) ), F ( x ) = (1 /

2) + (1 /π ) arctan(( x − a ) /b ). Hence, x = a + b tan( π ( u − (1 / χ , is a particular case ofthe exponential with mean (1 /λ ) = 2. Hence, x = − u ) generates a χ variate.94 APPENDIX G. MONTE CARLO MISCELLANEA (e) A χ d variate is characterized as a sum of squares of d normal variates. Hence, if d is even, we can generate a χ d variate as x = − u u . . . u d/ ).(f) Counting consecutive λ -exponential arrivals until the threshold t + t . . . + t k ≥ λ Poisson variate, f ( k ) = exp( − λ ) λ k /k !.(g) Appendix B presents characterizations of many discrete distributions by the Pois-son, hence providing implicit generation mechanisms for those distributions.(h) The following two dimensional transformation method generates two i.i.d. stan-dard Normal variates, see Ripley (1987). u, v ∼ U [0 , , θ = 2 πu , r = (cid:112) − v ) , x = r cos( θ ) , y = r sin( θ ) . To check the method consider the transformation to polar coordinates, [ r, θ ], of a standardbivariate normal [ x, y ] ∼ (1 / π ) exp(( − / x + y )).[ r, θ ] ∼ π exp (cid:18) − r (cid:19) (cid:12)(cid:12)(cid:12)(cid:12) cos( θ ) sin( θ ) − r sin( θ ) r cos( θ ) (cid:12)(cid:12)(cid:12)(cid:12) = 12 π r exp (cid:18) − r (cid:19) . Hence, r and θ are independent, θ is uniformly distributed in [0 , π ], and r is a χ variate. Finally, we see that r is produced by the transformation deﬁned in item (d)above to generate a χ variate.If the scaled density, κg , can be used as an envelope dominating the density f , thatis, f ≤ κg , the following acceptance-rejection method due to von Neumann can be used:(1) Generate [ y i , u i ] ∼ g × U [0 , until κu i ≤ f ( y i ) /g ( y i ). (2) Take x i = y i .The Gamma distribution with parameter c is f ( x ) = x c − exp( − x ) / Γ( c ). For c = 1this is the exponential distribution, also, the sum of two gamma variates with parameters c , c is a gamma variate with parameter c + c . The following results given in De´ak(1990, sec.4.5) provide implicit acceptance rejection generation methods:(i) For c < f ( x ) is dominated by the following density g ( x ) scaled by the factor κ = ( c Γ( c )) − + ( e Γ( c )) − . Moreover, G − has an easy analytic form. g ( x ) = (cid:40) ec ( e + c ) x c − , if x ∈ [0 , ec ( e + c ) e − x , if x ∈ [1 , ∞ [ , G ( x ) = (cid:40) e ( e + c ) x c , if x ∈ [0 , e ( e + c ) + c ( e + c ) (1 − e − x ) , if x ∈ [1 , ∞ [(ii) For c > f ( x ) is dominated by a Cauchy with parameters a = 1 / √ c − b = c −

1, scaled by the factor κ = π √ c − − c + 1)( c − c − / Γ( c ).(iii) For c > f ( x ) is dominated by the envelope density, g c ( x ), and scale factor, κ c ,described as follows. First, let us consider an auxiliar variate distributed as the t-densitywith 2 degrees of freedom. The auxiliar density, g ( y ), cumulative distribution, G ( y ), andgeneration method by direct inversion are as follows: g ( y ) = 12 √ (cid:18) y (cid:19) − , G ( y ) = 12 (cid:32) y/ √ (cid:112) y / (cid:33) , y ∼ √ u − / (cid:112) u (1 − u ) . .3. MCMC - MARKOV CHAIN MONTE CARLO κ c g c ( x ) = Γ( c )  (cid:32) x − ( c − (cid:112) c/ − / (cid:33)  / ,κ c = 2Γ( c ) (cid:114) c − (cid:18) c − e (cid:19) c − ≤ (cid:114) π e /c . The envelope variate can be generated from the auxiliar variate as x ∼ ( c −

1) + y (cid:112) c − / . (iv) It is easy to check that if y is a gamma variate with parameter c + 1 and u isuniform in [0 , x = yu /c is a gamma variate with parameter c . This property canbe used to use a gamma generator in the domain c < c >

1, and vice-versa.Appendix B presents characterizations of the Beta an Dirichlet distributions by theGamma, hence providing implicit generation mechanisms for those distributions. Formore non-uniform random generation methods see De´ak (1990), Gentle (1998), Lange(2000), Ripley (1987), and the encyclopedic work of Fishman (1996).

G.3 MCMC - Markov Chain Monte Carlo

This section uses the matrix notation and the basic facts about homogeneous Markovchains reviewd in section H.1.Markov Chain Monte Carlo, Conditional Monte Carlo, etc. are common names formethods that generate indirect random sampling for a discrete target density g . MCMCsampling is based on a Markov chain that has the target density as its limit distribution.Our presentation follows ch.1 of Gilks et al. (1996). For the original papers, see Gemanand Geman (1984), Hastings (1970), Metropolis and Ulam (1949), and Metropolis et al.(1953).The basic idea of the MCMC algorithms is to adapt a general (irreducible and aperi-odic) sampling kernel, Q , to the desired target distribution, g >

0. Starting form state i ,the MCMC algorithm proceeds as follows:(1) A candidate for the next state, j , is proposed with probability Q ji .(2a) The chain moves to the candidate j with acceptance probability α ( i, j ).(2b) Otherwise, candidate j is rejected, and the chain remains at state i .(3) Go to step 1.Formally, the MCMC transition kernel, P , has the form P ji = Q ji α ( i, j ) + I ( j = i ) (cid:16) − (cid:88) j Q ji α ( i, j ) (cid:17) , APPENDIX G. MONTE CARLO MISCELLANEA where the ﬁrst term corresponds to the acceptance of new state j , while the second termcorresponds to the rejection of the proposed candidate, indicating that the chain remainsat state i .In order to obtain the target distribution, g , as the limit distribution of the MCMC,we want to choose an acceptance probability, α ( i, j ), that enforces the detailed balanceequation, g i P ji = g i Q ji α ( i, j ) = g j Q ij α ( j, i ) = g j P ij . It is easy to check that the acceptance probabilities suggested by Metropolis-Hastings andBarker accomplish the goal. They are, respectively, α ( i, j ) = min (cid:18) , g j Q ij g i Q ji (cid:19) and α ( i, j ) = g j Q ij g i Q ji + g j Q ij , In Bayesian statistics, MCMC methods are typically used to compute (cid:98) f , the expectedvalue of a function, f ( θ ), on a speciﬁc region of the parameter space, T ⊆ Θ, with respectto the posterior density, p n ( θ ). In standard Bayesian models, p n ( θ ) = c ( y ) − L ( θ | y ) p ( θ ),where p ( θ ) is the prior distribution of the parameter θ , L ( θ | y ) is the likelihood of θ giventhe observed data y , and c ( y ) is the posterior normalization constant. Hence, (cid:98) f = 1 c ( y ) (cid:90) T f ( θ ) g ( θ | y ) dθ , g ( θ ) = L ( θ | y ) p ( θ ) , c ( y ) = (cid:90) Θ g ( θ | y ) dθ . Notice that α ( i, j ), the acceptance probabilities deﬁned above, can be computed fromposterior ratios p n ( θ j ) /p n ( θ i ) = g j /g i . Hence, actual implementations of these MCMCalgorithms do not require the explicit knowledge of the target distribution normalizationconstant, c ( y ). It suﬃces to have an un-normalized function that is proportional to thetarget distribution, g ( θ ) ∝ p n ( θ ), as it is the case for the likelihood-prior product.The original Metropolis algorithm uses a symmetric sampling kernel, Q ji = Q ij , seeMetropolis et al. (1954). In this caese, Metropolis-Hastings acceptance probability canbe simpliﬁed to the form α ( i, j ) = min(1 , g j /g i ). In statistical physics, the density ofinterest, g i , often takes the form of the Boltzmann distribution, g i = exp( − βH ( i )), wherethe Hamiltonian function, H ( i ), gives the energy of the corresponding state. In this case,a new state of lower energy, j | ∆ H = H ( j ) − H ( i ) ≤

0, is accepted for sure, while astate of higher energy is accepted with probability exp( − β ∆ H ). In section H.1, the sameacceptance rejection mechanism reappears in Metropolis version of Simulated Annealing.Random Walk Metropolis algorithms use a symmetric kernel that is a function only ofthe random walk step, z = y − x , that is, Q ( x, y ) = Q ( z ) = Q ( − z ). A common option inpractical implementations is to chose the random walk step from a multivariate Normaldistribution, z ∼ N (0 , Σ). The covariance matrix, Σ, scales the random walk steps. Ifthe steps are too large, the proposed steps would often result in sharp decrease of thetraget density, so the acceptance rate is low, making the MCMC ineﬃcient. If the steps .4. ESTIMATION OF RATIOS − ∂ log g ( x ) /∂ x (cid:48) ∂ x ) − , computed at the estimated mode, (cid:98) x = arg max g ( x ).Alternatively, one can take Σ proportional to a convex combination of the diagonal ma-trix D , a prior estimate of marginal variances, and the current estimate of the sampledcovariance matrix.Σ ∝ (1 − λ ) S + λD , S = 1 n n (cid:88) j =1 ( x j − ¯ x )( x j − ¯ x ) (cid:48) = 1 n ( X − ¯ x )( X − ¯ x ) (cid:48) . In both cases, the proportionality constant is interactively adapted in order to obtainan acceptance rate in a speciﬁed range. If the target distribution has heavy tails, thissampling kernel may be modiﬁed to a multivariate student’s t-distribution. For furthedetails, see Gilks et al. (1996).Cyclic MCMC schemes use a “composit kernel” that updates, one by one, the indi-vidual components of a k -dimensional vector state, x . That is, a cyclic MCMC goes fromthe current state, x to the next state, y , by k intermediate steps, x = [ x , x , . . . x k ],[ y , x , . . . x k ], [ y , y , . . . x k ], . . . [ y , y , . . . y k ] = y . Cyclic schemes include the Gibbs sam-pler, popularized by Geman and Geman (1984), and many useful variations. G.4 Estimation of Ratios

This section presents the derivation of the Monte Carlo procedure for the numericalintegrations required to implement the FBST. The symbol X represents the observeddata or some suﬃcient statistics. The best approach to the numerical integration steprequired by the FBST is approximation by Monte Carlo (MC) simulation, see AppendixA for the FBST deﬁnition, and Evans and Swartz (2000) and Zacks and Stern (2003) forthe Monte Carlo approach to this integration problem. We want an estimate of the ratioev ( H ; X ) = (cid:82) T f ( θ ; X ) dθ (cid:82) Θ f ( θ ; X ) dθT = T ( s ∗ ) , T ( v ) = { θ ∈ Θ | s ( θ ) > v } . Since the space Θ is unbounded, we randomly chose the values of θ according to an“importance sampling” density g ( θ ), which is positive on Θ. The evidence function isequivalent to ev ( H ; X ) = (cid:82) Θ Z ∗ g ( θ ; X ) g ( θ ) dθ (cid:82) Θ Z g ( θ ; X ) g ( θ ) dθ APPENDIX G. MONTE CARLO MISCELLANEA where Z g ( θ ; X ) = f ( θ ; X ) g ( θ ) and Z ∗ g ( θ ; X ) = I ∗ ( θ ; X ) Z g ( θ ; X ) I ∗ ( θ ; X ) = 1( θ ∈ T )Thus, a Monte Carlo estimate of the evidence isˆ Ev g,m ( X ) = (cid:80) mj =1 Z ∗ g ( θ j ; X ) (cid:80) mj =1 Z g ( θ j ; X )where θ j , j = 1 . . . m are iid and independently chosen in Θ according to the importancesampling density g ( θ ). Thus,ˆ Ev g,m ( X ) m →∞ −→ ev ( H ; X ) a.s.[ g ]The goodness of the MC estimation depends on the choice of g and m . Standard statis-tical software libraries have univariate random generators for most common distributions.These univariate generators can also be used to build vector variates from multivariatedistributions. Appendix D describes how to generate a Dirichlet vector variate fromunivariate Gammas.Johnson (1980) describes a simple procedure to generate the Cholesky factor of aWishart variate W = U (cid:48) U with n degrees of freedom, from the Cholesky factorization ofthe covariance parameter V = R − = C (cid:48) C : L ji = N (0 , , i > jL ii = (cid:112) χ ( n − i + 1) ; U = L (cid:48) C At the integration step it is important to perform all matrix computations directly fromCholesky factors, Golub and van Loan (1989), Jones (1985). In this problem we cantherefore use “exact sampling”, what simpliﬁes substantially the integration step, i.e., Z g ( θ ; X ) = 1. Precision of the MC Simulation

In order to control the number of points, m , used at each MC simulation, we need an es-timate of MC precision for evidence estimation. For a ﬁxed large value m , the asymptoticdistribution of ˆ Ev g,m ( X ) is normal with mean ev ( H ; X ) and asymptotic variance V g ( X ).According to the delta method, Bickel and Doksum (2001), we obtain that V g ( X ) = 1 m (cid:32) σ ∗ g µ g + σ g µ ∗ g µ g − µ ∗ g µ g γ g (cid:33) .4. ESTIMATION OF RATIOS µ g = (cid:90) Θ Z g ( θ ; X ) g ( θ ) d ( θ ) µ ∗ g = (cid:90) Θ Z ∗ g ( θ ; X ) g ( θ ) d ( θ ) σ g = (cid:90) Θ ( Z g ( θ ; X ) − µ g ) g ( θ ) d ( θ ) σ ∗ g = (cid:90) Θ (cid:0) Z ∗ g ( θ ; X ) − µ ∗ g (cid:1) g ( θ ) d ( θ ) γ g = (cid:90) Θ ( Z g ( θ ; X ) − µ g ) (cid:0) Z ∗ g ( θ ; X ) − µ ∗ g (cid:1) g ( θ ) d ( θ )are the expected values, variances and covariance of Z ( θ ; X ) and Z ∗ ( θ ; X ) with respectto g ( θ ).Deﬁne the coeﬃcients ξ g = σ g µ g , ξ ∗ g = σ ∗ g µ g For abbreviation, let η = ev ( H ; X ). Also note that η = µ ∗ g /µ g . Then the asymptoticvariance is V g ( X ) = 1 m (cid:18) ξ ∗ g + η ξ g − η γ g µ g (cid:19) Let us deﬁne the complementary variables Z cg ( θ ; X ) = I c ( θ ; X ) Z g ( θ ; X ) I c ( θ ; X ) = 1 − I ∗ ( θ ; X ) σ cg = V g ( Z c ( θ ; X )) ξ cg = σ cg µ g Some algebraic manipulation give us V g ( X ) in terms of ξ ∗ g and ξ cg , namely V g ( X ) = 1 m (cid:0) ξ ∗ g (1 − η ) + ξ cg η + 2 η (1 − η ) (cid:1) For large values of m , the asymptotic (1 − β ) level conﬁdence level conﬁdence intervalfor η is ˆ Ev g,m ( X ) ± ∆ g,m,β , where∆ g,m,β = F − β (1 , m ) m (cid:16) ˆ ξ ∗ g (1 − ˆ η ) + ˆ ξ cg ˆ η + 2 ˆ η (1 − ˆ η ) (cid:17) APPENDIX G. MONTE CARLO MISCELLANEA where F − β (1 , m ) is the 1 − β quantile of the F (1 , m ) distribution, and ˆ η , ˆ ξ ∗ g and ˆ ξ cg areconsistent estimators of the respective quantities.For large m , we can also use the approximation,∆ g,m,β = χ − β (1) m (cid:16) ˆ ξ ∗ g (1 − ˆ η ) + ˆ ξ cg ˆ η + 2 ˆ η (1 − ˆ η ) (cid:17) since F (1 , m ) converges in distribution to the chi-square distribution with 1 degree offreedom, as m → ∞ .If we wish to have ∆ g,m,β ≤ δ , for a prescribed value of δ , then m should be such that m ≥ χ − β ( m ) δ (cid:16) ˆ ξ ∗ g (1 − ˆ η ) + ˆ ξ cg ˆ η + 2 ˆ η (1 − ˆ η ) (cid:17) .5. MONTE CARLO FOR LINEAR SYSTEMS G.5 Monte Carlo for Linear Systems

Want to solve the simultaneous matrix equation, x = Hx + b , H n × n The (Direct) Monte Carlo methods of von Neumann and Ulam (NU) and of Wasow(WS) are based on probability transitions, P ji , and multipliers or weights, V ji , satisfyingthe following conditions: V ji = ( H ji /P ji ) I ( P ji > | H ji (cid:54) = 0 ⇒ P ji > ∧ P n n < We also deﬁne the extended Stochastic matrix, P =  P · · · P n P n +11 ... ... ... P n · · · P nn P n +1 n · · ·  , P n +1 i = 1 − (cid:88) nj =1 P ji P deﬁnes a Markov chain in a space of n + 1 states, { , , . . . n, n + 1 } , where the laststate, n + 1, is an absorbing state.We want to consider a random path or trajectory, T , of a “particle” starting at state i , until the particle is absorbed at step m + 1, that is, T = [ T (1) = i, T (2) , . . . , T ( m ) , T ( m + 1) = n + 1]We deﬁne a random variable, X ( T ), associated to each trajectory.First we deﬁne the multipliers products v = 1 and v k = v k − V T ( k ) T ( k − , ≤ k ≤ m . Von Neumann - Ulam’s and Wasow’s versions of the Monte Carlo Algorithm, use X ( T )equal to, respectively, N U ( T ) = v m b T ( m ) /P n +1 T ( m ) and W S ( T ) = (cid:88) mk =1 v k b T ( k ) . The key to these Monte Carlo algorithm is that the expected value of the variable X ( T ), over all trajectories starting at state i , is the solution of the simultaneous equation,provided these expected values are well deﬁned, that is, ifif e i = E ( X ( T ) | T (1) = i ) then e = He + b Let us prove the statement above for Wasow’s version. By deﬁnition,02

APPENDIX G. MONTE CARLO MISCELLANEA

Pr( T ) = m (cid:89) k =1 P T ( k +1) T ( k ) and e i = (cid:88) T = [ T (1) = i, T (2) = j, . . . T ( m + 1) = n + 1] ,m = 1 , , . . . ∞ . X ( T )Pr( T ) . Given a trajectory T , we can separate the terms in X ( T ) with index 1, that is, X ( T ) = b T (1) + V T (2) T (1) X ( T (2 : m +1)) , hence, e i = n +1 (cid:88) j =1 P ji (cid:88) S =[ j,...n +1] (cid:0) b i + V ji X ( S ) (cid:1) Pr( S )= n (cid:88) j =1 P ji (cid:88) S =[ j,...n +1] (cid:0) b i + V ji X ( S ) (cid:1) Pr( S ) + P n +1 i b i = n (cid:88) j =1 P ji V ji (cid:88) S =[ j,...n +1] X ( S )Pr( S ) + b i (cid:32) P n +1 i + n (cid:88) j =1 P ji (cid:88) S Pr( S ) (cid:33) = n (cid:88) j =1 H ji e j + b i , Q.E.D.The Reverse or Adjoint Monte Carlo methods of von Neumann and Ulam (NU) andof Wasow (WS) are based on probability transitions, Q ji , and multipliers or weights, W ji ,satisfying the following conditions: W ji = ( H ij /Q ji ) I ( Q ji > | H ij (cid:54) = 0 ⇒ Q ji > ∧ Q n n < We also deﬁne the extended Stochastic matrix, Q =  Q · · · Q n Q n +11 ... ... ... Q n · · · Q nn Q n +1 n · · ·  , Q n +1 i = 1 − (cid:88) nj =1 Q ji We want to consider a random path or trajectory, T , of a “particle” starting at state i , chosen at random with probability r i , until the particle is absorbed at step m + 1, justafter visiting state T ( m ) at step m , that is, T = [ T (1) = i, T (2) , . . . , T ( m ) , T ( m + 1) = n + 1] .5. MONTE CARLO FOR LINEAR SYSTEMS w = b i /r i and w k = w k − W T ( k ) T ( k − , ≤ k ≤ m . Von Neumann - Ulam’s and Wasow’s versions of the reverse or adjoint Monte CarloAlgorithm, use X j ( T ) equal to, respectively, N U j ( T ) = w m δ jT ( m ) /Q n +1 T ( m ) and W S j ( T ) = (cid:88) mk =1 w k δ jT ( k ) . Again, the key to these Monte Carlo algorithm is that the expected value of the variable X j ( T ), over all trajectories ending at state j , is the solution of the simultaneous equation,provided these expected values are well deﬁned. The proof for the reverse method issimilar to the direct case.04 APPENDIX G. MONTE CARLO MISCELLANEA ppendix HStochastic Evolution andOptimization “God does not play dice (with the universe).”

Albert Einstein (1879 - 1955). “Einstein, stop telling God what to do (with his dice).”

Niels Bohr (1885 - 1962). “God not only plays dice, He also sometimesthrows the dice where they are not seen.”

Stephen Hawking (1942 - ).This section gives a condensed introduction to inhomogeneous Markov chains, thetheory that is needed to formalize Simulated Annealing (SA) and related algorithmspresented in chapter 5. We follow the presentations in Jetschke (1989) and Pﬂug (1996,ch.2), and assume some familiarity with homogeneous Markov chains, as presented inFeller (1957, ch.15) or H¨aggstr¨om (2002).

H.1 Inhomogeneous Markov Chains

We begin by introducing some notation for this chapter. First, a notational idiosyncrasy:In almost all areas of mathematics it is usual to write a d -dimensional vector as a d × column matrix , x , and a linear transformation as the left multiplication of x by a d × d APPENDIX H. STOCHASTIC EVOLUTION AND OPTIMIZATION square matrix A , that is, Ax . However, in the literature of Markov chains, it is usual towrite a a d -dimensional vector as a 1 × d or row matrix , v , and a linear transformationas the right multiplication of v by a d × d square matrix P , that is, vP . Herein, we makeuse of the two forms, according to the context. d -Dimensional vectors are written in lower case format, v . A density or probabilityvector is a vector in the simplex support, v > (cid:107) v (cid:107) = 1. d -Dimensional (square)matrices, on the other hand, are written in upper case format, P . In particular, I isreserved to denote the d -dimensional identity matrix. A d -dimensional kernel or transitionprobability matrix has its rows in the simplex support. Right subscripts and superscriptswill index matrices rows and columns. For instance, P i , P j and P ji will indicate the i -throw, the j -th column, and the element or entry i, j of matrix P , respectively. In the sameway, x i and v j denote, respectively, the i -th element of the column vector x , and the j -thelement of the row vector v .Braces are used to index a sequence of objects, such as P { } , P { } , . . . P { t } . Thesymbol P { s :: t } will denote the product of the objects indexed from s to t , that is, P { s :: t } ≡ (cid:89) tk = s P { k } . Finaly, given scalars, α and β , we have, as usual, α ∧ β = min( α, β ), α ∨ β = max( α, β ), α + = 0 ∨ α , α − = 0 ∨ − α . Homogeneous Markov Chains

In a Markov chain with kernel or transition matrix P , P ji ≥ P i = 1, P ji represents transition probability from state x { i } to state x { j } in a ﬁnite state space, S = { x { } , x { } , . . . x { d }} . For the sake of simplicity, we often write the index, i , insteadof the indexed state, x { i } , that is, we identify the state space with its index set, S = { , , . . . d } .A trajectory or path of length t from an initial state i to a ﬁnal state j is given by τ = [ τ (1) = i, k (2) , . . . τ ( t ) , τ ( t + 1) = j ]. If a Markov chain is initially at state i , theprobability that it will follow the trajectory τ isPr( τ ) = (cid:89) tk =1 P τ ( k +1) τ ( k ) If we select the initial state state, i , from distribution v , v ≥ v = 1, the probabilitythat the chain is at state j after t transitions, following any possible trajectories troughintermediate states, is given by w j , where w = v (cid:89) tk =1 P . .1. INHOMOGENEOUS MARKOV CHAINS τ is possible if it has non-zero probability. A Markov chain is irreducibleif there is a possible trajectory connecting any initial state, i , to any ﬁnal state, j . Acycle is a trajectory with the same initial and ﬁnal states. State i has period k > i has length multiple of k . Otherwise, state i is aperiodic. AMarkov chain is aperiodic if has no periodic states.The probability distribution g is invariant by kernel P if g = gP . An invariantdistribution is also known as eigen-solution, equilibrium or stable distribution for P . Itcan be shown that an irreducible and aperiodic Markov chain has a unique invariantdistribution, see Feller (1957). Under the same regularity conditions, it can also be shownthat the invariant distribution is the chain’s limiting distribution, that is,lim t →∞ (cid:89) tk =1 P =  gg ... g  =  g g . . . g d g g . . . g d ... ... ... g g . . . g d  . Hence, for any initial distribution, v , v (cid:16) lim t →∞ (cid:89) tk =1 P (cid:17) = g . Given the irreducible and aperiodic kernel P , having the stable distribution g , the reverse kernel, R , is deﬁned as R ji = g j P ij /g i . The reverse kernel can be interpreted, usingBayes theorem, as the kernel of the Markov chain P going backwards in time, that is, R ji = Pr( x { t } = j | x { t + 1 } = i ) = Pr( x { t + 1 } = i | x { t } = j )Pr( x { t } = j )Pr( x { t + 1 } = i ) = P ij g j g i . Kernel P is reversible if there is a distribution g statisfying the detailed balance equation , g i P ji = g j P ij . Summing both sides of the detailed balance equation over index i , weobtain g j = (cid:80) i g i P ji , showing that this is a suﬃcient condition for g to be an invariantdistribution. Hence, for a reversible chain, the forward and backward kernels are identical, R ji = P ji . Vector and Matrix Norms A norm , in a vector space E , is a function (cid:107) . (cid:107) : E ⇒ R | ∀ x, y ∈ E and α ∈ R , (cid:107) x (cid:107) ≥ , and (cid:107) x (cid:107) = 0 ⇔ x = 0.2. (cid:107) αx (cid:107) = | α | (cid:107) x (cid:107) .08 APPENDIX H. STOCHASTIC EVOLUTION AND OPTIMIZATION (cid:107) x + y (cid:107) ≤ (cid:107) x (cid:107) + (cid:107) y (cid:107) , the triangular inequality.In particular, for x ∈ R n and p > (cid:107) x (cid:107) p = ( (cid:88) n | x i | p ) /p , (cid:107) x (cid:107) ∞ = n max i =1 | x i | . deﬁnes the standard L p norms in R n .Given a normed vector space, ( E, (cid:107) (cid:107) ), (cid:107) T (cid:107) ≡ max x (cid:54) =0 ( (cid:107) T ( x ) (cid:107) / (cid:107) x (cid:107) ) . deﬁnes the induced norm on the vector space of linear transformations, T : E → E , forwhich ∃ α ∈ R | (cid:107) T ( x ) (cid:107) ≤ α (cid:107) x (cid:107) , ∀ x ∈ E , that is, the vector space of bounded lineartransformations on E . By linearity, (cid:107) T (cid:107) ≡ max x | (cid:107) x (cid:107) =1 (cid:107) T ( x ) (cid:107) . In ( R n , (cid:107) (cid:107) ) the induced norm on set of bounded linear transformations, T : R n → R n ,deﬁnes the matrix norm in ( R n , (cid:107) (cid:107) ). Speciﬃcally, for an n × n matrix A , (cid:107) A (cid:107) = (cid:107) T (cid:107) ,where T ( x ) = Ax . Lemma 1:

The matrix norm in ( R n , (cid:107) (cid:107) ), has the following properties: If A and B are n × n matrices,1. (cid:107) A (cid:107) ≥ (cid:107) A (cid:107) = 0 ⇔ A = 02. (cid:107) A + B (cid:107) ≤ (cid:107) A (cid:107) + (cid:107) B (cid:107) (cid:107) AB (cid:107) ≤ (cid:107) A (cid:107) (cid:107) B (cid:107) Lemma 2: ( L and L ∞ explicit expressions). (cid:107) A (cid:107) = n max j =1 (cid:88) ni =1 | A ji |(cid:107) A (cid:107) ∞ = n max i =1 (cid:88) nj =1 | A ji | Proof:

To check the expression for L and L ∞ observe that (cid:107) Ax (cid:107) = (cid:88) ni =1 | (cid:88) nj =1 A ji x j | ≤ (cid:88) ni =1 (cid:88) nj =1 | A ji | | x j |≤ (cid:88) nj =1 | x j | n max j =1 (cid:88) ni =1 | A ji | = (cid:107) A (cid:107) (cid:107) x (cid:107) (cid:107) Ax (cid:107) ∞ = n max i =1 | (cid:88) nj =1 A ji x j | ≤ n max i =1 (cid:88) nj =1 | A ji | | x j |≤ n max j =1 | x j | n max i =1 (cid:88) nj =1 | A ji | = || x || ∞ || A || ∞ .1. INHOMOGENEOUS MARKOV CHAINS k is the index that realizes the maximum in the norm deﬁnition, then theequality is realized by the vector x = I k for L , and by the vector x | x j = sig ( A ji ) for L ∞ .One can check that (cid:107) x (cid:107) ∞ ≤ (cid:107) x (cid:107) ≤ n (cid:107) x (cid:107) ∞ and (cid:107) x (cid:107) ∞ ≤ (cid:107) x (cid:107) ≤ n / (cid:107) x (cid:107) ∞ . In fact,any given p norn can provide a bound to another q norm and, in this sense, they areequivalent. In the remaining of this section the L norm will be used throughout, so wewill write (cid:107) x (cid:107) for (cid:107) x (cid:107) . Dobroushin’s Contraction CoeﬃcientLemma 3 (Total Variation). Given two probability (non-negative, unitary, row) vectors, v and w , their Total Variation or L diﬀerence has the alternative expressions: (cid:107) v − w (cid:107) = 2 (cid:16) − (cid:88) k v k ∧ w k (cid:17) = 2 (cid:88) k (cid:0) v k − w k (cid:1) + Proof:

Just notice that2 − (cid:88) k v k ∧ w k = (cid:88) k v k + (cid:88) k w k − (cid:88) k v k ∧ w k = (cid:88) k | v k − w k | , and (cid:0) v k − w k (cid:1) + = (cid:0) v k − v k ∧ w k (cid:1) hence (cid:88) k (cid:0) v k − w k (cid:1) + = 1 − (cid:88) k v k ∧ w k . The Dobroushin Contraction Coeﬃcient or Ergodicity Coeﬃcient of a transition prob-ability matrix, P , is deﬁned as ρ ( P ) ≡

12 max i,j (cid:88) k | P ki − P kj | = 12 max i,j (cid:107) I i P − I j P (cid:107) . It is clear from the deﬁnition that ρ ( P ) measures the maximum L distance between therows of P . If a sequence of kernels, P { k } , is clear from the context, we shall also write ρ { k } ≡ ρ ( P { k } ) , and ρ { m :: n } ≡ (cid:89) tk = s ρ { k } . Lemma 4 (Vector Contraction). Two probability vectors, v and w , are contracted bythe transition matrix P in the sense that: (cid:107) vP − wP (cid:107) ≤ ρ ( P ) (cid:107) v − w (cid:107) . Proof: If v = w or if v = I i and w = I j , the result is trivial. Otherwise, let v (cid:54) = w and m = v ∧ w . Deﬁning G ji = 2 ( v i − m i )( w j − m j ) (cid:107) v − w (cid:107) , it is easy to check that:10 APPENDIX H. STOCHASTIC EVOLUTION AND OPTIMIZATION (a) G ji ≥ v − w = (cid:80) i,j G ji ( I i − I j ) , and(c) (cid:107) v − w (cid:107) = (cid:80) i,j G ji .Hence, (cid:107) vP − wP (cid:107) = (cid:107) (cid:88) i,j G ji ( I i − I j ) P (cid:107) ≤ (cid:16)(cid:88) i,j G ji (cid:17) max i,j (cid:107) ( I i − I j ) (cid:107) = 12 (cid:107) v − w (cid:107) ρ ( P ) = ρ ( P ) (cid:107) v − w (cid:107) . Lemma 5 (Matrix Contraction). Two transition matrices, P and Q , are contractedin the sense that: ρ ( P Q ) ≤ ρ ( P ) ρ ( Q ) . Proof: ρ ( P Q ) = 12 max i,j (cid:107) ( I i − I j ) P Q (cid:107) ≤ ρ ( Q ) 12 max i,j (cid:107) ( I i − I j ) P (cid:107) = ρ ( P ) ρ ( Q ) . Theorem 6 (Weak Ergodicity (loss of memory)).lim t →∞ ρ { t } = 0 ⇒ lim t →∞ (cid:107) ( v − w ) P { t }(cid:107) = 0 . Proof:

Immediate, from Lemma 2.

Lemma 7 (Strong Ergodicity). Assume that the following conditions hold:(a) Each P { k } has a unique invariant distribution, v { k } = v { k } P { k } , such that (cid:80) ∞ k =1 (cid:107) v { k + 1 } − v { k }(cid:107) < ∞ ;(b) ρ { k } > ρ { ∞} = 0 .Then, there is a limiting distribution, v {∞} , such that, for any distribution w ,lim t →∞ (cid:107) wP { t } − v {∞}(cid:107) = 0 Proof.

Condition 7a ensures that, with respect to the L norm, v { k } is a Cauchysequence in the compact simplex support. Hence, the sequence has a unique accumulationpoint, v {∞} = lim k →∞ v { k } .Since for 1 < s < t < ∞ v {∞} P { s :: t } − v {∞} = ( v {∞} − v { s } ) P { s :: t } + v { s } P { s :: t } − v {∞} =( v {∞} − v { s } ) P { s :: t } + (cid:88) t − k = s ( v { k } − v { k + 1 } ) P { k + 1 :: t } + v { t } − v {∞} , .2. SIMULATED ANNEALING (cid:107) wP { t } − v {∞}(cid:107) ≤ (cid:107) ( wP { t − } − v {∞} ) P { s :: t }(cid:107) + (cid:107) v {∞} P { s :: t } − v {∞}(cid:107) ≤ ρ { s :: t } + (cid:107) v {∞} − v { s }(cid:107) + (cid:88) n − k = s (cid:107) v { k } − v { k + 1 }(cid:107) + (cid:107) v {∞} − v { t }(cid:107) ≤ ρ { s :: t } + 2 sup k ≥ s (cid:107) v {∞} − v { k }(cid:107) + (cid:88) t − k = s (cid:107) v { k } − v { k + 1 }(cid:107) . Letting t → ∞ , all terms in the right hand side can be made arbitrarily small for anappropriate large value of s . Consequently, the left hand side converges to zero, Q.E.D. Theorem 8 (Small Perturbations). It is possible to use a perturbed sequence ofkernels, Q { k } , instead of P { k } , and still obtain convergence to the same invariant distri-bution provided that (cid:88) ∞ k =1 (cid:107) P { k } − Q { k }(cid:107) < ∞ . Proof.

The result follows from the inequality (cid:107) P { s :: t } − Q { s :: t }(cid:107) ≤ (cid:88) tk = s (cid:107) P { k } − Q { k }(cid:107) The Small Perturbations theorem plays an important role in the design of eﬃcientalgorithms based on heuristic perturbations, a technique that can greatly expedite theannealing process, see Stern (1991) and Pﬂug (1996, ch.2).

H.2 Simulated Annealing

The Metropolis Algorithm

Consider a system, X , where the system state is parameterized by a d -dimentionalcoordinate vector x = [ x , . . . x d ] ∈ X . The neighborhood N ( x ) is deﬁned as theset of states y that are adjacent to x , that is, the set of states that can be reacheddirrectly from x , taht is, with one move, or in a single step. The neighborhood size is n ( x ) = | N ( x ) | ≤ n = max x n ( x ). We assume that the neighborhood structure is symmet-ric, that is, y ∈ N ( x ) ⇒ x ∈ N ( y ), and that any two states, x and y , are linked by a pathwith at most m steps. Our aim is to minimize a ﬁnite and positive objective function, H ( x ), with an unique global minimum attained at x ∗ . The system’s Lipschitz constant,∆, is the maximum diﬀerence in the value of H , for adjacent states, that is,∆ = max x max y ∈ N ( x ) | H ( y ) − H ( x ) | . The Gibbs distribution is deﬁned as g ( θ ) x = n ( x ) Z ( θ ) exp( − θH ( x )) , with Z ( θ ) = (cid:88) x n ( x ) exp( − θH ( x )) . APPENDIX H. STOCHASTIC EVOLUTION AND OPTIMIZATION

The Gibbs distribution speciﬁes state probabilities in many systems of Statistical Physics,where the Hamiltonian function, H , represents state energies, and the parameter θ is thesystem’s inverse temperature. The normalization constant, Z ( θ ), is called the partitionfunction.The Metropolis kernel is deﬁned by P ( θ ) yx =  n ( x ) exp (cid:0) ( H ( x ) − H ( y )) + (cid:1) , if y ∈ N ( x )1 − (cid:80) y ∈ N ( x ) P ( θ ) yx if y = x . Theorem 9 (Metropolis sampling). The Gibbs distribution g ( θ ) is invariant for themetropolis kernel P ( θ ). Proof.

It suﬃces to prove the detailed balance equation g ( θ ) x P ( θ ) yx = g ( θ ) y P ( θ ) xy If y / ∈ N ( x ), balance is trivial. Otherwise, we use1 n ( x ) exp (cid:0) ( H ( x ) − H ( y )) + (cid:1) = 1 n ( x ) (cid:18) g ( θ ) y n ( x ) g ( θ ) x n ( y ) ∧ (cid:19) . Assuming that ( g ( θ ) y n ( x )) / ( g ( θ ) x n ( y )) ≥ g ( θ ) x P ( θ ) yx = g ( θ ) x n ( x ) and g ( θ ) y P ( θ ) xy = g ( θ ) y n ( y ) g ( θ ) x n ( y ) g ( θ ) y n ( x ) = g ( θ ) x n ( x ) . The case ( g ( θ ) y n ( x )) / ( g ( θ ) x n ( y )) < θ , θ , . . . , for the Simpliﬁed Metropo-lis Algorithm where, at each temperature 1 /θ t , we take m steps using the kernel P { t } = P ( θ t ), or a single step using the kernel Q { t } = P { t } m Theorem 10 (Logarithmic Cooling). In the simpliﬁed Metropolis algorithm, for anymonotone decreasing cooling schedule1 θ t ≥ ∆ m ln( n )ln( t )and any initial distribution w ,lim t →∞ (cid:107) wQ { } Q { } . . . Q { t } − v {∞}(cid:107) = 0 . Proof.

From the deﬁnition of the system’s Lipschitz constant, and from the fact thatany two states of the system are conected by a path of lenght at most m , it follows that,for any two states, x and y , Q { t } yx ≥ (cid:18) n exp ( − ∆ θ t ) (cid:19) m = 1 n m exp ( − m ∆ θ t ) . .3. GENETIC PROGRAMMING ρ { t } = ρ ( Q { t } ) = max x,y (cid:32) − (cid:88) z (cid:0) Q { t } zx ∧ Q { t } zy (cid:1)(cid:33) ≤ max x,y (cid:0) − (cid:0) Q { t } z ∗ x ∧ Q { t } z ∗ y (cid:1)(cid:1) ≤ − n m exp ( − m ∆ θ t ) =1 − n m exp (cid:18) − m ∆ ln( t )∆ m ln( n ) (cid:19) = 1 − n m n m t = 1 − t . Condition (c) of the strong ergodicity lemma follows from ∞ (cid:88) t =1 t = ∞ ⇒ ρ { ∞} = ∞ (cid:89) t =1 (cid:18) − t (cid:19) = 0 . Finally, in orther to check condition (a) the strong ergodicity lemma, we must showthat the invariant measures v { t } = v ( θ t ) constitute a Cauchy sequence. However, as θ → ∞ , the elements of v { t } are either increasing for x = x ∗ or decreasing for x (cid:54) = x ∗ andsuﬃciently large θ . Hence, for t ≥ l , (cid:88) ∞ t = l (cid:107) v { t + 1 } − v { t }(cid:107) = (cid:88) ∞ t = l (cid:88) x ∈ X ( v { t + 1 } − v { t } ) + ≤ (cid:88) x ∈ X ( v {∞} − v { l } ) + < ∞ . There is an implicit choice of scale in the unit taken to measure the Hamiltonian orobjective function, H ( x ). An adequate scale should start the annealing process with agood acceptance rate for hill climbing moves. The step size of the logarithmic coolingschedule is inversely proportional to the cooling constant , ∆ m ln( n ). An alternative tothe simpliﬁed Metroplis algorithm, taking m steps at each temperature θ t , is to implementthe standard Metropolis algorithm using the cooling constant ∆ m ln( n ). H.3 Genetic Programming

The Intrinsic Parallelism Argument

Consider programs coded as binary (0-1) arrays of length n . A pattern or schema of length l , is a partial speciﬁcation of a binary array of length l , s [ i ] = 0 , ∗ (don’t-care) , ≤ i ≤ l . The number of speciﬁed positions or loci , that is, l minus the number of don’t-cares,deﬁnes the schema’s order . The program’s sub-array p [ j ] in the window k ≤ j ≤ k + l , isan instance of schema s iﬀ they coincide, in the speciﬁed loci, that is, iﬀ p [ k + i ] = s [ i ],for all s [ i ] (cid:54) = ∗ .14 APPENDIX H. STOCHASTIC EVOLUTION AND OPTIMIZATION

The intrinsic parallelism argument, presented in chapter 6, requires an estimate ofhow many schemata of order l and length 2 l can be represented in a program of length n .Following Reeves (1991), consider the window of lenght 2 l at the beginning or leftmostlocus, 1 ≤ j ≤ l , and let B (2 l, l ) be the number of choices for the speciﬁed loci, l , amongthe 2 l available positions. This ﬁrst window can obvously represent B (2 l, l ) 2 l distinctschemata, for once the l loci have been chosen, there are 2 l possible 0-1 attributions totheir values.Now slide the window 2 l position to the right, so as to span positions 2 l + 1 ≤ j ≤ l .This new window has no positions in common with the previous one and can, therefore,represent the same number of schemata. If we keep sliding the window 2 l position to theright until positions n − l ≤ j ≤ n are spanned, it follows from Stirling’s approximationthat the total count of possible represented schemata, satisﬁes the relation n l B (2 l, l )2 l ≈ l ∝ m , where the population size is taken as m = c l . The constant c is interpreted as theexpected number of instances of any given schema (of order l and length 2 l ) presentin this population. Hence, under all the conditions above, we can (under) estimate thenumber of schemata present in the population as proportional to m . For generazationsof the implicit parallelism theorem, see Bertoni, M.Dorigo (1993). Stirling’s Approximation

For large n , ln n ! = (cid:88) nj =1 ln j ≈ (cid:90) nj =1 ln jdj = [ j ln j − j ] n = n ln n − n + 1 . A more detailed analysis of the remainder gives usln n ! ≈ n ln n − n + O (ln n ) . From Stirling’s approximation, the following Binomial approximations hold:ln (cid:18) nnp (cid:19) ≈ nH ( p ) where H ( p ) = − p ln p − (1 − p ) ln(1 − p ) . and ln (cid:18) ll (cid:19) ≈ lH (1 / . .4. ONTOGENIC DEVELOPMENT H.4 Ontogenic Development

Autopoietic and alopoietic systems, living organisms and artiﬁcial machines, have bothto be built up and have their basic components maintained. However, there are profounddiﬀerences in their development processes. In this section we examine the structuralsimilarities and diﬀerences between such systems, and how such structures can explainsome properties of systemic development.Herein, the adult or after construction systemic feature known as aging, receives specialattention. Elementary or simple components have no structure, no internal states, andhence no memory. They can, therefore, exhibit no aging. Complex systems, however,exhibit some form of aging. We will see how the aging process of complex system canreﬂect systemic structure. We will contrast, in particular, bottom-up and top-down systemconstruction, and their respective aging processes. Our analysis will follow Gavrilov (1981,2001, 2006). “The ﬁrst fundamental feature of biosystems is that, in contrast to technical (artiﬁcial)devices which are constructed out of previously manufactured and tested components, or-ganisms form themselves in ontogenesis through a process of self-assembly out of de novo forming and externally untested elements (cells). The second property of organisms is theextraordinary degree of miniaturization of their components (the microscopic dimensionsof cells, as well as the molecular dimensions of information carriers like DNA and RNA),permitting the creation of a huge redundancy in the number of elements. Thus, we expectthat for living organisms, in distinction to many technical (manufactured) devices, thereliability of the system is achieved not by the high initial quality of all the elements butby their huge numbers (redundancy).”

Gavrilov (2001,p.531.)

Aging Processes

In this section we follow Gavrilov (1981, 2001, 2006) to analyse the aging process of someredundant series / parallel reliability systems.As usual in reliability theory, t will denote failure time, f ( t ) and F ( t ) the density andcumulative distribution functions of the failure time, S ( t ) = 1 − F ( t ) the survival functionand h ( t ) = d S ( t ) S ( t ) dt = d ln S ( t ) dt the hazard function, failure rate, or mortality force, see Barlow and Prochan (1981).Simple, memoryless or non-aging components are characterized by exponentially dis-tributed failure time. In this case, the failure time has constant hazard rate, h ( t ) = κ ,and S ( t )) = exp( − κt ), κ, t ≥

0. Complex systems are characterized by diﬀerent agingregimes which, in turn, reﬂect their structural characteristics. Two aging regimes are of16

APPENDIX H. STOCHASTIC EVOLUTION AND OPTIMIZATION special interest to us:1- The Weibull or power law regime, with h ( t ) = κt α , κ, α >

0, characteristic ofcomplex top-down, external assembly or alopoietic systems, and2- The Gompertz-Mekeham regime, with h ( t ) = A + R exp( αt ), A, R, α >

0, character-istic of complex bottom-up, self assembly or autopoietc systems. In biological models, theMekeham parameter, A , indicates an external mortality force, whereas the pure Gompertzregime, for A = 0, models the internal or systemic hazard function.In what follows, we will see some structural models that explain these two regimesand test them on some engineering and biological systems.The two basic structures in reliability theory are parallel and series compositions.Complex systems in general, are recursive compositions of series and parallel blocks. Aparallel block fails if all its components fail, whereas a series block fails if any one ofits components fail, alternatively, a parallel block fails with its last failing component,whereas a series block fails with its ﬁrst failing component. Hence the series-parallelreliability compositional rules:- The cumulative distribution function of a parallel system with independent compo-nents equals the product of its components’ cumulative distribution functions.- The hazard function of a series system with independent components equals the sumof its components’ hazard functions.Let us now consider the “simplest complex system” modeling an organism or machinewith multiple, m , functions, where each function is performed by an independent blockof redundant simple components. That is, a system is assembled as a series of m blocks, b j , j = 1 . . . m , such that block j is assembled as a parallel (sub) system with n j simplecomponents.Top-down projects typically use a small number of redundant units, in order to op-timize production costs as well as to meet other project constraints such as maximumspace or weight. Hence, components have to comply with strict standards, achieved byseveral forms of quality control tasks in the manufacturing process. In such systems allcomponents are initially alive, operational or working, since they would have been oth-erwise rejected by quality control. They are typically depicted in block diagram of suchas shown in Figure 1A. In this example each block has the same number, n j = i , ofredundant components.Since each simple component has an exponential failure distribution, the reliabilitycompositional rules lead to the following systemic hazard functions for each block, F j = (cid:0) − e − κt (cid:1) i , h j ( t ) = iκe − κt (1 − e − κt ) i − − (1 − e − κt ) i ; .4. ONTOGENIC DEVELOPMENT h ( t ) = (cid:88) mj =1 h j ( t ) = miκe − κt (1 − e − κt ) i − − (1 − e − κt ) i . Using the early-life and late-life asymptotic approximations, 1 − exp( − κt ) ≈ κt , for t << /κ , and 1 − exp( − κt ) ≈

1, for t >> /κ , the i elements parallel block and systemichazard functions can be approximated as h i ( t ) ≈ (cid:26) iκ i t i − if t << /κ and κ if t >> /κ ; , h ( t ) ≈ (cid:26) miκ i t i − if t << /κ and mκ if t >> /κ ; . Let us now consider self-assembled blocks where the number i of initially workingelements follows a Poisson distribution with parameter λ = nq , P ( i ) = exp( − λ ) λ i /i !. Weshould also truncate the Poisson distribution, to account for the facts that the organismis initially alive, implying the exclusion of the i = 0 case, and that the organism is ﬁnite,implying a cut-oﬀ Pr( i > n ) = 0. The corrected normalization constant for this truncatedPoisson is c − = 1 − exp( − λ ) − exp( − λ ) (cid:80) ∞ i = n +1 λ i /i ! . As in the previous model, the systemic hazard function is the sum of those of itsblocks’, where each block begins with i , Poisson distributed, working elements. Hence,the expected systemic hazard function can be written as: h ( t ) = (cid:88) mj =1 h j ( t ) = m (cid:88) ni =1 P ( i ) h i ( t ) . Substitution of h i ( t ) yields the following systemic hazard rate and approximations: h ( t ) = cmκλe − λ e − κt (cid:88) ni =1 λ i − (1 − e − κt ) i − ( i − (cid:16) − (1 − e − κt ) i (cid:17) ,h ( t ) ≈ (cid:40) cmκλe − λ (cid:80) ni =1 ( κλt ) i − ( i − = R ( e αt − (cid:15) ( t )) if t << /κ and mκ if t >> /κ ; . In the last expression, R = cmκλ exp( − λ ), α = κλ , and (cid:15) ( t ) = (cid:80) ∞ i = n +1 ( κλt ) i − / ( i − . For for ﬁxed κ and λ and suﬃciently small t , (cid:15) ( t ) is close to zero. Hence, in early life, h ( t ) ≈ R exp( αt ), as in the pure Gompertz regime.18 APPENDIX H. STOCHASTIC EVOLUTION AND OPTIMIZATION ppendix IResearch Pro jects

In the last courses we have had classes of very diverse students. As expected, we hadstudents coming from the courses of Applied Mathematics, Physics and, of course, Statis-tics, but we also had some students with quite diﬀerent backgrounds, such as ComputerScience, Economics, Law, Logic and Philosophy. This appendix proposes some researchprojects that may be specially interesting to some of these students. I do believe, ofcourse, that most of them will also be interesting to the student of Statistics. If you areinterested in one of these projects, send me an e-mail, or stop by at my oﬃce, an let ustalk about how to proceed.

Bayesian and other Credal Networks

The sparse factorization techniques described in Appendix F can be transposed to BayesianNetworks and other belief propagation networks as well.1- Symbolic phase: Implement the algorithms used to ﬁnd a good elimination order,like the Gibbs heuristics, the Bayes-ball algorithm, and the other graph algorithms men-tioned in appendix F. A language such as C or C ++, providing good support for dynamicdata structures, is recommended.2- Numeric phase: Once the elimination order, requisite variables, etc. are determined,implement the numerical elimination process using static data structures. A language suchas Fortran, providing good support for automatic parallelization, is recommended.3- Investigate the potential for parallelization of the sequential codes implemented insteps 1 and 2. Discuss the possibility, diﬃculties and advantages of developing tailor madeparallel code versus the use of automatic parallelization tools.4- Implement eﬃcient MC or MCMC processes for computing the evidence supportingthe existence of a given causal link, that is, the existence of a given arrow in (a) a givenBayes network topology (b) all or a given subset of topolgies.41920 APPENDIX I. RESEARCH PROJECTS

Mixture of Factor Analyzers

Extend the theory and methods for Mixtures of Multivariate Gaussians, as described inappendix B, to Mixtures of FA’s. The geometric interpretation of these models is verysimilar, but whereas all Mixtures of Gaussians lie in the same d -dimensional space, eachMixture of FA’s lies in a diﬀerent hyperplane of the full d -dimensional space. In particular:a) Test the existence of a given component in the mixture.b) Test the existence of the least signiﬁcative dimension of a given component. Polynomial Networks

1- Discuss the use of edge annotations and heuristic merit functions in the synthesis ofsub-networks, that is, the use of heuristic “recombinative guidance”, in the terminologyof Nikolaev and Iba (2001, 2006).2- Discuss the use of time dependent objective functions, as is section 5.2, to guidethe synthesis of the entire network.3- Discuss how to test (sub) topologies of a given network. (De)Coupling, (De)Composition, Complementarity

1- Discuss the possibility of using complementary models in contexts other than QuantumMechanics. Give examples of such applications.2- Discuss the possibility of extending the results of Borges and Stern (2007) to modelswith limited dependence using, for example, the formalism of Copulas.3- Investigate the meaning and interpretation of decoupling or separation schemesgenerated by alternative sparse and/or structured matrix factorizations.4- Using wavelet or other self-similar representations, it is possible to overcome thestrict version of Heisenberg uncertainty relation, see Vidakovic (1999, p.xxx). However,these representations may introduce non-local, delayed, integral, long-rage, long-memoryor other forms of coupling or dependence. Investigate how to obtain generalized Heisen-berg type relations for such cases.5- Give suitable interpretations and implement statistical models for the “necessaryor consequential randomness” implied in the following examples:5a- Morgenstern and von Neumann (1947) and Nash (1951), proved the existence ofequilibrium strategies for non-cooperative games. However, in general, these equilibriaare not at deterministic or pure strategies, but at randomized or mixed strategies.5b- The concept of Impossible (or Inconsistent, or Unholy) Trinity, also known asthe Mundell-Fleming trilemma, is an hypothesis of international economics, stating the21impossibility of achieving simultaneously the follong goals: 1- ﬁxed exchange rate; 2- freecapital movement; and 3- independent monetary policy.

Economics

The economic system may be characterized by eigen-solutions, equilibria or ﬁxed pointsresulting from the collective interaction of many economic agents. Some of the mostimportant of such eigen-values are prices, see for example Ingrao and Israel (1990).1- Give concrete examples of such situations, that are well suited for experimentalresearch.2- Discuss how to measure the epistemic value of such an economic or ﬁnacial eigen-value.3- Discuss how to assess the consistency of such eigen-values, for example, by meansof sensitivity analyses.4- Discuss the need for regulatory mechanisms protecting such eigen-solutions such as,for example, anti-trust laws.5- Discuss the consequences of Zangwill’s global convergence theorem to the design ofgood regulatory policies; see, for example, Border (1989), Ingrao and Israel (1990) andZangwill (1964).

Law

The Objective / Subjective dichotomy manifests itself in the legal arena via the notion ofresponsibility. Responsibility may require either two or three conditions, namely.a) Damage: A loss suﬀered by the victim (or oﬀended party).b) Causal relation: A causal nexus linking an action (or lack thereof) of the accused(or defendant, oﬀending party, perpetrator) to the damage suﬀered by the victim.c) Illicitness: An explanation why the action (or lack thereof) of the accused was illegalor unlawful.While the programs and codes (in Luhmann’s sense) needed for checking condition (c)are internal ones, that is, programs and codes within the legal system itself, the programsand codes needed for checking conditions (a) and (b) are often external, that is, programsand codes of another systems, such as science or economics, for example.Hence, it is not surprising that a responsibility entailed by conditions (a) and (b) aloneis called “objective”, while one requiring conditions (a), (b) and (c) is called “subjective”,see Stern (2007).22

APPENDIX I. RESEARCH PROJECTS

R.B.Stern (2007) suggests the following principle, hereby named “Transference of Ob-jectivity” (TrOb), for systems characterized by the existence of eigen-solutions resultingfrom complex collective interactions:If an individual agent (or a small group of agents) in the system disrupts such an eigen-solution, hence destroying its objective character, then this agent becomes, in the samemeasure, objectively responsible for consequential damages caused by the disruption.1- Discuss the plausibility of the TrOb principle.2- Discuss possible justiﬁcations for the TrOb principle.3- Discuss the applicability of the TrOb principle in:a) Economic law; b) Environmental law.5- Discuss the applicability of TrOb for state actions.5- Discuss the applicability of TrOb for lost revenues.6- Discuss the applicability of TrOb for the loss of a chance.

Experiment Design and Philosophy

1) Discuss the possibility of conciliating the objective inference entailed by randomizationmethods with biased allocation and selection procedures.2) Discuss the possibility of using optimal selections or allocations obtained by Multi-Objective or Goal Programming, where some (fake or artiﬁcial) explaining variables haverandomly generated values.3) Discuss the possibility of using low-discrepancy selections or allocations obtainedby quasi-random or hybrid (scrambled quasi-random) lattices.4) How can we corroborate the objective character of such inference procedures? Forexample, what is the importance of sensitivity analyses in these allocation?5) What kind of protocols are appropriate for such inference procedures?6) What criteria can be used in balancing the epistemic value of a clinical study versusthe well being of the participants? What kind of moral, ethical and legal arguments canbe used to support these criteria?

Art

Make your contribution to the Art Gallery. ppendix JImage and Art Gallery

The images in this gallery are somehow related to topics discussed in the main text.They are provided with no ﬁxed deﬁnite interpretation, and only meant as a stimulus toimagination and creativity. Paraphrasing an aphorism of the poet Fernando Pessoa, – There is no good science that is vague, nor good art or poetry that is not.

Alternatively, paraphasing Fernando Gomide’s interpretation of Luiz de Cam˜oes: – Good navigation is precise; Good life is imprecise.

Additional contributions to the art gallery, many made by students or interested readers,can be found at ∼ jstern/books/gallery2.pdf .Figure JA.1: Wire Walking.The most important thing is not to fear at all.42324 APPENDIX J. IMAGE AND ART GALLERY

Figure JA.2: Ludwig Boltzmann, Cartoon by K.Przibram.Moving ahead, no matter what.Figure JA.3: Albert Einstein in his Bicycle.Following the gentle curvature of the garden’s geometry.25Figure JA.4: Niels Bohr in his Bicycle.Complementary pedals must be pushed one at a time.Figure JA.5: Empirical Science: All at Once!Caution: Do this only at a fully equipped laboratory.26