A General Strategy for Physics-Based Model Validation Illustrated with Earthquake Phenomenology, Atmospheric Radiative Transfer, and Computational Fluid Dynamics
aa r X i v : . [ phy s i c s . c o m p - ph ] O c t A General Strategy for Physics-BasedModel ValidationIllustrated with Earthquake Phenomenology,Atmospheric Radiative Transfer,and Computational Fluid Dynamics
Didier Sornette , , , Anthony B. Davis , James R. Kamm , and Kayo Ide Institute of Geophysics and Planetary Physicsand Department of Earth & Space SciencesUniversity of California, Los Angeles, CA 90095, USA. Laboratoire de Physique de la Mati`ere Condens´ee (CNRS UMR 6622)and Universit´e de Nice-Sophia Antipolis06108 Nice Cedex 2, France now at D-MTEC, ETH Z¨urich, CH-8032 Z¨urich, Switzerland [email protected] Los Alamos National LaboratorySpace & Remote Sensing Group (ISR-2)Los Alamos, NM 87545, USA [email protected] Los Alamos National LaboratoryApplied Science & Methods Development Group (X-1)Los Alamos, NM 87545, USA [email protected] Institute of Geophysics and Planetary Physicsand Department of Atmospheric & Oceanic SciencesUniversity of California, Los Angeles, CA 90095, USA [email protected]
This article is to be published the Lecture Notes in Computational Scienceand Engineering, Vol. TBD), Proceedings of in Computational Methods inTransport, Granlibakken 2006, F. Graziani and D. Swesty (Eds.), Springer-Verlag, New York (NY), 2007.
D. Sornette, A.B. Davis, J.R. Kamm, and K. Ide
This article is an augmented version of Ref. [1] by Sornette et al. that ap-peared in
Proceedings of the National Academy of Sciences
Summary.
Validation is often defined as the process of determining the degree towhich a model is an accurate representation of the real world from the perspectiveof its intended uses. Validation is crucial as industries and governments dependincreasingly on predictions by computer models to justify their decisions. In thisarticle, we survey the model validation literature and propose to formulate validationas an iterative construction process that mimics the process occurring implicitly inthe minds of scientists. We thus offer a formal representation of the progressivebuild-up of trust in the model, and thereby replace incapacitating claims on theimpossibility of validating a given model by an adaptive process of constructiveapproximation. This approach is better adapted to the fuzzy, coarse-grained natureof validation. Our procedure factors in the degree of redundancy versus novelty ofthe experiments used for validation as well as the degree to which the model predictsthe observations. We illustrate the new methodology first with the maturation ofQuantum Mechanics as the arguably best established physics theory and then withseveral concrete examples drawn from some of our primary scientific interests: acellular automaton model for earthquakes, an anomalous diffusion model for solarradiation transport in the cloudy atmosphere, and a computational fluid dynamicscode for the Richtmyer–Meshkov instability. General Strategy for Physics-Based Model Validation 3
At the heart of the scientific endeavor, model building involves a slow and arduousselection process, which can be roughly represented as proceeding according to thefollowing steps:1. start from observations and/or experiments;2. classify them according to regularities that they may exhibit: the presence ofpatterns, of some order, also sometimes referred to as structures or symmetries,is begging for “explanations” and is thus the nucleation point of modeling;3. use inductive reasoning, intuition, analogies, and so on, to build hypotheses fromwhich a model is constructed;4. test the model obtained in step 3 with available observations, and then extractpredictions that are tested against new observations or by developing dedicatedexperiments.The model is then rejected or refined by an iterative process, a loop going from step 1to step 4. A given model is progressively validated by the accumulated confirmationsof its predictions by repeated experimental and/or observational tests.Building and using a model requires a language, i.e., a vocabulary and syntax, toexpress it. The language can be English or French for instance to obtain predicatesspecifying the properties of and/or relation with the subject(s). It can be mathemat-ics, which is arguably the best language to formalize the relation between quantities,structures, space and change. It can be a computer language to implement a set ofrelations and instructions logically linked in a computer code to obtain quantitativeoutputs in the form of strings of numbers. In this later version, our primary interesthere, validation must be distinguished from verification. Whereas verification dealswith whether the simulation code correctly solves the model equations, validation carries an additional degree of trust in the value of the model vis-`a-vis experimentand, therefore, may convince one to use its predictions to explore beyond knownterritories [2].The validation of models is becoming a major issue as humans are increasinglyfaced with decisions involving complex tradeoffs in problems with large uncertainties,as for instance in attempts to control the growing anthropogenic burden on theplanet within a risk-cost framework [3, 4] based on predictions of models. For policydecisions, national, regional, and local governments increasingly depend on computermodels that are scrutinized by scientific agencies to attest to their legitimacy andreliability. Cognizance of this trend and its scientific implications is not lost on theengineering [5] and physics [6] communities.Our purpose here is to clarify from a physics-based perspective what validationis and to propose a roadmap for the development of systematic approach to physics-based validation with broad applications. We will focus primarily on the needs ofcomputational fluid dynamics and particle/radiation transport codes. By model, we understand an abstract conceptual construction based on axiomsand logical relations developed to extract logical propositions and predictions. D. Sornette, A.B. Davis, J.R. Kamm, and K. IdeIn the remainder of this section, we first review different definitions and ap-proaches found in the literature, positioning ourselves with respect to selected topicsor practices pertaining to validation; we then show how the validation problem isrelated to the mathematical statistics of hypothesis testing and discuss some prob-lems associated with emergent behaviors in complex systems. In section 2, we listand describe qualitatively the elements required in our vision of model validationas an iterative process where one strives to build trust in the model going from oneexperiment to the next; however, one must also be prepared to uncover in the modela flaw, which may or may not be fatal. We offer in sections 3–4 our quantitativephysics-based approach to model validation, where the relevance of the experimentto the validation process is represented explicitly. (An appendix explores the modelvalidation problem more formally and in a broader context.) Section 5 demonstratesthe general strategy for model validation using the historical development of quan-tum physics—a remarkably clear ideal case. Section 6 uses some research interestsof the present authors to further illustrate the validation procedure using less-than-perfect models in geophysics, computational fluid dynamics (CFD), and radiativetransfer. We summarize in section 7.
The following definitions are given by the American Institute of Aeronautics andAstronautics [7]: • Model : A representation of a physical system or process intended to enhance ourability to predict, control and eventually to understand its behavior. • Calibration : The process of adjusting numerical or physical modeling parame-ters in the computational model for the purpose of improving agreement withexperimental data. • Verification : The process of determining that a model implementation accuratelyrepresents the developer’s conceptual description of the model and the solutionof the model. • Validation : The process of determining the degree to which a model is an accuraterepresentation of the real world from the perspective of the intended uses of themodel.Figure 1, sometimes called a Sargent diagram, shows where validation and severalother of the above constructs and stages enter into a complete modeling project.In the concise phasing of Roache [2], “Verification consists in solving the equa-tions right while validation is solving the right equations.”
In the context of thevalidation of astrophysical simulation codes, Calder et al. [11] add: “Verificationand validation are fundamental steps in developing any new technology. For simula-tion technology, the goal of these testing steps is assessing the credibility of modelingand simulation.”Verifications of complex CFD codes usually comprise a suite of standard testproblems in the field of fluid dynamics [11]. These include Sod’s test [12], the strongshock tube problem [13], the Sedov explosion problem [14], the interacting blastwave problem [15], a shock forced through a jump in mesh refinement, and so on.
Validations of complex CFD codes is usually done by comparison with exper-iments testing a variety of physical phenomena, including instabilities, turbulentmixing, shocks, etc. Validation requires that the numerical simulations recover the General Strategy for Physics-Based Model Validation 5
Fig. 1.
Schematic representation of the conventional position of validation in modelconstruction according to Schlesinger [8] and Sargent [9, 10].salient qualitative features of the experiments, such as the instabilities, their non-linear development, the determination of the most unstable modes, and so on. See,for instance, Gnoffo et al. [16].Considerable work on verification and validation of simulations has been done inthe field of CFD, and in this literature the terms verification and validation have pre-cise, technical meanings [7, 2, 17, 9, 10]. Verification is taken to mean demonstratingthat a code or simulation accurately represents the conceptual model. Roache [18]stresses the importance of distinguishing between (i) verification of codes and (ii)verification of calculations. The former is concerned with the correctness of the code.The later deals with the correctness of the physical equations used in the code. Theprogramming and methods of solution can be correct (verification (i) successful)but they can solve erroneous equations (verification (ii) failure). Validation of a sim-ulation means demonstrating that the simulation appropriately describes Nature.The scope of validation is therefore much larger than that of verification and in-cludes comparison of numerical results with experimental or observational data. Inastrophysics, where it is difficult to obtain observations suitable for comparison tonumerical simulations, this process can present unique challenges. Roache [op. cit.]goes on to offer the optimistic prognosis that “the problems of Verification of Codesand Verification of Calculations are essentially solved for the case of structured grids,and for structured refinement of unstructured grids. It would appear that one higherlevel of algorithm/code development is required in order to claim a complete method-ology for Verification of Codes and Calculations. I expect this to happen. Within10 years, and likely much less, Verification of Codes and Calculations ought to besettled questions. I expect that Validation questions will always be with us.”
We fullyendorse this last sentence, as we will argue further on that validation is akin to thedevelopment of “trust” in theories of real phenomena, a never-ending quest.
For these reasons, the possibility of validating numerical models of natural phe-nomena, often endorsed either implicitly or identified as reachable goals by natural D. Sornette, A.B. Davis, J.R. Kamm, and K. Idescientists in their daily work, has been challenged; quoting from Oreskes et al. [19]: “Verification and validation of numerical models of natural systems is impossible.This is because natural systems are never closed and because model results are alwaysnon-unique.”
According to this view, the impossibility of “verifying” or “validating”models is not limited to computer models and codes but to all theories that relynecessarily on imperfectly measured data and auxiliary hypotheses. As Sterman [20]puts it: “Any theory is underdetermined and thus unverifiable, whether it is embodiedin a large-scale computer model or consists of the simplest equations.”
Accordingly,many uncertainties undermine the predictive reliability of any model of a complexnatural system in advance of its actual use. Such “impossibility” statements are reminiscent of other “impossibility theo-rems.” Consider the mathematics of algorithmic complexity [25], which providesone approach to the study of complex systems. Following reasoning related to thatunderpinning G¨odel’s incompleteness theorem, most complex systems have beenproved to be computationally irreducible, i.e., the only way to predict their evolu-tion is to actually let them evolve in time. Accordingly, the future time evolution ofmost complex systems appears inherently unpredictable. Such sweeping statementsturn out to have basically no practical value. This is because, in physics and otherrelated sciences, one aims at predicting coarse-grained properties. Only by ignor-ing most of molecular detail, for example, did researchers ever develop the laws ofthermodynamics, fluid dynamics and chemistry. Physics works and is not hamperedby computational irreducibility because we only ask for approximate answers atsome coarse-grained level [26]. By developing exact but coarse-grained procedureson computationally irreducible cellular automata, Israeli and Goldenfeld [27] havedemonstrated that prediction may simply depend on finding the right level for de-scribing the system. More generally, we argue that only coarse-grained scales areof interest in practice but their description requires “effective” laws which are ingeneral based on finer scales. In other words, real understanding must be rootedin the ability to predict coarser scales from finer scales, i.e., a real understandingsolves the universal micro-macro challenge. Similarly, we propose that validation ispossible, to some degree, as explained further on.
Calder et al. [11] also write: “We note that verification and validation are necessarybut not sufficient tests for determining whether a code is working properly or amodeling effort is successful. These tests can only determine for certain that a codeis not working properly.”
This last statement is important because it points to abridge between the problem of validation and some of the most central questions ofmathematical statistics [28], namely, hypothesis testing and statistical significancetests. This connection has been made previously by several others authors [29, 30, 31,32]. In showing the usefulness of the concepts and framework of hypothesis testing,we depart from Oberkampf and Trucano [33] who mistakenly state that hypothesis For further debate and commentary by Oreskes and her co-authors, see refs.[21, 22, 23]; also noteworthy is the earlier paper by Konikov and Bredehoeft[24] for a statement about validation impossibility in the context of groundwatermodels. General Strategy for Physics-Based Model Validation 7testing is a true or false issue, only. Every test of significance begins with a “null”hypothesis H , which represents a theory that has been put forward, either becauseit is believed to be true or because it is to be used, but has not been proved. For example, in a clinical trial of a new drug, the null hypothesis might be: “thenew drug is no better, on average, than the current drug.” We would write H : “thereis no difference between the two drugs on average.” The alternative hypothesis H isa statement of what a statistical hypothesis test is set up to establish. In the exampleof a clinical trial of a new drug, the alternative hypothesis might be that the newdrug has a different effect, on average, to be compared to that of the current drug.We would write H : the two drugs have different effects, on average. The alternativehypothesis might also be that the new drug is better, on average, than the currentdrug. Once the test has been carried out, the final conclusion is always given interms of the null hypothesis. We either “reject H in favor of H ” or “do not rejectH .” We never conclude “reject H ,” or even “accept H .” If we conclude “do notreject H ,” this does not necessarily mean that the null hypothesis is true, it onlysuggests that there is not sufficient evidence against H in favor of H ; rejecting thenull hypothesis then suggests that the alternative hypothesis may be true, or is atleast better supported by the data. Thus, one can never prove that an hypothesisis true, only that it is wrong by comparing it with another hypothesis. One canalso conclude that “hypothesis H is not necessary and another, more parsimonious,one H should be favored.” The alternative hypothesis H is not rejected, strictlyspeaking, but is found unnecessary or redundant with respect to H . This is thesituation when there are two (or several) alternative hypotheses H and H , whichcan be composite, nested, or non-nested. Within this framework, the above-mentioned statement by Oreskes et al. [19]that verification and validation of numerical models of natural systems is impossibleis hardly news: the theory of statistical hypothesis testing has taught mathematicaland applied statisticians for decades that one can never prove an hypothesis or amodel to be true. One can only develop an increasing trust in it by subjecting it tomore and more tests that “do not reject it.” We attempt to formalize below howsuch trust can be increased to lead to an asymptotic validation.
The above definitions are useful in recasting the role of code comparison in verifi-cation and validation (Code Comparison Principle or CCP). Trucano et al. [35] areunequivocal on this practice: “the use of code comparisons for validation is improperand dangerous.”
We propose to interpret the meaning of CCP for code verificationactivities (which has been proposed in this literature) as parallel to the problem ofhypothesis testing: Can one reject Code We refer the reader to V.J. Easton and J.H. McColl,
Statistics Glossary The technical difficulties of hypothesis testing depend on these nested structuresof the competing hypotheses; see, for instance, Gourieroux and Monfort [34]. D. Sornette, A.B. Davis, J.R. Kamm, and K. Ideother steps described below. The analogy with hypothesis testing illuminates whatCCP actually is: CCP allows the selection of one code among several codes (at leasttwo) but does not help one to draw conclusions about the validity of a given codeor model when considered as a unique entity independent of other codes or models. Thus, the fundamental problem of validation is more closely associated with theother class of problems addressed by the theory of hypothesis testing, which consistsin the so-called “tests of significance” where one considers only a single hypothesisH , and the alternative is “all the rest,” i.e., all hypotheses that differ from H . Inthat case, the conclusion of a test can be the following: “this data sample does notcontradict the hypothesis H ,” which is not the same as “the hypothesis H is true.”In other words, an hypothesis cannot be excluded because it is found sufficient atsome confidence level for explaining the available data. This is not to say that thehypothesis is true. It is just that the available data is unable to reject said hypoth-esis. Restating the same thing in a positive way, the result of a test of significanceis that the hypothesis H is “compatible with the available data.”It is implicit in the above discussion that, to compare codes quantitatively in ameaningful way, they must solve the same set of equations using different algorithms,and not just model the same physical system. Indeed, there is nothing wrong with“validating” a numerical implementation of a knowingly approximate approach to agiven physical problem. For instance, a (duly verified) diffusion/P transport codecan be validated against a detailed Monte Carlo or S n code. The more detailed modelmust in principle be validated against real-world data. In turn, it provides validation“data” to the coarser model. Naturally, the coarser (say, P transport) model stillneeds to establish its relevance to the real world problem of interest, preferably bycomparison with real observations, or at least be invoked only in regimes where itis known a priori to be sufficiently accurate based on comparison with a finer (say,Monte Carlo transport) model.Two noteworthy initiatives in transport model comparison for non-nuclear appli-cations are the Intercomparison of 3D Radation Codes (I3RC) [36] (i3rc.gsfc.nasa.gov)and the RAdiation Model Intercomparison (RAMI) [37, 38] (rami-benchmark.jrc.it).The former is focused on the challenge of 3D radiative transfer in the cloudy at-mosphere while the later is about 3D radiative transfer inside plant canopies; bothefforts are motivated by issues in remote sensing (especially from space) and radia-tive energy budget estimation (either in the framework of climate modeling or usingobservational diagnostics, which typically means more remote sensing). Much has We should stress that the Sandia Report [35] by Trucano et al. presents an evenmore negative view of code comparisons because it addresses the common practicein the simulation community that turns to code comparisons rather than bonefide verification or validation, without any independent referents. In remote sensing science, transport theory (for photons) plays a central roleand “validation” has a special meaning, namely, the estimation of uncertaintyfor remote sensing products based on “ground-truth,” i.e., field measurementsof the very same geophysical variables (e.g., surface temperature or reflectivity,vegetation productivity, soil moisture) that the satellite instrument is designed toquantify. These data are collected at the same location as the imagery, if possible,at the precision of a single pixel. This type of validation exercise will test boththe “forward” radiation transport theory and its “inversion.” Atmospheric remotesensing, particularly of clouds, poses a special challenge because, strictly-speaking, General Strategy for Physics-Based Model Validation 9been learned by the modelers participating in these code comparison studies, andthe models have been improved on average [39]. Although not connected so far tothe engineering community that is at the forefront of V&V standardization andmethodology, the I3RC and RAMI communities talk much about “testing,” andsometimes “certification,” and not so much about “verification” (which would beappropriate) or “validation” (which would not).What about multi-physics codes such as those used routinely in astrophysics,nuclear engineering, or climate modeling? CCP, along with the stern warnings ofTrucano et al. [35], applies here, too. Even assuming that all the model compo-nents are properly verified or even individually validated, the aggregated model islikely to be too complex to talk about clean verification through output comparison.Finding some level of agreement between two or more complex multi-physics modelswill naturally build confidence in the whole (community-wide) modeling enterprise.However, this is not to be interpreted as validation of any or all of the individualmodels.There are many reasons for wanting to have not just one model on hand but asuite of more or less elaborate ones. A typical collection can range from the mathe-matically and physically exact but numerically intractable to the analytically solv-able, possibly even on the proverbial back-of-an-envelope. We elaborate on and il-lustrate this kind of hierarchical modeling effort in section A.2 of the Appendix,offering it as an approach where model development is basically simultaneous withits validation.
As previously stated, validation can be characterized as the act of quantifying thecredibility of a model to represent phenomena of interest. Virtually all such modelscontain numerical parameters, the precise values of which are not known a priori and,therefore, must be assigned. Calibration is the process of adjusting those parametersto optimize (in some sense) the agreement between the model results and a specificset of experimental data. Such data necessarily have uncertainties associated withthem, e.g., due to natural variability in physical phenomena as well as to unavoidableimprecision of diagnostics. Likewise, there are intrinsic errors associated with thenumerical methods used to evaluate many models, e.g., in the approximate solutionsobtained from discretization schemes applied to partial differential equations. Theapproach of defensibly prescribing parameters for complex physical phenomena whileincorporating the inescapable variability in these values is called “calibration underuncertainty,” [40] a field that poses non-trivial challenges in its own right.However calibration is approached, it must be undertaken using a set of data—ideally from specifically chosen calibration experiments/observations [41]—that dif-there is no counterpart of ground-truthing. One must therefore often make dowith comparisons of ground-based and space-based remote-sensing (say, of thecolumn-integrated aerosol burden) to quantify uncertainty in both operations. In-situ measurements (temperature, humidity, cloud liquid water, etc.) from airborneplatforms—balloon or aircraft—are always welcome but collocation is rarely closeenough for point-to-point comparisons; statistical agreement is then all that is tobe expected, and residuals provide the required uncertainty.0 D. Sornette, A.B. Davis, J.R. Kamm, and K. Idefers from the physical configurations of ultimate interest (i.e., against which themodel will be validated). In order to ensure that validation remains independent ofcalibration, it is imperative that these data sets be disjoint. In the case of large,complex, and costly experiments encountered in many real-world applications, itcan be difficult to maintain a scientific “demilitarized zone” between calibrationand validation. To not do so, however, risks undermining the scientific integrity ofthe associated modeling enterprise, the potential predictive power of which mayrapidly wither as the validation study devolves into a thinly disguised exercise incalibration.For complex systems, there are many choices to be made regarding experimentaland numerical studies in both validation and calibration. The high-level approachof the Phenomena Identification and Ranking Table (PIRT) [42] can be used toheuristically characterize the nature of one’s interest in complicated systems. Thisapproach uses expert knowledge to identify the phenomenological components ina system of interest, to rank their (relative) perceived importance in the overallsystem, and to gauge the (relative) degree to which these component phenomenaare perceived to be understood. This rough-and-ready approach can be used totarget the choice of validation experiments for the greatest scientific payoff on fixedexperimental and simulation budgets. To help guide calibration activities, one canapply the quantitative techniques of sensitivity analysis to rank the relative impact ofinput parameters on model outcome. Such considerations are particularly importantfor complex models containing many adjustable parameters, for which it may proveimpossible to faithfully calibrate all input parameters.Saltelli et al. [43, 44] have championed “sensitivity analysis” methods, whichcome in two basic flavors and many variations. One class of methods uses exact ornumerical evaluation of partial derivatives of model output deemed important withrespect to input parameters to seek regions of parameter space that might needcloser examination from the standpoints of calibration and/or validation. If themodel has time dependence, one can follow the evolution of how parameter choicesinfluence the outcome. The alternate methodology uses adjoint dynamical equationsto determine the relative importance of various parameters. The publications ofSaltelli et al. provide numerous examples illustrating the value and practical impactof sensitivity analysis, as well as references to the wide scientific literature on thissubject. The results of numerical studies guided by sensitivity analysis can be usedboth to focus experimental resources on high-impact experimental studies and tosteer future model development efforts.In dynamical modeling, initial conditions can be viewed as parameters and, assuch, they need to be determined optimally from data. If the dynamical system inquestion is evolving continuously over time and data become available along thetrajectory of the dynamical system, the problem of finding a single initial conditionover the entire trajectory becomes increasingly and exceedingly difficult as the timewindow of the trajectory extends. In fact, it is practically impossible for the systemslike the atmosphere or ocean whose dynamics is highly nonlinear, high-dimensionalmodel is undoubtedly imperfect, and inhomogeneous and sporadic data are subjectto (poorly understood) errors.Data assimilation is an approach that attends to this problem by breaking up thetrajectory over (fixed-length) time windows and solving the initialization problemsequentially over one time window at a time as data become available. A noveltyof data assimilation is that, rather than solving the initialization problem from General Strategy for Physics-Based Model Validation 11scratch, it uses the model forecast as the first guess (the prior) of the initialization(optimization) problem. Once the optimization is completed, the optimal solution(the posterior) becomes the initial condition for the next model forecast.This iterative Bayesian approach to data assimilation is most effective when theuncertainties in both the prior and the data are accurately quantified, as the systemevolves over time and the data assimilation iterates one cycle after another. This isa non-trivial problem, because it requires the estimate of not only the model statebut also the uncertainties associated with it, as well as the proper description of theuncertainties in data.Numerical weather prediction (NWP) is one of the most familiar application ar-eas of data assimilation—one with major societal impact. The considerable progressin skill of the NWP in recent decades has been due to improvements in all aspectsof data assimilation [45], i.e., modeling of the atmosphere, quality and quantity ofdata, and data assimilation methods. At the time of writing, most operational NWPcenters use the so-called the “three-dimensional variational method” (3D-Var) [46],which is an economical and accurate statistical interpolation scheme that does notinclude the effect of uncertainty in the forecast. Some centers have switched to the“four-dimensional variational method” (4D-Var) [47], which incorporates the evolu-tion of uncertainty in linear sense by the used of the adjoint model of the highlynonlinear model. These variational methods always call for the minimization of acost function (cf. Appendix) that measures the difference between model results andobservations throughout some relevant region of space and time. Currently activeresearch areas in data assimilation include the effective and efficient quantificationof the time-dependent uncertainties of both the prior and posterior in the analysis.To this end, the ensemble Kalman filter methods have recently received considerableattention motivated by future integration into operational environments [48, 49, 50].As the importance of the uncertainties in data assimilation have become clear, manyNWP centers perform ensemble prediction along with the single analysis obtainedby the variational methods [51, 52, 53].Clearly, considerable similarities exist between the data assimilation problemand the model validation problem. Can successful data assimilation be construedas validation of the model? In our opinion, that would be unjustified because theobjectives are clearly different for these problems. As stated above, data assimilationadmits the imperfection of the model. It explicitly makes use of the knowledge fromthe previous data assimilation cycle. As the initialization problem is solved itera-tively over relatively short time windows, deviation of the model trajectory from thetrue evolution of the dynamical system in question tend to be small and data couldbe assimilated into the model without much discrepancy. Moreover, the operationalcenters perform careful quality-control of data to eliminate any isolated “outliers”with respect to the model trajectory. Thus, the data assimilation problem differsfrom the validation problem by design. Nevertheless, it is important to recognizethat the resources offered by data assimilation can ensure that models perform wellenough for their intended use.
A qualitatively new class of problems arise in fields such as the geosciences thatdeal with the construction of knowledge of a unique object, planet Earth, whose full2 D. Sornette, A.B. Davis, J.R. Kamm, and K. Idescope and range of processes can be replicated or controlled neither in the labora-tory nor in a supercomputer. This has led recently to championing the relevance of“systemic” (meaning “system approach”) also called “complex system” approachesto the geosciences. In this framework, positive and negative feedbacks (and evenmore complicated nonlinear multiplicative noise processes) entangle many differentmechanisms, whose impact on the overall organization can be neither assessed norunderstood in isolation. How does one validate a model using the systemic approach?This very interesting and difficult question is at the core of the problem of validation.How does one validate a model when it is making predictions on objects that arenot fully replicated in the laboratory, either in the range of variables, of parameters,or of scales? For instance, this question is crucial • in the scaling the physics of material and rock rupture tested in the laboratoryto the scale of earthquakes; • in the scaling the knowledge of hydrodynamical processes quantified in the labo-ratory to the length and time scales relevant to the atmospheric/oceanic weatherand climate, not to mention astrophysical systems; • in the science-based stewardship of the nuclear arsenal, where the challenge isto go from many component models tested at small scales in the laboratory tothe full-scale explosion of an aging nuclear weapon.The same issue arises in the evaluation of electronic circuits. In 2003, AllenR. Hefner, Founder and Chairman of the NIST/IEEE Working Group on ModelValidation, writes in its description: “The problem is that there is no systematicway to determine the range of applicability of the models provided within circuitsimulator component libraries.” See full-page boxed text for the complete version ofthis interesting text, as well as Ref. [54]. This example of validation of electroniccircuits is particularly interesting because it stresses the origin of the difficultiesinherent in validation: the fact that the dynamics are nonlinear and complex withthreshold effects and does not allow for a simple-minded analytic approach consistingin testing a circuit component by component. Extrapolating, this same difficulty isfound in validating general circulation models of the Earth’s climate or computercodes of nuclear explosions. The problem is thus fundamentally a “system” problem.The theory of systems, sometimes referred to as the theory of complex systems, is stillin its infancy but has shown the existence of surprises. The biggest surprise may bethe phenomenon of “emergence” in which qualitatively new processes or phenomenaappear in the collective behavior of the system, while they cannot be derived orguessed from the behavior of each element. The phenomenon of “emergence” issimilar to the philosophical law on the “transfer of the quantity into the quality.”How does one validate a model of such a system? Validation therefore requires anunderstanding of this emergence phenomenon.From another angle, the problem is that of extrapolating a body of knowledge,which is firmly established only in some limited ranges of variables, parametersand scales, beyond this clear domain into a more fuzzy zone of unknowns. Thisproblem has appeared and appears again and again in different guises in practicallyall scientific fields. A particularly notable domain of application is risk assessment;see, for instance, Kaplan and Garrick’s classic paper on risks [55], and the instructivehistory of quantitative risk analysis in US regulatory practice [56], especially in theUS nuclear power industry [57, 58, 59, 60]. An acute question in risk assessment dealswith the question of quantifying the potential for a catastrophic event (earthquake, General Strategy for Physics-Based Model Validation 134 D. Sornette, A.B. Davis, J.R. Kamm, and K. Idetornado, hurricane, flood, huge solar mass ejection, large meteorite, industrial plantexplosion, ecological disaster, financial crash, economic collapse, etc.) of amplitudenever yet sampled from the knowledge of past history and present understanding.To tackle this enduring question, each discipline has developed its own strategies,often being unaware of the approaches of others. Here, we attempt a formulationof the problem, and outline some general directions of attack, that hopefully willtranscend the specificities of each discipline. Our goal is to formulate the validationproblem in a way that may encourage productive crossings of disciplinary linesbetween different fields by recognizing the commonalities of the blocking points,and suggest useful guidelines.
In a generic exercise in model validation, one performs an experiment and, in parallel,runs the calculations with the available model. A comparison between the measure-ments of the experiment and the outputs of the model calculations is then performed.This comparison uses some metrics controlled by experimental feasibility, i.e., whatcan actually be measured. One then iterates by refining the model until (admittedlysubjective) satisfactory agreement is obtained. Then, another set of experiments isperformed, which is compared with the corresponding predictions of the model. Ifthe agreement is still satisfactory without modifying the model, this is consideredprogress in the validation of the model. Iterating with experiments testing differentfeatures of the model corresponds to mimicking the process of construction of atheory in physics [61]. As the model is exposed to increasing scrutiny and testing,the testers develop a better understanding of the reliability (and limitations) of themodel in predicting the outcome of new experimental and/or observational set-ups.This implies that “validation activity should be organized like a project, with goalsand requirements, a plan, resources, a schedule, and a documented record” [6].Extending previous work [29, 30, 31, 32], we thus propose to formulate thevalidation problem of a given model as an iterative construction that embodies theoften implicit process occurring in the minds of scientists:1. One starts with an a priori trust quantified by the value V prior in the potentialvalue of the model. This quantity captures the accumulated evidence thus far.If the model is new or the validation process is just starting, take V prior = 1.As we will soon see, the absolute value of V prior is unimportant but its relativechange is important.2. An experiment is performed, the model is set-up to calculate what should bethe outcome of the experiment, and the comparison between these predictionsand the actual measurements is made either in model space or in observationspace. The comparison requires a choice of metrics.3. Ideally, the quality of the comparison between predictions and observations isformulated as a statistical test of significance in which an hypothesis (the model)is tested against the alternative, which is “all the rest.” Then, the formulation ofthe comparison will be either “the model is rejected” (it is not compatible withthe data) or “the model is compatible with the data.” In order to implement thisstatistical test, one needs to attribute a likelihood p ( M | y obs ) or, more generally,a metric-based “grade” that quantifies the quality of the comparison between General Strategy for Physics-Based Model Validation 15the predictions of the model M and observations y obs . This grade is comparedwith the reference likelihood q of “all the rest.” Examples of implementationsinclude the sign test and the tolerance interval methods. In many cases, onedoes not have the luxury of a likelihood; one has then to resort to more empiricalassessments of how well the model explains crucial observations. In the mostcomplex cases, the outcome can be binary (accepted or rejected).4. The posterior value of the model is obtained according to a formula of the type V posterior /V prior = F [ p ( M | y obs ) , q ; c novel ] . (1)In this expression, V posterior is the posterior potential, or coefficient, of trust inthe value of the model after the comparison between the prediction of the modeland the new observations have been performed. By the action of F [ · · · ], V posterior can be either larger or smaller than V prior : in the former case, the experimentaltest has increased our trust in the validity of the model; in the later case, theexperimental test has signaled problems with the model. One could call V prior and V posterior the evolving “potential value of our trust” in the model or, looselyparaphrasing the theory of decision making in economics, the “utility” of themodel [63].The transformation from the potential value V prior of the model before the experi-mental test to V posterior after the test is embodied into the multiplier F , which canbe either larger than 1 (towards validation) or smaller than 1 (towards invalida-tion). We postulate that F depends on the grade p ( M | y obs ), to be interpreted asproportional to the probability of the model M given the data y obs . It is naturalto compare this probability with the reference likelihood q that one or more of allother conceivable models is compatible with the same data.Our multiplier F depends also on a parameter c novel that quantifies the impor-tance of the test. In other words, c novel is a measure of the impact of the experimentor of the observation, that is, how well the new observation explores novel “dimen-sions” of the parameter and variable spaces of both the process and the model thatcan reveal potential flaws. A fundamental challenge is that the determination of c novel requires, in some sense, a pre-existing understanding of the physical processesso that the value of a new experiment can be fully appreciated. In concrete situa-tions, one has only a limited understanding of the physical processes and the valueof a new observation is only assessed after a long learning phase, after comparisonwith other observations and experiments, as well as after comparison with the model,making c novel possibly self-referencing. Thus, we consider c novel as a judgment-basedweighting of experimental referents, in which judgment (for example, by a sub-ject matter expert) is dominant in its determination. The fundamental problem is Pal and Makai [62] have used the mathematical statistics of hypothesis testingas a way to validate the correctness of code simulating the operation of a com-plex system with respect to a level of confidence for safety problems. The mainconclusion is that the testing of the input variables separately may lead to incor-rect safety related decisions with unforeseen consequences. They have used twostatistical methods: the sign test and the tolerance interval methods for testingmore than one mutually dependent output variables. We propose to use these andsimilar tests delivering a probability level p which can then be compared with apre-defined likelihood level q .6 D. Sornette, A.B. Davis, J.R. Kamm, and K. Ideto quantify the relevance of a new experimental referent for validation to a givendecision-making problem, given that the experimental domain of the test does notoverlap with the application domain of the decision. Assignment of c novel requiresthe judgment of subject matter experts, whose opinions will likely vary. This vari-ability must be acknowledged (if not accounted for, however naively) in assigning c novel . Thus, providing an a priori value for c novel , as required in expression (1),remains a difficult and key step in the validation process. This difficulty is similarto specifying the utility function in decision making [63].Repeating an experiment twice is a special degenerate case since it amounts ide-ally to increasing the size of the statistical sample. In such a situation, one shouldaggregate the two experiments 1 and 2 (yielding the relative likelihoods p /q and p /q respectively) graded with the same c novel into an effective single test with thesame c novel and likelihood ( p /q )( p /q ). This is the ideal situation, as there are caseswhere repeating an experiment may wildly increase the evidence of systemic uncer-tainty or demonstrate uncontrolled variability or other kinds of problems. Whenthis occurs, this means that the assumption that there is no surprise, no novelty, inrepeating the experiment is incorrect. Then, the two experiments should be treatedso as to contribute two multipliers F ’s, because they reveal different kinds of uncer-tainty that can be generated by ensembles of experiments.One experimental test corresponds to a entire loop 1 − V prior to a V posterior according to (1). This V posterior becomes the new V prior for thenext test, which will transform it into another V posterior and so on, according to thefollowing iteration process: V (1)prior → V (1)posterior = V (2)prior → V (2)posterior = V (3)prior → · · · → V ( n )posterior . (2)After n validation loops, we have a posterior trust in the model given by V ( n )posterior V (1)prior = F h p (1) ( M | y (1)obs ) , q (1) ; c (1)novel i · · · F h p ( n ) ( M | y ( n )obs ) , q ( n ) ; c ( n )novel i , (3)where the product is time-ordered since the sequence of values for c ( j )novel dependon preceding tests. Validation can be said to be asymptotically satisfied when thenumber of steps n and the final value V ( n )posterior are sufficiently high. How high is highenough is subjective and may depend on both the application and programmaticconstraints. The concrete examples discussed below offer some insight on this issue. This sequence is reminiscent of a branching process: most of the time, afterthe first or second validation loop, the model will be rejected if V ( n )posterior be-comes much smaller than V (1)prior . The occurrence of a long series of validationtests is specific to those rare models/codes that happen to survive. We conjecturethat the nature of models and their tests make the probability of survival up tolevel n a power law decaying as a function of validation generation number n :Pr h V ( n )posterior ≥ V (1)prior i ∼ /n τ , for large n . The exponent τ = 3 / τ ≈ .
85 for 3 ≤ n ≤ c ( j )novel quantifying the novelty of the j thtest with respect to those preceding it. In full generality, each new F multipliershould be a function of all previous tests.The loop 1 − V posterior has grown to a level at which most experts will besatisfied and will believe in the validity of (i.e., be inclined to trust) the model. Thisformulation has the advantage of viewing the validation process as a convergenceor divergence built on a succession of steps, mimicking the construction of a theoryof reality. Expression (3) embodies the progressive build-up of trust in a modelor theory. This formulation provides a formal setting for discussing the difficultiesthat underlay the so-called impossibilities [19, 21] in validating a given model. Here,these difficulties are not only partitioned but quantified: • in the definition of “new” non-redundant experiments (parameter c novel ), • in choosing the metrics and the corresponding statistical tests quantifying thecomparison between the model and the measurements of this experiment (leadingto the likelihood ratio p/q ), and • in iterating the procedure so that the product of the gain/loss factors F [ · · · ]obtained after each test eventually leads to a clear-cut conclusion after severaltests.This formulation makes clear why and how one is never fully convinced that vali-dation has been obtained: it is a matter of degree, of confidence level, of decisionmaking, as in statistical testing. But this formulation helps in quantifying what newconfidence (or distrust) is gained in a given model. It emphasizes that validation isan ongoing process, similar to the never-ending construction of a theory of reality.The general formulation proposed here in terms of iterated validation loops isintimately linked with decision theory based on limited knowledge: the decision to“go ahead” and use the model is fundamentally a decision problem based on theaccumulated confidence embodied in V posterior . The “go/no-go” decision must takeinto account conflicting requirements and compromise between different objectives.Decision theory was created by the statistician Abraham Wald in the late forties [65],but is based ultimately on game theory [63, 66]. Wald used the term loss function ,which is the standard terminology used in mathematical statistics. In mathemati-cal economics, the opposite of the loss (or cost) function gives the concept of the utility function , which quantifies (in a specific functional form) what is consideredimportant and robust in the fit of the model to the data. We use V posterior in an It is conceivable that a new and radically different observation/experiment mayarise and challenge the built-up trust in a model; such a scenario exemplifies howany notion of validation “convergence” is inherently local.8 D. Sornette, A.B. Davis, J.R. Kamm, and K. Ideeven more general sense than “utility,” as a decision and information-based valua-tion that supports risk-informed decision-making based on “satisficing” (see theconcrete examples discussed below).It may be tempting to interpret the above formulation of the validation problemin terms of Bayes’ theorem p posterior ( M | Data) = p prior ( M ) × Pr(Data | M )Pr(Data) (4)where Pr(Data | M ) is the likelihood of the data given the model M , and Pr(Data) isthe unconditional likelihood of the data. However, we can not make immediate senseof Pr(Data). Only when a second model M ′ is introduced can we actually calculatePr(Data) = p prior ( M ) Pr(Data | M ) + p prior ( M ′ ) Pr(Data | M ′ ) . (5)In other words, Bayes’ formulation requires that we set a model/hypothesis in oppo-sition to another or other ones, while we examine here the case of a single hypothesisin isolation.We therefore stress that one should resist the urge to equate our V prior and V posterior with p prior and p posterior because they are not probabilities. It is not possibleto assign a probability to an experiment in an absolute way and thus Bayes’ theoremis mute on the validation problem as we have chosen to formulate it. Rather, wepropose that the problem of validation is fundamentally a problem of decision theory:at what stage is one willing to bet that the code will work for its intended use? Atwhat stage, are you ready to risk your reputation, your job, the lives of others, yourown life on the fact that the model/code will predict correctly the crucial aspect ofthe real-life test? One must therefore incorporate ingredients of decision theory, andnot only fully objective probabilities. Coming from a Bayesian perspective, p prior and p posterior could then be called the potential value or trust in the model/code or,as we prefer, to move closer to the application of decision theory in economics, the utility of the model/code [63].To summarize the discussion so far, expression (1) may be reminiscent of aBayesian analysis, however, it does not manipulate probabilities. (Instead, they ap-pear as independent variables, viz., p ( M | y obs ) and q .) In the Bayesian methodologyof validation [69, 70], only comparison between models can be performed due to theneed to remove the unknown probability of the data in Bayes’ formula. In contrast,our approach provides a value for each single model independently of the others.In addition, it emphasizes the importance of quantifying the novelty of each testand takes a more general view on how to use the information provided from thegoodness-of-fit. The valuation (1) of a model uses probabilities as partial inputs, notas the qualifying criteria for model validation. This does not mean, however, thatthere are not uncertainties in these quantities or in the terms F , q or c novel and thataleatory and epistemic uncertainties are ignored, as discussed below. In economics, satisficing is a behavior that attempts to achieve at least someminimum level of a particular variable, but that does not strive to achieve itsmaximum possible value. The verb “to satisfice” was coined by Herbert A. Simonin his theory of bounded rationality [67, 68]. For an in-depth discussion on aleatory versus systemic (a.k.a. epistemic) uncer-tainties, see for example
Review of Recommendations for Probabilistic Seismic
General Strategy for Physics-Based Model Validation 19
The multiplier F [ p ( M | y obs ) , q ; c novel ] should have the following properties:1. If the statistical test(s) performed on the given observations is (are) passed atthe reference level q , then the posterior potential value is larger than the priorpotential value: F > F ≤
1) for p > q (resp. p ≤ q ), which can bewritten succinctly as ln F/ ln( p/q ) > ∂F∂p > , for a given q . There could be a saturation of the growth of F for large p/q ,which can be either that F < ∞ as p/q → ∞ or of the form of a concavityrequirement ∂ F∂p < p/q : obtaining a quality of fit beyond a certain level should not beattempted.3. The larger the statistical level at which the test(s) performed on the givenobservations is (are) passed, the larger the impact of a “novel” experiment onthe multiplier enhancing the prior into the posterior potential value of the model: ∂F/∂c novel > ≤ p > q (resp. p ≤ q ).A very simple multiplier that obeys this these properties (not including thesaturation of the growth of F ) is given by F [ p ( M | y obs ) , q ; c novel ] = „ pq « c novel , (6)and is illustrated in the upper panel of Fig. 2 as a function of p/q and c novel . Thisform provides an intuitive interpretation of the meaning of the experiment impactparameter c novel . A non-committal evaluation of the novelty of a test would be c novel = 1, thus F = p/q and the chain (3) reduces to a product of normalizedlikelihoods, as in standard statistical tests. A value c novel > <
1) for agiven experiment describes a nonlinearly rapid (resp. slow) updating of our trust V as a function of the grade p/q of the model with respect to the observations. Inparticular, a large value of c novel corresponds to the case of “critical” tests. Notethat the parameterization of c novel in (6) should account for the decreased noveltynoted above occurring when the same experiment is repeated two or more times.The value of c novel should be reduced for each repetition of the same test; moreover,the value of c novel should approach unity as the number of repetitions increases. Hazard Analysis: Guidance on Uncertainty and Use of Experts A momentous example is the Michelson-Morley experiment for the Theory ofSpecial Relativity. For the Theory of General Relativity, it was the observationduring the famous 1919 solar eclipse of the bending of light rays from distantstars by the Sun’s mass and the elegant explanation of the anomalous precessionof the perihelion of Mercury’s orbit.0 D. Sornette, A.B. Davis, J.R. Kamm, and K. Ide
Fig. 2.
The multipliers defined by (6) and (7) are plotted as functions of p/q and c novel in the upper and lower panels respectively. Note the vertical log scale used forthe multiplier (6) in the top panel. General Strategy for Physics-Based Model Validation 21An alternative multiplier, F [ p ( M | y obs ) , q ; c novel ] = tanh “ pq + c novel ” tanh “ c novel ” 35 , (7)is plotted in the lower panel of Fig. 2 as a function of p/q and c novel . It emphasizesthat F saturates as a function of p/q and c novel as either one or both of them growlarge. A completely new experiment corresponds to c novel → ∞ so that 1 /c novel = 0and thus F tends to [tanh( p/q ) / tanh(1)] , i.e., V posterior /V prior is only determined bythe quality of the “fit” of the data by the model quantified by p/q . A finite c novel thusimplies that one already takes a restrained view on the usefulness of the experimentsince one limits the amplitude of the gain = V posterior /V prior , whatever the quality ofthe fit of the data by the model. The exponent 4 in (7) has been chosen so that themaximum confidence gain F is equal to tanh(1) − ≈ c novel = ∞ ) and perfect fit ( p/q → ∞ ). In contrast,the multiplier F can be arbitrarily small as p/q → c novel → ∞ ). For a finite novelty c novel , a test that fails the model miserably( p/q ≈
0) does not necessarily reject the model completely: unlike the expression in(6), F remains greater than zero. Indeed, if the novelty c novel is small, the worst-case multiplier (attained for p/q = 0) is [tanh (1 /c novel ) / tanh (1 + 1 /c novel )] ≈ − . − /c novel , which is only slightly less than unity if c novel ≪
1. In short,this formulation does not heavily weight unimportant tests, as seems intuitivelyappropriate.In the framework of decision theory, expression (1) with one of the specific ex-pressions in (6) or (7) provides a parametric form for the utility or decision “func-tion” of the decision maker. It is clear that many other forms of the utility functioncan be used, however, with the constraint of keeping the salient features of expres-sion (1) with (6) or (7), in terms of the impact of a new test given past tests, andthe quality of the comparison between the model predictions and the data. Thisindetermination is helpful since it mirrors the inherent variability of the validationlandscape. For instance, what comprises adequate validation for phenomena at one(e.g., macro-)scale may prove inadequate for related phenomena at another (e.g.,micro-)scale.Finally, we remark that the proposed form for the multiplier (7) contains animportant asymmetry between gains and losses: the failure to a single test withstrong novelty and significance cannot be compensated by the success of all theother tests combined. In other words, a single test is enough to reject a model.This encapsulates the common lore that reputation gain is a slow process requiringconstancy and tenacity, while its loss can occur suddenly with one single failure andis difficult to re-establish. We believe that the same applies to the build-up of trustin and, thus, validation of a model. See, e.g., the impact of localized seismicity on faults in the case of the Olami-Feder-Christensen model discussed below, or that of the “leverage” effect in quan-titative finance for the Multifractal Random Walk model described and evaluatedin Ref. [1].2 D. Sornette, A.B. Davis, J.R. Kamm, and K. Ide p/q and c novel These two crucial elements of a validation step are conditioned by four basic prob-lems, over which one can exert at least partial control. In particular, they address thetwo sources of uncertainty: “reducible” or epistemic (i.e., due to lack of knowledge)and “irreducible” or aleatory (i.e., due to variability inherent in the phenomenonunder consideration). In a nutshell, as becomes clearer below, the comparison be-tween p and q is more concerned with the aleatory uncertainty while c novel dealsin part with the epistemic uncertainty. In the following, as in the two examples (6)and (7), we consider that p and q enter only in the form of their ratio p/q . Thisshould not be generally the case but, given the many uncertainties, this restrictionsimplifies the analysis by removing one degree of freedom.1. How to model?
This addresses model construction and involves the structure ofthe elementary contributions, their hierarchical organization, and requires deal-ing with uncertainties and fuzziness. This concerns the epistemic uncertainty.2.
What to measure?
This relates to the nature of c novel : ideally, one should targetadaptively the observations to “sensitive” parts of the system and the model (as,e.g., Palmer et al. [72] did for atmospheric dynamics). Targeting observationscould be directed by the desire to access the most “relevant” information aswell as to get information that is the most reliable, i.e., which is contaminatedby the smallest errors. This is also the stance of Oberkampf and Trucano [33]: “A validation experiment is conducted for the primary purpose of determiningthe validity, or predictive accuracy, of a computational modeling and simulationcapability. In other words, a validation experiment is designed, executed, and an-alyzed for the purpose of quantitatively determining the ability of a mathematicalmodel and its embodiment in a computer code to simulate a well-characterizedphysical process.” In practice, we view c novel as an estimate of the importanceof the new observation and the degree of “surprise” it brings to the validationstep. Being the cornerstone of our formal approach to validation, we eventuallywant to see its determination grounded in sensitivity and/or PIRT analysis (cf.section 1.6). The epistemic uncertainty alluded to above is partially addressedin the choice of the empirical data and its rating with c novel (see the examplesof application discussed below).3. How to measure?
For given measurements or experiments, the problem is to findthe “optimal” metric or cost function (involved in the quality-of-fit measure p )for the intended use of the model. The notion of optimality needs to be defined.It could capture a compromise between fitting best the important features of thedata (what is “important” may be decided on the basis of previous studies andunderstanding or other processes, or programmatic concerns), and minimizingthe extraction of spurious information from noise. This requires one to have aprecise idea of the statistical properties of the noise. If such knowledge is notavailable, the cost function should be chosen accordingly. The choice of the costfunction involves the choice of how to look at the data. For instance, one maywant to expand the measurements at multiple scales using wavelet decomposi-tions and compare the prediction and observations scale by scale, or in terms ofmultifractal spectra of the physical fields estimated from these wavelet decompo-sitions [73] or from other methods. The general idea here is that, given complexobservation fields, it is appropriate to unfold the data on a variety of “metrics,” General Strategy for Physics-Based Model Validation 23which can then be used in the comparison between observations and model pre-dictions. The question is then: How well is the model able to reproduce thesalient multi-scale and multifractal properties derived from the observations?The physics of turbulent fields and of complex systems have offered many suchnew tools with which to unfold complex fields according to different statistics.Each of these statistics offers a metric to compare observations with model pre-dictions and is associated with a cost function focusing on a particular featureof the process. Since these metrics are derived from the understanding that tur-bulent fields can be analyzed using these metrics that reveal strong constraintsin their organization, these metrics can justifiably be called “physics-based.” Inpractice, p , and eventually p/q , has to be inferred as an estimate of the degreeof matching between the model output and the observation. This can be donefollowing the concept of fuzzy logic in which one replaces the yes/no pass testby a more gradual quantification of matching [74, 75]. We thus concur withOberkampf and Barone [76], while our general methodology goes beyond. Notethat this discussion relates primarily to the aleatory uncertainty.4. How to interpret the results?
This question relates to defining the test andthe reference probability level q that any other model (than the one underscrutiny) can explain the data. The interpretation of the results should aim atdetecting the “dimensions” that are missing, misrepresented or erroneous in themodel (systemic/epistemic uncertainty). What tests can be used to betray theexistence of hidden degrees of freedom and/or dimensions? This is the hardestproblem. It can sometimes possess an elegant solution when a given model isembedded in a more general one. Then, the limitation of the more restrictedmodel becomes clear from the vantage of the more general model.We refer to the Appendix for further thoughts on these four basic steps in modelconstruction and validation in a broader context than our present formulation.We now illustrate our algorithmic approach to model validation using the histor-ical development of quantum mechanics and three examples based on the authors’research activities. In these crude but revealing examples, we will use the form (7)and consider three finite values: c novel = 1 (marginally useful new test), c novel = 10(substantially new test), and c novel = 100 (important new test). When a likelihoodtest is not available, we propose to use three possible marks: p/q = 0 . p/q = 1 (marginally good fit), and p/q = 10 (good fit). Extreme values ( c novel or p/q are 0 or ∞ ) have already been discussed. Due to limited experience with thisapproach, we propose these ad hoc values in the following examples of its application. Quantum mechanics (QM) offer a vivid incarnation of how a model can turn pro-gressively into a theory held “true” by almost all physicists. Since its birth, QM hasbeen tested again and again because it presents a view of “reality” that is shockinglydifferent from the classical view experienced at the macroscopic scale. QM prescrip-tions and predictions often go against (classically-trained) intuition. Nevertheless,we can state that, by a long and thorough process of confirmed predictions of QM inexperiments, fueled by the imaginative set-up of paradoxes, QM has been validated4 D. Sornette, A.B. Davis, J.R. Kamm, and K. Ideas a correct description of nature. It is fair to say that the overwhelming majority ofphysicists have developed a strong trust in the validity of QM. That is, if someonecomes up with a new test based on a new paradox, most physicists would bet thatQM will come up with the right answer with a very high probability. It is thus bythe on-going testing and the compatibility of the prediction of QM with the obser-vations that QM has been validated. As a consequence, one can use it with strongconfidence to make predictions in novel directions. This is ideally the situation onewould like to attain for the problem of validation of all models, those discussed inthe following section in particular. We now give a very partial list of selected teststhat established the trust of physicists in QM.1. Pauli’s exclusion principle states that no two identical fermions (particles withnon-integer values of spin) may occupy the same quantum state simultaneously[77]. It is one of the most important principles in quantum physics, primarilybecause the three types of particle from which ordinary matter is made, elec-trons, protons, and neutrons, are all subject to it. With c novel = 100 and perfectagreement in numerous experiments ( p/q = ∞ ), this leads to F (1) = 2 . c novel = 100 at a minimum. The QM prediction turned outto be correct, winning over the hidden-variables theories [80, 81] ( p/q = ∞ ),leading again to F (2) = 2 . c novel = 100). TheAharonov-Bohm oscillations were observed in ordinary (i.e., not superconduct-ing) metallic rings, showing that electrons can maintain quantum mechanicalphase coherence in ordinary materials [82, 83]. This yields p/q = ∞ and thus F (3) = 2 . c novel = 100 and the numerous verifications and applica-tions (for instance in SQUIDs, Superconducting QUantum Interference Devices)argues for p/q = ∞ and thus F (4) = 2 .
9, as usual.5. The prediction of possible collapse of a gas of atoms at low temperature intoa single quantum state is known as Bose-Einstein condensation, again so muchagainst classical intuition ( c novel = 100). Atoms are indeed bosons (particleswith integer values of spin), which are not subjected to the Pauli exclusionprinciple evoked in the above test . · − K[84] ( p/q = ∞ ), leading once more to F (4) = 2 . c novel = 100). Experiment tests of theneutron prediction rejected the nonlinear version in favor of the standard QM[85] ( p/q = ∞ ), leading to F (6) = 2 . − on the fraction of the energy of the rf transition in Beions that could be due to nonlinear corrections to quantum mechanics [86]. Weassign c novel = 10, with p/q = 10), to this result, leading to F (7) = 2 .
4. Althoughless than F (1 − , this is still an impressive score.Combining the multipliers according to (3) leads to V (8)posterior /V (1)prior ≃ This is perhaps the simplest sand-pile model of self-organized criticality, which ex-hibits a phenomenology resembling real seismicity [88]. Figure 3 shows a “stress”map generated by the OFC model immediately after a large avalanche (main shock)at two magnifications, to illustrate the rich organization of almost synchronized re-gions [89]. To validate the OFC model, we examine the properties and predictionof the model that can be compared with real seismicity, together with our assess-ment of their c novel and quality-of-fit. We are careful to state these properties in anordered way, as specified in the above sequences (2)–(3).1. The statistical physics community recognized the discovery of the OFC modelas an important step in the development of a theory of earthquakes: withouta conservation law (which was thought before to be an essential condition),it nevertheless exhibits a power law distribution of avalanche sizes resemblingthe Gutenberg-Richter law [88]. On the other hand, many other models withdifferent mechanisms can explain observed power law distributions [91]. We thusattribute only c novel = 10 to this evidence. Because the power law distributionobtained by the model is of excellent quality for a certain parameter value( α ≈ . p/q = ∞ (perfect fit). Expression (7) then gives F (1) = 2 . Fig. 3.
Map of the “stress” field generated by the OFC model immediately aftera large avalanche (main shock) at two magnifications. The upper panel shows thewhole grid of size 1024 and the lower plot represents a subset of the grid delineatedby the square in the upper plot. Adapted from Ref. [90].2. Prediction of the OFC model concerning foreshocks and aftershocks, and theirexponents for the inverse and direct Omori laws. These predictions are twofold[90]: (i) the finding of foreshocks and aftershocks with similar qualitative prop-erties, and (ii) their inverse and direct Omori rates. The first aspect, deservesa large c novel = 100 as the observation of foreshocks and aftershocks came asa rather big surprise in such sand-pile models [92]. The clustering in time andspace of the foreshocks and aftershocks are qualitatively similar to real seismic-ity [90], which warrants p/q = 10, and thus F (2 a ) = 2 .
9. The second aspectis secondary compared with the first one ( c novel = 1). Since the exponents areonly qualitatively reproduced (but with no formal likelihood test available), wetherefore take p/q = 0 .
1. This leads to F (2 b ) = 0 . c novel = 10 as this observation is rather new but not completelyindependent of the Omori law. The fit is good so we grant a grade p/q = 10leading to F (3) = 2 . c novel = 1) and the large uncertainties suggest a grade p/q = 1 (to reflect thedifferent viewpoints on the absence of effect in real data) leading to F (4) = 1(neutral test).5. Most aftershocks are found to nucleate at “asperities” located on the mainshock rupture plane or on the boundary of the avalanche, in agreement withobservations [90]: c novel = 10 and p/q = 10 leading to F (5) = 2 . c novel =100), but absolutely not reproduced by the OFC model ( p/q = 0 . F (6) = 4 · − .Combining the multipliers according to (3) up to test V (6)posterior /V (1)prior =18 .
8, suggesting that the OFC model is validated as a useful model of the statisticalproperties of seismic catalogs, at least with respect to the properties which havebeen examined in these first five tests. Adding the crucial last test strongly fails themodel since V (7)posterior /V (1)prior = 7 . − . The model can not be used as a realisticpredictor of seismicity. The results of our quantitative validation process indicatethat it can nevertheless be useful to illustrate certain statistical properties and tohelp formulate new questions and hypotheses. To improve our modeling skill for climate dynamics, it is essential to reduce the sig-nificant uncertainty associated with clouds. In particular, estimation of the radiationbudget in the presence of clouds needs improvement since current operational mod-els for the most part ignore all variability below the scale of the climate model’s grid( ∼
100 km). A considerable effort has therefore been expended to derive more realisticmean-field radiative transfer models [93], mostly by considering only the one-pointvariability of clouds, that is, irrespective of their actual structure as captured by2-point (or higher) correlation statistics. However, it has been widely recognizedthat the Earth’s cloudiness is fractal over a wide range of scales [94]. This is themotivation for modeling the paths of solar photons at non-absorbing wavelengthsin the cloudy atmosphere as L´evy walks [91], which are characterized by frequentsmall steps (inside clouds) and occasional large jumps (typically between clouds) asrepresented schematically in Fig. 4. These (on-average downward) paths start at thetop of the highest clouds and end in escape to space or in absorption at the surface,respectively, cooling and warming the climate system. In contrast with most othermean-field models for solar radiative transfer, this diffusion model with anomalousscaling can be subjected to a battery of observational tests.1. The original goal of this phenomenological model, which accounts for the clus-tering of cloud water droplets into broken and/or multi-layered cloudiness, wasto predict the increase in steady-state flux transmitted to the surface comparedto what would filter through a fixed amount of condensed water in a singleunbroken cloud layer [96]. This property is common to all mean-field photontransport models that do anything at all about unresolved variability [93]. Thus,we assign only c novel = 1 to this test and, given that all models in this class aresuccessful, we have to take p/q = 1, hence F (1) = 1. The outcome of this firsttest is neutral.8 D. Sornette, A.B. Davis, J.R. Kamm, and K. Ide Fig. 4.
Schematic representation of the anomalous diffusion model of solar photontransport at non-absorbing wavelengths in the cloudy atmosphere. In this model,solar beams follow convoluted L´evy walks, which are characterized by frequent smallsteps (inside clouds) and occasional large jumps (between clouds or between cloudsand the surface). The partition between small and large jumps is controlled by theL´evy index α (the PDF of the jump sizes ℓ has a tail decaying as a power law ∼ /ℓ α ). Reproduced from Ref. [95].2. The first real test for this model occurred in the late 1990s, when it became possi-ble to accurately estimate the mean total path cumulated by solar radiation thatreaches the surface. This breakthrough was enabled by access to spectroscopyat medium (high) resolution of oxygen bands (lines) [97, 98]. There was alreadyremote sensing technology to infer simultaneously cloud optical depth, which iscolumn-integrated water in g (or cm ) per cm multiplied by the average cross-section for scattering or absorption in cm per g (or cm ). The observed trendsbetween mean path and optical depth were explained only by the new modelin spite of relatively large instrumental error bars. So we assign c novel = 100 tothis highly discriminating test and p/q = 10 (even though other models weregenerally not in a position to compete), hence F (2) = 2 . c novel = 10 to theobservation and p/q = 1 to the model performance. Indeed, this test is arguablyonly about the finite horizontal extent of the rain clouds resulting from deepconvection: one can exclude only most simplistic cloud models based on uniformplane-parallel slabs. So, again we obtain F (3) = 1 for an interesting but presentlyneutral test that needs refinement. General Strategy for Physics-Based Model Validation 294. Min et al. [100] developed an oxygen-line spectrometer with sufficient resolutionto estimate not just the mean path but also its root-mean-square (RMS) value.They found the prediction by Davis and Marshak [101] for normal diffusion tobe an extreme (envelop) case for the empirical scatter plot of mean vs. RMSpath, and this is indicative that the anomalous diffusion model will cover thebulk of the data. Because of some overlap with item c novel = 10 tothe test and p/q = 10 for the model performance since the anomalous diffusionmodel had not yet made a prediction for the RMS path (although we note thatother models have yet to make one for the mean path). We therefore receive F (4) = 2 . c novel = 100. The new mean- and RMS-path data was explained by Scholl etal. by creating an ad hoc hybrid between normal diffusion theory (which in-deed has a prediction for the RMS path [101]) and its anomalous counterpart(which still has none). This modification of the basic model can be viewed assignificant, meaning that we are in principle back to validation step 1 with thenew model. However, this exercise uncovered something quite telling about theoriginal anomalous diffusion model, namely, that its simple asymptotic (largeoptical depth) form used in all the above tests is not generally valid: for typicalcloud covers, the pre-asymptotic terms computed explicitly for the normal dif-fusion case prove to be important irrespective of whether the diffusion is normalor not. Consequently, in its original form (resulting in a simple scaling law forthe mean path with respect to cloud thickness and optical depth), the anoma-lous diffusion model fails to reproduce the new data even for the mean path.(Consequently, previous fits yielded only “effective” anomaly parameters andwere misleading if taken too literally.) So we assign p/q = 0 . F (5) = 4 10 − .Thus, V (6)posterior /V (1)prior = 3 10 − , a fatal blow for the anomalous diffusion in its sim-ple asymptotic form, even though V (5)posterior /V (1)prior = 7 . diffusion model to pre-asymptotic regimes. More recently, a model for anomalous transport (i.e., where angular details matter) has been proposed that fits all of thenew oxygen spectroscopy results [95].In summary, the first and simplest incarnation of the anomalous diffusion modelfor solar photon transport ran its course and demonstrated the power of oxygen-linespectroscopy as a test for the performance of radiative transfer models required inclimate modeling for large-scale average responses to solar illumination. Eventually,new and interesting tests will become feasible when we obtain dedicated oxygen-linespectroscopy from space with NASA’s Orbiting Carbon Observatory (OCO) missionplanned for launch in 2008. Indeed, we already know that the asymptotic scaling for0 D. Sornette, A.B. Davis, J.R. Kamm, and K. Idereflected photon paths [104] is different from their transmitted counterparts [101] instandard diffusion theory for both mean and RMS. So far, our examples of models for complex phenomena have hailed from quantumand statistical physics. In the latter case, they are stochastic models composed of:(1) simple code (hence rather trivial verification procedures) to generate realizations,and (2) analytical expressions for the ensemble-average properties (that are used inthe above validation exercises). We now turn to gas dynamics codes which have abroad range of applications, from astrophysical and geophysical flow simulation tothe design and performance analysis of engineering systems. Specifically, we discussthe validation of the “Cuervo” code developed at Los Alamos National Laboratory[105, 106] for use as a simulation tool in the complex physics of compressible mix-ing. This software generates solutions of the Euler equations for flows of inviscid,non-heat-conducting, compressible gas. Cuervo has been verified against a suite oftest problems including, e.g., those discussed by Liska and Wendroff [107]. As clearlystated by Oberkampf and Trucano [33] however, such verification differs from anddoes not guarantee validation against experimental data. A standard validation sce-nario involves the Richtmyer–Meshkov (RM) instability [108, 109], which arises whena density gradient in a fluid is subjected to an impulsive acceleration, e.g., due topassage of a shock wave (see Fig. 5). Evolution of the RM instability is nonlinear andhydrodynamically complex and hence defines an excellent problem-space to assessCFD code performance for more general mixing scenarios.
Fig. 5.
Schematic of the interactions between weakly shocked (Mach number ≈ ). The Richtmyer–Meshkov instabilityoccurs from the mismatch between the pressure gradient (at the shock front) andthe density gradient (between the light and dense gases), which acts as a source ofbaroclinic vorticity. The column of dense gas “rolls up” into a double-spiral formunder the action of the evolving vorticity.In the series of shock-tube experiments described in [110], RM dynamics are real-ized by preparing one or more cylinders with approximately identical axisymmetricGaussian concentration profiles of dense sulfur hexaflouride (SF ) in air. This (or General Strategy for Physics-Based Model Validation 31these) vertical “gas cylinder(s)” is (are) subjected to a weak shock—Mach number ≈ approximately five times that ofair) and the pressure gradient through the shock wave; this mismatch acts as thesource for baroclinic vorticity generation. Moreover, the flow evolution is stronglytwo-dimensional up to the final times considered. Visualization of the density field isobtained using a planar laser-induced fluorescence (PLIF) technique, which provideshigh-resolution quantitative concentration measurements in a plane that cross-cutsthe cylinders. The velocity field is diagnosed using particle image velocimetry (PIV),based on correlation measurements of small-scale particles that are seeded in the ini-tial flow field. Careful post-processing of images from 130 µ s to 1000 µ s after shockpassage yields planar concentration and velocity with error bars.1. This RM flow is dominated at early times by a vortex pair. Later, secondaryinstabilities rapidly transition the flow to a mixed state. We rate c novel = 10 forthe observations of these two instabilities. The Cuervo code correctly capturesthese two instabilities, best observed and modeled with a single cylinder. At thisqualitative level, we rate p/q = 10 (good fit), which leads to F (1) = 2 . c novel = 10 and p/q = 10 yields F (2) = 2 . c novel = 10and p/q = 10 yields F (3) = 2 . c novel = 1). TheCuervo code correctly accounts for the low wavenumber part of the spectrum butunderestimates the high wavenumber part (beyond the deterministic-stochastictransition wavenumber) by a factor 2 to 5. We capture this by setting p/q = 0 . F (4) = 0 . V (5)posterior /V (1)prior = 6 .
5, a signifi-cant gain, but still not sufficient to compellingly validate the Cuervo code for inviscidshock-induced hydrodynamic instability simulations, at least in 2D. Clearly, valida-tion against this single set of experiments is inadequate to address all intended usesof a CFD code such as Cuervo. The above three examples illustrate the utility of representing the validation processas a succession of steps, each of them characterized by the two parameters c novel and Intricate experiments with three gas cylinders have since been performed [111]and others are currently under way to further challenge compressible flow codes.2 D. Sornette, A.B. Davis, J.R. Kamm, and K. Ide p/q . The determination of c novel requires expert judgment and that of p/q a carefulstatistical analysis, which is beyond the scope of the present report (see Ref. [76]for a detailed case study). The parameter q is ideally imposed as a confidence level,say 95% or 99% as in standard statistical tests. In practice, it may depend on theexperimental test and requires a case-by-case examination.The uncertainties of c novel and of p/q need to be assessed. Indeed, differentstatistical estimations or metrics may yield different p/q ’s and different experts willlikely rate differently the novelty c novel of a new test. As a result, the trust gain V ( n +1)posterior /V (1)prior after n tests necessarily has a range of possible values that growsgeometrically with n . In certain cases, a drastic difference can be obtained by achange of c novel . For instance, if instead of attributing c novel = 100 to the sixth OFCtest, we put c novel = 10 (resp. 1) while keeping p/q = 0 . F (6) is changed from 4 · − to 4 · − (resp. 0 . V (7)posterior /V (1)prior = 0 .
07 (resp. ≃ c novel = 1 is arguably unrealistic, given the importanceof faults in seismology. The two possible choices c novel = 100 and c novel = 10 thengive similar conclusions on the invalidation of the OFC model. In our examples, V ( n +1)posterior /V (1)prior provides a qualitatively robust measure of the gain in trust after n steps; this robustness has been built-in by imposing a coarse-grained quality to p/q and c novel . The validation of numerical simulations continues to become more important ascomputational power grows, as the complexity of modeled systems increases, andas increasingly important decisions are influenced by computational models. Wehave proposed an iterative, constructive approach to validation using quantitativemeasures and expert knowledge to assess the relative state of validation of a modelinstantiated in a computer code. In this approach, the increase/decrease in validationis mediated through a function that incorporates the results of the model vis-`a-visthe experiment together with a measure of the impact of that experiment on thevalidation process. While this function is not uniquely specified, it is not arbitrary:certain asymptotic trends, consistent with heuristically plausible behavior, must beobserved. In four fundamentally different examples, we have illustrated how thisapproach might apply to a validation process for physics or engineering models.We believe that the multiplicative decomposition of trust gains or losses (given inEq. 3), using a suitable functional prescription (such as Eq. 7), provides a reasonedand principled description of the key elements—and fundamental limitations—ofvalidation. It should be equally applicable to biological and social sciences, especiallysince it is built upon the decision-making processes of the latter. We believe thatour procedure transforms the paralyzing criticisms in Popper’s style that “we cannotvalidate, we can only invalidate” [19] into a practical constructive algorithm. Thisstrategy addresses specifically both problems of distinguishing between competingmodels and transforming the vicious circle of lack of suitable data into a virtuousspiral path: each cycle is marked by a quantified increment of the evolving trust weput in a model based on the novelty and relevance of new data and the quality offits.We have also surveyed and commented extensively on the V&V literature. Wehope this digest will help the reader as much as its collation helped us deepen our General Strategy for Physics-Based Model Validation 33understanding of the challenge of model validation, including a new perspective onsome of our own work. We close with these far-reaching thoughts by Patrick J.Roache [112]:
In an age of spreading pseudoscience and anti-rationalism, it behooves thoseof us who believe in the good of science and engineering to be above re-proach whenever possible. Public confidence is further eroded with every er-ror we make. Although many of society’s problems can be solved with asimple change of values, major issues such as radioactive waste disposaland environmental modeling require technological solutions that necessar-ily involve computational physics. As Robert Laughlin [113] noted in thismagazine, “there is a serious danger of this power [of simulations] beingmisused, either by accident or through deliberate deception.”
Our intellec-tual and moral traditions will be served well by conscientious attention toverification of codes, verification of calculations, and validation, includingthe attention given to building new codes or modifying existing codes withspecific features that enable these activities.
Acknowledgments
This work was supported by the LDRD 20030037DR project “Physics-Based Anal-ysis of Dynamic Experimentation and Simulation” and the US DOE AtmosphericRadiation Measurement (ARM) program. We acknowledge stimulating interactionsand discussions with the other members of the project, including Bob Benjamin,Mike Cannon, Karen Fisher, Andy Fraser, Sanjay Kumar, Vladilen Pisarenko, KathyPrestridge, Bill Rider, Chris Tomkins, Kevin Vixie, Peter Vorobieff, and Cindy Zoldi.We were particularly impressed by Timothy G. Trucano as our reviewer for [1], whoprovided an extremely insightful and helpful report, with many thoughtful com-ments and constructive suggestions. As authors, we count ourselves very fortunateto have had such a strong audience to scrutinize and improve our contribution.
Appendix:A More Formal Look at the Role of Validation in theModeling Enterprise
We deal with models that possess two aspects: a conceptual part based on thephysical laws of nature (such as the Navier–Stokes conservation equations for fluiddynamics) and a computational part (like in CFD). Mathematically, a model alongwith observations are defined formally, as described in section A.1 below: • The model M maps the set { A } of parameters and of initial and boundaryconditions to a forecast of state variables in a formal vector X f ; • An observation projection G maps the true dynamics or physics in X t to rawmeasurements y o .Such definitions may seem abstract and of little use but they are important foun-dations to build a comprehensive roadmap for physically-based model validation.4 D. Sornette, A.B. Davis, J.R. Kamm, and K. IdeIn the following section, we refine the above definitions and introduce a few moreoperators and quantities. In section A.2, we revisit the key steps in a validationloop with this notation in hand. Finally, we discuss some fundamental limitationson model validation in section A.3 using some of our own research in time-seriesanalysis for illustration. A.1 Definitions
Let us denote X t ( r , t ) the true physical field. Observations y o ( r , t ) are obtained viaa possibly nonlinear operator G acting on X t ( r , t ): y o ( r , t ) = G{ X t ( r ′ , t ′ ) } . (A.1)The observations at position r and time t may be a combination of past valuesobtained over some finite region, hence our use of ( r ′ , t ′ ) which are different from( r , t ). The operator G may thus be non-local and (causally) time-dependent. Inaddition, any measurement has noise and uncertainties. Therefore, G is a stochasticoperator. The simplest specification beyond ignoring noise is to consider an additivenoise.A model M provides a forecast X f ( r , t ) either in the actual future or in termsof what will lead (via another operator) to the value of the measurements beyond acertain fiducial point in time. This is expressed by X f ( r , t ) = M ( { A } ) . (A.2) M is the model operator, which contains for instance the equation of states, theformulation in terms of ODEs, PDEs, discrete maps and so on, which are supposed toembody the known physics of the underlying processes. { A } contains the parametersof the model as well as the boundary and initial conditions. The model operator M has a non-random part. It can also contain an additive or multiplicative noisecomponent to represent the forecast errors as well as possible intrinsic stochasticcomponents of the dynamics. The forecast errors may stem from computationalerrors, numerical instabilities and uncertainties, the existence of multiple branchesin the solution and so on. The simplest specification is again to consider an additivenoise.The output M ( { A } ) of the model is translated into physical quantities that canbe compared with the observation via another operator H , which models mathe-matically and in code the observation process. In general, one would like to compare y o ( r , t ) given by (A.1) with H [ X f ( r , t )], that is, G{ X t ( r ′ , t ′ ) } with H [ M ( { A } )]. Theintended use of the model is key to “objective model validation,” because it turns“subjectiveness” of the model validation into an “object” using hypothesis testingand decision theory. To implement this idea, it is natural to introduce a cost function(see below) for the intended use of the model: C ` G{ X t ( r ′ , t ′ ) } ; H [ M ( { A } )] ´ , which is a measure of how well the model accounts for the observations. In thisexpression, the cost function is evaluated in the “physical space” of observa-tions/measurements. An alternative is to evaluate the cost function in the “modelspace,” i.e., General Strategy for Physics-Based Model Validation 35 C ` G − {H{ X t ( r ′ , t ′ ) }} ; M ( { A } ) ´ , where G − is the formal inverse operator to G which maps observations y o onto themodel space X f . In data assimilation, explicit form of G − does not exist in generaldue to rank deficiency. However, such alternative representation within the lineartheory corresponds to the duality between Kalman filtering and the 3D-Var [114].We propose to define the validation problem as a decision problem in whichone uses the loss function to infer/decide how much confidence one feels in thereliability of the model to function in the range in which it is supposed to apply.The interesting and challenging situation occurs when this range extends beyondthe region of parameter space in which all reasonably stringent controls have beenperformed. Validation requires the build-up of trust in the model or code so thatit is believed to be resilient and to work in complex real situations combining thesimple regimes that have been tested. The cost function is just an alternative wayof constructing the statistical test that provides the probability level p defined inthe main text. A.2 Four Recurring Types of Problem in Physically-Based ModelValidation
Our overarching goal is to advocate approaches to validation that are grounded inphysics. The term “physics-based” embodies two strategies:(a) use physical reasoning to improve modeling, target experiments and loss func-tions, and detect missed “dimensions;”(b) use concepts from statistical physics to formulate (in the spirit of Brown andSethna [115]) a validation process of complex models with complex data in theform of an N -body problem.Following this roadmap, we find ourselves asking the same four questions again andagain:1. How to model? (the question of model construction)2.
What to measure? (the question of estimating c novel in the main text)3. How to measure it? (the question of choosing and estimating the cost functionor “metric”)4.
How to interpret the results? (the question of estimating p in the main text)We view these four defining questions as the crucial steps within the validation loopdescribed in sections 2–4 of the main text. Problem 1: Targeting model development (How to model?)
Our discussion so far may give the impression that the modeling step is “homoge-neous.” It may actually be advantageous to develop a hierarchical modeling frame-work. In this respect, Oden et al. [116] proposed to use hierarchical modeling as amathematical structure that can be useful in directing validation studies. In thisconstruction, a class of models of events of interest is defined in which one identifiesa “fine” model that possesses a level of sophistication high enough to adequatelycapture the event of interest with good accuracy. This model may be intractable,even computationally. Hierarchical modeling consists in identifying a family of coarse6 D. Sornette, A.B. Davis, J.R. Kamm, and K. Idemodels that are solvable. Using the fine model output as a datum, the error in thesolution of ever coarser models can be estimated and controlled, with the goal ofobtaining a model best suited for the simulation goal at hand. The essential com-ponents of this program are the following [116]:1. Experimental data are collected to fully characterize the fine model.2. Quantities G ( X ) of interest are specified as the essential physical entity to bepredicted in the simulation (for instance in the form of the probability of thepredicted values of the quantity).3. The coarsest model is used to extract a preliminary estimate of G ( X ) and mod-eling and approximation errors are computed.4. If the estimated error exceeds the prescribed tolerance, the model is enhancedand the calculation is repeated until a model yielding results within the presetbounds is obtained.5. The truncation error of the perturbation expansion is estimated: if the totalerror exceeds a preset tolerance, the data set and the fine model definition mustbe updated; if not, the predicted G ( X ) and the probability that it will take onvalues in a given interval are produced as output.A concrete implementation of this program has been performed by Israeli andGoldenfeld [27]. Using elementary cellular automata as an example, Israeli andGoldenfeld show how to coarse-grain cellular automata in all categories of Wol-fram’s exhaustive classification [117]. The main discovery is that computationallyirreducible physical processes can be predictable and even computationally reducibleat a coarse-grained level of description. The resulting coarse-grained cellular automa-ton constructed with the coarse-graining procedure emulate the large-scale behaviorof the original systems without accounting for small-scale details. These results re-mind us that it is advantageous to develop a view of complex physical processes atdifferent scales, as the predictability may depend on the scale of observation.A related approach has been discussed recently by Brown and Sethna [115], whoconsider models defined in terms of a set of nonlinear ODEs applied to systems thathave large numbers of poorly known parameters, simplified dynamics, and uncertainconnectivity. They call models possessing these three features, “sloppy models.”Sloppy models characterize many other high-dimensional multi-parameter nonlinearmodels. Brown and Sethna propose to use the maximum likelihood method to framethe problem of parameter estimation and model validation in the form of statisticalensemble method. In our language, the problem boils down to a study of the costfunction C and its stiff and soft directions determined from the eigenvalue problemof the Hessian of C (with respect to the parameters of the model). In practice, Brownand Sethna propose to estimate the Hessian of C in terms of the so-called “Levenberg-Marquardt” Hessian (thus called because of its use of that popular minimizationalgorithm); that quantity is defined simply as a sum of pairwise products of first-order derivatives of the residuals with respect to the model parameters. Stiff modescorrespond to large eigenvalues. Similar to a decomposition in principal components,retaining the stiff modes allows one to get a more robust signature of the coarse-grained properties of the dynamics. This constitutes a concrete implementation ofour Problem 4 below on “targeting model errors.” This procedure also addressesthe problem of defining the operator H that selects the output of the model forcomparison to the experimental data.There is an interesting avenue for research here: rather than performing theprincipal component decomposition in one step, it may be advantageous to perform General Strategy for Physics-Based Model Validation 37a series of sub-system analysis, or cluster analysis, retaining the stiff modes of eachsub-system and then aggregating them at the next level of the hierarchy. Problem 2: Targeting the observations (What to measure?)
Objective:
Find G (and the associated H ) that reveals the most about model criticalbehavior.The problem has been addressed specifically in these terms by Palmer et al. [72]to target adaptive observations to “sensitive” parts of the atmosphere. Targetingobservations could be directed by the desire to get access to the most relevant in-formation that is also the most reliable (e.g., contaminated by the smallest errors).It may be worth mentioning that targeting the observations depends not only on G , but also M , { A } , as well as C (along with its own parameters discussed be-low). The targeting of the observations is the problem of maximizing the coefficient c novel introduced in the main text so that the new experiment/observation exploresnovel dimensions of the parameter and variable spaces of both the process and themodel that can best reveal potential flaws that could compromise the importantapplications. In general, one targets observations by developing experiments thatare thought to provide, in some sense, the most relevant tests of the physics.Oberkampf and Trucano (2002) [33] suggest that traditional experiments couldgenerally be grouped into three categories:1. experiments that are conducted primarily for the purpose of improving thefundamental understanding of some physical process;2. experiments conducted primarily for constructing or improving mathematicalmodels of fairly well-understood flows;3. experiments that determine or improve the reliability, performance, or safety ofcomponents, subsystems, or complete systems.These authors argue that validation experiments constitute a fourth type of experi-ment: “A validation experiment is conducted for the primary purpose of determiningthe validity, or predictive accuracy, of a computational modeling and simulation ca-pability. In other words, a validation experiment is designed, executed, and analyzedfor the purpose of quantitatively determining the ability of a mathematical modeland its embodiment in a computer code to simulate a well-characterized physicalprocess.” This leads them to propose the following guidelines: • Guideline : A validation experiment should be jointly designed by experi-mentalists, model developers, code developers, and code users working closelytogether throughout the program, from inception to documentation, with com-plete candor about the strengths and weaknesses of each approach. • Guideline : A validation experiment should be designed to capture the essen-tial physics of interest, including all relevant physical modeling data and initialand boundary conditions required by the code. • Guideline : A validation experiment should strive to emphasize the inherentsynergism between computational and experimental approaches. • Guideline : Although the experimental design should be developed coopera-tively, independence must be maintained in obtaining both the computationaland experimental results.8 D. Sornette, A.B. Davis, J.R. Kamm, and K. Ide • Guideline : A hierarchy of experimental measurements of increasing compu-tational difficulty and specificity should be made, for example, from globallyintegrated quantities to local measurements. • Guideline : The experimental design should be constructed to analyze and es-timate the components of random (precision) and bias (systematic) experimentalerrors.
Problem 3: Targeting the cost function (How to estimate thepenalty on imperfect models and measurements using theirdiscrepancies?)
For given measurements or experiments, that is, for given G , the problem is tofind the optimal cost function C for the intended use of the model. The notion ofoptimality needs to be defined. It could capture a compromise between the followingrequirements: • fit best the important features of the data (what is “important” may be de-cided on the basis of previous studies and understanding or other processes, orprogrammatic concerns); • minimize the extraction of spurious information from noise, which requires oneto have a precise idea of the statistical properties of the noise (if such knowledgeis not available, the cost function should take this into account).The choice of the cost function involves the choice of how to look at the data.For instance, one may want to expand the measurements at multiple scales usingwavelet decompositions and compare the prediction and observations scale by scale,or in terms of multifractal spectra of the physical fields estimated from these waveletdecomposition or from other methods. The general idea here is that, given complexobservation fields, it is appropriate to “project” the data onto a variety of “metrics”designed to detect and characterize phenomena of particular interest. For instance,wavelet-based scaling properties can be used in the comparison between observa-tions and model predictions; the question is then: How well is the model/code ableto reproduce the salient multi-scale properties derived from the observations? Thephysics of turbulent fields and of complex systems have offered many such newtools to unfold complex fields according to different statistics. Each of these statis-tics provides a basis for a metric to compare observations with model predictions.Each such statistics thus leads to a cost function focusing on a particular feature ofthe process. These metrics are derived from the understanding that turbulent fieldscan be analyzed using them, revealing strong constraints in their organization (spa-tial structure and temporal evolution). These metrics can therefore be described as“physics-based.”Furthermore, the choice of the cost function should take into account that thediagnostics of the experiments may lead to spurious results [11]. For example, inlaser-driven shock experiments, because the laser-induced fluorescence method illu-minates the mixing zone with a planar sheet of light, this diagnostic can lead toaliasing of long-wavelength structures into short-wavelength features in the images,thus affecting the interpretation of observed small-scale structures in the mixingzone. Also, because of the dynamic limits on diagnostic resolution, the formation ofsmall-scale structure cannot be completely determined. General Strategy for Physics-Based Model Validation 39As emphasized by Noam Chomsky in his own field of work [118], the danger withthe Popperian strategy [119] is that one might prematurely reject a theory based on“falsification” using data that are themselves poorly understood. For instance, lack ofquality control for the experiments can result in premature rejection of the model. Onthese issues, Stein [120] discusses means for controlling and for understanding sampleselection and variability, which can compromise conclusions drawn from validationtests.The problem of the choice of the cost function C seems, however, to be of lessimportance than Problem 2 above and Problem 4 below. In fact, almost all classicalresults on the limit properties of efficiency of statistical inference are valid (andproved) for a whole general family of cost functions C ( · ; · ) satisfying the followingconditions (see, e.g., Ibragimov and Hasminskii [121]):(a) C ( x, y ) = c ( | x − y | );(b) c ( z ) is a positive monotonically increasing function (including, e.g., power-lawfunctions | z | q , with q > c ( z ) should not increase too fast (its mean with respect to the Gaussian distri-bution must remain finite).Thus, statistical limit theorems are proved for the whole class of different power-lawcost functions (including the classic choice q = 2).As an example, it may be appropriate to consider the cost function in the fol-lowing form. Let us assume we are interested in some functional Z ( R, T |G{ X t ( r , t ) } , r ∈ D ( R ) , t ≤ T )depending on the past true physical field X t ( r , t ) in some region D ( R ). In this case,the cost function can be chosen as C ( Z [ R, T |G{ X t ( r , t ) } , r ∈ D ( R ) , t ≤ T ] ; Z [ R, T |H{ X f ( r , t ) } , r ∈ D ( R ) , t ≤ T ])(A.3)where C ( · ; · ) is some function satisfying above conditions (a)–(c). The formulation(A.3) for C ( · ; · ) should not only be a function of G and M , but also of those pa-rameters that correspond to our best guess for the uncertainties, errors and noise.Indeed, in most cases, we can never know real uncertainties, errors and noise in G and M (or even H ). Hence, we must parameterize them based on our best guess. Indata assimilation (described in the main text in relation to model calibration andvalidation), the accuracy of such parameterization is known to influence the resultssignificantly.Generalizations to (A.3) allowing for different fields in the two sets of variables in C are needed for some problems, such as in validation of meteorological models. Forinstance, consider a model state vector X (dimension is on the order of 10 ) whichis computed on a fixed spatial grid. In general, the locations of the observations arenot on the computational grid (for example, consider measurements with weatherballoons released from the surface). Thus, the observation Y is a function of X , butis not an attempt to estimate X itself. Hence, if the cost function is quadratic, it hasthe form ( Y − H ( X )) T O − ( Y − H ( X )) where H acts on the interpolation functionto pick up the model variable at the grid points close to the observed location,and O is related to the error covariance. Let us imagine a validation case usingsatellite infrared images for Y and atmospheric radiative state for X . Observationsare quasi-uniform in space at a given time; at each time, available observations and0 D. Sornette, A.B. Davis, J.R. Kamm, and K. Idetheir quality (represented by O ) may change, however. In this case, the cost functionmust take into account the mapping between X and Y so that we have C ( X, Y ) = C ( | H ( X ) − Y | ) rather than C ( X, Y ) = C ( | X − Y | ); therefore ( Y − H ( X )) T O − ( Y − H ( X )) when C is quadratic. In addition, for heterogeneous observations (satelliteimages, weather balloon measurements, airplane sampling, and so on), cost functionsshould take into account all these data into account such as C ( x, y ) = C satellite ( x, y ) + C balloon ( x, y ) + C airplane ( x, y ) + · · · and each C may have a complex idiosyncratic observation function H . See Courtieret al. [122] for a discussion on cost functions for atmospheric models and observationsystems. Problem 4: Targeting model errors (How to interpret the results?)
The problem here is to find the “dimensions” of the model that are missing, misrep-resented or erroneous. The question of how to interpret the results thus leads to thediscussion of the missing or misrepresented elements in the model. What tests canbe used to betray the existence of hidden degrees of freedom and/or dimensions?This is the hardest problem of all. It can sometimes find an elegant solutionwhen a given model is embedded in a more general model. Then, the limitationof the “small” model becomes clear from the vantage of the more general model.Well-known examples are • Newtonian mechanics as part of special relativity, when v ≪ c where v (resp. c )is the velocity of the body (resp. of light); • classical mechanics as part of quantum mechanics when h/mc ≪ L (where h isPlanck’s constant, m and L are the mass and size of the body and h/mc is theassociated Compton wavelength); • Eulerian hydrodynamics as part of Navier-Stokes hydrodynamics with its richphenomenology of turbulent motion (when the Reynolds number goes to infinity,equivalently, viscosity goes to zero); • classical thermodynamics as part of statistical physics of N ≫ N → ∞ .The challenge of targeting model errors is to develop diagnostics of missing dimen-sions even in absence of a more encompassing model. This could be done by addingrandom new dimensions to the model and studying its robustness.In what sense can one detect that a model is missing some essential ingredient,some crucial mechanisms, or that the number of variables or dimensions is inade-quate? To use a metaphor, this question is similar to asking ants living and walkingon a plane to gain awareness that there is a third dimension. This question (raised already by the German philosopher Kant) actually hasan answer that has been studied and solved by Ehrenfest in 1917 [123] (see alsoWhitrow’s 1956 article [124]). This answer is based on the analysis of severalfundamental physical laws in R n spaces and comparing their predictions as afunction of n . The value n = 3 turns out to be very special! Thus, ants studyinggravitation or electro-magnetic fields will see that there is more to space thantheir plane. General Strategy for Physics-Based Model Validation 41 A.3 Fundamental Limits on Model Validation
Before, while and after engaging in model validation, it is wise to reflect frequentlyand systematically on what is not known. Two examples using the formalizationintroduced in section A.1 are:
Ignorance on the model M ( { A } ) As quoted in the main text, Roache [2] states, in a nutshell, that validation is aboutsolving the right equations for the problem of immediate concern. How do we knowthe right equations?Consider, for instance, point vortex models, and let us perform “twin experi-ments,” i.e., (1) first generate the “simulated observations” by a “true” point vortexsystem that are unknown to the make-believe observer and modeler; (2) use the pro-cedure of section A.1 and construct a “validated” point vortex system. The problemis that, even before we start model validation, we are already using one of the mostcritical pieces of information, which is that the system is based on point vortices.Similar criticism for the use of “simulated observations” has been raised in data as-similation studies using OSSEs (Observing-System Simulation Experiments). Thiscriticism is crucial for model validation.For this unavoidable issue of model errors, we suggest that one needs a hierarchyof investigations:1. Look at the statistical or global properties of the time series and/or fields gen-erated by the models as well as from the data, such as distributions, correlationfunctions, n -point statistics, fractal and multifractal properties of the attractorsand emergent structures, in order to characterize how much of the data ourmodel fits. Part of this approach is the use of maximum likelihood theory todetermine the most probable value of the parameters of the model, conditionedon the realization of the time series.2. We can bring to bear on the problem the modern methods of computationalintelligence (or machine learning), including pattern classification and recog-nition methods ranging from the already classical ones (e.g., neural networks, K -means) to the most recent advances (e.g., support vector machines, “randomforests”).3. Lastly, a qualification of the model is obtained by testing and quantifying howwell it predicts the “future” beyond the interval used for calibration/initialization. Levels of ignorance on the observation G • First level : The characteristics of the noise are known, such as its distribution,covariance, and maybe higher-order statistics. • Second level : It may happen that the statistical properties of the noise are poorlyknown or constrained. • Third level : A worse situation is when some noise components are not known toexist and are thus simply not considered in the treatment. For instance, imaginethat one forgets in climate modeling about the impact of biological variabilityin time and space in the distribution of CO sequestration sites. • Fourth level : Finally, there is the representation error in G itself, i.e., how G ismodeled mathematically in H .2 D. Sornette, A.B. Davis, J.R. Kamm, and K. Ide Consequences of the sensitivity to initial conditions andnonlinearity in the model
Even an accurate forecast is limited by the inherent predictability of the system.In the same way, validation may be hindered by limited access to testing. The pre-dictability of a system refers to the fundamental limits of prediction for a system. Forinstance, if a system is pure noise, there is no possibility of forecasting it better thanchance. Similarly, there may be limits in the possibilities of testing the performanceof a model because of limits in measurements, limits in access to key parameters forinstance. With such limitations, it may be impossible to fully validate a model.A well-known source that limits predictability is the property of sensitivity toinitial conditions, which is one of the ingredients leading to chaotic behavior. Vali-dation has to be made immune to this sensitivity upon initial conditions, by using avariety of methods, including the properties of attractors, their invariant measures,the properties of Lyapunov exponents, and so on. Pisarenko and Sornette [125] haveshown that the sensitivity upon initial conditions leads to a limit of testability insimple toy models of chaotic dynamical systems, such as the logistic map. Theyaddressed the possibility of applying standard statistical methods (the least squaremethod, the maximum likelihood estimation method, the method of statistical mo-ments for estimation of parameters) to deterministically chaotic low-dimensionaldynamic system containing an additive dynamical noise. First, the nature of thesystem is found to require that any statistical method of estimation combines theestimation of the structural parameter with the estimation of the initial value. Thisis potentially an important lesson for such a class of systems. In addition, in suchsystems, one needs a trade-off between the need of using a large number of datapoints in the statistical estimation method to decrease the bias (i.e., to guaranteethe consistency of the estimation) and the unstable nature of dynamical trajectorieswith exponentially fast loss of memory of the initial condition. In this simple exam-ple, the limit of testability is reflected in the absence of theorems on the consistencyand efficiency of maximum likelihood estimation (MLE) methods [125]. We can useMLE with sometimes good practical results in controlled situations for which pastexperience has been accumulated but there is no guarantee that the MLE will notgo astray in some cases.This work has also shown that the Bayesian approach to parameter estimation ofchaotic deterministic systems is incorrect and probably suboptimal. The Bayesianapproach usually assumes non-informative priors for the structural parameters ofthe model, for the initial value and for the standard deviation of the noise. This ap-proach turns out to be incorrect, because it amounts to assuming a stochastic model,thus referring to quite another problem, since the correct model is fundamentallydeterministic (only with the addition of some noise).This negative conclusion on the use of the Bayesian approach should be con-trasted with the Bayesian approach of Hanson and Hemez [126] to model the plastic-flow characteristics of a high-strength steel by combining data from basic materialtests. The use of a Bayesian approach to this later problem seems warranted becausethe priors reflect the intrinsic heterogeneity of the samples and the large dispersionof the experiments. In this particular problem concerning material properties, theuse of Bayesian priors is warranted by the fact that the structural parameters of themodel can be viewed as drawn from a population. It is very important to stress thispoint: Bayesian approaches to structural parameter determination are justified only General Strategy for Physics-Based Model Validation 43in problems with random distributions of the parameters. For the previous problemof deterministic nonlinear dynamics, it turns out to be fundamentally incorrect. Wetherefore view proper partition of the problem at hand between deterministic andrandom components as an essential part of validation.
Extrapolating beyond the range of available data
In the previous discussion, the limit of testability is solely due to the phenomenon ofsensitive dependence upon initial conditions, as the model is assumed to be known(the logistic map in the above example). In general, we do not have such luxury. Letus illustrate the much more difficult problem by two examples stressing the possibil-ity for the existence of “indistinguishable states.” Consider a map f that generatesa time series. Assuming that f is unknown a priori, let us construct/constrain themap f whose initial condition and parameters can be tuned in such a way that tra-jectories of f can follow data of f for a while, but eventually the two maps diverge.Suppose that the time series of f is too short to explore the range expressing thedivergence between the two maps. How can we (in-)validate f as a incorrect modelof f ?This problem arises in the characterization of the tail of distributions of stochas-tic variables. For instance, Malevergne, Pisarenko and Sornette [127] have shownthat, based on available data, the best tests and efforts can not distinguish betweena power law tail and a stretched exponential distribution for financial returns. Thetwo classes of models are indistinguishable, given the amount of data. This fun-damental limitation has unfortunately severe consequences, because choosing oneor the other models involves different predictions for the frequency of very largelosses that lie beyond the range sampled by historical data (the f − f problem).The practical consequences are significant, in terms of the billions of dollars banksshould put (or not) aside to cover large market swings that are outside the data setavailable from the known past history.This example illustrates a crucial aspect of model validation, namely that itrequires the issuance of predictions outside the domain of parameters and/or ofvariables that has been tested “in-sample” to establish the (calibrated or “tuned”)model itself. References
1. D. Sornette, A.B. Davis, K. Ide, K.R. Vixie, V. Pisarenko, and J.R. Kamm.Algorithm for model validation: Theory and applications.
Proc. Nat. Acad.Sci. , 104:6562–6567, 2007.2. P.J. Roache.
Verification and Validation in Computational Science and Engi-neering . Hermosa Publishers, Albuquerque, NM, 1998.3. R. Costanza, R. d’Arge, R. deGroot, S. Farber, M. Grasso, B. Hannon, K. Lim-burg, S. Naeem, R.V. O’Neill, J. Paruelo, R.G. Raskin, P. Sutton, and M. van-denBelt. The value of the world’s ecosystem services and natural capital.
Nature , 387:253–260, 1997.4. S. Pimm. The value of everything.
Nature , 387:231–232, 1997.4 D. Sornette, A.B. Davis, J.R. Kamm, and K. Ide5. I. Babuska and J.T. Oden. Verification and validation in computational engi-neering and science: Basic concepts.
Comput. Methods Appl. Mech. Engrng. ,193:4057–4066, 2004.6. D.E. Post and L. G. Votta. Computational science demands a new paradigm.
Phys. Today , 58:35–41, 2005.7. AIAA. Guide for the verification and validation of computational fluid dy-namics simulations. Technical Report AIAA G-077-1998, American Instituteof Aeronautics and Astronautics, 1998.8. S. Schlesinger. Terminology for model credibility.
Simulation , 32(3):103–104,1979.9. R.G. Sargent. Verification and validation of simulation models. In D.J.Medeiros, E.F. Watson, J.S. Carson, and M.S. Manivannan, editors,
Proceed-ings of the 1998 Winter Simulation Conference , pages 121–130. 1998.10. R.G. Sargent. Some approaches and paradigms for verifying and validating sim-ulation models. In B.A. Peters, J.S. Smith, D.J. Medeiros, and M.W. Rohrer,editors,
Proceedings of the 2001 Winter Simulation Conference , pages 106–114.2001.11. A.C. Calder, B. Fryxell, T. Plewa, R. Rosner, L.J. Dursi, V.G. Weirs,T. Dupont, H.F. Robey, J.O. Kane, B.A. Remington, R.P. Drake, G. Dimonte,M. Zingale, F.X. Timmes, K. Olson, P. Ricker, P. MacNeice, and H.M. Tufo.On validating an astrophysical simulation code.
The Astrophysical JournalSupplement Series , 143:201–229, 2002.12. G. Sod. A survey of several finite difference methods for systems of nonlinearhyperbolic conservation laws.
J. Comp. Phys. , 27:1–31, 1978.13. W.J. Rider. An adaptive Riemann solver using a two-shock approximation.
Computers & Fluids , 28:741–777, 1999.14. L.I. Sedov.
Similarity and Dimensional Methods in Mechanics . AcademicPress, New York, NY, 1959.15. P. Woodward and P. Colella. The numerical simulation of two-dimensionalfluid flow with strong shocks.
J. Comp. Phys. , 54:115–173, 1984.16. P.A. Gnoffo, R.D. Braun, K.J. Weilmuenster, R.A. Mitcheltree, W.C. En-gelund, and R.W. Powell. Prediction and validation of Mars Pathfinder hy-personic aerodynamic data base. In
Proceedings of 7th AIAA/ASME JointThermophysics and Heat Transfer Conference, June 15–18, 1998, Albuquerque,NM , 1998.17. P.J. Roache.
Fundamentals of Computational Fluid Dynamics . Hermosa Pub-lishers, Albuquerque, NM, 1998.18. P.J. Roache. Recent contributions to verification and validation methodol-ogy. In
Proceedings of the Fifth World Congress on Computational Mechanics,Vienna, Austria . 2002.19. N. Oreskes, K. Shrader-Frechette, and K. Belitz. Verification, validation andconfirmation of numerical models in the Earth sciences.
Science , 263:641–646,1994.20. J.D. Sterman. The meaning of models.
Science , 264:329–330, 1994.21. E.J. Rykiel, Jr. The meaning of models.
Science , 264:330–331, 1994.22. N. Oreskes, K. Belitz, and K. Shrader-Frechette. The meaning of models—Response.
Science , 264:331, 1994.23. N. Oreskes. Evaluation (not validation) of quantitative models.
EnvironmentalHealth Perspective Supplements , 106(Suppl. 6):1453–1460, 1998. General Strategy for Physics-Based Model Validation 4524. L.F. Konikow and J.D. Bredehoeft. Groundwater models cannot be validated.
Adv. Water Res. , 15:75–83, 1992.25. G.J. Chaitin.
Algorithmic Information Theory . Cambridge University Press,New York, NY, 1987.26. M. Buchanan. Revealing order in the chaos.
New Scientist , 2488, 2005.27. N. Israeli and N. Goldenfeld. On computational irreducibility and the pre-dictability of complex physical systems.
Phys. Rev. Lett. , 92:74105–74108,2004.28. A.A. Borovkov.
Mathematical Statistics . Taylor & Francis, Amsterdam, TheNetherlands, 1998.29. H.W. Coleman and F. Stern. Uncertainties and CFD code validation.
J. FluidsEngrng. , 119:795–803, 1997.30. R.G. Hills and T.G. Trucano. Statistical validation of engineering and scien-tific models: Background. Technical Report SAND99-1256, Sandia NationalLaboratory, 1999.31. R.G. Hills and T.G. Trucano. Statistical validation of engineering and scientificmodels with application to CTH. Technical Report SAND2001-0312, SandiaNational Laboratory, 2000.32. R.G. Easterling. Measuring the predictive capability of computational models:Principles and methods, issues and illustrations. Technical Report SAND2001-0243, Sandia National Laboratory, 2001.33. W.L. Oberkampf and T.G. Trucano. Verification and validation in computa-tional fluid dynamics.
Progress in Aerospace Sciences , 38:209–272, 2002.34. C. Gourieroux and A. Monfort. Testing non-nested hypotheses. In R.F. Engleand D. McFadden, editors,
Handbook of Econometrics , volume IV, chapter 44,pages 2583–2637. Elsevier Science B.V., Amsterdam, The Netherlands, 1994.35. T.G. Trucano, M. Pilch, and W.L. Oberkampf. On the role of code compar-isons in verification and validation. Technical Report SAND2003-2752, SandiaNational Laboratory, 2003.36. R. F. Cahalan, L. Oreopoulos, A. Marshak, K. F. Evans, A. B. Davis, R. Pincus,K. H. Yetzer, B. Mayer, R. Davies, T. P. Ackerman, H. W. Barker, E. E.Clothiaux, R. G. Ellingson, M. J. Garay, E. Kassianov, S. Kinne, A. Macke,W. O’Hirok, P. T. Partain, S. M. Prigarin, A. N. Rublev, G. L. Stephens,F. Szczap, E. E. Takara, T. Varnai, G. Y. Wen, and T. B. Zhuravleva. TheI3RC: Bringing together the most advanced radiative transfer tools for cloudyatmospheres.
Bull. Amer. Meteor. Soc. , 86(9):1275 – 1293, 2005.37. B. Pinty, N. Gobron, J. L. Widlowski, S. A. W. Gerstl, M. M. Verstraete,M. Antunes, C. Bacour, F. Gascon, J. P. Gastellu, N. Goel, S. Jacquemoud,P. North, W. H. Qin, and R. Thompson. RAdiation Transfer Model Intercom-parison (RAMI) exercise.
J. Geophys. Res. , 106(D11):11937 – 11956, 2001.38. B. Pinty, J. L. Widlowski, M. Taberner, N. Gobron, M. M. Verstraete, M. Dis-ney, F. Gascon, J. P. Gastellu, L. Jiang, A. Kuusk, P. Lewis, X. Li, W. Ni-Meister, T. Nilson, P. North, W. Qin, L. Su, S. Tang, R. Thompson, W. Ver-hoef, H. Wang, J. Wang, G. Yan, and H. Zang. RAdiation Transfer ModelIntercomparison (RAMI) exercise: Results from the second phase.
J. Geophys.Res. , 109(D6):D06210–, doi:10.1029/2003JD004252, 2004.39. J. L. Widlowski, M. Taberner, B. Pinty, V. Bruniquel-Pinel, M. Disney,R. Fernandes, J. P. Gastellu-Etchegorry, N. Gobron, A. Kuusk, T. Lavergne,S. Leblanc, P. E. Lewis, E. Martin, M. Mottus, P. R. J. North, W. Qin, M. Ro-bustelli, N. Rochdi, R. Ruiloba, C. Soler, R. Thompson, W. Verhoef, M. M.6 D. Sornette, A.B. Davis, J.R. Kamm, and K. IdeVerstraete, and D. Xie. Third RAdiation Transfer Model Intercomparison(RAMI) exercise: Documenting progress in canopy reflectance models.
J. Geo-phys. Res. , 112(D9):D09111–, doi:10.1029/2006JD007821, 2007.40. T.G. Trucano, L.P. Swiler, T. Igusa, W.L. Oberkampf, and M. Pilch. Calibra-tion, validation, and sensitivity analysis: What’s what.
Reliab. Engrng. Syst.Safety , 91:1331–1357, 2006.41. W.L. Oberkampf. What are validation experiments?
Experimental Techniques ,25:35–40, 2002.42. T.G. Trucano, M. Pilch, and W.L. Oberkampf. General concepts for experimen-tal validation of ASCI code applications. Technical Report SAND2002-0341,Sandia National Laboratory, 2002.43. A. Saltelli, K. Chan, and E.M. Scott (eds.).
Sensitivity Analysis . John Wiley& Sons, Chichester, UK, 2000.44. A. Saltelli, S. Tarantola, F. Campolongo, and M. Ratto.
Sensitivity Analysisin Practice . John Wiley & Sons, Chichester, UK, 2004.45. E. Kalnay.
Atmospheric Modeling, Data Assimilation, and Predictability . Cam-bridge University Press, New York, NY, 2003.46. D.F. Parrish and J.F. Derber. The National Meteorological Center’s spec-tral statistical-interpolation analysis system.
Mon. Wea. Rev. , 120:1747–1763,1992.47. F. Rabier, H. Jarvinen, E. Klinker, J.-F. Mahfouf, and A. Simmons. TheECMWF implementation of four-dimensional variational assimilation (4D-Var): I. Experimental results with simplified physics.
Quart. J. Roy. Meteor.Soc. , 126:1143–1170, 2000.48. P.L. Houtekamer and H.L. Mitchell. Ensemble Kalman filtering.
Quart. J.Roy. Meteor. Soc. , 131:3269–3289, 2005.49. I. Szunyogh, E.J. Kostelich, G. Gyarmati, E. Kalnay, B.R. Hunt, E. Ott, E. Sat-terfield, and J.A. Yorke. A local ensemble transform Kalman filter data assim-ilation system for the ncep global model.
Tellus A: Dynamic Meteorology andOceanography , 59:doi: 10.1111/j.1600–0870.2007.00274, 2007. In press.50. J.S. Whitaker, T.M. Hamill, X. Wei, Y. Song, and Z. Toth. Ensemble dataassimilation with the NCEP global forecast system.
Mon. Wea. Rev. , 2007. Inpress.51. Z. Toth and E. Kalnay. Ensemble forecasting at NMC: The generation ofperturbations.
Bull. Amer. Meteor. Soc. , 74:2317–2330, 1993.52. P.L. Houtekamer and J. Derome. Methods for ensemble prediction.
Mon. Wea.Rev. , 123:2181–2196, 1995.53. F. Molteni, R. Buizza, T.N. Palmer, and T. Petroliagis. The ECMWF ensembleprediction system: Methodology and validation.
Quart. J. Roy. Meteor. Soc. ,122:73–119, 1996.54. D.W. Berning and A.R. Hefner, Jr. IGBT model validation.
IEEE IndustryApplications Magazine , 4(6):23–34, 1998.55. S. Kaplan and B. J. Garrick. On the quantitative definition of risk.
RiskAnalysis , 1:11–27, 1981.56. R.P. Rechard. Historical background on performance assessment for the WasteIsolation Pilot Plant.
Reliab. Engrng. Syst. Safety , 69:5–46, 2000.57. R.L. Keeney and D. von Winterfeldt. Eliciting probabilities from experts incomplex technical problems.
IEEE Trans Eng. Management , 38:191–201, 1991.58. J.C. Helton. Treatment of uncertainty in performance assessments for complexsystems.
Risk Analysis , 14:483–511, 1994. General Strategy for Physics-Based Model Validation 4759. J.C. Helton. Uncertainty and sensitivity analysis in performance assessmentfor the Waste Isolation Pilot Plant.
Comp. Phys. Comm. , 117:156–180, 1999.60. J.C. Helton, D.R. Anderson, G. Basabilvazo, H.-N. Jow, and M. G. Marietta.Conceptual structure of the 1996 performance assessment for the Waste Isola-tion Pilot Plant.
Reliab. Engrng. Syst. Safety , 69:151–165, 2000.61. F. Dyson.
Infinite in All Directions . Penguin, London, UK, 1988.62. L. Pal and M. Makai. Remarks on statistical aspects of safety analysis ofcomplex systems. 2003.63. J. von Neumann and O. Morgenstein.
Theory of Games and Economic Behav-ior . Princeton University Press, Princeton, NJ, 1944.64. T.E. Harris.
The Theory of Branching Processes . Dover, New York, NY, 1988.65. A. Wald.
Statistical Decision Functions . J. Wiley & Sons, New York, NY,1950.66. S. Kotz and N. Johnson.
Breakthroughs in Statistics (Foundations and Theory) ,volume 1. Springer-Verlag, Heidelberg, Germany, 1993.67. H.A. Simon. A behavioral model of rational choice.
Quarterly Journal ofEconomics , 69(1):99–118, 1955.68. H.A. Simon. Rational decision making in business organizations.
AmericanEconomic Review , 69(4):493–513, 1979.69. R. Zhang and S. Mahadevan. Bayesian methodology for reliability model ac-ceptance.
Reliab. Engrng. Syst. Safety , 80:95–103, 2003.70. S. Mahadevan and R. Rebba. Validation of reliability computational modelsusing Bayesian networks.
Reliab. Engrng. Syst. Safety , 87:223–232, 2005.71. Board on Earth Sciences and Resources, Panel on Seismic Hazard Evaluation,Committee on Seismology, Commission on Geosciences, Environment, and Re-sources, National Research Council.
Review of Recommendations for Proba-bilistic Seismic Hazard Analysis: Guidance on Uncertainty and Use of Experts .National Academy Press, Washington, DC, 1997.72. T.N. Palmer, R. Gelaro, J. Barkmeijer, and R. Buizza. Singular vectors, metricsand adaptive observations.
J. Atmos. Sci. , 55:633–653, 1998.73. J.-F. Muzy, E. Bacry, and A. Arn´eodo. The multifractal formalism revisitedwith wavelets.
Int. J. of Bifurcation and Chaos , 4:245–302, 1994.74. R.E. Bellman and L.A. Zadeh. Decision-making in a fuzzy environment.
Man-agement Science , 17:141–164, 1970.75. L.A. Zadeh. Knowledge representation in fuzzy logic. In R.R. Yager andL.A. Zadeh, editors,
An Introduction to Fuzzy Logic Applications in IntelligentSystems , pages 2–25. Kluwer Academic, Norwell, MA, 1992.76. W.L. Oberkampf and M.F. Barone. Measures of agreement between computa-tion and experiment: Validation metrics. Technical Report SAND2005-4302,Sandia National Laboratory, 2005.77. M. Massimi.
Pauli’s Exclusion Principle: The Origin and Validation of a Sci-entific Principle . University of Cambridge Press, Cambridge, UK, 2005.78. A. Einstein, B. Podolsky, and N. Rosen. Can quantum-mechanical descriptionof physical reality be considered complete?
Phys. Rev. , 47:777–780, 1935.79. J.S. Bell. On the Einstein-Podolsky-Rosen paradox.
Physics , 1:195–290, 1964.80. A. Aspect, C. Imbert, and G. Roger. Absolute measurement of an atomiccascade rate using a two-photon coincidence technique. Application to the 4p S → P → S cascade of calcium exited by a two photon absorp-tion. Optics Comm. , 34:46–52, 1980.8 D. Sornette, A.B. Davis, J.R. Kamm, and K. Ide81. N.J. Rarity and J.P. Tapster. Experimental violation of Bell’s inequality basedon phase and momentum.
Phys. Rev. Lett. , 64:2495–2498, 1990.82. R. Webb, S. Washburn, C. Umbach, and R. Laibowitz. Observation of h/e
Aharonov-Bohm oscillations in normal-metal rings.
Phys. Rev. Lett. , 54:2696–2969, 1985.83. B. Schwarzschild. Currents in normal-metal rings exhibit Aharonov-Bohmeffect.
Phys. Today , 39:17–20, 1986.84. M.H. Anderson, J.R. Ensher, M.R. Matthews, C.E. Wieman, and E.A. Cornell.Observation of Bose-Einstein condensation in a dilute atomic vapor.
Science ,296:198–201, 1995.85. R. G¨ahler, A.G. Klein, and A. Zeilinger. Neutron optical test of nonlinear wavemechanics.
Phys. Rev. A , 23:1611–1617, 1981.86. S. Weinberg. Precision tests of quantum mechanics.
Phys. Rev. Lett. , 62:485–488, 1989.87. A.J. Leggett. Testing the limits of quantum mechanics: Motivation, state ofplay, prospects.
J. Phys. Condens. Matter , 14:R415–R451, 2002.88. Z. Olami, H.J.S. Feder, and K. Christensen. Self-organized criticality in acontinuous, nonconservative cellular automaton modeling earthquakes.
Phys.Rev. Lett. , 68:1244–1247, 1992.89. B. Drossel. Complex scaling behavior of nonconserved self-organized criticalsystems.
Phys. Rev. Lett. , 89:238701–238704, 2002.90. A. Helmstetter, S. Hergarten, and D. Sornette. Properties of foreshocksand aftershocks of the non-conservative self-organized critical Olami-Feder-Christensen model.
Phys. Rev. E , 70:046120, 2004.91. D. Sornette.
Critical Phenomena in Natural Sciences . Springer-Verlag, Hei-delberg, Germany, 2004.92. S. Hergarten and H.J. Neugebauer. Foreshocks and aftershocks in the Olami-Feder-Christensen model.
Phys. Rev. Lett. , 88:2385011–2385014, 2002.93. H.W. Barker and A.B. Davis. Approximation methods in atmospheric 3D ra-diative transfer, Part 2: Unresolved variability and climate applications. InA. Marshak and A.B. Davis, editors,
3D Radiative Transfer in Cloudy Atmo-spheres , pages 343–383. Springer-Verlag, Heidelberg, Germany, 2005.94. S. Lovejoy. Area-perimeter relation for rain and cloud areas.
Science , 216:185–187, 1982.95. A.B. Davis. Effective propagation kernels in structured media with broad spa-tial correlations, illustration with large-scale transport of solar photons throughcloudy atmospheres. In F. Graziani, editor,
Computational Transport Theory- Granlibakken 2004 , pages 84–140. Springer-Verlag, New York, NY, 2006.96. A.B. Davis and A. Marshak. L´evy kinetics in slab geometry: Scaling of trans-mission probability. In M.M. Novak and T.G. Dewey, editors,
Fractal Frontiers ,pages 63–72. World Scientific, Singapore, 1997.97. K. Pfeilsticker. First geometrical pathlengths probability density functionderivation of the skylight from spectroscopically highly resolving oxygen A-band observations, 2. Derivation of the L´evy-indexfor the skylight transmittedby mid-latitude clouds.
J. Geophys. Res. , 104:4101–4116, 1999.98. Q.-L. Min, L.C. Harrison, and E.E. Clothiaux. Joint statistics of photon path-length and cloud optical depth: Case studies.
J. Geophys. Res. , 106:7375–7385,2001. General Strategy for Physics-Based Model Validation 4999. A.B. Davis, D.M. Suszcynski, and A. Marshak. Shortwave transport in thecloudy atmosphere by anomalous/L´evy photon diffusion: New diagnostics us-ing FORT´E lightning data. In
Proceedings of 10th Atmospheric Radiation Mea-surement (ARM) Science Team Meeting, San Antonio, Texas, March 13–17,2000 , 2000.100. Q.-L. Min, L.C. Harrison, P. Kiedron, J. Berndt, and E. Joseph. A high-resolution oxygen A-band and water vapor band spectrometer.
J. Geophys.Res. , 109:D02202–D02210, 2004.101. A.B. Davis and A. Marshak. Space-time characteristics of light transmittedthrough dense clouds: A Green function analysis.
J. Atmos. Sci. , 59:2714–2728,2002.102. T. Scholl, K. Pfeilsticker, A.B. Davis, H.K. Baltink, S. Crewell, U. L¨ohnert,C. Simmer, J. Meywerk, and M. Quante. Path length distributions for solarphotons under cloudy skies: Comparison of measured first and second momentswith predictions from classical and anomalous diffusion theories.
J. Geophys.Res. , 111:D12211–D12226, 2006.103. S.V. Buldyrev, M. Gitterman, S. Havlin, A.Ya. Kazakov, M.G.E. da Luz, E.P.Raposo, H.E. Stanley, and G.M. Viswanathan. Properties of L´evy flights onan interval with absorbing boundaries.
Physica A , 302:148–161, 2001.104. A.B. Davis, R.F. Cahalan, J.D. Spinehirne, M.J. McGill, and S.P. Love. Off-beam lidar: An emerging technique in cloud remote sensing based on radiativeGreen-function theory in the diffusion domain.
Phys. Chem. Earth (B) , 24:177–185 (Erratum 757–765), 1999.105. W.J. Rider, J.A. Greenough, and J.R. Kamm. Combining high-order accuracywith non-oscillatory methods through monotonicity preservation.
Int J. Num.Meth. Fluids , 47:1253–1259, 2005.106. W.J. Rider, J.A. Greenough, and J.R. Kamm. Accurate monotonicity- andextrema-preserving methods through adaptive nonlinear hybridizations.
J.Comput. Phys. , 225:1827–1848, 2007.107. R. Liska and B. Wendroff. Comparison of several difference schemes on 1D and2D test problems for the Euler equations.
SIAM J. Sci. Comput. , 25:995–1017,2003.108. R.D. Richtmyer. Taylor instability in shock acceleration of compressible fluids.
Commun. Pure Appl. Math. , 13:297–319, 1960.109. E.E. Meshkov. Instability of the interface of two gases accelerated by a shockwave.
Izv. Akad. Nauk SSSR, Mekh. Zhidk. Gaza , 5:151–158, 1969.110. R.F. Benjamin. An experimenter’s perspective on validating codes and mod-els with experiments having shock-accelerated fluid interfaces.
Comput. Sci.Engrng. , 6:40–49, 2004.111. S. Kumar, G. Orlicz, C. Tomkins, C. Goodenough, K. Prestridge, P. Vorobieff,and R.F. Benjamin. Stretching of material lines in shock-accelerated gaseousflows.
Phys. Fluids , 17:82107–82117, 2005.112. P.J. Roache. Building PDE codes to be verifiable and validatable.
Comput.Sci. Engrng. , 6:30–38, 2004.113. R.B. Laughlin. The physical basis of computability.
Comput. Sci. Engrng. ,4:27–30, 2002.114. K. Ide, P. Courtier, M. Ghil, and A.C. Lorenc. Unified notation for data assim-ilation: Operational, sequential and variational.
Journal of the MeteorologicalSociety of Japan , 75(1B):181–189, 1997.0 D. Sornette, A.B. Davis, J.R. Kamm, and K. Ide115. K.S. Brown and J.P. Sethna. Statistical mechanical approaches to models withmany poorly known parameters.
Phys. Rev. E , 68:21904–21912, 2003.116. J.T. Oden, J.C. Browne, I. Babuska, K.M. Liechti, and L.F. Demkowicz. Acomputational infrastructure for reliable computer simulations. In P.M.A.Sloot, D. Abramson, A.V. Bogdanov, J. Dongarra, A.Y. Zomaya, and Yu.E.Gorbachev, editors,
Computational Science - ICCS 2003, International Con-ference on Computational Science, Melbourne, Australia and St. Petersburg,Russia, June 2-4, 2003. Proceedings, Part IV , volume 2660 of
Lecture Notes inComputer Science , pages 385–390, New York, NY, 2003. Springer-Verlag.117. S. Wolfram.
Cellular Automata and Complexity . Westview Press, Boulder,CO, 2002.118. N. Chomsky.
Lectures on Government and Binding . Foris, Dordrecht, TheNetherlands, 1981.119. K.R. Popper.
The Logic of Scientific Discovery . Basic Books, New York, NY,1959.120. R.M. Stein.
Benchmarking Default Prediction Models: Pitfalls and Remediesin Model Validation (Technical Report 020305) . Moody’s KMV, New York,NY, 2002.121. I.A. Ibragimov and R.Z. Hasminskii.
Statistical Estimation: Asymptotic The-ory . Springer-Verlag, New York, NY, 1981.122. P. Courtier, E. Andersson, W. Heckley, J. Pailleux, D. Vasiljevic, M. Hamrud,A. Hollingsworth, F. Rabier, and M. Fisher. The ECMWF implementationof three-dimensional variational assimilation (3D-Var), Part 1: Formulation.
Quart. J. Roy. Meteor. Soc. , 124:1783–1807, 1998.123. H.A. Ehrenfest. In what way does it become manifest in the fundamental lawsof physics that space has three dimensions? In M.J. Klein, editor,
CollectedScientific Papers . Interscience, New York, NY, 1959.124. G.J. Whitrow. Why physical space has three-dimensions.
British Journal forPhilosophy of Science , 6:13–31, 1956.125. V.F. Pisarenko and D. Sornette. On statistical methods of parameter estima-tion for deterministically chaotic time-series.
Phys. Rev. E , 69:36122–36133,2004.126. K.M. Hanson and F.M. Hemez. Uncertainty quantification of simulation codesbased on experimental data. In
Proceedings of 41th AIAA Aerospace SciencesMeeting and Exhibit , pages 1–10, Reston, VA, 2003. American Institute ofAeronautics and Astronautics.127. Y. Malevergne, V.F. Pisarenko, and D. Sornette. Empirical distributions of log-returns: Between the stretched exponential and the power law?