[PDF] Six textbook mistakes in computational physics

Abstract

This article discusses several erroneous claims which appear in textbooks on numerical methods and computational physics. These are not typos or mistakes an individual author has made, but widespread misconceptions. In an attempt to stop these issues from further propagating, we discuss them here, along with some background comments. In each case, we also provide a correction, which is aimed at summarizing material that is known to experts but is frequently mishandled in the introductory literature. To make the mistakes and corrections easy to understand, we bring up specific examples drawn from elementary physics and math. We also take this opportunity to provide pointers to the specialist literature for readers who wish to delve into a given topic in more detail.

Full PDF

SSix textbook mistakes in computational physics

Alexandros Gezerlis

Department of Physics, University of Guelph, Guelph, Ontario N1G 2W1, Canada

Martin Williams

Oﬃce of Teaching and Learning & Department of Physics,University of Guelph, Guelph, Ontario N1G 2W1, Canada (Dated: June 16, 2020)This article discusses several erroneous claims which appear in textbooks on numerical methodsand computational physics. These are not typos or mistakes an individual author has made, butwidespread misconceptions. In an attempt to stop these issues from further propagating, we discussthem here, along with some background comments. In each case, we also provide a correction,which is aimed at summarizing material that is known to experts but is frequently mishandled inthe introductory literature. To make the mistakes and corrections easy to understand, we bringup speciﬁc examples drawn from elementary physics and math. We also take this opportunity toprovide pointers to the specialist literature for readers who wish to delve into a given topic in moredetail.

I. INTRODUCTION

There is no shortage of mistakes in the research literature on computational physics, (or any other subject, forthat matter). This is inevitable when the ﬁeld (or a given tool) is nascent, since investigators have to work “in thedark.” Textbooks are another matter: they are designed to condense the techniques and results that have stoodthe test of time. It is therefore important to ensure that mistaken claims that have somehow made it into severaltextbooks be corrected. The present work studies a set of six misconceptions on themes related to computationalscience. A selection criterion for the speciﬁc mistakes discussed has been that they appear in at least two standardtextbooks. Mistakes by individual textbook authors are sometimes also important, but have a diﬀerent ﬂavor: we arehere interested in misconceptions which have themselves. . . stood the test of time and are therefore more pernicious.Another criterion we have employed is that the mistake should be suﬃciently important that both stating it andcorrecting it can be understood with a minimum of eﬀort. Generally speaking, this implies that the question ittouches upon is of wider import. In other words, we are not talking about typos or other minor issues that studentscan ﬁgure out on their own while working through a textbook. Crucially, the point of the present article is not tocriticize authors of respected textbooks. The aim of our work is to make sure future generations of students learnthese topics right the ﬁrst time, and that is a goal that is better served by elaborating on the mistake and detailinghow to ﬁx it, not by dwelling on the speciﬁc wording employed by those who committed the mistake.When possible, we provide the relevant equations and a plot that emphasizes how one may be led astray. When thespeciﬁc misconception requires a more detailed argument than we can provide in this short article, we give pointersto the literature. One of the authors has recently ﬁnished writing a textbook on numerical methods in physics. Thepoints below are handled correctly in that work, but readers may beneﬁt from seeing alternative viewpoints, so wecite several related references.

II. MISTAKES AND CORRECTIONS

Since our goal is not to identify the guilty parties, here we cite, instead, a superset consisting of several standardtextbooks on the subject. (Detailed bibliographic information has been made available to the referee(s) of thepresent manuscript.) For each of our six themes, we start with some general background material and then providea mistake , namely a speciﬁc quote from the literature exemplifying a misconception. After some discussion of whatwent wrong, we conclude with a correction , i.e., a concise statement that overcomes the problems with the earlierversion.

A. Roundoﬀ error in an operation

Floating-point numbers are omnipresent: they are used in the overwhelming majority of computations in physics(or other technical ﬁelds). The representation of real numbers on the computer is a mature ﬁeld, which has been a r X i v : . [ phy s i c s . e d - ph ] J un .

606 + 2 − i r ( x )

1e 13+8.7523765808

FIG. 1: Value of rational function from Eq. (2.1).studied extensively. While in the past there existed several conﬂicting conventions, things are much better today:after the adoption of the IEEE 754 standard for ﬂoating-point arithmetic, ﬂoating-point numbers can be handleddependably and in a portable manner. Unfortunately, the treatment of rounding error in introductory textbooks isoften misleading, bringing us to the ﬁrst misconception:

Mistake “ There is a useful model for approximating how round-oﬀ error accumulates in a calculation involvinga large number of steps [...] we view the error in each step of a calculation as a literal ‘step’ in a random walk, thatis, a walk for which each step is in a random direction. ”Crucially, the claim here is that the rounding error in a mathematical operation (e.g., a subtraction) can beapproximated by assuming that the errors in the two numbers are uncorrelated. Such discussions implicitly invokethe central limit theorem , assuming that rounding errors are random and weakly correlated. A related assumptionis that rounding error accumulates as you include more and more numbers in a given calculation, i.e., it arises whenyou are studying many numbers but (presumably) doesn’t obtain when you only manipulate two or three numbers.Judging by students’ feedback in class, it appears that in their perception this tends to pull in two opposingdirections. On the one hand, students hear the term “random” and are usually overcome by a sense of impotence: ifthe error is random, there is probably nothing one can do about it. That may imply that it’s not worth investigatingin more detail exactly what can go wrong (and how to avoid such problems). On the other hand, if rounding error israndom, then one can produce a general theory (as above, employing the central limit theorem), which suggests thatthe problem arises only when you carry out many calculations in a row.To see why both of these impressions are incorrect, let us introduce a simple example: using (double-precision)ﬂoating-point numbers, -1.e30 + (1.e30 + 1.) gives an answer of , which is obviously wrong. If we simplyadd these numbers in a diﬀerent order, i.e., (-1.e30 + 1.e30) + 1. we get , which is the right answer. In bothscenarios, we were faced with only three numbers: there is no mysterious round-oﬀ error accumulation taking place.In the ﬁrst scenario, and are separated by 30 orders of magnitude, so the answer for that addition is simply ; we would have needed to keep track of 31 signiﬁcant ﬁgures to correctly compute (and store) the answer, anddouble precision only provides us with 15-16 of them. In the second scenario, the two large numbers cancel each otherout, so we are left with the correct ﬁnal answer.One could bring up many more cases along the same lines. An example due to W. Kahan, who was instrumentalto the creation of the IEEE 754 standard, is shown in Fig. 1: Horner’s rule (the standard technique used for accuratepolynomial evaluation) has been used twice, once for the numerator and once for the denominator of: r ( x ) = 4 x − x + 324 x − x + 622 x − x + 72 x − x + 112 (2.1)Notice that this plot is quite zoomed in, so we are essentially seeing the eﬀects of rounding error: far from being“noise” (i.e., uniformly random), this is obeying unmistakable patterns.These are not hand-picked counter-examples; to the contrary, it’s rather uncommon in practice to see roundingerrors combine randomly. (Of course, this does not mean that a probabilistic/statistical analysis cannot be a usefultool in the hands of experts.) The non-randomness of rounding error is discussed eloquently and unambiguously inN. J. Higham’s book. To summarize the main point:

Correction

Rounding errors are not random and are typically correlated. When rounding error leads to trouble,instead of thinking about a mysterious rounding-error accumulation, you should look for the one or two operationsthat are to blame. In other words, there is no substitute for a detailed investigation of where things went wrong.

B. Determinant value and ill-conditioning

Linear algebra is employed in most of computational physics. The two main problems that linear algebra studiescan be trivially stated: (a) Ax = b is a concise way of stating a linear system of n equations in n unknowns, and (b) Av = λ v is the standard form of the matrix eigenvalue problem. While these two problems are very easy to writedown, entire monographs have been dedicated to their eﬃcient solution see, e.g., Ref. 25. Given the emphasis thatnumerical analysts have placed on the subject, it is not surprising to hear that state-of-the-art libraries (like BLASand LAPACK) are robust and dependable. However, even a robust library cannot help you if the problem you aretrying to solve is pathological. Thus, it is always good to have an a priori estimate of how tricky a given problem is,i.e., of knowing what headaches may arise before one launches into a detailed study. This is precisely the study of the conditioning of a problem; in what follows, we limit ourselves to the solution of linear systems, Ax = b . It is easy tounderstand how the following misconception arose, but that does not make it any less problematic: Mistake “ For such [ill-conditioned] systems a small change in coeﬃcients will produce large changes in theresult. In particular, this situation occurs when the determinant for A is close to zero. ”On the face of it, this is plausible: a matrix A which is singular has zero determinant, det( A ) = 0. It is thentempting (but wrong) to say that a near-singular matrix is one that has a near-zero determinant. The ﬁrst thing thatcomes to mind when investigating this tentative criterion is how one could quantify “close to zero.” One should beable to relativize such a determination: a number which encapsulates the entire matrix (such as the determinant) maybe large or small in comparison to the magnitude of A ’s matrix elements. Quantitatively, one introduces the conceptof the matrix norm, which is a number summarizing how large or small the matrix elements are. For example: (cid:107) A (cid:107) = (cid:118)(cid:117)(cid:117)(cid:116) n − (cid:88) i =0 n − (cid:88) j =0 | A ij | (2.2)is the Euclidean or Frobenius norm. (Other deﬁnitions of the matrix norm also exist, but the intuitive point we aremaking here applies regardless.)Since a matrix norm is non-negative, a plausible way to cast the above tentative criterion is as follows: | det( A ) | (cid:28)(cid:107) A (cid:107) , i.e., the absolute value of the matrix determinant is much smaller than the matrix norm. This automaticallyadjusts for the fact that the matrix elements themselves could be small, in order to see if the determinant is evensmaller (or not). Let’s apply this criterion to an example. For the matrix: A = (cid:18) × − × − × − × − (cid:19) (2.3)we ﬁnd | det( A ) | = 2 × − and (cid:107) A (cid:107) ≈ . × − . Thus, since the criterion | det( A ) | (cid:28) (cid:107) A (cid:107) is clearly satisﬁed, youmay be thinking that this is an ill-conditioned matrix, i.e., a matrix for which many methods will have a hard time(and unstable methods will simply fail).At this point, we recall that this matrix A arises in the study of a linear system of equations of the form Ax = b (a 2 × n = 2 equations by a constant,without changing anything in the statement of the problem. In other words, multiplying the matrix in Eq. (2.3) with,say, 10 should not lead to a fundamental change. For the matrix: B = (cid:18) × × × × (cid:19) (2.4)we ﬁnd | det( B ) | = 2 × and (cid:107) B (cid:107) ≈ . × . In other words, we now ﬁnd | det( B ) | (cid:29) (cid:107) B (cid:107) , so this appears to bea (very) well-conditioned matrix. This is the opposite conclusion from that drawn in the previous paragraph!There is obviously something wrong with the criterion: multiplying all of our equations with a constant changesthe norm, but it has a more dramatic impact on the determinant. However, such a simple rescaling has nothing to dowith how well-conditioned or ill-conditioned our problem is. The resolution is to employ matrix perturbation theoryand carefully study the eﬀect a small change of our matrix has on the solution vector, i.e., what x + ∆ x looks likewhen you start with A + ∆ A . A quick derivation here shows that: (cid:107) ∆ x (cid:107)(cid:107) x (cid:107) ≤ (cid:107) A (cid:107) (cid:107) A − (cid:107) (cid:107) ∆ A (cid:107)(cid:107) A (cid:107) (2.5)In words, this relates an error bound on (cid:107) ∆ x (cid:107) / (cid:107) x (cid:107) with an error bound on (cid:107) ∆ A (cid:107) / (cid:107) A (cid:107) . The coeﬃcient in front of (cid:107) ∆ A (cid:107) / (cid:107) A (cid:107) tells us whether or not a small perturbation will tend to get magniﬁed. Equation (2.5) naturally leads usto introduce the condition number : κ ( A ) = (cid:107) A (cid:107) (cid:107) A − (cid:107) (2.6)When κ ( A ) is of order unity we are dealing with a well-conditioned problem, i.e., a small perturbation is not ampliﬁed.Conversely, when κ ( A ) is large, a perturbation is (typically) ampliﬁed, so our problem is ill-conditioned. Crucially,this condition number involves both the matrix A and its inverse, A − . (Equally crucially, this condition numberhas nothing to do with the determinant.) As a result, an overall scaling has no eﬀect: the condition number for both A and B , from Eq. (2.3) and Eq. (2.4) respectively, is 15, i.e., not very large. The study of problem conditioning,as well as its relationship to the stability of a given method, is nicely discussed in the book by N. Trefethen and D.Bau. To summarize:

Correction

To ﬁnd out whether or not the problem Ax = b is ill-conditioned, you should compute the conditionnumber κ ( A ) = (cid:107) A (cid:107) (cid:107) A − (cid:107) (and forget about the determinant). C. Lagrange interpolation at many points

In physics, we are very often provided with a table of values: we know that these are correct, but it is very costly(or impossible) to produce more points. Even so, in practical applications we typically need to have access to values“in between” the table’s rows. For example, we may be given a table of points, ( x j , y j ) for j = 0 , , . . . , n − f ( x )) but we require the values of the function f ( x )’s derivative, or integral, and soon. This is where interpolation comes in: one picks a set of n basis functions φ k ( x ) and uses them to produce a linearform: p ( x ) = n − (cid:88) k =0 c k φ k ( x ) (2.7)The (possibly non-linear) φ k ( x ) are considered known and one attempts to determine the n parameters c k in such away that the interpolating polynomial p ( x ) goes through our input data points. The ﬁrst idea that comes to mindis to use polynomials (or monomials) as basis functions; a time-tested technique that does this goes by the nameof “Lagrange interpolation.” (This is not least-squares ﬁtting, since we trust that the input is correct.) Given howfundamental this problem is, you would be justiﬁably surprised to learn that it is handled quite poorly in the literature: Mistake “ [T]his doesn’t work because very high order polynomials tend to have a lot of wiggles in them andcan deviate from the ﬁtted points badly in the intervals between points. It’s better in this case to ﬁt many lower-orderpolynomials such as quadratics or cubics to smaller sets of adjacent points. ”As it so happens, Lagrange interpolation has been subject to more than one misconceptions. Starting with theleast pernicious one, we quote from the excellent volume by Dahlquist and Bj¨orck (Ref. 5, p. 284): Lagrange’sinterpolation formula is said to be “more suitable for deriving theoretical results than for practical computation.”In other words, Lagrange interpolation is taken to be a tool for proving theorems, not for practical computation(F. Acton says something similar in his classic book, Ref. 2, pp. 95-96.) As a result, many volumes on numericalmethods (or on numerical analysis) brieﬂy mention the existence of Lagrange interpolation, before moving on to otherapproaches (most commonly, via Newton’s divided diﬀerences). As explained in N. Trefethen’s book on approximationtheory, the criticisms of Lagrange interpolation are unwarranted: using the barycentric interpolation formula allowsone to separate the construction stage (with a cost of O ( n ) operations) from the evaluation stage (which costs O ( n )operations). In other words, once you pick the interpolation nodes, evaluating the interpolating polynomial in betweenthe nodes is very easy.However, whether or not Lagrange interpolation is eﬃcient is not the main problem. The quote given in Mistake m K ( m ) Lagrange interpolationElliptic integral (a) m K ( m ) Lagrange interpolationElliptic integral (b)

FIG. 2: Data points and interpolated values for the complete elliptic integral of the ﬁrst kind. The left panel showsthe result of employing Chebyshev nodes, whereas in the right panel we use equally spaced nodes.against employing such an approach. Thus, in the literature it is customary to read that one should, instead, use cubic(or other) splines to connect a few points at a time, without attempting to capture the global behavior of the function.As discussed in N. Trefethen’s aforementioned text, this is an issue that is not limited to computational-physics books:similar claims are widespread in the mathematical literature, also.The most straightforward way of seeing that these claims are wrong is to ivestigate a speciﬁc example. Sinceinterpolation arises when the function f ( x ) is diﬃcult to evaluate “on-the-ﬂy,” we decide to look at the completeelliptic integral of the ﬁrst kind: K ( m ) = (cid:90) π/ (cid:112) − m sin φ dφ (2.8)This comes up in the study of the classical pendulum, as soon as you go beyond the small-angle approximation.To ﬁnd K ( m ) for a given m , one needs to compute a deﬁnite integral; assuming you are not an expert in numericalintegration, you may decide to use Gauss-Legendre quadrature here, an approach with a (deservedly) good reputation.Unfortunately, estimating the error in computing the deﬁnite integral is diﬃcult that way, so you would be reducedto computing the integral with an increasing number of points (and checking if the answer changes). In short, thisis a good example of a costly function: if we need to employ K ( m ) at millions of m ’s, we would really prefer not tohave to compute a deﬁnite integral for each one of them separately.In the spirit of this article, here we don’t go into the details of how Lagrange interpolation works. All you needto know in order to grasp the qualitative point is that we would like to use Eq. (2.8) to compute the value of thefunction at, say, n = 80 points: we will then produce a polynomial of 79-th degree that not only goes through thepoints, but hopefully also captures the underlying function in between. Again, without going through the detailsof the derivation, we state the resolution: you would be well advised to pick the x j ’s in your table of points to be Chebyshev points , namely the extrema of Chebyshev polynomials: x j = − cos (cid:18) jπn − (cid:19) , j = 0 , , . . . , n − − , a, b ] you can scale appropriately. Crucially, thesepoints cluster near − un equally spaced points does not lead to any conceptual or numericalproblems.Without further ado, we show the result of employing 80 Chebyshev points for the complete elliptic integral of theﬁrst kind in the left panel of Fig. 2. We see that our interpolating polynomial exhibits no wiggles whatsoever, despitethe fact that we employed a high degree intentionally; actually, for this example, we could have produced a decentinterpolant even with fewer points, but we wanted to emphasize that using a single polynomial of a high degree is anexcellent tool. The troubles mentioned in the literature arise only if you employ equidistant nodes (which don’t clusterat the ends of the interval): you can see this in the right panel of the ﬁgure, where wild ﬂuctuations appear near 0and 1. If you are in control of the node selection, as is often the case in practice, you can simply choose Chebyshevnodes and avoid any problems. As it so happens, for analytic functions, you get geometric convergence: adding 10more points improves the interpolation’s quality by an order of magnitude (!). To summarize: Correction

If the interpolation nodes are at your disposal, you should “bunch” them near the ends of theinterval. You can then employ a single polynomial (of potentially high degree) to capture a function’s complicatedbehavior. For an analytic function, the interpolant converges geometrically, i.e., the error is a straight line on asemilog scale.

D. Discrete Fourier transform and trigonometric interpolation

Trigonometric interpolation is a close relative of the topic we discussed in the previous subsection, namely polynomialinterpolation. Essentially, the problem is that of interpolation for the case of periodic functions/signals, so it shouldnot come as a surprise that Lagrange interpolation could also be used here. Of course, since our new problem isinherently periodic, it is more appropriate to employ sines and cosines as the basis functions φ k ( x ) in Eq. (2.7).Obviously, an equivalent formulation would be to employ complex exponentials, instead. The latter approach can beframed to take advantage of one of the most successful algorithms of the twentieth century, the fast Fourier transform(FFT). The notation and mathematical expressions on the subject can get messy, and as a result several misleading(if not outright wrong) statements appear in the literature. This brings us to: Mistake “ [W]e only need consider a ﬁnite linear combination q ( x ) = 1 n n − (cid:88) k =0 ˜ y k e ikx (2.10) [. . . ] the function f ( x ) and the sum q ( x ) agree on the sample points: f ( x j ) = q ( x j ) , j = 0 , , . . . , n − Therefore, q ( x ) can be viewed as a (complex-valued) interpolating trigonometric polynomial [. . . ] we will usuallydiscard its imaginary component. ”We have tweaked the notation here, to make the following discussion intelligible. Before elaborating on why thequoted statement is wrong, let us spend some time seeing whence it arose. The discrete Fourier transform (DFT)starts from n numbers (presumably representing the value of a function at a set of grid points, i.e., f ( x j ) = y j ) anduses those to produce n Fourier parameters ˜ y k :˜ y k = n − (cid:88) j =0 y j e − πikj/n , k = 0 , . . . , n − inverse discrete Fourier transform: y j = 1 n n − (cid:88) k =0 ˜ y k e πikj/n , j = 0 , . . . , n − y k and produces the function values y j . Observe that both the direct andthe inverse DFT refer to a grid of values: there is no continuous x variable anywhere in sight. There are only discreteindices, j and k ; as a matter of fact, we are here employing the shifted or standard order of the DFT, whereby boththe j and the k indices take on the same values (from 0 to n − x c e ( x ) Input dataNaive interpolation (real)Naive interpolation (imag) (a) x c e ( x ) Input dataDFT interpolationTrue solution (b)

FIG. 3: Incorrect (left) and correct (right) way to carry out discrete Fourier interpolation.It should now be easy to see what gave rise to the expression in Eq. (2.10): this looks like a generalization ofEq. (2.13) for any x value. In other words, this “interpolating” function q ( x ) certainly does go through the gridvalues, i.e., q ( x j ) = y j , as it should. The question that one should consider, however, is whether or not it is legitimateto consider q ( x ) as an interpolating polynomial, i.e., whether it is eﬀective in capturing the underlying behavior evenoutside the grid points. To answer this question, we consider the Mathieu equation, which appears in the study ofstring vibrations; since our example in the previous subsection corresponded to the classical pendulum, it is ﬁttingthat the Mathieu equation also shows up in the case of the quantum pendulum: f (cid:48)(cid:48) ( x ) = (2 a cos 2 x − s ) f ( x ) , f (0) = f (2 π ) (2.14)Mathematically, this is an eigenvalue problem, i.e., a messier version of a boundary-value problem. This would needto be solved numerically: that is not the focus of our discussion here, though. Just take it for granted that you havemanaged to solve this equation (for a given value of a ) at a small number of points, say n = 8, but you would like tointerpolate between those points so that you can access the underlying function at millions of distinct x ’s.In the left panel of Fig. 3 we show our input data points y j ; they correctly represent the second even Mathieufunction, ce ( x ), i..e., at those x values we know that we get an accurate solution, ce ( x j ) = y j . (The followingargument would hold just as well if we were discussing a diﬀerent solution.) In the same plot, we also show the realand imaginary parts of q ( x ). You can immediately see that q ( x ) has a sizable imaginary part (oﬀ the grid points)and several ﬂuctuations in between the data points; these ﬂuctuations do not appear to be supported by the data.In short, the fact that q ( x ) goes through the points (i.e., q ( x j ) = y j ) is not enough: you could come up with anotherfunction which oscillates even faster and also goes through the points. Neither of these functions can be taken as“the” interpolating trigonometric polynomial.To drive the point home, the right panel of Fig. 3 shows the exact solution ce ( x ), along with the same 8 datapoints as in the left panel. As we had suspected, the true solution does not exhibit strong oscillations in between thepoints. In the right panel we also took the opportunity to show the result of employing “DFT interpolation”: thisdoes not refer to Eq. (2.10) but to the following, undoubtedly more complicated, equation: p ( x ) = 1 n m − (cid:88) k =0 ˜ y k e ikx + 1 n ˜ y m cos mx + 1 n n − (cid:88) k = m +1 ˜ y k e i ( k − n ) x (2.15)where m = n/ n is even. Note that this is a general expression for the case of trigonometric interpolation whenyou have access to the Fourier parameters ˜ y k , i.e., applying this to the Mathieu equation is only a speciﬁc illustrationof more general principles. The details of how Eq. (2.15) came to be do not interest us here (for more, see Ref. 1, pp.358-359). The important takeaway is that p ( x ) manages to capture the complicated behavior of ce ( x ), even thoughwe provided it with only n = 8 input data points. Even better, it accomplishes this task without ever giving rise toan imaginary part: since our input data points were real, any interpolating function should also be real. If you werefeeling uncomfortable about dropping the imaginary part of q ( x ) “by hand,” as the quote above instructed you to do,you can relax now. To summarize: Correction

The discrete Fourier transform takes you from the n values y j to the n values ˜ y k . The inverseDFT goes the other way around, but still only takes you from n values to n values. If you wish to interpolate betweenthese n data points, you will need to carefully examine what constitutes trigonometric interpolation, instead of blindlygeneralizing the deﬁnition of the inverse discrete Fourier transform. E. Acceptance rate in the Metropolis algorithm

Quadrature techniques like Gauss-Legendre (mentioned above) get into serious trouble when the dimensionality ofthe problem grows. Essentially the only approach that works for such problems is stochastic integration , also known asMonte Carlo integration. Multidimensional Monte Carlo integration almost always employs the Metropolis algorithm,which is (like the FFT) one of the most important algorithms developed in the twentieth century. More details will beprovided below, but a crucial aspect of this algorithm is that it involves proposed steps in a multidimensional randomwalk: these steps may be either accepted or rejected. This brings us to:

Mistake “ The actual value of the step size h is determined from the desired accepting rate (the ratio of theaccepted to the attempted steps). A large h will result in a small accepting rate. In practice, h is commonly chosen sothat the accepting rate of moves is around 50%. ”This seems harmless enough: in many textbooks it is even explicitly called a “rule of thumb,” implying that it’s notreally based on theory. The problem with this recommendation, as experience mentoring undergraduate and beginninggraduate students has shown, is that newcomers treat this value of 50% as a goal, i.e., consider a computation ashaving succeeded if the acceptance rate is near 50% (and, therefore, as having failed if the value of this quantity is,say, 30%). As a consequence, students may be overconﬁdent about the trustworthiness of their results or, even worse,may spend many hours of work trying to tune something which does not add to their understanding of the physicalprocesses at play. Their time would be better spent exploring detailed features of their Monte Carlo run.Let’s elucidate what’s at stake here. The goal of the Metropolis algorithm is to sample from a (known) d -dimensionaldistribution w ( X ). It accomplishes this by starting from a location X i − and proposing a step as follows: Y i = X i − + θ × U i (2.16)Here U i is a d -dimensional set of random numbers from − θ is the “step size” in this d -dimensionalspace, and Y i is the proposed conﬁguration. The heart of the Metropolis algorithm consists of computing the ratio w ( Y i ) /w ( X i − ) and using that to determine whether to accept or reject the proposed step. The question naturallyarises of how to pick the value of the step size, θ . The aforementioned rule of thumb is to pick θ such that roughlyhalf of the total steps are accepted (and therefore half are rejected); the ratio of accepted states over total proposedsteps is known as the acceptance rate . The thinking behind this is as follows: if you make tiny steps, then you willget a large acceptance rate but you end up not moving very far from where you started, so your sampling is not veryeﬃcient. If you make very large steps, then your acceptance rate will be small, but that means that you are wastingyour time computing steps which are mostly not accepted and therefore don’t take you anywhere. The idea is thatyou should pick the step size somewhere between these two extreme options.While this qualitative argument is valid, it doesn’t actually justify using an acceptance rate of 50%. Crucially, theconﬁgurations referred to in the previous paragraph are correlated since that’s how they are produced: you start withone conﬁguration and use it to ﬁnd the next one (this is known as a Markov chain ). The question then arises how to gofrom this Markov chain, consisting of correlated samples, to a set of un correlated samples, which have lost all memoryof earlier states. In other words, one must examine the autocorrelation time T , after which the memory/correlationgets washed away. If your full simulation runs for much longer than T , then you can view it as being made up ofseveral uncorrelated samples. In equation form, we have: T = 1 + 2 N − (cid:88) k =0 C ( k ) (2.17)where C ( k ) is known as the “autocorrelation function,” deﬁned as follows: C ( k ) = 1( N − k )( f − f ) N − k − (cid:88) i =0 f ( X i ) f ( X i + k ) (2.18) k C ( k ) acc. rate = 70 %acc. rate = 29 %acc. rate = 11 % (a) Acceptance rate [%] A u t o c o rr e l a t i o n t i m e (b) FIG. 4: Autocorrelation function (left) and autocorrelation time (right) in variational Monte Carlo.Qualitatively, the autocorrelation function can help us tell how the variance of the mean for statistically independentvalues (i.e., the f − f in the denominator) gets ampliﬁed due to the fact that the samples are correlated; thenumerator is known as the “autocovariance.” Qualitatively, the autocorrelation function measures the correlationbetween conﬁgurations which are k steps apart; as shown in the ﬁrst equation, to get the autocorrelation time wesum up all the possibilities. For a given problem, a small T means that the Markov chain decorrelates fast, i.e., theentire calculation’s error bar gets smaller more quickly; that leads to an overall more eﬃcient Monte Carlo run.This may feel too abstract if you’ve never encountered these concepts before. Let’s pick a speciﬁc example: fourparticles (of mass m ) in three spatial dimensions, in the presence of a harmonic oscillator potential, also interactingwith each other in the context of quantum mechanics:ˆ H = − (cid:126) m (cid:88) j =0 ∇ j + 12 m (cid:88) j =0 (cid:8) ω x [( r j ) x ] + ω y [( r j ) y ] + ω z [( r j ) z ] (cid:9) + g (cid:88) j,k =0 j

To properly determine what acceptance rate is optimal, you should try to minimize the autocorre-lation time. If you don’t want to do that, you will typically be OK if your acceptance rate is from 15% to 50%, so youshould not spend time trying to tune it to be around 50%.

F. Global error when solving initial-value problems

Diﬀerential equations pop up everywhere in physics, see for example Newton’s second law or the Schr¨odingerequation. The study of dynamics is one of the most important (if not the most important) aspects of physical theory;mathematically, this gives rise to an initial-value problem (IVP): starting from a given point, one tries to determinethe solution(s) at later points. This is known as the “numerical integration” of diﬀerential equations, which sharessimilarities with numerical quadrature, i.e., the computation of deﬁnite integrals, but is also conceptually distinct.There are several ways of grouping the diﬀerent techniques designed to tackle ordinary diﬀerential equations: one candistinguish between single-step methods (e.g., Euler’s method or Runge-Kutta methods) and multistep methods (e.g.,Adams-Bashforth or Adams-Moulton methods). In what follows, we will specialize to the case of single-step methods.In introductory discussions of such methods, one often hears the terms local error and global error being mentioned.Statements like the following one are commonly made; to make matters worse, such an argument and result is oftensaid to be “intuitive”:

Mistake “ If our problem is deﬁned on a grid x ∈ [0 , , say (which would be typical), then ∆ x = 1 /N and wemake O (∆ x p ) error per step for N steps, so the global truncation error will be O (∆ x p ) N = O (∆ x p − ) . ”Let us try to unpack this a little. Generally speaking, we are interested in solving an IVP: y (cid:48) ( x ) = f ( x, y ( x )) , y ( a ) = c (2.20)Here y (cid:48) = dy/dx and f ( x, y ) is known. We want to solve for y ( x ), with x going from a to b . It is standard (thoughcertainly not mandatory) to employ a discretization scheme with equally spaced abscissas: x j = a + jh where j = 0 , , . . . , n −

1. The step size h is given by h = ( b − a ) / ( n − x j , we call the exact solution of ourdiﬀerential equation y ( x j ) and use the symbol y j for the approximate solution resulting from one of our single-stepmethods (e.g., Euler’s method). Qualitatively, the local error tells you how how close to or how far away from a given y ( x j ) the corresponding y j is, when you take a single step. Similarly, the global error tells you how poorly you havedone overall, i.e., how close y n − is to y ( b ). We will come back to the distinction between these two concepts below.For now, let us go over a likely explanation of why the quoted misconception arose in the ﬁrst place. To do so, weﬁrst make a quick digression into Newton-Cotes methods, which use equally spaced abscissas to compute a deﬁniteintegral. These approaches slice up the total area under a curve into little rectangles, trapeziums, and so on. Forconcreteness, let us stick to the simplest such approach, namely the rectangle rule. As you can see by taking a simpleTaylor series and integrating it from x j to x j +1 , the absolute error incurred by the rectangle rule in a single panelis E j = h f (cid:48) ( ξ j ) / ξ j is a point between x j and x j +1 . Crucially, the error budget in each panel is totallyindependent from the error budget in each other panel. Thus, it is straightforward to compute the error in the composite rectangle rule, simply as the sum of the one-panel errors; this leads to E = h ( b − a ) f (cid:48) ( ξ ) /

2, where ξ liessomewhere between a and b . Overall, our result is that the leading error in the composite rule is O ( h ). This is tobe compared with the one-panel error, which we saw above was O ( h ). So far, things behave exactly as our quotedstatement said they would.Of course, our quote in Mistake y j +1 = y j + hf ( x j , y j ) , j = 0 , , . . . , n − y = c (2.21)It’s easy to understand the geometrical interpretation of this method: the right-hand side of Eq. (2.21) has a termproportional to f ( x j , y j ); we know from Eq. (2.20) that f ( x j , y j ) is an approximation to the slope of the tangent tothe true solution at x j . This is illustrated in Fig. 5: for now, focus only on the ﬁrst step, for which we have assumedthat y ( x ) = y since that’s where we start.As you may recall, in the rectangle rule the one-panel error E j was arrived at by integrating over one panel. Theanalogous quantity for initial-value problems, the local truncation error, can be deﬁned as follows: t j = y ( x j +1 ) − y ( x j ) − hf ( x j , y ( x j )) (2.22)1 x y ( x ) x x x ExactForward Euler method

FIG. 5: Carrying out two integration steps using the forward Euler method.This employs the exact quantities ( y ( x j ), not y j ), because it hones in on the error made in a single step, under theassumption that one starts from the exact solution. As above, using a Taylor-series expansion you can show that thelocal truncation error for the Euler method is: t j = 12 h y (cid:48)(cid:48) ( ξ j ) (2.23)In other words, the local discretization error of this approach is O ( h ). So far, things are fully analogous to therectangle rule; the fact that there we had a ﬁrst derivative in the prefactor whereas here we have a second derivativeis immaterial for our purposes.Unfortunately, when you turn to the global error, the analogy with the rectangle rule is no longer helpful. Whencomputing the error in the composite rectangle rule, we took advantage of the fact that the errors in each panel wereindependent. This is no longer true for the errors incurred in diﬀerent steps of Euler’s method (or other single-stepmethods). Recall that when deﬁning the local truncation error in Eq. (2.22) we assumed that we started from theexact solution. While this is true in the ﬁrst step, it is no longer true after that: the actual error at the j -th step istherefore not given by Eq. (2.23) since y ( x j ) is actually diﬀerent from y j and therefore f ( x j , y ( x j )) is also diﬀerentfrom f ( x j , y j ), as a result of the errors made in previous iterations; have a look at the second step in Fig. 5 to seethat there the starting point is y , which is not the same as y ( x ) (and therefore also f ( x , y ) is not the same as f ( x , y ( x ))). To go beyond the local truncation error, we introduce: e j = y ( x j ) − y j (2.24)where, obviously, the total error incurred after n steps, i.e., the global error, would be e n − = E . Again, this is not thesame as the local truncation error, which assumes you have the correct answer at the starting point. It is reasonablystraightforward to show that e j and t j are related as follows: e j +1 = e j + h [ f ( x j , y ( x j )) − f ( x j , y j )] + t j (2.25)To compute the global error E = e n − , one has to carefully step through this formula, i.e., it is not enough to simplyadd up the t j , similarly to what one does for the composite rectangle rule. The argument that accomplishes what weneed is not too diﬃcult, but it does get a bit technical. The end result is: |E| ≤ h M L ( e L ( b − a ) −

1) (2.26)where L is a constant and we assumed that the second derivatives (to be found in t j ) are bounded by M . Overall, wehave arrived at an error bound for the global error in Euler’s method that is O ( h ). This was the same as that claimedin our quote above but, as seen by the complicated prefactor in Eq. (2.26), this error bound cannot be trivially related2to the fact that the local truncation error was O ( h ). In other words, the analogy between methods that handlenumerical quadrature and those designed for numerical integration of IVPs breaks down. To summarize: Correction

When solving initial-value problems, the single-step errors are not independent from each other.Thus, simply adding up the local truncation errors won’t do. You have to carefully see the cumulative eﬀect the errorin one panel has on the calculation of the next panel, and so on, until the end. Practically speaking, the result doesend up being that the global error is one order lower than the local error; for example, in the forward Euler method(which has a local error O ( h ) ) the global error obeys an error bound that is O ( h ) . III. SUMMARY AND CONCLUSION

Above, we saw a sampling of mistaken claims in the introductory literature on computational science. Thematically,these touched upon ﬂoating-point numbers, linear algebra, polynomial or trigonometric interpolation, as well asstochastic integration and the solution of initial-value problems. Together, they constitute a good cross section ofbasic themes with which every physicist should have a passing familiarity. In each of the six cases, we tried toprovide some context that we hope helped both to explain why the misconception came about and why it is, indeed, amistake. With the partial exception of Mistake students cannotinterrogate what they don’t know. ” In short, we hope to have made a small contribution toward the replacement of error propagation by error cessation . ACKNOWLEDGMENTS

This work was supported in part by the Natural Sciences and Engineering Research Council (NSERC) of Canada,the Canada Foundation for Innovation (CFI), and the Early Researcher Award (ERA) program of the Ontario Ministryof Research, Innovation and Science. A. Gezerlis,

Numerical Methods in Physics with Python , (Cambridge University Press, 2020). F. S. Acton,

Numerical Methods That Work , (The Mathematical Association of America, 1990). U. M. Ascher and C. Greif,

A First Course in Numerical Methods , (Society for Industrial and Applied Mathematics, 2011). J. F. Boudreau and E. S. Swanson,

Applied Computational Physics , (Oxford University Press, 2018). G. Dalhquist and ˚A. Bj¨orck,

Numerical Methods , (Prentice-Hall, 1974). J. Franklin,

Computational Methods for Physics , (Cambridge University Press, 2013). A. L. Garcia,

Numerical Methods for Physics , (CreateSpace, 2017). N. J. Giordano and H. Nakanishi,

Computational Physics , 2nd ed., (Pearson, 2005). A. Gilat and V. Subramaniam,

Numerical Methods for Engineers and Scientists , 3rd ed., (Wiley, 2013). R. W. Hamming,

Numerical Methods for Scientists and Engineers , 2nd ed., (McGraw-Hill, 1973). M. T. Heath,

Scientiﬁc Computing , 2nd ed., (McGraw-Hill, 2002). J. D. Hoﬀman,

Numerical Methods for Engineers and Scientists , 2nd ed., (Marcel Dekker, 2001). J. Izaac and J. Wang,

Computational Quantum Mechanics , (Springer, 2018). J. Kiusalaas,

Numerical Methods for Engineers with Python 3 , (Cambridge University Press, 2013). A. Klein and A. Godunov,

Introductory Computational Physics , (Cambridge University Press, 2006). S. Koonin and D. C. Meredith,

Computational Physics , (Addison-Wesley, 1990). R. H. Landau, M. J. P´aez, and C. C. Bordeianu,

Computational Physics , 3rd ed., (Wiley-VCH, 2015). M. Newman,

Computational Physics , Rev. ed., (CreateSpace, 2012). P. Olver and C. Shakiban,

Applied Linear Algebra , 2nd ed., (Springer, 2018). T. Pang,

An Introduction to Computational Physics , 2nd ed., (Cambridge University Press, 2006). W. H. Press, S. A. Teukolsky, W. T. Vetterling, and B. P. Flannery,

Numerical Recipes in Fortran , 2nd ed., (CambridgeUniversity Press, 1992). S. ˇSirca and M. Horvat,

Computational Methods in Physics , 2nd ed., (Springer, 2018). W. Kahan, “The Improbability of Probabilistic Error Analysis,” Third ICIAM Congress, (1995). N. J. Higham,

Accuracy and Stability of Numerical Algorithms , Second edn., (Society for Industrial and Applied Mathematics,2002), pp. 25-26. J. H. Wilkinson,

The Algebraic Eigenvalue Problem , (Oxford University Press, 1965). L. N. Trefethen and D. Bau III,

Numerical Linear Algebra , (Society for Industrial and Applied Mathematics, 1997). L. N. Trefethen,

Approximation Theory and Approximation Practice , (Society for Industrial and Applied Mathematics,2012), pp. 39-49. R. M. Martin, L. Reining, and D. M. Ceperley,

Interacting Electrons: Theory and Computational Approaches , (CambridgeUniversity Press, 2016), p. 583. G. O. Roberts, A. Gelman, W. R. Gilks, “Weak Convergence and Optimal Scaling of Random Walk Metropolis Algorithms,”Ann. Appl. Probab., , 110, (1997). T. Judt with T. Snyder,