[PDF] Bias-Variance Trade-off and Model Selection for Proton Radius Extractions

Abstract

Intuitively, a scientist might assume that a more complex regression model will necessarily yield a better predictive model of experimental data. Herein, we disprove this notion in the context of extracting the proton charge radius from charge form factor data. Using a Monte Carlo study, we show that a simpler regression model can in certain cases be the better predictive model. This is especially true with noisy data where the complex model will fit the noise instead of the physical signal. Thus, in order to select the appropriate regression model to employ, a clear technique should be used such as the Akaike information criterion or Bayesian information criterion, and ideally selected previous to seeing the results. Also, to ensure a reasonable fit, the scientist should also make regression quality plots, such as residual plots, and not just rely on a single criterion such as reduced chi2. When we apply these techniques to low four-momentum transfer cross section data, we find a proton radius that is consistent with the muonic Lamb shift results. While presented for the case of proton radius extraction, these concepts are applicable in general and can be used to illustrate the necessity of balancing bias and variance when building a regression model and validating results, ideas that are at the heart of modern machine learning algorithms.

Full PDF

BBias-Variance Trade-oﬀ and Model Selection for Proton Radius Extractions

Douglas W. Higinbotham, Pablo Giuliani, Randall E. McClellan, Simon ˇSirca,

3, 4 and Xuefei Yan Jeﬀerson Lab, Newport News, VA 23606 Department of Physics, Florida State University, Tallahassee, Florida 32306, USA Faculty of Mathematics and Physics, University of Ljubljana, SI-1000 Ljubljana, Slovenia Joˇzef Stefan Institute, SI-1000 Ljubljana, Slovenia Duke University, Durham, NC 27708

Intuitively, a scientist might assume that a more complex regression model will necessarily yielda better predictive model of experimental data. Herein, we disprove this notion in the context ofextracting the proton charge radius from charge form factor data. Using a Monte Carlo study, weshow that a simpler regression model can in certain cases be the better predictive model. This isespecially true with noisy data where the complex model will ﬁt the noise instead of the physicalsignal. Thus, in order to select the appropriate regression model to employ, a clear technique shouldbe used such as the Akaike information criterion or Bayesian information criterion, and ideallyselected previous to seeing the results. Also, to ensure a reasonable ﬁt, the scientist should alsomake regression quality plots, such as residual plots, and not just rely on a single criterion such asreduced χ . When we apply these techniques to low four-momentum transfer cross section data,we ﬁnd a proton radius that is consistent with the muonic Lamb shift results. While presentedfor the case of proton radius extraction, these concepts are applicable in general and can be usedto illustrate the necessity of balancing bias and variance when building a regression model andvalidating results, ideas that are at the heart of modern machine learning algorithms. I. INTRODUCTION

High-precision Lamb shift experiments on muonic hy-drogen atoms have determined the proton radius tobe 0.84087(39) fm [1, 2]. This result is in starkcontrast to the CODATA-2014 recommended value of0.8751(61) fm [3] which comes from electron scatteringresults and atomic transition frequencies [4–8]. The dis-crepancy between these radius values has become knownas the proton radius puzzle [9–14]. While initial ef-forts to understand this puzzle focused on the details ofthe muonic experiment, attention has now turned to re-examining the atomic and electron scattering results [15–18].For the electron scattering data, the proton charge ra-dius, r p , is extracted from cross section data by deter-mining the slope of the electric form factor, G E , in thelimit of four-moment transfer, Q , approaching zero: r p ≡ (cid:32) − dG E ( Q ) dQ (cid:12)(cid:12)(cid:12)(cid:12) Q =0 (cid:33) / . (1)This deﬁnition of the radius has been shown to bethe same as the one used by the Lamb shift measure-ments [19]. Unfortunately, electron scattering experi-ments cannot reach the Q = 0 limit; thus, an extrapola-tion is required to determine the charge radius from theexperimental data.The various proton radius values that have been ex-tracted from electron scattering data are shown in Fig. 1.In general, r p extractions using simple statistical model-ing of the low Q data [20–23] tend towards a smallerradius while the more complex statistical model, withmany free parameters, tend to extract a large proton ra-dius [24–28]. Nuclear theory constrained extractions of the radius tend to favor a smaller radius [29–31]. Onetechnical detail that is aﬀecting all these results is howthe normalization of the experimental data is handled. FIG. 1. Shown are the radii extracted from electron scatter-ing data [21–55]. The vertical bands indicate the value anduncertainty of the proton radius from muonic hydrogen [1, 2]and CODATA-2014 [3].

In general, regardless of the extracted radius, these dif-ferent ﬁts tend to agree rather well at moderate Q andthe diﬀerences between functions can only be clearly seen a r X i v : . [ phy s i c s . d a t a - a n ] A p r examining the very low Q region where slope and nor-malization are systematically linked together. A generalcriticism of the small radius extractions of the protonradius is the presence of statistical bias [56, 57], withan implication that bias needs to be avoided in order tosuccessfully extract the true radius from the data. Theuse of Monte Carlo methods to ﬁnd bias and then rejectsimple proton radius extraction methods originated withthe classic Monte Carlo study of Borkowski et al. [49]where linear extrapolations were ﬂatly rejected in favorof quadratic extrapolations. Interestingly, that work ig-nored the variance of the more complex function.We will show in this work that when using a MonteCarlo study to test a model’s ability to extract the protonradius one needs to consider not only bias but also vari-ance, and ﬁnd an appropriate balance between the two.After all, it is better to have a slightly biased watch totell the hour, than a broken one that is unbiased becauseit overestimates and underestimates the time symmetri-cally.We will also illustrate that one must consider therange, quantity and precision of the data when determin-ing the best predictive statistical model and show thatsimple biased statistical models can have a higher pre-dictive validity than unbiased more complex models [58].We will then apply these ideas to model selection withreal data, where instead of millions of Monte Carlo re-sults, we get but a single realization of the possible out-comes. II. BIAS

In the English language, bias is often used as a pejo-rative term. In the context of regression, it is simply anoﬀset of the mean from the true central value. Since itis part of a distribution, it is not a property of a singlerealization but can be determined by repeated sampling.In the context of the proton radius extractions, bias wasnicely illustrated by Borkowski et al. [49] and we willdescribe their procedure in the following paragraphs.Form factor pseudo-data is ﬁrst systematically gener-ated from 0 . − to 0.4, 0.8, 1.2, and 1 . − in stepsof 0 .

05 fm − using the standard dipole function: G D ( Q ) = (1 + Q / (18 .

23 fm − )) − , (2)where the cutoﬀ parameter of 18.23 fm − corresponds toa radius of 0.8113 fm.Next, to mimic real data, the pseudo data points areeach randomly shifted using a normal distribution withmean zero and a sigma of 0.5%. Then the pseudo datasets are then ﬁt using both linear and quadratic func-tions: f linear ( Q ) = a + a Q , (3) f quadratic ( Q ) = a + a Q + a Q . (4)These functions are written so that they are linear in theﬁt coeﬃcients, which allows the χ minimization to be TABLE I. The mean a and radius from doing 10 MonteCarlo simulations for each interval in Q where Eq. 2 was usedto generate pseudo data in 0 .

05 fm − steps with each datapoint smeared by a randomly generated, normally distributedpoint-to-point uncertainty of 0.5%. The results clearly indi-cate that the linear ﬁts are biased. The input radius was0.8113 fm (an a /a term of 0 . − ) and a = 1.interval linear ﬁt quadratic ﬁtfm − a Radius [fm] a Radius [fm]0.1 – 0.4 1.000 0.79 1.000 0.810.1 – 0.8 0.999 0.78 1.000 0.810.1 – 1.2 0.997 0.77 1.000 0.810.1 – 1.6 0.996 0.76 1.000 0.81 performed exactly while also allowing the normalizationto ﬂoat. To obtain the physical G E (0) = 1 behavior, onesimply divides these functions by the normalization term, a , to ﬁnd the slope of G E (0) given by a /a which canbe used in Eq. 1 to determine the radius.This procedure was repeated with 10 sets of pseudodata to precisely determine the mean of the extractedradii for these two functions. Since the standard dipolewas used as the input function, one would expect an un-biased function to return a radius of 0.8113 fm. Table Ireproduces the original result [49]. As the table shows,the mean values of a and r p show a clear bias. Based onthis study, the authors of the original work erroneouslyconcluded that the linear models should always be re-jected in favor of the lower-bias quadratic function. III. VARIANCE

While the linear ﬁt does exhibit a bias, bias is not theonly quantity that must be considered when selecting anappropriate statistical model to use. In particular, alongwith the oﬀset from the true value, bias, one must alsoconsider the distribution of the outcomes, the variance.Where variance is the square of the standard deviation, σ , of the statistical distribution. Table II shows a morecomplete picture of the simulation results where the σ ofthe results is shown along with the bias and is graphicallyrepresented in Fig. 2.For all Q intervals, the linear ﬁts provide signiﬁcantlysmaller σ than the more complex quadratic ﬁts. Thoughpicking the function to use based solely on variance wouldalso be incorrect. Thus, this seemingly simple exam-ple has turned into a nearly textbook illustration of thetrade-oﬀ between variance and bias with the linear ﬁtshaving a relatively high bias with a low variance, whilethe quadratic ﬁts have a low bias and high variance. Ofcourse, to calculate the bias requires we know the truevalue which often isn’t the case in a real experiment. TABLE II. An expanded version of Table I where instead of just showing the mean oﬀset of the ﬁt results for a /a , the bias,we also indicate the width of the ﬁt results, σ . Recall that the point uncertainty is ﬁxed at 0.5%. For the simulated radius of0.8113 fm, one would expect an unbias ﬁt to give a /a of 0.1097 fm ; thus, the diﬀerence of that value from the mean ﬁttedvalue of a /a is the bias and the width of the distribution, σ . Also shown is the root mean square error, RMSE, which can beused to quantify the best function for a given interval taking into account both bias and variance.linear ﬁt quadratic ﬁtData Interval a Radius a /a Bias σ RMSE a Radius a /a Bias σ RMSEPoints fm − [fm] [fm ] [fm ] [fm ] [fm ] [fm] [fm ] [fm ] [fm ] [fm ]7 0.1 – 0.4 0.9995 0.7949 − . − . − . − . − . − . − . − . − . − . − . − . − . − . − . − . a Radius a /a Bias σ RMSE a Radius a /a Bias σ RMSEPoints fm − [fm] [fm ] [fm ] [fm ] [fm ] [fm] [fm ] [fm ] [fm ] [fm ]31 0.1 – 0.4 0.9995 0.7952 − . − . − . − . − . − . − . − . − . − . − . − . − . − . − . − . IV. GOLDILOCKS DILEMMA

As was noted by George Box, all models are wrong,thus, the goal is to ﬁnd the most useful model [59]. Forany given set of statistical models, the goal is to ﬁnd theoptimal balance between bias and variance. In general,this can be written as d Bias d Complexity ≈ − d Variance d Complexity , (5)as illustrated in Fig. 4. Thus, to quantify the goodness ofthe ﬁts, we choose Root Mean Square Error (RMSE) [60,61],RMSE = (cid:113) bias + Sigma = (cid:112) Bias + Variance . (6)Using the RMSE values in Table II, one can now quan-tify that for this example the 0.1–0.8 fm − interval isthe preferred range for the linear model while the 0.1–1.6 fm − interval is the preferred range for the quadraticmodel. Going to even higher Q will require even morecomplex models as illustrated in Fig. 3. This is in con-trast to the conclusion one draws when one only considersbias as presented in Table I, though consistent with theobservation that the optimal speciﬁc form of the param-eterization may depend on the Q region being ﬁt [62].It is interesting to repeat the Monte Carlo simulationfor equal number of data points within each range es-pecially since, for elastic scattering, cross sections aresigniﬁcantly higher at lower values of Q and thus, it iseasy to obtain more low Q data. This is shown in Ta-ble III and now the picture is even grayer as the RMSEof the linear ﬁt is nearly equal to the quadratic, thus, assuming that the standard dipole was the true gener-ating function, an experiment with 31 data points andan uncertainty of 0.005 per point over a range of 0.1 to0.8 fm − and a diﬀerent experiment over a range of 0.1 to1.6 fm − would have an equal probability of reproducingthe correct radii if all other things were equal (given thatthe modeler in the ﬁrst group adjusts a line and the onein the second a parabola). This is visualized in Fig. 5where the linear ﬁt is clearly biased but has a small vari-ance compared to the unbiased, large variance quadraticﬁt.The choice of the parsimonious modeler to use the low Q data would likely be driven by the recognition of thefact that as Q increases the extraction of the chargeform factor is complicated by the growing inﬂuence ofthe magnetic form factor. The choice to use a larger Q range would likely be driven by a desire to form a morecomplete picture of the proton’s structure. For example,the parsimonious modeler may only be interested in theproton radius while another modeler may be interestedin higher order moments [63]. Thus, the tension in theextractions of the proton radius from electron scatter-ing data is really about the fact that modelers using thelow Q are generally getting a systematically diﬀerent re-sult than the modelers doing ﬁts which include high Q data, and perhaps points to a systematic problem withour knowledge of the magnetic form factor and/or thefunctional form of the form factors. V. THE BEST PREDICTIVE MODEL

Selecting between a linear or quadratic regression ofthe more complex standard dipole function may seem abit contrived, as one might naively think that just using

FIG. 2. A graphic representation of the Monte Carlo resultsshowing how the linear ﬁts tend to have a relatively high biasthough a low variance, while the quadratic ﬁts tend to have arelatively low bias but a large variance. The black line/pointrepresents the true value. the generating function itself would always yield the bestresults. In section D we prove that for non linear ﬁtsthere is an induced bias even when using the generatingfunction.When taking into account the variance, we can alsoshow that this is not always the case with a simple set up:we use the lowest Q range, 0.1 – 0.4 fm − , and replacethe quadratic function with the generating function anda ﬂoating normalization term: f DipoleFit ( Q ) = n (1 − b Q / − , (7)where n is the normalization factor and the radius isgiven by √− b . Pseudo data is generated for absoluterandom errors of 0.01, 0.005 and 0.003 with three diﬀer-ent spacings: 0.05 fm − spacing with 7 points, 0.01 fm − spacing with 31 points, and 0.005 fm − spacing with 61 Cubic ﬁtQuadratic ﬁtLinear ﬁt Q [1 / fm ] R M S E FIG. 3. A graphic showing how as one goes to higher andhigher Q with the same point-to-point uncertainty and spac-ing, an ever more complex model is needed to represent thedata. Complexity E rr o r bias variancemean square error FIG. 4. An illustration of the trade-oﬀ between bias and vari-ance when selecting a statistical model. Simple models willhave low variance but high bias (under-ﬁtting) while complexmodels will have low bias but high variance (over-ﬁtting). It isthis trade-oﬀ that one seeks to balance. While with repeatedMonte Carlo simulations it is trivial to ﬁnd the optimal pre-dictive model for a given set of data, in the real world the truemodel is typically unknown; one only gets to perform a verylimited number of experiments and thus, one relies on usingreal data and statistical methods for model selection [60]. points. The results of ﬁtting these pseudo data sets areshown in Table IV. For a given row in the Table, thelinear ﬁt has the greater bias and the dipole ﬁt alwayshas the greater variance; bringing the root mean squareerror very close for all the test cases. This example alsomakes it clear that it is not just the number of points thatmatter, but the size of the uncertainties and the rangeof the data that will also be critical parameters in model

FIG. 5. The result of a million simulations and ﬁts of linearﬁts over the Q range of 0.1 – 0.8 fm − and quadratic ﬁts over0.1 – 1.6 fm − , both with 31 uniformly spaced data points.Using root mean square error as the metric, neither exampleis signiﬁcantly better than the other for exacting the protonradius. This is analogous to a dart game between two equallyskilled players: one who hits the bull’s eye more often yethas a large spread (low bias but high variance), and another,equally skilled, player who has a tighter cluster of hits but anoﬀset (high bias but low variance). selection. In the cases where the MSE of the line wassmaller than the ones of the Dipole, the line was beinga better “predictive” model for the proton radius, whilethe Dipole was a better “descriptive” model of the formfactor as whole.Of course for real data, nature hides the true generat-ing function from us, so perhaps it is reassuring to knowthat a reasonable approximation is able to reveal the un-derlying physics just as well as, if not better than, thetrue function. To be clear, the lesson is not that onefunction is better than another; it is that for a given setof data, the scientist is challenged to use the most ap-propriate model (either descriptive or predictive) for thetask at hand. Further details on the general mathematicsbehind these example problems can be found in [58].Using an inappropriate model can lead to erroneousconclusions (e.g. due to diﬀerent normalziations of thedata). In Fig. 6 we show an example of one set of thepseudo data ﬁt with a linear and quadratic where theprior that the G E (0) = 1 has been applied (i.e. thedata has been divided by the normalization term fromthe regression so the function goes to the known limitingvalue at the origin). Since here we know the function thatgenerated the pseudo data, it is clear that the quadraticﬁt can cause the data to be inappropriately shifted andcan generate an large variations in the radius. Of coursein the real world, the true function is not known; thus,with real data a model selection technique is required inorder to determine the appropriate function. VI. MODEL SELECTION

While this classic Monte Carlo example problem is over40 years old, it actually points to exactly the split inthe current electron scattering proton radius extractionprocedures. The parsimonious modelers, who are focusedsolely on extracting a radius, have focused on the low Q G E = 30.2reduced = 1.04AIC = 3.1BIC = 6.0 True FunctionLinear FitLinear Fit - Initial Radius = 0.02 fm0.0 0.1 0.2 0.3 0.4 0.5 Q [fm ] G E = 29.6reduced = 1.06AIC = 4.6BIC = 8.9 True FunctionQuadratic FitQuadratic Fit - Initial Radius = 0.11 fm FIG. 6. Illustration of the eﬀect of renormalizing data. Theblack points show one set of pseudo data, with absolute0.3% random point-to-point uncertainty and 0.01 fm − spac-ing that have been renormalized by applying the prior that G E (0) = 1. While the prior is true, using an inappropriatemodel can cause both the extracted radius and normalizationto be dramatically shifted from the true function. The greybands were created by preforming of 250 simulations and arepresented as an animation in the supplemental material [64]. a G E = 30.2reduced = 1.04AIC = 3.1BIC = 6.0 True FunctionLinear FitLinear Fit - Initial Radius = 0.02 fm0.0 0.1 0.2 0.3 0.4 0.5 Q [fm ] a G E = 29.6reduced = 1.06AIC = 4.6BIC = 8.9 True FunctionQuadratic FitQuadratic Fit - Initial Radius = 0.11 fm FIG. 7. Instead of immediately applying the prior that G E (0) = 1 as done in Fig. 6, it is more intuitive to simple ﬁtthe data allowing the end-point to ﬂoat. In fact, a frequen-tist would correctly simply check that the function reasonablypoints to one within the normaliztion uncertainty. The greybands were created using the same pseudo data as used inFig. 6. and are presented as an animation in the supplemen-tal material [64]. While these two ﬁgures (Fig. 7 and 8) maylook very diﬀerent, as far as the extracted radius goes, thesetwo diﬀerent ways of presenting the pseudo data give exactlythe same results. region accepting a slightly higher level of bias in exchangefor low variance; on the other hand, those modelers whoare interested in extracting more information about theproton (e.g. higher order moments) ﬁt longer Q rangesand have focused on complex models which, while lowerin bias, come at a cost of higher variance. TABLE IV. For the lowest Q interval, 0.1 to 0.4 fm − , we compare regressions with a linear function (Eq. 3) to the dipole ﬁtfunction (Eq. 7). Keeping the range ﬁxed, a spacing of 0.05 fm − (7 points), 0.01 fm − (31 points) and 0.005 fm − (61 points)was used with various absolute random errors. In several of these cases, the simple linear function provides a better predictivemodel then the true functional form and is never far from the true function.linear ﬁt dipole ﬁtData Random a Radius a /a Bias σ RMSE a Radius a /a Bias σ RMSEPoints Error [fm] [fm ] [fm ] [fm ] [fm ] [fm] [fm ] [fm ] [fm ] [fm ]7 0.01 0.9995 0.7941 − . − . − . − . − . − . − . . − . − . − . . − . − . − . . − . − . − . − . − . − . − . − . − . − . − . − . − . − . − . Also, since we do not know the true model, one cannotin general calculate the RMSE. So, while Monte Carloexercises like the one described herein are extremely use-ful for ﬁnding reasonable models to consider and under-standing expected uncertainties, the data must be usedto select the appropriate model. For this one can rely onstatistical modeling selection techniques such as an F -test for nested models [65–67] or the more general Akaikeinformation criterion (AIC) [68] or Bayesian informationcriterion (BIC) [69] to guide our selection of the mostappropriate model to describe a given set of data. Thesestatistical criteria are calculated as follows: χ = N (cid:88) n =1 ((data i − model) /σ i ) , (8) reduced χ = χ / ( N − N var ) , (9)AIC = N log( χ /N ) + 2 N var , (10)BIC = N log( χ /N ) + log( N ) N var , (11)where N is the number of data points, data i and σ i aremeasured values and estimated uncertainties, and N var isthe number of model parameters. Further details aboutcurrent model selection techniques can be found in [70].One should also keep in mind that the input models inMonte Carlo simulations are always just an approxima-tion and one needs to be careful about drawing too stronginferences from the simulated results. For example, justbecause the linear model has a negative bias when com-pared to the standard dipole, does not imply that it hasa negative bias with respect to all possible models. VII. REAL DATA

This brings us to the real data and the current pro-ton radius extractions. Many diﬀerent functions havebeen tried over the years from simple linear ﬁts [53, 55]and continued fractions [43] to high order polynomialswith [25] and without constraints [26]. Since obtaining sub-percent level absolute cross sections is nearly im-possible, a normalization parameter is included to allowan entire set of data to shift as was done in the MonteCarlo simulations. Again, this just allows that the prior, G E (0) = 1, can be applied. To be clear, in this work,when referring to a set of experimental data we are re-ferring to a group of data with a single normalizationparameter.The measured cross sections, σ Meas , are related to thecharge and electric form factors via σ Meas σ Mott = n ε (1 + Q M ) (cid:20) εG E ( Q ) + Q M G M ( Q ) (cid:21) , (12)where n is a normalization factor and the kinematicquantities Q and ε are given by Q = 2 M E (1 − cos θ ) M + E (1 − cos θ ) , (13) ε = (cid:20) (cid:18) Q M (cid:19) tan θ (cid:21) − (14)where E is the incident electron beam energy, M is theproton mass, θ is the measured electron scattering an-gle. The Mott cross section, σ Mott , with the recoil factorincluded, is given by σ Mott = α E cos ( θ/ ( θ/ EM (1 − cos θ )) . (15)From Eq. 12 is it clear that as Q goes to zero and ε goes to one (forward angle electron scattering), the termswhich include the magnetic form factor, G M , becomerelatively unimportant.To illustrate the current tension between ﬁts done withdiﬀerent models, Fig. 8 shows 104 data points from onefull subset from a modern electron scattering experi-ment [71] that covers a range similar to the range studiedherein. By using a single set, only one ﬂoating normaliza-tion parameter is required. This one set of data covers Q [fm ] M e a s / n / M o tt Data Normalization n = 1.005 Data Normalization n = 0.998 th Order Polynomial0.88 fm 10 th Order Polynomial0.84 fm Continued Fraction0.84 fm Dipole

FIG. 8. Shown is one set of modern cross section data to il-lustrate the current tension between diﬀerent models and theeﬀect of the normalization parameter. In electron scatteringdata, the point-to-point uncertainties can be rather small,compared to the typical normalization uncertainty of a fewpercents. To make the curves, the standard dipole magneticform factor has been used along with charge form factors fromseveral recent publications: a bounded 13 th -order polynomialwith a radius of 0.88 fm [72], a unbounded 10 th -order polyno-mial with a radius of 0.88 fm [25], a continued fraction witha radius of 0.84 fm [22], and a dipole function with a radiusof 0.84 fm [23]. a range similar to the 85 data point ﬁts with 6 ﬂoat-ing normalizations [21, 40]. Along with the data, fourrepresentative functions are shown using Eq. 12 with astandard magnetic form factor. Two functions give radiithat agree with the CODATA value for the proton ra-dius [25, 72] and two that agree with the muonic Lambshift measurements [22, 23]. We note that this ﬁgurelooks oddly similar to the pseudo data in Fig. 6 but herethe true function is unknown so it is not clear whichcurves are shifted with respect to the true reduced crosssection values.Following the logic of this work, we try ﬁtting the crosssection data shown in Fig. 8 with Eq. 12, where G E hasbeen approximated by Eq. 3 and Eq. 4, though with thenormalization now subsumed in Eq. 12. Additionally,we performed ﬁts with the following two commonly usedfunctions: f cubic ( Q ) = 1 + a Q + a Q + a Q , (16) f rational ( Q ) = 1 + n Q m Q , (17)where the radius for the rational function is given by (cid:112) − n − m ) and for the polynomials by √− a . Thelow order rational function ( n = m = 1) can easily beextended to give the expected asymptotic behavior as Q → ∞ by using a more complex rational functionwhere m = n + 2 such as in [42, 73, 74].In order to check on the inﬂuence of the magnetic formfactor on the results, several diﬀerent magnetic form fac-tor functions were used. First, the ﬁts were done with a standard dipole magnetic form factor and then repeatedwith the magnetic form factor from [25] and [72]. The F -test, AIC, or BIC model selection techniques all slightlyprefer ﬁts with Eq. 4. The use of statistical criteria formodel selection helps avoids conﬁrmation bias, thoughone could still be using an inappropriate function forthe problem at hand. Uncertainties were determined byapplying a statistical bootstrap to the data [75]. Thisis done by repeatedly randomly sampling the true datawith replacement to generate thousands of new sets ofN points and reﬁtting those new sets. This allows oneto get uncertainty distributions using the data itself andavoids a number of assumptions that are required for χ uncertainty techniques to be valid [75] and, unlike χ techniques, is also sensitive to over-ﬁtting [76].It is of particular note that the quadratic ﬁt is thesame function over a similar range as found in the original1976 work opting for the quadratic ﬁt [49] though hereinwe use a single ﬂoating normalization instead of threeand the point-to-point errors are smaller. It is perhapsdistressing that published values of the radius extractedfrom electron scattering remained basically unchangedsince the 1976 work [49] while the functions used to makethe extrapolation and obtain that same radius becameincreasingly complex and convoluted. Oddly enough thestandard dipole magnetic form factor quadratic ﬁt hasthe lowest AIC and BIC values and gives a result con-sistent with the muonic Lamb shift; though the rationalfunction is nearly as good and is nearly exactly betweenthe CODATA and muonic Lamb shift values. VIII. GRAPHS IN STATISTICAL ANALYSIS

Beyond simply checking statistic criteria, it is impor-tant to check the statistical analysis graphically [77].The graphs help ensure that our underlying assumptionsabout the data are reasonably correct and help establishthat the results aren’t being overly inﬂuenced by a sin-gle point. This is particularly important when doing χ minimizations where a clear outlier can easily have anundue weight, given the quadratic structure of χ in theerrors. This is not to imply that one should simply re-move an outlier, but it does mean that one might wishto study the eﬀect of ﬁtting with and without the out-lining point to clearly show its inﬂuence on the result.The researcher can also review their notes to ensure thatnothing odd happened during the taking of that datapoint. To illustrate these points clearly, one can ﬁnd inAppendix A an updated version of the classic Anscombeexample [77] where we have added uncertainties so all theexamples give a reduced χ of unity yet the distributionof the data are in fact all clearly diﬀerent.In the top half of Fig. 9, the reported data is shownalong with the quadratic ﬁt: best ﬁt from Table V basedon AIC and BIC. In the bottom half, the residual, a dif-ference between the data and the model, is shown. Whilethese two plots are perhaps the most common statistical TABLE V. Using four diﬀerent magnetic form factor parameterizations, we extract the normalization and electric radius usinga linear, quadratic, cubic, and rational approximation for the G E function in Eq. 12. To avoid multiple ﬂoating normalizations,the single set of 104 data points shown in Fig. 8 is used. The parameter uncertainties were obtained by performing statisticalbootstraps of the data. As seen during the Monte Carlo studies, the linear ﬁt over this interval produces a small variance;but is clearly biased from the true 0.84–0.88 fm proton radius, whereas the quadratic ﬁt has a larger variance but gives lessbiased result. The cubic ﬁt function over-ﬁts and thus, produces a huge variance. The rational function is nearly as good asthe quadratic through produces a systematically larger radius though nicely in the range we expect. In the Table df stands fordegrees of freedom.Standard Magnetic Form Factorwithout Coulomb correction with Coulomb correction G E Fit df χ reduced AIC BIC Norm Extrapolated χ reduced AIC BIC Norm ExtrapolatedFunction χ Radius [fm] χ Radius [fm]linear 102 162.6 1.594 50.47 55.74 0.988(1) 0.785(2) 167.2 1.639 53.35 58.64 0.985(1) 0.790(2)quadratic 101 119.4 1.182 20.35 28.29 1.000(2) 0.852(10) 119.0 1.178 20.00 27.93 0.998(2) 0.860(10)cubic 100 117.3 1.173 20.51 31.13 0.990(6) 0.785(57) 117.1 1.171 20.33 30.90 0.990(6) 0.797(57)rational 101 120.0 1.188 20.83 28.76 1.001(2) 0.860(10) 119.6 1.184 20.50 28.64 0.999(2) 0.869(10)Kelly Magnetic Form Factorwithout Coulomb correction with Coulomb correction G E Fit df χ reduced AIC BIC Norm Extrapolated χ reduced AIC BIC Norm ExtrapolatedFunction χ Radius [fm] χ Radius [fm]linear 102 168.1 1.648 53.95 59.24 0.986(1) 0.780(2) 173.0 1.669 53.35 58.64 0.985(1) 0.785(2)quadratic 101 119.4 1.182 20.33 28.27 1.000(2) 0.852(10) 119.0 1.178 20.00 27.91 0.998(2) 0.860(10)cubic 100 117.1 1.171 20.36 30.94 0.985(6) 0.598(57) 116.9 1.169 20.14 30.72 0.983(6) 0.613(57)rational 101 120.0 1.188 20.89 28.82 1.001(2) 0.861(10) 119.6 1.188 20.87 28.80 1.001(2) 0.870(10)Bernauer Magnetic Form Factorwithout Coulomb correction with Coulomb correction G E Fit df χ reduced AIC BIC Norm Extrapolated χ reduced AIC BIC Norm ExtrapolatedFunction χ Radius [fm] χ Radius [fm]linear 102 163.3 1.601 53.96 56.24 0.988(1) 0.786(2) 168.0 1.647 53.85 59.14 0.985(1) 0.791(2)quadratic 101 119.5 1.183 20.44 28.37 1.000(2) 0.854(10) 119.0 1.179 20.08 28.00 1.000(4) 0.862(10)cubic 100 117.3 1.173 20.52 31.13 0.992(6) 0.786(56) 117.1 1.171 20.33 30.91 0.990(6) 0.797(56)rational 102 120.0 1.189 20.93 28.86 1.001(2) 0.871(13) 119.7 1.185 20.60 28.54 0.999(2) 0.871(13)Ye Magnetic Form Factorwithout Coulomb correction with Coulomb correction G E Fit df χ reduced AIC BIC Norm Extrapolated χ reduced AIC BIC Norm ExtrapolatedFunction χ Radius [fm] χ Radius [fm]linear 102 167.2 1.639 53.35 58.64 0.987(1) 0.781(2) 172.0 1.686 56.33 61.62 0.984(1) 0.786(2)quadratic 101 119.4 1.182 20.33 28.26 1.000(2) 0.852(10) 119.0 1.190 19.97 27.91 0.998(2) 0.860(10)cubic 100 117.3 1.173 20.55 31.12 0.992(6) 0.785(55) 117.1 1.171 20.33 30.91 0.992(6) 0.797(55)rational 102 119.8 1.188 20.87 28.80 1.001(2) 0.860(14) 119.6 1.184 20.54 28.48 0.999(2) 0.870(14) analysis graphs, they are not found in many cited pro-ton radius papers. In fact, some papers have no dataquality plots at all (e.g. [21]). As the human eye willﬁnd patterns in statistical noise, the normal Q-Q plot isan important tool to check whether or not the data isnormally distributed [78]. This is done in Fig. 10, wherethe sorted residuals are plotted against, and are shown tofollow, a normal distribution. There are of course moreformal statistical tests one can apply, but in general thevisualizations of the data can quickly reveal if there areany major problems. In particular, with least squaresﬁtting, one should be mindful that a single outlier canskew the results.

IX. LOWEST Q DATA

While the set of Mainz data studied above had a rangesimilar to the large range of our Monte Carlo studies, itis not the lowest Q set of data. The lowest set comesfrom a run with 180 MeV [79] and covers a range from0.090 to 0.330 fm − . This range is particularly intrigu-ing as it has been shown to be low enough that evena linear regression should be able to extract the radius;though instead of just relying on the Monte Carlo studyto make that choice, we once again systematically look atthe data with various function and variance choices forthe magnetic form factor in Table VI. The residuals andnormal Q-Q plots for the AIC and BIC selected modelsare shown in Fig. 11 and 12 respectively.Although these ﬁts look beautiful, the extraction of theradius depends on our belief that we can reasonably ex- TABLE VI. Using four diﬀerent magnetic form factor parameterizations, we extract the normalization and electric radius usinga linear, quadratic, cubic, and rational approximation for the G E function in Eq. 12. To avoid multiple ﬂoating normalizations,the single set of 106 data points shown in Fig. 8 is used. The parameter uncertainties were obtained by performing statisticalbootstraps of the data. As seen during the Monte Carlo studies, the linear ﬁt over this interval produces a small variance; butis clearly biased from the true 0.84–0.88 fm proton radius, whereas the quadratic has a larger variance but gives less biasedresult. The cubic ﬁt function is over-ﬁtting which is why it produces a huge variance. The rational function is nearly as goodas the quadratic through produces a systematically larger radius though nicely in the range we expect. In the Table df standsfor degrees of freedom.Standard Magnetic Form Factorwithout Coulomb correction with Coulomb correction G E Fit df χ reduced AIC BIC Norm Extrapolated χ reduced AIC BIC Norm ExtrapolatedFunction χ Radius [fm] χ Radius [fm]linear 104 69.2 0.666 − . − .

81 1.000(1) 0.826(8) 69.6 0.670 − . − .

21 0.997(1) 0.842(8)quadratic 103 67.0 0.651 − . − .

59 1.006(3) 0.923(50) 66.9 0.649 − . − .

81 1.004(3) 0.948(50)cubic 102 66.3 0.650 − . − .

13 0.996(6) 0.598(nan) 66.2 0.649 − . − .

30 0.994(6) 0.645(nan)rational 103 63.8 0.632 − . − .

46 0.934(3) 0.936(55) 67.0 0.650 − . − .

64 1.004(3) 0.964(55)Kelly Magnetic Form Factorwithout Coulomb correction with Coulomb correction G E Fit df χ reduced AIC BIC Norm Extrapolated χ reduced AIC BIC Norm ExtrapolatedFunction χ Radius [fm] χ Radius [fm]linear 104 69.3 0.666 − . − .

70 1.000(1) 0.824(8) 69.7 0.670 − . − .

08 0.998(1) 0.840(8)quadratic 103 67.0 0.651 − . − .

59 1.000(2) 0.923(50) 66.9 0.649 − . − .

81 1.004(2) 0.948(50)cubic 102 66.2 0.649 − . − .

21 0.995(6) 0.525(nan) 66.1 0.648 − . − .

38 0.993(6) 0.578(nan)rational 103 67.1 0.652 − . − .

44 1.006(3) 0.937(55) 67.0 0.651 − . − .

63 1.005(3) 0.965(55)Bernauer Magnetic Form Factorwithout Coulomb correction with Coulomb correction G E Fit df χ reduced AIC BIC Norm Extrapolated χ reduced AIC BIC Norm ExtrapolatedFunction χ Radius [fm] χ Radius [fm]linear 104 69.2 0.665 − . − .

87 1.001(1) 0.826(8) 69.9 0.669 − . − .

26 0.998(1) 0.842(8)quadratic 103 67.0 0.651 − . − .

59 1.006(2) 0.923(50) 66.9 0.649 − . − .

81 1.004(4) 0.948(50)cubic 102 66.2 0.650 − . − .

13 0.996(6) 0.598(nan) 66.2 0.649 − . − .

30 0.994(6) 0.645(nan)rational 103 67.1 0.652 − . − .

44 1.007(3) 0.936(55) 67.0 0.650 − . − .

64 0.999(3) 0.964(55)Ye Magnetic Form Factor without Coulomb correction with Coulomb correction G E Fit df χ reduced AIC BIC Norm Extrapolated χ reduced AIC BIC Norm ExtrapolatedFunction χ Radius [fm] χ Radius [fm]linear 104 69.3 0.666 − . − .

71 1.000(1) 0.824(8) 69.7 0.670 − . − .

09 0.998(1) 0.841(8)quadratic 103 67.0 0.651 − . − .

59 1.006(2) 0.923(50) 66.9 0.649 − . − .

81 1.004(2) 0.948(50)cubic 102 66.3 0.650 − . − .

13 0.996(6) 0.598(nan) 66.2 0.649 − . − .

30 0.994(6) 0.645(nan)rational 103 67.1 0.652 − . − .

44 1.007(3) 0.936(55) 67.0 0.651 − . − .

63 1.005(3) 0.964(55) pect our function to extrapolate well. This belief is basedon the Monte Carlo study at the beginning of the paper.And while it is certainly reasonable to assume the chargeform factor will smoothly continue to Q = 0, naturewill do as it wishes. New data in the less then 0.1 fm − region is certainly desirable to ensure that our anzats iscorrect [80–83]. Also, strictly speaking, the uncertaintiesfrom the regressions are only valid over the range of thedata, so again our belief that we are obtaining reason-able uncertainties links back to the Monte Carlo studyof the radius extraction using diﬀerent functions and therelatively short distance of the extrapolation.As has been shown by theory calculations such as Alar-con and Weiss [63, 84] the moments of the generatingfunction are likely far too complex to be constrained bydescriptive ﬁtting of experimental data. Also, as pointedout on page 378 of [67], moments do not uniquely de- ﬁne functions, so the best we can do for the higher orderterms is determine the shape of the data and then com-pare with theory.One can of course try to combine multiple sets of data,though this quickly turns into a Bayesian exercise with nounique solution and has an inherent human-in-the-loopfactor [85]. One could also try using a physical modeland just simply compare the model with a descriptive ﬁtof the data [63, 84, 86–89].In addition, in Appendix B we show an example ofusing a machine learning signiﬁcance testing technique,forward stepwise regression, to ﬁnd the best polynomialﬁt function. And in Appendix C we also show the ef-fect of conformal mapping in order to clearly show thateven after doing a mapping one will still need to apply arigorous model selection criteria.0 Q fm R e s i d u a l M e a s / M o tt = 119.4 reduced = 1.182AIC = 20.33 BIC = 28.26 Best Fit by AIC & BIC FIG. 9. The cross section data and the quadratic ﬁt curveare shown in the upper plot and the residual, the diﬀerencebetween the data and ﬁt, are shown in the lower plot. Ifan appropriate ﬁt function was used, the residuals should berandomly dispersed around the horizontal axis.

Normal Distribution Quantiles S a m p l e Q u a n t il e s FIG. 10. In order to check if the distribution of the residu-als are plausibly normally distributed, one can make a normalquantile-quantile, Q-Q, plot. This is done by sorting the resid-uals from lowest to highest and plotting the ordered residualsagainst a theoretical normal distribution. The resulting plotis a beautiful example of normally distributed residuals.

X. SUMMARY

This work was not meant to be an exhaustive studyof the world electron scattering data; but instead an ex-ploration of bias-variance trade-oﬀ as it relates to theextraction of the proton radius from electron scatteringdata. To do this, we have revisited a classic Monte Carlostudy [49] and shown that simply rejecting models with Q fm R e s i d u a l M e a s / M o tt Radius

AIC = 0.948 fmRadius

BIC = 0.843 fmAIC = -42.80 BIC = -35.26 Best Fit by AICBest Fit by BIC

FIG. 11. The cross section data and the quadratic ﬁt curveare shown in the upper plot and the residual, the diﬀerencebetween the data and ﬁt, are shown in the lower plot.

Normal Distribution Quantiles S a m p l e Q u a n t il e s FIG. 12. The low Q data quite as normally distributedresiduals though still reasonable close to what one would ex-pect. a bias is incorrect. Although adding more regression pa-rameters usually reduces the bias, it comes at the costof increased variance. We also illustrated how parsimo-nious models can have better predictive power than eventhe true underlying model in certain situations (see alsosection D for semi analytical calculations on this regard).Next, we carefully studied high precision sets of Mainzlow Q data [24] which covered a range very similar tothe Monte Carlo study. Here we have deﬁned a set tobe a group of data with a single normalization param-eter. The model selection techniques presented hereinprovide a rigorous method of selecting an appropriatemodel to describe a given set of data. We have illus-1trated that these model selection techniques along withsome key statistical analysis plots can help to ensure areasonable ﬁt.These ideas and techniques of model selection are notlimited to the physical sciences, but also extend to quan-titative analysis [90], and sit at the heart of statisticallearning [60]. It has been argued that over-ﬁtting is per-haps more problematic now than in the past due moderncomputing power [91]. It is amusing to note it was just1985 when Feynman noted that computers could not beathumans at the game of Go [92], while today computersdominate even this complicated game [93, 94]. With allthis computer power available, it is perhaps more neces-sary then ever to keep in mind the power and importanceof parsimonious modeling as nicely summarized by therenowned statistician George Box: “Since all models arewrong the scientist cannot obtain a ‘correct’ one by exces-sive elaboration. On the contrary, following William ofOccam, [the scientist] should seek an economical descrip-tion of natural phenomena. Just as the ability to devisesimple but evocative models is the signature of the greatscientist so over-elaboration and over-parameterization isoften the mark of mediocrity.” [59] ACKNOWLEDGMENTS

We would like to thank Miha Mihoviloviˇc for manyextremely useful discussions and help developing a num-ber of the ﬁgures. The regressions done in the body ofthis work were done in Python making use of the out-standing LMFIT package [95] to interface with SciPy li-braries [96]. Non-linear regressions were done using theLevenbergMarquardt method [97, 98]. Special thanks toEdward Tufte and his eﬀorts to help improve the visu-alization of evidence [99–102]. This work was supportedby the U.S. Department of Energy contract DE-AC05-06OR23177 under which Jeﬀerson Science Associates op-erates the Thomas Jeﬀerson National Accelerator Facilityand contract DE-FG02-03ER4123 at Duke University.

Appendix A: Anscombe’s Quartet

With the power of modern computing, one can betempted to blindly assume the results of a calculationare correct; but this can be extremely misleading [103].To illustrate this point we use the Anscombe quartet [77],though as nuclear physicists tend to use reduced χ in-stead of R , we have taken the 1973 example problem andadded uncertainties to the points as shown in Table VII.These four sets of (x,y,dy) values give to three signif-icant ﬁgures the same statistical quantities: mean, vari-ance, χ , reduced χ , etc. So if one fails to make graph-ical checks, one can be completely fooled into thinkingthe ﬁts are all equally good; but by simply graphing (seeFig. 13) one can see that only the ﬁrst set of data isdistributed in an ideal way around the ﬁt function. TABLE VII. Four data sets of (x,y,dy) values.Data set 1-3 1 2 3 4 4 1-4Variable x y y y x y dyObs. no. 1: 10.0 8.04 9.14 7.46 8.0 6.58 1.2352: 8.0 6.95 8.14 6.77 8.0 5.76 1.2353: 13.0 7.58 8.74 12.74 8.0 7.71 1.2354: 9.0 8.81 8.77 7.11 8.0 8.84 1.2355: 11.0 8.33 9.26 7.81 8.0 8.47 1.2356: 14.0 9.96 8.10 8.84 8.0 7.04 1.2357: 6.0 7.24 6.13 6.08 8.0 5.25 1.2358: 4.0 4.26 3.10 5.39 19.0 12.50 1.2359: 12.0 10.84 9.13 8.15 8.0 5.56 1.23510: 7.0 4.82 7.26 6.42 8.0 7.91 1.23511: 5.0 5.68 4.74 5.73 8.0 6.89 1.235 / df = 1.00 0 5 10 15 20x051015y Data set 2m = 0.50b = 3.00 / df = 1.000 5 10 15 20x051015y Data set 3m = 0.50b = 3.00 / df = 1.00 0 5 10 15 20x051015y Data set 4m = 0.50b = 3.00 / df = 1.00 FIG. 13. Graphs of the four sets of data.

Data set two clearly has a curved residual yet has ex-actly the same mean, errors, χ as the ﬁrst ﬁt. This sug-gests that the ﬁtter should likely add a quadratic termto the regression as well as check that the uncertaintieshave been correctly reported.Data set three illustrates the eﬀect of an outlier on theregression. Of course, the scientist doesn’t simply throwout an outlier. Instead one should report on the outlier’sinﬂuence on the result. For example, in data set three itwould be worth noting that if the outlier is removed, thedata exactly follow a line of y = 4 + 0 . x and that thatmeasurement should be repeated.Set four looks unsatisfactory, since all the informationabout the slope comes from one observation. This is verydiﬀerent from data set one where any one point can be re-moved and one will obtain nearly the same result. Thus,for this data set four it should be pointed out that asingle observation plays a critical role in the result.2 Appendix B: Stepwise Regression

In this paper, we have proceeded as is natural for ahuman statistical modeler, starting from a simple linearapproximation and then adding additional terms in Q as needed. Of course nothing says that this type of ﬁttingdetermines the terms of the moments the true generatingfunction but simply that we have found an appropriatefunction for ﬁtting the data. In fact, using experimen-tally determined “moments” as a regression constraintcan lead to circular logic as the normalization parame-ters and moments are linked in a very complex way.In Ref. [23] the authors used forward stepwise regres-sion on the Mainz Rosenbluth data [24] and the codereturned a beautiful alternating sign power series whichis what one would expect if the power series ﬁt was infact extracting moments of a generating function. Nev-ertheless, without a physical theory in mind, no claimwas made that these were the moments of the generatingfunction.In fact, if one applies the same stepwise regression codeto very low Q data it can give some seemly nonsensicalresults. While the linear term will remain essentially un-changed one no longer get the series of alternating . Asan example, using R [104] with the CAR package [105]to preform stepwise regression with AIC as the model se-lection criteria on the G E data from Griﬃoen, Carlsonand Maddox [22] with Q < . − , to match a rangestudied in the Monte Carlo simulations, one obtains thefollowing result: Start: AIC=36.47data$y ~ data$xDf Sum of Sq RSS AIC+ I(data$x^4) 1 9.1204 357.98 30.090+ I(data$x^3) 1 9.1063 358.00 30.104+ I(data$x^5) 1 8.9767 358.13 30.224+ I(data$x^2) 1 8.8697 358.23 30.324+ I(data$x^6) 1 8.7322 358.37 30.451+ I(data$x^7) 1 8.4320 358.67 30.730+ I(data$x^8) 1 8.1084 359.00 31.030+ I(data$x^9) 1 7.7822 359.32 31.333+ I(data$x^10) 1 7.4655 359.64 31.626+ I(data$x^11) 1 7.1643 359.94 31.905 367.10 36.468Step: AIC=30.09data$y ~ data$x + I(data$x^4)Df Sum of Sq RSS AIC 357.98 30.090+ I(data$x^2) 1 0.028238 357.96 32.064+ I(data$x^8) 1 0.021000 357.96 32.071+ I(data$x^7) 1 0.020418 357.96 32.071+ I(data$x^9) 1 0.020399 357.96 32.071+ I(data$x^3) 1 0.018965 357.96 32.073+ I(data$x^6) 1 0.018854 357.96 32.073+ I(data$x^10) 1 0.018801 357.96 32.073+ I(data$x^5) 1 0.017067 357.97 32.075+ I(data$x^11) 1 0.016573 357.97 32.075Weighted Residuals: Min 1Q Median 3Q Max-3.01533 -0.73635 -0.08819 0.66435 3.08064Coefficients: Estimate Std. Error t value Pr(>|t|)(Intercept) 0.9988295 0.0003569 2798.9 < 2e-16data$x -0.1172160 0.0011112 -105.5 < 2e-16I(data$x^4) 0.0061991 0.0021379 2.9 0.00399Residual standard error: 1.042 on 330 degrees of freedomMultiple R-squared: 0.9931, Adjusted R-squared: 0.993F-statistic: 2.37e+04 on 2 and 330 DF, p-value: < 2.2e-16

Unlike the previous studies in this work where test weresimply made by increasing the order of the power seriesby one, the forward stepwise regression scans over all pos-sible next terms in the power series up to a user deﬁned11 th order. The results shown above are in the algebraicnotation of R and the x term is Q . Starting from a linearfunction, diﬀerent combinations of higher order functionsare tried by adding and removing terms, in order to ﬁndthe best performing predictive model. The ﬁnal result ofthe ﬁt is illustrated in Fig. 14 and code to preform theﬁts can be found on github [106]. FIG. 14. The G E data of [22] shown along with a residual ofthe best ﬁt using forward stepwise regression. In the context of assigning meaning to the extractedmoments, the stepwise regression result is laughable. Onthe other hand, considering these ﬁts as only descriptiveapproximations valid over the range of the data, the re-sult is a function with uncertainty that describes the datawell.It is the fact that there is nearly an inﬁnite set of func-tions that can ﬁt the data within the error bands thatmakes the Monte Carlo modeling and the use of modelselection criteria so important; otherwise one is simplydoing a deep search in order to ﬁnd a particular solution.

Appendix C: Conformal Mapping

It has been argued that the use of conformal mappingcan improve the situation; but while some authors ofthis technique conﬁne themselves to low Q and modelconstrained parameters, others seem to use the mappingto obfuscate clear over-ﬁtting. For example, in Table 1of [40] the authors do various ﬁts up to ﬁfth order with noclear model selection technique in mind. By simply using3 Q [fm ]0.00.20.40.60.81.0 G E Standard Dipole G E Standard Dipole Q [fm ]0.900.920.940.960.981.00 G E Standard Dipole G E Standard Dipole

FIG. 15. Illustration of the result of performing the z trans-formation on the standard dipole function shown the solidline. For the standard dipole at low Q , this turns a slightlyconcave function into a convex function. The dashed line issimply a straight line to make the concavity of the functionsclear. As described in the text, this same eﬀect is seen withreal G E data. the selection criteria provided herein, AIC or BIC, oneﬁnds that in general the two parameter ﬁts are suﬃcientand the data used therein give 0.86(2)fm for the radius.By model selection criteria, all the ﬁts beyond secondorder are simply over-ﬁtting the data.On the other hand, the idea of conformal mapping isvery powerful. From a pure mathematical point of view,it can take a function, like G E ( Q ) that goes from zeroin inﬁnity and map it onto a ﬁnite range where one coulduse orthogonal polynomial to describe data. It can alsobe used to avoid the eﬀects of poles when doing polyno-mial regressions, though it is worth noting that rationalfunctions also have that property [107].The transformation from Q to the mapped parameter z is done by: z ( t, t cut , t ) = √ t cut − t − √ t cut − t √ t cut − t + √ t cut − t , (C1)where t = − Q , t cut = 4 m π , and t is a free parameterrepresenting the point being map onto z = 0. Hereinwe have used t = 0. It is argued without proof that“the curvature is smaller in the z variable than in the Q variable” [40], yet by taking exactly the same data asbefore and ﬁtting with stepwise regression one gets thefollowing result: Start: AIC=62.02data$y ~ data$xDf Sum of Sq RSS AIC+ I(data$x^2) 1 36.992 358.23 31.779+ I(data$x^3) 1 35.577 359.65 33.092+ I(data$x^4) 1 33.518 361.71 34.993+ I(data$x^5) 1 31.295 363.93 37.034+ I(data$x^6) 1 29.136 366.09 39.003+ I(data$x^7) 1 27.125 368.10 40.827+ I(data$x^8) 1 25.277 369.95 42.495+ I(data$x^9) 1 23.582 371.64 44.017+ I(data$x^10) 1 22.026 373.20 45.408+ I(data$x^11) 1 20.595 374.63 46.683 395.23 62.019Step: AIC=31.78data$y ~ data$x + I(data$x^2)Df Sum of Sq RSS AIC 358.23 31.779+ I(data$x^7) 1 0.46895 357.77 33.828+ I(data$x^6) 1 0.46636 357.77 33.830+ I(data$x^8) 1 0.46557 357.77 33.831+ I(data$x^9) 1 0.46025 357.77 33.836+ I(data$x^10) 1 0.45531 357.78 33.840+ I(data$x^5) 1 0.45194 357.78 33.844+ I(data$x^11) 1 0.45177 357.78 33.844+ I(data$x^4) 1 0.41901 357.82 33.874+ I(data$x^3) 1 0.36310 357.87 33.926Weighted Residuals:Min 1Q Median 3Q Max-3.07706 -0.75955 -0.09171 0.67092 3.04833Coefficients: Estimate Std. Error t value Pr(>|t|)(Intercept) 0.9992836 0.0005757 1735.683 < 2e-16data$x -0.9741340 0.0256596 -37.964 < 2e-16I(data$x^2) -1.5340681 0.2627947 -5.838 1.27e-08Residual standard error: 1.042 on 330 degrees of freedomMultiple R-squared: 0.9932, Adjusted R-squared: 0.9932F-statistic: 2.41e+04 on 2 and 330 DF, p-value: < 2.2e-16

FIG. 16. The z-transformed G E data shown along with aresidual of the best ﬁt. Thus, the stepwise regression shows that the trans-formation has increased the curvature of the data andturned the slightly convex data in Q to clearly concavedata in z . Assuming that a quadratic can be used tomake a reasonable extrapolation, one can use this R re-4gression result to extract a radius using the formula: r p = (cid:112) − data $ x )4 m π . (C2)Thus, one ﬁnds 0.85(1) fm, a result again consistent withthe muonic Lamb shift result, as well as the other resultsshown in this work. These ideas can be taken furtherand one could use ideas such as Gaussian process regres-sion to calculate many possible extrapolating paths andthen assign relative probabilities [108] as recently doneby Zhou et al. [109]. Appendix D: Semi-analytical Calculations

In this section we demonstrate a semi-analytical pro-cedure to calculate the bias and variance induced by thenoise in our estimation of the slope at zero. This frame-work can be used to reproduce the graph shown in Fig. 2and Fig. 3 without the repeated Monte Carlo procedureof creating many noisy data sets and sampling from theﬁtted functions. The virtue of this method is that it cangive us explicitly the inﬂuence each data point is havingon our bias and variance and, thus, can be used to iden-tify the points that maximize the information gained.Before proceeding we must realize that the bias in ourestimation of the slope has two sources. The ﬁrst one isdue to the noise in the data and will increase as the noiseincreases. The second is due to the possible fact that weare not using the “true” function to ﬁt the data, as wasshown in Table IV when the line had a clear bias evenwhen the noise was very small. We therefore split ourbias as: bias = bias σ + bias , (D1)where bias σ denotes the bias that scales with the noise,while bias will always be present even when σ → σ represents the size of the noise.In this section we will show how to calculate the ﬁrstsource, while the second can only be calculated if we knowthe “true” value of the slope. Let us deﬁne our quantityof interest, the slope at zero, as m ( X, Y, (cid:15) ), where X and Y are the lists of n data points and (cid:15) is a particularrealization of the noise, which we assume to be Gaussianwith mean zero and deviation σ . Here m ( X, Y, (cid:15) ) can bea closed expression for the slope given the data, as is thecase for linear models, or it can be treated as a numericalroutine that returns the slope when ﬁtting a non linearmodel, like the Dipole.

1. Noise Bias

We want to ﬁnd the average value of m ( X, Y, (cid:15) ), (cid:104) m ( X, Y ) (cid:105) , once all the possible realizations of the noise have been taken into account, weighted correctly by theirGaussian distribution. The result is given by the integral: (cid:104) m ( X, Y ) (cid:105) ≡ (cid:90) P ( (cid:15) ) m ( X, Y, (cid:15) ) d(cid:15), (D2)where d(cid:15) = d(cid:15) d(cid:15) ... d(cid:15) n , and P ( (cid:15) ) denotes the Gaussianprobability distribution for the noise given by: P ( (cid:15) ) = (cid:32) √ πσ ) (cid:33) n e ( (cid:15) + (cid:15) + ... (cid:15) n ) / σ . (D3)The integral in Eq. D2 is eﬀectively taking into ac-count all possible noise realizations, weighted by theircorresponding probabilities.Since in most cases we do not have an available expres-sion for m ( X, Y, (cid:15) ), we can expand it using a multivariateTaylor expansion and evaluate each term directly underthe integral sign in Eq. D2. The ﬁrst three terms in thisexpansion are: m ( X, Y, (cid:15) ) ≈ m + (cid:104) −→∇ (cid:15) m (cid:105) · −→ (cid:15) + 12 −→ (cid:15) T · [ H m ] · −→ (cid:15) , (D4)where m = m ( X, Y, (cid:15) = 0) is the value of the slopewhen the noise is zero. We denote (cid:104) −→∇ (cid:15) m (cid:105) the gra-dient of m taking the errors (cid:15) as variables, −→∇ (cid:15) m ≡ (cid:16) ∂m∂(cid:15) , ∂m∂(cid:15) , ... ∂m∂(cid:15) n (cid:17) , evaluated at (cid:15) = 0. Finally, [ H m ] denotes the Hessian matrix of m , H m i,j ≡ ∂ m∂(cid:15) i ∂(cid:15) j , againtaking the (cid:15) as variables and evaluating at (cid:15) = 0.Equation D4 says that for some small realization ofthe errors (cid:15) i , the value of m is approximately its valuewhen (cid:15) = 0, plus a linear correction in (cid:15) by the gradi-ent and ﬁnally a second order correction proportional to (cid:15) that involves the Hessian. Both of these quantities,the gradient and the Hessian are evaluated at (cid:15) = 0 andare therefore just a list and a matrix of ﬁxed numbers,respectively. These two groups of numbers can be ob-tained numerically by taking ﬁnite diﬀerences on m asan approximation to derivatives.Once we have expanded our slope function m we pro-ceed to calculate the integral D2: (cid:104) m ( X, Y ) (cid:105) ≈ (cid:90) (cid:32) m + (cid:24)(cid:24)(cid:24)(cid:24)(cid:24)(cid:24)(cid:58) (cid:104) −→∇ (cid:15) m (cid:105) · −→ (cid:15) + 12 −→ (cid:15) T · [ H m ] · −→ (cid:15) (cid:33) P ( (cid:15) ) d(cid:15). (D5)The linear correction, being proportional to (cid:15) , wouldintegrate to zero. The term m does not depend onthe noise and since P ( (cid:15) ) integrates to 1 it would justappear as it is in the ﬁnal result. The last term withthe Hessian is a quadratic form that would look like (cid:15) H + 2 (cid:15) (cid:15) H + ... + (cid:15) n H nn , where the constants5 H i,j are the ( i, j ) elements of the Hessian. Since ournoise additions are mutually independent ( (cid:15) i and (cid:15) j donot correlate) the integration of terms involving mixingof diﬀerent (cid:15) i will also yield zero. The integration overterms that involve the same (cid:15) i are by deﬁnition the vari-ance of the noise, σ . Therefore we have: (cid:104) m ( X, Y ) (cid:105) ≈ m + 12 (cid:90) (cid:0) −→ (cid:15) T · [ H m ] · −→ (cid:15) (cid:1) P ( (cid:15) ) d(cid:15) = m + 12 σ (cid:20) ∂ m∂(cid:15) + ∂ m∂(cid:15) + ... + ∂ m∂(cid:15) n (cid:21) . (D6)This equation shows that there will be a noise-inducedbias in the average estimation of our quantity m , whichgrows proportionally to σ . This bias will not be presentif the function that estimates m in terms of the data islinear in the observations Y (and therefore linear in thenoise (cid:15) ), since second derivatives of m will be zero fromthe start. This means that, regardless of the size of thenoise, linear ﬁts like a straight line, or any polynomial,will not gain a noise-induced bias, and the distributionsfor m will always center at the zero noise point.For non linear ﬁts on the other hand, there will be abias that will grow quadratically in the noise. This canapply to the actual function that generated the data, likethe Dipole in our study. Once the noise gets too high, thetrue generating function might not be the most reliableﬁt to the data, bias-wise. Nevertheless, the eﬀect willusually be very small, since it is proportional to σ .The term accompanying σ , the trace of the Hessianmatrix, can be interpreted as the Laplacian of m at zeronoise. Consistent with our ﬁndings, the Laplacian oper-ator of f ( x ) at a point a can been related to the rate ofchange of the average of f , (cid:104) f (cid:105) , at a small sphere centeredat a compared to f ( a ) [110].

2. Standard Deviation

In order to calculate the standard deviation s = (cid:113) (cid:104) m (cid:105) − (cid:104) m (cid:105) of our distribution for m we must com-pute the expected value of m , since we already have (cid:104) m (cid:105) .We can calculate this quantity up to second order in σ ,using expression D4: (cid:104) m ( X, Y ) (cid:105) ≈ (cid:90) (cid:20) m + (cid:104) −→∇ (cid:15) m (cid:105) · −→ (cid:15) + 12 −→ (cid:15) T · [ H m ] · −→ (cid:15) (cid:21) P ( (cid:15) ) d(cid:15) = (cid:90) (cid:20) m + (cid:24)(cid:24)(cid:24)(cid:24)(cid:24)(cid:24)(cid:24)(cid:24)(cid:58) m (cid:104) −→∇ (cid:15) m (cid:105) · −→ (cid:15) + (cid:16)(cid:104) −→∇ (cid:15) m (cid:105) · −→ (cid:15) (cid:17) +2 m −→ (cid:15) T · [ H m ] · −→ (cid:15) + O ( (cid:15) ) (cid:21) P ( (cid:15) ) d(cid:15). (D7)The ﬁrst term does not depend on (cid:15) and can go out ofthe integral. The second term is proportional to (cid:15) and integrates to zero. The Hessian term we already knowhow to integrate and we have neglected its square, whichwill be order (cid:15) . The third term, the gradient square,reads: (cid:16)(cid:104) −→∇ (cid:15) m (cid:105) · −→ (cid:15) (cid:17) = (cid:18) ∂m∂(cid:15) (cid:12)(cid:12)(cid:12)(cid:12) (cid:15) + ∂m∂(cid:15) (cid:12)(cid:12)(cid:12)(cid:12) (cid:15) + ... + ∂m∂(cid:15) n (cid:12)(cid:12)(cid:12)(cid:12) (cid:15) n (cid:19) . (D8)Once we expand the square we will have two typesof terms, those of the form (cid:16) ∂m∂(cid:15) i (cid:12)(cid:12) (cid:17) (cid:15) i and those withmixed (cid:15) , ( ∂m∂(cid:15) i ∂m∂(cid:15) j ) (cid:12)(cid:12) (cid:15) i (cid:15) j , i (cid:54) = j . Since our noise compo-nents are independent, only the ﬁrst type of terms willgive non zero results under integration: σ . Therefore wehave: (cid:104) m ( X, Y ) (cid:105) ≈ m + m σ T r [ H m ] + σ (cid:34)(cid:18) ∂m∂(cid:15) (cid:19) + (cid:18) ∂m∂(cid:15) (cid:19) ... + (cid:18) ∂m∂(cid:15) n (cid:19) (cid:35) . (D9)The ﬁrst coeﬃcient accompanying σ is the trace of theHessian matrix, which will cancel the same exact termthat appears in the square of the mean value for m : (cid:104) m (cid:105) ≈ ( m + σ T r [ H m ] ) ≈ m + m σ T r [ H m ] .The second coeﬃcient accompanying σ is the magni-tude of the gradient of m , evaluated at zero noise. Ourresult for the standard deviation s , if we only keep termsup to ﬁrst order in σ for (cid:104) m (cid:105) then reads: s = (cid:113) (cid:104) m (cid:105) − (cid:104) m (cid:105) = σ (cid:107)−→∇ (cid:15) m (cid:107) . (D10)Therefore, our calculations show that no matter howsmall the noise in the data, there would be an apprecia-ble, proportional, standard deviation in our estimate of m . This is true unless the gradient is zero, meaning thatour quantity m is not sensitive to change in the data toﬁrst order.All the calculations presented here can easily be gen-eralized to include correlations between the (cid:15) i and to in-clude diﬀerent sizes of noise for each point: σ i instead ofa single σ . The only requirement is that the noise is con-sidered Gaussian and the integrals would still be easilysolvable.It is also possible to extend the Taylor expansion of m in equation D4 to include more orders in the noise (cid:15) . Since the Taylor expansion will contain some combi-nation of diﬀerent (cid:15) i , the Gaussian integrals can still beperformed in closed form, term by term.

3. Maximizing Information Gained

Some fundamental questions that arises in experimen-tal designs and theory of information are: for a givenrelation f ( x ) = y between two quantities, what are the6most relevant values of x such that, when y is measured,the relation f is constrained the most? How much newinformation one gains by adding more points or reduc-ing the statistical uncertainties on the existing ones? Inother words, what are the ( x, y ) values that, consideredas a set, contain the most amount of information regard-ing f , for a given set of experimental uncertainties.With respect to the proton radius extraction that wehave studied here, a question of interest would be: fora given range in Q , what speciﬁc values Q i give us themost amount of information regarding the slope at zero,or the proton radius.Using the semi-analytical framework described, we canattempt to give a quantitative answer to this question.Both Eqs. D6 and D9 have explicit the contribution eachpoint makes to the noise bias and the standard deviation,respectively. Therefore, we could directly identify how areduction in uncertainty on each particular point wouldpropagate to our estimate of the radius.Alternatively, if we are stuck on some level of un-certainty, we could estimate where the n measurementsshould be made such that, again, our information gain ismaximal. In order to test these ideas, we minimized theRMSE deﬁned in Eq. 6 as a function of the location of nQ points, and studied how much our uncertainty in theslope at zero is reduced as n increased.Our y points were again generated by the Dipole func-tion deﬁned in Eq. 2 and we studied the ﬁt of a line inthe range 0 . − . − and the ﬁt of a parabola in therange 0 . − . − , both of which have proven to beoptimal in our simulations on section IV.Since the slope estimate from both models is linearin the observations y , we know that there will be nonoise-induced bias, and the only source of bias will bea constant term, see Eq. D1. By “constant” we meanthat it will not grow with the noise, but it would changedepending on the points Q i used.We seek therefore to minimize:RMSE ( X ) = bias ( X ) + s = ( m true − m ( X )) + σ (cid:107)−→∇ (cid:15) m ( X ) (cid:107) , (D11)where X = ( Q , ... Q n ) are the n locations of the “ob-served” points, and we have omitted the Y dependenceon our quantities since we are calculating Y directly fromour points X , and m true = − . .Now, the bias quantity ( m true − m ( X )) will indeeddepend on X , and is a number that is usually hiddento us by nature, since we do not know the true value.Nevertheless, this exercise is helpful in showing that somepoints have more information than others, and we will seehow both terms evolve as n grows.Figures 17 and 18 show the obtained results when thepoints Q i are not distributed uniformly but are rathermoved to positions that, respecting the assigned range,minimize the RMSE given by Eq. D11. It is very inter-esting to note that, as we expected from our analysis in FIG. 17. Line RMSE, absolute Bias and Sigma, s , for the op-timal location of points as a function of the number of pointsavailable, in the range 0.1-0.8 fm − . The dashed horizontallines show the respective values for the conﬁguration of 31equidistant points showed in Table II.FIG. 18. Parabola RMSE, absolute Bias and Sigma, s , forthe optimal location of points as a function of the number ofpoints available, in the range 0.1-0.1.6 fm − . The dashed hor-izontal lines show the respective values for the conﬁgurationof 31 equidistant points showed in Table II. section IV, the RMSE of the parabola is primarily drivenby the standard deviation contribution, while the line ismore balanced, although towards higher bias.The main result that we can extract from these twographs is that it takes only 10 points in the case of theline and 13 in the case of the parabola to obtain thesame value of RMSE compared to 31 uniformly spacedpoints, see Table III. If we use 31 points but locate themin optimal positions we obtain an RMSE of 0.0068 fm for the line and an RMSE of 0.0056 fm for the Parabola.The ﬁrst one represents an improvement of 21% while thesecond represents and improvement of 31%.As we have already mentioned, calculating (and opti-mizing with respect to) the bias requires the true valueto be known, which is usually not the case in interestingproblems. Nevertheless, the standard deviation we have7deﬁned is computable only as a function of the observedvalues and errors, and therefore it is a quantity we canuse when facing the real data. As can be seen in Figure 18 the bias of the parabola does not seem to change muchfor any conﬁguration of points, but the σ , and thereforethe RMSE, can be improved substantially. [1] R. Pohl et al. , Nature , 213 (2010).[2] A. Antognini et al. , Science , 417 (2013).[3] P. J. Mohr, D. B. Newell, and B. N. Taylor,Rev. Mod. Phys. , 035009 (2016), arXiv:1507.07956[physics.atom-ph].[4] M. G. Boshier, P. E. G. Baird, C. J. Foot, E. A. Hinds,M. D. Plimmer, D. N. Stacey, J. B. Swan, D. A. Tate,D. M. Warrington, and G. K. Woodgate, Phys. Rev. A40 , 6169 (1989).[5] M. Weitz, A. Huber, F. Schmidt-Kaler, D. Leibfried,and T. W. Hansch, Phys. Rev. Lett. , 328 (1994).[6] D. J. Berkeland, E. A. Hinds, and M. G. Boshier, Phys.Rev. Lett. , 2470 (1995).[7] S. Bourzeix, B. de Beauvoir, F. Nez, M. D. Plimmer,F. de Tomasi, L. Julien, F. Biraben, and D. N. Stacey,Phys. Rev. Lett. , 384 (1996).[8] T. Udem, A. Huber, B. Gross, J. Reichert,M. Prevedelli, M. Weitz, and T. W. Hansch, Phys. Rev.Lett. , 2646 (1997).[9] R. Pohl, R. Gilman, G. A. Miller, and K. Pachucki,Ann. Rev. Nucl. Part. Sci. , 175 (2013),arXiv:1301.0905 [physics.atom-ph].[10] C. E. Carlson, Prog. Part. Nucl. Phys. , 59 (2015),arXiv:1502.05314 [hep-ph].[11] H. Gao, T. Liu, C. Peng, Z. Ye, and Z. Zhao, TheUniverse , 18 (2015).[12] R. Pohl (CREMA), J. Phys. Soc. Jap. , 091003(2016).[13] F. Nez et al. , Phil. Trans. Roy. Soc. Lond. A369 , 4064(2011).[14] J. J. Krauth et al. , (2017), arXiv:1706.00696[physics.atom-ph].[15] N. G. Kelkar, T. Mart, and M. Nowakowski, MakaraJournal of Science (2016), 10.7454/mss.v20i3.6242,arXiv:1609.04456 [hep-ph].[16] A. Beyer, L. Maisenbacher, A. Matveev, R. Pohl,K. Khabarova, A. Grinin, T. Lamour, D. C. Yost, T. W.H¨ansch, N. Kolachevsky, and T. Udem, Science ,79 (2017).[17] H. Fleurbaey, Frequency metrology of the 1S-3S tran-sition of hydrogen: contribution to the proton chargeradius puzzle , Theses, Universit´e Pierre et Marie Curie(UPMC) (2017).[18] H. Fleurbaey, S. Galtier, S. Thomas, M. Bonnaud,L. Julien, F. Biraben, F. Nez, M. Abgrall, and J. Guena,Phys. Rev. Lett. , 183001 (2018), arXiv:1801.08816[physics.atom-ph].[19] G. A. Miller, Phys. Rev.

C99 , 035202 (2019),arXiv:1812.02714 [nucl-th].[20] M. Horbatsch and E. A. Hessels, Phys. Rev.

C93 ,015204 (2016), arXiv:1509.05644 [nucl-ex].[21] R. Rosenfelder, Phys. Lett.

B479 , 381 (2000),arXiv:nucl-th/9912031 [nucl-th].[22] K. Griﬃoen, C. Carlson, and S. Maddox, Phys. Rev.

C93 , 065207 (2016), arXiv:1509.06676 [nucl-ex].[23] D. W. Higinbotham, A. A. Kabir, V. Lin, D. Meekins, B. Norum, and B. Sawatzky, Phys. Rev.

C93 , 055207(2016), arXiv:1510.01293 [nucl-ex].[24] J. C. Bernauer et al. (A1), Phys. Rev. Lett. , 242001(2010), arXiv:1007.5076 [nucl-ex].[25] J. C. Bernauer et al. (A1), Phys. Rev.

C90 , 015206(2014), arXiv:1307.6227 [nucl-ex].[26] G. Lee, J. R. Arrington, and R. J. Hill, Phys. Rev.

D92 , 013013 (2015), arXiv:1505.01489 [hep-ph].[27] K. M. Graczyk and C. Juszczak, Phys. Rev.

C90 ,054334 (2014), arXiv:1408.0150 [hep-ph].[28] I. T. Lorenz and Ulf-G. Meißner, Phys. Lett.

B737 , 57(2014), arXiv:1406.2962 [hep-ph].[29] G. Hohler, E. Pietarinen, I. Sabba Stefanescu,F. Borkowski, G. G. Simon, V. H. Walther, and R. D.Wendling, Nucl. Phys.

B114 , 505 (1976).[30] M. A. Belushkin, H. W. Hammer, and Ulf-G. Meißner,Phys. Rev.

C75 , 035202 (2007), arXiv:hep-ph/0608337[hep-ph].[31] M. Horbatsch, E. A. Hessels, and A. Pineda, Phys. Rev.

C95 , 035203 (2017), arXiv:1610.09760 [nucl-th].[32] J. Arrington and I. Sick, J. Phys. Chem. Ref. Data ,031204 (2015), arXiv:1505.02680 [nucl-ex].[33] I. T. Lorenz, Ulf-G. Meißner, H. W. Hammer, and Y. B.Dong, Phys. Rev. D91 , 014023 (2015), arXiv:1411.1704[hep-ph].[34] I. T. Lorenz, H. W. Hammer, and Ulf-G. Meißner, Eur.Phys. J.

A48 , 151 (2012), arXiv:1205.6628 [hep-ph].[35] C. Adamuscin, S. Dubnicka, and A. Z. Dubnickova,

From quarks and gluons to hadrons and nuclei. Proceed-ings, International Workshop on Nuclear Physics, 33rdcourse, Erice, Italy, September 16-24, 2011 , Prog. Part.Nucl. Phys. , 479 (2012).[36] I. Sick, From quarks and gluons to hadrons and nu-clei. Proceedings, International Workshop on NuclearPhysics, 33rd course, Erice, Italy, September 16-24,2011 , Prog. Part. Nucl. Phys. , 473 (2012).[37] X. Zhan et al. , Phys. Lett. B705 , 59 (2011),arXiv:1102.0318 [nucl-ex].[38] G. Ron et al. (Jeﬀerson Lab Hall A), Phys. Rev.

C84 ,055204 (2011), arXiv:1103.5784 [nucl-ex].[39] D. Borisyuk, Nucl. Phys.

A843 , 59 (2010),arXiv:0911.4091 [hep-ph].[40] R. J. Hill and G. Paz, Phys. Rev.

D82 , 113005 (2010),arXiv:1008.4619 [hep-ph].[41] P. G. Blunden and I. Sick, Phys. Rev.

C72 , 057601(2005), arXiv:nucl-th/0508037 [nucl-th].[42] J. J. Kelly, Phys. Rev.

C70 , 068202 (2004).[43] I. Sick, Phys. Lett.

B576 , 62 (2003), arXiv:nucl-ex/0310008 [nucl-ex].[44] I. M. Gough Eschrich et al. (SELEX), Phys. Lett.

B522 ,233 (2001), arXiv:hep-ex/0106053 [hep-ex].[45] P. Mergell, U. G. Meißner, and D. Drechsel, Nucl. Phys.

A596 , 367 (1996), arXiv:hep-ph/9506375 [hep-ph].[46] C. W. Wong, Int. J. Mod. Phys. E3 , 821 (1994).[47] M. McCord, H. Crannell, L. W. Fagg, J. T. O’Brien,D. I. Sober, J. W. Lightbody, X. K. Maruyama, and P. A. Treado,

Application of accelerators in researchand industry. Proceedings, 11th International Confer-ence, Denton, USA, November 5-8, 1990 , Nucl. In-strum. Meth.

B56/57 , 496 (1991).[48] G. G. Simon, C. Schmitt, F. Borkowski, and V. H.Walther, Nucl. Phys.

A333 , 381 (1980).[49] F. Borkowski, G. G. Simon, V. H. Walther, and R. D.Wendling, Z. Phys.

A275 , 29 (1975).[50] F. Borkowski, P. Peuser, G. G. Simon, V. H. Walther,and R. D. Wendling, Nucl. Phys.

A222 , 269 (1974).[51] F. Borkowski, P. Peuser, G. G. Simon, V. H. Walther,and R. D. Wendling, Nucl. Phys.

B93 , 461 (1975).[52] Yu. K. Akimov et al. , Sov. Phys. JETP , 651 (1972),[Zh. Eksp. Teor. Fiz.62,1231(1972)].[53] J. J. Murphy, Y. M. Shin, and D. M. Skopik, Phys.Rev. C9 , 2125 (1974), [Erratum: Phys. Rev. C10 , 2111(1974)].[54] D. Frerejacque, D. Benaksas, and D. J. Drickey, Phys.Rev. , 1308 (1966).[55] L. N. Hand, D. G. Miller, and R. Wilson, Rev. Mod.Phys. , 335 (1963).[56] I. Sick and D. Trautmann, Phys. Rev. C95 , 012501(R)(2017), arXiv:1701.01809 [nucl-ex].[57] I. Sick, Atoms (2018), 10.3390/atoms6010002,arXiv:1801.01746 [nucl-ex].[58] G. Shmueli, Statist. Sci. , 289 (2010).[59] G. E. P. Box, Journal of the American Statistical Asso-ciation , 791 (1976).[60] T. Hastie, R. Tibshirani, and J. H. Friedman, The el-ements of statistical learning: data mining, inference,and prediction, 2nd Edition , Springer series in statistics(Springer, 2009).[61] G. James, D. Witten, T. Hastie, and R. Tibshirani,

An Introduction to Statistical Learning: With Appli-cations in R (Springer Publishing Company, Incorpo-rated, 2014).[62] W. M. Alberico, S. M. Bilenky, C. Giunti, andK. M. Graczyk, Phys. Rev.

C79 , 065204 (2009),arXiv:0812.3539 [hep-ph].[63] J. M. Alarc´on and C. Weiss, Phys. Rev.

C97 , 055203(2018), arXiv:1710.06430 [hep-ph].[64] See Supplemental Material at [URL will be inserted bypublisher] for animations of select images.[65] P. R. Bevington and D. Robinson,

Data Reduction andError Analysis for the Physical Sciences (McGraw-Hill,2003) 3rd Edition.[66] F. James,

Statistical Methods in Experimental Physics (World Scientiﬁc, 2006) 2nd Edition.[67] S. ˇSirca,

Probability for Physicists , Graduate Texts inPhysics (Springer International Publishing, 2016).[68] H. Akaike, IEEE Transactions on Automatic Control , 716 (1974).[69] G. Schwarz, Ann. Statist. , 461 (1978).[70] E. Wit, E. v. d. Heuvel, and J.-W. Romeijn, StatisticaNederlandica , 217 (2012).[71] Data from Mainz spectrometer B with a beam energy of315 MeV with radiative and Coulomb corrections [25].[72] Z. Ye, J. Arrington, R. J. Hill, and G. Lee, Phys. Lett. B777 , 8 (2018), arXiv:1707.09063 [nucl-ex].[73] A. J. R. Puckett et al. , Phys. Rev.

C96 , 055203 (2017),arXiv:1707.08587 [nucl-ex].[74] T. Gutsche, V. E. Lyubovitskij, and I. Schmidt, Phys.Rev.

D97 , 054011 (2018), arXiv:1712.08410 [hep-ph].[75] B. Efron, Ann. Statist. , 1 (1979). [76] R. Andrae, T. Schulze-Hartung, and P. Melchior, ArXive-prints (2010), arXiv:1012.3754 [astro-ph.IM].[77] F. J. Anscombe, The American Statistician , 17(1973).[78] M. B. Wilk and R. Gnanadesikan, Biometrika , 1(1968).[79] D. W. Higinbotham and R. E. McClellan, (2019),arXiv:1902.08185 [physics.data-an].[80] A. Gasparian (PRad at JLab), Proceedings, 13th In-ternational Conference on Meson-Nucleon Physics andthe Structure of the Nucleon (MENU 2013): Rome,Italy, September 30-October 4, 2013 , EPJ Web Conf. , 07006 (2014).[81] C. Peng and H. Gao, Proceedings, 21st InternationalConference on Few-Body Problems in Physics (FB21):Chicago, IL, USA, May 18-22, 2015 , EPJ Web Conf. , 03007 (2016).[82] M. Mihoviloviˇc et al. , Phys. Lett.

B771 , 194 (2017),arXiv:1612.06707 [nucl-ex].[83] X. Yan et al. , Phys. Rev.

C98 , 025204 (2018),arXiv:1803.01629 [nucl-ex].[84] J. M. Alarc´on and C. Weiss, Phys. Lett.

B784 , 373(2018), arXiv:1803.09748 [hep-ph].[85] P. Daee, T. Peltola, A. Vehtari, and S. Kaski, in ,IUI ’18 (ACM, New York, NY, USA, 2018) pp. 305–310.[86] J. M. Alarc´on, D. W. Higinbotham, C. Weiss,and Z. Ye, Accepted by Phys. Rev. C (2019),arXiv:1809.06373 [hep-ph].[87] J. M. Alarc´on and C. Weiss, Phys. Rev.

C96 , 055206(2017), arXiv:1707.07682 [hep-ph].[88] J. M. Alarcn,

Proceedings, 17th International Confer-ence on Hadron Spectroscopy and Structure (Hadron2017): Salamanca, Spain, September 25-29, 2017 , PoS

Hadron2017 , 155 (2018).[89] J. M. Alarc´on, J. Martin Camalich, and J. A. Oller,Annals Phys. , 413 (2013), arXiv:1210.4450 [hep-ph].[90] H. Brighton and G. Gigerenzer, Journal of Business Re-search , 1772 (2015), special Issue on Simple VersusComplex Forecasting.[91] G. C. Cawley and N. L. Talbot, J. Mach. Learn. Res. , 2079 (2010).[92] R. P. Feynman, “The computing machines in the fu-ture,” in Nishina Memorial Lectures: Creators of Mod-ern Physics (Springer Japan, Tokyo, 2008) pp. 99–114.[93] D. Silver, A. Huang, C. J. Maddison, A. Guez,L. Sifre, G. van den Driessche, J. Schrittwieser,I. Antonoglou, V. Panneershelvam, M. Lanctot,S. Dieleman, D. Grewe, J. Nham, N. Kalchbrenner,I. Sutskever, T. Lillicrap, M. Leach, K. Kavukcuoglu,T. Graepel, and D. Hassabis, Nature (London) ,484 (2016).[94] D. Barradas-Bautista and M. Alvarado, ArXiv e-prints(2018), arXiv:1803.05983 [cs.AI].[95] M. Newville, T. Stensitzki, D. B. Allen, and A. Ingar-giola, “LMFIT: Non-Linear Least-Square Minimizationand Curve-Fitting for Python,” (2014).[96] E. Jones, T. Oliphant, P. Peterson, et al. , “SciPy: Opensource scientiﬁc tools for Python,” (2001–).[97] K. Levenberg, Quart. Appl. Math. , 164 (1944).[98] D. W. Marquardt, Journal of the Society for In-dustrial and Applied Mathematics , 431 (1963),https://doi.org/10.1137/0111030. [99] E. R. Tufte, The Visual Display of Quantitative Infor-mation (Graphics Press, Cheshire, CT, USA, 1986).[100] E. Tufte,

Envisioning Information (Graphics Press,Cheshire, CT, USA, 1990).[101] E. R. Tufte,

Visual Explanations: Images and Quanti-ties, Evidence and Narrative (Graphics Press, Cheshire,CT, USA, 1997).[102] E. R. Tufte,

Beautiful Evidence (Graphics Press,Cheshire, CT, 2006).[103] S. ˇSirca and M. Horvat,

Computational Methods forPhysicists (Springer, 2012).[104] R Core Team,

R: A Language and Environment for Sta-tistical Computing , R Foundation for Statistical Com-puting, Vienna, Austria (2013).[105] J. Fox and S. Weisberg,

An R Companion to Applied Regression , 2nd ed. (Sage, Thousand Oaks CA, 2011).[106] D. Higinbotham, “Model selection with stepwiseregression,” https://github.com/JeffersonLab/model-selection (2016).[107] W. H. Press, S. A. Teukolsky, W. T. Vetterling, andB. P. Flannery,

Numerical Recipes 3rd Edition: The Artof Scientiﬁc Computing , 3rd ed. (Cambridge UniversityPress, New York, NY, USA, 2007).[108] C. E. Rasmussen and C. K. I. Williams,

Gaussian Pro-cesses for Machine Learning (Adaptive Computationand Machine Learning) (The MIT Press, 2005).[109] S. Zhou, P. Giuliani, J. Piekarewicz, A. Bhattacharya,and D. Pati, (2018), arXiv:1808.05977 [nucl-th].[110] D. F. Styer, American Journal of Physics83