Experimental noise in small-angle scattering can be assessed and corrected using the Bayesian Indirect Fourier Transformation
EExperimental noise in small-angle scattering can be assessed andcorrected using the Bayesian Indirect Fourier Transformation
Andreas Haahr Larsen † and Martin Cramer Pedersen † Department of Biochemistry, University of Oxford (Linacre College) Niels Bohr Institute, University of Copenhagen * These authors contributed equally to the presented work † Corresponding authors: [email protected], [email protected]
Abstract
Small-angle X-ray and neutron scattering are widely used to investigate soft matter and biophysical systems. Theexperimental errors are essential when assessing how well a hypothesized model fits the data. Likewise, they areimportant when weights are assigned to multiple datasets used to refine the same model. Therefore, it is problematicwhen experimental errors are over- or underestimated. We present a method, using Bayesian Indirect FourierTransformation for small-angle scattering data, to assess whether or not a given small-angle scattering dataset hasover- or underestimated experimental errors. The method is effective on both simulated and experimental data, andcan be used assess and rescale the errors accordingly. Even if the estimated experimental errors are appropriate, itis ambiguous whether or not a model fits sufficiently well, as the “true” reduced χ of the data is not necessarilyunity. This is particularly relevant for approaches where overfitting is an inherent challenge, such as reweightingof a simulated molecular dynamics trajectory against a small-angle scattering data or ab initio modelling. Usingthe outlined method, we show that one can determine what reduced χ to aim for when fitting a model againstsmall-angle scattering data. The method is easily accessible via a web interface. Small-angle X-ray and neutron scattering (SAXS andSANS) are valuable tools for obtaining structural inform-ation in the nanometer-regime for a range of materials,from inorganic nanoparticles, over polymer gels and col-loids, to biomolecules in solution. However, some chal-lenges remain to be dealt with. In this paper, we aim toovercome two central challenges.The first challenge is the issue of over- or underestim-ated experimental errors. To retrieve structural inform-ation from SAXS or SANS data, the data are usuallycompared to a theoretical model. The assessment of suchmodels does in most cases rely on the estimated exper-imental errors, with some notable exceptions being theruns test and the derived CorMap test [1].Experimental error estimates are also important whencombining a SAXS or SANS dataset with others sourcesof experimental data, to assess which weights shouldbe assigned to each type of data. This could be e.g.SAXS and SANS [2] or SAXS and nuclear magnetic res-onance [3]. Therefore, appropriate experimental errorsare essential when interpreting SAXS and SANS data.Unfortunately, these experimental errors are not alwayswell-determined or well-behaved. This has been demon-strated for the data in the small-angle scattering biolo-gical data bank (SASBDB) [4].The second challenge is the question of how tightly amodel should be fitted to a dataset. The informationcontent in small-angle scattering data is small comparedto the three-dimensional structural models, the scient-ist wishes to refine from the data. To avoid overfitting,prior knowledge or constraints are usually introduced, e.g. from other experiments, from simulations, or frommore general assumptions [5]. As an example, moleculardynamics (MD) simulations can be used in combinationwith SAXS data to refine a model, e.g. by reweighting theframes of the trajectory such that the calculated scatter-ing from the trajectory matches the data. However, suchsimulated trajectories contain information about the po-sition and movement of thousand of atoms, which far ex-ceeds the to free parameters that can be retrievedfrom a typical SAXS datasets [6, 7]. A good model musttherefore alter the prior trajectory as little as possible,but still fit the data sufficiently well. But how well issufficiently well [8]?One solution is to aim for a reduced χ of one. χ r ,which we will discuss in more detail later, is a measure forthe goodness of fit and has an expectation value of unity,so aiming for χ r = 1 is appealing and simple. There are,however, several problems with this approach [9]; evenunder the assumption that the experimental errors areappropriate. One problem is that the correct or “true” χ r is not unity for all datasets. Even if we knew thetrue underlying model and could calculate a correspond-ing SAXS or SANS model intensity profile accurately,we would not expect a χ r of exactly unity, but ratherone that followed the χ r distribution. So, in principle,one should aim for the true, underlying χ r for that spe-cific dataset when fitting data. The second challenge cantherefore be solved by estimating the unknown value ofthe true χ r for a given dataset; sometimes referred to asthe “noise level” of the dataset [10].In the presented study, we demonstrate how this quant-ity can be estimated with the Bayesian Indirect Four-ier Transformation (BIFT) algorithm [10], which in turn1 a r X i v : . [ phy s i c s . d a t a - a n ] D ec rovides an answer to the question: “when does the modelfit the data sufficiently well?”: it does so, when the χ r ofthe fit matches the true χ r of that particular dataset, asestimated by the method presented here. For a specificdataset this may be . or . , i.e. relatively far from thetextbook-mandated target value of . . This is particu-larly relevant for data with few data points, and hencefew degrees of freedom, resulting in a wide distributionfor χ r . The method is useful for combining SAXS orSANS with MD simulations, and when doing ab initiomodeling [11] or other types of approaches that involvesmodels with many degrees of freedom. The central technique investigated in this study is small-angle scattering, which obtains structural information ona sample as follows. A sample is irradiated by an in-tense, monochromatic, well-collimated beam of X-raysor neutrons. The scattered radiation is recorded on aposition-sensitive detector. For samples without a pre-ferred orientation such as e.g. molecules in solution, thedetection pattern exhibits rotational symmetry, and thedata can be azimuthally averaged into a one-dimenisionalset of data.We record the scattered intensity, I , as a function of themomentum transfer of the incoming radiation, q , whichis given by q = 4 π sin( θ ) /λ , where λ is the nominalwavelength of X-rays or neutrons, and θ is the scatter-ing angle. Additionally, data reduction software assignan estimate of the experimental error, σ , on the intensityrecorded for each value of q . Summing up, a datapointin a small-angle scattering dataset is a triplet consistingof { q, I, σ } . For this study, we simulated thousands of SAXS datasetsusing a simple, virtual SAXS instrument with dimensionsbased on the beam line BM29 [12] at ESRF in Grenoble.The simulations were done using the X-ray instrumentsimulation software package McXtrace [13] using methodsoutlined in the literature [14, 15].In the simulations, we assume a nominal X-raywavelength of Å, a collimation length of . , . pinholes, a beamport-to-sample distance of . , asample-to-detector distance of .
43 m , and beam stopcovering the incoming beam. The signal is recordedon a two-dimensional position-sensitive detector and azi-muthally averaged for a final range in scattering mo-mentum transfer of approximately . Å − to . Å − .We conduct virtual experiments for three distinctsamples in solution: the proteins lysozyme and bovineserum albumine (BSA) described by their respective crys-tal structures (i.e. the entries 2LYZ and 4F5S in the pro-tein data bank [16]) as well as a n-dodecyl- β -D-maltoside(DDM) detergent micelle at ° C described by the math-ematical model implemented in WillItFit [17] with di-mensions taken from experimental reports in the literat-ure [18]. The scattering profiles were calculated using the software package CRYSOL [19] for the proteins, whereasscattering from the solution of micelles was computed us-ing geometric form factor amplitudes [20].Our virtual experiments represent three different ex-posure times: short , medium , and long , the durations ofwhich differ by a factor of ten. Examples of simulateddata are shown in Figure 1; and all source code for thesecomponents are available through online repositories . Since its introduction in the seventies [21], the IndirectFourier Transform has been a staple in the preliminaryanalysis of solution small-angle scattering data by pro-ducing pair distance distributions of recorded data for aninitial glance at the sample’s structure. Popular imple-mentations of the algorithm rely on Bayesian statisticsand optimization for unbiased estimates of the paramet-ers needed for the transformation.Among these are the BIFT algorthm [10] with its as-sociated web-based implementation, BayesApp [22, 23].The methodology is readily extended to structural mod-els [5]. A GenApp-based [24] implemented of BayesAppis currently available online . The source code can befound in the associated repository .The central objective of the algorithm is to estimatethe pair distance distribution, p ( r ) of the sample, fromthe measured intensity. We remind the reader that asmall-angle scattering intensity profile, I ( q ) , is related to p ( r ) by: p ( r ) = 12 π n (cid:90) ∞ d q ( qr ) I ( q ) sin( qr ) qr , (1)where n is the number density of the given particle in thesample. In an experiment, intensity is measured only in alimited range, and in some discrete points with associatederrors, so the integral can not be to evaluated directly.Instead, the BIFT algorithm estimates p ( r ) in an indirectmatter.First, from an initial guess of p ( r ) , the inverted versionof Equation (1) is used to calculate the intensity: I ( q ) = 4 πn (cid:90) ∞ d r p ( r ) sin( qr ) qr . (2)In practice, the integral can be truncated at the largestdistance between two scatterers, D max . p ( r ) is usuallyrepresented via an expansion on a suitable set of basisfunctions such as e.g. cubic splines or cardinal sine func-tions. The coefficients of the basis functions are adjusted,until the intensity calculated by Equation (2) matchesthe measured intensity. To avoid overfitting, the fit-ting process is done under a smoothness constraint on p ( r ) [21, 25]. Specifically, the BIFT algorithm thus min-imizes the functional: Q = χ + αS, (3) https://github.com/McStasMcXtrace/McCode https://somo.chem.utk.edu/bayesapp https://github.com/Niels-Bohr-Institute-XNS-StructBiophys/BayesApp -3 -2 -1 Momentum transfer, q , Å -1 LysozymeShort exposure Momentum transfer, q , Å -1 LysozymeMedium exposure Momentum transfer, q , Å -1 LysozymeLong exposure10 -3 -2 -1 I n t en s i t y , I , A r b . un i t s Momentum transfer, q , Å -1 BSAShort exposure Momentum transfer, q , Å -1 BSAMedium exposure Momentum transfer, q , Å -1 BSALong exposure10 -2 -1 I n t en s i t y , I , A r b . un i t s Momentum transfer, q , Å -1 MicelleShort exposure 0.01 0.1Momentum transfer, q , Å -1 MicelleMedium exposure 0.01 0.1Momentum transfer, q , Å -1 MicelleLong exposure
Figure 1: Examples of the simulated datasets for each of our three systems: lysozyme (top), BSA (middle), andDDM micelles (bottom). As shown, we simulate three different exposure times for each of these systems. On theright, the structural models from which the data are simulated are rendered. A quarter of the oblate DDM micellehas been removed to reveal the interior core-shell structure.where χ measures the overlap between a set of M data-points, ( q j , I exp ,j , σ j ), and a model function I mod ( q ) ,evaluated in q j : χ = M (cid:88) j =1 (cid:18) I mod ( q j ) − I exp ,j σ j (cid:19) , (4)and S is the prior smoothness constraint: S = (cid:90) ∞ d r ( p (cid:48)(cid:48) ( r )) (5)where the (cid:48) denotes the derivative.As before, the upper limit of integral can be truncatedat D max . The hyperparameter α in Equation (3) weighsthe two contributions; finding the optimal value of α ispart of the BIFT algorithm objective [10]. Note thatalongside α , D max is also an estimated hyperparameterin this approach.BIFT also provides an estimate for the number of goodparameters, N g , in the dataset. I.e. how many degreesof freedom that were used to fit the data [6]. From thefitting process, one obtains a Hessian, consisting of thematrix elements B ij = ∂ χ ∂c i ∂c j , where c i are parameters inthe model. N g is given via α and the eigenvalues, λ i , ofthe Hessian: N g = N b (cid:88) i λ i α + λ i , (6)where N b is the number of basis functions used to repres-ent p ( r ) . χ Of particular importance to this study is the notion of thereduced χ , usually dubbed χ r . Any statistics textbookwill teach us to normalize the quantity in Equation (4)using the degrees of freedom N DoF : χ r = 1 N DoF M (cid:88) j =1 (cid:18) I mod ( q j ) − I exp ,j σ j (cid:19) . (7)The pertinent question for further application of thisquantity is: What is N DoF ? For a simple model with K independent parameters, N DoF is simply equal to thenumber datapoints minus the number of fitting paramet-ers, M − K .As an example, this could be a model where a knownscattering formfactor is fitted to a dataset with a scal-ing parameter and a constant background, i.e. with twofree parameters. In that case, N DoF = M − gives thecorrect distribution of for χ r (Figure 2). In this paper,we will discuss how to estimate N DoF for a p ( r ) distribu-tion based on an expansion on a set of (correlated) basisfunctions such as BIFT. As described above, the BIFT algorithm performs an ana-lysis resembling that of single value decomposition, toprovide the number of good parameters, N g [10, 6]. This3 .0 0.5 1.0 1.5 2.0 , N DoF = M C o un t s A r PDF r = 0.97±0.01 0.0 0.5 1.0 1.5 2.0 , N DoF = M B r PDF r = 0.99±0.010.0 0.5 1.0 1.5 2.0 , N DoF = M C o un t s C r PDF r = 0.89±0.01 0.0 0.5 1.0 1.5 2.0 , N DoF = M N g D r PDF r = 0.98±0.01 Figure 2:
SAXS datasets of lysozyme were simulatedwith long exposure time in McXtrace. Each dataset con-tained M = 50 points; examples of these datasets shownin Figure SI.1. (A-B) Models were fitted to the data witha scaling factor and a constant background as free para-meters with (A) M or (B) M − as degrees of freedomin the expression for χ r in Equation (7). The averagevalue of χ r , (cid:104) χ r (cid:105) , is listed in the legend, and the the-oretical probability distribution function (PDF) for thegiven number of degrees of freedom is shown. (C-D) His-tograms of χ r values from the BIFT fits with (C) M or(D) M − N g − as N DoF in χ r .quantity is the effective number of free parameters, tak-ing the smoothness prior into account, and is thereforea good estimate for the number of effective paramters inthe model.BIFT has additional degrees of freedom, as it estimatesa constant background and a maximum diameter, D max .If D max is too low, the data can not be fitted well, andthat is also the case if B is too high. Therefore, we arguethat these parameters contribute with one additional de-gree of freedom to the model. We therefore suggets that N DoF = M − N g − . This improves of the distribution for χ r (Figure 2 and Figure SI.3), when compared to simplyusing N DoF = M . We investigated whether the noise level of a dataset canbe found using the BIFT algorithm. For that purpose,we simulated an extensive amount of virtual data. Unlikeexperimental data, the noise level can be recovered fromsimulated data. As we used a model in the form of a cal-culated SAXS scattering curve to generate the simulateddata, we can determine the noise level of the simulateddata, χ , Model , by fitting the same curve to the simulated data.Following this, we ran BIFT for each dataset, whichgave our estimated value for the noise level, χ , BIFT , andmonitored the correlation between χ , Model and χ , BIFT .The correlation is strong for all tested systems (lyso-zyme, BSA and a micelle) and all tested exposure times(Figure 3), which supports the notion that χ , BIFT is adirect expression of the noise level of the dataset, χ , Model .That is a central observation in this study and necessaryfor being able to further assess and correct over-or un-derestimated errors.
Since BIFT can determine the noise level of the data, wepropose that it can also be used to identify over- or un-derestimated errors. If χ r, BIFT is much larger than unityit is an indication that errors are underestimated, and if χ r, BIFT is much smaller than unity, errors are likely over-estimated. In those cases, a better estimate of the errorscan be achieved by simply rescaling the experimental er-rors on the individual datapoints: σ Corrected = σ Recorded (cid:113) χ r, BIFT , (8)where σ Recorded are the over- or underestimated errors.We note that the BIFT algorithm should be run on eachsetting in a SANS dataset independently and these data-sets should be rescaled with each their factor.To extend our simulations further, we rescaled the er-rors of our simulated datasets with factors between . and to mimic over- or underestimated experimental er-rors. The BIFT algorithm was run on each dataset, andfrom that a factor to rescale these artificially over- orunderestimated errors was obtained using Equation (8).As shown in Figure 4, the artificially introduced rescal-ing factors were accurately recovered by BIFT, demon-strating that BIFT can identify over- or underestimatederrors. To illustrate the method on real experimental data, weused small-angle scattering data measured on a sample ofnanodiscs with the phospholipid DLPC (1,2-dilauroyl-sn-glycero-3-phosphocholine) and the membrane scaffoldingprotein MSP1D1 [26]. The structural model has been de-scribed previously [27, 28]. The data consist of five data-sets: one SAXS dataset, two SANS datasets measured in D O-based buffer (high- q and low- q setting), andtwo SANS datasets measured in D O-based buffer,where the protein rim of the nanodisc is matched out.All datasets were Fourier transformed using the BIFTalgorithm, which provides a χ r for each dataset (Table1). The resulting probabilities show that the χ r values arevery unlikely given correctly estimated errors. The SAXSdata have underestimated errors ( χ r significantly largerthan ), whereas all the SANS data have overestimatederrors ( χ r significantly smaller than ).The data can simultaneously be fitted with a nanodiscmodel (Figure 5A). Despite good fits, the resulting χ r χ r, Model = 1.00Mean χ r, BIFT = 1.00r = 0.997 χ Model LysozymeShort exposure Mean χ r, Model = 1.01Mean χ r, BIFT = 1.01r = 0.997 χ Model LysozymeMedium exposure Mean χ r, Model = 1.01Mean χ r, BIFT = 1.01r = 0.997 χ Model LysozymeLong exposure 0.8 0.9 1 1.1 1.2 Mean χ r, Model = 1.00Mean χ r, BIFT = 1.00r = 0.995 χ r , B I FT χ Model BSAShort exposure Mean χ r, Model = 1.00Mean χ r, BIFT = 1.00r = 0.995 χ Model BSAMedium exposure Mean χ r, Model = 1.00Mean χ r, BIFT = 1.00r = 0.995 χ Model BSALong exposure 0.8 0.9 1 1.1 1.2 0.8 0.9 1 1.1 1.2Mean χ r, Model = 1.00Mean χ r, BIFT = 1.00r = 0.996 χ r , B I FT χ Model MicelleShort exposure 0.8 0.9 1 1.1 1.2Mean χ r, Model = 1.00Mean χ r, BIFT = 1.00r = 0.996 χ r, Model MicelleMedium exposure 0.8 0.9 1 1.1 1.2Mean χ r, Model = 1.00Mean χ r, BIFT = 1.00r = 0.996 χ r, Model MicelleLong exposure
Figure 3: Correlation plot of the calculated values og χ r, Model and χ r, BIFT (using the normalisation introduced inthis paper via Figure 2). Pearson Correlation Coefficients, r , are listed in each legend of the plots. Examples ofthese simulated datasets are shown in Figure 1.values from the model fits are, respectively, very small(SANS) or very high (SAXS). We corrected the errors us-ing the protocol presented here, and after the correction,the χ r values corresponded well with the visual assess-ment, namely that of a good fit (Figure 5B). The cal-culated errors on the refined parameters were also morereasonable. Before rescaling the errors, the estimated er-rors on the refined parameters were underestimated fromthe fit to SAXS data, and overestimated from the fit toSANS data (Table SI.2).When fitting all dataset simultaneously, such a rescal-ing will assign a more appropriate weight to each data-set. In this case, more weight is given to the SANS dataafter rescaling the errors, and accordingly: less to theSAXS data. For these specific data and the model, the Data BIFT, χ r ProbabilitySAXS . ∼ − SANS, D O, low- q . ∼ − SANS, D O, high- q . ∼ − SANS, D O, low- q . ∼ − SANS, D O, high- q . ∼ − Table 1: χ r values from BIFT for each dataset, alongwith the probabilities for getting these values given thaterrors are correct (see definition in section SI.5). Thereare two setting for each contrast of the SANS data asseen in Figure 5.5 -3 -2 -1 I , A r b . un i t s q , Å -1 → -3 -2 -1 I , A r b . un i t s q , Å -1 S ugge s t ed r e sc a li ng b y B I FT Artificial rescaling of errors on dataLysozymeMedium exposure -3 -2 -1 I , A r b . un i t s q , Å -1 → -3 -2 -1 I , A r b . un i t s q , Å -1 Figure 4: The errors on simulated data were multipliedby a factor, effectively changing them to over- or underes-timated errors. This factor was estimated using the out-lined approach. On top and below, we show examples ofdatasets rescaled by factors of and , respectively. Forcompleteness, the data from the remaining simulationshave been subjected to the same numerical experiment(Figure SI.2.)SAXS data is, however, dominating the refinement pro-cess, also after rescaling (Table SI.2), but for other dataand models that is not the case (see e.g. [29]). Anotherexperimental example can be found in an early use ofthe method [30] with SANS data on the transmembraneprotein holo-translocon. In this paper, we outlined a proposed approach to twocentral challenges in the interpretation of SAXS andSANS data. The first challenge is the issue of over- or un-derestimated experimental errors. The BIFT algorithmprovides a valuable tool that can identify such experi-mental errors and rescale them to have a better estimate Figure 5: SAXS and SANS data for a sample of phos-pholipid bilayer nanodiscs. A geometrical nanodic modelrefined from the data is shown as inset, and the model fitis shown in black, with resulting χ r values given on theplots. (A) Data and fits before rescaling the errors withBIFT. Inset shows the nanodisc model. (B) Data and fitsafter rescaling of the errors.of the experimental errors. However, there are some im-portant limitations for the method:• As discussed in the introduction, χ r may be differ-ent from unity even though errors are correct. Thatis: one has to assess whether the χ r is so unlikelythat it is reasonable to believe that experimental er-rors are over- or underestimated. For that purpose,one can use the probability of getting a certain valueof χ , BIFT given that the errors are correct. Thisis now given as output in the GenApp implementa-tion of the BIFT algorithm. As a rule of thumb, werecommend that if the probability for the χ , BIFT value is below . (corresponding to three σ , seeSection SI.5), the errors should be rescaled to achievea better estimate.• Adjusting the experimental errors using Equation (8)assumes that uniformly rescaling these across the full q -range is appropriate. This may not be the case.• If the experimental errors are corrected, and amodel subsequently fitted to the data, the result-ing χ r, Corrected will not follow a χ r distribution.The BIFT algorithm accurately determines the noiselevel of data as shown in Figure 3, so χ r, Corrected willfollow a very narrow distribution around unity.Rather than rescaling the experimental errors, onecould aim for a χ r that is equal to χ r, BIFT when refining6 model from the data. This is a safe choice but lacksthe convenient and familiar property of χ r being close tounity for a good fit. This approach is also not applicablewhen combining several datasets, as mis-estimated errorswould give an incorrect weight between the different typesof data.The second challenge we meet in this paper, is the in-herent statistical variation in data. This implies that agood fit should not necessarily result in a χ r of unityas shown in Figure 2; even if the model in question isknown to be able to adequately represent the “groundtruth”. This is a challenge for approaches, in which themodel has many degrees of freedom, such as e.g. ab ini-tio modeling or reweighting of an ensemble of structuresagainst a SAXS dataset. The BIFT algorithm providesa measure for how well data should be fitted; specificallywhich value of χ r to strive for. The BIFT algorithm makes for an attractive componentin a pipeline for assessing noise in small-angle scatteringdata as the algorithm is fast, automated, and performswell using default settings for generic small-angle scat-tering data. The approach is applicable for SAXS andSANS data.Our findings and associated recommendations can besummed up in the following points:• The target χ for model refinement should be thatof the BIFT algorithm.• Alternatively, error bars can be rescaled using theBIFT algorithm. In that case, the target χ r is one,This is particularly useful for simultaneous refine-ment of several datasets, e.g. SAXS and SANS, inorder to obtain the correct weighting between alldata. Note that the emerging distribution of χ r afterrescaling is not χ -distributed.The method is accessible and effective. The interfacefor Bayesapp has been updated to support the ideaspresented here, where χ BIFT will be given along with theprobability of that value given that experimental errorsare correct. Additionally, a dataset with rescaled experi-mental errors is produced.
The authors would like to thank Steen Laugesen Hansen,who conceptualized the idea and wrote the original im-plementation of the BIFT algorithm. Furthermore, theauthors would like to thank Emre Brookes for great sup-port in updating the web interface for Bayesapp avail-able through Genapp. The authors also thank the Carls-berg Foundation grant CF19-0288 for funding AHL aswell as the Lundbeck foundation for the Brainstruc grantR155-2015-2666 and the Novo Nordisk Foundation Syn-ergy grant NNF15OC0016670 for funding to MCP. https://somo.chem.utk.edu/bayesapp References [1] Franke, D., Jeffries, C. M. & Svergun, D. I. Correlation Map, agoodness-of-fit test for one-dimensional X-ray scattering spectra.
Nature Methods , 419–422 (2015).[2] Larsen, A. H. et al. Combining molecular dynamics simulationswith small-angle X-ray and neutron scattering data to study multi-domain proteins in solution.
PLoS Computational Biology ,1–29 (2020).[3] Mertens, H. D. & Svergun, D. I. Combining NMR and small angleX-ray scattering for the study of biomolecular structure and dy-namics. Archives of Biochemistry and Biophysics , 33–41(2017).[4] Kikhney, A. G., Borges, C. R., Molodenskiy, D. S., Jeffries, C. M.& Svergun, D. I. SASBDB: Towards an automatically curated andvalidated repository for biological scattering data.
Protein Science , 66–75 (2020).[5] Larsen, A. H., Arleth, L. & Hansen, S. Analysis of small-anglescattering data using model fitting and Bayesian regularization. Journal of Applied Crystallography , 1151–1161 (2018).[6] Vestergaard, B. & Hansen, S. Application of Bayesian analysis toindirect Fourier transformation in small-angle scattering. Journalof Applied Crystallography , 797–804 (2006).[7] Konarev, P. V. & Svergun, D. I. A posteriori determination of theuseful data range for small-angle scattering experiments on dilutemonodisperse systems. IUCrJ , 352–360 (2015).[8] Orioli, S., Larsen, A. H., Bottaro, S. & Lindorff-larsen, K. Howto learn from inconsistencies: Integrating molecular simulationswith experimental data. In Strodel, B. & Barz, B. (eds.) Pro-gress in Molecular Biology and Translational Science: Computa-tional Approaches for Understanding Dynamical Systems: Pro-tein Folding and Assembly , vol. 170, chap. 3, 123–176 (Elsevier,2020).[9] Andrae, R., Schulze-Hartung, T. & Melchior, P. Dos and don’tsof reduced chi-squared. arXiv , 1–12 (2010). .[10] Hansen, S. Bayesian estimation of hyperparameters for indirectFourier transformation in small-angle scattering.
Journal of ap-plied crystallography , 1415–1421 (2000).[11] Franke, D. & Svergun, D. I. DAMMIF, a program for rapidab-initio shape determination in small-angle scattering. AppliedCrystallography , 342–346 (2009).[12] Pernot, P. et al. Upgraded ESRF BM29 beamline for SAXS onmacromolecules in solution.
Journal of Synchrotron Radiation , 1–5 (2013).[13] Knudsen, E. et al. McXtrace : a Monte Carlo software packagefor simulating X-ray optics, beamlines and experiments.
Journalof Applied Crystallography , 679–696 (2013).[14] Kynde, S. et al. A compact time-of-flight SANS instrument optim-ised for measurements of small sample volumes at the EuropeanSpallation Source.
Nuclear Instruments and Methods in PhysicsResearch Section A: Accelerators, Spectrometers, Detectors andAssociated Equipment , 133–141 (2014).[15] Pedersen, M. C., Hansen, S. L., Markussen, B., Arleth, L. &Mortensen, K. Quantification of the information in small-anglescattering data.
Journal of Applied Crystallography , 2000–2010 (2014).[16] Berman, H. M. et al. The protein data bank.
Nucleic acids re-search , 235–242 (2000).[17] Pedersen, M. C., Arleth, L. & Mortensen, K. WillItFit: A frame-work for fitting of constrained models to small-angle scatteringdata. Journal of Applied Crystallography , 1894–1898 (2013).[18] Oliver, R. C. et al. Dependence of Micelle Size and Shape onDetergent Alkyl Chain Length and Head Group.
PLoS ONE ,e62488 (2013).[19] Svergun, D. I., Barberato, C. & Koch, M. H. J. CRYSOL - aprogram to evaluate X-ray solution scattering of biological macro-molecules from atomic coordinates. Journal of Applied Crystal-lography , 768–773 (1995).[20] Pedersen, J. S. Analysis of small-angle scattering data from col-loids and polymer solutions: Modeling and least-squares fitting. Advances in Colloid and Interface Science , 171–210 (1997).[21] Glatter, O. A new method for the evaluation of small-angle scatter-ing data. Journal of Applied Crystallography , 415–21 (1977).
22] Hansen, S. BayesApp : a web site for indirect transformation ofsmall-angle scattering data.
Journal of Applied Crystallography , 35–6 (2012).[23] Hansen, S. Update for bayesapp: a web site for analysis of small-angle scattering data. Journal of Applied Crystallography ,1469–1471 (2014).[24] Brookes, E. & Savelyev, A. GenApp Integrated with OpenStackSupports Elastic Computing on Jetstream. Proceedings of thePractice and Experience in Advanced Research Computing 2017on Sustainability, Success and Impact
Solution of Ill-Posed Problems (Wiley, New York, 1977).[26] Bayburt, T. H., Grinkova, Y. V. & Sligar, S. G. Self-Assemblyof Discoidal Phospholipid Bilayer Nanoparticles with MembraneScaffold Proteins.
Nano Letters , 853–856 (2002).[27] Skar-Gislinge, N. et al. Elliptical structure of phospholipid bilayernanodiscs encapsulated by scaffold proteins: casting the roles ofthe lipids and the protein.
Journal of the American ChemicalSociety , 13713–22 (2010).[28] Skar-Gislinge, N. & Arleth, L. Small-angle scattering from phos-pholipid nanodiscs: derivation and refinement of a molecular con-strained analytical model form factor.
Physical Chemistry Chem-ical Physics , 3161–70 (2011).[29] Heller, W. T. Small-angle neutron scattering and contrast vari-ation: a powerful combination for studying biological struc-tures. Acta Crystallographica Section D Biological Crystallo-graphy
D66 , 1213–1217 (2010).[30] Martin, R. et al.
Structure and Dynamics of the Central LipidPool and Proteins of the Bacterial Holo-Translocon.
BiophysicalJournal , 1931–1940 (2019). upporting Information SI.1 Examples of simulated data with M = 50 points Figure SI.1 shows examples of simulated datasets with a detector binning the data into bins. The data form thebasis for Figures 2 and SI.3. -3 -2 -1 Momentum transfer, q , Å -1 LysozymeShort exposure Momentum transfer, q , Å -1 LysozymeMedium exposure Momentum transfer, q , Å -1 LysozymeLong exposure10 -3 -2 -1 I n t en s i t y , I , A r b . un i t s Momentum transfer, q , Å -1 BSAShort exposure Momentum transfer, q , Å -1 BSAMedium exposure Momentum transfer, q , Å -1 BSALong exposure10 -2 -1 I n t en s i t y , I , A r b . un i t s Momentum transfer, q , Å -1 MicelleShort exposure 0.01 0.1Momentum transfer, q , Å -1 MicelleMedium exposure 0.01 0.1Momentum transfer, q , Å -1 MicelleLong exposure
Figure SI.1: Examples of simulated data with datapoints.9 I.2 Full version of the correlation plots
The full version of the correlation plot in Figure 4 for the versions of our data with rescaled errors are shown inFigure SI.2.
Artificial rescaling of errors on dataLysozymeShort exposure Artificial rescaling of errors on dataLysozymeMedium exposure Artificial rescaling of errors on dataLysozymeLong exposure 0.1 1 10 S ugge s t ed r e sc a li ng b y B I FT Artificial rescaling of errors on dataBSAShort exposure Artificial rescaling of errors on dataBSAMedium exposure Artificial rescaling of errors on dataBSALong exposure 0.1 1 10 0.1 1 10 S ugge s t ed r e sc a li ng b y B I FT Artificial rescaling of errors on dataMicelleShort exposure 0.1 1 10Artificial rescaling of errors on dataMicelleMedium exposure 0.1 1 10Artificial rescaling of errors on dataMicelleLong exposure
Figure SI.2: Full version of Figure 4.10
I.3 Distributions of χ s for different choices of N DoF
All distributions of χ for our datasets with datapoints for the BIFT algorithm and for the structural model withdifferent choices for N DoF are shown in Figure SI.3. Corresponding statistics can be found in Table SI.1. , N DoF = M L y s o z y m e Short Exposure , N DoF = M Short Exposure , N DoF = M L y s o z y m e , N DoF = M N g , N DoF = M Medium Exposure , N DoF = M Medium Exposure , N DoF = M , N DoF = M N g , N DoF = M Long Exposure , N DoF = M Long Exposure , N DoF = M , N DoF = M N g , N DoF = M B S A , N DoF = M , N DoF = M B S A , N DoF = M N g , N DoF = M , N DoF = M , N DoF = M , N DoF = M N g , N DoF = M , N DoF = M , N DoF = M , N DoF = M N g , N DoF = M M i c e ll e , N DoF = M
20 1 2 , N DoF = M M i c e ll e , N DoF = M N g , N DoF = M , N DoF = M
20 1 2 , N DoF = M , N DoF = M N g , N DoF = M , N DoF = M
20 1 2 , N DoF = M , N DoF = M N g Figure SI.3: Full version of Figure 2. Average values for each histogram given in Table SI.1.Model Exposure time Choice of χ rχ M χ M − χ M χ M − N g − Lysozyme Short 0.95 0.99 0.87 0.95- Medium 0.93 0.97 0.86 0.95- Long 0.95 0.99 0.89 0.98BSA Short 0.93 0.97 0.78 0.87- Medium 0.94 0.98 0.80 0.90- Long 0.95 0.99 0.81 0.90Micelle Short 0.94 0.94 0.82 0.90- Medium 0.95 0.95 0.82 0.92- Long 0.95 0.95 0.83 0.94Mean of average values 0.94 0.98 0.83 0.92Standard deviation 0.01 0.01 0.03 0.03Table SI.1: Average of the χ r values for each sample/exposure time. Histograms for the same values aredisplayed in Figure SI.3. M = 50 is the number of datapoints, and N g is the number of “good parameters” from theBIFT algorithm. 11 I.4 Structural parameters for the presented SAS fits
The parameters refined from the fits in Figure 5 can be found in Table SI.2.
SAXS SAXS and SANS SANSOriginal Rescaled Original Rescaled Original Rescaled χ r . . . . .
08 0 . Structural parametersAxis ratio of bilayer . ± . . ± . . ± . . ± . . ± . . ± . Area per lipid headgroup, Å . ± . . ± . . ± . . ± . ±
222 70 ± Number of lipids ±
12 149 ±
27 152 ±
10 153 ± ±
499 143 ± Radius of gyration of histidine tag, Å ± ±
16 10 ±
11 11 ±
17 43 ± ± Volume of membrane scaffolding protein, Å ±
598 26494 ± ±
779 26621 ±
991 24315 ± ± Volume of phospholipid, DLPC, Å ± ±
11 679 ± ± ± ± Contrast-specific parametersSAXS Roughness, Å . ± . . ± . . ± . . ± . – –SAXS Background, − cm − ± ± ± ± – –SANS Roughness, Å – – . ± . . ± . . ± . . ± . SANS ( D O) Background, − cm − – – ± ± ± ± SANS ( D O) Background, − cm − – – ±
20 25 ± ±
16 25 ± Table SI.2: Parameters refined during the fitting of a phospholipid nanodisc model to data in Figure 5 beforerescaling and after rescaling the errors using BIFT. The model is described in the literature [27, 28]. The nanodiscmodel was refined from, respectively, SAXS alone, SAXS and SANS together, or SANS data alone (including SANSsamples with D O and D O in the buffer). 12
I.5 Details on the probability of χ , BIFT
We use the probability of a given value of χ , BIFT to assess whether the experimental errors in a given datasetare appropriate. This is the probability for getting a particular χ r, BIFT or any value more extreme, which can becompute as: P ( χ r ) = (cid:90) χ r d ¯ χ r p ( ¯ χ r ) for χ r ≤ (cid:102) χ r (cid:90) ∞ χ r d ¯ χ r p ( ¯ χ r ) for χ r > (cid:102) χ r (SI.1)where p ( ¯ χ r ) is a probability density that depends on the variable ¯ χ r . (cid:102) χ r is the median and can be approximated by: (cid:102) χ r ≈ (cid:18) − N DoF (cid:19) , (SI.2)where N DoF is the degrees of freedom. The median is unity for large N DoF . P is the two-tailed P -value for χ , BIFT given the null hypothesis that the experimental errors in question are appropriate (Figure SI.4). The probability isunity when χ r = (cid:102) χ r , i.e. in most practical cases, when χ r is unity, as all other values are more extreme. r =1.2 2 r Area 1Area 2 = Area 1 PDF for r ( N DoF )=40 r ( N DoF =40)=0.98P = 0.36 0 1 r =1.8 2 r Area 1Area 2 = Area 1 PDF for r ( N DoF )=40 r ( N DoF =40)=0.98P = 0.0028
Figure SI.4: Probability of a given χ r value, illustrated for two different χ r values and N DoF = 40 . The probabilityequals the area under the graphs for all ¯ χ r ≥ χ r plus the same area from the left-side tail. On the left, χ r = 1 . gives P = 0 . and is thus far above our suggested significance level of 0.003, so errors are probably correct. On theright, χ r = 1 . and P = 0 .0028