zfit: scalable pythonic fitting
Jonas Eschle, Albert Puig Navarro, Rafael Silva Coutinho, Nicola Serra
zzfit: scalable pythonic fitting
Jonas Eschle, Albert Puig Navarro, Rafael Silva Coutinho, Nicola Serra
Physik-Institut, Universit¨at Z¨urich, Z¨urich (Switzerland)
Abstract
Statistical modeling is a key element in many scientific fields and especially inHigh-Energy Physics (HEP) analysis. The standard framework to perform thistask in HEP is the C++ ROOT/RooFit toolkit; with Python bindings that areonly loosely integrated into the scientific Python ecosystem. In this paper, zfit, anew alternative to RooFit written in pure Python, is presented. Most of all, zfitprovides a well defined high-level API and workflow for advanced model buildingand fitting, together with an implementation on top of TensorFlow, allowing atransparent usage of CPUs and GPUs. It is designed to be extendable in avery simple fashion, allowing the usage of cutting-edge developments from thescientific Python ecosystem in a transparent way. The main features of zfit areintroduced, and its extension to data analysis, especially in the context of HEPexperiments, is discussed.
Keywords:
Model fitting, data analysis, statistical inference, Python
1. Introduction
Data collected by experiments such as the Large Hadron Collider (LHC) atCERN or ultra-energetic cosmic ray detectors are analysed in terms of physics-motivated mathematical models. Typically, these models have a parametricform, and the values of their parameters are the quantities of interest in a givenanalysis. Once a model has been defined, a fitting procedure that minimizesthe disagreement between the data and the model is applied to find the optimalvalues of these parameters, a crucial step in almost every analysis.Model fitting as done in High Energy Physics (HEP) has some peculiari-ties and differs from other fields. Models are often multidimensional, built bycomposing complicated, custom shapes, and have a non-trivial normalization.In addition, two further requirements need to be fulfilled: on one hand, rea-sonable scaling with data size and model complexity, motivated by the largedata samples being collected at the LHC; and, on the other hand, the capability
Email addresses:
[email protected] (Jonas Eschle), [email protected] (Albert Puig Navarro), [email protected] (Rafael Silva Coutinho), [email protected] (Nicola Serra)
Preprint submitted to SoftwareX May 21, 2020 a r X i v : . [ phy s i c s . d a t a - a n ] M a y o not only find optimal values for the parameters, but to be able to calculatetheir uncertainties with enough flexibility taking all correlations into account toallow for advanced statistical treatment, as uncertainties of measurements areas important as their central values.Since recent years, HEP analysis is moving more and more towards theusage of Python over the more traditional C++. The moving force behindthis paradigm shift is the existence of an extremely rich scientific computingPython ecosystem [1, 2], which provides a collection of efficient, easy-to-usetools for data analysis that are shared across a very large community, includingscientists from various fields and industry, especially in data science, and whichone can build new applications on.The de facto standard for model fitting in HEP is RooFit [3], which is in-tegrated in the ROOT toolkit [4] and provides a very powerful object-orientedarchitecture, as well as plotting capabilities and an advanced statistics module.It is written in C++ and can be accessed in Python through the ROOT Pythonbindings. These bindings come with its own set of non-negligible problems:complex memory management that easily leads to memory leaks, difficult inte-gration with the scientific computing Python stack, lack of extendability fromPython, and a complex installation procedure due to the need to install the fullROOT toolkit.General-purpose Python-based libraries such as SciPy [5], lmfit [6] or Ten-sorFlow Probability [7] do not fulfill all HEP fitting requirements, and thereforecannot be used as-is . Python-based HEP-specific fitting packages, such as prob-fit [8], TensorProb [9] or TensorFlowAnalysis [10], have tried to address theseissues and can be considered as proof-of-concepts that HEP-centric model fittingis possible in Python; none of them, however, offers the same fitting functionali-ties and ease of use as RooFit does while still integrating well with the scientificcomputing Python ecosystem.The goal of zfit is to fill this important gap in the HEP software ecosystemand provide a well defined API and workflow for model fitting together with aPython-based package that is fast, scalable and extensible.
2. Software description
The structure of a generic model fitting workflow is illustrated in Fig. 1. Itsbasic goal is to maximize the agreement of a set of data points to a probabilitydistribution described by a set of parameters. The measurement of the metricof the goodness—or, more precisely, the disagreement—of the predicted model with the experimental data is referred to as loss function. One of the most usedloss functions in HEP is based on the likelihood function, in particular the nega-tive log-likelihood, which is minimized in order to estimate the parameters of themodel. Once the optimal values of the model parameters have been determined,several statistical techniques can be used to calculate their uncertainties .The structure defined by model - data - loss - minimize can also be used to de-scribe a typical Deep Learning workflow—even more clearly by replacing model igure 1: Schematic zfit workflow view of the individual structural blocks. by “Neural Network” and minimize by “Training”. This similarity has pro-found implications in the design of the zfit implementation, as it allows to usean industry standard Deep Learning tool such as TensorFlow [11] (TF) as itscomputational backend by using the low-level API that consists of mathematicalfunctions. The immediate benefits for zfit in terms of performance arise fromkey features of TF such as the capacity to be run on heterogeneous architectures,mathematical optimizations such as automatic differentiation (required by themost used minimization algorithms in HEP), and operation caching. This putsthe performance of zfit in the same order of magnitude as the existing standardas shown in Appendix A.zfit does not intend to be a large, monolithic package and has thereforea clear, restricted scope: model fitting and sampling with a well defined andconsistent API which is mostly independent of the implementation. Tasks suchas data processing, plotting and statistical analysis are left for other packagesto tackle.The abstraction of the fitting process in the steps shown in Fig. 1 is oneof the main features of the architecture of zfit: by developing each step as aseparate “brick” of the library, it becomes possible to tightly integrate with thescientific computing Python ecosystem, and use the available tools to improveperformance, reduce code duplication and add extra functionality. Each of thesebricks ( model , data , loss , minimizer and fit result ) provides override hooks forall its core functionality, allowing to completely customize or replace any aspect3f their behavior. As hinted at above, these object deal internally with TF.Since using TF in the correct way adds another layer of difficulty, most of theseadditional complications are hidden from the user and the library offers a similaruser experience as the one found in other model fitting libraries, independent ofthe actual backend.Model fitting handles multiple dimensions. There are observables, the de-pendent variables, over which a model is usually normalized and parameters,which will be adjusted during the minimization of the loss. In contrast to otherframeworks such as RooFit, zfit makes a strong distinction between observablesand parameters; this removes any ambiguity in terms of normalization, allowsto be closer to Deep Learning and the optimized way of how TF works. Theobservables and corresponding limits in which the model fit is performed arehandled by the Space object. Space objects can be single- or multi-dimensional,and, thanks to the use of observables, can be combined using the Python arith-metic operators to tackle arbitrarily complex uses cases.Models are defined inside a given Space and implement a shape function thatreturns a value given some input data. Custom models are easily implementedusing their functional form and by default receive standard attributes such asnumerical integration and sampling. These generic functionalities can also beextended to analytical integration and specific sampling if provided by the user.Models can be combined to build more complex models (including increasing thenumber of dimensions) while keeping automatic integration, observable handlingand event sampling. The normalization is done over a range specified in the obs or further customized by overriding the _normalization method manually.Models are parametrized by
Parameter objects, which can be used in theimplementation of the shape function like any other scalar.
Independent pa-rameters are the only type of objects that are left free in the fit.
Composedparameters can be built by combining parameters through mathematical oper-ators and can be used like independent parameters.The loading, ordering and processing of data in zfit is handled uniformly,no matter their source, by the
Data object. While only the most importantformats are currently implemented—ROOT trees, Numpy arrays and PandasDataFrames, as well as TF tensors—the simple API of the Data object makesit easy to add additional data handling capabilities.Data and Models are combined to build a
Loss function, which can eitherbe pre-defined such as the negative log-likelihood or completely arbitrary. Thecomputation of the value of the loss for a given configuration of the model isefficiently implemented in zfit thanks to TF. It is also possible to add constraintsto the loss in order to modify its behavior. Constraints are additional terms tothe likelihood that often represent some external information on the parameter.While the most used constraints in HEP are directly implemented in zfit, suchas Gaussian constraints, arbitrary custom functions can be also used.Minimization is performed through stateless Minimizer objects. Since theloss calculation is usually the most intensive part of the minimization problem,in practice it is possible to wrap any existing Python minimizer to be used byzfit. Several minimizers are included in zfit, such as the SciPy optimizers, the4ost-used minimizer in HEP, called Minuit [12], or the TF implementation ofthe Adam optimizer [13].The results of the minimization process are returned in a FitResult object,which stores all the information about the fit parameters, the minimizationprocess, as well as the used instances of the Minimizer and the Loss. Thisincludes the function minimum and the covariance matrix.While the approximate uncertainties related to the fit parameters can becalculated directly inside zfit using the Hessian matrix or a simple profilingmethod, the calculation of precise parameter uncertainties and confidence inter-vals are left to specialized packages, such as those in Refs. [14, 15], which don’tneed any object other than the FitResult.
3. Illustrative examples
The architecture and features of zfit are better visualized with a typical HEPexample: a fit to the invariant mass of multiple particles including both signaland background events, extended to multiple dimensions, and introducing acontrol channel.
Let’s assume the domain of interest of the mass observable, in this case the B meson mass, is in the range [4 . , .
0] GeV. In zfit this is expressed with a
Space defining our domain: obs = zfit . Space ( obs =" mass ", limits =(4.5, 6.0))
The best interpretation of the observable at this stage is that it defines the nameand range of the observable axis.Data can be loaded for example from a numpy array, where the obs of theSpace are matched by order with the columns of the array: data = zfit . Data . from_numpy ( array = numpy_array_data , obs = obs ) with numpy_array_data being a numpy array holding our data. The signalcomponent of the fit will be modelled using a Gaussian (normal), which has twofree parameters—its mean and its width: mu = zfit . Parameter ("mu", 5.2, 4.5, 6.0)sigma = zfit . Parameter (" sigma ", 20 , 1, 40) where the arguments for the parameters are: a unique name, an initial value, alower and an upper bound. The bounds are not mandatory, the parameter willbe floating in both cases. These can be used to instantiate the Model as: signal = zfit . pdf . Gauss ( obs =obs , mu=mu , sigma = sigma ) pdf modulefor convenience, the philosophy of zfit is to provide a clear API with powerfulmechanisms to facilitate that the users and the community can implement theirown. Making use of another built-in model, the background that is parametrizedas an exponential: bkg = zfit . pdf . Exponential (-0.001 , obs = obs )
In order to build the sum of these two models, an additional parameter frac is used to describe the fraction of each species: frac_sig = zfit . Parameter (" signal fraction ", 0.5, 0, 1)model = zfit . pdf . SumPDF ( pdfs =[ signal , bkg ], fracs = frac_sig ) With the model and the data built, a loss function can be constructed. Inthis case, the goal is to perform an unbinned maximum likelihood fit, so thecorresponding Loss object is initialized as nll = zfit . loss . UnbinnedNLL ( model = model , data = data )
Finally, the minimization process is executed, first instantiating a Minimizerobject—in this case using the Minuit minimizer—and then running the mini-mization algorithm: minimizer = zfit . minimize . MinuitMinimizer ()result = minimizer . minimize ( nll )
The outcome of this minimization is stored in a
FitResult object, whichprovides access to all details that occurred during the convergence process andto the best values estimated for the free parameters together with a nice printoutof the parameters and uncertainties. For example, one could obtain the fittedvalue of the signal mean as if result . converged :mu_value = result . params [mu][" value "] The params object stores additional information such as estimated uncer-tainties if they have been calculated. Following a similar approach to the reference implementation of RooFit, the fraction ofthe last species corresponds to 1 − (cid:80) frac i . Notice that mu , the parameter object, and not "mu" , the name of it, is used as the key. .2. Advanced usage The next step is to extend the model to take into account extra dimensionsof the problem, in this case the angular distributions of the decay products ofthe B meson. To do so, the signal distribution is modelled with a custommodel, which is created by inheriting from one of the base classes provided byzfit: from zfit import zclass AngularSignalPDF ( zfit . pdf . ZPDF ):_PARAMS = ["FH", " AFB "]_N_OBS = 1def _unnormalized_pdf (self , x): costhetal = z. unstack_x (x) FH = self . params ["FH"]AFB = self . params [’AFB ’] pdf = (3 / 4 * (1 - FH) * (1 - z. square ( costhetal ))+ 0.5 * FH + AFB * costhetal )return pdf
Instead of
BasePDF , we use here an even simpler class,
ZPDF , which onlyrequires to specify the naming of the parametrization via the _PARAMS attributeand allows, optionally, to define the dimensionality through _N_OBS . The latteralso allow to create n-dimensional Models as x will be a list of length _N_OBS containing the data. Internally, the class handles the management of the freeparameters, which can be accessed by their parameterization name through the params attribute, which holds the parameters. It also provides a simple way toaccess the input data with the z.unstack_x . The module z provides some TFfunctions wrapped for better handling of dtypes as well as additional methodsand can be used mixed with pure TF functions. fh = zfit . Parameter ("FH", 1.)afb = zfit . Parameter (" AFB ", 0.5)obs_costhetal = zfit . Space (" thetal ", limits =(-1, 1))angular = AngularSignalPDF ( obs = obs_costhetal , FH=fl , AFB = afb ) To create a two-dimensional model, one way is to multiply the models de-scribing the different dimensions. Given that their observables are different, thenew model will automatically be defined in the combined space of the two: combined_model = angular * model mu can be obtained. Todo so, a second Gaussian is created to model this control mode: obs_control = zfit . Space (" control ", limits =(5.2, 5.6))sigma_control = zfit . Parameter (" sigma control channel ", 2, 0, 5)gauss_control = zfit . pdf . Gauss ( obs = obs_control ,mu=mu sigma = sigma_control ) A simultaneous
Loss can be built by either two independent
Loss functions orby providing both models and datasets in a list as nll = zfit . loss . UnbinnedNLL ([ combined_model , gauss_control ],[data , data_control ]) where data control has been created analogously to data . This simultaneousLoss can be minimized as shown before.
4. Impact and conclusions
The zfit package is a crucial piece required to complete the transition of HEPanalysis to a stack fully based on the scientific computing Python ecosystem.The importance of this paradigm change should not be underestimated: giventhe limited availability of resources in HEP, especially in areas such as soft-ware development, the possibility of integrating physics analysis within a well-developed and well-supported software ecosystem—allowing to focus mainly onthe HEP-specific parts of the analysis software—will have long term repercus-sions in the quality and usability of HEP analysis software.The package itself is not only an implementation of a versatile and powerfulmodel fitting library, but also specifies a carefully designed and well discussedAPI of the most crucial components and the workflow. With the aim to be astandard in its kind, other packages such as high-level statistics and plottingtools, can be built against the standardized API instead of a concrete imple-mentation, strongly reducing package dependencies. In addition, other packagescan focus on the specific implementation of a certain task, such as building aspecialized Loss, Model or a Minimizer, without the need to also implement theother parts of the workflow themselves.Finally, fits in HEP analysis often consist of large datasets and complicatedmodels, and require far more computing power than a single core. Scalabilityin terms of size and complexity of the data and models and the possibility torun on large computing infrastructures, including accelerators such as GPUs,is therefore a necessity. TensorFlow, the graph based computing backend inzfit, was built for the sole purpose of efficient large scale computing and takescare of the non-trivial task of optimal workload parallelization. This does notonly yield state-of-the-art performance, but also enables the HEP community8o keep up with the latest algorithm implementations and low-level computinginstructions at virtually no cost.In summary, a new high-level statistical modeling and fitting library purelydesigned within, and for, the Python ecosystem is presented. Future plans forzfit include full support to binned models, further extension on baseline HEPfeatures as well as standardized, human-readable serialization of objects.9 cknowledgements
We are grateful to Anton Poluektov, Chris Burr and Igor Babuschkin fordemonstrating the potential of unbinned model fitting within the context ofTensorFlow, which inspired this work. We also thank the Zurich LHCb Group,Matthieu Marinangeli, Josh Bendavid, Lukas Heinrich and the HSF community,especially Scikit-HEP project members, for useful discussions. A. Puig, R. SilvaCoutinho, J.Eschle and N. Serra gratefully acknowledge the support by theSwiss National Science Foundation (SNF) under contracts 168169, 174182 and182622. 10 ppendicesAppendix A. Performance
A series of studies are performed to evaluate the execution time of zfit incomparison to the conventional RooFit library. The performance relies on twoaspects, i.e. the complexity of the fit (given by the number of free parameters)and the data sample size. In this study, the sum of 9 Gaussian functions with acommon mean and width parameter, but with different central values, is usedas baseline model.Ensembles of Monte Carlo pseudoexperiments are generated and fitted withdifferent setup configurations .i. Nominal zfit implementation on CPU, using an analytic gradient providedby TF. In this case, an initial run is done to remove the graph compilertime. While not significant, this provides a more realistic estimation.ii. Alternatively the analytic gradient provided by TF is disabled, denotedby the addition “nograd”. Instead, the Minuit minimizer calculates anumerical approximation of the gradient internally.iii. The default CPU zfit settings is replaced by the GPU implementation.iv. A standard RooFit implementation using the Python bindings with Py-ROOT (version 6.16.00). The parallelization is done when invoking the fitTo method and equals the number of cores available.For each setup, twenty toys are generated with sample sizes from 128 to 8million events (except for the GPU, where it goes only up to 4 million ) andthe average fitting time is used as reference point. Note that for simplicity theMinuit algorithm is used in all cases.Figure A.2 compares the performance of different fit scenarios and can besummarize as follows.- As the number of events increase, the execution time of RooFit monoton-ically increases. - There is no advantage using parallelization for very few calculations, sincethe overhead of splitting and collecting the results is dominant. Withincreasing number of events this gets negligible and thus the executiontime of zfit increases way slower than that for RooFit. This also comesfrom the fact that more events mean a more stable loss shape, so theminimizer used in zfit performs better. The fitting performance is examined using a 12 core Intel i7 8850H with 2.60 GHz and32Gb RAM, and a Nvidia P1000 with 4GB RAM. The GPU used in these tests has a relative small memory. Performing larger-than-memorycomputations and multi-GPU is still work in progress. It is worth mentioning that the RooFit parallelization strategy is rather simple and withvery few overheads; several improvements in this direction are ongoing. It however hints animportant point of using TF in zfit, namely that this kind of work is factored out and asuperior performance can come for free. igure A.2: Averaged time per fit for a sum of Gaussian with shared mean and width as afunction of the the number of generated events. - The GPU is highly efficient in computing thousands of events in parallel.For only a few data points though, the overhead of moving data back andforth dominates, resulting in a worse performance in this regime.As a conclusion, the speed for a multicore CPU system is comparable be-tween zfit and RooFit. For large number of events, zfit typically outperformswith a similar response for both GPU and a multicore system.12 equired MetadataCurrent code versionNr. Code metadata description C1 Current code version 0.3.6C2 Permanent link to code/repositoryused for this code version https : //github.com/zf it/zf it C3 Legal Code License BSD-3C4 Code versioning system used GitC5 Software code languages, tools, andservices used Python, TensorFlowC6 Compilation requirements, operat-ing environments & dependencies Linux and MacOS supported (Windowsfunctionality unknown);Python 3.6+;dependencies (also specified in requirements.txt ): tensorf low > = 1 . . , < ,tensorf low probability > = 0 . . , < . ,scipy > = 1 . uproot, pandas, numpy, iminuit,typing, colorlog, texttable, ordered - set C7 If available Link to developer docu-mentation/manual https : //zf it.readthedocs.io/en/ . . / C8 Support email for questions zfi[email protected]
Table A.1: Code metadata. urrent executable software versionNr. Executable softwaremetadata description S1 Current software version 0.3.6S2 Permanent link to executa-bles of this version https : //github.com/zf it/zf it/releases/tag/ . . requirements.txt ): tensorf low > = 1 . . , < ,tensorf low probability > = 0 . . , < . ,scipy > = 1 . uproot, pandas, numpy, iminuit,typing, colorlog, texttable, ordered − set S6 If available, link to user man-ual - if formally published in-clude a reference to the pub-lication in the reference list https : //zf it.readthedocs.io/en/ . . / S7 Support email for questions zfi[email protected]
Table A.2: Software metadata eferences [1] T. E. Oliphant, Python for Scientific Computing, Comput Sci Eng 9 (3)(2007) 10–20. doi:10.1109/MCSE.2007.58 .[2] K. J. Millman, M. Aivazis, Python for Scientists and Engineers, ComputSci Eng 13 (2) (2011) 9–12. doi:10.1109/MCSE.2011.36 .[3] W. Verkerke, D. P. Kirkby, The RooFit toolkit for data modeling, eConfC0303241 (2003) MOLT007. arXiv:physics/0306116 .[4] R. Brun, F. Rademakers, ROOT: An object oriented data analysisframework, Nucl. Instrum. Meth. A389 (1997) 81–86. doi:10.1016/S0168-9002(97)00048-X .[5] E. Jones, T. Oliphant, P. Peterson, et al., SciPy: Open source scientifictools for Python, [Online; accessed 11/07/2019] (2001–).URL [6] M. Newville, R. Otten, A. Nelson, A. Ingargiola, T. Stensitzki, D. Allan,A. Fox, F. Carter, Micha(cid:32)l, D. Pustakhod, Y. Ram, Glenn, C. Deil, Stuer-mer, A. Beelen, O. Frost, N. Zobrist, G. Pasquevich, A. L. R. Hansen,T. Spillane, S. Caldwell, A. Polloreno, andrewhannum, J. Zimmermann,J. Borreguero, J. Fraine, deep 42-thought, B. F. Maier, B. Gamari, A. Al-marza, lmfit/lmfit-py (Dec. 2019). doi:10.5281/zenodo.598352 .URL https://doi.org/10.5281/zenodo.598352 [7] J. V. Dillon, I. Langmore, D. Tran, E. Brevdo, S. Vasudevan, D. Moore,B. Patton, et al., TensorFlow Distributions, (Nov 2017). arXiv:1711.10604 .[8] P. Ongmongkolkul, C. Deil, C. hsiang Cheng, A. Pearce, E. Rodrigues,H. Schreiner, M. Marinangeli, L. Geiger, H. Dembinski, scikit-hep/probfit(Nov. 2018). doi:10.5281/zenodo.1477852 .URL https://doi.org/10.5281/zenodo.1477852 [9] I. Babuschkin, C. Burr, S. Neubert, K. Br¨ugge, TensorProb (Apr 2019).URL https://tensorprob.github.io/tensorprob/ [10] A. Poluektov, A. Mauri, A. Merli, A. Mathad, M. Martinelli, A. Morris,Using TensorFlow for amplitude fits - the TensorFlowAnalysis package (Jul.2018). doi:10.5281/zenodo.1415412 .URL https://doi.org/10.5281/zenodo.1415412 [11] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, et al.,TensorFlow: Large-scale machine learning on heterogeneous systems, soft-ware available from tensorflow.org (2015).URL doi:10.1016/0010-4655(75)90039-9 .[13] D. P. Kingma, J. Ba, Adam: A Method for Stochastic Optimization, in: 3rdInternational Conference on Learning Representations, San Diego, 2015. arXiv:1412.6980 .[14] M. Marinangeli, lauztat (Jun. 2019). doi:10.5281/zenodo.3249383 .URL https://doi.org/10.5281/zenodo.3249383 [15] M. Marinangeli, B. Pollack, E. Rodrigues, J. Eschle, scikit-hep/hepstats(Nov. 2019). doi:10.5281/zenodo.3519200 .URL https://doi.org/10.5281/zenodo.3519200https://doi.org/10.5281/zenodo.3519200