AA Faster, More Intuitive RooFit
Stephan
Hageböck , ∗ CERN, Esplanade des Particules 1, 1211 Geneva 23, Switzerland
Abstract.
RooFit and
RooStats , the toolkits for statistical modelling in
ROOT ,are used in most searches and measurements at the Large Hadron Collideras well as at B factories. Larger datasets to be collected at e.g. the High-Luminosity LHC will enable measurements with higher precision, but will re-quire faster data processing to keep fitting times stable. In this work, a simplifi-cation of RooFit ’s interfaces and a redesign of its internal dataflow is presented.Interfaces are being extended to look and feel more STL-like to be more acces-sible both from C ++ and Python to improve interoperability and ease of use,while maintaining compatibility with old code. The redesign of the dataflowimproves cache locality and data loading, and can be used to process batches ofdata with vectorised SIMD computations. This reduces the time for computingunbinned likelihoods by a factor four to 16. This will allow to fit larger datasetsof the future in the same time or faster than today’s fits.
RooFit [1] is a C ++ package for statistical modelling distributed with ROOT [2].
RooFit allows to define computation graphs to connect observables, parameters, functions and
PDF s to likelihood models, which can be fit to data and be used for statistical tests. RooFit isshipped with
RooStats , a toolkit for performing statistical tests with
RooFit models. Itprovides tools such as Toy Monte Carlo tests, setting limits and computing significances.Further,
HistFactory provides tools to create
RooFit models from a collection of
ROOT histograms.
RooFit was originally developed for the BaBar collaboration, and is now used for statis-tical inference across many experiments in High-Energy Physics, e.g., at B factories and atthe Large Hadron Collider. It is crucial for the final steps of most analyses. It was designedfor single-core processors, and was neither optimised for large caches nor SIMD computa-tions. This work stands at the beginning of e ff orts to modernise RooFit to speed up fits, andmake it easier to use from both C ++ and Python. This will enable researchers to analyselarger datasets and devise more elaborate statistical models. RooFit ’s Interfaces In RooFit , any collection of mathematical entities such as parameters, observables, functionsor
PDF s are saved or passed to functions using the classes
RooArgSet and
RooArgList , setsand lists of
RooFit objects. ∗ e-mail: [email protected] P robability D ensity F unctions a r X i v : . [ c s . M S ] J u l he most common operation is iterating through these collections, both during fitting andwhen e.g. inspecting the values of parameters on the user side. This favours array-like datastructures like e.g. std::vector , but internally, RooFit ’s collections were using a linkedlist with optional hash lookup. Iterating through a
RooFit collection requires the followingC ++ code in ROOT TIterator* it = pdf.getParameters(obs)->createIterator(); RooAbsArg* p; while ((p=(RooAbsArg*)it->Next())) { p->Print(); } delete it; To speed up iterating and to provide an STL-like interface for
RooFit ’s collections, the linkedlist in
RooFit ’s collections was replaced by a std::vector . Functions such as begin() , end(), size(), operator [] were implemented to allow for STL-like handling of the col-lections. In ROOT
PDF as above can thereforebe achieved as follows: for ( auto p : *pdf.getParameters(obs)) p->Print(); The STL-like interface allows to reduce heap allocations, and replaces while (p = Next()) loops by range-based for loops. This reduces code clutter, memory leaks, dangling point-ers and variable shadowing, and speeds up iterating through collections by 20 % to 25 %.Random access, which was slow with large linked lists, now completes in constant time. De-pending on how often collections are iterated, typical workflows in
RooFit are sped up from5 % to 21 % [3]. Fits with a binned ATLAS likelihood model [4] completed 19 % faster whileyielding identical results.Modernising the C ++ interfaces is also beneficial for using RooFit from Python. Since
ROOT has a C ++ interpreter, it can dynamically generate Python bindings for C ++ objects [5]. In ROOT ++ code in the firstlisting. However, with an STL-like interface, Python iterators are generated automatically.The equivalent Python loop using ROOT for p in pdf.getParameters(obs): p.Print() In ROOT
ROOT RooCategory cat("cat", "Lep. mult."); cat.defineTypes( {"0Lep", "1Lep", "2Lep", "3Lep"}, { 0, 1, 2, 3 }); for ( const auto & name_idx : cat) { std::cout << name_idx.first << ", " << name_idx.second << std::endl; } ROOT RooCategory cat("cat", "Lep. mult."); cat.defineType("0Lep", 0); cat.defineType("1Lep", 1); cat.defineType("2Lep", 2); cat.defineType("3Lep", 3); TIterator* typeIt = cat.typeIterator(); RooCatType* catType; while ( (catType = dynamic_cast
RooAbsCollection and
RooAbsCategory ( + derived classes) are in use, they cannot be deprecated without forcing users to updateexisting code. Therefore, the new interfaces are provided as an addition, while old interfacesremain supported. Updating to the new interfaces leads to faster, shorter and more type-safecode, but it is not required.To enable such kind of backward compatibility, legacy iterators / functions were imple-mented, which mimic the functionality of the original objects. Occasionally, this requiresextra virtual calls or heap allocations, leading to slow downs of 1 % to 3 %, but the new inter-faces allow for 20 % speed up. The most critical iterators in RooFit have been modernised,and more iterators will be replaced as the modernisation of
RooFit continues. To detect usesof ine ffi cient interfaces in user code, users can add be-fore including ROOT headers, which will trigger deprecation warnings with the clang++ , g++ and MSVC compilers. Further, old interfaces will be documented in special sections of
RooFit ’s reference guide [6] to aid users in modernising existing code.More updates of interfaces are planned, such as easier importing of data from e.g.
ROOT ’s RDataFrame , STL containers or numpy arrays. The release of the new
PyROOT [5] willfurther enable designing more pythonic interfaces that don’t have to closely imitate the C ++ syntax. PDF s Figure 1.
Johnson distribution
In modernising
RooFit , the Hypatia2 [7] (
ROOT
ROOT ++ functioncan be used as a PDF in RooFit (see section 4),providing optimised implementations for these fre-quently used
PDF s is beneficial. Apart from provid-ing a citeable default, built-in
PDF s can be extendedwith extra checks, and optimised computations canbe applied. The advantages of built-in
PDF s are:
Analytic Integrals
In order to normalise functions, these must be integrated over the defini-tion range of their observables.
RooFit integrates all functions numerically, unless a functionfor analytic integration is overridden. This allows for faster and more accurate normalisation.
Generating Samples
RooFit generates data samples using the accept / reject method, unlessa generator function is implemented. For many distributions, more e ffi cient sampling strate-gies are known, which can only be employed for built-in PDF s. Checking parameters
Just-in-time compiled
PDF s as in section 4 will accept parameters andobservables with arbitrary definition range. When trying to evaluate a function outside of itsdefinition range, computations might yield negative probabilities, infinity or
NaN . This canslow down or entirely prohibit fitting distributions to data, since the minimiser has no meansof determining whether a parameter is in the allowed range. Starting with
ROOT
PDF s can be checked by the
PDF implementation.
SIMD Computations
Starting with
ROOT
SIMD computations. This is discussed in section 5.For the built-in Johnson distribution, for example, previously unstable fits were found toconverge six times faster than with the commonly used interpreted Johnson formula in LHCb.
Just-in-time Compiled
PDF s Although implementing a
PDF as a
RooFit class is the fastest and most accurate wayof building likelihood models,
RooFit was supporting interpreted
PDF s using the class
RooGenericPdf . It takes strings of function expressions, and interprets these using an in-stance of
ROOT
TFormula . With the arrival of cling , the just-in-time C ++ compilerand interpreter, TFormula was updated to use just-in-time compiled expressions. In
ROOT
TFormula was integrated in
RooFit , allowing for faster computations (com-piled with optimisations instead of interpreted). This also enables using previously-definedfunctions or functions loaded from a library as in the following example: // In a library or in included code double func( double x, double a) { return a*x*x + 1.;} [...] // When building fit model RooRealVar x("x", "Observable", 2.); // Define observable RooRealVar a("a", "Parameter", 3.); // Define parameter RooGenericPdf pdf("pdf", "func(x, a)", {x, a}); //evaluate func and normalise
Computations
RooFit
When
RooFit fits
PDF s to data, an expression is evaluated for each entry in the dataset tocompute the likelihood of observing an event.
RooFit achieves this by writing values fromthe rows of a dataset into the leaves of a computation graph, and the probability of the topnode is evaluated by calling evaluate and normalisation functions of daughter nodes. Eachnode caches its last value, and therefore constant branches of the computation graph (e.g.branches that only depend on parameters) are only computed once. This cycle, however,repeats for every entry in the dataset such that all branches that depend on observables haveto be recomputed every time. The total number of function calls is therefore proportional tothe number of entries in the dataset and to the number of (non-constant) nodes in the graph, N Data · N Nodes . For a small mathematical expression with 10 nodes and one million events,this already amounts to considerably more than 10 million function calls because additionalcalls for the normalisation of PDFs and for invalidating the node-local caches are necessary.Furthermore, loading only single values into the nodes of the computation graph is hostileto CPU caches. It is possible that when a node is being revisited to load the next entry, datahave been evicted from the cache(s). This means that
RooFit runs ine ffi cient on modernCPUs because of poor data locality and ine ffi cient memory access patterns. To improve the data locality and reduce the number of function calls, a
RooFit -internal inter-face for batched likelihood computations was implemented. Data are passed between nodesusing a std::span . Inputs for computations are directly read from arrays, while outputs arewritten to contiguous memory that is owned by the node that is running a computation.Instead of computing only one probability per node, all probabilities for all entries inthe dataset can be computed in a few function calls. Since such computations operate oncontiguous floating point numbers, data locality, caching and data prefetching improve. This RooFit also supports a multi-process mode, in which case smaller batches of contiguous data are processed,which are divided among multiple workers. clang9 i7-7820X AVX512clang8 i7-4790 AVX2clang9 i7-7820X AVX2gcc9 i7-7820X AVX2gcc9 i7-7820X AVX512clang9 i7-7820X AVX512clang8 i7-4790 AVX2clang9 i7-7820X AVX2gcc9 i7-7820X AVX2gcc9 i7-7820X AVX512
Speed up using vectorisation
Figure 2.
Speed up for computing the likelihoods of datasets of 100 000 to 300 000 events for variouslikelihood models. Using
ROOT
RooFit batch interface is timed against the normal single-value computations for di ff erent CPUs, compilers and instruction sets. On an intermediate-level CPUthat supports AVX2 instructions ( ), a speed up of 4x to 9x can be expected. On CPUs supporting
AVX512 instruction sets, the speed up ranges from 4x to 16x. For the “ChiSquarePdf” (top), almost no
SIMD functions were available. The speed up of 3x is the result of batched computations with fasterdata loading. speeds up computations three to four times without loss of precision. Figure 2 shows thecombined speed up of batched computations and additionally vectorisation with
SIMD in-structions (section 5.3) against classic single-value
RooFit computations. For the topmostentry (“ChiSquarePdf”), almost no
SIMD instructions could be used. This speed up is mostlydue to faster data loading and reducing function calls.When
PDF s that support the fast interface are used together with legacy
PDF s, compatibil-ity is ensured by a generic batch computation function that runs
RooFit ’s classic single-valuecomputations for each entry in the dataset, writes results into an array, and passes the resultson to the next node of the graph using std::span . SIMD
By converting
RooFit ’s data access patterns to reading from array-like structures,
RooFit could be extended with single-instruction-multiple-data computations (
SIMD ). Optimisationreports of the clang++-8 compiler were used to design computation functions that allow forautomatic vectorisation with
AVX2 instructions. This allows to increase the throughput, sincefour double-precision numbers can be processed simultaneously (8 with
AVX512 ). This re-quires using auto-vectorisable, inlinable mathematical functions, and entries in arrays cannotdepend on other entries. Auto-vectorisable functions are provided by the
VDT [9] package,of which logarithm and exponential function were used most frequently.
VDT approximatesstandard library functions using Padé polynomials, which can be inlined into other computa-tions, and optimised by the compiler.
ROOT can be compiled without
VDT , though, in whichcase the standard (non-inlinable) implementations are used. This usually prevents automaticvectorisation.o further aid the compiler optimiser, computations were rewritten such that data depen-dencies between elements are eliminated, and that as many intermediate results as possiblecan be held in CPU registers. Excessive branching in complicated functions was replaced bybranch-less computations such as 1 · A + · B to select A and its counterpart to select B . Al-though more CPU cycles are needed because both branches are computed, higher throughputis achieved because SIMD instructions process four or eight entries simultaneously. Finally,calls to non-inlinable functions and complicated reductions were removed where possible.The result of this work is shown in fig. 2. For various
RooFit PDF s and composite modelsthat use multiple
PDF s, the run time of the optimised
SIMD computation with
VDT functionsis compared to classic
RooFit single-value computations. Depending on the complexity ofthe computations and on the level of automatic compiler optimisations, computations speedup 3x to 16x. The theoretical speed up of 12x (3x from data loading, 4x for
AVX2 ) cannot bereached because of Amdahl’s law. Not all parts of
RooFit ’s code can be replaced with
SIMD code, and were therefore not sped up. Moreover, in order to use
SIMD instructions, moreinstructions may have to be issued (e.g. compute two Padé polynomials to approximate the exp function instead of calling the standard implementation).The level of automatic compiler optimisations has a strong e ff ect on the speed up. Figure 2demonstrates that both clang AVX2 instructions, but gcc significantly outperforms clang when
ROOT is compiled with
AVX512 instructions. Itcan therefore be expected that as compiler optimisers improve, the speed up for
RooFit increases, and that di ff erences between the compilers will shrink.Batch computations with vectorisation have been released in ROOT pdf.fitTo(data, BatchMode(true)); // Evaluate likelihood using fast batch mode Unit tests ensure that the optimised computation functions yield the same results as the classic
RooFit functions. When the fast
VDT approximations are used, the relative di ff erence forprobabilities is usually below 1 . × − , for log-likelihoods below 2 . × − , and for fitparameters below 1 . × − , which is except for corner cases orders of magnitude smallerthan the statistical error of the fit parameters.A drawback of the current implementation is that for maximal performance, ROOT has tobe compiled targeting the instruction set that the CPU supports. The possibility of shipping asmall library with computation functions for a few di ff erent architectures (e.g. SSE4 + AVX2 )is being investigated.
RooFit ’s interfaces are being modernised, and computations are being sped up with fast dataloading and
SIMD computations. The single-thread performance of unbinned fits in
RooFit has been increased by several factors without requiring code changes on the user side. Inconjunction with work on parallelising computations [10], a speed up by more than an orderof magnitude can be expected.The work on modernising
RooFit ’s interfaces and improving its performance will con-tinue, especially for binned fits such as
HistFactory models. eferences [1] W. Verkerke, D.P. Kirkby, econf
C0303241 , MOLT007 (2003), physics/0306116 [2] R. Brun, F. Rademakers, Nucl. Instrum. Methods Phys. Res.
A389 , 81 (1997)[3] S. Hageboeck, L. Moneta,
Making RooFit Ready for Run 3 , in (2020), unpublished[4] ATLAS Collaboration, JHEP , 069 (2015), [5] E. Tejedor Saavedra, S. Wunsch, M. Galli, A new PyROOT: Modern, Interoperable andmore Pythonic , EPJ Web Conf. (this volume) (2020)[6] The ROOT Team,