[PDF] A Faster, More Intuitive RooFit

Abstract

RooFit and RooStats, the toolkits for statistical modelling in ROOT, are used in most searches and measurements at the Large Hadron Collider as well as at B factories. Larger datasets to be collected at e.g. the High-Luminosity LHC will enable measurements with higher precision, but will require faster data processing to keep fitting times stable. In this work, a simplification of RooFit's interfaces and a redesign of its internal dataflow is presented. Interfaces are being extended to look and feel more STL-like to be more accessible both from C++ and Python to improve interoperability and ease of use, while maintaining compatibility with old code. The redesign of the dataflow improves cache locality and data loading, and can be used to process batches of data with vectorised SIMD computations. This reduces the time for computing unbinned likelihoods by a factor four to 16. This will allow to fit larger datasets of the future in the same time or faster than today's fits.

Full PDF

AA Faster, More Intuitive RooFit

Stephan

Hageböck , ∗ CERN, Esplanade des Particules 1, 1211 Geneva 23, Switzerland

Abstract.

RooFit and

RooStats , the toolkits for statistical modelling in

ROOT ,are used in most searches and measurements at the Large Hadron Collideras well as at B factories. Larger datasets to be collected at e.g. the High-Luminosity LHC will enable measurements with higher precision, but will re-quire faster data processing to keep ﬁtting times stable. In this work, a simpliﬁ-cation of RooFit ’s interfaces and a redesign of its internal dataﬂow is presented.Interfaces are being extended to look and feel more STL-like to be more acces-sible both from C ++ and Python to improve interoperability and ease of use,while maintaining compatibility with old code. The redesign of the dataﬂowimproves cache locality and data loading, and can be used to process batches ofdata with vectorised SIMD computations. This reduces the time for computingunbinned likelihoods by a factor four to 16. This will allow to ﬁt larger datasetsof the future in the same time or faster than today’s ﬁts.

RooFit [1] is a C ++ package for statistical modelling distributed with ROOT [2].

RooFit allows to deﬁne computation graphs to connect observables, parameters, functions and

PDF s to likelihood models, which can be ﬁt to data and be used for statistical tests. RooFit isshipped with

RooStats , a toolkit for performing statistical tests with

RooFit models. Itprovides tools such as Toy Monte Carlo tests, setting limits and computing signiﬁcances.Further,

HistFactory provides tools to create

RooFit models from a collection of

ROOT histograms.

RooFit was originally developed for the BaBar collaboration, and is now used for statis-tical inference across many experiments in High-Energy Physics, e.g., at B factories and atthe Large Hadron Collider. It is crucial for the ﬁnal steps of most analyses. It was designedfor single-core processors, and was neither optimised for large caches nor SIMD computa-tions. This work stands at the beginning of e ﬀ orts to modernise RooFit to speed up ﬁts, andmake it easier to use from both C ++ and Python. This will enable researchers to analyselarger datasets and devise more elaborate statistical models. RooFit ’s Interfaces In RooFit , any collection of mathematical entities such as parameters, observables, functionsor

PDF s are saved or passed to functions using the classes

RooArgSet and

RooArgList , setsand lists of

RooFit objects. ∗ e-mail: [email protected] P robability D ensity F unctions a r X i v : . [ c s . M S ] J u l he most common operation is iterating through these collections, both during ﬁtting andwhen e.g. inspecting the values of parameters on the user side. This favours array-like datastructures like e.g. std::vector , but internally, RooFit ’s collections were using a linkedlist with optional hash lookup. Iterating through a

RooFit collection requires the followingC ++ code in ROOT TIterator* it = pdf.getParameters(obs)->createIterator(); RooAbsArg* p; while ((p=(RooAbsArg*)it->Next())) { p->Print(); } delete it; To speed up iterating and to provide an STL-like interface for

RooFit ’s collections, the linkedlist in

RooFit ’s collections was replaced by a std::vector . Functions such as begin() , end(), size(), operator [] were implemented to allow for STL-like handling of the col-lections. In ROOT

PDF as above can thereforebe achieved as follows: for ( auto p : *pdf.getParameters(obs)) p->Print(); The STL-like interface allows to reduce heap allocations, and replaces while (p = Next()) loops by range-based for loops. This reduces code clutter, memory leaks, dangling point-ers and variable shadowing, and speeds up iterating through collections by 20 % to 25 %.Random access, which was slow with large linked lists, now completes in constant time. De-pending on how often collections are iterated, typical workﬂows in

RooFit are sped up from5 % to 21 % [3]. Fits with a binned ATLAS likelihood model [4] completed 19 % faster whileyielding identical results.Modernising the C ++ interfaces is also beneﬁcial for using RooFit from Python. Since

ROOT has a C ++ interpreter, it can dynamically generate Python bindings for C ++ objects [5]. In ROOT ++ code in the ﬁrstlisting. However, with an STL-like interface, Python iterators are generated automatically.The equivalent Python loop using ROOT for p in pdf.getParameters(obs): p.Print() In ROOT

ROOT RooCategory cat("cat", "Lep. mult."); cat.defineTypes( {"0Lep", "1Lep", "2Lep", "3Lep"}, { 0, 1, 2, 3 }); for ( const auto & name_idx : cat) { std::cout << name_idx.first << ", " << name_idx.second << std::endl; } ROOT RooCategory cat("cat", "Lep. mult."); cat.defineType("0Lep", 0); cat.defineType("1Lep", 1); cat.defineType("2Lep", 2); cat.defineType("3Lep", 3); TIterator* typeIt = cat.typeIterator(); RooCatType* catType; while ( (catType = dynamic_cast (typeIt->Next()) ) != nullptr ) { std::cout << catType.getVal() << ", " << catType.GetName() << std::endl; } delete typeIt; .1 Old Interfaces Remain Supported Given that the old interfaces of classes such as

RooAbsCollection and

RooAbsCategory ( + derived classes) are in use, they cannot be deprecated without forcing users to updateexisting code. Therefore, the new interfaces are provided as an addition, while old interfacesremain supported. Updating to the new interfaces leads to faster, shorter and more type-safecode, but it is not required.To enable such kind of backward compatibility, legacy iterators / functions were imple-mented, which mimic the functionality of the original objects. Occasionally, this requiresextra virtual calls or heap allocations, leading to slow downs of 1 % to 3 %, but the new inter-faces allow for 20 % speed up. The most critical iterators in RooFit have been modernised,and more iterators will be replaced as the modernisation of

RooFit continues. To detect usesof ine ﬃ cient interfaces in user code, users can add be-fore including ROOT headers, which will trigger deprecation warnings with the clang++ , g++ and MSVC compilers. Further, old interfaces will be documented in special sections of

RooFit ’s reference guide [6] to aid users in modernising existing code.More updates of interfaces are planned, such as easier importing of data from e.g.

ROOT ’s RDataFrame , STL containers or numpy arrays. The release of the new

PyROOT [5] willfurther enable designing more pythonic interfaces that don’t have to closely imitate the C ++ syntax. PDF s Figure 1.

Johnson distribution

In modernising

RooFit , the Hypatia2 [7] (

ROOT

ROOT ++ functioncan be used as a PDF in RooFit (see section 4),providing optimised implementations for these fre-quently used

PDF s is beneﬁcial. Apart from provid-ing a citeable default, built-in

PDF s can be extendedwith extra checks, and optimised computations canbe applied. The advantages of built-in

PDF s are:

Analytic Integrals

In order to normalise functions, these must be integrated over the deﬁni-tion range of their observables.

RooFit integrates all functions numerically, unless a functionfor analytic integration is overridden. This allows for faster and more accurate normalisation.

Generating Samples

RooFit generates data samples using the accept / reject method, unlessa generator function is implemented. For many distributions, more e ﬃ cient sampling strate-gies are known, which can only be employed for built-in PDF s. Checking parameters

Just-in-time compiled

PDF s as in section 4 will accept parameters andobservables with arbitrary deﬁnition range. When trying to evaluate a function outside of itsdeﬁnition range, computations might yield negative probabilities, inﬁnity or

NaN . This canslow down or entirely prohibit ﬁtting distributions to data, since the minimiser has no meansof determining whether a parameter is in the allowed range. Starting with

ROOT

PDF s can be checked by the

PDF implementation.

SIMD Computations

Starting with

ROOT

SIMD computations. This is discussed in section 5.For the built-in Johnson distribution, for example, previously unstable ﬁts were found toconverge six times faster than with the commonly used interpreted Johnson formula in LHCb.

Just-in-time Compiled

PDF s Although implementing a

PDF as a

RooFit class is the fastest and most accurate wayof building likelihood models,

RooFit was supporting interpreted

PDF s using the class

RooGenericPdf . It takes strings of function expressions, and interprets these using an in-stance of

ROOT

TFormula . With the arrival of cling , the just-in-time C ++ compilerand interpreter, TFormula was updated to use just-in-time compiled expressions. In

ROOT

TFormula was integrated in

RooFit , allowing for faster computations (com-piled with optimisations instead of interpreted). This also enables using previously-deﬁnedfunctions or functions loaded from a library as in the following example: // In a library or in included code double func( double x, double a) { return a*x*x + 1.;} [...] // When building fit model RooRealVar x("x", "Observable", 2.); // Define observable RooRealVar a("a", "Parameter", 3.); // Define parameter RooGenericPdf pdf("pdf", "func(x, a)", {x, a}); //evaluate func and normalise

PDF

Computations

RooFit

When

RooFit ﬁts

PDF s to data, an expression is evaluated for each entry in the dataset tocompute the likelihood of observing an event.

RooFit achieves this by writing values fromthe rows of a dataset into the leaves of a computation graph, and the probability of the topnode is evaluated by calling evaluate and normalisation functions of daughter nodes. Eachnode caches its last value, and therefore constant branches of the computation graph (e.g.branches that only depend on parameters) are only computed once. This cycle, however,repeats for every entry in the dataset such that all branches that depend on observables haveto be recomputed every time. The total number of function calls is therefore proportional tothe number of entries in the dataset and to the number of (non-constant) nodes in the graph, N Data · N Nodes . For a small mathematical expression with 10 nodes and one million events,this already amounts to considerably more than 10 million function calls because additionalcalls for the normalisation of PDFs and for invalidating the node-local caches are necessary.Furthermore, loading only single values into the nodes of the computation graph is hostileto CPU caches. It is possible that when a node is being revisited to load the next entry, datahave been evicted from the cache(s). This means that

RooFit runs ine ﬃ cient on modernCPUs because of poor data locality and ine ﬃ cient memory access patterns. To improve the data locality and reduce the number of function calls, a

RooFit -internal inter-face for batched likelihood computations was implemented. Data are passed between nodesusing a std::span . Inputs for computations are directly read from arrays, while outputs arewritten to contiguous memory that is owned by the node that is running a computation.Instead of computing only one probability per node, all probabilities for all entries inthe dataset can be computed in a few function calls. Since such computations operate oncontiguous ﬂoating point numbers, data locality, caching and data prefetching improve. This RooFit also supports a multi-process mode, in which case smaller batches of contiguous data are processed,which are divided among multiple workers. clang9 i7-7820X AVX512clang8 i7-4790 AVX2clang9 i7-7820X AVX2gcc9 i7-7820X AVX2gcc9 i7-7820X AVX512clang9 i7-7820X AVX512clang8 i7-4790 AVX2clang9 i7-7820X AVX2gcc9 i7-7820X AVX2gcc9 i7-7820X AVX512

Speed up using vectorisation

Figure 2.

Speed up for computing the likelihoods of datasets of 100 000 to 300 000 events for variouslikelihood models. Using

ROOT

RooFit batch interface is timed against the normal single-value computations for di ﬀ erent CPUs, compilers and instruction sets. On an intermediate-level CPUthat supports AVX2 instructions ( ), a speed up of 4x to 9x can be expected. On CPUs supporting

AVX512 instruction sets, the speed up ranges from 4x to 16x. For the “ChiSquarePdf” (top), almost no

SIMD functions were available. The speed up of 3x is the result of batched computations with fasterdata loading. speeds up computations three to four times without loss of precision. Figure 2 shows thecombined speed up of batched computations and additionally vectorisation with

SIMD in-structions (section 5.3) against classic single-value

RooFit computations. For the topmostentry (“ChiSquarePdf”), almost no

SIMD instructions could be used. This speed up is mostlydue to faster data loading and reducing function calls.When

PDF s that support the fast interface are used together with legacy

PDF s, compatibil-ity is ensured by a generic batch computation function that runs

RooFit ’s classic single-valuecomputations for each entry in the dataset, writes results into an array, and passes the resultson to the next node of the graph using std::span . SIMD

By converting

RooFit ’s data access patterns to reading from array-like structures,

RooFit could be extended with single-instruction-multiple-data computations (

SIMD ). Optimisationreports of the clang++-8 compiler were used to design computation functions that allow forautomatic vectorisation with

AVX2 instructions. This allows to increase the throughput, sincefour double-precision numbers can be processed simultaneously (8 with

AVX512 ). This re-quires using auto-vectorisable, inlinable mathematical functions, and entries in arrays cannotdepend on other entries. Auto-vectorisable functions are provided by the

VDT [9] package,of which logarithm and exponential function were used most frequently.

VDT approximatesstandard library functions using Padé polynomials, which can be inlined into other computa-tions, and optimised by the compiler.

ROOT can be compiled without

VDT , though, in whichcase the standard (non-inlinable) implementations are used. This usually prevents automaticvectorisation.o further aid the compiler optimiser, computations were rewritten such that data depen-dencies between elements are eliminated, and that as many intermediate results as possiblecan be held in CPU registers. Excessive branching in complicated functions was replaced bybranch-less computations such as 1 · A + · B to select A and its counterpart to select B . Al-though more CPU cycles are needed because both branches are computed, higher throughputis achieved because SIMD instructions process four or eight entries simultaneously. Finally,calls to non-inlinable functions and complicated reductions were removed where possible.The result of this work is shown in ﬁg. 2. For various

RooFit PDF s and composite modelsthat use multiple

PDF s, the run time of the optimised

SIMD computation with

VDT functionsis compared to classic

RooFit single-value computations. Depending on the complexity ofthe computations and on the level of automatic compiler optimisations, computations speedup 3x to 16x. The theoretical speed up of 12x (3x from data loading, 4x for

AVX2 ) cannot bereached because of Amdahl’s law. Not all parts of

RooFit ’s code can be replaced with

SIMD code, and were therefore not sped up. Moreover, in order to use

SIMD instructions, moreinstructions may have to be issued (e.g. compute two Padé polynomials to approximate the exp function instead of calling the standard implementation).The level of automatic compiler optimisations has a strong e ﬀ ect on the speed up. Figure 2demonstrates that both clang AVX2 instructions, but gcc signiﬁcantly outperforms clang when

ROOT is compiled with

AVX512 instructions. Itcan therefore be expected that as compiler optimisers improve, the speed up for

RooFit increases, and that di ﬀ erences between the compilers will shrink.Batch computations with vectorisation have been released in ROOT pdf.fitTo(data, BatchMode(true)); // Evaluate likelihood using fast batch mode Unit tests ensure that the optimised computation functions yield the same results as the classic

RooFit functions. When the fast

VDT approximations are used, the relative di ﬀ erence forprobabilities is usually below 1 . × − , for log-likelihoods below 2 . × − , and for ﬁtparameters below 1 . × − , which is except for corner cases orders of magnitude smallerthan the statistical error of the ﬁt parameters.A drawback of the current implementation is that for maximal performance, ROOT has tobe compiled targeting the instruction set that the CPU supports. The possibility of shipping asmall library with computation functions for a few di ﬀ erent architectures (e.g. SSE4 + AVX2 )is being investigated.

RooFit ’s interfaces are being modernised, and computations are being sped up with fast dataloading and

SIMD computations. The single-thread performance of unbinned ﬁts in

RooFit has been increased by several factors without requiring code changes on the user side. Inconjunction with work on parallelising computations [10], a speed up by more than an orderof magnitude can be expected.The work on modernising

RooFit ’s interfaces and improving its performance will con-tinue, especially for binned ﬁts such as

HistFactory models. eferences [1] W. Verkerke, D.P. Kirkby, econf

C0303241 , MOLT007 (2003), physics/0306116 [2] R. Brun, F. Rademakers, Nucl. Instrum. Methods Phys. Res.

A389 , 81 (1997)[3] S. Hageboeck, L. Moneta,

Making RooFit Ready for Run 3 , in (2020), unpublished[4] ATLAS Collaboration, JHEP , 069 (2015), [5] E. Tejedor Saavedra, S. Wunsch, M. Galli, A new PyROOT: Modern, Interoperable andmore Pythonic , EPJ Web Conf. (this volume) (2020)[6] The ROOT Team,