Making RooFit Ready for Run 3
MMaking RooFit Ready for Run 3
S Hageb¨ock and L Moneta CERN, 1211 Geneva 23, SwitzerlandE-mail: [email protected]
Abstract.
RooFit and
RooStats , the toolkits for statistical modelling in
ROOT , are used inmost searches and measurements at the Large Hadron Collider. The data to be collected inRun 3 will enable measurements with higher precision and models with larger complexity, butalso require faster data processing.In this work, first results on modernising
RooFit ’s collections, restructuring data flow andvectorising likelihood fits in
RooFit will be discussed. These improvements will enable the LHCexperiments to process larger datasets without having to compromise with respect to modelcomplexity, as fitting times would increase significantly with the large datasets to be expectedin Run 3.
1. Introduction
RooFit [1] is a C++ package for statistical modelling distributed with
ROOT [2].
RooFit allowsto define computation graphs connecting observables, parameters, functions and PDFs in orderto compute likelihoods and perform fits to data. An example of such a computation graph isshown in figure 1. Every node of the graph can be evaluated to a real value ( i.e. , real-valuednumber), which for a PDF denotes the probability to find a data event with the given values ofobservables, for a given model and its set of parameters.
RooStats is a collection of statistical tools to perform statistical tests with
RooFit models( e.g.
Toy Monte Carlo, setting limits). Further,
HistFactory provides tools to create
RooFit models from a collection of
ROOT histograms.
RooFit was originally developed for the BaBar collaboration, but later picked up by manyothers. Its central parts were designed for single-core processors, and were neither optimised forlarge caches nor SIMD computations. This work stands at the beginning of efforts to modernise
RooFit to speed up fits, and make it more accessible from both C++ and Python. The featuresto be described in the following two sections will be released in
ROOT 6.18 .
2. Modernising
RooFit ’s Internal Collections In RooFit , computation graphs and sets/lists of functions, PDFs, observables and parameters aresaved with the help of
RooAbsCollection , the base class of
RooFit ’s main collections
RooArgSet and
RooArgList . Internally, these were using a linked list with optional hash lookup.The most common operation on these collections during fitting, i.e. , repeated evaluation of thecomputation graph, is iteration. Less frequent operations are appending and finding elements,and collections are very rarely sorted or modified. This favours array-like data structures, andindeed the linked lists were identified as a bottleneck for fits. a r X i v : . [ c s . M S ] M a r igure 1. A RooFit likelihood model . This model represents the sum of a signal andbackground PDF, where the former is a convolution of other PDFs, the latter is a polynomialdistribution. Likelihood models are implemented as tree-like structures of nodes that can beevaluated to real values. Blue nodes represent parameters or observables, red nodes, whichdepend on the values of other nodes, represent functions or PDFs.Therefore, the linked list in RooAbsCollection was replaced by a std::vector , mostly tospeed up iterating.
RooAbsCollection was further provided with an STL-like interface ( size,begin, end ) to enable range-based for loops. This speeds up forward iteration by about 20 %,and random access completes in constant time. The figures 2 and 3 compare the old and newinterface. The STL-like interface allows to reduce heap allocations, and replaces while loopswith non-local variables by range-based for loops. This reduces code clutter and the dangerof memory leaks, dangling pointers and variable shadowing. It will also facilitate generating
ROOT ’s automatic Python bindings to iterate through
RooFit ’s collections in Python. Figure 4compares run times for typical
RooFit workflows between ROOT 6.16 and . Dependingon how often collections are iterated, the speed up varies between 5 and 21 %. Tests with anATLAS likelihood model [3] yielded a speed up of 19 %.
Figure 2.
Iterating through a
RooFit collectionwith old interface.
Figure 3.
Iterating through a
RooFit collection with new interface.
The old interface of
RooAbsCollection and its subclasses exposed three kinds of iterators tousers, one of which is shown in figure 2. All three are being used in
RooFit , and it is likely that
RooFit ’s users also use all three. Removing any of these would therefore break user code.To provide backward compatibility, these legacy iterators were re-implemented to also workwith the new collections. Since the old
RooFit collections were based on a linked list, the legacy This figure was obtained using the graphVizTree() export supported by all nodes in a
RooFit graph. These are a selection of representative
RooFit tutorials. h a n g e i n r un t i m e v s . ROOT6.16 / % Figure 4.
Comparison of run times for typical
RooFit workflows between
ROOT 6.16 and .Frequent use of STL-like iterators leads to a speed up of 20 %, heavy use of legacy iterators toa slow down.iterators need to remain valid also when reallocations happen. Therefore, the legacy iteratorshold a reference to the new collection, and use index access to iterate through it. This makesthem tolerant against reallocations, but inserting or deleting elements before the current positionis not supported, unlike for linked lists. Not a single instance of such usage was found in
RooFit ,though. Nevertheless, run-time checks were added to warn users if they insert/delete before thecurrent iterator. These are performed only if
ROOT is compiled with assertions enabled, and onlyfor legacy iterators.All legacy iterators further need to work both with the original
RooLinkedList and withthe STL-based collections . They were therefore implemented as adapters to a polymorphiciterator interface, which supports both the RooLinkedList and the counting iterator. Thismeans that all legacy iterators will work irrespective of reallocations and the actual type of thecollection, but they are slower than the fastest original
RooFit iterator, because they require aheap allocation to polymorphically switch between different backends. The slow down observedfor one workflow in figure 4 is caused by this.Nevertheless, instances of slow legacy iterators are easily found using profiling tools, and canbe replaced by STL iterators by changing the code as shown in figures 2 and 3, which leads toa speed up of 20 %. The most critical iterators in
RooFit have already been replaced, and lesscritical iterators will be replaced as the modernisation of
RooFit continues.
3. Faster
HistFactory
Models
HistFactory [4] is a toolkit to create
RooFit models from a collection of histograms. It supportsmultiple channels, multiple signal and background samples, sample scale factors, systematicuncertainties for shape and normalisation differences in histograms, and allows to implementcombined measurements of parameters such as a signal strength. The
RooLinkedList will be deprecated, but not all instances of its usage have been removed. It might furtherbe used in user code. o parametrise systematic uncertainties using histograms, users supply three histogramsof the same distribution: the nominal distribution and two histograms representing the ± σ uncertainty. Given that these have to be evaluated for multiple (sometimes hundreds) ofsystematic uncertainties, for multiple samples and multiple channels, several thousands ofhistograms might have to be analysed.When HistFactory was implemented, move semantics or shared pointers were not available.The authors therefore resorted to copying histograms, leading to a large overhead of copyingand deleting histograms. Performance-critical sections of the
HistFactory code were thereforerevisited, and move semantics as well as shared pointers implemented. This speeds up creating alikelihood model for an ATLAS measurement [3] by more than ten times, with identical results.This model comprises 10 832 histograms, 28 channels and 253 systematic uncertainties, and wasconstructed in 150 s instead of 1 800 s. Fits using this model furthermore converged 20 % fasterbecause of the optimisations discussed in section 2.
4. Batched Likelihood Computations
A bottleneck for likelihood computations in
RooFit is the repeated evaluation of the computationgraph such as the one shown in figure 1. To compute a likelihood, the probability of observing each event in the dataset has to be computed.
RooFit achieves this by loading the values ofthe observables into the leaves of the computation graph for a single event, and evaluating theprobability of the top node. Each node caches its last value, and therefore constant branches ofthe computation graph ( e.g. branches that only depend on parameters) are only computed once.Yet, all branches that depend on observables have to be recomputed for each entry in the dataset.In figure 1, for example, the node ”t” in the centre of the graph is the observable, whereas otherleaves are parameters. This means that the majority of nodes has to be recomputed for each entryin the dataset. Evaluating a node involves virtual function calls, and the number of such calls isproportional both to the number of entries in the dataset and to the number of (non-constant)nodes in the graph, N Data · N Nodes . For a small graph of 10 nodes and one million events, thisalready amounts to considerably more than 10 million function calls because additional calls forthe normalisation of PDFs and for invalidating (node-local) caches are necessary.Furthermore, loading single values into the nodes of the computation graph is hostile to CPUcaches. There is a high likelihood that when a node is being revisited to load the next entry, datahave been thrashed from the highest-level cache(s). This means that
RooFit runs inefficient onmodern CPUs because of poor data locality and inefficient memory access patterns.
To demonstrate that a likelihood evaluation can be sped up by loading batches of data, apreliminary interface using std::span was implemented for a few selected PDFs (Gaussian,Poisson and exponential distribution, summation of PDFs). Instead of computing only oneprobability per node, all probabilities for all entries in the data set were computed in a single function call for each node. Since such computations operate on array-like structures, datalocality, caching and prefetching improve. Figure 5 shows that run times for fitting simplemodels decrease by a factor 2 to 3 .
5. The speed up is expected to be even larger for largermodels.
When computations on array-like structures are performed, single-instruction-multiple-datacomputations (SIMD) can be used. The automatic vectorisation optimisation of the clangcompiler was used to vectorise computations for an
AVX2 architecture for the Gaussian andexponential distributions, as well as the addition of PDFs and the normalisation of the three.This requires auto-vectorisable mathematical functions, which need to be inlinable, and not igure 5.
Run time for fitting different likelihood models to datasets of two million events, Inteli7-4790.
Top : Using
AVX2
SIMD instructions and batch data processing increases the speed by7 × . Middle : Batch data processing leads to a speed up of 3 . × . Bottom : Current
RooFit .have any data dependencies between elements of the underlying arrays. Such functions areprovided by the
VDT [5] package, of which the logarithm and exponential function were used.Further, computations were rewritten such that there are no data dependencies, that they canbe executed entirely in registers, and that they use only limited branches, no (non-inlinable)functions and only simple reductions.Auto vectorisation for these selected PDFs with
AVX2 instructions increased the speed up to6 × to 7 × .
5. Summary
The improvements discussed in sections 2 and 3 will be released in
ROOT 6.18 , and the work onbatched and vectorised computations will continue. The batch interface will be refined to workwith any PDF, and a fall-back implementation for PDFs that have not been modified will beprovided. Auto-vectorisable batch computations will be implemented for a growing number ofPDFs to increase the single-thread performance of
RooFit by several factors without requiringcode changes on the user side. In conjunction with work an parallelising computations [6], aspeed up by more than an order of magnitude can be expected.
References [1] Verkerke W and Kirkby D P 2003 econf
C0303241
MOLT007 (
Preprint physics/0306116 )[2] Brun R and Rademakers F 1997
Nucl. Instrum. Methods Phys. Res https://root.cern [3] ATLAS Collaboration 2015
JHEP
069 (
Preprint )[4] Cranmer K, Lewis G, Moneta L, Shibata A and Verkerke W 2012 Histfactory: A tool for creating statisticalmodels for use with roofit and roostats Tech. Rep. CERN-OPEN-2012-016 CERN URL https://cds.cern.ch/record/1456844 [5] Piparo D, Innocente V and Hauth T 2014
J. Phys. Conf. Ser. et al.