[PDF] Making RooFit Ready for Run 3

Abstract

RooFit and RooStats, the toolkits for statistical modelling in ROOT, are used in most searches and measurements at the Large Hadron Collider. The data to be collected in Run 3 will enable measurements with higher precision and models with larger complexity, but also require faster data processing. In this work, first results on modernising RooFit's collections, restructuring data flow and vectorising likelihood fits in RooFit will be discussed. These improvements will enable the LHC experiments to process larger datasets without having to compromise with respect to model complexity, as fitting times would increase significantly with the large datasets to be expected in Run 3.

Full PDF

MMaking RooFit Ready for Run 3

S Hageb¨ock and L Moneta CERN, 1211 Geneva 23, SwitzerlandE-mail: [email protected]

Abstract.

RooFit and

RooStats , the toolkits for statistical modelling in

ROOT , are used inmost searches and measurements at the Large Hadron Collider. The data to be collected inRun 3 will enable measurements with higher precision and models with larger complexity, butalso require faster data processing.In this work, ﬁrst results on modernising

RooFit ’s collections, restructuring data ﬂow andvectorising likelihood ﬁts in

RooFit will be discussed. These improvements will enable the LHCexperiments to process larger datasets without having to compromise with respect to modelcomplexity, as ﬁtting times would increase signiﬁcantly with the large datasets to be expectedin Run 3.

1. Introduction

RooFit [1] is a C++ package for statistical modelling distributed with

ROOT [2].

RooFit allowsto deﬁne computation graphs connecting observables, parameters, functions and PDFs in orderto compute likelihoods and perform ﬁts to data. An example of such a computation graph isshown in ﬁgure 1. Every node of the graph can be evaluated to a real value ( i.e. , real-valuednumber), which for a PDF denotes the probability to ﬁnd a data event with the given values ofobservables, for a given model and its set of parameters.

RooStats is a collection of statistical tools to perform statistical tests with

RooFit models( e.g.

Toy Monte Carlo, setting limits). Further,

HistFactory provides tools to create

RooFit models from a collection of

ROOT histograms.

RooFit was originally developed for the BaBar collaboration, but later picked up by manyothers. Its central parts were designed for single-core processors, and were neither optimised forlarge caches nor SIMD computations. This work stands at the beginning of eﬀorts to modernise

RooFit to speed up ﬁts, and make it more accessible from both C++ and Python. The featuresto be described in the following two sections will be released in

ROOT 6.18 .

2. Modernising

RooFit ’s Internal Collections In RooFit , computation graphs and sets/lists of functions, PDFs, observables and parameters aresaved with the help of

RooAbsCollection , the base class of

RooFit ’s main collections

RooArgSet and

RooArgList . Internally, these were using a linked list with optional hash lookup.The most common operation on these collections during ﬁtting, i.e. , repeated evaluation of thecomputation graph, is iteration. Less frequent operations are appending and ﬁnding elements,and collections are very rarely sorted or modiﬁed. This favours array-like data structures, andindeed the linked lists were identiﬁed as a bottleneck for ﬁts. a r X i v : . [ c s . M S ] M a r igure 1. A RooFit likelihood model . This model represents the sum of a signal andbackground PDF, where the former is a convolution of other PDFs, the latter is a polynomialdistribution. Likelihood models are implemented as tree-like structures of nodes that can beevaluated to real values. Blue nodes represent parameters or observables, red nodes, whichdepend on the values of other nodes, represent functions or PDFs.Therefore, the linked list in RooAbsCollection was replaced by a std::vector , mostly tospeed up iterating.

RooAbsCollection was further provided with an STL-like interface ( size,begin, end ) to enable range-based for loops. This speeds up forward iteration by about 20 %,and random access completes in constant time. The ﬁgures 2 and 3 compare the old and newinterface. The STL-like interface allows to reduce heap allocations, and replaces while loopswith non-local variables by range-based for loops. This reduces code clutter and the dangerof memory leaks, dangling pointers and variable shadowing. It will also facilitate generating

ROOT ’s automatic Python bindings to iterate through

RooFit ’s collections in Python. Figure 4compares run times for typical

RooFit workﬂows between ROOT 6.16 and . Dependingon how often collections are iterated, the speed up varies between 5 and 21 %. Tests with anATLAS likelihood model [3] yielded a speed up of 19 %.

Figure 2.

Iterating through a

RooFit collectionwith old interface.

Figure 3.

Iterating through a

RooFit collection with new interface.

The old interface of

RooAbsCollection and its subclasses exposed three kinds of iterators tousers, one of which is shown in ﬁgure 2. All three are being used in

RooFit , and it is likely that

RooFit ’s users also use all three. Removing any of these would therefore break user code.To provide backward compatibility, these legacy iterators were re-implemented to also workwith the new collections. Since the old

RooFit collections were based on a linked list, the legacy This ﬁgure was obtained using the graphVizTree() export supported by all nodes in a

RooFit graph. These are a selection of representative

RooFit tutorials. h a n g e i n r un t i m e v s . ROOT6.16 / % Figure 4.

Comparison of run times for typical

RooFit workﬂows between

ROOT 6.16 and .Frequent use of STL-like iterators leads to a speed up of 20 %, heavy use of legacy iterators toa slow down.iterators need to remain valid also when reallocations happen. Therefore, the legacy iteratorshold a reference to the new collection, and use index access to iterate through it. This makesthem tolerant against reallocations, but inserting or deleting elements before the current positionis not supported, unlike for linked lists. Not a single instance of such usage was found in

RooFit ,though. Nevertheless, run-time checks were added to warn users if they insert/delete before thecurrent iterator. These are performed only if

ROOT is compiled with assertions enabled, and onlyfor legacy iterators.All legacy iterators further need to work both with the original

RooLinkedList and withthe STL-based collections . They were therefore implemented as adapters to a polymorphiciterator interface, which supports both the RooLinkedList and the counting iterator. Thismeans that all legacy iterators will work irrespective of reallocations and the actual type of thecollection, but they are slower than the fastest original

RooFit iterator, because they require aheap allocation to polymorphically switch between diﬀerent backends. The slow down observedfor one workﬂow in ﬁgure 4 is caused by this.Nevertheless, instances of slow legacy iterators are easily found using proﬁling tools, and canbe replaced by STL iterators by changing the code as shown in ﬁgures 2 and 3, which leads toa speed up of 20 %. The most critical iterators in

RooFit have already been replaced, and lesscritical iterators will be replaced as the modernisation of

RooFit continues.

3. Faster

HistFactory

Models

HistFactory [4] is a toolkit to create

RooFit models from a collection of histograms. It supportsmultiple channels, multiple signal and background samples, sample scale factors, systematicuncertainties for shape and normalisation diﬀerences in histograms, and allows to implementcombined measurements of parameters such as a signal strength. The

RooLinkedList will be deprecated, but not all instances of its usage have been removed. It might furtherbe used in user code. o parametrise systematic uncertainties using histograms, users supply three histogramsof the same distribution: the nominal distribution and two histograms representing the ± σ uncertainty. Given that these have to be evaluated for multiple (sometimes hundreds) ofsystematic uncertainties, for multiple samples and multiple channels, several thousands ofhistograms might have to be analysed.When HistFactory was implemented, move semantics or shared pointers were not available.The authors therefore resorted to copying histograms, leading to a large overhead of copyingand deleting histograms. Performance-critical sections of the

HistFactory code were thereforerevisited, and move semantics as well as shared pointers implemented. This speeds up creating alikelihood model for an ATLAS measurement [3] by more than ten times, with identical results.This model comprises 10 832 histograms, 28 channels and 253 systematic uncertainties, and wasconstructed in 150 s instead of 1 800 s. Fits using this model furthermore converged 20 % fasterbecause of the optimisations discussed in section 2.

4. Batched Likelihood Computations

A bottleneck for likelihood computations in

RooFit is the repeated evaluation of the computationgraph such as the one shown in ﬁgure 1. To compute a likelihood, the probability of observing each event in the dataset has to be computed.

RooFit achieves this by loading the values ofthe observables into the leaves of the computation graph for a single event, and evaluating theprobability of the top node. Each node caches its last value, and therefore constant branches ofthe computation graph ( e.g. branches that only depend on parameters) are only computed once.Yet, all branches that depend on observables have to be recomputed for each entry in the dataset.In ﬁgure 1, for example, the node ”t” in the centre of the graph is the observable, whereas otherleaves are parameters. This means that the majority of nodes has to be recomputed for each entryin the dataset. Evaluating a node involves virtual function calls, and the number of such calls isproportional both to the number of entries in the dataset and to the number of (non-constant)nodes in the graph, N Data · N Nodes . For a small graph of 10 nodes and one million events, thisalready amounts to considerably more than 10 million function calls because additional calls forthe normalisation of PDFs and for invalidating (node-local) caches are necessary.Furthermore, loading single values into the nodes of the computation graph is hostile to CPUcaches. There is a high likelihood that when a node is being revisited to load the next entry, datahave been thrashed from the highest-level cache(s). This means that

RooFit runs ineﬃcient onmodern CPUs because of poor data locality and ineﬃcient memory access patterns.

To demonstrate that a likelihood evaluation can be sped up by loading batches of data, apreliminary interface using std::span was implemented for a few selected PDFs (Gaussian,Poisson and exponential distribution, summation of PDFs). Instead of computing only oneprobability per node, all probabilities for all entries in the data set were computed in a single function call for each node. Since such computations operate on array-like structures, datalocality, caching and prefetching improve. Figure 5 shows that run times for ﬁtting simplemodels decrease by a factor 2 to 3 .

5. The speed up is expected to be even larger for largermodels.

When computations on array-like structures are performed, single-instruction-multiple-datacomputations (SIMD) can be used. The automatic vectorisation optimisation of the clangcompiler was used to vectorise computations for an

AVX2 architecture for the Gaussian andexponential distributions, as well as the addition of PDFs and the normalisation of the three.This requires auto-vectorisable mathematical functions, which need to be inlinable, and not igure 5.

Run time for ﬁtting diﬀerent likelihood models to datasets of two million events, Inteli7-4790.

Top : Using

AVX2

SIMD instructions and batch data processing increases the speed by7 × . Middle : Batch data processing leads to a speed up of 3 . × . Bottom : Current

RooFit .have any data dependencies between elements of the underlying arrays. Such functions areprovided by the

VDT [5] package, of which the logarithm and exponential function were used.Further, computations were rewritten such that there are no data dependencies, that they canbe executed entirely in registers, and that they use only limited branches, no (non-inlinable)functions and only simple reductions.Auto vectorisation for these selected PDFs with

AVX2 instructions increased the speed up to6 × to 7 × .

5. Summary

The improvements discussed in sections 2 and 3 will be released in

ROOT 6.18 , and the work onbatched and vectorised computations will continue. The batch interface will be reﬁned to workwith any PDF, and a fall-back implementation for PDFs that have not been modiﬁed will beprovided. Auto-vectorisable batch computations will be implemented for a growing number ofPDFs to increase the single-thread performance of

RooFit by several factors without requiringcode changes on the user side. In conjunction with work an parallelising computations [6], aspeed up by more than an order of magnitude can be expected.

References [1] Verkerke W and Kirkby D P 2003 econf

C0303241

MOLT007 (

Preprint physics/0306116 )[2] Brun R and Rademakers F 1997

Nucl. Instrum. Methods Phys. Res https://root.cern [3] ATLAS Collaboration 2015

JHEP

069 (

Preprint )[4] Cranmer K, Lewis G, Moneta L, Shibata A and Verkerke W 2012 Histfactory: A tool for creating statisticalmodels for use with rooﬁt and roostats Tech. Rep. CERN-OPEN-2012-016 CERN URL https://cds.cern.ch/record/1456844 [5] Piparo D, Innocente V and Hauth T 2014

J. Phys. Conf. Ser. et al.