Software Challenges For HL-LHC Data Analysis
ROOT Team, Kim Albertsson Brann, Guilherme Amadio, Sitong An, Bertrand Bellenot, Jakob Blomer, Philippe Canal, Olivier Couet, Massimiliano Galli, Enrico Guiraud, Stephan Hageboeck, Sergey Linev, Pere Mato Vila, Lorenzo Moneta, Axel Naumann, Alja Mrak Tadel, Vincenzo Eduardo Padulano, Fons Rademakers, Oksana Shadura, Matevz Tadel, Enric Tejedor Saavedra, Xavier Valls Pla, Vassil Vassilev, Stefan Wunsch
aa r X i v : . [ phy s i c s . d a t a - a n ] M a y Software Challenges For HL-LHC Data Analysis
The ROOT Team ∗ : Kim Albertsson Brann , Guilherme Amadio , Sitong An
2, 3 ,Bertrand Bellenot , Jakob Blomer , Philippe Canal , Olivier Couet ,Massimiliano Galli , Enrico Guiraud , Stephan Hageboeck , Sergey Linev , PereMato Vila , Lorenzo Moneta , Alja Mrak Tadel , Axel Naumann , VincenzoEduardo Padulano , Fons Rademakers , Oksana Shadura , Matevz Tadel , EnricTejedor Saavedra , Xavier Valls Pla , Vassil Vassilev , and Stefan Wunsch, Lule˚a University of Technology, Sweden CERN, Geneva, Switzerland Carnegie-Mellon University, USA Fermi National Accelerator Laboratory, Batavia, USA GSI, Darmstadt, Germany University of California San Diego, USA University of Nebraska Lincoln, USA Princeton University, USA Karlsruhe Institute of Technology, GermanyApril 2020
Abstract
The high energy physics community is discussing where investment is neededto prepare software for the HL-LHC and its unprecedented challenges. The ROOTproject is one of the central software players in high energy physics since decades.From its experience and expectations, the ROOT team has distilled a comprehensiveset of areas that should see research and development in the context of data analysissoftware, for making best use of HL-LHC’s physics potential. This work shows whatthese areas could be, why the ROOT team believes investing in them is needed,which gains are expected, and where related work is ongoing. It can serve as anindication for future cooperation and research proposals.
HL-LHC’s high-precision studies of standard model phenomena and BSM searches willrequire processing of huge data samples and comparing them to theoretical models withan explosion of parameters. Reducing systematic uncertainties (such as those introducedthrough correlations) to a level matching the much reduced statistical uncertaintiesof HL-LHC data requires more accurate and CPU-intensive simulations, data-drivenestimations, and testing high-dimensional models, built using a large number of inputfeatures and parameters. We expect thus a ”superscalar” demand in analysis throughputand data storage that extends well beyond the increase due to higher data statistics. ∗ [email protected] ROOT [1] is a fundamental ingredient of virtually all HEP workflows, in areas suchas data persistency, modeling, graphics, and analysis. The project demonstrates itscommunity role by its high number of contributors (more than 100 in 2019); more than1 Exabyte of physics data stored in ROOT files; excellent, active connections with theexperiments including direct investment by the experiments; and active exchange withphysicists such as more than 50 support messages per average work day in 2019.ROOT is an open source project following software development’s best practices.All contributions are public, as a prerequisite for publicly documenting and recognizingcontributions. This openness makes ROOT attractive for funding agencies, as demon-strated by its many contributors, and allows ROOT to serve as a core component of anecosystem where demonstrators and prototypes for R&D can ”plug in”. The project’sexpertise, its tradition of innovation, and its excellent connections to stakeholders allowthe project to establish where investment is needed to harvest the physics potential ofthe HL-LHC.
Based on the ROOT team’s experience and expertise, and based on discussions withphysicists and experiment representatives, the ROOT project predicts the following mainchallenges that are reflected in this short input document. • Reading data. The final steps of analyses are generally limited by the rate at whichevents are read. This has an effect on computing efficiency as well as physicists’efficiency (time-to-response). • Data size. Storage needs of data samples and available simulation capacities willlimit the availability of simulation samples and as a consequence possibly the qual-ity of physics extracted from HL-LHC’s data. • Efficient use of available compute silicon. Two main aspects will limit the efficiency:lack of accessible programming models for transparent use of heterogeneous anddistributed compute resources (accelerators, HPC, but also super-fast, near-CPUmemory); use of slow, interpreted languages in performance-critical code at leastin analyses. 2
Significant use of tools (also from industry) that are inefficient for HEP, for instancefor data deserialization, machine learning, modeling, and graphics. • Visualizations to communicate complex models, correlations, and uncertainties.
The ROOT project believes that the physics potential of the HL-LHC data can only beexploited by disproportionally increasing the size of simulation samples. This is caused tofirst order by the high demand for sampling high-dimensional parameter spaces of moreelaborate models [2]. Or, turning this around: the high statistical power of HL-LHCdata enables exclusions of high-dimensional parameter spaces of complex models, whichin turn require a higher ratio of simulation over real data than at the LHC. This entailshigher demands for analysis and storage, where even today’s disk storage demands comeat a considerable cost for WLCG, estimated at CHF 50M/year.While ROOT files have been compared many times over the past decades againstpossible alternative formats, their applicability and performance characteristics for HEPdata remain unrivaled [3] until this day. Although ROOT files (and specifically ROOT’scolumnar format, TTree) outperform their competitors for HEP workflows, the ROOTteam has identified several improvements so significant that they warrant an evolutionof the file format [4]. This research and development effort led to a prototype labeledRNTuple. It incorporates many of the successful design decisions of TTree, such as itscolumnar data layout or horizontal column expansion (”friend trees”).Lossy compression is currently carried out by each experiment separately, tweakingeach stored value to match its expected precision in an ad-hoc effort. The communityshould invest in a sustainable, general and (at least semi-) automatic approach that iscentral to the common I/O subsystem. A notable research in this area, in collaborationwith the ROOT team, is Accelogic’s Compressive Computing [5].File format changes (RNTuple) together with improvements in lossless and lossycompression should enable general space savings of 25%, for all experiments and dataformats, without any cost to the quality of the physics results.
Efficient analyses are dominated by read throughput of the input data. The need forhigh-throughput reads increases with the use of accelerators and highly parallel analysisworkflows.The ROOT project has determined that two main challenges must be addressed:automatic optimizations of parameters for data serialization and deserialization to notrely on physicists knowing the optimal software configuration, and bulk data process-ing to increase the amount of data per deserialization instruction count. The latter,together with simple event data models as favored by RNTuple, enable high-throughputdata transfer also to GPUs (”structs of arrays” without or with minimal host CPU ma-nipulation) and match data transfer patterns commonly available in High PerformanceComputing environments.Other optimizations related to read throughput can have a considerable impact, too.Examples include caching of intermediate analysis results [6] and optimization of the3ata format / layout to facilities’ storage systems (key value stores, distributed multi-node I/O, high-latency remote I/O such as through xrootd), as well as data placementand efficient handling of meta-data (derivation history, calibration constants, luminosityinformation, or quickly locating a given event). Being able to save such meta-data with,but not necessarily in a file, would enable experiments to easily update such metadata,and make bookkeeping easier.Reduced file sizes also mean higher deserialization throughput in ”events per second”.With benefits from caching of intermediary results and optimized data paths (bulk dataprocessing, optimizations for tomorrow’s storage systems) we predict possible through-put increases of a factor 2-5 compared to the LHC data throughput, depending on thestorage system and analysis workflow.
Today more than ever, physicists expect to be able to focus on the physics analysis,rather than its coding. To some extent this was triggered by the recent evolution of thePython scientific ecosystem demonstrating that complex analyses can be written in acompact style, using for instance efficient high level Python packages such as NumPy.Similar patterns can be used in other languages. C++’s traditional verbosity canbe alleviated by deferring type information to runtime - thanks to ROOT’s C++ just-in-time compiler cling. This allowed ROOT to create a declarative analysis interface,RDataFrame, for writing compact yet efficient analyses in either C++ or Python, ex-posing the ”what” to physicists while hiding the ”how” in its implementation details.
We encourage the approach where slow, interpreted languages such as Python areused to compose nonetheless efficient analyses from calls to optimized libraries suchas RDataFrame, rather than having the complete analysis written in a slow, interpretedlanguage. Real-world, production analysis mini-frameworks not following this guidelineare regularly observed to be 100 times slower than standard workflows .We are convinced that investing in ROOT’s unique, extremely powerful (automatic)Python bindings can greatly facilitate and accelerate Python analyses. It is an enablingcomponent to create a larger ecosystem with Python and C++ elements. It makesperformant C++ code more accessible thanks to simple Python interfaces, allowing moreusers to rely on these high-performance libraries. This entails defining abstractions thatshield the performance-critical parts. Given such abstractions, the use of accelerators isa natural extension. For optimal throughput and efficiency, a data layout has to take into account the hard-ware’s requirements. It should be implemented behind the scene of a simpler analysisinterface such as RDataFrame, where the engine carrying out the analysis steps knowshow to optimally schedule and layout data and transfers. The determination of what is”optimal” can happen at runtime, based on the available hardware and on characteris-tics of the analysis. This is a significant R&D task, with equally significant potentialperformance improvements. See for instance https://indi.to/gQL7P .3 Domain-Specific Languages Even though Python is the language of choice for many analyses, its performance (orlack thereof) and its verboseness when dealing with nested iterations poses a challenge.Domain specific languages (DSLs) promise to solve this to some extent by providing amore compact way of coding. One of the major concerns with DSLs is the inability todebug that language - in general, any DSL invented and exclusively adopted by HEPcannot benefit from an existing tooling market. Nonetheless, ROOT’s past use of DSLs(such as those of TTree::Draw and TFormula) proves that DSLs can be successful withlimited scope such as for cuts.An alternative exists: for instance RDataFrame and RVec (a vector-manipulation andcomputation library), being high-level interfaces, introduce their own concise expression”language” for analysis steps, while still staying in a well-known computing languagewith its tooling and training ecosystem. We are convinced that simple Python inter-faces together with performant C++ libraries and just-in-time compilation are superioralternatives to large-scale use of DSLs.
An integral component of the community’s software is ROOT’s interpreter cling. Itquickly converts C++ code to an executable, in-memory, binary program. Among otherroles, it provides information needed to store data structures; it is a prerequisite forPython bindings as well as ROOT’s C++ and CUDA interpretation; it enables web-based GUI interaction. ROOT’s interpreter also allows efficient evaluation of DSLs bytransforming them into C++, a mechanism currently used by ROOT’s TFormula. Itmakes simpler analysis interfaces with runtime type determination possible - crucial forwriting simple yet highly efficient analyses.We are convinced that the community should invest in the cling’s just-in-time com-pilation to further unlock its enormous potential, for instance improving the interactionbetween python and performance C++ libraries to facilitate their use in analyses, andoptimizing code at runtime based on available hardware (for instance through cling’sCUDA backend). As cling is at the backbone of the community’s data serialization, itis paramount to guarantee maintenance.
Analysis jobs are typically run numerous times, for testing and bug fixes, to obtainthe results, and for evaluating uncertainties and correlations. This wastes computingresources and the time of physicists, because it is often easier to rerun everything thanto write an efficient implementation that only computes quantities that changed, or thatpools common computations in a computation graph.To make matters worse, analysis frameworks have mushroomed to help with handlingcategories and computing uncertainties and correlations. We believe that this can beprovided centrally to benefit from common investment. Such tools can increase CPUefficiency by optimizing data flow; by reducing processing runs over all input data;or by caching relevant parts of the input data and intermediary results, for re-use inconsecutive runs of the analysis. 5 .6 Modeling
High energy physics analyses use complex statistical models, correlations, and uncer-tainties, a challenge that not many sciences have taken upon them. RooFit turned outto be the tool of choice. Now, many alternative solutions are on the market, wrappingeither industry libraries or re-implementing parts of RooFit from first principles. All ofthe currently available solutions have a limited featureset; to the best of our knowledgethese competing solutions cannot (and are not claiming to be able to) replace RooFit.Instead of causing community splits by the adoption of limited competing tools, thecommunity should invest in the renovation of its existing tools, to benefit from existingexpertise and from shared maintenance synergy. ROOT has recently shown that thisis extremely beneficial, with accelerations of common RooFit operations by factors fiveand beyond [7]. This is crucial for HL-LHC’s complex models used in analyses andcombinations.We believe that these requirements can be addressed by engaging and coordinat-ing with developers of community tools, and by providing much needed sustainability.Developments should cover streamlined model building, offloading of computations toaccelerators, and increased throughput by bulk processing of data.
Many related research elements on the topics above have already started [8], usually cen-tered around ROOT’s RDataFrame [9], the recent declarative analysis interface, whichhas already been adopted by a large amount of Run 2 LHC analyses. ROOT proposesto invest in RDataFrame, extending it to handle for instance automatic categorization,and derivation of uncertainties, while hiding implementation details and enabling op-timizations, data transfer, and scheduling on heterogeneous and distributed computebackends.
After more than 20 years of community investment, ROOT is providing much of thecommon key functionality required by analyses and HEP software. Its key parts havebeen continuously optimized and measured against alternatives; it has been extended tocover functionality that is of general relevance to the community.Many of these extensions, such as RooFit or TMVA, were initiated by members ofthe HEP community. The mechanism enabling this is crucial for HEP and its ability toeffectively and inclusively share development. Here, ROOT serves as an open, accessible,and extensible core ingredient. Implementing new features is possible with incrementaleffort, by extending or replacing parts of the functionality provided by ROOT. This inturn brings such R&D in scope for university groups and their grant requests. Wheresuch R&D endeavors are successful, ROOT can enable adoption in the community as acatalyzer and distribution mechanism.
We are convinced that adoption of tools external to HEP can have tremendous benefits.Some require a considerable development effort to make them suitable for HEP, forinstance to match existing software ingredients, or to satisfy physicists’ expectationsand traditions. Other parts of the software ecosystem have properties that are very6pecific to HEP and best addressed by common HEP software ingredients, for reasons ofperformance, features, or for keeping in-house expertise. Where this is the case, pivotalHEP solutions (such as the cling interpreter or ROOT’s data format) should be sharedwith industry to aid in sustainability and to share development efforts.For ROOT, recent examples of adopting industry tools include zstd compression,NumPy Python array management, the OpenUI5 GUI library, CuDNN as CUDA ma-chine learning library, and the MIXMAX random number generator. Each of theserequired effort to provide highly optimized interoperability with existing software. Thiseffort paid off as these tools were perfect matches for HEP’s requirements.
Competition is a prerequisite for progress. It is best created by duplicating parts ofthe existing ecosystem’s functionality and competing in that specific area. This enablessmooth integration and adoption by the community. It also allows for benchmarkingbased on technical merits, by comparing existing functionality with a competing imple-mentation.The ROOT project sees more and more competition taking a different route, withoutintegration into the existing ecosystem but instead based on external tools. This getsa prototype product out quickly, with minimal investment but without considerationof sustainability; it benefits from extra attention through the use of well-known names;it creates the impression of relevance by benefiting from the external tool’s relevance;it can use the often disputed argument that physicists using the external tool willhave higher ”market value” in industry. Adoption of these prototypes creates isolatedislands of competing technical solutions with limited sustainability and relevance forthe community as a whole . Investment becomes scattered, not for the benefit of all;technology expertise gets lost.HL-LHC is lucky enough to have strong software projects. They can play a coordi-nating role for contributions. We have demonstrated since decades that this model isbeneficial for institutes contributing (”owning” certain software parts), for the projects,and for the community as a whole. The community should not develop its own fundamental machine learning tools. Itshould collaborate with other sciences on improving and growing toolsets. Developmentefforts should focus on HEP-specific usability and optimization layers for model buildingand features, such as sculpting, and to interface with HEP’s optimized ecosystem, suchas ROOT I/O for fast inference.We believe that the community should embrace TMVA as that bridge between ROOTand external machine learning tools such as scikit-learn, XGBoost, TensorFlow, Keras,mxnet, or PyTorch. The community should invest in TMVA so it can provide customizedand targeted interfaces for HEP, with sustainability, performance and ease of use in mind,for instance through production-scale grid-deployable inference of unrivaled performance.ROOT’s just-in-time compilation can offer unique features here, hiding much of the Industry seems to value physicists as experts on statistical modeling and data analysis, rather thantheir knowledge of any given tool that is currently perceived as state of the art. Physicists are bestemployed as data experts, not tooling engineers. See for instance https://github.com/diana-hep/spark-root or https://github.com/diana-hep/rootconverterthat are no longer actively developed.
Communication of results is an integral part of any physics analysis, and graphics playa crucial role here. A good visualization engine with good defaults noticeably improvesthe productivity of physicists. A suitable visual language improves the effectiveness ofphysics reviews.
We are convinced that web-based graphics and adoption of external, web-based toolsreduce maintenance load, give the community access to a larger pool of potential de-velopers, and make graphics more sustainable and usable (platform independence, localvs. remote). Web-based graphics allow for trivial embedding for instance in onlinemonitoring applications or web-based analysis tools (”notebooks”).
To our knowledge, no alternative solution offers a comparable feature set. Analyseswritten in Python are tempted to use the Python packages matplotlib or seaborn. We seethe start of a separation of the community, making investment in graphics (better ROOTgraphics or better integration of matplotlib or seaborn for HEP’s purposes) only relevantto a fraction of the community. While ROOT’s graphics system addresses multiple usagepatterns (application, online, monitoring, analyses, utmost configurability, publication-ready plots), alternatives are mostly used in python-based analyses. We propose to unifythe community again by improving the usability of ROOT’s new graphics system to apoint where defaults are just right, and the effort of using it (especially from Python)vanishes compared to adjusting alternatives to HEP’s needs: doing easy things must beeasy, and doing hard things must be readily possible.
Visualization becomes even more relevant as the community sets out to refine its ”visuallanguage”, communicating a high number of uncertainties and correlations in complexmodels. HL-LHC will require advances in this area; even today’s visual communication ofresults as seen at the LHC has progressed to a complexity that motivates a rethinking ofsome of our visual language. It is well known that alternative solutions have performanceissues with complex graphics . Healthy software projects must have a long queue of innovation, ranging from R&Dto optimizations. They are generally motivated by sustainability aspects or by insightsfrom providing support or training. Delivering those improves usability and performance See e.g. https://matplotlib.org/3.2.1/tutorials/introductory/usage.html
Agreeing on common software prevented segregation within both the analysis and de-veloper communities. It allowed synergy, preventing needless duplication of efforts, andmaking investments available to all experiments to maximize their return. It rationalizedsoftware engineering, infrastructure, maintenance, and sustainability costs that wouldhave otherwise been spread and repeated, instead of allowing for synergy effects acrossprojects. It enabled incremental R&D with focused, reasonable effort.The community’s trust and investment in common projects should not come for free:we need competition to measure us against and as an additional source of innovation.But we foresee a schism in the analysis community, centered around the main softwareplayers in the field . While this started off as ”old vs. new,” this division now showsmultiple facets that favor the spread of uncertainty and misinformation. This carriesa cost simply due to the non-technical part of the competition, and causes a fruitlessduplication of efforts.We believe that WLCG / EP-SFT should be strengthened as the community’s cen-tral hub for common software, computing, and data management solutions. This willenable continued sharing of responsibilities, addressing topics such as sustainability andmaintenance, recognition, and community-wide adoption, and very importantly support.This model is extremely successful until this day, with examples such as ROOT’s newhistogramming being developed at LAL; ROOT’s new graphics system being developedat GSI; ROOT’s new event display being developed at UCSD; ROOT’s lossy compres-sion R&D taking place at BNL; RooFit being developed at NIKHEF; and ROOT’s I/Osubsystem being coordinated at Fermilab.We absolutely encourage competition on a technological basis. Many of ROOT’srecent advances came by comparing its performance and usability with that of alter-natives. Where appropriate this resulted in adoption of the superior tool (for instancefrom CINT to clang, from zlib to zstd), or in an implementation that was optimized forinterplay with other parts (notably I/O, such as for CuDNN or xgboost) or sustainability(Xtensor, dataframes from the R language).We believe that ROOT will remain one of the most important software ingredients inHEP for HL-LHC: ROOT’s role in the community, its collected and collective expertise,and its ongoing innovation warrants the community’s continuous trust. ROOT seessignificant challenges for HL-LHC workflows such as analyses. We are convinced thatthey can be solved best by the community investing decisively in common software. See e.g. https://indico.fnal.gov/event/21067/contribution/11/material/slides/0.pdf, slide 39. eferences [1] Ilka Antcheva et al. “ROOT—A C++ framework for petabyte data storage, sta-tistical analysis and visualization”. In: Computer Physics Communications
Computing and Software for Big Science
Jour-nal of Physics: Conference Series . Vol. 1085. 3. IOP Publishing. 2018, p. 032020.[4] Jakob Blomer et al. “Evolution of the ROOT Tree I/O”. In:
Proceedings for CHEP2019 (in print) . EPJ Web of Conferences. 2020.[5] Jerˆome Lauret et al. “Extreme compression for Large Scale Data store”. In:
Pro-ceedings for CHEP 2019 (in print) . EPJ Web of Conferences. 2020.[6] Gordon Watts. “Using Functional Languages and Declarative Programming toanalyze ROOT data: LINQtoROOT”. In:
Journal of Physics: Conference Series .Vol. 608. 1. IOP Publishing. 2015, p. 012024.[7] Stephan Hageboeck. “A Faster, More Intuitive RooFit”. In:
Proceedings for CHEP2019 (in print) . EPJ Web of Conferences. 2020.[8] Javier Cervantes Villanueva. “Parallelization and optimization of a High EnergyPhysics analysis with ROOT’s RDataFrame and Spark”. PhD thesis. Murcia U.[9] Danilo Piparo et al. “RDataFrame: Easy Parallel ROOT Analysis at 100 Threads”.In:
EPJ Web of Conferences . Vol. 214. EDP Sciences. 2019, p. 06029.[10] Lorenzo Moneta et al. “Machine Learning with ROOT/TMVA”. In:
Proceedingsfor CHEP 2019 (in print) . EPJ Web of Conferences. 2020.[11] Kim Albertsson et al. “Fast Inference for Machine Learning in ROOT / TMVA”.In: