[PDF] Evolution of the ROOT Tree I/O

Abstract

The ROOT TTree data format encodes hundreds of petabytes of High Energy and Nuclear Physics events. Its columnar layout drives rapid analyses, as only those parts ("branches") that are really used in a given analysis need to be read from storage. Its unique feature is the seamless C++ integration, which allows users to directly store their event classes without explicitly defining data schemas. In this contribution, we present the status and plans of the future ROOT 7 event I/O. Along with the ROOT 7 interface modernization, we aim for robust, where possible compile-time safe C++ interfaces to read and write event data. On the performance side, we show first benchmarks using ROOT's new experimental I/O subsystem that combines the best of TTrees with recent advances in columnar data formats. A core ingredient is a strong separation of the high-level logical data layout (C++ classes) from the low-level physical data layout (storage backed nested vectors of simple types). We show how the new, optimized physical data layout speeds up serialization and deserialization and facilitates parallel, vectorized and bulk operations. This lets ROOT I/O run optimally on the upcoming ultra-fast NVRAM storage devices, as well as file-less storage systems such as object stores.

Full PDF

EEvolution of the ROOT Tree I/O

Jakob

Blomer , ∗ , Philippe

Canal , Axel

Naumann , and Danilo

Piparo CERN, Geneva, Switzerland Fermilab, Chicago, U.S.

Abstract.

The ROOT TTree data format encodes hundreds of petabytes of HighEnergy and Nuclear Physics events. Its columnar layout drives rapid analyses,as only those parts (“branches”) that are really used in a given analysis need tobe read from storage. Its unique feature is the seamless C ++ integration, whichallows users to directly store their event classes without explicitly deﬁning dataschemas. In this contribution, we present the status and plans of the futureROOT 7 event I / O. Along with the ROOT 7 interface modernization, we aimfor robust, where possible compile-time safe C ++ interfaces to read and writeevent data. On the performance side, we show ﬁrst benchmarks using ROOT’snew experimental I / O subsystem that combines the best of TTrees with recentadvances in columnar data formats. A core ingredient is a strong separationof the high-level logical data layout (C ++ classes) from the low-level physicaldata layout (storage backed nested vectors of simple types). We show how thenew, optimized physical data layout speeds up serialization and deserializationand facilitates parallel, vectorized and bulk operations. This lets ROOT I / Orun optimally on the upcoming ultra-fast NVRAM storage devices, as well asﬁle-less storage systems such as object stores.

The data describing a High Energy Physics (HEP) event is typically represented by a recordcontaining variable-length collections of sub records. An event can, for instance, contain acollection of particles with certain scalar properties ( p t , E , etc.), another collection of jets, acollection of tracks, and so on. A typical physics analysis uses a large number of events butprocesses only a subset of the available properties. Therefore, ROOT’s TTree storage formatsupport a columnar physical data layout for nested sub records and collections [1]. Values ofa single property of many events (e.g., p t for events 1 to 1000) are stored consecutively ondisk. Thus, only those parts that are required for an analysis need to be read. Similar valuesare likely to be grouped together, which is beneﬁcial for compression.More than 1 EB of data is stored in the TTree format. For HEP use cases, the TTreeI / O speed and storage e ﬃ ciency has shown to be signiﬁcantly better than many industryproducts [2]. Furthermore, ROOT provides the unique feature of seamless C ++ and Pythonintegration where users do not need to write or generate a data schema. Yet, the TTreeimplementation limits the optimal use of new storage systems and storage device classes,such as object stores and ﬂash memory, and it shows shortcomings when it comes to multi-threaded and GPU supported analysis tasks and fail-safe APIs. ∗ e-mail: [email protected] a r X i v : . [ c s . D B ] M a r n this contribution, we present the design and ﬁrst benchmarks of the RNTuple set ofclasses. The RNTuple classes provide a new, experimental columnar event I / O system that isbackwards-incompatible to TTree both on the ﬁle format level and on the API level. Break-ing backwards compatibility allows us to use contemporary nomenclature and to design theROOT event I / O from the ground up for next-generation devices and the increased data ratesexpected from HL-LHC.

This section describes key design choices of the RNTuple data format and of class design andthe interfaces.

Compared to the TTree binary data layout, the RNTuple data layout is modestly modern-ized and borrows some ideas from Apache Arrow [3] (see Figure 1). Data is stored in columns of fundamental types supporting arbitrarily deeply nested collections (TTree dropsthe “columnar-ness” for deeply nested collections). Columns are partitioned in compressed pages , of typically a few tens of kilobytes in size. Like in TTree, clusters are a set of pagesthat contain all the data of a certain event range. They are typically a few tens of megabytesin size and a natural unit of processing for a single thread or task. ... ...Header FooterPage Cluster Dataset / File C ++ collections become o ﬀ set columns struct Event { int fId; vector < Particle > fPtcls;}; struct Particle { ﬂoat fE; vector < int > fIds;}; Approximate translation between TTree and RNTuple concepts:

Basket ≈ PageLeaf ≈ ColumnCluster ≈ Cluster

Figure 1.

Breakdown of the RNTuple data layout. Each scalar ﬁeld of the event struct is stored in aseparate column.

A collection’s representation contains an o ﬀ set column whose elements indicate the startindex within the columns that store the collection content; this allows for random-access ofindividual events. The indexing is local to the cluster such that clusters can be written inparallel and freely concatenated to a larger data set. This also allows for “fast merging”,where several RNTuple ﬁles can be concatenated by only adjusting the header and footer. Incontrast to TTree, o ﬀ set pages and value pages are always separated, which should improvethe compression ratio (to be conﬁrmed). Integers and ﬂoating point numbers in columns arestored in little-endian format (TTree: big-endian) in order to allow for memory mapping ofpages on most contemporary architectures. Boolean values, such as trigger bits, are stored asbitmaps (TTree: byte arrays), which improves the compression.The RNTuple meta-data are stored in a header and a footer. The header contains theschema of the RNTuple; the footer contains the locations of the pages. At a later point, In contrast to TTree, RNTuple currently does not support row-wise storage. e will extend the meta-data with a regularly written checkpoint footer (e. g. every 100 MB)in order to allow for data recovery in case of an application crash during data taking. Wewill also extend the meta-data with a user-accessible, namespace-scoped map of key-valuepairs, such that the experiment data management systems can maintain relevant information(checksums, replica locations, etc.) together with the data.The pages, header and footer do not necessarily need to be written consecutively in asingle ﬁle. The container for pages, header and footer can be a ROOT ﬁle where data isinterleaved with other objects such as histograms. The container can also be an RNTuplebare ﬁle or an object store. It is also conceivable to store header and footer in a di ﬀ erent ﬁlethan the pages to avoid backward seeks. The RNTuple class design comprises four layers (see Figure 2). The RNTuple classes makeuse of templates, such that for simple types (e.g., vectors of ﬂoats) that are known at compiletime, the compiler can inline a fast path from the highest to the lowest layer without additionalvalue copies or virtual calls.

Storage layer / byte ranges RPageStorage, RCluster, RNTupleDescriptor

Primitives layer / simple types “Columns” containing elements of fundamental types ( float , int , ...)grouped into (compressed) pages and clusters RColumn, RColumnElement, RPage

Logical layer / C ++ objects Mapping of C ++ types onto columnse.g. std::vector (cid:55)→ index column and a value column RField, RNTupleModel, REntry

Event iteration

Reading and writing in event loops and through

RDataFrameRNTupleDataSource, RNTupleView, RNTupleReader/Writer

Approximate translation between TTree andRNTuple classes:

TTree ≈ RNTupleReaderRNTupleWriterTTreeReader ≈ RNTupleViewTBranch ≈ RFieldTBasket ≈ RPageTTreeCache ≈ RClusterPool

Figure 2.

Layers and key class of RNTuple and their approximate counterparts in TTree.

The event iteration layer provides the user-facing interfaces to read and write events,either through RDataFrame [4] or as hand-written event loops. The user interface is presentedin more detail in Section 3.The logical layer splits C ++ objects into columns of fundamental types. Its centralclass is the RField that provides a C ++ template specialization for reading and writingof an I / O supported type. Currently there is support for boolean, integer and ﬂoatingpoint, std::vector and std::array containers, std::string , std::variant , and user-deﬁned classes with a ROOT dictionary. In the future, we will provide support for addtionaltypes (e.g., std::map , std::chrono ) and possibly for intra-event object references as a lim-ited form of pointers. While RNTuple limits I / O support to an explicit subset of C ++ types,those types are fully composable (e.g., a user-deﬁned class containing a vector of arrays ofanother user-deﬁned class).The primitives layer governs the pool of uncompressed and deserialized pages in mem-ory and the representation of fundamental types on disk. For most fundamental types, thememory layout equals the RNTuple on-disk layout. In some circumstances, pages need to be packed and unpacked , for instance in order to store booleans as bitmaps or in order to storeﬂoating point values with reduced precision.The storage layer provides access to the byte ranges containing a page on a physical orvirtual device. The storage layer manages compression and reads and writes to and from the / O device. It also allocates memory for pages in order to allow for direct page mappings.Currently there is support for a storage layer that uses a ROOT ﬁle as an RNTuple datacontainer and a storage layer that uses a bare ﬁle for comparison and testing. We plan to addanother implementation that uses an object store. We will also add virtual storage layers thatcombine RNTuple data sets similar to TTree’s chains and friend trees .An RNTuple cluster pool provides I / O scheduling capabilities. The cluster pool spawnsan I / O thread that asynchronously preloads upcoming pages of active columns. The clusterpool can linearize, merge and split requests to optimize the read pattern for the storage deviceat hand (e. g. spinning disk, ﬂash memory, remote server).

The RNTuple user-facing API is supposed to be easy to use correctly as to minimize thelikelihood of application crashes and wrong results. To this end, RNTuple provides anRDataFrame data source so that RDataFrame analyses code can be use unmodiﬁed with RN-Tuple data.The RNTuple interface for implementing hand-written event loops uses modern stan-dard techniques, including smart pointers, event traversal by C ++ iterators and compile-timesafety through templated interfaces (see Figure 3). For the type-unsafe interface, a runtimecheck veriﬁes that the the on-disk type and the in-memory type of ﬁelds match. auto ntpl =RNTupleReader::Open("Events", "f.root"); auto viewPt = ntpl->GetView< float >("pt"); for ( auto i : ntpl->GetEntryRange()) {hist.Fill(viewPt(i));} auto model = RNTupleModel::Create(); auto fldPt = model->MakeField< float >("pt"); // Note: there is also a void* based,// runtime type-safe API auto ntpl = RNTupleReader::Open(std::move(model), "Events", "f.root"); for ( auto entryId : *ntpl) {ntuple->LoadEntry(entryId);hist.Fill(*fldPt);} Figure 3.

RNTuple interface sketch for reading data. Left-hand side: zero-copy interface where thememory pointed to by views is managed by RNTuple. Right-hand side: interface that copies values intogenerated or user-provided memory locations.

The RNTuple classes are thread-friendly, i. e. multiple threads can safely use their owncopy of RNTuple classes to read the same data concurrently. In the future, we envisionsupport for multi-threaded writing (one cluster per thread or task) as well as support formultiple threads reading concurrently from the same range of clusters of an RNTuple. Insingle-threaded analyses, available idle cores should be used for decompression. We believethat these changes will require very little changes to the user-facing API.Error handling, for instance in case of device faults or malformed input data, is an impor-tant aspect of I / O interfaces. While it is often di ﬃ cult to recover gracefully from I / O errors,the I / O layer should reliably detect errors and produce an error report as close as possible tothe root cause. To this end, RNTuple throws C ++ exceptions for I / O errors.At a later point, we intend to add a limited C API for RNTuple in order to facilitate ROOTdata being transferred to 3rd party consumers, such as numpy arrays or machine learningtoolkits. To this end, most of RNTuple is implemented not to depend on core ROOT classes, able 1.

Sample analyses for performance evaluation. Main di ﬀ erences are the data model (ﬂat orwith collections) and the number of required branches (dense or sparse reading). LHCb run 1 open data B2HHH H1 micro dst [ ×

10] CMS nanoAOD June 201918 /

26 branches ( >

75 %) 16 /

152 branches ( ∼

10 %) 6 / < Table 2.

Overview of the benchmarking hardware.

Hardware Machine 1 Machine 2CPU Xeon Platinum 8260 @ 2.4 GHz Xeon E5-2630v3 @ 2.4 GHzMemory DDR4 RDIMM 2933 MHz DDR4 RDIMM 2133 MHzOptane (NVRAM) Optane DC 2666 MHz (ext4 / DAX) —SSD (ﬂash) Intel DC P4510, PCIe 3.1 × × SAS 7200 RPM (RAID1)Network — 1 GbEsuch that a minimal, stand-alone RNTuple I / O library can be built. The functionality of thislibrary will initially be limited to reading simple numerical type ﬁelds and vectors thereof.

In this section, we analyze the RNTuple performance in terms of read throughput and ﬁle sizefor typical, single threaded analysis tasks. We use three sample analyses for the benchmarks(see Table 1). Each analysis requires a subset of the available event properties, uses someproperties to ﬁlter events, and calculates an invariant mass from the selected events. Theanalyses were implemented using both TTree and RNTuple, each variant optimized for bestperformance with hand-written event loops . Basket / Page sizes and cluster sizes are compa-rable between TTree and RNTuple ﬁles. The “LHCb” sample is derived from an LHCb OpenData course [5]. The “H1” sample is derived from the ROOT “H1 analysis” tutorial with theoriginal data cloned ten times. The “CMS” sample is derived from the ROOT “dimuon” tuto-rial using the 2019 nanoAOD format [6] with simulated data. Two dedicated physical nodes,“machine 1” and “machine 2” are used for running the benchmarks (see Table 2). Both ma-chines run CentOS 7 and have ROOT compiled with gcc 7.3. A third dedicated node runsXRootD in version 4.10 and is conﬁgured to hold the data on a RAM disk. Figure 4 shows the ﬁle format e ﬃ ciency for the input data of the sample analysis. As ex-pected, the TTree and RNTuple e ﬃ ciency is very similar on the “LHCb” ﬂat data model. For“H1” and “CMS”, RNTuple shows signiﬁcantly better e ﬃ ciency due to the more e ﬃ cientstorage of collections and boolean values. (Approximately half of the di ﬀ erence in ﬁle sizecould be eliminated by using TTree’s experimental kGenerateOffsetMap I / O ﬂag.) Spacesavings of RNTuple remain even after compression. For the implementation, see https: // github.com / jblomer / iotools / tree / ntuple-chep-2019 ROOT branch https: // github.com / jblomer / root / tree / ntuple-chep-2019 A v e r age e v en t s i z e [ B ] Storage Efficiency LHCb Run 1 Open Data B2HHH

TTreeRNTuple

Storage Efficiency LHCb Run 1 Open Data B2HHH uncompr. lz4 zstd zlib lzma RN T up l e / TT r ee A v e r age e v en t s i z e [ k B ] Storage Efficiency H1 micro DST [x10]

TTreeRNTuple

Storage Efficiency H1 micro DST [x10] uncompr. lz4 zstd zlib lzma RN T up l e / TT r ee

66% 73% 75% 76% 83% A v e r age e v en t s i z e [ k B ] Storage Efficiency CMS nanoAOD TTJet 13TeV June 2019

TTreeRNTuple

Storage Efficiency CMS nanoAOD TTJet 13TeV June 2019 uncompr. lz4 zstd zlib lzma RN T up l e / TT r ee

69% 73% 74% 76% 87%

Figure 4.

File size comparison of the sample analysis input data in TTree and RNTuple format withdi ﬀ erent compression algorithms. Figure 5 shows the event throughput for running the sample analyses. When reading fromwarm ﬁle system bu ﬀ ers, the performance is dominated by deserialization and decompres-sion. The data deserialization in RNTuple is signiﬁcantly faster compared to TTree. Withstronger compression algorithms, the performance is more dominated by decompression thanby deserialization. Still, even for LZMA compressed data reading RNTuple data is faster forthe H1 and CMS samples.When reading with cold ﬁle system bu ﬀ ers, as shown in the lower half of Figure 5, theperformance depends not only on the deserialization and decompression speed but also onthe I / O throughput of the device. The additional CPU time spent on strong compression canbe more than compensated by a smaller transfer volume. For RNTuple, there is a sweet spotfor the recent zstd compression algorithm, in particular if taking into account the smaller ﬁlesize as compared to zlib and lz4.Figure 6 compares cold cache read performance for di ﬀ erent, frequently used physicaldata sources. For the slow devices HDD and 10 GbE, the performance is dominated by theI / O scheduler, i. e. by TTreeCache resp. RClusterPool. The I / O scheduler linearizes requests,merges nearby requests, and issues vector reads in order to minimize the overall number ofrequests sent to a device and the total transfer volume. In these benchmarks the RNTuple’sI / O scheduler shows a performance at least as good as the TTreeCache.

In contrast to spinning disks, SSDs are inherently parallel devices that beneﬁt from a largequeue depth so that they can read from multiple ﬂash cells concurrently. Figure 7 shows thee ﬀ ect of reading with multiple concurrent streams. To this end, we extend the RNTuple I / Oscheduler to read with multiple threads (1 stream / thread). Where the read performance is lim-ited by I / O and not by decompression and deserialization, increasing the number of streamscan yield another speed improvement of around a factor of 2.5. The gains max out at around16 streams. The lower gains for uncompressed LHCb and CMS samples are due to a limita-tion in the current RNTuple implementation that only preloads a single cluster. It thereforedoes not provide enough concurrent requests to ﬁll the parallel streams. With implementationof multi-cluster read-ahead, this limitation is going to be removed. An interesting topic offuture work is investigating automatic ways of the I / O scheduler to adjust to the underlyingphysical hardware. · E v en t s / s SSD READ throughput LHCb Run 1 Open Data B2HHH

TTreeRNTuple

95% CL

SSD READ throughput LHCb Run 1 Open Data B2HHH uncompr. lz4 zstd zlib lzma RN T up l e / TT r ee – · – · – · – · – · · E v en t s / s MEM CACHED READ throughput H1 micro DST [x10]

TTreeRNTuple

95% CL

MEM CACHED READ throughput H1 micro DST [x10] uncompr. lz4 zstd zlib lzma RN T up l e / TT r ee – · – · – · – · – · · E v en t s / s MEM CACHED READ throughput CMS nanoAOD TTJet 13TeV June 2019

TTreeRNTuple

95% CL

MEM CACHED READ throughput CMS nanoAOD TTJet 13TeV June 2019 uncompr. lz4 zstd zlib lzma RN T up l e / TT r ee – · – · – · – · – · · E v en t s / s SSD READ throughput LHCb Run 1 Open Data B2HHH

TTreeRNTuple

95% CL

SSD READ throughput LHCb Run 1 Open Data B2HHH uncompr. lz4 zstd zlib lzma RN T up l e / TT r ee – · – · – · – · – · · E v en t s / s SSD READ throughput H1 micro DST [x10]

TTreeRNTuple

95% CL

SSD READ throughput H1 micro DST [x10] uncompr. lz4 zstd zlib lzma RN T up l e / TT r ee – · – · – · – · – · · E v en t s / s SSD READ throughput CMS nanoAOD TTJet 13TeV June 2019

TTreeRNTuple

95% CL

SSD READ throughput CMS nanoAOD TTJet 13TeV June 2019 uncompr. lz4 zstd zlib lzma RN T up l e / TT r ee – · – · – · – · – · Figure 5.

Read throughput in events per second on machine 1. Upper half shows the results for warmﬁle system bu ﬀ ers. Lower half shows the results for reading from SSD with a cold cache. See Figure 7for further improvements for SSDs. Figure 8 shows the performance when reading RNTuple data from Optane DC NV-RAMs.The performance characteristics of NV-RAMs are in-between RAM and SSDs (here, weare not exploiting the non-volatility). In the future, they might become a more widespreadadditional cache layer or installed as a dedicated performance storage tier, e. g. in analysisfacilities.The results show no signiﬁcant di ﬀ erence between reading from warm ﬁle system cachesand reading from NV-RAM. As we also do not reach the peak throughput of the NV-RAMmodules, the results suggest a bottleneck in the I / O deserialization or plotting part of theanalysis run. Further optimizations of the RNTuple I / O path are subject of future work.Due to the fact that the RNTuple on-disk layout matches the in-memory layout, we cancompare reading data explicitly with POSIX read() and implicitly by memory mapping. On(byte-addressable) NV-RAM and warm ﬁle system bu ﬀ ers, both mechanisms yield compa-rable results. When reading sparsely from SSDs, the RNTuple I / O scheduler optimizationsbring a signiﬁcant performance gain. Further investigation reveals that the I / O schedulingthat underpins the memory mapping in Linux issues an order of magnitude more requests tothe device than the RNTuple scheduler.

In this contribution we presented the design and a ﬁrst performance evaluation of RNTu-ple, ROOT’s new experimental event I / O system. The RNTuple I / O system is a backwards- E v en t s / s READ throughput using different physical data sources (zstd compressed) "LHCb" "H1" "CMS"TTree TTree TTreeRNTuple RNTuple RNTuple

95% CL

READ throughput using different physical data sources (zstd compressed) RN T up l e / TT r ee "LHCb" "H1" "CMS" Solid State Disk Spinning Disk XRootD HTTP, 1GbE, 10ms

Figure 6.

Read speed with di ﬀ erent bandwidth and latency proﬁles. SSD benchmarks from machine 1,HDD and HTTP benchmarks from machine 2 connected to a dedicated, third XRootD server. Note thatthe SSD results on the left hand side are identical to the SSD / zstd results in Figure 5. S peed - up w r t. s i ng l e s t r ea m "LHCb" "H1" "CMS"uncompressed uncompressed uncompressedzstd zstd zstd

95% CL

700 MB/s1.2 GB/s680 MB/s

RNTuple SSD READ throughput using concurrent streams

Figure 7.

Full exploitation of SSDs by concurrent streams on machine 1. For comparison, the righthand side repeats the warm cache results from Figure 5. Single stream performance is identical to theSSD / zstd results in Figure 5. incompatible redesign of TTree, based on the many years of experience of the TTree de-velopment. It is from the ground up designed to work well in concurrent environments andto optimally support modern storage hardware and systems, such as SSDs, NV-RAM, andobject stores.Our benchmarks suggest that compared to TTree RNTuple can yield read speed improve-ments between a factor of 1.5 to 5 in realistic analysis scenarios, while at the same timereducing data sizes by 10 % to 20 %. We will gradually move the RNTuple code from a pro- · E v en t s / s Mem. cached Optane SSD (16 strms)read() read() read()mmap() mmap() mmap()

95% CL

RNTuple OPTANE NVDIMM READ throughput uncompressed data with read() and mmap()

LHCb run 1 open data B2HHH H1 micro DST CMS nanoAOD TTJet 13TeV June 2019

Figure 8.

Read performance using Optane DC NV-RAM in “App Direct” mode on machine 1. TheNV-RAM block device is formatted with ext4 with DAX optimization. Uncompressed input data isused in order to allow for comparsion between POSIX read() and mmap() . Note that the SSD read() results are identical to the 16 streams result in Figure 7. totype to a ROOT production component. The RNTuple classes are already available in the

ROOT::Experimental::RNTuple namespace if ROOT is compiled with the root7 cmakeoption. Tutorials are available to demonstrate the RNTuple functionality. We consider thesedevelopments and the associated future R&D topics essential building blocks for coping withdata rates at the HL-LHC.

Acknowledgements

We would like to thank Fons Rademakers and Luca Atzori from CERN openlab for giving usaccess to NV-RAM devices. We would like to thank Dirk Düllmann and Michal Simon fromCERN IT for providing us an XRootD test node. We would like to thank Oksana Shadura,Brian Bockelman, and Jim Pivarski for many fruitful discussions and suggestions.

References [1] R. Brun, F. Rademakers, Nuclear Instruments and Methods in Physics Research SectionA: Accelerators, Spectrometers, Detectors and Associated Equipment A , 81 (1997)[2] J. Blomer, Journal of Physics: Conference Series (2018)[3] The Apache Software Foundation,

Apache Arrow (2019), https://arrow.apache.org [4] G. Amadio, J. Blomer, P. Canal, G. Ganis, E. Guiraud, P.M. Vila, L. Moneta, D. Piparo,E. Tejedor, X.V. Pla, Journal of Physics: Conference Series (2018)[5] A. Rogozhnikov, A. Ustyuzhanin, C. Parkes, D. Derkach, M. Litwinski, M. Gersabeck,S. Amerio, S. Dallmeier-Tiessen, T. Head, G. Gilliver (2016), talk at the 22nd Int. Conf.on Computing in High Energy Physics (CHEP’16)[6] A. Rizzi, G. Petrucciani, M. Peruzzi, EPJ Web Conf214