Awkward Arrays in Python, C++, and Numba
AAwkward Arrays in Python, C++, and Numba
Jim
Pivarski , ∗ , Peter
Elmer , ∗∗ , and David
Lange , ∗∗∗ Princeton University
Abstract.
The Awkward Array library has been an important tool for physicsanalysis in Python since September 2018. However, some interface and imple-mentation issues have been raised in Awkward Array’s first year that argue for areimplementation in C ++ and Numba. We describe those issues, the new archi-tecture, and present some examples of how the new interface will look to users.Of particular importance is the separation of kernel functions from data struc-ture management, which allows a C ++ implementation and a Numba imple-mentation to share kernel functions, and the algorithm that transforms record-oriented data into columnar Awkward Arrays. Columnar data structures, in which identically typed data fields are contiguous in memory,are a good fit to physics analysis use-cases. This was recognized as early as 1989 whencolumn-wise ntuples were added to PAW and in 1997 when “splitting” was incorporated inthe ROOT file format [1]. In the past decade, with the Google Dremel paper [2], the Parquetfile format [3], the Arrow memory interchange format [4], and the inclusion of “ragged ten-sors” in TensorFlow [5], the significance of hierarchical columnar data structures has beenrecognized beyond particle physics.With the exception of the Columnar Objects experiment of T. Mattis et. al. [6] and theXND library [7], all of these projects focus on representing, storing, and transmitting colum-nar data structures, rather than operating on them. Physicists need to apply structure-changingtransformations to search for decay topology candidates and other tasks that can change thelevel of nesting and multiplicity of their data. Operations of this complexity can be definedas a suite of primitives, allowing for NumPy-like convenience in Python [8].The Awkward Array library [9] was created to provide these operations on array objectsthat are easily convertible to the other libraries (zero-copy in some cases). Since its release inSeptember 2018, Awkward Array has become one of the most widely pip-installed packagesfor particle physics (see Figure 1).Feedback from physicists, such as the interviews we reported previously [10] and in pri-vate conversations at a series of tutorials, has revealed that physicists appreciate the NumPy-like interface when it’s easy to see how an analysis task can be expressed that way, but stillneed an interface for imperative programming. In addition, some names were poorly chosen,leading to confusion and name-conflicts, and more of the library’s internal structure should ∗ e-mail: [email protected] ∗∗ e-mail: [email protected] ∗∗∗ e-mail: [email protected] a r X i v : . [ c s . M S ] J u l e p J a n M a y S e p J a n M a y S e p J a n M a y S e p J a n M a y S e p p i p - i n s t a ll s / d a y , - d a y m o v i n g a v e r a g e pip-installs on MacOS and Windows (not batch jobs) numpyscipypandasmatplotlibroot-numpyiminuitrootpy uproot awkward coffea Figure 1.
Number of pip-installations per day (smoothed by a 60-day moving average) for popular dataanalysis libraries (numpy, scipy, pandas, matplotlib) and particle physics libraries (root-numpy, iminuit,rootpy, uproot, awkward, co ff ea) on operating systems not used for batch jobs (MacOS and Windows). be hidden from end-users. Also, the original library’s pure NumPy implementation has beenhard to extend and maintain.All of these issues argue for a redesign of the library, keeping the core concepts that madeit successful, restructuring the internals for maintainance, and presenting a simpler, moreuniform interface to the user. This reimplementation project was dubbed “Awkward 1.x” andwas completed in March 2020. The principle of Awkward Array is that an array of any data structure can be constructedfrom a composition of nodes that each provide one feature. The prototypical example is ajagged array, which represents an array of unequal-length subarrays with an array of integer offsets and a contiguous array of content . If the content is one-dimensional, the jaggedarray is two-dimensional, where the second dimension has unequal lengths. To make a three-dimensional jagged array (unequal lengths in both inner dimensions), one jagged array nodecan be used as the content for another. With an appropriate set of generators, any datastructure can be assembled.In the original Awkward Array library, the nodes were Python classes with special meth-ods that NumPy recognizes to pass array-at-a-time operations through the data structure.Although that was an easy way to get started and respond rapidly to users’ needs, some oper-ations are di ffi cult to implement in NumPy calls only. For complete generality, Awkward 1.xnodes are implemented as C ++ classes, operated upon by specially compiled code.We can satisfy the need for imperative access by adding Numba [11] extensions to Awk-ward Array, but this would amount to rewriting the entire library, once in precompiled code(C ++ ), and once in JIT-compiled code (Numba). To ease maintainance burdens, we haveseparated the code that implements operations from the code that manages data structures.Data structures are implemented twice—in C ++ and Numba—but they both call the samesuite of operations. In total, there are four layers:. High-level user interface in Python, which presents a single awkward.Array class.2. Nested data structure nodes: C ++ classes wrapped in Python with pybind11.3. Two versions of the data structures, one in C ++ and one in Numba.4. Awkward Array operations in specialized, precompiled code with a pure C interface(can be called from C ++ and Numba), called “kernel functions.”With one exception to be discussed in Section 3, all loops that scale with the number of arrayelements are in the kernel functions layer. All allocation and memory ownership is in theC ++ and Numba layer. This separation mimics NumPy itself, which uses Python referencecounting to manage array ownership and precompiled code for all operations that scale withthe size of the arrays. Also, like CuPy and array libraries for machine learning, adding GPUsupport would only require a new implementation of the kernel functions, not all layers. From a data analyst’s perspective, the new Awkward Array library has only one importantdata type, awkward.Array , and a suite of functions operating on that type. >>> import awkward as ak>>> array = ak.Array([[{"x": 1, "y": [1.1]}, {"x": 2, "y": [2.0, 0.2]}],... [], [{"x": 3, "y": [3.0, 0.3, 3.3]}]])>>> array
These arrays can be sliced like NumPy arrays, with a mix of integers, slices, arrays ofbooleans and integers, jagged arrays of booleans and integers, but for any data structure. >>> array["y", [0, 2], :, 1:]
LorentzVectors (addition, boosting, ∆ R distances, etc.) provided as methods.However, this feature was implemented using Python class inheritance, which was fragile wkward.ArrayListOffsetArray64 RecordArray ListOffsetArray64NumpyArray NumpyArray content contents["x"]contents["y"]offsets contentoffsets Figure 2.
Structure of the array discussed in Section 2.1: hierarchical layout nodes are wrapped in asingle, user-facing awkward.Array . because new Python objects for the same data are frequently created, and it was easy to losethe necessary superclasses in these transformations.This feature is implemented in Awkward 1.x by instead applying the interpretation onlywhen creating the high-level wrapper, and keeping track of how to interpret each layout node with JSON-formatted parameters that pass through C ++ and Numba. For example, >>> class Point(ak.Record):... def __repr__(self):... return "Point({} {})".format(self["x"], self["y"])>>> ak.namespace["Point"] = Point>>> array.layout.content.setparameter("__class__", "Point")>>> array.layout.content.setparameter("__str__", "P")>>> array
ListOffsetArray32 versus
ListOffsetArray64 ), not forbuilding nested structures.The use of shared pointers and virtual inheritance might, at first, seem to be a performancebottleneck, but it is not. An operation on an Awkward Array only needs to step through thehared pointers and inheritance that defines the data type , which is several to hundreds ofnodes at most. The same operation loops over the values in the array, which can number inthe billions, in the kernel functions, which involve no smart pointers or inheritance. Thus,optimization e ff orts should focus on the kernel functions, rather than the C ++ layer. Numba is an opt-in JIT-compiler for a subset of Python, extensively covering NumPy arraysand their operations. Since Numba-compiled code looks like familiar, imperative Python,users can debug algorithms without compilation and only JIT-compile those functions whenthey are ready to scale up to large datasets.Numba has an extension mechanism that allows third-party libraries to inform the Numbacompiler of new data types. Awkward 1.x uses this extension mechanism to implement Awk-ward Arrays and their operations in Numba-compiled functions. This is a second implemen-tation of the node data structures and their memory-ownership, but not the kernel functions,which C ++ and Numba both call. All operations that transform arrays are implemented in a suite of kernel functions, whichare written in C ++ but exported as extern "C" . Only C-language features can be used inthe function signatures, which excludes classes, dispatch by argument types, and templates.Although this is inconvenient, Numba can only call external C functions, not C ++ , and thusthis is a requirement for C ++ and Numba to use the same kernel functions.Apart from internal template specialization on some argument types, the kernel functionimplementations also resemble pure C functions because they consist entirely of for loopsthat fill preallocated arrays (allocated and owned by C ++ or Numba). Our use of the word“kernel” derives from the fact that this separation between slow bookkeeping in C ++ and fastmath in simple, C-like code resembles the separation of CPU-bound and GPU-bound codein GPU applications. Thus, the library is already organized in a GPU-friendly way; all thatremains is to provide GPU-native implementations of each kernel function.All foreseeable optimization e ff ort will be focused on the kernel functions, rather than thebookkeeping and interface code in C ++ , Numba, and Python, with one exception: record-oriented → columnar data transformations discussed in the next section. → columnar Transformations of Awkward Arrays to and from columnar formats like Arrow and “split”ROOT branches are either single- malloc array copies or zero-copy data casting. Record-oriented data, however, require significant processing to transform into any columnar format.Such a function, named fromiter in the original Awkward Array library, had manyimportant use-cases. We have therefore moved the fromiter implementation from Pythoninto C ++ and observe a 10–20 × speed-up for typical data structures (see Figure 3). Anyrecord-oriented → columnar transformation of data whose type is not known at compile-time must include virtual method indirection, so further optimization is only possible forspecialized types. This is the exception to the rule that all operations that scale with the sizeof the dataset must be implemented in kernel functions, because the accumulated arrays aredynamically typed.Our record-oriented → columnar algorithm discovers the data’s type during the data trans-formation pass. For example, if a particular field has always been filled with integers, the first .11101001000 flatNumpy jaggedarray doublyjagged triplyjagged f r o m i t e r( p y ob j ) i n o l d A w k w a r d f r o m i t e r( J S O N ) i n o l d A w k w a r d f r o m i t e r ( p y ob j ) i n ne w A w k w a r d f r o m i t e r ( J S O N ) i n ne w A w k w a r d r e a d TT r ee f r o m up r o o t + n e w A w k w a r d r ea d TT r ee f r o m up r oo t + o l d A w k w a r d r e a d n e w R OO T RN T u p l e i n t o A w k w a r d r a t e i n m illi on s o f f l oa t s / s e c ( h i ghe r i s be tt e r) Figure 3.
Rate of reading listN(float) data from Python objects (“pyobj”) and from JSON stringsin the old and new Awkward fromiter , from ROOT’s old
TTree , which only needs the transformationfor doubly jagged and above, and from ROOT’s new
RNTuple , which never needs the transformation. time it is filled with a floating-point value invokes a conversion of the previous integer datainto floating-point, then the floating-point array is used henceforth. In the same example,later filling that field with a string invokes a replacement of the floating-point array with atagged union of floating-point and strings (reusing the floating-point array).Awkward 1.x provides three interfaces to this algorithm: (a) Python objects → AwkwardArrays, like the old fromiter , (b) JSON data → Awkward Arrays using the RapidJSONC ++ library’s SAX interface, and (c) a builder pattern in which the user can fill individualvalues via method calls. The latter is the most powerful interface, and it is provided in Python,C ++ , and Numba. In C ++ , it would allow Awkward interfaces for established C ++ projects,and in Numba, it provides a convenient way for physicists to build complex data structures.One particularly important special case of record-oriented → columnar transformation isunequal-length lists (of any depth of nesting) of numbers. Many analysis-level ROOT filescontain data of this type, though ROOT’s TTree serialization only stores singly jagged ar-rays in a columnar format: deeper levels are record-oriented. Since the data type is partiallyknown, a specialized implementation improves upon both the old and new fromiter , asshown in Figure 3. The more long-term solution, however, is ROOT’s new
RNTuple serial-ization, which is columnar at all levels, making this transformation unnecessary.
The reimplementation of Awkward Array in C ++ and Numba provides new features, a moreunified interface, and higher performance in some cases. These improvements are beingintroduced to users as a new library to ease the transition. The old library can still be pip-installed / imported as awkward , whereas the new one is available as awkward1 . Uproot istransitioning in a similar way, with a new uproot4 using awkward1 and the original uproot using awkward .Once adoption of the new libraries increases, they will become defaults by renam-ing awkward1 / uproot4 as awkward / uproot , with the originals becoming awkward0 and uproot3 . Thus, legacy scripts (perhaps necessary for a student’s graudation) can be keptfunctional by adding mport awkward0 as awkwardimport uproot3 as uproot while new scripts default to the new libraries. We expect this transition to be complete by theend of 2020. Support for this work was provided by NSF cooperative agreement OAC-1836650 (IRIS-HEP), grant OAC-1450377 (DIANA / HEP) and PHY-1520942 (US-CMS LHC Ops).
References [1] R. Brun, N. Buncic, and F. Rademakers, http://root.cern.ch/root/HowtoWriteTree.html , retrieved Feb 25, 1997.[2] S. Melnik, A. Gubarev, J.J. Long, G. Romer, S. Shivakumar, M. Tolton, and T. Vassilakis,“Dremel: Interactive Analysis of Web-Scale Datasets,” Proc. of the 36th Int’l Conf on VeryLarge Data Bases, pp. 330–339 (2010).[3] D. Vohra, Apache Parquet.
In:
Practical Hadoop Ecosystem.
Apress, Berkeley, CA(2016).[4] http://arrow.apache.org , retrieved Feb 20, 2016.[5] , retrieved Oct 16, 2019.[6] T. Mattis, J. Henning, P. Rein, R. Hirschfeld, and M. Appeltauer, “Columnar Objects:Improving the Performance of Analytical Applications,”
ACM Int’l Symp. on New Ideas,New Paradigms, and Reflections on Programming and Software (Onward!), pp. 197–210(2015).[7] https://xnd.io , retrieved Aug 16, 2018.[8] J. Pivarski, J. Nandi, D. Lange, and P. Elmer, “Columnar data processing for HEP analy-sis,” European Physical Journal Web of Conferences , 06026 (2019).[9] J. Pivarski, “Vectorized processing of nested data,” ROOT User’s Workshop (2018).[10] J. Pivarski and P. Elmer, “Nested data structures in array and SIMD frameworks,” 19thInternational Workshop on Advanced Computing and Analysis Techniques in Physics Re-search (2019).[11] S. K. Lam, A. Pitrou, S. Seibert, “Numba: a LLVM-based Python JIT compiler,” LLVM’15: Proceedings of the Second Workshop on the LLVM Compiler Infrastructure in HPC(2015).[12] https://datashape.readthedocs.io , retrieved Jun 26, 2016.[13] W. Jakob, J. Rhinelander, and D. Moldovan “pybind11 — Seamless operability betweenC ++++