[PDF] Array Programming with NumPy

Abstract

Array programming provides a powerful, compact, expressive syntax for accessing, manipulating, and operating on data in vectors, matrices, and higher-dimensional arrays. NumPy is the primary array programming library for the Python language. It plays an essential role in research analysis pipelines in fields as diverse as physics, chemistry, astronomy, geoscience, biology, psychology, material science, engineering, finance, and economics. For example, in astronomy, NumPy was an important part of the software stack used in the discovery of gravitational waves and the first imaging of a black hole. Here we show how a few fundamental array concepts lead to a simple and powerful programming paradigm for organizing, exploring, and analyzing scientific data. NumPy is the foundation upon which the entire scientific Python universe is constructed. It is so pervasive that several projects, targeting audiences with specialized needs, have developed their own NumPy-like interfaces and array objects. Because of its central position in the ecosystem, NumPy increasingly plays the role of an interoperability layer between these new array computation libraries.

Full PDF

AArray Programming with NumPy

Charles R. Harris , K. Jarrod Millman , St ´efan J. van der Walt , RalfGommers , Pauli Virtanen , David Cournapeau , Eric Wieser , Julian Taylor ,Sebastian Berg , Nathaniel J. Smith , Robert Kern , Matti Picus , StephanHoyer , Marten H. van Kerkwijk , Matthew Brett , Allan Haldane , JaimeFern ´andez del R´ıo , Mark Wiebe , Pearu Peterson , PierreG ´erard-Marchant , Kevin Sheppard , Tyler Reddy , Warren Weckesser ,Hameer Abbasi , Christoph Gohlke , and Travis E. Oliphant Independent Researcher, Logan, Utah, USA Brain Imaging Center, University of California, Berkeley, Berkeley, CA, USA Division of Biostatistics, University of California, Berkeley, Berkeley, CA, USA Berkeley Institute for Data Science, University of California, Berkeley, Berkeley, CA, USA Applied Mathematics, Stellenbosch University, Stellenbosch, South Africa Quansight LLC, Austin, TX, USA Department of Physics and Nanoscience Center, University of Jyv¨askyl¨a, Jyv¨askyl¨a, Finland Mercari JP, Tokyo, Japan Department of Engineering, University of Cambridge, Cambridge, UK Independent Researcher, Karlsruhe, Germany Independent Researcher, Berkeley, CA, USA Enthought, Inc., Austin, TX, USA Google Research, Mountain View, CA, USA Department of Astronomy & Astrophysics, University of Toronto, Toronto, ON, Canada School of Psychology, University of Birmingham, Edgbaston, Birmigham, UK Department of Physics, Temple University, Philadelphia, PA, USA Google, Zurich, Switzerland Department of Physics and Astronomy, The University of British Columbia, Vancouver, BC, Canada Amazon, Seattle, Washington, USA Independent Researcher, Saue, Estonia Department of Mechanics and Applied Mathematics, Institute of Cybernetics at Tallinn Technical University, Tallinn, Estonia Department of Biological and Agricultural Engineering, University of Georgia, Athens, GA France-IX Services, Paris, France Department of Economics, University of Oxford, Oxford, UK CCS-7, Los Alamos National Laboratory, Los Alamos, NM, USA Laboratory for Fluorescence Dynamics, Biomedical Engineering Department, University of California, Irvine, Irvine, CA, USA * [email protected], [email protected], [email protected] June 19, 2020 a r X i v : . [ c s . M S ] J un bstract Array programming provides a powerful, compact, expressive syntax for accessing, manipulating, andoperating on data in vectors, matrices, and higher-dimensional arrays [1]. NumPy is the primary arrayprogramming library for the Python language [2, 3, 4, 5]. It plays an essential role in research analysispipelines in ﬁelds as diverse as physics, chemistry, astronomy, geoscience, biology, psychology, materialscience, engineering, ﬁnance, and economics. For example, in astronomy, NumPy was an important part ofthe software stack used in the discovery of gravitational waves [6] and the ﬁrst imaging of a black hole [7].Here we show how a few fundamental array concepts lead to a simple and powerful programming paradigmfor organizing, exploring, and analyzing scientiﬁc data. NumPy is the foundation upon which the entirescientiﬁc Python universe is constructed. It is so pervasive that several projects, targeting audiences withspecialized needs, have developed their own NumPy-like interfaces and array objects. Because of its centralposition in the ecosystem, NumPy increasingly plays the role of an interoperability layer between thesenew array computation libraries.

Two Python array packages existed before NumPy.The Numeric package began in the mid-1990s andprovided an array object and array-aware functions inPython, written in C, and linking to standard fast im-plementations of linear algebra [8, 9]. One of its ear-liest uses was to steer C++ applications for inertialconﬁnement fusion research at Lawrence LivermoreNational Laboratory [10]. To handle large astro-nomical images coming from the Hubble Space Tele-scope, a reimplementation of Numeric, called Numar-ray, added support for structured arrays, ﬂexible in-dexing, memory mapping, byte-order variants, moreeﬃcient memory use, ﬂexible IEEE error handling ca-pabilities, and better type casting rules [11]. WhileNumarray was highly compatible with Numeric, thetwo packages had enough diﬀerences that it dividedthe community, until 2005, when NumPy emerged asa “best of both worlds” uniﬁcation [12]—combiningNumarray’s features with Numeric’s performance onsmall arrays and its rich C

Application ProgrammingInterface (API).Now, ﬁfteen years later, NumPy underpins almostevery Python library that does scientiﬁc or numeri-cal computation including SciPy [13], Matplotlib [14],pandas [15], scikit-learn [16], and scikit-image [17].It is a community-developed, open-source library,which provides a multidimensional Python array ob-ject along with array-aware functions that operate onit. Because of its inherent simplicity, the NumPy ar-ray is the de facto exchange format for array data in Python.NumPy operates on in-memory arrays using theCPU. To utilize modern, specialized storage andhardware, there has been a recent proliferation ofPython array packages. Unlike with the Numarrayand Numeric divide, it is now much harder for thesenew libraries to fracture the user community—givenhow much work already builds on top of NumPy.However, to provide the ecosystem with access tonew and exploratory technologies, NumPy is tran-sitioning into a central coordinating mechanism thatspeciﬁes a well-deﬁned array programming API anddispatches it, as appropriate, to specialized array im-plementations.

NumPy arrays

The NumPy array is a data structure that eﬃcientlystores and accesses multidimensional arrays [18], alsoknown as tensors, and enables a wide variety of scien-tiﬁc computation. It consists of a pointer to memory,along with metadata used to interpret the data storedthere, notably data type , shape , and strides (Fig. 1a).The data type describes the nature of elementsstored in an array. An array has a single data type,and each array element occupies the same numberof bytes in memory. Examples of data types includereal and complex numbers (of lower and higher pre-cision), strings, timestamps, and pointers to Pythonobjects.ig. 1: The NumPy array incorporates several fundamental array concepts. a,

The NumPyarray data structure and its associated metadata ﬁelds. b, Indexing an array with slices and steps. Theseoperations return a view of the original data. c, Indexing an array with masks, scalar coordinates, or otherarrays, so that it returns a copy of the original data. In the bottom example, an array is indexed with otherarrays; this broadcasts the indexing arguments before performing the lookup. d, Vectorization eﬃcientlyapplies operations to groups of elements. e, Broadcasting in the multiplication of two-dimensional arrays. f, Reduction operations act along one or more axes. In this example, an array is summed along select axes toproduce a vector, or along two axes consecutively to produce a scalar. g, Example NumPy code, illustratingsome of these concepts.The shape of an array determines the number ofelements along each axis, and the number of axes isthe array’s dimensionality. For example, a vector ofnumbers can be stored as a one-dimensional arrayof shape N , while color videos are four-dimensionalarrays of shape ( T, M, N,

Strides are necessary to interpret computer mem-ory, which stores elements linearly, as multidimen-sional arrays. It describes the number of bytes tomove forward in memory to jump from row to row,column to column, and so forth. Consider, for exam-ple, a 2-D array of ﬂoating-point numbers with shape(4 , × , indexing (to access subarrays or individual elements), opera-tors (e.g., +, − , × for vectorized operations and @ formatrix multiplication), as well as array-aware func-tions ; together, these provide an easily readable, ex-pressive, high-level API for array programming, whileNumPy deals with the underlying mechanics of mak-ing operations fast. Indexing an array returns single elements, subar-rays, or elements that satisfy a speciﬁc condition(Fig. 1b). Arrays can even be indexed using other3rrays (Fig. 1c). Wherever possible, indexing thatretrieves a subarray returns a view on the original ar-ray, such that data is shared between the two arrays.This provides a powerful way to operate on subsetsof array data while limiting memory usage.To complement the array syntax, NumPy includesfunctions that perform vectorized calculations on ar-rays, including arithmetic, statistics, and trigonom-etry (Fig. 1d). Vectorization—operating on wholearrays rather than their individual elements—is es-sential to array programming. This means that op-erations that would take many tens of lines to expressin languages such as C can often be implemented as asingle, clear Python expression. This results in con-cise code and frees users to focus on the details oftheir analysis, while NumPy handles looping over ar-ray elements near-optimally, taking into considera-tion, for example, strides, to best utilize the com-puter’s fast cache memory.When performing a vectorized operation (such asaddition) on two arrays with the same shape, it isclear what should happen. Through broadcasting ,NumPy allows the dimensions to diﬀer, while stillproducing results that appeal to intuition. A trivialexample is the addition of a scalar value to an array,but broadcasting also generalizes to more complexexamples such as scaling each column of an array orgenerating a grid of coordinates. In broadcasting,one or both arrays are virtually duplicated (that is,without copying any data in memory), so that theshapes of the operands match (Fig. 1d). Broadcast-ing is also applied when an array is indexed usingarrays of indices (Fig. 1c).Other array-aware functions, such as sum , mean ,and maximum , perform element-by-element reduc-tions , aggregating results across one, multiple, or allaxes of a single array. For example, summing an n -dimensional array over d axes results in a ( n − d )-dimensional array (Fig. 1f).NumPy also includes array-aware functions for cre-ating, reshaping, concatenating, and padding arrays;searching, sorting, and counting data; and readingand writing ﬁles. It provides extensive support forgenerating pseudorandom numbers, includes an as-sortment of probability distributions, and performsaccelerated linear algebra, utilizing one of several Fig. 2: NumPy is the base of the scientiﬁcPython ecosystem.

Essential libraries and projectsthat depend on NumPy’s API gain access to newarray implementations that support NumPy’s arrayprotocols (Fig. 3).backends such as OpenBLAS [19, 20] or Intel MKLoptimized for the CPUs at hand.Altogether, the combination of a simple in-memoryarray representation, a syntax that closely mimicsmathematics, and a variety of array-aware utilityfunctions forms a productive and powerfully expres-sive array programming language.

Scientiﬁc Python ecosystem

Python is an open-source, general-purpose, inter-preted programming language well-suited to standardprogramming tasks such as cleaning data, interactingwith web resources, and parsing text. Adding fastarray operations and linear algebra allows scientiststo do all their work within a single language—andone that has the advantage of being famously easyto learn and teach, as witnessed by its adoption as aprimary learning language in many universities.4ven though NumPy is not part of Python’s stan-dard library, it beneﬁts from a good relationship withthe Python developers. Over the years, the Pythonlanguage has added new features and special syntaxso that NumPy would have a more succinct and eas-ier to read array notation. Since it is not part of thestandard library, NumPy is able to dictate its ownrelease policies and development patterns.SciPy and Matplotlib are tightly coupled withNumPy—in terms of history, development, and use.SciPy provides fundamental algorithms for scien-tiﬁc computing, including mathematical, scientiﬁc,and engineering routines. Matplotlib generatespublication-ready ﬁgures and visualizations. Thecombination of NumPy, SciPy, and Matplotlib, to-gether with an advanced interactive environment likeIPython [21], or Jupyter [22], provides a solid foun-dation for array programming in Python. The sci-entiﬁc Python ecosystem (Fig. 2) builds on top ofthis foundation to provide several, widely used tech-nique speciﬁc libraries [16, 17, 23], that in turn un-derlay numerous domain speciﬁc projects [24, 25, 26,27, 28, 29]. NumPy, at the base of the ecosystem ofarray-aware libraries, sets documentation standards,provides array testing infrastructure, and adds buildsupport for Fortran and other compilers.Many research groups have designed large, com-plex scientiﬁc libraries, which add application spe-ciﬁc functionality to the ecosystem. For example,the eht-imaging library [30] developed by the EventHorizon Telescope collaboration for radio interfer-ometry imaging, analysis, and simulation, relies onmany lower-level components of the scientiﬁc Pythonecosystem. NumPy arrays are used to store and ma-nipulate numerical data at every step in the process-ing chain: from raw data through calibration and im-age reconstruction. SciPy supplies tools for generalimage processing tasks such as ﬁltering and imagealignment, while scikit-image, an image processing li-brary that extends SciPy, provides higher-level func-tionality such as edge ﬁlters and Hough transforms.The scipy.optimize module performs mathematicaloptimization. NetworkX [23], a package for complexnetwork analysis, is used to verify image comparisonconsistency. Astropy [24, 25] handles standard astro-nomical ﬁle formats and computes time/coordinate transformations. Matplotlib is used to visualize dataand to generate the ﬁnal image of the black hole.The interactive environment created by the ar-ray programming foundation along with the sur-rounding ecosystem of tools—inside of IPython orJupyter—is ideally suited to exploratory data anal-ysis. Users ﬂuidly inspect, manipulate, and visual-ize their data, and rapidly iterate to reﬁne program-ming statements. These statements are then stitchedtogether into imperative or functional programs, ornotebooks containing both computation and narra-tive. Scientiﬁc computing beyond exploratory workis often done in a text editor or an integrated de-velopment environment (IDEs) such as Spyder. Thisrich and productive environment has made Pythonpopular for scientiﬁc research.To complement this facility for exploratory workand rapid prototyping, NumPy has developed a cul-ture of employing time-tested software engineeringpractices to improve collaboration and reduce error[31]. This culture is not only adopted by leaders inthe project but also enthusiastically taught to new-comers. The NumPy team was early in adopting dis-tributed revision control and code review to improvecollaboration on code, and continuous testing thatruns an extensive battery of automated tests for ev-ery proposed change to NumPy. The project alsohas comprehensive, high-quality documentation, in-tegrated with the source code [32, 33, 34].This culture of using best practices for producingreliable scientiﬁc software has been adopted by theecosystem of libraries that build on NumPy. For ex-ample, in a recent award given by the Royal Astro-nomical Society to Astropy, they state:

The Astropy Project has provided hun-dreds of junior scientists with experience inprofessional-standard software developmentpractices including use of version control,unit testing, code review and issue trackingprocedures. This is a vital skill set for mod-ern researchers that is often missing fromformal university education in physics or as-tronomy.

Community members explicitly work to address this5ack of formal education through courses and work-shops [35, 36, 37].The recent rapid growth of data science, machinelearning, and artiﬁcial intelligence has further anddramatically boosted the usage of scientiﬁc Python.Examples of its signiﬁcant application, such as the eht-imaging library, now exist in almost every disci-pline in the natural and social sciences. These toolshave become the primary software environment inmany ﬁelds. NumPy and its ecosystem are commonlytaught in university courses, boot camps, and sum-mer schools, and are at the focus of community con-ferences and workshops worldwide.NumPy and its API have become truly ubiquitous.

Array proliferation and interoperability

NumPy provides in-memory, multidimensional, ho-mogeneously typed (i.e., single pointer and strided)arrays on CPUs. It runs on machines ranging fromembedded devices to the world’s largest supercom-puters, with performance approaching that of com-piled languages. For most its existence, NumPy ad-dressed the vast majority of array computation usecases.However, scientiﬁc data sets now routinely exceedthe memory capacity of a single machine and may bestored on multiple machines or in the cloud. In addi-tion, the recent need to accelerate deep learning andartiﬁcial intelligence applications has led to the emer-gence of specialized accelerator hardware, includinggraphics processing units (GPUs), tensor processingunits (TPUs), and ﬁeld-programmable gate arrays(FPGAs). Due to its in-memory data model, NumPyis currently unable to utilize such storage and special-ized hardware directly. However, both distributeddata and the parallel execution of GPUs, TPUs, andFPGAs map well to the paradigm of array program-ming: a gap, therefore, existed between availablemodern hardware architectures and the tools neces-sary to leverage their computational power.The community’s eﬀorts to ﬁll this gap led to aproliferation of new array implementations. For ex-ample, each deep learning framework created its ownarrays; PyTorch [38], Tensorﬂow [39], Apache MXNet [40], and JAX arrays all have the capability to runon CPUs and GPUs, in a distributed fashion, uti-lizing lazy evaluation to allow for additional per-formance optimizations. SciPy and PyData/Sparseboth provide sparse arrays—which typically containfew non-zero values and store only those in mem-ory for eﬃciency. In addition, there are projectsthat build on top of NumPy arrays as a data con-tainer and extend its capabilities. Distributed arraysare made possible that way by Dask, and labeledarrays—referring to dimensions of an array by namerather than by index for clarity, compare x[:, 1] vs. x.loc[:, ’time’] —by xarray [41].Such libraries often mimic the NumPy API, be-cause it lowers the barrier to entry for newcomersand provides the wider community with a stable ar-ray programming interface. This, in turn, preventsdisruptive schisms like the divergence of Numeric andNumarray. But exploring new ways of working witharrays is experimental by nature and, in fact, severalpromising libraries—such as Theano and Caﬀe—havealready ceased development. And each time thata user decides to try a new technology, they mustchange import statements and ensure that the newlibrary implements all the parts of the NumPy APIthey currently use.Ideally, operating on specialized arrays usingNumPy functions or semantics would simply work,so that users could write code once, and would thenbeneﬁt from switching between NumPy arrays, GPUarrays, distributed arrays, and so forth, as appropri-ate. To support array operations between externalarray objects, NumPy therefore added the capabilityto act as a central coordination mechanism with awell-speciﬁed API (Fig. 2).To facilitate this interoperability , NumPy provides“protocols” (or contracts of operation), that allow forspecialized arrays to be passed to NumPy functions(Fig. 3). NumPy, in turn, dispatches operations tothe originating library, as required. Over four hun-dred of the most popular NumPy functions are sup-ported. The protocols are implemented by widelyused libraries such as Dask, CuPy, xarray, and Py-Data/Sparse. Thanks to these developments, userscan now, for example, scale their computation froma single machine to distributed systems using Dask.6ig. 3:

NumPy’s API and array protocols expose new arrays to the ecosystem.

In this example,NumPy’s mean function is called on a Dask array. The call succeeds by dispatching to the appropriate libraryimplementation (i.e., Dask in this case) and results in a new Dask array. Compare this code to the examplecode in Fig. 1g.The protocols also compose well, allowing users toredeploy NumPy code at scale on distributed, multi-GPU systems via, for instance, CuPy arrays embed-ded in Dask arrays. Using NumPy’s high-level API,users can leverage highly parallel code execution onmultiple systems with millions of cores, all with min-imal code changes [42].These array protocols are now a key feature ofNumPy, and are expected to only increase in impor-tance. As with the rest of NumPy, we iteratively re-ﬁne and add protocol designs to improve utility andsimplify adoption.

Discussion

NumPy combines the expressive power of array pro-gramming , the performance of C, and the readability,usability, and versatility of Python in a mature, well-tested, well-documented, and community-developedlibrary. Libraries in the scientiﬁc Python ecosystemprovide fast implementations of most important al-gorithms. Where extreme optimization is warranted,compiled languages such as Cython [43], Numba [44],and Pythran [45], that extend Python and transpar-ently accelerate bottlenecks, can be used. Because ofNumPy’s simple memory model, it is easy to writelow-level, hand-optimized code, usually in C or For- tran, to manipulate NumPy arrays and pass themback to Python. Furthermore, using array protocols,it is possible to utilize the full spectrum of special-ized hardware acceleration with minimal changes toexisting code.NumPy was initially developed by students, fac-ulty, and researchers to provide an advanced, open-source array programming library for Python, whichwas free to use and unencumbered by license servers,dongles, and the like. There was a sense of build-ing something consequential together, for the beneﬁtof many others. Participating in such an endeavor,within a welcoming community of like-minded indi-viduals, held a powerful attraction for many earlycontributors.These user-developers frequently had to write codefrom scratch to solve their own or their colleagues’problems—often in low-level languages that precedePython, like Fortran [46] and C. To them, the advan-tages of an interactive, high-level array library wereevident. The design of this new tool was informedby other powerful interactive programming languagesfor scientiﬁc computing such as Basis [47], Yorick [48],R [49], and APL [50], as well as commercial languagesand environments like IDL and MATLAB.What began as an attempt to add an array objectto Python became the foundation of a vibrant ecosys-7em of tools. Now, a large amount of scientiﬁc workdepends on NumPy being correct, fast, and stable.It is no longer a small community project, but is corescientiﬁc infrastructure.The developer culture has matured: while initialdevelopment was highly informal, NumPy now hasa roadmap and a process for proposing and dis-cussing large changes. The project has formal gov-ernance structures and is ﬁscally sponsored by Num-FOCUS, a nonproﬁt that promotes open practicesin research, data, and scientiﬁc computing. Overthe past few years, the project attracted its ﬁrstfunded development, sponsored by the Moore andSloan Foundations, and received an award as part ofthe Chan Zuckerberg Initiative’s Essentials of OpenSource Software program. With this funding, theproject was (and is) able to have sustained focus overmultiple months to implement substantial new fea-tures and improvements. That said, it still dependsheavily on contributions made by graduate studentsand researchers in their free time.NumPy is no longer just the foundational array li-brary underlying the scientiﬁc Python ecosystem, buthas also become the standard API for tensor compu-tation and a central coordinating mechanism betweenarray types and technologies in Python. Work contin-ues to expand on and improve these interoperabilityfeatures.Over the next decade, we will face several chal-lenges. New devices will be developed, and existingspecialized hardware will evolve, to meet diminish-ing returns on Moore’s law. There will be more, anda wider variety of, data science practitioners, a sig-niﬁcant proportion of whom will be using NumPy.The scale of scientiﬁc data gathering will continueto expand, with the adoption of devices and instru-ments such as light sheet microscopes and the LargeSynoptic Survey Telescope (LSST) [51]. New gen-eration languages, interpreters, and compilers, suchas Rust [52], Julia [53], and LLVM [54], will inventand determine the viability of new concepts and datastructures.Through various mechanisms described in this pa-per, NumPy is poised to embrace such a changinglandscape, and to continue playing a leading role ininteractive scientiﬁc computation. To do so will re- quire sustained funding from government, academia,and industry. But, importantly, it will also need anew generation of graduate students and other de-velopers to engage, to build a NumPy that meets theneeds of the next decade of data science.

References [1] K. E. Iverson, “Notation as a tool of thought,”

Communications of the ACM , vol. 23, p. 444465,Aug. 1980.[2] P. F. Dubois, “Python: Batteries included,”

Computing in Science & Engineering , vol. 9,no. 3, pp. 7–9, 2007.[3] T. E. Oliphant, “Python for scientiﬁc com-puting,”

Computing in Science & Engineering ,vol. 9, pp. 10–20, May-June 2007.[4] K. J. Millman and M. Aivazis, “Python for sci-entists and engineers,”

Computing in Science &Engineering , vol. 13, no. 2, pp. 9–12, 2011.[5] F. P´erez, B. E. Granger, and J. D. Hunter,“Python: an ecosystem for scientiﬁc comput-ing,”

Computing in Science & Engineering ,vol. 13, no. 2, pp. 13–21, 2011.[6] B. P. Abbott, R. Abbott, T. Abbott, M. Aber-nathy, F. Acernese, K. Ackley, C. Adams,T. Adams, P. Addesso, R. Adhikari, et al. , “Ob-servation of gravitational waves from a binaryblack hole merger,”

Physical Review Letters ,vol. 116, no. 6, p. 061102, 2016.[7] A. A. Chael, M. D. Johnson, R. Narayan, S. S.Doeleman, J. F. Wardle, and K. L. Bouman,“High-resolution linear polarimetric imaging forthe event horizon telescope,”

The AstrophysicalJournal , vol. 829, no. 1, p. 11, 2016.[8] P. F. Dubois, K. Hinsen, and J. Hugunin, “Nu-merical Python,”

Computers in Physics , vol. 10,no. 3, pp. 262–267, 1996.[9] D. Ascher, P. F. Dubois, K. Hinsen, J. Hugunin,and T. E. Oliphant, “An open source project:Numerical Python,” 2001.810] T.-Y. Yang, G. Furnish, and P. F. Dubois,“Steering object-oriented scientiﬁc computa-tions,” in

Proceedings of TOOLS USA 97. In-ternational Conference on Technology of ObjectOriented Systems and Languages , pp. 112–119,IEEE, 1997.[11] P. Greenﬁeld, J. T. Miller, J. Hsu, and R. L.White, “numarray: A new scientiﬁc array pack-age for Python,”

PyCon DC , 2003.[12] T. E. Oliphant,

Guide to NumPy . Trelgol Pub-lishing USA, 1st ed., 2006.[13] P. Virtanen, R. Gommers, T. E. Oliphant,M. Haberland, T. Reddy, D. Cournapeau,E. Burovski, P. Peterson, W. Weckesser,J. Bright, S. J. van der Walt, M. Brett, J. Wil-son, K. J. Millman, N. Mayorov, A. R. J. Nel-son, E. Jones, R. Kern, E. Larson, C. J. Carey,I. Polat, Y. Feng, E. W. Moore, J. VanderPlas,D. Laxalde, J. Perktold, R. Cimrman, I. Hen-riksen, E. A. Quintero, C. R. Harris, A. M.Archibald, A. H. Ribeiro, F. Pedregosa, P. vanMulbregt, and SciPy 1.0 Contributors, “SciPy1.0—fundamental algorithms for scientiﬁc com-puting in Python,”

Nature Methods , vol. 17,pp. 261–272, 2020.[14] J. D. Hunter, “Matplotlib: A 2D graphics envi-ronment,”

Computing in Science & Engineering ,vol. 9, no. 3, pp. 90–95, 2007.[15] W. McKinney, “Data structures for statisticalcomputing in Python,” in

Proceedings of the 9thPython in Science Conference (S. van der Waltand J. Millman, eds.), pp. 51 – 56, 2010.[16] F. Pedregosa, G. Varoquaux, A. Gramfort,V. Michel, B. Thirion, O. Grisel, M. Blondel,P. Prettenhofer, R. Weiss, V. Dubourg, J. Van-derplas, A. Passos, D. Cournapeau, M. Brucher,M. Perrot, and ´E. Duchesnay, “Scikit-learn: Ma-chine learning in Python,”

Journal of MachineLearning Research , vol. 12, no. Oct, pp. 2825–2830, 2011.[17] S. van der Walt, J. L. Sch¨onberger, J. Nunez-Iglesias, F. Boulogne, J. D. Warner, N. Yager, E. Gouillart, T. Yu, and the scikit-image con-tributors, “scikit-image: image processing inPython,”

PeerJ , vol. 2, p. e453, 2014.[18] S. van der Walt, S. C. Colbert, and G. Varo-quaux, “The NumPy array: a structure for ef-ﬁcient numerical computation,”

Computing inScience & Engineering , vol. 13, no. 2, pp. 22–30, 2011.[19] Q. Wang, X. Zhang, Y. Zhang, and Q. Yi,“Augem: automatically generate high perfor-mance dense linear algebra kernels on x86 cpus,”in

SC’13: Proceedings of the International Con-ference on High Performance Computing, Net-working, Storage and Analysis , pp. 1–12, IEEE,2013.[20] Z. Xianyi, W. Qian, and Z. Yunquan, “Model-driven level 3 blas performance optimizationon loongson 3a processor,” in , pp. 684–691, IEEE, 2012.[21] F. P´erez and B. E. Granger, “IPython: a systemfor interactive scientiﬁc computing,”

Computingin Science & Engineering , vol. 9, no. 3, pp. 21–29, 2007.[22] T. Kluyver, B. Ragan-Kelley, F. P´erez,B. Granger, M. Bussonnier, J. Frederic, K. Kel-ley, J. Hamrick, J. Grout, S. Corlay, P. Ivanov,D. Avila, S. Abdalla, and C. Willing, “JupyterNotebooks—a publishing format for repro-ducible computational workﬂows,” in

Position-ing and Power in Academic Publishing: Play-ers, Agents and Agendas (F. Loizides andB. Schmidt, eds.), pp. 87–90, IOS Press, 2016.[23] A. A. Hagberg, D. A. Schult, and P. J. Swart,“Exploring network structure, dynamics, andfunction using NetworkX,” in

Proceedings of the7th Python in Science Conference (G. Varo-quaux, T. Vaught, and K. J. Millman, eds.),(Pasadena, CA USA), pp. 11–15, 2008.[24] Astropy Collaboration, T. P. Robitaille, E. J.Tollerud, P. Greenﬁeld, M. Droettboom,9. Bray, T. Aldcroft, M. Davis, A. Ginsburg,A. M. Price-Whelan, W. E. Kerzendorf, A. Con-ley, N. Crighton, K. Barbary, D. Muna, H. Fer-guson, F. Grollier, M. M. Parikh, P. H. Nair,H. M. Unther, C. Deil, J. Woillez, S. Con-seil, R. Kramer, J. E. H. Turner, L. Singer,R. Fox, B. A. Weaver, V. Zabalza, Z. I. Ed-wards, K. Azalee Bostroem, D. J. Burke, A. R.Casey, S. M. Crawford, N. Dencheva, J. Ely,T. Jenness, K. Labrie, P. L. Lim, F. Pierfed-erici, A. Pontzen, A. Ptak, B. Refsdal, M. Servil-lat, and O. Streicher, “Astropy: A communityPython package for astronomy,”

Astronomy &Astrophysics , vol. 558, p. A33, Oct. 2013.[25] A. M. Price-Whelan, B. M. Sip˝ocz, H. M.G¨unther, P. L. Lim, S. M. Crawford, S. Con-seil, D. L. Shupe, M. W. Craig, N. Dencheva,A. Ginsburg, J. T. VanderPlas, L. D. Bradley,D. P´erez-Su´arez, M. de Val-Borro, P. Pa-per Contributors, T. L. Aldcroft, K. L. Cruz,T. P. Robitaille, E. J. Tollerud, A. Coordina-tion Committee, C. Ardelean, T. Babej, Y. P.Bach, M. Bachetti, A. V. Bakanov, S. P. Bam-ford, G. Barentsen, P. Barmby, A. Baumbach,K. L. Berry, F. Biscani, M. Boquien, K. A.Bostroem, L. G. Bouma, G. B. Brammer, E. M.Bray, H. Breytenbach, H. Buddelmeijer, D. J.Burke, G. Calderone, J. L. Cano Rodr´ıguez,M. Cara, J. V. M. Cardoso, S. Cheedella,Y. Copin, L. Corrales, D. Crichton, D. D’Avella,C. Deil, ´E. Depagne, J. P. Dietrich, A. Donath,M. Droettboom, N. Earl, T. Erben, S. Fab-bro, L. A. Ferreira, T. Finethy, R. T. Fox,L. H. Garrison, S. L. J. Gibbons, D. A. Gold-stein, R. Gommers, J. P. Greco, P. Greenﬁeld,A. M. Groener, F. Grollier, A. Hagen, P. Hirst,D. Homeier, A. J. Horton, G. Hosseinzadeh,L. Hu, J. S. Hunkeler, ˇZ. Ivezi´c, A. Jain, T. Jen-ness, G. Kanarek, S. Kendrew, N. S. Kern, W. E.Kerzendorf, A. Khvalko, J. King, D. Kirkby,A. M. Kulkarni, A. Kumar, A. Lee, D. Lenz,S. P. Littlefair, Z. Ma, D. M. Macleod, M. Mas-tropietro, C. McCully, S. Montagnac, B. M.Morris, M. Mueller, S. J. Mumford, D. Muna,N. A. Murphy, S. Nelson, G. H. Nguyen, J. P. Ninan, M. N¨othe, S. Ogaz, S. Oh, J. K. Pare-jko, N. Parley, S. Pascual, R. Patil, A. A.Patil, A. L. Plunkett, J. X. Prochaska, T. Ras-togi, V. Reddy Janga, J. Sabater, P. Sakurikar,M. Seifert, L. E. Sherbert, H. Sherwood-Taylor,A. Y. Shih, J. Sick, M. T. Silbiger, S. Sin-ganamalla, L. P. Singer, P. H. Sladen, K. A.Sooley, S. Sornarajah, O. Streicher, P. Teuben,S. W. Thomas, G. R. Tremblay, J. E. H. Turner,V. Terr´on, M. H. van Kerkwijk, A. de la Vega,L. L. Watkins, B. A. Weaver, J. B. Whit-more, J. Woillez, V. Zabalza, and A. Contribu-tors, “The Astropy Project: Building an Open-science Project and Status of the v2.0 Core Pack-age,”

The Astronomical Journal , vol. 156, p. 123,Sept. 2018.[26] P. J. Cock, T. Antao, J. T. Chang, B. A.Chapman, C. J. Cox, A. Dalke, I. Friedberg,T. Hamelryck, F. Kauﬀ, B. Wilczynski, andM. J. L. de Hoon, “Biopython: freely availablePython tools for computational molecular biol-ogy and bioinformatics,”

Bioinformatics , vol. 25,no. 11, pp. 1422–1423, 2009.[27] K. J. Millman and M. Brett, “Analysis of func-tional Magnetic Resonance Imaging in Python,”

Computing in Science & Engineering , vol. 9,no. 3, pp. 52–55, 2007.[28] T. SunPy Community, S. J. Mumford,S. Christe, D. P´erez-Su´arez, J. Ireland,A. Y. Shih, A. R. Inglis, S. Liedtke, R. J.Hewett, F. Mayer, K. Hughitt, N. Freij,T. Meszaros, S. M. Bennett, M. Malocha,J. Evans, A. Agrawal, A. J. Leonard, T. P.Robitaille, B. Mampaey, J. Iv´an Campos-Rozo,and M. S. Kirk, “SunPy—Python for solarphysics,”

Computational Science and Discovery ,vol. 8, p. 014009, Jan. 2015.[29] J. Hamman, M. Rocklin, and R. Abernathy,“Pangeo: A Big-data Ecosystem for ScalableEarth System Science,” in

EGU General Assem-bly Conference Abstracts , EGU General Assem-bly Conference Abstracts, p. 12146, Apr 2018.1030] A. A. Chael, K. L. Bouman, M. D. Johnson,R. Narayan, S. S. Doeleman, J. F. Wardle,L. L. Blackburn, K. Akiyama, M. Wielgus, C.-k.Chan, et al. , “ehtim: Imaging, analysis, and sim-ulation software for radio interferometry,”

Astro-physics Source Code Library , 2019.[31] K. J. Millman and F. P´erez, “Developing open-source scientiﬁc practice,”

Implementing Repro-ducible Research. CRC Press, Boca Raton, FL ,pp. 149–183, 2014.[32] S. van der Walt, “The SciPy documentationproject (technical overview),” in

Proceedings ofthe 7th Python in Science Conference (SciPy2008) (G. Varoquaux, T. Vaught, and K. J. Mill-man, eds.), pp. 27–28, 2008.[33] J. Harrington, “The SciPy documentationproject,” in

Proceedings of the 7th Python in Sci-ence Conference (SciPy 2008) (G. Varoquaux,T. Vaught, and K. J. Millman, eds.), pp. 33–35,2008.[34] J. Harrington and D. Goldsmith, “Progress re-port: NumPy and SciPy documentation in2009,” in

Proceedings of the 8th Python in Sci-ence Conference (SciPy 2009) (G. Varoquaux,S. van der Walt, and K. J. Millman, eds.),pp. 84–87, 2009.[35] G. Wilson, “Software carpentry: Getting scien-tists to write better code by making them moreproductive,”

Computing in Science & Engineer-ing , November–December 2006.[36] J. E. Hannay, H. P. Langtangen, C. MacLeod,D. Pfahl, J. Singer, and G. Wilson, “How doscientists develop and use scientiﬁc software?,”in

Proc. 2009 ICSE Workshop on Software En-gineering for Computational Science and Engi-neering , 2009.[37] K. J. Millman, M. Brett, R. Barnowski, andJ.-B. Poline, “Teaching computational repro-ducibility for neuroimaging,”

Frontiers in Neu-roscience , vol. 12, p. 727, 2018. [38] A. Paszke, S. Gross, F. Massa, A. Lerer,J. Bradbury, G. Chanan, T. Killeen, Z. Lin,N. Gimelshein, L. Antiga, A. Desmaison,A. Kopf, E. Yang, Z. DeVito, M. Raison, A. Te-jani, S. Chilamkurthy, B. Steiner, L. Fang,J. Bai, and S. Chintala, “Pytorch: An imper-ative style, high-performance deep learning li-brary,” in

Advances in Neural Information Pro-cessing Systems 32 (H. Wallach, H. Larochelle,A. Beygelzimer, F. d ' Alch´e-Buc, E. Fox, andR. Garnett, eds.), pp. 8024–8035, Curran As-sociates, Inc., 2019.[39] M. Abadi, A. Agarwal, P. Barham, E. Brevdo,Z. Chen, C. Citro, G. S. Corrado, A. Davis,J. Dean, M. Devin, et al. , “Tensorﬂow:Large-scale machine learning on heteroge-neous distributed systems,” arXiv preprintarXiv:1603.04467 , 2016.[40] T. Chen, M. Li, Y. Li, M. Lin, N. Wang,M. Wang, T. Xiao, B. Xu, C. Zhang,and Z. Zhang, “Mxnet: A ﬂexible and ef-ﬁcient machine learning library for hetero-geneous distributed systems,” arXiv preprintarXiv:1512.01274 , 2015.[41] S. Hoyer and J. Hamman, “xarray: N-D labeledarrays and datasets in Python,”

Journal of OpenResearch Software , vol. 5, no. 1, 2017.[42] P. Entschev, “Distributed multi-GPU comput-ing with Dask, CuPy and RAPIDS.” EuroPy-thon 2019, 2019.[43] S. Behnel, R. Bradshaw, C. Citro, L. Dalcin,D. S. Seljebotn, and K. Smith, “Cython: Thebest of both worlds,”

Computing in Science &Engineering , vol. 13, no. 2, pp. 31–39, 2011.[44] S. K. Lam, A. Pitrou, and S. Seibert, “Numba:A LLVM-based Python JIT compiler,” in

Pro-ceedings of the Second Workshop on the LLVMCompiler Infrastructure in HPC , LLVM ’15,(New York, NY, USA), pp. 7:1–7:6, ACM, 2015.[45] S. Guelton, P. Brunet, M. Amini, A. Merlini,X. Corbillon, and A. Raynaud, “Pythran: En-abling static optimization of scientiﬁc python11rograms,”

Computational Science & Discovery ,vol. 8, no. 1, p. 014001, 2015.[46] J. Dongarra, G. H. Golub, E. Grosse, C. Moler,and K. Moore, “Netlib and na-net: Buildinga scientiﬁc computing community,”

IEEE An-nals of the History of Computing , vol. 30, no. 2,pp. 30–41, 2008.[47] P. F. Dubois, “The basis system,” tech. rep.,Lawrence Livermore National Laboratory, CA(USA), 1989. UCRL-MA-118543, Parts I-VI.[48] D. H. Munro and P. F. Dubois, “Using the yorickinterpreted language,”

Computers in Physics ,vol. 9, no. 6, pp. 609–615, 1995.[49] R. Ihaka and R. Gentleman, “R: a language fordata analysis and graphics,”

Journal of Compu-tational and Graphical Statistics , vol. 5, no. 3,pp. 299–314, 1996.[50] K. E. Iverson, “A programming language,” in

Proceedings of the May 1-3, 1962, Spring JointComputer Conference , pp. 345–351, 1962.[51] T. Jenness, F. Economou, K. Findeisen, F. Her-nandez, J. Hoblitt, K. S. Krughoﬀ, K. Lim,R. H. Lupton, F. Mueller, W. O’Mullane, et al. ,“Lsst data management software developmentpractices and tools,” in

Software and Cyber-infrastructure for Astronomy V , vol. 10707,p. 1070709, International Society for Optics andPhotonics, 2018.[52] N. D. Matsakis and F. S. Klock, “The rust lan-guage,”

Ada Letters , vol. 34, pp. 103–104, Oct.2014.[53] J. Bezanson, A. Edelman, S. Karpinski, andV. B. Shah, “Julia: A fresh approach to numer-ical computing,”

SIAM Review , vol. 59, no. 1,pp. 65–98, 2017.[54] C. Lattner and V. Adve, “LLVM: A compila-tion framework for lifelong program analysis andtransformation,” (San Jose, CA, USA), pp. 75–88, Mar 2004. [55] P. Peterson, “F2PY: a tool for connecting For-tran and Python programs,”

International Jour-nal of Computational Science and Engineering ,vol. 4, no. 4, pp. 296–305, 2009.[56] The NumPy Project Community, “NumPyproject governance,” 2015.[57] The NumPy Project Community, “NumPy codeof conduct,” 2018.[58] D. Holth, “Pep 427 – the wheel binary packageformat 1.0,” 2012.[59] Brett, M. et al, “multibuild,” 2016.[60] B. Griﬃth, P. Virtanen, N. Smith, M. van Kerk-wijk, and S. Hoyer, “NEP 13 – a mechanism foroverriding ufuncs,” 2013.[61] S. Hoyer, M. Rocklin, M. van Kerkwijk, H. Ab-basi, and E. Wieser, “NEP 18 – a dispatch mech-anism for numpy’s high level array functions,”2018.[62] M. E. O’Neill, “Pcg: A family of simple fastspace-eﬃcient statistically good algorithms forrandom number generation,” Tech. Rep. HMC-CS-2014-0905, Harvey Mudd College, Clare-mont, CA, Sept. 2014.[63] J. K. Salmon, M. A. Moraes, R. O. Dror, andD. E. Shaw, “Parallel random numbers: As easyas 1, 2, 3,” in

Proceedings of 2011 InternationalConference for High Performance Computing,Networking, Storage and Analysis , SC ’11, (NewYork, NY, USA), pp. 16:1–16:12, ACM, 2011.[64] C. Doty-Humphrey, “Practrand, version 0.94.”[65] M. Matsumoto and T. Nishimura, “MersenneTwister: A 623-dimensionally equidistributeduniform pseudo-random number generator,”

ACM Transactions on Modeling and ComputerSimulation , vol. 8, pp. 3–30, Jan. 1998.[66] K. Sheppard, B. Duvenhage, P. de Buyl, andD. A. Ham, “bashtage/randomgen: Release1.16.2,” Apr. 2019.1267] G. Marsaglia and W. W. Tsang, “The zigguratmethod for generating random variables,”

Jour-nal of Statistical Software, Articles , vol. 5, no. 8,pp. 1–7, 2000.[68] D. Lemire, “Fast random integer generation inan interval,”

ACM Transactions on Modelingand Computer Simulation , vol. 29, pp. 1–12, Jan2019.[69] top500, “Top 10 sites for november 2019,” 2019.[70] wikichip, “Astra - supercomputers,” 2019.[71] Wikipedia, “Arm architecture,” 2019.[72] NumPy Developers, “Numpy roadmap,” 2019.[73] Dustin Ingram, “Pep 599 – the manylinux2014platform tag,” 2019.

Methods

We use Git for version control and GitHub as thepublic hosting service for our oﬃcial upstream reposi-tory ( https://github.com/numpy/numpy ). We eachwork in our own copy (or fork) of the project and usethe upstream repository as our integration point. Toget new code into the upstream repository, we useGitHub’s pull request (PR) mechanism. This allowsus to review code before integrating it as well as torun a large number of tests on the modiﬁed code toensure that the changes do not break expected be-havior.We also use GitHub’s issue tracking system to col-lect and triage problems and proposed improvements.

Library organization

Broadly, the NumPy library consists of the follow-ing parts: the NumPy array data structure ndarray ;the so-called universal functions ; a set of libraryfunctions for manipulating arrays and doing scien-tiﬁc computation; infrastructure libraries for unittests and Python package building; and the program f2py for wrapping Fortran code in Python [55]. The ndarray and the universal functions are generallyconsidered the core of the library. In the following,we give a brief summary of these components of thelibrary.

Core.

The ndarray data structure and the univer-sal functions make up the core of NumPy.The ndarray is the data structure at the heart ofNumPy. The data structure stores regularly stridedhomogeneous data types inside a contiguous blockmemory, allowing for the eﬃcient representation of n -dimensional data. More details about the data struc-ture are given in “The NumPy array: a structure foreﬃcient numerical computation” [18].The universal functions , or more concisely, ufuncs ,are functions written in C that implement eﬃcientlooping over NumPy arrays. An important feature ofufuncs is the built-in implementation of broadcasting .For example, the function arctan2(x, y) is a ufuncthat accepts two values and computes tan − ( y/x ).When arrays are passed in as the arguments, the13func will take care of looping over the dimensionsof the inputs in such a way that if, say, x is a 1-Darray with length 3, and y is a 2-D array with shape2 ×

1, the output will be an array with shape 2 × x + y * z . Computing libraries.

NumPy provides a large li-brary of functions for array manipulation and scien-tiﬁc computing, including functions for: creating, re-shaping, concatenating, and padding arrays; search-ing, sorting and counting data in arrays; computingelementary statistics, such as the mean, median, vari-ance, and standard deviation; ﬁle I/O; and more.A suite of functions for computing the fast Fouriertransform (FFT) and its inverse is provided.NumPy’s linear algebra library includes functionsfor: solving linear systems of equations; computingvarious functions of a matrix, including the determi-nant, the norm, the inverse, and the pseudo-inverse;computing the Cholesky, eigenvalue, and singularvalue decompositions of a matrix; and more.The random number generator library in NumPyprovides alternative bit stream generators that pro-vide the core function of generating random integers.A higher-level generator class that implements an as-sortment of probability distributions is provided. Itincludes the beta, gamma and Weibull distributions,the univariate and multivariate normal distributions,and more.

Infrastructure libraries.

NumPy provides utili-ties for writing tests and for building Python pack-ages.The testing subpackage provides functions suchas assert allclose(actual, desired) that maybe used in test suites for code that uses NumPy ar-rays.NumPy provides the subpackage distutils whichincludes functions and classes to facilitate conﬁgura- tion, installation, and packaging of libraries depend-ing on NumPy. These can be used, for example, whenpublishing to the PyPI website.

F2PY.

The program f2py is a tool for buildingNumPy-aware Python wrappers of Fortran functions.NumPy itself does not use any Fortran code; F2PYis part of NumPy for historical reasons.

Governance

NumPy adopted an oﬃcial Governance Document onOctober 5, 2015 [56]. Project decisions are usuallymade by consensus of interested contributors. Thismeans that, for most decisions, everyone is entrustedwith veto power. A Steering Council, currently com-posed of 12 members, facilitates this process andoversees daily development of the project by con-tributing code and reviewing contributions from thecommunity.NumPy’s oﬃcial Code of Conduct was approvedon September 1, 2018 [57]. In brief, we strive to: be open ; be empathetic, welcoming, friendly, and pa-tient ; be collaborative ; be inquisitive ; and be careful inthe words that we choose . The Code of Conduct alsospeciﬁes how breaches can be reported and outlinesthe process for responding to such reports. Funding

In 2017, NumPy received its ﬁrst large grants total-ing 1.3M USD from the Gordon & Betty Moore andthe Alfred P. Sloan foundations. Stfan van der Waltis the PI and manages four programmers working onthe project. These two grants focus on addressingthe technical debt accrued over the years and on set-ting in place standards and architecture to encouragemore sustainable development.NumPy received a third grant for 195K USD fromthe Chan Zuckerberg Initiative at the end of 2019with Ralf Gommers as the PI. This grant focuses onbetter serving NumPy’s large number of beginningto intermediate level users and on growing the com-munity of NumPy contributors. It will also providesupport to OpenBLAS, on which NumPy depends foraccelerated linear algebra.14inally, since May 2019 the project receives a smallamount annually from Tidelift, which is used to fundthings like documentation and website improvements.

Developers

NumPy is currently maintained by a group of 23contributors with commit rights to the NumPy codebase. Out of these, 17 maintainers were active in2019, 4 of whom were paid to work on the projectfull-time. Additionally, there are a few long term de-velopers who contributed and maintain speciﬁc partsof NumPy, but are not oﬃcially maintainers.Over the course of its history, NumPy has attractedPRs by 823 contributors. However, its developmentrelies heavily on a small number of active maintain-ers, who share more than half of the contributionsamong themselves.At a release cycle of about every half year, the ﬁverecent releases in the years 2018 and 2019 have aver-aged about 450 PRs each, with each release attract-ing more than a hundred new contributors. Figure 4shows the number of PRs merged into the NumPymaster branch. Although the number of PRs be-ing merged ﬂuctuates, the plot indicates an increasednumber of contributions over the past years. Community calls

The massive number of scientiﬁc Python packagesthat built on NumPy meant that it had an unusu-ally high need for stability. So to guide our devel-opment we formalized the feature proposal process,and constructed a development roadmap with exten-sive input and feedback from the community.Weekly community calls alternate between triageand higher level discussion. The calls not only involvedevelopers from the community, but provide a venuefor vendors and other external groups to provide in-put. For example, after Intel produced a forked ver-sion of NumPy, one of their developers joined a call Note that before mid 2011, NumPy development did nothappen on github.com . All data provided here is based on thedevelopment which happened through GitHub PRs. In somecases contributions by maintainers may not be categorized assuch. P R s m e r g e d e a c h q u a r t e r MaintainerOther

Fig. 4:

Number of pull requests merged intothe NumPy master branch for each quartersince 2012.

The total number of PRs is indicatedwith the lower blue area showing the portion con-tributed by current or previous maintainers.to discuss community concerns.

NumPy enhancement proposals

Given the complexity of the codebase and the massivenumber of projects depending on it, large changes re-quire careful planning and substantial work. NumPyEnhancement Proposals (NEPs) are modeled afterPython Enhancement Proposals (PEPs) for “propos-ing major new features, for collecting community in-put on an issue, and for documenting the design de-cisions that have gone into Python” . Since thenthere have been 19 proposed NEPS—6 have been im-plemented, 4 have been accepted and are being im-plemented, 4 are under consideration, 3 have beendeferred or superseded, and 2 have been rejected orwithdrawn. Central role

NumPy plays a central role in building and standard-izing much of the scientiﬁc Python community infras-tructure. NumPy’s docstring standard is now widelyadopted. We are also now using the NEP system asa way to help coordinate the larger scientiﬁc Python https://numpy.org/neps/nep-0000.html Wheels build system

A Python wheel [58] is a standard ﬁle format for dis-tributing Python libraries. In addition to Pythoncode, a wheel may include compiled C extensions andother binary data. This is important, because manylibraries, including NumPy, require a C compiler andother build tools to build the software from the sourcecode, making it diﬃcult for many users to install thesoftware on their own. The introduction of wheelsto the Python packaging system has made it mucheasier for users to install precompiled libraries.A GitHub repository containing scripts to buildNumPy wheels has been conﬁgured so that a simplecommit to the repository triggers an automated buildsystem that creates NumPy wheels for several com-puter platforms, including Windows, Mac OSX andLinux. The wheels are uploaded to a public serverand made available for anyone to use. This systemmakes it easy for users to install precompiled versionsof NumPy on these platforms.The technology that is used to build the wheelsevolves continually. At the time this paper is beingwritten, a key component is the multibuild suiteof tools developed by Matthew Brett and other de-velopers [59]. Currently, scripts using multibuild are written for the continuous integration platformsTravis-CI (for Linux and Mac OSX) and Appveyor(for Windows).

Recent technical improvements

With the recent infusion of funding and a clear pro-cess for coordinating with the developer community,we have been able to tackle a number of importantlarge scale changes. We highlight two of those be-low, as well as changes made to our testing infras-tructure to support hardware platforms used in large scale computing.

Array function protocol

A vast number of projects are built on NumPy; theseprojects are consumers of the NumPy API. Over thelast several years, a growing number of projects areproviders of a

NumPy-like API and array objectstargeting audiences with specialized needs beyondNumPy’s capabilities. For example, the NumPy APIis implemented by several popular tensor computa-tion libraries including CuPy , JAX , and ApacheMXNet . PyTorch and Tensorﬂow provide ten-sor APIs with NumPy-inspired semantics. It is alsoimplemented in packages that support sparse arrayssuch as scipy.sparse and PyData/Sparse. Anothernotable example is Dask, a library for parallel com-puting in Python. Dask adopts the NumPy APIand therefore presents a familiar interface to exist-ing NumPy users, while adding powerful abilities toparallelize and distribute tasks.The multitude of specialized projects creates thediﬃculty that consumers of these NumPy-like APIswrite code speciﬁc to a single project and do not sup-port all of the above array providers. This is a burdenfor users relying on the specialized array-like, since atool they need may not work for them. It also cre-ates challenges for end-users who need to transitionfrom NumPy to a more specialized array. The grow-ing multitude of specialized projects with NumPy-like APIs threatened to again fracture the scientiﬁcPython community.To address these issues NumPy has the goal ofproviding the fundamental API for interoperability between the various NumPy-like APIs. An earlierstep in this direction was the implementation of the array ufunc protocol in NumPy 1.13, which en-abled interoperability for most mathematical func-tions [60]. In 2019 this was expanded more generally https://cupy.chainer.org/ https://jax.readthedocs.io/en/latest/jax.numpy.html https://numpy.mxnet.io/ https://pytorch.org/tutorials/beginner/blitz/tensor_tutorial.html array function pro-tocol into NumPy 1.17. These two protocols allowproviders of array objects to be interoperable withthe NumPy API: their arrays work correctly with al-most all NumPy functions [61]. For the users rely-ing on specialized array projects it means that eventhough much code is written speciﬁcally for NumPyarrays and uses the NumPy API as import numpyas np , it can nevertheless work for them. For exam-ple, here is how a CuPy GPU array can be passedthrough NumPy for processing, with all operationsbeing dispatched back to CuPy: import numpy as npimport cupy as cpx_gpu = cp. array ([1 , 2, 3])y = np. sum ( x_gpu ) Similarly, user deﬁned functions composed usingNumPy can now be applied to, e.g., multi-node dis-tributed Dask arrays: import numpy as npimport dask . array as dadef f(x):""" Function using NumPy API calls """y = np. tensordot (x, x.T)return np. mean (np. log (y + 1))x_local = np. random . random ([10000 , 10000])

Random number generation

The NumPy random module provides pseudorandomnumbers from a wide range of distributions. In legacyversions of NumPy, simulated random values are pro-duced by a

RandomState object that: handles seedingand state initialization; wraps the core pseudoran-dom number generator based on a Mersenne Twisterimplementation ; interfaces with the underlying code to be precise, the standard 32-bit version of MT19937 that transforms random bits into variates from otherdistributions; and supplies a singleton instance ex-posed in the root of the random module.The RandomState object makes a compatibilityguarantee so that a ﬁxed seed and sequence of func-tion calls produce the same set of values. This guar-antee has slowed progress since improving the under-lying code requires extending the API with additionalkeyword arguments. This guarantee continues to ap-ply to

RandomState .NumPy 1.17 introduced a new API for generatingrandom numbers that use a more ﬂexible structurethat can be extended by libraries or end-users. Thenew API is built using components that separate thesteps required to generate random variates. Pseudo-random bits are generated by a bit generator. Thesebits are then transformed into variates from complexdistributions by a generator. Finally, seeding is han-dled by an object that produces sequences of high-quality initial values.Bit generators are simple classes that managethe state of an underlying pseudorandom numbergenerator. NumPy ships with four bit generators.The default bit generator is a 64-bit implementa-tion of the Permuted Congruential Generator [62](

PCG64 ). The three other bit generators are a 64-bit version of the Philox generator [63] (

Philox ),Chris Doty-Humphrey’s Small Fast Chaotic genera-tor [64] (

SFC64 ), and the 32-bit Mersenne Twister [65](

MT19937 ) which has been used in older versions ofNumPy. Bit generators provide functions, exposedboth in Python and C, for generating random integerand ﬂoating point numbers.The

Generator consumes one of the bit gener-ators and produces variates from complicated dis-tributions. Many improved methods for generat-ing random variates from common distributions wereimplemented, including the Ziggurat method fornormal, exponential and gamma variates [67], andLemire’s method for bounded random integer gen-eration [68]. The

Generator is more similar to thelegacy

RandomState , and its API is substantially the The randomgen project supplies a wide range of alterna-tive bit generators such as a cryptographic counter-based gen-erators (

AESCtr ) and generators that expose hardware randomnumber generators (

RDRAND ) [66].

Generator does not make the same streamguarantee as the

RandomState object, and so variatesmay diﬀer across versions as improved generation al-gorithms are introduced. Finally, a

SeedSequence is used to initialize a bitgenerator. The seed sequence can be initialized withno arguments, in which case it reads entropy from asystem-dependent provider, or with a user-providedseed. The seed sequence then transforms the initialset of entropy into a sequence of high-quality pseu-dorandom integers, which can be used to initializemultiple bit generators deterministically. The keyfeature of a seed sequence is that it can be used tospawn child

SeedSequence s to initialize multiple dis-tinct bit generators. This capability allows a seedsequence to facilitate large distributed applicationswhere the number of workers required is not known.The sequences generated from the same initial en-tropy and spawns are fully deterministic to ensurereproducibility.The three components are combined to constructa complete random number generator. from numpy . random import (Generator ,PCG64 ,SeedSequence ,)seq = SeedSequence(1030424547444117993331016959)pcg = PCG64 ( seq )gen = Generator ( pcg )

This approach retains access to the seed sequencewhich can then be used to spawn additional genera-tors. children = seq . spawn (2)gen_0 = Generator ( PCG64 ( children [0]) )gen_1 = Generator ( PCG64 ( children [1]) )

While this approach retains complete ﬂexibility,the method np.random.default rng can be used to Despite the removal of the compatibility guarantee, sim-ple reproducibility across versions is encouraged, and minorchanges that do not produce meaningful performance gains orﬁx underlying bug are not generally adopted. instantiate a

Generator when reproducibility is notneeded.The ﬁnal goal of the new API is to improve ex-tensibility.

RandomState is a monolithic object thatobscures all of the underlying state and functions.The component architecture is one part of the ex-tensibility improvements. The underlying functions(written in C) which transform the output of a bitgenerator to other distributions are available for usein CFFI. This allows the same code to be run in bothNumPy and dependent that can consume CFFI, e.g.,Numba. Both the bit generators and the low-levelfunctions can also be used in C or Cython code. Testing on multiple architectures

At the time of writing the two fastest supercomput-ers in the world, Summit and Sierra, both have IBMPOWER9 architectures [69]. In late 2018, Astra, theﬁrst ARM-based supercomputer to enter the TOP500list, went into production [70]. Furthermore, over100 billion ARM processors have been produced asof 2017 [71], making it the most widely used instruc-tion set architecture in the world.Clearly there are motivations for a large scien-tiﬁc computing software library to support POWERand ARM architectures. We’ve extended our con-tinuous integration (CI) testing to include ppc64le (POWER8 on Travis CI) and ARMv8 (on Shippableservice). We also test with the s390x architecture(IBM Z CPUs on Travis CI) so that we can probe thebehavior of our library on a big-endian machine. Thissatisﬁes one of the major components of improved CItesting laid out in a version of our roadmap [72]—speciﬁcally, “CI for more exotic platforms.”PEP 599 [73] lays out a plan for new Pythonbinary wheel distribution support, manylinux2014 ,that adds support for a number of architectures sup-ported by the CentOS Alternative Architecture Spe-cial Interest Group, including ARMv8, ppc64le, aswell as s390x. We are thus well-positioned for a fu-ture where provision of binaries on these architec-tures will be expected for a library at the base of the As of 1.18.0, this scenario requires access to the NumPysource. Alternative approaches that avoid this extra step arebeing explored.

Acknowledgments

We thank Ross Barnowski, Paul Dubois, MichaelEickenberg, and Perry Greenﬁeld, who suggested textand provided helpful feedback on the manuscript.We also thank the many members of the commu-nity who provided feedback, submitted bug reports,made improvements to the documentation, code, orwebsite, promoted NumPy’s use in their scientiﬁcﬁelds, and built the vast ecosystem of tools and li-braries around NumPy. We also gratefully acknowl-edge the Numeric and Numarray developers on whosework we built.Jim Hugunin wrote Numeric in 1995, while a grad-uate student at MIT. Hugunin based his package onprevious work by Jim Fulton, then working at the USGeological Survey, with input from many others. Af-ter he graduated, Paul Dubois at the Lawrence Liv-ermore National Laboratory became the maintainer.Many people contributed to the project includingT.E.O. (a co-author of this paper), David Ascher,Tim Peters, and Konrad Hinsen.In 1998 the Space Telescope Science Institutestarted using Python and in 2000 began developing anew array package called Numarray, written almostentirely by Jay Todd Miller, starting from a proto-type developed by Perry Greenﬁeld. Other contrib-utors included Richard L. White, J. C. Hsu, JochenKrupper, and Phil Hodge. The Numeric/Numarraysplit divided the community, yet ultimately pushedprogress much further and faster than would other-wise have been possible.Shortly after Numarray development started,T.E.O. took over maintenance of Numeric. In 2005,he led the eﬀort and did most of the work to unifyNumeric and Numarray, and produce the ﬁrst versionof NumPy.Eric Jones co-founded (along with T.E.O. and P.P.)the SciPy community, gave early feedback on arrayimplementations, and provided funding and travelsupport to several community members. Numerouspeople contributed to the creation and growth of thelarger SciPy ecosystem, which gives NumPy much of its value. Others injected new energy and ideas bycreating experimental array packages.K.J.M. and S.J.v.d.W. were funded in part by theGordon and Betty Moore Foundation through GrantGBMF3834 and by the Alfred P. Sloan Foundationthrough Grant 2013-10-27 to the University of Cali-fornia, Berkeley. S.J.v.d.W., S.B., M.P., and W.W.were funded in part by the Gordon and Betty MooreFoundation through Grant GBMF5447 and by theAlfred P. Sloan Foundation through Grant G-2017-9960 to the University of California, Berkeley.

Author Contributions Statement

K.J.M. and S.J.v.d.W. composed the manuscriptwith input from others. S.B., R.G., K.S., W.W.,M.B., and T.J.R. contributed text. All authors havecontributed signiﬁcant code, documentation, and/orexpertise to the NumPy project. All authors re-viewed the manuscript.