Array Programming with NumPy
Charles R. Harris, K. Jarrod Millman, Stéfan J. van der Walt, Ralf Gommers, Pauli Virtanen, David Cournapeau, Eric Wieser, Julian Taylor, Sebastian Berg, Nathaniel J. Smith, Robert Kern, Matti Picus, Stephan Hoyer, Marten H. van Kerkwijk, Matthew Brett, Allan Haldane, Jaime Fernández del Río, Mark Wiebe, Pearu Peterson, Pierre Gérard-Marchant, Kevin Sheppard, Tyler Reddy, Warren Weckesser, Hameer Abbasi, Christoph Gohlke, Travis E. Oliphant
AArray Programming with NumPy
Charles R. Harris , K. Jarrod Millman , St ´efan J. van der Walt , RalfGommers , Pauli Virtanen , David Cournapeau , Eric Wieser , Julian Taylor ,Sebastian Berg , Nathaniel J. Smith , Robert Kern , Matti Picus , StephanHoyer , Marten H. van Kerkwijk , Matthew Brett , Allan Haldane , JaimeFern ´andez del R´ıo , Mark Wiebe , Pearu Peterson , PierreG ´erard-Marchant , Kevin Sheppard , Tyler Reddy , Warren Weckesser ,Hameer Abbasi , Christoph Gohlke , and Travis E. Oliphant Independent Researcher, Logan, Utah, USA Brain Imaging Center, University of California, Berkeley, Berkeley, CA, USA Division of Biostatistics, University of California, Berkeley, Berkeley, CA, USA Berkeley Institute for Data Science, University of California, Berkeley, Berkeley, CA, USA Applied Mathematics, Stellenbosch University, Stellenbosch, South Africa Quansight LLC, Austin, TX, USA Department of Physics and Nanoscience Center, University of Jyv¨askyl¨a, Jyv¨askyl¨a, Finland Mercari JP, Tokyo, Japan Department of Engineering, University of Cambridge, Cambridge, UK Independent Researcher, Karlsruhe, Germany Independent Researcher, Berkeley, CA, USA Enthought, Inc., Austin, TX, USA Google Research, Mountain View, CA, USA Department of Astronomy & Astrophysics, University of Toronto, Toronto, ON, Canada School of Psychology, University of Birmingham, Edgbaston, Birmigham, UK Department of Physics, Temple University, Philadelphia, PA, USA Google, Zurich, Switzerland Department of Physics and Astronomy, The University of British Columbia, Vancouver, BC, Canada Amazon, Seattle, Washington, USA Independent Researcher, Saue, Estonia Department of Mechanics and Applied Mathematics, Institute of Cybernetics at Tallinn Technical University, Tallinn, Estonia Department of Biological and Agricultural Engineering, University of Georgia, Athens, GA France-IX Services, Paris, France Department of Economics, University of Oxford, Oxford, UK CCS-7, Los Alamos National Laboratory, Los Alamos, NM, USA Laboratory for Fluorescence Dynamics, Biomedical Engineering Department, University of California, Irvine, Irvine, CA, USA * [email protected], [email protected], [email protected] June 19, 2020 a r X i v : . [ c s . M S ] J un bstract Array programming provides a powerful, compact, expressive syntax for accessing, manipulating, andoperating on data in vectors, matrices, and higher-dimensional arrays [1]. NumPy is the primary arrayprogramming library for the Python language [2, 3, 4, 5]. It plays an essential role in research analysispipelines in fields as diverse as physics, chemistry, astronomy, geoscience, biology, psychology, materialscience, engineering, finance, and economics. For example, in astronomy, NumPy was an important part ofthe software stack used in the discovery of gravitational waves [6] and the first imaging of a black hole [7].Here we show how a few fundamental array concepts lead to a simple and powerful programming paradigmfor organizing, exploring, and analyzing scientific data. NumPy is the foundation upon which the entirescientific Python universe is constructed. It is so pervasive that several projects, targeting audiences withspecialized needs, have developed their own NumPy-like interfaces and array objects. Because of its centralposition in the ecosystem, NumPy increasingly plays the role of an interoperability layer between thesenew array computation libraries.
Two Python array packages existed before NumPy.The Numeric package began in the mid-1990s andprovided an array object and array-aware functions inPython, written in C, and linking to standard fast im-plementations of linear algebra [8, 9]. One of its ear-liest uses was to steer C++ applications for inertialconfinement fusion research at Lawrence LivermoreNational Laboratory [10]. To handle large astro-nomical images coming from the Hubble Space Tele-scope, a reimplementation of Numeric, called Numar-ray, added support for structured arrays, flexible in-dexing, memory mapping, byte-order variants, moreefficient memory use, flexible IEEE error handling ca-pabilities, and better type casting rules [11]. WhileNumarray was highly compatible with Numeric, thetwo packages had enough differences that it dividedthe community, until 2005, when NumPy emerged asa “best of both worlds” unification [12]—combiningNumarray’s features with Numeric’s performance onsmall arrays and its rich C
Application ProgrammingInterface (API).Now, fifteen years later, NumPy underpins almostevery Python library that does scientific or numeri-cal computation including SciPy [13], Matplotlib [14],pandas [15], scikit-learn [16], and scikit-image [17].It is a community-developed, open-source library,which provides a multidimensional Python array ob-ject along with array-aware functions that operate onit. Because of its inherent simplicity, the NumPy ar-ray is the de facto exchange format for array data in Python.NumPy operates on in-memory arrays using theCPU. To utilize modern, specialized storage andhardware, there has been a recent proliferation ofPython array packages. Unlike with the Numarrayand Numeric divide, it is now much harder for thesenew libraries to fracture the user community—givenhow much work already builds on top of NumPy.However, to provide the ecosystem with access tonew and exploratory technologies, NumPy is tran-sitioning into a central coordinating mechanism thatspecifies a well-defined array programming API anddispatches it, as appropriate, to specialized array im-plementations.
NumPy arrays
The NumPy array is a data structure that efficientlystores and accesses multidimensional arrays [18], alsoknown as tensors, and enables a wide variety of scien-tific computation. It consists of a pointer to memory,along with metadata used to interpret the data storedthere, notably data type , shape , and strides (Fig. 1a).The data type describes the nature of elementsstored in an array. An array has a single data type,and each array element occupies the same numberof bytes in memory. Examples of data types includereal and complex numbers (of lower and higher pre-cision), strings, timestamps, and pointers to Pythonobjects.ig. 1: The NumPy array incorporates several fundamental array concepts. a,
The NumPyarray data structure and its associated metadata fields. b, Indexing an array with slices and steps. Theseoperations return a view of the original data. c, Indexing an array with masks, scalar coordinates, or otherarrays, so that it returns a copy of the original data. In the bottom example, an array is indexed with otherarrays; this broadcasts the indexing arguments before performing the lookup. d, Vectorization efficientlyapplies operations to groups of elements. e, Broadcasting in the multiplication of two-dimensional arrays. f, Reduction operations act along one or more axes. In this example, an array is summed along select axes toproduce a vector, or along two axes consecutively to produce a scalar. g, Example NumPy code, illustratingsome of these concepts.The shape of an array determines the number ofelements along each axis, and the number of axes isthe array’s dimensionality. For example, a vector ofnumbers can be stored as a one-dimensional arrayof shape N , while color videos are four-dimensionalarrays of shape ( T, M, N,
Strides are necessary to interpret computer mem-ory, which stores elements linearly, as multidimen-sional arrays. It describes the number of bytes tomove forward in memory to jump from row to row,column to column, and so forth. Consider, for exam-ple, a 2-D array of floating-point numbers with shape(4 , × , indexing (to access subarrays or individual elements), opera-tors (e.g., +, − , × for vectorized operations and @ formatrix multiplication), as well as array-aware func-tions ; together, these provide an easily readable, ex-pressive, high-level API for array programming, whileNumPy deals with the underlying mechanics of mak-ing operations fast. Indexing an array returns single elements, subar-rays, or elements that satisfy a specific condition(Fig. 1b). Arrays can even be indexed using other3rrays (Fig. 1c). Wherever possible, indexing thatretrieves a subarray returns a view on the original ar-ray, such that data is shared between the two arrays.This provides a powerful way to operate on subsetsof array data while limiting memory usage.To complement the array syntax, NumPy includesfunctions that perform vectorized calculations on ar-rays, including arithmetic, statistics, and trigonom-etry (Fig. 1d). Vectorization—operating on wholearrays rather than their individual elements—is es-sential to array programming. This means that op-erations that would take many tens of lines to expressin languages such as C can often be implemented as asingle, clear Python expression. This results in con-cise code and frees users to focus on the details oftheir analysis, while NumPy handles looping over ar-ray elements near-optimally, taking into considera-tion, for example, strides, to best utilize the com-puter’s fast cache memory.When performing a vectorized operation (such asaddition) on two arrays with the same shape, it isclear what should happen. Through broadcasting ,NumPy allows the dimensions to differ, while stillproducing results that appeal to intuition. A trivialexample is the addition of a scalar value to an array,but broadcasting also generalizes to more complexexamples such as scaling each column of an array orgenerating a grid of coordinates. In broadcasting,one or both arrays are virtually duplicated (that is,without copying any data in memory), so that theshapes of the operands match (Fig. 1d). Broadcast-ing is also applied when an array is indexed usingarrays of indices (Fig. 1c).Other array-aware functions, such as sum , mean ,and maximum , perform element-by-element reduc-tions , aggregating results across one, multiple, or allaxes of a single array. For example, summing an n -dimensional array over d axes results in a ( n − d )-dimensional array (Fig. 1f).NumPy also includes array-aware functions for cre-ating, reshaping, concatenating, and padding arrays;searching, sorting, and counting data; and readingand writing files. It provides extensive support forgenerating pseudorandom numbers, includes an as-sortment of probability distributions, and performsaccelerated linear algebra, utilizing one of several Fig. 2: NumPy is the base of the scientificPython ecosystem.
Essential libraries and projectsthat depend on NumPy’s API gain access to newarray implementations that support NumPy’s arrayprotocols (Fig. 3).backends such as OpenBLAS [19, 20] or Intel MKLoptimized for the CPUs at hand.Altogether, the combination of a simple in-memoryarray representation, a syntax that closely mimicsmathematics, and a variety of array-aware utilityfunctions forms a productive and powerfully expres-sive array programming language.
Scientific Python ecosystem
Python is an open-source, general-purpose, inter-preted programming language well-suited to standardprogramming tasks such as cleaning data, interactingwith web resources, and parsing text. Adding fastarray operations and linear algebra allows scientiststo do all their work within a single language—andone that has the advantage of being famously easyto learn and teach, as witnessed by its adoption as aprimary learning language in many universities.4ven though NumPy is not part of Python’s stan-dard library, it benefits from a good relationship withthe Python developers. Over the years, the Pythonlanguage has added new features and special syntaxso that NumPy would have a more succinct and eas-ier to read array notation. Since it is not part of thestandard library, NumPy is able to dictate its ownrelease policies and development patterns.SciPy and Matplotlib are tightly coupled withNumPy—in terms of history, development, and use.SciPy provides fundamental algorithms for scien-tific computing, including mathematical, scientific,and engineering routines. Matplotlib generatespublication-ready figures and visualizations. Thecombination of NumPy, SciPy, and Matplotlib, to-gether with an advanced interactive environment likeIPython [21], or Jupyter [22], provides a solid foun-dation for array programming in Python. The sci-entific Python ecosystem (Fig. 2) builds on top ofthis foundation to provide several, widely used tech-nique specific libraries [16, 17, 23], that in turn un-derlay numerous domain specific projects [24, 25, 26,27, 28, 29]. NumPy, at the base of the ecosystem ofarray-aware libraries, sets documentation standards,provides array testing infrastructure, and adds buildsupport for Fortran and other compilers.Many research groups have designed large, com-plex scientific libraries, which add application spe-cific functionality to the ecosystem. For example,the eht-imaging library [30] developed by the EventHorizon Telescope collaboration for radio interfer-ometry imaging, analysis, and simulation, relies onmany lower-level components of the scientific Pythonecosystem. NumPy arrays are used to store and ma-nipulate numerical data at every step in the process-ing chain: from raw data through calibration and im-age reconstruction. SciPy supplies tools for generalimage processing tasks such as filtering and imagealignment, while scikit-image, an image processing li-brary that extends SciPy, provides higher-level func-tionality such as edge filters and Hough transforms.The scipy.optimize module performs mathematicaloptimization. NetworkX [23], a package for complexnetwork analysis, is used to verify image comparisonconsistency. Astropy [24, 25] handles standard astro-nomical file formats and computes time/coordinate transformations. Matplotlib is used to visualize dataand to generate the final image of the black hole.The interactive environment created by the ar-ray programming foundation along with the sur-rounding ecosystem of tools—inside of IPython orJupyter—is ideally suited to exploratory data anal-ysis. Users fluidly inspect, manipulate, and visual-ize their data, and rapidly iterate to refine program-ming statements. These statements are then stitchedtogether into imperative or functional programs, ornotebooks containing both computation and narra-tive. Scientific computing beyond exploratory workis often done in a text editor or an integrated de-velopment environment (IDEs) such as Spyder. Thisrich and productive environment has made Pythonpopular for scientific research.To complement this facility for exploratory workand rapid prototyping, NumPy has developed a cul-ture of employing time-tested software engineeringpractices to improve collaboration and reduce error[31]. This culture is not only adopted by leaders inthe project but also enthusiastically taught to new-comers. The NumPy team was early in adopting dis-tributed revision control and code review to improvecollaboration on code, and continuous testing thatruns an extensive battery of automated tests for ev-ery proposed change to NumPy. The project alsohas comprehensive, high-quality documentation, in-tegrated with the source code [32, 33, 34].This culture of using best practices for producingreliable scientific software has been adopted by theecosystem of libraries that build on NumPy. For ex-ample, in a recent award given by the Royal Astro-nomical Society to Astropy, they state:
The Astropy Project has provided hun-dreds of junior scientists with experience inprofessional-standard software developmentpractices including use of version control,unit testing, code review and issue trackingprocedures. This is a vital skill set for mod-ern researchers that is often missing fromformal university education in physics or as-tronomy.
Community members explicitly work to address this5ack of formal education through courses and work-shops [35, 36, 37].The recent rapid growth of data science, machinelearning, and artificial intelligence has further anddramatically boosted the usage of scientific Python.Examples of its significant application, such as the eht-imaging library, now exist in almost every disci-pline in the natural and social sciences. These toolshave become the primary software environment inmany fields. NumPy and its ecosystem are commonlytaught in university courses, boot camps, and sum-mer schools, and are at the focus of community con-ferences and workshops worldwide.NumPy and its API have become truly ubiquitous.
Array proliferation and interoperability
NumPy provides in-memory, multidimensional, ho-mogeneously typed (i.e., single pointer and strided)arrays on CPUs. It runs on machines ranging fromembedded devices to the world’s largest supercom-puters, with performance approaching that of com-piled languages. For most its existence, NumPy ad-dressed the vast majority of array computation usecases.However, scientific data sets now routinely exceedthe memory capacity of a single machine and may bestored on multiple machines or in the cloud. In addi-tion, the recent need to accelerate deep learning andartificial intelligence applications has led to the emer-gence of specialized accelerator hardware, includinggraphics processing units (GPUs), tensor processingunits (TPUs), and field-programmable gate arrays(FPGAs). Due to its in-memory data model, NumPyis currently unable to utilize such storage and special-ized hardware directly. However, both distributeddata and the parallel execution of GPUs, TPUs, andFPGAs map well to the paradigm of array program-ming: a gap, therefore, existed between availablemodern hardware architectures and the tools neces-sary to leverage their computational power.The community’s efforts to fill this gap led to aproliferation of new array implementations. For ex-ample, each deep learning framework created its ownarrays; PyTorch [38], Tensorflow [39], Apache MXNet [40], and JAX arrays all have the capability to runon CPUs and GPUs, in a distributed fashion, uti-lizing lazy evaluation to allow for additional per-formance optimizations. SciPy and PyData/Sparseboth provide sparse arrays—which typically containfew non-zero values and store only those in mem-ory for efficiency. In addition, there are projectsthat build on top of NumPy arrays as a data con-tainer and extend its capabilities. Distributed arraysare made possible that way by Dask, and labeledarrays—referring to dimensions of an array by namerather than by index for clarity, compare x[:, 1] vs. x.loc[:, ’time’] —by xarray [41].Such libraries often mimic the NumPy API, be-cause it lowers the barrier to entry for newcomersand provides the wider community with a stable ar-ray programming interface. This, in turn, preventsdisruptive schisms like the divergence of Numeric andNumarray. But exploring new ways of working witharrays is experimental by nature and, in fact, severalpromising libraries—such as Theano and Caffe—havealready ceased development. And each time thata user decides to try a new technology, they mustchange import statements and ensure that the newlibrary implements all the parts of the NumPy APIthey currently use.Ideally, operating on specialized arrays usingNumPy functions or semantics would simply work,so that users could write code once, and would thenbenefit from switching between NumPy arrays, GPUarrays, distributed arrays, and so forth, as appropri-ate. To support array operations between externalarray objects, NumPy therefore added the capabilityto act as a central coordination mechanism with awell-specified API (Fig. 2).To facilitate this interoperability , NumPy provides“protocols” (or contracts of operation), that allow forspecialized arrays to be passed to NumPy functions(Fig. 3). NumPy, in turn, dispatches operations tothe originating library, as required. Over four hun-dred of the most popular NumPy functions are sup-ported. The protocols are implemented by widelyused libraries such as Dask, CuPy, xarray, and Py-Data/Sparse. Thanks to these developments, userscan now, for example, scale their computation froma single machine to distributed systems using Dask.6ig. 3:
NumPy’s API and array protocols expose new arrays to the ecosystem.
In this example,NumPy’s mean function is called on a Dask array. The call succeeds by dispatching to the appropriate libraryimplementation (i.e., Dask in this case) and results in a new Dask array. Compare this code to the examplecode in Fig. 1g.The protocols also compose well, allowing users toredeploy NumPy code at scale on distributed, multi-GPU systems via, for instance, CuPy arrays embed-ded in Dask arrays. Using NumPy’s high-level API,users can leverage highly parallel code execution onmultiple systems with millions of cores, all with min-imal code changes [42].These array protocols are now a key feature ofNumPy, and are expected to only increase in impor-tance. As with the rest of NumPy, we iteratively re-fine and add protocol designs to improve utility andsimplify adoption.
Discussion
NumPy combines the expressive power of array pro-gramming , the performance of C, and the readability,usability, and versatility of Python in a mature, well-tested, well-documented, and community-developedlibrary. Libraries in the scientific Python ecosystemprovide fast implementations of most important al-gorithms. Where extreme optimization is warranted,compiled languages such as Cython [43], Numba [44],and Pythran [45], that extend Python and transpar-ently accelerate bottlenecks, can be used. Because ofNumPy’s simple memory model, it is easy to writelow-level, hand-optimized code, usually in C or For- tran, to manipulate NumPy arrays and pass themback to Python. Furthermore, using array protocols,it is possible to utilize the full spectrum of special-ized hardware acceleration with minimal changes toexisting code.NumPy was initially developed by students, fac-ulty, and researchers to provide an advanced, open-source array programming library for Python, whichwas free to use and unencumbered by license servers,dongles, and the like. There was a sense of build-ing something consequential together, for the benefitof many others. Participating in such an endeavor,within a welcoming community of like-minded indi-viduals, held a powerful attraction for many earlycontributors.These user-developers frequently had to write codefrom scratch to solve their own or their colleagues’problems—often in low-level languages that precedePython, like Fortran [46] and C. To them, the advan-tages of an interactive, high-level array library wereevident. The design of this new tool was informedby other powerful interactive programming languagesfor scientific computing such as Basis [47], Yorick [48],R [49], and APL [50], as well as commercial languagesand environments like IDL and MATLAB.What began as an attempt to add an array objectto Python became the foundation of a vibrant ecosys-7em of tools. Now, a large amount of scientific workdepends on NumPy being correct, fast, and stable.It is no longer a small community project, but is corescientific infrastructure.The developer culture has matured: while initialdevelopment was highly informal, NumPy now hasa roadmap and a process for proposing and dis-cussing large changes. The project has formal gov-ernance structures and is fiscally sponsored by Num-FOCUS, a nonprofit that promotes open practicesin research, data, and scientific computing. Overthe past few years, the project attracted its firstfunded development, sponsored by the Moore andSloan Foundations, and received an award as part ofthe Chan Zuckerberg Initiative’s Essentials of OpenSource Software program. With this funding, theproject was (and is) able to have sustained focus overmultiple months to implement substantial new fea-tures and improvements. That said, it still dependsheavily on contributions made by graduate studentsand researchers in their free time.NumPy is no longer just the foundational array li-brary underlying the scientific Python ecosystem, buthas also become the standard API for tensor compu-tation and a central coordinating mechanism betweenarray types and technologies in Python. Work contin-ues to expand on and improve these interoperabilityfeatures.Over the next decade, we will face several chal-lenges. New devices will be developed, and existingspecialized hardware will evolve, to meet diminish-ing returns on Moore’s law. There will be more, anda wider variety of, data science practitioners, a sig-nificant proportion of whom will be using NumPy.The scale of scientific data gathering will continueto expand, with the adoption of devices and instru-ments such as light sheet microscopes and the LargeSynoptic Survey Telescope (LSST) [51]. New gen-eration languages, interpreters, and compilers, suchas Rust [52], Julia [53], and LLVM [54], will inventand determine the viability of new concepts and datastructures.Through various mechanisms described in this pa-per, NumPy is poised to embrace such a changinglandscape, and to continue playing a leading role ininteractive scientific computation. To do so will re- quire sustained funding from government, academia,and industry. But, importantly, it will also need anew generation of graduate students and other de-velopers to engage, to build a NumPy that meets theneeds of the next decade of data science.
References [1] K. E. Iverson, “Notation as a tool of thought,”
Communications of the ACM , vol. 23, p. 444465,Aug. 1980.[2] P. F. Dubois, “Python: Batteries included,”
Computing in Science & Engineering , vol. 9,no. 3, pp. 7–9, 2007.[3] T. E. Oliphant, “Python for scientific com-puting,”
Computing in Science & Engineering ,vol. 9, pp. 10–20, May-June 2007.[4] K. J. Millman and M. Aivazis, “Python for sci-entists and engineers,”
Computing in Science &Engineering , vol. 13, no. 2, pp. 9–12, 2011.[5] F. P´erez, B. E. Granger, and J. D. Hunter,“Python: an ecosystem for scientific comput-ing,”
Computing in Science & Engineering ,vol. 13, no. 2, pp. 13–21, 2011.[6] B. P. Abbott, R. Abbott, T. Abbott, M. Aber-nathy, F. Acernese, K. Ackley, C. Adams,T. Adams, P. Addesso, R. Adhikari, et al. , “Ob-servation of gravitational waves from a binaryblack hole merger,”
Physical Review Letters ,vol. 116, no. 6, p. 061102, 2016.[7] A. A. Chael, M. D. Johnson, R. Narayan, S. S.Doeleman, J. F. Wardle, and K. L. Bouman,“High-resolution linear polarimetric imaging forthe event horizon telescope,”
The AstrophysicalJournal , vol. 829, no. 1, p. 11, 2016.[8] P. F. Dubois, K. Hinsen, and J. Hugunin, “Nu-merical Python,”
Computers in Physics , vol. 10,no. 3, pp. 262–267, 1996.[9] D. Ascher, P. F. Dubois, K. Hinsen, J. Hugunin,and T. E. Oliphant, “An open source project:Numerical Python,” 2001.810] T.-Y. Yang, G. Furnish, and P. F. Dubois,“Steering object-oriented scientific computa-tions,” in
Proceedings of TOOLS USA 97. In-ternational Conference on Technology of ObjectOriented Systems and Languages , pp. 112–119,IEEE, 1997.[11] P. Greenfield, J. T. Miller, J. Hsu, and R. L.White, “numarray: A new scientific array pack-age for Python,”
PyCon DC , 2003.[12] T. E. Oliphant,
Guide to NumPy . Trelgol Pub-lishing USA, 1st ed., 2006.[13] P. Virtanen, R. Gommers, T. E. Oliphant,M. Haberland, T. Reddy, D. Cournapeau,E. Burovski, P. Peterson, W. Weckesser,J. Bright, S. J. van der Walt, M. Brett, J. Wil-son, K. J. Millman, N. Mayorov, A. R. J. Nel-son, E. Jones, R. Kern, E. Larson, C. J. Carey,I. Polat, Y. Feng, E. W. Moore, J. VanderPlas,D. Laxalde, J. Perktold, R. Cimrman, I. Hen-riksen, E. A. Quintero, C. R. Harris, A. M.Archibald, A. H. Ribeiro, F. Pedregosa, P. vanMulbregt, and SciPy 1.0 Contributors, “SciPy1.0—fundamental algorithms for scientific com-puting in Python,”
Nature Methods , vol. 17,pp. 261–272, 2020.[14] J. D. Hunter, “Matplotlib: A 2D graphics envi-ronment,”
Computing in Science & Engineering ,vol. 9, no. 3, pp. 90–95, 2007.[15] W. McKinney, “Data structures for statisticalcomputing in Python,” in
Proceedings of the 9thPython in Science Conference (S. van der Waltand J. Millman, eds.), pp. 51 – 56, 2010.[16] F. Pedregosa, G. Varoquaux, A. Gramfort,V. Michel, B. Thirion, O. Grisel, M. Blondel,P. Prettenhofer, R. Weiss, V. Dubourg, J. Van-derplas, A. Passos, D. Cournapeau, M. Brucher,M. Perrot, and ´E. Duchesnay, “Scikit-learn: Ma-chine learning in Python,”
Journal of MachineLearning Research , vol. 12, no. Oct, pp. 2825–2830, 2011.[17] S. van der Walt, J. L. Sch¨onberger, J. Nunez-Iglesias, F. Boulogne, J. D. Warner, N. Yager, E. Gouillart, T. Yu, and the scikit-image con-tributors, “scikit-image: image processing inPython,”
PeerJ , vol. 2, p. e453, 2014.[18] S. van der Walt, S. C. Colbert, and G. Varo-quaux, “The NumPy array: a structure for ef-ficient numerical computation,”
Computing inScience & Engineering , vol. 13, no. 2, pp. 22–30, 2011.[19] Q. Wang, X. Zhang, Y. Zhang, and Q. Yi,“Augem: automatically generate high perfor-mance dense linear algebra kernels on x86 cpus,”in
SC’13: Proceedings of the International Con-ference on High Performance Computing, Net-working, Storage and Analysis , pp. 1–12, IEEE,2013.[20] Z. Xianyi, W. Qian, and Z. Yunquan, “Model-driven level 3 blas performance optimizationon loongson 3a processor,” in , pp. 684–691, IEEE, 2012.[21] F. P´erez and B. E. Granger, “IPython: a systemfor interactive scientific computing,”
Computingin Science & Engineering , vol. 9, no. 3, pp. 21–29, 2007.[22] T. Kluyver, B. Ragan-Kelley, F. P´erez,B. Granger, M. Bussonnier, J. Frederic, K. Kel-ley, J. Hamrick, J. Grout, S. Corlay, P. Ivanov,D. Avila, S. Abdalla, and C. Willing, “JupyterNotebooks—a publishing format for repro-ducible computational workflows,” in
Position-ing and Power in Academic Publishing: Play-ers, Agents and Agendas (F. Loizides andB. Schmidt, eds.), pp. 87–90, IOS Press, 2016.[23] A. A. Hagberg, D. A. Schult, and P. J. Swart,“Exploring network structure, dynamics, andfunction using NetworkX,” in
Proceedings of the7th Python in Science Conference (G. Varo-quaux, T. Vaught, and K. J. Millman, eds.),(Pasadena, CA USA), pp. 11–15, 2008.[24] Astropy Collaboration, T. P. Robitaille, E. J.Tollerud, P. Greenfield, M. Droettboom,9. Bray, T. Aldcroft, M. Davis, A. Ginsburg,A. M. Price-Whelan, W. E. Kerzendorf, A. Con-ley, N. Crighton, K. Barbary, D. Muna, H. Fer-guson, F. Grollier, M. M. Parikh, P. H. Nair,H. M. Unther, C. Deil, J. Woillez, S. Con-seil, R. Kramer, J. E. H. Turner, L. Singer,R. Fox, B. A. Weaver, V. Zabalza, Z. I. Ed-wards, K. Azalee Bostroem, D. J. Burke, A. R.Casey, S. M. Crawford, N. Dencheva, J. Ely,T. Jenness, K. Labrie, P. L. Lim, F. Pierfed-erici, A. Pontzen, A. Ptak, B. Refsdal, M. Servil-lat, and O. Streicher, “Astropy: A communityPython package for astronomy,”
Astronomy &Astrophysics , vol. 558, p. A33, Oct. 2013.[25] A. M. Price-Whelan, B. M. Sip˝ocz, H. M.G¨unther, P. L. Lim, S. M. Crawford, S. Con-seil, D. L. Shupe, M. W. Craig, N. Dencheva,A. Ginsburg, J. T. VanderPlas, L. D. Bradley,D. P´erez-Su´arez, M. de Val-Borro, P. Pa-per Contributors, T. L. Aldcroft, K. L. Cruz,T. P. Robitaille, E. J. Tollerud, A. Coordina-tion Committee, C. Ardelean, T. Babej, Y. P.Bach, M. Bachetti, A. V. Bakanov, S. P. Bam-ford, G. Barentsen, P. Barmby, A. Baumbach,K. L. Berry, F. Biscani, M. Boquien, K. A.Bostroem, L. G. Bouma, G. B. Brammer, E. M.Bray, H. Breytenbach, H. Buddelmeijer, D. J.Burke, G. Calderone, J. L. Cano Rodr´ıguez,M. Cara, J. V. M. Cardoso, S. Cheedella,Y. Copin, L. Corrales, D. Crichton, D. D’Avella,C. Deil, ´E. Depagne, J. P. Dietrich, A. Donath,M. Droettboom, N. Earl, T. Erben, S. Fab-bro, L. A. Ferreira, T. Finethy, R. T. Fox,L. H. Garrison, S. L. J. Gibbons, D. A. Gold-stein, R. Gommers, J. P. Greco, P. Greenfield,A. M. Groener, F. Grollier, A. Hagen, P. Hirst,D. Homeier, A. J. Horton, G. Hosseinzadeh,L. Hu, J. S. Hunkeler, ˇZ. Ivezi´c, A. Jain, T. Jen-ness, G. Kanarek, S. Kendrew, N. S. Kern, W. E.Kerzendorf, A. Khvalko, J. King, D. Kirkby,A. M. Kulkarni, A. Kumar, A. Lee, D. Lenz,S. P. Littlefair, Z. Ma, D. M. Macleod, M. Mas-tropietro, C. McCully, S. Montagnac, B. M.Morris, M. Mueller, S. J. Mumford, D. Muna,N. A. Murphy, S. Nelson, G. H. Nguyen, J. P. Ninan, M. N¨othe, S. Ogaz, S. Oh, J. K. Pare-jko, N. Parley, S. Pascual, R. Patil, A. A.Patil, A. L. Plunkett, J. X. Prochaska, T. Ras-togi, V. Reddy Janga, J. Sabater, P. Sakurikar,M. Seifert, L. E. Sherbert, H. Sherwood-Taylor,A. Y. Shih, J. Sick, M. T. Silbiger, S. Sin-ganamalla, L. P. Singer, P. H. Sladen, K. A.Sooley, S. Sornarajah, O. Streicher, P. Teuben,S. W. Thomas, G. R. Tremblay, J. E. H. Turner,V. Terr´on, M. H. van Kerkwijk, A. de la Vega,L. L. Watkins, B. A. Weaver, J. B. Whit-more, J. Woillez, V. Zabalza, and A. Contribu-tors, “The Astropy Project: Building an Open-science Project and Status of the v2.0 Core Pack-age,”
The Astronomical Journal , vol. 156, p. 123,Sept. 2018.[26] P. J. Cock, T. Antao, J. T. Chang, B. A.Chapman, C. J. Cox, A. Dalke, I. Friedberg,T. Hamelryck, F. Kauff, B. Wilczynski, andM. J. L. de Hoon, “Biopython: freely availablePython tools for computational molecular biol-ogy and bioinformatics,”
Bioinformatics , vol. 25,no. 11, pp. 1422–1423, 2009.[27] K. J. Millman and M. Brett, “Analysis of func-tional Magnetic Resonance Imaging in Python,”
Computing in Science & Engineering , vol. 9,no. 3, pp. 52–55, 2007.[28] T. SunPy Community, S. J. Mumford,S. Christe, D. P´erez-Su´arez, J. Ireland,A. Y. Shih, A. R. Inglis, S. Liedtke, R. J.Hewett, F. Mayer, K. Hughitt, N. Freij,T. Meszaros, S. M. Bennett, M. Malocha,J. Evans, A. Agrawal, A. J. Leonard, T. P.Robitaille, B. Mampaey, J. Iv´an Campos-Rozo,and M. S. Kirk, “SunPy—Python for solarphysics,”
Computational Science and Discovery ,vol. 8, p. 014009, Jan. 2015.[29] J. Hamman, M. Rocklin, and R. Abernathy,“Pangeo: A Big-data Ecosystem for ScalableEarth System Science,” in
EGU General Assem-bly Conference Abstracts , EGU General Assem-bly Conference Abstracts, p. 12146, Apr 2018.1030] A. A. Chael, K. L. Bouman, M. D. Johnson,R. Narayan, S. S. Doeleman, J. F. Wardle,L. L. Blackburn, K. Akiyama, M. Wielgus, C.-k.Chan, et al. , “ehtim: Imaging, analysis, and sim-ulation software for radio interferometry,”
Astro-physics Source Code Library , 2019.[31] K. J. Millman and F. P´erez, “Developing open-source scientific practice,”
Implementing Repro-ducible Research. CRC Press, Boca Raton, FL ,pp. 149–183, 2014.[32] S. van der Walt, “The SciPy documentationproject (technical overview),” in
Proceedings ofthe 7th Python in Science Conference (SciPy2008) (G. Varoquaux, T. Vaught, and K. J. Mill-man, eds.), pp. 27–28, 2008.[33] J. Harrington, “The SciPy documentationproject,” in
Proceedings of the 7th Python in Sci-ence Conference (SciPy 2008) (G. Varoquaux,T. Vaught, and K. J. Millman, eds.), pp. 33–35,2008.[34] J. Harrington and D. Goldsmith, “Progress re-port: NumPy and SciPy documentation in2009,” in
Proceedings of the 8th Python in Sci-ence Conference (SciPy 2009) (G. Varoquaux,S. van der Walt, and K. J. Millman, eds.),pp. 84–87, 2009.[35] G. Wilson, “Software carpentry: Getting scien-tists to write better code by making them moreproductive,”
Computing in Science & Engineer-ing , November–December 2006.[36] J. E. Hannay, H. P. Langtangen, C. MacLeod,D. Pfahl, J. Singer, and G. Wilson, “How doscientists develop and use scientific software?,”in
Proc. 2009 ICSE Workshop on Software En-gineering for Computational Science and Engi-neering , 2009.[37] K. J. Millman, M. Brett, R. Barnowski, andJ.-B. Poline, “Teaching computational repro-ducibility for neuroimaging,”
Frontiers in Neu-roscience , vol. 12, p. 727, 2018. [38] A. Paszke, S. Gross, F. Massa, A. Lerer,J. Bradbury, G. Chanan, T. Killeen, Z. Lin,N. Gimelshein, L. Antiga, A. Desmaison,A. Kopf, E. Yang, Z. DeVito, M. Raison, A. Te-jani, S. Chilamkurthy, B. Steiner, L. Fang,J. Bai, and S. Chintala, “Pytorch: An imper-ative style, high-performance deep learning li-brary,” in
Advances in Neural Information Pro-cessing Systems 32 (H. Wallach, H. Larochelle,A. Beygelzimer, F. d ' Alch´e-Buc, E. Fox, andR. Garnett, eds.), pp. 8024–8035, Curran As-sociates, Inc., 2019.[39] M. Abadi, A. Agarwal, P. Barham, E. Brevdo,Z. Chen, C. Citro, G. S. Corrado, A. Davis,J. Dean, M. Devin, et al. , “Tensorflow:Large-scale machine learning on heteroge-neous distributed systems,” arXiv preprintarXiv:1603.04467 , 2016.[40] T. Chen, M. Li, Y. Li, M. Lin, N. Wang,M. Wang, T. Xiao, B. Xu, C. Zhang,and Z. Zhang, “Mxnet: A flexible and ef-ficient machine learning library for hetero-geneous distributed systems,” arXiv preprintarXiv:1512.01274 , 2015.[41] S. Hoyer and J. Hamman, “xarray: N-D labeledarrays and datasets in Python,”
Journal of OpenResearch Software , vol. 5, no. 1, 2017.[42] P. Entschev, “Distributed multi-GPU comput-ing with Dask, CuPy and RAPIDS.” EuroPy-thon 2019, 2019.[43] S. Behnel, R. Bradshaw, C. Citro, L. Dalcin,D. S. Seljebotn, and K. Smith, “Cython: Thebest of both worlds,”
Computing in Science &Engineering , vol. 13, no. 2, pp. 31–39, 2011.[44] S. K. Lam, A. Pitrou, and S. Seibert, “Numba:A LLVM-based Python JIT compiler,” in
Pro-ceedings of the Second Workshop on the LLVMCompiler Infrastructure in HPC , LLVM ’15,(New York, NY, USA), pp. 7:1–7:6, ACM, 2015.[45] S. Guelton, P. Brunet, M. Amini, A. Merlini,X. Corbillon, and A. Raynaud, “Pythran: En-abling static optimization of scientific python11rograms,”
Computational Science & Discovery ,vol. 8, no. 1, p. 014001, 2015.[46] J. Dongarra, G. H. Golub, E. Grosse, C. Moler,and K. Moore, “Netlib and na-net: Buildinga scientific computing community,”
IEEE An-nals of the History of Computing , vol. 30, no. 2,pp. 30–41, 2008.[47] P. F. Dubois, “The basis system,” tech. rep.,Lawrence Livermore National Laboratory, CA(USA), 1989. UCRL-MA-118543, Parts I-VI.[48] D. H. Munro and P. F. Dubois, “Using the yorickinterpreted language,”
Computers in Physics ,vol. 9, no. 6, pp. 609–615, 1995.[49] R. Ihaka and R. Gentleman, “R: a language fordata analysis and graphics,”
Journal of Compu-tational and Graphical Statistics , vol. 5, no. 3,pp. 299–314, 1996.[50] K. E. Iverson, “A programming language,” in
Proceedings of the May 1-3, 1962, Spring JointComputer Conference , pp. 345–351, 1962.[51] T. Jenness, F. Economou, K. Findeisen, F. Her-nandez, J. Hoblitt, K. S. Krughoff, K. Lim,R. H. Lupton, F. Mueller, W. O’Mullane, et al. ,“Lsst data management software developmentpractices and tools,” in
Software and Cyber-infrastructure for Astronomy V , vol. 10707,p. 1070709, International Society for Optics andPhotonics, 2018.[52] N. D. Matsakis and F. S. Klock, “The rust lan-guage,”
Ada Letters , vol. 34, pp. 103–104, Oct.2014.[53] J. Bezanson, A. Edelman, S. Karpinski, andV. B. Shah, “Julia: A fresh approach to numer-ical computing,”
SIAM Review , vol. 59, no. 1,pp. 65–98, 2017.[54] C. Lattner and V. Adve, “LLVM: A compila-tion framework for lifelong program analysis andtransformation,” (San Jose, CA, USA), pp. 75–88, Mar 2004. [55] P. Peterson, “F2PY: a tool for connecting For-tran and Python programs,”
International Jour-nal of Computational Science and Engineering ,vol. 4, no. 4, pp. 296–305, 2009.[56] The NumPy Project Community, “NumPyproject governance,” 2015.[57] The NumPy Project Community, “NumPy codeof conduct,” 2018.[58] D. Holth, “Pep 427 – the wheel binary packageformat 1.0,” 2012.[59] Brett, M. et al, “multibuild,” 2016.[60] B. Griffith, P. Virtanen, N. Smith, M. van Kerk-wijk, and S. Hoyer, “NEP 13 – a mechanism foroverriding ufuncs,” 2013.[61] S. Hoyer, M. Rocklin, M. van Kerkwijk, H. Ab-basi, and E. Wieser, “NEP 18 – a dispatch mech-anism for numpy’s high level array functions,”2018.[62] M. E. O’Neill, “Pcg: A family of simple fastspace-efficient statistically good algorithms forrandom number generation,” Tech. Rep. HMC-CS-2014-0905, Harvey Mudd College, Clare-mont, CA, Sept. 2014.[63] J. K. Salmon, M. A. Moraes, R. O. Dror, andD. E. Shaw, “Parallel random numbers: As easyas 1, 2, 3,” in
Proceedings of 2011 InternationalConference for High Performance Computing,Networking, Storage and Analysis , SC ’11, (NewYork, NY, USA), pp. 16:1–16:12, ACM, 2011.[64] C. Doty-Humphrey, “Practrand, version 0.94.”[65] M. Matsumoto and T. Nishimura, “MersenneTwister: A 623-dimensionally equidistributeduniform pseudo-random number generator,”
ACM Transactions on Modeling and ComputerSimulation , vol. 8, pp. 3–30, Jan. 1998.[66] K. Sheppard, B. Duvenhage, P. de Buyl, andD. A. Ham, “bashtage/randomgen: Release1.16.2,” Apr. 2019.1267] G. Marsaglia and W. W. Tsang, “The zigguratmethod for generating random variables,”
Jour-nal of Statistical Software, Articles , vol. 5, no. 8,pp. 1–7, 2000.[68] D. Lemire, “Fast random integer generation inan interval,”
ACM Transactions on Modelingand Computer Simulation , vol. 29, pp. 1–12, Jan2019.[69] top500, “Top 10 sites for november 2019,” 2019.[70] wikichip, “Astra - supercomputers,” 2019.[71] Wikipedia, “Arm architecture,” 2019.[72] NumPy Developers, “Numpy roadmap,” 2019.[73] Dustin Ingram, “Pep 599 – the manylinux2014platform tag,” 2019.
Methods
We use Git for version control and GitHub as thepublic hosting service for our official upstream reposi-tory ( https://github.com/numpy/numpy ). We eachwork in our own copy (or fork) of the project and usethe upstream repository as our integration point. Toget new code into the upstream repository, we useGitHub’s pull request (PR) mechanism. This allowsus to review code before integrating it as well as torun a large number of tests on the modified code toensure that the changes do not break expected be-havior.We also use GitHub’s issue tracking system to col-lect and triage problems and proposed improvements.
Library organization
Broadly, the NumPy library consists of the follow-ing parts: the NumPy array data structure ndarray ;the so-called universal functions ; a set of libraryfunctions for manipulating arrays and doing scien-tific computation; infrastructure libraries for unittests and Python package building; and the program f2py for wrapping Fortran code in Python [55]. The ndarray and the universal functions are generallyconsidered the core of the library. In the following,we give a brief summary of these components of thelibrary.
Core.
The ndarray data structure and the univer-sal functions make up the core of NumPy.The ndarray is the data structure at the heart ofNumPy. The data structure stores regularly stridedhomogeneous data types inside a contiguous blockmemory, allowing for the efficient representation of n -dimensional data. More details about the data struc-ture are given in “The NumPy array: a structure forefficient numerical computation” [18].The universal functions , or more concisely, ufuncs ,are functions written in C that implement efficientlooping over NumPy arrays. An important feature ofufuncs is the built-in implementation of broadcasting .For example, the function arctan2(x, y) is a ufuncthat accepts two values and computes tan − ( y/x ).When arrays are passed in as the arguments, the13func will take care of looping over the dimensionsof the inputs in such a way that if, say, x is a 1-Darray with length 3, and y is a 2-D array with shape2 ×
1, the output will be an array with shape 2 × x + y * z . Computing libraries.
NumPy provides a large li-brary of functions for array manipulation and scien-tific computing, including functions for: creating, re-shaping, concatenating, and padding arrays; search-ing, sorting and counting data in arrays; computingelementary statistics, such as the mean, median, vari-ance, and standard deviation; file I/O; and more.A suite of functions for computing the fast Fouriertransform (FFT) and its inverse is provided.NumPy’s linear algebra library includes functionsfor: solving linear systems of equations; computingvarious functions of a matrix, including the determi-nant, the norm, the inverse, and the pseudo-inverse;computing the Cholesky, eigenvalue, and singularvalue decompositions of a matrix; and more.The random number generator library in NumPyprovides alternative bit stream generators that pro-vide the core function of generating random integers.A higher-level generator class that implements an as-sortment of probability distributions is provided. Itincludes the beta, gamma and Weibull distributions,the univariate and multivariate normal distributions,and more.
Infrastructure libraries.
NumPy provides utili-ties for writing tests and for building Python pack-ages.The testing subpackage provides functions suchas assert allclose(actual, desired) that maybe used in test suites for code that uses NumPy ar-rays.NumPy provides the subpackage distutils whichincludes functions and classes to facilitate configura- tion, installation, and packaging of libraries depend-ing on NumPy. These can be used, for example, whenpublishing to the PyPI website.
F2PY.
The program f2py is a tool for buildingNumPy-aware Python wrappers of Fortran functions.NumPy itself does not use any Fortran code; F2PYis part of NumPy for historical reasons.
Governance
NumPy adopted an official Governance Document onOctober 5, 2015 [56]. Project decisions are usuallymade by consensus of interested contributors. Thismeans that, for most decisions, everyone is entrustedwith veto power. A Steering Council, currently com-posed of 12 members, facilitates this process andoversees daily development of the project by con-tributing code and reviewing contributions from thecommunity.NumPy’s official Code of Conduct was approvedon September 1, 2018 [57]. In brief, we strive to: be open ; be empathetic, welcoming, friendly, and pa-tient ; be collaborative ; be inquisitive ; and be careful inthe words that we choose . The Code of Conduct alsospecifies how breaches can be reported and outlinesthe process for responding to such reports. Funding
In 2017, NumPy received its first large grants total-ing 1.3M USD from the Gordon & Betty Moore andthe Alfred P. Sloan foundations. Stfan van der Waltis the PI and manages four programmers working onthe project. These two grants focus on addressingthe technical debt accrued over the years and on set-ting in place standards and architecture to encouragemore sustainable development.NumPy received a third grant for 195K USD fromthe Chan Zuckerberg Initiative at the end of 2019with Ralf Gommers as the PI. This grant focuses onbetter serving NumPy’s large number of beginningto intermediate level users and on growing the com-munity of NumPy contributors. It will also providesupport to OpenBLAS, on which NumPy depends foraccelerated linear algebra.14inally, since May 2019 the project receives a smallamount annually from Tidelift, which is used to fundthings like documentation and website improvements.
Developers
NumPy is currently maintained by a group of 23contributors with commit rights to the NumPy codebase. Out of these, 17 maintainers were active in2019, 4 of whom were paid to work on the projectfull-time. Additionally, there are a few long term de-velopers who contributed and maintain specific partsof NumPy, but are not officially maintainers.Over the course of its history, NumPy has attractedPRs by 823 contributors. However, its developmentrelies heavily on a small number of active maintain-ers, who share more than half of the contributionsamong themselves.At a release cycle of about every half year, the fiverecent releases in the years 2018 and 2019 have aver-aged about 450 PRs each, with each release attract-ing more than a hundred new contributors. Figure 4shows the number of PRs merged into the NumPymaster branch. Although the number of PRs be-ing merged fluctuates, the plot indicates an increasednumber of contributions over the past years. Community calls
The massive number of scientific Python packagesthat built on NumPy meant that it had an unusu-ally high need for stability. So to guide our devel-opment we formalized the feature proposal process,and constructed a development roadmap with exten-sive input and feedback from the community.Weekly community calls alternate between triageand higher level discussion. The calls not only involvedevelopers from the community, but provide a venuefor vendors and other external groups to provide in-put. For example, after Intel produced a forked ver-sion of NumPy, one of their developers joined a call Note that before mid 2011, NumPy development did nothappen on github.com . All data provided here is based on thedevelopment which happened through GitHub PRs. In somecases contributions by maintainers may not be categorized assuch. P R s m e r g e d e a c h q u a r t e r MaintainerOther
Fig. 4:
Number of pull requests merged intothe NumPy master branch for each quartersince 2012.
The total number of PRs is indicatedwith the lower blue area showing the portion con-tributed by current or previous maintainers.to discuss community concerns.
NumPy enhancement proposals
Given the complexity of the codebase and the massivenumber of projects depending on it, large changes re-quire careful planning and substantial work. NumPyEnhancement Proposals (NEPs) are modeled afterPython Enhancement Proposals (PEPs) for “propos-ing major new features, for collecting community in-put on an issue, and for documenting the design de-cisions that have gone into Python” . Since thenthere have been 19 proposed NEPS—6 have been im-plemented, 4 have been accepted and are being im-plemented, 4 are under consideration, 3 have beendeferred or superseded, and 2 have been rejected orwithdrawn. Central role
NumPy plays a central role in building and standard-izing much of the scientific Python community infras-tructure. NumPy’s docstring standard is now widelyadopted. We are also now using the NEP system asa way to help coordinate the larger scientific Python https://numpy.org/neps/nep-0000.html Wheels build system
A Python wheel [58] is a standard file format for dis-tributing Python libraries. In addition to Pythoncode, a wheel may include compiled C extensions andother binary data. This is important, because manylibraries, including NumPy, require a C compiler andother build tools to build the software from the sourcecode, making it difficult for many users to install thesoftware on their own. The introduction of wheelsto the Python packaging system has made it mucheasier for users to install precompiled libraries.A GitHub repository containing scripts to buildNumPy wheels has been configured so that a simplecommit to the repository triggers an automated buildsystem that creates NumPy wheels for several com-puter platforms, including Windows, Mac OSX andLinux. The wheels are uploaded to a public serverand made available for anyone to use. This systemmakes it easy for users to install precompiled versionsof NumPy on these platforms.The technology that is used to build the wheelsevolves continually. At the time this paper is beingwritten, a key component is the multibuild suiteof tools developed by Matthew Brett and other de-velopers [59]. Currently, scripts using multibuild are written for the continuous integration platformsTravis-CI (for Linux and Mac OSX) and Appveyor(for Windows).
Recent technical improvements
With the recent infusion of funding and a clear pro-cess for coordinating with the developer community,we have been able to tackle a number of importantlarge scale changes. We highlight two of those be-low, as well as changes made to our testing infras-tructure to support hardware platforms used in large scale computing.
Array function protocol
A vast number of projects are built on NumPy; theseprojects are consumers of the NumPy API. Over thelast several years, a growing number of projects areproviders of a
NumPy-like API and array objectstargeting audiences with specialized needs beyondNumPy’s capabilities. For example, the NumPy APIis implemented by several popular tensor computa-tion libraries including CuPy , JAX , and ApacheMXNet . PyTorch and Tensorflow provide ten-sor APIs with NumPy-inspired semantics. It is alsoimplemented in packages that support sparse arrayssuch as scipy.sparse and PyData/Sparse. Anothernotable example is Dask, a library for parallel com-puting in Python. Dask adopts the NumPy APIand therefore presents a familiar interface to exist-ing NumPy users, while adding powerful abilities toparallelize and distribute tasks.The multitude of specialized projects creates thedifficulty that consumers of these NumPy-like APIswrite code specific to a single project and do not sup-port all of the above array providers. This is a burdenfor users relying on the specialized array-like, since atool they need may not work for them. It also cre-ates challenges for end-users who need to transitionfrom NumPy to a more specialized array. The grow-ing multitude of specialized projects with NumPy-like APIs threatened to again fracture the scientificPython community.To address these issues NumPy has the goal ofproviding the fundamental API for interoperability between the various NumPy-like APIs. An earlierstep in this direction was the implementation of the array ufunc protocol in NumPy 1.13, which en-abled interoperability for most mathematical func-tions [60]. In 2019 this was expanded more generally https://cupy.chainer.org/ https://jax.readthedocs.io/en/latest/jax.numpy.html https://numpy.mxnet.io/ https://pytorch.org/tutorials/beginner/blitz/tensor_tutorial.html array function pro-tocol into NumPy 1.17. These two protocols allowproviders of array objects to be interoperable withthe NumPy API: their arrays work correctly with al-most all NumPy functions [61]. For the users rely-ing on specialized array projects it means that eventhough much code is written specifically for NumPyarrays and uses the NumPy API as import numpyas np , it can nevertheless work for them. For exam-ple, here is how a CuPy GPU array can be passedthrough NumPy for processing, with all operationsbeing dispatched back to CuPy: import numpy as npimport cupy as cpx_gpu = cp. array ([1 , 2, 3])y = np. sum ( x_gpu ) Similarly, user defined functions composed usingNumPy can now be applied to, e.g., multi-node dis-tributed Dask arrays: import numpy as npimport dask . array as dadef f(x):""" Function using NumPy API calls """y = np. tensordot (x, x.T)return np. mean (np. log (y + 1))x_local = np. random . random ([10000 , 10000])
Random number generation
The NumPy random module provides pseudorandomnumbers from a wide range of distributions. In legacyversions of NumPy, simulated random values are pro-duced by a
RandomState object that: handles seedingand state initialization; wraps the core pseudoran-dom number generator based on a Mersenne Twisterimplementation ; interfaces with the underlying code to be precise, the standard 32-bit version of MT19937 that transforms random bits into variates from otherdistributions; and supplies a singleton instance ex-posed in the root of the random module.The RandomState object makes a compatibilityguarantee so that a fixed seed and sequence of func-tion calls produce the same set of values. This guar-antee has slowed progress since improving the under-lying code requires extending the API with additionalkeyword arguments. This guarantee continues to ap-ply to
RandomState .NumPy 1.17 introduced a new API for generatingrandom numbers that use a more flexible structurethat can be extended by libraries or end-users. Thenew API is built using components that separate thesteps required to generate random variates. Pseudo-random bits are generated by a bit generator. Thesebits are then transformed into variates from complexdistributions by a generator. Finally, seeding is han-dled by an object that produces sequences of high-quality initial values.Bit generators are simple classes that managethe state of an underlying pseudorandom numbergenerator. NumPy ships with four bit generators.The default bit generator is a 64-bit implementa-tion of the Permuted Congruential Generator [62](
PCG64 ). The three other bit generators are a 64-bit version of the Philox generator [63] (
Philox ),Chris Doty-Humphrey’s Small Fast Chaotic genera-tor [64] (
SFC64 ), and the 32-bit Mersenne Twister [65](
MT19937 ) which has been used in older versions ofNumPy. Bit generators provide functions, exposedboth in Python and C, for generating random integerand floating point numbers.The
Generator consumes one of the bit gener-ators and produces variates from complicated dis-tributions. Many improved methods for generat-ing random variates from common distributions wereimplemented, including the Ziggurat method fornormal, exponential and gamma variates [67], andLemire’s method for bounded random integer gen-eration [68]. The
Generator is more similar to thelegacy
RandomState , and its API is substantially the The randomgen project supplies a wide range of alterna-tive bit generators such as a cryptographic counter-based gen-erators (
AESCtr ) and generators that expose hardware randomnumber generators (
RDRAND ) [66].
Generator does not make the same streamguarantee as the
RandomState object, and so variatesmay differ across versions as improved generation al-gorithms are introduced. Finally, a
SeedSequence is used to initialize a bitgenerator. The seed sequence can be initialized withno arguments, in which case it reads entropy from asystem-dependent provider, or with a user-providedseed. The seed sequence then transforms the initialset of entropy into a sequence of high-quality pseu-dorandom integers, which can be used to initializemultiple bit generators deterministically. The keyfeature of a seed sequence is that it can be used tospawn child
SeedSequence s to initialize multiple dis-tinct bit generators. This capability allows a seedsequence to facilitate large distributed applicationswhere the number of workers required is not known.The sequences generated from the same initial en-tropy and spawns are fully deterministic to ensurereproducibility.The three components are combined to constructa complete random number generator. from numpy . random import (Generator ,PCG64 ,SeedSequence ,)seq = SeedSequence(1030424547444117993331016959)pcg = PCG64 ( seq )gen = Generator ( pcg )
This approach retains access to the seed sequencewhich can then be used to spawn additional genera-tors. children = seq . spawn (2)gen_0 = Generator ( PCG64 ( children [0]) )gen_1 = Generator ( PCG64 ( children [1]) )
While this approach retains complete flexibility,the method np.random.default rng can be used to Despite the removal of the compatibility guarantee, sim-ple reproducibility across versions is encouraged, and minorchanges that do not produce meaningful performance gains orfix underlying bug are not generally adopted. instantiate a
Generator when reproducibility is notneeded.The final goal of the new API is to improve ex-tensibility.
RandomState is a monolithic object thatobscures all of the underlying state and functions.The component architecture is one part of the ex-tensibility improvements. The underlying functions(written in C) which transform the output of a bitgenerator to other distributions are available for usein CFFI. This allows the same code to be run in bothNumPy and dependent that can consume CFFI, e.g.,Numba. Both the bit generators and the low-levelfunctions can also be used in C or Cython code. Testing on multiple architectures
At the time of writing the two fastest supercomput-ers in the world, Summit and Sierra, both have IBMPOWER9 architectures [69]. In late 2018, Astra, thefirst ARM-based supercomputer to enter the TOP500list, went into production [70]. Furthermore, over100 billion ARM processors have been produced asof 2017 [71], making it the most widely used instruc-tion set architecture in the world.Clearly there are motivations for a large scien-tific computing software library to support POWERand ARM architectures. We’ve extended our con-tinuous integration (CI) testing to include ppc64le (POWER8 on Travis CI) and ARMv8 (on Shippableservice). We also test with the s390x architecture(IBM Z CPUs on Travis CI) so that we can probe thebehavior of our library on a big-endian machine. Thissatisfies one of the major components of improved CItesting laid out in a version of our roadmap [72]—specifically, “CI for more exotic platforms.”PEP 599 [73] lays out a plan for new Pythonbinary wheel distribution support, manylinux2014 ,that adds support for a number of architectures sup-ported by the CentOS Alternative Architecture Spe-cial Interest Group, including ARMv8, ppc64le, aswell as s390x. We are thus well-positioned for a fu-ture where provision of binaries on these architec-tures will be expected for a library at the base of the As of 1.18.0, this scenario requires access to the NumPysource. Alternative approaches that avoid this extra step arebeing explored.
Acknowledgments
We thank Ross Barnowski, Paul Dubois, MichaelEickenberg, and Perry Greenfield, who suggested textand provided helpful feedback on the manuscript.We also thank the many members of the commu-nity who provided feedback, submitted bug reports,made improvements to the documentation, code, orwebsite, promoted NumPy’s use in their scientificfields, and built the vast ecosystem of tools and li-braries around NumPy. We also gratefully acknowl-edge the Numeric and Numarray developers on whosework we built.Jim Hugunin wrote Numeric in 1995, while a grad-uate student at MIT. Hugunin based his package onprevious work by Jim Fulton, then working at the USGeological Survey, with input from many others. Af-ter he graduated, Paul Dubois at the Lawrence Liv-ermore National Laboratory became the maintainer.Many people contributed to the project includingT.E.O. (a co-author of this paper), David Ascher,Tim Peters, and Konrad Hinsen.In 1998 the Space Telescope Science Institutestarted using Python and in 2000 began developing anew array package called Numarray, written almostentirely by Jay Todd Miller, starting from a proto-type developed by Perry Greenfield. Other contrib-utors included Richard L. White, J. C. Hsu, JochenKrupper, and Phil Hodge. The Numeric/Numarraysplit divided the community, yet ultimately pushedprogress much further and faster than would other-wise have been possible.Shortly after Numarray development started,T.E.O. took over maintenance of Numeric. In 2005,he led the effort and did most of the work to unifyNumeric and Numarray, and produce the first versionof NumPy.Eric Jones co-founded (along with T.E.O. and P.P.)the SciPy community, gave early feedback on arrayimplementations, and provided funding and travelsupport to several community members. Numerouspeople contributed to the creation and growth of thelarger SciPy ecosystem, which gives NumPy much of its value. Others injected new energy and ideas bycreating experimental array packages.K.J.M. and S.J.v.d.W. were funded in part by theGordon and Betty Moore Foundation through GrantGBMF3834 and by the Alfred P. Sloan Foundationthrough Grant 2013-10-27 to the University of Cali-fornia, Berkeley. S.J.v.d.W., S.B., M.P., and W.W.were funded in part by the Gordon and Betty MooreFoundation through Grant GBMF5447 and by theAlfred P. Sloan Foundation through Grant G-2017-9960 to the University of California, Berkeley.
Author Contributions Statement
K.J.M. and S.J.v.d.W. composed the manuscriptwith input from others. S.B., R.G., K.S., W.W.,M.B., and T.J.R. contributed text. All authors havecontributed significant code, documentation, and/orexpertise to the NumPy project. All authors re-viewed the manuscript.