AI Data Wrangling with Associative Arrays
Jeremy Kepner, Vijay Gadepally, Hayden Jananthan, Lauren Milechin, Siddharth Samsi
AAI Data Wrangling with Associative Arrays
Jeremy Kepner , , , Vijay Gadepally , , Hayden Jananthan , , Lauren Milechin , Siddharth Samsi MIT Lincoln Laboratory Supercomputing Center, MIT Computer Science & AI Laboratory, MIT Mathematics Department, Vanderbilt University, MIT Department of Earth, Atmospheric and Planetary Sciences
Abstract —The AI revolution is data driven. AI “data wran-gling” is the process by which unusable data is transformed tosupport AI algorithm development (training) and deployment(inference). Significant time is devoted to translating diversedata representations supporting the many query and analysissteps found in an AI pipeline. Rigorous mathematical repre-sentations of these data enables data translation and analysisoptimization within and across steps. Associative array algebraprovides a mathematical foundation that naturally describes thetabular structures and set mathematics that are the basis ofdatabases. Likewise, the matrix operations and correspondinginference/training calculations used by neural networks are alsowell described by associative arrays. More surprisingly, a generaldenormalized form of hierarchical formats, such as XML andJSON, can be readily constructed. Finally, pivot tables, whichare among the most widely used data analysis tools, naturallyemerge from associative array constructors. A common founda-tion in associative arrays provides interoperability guarantees,proving that their operations are linear systems with rigorousmathematical properties, such as, associativity, commutativity,and distributivity that are critical to reordering optimizations.
I. I
NTRODUCTION
AI models require data for their construction and training[1]. The broad applicability of AI to diverse domains involvesintegration and conditioning of diverse data. AI “data wran-gling” is the process by which unusable data is transformedto support AI algorithm development (training) and deploy-ment (inference) [2]. Significant time is devoted to translatingdiverse data representations supporting the many query andanalysis steps found in an AI pipeline [3]. Typical steps includeparsing data from a raw hierarchical format to approximatelytabular spreadsheet files, ingesting into databases, queryingfor specific AI analysis, construction of vectors for trainingmodels, and storing matrix representations of neural networks.Each of these steps can be represented using a plethora ofdata structures and formats and a major component of AI datawrangling is harmonizing these representations to align withthe different operations of an AI pipeline.Theory can play a significant role in harmonizing datarepresentations. Set theory forms the basis of the mathematicalduality enabling the declarative SQL user interface to beimplemented with any procedural language in a relationaldatabase [4]–[6]. Graph theory and linear algebra are the basis
This material is based upon work supported by the Assistant Secretary ofDefense for Research and Engineering under Air Force Contract No. FA8702-15-D-0001 and National Science Foundation CCF-1533644. Any opinions,findings, conclusions or recommendations expressed in this material are thoseof the author(s) and do not necessarily reflect the views of the AssistantSecretary of Defense for Research and Engineering or the National ScienceFoundation. of the matrix representation of AI neural networks [7]–[10].Associative array algebra generalizes set theory and matricesto further unify SQL, NoSQL, and NewSQL databases [11]–[13]. Polystore databases leverage associative array mathemat-ics to simplify the interchange of data among diverse databases[14]–[16].Hierarchical data structures, such as JavaScript Object No-tation (JSON) [17] and Extensible Markup Language (XML)[18], are important to AI pipelines. Spreadsheets, and theircorresponding pivot tables, are available to over 100 millionusers and are an important tool for presenting AI results.This work presents rigorous mathematical representations ofthese data as associative arrays, enabling data translation andanalysis optimization within and across AI pipeline steps.II. A
SSOCIATIVE A RRAY M ATHEMATICS
The full mathematics of associative arrays and the waysthey encompass matrix mathematics and relational algebra aredescribed in the aforementioned references [12], [13], [16].The essence of associative array algebra is three operations:element-wise addition (database table union), element-wisemultiplication (database table intersection), and array multi-plication (database table transformation). These operations areillustrated as spreadsheet operations in Figure 1. In brief, anassociative array A is defined as a mapping from sets of keysto values A : K × K → V where K are the row keys and K are the column keys andcan be any sortable set, such as sets of integers, real numbers,and strings. The row keys are equivalent to the sequence IDin a relational database table. The column keys are equivalentto the column names in a database table. V is a set of valuesthat forms a semiring ( V , ⊕ , ⊗ , , with addition operation ⊕ , multiplication operation ⊗ , additive identity/multiplicativeannihilator 0, and multiplicative identity 1. The values can takeon many forms, such as numbers, strings, and sets. One of themost powerful features of associative arrays is that additionand multiplication can be a wide variety of operations. Someof the common combinations of addition and multiplicationoperations that have proven valuable are standard arithmeticaddition and multiplication + . × , union and intersection ∪ . ∩ used in database operations, and various tropical algebras thatare important in finance and neural networks: max . + , min . + , max . × , min . × , max . min , and min . max . Because associativearrays are typically constructed as semirings their operationsare linear systems with rigorous mathematical properties, such a r X i v : . [ c s . D B ] J a n = NxM ( k , k , v , Å ) ( k , k , v ) = A C = A T C = A Å B C = A Ä B C = A B = A Å . Ä B deconstructconstruct transpose addition elementwisemultiplication matrixmultiplication Fig. 1. Core associative array operations and their corresponding spreadsheet operations depicted in Microsoft Excel [19]–[22].
Hierarchical JSON Sparse Table dataset/accessLevel dataset/keyword dataset/modified dataset/description dataset/distribution/title dataset/identifier dataset/publisher/name
Fig. 2. Hierarchical JSON metadata records from Data.Gov Small Business Administration [23] and their corresponding denormalized sparse representation. as, associativity, commutativity, and distributivity that arecritical to reordering optimizations.III. AI P
IPELINES
Hierarchical data structures, such as JSON and XML, areplaying an increasingly important role in AI pipelines becauseof their ability to capture diverse data of the type that lendsitself to AI applications. JSON and XML also align well withobject oriented programming as arrays of objects built fromother arrays objects can be written in a human readable format.However, most data analysis environments and AI pipelinesdepend upon data being in an approximately tabular format.Fortunately, hierarchical data can also be represented as sparseassociative arrays by traversing the hierarchy and increment-ing/appending the row counters and column keys to emitrow/column/value triples directly passable to an associative array constructor (see Figure 2). As sparse associative arrayshierarchical data can be readily ingested into AI pipelines.Spreadsheets are used by over 100 million people each day.The results of AI pipelines are often presented in spreadsheetform. Figure 1 illustrates the spreadsheet equivalents of thecore associative arrays operations. Interestingly, pivot tables,which are among the most widely used spreadsheet analysistools, naturally emerge from associative array constructors.Associative arrays span the data and operations found inmost AI pipelines, providing a powerful tool for harmonizingand optimizing AI pipeline data flow.A
CKNOWLEDGEMENT
The authors wish to acknowledge the following individualsfor their contributions and support: Bob Bond, Alan Edel-man, Laz Gordon, Charles Leiserson, Dave Martinez, MimiMcClure, Michael Wright, William Arcand, David Bestor,illiam Bergeron, Chansup Byun, Matthew Hubbell, MichaelHoule, Michael Jones, Anna Klein, Peter Michaleas, JulieMullen, Andrew Prout, Antonio Rosa, Charles Yee, and AlbertReuther. R
EFERENCES[1] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,”
Nature , vol. 521,no. 7553, pp. 436–444, 2015.[2] W. McKinney,
Python for data analysis: Data wrangling with Pandas,NumPy, and IPython
Communications of the ACM , vol. 13, no. 6, pp. 377–387, 1970.[5] D. Maier,
The theory of relational databases , vol. 11. Computer sciencepress Rockville, 1983.[6] E. F. Codd,
The relational model for database management: version 2 .Addison-Wesley Longman Publishing Co., Inc., 1990.[7] J. Kepner and J. Gilbert,
Graph algorithms in the language of linearalgebra . SIAM, 2011.[8] J. Kepner, P. Aaltonen, D. Bader, A. Bulu, F. Franchetti, J. Gilbert,D. Hutchison, M. Kumar, A. Lumsdaine, H. Meyerhenke, S. McMillan,C. Yang, J. D. Owens, M. Zalewski, T. Mattson, and J. Moreira, “Math-ematical foundations of the graphblas,” in , pp. 1–9, Sep. 2016.[9] J. Kepner, M. Kumar, J. Moreira, P. Pattnaik, M. Serrano, and H. Tufo,“Enabling massive deep neural networks with the graphblas,” in ,pp. 1–10, Sep. 2017.[10] M. Kumar, W. P. Horn, J. Kepner, J. E. Moreira, and P. Pattnaik,“Ibm power9 and cognitive computing,”
IBM Journal of Research andDevelopment , vol. 62, pp. 10:1–10:12, July 2018.[11] V. Gadepally, J. Bolewski, D. Hook, D. Hutchison, B. Miller, andJ. Kepner, “Graphulo: Linear algebra graph kernels for nosql databases,”in
Parallel and Distributed Processing Symposium Workshop (IPDPSW),2015 IEEE International , pp. 822–830, IEEE, 2015.[12] J. Kepner, V. Gadepally, D. Hutchison, H. Jananthan, T. Mattson,S. Samsi, and A. Reuther, “Associative array model of sql, nosql, andnewsql databases,” in
High Performance Extreme Computing Conference(HPEC) , IEEE, 2016.[13] J. Kepner and H. Jananthan,
Mathematics of Big Data . MIT Press, 2018.[14] V. Gadepally, P. Chen, J. Duggan, A. Elmore, B. Haynes, J. Kepner,S. Madden, T. Mattson, and M. Stonebraker, “The bigdawg polystoresystem and architecture,” in , pp. 1–6, Sep. 2016.[15] T. Mattson, V. Gadepally, Z. She, A. Dziedzic, and J. Parkhurst,“Demonstrating the bigdawg polystore system for ocean metagenomicsanalysis.,” in
CIDR , 2017.[16] H. Jananthan, Z. Zhou, V. Gadepally, D. Hutchison, S. Kim, andJ. Kepner, “Polystore mathematics of relational algebra,” in