[PDF] Rethinking Abstractions for Big Data: Why, Where, How, and What

Abstract

Full PDF

RRethinking Abstractions for Big Data:Why, Where, How, and What

Mary Hall, Robert M. Kirby, Feifei Li, Miriah Meyer, Valerio Pascucci, Jeﬀ M. Phillips,Rob Ricci, Jacobus Van der Merwe, Suresh Venkatasubramanian,University of Utah

October 8, 2018

Big data refers to large and complex data sets that, under existing approaches, exceed the capacity andcapability of current compute platforms, systems software, analytical tools and human understanding [7].Numerous lessons on the scalability of big data can already be found in asymptotic analysis of algorithmsand from the high-performance computing (HPC) and applications communities. However, scale is onlyone aspect of current big data trends; fundamentally, current and emerging problems in big data are aresult of unprecedented complexity —in the structure of the data and how to analyze it, in dealing withunreliability and redundancy, in addressing the human factors of comprehending complex data sets, informulating meaningful analyses, and in managing the dense, power-hungry data centers that house bigdata.The computer science solution to complexity is ﬁnding the right abstractions, those that hide as muchtriviality as possible while revealing the essence of the problem that is being addressed. The “big datachallenge” has disrupted computer science by stressing to the very limits the familiar abstractions whichdeﬁne the relevant subﬁelds in data analysis, data management and the underlying parallel systems. Eﬃcientprocessing of big data has shifted systems towards increasingly heterogeneous and specialized units, withresilience and energy becoming important considerations. The design and analysis of algorithms must nowincorporate emerging costs in communicating data driven by IO costs, distributed data, and the growingenergy cost of these operations. Data analysis representations as structural patterns and visualizationssurpass human visual bandwidth, structures studied at small scale are rare at large scale, and large-scalehigh-dimensional phenomena cannot be reproduced at small scale.As a result, not enough of these challenges are revealed by isolating abstractions in a traditional soft-ware stack or standard algorithmic and analytical techniques, and attempts to address complexity eitheroversimplify or require low-level management of details. The authors believe that the abstractions for bigdata need to be rethought, and this reorganization needs to evolve and be sustained through continuedcross-disciplinary collaboration.In what follows, we ﬁrst consider the question of why big data and why now. We then describe the where (big data systems), the how (big data algorithms), and the what (big data analytics) challenges thatwe believe are central and must be addressed as the research community develops these new abstractions.We equate the biggest challenges that span these areas of big data with big mythological creatures, namelycyclops, that should be conquered.

Why Big Data, And Why Now?

We argue that big data is not simply the familiar outcome of constant improvements to processing, storage,and networking—it is something qualitatively diﬀerent with a novel set of problems that are the product ofthree factors: (1) The availability of large data sets—not only to communities that are accustomed to dealingwith them, but also to commercial and academic groups that have historically been unable to collect or store1 a r X i v : . [ c s . G L ] J un h y ? • availability• capacity• economic and social beneﬁt>> Where ? Big Data Systems

What ?Big Data Analytics

How ? communicationspace CPU cyclesresilience 010 Big Data Algorithms

Figure 1: The why, where, how, and what of big data.large amounts of data. The consequence is that what constitutes “big” varies greatly among communities—the common theme is that the data is large enough to place a strain on the existing techniques used tomanage, process, and interpret it. (2) The capacity to process and store large data sets—a result not only ofpredictable advances in hardware technology, but also of disruptive changes to the models of how this capacityis made available. The ability to rent or borrow computing and storage capacity at a variety of timescaleschanges the basic economics of the value of data—data becomes more valuable when this capacity is availableon demand and at the exact scale required for the problem. (3) The realization that there is economic orsocial beneﬁt to be gained from these data sets, whether that be for the advancement of technology, science,commerce, or the public good. All of these factors already exist in one form or another; what is new is thatadvances in each of these areas have converged. The result is a much broader cross-disciplinary interest thatextends across and beyond the communities that have traditionally looked at problems related to data.

Big data places new strains on the compute platforms where the analysis will be performed: the layers ofarchitecture, OS, compiler, and programming languages which allow programmers to take advantage of avariety of underlying platforms. The high-performance computing (HPC) community has historically beenthe mainstay of scalable compute systems, along with closely related work in grid computing [4]. Morerecently, a second community, led by companies such as Google, Yahoo!, and Amazon have been building anew set of systems and abstractions for big data processing. MapReduce [3] is the most notable example ofthis latter trend. These two communities have developed what may appear to be distinct and contrastingperspectives on extreme-scale systems, but both approaches have clear strengths. Perhaps surprisingly, thesedomains are converging towards common concerns, and we believe that a mixing of their abstractions bringsboth challenges and opportunities.

Performance vs. Productivity

A key area where divergence exists today is in programming models. The HPC community uses fairlylow-level programming models that provide control of hardware features such as message-passing across dis-tributed memories and explicit threading to control processor cores, designed for expert users who demandhigh performance over programmability. In contrast, MapReduce [3] is a commonly-used large-scale data2nalysis programming model, where programmers work with two simple abstractions: “mapping” computa-tion across individual key-value items, and “reducing” the intermediate results to a ﬁnal output or anotherintermediate state. The simplicity of this programming model, which favors programmer productivity overperformance, has resulted in rapid and widespread adoption.

While today’s systems make diﬀerent choices in the spectrum between performance and productivity, thetradeoﬀs these communities are facing in the future are bringing them closer together.

The HPC community, in its march to exascale—the ability to process a quintillion operations per second—is exploring more productive programming environments that hide the increased complexity of new energy-eﬃcient hardware features such as heterogeneous processing logic, non-uniform latency memory structures,conﬁgurability, and billion-way parallelism. The data analysis community ﬁnds the MapReduce paradigmtoo restrictive for some computations, and is exploring new programming models to facilitate other dataanalytics solutions, e.g., Dremel [8] (for queries), GraphLab [6] (for machine learning and optimization), andGiraph (for graph processing). Performance and eﬃciency will also become more important for extreme-scaledata analysis, particularly in data centers. Thus, the need to “dial-in” to appropriate levels of performance,productivity, and abstraction exists in both communities.

Multi-resolution programming models have beenproposed to permit diﬀerent users with varying programming expertise to use the same programming systemat diﬀerent levels of abstraction; experts can have control of application mapping to hardware, while morena¨ıve users can lean more heavily on automation and general-purpose solutions [2]. A common example ofsuch a multiresolution system relies on a domain-speciﬁc programming language or library that specializesthe expression and mapping of particular domains so that more na¨ıve users can capitalize on eﬀorts of expertusers to manage system complexity. Meanwhile Hadoop [1] (the open source version of MapReduce and itsrelatives) now permits some version of message passing to allow advanced users to further optimize somekey operations outside the basic, but restricted, paradigms. Thus, while the applications and programmingsolutions may vary signiﬁcantly, the current and future experiences of successfully deploying domain-speciﬁcsystems should provide general lessons towards addressing these challenges in both the HPC and data analysiscommunities.

Toward Resilience and Energy-Eﬃciency

The application programming environment is only one part of the overall computing environment; it is layeredon the run-time system, operating system, ﬁle system and networking layers. While today, the choices andabstractions made for these layers appear to diﬀer between the HPC and data analytics communities, againwe see initial steps towards convergence. Historically, the HPC community has used expensive hardware withlow failure rates, and relies on heavyweight checkpoint and restart of long-running applications to tolerateinfrequent errors. In contrast, data centers are designed to expect frequent failures; a typical strategy isto use low-cost hardware components that have poor failure rates, and to build ﬂexible, distributed failurerecovery schemes as part of the systems or application layers. With supercomputers growing dramaticallyin component counts and checkpoint-restart becoming too expensive relative to compute time, the HPCcommunity is by necessity moving towards a more resilient approach that is tolerant or able to recoverfrom errors [5, 2]. Similarly, for both supercomputing facilities and data centers, power consumption willdominate the operational costs, making energy eﬃciency a ﬁrst-class concern for both communities. Ratherthan relying completely on hardware solutions for energy eﬃciency, more aggressive management of datamovement (which dominates energy costs [5]), must be addressed at various levels of the software stack.The clear conclusion is that, regardless of the speciﬁc application-level abstractions used for computationon big data, there will be common needs for both resiliency and power eﬃciency; this points to the need tore-architect systems software to addresses these challenges.3

Big Data Algorithms

The rise of big data has disrupted the algorithmic paradigms that model how data is represented andprocessed. Fundamentally, algorithm design is about solving concrete problems within the conﬁnes of anabstract framework. This framework should elegantly capture the key trade-oﬀs between resources, whetherthey be time, space, precision, random bits, or communication, but hide the implementation details fromthe algorithmic designer. For years, computer systems have managed to handle these details behind thescenes (from an algorithmist’s perspective), and keep the balance in check. However, the evolution of thetrue trade-oﬀs has reached a breaking point. Most evidently, new computational models are now needed,ones that may deal with an ever-growing list of competing resource types. But also, the infallibility of inputdata’s accuracy, which had been masked by wise choices of concrete problems, is becoming a ﬁrst-orderconcern. Both of these issues are forcing changes in design principles, and for algorithm designers’ eﬀorts tobe most eﬀective, new abstractions that are at the same time simpler and richer are required.

New Models for Processing Large Data

For nearly 70 years, algorithms have been developed under variants of the Von Neumann model [11]. Thekey feature of this model is the assumption that accessing data and performing an operation on it took thesame time. While this assumption has never held on any actual architecture, cache hierarchies and fastmemory were able to make this an accurate model of reality . . . until now.Increasing data sizes and the dramatic increase in compute speeds have changed the balance betweencomputation and access. Processing data is now far cheaper than accessing it. As a consequence, dataaccess has itself become either an expensive resource to be optimized (as seen in the very successful externalmemory and cache-oblivious models of computation) or one that is extremely constrained, as in the streamingmodel [9] in which algorithms are only permitted to make a single pass over the data, and can only store atiny fraction of what they read.While these models have been quite eﬀective at modeling the problems of data access on a single processor ,it has been much harder to adapt them to the challenges of dealing with multiple processors on vast amountsof distributed data. Whether we are programming a GPU, a multicore processor, a cluster of thousands ofprocessing nodes, or even computations across distributed data servers, we have to model communication (between compute nodes, or between the memory hierarchy and nodes) rather than access . Modeling the interplay between communication and computation is perhaps the biggest challenge foralgorithm design in the modern era.

Valiant points out [10] that an ideal computational model “bridges in a performance-faithful manner whatthe hardware executes and what is in the mind of the software writer.” Where do we stand with respectto this ideal? As we discussed in Section 1, the tradeoﬀ between performance and productivity reﬂects thedisconnect between the “truth” of the hardware and the various programming abstractions used to interfacewith it. This disconnect transfers to theoretical models as well. While there are now numerous modelsfor thinking abstractly about big data systems, they are inspired by speciﬁc programming interfaces in userather than by a deeper understanding of underlying hardware and software realities. One notable exceptionis Valiant’s own recent work on bridging models for multicore systems. But even this work is limited to ahomogeneous hierarchy of (memory) layers and does not capture the full complexity of modern large-scaledistributed parallel systems.We should note here that there are many other factors that play an important role in the design andmodeling of large-data systems. In Section 1 for example, we discuss heterogeneity, resilience and energy-eﬃciency, among others. While we believe that communication is the most natural resource for modeling, weexpect that algorithmic principles that address the above concerns will continue to be of interest. Indeed, aswe have seen with the very successful streaming model, the key challenge in designing an eﬀective theoreticalframework will not be exhaustiveness, but a focus on the “right” resources to optimize.4 lgorithmically Managing Inaccuracies

Big data is often noisy. Actually, data has always been noisy, but the problem is more apparent with largedata since we often observe the same object multiple times, and these observations can have conﬂictingvalues. Furthermore, the data has reached a size where it is possible to eﬀectively model this uncertainty.This realization has two consequences. First, precision of solutions beyond the error tolerance of the inputdata is meaningless; approximate solutions can be used in place, as long as they have guarantees within thistolerance. Second, this input data uncertainty should be analyzed, in particular with regard to its aﬀect onthe output of a given task. That is, the input, the intermediate structure, and the output of these noisy datasets should have rich representations either describing the distribution of possible solutions, or tolerances ofthe worst case values.However, computing on such complex representations, especially when the data is big, requires newideas. One approach is to squash the data to a convenient smaller representation that also captures this datauncertainty. In fact, this process will often introduce further inaccuracies, but these can often be carefullybounded and modeled. This approach yields two challenges: the ﬁrst is eﬃciently computing this complexbut concise representation from an enormous data set, and second is maintaining such a rich representationas this data is processed through several analysis steps. With each step, we would ideally like to rely on anexisting algorithmic technique, but not have the inaccuracies increase and thus propagate at a larger level.Currently, there are Monte Carlo approaches which are quite general, as well as drastically more eﬃcientlytechniques (often growing out of streaming algorithms [9]) but that typically apply to only speciﬁc scenarios.A generally and eﬃcient algorithmic framework that can be abstracted to many types of complex data is aimportant challenge.

Data science, the emerging scientiﬁc paradigm of discovering hypotheses from large data corpora, has turnedthe natural progression of science on its head. Instead of painstakingly designing hypotheses and testingthem, it is now possible to generate hypotheses automatically by sifting through giant data sets.The danger of this approach is the phenomenon of multiple testing (or more colorfully, the green jellybean problem http://xkcd.com/882 ), where if enough hypothesis are considered separately, eventually oneobserved eﬀect may look statistically signiﬁcant without being true. This problem is all the more seriouswith large and complex data because the algorithms that generate these hypotheses can be opaque, and thedata itself can overwhelm our ability to process and visualize it. Moreover, the number of features in thedata can overwhelm most procedures designed to analyze them.The challenge of big data analytics therefore is to determine what information and structure really liesin these large, feature-rich data sets, and which models that can be evaluated eﬃciently and accurately, andvisualized to provide conﬁrmation of the learned phenomena.

Structure that is not

Everywhere

The curse of dimensionality refers to the exponential increase in complexity of algorithms as the number offeatures (the dimensions) of data increases. But this phenomenon is not just algorithmic; in high dimensionsthe meaningful patterns in data become harder to distinguish from random artifacts, regardless of theeﬃciency of the algorithm used. However, most natural processes that generate data tend to be less complex,and this has spurred the development of methods that assume the data lives in a low-dimensional subspace,and ﬁnd patterns conditioned on this or other assumptions. While it can be hard to verify that the datasatisﬁes these constraints, regularization methods can help nudge algorithms to look for such structures, andin some cases cross-validation can conﬁrm the validity of found structures.Another path for rigorous analysis of complex data is through the study of which summaries with worst-case guarantees can be attained, and understanding the trade-oﬀ between error tolerance and size. Limitson these summaries can imply that even big data sets can not provide more than a ﬁxed error tolerance for5ertain properties, and limit overzealous modeling and over-ﬁtting. In simulation data, modeling error can bemuch more dramatic, since data sets are the output of the models. This requires eﬃcient and early detectionof structural anomalies to short-circuit the expense of regenerating the data, and eﬃciently maintainablesummaries can be the key to this.As data becomes more complex, the models must adapt as well. Over the years we have seen modelsfor measuring data evolve from simple linear spaces to inﬁnite dimensional function spaces. We are nowseeing a further evolution into multi-scale representations of data, where “low-dimensional spaces” combine“low dimensionally” to create complex structure. But it still remains a challenge to automatically adapt thismodel complexity appropriately to the scale and structure of the data.

Learning Complex Structures

Advanced data models represent one component of the challenge of dealing with high-volume feature-richdata. In addition, the model learning strategies themselves are being revisited in the face of the scale andcomplexity of data. A learning task typically proceeds by training a model and then testing it (in batchmode) or building a model that is updated as new data appears (in an online setting). The training involvesnontrivial optimization and often relies on labeled data and a selected set of features on which to build amodel. But managing the choice of features and labeling decisions among large, complex, and distributeddata requires new insights beyond the related algorithmic challenges.Without a global view of the data, ensemble methods will become very important; these methods combineseveral models, perhaps learned on diﬀerent views of the data, into a single global model. These have alreadybeen successful in large-scale learning applications, such as in the Netﬂix challenge. Another approach isthrough transfer learning where model parameters (rather than data) are transferred between distributedentities. For instance, the GraphLab system [6] has (among other things) implemented parallel versionsof belief propagation in a way that generalizes MapReduce. Beyond these approaches, active learning andmulti-task learning make use of auxiliary data sources to minimize the cost of acquiring and using labelleddata, and a challenge in ongoing work is to adapt them to large distributed data sets.

Information Bandwidth Overload

When analyzing big data, classic two-dimensional statistical plots are often insuﬃcient for exploring andunderstanding the complex patterns and relationship embedded within. Furthermore, deciphering raw dataand computational results through visual representations is too often done as the ﬁnal step in a complexresearch process, with tools that are rarely speciﬁc to the task.

To truly close the loop in data science, human-understandable representations of data must be madeavailable throughout the analysis pipeline to help guide analysts in making decisions and discoveries.

Interactive visualizations support this process by allowing vast amounts of information to be encoded, andrely on our powerful perceptual systems to pull out interesting trends and structures. Interactive, ﬂexible,and sophisticated visualization tools allow analysts to validate data and models, to derive new hypotheses,and to make important discoveries.Interactive visualizations have started to replace the classic static images printed on paper, giving theviewer the ability to navigate multiple views of a data set. By linking together these diﬀerent views throughuser interactions, new paradigms of data exploration are possible; for example, a large, multidimensionaldata set can be visualized by integrating a high-level summary view with more detailed, ﬁne-grained viewsof subsets of the data. This allows deeper, more speciﬁc exploration of the data without compromisingits breadth. This design pattern, called overview+detail , is found in many visualization tools for big data.Abstractions like this are vital for designing eﬀective visualization tools, but they need to be specializedwithin the heterogeneous landscape of applications and data, need to have safeguards warning of multipletesting problems, and need to be built on top of eﬃcient indexing structures to allow for useable levels ofmanipulation. 6ﬀective visualization tools need to carefully straddle the competing demands of generality and speciﬁcity.On one hand, we need to devote time and resources to create tools that will support many diﬀerent analystsin many diﬀerent application areas—we need to ﬁnd broadly applicable visualization abstractions. On theother hand, however, individual questions and inquiries often require specialized data and visual abstractionsto tackle a speciﬁc problem, and individual cognitive diﬀerences between analysts can eﬀect the interpretationof a visualization. Understanding when, where, and how visualizations can impact analysis of big data areongoing research questions that draw on knowledge from computer science, cognitive science, and design.

Challenges

As a take-away message, we summarize the big challenges outlined in this document on resolving the inﬂuxof complex big data with new abstractions. To be clear, there are many other challenges associated with bigdata, especially those dealing with the social, legal, and economic aspects.The cyclops were big one-eyed creatures from Greek mythology and needed to be conquered by “heroes”such as Odysseus. Thus we identify the big challenges of big data with cyclops:

Brontes the “thunderer,”

Steropes the “ﬂasher,”

Polyphemus the “shepherd,” and

Arges the “brightener.”

Cyclops 1: [Brontes: System Abstractions] Design abstractions and languages for big data systems that “slide”gracefully between exposing a simple programming model that results in adequate performance for a broad classof programmers, yet making available lower-level details when necessary to maximize performance.

Cyclops 2: [Steropes: Algorithmic Models] Converge to computational abstractions which closely translate toand between the various evolving big data systems, and capture whichever emerging costs (e.g., communication,power, resiliency, precision, heterogeneity) dominates this new landscape.

Cyclops 3: [Polyphemus: Uncertainty Management] Assess the uncertainty and conﬁdence in big data corpuses,and develop a framework for eﬃciently managing, processing, representing, and visualizing its eﬀect up to theresolution at which it is reliable.

Cyclops 4: [Arges: Reliable Structure] Eﬃciently identify inherent low-dimensional or core structure from complexdata without over-ﬁtting, and represent this structure so it can be easily veriﬁed, analyzed, and visualized atmultiple scales.

Although these problems have been separated into categories, clearly the way forward should be a jointeﬀort along all of these fronts. Breakthroughs or resolutions in one area will have tremendous inﬂuence inothers. So it is paramount that system designers, algorithm experts, and data analysts work closely togetherto bring forth a new and exciting era of big data computing.

References [1] Apache Hadoop NextGen MapReduce (YARN). http://hadoop.apache.org/docs/r0.23.0/hadoop-yarn/hadoop-yarn-site/YARN.html , 2011.[2] S. Amarasinghe, M. Hall, R. Lethin, K. Pingali, D. Quinlan, V. Sarkar, J. Shalf, R. Lucas,K. Yelick(editor), P. Balaji, P. Diniz, A. Koniges, M. Snir, and S. Sachs(editor). Exascale programmingchallenges. Report of the 2011 Workshop on Exascale Programming Challenges Marina del Rey, July27-29, 2011.[3] Jeﬀrey Dean and Sanjay Ghemawat. MapReduce: Simpliﬁed data processing on large clusters.

Com-munications of the ACM , 51(1), 2008.[4] Ian Foster and Carl Kesselman. Globus: a metacomputing infrastucture toolkit.

The InternationalJournal of High Performance Computing Applications , 11:115–128, 1997.[5] P. Kogge. Next-generation supercomputers.

IEEE Spectrum , February 2011.76] Yucheng Low, Joseph Gonzalez, Aapo Kyrola, Danny Bickson, Carlos Guesterin, and Joseph M. Heller-stein. Distributed GraphLab: A Framework for Machine Learning and Data Mining in the Cloud.

PVLDB , 2012.[7] James Manyika, Michael Chui, Brad Brown, Jacques Bughin, Richard Dobbs, Charles Roxburgh, andAngela Hung Byers. Big data: The next frontier for innovation, competition and productivity. Technicalreport, McKinsey Global Institute, May 2011.[8] Sergey Melnik, Andrey Gubarev, Jing Jing Long, Geoﬀrey Romer, Shiva Shivakumar, Matt Tolton, andTheo Vassilakis. Dremel: Interactive analysis of web-scale datasets. In

Proceedings 36th InternationalConference on Very Large Data Bases , 2010.[9] S. Muthukrishnan.

Data Streams: Algorithms and Applications . Now Publishers, 2005.[10] Leslie Valiant. A bridging model for multi-core computing.

Jounral of Computer and System Sciences ,77:154–166, 2011.[11] John von Neumann. First draft of a report on the EDVAC. Univesity of Pennsylvania: W-670-ORD-4926 – [reproduced in