[PDF] "Big Data" and its Origins

Abstract

Against the background of explosive growth in data volume, velocity, and variety, I investigate the origins of the term "Big Data". Its origins are a bit murky and hence intriguing, involving both academics and industry, statistics and computer science, ultimately winding back to lunch-table conversations at Silicon Graphics Inc. (SGI) in the mid 1990s. The Big Data phenomenon continues unabated, and the ongoing development of statistical machine learning tools continues to help us confront it.

Full PDF

aa r X i v : . [ ec on . E M ] S e p On the Origin(s) of the Term “Big Data”

Francis X. DieboldUniversity of PennsylvaniaThis Version, September 8, 2020

Abstract: I investigate the origin(s) of the term “Big Data,” in industry and academics,and in computer science and econometrics. The term probably originated in lunch-tableconversations at Silicon Graphics Inc. (SGI) in the mid 1990s, in which John Mashey ﬁg-ured prominently. The ﬁrst signiﬁcant (and independent) academic references are arguablyWeiss and Indurkhya (1998) in computer science and Diebold (2000) in econometrics. Anunpublished 2001 research note by Douglas Laney at Gartner enriched the concept signiﬁ-cantly. Big Data the phenomenon continues unabated.Acknowledgments: For useful communications I thank – without implicating in any way –Larry Brown † , David Cannadine, Xu Cheng, Flavio Cunha, Susan Diebold, Melissa Fitzger-ald, Dean Foster, Michael Halperin, Steve Lohr, John Mashey, Tom Nickolas, Lauris Olson,Mallesh Pai, Marco Pospiech, Frank Schorfheide, Minchul Shin, Mike Steele, and StephenStigler.Key words: Data science, computing, statistics, econometricsJEL codes: C81, C82Contact Info: [email protected] Introduction

The Big Data phenomenon, by which I mean explosive growth in data volume, velocity,and variety, is at the heart of modern science. Indeed the necessity of grappling with BigData, and the desirability of unlocking the information hidden within it, is now a key themein all the sciences – arguably the key scientiﬁc theme of our times. Parts of my ﬁeld ofeconometrics, to take a tiny example, are working furiously to develop methods for learningfrom the massive amount of tick-by-tick ﬁnancial market data now available. In response toa question like “How big is your dataset?” in a ﬁnancial econometric context, an answer like“90 observations on each of 10 variables” would have been common ﬁfty years ago, but nowit’s comically quaint. A modern answer is likely to be a ﬁle size rather than an observationcount, and it’s more likely to be 200 GB than the 5 KB (say) of ﬁfty years ago. Moreover,someone reading this in twenty years will surely have a good laugh at my implicit assertionthat a 200 GB dataset is large. Indeed in other disciplines like physics, 200 GB is already small. The large hadron collider experiments that led to discovery of the Higgs boson, forexample, produce a petabyte of data (10 bytes) per second . My interest in the historical origins of the term “Big Data” was piqued in 2012 when MarcoPospiech, at the time a Ph.D. student studying the Big Data phenomenon at the TechnicalUniversity of Freiberg, informed me in private correspondence that he had traced the useof the term (in the modern sense) to my paper, “‘Big Data’ Dynamic Factor Models forMacroeconomic Measurement and Forecasting,” presented at the Eighth World Congress ofthe Econometric Society in Seattle in August 2000, and subsequently published as Diebold(2003). Intrigued, I did a bit more digging. As regards my paper, what’s true with near certaintyis that it is the ﬁrst academic reference to Big Data in a title or abstract in the statistics,econometrics, or additional x -metrics (insert your favorite x ) literatures. But deeper in-vestigation reveals that the situation is more nuanced than it ﬁrst appears: the origins of Big Data is similarly central to modern business. For an overview, see Andersen et al. (2013). And of course the assertion that 200 GB is large by today’s standards is with reference to my ﬁeld ofeconometrics. The November 2000 post-conference working paper, Diebold (2000), is available at . Moreover, as progressively more searches ﬁnd nothing, it’s becoming progressively more likely that it’sthe ﬁrst reference in those literatures, whether in the title, abstract or elsewhere. he term are intriguing and a bit murky, involving both industry and academics, computerscience and econometrics. I play a very early role, but I am not alone.I stumbled on the term Big Data innocently enough, via discussion of two papers thattook a new approach to macro-econometric dynamic factor models (DFMs), Reichlin (2003)and Watson (2003), presented back-to-back in an invited session of the 2000 World Congressof the Econometric Society. Older dynamic factor analyses included just a few variables,because parsimony was essential for tractability of numerical likelihood optimization.Thenew work by Reichlin and Watson, in contrast, showed how DFMs could be estimated usingprincipal components, thereby dispensing with numerical optimization and opening the ﬁeldto analysis of much larger datasets while nevertheless retaining a likelihood-based approach.My discussion had two overarching goals. First, I wanted to contrast the old and new macro-econometric DFM environments. Second, I wanted to emphasize that the driver of the newmacro-econometric DFM developments matched the driver of many other recent scientiﬁcdevelopments: explosive growth in available data . To that end, I wanted a concise termthat conjured a stark image. I came up with “Big Data,” which seemed apt and resonantand intriguingly Orwellian (especially when capitalized), and which helped to promote bothgoals.But I was not alone. There are issues of Big Data interpretation and context, and thingsget murkier if one includes unpublished and/or non-academic references. Academics wereaware of the emerging phenomenon but not the term. Conversely, a few pre-2000 references,both academic and non-academic, are intriguing but ultimately unconvincing, using the termbut not thoroughly aware of the phenomenon.On the academic side, Tilly (1984) mentions Big Data, but his article is not about the BigData phenomenon and demonstrates no awareness of it; rather, it is a discourse on whetherstatistical data analyses are of value to historians. On the non-academic side, the margincomments of a computer program posted to a newsgroup in 1987 mention a programmingtechnique called “small code, big data.” Fascinating, but oﬀ-mark. Next, Eric Larsonprovides an early popular-press mention in a 1989

Washington Post article about ﬁrms thatassemble and sell lists to junk-mailers. He notes in passing that “The keepers of Big Datasay they do it for the consumer’s beneﬁt.” Again fascinating, but again oﬀ-mark. Finally,a 1996 PR Newswire, Inc. release mentions network technology “for CPU clustering and See, for example,

Massive Data Sets: Proceedings of a Workshop , Committee on Ap-plied and Theoretical Statistics, National Research Council (National Academies Press, 1997), . See https://groups.google.com/forum/?fromgroups . See Eric Larson, “They’re Making a List: Data Companies and the Pigeonholing of America,”

Wash-ington Post , July 27, 1989. , Related, SGI ran an ad that featured the term Big Data in

Black Enterprise (March 1996,p. 60), several times in

Info World (starting November 17, 1997, p. 30), and several timesin

CIO (starting February 15, 1998, p. 5). Clearly then, Mashey and the SGI communitywere on to Big Data early, using it both as a unifying theme for technical seminars and asan advertising hook.There is also at least one more relevant pre-2000 Big Data reference in computer science.It is subsequent to Mashey et al ., but interestingly, it comes from the academic as opposedto industry part of the computer science community, and it not only uses the term butalso demonstrates some awareness of the phenomenon. Weiss and Indurkhya (1998), inparticular, note that “... very large collections of data ... are now being compiled intocentralized data warehouses, allowing analysts to make use of powerful methods to examinedata more comprehensively. In theory, ‘Big Data’ can lead to much stronger conclusions fordata-mining applications, but in practice many diﬃculties arise.”Finally, arriving on the scene later but also going beyond previous work in compellingways, Laney (2001) highlighted the “Three V’s” of Big Data (Volume, Variety and Velocity)in an unpublished 2001 research note at META Group. Laney’s note is clearly relevant,and it goes beyond my exclusive focus on volume, producing a signiﬁcantly enriched con-ceptualization of the Big Data phenomenon. In short, if Laney arrived slightly late, henevertheless brought more to the table.The rest, as they say, is history. As described by Cannadine (2020):In 2012, Big Data entered the mainstream when it was discussed at the WorldEconomic Forum in Davos. In March that year, the American government pro- http://static.usenix.org/event/usenix99/invited_talks/mashey.pdf . Mashey notes in private communication that the deck was for a “living talk” and hence updated regularly,so that the 1998 version is not the earliest. The earliest deck of which he is aware (and hence I am aware)is from 1997. META is now part of Gartner. http://goo.gl/Bo3GS . discipline .At ﬁrst pass, Big Data as a discipline sounds like marketing ﬂuﬀ, as do other informationtechnology sub-disciplines with catchy names like “artiﬁcial intelligence,” “data mining”,“neural networks”, and “machine learning.” Indeed it’s hard to resist smirking when toldthat major ﬁrms are rushing to create new executive titles like “Vice President for BigData.” But as I have argued, the phenomenon behind the term is very real, so it may benatural and desirable for a corresponding new business discipline to emerge, whatever itsexecutive titles.On the other hand, if Big Data is arguably a new business discipline, it’s still not obviousthat it’s a new scientiﬁc discipline. Skeptics will argue that traditional disciplines likecomputer science and statistics are perfectly capable of confronting the new phenomenon, sothat Big Data is not a new discipline, but rather just a box drawn around some traditionalones. But it’s hard not to notice that the whole of the emerging Big Data (or “data science”)discipline seems greater than the sum of its parts. That is, by drawing on perspectives froma variety of traditional disciplines, Big Data is not merely taking us to bigger traditional places; rather, it’s taking us to very new places, unimaginable only a short time ago. Indeedone could argue that Big Data is emerging as a major interdisciplinary triumph.

We are now confronted with both Big Data opportunities and Big Data pitfalls. Cannadine(2020) highlights some of the opportunities:... it isn’t so much the data that’s important, it’s what you do with it thatcounts. With the evolution of Big Data came ... new ways of analyzing the newdata sets to which we now have access. As a result, Big Data has been hailedfor its potential to improve decision-making in ﬁelds from business to medicine,allowing judgments and evaluations to be based increasingly on information andanalysis rather than intuition and insight.On the other hand, pitfalls lurk, for example, in the emergence of Orwellian surveillance.Cannadine (2020) takes a somewhat sanguine view: Steve Lohr reports the title “Vice President for Big Data” in his

New York Times piece, at . ,published in 1949:Always eyes watching you and the voice enveloping you. Asleep or awake, indoorsor outdoors, in the bath or bed – no escape. Nothing was your own except thefew cubic centimeters in your skull.Time will reveal how Big Data opportunities and pitfalls evolve, but there is no turningback. References

Andersen, T.G., T. Bollerslev, P.F. Christoﬀersen, and F.X. Diebold (2013), “Financial RiskMeasurement for Financial Risk Management,” In M. Harris, G. Constantinedes and R.Stulz (eds.),

Handbook of the Economics of Finance , Volume 2, Part B, Elsevier, 1127-1220.Cannadine, D. (2020), “Big Data,”

Behind the Buzzwords , BBC Radio 4 Podcast, 4 August2020, .Diebold, F.X. (2000), “Big Data Dynamic Factor Models for Macroeconomic Measurementand Forecasting,” Eighth World Congress of the Econometric Society, Seattle, August. .Diebold, F.X. (2003), “Big Data Dynamic Factor Models for Macroeconomic Measurementand Forecasting: A Discussion of the Papers by Reichlin and Watson,” In M. Dewa-tripont, L.P. Hansen and S. Turnovsky (eds.),

Advances in Economics and Econometrics:Theory and Applications, Eighth World Congress of the Econometric Society , CambridgeUniversity Press, 115-122.Laney, D. (2001), “3-D Data Management: Controlling Data Volume, Velocity and Variety,”META Group Research Note, February 6. http://goo.gl/Bo3GS . 5eichlin, L. (2003), “Factor Models in Large Cross Sections of Time Series,” In M. Dewa-tripont, L.P. Hansen and S. Turnovsky (eds.),

Advances in Economics and Econometrics:Theory and Applications, Eighth World Congress of the Econometric Society , CambridgeUniversity Press, 47-86.Tilly, C. (1984), “The Old New Social History and the New Old Social History,”

Review(Fernand Braudel Center) , 7, 363–406.Watson, M.W. (2003), “Macroeconomic Forecasting Using Many Predictors,” In M. Dewa-tripont, L.P. Hansen and S. Turnovsky (eds.),

Advances in Economics and Econometrics:Theory and Applications, Eighth World Congress of the Econometric Society , CambridgeUniversity Press, 87-115.Weiss, S.M. and N. Indurkhya (1998),