Creating a Relational Distributed Object Store
CCreating a relational distributed object store
Robert Primmer, Scott Nyman, Wayzen LinHitach Data Systems { bob.primmer,scott.nyman,wayzen.lin } @hds.com June 2013
Abstract
In and of itself, data storage has apparent businessutility. But when we can convert data to information,the utility of stored data increases dramatically. It isthe layering of relation atop the data mass that is theengine for such conversion. Frank relation amongstdiscrete objects sporadically ingested is rare, makingthe process of synthesizing such relation all the morechallenging, but the challenge must be met if we areever to see an equivalent business value for unstruc-tured data as we already have with structured data.This paper describes a novel construct, referred toas a relational distributed object store (RDOS), thatseeks to solve the twin problems of how to persis-tently and reliably store petabytes of unstructureddata while simultaneously creating and persisting re-lations amongst billions of objects.
Databases have proven to be a useful and versatilecontainer for housing structured data, consisting ofrelatively simple but well-defined data types, referredto as scalars (mostly numbers and strings). In the1970’s databases evolved beyond a basic storage con-tainer to provide functions critical to the businessby allowing relations amongst the stored data to beexpressed and persisted [6]. This is fundamental tohow business operates—where data from one part ofthe business must be correlated to other parts of thebusiness if insights and efficiencies are to be realized.As a simple example, consider operations at a re-tailer such as Walmart where a database has twotables: one records point-of-sale transactions, theother holds inventory data. At the point-of-sale it’suseful to have a record of all the items purchased,however it’s far more useful to be able to then re-late this back to the inventory system to automati-cally reorder goods as needed. This relation, and the automation it enabled, let Walmart eliminate man-ual shelf stock planning, lowered cost, and preventedover- and under-stocks. This basic example showswhere adding relation provides significant efficiencyto the business.The container used to store such transactions istypically a relational database, comprised of rows(tuples) and columns of scalar data. A substantialstrength of relational databases comes in the formof a standard query and transformation logic, SQL,which allows applications outside the creating appli-cation to query the data [2]. This ability to separatedata from the bounds of the creating application am-plifies and extends the value the data can provide thebusiness.
In contrast to the simple scalar types typical to struc-tured data, unstructured data is comprised of richand expressive data types (e.g. PowerPoint presen-tations or full-motion video) that do not fit well intothe traditional database paradigm.An object store is essentially a database for un-structured data, composed of two parts: a distributeddatabase that holds object references and a dis-tributed file store that stores the user data, referredto as data objects or blobs . The database is typicallymodeled as a NoSQL “shared nothing” data store forhorizontal scaling—replicating and partitioning dataover many servers [4].Public cloud examples include key-value storessuch as Amazon’s Dynamo [8] and Project Volde-mort used by LinkedIn [27]. In the enterprise and ser-vice provider sectors, the Hitachi Content Platform (HCP) provides a distributed object store that typi-cally resides behind a firewall [20]. HCP is conceptu-ally similar to the combination of Google’s Bigtable a r X i v : . [ c s . D B ] J un
5] for storing object references and Google File Sys-tem (GFS) [10] for storing data objects.
Fundamental to object storage is that the detail of thedistributed database and underlying file systems areabstracted from both the client applications (users)and system administrators. In the process objectstores shift the client model, essentially presentingstorage as a service rather than requiring clients tobe directly involved in data storage decisions, such asproperly balancing directory trees. While some ob-ject stores only support a single flat namespace [19],others allow the global namespace to be logically par-titioned for greater security, e.g. with a collection of buckets in Amazon S3 [29], or namespaces in HCP[22].Such abstraction allows for comparatively na¨ıveusers and administrators as the object store takescare of the detail of how and where user data isstored, protected, geo-replicated, de-duplicated, ver-sioned, garbage collected, and so forth. This grosslysimplifies application development and deployment,as evidenced by the plethora of start-ups who are ablein short order to get a worldwide service up, running,and generating revenue by leveraging a public objectstorage service—something never seen before in his-tory.Likewise, we see where limited compute platforms,such as smartphones and tablets, are able to have fullaccess to a universe of data despite zero support fora single storage protocol. Such devices have no sup-port for traditional storage protocols (FibreChannel,SCSI, NFS, CIFS) made popular in the last centuryfor LAN-attached disk [11]. Instead both simple andcomplex data types are served by nothing more thanthe basic web protocol HTTP.
This paper describes a mechanism for adding a rela-tional layer on top of the object storage layer, with-out destroying the simplicity gained through the ab-stractions for which object stores are noted. Our goalis for users to be able to manage relations betweendata in similar fashion to a relational database. How-ever, since unstructured data is intrinsically differentfrom structured data, it’s necessary that we providea substrate for defining and describing such relation.Further, we envision creating a mechanism by whichusers can query an RDOS in a manner similar to the way they query a relational database today for similarbenefit.
The remainder of the document is organized as fol-lows. Section 2 defines unstructured data and toucheson some of the associated challenges this datatypebrings. Metadata is essential to the challenge of cre-ating a relational layer on top of unstructured data.Section 3 defines metadata, explains why it is of suchvalue, and identifies its potential to transform un-structured data from an undifferentiated data massto a highly correlated data set that provides genuinevalue to the business. Section 4 describes HCP in itspresent form and section 5 describes the ideal of ex-tending it to be a relational distributed object storeand the business value it provides. Related work isdescribed in section 6 and section 7 concludes with asummary of the topics covered in this paper.
The awkward term unstructured data is intended tocomplement the term structured data —which com-monly refers to data stored in a database (DB) ofsome variety. Logically, unstructured data can bethought of as all data not stored in a DB, but mostcommonly it is used to refer to files housed in a hier-archical file system. Compared to scalars arranged byrow and column within a DB, files can be far moreexpressive, comprising such varied formats as officedocuments, digital images, and full-motion video, col-lectively referred to as rich data types . The notion of files is further abstracted to objects .While there isn’t a canonical definition specifying pre-cisely what constitutes an object in the context ofstorage, object store implementations tend to repre-sent object as the union of a data object, file systemmetadata and (user created) custom metadata.Here data object is simply the user data; for ex-ample a Microsoft Word document that contains atravel itinerary. The file system metadata may in-clude things such as the name of the file, the timeit was created and when it was last updated. Thusfar a basic file system would suffice to house both thedata (the Word document) and file system metadata(the filename, time created, etc.). An object storeprovides the ability to add a third element, custommetadata , to the mix.2ustom metadata allows the user to annotate thebase data with arbitrary text, often in the formof key-value pairs. Continuing our example, whenstoring our travel itinerary in an object store wemay want to annotate this document with select keydata such as: Year=2013, Department=Sales, Terri-tory=US, Status=Approved.From this example we see that custom metadataprovides a mechanism to make the core file more use-ful by allowing the user to abbreviate the file withselect information that can later be used to logi-cally group documents. Additionally we can performa lightweight search against the object store for alltravel requests by Sales in 2013 that were approvedfor travel within the US and quickly return a list ofall objects that meet these criteria.This is exactly the type of function we’d expectto perform against a sophisticated RDBMS. How-ever, instead of being restricted to storing scalars, wegain the full expressiveness possible with rich data filetypes without sacrificing the ability to convert thatdata into information through mechanisms such asthe ability to perform predicate searches. Better still,searching metadata stored within the object store isa far lighter weight operation than searching the fullcontent of every file stored and attempting to piecetogether logical groupings ad hoc. The value meta-data brings to unstructured data is substantial. Inshort, it’s metadata that allows us to add structurewhere none would otherwise exist in the unstructureddata world. In § Throughout the last two decades a recurring topic inIT is how to manage the exponential increase in thevolume of data that must be stored as societies moveincreasingly to a digital world [1]. What’s more, whilethe growth of data to date has been tremendous, therate of increase is projected to grow greater still. IDCprojects that the
Digital Universe will grow by a fac-tor of 300—to 40,000 exabytes!—by 2020, and thatenterprises will have “liability or responsibility for80% of the information” in this digital universe. Fur-ther, unstructured data is expected to constitute 90%of the 40,000 exabytes [9].This staggering growth correlates to a shift fromtext to rich data types such as digital images andvideo. To get a sense of how such data growth canbe rationalized, in [21] we see where the capacity re-quired to store a high definition movie trailer is 20,000times greater than that required to store a traditionalmovie review. It’s this continued shift to rich data types that fuels such spectacular growth; and thisgrowth shows few signs of subsiding.
Such growth begs the question: where are we to storethis mountain of data? While RDBMS are used tohold structured data, file systems and object storesare common containers for unstructured data.Much of the R&D over the last 50 years for filesystems has focused on how to improve storage effi-ciency (e.g. optimizing file system overhead), dura-bility, and reliability. However, from the perspectiveof user interface, the basic paradigm of the hierar-chical file system remains largely unchanged since itsintroduction in the 1960s.For enterprise-class object stores, a cardinal fo-cus has been on answering the question: How do weachieve extreme scale beyond that provided by tra-ditional file systems? Since their genesis, the targetmarket for object stores has been massive data stores,where the feature of cataloging all objects stored hasgreater value.It’s comparatively easy to keep track of thousandsof files for a few years; it’s far more difficult to man-age billions of them for decades. The Internet and thefalling price of magnetic storage have shifted our ex-pectations of digital storage. We now expect data tolive in perpetuity, instead of being periodically culledfor usefulness. This new behavior further highlightsthe need for large-scale data management both froma capacity and file count standpoint.A common misperception is that object stores area replacement for file systems; instead, they are anaugmentation. The file system is tightly coupled tothe operating system and provides a well establishedmechanism for organizing files within a hierarchy ofdirectories. By contrast, an object store focuses onchanging the presentation layer to the storage con-sumer through a simplified interface while achievingenormous scale by aggregating many file systems intoa single, higher-order grouping.Figure 1 presents an abstract view of the storagestack on a single node. The function of each superiorlayer is to aggregate and abstract the layer beneath,permitting greater sophistication and specializationin each layer without increasing complexity to clientsof upper layers. The object storage layer creates a distributed storage service to client applications with-out requiring the clients to manage data distribution.3igure 1: Storage Stack
In the past the term metadata was not widely known,understood mainly by technologist, and for goodreason—file systems provide metadata in parsimo-nious form, providing data about a file, such as whenit was created and last updated. Object stores bycontrast are comparatively lavish in what they allowfor metadata, allowing the user to associate any ar-bitrary text with a data object.Today, grandmothers think nothing of storing theirphotos online in globally available cloud repositorieswhile annotating the pictures with metadata key-value pairs such as who is in the photo, where it wastaken, and at what event. Young children know howto assign metadata tags to logically group items onFacebook and Twitter. In the 21st century, metadatahas moved from the exotic to pedestrian.
Metadata is the connective tissue that binds objectsto one another. Additionally metadata allows usersto apply semantic meaning to otherwise opaque dataobjects, while providing a means to abbreviate largefiles with a very small amount of data. In short, it ismetadata that allows us to add structure to unstruc-tured data.How important is metadata? In modern systems ifthe original designers fail to provide for it, the usersdo so themselves. Hashtags are a metadata conven-tion amongst users of the microblogging service Twit-ter [14], yet when it was first released Twitter hadno support for metadata. Instead hashtags were de-scribed by a user in 2007 (in a 140-character Twitterpost), and became so popular that the engineers at Twitter added software support for it. Today, Twit-ter detects “trending topics” using popular hashtags[3].
Absent the ability to create logical groupings, a massof objects housed in a data store is of limited value. Insuch a case the data store provides essentially equiva-lent value as tape—files are persisted, but there’s notmuch you can do with them.Logical grouping creates tacit relation amongst anotherwise indistinct set of independent objects. Sincethis is a logical operation, the grouping is elastic anddoes not require expensive disk operations to movefiles into specific containers to define groupings.
Storing data is one thing, having that same data havegenuine meaning to the user is quite another. Forexample, a photo stored on disk is valuable, but it’sthe ability to add semantic meaning to these photos(e.g., who is in the photograph at what event) thatbrings life to the data.
Indexing the content of very large files consumes sub-stantial capacity to hold the resulting index. Throughmetadata it’s possible to provide a comparativelysmall subset of data, effectively creating an abbrevia-tion of the larger file content. The capacity requiredto index this metadata can be orders of magnitudeless than that required to index the full set of dataobjects, with a commensurate reduction in computesrequired to create the search index.
Hitachi Content Platform (HCP) is a multipurposedistributed object store designed to support large-scale repositories of unstructured data. Physically,HCP is a collection of storage servers (nodes), re-ferred to as a cluster , that use a private back-endnetwork for inter-node communications and a publicfront-end network for client communication. A gos-sip protocol is used determine active cluster member-ship. In the event of temporal failure of any node orservice requests are automatically vectored to activenodes for fulfillment.4he system is logically divided into a collection of tenants and namespaces ; tenants are the administra-tive unit, namespaces are the storage unit. Tenantsare critical for cloud / service provider deploymentswhere it’s important not only to logically separateclient data, but also to divide the administration ofeach virtual instance of HCP. The system is designedso that no one administrator has dominion over thesystem as a whole, but only over their assigned ten-ant. A tenant administrator can create a collectionof namespaces that will hold user data.The value proposition of a multitenant system islargely the same as any shared pool of physical re-sources such as SANs: Sharing reduces cost by pool-ing physical resources. In the process they also ex-pose data on shared storage by unauthorized usersand overwrites by multiple clients [35]. To counterthis HCP employs a complex of security measures.These include strictly dividing administrative func-tions, defined user access rights and restrictions, ac-cess controls granular to individual objects, and alldata is optionally encrypted, both in-flight and ondisk.Client access to the system is via a numberof software gateways, distinguished by protocol.Presently the following protocols are supported:HTTP (REST), S3, CIFS, NFS, SMTP and Web-DAV. Objects that were ingested using any protocolare immediately accessible through any other sup-ported protocol. These protocols can be used to ac-cess the data with a web browser, HCP client tools,third-party applications, Windows Explorer, or na-tive Windows or Unix tools [22].HCP provides high availability with strong consis-tency guarantees. Replication comes in two forms:intra- and inter-cluster. Intra-cluster replication issynchronous to the write path to ensure that alldata ingested is successfully persisted to disk beforereturning acknowledgement to the client. Consis-tent hashing is used to distribute data among thenodes. Within the strictures of assuring replicas areassigned separate fault domains, writes bias towardlightly loaded nodes. Inter-cluster replication is asyn-chronous to the write path to ensure low write la-tency while providing geographic dispersion of ob-jects. HCP uses XML as the serialization format forpersisting custom metadata and XPATH structuredqueries against that metadata. This allows the sys-tem to return specific answers to user queries, ratherthan a collection of potential matches. HCP employs redundancy at multiple levels: theobject reference database, user data (data objects),and system and custom (user) metadata. Likewise,all hardware components are redundant so there isno single point of failure. Commonly the focus fordata protection centers on protecting user data. How-ever, for database systems it’s actually the protectionof the index that’s most important, as a loss of thepointer to the data is tantamount to the loss of thedata itself. HCP shards the object reference databaseamongst distinct fault domains, both for performanceand protection. Further, in the event of catastrophicfailure, a scavenging service is employed to recon-struct the database from the constituent metadatapersisted to disk as part of the write path.The data model is one of immutable data objectswith mutable metadata. To simulate in-place updatesof data objects HCP supports object versioning, i.e.the capability of a namespace to create, store, andmanage multiple versions of objects within the repos-itory. Capacity efficiency is achieved through a com-bination of compression and duplicate elimination.In addition to the actual data persistence compo-nent, the system needs to have scalable and robustsolutions for load balancing, membership and fail-ure detection, content verification, version control,retention policies, disposition of objects no longerunder retention, compression, encryption, failure re-covery, replica synchronization, overload handling,state transfer, garbage collection, concurrency andjob scheduling, request marshaling, request routing,system monitoring and alarming, and configurationmanagement.The result is a system that provides data manage-ment as a service, referred to as
Storage as a Service ,in a manner that makes both client development andmanagement of the system simple. The design goal isthat neither clients nor system administrators shouldbe aware of the detail of the multiple software servicesneeded to keep a multi-petabyte repository housingbillions of objects coherent and responsive to clientrequests.
The challenges of creating a relational object storeare multiple and substantial. The first deals with themost basic question of how do we obtain needed meta-data in the first place ( § § § § § § To date there is relatively little metadata being asso-ciated with the data objects in object stores. Thereare a number of reasons for this.
Application Reluctance : Legacy applicationswere built before a time when there was a well-definedmeans for storing and associating metadata with dataobjects. However, even in the presence of such mech-anisms today, a paucity of metadata persists. First, ageneral application by itself is unlikely to know whatconstitutes salient metadata for a given file. Second,most commercial applications, recognizing the valueof metadata to the customer, are reluctant to give upcontrol of the metadata to be used outside the con-fines of the application itself. Commercial applica-tion providers may not be motivated to put extensivemetadata into a data store as it allows the customerto detach the data from the application, in the pro-cess reducing the stickiness of the application. Thiscreates a financial disincentive to allow the user freeand ready access to this metadata without the needto involve the creating application.Further, even if a commercial application provideris properly motivated to do so for the good of thecustomer, it’s hard to determine what is the rightset of data to include in metadata as each customeris different. Just as with a RDBMS, where individ-ual customers create relations amongst data sets asappropriate to their own needs, the customer needsto be able to define the metadata that is apropos totheir specific environment.
Tools for Association : Few tools exists to en-able the systematic association of metadata to dataobjects. Business agents with the necessary domainexpertise for determining what constitutes relevantmetadata are unlikely to have the technical expertiserequired for tool development.
Sophisticated Metadata Stores : Even whenmodern applications wish to associate meaningfulmetadata with data objects, few object stores havethe sophistication to provide the necessary founda-tion for good programming practice.For example, for metadata to be a first-class en-tity it’s necessary that we are able to partition theaddress space so that different users can create, mod-ify and delete their own metadata in isolation. To-day, most systems provide a single bucket where all metadata is housed. Changes by one user thereforeaffect every other user, often in unpredictable ways.To protect user data from such unintended manipu-lations, all data stores partition the address space fordata objects; few provide equivalent functionality formetadata. For these reasons HCP allows metadatapartitions for each data abject. Figure 2 illustratesthis concept with a data object that has eight distinctmetadata partitions. In § The first step is to define what metadata should beextracted or applied to data objects. Search enginescan crack many file types and extract not only datafrom the core file, but also in some cases metadataembedded in the file’s header (e.g. in the case of DI-COM images). However, standard data is is madefar more useful by applying semantic meaning to thecontent. This step is difficult for a general searchengine to do as, by its very nature, semantic mean-ing is often particular to a user or organization andtherefore doesn’t lend itself to a generic solution.Creating an ontology, where semantic meaning canbe mapped to data, is referred to as a
Data Dictio-nary . Multiple such dictionaries can be created, eachwith meaning peculiar to a function within an orga-nization. For example, one dictionary may containa grammar common to a particular vertical, such as6he field of healthcare, while other dictionaries willbe particular to the organization itself, such as co-denames used within a company. The sum of theindividual data dictionaries creates the compendiumnecessary to inform the extraction and applicationstep.
The data dictionaries defined in the prior step act as extractors or applicators run across a field of previ-ously ingested data objects or against individual ob-jects during ingestion. When applied as a filter, ob-jects are scanned for relevant key-value pairs withinthe data object with the results applied as metadatascalars. Of course, such extraction is only possiblewhere the data object is searchable. In cases whereit is not, such as for image files, applicators wouldapply metadata as defined in the data dictionaries torelevant data objects.Since there can be multiple dictionaries, an object O is run through a pipeline of size p where the rules ofeach dictionary d are applied in turn. For extractionoperations we have f = (cid:80) pi =1 O ∩ d i , for applicatoroperations we have f = (cid:80) pi =1 O ∪ d i . In cases wherea particular step in the pipeline has no rules to ap-ply against an object that step becomes the identityfunction, that is O input = O output . Note that whileit’s possible that every step in a pipeline will be ex-clusively of type f or f , this is not requirement;that is, f and f are not mutually exclusive.Once run through the pipeline, data objects arecoupled with relevant metadata and stored as scalars,which can be used for subsequent queries by clientapplications and to inform relations between objects.Figure 3: Metadata Generation PipelineIn this model the extraction or application steps inthe pipeline are referred to as a metadata generationmodule , or MGM. As depicted in Figure 3, a dataobject progresses through a series of MGMs, wherean MGM comprises a script language specific to thetask. At every stage a decision is made whether thereis metadata to be applied. While each MGM is in-dependent, the ordering of MGMs can matter as thealgorithm applied at any individual MGM may base a decision of whether to add, change or delete a meta-data section based not only the data object itself, butrather the union of the data object and metadata ac-cumulated in the pipeline to that point. Nonetheless,fundamental to this design is that an MGM is inde-pendent, to allow the flexibility of adding or updatingMGMs in existing pipelines.In HCP the address space is partitioned into namespaces that essentially act as a virtual instanceof the system as a whole. This segregation allowsdifferent data management and access policies to beapplied to each set of data objects housed within aparticular namespace. It is a natural extension thateach namespace may have a different set of MGMsto analyze data objects specific to a namespace (Fig-ure 4). This allows finer-grain decisions to applied toobjects already grouped by user.Figure 4: Unique Pipeline per NamespaceDatabase programmers likely recognize that theMGM construct is conceptually similar to a storedprocedure in a RDBMS. A stored procedure is a sub-routine stored in the database data dictionary thatruns on the database server itself rather than directlyon the client. In this model the MGM acts as a storedprocedure, where common code can be run directlyon the RDOS cluster rather than the client.Note that a client application could choose to per-form all the functions of the MGM pipeline directlyby reading an object, applying metadata, and thenrewriting this object back to the RDOS cluster. Thedownside of this method is that network bandwidthis consumed by the roundtrip for the read/write op-erations. It can therefore be more efficient to performthis same function on the cluster by running objectsthrough the MGM pipeline. MGM scripts can be cre-7ted either by the application provider or constructedby the system administrator through a GUI. Armed with the metadata from § § CID ) is assigned (
CID=1234 )and ingested into RDOS. Subsequently the photosfrom the appraiser and the scan of the police re-port are uploaded. For all objects there is a meta-data tag set where
CID=1234 , providing a commonkey amongst the objects. At ingest, the client ap-plication can choose to add a second metadata fieldindicating that the images are related to the claimform. As objects are identified by URI, this translatesto:
RelTo=URI { Object1 } . Through this structure wehave a means of relating the full set of objects whichare all associated with the same insurance claim.This example depends upon the application explic-itly creating the relation between the objects by set-ting the RelTo metadata tag, which is the preferredmethod for new object ingest. However, the samemechanism described in § Once we’ve done the hard work of extracting keymetadata and presenting mechanisms to the user(both human and application) to define relations, weneed to select a means of persisting these relations ina manner that makes sense for the data type. In ourmodel we have chosen a graph database for this func-tion as the graph model provides an excellent meansof describing relation. Graph database models canbe defined as those in which data structures for theschema and instances are modeled as graphs or gen-eralizations of them, and data manipulation is ex-pressed by graph-oriented operations and type con-structors [2].Graph databases are described as “whiteboardfriendly.” Drawing circles and connecting them withlines on a whiteboard is how we visualize graphs.We denote data points as nodes and connect (relate)nodes with links , as illustrated in Figure 5 (p. 9).Graph databases, such as NEO4j [32], are well suitedto unstructured data as they are typeless and haveno set schema, while providing several advantages forour application.First, the graph data model is useful when the in-terconnectivity of data and the ability to discoverthe relationships between values is important, ratherthan simply commonality among value sets typicalto relational models. Discovering relations is fun-damental to our design, so its important we se-lect a data structure that can readily describe re-lations with good performance. Unlike join oper-ations in relational databases or map-reduce oper-ations in other databases, graph traversals are inconstant time [23]. Graph databases make it easyto discover centrality, where we measure individualnodes against a full graph. (The most famous cen-trality algorithm is Google PageRank [17]. In § and the Disease Ontology database . Freebase, cre- Graph theory is a branch of discrete mathematics. Nodesare referred to as vertices and what we call links are referredto as edges , where edges connect pairs of distinct vertices. Asimple graph (such as we’ll be using in our descriptions) with V vertices has at most V ( V − / http://disease-ontology.org Once the data is available and programmatically ac-cessible through well-defined APIs, modeling typesthat users can perform can be considered.Object associations can be definitive or manu-factured through a probabilistic generative modelto guide inference from incomplete data. The for-mer can be prescribed by the user by linking ob-jects through a GUI or template; the latter throughBayesian inference [18], which provides a rationalframework for updating beliefs about latent variablesin generative models given observed data [16].Creating Bayesian models through graphs andpredicate logic is more commonly the domain of datascientists than IT users. However, the goal for oursystem isn’t to mandate a specific model of relation(definitive or derived), but rather to take care in de-sign to not implicitly restrict the user to a particularmodel. Our system must provide the flexibility topersist and mutate object relation to meet a varietyof user requirements .As such, our goal is to provide an abstract frame-work over which schemas can be applied with suffi-cient flexibility to allow these relations to span theprosaic, e.g. linking monotonically increasing in-stances of an object (creating a version tree), to theexpressive, e.g. a graph schema where nodes repre-sent variables and directed edges between nodes rep-resent probabilistic causal links.For example, in an epidemiology study nodes mightrepresent whether a patient has a cold, a cough, afever or other conditions, and the presence or absenceof links indicates that colds tend to cause coughingand sinus inflammation but not fever; sinus inflam-mation tends to cause headache but not fever; andso on [28]. The probability of a causal relation canbe further refined by applying a weighting (0 - 1.0)to each link; e.g. there’s a 60% likelihood that a coldwill result in a cough. As depicted in Figure 5 graphsprovide a ready means for users to interpret data andvisualize relationships. Nate Silver discusses the breadth of problems to whichBayesian reasoning can be applied in [26].
Figure 5: Epidemiology Directed GraphThrough the proper application of multi-predicateconstraints over the field of objects in a domain, theuser is able to resolve to classes of objects—such asall objects associated with the progenitor of a dis-ease. For example a user might want to view all out-comes for patients over 60 in Africa with diminishedrespiratory function in the presence of the bacteriaLegionella, and contrast this to those in the presenceof the virus adenovirus to determine whether betteroutcomes are seen in patients with bacterial versusviral pneumonia.Such analytic questions can be compute intensivewhen run over the full mass of unstructured data(what we refer to as data objects in § The NoSQL space has been quite active, with markedsimilarity among implementations. Distributed sys-tems can be distinguished by how they choose to biasavailability versus consistency. Brewer’s CAP theo-rem [12] states that a distributed system can providefor any two of the three attributes of
Consistency,Availability and
Partition tolerance . Since networks(particularly WANs) will always partition (i.e., overtime there will be some parts of the network thatare temporarily unreachable), for distributed systemsthis distills to a choice of a design that favors strongconsistency or one that is highly available in the faceof partitions.This is the heart of the distinction between SQLand NoSQL systems. SQL systems adhere to ACIDproperties while most NoSQL stores follow BASE9roperties . Many NoSQL implementations provide eventual consistency [31], where writes are permittedin the presence of partitions. In such systems a clientcan update any replica of an object and all updates toan object will eventually be applied, but potentiallyin different orders at different replicas, thus creatingtemporal inconsistencies that must be sorted out byclient applications [7].Amazon’s Dynamo [8] is the classic example of theeventually consistent model. Dynamo, a key-valuestore, is a zero-hop distributed hash table, where eachnode maintains enough routing information locally toroute a request to the appropriate node directly. Dy-namo shares a number of characteristics with otherdistributed object stores used by public cloud ven-dors that stem from their operational model, whichis a closed loop system where both clients and allservices are under the control of a single company.These systems, while used to support public facingservices, can assume that their internal operating en-vironment is non-hostile and therefore have few se-curity requirements such as authentication and au-thorization. Further, since the NoSQL object storespermit inconsistency, users (client applications) mustcontain an agreed understanding of how to resolveconflicts during reads. As a general-purpose systemwith non-specific clients, HCP can make neither as-sumption and must design for hostile environmentsand provide strong consistency guarantees.Other systems that use an eventual consistencymodel include Facebook’s Cassandra (now Apache)and Google’s Bigtable. Bigtable is a distributed stor-age system for managing structured data. It main-tains a sparse, multi-dimensional sorted map and al-lows applications to access their data using multi-ple attributes [5]. Cassandra has been described asa marriage of Dynamo and Bigtable [15]. Yahoo’sPNUTS [7] supports Yahoo! web properties such asFlickr. It uses per-record timeline consistency, whereall replicas of a given record apply all updates to therecord in the same order. Data objects (blobs) followthe eventually consistent model.Providing strong consistency guarantees, Mi-crosoft’s Windows Azure Storage (WAS) [13] mixesstrong consistency within a local stamp (the analogof a local cluster for HCP) with subsequent replica-tion to remote geographies. This model is the mostsimilar to HCP, which provides strong write guaran-tees through synchronous replicas locally before asyn- ACID stands for Atomicity, Consistency, Isolation andDurability; BASE stands for Basically Available, Soft state,Eventual consistency. Descriptions of ACID and BASE can befound in [34] and [4] respectively. chronously dispersing replicas geographically. Suchsystems might be referred to as eventually distributed .Rather than requiring the object store to deal withcollections of file systems, some implementations pro-vide a single (logical) distributed file system, such asthe Google File System [10] and CEPH [33]. By con-trast, HCP uses a confederation of local file systemsdistributed over a collection nodes to persist data ob-jects. This detail is abstracted from clients in favorof a storage service key-value model, where the keyis a URI that indicates a unique object.
In this paper we described object stores in generaland HCP in particular, describing how they serveas a data container for unstructured data. We dis-cussed the importance of metadata, both to createlogical collections of objects, and to provide the basisfor establishing relation among objects. We then de-scribed an idealized relational distributed object storeand described mechanisms for obtaining and apply-ing metadata to existing objects, and a method ofassociating and persisting relations amongst objects,providing a framework for data analytics to be per-formed against the object store.
References [1]
Adams, A., and Mishra, N.
User survey analysis:Key trends shaping the future of data center infras-tructure through 2011.
Gartner Market Analysis andStatistics (Oct. 2010).[2]
Angles, R., and Gutierrez, C.
Survey of graphdatabase models.
ACM Comput. Surv. 40 , 1 (Feb.2008), 1:1–1:39.[3]
Badia, A., and Lemire, D.
A call to arms: re-visiting database design.
SIGMOD Rec. 40 , 3 (Nov.2011), 61–69.[4]
Cattell, R.
Scalable sql and nosql data stores.
SIGMOD Rec. 39 , 4 (May 2011), 12–27.[5]
Chang, F., Dean, J., Ghemawat, S., Hsieh,W. C., Wallach, D. A., Burrows, M., Chandra,T., Fikes, A., and Gruber, R. E.
Bigtable: A dis-tributed storage system for structured data.
ACMTrans. Comput. Syst. 26 , 2 (June 2008), 4:1–4:26.[6]
Codd, E. F.
A relational model of data for largeshared data banks.
Commun. ACM 13 , 6 (June1970), 377–387.[7]
Cooper, B. F., Ramakrishnan, R., Srivastava,U., Silberstein, A., Bohannon, P., Jacobsen,H.-A., Puz, N., Weaver, D., and Yerneni, R. nuts: Yahoo!’s hosted data serving platform. Proc.VLDB Endow. 1 , 2 (Aug. 2008), 1277–1288.[8]
DeCandia, G., Hastorun, D., Jampani, M.,Kakulapati, G., Lakshman, A., Pilchin, A.,Sivasubramanian, S., Vosshall, P., and Vo-gels, W.
Dynamo: amazon’s highly available key-value store.
SIGOPS Oper. Syst. Rev. 41 , 6 (Oct.2007), 205–220.[9]
Gants, R., and Reinsel, D.
The digital universe in2020: Big data, bigger digital shadows, and biggestgrowth in the far east. Tech. rep., IDC, Dec. 2012.[10]
Ghemawat, S., Gobioff, H., and Leung, S.-T.
The google file system. In
Proceedings of the nine-teenth ACM symposium on Operating systems prin-ciples (New York, NY, USA, 2003), SOSP ’03, ACM,pp. 29–43.[11]
Gibson, G., Nagle, D., Amiri, K., Butler, J.,Chang, F., Gobioff, H., Hardin, C., Riedel,E., Rochberg, D., and Zelenka, J.
A cost-effective, high-bandwidth storage architecture. In
ACM SIGOPS Operating Systems Review (1998),vol. 32, ACM, pp. 92–103.[12]
Gilbert, S., and Lynch, N.
Brewer’s conjectureand the feasibility of consistent, available, partition-tolerant web services.
SIGACT News 33 , 2 (June2002), 51–59.[13]
Huang, C., Simitci, H., Xu, Y., Ogus, A.,Calder, B., Gopalan, P., Li, J., and Yekhanin,S.
Erasure coding in windows azure storage. In
Pro-ceedings of the 2012 USENIX conference on AnnualTechnical Conference (Berkeley, CA, USA, 2012),USENIX ATC’12, USENIX Association, pp. 2–2.[14]
Kwak, H., Lee, C., Park, H., and Moon, S.
What is twitter, a social network or a news media?In
Proceedings of the 19th international conferenceon World wide web (New York, NY, USA, 2010),WWW ’10, ACM, pp. 591–600.[15]
Lakshman, A., and Malik, P.
Cassandra: a decen-tralized structured storage system.
SIGOPS Oper.Syst. Rev. 44 , 2 (Apr. 2010), 35–40.[16]
MacKay, D. J. C.
Information Theory, Inference& Learning Algorithms . Cambridge University Press,New York, NY, USA, 2002.[17]
Page, L., Brin, S., Motwani, R., and Wino-grad, T.
The pagerank citation ranking: Bringingorder to the web. Technical Report 1999-66, Stan-ford InfoLab, November 1999. Previous number =SIDL-WP-1999-0120.[18]
Pearl, J.
Probabilistic reasoning in intelligent sys-tems: networks of plausible inference . Morgan Kauf-mann Publishers Inc., San Francisco, CA, USA,1988.[19]
Primmer, R.
Efficient long-term data storage utiliz-ing object abstraction and content addressing. Tech.rep., EMC, July 2003. [20]
Primmer, R.
Distributed object store principles ofoperation. Tech. rep., Hitachi Data Systems, May2010.[21]
Primmer, R.
Structured vs. unstructureddata. , 2012.[22]
Ratner, M.
Hitachi content platform: Conceptsand features. Tech. rep., Hitachi Data Systems, Feb2013.[23]
Redmond, E., and Wilson, J.
Seven Databasesin Seven Weeks: A Guide to Modern Databases andthe NoSQL Movement . Oreilly and Associate Series.O’Reilly Vlg. GmbH & Company, 2012.[24]
Schriml, L. M., Arze, C., Nadendla, S., Chang,Y.-W. W., Mazaitis, M., Felix, V., Feng, G.,and Kibbe, W. A.
Disease ontology: a backbone fordisease semantic integration.
Nucleic Acids Research40 , D1 (2012), D940–D946.[25]
Sedgewick, R.
Algorithms in C++ - part 5: graphalgorithms (3 .ed.) . Addison-Wesley-Longman, 2002.[26]
Silver, N.
The Signal and the Noise: Why So ManyPredictions Fail—but Some Don’t . Penguin GroupUS, 2012.[27]
Sumbaly, R., Kreps, J., Gao, L., Feinberg, A.,Soman, C., and Shah, S.
Serving large-scale batchcomputed data with project voldemort. In
Proceed-ings of the 10th USENIX conference on File andStorage Technologies (Berkeley, CA, USA, 2012),FAST’12, USENIX Association, pp. 18–18.[28]
Tenenbaum, J., Kemp, C., Griffiths, T., andGoodman, N.
How to grow a mind: Statistics,structure, and abstraction. science 331 , 6022 (2011),1279–1285.[29]
Varia, J.
Cloud architectures. Tech. rep., AmazonWeb Services, June 2008.[30]
Vicknair, C., Macias, M., Zhao, Z., Nan, X.,Chen, Y., and Wilkins, D.
A comparison of agraph database and a relational database: a dataprovenance perspective. In
Proceedings of the 48thAnnual Southeast Regional Conference (New York,NY, USA, 2010), ACM SE ’10, ACM, pp. 42:1–42:6.[31]
Vogels, W.
Eventually consistent.
Queue 6 , 6 (Oct.2008), 14–19.[32]
Webber, J.
A programmatic introduction to neo4j.In
Proceedings of the 3rd annual conference on Sys-tems, programming, and applications: software forhumanity (New York, NY, USA, 2012), SPLASH ’12,ACM, pp. 217–218.[33]
Weil, S. A., Brandt, S. A., Miller, E. L., Long,D. D. E., and Maltzahn, C.
Ceph: a scalable,high-performance distributed file system. In
Proceed-ings of the 7th symposium on Operating systems de-sign and implementation (Berkeley, CA, USA, 2006),OSDI ’06, USENIX Association, pp. 307–320. Wright, C. P., Spillane, R., Sivathanu, G., andZadok, E.
Extending acid semantics to the file sys-tem.
Trans. Storage 3 , 2 (June 2007).[35]
Yoshida, H.
Lun security considerations for stor-age area networks.
Hitachi Data Systems PaperXP2185193 (1999), 1–7.(1999), 1–7.