[PDF] Bridging BAD Islands: Declarative Data Sharing at Scale

Abstract

In many Big Data applications today, information needs to be actively shared between systems managed by different organizations. To enable sharing Big Data at scale, developers would have to create dedicated server programs and glue together multiple Big Data systems for scalability. Developing and managing such glued data sharing services requires a significant amount of work from developers. In our prior work, we developed a Big Active Data (BAD) system for enabling Big Data subscriptions and analytics with millions of subscribers. Based on that, we introduce a new mechanism for enabling the sharing of Big Data at scale declaratively so that developers can easily create and provide data sharing services using declarative statements and can benefit from an underlying scalable infrastructure. We show our implementation on top of the BAD system, explain the data sharing data flow among multiple systems, and present a prototype system with experimental results.

Full PDF

BBridging BAD Islands:Declarative Data Sharing at Scale

Xikui Wang, Michael J. Carey

Donald Bren School of Information and Computer SciencesUniversity of California Irvine

Irvine, United States { xikuiw, mjcarey } @ics.uci.edu Vassilis J. Tsotras

Department of Computer Science and EngineeringUniversity of California Riverside

Riverside, United [email protected]

Abstract —In many Big Data applications today, informationneeds to be actively shared between systems managed by differentorganizations. To enable sharing Big Data at scale, developerswould have to create dedicated server programs and glue togethermultiple Big Data systems for scalability. Developing and manag-ing such glued data sharing services requires a signiﬁcant amountof work from developers. In our prior work, we developed a BigActive Data (BAD) system for enabling Big Data subscriptionsand analytics with millions of subscribers. Based on that, weintroduce a new mechanism for enabling the sharing of BigData at scale declaratively so that developers can easily createand provide data sharing services using declarative statementsand can beneﬁt from an underlying scalable infrastructure. Weshow our implementation on top of the BAD system, explain thedata sharing data ﬂow among multiple systems, and present aprototype system with experimental results.

Index Terms —data warehouses, database systems, distributedinformation systems

I. I

NTRODUCTION

Advances in information technology have created largecollections of data [1]. Such large volumes of data - Big Data- also come with big challenges. In order to transmit, process,and persist Big Data, researchers and experts from academiaand industry have developed a plethora of systems [2]–[5].However, most of them are passive in nature - passivelyanswering users’ requests to process and return data ratherthan actively processing and delivering data of interest tousers. In many applications, users not only want to analyzedata, but also to subscribe to and actively receive data ofinterest. Their interests may include the data’s content as wellas its relationships to other data. For example, in-ﬁeld ofﬁcersmay want to receive nearby threatening tweets whenever theyare posted . There can be millions of users having similarrequests. We refer to the enabling of Big Data subscriptionsand analytics as

Big Active Data (BAD). Traditional pub/subsystems [6] often lack the capability of data processing andhandling complex subscription requests that involve data’s re-lationships ( such as send me tweets near my current location ).More recent stream processing engines [7], [8] usually don’tpersist data for historical data analytics ( such as show me theaverage threatening rating of tweets in the past ﬁve months grouped by their location ). In order to accommodate BADchallenges, we have created a BAD system that supports BigData subscriptions and analytics at scale [9]–[13].In a big BAD world, the data to be analyzed and deliveredoften needs to be processed and enriched with additionalinformation so that interested users can obtain more insightsfrom the data. Such additional information may be managedby different organizations. Developers often need to sharedata between different systems for supporting BAD appli-cations (e.g., threatening tweets detected at the Departmentof Homeland Security need to be shared with local policedepartments). Data sharing can be difﬁcult, besides the ethicaland legal issues, because of the challenges in management,interoperability, security, and infrastructure [14]. Researchersfrom academia have developed projects that unify institutionalrepositories from different organizations for sharing researchdatasets [15], [16]. Companies have also created platformsbased on Big Data projects to improve business efﬁciency andconsolidate resources for better services [17]. Nevertheless,providing efﬁcient, reliable, and scalable data sharing servicesrequire dedicated infrastructures and collaborative efforts fromdevelopers and organizations. In this work, we focus onenabling the active sharing of Big Data declaratively in aBAD world. In particular, we characterize a BAD world asa group of BAD islands, where each organization runs anindependent BAD system as an island. We discuss how to“bridge” different BAD islands using scalable data sharingservices without additional programming from developers.II. B IG A CTIVE D ATA IN A N UTSHELL

Our BAD system has been built as an extension of ApacheAsterixDB, a Big Data Management System (BDMS) thatprovides distributed data management for large-scale, semi-structured data [18]. The BAD system [10], [13] can enablemillions of users to subscribe to data of interest and receiveupdates continuously, and it also supports Big Data analyticswith a declarative language, SQL++ (a SQL-inspired querylanguage for semi-structured data) [19]. An overview of theBAD system is shown in Figure 1. Due to space limits, here wefocus on two key components: Data Feeds and Data Channels.For more details about our project we refer to [9]–[13]. a r X i v : . [ c s . D B ] J a n evelopersSubscribers AnalystsIncomingData AdditionalInformation A BAD SystemPersistent Storage

Data Channels

Analytical EngineData Feeds

Broker Network

BAD Applications

Fig. 1. An overview of the BAD system

A. Data Feeds

Data feeds help the BAD system to ingest rapidly incomingdata from various external data sources in different formats.Users can create a data feed using SQL++ statements. Asan example, in Figure 2, we deﬁne a data type

Tweet todescribe the incoming data’s minimum required attributes andan active dataset

Tweets to persist the incoming data. Activedatasets, different from normal datasets, enable continuousquery semantics [20] in channels (discussed next) [10], [12].Here we create a

TweetFeed using a socket adapter and specifythe incoming data’s format as JSON. This allows the BADsystem to use a socket server to intake incoming JSON data.The

TweetFeed is connected to the dataset

Tweets so that theingested data can be persisted in storage (partitioned acrossall nodes of a cluster) directly for later use. There are twotypes of data feeds: static feeds, which maximize ingestionthroughput, and dynamic feeds, which allow users to enrichincoming data using user-deﬁned functions (UDFs) [21]–[24].

CREATE TYPE

Tweet AS { tid: bigint, uid: bigint, text: string }; CREATE ACTIVE DATASET

Tweets(Tweet)

PRIMARY KEY tid;

CREATE FEED

TweetFeed

WITH {"type-name" : "TweetType","adapter-name": "socket_adapter","format" : "JSON","sockets": "FEED_HOST:FEED_PORT","address-type": "IP","dynamic": false };

CONNECT FEED

TweetFeed

TO DATASET

Tweets;

START FEED

TweetFeed;

Fig. 2. A sample data feed connected to an active dataset

B. Data Channels

Data channels allow developers to activate parameterizedqueries as services for millions of users to subscribe to andcontinuously receive their data of interest. When creating achannel, developers can construct a channel query to describethe data of interest for subscribers and specify a channelperiod to indicate how often should the channel query beevaluated for subscribed users. All subscriptions of a channelare evaluated together to allow the system to exploit sharedcomputations among them (e.g., many subscribers could beinterested in tweets from Orange County), and increasing thechannel period could lead to a bigger batch size and thus allowcomputing complex data of interest for more subscribers atscale. For example, we can create a

NearbyThreateningTweets channel, as shown in Figure 3, to allow in-ﬁeld ofﬁcers tosubscribe to nearby threatening tweets. In the channel query, we use the is new function to look for new threateningtweets near a subscribed ofﬁcer’s location and return thosetweets to subscribers every 10 seconds. The active dataset

Tweets provides continuous query semantics to make sureevery qualiﬁed new tweet will be delivered to subscribedofﬁcers. The threatening tweets for subscribers are sent tobrokers registered as HTTP endpoints in the BAD system.A user can subscribe to a data channel on a broker and thusreceive updates from it. As shown in Figure 4, we can registera broker and make two separate subscriptions (on behalf ofin-ﬁeld ofﬁcers) on this broker so that the threatening tweetsnear these two in-ﬁeld ofﬁcers are sent to this broker andthen delivered to them. Data channels provide two modes fordelivering data: push and pull . In the push mode, the data ofinterest is pushed to brokers directly. In the pull mode, a brokerhaving new data of interest for its subscribers will receive anotiﬁcation from the channel, and then the broker can pull thatdata from BAD storage later. // Similar to TweetFeed and Tweets, we have a LocationFeed connected// to an OfficerLocations dataset to receive and store the live// location updates from in-field officers//

CREATE TYPE

OfficerLocation AS { oid: int, location: point };// CREATE ACTIVE DATASET

OfficerLocations(OfficerLocation)//

PRIMARY KEY oid;

CREATE CONTINUOUS CHANNEL

NearbyThreateningTweets(oid)

PERIOD duration("PT10S") {

SELECT t FROM

OfficerLocations o, Tweets t

WHERE spatial_distance(t.location, o.location) < 5

AND o.oid = oid

AND t.threatening_rating > 0

AND is_new(t) };

Fig. 3. A sample continuous channel for nearby hateful tweets

CREATE BROKER

BROKER_A AT "http://BROKER_A_HOST:BROKER_A_PORT/API"; SUBSCRIBE TO

NearbyThreateningTweets("0907") ON BROKER_A;

SUBSCRIBE TO

NearbyThreateningTweets("1226") ON BROKER_A;

Fig. 4. Registering a broker and making subscriptions

III. BAD I

SLANDS

In the following sections, we discuss how we can connectBAD systems managed by different organizations (islands) ina BAD world together to enable data sharing among them. Weuse a three-island example with the following organizations forillustration: the Department of Homeland Security, the OrangeCounty Sheriff’s Department, and the University of California-Irvine. Each organization hosts an independent BAD systemand serves its own BAD users with localized information.

A. BAD Island 1: Department of Homeland Security

The Department of Homeland Security (DHS) is a federalagency responsible for ensuring public security. In our ex-ample, DHS has access to all tweets posted in the UnitedStates. These tweets cannot be shared with other organizationsdirectly due to licensing and privacy concerns, except for thetweets that are related to potential threats. The BAD system atDHS needs to provide data analytics on collected tweets andserve tweets to its agents through data channels.Since raw tweets from Twitter may not contain all necessaryinformation, DHS might need to enrich them with other One could also apply the is new function on OfﬁcerLocations to look fornearby threatening tweets only for ofﬁcers actively updating their locations.Interested readers may refer to [13] for more continuous channel examples. elevant data. As an example, DHS could collect weaponregistration information for some sensitive twitter accountholders and attach that to tweets to provide important addi-tional information for interested subscribers. In addition, DHScould also utilize Machine Learning algorithms to estimatethe threatening rating of the tweets’ text and use that for lateranalysis. An overview of the DHS island is shown in Figure 5.

Weapon Registration InformationEnriched Threatening TweetsProcessing Data AnalyticsTweets

ThreateningTweetsThreatening Rating DetectionAlgorithm

Fig. 5. An overview of the DHS Island

B. BAD Island 2: Orange County Sheriff’s Department

The Orange County Sheriff’s Department (OCSD) is thelocal law enforcement agency that ensures safety and respondsto potential crimes in Orange County, CA. In our use case,OCSD wants to monitor major local events and ensure thesafety of the event and its participants. In-ﬁeld ofﬁcers whopatrol around the county, continuously report their locationsback to OCSD so that OCSD can send them instructions basedon their locations (e.g., when an emergency happens, sendnearby ofﬁcers for help).To prevent potential threats to local events, OCSD wouldlike to obtain the threatening tweets posted in Orange County.When a local threatening tweet is detected, OCSD can ﬁndimportant events close to the tweet and then notify the nearbyin-ﬁeld ofﬁcers about the event and the tweet so they canfurther investigate it. Additionally, OCSD wants to supportdata analytics on data stored in the system. An overview ofthe OCSD island is shown in Figure 6.

Event InformationThreatening EventsProcessingThreateningTweets In-field Officers’ Locations Data Analytics

LocationUpdatesEvent Notifications

In-field Officers

Fig. 6. An overview of the OCSD island

C. BAD Island 3: University of California-Irvine

The University of California-Irvine (UCI) is a public univer-sity located in Irvine, a city in Orange County. The universityoften hosts various activities and events in different buildingson campus. To ensure students’ and visitors’ safety, the uni-versity has its own university police ofﬁcers placed at varioussecurity stations on campus, and students/visitors can seek helpfrom when an emergency happens. The buildings on campushave notice boards for showing important notiﬁcations andalerts. The university also has an alerting service - zotALERT - which delivers important messages to people (subscribers)on-campus through text messages and emails.UCI would like to acquire the threatening tweets posted nearthe UCI campus and notify people in the buildings aroundthose tweets to raise attention. An alert could include theinformation about nearby security stations for the tweet so thatpeople in an emergency situation could quickly seek help. Dataanalytics on threatening tweets and other data in the systemfor school ofﬁcials are also to be supported. An overview ofthe UCI island is shown in Figure 7.

On-campus Building Information

On-campus AlertsProcessing Data AnalyticsThreateningTweets

On-campus Security Information

Alert Notifications

SchoolOfficialsBuildingNotice BoardzotALERT

Fig. 7. An overview of the UCI island

IV. I

SLAND H OPPING : C

ONNECTING

BAD I

SLANDS

In order to support the BAD services at OCSD and UCIdescribed in Section III, we need to enable the sharing ofthreatening tweets detected at DHS with OCSD and UCI.These tweets can be combined with local information atOrange County and UCI, respectively, and then be usedfor creating localized notiﬁcations for subscribers on eachisland. Below we consider three options for sharing threateningtweets among these islands, namely: (1) combining all islandstogether into one ( a BAD Continent ), (2) creating directconnections between the individual islands as needed (

BADFerries ) and (3) utilizing the channel idea to allow islands tosubscribe to what they need from one another (

BAD Bridges ).Below we discuss the three options in detail.

A. Option 1: A BAD Continent

Instead of sharing threatening tweets between multiple BADislands, one could create a big BAD island, namely a BADcontinent, that holds not only the data at DHS but also thelocal data from OCSD and UCI, as shown in Figure 8. In thiscase, all services at OCSD and UCI could be integrated intothis BAD continent, and all subscribers then would subscribeto this BAD continent directly. All information is now in thesame system. Developers from different organizations couldeasily create BAD services without having to share data.In principle, a one-for-all BAD continent could be easy tobuild, and it avoids the complexity of connecting differentBAD islands. Although the resulting BAD system could bescaled to support the volume of data and users from multipleorganizations, such global integration would introduce signiﬁ-cant management and administration overheads, especially forthe service provider (DHS in this case). For the three-islandexample, not only would all local information (including localevents, campus building layouts, etc.) need to be stored inthe BAD continent, but all updates (location updates, eventupdates, etc.) would need to be forwarded to the system.Managing all local data at DHS could be very complex eapon Registration Information Threatening Rating DetectionAlgorithm

Event InformationIn-field Officers’ LocationsOn-campus Building Info. On-campus Security Info.

SchoolOfficials BuildingNotice Board zotALERT LocationUpdatesEvent Notifications

In-field Officers … Processing

Tweets … …

Fig. 8. An illustration of a BAD continent and would require sophisticated access control. When moreorganizations join, such a database would have to manage allkinds of additional local information while receiving updatesfrom multiple parties; this system would quickly becomeimpractical to maintain by one organization. Additionally, suchglobal information sharing may not be permitted (by law)between different agencies in all cases.

B. Option 2: BAD Ferries

A different way of supporting the required BAD servicesat OCSD and UCI, without combining everything together,would be to programmatically send the requested data fromDHS to OCSD and UCI, as shown in Figure 9. DHS couldsend the threatening tweets detected in Orange County andnear UCI campus to OCSD and UCI, respectively, and OCSDBAD and UCI BAD could then combine those tweets withtheir local information to produce localized notiﬁcations fortheir subscribers.

BAD@DHS BAD@UCIBAD@OCSD

Dedicated Server Programs with Glue Dedicated Client ProgramDedicated Client Program

Fig. 9. An illustration of BAD ferries

In order to share the data cleanly and efﬁciently, DHSwould need to create a dedicated server program that allowsother organizations to access the shared data in DHS. Also,OCSD and UCI would need to develop corresponding clientprograms connected to the DHS server program and obtainshared data. Data exchanges between the server and clientscould be frequent, and there could be many more clientswho would like to access the shared data. Thus, the serverprogram would need to be efﬁcient, reliable, and scalable forhandling a large number of clients and a large volume of data.Implementing and extending the server and client programswould require signiﬁcant efforts from these organizations.

C. Option 3: BAD Bridges

An important observation is that this data exchange pattern,where we have an island serving data and multiple islandsconstantly requesting data of interest, resonates well with theoriginal BAD user model, where subscribers subscribe to dataand constantly receive updates. Inspired by this, we couldcharacterize a BAD island as being a BAD subscriber ofanother island and connect these islands using

BAD bridges built on data channels and data feeds to share data at scale, asshown in Figure 10. One might characterize this architectureas: “One man’s channel is another man’s feed.”

BAD@DHS BAD@UCIBAD@OCSD threateningTweetsAt Channel localThreateningTweet FeedlocalThreateningTweet FeedThreatening Tweet @ OCSDThreatening Tweet @ UCI

Fig. 10. An illustration of BAD bridges

Following our example, we could ﬁrst create a data channelon DHS BAD, which serves threatening tweets by areas,namely via a threateningTweetsAt channel, and other islandsinterested in local threatening tweets from an area could thensubscribe to this channel with the area name of interest. OCSDBAD, as a subscriber, can subscribe to this channel with theparameter “OC”, and UCI BAD, as another subscriber, canalso subscribe to this channel with the parameter “UCI”. Wecould use a push channel to push threatening tweets to OCSDand UCI BAD so they can receive local threatening tweetsfrom the channel at DHS directly, process them with localinformation, and then produce localized notiﬁcations to theirown subscribers.On OCSD and UCI BAD, we could utilize data feeds toreceive threatening tweets detected by the threateningTweetsAtchannel on DHS BAD. Taking OCSD BAD as an example,we could create an HTTP feed and connect it to a localOCSD dataset for persisting the threatening tweets. We couldregister the feed’s HTTP address as a broker in DHS BADand then subscribe to the threateningTweetsAt channel withthe parameter “OC”. With this feed, broker, and subscription,threatening tweets posted at Orange County and detected byDHS would then be sent to the feed’s endpoint from thethreateningTweetsAt channel. Similarly, we could repeat thisprocess for other BAD systems to obtain threatening tweetsfrom their areas of interest. Since the BAD system is scalableand can support a large number of subscribers with a largevolume of data, bridging BAD systems using data channelsand feeds can be scaled out to support many more islandsconnecting to DHS. This allows developers to declarativelycreate data sharing services, without additional programmingand gluing together multiple systems, as we will see next.. B

UILDING

BAD B

RIDGES

Given the advantages of the BAD Bridges approach, wenow introduce

BAD brokers to further simplify and enhancedata exchanges between BAD islands and

BAD feeds and thushelp users create bridges and manage their life-cycles.

A. BAD Brokers

The broker sub-system in BAD manages the communicationbetween the BAD system and its subscribers. A broker regis-ters itself as an HTTP endpoint in the BAD system. Notiﬁca-tions containing data of interest produced by the BAD back-end are delivered to this broker endpoint and then disseminatedto subscribers who subscribed on this broker. In order toallow general brokers to parse the incoming notiﬁcations, datachannels produce notiﬁcations as JSON objects, and morecomplex data types supported in BAD in the AsterixDB DataModel (ADM) (such as datetimes, points, etc.) are encodedas strings, arrays, and other JSON data types. Since BADislands are “brokers” that can also directly process ADM data,we can instead deliver their notiﬁcations as ADM records tomaintain the richer data type information and avoid additionaldata encoding and decoding overheads.To allow brokers to process ADM data and to becomeextensible for future use cases, we introduce a new notionof

BAD brokers and a simple new syntax for creating brokersin BAD, as shown in Figure 11. Users can add an optionalWITH statement for providing additional information aboutthe broker. While we only support “broker-type” for now, thiscan be further extended to support other features in the future.When there is no WITH statement or when the broker-type isset to “general”, we create a general broker that takes JSONdata. When the broker-type is set to “BAD”, we create aBAD broker that takes ADM records. In general, a channelcan have subscriptions from both types of brokers. In thatcase, channel executions will send JSON formatted data to thegeneral brokers and ADM formatted data to the BAD brokers.

CREATE BROKER

BROKER_NAME AT "http://BROKER_HOST:PORT_NUM" WITH { "broker-type" : "BAD" };

Fig. 11. Creating a BAD broker

B. BAD Feeds

Bridging from a BAD island A to another BAD island B and sharing data to island B requires several steps: create adata feed on island B ; register the feed with island A as a BADbroker; and create a subscription on island A on the createdbroker. Also, removing the bridge between island A and island B requires unsubscribing from the channel and removing theBAD broker on island A . In order to simplify the process ofbridging BAD islands and help users manage the life-cyclesof bridges, we also introduce the notion of BAD feeds .One can create a BAD feed on island B and connect it to achannel on island A using the statement in Figure 12. Unlikeregular data feeds, users would need to specify several addi-tional conﬁguration entries for connecting to a data channelon the other BAD island. In particular, the “bad-host”, “bad-channel”, and “bad-dataverse” conﬁguration parameters help the system locate the data channel on the other island, while“bad-channel-parameters” contains subscription parameters asa quote-escaped string for subscribing to the channel. Whena channel takes multiple parameters, we use commas toseparate them. If a data feed wants to subscribe to a channelwith several different parameters (e.g., OCSD BAD wants tosubscribe to threatening tweets from both Orange County andUCI) we can concatenate them using semicolons. CREATE FEED

A_SAMPLE_BAD_FEED_ON_ISLAND_B

WITH {"adapter-name" : "http_adapter","address-type" : "IP","format" : "ADM","addresses" : "ISLAND_B_FEED_HOST:ISLAND_B_FEED_PORT","type-name" : "INCOMING_DATA_TYPE","bad-host" : "ISLAND_A_HOST","bad-channel" : "ISLAND_A_CHANNEL_NAME","bad-channel-parameters": "PARAM_1-1,PARAM_1-2;PARAM2-1,PARAM_2-2","bad-dataverse": "ISLAND_A_CHANNEL_DATAVERSE" };

Fig. 12. Creating a BAD feed on island B

The bridge’s information is persisted in the BAD system’smetadata with a feed’s conﬁguration when the feed is created.When a user starts a BAD feed on a local BAD system (island B ), it registers a broker on the speciﬁed remote BAD system(island A ) using island B’s feed endpoint and subscribes toisland A’s channel using the provided parameters automati-cally. When a user stops the BAD feed, island B unsubscribesfrom island A’s channel and then removes the broker fromisland A . We tie the start and stop events of a data feed on thelocal BAD system (island B) to the subscribe and unsubscribeactions on the remote BAD system (island A), so when thefeed is not running, the remote BAD system will not need tocompute and deliver data to this BAD feed.VI. A P ROTOTYPE OF

BAD I

SLANDS

We now describe a complete prototype of BAD islands thatsupports the use cases described in Section III. We show howto create and connect three BAD islands (the BAD trinity)using declarative statements and show how data ﬂows betweenthese different islands. The BAD system organizes data andother entities under dataverses (similar to databases in anRDBMS). To differentiate organizations, we use different data-verses for different organizations (using the

USE statement).

A. BAD@DHS

DHS BAD intakes tweets from external data sources. Wecan create a TweetFeed like the one in Figure 2 and conﬁgureit as dynamic to enrich the incoming tweets with additionalinformation needed using UDFs. We ﬁrst enrich an incomingtweet with the tweet’s user’s weapon registration records (ifany). To hold the weapon registration records of sensitive tweetusers, we create a data type

WeaponRegistration and a dataset

WeaponRegistrations , as shown in Figure 13. (A user may havemultiple weapons.)

USE dhs;

CREATE TYPE

WeaponRegistration AS { wrid: uuid, uid: bigint, weapon_name: string }; CREATE DATASET

WeaponRegistrations(WeaponRegistration)

PRIMARY KEY wrid AUTOGENERATED;

Fig. 13. Data type and dataset deﬁnition for weapon registration information econd, we create a Java UDF to detect the threateningrating of a tweet’s text using a list of threatening words, asshown in Figure 14. In this UDF, we load an external list ofthreatening words and we use the number of threatening wordsin the given text as its threatening rating. ...@Override public void evaluate(IFunctionHelper functionHelper) throws

Exception {JString input = (JString) functionHelper.getArgument(0);JInt output = (JInt) functionHelper.getResultObject();String tweetText = input.getValue(); int threateningRating = 0;String[] words = tweetText.split(" "); for (String word : words) {// The threateningWordList is initialized with a file when function starts if (threateningWordList.contains(word.replaceAll("[,.]", ""))) {threateningRating++;}}output.setValue(threateningRating);functionHelper.setResult(output);}... Fig. 14. A Java UDF for determining the threatening rating

To add the desired set of enrichments to incoming tweets,we can create a SQL++ UDF

EnrichTweet and attach it to theTweetFeed when connecting to the Tweets dataset, as shownin Figure 15. In this UDF, we also transform the epoch timeof a tweet’s “created at” attribute into a datetime attribute“timestamp” and we create a point attribute “location” usingthe array of coordinates. These ADM attributes can be useful,as they do not need to be constructed in computations likespatial joins every time. Here we use the Java UDF deﬁnedin Figure 14 to extract the threatening rating of the tweet’stext and attach it as a “threatening rating” attribute. We usea sub-query to look for the weapon registration informationof the tweet’s user and nest the registered weapons into a“user registered weapon” attribute. These new attributes aremerged into the tweet and will be persisted for producingnotiﬁcations.

USE dhs;

CREATE FUNCTION

EnrichTweet(tweet) {object_merge(tweet, {"timestamp" : datetime_from_unix_time_in_ms(tweet.created_at),"location" :create_point(tweet.coordinates[0],tweet.coordinates[1]),"threatening_rating" : threateningRating(tweet.text),"user_registered_weapon": (

SELECT

VALUE w.weapon_name

FROM

WeaponRegistrations w

WHERE w.uid = tweet.uid)})};

CONNECT FEED

TweetFeed to

DATASET

Tweets

APPLY FUNCTION

EnrichTweet;

START FEED

TweetFeed;

Fig. 15. Enriching tweets with additional information

With these enriched threatening tweets, we can serve threat-ening tweets from areas by creating the continuous datachannel “ThreateningTweetsAt” shown in Figure 16. To puteverything together, a detailed overview of the entire DHSBAD system is shown in Figure 17.

USE dhs;

CREATE CONTINUOUS PUSH CHANNEL

ThreateningTweetsAt(area_name)

PERIOD duration("PERIOD_DURATION") {

SELECT t.area_name, t.text, t.location, t.threatening_rating,t.user_registered_weapon

FROM

Tweets t

WHERE t.area_name = area_name

AND t.threatening_rating > 0

AND is_new(t) };

Fig. 16. Deﬁnition of the ThreateningTweetsAt channel

B. BAD@OCSD

OCSD BAD in this prototype receives threatening tweetsnot only from Orange County but also from UCI to demon-strate how a BAD feed can connect to a channel with two

Fig. 17. The internal details of the DHS BAD system sets of parameters. OCSD BAD notiﬁes in-ﬁeld ofﬁcers aboutnearby threatening tweets that are close to important localevents. To persist event information in OCSD BAD, we cancreate a data type “Event” and a dataset “Events” as shown inFigure 18.

USE ocsd;

CREATE TYPE

Event AS { eid: uuid, name: string, location: point,event_duration: duration, radius_km: double }; CREATE DATASET

Events(Event)

PRIMARY KEY eid;

Fig. 18. Data type and dataset deﬁnition for events

To store threatening tweets coming from DHS BAD, wecreate a data type

LocalThreateningTweet and an active dataset

LocalThreateningTweets in Figure 19. (We use an activedataset for threatening tweets to ensure continuous query se-mantics in later local channel computations.) We create a BADfeed in Figure 20 to obtain local threatening tweets from DHS.This BAD feed subscribes to the DHS threateningTweetsAtchannel with parameters “OC” and “UCI”, which correspondto two separate subscriptions in DHS BAD. Since there is nofurther data enrichment during ingestion, we use a static datafeed here and connect it to LocalThreateningTweets directly.

CREATE TYPE

LocalThreateningTweet AS { channelExecutionEpochTime: bigint,dataverseName: string, channelName: string }; CREATE ACTIVE DATASET

LocalThreateningTweets(LocalThreateningTweet)

PRIMARY KEY channelExecutionEpochTime;

Fig. 19. Data type and dataset deﬁnition for local threatening tweets at OrangeCounty

USE ocsd;

CREATE FEED

LocalThreateningTweetFeed

WITH {"adapter-name" : "http_adapter","addresses" : "OCSD_HOST:10013","address-type" : "IP","type-name" : "LocalThreateningTweet","format" : "adm","bad-host" : "DHS_HOST","bad-channel" : "ThreateningTweetsAt","bad-channel-parameters": "\"OC\";\"UCI\"","bad-dataverse": "dhs","dynamic": false };

CONNECT FEED

LocalThreateningTweetFeed

TO DATASET

LocalThreateningTweets;

START FEED

LocalThreateningTweetFeed;

Fig. 20. Deﬁnition, connect and start feed statements for LocalThreaten-ingTweetFeed

In-ﬁeld ofﬁcers from OCSD also continuously send theirlocation updates to the OCSD BAD system so that OCSDcan notify the ofﬁcers about nearby threatening tweets basedon their current location. We can use the data type, dataset,and feed described in Figure 3 for intaking and persisting theocation updates. As there is no further enrichment for locationupdates, the LocationFeed can be static as well.With the local threatening tweets, event information, andofﬁcers’ locations, we can now create a continuous channelfor in-ﬁeld ofﬁcers to subscribe to nearby threatening tweetsclose to local events (a.k.a. threatening events), as shown inFigure 21. The notiﬁcations from DHS contain threateningtweets as an array in the “results” attribute, so we use theUNNEST operation to access each independent threateningtweet. We calculate the distance between the ofﬁcer and thetweet, the event and the tweet, and the ofﬁcer and the event.If the ofﬁcer is near a threatening tweet and the threateningtweet is near an event, we send a notiﬁcation to the ofﬁcer. Thenotiﬁcation contains the tweet’s content, the event information,the distance between the ofﬁcer and the tweet, and the distancebetween the ofﬁcer and the event in the notiﬁcation to help theofﬁcer take further actions. A detailed overview of the OCSDBAD system is shown in Figure 22.

USE ocsd;

CREATE CONTINUOUS PUSH CHANNEL

ThreateningEventsNear(oid)

PERIOD duration("PERIOD_DURATION") {

FROM

LocalThreateningTweets tn, OfficerLocations o, Events e

UNNEST tn.results threatening_tweet

LET tweet_loc = threatening_tweet.result.location,officer_tweet_dist = spatial_distance(o.location, tweet_loc),event_tweet_dist = spatial_distance(e.location, tweet_loc),officer_event_dist = spatial_distance(o.location, e.location)

WHERE is_new(tn)

AND oid = o.oid

AND officer_tweet_dist < 0.1

AND event_tweet_dist < e.radius_km / 100

SELECT oid, threatening_tweet.result tweet_content, e event_info,officer_tweet_dist * 100 as tweet_distance_km,officer_event_dist * 100 as event_distance_km};

Fig. 21. Deﬁnition of the ThreateningEventsNear channelFig. 22. The internal details of the OCSD BAD system

C. BAD@UCI

UCI BAD receives threatening tweets posted at UCI andchecks whether a threatening tweet is near an on-campusbuilding. If so, it creates a notiﬁcation about the threateningtweet together with the nearby security stations’ information.Like OCSD BAD, to persist threatening tweets at UCI, weneed to create a data type

LocalThreateningTweet and a dataset

LocalThreateningTweet on UCI BAD. To receive threateningtweets at UCI from DHS, we need to create a BAD feed, likeFigure 20, connected to the ThreateningTweetsAt channel butusing the parameter “UCI”.To provide more information for UCI BAD’s subscribers,we store on-campus buildings, for checking whether there isa threatening tweet nearby, and security stations, for studentsto seek for help from, in UCI BAD. In Figure 23, we createthe data types and datasets for them respectively.

USE uci;

CREATE TYPE

Building AS { bid: uuid, name: string }; CREATE TYPE

SecurityStation AS { sid: bigint, location: point }; CREATE DATASET

Buildings(Building)

PRIMARY KEY bid AUTOGENERATED;

CREATE DATASET

SecurityStations(SecurityStation)

PRIMARY KEY sid;

Fig. 23. Data type and dataset deﬁnition of buildings and security stations

With the local threatening tweets, on-campus building in-formation, and security station information, we can create acontinuous channel called “AlertsOnCampus” to provide on-campus alerts about threatening tweets near buildings withsecurity stations’ information attached using the statementshown in Figure 24. Like the ThreateningEventsNear channelin OCSD BAD, we ﬁrst UNNEST threatening tweets from theincoming notiﬁcations. Then, we check whether a threateningtweet is posted at an on-campus building. If so, we attachthe security station information to the threatening tweet, withstations ordered by their distances to the tweet’s location, andgenerate an alert. A detailed overview of the UCI BAD systemis shown in Figure 25.

USE uci;

CREATE CONTINUOUS PUSH CHANNEL

AlertsOnCampus()

PERIOD duration("PERIOD_DURATION") {

FROM

LocalThreateningTweets tn, Buildings b

UNNEST tn.results threatening_tweet

LET tweet_loc = threatening_tweet.result.location,station_dist = (

FROM

SecurityStations s

LET dist = spatial_distance(tweet_loc, s.location)

SELECT s stationInfo, dist * 100 dist_km

ORDER BY dist)

WHERE is_new(tn)

AND spatial_intersect(tweet_loc, b.area)

SELECT threateningTweet.result tweet_content,b building_info, station_dist};

Fig. 24. Deﬁnition of the AlertsOnCampus channelFig. 25. The internal details of the UCI BAD system

D. The Trip of A Threatening Tweet

In order to illustrate how BAD islands interact with BADbridges, we pick a sample tweet and show how it ﬂows throughthe three islands and their bridges and produces notiﬁcationswith local information for the subscribers on each island. Anoverview of our three-island prototype is shown in Figure 26.The circled numerical labels in the ﬁgure will be used later forillustrating the data content at different stages of the workﬂow.We will use the raw tweet in Figure 27 (labeled 1 inFigure 26) as the example. This tweet is posted at UCI,and it contains the tweet’s geolocation as a JSON array ofcoordinates and the epoch timestamp of when the tweet wascreated as a JSON number. This raw tweet is ingested by theTweetFeed deﬁned in Figure 2 and then enriched by the UDFdeﬁned in Figure 15. After that, the enriched tweet is persistedin the Tweets dataset as shown in Figure 28 (labeled 2 in Fig-ure 26). Enriched tweets contain a threatening rating detected weet Feed

Tweets

BAD @ DHSBAD @ OCSD

ThreatfulTweetsAt(OC)ThreatfulTweetsAt(UCI)

BAD @ UCI

ThreatfulTweetsAt(UCI)

Weapon Registrations

Threatening Word List

ThreateningTweetAt(area_name) Channel

Raw Tweets

Officer Locations Events

ThreateningEventsNear(officer_id) Channel

Buildings

AlertsOnCampus() Channel

SecurityStations ① ③④ ⑤②

LocalThreateningTweetFeed

LocalThreateningTweets

LocalThreateningTweetFeed

LocalThreateningTweets

Fig. 26. An overview of BAD islands { "tid": 1593142018123,"uid": 73,"area_name": "UCI","text": "Saul Goodman builds SKS, and Todd Alquist fires AK47,but Skyler White sells Cabbage.","coordinates": [ 33.64921228736088, -117.84181977473024 ],"created_at": 1593142018123 } Fig. 27. A sample raw threatening tweet { "tid": 1593142018123,"uid": 73,"area_name": "UCI","text": "Saul Goodman builds SKS, and Todd Alquist fires AK47,but Skyler White sells Cabbage.","coordinates": [ 33.64921228736088, -117.84181977473024 ],"created_at": 1593142018123,"threatening_rating": 2,"user_registered_weapon": [ "AR10" , "AK47", "GLOCK21" ],"timestamp": datetime("2020-06-26T03:26:58.123Z"),"location": point("33.64921228736088,-117.84181977473024") } Fig. 28. The enriched threatening tweet by the Java UDF, an array of registered weapons for the tweet’suser obtained by looking in the WeaponRegistrations datasetusing the “uid” attribute, a timestamp as a datetime attribute,and a location as a point attribute.Since both the OCSD and UCI BAD systems subscribe tothreatening tweets at UCI, they will each receive a notiﬁcationfrom DHS BAD about this threatening tweet. Figure 29 showsthe notiﬁcation sent to OCSD BAD (labeled 3 in Figure 26). Ifthere was also a threatening tweet posted in Orange County atthe same time, the “results” array would include that tweet butwith a different subscription ID, as OCSD BAD has two sub-scriptions to the ThreateningTweetAt channel with parameters“OC” and “UCI”, respectively. Since UCI BAD also subscribesto the channel, but with a different subscription on anotherbroker (pointed to UCI’s BAD feed), the notiﬁcation for UCIBAD will be produced and sent separately.In the OCSD BAD ThreateningEventsNear channel, threat-ening tweets are combined with local event information andofﬁcer location information to produce the nearby threateningevent notiﬁcations for in-ﬁeld ofﬁcers. There is one local event“OC Marathon” near the threatening tweet in Figure 28, andthere is an in-ﬁeld ofﬁcer 0 nearby, so OCSD BAD producesone notiﬁcation about the tweet and the event for this ofﬁcer.Figure 30 shows this threatening event notiﬁcation (labeled4 in Figure 26). It contains the event information as the“event info” attribute, the threatening tweet’s information asthe “tweet content” attribute, and the distances from the ofﬁ- { "dataverseName": "dhs","channelName": "ThreateningTweetsAt","channelExecutionEpochTime": 1593142019521,"results": [ { "result": { "text": "Saul Goodman builds SKS, and Todd Alquist firesAK47, but Skyler White sells Cabbage.","area_name": "UCI","location": point("33.64921228736088,-117.84181977473024"),"threatening_rating": 2,"user_registered_weapon": [ "AR10" , "AK47", "GLOCK21" ] } ,"channelExecutionTime": datetime("2020-06-26T03:26:59.521Z"),"subscriptionId": uuid("82e61d25-f7ad-0632-3b9a-9c26e681ad84"),"deliveryTime": datetime("2020-06-26T03:26:59.522Z") } ] } Fig. 29. The generated threatening tweet notiﬁcation from DHS { "dataverseName": "ocsd","channelName": "ThreateningEventsNear","channelExecutionEpochTime": 1593142020436,"results": [ { "result": { "event_info": { "eid": uuid("82e61d25-4cad-0632-3d8d-148e71cb50bf"),"name": "OC Marathon","location":point("33.66100302712824, -117.83950620703125"),"event_duration": duration("PT10S"),"radius_km": 3.57746886883645 } ,"tweet_distance_km": 4.854786471222485,"event_distance_km": 5.6839370484947755,"oid": 0,"tweet_content": { "text": "Saul Goodman builds SKS, and Todd Alquist firesAK47, but Skyler White sells Cabbage.","area_name": "UCI","location": point("33.64921228736088,-117.84181977473024"),"threatening_rating": 2,"user_registered_weapon": [ "AR10" , "AK47", "GLOCK21" ] }} ,"channelExecutionTime": datetime("2020-06-26T03:27:00.436Z"),"subscriptionId": uuid("82e61d25-47ad-0632-3e5c-22b3cb7d7df4"),"deliveryTime": datetime("2020-06-26T03:27:00.437Z") } ] } Fig. 30. The generated threatening event notiﬁcation from OCSD cer 0 to the tweet and to the event as the “event distance km”and “tweet distance km” attributes respectively.In the UCI BAD AlertsOnCampus channel, threateningtweets are combined with on-campus building information andsecurity station information to produce alerts. The threateningtweet in Figure 28 is near the building “Student Center”,so UCI BAD produces a notiﬁcation to alert people aroundthis building as shown in Figure 31 (labeled 5 in Figure 26).The building information is attached to the notiﬁcation. Thereare two security stations nearby, so the system attaches theirinformation with their distances, ordered by their distances tothe threatening tweet. Everyone subscribing to the AlertsOn-Campus channel will receive this notiﬁcation.VII. BAD I

SLANDS T OUR AND E VALUATION

To illustrate how BAD applications can be built with BADislands and to visualize the process of data ﬂowing throughmultiple systems and becoming notiﬁcations for subscribers,we have created three dashboards for each organization basedon our prototype, as shown in Figure 32. "dataverseName": "uci","channelName": "AlertsOnCampus","channelExecutionEpochTime": 1593142024344,"results": [ { "result": { "buildingInfo": { "bid": uuid("82e61d25-43ad-0632-45d0-0ba5366832d9"),"name": "Student Center","area": rectangle("33.64811430275051, -117.8433202724914533.649382536086605,-117.84153928570557") } ,"stationDist": [ { "stationInfo": { "sid": 1,"location":point("33.64792551859947, -117.84013290702327"),"name": "Station } ,"dist_km": 0.21216259109805177 } , { "stationInfo": { "sid": 0,"location":point("33.646866723393266, -117.84170161534618"),"name": "Station } ,"dist_km": 0.23485382616041114 } ],"tweetContent": { "text": "Saul Goodman builds SKS, and Todd Alquist firesAK47, but Skyler White sells Cabbage.","area_name": "UCI","location": point("33.64921228736088,-117.84181977473024"),"threatening_rating": 2,"user_registered_weapon": [ "AR10" , "AK47", "GLOCK21" ] }} ,"channelExecutionTime": datetime("2020-06-26T03:27:04.344Z"),"subscriptionId": uuid("82e61d25-0ead-0632-4717-e17b6a912fa6"),"deliveryTime": datetime("2020-06-26T03:27:04.345Z") } ] } Fig. 31. The generated on-campus alert from UCIFig. 32. An overview of BAD islands dashboards

Due to space limits, instead of describing the features oneach dashboard in detail, we will focus on the VisualizationPanel of the OCSD Dashboard, shown in Figure 33, toillustrate how threatening tweets go from DHS BAD to OCSDBAD and how OCSD BAD combines threatening tweets withother local information for its subscribers.The Visualization Panel contains a map that shows the in-coming threatening tweets, local events, produced threateningevents, and in-ﬁeld ofﬁcers’ movements. The map contains acontrol bar at the top (highlighted in a blue box) so dashboardusers can navigate the map, add a new event, and add a newin-ﬁeld ofﬁcer. A new event can be added by drawing a circleon the map indicating the event’s area. An ofﬁcer can be addedby dropping an ofﬁcer icon on a preferred location on the map.Information about the created events and ofﬁcers is updatedin the underlying OCSD BAD system accordingly. An addedofﬁcer moves around the map randomly and continuouslysends its current location to the OCSD BAD system. One can

Fig. 33. Visualization panel of OCSD dashboard change an ofﬁcer’s location by dragging the ofﬁcer’s icon toa new place on the map.We use a red tweet icon to mark the threatening tweetsreceived from DHS BAD and a black tweet icon for thethreatening events detected by OCSD BAD. When an in-ﬁeldofﬁcer receives a threatening event notiﬁcation (as highlightedin the red circles in the ﬁgure), the ofﬁcer randomly decideswhether to go to the threatening event’s location for furtherinvestigation or to stay at his or her current location. Theofﬁcer’s decision pops up as a small information window, asshown in the ﬁgure. If the ofﬁcer decides to go, he or shemoves gradually towards the tweet’s location, as the upperofﬁcer does in the ﬁgure.In addition to the dashboards, we also conducted a simpleexperiment to measure the tweet propagation delays in ourprototype system, starting from the posting of new tweets tothe receipt of the localized notiﬁcations by subscribers on eachisland. We deployed the prototype on a three-node cluster,one node per island, where each node had a Dual-Core AMDOpteron Processor 2212 2.0 GHz, 8 GB of RAM, and a 900GB hard disk drive. We used the statements described inSection VI to conﬁgure the nodes.The information propagation times for BAD islands dependon the complexity of the computations in the pipeline (dataenrichment and channel computation) and on the speciﬁedchannel period durations. In our experiments, we used thesame channel period for all three channels, testing two dif-ferent channel periods (1s and 2s). Since channels executeonce per each channel period, for each channel execution, weeasured the average delay for threatening tweets deliveredto subscribers in this channel execution. We let all channelscomplete 50 executions and kept track of the average delaysthroughout the process. On DHS BAD, tweets were set toarrive at 10 tweets per second, and half of the tweets con-tained at least one threatening word. On OCSD BAD, everythreatening tweet had an event nearby. OCSD had 100 in-ﬁeldofﬁcers constantly updating their locations and subscribing tonearby threatening events. On UCI BAD, every threateningtweet was close to an on-campus building. UCI had 5 on-campus security stations and 100 subscriptions subscribing toon-campus alerts. The delays are shown in Figure 34. A v e r g a g e D e l a y ( S ec ond s ) Channel ExecutionOCSD - 2 seconds UCI - 2 secondsOCSD - 1 second UCI - 1 second

Fig. 34. Delays of threatening tweets for OCSD and UCI BAD subscribers

Clearly, the subscribers at OCSD and UCI are able toreceive localized threatening tweets of interest in a timelymanner, while the delays are relatively stable, especially forthe 1s channel period. When the channel period is increased to2s, the delays increase since the system batches more incom-ing tweets per channel execution; while this would increasethe delay for subscribers, it would also increase the systemscalability under higher loads. UCI BAD in general has higherdelays than OCSD BAD due to a more complex computationand more local information added to local threatening tweets.VIII. C

ONCLUSIONS

In this work, we have focused on enabling users to declar-atively create scalable data sharing services between differentBAD systems. We looked at an example use case in which twolocal organizations (OCSD and UCI) would like to get datafrom a third organization (DHS) in order to provide BADservices to their subscribers. We discussed several possibleways of supporting this use case and proposed using data feedsand data channels for bridging BAD systems. We extendedthe BAD system with

BAD brokers to simplify data exchangesbetween channels and feeds and

BAD feeds to help users createbridges between different BAD systems. We detailed a three-island prototype to show how BAD islands can be bridgedtogether. We demonstrated how users can easily build suchsystems with declarative statements, and we used an exampleto show how data and events ﬂow within the system. Webuilt a set of dashboards based on our prototype to concretelyillustrate how BAD islands share data and support BADapplications with localized information, and we conducted anexperiment to examine the delays in the prototype system. A

CKNOWLEDGMENT

This research was partially supported by NSF grantsIIS-1447826, IIS-1447720, IIS-1838222, IIS-1838248, CNS-1924694 and CNS-1925610.R

EFERENCES[1] R. Bryant, R. H. Katz, and E. D. Lazowska, “‘Big-Data Computing’:Creating revolutionary breakthroughs in commerce, science and society,”2008.[2] K. Shvachko, H. Kuang, S. Radia et al. , “The Hadoop distributedﬁle system,” in

IEEE 26th Symposium on Mass Storage Systems andTechnologies, MSST 2012, Lake Tahoe, Nevada, USA, May 3-7, 2010 ,2010, pp. 1–10.[3] C. Olston, B. Reed, U. Srivastava et al. , “Pig latin: a not-so-foreignlanguage for data processing,” in

Proceedings of the ACM SIGMODInternational Conference on Management of Data , 2008, pp. 1099–1110.[4] A. Thusoo, J. S. Sarma, N. Jain et al. , “Hive - A warehousing solutionover a map-reduce framework,”

PVLDB , vol. 2, no. 2, pp. 1626–1629,2009.[5] M. Zaharia, M. Chowdhury, T. Das et al. , “Resilient distributed datasets:A fault-tolerant abstraction for in-memory cluster computing,” in

Pro-ceedings of the 9th USENIX Symposium on Networked Systems Designand Implementation , 2012, pp. 15–28.[6] P. T. Eugster, P. Felber, R. Guerraoui et al. , “The many faces ofpublish/subscribe,”

ACM Comput. Surv. , vol. 35, no. 2, pp. 114–131,2003.[7] M. Zaharia, T. Das, H. Li, T. Hunter, S. Shenker, and I. Stoica,“Discretized streams: fault-tolerant streaming computation at scale,”in

ACM SIGOPS 24th Symposium on Operating Systems Principles ,M. Kaminsky and M. Dahlin, Eds. ACM, 2013, pp. 423–438.[8] P. Carbone, A. Katsifodimos, S. Ewen et al. , “Apache Flink™: Streamand batch processing in a single engine,”

IEEE Data Eng. Bull. , vol. 38,no. 4, pp. 28–38, 2015.[9] M. J. Carey, S. Jacobs, and V. J. Tsotras, “Breaking BAD: a data servingvision for big active data,” in

Proceedings of the 10th ACM InternationalConference on Distributed and Event-based Systems , 2016, pp. 181–186.[10] S. Jacobs, X. Wang, M. J. Carey, V. J. Tsotras, and M. Y. S. Uddin,“BAD to the bone: Big active data at its core,”

VLDB J. , 2020.[11] S. Jacobs, M. Y. S. Uddin, M. J. Carey et al. , “A BAD demonstration:Towards big active data,”

PVLDB , vol. 10, no. 12, pp. 1941–1944, 2017.[12] X. Wang, “Activating Big Data at Scale,” Ph.D. dissertation, Universityof California, Irvine, USA, 2020.[13] X. Wang, M. J. Carey, and V. J. Tsotras, “Subscribing to Big Data atscale,” arXiv preprint arXiv:2009.04611 , 2020.[14] S. Wolfert, “Study on data sharing between companies in europe,” 2018,[Online; accessed Jul-12th-2020].[15] W. K. Michener, S. Allard, A. E. Budden et al. , “Participatory designof DataONE - enabling cyberinfrastructure for the biological and envi-ronmental sciences,”

Ecol. Informatics , vol. 11, pp. 5–15, 2012.[16] R. Rice, “DISC-UK datashare project,” in

Technology of Data: Collec-tion, Communication, Access and Preservation . IASSIST, 2008.[17] E. Scaria, A. Berghmans, M. Pont, C. Arnaut, and S. Leconte, “Studyon data sharing between companies in Europe,”

A study prepared forthe European Commission Directorate-General for CommunicationsNetworks, Content and Technology by everis Benelux , vol. 24, 2018.[18] S. Alsubaiee, Y. Altowim, H. Altwaijry et al. , “AsterixDB: A scalable,open source BDMS,”

PVLDB , vol. 7, no. 14, pp. 1905–1916, 2014.[19] D. Chamberlin,

SQL++ For SQL Users: A Tutorial . Couchbase, Inc.,2018, (Available at Amazon.com).[20] D. B. Terry, D. Goldberg, D. A. Nichols et al. , “Continuous queriesover append-only databases,” in

Proceedings of the 1992 ACM SIGMODInternational Conference on Management of Data , 1992, pp. 321–330.[21] W. Y. Alkowaileet, S. Alsubaiee, M. J. Carey et al. , “End-to-endmachine learning with Apache AsterixDB,” in

Proceedings of the SecondWorkshop on Data Management for End-To-End Machine Learning ,2018, pp. 6:1–6:10.[22] X. Wang and M. J. Carey, “An IDEA: an ingestion framework for dataenrichment in AsterixDB,”

PVLDB , vol. 12, no. 11, pp. 1485–1498,2019.[23] R. Grover and M. J. Carey, “Data ingestion in AsterixDB,” in

EDBTConf. , 2015.[24] W. Y. Alkowaileet, S. Alsubaiee, M. J. Carey et al. , “EnhancingBig Data with semantics: The AsterixDB approach (poster),” in12thIEEE International Conference on Semantic Computing, ICSC