Bridging BAD Islands: Declarative Data Sharing at Scale
BBridging BAD Islands:Declarative Data Sharing at Scale
Xikui Wang, Michael J. Carey
Donald Bren School of Information and Computer SciencesUniversity of California Irvine
Irvine, United States { xikuiw, mjcarey } @ics.uci.edu Vassilis J. Tsotras
Department of Computer Science and EngineeringUniversity of California Riverside
Riverside, United [email protected]
Abstract —In many Big Data applications today, informationneeds to be actively shared between systems managed by differentorganizations. To enable sharing Big Data at scale, developerswould have to create dedicated server programs and glue togethermultiple Big Data systems for scalability. Developing and manag-ing such glued data sharing services requires a significant amountof work from developers. In our prior work, we developed a BigActive Data (BAD) system for enabling Big Data subscriptionsand analytics with millions of subscribers. Based on that, weintroduce a new mechanism for enabling the sharing of BigData at scale declaratively so that developers can easily createand provide data sharing services using declarative statementsand can benefit from an underlying scalable infrastructure. Weshow our implementation on top of the BAD system, explain thedata sharing data flow among multiple systems, and present aprototype system with experimental results.
Index Terms —data warehouses, database systems, distributedinformation systems
I. I
NTRODUCTION
Advances in information technology have created largecollections of data [1]. Such large volumes of data - Big Data- also come with big challenges. In order to transmit, process,and persist Big Data, researchers and experts from academiaand industry have developed a plethora of systems [2]–[5].However, most of them are passive in nature - passivelyanswering users’ requests to process and return data ratherthan actively processing and delivering data of interest tousers. In many applications, users not only want to analyzedata, but also to subscribe to and actively receive data ofinterest. Their interests may include the data’s content as wellas its relationships to other data. For example, in-field officersmay want to receive nearby threatening tweets whenever theyare posted . There can be millions of users having similarrequests. We refer to the enabling of Big Data subscriptionsand analytics as
Big Active Data (BAD). Traditional pub/subsystems [6] often lack the capability of data processing andhandling complex subscription requests that involve data’s re-lationships ( such as send me tweets near my current location ).More recent stream processing engines [7], [8] usually don’tpersist data for historical data analytics ( such as show me theaverage threatening rating of tweets in the past five months grouped by their location ). In order to accommodate BADchallenges, we have created a BAD system that supports BigData subscriptions and analytics at scale [9]–[13].In a big BAD world, the data to be analyzed and deliveredoften needs to be processed and enriched with additionalinformation so that interested users can obtain more insightsfrom the data. Such additional information may be managedby different organizations. Developers often need to sharedata between different systems for supporting BAD appli-cations (e.g., threatening tweets detected at the Departmentof Homeland Security need to be shared with local policedepartments). Data sharing can be difficult, besides the ethicaland legal issues, because of the challenges in management,interoperability, security, and infrastructure [14]. Researchersfrom academia have developed projects that unify institutionalrepositories from different organizations for sharing researchdatasets [15], [16]. Companies have also created platformsbased on Big Data projects to improve business efficiency andconsolidate resources for better services [17]. Nevertheless,providing efficient, reliable, and scalable data sharing servicesrequire dedicated infrastructures and collaborative efforts fromdevelopers and organizations. In this work, we focus onenabling the active sharing of Big Data declaratively in aBAD world. In particular, we characterize a BAD world asa group of BAD islands, where each organization runs anindependent BAD system as an island. We discuss how to“bridge” different BAD islands using scalable data sharingservices without additional programming from developers.II. B IG A CTIVE D ATA IN A N UTSHELL
Our BAD system has been built as an extension of ApacheAsterixDB, a Big Data Management System (BDMS) thatprovides distributed data management for large-scale, semi-structured data [18]. The BAD system [10], [13] can enablemillions of users to subscribe to data of interest and receiveupdates continuously, and it also supports Big Data analyticswith a declarative language, SQL++ (a SQL-inspired querylanguage for semi-structured data) [19]. An overview of theBAD system is shown in Figure 1. Due to space limits, here wefocus on two key components: Data Feeds and Data Channels.For more details about our project we refer to [9]–[13]. a r X i v : . [ c s . D B ] J a n evelopersSubscribers AnalystsIncomingData AdditionalInformation A BAD SystemPersistent Storage
Data Channels
Analytical EngineData Feeds
Broker Network
BAD Applications
Fig. 1. An overview of the BAD system
A. Data Feeds
Data feeds help the BAD system to ingest rapidly incomingdata from various external data sources in different formats.Users can create a data feed using SQL++ statements. Asan example, in Figure 2, we define a data type
Tweet todescribe the incoming data’s minimum required attributes andan active dataset
Tweets to persist the incoming data. Activedatasets, different from normal datasets, enable continuousquery semantics [20] in channels (discussed next) [10], [12].Here we create a
TweetFeed using a socket adapter and specifythe incoming data’s format as JSON. This allows the BADsystem to use a socket server to intake incoming JSON data.The
TweetFeed is connected to the dataset
Tweets so that theingested data can be persisted in storage (partitioned acrossall nodes of a cluster) directly for later use. There are twotypes of data feeds: static feeds, which maximize ingestionthroughput, and dynamic feeds, which allow users to enrichincoming data using user-defined functions (UDFs) [21]–[24].
CREATE TYPE
Tweet AS { tid: bigint, uid: bigint, text: string }; CREATE ACTIVE DATASET
Tweets(Tweet)
PRIMARY KEY tid;
CREATE FEED
TweetFeed
WITH {"type-name" : "TweetType","adapter-name": "socket_adapter","format" : "JSON","sockets": "FEED_HOST:FEED_PORT","address-type": "IP","dynamic": false };
CONNECT FEED
TweetFeed
TO DATASET
Tweets;
START FEED
TweetFeed;
Fig. 2. A sample data feed connected to an active dataset
B. Data Channels
Data channels allow developers to activate parameterizedqueries as services for millions of users to subscribe to andcontinuously receive their data of interest. When creating achannel, developers can construct a channel query to describethe data of interest for subscribers and specify a channelperiod to indicate how often should the channel query beevaluated for subscribed users. All subscriptions of a channelare evaluated together to allow the system to exploit sharedcomputations among them (e.g., many subscribers could beinterested in tweets from Orange County), and increasing thechannel period could lead to a bigger batch size and thus allowcomputing complex data of interest for more subscribers atscale. For example, we can create a
NearbyThreateningTweets channel, as shown in Figure 3, to allow in-field officers tosubscribe to nearby threatening tweets. In the channel query, we use the is new function to look for new threateningtweets near a subscribed officer’s location and return thosetweets to subscribers every 10 seconds. The active dataset
Tweets provides continuous query semantics to make sureevery qualified new tweet will be delivered to subscribedofficers. The threatening tweets for subscribers are sent tobrokers registered as HTTP endpoints in the BAD system.A user can subscribe to a data channel on a broker and thusreceive updates from it. As shown in Figure 4, we can registera broker and make two separate subscriptions (on behalf ofin-field officers) on this broker so that the threatening tweetsnear these two in-field officers are sent to this broker andthen delivered to them. Data channels provide two modes fordelivering data: push and pull . In the push mode, the data ofinterest is pushed to brokers directly. In the pull mode, a brokerhaving new data of interest for its subscribers will receive anotification from the channel, and then the broker can pull thatdata from BAD storage later. // Similar to TweetFeed and Tweets, we have a LocationFeed connected// to an OfficerLocations dataset to receive and store the live// location updates from in-field officers//
CREATE TYPE
OfficerLocation AS { oid: int, location: point };// CREATE ACTIVE DATASET
OfficerLocations(OfficerLocation)//
PRIMARY KEY oid;
CREATE CONTINUOUS CHANNEL
NearbyThreateningTweets(oid)
PERIOD duration("PT10S") {
SELECT t FROM
OfficerLocations o, Tweets t
WHERE spatial_distance(t.location, o.location) < 5
AND o.oid = oid
AND t.threatening_rating > 0
AND is_new(t) };
Fig. 3. A sample continuous channel for nearby hateful tweets
CREATE BROKER
BROKER_A AT "http://BROKER_A_HOST:BROKER_A_PORT/API"; SUBSCRIBE TO
NearbyThreateningTweets("0907") ON BROKER_A;
SUBSCRIBE TO
NearbyThreateningTweets("1226") ON BROKER_A;
Fig. 4. Registering a broker and making subscriptions
III. BAD I
SLANDS
In the following sections, we discuss how we can connectBAD systems managed by different organizations (islands) ina BAD world together to enable data sharing among them. Weuse a three-island example with the following organizations forillustration: the Department of Homeland Security, the OrangeCounty Sheriff’s Department, and the University of California-Irvine. Each organization hosts an independent BAD systemand serves its own BAD users with localized information.
A. BAD Island 1: Department of Homeland Security
The Department of Homeland Security (DHS) is a federalagency responsible for ensuring public security. In our ex-ample, DHS has access to all tweets posted in the UnitedStates. These tweets cannot be shared with other organizationsdirectly due to licensing and privacy concerns, except for thetweets that are related to potential threats. The BAD system atDHS needs to provide data analytics on collected tweets andserve tweets to its agents through data channels.Since raw tweets from Twitter may not contain all necessaryinformation, DHS might need to enrich them with other One could also apply the is new function on OfficerLocations to look fornearby threatening tweets only for officers actively updating their locations.Interested readers may refer to [13] for more continuous channel examples. elevant data. As an example, DHS could collect weaponregistration information for some sensitive twitter accountholders and attach that to tweets to provide important addi-tional information for interested subscribers. In addition, DHScould also utilize Machine Learning algorithms to estimatethe threatening rating of the tweets’ text and use that for lateranalysis. An overview of the DHS island is shown in Figure 5.
Weapon Registration InformationEnriched Threatening TweetsProcessing Data AnalyticsTweets
ThreateningTweetsThreatening Rating DetectionAlgorithm
Fig. 5. An overview of the DHS Island
B. BAD Island 2: Orange County Sheriff’s Department
The Orange County Sheriff’s Department (OCSD) is thelocal law enforcement agency that ensures safety and respondsto potential crimes in Orange County, CA. In our use case,OCSD wants to monitor major local events and ensure thesafety of the event and its participants. In-field officers whopatrol around the county, continuously report their locationsback to OCSD so that OCSD can send them instructions basedon their locations (e.g., when an emergency happens, sendnearby officers for help).To prevent potential threats to local events, OCSD wouldlike to obtain the threatening tweets posted in Orange County.When a local threatening tweet is detected, OCSD can findimportant events close to the tweet and then notify the nearbyin-field officers about the event and the tweet so they canfurther investigate it. Additionally, OCSD wants to supportdata analytics on data stored in the system. An overview ofthe OCSD island is shown in Figure 6.
Event InformationThreatening EventsProcessingThreateningTweets In-field Officers’ Locations Data Analytics
LocationUpdatesEvent Notifications
In-field Officers
Fig. 6. An overview of the OCSD island
C. BAD Island 3: University of California-Irvine
The University of California-Irvine (UCI) is a public univer-sity located in Irvine, a city in Orange County. The universityoften hosts various activities and events in different buildingson campus. To ensure students’ and visitors’ safety, the uni-versity has its own university police officers placed at varioussecurity stations on campus, and students/visitors can seek helpfrom when an emergency happens. The buildings on campushave notice boards for showing important notifications andalerts. The university also has an alerting service - zotALERT - which delivers important messages to people (subscribers)on-campus through text messages and emails.UCI would like to acquire the threatening tweets posted nearthe UCI campus and notify people in the buildings aroundthose tweets to raise attention. An alert could include theinformation about nearby security stations for the tweet so thatpeople in an emergency situation could quickly seek help. Dataanalytics on threatening tweets and other data in the systemfor school officials are also to be supported. An overview ofthe UCI island is shown in Figure 7.
On-campus Building Information
On-campus AlertsProcessing Data AnalyticsThreateningTweets
On-campus Security Information
Alert Notifications
SchoolOfficialsBuildingNotice BoardzotALERT
Fig. 7. An overview of the UCI island
IV. I
SLAND H OPPING : C
ONNECTING
BAD I
SLANDS
In order to support the BAD services at OCSD and UCIdescribed in Section III, we need to enable the sharing ofthreatening tweets detected at DHS with OCSD and UCI.These tweets can be combined with local information atOrange County and UCI, respectively, and then be usedfor creating localized notifications for subscribers on eachisland. Below we consider three options for sharing threateningtweets among these islands, namely: (1) combining all islandstogether into one ( a BAD Continent ), (2) creating directconnections between the individual islands as needed (
BADFerries ) and (3) utilizing the channel idea to allow islands tosubscribe to what they need from one another (
BAD Bridges ).Below we discuss the three options in detail.
A. Option 1: A BAD Continent
Instead of sharing threatening tweets between multiple BADislands, one could create a big BAD island, namely a BADcontinent, that holds not only the data at DHS but also thelocal data from OCSD and UCI, as shown in Figure 8. In thiscase, all services at OCSD and UCI could be integrated intothis BAD continent, and all subscribers then would subscribeto this BAD continent directly. All information is now in thesame system. Developers from different organizations couldeasily create BAD services without having to share data.In principle, a one-for-all BAD continent could be easy tobuild, and it avoids the complexity of connecting differentBAD islands. Although the resulting BAD system could bescaled to support the volume of data and users from multipleorganizations, such global integration would introduce signifi-cant management and administration overheads, especially forthe service provider (DHS in this case). For the three-islandexample, not only would all local information (including localevents, campus building layouts, etc.) need to be stored inthe BAD continent, but all updates (location updates, eventupdates, etc.) would need to be forwarded to the system.Managing all local data at DHS could be very complex eapon Registration Information Threatening Rating DetectionAlgorithm
Event InformationIn-field Officers’ LocationsOn-campus Building Info. On-campus Security Info.
SchoolOfficials BuildingNotice Board zotALERT LocationUpdatesEvent Notifications
In-field Officers … Processing
Tweets … …
Fig. 8. An illustration of a BAD continent and would require sophisticated access control. When moreorganizations join, such a database would have to manage allkinds of additional local information while receiving updatesfrom multiple parties; this system would quickly becomeimpractical to maintain by one organization. Additionally, suchglobal information sharing may not be permitted (by law)between different agencies in all cases.
B. Option 2: BAD Ferries
A different way of supporting the required BAD servicesat OCSD and UCI, without combining everything together,would be to programmatically send the requested data fromDHS to OCSD and UCI, as shown in Figure 9. DHS couldsend the threatening tweets detected in Orange County andnear UCI campus to OCSD and UCI, respectively, and OCSDBAD and UCI BAD could then combine those tweets withtheir local information to produce localized notifications fortheir subscribers.
BAD@DHS BAD@UCIBAD@OCSD
Dedicated Server Programs with Glue Dedicated Client ProgramDedicated Client Program
Fig. 9. An illustration of BAD ferries
In order to share the data cleanly and efficiently, DHSwould need to create a dedicated server program that allowsother organizations to access the shared data in DHS. Also,OCSD and UCI would need to develop corresponding clientprograms connected to the DHS server program and obtainshared data. Data exchanges between the server and clientscould be frequent, and there could be many more clientswho would like to access the shared data. Thus, the serverprogram would need to be efficient, reliable, and scalable forhandling a large number of clients and a large volume of data.Implementing and extending the server and client programswould require significant efforts from these organizations.
C. Option 3: BAD Bridges
An important observation is that this data exchange pattern,where we have an island serving data and multiple islandsconstantly requesting data of interest, resonates well with theoriginal BAD user model, where subscribers subscribe to dataand constantly receive updates. Inspired by this, we couldcharacterize a BAD island as being a BAD subscriber ofanother island and connect these islands using
BAD bridges built on data channels and data feeds to share data at scale, asshown in Figure 10. One might characterize this architectureas: “One man’s channel is another man’s feed.”
BAD@DHS BAD@UCIBAD@OCSD threateningTweetsAt Channel localThreateningTweet FeedlocalThreateningTweet FeedThreatening Tweet @ OCSDThreatening Tweet @ UCI
Fig. 10. An illustration of BAD bridges
Following our example, we could first create a data channelon DHS BAD, which serves threatening tweets by areas,namely via a threateningTweetsAt channel, and other islandsinterested in local threatening tweets from an area could thensubscribe to this channel with the area name of interest. OCSDBAD, as a subscriber, can subscribe to this channel with theparameter “OC”, and UCI BAD, as another subscriber, canalso subscribe to this channel with the parameter “UCI”. Wecould use a push channel to push threatening tweets to OCSDand UCI BAD so they can receive local threatening tweetsfrom the channel at DHS directly, process them with localinformation, and then produce localized notifications to theirown subscribers.On OCSD and UCI BAD, we could utilize data feeds toreceive threatening tweets detected by the threateningTweetsAtchannel on DHS BAD. Taking OCSD BAD as an example,we could create an HTTP feed and connect it to a localOCSD dataset for persisting the threatening tweets. We couldregister the feed’s HTTP address as a broker in DHS BADand then subscribe to the threateningTweetsAt channel withthe parameter “OC”. With this feed, broker, and subscription,threatening tweets posted at Orange County and detected byDHS would then be sent to the feed’s endpoint from thethreateningTweetsAt channel. Similarly, we could repeat thisprocess for other BAD systems to obtain threatening tweetsfrom their areas of interest. Since the BAD system is scalableand can support a large number of subscribers with a largevolume of data, bridging BAD systems using data channelsand feeds can be scaled out to support many more islandsconnecting to DHS. This allows developers to declarativelycreate data sharing services, without additional programmingand gluing together multiple systems, as we will see next.. B
UILDING
BAD B
RIDGES
Given the advantages of the BAD Bridges approach, wenow introduce
BAD brokers to further simplify and enhancedata exchanges between BAD islands and
BAD feeds and thushelp users create bridges and manage their life-cycles.
A. BAD Brokers
The broker sub-system in BAD manages the communicationbetween the BAD system and its subscribers. A broker regis-ters itself as an HTTP endpoint in the BAD system. Notifica-tions containing data of interest produced by the BAD back-end are delivered to this broker endpoint and then disseminatedto subscribers who subscribed on this broker. In order toallow general brokers to parse the incoming notifications, datachannels produce notifications as JSON objects, and morecomplex data types supported in BAD in the AsterixDB DataModel (ADM) (such as datetimes, points, etc.) are encodedas strings, arrays, and other JSON data types. Since BADislands are “brokers” that can also directly process ADM data,we can instead deliver their notifications as ADM records tomaintain the richer data type information and avoid additionaldata encoding and decoding overheads.To allow brokers to process ADM data and to becomeextensible for future use cases, we introduce a new notionof
BAD brokers and a simple new syntax for creating brokersin BAD, as shown in Figure 11. Users can add an optionalWITH statement for providing additional information aboutthe broker. While we only support “broker-type” for now, thiscan be further extended to support other features in the future.When there is no WITH statement or when the broker-type isset to “general”, we create a general broker that takes JSONdata. When the broker-type is set to “BAD”, we create aBAD broker that takes ADM records. In general, a channelcan have subscriptions from both types of brokers. In thatcase, channel executions will send JSON formatted data to thegeneral brokers and ADM formatted data to the BAD brokers.
CREATE BROKER
BROKER_NAME AT "http://BROKER_HOST:PORT_NUM" WITH { "broker-type" : "BAD" };
Fig. 11. Creating a BAD broker
B. BAD Feeds
Bridging from a BAD island A to another BAD island B and sharing data to island B requires several steps: create adata feed on island B ; register the feed with island A as a BADbroker; and create a subscription on island A on the createdbroker. Also, removing the bridge between island A and island B requires unsubscribing from the channel and removing theBAD broker on island A . In order to simplify the process ofbridging BAD islands and help users manage the life-cyclesof bridges, we also introduce the notion of BAD feeds .One can create a BAD feed on island B and connect it to achannel on island A using the statement in Figure 12. Unlikeregular data feeds, users would need to specify several addi-tional configuration entries for connecting to a data channelon the other BAD island. In particular, the “bad-host”, “bad-channel”, and “bad-dataverse” configuration parameters help the system locate the data channel on the other island, while“bad-channel-parameters” contains subscription parameters asa quote-escaped string for subscribing to the channel. Whena channel takes multiple parameters, we use commas toseparate them. If a data feed wants to subscribe to a channelwith several different parameters (e.g., OCSD BAD wants tosubscribe to threatening tweets from both Orange County andUCI) we can concatenate them using semicolons. CREATE FEED
A_SAMPLE_BAD_FEED_ON_ISLAND_B
WITH {"adapter-name" : "http_adapter","address-type" : "IP","format" : "ADM","addresses" : "ISLAND_B_FEED_HOST:ISLAND_B_FEED_PORT","type-name" : "INCOMING_DATA_TYPE","bad-host" : "ISLAND_A_HOST","bad-channel" : "ISLAND_A_CHANNEL_NAME","bad-channel-parameters": "PARAM_1-1,PARAM_1-2;PARAM2-1,PARAM_2-2","bad-dataverse": "ISLAND_A_CHANNEL_DATAVERSE" };
Fig. 12. Creating a BAD feed on island B
The bridge’s information is persisted in the BAD system’smetadata with a feed’s configuration when the feed is created.When a user starts a BAD feed on a local BAD system (island B ), it registers a broker on the specified remote BAD system(island A ) using island B’s feed endpoint and subscribes toisland A’s channel using the provided parameters automati-cally. When a user stops the BAD feed, island B unsubscribesfrom island A’s channel and then removes the broker fromisland A . We tie the start and stop events of a data feed on thelocal BAD system (island B) to the subscribe and unsubscribeactions on the remote BAD system (island A), so when thefeed is not running, the remote BAD system will not need tocompute and deliver data to this BAD feed.VI. A P ROTOTYPE OF
BAD I
SLANDS
We now describe a complete prototype of BAD islands thatsupports the use cases described in Section III. We show howto create and connect three BAD islands (the BAD trinity)using declarative statements and show how data flows betweenthese different islands. The BAD system organizes data andother entities under dataverses (similar to databases in anRDBMS). To differentiate organizations, we use different data-verses for different organizations (using the
USE statement).
A. BAD@DHS
DHS BAD intakes tweets from external data sources. Wecan create a TweetFeed like the one in Figure 2 and configureit as dynamic to enrich the incoming tweets with additionalinformation needed using UDFs. We first enrich an incomingtweet with the tweet’s user’s weapon registration records (ifany). To hold the weapon registration records of sensitive tweetusers, we create a data type
WeaponRegistration and a dataset
WeaponRegistrations , as shown in Figure 13. (A user may havemultiple weapons.)
USE dhs;
CREATE TYPE
WeaponRegistration AS { wrid: uuid, uid: bigint, weapon_name: string }; CREATE DATASET
WeaponRegistrations(WeaponRegistration)
PRIMARY KEY wrid AUTOGENERATED;
Fig. 13. Data type and dataset definition for weapon registration information econd, we create a Java UDF to detect the threateningrating of a tweet’s text using a list of threatening words, asshown in Figure 14. In this UDF, we load an external list ofthreatening words and we use the number of threatening wordsin the given text as its threatening rating. ...@Override public void evaluate(IFunctionHelper functionHelper) throws
Exception {JString input = (JString) functionHelper.getArgument(0);JInt output = (JInt) functionHelper.getResultObject();String tweetText = input.getValue(); int threateningRating = 0;String[] words = tweetText.split(" "); for (String word : words) {// The threateningWordList is initialized with a file when function starts if (threateningWordList.contains(word.replaceAll("[,.]", ""))) {threateningRating++;}}output.setValue(threateningRating);functionHelper.setResult(output);}... Fig. 14. A Java UDF for determining the threatening rating
To add the desired set of enrichments to incoming tweets,we can create a SQL++ UDF
EnrichTweet and attach it to theTweetFeed when connecting to the Tweets dataset, as shownin Figure 15. In this UDF, we also transform the epoch timeof a tweet’s “created at” attribute into a datetime attribute“timestamp” and we create a point attribute “location” usingthe array of coordinates. These ADM attributes can be useful,as they do not need to be constructed in computations likespatial joins every time. Here we use the Java UDF definedin Figure 14 to extract the threatening rating of the tweet’stext and attach it as a “threatening rating” attribute. We usea sub-query to look for the weapon registration informationof the tweet’s user and nest the registered weapons into a“user registered weapon” attribute. These new attributes aremerged into the tweet and will be persisted for producingnotifications.
USE dhs;
CREATE FUNCTION
EnrichTweet(tweet) {object_merge(tweet, {"timestamp" : datetime_from_unix_time_in_ms(tweet.created_at),"location" :create_point(tweet.coordinates[0],tweet.coordinates[1]),"threatening_rating" : threateningRating(tweet.text),"user_registered_weapon": (
SELECT
VALUE w.weapon_name
FROM
WeaponRegistrations w
WHERE w.uid = tweet.uid)})};
CONNECT FEED
TweetFeed to
DATASET
Tweets
APPLY FUNCTION
EnrichTweet;
START FEED
TweetFeed;
Fig. 15. Enriching tweets with additional information
With these enriched threatening tweets, we can serve threat-ening tweets from areas by creating the continuous datachannel “ThreateningTweetsAt” shown in Figure 16. To puteverything together, a detailed overview of the entire DHSBAD system is shown in Figure 17.
USE dhs;
CREATE CONTINUOUS PUSH CHANNEL
ThreateningTweetsAt(area_name)
PERIOD duration("PERIOD_DURATION") {
SELECT t.area_name, t.text, t.location, t.threatening_rating,t.user_registered_weapon
FROM
Tweets t
WHERE t.area_name = area_name
AND t.threatening_rating > 0
AND is_new(t) };
Fig. 16. Definition of the ThreateningTweetsAt channel
B. BAD@OCSD
OCSD BAD in this prototype receives threatening tweetsnot only from Orange County but also from UCI to demon-strate how a BAD feed can connect to a channel with two
Fig. 17. The internal details of the DHS BAD system sets of parameters. OCSD BAD notifies in-field officers aboutnearby threatening tweets that are close to important localevents. To persist event information in OCSD BAD, we cancreate a data type “Event” and a dataset “Events” as shown inFigure 18.
USE ocsd;
CREATE TYPE
Event AS { eid: uuid, name: string, location: point,event_duration: duration, radius_km: double }; CREATE DATASET
Events(Event)
PRIMARY KEY eid;
Fig. 18. Data type and dataset definition for events
To store threatening tweets coming from DHS BAD, wecreate a data type
LocalThreateningTweet and an active dataset
LocalThreateningTweets in Figure 19. (We use an activedataset for threatening tweets to ensure continuous query se-mantics in later local channel computations.) We create a BADfeed in Figure 20 to obtain local threatening tweets from DHS.This BAD feed subscribes to the DHS threateningTweetsAtchannel with parameters “OC” and “UCI”, which correspondto two separate subscriptions in DHS BAD. Since there is nofurther data enrichment during ingestion, we use a static datafeed here and connect it to LocalThreateningTweets directly.
CREATE TYPE
LocalThreateningTweet AS { channelExecutionEpochTime: bigint,dataverseName: string, channelName: string }; CREATE ACTIVE DATASET
LocalThreateningTweets(LocalThreateningTweet)
PRIMARY KEY channelExecutionEpochTime;
Fig. 19. Data type and dataset definition for local threatening tweets at OrangeCounty
USE ocsd;
CREATE FEED
LocalThreateningTweetFeed
WITH {"adapter-name" : "http_adapter","addresses" : "OCSD_HOST:10013","address-type" : "IP","type-name" : "LocalThreateningTweet","format" : "adm","bad-host" : "DHS_HOST","bad-channel" : "ThreateningTweetsAt","bad-channel-parameters": "\"OC\";\"UCI\"","bad-dataverse": "dhs","dynamic": false };
CONNECT FEED
LocalThreateningTweetFeed
TO DATASET
LocalThreateningTweets;
START FEED
LocalThreateningTweetFeed;
Fig. 20. Definition, connect and start feed statements for LocalThreaten-ingTweetFeed
In-field officers from OCSD also continuously send theirlocation updates to the OCSD BAD system so that OCSDcan notify the officers about nearby threatening tweets basedon their current location. We can use the data type, dataset,and feed described in Figure 3 for intaking and persisting theocation updates. As there is no further enrichment for locationupdates, the LocationFeed can be static as well.With the local threatening tweets, event information, andofficers’ locations, we can now create a continuous channelfor in-field officers to subscribe to nearby threatening tweetsclose to local events (a.k.a. threatening events), as shown inFigure 21. The notifications from DHS contain threateningtweets as an array in the “results” attribute, so we use theUNNEST operation to access each independent threateningtweet. We calculate the distance between the officer and thetweet, the event and the tweet, and the officer and the event.If the officer is near a threatening tweet and the threateningtweet is near an event, we send a notification to the officer. Thenotification contains the tweet’s content, the event information,the distance between the officer and the tweet, and the distancebetween the officer and the event in the notification to help theofficer take further actions. A detailed overview of the OCSDBAD system is shown in Figure 22.
USE ocsd;
CREATE CONTINUOUS PUSH CHANNEL
ThreateningEventsNear(oid)
PERIOD duration("PERIOD_DURATION") {
FROM
LocalThreateningTweets tn, OfficerLocations o, Events e
UNNEST tn.results threatening_tweet
LET tweet_loc = threatening_tweet.result.location,officer_tweet_dist = spatial_distance(o.location, tweet_loc),event_tweet_dist = spatial_distance(e.location, tweet_loc),officer_event_dist = spatial_distance(o.location, e.location)
WHERE is_new(tn)
AND oid = o.oid
AND officer_tweet_dist < 0.1
AND event_tweet_dist < e.radius_km / 100
SELECT oid, threatening_tweet.result tweet_content, e event_info,officer_tweet_dist * 100 as tweet_distance_km,officer_event_dist * 100 as event_distance_km};
Fig. 21. Definition of the ThreateningEventsNear channelFig. 22. The internal details of the OCSD BAD system
C. BAD@UCI
UCI BAD receives threatening tweets posted at UCI andchecks whether a threatening tweet is near an on-campusbuilding. If so, it creates a notification about the threateningtweet together with the nearby security stations’ information.Like OCSD BAD, to persist threatening tweets at UCI, weneed to create a data type
LocalThreateningTweet and a dataset
LocalThreateningTweet on UCI BAD. To receive threateningtweets at UCI from DHS, we need to create a BAD feed, likeFigure 20, connected to the ThreateningTweetsAt channel butusing the parameter “UCI”.To provide more information for UCI BAD’s subscribers,we store on-campus buildings, for checking whether there isa threatening tweet nearby, and security stations, for studentsto seek for help from, in UCI BAD. In Figure 23, we createthe data types and datasets for them respectively.
USE uci;
CREATE TYPE
Building AS { bid: uuid, name: string }; CREATE TYPE
SecurityStation AS { sid: bigint, location: point }; CREATE DATASET
Buildings(Building)
PRIMARY KEY bid AUTOGENERATED;
CREATE DATASET
SecurityStations(SecurityStation)
PRIMARY KEY sid;
Fig. 23. Data type and dataset definition of buildings and security stations
With the local threatening tweets, on-campus building in-formation, and security station information, we can create acontinuous channel called “AlertsOnCampus” to provide on-campus alerts about threatening tweets near buildings withsecurity stations’ information attached using the statementshown in Figure 24. Like the ThreateningEventsNear channelin OCSD BAD, we first UNNEST threatening tweets from theincoming notifications. Then, we check whether a threateningtweet is posted at an on-campus building. If so, we attachthe security station information to the threatening tweet, withstations ordered by their distances to the tweet’s location, andgenerate an alert. A detailed overview of the UCI BAD systemis shown in Figure 25.
USE uci;
CREATE CONTINUOUS PUSH CHANNEL
AlertsOnCampus()
PERIOD duration("PERIOD_DURATION") {
FROM
LocalThreateningTweets tn, Buildings b
UNNEST tn.results threatening_tweet
LET tweet_loc = threatening_tweet.result.location,station_dist = (
FROM
SecurityStations s
LET dist = spatial_distance(tweet_loc, s.location)
SELECT s stationInfo, dist * 100 dist_km
ORDER BY dist)
WHERE is_new(tn)
AND spatial_intersect(tweet_loc, b.area)
SELECT threateningTweet.result tweet_content,b building_info, station_dist};
Fig. 24. Definition of the AlertsOnCampus channelFig. 25. The internal details of the UCI BAD system
D. The Trip of A Threatening Tweet
In order to illustrate how BAD islands interact with BADbridges, we pick a sample tweet and show how it flows throughthe three islands and their bridges and produces notificationswith local information for the subscribers on each island. Anoverview of our three-island prototype is shown in Figure 26.The circled numerical labels in the figure will be used later forillustrating the data content at different stages of the workflow.We will use the raw tweet in Figure 27 (labeled 1 inFigure 26) as the example. This tweet is posted at UCI,and it contains the tweet’s geolocation as a JSON array ofcoordinates and the epoch timestamp of when the tweet wascreated as a JSON number. This raw tweet is ingested by theTweetFeed defined in Figure 2 and then enriched by the UDFdefined in Figure 15. After that, the enriched tweet is persistedin the Tweets dataset as shown in Figure 28 (labeled 2 in Fig-ure 26). Enriched tweets contain a threatening rating detected weet Feed
Tweets
BAD @ DHSBAD @ OCSD
ThreatfulTweetsAt(OC)ThreatfulTweetsAt(UCI)
BAD @ UCI
ThreatfulTweetsAt(UCI)
Weapon Registrations
Threatening Word List
ThreateningTweetAt(area_name) Channel
Raw Tweets
Officer Locations Events
ThreateningEventsNear(officer_id) Channel
Buildings
AlertsOnCampus() Channel
SecurityStations ① ③④ ⑤②
LocalThreateningTweetFeed
LocalThreateningTweets
LocalThreateningTweetFeed
LocalThreateningTweets
Fig. 26. An overview of BAD islands { "tid": 1593142018123,"uid": 73,"area_name": "UCI","text": "Saul Goodman builds SKS, and Todd Alquist fires AK47,but Skyler White sells Cabbage.","coordinates": [ 33.64921228736088, -117.84181977473024 ],"created_at": 1593142018123 } Fig. 27. A sample raw threatening tweet { "tid": 1593142018123,"uid": 73,"area_name": "UCI","text": "Saul Goodman builds SKS, and Todd Alquist fires AK47,but Skyler White sells Cabbage.","coordinates": [ 33.64921228736088, -117.84181977473024 ],"created_at": 1593142018123,"threatening_rating": 2,"user_registered_weapon": [ "AR10" , "AK47", "GLOCK21" ],"timestamp": datetime("2020-06-26T03:26:58.123Z"),"location": point("33.64921228736088,-117.84181977473024") } Fig. 28. The enriched threatening tweet by the Java UDF, an array of registered weapons for the tweet’suser obtained by looking in the WeaponRegistrations datasetusing the “uid” attribute, a timestamp as a datetime attribute,and a location as a point attribute.Since both the OCSD and UCI BAD systems subscribe tothreatening tweets at UCI, they will each receive a notificationfrom DHS BAD about this threatening tweet. Figure 29 showsthe notification sent to OCSD BAD (labeled 3 in Figure 26). Ifthere was also a threatening tweet posted in Orange County atthe same time, the “results” array would include that tweet butwith a different subscription ID, as OCSD BAD has two sub-scriptions to the ThreateningTweetAt channel with parameters“OC” and “UCI”, respectively. Since UCI BAD also subscribesto the channel, but with a different subscription on anotherbroker (pointed to UCI’s BAD feed), the notification for UCIBAD will be produced and sent separately.In the OCSD BAD ThreateningEventsNear channel, threat-ening tweets are combined with local event information andofficer location information to produce the nearby threateningevent notifications for in-field officers. There is one local event“OC Marathon” near the threatening tweet in Figure 28, andthere is an in-field officer 0 nearby, so OCSD BAD producesone notification about the tweet and the event for this officer.Figure 30 shows this threatening event notification (labeled4 in Figure 26). It contains the event information as the“event info” attribute, the threatening tweet’s information asthe “tweet content” attribute, and the distances from the offi- { "dataverseName": "dhs","channelName": "ThreateningTweetsAt","channelExecutionEpochTime": 1593142019521,"results": [ { "result": { "text": "Saul Goodman builds SKS, and Todd Alquist firesAK47, but Skyler White sells Cabbage.","area_name": "UCI","location": point("33.64921228736088,-117.84181977473024"),"threatening_rating": 2,"user_registered_weapon": [ "AR10" , "AK47", "GLOCK21" ] } ,"channelExecutionTime": datetime("2020-06-26T03:26:59.521Z"),"subscriptionId": uuid("82e61d25-f7ad-0632-3b9a-9c26e681ad84"),"deliveryTime": datetime("2020-06-26T03:26:59.522Z") } ] } Fig. 29. The generated threatening tweet notification from DHS { "dataverseName": "ocsd","channelName": "ThreateningEventsNear","channelExecutionEpochTime": 1593142020436,"results": [ { "result": { "event_info": { "eid": uuid("82e61d25-4cad-0632-3d8d-148e71cb50bf"),"name": "OC Marathon","location":point("33.66100302712824, -117.83950620703125"),"event_duration": duration("PT10S"),"radius_km": 3.57746886883645 } ,"tweet_distance_km": 4.854786471222485,"event_distance_km": 5.6839370484947755,"oid": 0,"tweet_content": { "text": "Saul Goodman builds SKS, and Todd Alquist firesAK47, but Skyler White sells Cabbage.","area_name": "UCI","location": point("33.64921228736088,-117.84181977473024"),"threatening_rating": 2,"user_registered_weapon": [ "AR10" , "AK47", "GLOCK21" ] }} ,"channelExecutionTime": datetime("2020-06-26T03:27:00.436Z"),"subscriptionId": uuid("82e61d25-47ad-0632-3e5c-22b3cb7d7df4"),"deliveryTime": datetime("2020-06-26T03:27:00.437Z") } ] } Fig. 30. The generated threatening event notification from OCSD cer 0 to the tweet and to the event as the “event distance km”and “tweet distance km” attributes respectively.In the UCI BAD AlertsOnCampus channel, threateningtweets are combined with on-campus building information andsecurity station information to produce alerts. The threateningtweet in Figure 28 is near the building “Student Center”,so UCI BAD produces a notification to alert people aroundthis building as shown in Figure 31 (labeled 5 in Figure 26).The building information is attached to the notification. Thereare two security stations nearby, so the system attaches theirinformation with their distances, ordered by their distances tothe threatening tweet. Everyone subscribing to the AlertsOn-Campus channel will receive this notification.VII. BAD I
SLANDS T OUR AND E VALUATION
To illustrate how BAD applications can be built with BADislands and to visualize the process of data flowing throughmultiple systems and becoming notifications for subscribers,we have created three dashboards for each organization basedon our prototype, as shown in Figure 32. "dataverseName": "uci","channelName": "AlertsOnCampus","channelExecutionEpochTime": 1593142024344,"results": [ { "result": { "buildingInfo": { "bid": uuid("82e61d25-43ad-0632-45d0-0ba5366832d9"),"name": "Student Center","area": rectangle("33.64811430275051, -117.8433202724914533.649382536086605,-117.84153928570557") } ,"stationDist": [ { "stationInfo": { "sid": 1,"location":point("33.64792551859947, -117.84013290702327"),"name": "Station } ,"dist_km": 0.21216259109805177 } , { "stationInfo": { "sid": 0,"location":point("33.646866723393266, -117.84170161534618"),"name": "Station } ,"dist_km": 0.23485382616041114 } ],"tweetContent": { "text": "Saul Goodman builds SKS, and Todd Alquist firesAK47, but Skyler White sells Cabbage.","area_name": "UCI","location": point("33.64921228736088,-117.84181977473024"),"threatening_rating": 2,"user_registered_weapon": [ "AR10" , "AK47", "GLOCK21" ] }} ,"channelExecutionTime": datetime("2020-06-26T03:27:04.344Z"),"subscriptionId": uuid("82e61d25-0ead-0632-4717-e17b6a912fa6"),"deliveryTime": datetime("2020-06-26T03:27:04.345Z") } ] } Fig. 31. The generated on-campus alert from UCIFig. 32. An overview of BAD islands dashboards
Due to space limits, instead of describing the features oneach dashboard in detail, we will focus on the VisualizationPanel of the OCSD Dashboard, shown in Figure 33, toillustrate how threatening tweets go from DHS BAD to OCSDBAD and how OCSD BAD combines threatening tweets withother local information for its subscribers.The Visualization Panel contains a map that shows the in-coming threatening tweets, local events, produced threateningevents, and in-field officers’ movements. The map contains acontrol bar at the top (highlighted in a blue box) so dashboardusers can navigate the map, add a new event, and add a newin-field officer. A new event can be added by drawing a circleon the map indicating the event’s area. An officer can be addedby dropping an officer icon on a preferred location on the map.Information about the created events and officers is updatedin the underlying OCSD BAD system accordingly. An addedofficer moves around the map randomly and continuouslysends its current location to the OCSD BAD system. One can
Fig. 33. Visualization panel of OCSD dashboard change an officer’s location by dragging the officer’s icon toa new place on the map.We use a red tweet icon to mark the threatening tweetsreceived from DHS BAD and a black tweet icon for thethreatening events detected by OCSD BAD. When an in-fieldofficer receives a threatening event notification (as highlightedin the red circles in the figure), the officer randomly decideswhether to go to the threatening event’s location for furtherinvestigation or to stay at his or her current location. Theofficer’s decision pops up as a small information window, asshown in the figure. If the officer decides to go, he or shemoves gradually towards the tweet’s location, as the upperofficer does in the figure.In addition to the dashboards, we also conducted a simpleexperiment to measure the tweet propagation delays in ourprototype system, starting from the posting of new tweets tothe receipt of the localized notifications by subscribers on eachisland. We deployed the prototype on a three-node cluster,one node per island, where each node had a Dual-Core AMDOpteron Processor 2212 2.0 GHz, 8 GB of RAM, and a 900GB hard disk drive. We used the statements described inSection VI to configure the nodes.The information propagation times for BAD islands dependon the complexity of the computations in the pipeline (dataenrichment and channel computation) and on the specifiedchannel period durations. In our experiments, we used thesame channel period for all three channels, testing two dif-ferent channel periods (1s and 2s). Since channels executeonce per each channel period, for each channel execution, weeasured the average delay for threatening tweets deliveredto subscribers in this channel execution. We let all channelscomplete 50 executions and kept track of the average delaysthroughout the process. On DHS BAD, tweets were set toarrive at 10 tweets per second, and half of the tweets con-tained at least one threatening word. On OCSD BAD, everythreatening tweet had an event nearby. OCSD had 100 in-fieldofficers constantly updating their locations and subscribing tonearby threatening events. On UCI BAD, every threateningtweet was close to an on-campus building. UCI had 5 on-campus security stations and 100 subscriptions subscribing toon-campus alerts. The delays are shown in Figure 34. A v e r g a g e D e l a y ( S ec ond s ) Channel ExecutionOCSD - 2 seconds UCI - 2 secondsOCSD - 1 second UCI - 1 second
Fig. 34. Delays of threatening tweets for OCSD and UCI BAD subscribers
Clearly, the subscribers at OCSD and UCI are able toreceive localized threatening tweets of interest in a timelymanner, while the delays are relatively stable, especially forthe 1s channel period. When the channel period is increased to2s, the delays increase since the system batches more incom-ing tweets per channel execution; while this would increasethe delay for subscribers, it would also increase the systemscalability under higher loads. UCI BAD in general has higherdelays than OCSD BAD due to a more complex computationand more local information added to local threatening tweets.VIII. C
ONCLUSIONS
In this work, we have focused on enabling users to declar-atively create scalable data sharing services between differentBAD systems. We looked at an example use case in which twolocal organizations (OCSD and UCI) would like to get datafrom a third organization (DHS) in order to provide BADservices to their subscribers. We discussed several possibleways of supporting this use case and proposed using data feedsand data channels for bridging BAD systems. We extendedthe BAD system with
BAD brokers to simplify data exchangesbetween channels and feeds and
BAD feeds to help users createbridges between different BAD systems. We detailed a three-island prototype to show how BAD islands can be bridgedtogether. We demonstrated how users can easily build suchsystems with declarative statements, and we used an exampleto show how data and events flow within the system. Webuilt a set of dashboards based on our prototype to concretelyillustrate how BAD islands share data and support BADapplications with localized information, and we conducted anexperiment to examine the delays in the prototype system. A
CKNOWLEDGMENT
This research was partially supported by NSF grantsIIS-1447826, IIS-1447720, IIS-1838222, IIS-1838248, CNS-1924694 and CNS-1925610.R
EFERENCES[1] R. Bryant, R. H. Katz, and E. D. Lazowska, “‘Big-Data Computing’:Creating revolutionary breakthroughs in commerce, science and society,”2008.[2] K. Shvachko, H. Kuang, S. Radia et al. , “The Hadoop distributedfile system,” in
IEEE 26th Symposium on Mass Storage Systems andTechnologies, MSST 2012, Lake Tahoe, Nevada, USA, May 3-7, 2010 ,2010, pp. 1–10.[3] C. Olston, B. Reed, U. Srivastava et al. , “Pig latin: a not-so-foreignlanguage for data processing,” in
Proceedings of the ACM SIGMODInternational Conference on Management of Data , 2008, pp. 1099–1110.[4] A. Thusoo, J. S. Sarma, N. Jain et al. , “Hive - A warehousing solutionover a map-reduce framework,”
PVLDB , vol. 2, no. 2, pp. 1626–1629,2009.[5] M. Zaharia, M. Chowdhury, T. Das et al. , “Resilient distributed datasets:A fault-tolerant abstraction for in-memory cluster computing,” in
Pro-ceedings of the 9th USENIX Symposium on Networked Systems Designand Implementation , 2012, pp. 15–28.[6] P. T. Eugster, P. Felber, R. Guerraoui et al. , “The many faces ofpublish/subscribe,”
ACM Comput. Surv. , vol. 35, no. 2, pp. 114–131,2003.[7] M. Zaharia, T. Das, H. Li, T. Hunter, S. Shenker, and I. Stoica,“Discretized streams: fault-tolerant streaming computation at scale,”in
ACM SIGOPS 24th Symposium on Operating Systems Principles ,M. Kaminsky and M. Dahlin, Eds. ACM, 2013, pp. 423–438.[8] P. Carbone, A. Katsifodimos, S. Ewen et al. , “Apache Flink™: Streamand batch processing in a single engine,”
IEEE Data Eng. Bull. , vol. 38,no. 4, pp. 28–38, 2015.[9] M. J. Carey, S. Jacobs, and V. J. Tsotras, “Breaking BAD: a data servingvision for big active data,” in
Proceedings of the 10th ACM InternationalConference on Distributed and Event-based Systems , 2016, pp. 181–186.[10] S. Jacobs, X. Wang, M. J. Carey, V. J. Tsotras, and M. Y. S. Uddin,“BAD to the bone: Big active data at its core,”
VLDB J. , 2020.[11] S. Jacobs, M. Y. S. Uddin, M. J. Carey et al. , “A BAD demonstration:Towards big active data,”
PVLDB , vol. 10, no. 12, pp. 1941–1944, 2017.[12] X. Wang, “Activating Big Data at Scale,” Ph.D. dissertation, Universityof California, Irvine, USA, 2020.[13] X. Wang, M. J. Carey, and V. J. Tsotras, “Subscribing to Big Data atscale,” arXiv preprint arXiv:2009.04611 , 2020.[14] S. Wolfert, “Study on data sharing between companies in europe,” 2018,[Online; accessed Jul-12th-2020].[15] W. K. Michener, S. Allard, A. E. Budden et al. , “Participatory designof DataONE - enabling cyberinfrastructure for the biological and envi-ronmental sciences,”
Ecol. Informatics , vol. 11, pp. 5–15, 2012.[16] R. Rice, “DISC-UK datashare project,” in
Technology of Data: Collec-tion, Communication, Access and Preservation . IASSIST, 2008.[17] E. Scaria, A. Berghmans, M. Pont, C. Arnaut, and S. Leconte, “Studyon data sharing between companies in Europe,”
A study prepared forthe European Commission Directorate-General for CommunicationsNetworks, Content and Technology by everis Benelux , vol. 24, 2018.[18] S. Alsubaiee, Y. Altowim, H. Altwaijry et al. , “AsterixDB: A scalable,open source BDMS,”
PVLDB , vol. 7, no. 14, pp. 1905–1916, 2014.[19] D. Chamberlin,
SQL++ For SQL Users: A Tutorial . Couchbase, Inc.,2018, (Available at Amazon.com).[20] D. B. Terry, D. Goldberg, D. A. Nichols et al. , “Continuous queriesover append-only databases,” in
Proceedings of the 1992 ACM SIGMODInternational Conference on Management of Data , 1992, pp. 321–330.[21] W. Y. Alkowaileet, S. Alsubaiee, M. J. Carey et al. , “End-to-endmachine learning with Apache AsterixDB,” in
Proceedings of the SecondWorkshop on Data Management for End-To-End Machine Learning ,2018, pp. 6:1–6:10.[22] X. Wang and M. J. Carey, “An IDEA: an ingestion framework for dataenrichment in AsterixDB,”
PVLDB , vol. 12, no. 11, pp. 1485–1498,2019.[23] R. Grover and M. J. Carey, “Data ingestion in AsterixDB,” in
EDBTConf. , 2015.[24] W. Y. Alkowaileet, S. Alsubaiee, M. J. Carey et al. , “EnhancingBig Data with semantics: The AsterixDB approach (poster),” in12thIEEE International Conference on Semantic Computing, ICSC