The Webdamlog System Managing Distributed Knowledge on the Web
aa r X i v : . [ c s . D B ] A p r The Webdamlog System
Managing Distributed Knowledge on the Web ∗ Serge AbiteboulInria Saclay & ENS CachanFrancefi[email protected] Émilien AntoineInria Saclay & ENS CachanFrancefi[email protected] Julia StoyanovichDrexel University, USASkoltech, [email protected] 15, 2018
We study the use of
WebdamLog , a declarative high-level lan-guage in the style of datalog, to support the distribution ofboth data and knowledge (i.e., programs) over a network of au-tonomous peers. The main novelty of
WebdamLog compared todatalog is its use of delegation, that is, the ability for a peer tocommunicate a program to another peer.We present results of a user study, showing that users canwrite
WebdamLog programs quickly and correctly, and with aminimal amount of training. We present an implementationof the
WebdamLog inference engine relying on the
Bud dat-alog engine. We describe an experimental evaluation of the
WebdamLog engine, demonstrating that
WebdamLog can be im-plemented efficiently. We conclude with a discussion of ongoingand future work.
A number of works have argued for developing a holistic ap-proach to distributed content management, e.g.
P2P ContentWarehouse [1],
Dataspaces [11] and
Data rings [6]. The goalis to facilitate the collaboration of autonomous peers towardssolving content management tasks. Such situations arise forinstance in personal information management (PIM), which isoften given as an important motivating example [11]. In [6], theauthors argued for founding such data exchange on declarativelanguages, to facilitate the design of applications, notably bynon-technical users.In the present work, we propose an approach for manag-ing data and knowledge (i.e., programs) over a network of au-tonomous peers. From a system viewpoint, the different actorsare autonomous and heterogeneous in the style of P2P [6, 11].However, we do not see the system we developed as an alter-native to existing network services such as Facebook or Flickr.Rather, we view our system as the means of seamlessly manag-ing distributed knowledge residing in any of these services, aswell as in a wide variety of systems managing personal or socialdata.Our system uses the
WebdamLog language [4], a declarativehigh-level language in the style of datalog, to support the dis-tribution of both data and knowledge (i.e., programs) over anetwork of autonomous peers. In recent years, there has beenrenewed interest in using languages in the datalog family ina large range of applications, from program analysis, to secu-rity and privacy protocols, to natural language processing, tomulti-player games. The arguments in favor of datalog-stylelanguages are familiar ones: a declarative approach alleviatesthe conceptual complexity on the user, while at the same time ∗ This work has been partially funded by the European Research Coun-cil under the European Community’s Seventh Framework Programme(FP7/2007-2013); ERC grantWebdam, agreement 226513. http://webdam.inria.fr/ allowing for powerful performance optimizations on the part ofthe system.
WebdamLog is a datalog-style language that emphasizes coop-eration between autonomous peers communicating in an asyn-chronous manner. The
WebdamLog language extends datalog ina number of ways, supporting updates [9], distribution [3], nega-tion [12], and, importantly, a novel feature called delegation [4].As a result,
WebdamLog is neither as simple nor as beautiful asdatalog. It is also more procedural, which is needed to capturereal Web applications with the peers’ knowledge evolving overtime.We illustrate by example (Section 2) that the language (for-mally recalled in Section 3) is indeed well adapted to specify-ing realistic distributed content management tasks, notably inPIM. Our technical contributions are described in the followingsections: • We present results of a user study, showing that users canwrite
WebdamLog programs quickly and correctly, and witha minimal amount of training (Section 4). • We present an implementation of the
WebdamLog enginerelying on the
Bud datalog engine (Section 5). Our imple-mentation supports novel linguistic features such as peerand predicate variables and rule delegation. • We describe an experimental evaluation of the
WebdamLog engine (Section 6).We discuss related work in Section 7, outline future researchdirections and conclude in Section 8.
Suppose that Alice and Bob are getting married, and theirfriends want to offer them an album of photos in which thebride and groom appear together. Such photos may be ownedby friends and family members of Alice and Bob. Owners ofthe photos may store them on a variety of services and devices,including, e.g., desktop computers, smartphones, Picasa, andFlickr.Making a photo album for Alice and Bob involves the follow-ing steps: (1) Identify friends of Alice and Bob using Facebookand Google+; (2) Find out where each friend keeps his/her pho-tos and how to access them; (3) From among all photos that areobtained, select those that feature both Alice and Bob, using,e.g., tags or face recognition software; and (4) Ask Sue, a friendof Alice, to verify that the selected photos are appropriate forthe photo album and to possibly exclude some from this album.As should be clear from the example, such a task would bemuch more manageable if it were executed automatically. Itsexecution involves a certain amount of simple reasoning on the1art of the system, which can be naturally specified with declar-ative rules. For example, for Step (1), the following
WebdamLog rule computes the union of Alice’s and Bob’s Facebook contactsin a relation allFriends on Sue’s peer: [rule at sue]allFriends@sue($name) :- friends@aliceFB($name)allFriends@sue($name) :- friends@bobFB($name) using wrappers to Facebook for Alice and Bob.In general, a peer name such as aliceFB or sueIPhone denotesa system or a device associated to a particular URL. Also a relation name such as friends or contacts denotes the name of arelation or a service on the corresponding system/device.For simplicity, we assume that a person’s name, e.g. alice ,corresponds to the name of the peer that the particular frienduses as entry point to the Webdam system. (This name is thusassociated to a particular URL.) We assume that each suchpeer keeps localization data for the corresponding person. Forinstance, relation photoLocation in that peer tells where (i.e., atwhich peers) this person keeps her photos. The following rule,at peer sue , delegates Steps (2) and (3) of the photo album taskto the peers corresponding to the peers corresponding to herfriends: [rule at sue]album@sue($photo,$name) :-allFriends@sue($name),photoLocation@$name($peer),photos@$peer($photo),features@$peer($photo,alice),features@$peer($photo,bob)
The key feature of this rule is the use of the
WebdamLog lan-guage to share the work. Let Dan be a friend, and so a possiblesource. Then Sue’s peer will delegate the following rule to Dan’speer: [rule at dan]album@sue($photo,dan) :-photoLocation@dan($peer),photos@$peer($photo),features@$peer($photo,alice),features@$peer($photo,bob)
Now suppose that Dan uses both Picasa and Flickr. Then,Dan’s peer will delegate to danPicasa (a wrapper for Dan’s ac-count on Picasa) the following rule: [rule at danPicasa]album@sue($photo,dan) :-photos@danPicasa($photo),features@danPicasa($photo,alice),features@danPicasa($photo,bob) and similarly for Flickr.Note how the tasks are automatically shared by many peers.Observe that when new friends of Alice or Bob are discovered(e.g., proposed by some known friends), Sue’s album, which isdefined intentionally , is automatically updated. Observe alsothat, to simplify, we assume here that all peers use a similarorganization (ontology). This constraint may easily be removedat the cost of slightly more complicated rules.Now consider Step (4) in the photo album task. Sue may de-cide, for instance, that photos of the couple from Dave’s Flickrstream are inappropriate, and that Dave should be excludedfrom the set of sources. Such manual curation by Sue may beaccomplished by modifying the definition of allFriends: [rule at sue]allFriends@sue($name) :- friends@aliceFB($name),not blocked@sue($name)allFriends@sue($name) :- friends@bobFB($name)not blocked@sue($name)
By inserting/removing facts in blocked @ sue , Sue now controlswho can participate. A similar control can also be added at thephoto or photo location level.Observe that updates result in modifying the programs run-ning at the participating peers. For instance, the sets of rulesat the various peers evolve, controlled by Sue’s updates as wellas by the discovery of new friends of Alice or Bob, and of newsources of photos. Consequently, the album evolves as well.We will use the example of this section throughout the paperto demonstrate the salient features of our approach. In this section, we briefly recall the language
WebdamLog from [4].We assume the existence of a countable set of variables andof a countable set of data values that includes a set of relationnames and a set of peer names. (Relation and peer names arepart of the data.) Variables start with the symbol $, e.g. $ x . Schema. A relation in our context is an expression m @ p where m is a relation name and p a peer name. A schema is anexpression ( π, E, I, σ ) where π is a possibly infinite set of peernames, E is a set of extensional relations of the form m @ p for p ∈ π , I is a set of intentional relations of the form m @ p for p ∈ π , and σ , the sorting function, specifies for each relation m @ p , an integer σ ( m @ p ) that is its sort. A relation cannot beat the same time intentional and extensional. Facts. A fact is an expression of the form m @ p ( a , ..., a n ),where n = σ ( m @ p ) and a , ..., a n are data values. An exampleof a fact is: pictures @ myalbum ( . jpg , ”“ Timbuktu ′′ ” , / / ) Rules. A term is a constant or a variable. A rule in a peer p is an expression of the form:[at p] $ R @ $ P ($ U ):-( ¬ ) $ R @ $ P ($ U ),. . . ,( ¬ ) $ R n @ $ P n ($ U n )where $ R , $ R i are relation terms, $ P , $ P i are peer terms, $ U , $ U i are vectors of terms. The following safety condition is imposed:that $ R and $ P must appear positively bound in the body andeach variable occurring in a negative literal must also appearpositively bound in the body. In addition, rules are requiredfrom left to right and it is also required that each peer name$ P i must be positively bound in a previous atom. Semantics.
At a particular point in time, each peer p has a state consisting of some facts, some rules specified locally, andpossibly of some rules that have been delegated to p by otherpeers. Peers evolve by updating their base of facts, by sendingfacts to other peers, and by updating their delegations to otherpeers. So, both the set of facts and the set of delegated rulesevolve over time. (To simplify, we follow [4] in assuming thatthe set of rules specified locally is fixed.)The semantics of a rule with head m @ p ( u ) in a peer p ′ de-pends on the nature of the relation in its head: whether it isextensional ( m @ p in E ) or intentional ( m @ p in I ), and whetherit is local ( p = p ′ ) or not. We first consider rules in which all re-lations occurring in the body are local; we call such rules localrules . A subtlety lies in the use of variables for peer names. Thenature of a rule may depend on the instantiation of its variables,i.e., one instantiation of a particular rule may be local, whereasanother may not be.2e distinguish 5 cases identified by a letter in which we clas-sify the rules. A. Local rule with local intentional head (datalog)These rules define local intentional predicates, as in classic dat-alog.
B. Local rule with local extensional head (localdatabase updates) Facts derived by this kind of rules are in-serted into the local database. Note that, by default, like inDedalus[9], facts are not persistent. To have them persist, weuse rules of the form m @ p ( U ) :- m @ p ( U ). Deletion can becaptured by controlling the persistence of facts.The two previous kinds of rules, containing only predicates ofthe local peer, do not require network communication, and arenot affected by problems due to asynchronicity of the network. C. Local rule with non-local extensional head (messag-ing) Facts derived by rules of this kind are sent to other peers.For example, the rule:[at mi] $ m @ $ p ($ name , ”“ Happy birthday ! ′′ ”) :- today @ mi ($ date ), birthday @ mi ($ name , $ m , $ p , $ date )where mi stands for my iPhone, results in sending a HappyBirthday message to a contact on the day of his birthday. Ob-serve that the name $p of the peer and the name $m of themessage varies depending on the person. D. Local rule with non-local intentional head (viewdelegation) Such a rule results in installing a view remotely.For instance, the rule[at mi] boyMeetsGirl @ gossipsite ($ girl , $ boy ) :- girls @ mi ($ girl , $ loc ), boys @ mi ($ boy , $ loc )installs a join of two mi relations at gossipsite .Finally we consider non-local rules. E. Non-local (general delegation) Consider the rule[at mi] boyMeetsGirl @ gossipsite ($ girl , $ boy ) :- girls @ mi ($ girl , $ loc ), boys @ ai ($ boy , $ loc )where ai stands for Alice’s iPhone. This results in installing, at gossipsite , a view t r @ mi and a rule, defined as follows:[at mi ] t r @ mi @ ai ($ girl , $ loc ) :- girls @ mi ($ girl , $ loc )[at ai ] boyMeetsGirl @ gossipsite ($ girl , $ boy ) :- t r @ mi @ ai ($ girl , $ loc ), boys @ ai ($ boy , $ loc )Note that both rules are now local. Note also that, when girls @ mi changes, this modifies the view at Alice’s iPhone, pos-sibly changing the semantics of boyMeetsGirl @ gossipsite .In [4], we formally define the semantics of WebdamLog . Weshow that, unless all peers and programs are known in advance,delegation strictly increases the expressive power of the model.If they are known in advance, delegation does not bring anyextra power. Of course, delegation is also useful in practice,because it enables obtaining logic (rules) from other sites, anddeploying logic (rules) to other sites. Conditions for systemsto be deterministic are shown in [4], and are extremely restric-tive. Even in the absence of negation, a
WebdamLog systemwill typically not be deterministic because of asynchronicity.
We argued in the introduction that
WebdamLog can be used todeclaratively specify distributed tasks in a variety of applica-tions, including personal data management. We conducted auser study to demonstrate the usability of
WebdamLog in thisparticular domain.
Participants.
We recruited 27 participants for the userstudy. We present a break-down of results by two groups.
Group 1 consisted of 16 participants with training in Com-puter Science. Among them, 5 had basic database background,and 4 were familiar with advanced database concepts, includingdatalog. The group had the following break-down by highestcompleted education level: 2 high school, 3 BS, 9 MS, and 2PhD.
Group 2 consisted of 11 participants with no CS training, andwith the following break-down by highest completed educationlevel: 3 vocational school, 6 BS, 2 MS.
Study design.
All participants were given a brief tutorial inwhich basic features of
WebdamLog were explained informally,and demonstrated through examples. The tutorial took 15-20minutes for
Group 1 and 25 minutes for
Group 2 . Following thetutorial, all participants were asked to take a written test. Thetest consisted of three problems that tested comprehension ofdifferent features of
WebdamLog , including local and non-localrules, rules with variable relation and peer names, and delega-tion. In the tutorial and the test, we did not make an explicitdistinction between intentional and extensional relations, andwe ignored recursion.The user study test had the following contents, reproducedhere literally, apart from formatting.
Problem 1.
Consider the following relations and facts. schema: songs(fileName,content) // the same at all peerssongs@lastFM("song1.mp3", "...")songs@lastFM("song2.mp3", "...")songs@lastFM("song3.mp3", "...")songs@pandora("song4.mp3", "...")songs@pandora("song5.mp3", "...")
1. Write one or several rules that copy all songs from lastFM and
Pandora into relation songs at peer myLaptop .2. Suppose now that relation peers @ myLaptop containsnames of peers on which to look for music. You can assumethat each peer stores songs in a relation called songs , withthe same schema as above. Write a WebdamLog programthat copies songs from all peers into songs @ myLaptop .3. Write a rule that copies songs from songs @ myLaptop intothe songs relation on each peer whose name is listed in peers @ myLaptop . Problem 2.
Consider the following relations and facts. schema: friends(friendName) photos(fileName,content)inPhoto(fileName, friendName)friends@facebook("ann")friends@facebook("sue")friends@facebook("zoe")photos@ann("sunset.jpg", "...")photos@ann("vacation.jpg", "...")photos@ann("party.jpg", "...")photos@sue("image1.jpg","...")photos@sue("image2.jpg","...")inPhoto@ann("vacation.jpg", "jane")inPhoto@ann("vacation.jpg", "ann")inPhoto@ann("party.jpg", "jane") nPhoto@ann("party.jpg", "zoe")inPhoto@ann("party.jpg", "sue")inPhoto@sue("image2.jpg", "sue")inPhoto@sue("image2.jpg", "jane") Assume that photos and inPhoto relations at all peers havethe same schema. Consider now the following
WebdamLog rule. photos@myLaptop($X,$Z) :- friends@facebook($Y),photos@$Y($X,$Z), inPhoto@$Y($X,"jane")
1. Explain in words what this rule computes.2. List the facts in that are in photos @ myLaptop after therule above is executed.3. List the facts that are in photos @ myLaptop if the followingrule is executed instead: photos@myLaptop($X,$Z) :- friends@facebook($Y),photos@$Y($X,$Z), inPhoto@$Y($X,"jane"),inPhoto@$Y($X,"sue") Problem 3.
Recall the example from the tutorial, in whichwe looked at subscribing the peer myLaptop to CNN news. Thisexample is reproduced below. schema: news@cnn(text) news@myLaptop(source, text)subscribers@cnn(peer)news@cnn("US Olympic gold")news@cnn("Higgs boson seen in action")subscribers@cnn("myLaptop")[at cnn] news@$X("cnn", $Y) :- subscribers@cnn($X),news@cnn($Y)
Suppose that you would now like to receive CNN news on peer myPhone , and to store them in relation news , with the schema source,text . Describe at least 1 method for doing this. Youmay assume that you can add rules at peers cnn , myLaptop and myPhone , and that you can insert facts into relations on any ofthese peers. Results.
The results of the study were very encouraging.
Group 1.
On Problem 1, 3 participants received a score of2.5 out of 3, while 13 participants received a perfect score. Allparticipants received a perfect score on Problem 2. Problem 3was open-ended, and all participants gave at least one correctanswer. 4 participants gave 3 correct answers, 4 gave 2 correctanswers (2 of these also gave 1 incorrect answer each), and theremaining 8 participants each gave 1 correct answer.We also asked participants to record how long it took themto answer each problem, in minutes. Problem 1 took between2.5 and 6 minutes, Problem 2 between 2 and 9 minutes, andProblem 3 between 1 and 8 minutes. We did not observe anycorrelation between the time it took to answer questions and theparticipant background in data management or even datalog.
Group 2.
On Problem 1, the average score was 2.3, with thefollowing break-down: 6 participants received a perfect score,3 received 2 out of 3, 1 had a score of 1, and 2 were not ableto solve the problem. On Problem 2, 10 participants receiveda perfect score and 1 got a score of 2 out of 3. On Problem3, 1 gave 5 good answers, 6 gave 3 good answers, 3 gave 2good answers, and 2 gave no correct answer. The same twoparticipants failed to answer Problems 1 and 3.The test took longer for the participants without CS training.Problem 1 took between 6 and 8 minutes to solve in this group,Problem 2 took between 5 and 8 minutes, and Problem 3 tookbetween 4 and 12 minutes. Figure 1:
WebdamLog engine in a full
Webdam peer
Remark.
We considered alternative ways in which a usercan interact with a
WebdamLog system. We are currently de-veloping an interface in which users will be able to write
Web-damLog programs, but will also have access to customizablecanned queries implementing common functionality. A SQL-based approach is not a natural choice, since SQL does notaccommodate distribution, which is central to
WebdamLog . In summary , all technical and the majority of non-technicalparticipants of our study were able to both understand andwrite
WebdamLog programs correctly, with a minimal amountof training. We observed a difference between the technical andnon-technical groups in terms of both correctness and time tosolution. Two members of the non-technical group were able tounderstand
WebdamLog programs but were not able to writeprograms on their own. We believe that this issue will be alle-viated once an appropriate user interface becomes available.
In this section, we describe the architecture of the
WebdamLog system. We describe the implementation of the system, stress-ing the novel features compared to standard datalog engines.
Figure 1 shows the architecture of a
WebdamLog peer. Factsand rules are stored in a persistent store. The
WebdamLog en-gine, described in greater detail in the remainder of this section,retrieves these facts and rules to process updates and answerqueries coming from the top layers. The Security module pro-vides facilities for standard access control mechanisms such asencryption, signatures and other authentication protocols. TheCommunication module is responsible for exchanging facts andrules with other peers.Datalog evaluation has been intensively studied, and severalopen-source implementations are available. We chose not to im-plement yet another datalog engine, but instead to extend anexisting one. In particular, we considered two open-source sys-tems that are currently being supported, namely,
Bud [21] fromBerkeley University and
IRIS [19] from Innsbruck University.The
IRIS system is implemented in Java and supports the mainstrategies for efficient evaluation of standard local datalog. The
Bud system is implemented in the Ruby scripting language, andinitially seemed less promising from a performance viewpoint.However,
Bud provides mechanisms for asynchronous commu-nication between peers, an essential feature for
WebdamLog .4n absence of a real performance comparison, the choice wasnot easy. We finally decided in favor of
Bud , both because ofits support for asynchronous communication, and because itsscalability has been demonstrated in real-life scenarios such asInternet routing.
The
Bud system supports a powerful datalog-like language in-troduced in [8]. Indeed, we see
Bud (and use it) as a dis-tributed datalog engine with updates and asynchronous com-munications.A
WebdamLog computation consists semantically of a se-quence of stages , with each stage involving a single peer. Eachstage of a
WebdamLog peer computation is in turn performedby a three-step
Bud computation, described next. Note thatwe use the word stage for
WebdamLog and step for
Bud : . . . Stage at peer p Stage at peer q . . .Step 1 Step 2 Step 3 Step 1 Step 2 Step 3 (1) The 3 steps of a
WebdamLog stage are as follows:1. Inputs are collected including input messages from otherpeers, clock interrupts and host language calls.2. Time is frozen; the union of the local store and of thebatch of events received since the last stage is taken as anextensional database, and a
Bud program is run to fixpoint.3. Outputs are obtained as side effects of the program, includ-ing output messages to other peers, updates to the localstore, and host language callbacks.Observe that a fixpoint computation is performed at Step 2by the local datalog engine (namely the
Bud engine). Thiscomputation is based on a fixed program with no deletion overa fixed set of extensional relations. In Step 3, deletion messagesmay be produced, along with updates to the set of rules andto the set of extensional relations (for different reasons, whichwe will explain further). Note that all this occurs outside thedatalog fixpoint computation.Relations appearing in the rules are implemented as
Bud col-lections.
Bud distinguishes between three kinds of key-valuesets:1. A table keeps a fact until an explicit delete order is received.We use tables to support
WebdamLog extensional relations.2. A scratch is used for storing results of intermediate com-putation. We use scratch collections to implement
Web-damLog local intentional relations. It is emptied at Step 1and receives facts during fixpoint computation at Step 2.3. A channel provides support for asynchronous communica-tions. It records facts that have to be sent to other peers.We use channels for that and in particular for messagesrelated to installing or removing delegations.As in
WebdamLog , facts in a peer are consumed by the engineat each firing of the peer (each stage). To make facts persistent,they have to be re-derived by the peer at each stage. Thisis captured in our implementation by assuming that rules re-derive extensional facts implicitly, unless a deletion message hasbeen received.We observe a subtle point that lead us to not fully adopt theoriginal semantics of
WebdamLog , as described in [4]. There,we assumed for simplicity that messages are transmitted in-stantaneously. This assumption is not realistic in practice, anddoes not hold in our implementation. Since communications are asynchronous, there is no guarantee in
WebdamLog as towhen a fact written to a channel will be received by a remotepeer.
We now describe how
WebdamLog rules are implemented ontop of
Bud . We distinguish between 4 cases. This brings usto revisit the semantics of
WebdamLog (from Section 3) with afocus on implementation. As in Section 3, whether a rule in apeer p is local (i.e., all relations occurring in the rule body are p -relations) plays an important role. We consider 4 cases. Thelast case (Case F) focuses on the use of variables for relationand peer names. For the first 3 cases, we ignore such variables. A-B-C. Simple local rules.
In this cases, i.e., local ruleswith either an extensional relation or a local intentional relationin the head,
WebdamLog rules can be directly supported byidentical
Bud rules. (This takes care of local deduction as indatalog (A), messages for local updates (B) and messages toother peers (C).)
D. Local with non-local intentional head.
From an im-plementation viewpoint, this case is more tricky. We illustrateit with an example. Consider an intentional relation s @ q de-fined in the distributed setting by the following two rules:[at p1] s @ q ( X , Y ) :- r @ p ( X , Y )[at p2] s @ q ( X , Y ) :- r @ p ( X , Y )Intuitively, the two rules specify a view relation s @ q at q thatis the union of two relations r @ p and r @ p from peers p and p , respectively. Consider a possible naive implementation thatwould consist in materializing relation s at q , and having p and p send update messages to q . Now suppose that a tuple h , i is in both r @ p and r @ p . Then it is correctly in s @ q .Now suppose that this tuple is deleted from r @ p . Then adeletion message is sent to q , resulting in wrongly deleting thefact from s @ q .The problem arises because the tuple h , i originally had tworeasons to be in s , and only one of the reasons disappeared.To avoid this problem, we could use the provenance of the fact h , i in s @ q .A general approach for tracking provenance in our setting,and to using it as basis for performance optimizations, is partof ongoing work, and is outlined in Section 5.5. For now, wecan implement the following Bud rules at p , p to correctlysupport the two rules:[at p1] s @ q ( X , Y ):- r @ p ( X , Y )[at p2] s @ q ( X , Y ):- r @ p ( X , Y )[at q] s @ q ( X , Y ):- s @ q ( X , Y )[at q] s @ q ( X , Y ):- s @ q ( X , Y )Note that relations s and s may be either intentional, inwhich case the view is computed on demand, or extensional, inwhich case the view is materialized. E. Non-local rules.
We consider non-local rules with exten-sional head. (Non-local rules with intentional head are treatedsimilarly.) An example of such a rule is:[at p] r @ q ( X ):- r @ q ( X ),. . . , r i @ q i ( X i ),. . .with q = . . . = q i − = p , q i = q = p , and with each X j denotinga tuple of terms. If we consider atoms in the body from leftto right, we can process at p the rule until we reach r i @ q ( X i ).Peer p does not know how to evaluate this atom, but it knowsthat the atom is in the realm of q . Therefore, peer p rewritesthe rule into two rules, as specified by the formal definition ofdelegation in WebdamLog :5at p] mid @ q ( X mid ) :- r @ p ( X ),. . . , r i − @ p ( X i − )[at q] r @ q ( X ) :- mid @ q ( X mid ), r i @ q ( X i ),. . .where mid identifies the message, and notably encodes, (i) theidentifier of the original rule, (ii) that the rule was delegated by p to q , and (iii) the split position in the original rule. The tuple X mid includes the variables that are needed for the evaluation ofthe second part of the rule, or for the head. Observe that thefirst rule (at p ) is now local. If the second rule, installed at q ,is also local, no further rewriting is needed. Otherwise, a newrewriting happens, again splitting the rule at q , delegating thesecond part of the rule as appropriate, and so on.Observe that an evolution of the state of p may result in in-stalling new rules at q , or in removing some delegations. Dele-tion of a delegation is simply captured by updating the predi-cate guarding the rule. Insertion of a new delegation modifiesthe program at q . Note that in Bud the program of a peer isfixed, and so adding and removing delegations is a novel fea-ture in
WebdamLog . Implementing this feature requires us tomodify the
Bud program of a peer. This happens during Step1 of the
WebdamLog stage.
F. Relation and peer variables.
Finally, we consider re-lation and peer variables. In all cases presented so far,
Web-damLog rules could be compiled statically into
Bud rules. Thisis no longer possible in this last case. To see this, consider anatom in the body of a rule. Observe that, if the peer name inthis atom is a variable, then the system cannot tell before thevariable is instantiated whether the rule is local or not. Also,observe that, if the relation name in this atom is a variable,then the system cannot know whether that relation already ex-ists or not. In general, we cannot compile a
WebdamLog ruleinto
Bud until all peer and relation variables are instantiated.To illustrate this situation more precisely, consider a rule ofthe form: r @ p ( X ):- r @ p ($ X ), . . . ,$ X @ p ( X i ),. . . ,where r @ p is extensional and $ X is a variable. This particularrule is relatively simple since, no matter how the variable isinstantiated, the rule falls into the simple case B . However, itis not a Bud rule because of the variable relation name $ X .Note that WebdamLog rules are evaluated from left to right,and a constraint is that each relation and peer variable mustbe bound in a previous atom. (This constraint is imposed bythe language.) Therefore, when we reach the atom $ X @ p ( X i ),the variable $ X has been instantiated.To evaluate this rule, we use two WebdamLog stages of thepeer. In the first stage, we bind $ X with values found by in-stantiating r @ p ($ X ). Suppose that we find two values for $ X ,say t and t . We always wait for the next stage to introducenew rules (there are two new rules in this case). More precisely,new rules are introduced during Step 1 of the WebdamLog com-putation of the next stage. In the example, the following rulesare added to the
Bud program at p : r @ p ( X ):- t1 @ p ( X i ),. . . , r @ p ( X ):- t2 @ p ( X i ),. . . ,Observe that, even in the absence of delegation, having variablerelation and peer names allows the WebdamLog engine to pro-duce new rules at run time, possibly leading to the creation ofnew relations. This is a distinguishing feature of our approach,and is novel to
WebdamLog and to our implementation.This example uses a relation name variable. Peer name vari-ables are treated similarly. Observe that having a peer namevariable, and instantiating it to thousands of peer names, al-lows us delegating a rule to thousands of peers. This makesdistributing computation very easy from the point of view of the user, but also underscores the need for powerful securitymechanisms. Developing such mechanisms is in our immediateplans for future work.
The
Bud engine evaluates the fixpoint using the semi-naivealgorithm, i.e.,
Bud saturates one stratum after another ac-cording to a stratification given by the dependency graph . Thedependency graph is a directed hyper-graph with relations asnodes, and with a hyper-edge from relations s i to relation r ifthere is a rule in which all s i appear in the body and r appearsin the head. Since this is classic material, we omit the detailsbut observe that, since WebdamLog rules may be added or re-moved at run-time, the program evolves, leading to changesin the dependency graph. Therefore, the dependency graph isrecomputed at step 1 of a
WebdamLog stage when receivingnew rules, and remains fixed for the following step 2. The
Web-damLog engine pushes further the differentiation technique thatserves as basis of the semi-naive algorithm.Although, according to
WebdamLog semantics, facts are con-sumed and possibly re-derived, it would be inefficient to recom-pute the proof of existence of all facts at each stage. Betweentwo consecutive stages, each relation keeps a cache of its previ-ous contents. This cache may be invalidated by
WebdamLog ifa newly installed rule creates a new dependency for this rela-tion. Note that
Bud already performs cache invalidation prop-agation for facts, which we adapt to fit
WebdamLog semantics.This incremental optimization across stages allows us to runthe fixpoint computation only on the relations that may havechanged since the previous stage. A WebdamLog system executes in a highly dynamic environ-ment, where peer state frequently changes, in terms of bothdata and program, and where peers may come and go. Thisis a strong departure from datalog-based systems such as
Bud that assume the set of peers and rules to be fixed. As partof our ongoing work, we are focusing on efficiently supportingdynamic changes in peer state, with the help of a novel kind ofa provenance graph .We use provenance graphs to record the derivations of
Web-damLog facts and rules, and to capture fine-grained dependen-cies between facts, rules, and peers. We build on the formalismproposed in [13], where each tuple in the database is annotatedwith an element of a provenance semiring, and annotations arepropagated through query evaluation. Provenance can be usedfor a number of purposes such as explaining query results orsystem behavior, and for debugging. Our primary use of prove-nance is to optimize performance of
WebdamLog evaluation inpresence of deletions. We are also currently investigating theuse of provenance for enforcing access control and for detectingaccess control violations.
The goal of the experimental evaluation is to verify that
Web-damLog programs can be executed efficiently. We show herethat rewriting and delegation can be implemented efficiently.In the experiments, we used synthetically generated data. Allexperiments were conducted on up to 50 Amazon EC2 micro in-stances, with 2
WebdamLog peers per instance. Micro-instancesare virtual machines with two process units, Intel(R) Xeon(R)CPU E5507 @2.27GHz with 613 MB of RAM, running Ubuntuserver 12.04 (64-bit). All experiments were executed 4 timeswith a warm start. We report averages over 4 executions.6 he cost of delegation.
We now focus is on measuring
WebdamLog overhead in dealing with delegations. Recall the
Bud steps performed by each peer at each
WebdamLog stage,described in Section 5.2. We can break down each step into
WebdamLog -specific and
Bud -specific tasks as follows:1. Inputs are collecteda)
Bud reads the input from the network and populatesits channels.b)
WebdamLog parses the input in channels and up-dates the dependency graph with new rules. The de-pendency graph is used to control the rules that areused in the semi-naive evaluation (see Section 5.4).2. Time is frozena)
Bud invalidates each ∆ (used by the semi-naive eval-uation) that has to be reevaluated because it corre-sponds to a relation that may have changed.b)
WebdamLog invalidates ∆ according to program up-dates. Moreover,
WebdamLog propagates deletions.(Recall that the semi-naive evaluation deals only withtuple additions.)c)
Bud performs semi-naive fixpoint evaluation for allinvalidated relations, taking the last ∆ for differenti-ation.3. Outputs are obtaineda)
WebdamLog builds packets of rules and updates tosend.b)
Bud sends packets.We report the running time of
WebdamLog as the sum ofSteps 1b, 2b and 3a, and the running time of
Bud as the sumof Steps 1a, 2a, 2c and 3b. All running times are expressedin percentage of the total running time, which is measured inseconds. For each experiment, we will see that the runningtime of
WebdamLog -specific phases is reasonable compared tothe overall running time.
Non-local rules.
In the first experiment, we evaluate therunning time of a non-local rule with an extensional head. Rulesof this kind lead to delegations. We use the following rule: [at alice]join@sue($Z) :- rel1@alice($X,$Y), rel2@bob($Y,$Z)
This rule computes the join of two relations at distinct peers( rel1 @ alice and rel2 @ bob ), and then installs the result, projectedon the last column, at the third peer ( join @ sue ). Relations rel1 @ alice and rel2 @ bob each contain 1 000 tuples that are pairsof integers, with values drawn uniformly at random from the1 to 100 range. In the next table, we report the total runningtime of the program at each peer, as well as the break-down ofthe time into Bud and
WebdamLog . WebdamLog Bud total alice bob sue
WebdamLog compu-tation on alice is fairly high: 10.8%. This is because that peer’swork is essentially to delegate the join to bob . Peer bob spendsmost of its time computing the join, a
Bud computation. Peer sue has little to do. As can be seen from these numbers, theoverhead of delegation is small.
Relation and peer variables.
In the second experiment,we evaluate the execution time of a
WebdamLog program forthe distributed computation of a union. The following ruleuses relation and peer variables and executes at peer sue :
20 40 60 80 100 . . . . . . . % of matched facts w a i t i ng t i m e a t S ue ( s e c ) QSQ evaluationfull materialization
Figure 2: Distributed QSQ optimization [at sue]union@sue($X) :- peers@sue($Y,$Z), $Y@$Z($X)
The relation peers @ sue contains 12 tuples corresponding to 3peers (including sue ) with 4 relations per peer. Thus, the rulespecifies a union of 12 relations. Each relation participatingin the union contains 1 000 tuples, each with a single integercolumn, and with values for the attribute drawn independentlyat random between 1 and 10 000. WebdamLog Bud total sue remote1 remote2 sue does most of the work, both delegating rulesand also computing the union. The
WebdamLog overhead is9.9%, which is still reasonable. The running time on remotepeers is very small, and the
WebdamLog portion of the compu-tation is negligible.
QSQ-style optimization.
In this experiment, we measurethe effectiveness of an optimization that can be viewed as a dis-tributed version of query subquery (QSQ) [22], where only therelevant data are communicated at query time. More precisely,we consider the following view union2 on peer sue , defined asthe union of two relations. [at sue]union2@sue($name,$X) :- friendPhotos@alice($name,$X)union2@sue($name,$X) :- friendPhotos@bob($name,$X)
Suppose we want to obtain the photos of Charlie, i.e. thetuples in union2 that have the value “Charlie” for first at-tribute. We vary the number of facts in friendPhotos @ alice and friendPhoto @ bob that match the query. We compare the costof materializing the entire view to answer the query to thatof installing only the necessary delegations computed at querytime to compute the answer.Results of this experiment are presented in Figure 2. Wereport the waiting time at sue . As expected, QSQ-style opti-mization brings important performance improvements (exceptwhen almost all facts are selected). This shows its usefulnessin such a distributed setting.7 Related work
The
WebdamLog language is motivated by previous work onthe
WebdamExchange system [5]. The system described therecould automatically adapt to a variety of protocols and accessmethods found on the Web, notably for localizing data and foraccess control [10]. In developing toy applications with
Web-damExchange [10], we realized the need for a logic that couldbe used (i) to declaratively specify applications and (ii) to ex-change application logic between peers. This motivated theintroduction of
WebdamLog [4], a language based on rules thatcan run locally and be exchanged between peers.Distributed data management has been studied since the ear-liest days of databases [20]. The fact that it is possible to accessdata from several data sources has been studied under variousnames, notably multi-databases or federated databases. Thesetting we consider is in the spirit of peer-to-peer databases withautonomous and heterogeneous data sources. Of course, stan-dard query optimization techniques developed for distributeddatabase systems are relevant here. We insist in particular onthe techniques that are more relevant to our setting, which isbased on datalog. One should mention that there have been anumber of works on parallel or distributed evaluation of data-log, e.g., [2, 16].The use of declarative languages, in particular datalog ex-tensions, for distributed data management has already beenadvocated, e.g., in [3, 6]. There has recently been renewedinterest in this approach [15]. Several systems have been de-veloped based on the declarative paradigm [14, 18, 17], withperformance comparable to that of systems based on impera-tive languages. Our implementation uses the
Bud system [21].The language Dedalus [9] has been proposed as a formal foun-dation for
Bud . We prefer here to use the language
WebdamLog ,in particular because it features delegation.Most classic optimization techniques for datalog are relevantto our work, in particular, semi-naive evaluation that is sup-ported by
Bud . We also considered the query-subquery opti-mization [22] as adapted to the distributed context in [2].
This paper presents an implementation of the
WebdamLog lan-guage, introduced in [4]. The two main challenges for such anapproach are (i) the difficulty of writing rules for non-technicalusers and (ii) the difficulty to offer good performance: • With respect to (i), we present a user study that verypromisingly shows that the participants (many of themnot computer scientists) are able to understand and writesimple rules. • With respect to (ii), we benefit from previous datalog opti-mization techniques and efficient network communicationby relying on the
Bud system to support the basic func-tionality of distributed datalog. We show that the higherlevel features of
WebdamLog , notably delegation, can besupported efficiently using logical rule rewriting.All this demonstrates the feasibility of an approach based on
WebdamLog to support exchanges of data and rules betweenrapidly evolving peers in a distributed and dynamic environ-ment.In the future, we are considering the following directions:
Access control
One of the bases of
WebdamLog is that a peercan locally install rules that are specified by another peer.Clearly, this is potentially very risky. Access control istherefore of paramount importance. We plan to work on access control, and in particular investigate the use ofprovenance for enforcing access control and for detectingaccess control violations.
Interface
Our user study demonstrated that
WebdamLog isappropriate for specifying distributed data managementtasks. We are in the process of developing a user inter-face for the
WebdamLog system. We also plan to conducta follow-up user study (i) drawing from a larger pool ofparticipants, (ii) including more participants without anyCS training, and (iii) testing the usability of other aspectsof the language, notably intentional vs. extensional predi-cates.
Application
We intend to demonstrate the use of our systemwith complete applications, e.g., for social networks andpersonal data management.
Optimization
We are currently developing a provenance-basedapproach for efficiently supporting changes in programstate. Also, an optimization technique based on map-reduce and intense parallelism has been proposed for dat-alog [7]. It would be interesting to consider such an ap-proach in our distributed setting.
References [1] S. Abiteboul. Managing an XML warehouse in a P2P con-text. In
CAiSE , pages 4–13, 2003. 1[2] S. Abiteboul, Z. Abrams, S. Haar, and T. Milo. Diagno-sis of asynchronous discrete event systems: datalog to therescue! In
PODS , pages 358–367, 2005. 8[3] S. Abiteboul, O. Benjelloun, and T. Milo. The Active XMLproject: an overview.
VLDB J. , 17(5):1019–1040, 2008. 1,8[4] S. Abiteboul, M. Bienvenu, A. Galland, and E. Antoine. Arule-based language for Web data management. In
PODS ,2011. 1, 2, 3, 5, 8[5] S. Abiteboul, A. Galland, and N. Polyzotis. A modelfor web information management with access control. In
WebDB Workshop , 2011. 8[6] S. Abiteboul and N. Polyzotis. The data ring: Communitycontent sharing. In
CIDR , pages 154–163, 2007. 1, 8[7] F. N. Afrati, V. R. Borkar, M. J. Carey, N. Polyzotis,and J. D. Ullman. Map-reduce extensions and recursivequeries. In
EDBT , pages 1–8, 2011. 8[8] P. Alvaro, N. Conway, J. Hellerstein, and W. R. Marczak.Consistency analysis in bloom: a calm and collected ap-proach. In
CIDR , pages 249–260, 2011. 5[9] P. Alvaro, W. R. Marczak, N. Conway, J. M. Hellerstein,D. Maier, and R. C. Sears. Dedalus: Datalog in Timeand Space. Technical Report UCB/EECS-2009-173, EECSDepartment, University of California, Berkeley, December2009. 1, 3, 8[10] E. Antoine, A. Galland, K. Lyngbaek, A. Marian, andN. Polyzotis. [Demo] Social Networking on top of the Web-damExchange System. In
ICDE , 2011. 8811] M. J. Franklin, A. Y. Halevy, and D. Maier. Fromdatabases to dataspaces: a new abstraction for informa-tion management.
SIGMOD Record , 34(4):27–33, 2005. 1[12] A. V. Gelder. Negation as failure using tight derivations forgeneral logic programs.
J. Log. Program. , 6(1&2):109–133,1989. 1[13] T. J. Green, G. Karvounarakis, and V. Tannen. Prove-nance semirings. In
PODS , pages 31–40, 2007. 6[14] S. Grumbach and F. Wang. Netlog, a rule-based languagefor distributed programming. In
PADL , pages 88–103,2010. 8[15] J. M. Hellerstein. The declarative imperative: experiencesand conjectures in distributed logic.
SIGMOD Record ,39(1):5–19, 2010. 8[16] M. A. W. Houtsma, P. M. G. Apers, and S. Ceri. Dis-tributed transitive closure computations: The disconnec-tion set approach. In
VLDB , 1990. 8 [17] B. T. Loo, T. Condie, M. N. Garofalakis, D. E. Gay, J. M.Hellerstein, P. Maniatis, R. Ramakrishnan, T. Roscoe, andI. Stoica. Declarative networking: language, execution andoptimization. In
SIGMOD , pages 97–108, 2006. 8[18] B. T. Loo, T. Condie, J. M. Hellerstein, P. Maniatis,T. Roscoe, and I. Stoica. Implementing declarative over-lays. In
SOSP , volume 39, pages 75–90, 2005. 8[19] U. of Innsbruck. Iris - integrated rule inference system. http://iris-reasoner.org/ . 4[20] M. T. Özsu and P. Valduriez.
Principles of DistributedDatabase Systems, Third Edition . Springer, 2011. 8[21] B. O. O. M. project. Bloom programming language. . 4, 8[22] L. Vieille. Recursive Axioms in Deductive Databases: TheQuery/Sub-query Approach. In