SocialScope: Enabling Information Discovery on Social Content Sites
CIDR Perspectives 2009
SocialScope: Enabling Information Discovery onSocial Content Sites
Sihem Amer-Yahia
Yahoo! ResearchNew York, NY [email protected] Laks V.S. Lakshmanan
Univ. of British ColumbiaVancouver, Canada [email protected] Cong Yu
Yahoo! ResearchNew York, NY [email protected]
ABSTRACT
Recently, many content sites have started encouraging their users toengage in social activities such as adding buddies on Yahoo! Traveland sharing articles with their friends on New York Times. This hasled to the emergence of social content sites , which is being facili-tated by initiatives like OpenID and OpenSocial . These commu-nity standards enable the open access to users’ social profiles andconnections by individual content sites and are bringing content-oriented sites and social networking sites ever closer. The inte-gration of content and social information raises new challenges for information management and discovery over such sites. We pro-pose a logical architecture, named SocialScope , consisting of threelayers, for tackling the challenges. The content management layeris responsible for integrating, maintaining and physically accessingthe content and social data. The information discovery layer takescare of analyzing content to derive interesting new information, andinterpreting and processing the user’s information need to identifyrelevant information. Finally, the information presentation layerexplores the discovered information and helps users better under-stand it in a principled way. We describe the challenges in eachlayer and propose solutions for some of those challenges. In par-ticular, we propose a uniform algebraic framework, which can beleveraged to uniformly and flexibly specify many of the informa-tion discovery and analysis tasks and provide the foundation for theoptimization of those tasks.
1. INTRODUCTION
Web 2.0 is leading to an increasing integration of content infor-mation with the social information (profiles, connections and activ-ities) of users, giving rise to social content sites . Sites like Flickrand del.icio.us started out as such sites by enabling users to tag andshare contents like photos and bookmarks. More recently, however,sites that started as pure content oriented or pure social network-ing focused are increasingly marching toward such an integration.For example, content sites like Amazon and Yahoo!Travel are be- This article is published under a Creative Commons License Agreement(http://creativecommons.org/licenses/by/3.0/).You may copy, distribute, display, and perform the work, make derivativeworks and make commercial use of the work, but you must attribute thework to the author and CIDR 2009. th Biennial Conference on Innovative Data Systems Research (CIDR)
January 4-7, 2009, Asilomar, California, USA. coming more social: users can now become friends, share contentwith each other, and tag the content with their own descriptions.Similarly, social sites like MySpace and Facebook are adding morecontent: users can add contents like photos and news items to theirpersonal spaces, making the site more practically useful in theirdaily lives.We envision this integration of social sites and content sites to begreatly helped by initiatives like OpenID and OpenSocial. They cannow collaborate with each other to form virtual social content sites,where the social sites manage the user’s social life and the contentsites manage the detailed content information. We are already wit-nessing this in many domains. For example, on most online newssites (e.g., New York Times), each article is accompanied by but-tons corresponding to Facebook, del.icio.us, etc., which allow youto quickly post the article to your favorite social site and share itwith friends. Together, the social site(s) and the content site form apowerful virtual social content site that can engage a larger numberof users much more deeply than each individual site.Another important trend is the increasing structural richness ofinformation about the users and the content. On one hand, socialsites are knowing more about us through the rich information wevoluntarily provide (e.g., name, interests, etc.). On the other hand,content sites are generating more structured information as a resultof advances in information extraction and wiki-style mass collab-orations. Only a few years ago, almost all the Wikipedia pageswere pure text articles. Now, nearly all of the highly visited onescontain some structured information (e.g., Infoboxes) and are orga-nized within a set of loosely defined category hierarchies.The ways in which those sites help their users discover infor-mation, however, have evolved little from the traditional keyword-based search paradigm. The rapidly growing social graphs under-lying those social content sites are rarely leveraged to better servethe user’s information needs. There is still a dichotomy betweeninformation retrieval, which focuses on locating information thatis semantically relevant to a user query, and information recom-mendation, which focuses on identifying information a user mightprefer based on her social profile activities and those of her so-cial connections. Finally, results are still ranked and presented in apredominantly list-based fashion, not taking advantage of the richstructure and social provenance embedded in them.In this paper, we identify the research challenges and opportuni-ties associated with managing and discovering information on so-cial content sites. We present an architectural vision,
SocialScope ,as a platform where those challenges can be addressed. But first,we provide some motivating examples in the context of a real-worldsocial content site, Yahoo!Travel.
CIDR Perspectives 2009 general categorical specific(e.g., things to do) (e.g., family)with locations 32.36% 22.52% 8.37%w/o locations 21.38% 5.34%
Table 1:
Summary Statistics of 10 Million Y!Travel Queries.
2. CASE STUDY WITH YAHOO! TRAVEL
Y!Travel is a typical content site that is gradually evolvinginto a social content site. Initially built as a portal on travel desti-nations, it has been incorporating various social features includingallowing users to tag travel destination and showing them similartravelers. It has also been interacting with Y!Local and
Flickr to provide more structured information and social data about des-tinations. These days, users visit
Y!Travel to look for informa-tion about travel destinations as well as to learn about their friendsand other travelers. We briefly describe the data and queries in
Y!Travel . Y!Travel Data : Y!Travel maintains a comprehensive set oftravel objects: cities, restaurants, etc., each with its own structure.Various semantic links are established between objects. For exam-ple, Fisherman’s Wharf and San Francisco are connected throughgeographical containment. Users on
Y!Travel provide detailedinformation about themselves, including self-tags, interests, etc.,and they are also connected in various ways. For example, they canbe friends on
Flickr or contacts on the Instant Messenger net-work. Finally, users browse travel objects, tag them with keywords,and provide ratings and reviews on them, creating connections be-tween users and objects.
Y!Travel Queries : Users interact with
Y!Travel through asearch interface, where they enter a set of keywords and obtain alist of travel objects considered relevant to their queries. We con-ducted a comprehensive analysis on 10 million recent
Y!Travel queries to better understand the user behavior. The results are sum-marized in Table 1. By leveraging the domain knowledge we haveabout geographical locations and travel destinations, we detect lo-cation terms in queries and classify each query into three classes: general, categorical, and specific . General queries are those con-taining terms like “things to do”, “attraction”, or just a location byitself. Over 50% of the queries fall into this class, and about 60% ofthose queries contain a location. Categorical queries refer to thosecontaining terms like “hotel”, “family”, “historic”, etc. About 30%of the queries fall into this class and a majority of them mention alocation. Finally, there are also about 8% of the queries looking forspecific destinations like “Disneyland” and “Yosemite Park”.The distributions of
Y!Travel queries indicate that the maininformation needs of users are not specific destinations, but ratherthe set of interesting destinations among a large group that areloosely constrained by the general or categorical queries. This is incontrast with web search engines, where users are mostly search-ing for specific information. As a result, the search paradigm is apoor fit for
Y!Travel because it is inherently hard to discriminateamong a large group of results based purely on the query keywords.For example, almost all destinations in
Y!Travel are attractions .To address this problem,
Y!Travel manually creates guides of“things to do”, “hotels”, and “restaurants” for popular destinations,which are extensively browsed by users. However, it is impossibleto manually create the “right” guide for each user on all destinationsand queries. A new information discovery paradigm is thereforeneeded for
Y!Travel and other similar social content sites. Inthe rest of this section, we describe our vision of this new paradigm http://travel.yahoo.com/ There are about 10% of the queries that we were unable to classify. through three hypothetical examples involving
Y!Travel that aresynthesized from our extensive conversations with actual users. E XAMPLE John is in Denver for a conference. Having oneday free, he visits
Y!Travel and searches for “Denver attrac-tions”. John has in the past visited quite a few baseball fields on
Y!Travel and has many friends on
Facebook with interests in“baseball”. With this knowledge,
Y!Travel recommends to him“B’s Ballpark Museum” (a small baseball museum in the suburb),“Coors Field” (home field of the Rockies), as well as the upcomingbaseball game “Yankees vs Rockies” to be played at Coors Field,which is fetched from
Y!Sports . Example 1 represents one out of three queries on
Y!Travel .However, the traditional information retrieval approach fails for itbecause there are often many objects that are semantically rele-vant to John’s query and no ranking mechanism (e.g., tf-idf mea-sure) based on pure semantic relevance can differentiate them. It istherefore imperative for the system to incorporate social relevance ,which considers John’s social profile and connections, to decidewhich attractions he will prefer. Essentially, information discov-ery on social content sites requires the integration of two majorparadigms: semantic relevance with respect to a query and socialrelevance in the spirit of recommendations. The former scopes thediscovery to information relevant to John’s current needs as ex-pressed by him, while the latter identifies the information most ap-pealing to John as a user. Indeed, “B’s Ballpark Museum” may notbe a major attraction, yet John, being a baseball fan, is likely toenjoy a visit to it. Example 1 further illustrates another importantdesideratum: the need to retrieve relevant information from exter-nal social or content sites that are physically and administrativelyseparate from
Y!Travel , which is becoming possible because ofvarious initiatives like OpenSocial.E
XAMPLE Selma, a young musician with two babies, is plan-ning a family trip to Barcelona. She searches for “Barcelona familytrip with babies” on
Y!Travel . As in John’s case,
Y!Travel searches for attractions that are semantically and socially relevantto Selma. While Selma is well-connected to her musician friends,very few of them have kids and are suitable for trip recommenda-tion to Selma in this case. Instead,
Y!Travel analyzes her otherfriends who have made similar family trips before and uses themas the social basis for recommending baby-friendly attractions likethe “Parc de la Ciutadella”.
Selma’s example illustrates the importance of analyzing the so-cial connections of users and choosing the right subset of the con-nections as the basis for discovering socially-relevant results. Un-like in John’s case,
Y!Travel has to understand the distinct groupsof friends that Selma has and pick the right group for her family ori-ented query. This analysis, however, is non-trivial since the natureof social activities and connections are often implicit and noisy.Determining whether a social connection is suitable for answeringa particular query is a significant challenge for a social content siteand may even require an interaction with the user. Furthermore,it is not always possible or necessary to “personalize” social rele-vances. Even if Selma does not have any friend with young babies,
Y!Travel should still be able identify a group of “experts” onthe topic to help answer Selma’s query. This would require exten-sive data analysis to identify topics within the data and users withexpertise on the topics.E
XAMPLE Alexia is a high school student planning a sum-mer field trip for an assignment from her history class. She comesto
Y!Travel and searches for “American history” to find placesfor research on the subject. As in previous examples,
Y!Travel
CIDR Perspectives 2009
Facebook Y! Sports (cid:1)(cid:2)(cid:3)(cid:4)(cid:5)(cid:3)(cid:4)(cid:6)(cid:7)(cid:3)(cid:4)(cid:5)(cid:8)(cid:9)(cid:10)(cid:4)(cid:2)(cid:9)(cid:1)(cid:2)(cid:3)(cid:4)(cid:5)(cid:3)(cid:4)(cid:6)(cid:7)(cid:3)(cid:4)(cid:5)(cid:8)(cid:9)(cid:10)(cid:4)(cid:2)(cid:9)(cid:1)(cid:2)(cid:3)(cid:4)(cid:5)(cid:3)(cid:4)(cid:6)(cid:7)(cid:3)(cid:4)(cid:5)(cid:8)(cid:9)(cid:10)(cid:4)(cid:2)(cid:9)(cid:1)(cid:2)(cid:3)(cid:4)(cid:5)(cid:3)(cid:4)(cid:6)(cid:7)(cid:3)(cid:4)(cid:5)(cid:8)(cid:9)(cid:10)(cid:4)(cid:2)(cid:9)(cid:1)(cid:2)(cid:3)(cid:4)(cid:5)(cid:3)(cid:4)(cid:6)(cid:11)(cid:3)(cid:10)(cid:12)(cid:13)(cid:14)(cid:5)(cid:9)(cid:1)(cid:2)(cid:3)(cid:4)(cid:5)(cid:3)(cid:4)(cid:6)(cid:11)(cid:3)(cid:10)(cid:12)(cid:13)(cid:14)(cid:5)(cid:9)(cid:1)(cid:2)(cid:3)(cid:4)(cid:5)(cid:3)(cid:4)(cid:6)(cid:11)(cid:3)(cid:10)(cid:12)(cid:13)(cid:14)(cid:5)(cid:9)(cid:1)(cid:2)(cid:3)(cid:4)(cid:5)(cid:3)(cid:4)(cid:6)(cid:11)(cid:3)(cid:10)(cid:12)(cid:13)(cid:14)(cid:5)(cid:9) (cid:7)(cid:3)(cid:15)(cid:2)(cid:9)(cid:16)(cid:10)(cid:4)(cid:17)(cid:2)(cid:3)(cid:6)(cid:18)(cid:17)(cid:19)(cid:20)(cid:2)(cid:21)(cid:5)(cid:9)(cid:5)(cid:9)(cid:7)(cid:3)(cid:15)(cid:2)(cid:9)(cid:16)(cid:10)(cid:4)(cid:17)(cid:2)(cid:3)(cid:6)(cid:18)(cid:17)(cid:19)(cid:20)(cid:2)(cid:21)(cid:5)(cid:9)(cid:5)(cid:9)(cid:7)(cid:3)(cid:15)(cid:2)(cid:9)(cid:16)(cid:10)(cid:4)(cid:17)(cid:2)(cid:3)(cid:6)(cid:18)(cid:17)(cid:19)(cid:20)(cid:2)(cid:21)(cid:5)(cid:9)(cid:5)(cid:9)(cid:7)(cid:3)(cid:15)(cid:2)(cid:9)(cid:16)(cid:10)(cid:4)(cid:17)(cid:2)(cid:3)(cid:6)(cid:18)(cid:17)(cid:19)(cid:20)(cid:2)(cid:21)(cid:5)(cid:9)(cid:5)(cid:9) (cid:1) (cid:2) (cid:3) (cid:4) (cid:5) (cid:4) (cid:3) (cid:6) (cid:7) (cid:1) (cid:2) (cid:3) (cid:4) (cid:5) (cid:4) (cid:3) (cid:6) (cid:7) (cid:1) (cid:2) (cid:3) (cid:4) (cid:5) (cid:4) (cid:3) (cid:6) (cid:7) (cid:1) (cid:2) (cid:3) (cid:4) (cid:5) (cid:4) (cid:3) (cid:6) (cid:7) (cid:8) (cid:9) (cid:10) (cid:9) (cid:11)(cid:12) (cid:13) (cid:8) (cid:9) (cid:10) (cid:9) (cid:11)(cid:12) (cid:13) (cid:8) (cid:9) (cid:10) (cid:9) (cid:11)(cid:12) (cid:13) (cid:8) (cid:9) (cid:10) (cid:9) (cid:11)(cid:12) (cid:13) (cid:22)(cid:5)(cid:19)(cid:23)(cid:12)(cid:4)(cid:6)(cid:24)(cid:5)(cid:12)(cid:5)(cid:20)(cid:4)(cid:2)(cid:9)(cid:22)(cid:5)(cid:19)(cid:23)(cid:12)(cid:4)(cid:6)(cid:24)(cid:5)(cid:12)(cid:5)(cid:20)(cid:4)(cid:2)(cid:9)(cid:22)(cid:5)(cid:19)(cid:23)(cid:12)(cid:4)(cid:6)(cid:24)(cid:5)(cid:12)(cid:5)(cid:20)(cid:4)(cid:2)(cid:9)(cid:22)(cid:5)(cid:19)(cid:23)(cid:12)(cid:4)(cid:6)(cid:24)(cid:5)(cid:12)(cid:5)(cid:20)(cid:4)(cid:2)(cid:9)
Social ContentGraph (cid:18)(cid:10)(cid:4)(cid:10)(cid:6)(cid:25)(cid:10)(cid:3)(cid:10)(cid:8)(cid:5)(cid:9)(cid:18)(cid:10)(cid:4)(cid:10)(cid:6)(cid:25)(cid:10)(cid:3)(cid:10)(cid:8)(cid:5)(cid:9)(cid:18)(cid:10)(cid:4)(cid:10)(cid:6)(cid:25)(cid:10)(cid:3)(cid:10)(cid:8)(cid:5)(cid:9)(cid:18)(cid:10)(cid:4)(cid:10)(cid:6)(cid:25)(cid:10)(cid:3)(cid:10)(cid:8)(cid:5)(cid:9)
Query /ResultOpenSocial API Activities
Y! IM
OpenSocial API C on t e n t M a n a g e m e n t Information DiscoveryInformation Presentation (cid:7)(cid:3)(cid:15)(cid:2)(cid:9)(cid:16)(cid:10)(cid:4)(cid:17)(cid:2)(cid:3)(cid:6)(cid:26)(cid:9)(cid:8)(cid:10)(cid:3)(cid:17)(cid:14)(cid:5)(cid:9)(cid:7)(cid:3)(cid:15)(cid:2)(cid:9)(cid:16)(cid:10)(cid:4)(cid:17)(cid:2)(cid:3)(cid:6)(cid:26)(cid:9)(cid:8)(cid:10)(cid:3)(cid:17)(cid:14)(cid:5)(cid:9)(cid:7)(cid:3)(cid:15)(cid:2)(cid:9)(cid:16)(cid:10)(cid:4)(cid:17)(cid:2)(cid:3)(cid:6)(cid:26)(cid:9)(cid:8)(cid:10)(cid:3)(cid:17)(cid:14)(cid:5)(cid:9)(cid:7)(cid:3)(cid:15)(cid:2)(cid:9)(cid:16)(cid:10)(cid:4)(cid:17)(cid:2)(cid:3)(cid:6)(cid:26)(cid:9)(cid:8)(cid:10)(cid:3)(cid:17)(cid:14)(cid:5)(cid:9)
Social Content Admin User (cid:14) (cid:15) (cid:12) (cid:13) (cid:7) (cid:16) (cid:10) (cid:3) (cid:12) (cid:13) (cid:17) (cid:9)(cid:2) (cid:12) (cid:14) (cid:15) (cid:12) (cid:13) (cid:7) (cid:16) (cid:10) (cid:3) (cid:12) (cid:13) (cid:17) (cid:9)(cid:2) (cid:12) (cid:14) (cid:15) (cid:12) (cid:13) (cid:7) (cid:16) (cid:10) (cid:3) (cid:12) (cid:13) (cid:17) (cid:9)(cid:2) (cid:12) (cid:14) (cid:15) (cid:12) (cid:13) (cid:7) (cid:16) (cid:10) (cid:3) (cid:12) (cid:13) (cid:17) (cid:9)(cid:2) (cid:12)
Figure 1:
Architecture of SocialScope. leverages Alexia’s social information and finds a set of semanticallyand socially relevant places. However, the results this time con-tain places from throughout the country and fall into many differenttopics. Recognizing the chaotic nature of the results,
Y!Travel moves away from the traditional list-based presentation, and auto-matically groups results along multiple dimensions: geographicalor organized based on who—her classmates in the history class orher friends on the soccer team—endorse it. Furthermore,
Y!Travel analyzes the result destinations and present related topics (e.g., In-dependence War) and users (e.g., Jane, who left comments on manyresult destinations) to Alexia.
Example 3 illustrates another important aspect in informationdiscovery on social content sites: result presentation . Similar toAlexia’s case, there can be many equally relevant (semantically andsocially) results to a user and her query, and alternative presenta-tion mechanisms need to be employed for the users to effectivelyexplore the results. For example, grouping can be accomplishedbased on the rich structure information associated with each objectin a way similar to faceted search. More interestingly, in socialcontent sites, each result also has an associated social provenancethat can be explored for more sophisticated grouping. Furthermore,users like Alexia are often not just specifically looking for objects:they are also interested in exploring other information (e.g., sim-ilar users and associated topics) related to their information need.
Y!Travel needs to detect when such explorations are warrantedand how to facilitate them.
As all three examples have emphasized, effective social contentdiscovery calls for an integration of three major paradigms: key-word search, database-style querying, and recommendations.
Thesearch/query paradigm helps users express their need and narrowdown the discovery scope, while the recommendation paradigm en-ables the system to guide and expand the discovery process socially.The fact that social and semantic relevances play an equally impor-tant role distinguishes this from personalized search [21], wherepersonalization is achieved by re-ranking semantically relevant re-sults in a post-processing step.Second, the social information on social content sites are muchmore complex than in traditional recommendation systems [24]. In-stead of being characterized simply by what items they have reador bought previously, users can participate in various social activi-ties and establish connections with different semantics. These con-nections need to be analyzed and selectively applied to help thediscovery process. Finally, the fact that most user queries are exploratory in na-ture calls for effective ways to present the results to facilitate in-formation exploration. While faceted search [14] has made stridesin helping users explore query results, it does not account for so-cial provenances. Discovering the most effective groupings, eitherbased on structural attributes or on social relevance, is a significantnew challenge.
3. ARCHITECTURAL VISION
Figure 1 describes the logical architecture of
SocialScope . Atits core is the social content graph , which represents users, objects,and various connections among them. Information in the graphmay be locally owned (e.g., destinations in
Y!Travel ), externallyintegrated (e.g., friendship connection obtained from Facebook )or derived (e.g., links describing similarities between users). Threelayers,
Content Management , Information Discovery , and
Infor-mation Presentation , form the entire
SocialScope system, and webriefly describe each next.
Information Discovery (Section 5): This layer consists of twocomponents:
Content Analyzer and
Information Discoverer . TheContent Analyzer derives new nodes (e.g., topics) and links (e.g.,similarities between users) through various analyses (e.g., LatentDirichlet Allocation [8], association rule mining [3]) of the raw so-cial content graph in an off-line fashion. Those analyses can bespecified and triggered automatically by the system itself or by a
Social Content Administrator . The Information Discoverer parsesthe user query, constructs its internal representations (based on var-ious semantic and social relevance computations), and evaluatesthem on the social content graph. The result is a social content sub-graph, called
Meaningful Social Graph (MSG), that is semanticallyand socially relevant to a given user and query. One major visionof
SocialScope is to enable uniform manipulation of social contentgraphs , leading to declarative, flexible, and optimizable graph anal-ysis and information discovery processes. We accomplish this byproposing a logical algebraic framework for social content graphmanipulation.
Content Management (Section 6): This layer handles two maintasks. First, it facilitates the incorporation of social informationfrom remote sites through
Content Integrator . This has becomeincreasingly important as open standards like OpenSocial becomewidely accepted, which allow the core social content sites to lever-age the large amount of information within social sites. The sec-ond task is the maintenance and retrieval of the social content graphthrough the
Data Manager , which abstracts away the physical im-
CIDR Perspectives 2009 plementation of the graph. In addition, given that much social con-tent is created and maintained externally, Data Manager needs tomake decisions on when and how to refresh parts of the social graphefficiently. The
Activity Manager helps in that regard by categoriz-ing users based on their activities.
Information Presentation (Section 7): This layer provides acomprehensive result exploration framework. It admits as input theMSG from the Information Discovery layer and dynamically or-ganizes the results for effective exploration by the user. There aretwo key primitives: grouping and ranking, managed by InformationOrganizer and Result Selector, respectively. The former identifiesappropriate (structural or social) criteria for grouping results, whilethe latter identifies appropriate mechanisms for ranking and select-ing results within or across groups. When multiple presentationgroups are available, Information Organizer also makes decisionson which group is more relevant to the user and her current infor-mation needs.Let’s begin by introducing the data and query models adopted by
SocialScope .
4. DATA AND QUERY MODEL
Nodes and Links : We adopt a graph model for representing so-cial content. Intuitively, nodes in the graph represent physical andabstract entities like users and topics, and links represent connec-tions and activities between entities such as friendship and taggingactions. Each node or link has a unique id . It is worth noting thatthe graph model described here is a logical model that is not tied toany specific physical implementation.Nodes and links contain structural attributes, including a manda-tory type attribute. We adopt a flexible (i.e., schema-less) typingsystem and allow the type attribute to have multiple values. Forexample: n = { id=1; type=‘user, traveler’; name=‘John’ } and n = { id=2; type=‘item, city’; name=‘Denver’; keywords=‘skiing’ } are two nodes representing our traveler John and the city Den-ver, respectively, in Example 1. Similarly, l ( n , n ) = { id=12;type=‘act, tag’; date=‘2008-8-2’; tags=‘rockies baseball’ } is a linkrecording the activity that John tagged Denver with tags ‘rockiesbaseball’. Our typing system gives us the flexibility of creatingnew types through content analysis. We also maintain an evolvingcatalog of basic types, including user , item , topic , group fornodes and connect (e.g., friend), act (e.g., tag, review, click,etc.), match , belong for links. Those basic types are adequatefor modeling most of the social content sites we have encountered. Social Content Graphs : We model an instance of a social con-tent site as a social content graph . A social content graph con-sists of nodes and links as described above. It is sometimes conve-nient to view the social content graph as an overlay of sub-graphs,namely the activity graph , which maintains users’ activities on items,the network graph , which maintains social connections, and the topical graph , which maintains links from users or items to derivedsemantic groups or topics.
Queries : Users interact with
SocialScope by specifying a (pos-sibly empty) query on content and structure . Structural predicatesare interpreted in the usual Boolean sense, while content condi-tions are used to compute semantic relevance which, combinedwith social relevance, results in a single relevance score. The sys-tem generates recommendations within the scope defined by thequery, treating the structural predicates as the constraints defin-ing the scope. When the structural predicates are absent in thequery, only semantic relevance and social relevance are taken into Here and elsewhere, the term structure refers to the attribute/valuepairs associated with nodes and links. account. And when a query is empty, only social relevance is ac-counted for.
5. INFORMATION DISCOVERY:AN ALGEBRAIC FRAMEWORK
Developing an efficient and flexible mechanism to manipulatethe social content graph is a major goal of the Information Dis-covery layer. While social network graphs have been the subjectof a number of social network analyses [23, 27, 29] and social ac-tivity graphs have been leveraged in various recommendation algo-rithms [2, 24], most of those works are designed for simple graphs(i.e., no complex structures on nodes or links) and adopt ad-hocmethods. This leaves the system with few opportunities for reuse,customization and optimization. Furthermore, while searching forobjects based on content relevance has been extensively studied be-fore, it has never been integrated into a social context in a principledway. We believe a uniform algebraic framework is needed to ma-nipulate the kind of complex social content graphs we encounter insocial content sites and to provide flexibility in the manner in whichinformation is analyzed and discovered.We thus propose a logical algebra that is capable of expressingsophisticated tasks for data analysis and discovering socially andsemantically relevant results. Each operator in the algebra takessocial content graphs as input and outputs a social content graph. Inthe next section, we present our algebra formally and demonstratethe expressive power of the algebra by showing a comprehensiveset of tasks that can be expressed in the algebra.
At the core of the algebra are two unary selection operators:Node Selection ( σ N h C, Si ) and Link Selection ( σ L h C, Si ). Both op-erators take a condition C and an optional scoring function S asparameters, and a (social content) graph as input. The condition C consists of a list of structural conditions (e.g., { type=‘city’, rating ≥ ‘0.5’ } ) and a set of keywords (e.g., ‘Denver attraction’). Satisfac-tion of the structural conditions by a node is defined in the obviousmanner: a node v is said to satisfy a structural condition of the form att = val , ..., val k , if the set of v ’s values for att is a superset ofthe values { val , ..., val k } . When an optional scoring function S is specified as an input parameter, a score is generated using S foreach node based on how well its content matches the keywords in C . If no scoring function is specified, but C includes keywords, adefault scoring function is used for generating the score. Finally,Node Selection outputs a null graph consisting of nodes (and nolinks) of the input graph that satisfy the node condition C . And ascore is generated (by S ) and attached to each node in the outputgraph. More formally:D EFINITION
ODE S ELECTION ). σ N h C, Si ( G ) = { v, v.score = S ( v ) | v ∈ nodes ( G ) ∧ v satisfies C } .Link Selection is defined in an analogous manner, with the sameformat specification and satisfaction definition for condition C . LinkSelection outputs a subgraph of the input graph induced by thoselinks satisfying the selection condition C . And a score is generatedby the optional scoring function S and attached to each link withinthe output graph. More formally:D EFINITION
INK S ELECTION ). σ L h C, Si ( G ) = { `, `.score = S ( ` ) | ` ∈ links ( G ) ∧ ` satisfies C } .Note that in the examples that will follow, we often omit thescoring function for clarity. Also in the core framework are the binary set operators. Wedefine the three common operators, Intersection ( ∩ ), Union ( ∪ ),Minus ( \ ), as follows: CIDR Perspectives 2009 D EFINITION ET - THEORETIC OPERATORS ). Let G and G be two social content graphs originated from the same socialcontent site. G ∪ G , G ∩ G , and G \ G are defined as:nodes ( G ⊕ G ) = nodes ( G ) ⊕ nodes ( G ) and links ( G ⊕ G ) = links ( G ) ⊕ links ( G ) , where ⊕ is one of ∪ , ∩ , \ , andnodes and links with the same id are consolidated in the outputgraph.As an example, a link belongs to G \ G if and only if it is in G but not in G . Note that a link in G \ G must necessarily beincident on nodes which appear in G but not in G . Remarks : First, in all the definitions above, nodes and links arematched on the basis of their id , as a result, graph isomorphismis not an issue. Second, we note that the operator \ can be definedin more than one way. According to the definition above, G \ G is the subgraph of G induced by those nodes of G which are notpresent in G . Thus, all links in G \ G are necessarily those forwhich both endpoints are present in G but not in G . We call thisthe Node-Driven Minus operator. Below, we give an alternative,link-driven, definition of the Minus operator, denoted ‘ \· ’.D EFINITION
INK - DRIVEN M INUS ). Let G and G betwo social content graphs originated from the same social contentsite. G \· G is defined as: links ( G \· G ) = links ( G ) \ links ( G ) ;nodes ( G \· G ) consists precisely of those nodes which are in-duced by the set of links in links ( G \· G ) .As an example of the difference between Node-Driven and Link-Driven Minus operators, consider G = { ( a, b ) , ( a, c ) , ( b, c ) } and G = { ( a, b ) } . G \ G is a null graph containing only node c andno links. On the other hand, G \· G contains all the three nodes a, b, c and the links ( a, c ) and ( b, c ) . We note in Lemma 1 thatthe Link-Driven Minus operator can be expressed with a combina-tion of Node-Driven Minus operator and Semi-Join operator (to bedescribed later).L EMMA
1. Operator \· can be expressed using operators \ and n . (Proof omitted for clarity.) Next, we introduce more sophisticated binary operators. Opera-tor Composition G (cid:12) h δ, Fi G takes a directional condition δ anda composition function F as parameters and produces a graph in-duced by new links that are composed from links in G and G .Input links to be composed are selected if they satisfy the direc-tional condition δ . And each new link in the output is attached withattributes generated by the function F . The directional condition δ consists of two link directions, d = src | tgt and d = src | tgt ,corresponding to links in G and G , respectively. E.g., δ =( src , tgt ) means two links are composed if and only if the source nodeof the G link matches the target node of the G link, where twonodes match if and only if they have the same id . Composition Function : Intuitively, the composition function F combines the attributes of input links and generates new attributesfor the output link produced by composition. Since there can be avariety of user-defined composition functions, we focus on the theircore requirements here. First, a composition function must acceptas input two groups of attributes (and their values) correspondingto the two input links. These attributes may be link attributes ornode attributes. Second, a composition function must produce asoutput a group of uniquely named attributes (and their values) to beassociated with the output link. If a function satisfies both require-ments, we consider it in the class of CF . Formally, the compositionoperator is defined as follows (note that δ ¯ d i indicates the oppositedirection of δ d i ): D EFINITION
OMPOSITION ). Let G and G be two so-cial content graphs originated from the same social content site. G = G (cid:12) h δ, Fi G , where F is a function in the class CF , isdefined as: • ∀ u, v, ` [ u, v ∈ nodes ( G ) , ` ∈ links ( G ) if and only if ∃ ` ∈ links ( G ) , ` ∈ links ( G ) s.t. u = ` .δ ¯ d ∧ v = ` .δ ¯ d ∧ ` .δ d = ` .δ d ∧ `.src = u ∧ `.tgt = v ]. • `. { att , att , ... } = F ( ` , ` ) In contrast with composition, operator Semi-Join G n δ G , pro-duces a subgraph of G induced by the G links that match thelinks in G . Again, links to be joined are selected if they satisfy thedirectional condition δ . Both Composition and Semi-Join “con-nect” links in their input graphs. However, composition generatesnew links while semi-join simply filters away unwanted links. Asa special case, when G ( G ) is a null graph (i.e., no links), we set d (resp., d ) to src . Formally, we have:D EFINITION
EMI -J OIN ). Let G and G be two socialcontent graphs originated from the same social content site. G = G n δ G is defined as: • ∀ ` [ ` src , ` tgt ∈ nodes ( G ) , ` ∈ links ( G ) if and only if ` ∈ links ( G ) ∧ ∃ ` ∈ links ( G ) s.t. `.δ d = ` .δ d ].E XAMPLE
EARCH ). We illustrate how the search task, “Find John’s friends who have visited travel destinations near Den-ver and all their activities” , can be accomplished. Given the socialcontent graph G and John’s node id , we proceed as follows:John’s network is G = σ LC ( G n ( src , src ) σ NC ( G )) , where C is id = and C is type=‘friend’. Users who visited places nearDenver are captured by: G = σ LC ( G n ( tgt , src ) σ NC ( G )) , where C is { type=‘destination’, ‘near Denver’ } and C is type=‘visit’.John’s subset of friends who have visited places near Denver isthen G = G n ( tgt , src ) G , while places near Denver that arevisited by John’s friends are G = G n ( src , tgt ) G . The union G = G ∪ G puts these two together. Activities by these friendsare G = σ LC ( G n ( src , tgt ) G ) , where C is type = ‘act’. Fi-nally, the union G = G ∪ G puts together John, his friendswho have visited places near Denver, the places, and the friends’activities. Aggregation is critical for most analysis tasks on social contentgraphs including model-based methods like Latent Dirichlet Allo-cation (LDA) [8]. One important feature of aggregation is the cre-ation of new information (aggregation results) that need to be storedand maintained. Because of the rich structures of nodes and links,we can naturally incorporate aggregation results as new attributes.We define two operators Node Aggregation ( γ N h C,d,att, Ai ( G ) ) andLink Aggregation ( γ L h C, att , Ai ( G ) ), where C is a link condition asdescribed in Section 5.1, d = src | tgt is a directional constraint, att is the destination attribute whose value will contain the ag-gregation result, and A is the aggregation function. Aggregation Function : Intuitively, the aggregation function takesas input a collection of links (and their associated attributes andvalues) and produces as output a value to be associated with the at-tribute att . We focus on two classes of aggregations: (i) the class
SAF of set aggregation functions that map a set of links to a set ofscalars, and (ii) the class
NAF of numerical aggregation functionsthat map a set of links to a numerical scalar value. We formallydefine both classes below. Note that in the following definition, we
CIDR Perspectives 2009 use $x to denote a variable. When att is a set-valued attribute ofa link ` , the expression ` . att = $x binds $x to one value of ` . att at a time.D EFINITION ET A GGREGATE F UNCTIONS ). Let L be aset of links. An aggregation function A is in class SAF if and onlyif it is of the form { $ x | ` ∈ L & `. att = $ x } , which extractsvalues of the attribute att from every link in the input set L andforms an output set of scalar values.As an example, let L correspond to the set of links correspondingto a user’s tagging actions. Let tags be the attribute of these linksthat contains the tags assigned to each item. The function { $ x |∃ ` ∈ L : `. tags = $ x } forms the set of all distinct tags assignedby the user to the items she has tagged.D EFINITION
UMERICAL A GGREGATE F UNCTIONS ). Theclass
NAF of aggregation functions is defined as follows: • Every arithmetic operation + , − , × , ÷ is in NAF ; • The constant functions and which map every scalar inputto the constant and respectively are in NAF ; • Summation over a collection, i.e., P x ∈ X f ( x ) , where X isa collection and f is a function in NAF , is in turn in
NAF ; • Product over a collection, i.e., Π x ∈ X f ( x ) , where X is a col-lection and f is a function in NAF , is in turn in
NAF ; • NAF is closed under composition .It is easy to see that popular aggregate functions like summation,count, and average can be readily expressed in
NAF . For example,to count the number of elements in a set X , we do the following: COUNT ( X ) ::= P x ∈ X ( x ) . Other aggregate functions likeminimum and maximum can also be expressed, although the detailsof the construction is omitted here for the clarity of presentation.Henceforth, we refer to the union of the classes SAF ∪ NAF assimply AF .D EFINITION
ODE A GGREGATION ). Let G be a social con-tent graph, γ N h C,d, att , Ai ( G ) produces a social content graph G thatis isomorphic to G and ∀ v ∈ G if ∃ ` ∈ G ∧ ` satisfies C ∧ `.d = v ,then v. att = A ( { ` i ∈ links ( G ) | ` i satisfies C & ` i .d = v } ) .Notice that the directionality parameter d acts as a group-by at-tribute, in that all outgoing links from a node (or all incoming linksto a node) are grouped together and aggregated. As an exampleof node aggregation, suppose A ( L ) simply counts the number oflinks in L . Let the condition C be type =‘friend’. The expression γ N h C, ‘ src , fnd cnt , Ai ( G ) produces a graph that is isomorphic to G except for every node that has one or more outgoing ‘friend’ links.For those nodes, an attribute fnd cnt is generated to store the ag-gregate value, namely the number of friends, as computed by theaggregation function. Similarly, node aggregation can be used toassign an attribute tags used to every user node, whose valuesinclude all the tags that have been used by the user.The definition of the Link Aggregation operator is analogous toNode Aggregation, except for two major differences. First, linkaggregation changes the structure of an input graph: it replaces a setof links between a given src and tgt node by a new link. Secondly,the result of the aggregate computation is assigned as a destinationattribute of the newly created link. D EFINITION
10 (L
INK A GGREGATION ). Let G be a socialcontent graph, γ L h C, att , Ai ( G ) produces a social content graph G as follows:1. Partition { ` | ` ∈ links ( G ) ∧ ` satisfies C } on ` .src and ` .tgt;2. For each set of links L s,t sharing the same source node s andthe same target node t , replace L s,t with a new link ` s,t ;3. Attach an attribute att with ` s,t , with its value computed as A ( L s,t ) .As an example of link aggregation, let G be a graph containingusers and their friends, and let G be a graph containing users andcities that they have visited. Both these subgraphs can be extractedeasily from an input social content graph corresponding to a socialcontent site, in a manner similar to that illustrated in Example 4.Furthermore, let G be the result of composing G and G , wherethe composed links contain attribute type =‘user friend item’ andare results of composing friend links in G and visit links in G .The link aggregation γ L h C, vst cnt ,COUNT i ( G ) , where C is the con-dition type =‘user friend item’, replaces each set of links sharingthe same user node and the same city node by one new link. Itthen assigns an attribute vst cnt to the new links, whose value iscomputed by counting the number of user friend item linksfrom the user node to the city node in G .Next, we describe a comprehensive example that represents thecollaborative filtering strategy of recommendation.E XAMPLE
OLLABORATIVE F ILTERING ). We show how collaborative filtering can be expressed for recommending traveldestinations to John. Given the social content graph G and John’snode id = , we proceed as follows:1. G ←− σ L type =‘ visit ( G n src,src σ N id =101 ( G )) . G now con-tains user John and the places he has visited.2. G ←− γ N type =‘ visit ,src, vst , A ( G ) , where A is a set aggre-gation function that collects the set of destinations that Johnhas visited and stores that as attribute vst of node John.3. G ←− σ L type =‘ visit ( G n src,src σ N id =101 ( G ) , finding usersother than John and the places they have visited.4. G ←− γ N type =‘ visit , vst ,src, A ( G ) , collecting the set of des-tinations that each user (other than John) has visited andstores that as attribute vst of the user node.5. G ←− G (cid:12) h δ, Fi G , where δ = ( tgt, tgt ) and F is a com-position function that computes the Jaccard similarity be-tween user John and every other user and assigns the resultto the attribute sim on the links produced by composition.The attribute vst of each user contains the necessary infor-mation for F to compute the Jaccard similarity between Johnand other users. Notice that this step produces one link fromJohn to another user for every common place visited by both.The value of sim on all these links is the same.6. G ←− γ L sim > . , type , A ( G ) , where A is an aggregation func-tion that assigns the constant string value ‘match’ to the des-tination attribute type and retains the value of sim fromany of the input links. This step replaces each set of linksfrom John to another user similar to him with a weight over . by a new link with type =‘match’. Notice that this is well defined.
CIDR Perspectives 2009 $1 $2 $3 type=match type=destinationid=101 type=visit
Figure 2: Example of graph pattern for collaborative filtering. G ←− σ L type =‘ visit ( G n tgt,src σ N type =‘ destination ( G )) . Thisstep computes users and the destinations they have visited.8. G ←− ( G n tgt,src G ) (cid:12) h ( tgt,src ) , sim sc , F i ( G n src,tgt G ) . This step composes the two graphs: John and his sim-ilarity network friends (with similarity over . ), and usersand the destinations they have visited. For each of John’ssimilarity network friends, who has visited a destination, anew link is added between John and that destination. Thefunction F simply copies the value of attribute sim of thelink from John to the user, on to the new link from John tothe destination node and assigns this value to the attribute sim sc .9. G ←− γ LC, score ,AV ERAGE ( G ) . For each destination node,we replace the set of links from John to the destination nodeby one new link with an attribute score . The value of score is computed by taking the average of the sim sc values on the links being aggregated.Finally, destination nodes so obtained can be recommended toJohn on the basis of the computed score value.Often, aggregations can involve multiple links. For example,counting the number of each user’s friends who have tagged at leastfive URLs with the term ‘baseball’, involves aggregation on friendand tagging links. This leaves us with two alternatives: allowingcomplex aggregation conditions like a graph pattern and thereforeusing fewer aggregation steps, or using more aggregation steps andtherefore reducing the complexity of aggregation conditions.As an example, we illustrate the use of graph patterns for ex-pressing aggregations more concisely. In the above example, weused link aggregation confined to aggregating over links between apair of nodes. As a result, we first had to create links from John toeach destination node, one link for each similarity network friendof John that has visited that destination (Step 6). Then we had toperform a separate link aggregation to compute the score of eachdestination being recommended to John, as the average sim sc value of the recommending user. Graph patterns make it possibleto achieve these steps more concisely. Figure 2 depicts a graph pat-tern showing a ‘match’ link followed by a ‘visit’ link. First, wecompute the union G ∪ G of the graphs G , G in Example 5,which contains John, his similarity network and the destinationsthey have visited. The operator γ LGP, score , A ( G ∪ G ) , where GP is the graph pattern in Figure 2, creates a new link between Johnand a destination node whenever the latter is reachable from Johnby a match-visit link path. Only one link is created from John tothe destination node, and the link is assigned an attribute score ,whose value is computed as the average value of sim sc on thematch link of the set of match-visit paths from John to the destina-tion node.One of the research challenges we are pursuing is to study thedifference between the two approaches and identify the conditionsunder which one of the two approaches will be more effective .
6. CONTENT MANAGEMENT
At the core of most social content sites, there are three majorcategories of data: site content , users’ social profiles and connec-tions , and users’ site-specific social activities . Intuitively, site con-tent is the content that users are interested in when they visit thesocial content site. Examples of such content include travel desti-nations in
Y!Travel or URLs in del.icio.us . Social profilesand connections are the information regarding the users themselves(e.g., name, education, etc.) and their explicit social connections(e.g., friends, classmates, colleagues, etc.). Finally, site-specificsocial activities are the activities users perform on the site content.For example, in
Y!Travel , users visit and browse destinations,while in del.icio.us , users bookmark URLs with tags.How to effectively and efficiently manage the three categoriesof data is at the heart of challenges to be addressed by the Con-tent Management layer of our
SocialScope system. As a first steptoward this goal, we describe and analyze three alternative manage-ment models for social content sites in Section 6.1. In Section 6.2,we discuss a detailed study on how the storage of large volumes ofdata can be optimized.
Logically, the social content graph is a single comprehensivegraph that encompasses both content and social information rele-vant for the site. Physically, however, there are multiple modelsthrough which we can implement the social content graph, depend-ing on how we maintain the social information.
Decentralized Model : In this model, each social content sitemaintains their own social information, including storing the userprofiles and social connections, and effectively manages the entiresocial content graph internally. This is perhaps the most dominantmodel in the early days of Web 2.0, when sites like del.icio.us and
Flickr were just starting, and they were all soliciting users’profiles and social connections on their own. This led to a set of de-centralized social graphs, each residing in a different social contentsite, and collectively forming the global social graph.This decentralized model provides social content sites with someobvious benefits, including full control over the entire data, whichenables the site to perform comprehensive analysis on the socialcontent graph, and increased exit cost on users, because they willhave to leave their social connections behind and re-establish (prob-ably the same) connections elsewhere if they decide to switch to adifferent site. It, however, has a couple of major problems. First,establishing a social graph with critical mass is incredibly diffi-cult. Many social content sites can only provide strong user ex-perience when they are able to leverage a large underlying socialgraph. For example, an event planning site is of no practical valueif few of your family and friends are using it. This presents the cold start problem for many content sites that few of them canovercome. Second, social graph decentralization means it is nec-essary for users to establish their social connections multiple timeson many different sites, even though most of those connections arethe same. This creates unnecessary burdens on the users and detersthem from adopting emerging social content sites.
Closed Cartel Model : With the emergence of several dominantsocial networking sites, the Closed Cartel model has become vi-able. In this model, users establish and maintain their social pro-files and connections at a few of the dominant social sites and letthose sites or third-party applications, which are developed specif-ically for those sites, fulfill their content needs.
Facebook is theprime example of this model. The social sites in this model, likein all cartels, are the biggest winners here: they maintain full con-trol over the social content graph and effectively determine whichcontent users will have access to. Content sites in this model are
CIDR Perspectives 2009
Factor Decentralized Model Closed Cartel Open Cartel
Users which site to interact with? content site social site content sitemultiple same connections and profiles? yes no noContent Sites control over content yes limited yescontrol over social graph yes no limitedcontrol over activities yes no yesSocial Sites control over content no limited nocontrol over social graph no yes yescontrol over activities no yes limited
Table 2:
A Comparison between Three Content Management Models for Social Content Sites. reduced to social applications, with no ability to perform complexanalysis on the underlying social graph. More importantly, they arealso forced to adapt their user interaction experience to the overalluser interaction theme of the host social sites. There are two ma-jor implications for users. First, users no longer have to maintainmany social profiles and establish the same social connections atmany different sites, which is a significant improvement over theDecentralized Model. Second, however, they are forced to have acentral online social presence, without which they won’t even haveaccess to the contents otherwise would have been available on con-tent sites.
Open Cartel Model : The Open Cartel model is an integrationbetween the Decentralized and Closed Cartel models. In this model,a few dominant social sites still maintain the social profiles andconnections. However, through open standards, individual contentsites are allowed to retrieve social information from them, givenusers’ permission, and integrate it with the content they provide ontheir own sites. Furthermore, content sites are allowed to propa-gate social profiles and connections established on their own sitesback to the social sites. Given this open access and depending ontheir levels of expertise, the content sites can now operate in one ofthe three levels. The simplest content sites can choose to delegatethe management of both activities generated on their site and thesocial connections to the social sites. More sophisticated contentsites can manage user activities on their own and simply rely onsocial sites to provide the social graph. Even more sophisticatedcontent sites can maintain their own social graphs and keep themin sync with the social sites. These social graphs can be consideredas focused views on the underlying global social graph. The im-plications for users are three-fold. First, similar to the Close Cartelmodel, there is no need for users to repeat their profiles and con-nections at many different places. Second, users will have multiplepoints of interaction where they can consume content powered bytheir social profiles and connections. Finally, user experience canbe fully customized instead of conforming to the look and feel ofthe social sites.
Discussion : A summary of the comparison between the threemodels is listed in Table 2. While it is relatively clear that the De-centralized Model is being replaced by the Cartel models, word isstill out on which Cartel model will eventually come out on top.The core issue here is control over the content and social activi-ties. In the Closed Cartel model, content sites delegate the man-agement of social activities and the presentation of content to thesocial sites and essentially become applications that can not survivewithout the host social site. In the Open Cartel model, social andcontent sites create symbiosis relationships, where social sites pro-vide valuable information to enhance user experience on contentsites, and content sites in turn realize the value of the social graphson social sites and expand them by providing users with useful con-tents and engaging them in interesting activities. It is our belief thatsmall niche content sites (e.g., your neighborhood reading group)will prefer the Closed Cartel model for ease of management, while larger content sites (e.g., New York Times and
Y!Travel ) wouldprefer the Open Cartel model.
A good understanding of social connections and activities canhelp cluster users and their associated contents in ways that wouldimprove the data access performance. We briefly discuss how wecan leverage those to cluster users for better query processing.Consider a social content site similar to del.icio.us , whereusers connect with other users and tag items with tags. Let U bethe set of user nodes. Given a u ∈ U , we use items ( u ) to denoteitems tagged by u , network ( u ) to denote users connected to u ,and taggers ( i, k ) to denote users who tagged item i with tag k . Queries and Scores:
For this study, we are interested in keyword-only queries, Q u = k , ..., k n . We first define the score of an item i for user u and a keyword k j , score k j ( i, u ) = f ( network ( u ) ∩ taggers ( i, k j ) , where f is a monotone function. We further definethe overall score of an item i for a user query Q u as score ( i, u ) = g ( score k ( i, u ) , ..., score k n ( i, u )) , where g is a monotone aggre-gate function. While the framework is general enough to permitarbitrary monotone functions f and g , we will use f = count and g = sum , for ease of exposition. Indices:
Typically, in Information Retrieval, one inverted list in-dex is created for each keyword [6]. Each entry in the list containsthe identifier of a document along with its score for that keyword.Storing scores allows to sort entries in the inverted list thereby en-abling top- k pruning [16]. While in classic IR each document has aunique score for a keyword (e.g., tf*idf [6] or probabilistic [18]), inour problem, the score of an item for a tag depends on the networkof the user who is asking the query.One straightforward adaptation to our framework is to store oneinverted list per (tag, user) pair and sort items in each list accord-ing to their scores for the tag and user. We denote such an indexby IL uk , which contains entries of the form ( i, score k ( i, u )) . Eachitem will be replicated along with its score in each (tag,user) in-verted list. At query time, items scores can be aggregated acrossall inverted lists relevant to query keywords. However, considera moderately sized [19] social content site with , users, million items, and distinct tags. If on average each item re-ceives tags which are given by % of the users, the size of theindex would be approximately terabyte, assuming bytes perindex entry. This kind of space requirement can easily become pro-hibitive as the network and tagging activity expand. Clustering:
In [5], we explored user clustering strategies whichachieve different compromises between storage space and process-ing time. Here, we formalize these strategies and expand them fur-ther. The intuitive idea is to cluster users according their socialconnections and activities such that score estimations can be doneaccurately without blowing up the index size. There are three mainstrategies: network-based , behavior-based and hybrid .Given a cluster C , the score of an item i in an index IL Ck , is CIDR Perspectives 2009 computed as the upper-bound of scores of i for each user u ∈ C : score k ( i, C ) = max u ∈ C score k ( i, u ) (1)By storing score upper-bounds, top- k pruning algorithms canstill be used. However, score upper-bounds entail having to com-pute exact scores at query time for a specific user. This computationintroduces some processing overhead compared with the straight-forward approach, where exact scores are stored for each (tag, user) pair. To better understand this, we formalize the different user clus-tering methods.D EFINITION
11 (N
ETWORK -B ASED C LUSTER ). Two users u and u belong to the same network-based cluster if and only if thefollowing predicate is true: | network ( u ) ∩ network ( u ) || network ( u ) ∪ network ( u ) | ≥ θ (2)where θ is an application-defined threshold. Two users fall into thesame network-based cluster if their networks are similar enough.Given that item scores depend on user networks, it is natural toassume that an item would have a similar score for two users whosenetworks overlap substantially. Each user falls into a single clusterand an inverted list is created for each cluster, instead of each user.In [5], we explored the space/time compromise of network-basedclustering and showed that it consumes less space than the basicstrategy without incurring too much query processing overhead.The applicability of network-based clustering to larger networks,obtained by integrating different social graphs, is the subject of fu-ture research.Unfortunately, network-based clustering may have poor perfor-mance in the following scenario. Assume a user u whose networkcontains users v , v , ..., v and v , ..., v . Assume another user u whose network contains v , v , ..., v and that u and u endup in the same cluster. However, if most of the tagging actionscome from users in v , ..., v , item scores for u and u wouldbe very different. Clustering u and u would not be beneficialand would in fact incur unnecessary processing overhead. Conse-quently, we further explored behavior-based clustering.D EFINITION
12 (B
EHAVIOR -B ASED C LUSTER ). Two users u and u belong to the same behavior-based cluster if and only if thefollowing predicate is true: | items ( u ) ∩ items ( u ) || items ( u ) ∪ items ( u ) | ≥ θ (3)Here, two users belong to the same cluster if their tagging be-havior is similar. In this case, the network members of a user u may belong to multiple clusters. Therefore, at query time, poten-tially more clusters will be considered than in the network-basedclustering strategy. In [5], we showed that behavior-based cluster-ing achieves better processing time to the expense of space whencompared to network-based clustering.Ideally, one would want to combine the benefits of network-based and behavior-based clustering. We define hybrid clusteringwhere two users fall into the same cluster if members of their net-work tag similarly. Here, we give the definition of a hybrid cluster:exploring the benefits of this strategy is the subject of future work.D EFINITION
13 (H
YBRID C LUSTER ). Two users u and u belong to the same hybrid cluster if and only if the following pred-icate is true: | items ( v ) ∩ items ( v ) || items ( v ) ∪ items ( v ) | ≥ θ (4) for all users v ∈ network ( u ) and v ∈ network ( u ) Further Discussion : We explored users’ social connections andbehaviors to answer a very simple kind of information discoveryquery: keyword-only queries. However, those social informationcan potentially be leveraged in many other fashions, including guid-ing information synchronization decisions from remote social sites.For example, a user who is highly connected may require more fre-quent synchronization of his network from social sites. The devel-opment of a framework to guide data storage and synchronizationdecisions based on users’ social connections and activities is an in-teresting research field needs to be explored further.
7. INFORMATION PRESENTATION
Supporting effective user interactions in social content sites isnot only a matter of locating relevant results for the user, but alsoidentifying the right presentation of results. The right presentationcan help a user explore the information more effectively, especiallywhen she is not sure about exactly what she wants, which is oftenthe case, as we learned from the
Y!Travel queries. Our visionfor the Information Presentation layer is to build a dynamic resultexploration framework .In search, presentation is primarily in the form of a single rankedlist of results, where a result’s rank reflects its degree of relevanceto the input query. In recommender systems, presentation is an im-portant aspect and has direct implications on building users’ trustand giving them incentives to participate in more activities [24, 28].There are many interesting new challenges in information presenta-tion, including those that are related to user interface design. Here,we focus mainly on result grouping, and providing explanations forresults and groups.
Given a set of items I Q u which have been computed for a userand a query, there are many different mechanisms for groupingitems in I Q u : Social Grouping , which defines item groups based onsimilarity or closeness between users who endorsed the items;
Top-ical Grouping , which defines item groups using the abstract topicseach item belongs to;
Structural Grouping , which relies on similar-ity in items’ attributes. A key algorithmic challenge is the dynamicdiscovery of groups given a query result set I Q u . We provide herea formal definition of social grouping.D EFINITION
14 (S
OCIAL G ROUPING ). Two items i and i belong to the same social group if and only if the following predi-cate is true: | taggers ( i ) ∩ taggers ( i ) || taggers ( i ) ∪ taggers ( i ) | ≥ θ (5)where θ is an application-specific threshold. The groups definedabove are user-independent and could be pre-processed. When aquery result I Q u is computed, the task is to partition it into a set of meaningful groups. Group meaningfulness can be defined using acombination of the following criteria. First, total number of groups .Due to real estate on a page, the number of groups to display at atime needs to be restricted. Second, group quality , which is definedusing the relevance of items in the group. Finally, group size , whichis simply the number of items in the group.Since screen real estate is limited, an interesting presentationalalternative is to present the groups hierarchically, i.e., initially presenta small number of groups appropriate for the screen area and uponrequest divide a group that the user is interested in into subgroups.Devising a grouping mechanism that dynamically adjusts with zoom-in and zoom-out requests is a promising presentation model thatneeds to further explored. CIDR Perspectives 2009
Another challenge is to provide explanations on the results anddescriptions of the groups. Unlike in web search, results from infor-mation discovery on social content sites are often endorsed by otherusers or are connected to other interesting objects, i.e., there existsa so-called social provenance . Letting users be aware of the socialprovenance often allows them to make more informed decisions asto what to do with the results. Similarly, providing descriptions onresult groups can help them better understand the semantics behindthose groups and therefore make better choices on what to explorefurther.An explanation for a recommended item depends on the under-lying recommendation strategy used [30]. If an item i is recom-mended to user u by a content-based strategy, then an explanation for recommendation i is defined as: Expl ( u, i ) = { i ∈ I | ItemSim ( i, i ) > i ∈ Items ( u ) } i.e., the set of items similar to items ( i ) that user u has ratedin the past. The explanation may contain more information suchas the similarity weight ItemSim ( i, i ) × rating ( u, i ) . Here, ItemSim ( i, i ) returns a measure of similarity between two items i and i , and rating ( u, i ) indicates the rating of item i by user u (it is if u has not rated i ).If an item i is recommended to user u by a collaborative filteringstrategy, then an explanation for a recommendation i is: Expl ( u, i ) = { u ∈ U | UserSim ( u, u ) > i ∈ Items ( u ) } i.e., the set of users similar to u who have rated item i . Similarly toitem-based explanations, we can augment each user u in the expla-nation with the similarity weight UserSim ( u, u ) × rating ( u , i ) .Here, UserSim ( u, u ) returns a measure of similarity or connectiv-ity between two users u and u (it is if u and u are not connected).In all cases, the explanation of a recommendation is either a setof items or a set of users, possibly together with weights as de-scribed above. Given an item explanation, there are many presen-tation alternatives. The most straightforward option is to list theset of users or items in the explanation of each item. Another al-ternative is to return aggregate information such as: “60% of yourfriends endorsed this item” or “This item is similar to 75% of itemsyou visited before”. The challenge is when and how to generatedthose aggregation information efficiently.We can also define group explanation , Expl ( u, g ) , as an aggre-gation over individual item explanations in the group. However, itis more intriguing to explore how we can effectively convert indi-vidual explanations for items in a group into a concise explanationat a group level.
8. RELATED WORK
In a series of works, Mendelzon et al. [12, 11, 10] proposedquery languages for manipulating graphs. The G + language [12]was proposed as a complementary language for Datalog, for ex-pressing recursive queries using visual concepts. Later, G + wasextended into Hy + [11], a hypergraph-based visualization and query-ing language. In [10], additional primitives were added to supportaggregation over edges as well as paths, without explicit recursion,but using transitive closure as a primitive. Amann and Scholl [4]proposed the Gram model and language for querying hypertext datamodeled as graphs. The language includes limited support for re-cursion. All of these languages, however, use graph patterns ex-tensively within their queries. This is in contrast to our algebraicapproach, which relies on a set of operators that manipulate nodesand links.In the context of semi-structured data, substantial work has beendone on graph querying (e.g., Struql [17], UnQL [9], and Lorel [1]). Much of the emphasis was on querying graphs using regular pathexpressions over edge labels. Such expressions are too heavy-weight for our applications. Finally, in the context of object-orienteddatabases, the GOOD data model and query language were devel-oped by Paredaens et al. [20]. A key distinction between virtuallyevery paper on graph querying and our work is that we do not ex-pect the user to interact with the system using our query language.In addition, none of these previous works considers the integrationof search, querying, and recommendation.Indeed, while search and recommendation have been investi-gated separately, their combination has received very little atten-tion, with perhaps the only exception in [15], where the authorsstudied the effectiveness of scoring functions in both search andrecommendation. Another closely related work [26], which de-veloped OLAP-style algorithms to answer social queries such asreturning all the tags of a given user. Neither paper addresses thechallenges of social content analysis, which is substantially morecomplex than queries.Several approaches have been developed in the context of Websearch result presentation. The approach in [25] is based on cluster-ing results into groups of related topics. Gravano and Dakka [13]describe a hybrid method for summarizing online news articleswhich leverages pre-computation in order to efficiently computedocument clusters, at query time. By contrast, our study focuseson result exploration through social, structural and topical group-ings. In [22], the authors propose a presentation layer on top of arelational database in order to improve its usability, stressing theimportance of provenance. While the idea of presentation is com-mon to ours, their focus is not on information discovery over socialcontent sites.Finally, faceted search [14, 7] supports richer information dis-covery tasks over structured data. However, it mainly focuses onexposing hidden data correlations and providing aggregate countsalong with each facet. It will be interesting to explore if socialprovenance can be considered with the faceted search framework.
9. CONCLUSION
We envision that domain-specific social content sites will in-creasingly become a part of the our online life. We motivated in-formation discovery over such (real or virtual) social content sitesand identified several major challenges. In particular, we proposed
SocialScope , a logical architecture with three layers: InformationDiscovery, Content Management and Information Presentation. Wediscussed key issues and contributions in each layer.In the context of Information Discovery, we proposed an alge-braic framework to manipulate social content graphs. To the bestof our knowledge, our algebra is the first one that is capable of ma-nipulating social content graph in a uniform and flexible way. Inthe context of Content Management, we identified three main cat-egories of data within social content sites: site content, social pro-files and connections, and site-specific social activities. We exam-ined three alternative content management models, each defined byhow they management the three categories of data, and comparedtheir benefits and drawbacks. We also discussed how to leveragecommon user behaviors to optimize data storage and indexing forquery processing. Finally, in the context of Information Presen-tation, we discussed how novel ways of presenting information tousers can help them understand the large variety of content discov-ered from social content sites.We believe that
SocialScope offers a framework in which keychallenges in data management in social content sites can be ad-dressed by our research community.
CIDR Perspectives 2009
10. REFERENCES [1] S. Abiteboul et al. The Lorel query language forsemistructured data.
Intl. J. Digital Libraries , 1:68–88, 1997.[2] G. Adomavicius and A. Tuzhilin. Toward the next generationof recommender systems: A survey of the state-of-the-artand possible extensions.
IEEE Trans. Knowl. Data Eng. ,17(6), 2005.[3] R. Agrawal, T. Imielinski, and A. Swami. MiningAssociation Rules Between Sets of Items in LargeDatabases. In
SIGMOD , 1993.[4] B. Amann and M. Scholl. Gram: a graph data model andquery languages. In
ECHT , 1992.[5] S. Amer-Yahia, M. Benedikt, L. Lakshmanan, andJ. Stoyanovich. Efficient Network-Aware Search inCollaborative Tagging Sites. In
VLDB , 2008.[6] R. Baeza-Yates and B. Ribeiro-Neto.
Modern InformationRetrieval . Addison-Weslery, 1999.[7] O. Ben-Yitzhak et al. Beyond basic faceted search. In
WSDM , 2008.[8] D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent dirichletallocation.
Machine Learning Research , 3(1):993–1022,2003.[9] P. Buneman, M. Fernandez, and D. Suciu. Unql: a querylanguage and algebra for semistructured data based onstructural recursion.
The VLDB Journal , 9(1):76–110, 2000.[10] M. P. Consens and A. O. Mendelzon. Low complexityaggregation in graphlog and datalog. In
Proceedings of the1990 International Conference on Database Theory, Paris,France , pages 379–394. Springer Verlag, 1990.[11] M. P. Consens and A. O. Mendelzon. Hy+: A hygraph-basedquery and visualization system. In
SIGMOD , 1993.[12] I. F. Cruz, A. O. Mendelzon, and P. T. Wood. A graphicalquery language supporting recursion. In
SIGMOD , 1987.[13] W. Dakka and L. Gravano. Efficient summarization-awaresearch for online news articles.
JCDL , pages 63–72, 2007.[14] D. Debabrata et al. Dynamic Faceted Search forDiscovery-Driven Analysis. In
CIKM , 2008.[15] R. S. et al. Social wisdom for search and recommendation.
Data Enginnering Bulletin , pages 40–49, 2008.[16] R. Fagin, A. Lotem, and M. Naor. Optimal aggregationalgorithms for middleware.
JCSS , 66(4):614–656, 2003.[17] M. Fern´andez, D. Florescu, J. Kang, A. Levy, and D. Suciu.Catching the boat with strudel: experiences with a web-sitemanagement system. In
SIGMOD , 1998.[18] N. Fuhr and T. R¨olleke. A probabilistic relational algebra forthe integration of information retrieval and database systems.
ACM Transactions on Information Systems , 15(1):32–66,1997.[19] S. A. Golder and B. A. Huberman. The structure ofcollaborative tagging systems.
Information Dynamics Lab,HP Labs , 2006. Available from http://arxiv.org/pdf/cs/0508082 .[20] M. Gyssens, J. Paredaens, J. V. den Busche, and D. V. Gucht.A graph-oriented object database model. In
PODS , 1990.[21] Y. E. Ioannidis. Emerging open agoras of data andinformation. In
ICDE , pages 1–5, 2007.[22] H. V. Jagadish, A. Chapman, A. Elkiss, M. Jayapandian,Y. Li, A. Nandi, and C. Yu. Making database systems usable.In
SIGMOD , 2007.[23] M. E. Kipp and D. G. Campbell. Patterns and inconsistenciesin collaborative tagging systems: An examination of tagging practices.
Proceedings American Society for InformationScience and Technology , 2006.[24] J. A. Konstan. Introduction to recommender systems. In
SIGIR , 2007.[25] K. Kummamuru et al. A Hierarchical Monothetic DocumentClustering Algorithm for Summarization and BrowsingSearch Results. In
WWW , 2004.[26] K. Morfonios and G. Koutrika. OLAP cubes for socialsearches: Standing on the shoulders of giants? In
WebDB ,2008.[27] M. E. J. Newman. Models of the Small World.
Journal ofStatistical Physics , 101(3-4):819–841, November 2000.[28] N. Tintarev and J. Masthoff. Effective Explanations ofRecommendations: User-Centered Design. In
RecSys , 2007.[29] D. Watts and S. H. Strogatz. Collective Dynamics of“Small-World” Networks.
Nature , 393:440–442, 1998.[30] C. Yu, L. Lakshmanan, and S. Amer-Yahia. It takes variety tomake a world: Diversification in recommender systems. In