aa r X i v : . [ c s . D B ] A p r Ranking Object under Team Context
Xiaolu Lu , Dongxu Li , Xiang Li † , and Ling Feng School of Software, Nanjing University { mf1232050,mf1332027,lx } @software.nju.edu.cn Department of Computer Science and Technology, Tsinghua University [email protected]
Abstract.
Context-aware database has drawn increasing attention fromboth industry and academia recently by taking users’ current situationand environment into consideration. However, most of the literature fo-cus on individual context, overlooking the team users. In this paper, weinvestigate how to integrate team context into database query processto help the users’ get top-ranked database tuples and make the teammore competitive. We review naive and propose an optimized query al-gorithm to select the suitable records and show that they output thesame results while the latter is more computational efficient. Extensiveempirical studies are conducted to evaluate the query approaches anddemonstrate their effectiveness and efficiency.
Millions of users take portable devices in the palm of their hands. It leads tothe rapid development of context-aware database whose users have great expec-tations of getting suitable query results based on their ambient environment.At the same time, context-aware query has been widely explored to tackle withthe many-answers problem to get rid of overwhelming information. Essentially,these applications keep context information to predict users’ preferences. Re-searches in context-aware query have mainly focused on contexts from sensorsand user profiles rather than the users’ organization-level context i.e the teamcontext. Recently, the problem of context-aware database query has drawn in-creasing attention from both industry and academia. To cope with the problemmany approaches have been proposed and can be divided into two categories: qualitative and quantitative . Qualitative approaches model the user preferenceas partial order and apply logic tools to reason the user’s intention [14]. Onthe other hand, quantitative approaches compute the users’ satisfaction by scorefunction [8]. However, most of them are based on the individual context. In [7],group context is taken but the group cannot change during the query process. Inthis paper, we propose the problem of ranking under team context(RTC) whichqueries the database from a team’s perspective and aims to helping the usershave a more competitive context by ranking and replace some team compo-nent with top-ranked tuples. For example, NBA teams are preparing the roster † Xiang Li is the corresponding author. and select the prospective player in hope of qualify for the play-offs(finals ofNBA) in the next season. To this end, we need to consider the team context ina united group to query the player database for the best player,i.e., who doesthe team need to acquire in the coming season to make itself a serious candi-date for play-offs and which player in the current team should be included in atrade. To the best of our knowledge, this work is the first to focus on team con-text query while traditional context-aware methods relies on individual context.When taking the whole organization background into consideration, queryingbecomes more practical and convenient for company customers. Moreover, teamcontext-aware query make it easy to get different query results from differentlayers of hierarchy which meets perfectly with the innate characteristics of manycontexts. The brute force method can be quite inefficient due to excessive I/Ooverhead. In an effort to handle the limitation, we introduce an I/O-efficientapproach RTC* based on Nearest Neighbour. RTC* calculate the exact virtualcomponent the user need to replace with,and map it to the database space. Withnearest neighbour technique, we offer the ranking of query results. We prove thatRTC* can produce the same results as the brute force method.We summarize our key contributions as follows:– We define the RTC problem.– We propose the solution to the RTC problem based on NN-indexing andprove its correctness.– We evaluate our algorithms by experiments in terms of effectiveness andefficiency.The rest of the paper is organized as follows: Section 2 describes related workwith a comparison. Section 3 defines the RTC problem and section 4 proposesour method with a review of baseline method. Section 5 presents the experimentswith analysis of the results. Section 6 is conclusion and future work.
Object ranking under team context is a kind of context-aware query processing,aiming at helping systems provide query results after understanding the real in-tentions behind the queries. To be more specific, our work handled the contextthat has a team property with a goal of being more competitive and approach-ing teams with higher rank. Researches in field of context-aware query can beroughly divided into two categories: qualitative and quantitative. In qualitativestrategies: preference over database tuples are calculated by score functions. [8,12] In quantitative strategies: logical rules are hard coded to database systemto infer the users’ preferences. [15] But group or team context is overlooked forquite a long time. Recently, researches on group preferences have been reported.In [13], Stefanids et al. generalized their previous work on hierarchical contextmodel to tackle the needs of a group. In [7], Li and Feng propose several methodsto meet most of the people’s contexts. However, all these work consider group asunion of individuals or most of the members. In our work, team context is take in its entirety. The object selection based on team context is to make the teamcloser to its rivals. Context tackled in this paper is formed by objects from theobject space, which have not been exploited. k -NN algorithm was one of the most widely used approaches in many fields,first proposed by [1] and continuously improving and refining for specific purpose,especially in spatial databases and sensor networks. Moreover, applying k -NNapproach in high dimensional data has raised many attentions. [2] proposed anew method for performing data processing using k -NN in high dimensional dataand provided a lower bound of distance between feature vectors. [6, 16] reviewedand put forward a method with hybrid index techniques for solving so called”curse of dimensionality” problems. The state-of-art high dimensional indexingtechnique iDistance proposed by [4] to enhance efficiency of existing approaches.Recently, [17] propose G-tree index for finding the k nearest objects to the givenlocation. [11] has carefully reviewed the skills in partitioning the data space byiDistance. Consider an example in NBA. A fact is that if the games winning of one teamranks top 10 in regular seasons, it would be guaranteed to enter into play-offs.What should a team ranked 11st ∼ C ranked 17th in NBA wants to enter into play-offs. Fromthe team’s view, if the team could approach or even supersede one of the top 10teams, its chance for entering into play-offs will becomes greater. We refer theteam to be surpassed by current one as the target team.To achieve this goal, usually one player in C will be exchanged with anotherbought in the transaction. Which pair of players should be selected for fulfillingthis goal is a challenging question needed to be answered.Similar scenarios will also occurred in other area, such as in teams of softwaredevelopers, clusters of computers etc. Motivated by those ones, Problem solvedin this paper can be interpreted as rank the objects and select the ones servedas the substitution of a objects in the team. Given an object space O with n d -dimensional objects. Team context(TC) inour paper is defined as a context C formed by m objects { O , O , .., O m , ∀ O m ∈ C, O m ∈ O } like how teams formed in NBA. Also, define a target team T of C for approaching. Clearly, the contributions of each object differs accordingto different team contexts, like performance varies of one player in differentteams in NBA. Thus, while exchanging objects, a set of exchanging parame-ters Λ { λ , λ , ..., λ m } is defined for measuring the contribution of O ∈ O undercurrent TC. Formally, our problem is: Problem 1. Ranking under Team Context (RTC Problem) Rank the objects in O and determine a swap-in object which is top-ranked corresponding to a swap-outobject in C . After performing the exchanging procedure, C can approach T toits best effort. Since team context C is formed by objects in O , C can also be described by contributions of its components. There existdifferent ways calculating the contributions of components, which are based onhow different contexts are organized. In this paper, we adopt the method whichmeans the team’s ability is the accumulation of all its components, since it isthe most widely used way in real scenarios. Demonstrate in (1): c i = m X j =1 o ji (1)where c i means value on i th dimension of C and o ji means the value of j thobject on its i th dimension. Contributions of Attributes
Although final ranking of the one team dependson values of all attributes, not all of them weight equally. For identifying theimportance of each attribute, we adopt the Kendall’s tau( τ ) coefficients, whichis a rank coefficient measuring association between two measured associations[5]. Through calculating the association between each dimension and final rank-ing of team context pairwisely, an coefficient will be obtained and will be regardedas weight parameter of corresponding dimension i , denote as w i . Truncated Distance
Usually, we measure the difference between contexts(orobjects) by weighted Euclid distance. However, positive distance yields to rep-resent the overall conditions of a TC, especially the case shown in Fig.1(a).The 2-D case in Fig.1(a) depicted that C exceeds T in dimension x whileyields T in dimension y . If we only consider measuring the distance by weightedEuclid distance, C might be far away from T due to abstract advantage ondimension x , therefore, situation such as losing strength on x dimension whenapproaching T will occur, which contradicts the team’s goal. In order to preserveadvantage of C while approaching T , using truncated distance as a measurementis adopted as shown in Fig.1(c). As illustrated, if Case 1 happens, we only con-sider the distance between C ′ and T rather than C and T . Another case shownin Fig.1(b) is relatively simple for tackling since C lags T in both dimension x and y . So distance is typical weighted Euclid distance. .. x . y . O . T (0 . , C (1 , . x . y . O . T (0 . , C (0 . , . x . y . O . T (0 . , C (1 , . C ′ (0 . , . Fig. 1.
Example of Truncated Distance
For a clear expression, we define a 0-1 truncating vector −→ T V ( tv , tv , ..., tv d )to describe the truncated distance. Denote diff i as the difference on dimension i and −→ T V ( i ) as the i th component of −→ T V . Truncated difference ] diff i on ithdimension is: ] diff i = diff i × −→ T V ( i ) = ( t i − c i ) × −→ T V ( i ) (2)Notice that dimensions where t i − c i < C , remaining ones will be referred as weak dimensions accordingly.According to (2), the truncated weighted Euclid distance f dis is: f dis = vuut d X i =1 ( w i ] diff i ) (3)Distance measurements in this paper are all truncated distance. −→ T V is referredas truncating vector henceforth. For better presentation, we denote ] oDis as thetruncated distance between two objects which is calculated using (3). Exchange Procedure
Define the exchange procedure as swapping R ( r , r , ..., r d )in C with P ( p , p , .., p d ) in O , thus new diff i ′ is:diff i ′ = t i − ( c i − r i + λ r λ p p i ) (4)where λ r , λ p ∈ Λ are exchange parameters defined in section 3.2.Accordingly, truncated difference g diff i ′ after exchanging procedure can becalculated by (2) with a new 0-1 truncating vector −−→ T V based on situation oneach dimension. Hence, new f dis ′ after exchanging objects can be calculated with g diff i ′ by applying (3). Before we propose the RTC* method, we define a virtual object as follows:
Definition 1. (Virtual Object) Define a virtual object V ( v , v , ..v d ) which couldmake C has the same value of T on each weak dimension after exchanging withobject R in C . Thus, value of virtual object on dimension i is: v i = diff i + r i λ r × −−→ T V ( i ) (5) where −−→ T V is new truncating vector for virtual object and r i is the value of swap-out object R on dimension i. Corollary 1.
Assume ∀ w i ∈ W ( w , w , ..., w d ) , w i > , denote the truncateddistance between objects as g oDis . The nearest neighbours of virtual objects mea-sured by g oDis is the top-ranked ones who can make C become closer to T .Proof. Suppose we can find a nearest neighbour P of V , λ p P ∈ O , truncateddifference between V and P is represented using e ∆ where e ∆ ( i ) is the truncatedvalue on dimension i . Thus, g diff ′ is: g diff i ′ = ( diff i + r i λ r − p i ) × −−→ T V ( i ) (6)where −−→ T V is truncating vector and −−→ T V ( i ) = 0 iff g ∆( i ) <
0, so f dis ′ can berepresented as: f dis ′ = vuut d X i =1 ( w i g diff i ′ ) (7)Notice that ∀ w i ∈ W ( w , w , ..., w d ) , w i >
0, (7) also can be represented as: f dis ′ = vuut d X i =1 ( w i (( v i − p i ) −−→ T V ( i )) = vuut d X i =1 ( w i ( e ∆ ( i ) × −−→ T V ( i )) = g oDis (8) ⊓⊔ So our problem of ranking objects from perspective of team context can bemapped into object space. Which is, by considering nearest neighbours of virtualobjects under current team context, we can obtain top-ranked tuples.We can index the truncated distance between ∀ O i ∈ O and the virtual objectfor convenience of searching:As presented in Algorithm 1, we first calculate the virtual object based oncurrent team context and index the g oDis between O ∈ O and virtual object iniDistance presented in [16] for processing the query.It is easy to make the generalization that the query time is only related tothe cardinality of our current context C , so RTC* will show high performanceand good scalability on very large datasets. Algorithm 1:
RTC* Method
Input : current context C ,Target Context T , O Output : < R i , P i > foreach R i ∈ C do Calculate V i using (5); foreach P i ∈ O do Index g oDis between P i and V i ; find < R i , P i > with Min( g oDis); return < R i , P i > ; All the experiments were performed on machine with Intel Core(TM) i3 CPUand 4 GB RAM hosted on 32 bit Windows 7.
Datasets
We perform our experiments on both real and synthetic data. Realdataset is obtained from [9] which consists total statistic data of NBA regularseason from 2011 to 2012. Real dataset contains 400 players with 24 attributesin total and and 30 teams described by 20 dimensions. Size of player dataset is39.5KB and team dataset is 20KB.Attributes which can discriminate between season-long successful and un-successful basketball teams according to researches on basketball in [3, 10] areFG, 3P, 3PA, BLK, FT, STL, FTA, PTS, AST, DRB and TRB. We use thisattribute set for our experiments as well.Synthetic dataset are generated based on the features of real dataset withtotal 1 . × records and 69MB size. Feature of partial dimensions is illus-trated in Fig.2. We make hypothesis H that values of dimensions listed in 1 hasnegative binomial distribution and do distribution fitting accordingly. D en s i t y Data Distribution on Player PTS PlayerPTS dataFitting Plot (a) Player PTS D en s i t y Player STL Data Distributio PlayerSTL dataFitting Plot (b) Player STL
Fig. 2.
Partial Review of Data Distribution
We test the hypothesis using Chi-square goodness-of-fit with parameters es-timated in Table 1. H is accepted at 95% significance level. Synthetic data aregenerated based on the fitted distribution. Table 1.
Estimated Parameters of Data DistributionDimension r p Dimension r p Dimension r pFG 1.44 0.008 TRB 1.62 0.008 BLK 0.91 0.004DRB 1.67 0.01 FT 1.07 0.013 STL 1.70 0.045FTA 1.16 0.01 PTS 1.40 0.003 AST 0.93 0.0092
Because each dimension contributes differently in teams’ final rankings, weadopt Kendall’s τ , which is a method measuring the associations between at-tributes [5], to calculate weight of each dimension. Results are listed in Table 2. Table 2.
Weight for Each DimensionDimension Weight Dimension Weight Dimension Weight Dimension WeightDRB 0.35 FG 0.2695 3P 0.30 AST 0.243PA 0.20 FT 0.2576 TRB 0.1884 STL 0.38FTA 0.27 PTS 0.4060 BLK 0.24
First, we select the target team on real dataset before exchanging players. Resultsare shown in Table 3 with the initial truncated distance between C and T . Table 3.
Target Context of Each Team
C T
Distance
C T
Distance
C T
DistanceDEN BOS 6.1096 PHI BOS 29.5286 UTA MEM 13.6334ORL BOS 51.2774 HOU ATL 31.2126 DAL ATL 28.4675NYK MEM 23.6467 PHO BOS 18.3673 MIL MEM 19.3955POR LAC 31.0965
A Brute Force Method
Brute force method is performed for each ”mid-class”(teams ranked 11st ∼ ∼ Table 4.
Selection For HOURoster Candidate New DistanceLuis Scola Josh Smith 0Patrick Patterson LeBron James 0 team on real dataset as a baseline method. Take results of HOU listed in Table4 as an example. There are two pairs of players could be found for making thisteam approach its target measured by truncated distance. Either pair can beselected to make chance for this team to enter into play-offs.
DEN ORL NYK DAL UTA PHI HOU PHO MIL POR0102030405060 Team d i s t an c e t o t a r ge t Distance to Target Team Initial DistanceNew Distance
Fig. 3.
Comparison of Team’s Distance to Target
Fig.3 depicts between each ”mid-class” team and its target before and afterexchanging players. We can observe that many teams will have same value as itstargets on their weak dimensions, which are illustrated in square.
RTC* Method
We also test RTC* method on the real dataset. In estimatingvirtual players using (5), minutes played serves as the exchange parameter λ r .Also consider HOU(Houston Rockets) as an example. Values on dimensionFG,3P,3PA,FT and FTA of virtual player and corresponding swap-in playerlisted in Table 5 for explanation:As listed in Table 5, both selected players from player space are better thancorresponding virtual players on those dimensions. According to definition of Table 5.
Overview of Virtual Player and CandidateName Attributes g oDisFG 3P 3PA FT FTAJosh Smith 0.22 0.01 0.05 0.09 0.14 0Virtual Player of Josh 0.18 0.01 0.00 0.08 0.14LeBron James 0.27 0.02 0.06 0.17 0.22 0Virtual Player of LeBron 0.11 0.01 0.00 0.03 0.10 truncated distance, the g oDis calculated between Josh Smith and its correspond-ing virtual player, or LeBron James and its virtual player listed, are 0. Therefore,those two pairs are selected as the result which is same as the ones selected usingbrute force listed in Table 4.Notice that both results are the nearest neighbours of corresponding virtualplayer measured by g oDis, which further proves the rationale behind Corollary 1. In this section, we mainly focus on analysing the results of brute force methodand RTC* on efficiency and scalability.We fixed our block size as 100 records per block for real dataset and 10 perblock for synthetic data. I/ O t i m e s I/O times on real dataset brute force methodRTC* (a) I/O Test on real data I/ O t i m e s I/O times on synthetic dataset brute force methodRTC* (b) I/O Test on synthetic data
Fig. 4.
I/O Performance Of Different Method
Depicted in Fig.4, regardless of the data size and block size, I/O will beperformed only once using RTC* as long as we had set up virtual player index,while brute force method varies depending on TC, which shows less robustness.Like I/O testing, we tested time cost of selection method both on real andsynthetic data, illustrated in Fig.5. E l ap s ed T i m e ( s e c ) Elapsed Time of Selection on Real Data Brute ForceRTC* (a) Time Cost on Real Data E l ap s ed T i m e ( s e c ) Elapsed Time of Selection on Synthetic Data Brute ForceRTC* (b) Time Cost on Synthetic
Fig. 5.
Time Cost Of Different Method
It is easy to generalize that the time cost of RTC* grows slowly with aconstant rate while brute force increase very fast.Differs from brute force method which highly depends on the value of teamcontext, RTC* has good robustness and better performance regardless of thedata value.
In this paper, we introduce the problem of object selection under team con-text. This problem is quite practical in many scenarios of selecting objects toimprove the team or organization’s competence. We propose the brute forcealgorithm RTC for the problem. Furthermore, we propose an I/O efficient algo-rithm RTC* based on NN-indexing with a proof that its output is equivalent toRTC. Extensive experiments are conducted on both synthetic and real datasetsto demonstrate the effectiveness and efficiency of our algorithms.We would like to extend our work from two directions in our future work.First, due to the fact that the probabilistic database tuples are not uncommon,we plan to do probabilistic object selection. Second, query the database objectsbased on the teams’ temporal contexts.
References
1. N. Altman. An introduction to kernel and nearest-neighbor nonparametric regres-sion.
The American Statistician , 46(3):175–185, 1992.2. J. Hu, B. Cui, and H. T. Shen. Diagonal ordering: A new approach to high-dimensional knn processing. In
ADC , pages 39–47, 2004.3. S. J. Ibez, J. Sampaio, S. Feu, A. Lorenzo, M. A. Gmez, and E. Ortega. Basket-ball game-related statistics that discriminate between teams season-long success.
European Journal of Sport Science , 8(6):369–372, 2008.24. H. V. Jagadish, B. C. Ooi, K.-L. Tan, C. Yu, and R. Zhang. idistance: An adaptiveb+-tree based indexing method for nearest neighbor search.
ACM Trans. DatabaseSyst. , 30(2):364–397, 2005.5. M. G. Kendall. A new measure of rank correlation.
Biometrika , 30(1/2):81–93,1938.6. N. Kouiroukidis and G. Evangelidis. The effects of dimensionality curse in highdimensional knn search. In
Panhellenic Conference on Informatics , pages 41–45,2011.7. X. Li and L. Feng. Context-aware group top-k query. In
ICDIM , pages 149–154,2012.8. X. Li, L. Feng, and L. Zhou. Contextual ranking of database querying results: Astatistical approach. In
EuroSSC , pages 126–139, 2008.9. S. R. LLC. Player season finder. , November 2013.10. D. Oliver.
Basketball on paper: rules and tools for performance analysis . PotomacBooks, Inc., 2004.11. M. A. Schuh, T. Wylie, J. M. Banda, and R. A. Angryk. A comprehensive studyof idistance partitioning strategies for knn queries and high-dimensional data in-dexing. In
BNCOD , pages 238–252, 2013.12. K. Stefanidis, E. Pitoura, and P. Vassiliadis. Adding context to preferences. In
ICDE , pages 846–855, 2007.13. K. Stefanidis, N. Shabib, K. Nørv˚ag, and J. Krogstie. Contextual recommendationsfor groups. In
ER Workshops , pages 89–97, 2012.14. A. H. van Bunningen, L. Feng, and P. M. G. Apers. A context-aware preferencemodel for database querying in an ambient intelligent environment. In
DEXA ,pages 33–43, 2006.15. A. H. van Bunningen, M. M. Fokkinga, P. M. G. Apers, and L. Feng. Rankingquery results using context-aware preferences. In
ICDE Workshops , pages 269–276,2007.16. C. Yu, B. Cui, S. Wang, and J. Su. Efficient index-based knn join processing forhigh-dimensional data.
Information and Software Technology , 49(4):332–344, 2007.17. R. Zhong, G. Li, K.-L. Tan, and L. Zhou. G-tree: an efficient index for knn searchon road networks. In