Yingxia Shao | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Yingxia Shao is active.

Explore More

Publication

Featured researches published by Yingxia Shao.

conference on information and knowledge management | 2015

Joint Modeling of User Check-in Behaviors for Point-of-Interest Recommendation

Hongzhi Yin; Xiaofang Zhou; Yingxia Shao; Hao Wang; Shazia Wasim Sadiq

Point-of-Interest (POI) recommendation has become an important means to help people discover attractive and interesting locations, especially when users travel out of town. However, extreme sparsity of user-POI matrix creates a severe challenge. To cope with this challenge, a growing line of research has exploited the temporal effect, geographical-social influence, content effect and word-of-mouth effect. However, current research lacks an integrated analysis of the joint effect of the above factors to deal with the issue of data-sparsity, especially in the out-of-town recommendation scenario which has been ignored by most existing work. In light of the above, we propose a joint probabilistic generative model to mimic user check-in behaviors in a process of decision making, which strategically integrates the above factors to effectively overcome the data sparsity, especially for out-of-town users. To demonstrate the applicability and flexibility of our model, we investigate how it supports two recommendation scenarios in a unified way, i.e., home-town recommendation and out-of-town recommendation. We conduct extensive experiments to evaluate the performance of our model on two real large-scale datasets in terms of both recommendation effectiveness and efficiency, and the experimental results show its superiority over other competitors.

international conference on management of data | 2014

Parallel subgraph listing in a large-scale graph

Yingxia Shao; Bin Cui; Lei Chen; Lin Ma; Junjie Yao; Ning Xu

Subgraph listing is a fundamental operation to many graph and network analyses. The problem itself is computationally expensive and is well-studied in centralized processing algorithms. However, the centralized solutions cannot scale well to large graphs. Recently, several parallel approaches are introduced to handle the large graphs. Unfortunately, these parallel approaches still rely on the expensive join operations, thus cannot achieve high performance. In this paper, we design a novel parallel subgraph listing framework, named PSgL. The PSgL iteratively enumerates subgraph instances and solves the subgraph listing in a divide-and-conquer fashion. The framework completely relies on the graph traversal, and avoids the explicit join operation. Moreover, in order to improve its performance, we propose several solutions to balance the workload and reduce the size of intermediate results. Specially, we prove the problem of partial subgraph instance distribution for workload balance is NP-hard, and carefully design a set of heuristic strategies. To further reduce the enormous intermediate results, we introduce three independent mechanisms, which are automorphism breaking of the pattern graph, initial pattern vertex selection based on a cost model, and a pruning method based on a light-weight index. We have implemented the prototype of PSgL, and run comprehensive experiments of various graph listing operations on diverse large graphs. The experiments clearly demonstrate that PSgL is robust and can achieve performance gain over the state-of-the-art solutions up to 90%.

IEEE Transactions on Knowledge and Data Engineering | 2015

Heterogeneous Environment Aware Streaming Graph Partitioning

Ning Xu; Bin Cui; Lei Chen; Zi Huang; Yingxia Shao

With the increasing availability of graph data and widely adopted cloud computing paradigm, graph partitioning has become an efficient pre-processing technique to balance the computing workload and cope with the large scale of input data. Since the cost of partitioning the entire graph is strictly prohibitive, there are some recent tentative works towards streaming graph partitioning which run faster, are easily parallelized, and can be incrementally updated. Most of the existing works on streaming partitioning assume that worker nodes within a cluster are homogeneous in nature. Unfortunately, this assumption does not always hold. Experiments show that these homogeneous algorithms suffer a significant performance degradation when running at heterogeneous environment. In this paper, we propose a novel adaptive streaming graph partitioning approach to cope with heterogeneous environment. We first formally model the heterogeneous computing environment with the consideration of the unbalance of computing ability (e.g., the CPU frequency) and communication ability (e.g., the network bandwidth) for each node. Based on this model, we propose a new graph partitioning objective function that aims to minimize the total execution time of the graph-processing job. We then explore some simple yet effective streaming algorithms for this objective function that can achieve balanced and efficient partitioning result. Extensive experiments are conducted on a moderate sized computing cluster with real-world web and social network graphs. The results demonstrate that the proposed approach achieves significant improvement compared with the state-of-the-art solutions.

very large data bases | 2015

An efficient similarity search framework for SimRank over large dynamic graphs

Yingxia Shao; Bin Cui; Lei Chen; Mingming Liu; Xing Xie

SimRank is an important measure of vertex-pair similarity according to the structure of graphs. The similarity search based on SimRank is an important operation for identifying similar vertices in a graph and has been employed in many data analysis applications. Nowadays, graphs in the real world become much larger and more dynamic. The existing solutions for similarity search are expensive in terms of time and space cost. None of them can efficiently support similarity search over large dynamic graphs. In this paper, we propose a novel two-stage random-walk sampling framework (TSF) for SimRank-based similarity search (e.g., top-k search). In the preprocessing stage, TSF samples a set of one-way graphs to index raw random walks in a novel manner within O(NRg) time and space, where N is the number of vertices and Rg is the number of one-way graphs. The one-way graph can be efficiently updated in accordance with the graph modification, thus TSF is well suited to dynamic graphs. During the query stage, TSF can search similar vertices fast by naturally pruning unqualified vertices based on the connectivity of one-way graphs. Furthermore, with additional Rq samples, TSF can estimate the SimRank score with probability [EQUATION] if the error of approximation is bounded by 1 -- e. Finally, to guarantee the scalability of TSF, the one-way graphs can also be compactly stored on the disk when the memory is limited. Extensive experiments have demonstrated that TSF can handle dynamic billion-edge graphs with high performance.

IEEE Transactions on Knowledge and Data Engineering | 2015

PAGE: A Partition Aware Engine for Parallel Graph Computation

Yingxia Shao; Bin Cui; Lin Ma

Graph partition quality affects the overall performance of parallel graph computation systems. The quality of a graph partition is measured by the balance factor and edge cut ratio. A balanced graph partition with small edge cut ratio is generally preferred since it reduces the expensive network communication cost. However, according to an empirical study on Giraph, the performance over well partitioned graph might be even two times worse than simple random partitions. This is because these systems only optimize for the simple partition strategies and cannot efficiently handle the increasing workload of local message processing when a high quality graph partition is used. In this paper, we propose a novel partition aware graph computation engine named PAGE, which equips a new message processor and a dynamic concurrency control model. The new message processor concurrently processes local and remote messages in a unified way. The dynamic model adaptively adjusts the concurrency of the processor based on the online statistics. The experimental evaluation demonstrates the superiority of PAGE over the graph partitions with various qualities.

international conference on management of data | 2014

Efficient cohesive subgraphs detection in parallel

Yingxia Shao; Lei Chen; Bin Cui

A cohesive subgraph is a primary vehicle for massive graph analysis, and a newly introduced cohesive subgraph, k-truss, which is motivated by a natural observation of social cohesion, has attracted more and more attention. However, the existing parallel solutions to identify the k-truss are inefficient for very large graphs, as they still suffer from huge communication cost and large number of iterations during the computation. In this paper, we propose a novel parallel and efficient truss detection algorithm, called PeTa. The PeTa produces a triangle complete subgraph (TC-subgraph) for every computing node. Based on the TC-subgraphs, PeTa can detect the local k-truss in parallel within a few iterations. We theoretically prove, within this new paradigm, the communication cost of PeTa is bounded by three times of the number of triangles, the total computation complexity of PeTa is the same order as the best known serial algorithm and the number of iterations for a given partition scheme is minimized as well. Furthermore, we present a subgraph-oriented model to efficiently express PeTa in parallel graph computing systems. The results of comprehensive experiments demonstrate, compared with the existing solutions, PeTa saves 2X to 19X in communication cost, reduces 80% to 95% number of iterations and improves the overall performance by 80% across various real-world graphs.

international conference on management of data | 2016

Tornado: A System For Real-Time Iterative Analysis Over Evolving Data

Xiaogang Shi; Bin Cui; Yingxia Shao; Yunhai Tong

There is an increasing demand for real-time iterative analysis over evolving data. In this paper, we propose a novel execution model to obtain timely results at given instants. We notice that a loop starting from a good initial guess usually converges fast. Hence we organize the execution of iterative methods over evolving data into a main loop and several branch loops. The main loop is responsible for the gathering of inputs and maintains the approximation to the timely results. When the results are requested by a user, a branch loop is forked from the main loop and iterates until convergence to produce the results. Using the approximation of the main loop, the branch loops can start from a place near the fixed-point and converge quickly. Since the inputs not reflected in the approximation is concerned with the approximation error, we develop a novel bounded asynchronous iteration model to enhance the timeliness. The bounded asynchronous iteration model can achieve fine-grained updates while ensuring correctness for general iterative methods. Based on the proposed execution model, we design and implement a prototype system named Tornado on top of Storm. Tornado provides a graph-parallel programming model which eases the programming of most real-time iterative analysis tasks. The reliability is also enhanced by provisioning efficient fault tolerance mechanisms. Empirical evaluation conducted on Tornado validates that various real-time iterative analysis tasks can improve their performance and efficiently tolerate failures with our execution model.

international conference on management of data | 2015

Exploiting Matrix Dependency for Efficient Distributed Matrix Computation

Lele Yu; Yingxia Shao; Bin Cui

Distributed matrix computation is a popular approach for many large-scale data analysis and machine learning tasks. However existing distributed matrix computation systems generally incur heavy communication cost during the runtime, which degrades the overall performance. In this paper, we propose a novel matrix computation system, named DMac, which exploits the matrix dependencies in matrix programs for efficient matrix computation in the distributed environment. We decompose each matrix program into a sequence of operations, and reveal the matrix dependencies between operations in the program. We next design a dependency-oriented cost model to select an optimal execution strategy for each operation, and generate a communication efficient execution plan for the matrix computation program. To facilitate the matrix computation in distributed systems, we further divide the execution plan into multiple un-interleaved stages which can run in a distributed cluster with efficient local execution strategy on each worker. The DMac system has been implemented on a popular general-purpose data processing framework, Spark. The experimental results demonstrate that our techniques can significantly improve the performance of a wide range of matrix programs.

very large data bases | 2017

An experimental evaluation of simrank-based similarity search algorithms

Zhipeng Zhang; Yingxia Shao; Bin Cui; Ce Zhang

Given a graph, SimRank is one of the most popular measures of the similarity between two vertices. We focus on efficiently calculating SimRank, which has been studied intensively over the last decade. This has led to many algorithms that efficiently calculate or approximate SimRank being proposed by researchers. Despite these abundant research efforts, there is no systematic comparison of these algorithms. In this paper, we conduct a study to compare these algorithms to understand their pros and cons. We first introduce a taxonomy for different algorithms that calculate SimRank and classify each algorithm into one of the following three classes, namely, iterative-, non-iterative-, and random walk-based method. We implement ten algorithms published from 2002 to 2015, and compare them using synthetic and real-world graphs. To ensure the fairness of our study, our implementations use the same data structure and execution framework, and we try our best to optimize each of these algorithms. Our study reveals that none of these algorithms dominates the others: algorithms based on iterative method often have higher accuracy while algorithms based on random walk can be more scalable. One noniterative algorithm has good effectiveness and efficiency on graphs with medium size. Thus, depending on the requirements of different applications, the optimal choice of algorithms differs. This paper provides an empirical guideline for making such choices.

web age information management | 2018

CUTE: Querying Knowledge Graphs by Tabular Examples

Zichen Wang; Tian Li; Yingxia Shao; Bin Cui

Knowledge graphs and the query language SPARQL have opened up the possibility of retrieving information, acquiring knowledge and building applications over large linked data. However, due to the unfamiliarity with both SPARQL and the datasets, users always struggle to write well-expressed queries. To increase the usability of knowledge graphs, we develop a query-by-example system CUTE, which supports complex query intent. CUTE takes tabular examples as input, and returns high-quality results via continuous user interaction.

Explore More