Lidan Shou | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Lidan Shou is active.

Explore More

Publication

Featured researches published by Lidan Shou.

international acm sigir conference on research and development in information retrieval | 2013

Sumblr: continuous summarization of evolving tweet streams

Lidan Shou; Zhenhua Wang; Ke Chen; Gang Chen

With the explosive growth of microblogging services, short-text messages (also known as tweets) are being created and shared at an unprecedented rate. Tweets in its raw form can be incredibly informative, but also overwhelming. For both end-users and data analysts it is a nightmare to plow through millions of tweets which contain enormous noises and redundancies. In this paper, we study continuous tweet summarization as a solution to address this problem. While traditional document summarization methods focus on static and small-scale data, we aim to deal with dynamic, quickly arriving, and large-scale tweet streams. We propose a novel prototype called Sumblr (SUMmarization By stream cLusteRing) for tweet streams. We first propose an online tweet stream clustering algorithm to cluster tweets and maintain distilled statistics called Tweet Cluster Vectors. Then we develop a TCV-Rank summarization technique for generating online summaries and historical summaries of arbitrary time durations. Finally, we describe a topic evolvement detection method, which consumes online and historical summaries to produce timelines automatically from tweet streams. Our experiments on large-scale real tweets demonstrate the efficiency and effectiveness of our approach.

conference on information and knowledge management | 2012

Evaluating geo-social influence in location-based social networks

Chao Zhang; Lidan Shou; Ke Chen; Gang Chen; Yijun Bei

The emerging location-based social network (LBSN) services not only allow people to maintain cyber links with their friends, but also enable them to share the events happening on them at different locations. The geo-social correlations among event participants make it possible to quantify mutual user influence for various events. Such a quantification of influence could benefit a wide spectrum of real-life applications such as targeted advertising and viral marketing. In this paper, we perform an in-depth analysis of the geo-social correlations among LBSN users at event level, based on which we address two problems: user influence evaluation and influential events discovery. To capture the geo-social closeness between LBSN users, we propose a unified influence metric. This metric combines a novel social proximity measure named penalized hitting time, with a geographical weight function modeled by power law distribution. We propose two approximate algorithms, namely global iteration (GI) and dynamic neighborhood expansion (DNE), to efficiently evaluate user influence with tight theoretical error bounds. We then adopt the sampling technique and the threshold algorithm to support efficient retrieval of top-K influential events. Extensive experiments on both real-life and synthetic LBSN data sets confirm that the proposed algorithms are effective, efficient, and scalable.

IEEE Transactions on Knowledge and Data Engineering | 2015

On Summarization and Timeline Generation for Evolutionary Tweet Streams

Zhenhua Wang; Lidan Shou; Ke Chen; Gang Chen; Sharad Mehrotra

Short-text messages such as tweets are being created and shared at an unprecedented rate. Tweets, in their raw form, while being informative, can also be overwhelming. For both end-users and data analysts, it is a nightmare to plow through millions of tweets which contain enormous amount of noise and redundancy. In this paper, we propose a novel continuous summarization framework called Sumblr to alleviate the problem. In contrast to the traditional document summarization methods which focus on static and small-scale data set, Sumblr is designed to deal with dynamic, fast arriving, and large-scale tweet streams. Our proposed framework consists of three major components. First, we propose an online tweet stream clustering algorithm to cluster tweets and maintain distilled statistics in a data structure called tweet cluster vector (TCV). Second, we develop a TCV-Rank summarization technique for generating online summaries and historical summaries of arbitrary time durations. Third, we design an effective topic evolution detection method, which monitors summary-based/volume-based variations to produce timelines automatically from tweet streams. Our experiments on large-scale real tweets demonstrate the efficiency and effectiveness of our framework.

IEEE Transactions on Knowledge and Data Engineering | 2018

SLADE: A Smart Large-Scale Task Decomposer in Crowdsourcing

Yongxin Tong; Lei Chen; Zimu Zhou; H. V. Jagadish; Lidan Shou; Weifeng Lv

Crowdsourcing has been shown to be effective in a wide range of applications, and is seeing increasing use. A large-scale crowdsourcing task often consists of thousands or millions of atomic tasks, each of which is usually a simple task such as binary choice or simple voting. To distribute a large-scale crowdsourcing task to limited crowd workers, a common practice is to pack a set of atomic tasks into a task bin and send to a crowd worker in a batch. It is challenging to decompose a large-scale crowdsourcing task and execute batches of atomic tasks, which ensures reliable answers at a minimal total cost. Large batches lead to unreliable answers of atomic tasks, while small batches incur unnecessary cost. In this paper, we investigate a general crowdsourcing task decomposition problem, called the Smart Large-scAle task DE composer (SLADE) problem, which aims to decompose a large-scale crowdsourcing task to achieve the desired reliability at a minimal cost. We prove the NP-hardness of the SLADE problem and propose solutions in both homogeneous and heterogeneous scenarios. For the homogeneous SLADE problem, where all the atomic tasks share the same reliability requirement, we propose a greedy heuristic algorithm and an efficient and effective approximation framework using an optimal priority queue (OPQ) structure with provable approximation ratio. For the heterogeneous SLADE problem, where the atomic tasks can have different reliability requirements, we extend the OPQ-based framework leveraging a partition strategy, and also prove its approximation guarantee. Finally, we verify the effectiveness and efficiency of the proposed solutions through extensive experiments on representative crowdsourcing platforms.

international conference on data engineering | 2003

HDoV-tree: the structure, the storage, the speed

Lidan Shou; Zhiyong Huang; Kian-Lee Tan

In a visualization system, one of the key issues is to optimize performance and visual fidelity. This is especially critical for large virtual environments where the models do not fit into the memory. Here, we present a novel structure called HDoV-tree that can be tuned to provide excellent visual fidelity and performance based on the degree of visibility of objects. HDoV-tree also exploits internal level-of-details (LoDs) that represent a collection of objects in a coarser form. We also propose three storage structures for the HDoV-tree. We implemented HDoV-tree in a prototype walkthrough system called VISUAL. We have evaluated the HDoV-tree on visibility queries, and also compared the performance of VISUAL against REVIEW, a walkthrough system based on R-tree. Our results show that the HDoV-tree is an efficient structure. Moreover, VISUAL can lead to high frame rates without compromising visual fidelity.

international conference on data engineering | 2013

An efficient and compact indexing scheme for large-scale data store

Peng Lu; Sai Wu; Lidan Shou; Kian-Lee Tan

The amount of data managed in todays Cloud systems has reached an unprecedented scale. In order to speed up query processing, an effective mechanism is to build indexes on attributes that are used in query predicates. However, conventional indexing schemes fail to provide a scalable service: as the size of these indexes are proportional to the data size, it is not space efficient to build many indexes. As such, it becomes more crucial to develop effective index to provide scalable database services in the Cloud. In this paper, we propose a compact bitmap indexing scheme for a large-scale data store. The bitmap indexing scheme combines state-of-the-art bitmap compression techniques, such as WAH encoding and bit-sliced encoding. To further reduce the index cost, a novel and query efficient partial indexing technique is adopted, which dynamically refreshes the index to handle updates and process queries. The intuition of our indexing approach is to maximize the number of indexed attributes, so that a wider range of queries, including range and join queries, can be efficiently supported. Our indexing scheme is light-weight and its creation can be seamlessly grafted onto the MapReduce processing engine without incurring significant running cost. Moreover, the compactness allows us to maintain the bitmap indexes in memory so that performance overhead of index access is minimal. We implement our indexing scheme on top of the underlying Distributed File System (DFS) and evaluate its performance on an in-house cluster. We compare our index-based query processing with HadoopDB to show its superior performance. Our experimental results confirm the effectiveness, efficiency and scalability of the indexing scheme.

IEEE Transactions on Knowledge and Data Engineering | 2013

Supporting Pattern-Preserving Anonymization for Time-Series Data

Lidan Shou; Xuan Shang; Ke Chen; Gang Chen; Chao Zhang

Time series is an important form of data available in numerous applications and often contains vast amount of personal privacy. The need to protect privacy in time-series data while effectively supporting complex queries on them poses nontrivial challenges to the database community. We study the anonymization of time series while trying to support complex queries, such as range and pattern matching queries, on the published data. The conventional k-anonymity model cannot effectively address this problem as it may suffer severe pattern loss. We propose a novel anonymization model called (k, P)-anonymity for pattern-rich time series. This model publishes both the attribute values and the patterns of time series in separate data forms. We demonstrate that our model can prevent linkage attacks on the published data while effectively support a wide variety of queries on the anonymized data. We propose two algorithms to enforce (k, P)-anonymity on time-series data. Our anonymity model supports customized data publishing, which allows a certain part of the values but a different part of the pattern of the anonymized time series to be published simultaneously. We present estimation techniques to support query processing on such customized data. The proposed methods are evaluated in a comprehensive experimental study. Our results verify the effectiveness and efficiency of our approach.

Information Sciences | 2009

Bottom-up discovery of frequent rooted unordered subtrees

Yijun Bei; Gang Chen; Lidan Shou; Xiaoyan Li; Jinxiang Dong

In the past decade, XML has emerged as the standard language for information exchanging over the Internet. Due to its tree-structure paradigm, XML is superior for its capability of storing, querying, and manipulating complex data. Therefore, discovering frequent tree patterns over tree-structured data has become an interesting topic for XML data management. In this paper, we propose a tree mining algorithm, named BUXMiner, for finding a special class of frequent trees, called rooted unordered trees, from a tree-structured database. BUXMiner employs an efficient bottom-up approach to enumerate all candidate trees over a compact global tree guide and computes the frequent trees based on the tree guide. In addition to BUXMiner, we also propose a mining approach called BUMXMiner to discover the maximal frequent rooted unordered trees. We compare BUXMiner with previous tree-structure mining algorithms, namely XQPMinerTID and FastXMiner, which were also proposed to discover rooted unordered trees. The experimental results show that our algorithm outperforms XQPMinerTID and FastXMiner in terms of efficiency. The performance results from real-world applications also indicate the usefulness of our proposed tree mining algorithms in a variety of web applications, such as analysis of web page access patterns and mining frequent XML query patterns for caching.

IEEE Transactions on Knowledge and Data Engineering | 2013

KSQ: Top-k Similarity Query on Uncertain Trajectories

Chunyang Ma; Hua Lu; Lidan Shou; Gang Chen

Similarity search on spatiotemporal trajectories has a wide range of applications. Most of existing research focuses on certain trajectories. However, trajectories often are uncertain due to various factors, for example, hardware limitations and privacy concerns. In this paper, we introduce p-distance, a novel and adaptive measure that is able to quantify the dissimilarity between two uncertain trajectories. Based on this measure of dissimilarity, we define top-k similarity query (KSQ) on uncertain trajectories. A KSQ returns the k trajectories that are most similar to a given trajectory in terms of p-distance. To process such queries efficiently, we design UTgrid for indexing uncertain trajectories and develop query processing algorithms that make use of UTgrid for effective pruning. We conduct an extensive experimental study on both synthetic and real data sets. The results indicate that UTgrid is an effective indexing method for similarity search on uncertain trajectories. Our query processing using UTgrid dramatically improves the query performance and scales well in terms of query time and I/O.

international acm sigir conference on research and development in information retrieval | 2011

UPS: efficient privacy protection in personalized web search

Gang Chen; He Bai; Lidan Shou; Ke Chen; Yunjun Gao

In recent years, personalized web search (PWS) has demonstrated effectiveness in improving the quality of search service on the Internet. Unfortunately, the need for collecting private information in PWS has become a major barrier for its wide proliferation. We study privacy protection in PWS engines which capture personalities in user profiles. We propose a PWS framework called UPS that can generalize profiles in for each query according to user-specified privacy requirements. Two predictive metrics are proposed to evaluate the privacy breach risk and the query utility for hierarchical user profile. We develop two simple but effective generalization algorithms for user profiles allowing for query-level customization using our proposed metrics. We also provide an online prediction mechanism based on query utility for deciding whether to personalize a query in UPS. Extensive experiments demonstrate the efficiency and effectiveness of our framework.

Explore More