Yanyan Shen | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Yanyan Shen is active.

Explore More

Publication

Featured researches published by Yanyan Shen.

very large data bases | 2012

CDAS: a crowdsourcing data analytics system

Xuan Liu; Meiyu Lu; Beng Chin Ooi; Yanyan Shen; Sai Wu; Meihui Zhang

Some complex problems, such as image tagging and natural language processing, are very challenging for computers, where even state-of-the-art technology is yet able to provide satisfactory accuracy. Therefore, rather than relying solely on developing new and better algorithms to handle such tasks, we look to the crowdsourcing solution -- employing human participation -- to make good the shortfall in current technology. Crowdsourcing is a good supplement to many computer tasks. A complex job may be divided into computer-oriented tasks and human-oriented tasks, which are then assigned to machines and humans respectively. n nTo leverage the power of crowdsourcing, we design and implement a Crowdsourcing Data Analytics System, CDAS. CDAS is a framework designed to support the deployment of various crowdsourcing applications. The core part of CDAS is a quality-sensitive answering model, which guides the crowdsourcing engine to process and monitor the human tasks. In this paper, we introduce the principles of our quality-sensitive model. To satisfy user required accuracy, the model guides the crowdsourcing query engine for the design and processing of the corresponding crowdsourcing jobs. It provides an estimated accuracy for each generated result based on the human workers historical performances. When verifying the quality of the result, the model employs an online strategy to reduce waiting time. To show the effectiveness of the model, we implement and deploy two analytics jobs on CDAS, a twitter sentiment analytics job and an image tagging job. We use real Twitter and Flickr data as our queries respectively. We compare our approaches with state-of-the-art classification and image annotation techniques. The results show that the human-assisted methods can indeed achieve a much higher accuracy. By embedding the quality-sensitive model into crowdsourcing query engine, we effectively reduce the processing cost while maintaining the required query answer quality.

very large data bases | 2012

Efficient processing of k nearest neighbor joins using MapReduce

Wei Lu; Yanyan Shen; Su Chen; Beng Chin Ooi

k nearest neighbor join (kNN join), designed to find k nearest neighbors from a dataset S for every object in another dataset R, is a primitive operation widely adopted by many data mining applications. As a combination of the k nearest neighbor query and the join operation, kNN join is an expensive operation. Given the increasing volume of data, it is difficult to perform a kNN join on a centralized machine efficiently. In this paper, we investigate how to perform kNN join using MapReduce which is a well-accepted framework for data-intensive applications over clusters of computers. In brief, the mappers cluster objects into groups; the reducers perform the kNN join on each group of objects separately. We design an effective mapping mechanism that exploits pruning rules for distance filtering, and hence reduces both the shuffling and computational costs. To reduce the shuffling cost, we propose two approximate algorithms to minimize the number of replicas. Extensive experiments on our in-house cluster demonstrate that our proposed methods are efficient, robust and scalable.

international conference on management of data | 2014

Discovering queries based on example tuples

Yanyan Shen; Kaushik Chakrabarti; Surajit Chaudhuri; Bolin Ding; Lev Novik

An enterprise information worker is often aware of a few example tuples (but not the entire result) that should be present in the output of the query. We study the problem of discovering the minimal project join query that contains the given example tuples in its output. Efficient discovery of such queries is challenging. We propose novel algorithms to solve this problem. Our experiments on real-life datasets show that the proposed solution is significantly more efficient compared with na{i}ve adaptations of known techniques.

very large data bases | 2015

Dexter: large-scale discovery and extraction of product specifications on the web

Disheng Qiu; Luciano Barbosa; Xin Luna Dong; Yanyan Shen; Divesh Srivastava

The web is a rich resource of structured data. There has been an increasing interest in using web structured data for many applications such as data integration, web search and question answering. In this paper, we present Dexter, a system to find product sites on the web, and detect and extract product specifications from them. Since product specifications exist in multiple product sites, our focused crawler relies on search queries and backlinks to discover product sites. To perform the detection, and handle the high diversity of specifications in terms of content, size and format, our system uses supervised learning to classify HTML fragments (e.g., tables and lists) present in web pages as specifications or not. To perform large-scale extraction of the attribute-value pairs from the HTML fragments identified by the specification detector, Dexter adopts two lightweight strategies: a domain-independent and unsupervised wrapper method, which relies on the observation that these HTML fragments have very similar structure; and a combination of this strategy with a previous approach, which infers extraction patterns by annotations generated by automatic but noisy annotators. The results show that our crawler strategy to locate product specification pages is effective: (1) it discovered 1:46AM product specification pages from 3; 005 sites and 9 different categories; (2) the specification detector obtains high values of F-measure (close to 0:9) over a heterogeneous set of product specifications; and (3) our efficient wrapper methods for attribute-value extraction get very high values of precision (0.92) and recall (0.95) and obtain better results than a state-of-the-art, supervised rule-based wrapper.

very large data bases | 2014

Fast failure recovery in distributed graph processing systems

Yanyan Shen; Gang Chen; H. V. Jagadish; Wei Lu; Beng Chin Ooi; Bogdan Marius Tudor

Distributed graph processing systems increasingly require many compute nodes to cope with the requirements imposed by contemporary graph-based Big Data applications. However, increasing the number of compute nodes increases the chance of node failures. Therefore, provisioning an efficient failure recovery strategy is critical for distributed graph processing systems. This paper proposes a novel recovery mechanism for distributed graph processing systems that parallelizes the recovery process. The key idea is to partition the part of the graph that is lost during a failure among a subset of the remaining nodes. To do so, we augment the existing checkpoint-based and log-based recovery schemes with a partitioning mechanism that is sensitive to the total computation and communication cost of the recovery process. Our implementation on top of the widely used Giraph system outperforms checkpoint-based recovery by up to 30x on a cluster of 40 compute nodes.

web search and data mining | 2018

Inferring Dockless Shared Bike Distribution in New Cities

Zhaoyang Liu; Yanyan Shen

Recently, dockless shared bike services have achieved great success and reinvented bike sharing business in China. When expanding bike sharing business into a new city, most start-ups always wish to find out how to cover the whole city with a suitable bike distribution. In this paper, we study the problem of inferring bike distribution in new cities, which is challenging. As no dockless bikes are deployed in the new city, we propose to learn insights on bike distribution from cities populated with dockless bikes. We exploit multi-source data to identify important features that affect bike distributions and develop a novel inference model combining Factor Analysis and Convolutional Neural Network techniques. The extensive experiments on real-life datasets show that the proposed solution provides significantly more accurate inference results compared with competitive prediction methods.

web search and data mining | 2018

Predicting Multi-step Citywide Passenger Demands Using Attention-based Neural Networks

Xian Zhou; Yanyan Shen; Linpeng Huang

Predicting passenger pickup/dropoff demands based on historical mobility trips has been of great importance towards better vehicle distribution for the emerging mobility-on-demand (MOD) services. Prior works focused on predicting next-step passenger demands at selected locations or hotspots. However, we argue that multi-step citywide passenger demands encapsulate both time-varying demand trends and global statuses, and hence are more beneficial to avoiding demand-service mismatching and developing effective vehicle distribution/scheduling strategies. In this paper, we propose an end-to-end deep neural network solution to the prediction task. We employ the encoder-decoder framework based on convolutional and ConvLSTM units to identify complex features that capture spatiotemporal influences and pickup-dropoff interactions on citywide passenger demands. A novel attention model is incorporated to emphasize the effects of latent citywide mobility regularities. We evaluate our proposed method using real-word mobility trips (taxis and bikes) and the experimental results show that our method achieves higher prediction accuracy than the adaptations of the state-of-the-art approaches.

international joint conference on artificial intelligence | 2018

Cuckoo Feature Hashing: Dynamic Weight Sharing for Sparse Analytics

Jinyang Gao; Beng Chin Ooi; Yanyan Shen; Wang-Chien Lee

Feature hashing is widely used to process large scale sparse features for learning of predictive models. Collisions inherently happen in the hashing process and hurt the model performance. In this paper, we develop a new feature hashing scheme called Cuckoo Feature Hashing (CCFH), which treats feature hashing as a problem of dynamic weight sharing during model training. By leveraging a set of indicators to dynamically decide the weight of each feature based on alternative hash locations, CCFH effectively prevents the collisions between important features to the model, i.e. predictive features, and thus avoid model performance degradation. Experimental results on prediction tasks with hundred-millions of features demonstrate that CCFH can achieve the same level of performance by using only 15%-25% parameters compared with conventional feature hashing.

Journal of Parallel and Distributed Computing | 2018

NVHT: An efficient key–value storage library for non-volatile memory

Kaixin Huang; Jie Zhou; Linpeng Huang; Yanyan Shen

Abstract Non-Volatile Memory (NVM) promises persistence, byte-addressability and DRAM-like read/write latency. These properties indicate that NVM has the potential to be incorporated with key–value stores to achieve high performance and durability simultaneously. Specifically, data can be stored in NVM inherently without DRAM buffering, which eliminates expensive disk I/Os and data format transformation cost. However, several challenges such as data inconsistency and write endurance arise along with the benefits. We propose a library named NVHT to provide APIs for NVM-based key–value store operations. In NVHT, we introduce non-volatile pointer to solve the dynamic address mapping problem and design a wear-out-aware memory allocator for NVM. The core of NVHT is a novel NVM-friendly hash table structure. NVHT guarantees consistency using a log-based mechanism. The experimental results show that compared with LevelDB and BerkeleyDB running on a DRAM-based file system, NVHT achieves more than 2x and 4x speedup for insert and search operation respectively. Compared with in-memory key–value store system Redis, NVHT still achieves higher transaction performance in terms of random update throughput (up to 1.5x and 2.5x for RDB scheme and AOF scheme, respectively).

pacific-asia conference on knowledge discovery and data mining | 2018

Cruising or Waiting: A Shared Recommender System for Taxi Drivers

Xiaoting Jiang; Yanyan Shen

Recent efforts have been made on mining mobility of taxi trajectories and developing recommender systems for taxi drivers. Existing systems focused on recommending seeking routes to the place with the highest passenger pick-up possibility. They mostly ignore that waiting at nearby taxi stands may also help increase the profit. Furthermore, the recommended results seldom consider potential competitions among drivers and real-time traffic. In this paper, we propose a shared recommender system for taxi drivers by including waiting as one kind of seeking policy. We model a seeking process as a Markov Decision Process, and propose a novel Q-learning algorithm to train the model based on massive trajectory data efficiently. During online recommendation, we update the model using feedbacks from drivers and recommend the optimal seeking policy by taking competitions among drivers and real-time traffic into account. Experimental results show that our system achieves better performance than the state-of-the-art approaches.

Explore More