Yudian Zheng | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Yudian Zheng is active.

Explore More

Publication

Featured researches published by Yudian Zheng.

very large data bases | 2017

Truth inference in crowdsourcing: is the problem solved?

Yudian Zheng; Guoliang Li; Yuanbing Li; Caihua Shan; Reynold Cheng

Crowdsourcing has emerged as a novel problem-solving paradigm, which facilitates addressing problems that are hard for computers, e.g., entity resolution and sentiment analysis. However, due to the openness of crowdsourcing, workers may yield low-quality answers, and a redundancy-based method is widely employed, which first assigns each task to multiple workers and then infers the correct answer (called truth) for the task based on the answers of the assigned workers. A fundamental problem in this method is Truth Inference, which decides how to effectively infer the truth. Recently, the database community and data mining community independently study this problem and propose various algorithms. However, these algorithms are not compared extensively under the same framework and it is hard for practitioners to select appropriate algorithms. To alleviate this problem, we provide a detailed survey on 17 existing algorithms and perform a comprehensive evaluation using 5 real datasets. We make all codes and datasets public for future research. Through experiments we find that existing algorithms are not stable across different datasets and there is no algorithm that outperforms others consistently. We believe that the truth inference problem is not fully solved, and identify the limitations of existing algorithms and point out promising research directions.

extending database technology | 2015

On Optimality of Jury Selection in Crowdsourcing

Yudian Zheng; Reynold Cheng; Silviu Maniu; Luyi Mo

Recent advances in crowdsourcing technologies enable computationally challenging tasks (e.g., sentiment analysis and entity resolution) to be performed by Internet workers, driven mainly by monetary incentives. A fundamental question is: how should workers be selected, so that the tasks in hand can be accomplished successfully and economically? In this paper, we study the Jury Selection Problem (JSP): Given a monetary budget, and a set of decision-making tasks (e.g., “Is Bill Gates still the CEO of Microsoft now?”), return the set of workers (called jury), such that their answers yield the highest “Jury Quality” (or JQ). Existing JSP solutions make use of the Majority Voting (MV) strategy, which uses the answer chosen by the largest number of workers. We show that MV does not yield the best solution for JSP. We further prove that among all voting strategies (including deterministic and randomizedstrategies), BayesianVoting(BV)canoptimallysolveJSP. We then examine how to solve JSP based on BV. This is technically challenging, since computing the JQ with BV is NP-hard. We solve this problem by proposing an approximate algorithm that is computationally efficient. Our approximate JQ computation algorithm is also highly accurate, and its error is proved to be bounded within 1%. We extend our solution by considering the task owner’s “belief” (or prior) on the answers of the tasks. Experiments on synthetic and real datasets show that our new approach is consistently better than the best JSP solution known.

knowledge discovery and data mining | 2016

Meta Structure: Computing Relevance in Large Heterogeneous Information Networks

Zhipeng Huang; Yudian Zheng; Reynold Cheng; Yizhou Sun; Nikos Mamoulis; Xiang Li

A heterogeneous information network (HIN) is a graph model in which objects and edges are annotated with types. Large and complex databases, such as YAGO and DBLP, can be modeled as HINs. A fundamental problem in HINs is the computation of closeness, or relevance, between two HIN objects. Relevance measures can be used in various applications, including entity resolution, recommendation, and information retrieval. Several studies have investigated the use of HIN information for relevance computation, however, most of them only utilize simple structure, such as path, to measure the similarity between objects. In this paper, we propose to use meta structure, which is a directed acyclic graph of object types with edge types connecting in between, to measure the proximity between objects. The strength of meta structure is that it can describe complex relationship between two HIN objects (e.g., two papers in DBLP share the same authors and topics). We develop three relevance measures based on meta structure. Due to the computational complexity of these measures, we further design an algorithm with data structures proposed to support their evaluation. Our extensive experiments on YAGO and DBLP show that meta structure-based relevance is more effective than state-of-the-art approaches, and can be efficiently computed.

very large data bases | 2016

DOCS: a domain-aware crowdsourcing system using knowledge bases

Yudian Zheng; Guoliang Li; Reynold Cheng

Crowdsourcing is a new computing paradigm that harnesses human effort to solve computer-hard problems, such as entity resolution and photo tagging. The crowd (or workers) have diverse qualities and it is important to effectively model a worker’s quality. Most of existing worker models assume that workers have the same quality on different tasks. In practice, however, tasks belong to a variety of diverse domains, and workers have different qualities on different domains. For example, a worker who is a basketball fan should have better quality for the task of labeling a photo related to ‘Stephen Curry’ than the one related to ‘Leonardo DiCaprio’. In this paper, we study how to leverage domain knowledge to accurately model a worker’s quality. We examine using knowledge base (KB), e.g., Wikipedia and Freebase, to detect the domains of tasks and workers. We develop Domain Vector Estimation, which analyzes the domains of a task with respect to the KB. We also study Truth Inference, which utilizes the domain-sensitive worker model to accurately infer the true answer of a task. We design an Online Task Assignment algorithm, which judiciously and efficiently assigns tasks to appropriate workers. To implement these solutions, we have built DOCS, a system deployed on the Amazon Mechanical Turk. Experiments show that DOCS performs much better than the state-of-the-art approaches.

international conference on management of data | 2017

CDB: Optimizing Queries with Crowd-Based Selections and Joins

Guoliang Li; Chengliang Chai; Ju Fan; Xueping Weng; Jian Li; Yudian Zheng; Yuanbing Li; Xiang Yu; Xiaohang Zhang; Haitao Yuan

Crowdsourcing database systems have been proposed to leverage crowd-powered operations to encapsulate the complexities of interacting with the crowd. Existing systems suffer from two major limitations. Firstly, in order to optimize a query, they often adopt the traditional tree model to select an optimized table-level join order. However, the tree model provides a coarse-grained optimization, which generates the same order for different joined tuples and limits the optimization potential that different joined tuples can be optimized by different orders. Secondly, they mainly focus on optimizing the monetary cost. In fact, there are three optimization goals (i.e., smaller monetary cost, lower latency, and higher quality) in crowdsourcing, and it calls for a system to enable multi-goal optimization. To address the limitations, we develop a crowd-powered database system CDB that supports crowd-based query optimizations, with focus on join and selection. CDB has fundamental differences from existing systems. First, CDB employs a graph-based query model that provides more fine-grained query optimization. Second, CDB adopts a unified framework to perform the multi-goal optimization based on the graph model. We have implemented our system and deployed it on AMT, CrowdFlower and ChinaCrowd. We have also created a benchmark for evaluating crowd-powered databases. We have conducted both simulated and real experiments, and the experimental results demonstrate the performance superiority of CDB on cost, latency and quality.

international conference on management of data | 2017

Crowdsourced Data Management: Overview and Challenges

Guoliang Li; Yudian Zheng; Ju Fan; Jiannan Wang; Reynold Cheng

Many important data management and analytics tasks cannot be completely addressed by automated processes. Crowdsourcing is an effective way to harness human cognitive abilities to process these computer-hard tasks, such as entity resolution, sentiment analysis, and image recognition. Crowdsourced data management has been extensively studied in research and industry recently. In this tutorial, we will survey and synthesize a wide spectrum of existing studies on crowdsourced data management. We first give an overview of crowdsourcing, and then summarize the fundamental techniques, including quality control, cost control, and latency control, which must be considered in crowdsourced data management. Next we review crowdsourced operators, including selection, collection, join, top-k, sort, categorize, aggregation, skyline, planning, schema matching, mining and spatial crowdsourcing. We also discuss crowdsourcing optimization techniques and systems. Finally, we provide the emerging challenges.

conference on information and knowledge management | 2016

On Transductive Classification in Heterogeneous Information Networks

Xiang Li; Ben Kao; Yudian Zheng; Zhipeng Huang

A heterogeneous information network (HIN) is used to model objects of different types and their relationships. Objects are often associated with properties such as labels. In many applications, such as curated knowledge bases for which object labels are manually given, only a small fraction of the objects are labeled. Studies have shown that transductive classification is an effective way to classify and to deduce labels of objects, and a number of transductive classifiers have been put forward to classify objects in an HIN. We study the performance of a few representative transductive classification algorithms on HINs. We identify two fundamental properties, namely, cohesiveness and connectedness, of an HIN that greatly influence the effectiveness of transductive classifiers. We define metrics that measure the two properties. Through experiments, we show that the two properties serve as very effective indicators that predict the accuracy of transductive classifiers. Based on cohesiveness and connectedness we derive (1) a black-box tester that evaluates whether transductive classifiers should be applied for a given classification task and (2) an active learning algorithm that identifies the objects in an HIN whose labels should be sought in order to improve classification accuracy.

conference on information and knowledge management | 2017

Sybil Defense in Crowdsourcing Platforms

Dong Yuan; Guoliang Li; Qi Li; Yudian Zheng

Crowdsourcing platforms have been widely deployed to solve many computer-hard problems, e.g., image recognition and entity resolution. Quality control is an important issue in crowdsourcing, which has been extensively addressed by existing quality-control algorithms, e.g., voting-based algorithms and probabilistic graphical models. However, these algorithms cannot ensure quality under sybil attacks, which leverages a large number of sybil accounts to generate results for dominating answers of normal workers. To address this problem, we propose a sybil defense framework for crowdsourcing, which can help crowdsourcing platforms to identify sybil workers and defense the sybil attack. We develop a similarity function to quantify worker similarity. Based on worker similarity, we cluster workers into different groups such that we can utilize a small number of golden questions to accurately identify the sybil groups. We also devise online algorithms to instantly detect sybil workers to throttle the attacks. Our method also has ability to detect multi-attackers in one task. To the best of our knowledge, this is the first framework for sybil defense in crowdsourcing. Experimental results on real-world datasets demonstrate that our method can effectively identify and throttle sybil workers.

web age information management | 2017

Meta Paths and Meta Structures: Analysing Large Heterogeneous Information Networks

Reynold Cheng; Zhipeng Huang; Yudian Zheng; Jing Yan; Ka Yu Wong; Eddie Ng

A heterogeneous information network (HIN) is a graph model in which objects and edges are annotated with types. Large and complex databases, such as YAGO and DBLP, can be modeled as HINs. A fundamental problem in HINs is the computation of closeness, or relevance, between two HIN objects. Relevance measures, such as PCRW, PathSim, and HeteSim, can be used in various applications, including information retrieval, entity resolution, and product recommendation. These metrics are based on the use of meta-paths, essentially a sequence of node classes and edge types between two nodes in a HIN. In this tutorial, we will give a detailed review of meta-paths, as well as how they are used to define relevance. In a large and complex HIN, retrieving meta paths manually can be complex, expensive, and error-prone. Hence, we will explore systematic methods for finding meta paths. In particular, we will study a solution based on the Query-by-Example (QBE) paradigm, which allows us to discover meta-paths in an effective and efficient manner.

international conference on data engineering | 2017

Crowdsourced Data Management: A Survey

Guoliang Li; Jiannan Wang; Yudian Zheng; Michael J. Franklin

Any important data management and analytics tasks cannot be completely addressed by automated processes. These tasks, such as entity resolution, sentiment analysis, and image recognition can be enhanced through the use of human cognitive ability. Crowdsouring platforms are an effective way to harness the capabilities of people (i.e., the crowd) to apply human computation for such tasks. Thus, crowdsourced data management has become an area of increasing interest in research and industry. We identify three important problems in crowdsourced data management. (1) Quality Control: Workers may return noisy or incorrect results so effective techniques are required to achieve high quality; (2) Cost Control: The crowd is not free, and cost control aims to reduce the monetary cost; (3) Latency Control: The human workers can be slow, particularly compared to automated computing time scales, so latency-control techniques are required. There has been significant work addressing these three factors for designing crowdsourced tasks, developing crowdsourced data manipulation operators, and optimizing plans consisting of multiple operators. In this paper, we survey and synthesize a wide spectrum of existing studies on crowdsourced data management. Based on this analysis we then outline key factors that need to be considered to improve crowdsourced data management.

Explore More