Xiaoxin Yin
University of Illinois at Urbana–Champaign
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Xiaoxin Yin.
IEEE Transactions on Knowledge and Data Engineering | 2008
Xiaoxin Yin; Jiawei Han; Philip S. Yu
The World Wide Web has become the most important information source for most of us. Unfortunately, there is no guarantee for the correctness of information on the Web. Moreover, different websites often provide conflicting information on a subject, such as different specifications for the same product. In this paper, we propose a new problem, called Veracity, i.e., conformity to truth, which studies how to find true facts from a large amount of conflicting information on many subjects that is provided by various websites. We design a general framework for the Veracity problem and invent an algorithm, called TRUTHFlNDER, which utilizes the relationships between websites and their information, i.e., a website is trustworthy if it provides many pieces of true information, and a piece of information is likely to be true if it is provided by many trustworthy websites. An iterative method is used to infer the trustworthiness of websites and the correctness of information from each other. Our experiments show that TRUTHFlNDER successfully finds true facts among conflicting information and identifies trustworthy websites better than the popular search engines.
knowledge discovery and data mining | 2007
Xiaoxin Yin; Jiawei Han; Philip S. Yu
The World Wide Web has become the most important information source for most of us. Unfortunately, there is no guarantee for the correctness of information on the Web. Moreover, different websites often provide conflicting information on a subject, such as different specifications for the same product. In this paper, we propose a new problem, called Veracity, i.e., conformity to truth, which studies how to find true facts from a large amount of conflicting information on many subjects that is provided by various websites. We design a general framework for the Veracity problem and invent an algorithm, called TRUTHFlNDER, which utilizes the relationships between websites and their information, i.e., a website is trustworthy if it provides many pieces of true information, and a piece of information is likely to be true if it is provided by many trustworthy websites. An iterative method is used to infer the trustworthiness of websites and the correctness of information from each other. Our experiments show that TRUTHFlNDER successfully finds true facts among conflicting information and identifies trustworthy websites better than the popular search engines.
international conference on data engineering | 2007
Xiaoxin Yin; Jiawei Han; Philip S. Yu
Different people or objects may share identical names in the real world, which causes confusion in many applications. It is a nontrivial task to distinguish those objects, especially when there is only very limited information associated with each of them. In this paper, we develop a general object distinction methodology called DISTINCT, which combines two complementary measures for relational similarity: set resemblance of neighbor tuples and random walk probability, and uses SVM to weigh different types of linkages without manually labeled training data. Experiments show that DISTINCT can accurately distinguish different objects with identical names in real databases.
IEEE Transactions on Knowledge and Data Engineering | 2006
Xiaoxin Yin; Jiawei Han; Jiong Yang; Philip S. Yu
Relational databases are the most popular repository for structured data, and is thus one of the richest sources of knowledge in the world. In a relational database, multiple relations are linked together via entity-relationship links. Multirelational classification is the procedure of building a classifier based on information stored in multiple relations and making predictions with it. Existing approaches of inductive logic programming (recently, also known as relational mining) have proven effective with high accuracy in multirelational classification. Unfortunately, most of them suffer from scalability problems with regard to the number of relations in databases. In this paper, we propose a new approach, called CrossMine, which includes a set of novel and powerful methods for multirelational classification, including 1) tuple ID propagation, an efficient and flexible method for virtually joining relations, which enables convenient search among different relations, 2) new definitions for predicates and decision-tree nodes, which involve aggregated information to provide essential statistics for classification, and 3) a selective sampling method for improving scalability with regard to the number of tuples. Based on these techniques, we propose two scalable and accurate methods for multirelational classification: CrossMine-Rule, a rule-based method and CrossMine-Tree, a decision-tree-based method. Our comprehensive experiments on both real and synthetic data sets demonstrate the high scalability and accuracy of the CrossMine approach
Data Mining and Knowledge Discovery | 2007
Xiaoxin Yin; Jiawei Han; Philip S. Yu
Most structured data in real-life applications are stored in relational databases containing multiple semantically linked relations. Unlike clustering in a single table, when clustering objects in relational databases there are usually a large number of features conveying very different semantic information, and using all features indiscriminately is unlikely to generate meaningful results. Because the user knows her goal of clustering, we propose a new approach called CrossClus, which performs multi-relational clustering under user’s guidance. Unlike semi-supervised clustering which requires the user to provide a training set, we minimize the user’s effort by using a very simple form of user guidance. The user is only required to select one or a small set of features that are pertinent to the clustering goal, and CrossClus searches for other pertinent features in multiple relations. Each feature is evaluated by whether it clusters objects in a similar way with the user specified features. We design efficient and accurate approaches for both feature selection and object clustering. Our comprehensive experiments demonstrate the effectiveness and scalability of CrossClus.
knowledge discovery and data mining | 2005
Xiaoxin Yin; Jiawei Han; Philip S. Yu
Clustering is an essential data mining task with numerous applications. However, data in most real-life applications are high-dimensional in nature, and the related information often spreads across multiple relations. To ensure effective and efficient high-dimensional, cross-relational clustering, we propose a new approach, called CrossClus, which performs cross-relational clustering with users guidance. We believe that users guidance, even likely in very simple forms, could be essential for effective high-dimensional clustering since a user knows well the application requirements and data semantics. CrossClus is carried out as follows: A user specifies a clustering task and selects one or a small set of features pertinent to the task. CrossClus extracts the set of highly relevant features in multiple relations connected via linkages defined in the database schema, evaluates their effectiveness based on users guidance, and identifies interesting clusters that fit users needs. This method takes care of both quality in feature extraction and efficiency in clustering. Our comprehensive experiments demonstrate the effectiveness and scalability of this approach.
Third IEEE International Workshop on Information Assurance (IWIA'05) | 2005
Xiaoxin Yin; William Yurcik; Adam J. Slagell
Visualization of IP-based traffic dynamics on networks is a challenging task due to large data volume and the complex, temporal relationships between hosts. We present the architecture of VisFlowConnect-IP, a powerful new tool to visualize IP network traffic flow dynamics for security situational awareness. VisFlowConnect-IP allows an operator to visually assess the connectivity of large and complex networks on a single screen. It provides an overall view of the entire network and filter/drill-down features that allow operators to request more detailed information. Preliminary reports from several organizations using this tool report increased responsiveness to security events as well as new insights into understanding the security dynamics of their networks. In this paper we focus specifically on the design decisions made during the VisFlowConnect development process so that others may learn from our experience. The current VisFlowConnect architecture - the result of these design decisions - is extensible to processing other high-volume multi-dimensional data streams where link connectivity/activity is a focus of study. We report experimental results quantifying the scalability of the underlying algorithms for representing link analysis given continuous high-volume traffic flows as input.
international conference on management of data | 2008
Yizhou Sun; Tianyi Wu; Zhijun Yin; Hong Cheng; Jiawei Han; Xiaoxin Yin; Peixiang Zhao
Online bibliographic databases, such as DBLP in computer science and PubMed in medical sciences, contain abundant information about research publications in different fields. Each such database forms a gigantic information network (hence called BibNet), connecting in complex ways research papers, authors, conferences/journals, and possibly citation information as well, and provides a fertile land for information network analysis. Our BibNetMiner is designed for sophisticated information network mining on such bibliographic databases. In this demo, we will take the DBLP database as an example, demonstrate several attractive functions of BibNetMiner, including clustering, ranking and profiling of conferences and authors based on the research subfields. A user-friendly, visualization-enhanced interface will be provided to facilitate interactive exploration of a bibliographic database. This project will serve as an example to demonstrate the power of links in information network mining. Since the dataset is large and the network is heterogeneous, such a study will benefit the research on the analysis of massive heterogeneous information networks.
workshop on internet and network economics | 2005
Wen Ding; William Yurcik; Xiaoxin Yin
Firms hesitate to outsource their network security to outside security providers (called Managed Security Service Providers or MSSPs) because an MSSP may shirk secretly to increase profits. In economics this secret shirking behavior is commonly referred to as the Moral Hazard problem. There is a counter argument that this moral hazard problem is not as significant for the Internet security outsourcing market because MSSPs work hard to build and maintain their reputations which are crucial to surviving competition. Both arguments make sense and should be considered to write a successful contract. This paper studies the characteristics of optimal contracts (payment to MSSPs) for security outsourcing market by setting up an economic framework that combines both effects. It is shown that an optimal contract should be performance-based. The degree of performance dependence decreases if the reputation effect becomes more significant. We also show that if serving a large group of customers helps the provider to improve service quality significantly (which is observed in the internet security outsourcing market), an optimal contract should always be performance-based even if a strong reputation effect exists.
european conference on machine learning | 2005
Xiaoxin Yin; Jiawei Han
With the fast expansion of computer networks, it is inevitable to study data mining on heterogeneous databases. In this paper we propose MDBM, an accurate and efficient approach for classification on multiple heterogeneous databases. We propose a regression-based method for predicting the usefulness of inter-database links that serve as bridges for information transfer, because such links are automatically detected and may or may not be useful or even valid. Because of the high cost of inter-database communication, MDBM employs a new strategy for cross-database classification, which finds and performs actions with high benefit-to-cost ratios. The experiments show that MDBM achieves high accuracy in cross-database classification, with much higher efficiency than previous approaches.