Xuanhui Wang
University of Illinois at Urbana–Champaign
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Xuanhui Wang.
architectural support for programming languages and operating systems | 2006
Zhenmin Li; Lin Tan; Xuanhui Wang; Shan Lu; Yuanyuan Zhou; ChengXiang Zhai
Software errors are a major cause for system failures. To effectively design tools and support for detecting and recovering from software failures requires a deep understanding of bug characteristics. Recently, software and its development process have significantly changed in many ways, including more help from bug detection tools, shift towards multi-threading architecture, the open-source development paradigm and increasing concerns about security and user-friendly interface. Therefore, results from previous studies may not be applicable to present software. Furthermore, many new aspects such as security, concurrency and open-source-related characteristics have not well studied. Additionally, previous studies were based on a small number of bugs, which may lead to non-representative results.To investigate the impacts of the new factors on software errors, we analyze bug characteristics by first sampling hundreds of real world bugs in two large, representative open-source projects. To validate the representativeness of our results, we use natural language text classification techniques and automatically analyze around 29, 000 bugs from the Bugzilla databases of the software.Our study has discovered several new interesting characteristics: (1) memory-related bugs have decreased because quite a few effective detection tools became available recently; (2) surprisingly, some simple memory-related bugs such as NULL pointer dereferences that should have been detected by existing tools in development are still a major component, which indicates that the tools have not been used with their full capacity; (3) semantic bugs are the dominant root causes, as they are application specific and difficult to fix, which suggests that more efforts should be put into detecting and fixing them; (4) security bugs are increasing, and the majority of them cause severe impacts.
knowledge discovery and data mining | 2007
Xuanhui Wang; ChengXiang Zhai; Xiao Hu; Richard Sproat
Previous work on text mining has almost exclusively focused on a single stream. However, we often have available multiple text streams indexed by the same set of time points (called coordinated text streams), which offer new opportunities for text mining. For example, when a major event happens, all the news articles published by different agencies in different languages tend to cover the same event for a certain period, exhibiting a correlated bursty topic pattern in all the news article streams. In general, mining correlated bursty topic patterns from coordinated text streams can reveal interesting latent associations or events behind these streams. In this paper, we define and study this novel text mining problem. We propose a general probabilistic algorithm which can effectively discover correlated bursty patterns and their bursty periods across text streams even if the streams have completely different vocabularies (e.g., English vs Chinese). Evaluation of the proposed method on a news data set and a literature data set shows that it can effectively discover quite meaningful topic patterns from both data sets: the patterns discovered from the news data set accurately reveal the major common events covered in the two streams of news articles (in English and Chinese, respectively), while the patterns discovered from two database publication streams match well with the major research paradigm shifts in database research. Since the proposed method is general and does not require the streams to share vocabulary, it can be applied to any coordinated text streams to discover correlated topic patterns that burst in multiple streams in the same period.
international conference on machine learning | 2009
Deng Cai; Xuanhui Wang; Xiaofei He
Dyadic data arises in many real world applications such as social network analysis and information retrieval. In order to discover the underlying or hidden structure in the dyadic data, many topic modeling techniques were proposed. The typical algorithms include Probabilistic Latent Semantic Analysis (PLSA) and Latent Dirichlet Allocation (LDA). The probability density functions obtained by both of these two algorithms are supported on the Euclidean space. However, many previous studies have shown naturally occurring data may reside on or close to an underlying submanifold. We introduce a probabilistic framework for modeling both the topical and geometrical structure of the dyadic data that explicitly takes into account the local manifold structure. Specifically, the local manifold structure is modeled by a graph. The graph Laplacian, analogous to the Laplace-Beltrami operator on manifolds, is applied to smooth the probability density functions. As a result, the obtained probabilistic distributions are concentrated around the data manifold. Experimental results on real data sets demonstrate the effectiveness of the proposed approach.
conference on information and knowledge management | 2008
Xuanhui Wang; ChengXiang Zhai
Search engine logs are an emerging new type of data that offers interesting opportunities for data mining. Existing work on mining such data has mostly attempted to discover knowledge at the level of queries (e.g., query clusters). In this paper, we propose to mine search engine logs for patterns at the level of terms through analyzing the relations of terms inside a query. We define two novel term association patterns (i.e., context-sensitive term substitutions and term additions) and propose new methods for mining such patterns from search engine logs. These two patterns can be used to address the mis-specification and under-specification problems of ineffective queries. Experiment results on real search engine logs show that the mined context-sensitive term substitutions can be used to effectively reword queries and improve their accuracy, while the mined context-sensitive term addition patterns can be used to support query refinement in a more effective way.
language and technology conference | 2006
Tao Tao; Xuanhui Wang; Qiaozhu Mei; ChengXiang Zhai
Language model information retrieval depends on accurate estimation of document models. In this paper, we propose a document expansion technique to deal with the problem of insufficient sampling of documents. We construct a probabilistic neighborhood for each document, and expand the document with its neighborhood information. The expanded document provides a more accurate estimation of the document model, thus improves retrieval accuracy. Moreover, since document expansion and pseudo feedback exploit different corpus structures, they can be combined to further improve performance. The experiment results on several different data sets demonstrate the effectiveness of the proposed document expansion method.
international acm sigir conference on research and development in information retrieval | 2008
Xuanhui Wang; Hui Fang; ChengXiang Zhai
Negative relevance feedback is a special case of relevance feedback where we do not have any positive example; this often happens when the topic is difficult and the search results are poor. Although in principle any standard relevance feedback technique can be applied to negative relevance feedback, it may not perform well due to the lack of positive examples. In this paper, we conduct a systematic study of methods for negative relevance feedback. We compare a set of representative negative feedback methods, covering vector-space models and language models, as well as several special heuristics for negative feedback. Evaluating negative feedback methods requires a test set with sufficient difficult topics, but there are not many naturally difficult topics in the existing test collections. We use two sampling strategies to adapt a test collection with easy topics to evaluate negative feedback. Experiment results on several TREC collections show that language model based negative feedback methods are generally more effective than those based on vector-space models, and using multiple negative models is an effective heuristic for negative feedback. Our results also show that it is feasible to adapt test collections with easy topics for evaluating negative feedback methods through sampling.
international conference on data mining | 2003
Hua-Jun Zeng; Xuanhui Wang; Zheng Chen; Hongjun Lu; Wei-Ying Ma
Semisupervised learning methods construct classifiers using both labeled and unlabeled training data samples. While unlabeled data samples can help to improve the accuracy of trained models to certain extent, existing methods still face difficulties when labeled data is not sufficient and biased against the underlying data distribution. We present a clustering based classification (CBC) approach. Using this approach, training data, including both the labeled and unlabeled data, is first clustered with the guidance of the labeled data. Some of unlabeled data samples are then labeled based on the clusters obtained. Discriminative classifiers can subsequently be trained with the expanded labeled dataset. The effectiveness of the proposed method is justified analytically. Our experimental results demonstrated that CBC outperforms existing algorithms when the size of labeled dataset is very small.
international acm sigir conference on research and development in information retrieval | 2006
Xuanhui Wang; Jian Tao Sun; Zheng Chen; ChengXiang Zhai
Co-occurrence data is quite common in many real applications. Latent Semantic Analysis (LSA) has been successfully used to identify semantic relations in such data. However, LSA can only handle a single co-occurrence relationship between two types of objects. In practical applications, there are many cases where multiple types of objects exist and any pair of these objects could have a pairwise co-occurrence relation. All these co-occurrence relations can be exploited to alleviate data sparseness or to represent objects more meaningfully. In this paper, we propose a novel algorithm, M-LSA, which conducts latent semantic analysis by incorporating all pairwise co-occurrences among multiple types of objects. Based on the mutual reinforcement principle, M-LSA identifies the most salient concepts among the co-occurrence data and represents all the objects in a unified semantic space. M-LSA is general and we show that several variants of LSA are special cases of our algorithm. Experiment results show that M-LSA outperforms LSA on multiple applications, including collaborative filtering, text clustering, and text categorization.
Empirical Software Engineering | 2014
Lin Tan; Chen Liu; Zhenmin Li; Xuanhui Wang; Yuanyuan Zhou; ChengXiang Zhai
To design effective tools for detecting and recovering from software failures requires a deep understanding of software bug characteristics. We study software bug characteristics by sampling 2,060 real world bugs in three large, representative open-source projects—the Linux kernel, Mozilla, and Apache. We manually study these bugs in three dimensions—root causes, impacts, and components. We further study the correlation between categories in different dimensions, and the trend of different types of bugs. The findings include: (1) semantic bugs are the dominant root cause. As software evolves, semantic bugs increase, while memory-related bugs decrease, calling for more research effort to address semantic bugs; (2) the Linux kernel operating system (OS) has more concurrency bugs than its non-OS counterparts, suggesting more effort into detecting concurrency bugs in operating system code; and (3) reported security bugs are increasing, and the majority of them are caused by semantic bugs, suggesting more support to help developers diagnose and fix security bugs, especially semantic security bugs. In addition, to reduce the manual effort in building bug benchmarks for evaluating bug detection and diagnosis tools, we use machine learning techniques to classify 109,014 bugs automatically.
web search and data mining | 2013
Yi Chang; Xuanhui Wang; Qiaozhu Mei; Yan Liu
Twitter has become one of the most popular platforms for users to share information in real time. However, as an individual tweet is short and lacks sufficient contextual information, users cannot effectively understand or consume information on Twitter, which can either make users less engaged or even detached from using Twitter. In order to provide informative context to a Twitter user, we propose the task of Twitter context summarization, which generates a succinct summary from a large but noisy Twitter context tree. Traditional summarization techniques only consider text information, which is insufficient for Twitter context summarization task, since text information on Twitter is very sparse. Given that there are rich user interactions in Twitter, we thus study how to improve summarization methods by leveraging such signals. In particular, we study how user influence models, which project user interaction information onto a Twitter context tree, can help Twitter context summarization within a supervised learning framework. To evaluate our methods, we construct a data set by asking human editors to manually select the most informative tweets as a summary. Our experimental results based on this editorial data set show that Twitter context summarization is a promising research topic and pairwise user influence signals can significantly improve the task performance.