Is this you? Create Your Porfile

Kui Yu

Hefei University of Technology

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Kui Yu is active.

Explore More

Publication

Featured researches published by Kui Yu.

international conference on data mining | 2014

Towards Scalable and Accurate Online Feature Selection for Big Data

Kui Yu; Xindong Wu; Wei Ding; Jian Pei

Feature selection is important in many big data applications. There are at least two critical challenges. Firstly, in many applications, the dimensionality is extremely high, in millions, and keeps growing. Secondly, feature selection has to be highly scalable, preferably in an online manner such that each feature can be processed in a sequential scan. In this paper, we develop SAOLA, a Scalable and Accurate On Line Approach for feature selection. With a theoretical analysis on a low bound on the pair wise correlations between features in the currently selected feature subset, SAOLA employs novel online pair wise comparison techniques to address the two challenges and maintain a parsimonious model over time in an online manner. An empirical study using a series of benchmark real data sets shows that SAOLA is scalable on data sets of extremely high dimensionality, and has superior performance over the state-of-the-art feature selection methods.

ACM Transactions on Knowledge Discovery From Data | 2016

Scalable and Accurate Online Feature Selection for Big Data

Kui Yu; Xindong Wu; Wei Ding; Jian Pei

Feature selection is important in many big data applications. Two critical challenges closely associate with big data. First, in many big data applications, the dimensionality is extremely high, in millions, and keeps growing. Second, big data applications call for highly scalable feature selection algorithms in an online manner such that each feature can be processed in a sequential scan. We present SAOLA, a Scalable and Accurate OnLine Approach for feature selection in this paper. With a theoretical analysis on bounds of the pairwise correlations between features, SAOLA employs novel pairwise comparison techniques and maintains a parsimonious model over time in an online manner. Furthermore, to deal with upcoming features that arrive by groups, we extend the SAOLA algorithm, and then propose a new group-SAOLA algorithm for online group feature selection. The group-SAOLA algorithm can online maintain a set of feature groups that is sparse at the levels of both groups and individual features simultaneously. An empirical study using a series of benchmark real datasets shows that our two algorithms, SAOLA and group-SAOLA, are scalable on datasets of extremely high dimensionality and have superior performance over the state-of-the-art feature selection methods.

international syposium on methodologies for intelligent systems | 2006

Triangulation of bayesian networks using an adaptive genetic algorithm

Hao Wang; Kui Yu; Xindong Wu; Hongliang Yao

The search for an optimal node elimination sequence for the triangulation of Bayesian networks is an NP-hard problem. In this paper, a new method, called the TAGA algorithm, is proposed to search for the optimal node elimination sequence. TAGA adjusts the probabilities of crossover and mutation operators by itself, and provides an adaptive ranking-based selection operator that adjusts the pressure of selection according to the evolution of the population. Therefore the algorithm not only maintains the diversity of the population and avoids premature convergence, but also improves on-line and off-line performances. Experimental results show that the TAGA algorithm outperforms a simple genetic algorithm, an existing adaptive genetic algorithm, and simulated annealing on three Bayesian networks.

knowledge discovery and data mining | 2012

Mining emerging patterns by streaming feature selection

Kui Yu; Wei Ding; Dan A. Simovici; Xindong Wu

Building an accurate emerging pattern classifier with a high-dimensional dataset is a challenging issue. The problem becomes even more difficult if the whole feature space is unavailable before learning starts. This paper presents a new technique on mining emerging patterns using streaming feature selection. We model high feature dimensions with streaming features, that is, features arrive and are processed one at a time. As features flow in one by one, we online evaluate each coming feature to determine whether it is useful for mining predictive emerging patterns (EPs) by exploiting the relationship between feature relevance and EP discriminability (the predictive ability of an EP). We employ this relationship to guide an online EP mining process. This new approach can mine EPs from a high-dimensional dataset, even when its entire feature set is unavailable before learning. The experiments on a broad range of datasets validate the effectiveness of the proposed approach against other well-established methods, in terms of predictive accuracy, pattern numbers and running time.

IEEE Transactions on Knowledge and Data Engineering | 2013

Bridging Causal Relevance and Pattern Discriminability: Mining Emerging Patterns from High-Dimensional Data

Kui Yu; Wei Ding; Hao Wang; Xindong Wu

It is a nontrivial task to build an accurate emerging pattern (EP) classifier from high-dimensional data because we inevitably face two challenges 1) how to efficiently extract a minimal set of strongly predictive EPs from an explosive number of candidate patterns, and 2) how to handle the highly sensitive choice of the minimal support threshold. To address these two challenges, we bridge causal relevance and EP discriminability (the predictive ability of emerging patterns) to facilitate EP mining and propose a new framework of mining EPs from high-dimensional data. In this framework, we study the relationships between causal relevance in a causal Bayesian network and EP discriminability in EP mining, and then reduce the pattern space of EP mining to direct causes and direct effects, or the Markov blanket (MB) of the class attribute in a causal Bayesian network. The proposed framework is instantiated by two EPs-based classifiers, CE-EP and MB-EP, where CE stands for direct Causes and direct Effects, and MB for Markov Blanket. Extensive experiments on a broad range of data sets validate the effectiveness of the CE-EP and MB-EP classifiers against other well-established methods, in terms of predictive accuracy, pattern numbers, running time, and sensitivity analysis.

knowledge discovery and data mining | 2013

Towards long-lead forecasting of extreme flood events: a data mining framework for precipitation cluster precursors identification

Dawei Wang; Wei Ding; Kui Yu; Xindong Wu; Ping Chen; David Small; Shafiqul Islam

The development of disastrous flood forecasting techniques able to provide warnings at a long lead-time (5-15 days) is of great importance to society. Extreme Flood is usually a consequence of a sequence of precipitation events occurring over from several days to several weeks. Though precise short-term forecasting the magnitude and extent of individual precipitation event is still beyond our reach, long-term forecasting of precipitation clusters can be attempted by identifying persistent atmospheric regimes that are conducive for the precipitation clusters. However, such forecasting will suffer from overwhelming number of relevant features and high imbalance of sample sets. In this paper, we propose an integrated data mining framework for identifying the precursors to precipitation event clusters and use this information to predict extended periods of extreme precipitation and subsequent floods. We synthesize a representative feature set that describes the atmosphere motion, and apply a streaming feature selection algorithm to online identify the precipitation precursors from the enormous feature space. A hierarchical re-sampling approach is embedded in the framework to deal with the imbalance problem. An extensive empirical study is conducted on historical precipitation and associated flood data collected in the State of Iowa. Utilizing our framework a few physically meaningful precipitation cluster precursor sets are identified from millions of features. More than 90% of extreme precipitation events are captured by the proposed prediction model using precipitation cluster precursors with a lead time of more than 5 days.

knowledge discovery and data mining | 2007

A parallel algorithm for learning Bayesian networks

Kui Yu; Hao Wang; Xindong Wu

Computing the expected statistics is the main bottleneck in learning Bayesian networks in large-scale problem domains. This paper presents a parallel learning algorithm, PL-SEM, for learning Bayesian networks, based on an existing structural EM algorithm (SEM). Since the computation of the expected statistics is in the parametric learning part of the SEM algorithm, PLSEM exploits a parallel EM algorithm to compute the expected statistics. The parallel EM algorithm parallelizes the E-step and M-step. At the E-step, PLSEM parallel computes the expected statistics of each sample; and at the M-step, with the conditional independence of Bayesian networks and the expected statistics computed at the E-step, PL-SEM exploits the decomposition property of the likelihood function under the completed data to parallel estimate each local likelihood function. PL-SEM effectively computes the expected statistics, and greatly reduces the time complexity of learning Bayesian networks.

international conference on data mining | 2011

Causal Associative Classification

Kui Yu; Xindong Wu; Wei Ding; Hao Wang; Hongliang Yao

Associative classifiers have received considerable attention due to their easy to understand models and promising performance. However, with a high dimensional dataset, associative classifiers inevitably face two challenges: (1) how to extract a minimal set of strong predictive rules from an explosive number of generated association rules, and (2) how to deal with the highly sensitive choice of the minimal support threshold. In order to address these two challenges, we introduce causality into associative classification, and propose a new framework of causal associative classification. In this framework, we use causal Bayesian networks to bridge irrelevant and redundant features with irrelevant and redundant rules in associative classification. Without loss of prediction power, the feature space involved with the antecedent of a classification rule is reduced to the space of the direct causes, direct effects, and direct causes of the direct effects, a.k.a. the Markov blanket, of the consequent of the rule in causal Bayesian networks. The proposed framework is instantiated via baseline classifiers using emerging patterns. Experimental results show that our framework significantly reduces the model complexity while outperforming the other state-of-the-art algorithms.

international conference on data mining | 2013

Markov Blanket Feature Selection with Non-faithful Data Distributions

Kui Yu; Xindong Wu; Zan Zhang; Yang Mu; Hao Wang; Wei Ding

In faithful Bayesian networks, the Markov blanket of the class attribute is a unique and minimal feature subset for optimal feature selection. However, little attention has been paid to Markov blanket feature selection in a non-faithful environment which widely exists in the real world. To tackle this issue, in this paper, we deal with non-faithful data distributions and propose the concept of representative sets instead of Markov blankets. With a standard sparse group lasso for selection of features from the representative sets, we design an effective algorithm, SRS, for Markov blanket feature Selection via Representative Sets with non-faithful data distributions. Empirical studies demonstrate that SRS outperforms the state-of-the-art Markov blanket feature selectors and other well-established feature selection methods.

knowledge discovery and data mining | 2015

Tornado Forecasting with Multiple Markov Boundaries

Kui Yu; Dawei Wang; Wei Ding; Jian Pei; David Small; Shafiqul Islam; Xindong Wu

Reliable tornado forecasting with a long-lead time can greatly support emergency response and is of vital importance for the economy and society. The large number of meteorological variables in spatiotemporal domains and the complex relationships among variables remain the top difficulties for a long-lead tornado forecasting. Standard data mining approaches to tackle high dimensionality are usually designed to discover a single set of features without alternating options for domain scientists to select more reliable and physical interpretable variables. In this work, we provide a new solution to use the concept of multiple Markov boundaries in local causal discovery to identify multiple sets of the precursors for tornado forecasting. Specifically, our algorithm first confines the extremely large feature spaces to a small core feature space, then it mines multiple sets of the precursors from the core feature space that may equally contribute to tornado forecasting. With the multiple sets of the precursors, we are able to report to domain scientists the predictive but practical set of precursors. An extensive empirical study is conducted on eight benchmark data sets and the historical tornado data near Oklahoma City, OK in the United States. Experimental results show that the tornado precursors we identified can help to improve the reliability of long-lead time catastrophic tornado forecasting.

Explore More