Ruichu Cai
Guangdong University of Technology
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Ruichu Cai.
international conference on management of data | 2009
Zhenjie Zhang; Yin Yang; Ruichu Cai; Dimitris Papadias; Anthony K. H. Tung
The skyline of a d-dimensional dataset consists of all points not dominated by others. The incorporation of the skyline operator into practical database systems necessitates an efficient and effective cardinality estimation module. However, existing theoretical work on this problem is limited to the case where all d dimensions are independent of each other, which rarely holds for real datasets. The state of the art Log Sampling (LS) technique simply applies theoretical results for independent dimensions to non-independent data anyway, sometimes leading to large estimation errors. To solve this problem, we propose a novel Kernel-Based (KB) approach that approximates the skyline cardinality with nonparametric methods. Extensive experiments with various real datasets demonstrate that KB achieves high accuracy, even in cases where LS fails. At the same time, despite its numerical nature, the efficiency of KB is comparable to that of LS. Furthermore, we extend both LS and KB to the k-dominant skyline, which is commonly used instead of the conventional skyline for high-dimensional data.
IEEE Transactions on Knowledge and Data Engineering | 2011
Ruichu Cai; Anthony K. H. Tung; Zhenjie Zhang; Zhifeng Hao
In previous studies, association rules have been proven to be useful in classification problems over high dimensional gene expression data. However, due to the nature of such data sets, it is often the case that millions of rules can be derived such that many of them are covered by exactly the same set of training tuples and thus have exactly the same support and confidence. Ranking and selecting useful rules from such equivalent rule groups remain an interesting and unexplored problem. In this paper, we look at two interestingness measures for ranking the interestingness of rules within equivalent rule group: Max-Subrule-Conf and Min-Subrule-Conf. Based on these interestingness measures, an incremental Apriori-like algorithm is designed to select more interesting rules from the lower bound rules of the group. Moreover, we present an improved classification model to fully exploit the potential of the selected rules. Our empirical studies on our proposed methods over five gene expression data sets show that our proposals improve both the efficiency and effectiveness of the rule extraction and classifier construction over gene expression data sets.
Pattern Recognition | 2011
Ruichu Cai; Zhenjie Zhang; Zhifeng Hao
Feature selection is an important preprocessing step for building efficient, generalizable and interpretable classifiers on high dimensional data sets. Given the assumption on the sufficient labelled samples, the Markov Blanket provides a complete and sound solution to the selection of optimal features, by exploring the conditional independence relationships among the features. In real-world applications, unfortunately, it is usually easy to get unlabelled samples, but expensive to obtain the corresponding accurate labels on the samples. This leads to the potential waste of valuable classification information buried in unlabelled samples. In this paper, we propose a new BAyesian Semi-SUpervised Method, or BASSUM in short, to exploit the values of unlabelled samples on classification feature selection problem. Generally speaking, the inclusion of unlabelled samples helps the feature selection algorithm on (1) pinpointing more specific conditional independence tests involving fewer variable features and (2) improving the robustness of individual conditional independence tests with additional statistical information. Our experimental results show that BASSUM enhances the efficiency of traditional feature selection methods and overcomes the difficulties on redundant features in existing semi-supervised solutions.
Neural Networks | 2013
Ruichu Cai; Zhenjie Zhang; Zhifeng Hao
With the advances of biomedical techniques in the last decade, the costs of human genomic sequencing and genomic activity monitoring are coming down rapidly. To support the huge genome-based business in the near future, researchers are eager to find killer applications based on human genome information. Causal gene identification is one of the most promising applications, which may help the potential patients to estimate the risk of certain genetic diseases and locate the target gene for further genetic therapy. Unfortunately, existing pattern recognition techniques, such as Bayesian networks, cannot be directly applied to find the accurate causal relationship between genes and diseases. This is mainly due to the insufficient number of samples and the extremely high dimensionality of the gene space. In this paper, we present the first practical solution to causal gene identification, utilizing a new combinatorial formulation over V-Structures commonly used in conventional Bayesian networks, by exploring the combinations of significant V-Structures. We prove the NP-hardness of the combinatorial search problem under a general settings on the significance measure on the V-Structures, and present a greedy algorithm to find sub-optimal results. Extensive experiments show that our proposal is both scalable and effective, particularly with interesting findings on the causal genes over real human genome data.
Information Sciences | 2014
Ruichu Cai; Zhenjie Zhang; Anthony K. H. Tung; Chenyun Dai; Zhifeng Hao
Hierarchical clustering problem is a traditional topic in computer science, which aims to discover a consistent hierarchy of clusters with different granularities. One of the most important open questions on hierarchical clustering is the identification of the meaningful clustering levels in the hierarchical structure. In this paper, we answer this question from algorithmic point of view. In particular, we derive a quantitative analysis on the impact of the low-level clustering costs on high level clusters, when agglomerative algorithms are run to construct the hierarchy. This analysis enables us to find meaningful clustering levels, which are independent of the clusters hierarchically beneath it. We thus propose a general agglomerative hierarchical clustering framework, which automatically constructs meaningful clustering levels. This framework is proven to be generally applicable to any k-clustering problem in any α -relaxed metric space, in which strict triangle inequality is relaxed within some constant factor α . To fully utilize the hierarchical clustering framework, we conduct some case studies on k-median and k-means clustering problems, in both of which our framework achieves better approximation factor than the state-of-the-art methods. We also extend our framework to handle the data stream clustering problem, which allows only one scan on the whole data set. By incorporating our framework into Guhas data stream clustering algorithm, the clustering quality is greatly enhanced with only small extra computation cost incurred. The extensive experiments show that our proposal is superior to the distance based agglomerative hierarchical clustering and data stream clustering algorithms on a variety of data sets.
IEEE Transactions on Neural Networks | 2017
Ruichu Cai; Zhenjie Zhang; Zhifeng Hao; Marianne Winslett
Social causality study on human action sequences is useful and important to improve our understandings to human behaviors on online social networks. The redundant indirect causalities and unobserved confounding factors, such as homophily and simultaneity phenomena, contribute to the huge challenges on accurate causal discovery on such human actions. A causal relationship exists between two persons, if the actions of one person are significantly affected by the actions of the other person, while fairly independent of her/his own prior actions. In this paper, we design a systematic approach based on conditional independence testing to detect such asymmetric relations, even when there are latent confounders underneath the observational action sequences. Technically, a group of asymmetric independence tests are conducted to infer the loose causal directions between action sequence pairs, followed by another group of tests to distinguish different types of relationships, e.g., homophily and simultaneity. Finally, a causal structure learning method is employed to output pairwise causalities with redundant indirect causalities eliminated. Empirical evaluations on simulated data verify the effectiveness and scalability of our proposals. We also present four interesting patterns of causal relations found by our algorithm, on real Sina Weibo feeds, including two new patterns never reported in previous studies.
IEEE Transactions on Neural Networks | 2018
Ruichu Cai; Zhenjie Zhang; Zhifeng Hao; Marianne Winslett
Scalable causal discovery is an essential technology to a wide spectrum of applications, including biomedical studies and social network evolution analysis. To tackle the difficulty of high dimensionality, a number of solutions are proposed in the literature, generally dividing the original variable domain into smaller subdomains by computation intensive partitioning strategies. These approaches usually suffer significant structural errors when the partitioning strategies fail to recognize true causal edges across the output subdomains. Such a structural error accumulates quickly with the growing depth of recursive partitioning, due to the lack of correction mechanism over causally connected variables when they are wrongly divided into two subdomains, finally jeopardizing the robustness of the integrated results. This paper proposes a completely different strategy to solve the problem, powered by a lightweight random partitioning scheme together with a carefully designed merging algorithm over results from the random partitions. Based on the randomness properties of the partitioning scheme, we design a suite of tricks for the merging algorithm, in order to support propagation-based significance enhancement, maximal acyclic subgraph causal ordering, and order-sensitive redundancy elimination. Theoretical studies as well as empirical evaluations verify the genericity, effectiveness, and scalability of our proposal on both simulated and real-world causal structures when the scheme is used in combination with a variety of causal solvers known effective on smaller domains.
international conference on machine learning | 2013
Ruichu Cai; Zhenjie Zhang; Zhifeng Hao
neural information processing systems | 2018
Ruichu Cai; Jie Qiao; Kun Zhang; Zhenjie Zhang; Zhifeng Hao
national conference on artificial intelligence | 2018
Ruichu Cai; Jie Qiao; Zhenjie Zhang; Zhifeng Hao