Bokai Cao
University of Illinois at Chicago
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Bokai Cao.
international conference on data mining | 2014
Bokai Cao; Xiangnan Kong; Philip S. Yu
Link prediction has become an important and active research topic in recent years, which is prevalent in many real-world applications. Current research on link prediction focuses on predicting one single type of links, such as friendship links in social networks, or predicting multiple types of links independently. However, many real-world networks involve more than one type of links, and different types of links are not independent, but related with complex dependencies among them. In such networks, the prediction tasks for different types of links are also correlated and the links of different types should be predicted collectively. In this paper, we study the problem of collective prediction of multiple types of links in heterogeneous information networks. To address this problem, we introduce the linkage homophily principle and design a relatedness measure, called RM, between different types of objects to compute the existence probability of a link. We also extend conventional proximity measures to heterogeneous links. Furthermore, we propose an iterative framework for heterogeneous collective link prediction, called HCLP, to predict multiple types of links collectively by exploiting diverse and complex linkage information in heterogeneous information networks. Empirical studies on real-world tasks demonstrate that the proposed collective link prediction approach can effectively boost link prediction performances in heterogeneous information networks.
knowledge discovery and data mining | 2013
Xiangnan Kong; Bokai Cao; Philip S. Yu
Multi-label classification is prevalent in many real-world applications, where each example can be associated with a set of multiple labels simultaneously. The key challenge of multi-label classification comes from the large space of all possible label sets, which is exponential to the number of candidate labels. Most previous work focuses on exploiting correlations among different labels to facilitate the learning process. It is usually assumed that the label correlations are given beforehand or can be derived directly from data samples by counting their label co-occurrences. However, in many real-world multi-label classification tasks, the label correlations are not given and can be hard to learn directly from data samples within a moderate-sized training set. Heterogeneous information networks can provide abundant knowledge about relationships among different types of entities including data samples and class labels. In this paper, we propose to use heterogeneous information networks to facilitate the multi-label classification process. By mining the linkage structure of heterogeneous information networks, multiple types of relationships among different class labels and data samples can be extracted. Then we can use these relationships to effectively infer the correlations among different class labels in general, as well as the dependencies among the label sets of data examples inter-connected in the network. Empirical studies on real-world tasks demonstrate that the performance of multi-label classification can be effectively boosted using heterogeneous information net- works.
international conference on data mining | 2014
Bokai Cao; Lifang He; Xiangnan Kong; Philip S. Yu; Zhifeng Hao; Ann B. Ragin
In the era of big data, we can easily access information from multiple views which may be obtained from different sources or feature subsets. Generally, different views provide complementary information for learning tasks. Thus, multi-view learning can facilitate the learning process and is prevalent in a wide range of application domains. For example, in medical science, measurements from a series of medical examinations are documented for each subject, including clinical, imaging, immunologic, serologic and cognitive measures which are obtained from multiple sources. Specifically, for brain diagnosis, we can have different quantitative analysis which can be seen as different feature subsets of a subject. It is desirable to combine all these features in an effective way for disease diagnosis. However, some measurements from less relevant medical examinations can introduce irrelevant information which can even be exaggerated after view combinations. Feature selection should therefore be incorporated in the process of multi-view learning. In this paper, we explore tensor product to bring different views together in a joint space, and present a dual method of tensor-based multi-view feature selection DUAL-TMFS based on the idea of support vector machine recursive feature elimination. Experiments conducted on datasets derived from neurological disorder demonstrate the features selected by our proposed method yield better classification performance and are relevant to disease diagnosis.
international conference on data mining | 2015
Bokai Cao; Xiangnan Kong; Jingyuan Zhang; Philip S. Yu; Ann B. Ragin
Mining discriminative subgraph patterns from graph data has attracted great interest in recent years. It has a wide variety of applications in disease diagnosis, neuroimaging, etc. Most research on subgraph mining focuses on the graph representation alone. However, in many real-world applications, the side information is available along with the graph data. For example, for neurological disorder identification, in addition to the brain networks derived from neuroimaging data, hundreds of clinical, immunologic, serologic and cognitive measures may also be documented for each subject. These measures compose multiple side views encoding a tremendous amount of supplemental information for diagnostic purposes, yet are often ignored. In this paper, we study the problem of discriminative subgraph selection using multiple side views and propose a novel solution to find an optimal set of subgraph features for graph classification by exploring a plurality of side views. We derive a feature evaluation criterion, named gSide, to estimate the usefulness of subgraph patterns based upon side views. Then we develop a branch-and-bound algorithm, called gMSV, to efficiently search for optimal subgraph features by integrating the subgraph mining process and the procedure of discriminative feature selection. Empirical studies on graph classification tasks for neurological disorders using brain networks demonstrate that subgraph patterns selected by the multi-side-view guided subgraph selection approach can effectively boost graph classification performances and are relevant to disease diagnosis.
web search and data mining | 2016
Bokai Cao; Hucheng Zhou; Guoqiang Li; Philip S. Yu
With rapidly growing amount of data available on the web, it becomes increasingly likely to obtain data from different perspectives for multi-view learning. Some successive examples of web applications include recommendation and target advertising. Specifically, to predict whether a user will click an ad in a query context, there are available features extracted from user profile, ad information and query description, and each of them can only capture part of the task signals from a particular aspect/view. Different views provide complementary information to learn a practical model for these applications. Therefore, an effective integration of the multi-view information is critical to facilitate the learning performance. In this paper, we propose a general predictor, named multi-view machines (MVMs), that can effectively explore the full-order interactions between features from multiple views. A joint factorization is applied for the interaction parameters which makes parameter estimation more accurate under sparsity and renders the model with the capacity to avoid overfitting. Moreover, MVMs can work in conjunction with different loss functions for a variety of machine learning tasks. The advantages of MVMs are illustrated through comparison with other methods for multi-view prediction, including support vector machines (SVMs), support tensor machines (STMs) and factorization machines (FMs). A stochastic gradient descent method and a distributed implementation on Spark are presented to learn the MVM model. Through empirical studies on two real-world web application datasets, we demonstrate the effectiveness of MVMs on modeling feature interactions in multi-view data. A 3.51\% accuracy improvement is shown on MVMs over FMs for the problem of movie rating prediction, and 0.57\% for ad click prediction.
NeuroImage: Clinical | 2015
Bokai Cao; Xiangnan Kong; Casey S. Kettering; Philip S. Yu; Ann B. Ragin
To inform an understanding of brain status in HIV infection, quantitative imaging measurements were derived at structural, microstructural and macromolecular levels in three different periods of early infection and then analyzed simultaneously at each stage using data mining. Support vector machine recursive feature elimination was then used for simultaneous analysis of subject characteristics, clinical and behavioral variables, and immunologic measures in plasma and CSF to rank features associated with the most discriminating brain alterations in each period. The results indicate alterations beginning in initial infection and in all periods studied. The severity of immunosuppression in the initial virus host interaction was the most highly ranked determinant of earliest brain alterations. These results shed light on the initial brain changes induced by a neurotropic virus and their subsequent evolution. The pattern of ongoing alterations occurring during and beyond the period in which virus is suppressed in the systemic circulation supports the brain as a viral reservoir that may preclude eradication in the host. Data mining capabilities that can address high dimensionality and simultaneous analysis of disparate information sources have considerable utility for identifying mechanisms underlying onset of neurological injury and for informing new therapeutic targets.
8th International Conference on Brain Informatics and Health, BIH 2015 | 2015
Bokai Cao; Liang Zhan; Xiangnan Kong; Philip S. Yu; Nathalie Vizueta; Lori L. Altshuler; Alex D. Leow
Using sophisticated graph-theoretical analyses, modern magnetic resonance imaging techniques have allowed us to model the human brain as a brain connectivity network or a graph. In a brain network, the nodes of the network correspond to a set of brain regions and the link or edges correspond to the functional or structural connectivity between these regions. The linkage structure in brain networks can encode valuable information about the organizational properties of the human brain as a whole. However, the complexity of such linkage information raises major challenges in the era of big data in brain informatics. Conventional approaches on brain networks primarily focus on local patterns within select brain regions or pairwise connectivity between regions. By contrast, in this study, we proposed a graph mining framework based on state-of-the-art data mining techniques. Using a statistical test based on the G-test, we validated this framework in a sample of euthymic bipolar I subjects, and identified abnormal subgraph patterns in the rsfMRI networks of these subjects relative to healthy controls.
siam international conference on data mining | 2016
Jingyuan Zhang; Bokai Cao; Sihong Xie; Chun Ta Lu; Philip S. Yu; Ann B. Ragin
There is considerable interest in mining neuroimage data to discover clinically meaningful connectivity patterns to inform an understanding of neurological and neuropsychiatric disorders. Subgraph mining models have been used to discover connected subgraph patterns. However, it is difficult to capture the complicated interplay among patterns. As a result, classification performance based on these results may not be satisfactory. To address this issue, we propose to learn non-linear representations of brain connectivity patterns from deep learning architectures. This is non-trivial, due to the limited subjects and the high costs of acquiring the data. Fortunately, auxiliary information from multiple side views such as clinical, serologic, immunologic, cognitive and other diagnostic testing also characterizes the states of subjects from different perspectives. In this paper, we present a novel Multi-side-View guided AutoEncoder (MVAE) that incorporates multiple side views into the process of deep learning to tackle the bias in the construction of connectivity patterns caused by the scarce clinical data. Extensive experiments show that MVAE not only captures discriminative connectivity patterns for classification, but also discovers meaningful information for clinical interpretation.
web search and data mining | 2017
Chun Ta Lu; Lifang He; Weixiang Shao; Bokai Cao; Philip S. Yu
Many real-world problems, such as web image analysis, document categorization and product recommendation, often exhibit dual-heterogeneity: heterogeneous features obtained in multiple views, and multiple tasks might be related to each other through one or more shared views. To address these Multi-Task Multi-View (MTMV) problems, we propose a tensor-based framework for learning the predictive multilinear structure from the full-order feature interactions within the heterogeneous data. The usage of tensor structure is to strengthen and capture the complex relationships between multiple tasks with multiple views. We further develop efficient multilinear factorization machines (MFMs) that can learn the task-specific feature map and the task-view shared multilinear structures, without physically building the tensor. In the proposed method, a joint factorization is applied to the full-order interactions such that the consensus representation can be learned. In this manner, it can deal with the partially incomplete data without difficulty as the learning procedure does not simply rely on any particular view. Furthermore, the complexity of MFMs is linear in the number of parameters, which makes MFMs suitable to large-scale real-world problems. Extensive experiments on four real-world datasets demonstrate that the proposed method significantly outperforms several state-of-the-art methods in a wide variety of MTMV problems.
international world wide web conferences | 2017
Xiaokai Wei; Linchuan Xu; Bokai Cao; Philip S. Yu
Link Prediction has been an important task for social and information networks. Existing approaches usually assume the completeness of network structure. However, in many real-world networks, the links and node attributes can usually be partially observable. In this paper, we study the problem of Cross View Link Prediction (CVLP) on partially observable networks, where the focus is to recommend nodes with only links to nodes with only attributes (or vice versa). We aim to bridge the information gap by learning a robust consensus for link-based and attribute-based representations so that nodes become comparable in the latent space. Also, the link-based and attribute-based representations can lend strength to each other via this consensus learning. Moreover, attribute selection is performed jointly with the representation learning to alleviate the effect of noisy high-dimensional attributes. We present two instantiations of this framework with different loss functions and develop an alternating optimization framework to solve the problem. Experimental results on four real-world datasets show the proposed algorithm outperforms the baseline methods significantly for cross-view link prediction.