Jundong Li | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Jundong Li is active.

Explore More

Publication

Featured researches published by Jundong Li.

ACM Computing Surveys | 2017

Feature Selection: A Data Perspective

Jundong Li; Kewei Cheng; Suhang Wang; Fred Morstatter; Robert P. Trevino; Jiliang Tang; Huan Liu

Feature selection, as a data preprocessing strategy, has been proven to be effective and efficient in preparing data (especially high-dimensional data) for various data-mining and machine-learning problems. The objectives of feature selection include building simpler and more comprehensible models, improving data-mining performance, and preparing clean, understandable data. The recent proliferation of big data has presented some substantial challenges and opportunities to feature selection. In this survey, we provide a comprehensive and structured overview of recent advances in feature selection research. Motivated by current challenges and opportunities in the era of big data, we revisit feature selection research from a data perspective and review representative feature selection algorithms for conventional data, structured data, heterogeneous data and streaming data. Methodologically, to emphasize the differences and similarities of most existing feature selection algorithms for conventional data, we categorize them into four main groups: similarity-based, information-theoretical-based, sparse-learning-based, and statistical-based methods. To facilitate and promote the research in this community, we also present an open source feature selection repository that consists of most of the popular feature selection algorithms (http://featureselection.asu.edu/). Also, we use it as an example to show how to evaluate feature selection algorithms. At the end of the survey, we present a discussion about some open problems and challenges that require more attention in future research.

web search and data mining | 2017

Label Informed Attributed Network Embedding

Xiao Huang; Jundong Li; Xia Hu

Attributed network embedding aims to seek low-dimensional vector representations for nodes in a network, such that original network topological structure and node attribute proximity can be preserved in the vectors. These learned representations have been demonstrated to be helpful in many learning tasks such as network clustering and link prediction. While existing algorithms follow an unsupervised manner, nodes in many real-world attributed networks are often associated with abundant label information, which is potentially valuable in seeking more effective joint vector representations. In this paper, we investigate how labels can be modeled and incorporated to improve attributed network embedding. This is a challenging task since label information could be noisy and incomplete. In addition, labels are completely distinct with the geometrical structure and node attributes. The bewildering combination of heterogeneous information makes the joint vector representation learning more difficult. To address these issues, we propose a novel Label informed Attributed Network Embedding (LANE) framework. It can smoothly incorporate label information into the attributed network embedding while preserving their correlations. Experiments on real-world datasets demonstrate that the proposed framework achieves significantly better performance compared with the state-of-the-art embedding algorithms.

siam international conference on data mining | 2016

Robust unsupervised feature selection on networked data

Jundong Li; Xia Hu; Liang Wu; Huan Liu

Feature selection has shown its effectiveness to prepare high-dimensional data for many data mining and machine learning tasks. Traditional feature selection algorithms are mainly based on the assumption that data instances are independent and identically distributed. However, this assumption is invalid in networked data since instances are not only associated with high dimensional features but also inherently interconnected with each other. In addition, obtaining label information for networked data is time consuming and labor intensive. Without label information to direct feature selection, it is difficult to assess the feature relevance. In contrast to the scarce label information, link information in networks are abundant and could help select relevant features. However, most networked data has a lot of noisy links, resulting in the feature selection algorithms to be less effective. To address the above mentioned issues, we propose a robust unsupervised feature selection framework NetFS for networked data, which embeds the latent representation learning into feature selection. Therefore, content information is able to help mitigate the negative effects from noisy links in learning latent representations, while good latent representations in turn can contribute to extract more meaningful features. In other words, both phases could cooperate and boost each other. Experimental results on realworld datasets demonstrate the effectiveness of the proposed

conference on information and knowledge management | 2015

Unsupervised Streaming Feature Selection in Social Media

Jundong Li; Xia Hu; Jiliang Tang; Huan Liu

The explosive growth of social media sites brings about massive amounts of high-dimensional data. Feature selection is effective in preparing high-dimensional data for data analytics. The characteristics of social media present novel challenges for feature selection. First, social media data is not fully structured and its features are usually not predefined, but are generated dynamically. For example, in Twitter, slang words (features) are created everyday and quickly become popular within a short period of time. It is hard to directly apply traditional batch-mode feature selection methods to find such features. Second, given the nature of social media, label information is costly to collect. It exacerbates the problem of feature selection without knowing feature relevance. On the other hand, opportunities are also unequivocally present with additional data sources; for example, link information is ubiquitous in social media and could be helpful in selecting relevant features. In this paper, we study a novel problem to conduct unsupervised streaming feature selection for social media data. We investigate how to exploit link information in streaming feature selection, resulting in a novel unsupervised streaming feature selection framework USFS. Experimental results on two real-world social media datasets show the effectiveness and efficiency of the proposed framework comparing with the state-of-the-art unsupervised feature selection algorithms.

conference on information and knowledge management | 2017

Attributed Network Embedding for Learning in a Dynamic Environment

Jundong Li; Harsh Dani; Xia Hu; Jiliang Tang; Yi Chang; Huan Liu

Network embedding leverages the node proximity manifested to learn a low-dimensional node vector representation for each node in the network. The learned embeddings could advance various learning tasks such as node classification, network clustering, and link prediction. Most, if not all, of the existing works, are overwhelmingly performed in the context of plain and static networks. Nonetheless, in reality, network structure often evolves over time with addition/deletion of links and nodes. Also, a vast majority of real-world networks are associated with a rich set of node attributes, and their attribute values are also naturally changing, with the emerging of new content patterns and the fading of old content patterns. These changing characteristics motivate us to seek an effective embedding representation to capture network and attribute evolving patterns, which is of fundamental importance for learning in a dynamic environment. To our best knowledge, we are the first to tackle this problem with the following two challenges: (1) the inherently correlated network and node attributes could be noisy and incomplete, it necessitates a robust consensus representation to capture their individual properties and correlations; (2) the embedding learning needs to be performed in an online fashion to adapt to the changes accordingly. In this paper, we tackle this problem by proposing a novel dynamic attributed network embedding framework - DANE. In particular, DANE first provides an offline method for a consensus embedding and then leverages matrix perturbation theory to maintain the freshness of the end embedding results in an online manner. We perform extensive experiments on both synthetic and real attributed networks to corroborate the effectiveness and efficiency of the proposed framework.

international conference on data mining | 2016

Toward Time-Evolving Feature Selection on Dynamic Networks

Jundong Li; Xia Hu; Ling Jian; Huan Liu

Recent years have witnessed the prevalence of networked data in various domains. Among them, a large number of networks are not only topologically structured but also have a rich set of features on nodes. These node features are usually of high dimensionality with noisy, irrelevant and redundant information, which may impede the performance of other learning tasks. Feature selection is useful to alleviate these critical issues. Nonetheless, a vast majority of existing feature selection algorithms are predominantly designed in a static setting. In reality, real-world networks are naturally dynamic, characterized by both topology and content changes. It is desirable to capture these changes to find relevant features tightly hinged with network structure continuously, which is of fundamental importance for many applications such as disaster relief and viral marketing. In this paper, we study a novel problem of time-evolving feature selection for dynamic networks in an unsupervised scenario. Specifically, we propose a TeFS framework by leveraging the temporal evolution property of dynamic networks to update the feature selection results incrementally. Experimental results show the superiority of TeFS over the state-of-the-art batch-mode unsupervised feature selection algorithms.

international joint conference on artificial intelligence | 2017

Radar: residual analysis for anomaly detection in attributed networks

Jundong Li; Harsh Dani; Xia Hu; Huan Liu

Attributed networks are pervasive in different domains, ranging from social networks, gene regulatory networks to financial transaction networks. This kind of rich network representation presents challenges for anomaly detection due to the heterogeneity of two data representations. A vast majority of existing algorithms assume certain properties of anomalies are given a prior. Since various types of anomalies in real-world attributed networks coexist, the assumption that priori knowledge regarding anomalies is available does not hold. In this paper, we investigate the problem of anomaly detection in attributed networks generally from a residual analysis perspective, which has been shown to be effective in traditional anomaly detection problems. However, it is a non-trivial task in attributed networks as interactions among instances complicate the residual modeling process. Methodologically, we propose a learning framework to characterize the residuals of attribute information and its coherence with network information for anomaly detection. By learning and analyzing the residuals, we detect anomalies whose behaviors are singularly different from the majority. Experiments on real datasets show the effectiveness and generality of the proposed framework.

Geoinformatica | 2016

On discovering co-location patterns in datasets: a case study of pollutants and child cancers

Jundong Li; Aibek Adilmagambetov; Mohomed Shazan Mohomed Jabbar; Osmar R. Zaïane; Alvaro Osornio-Vargas; Osnat Wine

We intend to identify relationships between cancer cases and pollutant emissions by proposing a novel co-location mining algorithm. In this context, we specifically attempt to understand whether there is a relationship between the location of a child diagnosed with cancer with any chemical combinations emitted from various facilities in that particular location. Co-location pattern mining intends to detect sets of spatial features frequently located in close proximity to each other. Most of the previous works in this domain are based on transaction-free apriori-like algorithms which are dependent on user-defined thresholds, and are designed for boolean data points. Due to the absence of a clear notion of transactions, it is nontrivial to use association rule mining techniques to tackle the co-location mining problem. Our proposed approach is focused on a grid based transactionization? of the geographic space, and is designed to mine datasets with extended spatial objects. It is also capable of incorporating uncertainty of the existence of features to model real world scenarios more accurately. We eliminate the necessity of using a global threshold by introducing a statistical test to validate the significance of candidate co-location patterns and rules. Experiments on both synthetic and real datasets reveal that our algorithm can detect a considerable amount of statistically significant co-location patterns. In addition, we explain the data modelling framework which is used on real datasets of pollutants (PRTR/NPRI) and childhood cancer cases.

international joint conference on artificial intelligence | 2017

Reconstruction-based Unsupervised Feature Selection: An Embedded Approach

Jundong Li; Jiliang Tang; Huan Liu

Feature selection has been proven to be effective and efficient in preparing high-dimensional data for data mining and machine learning problems. Since real-world data is usually unlabeled, unsupervised feature selection has received increasing attention in recent years. Without label information, unsupervised feature selection needs alternative criteria to define feature relevance. Recently, data reconstruction error emerged as a new criterion for unsupervised feature selection, which defines feature relevance as the capability of features to approximate original data via a reconstruction function. Most existing algorithms in this family assume predefined, linear reconstruction functions. However, the reconstruction function should be data dependent and may not always be linear especially when the original data is high-dimensional. In this paper, we investigate how to learn the reconstruction function from the data automatically for unsupervised feature selection, and propose a novel reconstruction-based unsupervised feature selection framework REFS, which embeds the reconstruction function learning process into feature selection. Experiments on various types of real-world datasets demonstrate the effectiveness of the proposed framework REFS.

data warehousing and knowledge discovery | 2014

Discovering Statistically Significant Co-location Rules in Datasets with Extended Spatial Objects

Jundong Li; Osmar R. Zaïane; Alvaro Osornio-Vargas

Co-location rule mining is one of the tasks of spatial data mining, which focuses on the detection of sets of spatial features that show spatial associations. Most previous methods are generally based on transaction-free apriori-like algorithms which are dependent on user-defined thresholds and are designed for boolean data points. Due to the absence of a clear notion of transactions, it is nontrivial to use association rule mining techniques to tackle the co-location rule mining problem. To solve these difficulties, a transactionization approach was recently proposed; designed to mine datasets with extended spatial objects. A statistical test is used instead of global thresholds to detect significant co-location rules. One major shortcoming of this work is that it limits the size of antecedent of co-location rules up to three features, therefore, the algorithm is difficult to scale up. In this paper we introduce a new algorithm that fully exploits the property of statistical significance to detect more general co-location rules. We use our algorithm on real datasets with the National Pollutant Release Inventory (NPRI). A classifier is also proposed to help evaluate the discovered co-location rules.

Explore More