Is this you? Create Your Porfile

Yunming Ye

Harbin Institute of Technology

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Yunming Ye is active.

Explore More

Publication

Featured researches published by Yunming Ye.

green computing and communications | 2010

Security Issues and Challenges for Cyber Physical System

Eric Ke Wang; Yunming Ye; Xiaofei Xu; Siu-Ming Yiu; Lucas Chi Kwong Hui; K. P. Chow

In this paper, we investigate the security challenges and issues of cyber-physical systems. (1)We abstract the general workflow of cyber physical systems, (2)identify the possible vulnerabilities, attack issues, adversaries characteristics and a set of challenges that need to be addressed, (3)then we also propose a context-aware security framework for general cyber-physical systems and suggest some potential research areas and problems.

IEEE Transactions on Knowledge and Data Engineering | 2013

TW-k-means: Automated two-level variable weighting clustering algorithm for multiview data

Xiaojun Chen; Xiaofei Xu; Joshua Zhexue Huang; Yunming Ye

This paper proposes TW-k-means, an automated two-level variable weighting clustering algorithm for multiview data, which can simultaneously compute weights for views and individual variables. In this algorithm, a view weight is assigned to each view to identify the compactness of the view and a variable weight is also assigned to each variable in the view to identify the importance of the variable. Both view weights and variable weights are used in the distance function to determine the clusters of objects. In the new algorithm, two additional steps are added to the iterative k-means clustering process to automatically compute the view weights and the variable weights. We used two real-life data sets to investigate the properties of two types of weights in TW-k-means and investigated the difference between the weights of TW-k-means and the weights of the individual variable weighting method. The experiments have revealed the convergence property of the view weights in TW-k-means. We compared TW-k-means with five clustering algorithms on three real-life data sets and the results have shown that the TW-k-means algorithm significantly outperformed the other five clustering algorithms in four evaluation indices.

Pattern Recognition | 2012

A feature group weighting method for subspace clustering of high-dimensional data

Xiaojun Chen; Yunming Ye; Xiaofei Xu; Joshua Zhexue Huang

This paper proposes a new method to weight subspaces in feature groups and individual features for clustering high-dimensional data. In this method, the features of high-dimensional data are divided into feature groups, based on their natural characteristics. Two types of weights are introduced to the clustering process to simultaneously identify the importance of feature groups and individual features in each cluster. A new optimization model is given to define the optimization process and a new clustering algorithm FG-k-means is proposed to optimize the optimization model. The new algorithm is an extension to k-means by adding two additional steps to automatically calculate the two types of subspace weights. A new data generation method is presented to generate high-dimensional data with clusters in subspaces of both feature groups and individual features. Experimental results on synthetic and real-life data have shown that the FG-k-means algorithm significantly outperformed four k-means type algorithms, i.e., k-means, W-k-means, LAC and EWKM in almost all experiments. The new algorithm is robust to noise and missing values which commonly exist in high-dimensional data.

knowledge discovery and data mining | 2011

MultiRank: co-ranking for objects and relations in multi-relational data

Michaek Kwok-Po Ng; Xutao Li; Yunming Ye

The main aim of this paper is to design a co-ranking scheme for objects and relations in multi-relational data. It has many important applications in data mining and information retrieval. However, in the literature, there is a lack of a general framework to deal with multi-relational data for co-ranking. The main contribution of this paper is to (i) propose a framework (MultiRank) to determine the importance of both objects and relations simultaneously based on a probability distribution computed from multi-relational data; (ii) show the existence and uniqueness of such probability distribution so that it can be used for co-ranking for objects and relations very effectively; and (iii) develop an efficient iterative algorithm to solve a set of tensor (multivariate polynomial) equations to obtain such probability distribution. Extensive experiments on real-world data suggest that the proposed framework is able to provide a co-ranking scheme for objects and relations successfully. Experimental results have also shown that our algorithm is computationally efficient, and effective for identification of interesting and explainable co-ranking results.

Pattern Recognition | 2013

Stratified sampling for feature subspace selection in random forests for high dimensional data

Yunming Ye; Qingyao Wu; Joshua Zhexue Huang; Michael K. Ng; Xutao Li

For high dimensional data a large portion of features are often not informative of the class of the objects. Random forest algorithms tend to use a simple random sampling of features in building their decision trees and consequently select many subspaces that contain few, if any, informative features. In this paper we propose a stratified sampling method to select the feature subspaces for random forests with high dimensional data. The key idea is to stratify features into two groups. One group will contain strong informative features and the other weak informative features. Then, for feature subspace selection, we randomly select features from each group proportionally. The advantage of stratified sampling is that we can ensure that each subspace contains enough informative features for classification in high dimensional data. Testing on both synthetic data and various real data sets in gene classification, image categorization and face recognition data sets consistently demonstrates the effectiveness of this new method. The performance is shown to better that of state-of-the-art algorithms including SVM, the four variants of random forests (RF, ERT, enrich-RF, and oblique-RF), and nearest neighbor (NN) algorithms.

Journal of Multimedia | 2014

Short Text Classification: A Survey

Ge Song; Yunming Ye; Xiaolin Du; Xiaohui Huang; Shifu Bie

With the recent explosive growth of e-commerce and online communication, a new genre of text, short text, has been extensively applied in many areas. So many researches focus on short text mining. It is a challenge to classify the short text owing to its natural characters, such as sparseness, large-scale, immediacy, non-standardization. It is difficult for traditional methods to deal with short text classification mainly because too limited words in short text cannot represent the feature space and the relationship between words and documents. Several researches and reviews on text classification are shown in recent times. However, only a few of researches focus on short text classification. This paper discusses the characters of short text and the difficulty of short text classification. Then we introduce the existing popular works on short text classifiers and models, including short text classification using sematic analysis, semi-supervised short text classification, ensemble short text classification, and real-time classification. The evaluations of short text classification are analyzed in our paper. Finally we summarize the existing classification technology and prospect for development trend of short text classification.

IEEE Transactions on Knowledge and Data Engineering | 2014

MultiComm: Finding Community Structurein Multi-Dimensional Networks

Xutao Li; Michael K. Ng; Yunming Ye

The main aim of this paper is to develop a community discovery scheme in a multi-dimensional network for data mining applications. In online social media, networked data consists of multiple dimensions/entities such as users, tags, photos, comments, and stories. We are interested in finding a group of users who interact significantly on these media entities. In a co-citation network, we are interested in finding a group of authors who relate to other authors significantly on publication information in titles, abstracts, and keywords as multiple dimensions/entities in the network. The main contribution of this paper is to propose a framework (MultiComm)to identify a seed-based community in a multi-dimensional network by evaluating the affinity between two items in the same type of entity (same dimension)or different types of entities (different dimensions)from the network. Our idea is to calculate the probabilities of visiting each item in each dimension, and compare their values to generate communities from a set of seed items. In order to evaluate a high quality of generated communities by the proposed algorithm, we develop and study a local modularity measure of a community in a multi-dimensional network. Experiments based on synthetic and real-world data sets suggest that the proposed framework is able to find a community effectively. Experimental results have also shown that the performance of the proposed algorithm is better in accuracy than the other testing algorithms in finding communities in multi-dimensional networks.

BMC Systems Biology | 2015

Protein functional properties prediction in sparsely-label PPI networks through regularized non-negative matrix factorization

Qingyao Wu; Zhenyu Wang; Chunshan Li; Yunming Ye; Yueping Li; Ning Sun

BackgroundPredicting functional properties of proteins in protein-protein interaction (PPI) networks presents a challenging problem and has important implication in computational biology. Collective classification (CC) that utilizes both attribute features and relational information to jointly classify related proteins in PPI networks has been shown to be a powerful computational method for this problem setting. Enabling CC usually increases accuracy when given a fully-labeled PPI network with a large amount of labeled data. However, such labels can be difficult to obtain in many real-world PPI networks in which there are usually only a limited number of labeled proteins and there are a large amount of unlabeled proteins. In this case, most of the unlabeled proteins may not connected to the labeled ones, the supervision knowledge cannot be obtained effectively from local network connections. As a consequence, learning a CC model in sparsely-labeled PPI networks can lead to poor performance.ResultsWe investigate a latent graph approach for finding an integration latent graph by exploiting various latent linkages and judiciously integrate the investigated linkages to link (separate) the proteins with similar (different) functions. We develop a regularized non-negative matrix factorization (RNMF) algorithm for CC to make protein functional properties prediction by utilizing various data sources that are available in this problem setting, including attribute features, latent graph, and unlabeled data information. In RNMF, a label matrix factorization term and a network regularization term are incorporated into the non-negative matrix factorization (NMF) objective function to seek a matrix factorization that respects the network structure and label information for classification prediction.ConclusionExperimental results on KDD Cup tasks predicting the localization and functions of proteins to yeast genes demonstrate the effectiveness of the proposed RNMF method for predicting the protein properties. In the comparison, we find that the performance of the new method is better than those of the other compared CC algorithms especially in paucity of labeled proteins.

IEEE Transactions on Nanobioscience | 2012

SNP Selection and Classification of Genome-Wide SNP Data Using Stratified Sampling Random Forests

Qingyao Wu; Yunming Ye; Yang Liu; Michael K. Ng

For high dimensional genome-wide association (GWA) case-control data of complex disease, there are usually a large portion of single-nucleotide polymorphisms (SNPs) that are irrelevant with the disease. A simple random sampling method in random forest using default mtry parameter to choose feature subspace, will select too many subspaces without informative SNPs. Exhaustive searching an optimal mtry is often required in order to include useful and relevant SNPs and get rid of vast of non-informative SNPs. However, it is too time-consuming and not favorable in GWA for high-dimensional data. The main aim of this paper is to propose a stratified sampling method for feature subspace selection to generate decision trees in a random forest for GWA high-dimensional data. Our idea is to design an equal-width discretization scheme for informativeness to divide SNPs into multiple groups. In feature subspace selection, we randomly select the same number of SNPs from each group and combine them to form a subspace to generate a decision tree. The advantage of this stratified sampling procedure can make sure each subspace contains enough useful SNPs, but can avoid a very high computational cost of exhaustive search of an optimal mtry, and maintain the randomness of a random forest. We employ two genome-wide SNP data sets (Parkinson case-control data comprised of 408 803 SNPs and Alzheimer case-control data comprised of 380 157 SNPs) to demonstrate that the proposed stratified sampling method is effective, and it can generate better random forest with higher accuracy and lower error bound than those by Breimans random forest generation method. For Parkinson data, we also show some interesting genes identified by the method, which may be associated with neurological disorders for further biological investigations.

Journal of Information Science | 2014

Detecting hot topics from Twitter: A multiview approach

Yixiang Fang; Haijun Zhang; Yunming Ye; Xutao Li

Twitter is widely used all over the world, and a huge number of hot topics are generated by Twitter users in real time. These topics are able to reflect almost every aspect of people’s daily lives. Therefore, the detection of topics in Twitter can be used in many real applications, such as monitoring public opinion, hot product recommendation and incidence detection. However, the performance of traditional topic detection methods is still far from perfect largely owing to the tweets’ features, such as their limited length and arbitrary abbreviations. To address these problems, we propose a novel framework (MVTD) for Twitter topic detection using multiview clustering, which can integrate multirelations among tweets, such as semantic relations, social tag relations and temporal relations. We also propose some methods for measuring relations among tweets. In particular, to better measure the semantic similarity of tweets, we propose a new document similarity measure based on a suffix tree (STVSM). In addition, a new keyword extraction method based on a suffix tree is proposed. Experiments on real datasets show that the performance of MVTD is much better than that of a single view, and it is useful for detecting topics from Twitter.

Explore More