Network


Latest external collaboration on country level. Dive into details by clicking on the dots.

Hotspot


Dive into the research topics where Xin Dang is active.

Publication


Featured researches published by Xin Dang.


IEEE Transactions on Pattern Analysis and Machine Intelligence | 2009

Outlier Detection with the Kernelized Spatial Depth Function

Yixin Chen; Xin Dang; Hanxiang Peng; Henry L. Bart

Statistical depth functions provide from the deepest point a center-outward ordering of multidimensional data. In this sense, depth functions can measure the extremeness or outlyingness of a data point with respect to a given data set. Hence, they can detect outliers observations that appear extreme relative to the rest of the observations. Of the various statistical depths, the spatial depth is especially appealing because of its computational efficiency and mathematical tractability. In this article, we propose a novel statistical depth, the kernelized spatial depth (KSD), which generalizes the spatial depth via positive definite kernels. By choosing a proper kernel, the KSD can capture the local structure of a data set while the spatial depth fails. We demonstrate this by the half-moon data and the ring-shaped data. Based on the KSD, we propose a novel outlier detection algorithm, by which an observation with a depth value less than a threshold is declared as an outlier. The proposed algorithm is simple in structure: the threshold is the only one parameter for a given kernel. It applies to a one-class learning setting, in which normal observations are given as the training data, as well as to a missing label scenario, where the training set consists of a mixture of normal observations and outliers with unknown labels. We give upper bounds on the false alarm probability of a depth-based detector. These upper bounds can be used to determine the threshold. We perform extensive experiments on synthetic data and data sets from real applications. The proposed outlier detector is compared with existing methods. The KSD outlier detector demonstrates a competitive performance.


Knowledge Based Systems | 2014

Financial ratio selection for business failure prediction using soft set theory

Wei Xu; Zhi Xiao; Xin Dang; Daoli Yang; Xianglei Yang

This paper presents a novel parameter reduction method guided by soft set theory (NSS) to select financial ratios for business failure prediction (BFP). The proposed method integrates statistical logistic regression into soft set decision theory, hence takes advantages of two approaches. The procedure is applied to real data sets from Chinese listed firms. From the financial analysis statement category set and the financial ratio set considered by the previous literatures, our proposed method selects nine significant financial ratios. Among them, four ratios are newly recognized as important variables for BFP. For comparison, principal component analysis, traditional soft set theory, and rough set theory are reduction methods included in the study. The predictive ability of the selected ratios by each reduction method along with the ratios commonly used in the prior literature is evaluated by three forecasting tools support vector machine, neural network, and logistic regression. The results demonstrate superior forecasting performance of the proposed method in terms of accuracy and stability.


BMC Bioinformatics | 2009

Graph ranking for exploratory gene data analysis

Cuilan Gao; Xin Dang; Yixin Chen; Dawn Wilkins

BackgroundMicroarray technology has made it possible to simultaneously monitor the expression levels of thousands of genes in a single experiment. However, the large number of genes greatly increases the challenges of analyzing, comprehending and interpreting the resulting mass of data. Selecting a subset of important genes is inevitable to address the challenge. Gene selection has been investigated extensively over the last decade. Most selection procedures, however, are not sufficient for accurate inference of underlying biology, because biological significance does not necessarily have to be statistically significant. Additional biological knowledge needs to be integrated into the gene selection procedure.ResultsWe propose a general framework for gene ranking. We construct a bipartite graph from the Gene Ontology (GO) and gene expression data. The graph describes the relationship between genes and their associated molecular functions. Under a species condition, edge weights of the graph are assigned to be gene expression level. Such a graph provides a mathematical means to represent both species-independent and species-dependent biological information. We also develop a new ranking algorithm to analyze the weighted graph via a kernelized spatial depth (KSD) approach. Consequently, the importance of gene and molecular function can be simultaneously ranked by a real-valued measure, KSD, which incorporates the global and local structure of the graph. Over-expressed and under-regulated genes also can be separately ranked.ConclusionThe gene-function bigraph integrates molecular function annotations into gene expression data. The relevance of genes is described in the graph (through a common function). The proposed method provides an exploratory framework for gene data analysis.


BMC Systems Biology | 2014

Learning accurate and interpretable models based on regularized random forests regression

Sheng Liu; Shamitha Dissanayake; Sanjay R. Patel; Xin Dang; Todd E. Mlsna; Yixin Chen; Dawn Wilkins

BackgroundMany biology related research works combine data from multiple sources in an effort to understand the underlying problems. It is important to find and interpret the most important information from these sources. Thus it will be beneficial to have an effective algorithm that can simultaneously extract decision rules and select critical features for good interpretation while preserving the prediction performance.MethodsIn this study, we focus on regression problems for biological data where target outcomes are continuous. In general, models constructed from linear regression approaches are relatively easy to interpret. However, many practical biological applications are nonlinear in essence where we can hardly find a direct linear relationship between input and output. Nonlinear regression techniques can reveal nonlinear relationship of data, but are generally hard for human to interpret. We propose a rule based regression algorithm that uses 1-norm regularized random forests. The proposed approach simultaneously extracts a small number of rules from generated random forests and eliminates unimportant features.ResultsWe tested the approach on some biological data sets. The proposed approach is able to construct a significantly smaller set of regression rules using a subset of attributes while achieving prediction performance comparable to that of random forests regression.ConclusionIt demonstrates high potential in aiding prediction and interpretation of nonlinear relationships of the subject being studied.


Journal of Nonparametric Statistics | 2009

Influence functions of some depth functions, and application to depth-weighted L-statistics

Xin Dang; Robert Serfling; Weihua Zhou

Depth functions are increasingly being used in building nonparametric outlier detectors and in constructing useful nonparametric statistics such as depth-weighted L-statistics (DL-statistics). Robustness of a depth function is an essential property for such applications. Here, robustness of three key depth functions, spatial, simplicial, and generalised Tukey, is explored via the influence function (IF) approach. For all three depths, the IFs are derived and found to be bounded, an important robustness property, and are applied to evaluate two other robustness features, gross error sensitivity and local shift sensitivity. These IFs are also used as components of the IFs of associated DL-statistics, for which through a standard approach consistency and asymptotic normality are then derived. In turn, the asymptotic normality is applied to obtain asymptotic relative efficiencies (ARE). For spatial depth, two forms of weight function suggested in the recent literature are considered and AREs in comparison with the mean are obtained. For all three depths and one of these weight functions, finite sample REs are obtained by simulation under normal, contaminated normal, and heavy-tailed t distributions. As a technical tool of general interest, needed here with the simplicial depth, the IF of a general U-statistic is derived.


Statistics in Medicine | 2009

A unified approach for analyzing exchangeable binary data with applications to developmental toxicity studies

Xin Dang; Stephine Lena Keeton; Hanxiang Peng

In this article, we present a general procedure to analyze exchangeable binary data that may also be viewed as realizations of binomial mixtures. Our approach unifies existing models and is practical and computationally easy. Resulting from completely monotonic functions, we introduce a rich family of parametric parsimonious binomial mixtures, including the incomplete Beta-, Gamma-, Normal-, and Poisson-binomial, generalizing the Beta-binomial. We show that the family is closed under convex linear combinations, products, and composites. We also give the moments and the Markov property of the family. With such distributions, we can perform statistical inference on correlated binary data and, in particular, overdispersed data. We propose a regression procedure that generalizes logistic regression. We provide a forward model selection procedure. We run a small simulation to validate the inclusion of the binomial distribution. Finally, we apply the proposed procedure to analyze the 2, 4, 5-Trichlorophenoxyacetic acid and E2 data and compare the results with existing procedures.


IEEE Transactions on Knowledge and Data Engineering | 2015

Robust Model-Based Learning via Spatial-EM Algorithm

Kai Yu; Xin Dang; Henry L. Bart; Yixin Chen

This paper presents a new robust EM algorithm for the finite mixture learning procedures. The proposed Spatial-EM algorithm utilizes median-based location and rank-based scatter estimators to replace sample mean and sample covariance matrix in each M step, hence enhancing stability and robustness of the algorithm. It is robust to outliers and initial values. Compared with many robust mixture learning methods, the Spatial-EM has the advantages of simplicity in implementation and statistical efficiency. We apply Spatial-EM to supervised and unsupervised learning scenarios. More specifically, robust clustering and outlier detection methods based on Spatial-EM have been proposed. We apply the outlier detection to taxonomic research on fish species novelty discovery. Two real datasets are used for clustering analysis. Compared with the regular EM and many other existing methods such as K-median, X-EM and SVM, our method demonstrates superior performance and high robustness.


Communications in Statistics-theory and Methods | 2015

Robustness of the Affine Equivariant Scatter Estimator Based on the Spatial Rank Covariance Matrix

Kai Yu; Xin Dang; Yixin Chen

Visuri et al. (2000) proposed a technique for robust covariance matrix estimation based on different notions of multivariate sign and rank. Among them, the spatial rank based covariance matrix estimator that utilizes a robust scale estimator is especially appealing due to its high robustness, computational ease, and good efficiency. Also, it is orthogonally equivariant under any distribution and affinely equivariant under elliptically symmetric distributions. In this paper, we study robustness properties of the estimator with respective to two measures: breakdown point and influence function. More specifically, the upper bound of the finite sample breakdown point can be achieved by a proper choice of univariate robust scale estimator. The influence functions for eigenvalues and eigenvectors of the estimator are derived. They are found to be bounded under some assumptions. Moreover, finite sample efficiency comparisons to popular robust MCD, M, and S estimators are reported.


bioinformatics and biomedicine | 2013

Rule based regression and feature selection for biological data

Sheng Liu; Shamitha Dissanayake; Sanjay R. Patel; Xin Dang; Todd E. Mlsna; Yixen Chen; Dawn Wilkins

Regression is widely utilized in a variety of biological problems involving continuous outcomes. There are a number of methods for building regression models ranging from linear models to more complex nonlinear ones. While linear regression techniques can identify linear correlations between input and output, in many practical applications, the relations are nonlinear. These relations can be modeled by nonlinear regression techniques effectively. However, many models built with nonlinear techniques have limited interpretation, which is crucial in many biological problems. We propose a rule based regression algorithm that uses 1-norm regularized random forests. The proposed approach simultaneously extracts a small number of rules from generated random forests and eliminates unimportant features, and hence is able to provide a simple interpretation. We tested the approach on a seacoast chemical sensors dataset, a Stockori flowering time dataset, and three datasets from the UCI repository. The proposed approach is able to construct a significantly smaller set of regression rules using a subset of attributes while achieving prediction performance comparable to that of conventional random forests regression. It demonstrates high potential in terms of prediction performance and interpretation ease on studying nonlinear relationships of the subjects.


Journal of Statistical Computation and Simulation | 2011

A numerical study of multiple imputation methods using nonparametric multivariate outlier identifiers and depth-based performance criteria with clinical laboratory data

Xin Dang; Robert Serfling

It is well known that if a multivariate outlier has one or more missing component values, then multiple imputation (MI) methods tend to impute nonextreme values and make the outlier become less extreme and less likely to be detected. In this paper, nonparametric depth-based multivariate outlier identifiers are used as criteria in a numerical study comparing several established methods of MI as well as a new proposed one, nine in all, in a setting of several actual clinical laboratory data sets of different dimensions. Two criteria, an ‘outlier recovery probability’ and a ‘relative accuracy measure’, are developed, based on depth functions. Three outlier identifiers, based on Mahalanobis distance, robust Mahalanobis distance, and generalized principle component analysis are also included in the study. Consequently, not only the comparison of imputation methods but also the comparison of outlier detection methods is accomplished in this study. Our findings show that the performance of an MI method depends on the choice of depth-based outlier detection criterion, as well as the size and dimension of the data and the fraction of missing components. By taking these features into account, an MI method for a given data set can be selected more optimally.

Collaboration


Dive into the Xin Dang's collaboration.

Top Co-Authors

Avatar

Yixin Chen

University of Mississippi

View shared research outputs
Top Co-Authors

Avatar

Dawn Wilkins

University of Mississippi

View shared research outputs
Top Co-Authors

Avatar

Hanxiang Peng

University of Mississippi

View shared research outputs
Top Co-Authors

Avatar
Top Co-Authors

Avatar

Zhi Xiao

Chongqing University

View shared research outputs
Top Co-Authors

Avatar

Christopher Ma

University of Mississippi

View shared research outputs
Top Co-Authors

Avatar

Robert Serfling

University of Texas at Dallas

View shared research outputs
Top Co-Authors

Avatar

Sheng Liu

University of Mississippi

View shared research outputs
Top Co-Authors

Avatar
Top Co-Authors

Avatar

Hailin Sang

University of Mississippi

View shared research outputs
Researchain Logo
Decentralizing Knowledge