Vladimir Nikulin
University of Queensland
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Vladimir Nikulin.
International Journal of Data Warehousing and Mining | 2008
Vladimir Nikulin
Imbalanced data represent a significant problem because the corresponding classifier has a tendency to ignore patterns which have smaller representation in the training set. We propose to consider a large number of balanced training subsets where representatives from the larger pattern are selected randomly. As an outcome, the system will produce a matrix of linear regression coefficients where rows represent random subsets and columns represent features. Based on the above matrix we make an assessment of the stability of the influence of the particular features. It is proposed to keep in the model only features with stable influence. The final model represents an average of the single models, which are not necessarily a linear regression. The above model had proven to be efficient and competitive during the PAKDD-2007 Data Mining Competition.
australasian joint conference on artificial intelligence | 2009
Vladimir Nikulin; Geoffrey J. McLachlan; Shu Kay Angus Ng
Ensembles are often capable of greater prediction accuracy than any of their individual members. As a consequence of the diversity between individual base-learners, an ensemble will not suffer from overfitting. On the other hand, in many cases we are dealing with imbalanced data and a classifier which was built using all data has tendency to ignore minority class. As a solution to the problem, we propose to consider a large number of relatively small and balanced subsets where representatives from the larger pattern are to be selected randomly. As an outcome, the system produces the matrix of linear regression coefficients whose rows represent random subsets and columns represent features. Based on the above matrix we make an assessment of how stable the influence of the particular features is. It is proposed to keep in the model only features with stable influence. The final model represents an average of the base-learners, which are not necessarily a linear regression. Test results against datasets of the PAKDD-2007 data-mining competition are presented.
computational intelligence methods for bioinformatics and biostatistics | 2009
Vladimir Nikulin; Geoffrey J. McLachlan
The high dimensionality of microarray data, the expressions of thousands of genes in a much smaller number of samples, presents challenges that affect the validity of the analytical results. Hence attention has to be given to some form of dimension reduction to represent the data in terms of a smaller number of variables. The latter are often chosen to be a linear combinations of the original variables (genes) called metagenes. One commonly used approach is principal component analysis (PCA), which can be implemented via a singular value decomposition (SVD). However, in the case of a high-dimensional matrix, SVD may be very expensive in terms of computational time. We propose to reduce the SVD task to the ordinary maximisation problem with an Euclidean norm which may be solved easily using gradient-based optimisation. We demonstrate the effectiveness of this approach to the supervised classification of gene expression data.
International Journal of Computational Intelligence and Applications | 2006
Vladimir Nikulin
Signature-based intrusion detection systems look for known, suspicious patterns in the input data. In this paper we explore compression of labeled empirical data using threshold-based clustering with regularization. The main target of clustering is to compress training dataset to the limited number of signatures, and to minimize the number of comparisons that are necessary to determine the status of the input event as a result. Essentially, the process of clustering includes merging of the clusters which are close enough. As a consequence, we will reduce original dataset to the limited number of labeled centroids. In a complex with k-nearest-neighbor (kNN) method, this set of centroids may be used as a multi-class classifier. Clearly, different attributes have different importance depending on the particular training database and given cost matrix. This importance may be regulated in the definition of the distance using linear weight coefficients. The paper introduces special procedure to estimate above weight coefficients. The experiments on the KDD-99 intrusion detection dataset have confirmed the effectiveness of the proposed methods.
bioinformatics and biomedicine | 2009
Vladimir Nikulin; Geoffrey J. McLachlan
We propose a general method for matrix factorization based on decomposition by parts. It can reduce the dimension of expression data from thousands of genes to several factors. Unlike classification and regression, matrix decomposition requires no response variable and thus falls into category of unsupervised learning methods. We demonstrate the effectiveness of this approach to the supervised classification of gene expression data.
Data mining, intrusion detection, information assurance, and data networks security. Conference | 2005
Vladimir Nikulin; Alexander J. Smola
Parametric, model-based algorithms learn generative models from the data, with each model corresponding to one particular cluster. Accordingly, the model-based partitional algorithm will select the most suitable model for any data object (Clustering step}, and will recompute parametric models using data specifically from the corresponding clusters {Maximization step). This Clustering-Maximization framework have been widely used and have shown promising results in many applications including complex variable-length data. The paper proposes (Experience-Innovation} (EI) method as a natural extension of the (Clustering-Maximization} framework. This method includes 3 components: (1) keep the best past experience and make empirical likelihood trajectory monotonical as a result; (2) find a new model as a function of existing models so that the corresponding cluster will split existing clusters with bigger number of elements and smaller uniformity; (3) heuristical innovations, for example, several trials with random initial settings. Also, we introduce clustering regularisation based on the balanced complex of two conditions: (1) significance of any particular cluster; (2) difference between any 2 clusters. We illustrate effectiveness of the proposed methods using first-order Markov model in application to the large web-traffic dataset. The aim of the experiment is to explain and understand the way people interact with web sites.
international joint conference on neural network | 2006
Vladimir Nikulin
We consider several models, which employ gradient-based method as a core optimization tool. Experimental results were obtained in a real time environment during WCCI-2006 Performance Prediction Challenge. None of the models were proved to be absolutely best against all five datasets. Furthermore, we can exploit the actual difference between different models and create an ensemble system as a complex of the base models where the balances may be regulated using special parameters or confidence levels. Overfitting is a usual problem in the situation when dimension is comparable with the sample size or even higher. Using mean-variance filtering we can reduce the difference between training and test results significantly considering some features as a noise.
intelligent data engineering and automated learning | 2005
Vladimir Nikulin
We propose universal clustering in line with the concepts of universal estimation. In order to illustrate the model of universal clustering we consider family of power loss functions in probabilistic space which is marginally linked to the Kullback-Leibler divergence. The model proved to be effective in application to the synthetic data. Also, we consider large web-traffic dataset. The aim of the experiment is to explain and understand the way people interact with web sites.
international symposium on neural networks | 2010
Vladimir Nikulin; Geoffrey J. McLachlan
Brain segmentation represents a very complex and challenging problem. Fiber pathways connecting the same functional regions of the brain form a natural anatomical group (bundle). Fiber bundling is a typical clustering problem. Note that the fiber bundles in the human brain take various sizes and shapes. The measure used to define the spatial proximity between curves is of fundamental importance for clustering. It is not easy (first of all in terms of the computational time) to compare different fibers directly taking into account that they have different lengths and structures. As a solution to this problem, we propose to consider intermediate key-sets with several very important 3D-points. Depending on the proximity to one particular set we can make a conclusion whether or not two different curves are similar. Our method was tested successfully during the International 2009 Pittsburgh Brain Connectivity IEEE ICDM Competition, where we achieved the top score in Challenge 1 (our score was 50.49% higher compared to the second highest score). Also, we were placed second in Challenge 2.
machine learning and data mining in pattern recognition | 2016
Vladimir Nikulin
One way to optimise insurance prices and policies is to collect and to analyse driving trajectories: sequences of 2D-points, where time distance between any two consequitive points is a constant. Suppose that most of the drivers have safe driving style with similar statistical characteristics. Using above assumption as a main ground, we shall go through the list of all drivers (available in the database) assuming that the current driver is “bad”. We shall add to the training database several randomly selected drivers assuming that they are “good”. By comparing the current driver with a few randomly selected “good” drivers, we estimate the probability that the current driver is bad (or has significant deviations from usual statistical characteristics). Note as a distinguished particular feature of the presented method: it does not require availability of the training labels. The database includes 2736 drivers with 200 variable length driving trajectories each. We tested our model (with competitive results) online during Kaggle-based AXA Drivers Telematics Challenge in 2015.