Renata Avros
ORT Braude College of Engineering
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Renata Avros.
Machine Learning | 2011
Zeev Volkovich; Zeev Barzily; Gerhard-Wilhelm Weber; Dvora Toledano-Kitai; Renata Avros
In cluster analysis, selecting the number of clusters is an “ill-posed” problem of crucial importance. In this paper we propose a re-sampling method for assessing cluster stability. Our model suggests that samples’ occurrences in clusters can be considered as realizations of the same random variable in the case of the “true” number of clusters. Thus, similarity between different cluster solutions is measured by means of compound and simple probability metrics. Compound criteria result in validation rules employing the stability content of clusters. Simple probability metrics, in particular those based on kernels, provide more flexible geometrical criteria. We analyze several applications of probability metrics combined with methods intended to simulate cluster occurrences. Numerical experiments are provided to demonstrate and compare the different metrics and simulation approaches.
Central European Journal of Operations Research | 2012
Zeev Volkovich; Zeev Barzily; Gerhard-Wilhelm Weber; Dvora Toledano-Kitai; Renata Avros
Among the areas of data and text mining which are employed today in OR, science, economy and technology, clustering theory serves as a preprocessing step in the data analyzing. An important component of clustering theory is determination of the true number of clusters. This problem has not been satisfactorily solved. In our paper, this problem is addressed by the cluster stability approach. For several possible numbers of clusters, we estimate the stability of the partitions obtained from clustering of samples. Partitions are considered consistent if their clusters are stable. Clusters validity is measured by the total number of edges, in the clusters’ minimal spanning trees, connecting points from different samples. Actually, we use the Friedman and Rafsky two sample test statistic. The homogeneity hypothesis of well mingled samples, within the clusters, leads to an asymptotic normal distribution of the considered statistic. Resting upon this fact, the standard score of the mentioned edges quantity is set, and the partition quality is represented by the worst cluster, corresponding to the minimal standard score value. It is natural to expect that the true number of clusters can be characterized by the empirical distribution having the shortest left tail. The proposed methodology sequentially creates the described distribution and estimates its left-asymmetry. Several presented numerical experiments demonstrate the ability of the approach to detect the true number of clusters.
Archive | 2012
Renata Avros; Oleg N. Granichin; Dmitry S. Shalymov; Zeev Volkovich; Gerhard-Wilhelm Weber
One of the important problems arising in cluster analysis is the estimation of the appropriate number of clusters. In the case when the expected number of clusters is sufficiently large, the majority of the existing methods involve high complexity computations. This difficulty can be avoided by using a suitable confidence interval to estimate the number of clusters. Such a method is proposed in the current chapter.
Communications in Statistics-theory and Methods | 2011
Zeev Volkovich; Zeev Barzily; Renata Avros; Dvora Toledano-Kitai
K-Nearest Neighbors is a widely used technique for classifying and clustering data. In the current article, we address the cluster stability problem based upon probabilistic characteristics of this approach. We estimate the stability of partitions obtained from clustering pairs of samples. Partitions are presumed to be consistent if their clusters are stable. Clusters validity is quantified through the amount of K-Nearest Neighbors belonging to the points sample. The null-hypothesis, of the well-mixed samples within the clusters, suggests Binomial Distribution of this quantity with K trials and the success probability 0.5. A cluster is represented by a summarizing index, of the p-values calculated over all cluster objects, under the null hypothesis for the alternative, and the partition quality is evaluated via the worst partition cluster. The true number of clusters is attained by the empirical index distribution having maximal suitable asymmetry. The proposed methodology offers to produce the index distributions sequentially and to assess their asymmetry. Numerical experiments exhibit a good capability of the methodology to expose the true number of clusters.
soft computing | 2013
Dvora Toledano-Kitai; Renata Avros; Zeev Volkovich; Gerhard-Wilhelm Weber; Orly Yahalom
Cluster validation is the task of estimating the quality of a given partition of a data set into clusters of similar objects. Normally, a clustering algorithm requires a desired number of clusters as a parameter. We consider the cluster validation problem of determining the optimal “true” number of clusters. We adopt the stability testing approach, according to which, repeated applications of a given clustering algorithm provide similar results when the specified number of clusters is correct. To implement this idea, we draw pairs of independent equal sized samples, where one sample in any pair is drawn from the data source and the other one is drawn from a noised version thereof. We then run the same clustering method on both samples in any pair and test the similarity between the obtained partitions using a general k-Nearest Neighbor Binomial model. These similarity measurements enable us to estimate the correct number of clusters. A series of numerical experiments on both synthetic and real world data demonstrates the high capability of the offered discipline compared to other methods. In particular, the use of a noised data set is shown to produce significantly better results than in the case of using two independent samples which are both drawn from the data source.
International Journal of Operational Research | 2012
Zeev Volkovich; Gerhard-Wilhelm Weber; Renata Avros; Orly Yahalom
This work addresses the cluster validation problem of determining the ‘right’ number of clusters. We consider a cluster stability property based on the k-nearest neighbour type coincidences model . Quality of a clustering is measured by the deviation from this model, where a small deviation indicates a good clustering. The true number of clusters corresponds to the empirical deviation distribution having the shortest right tail. Experiments carried out on synthetic and real data sets demonstrate the effectiveness of our method.
Automation and Remote Control | 2011
Oleg N. Granichin; Dmitry S. Shalymov; Renata Avros; Zeev Volkovich
Clustering is actively studied in such fields as statistics, pattern recognition, machine training, et al. A new randomized algorithm is suggested and established for finding the number of clusters in the set of data, the efficiency of which is demonstrated by examples of simulation modeling on synthetic data with thousands of clusters.
POWER CONTROL AND OPTIMIZATION: Proceedings of the 3rd Global Conference on Power Control and Optimization | 2010
Zeev Volkovich; Gerhard-Wilhelm Weber; Renata Avros
This work is addressed to the problem of cluster validation to determine the right number of clusters. We consider a cluster stability property based on the k nearest neighbor type coincidences model. Cluster quality is measured by the deviations from this model such that good constructed clusters are typified by small departures values. The true number of clusters corresponds to the empirical deviation distribution having shortest right tail. The experiments carried out on synthetic and real databases demonstrate the effectiveness of the approach.
machine learning and data mining in pattern recognition | 2018
Renata Avros; Zeev Volkovich
The paper presents a novel methodology intended to distinguish between real and artificially generated manuscripts. The approach employs inherent differences between the human and artificially generated wring styles. Taking into account the nature of the generation process, we suggest that the human style is essentially more “diverse” and “rich” in comparison with an artificial one. In order to assess dissimilarities between fake and real papers, a distance between writing styles is evaluated via the dynamic dissimilarity methodology. From this standpoint, the generated papers are much similar in their own style and significantly differ from the human written documents. A set of fake documents is captured as the training data so that a real document is expected to appear as an outlier in relation to this collection. Thus, we analyze the proposed task in the context of the one-class classification using a one-class SVM approach compared with a clustering base procedure. The provided numerical experiments demonstrate very high ability of the proposed methodology to recognize artificially generated papers.
Journal of Applied Bioinformatics & Computational Biology | 2018
Valery M. Kirzhner; Zeev Volkovich; Renata Avros; Elena V. Ravve
Metagenome is a mixture of different genomes and the analysis of its composition is, currently, a challenging problem of bioinformatics. In the present study, we attempt to solve this problem using DNAmarker primers-short nucleic acid fragments. Formally speaking, each primer maps a genome into a finite set of integral numbers. This set is called the genome spectrum for the given primer and is unique for each genome. The union of the genetic material of two genomes is mapped into the union of their spectra. Thus the metagenome spectrum always includes (covers) the spectra of the constituting genomes. A genome whose spectrum is not covered by the metagenome one, cannot be part of the metagenome, while the spectrum of a genome that is not included in the metagenome can accidentally be covered by the metagenome spectrum. However, if covering occurs for a few different primers, the probability of the genome inclusion in the metagenome can be estimated, the accuracy depending on the number of the primers used. In the present study, the estimations are made for the case of random primers and their effectiveness is assessed using the computer simulation of the RAPD technology.