Veronika Cheplygina
Erasmus University Rotterdam
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Veronika Cheplygina.
Pattern Recognition | 2015
Veronika Cheplygina; David M. J. Tax; Marco Loog
Multiple instance learning (MIL) is concerned with learning from sets (bags) of objects (instances), where the individual instance labels are ambiguous. In this setting, supervised learning cannot be applied directly. Often, specialized MIL methods learn by making additional assumptions about the relationship of the bag labels and instance labels. Such assumptions may fit a particular dataset, but do not generalize to the whole range of MIL problems. Other MIL methods shift the focus of assumptions from the labels to the overall (dis)similarity of bags, and therefore learn from bags directly. We propose to represent each bag by a vector of its dissimilarities to other bags in the training set, and treat these dissimilarities as a feature representation. We show several alternatives to define a dissimilarity between bags and discuss which definitions are more suitable for particular MIL problems. The experimental results show that the proposed approach is computationally inexpensive, yet very competitive with state-of-the-art algorithms on a wide range of MIL datasets. HighlightsA general bag dissimilarities framework for multiple instance learning is explored.Point set distances and distribution distances are considered.Metric dissimilarities are not necessarily more informative.Results are competitive with, or outperform state-of-the-art algorithms.Practical suggestions for end-users are provided.
IEEE Transactions on Neural Networks | 2016
Veronika Cheplygina; David M. J. Tax; Marco Loog
In multiple instance learning, objects are sets (bags) of feature vectors (instances) rather than individual feature vectors. In this paper, we address the problem of how these bags can best be represented. Two standard approaches are to use (dis)similarities between bags and prototype bags, or between bags and prototype instances. The first approach results in a relatively low-dimensional representation, determined by the number of training bags, whereas the second approach results in a relatively high-dimensional representation, determined by the total number of instances in the training set. However, an advantage of the latter representation is that the informativeness of the prototype instances can be inferred. In this paper, a third, intermediate approach is proposed, which links the two approaches and combines their strengths. Our classifier is inspired by a random subspace ensemble, and considers subspaces of the dissimilarity space, defined by subsets of instances, as prototypes. We provide insight into the structure of some popular multiple instance problems and show state-of-the-art performances on these data sets.
SIMBAD'11 Proceedings of the First international conference on Similarity-based pattern recognition | 2011
David M. J. Tax; Marco Loog; Robert P. W. Duin; Veronika Cheplygina; Wan-Jui Lee
When objects cannot be represented well by single feature vectors, a collection of feature vectors can be used. This is what is done in Multiple Instance learning, where it is called a bag of instances. By using a bag of instances, an object gains more internal structure than when a single feature vector is used. This improves the expressiveness of the representation, but also adds complexity to the classification of the object. This paper shows that for the situation that not a single instance determines the class label of a bag, simple bag dissimilarity measures can significantly outperform standard multiple instance classifiers. In particular a measure that computes just the average minimum distance between instances, or a measure that uses the Earth Movers distance, perform very well.
Pattern Recognition Letters | 2015
Veronika Cheplygina; David M. J. Tax; Marco Loog
Many classification tasks are not entirely suitable for supervised learning.Instead of individual feature vectors, bags of feature vectors can be considered.Many learning scenarios with bags in training and/or test phase have been proposed.We provide an overview and taxonomy of these learning scenarios. Many classification problems can be difficult to formulate directly in terms of the traditional supervised setting, where both training and test samples are individual feature vectors. There are cases in which samples are better described by sets of feature vectors, that labels are only available for sets rather than individual samples, or, if individual labels are available, that these are not independent. To better deal with such problems, several extensions of supervised learning have been proposed, where either training and/or test objects are sets of feature vectors. However, having been proposed rather independently of each other, their mutual similarities and differences have hitherto not been mapped out. In this work, we provide an overview of such learning scenarios, propose a taxonomy to illustrate the relationships between them, and discuss directions for further research in these areas.
Pattern Recognition | 2015
Ethem Alpaydin; Veronika Cheplygina; Marco Loog; David M. J. Tax
In multiple-instance (MI) classification, each input object or event is represented by a set of instances, named a bag, and it is the bag that carries a label. MI learning is used in different applications where data is formed in terms of such bags and where individual instances in a bag do not have a label. We review MI classification from the point of view of label information carried in the instances in a bag, that is, their sufficiency for classification. Our aim is to contrast MI with the standard approach of single-instance (SI) classification to determine when casting a problem in the MI framework is preferable. We compare instance-level classification, combination by noisy-or, and bag-level classification, using the support vector machine as the base classifier. We define a set of synthetic MI tasks at different complexities to benchmark different MI approaches. Our experiments on these and two real-world bioinformatics applications on gene expression and text categorization indicate that depending on the situation, a different decision mechanism, at the instance- or bag-level, may be appropriate. If the instances in a bag provide complementary information, a bag-level MI approach is useful; but sometimes the bag information carries no useful information at all and an instance-level SI classifier works equally well, or better. HighlightsWe categorize problems by the amount of label information instances in a bag carry.We define synthetic tasks of increasing complexity or intra-bag dependency.These problems allow us to measure the power of multiple-instance algorithms.We experiment on two bioinformatics data for gene expression and text categorization.
medical image computing and computer assisted intervention | 2015
Veronika Cheplygina; Lauge Sørensen; David M. J. Tax; Marleen de Bruijne; Marco Loog
We address the problem of instance label stability in multiple instance learning MIL classifiers. These classifiers are trained only on globally annotated images bags, but often can provide fine-grained annotations for image pixels or patches instances. This is interesting for computer aided diagnosis CAD and other medical image analysis tasks for which only a coarse labeling is provided. Unfortunately, the instance labels may be unstable. This means that a slight change in training data could potentially lead to abnormalities being detected in different parts of the image, which is undesirable from a CAD point of view. Despite MIL gaining popularity in the CAD literature, this issue has not yet been addressed. We investigate the stability of instance labels provided by several MIL classifiers on 5 different datasets, of which 3 are medical image datasets breast histopathology, diabetic retinopathy and computed tomography lung images. We propose an unsupervised measure to evaluate instance stability, and demonstrate that a performance-stability trade-off can be made when comparing MIL classifiers.
international conference on pattern recognition | 2014
Veronika Cheplygina; Lauge Sørensen; David M. J. Tax; Jesper Holst Pedersen; Marco Loog; Marleen de Bruijne
Chronic obstructive pulmonary disease (COPD) is a lung disease where early detection benefits the survival rate. COPD can be quantified by classifying patches of computed tomography images, and combining patch labels into an overall diagnosis for the image. As labeled patches are often not available, image labels are propagated to the patches, incorrectly labeling healthy patches in COPD patients as being affected by the disease. We approach quantification of COPD from lung images as a multiple instance learning (MIL) problem, which is more suitable for such weakly labeled data. We investigate various MIL assumptions in the context of COPD and show that although a concept region with COPD-related disease patterns is present, considering the whole distribution of lung tissue patches improves the performance. The best method is based on averaging instances and obtains an AUC of 0.742, which is higher than the previously reported best of 0.713 on the same dataset. Using the full training set further increases performance to 0.776, which is significantly higher (DeLong test) than previous results.
SIMBAD'13 Proceedings of the Second international conference on Similarity-Based Pattern Recognition | 2013
Yenisel Plasencia-Calaña; Veronika Cheplygina; Robert P. W. Duin; Edel García-Reyes; Mauricio Orozco-Alzate; David M. J. Tax; Marco Loog
A widely used approach to cope with asymmetry in dissimilarities is by symmetrizing them. Usually, asymmetry is corrected by applying combiners such as average, minimum or maximum of the two directed dissimilarities. Whether or not these are the best approaches for combining the asymmetry remains an open issue. In this paper we study the performance of the extended asymmetric dissimilarity space (EADS) as an alternative to represent asymmetric dissimilarities for classification purposes. We show that EADS outperforms the representations found from the two directed dissimilarities as well as those created by the combiners under consideration in several cases. This holds specially for small numbers of prototypes; however, for large numbers of prototypes the EADS may suffer more from overfitting than the other approaches. Prototype selection is recommended to overcome overfitting in these cases.
International Journal of Pattern Recognition and Artificial Intelligence | 2012
Wan-Jui Lee; Veronika Cheplygina; David M. J. Tax; Marco Loog; Robert P. W. Duin
Structures and features are opposite approaches in building representations for object recognition. Bridging the two is an essential problem in pattern recognition as the two opposite types of information are fundamentally different. As dissimilarities can be computed for both the dissimilarity representation can be used to combine the two. Attributed graphs contain structural as well as feature-based information. Neglecting the attributes yields a pure structural description. Isolating the features and neglecting the structure represents objects by a bag of features. In this paper we will show that weighted combinations of dissimilarities may perform better than these two extremes, indicating that these two types of information are essentially different and strengthen each other. In addition we present two more advanced integrations than weighted combining and show that these may improve the classification performances even further.
Pattern Recognition | 2017
Marc-André Carbonneau; Veronika Cheplygina; Eric Granger; Ghyslain Gagnon
Multiple instance learning (MIL) is a form of weakly supervised learning where training instances are arranged in sets, called bags, and a label is provided for the entire bag. This formulation is gaining interest because it naturally fits various problems and allows to leverage weakly labeled data. Consequently, it has been used in diverse application fields such as computer vision and document classification. However, learning from bags raises important challenges that are unique to MIL. This paper provides a comprehensive survey of the characteristics which define and differentiate the types of MIL problems. Until now, these problem characteristics have not been formally identified and described. As a result, the variations in performance of MIL algorithms from one data set to another are difficult to explain. In this paper, MIL problem characteristics are grouped into four broad categories: the composition of the bags, the types of data distribution, the ambiguity of instance labels, and the task to be performed. Methods specialized to address each category are reviewed. Then, the extent to which these characteristics manifest themselves in key MIL application areas are described. Finally, experiments are conducted to compare the performance of 16 state-of-the-art MIL methods on selected problem characteristics. This paper provides insight on how the problem characteristics affect MIL algorithms, recommendations for future benchmarking and promising avenues for research. Code is available on-line at https://github.com/macarbonneau/MILSurvey.