Nathalie Japkowicz | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Nathalie Japkowicz is active.

Explore More

Publication

Featured researches published by Nathalie Japkowicz.

computational intelligence | 2004

A Multiple Resampling Method for Learning from Imbalanced Data Sets

Andrew Estabrooks; Taeho Jo; Nathalie Japkowicz

Resampling methods are commonly used for dealing with the class‐imbalance problem. Their advantage over other methods is that they are external and thus, easily transportable. Although such approaches can be very simple to implement, tuning them most effectively is not an easy task. In particular, it is unclear whether oversampling is more effective than undersampling and which oversampling or undersampling rate should be used. This paper presents an experimental study of these questions and concludes that combining different expressions of the resampling approach is an effective solution to the tuning problem. The proposed combination scheme is evaluated on imbalanced subsets of the Reuters‐21578 text collection and is shown to be quite effective for these problems.

Sigkdd Explorations | 2004

Class imbalances versus small disjuncts

Taeho Jo; Nathalie Japkowicz

It is often assumed that class imbalances are responsible for significant losses of performance in standard classifiers. The purpose of this paper is to the question whether class imbalances are truly responsible for this degradation or whether it can be explained in some other way. Our experiments suggest that the problem is not directly caused by class imbalances, but rather, that class imbalances may yield small disjuncts which, in turn, will cause degradation. We argue that, in order to improve classifier performance, it may, then, be more useful to focus on the small disjuncts problem than it is to focus on the class imbalance problem. We experiment with a method that takes the small disjunct problem into consideration, and show that, indeed, it yields a performance superior to the performance obtained using standard or advanced solutions to the class imbalance problem.

australian joint conference on artificial intelligence | 2006

Beyond accuracy, f-score and ROC: a family of discriminant measures for performance evaluation

Marina Sokolova; Nathalie Japkowicz; Stan Szpakowicz

Different evaluation measures assess different characteristics of machine learning algorithms. The empirical evaluation of algorithms and classifiers is a matter of on-going debate among researchers. Most measures in use today focus on a classifiers ability to identify classes correctly. We note other useful properties, such as failure avoidance or class discrimination, and we suggest measures to evaluate such properties. These measures – Youdens index, likelihood, Discriminant power – are used in medical diagnosis. We show that they are interrelated, and we apply them to a case study from the field of electronic negotiations. We also list other learning problems which may benefit from the application of these measures.

european conference on machine learning | 2004

Applying support vector machines to imbalanced datasets

Rehan Akbani; Stephen Kwek; Nathalie Japkowicz

Support Vector Machines (SVM) have been extensively studied and have shown remarkable success in many applications. However the success of SVM is very limited when it is applied to the problem of learning from imbalanced datasets in which negative instances heavily outnumber the positive instances (e.g. in gene profiling and detecting credit card fraud). This paper discusses the factors behind this failure and explains why the common strategy of undersampling the training data may not be the best choice for SVM. We then propose an algorithm for overcoming these problems which is based on a variant of the SMOTE algorithm by Chawla et al, combined with Veropoulos et als different error costs algorithm. We compare the performance of our algorithm against these two algorithms, along with undersampling and regular SVM and show that our algorithm outperforms all of them.

Machine Learning | 2001

Supervised versus unsupervised binary-learning by feedforward neural networks

Nathalie Japkowicz

Binary classification is typically achieved by supervised learning methods. Nevertheless, it is also possible using unsupervised schemes. This paper describes a connectionist unsupervised approach to binary classification and compares its performance to that of its supervised counterpart. The approach consists of training an autoassociator to reconstruct the positive class of a domain at the output layer. After training, the autoassociator is used for classification, relying on the idea that if the network generalizes to a novel instance, then this instance must be positive, but that if generalization fails, then the instance must be negative. When tested on three real-world domains, the autoassociator proved more accurate at classification than its supervised counterpart, MLP, on two of these domains and as accurate on the third (Japkowicz, Myers, & Gluck, 1995). The paper seeks to generalize these results and concludes that, in addition to learning aconcept in the absence of negative examples, 1) autoassociation is more efficient than MLP in multi-modal domains, and 2) it is more accurate than MLP in multi-modal domains for which the negative class creates a particularly strong need for specialization or the positive class creates a particularly weak need for specialization. In multi-modal domains for which the positive class creates a particularly strong need for specialization, on the other hand, MLP is more accurate than autoassociation.

Neural Computation | 2000

Nonlinear Autoassociation Is Not Equivalent to PCA

Nathalie Japkowicz; Stephen José Hanson; Mark A. Gluck

A common misperception within the neural network community is that even with nonlinearities in their hidden layer, autoassociators trained with backpropagation are equivalent to linear methods such as principal component analysis (PCA). Our purpose is to demonstrate that nonlinear autoassociators actually behave differently from linear methods and that they can outperform these methods when used for latent extraction, projection, and classification. While linear autoassociators emulate PCA, and thus exhibit a flat or unimodal reconstruction error surface, autoassociators with nonlinearities in their hidden layer learn domains by building error reconstruction surfaces that, depending on the task, contain multiple local valleys. This interpolation bias allows nonlinear autoassociators to represent appropriate classifications of nonlinear multimodal domains, in contrast to linear autoassociators, which are inappropriate for such tasks. In fact, autoassociators with hidden unit nonlinearities can be shown to perform nonlinear classification and nonlinear recognition.

international conference on data mining | 2006

A Feature Selection and Evaluation Scheme for Computer Virus Detection

Olivier Henchiri; Nathalie Japkowicz

Anti-virus systems traditionally use signatures to detect malicious executables, but signatures are over-fitted features that are of little use in machine learning. Other more heuristic methods seek to utilize more general features, with some degree of success. In this paper, we present a data mining approach that conducts an exhaustive feature search on a set of computer viruses and strives to obviate over-fitting. We also evaluate the predictive power of a classifier by taking into account dependence relationships that exist between viruses, and we show that our classifier yields high detection rates and can be expected to perform as well in real-world conditions.

canadian conference on artificial intelligence | 2001

Concept-Learning in the Presence of Between-Class and Within-Class Imbalances

Nathalie Japkowicz

In a concept learning problem, imbalances in the distribution of the data can occur either between the two classes or within a single class. Yet, although both types of imbalances are known to affect negatively the performance of standard classifiers, methods for dealing with the class imbalance problem usually focus on rectifying the between-class imbalance problem, neglecting to address the imbalance occuring within each class. The purpose of this paper is to extend the simplest proposed approach for dealing with the between-class imbalance problem--random re-sampling--in order to deal simultaneously with the two problems. Although re-sampling is not necessarily the best way to deal with problems of imbalance, the results reported in this paper suggest that addressing both problems simultaneously is beneficial and should be done by more sophisticated techniques as well.

international syposium on methodologies for intelligent systems | 2008

Boosting support vector machines for imbalanced data sets

Benjamin X. Wang; Nathalie Japkowicz

Real world data mining applications must address the issue of learning from imbalanced data sets. The problem occurs when the number of instances in one class greatly outnumbers the number of instances in the other class. Such data sets often cause a default classifier to be built due to skewed vector spaces or lack of information. Common approaches for dealing with the class imbalance problem involve modifying the data distribution or modifying the classifier. In this work, we choose to use a combination of both approaches. We use support vector machines with soft margins as the base classifier to solve the skewed vector spaces problem. Then we use a boosting algorithm to get an ensemble classifier that has lower error than a single classifier.We found that this ensemble of SVMs makes an impressive improvement in prediction performance, not only for the majority class, but also for the minority class.

Knowledge and Information Systems | 2006

Node similarity in the citation graph

Wangzhong Lu; Jeannette C. M. Janssen; Evangelos E. Milios; Nathalie Japkowicz; Yongzheng Zhang

Published scientific articles are linked together into a graph, the citation graph, through their citations. This paper explores the notion of similarity based on connectivity alone, and proposes several algorithms to quantify it. Our metrics take advantage of the local neighborhoods of the nodes in the citation graph. Two variants of link-based similarity estimation between two nodes are described, one based on the separate local neighborhoods of the nodes, and another based on the joint local neighborhood expanded from both nodes at the same time. The algorithms are implemented and evaluated on a subgraph of the citation graph of computer science in a retrieval context. The results are compared with text-based similarity, and demonstrate the complementarity of link-based and text-based retrieval.

Explore More