Sarah Jane Delany | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Sarah Jane Delany is active.

Explore More

Publication

Featured researches published by Sarah Jane Delany.

Knowledge Based Systems | 2005

A case-based technique for tracking concept drift in spam filtering

Sarah Jane Delany; Pádraig Cunningham; Alexey Tsymbal; Lorcan Coyle

Spam filtering is a particularly challenging machine learning task as the data distribution and concept being learned changes over time. It exhibits a particularly awkward form of concept drift as the change is driven by spammers wishing to circumvent spam filters. In this paper we show that lazy learning techniques are appropriate for such dynamically changing contexts. We present a case-based system for spam filtering that can learn dynamically. We evaluate its performance as the case-base is updated with new cases. We also explore the benefit of periodically redoing the feature selection process to bring new features into play. Our evaluation shows that these two levels of model update are effective in tracking concept drift.

Expert Systems With Applications | 2012

SMS spam filtering

Sarah Jane Delany; Mark Buckley; Derek Greene

Highlights? We motivate the need for content-based SMS spam filtering. ? We discuss similarities/differences between email and SMS spam filtering. ? We review recent research in SMS spam filtering. ? We analyse recent SMS spam messages and make a dataset available. ? Early days, no consensus yet on best techniques but significant challenges exist. Mobile or SMS spam is a real and growing problem primarily due to the availability of very cheap bulk pre-pay SMS packages and the fact that SMS engenders higher response rates as it is a trusted and personal service. SMS spam filtering is a relatively new task which inherits many issues and solutions from email spam filtering. However it poses its own specific challenges. This paper motivates work on filtering SMS spam and reviews recent developments in SMS spam filtering. The paper also discusses the issues with data collection and availability for furthering research in this area, analyses a large corpus of SMS spam, and provides some initial benchmark results.

Artificial Intelligence Review | 2005

An Assessment of Case-Based Reasoning for Spam Filtering

Sarah Jane Delany; Pádraig Cunningham; Lorcan Coyle

Because of the changing nature of spam, a spam filtering system that uses machine learning will need to be dynamic. This suggests that a case-based (memory-based) approach may work well. Case-Based Reasoning (CBR) is a lazy approach to machine learning where induction is delayed to run time. This means that the case base can be updated continuously and new training data is immediately available to the induction process. In this paper we present a detailed description of such a system called ECUE and evaluate design decisions concerning the case representation. We compare its performance with an alternative system that uses Naïve Bayes. We find that there is little to choose between the two alternatives in cross-validation tests on data sets. However, ECUE does appear to have some advantages in tracking concept drift over time.

Lecture Notes in Computer Science | 2004

An Analysis of Case-Base Editing in a Spam Filtering System

Sarah Jane Delany; Pádraig Cunningham

Because of the volume of spam email and its evolving nature, any deployed Machine Learning- based spam filtering system will need to have procedures for case-base maintenance. Key to this will be procedures to edit the case-base to remove noise and eliminate redundancy. In this paper we present a two stage process to do this. We present a new noise reduction algorithm called Blame-Based Noise Reduction that removes cases that are observed to cause misclassification. We also present an algorithm called Conservative Redundancy Reduction that is much less aggressive than the state-of-the-art alternatives and has significantly better generalisation performance in this domain. These new techniques are evaluated against the alternatives in the literature on four datasets of 1000 emails each (50% spam and 50% non spam).

the florida ai research society | 2010

Handling Concept Drift in a Text Data Stream Constrained by High Labelling Cost

Patrick Lindstrom; Sarah Jane Delany; Brian Mac Namee

In many real-world classification problems the concept being modelled is not static but rather changes over time - a situation known as concept drift. Most techniques for handling concept drift rely on the true classifications of test instances being available shortly after classification so that classifiers can be retrained to handle the drift. However, in applications where labelling instances with their true class has a high cost this is not reasonable. In this paper we present an approach for keeping a classifier up-to-date in a concept drift domain which is constrained by a high cost of labelling. We use an active learning type approach to select those examples for labelling that are most useful in handling changes in concept. We show how this approach can adequately handle concept drift in a text filtering scenario requiring just 15% of the documents to be manually categorised and labelled.

intelligent information systems | 2010

Noise reduction for instance-based learning with a local maximal margin approach

Nicola Segata; Enrico Blanzieri; Sarah Jane Delany; Pádraig Cunningham

To some extent the problem of noise reduction in machine learning has been finessed by the development of learning techniques that are noise-tolerant. However, it is difficult to make instance-based learning noise tolerant and noise reduction still plays an important role in k-nearest neighbour classification. There are also other motivations for noise reduction, for instance the elimination of noise may result in simpler models or data cleansing may be an end in itself. In this paper we present a novel approach to noise reduction based on local Support Vector Machines (LSVM) which brings the benefits of maximal margin classifiers to bear on noise reduction. This provides a more robust alternative to the majority rule on which almost all the existing noise reduction techniques are based. Roughly speaking, for each training example an SVM is trained on its neighbourhood and if the SVM classification for the central example disagrees with its actual class there is evidence in favour of removing it from the training set. We provide an empirical evaluation on 15 real datasets showing improved classification accuracy when using training data edited with our method as well as specific experiments regarding the spam filtering application domain. We present a further evaluation on two artificial datasets where we analyse two different types of noise (Gaussian feature noise and mislabelling noise) and the influence of different class densities. The conclusion is that LSVM noise reduction is significantly better than the other analysed algorithms for real datasets and for artificial datasets perturbed by Gaussian noise and in presence of uneven class densities.

Evolving Systems | 2013

Drift detection using uncertainty distribution divergence

Patrick Lindstrom; Brian Mac Namee; Sarah Jane Delany

Data generated from naturally occurring processes tends to be non-stationary. For example, seasonal and gradual changes in climate data and sudden changes in financial data. In machine learning the degradation in classifier performance due to such changes in the data is known as concept drift and there are many approaches to detecting and handling it. Most approaches to detecting concept drift, however, make the assumption that true classes for test examples will be available at no cost shortly after classification and base the detection of concept drift on measures relying on these labels. The high labelling cost in many domains provides a strong motivation to reduce the number of labelled instances required to detect and handle concept drift. Triggered detection approaches that do not require labelled instances to detect concept drift show great promise for achieving this. In this paper we present Confidence Distribution Batch Detection, an approach that provides a signal correlated to changes in concept without using labelled data. This signal combined with a trigger and a rebuild policy can maintain classifier accuracy which, in most cases, matches the accuracy achieved using classification error based detection techniques but using only a limited amount of labelled data.

AICS'09 Proceedings of the 20th Irish conference on Artificial intelligence and cognitive science | 2009

Learning without default: a study of one-class classification and the low-default portfolio problem

Kenneth Kennedy; Brian Mac Namee; Sarah Jane Delany

This paper asks at what level of class imbalance one-class classifiers outperform two-class classifiers in credit scoring problems in which class imbalance, referred to as the low-default portfolio problem, is a serious issue. The question is answered by comparing the performance of a variety of one-class and two-class classifiers on a selection of credit scoring datasets as the class imbalance is manipulated. We also include random oversampling as this is one of the most common approaches to addressing class imbalance. This study analyses the suitability and performance of recognised two-class classifiers and one-class classifiers. Based on our study we conclude that the performance of the two-class classifiers deteriorates proportionally to the level of class imbalance. The two-class classifiers outperform one-class classifiers with class imbalance levels down as far as 15% (i.e. the imbalance ratio of minority class to majority class is 15:85). The one-class classifiers, whose performance remains unvaried throughout, are preferred when the minority class constitutes approximately 2% or less of the data. Between an imbalance of 2% to 15% the results are not as conclusive. These results show that one-class classifiers could potentially be used as a solution to the low-default portfolio problem experienced in the credit scoring domain.

Expert Systems With Applications | 2014

Dynamic estimation of worker reliability in crowdsourcing for regression tasks: Making it work

Alexey Tarasov; Sarah Jane Delany; Brian Mac Namee

Abstract One of the biggest challenges in crowdsourcing is detecting noisy and incompetent workers. A possible way of handling this problem is to dynamically estimate the reliability of workers as they do work and accept only those workers who are deemed to be reliable to date. Although many approaches to dynamic estimation of rater reliability exist, they are often only appropriate for very specific categories of tasks, for example, only for binary classification. They also can make unrealistic assumptions such as requiring access to a large number of gold standard answers or relying on the constant availability of any rater. In this paper, we propose a novel approach to the dynamic estimation of rater reliability in regression (DER 3 ) using multi-armed bandits. This approach is specifically suited for real-life crowdsourcing scenarios, where the task at hand is labelling or rating corpora to be used in supervised machine learning, and the annotations are continuous ratings, although it can be easily generalised to multi-class or binary classification tasks. We demonstrate that DER 3 provides high-accuracy results and at the same time keeps the cost of the rating process low. Although our main motivating example is the recognition of emotion in speech, our approach shows similar results in other application areas.

Archive | 2010

Using Crowdsourcing for Labelling Emotional Speech Assets

Alexey Tarasov; Sarah Jane Delany; Charlie Cullen

The success of supervised learning approaches for the classification of emotion in speech depends highly on the quality of the training data. The manual annotation of emotion speech assets is the primary way of gathering training data for emotional speech recognition. This position paper proposes the use of crowdsourcing for the rating of emotion speech assets. Recent developments in learning from crowdsourcing offer opportunities to determine accurate ratings for assets which have been annotated by large numbers of non-expert individuals. The challenges involved include identifying good annotators, determining consensus ratings and learning the bias of annotators.

Explore More