Is this you? Create Your Porfile

Georg Krempl

Otto-von-Guericke University Magdeburg

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Georg Krempl is active.

Explore More

Publication

Featured researches published by Georg Krempl.

Sigkdd Explorations | 2014

Open challenges for data stream mining research

Georg Krempl; Indre Žliobaite; Dariusz Brzezinski; Eyke Hüllermeier; Vincent Lemaire; Tino Noack; Ammar Shaker; Sonja Sievi; Myra Spiliopoulou; Jerzy Stefanowski

Every day, huge volumes of sensory, transactional, and web data are continuously generated as streams, which need to be analyzed online as they arrive. Streaming data can be considered as one of the main sources of what is called big data. While predictive modeling for data streams and big data have received a lot of attention over the last decade, many research approaches are typically designed for well-behaved controlled problem settings, overlooking important challenges imposed by real-world applications. This article presents a discussion on eight open challenges for data stream mining. Our goal is to identify gaps between current research and meaningful applications, highlight open problems, and define new application-relevant research directions for data stream mining. The identified challenges cover the full cycle of knowledge discovery and involve such problems as: protecting data privacy, dealing with legacy systems, handling incomplete and delayed information, analysis of complex data, and evaluation of stream mining algorithms. The resulting analysis is illustrated by practical applications and provides general suggestions concerning lines of future research in data stream mining.

intelligent data analysis | 2011

The algorithm APT to classify in concurrence of latency and drift

Georg Krempl

Population drift is a challenging problem in classification, and denotes changes in probability distributions over time. Known driftadaptive classification methods such as incremental learning rely on current, labelled data for classification model updates, assuming that such labelled data are available without verification latency. However, verification latency is a relevant problem in some application domains, where predictions have to be made far into the future. This concurrence of drift and latency requires new approaches in machine learning. We propose a two-stage learning strategy: First, the nature of drift in temporal data needs to be identified. This requires the formulation of explicit drift models for the underlying data generating process. In a second step, these models are used to substitute scarce labelled data for updating classification models. This paper contributes an explicit drift model, which is characterising a mixture of independently evolving sub-populations. In this model, the joint distribution is a mixture of arbitrarily distributed subpopulations drifting over time. An arbitrary sub-population tracker algorithm is presented, which can track and predict the distributions by the use of unlabelled data. Experimental evaluation shows that the presented APT algorithm is capable of tracking and predicting changes in the posterior distribution of class labels accurately.

Machine Learning | 2015

Optimised probabilistic active learning (OPAL)

Georg Krempl; Daniel Kottke; Vincent Lemaire

In contrast to ever increasing volumes of automatically generated data, human annotation capacities remain limited. Thus, fast active learning approaches that allow the efficient allocation of annotation efforts gain in importance. Furthermore, cost-sensitive applications such as fraud detection pose the additional challenge of differing misclassification costs between classes. Unfortunately, the few existing cost-sensitive active learning approaches rely on time-consuming steps, such as performing self-labelling or tedious evaluations over samples. We propose a fast, non-myopic, and cost-sensitive probabilistic active learning approach for binary classification. Our approach computes the expected reduction in misclassification loss in a labelling candidate’s neighbourhood. We derive and use a closed-form solution for this expectation, which considers the possible values of the true posterior of the positive class at the candidate’s position, its possible label realisations, and the given labelling budget. The resulting myopic algorithm runs in the same linear asymptotic time as uncertainty sampling, while its non-myopic counterpart requires an additional factor of

Computational Statistics & Data Analysis | 2013

Drift mining in data

Vera Hofer; Georg Krempl

intelligent data analysis | 2013

Correcting the Usage of the Hoeffding Inequality in Stream Mining

Pawel Matuszyk; Georg Krempl; Myra Spiliopoulou

O(m \cdot \log m)

international conference on data mining | 2011

Classification in Presence of Drift and Latency

Georg Krempl; Vera Hofer

european conference on machine learning | 2011

Online clustering of high-dimensional trajectories under concept drift

Georg Krempl; Zaigham Faraz Siddiqui; Myra Spiliopoulou

O(m·logm) in the budget size. The experimental evaluation on several synthetic and real-world data sets shows competitive or better classification performance and runtime, compared to several uncertainty sampling- and error-reduction-based active learning strategies, both in cost-sensitive and cost-insensitive settings.

discovery science | 2014

Probabilistic Active Learning: Towards Combining Versatility, Optimality and Efficiency

Georg Krempl; Daniel Kottke; Myra Spiliopoulou

A novel statistical methodology for analysing population drift in classification is introduced. Drift denotes changes in the joint distribution of explanatory variables and class labels over time. It entails the deterioration of a classifiers performance and requires the optimal decision boundary to be adapted after some time. However, in the presence of verification latency a re-estimation of the classification model is impossible, since in such a situation only recent unlabelled data are available, and the true corresponding labels only become known after some lapse in time. For this reason a novel drift mining methodology is presented which aims at detecting changes over time. It allows us either to understand evolution in the data from an ex-post perspective or, ex-ante, to anticipate changes in the joint distribution. The proposed drift mining technique assumes that the class priors change by a certain factor from one time point to the next, and that the conditional distributions do not change within this time period. Thus, the conditional distributions can be estimated at a time where recent labelled data are available. In subsequent periods the unconditional distribution can be expressed as a mixture of the conditional distributions, where the mixing proportions are equal to the class priors. However, as the unconditional distributions can also be estimated from new unlabelled data, they can then be compared to the mixture representation by means of least-squares criteria. This allows for easy and fast estimation of the changes in class prior values in the presence of verification latency. The usefulness of this drift mining approach is demonstrated using a real-world dataset from the area of credit scoring. Highlights? A novel statistical methodology for analysing population drift in classification is presented. ? Drift mining that aims at modelling changes in distributions over time is introduced. ? A model on global drift that addresses a change in class prior is introduced.

intelligent data analysis | 2015

Probabilistic Active Learning in Datastreams

Daniel Kottke; Georg Krempl; Myra Spiliopoulou

Many stream classification algorithms use the Hoeffding Inequality to identify the best split attribute during tree induction. We show that the prerequisites of the Inequality are violated by these algorithms, and we propose corrective steps. The new stream classification core, correctedVFDT , satisfies the prerequisites of the Hoeffding Inequality and thus provides the expected performance guarantees. The goal of our work is not to improve accuracy, but to guarantee a reliable and interpretable error bound. Nonetheless, we show that our solution achieves lower error rates regarding split attributes and sooner split decisions while maintaining a similar level of accuracy.

european conference on artificial intelligence | 2014

Probabilistic active learning: a short proposition

Georg Krempl; Daniel Kottke; Myra Spiliopoulou

Changes in underlying distributions over time are a challenging problem in supervised learning. While this problem of drift is subject to an increasing effort in research, some definitions required for proper distinction of types of drift remain ambiguous. Furthermore, the approaches discussed in literature so far require new, labelled data for incremental model updates. However, there are domains in which such data is scarce or only available with a considerable time lag, a so-called verification latency. This issues are addressed in this paper: First, the different notations used in literature are related, and an overview over types of drift is given. Second, following the change mining paradigm, explicit models of drift are introduced. These drift models can be employed when actual, labelled data is scarce or not available at all, as they allow to anticipate changes in distributions over time. Third, an exemplary drift-adaptive learning strategy that employs such a drift model is presented: Using an expectation-maximisation algorithm, a mixture of subpopulations is tracked. As a result, the classification model can be updated using solely new, unlabelled data.

Explore More