Is this you? Create Your Porfile

Martin Scholz

Technical University of Dortmund

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Martin Scholz is active.

Explore More

Publication

Featured researches published by Martin Scholz.

knowledge discovery and data mining | 2006

YALE: rapid prototyping for complex data mining tasks

Ingo Mierswa; Michael Wurst; Ralf Klinkenberg; Martin Scholz; Timm Euler

KDD is a complex and demanding task. While a large number of methods has been established for numerous problems, many challenges remain to be solved. New tasks emerge requiring the development of new methods or processing schemes. Like in software development, the development of such solutions demands for careful analysis, specification, implementation, and testing. Rapid prototyping is an approach which allows crucial design decisions as early as possible. A rapid prototyping system should support maximal re-use and innovative combinations of existing methods, as well as simple and quick integration of new ones.This paper describes Yale, a free open-source environment forKDD and machine learning. Yale provides a rich variety of methods whichallows rapid prototyping for new applications and makes costlyre-implementations unnecessary. Additionally, Yale offers extensive functionality for process evaluation and optimization which is a crucial property for any KDD rapid prototyping tool. Following the paradigm of visual programming eases the design of processing schemes. While the graphical user interface supports interactive design, the underlying XML representation enables automated applications after the prototyping phase.After a discussion of the key concepts of Yale, we illustrate the advantages of rapid prototyping for KDD on case studies ranging from data pre-processing to result visualization. These case studies cover tasks like feature engineering, text mining, data stream mining and tracking drifting concepts, ensemble methods and distributed data mining. This variety of applications is also reflected in a broad user base, we counted more than 40,000 downloads during the last twelve months.

intelligent data analysis | 2007

Boosting classifiers for drifting concepts

Martin Scholz; Ralf Klinkenberg

In many real-world classification tasks, data arrives over time and the target concept to be learned from the data stream may change over time. Boosting methods are well-suited for learning from data streams, but do not address this concept drift problem. This paper proposes a boosting-like method to train a classifier ensemble from data streams that naturally adapts to concept drift. Moreover, it allows to quantify the drift in terms of its base learners. Similar as in regular boosting, examples are re-weighted to induce a diverse ensemble of base models. In order to handle drift, the proposed method continuously re-weights the ensemble members based on their performance on the most recent examples only. The proposed strategy adapts quickly to different kinds of concept drift. The algorithm is empirically shown to outperform learning algorithms that ignore concept drift. It performs no worse than advanced adaptive time window and example selection strategies that store all the data and are thus not suited for mining massive streams. The proposed algorithm has low computational costs.

Archive | 2004

The MiningMart Approach to Knowledge Discovery in Databases

Katharina Morik; Martin Scholz

Although preprocessing is one of the key issues in data analysis, it is still common practice to address this task by manually entering SQL statements and using a variety of stand-alone tools. The results are not properly documented and hardly re-usable. The MiningMart system presented in this chapter focuses on setting up and re-using best practice cases of preprocessing data stored in very large databases. A metadata model named M4 is used to declaratively define and document both, all steps of such a preprocessing chain and all the data involved. For data and applied operators there is an abstract level, understandable by human users, and an executable level, used by the metadata compiler to run cases for given data sets. An integrated environment allows for rapid development of preprocessing chains. Adaptation to different environments is supported simply by specifying all involved database entities in the target DBMS. This allows reuse of best practice cases published on the Internet.

knowledge discovery and data mining | 2005

Sampling-based sequential subgroup mining

Martin Scholz

Subgroup discovery is a learning task that aims at finding interesting rules from classified examples. The search is guided by a utility function, trading off the coverage of rules against their statistical unusualness. One shortcoming of existing approaches is that they do not incorporate prior knowledge. To this end a novel generic sampling strategy is proposed. It allows to turn pattern mining into an iterative process. In each iteration the focus of subgroup discovery lies on those patterns that are unexpected with respect to prior knowledge and previously discovered patterns. The result of this technique is a small diverse set of understandable rules that characterise a specified property of interest. As another contribution this article derives a simple connection between subgroup discovery and classifier induction. For a popular utility function this connection allows to apply any standard rule induction algorithm to the task of subgroup discovery after a step of stratified resampling. The proposed techniques are empirically compared to state of the art subgroup discovery algorithms.

international conference on data mining | 2005

On the tractability of rule discovery from distributed data

Martin Scholz

This paper analyses the tractability of rule selection for supervised learning in distributed scenarios. The selection of rules is usually guided by a utility measure such as predictive accuracy or weighted relative accuracy. A common strategy to tackle rule selection from distributed data is to evaluate rules locally on each dataset. While this works well for homogeneously distributed data, this work proves limitations of this strategy if distributions are allowed to deviate. The identification of those subsets for which local and global distributions deviate, poses a learning task of its own, which is shown to be at least as complex as discovering the globally best rules from local data.

LPD'04 Proceedings of the 2004 international conference on Local Pattern Detection | 2004

Knowledge-Based sampling for subgroup discovery

Martin Scholz

Subgroup discovery aims at finding interesting subsets of a classified example set that deviates from the overall distribution. The search is guided by a so-called utility function, trading the size of subsets (coverage) against their statistical unusualness. By choosing the utility function accordingly, subgroup discovery is well suited to find interesting rules with much smaller coverage and bias than possible with standard classifier induction algorithms. Smaller subsets can be considered local patterns, but this work uses yet another definition: According to this definition global patterns consist of all patterns reflecting the prior knowledge available to a learner, including all previously found patterns. All further unexpected regularities in the data are referred to as local patterns. To address local pattern mining in this scenario, an extension of subgroup discovery by the knowledge-based sampling approach to iterative model refinement is presented. It is a general, cheap way of incorporating prior probabilistic knowledge in arbitrary form into Data Mining algorithms addressing supervised learning tasks.

european conference on principles of data mining and knowledge discovery | 2006

Distributed subgroup mining

Michael Wurst; Martin Scholz

Subgroup discovery is a popular form of supervised rule learning, applicable to descriptive and predictive tasks. In this work we study two natural extensions of classical subgroup discovery to distributed settings. In the first variant the goal is to efficiently identify global subgroups, i.e. the rules an analysis would yield after collecting all the data at a single central database. In contrast, the second considered variant takes the locality of data explicitly into account. The aim is to find patterns that point out major differences between individual databases with respect to a specific property of interest (target attribute). We point out substantial differences between these novel learning problems and other kinds of distributed data mining tasks. These differences motivate new search and communication strategies, aiming at a minimization of computation time and communication costs. We present and empirically evaluate new algorithms for both considered variants.

Sigkdd Explorations | 2004

KDD-Cup 2004: protein homology task

Christophe Foussette; Daniel Hakenjos; Martin Scholz

In this paper we describe the winning model for the performance measure lowest ranked homologous sequence (RKL). This was a subtask of the Protein Homology Prediction task of the KDD Cup 2004. The goal was to predict protein homology for different performance metrics. The given data was organized in blocks, each of which corresponds to a specific native sequence. The two metrics average precision (APR) and RKL explicitly make use of this block structure. Our solution consists of two parts. The first one is a global classification SVM not aware of the block structure. The second part is a k-NearestNeighbor scheme for block similarity, used to train ranking SVMs on the fly. Furthermore, we sketch our approach to optimize the root-mean-squared-error and report some alternative solutions that turned out to be suboptimal.

Archive | 2007

Scalable and accurate knowledge discovery in real world databases

Martin Scholz

ion: Meta-data are given at different levels of abstraction, a conceptual (abstract) and a relational (executable) level. This makes an abstract case understandable and re-usable. Data documentation: All attributes together with the database tables and views, which are input to a preprocessing chain are explicitly listed at both, the conceptual and relational part of the meta-data level. An ontology allows to organize all data, e.g. by distinguishing between concepts of the domain and relationships between these concepts. For all entities involved, there is a text field for documentation. This makes the data much more understandable, e.g. by human domain experts, than if just referring to the names of specific database objects. Furthermore, statistics and important features for data mining (e.g., presence of null values) are accessible as well. This augments the meta-data usually found in relational databases and gives a good overview of the data sets at hand. Case documentation: The chain of preprocessing operators is documented, as well. First of all, the declarative definition of an executable case in the M4 model can already be considered to provide a documentation. Furthermore, apart from the opportunity to use “speaking names” for steps and data objects, there are text fields to document all steps of a case together with their parameter settings. This helps to quickly figure out the relevance of each step and makes cases reproducible. Ease of case adaptation: In order to run a given sequence of operators on a new database, only the relational meta-data and their mapping to the conceptual meta-data has to be defined. A sales prediction case can, for instance, be applied to different kinds of shops, and a standard sequence of steps for preparing time series for a specific learner might even serve as a template that applies to very different mining contexts. The same effect eases the maintenance of cases, when the database schema changes over time. The user just needs to update the corresponding links from the conceptual to the relational level. This is especially easy when all abstract M4 entities are documented. The MININGMART project has developed a model for meta-data together with its compiler, and has implemented human-computer interfaces that allow database managers and case designers

european conference on machine learning | 2006

Boosting in PN spaces

Martin Scholz

This paper analyzes boosting in unscaled versions of ROC spaces, also referred to as PN spaces. A minor revision to AdaBoost s reweighting strategy is analyzed, which allows to reformulate it in terms of stratification, and to visualize the boosting process in nested PN spaces as known from divide-and-conquer rule learning. The analyzed confidence-rated algorithm is proven to take more advantage of its base models in each iteration, although also searching a space of linear discrete base classifier combinations. The algorithm reduces the training error quicker without lacking any of the advantages of original AdaBoost. The PN space interpretation allows to derive a lower-bound for the area under the ROC curve metric (AUC) of resulting ensembles based on the AUC after reweighting. The theoretical findings of this paper are complemented by an empirical evaluation on benchmark datasets.

Explore More