Michael Wurst
IBM
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Michael Wurst.
knowledge discovery and data mining | 2006
Ingo Mierswa; Michael Wurst; Ralf Klinkenberg; Martin Scholz; Timm Euler
KDD is a complex and demanding task. While a large number of methods has been established for numerous problems, many challenges remain to be solved. New tasks emerge requiring the development of new methods or processing schemes. Like in software development, the development of such solutions demands for careful analysis, specification, implementation, and testing. Rapid prototyping is an approach which allows crucial design decisions as early as possible. A rapid prototyping system should support maximal re-use and innovative combinations of existing methods, as well as simple and quick integration of new ones.This paper describes Yale, a free open-source environment forKDD and machine learning. Yale provides a rich variety of methods whichallows rapid prototyping for new applications and makes costlyre-implementations unnecessary. Additionally, Yale offers extensive functionality for process evaluation and optimization which is a crucial property for any KDD rapid prototyping tool. Following the paradigm of visual programming eases the design of processing schemes. While the graphical user interface supports interactive design, the underlying XML representation enables automated applications after the prototyping phase.After a discussion of the key concepts of Yale, we illustrate the advantages of rapid prototyping for KDD on case studies ranging from data pre-processing to result visualization. These case studies cover tasks like feature engineering, text mining, data stream mining and tracking drifting concepts, ensemble methods and distributed data mining. This variety of applications is also reflected in a broad user base, we counted more than 40,000 downloads during the last twelve months.
genetic and evolutionary computation conference | 2006
Ingo Mierswa; Michael Wurst
In this work we propose a novel, sound framework for evolutionary feature selection in unsupervised machine learning problems. We show that unsupervised feature selection is inherently multi-objective and behaves differently from supervised feature selection in that the number of features must be maximized instead of being minimized. Although this might sound surprising from a supervised learning point of view, we exemplify this relationship on the problem of data clustering and show that existing approaches do not pose the optimization problem in an appropriate way. Another important consequence of this paradigm change is a method which segments the Pareto sets produced by our approach. Inspecting only prototypical points from these segments drastically reduces the amount of work for selecting a final solution. We compare our methods against existing approaches on eight data sets.
Lecture Notes in Computer Science | 2005
Jasminko Novak; Michael Wurst
Knowledge exchange between heterogeneous communities of practice has been recognized as the critical source of innovation and creation of new knowledge. This paper considers the problem of enabling such cross community knowledge exchange through knowledge visualization. We discuss the social nature of knowledge construction and describe main requirements for practical solutions to the given problem, as well as existing approaches. Based on this analysis, we propose a model for collaborative elicitation and visualization of community knowledge perspectives based on the construction of personalised learning knowledge maps and shared concept networks that incorporate implicit knowledge and personal views of individual users. We show how this model supports explicit and implicit exchange of knowledge between the members of different communities and present its prototypical realization in the Knowledge Explorer, an interactive tool for collaborative visualization and cross-community sharing of knowledge. Concrete application scenarios and evaluation experiences are discussed on the example of the Internet platform netzspannung.org.
Future Generation Computer Systems | 2007
Michael Wurst; Katharina Morik
Finding the right data representation is essential for virtually every data mining application. In this work we describe an approach to collaborative feature extraction, selection and aggregation in distributed, loosely coupled domains. In contrast to other work in the field of distributed data mining, we focus on scenarios in which a large number of loosely coupled nodes apply data mining to different, usually very small and overlapping, subsets of the entire data space. The aim is not to find a global concept to cover all data, but to learn a set of local concepts. Our prototypical application is a distributed media organization platform, called Nemoz, that assists users in maintaining their media collections. We propose two models for collaborative feature extraction, selection and aggregation for supervised data mining. One is based on a centralized p2p architecture, and the other on a fully distributed p2p architecture. We compare both models on a real word data set and discuss their advantages and problems.
Lecture Notes in Computer Science | 2003
Jasminko Novak; Michael Wurst; Monika Fleischmann; Wolfgang Strauss
This paper presents an agent-based approach to semantic exploration and knowledge discovery in large information spaces by means of capturing, visualizing and making usable implicit knowledge structures of a group of users. The focus is on the developed conceptual model and system for creation and collaborative use of personalized learning knowledge maps. We use the paradigm of agents on the one hand as model for our approach, on the other hand it serves as a basis for an efficient implementation of the system. We present an unobtrusive model for profiling personalised user agents based on two dimensional semantic maps that provide 1) a medium of implicit communication between human users and the agents, 2) form of visual representation of resulting knowledge structures. Concerning the issues of implementation we present an agent architecture, consisting of two sets of asynchronously operating agents, which enables both sophisticated processing, as well as short respond times necessary for enabling interactive use in real-time.
Knowledge and Information Systems | 2012
Katharina Morik; Andreas Kaspari; Michael Wurst; Marcin Skirzynski
Large media collections rapidly evolve in the World Wide Web. In addition to the targeted retrieval as is performed by search engines, browsing and explorative navigation is an important issue. Since the collections grow fast and authors most often do not annotate their web pages according to a given ontology, automatic structuring is in demand as a prerequisite for any pleasant human–computer interface. In this paper, we investigate the problem of finding alternative high-quality structures for navigation in a large collection of high-dimensional data. We express desired properties of frequent termset clustering (FTS) in terms of objective functions. In general, these functions are conflicting. This leads to the formulation of FTS clustering as a multi-objective optimization problem. The optimization is solved by a genetic algorithm. The result is a set of Pareto-optimal solutions. Users may choose their favorite type of a structure for their navigation through a collection or explore the different views given by the different optimal solutions. We explore the capability of the new approach to produce structures that are well suited for browsing on a social bookmarking data set.
european conference on machine learning | 2006
Michael Wurst; Katharina Morik; Ingo Mierswa
Personal media collections are structured in very different ways by different users. Their support by standard clustering algorithms is not sufficient. First, users have their personal preferences which they hardly can express by a formal objective function. Instead, they might want to select among a set of proposed clusterings. Second, users most often do not want hand-made partial structures be overwritten by an automatic clustering. Third, given clusterings of others should not be ignored but used to enhance the own structure. In contrast to other cluster ensemble methods or distributed clustering, a global model (consensus) is not the aim. Hence, we investigate a new learning task, namely learning localized alternative cluster ensembles, where a set of given clusterings is taken into account and a set of proposed clusterings is delivered. This paper proposes an algorithm for solving the new task together with a method for evaluation.
extending database technology | 2010
Benjamin Leonhardi; Bernhard Mitschang; Rubén Pulido; Christoph Sieb; Michael Wurst
Online Analytical Processing (OLAP) is a popular technique for explorative data analysis. Usually, a fixed set of dimensions (such as time, place, etc.) is used to explore and analyze various subsets of a given, multi-dimensional data set. These subsets are selected by constraining one or several of the dimensions, for instance, showing sales only in a given year and geographical location. Still, such aggregates are often not enough. Important information can only be discovered by combining several dimensions in a multidimensional analysis. Most existing approaches allow to add new dimensions either statically or dynamically. These approaches support, however, only the creation of global dimensions that are not interactive for the user running the report. Furthermore, they are mostly restricted to data clustering and the resulting dimensions cannot be interactively refined. In this paper we propose a technique and an architectural solution that is based on an interaction concept for creating OLAP dimensions on subsets of the data dynamically, triggered interactively by the user, based on arbitrary multi-dimensional grouping mechanisms. This approach allows combining the advantages of both, OLAP exploration and interactive multidimensional analysis. We demonstrate the industry-strength of our solution architecture using a setup of IBM® InfoSphere#8482; Warehouse data mining and Cognos® BI as reporting engine. Use cases and industrial experiences are presented showing how insight derived from data mining can be transparently presented in the reporting front end, and how data mining algorithms can be invoked from the front end, achieving closed-loop integration.
industrial and engineering applications of artificial intelligence and expert systems | 2006
Sascha Hennig; Michael Wurst
Clustering text documents is a basic enabling technique in a wide variety of Information and Knowledge Management applications. This paper presents an incremental clustering system to organize and manage Newsgroup articles. It serves administrators and readers of a Newsgroup to archive important postings and to get a structured over-view on current developments and topics. To be practically applicable, such a system must fulfill two conditions. First, it must be able to process rapidly changing text streams, modifying the cluster structure dynamically by adding, deleting and restructuring clusters. Second, it must consider the user in the incremental process. Severe changes in the organization structure are unacceptable for most users, even if they are optimal from the point of view of an abstract clustering criterion. We propose an approach to model the cost to accommodate to changes in the cluster structure explicitly. Users then may constraint, which changes are acceptable to them.
ECML | 2006
Ingo Mierswa; Michael Wurst
Feature construction is essential for solving many complex learning problems. Unfortunately, the construction of features usually implies searching a very large space of possibilities and is often computationally demanding. In this work, we propose a case based approach to feature construction. Learning tasks are stored together with a corresponding set of constructed features in a case base and can be retrieved to speed up feature construction for new tasks. The essential part of our method is a new representation model for learning tasks and a corresponding distance measure. Learning tasks are compared using relevance weights on a common set of base features only. Therefore, the case base can be built and queried very efficiently. In this respect, our approach is unique and enables us to apply case based feature construction not only on a large scale, but also in distributed learning scenarios in which communication costs play an important role. We derive a distance measure for heterogenious learning tasks by stating a set of necessary conditions. Although the conditions are quite basic, they constraint the set of applicable methods to a surprisingly small number. We also provide some experimental evidence for the utility of our method. ∗This work was supported by the Deutsche Forschungsgemeinschaft (DFG) as a part of the Collaborative Research Center ”Computational Intelligence” (SFB 531)