Network


Latest external collaboration on country level. Dive into details by clicking on the dots.

Hotspot


Dive into the research topics where Michal Skubacz is active.

Publication


Featured researches published by Michal Skubacz.


web intelligence | 2007

Content Extraction from News Pages Using Particle Swarm Optimization on Linguistic and Structural Features

Cai-Nicolas Ziegler; Michal Skubacz

Todays Web pages are commonly made up of more than merely one cohesive block of information. For instance, news pages from popular media channels such as Financial Times or Washington Post consist of no more than 30%-50% of textual news, next to advertisements, link lists to related articles, disclaimer information, and so forth. However, for many search-oriented applications such as the detection of relevant pages for an in-focus topic, dissecting the actual textual content from surrounding page clutter is an essential task, so as to maintain appropriate levels of document retrieval accuracy. We present a novel approach that extracts real content from news Web pages in an unsupervised fashion. Our method is based on distilling linguistic and structural features from text blocks in HTML pages, having a particle swarm optimizer (PSO) learn feature thresholds for optimal classification performance. Empirical evaluations and benchmarks show that our approach works very well when applied to several hundreds of news pages from popular media in 5 languages.Todays Web pages are commonly made up of more than merely one cohesive block of information. For instance, news pages from popular media channels such as Financial Times or Washington Post consist of no more than 30%-50% of textual news, next to advertisements, link lists to related articles, disclaimer information, and so forth. However, for many search-oriented applications such as the detection of relevant pages for an in-focus topic, dissecting the actual textual content from surrounding page clutter is an essential task, so as to maintain appropriate levels of document retrieval accuracy. We present a novel approach that extracts real content from news Web pages in an unsupervised fashion. Our method is based on distilling linguistic and structural features from text blocks in HTML pages, having a Particle Swarm Optimizer (PSO) learn feature thresholds for optimal classification performance. Empirical evaluations and benchmarks show that our approach works very well when applied to several hundreds of news pages from popular media in 5 languages.


web intelligence | 2006

Towards Automated Reputation and Brand Monitoring on the Web

Cai-Nicolas Ziegler; Michal Skubacz

The ever-increasing growth of the Web as principal provider of news and opinions makes it impossible for individuals to manually spot and analyze all information of particular importance for global large-scale corporations. Hence, automated means, identifying upcoming topics of utter relevance and monitoring the reputation of a brand as well as its competitors, are becoming indispensable. In this paper, we present a platform for analyzing Web data for such purposes, adopting different semantic perspectives and providing the market analyst with a flexible suite of instruments. We focus on two of these tools and outline their particular utility for research and exploration


web intelligence | 2006

Relevance and Impact of Tabbed Browsing Behavior on Web Usage Mining

Maximilian Viermetz; Carsten Stolz; Vassil Gedov; Michal Skubacz

The rapid growth of the Internet has pushed the research and development of Web usage mining ever more into focus. Web usage mining and its applications have become critical to the business world. These analyses rest in turn on the ability to develop a clear understanding of the actions a user has taken. So far, the temporal order of clicks has been taken to be equal to the structural order of a session. With the advent of the newest browser generation where the use of multiple tabs has become a common feature, the above assumption does not necessarily hold any more. It is crucial to understand how the use of multiple tabs impacts on Web usage mining, especially on the understanding of a session and its reconstruction. In order to analyze this new browsing behavior, we introduce a generic browsing model extending the traditional serial or single window model to cover the use of multiple tabs. Based on this model, we present and analyze an approach to detect use of multiple tabs within sessions. The existence and increasing prominence of the use of multiple tabs is shown by this approach to be of relevance to business analysis as well as research results


web intelligence | 2005

Guidance Performance Indicator " Web Metrics for Information Driven Web Sites

Carsten Stolz; Maximilian Viermetz; Michal Skubacz; Ralph Neuneier

For the evaluation of Web sites, a multitude of metrics are available. Apart from general statistical measures, success metrics reflect the degree to which a Web site achieves its defined objectives. Particularly metrics for e-commerce sites based on transaction analysis are commonly available and well understood. In contrast to transaction based sites, the success of Web sites geared toward information delivery is harder to quantify since there is no direct feedback of user intent. User feedback is only directly available on transactional Web sites. We introduce a metric to measure the success of an information driven Web site in meeting its objective to deliver the desired information in a timely and usable fashion. We propose to assign a value to each click based on the type of transition, duration and semantic distance. These values are then combined into a scoring model describing the success of a Web site in meeting its objectives. The resulting metric is introduced as the GPI and its applicability shown on a large corporate Web site.


WIT Transactions on Information and Communication Technologies | 2000

Input Dependent Misclassification Costs ForCost-sensitive Classifiers

Jaakko Hollmén; Michal Skubacz; Michiaki Taniguchi

In data mining and in classification specifically, cost issues have been undervalued for a long time, although they are of crucial importance in real-world applications. Recently, however, cost issues have received growing attention, see for example [1,2,3]. Cost-sensitive classifiers are usually based on the assumption of constant misclassification costs between given classes, that is, the cost incurred when an object of class j is erroneously classified as belonging to class i. In many domains, the same type of error may have differing costs due to particular characteristics of objects to be classified. For example, loss caused by misclassifying credit card abuse as normal usage is dependent on the amount of uncollectible credit involved. In this paper, we extend the concept of misclassification costs to include the influence of the input data to be classified. Instead of a fixed misclassification cost matrix, we now have a misclassification cost matrix of functions, separately evaluated for each object to be classified. We formulate the conditional risk for this new approach and relate it to the fixed misclassification cost case. As an illustration, experiments in the telecommunications fraud domain are used, where the costs are naturally data-dependent due to the connection-based nature of telephone tariffs. Posterior probabilities from a hidden Markov model are used in classification, although the described cost model is applicable with other methods such as neural networks or probabilistic networks.


congress on evolutionary computation | 2008

Tracking Topic Evolution in News Environments

Maximilian Viermetz; Michal Skubacz; Cai-Nicolas Ziegler; Dietmar Seipel

For companies acting on a global scale, the necessity to monitor and analyze news channels and consumer-generated media on the Web, such as weblogs and n news-groups, is steadily increasing. In particular the identification of novel trends and upcoming issues, as well as their dynamic evolution over time, is of utter importance to corporate communications and market analysts. Automated machine learning systems using clustering techniques have only partially succeeded in addressing these newly arising requirements, failing in their endeavor to properly assign short-term hype topics to long-term trends. We propose an approach which allows to monitor news wire on different levels of temporal granularity, extracting key-phrases that reflect short-term topics as well as longer-term trends by means of statistical language modelling. Moreover, our approach allows for assigning those windows of smaller scope to those of longer intervals.


web intelligence | 2008

Mining and Exploring Unstructured Customer Feedback Data Using Language Models and Treemap Visualizations

Cai-Nicolas Ziegler; Michal Skubacz; Maximilian Viermetz

We propose an approach for exploring large corpora of textual customer feedback in a guided fashion, bringing order to massive amounts of unstructured information. The prototypical system we implemented allows an analyst to assess labelled clusters in a graphical fashion, based on treemaps, and perform drill-down operations to investigate the topic of interest in a more fine-grained manner. Labels are chosen by simple but effective term weighting schemes and lay the foundations for assigning feedback postings to clusters. In order to allow for drill-down operations leading to new clusters of refined information, we present an approach that contrasts foreground and background models of feedback texts when stepping into the currently selected set of feedback messages. The prototype we present is already in use at various Siemens units and has been embraced by marketing analysts.


international world wide web conferences | 2004

Matching web site structure and content

Vassil Gedov; Carsten Stolz; Ralph Neuneier; Michal Skubacz; Dietmar Seipel

To keep an overview of a complex corporate web sites, it is crucial to understand the relationship of contents, structure and the users behavior. In this paper, we describe an approach which is allowing us to compare web page content with the information implictly defined by the structure of the web site. We start by describing each web page with a set of key words. We combine this information with the link structure in an algorithm generating a context based description. By comparing both descriptions, we draw conclusions about the semantic relationship of a web page and its neighbourhood. In this way, we indicate whether a page fits in the content of its neighbourhood. Doing this, we implicitly identify topics which span over several connected web pages. With our approach we support redesign processes by assessing the actual structure and content of a web site with designers concepts.


web intelligence | 2007

Using Topic Discovery to Segment Large Communication Graphs for Social Network Analysis

Maximilian Viermetz; Michal Skubacz

The application of social network analysis to graphs found in the World Wide Web and the Internet has received increasing attention in recent years. Networks as diverse as those generated by e-mail communication, instant messaging, link structure in the Internet as well as citation and collaboration networks have all been treated with this method. So far these analyses solely utilize graph structure. There is, however, another source of information available in messaging corpora, namely content. We propose to apply the field of content analysis to the process of social network analysis. By extracting relevant and cohesive sub-networks from massive graphs, we obtain information on the actors contained in such sub-networks to a much firmer degree than before.


web information systems engineering | 2005

Web performance indicator by implicit user feedback – application and formal approach

Michael Barth; Michal Skubacz; Carsten Stolz

With growing importance of the internet, web sites have to be continously improved. Web metrics help to identify improvement potentials. Particularly success metrics for e-commerce sites based on transaction analysis are commonly available and well understood. In contrast to transaction based sites, the success of web sites geared toward information delivery is harder to quantify since there is no direct feedback of the user. We propose a generic success measure for information driven web sites. The idea of the measure is based on the observation of user behaviour in context of the web site semantics. In particular we observe users on their way through the web site and assign positive and negative scores to their actions. The value of the score depends on the transitions between page types and their contribution to the web site’s objectives. To derive a generic view on the metric construction, we introduce a formal meta environment deriving success measures upon the relations and dependencies of usage, content and structure of a web site.

Collaboration


Dive into the Michal Skubacz's collaboration.

Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar

Vassil Gedov

University of Würzburg

View shared research outputs
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Researchain Logo
Decentralizing Knowledge