Marcus Herzog
Vienna University of Technology
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Marcus Herzog.
symposium on principles of database systems | 2004
Georg Gottlob; Christoph T. Koch; Robert Baumgartner; Marcus Herzog; Sergio Flesca
We present the Lixto project, which is both a research project in database theory and a commercial enterprise that develops Web data extraction (wrapping) and Web service definition software. We discuss the projects main motivations and ideas, in particular the use of a logic-based framework for wrapping. Then we present theoretical results on monadic datalog over trees and on Elog, its close relative which is used as the internal wrapper language in the Lixto system. These results include both a characterization of the expressive power and the complexity of these languages. We describe the visual wrapper specification process in Lixto and various practical aspects of wrapping. We discuss work on the complexity of query languages for trees that was inseminated by our theoretical study of logic-based languages for wrapping. Then we return to the practice of wrapping and the Lixto Transformation Server, which allows for streaming integration of data extracted from Web pages. This is a natural requirement in complex services based on Web wrapping. Finally, we discuss industrial applications of Lixto and point to open problems for future study.
international world wide web conferences | 2005
Bernhard Krüpl; Marcus Herzog; Wolfgang Gatterbauer
We describe a method to extract tabular data from web pages. Rather than just analyzing the DOM tree, we also exploit visual cues in the rendered version of the document to extract data from tables which are not explicitly marked with an HTML table element. To detect tables, we rely on a variant of the well-known X-Y cut algorithm as used in the OCR community. We implemented the system by directly accessing Mozillas box model that contains the positional data for all HTML elements of a given web page.
very large data bases | 2009
Robert Baumgartner; Georg Gottlob; Marcus Herzog
Online market intelligence (OMI), in particular competitive intelligence for product pricing, is a very important application area for Web data extraction. However, OMI presents non-trivial challenges to data extraction technology. Sophisticated and highly parameterized navigation and extraction tasks are required. On-the-fly data cleansing is necessary in order two identify identical products from different suppliers. It must be possible to smoothly define data flow scenarios that merge and filter streams of extracted data stemming from several Web sites and store the resulting data into a data warehouse, where the data is subjected to market intelligence analytics. Finally, the system must be highly scalable, in order to be able to extract and process massive amounts of data in a short time. Lixto (www.lixto.com), a company offering data extraction tools and services, has been providing OMI solutions for several customers. In this paper we show how Lixto has tackled each of the above challenges by improving and extending its original data extraction software. Most importantly, we show how high scalability is achieved through cloud computing. This paper also features a case study from the computers and electronics market.
european semantic web conference | 2005
Robert Baumgartner; Nicola Henze; Marcus Herzog
This paper shows how Semantic Web technologies enable the design and implementation of advanced, personalized information systems. We demonstrate by means of an example application how personalized content syndication can be realized in the Semantic Web. Our approach consists of two main parts: The web data extraction part, providing the information system with real-time, dynamic data, and the personalization part, which deduces – with the aid of ontological domain knowledge – personalized views on the data. The prototype of the system has been realized using the Personal Reader Framework for designing, implementing, and maintaining Web content Readers.
Lecture Notes in Computer Science | 2001
Marcus Herzog; Georg Gottlob
M-Commerce applications are E-Commerce applications having at least at one end a mobile terminal. Therefore M-Commerce applications share a number of properties with E-Commerce applications while adding additional burdens on the application developer. In this paper we present a conceptual model of an application framework that provides services at the core of M-Commerce applications. We will also present an implementation of this framework and discuss the properties of the information processing involved.
international world wide web conferences | 2006
Bernhard Krüpl; Marcus Herzog
In the AllRight project, we are developing an algorithm for unsupervised table detection and segmentation that uses the visual rendition of a Web page rather than the HTML code. Our algorithm works bottom-up by grouping word bounding boxes into larger groups and uses a set of heuristics. It has already been implemented and a preliminary evaluation on about 6000 Web documents has been carried out.
symposium on applications and the internet | 2004
Robert Baumgartner; Georg Gottlob; Marcus Herzog; Wolfgang Slany
The World Wide Web is the largest database of the world - a huge amount of knowledge, information and services primarily intended for human users. Unfortunately, data on the Web requires intelligent interpretation and cannot be easily used by programs. It requires advanced data extraction and integration techniques to automatically process data. Lixto technology addresses these issues and enables developers to interactively turn Web pages into Web services.
Lecture Notes in Computer Science | 2005
Robert Baumgartner; Thomas Eiter; Georg Gottlob; Marcus Herzog; Christoph T. Koch
The World Wide Web represents a universe of knowledge and information. Unfortunately, it is not straightforward to query and access the desired information. Languages and tools for accessing, extracting, transforming, and syndicating the desired information are required. The Web should be useful not merely for human consumption but additionally for machine communication. Therefore, powerful and user-friendly tools based on expressive languages for extracting and integrating information from various different Web sources, or in general, various heterogeneous sources are needed. The tutorial gives an introduction to Web technologies required in this context, and presents various approaches and techniques used in information extraction and integration. Moreover, sample applications in various domains motivate the discussed topics and providing data instances for the Semantic Web is illustrated.
international conference on computational science and its applications | 2005
Robert Baumgartner; Christian Enzi; Nicola Henze; Marc Herrlich; Marcus Herzog; Matthias Kriesell; Kai Tomaschewski
In this paper a methodology and a framework for personalized views on data available on the World Wide Web are proposed. We describe its main two ingredients, Web data extraction and ontology-based personalized content presentation. We exemplify the usage of these methodologies with a sample application for personalized publication browsing.
Search Computing | 2010
Robert Baumgartner; Alessandro Campi; Georg Gottlob; Marcus Herzog
Web data extraction is an enabling technique in the search computing scenario. In this chapter, we first review the state of the art in wrapper technologies focusing on how wrapper generators can be used to create unified services that integrate data from Web Applications and Web services in various domains. Next, we describe the Lixto approach and we present the Lixto Suite as one example of Web Process Integration. Finally, application areas and future challenges and the usage of wrapper technologies in the search computing context is discussed.