Andrew Jon Sellers | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Andrew Jon Sellers is active.

Explore More

Publication

Featured researches published by Andrew Jon Sellers.

very large data bases | 2013

OXPath: A language for scalable data extraction, automation, and crawling on the deep web

Tim Furche; Georg Gottlob; Giovanni Grasso; Christian Schallhart; Andrew Jon Sellers

The evolution of the web has outpaced itself: A growing wealth of information and increasingly sophisticated interfaces necessitate automated processing, yet existing automation and data extraction technologies have been overwhelmed by this very growth. To address this trend, we identify four key requirements for web data extraction, automation, and (focused) web crawling: (1) interact with sophisticated web application interfaces, (2) precisely capture the relevant data to be extracted, (3) scale with the number of visited pages, and (4) readily embed into existing web technologies. We introduce OXPath as an extension of XPath for interacting with web applications and extracting data thus revealed—matching all the above requirements. OXPath’s page-at-a-time evaluation guarantees memory use independent of the number of visited pages, yet remains polynomial in time. We experimentally validate the theoretical complexity and demonstrate that OXPath’s resource consumption is dominated by page rendering in the underlying browser. With an extensive study of sublanguages and properties of OXPath, we pinpoint the effect of specific features on evaluation performance. Our experiments show that OXPath outperforms existing commercial and academic data extraction tools by a wide margin.

international world wide web conferences | 2012

Visual oXPath: robust wrapping by example

Jochen Kranzdorf; Andrew Jon Sellers; Giovanni Grasso; Christian Schallhart; Tim Furche

Good examples are hard to find, particularly in wrapper induction: Picking even one wrong example can spell disaster by yielding overgeneralized or overspecialized wrappers. Such wrappers extract data with low precision or recall, unless adjusted by human experts at significant cost. Visual OXPath is an open-source, visual wrapper induction system that requires minimal examples and eases wrapper refinement: Often it derives the intended wrapper from a single example through sophisticated heuristics that determine the best set of similar examples. To ease wrapper refinement, it offers a list of wrappers ranked by example similarity and robustness. Visual OXPath offers extensive visual feedback for this refinement which can be performed without any knowledge of the underlying wrapper language. Where further refinement by a human wrapper is needed, Visual OXPath profits from being based on OXPath, a declarative wrapper language that extends XPath with a thin layer of features necessary for extraction and page navigation.

international conference on web engineering | 2011

How the minotaur turned into ariadne: ontologies in web data extraction

Tim Furche; Georg Gottlob; Xiaonan Guo; Christian Schallhart; Andrew Jon Sellers; Cheng Wang

Humans require automated support to profit from the wealth of data nowadays available on the web. To that end, the linked open data initiative and others have been asking data providers to publish structured, semantically annotated data. Small data providers, such as most UK real-estate agencies, however, are overburdened with this task-- often just starting to move from simple, table- or list-like directories to web applications with rich interfaces. We argue that fully automated extraction of structured data can help resolve this dilemma. Ironically, automated data extraction has seen a recent revival thanks to ontologies and linked open data to guide data extraction. First results from the DIADEM project illustrate that high quality, fully automated data extraction at a web scale is possible, if we combine domain ontologies with a phenomenology describing the representation of domain concepts. We briefly summarise the DIADEM project and discuss a few preliminary results.

extending database technology | 2011

Taking the OXPath down the deep web

Andrew Jon Sellers; Tim Furche; Georg Gottlob; Giovanni Grasso; Christian Schallhart

Although deep web analysis has been studied extensively, there is no succinct formalism to describe user interactions with AJAX-enabled web applications. Toward this end, we introduce OXPath as a superset of XPath 1.0. Beyond XPath, OXPath is able (1) to fill web forms and trigger DOM events, (2) to access dynamically computed CSS attributes, (3) to navigate between visible form fields, and (4) to mark relevant information for extraction. This way, OXPath expressions can closely simulate the human interaction relevant for navigation rather than rely exclusively on the HTML structure. Thus, they are quite resilient against technical changes. We demonstrate the expressiveness and practical efficacy of OXPath to tackle a group flight planning problem. We use the OXPath implementation and visual interface to access the popular, highly-scripted travel site Kayak. We show, how to formulate OXPath expressions to extract all booking information with just a few lines of code.

international world wide web conferences | 2011

OXPath: little language, little memory, great value

Andrew Jon Sellers; Tim Furche; Georg Gottlob; Giovanni Grasso; Christian Schallhart

Data about everything is readily available on the web-but often only accessible through elaborate user interactions. For automated decision support, extracting that data is essential, but infeasible with existing heavy-weight data extraction systems. In this demonstration, we present OXPath, a novel approach to web extraction, with a system that supports informed job selection and integrates information from several different web sites. By carefully extending XPath, OXPath exploits its familiarity and provides a light-weight interface, which is easy to use and embed. We highlight how OXPath guarantees optimal page buffering, storing only a constant number of pages for non-recursive queries.

international world wide web conferences | 2011

The OXPath to success in the deep web

Andrew Jon Sellers

The world wide web provides access to a wealth of data. Collecting and maintaining such large amounts of data necessitates automated processing for extraction, since appropriate automation can perform extraction tasks that would be otherwise infeasible. Modern web interfaces, however, are generally designed primarily for human users, delivering sophisticated interactions through the use of client-side scripting and asynchronous server communication. To this end, we introduce OXPath, a careful extension of XPath that facilitates data extraction from the deep web. OXPath exploits XPaths familiarity and theoretical foundations. OXPath, then, achieves favourable evaluation complexity and optimal page buffering, storing only a constant number of pages for non-recursive queries. Further, OXPath provides a lightweight interface, which is easy to use and embed. This paper outlines the motivation, theoretical framework, current implementation, and preliminary results obtained so far. We conclude with proposed future work on OXPath, including an investigation of how to deploy OXPath efficiently in a highly elastic computing framework (cloud).

Proceedings of the 1st International Workshop on Linked Web Data Management | 2011

Exploring the web with OXPath

Tim Furche; Georg Gottlob; Giovanni Grasso; Christian Schallhart; Andrew Jon Sellers

OXPath is a careful extension of XPath that facilitates data extraction from the deep web. It is designed to facilitate the large-scale extraction of data from sophisticated modern web interfaces with client-side scripting and asynchronous server communication. Its main characteristics are (1) a minimal extension of XPath to allow page navigation and action execution, (2) a set-theoretic formal semantics for full OXPath, (3) and a sophisticated memory management that minimizes page buffering. In this poster, we briefly review the main features of the language and discuss ongoing and future work.

international world wide web conferences | 2012

DIADEM: domain-centric, intelligent, automated data extraction methodology

Tim Furche; Georg Gottlob; Giovanni Grasso; Omer Gunes; Xiaonan Guo; Andrey Kravchenko; Giorgio Orsi; Christian Schallhart; Andrew Jon Sellers; Cheng Wang

international semantic web conference | 2012

DEQA: deep web extraction for question answering

Jens Lehmann; Tim Furche; Giovanni Grasso; Axel-Cyrille Ngonga Ngomo; Christian Schallhart; Andrew Jon Sellers; Christina Unger; Lorenz Bühmann; Daniel Gerber; Konrad Höffner; David Liu; Sören Auer

Proceedings of The Vldb Endowment | 2011