Network


Latest external collaboration on country level. Dive into details by clicking on the dots.

Hotspot


Dive into the research topics where Robert Meusel is active.

Publication


Featured researches published by Robert Meusel.


international semantic web conference | 2014

The WebDataCommons Microdata, RDFa and Microformat Dataset Series

Robert Meusel; Petar Petrovski

In order to support web applications to understand the content of HTML pages an increasing number of websites have started to annotate structured data within their pages using markup formats such as Microdata, RDFa, Microformats. The annotations are used by Google, Yahoo!, Yandex, Bing and Facebook to enrich search results and to display entity descriptions within their applications. In this paper, we present a series of publicly accessible Microdata, RDFa, Microformats datasets that we have extracted from three large web corpora dating from 2010, 2012 and 2013. Altogether, the datasets consist of almost 30 billion RDF quads. The most recent of the datasets contains amongst other data over 211 million product descriptions, 54 million reviews and 125 million postal addresses originating from thousands of websites. The availability of the datasets lays the foundation for further research on integrating and cleansing the data as well as for exploring its utility within different application contexts. As the dataset series covers four years, it can also be used to analyze the evolution of the adoption of the markup formats.


international semantic web conference | 2013

Deployment of RDFa, Microdata, and Microformats on the Web A Quantitative Analysis

Kai Eckert; Robert Meusel; Hannes Mühleisen; Michael Schuhmacher; Johanna Völker

More and more websites embed structured data describing for instance products, reviews, blog posts, people, organizations, events, and cooking recipes into their HTML pages using markup standards such as Microformats, Microdata and RDFa. This development has accelerated in the last two years as major Web companies, such as Google, Facebook, Yahoo!, and Microsoft, have started to use the embedded data within their applications. In this paper, we analyze the adoption of RDFa, Microdata, and Microformats across the Web. Our study is based on a large public Web crawl dating from early 2012 and consisting of 3 billion HTML pages which originate from over 40 million websites. The analysis reveals the deployment of the different markup standards, the main topical areas of the published data as well as the different vocabularies that are used within each topical area to represent data. What distinguishes our work from earlier studies, published by the large Web companies, is that the analyzed crawl as well as the extracted data are publicly available. This allows our findings to be verified and to be used as starting points for further domain-specific investigations as well as for focused information extraction endeavors.


web science | 2015

The Graph Structure in the Web : Analyzed on Different Aggregation Levels

Robert Meusel; Sebastiano Vigna; Oliver Lehmberg

Knowledge about the general graph structure of theWorldWideWeb is important for understanding the social mechanisms that govern its growth, for designing ranking methods, for devising better crawling algorithms, and for creating accurate models of its structure. In this paper, we analyze a large web graph. The graph was extracted from a large publicly accessible web crawl that was gathered by the Common Crawl Foundation in 2012. The graph covers over 3:5 billion web pages and 128:7 billion hyperlinks. We analyze and compare, among other features, degree distributions, connectivity, average distances, and the structure of weakly/strongly connected components. We conduct our analysis on three different levels of aggregation: page, host, and pay-level domain (PLD) (one “dot level” above public suffixes). Our analysis shows that, as evidenced by previous research (Serrano et al., 2007), some of the features previously observed by Broder et al., 2000 are very dependent on artifacts of the crawling process, whereas other appear to be more structural. We confirm the existence of a giant strongly connected component; we however find, as observed by other researchers (Donato et al., 2005; Boldi et al., 2002; Baeza-Yates and Poblete, 2003), very different proportions of nodes that can reach or that can be reached from the giant component, suggesting that the “bow-tie structure” as described by Broder et al. is strongly dependent on the crawling process, and to the best of our current knowledge is not a structural property of the Web. More importantly, statistical testing and visual inspection of size-rank plots show that the distributions of indegree, outdegree and sizes of strongly connected components of the page and host graph are not power laws, contrarily to what was previously reported for much smaller crawls, although they might be heavy tailed. If we aggregate at pay-level domain, however, a power law emerges. We also provide for the first time accurate measurement of distance-based features, using recently introduced algorithms that scale to the size of our crawl (Boldi and Vigna, 2013).


Journal of Web Semantics | 2015

The Mannheim Search Join Engine

Oliver Lehmberg; Dominique Ritze; Petar Ristoski; Robert Meusel; Heiko Paulheim

A Search Join is a join operation which extends a user-provided table with additional attributes based on a large corpus of heterogeneous data originating from the Web or corporate intranets. Search Joins are useful within a wide range of application scenarios: Imagine you are an analyst having a local table describing companies and you want to extend this table with attributes containing the headquarters, turnover, and revenue of each company. Or imagine you are a film enthusiast and want to extend a table describing films with attributes like director, genre, and release date of each film. This article presents the Mannheim Search Join Engine which automatically performs such table extension operations based on a large corpus of Web data. Given a local table, the Mannheim Search Join Engine searches the corpus for additional data describing the entities contained in the input table. The discovered data are joined with the local table and are consolidated using schema matching and data fusion techniques. As a result, the user is presented with an extended table and given the opportunity to examine the provenance of the added data. We evaluate the Mannheim Search Join Engine using heterogeneous data originating from over one million different websites. The data corpus consists of HTML tables, as well as Linked Data and Microdata annotations which are converted into tabular form. Our experiments show that the Mannheim Search Join Engine achieves a coverage close to 100% and a precision of around 90% for the tasks of extending tables describing cities, companies, countries, drugs, books, films, and songs.


international world wide web conferences | 2016

A Large Public Corpus of Web Tables containing Time and Context Metadata

Oliver Lehmberg; Dominique Ritze; Robert Meusel

The Web contains vast amounts of HTML tables. Most of these tables are used for layout purposes, but a small subset of the tables is relational, meaning that they contain structured data describing a set of entities [2]. As these relational Web tables cover a very wide range of different topics, there is a growing body of research investigating the utility of Web table data for completing cross-domain knowledge bases [6], for extending arbitrary tables with additional attributes [7, 4], as well as for translating data values [5]. The existing research shows the potentials of Web tables. However, comparing the performance of the different systems is difficult as up till now each system is evaluated using a different corpus of Web tables and as most of the corpora are owned by large search engine companies and are thus not accessible to the public. In this poster, we present a large public corpus of Web tables which contains over 233 million tables and has been extracted from the July 2015 version of the CommonCrawl. By publishing the corpus as well as all tools that we used to extract it from the crawled data, we intend to provide a common ground for evaluating Web table systems. The main difference of the corpus compared to an earlier corpus that we extracted from the 2012 version of the CommonCrawl as well as the corpus extracted by Eberius et al. [3] from the 2014 version of the CommonCrawl is that the current corpus contains a richer set of metadata for each table. This metadata includes table-specific information such as table orientation, table caption, header row, and key column, but also context information such as the text before and after the table, the title of the HTML page, as well as timestamp information that was found before and after the table. The context information can be useful for recovering the semantics of a table [7]. The timestamp information is crucial for fusing time-depended data, such as alternative population numbers for a city [8].


web intelligence, mining and semantics | 2015

A Web-scale Study of the Adoption and Evolution of the schema.org Vocabulary over Time

Robert Meusel; Heiko Paulheim

Promoted by major search engines, schema.org has become a widely adopted standard for marking up structured data in HTML web pages. In this paper, we use a series of large-scale Web crawls to analyze the evolution and adoption of schema.org over time. The availability of data from different points in time for both the schema and the websites deploying data allows for a new kind of empirical analysis of standards adoption, which has not been possible before. To conduct our analysis, we compare different versions of the schema.org vocabulary to the data that was deployed on hundreds of thousands of Web pages at different points in time. We measure both top-down adoption (i.e., the extent to which changes in the schema are adopted by data providers) as well as bottom-up evolution (i.e., the extent to which the actually deployed data drives changes in the schema). Our empirical analysis shows that both processes can be observed.


web science | 2014

Graph structure in the web: aggregated by pay-level domain

Oliver Lehmberg; Robert Meusel

Previous research on the overall graph structure of the World Wide Web mostly focused on the page level, meaning that the graph that directly results from hyperlinks between individual web pages was analyzed. This paper aims to provide additional insights about the macroscopic structure of the World Web Web by analyzing an aggregated version of a recent web graph. The graph covers over 3.5 billion web pages and 128 billion hyperlinks between pages. It was crawled in the first half of 2012. We aggregate this graph by pay-level domain (PLD), meaning that all pages that belong to the same pay-level domain are represented by a single node and that an arc exists between two nodes if there is at least one hyperlink between pages of the corresponding pay-level domains. The resulting PLD graph covers 43 million PLDs and contains 623 million arcs between PLDs. Analyzing this aggregated graph allows us to present findings about linkage patterns between complete websites and not only individual HTML pages. In this paper, we present basic statistics about the PLD graph, such as degree distributions, top-ranked PLDs, distances and diameter. We analyze whether the bow-tie structure introduced by Broder et al. can also be identified in our PLD graph and reveal a backbone of highly interlinked websites within the graph. We group the websites by top-level domain and report findings about the overall linkage within and between different top-level domains. In a last experiment, we use data from the Open Directory Project (DMOZ) to categorize websites by topic and report findings about linkage patterns between websites belonging to different topical categories.


conference on information and knowledge management | 2014

Focused Crawling for Structured Data

Robert Meusel; Peter Mika; Roi Blanco

The Web is rapidly transforming from a pure document collection to the largest connected public data space. Semantic annotations of web pages make it notably easier to extract and reuse data and are increasingly used by both search engines and social media sites to provide better search experiences through rich snippets, faceted search, task completion, etc. In our work, we study the novel problem of crawling structured data embedded inside HTML pages. We describe Anthelion, the first focused crawler addressing this task. We propose new methods of focused crawling specifically designed for collecting data-rich pages with greater efficiency. In particular, we propose a novel combination of online learning and bandit-based explore/exploit approaches to predict data-rich web pages based on the context of the page as well as using feedback from the extraction of metadata from previously seen pages. We show that these techniques significantly outperform state-of-the-art approaches for focused crawling, measured as the ratio of relevant pages and non-relevant pages collected within a given budget.


european semantic web conference | 2015

Heuristics for Fixing Common Errors in Deployed schema.org Microdata

Robert Meusel; Heiko Paulheim

Being promoted by major search engines such as Google, Yahoo!, Bing, and Yandex, Microdata embedded in web pages, especially using schema.org, has become one of the most important markup languages for the Web. However, deployed Microdata is most often not free from errors, which limits its practical use. In this paper, we use the WebDataCommons corpus of Microdata extracted from more than


Machine Learning | 2015

A decomposition of the outlier detection problem into a set of supervised learning problems

Heiko Paulheim; Robert Meusel

Collaboration


Dive into the Robert Meusel's collaboration.

Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar

Kai Eckert

University of Mannheim

View shared research outputs
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar

Stefano Faralli

Sapienza University of Rome

View shared research outputs
Researchain Logo
Decentralizing Knowledge