Heiko Paulheim
University of Mannheim
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Heiko Paulheim.
international semantic web conference | 2014
Max Schmachtenberg; Heiko Paulheim
The central idea of Linked Data is that data publishers support applications in discovering and integrating data by complying to a set of best practices in the areas of linking, vocabulary usage, and metadata provision. In 2011, the State of the LOD Cloud report analyzed the adoption of these best practices by linked datasets within different topical domains. The report was based on information that was provided by the dataset publishers themselves via the datahub.io Linked Data catalog. In this paper, we revisit and update the findings of the 2011 State of the LOD Cloud report based on a crawl of the Web of Linked Data conducted in April 2014. We analyze how the adoption of the different best practices has changed and present an overview of the linkage relationships between datasets in the form of an updated LOD cloud diagram, this time not based on information from dataset providers, but on data that can actually be retrieved by a Linked Data crawler. Among others, we find that the number of linked datasets has approximately doubled between 2011 and 2014, that there is increased agreement on common vocabularies for describing certain types of entities, and that provenance and license metadata is still rarely provided by the data sources.
web intelligence, mining and semantics | 2012
Heiko Paulheim; Johannes Fümkranz
The quality of the results of a data mining process strongly depends on the quality of the data it processes. A good result is more likely to obtain the more useful background knowledge there is in a dataset. In this paper, we present a fully automatic approach for enriching data with features that are derived from Linked Open Data, a very large, openly available data collection. We identify six different types of feature generators, which are implemented in our open-source tool FeGeLOD. In four case studies, we show that our approach can be applied to different problems, ranging from classical data mining to ontology learning and ontology matching on the semantic web. The results show that features generated from publicly available information may allow data mining in problems where features are not available at all, as well as help improving the results for tasks where some features are already available.
international semantic web conference | 2013
Heiko Paulheim
Type information is very valuable in knowledge bases. However, most large open knowledge bases are incomplete with respect to type information, and, at the same time, contain noisy and incorrect data. That makes classic type inference by reasoning difficult. In this paper, we propose the heuristic link-based type inference mechanism SDType, which can handle noisy and incorrect data. Instead of leveraging T-box information from the schema, SDType takes the actual use of a schema into account and thus is also robust to misused schema elements.
Journal of Web Semantics | 2016
Petar Ristoski; Heiko Paulheim
Data Mining and Knowledge Discovery in Databases (KDD) is a research field concerned with deriving higher-level insights from data. The tasks performed in that field are knowledge intensive and can often benefit from using additional knowledge from various sources. Therefore, many approaches have been proposed in this area that combine Semantic Web data with the data mining and knowledge discovery process. This survey article gives a comprehensive overview of those approaches in different stages of the knowledge discovery process. As an example, we show how Linked Open Data can be used at various stages for building content-based recommender systems. The survey shows that, while there are numerous interesting research works performed, the full potential of the Semantic Web and Linked Open Data for data mining and KDD is still to be unlocked.
extended semantic web conference | 2013
Axel Schulz; Petar Ristoski; Heiko Paulheim
Microblogs are increasingly gaining attention as an important information source in emergency management. Nevertheless, it is still difficult to reuse this information source during emergency situations, because of the sheer amount of unstructured data. Especially for detecting small scale events like car crashes, there are only small bits of information, thus complicating the detection of relevant information.
International Journal on Semantic Web and Information Systems | 2014
Heiko Paulheim
Linked Data on the Web is either created from structured data sources (such as relational databases), from semi-structured sources (such as Wikipedia), or from unstructured sources (such as text). In the latter two cases, the generated Linked Data will likely be noisy and incomplete. In this paper, we present two algorithms that exploit statistical distributions of properties and types for enhancing the quality of incomplete and noisy Linked Data sets: SDType adds missing type statements, and SDValidate identifies faulty statements. Neither of the algorithms uses external knowledge, i.e., they operate only on the data itself. We evaluate the algorithms on the DBpedia and NELL knowledge bases, showing that they are both accurate as well as scalable. Both algorithms have been used for building the DBpedia 3.9 release: With SDType, 3.4 million missing type statements have been added, while using SDValidate, 13,000 erroneous RDF statements have been removed from the knowledge base.
international semantic web conference | 2016
Petar Ristoski; Heiko Paulheim
Linked Open Data has been recognized as a valuable source for background information in data mining. However, most data mining tools require features in propositional form, i.e., a vector of nominal or numerical features associated with an instance, while Linked Open Data sources are graphs by nature. In this paper, we present RDF2Vec, an approach that uses language modeling approaches for unsupervised feature extraction from sequences of words, and adapts them to RDF graphs. We generate sequences by leveraging local information from graph sub-structures, harvested by Weisfeiler-Lehman Subtree RDF Graph Kernels and graph walks, and learn latent numerical representations of entities in RDF graphs. Our evaluation shows that such vector representations outperform existing techniques for the propositionalization of RDF graphs on a variety of different predictive machine learning tasks, and that feature vector representations of general knowledge graphs such as DBpedia and Wikidata can be easily reused for different tasks.
international semantic web conference | 2012
Heiko Paulheim
Statistics are very present in our daily lives. Every day, new statistics are published, showing the perceived quality of living in different cities, the corruption index of different countries, and so on. Interpreting those statistics, on the other hand, is a difficult task. Often, statistics collect only very few attributes, and it is difficult to come up with hypotheses that explain, e.g., why the perceived quality of living in one city is higher than in another. In this paper, we introduce Explain-a-LOD, an approach which uses data from Linked Open Data for generating hypotheses that explain statistics. We show an implemented prototype and compare different approaches for generating hypotheses by analyzing the perceived quality of those hypotheses in a user study.
extended semantic web conference | 2013
Heiko Paulheim; Sven Hertling; Dominique Ritze
With a growing number of ontologies used in the semantic web, agents can fully make sense of different datasets only if correspondences between those ontologies are known. Ontology matching tools have been proposed to find such correspondences. While the current research focus is mainly on fully automatic matching tools, some approaches have been proposed that involve the user in the matching process. However, there are currently no benchmarks and test methods to compare such tools. In this paper, we introduce a number of quality measures for interactive ontology matching tools, and we discuss means to automatically run benchmark tests for such tools. To demonstrate how those evaluation can be designed, we show examples on assessing the quality of interactive matching tools which involve the user in matcher selection and matcher parametrization.
Journal of Web Semantics | 2015
Petar Ristoski; Heiko Paulheim
Lots of data from different domains are published as Linked Open Data (LOD). While there are quite a few browsers for such data, as well as intelligent tools for particular purposes, a versatile tool for deriving additional knowledge by mining the Web of Linked Data is still missing. In this system paper, we introduce the RapidMiner Linked Open Data extension. The extension hooks into the powerful data mining and analysis platform RapidMiner, and offers operators for accessing Linked Open Data in RapidMiner, allowing for using it in sophisticated data analysis workflows without the need for expert knowledge in SPARQL or RDF. The extension allows for autonomously exploring the Web of Data by following links, thereby discovering relevant datasets on the fly, as well as for integrating overlapping data found in different datasets. As an example, we show how statistical data from the World Bank on scientific publications, published as an RDF data cube, can be automatically linked to further datasets and analyzed using additional background knowledge from ten different LOD datasets.