Tommaso Soru
Leipzig University
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Tommaso Soru.
international semantic web conference | 2013
Daniel Gerber; Sebastian Hellmann; Lorenz Bühmann; Tommaso Soru; Axel-Cyrille Ngonga Ngomo
The vision behind the Web of Data is to extend the current document-oriented Web with machine-readable facts and structured data, thus creating a representation of general knowledge. However, most of the Web of Data is limited to being a large compendium of encyclopedic knowledge describing entities. A huge challenge, the timely and massive extraction of RDF facts from unstructured data, has remained open so far. The availability of such knowledge on the Web of Data would provide significant benefits to manifold applications including news retrieval, sentiment analysis and business intelligence. In this paper, we address the problem of the actuality of the Web of Data by presenting an approach that allows extracting RDF triples from unstructured data streams. We employ statistical methods in combination with deduplication, disambiguation and unsupervised as well as supervised machine learning techniques to create a knowledge base that reflects the content of the input streams. We evaluate a sample of the RDF we generate against a large corpus of news streams and show that we achieve a precision of more than 85%.
international conference on semantic systems | 2015
Diego Esteves; Diego Moussallem; Ciro Baron Neto; Tommaso Soru; Markus Ackermann; Jens Lehmann
Over the last decades many machine learning experiments have been published, giving benefit to the scientific progress. In order to compare machine-learning experiment results with each other and collaborate positively, they need to be performed thoroughly on the same computing environment, using the same sample datasets and algorithm configurations. Besides this, practical experience shows that scientists and engineers tend to have large output data in their experiments, which is both difficult to analyze and archive properly without provenance metadata. However, the Linked Data community still misses a lightweight specification for interchanging machine-learning metadata over different architectures to achieve a higher level of interoperability. In this paper, we address this gap by presenting a novel vocabulary dubbed MEX. We show that MEX provides a prompt method to describe experiments with a special focus on data provenance and fulfills the requirements for a long-term maintenance.
european semantic web conference | 2014
Markus Nentwig; Tommaso Soru; Axel-Cyrille Ngonga Ngomo; Erhard Rahm
Links between knowledge bases build the backbone of the Web of Data. Consequently, numerous applications have been developed to compute, evaluate and infer links. Still, the results of many of these applications remain inaccessible to the tools and frameworks that rely upon it. We address this problem by presenting LinkLion, a repository for links between knowledge bases. Our repository is designed as an open-access and open-source portal for the management and distribution of link discovery results. Users are empowered to upload links and specify how these were created. Moreover, users and applications can select and download sets of links via dumps or SPARQL queries. Currently, our portal contains 12.6 million links of 10 different types distributed across 3184 mappings that link 449 datasets. In this demo, we will present the repository as well as different means to access and extend the data it contains. The repository can be found at http://www.linklion.org.
international world wide web conferences | 2015
Tommaso Soru; Edgard Marx; Axel-Cyrille Ngonga Ngomo
The Linked Data principles provide a decentral approach for publishing structured data in the RDF format on the Web. In contrast to structured data published in relational databases where a key is often provided explicitly, finding a set of properties that allows identifying a resource uniquely is a non-trivial task. Still, finding keys is of central importance for manifold applications such as resource deduplication, link discovery, logical data compression and data integration. In this paper, we address this research gap by specifying a refinement operator, dubbed ROCKER, which we prove to be finite, proper and non-redundant. We combine the theoretical characteristics of this operator with two monotonicities of keys to obtain a time-efficient approach for detecting keys, i.e., sets of properties that describe resources uniquely. We then utilize a hash index to compute the discriminability score efficiently. Therewith, we ensure that our approach can scale to very large knowledge bases. Results show that ROCKER yields more accurate results, has a comparable runtime, and consumes less memory w.r.t. existing state-of-the-art techniques.
international conference on semantic systems | 2014
Tommaso Soru; Axel-Cyrille Ngonga Ngomo
The detection of links between resources is intrinsic to the vision of the Linked Data Web. Due to the mere size of current knowledge bases, this task is commonly addressed by using tools. In particular, manifold link discovery frameworks have been developed. These frameworks implement several different machine-learning approaches to discovering links. In this paper, we investigate which of the commonly used supervised machine-learning classifiers performs best on the link discovery task. To this end, we first present our evaluation pipeline. Then, we compare ten different approaches on three artificial and three real-world benchmark data sets. The classification outcomes are subsequently compared with several state-of-the-art frameworks. Our results suggest that while several algorithms perform well, multilayer perceptrons perform best on average. Moreover, logistic regression seems best suited for noisy data.
international world wide web conferences | 2015
Michael Martin; Konrad Abicht; Claus Stadler; Axel-Cyrille Ngonga Ngomo; Tommaso Soru; Sören Auer
CubeViz is a flexible exploration and visualization platform for statistical data represented adhering to the RDF Data Cube vocabulary. If statistical data is provided adhering to the Data Cube vocabulary, CubeViz exhibits a faceted browsing widget allowing to interactively filter observations to be visualized in charts. Based on the selected structural part, CubeViz offers suitable chart types and options for configuring the visualization by users. In this demo we present the CubeViz visualization architecture and components, sketch its underlying API and the libraries used to generate the desired output. By employing advanced introspection, analysis and visualization bootstrapping techniques CubeViz hides the schema complexity of the encoded data in order to support a user-friendly exploration experience.
international semantic technology conference | 2012
Tommaso Soru; Axel-Cyrille Ngonga Ngomo
Discovering cross-knowledge-base links is of central importance for manifold tasks across the Linked Data Web. So far, learning link specifications has been addressed by approaches that rely on standard similarity and distance measures such as the Levenshtein distance for strings and the Euclidean distance for numeric values. While these approaches have been shown to perform well, the use of standard similarity measure still hampers their accuracy, as several link discovery tasks can only be solved sub-optimally when relying on standard measures. In this paper, we address this drawback by presenting a novel approach to learning string similarity measures concurrently across multiple dimensions directly from labeled data. Our approach is based on learning linear classifiers which rely on learned edit distance within an active learning setting. By using this combination of paradigms, we can ensure that we reduce the labeling burden on the experts at hand while achieving superior results on datasets for which edit distances are useful. We evaluate our approach on three different real datasets and show that our approach can improve the accuracy of classifiers. We also discuss how our approach can be extended to other similarity and distance measures as well as different classifiers.
ieee international conference semantic computing | 2017
Edgard Marx; Saeedeh Shekarpour; Tommaso Soru; Adrian M.P. Braşoveanu; Muhammad Saleem; Ciro Baron; Albert Weichselbraun; Jens Lehmann; Axel-Cyrille Ngonga Ngomo; Sören Auer
Over the last years, the amount of data published as Linked Data on the Web has grown enormously. In spite of the high availability of Linked Data, organizations still encounter an accessibility challenge while consuming it. This is mostly due to the large size of some of the datasets published as Linked Data. The core observation behind this work is that a subset of these datasets suffices to address the needs of most organizations. In this paper, we introduce Torpedo, an approach for efficiently selecting and extracting relevant subsets from RDF datasets. In particular, Torpedo adds optimization techniques to reduce seek operations costs as well as the support of multi-join graph patterns and SPARQL FILTERs that enable to perform a more granular data selection. We compare the performance of our approach with existing solutions on nine different queries against four datasets. Our results show that our approach is highly scalable and is up to 26% faster than the current state-of-the-art RDF dataset slicing approach.
International Journal of Semantic Computing | 2013
Edgard Marx; Tommaso Soru; Saeedeh Shekarpour; Sören Auer; Axel-Cyrille Ngonga Ngomo; Karin Koogan Breitman
Over the last years, a considerable amount of structured data has been published on the Web as Linked Open Data (LOD). Despite recent advances, consuming and using Linked Open Data within an organization is still a substantial challenge. Many of the LOD datasets are quite large and despite progress in Resource Description Framework (RDF) data management their loading and querying within a triple store is extremely time-consuming and resource-demanding. To overcome this consumption obstacle, we propose a process inspired by the classical Extract-Transform-Load (ETL) paradigm. In this article, we focus particularly on the selection and extraction steps of this process. We devise a fragment of SPARQL Protocol and RDF Query Language (SPARQL) dubbed SliceSPARQL, which enables the selection of well-defined slices of datasets fulfilling typical information needs. SliceSPARQL supports graph patterns for which each connected subgraph pattern involves a maximum of one variable or Internationalized resource identifier (IRI) in its join conditions. This restriction guarantees the efficient processing of the query against a sequential dataset dump stream. Furthermore, we evaluate our slicing approach on three different optimization strategies. Results show that dataset slices can be generated an order of magnitude faster than by using the conventional approach of loading the whole dataset into a triple store.
ieee international conference semantic computing | 2017
Edgard Marx; Ciro Baron; Tommaso Soru; Sören Auer
The Semantic Web architecture choices lead to a tremendous amount of information published on the Web. However, to query and build applications on top of Linked Open Data is still significantly challenging. In this work, we present Knowledge Box (KBox), an approach for transparently shifting query execution on Knowledge Graphs to the edge. We show that our approach makes the consumption of Knowledge Graphs more reliable and faster than the formerly introduced methods. In practice, KBox demands less time and resources to setup than traditional approaches, while being twice as fast as SPARQL endpoints, even when serving a single client.