Andreas Spitz | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Andreas Spitz is active.

Explore More

Publication

Featured researches published by Andreas Spitz.

international acm sigir conference on research and development in information retrieval | 2016

Terms over LOAD: Leveraging Named Entities for Cross-Document Extraction and Summarization of Events

Andreas Spitz; Michael Gertz

Real world events, such as historic incidents, typically contain both spatial and temporal aspects and involve a specific group of persons. This is reflected in the descriptions of events in textual sources, which contain mentions of named entities and dates. Given a large collection of documents, however, such descriptions may be incomplete in a single document, or spread across multiple documents. In these cases, it is beneficial to leverage partial information about the entities that are involved in an event to extract missing information. In this paper, we introduce the LOAD model for cross-document event extraction in large-scale document collections. The graph-based model relies on co-occurrences of named entities belonging to the classes locations, organizations, actors, and dates and puts them in the context of surrounding terms. As such, the model allows for efficient queries and can be updated incrementally in negligible time to reflect changes to the underlying document collection. We discuss the versatility of this approach for event summarization, the completion of partial event information, and the extraction of descriptions for named entities and dates. We create and provide a LOAD graph for the documents in the English Wikipedia from named entities extracted by state-of-the-art NER tools. Based on an evaluation set of historic data that include summaries of diverse events, we evaluate the resulting graph. We find that the model not only allows for near real-time retrieval of information from the underlying document collection, but also provides a comprehensive framework for browsing and summarizing event data.

PLOS ONE | 2014

Measuring long-term impact based on network centrality: unraveling cinematic citations.

Andreas Spitz; Emoke Ágnes Horvát

Traditional measures of success for film, such as box-office revenue and critical acclaim, lack the ability to quantify long-lasting impact and depend on factors that are largely external to the craft itself. With the growing number of films that are being created and large-scale data becoming available through crowd-sourced online platforms, an endogenous measure of success that is not reliant on manual appraisal is of increasing importance. In this article we propose such a ranking method based on a combination of centrality indices. We apply the method to a network that contains several types of citations between more than 40,000 international feature films. From this network we derive a list of milestone films, which can be considered to constitute the foundations of cinema. In a comparison to various existing lists of ‘greatest’ films, such as personal favourite lists, voting lists, lists of individual experts, and lists deduced from expert polls, the selection of milestone films is more diverse in terms of genres, actors, and main creators. Our results shed light on the potential of a systematic quantitative investigation based on cinematic influences in identifying the most inspiring creations in world cinema. In a broader perspective, we introduce a novel research question to large-scale citation analysis, one of the most intriguing topics that have been at the forefront of scientific enquiries for the past fifty years and have led to the development of various network analytic methods. In doing so, we transfer widely studied approaches from citation analysis to the the newly emerging field of quantification efforts in the arts. The specific contribution of this paper consists in modelling the multidimensional cinematic references as a growing multiplex network and in developing a methodology for the identification of central films in this network.

advances in social networks analysis and mining | 2015

Beyond Friendships and Followers: The Wikipedia Social Network

Johanna Geib; Andreas Spitz; Michael Gertz

Most traditional social networks rely on explicitly given relations between users, their friends and followers. In this paper, we go beyond well structured data repositories and create a person-centric network from unstructured text - the Wikipedia Social Network. To identify persons in Wikipedia, we make use of interwiki links, Wikipedia categories and person related information available in Wikidata. From the co-occurrences of persons on a Wikipedia page we construct a large-scale person-centric network and provide a weighting scheme for the relationship of two persons based on the distances of their mentions within the text. We extract key characteristics of the network such as centrality, clustering coefficient and component sizes for which we find values that are typical for social networks. Using state-of-the-art algorithms for community detection in massive networks, we identify interesting communities and evaluate them against Wikipedia categories. The Wikipedia social network developed this way provides an important source for future social analysis tasks.

international conference on management of data | 2016

So far away and yet so close: augmenting toponym disambiguation and similarity with text-based networks

Andreas Spitz; Johanna Geiß; Michael Gertz

Place similarity has a central role in geographic information retrieval and geographic information systems, where spatial proximity is frequently just a poor substitute for semantic relatedness. For applications such as toponym disambiguation, alternative measures are thus required to answer the non-trivial question of place similarity in a given context. In this paper, we discuss a novel approach to the construction of a network of locations from unstructured text data. By deriving similarity scores based on the textual distance of toponyms, we obtain a kind of relatedness that encodes the importance of the co-occurrences of place mentions. Based on the text of the English Wikipedia, we construct and provide such a network of place similarities, including entity linking to Wikidata as an augmentation of the contained information. In an analysis of centrality, we explore the networks capability of capturing the similarity between places. An evaluation of the network for the task of toponym disambiguation on the AIDA CoNLL-YAGO dataset reveals a performance that is in line with state-of-the-art methods.

geographic information retrieval | 2015

The Wikipedia location network: overcoming borders and oceans

Johanna Geiß; Andreas Spitz; Jannik Strötgen; Michael Gertz

In social network analysis and information retrieval, research has recently been devoted to the extraction of implicit relationships between persons from unstructured textual sources. In this paper, we adapt such a person-centric approach to the extraction of locations and build the Wikipedia Location Network based on co-occurrences of place names in the English Wikipedia. We summarize the networks characteristics and demonstrate its value for future location relationship analysis tasks.

PLOS ONE | 2016

Assessing Low-Intensity Relationships in Complex Networks.

Andreas Spitz; Anna Gimmler; Thorsten Stoeck; Katharina Anna Zweig; Emoke Ágnes Horvát

Many large network data sets are noisy and contain links representing low-intensity relationships that are difficult to differentiate from random interactions. This is especially relevant for high-throughput data from systems biology, large-scale ecological data, but also for Web 2.0 data on human interactions. In these networks with missing and spurious links, it is possible to refine the data based on the principle of structural similarity, which assesses the shared neighborhood of two nodes. By using similarity measures to globally rank all possible links and choosing the top-ranked pairs, true links can be validated, missing links inferred, and spurious observations removed. While many similarity measures have been proposed to this end, there is no general consensus on which one to use. In this article, we first contribute a set of benchmarks for complex networks from three different settings (e-commerce, systems biology, and social networks) and thus enable a quantitative performance analysis of classic node similarity measures. Based on this, we then propose a new methodology for link assessment called z* that assesses the statistical significance of the number of their common neighbors by comparison with the expected value in a suitably chosen random graph model and which is a consistently top-performing algorithm for all benchmarks. In addition to a global ranking of links, we also use this method to identify the most similar neighbors of each single node in a local ranking, thereby showing the versatility of the method in two distinct scenarios and augmenting its applicability. Finally, we perform an exploratory analysis on an oceanographic plankton data set and find that the distribution of microbes follows similar biogeographic rules as those of macroorganisms, a result that rejects the global dispersal hypothesis for microbes.

international world wide web conferences | 2015

Terms in Time and Times in Context: A Graph-based Term-Time Ranking Model

Andreas Spitz; Jannik Strötgen; Thomas Bögel; Michael Gertz

Approaches in support of the extraction and exploration of temporal information in documents provide an important ingredient in many of todays frameworks for text analysis. Methods range from basic techniques, primarily the extraction of temporal expressions and events from documents, to more sophisticated approaches such as ranking of documents with respect to their temporal relevance to some query term or the construction of timelines. Almost all of these approaches operate on the document level, that is, for a collection of documents a timeline is extracted or a ranked list of documents is returned for a temporal query term. In this paper, we present an approach to characterize individual dates, which can be of different granularities, and terms. Given a query date, a ranked list of terms is determined that are highly relevant for that date and best summarize the date. Analogously, for a query term, a ranked list of dates is determined that best characterize the term. Focusing on just dates and single terms as they occur in documents provides a fine-grained query and exploration method for document collections. Our approach is based on a weighted bipartite graph representing the co-occurrences of time expressions and terms in a collection of documents. We present different measures to obtain a ranked list of dates and terms for a query term and date, respectively. Our experiments and evaluation using Wikipedia as a document collection show that our approach provides an effective means in support of date and temporal term summarization.

advances in social networks analysis and mining | 2015

Breaking the News: Extracting the Sparse Citation Network Backbone of Online News Articles

Andreas Spitz; Michael Gertz

Networks of online news articles and blog posts are some of the most commonly used data sets in network science. As a result, they have become a vital piece of network analysis and are used for the evaluation of algorithms that work on large networks, or serve as examples in the analysis of information diffusion and propagation. Similarly, scientific citation networks are part of the bedrock upon which much of modern network analysis is built and have been studied for decades. In this paper, we show that the backbone inherent to networks of online news articles shares significant structural similarities to scientific citation networks once the noise of spurious links is stripped away. We present a data set of news articles that, while it is extremely sparse and lightweight, still contains information relevant to the propagation of information in mass media and is remarkably similar to scientific citation networks, thus opening the door to the use of established methodologies from scientometrics and bibliometrics in the analysis of online news propagation.

international world wide web conferences | 2017

EVELIN: Exploration of Event and Entity Links in Implicit Networks

Andreas Spitz; Satya Almasian; Michael Gertz

Implicit networks that describe latent entity relations have been demonstrated to be valuable tools in information retrieval, knowledge extraction, and search in document collections. While such implicit relations offer less insight into the types of connection between entities than traditional knowledge bases, they are much easier to extract from unstructured textual sources. Furthermore, they allow the derivation of relationship strength between entities that can be used to identify and leverage important co-mentions, based on which complex constructs of semantically related entities can be assembled with ease. One example of such implicit networks are LOAD graphs, which encode the textual proximity of location-, organization-, actor-, and date-mentions in document collections for the exploration, identification and summarization of events and entity relations. Here, we present EVELIN as a graphical, web-based interface for the exploration of such implicit networks of entities on the example of a large-scale network constructed from the English Wikipedia. The interface is available for online use at http://evelin.ifi.uni-heidelberg.de/.

advances in social networks analysis and mining | 2015

Exploiting Phase Transitions for the Efficient Sampling of the Fixed Degree Sequence Model

Christian Brugger; André Lucas Chinazzo; Alexandre Flores John; Christian de Schryver; Norbert Wehn; Andreas Spitz; Katharina Anna Zweig

Real-world network data is often very noisy and contains erroneous or missing edges. These superfluous and missing edges can be identified statistically by assessing the number of common neighbors of the two incident nodes. To evaluate whether this number of common neighbors, the so called co-occurrence, is statistically significant, a comparison with the expected co-occurrence in a suitable random graph model is required. For networks with a skewed degree distribution, including most real-world networks, it is known that the fixed degree sequence model, which maintains the degrees of nodes, is favourable over using simplified graph models that are based on an independence assumption. However, the use of a fixed degree sequence model requires sampling from the space of all graphs with the given degree sequence and measuring the co-occurrence of each pair of nodes in each of the samples, since there is no known closed formula for this statistic. While there exist log-linear approaches such as Markov chain Monte Carlo sampling, the computational complexity still depends on the length of the Markov chain and the number of samples, which is significant in large-scale networks. In this article, we show based on ground truth data that there are various phase transition-like tipping points that enable us to choose a comparatively low number of samples and to reduce the length of the Markov chains without reducing the quality of the significance test. As a result, the computational effort can be reduced by an order of magnitudes.

Explore More