Is this you? Create Your Porfile

Thomas Gottron

University of Koblenz and Landau

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Thomas Gottron is active.

Explore More

Publication

Featured researches published by Thomas Gottron.

web science | 2011

Bad news travel fast: a content-based analysis of interestingness on Twitter

Nasir Naveed; Thomas Gottron; Jérôme Kunegis; Arifah Che Alhadi

On the microblogging site Twitter, users can forward any message they receive to all of their followers. This is called a retweet and is usually done when users find a message particularly interesting and worth sharing with others. Thus, retweets reflect what the Twitter community considers interesting on a global scale, and can be used as a function of interestingness to generate a model to describe the content-based characteristics of retweets. In this paper, we analyze a set of high- and low-level content-based features on several large collections of Twitter messages. We train a prediction model to forecast for a given tweet its likelihood of being retweeted based on its contents. From the parameters learned by the model we deduce what are the influential content features that contribute to the likelihood of a retweet. As a result we obtain insights into what makes a message on Twitter worth retweeting and, thus, interesting.

database and expert systems applications | 2008

Content Code Blurring: A New Approach to Content Extraction

Thomas Gottron

Most HTML documents on the world wide web contain far more than the article or text which forms their main content. Navigation menus, functional and design elements or commercial banners are typical examples of additional contents. Content extraction is the process of identifying the main content and/or removing the additional contents. We introduce content code blurring, a novel content extraction algorithm. As the main text content is typically a long, homogeneously formatted region in a web document, the aim is to identify exactly these regions in an iterative process. Comparing its performance with existing content extraction solutions we show thatfor most documents content code blurring delivers the best results.

european conference on information retrieval | 2010

A comparison of language identification approaches on short, query-style texts

Thomas Gottron; Nedim Lipka

In a multi-language Information Retrieval setting, the knowledge about the language of a user query is important for further processing. Hence, we compare the performance of some typical approaches for language detection on very short, query-style texts. The results show that already for single words an accuracy of more than 80% can be achieved, for slightly longer texts we even observed accuracy values close to 100%.

Journal of Web Semantics | 2012

SchemEX - Efficient construction of a data catalogue by stream-based indexing of linked data

Mathias Konrath; Thomas Gottron; Steffen Staab; Ansgar Scherp

We present SchemEX, an approach and tool for a stream-based indexing and schema extraction of Linked Open Data (LOD) at web-scale. The schema index provided by SchemEX can be used to locate distributed data sources in the LOD cloud. It serves typical LOD information needs such as finding sources that contain instances of one specific data type, of a given set of data types (so-called type clusters), or of instances in type clusters that are connected by one or more common properties (so-called equivalence classes). The entire process of extracting the schema from triples and constructing an index is designed to have linear runtime complexity. Thus, the schema index can be computed on-the-fly while the triples are crawled and provided as a stream by a linked data spider. To demonstrate the web-scalability of our approach, we have computed a SchemEX index over the Billion Triples Challenge (BTC) dataset 2011 consisting of 2,170 million triples. In addition, we have computed the SchemEX index on a dataset with 11 million triples. We use this smaller dataset for conducting a detailed qualitative analysis. We are capable of locating relevant data sources with recall between 71% and 98% and a precision between 74% and 100% at a window size of 100 K triples observed in the stream and depending on the complexity of the query, i.e. if one wants to find specific data types, type clusters or equivalence classes.

conference on information and knowledge management | 2011

Insights into explicit semantic analysis

Thomas Gottron; Maik Anderka; Benno Stein

Since its debut the Explicit Semantic Analysis (ESA) has received much attention in the IR community. ESA has been proven to perform surprisingly well in several tasks and in different contexts. However, given the conceptual motivation for ESA, recent work has observed unexpected behavior. In this paper we look at the foundations of ESA from a theoretical point of view and employ a general probabilistic model for term weights which reveals how ESA actually works. Based on this model we explain some of the phenomena that have been observed in previous work and support our findings with new experiments. Moreover, we provide a theoretical grounding on how the size and the composition of the index collection affect the ESA-based computation of similarity values for texts.

edbt icdt workshops | 2013

LOVER: support for modeling data using linked open vocabularies

Johann Schaible; Thomas Gottron; Stefan Scheglmann; Ansgar Scherp

Various best practices and principles are provided to guide an ontology engineer when modeling Linked Data. The choice of appropriate vocabularies is one essential aspect in the guidelines, as it leads to better interpretation, querying, and consumption of the data by Linked Data applications and users. In this paper, we propose LOVER: a novel approach to support the ontology engineer in modeling a Linked Data dataset. We illustrate the concept of LOVER, which supports the engineer by recommending appropriate classes and properties from existing and actively used vocabularies. The recommendations are made on the basis of on an iterative multimodal search. It uses different, orthogonal information sources for finding vocabulary terms, e.g. based on a best string match or schema information on other datasets published in the Linked Open Data cloud. We describe LOVERs recommendation mechanism in general and illustrate it along a real-life example from the social sciences domain.

european conference on research and advanced technology for digital libraries | 2009

Document word clouds: visualising web documents as tag clouds to aid users in relevance decisions

Thomas Gottron

Information Retrieval systems spend a great effort on determining the significant terms in a document. When, instead, a user is looking at a document he cannot benefit from such information. He has to read the text to understand which words are important. In this paper we take a look at the idea of enhancing the perception of web documents with visualisation techniques borrowed from the tag clouds of Web 2.0. Highlighting the important words in a document by using a larger font size allows to get a quick impression of the relevant concepts in a text. As this process does not depend on a user query it can also be used for explorative search. A user study showed, that already simple TF-IDF values used as notion of word importance helped the users to decide quicker, whether or not a document is relevant to a topic.

conference on recommender systems | 2012

Online dating recommender systems: the split-complex number approach

Jérôme Kunegis; Gerd Gröner; Thomas Gottron

A typical recommender setting is based on two kinds of relations: similarity between users (or between objects) and the taste of users towards certain objects. In environments such as online dating websites, these two relations are difficult to separate, as the users can be similar to each other, but also have preferences towards other users, i.e., rate other users. In this paper, we present a novel and unified way to model this duality of the relations by using split-complex numbers, a number system related to the complex numbers that is used in mathematics, physics and other fields. We show that this unified representation is capable of modeling both notions of relations between users in a joint expression and apply it for recommending potential partners. In experiments with the Czech dating website Libimseti.cz we show that our modeling approach leads to an improvement over baseline recommendation methods in this scenario.

information integration and web-based applications & services | 2008

Combining content extraction heuristics: the CombinE system

Thomas Gottron

The main text content of an HTML document on the WWW is typically surrounded by additional contents, such as navigation menus, advertisements, link lists or design elements. Content Extraction (CE) is the task to identify and extract the main content. Ongoing research has spawned several CE heuristics of different quality. However, so far only the Crunch framework combines several heuristics to improve its overall CE performance. Since Crunch, though, many new algorithms have been formulated. The CombinE system is designed to test, evaluate and optimise combinations of CE heuristics. Its aim is to develop CE systems which yield better and more reliable extracts of the main content of a web document.

european semantic web conference | 2014

Survey on Common Strategies of Vocabulary Reuse in Linked Open Data Modeling

Johann Schaible; Thomas Gottron; Ansgar Scherp

The choice of which vocabulary to reuse when modeling and publishing Linked Open Data (LOD) is far from trivial. There is no study that investigates the different strategies of reusing vocabularies for LOD modeling and publishing. In this paper, we present the results of a survey with 79 participants that examines the most preferred vocabulary reuse strategies of LOD modeling. The participants, LOD publishers and practitioners, were asked to assess different vocabulary reuse strategies and explain their ranking decision. We found significant differences between the modeling strategies that range from reusing popular vocabularies, minimizing the number of vocabularies, and staying within one domain vocabulary. A very interesting insight is that the popularity in the meaning of how frequent a vocabulary is used in a data source is more important than how often individual classes and properties are used in the LOD cloud. Overall, the results of this survey help in better understanding the strategies how data engineers reuse vocabularies and may also be used to develop future vocabulary engineering tools.

Explore More