Thomas Roelleke | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Thomas Roelleke is active.

Explore More

Publication

Featured researches published by Thomas Roelleke.

international acm sigir conference on research and development in information retrieval | 2003

A frequency-based and a poisson-based definition of the probability of being informative

Thomas Roelleke

This paper reports on theoretical investigations about the assumptions underlying the inverse document frequency (idf). We show that an intuitive idf-based probability function for the probability of a term being informative assumes disjoint document events. By assuming documents to be independent rather than disjoint, we arrive at a Poisson-based probability of being informative. The framework is useful for understanding and deciding the parameter estimation and combination in probabilistic retrieval models.

international acm sigir conference on research and development in information retrieval | 2006

A parallel derivation of probabilistic information retrieval models

Thomas Roelleke; Jun Wang

This paper investigates in a stringent athematical formalism the parallel derivation of three grand probabilistic retrieval models: binary independent retrieval (BIR), Poisson model (PM), and language modelling (LM).The investigation has been motivated by a number of questions. Firstly, though sharing the same origin, namely the probability of relevance, the models differ with respect to event spaces. How can this be captured in a consistent notation, and can we relate the event spaces? Secondly, BIR and PM are closely related, but how does LM fit in? Thirdly, how are tf-idf and probabilistic models related? .The parallel investigation of the models leads to a number of formalised results: BIR and PM assume the collection to be a set of non-relevant documents, whereas LM assumes the collection to be a set of terms from relevant documents.PM can be viewed as a bridge connecting BIR and LM.A BIR-LM equivalence explains BIR as a special LM case.PM explains tf-idf, and both, BIR and LM probabilities express tf-idf in a dual way..

very large data bases | 2008

Modelling retrieval models in a probabilistic relational algebra with a new operator: the relational Bayes

Thomas Roelleke; Hengzhi Wu; Jun Wang; Hany Azzam

This paper presents a probabilistic relational modelling (implementation) of the major probabilistic retrieval models. Such a high-level implementation is useful since it supports the ranking of any object, it allows for the reasoning across structured and unstructured data, and it gives the software (knowledge) engineer control over ranking and thus supports customisation. The contributions of this paper include the specification of probabilistic SQL (PSQL) and probabilistic relational algebra (PRA), a new relational operator for probability estimation (the relational Bayes), the probabilistic relational modelling of retrieval models, a comparison of modelling retrieval with traditional SQL versus modelling retrieval with PSQL, and a comparison of the performance of probability estimation with traditional SQL versus PSQL. The main findings are that the PSQL/PRA paradigm allows for the description of advanced retrieval models, is suitable for solving large-scale retrieval tasks, and outperforms traditional SQL in terms of abstraction and performance regarding probability estimation.

international acm sigir conference on research and development in information retrieval | 2005

Relevance information: a loss of entropy but a gain for IDF?

Arjen P. de Vries; Thomas Roelleke

When investigating alternative estimates for term discriminativeness, we discovered that relevance information and idf are much closer related than formulated in classical literature. Therefore, we revisited the justification of idf as it follows from the binary independent retrieval (BIR) model. The main result is a formal framework uncovering the close relationship of a generalised idf and the BIR model. The framework makes explicit how to incorporate relevance information into any retrieval function that involves an idf-component.In addition to the idf-based formulation of the BIR model, we propose Poisson-based estimates as an alternative to the classical estimates, this being motivated by the superiority of Poisson-based estimates for the within-document term frequencies. The main experimental finding is that a Poisson-based idf is superior to the classical idf, where the superiority is particularly evident for long queries.

international acm sigir conference on research and development in information retrieval | 2013

Extractive summarisation via sentence removal: condensing relevant sentences into a short summary

Marco Bonzanini; Miguel Martinez-Alvarez; Thomas Roelleke

Many on-line services allow users to describe their opinions about a product or a service through a review. In order to help other users to find out the major opinion about a given topic, without the effort to read several reviews, multi-document summarisation is required. This research proposes an approach for extractive summarisation, supporting different scoring techniques, such as cosine similarity or divergence, as a method for finding representative sentences. The main contribution of this paper is the definition of an algorithm for sentence removal, developed to maximise the score between the summary and the original document. Instead of ranking the sentences and selecting the most important ones, the algorithm iteratively removes unimportant sentences until a desired compression rate is reached. Experimental results show that variations of the sentence removal algorithm provide good performance.

international acm sigir conference on research and development in information retrieval | 2012

Opinion summarisation through sentence extraction: an investigation with movie reviews

Marco Bonzanini; Miguel Martinez-Alvarez; Thomas Roelleke

In on-line reviews, authors often use a short passage to describe the overall feeling about a product or a service. A review as a whole can mention many details not in line with the overall feeling, so capturing this key passage is important to understand the overall sentiment of the review. This paper investigates the use of extractive summarisation in the context of sentiment classification. The aim is to find the summary sentence, or the short passage, which gives the overall sentiment of the review, filtering out potential noisy information. Experiments on a movie review data-set show that subjectivity detection plays a central role in building summaries for sentiment classification. Subjective extracts carry the same polarity of the full text reviews, while statistical and positional approaches are not able to capture this aspect.

Proceedings of the Third International Workshop on Keyword Search on Structured Data | 2012

A schema-driven approach for knowledge-oriented retrieval and query formulation

Hany Azzam; Sirvan Yahyaei; Marco Bonzanini; Thomas Roelleke

In order to search across factual knowledge and content explicated using different data formats this paper leverages a generic data model (schema) that transforms keyword-based retrieval models and queries to knowledge-oriented models and semantically-expressive queries. As each of the transformed retrieval models capitalises on a specific evidence space (term, classification, relationship and attribute), we demonstrate two possible combinations of these spaces, namely macro-based or micro-based. For bare keyword-based queries we demonstrate how the data model can be used to augment the queries with classifications, relationships, etc. that reflect the underlying constraints and objects found in the heterogeneous knowledge bases. Using the IMDb benchmark the results demonstrate the feasibility and effectiveness of the instantiated retrieval models and the query reformulation process.

string processing and information retrieval | 2011

Cross-lingual text fragment alignment using divergence from randomness

Sirvan Yahyaei; Marco Bonzanini; Thomas Roelleke

This paper describes an approach to automatically align fragments of texts of two documents in different languages. A text fragment is a list of continuous sentences and an aligned pair of fragments consists of two fragments in two documents, which are content-wise related. Cross-lingual similarity between fragments of texts is estimated based on models of divergence from randomness. A set of aligned fragments based on the similarity scores are selected to provide an alignment between sections of the two documents. Similarity measures based on divergence show strong performance in the context of cross-lingual fragment alignment in the performed experiments.

international conference on the theory of information retrieval | 2009