Manos Tsagkias
University of Amsterdam
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Manos Tsagkias.
european conference on information retrieval | 2011
Kamran Massoudi; Manos Tsagkias; Maarten de Rijke; Wouter Weerkamp
We propose a retrieval model for searching microblog posts for a given topic of interest. We develop a language modeling approach tailored to microblogging characteristics, where redundancy-based IR methods cannot be used in a straightforward manner. We enhance this model with two groups of quality indicators: textual and microblog specific. Additionally, we propose a dynamic query expansion model for microblog post retrieval. Experimental results on Twitter data reveal the usefulness of boolean search, and demonstrate the utility of quality indicators and query expansion in microblog search
european conference on information retrieval | 2010
Manos Tsagkias; Wouter Weerkamp; Maarten de Rijke
Online news agents provide commenting facilities for their readers to express their opinions or sentiments with regards to news stories. The number of user supplied comments on a news article may be indicative of its importance, interestingness, or impact. We explore the news comments space, and compare the log-normal and the negative binomial distributions for modeling comments from various news agents. These estimated models can be used to normalize raw comment counts and enable comparison across different news sites. We also examine the feasibility of online prediction of the number of comments, based on the volume observed shortly after publication. We report on solid performance for predicting news comment volume in the long run, after short observation. This prediction can be useful for identifying news stories with the potential to “take off,” and can be used to support front page optimization for news sites.
european conference on information retrieval | 2012
Andrei Oghina; Mathias Breuss; Manos Tsagkias; Maarten de Rijke
We predict IMDb movie ratings and consider two sets of features: surface and textual features. For the latter, we assume that no social media signal is isolated and use data from multiple channels that are linked to a particular movie, such as tweets from Twitter and comments from YouTube. We extract textual features from each channel to use in our prediction model and we explore whether data from either of these channels can help to extract a better set of textual feature for prediction. Our best performing model is able to rate movies very close to the observed values.
international joint conference on natural language processing | 2015
Nikos Voskarides; Edgar Meij; Manos Tsagkias; Maarten de Rijke; Wouter Weerkamp
We study the problem of explaining relationships between pairs of knowledge graph entities with human-readable descriptions. Our method extracts and enriches sentences that refer to an entity pair from a corpus and ranks the sentences according to how well they describe the relationship between the entities. We model this task as a learning to rank problem for sentences and employ a rich set of features. When evaluated on a large set of manually annotated sentences, we find that our method significantly improves over state-of-the-art baseline models.
international acm sigir conference on research and development in information retrieval | 2015
David van Dijk; Manos Tsagkias; Maarten de Rijke
We focus on detecting potential topical experts in community question answering platforms early on in their lifecycle. We use a semi-supervised machine learning approach. We extract three types of feature: (i) textual, (ii) behavioral, and (iii) time-aware, which we use to predict whether a user will become an expert in the longterm. We compare our method to a machine learning method based on a state-of-the-art method in expertise retrieval. Results on data from Stack Overflow demonstrate the utility of adding behavioral and time-aware features to the baseline method with a net improvement in accuracy of 26% for very early detection of expertise.
international acm sigir conference on research and development in information retrieval | 2013
Richard Berendsen; Manos Tsagkias; Wouter Weerkamp; Maarten de Rijke
Recent years have witnessed a persistent interest in generating pseudo test collections, both for training and evaluation purposes. We describe a method for generating queries and relevance judgments for microblog search in an unsupervised way. Our starting point is this intuition: tweets with a hashtag are relevant to the topic covered by the hashtag and hence to a suitable query derived from the hashtag. Our baseline method selects all commonly used hashtags, and all associated tweets as relevance judgments; we then generate a query from these tweets. Next, we generate a timestamp for each query, allowing us to use temporal information in the training process. We then enrich the generation process with knowledge derived from an editorial test collection for microblog search. We use our pseudo test collections in two ways. First, we tune parameters of a variety of well known retrieval methods on them. Correlations with parameter sweeps on an editorial test collection are high on average, with a large variance over retrieval algorithms. Second, we use the pseudo test collections as training sets in a learning to rank scenario. Performance close to training on an editorial test collection is achieved in many cases. Our results demonstrate the utility of tuning and training microblog search algorithms on automatically generated training material.
conference on information and knowledge management | 2009
Katja Hofmann; Manos Tsagkias; Edgar Meij; Maarten de Rijke
Keyphrases are short phrases that reflect the main topic of a document. Because manually annotating documents with keyphrases is a time-consuming process, several automatic approaches have been developed. Typically, candidate phrases are extracted using features such as position or frequency in the document text. Document structure may contain useful information about which parts or phrases of a document are important, but has rarely been considered as a source of information for keyphrase extraction. We address this issue in the context of keyphrase extraction from scientific literature. We introduce a new, large corpus that consists of full-text journal articles, where the rich collection and document structure available at the publishing stage is explicitly annotated. We explore features based on the XML tags contained in the documents, and based on generic section types derived using position and cue words in section titles. For XML tags we find sections, abstract, and title to perform best, but many smaller elements may be beneficial in combination with other features. Of the generic section types, the discussion section is found to be most useful for keyphrase extraction.
international acm sigir conference on research and development in information retrieval | 2014
Aliaksei Severyn; Alessandro Moschitti; Manos Tsagkias; Richard Berendsen; Maarten de Rijke
We tackle the problem of improving microblog retrieval algorithms by proposing a robust structural representation of (query, tweet) pairs. We employ these structures in a principled kernel learning framework that automatically extracts and learns highly discriminative features. We test the generalization power of our approach on the TREC Microblog 2011 and 2012 tasks. We find that relational syntactic features generated by structural kernels are effective for learning to rank (L2R) and can easily be combined with those of other existing systems to boost their accuracy. In particular, the results show that our L2R approach improves on almost all the participating systems at TREC, only using their raw scores as a single feature. Our method yields an average increase of 5% in retrieval effectiveness and 7 positions in system ranks.
international acm sigir conference on research and development in information retrieval | 2011
Manos Tsagkias; Maarten de Rijke; Wouter Weerkamp
Republished article finding is the task of identifying instances of articles that have been published in one source and republished more or less verbatim in another source, which is often a social media source. We address this task as an ad hoc retrieval problem, using the source article as a query. Our approach is based on language modeling. We revisit the assumptions underlying the unigram language model taking into account the fact that in our setup queries are as long as complete news articles. We argue that in this case, the underlying generative assumption of sampling words from a document with replacement, i.e., the multinomial modeling of documents, produces less accurate query likelihood estimates. To make up for this discrepancy, we consider distributions that emerge from sampling without replacement: the central and non-central hypergeometric distributions. We present two retrieval models that build on top of these distributions: a log odds model and a bayesian model where document parameters are estimated using the Dirichlet compound multinomial distribution. We analyse the behavior of our new models using a corpus of news articles and blog posts and find that for the task of republished article finding, where we deal with queries whose length approaches the length of the documents to be retrieved, models based on distributions associated with sampling without replacement outperform traditional models based on multinomial distributions.
web search and data mining | 2016
David Graus; Manos Tsagkias; Wouter Weerkamp; Edgar Meij; Maarten de Rijke
Entity ranking, i.e., successfully positioning a relevant entity at the top of the ranking for a given query, is inherently difficult due to the potential mismatch between the entitys description in a knowledge base, and the way people refer to the entity when searching for it. To counter this issue we propose a method for constructing dynamic collective entity representations. We collect entity descriptions from a variety of sources and combine them into a single entity representation by learning to weight the content from different sources that are associated with an entity for optimal retrieval effectiveness. Our method is able to add new descriptions in real time and learn the best representation as time evolves so as to capture the dynamics of how people search entities. Incorporating dynamic description sources into dynamic collective entity representations improves retrieval effectiveness by 7% over a state-of-the-art learning to rank baseline. Periodic retraining of the ranker enables higher ranking effectiveness for dynamic collective entity representations.