Vitor R. Carvalho | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Vitor R. Carvalho is active.

Explore More

Publication

Featured researches published by Vitor R. Carvalho.

international acm sigir conference on research and development in information retrieval | 2009

Reducing long queries using query quality predictors

Giridhar Kumaran; Vitor R. Carvalho

Long queries frequently contain many extraneous terms that hinder retrieval of relevant documents. We present techniques to reduce long queries to more effective shorter ones that lack those extraneous terms. Our work is motivated by the observation that perfectly reducing long TREC description queries can lead to an average improvement of 30% in mean average precision. Our approach involves transforming the reduction problem into a problem of learning to rank all sub-sets of the original query (sub-queries) based on their predicted quality, and selecting the top sub-query. We use various measures of query quality described in the literature as features to represent sub-queries, and train a classifier. Replacing the original long query with the top-ranked sub-query chosen by the ranker results in a statistically significant average improvement of 8% on our test sets. Analysis of the results shows that query reduction is well-suited for moderately-performing long queries, and a small set of query quality predictors are well-suited for the task of ranking sub-queries.

european conference on information retrieval | 2008

Ranking users for intelligent message addressing

Vitor R. Carvalho; William W. Cohen

Finding persons who are knowledgeable on a given topic (i.e. Expert Search) has become an active area of recent research [1, 2, 3]. In this paper we investigate the related task of Intelligent Message Addressing, i.e., finding persons who are potential recipients of a message under composition given its current contents, its previously-specified recipients or a few initial letters of the intended recipient contact (intelligent auto-completion). We begin by providing quantitative evidence, from a very large corpus, of how frequently email users are subject to message addressing problems. We then propose several techniques for this task, including adaptations of well-known formal models of Expert Search. Surprisingly, a simple model based on the K-Nearest-Neighbors algorithm consistently outperformed all other methods. We also investigated combinations of the proposed methods using fusion techniques, which leaded to significant performance improvements over the baselines models. In auto-completion experiments, the proposed models also outperformed all standard baselines. Overall, the proposed techniques showed ranking performance of more than 0.5 in MRR over 5202 queries from 36 different email users, suggesting intelligent message addressing can be a welcome addition to email.

international acm sigir conference on research and development in information retrieval | 2010

Exploring reductions for long web queries

Niranjan Balasubramanian; Giridhar Kumaran; Vitor R. Carvalho

Long queries form a difficult, but increasingly important segment for web search engines. Query reduction, a technique for dropping unnecessary query terms from long queries, improves performance of ad-hoc retrieval on TREC collections. Also, it has great potential for improving long web queries (upto 25% improvement in NDCG@5). However, query reduction on the web is hampered by the lack of accurate query performance predictors and the constraints imposed by search engine architectures and ranking algorithms. In this paper, we present query reduction techniques for long web queries that leverage effective and efficient query performance predictors. We propose three learning formulations that combine these predictors to perform automatic query reduction. These formulations enable trading of average improvements for the number of queries impacted, and enable easy integration into the search engines architecture for rank-time query reduction. Experiments on a large collection of long queries issued to a commercial search engine show that the proposed techniques significantly outperform baselines, with more than 12% improvement in NDCG@5 in the impacted set of queries. Extension to the formulations such as result interleaving further improves results. We find that the proposed techniques deliver consistent retrieval gains where it matters most: poorly performing long web queries.

knowledge discovery and data mining | 2006

Single-pass online learning: performance, voting schemes and online feature selection

Vitor R. Carvalho; William W. Cohen

To learn concepts over massive data streams, it is essential to design inference and learning methods that operate in real time with limited memory. Online learning methods such as perceptron or Winnow are naturally suited to stream processing; however, in practice multiple passes over the same training data are required to achieve accuracy comparable to state-of-the-art batch learners. In the current work we address the problem of training an on-line learner with a single passover the data. We evaluate several existing methods, and also propose a new modification of Margin Balanced Winnow, which has performance comparable to linear SVM. We also explore the effect of averaging, a.k.a. voting, on online learning. Finally, we describe how the new Modified Margin Balanced Winnow algorithm can be naturally adapted to perform feature selection. This scheme performs comparably to widely-used batch feature selection methods like information gain or Chi-square, with the advantage of being able to select features on-the-fly. Taken together, these techniques allow single-pass online learning to be competitive with batch techniques, and still maintain the advantages of on-line learning.

international acm sigir conference on research and development in information retrieval | 2011

Crowdsourcing for search evaluation

Vitor R. Carvalho; Matthew Lease; Emine Yilmaz

The Crowdsourcing for Search Evaluation Workshop (CSE 2010) was held on July 23, 2010 in Geneva, Switzerland, in conjunction with the 33rd Annual ACM SIGIR Conference. The workshop addressed the latest advances in theory and empirical methods in crowdsourcing for search evaluation, as well as novel applications of crowdsourcing for evaluating search systems. Three invited talks were presented, along with seven refereed papers. Proceedings from the workshop, along with presentation slides, have been made available online.

international acm sigir conference on research and development in information retrieval | 2010

Predicting query performance on the web

Niranjan Balasubramanian; Giridhar Kumaran; Vitor R. Carvalho

Predicting the performance of web queries is useful for several applications such as automatic query reformulation and automatic spell correction. In the web environment, accurate performance prediction is challenging because measures such as clarity that work well on homogeneous TREC-like collections, are not as effective and are often expensive to compute. We present Rank-time Performance Prediction (RAPP), an effective and efficient approach for online performance prediction on the web. RAPP uses retrieval scores, and aggregates of the rank-time features used by the document- ranking algorithm to train regressors for query performance prediction. On a set of over 12,000 queries sampled from the query logs of a major search engine, RAPP achieves a linear correlation of 0.78 with DCG@5, and 0.52 with NDCG@5. Analysis of prediction accuracy shows that hard queries are easier to identify while easy queries are harder to identify.

conference on information and knowledge management | 2010

Online stratified sampling: evaluating classifiers at web-scale

Paul N. Bennett; Vitor R. Carvalho

Deploying a classifier to large-scale systems such as the web requires careful feature design and performance evaluation. Evaluation is particularly challenging because these large collections frequently change. In this paper we adapt stratified sampling techniques to evaluate the precision of classifiers deployed in large-scale systems. We investigate different types of stratification strategies, and then we derive a new online sampling algorithm that incrementally approximates the theoretical optimal disproportionate sampling strategy. In experiments, the proposed algorithm significantly outperforms both simple random sampling as well as other types of stratified sampling, with an average reduction of about 20% in labeling effort to reach the same confidence and interval-bounds on precision

Archive | 2011

Email “Speech Acts”

Vitor R. Carvalho

One important use of work-related email is negotiating and delegating shared tasks and subtasks. Email task management could be made more efficient if we were able to automatically detect the intent of an email message — for example, to determine if the email contains a request, a commitment by the sender to perform some task, or an amendment to an earlier proposal.

conference on information and knowledge management | 2008

Suppressing outliers in pairwise preference ranking

Vitor R. Carvalho; Jonathan L. Elsas; William W. Cohen; Jaime G. Carbonell

Many of the recently proposed algorithms for learning feature-based ranking functions are based on the pairwise preference framework, in which instead of taking documents in isolation, document pairs are used as instances in the learning process. One disadvantage of this process is that a noisy relevance judgment on a single document can lead to a large number of mis-labeled document pairs. This can jeopardize robustness and deteriorate overall ranking performance. In this paper we study the effects of outlying pairs in rank learning with pairwise preferences and introduce a new meta-learning algorithm capable of suppressing these undesirable effects. This algorithm works as a second optimization step in which any linear baseline ranker can be used as input. Experiments on eight different ranking datasets show that this optimization step produces statistically significant performance gains over state-of-the-art methods.

Archive | 2011

Modeling Intention in Email

Vitor R. Carvalho

Everyday more than half of American adult internet users read or write email messages at least once. The prevalence of email has significantly impacted the working world, functioning as a great asset on many levels, yet at times, a costly liability. In an effort to improve various aspects of work-related communication, this work applies sophisticated machine learning techniques to a large body of email data. Several effective models are proposed that can aid with the prioritization of incoming messages, help with coordination of shared tasks, improve tracking of deadlines, and prevent disastrous information leaks. Carvalho presents many data-driven techniques that can positively impact work-related email communication and offers robust models that may be successfully applied to future machine learning tasks.

Explore More