Rishiraj Saha Roy | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Rishiraj Saha Roy is active.

Explore More

Publication

Featured researches published by Rishiraj Saha Roy.

international world wide web conferences | 2011

Unsupervised query segmentation using only query logs

Nikita Mishra; Rishiraj Saha Roy; Niloy Ganguly; Srivatsan Laxman; Monojit Choudhury

We introduce an unsupervised query segmentation scheme that uses query logs as the only resource and can effectively capture the structural units in queries. We believe that Web search queries have a unique syntactic structure which is distinct from that of English or a bag-of-words model. The segments discovered by our scheme help understand this underlying grammatical structure. We apply a statistical model based on Hoeffdings Inequality to mine significant word n-grams from queries and subsequently use them for segmenting the queries. Evaluation against manually segmented queries shows that this technique can detect rare units that are missed by our Pointwise Mutual Information (PMI) baseline.

international acm sigir conference on research and development in information retrieval | 2012

An IR-based evaluation framework for web search query segmentation

Rishiraj Saha Roy; Niloy Ganguly; Monojit Choudhury; Srivatsan Laxman

This paper presents the first evaluation framework for Web search query segmentation based directly on IR performance. In the past, segmentation strategies were mainly validated against manual annotations. Our work shows that the goodness of a segmentation algorithm as judged through evaluation against a handful of human annotated segmentations hardly reflects its effectiveness in an IR-based setup. In fact, state-of the-art algorithms are shown to perform as good as, and sometimes even better than human annotations a fact masked by previous validations. The proposed framework also provides us an objective understanding of the gap between the present best and the best possible segmentation algorithm. We draw these conclusions based on an extensive evaluation of six segmentation strategies, including three most recent algorithms, vis-a-vis segmentations from three human annotators. The evaluation framework also gives insights about which segments should be necessarily detected by an algorithm for achieving the best retrieval results. The meticulously constructed dataset used in our experiments has been made public for use by the research community.

forum for information retrieval evaluation | 2013

Overview of the FIRE 2013 Track on Transliterated Search

Rishiraj Saha Roy; Monojit Choudhury; Prasenjit Majumder; Komal Agarwal

In this paper, we provide an overview of the FIRE 2013 track on transliterated search and describe the datasets released as part of the track. This was the first year that the track was organized. We had proposed two subtasks as part of the challenge. In the first subtask, which we had proposed for Hindi, Bangla, and Gujarati, participants had to devise an algorithm to label the true languages of words in a sentence. Additionally, if a non-English word was identified, the algorithm was also supposed to provide the transliteration of the word in the native script. The second subtask was retrieval-based, where mixed-script documents had to be retrieved and ranked by relevance in response to ad hoc queries. The queries in our dataset were Bollywood Hindi song lyrics, in Roman script. We received a total of 25 run submissions from five different teams across the world (three from India and two from abroad). Conducting this track helped us generate awareness about the importance of transliteration in the context of Indian languages. Results show that there is considerable scope for improvement of transliteration accuracies for the studied languages.

Journal of Web Semantics | 2015

Discovering and understanding word level user intent in Web search queries

Rishiraj Saha Roy; Rahul Katare; Niloy Ganguly; Srivatsan Laxman; Monojit Choudhury

Identifying and interpreting user intent are fundamental to semantic search. In this paper, we investigate the association of intent with individual words of a search query. We propose that words in queries can be classified as either content or intent, where content words represent the central topic of the query, while users add intent words to make their requirements more explicit. We argue that intelligent processing of intent words can be vital to improving the result quality, and in this work we focus on intent word discovery and understanding. Our approach towards intent word detection is motivated by the hypotheses that query intent words satisfy certain distributional properties in large query logs similar to function words in natural language corpora. Following this idea, we first prove the effectiveness of our corpus distributional features, namely, word co-occurrence counts and entropies, towards function word detection for five natural languages. Next, we show that reliable detection of intent words in queries is possible using these same features computed from query logs. To make the distinction between content and intent words more tangible, we additionally provide operational definitions of content and intent words as those words that should match, and those that need not match, respectively, in the text of relevant documents. In addition to a standard evaluation against human annotations, we also provide an alternative validation of our ideas using clickthrough data. Concordance of the two orthogonal evaluation approaches provide further support to our original hypothesis of the existence of two distinct word classes in search queries. Finally, we provide a taxonomy of intent words derived through rigorous manual analysis of large query logs.

Proceedings of the 9th International Conference (EVOLANG9) | 2012

ARE WEB SEARCH QUERIES AN EVOLVING PROTOLANGUAGE

Rishiraj Saha Roy; Monojit Choudhury; Kalika Bali

Searching information on the World Wide Web by issuing queries to commercial search engines is one of the most common activities engaged in by almost every Web user. Web search queries have a unique structure, which is more complex than just a bag-of-words, yet simpler than a natural language. This structure has been evolving over the past decade which is an artefact of the way search engines are evolving and aggressively using feedback from past users to serve current and future users better. In this paper, we argue that queries can be considered as an evolving protolanguage from functional, structural and dynamical points of view. Therefore, Web search logs, a perfectly preserved and rich dataset, can probably reveal several interesting facts about the evolution of protolanguage.

international acm sigir conference on research and development in information retrieval | 2017

Privacy through Solidarity: A User-Utility-Preserving Framework to Counter Profiling

Asia J. Biega; Rishiraj Saha Roy; Gerhard Weikum

Online service providers gather vast amounts of data to build user profiles. Such profiles improve service quality through personalization, but may also intrude on user privacy and incur discrimination risks. In this work, we propose a framework which leverages solidarity in a large community to scramble user interaction histories. While this is beneficial for anti-profiling, the potential downside is that individual user utility, in terms of the quality of search results or recommendations, may severely degrade. To reconcile privacy and user utility and control their trade-off, we develop quantitative models for these dimensions and effective strategies for assigning user interactions to Mediator Accounts. We demonstrate the viability of our framework by experiments in two different application areas (search and recommender systems), using two large datasets.

conference on intelligent text processing and computational linguistics | 2015

Automated Linguistic Personalization of Targeted Marketing Messages Mining User-Generated Text on Social Media

Rishiraj Saha Roy; Aishwarya Padmakumar; Guna Prasaad Jeganathan; Ponnurangam Kumaraguru

Personalizing marketing messages for specific audience segments is vital for increasing user engagement with advertisements, but it becomes very resource-intensive when the marketer has to deal with multiple segments, products or campaigns. In this research, we take the first steps towards automating message personalization by algorithmically inserting adjectives and adverbs that have been found to evoke positive sentiment in specific audience segments, into basic versions of ad messages. First, we build language models representative of linguistic styles from user-generated textual content on social media for each segment. Next, we mine product-specific adjectives and adverbs from content associated with positive sentiment. Finally, we insert extracted words into the basic version using the language models to enrich the message for each target segment, after statistically checking in-context readability. Decreased cross-entropy values from the basic to the transformed messages show that we are able to approach the linguistic style of the target segments. Crowdsourced experiments verify that our personalized messages are almost indistinguishable from similar human compositions. Social network data processed for this research has been made publicly available for community use.

international world wide web conferences | 2018

Never-Ending Learning for Open-Domain Question Answering over Knowledge Bases

Abdalghani Abujabal; Rishiraj Saha Roy; Mohamed Yahya; Gerhard Weikum

Translating natural language questions to semantic representations such as SPARQL is a core challenge in open-domain question answering over knowledge bases (KB-QA). Existing methods rely on a clear separation between an offline training phase, where a model is learned, and an online phase where this model is deployed. Two major shortcomings of such methods are that (i) they require access to a large annotated training set that is not always readily available and (ii) they fail on questions from before-unseen domains. To overcome these limitations, this paper presents NEQA, a continuous learning paradigm for KB-QA. Offline, NEQA automatically learns templates mapping syntactic structures to semantic ones from a small number of training question-answer pairs. Once deployed, continuous learning is triggered on cases where templates are insufficient. Using a semantic similarity function between questions and by judicious invocation of non-expert user feedback, NEQA learns new templates that capture previously-unseen syntactic structures. This way, NEQA gradually extends its template repository. NEQA periodically re-trains its underlying models, allowing it to adapt to the language used after deployment. Our experiments demonstrate NEQAs viability, with steady improvement in answering quality over time, and the ability to answer questions from new domains.

international world wide web conferences | 2015

Probabilistic Deduplication of Anonymous Web Traffic

Rishiraj Saha Roy; Ritwik Sinha; Niyati Chhaya; Shiv Kumar Saini

Cookies and log in-based authentication often provide incomplete data for stitching website visitors across multiple sources, necessitating probabilistic deduplication. We address this challenge by formulating the problem as a binary classification task for pairs of anonymous visitors. We compute visitor proximity vectors by converting categorical variables like IP addresses, product search keywords and URLs with very high cardinalities to continuous numeric variables using the Jaccard coefficient for each attribute. Our method achieves about 90% AUC and F-scores in identifying whether two cookies map to the same visitor, while providing insights on the relative importance of available features in Web analytics towards the deduplication process.

international acm sigir conference on research and development in information retrieval | 2014

Improving unsupervised query segmentation using parts-of-speech sequence information

Rishiraj Saha Roy; Yogarshi Vyas; Niloy Ganguly; Monojit Choudhury

We present a generic method for augmenting unsupervised query segmentation by incorporating Parts-of-Speech (POS) sequence information to detect meaningful but rare n-grams. Our initial experiments with an existing English POS tagger employing two different POS tagsets and an unsupervised POS induction technique specifically adapted for queries show that POS information can significantly improve query segmentation performance in all these cases.

Explore More