Mucahid Kutlu
Qatar University
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Mucahid Kutlu.
Information Processing and Management | 2018
Mucahid Kutlu; Tamer Elsayed; Matthew Lease
Abstract While test collections provide the cornerstone for Cranfield-based evaluation of information retrieval (IR) systems, it has become practically infeasible to rely on traditional pooling techniques to construct test collections at the scale of today’s massive document collections (e.g., ClueWeb12’s 700M+ Webpages). This has motivated a flurry of studies proposing more cost-effective yet reliable IR evaluation methods. In this paper, we propose a new intelligent topic selection method which reduces the number of search topics (and thereby costly human relevance judgments) needed for reliable IR evaluation. To rigorously assess our method, we integrate previously disparate lines of research on intelligent topic selection and deep vs. shallow judging (i.e., whether it is more cost-effective to collect many relevance judgments for a few topics or a few judgments for many topics). While prior work on intelligent topic selection has never been evaluated against shallow judging baselines, prior work on deep vs. shallow judging has largely argued for shallowed judging, but assuming random topic selection. We argue that for evaluating any topic selection method, ultimately one must ask whether it is actually useful to select topics, or should one simply perform shallow judging over many topics? In seeking a rigorous answer to this over-arching question, we conduct a comprehensive investigation over a set of relevant factors never previously studied together: 1) method of topic selection; 2) the effect of topic familiarity on human judging speed; and 3) how different topic generation processes (requiring varying human effort) impact (i) budget utilization and (ii) the resultant quality of judgments. Experiments on NIST TREC Robust 2003 and Robust 2004 test collections show that not only can we reliably evaluate IR systems with fewer topics, but also that: 1) when topics are intelligently selected, deep judging is often more cost-effective than shallow judging in evaluation reliability; and 2) topic familiarity and topic generation costs greatly impact the evaluation cost vs. reliability trade-off. Our findings challenge conventional wisdom in showing that deep judging is often preferable to shallow judging when topics are selected intelligently.
international acm sigir conference on research and development in information retrieval | 2018
Mucahid Kutlu; Tyler McDonnell; Yassmine Barkallah; Tamer Elsayed; Matthew Lease
While crowdsourcing offers a low-cost, scalable way to collect relevance judgments, lack of transparency with remote crowd work has limited understanding about the quality of collected judgments. In prior work, we showed a variety of benefits from asking crowd workers to provide \em rationales for each relevance judgment \citemcdonnell2016relevant. In this work, we scale up our rationale-based judging design to assess its reliability on the 2014 TREC Web Track, collecting roughly 25K crowd judgments for 5K document-topic pairs. We also study having crowd judges perform topic-focused judging, rather than across topics, finding this improves quality. Overall, we show that crowd judgments can be used to reliably rank IR systems for evaluation. We further explore the potential of rationales to shed new light on reasons for judging disagreement between experts and crowd workers. Our qualitative and quantitative analysis distinguishes subjective vs.\ objective forms of disagreement, as well as the relative importance of each disagreement cause, and we present a new taxonomy for organizing the different types of disagreement we observe. We show that many crowd disagreements seem valid and plausible, with disagreement in many cases due to judging errors by the original TREC assessors. We also share our WebCrowd25k dataset, including: (1) crowd judgments with rationales, and (2) taxonomy category labels for each judging disagreement analyzed.
arXiv: Information Retrieval | 2018
Maram Hasanain; Reem Suwaileh; Tamer Elsayed; Mucahid Kutlu; Hind Almerekhi
This article introduces a new language-independent approach for creating a large-scale high-quality test collection of tweets that supports multiple information retrieval (IR) tasks without running a shared-task campaign. The adopted approach (demonstrated over Arabic tweets) designs the collection around significant (i.e., popular) events, which enables the development of topics that represent frequent information needs of Twitter users for which rich content exists. That inherently facilitates the support of multiple tasks that generally revolve around events, namely event detection, ad-hoc search, timeline generation, and real-time summarization. The key highlights of the approach include diversifying the judgment pool via interactive search and multiple manually-crafted queries per topic, collecting high-quality annotations via crowd-workers for relevancy and in-house annotators for novelty, filtering out low-agreement topics and inaccessible tweets, and providing multiple subsets of the collection for better availability. Applying our methodology on Arabic tweets resulted in EveTAR, the first freely-available tweet test collection for multiple IR tasks. EveTAR includes a crawl of 355M Arabic tweets and covers 50 significant events for which about 62K tweets were judged with substantial average inter-annotator agreement (Kappa value of 0.71). We demonstrate the usability of EveTAR by evaluating existing algorithms in the respective tasks. Results indicate that the new collection can support reliable ranking of IR systems that is comparable to similar TREC collections, while providing strong baseline results for future studies over Arabic tweets.
international acm sigir conference on research and development in information retrieval | 2016
Reem Suwaileh; Mucahid Kutlu; Nihal Fathima; Tamer Elsayed; Matthew Lease
Web crawls provide valuable snapshots of the Web which enable a wide variety of research, be it distributional analysis to characterize Web properties or use of language, content analysis in social science, or Information Retrieval (IR) research to develop and evaluate effective search algorithms. While many English-centric Web crawls exist, existing public Arabic Web crawls are quite limited, limiting research and development. To remedy this, we present ArabicWeb16, a new public Web crawl of roughly 150M Arabic Web pages with significant coverage of dialectal Arabic as well as Modern Standard Arabic. For IR researchers, we expect ArabicWeb16 to support various research areas: ad-hoc search, question answering, filtering, cross-dialect search, dialect detection, entity search, blog search, and spam detection. Combined use with a separate Arabic Twitter dataset we are also collecting may provide further value.
very large data bases | 2018
Yousuf Ahmad; Omar Khattab; Arsal Malik; Ahmad Musleh; Mohammad Hammoud; Mucahid Kutlu; Mostafa Shehata; Tamer Elsayed
This paper presents LA3, a scalable distributed system for graph analytics. LA3 couples a vertex-based programming model with a highly optimized linear algebra-based engine. It translates any vertex-centric program into an iteratively executed sparse matrix-vector multiplication (SpMV). To reduce communication and enhance scalability, the adjacency matrix representing an input graph is partitioned into locality-aware 2D tiles distributed across multiple processes. Alongside, three major optimizations are incorporated to preclude redundant computations and minimize communication. First, the link-based structure of the input graph is exploited to classify vertices into different types. Afterwards, vertices of special types are factored out of the main loop of the graph application to avoid superfluous computations. We refer to this novel optimization as computation filtering. Second, a communication filtering mechanism is involved to optimize for the high sparsity of the input matrix due to power-law distributions, common in real-world graphs. This optimization ensures that each process receives only the messages that pertain to non-zero entries in its tiles, substantially reducing communication traffic since most tiles are highly sparse. Lastly, a pseudo-asynchronous computation and communication optimization is proposed, whereby processes progress and communicate asynchronously, consume messages as soon as they become available, and block otherwise. We implemented and extensively tested LA3 on private and public clouds. Results show that LA3 outperforms six related state-of-the-art and popular distributed graph analytics systems by an average of 10×. PVLDB Reference Format: Yousuf Ahmad, Omar Khattab, Arsal Malik, Ahmad Musleh, Mohammad Hammoud, Mucahid Kutlu, Mostafa Shehata, Tamer Elsayed. LA3: A Scalable Linkand Locality-Aware Linear AlgebraBased Graph Analytics System. PVLDB, 11(8): 920-933, 2018. DOI: https://doi.org/10.14778/3204028.3204035 Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Articles from this volume were invited to present their results at The 44th International Conference on Very Large Data Bases, August 2018, Rio de Janeiro, Brazil. Proceedings of the VLDB Endowment, Vol. 11, No. 8 Copyright 2018 VLDB Endowment 2150-8097/18/04...
conference on information and knowledge management | 2018
Khaled Yasser; Mucahid Kutlu; Tamer Elsayed
10.00. DOI: https://doi.org/10.14778/3204028.3204035
conference on information and knowledge management | 2018
Mucahid Kutlu; Tamer Elsayed; Maram Hasanain; Matthew Lease
Even though Web search engines play an important role in finding documents relevant to user queries, there is little to no attention given to how they perform in terms of usefulness for fact-checking claims. In this paper, we introduce a new research problem that addresses the ability of fact-checking systems to distinguish Web search results that are useful in discovering the veracity of claims from the ones that are not.We also propose a re-ranking method to improve ranking of search results for fact-checking. To evaluate our proposed method, we conducted a preliminary study for which we have developed a test collection that includes 22 claims and 20 manually-annotated Web search results for each. Our experiments show that the proposed method outperforms the baseline represented by the original ranking of search results. The contributions this improvement brings to real-world applications is two-fold: it will help human fact-checkers find useful documents for their task faster, and it will help automated fact-checking systems by pointing out which documents are useful and which are not.
international joint conference on artificial intelligence | 2017
Tyler McDonnell; Mucahid Kutlu; Tamer Elsayed; Matthew Lease
Because it is expensive to construct test collections for Cranfield-based evaluation of information retrieval systems, a variety of lower-cost methods have been proposed. The reliability of these methods is often validated by measuring rank correlation (e.g., Kendalls tau) between known system rankings on the full test collection vs. observed system rankings on the lower-cost one. However, existing rank correlation measures do not consider the statistical significance of score differences between systems in the observed rankings. To address this, we propose two statistical-significance-aware rank correlation measures, one of which is a head-weighted version of the other. We first show empirical differences between our proposed measures and existing ones. We then compare the measures while benchmarking four system evaluation methods: pooling, crowdsourcing, evaluation with incomplete judgments, and automatic system ranking. We show that use of our measures can lead to different experimental conclusions regarding reliability of alternative low-cost evaluation methods.
national conference on artificial intelligence | 2016
Tyler McDonnell; Matthew Lease; Mucahid Kutlu; Tamer Elsayed
When collecting subjective human ratings of items, it can be difficult to measure and enforce data quality due to task subjectivity and lack of insight into how judges arrive at each rating decision. To address this, we propose requiring judges to provide a specific type of rationale underlying each rating decision. We evaluate this approach in the domain of Information Retrieval, where human judges rate the relevance of Webpages. Costbenefit analysis over 10,000 judgments collected on Mechanical Turk suggests a win-win: experienced crowd workers provide rationales with no increase in task completion time while providing further benefits, including more reliable judgments and greater transparency1.
arXiv: Information Retrieval | 2018
Mucahid Kutlu; Vivek Khetan; Matthew Lease