Abdur Chowdhury | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Abdur Chowdhury is active.

Explore More

Publication

Featured researches published by Abdur Chowdhury.

international conference on management of data | 2006

Effective keyword search in relational databases

Fang Liu; Clement T. Yu; Weiyi Meng; Abdur Chowdhury

With the amount of available text data in relational databases growing rapidly, the need for ordinary users to search such information is dramatically increasing. Even though the major RDBMSs have provided full-text search capabilities, they still require users to have knowledge of the database schemas and use a structured query language to search information. This search model is complicated for most ordinary users. Inspired by the big success of information retrieval (IR) style keyword search on the web, keyword search in relational databases has recently emerged as a new research topic. The differences between text databases and relational databases result in three new challenges: (1) Answers needed by users are not limited to individual tuples, but results assembled from joining tuples from multiple tables are used to form answers in the form of tuple trees. (2) A single score for each answer (i.e. a tuple tree) is needed to estimate its relevance to a given query. These scores are used to rank the most relevant answers as high as possible. (3) Relational databases have much richer structures than text databases. Existing IR strategies to rank relational outputs are not adequate. In this paper, we propose a novel IR ranking strategy for effective keyword search. We are the first that conducts comprehensive experiments on search effectiveness using a real world database and a set of keyword queries collected by a major search company. Experimental results show that our strategy is significantly better than existing strategies. Our approach can be used both at the application level and be incorporated into a RDBMS to support keyword-based search in relational databases.

international acm sigir conference on research and development in information retrieval | 2004

Hourly analysis of a very large topically categorized web query log

Steven M. Beitzel; Eric C. Jensen; Abdur Chowdhury; David A. Grossman; Ophir Frieder

We review a query log of hundreds of millions of queries that constitute the total query traffic for an entire week of a general-purpose commercial web search service. Previously, query logs have been studied from a single, cumulative view. In contrast, our analysis shows changes in popularity and uniqueness of topically categorized queries across the hours of the day. We examine query traffic on an hourly basis by matching it against lists of queries that have been topically pre-categorized by human editors. This represents 13% of the query traffic. We show that query traffic from particular topical categories differs both from the query stream as a whole and from other categories. This analysis provides valuable insight for improving retrieval effectiveness and efficiency. It is also relevant to the development of enhanced query disambiguation, routing, and caching algorithms.

ACM Transactions on Information Systems | 2002

Collection statistics for fast duplicate document detection

Abdur Chowdhury; Ophir Frieder; David A. Grossman; Mary Catherine McCabe

We present a new algorithm for duplicate document detection thatuses collection statistics. We compare our approach with thestate-of-the-art approach using multiple collections. Thesecollections include a 30 MB 18,577 web document collectiondeveloped by Excite@Home and three NIST collections. The first NISTcollection consists of 100 MB 18,232 LA-Times documents, which isroughly similar in the number of documents to theExcite&at;Home collection. The other two collections are both 2GB and are the 247,491-web document collection and the TREC disks 4and 5---528,023 document collection. We show that our approachcalled I-Match, scales in terms of the number of documents andworks well for documents of all sizes. We compared our solution tothe state of the art and found that in addition to improvedaccuracy of detection, our approach executed in roughly one-fifththe time.

ACM Transactions on Information Systems | 2007

Repeatable evaluation of search services in dynamic environments

Eric C. Jensen; Steven M. Beitzel; Abdur Chowdhury; Ophir Frieder

In dynamic environments, such as the World Wide Web, a changing document collection, query population, and set of search services demands frequent repetition of search effectiveness (relevance) evaluations. Reconstructing static test collections, such as in TREC, requires considerable human effort, as large collection sizes demand judgments deep into retrieved pools. In practice it is common to perform shallow evaluations over small numbers of live engines (often pairwise, engine A vs. engine B) without system pooling. Although these evaluations are not intended to construct reusable test collections, their utility depends on conclusions generalizing to the query population as a whole. We leverage the bootstrap estimate of the reproducibility probability of hypothesis tests in determining the query sample sizes required to ensure this, finding they are much larger than those required for static collections. We propose a semiautomatic evaluation framework to reduce this effort. We validate this framework against a manual evaluation of the top ten results of ten Web search engines across 896 queries in navigational and informational tasks. Augmenting manual judgments with pseudo-relevance judgments mined from Web taxonomies reduces both the chances of missing a correct pairwise conclusion, and those of finding an errant conclusion, by approximately 50%.

very large data bases | 2010

Fast incremental and personalized PageRank

Bahman Bahmani; Abdur Chowdhury; Ashish Goel

In this paper, we analyze the efficiency of Monte Carlo methods for incremental computation of PageRank, personalized PageRank, and similar random walk based methods (with focus on SALSA), on large-scale dynamically evolving social networks. We assume that the graph of friendships is stored in distributed shared memory, as is the case for large social networks such as Twitter. For global PageRank, we assume that the social network has n nodes, and m adversarially chosen edges arrive in a random order. We show that with a reset probability of e, the expected total work needed to maintain an accurate estimate (using the Monte Carlo method) of the PageRank of every node at all times is [EQUATION]. This is significantly better than all known bounds for incremental PageRank. For instance, if we naively recompute the PageRanks as each edge arrives, the simple power iteration method needs [EQUATION] total time and the Monte Carlo method needs O(mn/e) total time; both are prohibitively expensive. We also show that we can handle deletions equally efficiently. We then study the computation of the top k personalized PageRanks starting from a seed node, assuming that personalized PageRanks follow a power-law with exponent α q ln n random walks starting from every node for large enough constant q (using the approach outlined for global PageRank), then the expected number of calls made to the distributed social network database is O(k/(R(1-α)/α)). We also present experimental results from the social networking site, Twitter, verifying our assumptions and analyses. The overall result is that this algorithm is fast enough for real-time queries over a dynamic social network.

ACM Transactions on Information Systems | 2007

Automatic classification of Web queries using very large unlabeled query logs

Steven M. Beitzel; Eric C. Jensen; David Lewis; Abdur Chowdhury; Ophir Frieder

Accurate topical classification of user queries allows for increased effectiveness and efficiency in general-purpose Web search systems. Such classification becomes critical if the system must route queries to a subset of topic-specific and resource-constrained back-end databases. Successful query classification poses a challenging problem, as Web queries are short, thus providing few features. This feature sparseness, coupled with the constantly changing distribution and vocabulary of queries, hinders traditional text classification. We attack this problem by combining multiple classifiers, including exact lookup and partial matching in databases of manually classified frequent queries, linear models trained by supervised learning, and a novel approach based on mining selectional preferences from a large unlabeled query log. Our approach classifies queries without using external sources of information, such as online Web directories or the contents of retrieved pages, making it viable for use in demanding operational environments, such as large-scale Web search services. We evaluate our approach using a large sample of queries from an operational Web search engine and show that our combined method increases recall by nearly 40% over the best single method while maintaining adequate precision. Additionally, we compare our results to those from the 2005 KDD Cup and find that we perform competitively despite our operational restrictions. This suggests it is possible to topically classify a significant portion of the query stream without requiring external sources of information, allowing for deployment in operationally restricted environments.

international conference on data mining | 2005

Improving automatic query classification via semi-supervised learning

Steven M. Beitzel; Eric C. Jensen; Ophir Frieder; David Lewis; Abdur Chowdhury; Aleksander Kolcz

Accurate topical classification of user queries allows for increased effectiveness and efficiency in general-purpose Web search systems. Such classification becomes critical if the system is to return results not just from a general Web collection but from topic-specific back-end databases as well. Maintaining sufficient classification recall is very difficult as Web queries are typically short, yielding few features per query. This feature sparseness coupled with the high query volumes typical for a large-scale search service makes manual and supervised learning approaches alone insufficient. We use an application of computational linguistics to develop an approach for mining the vast amount of unlabeled data in Web query logs to improve automatic topical Web query classification. We show that our approach in combination with manual matching and supervised learning allows us to classify a substantially larger proportion of queries than any single technique. We examine the performance of each approach on a real Web query stream and show that our combined method accurately classifies 46% of queries, outperforming the recall of best single approach by nearly 20%, with a 7% improvement in overall effectiveness.

international acm sigir conference on research and development in information retrieval | 2005

Automatic web query classification using labeled and unlabeled training data

Steven M. Beitzel; Eric C. Jensen; Ophir Frieder; David A. Grossman; David Lewis; Abdur Chowdhury; Aleksander Kolcz

Accurate topical categorization of user queries allows for increased effectiveness, efficiency, and revenue potential in general-purpose web search systems. Such categorization becomes critical if the system is to return results not just from a general web collection but from topic-specific databases as well. Maintaining sufficient categorization recall is very difficult as web queries are typically short, yielding few features per query. We examine three approaches to topical categorization of general web queries: matching against a list of manually labeled queries, supervised learning of classifiers, and mining of selectional preference rules from large unlabeled query logs. Each approach has its advantages in tackling the web query classification recall problem, and combining the three techniques allows us to classify a substantially larger proportion of queries than any of the individual techniques. We examine the performance of each approach on a real web query stream and show that our combined method accurately classifies 46% of queries, outperforming the recall of the best single approach by nearly 20%, with a 7% improvement in overall effectiveness.

Journal of the Association for Information Science and Technology | 2004

Fusion of effective retrieval strategies in the same information retrieval system

Steven M. Beitzel; Eric C. Jensen; Abdur Chowdhury; David A. Grossman; Ophir Frieder; Nazli Goharian

Prior efforts have shown that under certain situations retrieval effectiveness may be improved via the use of data fusion techniques. Although these improvements have been observed from the fusion of result sets from several distinct information retrieval systems, it has often been thought that fusing different document retrieval strategies in a single information retrieval system will lead to similar improvements. In this study, we show that this is not the case. We hold constant systemic differences such as parsing, stemming, phrase processing, and relevance feedback, and fuse result sets generated from highly effective retrieval strategies in the same information retrieval system. From this, we show that data fusion of highly effective retrieval strategies alone shows little or no improvement in retrieval effectiveness. Furthermore, we present a detailed analysis of the performance of modern data fusion approaches, and demonstrate the reasons why they do not perform well when applied to this problem. Detailed results and analyses are included to support our conclusions.

knowledge discovery and data mining | 2004

Improved robustness of signature-based near-replica detection via lexicon randomization

Aleksander Kolcz; Abdur Chowdhury; Joshua Alspector

Detection of near duplicate documents is an important problem in many data mining and information filtering applications. When faced with massive quantities of data, traditional duplicate detection techniques relying on direct inter-document similarity computation (e.g., using the cosine measure) are often not feasible given the time and memory performance constraints. On the other hand, fingerprint-based methods, such as I-Match, are very attractive computationally but may be brittle with respect to small changes to document content. We focus on approaches to near-replica detection that are based upon large-collection statistics and present a general technique of increasing their robustness via multiple lexicon randomization. In experiments with large web-page and spam-email datasets the proposed method is shown to consistently outperform traditional I-Match, with the relative improvement in duplicate-document recall reaching as high as 40-60%. The large gains in detection accuracy are offset by only small increases in computational requirements.

Explore More