Jiaul H. Paik | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Jiaul H. Paik is active.

Explore More

Publication

Featured researches published by Jiaul H. Paik.

international acm sigir conference on research and development in information retrieval | 2013

A novel TF-IDF weighting scheme for effective ranking

Jiaul H. Paik

Term weighting schemes are central to the study of information retrieval systems. This article proposes a novel TF-IDF term weighting scheme that employs two different within document term frequency normalizations to capture two different aspects of term saliency. One component of the term frequency is effective for short queries, while the other performs better on long queries. The final weight is then measured by taking a weighted combination of these components, which is determined on the basis of the length of the corresponding query. Experiments conducted on a large number of TREC news and web collections demonstrate that the proposed scheme almost always outperforms five state of the art retrieval models with remarkable significance and consistency. The experimental results also show that the proposed model achieves significantly better precision than the existing models.

ACM Transactions on Information Systems | 2011

GRAS: An effective and efficient stemming algorithm for information retrieval

Jiaul H. Paik; Mandar Mitra; Swapan K. Parui; Kalervo Järvelin

A novel graph-based language-independent stemming algorithm suitable for information retrieval is proposed in this article. The main features of the algorithm are retrieval effectiveness, generality, and computational efficiency. We test our approach on seven languages (using collections from the TREC, CLEF, and FIRE evaluation platforms) of varying morphological complexity. Significant performance improvement over plain word-based retrieval, three other language-independent morphological normalizers, as well as rule-based stemmers is demonstrated.

international acm sigir conference on research and development in information retrieval | 2011

A novel corpus-based stemming algorithm using co-occurrence statistics

Jiaul H. Paik; Dipasree Pal; Swapan K. Parui

We present a stemming algorithm for text retrieval. The algorithm uses the statistics collected on the basis of certain corpus analysis based on the co-occurrence between two word variants. We use a very simple co-occurrence measure that reflects how often a pair of word variants occurs in a document as well as in the whole corpus. A graph is formed where the word variants are the nodes and two word variants form an edge if they co-occur. On the basis of the co-occurrence measure, a certain edge strength is defined for each of the edges. Finally, on the basis of the edge strengths, we propose a partition algorithm that groups the word variants based on their strongest neighbors, that is, the neighbors with largest strengths. Our stemming algorithm has two static parameters and does not use any other information except the co-occurrence statistics from the corpus. The experiments on TREC, CLEF and FIRE data consisting of four European and two Asian languages show a significant improvement over no-stem strategy on all the languages. Also, the proposed algorithm significantly outperforms a number of strong stemmers including the rule-based ones on a number of languages. For highly inflectional languages, a relative improvement of about 50% is obtained compared to un-normalized words and a relative improvement ranging from 5% to 16% is obtained compared to the rule based stemmer for the concerned language.

ACM Transactions on Information Systems | 2015

A Pólya Urn Document Language Model for Improved Information Retrieval

Ronan Cummins; Jiaul H. Paik; Yuanhua Lv

The multinomial language model has been one of the most effective models of retrieval for more than a decade. However, the multinomial distribution does not model one important linguistic phenomenon relating to term dependency—that is, the tendency of a term to repeat itself within a document (i.e., word burstiness). In this article, we model document generation as a random process with reinforcement (a multivariate Pólya process) and develop a Dirichlet compound multinomial language model that captures word burstiness directly. We show that the new reinforced language model can be computed as efficiently as current retrieval models, and with experiments on an extensive set of TREC collections, we show that it significantly outperforms the state-of-the-art language model for a number of standard effectiveness metrics. Experiments also show that the tuning parameter in the proposed model is more robust than that in the multinomial language model. Furthermore, we develop a constraint for the verbosity hypothesis and show that the proposed model adheres to the constraint. Finally, we show that the new language model essentially introduces a measure closely related to idf, which gives theoretical justification for combining the term and document event spaces in tf-idf type schemes.

ACM Transactions on Asian Language Information Processing | 2011

A Fast Corpus-Based Stemmer

Jiaul H. Paik; Swapan K. Parui

Stemming is a mechanism of word form normalization that transforms the variant word forms to their common root. In an Information Retrieval system, it is used to increase the system’s performance, specifically the recall and desirably the precision. Although its usefulness is shown to be mixed in languages such as English, because morphologically complex languages stemming produces a significant performance improvement. A number of linguistic rule-based stemmers are available for most European languages which employ a set of rules to get back the root word from its variants. But for Indian languages which are highly inflectional in nature, devising a linguistic rule-based stemmer needs some additional resources which are not available. We present an approach which is purely corpus based and finds the equivalence classes of variant words in an unsupervised manner. A set of experiments on four languages using FIRE, CLEF, and TREC test collections shows that our approach provides comparable results with linguistic rule-based stemmers for some languages and gives significant performance improvement for resource constrained languages such as Bengali and Marathi.

conference on information and knowledge management | 2014

A Fixed-Point Method for Weighting Terms in Verbose Informational Queries

Jiaul H. Paik; Douglas W. Oard

The term weighting and document ranking functions used with informational queries are typically optimized for cases in which queries are short and documents are long. It is reasonable to assume that the presence of a term in a short query reflects some aspect of the topic that is important to the user, and thus rewarding documents that contain the greatest number of distinct query terms is a useful heuristic. Verbose informational queries, such as those that result from cut-and-paste of example text, or that might result from informal spoken interaction, pose a different challenge in which many extraneous (and thus potentially misleading) terms may be present in the query. Modest improvements have been reported from applying supervised methods to learn which terms in a verbose query deserve the greatest emphasis. This paper proposes a novel unsupervised method for weighting terms in verbose informational queries that relies instead on iteratively estimating which terms are most central to the query. The key idea is to use an initial set of retrieval results to define a recursion on the term weight vector that converges to a fixed point representing the vector that optimally describes the initial result set. Experiments with several TREC news and Web test collections indicate that the proposed method often statistically significantly outperforms state of the art supervised methods.

rough sets and knowledge technology | 2007

Evolutionary fuzzy biclustering of gene expression data

Haider Banka; Jiaul H. Paik

Biclustering or simultaneous clustering attempts to find maximal subgroups of genes and subgroups of conditions where the genes exhibit highly correlated activities over a range of conditions. The possibilistic approach extracts one bicluster at a time, by assigning to it a membership for each gene-condition pair. In this study, a novel evolutionary framework is introduced for generating optimal fuzzy possibilistic biclusters from microarray gene expression data. The different parameters controlling the size of the biclusters are tuned. The experimental results on benchmark datasets demonstrate better performance as compared to existing algorithms available in literature.

ACM Transactions on Information Systems | 2013

Effective and Robust Query-Based Stemming

Jiaul H. Paik; Swapan K. Parui; Dipasree Pal; Stephen E. Robertson

Stemming is a widely used technique in information retrieval systems to address the vocabulary mismatch problem arising out of morphological phenomena. The major shortcoming of the commonly used stemmers is that they accept the morphological variants of the query words without considering their thematic coherence with the given query, which leads to poor performance. Moreover, for many queries, such approaches also produce retrieval performance that is poorer than no stemming, thereby degrading the robustness. The main goal of this article is to present corpus-based fully automatic stemming algorithms which address these issues. A set of experiments on six TREC collections and three other non-English collections containing news and web documents shows that the proposed query-based stemming algorithms consistently and significantly outperform four state of the art strong stemmers of completely varying principles. Our experiments also confirm that the robustness of the proposed query-based stemming algorithms are remarkably better than the existing strong baselines.

north american chapter of the association for computational linguistics | 2015

Using Zero-Resource Spoken Term Discovery for Ranked Retrieval.

Jerome White; Douglas W. Oard; Aren Jansen; Jiaul H. Paik; Rashmi Sankepally

Research on ranked retrieval of spoken content has assumed the existence of some automated (word or phonetic) transcription. Recently, however, methods have been demonstrated for matching spoken terms to spoken content without the need for language-tuned transcription. This paper describes the first application of such techniques to ranked retrieval, evaluated using a newly created test collection. Both the queries and the collection to be searched are based on Gujarati produced naturally by native speakers; relevance assessment was performed by other native speakers of Gujarati. Ranked retrieval is based on fast acoustic matching that identifies a deeply nested set of matching speech regions, coupled with ways of combining evidence from those matching regions. Results indicate that the resulting ranked lists may be useful for some practical similarity-based ranking tasks.

international acm sigir conference on research and development in information retrieval | 2014

Assessing the reliability and reusability of an E-discovery privilege test collection

Jyothi K. Vinjumur; Douglas W. Oard; Jiaul H. Paik

In some jurisdictions, parties to a lawsuit can request documents from each other, but documents subject to a claim of privilege may be withheld. The TREC 2010 Legal Track developed what is presently the only public test collection for evaluating privilege classification. This paper examines the reliability and reusability of that collection. For reliability, the key question is the extent to which privilege judgments correctly reflect the opinion of the senior litigator whose judgment is authoritative. For reusability, the key question is the degree to which systems whose results contributed to creation of the test collection can be fairly compared with other systems that use those privilege judgments in the future. These correspond to measurement error and sampling error, respectively. The results indicate that measurement error is the larger problem.

Explore More