Is this you? Create Your Porfile

Shoaib Jameel

The Chinese University of Hong Kong

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Shoaib Jameel is active.

Explore More

Publication

Featured researches published by Shoaib Jameel.

international acm sigir conference on research and development in information retrieval | 2013

An unsupervised topic segmentation model incorporating word order

Shoaib Jameel; Wai Lam

We present a new unsupervised topic discovery model for a collection of text documents. In contrast to the majority of the state-of-the-art topic models, our model does not break the documents structure such as paragraphs and sentences. In addition, it preserves word order in the document. As a result, it can generate two levels of topics of different granularity, namely, segment-topics and word-topics. In addition, it can generate n-gram words in each topic. We also develop an approximate inference scheme using Gibbs sampling method. We conduct extensive experiments using publicly available data from different collections and show that our model improves the quality of several text mining tasks such as the ability to support fine grained topics with n-gram words in the correlation graph, the ability to segment a document into topically coherent sections, document classification, and document likelihood estimation.

ACM Transactions on Information Systems | 2015

Web Query Reformulation via Joint Modeling of Latent Topic Dependency and Term Context

Lidong Bing; Wai Lam; Tak-Lam Wong; Shoaib Jameel

An important way to improve users’ satisfaction in Web search is to assist them by issuing more effective queries. One such approach is query reformulation, which generates new queries according to the current query issued by users. A common procedure for conducting reformulation is to generate some candidate queries first, then a scoring method is employed to assess these candidates. Currently, most of the existing methods are context based. They rely heavily on the context relation of terms in the history queries and cannot detect and maintain the semantic consistency of queries. In this article, we propose a graphical model to score queries. The proposed model exploits a latent topic space, which is automatically derived from the query log, to detect semantic dependency of terms in a query and dependency among topics. Meanwhile, the graphical model also captures the term context in the history query by skip-bigram and n-gram language models. In addition, our model can be easily extended to consider users’ history search interests when we conduct query reformulation for different users. In the task of candidate query generation, we investigate a social tagging data resource—Delicious bookmark—to generate addition and substitution patterns that are employed as supplements to the patterns generated from query log data.

Knowledge Based Systems | 2015

Adaptive Concept Resolution for document representation and its applications in text mining

Lidong Bing; Shan Jiang; Wai Lam; Yan Zhang; Shoaib Jameel

It is well-known that synonymous and polysemous terms often bring in some noise when we calculate the similarity between documents. Existing ontology-based document representation methods are static so that the selected semantic concepts for representing a document have a fixed resolution. Therefore, they are not adaptable to the characteristics of document collection and the text mining problem in hand. We propose an Adaptive Concept Resolution (ACR) model to overcome this problem. ACR can learn a concept border from an ontology taking into the consideration of the characteristics of the particular document collection. Then, this border provides a tailor-made semantic concept representation for a document coming from the same domain. Another advantage of ACR is that it is applicable in both classification task where the groups are given in the training document set and clustering task where no group information is available. The experimental results show that ACR outperforms an existing static method in almost all cases. We also present a method to integrate Wikipedia entities into an expert-edited ontology, namely WordNet, to generate an enhanced ontology named WordNet-Plus, and its performance is also examined under the ACR model. Due to the high coverage, WordNet-Plus can outperform WordNet on data sets having more fresh documents in classification.

european conference on information retrieval | 2013

An n-gram topic model for time-stamped documents

Shoaib Jameel; Wai Lam

This paper presents a topic model that captures the temporal dynamics in the text data along with topical phrases. Previous approaches have relied upon bag-of-words assumption to model such property in a corpus. This has resulted in an inferior performance with less interpretable topics. Our topic model can not only capture changes in the way a topic structure changes over time but also maintains important contextual information in the text data. Finding topical n-grams, when possible based on context, instead of always presenting unigrams in topics does away with many ambiguities that individual words may carry. We derive a collapsed Gibbs sampler for posterior inference. Our experimental results show an improvement over the current state-of-the-art topics over time model.

web intelligence | 2012

Ranking Text Documents Based on Conceptual Difficulty Using Term Embedding and Sequential Discourse Cohesion

Shoaib Jameel; Wai Lam; Xiaojun Qian

We propose a novel framework for determining the conceptual difficulty of a domain-specific text document without using any external lexicon. Conceptual difficulty relates to finding the reading difficulty of domain-specific documents. Previous approaches to tackling domain-specific readability problem have heavily relied upon an external lexicon, which limits the scalability to other domains. Our model can be readily applied in domain-specific vertical search engines to re-rank documents according to their conceptual difficulty. We develop an unsupervised and principled approach for computing a terms conceptual difficulty in the latent space. Our approach also considers transitions between the segments generated in sequence. It performs better than the current state-of-the-art comparative methods.

acm/ieee joint conference on digital libraries | 2012

An unsupervised technical difficulty ranking model based on conceptual terrain in the latent space

Shoaib Jameel; Wai Lam; Xiaojun Qian; Ching-man Au Yeung

Search results of the existing general-purpose search engines usually do not satisfy domain-specific information retrieval tasks as there is a mis-match between the technical expertise of a user and the results returned by the search engine. In this paper, we investigate the problem of ranking domain-specific documents based on the technical difficulty. We propose an unsupervised conceptual terrain model using Latent Semantic Indexing (LSI) for re-ranking search results obtained from a similarity based search system. We connect the sequences of terms under the latent space by the semantic distance between the terms and compute the traversal cost for a document indicating the technical difficulty. Our experiments on a domain-specific corpus demonstrate the efficacy of our method.

conference on information and knowledge management | 2011

An unsupervised ranking method based on a technical difficulty terrain

Shoaib Jameel; Wai Lam; Ching-man Au Yeung; Sheaujiun Chyan

Users look for information that can suit their level of expertise, but it often takes a mammoth effort to trace such information. One has to sift through multiple pages to look for one that fits the appropriate technical background. In this paper, a query-independent ranking system is proposed for technical web pages. The pages returned by the system are sorted by their relative technical difficulty in either ascending or descending order specified by the user. The technical difficulty of a document i.e. terms in sequence, is first computed by the combination of each individual terms geometry in the low-dimensional latent semantic indexing (LSI) space, which can be visualized as a conceptual terrain. Then the pages are ranked based on the expected cost to get over the terrain. Results indicate that our terrain based method outperforms traditional readability measures.

international conference on the theory of information retrieval | 2016

Who Wants to Join Me?: Companion Recommendation in Location Based Social Networks

Yi Liao; Wai Lam; Shoaib Jameel; Steven Schockaert; Xing Xie

We consider the problem of identifying possible companions for a user who is planning to visit a given venue. Specifically, we study the task of predicting which of the users current friends, in a location based social network (LBSN), are most likely to be interested in joining the visit. An important underlying assumption of our model is that friendship relations can be clustered based on the kinds of interests that are shared by the friends. To identify these friendship types, we use a latent topic model, which moreover takes into account the geographic proximity of the user to the location of the proposed venue. To the best of our knowledge, our model is the first that addresses the task of recommending companions for a proposed activity. While a number of existing topic models can be adapted to make such predictions, we experimentally show that such methods are significantly outperformed by our model.

semantics, knowledge and grid | 2012

An Unsupervised Technical Readability Ranking Model by Building a Conceptual Terrain in LSI

Shoaib Jameel; Xiaojun Qian

Searching for domain-specific related information has gained a high popularity in recent years. Naturally, everyone is not at par with each other when it comes to knowledge about the concepts of a domain. A doctor may be well versed in her field of specialization and probably would search for advanced medical documents on the Internet. But she may look for a much simpler material related to Computer Programming. However, current information retrieval (IR) systems just return a mixed set of results based on similarity and popularity of the web pages. Existing methods which have tried to address the issue of matching readers with texts in domain-specific IR either use an ontology or some seed concepts thereby limiting their application in certain domains only. Moreover, readability methods cannot address the issue in domain-specific IR ranking because they fail to give precise prediction when applied on web pages. We address this problem in domain-specific search using a conceptual model where the sequence of the terms in a document is modeled as a connected conceptual terrain. Our model has achieved significant improvement in ranking documents by technical readability.

international conference on computational linguistics | 2014

Website Community Mining from Query Logs with Two-Phase Clustering

Lidong Bing; Wai Lam; Shoaib Jameel; Chunliang Lu

A website community refers to a set of websites that concentrate on the same or similar topics. There are two major challenges in website community mining task. First, the websites in the same topic may not have direct links among them because of competition concerns. Second, one website may contain information about several topics. Accordingly, the website community mining method should be able to capture such phenomena and assigns such website into different communities. In this paper, we propose a method to automatically mine website communities by exploiting the query log data in Web search. Query log data can be regarded as a comprehensive summarization of the real Web. The queries that result in a particular website clicked can be regarded as the summarization of that website content. The websites in the same topic are indirectly connected by the queries that convey information need in this topic. This observation can help us overcome the first challenge. The proposed two-phase method can tackle the second challenge. In the first phase, we cluster the queries of the same host to obtain different content aspects of the host. In the second phase, we further cluster the obtained content aspects from different hosts. Because of the two-phase clustering, one host may appear in more than one website communities.

Explore More