Luo Si | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Luo Si is active.

Explore More

Publication

Featured researches published by Luo Si.

international acm sigir conference on research and development in information retrieval | 2004

An automatic weighting scheme for collaborative filtering

Rong Jin; Joyce Y. Chai; Luo Si

Collaborative filtering identifies information interest of a particular user based on the information provided by other similar users. The memory-based approaches for collaborative filtering (e.g., Pearson correlation coefficient approach) identify the similarity between two users by comparing their ratings on a set of items. In these approaches, different items are weighted either equally or by some predefined functions. The impact of rating discrepancies among different users has not been taken into consideration. For example, an item that is highly favored by most users should have a smaller impact on the user-similarity than an item for which different types of users tend to give different ratings. Even though simple weighting methods such as variance weighting try to address this problem, empirical studies have shown that they are ineffective in improving the performance of collaborative filtering. In this paper, we present an optimization algorithm to automatically compute the weights for different items based on their ratings from training users. More specifically, the new weighting scheme will create a clustered distribution for user vectors in the item space by bringing users of similar interests closer and separating users of different interests more distant. Empirical studies over two datasets have shown that our new weighting scheme substantially improves the performance of the Pearson correlation coefficient method for collaborative filtering.

conference on information and knowledge management | 2002

A language modeling framework for resource selection and results merging

Luo Si; Rong Jin; James P. Callan; Paul Ogilvie

Statistical language models have been proposed recently for several information retrieval tasks, including the resource selection task in distributed information retrieval. This paper extends the language modeling approach to integrate resource selection, ad-hoc searching, and merging of results from different text databases into a single probabilistic retrieval model. This new approach is designed primarily for Intranet environments, where it is reasonable to assume that resource providers are relatively homogeneous and can adopt the same kind of search engine. Experiments demonstrate that this new, integrated approach is at least as effective as the prior state-of-the-art in distributed IR.

acm multimedia | 2004

Effective automatic image annotation via a coherent language model and active learning

Rong Jin; Joyce Y. Chai; Luo Si

Image annotations allow users to access a large image database with textual queries. There have been several studies on automatic image annotation utilizing machine learning techniques, which automatically learn statistical models from annotated images and apply them to generate annotations for unseen images. One common problem shared by most previous learning approaches for automatic image annotation is that each annotated word is predicated for an image independently from other annotated words. In this paper, we proposed a coherent language model for automatic image annotation that takes into account the word-to-word correlation by estimating a coherent language model for an image. This new approach has two important advantages: 1) it is able to automatically determine the annotation length to improve the accuracy of retrieval results, and 2) it can be used with active learning to significantly reduce the required number of annotated image examples. Empirical studies with Corel dataset are presented to show the effectiveness of the coherent language model for automatic image annotation.

ACM Transactions on Information Systems | 2003

A semisupervised learning method to merge search engine results

Luo Si; James P. Callan

The proliferation of searchable text databases on local area networks and the Internet causes the problem of finding information that may be distributed among many disjoint text databases (distributed information retrieval). How to merge the results returned by selected databases is an important subproblem of the distributed information retrieval task. Previous research assumed that either resource providers cooperate to provide normalizing statistics or search clients download all retrieved documents and compute normalized scores without cooperation from resource providers.This article presents a semisupervised learning solution to the result merging problem. The key contribution is the observation that information used to create resource descriptions for resource selection can also be used to create a centralized sample database to guide the normalization of document scores returned by different databases. At retrieval time, the query is sent to the selected databases, which return database-specific document scores, and to a centralized sample database, which returns database-independent document scores. Documents that have both a database-specific score and a database-independent score serve as training data for learning to normalize the scores of other documents. An extensive set of experiments demonstrates that this method is more effective than the well-known CORI result-merging algorithm under a variety of conditions.

international acm sigir conference on research and development in information retrieval | 2011

Composite hashing with multiple information sources

Dan Zhang; Fei Wang; Luo Si

Similarity search applications with a large amount of text and image data demands an efficient and effective solution. One useful strategy is to represent the examples in databases as compact binary codes through semantic hashing, which has attracted much attention due to its fast query/search speed and drastically reduced storage requirement. All of the current semantic hashing methods only deal with the case when each example is represented by one type of features. However, examples are often described from several different information sources in many real world applications. For example, the characteristics of a webpage can be derived from both its content part and its associated links. To address the problem of learning good hashing codes in this scenario, we propose a novel research problem -- Composite Hashing with Multiple Information Sources (CHMIS). The focus of the new research problem is to design an algorithm for incorporating the features from different information sources into the binary hashing codes efficiently and effectively. In particular, we propose an algorithm CHMIS-AW (CHMIS with Adjusted Weights) for learning the codes. The proposed algorithm integrates information from several different sources into the binary hashing codes by adjusting the weights on each individual source for maximizing the coding performance, and enables fast conversion from query examples to their binary hashing codes. Experimental results on five different datasets demonstrate the superior performance of the proposed method against several other state-of-the-art semantic hashing techniques.

conference on information and knowledge management | 2001

A statistical model for scientific readability

Luo Si; James P. Callan

In this paper, we present a new method of using statistical models to estimate readability [1]. Language Model is used to capture the content information. It is combined with linguistic feature model by a linear form. Experiments show that this new method has a better performance than the widely used Flesch-Kincaid readability formula.

Information Retrieval | 2006

A study of mixture models for collaborative filtering

Rong Jin; Luo Si; ChengXiang Zhai

Collaborative filtering is a general technique for exploiting the preference patterns of a group of users to predict the utility of items for a particular user. Three different components need to be modeled in a collaborative filtering problem: users, items, and ratings. Previous research on applying probabilistic models to collaborative filtering has shown promising results. However, there is a lack of systematic studies of different ways to model each of the three components and their interactions. In this paper, we conduct a broad and systematic study on different mixture models for collaborative filtering. We discuss general issues related to using a mixture model for collaborative filtering, and propose three properties that a graphical model is expected to satisfy. Using these properties, we thoroughly examine five different mixture models, including Bayesian Clustering (BC), Aspect Model (AM), Flexible Mixture Model (FMM), Joint Mixture Model (JMM), and the Decoupled Model (DM). We compare these models both analytically and experimentally. Experiments over two datasets of movie ratings under different configurations show that in general, whether a model satisfies the proposed properties tends to be correlated with its performance. In particular, the Decoupled Model, which satisfies all the three desired properties, outperforms the other mixture models as well as many other existing approaches for collaborative filtering. Our study shows that graphical models are powerful tools for modeling collaborative filtering, but careful design is necessary to achieve good performance.

international acm sigir conference on research and development in information retrieval | 2005

Modeling search engine effectiveness for federated search

Luo Si; Jamie Callan

Federated search links multiple search engines into a single, virtual search system. Most prior research of federated search focused on selecting search engines that have the most relevant contents, but ignored the retrieval effectiveness of individual search engines. This omission can cause serious problems when federating search engines of different qualities.This paper proposes a federated search technique that uses utility maximization to model the retrieval effectiveness of each search engine in a federated search environment. The new algorithm ranks the available resources by explicitly estimating the amount of relevant material that each resource can return, instead of the amount of relevant material that each resource contains. An extensive set of experiments demonstrates the effectiveness of the new algorithm.

conference on information and knowledge management | 2003

Collaborative filtering with decoupled models for preferences and ratings

Rong Jin; Luo Si; ChengXiang Zhai; James P. Callan

In this paper, we describe a new model for collaborative filtering. The motivation of this work comes from the fact that two users with very similar preferences on items may have very different rating schemes. For example, one user may tend to assign a higher rating to all items than another user. Unlike previous models of collaborative filtering, which determine the similarity between two users only based on their rating performance, our model treats the users preferences on items separately from the users rating scheme. More specifically, for each user, we build two separate models: a preference model capturing which items are favored by the user and a rating model capturing how the user would rate an item given the preference information. The similarity of two users is computed based on the underlying preference model, instead of the surface ratings. We compare the new model with several representative previous approaches on two data sets. Experiment results show that the new model outperforms all the previous approaches that are tested consistently on both data sets.

international acm sigir conference on research and development in information retrieval | 2002

Using sampled data and regression to merge search engine results

Luo Si; James P. Callan

This paper addresses the problem of merging results obtained from different databases and search engines in a distributed information retrieval environment. The prior research on this problem either assumed the exchange of statistics necessary for normalizing scores (cooperative solutions) or is heuristic. Both approaches have disadvantages. We show that the problem in uncooperative environments is simpler when viewed as a component of a distributed IR system that uses query-based sampling to create resource descriptions. Documents sampled for creating resource descriptions can also be used to create a sample centralized index, and this index is a source of training data for adaptive results merging algorithms. A variety of experiments demonstrate that this new approach is more effective than a well-known alternative, and that it allows query-by-query tuning of the results merging function.

Explore More