Is this you? Create Your Porfile

Huajing Li

Pennsylvania State University

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Huajing Li is active.

Explore More

Publication

Featured researches published by Huajing Li.

international acm sigir conference on research and development in information retrieval | 2008

Real-time automatic tag recommendation

Yang Song; Ziming Zhuang; Huajing Li; Qiankun Zhao; Jia Li; Wang-Chien Lee; C. Lee Giles

Tags are user-generated labels for entities. Existing research on tag recommendation either focuses on improving its accuracy or on automating the process, while ignoring the efficiency issue. We propose a highly-automated novel framework for real-time tag recommendation. The tagged training documents are treated as triplets of (words, docs, tags), and represented in two bipartite graphs, which are partitioned into clusters by Spectral Recursive Embedding (SRE). Tags in each topical cluster are ranked by our novel ranking algorithm. A two-way Poisson Mixture Model (PMM) is proposed to model the document distribution into mixture components within each cluster and aggregate words into word clusters simultaneously. A new document is classified by the mixture model based on its posterior probabilities so that tags are recommended according to their ranks. Experiments on large-scale tagging datasets of scientific documents (CiteULike) and web pages del.icio.us) indicate that our framework is capable of making tag recommendation efficiently and effectively. The average tagging time for testing a document is around 1 second, with over 88% test documents correctly labeled with the top nine tags we suggested.

international world wide web conferences | 2006

CiteSeerx: an architecture and web service design for an academic document search engine

Huajing Li; Isaac G. Councill; Wang-Chien Lee; C. Lee Giles

CiteSeer is a scientific literature digital library and search engine which automatically crawls and indexes scientific documents in the field of computer and information science. After serving as a public search engine for nearly ten years, CiteSeer is starting to have scaling problems for handling of more documents, adding new feature and more users. Its monolithic architecture design prevents it from effectively making use of new web technologies and providing new services. After analyzing the current system problems, we propose a new architecture and data model, CiteSeerx. CiteSeerx that will overcome the existing problems as well as provide scalability and better performance plus new services and system features.

very large data bases | 2010

Z-SKY: an efficient skyline query processing framework based on Z-order

Ken C. K. Lee; Wang-Chien Lee; Baihua Zheng; Huajing Li; Yuan Tian

Given a set of data points in a multidimensional space, a skyline query retrieves those data points that are not dominated by any other point in the same dataset. Observing that the properties of Z-order space filling curves (or Z-order curves) perfectly match with the dominance relationships among data points in a geometrical data space, we, in this paper, develop and present a novel and efficient processing framework to evaluate skyline queries and their variants, and to support skyline result updates based on Z-order curves. This framework consists of ZBtree, i.e., an index structure to organize a source dataset and skyline candidates, and a suite of algorithms, namely, (1) ZSearch, which processes skyline queries, (2) ZInsert, ZDelete and ZUpdate, which incrementally maintain skyline results in presence of source dataset updates, (3) ZBand, which answers skyband queries, (4) ZRank, which returns top-ranked skyline points, (5) k-ZSearch, which evaluates k-dominant skyline queries, and (6) ZSubspace, which supports skyline queries on a subset of dimensions. While derived upon coherent ideas and concepts, our approaches are shown to outperform the state-of-the-art algorithms that are specialized to address particular skyline problems, especially when a large number of skyline points are resulted, via comprehensive experiments.

international acm sigir conference on research and development in information retrieval | 2009

A probabilistic topic-based ranking framework for location-sensitive domain information retrieval

Huajing Li; Zhisheng Li; Wang-Chien Lee; Dik Lun Lee

It has been observed that many queries submitted to search engines are location-sensitive. Traditional search techniques fail to interpret the significance of such geographical clues and as such are unable to return highly relevant search results. Although there have been efforts in the literature to support location-aware information retrieval, critical challenges still remain in terms of search result quality and data scalability. In this paper, we propose an innovative probabilistic ranking framework for domain information retrieval where users are interested in a set of location-sensitive topics. Our proposed method recognizes the geographical distribution of topic influence in the process of ranking documents and models it accurately using probabilistic Gaussian Process classifiers. Additionally, we demonstrate the effectiveness of the proposed ranking framework by implementing it in a Web search service for NBA news. Extensive performance evaluation is performed on real Web document collections, which confirms that our proposed mechanism works significantly better (around 29.7% averagely using DCG20 measure) than other popular location-aware information retrieval techniques in ranking quality.

scalable information systems | 2006

Efficient progressive processing of skyline queries in peer-to-peer systems

Huajing Li; Qingzhao Tan; Wang-Chien Lee

Skyline queries have received a lot of attention from database and information retrieval research communities. A skyline query returns a set of data objects that is not dominated by any other data objects in a given dataset. However, most of existing studies focus on skyline query processing in centralized systems. Only recently, skyline queries are considered in a distributed computing environment. Acknowledging the trend toward peer-to-peer (P2P) systems in distributed computing, we examine the problem of skyline query processing in P2P systems and propose innovative solutions. We exploit the data semantic embedded in semantically structured P2P overlay networks to efficiently prune search space, without compromising the quality of query result. In addition, we propose approximate algorithms to support skyline queries where exact answers are too costly to obtain. These approximate algorithms produce high quality answers using heuristics based on local semantics of peer nodes. Extensive experiments validate that our algorithms provides high efficiency and scalability to skyline query processing in P2P systems.

web information and data management | 2008

Personalized ranking for digital libraries based on log analysis

Yang Sun; Huajing Li; Isaac G. Councill; Jian Huang; Wang-Chien Lee; C. Lee Giles

Given the exponential increase of indexable context on the Web, ranking is an increasingly difficult problem in information retrieval systems. Recent research shows that implicit feedback regarding user preferences can be extracted from web access logs in order to increase ranking performance. We analyze the implicit user feedback from access logs in the CiteSeer academic search engine and show how site structure can better inform the analysis of clickthrough feedback providing accurate personalized ranking services tailored to individual information retrieval systems. Experiment and analysis shows that our proposed method is more accurate on predicting user preferences than any non-personalized ranking methods when user preferences are stable over time. We compare our method with several non-personalized ranking methods including ranking SVMlight as well as several ranking functions specific to the academic document domain. The results show that our ranking algorithm can reach 63.59% accuracy in comparison to 50.02% for ranking SVMlight and below 43% for all other single feature ranking methods. We also show how the derived personalized ranking vectors can be employed for other ranking-related purposes such as recommendation systems.

international conference on social computing | 2010

Personalized Feed Recommendation Service for Social Networks

Huajing Li; Yuan Tian; Wang-Chien Lee; C. Lee Giles; Meng Chang Chen

Social network systems (SNSs) such as Facebook and Twitter have recently attracted millions of users by providing social network based services to support easy message posting, information sharing and inter-friend communication. With the rapid growth of social networks, users of SNSs may easily get overwhelmed by the excessive volume of information feeds and felt challenging to digest and find truly valuable information. In this paper, we introduce a personalized feed recommendation service for SNS users based on user interests and social network contexts. Our approach incorporates both the topical preference and topological locality of a user in determining a feed’s relevance. We propose a popularity diffusion model to propagate feeds in social networks and support our recommendation service with a set of personalized indices for feed-based information retrieval. A suite of efficient index manipulation algorithms are developed in our framework to address the need of managing the dynamics in social networks. We conduct an extensive performance evaluation to compare our proposal with alternative solutions using both real and synthetic social network data, which suggests our proposal outperforms in both efficiency and relevance.

scalable information systems | 2006

CiteSeer χ : a scalable autonomous scientific digital library

Huajing Li; Isaac G. Councill; Levent Bolelli; Ding Zhou; Yang Song; Wang-Chien Lee; Anand Sivasubramaniam; C. Lee Giles

CiteSeer is a scientific literature digital library and search engine which automatically crawls and indexes scientific documents in the fields of computer and information science. Since its inception in 1997 CiteSeer has grown to index over 730,000 documents and serves over 800,000 requests daily, pushing the limits of the current systems capabilities. In addition, CiteSeers monolithic architecture inconveniences system maintenance and reduces the flexibility of the system in terms of new feature development, algorithm updates, and system interoperability. In this paper, we discuss the problems of the current CiteSeer architecture and propose a new architecture for a next generation CiteSeer application. The new architecture is based on modular web services and pluggable service components. Preliminary results based on a prototype system show the new architecture enhances flexibility, scalability, and performance for CiteSeer. In addition, new services in development for the next generation CiteSeer system are discussed.

international conference on web engineering | 2007

A hybrid cache and prefetch mechanism for scientific literature search engines

Huajing Li; Wang-Chien Lee; Anand Sivasubramaniam; C. Lee Giles

CiteSeer, a scientific literature search engine that focuses on documents in the computer science and information science domains, suffers from scalability issue on the number of requests and the size of indexed documents, which increased dramatically over the years. CiteSeerχ is an effort to re-architect the search engine. In this paper, we present our initial design of a framework for caching query results, indices, and documents. This design is based on analysis of logged workload in CiteSeer. Our experiments based on mock client requests that simulate actual user behaviors confirm that our approach works well in enhancing system performances.

acm/ieee joint conference on digital libraries | 2007

SearchGen: a synthetic workload generator for scientific literature digital libraries and search engines

Huajing Li; Wang-Chien Lee; Anand Sivasubramaniam; C. Lee Giles

Due to the popularity of web applications and their heavy usage, it is important to obtain a good understanding of their workloads in order to improve performance of search services. Existing works have typically focused on generic web workloads without putting emphasis on specific domains. In this paper, we analyze the usage logs of CiteSeer, a scientific literature digital library and search engine, to characterize workloads for both robots and users. Essential ingredients that contribute to workloads are proposed. Among them we find the access intervals show high variance, and thus cannot be predicted well with time-series models. On the other hand, client visiting path and semantics can be well captured with probabilistic models and Zipf-law. Based on the findings, we propose SearchGen, a synthetic workload generator to output traces for scientific literature digital libraries and search engines. A comparison between synthetic workloads and actual logged traces suggests that the synthetic workload fits well.

Explore More