Hongfei Yan | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Hongfei Yan is active.

Explore More

Publication

Featured researches published by Hongfei Yan.

european conference on information retrieval | 2011

Comparing twitter and traditional media using topic models

Wayne Xin Zhao; Jing Jiang; Jianshu Weng; Jing He; Ee-Peng Lim; Hongfei Yan; Xiaoming Li

Twitter as a new form of social media can potentially contain much useful information, but content analysis on Twitter has not been well studied. In particular, it is not clear whether as an information source Twitter can be simply regarded as a faster news feed that covers mostly the same information as traditional news media. In This paper we empirically compare the content of Twitter with a traditional news medium, New York Times, using unsupervised topic modeling. We use a Twitter-LDA model to discover topics from a representative sample of the entire Twitter. We then use text mining techniques to compare these Twitter topics with topics from New York Times, taking into consideration topic categories and types. We also study the relation between the proportions of opinionated tweets and retweets and topic categories and types. Our comparisons show interesting and useful findings for downstream IR or DM applications.

conference on information and knowledge management | 2011

Recommending citations with translation model

Yang Lu; Jing He; Dongdong Shan; Hongfei Yan

Citation Recommendation is useful for an author to find out the papers or books that can support the materials she is writing about. It is a challengeable problem since the vocabulary used in the content of papers and in the citation contexts are usually quite different. To address this problem, we propose to use translation model, which can bridge the gap between two heterogeneous languages. We conduct an experiment and find the translation model can provide much better candidates of citations than the state-of-the-art methods.

conference on information and knowledge management | 2012

Automatic labeling hierarchical topics

Xian-Ling Mao; Zhao-Yan Ming; Zheng-Jun Zha; Tat-Seng Chua; Hongfei Yan; Xiaoming Li

Recently, statistical topic modeling has been widely applied in text mining and knowledge management due to its powerful ability. A topic, as a probability distribution over words, is usually difficult to be understood. A common, major challenge in applying such topic models to other knowledge management problem is to accurately interpret the meaning of each topic. Topic labeling, as a major interpreting method, has attracted significant attention recently. However, previous works simply treat topics individually without considering the hierarchical relation among topics, and less attention has been paid to creating a good hierarchical topic descriptors for a hierarchy of topics. In this paper, we propose two effective algorithms that automatically assign concise labels to each topic in a hierarchy by exploiting sibling and parent-child relations among topics. The experimental results show that the inter-topic relation is effective in boosting topic labeling accuracy and the proposed algorithms can generate meaningful topic labels that are useful for interpreting the hierarchical topics.

ACM Transactions on Information Systems | 2015

A General SIMD-Based Approach to Accelerating Compression Algorithms

Wayne Xin Zhao; Xudong Zhang; Daniel Lemire; Dongdong Shan; Jian-Yun Nie; Hongfei Yan; Ji-Rong Wen

Compression algorithms are important for data-oriented tasks, especially in the era of “Big Data.” Modern processors equipped with powerful SIMD instruction sets provide us with an opportunity for achieving better compression performance. Previous research has shown that SIMD-based optimizations can multiply decoding speeds. Following these pioneering studies, we propose a general approach to accelerate compression algorithms. By instantiating the approach, we have developed several novel integer compression algorithms, called Group-Simple, Group-Scheme, Group-AFOR, and Group-PFD, and implemented their corresponding vectorized versions. We evaluate the proposed algorithms on two public TREC datasets, a Wikipedia dataset, and a Twitter dataset. With competitive compression ratios and encoding speeds, our SIMD-based algorithms outperform state-of-the-art nonvectorized algorithms with respect to decoding speeds.

international joint conference on natural language processing | 2015

User Based Aggregation for Biterm Topic Model

Weizheng Chen; Jinpeng Wang; Yan Zhang; Hongfei Yan; Xiaoming Li

Biterm Topic Model (BTM) is designed to model the generative process of the word co-occurrence patterns in short texts such as tweets. However, two aspects of BTM may restrict its performance: 1) user individualities are ignored to obtain the corpus level words co-occurrence patterns; and 2) the strong assumptions that two co-occurring words will be assigned the same topic label could not distinguish background words from topical words. In this paper, we propose Twitter-BTM model to address those issues by considering user level personalization in BTM. Firstly, we use user based biterms aggregation to learn user specific topic distribution. Secondly, each user’s preference between background words and topical words is estimated by incorporating a background topic. Experiments on a large-scale real-world Twitter dataset show that Twitter-BTM outperforms several stateof-the-art baselines.

conference on information and knowledge management | 2010

Context modeling for ranking and tagging bursty features in text streams

Wayne Xin Zhao; Jing Jiang; Jing He; Dongdong Shan; Hongfei Yan; Xiaoming Li

Bursty features in text streams are very useful in many text mining applications. Most existing studies detect bursty features based purely on term frequency changes without taking into account the semantic contexts of terms, and as a result the detected bursty features may not always be interesting or easy to interpret. In this paper we propose to model the contexts of bursty features using a language modeling approach. We then propose a novel topic diversity-based metric using the context models to find newsworthy bursty features. We also propose to use the context models to automatically assign meaningful tags to bursty features. Using a large corpus of a stream of news articles, we quantitatively show that the proposed context language models for bursty features can effectively help rank bursty features based on their newsworthiness and to assign meaningful tags to annotate bursty features.

conference on information and knowledge management | 2011

Efficient phrase querying with flat position index

Dongdong Shan; Wayne Xin Zhao; Jing He; Rui Yan; Hongfei Yan; Xiaoming Li

A large proportion of search engine queries contain phrases,namely a sequence of adjacent words. In this paper, we propose to use flat position index (a.k.a schema-independent index) for phrase query evaluation. In the flat position index, the entire document collection is viewed as a huge sequence of tokens. Each token is represented by one flat position, which is a unique position offset from the beginning of the collection. Each indexed term is associated with a list of the flat positions about that term in the sequence. To recover DocID from flat positions efficiently, we propose a novel cache sensitive look-up table (CSLT), which is much faster than existing search algorithms. Experiments on TREC GOV2 data collection show that flat position index can reduce the index size and speed up phrase querying substantially, compared with traditional word-level index.

asia information retrieval symposium | 2013

Guess What You Will Cite: Personalized Citation Recommendation Based on Users’ Preference

Ya’ning Liu; Rui Yan; Hongfei Yan

Automatic citation recommendation based on citation context is a highly valued research topic. When writing papers, researchers can save a lot of time with a system which can recommend a paper list for every citation placeholder. The past works all focus on the content based methods only. In this paper, we consider the citation recommendation as a content based analysis combined with personalization, using users’ publication or citation history as users’ profile and conduct to a personalized citation recommendation. After the combination of users’ citing preference with content relevance measurement, we obtain an 27.65% improvement of the performance in terms of MAP and 31.67% improvement in recall@10 compared with state-of-art models for citation recommendation problem.

Computer Networks | 2007

On the peninsula phenomenon in web graph and its implications on web search

Tao Meng; Hongfei Yan

Web masters usually place certain web pages such as home pages and index pages in front of others. Under such a design, it is necessary to go through some pages to reach the destination pages, which is similar to the scenario of reaching an inner town of a peninsula through other towns at the edge of the peninsula. In this paper, we try to validate that peninsulas are a universal phenomenon in the World-Wide Web, and clarify how this phenomenon can be used to enhance web search and study web connectivity problems. For this purpose, we model the web as a directed graph, and give a proper definition of peninsulas based on this graph. We also present an efficient algorithm to find web peninsulas. Using data collected from the Chinese web by Tianwang search engine, we perform an experiment on the distribution of sizes of peninsulas and their correlations with PageRank values, outdegrees, or indegrees of the ties with other outside vertices. The results show that the peninsula structure on a web graph can greatly expedite the computation of PageRank values; and it can also significantly affect the link extraction capability and information coverage of web crawlers.

web intelligence | 2004

The Evolution of Link-Attributes for Pages and Its Implications on Web Crawling

Tao Meng; Hongfei Yan; Jimin Wang; Xiaoming Li

It is important for an incremental crawler to know how web pages evolve and the relation between their changing frequencies and the link-attributes such as indegrees. This paper proposes a model for incremental crawling and performs an experiment to verify the correlation between them, by monitoring the evolution of all the link-attributes of the web pages within one website. Particularly, we look deeply into one special kind of page named Index-pages. From the experiment, we can make four conclusions: (1) Pages which have bigger indegrees, outdegrees or PageRank values change more often, and these link-attributes all approximately obey a power-law distribution. (2) The link-attributes of pages seldom change though the pages change themselves. (3) A small proportion of the pages link to most of the vertexes in the web graph. (4) The Index-pages link to sizeable new pages in a website. These conclusions can be used to greatly enhance the performance of an incremental crawler, which is the foremost component for general search engines and web information stores.

Explore More