Kai-Fu Lee | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Kai-Fu Lee is active.

Explore More

Publication

Featured researches published by Kai-Fu Lee.

ACM Transactions on Asian Language Information Processing | 2002

Toward a unified approach to statistical language modeling for Chinese

Jianfeng Gao; Joshua T. Goodman; Mingjing Li; Kai-Fu Lee

This article presents a unified approach to Chinese statistical language modeling (SLM). Applying SLM techniques like trigram language models to Chinese is challenging because (1) there is no standard definition of words in Chinese; (2) word boundaries are not marked by spaces; and (3) there is a dearth of training data. Our unified approach automatically and consistently gathers a high-quality training data set from the Web, creates a high-quality lexicon, segments the training data using this lexicon, and compresses the language model, all by using the maximum likelihood principle, which is consistent with trigram model training. We show that each of the methods leads to improvements over standard SLM, and that the combined method yields the best pinyin conversion result reported.

meeting of the association for computational linguistics | 2000

A new statistical approach to Chinese Pinyin input

Zheng Chen; Kai-Fu Lee

Chinese input is one of the key challenges for Chinese PC users. This paper proposes a statistical approach to Pinyin-based Chinese input. This approach uses a trigram-based language model and a statistically based segmentation. Also, to deal with real input, it also includes a typing model which enables spelling correction in sentence-based Pinyin input, and a spelling model for English which enables modeless Pinyin input.

meeting of the association for computational linguistics | 2000

Distribution-based pruning of backoff language models

Jianfeng Gao; Kai-Fu Lee

We propose a distribution-based pruning of n-gram backoff language models. Instead of the conventional approach of pruning n-grams that are infrequent in training data, we prune n-grams that are likely to be infrequent in a new document. Our method is based on the n-gram distribution i.e. the probability that an n-gram occurs in a new document. Experimental results show that our method performed 7--9% (word perplexity reduction) better than conventional cutoff methods.

international conference on acoustics, speech, and signal processing | 2000

A unified approach to statistical language modeling for Chinese

Jianfeng Gao; Haifeng Wang; Mingjing Li; Kai-Fu Lee

The paper presents a unified approach to Chinese statistical language modeling (SLM). Applying SLM techniques like trigrams to Chinese is challenging because: (1) there is no standard definition of words in Chinese, (2) word boundaries are not marked by spaces, and (3) there is a dearth of training data. Our unified approach automatically and consistently gathers a high-quality training data set from the Web, creates a high-quality lexicon, and segments the training data using this lexicon, all using a maximum likelihood principle, which is consistent with the trigram training. We show that each of the methods leads to improvements over standard SLM, and that the combined method yields the best pinyin conversion result reported.

pacific rim international conference on artificial intelligence | 2000

Towards a next-generation search engine

Qiang Yang; Haifeng Wang; Ji-Rong Wen; Gao Zhang; Ye Lu; Kai-Fu Lee; HongJiang Zhang

As more information becomes available on the World Wide Web, it has become an acute problem to provide effective search tools for information access. Previous generations of search engines are mainly keyword-based and cannot satisfy many informational needs of their users. Search based on simple keywords returns many irrelevant documents that can easily swamp the user. In this paper, we describe the system architecture of a next-generation search engine that we have built with a goal to provide accurate search result on frequently asked concepts. Our key differentiating factors from other search engines are natural language user interface, accurate search results, and interactive user interface and multimedia content retrieval. We describe the architecture, design goals and experience in developing the search engine.

US Patent | 2000