K. L. Kwok
Queens College
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by K. L. Kwok.
international acm sigir conference on research and development in information retrieval | 1997
K. L. Kwok
Three representation methods are empirically investigated for Chinese information retrieval: 1-gram (single character), bigram (two contiguous overlapping characters), and short-word indexing based on a simple segmentation of the text. The retrieval collection is the approximately 170 MB TREC-5 Chinese corpus of news articles, and 28 queries that are long and rich in wordings. Evaluation shows that 1-gram indexing is good but not sufficiently competitive, while bigram indexing works surprisingly well. Bigram indexing leads to a large index term space, three times that of short-word indexing, but is as good as short-word indexing in precision, and about 5% better in relevants retrieved. The best average non-interpolated precision is about 0.45, 17% better than 1-gram indexing and quite high for a mainly statistical approach.
international acm sigir conference on research and development in information retrieval | 1998
K. L. Kwok; M. Chan
Short queries in an ad-hoc retrieval environment are difftcult but unavoidable. We present several methods to try to improve our current strategy of 2-stage pseudorelevance feedback retrieval in such a situation. They are: I) avtf query term weighting, 2) variable high frequency Ziptian threshold, 3) collection enrichment, 4) enhancing term variety in raw queries, and 5) using retrieved document local term statistics. Avtf employs collection statistics to weight terms in short queries. Variable high frequency threshold defines and ignores statistical stopwords based on query length. Collection enrichment adds other collections to the one under investigation so as to improve the chance of ranking more relevant documents in the top n for the pseudofeedback process. Enhancing term variety to raw queries tries to find highly associated terms in a set of documents that is domain-related to the query. Making the query longer may improve 1st stage retrieval. And retrieved document local statistics re-weight terms in the 2nd stage using the set of domain-related documents rather than the whole collection as used during the initial stage. Experiments were performed using the TREC 5 and 6 environment. It is found that together these methods perform well for the difftcult TRECS topics, and also works for the TREC-6 very short topics.
international acm sigir conference on research and development in information retrieval | 1996
K. L. Kwok
K.L. Kwok email: [email protected] .edu Dept., Queens College, City University of New York, Flushing, NY 11367, USA. Ad-hoc retrieval relies on the evidence from a user’s query to provide a sufficient variety of terms as well as different term frequencies for differentiating term importance. Short queries lack both typea of information. A new method of automatically weighting query terms for ad-hoc retrieval is introduced that works for short queries. It is based on the term usage statistics in a collection and no training is required. Expximents with both the TREC2 and TREC4 ad-hoc queries show that this weighting scheme can provide significantly better results at the initial retrievat stage. At the expanded query stage, results vary from equal to significantly better than those relying on the originst query weights. In particular, this automatic method provides similar improvements to extra short queries of two to four content terms only.
ACM Transactions on Asian Language Information Processing | 2002
Robert W. P. Luk; K. L. Kwok
With the advent of the Internet and intranets, substantial interest is being shown in Asian language information retrieval; especially in Chinese, which is a good example of an Asian ideographic language (other examples include Japanese and Korean). Since, in this type of language, spaces do not delimit words, an important issue is which index terms should be extracted from documents. This issue also has wider implications for indexing other languages such as agglutinating languages (e.g., Finnish and Turkish), archaic ideographic languages like Egyptian hieroglyphs, and other types of information such as data stored in genomic databases. Although comparisons of indexing strategies for Chinese documents have been made, almost all of them are based on a single retrieval model. This article compares the performance of various combinations of indexing strategies (i.e., character, word, short-word, bigram, and Pircs indexing) and retrieval models (i.e., vector space, 2-Poisson, logistic regression, and Pircs models). We determine which model (and its parameters) achieves the (near) best retrieval effectiveness without relevance feedback, and compare it with the open evaluations (i.e., TREC and NTCIR) for both long and title queries. In addition, we describe a more extensive investigation of retrieval efficiency. In particular, the storage cost of word indexing is only slightly more than character indexing, and bigram indexing is about double the storage cost of other indexing strategies. The retrieval time typically varies linearly with the number of unique terms in the query, which is supported by correlation values above 90%. The Pircs retrieval system achieves robust and good retrieval performance, but it appears to be the slowest method, whereas vector space models were not very effective in retrieval, but were able to respond quickly. For robust, near-best retrieval effectiveness, without considering storage overhead, the 2-Poisson model using bigram indexing appears to be a good compromise between retrieval effectiveness and efficiency for both long and title queries.
international acm sigir conference on research and development in information retrieval | 1991
K. L. Kwok
This paper shows how a newtwork view of probabilistic information indexing and retrieval with components may implement query expansion and modification (based on user relevance feedback) by growing new edges and adapting weights between queries and terms of relevant documents. Experimental results with two collections and partial feedback confirm that the process can lead to much improved performance. Learning from irrelvant documents however was not effective.
asia information retrieval symposium | 2005
K. L. Kwok; Laszlo Grunfeld; Peter Deng
Users experience frustration when their reasonable queries retrieve no relevant documents. We call these weak queries and retrievals. Improving their effectiveness is an important issue in ad-hoc retrieval and will be most rewarding for these users. We offer an explanation (with experimental support) why data fusion of sufficiently different retrieval lists can improve weak query results. This approach requires sufficiently different retrieval lists for an ad-hoc query. We propose various ways of selecting salient terms from longer queries to probe the web, and define alternate queries from web results. Target retrievals by the original and alternate queries are combined. When compared with normal ad-hoc retrieval, web assistance and data fusion can improve weak query effectiveness by over 100%. Another benefit of this approach is that other queries also improve along with weak ones, unlike pseudo-relevance feedback which works mostly for non-weak queries.
Information Processing and Management | 1988
K. L. Kwok; William Kuan
Abstract The use of document components—such as a term, a sentence, or a whole document — for indexing and retrieval has been investigated employing two medium-size collections. A number of probabilistic similarity measures based on document components are studied, as well as a new method of handling probability estimates involving small sample sizes. It is seen that some of the new similarity measures can provide comparable performance to those methods studied by other investigators. In general, the term and sentence modes result in substantially equal performance, and both are superior to the document mode. However, it may be necessary to use longer documents in order to reveal fully the usefulness of the sentence mode, because documents in our databases may not have sufficient numbers of sentences.
Information Processing and Management | 1988
K. L. Kwok
Abstract A recent article by Salton and Zhang compared some retrieval results for document collections indexed using title and abstract only with that of the same indexing enhanced by terms from bibliographically related titles. They observed that the method of adding related title terms is not sufficiently reliable. We feel that their conclusion may be overly pessimistic because it was based on small negative results using a 146-document “soft science” (information science) sample collection and without the benefit of term selection among the bibliographically related titles. On the other hand, we believe that their positive results with a 3204-document computer science collection is significant and may serve as new evidence that the method may actually work for scientific literature. Discussions on some strategies for term selection are given, as well as reasons why cited titles are useful and preferred over other types of related titles.
asia information retrieval symposium | 2006
K. L. Kwok; Peter Deng
A minimal approach to Chinese factoid QA is described. It employs entity extraction software, template matching, and statistical candidate answer ranking via five evidence types, and does not use explicit word segmentation or Chinese syntactic analysis. This simple approach is more portable to other Asian languages, and may serve as a base on which more precise techniques can be used to improve results. Applying to the NTCIR-5 monolingual environment, it delivers medium top-1 accuracy and MRR of .295, .3381 (supported answers) and .41, .4998 (including unsupported) respectively. When applied to English-Chinese cross language QA with three different forms of English-Chinese question translation, it attains top-1 accuracy and MRR of .155, .2094 (supported) and .215, .2932 (unsupported), about ~52% to ~62% of monolingual effectiveness. CLQA improvements via successively different forms of question translation are also demonstrated.
international acm sigir conference on research and development in information retrieval | 2002
K. L. Kwok
Queries have specific properties, and may need individualized methods and parameters to optimize retrieval. Length is one property. We look at how two-word queries may attain higher precision by re-ranking using word co-occurrence evidence in retrieved documents. Co-occurrence within document context is not sufficient, but window context including sentence context evidence can provide precision improvements at low recall region of 4 to 10% using initial retrieval results, and positively affects pseudo-relevance feedback.