Robert W. P. Luk | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Robert W. P. Luk is active.

Explore More

Publication

Featured researches published by Robert W. P. Luk.

ACM Transactions on Information Systems | 2008

Interpreting TF-IDF term weights as making relevance decisions

Ho Chung Wu; Robert W. P. Luk; Kam-Fai Wong; Kui Lam Kwok

A novel probabilistic retrieval model is presented. It forms a basis to interpret the TF-IDF term weights as making relevance decisions. It simulates the local relevance decision-making for every location of a document, and combines all of these “local” relevance decisions as the “document-wide” relevance decision for the document. The significance of interpreting TF-IDF in this way is the potential to: (1) establish a unifying perspective about information retrieval as relevance decision-making; and (2) develop advanced TF-IDF-related term weights for future elaborate retrieval models. Our novel retrieval model is simplified to a basic ranking formula that directly corresponds to the TF-IDF term weights. In general, we show that the term-frequency factor of the ranking formula can be rendered into different term-frequency factors of existing retrieval systems. In the basic ranking formula, the remaining quantity - log p(&rmacr;|t ∈ d) is interpreted as the probability of randomly picking a nonrelevant usage (denoted by &rmacr;) of term t. Mathematically, we show that this quantity can be approximated by the inverse document-frequency (IDF). Empirically, we show that this quantity is related to IDF, using four reference TREC ad hoc retrieval data collections.

Journal of the Association for Information Science and Technology | 2002

A survey in indexing and searching XML documents

Robert W. P. Luk; Hong Va Leong; Tharam S. Dillon; Alvin T. S. Chan; W. Bruce Croft; James Allan

XML holds the promise to yield (1) a more precise search by providing additional information in the elements, (2) a better integrated search of documents from heterogeneous sources, (3) a powerful search paradigm using structural as well as content specifications, and (4) data and information exchange to share resources and to support cooperative search. We survey several indexing techniques for XML documents, grouping them into flat-file, semistructured, and structured indexing paradigms. Searching techniques and supporting techniques for searching are reviewed, including full text search and multistage search. Because searching XML documents can be very flexible, various search result presentations are discussed, as well as database and information retrieval system integration and XML query languages. We also survey various retrieval models, examining how they would be used or extended for retrieving XML documents. To conclude the article, we discuss various open issues that XML poses with respect to information retrieval and database research.

Engineering Applications of Artificial Intelligence | 2007

Stock time series pattern matching: Template-based vs. rule-based approaches

Tak-chung Fu; Fu-Lai Chung; Robert W. P. Luk; Chak-man Ng

One of the major duties of financial analysts is technical analysis. It is necessary to locate the technical patterns in the stock price movement charts to analyze the market behavior. Indeed, there are two main problems: how to define those preferred patterns (technical patterns) for query and how to match the defined pattern templates in different resolutions. As we can see, defining the similarity between time series (or time series subsequences) is of fundamental importance. By identifying the perceptually important points (PIPs) directly from the time domain, time series and templates of different lengths can be compared. Three ways of distance measure, including Euclidean distance (PIP-ED), perpendicular distance (PIP-PD) and vertical distance (PIP-VD), for PIP identification are compared in this paper. After the PIP identification process, both template- and rule-based pattern-matching approaches are introduced. The proposed methods are distinctive in their intuitiveness, making them particularly user friendly to ordinary data analysts like stock market investors. As demonstrated by the experiments, the template- and the rule-based time series matching and subsequence searching approaches provide different directions to achieve the goal of pattern identification.

Engineering Applications of Artificial Intelligence | 2008

Representing financial time series based on data point importance

Tak-chung Fu; Fu-Lai Chung; Robert W. P. Luk; Chak-man Ng

Recently, the increasing use of time series data has initiated various research and development attempts in the field of data and knowledge management. Time series data is characterized as large in data size, high dimensionality and update continuously. Moreover, the time series data is always considered as a whole instead of individual numerical fields. Indeed, a large set of time series data is from stock market. Stock time series has its own characteristics over other time series. Moreover, dimensionality reduction is an essential step before many time series analysis and mining tasks. For these reasons, research is prompted to augment existing technologies and build new representation to manage financial time series data. In this paper, financial time series is represented according to the importance of the data points. With the concept of data point importance, a tree data structure, which supports incremental updating, is proposed to represent the time series and an access method for retrieving the time series data point from the tree, which is according to their order of importance, is introduced. This technique is capable to present the time series in different levels of detail and facilitate multi-resolution dimensionality reduction of the time series data. In this paper, different data point importance evaluation methods, a new updating method and two dimensionality reduction approaches are proposed and evaluated by a series of experiments. Finally, the application of the proposed representation on mobile environment is demonstrated.

international conference on data mining | 2002

Evolutionary time series segmentation for stock data mining

Fu-Lai Chung; Tak-chung Fu; Robert W. P. Luk; Vincent T. Y. Ng

Stock data in the form of multiple time series are difficult to process, analyze and mine. However, when they can be transformed into meaningful symbols like technical patterns, it becomes easier. Most recent work on time series queries concentrates only on how to identify a given pattern from a time series. Researchers do not consider the problem of identifying a suitable set of time points for segmenting the time series in accordance with a given set of pattern templates (e.g., a set of technical patterns for stock analysis). On the other hand, using fixed length segmentation is a primitive approach to this problem; hence, a dynamic approach (with high controllability) is preferred so that the time series can be segmented flexibly and effectively according to the needs of users and applications. In view of the fact that such a segmentation problem is an optimization problem and evolutionary computation is an appropriate tool to solve it, we propose an evolutionary time series segmentation algorithm. This approach allows a sizeable set of stock patterns to be generated for mining or query. In addition, defining the similarity between time series (or time series segments) is of fundamental importance in fitness computation. By identifying perceptually important points directly from the time domain, time series segments and templates of different lengths can be compared and intuitive pattern matching can be carried out in an effective and efficient manner. Encouraging experimental results are reported from tests that segment the time series of selected Hong Kong stocks.

ACM Transactions on Asian Language Information Processing | 2002

A comparison of Chinese document indexing strategies and retrieval models

Robert W. P. Luk; K. L. Kwok

With the advent of the Internet and intranets, substantial interest is being shown in Asian language information retrieval; especially in Chinese, which is a good example of an Asian ideographic language (other examples include Japanese and Korean). Since, in this type of language, spaces do not delimit words, an important issue is which index terms should be extracted from documents. This issue also has wider implications for indexing other languages such as agglutinating languages (e.g., Finnish and Turkish), archaic ideographic languages like Egyptian hieroglyphs, and other types of information such as data stored in genomic databases. Although comparisons of indexing strategies for Chinese documents have been made, almost all of them are based on a single retrieval model. This article compares the performance of various combinations of indexing strategies (i.e., character, word, short-word, bigram, and Pircs indexing) and retrieval models (i.e., vector space, 2-Poisson, logistic regression, and Pircs models). We determine which model (and its parameters) achieves the (near) best retrieval effectiveness without relevance feedback, and compare it with the open evaluations (i.e., TREC and NTCIR) for both long and title queries. In addition, we describe a more extensive investigation of retrieval efficiency. In particular, the storage cost of word indexing is only slightly more than character indexing, and bigram indexing is about double the storage cost of other indexing strategies. The retrieval time typically varies linearly with the number of unique terms in the query, which is supported by correlation values above 90%. The Pircs retrieval system achieves robust and good retrieval performance, but it appears to be the slowest method, whereas vector space models were not very effective in retrieval, but were able to respond quickly. For robust, near-best retrieval effectiveness, without considering storage overhead, the 2-Poisson model using bigram indexing appears to be a good compromise between retrieval effectiveness and efficiency for both long and title queries.

congress on evolutionary computation | 2001

Evolutionary segmentation of financial time series into subsequences

Tak-Chung Fu; Fu-Lai Chung; Vincent T. Y. Ng; Robert W. P. Luk

Time series data are difficult to manipulate. When they can be transformed into meaningful symbols, it becomes an easy task to query and understand them. While most recent works in time series query only concentrate on how to identify a given pattern from a time series, they do not consider the problem of identifying a suitable set of time points based upon which the time series can be segmented in accordance with a given set of pattern templates, e.g., a set of technical analysis patterns for stock analysis. On the other hand, using fixed length segmentation is only a primitive approach to such kind of problem and hence a dynamic approach is preferred so that the time series can be segmented flexibly and effectively. In view of the fact that such a segmentation problem is actually an optimization problem and evolutionary computation is an appropriate tool to solve it, we propose an evolutionary segmentation algorithm in this paper. Encouraging experimental results in segmenting the Hong Kong Hang Seng Index using 22 technical analysis patterns are reported.

Computer Speech & Language | 1996

Stochastic phonographic transduction for English

Robert W. P. Luk; Robert I. Damper

Abstract This paper introduces and reviews stochastic phonographic transduction (SPT), a trainable (“data-driven”) technique for letter-to-phoneme conversion based on formal language theory, as well as describing and detailing one particularly simple realization of SPT. The spellings and pronunciations of English words are modelled as the productions of a stochastic grammar, inferred from example data in the form of a pronouncing dictionary. The terminal symbols of the grammar are letter–phoneme correspondences, and the rewrite (production) rules of the grammar specify how these are combined to form acceptable English word spellings and their pronunciations. Given the spelling of a word as input, a pronunciation can then be produced as output by parsing the input string according to the letter-part of the terminals and selecting the “best” sequence of corresponding phoneme-parts according to some well-motivated criteria. Although the formalism is in principle very general, restrictive assumptions must be made if practical, trainable systems are to be realized. We have assumed at this stage that the grammar is regular. Further, word generation is modelled as a Markov process in which terminals (correspondences) are simply concatenated. The SPT learning task then amounts to the inference of a set of correspondences and estimation from the training data of their associated transition probabilities. Transduction to produce a pronunciation for a word given its spelling is achieved by Viterbi decoding, using a maximum likelihood criterion. Results are presented for letter–phoneme alignment and transduction for the dictionary training data, unseen dictionary words, unseen proper nouns and novel (pseudo-)words. Two different ways of inferring correspondences are described and compared. It is found that the provision of quite limited information about the alternating vowel/consonant structure of words aids the inference process significantly. Best transduction performance obtained on unseen dictionary words is 93·7% phonemes correct, conservatively scored. Automatically inferred correspondences also consistently out-perform a published set of manually derived correspondences when used for SPT. Although the comparison is difficult to make, we believe that current results for letter-to-phoneme conversion are at least as good as the best reported so far for a data-driven approach, while being comparable in performance to knowledge-based approaches.

International Workshop on Challenges in Web Information Retrieval and Integration | 2005

XML Document Clustering Using Common XPath

Ho-pong Leung; Fu-Lai Chung; Stephen Chi-fai Chan; Robert W. P. Luk

XML is becoming a common way of storing data. The elements and their arrangement in the document’s hierarchy not only describe the document structure but also imply the data’s semantic meaning, and hence provide valuable information to develop tools for manipulating XML documents. In this paper, we pursue a data mining approach to the problem of XML document clustering. We introduce a novel XML structural representation called common XPath (CXP), which encodes the frequently occurring elements with the hierarchical information, and propose to take the CXPs mined to form the feature vectors for XML document clustering. In other words, data mining acts as a feature extractor in the clustering process. Based on this idea, we devise a path-based XML document clustering algorithm called PBClustering which groups the documents according to their CXPs, i.e. their frequent structures. Encouraging simulation results are observed and reported.

Information Processing and Management | 2008

Re-examining the effects of adding relevance information in a relevance feedback environment

Wingo Sai-Keung Wong; Robert W. P. Luk; Hong-va Leong; Keishiu Ho; Dik Lun Lee

This paper presents an investigation about how to automatically formulate effective queries using full or partial relevance information (i.e., the terms that are in relevant documents) in the context of relevance feedback (RF). The effects of adding relevance information in the RF environment are studied via controlled experiments. The conditions of these controlled experiments are formalized into a set of assumptions that form the framework of our study. This framework is called idealized relevance feedback (IRF) framework. In our IRF settings, we confirm the previous findings of relevance feedback studies. In addition, our experiments show that better retrieval effectiveness can be obtained when (i) we normalize the term weights by their ranks, (ii) we select weighted terms in the top K retrieved documents, (iii) we include terms in the initial title queries, and (iv) we use the best query sizes for each topic instead of the average best query size where they produce at most five percentage points improvement in the mean average precision (MAP) value. We have also achieved a new level of retrieval effectiveness which is about 55-60% MAP instead of 40+% in the previous findings. This new level of retrieval effectiveness was found to be similar to a level using a TREC ad hoc test collection that is about double the number of documents in the TREC-3 test collection used in previous works.

Explore More