Samuel W. K. Chan
The Chinese University of Hong Kong
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Samuel W. K. Chan.
decision support systems | 2011
Samuel W. K. Chan; James Franklin
Although most quantitative financial data are analyzed using traditional statistical, artificial intelligence or data mining techniques, the abundance of online electronic financial news articles has opened up new possibilities for intelligent systems that can extract and organize relevant knowledge automatically in a usable format. Most information extraction systems require a hand-built dictionary of templates and thus need continual modification to accommodate new patterns that are observed in the text. In this research, we propose a novel text-based decision support system (DSS) that (i) extracts event sequences from shallow text patterns, and (ii) predicts the likelihood of the occurrence of events using a classifier-based inference engine. The prediction relies on two major, but complementary, feature sets: adjacent events and a set of information-theoretic functions. In contrast to other approaches, the proposed text-based DSS gives explanatory hypotheses about its predictions from a coalition of intimations learned from the inference engine, while preserving robustness and without indulging in formalism. We investigate more than 2000 financial reports with 28,000 sentences. Experiments show that the prediction accuracy of our model outperforms similar statistical models by 7% for the seen data while significantly improving the prediction accuracy for the unseen data. Further comparisons substantiate the experimental findings.
IEEE Transactions on Neural Networks | 1998
Samuel W. K. Chan; James Franklin
Natural language understanding involves the simultaneous consideration of a large number of different sources of information. Traditional methods employed in language analysis have focused on developing powerful formalisms to represent syntactic or semantic structures along with rules for transforming language into these formalisms. However, they make use of only small subsets of knowledge. This article will describe how to use the whole range of information through a neurosymbolic architecture which is a hybridization of a symbolic network and subsymbol vectors generated from a connectionist network. Besides initializing the symbolic network with prior knowledge, the subsymbol vectors are used to enhance the systems capability in disambiguation and provide flexibility in sentence understanding. The model captures a diversity of information including word associations, syntactic restrictions, case-role expectations, semantic rules and context. It attains highly interactive processing by representing knowledge in an associative network on which actual semantic inferences are performed. An integrated use of previously analyzed sentences in understanding is another important feature of our model. The model dynamically selects one hypothesis among multiple hypotheses. This notion is supported by three simulations which show the degree of disambiguation relies both on the amount of linguistic rules and the semantic-associative information available to support the inference processes in natural language understanding. Unlike many similar systems, our hybrid system is more sophisticated in tackling language disambiguation problems by using linguistic clues from disparate sources as well as modeling context effects into the sentence analysis. It is potentially more powerful than any systems relying on one processing paradigm.
decision support systems | 2006
Samuel W. K. Chan
With the explosion in the quantity of on-line text and multimedia information in recent years, there has been a renewed interest in the automated extraction of knowledge and information in various disciplines. In this paper, we provide a novel quantitative model for the creation of a summary by extracting a set of sentences that represent the most salient content of a text. The model is based on a shallow linguistic extraction technique. What distinguishes it from previous research is that it does not work on the detection of specific keywords or cue-phrases to evaluate the relevance of the sentence concerned. Instead, the attention is focused on the identification of the main factors in the textual continuity. Simulation experiments suggest that this technique is useful because it moves away from a purely keyword-based method of textual information extraction and its associated limitations.
systems man and cybernetics | 2004
Samuel W. K. Chan
Most current information retrieval systems rely solely on lexical item repetition, which is notorious for its vulnerability. In this research, we propose a novel method for the extraction of salient textual patterns. One of our major objectives is to move away from keywords and their associated limitations in textual information retrieval. How individual sentences in text fit together to be perceived as a salient pattern is identified. A text network that exhibits textual continuity, arising from a connectionist model, is described. The network facilitates a dynamic extraction of salient textual segments by capturing semantics from two different categories of natural language, namely lexical cohesion and contextual coherence. We also present the results of an empirical study designed to compare our model with the performance of human judges in the identification of salient textual patterns. The preliminary results show that our model has the potential for automatic salient patterns discovery in text.
decision support systems | 2017
Samuel W. K. Chan; Mickey W. C. Chong
The growth of financial texts in the wake of big data has challenged most organizations and brought escalating demands for analysis tools. In general, text streams are more challenging to handle than numeric data streams. Text streams are unstructured by nature, but they represent collective expressions that are of value in any financial decision. It can be both daunting and necessary to make sense of unstructured textual data. In this study, we address key questions related to the explosion of interest in how to extract insight from unstructured data and how to determine if such insight provides any hints concerning the trends of financial markets. A sentiment analysis engine (SAE) is proposed which takes advantage of linguistic analyses based on grammars. This engine extends sentiment analysis not only at the word token level, but also at the phrase level within each sentence. An assessment heuristic is applied to extract the collective expressions shown in the texts. Also, three evaluations are presented to assess the performance of the engine. First, several standard parsing evaluation metrics are applied on two treebanks. Second, a benchmark evaluation using a dataset of English movie review is conducted. Results show our SAE outperforms the traditional bag of words approach. Third, a financial text stream with twelve million words that aligns with a stock market index is examined. The evaluation results and their statistical significance provide strong evidence of a long persistence in the mood time series generated by the engine. In addition, our approach establishes grounds for belief that the sentiments expressed through text streams are helpful for analyzing the trends in a stock market index, although such sentiments and market indices are normally considered to be completely uncorrelated. To explain a classifier-based sentiment parser for financial textsTo demonstrate how to assign the polarity of phrases using an assessment heuristicTo provide statistical tests using twelve million words to attest its significance
decision support systems | 2004
Samuel W. K. Chan; Mickey W. C. Chong
While the breath of vocabulary used in long documents may mislead the traditional keyword-based retrieval systems, the demands for techniques in nontextual Web classification and retrieval from a large document collection are mounting. Only a few prototype systems have attempted to classify hypertext on the basis of nontextual elements in order to locate unfamiliar documents. As a result, a large portion of Web documents having pictorial information in nature is far beyond the reach of most current search engines. In this research, we devise a novel quantitative model of nontextual World Wide Web (WWW) classification based on image information. An intelligent content-sensitive, attribute-rich image classifier is presented. An image similarity measure is used to deduce the likelihood among images. Different image feature vectors have been constructed and evaluated. Evaluation shows images judged to be similar by human form interesting clusters in our unsupervised learning. Comparison with other clustering technique, such as Hierarchical Agglomerative Clustering (HAC), demonstrates that our approach is found useful in content-based image information retrieval.
Machine Translation | 1999
Samuel W. K. Chan; Benjamin Ka-Yin T'sou
Anaphora is a discourse-level linguistic phenomenon.There is consensus that anaphora resolution shouldrely on prior sentences within the context of thediscourse. We propose to cast anaphora resolution asa semantic inference process in which a combination ofmultiple strategies, each exploiting different aspectsof linguistic knowledge, is employed to provide acoherent resolution of anaphora. A framework whichencompasses several salient linguistic parameters suchas grammatical role, proximity, repetition, sentencerecency and semantic cues is demonstrated. This workalso shows how an anaphora-resolution algorithm can beembedded within a framework which captures all theabove salient parameters, as well as remedies some ofthe inadequacies found in any monolithic resolutionsystem. A language-neutral semantic representationcharacterized by semantic cues is presented in orderto capture the distilled information after resolution.The effectiveness of the language-neutralrepresentation, both for machine translation andanaphora resolution, is demonstrated through a set ofsimulations and evaluations.
text speech and dialogue | 2013
Samuel W. K. Chan; Mickey W. C. Chong
This research takes advantage of word structures and produces a good estimate of part-of-speech tags of Chinese compound words before they are fed into a tagger. The approach relies on a set of features from Chinese morphemes as well as a set of collocation markers which provide hints on the syntactic categories of compound words. A recursive inferential mechanism is devised to alleviate the riffle effect from changes made at its neighbors during tagging. The approach is justified with a compound words database with more than 53,500 words. Experimental results with 500,000 words show the approach outperforms its counterparts.
Expert Systems With Applications | 2009
Samuel W. K. Chan
A two-phase annotation method for semantic labeling in natural language processing is proposed. The dynamic programming approach stresses on a non-exact string matching which takes full advantage of the underlying grammatical structure of the parse trees in a Treebank. The first phase of the labeling is a coarse-grained syntactic parsing which is complementary to a semantic dissimilarities analysis in its latter phase. The approach goes beyond shallow parsing to a deeper level of case role identification, while preserving robustness, without being bogged down into a complete linguistic analysis. The paper presents experimental results for recognizing more than 50 different semantic labels in 10,000 sentences. Results show that the approach improves the labeling, even though with incomplete information. Detailed evaluations are discussed in order to justify its significances.
intelligent data engineering and automated learning | 2007
Samuel W. K. Chan
This paper proposes a model of semantic labeling based on the edit distance. The dynamic programming approach stresses on a non-exact string matching technique that takes full advantage of the underlying grammatical structure of 65,000 parse trees in a Treebank. Both part-of-speech and lexical similarity serve to identify the possible semantic labels, without miring into a pure linguistic analysis. The model described has been implemented. We also analyze the tradeoffs between the part-of-speech and lexical similarity in the semantic labeling. Experimental results for recognizing various labels in 10,000 sentences are used to justify its significances.