Shing-Kit Chan
The Chinese University of Hong Kong
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Shing-Kit Chan.
ACM Transactions on Information Systems | 2007
Wai Lam; Shing-Kit Chan; Ruizhang Huang
This article introduces a named entity matching model that makes use of both semantic and phonetic evidence. The matching of semantic and phonetic information is captured by a unified framework via a bipartite graph model. By considering various technical challenges of the problem, including order insensitivity and partial matching, this approach is less rigid than existing approaches and highly robust. One major component is a phonetic matching model which exploits similarity at the phoneme level. Two learning algorithms for learning the similarity information of basic phonemic matching units based on training examples are investigated. By applying the proposed named entity matching model, a mining system is developed for discovering new named entity translations from daily Web news. The system is able to discover new name translations that cannot be found in the existing bilingual dictionary.
knowledge discovery and data mining | 2006
Tak-Lam Wong; Wai Lam; Shing-Kit Chan
Online auction Web sites are fast changing and highly dynamic. It is difficult to digest the poorly organized and vast amount of information contained in the auction sites. We develop a unified framework aiming at automatically extracting the product features and summarizing the hot item features across different auction Web sites. One challenge of this problem is to extract useful information from the product descriptions provided by the sellers, which vary largely in the layout format. We formulate the problem as a single graph labeling problem using conditional random fields which can model the relationship among the neighbouring tokens in a Web page, the tokens from different pages, as well as various information such as the hot item features across different auction sites. We have conducted extensive experiments from several real-world auction Web sites to demonstrate the effectiveness of our framework.
international conference on data mining | 2010
Shing-Kit Chan; Wai Lam
Cascaded approach has been used for a long time to conduct sub-tasks in order to accomplish a major task. We put cascaded approach in a probabilistic framework and analyze possible reasons for cascaded errors. To reduce the occurrence of cascaded errors, we need to add a constraint when performing joint training. We suggest a pseudo Conditional Random Field (pseudo-CRF) approach that models two sub-tasks as two Conditional Random Fields (CRFs). We then present the formulation in the context of a linear chain CRF for solving problems on sequence data. In conducting joint training for a pseudo-CRF, we reuse all existing well-developed efficient inference algorithms for a linear chain CRF, which would otherwise require the use of approximate inference algorithms or simulations that involve long computational time. Our experimental results show an interesting fact that a jointly trained CRF model in a pseudo-CRF may perform worse than a separately trained CRF on a sub-task. However the overall system performance of a pseudo-CRF would outperform that of a cascaded approach. We implement the implicit constraint in the form of a soft constraint such that users can define the penalty cost for violating the constraint. In order to work on large-scale datasets, we further suggest a parallel implementation of the pseudo-CRF approach, which can be implemented on a multi-core CPU or GPU on a graphics card that supports multi-threading. Our experimental results show that it can achieve a 12 times increase in speedup.
knowledge discovery and data mining | 2009
Shing-Kit Chan; Wai Lam
Log-linear models have been widely used in text mining tasks because it can incorporate a large number of possibly correlated features. In text mining, these possibly correlated features are generated by conjunction of features. They are usually used with log-linear models to estimate robust conditional distributions. To avoid manual construction of conjunction of features, we propose a new algorithmic framework called F-tree for automatically generating and storing conjunctions of features in text mining tasks. This compact graph-based data structure allows fast one-vs-all matching of features in the feature space which is crucial for many text mining tasks. Based on this hierarchical data structure, we propose a systematic method for removing redundant features to further reduce memory usage and improve performance. We do large-scale experiments on three publicly-available datasets and show that this automatic method can get state-of-the-art performance achieved by manual construction of features.
international joint conference on natural language processing | 2008
Xiaofeng Yu; Wai Lam; Shing-Kit Chan; Yiu Kei Wu; Bo Chen
international conference on data mining | 2007
Shing-Kit Chan; Wai Lam; Xiaofeng Yu
international joint conference on natural language processing | 2008
Xiaofeng Yu; Wai Lam; Shing-Kit Chan
bioinformatics and bioengineering | 2007
Shing-Kit Chan; Wai Lam
siam international conference on data mining | 2006
Tak-Lam Wong; Wai Lam; Shing-Kit Chan
international joint conference on natural language processing | 2008
Shing-Kit Chan; Wai Lam; Xiaofeng Yu