Network


Latest external collaboration on country level. Dive into details by clicking on the dots.

Hotspot


Dive into the research topics where Benjamin K. Tsou is active.

Publication


Featured researches published by Benjamin K. Tsou.


meeting of the association for computational linguistics | 1998

Chinese Word Segmentation without Using Lexicon and Hand-crafted Training Data

Maosong Sun; Dayang Shen; Benjamin K. Tsou

Chinese word segmentation is the first step in any Chinese NLP system. This paper presents a new algorithm for segmenting Chinese texts without making use of any lexicon and hand-crafted linguistic resource. The statistical data required by the algorithm, that is, mutual information and the difference of t-score between characters, is derived automatically from raw Chinese corpora. The preliminary experiment shows that the segmentation accuracy of our algorithm is acceptable. We hope the gaining of this approach will be beneficial to improving the performance (especially in ability to cope with unknown words and ability to adapt to various domains) of the existing segmenters, though the algorithm itself can also be utilized as a stand-alone segmenter in some NLP applications.


international conference on computational linguistics | 2008

Active Learning with Sampling by Uncertainty and Density for Word Sense Disambiguation and Text Classification

Jingbo Zhu; Huizhen Wang; Tianshun Yao; Benjamin K. Tsou

This paper addresses two issues of active learning. Firstly, to solve a problem of uncertainty sampling that it often fails by selecting outliers, this paper presents a new selective sampling technique, sampling by uncertainty and density (SUD), in which a k-Nearest-Neighbor-based density measure is adopted to determine whether an unlabeled example is an outlier. Secondly, a technique of sampling by clustering (SBC) is applied to build a representative initial training data set for active learning. Finally, we implement a new algorithm of active learning with SUD and SBC techniques. The experimental results from three real-world data sets show that our method outperforms competing methods, particularly at the early stages of active learning.


conference on information and knowledge management | 2009

Multi-aspect opinion polling from textual reviews

Jingbo Zhu; Huizhen Wang; Benjamin K. Tsou; Muhua Zhu

This paper presents an unsupervised approach to aspect-based opinion polling from raw textual reviews without explicit ratings. The key contribution of this paper is three-fold. First, a multi-aspect bootstrapping algorithm is proposed to learn from unlabeled data aspect-related terms of each aspect to be used for aspect identification. Second, an unsupervised segmentation model is proposed to address the challenge of identifying multiple single-aspect units in a multi-aspect sentence. Finally, an aspect-based opinion polling algorithm is presented. Experiments on real Chinese restaurant reviews show that our opinion polling method can achieve 75.5% precision performance.


IEEE Transactions on Audio, Speech, and Language Processing | 2010

Active Learning With Sampling by Uncertainty and Density for Data Annotations

Jingbo Zhu; Huizhen Wang; Benjamin K. Tsou; Matthew Y. Ma

To solve the knowledge bottleneck problem, active learning has been widely used for its ability to automatically select the most informative unlabeled examples for human annotation. One of the key enabling techniques of active learning is uncertainty sampling, which uses one classifier to identify unlabeled examples with the least confidence. Uncertainty sampling often presents problems when outliers are selected. To solve the outlier problem, this paper presents two techniques, sampling by uncertainty and density (SUD) and density-based re-ranking. Both techniques prefer not only the most informative example in terms of uncertainty criterion, but also the most representative example in terms of density criterion. Experimental results of active learning for word sense disambiguation and text classification tasks using six real-world evaluation data sets demonstrate the effectiveness of the proposed methods.


international conference on computational linguistics | 2002

Covering ambiguity resolution in Chinese word segmentation based on contextual information

Xiao Luo; Maosong Sun; Benjamin K. Tsou

Covering ambiguity is one of the two basic types of ambiguities in Chinese word segmentation. We regard its resolution as equivalent to word sense disambiguation, and make use of the classical vector space model in information retrieval to formulate the contexts of ambiguous words. A variation form of TFIDF weighting is proposed and a Chinese thesaurus is additionally utilized to cope with data sparseness problem. We select 90 frequent cases of covering ambiguities as the target. The training set includes 77654 sentences, and the test set includes 19242 sentences. The experimental results showed that our model has achieved 96.58% accuracy, outperforming the original form of TFIDF weighting as well as another baseline model, the hidden Markov model.


conference on information and knowledge management | 2009

Aspect-based sentence segmentation for sentiment summarization

Jingbo Zhu; Muhua Zhu; Huizhen Wang; Benjamin K. Tsou

Aspect-based sentiment summarization systems generally use sentences associated with relevant aspects extracted from the reviews as the basis for summarization. However, in real reviews, a single sentence often exhibits several aspects for opinions. This paper proposes a two-stage segmentation model to address the challenge of identifying multiple single-aspect and single-polarity units in one sentence, namely aspect-based sentence segmentation. Our model deals with both issues of aspect change and polarity change occurring in the input sentence. Experiments on restaurant reviews show that our model outperforms state-of-the-art linear text segmentation methods.


north american chapter of the association for computational linguistics | 2000

Mining discourse markers for Chinese textual summarization

Samuel W. K. Chan; Tom B. Y. Lai; W. J. Gao; Benjamin K. Tsou

Discourse markers foreshadow the message thrust of texts and saliently guide their rhetorical structure which are important for content filtering and text abstraction. This paper reports on efforts to automatically identify and classify discourse markers in Chinese texts using heuristic-based and corpus-based data-mining methods, as an integral part of automatic text summarization via rhetorical structure and Discourse Markers. Encouraging results are reported.


conference of the european chapter of the association for computational linguistics | 2003

Categorial fluidity in Chinese and its implications for part-of-speech tagging

Oi Yee Kwong; Benjamin K. Tsou

This paper discusses the theoretical and practical concerns in part-of-speech (POS) tagging for Chinese. Unlike other languages such as English, Chinese lacks morphological marking in association with categorial alternations. We consider such categorial fluidity a continuum, and any categorial shift a transition, with special focus on the verb-noun shift. Preliminary observations are reported on this phenomenon from empirical data, and we suggest that POS tagging should not only be theoretically valid but also sufficiently capture the extent of categorial fluidity as reflected by the data.


meeting of the association for computational linguistics | 2000

Enhancement of a Chinese Discourse Marker Tagger with C4.5

Benjamin K. Tsou; Tom B. Y. Lai; Samuel W. K. Chan; Weijun Gao; Xuegang Zhan

Discourse markers are complex discontinuous linguistic expressions which are used to explicitly signal the discourse structure of a text. This paper describes efforts to improve an automatic tagging system which identifies and classifies discourse markers in Chinese texts by applying machine learning (ML) to the disambiguation of discourse markers, as an integral part of automatic text summarization via rhetorical structure. Encouraging results are reported.


international conference on the computer processing of oriental languages | 2009

A Density-Based Re-ranking Technique for Active Learning for Data Annotations

Jingbo Zhu; Huizhen Wang; Benjamin K. Tsou

One of the popular techniques of active learning for data annotations is uncertainty sampling, however, which often presents problems when outliers are selected. To solve this problem, this paper proposes a density-based re-ranking technique, in which a density measure is adopted to determine whether an unlabeled example is an outlier. The motivation of this study is to prefer not only the most informative example in terms of uncertainty measure, but also the most representative example in terms of density measure. Experimental results of active learning for word sense disambiguation and text classification tasks using six real-world evaluation data sets show that our proposed density-based re-ranking technique can improve uncertainty sampling.

Collaboration


Dive into the Benjamin K. Tsou's collaboration.

Top Co-Authors

Avatar

Tom B. Y. Lai

City University of Hong Kong

View shared research outputs
Top Co-Authors

Avatar

Oi Yee Kwong

City University of Hong Kong

View shared research outputs
Top Co-Authors

Avatar
Top Co-Authors

Avatar

Samuel W. K. Chan

City University of Hong Kong

View shared research outputs
Top Co-Authors

Avatar

King Kui Sin

City University of Hong Kong

View shared research outputs
Top Co-Authors

Avatar

Huizhen Wang

Northeastern University

View shared research outputs
Top Co-Authors

Avatar

Jingbo Zhu

Northeastern University (China)

View shared research outputs
Top Co-Authors

Avatar
Top Co-Authors

Avatar

Bin Lu

City University of Hong Kong

View shared research outputs
Top Co-Authors

Avatar

Caesar Suen Lun

City University of Hong Kong

View shared research outputs
Researchain Logo
Decentralizing Knowledge