Is this you? Create Your Porfile

Keh-Yih Su

National Tsing Hua University

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Keh-Yih Su is active.

Explore More

Publication

Featured researches published by Keh-Yih Su.

Journal of the Acoustical Society of America | 1995

Multiple score language processing system

Keh-Yih Su; Jing-Shin Chang; Jong-Nae Wang; Mei-Hui Su

A language processing system includes a mechanism for measuring the syntax trees of sentences of material to be translated and a mechanism for truncating syntax trees in response to the measuring mechanism. In a particular embodiment, a Score Function is provided for disambiguating or truncating ambiguities on the basis of composite scores, generated at different stages of the processing.

IEEE Transactions on Speech and Audio Processing | 1994

Speech recognition using weighted HMM and subspace projection approaches

Keh-Yih Su; Chin-Hui Lee

A weighted hidden Markov model (HMM) algorithm and a subspace projection algorithm are proposed to address the discrimination and robustness issues for HMM-based speech recognition. A robust two-stage classifier is also proposed to incorporate these two approaches to further improve the performance. The weighted HMM enhances its discrimination power by first jointly considering the state likelihoods of different word models, then assigning a weight to the likelihood of each state, according to its contribution in discriminating words. The robustness of this model is then improved by increasing the likelihood difference between the top and the second candidates. The subspace projection approach discards unreliable observations on the basis of maximizing the divergence between different word pairs. To improve robustness, the mean of each cluster is then adjusted to obtain maximum separation different clusters. The performance was evaluated with a highly confusable vocabulary consisting of the nine English E-set words. The test was conducted in a multispeaker (100 talkers), isolated-word mode. The 61.7% word accuracy for the original HMM-based system was improved to 74.9% and 76.6%, respectively, by using the weighted HMM and the subspace projection methods. By incorporating the weighted HMM in the first stage and the subspace projection in the second stage, the two-stage classifier achieved a word accuracy of 79.4%. >

international conference on computational linguistics | 1992

A new quantitative quality measure for machine translation systems

Keh-Yih Su; Ming-Wen Wu; Jing-Shin Chang

In this paper, an objective quantitative quality measure is proposed to evaluate the performance of machine translation systems. The proposed method is to compare the raw translation output of an MT system with the final revised version for the customers, and then compute the editing efforts required to convert the raw translation to the final version. In contrast to the other proposals, the evaluation process can be done quickly and automatically. Hence, it can provide a quick response on any system change. A system designer can thus quickly find the advantages or faults of a particular performance dynamically. Application of such a measure to improve the system performance on-line on a parameterized and feedback-controlled system will be demonstrated. Furthermore, because the revised version is used directly as a reference, the performance measure can reflect the real quality gap between the system performance and customer expectation. A system designer can thus concentrate on practically important topics rather than on theoretically interesting issues.

中文計算語言學期刊 | 1997

An Unsupervised Iterative Method for Chinese New Lexicon Extraction

Jing-Shin Chang; Keh-Yih Su

An unsupervised iterative approach for extracting a new lexicon (or unknown words) from a Chinese text corpus is proposed in this paper. Instead of using a non-iterative segmentation-merging-filtering-and-disambiguation approach, the proposed method iteratively integrates the contextual constraints (among word candidates) and a joint character association metric to progressively improve the segmentation results of the input corpus (and thus the new word list.) An augmented dictionary, which includes potential unknown words (in addition to known words), is used to segment the input corpus, unlike traditional approaches which use only known words for segmentation. In the segmentation process, the augmented dictionary is used to impose contextual constraints over known words and potential unknown words within input sentences; an unsupervised Viterbi Training process is then applied to ensure that the selected potential unknown words (and known words) maximize the likelihood of the input corpus. On the other hand, the joint character association metric (which reflects the global character association characteristics across the corpus) is derived by integrating several commonly used word association metrics, such as mutual information and entropy, with a joint Gaussian mixture density function; such integration allows the filter to use multiple features simultaneously to evaluate character association, unlike traditional filters which apply multiple features independently. The proposed method then allows the contextual constraints and the joint character association metric to enhance each other; this is achieved by iteratively applying the joint association metric to truncate unlikely unknown words in the augmented dictionary and using the segmentation result to improve the estimation of the joint association metric. The refined augmented dictionary and improved estimation are then used in the next iteration to acquire better segmentation and carry out more reliable filtering. Experiments show that both the precision and recall rates are improved almost monotonically, in contrast to non-iterative segmentation-merging-filtering-and-disambiguation approaches, which often sacrifice precision for recall or vice versa. With a corpus of 311,591 sentences, the performance is 76% (bigram), 54% (trigram), and 70% (quadragram) in F-measure, which is significantly better than using the non-iterative approach with F-measures of 74% (bigram), 46% (trigram), and 58% (quadragram).

IEEE Transactions on Knowledge and Data Engineering | 1993

An efficient algorithm for matching multiple patterns

Jang-Jong Fan; Keh-Yih Su

An efficient algorithm for performing multiple pattern match in a string is described. The match algorithm combines the concept of deterministic finite state automata (DFSA) and the Boyer-Moore algorithm to achieve better performance. Experimental results indicate that in the average case, the algorithm is able to perform pattern match operations sublinearly, i.e. it does not need to inspect every character of the string to perform pattern match operations. The analysis shows that the number of characters to be inspected decreases as the length of patterns increases, and increases slightly as the total number of patterns increases. To match an eight-character pattern in an English string using the algorithm, only about 17% of all characters of the strong and 33% of all characters of the string, when the number of patterns is seven, are inspected. In an actual testing, the algorithm running on SUN 3/160 takes only 3.7 s to search seven eight-character patterns in a 1.4-Mbyte English text file. >

meeting of the association for computational linguistics | 1994

A Corpus-based Approach to Automatic Compound Extraction

Keh-Yih Su; Ming-Wen Wu; Jing-Shin Chang

An automatic compound retrieval method is proposed to extract compounds within a text message. It uses n-gram mutual information, relative frequency count and parts of speech as the features for compound extraction. The problem is modeled as a two-class classification problem based on the distributional characteristics of n-gram tokens in the compound and the non-compound clusters. The recall and precision using the proposed approach are 96.2% and 48.2% for bigram compounds and 96.6% and 39.6% for trigram compounds for a testing corpus of 49,314 words. A significant cutdown in processing time has been observed.

meeting of the association for computational linguistics | 1992

GPSM: A GENERALIZED PROBABILISTIC SEMANTIC MODEL FOR AMBIGUITY RESOLUTION

Jing-Shin Chang; Yih-Fen Luo; Keh-Yih Su

In natural language processing, ambiguity resolution is a central issue, and can be regarded as a preference assignment problem. In this paper, a Generalized Probabilistic Semantic Model (GPSM) is proposed for preference computation. An effective semantic tagging procedure is proposed for tagging semantic features. A semantic score function is derived based on a score function, which integrates lexical, syntactic and semantic preference under a uniform formulation. The semantic score measure shows substantial improvement in structural disambiguation over a syntax-based approach.

Archive | 1991

GLR Parsing with Scoring

Keh-Yih Su; Jong-Nae Wang; Mei-Hui Su; Jing-Shin Chang

In a machine translation system, the number of possible analyses associated with a given sentence is usually very large due to the ambiguous nature of natural languages. But, it is desirable that only the best one or two analyses be translated and passed to the post-editor so as to reduce the required efforts of post-editing. In addition, processing time for a sentence is usually limited when processing a large number of sentences in batch mode. Therefore, it is important, in a practical machine translation system, to obtain the best syntax tree which has the best annotated semantic interpretation within a reasonably short time. This is only possible with an intelligent parsing algorithm which can truncate undesirable analyses as early as possible and avoid wasting time in parsing those ambiguous constructions that will eventually be discarded.

Machine Translation | 1990

Some key issues in designing MT systems

Keh-Yih Su; Jing-Shin Chang

Development of a machine translation system (Mts) requires many tradeoffs in terms of the variety of available formalisms and control mechanisms. The tradeoffs involve issues in the generative power of grammar, formal linguistic power and efficiency of the parser, manipulation flexibility for knowledge bases, knowledge acquisition, degree of expressiveness and uniformity of the system, integration of the knowledge sources, and so forth. In this paper we discuss some basic decisions which must be made in constructing a large system. Our experience with an operational English-Chinese Mts, ArchTran, is presented to illustrate decision making related to procedural tradeoffs.

meeting of the association for computational linguistics | 1994

An Automatic Treebank Conversion Algorithm for Corpus Sharing

Jong-Nae Wang; Jing-Shin Chang; Keh-Yih Su

An automatic treebank conversion method is proposed in this paper to convert a treebank into another treebank. A new treebank associated with a different grammar can be generated automatically from the old one such that the information in the original treebank can be transformed to the new one and be shared among different research communities. The simple algorithm achieves conversion accuracy of 96.4% when tested on 8,867 sentences between two major grammar revisions of a large MT system.

Explore More