Hung-Shin Lee
National Taiwan University
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Hung-Shin Lee.
spoken language technology workshop | 2010
Meng-Sung Wu; Hung-Shin Lee; Hsin-Min Wang
Topic modeling has been widely applied in a variety of text modeling tasks as well as in speech recognition systems for effectively capturing the semantic and statistic information in documents or speech utterances. Most topic models rely on the bag-of-words assumption that results in learned latent topics composed of lists of individual words. Unfortunately, these words may convey topical information but lack accurate semantic knowledge of the text. In this paper, we present the semantic associative topic model, where the concept of the semantic association terms is extended to topic modeling, which provides guidance on modeling the semantic associations that occur among single words by expressing a document as an association of multiple words. Further, the pointwise KL-divergence metric is used to measure the significance of the association. We also integrate original PLSA and SATM models, which have mixed feature representations. Experimental results on WSJ and AP datasets show that the proposed approaches achieved higher performance compared to other methods.
international conference on acoustics, speech, and signal processing | 2014
Hung-Shin Lee; Yu Tso; Yun-Fan Chang; Hsin-Min Wang; Shyh-Kang Jeng
In this paper, we study the use of two kinds of kernel-based discriminative models, namely support vector machine (SVM) and deep neural network (DNN), for speaker verification. We treat the verification task as a binary classification problem, in which a pair of two utterances, each represented by an i-vector, is assumed to belong to either the “within-speaker” group or the “between-speaker” group. To solve the problem, we employ various binary operations to retain the basic relationship between any pair of i-vectors to form a single vector for training the discriminative models. This study also investigates the correlation of achievable performances with the number of training pairs and the various combinations of basic binary operations, using the SVM and DNN binary classifiers. The experiments are conducted on the male portion of the core task in the NIST 2005 Speaker Recognition Evaluation (SRE), and the results are competitive or even better, in terms of normalized decision cost function (minDCF) and equal error rate (EER), while compared to other non-probabilistic based models, such as the conventional speaker SVMs and the LDA-based cosine distance scoring.
international conference on acoustics, speech, and signal processing | 2017
Hung-Shin Lee; Yu-Ding Lu; Chin-Cheng Hsu; Yu Tsao; Hsin-Min Wang; Shyh-Kang Jeng
This paper presents a learning and scoring framework based on neural networks for speaker verification. The framework employs an autoencoder as its primary structure while three factors are jointly considered in the objective function for speaker discrimination. The first one, relating to the sample reconstruction error, makes the structure essentially a generative model, which benefits to learn most salient and useful properties of the data. Functioning in the middlemost hidden layer, the other two attempt to ensure that utterances spoken by the same speaker are mapped into similar identity codes in the speaker discriminative subspace, where the dispersion of all identity codes are maximized to some extent so as to avoid the effect of over-concentration. Finally, the decision score of each utterance pair is simply computed by cosine similarity of their identity codes. Dealing with utterances represented by i-vectors, the results of experiments conducted on the male portion of the core task in the NIST 2010 Speaker Recognition Evaluation (SRE) significantly demonstrate the merits of our approach over the conventional PLDA method.
international conference on acoustics, speech, and signal processing | 2013
Hung-Shin Lee; Yu-Chin Shih; Hsin-Min Wang; Shyh-Kang Jeng
Phonotactics, dealing with permissible phone patterns and their frequencies of occurrence in a specific language, is acknowledged to be related to spoken language recognition (SLR). With the assistance of phone recognizers, each speech utterance can be decoded into an ordered sequence of phone vectors filled with likelihood scores contributed by all possible phone models. In this paper, we propose a novel approach to dig the concealed phonotactic structure out of the phone-likelihood vectors through a kind of multivariate time series analysis: dynamic linear models (DLM). In these models, treating the generation of phone patterns in each utterance as a dynamic system, the relationship between adjacent vectors is linearly and time-invariantly modeled, and unobserved states are introduced to capture a temporal coherence intrinsic in the system. Each utterance expressed by the DLM is further transformed into a fixed-dimensional linear subspace so that well-developed distance measures between two subspaces can be applied to linear discriminant analysis (LDA) in a dissimilarity-based fashion. The results of SLR experiments on the OGI-TS corpus demonstrate that the proposed framework outperforms the well-known vector space modeling (VSM)-based methods and achieves comparable performance to our previous subspace-based method.
conference of the international speech communication association | 2016
Hung-Shin Lee; Yu Tsao; Chi-Chun Lee; Hsin-Min Wang; Wei-Cheng Lin; Wei-Chen Chen; Shan-Wen Hsiao; Shyh-Kang Jeng
To estimate the degree of sincerity conveyed by a speech utterance and received by listeners, we propose an instance-based learning framework with shallow neural networks. The framework plays as not only a regressor that intends to fit the predicted value to the actual value but also a ranker that preserves the relative target magnitude between each pair of utterances, in an attempt to derive a higher Spearman’s rank correlation coefficient. In addition to describing how to simultaneously minimize regression and ranking losses, the issue of how utterance pairs work in the training and evaluation phases is also addressed by two kinds of realizations. The intuitive one is related to random sampling while the other seeks for representative utterances, named anchors, to form non-stochastic pairs. Our system outperforms the baseline by more than 25% relative improvement in the development set.
international symposium/conference on music information retrieval | 2011
Ju-Chiang Wang; Hung-Shin Lee; Hsin-Min Wang; Shyh-Kang Jeng
Proceedings of the Seventh SIGHAN Workshop on Chinese Language Processing | 2013
Kuan Yu Chen; Hung-Shin Lee; Chung-Han Lee; Hsin-Min Wang; Hsin-Hsi Chen
conference of the international speech communication association | 2014
Hung-Shin Lee; Yu Tsao; Hsin-Min Wang; Shyh-Kang Jeng
conference of the international speech communication association | 2014
How Jing; Ting-yao Hu; Hung-Shin Lee; Wei-Chen Chen; Chi-Chun Lee; Yu Tsao; Hsin-Min Wang
conference of the international speech communication association | 2012
Yu-Chin Shih; Hung-Shin Lee; Hsin-Min Wang; Shyh-Kang Jeng