Hung-yi Lee | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Hung-yi Lee is active.

Explore More

Publication

Featured researches published by Hung-yi Lee.

conference of the international speech communication association | 2016

Audio Word2Vec: Unsupervised Learning of Audio Segment Representations Using Sequence-to-Sequence Autoencoder.

Yu-An Chung; Chao-Chung Wu; Chia-Hao Shen; Hung-yi Lee; Lin-Shan Lee

The vector representations of fixed dimensionality for words (in text) offered by Word2Vec have been shown to be very useful in many application scenarios, in particular due to the semantic information they carry. This paper proposes a parallel version, the Audio Word2Vec. It offers the vector representations of fixed dimensionality for variable-length audio segments. These vector representations are shown to describe the sequential phonetic structures of the audio segments to a good degree, with very attractive real world applications such as query-by-example Spoken Term Detection (STD). In this STD application, the proposed approach significantly outperformed the conventional Dynamic Time Warping (DTW) based approaches at significantly lower computation requirements. We propose unsupervised learning of Audio Word2Vec from audio data without human annotation using Sequence-to-sequence Audoencoder (SA). SA consists of two RNNs equipped with Long Short-Term Memory (LSTM) units: the first RNN (encoder) maps the input audio sequence into a vector representation of fixed dimensionality, and the second RNN (decoder) maps the representation back to the input audio sequence. The two RNNs are jointly trained by minimizing the reconstruction error. Denoising Sequence-to-sequence Autoencoder (DSA) is furthered proposed offering more robust learning.

IEEE Transactions on Audio, Speech, and Language Processing | 2015

Spoken content retrieval: beyond cascading speech recognition with text retrieval

Lin-Shan Lee; James R. Glass; Hung-yi Lee; Chun-an Chan

Spoken content retrieval refers to directly indexing and retrieving spoken content based on the audio rather than text descriptions. This potentially eliminates the requirement of producing text descriptions for multimedia content for indexing and retrieval purposes, and is able to precisely locate the exact time the desired information appears in the multimedia. Spoken content retrieval has been very successfully achieved with the basic approach of cascading automatic speech recognition (ASR) with text information retrieval: after the spoken content is transcribed into text or lattice format, a text retrieval engine searches over the ASR output to find desired information. This framework works well when the ASR accuracy is relatively high, but becomes less adequate when more challenging real-world scenarios are considered, since retrieval performance depends heavily on ASR accuracy. This challenge leads to the emergence of another approach to spoken content retrieval: to go beyond the basic framework of cascading ASR with text retrieval in order to have retrieval performances that are less dependent on ASR accuracy. This overview article is intended to provide a thorough overview of the concepts, principles, approaches, and achievements of major technical contributions along this line of investigation. This includes five major directions: 1) Modified ASR for Retrieval Purposes: cascading ASR with text retrieval, but the ASR is modified or optimized for spoken content retrieval purposes; 2) Exploiting the Information not present in ASR outputs: to try to utilize the information in speech signals inevitably lost when transcribed into phonemes and words; 3) Directly Matching at the Acoustic Level without ASR: for spoken queries, the signals can be directly matched at the acoustic level, rather than at the phoneme or word levels, bypassing all ASR issues; 4) Semantic Retrieval of Spoken Content: trying to retrieve spoken content that is semantically related to the query, but not necessarily including the query terms themselves; 5) Interactive Retrieval and Efficient Presentation of the Retrieved Objects: with efficient presentation of the retrieved objects, an interactive retrieval process incorporating user actions may produce better retrieval results and user experiences.

IEEE Transactions on Audio, Speech, and Language Processing | 2013

Enhanced Spoken Term Detection Using Support Vector Machines and Weighted Pseudo Examples

Hung-yi Lee; Lin-Shan Lee

Spoken term detection (STD) is a key technology for retrieval of spoken content, which will be very important to retrieve and browse multimedia content over the Internet. The discriminative capability of machine learning methods has recently been used to facilitate STD. This paper presents a new approach to improve STD using support vector machines (SVM) based on acoustic information. The concept of pseudo-relevance feedback (PRF) well used in the retrieval of text, image and video is used here. The basic idea of using PRF here is to assume some spoken segments in the first-pass retrieved results are relevant (or pseudo-relevant) and some others irrelevant (or pseudo-irrelevant), and take these segments as positive and negative examples to train a query-specific SVM. This SVM is then used for re-ranking the first-pass retrieved results, and only the re-ranked results are shown to the user. In this paper, feature vectors representing the spoken segments based on acoustic information to be used in SVM are considered and analyzed. Furthermore, conventionally in PRF the items with the highest and lowest scores in the first-pass retrieved results are respectively taken as pseudo-relevant and -irrelevant, but in this way some incorrect examples are inevitably included in the training data especially when the recognition accuracy is poor. Here we further propose an enhanced SVM which not only better selects positive/negative examples considering the reliability of the spoken segments, but emphasizes more on more reliable training examples by modifying the SVM formulation. Experiments on two different sets of spoken archives with different speaking styles and different levels of recognition accuracies demonstrated significant improvements offered by the proposed approaches.

international conference on acoustics, speech, and signal processing | 2011

Improved spoken term detection with graph-based re-ranking in feature space

Yun-Nung Chen; Chia-ping Chen; Hung-yi Lee; Chun-an Chan; Lin-Shan Lee

This paper presents a graph-based approach for spoken term detection. Each first-pass retrieved utterance is a node on a graph and the edge between two nodes is weighted by the similarity between the two utterances evaluated in feature space. The score of each node is then modified by the contributions from its neighbors by random walk or its modified version, because utterances similar to more utterances with higher scores should be given higher relevance scores. In this way the global similarity structure of all first-pass retrieved utterances can be jointly considered. Experimental results show that this new approach offers significantly better performance than the previously proposed pseudo-relevance feedback approach, which considers primarily the local similarity relationship between first-pass retrieved utterances, and these two different approaches can be cascaded to provide even better results.

IEEE Transactions on Audio, Speech, and Language Processing | 2012

Interactive Spoken Document Retrieval With Suggested Key Terms Ranked by a Markov Decision Process

Yi-cheng Pan; Hung-yi Lee; Lin-Shan Lee

Interaction with users is a powerful strategy that potentially yields better information retrieval for all types of media, including text, images, and videos. While spoken document retrieval (SDR) is a crucial technology for multimedia access in the network era, it is also more challenging than text information retrieval because of the inevitable recognition errors. It is therefore reasonable to consider interactive functionalities for SDR systems. We propose an interactive SDR approach in which given the users query, the system returns not only the retrieval results but also a short list of key terms describing distinct topics. The user selects these key terms to expand the query if the retrieval results are not satisfactory. The entire retrieval process is organized around a hierarchy of key terms that define the allowable state transitions; this is modeled by a Markov decision process, which is popularly used in spoken dialogue systems. By reinforcement learning with simulated users, the key terms on the short list are properly ranked such that the retrieval success rate is maximized while the number of interactive steps is minimized. Significant improvements over existing approaches were observed in preliminary experiments performed on information needs provided by real users. A prototype system was also implemented.

ieee automatic speech recognition and understanding workshop | 2011

Improved spoken term detection using support vector machines with acoustic and context features from pseudo-relevance feedback

Tsung-wei Tu; Hung-yi Lee; Lin-Shan Lee

This paper reports a new approach to improving spoken term detection that uses support vector machine (SVM) with acoustic and linguistic features. As SVM is a good technique for discriminating different features in vector space, we recently proposed to use pseudo-relevance feedback to automatically generate training data for SVM training and use SVM to re-rank the first-pass results considering the context consistency in the lattices. In this paper, we further extend this concept by considering acoustic features at word, phone and HMM state levels and linguistic features of different order. Extensive experiments under various recognition environments demonstrate significant improvements in all cases. In particular, the acoustic features at the HMM state level offered the most significant improvements, and the improvements achieved by acoustic and linguistic features are shown to be additive.

international conference on acoustics, speech, and signal processing | 2012

Utterance-level latent topic transition modeling for spoken documents and its application in automatic summarization

Hung-yi Lee; Yun-Nung Chen; Lin-Shan Lee

In this paper, we propose to use an utterance-level latent topic transition model to estimate the latent topics behind the utterances, and test the performance of such model in extractive speech summarization. In this model, the latent topic weights behind an utterance are estimated, and these topic weights evolve from an utterance to the next in a spoken document based on a topic transition function represented by a matrix. We explore different ways of obtaining such topic transition matrices used in the model, and find using a set of matrices estimated with utterances clustered from a training spoken document set is very useful. This model was shown to be able to offer extra performance improvement when used with the popularly used Probability Latent Semantic Analysis (PLSA) in preliminary experiments on speech summarization.

international conference on acoustics, speech, and signal processing | 2009

Improved lattice-based spoken document retrieval by directly learning from the evaluation measures

Chao-hong Meng; Hung-yi Lee; Lin-Shan Lee

Lattice-based approaches have been widely used in spoken document retrieval to handle the speech recognition uncertainty and errors. Position Specific Posterior Lattices (PSPL) and Confusion Network (CN) are good examples. It is therefore interesting to derive improved model for spoken document retrieval by properly integrating different versions of lattice-based approaches in order to achieve better performance. In this paper we borrow the framework of ‘learning to rank’ from text document retrieval and try to integrate it into the scenario of lattice-based spoken document retrieval. Two approaches are considered here, AdaRank and SVM-map. With these approaches, we are able to learn and derived improved models using different versions of PSPL/CN. Preliminary experiments with broadcast news in Mandarin Chinese showed significant improvements.

IEEE Transactions on Audio, Speech, and Language Processing | 2014

Spoken knowledge organization by semantic structuring and a prototype course lecture system for personalized learning

Hung-yi Lee; Sz-Rung Shiang; Ching-feng Yeh; Yun-Nung Chen; Yu Huang; Sheng-yi Kong; Lin-Shan Lee

It takes very long time to go through a complete online course. Without proper background, it is also difficult to understand retrieved spoken paragraphs. This paper therefore presents a new approach of spoken knowledge organization for course lectures for efficient personalized learning. Automatically extracted key terms are taken as the fundamental elements of the semantics of the course. Key term graph constructed by connecting related key terms forms the backbone of the global semantic structure. Audio/video signals are divided into multi-layer temporal structure including paragraphs, sections and chapters, each of which includes a summary as the local semantic structure. The interconnection between semantic structure and temporal structure together with spoken term detection jointly offer to the learners efficient ways to navigate across the course knowledge with personalized learning paths considering their personal interests, available time and background knowledge. A preliminary prototype system has also been successfully developed.

international conference on acoustics, speech, and signal processing | 2011

Improved spoken term detection using support vector machines based on lattice context consistency

Hung-yi Lee; Tsung-wei Tu; Chia-ping Chen; Chao-Yu Huang; Lin-Shan Lee

We propose an improved spoken term detection approach that uses support vector machines trained with lattice context consistency. The basic idea is that the same term usually have similar context, while quite different context usually implies the terms are different. Support vector machine can be trained using query context feature vectors obtained from the lattice to estimate better scores for ranking, and significant improvements can be obtained. This process can be performed iteratively and integrated with the pseudo relevance feedback in acoustic feature space proposed previously, both offering further improvements.

Explore More