Network


Latest external collaboration on country level. Dive into details by clicking on the dots.

Hotspot


Dive into the research topics where Winston H. Hsu is active.

Publication


Featured researches published by Winston H. Hsu.


IEEE MultiMedia | 2006

Large-scale concept ontology for multimedia

Milind R. Naphade; John R. Smith; Jelena Tesic; Shih-Fu Chang; Winston H. Hsu; Lyndon Kennedy; Alexander G. Hauptmann; Jon Curtis

As increasingly powerful techniques emerge for machine tagging multimedia content, it becomes ever more important to standardize the underlying vocabularies. Doing so provides interoperability and lets the multimedia community focus ongoing research on a well-defined set of semantics. This paper describes a collaborative effort of multimedia researchers, library scientists, and end users to develop a large standardized taxonomy for describing broadcast news video. The large-scale concept ontology for multimedia (LSCOM) is the first of its kind designed to simultaneously optimize utility to facilitate end-user access, cover a large semantic space, make automated extraction feasible, and increase observability in diverse broadcast news video data sets


acm multimedia | 2007

Video search reranking through random walk over document-level context graph

Winston H. Hsu; Lyndon Kennedy; Shih-Fu Chang

Multimedia search over distributed sources often result in recurrent images or videos which are manifested beyond the textual modality. To exploit such contextual patterns and keep the simplicity of the keyword-based search, we propose novel reranking methods to leverage the recurrent patterns to improve the initial text search results. The approach, context reranking, is formulated as a random walk problem along the context graph, where video stories are nodes and the edges between them are weighted by multimodal contextual similarities. The random walk is biased with the preference towards stories with higher initial text search scores - a principled way to consider both initial text search results and their implicit contextual relationships. When evaluated on TRECVID 2005 video benchmark, the proposed approach can improve retrieval on the average up to 32% relative to the baseline text search method in terms of story-level Mean Average Precision. In the people-related queries, which usually have recurrent coverage across news sources, we can have up to 40% relative improvement. Most of all, the proposed method does not require any additional input from users (e.g., example images), or complex search models for special queries (e.g., named person search).


acm multimedia | 2006

Video search reranking via information bottleneck principle

Winston H. Hsu; Lyndon Kennedy; Shih-Fu Chang

We propose a novel and generic video/image reranking algorithm, IB reranking, which reorders results from text-only searches by discovering the salient visual patterns of relevant and irrelevant shots from the approximate relevance provided by text results. The IB reranking method, based on a rigorous Information Bottleneck (IB) principle, finds the optimal clustering of images that preserves the maximal mutual information between the search relevance and the high-dimensional low-level visual features of the images in the text search results. Evaluating the approach on the TRECVID 2003-2005 data sets shows significant improvement upon the text search baseline, with relative increases in average performance of up to 23%. The method requires no image search examples from the user, but is competitive with other state-of-the-art example-based approaches. The method is also highly generic and performs comparably with sophisticated models which are highly tuned for specific classes of queries, such as named-persons. Our experimental analysis has also confirmed the proposed reranking method works well when there exist sufficient recurrent visual patterns in the search results, as often the case in multi-source news videos.


acm multimedia | 2004

Story boundary detection in large broadcast news video archives: techniques, experience and trends

Tat-Seng Chua; Shih-Fu Chang; Lekha Chaisorn; Winston H. Hsu

The segmentation of news video into story units is an important step towards effective processing and management of large news video archives. In the story segmentation task in TRECVID 2003, a wide variety of techniques were employed by many research groups to segment over 120-hour of news video. The techniques employed range from simple anchor person detector to soisticated machine learning models based on HMM and Maximum Entropy (ME) approaches. The general results indicate that the judicious use of multi-modality features coupled with rigorous machine learning models could produce effective solutions. This paper presents the algorithms and experience learned in TRECVID evaluations. It also points the way towards the development of scalable technology to process large news video corpuses.


acm multimedia | 2008

ContextSeer: context search and recommendation at query time for shared consumer photos

Yi-Hsuan Yang; Po Tun Wu; Ching Wei Lee; Kuan Hung Lin; Winston H. Hsu; Homer H. Chen

The advent of media-sharing sites like Flickr has drastically increased the volume of community-contributed multimedia resources on the web. However, due to their magnitudes, these collections are increasingly difficult to understand, search and navigate. To tackle these issues, a novel search system, ContextSeer, is developed to improve search quality (by reranking) and recommend supplementary information (i.e., search-related tags and canonical images) by leveraging the rich context cues, including the visual content, high-level concept scores, time and location metadata. First, we propose an ordinal reranking algorithm to enhance the semantic coherence of text-based search result by mining contextual patterns in an unsupervised fashion. A novel feature selection method, wc-tf-idf is also developed to select informative context cues. Second, to represent the diversity of search result, we propose an efficient algorithm cannoG to select multiple canonical images without clustering. Finally, ContextSeer enhances the search experience by further recommending relevant tags. Besides being effective and unsupervised, the proposed methods are efficient and can be finished at query time, which is vital for practical online applications. To evaluate ContextSeer, we have collected 0.5 million consumer photos from Flickr and manually annotated a number of queries by pooling to form a new benchmark, Flickr550. Ordinal reranking achieves significant performance gains both in Flcikr550 and TRECVID search benchmarks. Through a subjective test, cannoG expresses its representativeness and excellence for recommending multiple canonical images.


international conference on image processing | 2006

Topic Tracking Across Broadcast News Videos with Visual Duplicates and Semantic Concepts

Winston H. Hsu; Shih-Fu Chang

Videos from distributed sources (e.g., broadcasts, podcasts, blogs, etc.) have grown exponentially. Topic threading is very useful for organizing such large-volume information sources. Current solutions primarily rely on text features only but encounter difficulty when text is noisy or unavailable. In this paper, we propose new representations and similarity measures for news videos based on low-level features, visual near-duplicates, and high-level semantic concepts automatically detected from videos. We develop a multi-modal fusion framework for estimating relevance of a new story to a known topic. Our extensive experiments using TRECVID 2005 data set (171 hours, 6 channels, 3 languages) confirm that near-duplicates consistently and significantly boost the tracking performance by up to 25%. In addition, we present information-theoretic analysis to assess the complexity of each semantic topic and determine the best subset of concepts for tracking each topic.


electronic imaging | 2003

Discovery and fusion of salient multimodal features toward news story segmentation

Winston H. Hsu; Shih-Fu Chang; Chih-Wei Huang; Lyndon Kennedy; Ching-Yung Lin; Giridharan Iyengar

In this paper, we present our new results in news video story segmentation and classification in the context of TRECVID video retrieval benchmarking event 2003. We applied and extended the Maximum Entropy statistical model to effectively fuse diverse features from multiple levels and modalities, including visual, audio, and text. We have included various features such as motion, face, music/speech types, prosody, and high-level text segmentation information. The statistical fusion model is used to automatically discover relevant features contributing to the detection of story boundaries. One novel aspect of our method is the use of a feature wrapper to address different types of features -- asynchronous, discrete, continuous and delta ones. We also developed several novel features related to prosody. Using the large news video set from the TRECVID 2003 benchmark, we demonstrate satisfactory performance (F1 measures up to 0.76 in ABC news and 0.73 in CNN news), present how these multi-level multi-modal features construct the probabilistic framework, and more importantly observe an interesting opportunity for further improvement.


conference on image and video retrieval | 2005

Visual cue cluster construction via information bottleneck principle and kernel density estimation

Winston H. Hsu; Shih-Fu Chang

Recent research in video analysis has shown a promising direction, in which mid-level features (e.g., people, anchor, indoor) are abstracted from low-level features (e.g., color, texture, motion, etc.) and used for discriminative classification of semantic labels. However, in most systems, such mid-level features are selected manually. In this paper, we propose an information-theoretic framework, visual cue cluster construction (VC3), to automatically discover adequate mid-level features. The problem is posed as mutual information maximization, through which optimal cue clusters are discovered to preserve the highest information about the semantic labels. We extend the Information Bottleneck framework to high-dimensional continuous features and further propose a projection method to map each video into probabilistic memberships over all the cue clusters. The biggest advantage of the proposed approach is to remove the dependence on the manual process in choosing the mid-level features and the huge labor cost involved in annotating the training corpus for training the detector of each mid-level feature. The proposed VC3 framework is general and effective, leading to exciting potential in solving other problems of semantic video analysis. When tested in news video story segmentation, the proposed approach achieves promising performance gain over representations derived from conventional clustering techniques and even the mid-level features selected manually.


international conference on multimedia and expo | 2004

Generative, discriminative, and ensemble learning on multi-modal perceptual fusion toward news video story segmentation

Winston H. Hsu; Shih-Fu Chang

News video story segmentation is a critical task for automatic video indexing and summarization. Our prior work has demonstrated promising performance by using a generative model, called maximum entropy (ME), which models the posterior probability given the multi-modal perceptual features near the candidate points. In this paper, we investigate alternative statistical approaches based on discriminative models, i.e. support vector machine (SVM), and ensemble learning, i.e. boosting. In addition, we develop a novel approach, called BoostME, which uses the ME classifiers and the associated confidence scores in each boosting iteration. We evaluated these different methods using the TRECVID 2003 broadcast news data set. We found that SVM-based and ME-based techniques both outperformed the pure boosting techniques, with the SVM-based solutions achieving even slightly higher accuracy. Moreover, we summarize extensive analysis results of error sources over distinctive news story types to identify future research opportunities


international conference on acoustics, speech, and signal processing | 2004

News video story segmentation using fusion of multi-level multi-modal features in TRECVID 2003

Winston H. Hsu; Lyndon Kennedy; Chih-Wei Huang; Shih-Fu Chang; Ching-Yung Lin; Giridharan Iyengar

We present our new results in news video story segmentation and classification in the context of the TRECVID video retrieval benchmarking event 2003. We applied and extended the maximum entropy statistical model to fuse diverse features effectively from multiple levels and modalities, including visual, audio, and text. We have included various features such as motion, face, music/speech types, prosody, and high-level text segmentation information. The statistical fusion model is used to discover automatically relevant features contributing to the detection of story boundaries. One novel aspect of our method is the use of a feature wrapper to address different types of features - asynchronous, discrete, continuous and delta ones. We also developed several novel features related to prosody. Using the large news video set from the TRECVID 2003 benchmark, we demonstrate satisfactory performance (F1 measure up to 0.76) and, more importantly, observe an interesting opportunity for further improvement.

Collaboration


Dive into the Winston H. Hsu's collaboration.

Top Co-Authors

Avatar
Top Co-Authors

Avatar

Chih-Wei Huang

National Central University

View shared research outputs
Top Co-Authors

Avatar
Top Co-Authors

Avatar

Chih-Jen Lin

National Taiwan University

View shared research outputs
Top Co-Authors

Avatar

Ching Wei Lee

National Taiwan University

View shared research outputs
Top Co-Authors

Avatar

Homer H. Chen

National Taiwan University

View shared research outputs
Top Co-Authors

Avatar

Kuan Hung Lin

National Taiwan University

View shared research outputs
Top Co-Authors

Avatar

Ming-Fang Weng

National Taiwan University

View shared research outputs
Researchain Logo
Decentralizing Knowledge