Ziheng Zhou | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Ziheng Zhou is active.

Explore More

Publication

Featured researches published by Ziheng Zhou.

computer vision and pattern recognition | 2011

Towards a practical lipreading system

Ziheng Zhou; Guoying Zhao; Matti Pietikäinen

A practical lipreading system can be considered either as subject dependent (SD) or subject-independent (SI). An SD system is user-specific, i.e., customized for some particular user while an SI system has to cope with a large number of users. These two types of systems pose variant challenges and have to be treated differently. In this paper, we propose a simple deterministic model to tackle the problem. The model first seeks a low-dimensional manifold where visual features extracted from the frames of a video can be projected onto a continuous deterministic curve embedded in a path graph. Moreover, it can map arbitrary points on the curve back into the image space, making it suitable for temporal interpolation. Based on the model, we develop two separate strategies for SD and SI lipreading. The former is turned into a simple curve-matching problem while for the latter, we propose a video-normalization scheme to improve the system developed by Zhao et al. We evaluated our system on the OuluVS database and achieved recognition rates more than 20% higher than the ones reported by Zhao et al. in both SD and SI testing scenarios.

Image and Vision Computing | 2014

A review of recent advances in visual speech decoding

Ziheng Zhou; Guoying Zhao; Xiaopeng Hong; Matti Pietikäinen

Abstract Visual speech information plays an important role in automatic speech recognition (ASR) especially when audio is corrupted or even inaccessible. Despite the success of audio-based ASR, the problem of visual speech decoding remains widely open. This paper provides a detailed review of recent advances in this research area. In comparison with the previous survey [97] which covers the whole ASR system that uses visual speech information, we focus on the important questions asked by researchers and summarize the recent studies that attempt to answer them. In particular, there are three questions related to the extraction of visual features, concerning speaker dependency, pose variation and temporal information, respectively. Another question is about audio-visual speech fusion, considering the dynamic changes of modality reliabilities encountered in practice. In addition, the state-of-the-art on facial landmark localization is briefly introduced in this paper. Those advanced techniques can be used to improve the region-of-interest detection, but have been largely ignored when building a visual-based ASR system. We also provide details of audio-visual speech databases. Finally, we discuss the remaining challenges and offer our insights into the future research on visual speech decoding.

ieee international conference on automatic face gesture recognition | 2015

OuluVS2: A multi-view audiovisual database for non-rigid mouth motion analysis

Iryna Anina; Ziheng Zhou; Guoying Zhao; Matti Pietikäinen

Visual speech constitutes a large part of our nonrigid facial motion and contains important information that allows machines to interact with human users, for instance, through automatic visual speech recognition (VSR) and speaker verification. One of the major obstacles to research of non-rigid mouth motion analysis is the absence of suitable databases. Those available for public research either lack a sufficient number of speakers or utterances or contain constrained view points, which limits their representativeness and usefulness. This paper introduces a newly collected multi-view audiovisual database for non-rigid mouth motion analysis. It includes more than 50 speakers uttering three types of utterances and more importantly, thousands of videos simultaneously recorded by six cameras from five different views spanned between the frontal and profile views. Moreover, a simple VSR system has been developed and tested on the database to provide some baseline performance.

IEEE Transactions on Image Processing | 2013

Video Texture Synthesis With Multi-Frame LBP-TOP and Diffeomorphic Growth Model

Yimo Guo; Guoying Zhao; Ziheng Zhou; Matti Pietikäinen

Video texture synthesis is the process of providing a continuous and infinitely varying stream of frames, which plays an important role in computer vision and graphics. However, it still remains a challenging problem to generate high-quality synthesis results. Considering the two key factors that affect the synthesis performance, frame representation and blending artifacts, we improve the synthesis performance from two aspects: 1) Effective frame representation is designed to capture both the image appearance information in spatial domain and the longitudinal information in temporal domain. 2) Artifacts that degrade the synthesis quality are significantly suppressed on the basis of a diffeomorphic growth model. The proposed video texture synthesis approach has two major stages: video stitching stage and transition smoothing stage. In the first stage, a video texture synthesis model is proposed to generate an infinite video flow. To find similar frames for stitching video clips, we present a new spatial-temporal descriptor to provide an effective representation for different types of dynamic textures. In the second stage, a smoothing method is proposed to improve synthesis quality, especially in the aspect of temporal continuity. It aims to establish a diffeomorphic growth model to emulate local dynamics around stitched frames. The proposed approach is thoroughly tested on public databases and videos from the Internet, and is evaluated in both qualitative and quantitative ways.

IEEE Transactions on Circuits and Systems for Video Technology | 2012

An Image-Based Visual Speech Animation System

Ziheng Zhou; Guoying Zhao; Yimo Guo; Matti Pietikäinen

An image-based visual speech animation system is presented in this paper. A video model is proposed to preserve the video dynamics of a talking face. The model represents a video sequence by a low-dimensional continuous curve embedded in a path graph and establishes a map from the curve to the image domain. When selecting video segments for synthesis, we loosen the traditional requirement of using triphone as the unit to allow segments to contain longer natural talking motion. Dense videos are sampled from the segments, concatenated, and downsampled to train a video model that enables efficient time alignment and motion smoothing for the final video synthesis. Different viseme definitions are used to investigate the impact of visemes on the video realism of the animated talking face. The system is built on a public database and tested both objectively and subjectively.

international conference on pattern recognition | 2010

Lipreading: A Graph Embedding Approach

Ziheng Zhou; Guoying Zhao; Matti Pietikäinen

In this paper, we propose a novel graph embedding method for the problem of lipreading. To characterize the temporal connections among video frames of the same utterance, a new distance metric is defined on a pair of frames and graphs are constructed to represent the video dynamics based on the distances between frames. Audio information is used to assist in calculating such distances. For each utterance, a subspace of the visual feature space is learned from a well-defined intrinsic and penalty graph within a graph-embedding framework. Video dynamics are found to be well preserved along some dimensions of the subspace. Discriminatory cues are then decoded from curves of the projected visual features to classify different utterances.

international conference on acoustics, speech, and signal processing | 2017

Image denoising via group sparsity residual constraint

Zhiyuan Zha; Xin Liu; Ziheng Zhou; Xiaohua Huang; Jingang Shi; Zhenhong Shang; Lan Tang; Yechao Bai; Qiong Wang; Xinggan Zhang

Group sparsity has shown great potential in various low-level vision tasks (e.g, image denoising, deblurring and inpainting). In this paper, we propose a new prior model for image denoising via group sparsity residual constraint (GSRC). To enhance the performance of group sparse-based image denoising, the concept of group sparsity residual is proposed, and thus, the problem of image denoising is translated into one that reduces the group sparsity residual. To reduce the residual, we first obtain some good estimation of the group sparse coefficients of the original image by the first-pass estimation of noisy image, and then centralize the group sparse coefficients of noisy image to the estimation. Experimental results have demonstrated that the proposed method not only outperforms many state-of-the-art denoising methods such as BM3D and WNNM, but results in a faster speed.

asian conference on computer vision | 2016

Concatenated Frame Image Based CNN for Visual Speech Recognition

Takeshi Saitoh; Ziheng Zhou; Guoying Zhao; Matti Pietikäinen

This paper proposed a novel sequence image representation method called concatenated frame image (CFI), two types of data augmentation methods for CFI, and a framework of CFI-based convolutional neural network (CNN) for visual speech recognition (VSR) task. CFI is a simple, however, it contains spatial-temporal information of a whole image sequence. The proposed method was evaluated with a public database OuluVS2. This is a multi-view audio-visual dataset recorded from 52 subjects. The speaker independent recognition tasks were carried out with various experimental conditions. As the result, the proposed method obtained high recognition accuracy.

indian conference on computer vision, graphics and image processing | 2010

Synthesizing a talking mouth

Ziheng Zhou; Guoying Zhao; Matti Pietikäinen

This paper presents a visually realistic animation system for synthesizing a talking mouth. Video synthesis is achieved by first learning generative models from the recorded speech videos and then using the learned models to generate videos for novel utterances. A generative model considers the whole utterance contained in a video as a continuous process and represents it using a set of trigonometric functions embedded within a path graph. The transformation that projects the values of the functions to the image space is found through graph embedding. Such a model allows us to synthesize mouth images at arbitrary positions in the utterance. To synthesize a video for a novel utterance, the utterance is first compared with the existing ones from which we find the phoneme combinations that best approximate the utterance. Based on the learned models, dense videos are synthesized, concatenated and downsampled. A new generative model is then built on the remaining image samples for the final video synthesis.

international conference on pattern recognition | 2014

Facial 3D Shape Estimation from Images for Visual Speech Animation

Utpala Musti; Ziheng Zhou; Matti Pietikäinen

In this paper we describe the first version of our system for estimating 3D shape sequences from images of the frontal face. This approach is developed with 3D Visual Speech Animation (VSA) as the target application. In particular, the focus is on the usability of an existing state-of-the-art image-based VSA system and subsequent on-line estimation of the corresponding 3D facial shape sequence from its output. This has the added advantage of a 3D visual speech, which is mainly render ability of the face in different poses and illumination conditions. The idea is based on the detection of landmarks from the facial image which are then used to determine the pose and shape. The method belongs to the category of methods which use a prior 3D Morph able Models (3D-MM) trained using 3D facial data. For the time being it is developed for a person-specific domain, i.e. the 3D-MM and the 2D facial landmark detector are trained using the data of a single person and tested with the same person-specific data.

Explore More