Xiang | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Xiang is active.

Explore More

Publication

Featured researches published by Xiang.

acm multimedia | 2017

NormFace: L 2 Hypersphere Embedding for Face Verification

Feng Wang; Xiang Xiang; Jian Cheng; Alan L. Yuille

Thanks to the recent developments of Convolutional Neural Networks, the performance of face verification methods has increased rapidly. In a typical face verification method, feature normalization is a critical step for boosting performance. This motivates us to introduce and study the effect of normalization during training. But we find this is non-trivial, despite normalization being differentiable. We identify and study four issues related to normalization through mathematical analysis, which yields understanding and helps with parameter settings. Based on this analysis we propose two strategies for training using normalized features. The first is a modification of softmax loss, which optimizes cosine similarity instead of inner-product. The second is a reformulation of metric learning by introducing an agent vector for each class. We show that both strategies, and small variants, consistently improve performance by between 0.2% to 0.4% on the LFW dataset based on two models. This is significant because the performance of the two models on LFW dataset is close to saturation at over 98%.

arXiv: Computer Vision and Pattern Recognition | 2016

Pose-Selective Max Pooling for Measuring Similarity

Xiang Xiang; Trac D. Tran

In this paper, we deal with two challenges for measuring the similarity of the subject identities in practical video-based face recognition - the variation of the head pose in uncontrolled environments and the computational expense of processing videos. Since the frame-wise feature mean is unable to characterize the pose diversity among frames, we define and preserve the overall pose diversity and closeness in a video. Then, identity will be the only source of variation across videos since the pose varies even within a single video. Instead of simply using all the frames, we select those faces whose pose point is closest to the centroid of the K-means cluster containing that pose point. Then, we represent a video as a bag of frame-wise deep face features while the number of features has been reduced from hundreds to K. Since the video representation can well represent the identity, now we measure the subject similarity between two videos as the max correlation among all possible pairs in the two bags of features. On the official 5,000 video-pairs of the YouTube Face dataset for face verification, our algorithm achieves a comparable performance with VGG-face that averages over deep features of all frames. Other vision tasks can also benefit from the generic idea of employing geometric cues such as 3-D poses to improve the descriptiveness of deep features learned from appearances.

international conference on acoustics, speech, and signal processing | 2015

Hierarchical Sparse and Collaborative Low-Rank representation for emotion recognition

Xiang Xiang; Minh Dao; Gregory D. Hager; Trac D. Tran

In this paper, we design a Collaborative-Hierarchical Sparse and Low-Rank (C-HiSLR) model that is natural for recognizing human emotion in visual data. Previous attempts require explicit expression components, which are often unavailable and difficult to recover. Instead, our model exploits the low-rank property to subtract neutral faces from expressive facial frames as well as performs sparse representation on the expression components with group sparsity enforced. For the CK+ dataset, C-HiSLR on raw expressive faces performs as competitive as the Sparse Representation based Classification (SRC) applied on manually prepared emotions. Our C-HiSLR performs even better than SRC in terms of true positive rate.

Archive | 2017

Transferring Face Verification Nets To Pain and Expression Regression.

Feng Wang; Xiang Xiang; Chang Liu; Trac D. Tran; Austin Reiter; Gregory D. Hager; Harry Quon; Jian Cheng; Alan L. Yuille

Limited labeled data are available for the research of estimating facial expression intensities. For instance, the ability to train deep networks for automated pain assessment is limited by small datasets with labels of patient-reported pain intensities. Fortunately, fine-tuning from a data-extensive pre-trained domain, such as face verification, can alleviate this problem. In this paper, we propose a network that fine-tunes a state-of-the-art face verification network using a regularized regression loss and additional data with expression labels. In this way, the expression intensity regression task can benefit from the rich feature representations trained on a huge amount of data for face verification. The proposed regularized deep regressor is applied to estimate the pain expression intensity and verified on the widely-used UNBC-McMaster Shoulder-Pain dataset, achieving the state-of-the-art performance. A weighted evaluation metric is also proposed to address the imbalance issue of different pain intensities.Limited annotated data is available for the research of estimating facial expression intensities. For example, the ability to train deep CNNs for automated pain assessment is limited by small datasets associated with labels of patient-reported pain intensities. Fortunately, fine-tuning from a data-extensive pre-trained domain such as face verification can alleviate the problem. In this paper, we propose a regularized network that fine-tunes a state-of-the-art face verification network using expression-intensity labeled data with a regression layer. In this way, the expression regression task can benefit from the rich feature representations trained on a huge amount of data for face verification. The proposed regularizered deep regressor is applied in estimating the intensity of pain intensity estimation (Shoulder-Pain dataset). It achieves the stateof-the-art performance on Shoulder-Pain dataset. Particularly for Shoulder-Pain with the imbalance issue of different pain intensities, a novel weighted evaluation metric is proposed.

IEEE Transactions on Circuits and Systems for Video Technology | 2017

Linear Disentangled Representation Learning For Facial Actions

Xiang Xiang; Trac D. Tran

Limited annotated data available for the recognition of facial expression and particularly action units makes it hard to train a deep network which can learn disentangled invariant features. However, a supervised linear model is undemanding in terms of training data. In this paper, we propose an elegant linear model to untangle facial actions from expressive face videos which contain a mixture of linearly-representable attributes. Previous attempts require an explicit decoupling of identity and expression which is practically inexact. Instead, we exploit the low-rank property across frames to implicitly subtract the intrinsic neutral face, which are modeled jointly with sparse representation only on the residual expression components. On CK+, our one-shot C-HiSLR on raw-face pixel-intensities performs far more competitive than conventional shape+SVM models with landmark detection and two-stepped SRC of the same type yet applied on manually prepared expression components. It is also comparable with the piecewise linear model DCS and temporal models, such as CRF and Bayes nets. We apply it to action unit (AU) recognition on MPI-VDB achieving a decent performance. As expression is a mixture of AUs, the result gives hopes of approximating an expression using a piecewise linear model.

IAPR Workshop on Multimodal Pattern Recognition of Social Signals in Human-Computer Interaction | 2016

Recursively Measured Action Units

Xiang Xiang; Trac D. Tran

Video is a recursively measured signal where frames are highly correlated with structured sparsity and low-rankness. A simple example is facial expression - multiple measurements of a face. Several salient facial action units (AU) are often enough for a correct expression recognition. We hope that AUs are not stored when the face remains neutral until they become salient when expression occurs, as well as that the recognizer is still able to restore historic salient AUs. A temporal memory mechanism is appealing for a real-time system to reduce rich redundancy in information coding. We formulate expression recognition as a video Sparse Representation based Classification (SRC) with Long Short-Term Memory (LSTM) mechanism, which is applicable for human actions yet requiring a careful design of sparse representation due to possible changing scenes. Preliminary experiments are conducted on the MPI Face Video Database (MPI-VDB). We compare the proposed sparse coding with temporal modeling using LSTM against the baseline of sparse coding with simultaneous recursive matching pursuit (SRMP).

Computer-assisted and robotic endoscopy : first International Workshop, CARE 2014, held in conjunction with MICCAI 2014, Boston, MA, USA, September 18, 2014 : revised selected papers / Xiongbiao Luo, Tobias Reichl, Daniel Mirota, Timoth... | 2014

Is Multi-model Feature Matching Better for Endoscopic Motion Estimation?

Xiang Xiang; Daniel J. Mirota; Austin Reiter; Gregory D. Hager

Camera motion estimation is a standard yet critical step to endoscopic visualization. It is affected by the variation of locations and correspondences of features detected in 2D images. Feature detectors and descriptors vary, though one of the most widely used remains SIFT. Practitioners usually also adopt its feature matching strategy, which defines inliers as the feature pairs subjecting to a global affine transformation. However, for endoscopic videos, we are curious if it is more suitable to cluster features into multiple groups. We can still enforce the same transformation as in SIFT within each group. Such a multi-model idea has been recently examined in the Multi-Affine work, which outperforms Lowes SIFT in terms of re-projection error on minimally invasive endoscopic images with manually labelled ground-truth matches of SIFT features. Since their difference lies in matching, the accuracy gain of estimated motion is attributed to the holistic Multi-Affine feature matching algorithm. But, more concretely, the matching criterion and point searching can be the same as those built in SIFT. We argue that the real variation is only the motion model verification. We either enforce a single global motion model or employ a group of multiple local ones. In this paper, we investigate how sensitive the estimated motion is affected by the number of motion models assumed in feature matching. While the sensitivity can be analytically evaluated, we present an empirical analysis in a leaving-one-out cross validation setting without requiring labels of ground-truth matches. Then, the sensitivity is characterized by the variance of a sequence of motion estimates. We present a series of quantitative comparison such as accuracy and variance between Multi-Affine motion models and the global affine model.

international symposium on biomedical imaging | 2018