Featured Researches

Audio and Speech Processing

CDPAM: Contrastive learning for perceptual audio similarity

Many speech processing methods based on deep learning require an automatic and differentiable audio metric for the loss function. The DPAM approach of Manocha et al. learns a full-reference metric trained directly on human judgments, and thus correlates well with human perception. However, it requires a large number of human annotations and does not generalize well outside the range of perturbations on which it was trained. This paper introduces CDPAM, a metric that builds on and advances DPAM. The primary improvement is to combine contrastive learning and multi-dimensional representations to build robust models from limited data. In addition, we collect human judgments on triplet comparisons to improve generalization to a broader range of audio perturbations. CDPAM correlates well with human responses across nine varied datasets. We also show that adding this metric to existing speech synthesis and enhancement methods yields significant improvement, as measured by objective and subjective tests.

Read more
Systems and Control

Koopman based data-driven predictive control

Sparked by the Willems' fundamental lemma, a class of data-driven control methods has been developed for LTI systems. At the same time, the Koopman operator theory attempts to cast a nonlinear control problem into a standard linear one albeit infinite-dimensional. Motivated by these two ideas, a data-driven control scheme for nonlinear systems is proposed in this work. The proposed scheme is compatible with most differential regressors enabling offline learning. In particular, the model uncertainty is considered, enabling a novel data-driven simulation framework based on Wasserstein distance. Numerical experiments are performed with Bayesian neural networks to show the effectiveness of both the proposed control and simulation scheme.

Read more
Signal Processing

Sparse Factorization-based Detection of Off-the-Grid Moving targets using FMCW radars

In this paper, we investigate the application of continuous sparse signal reconstruction algorithms for the estimation of the ranges and speeds of multiple moving targets using an FMCW radar. Conventionally, to be reconstructed, continuous sparse signals are approximated by a discrete representation. This discretization of the signal's parameter domain leads to mismatches with the actual signal. While increasing the grid density mitigates these errors, it dramatically increases the algorithmic complexity of the reconstruction. To overcome this issue, we propose a fast greedy algorithm for off-the-grid detection of multiple moving targets. This algorithm extends existing continuous greedy algorithms to the framework of factorized sparse representations of the signals. This factorized representation is obtained from simplifications of the radar signal model which, up to a model mismatch, strongly reduces the dimensionality of the problem. Monte-Carlo simulations of a K-band radar system validate the ability of our method to produce more accurate estimations with less computation time than the on-the-grid methods and than methods based on non-factorized representations.

Read more
Image and Video Processing

Multi-scale GCN-assisted two-stage network for joint segmentation of retinal layers and disc in peripapillary OCT images

An accurate and automated tissue segmentation algorithm for retinal optical coherence tomography (OCT) images is crucial for the diagnosis of glaucoma. However, due to the presence of the optic disc, the anatomical structure of the peripapillary region of the retina is complicated and is challenging for segmentation. To address this issue, we developed a novel graph convolutional network (GCN)-assisted two-stage framework to simultaneously label the nine retinal layers and the optic disc. Specifically, a multi-scale global reasoning module is inserted between the encoder and decoder of a U-shape neural network to exploit anatomical prior knowledge and perform spatial reasoning. We conducted experiments on human peripapillary retinal OCT images. The Dice score of the proposed segmentation network is 0.820 ± 0.001 and the pixel accuracy is 0.830 ± 0.002, both of which outperform those from other state-of-the-art techniques.

Read more

Audio and Speech Processing

CDPAM: Contrastive learning for perceptual audio similarity

Many speech processing methods based on deep learning require an automatic and differentiable audio metric for the loss function. The DPAM approach of Manocha et al. learns a full-reference metric trained directly on human judgments, and thus correlates well with human perception. However, it requires a large number of human annotations and does not generalize well outside the range of perturbations on which it was trained. This paper introduces CDPAM, a metric that builds on and advances DPAM. The primary improvement is to combine contrastive learning and multi-dimensional representations to build robust models from limited data. In addition, we collect human judgments on triplet comparisons to improve generalization to a broader range of audio perturbations. CDPAM correlates well with human responses across nine varied datasets. We also show that adding this metric to existing speech synthesis and enhancement methods yields significant improvement, as measured by objective and subjective tests.

More from Audio and Speech Processing
End-to-End Multi-Channel Transformer for Speech Recognition

Transformers are powerful neural architectures that allow integrating different modalities using attention mechanisms. In this paper, we leverage the neural transformer architectures for multi-channel speech recognition systems, where the spectral and spatial information collected from different microphones are integrated using attention layers. Our multi-channel transformer network mainly consists of three parts: channel-wise self attention layers (CSA), cross-channel attention layers (CCA), and multi-channel encoder-decoder attention layers (EDA). The CSA and CCA layers encode the contextual relationship within and between channels and across time, respectively. The channel-attended outputs from CSA and CCA are then fed into the EDA layers to help decode the next token given the preceding ones. The experiments show that in a far-field in-house dataset, our method outperforms the baseline single-channel transformer, as well as the super-directive and neural beamformers cascaded with the transformers.

More from Audio and Speech Processing
Non-linear frequency warping using constant-Q transformation for speech emotion recognition

In this work, we explore the constant-Q transform (CQT) for speech emotion recognition (SER). The CQT-based time-frequency analysis provides variable spectro-temporal resolution with higher frequency resolution at lower frequencies. Since lower-frequency regions of speech signal contain more emotion-related information than higher-frequency regions, the increased low-frequency resolution of CQT makes it more promising for SER than standard short-time Fourier transform (STFT). We present a comparative analysis of short-term acoustic features based on STFT and CQT for SER with deep neural network (DNN) as a back-end classifier. We optimize different parameters for both features. The CQT-based features outperform the STFT-based spectral features for SER experiments. Further experiments with cross-corpora evaluation demonstrate that the CQT-based systems provide better generalization with out-of-domain training data.

More from Audio and Speech Processing
Systems and Control

Koopman based data-driven predictive control

Sparked by the Willems' fundamental lemma, a class of data-driven control methods has been developed for LTI systems. At the same time, the Koopman operator theory attempts to cast a nonlinear control problem into a standard linear one albeit infinite-dimensional. Motivated by these two ideas, a data-driven control scheme for nonlinear systems is proposed in this work. The proposed scheme is compatible with most differential regressors enabling offline learning. In particular, the model uncertainty is considered, enabling a novel data-driven simulation framework based on Wasserstein distance. Numerical experiments are performed with Bayesian neural networks to show the effectiveness of both the proposed control and simulation scheme.

More from Systems and Control
Stability Analysis and State-Feedback Stabilization of LPV Time-Delay Systems with Piecewise Constant Parameters subject to Spontaneous Poissonian Jumps

This paper discusses the stability analysis of linear parameter varying systems with a parameter-dependent delay where the parameters are assumed to be stochastic piecewise constants under spontaneous Poissonian jumps. Based on stochastic Lyapunov-Krasovskii functionals, we also provide sufficient synthesis conditions for the gain-scheduled state-feedback controller with memory in terms of parameter-dependent linear matrix inequalities (LMIs). Such synthesis conditions are computationally intractable due to the presence of integral terms. However, we show that these LMIs can be equivalently represented by integral-free LMIs, which are computationally tractable. Finally, we illustrate the applicability of the results through examples.

More from Systems and Control
Reduction of the Beam Pointing Error for Improved Free-Space Optical Communication Link Performance

Free-space optical communication is emerging as a low-power, low-cost, and high data rate alternative to radio-frequency communication in short-to medium-range applications. However, it requires a close-to-line-of-sight link between the transmitter and the receiver. This paper proposes a robust $\cHi$ control law for free-space optical (FSO) beam pointing error systems under controlled weak turbulence conditions. The objective is to maintain the transmitter-receiver line, which means the center of the optical beam as close as possible to the center of the receiving aperture within a prescribed disturbance attenuation level. First, we derive an augmented nonlinear discrete-time model for pointing error loss due to misalignment caused by weak atmospheric turbulence. We then investigate the $\cHi$-norm optimization problem that guarantees the closed-loop pointing error is stable and ensures the prescribed weak disturbance attenuation. Furthermore, we evaluate the closed-loop outage probability error and bit error rate (BER) that quantify the free-space optical communication performance in fading channels. Finally, the paper concludes with a numerical simulation of the proposed approach to the FSO link's error performance.

More from Systems and Control
Signal Processing

Sparse Factorization-based Detection of Off-the-Grid Moving targets using FMCW radars

In this paper, we investigate the application of continuous sparse signal reconstruction algorithms for the estimation of the ranges and speeds of multiple moving targets using an FMCW radar. Conventionally, to be reconstructed, continuous sparse signals are approximated by a discrete representation. This discretization of the signal's parameter domain leads to mismatches with the actual signal. While increasing the grid density mitigates these errors, it dramatically increases the algorithmic complexity of the reconstruction. To overcome this issue, we propose a fast greedy algorithm for off-the-grid detection of multiple moving targets. This algorithm extends existing continuous greedy algorithms to the framework of factorized sparse representations of the signals. This factorized representation is obtained from simplifications of the radar signal model which, up to a model mismatch, strongly reduces the dimensionality of the problem. Monte-Carlo simulations of a K-band radar system validate the ability of our method to produce more accurate estimations with less computation time than the on-the-grid methods and than methods based on non-factorized representations.

More from Signal Processing
Does Probabilistic Constellation Shaping Benefit IM-DD Systems without Optical Amplifiers?

Probabilistic constellation shaping (PCS) has been widely applied to amplified coherent optical transmissions owing to its shaping gain over the uniform signaling and fine-grained rate adaptation to the underlying fiber channel condition. These merits stimulate the study of applying PCS to short-reach applications dominated by intensity modulation (IM) direct detection (DD) systems. As commercial IM-DD systems typically do not employ optical amplification to save the cost and power consumption, they are no longer subject to an average power constraint (APC) but a peak power constraint (PPC), which poses unique challenges to take full advantages of PCS. This paper provides a comprehensive investigation of PCS in IM-DD systems without optical amplifiers. In particular, we reveal that if the transmitter enhances the peak-to-average power ratio of the signal, a PPC system can be partially or even fully converted to an APC system in which the classical PCS offers its merits. The findings are verified through an IM-DD experiment using 4- and 8-ary pulse amplitude modulations.

More from Signal Processing
RIGOLETTO -- RIemannian GeOmetry LEarning: applicaTion To cOnnectivity. A contribution to the Clinical BCI Challenge -- WCCI2020

This short technical report describes the approach submitted to the Clinical BCI Challenge-WCCI2020. This submission aims to classify motor imagery task from EEG signals and relies on Riemannian Geometry, with a twist. Instead of using the classical covariance matrices, we also rely on measures of functional connectivity. Our approach ranked 1st on the task 1 of the competition.

More from Signal Processing
Image and Video Processing

Multi-scale GCN-assisted two-stage network for joint segmentation of retinal layers and disc in peripapillary OCT images

An accurate and automated tissue segmentation algorithm for retinal optical coherence tomography (OCT) images is crucial for the diagnosis of glaucoma. However, due to the presence of the optic disc, the anatomical structure of the peripapillary region of the retina is complicated and is challenging for segmentation. To address this issue, we developed a novel graph convolutional network (GCN)-assisted two-stage framework to simultaneously label the nine retinal layers and the optic disc. Specifically, a multi-scale global reasoning module is inserted between the encoder and decoder of a U-shape neural network to exploit anatomical prior knowledge and perform spatial reasoning. We conducted experiments on human peripapillary retinal OCT images. The Dice score of the proposed segmentation network is 0.820 ± 0.001 and the pixel accuracy is 0.830 ± 0.002, both of which outperform those from other state-of-the-art techniques.

More from Image and Video Processing
Editorial: Introduction to the Issue on Deep Learning for Image/Video Restoration and Compression

Recent works have shown that learned models can achieve significant performance gains, especially in terms of perceptual quality measures, over traditional methods. Hence, the state of the art in image restoration and compression is getting redefined. This special issue covers the state of the art in learned image/video restoration and compression to promote further progress in innovative architectures and training methods for effective and efficient networks for image/video restoration and compression.

More from Image and Video Processing
Attention-Based Neural Networks for Chroma Intra Prediction in Video Coding

Neural networks can be successfully used to improve several modules of advanced video coding schemes. In particular, compression of colour components was shown to greatly benefit from usage of machine learning models, thanks to the design of appropriate attention-based architectures that allow the prediction to exploit specific samples in the reference region. However, such architectures tend to be complex and computationally intense, and may be difficult to deploy in a practical video coding pipeline. This work focuses on reducing the complexity of such methodologies, to design a set of simplified and cost-effective attention-based architectures for chroma intra-prediction. A novel size-agnostic multi-model approach is proposed to reduce the complexity of the inference process. The resulting simplified architecture is still capable of outperforming state-of-the-art methods. Moreover, a collection of simplifications is presented in this paper, to further reduce the complexity overhead of the proposed prediction architecture. Thanks to these simplifications, a reduction in the number of parameters of around 90% is achieved with respect to the original attention-based methodologies. Simplifications include a framework for reducing the overhead of the convolutional operations, a simplified cross-component processing model integrated into the original architecture, and a methodology to perform integer-precision approximations with the aim to obtain fast and hardware-aware implementations. The proposed schemes are integrated into the Versatile Video Coding (VVC) prediction pipeline, retaining compression efficiency of state-of-the-art chroma intra-prediction methods based on neural networks, while offering different directions for significantly reducing coding complexity.

More from Image and Video Processing

Ready to get started?

Join us today