Soheil Khorram
Sharif University of Technology
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Soheil Khorram.
international workshop on machine learning for signal processing | 2011
Sara Bahaadini; Hossein Sameti; Soheil Khorram
Scattered and little research in the field of Persian speech synthesis systems has been performed during the last ten years. Comprehensive framework that properly implements and adapts statistical speech synthesis methods for Persian has not been conducted yet. In this paper, recent statistical parametric speech synthesis methods including CLUSTERGEN, traditional HMM-based speech synthesis and its STRAIGHT version, are implemented and adapted for Persian language. CCR test is carried out to compare these methods with each other and with unit selection method. Listeners Score samples based on CMOS. The methods were ranked by averaging the CCR scores. The results show that STRAIGHT-based system produces the best quality. Traditional HMM-based and unit selection are second and third in quality ranking. These approximately produce the same quality. Finally CLUSTERGEN produces the worst quality among these four systems.
Eurasip Journal on Audio, Speech, and Music Processing | 2014
Soheil Khorram; Hossein Sameti; Fahimeh Bahmaninezhad; Simon King; Thomas Drugman
Decision tree-clustered context-dependent hidden semi-Markov models (HSMMs) are typically used in statistical parametric speech synthesis to represent probability densities of acoustic features given contextual factors. This paper addresses three major limitations of this decision tree-based structure: (i) The decision tree structure lacks adequate context generalization. (ii) It is unable to express complex context dependencies. (iii) Parameters generated from this structure represent sudden transitions between adjacent states. In order to alleviate the above limitations, many former papers applied multiple decision trees with an additive assumption over those trees. Similarly, the current study uses multiple decision trees as well, but instead of the additive assumption, it is proposed to train the smoothest distribution by maximizing entropy measure. Obviously, increasing the smoothness of the distribution improves the context generalization. The proposed model, named hidden maximum entropy model (HMEM), estimates a distribution that maximizes entropy subject to multiple moment-based constraints. Due to the simultaneous use of multiple decision trees and maximum entropy measure, the three aforementioned issues are considerably alleviated. Relying on HMEM, a novel speech synthesis system has been developed with maximum likelihood (ML) parameter re-estimation as well as maximum output probability parameter generation. Additionally, an effective and fast algorithm that builds multiple decision trees in parallel is devised. Two sets of experiments have been conducted to evaluate the performance of the proposed system. In the first set of experiments, HMEM with some heuristic context clusters is implemented. This system outperformed the decision tree structure in small training databases (i.e., 50, 100, and 200 sentences). In the second set of experiments, the HMEM performance with four parallel decision trees is investigated using both subjective and objective tests. All evaluation results of the second experiment confirm significant improvement of the proposed system over the conventional HSMM.
EURASIP Journal on Advances in Signal Processing | 2015
Soheil Khorram; Hossein Sameti; Simon King
This paper proposes the use of a new binary decision tree, which we call a soft decision tree, to improve generalization performance compared to the conventional ‘hard’ decision tree method that is used to cluster context-dependent model parameters in statistical parametric speech synthesis. We apply the method to improve the modeling of fundamental frequency, which is an important factor in synthesizing natural-sounding high-quality speech. Conventionally, hard decision tree-clustered hidden Markov models (HMMs) are used, in which each model parameter is assigned to a single leaf node. However, this ‘divide-and-conquer’ approach leads to data sparsity, with the consequence that it suffers from poor generalization, meaning that it is unable to accurately predict parameters for models of unseen contexts: the hard decision tree is a weak function approximator. To alleviate this, we propose the soft decision tree, which is a binary decision tree with soft decisions at the internal nodes. In this soft clustering method, internal nodes select both their children with certain membership degrees; therefore, each node can be viewed as a fuzzy set with a context-dependent membership function. The soft decision tree improves model generalization and provides a superior function approximator because it is able to assign each context to several overlapped leaves. In order to use such a soft decision tree to predict the parameters of the HMM output probability distribution, we derive the smoothest (maximum entropy) distribution which captures all partial first-order moments and a global second-order moment of the training samples. Employing such a soft decision tree architecture with maximum entropy distributions, a novel speech synthesis system is trained using maximum likelihood (ML) parameter re-estimation and synthesis is achieved via maximum output probability parameter generation. In addition, a soft decision tree construction algorithm optimizing a log-likelihood measure is developed. Both subjective and objective evaluations were conducted and indicate a considerable improvement over the conventional method.
international symposium on artificial intelligence | 2013
Soheil Khorram; Fahimeh Bahmaninezhad; Hossein Sameti
Hidden Markov Model (HMM)-based synthesis (HTS) has recently been confirmed to be the most effective method in generating natural speech. However, it lacks adequate context generalization when the training data is limited. As a solution, current study provides a new context-dependent speech modeling framework based on the Gaussian Conditional Random Field (GCRF) theory. By applying this model, an innovative speech synthesis system has been developed which can be viewed as an extension of Context-Dependent Hidden Semi Markov Model (CD-HSMM). A novel Viterbi decoder along with a stochastic gradient ascent algorithm was applied to train model parameters. Also, a fast and efficient parameter generation algorithm was derived for the synthesis part. Experimental results using objective and subjective criteria have shown that the proposed system outperforms HSMM substantially in limited speech databases. Moreover, Mel-cepstral distance of the spectral parameters has been reduced considerably for any size of training database.
information sciences, signal processing and their applications | 2012
Soheil Khorram; Hossein Sameti; Hadi Veisi
Adaptive Noise Cancellation (ANC) is an effective dual-channel technique for background noise reduction. Due to the presence of uncorrelated noise components at the two inputs in vehicular environments, ANC does not provide sufficient background noise reduction. To alleviate this problem, a complementary linear filter is added to ANC structure. Filter coefficients are determined to make the enhanced signal an MMSE estimation of speech signal. Therefore, the ANC structure is modified to a dual-channel Wiener structure. We prove that this structure is identical to the LMS type ANC which is followed by a Wiener post-filter. A new method is proposed for the noise spectrum estimation in the Wiener post-filter. This method does not require Voice Activity Detectors (VADs) and performs better speech enhancement in nonstationary noisy environments. Experimental results show that the proposed system can overcome the problem efficiently, at the cost of more complexity and more speech distortion.
international symposium on signal processing and information technology | 2008
Soheil Khorram; Hossein Sameti; Hadi Veisi; Hamid Reza Abutalebi
Adaptive noise cancellation (ANC) is a well-known technique for background noise reduction in automobile and vehicular environments. The noise fields in automobile and other vehicle interior obey the diffuse noise field model closely. On the other hand, the ANC does not provide sufficient noise reduction in the diffuse noise fields. In this paper, a new multistage post-filter is designed for ANC as a solution to diffuse noise conditions. The designed post-filter is a single channel linear prediction (LP) based speech enhancement system. The LP is performed by an adaptive lattice filter and attempts to extract speech components by using intermediate ANC signals. The post-filter has no processing delay which is suitable for speech communication systems. We have evaluated the performance of proposed system in various real-life noise fields, recorded in an automobile environment. The experimental results using various quality measures show that the proposed method is superior to both the adaptive noise canceller and LP-based speech enhancement systems.
international conference on signal processing | 2008
Soheil Khorram; Hossein Sameti; Hadi Veisi
Adaptive noise cancellers (ANCs) do not provide sufficient noise reduction in the diffuse noise fields. In this paper, a new hybrid structure is proposed as a solution to this problem. The proposed system is a combination of two subsystems, an ANC and a new multistage post-filter. The post-filter is based on linear prediction (LP) and attempts to extract speech component by using intermediate ANC signals. The system is implemented on an over-sampled DFT filterbank with different analysis and synthesis prototype filters. The experimental results using various quality measures show that the proposed system is superior to both the subband ANC and subband LP based speech enhancement systems.
international conference on signal processing | 2014
Soheil Khorram; Hossein Sameti; Fahimeh Bahmaninezhad
This article proposes a method to improve the performance of deterministic plus stochastic model (DSM-) based feature extraction by integrating the contextual information. One precious advantage of speech synthesis over speech recognition is that in both training and testing phases of synthesis, contextual information is available. However, similar to recognition, this invaluable knowledge has been forgotten during acoustic feature extraction of speech synthesis. DSM expresses the residual of Mel-cepstral analysis through a summation of two components, namely deterministic and stochastic. This study proposes to model the deterministic component through a novel context-dependent principal component analysis (CD-PCA), and the stochastic component through the conventional high-pass filtered noise. Furthermore, due to the high dependency of the proposed feature extraction on state boundaries, the feature analysis and HMM-based modeling are performed in an iterative manner. Subjective evaluations conducted on a Persian speech database confirm the effectiveness of the proposed synthesis system.
non-linear speech processing | 2013
Fahimeh Bahmaninezhad; Soheil Khorram; Hossein Sameti
Speaker adaptive speech synthesis based on Hidden Semi-Markov Model (HSMM) has been demonstrated to be dramatically effective in the presence of confined amount of speech data. However, we could intensify this effectiveness by training the average voice model appropriately. Hence, this study presents a new method for training the average voice model. This method guarantees that data from every speaker contributes to all the leaves of decision tree. We considered this fact that small training data and highly diverse contexts of training speakers are considered as disadvantages which degrade the quality of average voice model impressively, and further influence the adapted model and synthetic speech unfavorably. The proposed method takes such difficulties into account in order to train a tailored average voice model with high quality. Consequently, as the experiments indicate, the proposed method outweighs the conventional one not only in the quality of synthetic speech but also in similarity to the natural voice. Our experiments show that the proposed method increases the CMOS test score by 0.6 to the conventional one.
international symposium on artificial intelligence | 2013
Fatemeh Sadat Saleh; Boshra Shams; Hossein Sameti; Soheil Khorram
Automatic detection of prosodic events in speech such as detecting the boundaries of Accentual Phrases (APs) and Intonational Phrases (IPs) has been an attractive subject in recent years for speech technologists and linguists. Prosodic events are important for spoken language applications such as speech recognition and translation. Also in order to generate natural speech in text to speech synthesizers, the corpus should be tagged with prosodic events. In this paper, we introduce and implement a prosody recognition system that could automatically label prosodic events and their boundaries at the syllable level in Persian language using a Multi-Space Probability Distribution Hidden Markov Model. In order to implement this system we use acoustic features. Experiments show that the detector achieves about 73.5 % accuracy on accentual phrase labeling and 80.08 % accuracy on intonation phrase detection. These accuracies are comparable with automatic labeling results in American English language which has used acoustic features and achieved 73.97 % accuracy in syllable level.