Ivan Himawan
Queensland University of Technology
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Ivan Himawan.
IEEE Transactions on Audio, Speech, and Language Processing | 2008
Iain A. McCowan; Mike Lincoln; Ivan Himawan
This correspondence presents a microphone array shape calibration procedure for diffuse noise environments. The procedure estimates intermicrophone distances by fitting the measured noise coherence with its theoretical model and then estimates the array geometry using classical multidimensional scaling. The technique is validated on noise recordings from two office environments.
IEEE Transactions on Audio, Speech, and Language Processing | 2011
Ivan Himawan; Iain A. McCowan; Sridha Sridharan
Microphone arrays have been used in various applications to capture conversations, such as in meetings and teleconferences. In many cases, the microphone and likely source locations are known a priori, and calculating beamforming filters is therefore straightforward. In ad-hoc situations, however, when the microphones have not been systematically positioned, this information is not available and beamforming must be achieved blindly. In achieving this, a commonly neglected issue is whether it is optimal to use all of the available microphones, or only an advantageous subset of these. This paper commences by reviewing different approaches to blind beamforming, characterizing them by the way they estimate the signal propagation vector and the spatial coherence of noise in the absence of prior knowledge of microphone and speaker locations. Following this, a novel clustered approach to blind beamforming is motivated and developed. Without using any prior geometrical information, microphones are first grouped into localized clusters, which are then ranked according to their relative distance from a speaker. Beamforming is then performed using either the closest microphone cluster, or a weighted combination of clusters. The clustered algorithms are compared to the full set of microphones in experiments on a database recorded on different ad-hoc array geometries. These experiments evaluate the methods in terms of signal enhancement as well as performance on a large vocabulary speech recognition task.
international conference on multimedia and expo | 2012
Ivan Himawan; Wei Song; Dian Tjondronegoro
Effective streaming of video can be achieved by providing more bits to the most important region in the frame at the cost of reduced bits in the less important regions. This strategy can be beneficial for delivering high quality videos in mobile devices, especially when the availability of bandwidth is usually low and limited. While the state-of-the-art video codecs such as H.264 may have been optimised for perceived quality, it is hypothesised that users will give more attention to interesting region/object when watching videos. Therefore, giving a higher quality to region of interest (ROI) while reducing quality of other areas may result in improving the overall perceived quality without necessarily increasing the bitrate. In this paper, the impact of ROI-based encoded video on perceived quality is investigated by conducting a user study for various target bit rates. The results from the user study demonstrate that ROI-based video coding has superior perceived quality compared to normal encoded video at the same bitrate in the lower bitrate range.
international conference on acoustics, speech, and signal processing | 2010
Ivan Himawan; Iain A. McCowan; Sridha Sridharan
This paper proposes a clustered approach for blind beamfoming from ad-hoc microphone arrays. In such arrangements, microphone placement is arbitrary and the speaker may be close to one, all or a subset of microphones at a given time. Practical issues with such a configuration mean that some microphones might be better discarded due to poor input signal to noise ratio (SNR) or undesirable spatial aliasing effects from large inter-element spacings when beamforming. Large inter-microphone spacings may also lead to inaccuracies in delay estimation during blind beamforming. In such situations, using a cluster of microphones (ie, a sub-array), closely located both to each other and to the desired speech source, may provide more robust enhancement than the full array. This paper proposes a method for blind clustering of microphones based on the magnitude square coherence function, and evaluates the method on a database recorded using various ad-hoc microphone arrangements.
international conference on acoustics, speech, and signal processing | 2008
Ivan Himawan; Sridha Sridharan; Iain A. McCowan
This paper investigates robustness to uncertain microphone placements in an array beamformer front-end to a speech recognition system. There are two general approaches to handling the placement uncertainty: using the approximately known geometry in a robust beamforming technique, or using techniques that require no prior knowledge of geometry. Experiments in this paper compare the robustness of different techniques for both of these approaches in terms of speech recognition accuracy. To benefit from existing microphone array speech recognition data corpora for experimentation, microphone placement uncertainty is simulated by introducing random perturbations in the assumed geometry. Experimental results show that robust beamforming yields stable performance to a certain degree of placement error, but thereafter techniques such as automatic calibration are beneficial.
acm multimedia | 2014
Wei Song; Dian Tjondronegoro; Ivan Himawan
Effective Quality of Experience (QoE) management for mobile video delivery -- to optimize overall user experience while adapting to heterogeneous use contexts -- is still a big challenge to date. This paper proposes a mobile video delivery system to emphasize the use of acceptability as the main indicator of QoE to manage the end-to-end factors in delivering mobile video services. The first contribution is a novel framework for user-centric mobile video system that is based on acceptability-based QoE (A-QoE) prediction models, which were derived from comprehensive subjective studies. The second contribution is results from a field study that evaluates the user experience of the proposed system during realistic usage circumstances, addressing the impacts of perceived video quality, loading speed, interest in content, viewing locations, network bandwidth, display devices, and different video coding approaches, including region-of-interest (ROI) enhancement and center zooming.
international conference on multimedia and expo | 2014
Ivan Himawan; Andrew J. Zele; Dian Tjondronegoro
This paper proposes a novel approach to video deblocking which performs perceptually adaptive bilateral filtering by considering color, intensity, and motion features in a holistic manner. The method is based on bilateral filter which is an effective smoothing filter that preserves edges. The bilateral filter parameters are adaptive and avoid over-blurring of texture regions and at the same time eliminate blocking artefacts in the smooth region and areas of slow motion content. This is achieved by using a saliency map to control the strength of the filter for each individual point in the image based on its perceptual importance. The experimental results demonstrate that the proposed algorithm is effective in deblocking highly compressed video sequences and to avoid over-blurring of edges and textures in salient regions of image.
Computer Speech & Language | 2018
Hafizur Rahman; Ahilan Kanagasundaram; Ivan Himawan; David Dean; Sridha Sridharan
Domain mismatch significantly affects the speaker verification performance.Domain invariant linear discriminant analysis (DI-LDA) for compensating domain mismatch in the LDA subspace.Domain invariant probabilistic linear discriminant analysis (DI-PLDA) for domain mismatch modelling n the PLDA subspace.DI-LDA approach followed by the DI-PLDA (DI-PLDA[DI-LDA]) to compensate domain mismatch from both LDA and PLDA subspaces.Limited target domain data requirement using domain mismatch compensation techniques. The performance of state-of-the-art i-vector speaker verification systems relies on a large amount of training data for probabilistic linear discriminant analysis (PLDA) modeling. During the evaluation, it is also crucial that the target condition data is matched well with the development data used for PLDA training. However, in many practical scenarios, these systems have to be developed, and trained, using data that is often outside the domain of the intended application, since the collection of a significant amount of in-domain data is often difficult. Experimental studies have found that PLDA speaker verification performance degrades significantly due to this development/evaluation mismatch. This paper introduces a domain-invariant linear discriminant analysis (DI-LDA) technique for out-domain PLDA speaker verification that compensates domain mismatch in the LDA subspace. We also propose a domain-invariant probabilistic linear discriminant analysis (DI-PLDA) technique for domain mismatch modeling in the PLDA subspace, using only a small amount of in-domain data. In addition, we propose the sequential and score-level combination of DI-LDA, and DI-PLDA to further improve out-domain speaker verification performance. Experimental results show the proposed domain mismatch compensation techniques yield at least 27% and 14.5% improvement in equal error rate (EER) over a pooled PLDA system for telephone-telephone and interview-interview conditions, respectively. Finally, we show that the improvement over the baseline pooled system can be attained even when significantly reducing the number of in-domain speakers, down to 30 in most of the evaluation conditions.
Multimedia Tools and Applications | 2017
Ivan Himawan; Wei Song; Dian Tjondronegoro
At present, the most reliable method to obtain end-user perceived quality is through subjective tests. In this paper, the impact of automatic region-of-interest (ROI) coding on perceived quality of mobile video is investigated. The evidence, which is based on perceptual comparison analysis, shows that the coding strategy improves perceptual quality. This is particularly true in low bit rate situations. The ROI detection method used in this paper is based on two approaches: (1) automatic ROI by analyzing the visual contents automatically, and (2) eye-tracking based ROI by aggregating eye-tracking data across many users, used to both evaluate the accuracy of automatic ROI detection and the subjective quality of automatic ROI encoded video. The perceptual comparison analysis is based on subjective assessments with 54 participants, across different content types, screen resolutions, and target bit rates while comparing the two ROI detection methods. The results from the user study demonstrate that ROI-based video encoding has higher perceived quality compared to normal video encoded at a similar bit rate, particularly in the lower bit rate range.
conference of the international speech communication association | 2016
Houman Ghaemmaghami; Md. Hafizur Rahman; Ivan Himawan; David Dean; Ahilan Kanagasundaram; Sridha Sridharan; Clinton Fookes
This paper presents the QUT speaker recognition system, as a competing system in the Speakers In The Wild (SITW) speaker recognition challenge. Our proposed system achieved an overall ranking of second place, in the main core-core condition evaluations of the SITW challenge. This system uses an ivector/ PLDA approach, with domain adaptation and a deep neural network (DNN) trained to provide feature statistics. The statistics are accumulated by using class posteriors from the DNN, in place of GMM component posteriors in a typical GMM UBM i-vector/PLDA system. Once the statistics have been collected, the i-vector computation is carried out as in a GMM-UBM based system. We apply domain adaptation to the extracted i-vectors to ensure robustness against dataset variability, PLDA modelling is used to capture speaker and session variability in the i-vector space, and the processed i-vectors are compared using the batch likelihood ratio. The final scores are calibrated to obtain the calibrated likelihood scores, which are then used to carry out speaker recognition and evaluate the performance of the system. Finally, we explore the practical application of our system to the core-multi condition recordings of the SITW data and propose a technique for speaker recognition in recordings with multiple speakers.