Jaco Badenhorst
Council of Scientific and Industrial Research
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Jaco Badenhorst.
Speech Communication | 2014
Nic J. de Vries; Marelie H. Davel; Jaco Badenhorst; Willem D. Basson; Febe de Wet; Etienne Barnard; Alta de Waal
Acoustic data collection for automatic speech recognition (ASR) purposes is a particularly challenging task when working with under-resourced languages, many of which are found in the developing world. We provide a brief overview of related data collection strategies, highlighting some of the salient issues pertaining to collecting ASR data for under-resourced languages. We then describe the development of a smartphone-based data collection tool, Woefzela, which is designed to function in a developing world context. Specifically, this tool is designed to function without any Internet connectivity, while remaining portable and allowing for the collection of multiple sessions in parallel; it also simplifies the data collection process by providing process support to various role players during the data collection process, and performs on-device quality control in order to maximise the use of recording opportunities. The use of the tool is demonstrated as part of a South African data collection project, during which almost 800 hours of ASR data was collected, often in remote, rural areas, and subsequently used to successfully build acoustic models for eleven languages. The on-device quality control mechanism (referred to as QC-on-the-go) is an interesting aspect of the Woefzela tool and we discuss this functionality in more detail. We experiment with different uses of quality control information, and evaluate the impact of these on ASR accuracy. Woefzela was developed for the Android Operating System and is freely available for use on Android smartphones.
Proceedings of the First Workshop on Language Technologies for African Languages | 2009
Jaco Badenhorst; Charl Johannes van Heerden; Marelie H. Davel; Etienne Barnard
We describe the Lwazi corpus for automatic speech recognition (ASR), a new telephone speech corpus which includes data from nine Southern Bantu languages. Because of practical constraints, the amount of speech per language is relatively small compared to major corpora in world languages, and we report on our investigation of the stability of the ASR models derived from the corpus. We also report on phoneme distance measures across languages, and describe initial phone recognisers that were developed using this data.
Procedia Computer Science | 2016
Febe de Wet; Jaco Badenhorst; Thipe Modipa
Abstract The official languages of South Africa can still be classified as under-resourced with respect to the speech resources that are required for technology development. Harvesting speech data from existing sources is one means to create additional resources. The aim of the study reported on in this paper was to improve the harvesting and transcription accuracy of a corpus derived from parliamentary data. This aim was achieved by improving on the text normalisation process and pronunciation modelling as well as by iteratively training more accurate in-domain acoustic models. In this manner, more data could be harvested with higher confidence than using baseline pronunciation dictionaries and out-of-domain speech data.
conference of the international speech communication association | 2011
Nic J. de Vries; Jaco Badenhorst; Marelie H. Davel; Etienne Barnard; Alta de Waal
SLTU | 2014
Etienne Barnard; Marelie H. Davel; Charl Johannes van Heerden; Febe de Wet; Jaco Badenhorst
SLTU | 2012
Jaco Badenhorst; Alta de Waal; Febe de Wet
language resources and evaluation | 2011
Jaco Badenhorst; Charl Johannes van Heerden; Marelie H. Davel; Etienne Barnard
Archive | 2011
Jaco Badenhorst; Marelie H. Davel; Etienne Barnard
Archive | 2010
Jaco Badenhorst; Marelie H. Davel; Etienne Barnard
Archive | 2012
Jaco Badenhorst; Marelie H. Davel; Etienne Barnard