Gerrit Botha
University of Cape Town
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Gerrit Botha.
Computer Speech & Language | 2012
Gerrit Botha; Etienne Barnard
The classification accuracy of text-based language identification depends on several factors, including the size of the text fragment to be identified, the amount of training data available, the classification features and algorithm employed, and the similarity of the languages to be identified. To date, no systematic study of these factors and their interactions has been published. We therefore investigate the effects of each of these factors and their relations on the performance of text-based language identification. Our study uses n-gram statistics as features for classification. In particular, we compare support vector machines, Naive Bayesian and difference-in-frequency classifiers on different amounts of input text and various values of n for different amounts of training data. For a fixed value of n the support vector machines generally outperform the other classifiers, but the simpler classifiers are able to handle larger values of n. The additional computational complexity of training the support vector machine classifier may not be justified in light of importance of using a large value of n, except possibly for small sizes of the input window when limited training data is available. Our training and testing corpora consisted of text from the 11 official languages of South Africa. Within these languages distinct language families can be found. We find that it is much more difficult to discriminate languages within languages families than languages in different families. The overall accuracy on short input strings is low for this reason, but for input strings of 100 characters or more there is only a slight confusion within families and accuracies as high as 99.4% are achieved. For the smallest input strings studied here, which consist of 15 characters, the best accuracy achieved is only 83%, but when languages in different families are grouped together, this corresponds to a usable 95.1% accuracy. The relationship between the amount of training data and the accuracy achieved is found to depend on the window size: for the largest window (300 characters) about 400000 characters are sufficient to achieve close-to-optimal accuracy, whereas improvements in accuracy are found even beyond 1.6 million characters of training data for smaller windows. Our study concludes that the correlation between the factors studied significantly affect classification accuracy; therefore, to assure credible and comparable results, these factors need to be controlled in any text-based language identification task.
Nature Communications | 2017
Ananyo Choudhury; Michele Ramsay; Scott Hazelhurst; Shaun Aron; Soraya Bardien; Gerrit Botha; Emile R. Chimusa; Alan Christoffels; Junaid Gamieldien; Mahjoubeh J. Sefid-Dashti; Fourie Joubert; Ayton Meintjes; Nicola Mulder; Raj Ramesar; Jasper Rees; Kathrine Scholtz; Dhriti Sengupta; Himla Soodyall; Philip Venter; Louise Warnich; Michael S. Pepper
The Southern African Human Genome Programme is a national initiative that aspires to unlock the unique genetic character of southern African populations for a better understanding of human genetic diversity. In this pilot study the Southern African Human Genome Programme characterizes the genomes of 24 individuals (8 Coloured and 16 black southeastern Bantu-speakers) using deep whole-genome sequencing. A total of ~16 million unique variants are identified. Despite the shallow time depth since divergence between the two main southeastern Bantu-speaking groups (Nguni and Sotho-Tswana), principal component analysis and structure analysis reveal significant (p < 10−6) differentiation, and FST analysis identifies regions with high divergence. The Coloured individuals show evidence of varying proportions of admixture with Khoesan, Bantu-speakers, Europeans, and populations from the Indian sub-continent. Whole-genome sequencing data reveal extensive genomic diversity, increasing our understanding of the complex and region-specific history of African populations and highlighting its potential impact on biomedical research and genetic susceptibility to disease.African populations show a high level of genetic diversity and extensive regional admixture. Here, the authors sequence the whole genomes of 24 South African individuals of different ethnolinguistic origin and find substantive genomic divergence between two southeastern Bantu-speaking groups.
Infection and Immunity | 2017
Katie Lennard; Smritee Dabee; Shaun L. Barnabas; Enock Havyarimana; Anna K. Blakney; Shameem Z. Jaumdally; Gerrit Botha; Nonhlanhla N. Mkhize; Linda-Gail Bekker; David A. Lewis; Glenda Gray; Nicola Mulder; Jo-Ann S. Passmore; Heather B. Jaspan; Vincent B. Young
ABSTRACT Young African females are at an increased risk of HIV acquisition, and genital inflammation or the vaginal microbiome may contribute to this risk. We studied these factors in 168 HIV-negative South African adolescent females aged 16 to 22 years. Unsupervised clustering of 16S rRNA gene sequences revealed three clusters (subtypes), one of which was strongly associated with genital inflammation. In a multivariate model, the microbiome compositional subtype and hormonal contraception were significantly associated with genital inflammation. We identified 40 taxa significantly associated with inflammation, including those reported previously (Prevotella, Sneathia, Aerococcus, Fusobacterium, and Gemella) as well as several novel taxa (including increased frequencies of bacterial vaginosis-associated bacterium 1 [BVAB1], BVAB2, BVAB3, Prevotella amnii, Prevotella pallens, Parvimonas micra, Megasphaera, Gardnerella vaginalis, and Atopobium vaginae and decreased frequencies of Lactobacillus reuteri, Lactobacillus crispatus, Lactobacillus jensenii, and Lactobacillus iners). Women with inflammation-associated microbiomes had significantly higher body mass indices and lower levels of endogenous estradiol and luteinizing hormone. Community functional profiling revealed three distinct vaginal microbiome subtypes, one of which was characterized by extreme genital inflammation and persistent bacterial vaginosis (BV); this subtype could be predicted with high specificity and sensitivity based on the Nugent score (≥9) or BVAB1 abundance. We propose that women with this BVAB1-dominated subtype may have chronic genital inflammation due to persistent BV, which may place them at a particularly high risk for HIV infection.
PLOS Computational Biology | 2017
C. Victor Jongeneel; Ovokeraye Achinike-Oduaran; Ezekiel Adebiyi; Marion O. Adebiyi; Seun Adeyemi; Bola Akanle; Shaun Aron; Efejiro Ashano; Hocine Bendou; Gerrit Botha; Emile R. Chimusa; Ananyo Choudhury; Ravikiran Donthu; Jenny Drnevich; Oluwadamila Falola; Christopher J. Fields; Scott Hazelhurst; Liesl M. Hendry; Itunuoluwa Isewon; Radhika S. Khetani; Judit Kumuthini; Magambo Phillip Kimuda; Lerato Magosi; Liudmila Sergeevna Mainzer; Suresh Maslamoney; Mamana Mbiyavanga; Ayton Meintjes; Danny Mugutso; Phelelani T. Mpangase; Richard J. Munthali
The H3ABioNet pan-African bioinformatics network, which is funded to support the Human Heredity and Health in Africa (H3Africa) program, has developed node-assessment exercises to gauge the ability of its participating research and service groups to analyze typical genome-wide datasets being generated by H3Africa research groups. We describe a framework for the assessment of computational genomics analysis skills, which includes standard operating procedures, training and test datasets, and a process for administering the exercise. We present the experiences of 3 research groups that have taken the exercise and the impact on their ability to manage complex projects. Finally, we discuss the reasons why many H3ABioNet nodes have declined so far to participate and potential strategies to encourage them to do so.
AAS Open Research | 2018
Azza Elgaili Ahmed; Phelelani T. Mpangase; Sumir Panji; Shakuntala Baichoo; Gerrit Botha; Faisal M. Fadlelmola; Scott Hazelhurst; Peter van Heusden; C. Victor Jongeneel; Fourie Joubert; Liudmila Sergeevna Mainzer; Ayton Meintjes; Don Armstrong; Michael R. Crusoe; Brian O'Connor; Yassine Souilmi; Mustafa Alghali; Shaun Aron; Hocine Bendou; Eugene De Beste; Mamana Mbiyavanga; Oussema Souiai; Long Yi; Jennie Zermeno; Nicola Mulder
The need for portable and reproducible genomics analysis pipelines is growing globally as well as in Africa, especially with the growth of collaborative projects like the Human Health and Heredity in Africa Consortium (H3Africa). The Pan-African H3Africa Bioinformatics Network (H3ABioNet) recognized the need for portable, reproducible pipelines adapted to heterogeneous compute environments, and for the nurturing of technical expertise in workflow languages and containerization technologies. To address this need, in 2016 H3ABioNet arranged its first Cloud Computing and Reproducible Workflows Hackathon, with the purpose of building key genomics analysis pipelines able to run on heterogeneous computing environments and meeting the needs of H3Africa research projects. This paper describes the preparations for this hackathon and reflects upon the lessons learned about its impact on building the technical and scientific expertise of African researchers. The workflows developed were made publicly available in GitHub repositories and deposited as container images on quay.io.
Standards in Genomic Sciences | 2015
Shakuntala Baichoo; Gerrit Botha; Yasmina Jaufeerally-Fakim; Zahra Mungloo-Dilmohamud; Daniel Lundin; Nicola Mulder; Vasilis J. Promponas; Christos A. Ouzounis
In the context of recent international initiatives to bolster genomics research for Africa, and more specifically to develop bioinformatics expertise and networks across the continent, a workshop on computational metagenomics was organized during the end of 2014 at the University of Mauritius. The workshop offered background on various aspects of computational biology, including databases and algorithms, sequence analysis fundamentals, metagenomics concepts and tools, practical exercises, journal club activities and research seminars. We have discovered a strong interest in metagenomics research across Africa, to advance practical applications both for human health and the environment. We have also realized the great potential to develop genomics and bioinformatics through collaborative efforts across the continent, and the need for further reinforcing the untapped human potential and exploring the natural resources for stronger engagement of local scientific communities, with a view to contributing towards the improvement of human health and well-being for the citizens of Africa.
Archive | 2005
Gerrit Botha; Etienne Barnard
Scientific Reports | 2018
Shantelle Claassen-Weitz; Sugnet Gardner-Lubbe; Paul Nicol; Gerrit Botha; Stephanie Mounaud; Jyoti Shankar; William C. Nierman; Nicola Mulder; Shrish Budree; Heather J. Zar; Mark P. Nicol; Mamadou Kaba
Suid-Afrikaanse Tydskrif vir Natuurwetenskap en Tegnologie | 2017
Katie Lennard; Smritee Dabee; Shaun L. Barnabas; Enock Havyarimana; Shameem Z. Jaumdally; Gerrit Botha; Nonhlanhla N. Mkhize; Linda-Gail Bekker; Glenda Gray; Nicola Mulder; Jo-Ann S. Passmore; Heather B. Jaspan
Suid-Afrikaanse Tydskrif vir Natuurwetenskap en Tegnologie | 2017
Katie Lennard; Smritee Dabee; Shaun L. Barnabas; Enock Havyarimana; Shameem Z. Jaumdally; Gerrit Botha; Nonhlanhla N. Mkhize; Linda-Gail Bekker; Glenda Gray; Nicola Mulder; Jo-Ann S. Passmore; Heather B. Jaspan