Kassim Mwitondi | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Kassim Mwitondi is active.

Explore More

Publication

Featured researches published by Kassim Mwitondi.

Journal of Applied Statistics | 2013

Data mining with Rattle and R

Kassim Mwitondi

In this book, Graham Williams presents the reader with a comprehensive treatment of data mining from data understanding and preparation through model development, evaluation and refinement to practical deployment. Structured in four main parts – exploration, model building, performance and appendices – the book provides a coherent link between data, tools, models and performance. The first seven chapters, focusing on the fundamentals of R and Rattle (a graphical interface for data mining using R), data formats, distributions and visualisation, highlight the R–Rattle exploratory power. Unsupervised and supervised modelling techniques are detailed in the second part of the book followed by performance assessment and deployment in the third Part. Deriving from this structure is one of the book’s distinctive features: its focus on the hands-on end-to-end process of data mining using open source tools Rattle and R which makes it particularly interesting to both students and practitioners of data mining. It is quite possible for the R novice to find this book hard to access due to its substantial content of R Graphical User Interface and programming skills. However, while this may appear to be a downside at first glance, reading the book reveals that its structure and writing style make it easily adaptable to other software applications. Further, despite both R and Rattle being versionvariant, the book is cushioned against susceptibility to version obsolescence by well-balanced and integrated discussions of the data mining process and the adopted tools and methods. Thus, data mining students and practitioners with or without a working knowledge of R will find this book to be at least a good supplement to their existing tools and procedures. As a regular R user in a data mining environment, I found the book extremely useful and insightful with great potentials for improvement. In particular, there is scope for enhancing the discussions of the tuning parameters for each of the models in Part II as they are fundamental to data mining results. For instance, expanding on the role of the cost and sigma parameters on pages 299 and 300 may provide useful intuition to the R novice.

Data Science Journal | 2013

A Data-Driven Method for Selecting Optimal Models Based on Graphical Visualisation of Differences in Sequentially Fitted ROC Model Parameters

Kassim Mwitondi; Rida E. Moustafa; Ali S. Hadi

Differences in modelling techniques and model performance assessments typically impinge on the quality of knowledge extraction from data. We propose an algorithm for determining optimal patterns in data by separately training and testing three decision tree models in the Pima Indians Diabetes and the Bupa Liver Disorders datasets. Model performance is assessed using ROC curves and the Youden Index. Moving differences between sequential fitted parameters are then extracted, and their respective probability density estimations are used to track their variability using an iterative graphical data visualisation technique developed for this purpose. Our results show that the proposed strategy separates the groups more robustly than the plain ROC/Youden approach, eliminates obscurity, and minimizes over-fitting. Further, the algorithm can easily be understood by non-specialists and demonstrates multi-disciplinary compliance.

international conference on neural information processing | 2012

A sequential data mining method for modelling solar magnetic cycles

Kassim Mwitondi; Raeed T. Said; Adil Yousif

We propose an adaptive data-driven approach to modelling solar magnetic activity cyclesbased on a sequential link between unsupervised and supervised modelling. Monthly sunspot numbers spanning over hundreds of years --- from the mid-18th century to the first quarter of 2012 - obtained from the Royal Greenwich Observatory provide a reliable source of training and validation sets.An indicator variable is used to generate class labels and internal parameters which are used to separate high from low activity cycles. Our results show that by maximising data-dependent parameters and using them as inputs to a support vector machine model we obtain comparatively more robust and reliable predictions. Finally, we demonstrate how the method can be adapted to other unsupervised and supervised modelling applications.

International Journal of Intelligent Computing and Cybernetics | 2016

Detection of natural structures and classification of HCI-HPR data using robust forward search algorithm

Fatima Isiaka; Kassim Mwitondi; Adamu M. Ibrahim

Purpose – The purpose of this paper is to proposes a forward search algorithm for detecting and identifying natural structures arising in human-computer interaction (HCI) and human physiological response (HPR) data. Design/methodology/approach – The paper portrays aspects that are essential to modelling and precision in detection. The methods involves developed algorithm for detecting outliers in data to recognise natural patterns in incessant data such as HCI-HPR data. The detected categorical data are simultaneously labelled based on the data reliance on parametric rules to predictive models used in classification algorithms. Data were also simulated based on multivariate normal distribution method and used to compare and validate the original data. Findings – Results shows that the forward search method provides robust features that are capable of repelling over-fitting in physiological and eye movement data. Research limitations/implications – One of the limitations of the robust forward search algori...

WIT Transactions on Information and Communication Technologies | 2014

A kernel density smoothing method for determining an optimal number of clusters in continuous data

J. Bugrien; Kassim Mwitondi; F. Shuweihdi

While data clustering algorithms are becoming increasingly popular across scientific, industrial and social data mining applications, model complexity remains a major challenge. Most clustering algorithms do not incorporate a mechanism for finding an optimal scale parameter that corresponds to an appropriate number of clusters. We propose , a kernel-density smoothing-based approach to data clustering. Its main ideas derive from two unsupervised clustering approaches – kernel density estimation (KDE) and scale-spacing clustering (SSC). The novel method determines the optimal number of clusters by first finding dense regions in data before separating them based on data-dependent parameter estimates. The optimal number of clusters is determined from different levels of smoothing after the inherent number of arbitrary shape clusters has been detected without a priori information. We demonstrate the applicability of the proposed method under both nested and non-nested hierarchical clustering methodologies. Simulated and real data results are presented to validate the performance of the method, with repeated runs showing high accuracy and reliability.

Journal of Applied Statistics | 2013

Statistical computing in C++ and R

Kassim Mwitondi

The rationale of this text appears to derive from the authors’perception of statistics and computing as being increasingly inseparable as computational power and data acquisition and consumption grow exponentially. The choice of C++ and R is predicated on the popularity of the languages within the computing and statistical communities, respectively. Thus, while interfacing C++ programming with R is a common practice among R developers and contributors, this book provides an extension of that practice to statistical computing practitioners, presenting the combined power of C++ and R programming skills in tackling statistical problems. Divided into 11 chapters, it addresses a wide range of related topics from an overview of object-oriented programming to parallel computing in both C++ and R. There are a number of unique features you would typically not find in any off-the-shelf programming or statistical text book. For instance, its main content embodies a collection of building blocks for converting statistical problems into their computational analogs. This writing style potentially influences the reader to think algorithmically – a skill not necessarily possessed by many statisticians. Further, the book outlines ways of mapping various data types in both directions forming a useful resource for facilitating data sharing among multi-disciplinary data analysis teams. The book requires programming and statistical maturity and is clearly a good reference for developers of statistical applications. However, the downside is that despite (or because) of its wide scope of coverage and its high level of intricacy, the book can expect a limited audience. Despite being related, most chapters do not appear to strongly support the coherent nature upon which the book is predicated. Most importantly, Appendices A–E are only marginally useful for a C++ and R novice seeking to understand the core text. As a regular R user I found some material completely alien and detached but I also found a lot of useful and readily adaptable material. I believe the same may apply to a regular C++ user and so to widen the target audience, future editions may need to build in more coherence than is currently available. Again, in its current form, the book is an excellent reference for statistical programmers interfacing C++ and R.

Journal of Applied Statistics | 2012

Statistical data mining using SAS applications

Kassim Mwitondi

The book can be viewed as a specialised tool for SAS data analysis. Divided into seven main sections, it addresses a wide range of analytical topics from an introduction to data mining to core unsupervised and supervised learning techniques. Its key features include the provision of case studies throughout the sections, downloadable macros and instructions on how to run them. A working knowledge of SAS is expected but there is no requirement for either mathematical or SAS programming maturity. The step-by-step instructions and the graphical representations of data make it particularly useful to those wishing to communicate complex and technical data to a largely non-specialist audiences. As a regular SAS user, over the years I have noted the general feeling of confusion between the conventional SAS application and SAS Enterprise Guide (SAS EG), especially among some members of the non-SAS community. In Appendix II, the author highlights the incompatibility of some of the book’s accompanying macros and refers the reader to SAS EG compatible macros. The macro files, macro-call files and sample data sets to be used in the examples must all be downloaded from the book’s website. Although this is a good feature in that future macro updates may be uploaded to the site, including these files on a CD attached to the book would probably have greatly enhanced its scope. Although the book may be viewed as software-specific, with the potential risk of sliding into obsolescence as new SAS versions come, its examples help to develop a software-independent data mining understanding. While as a regular SAS user I could easily read and follow the examples, it is very likely that those new to the SAS and SAS EG environments may find navigation through the book somewhat awkward. Furthermore, the lack of explicit statistical computing examples and the exclusion of procedural routines amount to carrying out a background demonstration of data mining without showing the reader how to do it. This adopted style seems to have obscured the relevance of Chapter 7. It is probably fair to say that, in its current form, the book can only be useful as a reference to people who routinely use SAS applications or as a supplement to a statistics or data mining course with a significant SAS component. Future editions may need to enhance some of the faintly visible graphics such as the screenshots on pages 235 and 238 and typos such as the repeated “p-value value” on page 160.

The Journal of Supercomputing | 2018

A statistical downscaling framework for environmental mapping

Kassim Mwitondi; Farha A. Al-Kuwari; Raed A. Saeed; Shahrzad Zargari

In recent years, knowledge extraction from data has become increasingly popular, with many numerical forecasting models, mainly falling into two major categories—chemical transport models (CTMs) and conventional statistical methods. However, due to data and model variability, data-driven knowledge extraction from high-dimensional, multifaceted data in such applications require generalisations of global to regional or local conditions. Typically, generalisation is achieved via mapping global conditions to local ecosystems and human habitats which amounts to tracking and monitoring environmental dynamics in various geographical areas and their regional and global implications on human livelihood. Statistical downscaling techniques have been widely used to extract high-resolution information from regional-scale variables produced by CTMs in climate model. Conventional applications of these methods are predominantly dimensional reduction in nature, designed to reduce spatial dimension of gridded model outputs without loss of essential spatial information. Their downside is twofold—complete dependence on unlabelled design matrix and reliance on underlying distributional assumptions. We propose a novel statistical downscaling framework for dealing with data and model variability. Its power derives from training and testing multiple models on multiple samples, narrowing down global environmental phenomena to regional discordance through dimensional reduction and visualisation. Hourly ground-level ozone observations were obtained from various environmental stations maintained by the US Environmental Protection Agency, covering the summer period (June–August 2005). Regional patterns of ozone are related to local observations via repeated runs and performance assessment of multiple versions of empirical orthogonal functions or principal components and principal fitted components via an algorithm with fully adaptable parameters. We demonstrate how the algorithm can be extended to weather-dependent and other applications with inherent data randomness and model variability via its built-in interdisciplinary computational power that connects data sources with end-users.

Journal of Statistics Applications & Probability | 2016

A parameter leveraging method for unsupervised big data modelling

Kassim Mwitondi; Eman Khorsheed

Increasingly sophisticated methods and tools are needed for tracking the dynamics and detecting inherent structures in modern day highly voluminous multi-faceted. Data scientists have long realized that tackling global challenges such as climate change, terrorism and food security cannot be contained within the frameworks and models of conventional data analysis. For example, separating noise from meaningful data in even a low-dimensional data with heavy tails and/or overlaps is quite challenging and standard non-linear approaches do not always succeed. Tracking the dynamics of multi-faceted data involving complex systems is tantamount to tracking agent-based complex systems with many interacting agents. Dimensional-reduction methods are commonly used to try and capture structures inherent in data but they do not generally lead to optimal solutions mainly because their optimisation functions and theoretical methods typically rely on special structures. We propose a parameter leveraging method for unsupervised big data modelling. The method searches for structures in data and creates a series of sub-structures which are subsequently merged or split. The strategy is to present the algorithm with a set of periodic data as one complex system. It then uses the patterns in the sub-structures to determine the overall behaviour of the complex system. Applications on solar magnetic activity cycles and seismic data show that the proposed method out-performs conventional unsupervised methods. We illustrate how the method can be extended to supervised modelling.

Data Science Journal | 2012

HARNESSING DATA FLOW AND MODELLING POTENTIALS FOR SUSTAINABLE DEVELOPMENT

Kassim Mwitondi; Jamal B. Bugrien

Tackling the global challenges relating to health, poverty, business, and the environment is heavily dependent on the flow and utilisation of data. However, while enhancements in data generation, storage, modelling, dissemination, and the related integration of global economies and societies are fast transforming the way we live and interact, the resulting dynamic, globalised, information society remains digitally divided. On the African continent in particular, this division has resulted in a gap between the knowledge generation and its transformation into tangible products and services. This paper proposes some fundamental approaches for a sustainable transformation of data into knowledge for the purpose of improving the peoples quality of life. Its main strategy is based on a generic data sharing model providing access to data utilising and generating entities in a multi-disciplinary environment. It highlights the great potentials in using unsupervised and supervised modelling in tackling the typically predictive-in-nature challenges we face. Using both simulated and real data, the paper demonstrates how some of the key parameters may be generated and embedded in models to enhance their predictive power and reliability. The papers conclusions include a proposed implementation framework setting the scene for the creation of decision support systems capable of addressing the key issues in society. It is expected that a sustainable data flow will forge synergies among the private sector, academic, and research institutions within and among countries. It is also expected that the papers findings will help in the design and development of knowledge extraction from data in the wake of cloud computing and, hence, contribute towards the improvement in the peoples overall quality of life. To avoid running high implementation costs, selected open source tools are recommended for developing and sustaining the system.

Explore More