Karel Dejaeger
Katholieke Universiteit Leuven
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Karel Dejaeger.
European Journal of Operational Research | 2012
Wouter Verbeke; Karel Dejaeger; David Martens; J Hur; Bart Baesens
Customer churn prediction models aim to indicate the customers with the highest propensity to attrite, allowing to improve the efficiency of customer retention campaigns and to reduce the costs associated with churn. Although cost reduction is their prime objective, churn prediction models are typically evaluated using statistically based performance measures, resulting in suboptimal model selection. Therefore, in the first part of this paper, a novel, profit centric performance measure is developed, by calculating the maximum profit that can be generated by including the optimal fraction of customers with the highest predicted probabilities to attrite in a retention campaign. The novel measure selects the optimal model and fraction of customers to include, yielding a significant increase in profits compared to statistical measures.
IEEE Transactions on Software Engineering | 2012
Karel Dejaeger; Wouter Verbeke; David Martens; Bart Baesens
A predictive model is required to be accurate and comprehensible in order to inspire confidence in a business setting. Both aspects have been assessed in a software effort estimation setting by previous studies. However, no univocal conclusion as to which technique is the most suited has been reached. This study addresses this issue by reporting on the results of a large scale benchmarking study. Different types of techniques are under consideration, including techniques inducing tree/rule-based models like M5 and CART, linear models such as various types of linear regression, nonlinear models (MARS, multilayered perceptron neural networks, radial basis function networks, and least squares support vector machines), and estimation techniques that do not explicitly induce a model (e.g., a case-based reasoning approach). Furthermore, the aspect of feature subset selection by using a generic backward input selection wrapper is investigated. The results are subjected to rigorous statistical testing and indicate that ordinary least squares regression in combination with a logarithmic transformation performs best. Another key finding is that by selecting a subset of highly predictive attributes such as project size, development, and environment related attributes, typically a significant increase in estimation accuracy can be obtained.
IEEE Transactions on Software Engineering | 2013
Karel Dejaeger; Thomas Verbraken; Bart Baesens
Software testing is a crucial activity during software development and fault prediction models assist practitioners herein by providing an upfront identification of faulty software code by drawing upon the machine learning literature. While especially the Naive Bayes classifier is often applied in this regard, citing predictive performance and comprehensibility as its major strengths, a number of alternative Bayesian algorithms that boost the possibility of constructing simpler networks with fewer nodes and arcs remain unexplored. This study contributes to the literature by considering 15 different Bayesian Network (BN) classifiers and comparing them to other popular machine learning techniques. Furthermore, the applicability of the Markov blanket principle for feature selection, which is a natural extension to BN theory, is investigated. The results, both in terms of the AUC and the recently introduced H-measure, are rigorously tested using the statistical framework of Demšar. It is concluded that simple and comprehensible networks with less nodes can be constructed using BN classifiers other than the Naive Bayes classifier. Furthermore, it is found that the aspects of comprehensibility and predictive performance need to be balanced out, and also the development context is an item which should be taken into account during model selection.
European Journal of Operational Research | 2012
Karel Dejaeger; Frank Goethals; Antonio Giangreco; Lapo Mola; Bart Baesens
As a consequence of the heightened competition on the education market, the management of educational institutions often attempts to collect information on what drives student satisfaction by e.g. organizing large scale surveys amongst the student population. Until now, this source of potentially very valuable information remains largely untapped. In this study, we address this issue by investigating the applicability of different data mining techniques to identify the main drivers of student satisfaction in two business education institutions. In the end, the resulting models are to be used by the management to support the strategic decision making process. Hence, the aspect of model comprehensibility is considered to be at least equally important as model performance. It is found that data mining techniques are able to select a surprisingly small number of constructs that require attention in order to manage student satisfaction.
international conference on e business | 2010
Baojun Ma; Karel Dejaeger; Jan Vanthienen; Bart Baesens
In software defect prediction, predictive models are estimated based on various code attributes to assess the likelihood of software modules containing errors. Many classification methods have been suggested to accomplish this task. However, association based classification methods have not been investigated so far in this context. This paper assesses the use of such a classification method, CBA2, and compares it to other rule based classification methods. Furthermore, we investigate whether rule sets generated on data from one software project can be used to predict defective software modules in other, similar software projects. It is found that applying the CBA2 algorithm results in both accurate and comprehensible rule sets.
Journal of Systems and Software | 2015
Julie Moeyersoms; Enric Junqué de Fortuny; Karel Dejaeger; Bart Baesens; David Martens
HighlightsWe argue that comprehensibility is crucial in software effort and fault prediction.We extracted new datasets based on the Android repository.ALPA extracts a tree that mimics the performance of the complex model.The extracted trees are not only comprehensible but also more accurate. Software fault and effort prediction are important tasks to minimize costs of a software project. In software effort prediction the aim is to forecast the effort needed to complete a software project, whereas software fault prediction tries to identify fault-prone modules. In this research both tasks are considered, thereby using different data mining techniques. The predictive models not only need to be accurate but also comprehensible, demanding that the user can understand the motivation behind the models prediction. Unfortunately, to obtain predictive performance, comprehensibility is often sacrificed and vice versa. To overcome this problem, we extract trees from well performing Random Forests (RFs) and Support Vector Machines for regression (SVRs) making use of a rule extraction algorithm ALPA. This method builds trees (using C4.5 and REPTree) that mimic the black-box model (RF, SVR) as closely as possible. The proposed methodology is applied to publicly available datasets, complemented with new datasets that we have put together based on the Android repository. Surprisingly, the trees extracted from the black-box models by ALPA are not only comprehensible and explain how the black-box model makes (most of) its predictions, but are also more accurate than the trees obtained by working directly on the data.
Information & Management | 2013
Helen Tadesse Moges; Karel Dejaeger; Wilfried Lemahieu; Bart Baesens
Recent studies have indicated that companies are increasingly experiencing Data Quality (DQ) related problems as more complex data are being collected. To address such problems, the literature suggests the implementation of a Total Data Quality Management Program (TDQM) that should consist of the following phases: DQ definition, measurement, analysis and improvement. As such, this paper performs an empirical study using a questionnaire that was distributed to financial institutions worldwide to identify the most important DQ dimensions, to assess the DQ level of credit risk databases using the identified DQ dimensions, to analyze DQ issues and to suggest improvement actions in a credit risk assessment context. This questionnaire is structured according to the framework of Wang and Strong and incorporates three additional DQ dimensions that were found to be important to the current context (i.e., actionable, alignment and traceable). Additionally, this paper contributes to the literature by developing a scorecard index to assess the DQ level of credit risk databases using the DQ dimensions that were identified as most important. Finally, this study explores the key DQ challenges and causes of DQ problems and suggests improvement actions. The findings from the statistical analysis of the empirical study delineate the nine most important DQ dimensions, which include accuracy and security for assessing the DQ level.
international conference on tools with artificial intelligence | 2010
Rudy Setiono; Karel Dejaeger; Wouter Verbeke; David Martens; Bart Baesens
Neural networks are often selected as tool for software effort prediction because of their capability to approximate any continuous function with arbitrary accuracy. A major drawback of neural networks is the complex mapping between inputs and output, which is not easily understood by a user. This paper describes a rule extraction technique that derives a set of comprehensible IF-THEN rules from a trained neural network applied to the domain of software effort prediction. The suitability of this technique is tested on the ISBSG R11 data set by a comparison with linear regression, radial basis function networks, and CART. It is found that the most accurate results are obtained by CART, though the large number of rules limits comprehensibility. Considering comprehensible models only, the concise set of extracted rules outperform the pruned CART tree, making neural network rule extraction the most suitable technique for software effort prediction when comprehensibility is important.
International Journal of Information Quality | 2012
Helen Tadesse Moges; Karel Dejaeger; Wilfried Lemahieu; Bart Baesens
Recent studies indicated that companies are increasingly experiencing data quality (DQ) related problems resulting from their increased data collection efforts. Addressing these concerns requires a clear definition of DQ but typically, DQ is only broadly defined as ‘fitness for use’. While capturing its essence, a more precise interpretation of DQ is required during measurement. While there is a growing consensus on the multi-dimensional nature of DQ, no exact DQ definition has been put forward due to its context dependency. On the contrary, it is often stated that its constituting dimensions should be identified and defined in relation to the task at hand. Answering this call, we identify the DQ dimensions important to the credit risk assessment environment. In addition, we explore key DQ challenges and report on the causes of DQ problems in financial institutions. Statistical tests indicated nine most important DQ dimensions.
European Journal of Operational Research | 2012
Wouter Verbeke; Karel Dejaeger; David Martens; J Hur; Bart Baesens
Customer churn prediction models aim to indicate the customers with the highest propensity to attrite, allowing to improve the efficiency of customer retention campaigns and to reduce the costs associated with churn. Although cost reduction is their prime objective, churn prediction models are typically evaluated using statistically based performance measures, resulting in suboptimal model selection. Therefore, in the first part of this paper, a novel, profit centric performance measure is developed, by calculating the maximum profit that can be generated by including the optimal fraction of customers with the highest predicted probabilities to attrite in a retention campaign. The novel measure selects the optimal model and fraction of customers to include, yielding a significant increase in profits compared to statistical measures.