Wojtek Kowalczyk
Leiden University
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Wojtek Kowalczyk.
Applied Intelligence | 2013
C. Natalie van der Wal; Wojtek Kowalczyk
The goals of this research were: (1)xa0to develop a system that will automatically measure changes in the emotional state of a speaker by analyzing his/her voice, (2)xa0to validate this system with a controlled experiment and (3)xa0to visualize the results to the speaker in 2-d space. Natural (non-acted) human speech of 77 (Dutch) speakers was collected and manually divided into meaningful speech units. Three recordings per speaker were collected, in which he/she was in a positive, neutral and negative state. For each recording, the speakers rated 16 emotional states on a 10-point Likert Scale. The Random Forest algorithm was applied to 207 speech features that were extracted from recordings to qualify (classification) and quantify (regression) the changes in speaker’s emotional state. Results showed that predicting the direction of change of emotions and predicting the change of intensity, measured by Mean Squared Error, can be done better than the baseline (the most frequent class label and the mean value of change, respectively). Moreover, it turned out that changes in negative emotions are more predictable than changes in positive emotions. A controlled experiment investigated the difference in human and machine performance on judging the emotional states in one’s own voice and that of another. Results showed that humans performed worse than the algorithm in the detection and regression problems. Humans, just like the machine algorithm, were better in detecting changing negative emotions rather than positive ones. Finally, results of applying the Principal Component Analysis (PCA) to our data provided a validation of dimensional emotion theories and they suggest that PCA is a promising technique for visualizing user’s emotional state in the envisioned application.
Journal of Cheminformatics | 2017
Eelke B. Lenselink; Niels ten Dijke; Brandon Bongers; George Papadatos; Herman W. T. van Vlijmen; Wojtek Kowalczyk; Adriaan P. IJzerman; Gerard J. P. van Westen
The increase of publicly available bioactivity data in recent years has fueled and catalyzed research in chemogenomics, data mining, and modeling approaches. As a direct result, over the past few years a multitude of different methods have been reported and evaluated, such as target fishing, nearest neighbor similarity-based methods, and Quantitative Structure Activity Relationship (QSAR)-based protocols. However, such studies are typically conducted on different datasets, using different validation strategies, and different metrics. In this study, different methods were compared using one single standardized dataset obtained from ChEMBL, which is made available to the public, using standardized metrics (BEDROC and Matthews Correlation Coefficient). Specifically, the performance of Naïve Bayes, Random Forests, Support Vector Machines, Logistic Regression, and Deep Neural Networks was assessed using QSAR and proteochemometric (PCM) methods. All methods were validated using both a random split validation and a temporal validation, with the latter being a more realistic benchmark of expected prospective execution. Deep Neural Networks are the top performing classifiers, highlighting the added value of Deep Neural Networks over other more conventional methods. Moreover, the best method (‘DNN_PCM’) performed significantly better at almost one standard deviation higher than the mean performance. Furthermore, Multi-task and PCM implementations were shown to improve performance over single task Deep Neural Networks. Conversely, target prediction performed almost two standard deviations under the mean performance. Random Forests, Support Vector Machines, and Logistic Regression performed around mean performance. Finally, using an ensemble of DNNs, alongside additional tuning, enhanced the relative performance by another 27% (compared with unoptimized ‘DNN_PCM’). Here, a standardized set to test and evaluate different machine learning algorithms in the context of multi-task learning is offered by providing the data and the protocols.Graphical Abstract.
intelligent data analysis | 2015
Bas van Stein; Hao Wang; Wojtek Kowalczyk; Thomas Bäck; Michael Emmerich
In business and academia we are continuously trying to model and analyze complex processes in order to gain insight and optimize. One of the most popular modeling algorithms is Kriging, or Gaussian Processes. A major bottleneck with Kriging is the amount of processing time of at least (O(n^3)) and memory required (O(n^2)) when applying this algorithm on medium to big data sets. With big data sets, that are more and more available these days, Kriging is not computationally feasible. As a solution to this problem we introduce a hybrid approach in which a number of Kriging models built on disjoint subsets of the data are properly weighted for the predictions. The proposed model is both in processing time and memory much more efficient than standard Global Kriging and performs equally well in terms of accuracy. The proposed algorithm is better scalable, and well suited for parallelization.
pacific-asia conference on knowledge discovery and data mining | 2013
Rob M. Konijn; Wouter Duivesteijn; Wojtek Kowalczyk; Arno J. Knobbe
In Subgroup Discovery, one is interested in finding subgroups that behave differently from the ‘average’ behavior of the entire population. In many cases, such an approach works well because the general population is rather homogeneous, and the subgroup encompasses clear outliers. In more complex situations however, the investigated population is a mixture of various subpopulations, and reporting all of these as interesting subgroups is undesirable, as the variation in behavior is explainable. In these situations, one would be interested in finding subgroups that are unusual with respect to their neighborhood. In this paper, we present a novel method for discovering such local subgroups. Our work is motivated by an application in health care fraud detection. In this domain, one is dealing with various types of medical practitioners, who sometimes specialize in specific patient groups (elderly, disabled, etc.), such that unusual claiming behavior in itself is not cause for suspicion. However, unusual claims with respect to a reference group of similar patients do warrant further investigation into the suspect associated medical practitioner. We demonstrate experimentally how local subgroups can be used to capture interesting fraud patterns.
international conference information processing | 2016
Bas van Stein; Wojtek Kowalczyk
Real-life datasets that occur in domains such as industrial process control, medical diagnosis, marketing, risk management, often contain missing values. This poses a challenge for many classification and regression algorithms which require complete training sets. In this paper we present a new approach for “repairing” such incomplete datasets by constructing a sequence of regression models that iteratively replace all missing values. Additionally, our approach uses the target attribute to estimate the values of missing data. The accuracy of our method, Incremental Attribute Regression Imputation, IARI, is compared with the accuracy of several popular and state of the art imputation methods, by applying them to five publicly available benchmark datasets. The results demonstrate the superiority of our approach.
ieee international conference on fuzzy systems | 2016
Bas van Stein; Hao Wang; Wojtek Kowalczyk; Michael Emmerich; Thomas Bäck
Kriging or Gaussian Process Regression has been successfully applied in many fields. One of the major bottlenecks of Kriging is the complexity in both processing time (cubic) and memory (quadratic) in the number of data points. To overcome these limitations, a variety of approximation algorithms have been proposed. One of these approximation algorithms is Optimally Weighted Cluster Kriging (OWCK). In this paper, OWCK is extended and enhanced by the use of fuzzy clustering methods in order to increase the accuracy. Several options are proposed and evaluated against both the original OWCK and a variety of other Kriging approximation algorithms.
international conference information processing | 2018
Mark Pijnenburg; Wojtek Kowalczyk
In this paper we introduce the concept of singular outliers and provide an algorithm (SODA) for detecting these outliers. Singular outliers are multivariate outliers that differ from conventional outliers by the fact that the anomalous values occur for only one feature (or a relatively small number of features). Singular outliers occur naturally in the fields of fraud detection and data quality, but can be observed in other application fields as well. The SODA algorithm is based on the local Euclidean Manhattan Ratio (LEMR). The algorithm is applied to five real-world data sets and the outliers found by it are qualitatively and quantitatively compared to outliers found by three conventional outlier detection algorithms, showing the different nature of singular outliers.
international conference information processing | 2018
Bas van Stein; Hao Wang; Wojtek Kowalczyk; Thomas Bäck
For most regression models, their overall accuracy can be estimated with help of various error measures. However, in some applications it is important to provide not only point predictions, but also to estimate the “uncertainty” of the prediction, e.g., in terms of confidence intervals, variances, or interquartile ranges. There are very few statistical modeling techniques able to achieve this. For instance, the Kriging/Gaussian Process method is equipped with a theoretical mean squared error. In this paper we address this problem by introducing a heuristic method to estimate the uncertainty of the prediction, based on the error information from the k-nearest neighbours. This heuristic, called the k-NN uncertainty measure, is computationally much cheaper than other approaches (e.g., bootstrapping) and can be applied regardless of the underlying regression model. To validate and demonstrate the usefulness of the proposed heuristic, it is combined with various models and plugged into the well-known Efficient Global Optimization algorithm (EGO). Results demonstrate that using different models with the proposed heuristic can improve the convergence of EGO significantly.
international syposium on methodologies for intelligent systems | 2017
Mark Pijnenburg; Wojtek Kowalczyk
Including categorical variables with many levels in a logistic regression model easily leads to a sparse design matrix. This can result in a big, ill-conditioned optimization problem causing overfitting, extreme coefficient values and long run times. Inspired by recent developments in matrix factorization, we propose four new strategies of overcoming this problem. Each strategy uses a Factorization Machine that transforms the categorical variables with many levels into a few numeric variables that are subsequently used in the logistic regression model. The application of Factorization Machines also allows for including interactions between the categorical variables with many levels, often substantially increasing model accuracy. The four strategies have been tested on four data sets, demonstrating superiority of our approach over other methods of handling categorical variables with many levels. In particular, our approach has been successfully used for developing high quality risk models at the Netherlands Tax and Customs Administration.
international conference information processing | 2016
Bas van Stein; Wojtek Kowalczyk; Thomas Bäck
Missing values in datasets form a very relevant and often overlooked problem in many fields. Most algorithms are not able to handle missing values for training a predictive model or analyzing a dataset. For this reason, records with missing values are either rejected or repaired. However, both repairing and rejecting affects the dataset and the final results, creating bias and uncertainty. Therefore, knowledge about the nature of missing values and the underlying mechanisms behind them are of vital importance. To gain more in-depth insight into the underlying structures and patterns of missing values, the concept of Monotone Mixture Patterns is introduced and used to analyze the patterns of missing values in datasets. Several visualization methods are proposed to present the “patterns of missingness” in an informative way. Finally, an algorithm to generate missing values in datasets is provided to form the basis of a benchmarking tool. This algorithm can generate a large variety of missing value patterns for testing and comparing different algorithms that handle missing values.