Paweł Teisseyre | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Paweł Teisseyre is active.

Explore More

Publication

Featured researches published by Paweł Teisseyre.

Computational Statistics & Data Analysis | 2014

Using random subspace method for prediction and variable importance assessment in linear regression

Jan Mielniczuk; Paweł Teisseyre

A random subset method (RSM) with a new weighting scheme is proposed and investigated for linear regression with a large number of features. Weights of variables are defined as averages of squared values of pertaining t-statistics over fitted models with randomly chosen features. It is argued that such weighting is advisable as it incorporates two factors: a measure of importance of the variable within the considered model and a measure of goodness-of-fit of the model itself. Asymptotic weights assigned by such a scheme are determined as well as assumptions under which the method leads to consistent choice of significant variables in the model. Numerical experiments indicate that the proposed method behaves promisingly when its prediction errors are compared with errors of penalty-based methods such as the lasso and it has much smaller false discovery rate than the other methods considered.

Neurocomputing | 2016

Feature ranking for multi-label classification using Markov networks

Paweł Teisseyre

We propose a simple and efficient method for ranking features in multi-label classification. The method produces a ranking of features showing their relevance in predicting labels, which in turn allows us to choose a final subset of features. The procedure is based on Markov networks and allows us to model the dependencies between labels and features in a direct way. In the first step we build a simple network using only labels and then we test how much adding a single feature affects the initial network. More specifically, in the first step we use the Ising model whereas the second step is based on the score statistic, which allows us to test a significance of added features very quickly. The proposed approach does not require transformation of label space, gives interpretable results and allows for attractive visualization of dependency structure. We give a theoretical justification of the procedure by discussing some theoretical properties of the Ising model and the score statistic. We also discuss feature ranking procedure based on fitting Ising model using l1 regularized logistic regressions. Numerical experiments show that the proposed methods outperform the conventional approaches on the considered artificial and real datasets.

Journal of Quantitative Linguistics | 2014

Analysing Utterances in Polish Parliament to Predict Speaker’s Background

Piotr Przybyła; Paweł Teisseyre

Abstract In this study we use transcripts of the Sejm (Polish parliament) to predict speaker’s background: gender, education, party affiliation and birth year. We create learning cases consisting of 100 utterances by the same author and, using rich multi-level annotations of the source corpus, extract a variety of features from them. They are either text-based (e.g. mean sentence length, percentage of long words or frequency of named entities of certain types) or word-based (unigrams and bigrams of surface forms, lemmas and interpretations). Next, we apply general-purpose feature selection, regression and classification algorithms and obtain results well over the baseline (97% of accuracy for gender, 95% for education, 76–88% for party). Comparative study shows that random forest and k nearest neighbour’s classifier usually outperform other methods commonly used in text mining, such as support vector machines and naïve Bayes classifier. Performed evaluation experiments help to understand how these solutions deal with such sparse and highly-dimensional data and which of the considered traits influence the language the most. We also address difficulties caused by some of the properties of Polish, typical also for other Slavonic languages.

intelligent information systems | 2017

Diversity of editors and teams versus quality of cooperative work: experiments on wikipedia

Marcin Sydow; Katarzyna Baraniak; Paweł Teisseyre

We study whether and how the diversity of editors and teams affects the quality of work in a virtual cooperative work environment on the Wikipedia example. We propose a measure of interests diversity of an editor and some measures of team diversity in terms of members’ interests and experience. Statistical and machine learning methods are used to investigate the dependency between diversity and work quality. The presented experimental results confirm our hypothesis that interest diversity of a single editors and team diversity are positively related to the quality of their work. Interestingly, some of our experiments also indicate that diversity may be more important than such attributes as productivity of an editor or size or experience of the team. Our experimental results demonstrate that it is possible to predict work quality based on diversity which is an additional statistical signal that diversity is correlated with work quality.

Neurocomputing | 2017

CCnet: Joint multi-label classification and feature selection using classifier chains and elastic net regularization

Paweł Teisseyre

Abstract Classifier chains are among the most successful methods in multi-label classification due to their simplicity and promising performance. However the standard versions of classifier chains described in the literature do not usually perform feature selection. In this paper we propose an algorithm CCnet which is a combination of classifier chains and elastic-net regularization. An important advantage of the CCnet is that selection of the relevant features in an integral element of the learning process. We show the stability of our algorithm and analyse the generalization error bound. The difference between generalization error and empirical error is bounded by a term which scales as n − 1 / 2 , where n is a size of a training data. It follows from experiments that the proposed algorithm outperforms the standard versions of classifier chains as well as other state-of-the-art methods. We also show that the feature selection is stable with respect to the order of fitting the models in the chain.

Communications in Statistics-theory and Methods | 2012

Selection of Regression and Autoregression Models with Initial Ordering of Variables

Jan Mielniczuk; Paweł Teisseyre

We consider an information criterion for model selection in random design linear regression and autoregression case which allows for a general penalty and a general averaging factor for sum of squared residuals replacing reciprocal of a sample size. This leads to a consistent selection of a set of non zero coefficients. The search over all subsets may be replaced by search over nested family when predictors are pre-ordered with respect to their significance in the largest model. We show that such procedure detects the significant variables in both regression setups even when the number of models increases with a sample size.

Genetic Epidemiology | 2018

A deeper look at two concepts of measuring gene-gene interactions: logistic regression and interaction information revisited

Jan Mielniczuk; Paweł Teisseyre

Detection of gene–gene interactions is one of the most important challenges in genome‐wide case–control studies. Besides traditional logistic regression analysis, recently the entropy‐based methods attracted a significant attention. Among entropy‐based methods, interaction information is one of the most promising measures having many desirable properties. Although both logistic regression and interaction information have been used in several genome‐wide association studies, the relationship between them has not been thoroughly investigated theoretically. The present paper attempts to fill this gap. We show that although certain connections between the two methods exist, in general they refer two different concepts of dependence and looking for interactions in those two senses leads to different approaches to interaction detection. We introduce ordering between interaction measures and specify conditions for independent and dependent genes under which interaction information is more discriminative measure than logistic regression. Moreover, we show that for so‐called perfect distributions those measures are equivalent. The numerical experiments illustrate the theoretical findings indicating that interaction information and its modified version are more universal tools for detecting various types of interaction than logistic regression and linkage disequilibrium measures.

Challenges in Computational Statistics and Data Mining | 2016

What Do We Choose When We Err? Model Selection and Testing for Misspecified Logistic Regression Revisited

Jan Mielniczuk; Paweł Teisseyre

The problem of fitting logistic regression to binary model allowing for missppecification of the response function is reconsidered. We introduce two-stage procedure which consists first in ordering predictors with respect to deviances of the models with the predictor in question omitted and then choosing the minimizer of Generalized Information Criterion in the resulting nested family of models. This allows for large number of potential predictors to be considered in contrast to an exhaustive method. We prove that the procedure consistently chooses model \(t^{*}\) which is the closest in the averaged Kullback-Leibler sense to the true binary model t. We then consider interplay between t and \(t^{*}\) and prove that for monotone response function when there is genuine dependence of response on predictors, \(t^{*}\) is necessarily nonempty. This implies consistency of a deviance test of significance under misspecification. For a class of distributions of predictors, including normal family, Rudd’s result asserts that \(t^{*}=t\). Numerical experiments reveal that for normally distributed predictors probability of correct selection and power of deviance test depend monotonically on Rudd’s proportionality constant \(\eta \).

Computational Statistics | 2016