Johan Pensar
Åbo Akademi University
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Johan Pensar.
Twin Research and Human Genetics | 2013
Ada Johansson; Patrick Jern; Pekka Santtila; Bettina von der Pahlen; Elias Eriksson; Lars Westberg; Henrik Nyman; Johan Pensar; Jukka Corander; N. Kenneth Sandnabba
The Genetics of Sexuality and Aggression (GSA) project was launched at the Abo Akademi University in Turku, Finland in 2005 and has so far undertaken two major population-based data collections involving twins and siblings of twins. To date, it consists of about 14,000 individuals (including 1,147 informative monozygotic twin pairs, 1,042 informative same-sex dizygotic twin pairs, 741 informative opposite-sex dizygotic twin pairs). Participants have been recruited through the Central Population Registry of Finland and were 18-49 years of age at the time of the data collections. Saliva samples for DNA genotyping (n = 4,278) and testosterone analyses (n = 1,168) were collected in 2006. The primary focus of the data collections has been on sexuality (both sexual functioning and sexual behavior) and aggressive behavior. This paper provides an overview of the data collections as well as an outline of the phenotypes and biological data assembled within the project. A detailed overview of publications can be found at the projects Web site: http://www.cebg.fi/.
Data Mining and Knowledge Discovery | 2015
Johan Pensar; Henrik Nyman; Timo Koski; Jukka Corander
We introduce a novel class of labeled directed acyclic graph (LDAG) models for finite sets of discrete variables. LDAGs generalize earlier proposals for allowing local structures in the conditional probability distribution of a node, such that unrestricted label sets determine which edges can be deleted from the underlying directed acyclic graph (DAG) for a given context. Several properties of these models are derived, including a generalization of the concept of Markov equivalence classes. Efficient Bayesian learning of LDAGs is enabled by introducing an LDAG-based factorization of the Dirichlet prior for the model parameters, such that the marginal likelihood can be calculated analytically. In addition, we develop a novel prior distribution for the model structures that can appropriately penalize a model for its labeling complexity. A non-reversible Markov chain Monte Carlo algorithm combined with a greedy hill climbing approach is used for illustrating the useful properties of LDAG models for both real and synthetic data sets.
Bayesian Analysis | 2014
Henrik Nyman; Johan Pensar; Timo Koski; Jukka Corander
Theory of graphical models has matured over more than three decades to provide the backbone for several classes of models that are used in a myriad of applications such as genetic mapping of diseases, credit risk evaluation, reliability and computer security. Despite their generic applicability and wide adoption, the constraints imposed by undirected graphical models and Bayesian networks have also been recognized to be unnecessarily stringent under certain circumstances. This observation has led to the proposal of several generalizations that aim at more relaxed constraints by which the models can impose local or context-specific dependence structures. Here we consider an additional class of such models, termed stratified graphical models. We develop a method for Bayesian learning of these models by deriving an analytical expression for the marginal likelihood of data under a specific subclass of decomposable stratified models. A non-reversible Markov chain Monte Carlo approach is further used to identify models that are highly supported by the posterior distribution over the model space. Our method is illustrated and compared with ordinary graphical models through application to several real and synthetic datasets.
International Journal of Approximate Reasoning | 2016
Johan Pensar; Henrik Nyman; Jarno Lintusaari; Jukka Corander
Bayesian networks are one of the most widely used tools for modeling multivariate systems. It has been demonstrated that more expressive models, which can capture additional structure in each conditional probability table (CPT), may enjoy improved predictive performance over traditional Bayesian networks despite having fewer parameters. Here we investigate this phenomenon for models of various degree of expressiveness on both extensive synthetic and real data. To characterize the regularities within CPTs in terms of independence relations, we introduce the notion of partial conditional independence (PCI) as a generalization of the well-known concept of context-specific independence (CSI). To model the structure of the CPTs, we use different graph-based representations which are convenient from a learning perspective. In addition to the previously studied decision trees and graphs, we introduce the concept of PCI-trees as a natural extension of the CSI-based trees. To identify plausible models we use the Bayesian score in combination with a greedy search algorithm. A comparison against ordinary Bayesian networks shows that models with local structures in general enjoy parametric sparsity and improved out-of-sample predictive performance, however, often it is necessary to regulate the model fit with an appropriate model structure prior to avoid overfitting in the learning process. The tree structures, in particular, lead to high quality models and suggest considerable potential for further exploration. We study the effect of including local structures in learning of Bayesian networks.We introduce partial conditional independence to characterize the restrictions.The local structures are modeled using various graph-based representations.The models are learned using a Bayesian score and a greedy search algorithm.In general, local structures improve the predictive accuracy of the learned models.
workshop on logic language information and computation | 2016
Jukka Corander; Antti Hyttinen; Juha Kontinen; Johan Pensar; Jouko Väänänen
Bayesian networks constitute a qualitative representation for conditional independence CI properties of a probability distribution. It is known that every CI statement implied by the topology of a Bayesian network G is witnessed over G under a graph-theoretic criterion called d-separation. Alternatively, all such implied CI statements have been shown to be derivable using the so-called semi-graphoid axioms. In this article we consider Labeled Directed Acyclic Graphs LDAG the purpose of which is to graphically model situations exhibiting context-specific independence CSI. We define an analogue of dependence logic suitable to express context-specific independence and study its basic properties. We also consider the problem of finding inference rules for deriving non-local CSI and CI statements that logically follow from the structure of a LDAG but are not explicitly encoded by it.
Computational Statistics | 2016
Henrik Nyman; Johan Pensar; Timo Koski; Jukka Corander
Log-linear models are the popular workhorses of analyzing contingency tables. A log-linear parameterization of an interaction model can be more expressive than a direct parameterization based on probabilities, leading to a powerful way of defining restrictions derived from marginal, conditional and context-specific independence. However, parameter estimation is often simpler under a direct parameterization, provided that the model enjoys certain decomposability properties. Here we introduce a cyclical projection algorithm for obtaining maximum likelihood estimates of log-linear parameters under an arbitrary context-specific graphical log-linear model, which needs not satisfy criteria of decomposability. We illustrate that lifting the restriction of decomposability makes the models more expressive, such that additional context-specific independencies embedded in real data can be identified. It is also shown how a context-specific graphical model can correspond to a non-hierarchical log-linear parameterization with a concise interpretation. This observation can pave way to further development of non-hierarchical log-linear models, which have been largely neglected due to their believed lack of interpretability.
Advanced Data Analysis and Classification | 2016
Henrik Nyman; Jie Xiong; Johan Pensar; Jukka Corander
An inductive probabilistic classification rule must generally obey the principles of Bayesian predictive inference, such that all observed and unobserved stochastic quantities are jointly modeled and the parameter uncertainty is fully acknowledged through the posterior predictive distribution. Several such rules have been recently considered and their asymptotic behavior has been characterized under the assumption that the observed features or variables used for building a classifier are conditionally independent given a simultaneous labeling of both the training samples and those from an unknown origin. Here we extend the theoretical results to predictive classifiers acknowledging feature dependencies either through graphical models or sparser alternatives defined as stratified graphical models. We show through experimentation with both synthetic and real data that the predictive classifiers encoding dependencies have the potential to substantially improve classification accuracy compared with both standard discriminative classifiers and the predictive classifiers based on solely conditionally independent features. In most of our experiments stratified graphical models show an advantage over ordinary graphical models.
Statistics and Computing | 2017
Tomi Janhunen; Martin Gebser; Jussi Rintanen; Henrik Nyman; Johan Pensar; Jukka Corander
Statistical model learning problems are traditionally solved using either heuristic greedy optimization or stochastic simulation, such as Markov chain Monte Carlo or simulated annealing. Recently, there has been an increasing interest in the use of combinatorial search methods, including those based on computational logic. Some of these methods are particularly attractive since they can also be successful in proving the global optimality of solutions, in contrast to stochastic algorithms that only guarantee optimality at the limit. Here we improve and generalize a recently introduced constraint-based method for learning undirected graphical models. The new method combines perfect elimination orderings with various strategies for solution pruning and offers a dramatic improvement both in terms of time and memory complexity. We also show that the method is capable of efficiently handling a more general class of models, called stratified/labeled graphical models, which have an astronomically larger model space.
bioRxiv | 2018
Santeri Puranen; Maiju Pesonen; Johan Pensar; Yingying Xu; John A. Lees; Stephen D. Bentley; Nicholas J. Croucher; Jukka Corander
The potential for genome-wide modelling of epistasis has recently surfaced given the possibility of sequencing densely sampled populations and the emerging families of statistical interaction models. Direct coupling analysis (DCA) has previously been shown to yield valuable predictions for single protein structures, and has recently been extended to genome-wide analysis of bacteria, identifying novel interactions in the co-evolution between resistance, virulence and core genome elements. However, earlier computational DCA methods have not been scalable to enable model fitting simultaneously to 104–105 polymorphisms, representing the amount of core genomic variation observed in analyses of many bacterial species. Here, we introduce a novel inference method (SuperDCA) that employs a new scoring principle, efficient parallelization, optimization and filtering on phylogenetic information to achieve scalability for up to 105 polymorphisms. Using two large population samples of Streptococcus pneumoniae, we demonstrate the ability of SuperDCA to make additional significant biological findings about this major human pathogen. We also show that our method can uncover signals of selection that are not detectable by genome-wide association analysis, even though our analysis does not require phenotypic measurements. SuperDCA, thus, holds considerable potential in building understanding about numerous organisms at a systems biological level.
Sexual Abuse: A Journal of Research and Treatment | 2017
Alessandro Tadei; Johan Pensar; Jukka Corander; Katarina Finnilä; Pekka Santtila; Jan Antfolk
In assessments of child sexual abuse (CSA) allegations, informative background information is often overlooked or not used properly. We therefore created and tested an instrument that uses accessible background information to calculate the probability of a child being a CSA victim that can be used as a starting point in the following investigation. Studying 903 demographic and socioeconomic variables from over 11,000 Finnish children, we identified 42 features related to CSA. Using Bayesian logic to calculate the probability of abuse, our instrument—the Finnish Investigative Instrument of Child Sexual Abuse (FICSA)—has two separate profiles for boys and girls. A cross-validation procedure suggested excellent diagnostic utility (area under the curve [AUC] = 0.97 for boys and AUC = 0.88 for girls). We conclude that the presented method can be useful in forensic assessments of CSA allegations by adding a reliable statistical approach to considering background information, and to support clinical decision making and guide investigative efforts.