Network


Latest external collaboration on country level. Dive into details by clicking on the dots.

Hotspot


Dive into the research topics where Małgorzata Bogdan is active.

Publication


Featured researches published by Małgorzata Bogdan.


The Annals of Applied Statistics | 2015

SLOPE { Adaptive Variable Selection via Convex Optimization

Małgorzata Bogdan; Ewout van den Berg; Chiara Sabatti; Weijie Su; Emmanuel J. Candès

We introduce a new estimator for the vector of coefficients β in the linear model y = Xβ + z, where X has dimensions n × p with p possibly larger than n. SLOPE, short for Sorted L-One Penalized Estimation, is the solution to [Formula: see text]where λ1 ≥ λ2 ≥ … ≥ λ p ≥ 0 and [Formula: see text] are the decreasing absolute values of the entries of b. This is a convex program and we demonstrate a solution algorithm whose computational complexity is roughly comparable to that of classical ℓ1 procedures such as the Lasso. Here, the regularizer is a sorted ℓ1 norm, which penalizes the regression coefficients according to their rank: the higher the rank-that is, stronger the signal-the larger the penalty. This is similar to the Benjamini and Hochberg [J. Roy. Statist. Soc. Ser. B57 (1995) 289-300] procedure (BH) which compares more significant p-values with more stringent thresholds. One notable choice of the sequence {λ i } is given by the BH critical values [Formula: see text], where q ∈ (0, 1) and z(α) is the quantile of a standard normal distribution. SLOPE aims to provide finite sample guarantees on the selected model; of special interest is the false discovery rate (FDR), defined as the expected proportion of irrelevant regressors among all selected predictors. Under orthogonal designs, SLOPE with λBH provably controls FDR at level q. Moreover, it also appears to have appreciable inferential properties under more general designs X while having substantial power, as demonstrated in a series of experiments running on both simulated and real data.


Computational Statistics & Data Analysis | 2012

Modified versions of Bayesian Information Criterion for genome-wide association studies

Florian Frommlet; Piotr Twaróg; Małgorzata Bogdan

For the vast majority of genome-wide association studies (GWAS) statistical analysis was performed by testing markers individually. Elementary statistical considerations clearly show that in the case of complex traits an approach based on multiple regression or generalized linear models is preferable to testing single markers. A model selection approach to GWAS can be based on modifications of the Bayesian Information Criterion (BIC), where some search strategies are necessary to deal with a huge number of potential models. Comprehensive simulations based on real SNP data confirm that model selection has larger power to detect causal SNPs in complex models than single-marker tests. Furthermore, testing single markers leads to substantial problems with proper ranking of causal SNPs and tends to detect a certain number of false positive SNPs, which are not linked to any of the causal mutations. This behavior of single-marker tests is typical in GWAS for complex traits and can be explained by an aggregated influence of many small random sample correlations between genotypes of the SNP under investigation and other causal SNPs. These findings might at least partially explain problems with low power and nonreplicability of results in GWAS. A real data analysis illustrates advantages of model selection in practice, where publicly available gene expression data as traits for individuals from the HapMap project are reanalyzed.


Genetics | 2007

Locating multiple interacting quantitative trait loci using rank-based model selection

Małgorzata Zak; Andreas Baierl; Małgorzata Bogdan; Andreas Futschik

In previous work, a modified version of the Bayesian information criterion (mBIC) was proposed to locate multiple interacting quantitative trait loci (QTL). Simulation studies and real data analysis demonstrate good properties of the mBIC in situations where the error distribution is approximately normal. However, as with other standard techniques of QTL mapping, the performance of the mBIC strongly deteriorates when the trait distribution is heavy tailed or when the data contain a significant proportion of outliers. In the present article, we propose a suitable robust version of the mBIC that is based on ranks. We investigate the properties of the resulting method on the basis of theoretical calculations, computer simulations, and a real data analysis. Our simulation results show that for the sample sizes typically used in QTL mapping, the methods based on ranks are almost as efficient as standard techniques when the data are normal and are much better when the data come from some heavy-tailed distribution or include a proportion of outliers.


Journal of Statistical Computation and Simulation | 1995

Data driven versions of pearson's chisquare test for uniformity

Małgorzata Bogdan

The test statistic of Pearsons Chi-square test for uniformity can be seen as the L 2-distance between the null density and the histogram density estimator. The power of this test depends heavily on the number of histogram cells. Recommended procedures for choosing this number usually exploit the knowledge of the class of alternatives. We present how the Schwarzs Bayesian Information Criterion and certain minimum complexity criteria can be used for selecting the number of classes for Pearsons Chi-square test. These criteria allow us to make a choice depending only on the observed data. We compare the powers of the resulting data driven tests with the power of Chi-square test by means of Monte Carlo simulations. We investigate also a test based on the L 2-distance between the null density and the mixture of histogram density estimators, introduced by J. Rissanen. This test turns out to be much better than Pearsons test, and it is competitive and comparable to other known test procedures.


Annals of the Institute of Statistical Mathematics | 2002

A data driven smooth test for circular uniformity

Małgorzata Bogdan; Krzysztof Bogdan; A. Futschik

We propose a new omnibus test for uniformity on the circle. The new test is based upon the idea of data driven smooth tests as presented in Ledwina (1994, J. Amer. Statist. Assoc., 89, 1000–1005). Our simulations indicate that the test performs very well for multifarious alternatives. In particular, it seems to outperform other known omnibus tests when testing against multimodal alternatives. We also investigate asymptotic properties of our test and we prove that it is consistent against every departure from uniformity.


Annals of Statistics | 2017

False discoveries occur early on the Lasso path

Weijie Su; Małgorzata Bogdan; Emmanuel J. Candès

In regression settings where explanatory variables have very low correlations and there are relatively few effects, each of large magnitude, we expect the Lasso to find the important variables with few errors, if any. This paper shows that in a regime of linear sparsity---meaning that the fraction of variables with a non-vanishing effect tends to a constant, however small---this cannot really be the case, even when the design variables are stochastically independent. We demonstrate that true features and null features are always interspersed on the Lasso path, and that this phenomenon occurs no matter how strong the effect sizes are. We derive a sharp asymptotic trade-off between false and true positive rates or, equivalently, between measures of type I and type II errors along the Lasso path. This trade-off states that if we ever want to achieve a type II error (false negative rate) under a critical value, then anywhere on the Lasso path the type I error (false positive rate) will need to exceed a given threshold so that we can never have both errors at a low level at the same time. Our analysis uses tools from approximate message passing (AMP) theory as well as novel elements to deal with a possibly adaptive selection of the Lasso regularizing parameter.


Genetics | 2017

Controlling the rate of GWAS false discoveries

Damian Brzyski; Christine B. Peterson; Piotr Sobczyk; Emmanuel J. Candès; Małgorzata Bogdan; Chiara Sabatti

With the rise of both the number and the complexity of traits of interest, control of the false discovery rate (FDR) in genetic association studies has become an increasingly appealing and accepted target for multiple comparison adjustment. While a number of robust FDR-controlling strategies exist, the nature of this error rate is intimately tied to the precise way in which discoveries are counted, and the performance of FDR-controlling procedures is satisfactory only if there is a one-to-one correspondence between what scientists describe as unique discoveries and the number of rejected hypotheses. The presence of linkage disequilibrium between markers in genome-wide association studies (GWAS) often leads researchers to consider the signal associated to multiple neighboring SNPs as indicating the existence of a single genomic locus with possible influence on the phenotype. This a posteriori aggregation of rejected hypotheses results in inflation of the relevant FDR. We propose a novel approach to FDR control that is based on prescreening to identify the level of resolution of distinct hypotheses. We show how FDR-controlling strategies can be adapted to account for this initial selection both with theoretical results and simulations that mimic the dependence structure to be expected in GWAS. We demonstrate that our approach is versatile and useful when the data are analyzed using both tests based on single markers and multiple regression. We provide an R package that allows practitioners to apply our procedure on standard GWAS format data, and illustrate its performance on lipid traits in the North Finland Birth Cohort 66 cohort study.


Statistics | 1996

Testing Uniformity Via Log-Spline Modeling

Małgorzata Bogdan; T. Ledwina

The data driven method of selecting the number of components in Neymans smooth test for uniformity, introduced by Ledwina (1994) and generalized by Kallenberg and Ledwina (1995a), is extended to cover some log-spline models. Asymptotic null distribution is derived and consistency of the data driven test is shown under essentially all alternatives. Some simulation results are also presented.


Genetic Epidemiology | 2001

The importance of connections: joining components of the Hutterite pedigree.

Nicola H. Chapman; Anne Louise Leutenegger; Badzioch; Małgorzata Bogdan; Erin M. Conlon; E.W. Daw; Gagnon F; N. Li; Maia Jm; Ellen M. Wijsman; E. A. Thompson

We analyzed a quantitative trait (serum IgE levels), and a binary trait (asthma), in four Hutterite sub‐pedigrees. A genome screen for asthma was performed using GENEHUNTER, and interesting regions were followed up using extended pedigrees and the FASTLINK package. Markov chain Monte Carlo (MCMC) methods were used to assess haplotype sharing among affected individuals (MORGAN/AUTOZYG), and to perform a combined oligogenic segregation and linkage analysis (LOKI) for log10(IgE). We found evidence for at least two susceptibility loci for asthma on chromosome 5, and a QTL for log10(IgE) on chromosome 1. Our analyses demonstrate that using the most complete pedigree structure possible is advisable, with attention to the possibility of heterogeneity among subunits of a very large pedigree.


Bioinformatics | 2004

On the significance of sequence alignments when using multiple scoring matrices

Florian Frommlet; Andreas Futschik; Małgorzata Bogdan

MOTIVATION Pairwise local sequence alignment is commonly used to search data bases for sequences related to some query sequence. Alignments are obtained using a scoring matrix that takes into account the different frequencies of occurrence of the various types of amino acid substitutions. Software like BLAST provides the user with a set of scoring matrices available to choose from, and in the literature it is sometimes recommended to try several scoring matrices on the sequences of interest. The significance of an alignment is usually assessed by looking at E-values and p-values. While sequence lengths and data base sizes enter the standard calculations of significance, it is much less common to take the use of several scoring matrices on the same sequences into account. Altschul proposed corrections of the p-value that account for the simultaneous use of an infinite number of PAM matrices. Here we consider the more realistic situation where the user may choose from a finite set of popular PAM and BLOSUM matrices, in particular the ones available in BLAST. It turns out that the significance of a result can be considerably overestimated, if a set of substitution matrices is used in an alignment problem and the most significant alignment is then quoted. RESULTS Based on extensive simulations, we study the multiple testing problem that occurs when several scoring matrices for local sequence alignment are used. We consider a simple Bonferroni correction of the p-values and investigate its accuracy. Finally, we propose a more accurate correction based on extreme value distributions fitted to the maximum of the normalized scores obtained from different scoring matrices. For various sets of matrices we provide correction factors which can be easily applied to adjust p- and E-values reported by software packages.

Collaboration


Dive into the Małgorzata Bogdan's collaboration.

Top Co-Authors

Avatar
Top Co-Authors

Avatar

David J. Ramsey

Baylor College of Medicine

View shared research outputs
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar

Damian Brzyski

Indiana University Bloomington

View shared research outputs
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar

Weijie Su

University of Pennsylvania

View shared research outputs
Top Co-Authors

Avatar
Top Co-Authors

Avatar

Piotr Twaróg

Wrocław University of Technology

View shared research outputs
Researchain Logo
Decentralizing Knowledge