Andrew M. Raim | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Andrew M. Raim is active.

Explore More

Publication

Featured researches published by Andrew M. Raim.

Computational Statistics & Data Analysis | 2016

A flexible zero-inflated model to address data dispersion

Kimberly F. Sellers; Andrew M. Raim

Excess zeroes are often thought of as a cause of data over-dispersion (i.e. when the variance exceeds the mean); this claim is not entirely accurate. In actuality, excess zeroes reduce the mean of a dataset, thus inflating the dispersion index (i.e. the variance divided by the mean). While this results in an increased chance for data over-dispersion, the implication is not guaranteed. Thus, one should consider a flexible distribution that not only can account for excess zeroes, but can also address potential over- or under-dispersion. A zero-inflated Conway-Maxwell-Poisson (ZICMP) regression allows for modeling the relationship between explanatory and response variables, while capturing the effects due to excess zeroes and dispersion. This work derives the ZICMP model and illustrates its flexibility, extrapolates the corresponding likelihood ratio test for the presence of significant data dispersion, and highlights various statistical properties and model fit through several examples. Zero-inflated Conway-Maxwell-Poisson models dispersed datasets with excess zeroes.Hypothesis test detects statistically significant dispersion in light of excess zeroes.Data simulations and examples illustrate flexibility in model fit.

Journal of Statistical Computation and Simulation | 2013

Maximum-likelihood estimation of the random-clumped multinomial model as a prototype problem for large-scale statistical computing

Andrew M. Raim; Matthias K. Gobbert; Nagaraj K. Neerchal; Jorge G. Morel

Numerical methods are needed to obtain maximum-likelihood estimates (MLEs) in many problems. Computation time can be an issue for some likelihoods even with modern computing power. We consider one such problem where the assumed model is a random-clumped multinomial distribution. We compute MLEs for this model in parallel using the Toolkit for Advanced Optimization software library. The computations are performed on a distributed-memory cluster with low latency interconnect. We demonstrate that for larger problems, scaling the number of processes improves wall clock time significantly. An illustrative example shows how parallel MLE computation can be useful in a large data analysis. Our experience with a direct numerical approach indicates that more substantial gains may be obtained by making use of the specific structure of the random-clumped model.

Journal of Statistical Computation and Simulation | 2018

Parallelizing Computation of Expected Values in Recombinant Binomial Trees

Sai K. Popuri; Andrew M. Raim; Nagaraj K. Neerchal; Matthias K. Gobbert

ABSTRACT Recombinant binomial trees are binary trees where each non-leaf node has two child nodes, but adjacent parents share a common child node. Such trees arise in option pricing in finance. For example, an option can be valued by evaluating the expected payoffs with respect to random paths in the tree. The cost to exactly compute expected values over random paths grows exponentially in the depth of the tree, rendering a serial computation of one branch at a time impractical. We propose a parallelization method that transforms the calculation of the expected value into an embarrassingly parallel problem by mapping the branches of the binomial tree to the processes in a multiprocessor computing environment. We also discuss a parallel Monte Carlo method and verify the convergence and the variance reduction behavior by simulation study. Performance results from R and Julia implementations are compared on a distributed computing cluster.

Journal of Computational and Graphical Statistics | 2018

An Extension of Generalized Linear Models to Finite Mixture Outcome Distributions

Andrew M. Raim; Nagaraj K. Neerchal; Jorge G. Morel

ABSTRACT Finite mixture distributions arise in sampling a heterogeneous population. Data drawn from such a population will exhibit extra variability relative to any single subpopulation. Statistical models based on finite mixtures can assist in the analysis of categorical and count outcomes when standard generalized linear models (GLMs) cannot adequately express variability observed in the data. We propose an extension of GLMs where the response follows a finite mixture distribution and the regression of interest is linked to the mixture’s mean. This approach may be preferred over a finite mixture of regressions when the population mean is of interest; here, only one regression must be specified and interpreted in the analysis. A technical challenge is that the mixture’s mean is a composite parameter that does not appear explicitly in the density. The proposed model maintains its link to the regression through a certain random effects structure and is completely likelihood-based. We consider typical GLM cases where means are either real-valued, constrained to be positive, or constrained to be on the unit interval. The resulting model is applied to two example datasets through Bayesian analysis. Supporting the extra variation is seen to improve residual plots and produce widened prediction intervals reflecting the uncertainty. Supplementary materials for this article are available online.

Archive | 2013

Block Cyclic Distribution of Data in pbdR and its Effects on Computational Efficiency

Matthew G. Bachmann; Ashley D. Dyas; Shelby C. Kilmer; Julian Sass; Andrew M. Raim; Nagaraj K. Neerchal; Kofi P. Adragni; George Ostrouchov; Ian F. Thorpe

Programming with big data in R (pbdR), a package used to implement high-performance computing in the statistical software R, uses block cyclic distribution to organize large data across many processes. Because computations performed on large matrices are often not associative, a systematic approach must be used during parallelization to divide the matrix correctly. The block cyclic distribution method stresses a balanced load across processes by allocating sections of data to a corresponding node. This method achieves well divided data that each process computes individually and calculates a final result more efficiently. A nontrivial problem occurs when using block cyclic distribution: Which combinations of different block sizes and grid layouts are most effective? These two factors greatly influence computational efficiency, and therefore it is crucial to study and understand their relationship. To analyze the effects of block size and processor grid layout, we carry out a performance study of the block cyclic process used to compute a principal components analysis (PCA). We apply PCA both to a large simulated data set and to data involving the analysis of single nucleotide polymorphisms (SNPs). We implement analysis of variance (ANOVA) techniques in order to distinguish the variability associated with each grid layout and block distribution. Once the nature of these factors is determined, predictions about the performance for much larger data sets can be made. Our final results demonstrate the relationship between computational efficiency and both block distribution and processor grid layout, and establish a benchmark regarding which combinations of these factors are most effective.

Archive | 2013

Identifying Nonlinear Correlations in High Dimensional Data with Application to Protein Molecular Dynamics Simulations

William J. Bailey; Claire A. Chambless; Brandynne M. Cho; Jesse D. Smith; Andrew M. Raim; Kofi P. Adragni; Ian F. Thorpe

Complex biomolecules such as proteins can respond to changes in their environment through a process called allostery, which plays an important role in regulating the function of these biomolecules. Allostery occurs when an event at a specific location in a macromolecule produces an effect at a location in the molecule some distance away. An important component of allostery is the coupling of protein sites. Such coupling is one mechanism by which allosteric effects can be transmitted over long distances. To understand this phenomenon, molecular dynamic simulations are carried out with a large number of atoms, and the trajectories of these atoms are recorded over time. Simple correlation methods have been used in the literature to identify coupled motions between protein sites. We implement a recently developed statistical method for dimension reduction called principal fitted components (PFC) in the statistical programming language R to identify both linear and non-linear correlations between protein sites while dealing efficiently with the high dimensionality of the data. PFC models reduce the dimensionality of data while capturing linear and nonlinear dependencies among predictors (atoms) using a flexible set of basis functions. For faster processing, we implement the PFC algorithm using parallel computing through the Programming with Big Data in R (pbdR) package for R. We demonstrate the methods’ effectiveness on simulated datasets, and apply the routine to time series data from Molecular Dynamic (MD) simulations to identify coupled motion among the atoms.

No. HPCF-2012-15. (2012) | 2012