Featured Researches

Methodology

Optimal Nested Simulation Experiment Design via Likelihood Ratio Method

Nested simulation arises frequently in financial or input uncertainty quantification problems, where the performance measure is defined as a function of the simulation output mean conditional on the outer scenario. The standard nested simulation samples M outer scenarios and runs N inner replications at each. We propose a new experiment design framework for a problem whose inner replication's inputs are generated from probability distribution functions parameterized by the outer scenario. This structure lets us pool replications from an outer scenario to estimate another scenario's conditional mean via the likelihood ratio method. We formulate a bi-level optimization problem to decide not only which of M outer scenarios to simulate and how many times to replicate at each, but also how to pool these replications such that the total simulation effort is minimized while achieving the same estimation error as the standard nested simulation. The resulting optimal design requires far less simulation effort than MN . We provide asymptotic analyses on the convergence rates of the performance measure estimators computed from the experiment design. Empirical results show that our experiment design significantly reduces the simulation cost compared to the standard nested simulation as well as a state-of-the-art design that pools replications via regressions.

Read more
Methodology

Optimal Sampling Regimes for Estimating Population Dynamics

Ecologists are interested in modeling the population growth of species in various ecosystems. Studying population dynamics can assist environmental managers in making better decisions for the environment. Traditionally, the sampling of species and tracking of populations have been recorded on a regular time frequency. However, sampling can be an expensive process due to available resources, money and time. Limiting sampling makes it challenging to properly track the growth of a population. Thus, we propose a new and novel approach to designing sampling regimes based on the dynamics associated with population growth models. This design study minimizes the amount of time ecologists spend in the field, while maximizing the information provided by the data.

Read more
Methodology

Ordinal Trees and Random Forests: Score-Free Recursive Partitioning and Improved Ensembles

Existing ordinal trees and random forests typically use scores that are assigned to the ordered categories, which implies that a higher scale level is used. Versions of ordinal trees are proposed that take the scale level seriously and avoid the assignment of artificial scores. The basic construction principle is based on an investigation of the binary models that are implicitly used in parametric ordinal regression. These building blocks can be fitted by trees and combined in a similar way as in parametric models. The obtained trees use the ordinal scale level only. Since binary trees and random forests are constituent elements of the trees one can exploit the wide range of binary trees that have already been developed. A further topic is the potentially poor performance of random forests, which seems to have ignored in the literature. Ensembles that include parametric models are proposed to obtain prediction methods that tend to perform well in a wide range of settings. The performance of the methods is evaluated empirically by using several data sets.

Read more
Methodology

Overcoming bias in representational similarity analysis

Representational similarity analysis (RSA) is a multivariate technique to investigate cortical representations of objects or constructs. While avoiding ill-posed matrix inversions that plague multivariate approaches in the presence of many outcome variables, it suffers from the confound arising from the non-orthogonality of the design matrix. Here, a partial correlation approach will be explored to adjust for this source of bias by partialling out this confound. A formal analysis will show the dependence of this confound on the temporal correlation model of the sequential observations, motivating a data-driven approach that avoids the problem of misspecification of this model. However, where the autocorrelation locally diverges from the volume average, bias may be difficult to control for exactly (local bias), given the difficulties of estimating the precise form of the confound at each voxel. Application to real data shows the effectiveness of the partial correlation approach, suggesting the impact of local bias to be minor. However, where the control for bias locally fails, possible spurious associations with the similarity matrix of the stimuli may emerge. This limitation may be intrinsic to RSA applied to non-orthogonal designs.

Read more
Methodology

PCA Rerandomization

Mahalanobis distance between treatment group and control group covariate means is often adopted as a balance criterion when implementing a rerandomization strategy. However, this criterion may not work well for high-dimensional cases because it balances all orthogonalized covariates equally. Here, we propose leveraging principal component analysis (PCA) to identify proper subspaces in which Mahalanobis distance should be calculated. Not only can PCA effectively reduce the dimensionality for high-dimensional cases while capturing most of the information in the covariates, but it also provides computational simplicity by focusing on the top orthogonal components. We show that our PCA rerandomization scheme has desirable theoretical properties on balancing covariates and thereby on improving the estimation of average treatment effects. We also show that this conclusion is supported by numerical studies using both simulated and real examples.

Read more
Methodology

Parameter Restrictions for the Sake of Identification: Is there Utility in Asserting that Perhaps a Restriction Holds?

Statistical modeling can involve a tension between assumptions and statistical identification. The law of the observable data may not uniquely determine the value of a target parameter without invoking a key assumption, and, while plausible, this assumption may not be obviously true in the scientific context at hand. Moreover, there are many instances of key assumptions which are untestable, hence we cannot rely on the data to resolve the question of whether the target is legitimately identified. Working in the Bayesian paradigm, we consider the grey zone of situations where a key assumption, in the form of a parameter space restriction, is scientifically reasonable but not incontrovertible for the problem being tackled. Specifically, we investigate statistical properties that ensue if we structure a prior distribution to assert that `maybe' or `perhaps' the assumption holds. Technically this simply devolves to using a mixture prior distribution putting just some prior weight on the assumption, or one of several assumptions, holding. However, while the construct is straightforward, there is very little literature discussing situations where Bayesian model averaging is employed across a mix of fully identified and partially identified models.

Read more
Methodology

Parameter estimation in nonlinear mixed effect models based on ordinary differential equations: an optimal control approach

We present a parameter estimation method for nonlinear mixed effect models based on ordinary differential equations (NLME-ODEs). The method presented here aims at regularizing the estimation problem in presence of model misspecifications, practical identifiability issues and unknown initial conditions. For doing so, we define our estimator as the minimizer of a cost function which incorporates a possible gap between the assumed model at the population level and the specific individual dynamic. The cost function computation leads to formulate and solve optimal control problems at the subject level. This control theory approach allows to bypass the need to know or estimate initial conditions for each subject and it regularizes the estimation problem in presence of poorly identifiable parameters. Comparing to maximum likelihood, we show on simulation examples that our method improves estimation accuracy in possibly partially observed systems with unknown initial conditions or poorly identifiable parameters with or without model error. We conclude this work with a real application on antibody concentration data after vaccination against Ebola virus coming from phase 1 trials. We use the estimated model discrepancy at the subject level to analyze the presence of model misspecification.

Read more
Methodology

Parametric Copula-GP model for analyzing multidimensional neuronal and behavioral relationships

One of the main challenges in current systems neuroscience is the analysis of high-dimensional neuronal and behavioral data that are characterized by different statistics and timescales of the recorded variables. We propose a parametric copula model which separates the statistics of the individual variables from their dependence structure, and escapes the curse of dimensionality by using vine copula constructions. We use a Bayesian framework with Gaussian Process (GP) priors over copula parameters, conditioned on a continuous task-related variable. We validate the model on synthetic data and compare its performance in estimating mutual information against the commonly used non-parametric algorithms. Our model provides accurate information estimates when the dependencies in the data match the parametric copulas used in our framework. When the exact density estimation with a parametric model is not possible, our Copula-GP model is still able to provide reasonable information estimates, close to the ground truth and comparable to those obtained with a neural network estimator. Finally, we apply our framework to real neuronal and behavioral recordings obtained in awake mice. We demonstrate the ability of our framework to 1) produce accurate and interpretable bivariate models for the analysis of inter-neuronal noise correlations or behavioral modulations; 2) expand to more than 100 dimensions and measure information content in the whole-population statistics. These results demonstrate that the Copula-GP framework is particularly useful for the analysis of complex multidimensional relationships between neuronal, sensory and behavioral data.

Read more
Methodology

Parsimonious Bayesian Factor Analysis for modelling latent structures in spectroscopy data

In recent years animal diet has been receiving increased attention, in particular examining the impact of pasture-based feeding strategies on the quality of milk and dairy products, in line with the increased prevalence of grass-fed dairy products appearing on market shelves. To date, there are limited testing methods available for the verification of grass-fed dairy therefore these products are susceptible to food fraud and adulteration. Hence statistical tools studying potential differences among milk samples coming from animals on different feeding systems are required, thus providing increased security around the authenticity of the products. Infrared spectroscopy techniques are widely used to collect data on milk samples and to predict milk related traits. While these data are routinely used to predict the composition of the macro components of milk, each spectrum provides a reservoir of unharnessed information about the sample. The interpretation of these data presents a number of challenges due to their high-dimensionality and the relationships amongst the spectral variables. In this work we propose a modification of the standard factor analysis to induce a parsimonious summary of spectroscopic data. The procedure maps the observations into a low-dimensional latent space while simultaneously clustering observed variables. The method indicates possible redundancies in the data and it helps disentangle the complex relationships among the wavelengths. A flexible Bayesian estimation procedure is proposed for model fitting, providing reasonable values for the number of latent factors and clusters. The method is applied on milk mid-infrared spectroscopy data from dairy cows on different pasture and non-pasture based diets, providing accurate modelling of the data correlation, the clustering of variables and information on differences between milk samples from cows on different diets.

Read more
Methodology

Parsimonious Feature Extraction Methods: Extending Robust Probabilistic Projections with Generalized Skew-t

We propose a novel generalisation to the Student-t Probabilistic Principal Component methodology which: (1) accounts for an asymmetric distribution of the observation data; (2) is a framework for grouped and generalised multiple-degree-of-freedom structures, which provides a more flexible approach to modelling groups of marginal tail dependence in the observation data; and (3) separates the tail effect of the error terms and factors. The new feature extraction methods are derived in an incomplete data setting to efficiently handle the presence of missing values in the observation vector. We discuss various special cases of the algorithm being a result of simplified assumptions on the process generating the data. The applicability of the new framework is illustrated on a data set that consists of crypto currencies with the highest market capitalisation.

Read more

Ready to get started?

Join us today