Featured Researches

Data Analysis Statistics And Probability

Impact of non-normal error distributions on the benchmarking and ranking of Quantum Machine Learning models

Quantum machine learning models have been gaining significant traction within atomistic simulation communities. Conventionally, relative model performances are being assessed and compared using learning curves (prediction error vs. training set size). This article illustrates the limitations of using the Mean Absolute Error (MAE) for benchmarking, which is particularly relevant in the case of non-normal error distributions. We analyze more specifically the prediction error distribution of the kernel ridge regression with SLATM representation and L 2 distance metric (KRR-SLATM-L2) for effective atomization energies of QM7b molecules calculated at the level of theory CCSD(T)/cc-pVDZ. Error distributions of HF and MP2 at the same basis set referenced to CCSD(T) values were also assessed and compared to the KRR model. We show that the true performance of the KRR-SLATM-L2 method over the QM7b dataset is poorly assessed by the Mean Absolute Error, and can be notably improved after adaptation of the learning set.

Read more
Data Analysis Statistics And Probability

Impact of non-stationarity on hybrid ensemble filters: A study with a doubly stochastic advection-diffusion-decay model

Effects of non-stationarity on the performance of hybrid ensemble filters are studied (by hybrid filters we mean those which blend ensemble covariances with some other regularizing covariances). To isolate effects of non-stationarity from effects due to nonlinearity (and the non-Gaussianity it causes), a new doubly stochastic advection-diffusion-decay model (DSADM) is proposed. The model is hierarchical: it is a linear stochastic partial differential equation whose coefficients are random fields defined through their own stochastic partial differential equations. DSADM generates conditionally Gaussian spatiotemporal random fields with a tunable degree of non-stationarity in space and time. DSADM allows the use of the exact Kalman filter as a baseline benchmark. In numerical experiments with DSADM as the "model of truth", the relative importance of the three kinds of covariance blending is studied: with static, time-smoothed, and space-smoothed covariances. It is shown that the stronger the non-stationarity, the less useful the static covariance matrix becomes and the more beneficial the time-smoothed covariances are. Time-smoothing of background-error covariances proved to be systematically more useful than their space-smoothing. Under non-stationarity, a filter that extends the (previously proposed by the authors) Hierarchical Bayes Ensemble Filter and accommodates the three covariance-blending techniques is shown to outperform all other configurations of the filters tested. The R code of the model and the filters is available from this http URL.

Read more
Data Analysis Statistics And Probability

Implementation of GENFIT2 as an experiment independent track-fitting framework

The GENFIT toolkit, initially developed at the Technische Universitaet Muenchen, has been extended and modified to be more general and user-friendly. The new GENFIT, called GENFIT2, provides track representation, track-fitting algorithms and graphic visualization of tracks and detectors, and it can be used for any experiment that determines parameters of charged particle trajectories from spacial coordinate measurements. Based on general Kalman filter routines, it can perform extrapolations of track parameters and covariance matrices. It also provides interfaces to Millepede II for alignment purposes, and RAVE for the vertex finder. Results of an implementation of GENFIT2 in basf2 and PandaRoot software frameworks are presented here.

Read more
Data Analysis Statistics And Probability

Improved Asymptotic Formulae for Statistical Interpretation Based on Likelihood Ratio Tests

The asymptotic formulae to describe the probability distribution of a test statistic in G. Cowan \emph{et al.}'s paper are deeply based on Wald's approximation. Wald's approximation is valid if the background size is big enough. It works well in most cases of searching for new physics. In this work, the asymptotic formulae are improved with weaker approximation conditions. The sub-leading contributions due to limited sample size and inegligible signal-to-background ratio are considered. The new asymptotic formulae work better than the old ones especially if the number of event is of the order of 1. A conjecture proposed in G. Cowan \emph{et al.}'s paper is also clarified.

Read more
Data Analysis Statistics And Probability

Independent Normalization for γ -ray Strength Functions: The Shape Method

The Shape method, a novel approach to obtain the functional form of the γ -ray strength function ( γ SF) in the absence of neutron resonance spacing data, is introduced. When used in connection with the Oslo method the slope of the Nuclear Level Density (NLD) is obtained simultaneously. The foundation of the Shape method lies in the primary γ -ray transitions which preserve information on the functional form of the γ SF. The Shape method has been applied to 56 Fe, 92 Zr, 164 Dy, and 240 Pu, which are representative cases for the variety of situations encountered in typical NLD and γ SF studies. The comparisons of results from the Shape method to those from the Oslo method demonstrate that the functional form of the γ SF is retained regardless of nuclear structure details or J π values of the states fed by the primary transitions.

Read more
Data Analysis Statistics And Probability

Inference of stochastic time series with missing data

Inferring dynamics from time series is an important objective in data analysis. In particular, it is challenging to infer stochastic dynamics given incomplete data. We propose an expectation maximization (EM) algorithm that iterates between alternating two steps: E-step restores missing data points, while M-step infers an underlying network model of restored data. Using synthetic data generated by a kinetic Ising model, we confirm that the algorithm works for restoring missing data points as well as inferring the underlying model. At the initial iteration of the EM algorithm, the model inference shows better model-data consistency with observed data points than with missing data points. As we keep iterating, however, missing data points show better model-data consistency. We find that demanding equal consistency of observed and missing data points provides an effective stopping criterion for the iteration to prevent overshooting the most accurate model inference. Armed with this EM algorithm with this stopping criterion, we infer missing data points and an underlying network from a time-series data of real neuronal activities. Our method recovers collective properties of neuronal activities, such as time correlations and firing statistics, which have previously never been optimized to fit.

Read more
Data Analysis Statistics And Probability

Inference of the Kinetic Ising Model with Heterogeneous Missing Data

We consider the problem of inferring a causality structure from multiple binary time series by using the Kinetic Ising Model in datasets where a fraction of observations is missing. We take our steps from a recent work on Mean Field methods for the inference of the model with hidden spins and develop a pseudo-Expectation-Maximization algorithm that is able to work even in conditions of severe data sparsity. The methodology relies on the Martin-Siggia-Rose path integral method with second order saddle-point solution to make it possible to calculate the log-likelihood in polynomial time, giving as output a maximum likelihood estimate of the couplings matrix and of the missing observations. We also propose a recursive version of the algorithm, where at every iteration some missing values are substituted by their maximum likelihood estimate, showing that the method can be used together with sparsification schemes like LASSO regularization or decimation. We test the performance of the algorithm on synthetic data and find interesting properties when it comes to the dependency on heterogeneity of the observation frequency of spins and when some of the hypotheses that are necessary to the saddle-point approximation are violated, such as the small couplings limit and the assumption of statistical independence between couplings.

Read more
Data Analysis Statistics And Probability

Information-theoretic measures for non-linear causality detection: application to social media sentiment and cryptocurrency prices

Information transfer between time series is calculated by using the asymmetric information-theoretic measure known as transfer entropy. Geweke's autoregressive formulation of Granger causality is used to find linear transfer entropy, and Schreiber's general, non-parametric, information-theoretic formulation is used to detect non-linear transfer entropy. We first validate these measures against synthetic data. Then we apply these measures to detect causality between social sentiment and cryptocurrency prices. We perform significance tests by comparing the information transfer against a null hypothesis, determined via shuffled time series, and calculate the Z-score. We also investigate different approaches for partitioning in nonparametric density estimation which can improve the significance of results. Using these techniques on sentiment and price data over a 48-month period to August 2018, for four major cryptocurrencies, namely bitcoin (BTC), ripple (XRP), litecoin (LTC) and ethereum (ETH), we detect significant information transfer, on hourly timescales, in directions of both sentiment to price and of price to sentiment. We report the scale of non-linear causality to be an order of magnitude greater than linear causality.

Read more
Data Analysis Statistics And Probability

Integrated VAC: A robust strategy for identifying eigenfunctions of dynamical operators

One approach to analyzing the dynamics of a physical system is to search for long-lived patterns in its motions. This approach has been particularly successful for molecular dynamics data, where slowly decorrelating patterns can indicate large-scale conformational changes. Detecting such patterns is the central objective of the variational approach to conformational dynamics (VAC), as well as the related methods of time-lagged independent component analysis and Markov state modeling. In VAC, the search for slowly decorrelating patterns is formalized as a variational problem solved by the eigenfunctions of the system's transition operator. VAC computes solutions to this variational problem by optimizing a linear or nonlinear model of the eigenfunctions using time series data. Here, we build on VAC's success by addressing two practical limitations. First, VAC can give poor eigenfunction estimates when the lag time parameter is chosen poorly. Second, VAC can overfit when using flexible parameterizations such as artificial neural networks with insufficient regularization. To address these issues, we propose an extension that we call integrated VAC (IVAC). IVAC integrates over multiple lag times before solving the variational problem, making its results more robust and reproducible than VAC's.

Read more
Data Analysis Statistics And Probability

Integration with an Adaptive Harmonic Mean Algorithm

Numerically estimating the integral of functions in high dimensional spaces is a non-trivial task. A oft-encountered example is the calculation of the marginal likelihood in Bayesian inference, in a context where a sampling algorithm such as a Markov Chain Monte Carlo provides samples of the function. We present an Adaptive Harmonic Mean Integration (AHMI) algorithm. Given samples drawn according to a probability distribution proportional to the function, the algorithm will estimate the integral of the function and the uncertainty of the estimate by applying a harmonic mean estimator to adaptively chosen regions of the parameter space. We describe the algorithm and its mathematical properties, and report the results using it on multiple test cases.

Read more

Ready to get started?

Join us today