Featured Researches

Data Analysis Statistics And Probability

A New Approach for 4DVar Data Assimilation

Four-dimensional variational data assimilation (4DVar) has become an increasingly important tool in data science with wide applications in many engineering and scientific fields such as geoscience1-12, biology13 and the financial industry14. The 4DVar seeks a solution that minimizes the departure from the background field and the mismatch between the forecast trajectory and the observations within an assimilation window. The current state-of-the-art 4DVar offers only two choices by using different forms of the forecast model: the strong- and weak-constrained 4DVar approaches15-16. The former ignores the model error and only corrects the initial condition error at the expense of reduced accuracy; while the latter accounts for both the initial and model errors and corrects them separately, which increases computational costs and uncertainty. To overcome these limitations, here we develop an integral correcting 4DVar (i4DVar) approach by treating all errors as a whole and correcting them simultaneously and indiscriminately. To achieve that, a novel exponentially decaying function is proposed to characterize the error evolution and correct it at each time step in the i4DVar. As a result, the i4DVar greatly enhances the capability of the strong-constrained 4DVar for correcting the model error while also overcomes the limitation of the weak-constrained 4DVar for being prohibitively expensive with added uncertainty. Numerical experiments with the Lorenz model show that the i4DVar significantly outperforms the existing 4DVar approaches. It has the potential to be applied in many scientific and engineering fields and industrial sectors in the big data era because of its ease of implementation and superior performance.

Read more
Data Analysis Statistics And Probability

A Predictive Model for Steady-State Multiphase Pipe Flow: Machine Learning on Lab Data

Engineering simulators used for steady-state multiphase pipe flows are commonly utilized to predict pressure drop. Such simulators are typically based on either empirical correlations or first-principles mechanistic models. The simulators allow evaluating the pressure drop in multiphase pipe flow with acceptable accuracy. However, the only shortcoming of these correlations and mechanistic models is their applicability. In order to extend the applicability and the accuracy of the existing accessible methods, a method of pressure drop calculation in the pipeline is proposed. The method is based on well segmentation and calculation of the pressure gradient in each segment using three surrogate models based on Machine Learning algorithms trained on a representative lab data set from the open literature. The first model predicts the value of a liquid holdup in the segment, the second one determines the flow pattern, and the third one is used to estimate the pressure gradient. To build these models, several ML algorithms are trained such as Random Forest, Gradient Boosting Decision Trees, Support Vector Machine, and Artificial Neural Network, and their predictive abilities are cross-compared. The proposed method for pressure gradient calculation yields R 2 =0.95 by using the Gradient Boosting algorithm as compared with R 2 =0.92 in case of Mukherjee and Brill correlation and R 2 =0.91 when a combination of Ansari and Xiao mechanistic models is utilized. The method for pressure drop prediction is also validated on three real field cases. Validation indicates that the proposed model yields the following coefficients of determination: R 2 =0.806,0.815 and 0.99 as compared with the highest values obtained by commonly used techniques: R 2 =0.82 (Beggs and Brill correlation), R 2 =0.823 (Mukherjee and Brill correlation) and R 2 =0.98 (Beggs and Brill correlation).

Read more
Data Analysis Statistics And Probability

A Priori Tests for the MIXMAX Random Number Generator

We define two a priori tests of pseudo-random number generators for the class of linear matrix-recursions. The first desirable property of a random number generator is the smallness of serial or lagged correlations between generated numbers. For the particular matrix generator called MIXMAX, we find that the serial correlation actually vanishes. Next, we define a more sophisticated measure of correlation, which is a multiple correlator between elements of the generated vectors. The lowest order non-vanishing correlator is a four-element correlator and is non-zero for lag s=1 . At lag s≥2 , this correlator again vanishes. For lag s=2 , the lowest non-zero correlator is a six-element correlator. The second desirable property for a linear generator is the favorable structure of the lattice which typically appears in dimensions higher than the dimension of the phase space of the generator, as discovered by Marsaglia. We define an appropriate generalization of the notion of the spectral index for LCG which is a measure of goodness of this lattice to the matrix generators such as MIXMAX and find that the spectral index is independent of the size of the matrix N and is equal to 3 – √ .

Read more
Data Analysis Statistics And Probability

A Sheaf Theoretical Approach to Uncertainty Quantification of Heterogeneous Geolocation Information

Integration of heterogeneous sensors is a challenging problem across a range of applications. Prominent among these are multi-target tracking, where one must combine observations from different sensor types in a meaningful way to track multiple targets. Because sensors have differing error models, we seek a theoretically-justified quantification of the agreement among ensembles of sensors, both overall for a sensor collection, and also at a fine-grained level specifying pairwise and multi-way interactions among sensors. We demonstrate that the theory of mathematical sheaves provides a unified answer to this need, supporting both quantitative and qualitative data. The theory provides algorithms to globalize data across the network of deployed sensors, and to diagnose issues when the data do not globalize cleanly. We demonstrate the utility of sheaf-based tracking models based on experimental data of a wild population of black bears in Asheville, North Carolina. A measurement model involving four sensors deployed among the bears and the team of scientists charged with tracking their location is deployed. This provides a sheaf-based integration model which is small enough to fully interpret, but of sufficient complexity to demonstrate the sheaf's ability to recover a holistic picture of the locations and behaviors of both individual bears and the bear-human tracking system. A statistical approach was developed for comparison, a dynamic linear model which was estimated using a Kalman filter. This approach also recovered bear and human locations and sensor accuracies. When the observations are normalized into a common coordinate system, the structure of the dynamic linear observation model recapitulates the structure of the sheaf model, demonstrating the canonicity of the sheaf-based approach. But when the observations are not so normalized, the sheaf model still remains valid.

Read more
Data Analysis Statistics And Probability

A binned likelihood for stochastic models

Metrics of model goodness-of-fit, model comparison, and model parameter estimation are the main categories of statistical problems in science. Bayesian and frequentist methods that address these questions often rely on a likelihood function, which is the key ingredient in order to assess the plausibility of model parameters given observed data. In some complex systems or experimental setups, predicting the outcome of a model cannot be done analytically, and Monte Carlo techniques are used. In this paper, we present a new analytic likelihood that takes into account Monte Carlo uncertainties, appropriate for use in the large and small sample size limits. Our formulation performs better than semi-analytic methods, prevents strong claims on biased statements, and provides improved coverage properties compared to available methods.

Read more
Data Analysis Statistics And Probability

A class of randomized Subset Selection Methods for large complex networks

Most of the real world complex networks such as the Internet, World Wide Web and collaboration networks are huge; and to infer their structure and dynamics one requires handling large connectivity (adjacency) matrices. Also, to find out the spectra of these networks, one needs to perform the EigenValue Decomposition(or Singular Value Decomposition for bipartite networks) of these large adjacency matrices or their Laplacian matrices. In the present work, we proposed randomized versions of the existing heuristics to infer the norm and the spectrum of the adjacency matrices. In an earlier work [1], we used Subset Selection (SS) procedure to obtain the critical network structure which is smaller in size and retains the properties of original networks in terms of its Principal Singular Vector and eigenvalue spectra. We now present a few randomized versions of SS (RSS) and their time and space complexity calculation on various benchmark and real-world networks. We find that the RSS based on using QR decomposition instead of SVD in deterministic SS is the fastest. We evaluate the correctness and the performance speed after running these randomized SS heuristics on test networks and comparing the results with deterministic counterpart reported earlier. We find the proposed methods can be used effectively in large and sparse networks; they can be extended to analyse important network structure in dynamically evolving networks owing to their reduced time complexity.

Read more
Data Analysis Statistics And Probability

A data-driven convergence criterion for iterative unfolding of smeared spectra

A data-driven convergence criterion for the D'Agostini (Richardson-Lucy) iterative unfolding is presented. It relies on the unregularized spectrum (infinite number of iterations), and allows a safe estimation of the bias and undercoverage induced by truncating the algorithm. In addition, situations where the response matrix is not perfectly known are also discussed, and show that in most cases the unregularized spectrum is not an unbiased estimator of the true distribution. Whenever a bias is introduced, either by truncation of by poor knowledge of the response, a way to retrieve appropriate coverage properties is proposed.

Read more
Data Analysis Statistics And Probability

A deconvolution method for reconstruction of data time series from intermittent systems

In this manuscript, we will investigate the deconvolution method for recovering pulse arrival times and amplitudes using synthetic data. For the deconvolution procedure to have hope of recovering amplitudes and arrivals, the average waiting time between events must be at least 10 times the time step.

Read more
Data Analysis Statistics And Probability

A deep learning approach to multi-track location and orientation in gaseous drift chambers

Accurate measuring the location and orientation of individual particles in a beam monitoring system is of particular interest to researchers in multiple disciplines. Among feasible methods, gaseous drift chambers with hybrid pixel sensors have the great potential to realize long-term stable measurement with considerable precision. In this paper, we introduce deep learning to analyze patterns in the beam projection image to facilitate three-dimensional reconstruction of particle tracks. We propose an end-to-end neural network based on segmentation and fitting for feature extraction and regression. Two segmentation branches, named binary segmentation and semantic segmentation, perform initial track determination and pixel-track association. Then pixels are assigned to multiple tracks, and a weighted least squares fitting is implemented with full back-propagation. Besides, we introduce a center-angle measure to judge the precision of location and orientation by combining two separate factors. The initial position resolution achieves 8.8 μm for the single track and 11.4 μm (15.2 μm ) for the 1-3 tracks (1-5 tracks), and the angle resolution achieves 0.15 ∘ and 0.21 ∘ (0.29 ∘ ) respectively. These results show a significant improvement in accuracy and multi-track compatibility compared to traditional methods.

Read more
Data Analysis Statistics And Probability

A deep neural network for simultaneous estimation of b jet energy and resolution

We describe a method to obtain point and dispersion estimates for the energies of jets arising from b quarks produced in proton-proton collisions at an energy of s √ = 13 TeV at the CERN LHC. The algorithm is trained on a large simulated sample of b jets and validated on data recorded by the CMS detector in 2017 corresponding to an integrated luminosity of 41 fb −1 . A multivariate regression algorithm based on a deep feed-forward neural network employs jet composition and shape information, and the properties of reconstructed secondary vertices associated with the jet. The results of the algorithm are used to improve the sensitivity of analyses that make use of b jets in the final state, such as the observation of Higgs boson decay to b b ¯ .

Read more

Ready to get started?

Join us today