# Statistics

###### Featured Researches

## Large-data determinantal clustering

Determinantal consensus clustering is a promising and attractive alternative to partitioning about medoids and k-means for ensemble clustering. Based on a determinantal point process or DPP sampling, it ensures that subsets of similar points are less likely to be selected as centroids. It favors more diverse subsets of points. The sampling algorithm of the determinantal point process requires the eigendecomposition of a Gram matrix. This becomes computationally intensive when the data size is very large. This is particularly an issue in consensus clustering, where a given clustering algorithm is run several times in order to produce a final consolidated clustering. We propose two efficient alternatives to carry out determinantal consensus clustering on large datasets. They consist in DPP sampling based on sparse and small kernel matrices whose eigenvalue distributions are close to that of the original Gram matrix.

Read more## A Text Mining Discovery of Similarities and Dissimilarities Among Sacred Scriptures

The careful examination of sacred texts gives valuable insights into human psychology, different ideas regarding the organization of societies as well as into terms like truth and God. To improve and deepen our understanding of sacred texts, their comparison, and their separation is crucial. For this purpose, we use our data set has nine sacred scriptures. This work deals with the separation of the Quran, the Asian scriptures Tao-Te-Ching, the Buddhism, the Yogasutras, and the Upanishads as well as the four books from the Bible, namely the Book of Proverbs, the Book of Ecclesiastes, the Book of Ecclesiasticus, and the Book of Wisdom. These scriptures are analyzed based on the natural language processing NLP creating the mathematical representation of the corpus in terms of frequencies called document term matrix (DTM). After this analysis, machine learning methods like supervised and unsupervised learning are applied to perform classification. Here we use the Multinomial Naive Bayes (MNB), the Super Vector Machine (SVM), the Random Forest (RF), and the K-nearest Neighbors (KNN). We obtain that among these methods MNB is able to predict the class of a sacred text with an accuracy of about 85.84 %.

Read more## The future of forecasting competitions: Design attributes and principles

Forecasting competitions are the equivalent of laboratory experimentation widely used in physical and life sciences. They provide useful, objective information to improve the theory and practice of forecasting, advancing the field, expanding its usage and enhancing its value to decision and policymakers. We describe ten design attributes to be considered when organizing forecasting competitions, taking into account trade-offs between optimal choices and practical concerns like costs, as well as the time and effort required to participate in them. Consequently, we map all major past competitions in respect to their design attributes, identifying similarities and differences between them, as well as design gaps, and making suggestions about the principles to be included in future competitions, putting a particular emphasis on learning as much as possible from their implementation in order to help improve forecasting accuracy and uncertainty. We discuss that the task of forecasting often presents a multitude of challenges that can be difficult to be captured in a single forecasting contest. To assess the caliber of a forecaster, we, therefore, propose that organizers of future competitions consider a multi-contest approach. We suggest the idea of a forecasting "athlon", where different challenges of varying characteristics take place.

Read more## On structural and practical identifiability

We discuss issues of structural and practical identifiability of partially observed differential equations which are often applied in systems biology. The development of mathematical methods to investigate structural non-identifiability has a long tradition. Computationally efficient methods to detect and cure it have been developed recently. Practical non-identifiability on the other hand has not been investigated at the same conceptually clear level. We argue that practical identifiability is more challenging than structural identifiability when it comes to modelling experimental data. We discuss that the classical approach based on the Fisher information matrix has severe shortcomings. As an alternative, we propose using the profile likelihood, which is a powerful approach to detect and resolve practical non-identifiability.

Read more###### Computation

###### Large-data determinantal clustering

Determinantal consensus clustering is a promising and attractive alternative to partitioning about medoids and k-means for ensemble clustering. Based on a determinantal point process or DPP sampling, it ensures that subsets of similar points are less likely to be selected as centroids. It favors more diverse subsets of points. The sampling algorithm of the determinantal point process requires the eigendecomposition of a Gram matrix. This becomes computationally intensive when the data size is very large. This is particularly an issue in consensus clustering, where a given clustering algorithm is run several times in order to produce a final consolidated clustering. We propose two efficient alternatives to carry out determinantal consensus clustering on large datasets. They consist in DPP sampling based on sparse and small kernel matrices whose eigenvalue distributions are close to that of the original Gram matrix.

More from Computation###### ROBustness In Network (robin): an R package for Comparison and Validation of communities

In network analysis, many community detection algorithms have been developed, however, their implementation leaves unaddressed the question of the statistical validation of the results. Here we present robin(ROBustness In Network), an R package to assess the robustness of the community structure of a network found by one or more methods to give indications about their reliability. The procedure initially detects if the community structure found by a set of algorithms is statistically significant and then compares two selected detection algorithms on the same graph to choose the one that better fits the network of interest. We demonstrate the use of our package on the American College Football benchmark dataset.

More from Computation###### Cosine Series Representation

This short paper is based on Chung et al. (2010), where the cosine series representation (CSR) is used in modeling the shape of white matter fiber tracts in diffusion tensor imaging(DTI) and Wang et al. (2018), where the method is used to denoise EEG. The proposed explicit analytic approach offers far superior flexibility in statistical modeling compared to the usual implicit Fourier transform methods such as the discrete cosine transforms often used in signal processing. The MATLAB codes and sample data can be obtained from this http URL.

More from Computation###### Other Statistics

###### A Text Mining Discovery of Similarities and Dissimilarities Among Sacred Scriptures

The careful examination of sacred texts gives valuable insights into human psychology, different ideas regarding the organization of societies as well as into terms like truth and God. To improve and deepen our understanding of sacred texts, their comparison, and their separation is crucial. For this purpose, we use our data set has nine sacred scriptures. This work deals with the separation of the Quran, the Asian scriptures Tao-Te-Ching, the Buddhism, the Yogasutras, and the Upanishads as well as the four books from the Bible, namely the Book of Proverbs, the Book of Ecclesiastes, the Book of Ecclesiasticus, and the Book of Wisdom. These scriptures are analyzed based on the natural language processing NLP creating the mathematical representation of the corpus in terms of frequencies called document term matrix (DTM). After this analysis, machine learning methods like supervised and unsupervised learning are applied to perform classification. Here we use the Multinomial Naive Bayes (MNB), the Super Vector Machine (SVM), the Random Forest (RF), and the K-nearest Neighbors (KNN). We obtain that among these methods MNB is able to predict the class of a sacred text with an accuracy of about 85.84 %.

More from Other Statistics###### A few statistical principles for data science

In any other circumstance, it might make sense to define the extent of the terrain (Data Science) first, and then locate and describe the landmarks (Principles). But this data revolution we are experiencing defies a cadastral survey. Areas are continually being annexed into Data Science. For example, biometrics was traditionally statistics for agriculture in all its forms but now, in Data Science, it means the study of characteristics that can be used to identify an individual. Examples of non-intrusive measurements include height, weight, fingerprints, retina scan, voice, photograph/video (facial landmarks and facial expressions), and gait. A multivariate analysis of such data would be a complex project for a statistician, but a software engineer might appear to have no trouble with it at all. In any applied-statistics project, the statistician worries about uncertainty and quantifies it by modelling data as realisations generated from a probability space. Another approach to uncertainty quantification is to find similar data sets, and then use the variability of results between these data sets to capture the uncertainty. Both approaches allow 'error bars' to be put on estimates obtained from the original data set, although the interpretations are different. A third approach, that concentrates on giving a single answer and gives up on uncertainty quantification, could be considered as Data Engineering, although it has staked a claim in the Data Science terrain. This article presents a few (actually nine) statistical principles for data scientists that have helped me, and continue to help me, when I work on complex interdisciplinary projects.

More from Other Statistics###### Bridging Breiman's Brook: From Algorithmic Modeling to Statistical Learning

In 2001, Leo Breiman wrote of a divide between "data modeling" and "algorithmic modeling" cultures. Twenty years later this division feels far more ephemeral, both in terms of assigning individuals to camps, and in terms of intellectual boundaries. We argue that this is largely due to the "data modelers" incorporating algorithmic methods into their toolbox, particularly driven by recent developments in the statistical understanding of Breiman's own Random Forest methods. While this can be simplistically described as "Breiman won", these same developments also expose the limitations of the prediction-first philosophy that he espoused, making careful statistical analysis all the more important. This paper outlines these exciting recent developments in the random forest literature which, in our view, occurred as a result of a necessary blending of the two ways of thinking Breiman originally described. We also ask what areas statistics and statisticians might currently overlook.

More from Other Statistics###### Applications

###### The future of forecasting competitions: Design attributes and principles

Forecasting competitions are the equivalent of laboratory experimentation widely used in physical and life sciences. They provide useful, objective information to improve the theory and practice of forecasting, advancing the field, expanding its usage and enhancing its value to decision and policymakers. We describe ten design attributes to be considered when organizing forecasting competitions, taking into account trade-offs between optimal choices and practical concerns like costs, as well as the time and effort required to participate in them. Consequently, we map all major past competitions in respect to their design attributes, identifying similarities and differences between them, as well as design gaps, and making suggestions about the principles to be included in future competitions, putting a particular emphasis on learning as much as possible from their implementation in order to help improve forecasting accuracy and uncertainty. We discuss that the task of forecasting often presents a multitude of challenges that can be difficult to be captured in a single forecasting contest. To assess the caliber of a forecaster, we, therefore, propose that organizers of future competitions consider a multi-contest approach. We suggest the idea of a forecasting "athlon", where different challenges of varying characteristics take place.

More from Applications###### Data-adaptive Dimension Reduction for US Mortality Forecasting

Forecasting accuracy of mortality data is important for the management of pension funds and pricing of life insurance in actuarial science. Age-specific mortality forecasting in the US poses a challenging problem in high dimensional time series analysis. Prior attempts utilize traditional dimension reduction techniques to avoid the curse of dimensionality, and then mortality forecasting is achieved through features' forecasting. However, a method of reducing dimension pertinent to ideal forecasting is elusive. To address this, we propose a novel approach to pursue features that are not only capable of representing original data well but also capturing time-serial dependence as most as possible. The proposed method is adaptive for the US mortality data and enjoys good statistical performance. As a comparison, our method performs better than existing approaches, especially in regard to the Lee-Carter Model as a benchmark in mortality analysis. Based on forecasting results, we generate more accurate estimates of future life expectancies and prices of life annuities, which can have great financial impact on life insurers and social securities compared with using Lee-Carter Model. Furthermore, various simulations illustrate scenarios under which our method has advantages, as well as interpretation of the good performance on mortality data.

More from Applications###### A Bayesian spatio-temporal nowcasting model for public health decision-making and surveillance

As COVID-19 spread through the United States in 2020, states began to set up alert systems to inform policy decisions and serve as risk communication tools for the general public. Many of these systems, like in Ohio, included indicators based on an assessment of trends in reported cases. However, when cases are indexed by date of disease onset, reporting delays complicate the interpretation of trends. Despite a foundation of statistical literature to address this problem, these methods have not been widely applied in practice. In this paper, we develop a Bayesian spatio-temporal nowcasting model for assessing trends in county-level COVID-19 cases in Ohio. We compare the performance of our model to the current approach used in Ohio and the approach that was recommended by the Centers for Disease Control and Prevention. We demonstrate gains in performance while still retaining interpretability using our model. In addition, we are able to fully account for uncertainty in both the time series of cases and in the reporting process. While we cannot eliminate all of the uncertainty in public health surveillance and subsequent decision-making, we must use approaches that embrace these challenges and deliver more accurate and honest assessments to policymakers.

More from Applications###### Methodology

###### On structural and practical identifiability

We discuss issues of structural and practical identifiability of partially observed differential equations which are often applied in systems biology. The development of mathematical methods to investigate structural non-identifiability has a long tradition. Computationally efficient methods to detect and cure it have been developed recently. Practical non-identifiability on the other hand has not been investigated at the same conceptually clear level. We argue that practical identifiability is more challenging than structural identifiability when it comes to modelling experimental data. We discuss that the classical approach based on the Fisher information matrix has severe shortcomings. As an alternative, we propose using the profile likelihood, which is a powerful approach to detect and resolve practical non-identifiability.

More from Methodology###### Fisher Scoring for crossed factor Linear Mixed Models

The analysis of longitudinal, heterogeneous or unbalanced clustered data is of primary importance to a wide range of applications. The Linear Mixed Model (LMM) is a popular and flexible extension of the linear model specifically designed for such purposes. Historically, a large proportion of material published on the LMM concerns the application of popular numerical optimization algorithms, such as Newton-Raphson, Fisher Scoring and Expectation Maximization to single-factor LMMs (i.e. LMMs that only contain one "factor" by which observations are grouped). However, in recent years, the focus of the LMM literature has moved towards the development of estimation and inference methods for more complex, multi-factored designs. In this paper, we present and derive new expressions for the extension of an algorithm classically used for single-factor LMM parameter estimation, Fisher Scoring, to multiple, crossed-factor designs. Through simulation and real data examples, we compare five variants of the Fisher Scoring algorithm with one another, as well as against a baseline established by the R package lmer, and find evidence of correctness and strong computational efficiency for four of the five proposed approaches. Additionally, we provide a new method for LMM Satterthwaite degrees of freedom estimation based on analytical results, which does not require iterative gradient estimation. Via simulation, we find that this approach produces estimates with both lower bias and lower variance than the existing methods.

More from Methodology###### Nonparametric C- and D-vine based quantile regression

Quantile regression is a field with steadily growing importance in statistical modeling. It is a complementary method to linear regression, since computing a range of conditional quantile functions provides a more accurate modelling of the stochastic relationship among variables, especially in the tails. We introduce a novel non-restrictive and highly flexible nonparametric quantile regression approach based on C- and D-vine copulas. Vine copulas allow for separate modeling of marginal distributions and the dependence structure in the data, and can be expressed through a graph theoretical model given by a sequence of trees. This way we obtain a quantile regression model, that overcomes typical issues of quantile regression such as quantile crossings or collinearity, the need for transformations and interactions of variables. Our approach incorporates a two-step ahead ordering of variables, by maximizing the conditional log-likelihood of the tree sequence, while taking into account the next two tree levels. Further, we show that the nonparametric conditional quantile estimator is consistent. The performance of the proposed methods is evaluated in both low- and high-dimensional settings using simulated and real world data. The results support the superior prediction ability of the proposed models.

More from Methodology#### Ready to get started?

Join us today