A robust multivariate linear non-parametric maximum likelihood model for ties
AA robust multivariate linear non-parametric maximum likelihood modelfor ties
Landon Hurley
Yale UniversitySchool of Public Health &VA Connecticut Healthcare SystemWest Haven CSP Coordinating Centre
Statistical analysis in applied research, across almost every field (e.g., biomedical, economics,computer science, and psychological) makes use of samples upon which the explicit errordistribution of the dependent variable is unknown or, at best, di ffi cult to linearly model. Yet,these assumptions are extremely common. Unknown distributions are of course biased whenincorrectly specified, compromising the generalisability of our interpretations – the linearly un-biased Euclidean distance is very di ffi cult to correctly identify upon finite samples and thereforeresults in an estimator which is neither unbiased nor maximally informative when incorrectlyapplied. The alternative common solution to the problem however, the use of non-parametricstatistics, has its own fundamental flaws. In particular, these flaws revolve around the problemof order-statistics and the estimation in the presence of ties, which often removes the intro-duction of multiple independent variables and the estimation of interactions. We introduce acompetitor to the Euclidean norm, the Kemeny norm, which we prove to be a valid Banachspace, and construct a multivariate linear expansion of the Kendall-Theil-Sen estimator, whichperforms without compromising the parameter space extensibility, and establish its linear max-imum likelihood properties. Empirical demonstrations upon both simulated and empirical datashall be used to demonstrate these properties, such that the new estimator is nearly equivalent inpower for the glm upon Gaussian data, but grossly superior for a vast array of analytic scenar-ios, including finite ordinal sum-score analysis, thereby aiding in the resolution of replicationin the Applied Sciences. Introduction
Achieving the general construction of a non-parametriclinear regression framework, wherein the distribution, linear-ity, and closed form expressions of the estimating equationsbetween errors and covariates may be easily presented, hasbeen a long desired result in applied statistics. The first ma-jor development was that of Kendall (1938) τ a and the corre-sponding univariate Kendall-Theil-Sen estimator, which de-veloped a locally consistent Gauss-Markov estimator insen-sitive to outliers, subject only to the requirement of order-ability to ensure these properties. This allows the estimatorto compete well against least squares even for normally dis-tributed data, while also allowing for linear single slopes tobe applied to discrete ordinal data. The estimator, however,does not provide all of the necessary properties required inexperimental statistical designs, in particular for use in theApplied Social Sciences. Specifically, we are referring to thehigher-order factorial and polynomial design matrices, which [email protected] are unable to e ff ectively estimated with the introduction ofties (or collisions and surjective mappings), which precludefinite sample (strong) identification and convergence. Toresolve this, we introduce a Banach norm metric topologi-cal vector space, which possesses the same structure of theKendall τ -metric, but which does not possess the selectionbias upon the sample space, as it is naturally robust to theoccurrence of ties, unlike Kendall’s τ b . Further, this sametopology allows for the estimation of finite sample interac-tions, which are mathematically identical to ties, and there-fore a substantial unresolved problem in applied research.We introduce the mathematical properties of a complete met-ric upon a linear sub-space, and compare performance innumerous scenarios to that of the traditional Gaussian lin-ear regression model, which demonstrate support of the su-periority of the Kemeny norm, in particular as an unbiasedsecond-order consistent (i.e., replicable) estimator. In addi-tion, we derive estimating equations similar to that of OLSregression in terms of the variance-covariance matrices forboth parameter estimates and standard errors, and has beenshown e ff ective at addressing missing at random data with anEM-algorithm. a r X i v : . [ s t a t . M E ] F e b HURLEY
In applied analysis, researchers are often presented witha measurement of interest y m , the dependent variable, alongwith a covariate set X nm which we use to estimate and explorea stable relationship between the expected changes in the tar-get relative to the di ff erences observed or controlled upon oursample. In this manuscript, unless otherwise stated, we as-sume each column vector in { y , X } is of length m , for which y is a scalar while X is a rectangular matrix of order m × n , withthe restriction that m (cid:29) n , with uniform sampling selectionindependent and identical upon the population wrt the row-space. The goal of a regression framework is to remove thelinear dependencies between all n choose 2 features in the de-sign matrix X of n features, and then to project the optimallyweighted linear combination of these unique pieces of infor-mation onto Y . This unbiased Gauss-Markov optimality issuch that we may interpret and approximate how the averagefixed unit change in the similarity of X (cid:44) → y may be expectedto correspond to an estimated observed change in y . Linearsystems with complete metrics are extremely beneficial forsuch applications, both in terms of their parameter flexibilityin establishing complicated yet estimable linear relations, butalso in their ability to establish, upon relatively small sam-ples, learned patterns which strongly generalise outside ofthe sample at hand.This beneficence comes at a cost, however: the condi-tional normality and linearity of errors be correctly estab-lished, in order to maintain the orthonormal separability ofthe bias of incomplete sampling upon the sample from thebias in the parameters. Thus, we provide a robust multi-variate mathematical framework, in the style of the generallinear model, which may be applied to almost any partiallyorderable probability distribution function definable upon acommon population; thus any distribution which is indepen-dently sampled, but which does not require linearity to beestablished across the column space of X . We will alsodemonstrate how the Kendall τ and similar non-parametricwork (e.g., the Wilcoxon rank-sums test and the Friedmantest) may be resolved to produce an unbiased linear estima-tor which is e ffi cient and easily capable of addressing non-parametric multivariate families within a linear sub-space.The Kemeny (1959) metric defines a complete mathemat-ical framework whose methods are shown to be a maximumlikelihood estimator, with both probabilistic and closed formsolutions, for almost any sortable distribution. It is furtherdemonstrated to be only mildly less e ffi cient when in thepresence of a truly normal distribution, and substantiallymore Gauss-Markov optimal when addressing non-normaldata. We introduce a non-parametric linear regression systemwhose solutions and standard errors are demonstrably andtheoretically a maximum likelihood estimator (MLE) whichis robust to non-normality and more informative even un-der applications to estimation scenarios such as summative-scores and even applications of the polychoric correlations. Contribution and organisation of the paper
Supposing the Kemeny metric ρ K to be a convex func-tional for a topological vector space ( X , ρ K ), we prove andprovide empirical demonstrations that the relation between( ˆ α n , (cid:15) ) is both uniquely determined and linear under a rela-tively weak set of conditions, as well as being an estimator ofminimum variance. We define the necessary characteristicsof the parametric error family which satisfies this functionallinear basis, and demonstrate how it enables minimum un-certainty with respect to ˆ α n as compared to other unbiasedestimators. We conclude with several simulations and an ap-plied data analysis, all validated under jackknife resampling,to demonstrate that the performance conditions expected un-der maximum likelihood are validated as a primal-dual char-acterisation for our introduced methodology, without the in-troduction of a non-identity link function, as a consequenceof the a ffi ne relationship upon ( X , ρ K ) for a much wider arrayof the exponential family of distributions. Motivation and literature review
We posit that the maximum likelihood properties of theEuclidean (cid:96) norm are non-robust in terms of their consis-tency and breakdown in finite samples. While asymptoticconsistency in expectation is provably true, the ability for afinite sample to possess a sub-additive representation of thepopulation, from said subset, is much less forthcoming, espe-cially when the conditional distribution (i.e., the error distri-bution, (cid:15) ) is non-normally distributed. We argue that this em-pirical failing is largely a function of the over-generalisationof the normal distribution of errors as a continuous randomfield which is orthonormal to the covariate space, which di-rectly implies that the finite sample selection and parameterbiases are inconsistent with our asserted inductive interpreta-tions in how to to understand a population.A brief introduction to the foundational basis of this errormay be found with the James and Stein (1961) lemma, de-composing bias into orthonormal components upon any ar-bitrary additive norm space, wherein γ T represents the totalestimation bias and the Bayes error ε , denoting the total ir-reducible error. γ T may be further expanded to denote biaswith respect to the sampling upon the population γ m , biaswrt the parameter estimation (e.g., scenarios for which themodel is not correctly identified, as well as traditional Tikhi-nov ridge regression or restricted maximum likelihood esti-mation), and the interaction of these two pieces γ n · γ m . Witha complete metric topological vector space (TVS), under thelimit wrt m for uniform sampling, it is expected by defini-tion that γ n is strongly convergent to 0, and therefore that thebias γ T is solely a function of the proportional representationof the population within the sample. This bias, if reflectiveof uniform sampling, should tend to 0 as well revealing thestructure if the common population from which γ m arose. INEAR NON-PARAMETRIC REGRESSION, WITH INTERACTIONS (cid:15) (cid:124)(cid:123)(cid:122)(cid:125) sample error = γ T (cid:124)(cid:123)(cid:122)(cid:125) bias + ε (cid:124)(cid:123)(cid:122)(cid:125) Bayes Error (1) (cid:15) = ( γ m + γ n + γ m · γ n ) (cid:124) (cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32) (cid:123)(cid:122) (cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32) (cid:125) γ T + ε (2) (cid:107) (cid:15) − ε (cid:107) = p-lim m →∞ ( γ m + (cid:19)(cid:19)(cid:55) γ n + (cid:8)(cid:8)(cid:8)(cid:8)(cid:42) γ m · γ n ) (cid:124) (cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32) (cid:123)(cid:122) (cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32) (cid:125) γ T = . (3)Early non-parametric work arguably foundered upon theproblem of model identification in the presence of ties, whichnaturally arise in both the sample and parameter spaces, suchthat the James and Stein (or bias-variance tradeo ff ) inequal-ity for a Banach norm-space has both: (1) non-zero γ m , orbias with respect to the sampling resulting from ties beingexcluded, and (2) non-ignorable bias with respect to the pa-rameters γ n if the ties are averaged over. For finite samplingson a normal distribution of errors though, it naturally followsthat γ n → γ m converg-ing to 0 in the population under the restriction that lim m →∞ + .The Kemeny (1959) metric was constructed to explicitly re-solve the problem of sub-additivity in the presence of ties,and from this metric space, the Kemeny distance functionand a probability density function can be shown to be ex-ponentially related, and in fact may be isometrically embed-ded. This will allow us to characterise the Kemeny distancewithin the Gaussian probability family, which may be shownto asymptotically converge to the same point. Unsurpris-ingly, the Euclidean metric is more informative for Gaussiandata, however the linearity of the Kemeny metric as well asits minimal loss of power in favouring its selection, presentsa convenient means of constructing a linear regression modelspace, without the assumption of the normality of errors andwithout losing the ability to estimate more complex terms asis otherwise typically observed.When we know the distribution in which we are interestedin modelling (i.e., y ) the introduction of a such a completemetric, measuring the distance between our predictions (theestimands ˆ y ) and the true values y , is the definition of a max-imum likelihood estimator, solely characterised by the min-imisation of (cid:15) for which γ m tends to zero as the samplebecomes the population, and (cid:15) → ε when the regressionmodel is correctly specified. The ability to leverage the inner-product space defined by the Euclidean norm enables a min-imisation procedure of approximation for which γ n stronglytends towards 0, and therefore so does γ m · γ n =
0. Conse-quently assuming that we have representative sampling uponthe population, the function we learn upon our data is verystable, independent of the specific objects recorded in ourdata: of course, γ m > | m | < ∞ + . Therefore,these approximations are imperfect, but these imperfections do not compromise the estimation of the sample relations,merely the inductive capacity to link the understandings inour sample to the larger population with an unknown ‘truth’.As stated, these techniques are valid for Euclidean spaces:however, knowing the appropriate transformations to estab-lish additivity to allow the decomposition of the (cid:15) is muchmore di ffi cult. If γ n > , our ability to learn relations whichare approximately correct is suddenly a ff ected by every otheruncertainty in the sample, under a model which asserts theseterms are correctly fixed to zero. This increases the distancebetween our ability to fit a sample’s data, and our ability tounderstand a population, with the uniqueness of the likeli-hood function weakened and the sharpness of the convexitydiminished as well. When well-posed, all bias in the pop-ulation is 0, and therefore the correct modelling structure issolvable to produce a unique model solution. However, forany bias which is non-zero, the distance between the sampleerror (cid:15) and the true Bayes error ε grows as the uniquenessof our induced relationships (i.e. ,the existence of a unique‘truth’) is only established under the axiomatic veracity ofsaid conjecture, the generalisability of all our interpretationsis unknowingly compromised.If we take Box (1976) in earnest, then (cid:15) (cid:44) ε , equiva-lent to stating that bias γ T > γ m , and that thepieces of the bias all interact together to deform the holo-morpohic mapping onto our parameter space upon our co-variates. Of course, the incorrect application of a non-sub-additive metric introduces a positive third term in equation 2,in which the function learned is inseparably by a singularregularity criteria of error minimisation wrt the unique sam-ple. Our ability to replicate interpretability across indepen-dent samplings without merely relying upon a weakly consis-tent cheat, is arguably a reason behind the replication crisisin the Social Sciences (Wald, 1949; White, 1982), since or-dinal scales are certainly not continuous, let alone normallydistributed even upon a population. Therefore, the likeli-hood tests and partial Wald tests may be presumed to notbe strongly consistent under the conditions which they arecommonly published if the normality of errors is false aswell (Wald, 1949; Le Cam, 1953). Weak consistency (un-der which γ m → m ≈ ∞ + ) is an undesirablesolution, since it requires the researchers to accurately char-acterise the function we are approximating only when thepopulation is exhaustively sampled, which defeats the pur-pose of inductive argumentation in favour of description, andis therefore meaningless unless the population may be accu-rately collected. It should also be noted that the utilisationof meta-analysis does not pose an adequate solution, sincethe bias in the multiple levels amongst a collection of studiesis typically not resolved, nor is it clearly addressed that theestimates themselves are biased. However, this presumption HURLEY remains the current default for non-normality in the use ofboth Kendall’s τ b and Spearman’s ρ .Consider a data sampling process which produces an ( m ×
1) vector y = ( y , y , · · · , y m ) (cid:124) of observable real numbers, y ⊂ R m . Said data is immutably capable of describing, withprobability 1, the data in itself, the sample. However thereexists no descriptive capacity to address anything beyond it-self: no inductive inferences concerning the characteristicsof either DSP or data generating process (DGP) are possible(Solomono ff , 1964). Functional data analysis is a processby which we may approximate upon an unknown functionspace, and a framework allowing us the ability to predict,within our sample. In Statistics, we are often presented withsuch an unknown data generating function drawn upon a fi-nite sample, which we must approximate in an attempt to un-derstand the population. The identification of a specific errorstructure (a parametric probability distribution family; pdf)which, conditionally, allows us to linearly separate a struc-ture of interest (a model space amongst the universe of allpossible identifiable parameters, α n ⊂ Ω ) from the completeuncertainty of the system. Traditional solutions of maximumlikelihood (ML) and ordinary least squares (OLS) have lin-earised the (cid:96) -metric space (see equation 4), for certain spe-cific conditions y = α + α X + (cid:15). (4)We define the data as independently and identically sam-pled upon a random variable from an unknown joint prob-ability distribution, whose characteristics will be further ex-panded upon. These m independently distributed outcomesupon this endogenous process y , we wish to calculate esti-mates and conduct inference about unknown specificationsof the relations between X and y . The estimators of focus areupon a single level (constant within the population) vector( α , α n ) ∈ R n + , α ⊂ Ω , wherein Ω denotes the space of allidentifiable parameters, and α denotes the intercept. If weview the Euclidean distance as a characterisation of the Pear-son correlation, then it immediately follows that a simple re-gression is another form of said correlation (see equation 4).Consider then an empirical scenario, wherein X ∼ N ( µ, σ )and y ∼ N ( µ, σ (cid:15) ) upon m units. Within such a static (fixed)empirical system, the normal MLE is clearly applicable, anda provable minimum variance estimator. However, considerinstead the same system of n random variables transformedby a copula, wherein the scores upon X are such that min-imising the Euclidean distance no longer satisfies the proper-ties of the minimum variance maximum likelihood estimatorwhich converges to an expected error of 0 for the population(by the smoothing theorem or the law of total expectation).The specific transformations are completely arbitrary, how-ever we assume that they continue to maintain the proper-ties of a complete probabilistic metric space, as per Sklar’stheorem (Schweizer & Sklar, 2005; Menger, 1942). Thisensures, by using a data generating function such as the the generalised partial credit model to link the original scores X → X (cid:48) , that due to the probabilistic mapping, there is noguarantee of satisfying the triangle inequality upon the Eu-clidean topology. This is because distance between any adja-cent ordinal distances are no longer linear with three distinctpoints of origin (i.e., ρ ( x , − ρ ( y , (cid:44) ρ ( x (cid:48) , − ρ ( y (cid:48) , m and n . Therefore, the estimators ˆ α and ˆ σ are notindependent, as required, a contradiction whose resolutionis fundamental for construction of the valid classical t- andF-tests with respect to both first order approximations (coef-ficient bias) but more importantly, second order (standard er-ror) bias. The estimator α represents the coe ffi cients of vec-tor decomposition of y = X ˆ α = P y = X α + P (cid:15) , from whichfollows that α is a function of P (cid:15) . Simultaneously, the esti-mator ˆ σ is a norm of vector M (cid:15) divided by n, and thus also afunction of M (cid:15) . Now, random variables ( P (cid:15), M (cid:15) ) are jointlynormal as a linear transformation of (cid:15) , and also orthonormalbecause PM (cid:44)
0, which means that P (cid:15) and M (cid:15) are not inde-pendent, and therefore that estimators ˆ α and ˆ σ are also notindependent (Hoe ff ding, 1948). However, given establishedbiases for finite samples, the minimum variance replicabilityof the Gaussian likelihood function is not a valid presump-tion, entirely consistent with current research findings. Asthe error cannot be linearly separated from the regularity pa-rameters, it follows that the interaction from equation 1 aspresented in equation 4 is non-zero, from which follows theintroduction of non-zero bias wrt γ n .To address non-normal data, Nelder and Wedderburn(1972) introduced a linking function between the coe ffi -cients, α , and the error, ε , which linearised the function toallow the additive decomposition of (cid:15) from y as a function of α X . This still maintains the parametric nature of the approx-imation distribution, such that we may correctly establishsub-additivity upon the parameter space in terms of our ob-jective goal min (cid:15) which solely defines our learning process(Vapnik, 2013) by an appropriately selected monotonic trans-formation. Non-parametric functional families based upondata ranking (so-called order statistics; (Thurstone, 1927;Lipovetsky, 2007)) which are invariant to the specific dis-tribution have been a popular alternative resource, seekingto define relations between relative data orderability, ratherthan the original data scores. However, the definition of a INEAR NON-PARAMETRIC REGRESSION, WITH INTERACTIONS (cid:96) space has been widely preferable,conditional upon the correct selection of the empirical distri-bution wrt y .Typical order-statistic methods, such as the Kendall τ b and to a lesser extent, the Spearman footrule (Kendall, 1938;Spearman, 1906), rely upon a topological sub-space con-structed upon a symmetric group S m , for which each in-dividual sample realisation is unique. This error structurewas assumed to originate upon an explicitly (and without er-ror) observed continuous random variable, such that P ( x i = x i (cid:48) ) =
0, almost surely, thereby precluding the existence oftwo subjects with di ff erent covariates possessing the samerank (ties). This characterisation excludes many commonempirical measurements in both continuous and discrete em-pirical spaces, and results in a biased functional approxi-mation due to the Heckman selection process which assertsnon-uniform probability of representation within the sam-pling from the population, as a function of the character-istics of each X i , demonstrating the non-ignorability of the γ n · γ m and γ m terms. This follows, as many multivariateand univariate probability distributions are consequently in-capable of being uniquely embedded upon S m for finite (andgeneralisable) learning as a maximum likelihood problem,due to the lack of identifiability with respect to ties. Theseapproaches further preclude higher-order polynomial termsand interactions, a substantive necessity in observational andexperimental research. Moreover, the existence of ties is asubstantially larger problem within the multivariate space, asthey become substantially more frequent. Consider, for in-stance, the Rubin (1971) causal model, whose counterfactualapproach is wholly constructed upon the existence of mul-tivariate surjective mappings onto common points of non-identical multivariate X . The problem of ties (or collisions)in rank and order-statistic methods have been extensively de-tailed (Diaconis, 1988; Lehmann, 2009; Hollander, Wolfe,& Chicken, 2013; Harlow, 2013). However these resolu-tions rely upon the asymptotic weak order convergence wrt m , rather than properly defining a complete compact met-ric space. As interaction terms in the coe ffi cient space areties, almost all rank based techniques avoid estimating them,leading to their disuse in common experimental frameworks,which we address here. An unbiased linear metric estimator upon the expandedpermutation space
Complete metrics are an extensively studied field in the-oretical statistics and topology (Schechter, 1997), althoughapplied practice often limits use to the (cid:96) , or Euclidean, met-ric space. Non-linear transformations in the form of theGeneralised Linear Model (McCullagh & Nelder, 1989) in-duce linear additive separation between the model and error structures, while satisficing the primal-dual characterisationof error minimisation, from which follows the highly desir-able generalisability of the unique learned patterns upon thesample. We prove that the Kemeny (1959) metric is such alinear metric for any sortable cumulative distribution func-tion which is permutation non-invariant, thereby implyingas a necessary characteristic, the ability to sort distributionalprobability as a direct linear function of the relative orderingin the sample, which grows linearly to become the popula-tion, and is therefore an MLE with minimum variance.A realisation upon either ( X i j , y i ) which map to anon unique collision under either Spearman’s footrule andKendall’s τ b distances and respective correlational measures,fails to satisfy the properties of a complete metric (Fagin, Ku-mar, & Sivakumar, 2003), due to the uncertainty of the sur-jective mapping. For such a common empirical scenario, itfollows that for finite sample ties, both distances are invalidmaximum likelihood estimators. Said distances are finitelybiased, due to the correlation which now exists between theerror and the specific data realisations (Hoe ff ding, 1948), andthus the relative sparseness of the sample space restrictionenables only weak convergence under the weak law of largenumbers. Therefore, the development of a metric topolog-ical distance and corresponding quotient space for all realsupon i ∈ { , . . . , m } , ∀ m < ∞ + across the column rank spacesremains an unavoidably necessity; however such a measurehas been largely neglected (Kemeny, 1959; Diaconis, 1988;Fagin et al., 2003). We present the utility of this Kemenynorm in Figure 1, wherein the relative size of the populationpermutation space is expanded for five observations, from apopulation of 24 unique permutations upon the sample space,to 256 (Good, 1975).The Kemeny norm is constructed upon a score matrix,which denotes pairwise discretisation across all pairs of ob-served subjects as presented in equation 6 for comparisonupon subjects i and i (cid:48) in the space (cid:16) m (cid:17) . This score matrixdescribes a relative orderability to all other empirical obser-vations, with the simple image of greater than (a), equal to(0), or less than (-a), upon which the fixed constant a = τ , with the adjustment of a valid image for tied ele-ments, which were merely assumed to occur with probabilityalmost surely 0, for continuous non-normal data. However,empirical measure spaces such as ordinal survey responses,which contain fixed ordered sets of possible choices in re-sponse to prompts, are substantially more likely to incur suchties, thereby raising the loss of e ffi cient MLE properties toan immediate point of concern; traditional approaches suchas the polychoric correlation (Pearson & Pearson, 1922; Ols-son, 1979; Savalei, 2011) fail to address this issue, as emptycells upon the cross-tabulation of responses (the inverse needto the original assumption of almost no ties) produce unsta-ble approximations of the correlation matrix. Unsurprisingly, HURLEY . . . . . . A monotonically increasing CDF x F n ( x ) . . . . . . A monotonically non−decreasing CDF x F n ( x ) Figure 1
Comparison of the two rank metrics upon the permuta-tion space S n = , in which ties are explicitly avoided, andthen permitted, in order to demonstrate the empiricalpopulation space for unbiased estimators. It is seen thatthe latter, advocated, Kemeny metric is visually more dense,corresponding to faster convergence to the ECDF in thepopulation under the strong law of large numbers. the summation across the columns of the score matrix (equa-tion 6) results in a complete metric, the Kemeny distance( ρ K ), as found in equation 5, (and re-expressed as a bijectivelinear cross-product in Emond and Mason (2002)) ρ K ( A , B ) = (cid:88) i (cid:88) j sign( κ i j − κ i j ) (5) κ ii (cid:48) = a if f ( x i ) > f ( x i (cid:48) ) − a if f ( x i ) < f ( x i (cid:48) )0 if f ( d i ) = f ( d i (cid:48) ) , (6)for which ρ : R × R (cid:55)→ R , where ρ is the space of pos-sible monotonic metric functions which, smoothly approx-imate the Heaviside function in aggregation across m , pro-ducing the cumulative distribution function. In Table 1, wedemonstrate a repeated random simulation of the bivariatecorrelation of a bivariate Poisson distribution, with a popula-tion correlation of 0 with 100 subjects. It is seen that consis-tent with our hypothesis, the Kemeny correlation does pos-sess the smallest standard deviation under replication, witha minimum 25% greater concentration, and a maximum ra-tio nearly 250 times smaller. This serves to demonstrate ourcontention that numerous alternative metrics are less e ffi cientin comparison to our proposed estimator, in both a univariateand multivariate space.The Kemeny metric may be shown to be a continuousspace from Schechter (1997), as it is a complete metric, and further to be strongly convergent wrt m → ∞ , presentinga su ffi cient basis equivalent to conventional Banach (1934)-norm spaces (Cauchy-Schwartz convergence is an explicitconsequence upon any complete metric space). When com-bined with an observed empirically observable space mea-sured with the Kemeny metric, the space is closed, complete,and continuous, and therefore compact for finite m ∈ Z + . Webegin the proof here, with the assumption that any convexcontinuous metric space must be shown to be connected, asdefined upon our population space, which here is the permu-tation space H m , X ∈ R n ≡ H m . We treat the graph of m − m distinct bands to reflect the unique distances of the permuta-tions from an arbitrary origin ρ ( u , π ) ∈ H m , for any orderablesequence u i ∈ R for i = { , . . . , m − m } , which may be com-pared to an arbitrary point of origin, π , upon the norm-spaceof H m . However, it is recommended the π = , , . . . , m = I m ,the identity permutation, due to its uniqueness for any per-mutation group upon the sample of m individual units. Theinverse identity permutation, I (cid:48) m may also be utilised, underthe same reasoning.We call this graph of all non-isolated nodes G , with m ( m −
1) unique distance which contains k elements, forwhich k may be computed using a recursive summation ofStirling numbers (Good, 1975). k may be less than or equal to m , denoting the presence of unique real valued measurementsunder equality, and a non-zero probability of ties occurringholds, as k → ∞ ; the restriction that k > S m ⊆ H m , in which all distances are multi-plied by a scalar of 2 for a =
1, obtaining a bijection betweenthe Kendall and Kemeny metrics. This corresponds to thedemonstrated point-wise equality upon the empirical cumu-lative distribution function (ECDF) for each specific distancein Figure 1. An adjacency swap of distance 1 (transpositionof two rankings) under the Kendall distance now asserts adistance of 2 under the Kemeny distance, as the tied positionis occupied, and then moved past indicating two distinct lo-cations from { ρ K ( u , I m ) , ρ K ( v , I m ) } → { u − v } → { v , u } uponsaid graph. u − v is denotes an equivalence (incomplete, orpartial, ordering) for the two specific elements on a singlerandom variable X j .The elements in the compact support upon ρ K are con-nected by the underlying commonality of conditional ex-changeability, adjusting for the generating function, therebydefining a residual, conditional upon the su ffi cient statisticsas follows from the linearity of the metric. Said linearity en-ables a connected function upon an exhaustive permutationspace H m , as defined with the Kemeny metric, to converge toa normal distribution as the number of bins within which sub-jects’ measurements may be placed grows to equality with INEAR NON-PARAMETRIC REGRESSION, WITH INTERACTIONS Table 1
Comparison of bias and power for 1,500 randomly generated ordinal responses
Statistic Mean St. Dev. Skew Min 25% 75% Max ρ K r − − − − − ρ − − − − − τ b − − − − − r n = { , } andby induction to all finite sample sets. For n = H m on the graph G are realised ( u , v ) ∈ V ( G ), we wish toidentify a path from u to v termed the ( uv ) − path. If u = v , or u (cid:44) v , on the edges of G , E ( G ), then the distance between thepoints must be either 0 or a , denoting either an isomorphismor adjacency. It is also trivially obvious that for n = u (cid:44) v and uv (cid:51) E ( G ), the transitive property of the seg-ments of the graph which does not contain both ( u , v ) holds.Such a representation may be considered as a neighbourhoodof each endpoint A = { a ∈ V ( G ) | v a ∈ E ( G ) } (7) B = { b ∈ V ( G ) | v b ∈ E ( G ) } (8)and as long as these neighbourhoods possess a non-empty setintersection, a common element w must connect u (cid:55)→ w and w (cid:55)→ v ; ∴ u (cid:55)→ w (cid:55)→ v allowing construction of a continu-ous path from u (cid:55)→ v for any beginning and end point. Theinclusion-exclusion principle operating upon the neighbour-hood of each node guarantees a common neighbour must ex-ist for all points (see 9) upon all such H m graphs, for n ≥ x j ∈ X n , as the graph is con-nected. | A ∩ B | = | A | + | B | − | A ∪ B | (9) = deg( u ) + deg( v ) − | A ∪ B | (10) ≥ n − + n − − ( n −
2) (11) = ∴ | A ∩ B | (cid:44) ∅ (12)The metric is therefore uniquely defined (up to proportionalconstant a ) to be a consistent, bounded distance defining apopulation topological vector space ( Ω , ρ K ) which Kemeny(1959) treated as a functional mapping of the domain X ontoa univariate real number. Said functional follows from thenesting of the image of the score matrix upon X , which maybe expanded to be considered as the design matrix withoutcomplication, within the summation in equation 5, result-ing in the measurement of the metric distance between any two points. The diameter of G for all nodes ( u , v ) ∈ V ( G )are found within the realised finite countable interval { (cid:52) ρ K ( u , v ) (cid:52) m ( m − } , which is always known and fixedin the sample. As the distance is closed by the existenceof the upper bound for all finite realisation of m , an imageof the support is both a closed and a bounded set. There-fore, given any two points ( u , v ) ∈ X on H m , the pairwisedistance is continuous and homogeneous for all finite sub-samples of the universe of populations u ∈ V ( G ) as the sam-ple grows asymptotically under a uniform and independentsampling of all observable permutations in the population bySlutsky’s theorem, assuming a single population is sampled.We have thus shown that the Kemeny metric is a linear con-vex function upon the compact permutation support of H m ,and is therefore continuous. A simple proof by contradictionmay be used to establish distribution over additivity, and itwill then be shown that by homogeneity, the evaluation of ρ = ρ K commutes with multiplication by constant vector α n ,representing the coe ffi cient parameters. Assume as a func-tion of α n (for fixed intercept α ) that ρ K ( x α n + y α n , I m ) + α = α n ρ K ( x , I m ) + α n ρ K ( y , I m ) + α , which reduces to α = α ;from this immediately follows that the solution is a linearequation wrt α n for α = α is defined wrt to I m , andtherefore represents the normalised scores for which the cen-tral location is 0 for all variables under analysis: the intro-duction of the non-zero intercept term as an additive constantmerely serves to translate the expectation of the errors in pre-diction as a Cauchy-Schwartz convergent function series un-der the limit as m → ∞ + , demonstrating with probability 1that the linear estimator is unbiased upon the Kemeny metric.We next proceed to prove that the Kemeny metric is un-biased with minimum variance. This also demonstrates theconclusion that the Kendall rank distance is biased for finitesamples as a direct result of the restriction to S m for conven-tional data collection. However, this may also be seen bynoting that all elements of x ∈ X cover ρ K and therefore X : (cid:83) mi = ( x ) m ↔ Domain( G ), from which follows x m ∩ x m = ∅ ,and therefore demonstrating both the completeness and com-pactness of the metric. Any norm space which establishes aBanach space for which the three properties of a metric must HURLEY also be homogeneous, which provisions several useful prop-erties, including the power-metric property. Kemeny (1959)proved the first three properties for a complete metric for ρ K ,however we must also prove αρ ( P , Q ) = ρ ( α · P , Q ) ∀ α (cid:44) wrt ( α, (cid:15) ), the re-gression parameters and its error, as necessary to establishboth addition and multiplication as valid functors. Assumean intercept only regression model, x j = =
1, for which boththe properties of ρ ( α x , I m ) = α ρ ( x , I m ) , ∀ x ∈ R , α ∈ R ,and ρ ( x + y , I m ) = ρ ( x , I m ) + ρ ( y , I m ) , ∀ x ∈ R , y ∈ R . By a ∈ R (see equation 6), the homogeneity property of the Ke-meny metric follows such that ρ K ( aX , I m ) ≡ a · ρ K ( X , I m ), bylinear scaling of the penalty term, which forces the monoidscalar a to always be a finite non-zero real, but is other-wise unbounded, without a ff ecting the relative ordering ofall numbers. Exhaustive enumeration establishes that for H ,the permutation group with repetition possesses cardinality4 ρ K ( H , I ) ∈ { , , , } , which by substitution of a ≡ { a , , a , a } ∝ a · { , , , } . Simple induc-tion by m +
1, where m is a finite number, demonstrates thatany axiomatic conditions which hold upon S must also holdupon S m ., i.e., S + S + · · · + S m + S m + . The validity of thisinduction is proven by seeking the equivalence from S thatall elements in the set { S m + } = { S m + S m + } . The cardinal-ity of these two groups was proven in Good (1975), so it isimmediately known that the groups are correctly sized, andalways begin at 0, for the ascending sequence m (since bythe property of indiscernibility, any group of size 1 must beequivalent to itself, and hence ρ K ( I m , I m ) = a · ( m − m ), using the established multiplicity.From these, we see that, under the assumption that there isan element k ∈ S m for m ≥
1, upon which the distance from I m may be calculated according to equation 6 and equation 5,with I m as the origin. As already established, ρ K ( x , I m ), for x ∈ S m is correctly calculated upon the entire group, fromboth the left ( I m ) and the right ( I (cid:48) m ). As there is no finite num-ber on the reals for which S m is not capable of calculating theKemeny distance, due to the connectedness of G , it is there-fore immediately seen that lim m →∞ + S ( x = aI ) ⊂ S ( ax ) ⊂ , . . . ⊂ , S m − ( ax ) ≤ S m ( ax ) ≡ a ( S ( x = I ) ⊂ aS ( x ) ⊂ , . . . ⊂ , aS m − ( x ) ⊂ aS m − ( x ) ≤ aS m ( x )), by the distributivity of thelinear multiplication. Thus, the Kemeny metric is shown tobe a linear Banach space, and allows utilisation of the powermetric property.From these properties follows a means for consistent esti-mation of linear parameters (by the Cauchy-Schwarz convex-ity of all complete metrics), such as interaction with respectto almost any homogeneous error distribution whose cumu-lative distribution F is monotonically non-decreasing. Fora function F to be monotonically non-decreasing is a com- plementary extension of the typical assumption of a mono-tonically increasing cumulative distribution function (as in(Mann & Whitney, 1947; Cox, 1972)). Under a mono-tonically increasing function, the probability of ordered in-dices or statistics, of the dependent variable are uniquelysortable, such that each of m observations possesses an rel-ative ordering upon the sample with respect to its largest(or smallest) realised value. Assume F ( x ) explicitly char-acterises the space under the cdf with the point-wise in-equality F x i ( t ) < F x i + ( t )), wherein t satisfies the proper-ties of the order-statistics which are exchangeable with re-alisations upon the raw observations x as a consequence ofthe unique probability metric, justified by Sklar’s theorem.A bijective relation therefore exists between the probabilityand empirically observed measure spaces for each individual x i ∀ i ∈ { , , . . . , m } . Under the Kemeny metric, the inequal-ity upon F is replaced with the relation F x i ( t ) ≤ F x i + ( t )),which induces a transformation of the CDF as visualised inFigure 1. This allows finite sample first and second orderconvergence, under the substantive increase in density uponthe realisable sample space. A simple algebraic adjustmentof the Kemeny metric further allows the estimation of a fi-nite sample correlation estimating equation, which will belater shown to also be a minimum variance maximum likeli-hood estimator. This expression is provided in equation 13,which, due to the Banach-norm properties of the Kemenymetric are linearly strongly consistent, unbiased, and invari-ant to monotonic transformations. The Kemeny correlationis later expanded to demonstrate a variance-covariance ma-trix, for which it is shown to enable the estimation of a mul-tivariate linear non-parametric regression, for a parameterspace which includes the introduction of polynomial termsand interactions, an immense improvement over current non-parametric estimators in terms of γ n . r K = − · ρ K ( x j , x j (cid:48) ) m ( m − . (13) U-statistic estimator properties
Let P be an arbitrary family of probability distributionswhich are homogeneous upon the space ( X , ρ K ), restrictedonly such that each distribution P j is a vector of length m composed upon the family of weakly-orderable distributions.Said data is permutation non-invariant (i.e., it is orderable),but includes weak-orderings such that all pairwise elementalcomparisons may be determined to be greater than, lesserthan, or equal to an arbitrary point of origin, with Hermitiansemi-positive definite distances. As the Kemeny distance iscontinuous and convex metric, the sole restriction to func-tional analysis lies upon the existence of a common and in-dependent generating random probability function for the er-rors. In extension to multivariate distribution, it will not beexpected that the parametric families be identical, but merely INEAR NON-PARAMETRIC REGRESSION, WITH INTERACTIONS P j ∈ P ∀ j = { , . . . , n } . Let λ ( P j ) be a real-valued func-tion defined for P j ∈ P , which is estimable for the observabledata space X , a rectangular matrix of order m × n , upon which X ⊆ X is identically and independently randomly distributed.Further allow λ ( P j ) be an estimable parameter for some in-teger m for which exists an unbiased estimator with property λ ( P ) which by the linearity of the space defined by ρ K pro-duces a symmetric function over all permutations upon therow-space of the data, which may be countably infinite asper the Axiom of Choice. The nature of the Kemeny metricas Borel measurable follows from the finite countable natureof H m ∀ m as established by Good (1975). The central loca-tion which minimises the distribution of distances upon thepermutation space H m , ρ K ( π j , π ) ∀ π ∈ H m is satisfied by themidpoint distance of the unique extrema and its inverse, ex-pressed as a point of distance m ( m − in the population, aboutwhich the distribution of distances is linearly symmetric. Theexpectation of the metric for said unique point of origin isequivalent to the exhaustive symmetric sampling amongst allpermutation points upon ρ K defining the order statistics, andtherefore that the point of symmetry (the first moment or theexpectation) is finite for all finite sample, m < ∞ s. There-fore, U j upon ρ K produces a linear functional of the expecta-tion, identical to the one-sample Wilcoxon rank-sum statisticupon S m = Ω , thus establishing the expectation of the lin-ear operator as the median. The variance for a finite meanis definable by the power metric property of the ultrametric(or any Hilbert) space as a quadratic Taylor expansion aboutthe expectation. The variance of said U-statistic is expressedwherein Ξ j j = Var( h ( x , . . . , x k )), defined for all finite reali-sations upon x m ∈ R n . We express the variance of univariatevariable x j = ξ j = ξ ( x j ) and as a multivariate set of vari-ables as the diagonal of the n × n matrix Ξ . Consider twosubsets of the population D for which there are exactly k common subjects between two subsets. The distinct choicesfor the construction of both subsets are (cid:16) nm (cid:17)(cid:16) mk (cid:17)(cid:16) n − mm − k (cid:17) ; as theestimate ˆ h is symmetric and independent of the construc-tion of each subset (by the unbiasedness of said estimandupon the metric), it follows that the point of inflection in theprobability distribution of the distance function ρ K allows forthe construction of a minimum variance estimator of order n − k and therefore that the estimator amongst the exponen-tial family P ∈ P is n / -consistent for a singular randomvariable. A closed multivariate solution for α n based upon Ξ , the Kemeny metric variance-covariance upon the unionof the design matrix X with n parameters and the dependentvariable y , with row and column n + y is linearly endogenous wrt to the residual (cid:15) . From thismatrix the covariate parameters ( α n ), and their respective n standard errors may be estimated. The n coe ffi cients α j for j = { , , . . . , n } may be simultaneously estimated upon thesub-matrix Ξ n , n , expressing the variance-covariance ma- trix of the feature space, along with the residual error vari-ance ξ (cid:15) as the complement of the linear covariance betweenthe dependent variable and regression function, subtractedfrom the total variance along with the intercept α and thevariance of the parameters σ α : α = ν ( y ) − ν ( X ) n α n (14) α n = Ξ − n , n Ξ n + , n (15) ξ (cid:15) = √ m − n − ξ j j ) − · ξ ( α + X α n , y ) , (16) σ α = ( ξ (cid:15) · diag( X (cid:124) X ) − n , n ) . (17)From these established properties for linearly unbiasedestimators, we conclude a valid realisation of the Gauss-Markov theorem, establishing the minimum variance prop-erties of the unique linear model space parameters solved forunder ρ K by the continuity of the metric for a vector spacewhich is of full column order rank, from which is justifiedthe application of the Gram-Schmidt solution, for m (cid:29) n .This further allows us to define not only the correlation as alinear rescaling of the compact Kemeny distance about theexpectation, but also to scale the correlation by the roots ofthe likelihood function, thereby defining both the variancesand covariances by the inner-product scaling of the compactsu ffi cient statistic. The cross-product X (cid:124) X requires no ad-ditional computational adjustment, due to the linear additiveand multiplicative equivalence of the parameter space uponthe Kemeny metric, and therefore is validly defined as suchupon this topology as well. Probability distribution of F upon y = f(x) As the Kemeny metric is henceforth definable in a uniquelinear metric space with established U -statistic propertieswith support f : α X → ≤ R ≤ m − m , a specific probabil-ity distribution for the population must be definable as well.As was previously shown, the first and second moments ofthe Kemeny metric are linearly expressible in closed form bythe median ( ν ) and dispersion about the median ( ξ ), in lieuof the conventional mean and variance. Since the mediannaturally converges to the mean upon a population, the finiterobustness of the median as the linear convex expectation un-der independent sampling is greatly beneficial, but asymp-totically equivalent, while possessing a breakdown point of50%, consistent with expectations regarding a closed formexpression for the median.We propose that the pdf, given its linear nature, be de-fined as a Gaussian function, which is strongly consistent asa linear function upon the Kemeny metric, and which may be0 HURLEY defined for the population of univariate reals x ∈ R ; | x | = mF ( x j ) = (cid:90) ∞−∞ ξ j · exp (cid:32) − ( x j − ν ) c (cid:33) dx = ν = m m (cid:88) i = ( ρ K ( x , I m )) (19) ξ = m − m (cid:88) i = (cid:18) ρ K ( x i , I m ) − m − m (cid:19) (20)for all finite orderable distributions of reals of length m . Thisimplies that a multinomial distribution is explicitly not well-posed, but any monotonic partial ordering would be. Fromthis perspective, we may construct a robust Fisher z-scoredistribution wherein the expected value ν is 0 and the scale ξ = Probabilistic MLE
It is assumed, for any MLE, that any real parameter on theinterior of the parameter space, ˙ α = ( α n ∪ α ) ∈ Ω , possessesa cumulative distribution function, for which also exists aprobability distribution function f ( x i ; ˙ α ) for random variable x i associated with the i th empirical realisation within a study.We further assume that F ( x j , ˙ α ) is either discrete for all ˙ α orabsolutely continuous for all ˙ α .Under a linear pdf, an estimator ˆ˙ α is regularly and linearlyobtained to satisfy the likelihood score, or estimating equa-tion which equals 0 and is invariant to logarithmic transfor-mation, possessing a unique and singular optima. Said op-tima is defined by selection of the minimal su ffi cient statis-tics as equivalent to the already provided closed form ex-pressions of ν and ξ , for which the error term then becomesthe object of minimisation wrt the joint set α = { α , α n } .Said property is established by demonstration that the valueof su ffi cient statistic z ∀ T ( z ) may be consistently estimated,once the loss function L ( ˙ α | X ) is known. The likelihood func-tion for ˙ α which arises under the assumption of independentand full rank data which is linearly and independently dis-tributed in multivariate field of dimension n controlling forthe sample estimates ˆ nu | X and ˆ Ξ | X . To determine for each m the most likely estimate ( wrt α ) or corresponding m predic- tions ( wrt y ), we must demonstrate convergence with proba-bility 1 to the local optima in the interior of the parameterspace which individually characterise ˙ α , and which is essen-tially unique (Perlman, 1983) under certain well-establishedconditions. We will demonstrate that our characterisationupon the space ( X , y , α ; ρ K ) satisfies under the Kemeny met-ric for a matrix of su ffi cient order these requirements, suchthat even when the likelihood itself may be unbounded orthe Bayes error may not coincide with zero, our proposalwill remain a consistent estimator of ˙ α . The Fisher ex-pected information matrix about the interior parameter set˙ α is defined such that I ( ˙ α ) = E ˙ α { S ( ˙ α X ) S (cid:124) ( ˙ α ) X } for which S ( ˙ α ; x i ) = ∂ log L ( ˙ α ) /∂ ˙ α is the gradient score statistic of thelog-likelihood, and x j = ( x , · · · , x m ) is the j th column vec-tor of X from which the Fisher information is estimated, andfrom which follows the standard errors (SE),SE( ˆ˙ α r ) ≈ (cid:0) I − ( ˆ˙ α ) (cid:1) rr ( r = , · · · , n + ∈ diag X (cid:124) X (21)as the cross-product of the score statistics for an n + ρ ( X , · ), allpoints α on the interior of the parameter space Ω are well-posed.The uniqueness clearly follows for any finite selectionupon of ˆ˙ α , wherein for a fixed finite sample space X ex-ists a function L ( α ; x ) = h ( x ) · f ( x ; α ) defined upon thethree-sequence of partial derivatives of equation 18 wrt α n ,utilising the probability field as indicated in equation 18,which is already established to be a linear function, withboth derivatives of the same sign, and thus positive defi-nite. This thereby satisfies the selection of ˆ α n , a unique pointwhich maximises the likelihood and minimises of the Ke-meny metric, invariant to logarithmic transformations saidloss function, and the maximisation of the log of this sameequation produce. It therefore immediately follows thatlog { L ( ˙ α ; x , ρ K ) } = l ( ˙ α ; x , ρ K ), characterises a uniquely solv-able score equation of the log of the unbiased linear estimatein terms of the pdf found in the derivative of equation 18 atpoint x . l ( ν, ξ | x ) = − m (cid:0) log 2 π + log ξ (cid:1) − ξ m (cid:88) i = ( x i − ν ) (22)The su ffi ciency of the two provided statistics may be es-tablished for both S m and the expanded space H m , wherein m continues to denotes the unique set of permutations re-alisable upon vector x , which may then be established forall m < ∞ by recurrence. Begin upon H = x = { } , forwhich using equations 19 and 20, we produce the estimates ν j = ξ j =
0, respectively. This is immediately observ-ably valid, as there is only one possible permutation upon
INEAR NON-PARAMETRIC REGRESSION, WITH INTERACTIONS | H | then, the population probability is com-plete, integrable to 1, and for (cid:80) x ( f ( x ) = · x ≡ x = ν. The variance expression ξ ( x j ) is identically established, as f ( x ) = · ( x i − ν ) =
0, thereby realising in expectation overthe population, both necessary and su ffi cient statistics to sat-isfy our Gaussian characterisation of the probability randomerror. Application of the score function providesˆ ν = ξ m (cid:88) i = ( x i − ν ) = m ξ (cid:0) ˆ ν − ν ) , from which it may be immediately seen that there exists onlyone optima, at 0, with ˆ ν = ν . For the score function taken wrt ξ for known ˆ ν is obtained l ( ν, ξ | x ) = − m ξ + ξ ) m (cid:88) i = ( x i − ν ) = − m ξ ) (cid:16) ξ − m m (cid:88) i = ( x i − ν ) (cid:17) , which given the identity ν ( x ) = E ( x ), follows ξ ( x ) = m m (cid:88) i = ( x i − ˆ ν ) . Said estimators are biased upon finite samples, in that nei-ther ν , nor ξ are typically known, and therefore a reductionin the degrees of freedom are necessary to correct for finitesamples. The variance estimator may be corrected upon fi-nite samples by substitution of m − m , as with m − m for the median. Therefore, both necessary and suf-ficient statistics are producible from these two estimatingscore equations, for which the minimal su ffi cient conditionsare thereby satisfied. The variance of the parameters, andthe construction of the Fisher information matrix, is therebyproduced by taking the derivatives for each equation wrt thetarget parameters, as already well established.As a linear function space, the asymptotic sampling dis-tribution of the MLE ˆ α ( x ) is expected to be (multivariate)normally distributed, with expectation α and variance I ( α ) − wherein the established properties of the linear metric spaceensures that a quadratic approximation of I ( α ) is su ffi cientas the sum of orthonormal variances (as previously utilisedto demonstrate the bias of the (cid:96) -norm MLE) as calculatedusing equation 20. For a linear regression upon the Kemenyspace, is a linear equation y i = α + α n X i + (cid:15) i , where (cid:15) i is distributed as an independent normal distributionwith median 0 and and unknown error variance, as previouslyestablished for elements i = , · · · , m in the sample. There-fore, the joint density for (cid:15) i upon ( X , y ; ρ K ) is as follows, as specified according to the likelihood equation1 (cid:112) πξ exp − (cid:16) (cid:15) i = ξ · (cid:112) pi ξ (cid:17) · exp − (cid:16) (cid:15) i = ξ · (cid:112) pi ξ (cid:17) ·· · · (cid:112) pi ξ exp − (cid:16) (cid:15) m ξ (cid:17) = (cid:112) (2 πξ ) m exp − ξ m (cid:88) i = (cid:15) i . By substituting (cid:15) i = y i − ( α + α n X i ) , the likelihood function istherefore L ( α , α n , ξ | y , x ) = √ (2 πξ ) m exp − ξ (cid:80) mi = ( y i − ( α + α n x )) , from which follows the score function l ( α , α n , ξ | y , x ) = − m π ) + log( ξ )) − ξ m (cid:88) i = ( y i − ( α + α n x i )) . Consequently, it is seen that optimising the likelihood func-tion for the parameter space α is equivalent to minimisingthe residual sum of squares as previously defined, with n + m − ( n −
1) residual degreesof freedom for any well-posed regression scenario with anerror distribution which is monotonically non-decreasing.It should be noted that in a regression problem for finitesamples, both the expectation and the expected dispersion(i.e., the median and the variances and covariances) are un-known. Therefore, it is recommended that Students’ t statis-tics be employed, which is reasonable since the sharp con-vexity previously established for finite samples ensures thatthe sample mean and variances are both orthonormal andstrongly consistent, thereby precluding the typical necessityfor asymptotic weak convergence, as typically utilised forthe Wilcoxon rank-sum statistic, the Kendall’s τ b , and theTheil-Kendall-Sen non-parametric estimators when one ormore ties occurs. The Hessian matrix is as follows, estab-lished separately for the parameters ˙ α and the residual sumof squares, for which a second derivative must each be takenupon the likelihood equations previously provided. Theseresult in the estimators H = ∂ l ∂ ˙ α∂ ˙ α (cid:124) = − X j X j ξ (cid:15) − X (cid:124) (cid:15)ξ (cid:15) − (cid:15) (cid:124) X ξ (cid:15) m ξ (cid:15) − ξ (cid:124) (cid:15) ξ (cid:15) ξ (cid:15) . (23)The expectation of H ( ˙ α ) follows, under which the Gauss-Markov assumptions cancel out the covariances between ξ (cid:124) (cid:15) and X on the o ff -diagonals. The expectation of the modelvariance (which is a fixed constant) is equivalent to theclosed form expression already provided, therefore estab-lishing asymptotic maximum likelihood estimator candidacy.2 HURLEY
The second term, concerning the error variance may be re-duced E ( H ) = ∂ l ∂ ˙ α∂ ˙ α (cid:124) = I ( ˙ α ) − = − X j X j ξ (cid:15) − m ξ (cid:15) (24)and from which follows the typical relation estimator as pre-viously established for linear estimators upon this topologi-cal manifold. Thus the Information matrix can be seen to beequivalent to the closed form expressions already providedfor the ‘least squares’ estimator upon the Kemeny metricspace. The Cramer-Rao lower bound is further shown to besatisfied, var( ˙ α ) ≥ (cid:16) − E ( H ( ˙ α )) (cid:17) − , and is in fact equivalentto the formulations provided in equation 24, thereby estab-lishing that this presents a minimum variance estimator, andtherefore an e ffi cient MLE in the Gauss-Markov satisfactionof the first and second order consistency. Geometric perspective upon a non-parametric interac-tion
The idea of main e ff ects does not necessarily guaranteethat a collision upon the covariate space will occur, howeverthe multiplication of two features, especially in the commonproblem of finite observation spaces (demographics or trialconditions) necessitates that ties will occur. We demonstratethat the Kemeny metric maintains its maximum likelihoodproperties in a linear framework while allowing for inter-actions to be estimated. This is a compelling improvementover all other current employed non-parametric techniques,as it allows for the estimation of linearly first and secondorder consistent interactions without a need to conduct sub-study stratification. A brief simulation study was conductedto demonstrate the superiority of the method proposed in thismanuscript in comparison to established alternatives and tra-ditional OLS regression, and the results are provided in Ta-ble 2. Also of note is the finding reported in Table 2, thatthe standard deviation of the estimated parameters is nearlyequivalent for all parameters with the closed form expres-sion, as expected for a linear estimator.A cursory inspection of Figure 2 reveals the distributionof all coe ffi cients under the Kemeny metric to be both nor-mally distributed and less dispersed (i.e., more informative)compared to the alternative formulations, as expected of aminimum variance estimator. The predominant componentof the calculation of the asymptotic standard errors, the sumof squared errors, is geometrically identical to the Kemenydistance between the regression predictions and the target.It also immediately follows from this geometric equivalencethat the Kemeny metric is, both empirically and theoretically,a more powerful estimator for any cumulative distribution Table 2
Estimation of interaction parameters over 2,500 jackkniferesamplings under Kloke (2009), Kemeny (1957), and OLSmetrics with mean bootstrapped linear standard errorsreported as ˜ α , indicating expected higher e ffi ciency of theparametric standard errors Statistic Mean St. Dev. Min 25% 75% Max ˜ α (Intercept) 0.231 0.082 − − − − − − − − − − − − − − − − − − − − − − − − − − function upon homogeneous but non-Gaussian data samples.This is of course not true for instances in which a proper (cid:96) contraction may be imposed, however this typically inducesa non-linearity, whereas our approach is linear. Computingthe distance between the projection and the target, divided bythe residual degrees of freedoms, provides the mean squarederrors, as would be expected for any linear functional ba-sis. The product of the MSE by the individual parametercross-product, produces the standard errors as a simple lin-ear function. Further, all matrix multiplications which resultin a n × n product are supplanted by the covariance matrixunder the Kemeny distance, enabling an exhaustive imple-mentation of a purely ordinal linear framework. This ensuresboth primal and dual characterisations of a maximum like-lihood estimator as theorised, without the loss of generalityinduced by norming a continuous score function. This is be-cause the units of the Kemeny distance are invariant over alldata spaces under the strong consistency by the Glivenko-Cantelli theorem of both ECDF → CDF and ˆ F → F .We also must demonstrate that the standard errors are(second-order) minimised with respect to the fixed datainput into the model. To address this, we utilise theAnscombe (1973) dataset with all appropriate bivariate pair-ings, which are bootstrapped with replacement to produce15,500 datasets of 550 subject measurements, for each pair-ing. We would expect under valid characterisations, that thestandard deviations of the parametric formulas to be smallerthan the corresponding bootstrapped estimates, unless theparametric assumptions were violated, and to otherwise ap-proach a ratio of 1 between each of the two metric spaces.The substantively finding reflects the approximately constantscaling di ff erence in the four bivariate data sets between thebootstrapped and formula estimated standard errors. Com- INEAR NON-PARAMETRIC REGRESSION, WITH INTERACTIONS −0.2 0.0 0.2 0.4 Distribution of Intercept estimates
N = 2500 Bandwidth = 0.01033 D en s i t y KlokeKemenyOLS −0.1 0.0 0.1 0.2 0.3 0.4 0.5
Distribution of Sociological estimates
N = 2500 Bandwidth = 0.01703 D en s i t y KlokeKemenyOLS −0.8 −0.6 −0.4 −0.2 0.0
Distribution of Gender estimates
N = 2500 Bandwidth = 0.02362 D en s i t y KlokeKemenyOLS−0.5 0.0 0.5 1.0
Distribution of Age estimates
N = 2500 Bandwidth = 0.0347 D en s i t y KlokeKemenyOLS −2 −1 0 1 2
Distribution of Interaction estimates
N = 2500 Bandwidth = 0.008183 D en s i t y KlokeKemenyOLS −3 −2 −1 0 1 2 3
Normal Q−Q Plot
Theoretical Quantiles S a m p l e Q uan t il e s Figure 2
Mardia distributional examination of regularity parameters with interaction pared to the OLS formulation of the first and second orderstatistics (coe ffi cients and standard errors, respectively, ofthe intercept and regression slope) which possesses a ratioapproaching 1 ( min f or ( x , y ); the bivariate Gaussiandistribution) between the bootstrapped standard deviationsof the empirical coe ffi cient distribution and mean standarderrors as expected for the constant (but heavily biased) cor-relation of ˆ r = . ffi cient.However, as stated, this reported values in Table 3 greatermagnitude (by a constant scaling) is expected, as the sum-of-squared-errors are in fact non-constant and does not af-fect the comparison of the relative variability. In Table 3the standard errors show that the only approximately unitscaling between the two estimates are found for the solecase in which the Kemeny metric is invalidly applied, i.e.,the quadratic function approximated by a single slope upon( x , y ). For all other cases, the standard errors are consis-tently smaller (approximately 92%), as would be expectedfor the scenario in which the parametric error family wascorrectly derived. More interestingly though is the compari-son of ( x , y ; ρ E , ρ K ), which is the only instance of an unbi- ased estimator upon the Anscombe dataset. Here, presentedin the first results for Table 3, the estimated standard errorsare approximately twice the size of the empirically estimatedstandard deviation, whereas for ( x , y [ ρ E ratio isagain approximately 1, however the ratio upon ρ K is nearly5, but under replication, the constant scaling is found to ap-proximate √ m − n , the residual rescaling of the degrees offreedom of the variance. Once this is corrected for, the linearrelative e ffi ciency we hypothesised is validated with a ratio ofapproximately 92% upon two or more coe ffi cients in a linearsystem of m equations. This necessary correction however,for ( x , y [ Decomposition of Sums-of-Squares
A final connection for any theory of a general linear modelrequires the demonstration of decomposition by sums ofsquares. As such a decomposition upon a centred data set X m × n in the form of a square X (cid:124) X or XX (cid:124) matrix provides4 HURLEY
Table 3
Unscaled estimation of the Kemeny based non-parametric bootstrapping of the coe ffi cients ( α , α ) and unscaled standarderrors (multiply by ( √ m − n ) − to correct) across 15,500 datasets constructed from 550 bivariate elements resampled fromthe Anscombe dataset. mean sd median min max range skew kurtosis( x , y ) α α σ ( α ) 0.2624 0.0083 0.2622 0.2337 0.2972 0.0635 0.1384 0.1517 σ ( α ) 0.0275 0.0007 0.0275 0.0247 0.0307 0.0060 0.0413 0.0975( x , y ) α α σ ( α ) 0.2962 0.0101 0.2957 0.2615 0.3588 0.0973 1.3671 5.9482 σ ( α ) 0.0311 0.0009 0.0310 0.0281 0.0371 0.0090 1.7509 8.1898( x , y ) α α σ ( α ) 0.0915 0.0118 0.0912 0.0542 0.1409 0.0867 0.1885 0.0643 σ ( α ) 0.0096 0.0011 0.0096 0.0059 0.0140 0.0081 0.1324 0.0234( x , y ) α α σ ( α ) 0.3509 0.0035 0.3508 0.3279 0.3771 0.0492 -0.8226 7.5510 σ ( α ) 0.0368 0.0009 0.0367 0.0326 0.0428 0.0102 0.4238 1.3680 the same fundamental information necessary to produce, interms of the design matrix, the total sum of squares and crossproducts. As previously stated, the Sums of Squared Errors(SSE) and the Mean-Squared Error (MSE) terms are alreadydefined as the distance and expected distance between thepredictions and the target. The intercept only model in turndefines the origin of the coe ffi cient parameter space as themedian, about which the regression terms minimise the dis-tance. Thus, each operation with respect to ˆ α upon the lin-ear convex Kemeny distance is expected to minimise the dis-crepancy in ordination of the prediction set ρ K ( y , ˆ y ). Whilewe will not exhaustively address comparisons with alterna-tive non-parametric ANOVA models (e.g., (Rizzo & Székely,2010)), it should be immediately obvious to the reader thatas a function of removal of the free parameter p upon theMinkowski distance, our proposed method is superior, re-quiring fewer parameters and introducing less bias in orderto establish the linearity between the decomposed sum-of-squares and the target. We demonstrate that the one-wayANOVA decomposition under DISCO (Rizzo & Székely,2010) and Kloke, McKean, and Rashid (2009) are both out-performed by the Kemeny metric, and that our method is fur-ther less restrictive in the coe ffi cient space in that it allowsinteractions. We explore this by repeated jackknife resam-pling upon the the Warpbreaks data set (R Core Team, 2020)for an ANOVA based decomposition of Breakage counts pertension (a three level factor; L, M, and H) and wool type(binomial levels, A or B), treating tension as both an or-dered factor, and a multinomial distribution. The dependentmeasure in the data set is a total count of warp breaks perstandardised length of yarn measure, well represented by aPoisson distribution. These results of the exploratory anal-yses, treating the outcome as Gaussian, are provided in Ta- ble 4, providing demonstration of bootstrapped consistencyand relative stability of all estimates, while providing bothan omnibus and local Wald-testing for each e ff ect under theassumption of a weakly-orderable measurement data space.More interesting though is the recognition that,while the tension variable is an ordered magnitude(Low,Medium,High) which must be categorically codedunder the Frobenius norm, the Kemeny norm allows forthe ordinal nature of the data to assessed as part of thedata. In order to compare this, we assessed the recoded theWarpbreaks design matrix to assess the tension variable andall interactions as single variables, reducing the necessarymodel degrees of freedom to 3 from 5 – the results arepresented in Table 6. The interesting result, presentedhere with the 25,500 bootstrapped replications of the 54subjects with replacement, was that the conclusions are infact similar between the OLS and Kemeny approaches: bothidentified two of the same elements as significant, howeverthe comparative range and standard deviations for thecoe ffi cients of determination and the MSEs are substantiallylarger for the Frobenius norm.Provided in Table 5 are the empirical standard deviation ofthe distribution of the Mean Squares, which show that non-normality presents more widely dispersed ANOVA termsthan the proposed Kemeny metric. Interestingly in all casesof the application of a Wald test, the cross-validated resam-pling indicates significant non-zero e ff ects which are concor-dant with the conclusions of the Wald tests under the Ke-meny metric space. Comparisons to DISCO analysis (Rizzo& Székely, 2010) provide one-way ANOVA results, how-ever the tests are shown to be less powerful, for Minkowskiadjustment p = .
5, demonstrating that the Kemeny metricprovides a more robust and consistent alternative to the alter-
INEAR NON-PARAMETRIC REGRESSION, WITH INTERACTIONS Table 4 ff ects and Interaction upon ρ K Statistic Mean St. Dev. Min 25% 75% Max ˜ σ ( α )(Intercept) 0.055 0.017 − − − − − − − − − − − − − − − − Table 5
ANOVA Wald test decomposition for ρ K over 2,500cross-validations in comparison to two univariate DISCOanalyses with p = . and traditional ρ E ANOVA
Method Term df MS Min(MS) 25% MS 75% MS Max(MS) ˜ σ ( MS )Kemeny tension 2 872.13 789.25 856.75 888.25 942.25 90.19wool 1 3,437.726 2,699 3,295.8 3,582 4,063 103.79tension:wool 2 833.40 756.75 817.50 849.50 909.75 94.69Error 48 27.098 2.644 18.67 25.24 36.31 2.67DISCO tension 2 127.06 32.40 173.46 318.94 812.00 110.28wool 1 134.25 5.69 56.86 187.18 656.73 100.84Error(tension) 52 27.275 15.53 24.78 29.71 42.27 3.61Error(wool) 53 31.664 18.47 28.45 34.82 46.22 3.56ANOVA tension 2 823.40 119.05 1077.43 2083.32 5233.44 785.63wool 1 768.45 0.02 299.67 1,097.64 4,883.08 605.23tension:wool 2 413.36 0.87 243.59 543.66 1592.54 450.42Error 48 105.810 0.81 1.26 1.55 2.18 16.41 native analyses, for a broader hypothesis space of the modelsthemselves. Table 6
Empirical comparison of the standard errors upon theordinal characterisation of the Tension feature α se ( α ) σ ( α ) t p ( t )(Intercept) 28.1528 1.4780 1.4865 19.0484 0.0000woolB -2.8857 1.4780 1.4786 -1.9525 0.0282tensionM -5.0127 1.8049 2.0120 -2.7773 0.0038tensionH -3.2424 1.0444 0.9212 -3.1046 0.0016woolB:tensionM 5.2573 1.8049 2.0098 2.9129 0.0027woolB:tensionH -0.0000 1.0444 0.9242 -0.0000 0.5000MSE 106.57 21.52(Intercept) 33.2425 2.4557 3.7112 13.5371 0.0000woolB -1.8232 .9085 1.3234 -2.006 0.0251tension -3.7919 1.1441 1.5777 -3.3142 0.0009woolB:tension -1.3541 1.6319 0.9925 -0.8298 0.2053MSE 151.575 18.57 Empirical Demonstrations
We conclude with several empirical data demonstrations,to validate the mathematical fact that the results in the pres-ence of ties of several estimators are inconsistent and under- powered, even in the presence of jackknife resampling. In thepresence of ties, conventional estimators do not possess theproperties of an MLE, which is empirically explored by theapplication of jackknife resampling and statistical compar-isons of the results, with the null hypothesis of comparableperformance expected to produce no observable di ff erencesin the first order results of the coe ffi cients. The second con-tention, that the application of a non-metric topology resultsin inconsistent (and therefore unrealisable representations ofa population), is explored as well, with the null hypothesis ofperformance less than or equal to that of conventional meth-ods. Thus, it would be found that the proposed methods failto demonstrate any improvement over current methods forscenarios such as normal distributions. These conjectures,if validated, therefore demonstrate that even non-parametrictests do not provide adequate unbiased understanding outsidethe population, when we introduce tied subjects, a point ofpresumed necessity for empirical studies in any field. Allsimulations in this section are performed using R v.4.0.3(R Core Team, 2020) and custom written software which isavailable from the authors.We next will compare the Tukey-Siegel, Kendall- τ b andWilcoxon rank-sum performance to the Pearson correlationand the proposed estimator, with respect to the point estimatedi ff erences and empirical variance (power) of each estima-tor. We rely upon the natural relationship between a distancemeasure and a similarity measure as mutual reflections uponthe arbitrary, but appropriate, point of origin, and thus es-timations of similar and comparable bivariate relationships.This is a valid assertion, as otherwise the data must be multi-nomial distributed, and result in a substantially divergent per-spective as to the nature of the data, for which all data an-alytic methods are invalid without à priori data processingin the form of binomial coding, and therefore would be ex-pected to perform poorly by definition. Minimum variability of rank-tests with tiesTable 7
Empirical distribution of relations estimated upon bothjacknife and 7,500 times repeated resampling (CV).
Statistic Mean St. Dev. Min 25% 75% Max ρ K ( x , x ) − − − r ( x , x ) − − − − ρ ( x , x ) − − − − τ ( x , x ) − − − − r − r K ( x , x ) 0.042 0.079 − − r ( x , x ) − − − ρ ( x , x ) − − − − τ ( x , x ) − − − − r -0.134 0.187 − A second brief demonstration with a 14 observation bi-variate data set is presented; these data may be equivalently6
HURLEY viewed as an unpaired Wald t-test, an OLS regression, apoint-biserial correlation, a rank-sum test of di ff erences be-tween groups, and any other functional linear basis analytictechniques – asymptotically, these are all consistent underWilk’s theorem. In e ff ect, it proposes the identification ofa test in the shift between two groups’ measurements, fora one unit change in the group membership. Here, we aresolely interested in the distributional properties of the vari-ous estimators under the conceptual meaning of the unbiasedmaximum likelihood estimator. We expect that a correctlyidentified estimator with appropriately chosen error distribu-tion will possess a global representation of the bivariate rela-tionship, with the point of minimum error in turn being themost informative (smaller standard error) estimate as well.Under these two definitions of the maximum likelihood esti-mator, the point is made that solving the minimum error ofapproximation must best approximate the population value.We demonstrate that this is not true for any estimator whichis not bivariate normal even with standard ‘non-parametric’testing procedures, due to the non-metric nature of the rankinterpretation with ties. Further, not only does the Kemenycorrelation coe ffi cient most accurately approximate the pop-ulation true value, but also appears to do so with the ex-pected smallest distribution of scores, thereby demonstrat-ing that this estimator, for any homogeneous score distribu-tion stochastically dominates not only the Pearson based es-timators (for scores which are approximately normally dis-tributed) upon non-Gaussian spaces but also the bivariatenon-parametric tests as well. This provides support for ourconjecture that, in the presence of ties, even ‘tie-resolving’estimators for both Spearman ρ and Kendall τ b , are biasedestimators of the population relationship as tested within aNull Hypothesis framework. Real world VHA dataset
A data set from a Veteran A ff air O ffi ce of Research andDevelopment study approved by the VA Central InstitutionalReview Board (CIRB) is utilised to demonstrate both finiteand bootstrap comparative performance upon an endogenousordinal summative score in evaluating randomised treatmentoutcomes for m =
594 Major Depressive Disorder (MDD)complete case patients. All participants provided written in-formed consent and privacy authorization. This data waspreviously reported in Zisook et al. (2019) and the non-normality of the data was addressed with a Cox survivalmodel of time to remission, where remission was defined asa quantised decrease of at least a fixed reduction in depres-sion score. Here we instead provide a method of quantitativeassessment of the reduction in overall depressive impairmentover time for the three treatment strata, which is comparedto the an equivalent model estimated under OLS. This allowsfor explicit shifts in the quantity of depressive impairment tobe assessed as a continuous measure without explicit para- metric assumptions beyond the existence of a suitable cu-mulative distribution function. However, the non-normalityof the response variable is unresolved by both definition (afinite set of responses demonstrates only weakly consistentnormality in the population) and in overall ECDF. As previ-ously reviewed, traditional alternative techniques explicitlypreclude the estimation of substantive parameters of explicitinterest, such as interactions, resulting in choices in modelselection favouring the general linear model, in spite of itsinappropriateness. We demonstrate here how these conclu-sions are inconsistent with the data learning process, result-ing in both false positives and false negatives, for both thecomplete cases sample and a bootstrapped resampling withreplacement of 15,500 cases of 594 subjects for each sample.The participants here were VHA patients aged 18 yearsor older who were both diagnosed and treated as an MDDpatient. Diagnostic eligibility was supplemented with the 9-Item Patient Health Questionnaire, an ordered scale with sup-port [0 , ff erence of 14.83.Controlling for ancillary baseline covariates, it was de-sired to assess the change in score at week twelve upon theEuroQOL, with explicit focus being upon the interaction ofsocial support (a categorical 4 level variable) and treatmentapproach arm. The hypothesis was held that greater avail-ability of social support would present di ff erential changes inthe treatment e ffi cacy, and further that these findings wouldnot be found under the traditional use of an OLS regression.The results of the model are presented in Table 8, compar-ing the two techniques. It should be noted that if the nor-mality assumption was met, all statistics would be expectedto replicate with greater stability for the OLS model, andwould present with smaller standard errors than those foundunder the methods proposed in this paper. However, as weseen from the disjoint findings between the cross-validationresampling performed for both models, while the OLS ap-proach does imply a higher coe ffi cient of determination andtherefore better fit, the resampling demonstrates that thisvalue is biased, resulting in a lower estimated MSE comparedto the true MSE, and therefore also draws concerns wrt thepartial Wald tests, which may be invalidly over-powered as aresult.As was hypothesised, the order-statistic based techniquedemonstrated greater power on the non-normally distributeddata, and further, these findings replicated with greater suc-cess than the cross-validation performed upon under OLS re-gression. It was found that the non-parametric linear basiswas a better substantive average of all observations, satisfy- INEAR NON-PARAMETRIC REGRESSION, WITH INTERACTIONS Table 8
VHA Data example, conducted upon 594 subjects; ( F , = . , R = . σ R = . ρ E ) and ( F , = . , R = . σ R = . ρ K ) , for 15,500 cross-validated resamplings. OLSOLS Coe ffi cients OLS Standard Errorsmean sd min max range mean sd min max range(Intercept) 55.0681 4.7306 35.9979 72.6146 36.6167 4.704 0.2188 3.8504 5.5423 1.6919AGE 0.0039 0.0729 -0.2661 0.3009 0.567 0.068 0.0029 0.0583 0.082 0.0238TrtcodeB -1.1438 2.5716 -10.9435 8.6931 19.6366 2.5958 0.1298 2.1816 3.1298 0.9481TrtcodeC -0.6499 2.4634 -10.8852 9.4516 20.3368 2.6167 0.132 2.1693 3.1337 0.9644marital_status1 -2.5109 2.9317 -13.8083 9.161 22.9693 2.9556 0.1561 2.3653 3.8533 1.488marital_status2 -0.9593 3.781 -18.3705 13.344 31.7145 3.966 0.3353 3.0119 6.4659 3.454marital_status3 -2.4362 6.3579 -36.5888 22.6965 59.2853 6.5796 1.3334 3.9924 17.8907 13.8983race1 0.6867 1.7443 -6.6847 7.0143 13.699 1.7369 0.0779 1.4769 2.1382 0.6613race2 1.3162 2.6922 -12.2875 12.2162 24.5037 2.6354 0.1898 2.0273 3.6687 1.6414education 0.9011 0.5914 -1.399 3.1646 4.5636 0.6038 0.0242 0.5203 0.72 0.1997EUROHLTH_BASE 0.2965 0.0376 0.1389 0.4393 0.3005 0.0336 0.0014 0.0287 0.04 0.0114F20TotalACE -0.021 0.277 -1.2014 0.9195 2.1208 0.2815 0.0121 0.237 0.3318 0.0948CIRSscore -0.4233 0.1739 -1.0077 0.1992 1.2068 0.1561 0.0066 0.1298 0.1835 0.0537TrtcodeB:marital_status1 3.3191 3.8553 -10.5904 17.7299 28.3203 3.869 0.1603 3.278 4.7756 1.4975TrtcodeC:marital_status1 0.0388 4.0232 -15.8494 15.2499 31.0992 3.8563 0.156 3.2334 4.5409 1.3075TrtcodeB:marital_status2 1.1768 4.9905 -23.1863 24.5952 47.7814 5.6276 0.398 4.5075 7.8708 3.3632TrtcodeC:marital_status2 2.5672 4.9334 -15.2903 24.2221 39.5124 5.359 0.3419 4.3294 7.5138 3.1844TrtcodeB:marital_status3 -1.5576 12.6209 -45.8561 45.1172 90.9733 12.6189 2.809 7.4795 25.3048 17.8252TrtcodeC:marital_status3 4.4462 8.063 -32.2378 39.813 72.0508 9.3286 1.4704 6.1888 20.7778 14.5889KemenyKemeny Coe ffi cients Kemeny Standard Errorsmean sd min max range mean sd min max range(Intercept) 58.4679 2.8408 47.1501 70.3194 23.1693 1.3868 0.0516 1.1896 1.638 0.4484AGE 0.0193 0.0404 -0.1465 0.1763 0.3228 0.0207 0.0008 0.0178 0.0245 0.0067TrtcodeB 0.1778 0.9585 -3.7916 3.9957 7.7874 0.8034 0.0298 0.6845 0.9481 0.2635TrtcodeC -0.0677 0.8878 -3.968 3.2428 7.2109 0.8177 0.0305 0.673 0.9653 0.2923marital_status1 -0.969 0.9431 -4.6262 3.3803 8.0065 0.8653 0.0326 0.742 1.065 0.3230marital_status2 1.0144 2.5456 -10.7434 11.0517 21.7952 1.125 0.0463 0.9282 1.6365 0.7083marital_status3 -1.6155 10.4665 -69.0194 57.148 126.1674 1.7262 0.108 1.3734 5.144 3.7706race1 0.4943 1.2988 -4.8352 5.6206 10.4557 0.5573 0.0212 0.4533 0.6584 0.2051race2 1.8235 3.0921 -11.6497 14.3368 25.9865 0.8725 0.0357 0.6624 1.0319 0.3695education 0.771 0.4065 -0.743 2.2454 2.9884 0.1962 0.0076 0.1599 0.2319 0.0719EUROHLTH_BASE 0.2102 0.0253 0.1185 0.3072 0.1887 0.0101 0.0004 0.0086 0.0118 0.0032F20TotalACE 0.0593 0.1798 -0.6474 0.7208 1.3682 0.0881 0.0033 0.073 0.104 0.0310CIRSscore -0.2781 0.0955 -0.67 0.0715 0.7415 0.0473 0.0017 0.0398 0.0558 0.0160TrtcodeB:marital_status1 1.6503 1.9457 -5.3977 8.7065 14.1042 1.1722 0.0428 1.0064 1.3822 0.3758TrtcodeC:marital_status1 0.8088 1.8886 -7.3827 8.5681 15.9508 1.2083 0.0447 1.0313 1.4265 0.3951TrtcodeB:marital_status2 -3.1926 6.2124 -41.8194 22.7819 64.6013 1.665 0.0637 1.394 2.1636 0.7697TrtcodeC:marital_status2 2.2548 5.261 -22.3428 27.3877 49.7305 1.5949 0.0604 1.3289 1.9826 0.6536TrtcodeB:marital_status3 -3.0289 33.3953 -183.8365 189.0291 372.8656 5.501 0.4002 2.3222 7.2848 4.9626TrtcodeC:marital_status3 3.2242 19.7106 -125.7341 178.219 303.953 2.5136 0.1344 2.0154 5.5258 3.5105 ing the replicability assumptions of the model, unlike withthe parametric regression (White, 1982). In particular, whilethe coe ffi cients of determination were twice as large for theOLS model, the variability of said statistic under replica-tion was found to be 13 times greater for OLS than for theKemeny regression under jackknife resampling. The esti-mates were conducted upon the correlation matrix for therank-based method, and thus valid comparisons are foundupon the Wald tests, and are proportionally similar betweenthe two models (Krane & McDonald, 1978). These find-ings would indicate substantive subjectivity to which unitsremained within the sample, and therefore contraindicatethe conditional exchangeability assumption upon which themaximum likelihood generalisability principles are founded.Moreover, from the perspective of experimental replicability,consistency for OLS for non-normally distributed data would appear to only be satisfied under the weak law of large num-bers, and therefore does not present evidence of being a validmaximum likelihood estimator for this data set, as would betheoretically expected for both first and second order estima-tor biases. In addition, attention should be drawn to the co-e ffi cients of determination estimated for each of the 15,500replicated samplings of the models, as reported in Table 8.There it can be seen that while the sampled R values werein fact always substantially higher (by an average scaling of173), the variability of the estimators upon the bootstrappedsamples was 28 times greater for the Gaussian general linearmodel than for our Kemeny linear model, thereby betrayingthe necessary performance which would be expected upon amaximum likelihood estimator upon a more informative met-ric topology, as identified under valid linear Gaussian para-metric assumptions.8 HURLEY
We suggest the focus of attention be with respect to thee ff ect TrtcodeB:marital_status2, as this term is explicitly sig-nificant and indicative of a contradictory finding than un-der OLS assumptions. This demonstrates evidence in favourof the interpretation that married individuals were less de-pressed under treatment B , and further that marriage in gen-eral provided a strongly significant positive e ff ect upon theoverall assessment of depression at treatment termination. Inparticular is the di ff erence in MSE between the two models;while OLS does in fact establish a higher predictive accu-racy within sample, the replicability or stability of this pointestimate under bootstrapping is nearly four times greater,whereas while under resampling the median coe ffi cient ofdetermination for the two approaches was nearly halved forOLS, it remained nearly constant for the Kemeny method-ology proposed herein. As well, the standard errors arefound to perform exactly as expected, being in most casessmaller and producing Wald test statistics which are con-sistently found to be significant, unlike with the OLS ap-proach. The consistency of these results are empiricallydemonstrated with the comparison of the standard deviationsof the comparisons across bootstrapped data sets, which are,again, smaller for the Kemeny metric. Discussion
This paper introduces both a closed form and maxi-mum likelihood solution to an endogenous unbiased lin-ear learning problem upon an ordinal, and more generallynon-parametric, metric topological manifold, with severaldemonstrations of performance with arbitrary data sets. Fur-ther, this metric space was shown to be linear and unbi-ased, possessing MLE properties in the presence and ab-sence of ties, thereby encompassing a much wider breadthof empirical applications in terms of the estimable parame-ters (Mukherjee, 2016; Chatterjee & Mukherjee, 2019). Thisallows for the calculation of indeterminate points (e.g., poly-nomial terms and interactions) for incomplete rankings andsum-of-square decompositions in a similar, but more com-prehensive manner to the Theil-Kendall-Sen estimator in themodel regularity parameter space α ∈ Ω , for which an intu-itive and computationally simple deconstruction may be ap-plied for finite samples. Pragmatically, this Kemeny distanceestimator has been shown to satisfy the requirements of alinear maximum likelihood estimator in particular for ordi-nal data both in the absence and presence of all non-nominaldata, for which we have derived and empirically shown thatthe local and global model parameters may be assessed un-der the conventionally Wald test framework. Further demon-strated in this text is the information superiority and reducedparametric assumptions necessary to validate the inductivegeneralisation beyond the sample to a homogeneous and po-tentially multivariate population. Explicit comparisons weremade for nested models along with the computational sta- bility in the estimation of a linear and well-posed parame-ter space for joint coe ffi cients of determination, as well asthe ability to examine the marginal distributions of the con-ditionally independent parameter space. Over all of theseassertions, we follow through to have demonstrated the rela-tive power gains which are expected to hold under the stronglaw of large numbers, under much weaker parametric con-ditions than those typically associated with maximum likeli-hood principles.Importantly, this embodies an important empirical demon-stration of the conjecture of the replication crisis, in that it isseen that even significant empirical findings upon impropererror distributions fail to allow for replication to validly be astrongly consistent estimator. That is, the variability of repli-cation e ff orts is greater than the standard errors typically es-timated and reported. Since the assumption that a finite setof ordinal measurements, where the number of responses isgreater than 2, was shown to produce biased estimates underthe Gaussian estimator it questions the widespread validity ofthe assumption of Gaussian assumptions upon the Frobeniusnorm, as is commonly taught at both the undergraduate andgraduate levels of the Social Sciences. Such a considerationis extremely important in the context of the near universalprevalence and reliance upon ordinal measurements in theSocial Sciences, and more generally the failure to correctlyestablish distributions of normal errors upon finite samplesunder the (cid:96) -space. This mathematical framework demon-strates that the false assumption of the su ffi ciency of the Eu-clidean norm upon data spaces can result in scenarios whichare markedly similar to that of the replication crisis of sta-tistical tests, wherein significant biased point estimates maybe found, for which the variability is drastically underesti-mated. This follows from the simple fact that in the loss ofMLE properties, the strong convexity and consistency of ourestimators upon a finite sample are lost, and therefore mayonly be abstracted to a description of the population whenexhaustive sampling has been enacted. Since this is infeasi-ble, the choice of maximum likelihood procedures which areappropriate to our data is recommended. Similar argumentshave been extended to contexts such as latent variable mod-elling and Psychometrics, a popular methodological topicemployed. It has been demonstrated in another manuscriptunder submission that the techniques upon which this pro-cess is constructed allow us to remove the multivariate nor-mality assumption, while maintaining first and second or-der consistency, thereby presenting an acceptable maximumlikelihood estimator for ordinal data, which avoids said is-sues described with the polychoric correlation, and addressesthe common failure of the goodness-of-fit statistics uponnon-normal data. Similar work under submission has beenemployed to similarly address both missing data under theexpectation-maximisation algorithm for finite samples uponthe ( X , ρ K )-space, construct a non-parametric formulation of INEAR NON-PARAMETRIC REGRESSION, WITH INTERACTIONS M -statistic, and to address mixture models under non-parametric feature spaces.ReferencesAnscombe, F. J. (1973). Graphs in statistical analysis. TheAmerican Statistician , (1), 17–21. doi: 10.2307 / Theory of linear operations (pp. 23–32). Elsevier. doi: 10.1016 / s0924-6509(08)70023-0Box, G. E. P. (1976, December). Science and statistics. Jour-nal of the American Statistical Association , (356),791–799. doi: 10.1080 / IEEE Transactions on Information Theory , (6), 3525–3539. doi: 10.1109 / tit.2019.2893911Cox, D. R. (1972). Regression models and life-tables. Jour-nal of the Royal Statistical Society: Series B (Method-ological) , (2), 187–202. doi: 10.1111 / j.2517-6161.1972.tb00899.xDiaconis, P. (1988). Group representations in probabil-ity and statistics (Vol. 11). Institute of MathematicalStatistics, Hayward, CA.Emond, E. J., & Mason, D. W. (2002). A new rank cor-relation coe ffi cient with application to the consensusranking problem. Journal of Multi-Criteria DecisionAnalysis , (1), 17–28. doi: 10.1002 / mcda.313Fagin, R., Kumar, R., & Sivakumar, D. (2003). Comparingtop k lists. SIAM Journal on Discrete Mathematics , (1), 134–160. doi: 10.1137 / s0895480102412856Friedman, M. (1937). The use of ranks to avoid the as-sumption of normality implicit in the analysis of vari-ance. Journal of the American Statistical Associa-tion , (200), 675–701. doi: 10.1080 / IEEE Transactions on Infor-mation Theory , (11), 5130–5139. doi: 10.1109 / tit.2008.929943Good, I. J. (1975). The number of orderings of n candidateswhen ties are permitted. Fibonacci Quarterly , (1),11-18. doi: 10.1080 / What if there were no significance tests?
Psychology Press. doi: 10.4324 / ff ding, W. (1948). A class of statistics with asymptot-ically normal distribution. The Annals of Mathemat-ical Statistics , (3), 293–325. doi: 10.1214 / aoms / Non-parametric statistical methods (Vol. 751). John Wiley& Sons.James, W., & Stein, C. (1961). Estimation with quadraticloss. In
Proceedings of the fourth berkeley symposiumon mathematical statistics and probability (Vol. 1, pp.361–379).Kemeny, J. G. (1959). Generalized random variables.
Pa-cific Journal of Mathematics , (4), 1179–1189. doi:10.2140 / pjm.1959.9.1179Kendall, M. G. (1938). A new measure of rank correlation. Biometrika , 81–93. doi: 10.2307 / Journal of theAmerican Statistical Association , (485), 384–390.doi: 10.1198 / jasa.2009.0116Krane, W. R., & McDonald, R. P. (1978). Scale invari-ance and the factor analysis of correlation matrices. British Journal of Mathematical and Statistical Psy-chology , (2), 218–228. doi: 10.1111 / j.2044-8317.1978.tb00586.xLane, S. M. (1971). Categories for the working mathemati-cian . Springer New York. doi: 10.1007 / Journal of Non-parametric Statistics , (4), 397–405. doi: 10.1080 / Mathematical and Computer Modelling , (7-8), 917–926. doi: 10.1016 / j.mcm.2006.09.009Mann, H. B., & Whitney, D. R. (1947). On a test of whetherone of two random variables is stochastically largerthan the other. The Annals of Mathematical Statistics , (1), 50–60. doi: 10.1214 / aoms / Generalized linearmodels (Vol. 37). Boca Raton, FL: CRC Press.Menger, K. (1942). Statistical metrics.
Proceedings of theNational Academy of Sciences , (12), 535–537. doi:10.1073 / pnas.28.12.535Mukherjee, S. (2016). Estimation in exponential familieson permutations. The Annals of Statistics , (2), 853–875. doi: 10.1214 / Journal of the Royal Statis-tical Society. Series A (General) , (3), 370. doi:10.2307 / HURLEY the polychoric correlation coe ffi cient. Psychometrika , (4), 443–460. doi: 10.1007 / bf02296207Owen, A. B. (2001). Empirical likelihood . Boca Raton, FL:CRC Press.Pearson, K., & Pearson, E. S. (1922). On polychoric coef-ficients of correlation.
Biometrika , (1-2), 127–156.doi: 10.1093 / biomet / Recent ad-vances in statistics (pp. 339–370). Elsevier. doi:10.1016 / b978-0-12-589320-6.50020-4R Core Team. (2020). R: A language and environmentfor statistical computing [Computer software man-ual]. Vienna, Austria. Retrieved from Redner, R. (1981, January). Note on the consistency of themaximum likelihood estimate for nonidentifiable dis-tributions.
The Annals of Statistics , (1), 225–228. doi:10.1214 / aos / TheAnnals of Applied Statistics , (2), 1034–1055. doi:10.1214 / The use of matched sampling and re-gression adjustment in observational studies (Unpub-lished doctoral dissertation). Harvard University.Savalei, V. (2011). What to do about zero frequencycells when estimating polychoric correlations.
Struc-tural Equation Modeling: A Multidisciplinary Jour-nal , (2), 253–273. doi: 10.1080 / Handbook of analysis and its foun-dations . Elsevier. doi: 10.1016 / b978-0-12-622760-4.x5000-6Schweizer, B., & Sklar, A. (2005). Probabilistic metricspaces . Mineola, NY: Dover Publications, Inc.Solomono ff , R. J. (1964). A formal theory of inductive in-ference. part I. Information and Control , , 1–22. doi:10.1016 / S0019-9958(64)90223-2Spearman, C. (1906). ‘Footrule’ For Measuring Correla-tion.
British Journal of Psychology , (1), 89–108. doi:10.1111 / j.2044-8295.1906.tb00174.xThurstone, L. L. (1927). A law of comparative judg-ment. Psychological Review , (4), 273–286. doi:10.1037 / h0070288Vapnik, V. N. (2013). The nature of statistical learningtheory (3rd ed.). New York, NY: Springer Science &Business Media.Wald, A. (1949). Note on the consistency of the maxi-mum likelihood estimate.
The Annals of Mathemat-ical Statistics , (4), 595–601. doi: 10.1214 / aoms / Econometrica , (1), 1. doi:10.2307 / Biometrics Bulletin , (6), 80. doi: 10.2307 / American Journal of Psychiatry , (5), 348–357. doi: 10.1176 / appi.ajp.2018.18091079 Appendix AKullback-Leibler divergence
The Kullback-Leibler divergence is defined for any correctlyposed probability distribution, from which follows a mono-tonically non-increasing primal problem (error minimisa-tion) convergence sequence which is unbiased and convergesto the target when the probability distribution is linear andorthonormal. For any measurable space on the reals which issortable, then, it follows that the Kemeny metric is an unbi-ased estimator, which is linear but less informative than thelogarithmic linearisation undergone by the use of the canon-ical linking function upon the (cid:96) metric. For conditions inwhich the family of distributions are not identified, the like-lihood function is almost surely maximised in the neighbour-hood of Ω as defined upon the sample as long as the quotientspace is defined (Redner, 1981) and is asymptotically nor-mal as long for each α with su ffi ciently small radius in theconcave neighbourhood about the optima, f ( · , α, r ) is mea-surable and the following inequalities are therefore true (cid:90) x log f ∗ ( x , α, r ) d α ∗ < ∞ and (cid:90) x log h ∗ ( x , s ) d θ < ∞ , assuming α ∗ denotes the true parameter set. If true, then as δ ( θ , θ t ) → ∞ + , f ( x , θ i ) → θ t , and isthus satisfied for any sub-additive metric topology. Furtherassuming that (cid:90) | log f ( x , θ ) | d u θ < ∞ , and therefore finite, then if θ i → θ and in turn f ( x , θ ) → f ( x , θ ) for a non-deterministic functional space, the withprobability one the likelihood function converges to the truefunction, which for any linear function is best approximatedby its expectation which minimises the error for all M . Uponany mixture family, which is linear δ ( X , Ω ), then it fol-lows that both extensions to Wald’s theorems (Redner, 1981,p. 226) are satisficed upon the Kemeny metric, and there-fore f ( X , ¯ θ m ) → f ( X , ˆ θ ) with probability 1. From Redner INEAR NON-PARAMETRIC REGRESSION, WITH INTERACTIONS
Theorem 5 , it follows that the Kemeny metric, as acompact ultrametric, is strongly consistent for any orderableand independently sampled distribution, including a multi-variate normal distribution. Any distribution for which the (cid:96) metric is complete and compact is also linearly stronglyconsistent for the Kemeny metric, as it is invariant to anymonotonic transformation (non-identity canonical link), andthus the properties generalise without any additional restric-tions. However, the selection of an appropriate non-lineartransformation is subjective and subject to misjudgement dueto idiosyncrasies such as cultural modelling norms, non-uniform sampling between the data and the population, andincomplete data observations. Analysing ordinal data asan endogenous measure space, as opposed to continuousdata, requires fewer structural assumptions in the endoge-nous response to introduce maximum likelihood estimation(Friedman, 1937; Wilcoxon, 1945). However, this must bepaired with the introduction of more assumptions upon thesampling procedure. This follows under the conventionalconstruction of both Spearman’s ρ and Kendall’s τ a , in whichties are explicitly excluded from the space with probability1, from which directly follows by contradiction a tie in anysample upon the population with probability of 0. By thepigeon-hole principle though, ties almost surely are observedfor any ordinal measure space since the number of empiricalpaired orderings is always less than the number of possibleobservations in the sample, as is demonstrated by the con-struction of a polychoric correlation matrix (Olsson, 1979;Pearson & Pearson, 1922).Unfortunately, the polychoric correlation intro-duces an assumption concerning the latent variable linearityupon the (cid:96) metric, and requires greater sample sizes withrespect to g responses upon an ordinal scale (i.e., g cells).Each bivariate relation pairing must possess su ffi cient cellsize to just identify each latent variable, requiring at leasttwo parameters; multidimensional latent spaces in turn re-quire more parameters to be estimated, which quickly inflatesthe minimum su ffi cient sample size. As a continuous distri-bution upon a compact metric is discrete, Gibbs’ inequalitymay be directly leveraged to define for two any orderable dis-tributions P = { p , · · · , p m } and Q = { q , · · · , q m } for which P , Q ∈ [0 ,
1] with equality only holding when P = Q forthe metric space δ . Since by our previous assumptions, thesample is defined for a finite support in the field [0 , m ( m − / (1 + m − m )in the population, thereby allowing for the Kullback-Leiblerdivergence to be constructed D KL ( P (cid:107) Q ) = m (cid:88) i = p i log p i q i ≥ . If one were to expand the definition of P to be a function ofparameters α ∈ Ω in the parameter universe, then a Bregman divergence naturally follows F : Ω → R , wherein the ob-served space is strictly non-negative D F ( P , Q ) ≥ F which is linear wrt the endogenous empiricalprocess P for which the expectation is the optimal su ffi cientstatistic (Frigyik, Srivastava, & Gupta, 2008).A further proof of uniqueness for any F -norm,which includes any Banach norm-space, may be constructedfrom the axiom of the excluded middle dependent choice,given a proper metric upon any random partially orderablefield. Thus, the property of dependent choice holds upon it,unlike the sub-graph formed by the traditional rank meth-ods of the S m -space (Diaconis, 1988). As a normed space(which includes Banach spaces, F -spaces, and G -spaces) X defined with the topology ( X , ρ K ) is provably both linear andcontinuous with finite dimensionality n upon the stochasticspace ( (cid:15), α n ). It is therefore a valid dream space, which is ex-plicitly uniquely defined with an inner-product space (Lane,1971). Let X be complete and ρ K be compact as previouslyestablished. It follows then that if Y is any topological vec-tor space and f : X → Y is any linear operator, then f iscontinuous. It also follows that should Ω be an open convexsubset of X , Y is a locally full space, and f : Ω → Y is aconvex operator, then f is continuous. By Schechter (1997,p. 751), any two complete F − norms on a vector space aretopologically equivalent; this is proven as the identity map-ping X → X is a linear operator. Thus proves the unique-ness of the maximum likelihood estimator upon the Kemenymetric, which is also a proper MLE for any partially order-able distribution (Schechter, 1997), as also defined for thelinear function space we have established. Further, sincegiven the Gaussian nature of the probability space is alwaysmonotonically non-decreasing, the Hessian of the likelihoodfunction is always non-negative, and therefore it follows thatthe likelihood function is a linear local maxima for any iden-tically and independently distributed error function which ispartially orderable. Appendix BPositive semi-definiteness of the Kemeny correlation matrix
For m observations upon n measures, the matrix Ξ may beconstructed, as a square matrix of order m × m or n × n , sum-marising pairwise similarity over all subjects or variables, re-spectively. Assume that a square real matrix A is positivesemi-definite (p.s.d.)when, for any m × x (cid:48) Ax ≥ A and B are p.s.d. upon the space ( X , ρ K ) thenby the established additive and multiplicative properties ofthe Kemeny metric, so is A + B . We construct the followingbounded equivalence x (cid:48) ( A + B ) x = x (cid:48) Ax + x (cid:48) B x ≥
0; fromthis follows the general statement that the sum of any p.s.d.matrices is itself positive semi-definite.For a sample of vectors x i = ( x i , . . . , x mn ) (cid:124) , with i = , . . . , m , the sample median ˆ ν is estimated as per equa-tion 19 and the sample correlation matrix Ξ as given for each2 HURLEY bivariate pair in equation 13 (with the covariance scaling fol-lowing by the use of equation 20). For the non-zero vector z ∈ R n , it follows that each vector is non-constant, and there-fore has positive variance from which we use equation 20 ξ = ξ (cid:124) ρ K ( x n , x n ) ξ ∝ ξ (cid:124) Ξ ξ > . Allow z i to be defined z i = ( x i − ˆ ν ( x )), for i = , . . . , m .Any non-zero x ∈ R n is therefore equal to zero if and onlyif ρ K ( x (cid:124) i , I m ) =
0, for each i = , . . . , n . Upon the set { z , . . . , z n } spanning R n , there exist real numbers β , . . . , β m such that z = β x + . . . + β m x m , and we also possess z (cid:124) x = α z (cid:124) x + · · · + α n z (cid:124) n x =
0, which induces a contra-diction. It therefore follows that if the span of any randomsampling distribution upon z i spans R n , then Ξ is positivedefinite. Positive semi-definiteness then may be establishedfrom z Ξ n z (cid:48) ≥ z . As Ξ n is 1 / m times the dis-tance of ρ K ( v i , I m ) · ρ K ( v i , I m ), the squared length of vector zv i is m . and as m > ∈ Z and a sum of squares is strictly non-negative, z Ξ n z (cid:48) ≥
0, and thus when v i spans the continuousfield, z = Ξ n is definite, and therefore any Ξ may beinverted for the marginalised sampling distribution wrt either m or n . Appendix CProof of the probability measure for the Kemeny metric
For a measure space ( Ω , F , P ), where P ( x ∈ X ) denotes ameasure X upon which x occurs with probability P ( x ), forwhich P ( Ω ) =
1, allowing us to define said measure space asa probability space, with sample space Ω , event space F , andprobability measure P . The first Kolmogorov axiom has al-ready been proven, wherein each observed event occurs withpositive probability, for each element in the event space, asseen in equation 9. Concurrently, the finite and compact na-ture of the Kemeny measure space ensures that P ( x ) is also finite, with bounded support P ( −∞ ) = ≤ P ( x ) ≤ P ( ∞ ) = . The second axiom, that that the probability that at least oneof the elementary events in the entire sample space will occuris 1, and therefore that Ω is complete, follows from finite andcumulative nature of the Stirling numbers wrt the Kemenymetric, from which we may define that for the space | H | = I asa partial ranking, and which therefore must occur with prob-ability 1. If not, then the Identity permutation is not a validorigin for the sample space, and therefore the population isnon-uniquely identified (as neither the identity permutationor its inverse are validly measurable events). Finally, σ -additivity must be validly present to ensure that a given eventmust occur with unique probability, and is therefore a func-tion. In equation 9, it is shown that the graph of the metricspace is connected, and thus therefore that the Kemeny met-ric space is a function, for which each occurrence is uniquelymapped onto a singular distance with respect to the arbitraryorigin π for the entire space x m . As such, any element in thesample space must be measured upon the probability field,which for equation 18 contains all reals. Second, since thecumulative distribution is complete and must integrate to 1over all disjoint events measured upon the Kemeny metricspace, all disjoint subsets must also sum to 1, as shown inequation 9, which, together with the finite countability of thesymmetric group H m , ensures that the probability measureis always observed with probability 1, thereby satisfying theequality µ (cid:0)(cid:83) ∞ m = A m (cid:1) = (cid:80) ∞ m = µ ( A m ), to ensure σσ