aa r X i v : . [ m a t h . S T ] O c t An Intrinsic Treatment of Stochastic Linear Regression
Yu-Lin Chou ∗ Abstract
Linear regression is perhaps one of the most popular statistical concepts, which per-meates almost every scientific field of study. Due to the technical simplicity and wideapplicability of linear regression, attention is almost always quickly directed to the al-gorithmic or computational side of linear regression. In particular, the underlying math-ematics of stochastic linear regression itself as an entity usually gets either a peripheraltreatment or a relatively in-depth but ad hoc treatment depending on the type of con-cerned problems; in other words, compared to the extensiveness of the study of math-ematical properties of the “derivatives” of stochastic linear regression such as the leastsquares estimator, the mathematics of stochastic linear regression itself seems to havenot yet received a due intrinsic treatment. Apart from the conceptual importance, aconsequence of an insufficient or possibly inaccurate understanding of stochastic linear re-gression would be the recurrence for the role of stochastic linear regression in the important(and more sophisticated) context of structural equation modeling to be misperceived ortaught in a misleading way. We believe this pity is rectifiable when the fundamental con-cepts are correctly classified. Accompanied by some illustrative, distinguishing examplesand counterexamples, we intend to pave out the mathematical framework for stochasticlinear regression, in a rigorous but non-technical way, by giving new results and pastingtogether several fundamental known results that are, we believe, both enlightening andconceptually useful, and that had not yet been systematically documented in the relatedliterature. As a minor contribution, the way we arrange the fundamental known resultswould be the first attempt in the related literature.
Keywords: concept classification; conditional expectation; counterexamples in statistics;orthogonal projection; stochastic linear regression
MSC 2020:
We are attempting to correctly classify the concept of stochastic linear regression andseveral ubiquitous related concepts, and are much less concerned with problems of prac-tical interest regarding stochastic linear regression. Figuratively, we wish to “embed”the concept of stochastic linear regression in mathematics, in particular in probability ∗ Yu-Lin Chou, Institute of Statistics, National Tsing Hua University, Hsinchu 30013, Taiwan, R.O.C.;Email: [email protected] . The author wishes to express gratitute for the comments receivedfor the first version. · · · are not fit for heaven, but on earth they are most useful. · · · . ’T is the same withmules, horses, · · · ” from the great poet Mary A. Evans (George Eliot) [2].The indicated purpose is not as exotic as it sounds once we see that it is simplya natural part of developments of a mathematical theory, and statistics, as a branchof mathematics, has since Fisher’s modern initiation already implicitly moved towardsobtaining a unified embedding in the sense that every statistical object is defined assome mathematical object. We invite the reader to think for your reference about thedefinition of a random sample, of an estimator, of a test, or of a random field, althoughthe last one would be more of a probabilistic flavor. A natural, logical conclusion drawnfrom the phenomenon is that modern statistics tends to embed their concerned conceptsin mathematics. We remark in passing that this embedding property is a prerogative ofstatistics, which is not shared by engineering fields or even physics. For an engineeringfield, the reason is evident; mathematics plays in the field a role that facilitates modelingworks. As to physics, although to a great extent it may be embedded in mathematics(e.g. the concept of spacetime), there are many concepts in physics that may not bereasonably taken as a mathematical object (e.g. the concept of mass or of mutualinteraction).On the other hand, for a knowledge system to qualify as a science, a necessarycondition is for the system to be developed towards internal unification. We mightmention and re-appreciate Euclid’s genius — the invention of axiomatics — that isacknowledged as the first complete attempt of human mind to logically rearrange thethen scattered “mathematical facts” in such a way that a sane person may reason out forherself the known or unknown mathematical facts under a given set of few pre-defined2ules. Mathematics is then well-eligible for being a science in various senses; Pythagorastheorem, or any mathematical theorem in general, has since been not just an interestingrecurrent phenomenon, nor just a useful trick for engineering purposes, nor just a wiseopinion from the esteemed scholars, nor some truth that is unfathomable to the civiliansand only “owned” by the rich and powerful. The Euclid’s invention thus demystifiesa significant aspect of the nature of mathematics, and independentizes mathematicalactivities, making them essentially not exclusively belonging to any social class. Further,we believe, even a person who does not work in any particular scientific field would expecta knowledge system, if generally accepted as a science, to be much more than just acookbook or a collection of “very useful methods” whose deeper connections are leftunorganized. This non-philosophy — satisfaction of a collection of very useful methodswithout caring to seek after the deeper connections — seems to be a prerogative of thebusiness, industrial world; after all, by nature they seek profit (no moral judgementis implied), and hence a sense of immediate satisfaction. However, if we acknowledgethat a science is supposed to seek truth, then it would be unjustified to stay at thesatisfaction level of the business-industririal activities.This tendency — for a statistical object to be defined as a mathematical object— is also an enlightened, edified movement as, in terms of mathematics, we can saveourselves from spending energy on the philosophical or semantical queries into what weare really talking about by focusing on the functional properties of the concepts of ourconcern. For instance, rather than arguing what a random variable really is and thendefining it, we define a random variable by requiring what a random variable shoulddo, or, equivalently, by requiring what we can do with a random variable. Evidently,this “epistemological” approach is recurrent in mathematics and a signature therein,and may be fairly referred to as a mathematical approach. The result is more thanbeneficial; as well-known and well-received, it turns out that the well-established math-ematical objects — (probability) measure and measurable function — may be used todefine a random variable, and so the statistical object — random variable — is in thissense embedded in mathematics. It then follows that the fundamental statistical object— random sample — also becomes a mathematical object. We also wish to point outanother more than familiar event of mathematization by embedding a long-time vagueobject in mathematics: Kolmogorov’s measure-theoretic treatment of probability. Theconcept of probability had long been arguably a controversial object; but Kolmogorov’smathematical astuteness led him to recognize that the mathematical object — meas-ure — just serves the purpose of delineating what probability should do, and the nice3amifications of this Kolmogorov’s embedding are stunning.Another kind of benefits obtainable from establishing a suitable embedding is aboutconceptual coherency and clarity. From a panoramic view, it is evidently desirable forthe theory of any mathematical science to admit as few ambiguities as possible, sothat, for example, the understanding of any concept thereof does not depend on theinterpretation of any individual therein, which in turn ensures the efficiency and qualityof the scientific communications.Although most statistical concepts are defined as some mathematical concept, theimportant statistical concept — stochastic linear regression — is an exception. Whenit comes to stochastic linear regression, the customary treatment seems to be ad hoc depending on the problem at hand; for instance, sometimes stochastic linear regressionis associated with conditional expectation, sometimes it is associated with orthogonalprojection, sometimes it is nearly taken to be an arbitrary “linear model”, sometimes itis associated with algorithms such as least squares (and hence treated as a technique),and sometimes it is left tacitly understood as a string of symbols representing “thefamiliar form requiring no further elaboration”.And, usually, in teaching materials the particular aspects of stochastic linear regres-sion are stressed without a caveat nor a further elaboration for a full, more completepicture; and none of the partial descriptions establishes stochastic linear regression pre-cisely as a mathematical object in a reasonable way. Besides, these partial descriptionsof stochastic linear regression, each of which captures a component of the concept ofstochastic linear regression, are, however, independent in the (weak) sense that no twoof them are equivalent. It is not difficult to write down a justification for this obser-vation. Among the partial descriptions, a less evident non-equivalence would be affineconditional expectation and linear orthogonal projection, the latter being equivalent tothe uncorrelatedness between error term and regressor(s) under “very” mild, reasonableassumptions. We will prove this particular non-equivalence later on. It seems that anexample of this non-equivalence, apparently heuristically enlightening, rarely appears inthe related literature.Thus the term “stochastic linear regression” seems to be just a placeholder such that,depending on the problem at hand, it could mean different things; the most significantpossible meanings of “stochastic linear regression” are as listed above. It is clear that,to embed the concept of stochastic linear regression in mathematics, we cannot rely onthe last three of the partial descriptions — stochastic linear regression as a “model”, asa technique, and as a string of symbols or “equations”; the first two themselves are not4athematical objects, and the last one is “morally” a mathematical object but out-of-context. That an equation interpretation is not suitable for describing stochastic linearregression may be seen as follows. Indeed, an equation in mathematics is taken as apredicate, which is a well-established, clear concept in logic, and to solve an equationmeans to find some element of a given set such that the predicate is true of the element.For example, a heat equation “ ∂ t u = ∂ x u ” (considered on a suitable subset of R ) is,according to the convention, precisely the predicate “the (0 , , β ” directly fromthe given equations; instead, it is solving for “the optimal ‘ β ’” from a moment conditionderived from the given equations that is of concern. Moreover, embedding stochasticlinear regression in mathematics in terms of moment equations is not advisable as weare then led back to meet the non-equivalence between affine conditional expectationand linear orthogonal projection.Although affine conditional expectation and linear orthogonal projection are notequivalent, and none of them alone may fully equate to the concept of stochastic lin-ear regression, for a comparison we might add that they both are mathematical ob-jects. If (Ω , F , P ) is a probability space, if Y is an L random variable on Ω, andif X is a random variable on Ω, then, since the P -indefinite integral ( Y d P ) | σ ( X ) of Y restricted to the sigma-algebra σ ( X ) ⊂ F generated by X is absolutely continuouswith respect to P | σ ( X ) , the ( P | σ ( X ) -essential) Radon-Nikodym derivative of the measure( Y d P ) | σ ( X ) with respect to P | σ ( X ) exists, and, upon identifying two P | σ ( X ) -almost sureequal σ ( X )-measurable random variables Ω → R with each other, one may define theconditional expectation of Y given X as the thus obtained Radon-Nikodym derivative D P | σ ( X ) ( Y d P ) | σ ( X ) . If E ( Y k X ) is not essentially constant, and if the Doob-Dynkinfunction, or the so-called regression function, of Y given X , being a function R → R such that the composition of it circ X is the conditional expectation of Y given X , isaffine, then we obtain an example of affine conditional expectation. On the other hand,if X, Y ∈ L ( P ), and if X is not essentially zero, then the linear orthogonal projectionof Y given X is precisely the random variable βX with β ∈ R being the solution of theequation E X ( Y − Xb ) = 0.We believe the unsatisfactory or improvable status quo of the concept of stochasticlinear regression is rectifiable. Inspecting the various special senses attached to stochastic5inear regression, we have noticed that the “definition” of stochastic linear regressionseems to depend on the context under consideration and hence on the concerned problem.This suggests that there seems no intrinsic treatment for the important concept ofstochastic linear regression, which is a pity. Our usage of “intrinsic” coincides with theusual usage in mathematics (and even with philosophy such as the field of epistemology;e.g. Lewis [6]) in a broad sense, and seeking intrinsic properties certainly gains insightinto the objects of interest, and hence is itselt intrinsically interesting.For elementary examples in mathematics, in (linear) algebra the dimension of avector space, as any two bases of the space have the same cardinality, is an intrinsicproperty of the vector space itself in the sense that the dimension of the vector spacedoes not depend on the choice of bases; in analysis, an L p -metric may be made well-defined (from being a pseudo-metric to being a metric) in terms of the L p -norm and theequivalence classes of L p functions (with respect to the equivalence relation of almosteverywhere equality) by noticing that the resulting metric is intrinsic in the sense thatit does not depend on the choice of the representatives; and in geometry, the dimensionof a (topological) manifold, as a Euclidean space is homeomorphic to a Euclidean spaceprecisely when their dimensions agree, is an intrinsic property of the manifold in thesense that the dimension of the manifold does not depend on the choice of its atlases.Without implying any “indoctrination”, we intend to suggest a reasonable, intrinsiclook at the concept of stochastic linear regression, with the hope that the aforementionedissues may begin to be settled in a satisfactory, unified way, and without claiming asupreme generality encompassing all known types of “regression”. At any rate, ourframework is general enough to cover the interesting cases so as to be conceptuallyenlightening, and is at the same time sufficiently special to be tractable and informativewithout loss of practical meaningfulness.The generic idea and the nice consequents of the proposed intrinsic treatment ofstochastic linear regression may be sketched as follows. By delineating the requirementsof suitable strength for stochastic linear regression with the requirements kept as fewas possible such that both the problems of “parameter learnability” and of the qualityof estimation are satisfactorily taken care of in theory, we leverage the fact that every L space is an inner product space to develop the concept of stochastic linear regres-sion as a suitable class of probability measures defined on the Borel sigma-algebra ofEuclidean subsets. One may then obtain a somewhat unified viewpoint of stochasticlinear regression. In between the developments, we will also give new results and dis-cuss informative, simple examples and counterexamples to clear up some mythologies6ertaining to stochastic linear regression. If m ≤ l are natural numbers, if 1 ≤ t < · · · < t m ≤ l are natural numbers, and if I := { t , . . . , t m } , we will denote by π I the natural projection ( z , . . . , z l ) ( z t , . . . , z t m )from R l onto R m . Thus π I = π I when I = I ; the definition of π I arranges the elementsof I from the smallest one to the greatest one. If I is a singleton, say I = { } , we willwrite π for π { } = π I . The domain of a natural projection π I will always be uniquelydetermined by context. Both the notation for natural projection and the terminologyfor π I follow Billingsley [1].If l ∈ N , we denote by B R l the Borel sigma-algebra generated by the usual topologyof the Euclidean space R l .If (Ω , F ) is a measurable space, we denote by Π( F ) the collection of all probabilitymeasures defined on F . Thus Π( B R l ) is for every l ∈ N the collection of all probabilitymeasures defined on B R l . If z ∈ R l , the symbol D z denotes the Dirac measure (degen-erate distribution) B R l → { , } , B B ( z ) “concentrated on” { z } ; so D z ∈ Π( B R l )for all z ∈ R l .If Ω , Ω are arbitrary sets, and if f : Ω → Ω, we will write f − for the pre-image map2 Ω → Ω , A
7→ { x ∈ Ω | f ( x ) ∈ A } induced by f .If (Ω , F , P ) is a probability space, and if Z : Ω → R is a random variable, we willfrequently denote by P Z the induced probability measure of P by Z , i.e. P Z ≡ P ◦ Z − .Thus P Z is the (probability) distribution of Z . If P is the distribution of Z , we maysometimes write Z ∼ P . If P is not the distribution of Z , we sometimes write Z P .The symbol R + denotes the set { x ∈ R | x ≥ } ; and R ++ denotes { x ∈ R | x > } .If (Ω , F , M ) is a measure space, and if A ∈ F , we denote by M ⌉ A the measure A M ( A ∩ A ) , F → R + ∪ { + ∞} . The measure M is said to be concentrated on A if and only if M = M ⌉ A on F . This justifies the statement that a Dirac measure D z is concentrated on { z } for every suitable z . Our use of “concentrated on” has itsroots in Rudin [7]; and the corresponding notation is adapted from Federer [3]. Since M ⌉ A = ( M ⌉ A ) ⌉ A , the measure M ⌉ A is concentrated on A . A random variable whosedistribution is concentrated on a Borel subset B of R will also be said to be concentratedon B . Thus a random variable not concentrated on any singleton subset of R is preciselya non-degenerate random variable. A random variable not concentrated on a singleton7 x } ⊂ R will also be said to be not essentially x . We remark: If X is a random variable,and if P is the distribution of X , then P is not concentrated on a singleton { x } of R ifand only if P ( { x } ) <
1, which holds if and only if P = D , which holds if and only if X D .The notation “ ⌉ A ” is also applied to collections of sets. If F is a sigma-algebra ofsubsets of Ω, the symbol F ⌉ A denotes the relative sigma-algebra { A ∩ A | A ∈ F } of A . If l ∈ N , and if a context is in the presence of a matrix operation such as transposition,then an l -tuple is to be taken as an l × x, y ∈ R l , then x ⊤ y isthe sum of the products of the i th components of x and y , and xy ⊤ is the matrix whose( i, j )-entry is x i y j .We will identify two random variables (on the same probability space), if equal almostsurely, with each other.As there are different terminologies employed to address a measurable function hav-ing finite integral, ranging over { “exist”, “integrable”, “summable” } , we will often referto a random variable having finite integral (finite mean) as an L random variable. Fora random variable having finite L p -norm (finite p -th [raw] moment) with 1 ≤ p ≤ + ∞ ,the same rule applies. The underlying probability space may depend on but can bedetermined in terms of the context.If Ω is a probability space, if Y is an L random variable on Ω, and if X is arandom vector on Ω with l components (including the case where l = 1), we will write E ( Y k X ) := E ( Y k σ ( X )); and we will denote by E ( Y | X ) : R l → R the correspondingDoob-Dynkin (regression) function f : R l → R such that E ( Y k X ) = f ◦ X on Ω. Thus E ( Y k X ) = E ( Y | X ) ◦ X = ( E ( Y | X )( x )) x ∈ R l ◦ X on Ω. The domain of the function E ( Y k X ) is Ω, which is not necessarily R l ; butthe domain of E ( Y | X ) is R l . Although in general we will refer to E ( Y k X ) as theconditional expectation (of Y given X ) and to E ( Y | X ) as the regression function(of Y given X ), sometimes we will also refer to a regression function as a conditionalexpectation. But this mixed usage will not cause any confusion. Moreover, for whatit is worth, the random variable Y is said to be mean independent of X if and onlyif E ( Y k X ) = E Y . Thus, if Y is centered, i.e. if E Y = 0, then for Y to be meanindependent of X means E ( Y k X ) = 0. The expectation of a random vector or amatrix of random variables is always understood componentwisely. For example, if X is a random vector with each component X j being L , then E X denotes the real vector( E X , . . . , E X k ) ∈ R k . 8or our purposes, we will not take a specific definition of linear orthogonal projection;we deliberately let context and our arguments jointly determine its role in stochasticlinear regression. For the less experienced, our doing so helps build the idea of theunderlying mathematical structure of stochastic linear regression; for the experienced,our doing so will hardly cause any confusion and, hopefully, will clarify some less noticedaspects of stochastic linear regression. We will not always use the modifier “linear”when referring to linear orthogonal projection; but, as usual, context matters. Thesame considerations apply to the term “coefficient of linear orthogonal projection”.We will denote by K the covariance operator. If Y is a random element of R q witheach component being L , and if X is a random element of R l with each componentbeing L , then K ( X, Y ) := E ( X − E X )( Y − E Y ) ⊤ .Since we intend to connect the known results together whenever suitable, we will use“Fact”, instead of the usual “Theorem” or “Proposition”, to state known results. Regardless of the context where one speaks of stochastic linear regression, the com-mon fundamental material, although usually off-stage, is an unknown distribution P ∈ Π( B R k ). Here k ∈ N is given by the problem under consideration. We say “given”as we are considering the “population” situation prior to a confrontation with data, sothat revising the choice of k is beyond the scope. The probability measure P governsthe behavior of the k + 1 variates whose statistical relationship interests the researcher.Specifically, the researcher at least wishes to investigate how a random variable Y withdistribution P π , which represents the dependent variate of interest to her, depends ona linear combination of k random variables X j , representing the covariates of interestto her, each of which has distribution P π j +1 with 1 ≤ j ≤ k . To make sense, the randomvariables Y, X , . . . , X k certainly have to be defined on the same probability space atthe outset; but to assume so is always realistic, and to do so is always mathematicallypossible. For our purposes, taking Y to be π and each X j to be π j are mathematicallyjust fine. And we remark that the possible presence of the constant covariate is readilytaken care of by employing the Dirac measure D concentrated on { } .Now, to ensure that a meaningful result may be obtained out of observations on the k + 1 variates, the probabilistic behavior of the random variables Y, X , . . . , X k cannotbe arbitrary. The researcher then needs to seek conditions under which she can beassured thati) a “meaningful” linear statistical relationship really exists and is actually “learn-9ble” from data, i.e. some linear statistical relationship between Y and X , . . . , X k exists such that its interpretation makes good sense, and this relationship isuniquely determined by the distribution P of the random variables so that anysuitable transformation of data drawn from P will not approximate the relation-ship vacuously;ii) the probability that an estimation of the unique relationship makes the correctdecision is well-controlled when the data are sufficiently nice and many.The first requirement is intrinsic to the random variables Y, X , . . . , X k , independ-ently of the probabilistic mechanism governing how data are generated (e.g. stationarityand ergodicity). The second requirement certainly depends more on the probabilisticmechanism that generates data, and so it is more of a technical consideration to al-low probability limit theorems such as laws of large numbers to work. In short, thetwo requirements are the minimum requirements such that the first one prevents theresearcher’s study from being an alchemy and the second one is necessitated by askingfor a reasonable quality of estimation of the linear statistical relationship.Although sufficient conditions ensuring the two requirements are well-known, mostof the existing conditions are too strong for the two requirements (certainly, the existingsufficient conditions are also intended to take care of other desired properties.). Forthe second requirement, requiring Y and each X j to be L suffices, which allows of anapplication of a weak law or even of Kolmogorov’s strong law for well-behaved data suchas independent identically distributed (i.i.d.) data.A set of weak sufficient conditions for the first requirement is more interesting. Formost of the time, the condition E ( Y − X ⊤ β k X ) = 0 (when making sense) is used alongwith a regularity condition (e.g. orthogonality) on { X , . . . , X k } guaranteeing that thereis exactly one β ∈ R k such that E ( Y − X ⊤ β k X ) = 0, so that X ⊤ β is precisely the linearorthogonal projection of Y given X , . . . , X k . The mean independence assumption oferror term is a possible factor of the etymology of stochastic linear regression, or linearregression in general.Nevertheless, although the mean independence assumption of error term is innocuousfor multi-normal random vectors, and might as well be imposed based on a backgroundstructural theory whenever suitable, there is no reason why an arbitrary random vector( Y, X ⊤ ) with Y being L should serve that E ( Y | X ) is affine. For instance, if X is an L ,non-degenerate random variable, and if Y := X , then E ( Y k X ) = E ( X k X ) = X ;and the square function x x on R is not affine. Further, although E ( ε k X ) = 0implies E Xε = 0 provided that ε, X are random variables such that ε, Xε are both L ,10he converse is not true. Before we show the falsehood of the converse, we remark thatthe falsehood is actually not surprising as the mixed moment E Xε is in some sense amodulus of linear dependence between X and ε , while the true dependence between X and ε can certainly be wildly nonlinear, so that their regression function E ( ε | X ) canalso take a wild form. Theorem 1 (orthogonality without mean independence) . If X is an L , non-degeneraterandom variable, and if E X = 0 , then there is some L , non-degenerate random variable ε on the same probability space such that E Xε = 0 and E ( ε k X ) = ε .Proof. Let ε := X . Then ε is σ ( X )-measurable, and ε is L by Jensen’s inequality.Moreover, we have E ( ε k X ) = E ( X k X ) = ε . But by assumption we also have E Xε = E X = 0.We remark that, under the assumptions of Theorem 1, the random variables ε, Xε are both L ; so Theorem 1 disproves that orthogonality implies mean independence ina bona fide way.Thus counterexamples to the statement that orthogonality implies mean independ-ence are in fact abundant: Example 1. If X ∼ N (0 , ε := X , then E Xε = E X = 0; but E ( ε k X ) = X ∼ χ (1) , which is certainly not degenerate.Slightly wilder examples can be constructed easily. For instance, let Ω := [0 , × R , and probabilitize Ω with respect to the evident product Borel sigma-algebra of B R ⌉ [0 , (the Borel subsets of [0 , B R by the product probability measure ofthe Rademacher distribution and the standard Gaussian distribution, so that π ∼ ( D − + D ) and π ∼ N (0 , π , π are independent random variables. Theexistence of a nontrivial Rademacher random variable, i.e. of a random variable has ( D − + D ) as its distribution, is well-known and may be constructed in a non-artificialway by considering the dyadic expansions of elements of [0 , X := π , and if ε := π + π , then E Xε = 0 by the independence of π and π . Moreover, since ε ∈ L (Ω)by Minkowski’s inequality, from independence we also have E ( ε k X ) = π ∼ χ (1),which is never degenerate.From Theorem 1 it follows immediately that11 orollary 1 (non-equivalence between orthogonal projection and conditional expecta-tion) . There are continuum-many random elements ( Y, X ) of R such that the orthogonalprojection and conditional expectation of Y given X exist and disagree.Proof. Indeed, the proof of Theorem 1 applies to any L random variable with symmetricprobability density function. If X ∼ N (0 , σ ), and if Y := X + X , then X is theorthogonal projection of Y given X . Since Y is then L by Minkowski’s inequality, theconditional expectation E ( Y k X ) exists and is = X + X .Since R ++ is in bijection with R , there are continuum-many choices of σ .Therefore, to take care of the first minimum requirement for stochastic linear re-gression, that there exists exactly one “meaningful” statistical relationship between Y and X , . . . , X k in terms of linear combination of the covariates X , . . . , X k , the usualmean independence assumption E ( Y − X ⊤ β k X ) = 0 is much too strong with respectto mathematical considerations. It follows that, even without any reference to data, theconditional expectation interpretation of stochastic linear regression is distorted. Wemight add that the interpretation distortion means that, even in the event that theestimated orthogonal projection of Y given X passes all the tests and diagnostics, thisestimated orthogonal projection may very well have little to do with the conditionalexpectation of Y given X , and so it would barely make sense to attach a sense of effecton the (conditional) average behavior of Y to the estimated coefficient of orthogonalprojection, although it makes every sense to view the estimated coefficient as an effecton the behavior of Y .However, the mathematical remarks above do not necessarily always negate the legit-imacy of the mean independence assumption of error term, and hence of the conditionalexpectation interpretation, of stochastic linear regression. We notice that we did notmake any a priori assumption restricting how ε and X are related a priori , which cer-tainly opens up a variety of possibilities. So a moral conveyed by Corollary 1 is this, that,in practice, what one necessarily learns via stochastic linear regression is not the condi-tional expectation of the involved random variables, unless there is further informationindicating the “true” dependence between the random variables.This concept of further information is in fact natural when it comes to contextswhere there is an acceptable structural theory guiding the researcher to believe thatthe mean independence assumption is appropriate in a broad sense. For an elementaryexample, if there is in the researcher’s field a generally accepted theory saying thatthe random variables Y, X , . . . , X k concerning her may jointly admit some ( k + 1)-12ormal distribution, then she may rest assured that what she will learn via stochasticlinear regression is precisely the conditional expectation (modulo a translation) of Y given X , . . . , X k . Another elementary example is a prototypical context of time seriesanalysis. In a chemical experiment under a suitably controlled environment, let thescalar outcome be recorded according to the natural order of time. Given the relativestability of the experimental environment, the outcome may be described by a discrete-time stationary process in some suitable sense. If Y represents the outcome of interest,if X represents the “lagged version” of Y , and if how Y depends on X is sought after,then, since the environment is relatively stable, it would be reasonable to assume themean independence of error term with respect to X or even the independence of errorterm and X with error term having mean zero.As an example regarding the appropriateness of the mean independence assumptionwith respect to a more special structural theory and under a relatively uncontrolled,observational environment, we wish to refer the reader to a field such as mathematicalfinance. There is in mathematical finance the so-called efficient market hypothesis,whose empirical validity is generally acknowledged in some special cases, such that, forexample, the researcher may consider a process of stock price as a martingale. Thisspecial background structural theory, when appropriately interpreted and applied, thenassures that the researcher’s study via stochastic linear regression, for Y being, say,the changes in stock prices and for X being, say, a random variable measurable withrespect to the “history” or all the “past information”, may consider reasonable the meanindependence assumption.In contrast with the case where a tenable structural theory is absent so that a con-ditional mean interpretation for stochastic linear regression may very well be inappro-priate, we see that linear orthogonal projection is potentially indeed affine conditionalexpectation in the presence of a tenable strucutral theory and hence, whenever the datasuggest the suitability of the estimated orthogonal projection, one may be confidentin addressing the estimated coefficient as the effect on the (conditional) average beha-vior of the dependent variate. Moreover, we have shown that the association of linearorthogonal projection with stochastic linear regression is equivalent to that of affine con-ditional expectation with stochastic linear regression (which, as seen, may “easily” befalse by Corollary 1) if and only if the mean independence of error term holds in a reas-onable way (which may be the case under a suitable background structural hypothesis).Thus, from a pure mathematical viewpoint without any practical consideration such astaking into account a background structural theory, the concept of stochastic linear re-13ression itself need not involve conditional expectation at the outset. In particular, onecannot expect a descriptive data analysis, which is by definition a purely “data-driven”study without any reference to any structural theory, via stochastic linear regression toadmit an interpretation of conditional mean.Structural theory plays a role that goes beyond the aforementioned matter. Wemight need to stress this, that, probably due to the fact that many observational studies,in contrast with experimental studies, are implemented with a background structuraltheory in mind, there is a tendency to mix the structural considerations with the purelymathematical considerations when it comes stochastic linear regression (e.g. consideringthe concept of “parameter” or of “error”). By a structural consideration we refer to asituation where a background structural theory, maintained or to be tested, suggestsa specific way of dependence between the variates of interest to the researcher. As animmediate example, analyzing financial data is usually and conceivably a priori tiedto the background theory regarding the financial variates under consideration, and thestructural theory may impose a mathematical dependence structure for the variates. In view of the previous analysis, we see that, to ensure the two minimum requirementsfor the researcher’s study via stochastic linear regression to be meaningful, it suffices toimpose the orthogonality of error term along with “one and a half” regularity conditionson the variates
Y, X , . . . , X k , i.e. along with the conditions that the random variables Y, X , . . . , X k are L and, e.g. that the set { X , . . . , X k } is orthogonal. And thesesufficient conditions are reasonably mild both from the mathematical viewpoint and apractical viewpoint. Indeed, besides the simplicity of the conditions, there is a simplebut deeper mathematical reason, which is seemingly seldom stressed or even noticed, tojustify the conditions and reveal a further connection among them: Fact 1. If H is an inner product space over R , if g ∈ H , and if { f , . . . , f k } ⊂ H \ { } is orthogonal, then there is exactly one ( b, h ) ∈ R k × H such that g = P kj =1 f j b j + h and h is orthogonal to each f j . Fact 1 is easily found in, e.g. the introductory textbooks of abstract algebra, and aproof of Fact 1 is apparent. If h· , ·i denotes the inner product of H , the unique choice of b is simply b := ( h f j , f j i − h f j , g i ) kj =1 , and that of h is simply g − f ⊤ b ; here f := (( f j ) kj =1 ) ⊤ .If H is the L space of random variables on a given probability space, then H is aninner product space by considering the inner product ( f, g ) R f g = E f g defined on14 × H . It then follows immediately from Fact 1 that Proposition 1 (“abundance” of orthogonal projection) . If Y, X , . . . , X k are L randomvariables, if no X j is concentrated on { } , and if { X , . . . , X k } is orthogonal, i.e. if E X j X j = 0 for all ≤ j = j ≤ k , then there is exactly one β ∈ R k and there is exactlyone L random variable ε such that Y = P kj =1 X j β j + ε and E X j ε = 0 for all ≤ j ≤ k ,and ( E XX ⊤ ) − E XY with X := ( X , . . . , X k ) ⊤ is the unique choice of β .Proof. The first conclusion is a special case of that of Fact 1; the assumption that no X j is essentially zero prevents any X j from being equal to 0 with zero probability. For thesecond conclusion, we notice that E XX ⊤ is by the orthogonality assumption a diagonalmatrix with each diagonal entry nonzero and hence invertible.Since the assumptions of Proposition 1 may be considered mild for practical purposes,in practice we can by Proposition 1 “always” talk about the linear orthogonal projectionof a random variable given a random vector. Moreover, we remark that the orthogonalityof error term actually follows from the aforementioned regularity conditions on theinvolved random variables, and that the unique choice of the coefficient β of linearorthogonal projection is precisely the familiar “population counterpart” of least squaresestimator. Proposition 1 implies Corollary 2 (orthogonal projection coefficient as optimizer) . Under the assumptionsof Proposition 1 with the same notation, there is exactly one β ∈ R k such that β ∈ argmin β ∈ R k E ( Y − X ⊤ β ) , namely, the minimization problem admits a unique solution,and β = ( E XX ⊤ ) − E XY .Proof. Under the given assumptions, writing ( E XX ⊤ ) − E XY is legitimate. Since apoint β of R k is a solution to the minimization problem only if β = ( E XX ⊤ ) − E XY ,there is at most one such β . But the orthogonal projection coefficient ( E XX ⊤ ) − E XY is also a solution to the minimization problem, there is at least one such β .Corollary 2 says that the coefficient of linear orthogonal projection minimizes theapproximation error in the L or mean-square sense. Owing to results sharing thesame conclusion of Corollary 2, some authors would define orthogonal projection asbest linear predictor in the mean-square sense. It can easily be shown that a conditionalexpectation happens to be the best mean-square predictor, and so Corollary 2 mayexplain the (unjustified, as shown previously) conditional expectation interpretation ofstochastic linear regression. 15urther, as far as the purpose of ensuring that it is meaningful to talk about learningabout the orthogonal projection coefficient, the orthogonality condition on { X , . . . , X k } does not cost one too much generality. If { X , . . . , X k } is as in Proposition 1, andif in addition the elements are centered, then the variance of every nontrivial linearcombination of { X , . . . , X k } is >
0. For the following local purpose(s), a collection of L random variables X , . . . , X k is said to be essentially linearly independent if and onlyif the variance of a X + · · · + a k X k is > a , . . . , a k ) ∈ R k . Considerthe following Proposition 2 (orthogonalization) . If X is a random element of R k with L componentsthat are essentially linearly independent, then there is exactly one element A of theclassical Lie group SL k ( R ) such that the components of the random element AX of R k form an orthogonal set.Proof. The proof idea is just an application of conventional wisdom (Gram-Schmidt).For k = 1, taking A to be the matrix [1] having 1 as the single entry suffices as { X } is trivially or vacuously orthogonal.We prove for k = 2; the underlying machinery will then be clear for all k ≥
3. Let X := X . If X = aX + X , then E X X = 0 implies that a = − E X X / E X . Butfor this particular choice of a , it evidently holds that X := aX + X is orthogonal to X = X . Therefore, the matrix a is the unique choice of A . Since the unique choice of A has determinant 1, it lies in SL k ( R ).For k ≥
3, we solve the corresponding k − k − X , . . . , X k are L random variables not concentrated on { } , and if { X ,. . . , X k } is not orthogonal, then Proposition 1 is not directly applicable. But, regardingthe coefficient learning purpose, we can by Proposition 2 apply Proposition 1 to theorthogonalized version of X provided that the components of X are essentially linearlyindependent; then the orthogonal projection coefficient of Y given X is simply theorthogonal projection coefficient of Y given the orthogonalized X left-multiplied by thetranspose of the orthogonalization matrix. That the orthogonality of error term to X istaken care of is due to the fact that the orthogonalization matrix has constant entries.Thus we arrive at 16 heorem 2 (“strengthened” Proposition 1) . If Y, X , . . . , X k are L random variables,and if { X , . . . , X k } is essentially linearly independent, then there is exactly one β ∈ R k and there is exactly one L random variable ε such that Y = P kj =1 X j β j + ε and E X j ε = 0 for all ≤ j ≤ k .Proof. We have hinted the essential considerations. If { X , . . . , X k } is orthogonal,then Proposition 1 implies the desired conclusions. If not, we apply Proposition 2to orthogonalize it by a unique matrix A ∈ SL k ( R ). Write X := ( X , . . . , X k ) ⊤ ;then the components of AX are all L ; and so by Proposition 1 there is exactly one α ∈ R k and there is exactly one L random variable δ such that Y = ( AX ) ⊤ α + δ and E (( AX ) δ ) = ((0) kj =1 ) ⊤ . Since ((0) kj =1 ) ⊤ = E (( AX ) δ ) = A ( E Xδ ), and since A isinvertible, we have E Xδ = ((0) kj =1 ) ⊤ . Taking β := A ⊤ α completes the proof. Remark.
We have shown that it is quite “easy” to ensure that the researcher’s studyvia stochastic linear regression is meaningful. Since the non-degenerateness of each X j isalmost automatically satisfied with respect to practical purposes, the most “stringent”assumption turns out to be the L -ness (and, perhaps, the essential linear independence)of the involved random variables! And for the involved random variables to be L , ifnot automatically true in practice, is a very mild condition.Nevertheless, Theorem 2 is perhaps only of theoretical interest; without orthogonal-ity, the usual form of orthogonal projection coefficient is not guaranteed.Now we can in passing clarify this common condition on error term in a context ofstochastic linear regression, that the mean of error term is = 0. From a mathematicalviewpoint, that error term has zero mean is immaterial. If Y, X , . . . , X k are L randomvariables with each X j not concentrated on { } , and if there is some 1 ≤ j ≤ k suchthat X j = 1, i.e. if the constant regressor 1 is present, then the corresponding uniqueerror ε is by Theorem 2 orthogonal to X j ; it follows that E X j ε = E ε = 0.Another closely related unclarity regarding the uncorrelatedness between ε and each X j may now be settled as well. For convenience, we state the following elementary Fact 2 (equivalence of orthogonality and uncorrelatedness under mean zero) . If X, ε are L random variables, and if E ε = 0 , then E Xε = 0 if and only if K ( X, ε ) = 0 . Thus orthogonality between random variables is equivalent to their uncorrelatednesswhen one of the random variables has mean zero. Further, we have
Proposition 3 (uncorrelatedness and orthogonal projection) . Under the assumptionsof Proposition 1 with the same notation, if in addition there is some ≤ j ≤ k suchthat X j = 1 , then K ( X j , ε ) = 0 for all ≤ j ≤ k . roof. As argued in a previous paragraph, Proposition 1 ensures that E ε = 0 in thepresence of the constant regressor. The desired conclusion then follows from Fact 2.We may proceed to the desired clarification. The issue is that sometimes in a contextof stochastic linear regression the uncorrelatedness of ε and each X j is instead stressedwithout a specific reference to the orthogonality of ε to each X j . As we have seen, theorthogonality of ε to each X j is of fundamental concern. And a random variable beingcentered, i.e. a random variable with mean zero, has nothing to do with the orthogonalityof the random variable to a given random variable (e.g. a standard normal randomvariable is centered but not orthogonal to itself). Indeed, the concept of orthogonalitybetween two random variables is “very independent” of that of uncorrelatedness of therandom variables in the sense that they do not imply each other in abundant cases: Theorem 3 (non-equivalence between orthogonality and uncorrelatedness) . There arecontinuum-many L random variables X, ε such that E Xε = 0 and K ( X, ε ) = 0 ; andthere are continuum-many L random variables X, ε such that E Xε = 0 and K ( X, ε ) =0 .Proof.
For both assertions, let t > ξ ∼ N (0 , t ), let X := ξ + √ t and ε := ξ − √ t . Then E Xε = ( E ξ ) − t = 0. But E Xε − ( E X )( E ε ) = 0 + t = t >
0; so K ( X, ε ) = 0. Since the set R ++ is in bijection with R , the first assertion follows.For the second assertion, let ξ be a Rademacher random variable, i.e. let ξ ∼ ( D − + D ). Then E ξ = 0 and E ξ = 1. These equalities follow directly from thedefinition of Lebesgue integration. If X := tξ , and if ε := ( tξ ) − + t , then E Xε = E ( ξ + t ξ )= 0 + t · t > . On the other hand, we have E tξ = t and E (( tξ ) − + t ) = 0 + t = t ; so ( E X )( E ε ) = E Xε ,and hence K ( X, ε ) = 0; this completes the proof.Given Proposition 1 and Theorem 3, we see the mathematical danger of mixingorthogonality with uncorrelatedness. This mixture seems especially customary whenstochastic linear regression is spoken of in the presence of an “obvious” backgroundstructural theory with the “understanding” that “error” means both the unobservable18andom disturbance to the system and the error associated with the orthogonal projec-tion under consideration.The indicated issue entails a typical source of confusion when it comes to stochasticlinear regression — mixing purely mathematical concepts and considerations with col-loquial ideas and purpose-specific concerns. We hope that our treatment of stochasticlinear regression as a whole would also help to clarify the issues of such a type.
Given the importance and convenience of Proposition 1, let us agree on
Definition 1. (fundamental random vector and canonical error) Let k ∈ N ; let Y, X ,. . . , X k be random variables defined on the same probability space. Then the randomvector ( Y, X , . . . , X k ) is called a fundamental random vector if and only if i) eachcomponent of it is L , ii) each X j is not concentrated on { } , and iii) the set { X , . . . , X k } is orthogonal. The difference obtained by subtracting Y from its (unique) orthogonalprojection given X , . . . , X k is called the canonical error of the fundamental randomvector.By the orthogonal projection coefficient of a fundamental random vector ( Y, X ,. . . , X k ) we mean the orthogonal projection coefficient of Y , the first component, given X , . . . , X k , the last k components. Note.
The existence of a fundamental random vector is never a problem; for each k ∈ N , one can always consider at least the multi-normal (Gaussian) distributions on B R k with suitable dependence structure.Thus, for every fundamental random vector, it holds by Proposition 1 and Corollary2 that the orthogonal projection coefficient of the fundamental random vector is preciselythe optimizer minimizing the L -norm of the canonical error of the fundamental randomvector.The introduced terminologies in Definition 1 will not be merely nominal; they helpto fix the concepts. Their usefulness will be seen.On the basis of all the previous remarks, in particular of Proposition 1 and Corollary2, let us also agree on Definition 2 (stochastic linear regression) . Let k ∈ N ; let M ,kreg be the collection of all P ∈ Π( B R k ) such that i) P π j = P π j ⌉ { } for all 2 ≤ j ≤ k + 1, ii) R R x d P π j ( x ) < + ∞ for all 1 ≤ j ≤ k + 1 , and iii) R R xx d P π { j,j } ( x, x ) = 0 for all 2 ≤ j = j ≤ k + 1. Then M ,kreg is called a stochastic linear regression model .19et P ∈ Π( B R k ). Then P is called a stochastic linear regression if and only if P ∈ M ,kreg .The requirements in Definition 2 are precisely and simply translated from the as-sumptions of Proposition 1: Theorem 4 (characterizing stochastic linear regression model via fundamental randomvector) . If k ∈ N , then M ,kreg = { P ∈ Π( B R k ) | Z ∼ P for some fundamental random vector Z } . Proof.
We first prove the inclusion relation ⊃ . Let P be an element of the right-sidecollection. Then there are some probability space (Ω , F , P ) and some fundamentalrandom vector Z on Ω such that P Z = P . Since each component of Z lies in L ( P ) byDefinition 1, we have | Z j | L ( P ) = E ( Z j ) = Z R x d P Z j ( x ) = Z R x d P π j ( x )for all 1 ≤ j ≤ k + 1. Here | · | L ( P ) denotes the in-context L -norm.Since Z j is not essentially zero for all 2 ≤ j ≤ k + 1 by Definition 1, it follows that P π j = P π j ⌉ { } for all 2 ≤ j ≤ k + 1.Moreover, from the orthogonality requirement of Definition 1, we have0 = E Z j Z j = Z R xx d P ( Z j ,Z j ) ( x, x )= Z R xx d P π { j,j } ( x, x )for all 2 ≤ j = j ≤ k + 1. The last equality follows jointly from the facts that σ ( { B × B | B , B ∈ B R } ) = B R and that the collection { B × B | B , B ∈ B R } is stable with respect to finite inter-sections. The inclusion ⊃ follows.For the other inclusion ⊂ , let P ∈ M ,kreg . If Ω := R k , if F := B R k , and if P := P ,then, upon taking Z to be the ( k + 1)-tuple ( π , . . . , π k +1 ) ⊤ of natural projections π j defined on Ω, the desired inclusion relation ⊂ follows.20rom a mathematical viewpoint, we have obtained the desired intrinsic treatment ofstochastic linear regression in the old sense on the basis of Proposition 1, Corollary 2,and Theorem 4. We have established the concept of a stochastic linear regression in away involving and only involving pure mathematical considerations. Since the stochasticlinear regression model M ,kreg is for every k ∈ N defined as a subcollection of the classΠ( B R k ) of all probability measures on B R k , and since a stochastic linear regression,being defined as an element of M ,kreg for some k ∈ N , is simply a probability measurewith suitable properties, we have embedded the concept of stochastic linear regression(in the old sense) in mathematics in the sense discussed in the introduction of the presentpaper.Besides the elegance and conceptual utilities of defining a stochastic linear regressionas a probability measure, we wish to liken the present situation to an existing one in therelated literature, although the treatment in the existing situation is much less exotic,and nearly requires no further justifications or elaborations: In the literature of machinelearning, some authors take a stochastic process to be a probability measure. We referthe reader to, e.g. Ryabko [7] or Khaleghi and Ryabko [5].With respect to practical purposes, as we have argued previously how a fundamentalrandom vector is the basic ingredient or “regression material” for stochastic linear re-gression in the old sense, Theorem 4 preserves this important aspect of stochastic linearregression (in the old sense), and hence serves as a further justification for our definitionof a stochastic linear regression model.Nevertheless, in practice there are certainly a large number of situations where onedoes not and cannot reasonably consider an arbitrary fundamental random vector asthe basic “regression material”. For studies with a reference to a structural theory, thefirst component of the fundamental random vector under consideration usually dependson the last k components of the fundamental random vector in a pre-specified way thatis suggested by the reference structural theory; in such a case, the first component isthen defined in terms of the last k components. In fact, this pre-specified dependenceis the basis of any simulation studies involving stochastic linear regression in the oldsense; given k orthogonal random variables X , . . . , X k being L and not concentratedon { } , some k constants = 0, and an additional “well-behaved” L random variable η independent of each X j in some suitable way, one defines a new random variable Y as the linear combiniation of the random variables X , . . . , X k and the constants plusthe random variable η . [If η is L , then Y is also L by Minkowski’s inequality; so( Y, X , . . . , X k ) is a fundamental random vector.]21xcluding the simulation studies, a context where a dependence is pre-specifiedwithin the fundamental random vector under consideration is another significant sourceof confusion when it comes to stochastic linear regression in the old sense, especiallywhen the definition of Y in terms of X , . . . , X k and η has the same form as the in-herent orthogonal projection of Y given X , . . . , X k plus the inherent canonical error ε . But the concept of “imposed error” or “structural error” η is independent of that ofcanonical error ε ; structural error need not be orthogonal to each X j .Our notion of a stochastic linear regression model takes care of the case where afundamental random vector considered in a simulation study is obtained by the linear-additive specification; we have another characterization of a stochastic linear regressionmodel: Theorem 5 (parametrized representation of stochastic linear regression model) . Let k ∈ N . For every β ∈ R k , let S β be the collection of all P ∈ Π( B R k ) such that i) ( π , . . . , π k +1 ) is a fundamental random vector on the probability space ( R k , B R k , P ) and ii) there is exactly one η ∈ L ( P ) such that π = P k +1 j =2 π j β j + η and E π j η = 0 forall ≤ j ≤ k + 1 . Then β, β ∈ R k and β = β imply S β ∩ S β = ∅ , and M ,kreg = [ β ∈ R k S β . Proof.
To see the disjointness, let β, β ∈ R k be distinct, and suppose there is some P ∈ Π( B R k ) such that P ∈ S β ∩ S β . Then ( π , . . . , π k +1 ) is a fundamental randomvector with respect to P , and so by Proposition 1 we have β = β , a contradiction.The inclusion ⊃ follows trivially from the definition of S β , and the inclusion ⊂ follows from Proposition 1 and the proof of Theorem 4.Theorem 5 shows an additional nice feature of our definition of a stochastic linearregression model. Moreover, it allows us to connect an indexed family of stochasticlinear regressions with the notion of parameter identifiability: Theorem 6 (parametrization injectiveness of stochastic linear regression) . Let k ∈ N ;let S β be the same as in Theorem 5 for all β ∈ R k . If Θ ⊂ R k , and if ( P β ) β ∈ Θ ∈ × β ∈ Θ S β , then the map β P β , Θ → { P β | β ∈ Θ } is an injection.Proof. First of all, every S β is nonempty; acknowledging the axiom of choice impliesthat our assumption is not vacuous. From the first assertion of Theorem 5, it followsthat P β = P β for all β, β ∈ Θ such that β = β .22heorem 6 ensures that a collection of fundamental random vectors, obtained byspecifying the linear-additive dependence as the constant vector runs through a givensubset of R k , is indeed a collection of distinct stochastic linear regressions: Corollary 3 (simulation and stochastic linear regression) . Let η, X , . . . , X k be L ran-dom variables defined on the same probability space; let X j D for all ≤ j ≤ k ; let { η, X , . . . , X k } be orthogonal; let Θ ⊂ R k . For every β ∈ Θ , let Y := P kj =1 X j β j + η ,so that ( Y, X , . . . , X k ) is a fundamental random vector. If P β is the distribution of ( Y, X , . . . , X k ) for all β ∈ Θ , i.e. if P β is the stochastic linear regression correspondingto ( Y, X , . . . , X k ) for all β ∈ Θ , then the map β R β , Θ → { P β | β ∈ Θ } is aninjection. Thus, for any given fundamental random vector subject to the specification given inCorollary 3, different given values of β determine different stochastic linear regressions.This conclusion is certainly desirable for the practical purposes.We have completed our intended treatment. References [1] Billingsley, P. (1999).
Convergence of Probability Measures . Wiley.[2] Eliot, G. (1819–1880).
The Poems of George Eliot , reprint of the complete editionpublished by T.Y. Crowell & Co. in 1884. Palala Press.[3] Federer, H. (1996).
Geometric Measure Theory , reprint of the first edition. Springer.[4] Flexner, A. (1939). The usefulness of useless knowledge.
Harpers
Philosophical Studies Real and Complex Analysis , (international) third edition.McGraw-Hill.[8] Ryabko, D. (2019).