Change Point Detection with Optimal Transport and Geometric Discrepancy
CChange Point Detection with Optimal Transportand Geometric Discrepancy
Pronko Nikita KonstantinovichSeptember 12, 2018
Abstract
We present novel retrospective change point detection approach basedon optimal transport and geometric discrepancy. The method does not re-quire any parametric assumptions about distributions separated by changepoints. It can be used both for single and multiple change point detec-tion and estimation, while the number of change points is either knownor unknown. This result is achieved by construction of a certain slidingwindow statistic from which change points can be derived with elemen-tary convex geometry in a specific Hilbert space. The work is illustratedwith computational examples, both artificially constructed and based onactual data.
Introduction
Change point problem was firstly mentioned by Stewhart [28]as a problem of industrial quality control in the year 1928. This problem wassolved by Girshick and Rubin [10] in 1953. The optimal solutions for para-metric online formulation of the problem were provided by Shiryaev and Pol-lak [29],[25]. Asymptotically optimal posteriori change point detection methodwas developed by Borovkov [2]. This method requires knowledge of parametriclikelihood functions, however it can be translated to nonparametric case withempirical likelihood [15]. The fundamental results in nonparametric study ofchange-point problem are due Brodsky and Darkhovsky [3], and Horvath etal. [6]. These methods are generally tailored for one-dimensional data. Non-parametric change point inference with multidimensional data is still an openproblem. New methods are still being developed usually with applications ofideas from other scopes to change point data. For example, [21], where di-vergence from cluster network analysis is used in order to estimate both thenumber and the locations of change points or [20], where multivariate versionof Wilcoxon rank statistic is computed.This work is focused on nonparametric change point detection in multi-variate data. This problem has large scope of applications including finance[31],genomics [23] and signal processing [16].The problem of optimal transportation were introduced by Monge [22] inthe year 1781. The modern statement of the problem is due to Kantorovich as1 a r X i v : . [ s t a t . M E ] J u l t was stated in the seminal paper in 1941 [14]. We recommend [33] and [26] asmodern references to the subjectVector rank and quantile functions were introduced by Serfiling [27] as partDepth-Outlyingness-Quantile-Rank (DOQR) paradigm. Monge-kantorovich vec-tor ranks were introduced by Chernozhukov et al. [4].Study of discrepancy were started by H. Weyl [34] in 1916 with numbertheory as intended area of application, more computational aspects of discrep-ancy were pioneered by Niederreiter [24] (quasi- and pseudo- random numbergeneration) and Hlawka [13] (numeric integration). We say that discrepancy isgeometric when we talk about continuous distributions with fixed well-identifiedsupport, this is the opposition to the case of combinatorial discrepancy whichstudy nonuniformity of finite distributions. The main contribution to the theoryof quadratic generalized geometric discrepancy is by Hickernel [11]. This is awork of a highly synthetic nature. This Synthetic nature is coming from thecombination of vector ranks and geometric discrepancy, optimal transport andchange point problems. All this concepts are rarely to never seen together inscientific publications, however in this paper they are intertwined in a singularcomputational methodology.But why these results come to existence only now? The answer to this ques-tions in our case is rooted into the fact that computing discrepancy used tobe computationally challenging task for non-uniform probability distributions.However, in the recent publication [4] by Chernozhukov,Carlie, Galichon andHalin the Monge-Kantorovich vector rank functions were introduced. Vectorrank function can be understood in the context of this work as homeomorphicextension of classical one-dimensional cumulative distribution functions to mul-tidimensional spaces. The idea of using low-discrepancy sequences as a toolfor the estimation of the vector rank functions was already present in the [4].As the name suggests discrepancy is the natural measure of consistency of thesequence to be low-discrepancy sequence in the same way as the order statisticis the natural measure of the sequence to be ordered. Thus, as classical one-dimensional ranks and order statistic is used in the construction of the classicalKolmogorov-Smirnov and Cr´amer-Von Mizes tests of goodness-of-fit, the vec-tor rank function and discrepancy can be used in order to construct naturalextension of this tests in the multidimensional context.Thus, the main objective of the work is the exploration of vector rank discrep-ancy as methodology for nonparametric high-dimensional statistic. We selectchange point problem as the test subject for this inquiry.The strict measure theoretic approach mentioned above is strengthened inthis work as we speak about probability measures instead of probability distri-butions and densities whenever is possible which contradicts current trends inthe statistical science. This can be justified by convenience of understandingmeasures as points in certain multidimensional spaces bearing intrinsic geomet-ric structure. On the other hand, we view all mathematical constructions in thiswork as theoretical prototypes of certain concrete algorithms ad data structureswhich can be implemented in order to solve practical tasks. Our approach tothe problem is highly inspired by [19] and we use convex geometric intuition2henever possible. acknowledgement Underlying research was conducted during author’swork for master’s degree in ”Mathematical methods of Optimization and Stochas-tics” program in NRU Higher School of Economics, Moscow, Russia. Authorwants to thank head of the program Vladimir Spokoiny, his scientific advisor An-drei Sobolevski for suggesting the research topic and support, Alexey Naumovand Maxim Panov for the discussion in IITP RAS; Yuri Yanovich,AlexandraSuvorikova , Elena Chernousova and Alexey Kroshnin for fruitful discussion atthe IUM seminar.
Notation
This is notation for the use in the sequel. R d is a d -dimensional euclidean space with d < ∞ . I = [0 ,
1] is a unitinterval and I d is a unit hypercube in R d . In line with Kolmogorov’s axiomaticstriple (Ω , F , P ) is a probability space, and every observation as a random variableis assumed to be a Borel measurable map from Ω to R d . For brevity Ω denotesthe whole probability measure structure (Ω , F , P ).Every random variable ξ : Ω → R d defines the probability measure P = ξ P that is referred to as the probability law of ξ , where ξ P denotes a pushforwardof the measure P by ξ . That is, for every Borel set B ⊂ R d it holds that P ( B ) = P ( ξ − ( B )). In this case the notation ξ ∼ P is used. In case ξ isa random process over domain T notation ξ ∼ P means that P defines thehierarchical probability distribution system of ξ as a whole. In this case for all t, s ∈ T the measure P t is the distribution of a single observation ξ t and P t,s isthe joint distribution of ξ t and ξ s i.e. ξ t ∼ P t , ( ξ t , ξ s ) ∼ P t,s . ξ ∼ i . i . d P means that ξ is independently identically distributed sample (i.i.d) with probability law P .A set of all probability laws with finite first and second moments is denotedby P (cid:0) R d (cid:1) . A subset of P (cid:0) R d (cid:1) of probability measures absolutely continuouswith respect to Lebesgue measure λ is denoted by P (cid:28) λ (() R ). U d is the uniformdistribution over I d . It is obvious that U d ∈ P (cid:28) λ (cid:0) R d (cid:1) .In the sequel I n isassumed to be implicitly equipped with U n for each n ∈ N , which makes it intoa probability space, and I = {∅} is equipped with the counting measure.If ξ n is an sequence of random elements in some metric space with distancemetric ρ , when it is said that ξ converges to x in probability if for each ε ∈ R ++ itholds lim n →∞ P ( ρ ( ξ n , x ) > ε ) = 0, which is denoted as ξ n → P x . In case ρ ( ξ n , x )is not a measurable function for all n the convergence in outer probability maybe introduced and denoted by ξ → P • x . Having A εn = { ω ∈ Ω : ρ ( ξ n ( ω ) , x ) > ε } the convergence in outer probability is equivalent to lim n →∞ inf { P ( B ) | B ∈ F : A εn ⊂ B } = 0 for all ε > ξ is a discrete time ergodic process with convergence in probability,1 n n (cid:88) i =1 f ( ξ i ) → P (cid:90) R d f d ¯ P (1)for every bounded Lipschitz function f , we say that ¯ P is the weak limit distri-bution of ξ . 3 roblem Formulation In this section a brief review for common variationsof change point problem is provided.Ordered Set T is associated with time. It is possible to discuss change pointproblem with continuous time [9]. However, this work deals only with case ofdiscrete time. It is also possible to consider a change point detection for randomfields [3].Two main types of change point problem are offline and online detection.In offline detection T is assumed to be a finite set of size |T | = T . Withoutloss of generality, it is assumed that T = { , . . . , T } . Initial observations X aretreated as a time series indexed by T . Firstly, we discuss the case of a singlechange point. Definition 1.
The change point of X is an unknown moment of time θ ∈ T such that ( X t ) θt =1 ∼ P and ( X t ) Tt = θ +1 ∼ P . There are two possibilities.Firstly, both subsamples may be i.i.d with ( X t ) θt =1 ∼ i . i . d ¯ P and ( X t ) Tt = θ +1 ∼ i . i . d ¯ P .It is also possible to consider unknowns ¯ P and ¯ P to be ergodic weak limitdistributions of ( X t ) θt =1 and ( X t ) Tt = θ +1 respectively.Then offline single change point detection problem is a hypothesis test for H : ¯ P = ¯ P versus H : ¯ P (cid:54) = ¯ P for every possible estimate ˆ θ of θ having1 < ˆ θ < T . Retrieving valid value of ˆ θ is a related problem which will be referredas a change point estimation. Definition 2.
In case of multiple change points existence of up to N changemoments θ < . . . < θ n < . . . θ N ∈ T is assumed. In simpler formulation ofproblem value of N is assumed to be known, while in more complicated one N is also an object of estimation. As the previous variation the problem is ahypothesis test about distribution of samples ( X t ) θ n t = θ n − +1 ∼ ¯ P n where θ = 1and θ N +1 = T for simplicity. The problem splits into multiple hypothesis testsfor H n : ¯ P n = ¯ P n +1 versus H n : ¯ P n (cid:54) = ¯ P n +1 for all possible estimates ˆ θ of θ in case of known N . For unknown N the problem is structured around testingagainst H : ¯ P (cid:54) = ¯ P (cid:54) = . . . (cid:54) = ¯ P ˆ N for all estimates ( ˆ N, ˆ θ ) of ( N, θ ) where possiblevalues of ˆ N are constricted to some meaningful finite set.For online change point problem T ∼ = Z as an ordered set, and X can bethought as infinite times series unfolding in real time. In this case the goal is toidentify change point θ as soon as possible which means using minimal amountof observations X t with t ≥ θ . Branding every moment of time as a changepoint will achieve zero delay. However, this detection procedure is unacceptableas it achieves maximal false alarm rate. This means that change-point detectionalgorithm needs to be tuned in for minimal delay with fixed false alarm rate.In a more theoretical framework a single change point θ can be considered,which implies that infinite set { X t : t < } of observations with prior distributioncan be used for detection. However, in more practical situation the stream ofchanege ponts θ : N → T is considered, with only finite sample { X t : ˆ θ n − T, r ) stands for binomial distribution. However, the changepoint problem is a single element sample problem, which means that only min-imal properties of the parameter can be inferred from the data. This does notchange the problem except for treating ˆ θ as an estimate of the mean.Obviously,the observer chooses strategy which maximizes chances of victory.This can be interpreted in terms of loses L , which are equal to L (0) = min { − r, r } = L (1) in case observer claims no change point and L (ˆ r ) = | ˆ r − r | otherwise.Let L (ˆ r ) = min 1 − ˆ r, ˆ r in case the observer gives estimate when no real changepoint exists. 5f one preserves r while increasing T the empirical distribution of X willconverge to a measure µ r = r ¯ P +(1 − r ) ¯ P , the convex combination of measures.It possible to think of µ r as contained inside one-dimensional interval [ ¯ P , ¯ P ]embedded into the infinite-dimensional convex space of probability measures.In order to construct such measures from random variables we will use randommixing map I sZ : Y (cid:55)→ b s Y + (1 − b s ) Z in the sequel. Here b s ∼ Bern( s ) and s ∈ [0 , R : R d → R d such that R µ r = U d , a knownreference distribution of simple structure. For our application U d is a uniformdistribution over a unit hypercube I d . It was shown in [4] and [8] that ˆ R , anestimate of R , can be recovered from data without any parametric assumptions.If data is i.i.d distributed with laws P and P on each side of change point,when figure 1 will exhibit relations between probability distribution laws dis-cussed so far. P P P P µ r P R P R P U d ................................................................................................................. ............ I rξ I − rξ I r • ............................................................................................................................. I − r • ............................................................................................................................. R ............................................................................................................................. R .................................................................................................. ............ I rR ( ξ ) .............................................................................................................. I − rR ( ξ ) ............................................................................................................................. R ............................................................................................................................. ξ ............................................................................................................................. ξ ............................................................................................................................. X Figure 1: Diagram with measures as objects and pushforwards as arrows . Thetop row represents abstract measure spaces which are pushed forward to un-known measures by observed Random variables. The bottom row exhibits stateof the probability laws after application of the vector rank function of µ r .The diagram shows that application of vector rank function can be thoughtas a probabilist’s ”change of basis”. The pragmatic value for change pointproblem is in elimination of unknown distribution µ r from the model.It is obvious that the diagram depicted at fig. 1 commute. As the pushfor-ward acts as a linear map of signed measures defined over two different measur-able spaces. If α and β are measures, x, y ∈ R , B is a measurable set and f isa measurable function, then f ( xα + yβ )( B ) = xα ( f − ( B )) + yβ ( f − ( B )) = xf α ( B ) + yf β ( B ) . Therefore, by linearity of the pushforward U d = R µ r = R ( rP + (1 − r ) P ) = rR P + (1 − r ) R P . P and P . We introduce notion of Kullback-Leibler divergencein order to quantify this difference. Definition 3. Kullback-Leibler divergence between measures P and P admitting densities p and and p respectively isKL( P , P ) = (cid:90) Ω log p p d P Note that while P , P ∈ P (cid:28) λ (cid:0) R d (cid:1) they admit densities. Now we formu-late a sufficient condition on R that ensures that our transformations preservedivergence between distributions.If R is invertible almost everywhere P , when KL( R P , R P ) = KL( P , P )Let X be a random variable such that X ∼ P . Then, R ( X ) ∼ R P and byinvertibility a.e. of R distribution R P has density p ◦ R − a.e. ; similar factis also true for P . Then,KL( R P , R P ) = E log p ( R − RX ) p ( R − RX ) = E log p ( X ) p ( X ) = KL( P , P ) Formulation of the method In this chapter our approach to constructionof the vector rank function’s estimates ˆ R n is presented. The main goal of Thisapproach is the avoidance of approximation of the entire function R . Note, thateven the estimation that the aim of this task is actually not a precise estimationof values R ( X i ) for each observation X i but a construction of an estimate thatpreserves the relative geometry of a sample X along its change points.The work [4] presents continuous, semidiscrete, and discrete methods of es-timation of R . The Implementation of change point detection discussed in thissection is based solely on discrete approach. The reason behind this choice iscomputational simplicity.Vector rank function in general can be understood as ( P, mathbbU )-almostsurely homeomorphisms between supports of sample probability distribution P and reference probability law of choice U ( R is defined almost everywhere P and continuous with respect to subset topology, and admits similar continuousinverse existing almost everywhere U ) , such that R P = U and both depthmedian and symmetries are preserved. This symmetries and the concept ofdepth median need to be additionally defined, which goes beyond of the scopeof this paper. In our application uniform measure over unit hypercube I d isselected as the reference U . In case d = 1 the cumulutive distribution function(cdf) of P is also the ’vector’ rank function of the probability distribution of P .The vector rank function is not unique in general. We use Monge-Kantarovichvector rank developed in [4], which is defined as optimal transport between P and U R = arg min R : R P = U d (cid:90) R d d ( R ( x ) , x ) d P ( x ) . R can be intuitively justified bythe equivalence of the above optimization problem to the maximization of theintegral (cid:90) R d (cid:104) R ( x ) − m ( U d ) , x − m ( P ) (cid:105) d P ( x ) , where m ( U d ) and m ( P ) stands for depth medians of distributions U d and P respectively. Thus, the optimal transport preserves geometric properties of dis-tribution P which can be expressed by inner products of points relative to itscenter of symmetry. This is what is understood as the relative geometry of thedata ,and what we try to preserve during vector rank estimation.In this work we are focused on the discrete estimation of vector rank function.Let u be an equidistributed sequence of points in I d . That is, for every Lipschitzcontinuous function f defined on I d it holdslim n →∞ n n (cid:88) i =1 f ( u i ) = (cid:90) I d f ( x ) d x. (3)Then, for T observations define Y i = ˆ R T ( X i ) = u σ ∗ ( i ) , where σ ∗ = arg min σ ∈ S T T (cid:88) i =1 (cid:107) X i − u σ ( i ) (cid:107) . (4)In case convergence in (3) is understood as convergence a.s or even as conver-gence in probability the sequence u can be taken as an i.i.d sequence of randomvariables or as an ergodic process with a uniform weak limit distribution. How-ever, this implementation is more suited for deterministic form of u . Discrepancycan be understood as a natural measure of slackness of condition (3) for a fixedvalue of T .Classical (one-sided) geometric discrepancy of the sequence ( u i ) ni =1 is de-scribed by the formula D ( u ) = max x ∈ I d (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:12)(cid:12) { u i | ≤ i ≤ n } ∩ [0 , x ] (cid:12)(cid:12) T − d (cid:89) j =1 x j (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) . (5)Note that (5) admits representation D ( u ) = (cid:107) ϕ u (cid:107) ∞ for a certain function ϕ u .This idea were used in [11] to introduce generalized quadratic discrepancy withsupremum-norm replaced by a certain quadratic norm in a certain SobolevSpace. The result can be expressed as (cid:16) D κ ( u ) (cid:17) = = (cid:90) (cid:90) η ( x, y ) d x d y − n n (cid:88) i =1 (cid:90) η ( x, u i ) + 1 n n (cid:88) i,j =1 η ( u j , u i ) , (6)8here η is the reproducing Hilbert kernel of the Sobolev space η ( x, y ) = d (cid:89) i =1 (cid:16) M + β (cid:0) κ ( x i )+ κ ( y i )+ 12 B (cid:0) x i − y i mod 1 (cid:1) + B ( x i ) B ( y i ) (cid:17) (7)with β ∈ R standing for a scale parameter and κ standing for a functionalparameter with square-integrable derivative with (cid:82) κ = 0 and the value M defined by M = 1 − β (cid:90) ( κ (cid:48) ) , and B i is the ith Bernoulli polynomial. Note, that in case d = 1 statistic (5)turns into Kolmogorov-Smirnov statistic and (6) turns into Cr/‘amer-Von Mizesstatistic for uniform distribution test.The function κ is a functional parameter defining exact form of the discrep-ancy. In this text we use star discrepancy produced by κ ∗ = 16 − x and the centred discrepancy produced by selecting κ c = − B (cid:18) x − 12 mod 1 (cid:19) The case of u being a low-discrepancy sequence is a particularly well suitedfor our application. Definition 4. low-discrepancy sequence is a deterministic sequence u in I d designed with a goal of minimizing value the D κp ( u i ) ni =1 for each natural number n . Definition above is rather informal. However, it is postulated firmly that forany low-discrepancy sequence u property (3) holds and lim n →∞ D κp ( u i ) ni =1 = 0for any choice of p and κ . Thus, the rate of convergence of D κp ( u i ) ni =1 can beunderstood as a measure of efficiency of a low-discrepancy sequence u i .In our application we use Sobol sequence with grey code implementation.Sobol sequence were introduced by Sobol [30] and the grey code implementationis due [1]. Detailed investigation of the nature of this sequence goes beyond thescope of this work. However, any other low-discrepancy sequence can be used.For a Sobol sequence value D κ ( u i ) ni =1 converges to zero as O ( n − ε ), where ε depends on d implicitly. While convergence rate of discrepancy for a randomuniform sequence is O (1 / √ n ), for d (cid:28) 200 value of ε < / 2, which makeslow-discrepancy sequence rather efficient. However, scrambling and effectivedimension techniques can increase rate of convergence [12].It can be established that the sequences u and X have no repeating valuesalmost surely, As our data is assumed to come from atomless distributions. Thismeans that the problem of finding Permutation σ ∗ in (4) is the optimal assign-ment problem. The Optimal assignment problem is the special case the linear9rogramming, which can be solved in O ( n ) time by the Hungarian algorithm[17][32]. Amortizations based on applications of parallel programming can im-prove computation complexity to O ( n log n ). Approximate algorithms can beused for faster computations, for example [7].Note, that even if sample X has an i.i.d distribution, then the resultingtransport Y is not independent itself. However, the covariance of the elementsis converging to zero as T goes to infinity. Let n (cid:54) = m be two indices less orequal to T and i, j be coordinate indices. As ( X k ) Tk =1 is assumed to be i.i.d, itfollows that Y in has a discrete uniform distribution over { u ik } Tk =1 . Then,Cov( Y in , Y jm ) = E Y in Y jm − E Y in E Y jm = T (cid:88) k =1 T (cid:88) l (cid:54) = k u ik u jl T ( T − − T (cid:88) k =1 T (cid:88) l =1 u ik u jl T == T (cid:88) k =1 T (cid:88) l (cid:54) = k u ik u jl T ( T − − T (cid:88) k =1 u ik u jk T , so by bounding from above and below with equidistribution property of u inthe limit caseCov( Y in , Y jm ) ≤ T (cid:88) k =1 T (cid:88) l =1 u ik u il T ( T − −−−−→ T →∞ lim T →∞ E U E UT − T →∞ T − 1) = 0 , Cov( Y in , Y jm ) ≥ − T (cid:88) k =1 u ik u jk T ≥ − T (cid:88) k =1 u ik T −−−−→ T →∞ lim T →∞ − E UT = lim T →∞ − T = 0;where U is a uniform random variable on [0 , Y in , Y jm )converges to zero. As variance of Y in converges to the variance of a standard uni-form distribution on [0 , 1] we will treat sample Y as uncorrelated in asymptoticcontext under assumption of i.i.d. distribution of X .It can be easily seen that that quadratic discrepancy admits a degenerateV-statistic representation with the kernel K : (cid:16) D κ ( y ) (cid:17) = 1 n n (cid:88) i,j =1 K ( y i , y j ) , (8)assuming sample y of n elements. Properties of V-statistic produces the asymp-totic result n (cid:16) D κ ( y ) (cid:17) d −−−−→ n →∞ V = ∞ (cid:88) i =1 λ i Z i ∼ V ( λ ) , where Z ∼ i . i . d N (0 , 1) and λ are non-zero eigenvalues of the integral operator A defined by the relation A ( f )( x ) = (cid:90) f ( y ) K ( x, y )d y 10t can be shown that A is in fact positive-semidefinite and trace-class, whichmeans that all λ i > ∞ (cid:88) n =1 λ i < ∞ Eigenvalues of A can be approximated by a finite collection of N numbers ˆ λ with Nystr¨om method. As it was shown in [5] the twofold approximation of cdfof V | N = (cid:80) Ni =1 (cid:98) λ Ni Z i is possible for fixed natural numbers N , K and parameter α ∈ (0 , P ( V | N < x ) ≈≈ − K (cid:88) k =0 sin (cid:18) (cid:80) N arctan (cid:18) (cid:18) k + 12 (cid:19) α (cid:98) λ Ni (cid:19) − (cid:0) k + (cid:1) αx (cid:19) π (cid:0) k + (cid:1) (cid:81) Ni =1 (cid:114) (cid:16) (cid:0) k + (cid:1) α (cid:98) λ Ni (cid:17) . (9)This formula can be used for computing quantiles and critical values of V ( λ )with simple one-dimensional zero-seeking algorithm. Case of A Single Change Point In this chapter an approach for de-tecting a single change point is presented. We impose a model assumptions thatfor a fixed τ change point θ either satisfy τ < θ < T − τ or it does not exist .This value τ can be selected in a such way that 1 (cid:28) τ (cid:28) T and be used as asliding window bandwidth defining a ’control chart’-like object which we referto as diphoragram. Definition 5. empirical sliding diphoragram for change point data ( X t ) Ti =1 is defined by (cid:98) ∆ T,τt = (cid:16) D κ (cid:0) ˆ R T ( X i ) (cid:1) t + τi = t (cid:17) = (cid:16) D κ ( Y i ) t + τi = t (cid:17) and ideal sliding diphoragram by∆ T,τt = (cid:16) D κ (cid:0) R µ r ( X i ) (cid:1) t + τi = t (cid:17) for t in range from 1 to T (cid:48) τ = T − τ .If κ is a continuous function, then (cid:98) ∆ T,τt P −→ ∆ T,τt as T → ∞ . With thiscondition discrepancy is a continuous function of data. So, convergence of vectorranks proved in [4] implies convergence of diphoragrams. Note, that with kernelrepresentation charts admit a difference representation (cid:98) ∆ T,τt +1 − (cid:98) ∆ T,τt = 1 τ (cid:32) t +1+ τ (cid:88) i = t +1 K ( Y i , Y t + τ +1 ) − t + τ (cid:88) i = t K ( Y i , Y t ) (cid:33) . (cid:98) ∆ T and (cid:98) ∆ T,τ takes only quadratic time O ( dT ) in number in obser-vation. Moreover, if crisp optimal transport with low-discrepancy sequence u isused in estimation of vector rank function, then A Y = A u , so all values can beprecomputed.For the application to the change point problem consider two increasingsequences of integers T n and τ n such that T n τ n = a > n ∈ N areconstructed. Then as Y i are assumed to be independent (cid:98) ∆ T n ,τ n t and (cid:98) ∆ T n ,τ n t + τ n arealso independent random variables. Definition 6. mean sliding discrepancy for set sample ( Y i ) T n i =1 is computedby ∆ n = τ n T n a − (cid:88) j =0 (cid:98) ∆ T n ,τ n jτ n Note, that the mean sliding discrepancy is not the same as the discrepancyof the whole sample. By independence, in case H holds, as n → ∞ ∆ n d −−−−→ T →∞ a (cid:88) j =1 ∞ (cid:88) i =1 λ i a Z i,j ∼ V (cid:32)(cid:18) λ i a (cid:19) aj =1 (cid:33) ∞ i =1 , where Z ∼ i . i . d N (0 , (cid:98) ∆ T n ,τ n t sampled form the non-uniform data which goes to ∞ in probability as n → ∞ .So, the whole sum ∆ n P −−−−−−→ n → infty ∞ .In case the single change point exists the statistic (cid:98) ∆ T n ,τ n is expected to attainthe minimal value at such moment of time t ∗ when the empirical distributionof ( Y t ) t ∗ + τ n t = t ∗ approaches the empirical distribution of the whole sample, as theempirical distribution of the whole sample is converging to the U d . Let t ∗ n = arg min ≤ t ≤ T n − τ n (cid:98) ∆ T n ,τ n t . It is expected that the ratio of numbers of elements from both sides of changepoint in the subsample used in the computation of (cid:98) ∆ T n ,τ n t ∗ n and in the wholesample are equal. This can be represented in the algebraic relation˜ θ n − t ∗ n τ n = ˜ θ n T n = ˆ r n , where ˜ θ n is the produced estimate of the change point. This provides the ex-pression for the estimate ˜ θ n = t ∗ n − a − . We accept H for a fixed γ -level in case p n = P (cid:16) V | N ≤ ∆ n (cid:17) < − γ, (10)12here the cdf is numerically estimated as in (9) for some fixed parameters N and K . Otherwise reject H and state that there was a change most probableat ˆ θ n .In order to reason about properties of the estimate ˆ θ n . Let M ( I d ) denotespace of finite signed measures other the hypercube I d . Definition 7. Space of nonuniformities D is defined as a quotient of thereal vector spaces D = M ( I d ) R U d endowed with a Hilbert space structure by inner product defined for [ ν ] , [ µ ] ∈ D by (cid:10) [ ν ] , [ µ ] (cid:11) = (cid:90) (cid:90) K ( x, y ) d ν ( x ) d µ ( y ) (11)Note, that the relation (11) is well defined as for all a, b ∈ R (cid:68)(cid:2) ν + a U d (cid:3) , (cid:2) µ + b U d (cid:3)(cid:69) = (cid:90) (cid:90) K ( x, y ) d ν ( x ) d µ ( y ) + b (cid:90) (cid:90) K ( x, y ) d ν ( x ) d y ++ a (cid:90) (cid:90) K ( x, y ) d x d µ ( y ) + ab (cid:90) (cid:90) K ( x, y ) d x d y = (cid:90) (cid:90) K ( x, y ) d ν ( x ) d µ ( y ) , as (cid:82) K ( x, y ) d x = 0 for any value of y and K is symmetric and is indeed an innerproduct as K is a positive-definite kernel. As the line R U d intersects simplex ofprobability measures P ( I d ) only in one point ( U d itself) the natural projection µ (cid:55)→ [ µ ] is injective on P ( I d ), so we denote a nonuniformity arising from each µ ∈ P ( I d ) just as µ . If every subsample ( Y i ) t + τi = t is associated with an empiricalmeasures ˆ µ t = (cid:80) t + τ n i = t δ Y i , then (cid:98) ∆ T n ,τ n t = (cid:107) ˆ µ t (cid:107) D = (cid:104) ˆ µ t , ˆ µ t (cid:105) . By construction of the vector rank function (cid:107) rR P + (1 − r ) R P (cid:107) D = 0,hence rR P + (1 − r ) R P = 0 in D . This produces result R P = − r − r R P = ⇒ (cid:13)(cid:13)(cid:13) R P (cid:13)(cid:13)(cid:13) D = r (1 − r ) (cid:13)(cid:13)(cid:13) R P (cid:13)(cid:13)(cid:13) D = ⇒ = ⇒ r = (cid:107) R P (cid:107) D (cid:107) R P (cid:107) D + (cid:107) R P (cid:107) D (12)Considering that r = 1 / R P ] = − [ R P ], so magnitude of thediscrepancy will have the same distribution for sample with equal proportionsof elements with distributions of R P and R P . If t ∗ = (1 − a − ) θ , then itcan be shown that estimate ˆ θ is unbiased: E ˆ θ n = (cid:80) T n − τ n t =1 t P (cid:0) t = arg min t (cid:48) (cid:98) ∆ T n ,τ n t (cid:48) (cid:1) − a − n = t ∗ + B + n − B − n − a − n = t ∗ − a − n = θ = T n , B − n = (cid:98) r ( T n − τ n ) (cid:99) (cid:88) t =1 t P (cid:16) t ∗ − t = arg min t (cid:48) (cid:98) ∆ T n ,τ n t (cid:48) (cid:17) ,B + n = (cid:98) (1 − r )( T n − τ n ) (cid:99) (cid:88) t =1 t P (cid:16) t ∗ + t = arg min t (cid:48) (cid:98) ∆ T n ,τ n t (cid:48) (cid:17) . However, then r (cid:54) = 1 / r is projected to be biased towards 1 / B + will only increase and the value B − will only decrease as r decreases. Otherwise, increase of r increases the value B − too, and decreasesthe value of B + . In order to proof consistency of estimator ˆ r n we introduce afamily of sequences for each s ∈ ((1 − a − ) − , sn = (cid:98) ∆ T n ,τ n t (cid:48) having t (cid:48) = arg min ≤ t ≤ T n − τ n (cid:12)(cid:12)(cid:12)(cid:12) s − tT n (1 − a − ) (cid:12)(cid:12)(cid:12)(cid:12) Then by the weak convergence of vector ranks for every s the sequence convergesto a value:˜∆ sn P −−−−→ n →∞ ∆ s = (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) s< r − a − − a − ( s ) R P + 1 s> r − a − ( s ) R P ++1 r − a − − a − ≤ s ≤ r − a − ( s ) (cid:18) r − s (1 − a − ) a − R P + a − − r + s (1 − a − ) a − R P (cid:19) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) D . Otherwise ˜∆ rn P −→ 0. We can assert by structural uniformity of ˜∆ n that ˜∆ P −→ ∆in Skorohod’s topology. Thus, ˆ r n = arg min s ∆ sn P −→ arg min s ∆ s = r providingconvergence in probability. This suggests that the bias of ˆ r n can be boundedby some sequence β with convergence β n −−−−→ n →∞ | B + n − B − n | T n ≤ β n Without loss of generality assume that r < / 2. Then, we separate positive biasinto mixing and non-mixing parts: | B + n − B − n | T n < B + n T n << (cid:98) (1 − r ) τ n (cid:99) (cid:88) t =1 t P (cid:16) (cid:98) ∆ T n ,τ n t (cid:48) < ˜∆ rn (cid:17) T n + T n − τ n − t ∗ (cid:88) t = (cid:98) (1 − r ) τ n (cid:99) +1 t P (cid:16) (cid:98) ∆ T n ,τ n t (cid:48) < ˜∆ rn (cid:17) T n . For non-mixing part apply Markov inequality for each fixed value of ˜∆ rn andsome value ζ > P (cid:16) (cid:98) ∆ T n ,τ n t (cid:48) < ˜∆ rn (cid:17) = P (cid:16)(cid:0) (cid:98) ∆ T n ,τ n t (cid:48) (cid:1) − ζ > (cid:0) ˜∆ rn (cid:1) − ζ (cid:19) ≤ (cid:0) ˜∆ rn (cid:1) ζ E (cid:18)(cid:0) (cid:98) ∆ T n ,τ n t (cid:48) (cid:1) − ζ (cid:19) . 14t is possible to use inverse as ˜∆ n > n . With this inequality thenon-mixing part can be bounded T n − τ n − t ∗ (cid:88) t = (cid:98) (1 − r ) τ n (cid:99) +1 t P (cid:16) (cid:98) ∆ T n ,τ n t (cid:48) < ˜∆ rn (cid:17) T n ≤ T n E (cid:18)(cid:0) ˜∆ rn (cid:1) ζ (cid:19) E (cid:18)(cid:0) ˜∆ n (cid:1) − ζ (cid:19) ≤≤ C r T n E (cid:18)(cid:0) ˜∆ rn (cid:1) ζ (cid:19) . As ˜∆ rn → O ( τ − εn ) it is possible to specify a constant ζ in sucha way that T n E (( ˜∆ rn ) ζ ) ↓ E (( ˜∆ n ) − ζ ) approaches someconstant value by weak convergence and therefore can be bounded. Thus, thenon-mixing bias approaches zero as n goes to infinity. Note, that all momentsof discrepancy exists as it is bounded on a positive interval for each n . Thismethod can be extended in order to show that the estimate is asymptoticallyunbiased under an assumption of change point slackness, Definition 8. Sequence of diphoragrams (cid:98) ∆ T n ,τ n t has slackness property ifffor some constant γ ∈ (0 , a − ) and for all n large enough there are time points t ∗∗ n > t ∗ n such that : t ∗∗ n − t ∗ n T n ≥ γ, t ∗∗ n (cid:88) t =1 t P (cid:16) t = arg min t (cid:48) (cid:98) ∆ T n ,τ n t (cid:48) (cid:17) − B − n ≤ . Then, it is possible to construct a similar bound | B + n − B − n | T n ≤ T n E (cid:32)(cid:0) ˜∆ rn (cid:1) λ E (cid:18)(cid:0) ˜∆ γ + rn (cid:1) − λ (cid:12)(cid:12)(cid:12) ˜∆ rn (cid:19)(cid:33) , which converges to zero at infinity. alternative methods One of the negative properties of the method de-scribed in the previous chapter is the requirement of specification of bandwidth τ , which prevents change point detection in the proximity of the limit points1 and T . However, as space of nonuniformities D is a metric space, it is pos-sible to determine change point by maximizing distance between two empiricalmeasures dist(ˆ θ ) = (cid:107) ˆ µ +ˆ θ − ˆ µ − ˆ θ (cid:107) D , whereˆ µ − ˆ θ = 1 (cid:98) θ ˆ θ (cid:88) t =1 δ Y t , ˆ µ +ˆ θ = 1 T − (cid:98) θ T (cid:88) t =ˆ θ +1 δ Y t . (13)15hen, by definition of the norm the distance is computed asdist (cid:16) ˆ θ (cid:17) = 1(ˆ θ ) θ (cid:88) n,m =1 K ( Y n , Y m ) −− θ ( T − ˆ θ ) ˆ θ (cid:88) n =1 T θ (cid:88) m =ˆ θ +1 K ( Y n , Y m ) + 1( T − ˆ θ ) T (cid:88) n,m =ˆ θ +1 K ( Y n , Y m ) , leading to a change point estimate ˆ θ = arg max ˜ θ dist(˜ θ ). As empirical measuresof (13) will converge to some points of interval [ R P , R P ] as T n goes toinfinity, the estimate ˆ θ n converges to true value θ in probability.Change points detection methods of this forms were explored in the work[21]. Thus, we will not explore it in further depth. The important properties ofthis statistic is that it still can be computed in O ( T ) time and that dist(ˆ θ ) isa U-statistic.Geometric idea of (12) suggests that the value ς (cid:16) ˆ θ (cid:17) = ˆ θT − (cid:107) ˆ µ +ˆ θ (cid:107) D (cid:107) ˆ µ − ˆ θ (cid:107) D + (cid:107) ˆ µ +ˆ θ (cid:107) D approaches zero as data size grows to infinity iff ˆ r approaches the true ratio r . Thus, if change point exists, then change point can be estimated as ˆ θ =arg min ˜ θ | ς (˜ θ ) | . Or alternatively compute ˆ r m by applying iterationˆ r m = (cid:107) ˆ µ + r m − (cid:107) D (cid:107) ˆ µ − r m − (cid:107) D + (cid:107) ˆ µ + r m − (cid:107) D , where µ + r m − and µ − r m − are empirical measures corresponding to the estimatesobtained at previous iterations. The initial value ˆ r can be selected to be equalto 1 / Multiple Change Points In this chapter the situation of K possible consec-utive change points θ = ( θ , . . . , θ K ) in the data is considered. Now, r denotesa list of positive values r = ( r , . . . , r K ) = (cid:32) θ T , θ − θ T , . . . , θ K − (cid:80) K − k =1 θ k T (cid:33) , which can be understood as the first K coefficients in the convex combinationof K + 1 probability distributions P , . . . , P K +1 . That is, define µ r = K (cid:88) k =1 r k P k + (cid:32) − K (cid:88) k =1 r k (cid:33) P K +1 . Furthermore, estimates ˆ θ and ˆ r are treated as lists of corresponding structure.16y definition of vector rank function R µ r = U d . However, in order toapply methods similar to ones developed in the previous chapter we need onemore property of the model. Definition 9. K probability measures P , . . . P K are said to be convexly in-dependent if for all lists of K coefficient λ ∈ R K + , such that (cid:80) Kk =1 λ k = 1, foreach i equality P i = (cid:80) Kk =1 λ k P k , holds only if λ j = δ i,j for each j , where δ i,j isthe Kronecker delta.Assume that the true value of K ∈ K . Then, by construction (cid:98) K SMA ≤ K with probability α .It can be postulated, that 0 of D lies in the convex hull of R P i . That is0 ∈ conv { R P k } K +1 k =1 . It suggests that there are projections of 0 to the edges ofthe convex polytope π k ∈ [ R P k , R P k +1 ] which minimizes (cid:107) π k (cid:107) D . Hence,if τ is taken to be small enough then local minimas of (cid:98) ∆ T,τ will happen in theproximity of the change points with high probability.The problem with this method is that proportions of points from differentsides of a change point at the local minimise are given by the relation π k (cid:107) R P k +1 (cid:107) D − (cid:104) R P k , R P k +1 (cid:105) D d D ( R P k , R P k +1 ) R P k ++ (cid:107) R P k (cid:107) D − (cid:104) R P k , R P k +1 (cid:105) D d D ( R P k , R P k +1 ) R P k +1 , which can not be recovered from the diphoragram. For this reason we proposean iterative procedure. Let ˆ µ n ˆ θ k , ˆ θ k +1 denote empirical measures of points fromthe interval bounded by change point estimates [ˆ θ k ] n and [ˆ θ k +1 ] n .1. start with T = T .2. for each k ≤ K select t ∗ k = arg min t ∈T k ˜∆ T,τt and update T k +1 = { t ∈ T k : | t − t ∗ k | > τ } .3. make initial change point estimation with blind adjustment (cid:104) ˆ θ k (cid:105) = t ∗ k + τ / N with nth plusone readjustment being (cid:104) ˆ θ k (cid:105) n +1 = t ∗ k + (cid:104) ˆ λ k (cid:105) n τ, where (cid:104) ˆ λ k (cid:105) n = (cid:13)(cid:13)(cid:13) µ n ˆ θ k , ˆ θ k +1 (cid:13)(cid:13)(cid:13) D − (cid:68) µ n ˆ θ k − , ˆ θ k , µ n ˆ θ k , ˆ θ k +1 (cid:69) D d D (cid:16) µ n ˆ θ k − , ˆ θ k , µ n ˆ θ k , ˆ θ k +1 (cid:17) with surrogate change points being (cid:104) ˆ θ (cid:105) n − = 0 and (cid:104) ˆ θ K +1 (cid:105) n − = T .17n order to approach problem of model misspecification we apply smallest ac-cepted model (SMA) methodology. For a collection of (cid:98) K we acquire minimizingtime points t ∗ . Then, test for change points in the intervals bounded by ˆ θ i andˆ θ i +1 with surrogate values as above by computing a p-value approximations.Thus, the model can be estimated while new local minimizers are beingrecovered and the process can be stopped as minimal accepted value of ˆ K hasbeen achieved. Computational Results In order to conduct computational experiments themethods discussed in the previous chapter were implemented in Python pro-gramming language with use of numpy and scipy libraries. We use sobol seq package in order to generate Sobol sequences.Figure 2: Example of a control chart for 1000 observations of 5-dimensionaldata with standard normal distribution without change points computed bystar kernel with τ = 100. Abscissa represents values of shifted time t in a rangefrom 1 to 900 and ordinate stands for value of ˆ∆ t . Successfully, no change pointswere detected. Note, that whole process ˆ∆ t behaves as a martingale with ’small’expected value. Simulations with Zero Change Points For simulations with zero changepoints we are interested in measuring statistical significance or confidence of thetest statistic T which can be understood assignificance( T ) = P ( T rejects H | H is true) , confidence( T ) = P ( T accepts H | H is true) . In simulations with zero change points H obviously is true. So, for a run of n simulations we estimate confidence as (cid:100) conf( T, n ) = { simulations with no change points detected } n . We run simulations without change points and vary certain fixed parameterwhile measuring confidence and inverse p -value for each value of parameter.The only nontrivial results were obtained for change in data dimension d .18igure 3: Change of mean inverse p -value with the growth of dimensionality.We ran 20 simulations for each value d of dimension in range from 1 to 12.Each simulation produced sample 200 standard normal variables and no changepoints. Then hypothesis were tested with significance parameter α = 0 . τ = 30. Observe that the inverse p -value decreases from1 to 6 which can be understood as increase in statistical confidence in the absenceof change points. Then it goes to the value of 0 . θ = 1000and normal distribution with mean of value 5 computed by star kernel with τ = 100. Abscissa represents values of shifted time t in a range from 1 to 1900and ordinate stands for value of ˆ∆ t . Successfully, change point was estimatedat ˆ θ = 1000 with t ∗ = 950. Note, that the whole process ˆ∆ t behaves as amartingale with ’big’ expected value in the distance of change point and goesfor a ’swoop’ in proximity of θ . Simulations with One Change Point While running a simulations withone change point we are naturally interested in the estimation of statisticalpower of the test T which can be understood aspower( T ) = P ( T rejects H | H is false) , n simulations as (cid:100) pow( T, n ) = { simulations with change points detected } n . We investigate dependence of (cid:100) pow( T, n ) on differences between distributionsfrom opposite sides of the change point.Figure 5: Change of inverse p-value and statistical power with the growth ofdifference in the mean. We ran 20 simulations for each values of the differencein the range from 0.2 to 1 with the step equal to 0.1. Each simulation containeda sample of 200 normally distributed variables with d = 3 and θ = 100 withcorresponding difference in the mean of distributions from both sides of thechange point. The hypothesis was tested with α = 0 . τ = 30 . The graph shows that the star kernel outperforms symmetric kernel at all valuesof the difference.Figure 6: Change of inverse p-value ans statistical power with the growth ofdifference in the variance. We ran 20 simulations for each values of the differencein the range from 2 to 10 with the step equal to 0.5. Each simulation containeda sample of 200 normally distributed variables with d = 5 and θ = 100 withcorresponding difference in the variance of distributions from both sides of thechange point. The hypothesis was tested with α = 0 . τ = 30 . The graph shows that the star kernel outperforms symmetric kernel at all valuesof the difference.The figures shows that star discrepancy outperforms symmetric discrepancyin detecting change both in expectation and in variance. However empiricalresults in [18] indicates that in some situations symmetric discrepancy mayturn out tob be a better tool. 20n the situation of existing change point not only power of the test is ofinterest but also a precision of change point estimations. As it was shownin previous chapter bias of a change point estimate increases as true ratio r of distributions in the sample diverges from the value of 1 / 2. We provide acomputational illustrations:Figure 7: Discrepancy diphoragram for 300 normal i.i.d observations with d = 3with change in expectation at θ = 200. Abscissa represents values of shiftedtime t in a range from 1 to 250 and ordinate stands for value of ˆ∆ t constructedwith τ = 50. Change point was detected at location ˆ θ = 197, which can beinterpreted as a weak drift towards midpoint. Note, that the relation of meanvalues in martingale parts of ˆ∆ t at different sides of the change point correspondsto the nonuniformity space theory presented in the previous chapter.Figure 8: Change of error θ − ˆ θ and absolute error | θ − ˆ θ | with shift of value of r . We ran 20 simulations for each values of the ration in the range from 1 / / 60 with the step equal to 1 / 60. Each simulation contained a sample of r ∗ − r ) ∗ 300 variables uniformly distributed in [ − , 1] square with d = 2and θ = r ∗ 300 . Change points were estimated with τ = 50. The graph showsthat error indeed grows in the distance of the midpoint and that estimate hasstrong drift towards 1 / Simulations with Multiple Change Points For case of multiple changepoint we provide only example with diphoragrams:21igure 9: Discrepancy diphoragrams for 300 observations with d = 3. Sam-ple has distribution model P = N ((3 , , , I ) , P = U [10 , and P = N ( − (3 , , , I ) with θ = 120 and θ = 240. Abscissa represents values ofshifted time t in a range from 1 to 250 and ordinate stands for value of ˆ∆ t constructed with τ = 50. Change point was detected at locations ˆ θ = 111 andˆ θ = 237. However, after application of iterative correction improved estimateˆ θ (cid:48) = 119 and ˆ θ (cid:48) = 241 were acquired. Examples with financial data: In order to provide examples, which arenot artificially constructed, financial data similar to one in [31]. This data wasacquired from Data library of Keneth R. French. It contains mean monthlyreturns from five portfolios each composed of one of five industrial sectors inthe USA, which are:(A) Finance,(B) Manufacturing,(C) Retail, wholesale and some services,(D) Utilities,(E) Other.This provides data dimension of d = 5 and total number of observations of T = 1091 as they were recorded monthly from July of the year 1925 to the mayof the year 2017. 22A) (B) (C)(D) (E)Figure 10: We run SMA-based change point estimation with star ker-nel,bandwidth equal to 50 and the false alarm rate equal to 0.2. The procedureprovides 5 change points at moment March of 1934, April of 1955, April of 1964,December of 1984, September of 1988.In the original paper [31] the parametric Bayesian inference was used tooestimate nine and seven change points, and the total number of change pointwas only guessed and not inferred. Our results differ significantly from thisoriginal results, however different time range was used.(A) (B) (C)(D) (E)Figure 11: By running SMA-procedure with symmetric kernel, bandwidth equalto 40 and the false alarm rate of 0.2, we get less conservative estimate. Changepoints are February 1927, October 1940, July 1954, April 1964, January 1969,May 1976, May 1990, July 2012.It can be projected that all above change points can be attributed to theimportant events in the economic history. For example, change point at July1954 can be related to the end of recession of 1953, which itself can be explainedby the change of interrelations of the industrial sectors leading to the change inthe structure of the observed distribution. Hence, we can describe performanceof the SMA-procedure as satisfactory. 23 onclusion and Discussion As result of the work a collection of methodsof change point detection methods was developed. All this methods are basedon interaction of vector ranks and geometric discrepancy which is a novel result.Certain basic consistency result were proved for the method in its basic formdesigned for detecting a single change point. However, they also can be appliedfor detecting and estimating multiple change points.Computational results shows applicability of the method both for simpleartificial change point problems an problems concerning actual data from appli-cations. It is strictly indicated by resulting experience that the method is muchmore potent in situations then the distributions are not concentric. Empiricalevidences of our theoretical findings were also observed. These theoretical re-sults include expression of relation on different sides of change point throughinner product in the Hilbert space D . Another Theoretical result concerns de-pendence of estimate’s bias on ratio in which change point separates sample.Another positive results is the discovery of D , which can be used for estab-lishing alternative iterative procedures defined by relations in this Hilbert space.Furthermore, this Hilbert space structure on empirical measures can be used forproving more theoretical results in the future.Better proofs which cover ergodic processes and provide exact rates of con-vergence a still need to be worked out for the methods. Furthermore concentra-tion results for change point estimates might render this algorithms interestingfor practical application. However, they are absent in the current work.Finally, online version of method can be implemented. The only requirementfor this algorithm is fast online computation of the optimal assignment prob-lem. It can be projected that such algorithm can be derived from the Sinkhorndistance regularised algorithm which were designed by Cuturi [7]. References [1] Ilya A Antonov and VM Saleev. “An economic method of computingLP τ -sequences”. In: USSR Computational Mathematics and MathematicalPhysics Theory of Probability & Its Applications Nonparametric Methods in Change PointProblems . 243. Springer Science & Business Media, 1993.[4] Victor Chernozhukov et al. Monge-Kantorovich depth, quantiles, ranks,and signs . Version 4. 2014. arXiv: .[5] Christine Choirat and Raffaello Seri. “The asymptotic distribution ofquadratic discrepancies”. In: Monte Carlo and quasi-Monte Carlo methods2004 . Springer, 2006, pp. 61–76.[6] Mikl´os Cs¨orgHo and Lajos Horvath. “20 Nonparametric methods for change-point problems”. In: Handbook of statistics Advances in Neural Information Processing Systems (2013),pp. 2292–2300. arXiv: .[8] Alexis Decurninge. Multivariate quantiles and multivariate L-moments .Version 2. 2014. arXiv: .[9] A Dvoretzky, Jack Kiefer, Jacob Wolfowitz, et al. “Sequential decisionproblems for processes with continuous time parameter. Testing hypothe-ses”. In: The Annals of Mathematical Statistics The Annals of mathematical statistics (1952), pp. 114–125.[11] Fred Hickernell. “A generalized discrepancy and quadrature error bound”.In: Mathematics of Computation of the American Mathematical Society Error Analysis for Quasi-Monte Carlo Methods . Ver-sion 1. 2017. arXiv: .[13] Edmund Hlawka. “Discrepancy and Riemann integration”. In: Studies inPure Mathematics (New York) Dokl. Akad.Nauk. SSSR . Vol. 37. M., 1942, pp. 227–229.[15] Yoshinobu Kawahara and Masashi Sugiyama. “Sequential change-pointdetection based on direct density-ratio estimation”. In: Statistical Analysisand Data Mining Signal Processing Naval research logistics quarterly Mathematics of Computation NSF-CBMS regional conference series in probability and statistics . JS-TOR. 1995, pp. i–163.[20] Alexandre Lung-Yut-Fong, C´eline L´evy-Leduc, and Olivier Capp´e. Homo-geneity and change-point detection tests for multivariate data using rankstatistics . Version 3. 2011. arXiv: .[21] David S Matteson and Nicholas A James. “A nonparametric approachfor multiple change point analysis of multivariate data”. Version 2. In: Journal of the American Statistical Association .2522] Gaspard Monge. “M´emoire sur la th´eorie des d´eblais et des remblais”. In: Histoire de l‘Acad´emie Royale des Sciences de Paris, avec les M´emoiresde Math´ematique et de Physique pour la m´eme ann´ee . De l’ImprimerieRoyale, 1781, pp. 666–704.[23] Vito MR Muggeo and Giada Adelfio. “Efficient change point detection forgenomic sequences of continuous measurements”. In: Bioinformatics Bulletin of the American Mathematical Society TheAnnals of Statistics (1985), pp. 206–227.[26] Filippo Santambrogio. “Optimal transport for applied mathematicians”.In: Birk¨auser, NY (due in September 2015) (2015).[27] Robert Serfling. “Quantile functions for multivariate analysis: approachesand applications”. In: Statistica Neerlandica Journal of the American StatisticalAssociation Theory of Probability & Its Applications Preprint IPM Akad. Nauk SSSR Journal of the Royal Statistical Society: Series B (Statistical Methodology) Networks Optimal transport: old and new . Vol. 338. Springer Science& Business Media, 2008.[34] Hermann Weyl. “ ¨Uber die gleichverteilung von zahlen mod. eins”. In: