Multiple testing with persistent homology
MMultiple testing with persistent homology
Mikael Vejdemo-Johansson Sayan Mukherjee ∗ November 11, 2019
Abstract
Multiple hypothesis testing requires a control procedure. Simply in-creasing simulations or permutations to meet a Bonferroni-style thresholdis prohibitively expensive. In this paper we propose a null model basedapproach to testing for acyclicity, coupled with a Family-Wise Error Rate(FWER) control method that does not suffer from these computationalcosts. We adapt an False Discovery Rate (FDR) control approach tothe topological setting, and show it to be compatible both with our nullmodel approach and with previous approaches to hypothesis testing inpersistent homology. By extending a limit theorem for persistent homologyon samples from point processes, we provide theoretical validation for ourFWER and FDR control methods.
Hypothesis testing in the based on topological summaries of data has been anarea of Topological Data Analysis (TDA) that has seen growth recently as bothapplied and mathematical statistics have been developed using TDA. Almostof all the current literature on hypothesis testing in TDA has focused on twosample tests [22] or extensions to analysis of variance (ANOVA) settings [8]where differences across more than two conditions are considered. Neither ofthese papers take into account multiple testing because the number of hypothesestested is small, for example one in two sample tests. However, as the number ofgroups in an ANOVA increase mutiple testing is a concern, in addition thereare many applications where TDA can be applied to many subsets of features ofcoordinates in a two sample test with the goal of finding those subsets which ∗ On behalf of all authors, the corresponding author states that there is no conflict ofinterest.MVJ would like to acknowledge the support of NVIDIA Corporation with the donation of theTitan X Pascal GPU used for this research.SM would like to acknowledge partial funding from HFSP RGP005, NSF DMS 17-13012,NSF BCS 1552848, NSF DBI 1661386, NSF IIS 15-46331, NSF DMS 16-13261, as well ashigh-performance computing partially supported by grant 2016-IDG-1013 from the NorthCarolina Biotechnology CenterThe simulated data used in this paper are available from Figshare, DOI10.6084/m9.figshare.10262507 a r X i v : . [ c s . C G ] N ov re significantly different between the two groups. When the number of subsetsof features are in the hundreds or thousands correction for multiple hypothesistesting is crucial.In this paper we propose a methodology to address multiple hypothesistesting by addressing both statistical and computational concerns that arise withmultiple hypothesis testing (MHT). The main computational issue we address ishow to efficiently compute a null distribution that can be used in the multiplehypothesis setting. The key computational idea is that for barcodes one canefficiently compute an empirical null distribution via simulation that is validmodulo normalization across dimension (rank) and this empirical distribution canbe used for all of the multiple hypothesis. In contrast, current methodologies reston either applying permutations or bootstrap sampling to each test individually,which for thousands of tests can be infeasible. The barcode statistics we considerfor generating the null distribution include as examples the symmetric barcodepolynomial [1] or a symmetric tropical barcode polynomial [18].Our proposed method for controlling MHT uses z-score normalizations toproduce comparable quantities from different hypothesis tests. We show thatunder reasonable assumptions the z-score distributions converge to a dis-tribution only dependent on the underlying null model . It follows thatthe z-scores are themselves comparable.The statistical contributions of this paper are in how to use the empiricalnull distribution for controlling family-wise error rates and false discovery rates,as well as a theoretical foundation for the proposed hypothesis testing method.Our interest in MHT originates with seeking statistical methods for testingwhether a given point cloud is sufficiently acyclic – whether the point cloudis expected to have essentially no persistent homology. Testing for persistenceacross hundreds or thousands of point clouds makes MHT quickly relevant.Our approach picks a null model that captures what essentially no persistenthomology might look like for a given size and density of a point cloud, and thenapproaches the question using simulation testing: comparing a barcode statisticof the observed point cloud to barcode statistics calculated on a collection ofrandomly generated point clouds from the null model.In this paper we propose and approach two fundamental problems: Problem 1 (Family-wise error rate corrected acyclicity testing) . Given a collec-tion of point clouds, determine if any one of the point clouds has more persistenthomology than should be expected from random data.Provide assurances that the proposed method will not produce a single falsepositive among all the tested point clouds, with probability α . Problem 2 (False discovery rate corrected acyclicity testing) . Given a collectionof point clouds, determine if any one of the point clouds has more persistenthomology than should be expected from random data.Provide assurances that false positives occur with a rate of at most α . TDA Background Theory
In this paper, we concern ourselves with the statistical behavior of persistenthomology .An abstract simplicial complex is a collection Σ of subsets, called simplices,of a set V of vertices such that all subsets τ ⊂ σ ∈ Σ of a simplex are alsosimplices in Σ. A simplex with d + 1 vertices is said to be d -dimensional. To asimplicial complex is associated the chain complex : a graded vector space C ∗ Σwith basis elements corresponding to the simplices in Σ. The grading is providedby the dimension of the corresponding simplex. The chain complex is equippedwith a linear boundary map : ∂ ([ v , . . . , v d ]) = (cid:88) ( − i [ v , . . . , v i − , v i +1 , . . . , v d ]The homology H ∗ Σ is defined as ker ∂/ img ∂ .For a comprehensive introduction to homology we refer to Hatcher [14].A simplicial complex is filtered if it decomposes into a sequence of inclusionsΣ s (cid:44) → Σ t for s < t . Persistent homology provides a way of gluing together theindividual homology vector spaces H ∗ Σ t into a globally consistent structure P H ∗ Σ ∗ . In this structure, the bases for each H ∗ Σ t are chosen to be compatible:each basis element emerges at some t b and is preserved through all parametervalues up to some t d . The persistent homology is often represented as a persistencediagram , a multiset in the plane with each basis element represented as its pairof birth and death times: { ( t b , t d ) } .The class of persistence diagrams admits several interesting metrics. Themost commonly used in Topological Data Analysis is the Bottleneck distance .Each diagram is a multiset of points in the extended plane where in additionto the finitely many off-diagonal points, each diagram includes copies of allpoints on the diagonal. The diagonal is needed so that bijections can be definedbetween diagrams that have different number of off-diagonal points. For a pair
X, Y of persistence diagrams the bottleneck distance is defined as The bottleneckdistance is defined as d B ( X, Y ) = inf φ sup x (cid:107) x − φ ( x ) (cid:107) ∞ , where φ is the set of all possible bijections between X and Y . We define the bottleneck norm of a persistence diagram X as the bottleneck distance to theempty diagram, (cid:107) X (cid:107) B = d B ( X, {} ). The bottleneck norm is generated by thepoint in X with the longest distance to the diagonal: a short calculation reveals (cid:107) X (cid:107) B = max( t d − t b ) / √ The idea that topological summaries such as persistence diagrams form a probabil-ity space for which formal statistical analysis is well defined was developed in [19].Further developments on defining useful summary statistics within persistent3omology and considering means, medians, and variances of persistence diagramswas pursued in several papers [20, 23, 25]. The main challenge in consideringpersistence diagrams as a probability space was pointed out in [23, 24]—the spaceof persistence diagrams is positively curved which results in non-unique geodesics.As a result the mean of a set of diagrams need not be unique which complicatesdata analysis. To avoid this issue persistence landscapes were introduced in[7], persistence landscapes are functions so they can be considered as randomfunctions in a Banach space, a construction that admits central limit theorems,unique means and medians. Further examinination of bootstrap properties ofpersistence based summaries as well as a notion of confidence intervals for pointsin a diagram was developed in [11, 13]. An alternative approach was consideredin a series of papers where instead of considering a persistence diagram as asummary a probability density was used as a topological summary, an approachcalled distance to measure [9, 10, 11]In the context of hypothesis testing [4] proposed using goodness of fit statis-tics – Kolmogorov-Smirnov, χ or Mann-Whitney – to test compare empiricaldistributions from two samples of persistence diagrams. The ideas most closelyrelated to the procedures we develop in this paper was to define hypothesistesting procedures directly on persistence diagrams using permutation testingand barcode distances [8, 22]. In this paper we will extend two sample sin-gle hypothesis testing and ANOVA procedures to the multiple hypothesis testsetting. The multiple testing procedures we propose in this paper will work for any validnull model that admits either an empirical or analytic null distribution. Theparticular null model we consider in this paper is motivated by our problem oftesting whether the cycles seen in a data set arise from structure in the data orare random artifacts of sampling. Our choice of a null model is predicated onsatisfying the objectives that the null distribution should be efficient to computeand reflective of random data. The particular null model we consider is thatrandom samples are drawn uniformly and independently in a box with a fixednumber n of observations.The decision for selecting a uniform distribution for the null model is the-oretical analysis that has shown persistence intervals for simplicial complexeswith a uniform distribution as the sampling model will be short [5]. In contrastit was seen in [2] that using the multivariate normal distribution as the samplingmodel for the data results in quite large persistence intervals, typically arisingfrom the tails of the distribution.The specific model we consider is given a fixed dimension N and a boundingbox defined by an interval [ a i , b i ] for each coordinate i = 1 , .., N we drawn n pointsfrom a Poisson distribution with rate parameter λ specified as λ = (cid:81) i | b i − a i | conditional on sampling n points in the set [ a , b ] × · · · [ a N , b N ]. Here n is thenumber of observations in the point clouds that will be compared to the null4odel. The parameters { a i , b i } are estimated from the data { x , ..., x n } via thefollowing uniformly minimum variance unbiased estimatorˆ a i = N + 1 N (min i − max i ) ˆ b i = N + 1 N (max i − min i ) , where min i is the minimum value for the i -th coordinate and max i is themaximum value. In summary our null model is a standard spatial model of aconditional Poisson spatial process over the interval [ a , b ] × · · · [ a N , b N ] [21].One can consider a convex hull of the data rather than a uniform box, howeverestimation of the convex hull is not as efficient and an unbiased estimator ismuch more complex. One may be faced with a need to perform statistical tests several times to dealwith any one research question for example given two classes and thousands ofsets of features one may want to know which subset of features differ betweenthe classes. In this problem of simultaneous testing individual significance levelscompound so that the probability of a false discovery across all the tests ends uppotentially far higher than the level chosen for each test. A typical example ofwhen this type of repeated testing comes up is in the second step of an ANOVAtype analysis, such as the one in [8]: after identifying that not all groups areequal, pairwise comparisons are used to determine which groups do differ.In the classical Neyman-Pearson paradigm for hypothesis testing data is usedto distinguish between two possible collections of probability distributions, calledthe null hypothesis and the alternative hypothesis . A statistical test calculatessome statistic from the data, and based on that statistic either rejects the null or fails to reject the null .In multiple hypothesis testing – arising from ANOVA type analyses such as in[8], or two sample comparisons on multiple features– a large collection of pairs ofhypotheses are tested, some rejected and some failing to reject. The fundamentalproblem is whether the observed rejections are in fact demonstrating a globalphenomenon of note, or an artifact of taking random samples.We will introduce the following notation to be able to discuss the differentrates and quantities involved. We are performing m different hypothesis testsfrom which m follow the null as the true distribution and m have the alternativeas true. We reject R and accept W of the tests.Accept null Reject null TotalNull true U V m Null false
T S m W R m
The quantities U and S measure the frequency for which we accept or rejectthe null correctly. The error quantity V measures false discoveries (false positives,type I errors) and T measures missed discoveries (false negatives, type II errors).5ejcting a hypothesis is typically parametrized by its level α = P ( V > family-wise error rate is defined to be FWER = P ( V > false discovery rate is defined as the expected ratio of false discoveriesamong all discoveries: FDR = E [ V /R ], with the convention that if there are norejection at all, the ratio is set to 0.Controlling for false discovery rates admits that false discoveries will bemade as a matter of course, and aims to keep the rate of false discoveries to thesignificance level chosen, rather than keeping the probability of avoiding all falsediscoveries to the significance level.
Classical FWER control methods adjust the cutoff at which each hypothesis isrejected. The simplest is
Bonferroni correction [6]: if the rejection events weredisjoint between all tests, then their probabilities would add. Hence, rejecting ata level of α FWER = α/m will achieve the requested probability of having anyfalse discoveries at all.The Bonferroni correction has the unrealistic assumption of independentrejection events and is known to be overly conservative. There are elaborations onthis: Holm’s step-down and Hochberg’s step-up procedures [16, 17] are commonlyused – these too work with quantities k · α/m as rejection thresholds.When using either of these in a permutation or simulation setting, the number m of simultaneous tests can drive up the number of permutations or simulationsrequired for an acceptable test level dramatically. If computations are expensive– such as with persistence diagrams or with bottleneck distances – then thisquickly becomes prohibitive.In a previous preprint [26] we have proposed a method for multiple testingagainst the uniform null model that controls for Family Wise Error Rates.Most interesting persistence statistics vary with the overall scale of the pointcloud; different point clouds produce statistics that usually are not immediatelycomparable. If they were, however, we could detect a deviation from thenull model behaviour through the existence of a particularly large value forcorresponding barcode-based statistics. We can produce a joint test by firstmaking the statistics comparable, and then performing a simulation test wherein each simulation step the largest statistic value across simulated representativesfor all the point clouds is extracted.The approach is rooted in the observation that, having computed test statistics t , . . . , t m from each test separately, 6 FWER = P ( V > P ( { t > c } ∪ · · · ∪ { t m > c }| H )= P (max i t i > c | H )We don’t need distributional assumptions as long as the null hypothesissampling distributions are comparable across all test cases. We will go furtherinto how to build a test statistic for which this assumption is reasonable inSection 5.Based on this we propose the following approach Method 1 (Family-wise error rate controlled test for acyclicity) . Given a familyof point clouds X , . . . , X K , an invariant γ : { Point clouds } → R , and a nullmodel M of random point clouds, we may reject the null hypothesis of acyclicityin favor of non-acyclicity by:1. Draw M , . . . , M N − K from M .2. Compute all ˜ y ji = γ ( M ji ) and ˜ x i = γ ( X i ) .3. For each i ∈ [1 , K ] , use ˜ y ji to create a standardization method, (ie tocalculate mean and standard deviation for the studentization, or to calculatethe empirical CDF for histogram equalization) and standardize all ˜ y ji to y ji and standardize ˜ x i to x i .4. For each j ∈ [1 , N − calculate y i = max j y ji . Calculate x = max x i .5. Compute the rank r of x among x together with all the y i .We may then reject the null hypothesis at a level of p = ( N − r + 1) /N . For false discovery rate control, we seek to control q FDR = E [ V /R ]: the proportionof false discoveries among all the rejected null hypotheses. By convention, if R = 0 then q FDR = 0.Simulated (or permuted) point clouds or diagrams simulate a null model,and thus allow us to estimate V , so that we can calculate a cutoff that achievesthe false discovery rate we want. Just as in Section 4.1, our method only relieson the test statistics to be exchangeable – for their conditional distributions toall be equal.We can estimate both V and R from data and from simulations: V /N isestimated by the rate of rejections in the null model, while
R/N is estimated bythe rate of rejections in the data. Their ratio estimates (
V /N ) / ( R/N ) =
V /R .Based on this we propose the following approach7 ethod 2 (False discovery rate controlled test for acyclicity) . Given a familyof point clouds X , . . . , X K , an invariant γ : { Point clouds } → R , and a nullmodel M of random point clouds, we may reject the null hypothesis of acyclicityin favor of non-acyclicity by:1. Draw M , . . . , M N − K from M .2. Compute all ˜ y ji = γ ( M ji ) and ˜ x i = γ ( X i ) .3. For each i ∈ [1 , K ] , use ˜ y ji to create a standardization method, (ie tocalculate mean and standard deviation for the studentization, or to calculatethe empirical CDF for histogram equalization) and standardize all ˜ y ji to y ji and standardize ˜ x i to x i .4. Rank the x i to form a sorted sequence x ( i ) .5. For each x ( i ) = c i , calculate % V ( c i ) = { y ji ≥ c i } K ( N −
1) % R ( c i ) = { x i ≥ c i } K ˆ q FDR ( c i ) = % V ( c i )% R ( c i )
6. Pick the smallest c i such that ˆ q FDR( c i ) ≤ α for the chosen level α .7. Reject all null hypotheses corresponding to x ( j ) ≥ c i If all ˆ q FDR ( c i ) ≥ α , then this means that min i ˆ q FDR ( c i ) is as good a falsediscovery rate as is attainable. In [22], Robinson and Turner propose a hypothesis test for persistent homology.In their setup, two groups of persistence diagrams D , , . . . , D ,n , D , , . . . , D ,m are sampled, and using a loss function built from in-group p -Wasserstein distance F p,q ( { D ,i } , { D ,j } ) = 12 n ( n − (cid:88) (cid:88) d p ( D ,i , D ,j ) q + 12 m ( m − (cid:88) (cid:88) d p ( D ,i , D ,j ) q they propose a permutation test: the membership in group 1 or 2 is repeatedlypermuted and the loss function computed for each permutation. The rank ofthe loss in the original division produces a p -value estimate for the test.In a follow-up paper, Cericola et al [8] propose an extension to Robinsonand Turner’s two sample test to test for an ANOVA style hypothesis of severalgroups D , , . . . , D ,n , . . . , D m, , . . . , D h,n m being all equal. This paper suggests that after rejecting this type of nullhypothesis, we can use Robinson and Turner style testing pairwise on thediagram groups to locate the discrepancies.8either of these two papers mention how to correct for the intrinsic multipletesting. We assume that a total of h two sample tests are being performed.Hence we are given 2 h groups of persistence diagrams D , , . . . , D ,n , D , . . . D ,m . . .D h , , . . . , D h ,n h , D h , . . . D h ,m h Following the basic philosophy used in Section 4.2, we would calculate:1. X k = F p,q ( { D k ,i } , { D k ,j } ), the K different observed losses2. Y k(cid:96) , the N permuted losses for a collection of permutations for each testpair3. For each X ( k ) in a sorted order of the X k values, the estimates% V ( X ( k ) ) = { Y j(cid:96) ≤ X ( k ) } K ( N −
1) % R ( X ( k ) ) = { X j ≤ X ( k ) } K ˆ q FDR ( X ( k ) ) = % V X ( k ) )% R ( X ( k ) )As a cutoff that guarantees a specified FDR q , pick X c = max { X ( k ) :ˆ q FDR ( X ( k ) ) ≤ q . Reject all hypotheses with X k ≤ X c . If no such maximumexists, then min k ˆ q FDR ( X ( k ) ) is the smallest achievable FDR with the observeddata. Many invariants of persistence bars differ with the overall scale of the pointcloud, so the invariants are not immediately comparable. From a null modeland repeated simulations, however, we can evaluate empirical estimates of meanˆ µ and standard deviation ˆ σ for each point cloud shape. With these estimates,a studentized standardization x (cid:55)→ x − ˆ µ ˆ σ produces values that are comparablebetween different instances.As mentioned above in Section 4.1, all our constructions rely on the samplingdistributions of the test statistics to be equal. To investigate this, we will be show-ing distributions of the studentized standardizations of both the bottleneck norm (cid:107) D (cid:107) B = ( t d − t b ) / √ (cid:107) D (cid:107) B ) = log(( t d − t b ) / √ D is a diagram with maximum length feature living from t b to t d . z -scores Most of our constructions in Section 4 start out by simulating draws from thenull model. The next step uses these to estimate means and variances for the test9tatistic, to finally compute z-score normalizations and compare these normalizedstatistics. Inherent in these constructions is an assumption that the resultingz-scores are comparable to each other.In this section we show that the z-scores are comparable. The two sources ofevidence we will use are a law of large numbers and central limit theorem forproperly scaled persistent Betti numbers. What we show is that for a null modelthat produces a dense enough sample from a stationary ergodic point process ina window estimated from the data, the z-scores are comparable. If we were onlyconsidering a box around the dataWe consider L as a convex shape and (cid:96) a rescaling of the shape. Φ L is astationary point process with its support restricted to (cid:96) and K (Φ L ) is a filtrationof a draw from Φ L . Lastly, β r,sq ( K (Φ L )) is the barcode with birth at r and deathat s and 0 ≤ r ≤ s < ∞ generated from the filtration K (Φ L ). Given this notationwe would like the following theorems to hold. Theorem 1.
Assume that Φ is a stationary point process having all finitemoments. Then, for any ≤ r ≤ s < ∞ and q ≥ , and any convex shape L ,there exists a constant b r,sq such that E (cid:2) β r,sq ( K (Φ (cid:96)L )) (cid:3) Vol( (cid:96)L ) → b r,sq as (cid:96) → ∞ In addition, if Φ is ergodic, then β r,sq ( K (Φ (cid:96)L ))Vol( (cid:96)L ) → b r,sq almost surely as (cid:96) → ∞ Theorem 2.
Assume that Φ is a homogeneous Poisson point process on a convexshape supported on L with unit intensity. Then, for any ≤ r ≤ s < ∞ and q ≥ , there exists a constant σ r,s = σ r,s ( q ) such that β r,sq ( K (Φ (cid:96)L )) − E (cid:2) β r,sq ( K (Φ (cid:96)L )) (cid:3) Vol( (cid:96)L ) d → N (0 , σ r,s ) as (cid:96) → ∞ . The first theorem is a law of large numbers result and the second is a centrallimit theorem that specifies a limiting distribution for persistent Betti numbers.Both theorems can be derived based on results from [15], where the authorsproved the above results for the case where the convex shape L is restricted toboxes of side-length L . Theorem 1 (1.11 in [15]) . Assume that Φ is a stationary point process havingall finite moments. Then, for any ≤ r ≤ s < ∞ and q ≥ , there exists aconstant b r,sq such that for L = [ − , N , E (cid:2) β r,sq ( K (Φ (cid:96)L )) (cid:3) Vol ( (cid:96)L ) → b r,sq as (cid:96) → ∞ In addition, if Φ is ergodic, then β r,sq ( K (Φ (cid:96)L )) Vol ( (cid:96)L ) → b r,sq almost surely as (cid:96) → ∞ heorem 2 (1.12 in [15]) . Assume that Φ is a homogeneous Poisson pointprocess supported in the box with side-length L with unit intensity. Then, forany ≤ r ≤ s < ∞ and q ≥ , there exists a constant σ r,s = σ r,s ( q ) such that β r,sq ( K (Φ L )) − E (cid:2) β r,sq ( K (Φ L )) (cid:3) L N/ d → N (0 , σ r,s ) as L → ∞ . We now prove Theorem 1.
Proof.
We first state a sketch of the main points in the proof from [15].The cube Λ mM can be tiled by m N translated copies of Λ M . Writing ψ ( L ) = E (cid:2) β r,sq ( K (Φ Λ L )) (cid:3) for the expected persistent Betti number from r to s in q -dimensional persistent homology, the authors show that for sufficientlylarge M , the contribution from simplices that span a border between the tiles issufficiently small that ψ ( mM )( mM ) N = ψ ( M ) M N + O ( M − )Next, the authors show that when growing the sample cube from L (cid:48) to L , E (cid:2) | β r,sq ( K (Φ Λ L )) − β r,sq ( K (Φ Λ L (cid:48) )) | (cid:3) ≤ (cid:88) E (cid:2) |{ q -simplices in Φ Λ L (cid:48) with at least one vertex in Λ L \ Λ L (cid:48) }| (cid:3) = O ( | Λ L \ Λ L (cid:48) | ) = O (( L − L (cid:48) ) L N − From this follows that for a fixed (large enough) M , and m chosen such that mM ≤ L < ( m + 1) M , ψ ( L ) L N = ψ ( mM )( mM ) N + O ( M L − )This shows that the sequence ψ ( L ) /L − N is Cauchy and hence converges.We now adapt the proof to apply to general convex shapes.The first step is showing that for sufficiently large M , the tile boundaries whentiling a shape with translated copies of [ − M/ , M/ N is also the relevantquantity to study for the convex setting.The step we need to consider is the second part of the proof argument in [15]:Given a sufficiently large M such that the contribution from edges separating M -cubes from each other, is the contribution from the outer shell when growingthe convex shape L from mL to ( m + 1) L sufficiently small that we can showthe same expected convergence result for the convex shape.Choose (cid:96) minimal such that an appropriately translated L fits inside the cube[ − (cid:96)M/ , (cid:96)M/ N . Write O L for the minimal set of M -cubes that contains L ,and I L for the maximal set of M -cubes that is contained in L in a tiling using (cid:96) N cubes. The proportions p O mL = |O mL | / ( m(cid:96) ) N and p I mL = |I mL | / ( m(cid:96) ) N of cubes converge to the same proportion p = Vol( L ) / ( (cid:96)M ) N . This is because11he number of cubes in O mL \ I mL is proportional to the surface area (volumeof co-dimension 1 facets) of mL .As we scale up our shape from one that fits in a cube with ( m(cid:96) ) N tiled M -cubes, to a shape that fits in a cube with (( m + 1) (cid:96) ) N cubes, the volumedifference is bounded above by |O ( m + 1) (cid:96)L | − |I m(cid:96)L | . The change in persistentBetti number in turn is bounded above by the volume difference.As m → ∞ , |O ( m + 1) (cid:96)L | → p (( m + 1) (cid:96)M ) N |I m(cid:96)L | → p ( m(cid:96)M ) N |O ( m + 1) (cid:96)L | − |I m(cid:96)L | → p(cid:96) N M N (( m + 1) N − m N ) = p(cid:96) N M N · O ( m N − )Hence, for k and m such that Λ m(cid:96)M ⊆ kL ⊆ Λ ( m +1) (cid:96)M , ψ ( kL ) k N Vol( L ) = ψ ( I m(cid:96)M ) p ( m(cid:96)M ) N + O ( m − )With M chosen large enough, ψ ( kL ) k N Vol( L ) = ψ ( I m(cid:96)M ) p ( m(cid:96)M ) N + O ( m − ) = 1 p ψ ( M ) M N + O ( m − ) + O ( M − )This shows Cauchy convergence for the convex shape.One can use almost identical arguments as above to prove Theorem 2 startingwith Theorem 1.12 in [15].From the theorem follows that for dense enough samples from the sameunderlying ergodic point process, the distribution of persistent diagrams agree.Since the distribution of diagrams agree, so does the distribution of any statisticcalculated on these diagrams. These claims hold – up to a scaling factor givenby the sample density and size.Since the theorem tells us that the persistent Betti numbers agree afterrescaling by the inverse of the (hyper)volume, for different density samples of afamily of ergodic point processes that only differ by a scaling factor – such asconstant density Poisson processes – the scaling factor will be composed of firstrescaling the sample to unit density, and then scaling the result by the inverseof this rescaled volume. In practice we do not need to know any of these details:since it is a constant scaling factor, the z -score calculation inherently estimatesthe correct scaling. We only need to know that the distributions are comparableup to a scaling factor to conclude that the z -scores are comparable. To validate our suggested FWER method and evaluate its performance weperform simulation tests on null model data input to verify the level, and with a12ingle noisy circle input together with null model data input for a power analysisof each method.We use the null model of uniformly distributed points in a plane rectangle, andfor computational expediency we restrict our testing to two ambient dimensions.Our simulations test for all combinations of: • N ∈ { , } (number of point clouds for each test) • K ∈ { , , } (number of simultaneous tests to control)For each box, we draw uniformly at random • Box side lengths in { . , , }• Point counts for a box in { , , , }• For the power test: in one of the boxes, points on a circle with addedmultivariate isotropic Gaussian noise with variance from { . , . } fittedin a square box with side lengths 1 × α -complex construction is topologically equivalent to ˇCech complexes[3], and for speed in our simulations we choose to use the α -complex persistenthomology calculation in the R package TDA [12]. With simulations in place weperform bootstrap evaluations of level and power of our methods.We will use the invariant γ ( X ) = √ (cid:107) X (cid:107) B = max t d − t b of maximum barlength as well as log γ ( X ) and compare the performance of both. The theoretical basis for both our proposed control methods, both for the FWERand the FDR control algorithms, relies on the exchangeability of the test statisticsused: the null hypothesis distribution of studentized γ ( X ) or studentized log γ ( X )should be equal for all X s involved in the comparison.We provide three overviews to support the claim that studentized maximumbar length and studentized log of maximum bar length are exchangeable in thissense.First, in Figure 1, we show empirical distribution functions (ECDFs) forrepeated sampling from the null model. For each sample, many point clouds weredrawn, and their invariants studentized – all the resulting invariant values werepooled to produce one of the ECDFs in the plot. We can see the distributionsstaying reasonably tightly grouped.For a clearer visualization, we turn to pairwise QQ-plots. In Figure 2, wesee a sample of 20 pairs from these distributions: 5 for each combination ofhomological dimension (0 or 1) and invariant ( γ ( X ) = √ (cid:107) X (cid:107) B = max t d − t b or log( γ ( X ))). The straightness of these QQ-plots supports our assertion thatthe invariants are exchangeable. 13 m a x ( d − b ) m a x ( l og ( d − b )) −2 0 2 4 −7.5 −5.0 −2.5 0.0 2.50.000.250.500.751.000.000.250.500.751.00 Figure 1: Four collections of empirical distribution functions (ECDFs). Leftcolumn is for H and right column is for H while top row is for the √ (cid:107) X (cid:107) B =max t d − t b persistence diagram invariant and the bottom row is for log γ ( X ) = √ (cid:107) X (cid:107) B = log max t d − t b .Each graph is a collection of 50 samples of different choices of size of boundingbox in the plane and number of points in the bounding box. For each such sizeand density choice, 1000 point clouds were drawn at random, and the invariantscalculated and then studentized as described in Section 5. The ECDFs are ofthese batches of studentized persistence invariants.From the tight coupling within each graph, we can see that the across differentchoices of bounding box sizes and point densities, the resulting studentizedpersistence invariant distributions are largely very similar.14 Figure 2: Pairwise QQ-plot for 10 samples of pairs of different choices of sizeof bounding box in the plane and number of points in the bounding box. Foreach size and density choice, 1000 point clouds were drawn at random, andthe invariants calculated and then studentized as described in Section 5. TheQQ-plots are for the resulting studentized invariants.A QQ-plot is a scatter plot of the quantiles of two distributions against eachother. The shape of the QQ-plot reveals information about how similar thedistributions are, and in which ways they differ. A straight line, such as we seein all of these, indicates that the distribution are equal (up to translation andscaling) while a line that follows the x = y line indicates that the distributionsare equal. 15 .2 FWER Experiments We validate the FWER control procedures by estimating the probability of falsediscovery on null model data and we analyze the power of the proposed methodsby attempting to detect a single noisy circle in a family of null model datasamples.For the experiments, we precomputed 160000 point cloud invariants. Sincewe are working with point clouds in the plane, we computed in homologicaldimensions 0 and 1, and for each combination of box shapes and point counts aswell as for each noise level and point count combination, we generated 5000 pointclouds. All our subsequent results are based on drawing from these precomputedinvariants at random, matching box sizes and point counts when producingsimulations to match a particular point cloud.
We evaluate the empirical level of our proposed methods. From 100 simulationsdrawing from pre-computed barcode sizes, the null rejection rates for null modeldata for our methods are summarized in Table 1. For each of the simulations, arandom number, between 2 and 50 of point cloud invariants were drawn fromthe precomputed data. To each point cloud invariant, another 99 point cloudswith matching box sizes and point counts are drawn as a simulation test. These100 batches of 100 point clouds go through each of our proposed methods, andrejection rates at confidence levels of 0.1, 0.05 and 0.01 are calculated.
For the power analysis we picked pre-calculated invariants from circles with a1 × . .
25 respectively. For each of 100 simulations, one circleinvariant was picked, and another random number (between 1 and 49) of nullmodel point cloud invariants added. This collection of point clouds go throughthe same process of generating 99 null model invariants for each, and run thecollections through the described methods. The result of 100 simulations eachat the two noise levels is shown in Table 1.Examples of the kind of circles we use for the power calculation can be seenin Figure 3.
In this paper we have discussed multiple hypothesis testing (MHT) against anull model – given a way to generate typical point clouds , we can test whether anumeric statistic on barcodes is consistent with that model. We are particularlyinterested in using uniformly distributed points as a null model – both since theyare known to generate short bars [5] and because this is a null model commonlyused in classical spatial statistics. 16 . . x y Figure 3: Noisy circles as used by the power calculation. Top row, σ = 0 . σ = 0 .
25. The plots have, from left to right, 10, 50, 100 and 500points. 17WER p <
Studentized length Studentized log-lengthNull model0.01 0.04 0.040.05 0.1 0.080.10 0.13 0.13 σ = 0 . σ = 0 . eferences [1] Aaron Adcock, Erik Carlsson, and Gunnar Carlsson. The ring of algebraicfunctions on persistence bar codes. Homology, Homotopy and Applications ,18(1):381–402, 2016.[2] Robert J Adler, Omer Bobrowski, and Shmuel Weinberger. Crackle: Thehomology of noise.
Discrete & Computational Geometry , 52(4):680–704,2014.[3] Ulrich Bauer and Herbert Edelsbrunner. The Morse theory of ˇcech andDelaunay filtrations. In
Proceedings of the thirtieth annual Symposium onComputational Geometry , page 484. ACM, 2014.[4] Andrew J. Blumberg, Itamar Gal, Michael A. Mandell, and Matthew Pancia.Robust statistics, hypothesis testing, and confidence intervals for persistenthomology on metric measure spaces.
Foundations of Computational Mathe-matics , 14(4):745–789, 2014. URL: http://link.springer.com/article/10.1007/s10208-014-9201-4 .[5] Omer Bobrowski, Matthew Kahle, and Primoz Skraba. Maximally persistentcycles in random geometric complexes.
The Annals of Applied Probability ,27(4):2032–2060, 2017.[6] Carlo Bonferroni. Sulle medie multiple di potenze.
Bollettino dell’UnioneMatematica Italiana , 5(3-4):267–270, 1950.[7] Peter Bubenik. Statistical topological data analysis using persistence land-scapes.
The Journal of Machine Learning Research , 16(1):77–102, 2015.URL: http://arxiv.org/abs/1207.6437 .[8] Christopher Cericola, Inga Johnson, Joshua Kiers, Mitchell Krock, Jor-dan Purdy, and Johanna Torrence. Extending Hypothesis Testing withPersistence Homology to Three or More Groups.
Involve, a Journal ofMathematics , 11(1):27–51, January 2018. arXiv: 1602.03760. URL: http://arxiv.org/abs/1602.03760 , doi:10.2140/involve.2018.11.27 .[9] Fr´ed´eric Chazal, Brittany Fasy, Fabrizio Lecci, Bertrand Michel, AlessandroRinaldo, Alessandro Rinaldo, and Larry Wasserman. Robust topologicalinference: Distance to a measure and kernel distance. The Journal ofMachine Learning Research , 18(1):5845–5884, 2017. URL: http://arxiv.org/abs/1412.7197 .[10] Fr´ed´eric Chazal, Brittany Fasy, Fabrizio Lecci, Bertrand Michel, AlessandroRinaldo, and Larry Wasserman. Subsampling methods for persistent homol-ogy. In
International Conference on Machine Learning , pages 2143–2151,2015. URL: http://arxiv.org/abs/1406.1901 .1911] Fr´ed´eric Chazal, Brittany Terese Fasy, Fabrizio Lecci, Alessandro Rinaldo,Aarti Singh, and Larry Wasserman. On the Bootstrap for PersistenceDiagrams and Landscapes. arXiv:1311.0376 [cs, math, stat] , November2013. URL: http://arxiv.org/abs/1311.0376 .[12] Brittany Terese Fasy, Jisu Kim, Fabrizio Lecci, Clement Maria, and VincentRouvreau. Tda: statistical tools for topological data analysis.
Softwareavailable at https://cran.r-project.org/package=TDA , 2014.[13] Brittany Terese Fasy, A. Rinaldo, and L. Wasserman. Stochastic Con-vergence of Persistence Landscapes and Silhouettes.
Convergence , (1/25),2014.[14] A. Hatcher.
Algebraic topology . Cambridge University Press, 2002.[15] Yasuaki Hiraoka, Tomoyuki Shirai, and Khanh Duy Trinh. Limit theoremsfor persistence diagrams.
The Annals of Applied Probability , 28(5):2740–2780, October 2018. doi:10.1214/17-AAP1371 .[16] Yosef Hochberg. A sharper bonferroni procedure for multiple tests ofsignificance.
Biometrika , 75(4):800–802, 1988. URL: http://dx.doi.org/10.1093/biomet/75.4.800 , arXiv:/oup/backfile/content_public/journal/biomet/75/4/10.1093/biomet/75.4.800/2/75-4-800.pdf , doi:10.1093/biomet/75.4.800 .[17] Sture Holm. A simple sequentially rejective multiple test procedure. Scan-dinavian journal of statistics , pages 65–70, 1979.[18] S Kaliˇsnik Verovˇsek and G Carlsson. Symmetric and r-symmetric tropicalpolynomials and rational functions.
Journal of Pure and Applied Algebra ,220:3610–3627, 2014.[19] Yuriy Mileyko, Sayan Mukherjee, and John Harer. Probability measures onthe space of persistence diagrams.
Inverse Problems , 27(12):124007, 2011.URL: http://stacks.iop.org/0266-5611/27/i=12/a=124007 .[20] Elizabeth Munch, Katharine Turner, Paul Bendich, Sayan Mukherjee,Jonathan Mattingly, and John Harer. Probabilistic Fr´echet means for timevarying persistence diagrams.
Electronic Journal of Statistics , 9(1):1173–1204, 2015.[21] Brian D. Ripley.
Spatial statistics , volume 575. John Wiley & Sons, 2005.[22] Andrew Robinson and Katharine Turner. Hypothesis testing for topologicaldata analysis.
Journal of Applied and Computational Topology , November2017. URL: http://link.springer.com/10.1007/s41468-017-0008-7 , doi:10.1007/s41468-017-0008-7 .[23] Katharine Turner. Means and medians of sets of persistence diagrams. arXive-print 1307.8300, July 2013. URL: http://arxiv.org/abs/1307.8300 .2024] Katharine Turner, Yuriy Mileyko, Sayan Mukherjee, and John Harer. Fr´echetmeans for distributions of persistence diagrams. Discrete & ComputationalGeometry , 52(1):44–70, 2014.[25] Katharine Turner, Sayan Mukherjee, and Doug M. Boyer. Sufficient statisticsfor shapes and surfaces.
Annals of statistics , 2013.[26] Mikael Vejdemo-Johansson and Alisa Leshchenko. Certified Mapper:Repeated testing for acyclicity and obstructions to the nerve lemma. arXiv:1808.09933 [cs, math, stat] , August 2018. arXiv: 1808.09933. URL: http://arxiv.org/abs/1808.09933http://arxiv.org/abs/1808.09933