[PDF] Distributed computation of persistent homology

Abstract

Persistent homology is a popular and powerful tool for capturing topological features of data. Advances in algorithms for computing persistent homology have reduced the computation time drastically -- as long as the algorithm does not exhaust the available memory. Following up on a recently presented parallel method for persistence computation on shared memory systems, we demonstrate that a simple adaption of the standard reduction algorithm leads to a variant for distributed systems. Our algorithmic design ensures that the data is distributed over the nodes without redundancy; this permits the computation of much larger instances than on a single machine. Moreover, we observe that the parallelism at least compensates for the overhead caused by communication between nodes, and often even speeds up the computation compared to sequential and even parallel shared memory algorithms. In our experiments, we were able to compute the persistent homology of filtrations with more than a billion (10^9) elements within seconds on a cluster with 32 nodes using less than 10GB of memory per node.

Full PDF

DDistributed computation of persistent homology

Ulrich Bauer ∗ Michael Kerber † Jan Reininghaus ‡ October 3, 2013

Abstract

Persistent homology is a popular and powerful tool for capturing topological features of data. Advancesin algorithms for computing persistent homology have reduced the computation time drastically – as longas the algorithm does not exhaust the available memory. Following up on a recently presented parallelmethod for persistence computation on shared memory systems, we demonstrate that a simple adaption ofthe standard reduction algorithm leads to a variant for distributed systems. Our algorithmic design ensuresthat the data is distributed over the nodes without redundancy; this permits the computation of muchlarger instances than on a single machine. Moreover, we observe that the parallelism at least compensatesfor the overhead caused by communication between nodes, and often even speeds up the computationcompared to sequential and even parallel shared memory algorithms. In our experiments, we were able tocompute the persistent homology of ﬁltrations with more than a billion (10 ) elements within seconds ona cluster with 32 nodes using less than 10GB of memory per node. ∗ Institute of Science and Technology Austria, Klosterneuburg, Austria. http://ulrich-bauer.org † Max-Planck-Center for Visual Computing and Communication, Saarbr¨ucken, Germany. http://mpi-inf.mpg.de/˜mkerber ‡ Institute of Science and Technology Austria, Klosterneuburg, Austria. http://ist.ac.at/˜reininghaus a r X i v : . [ c s . C G ] O c t Introduction

Background

A recent trend in data analysis is to understand the shape of data (possibly in high dimensions)using topological methods. The idea is to interpret the data as a growing sequence of topological spaces(a ﬁltration ), such as for example, sublevel sets of a function with an increasing threshold, or thickeningsof a point set. The goal is now to compute a topological summary of the ﬁltration, which can be used,for instance, to identify features, as well as to infer topological properties of a sampled shape. Persistenthomology describes how homological features appear and disappear in the ﬁltration (see Section 2 for moredetails). Besides signiﬁcant theoretical advances, persistent homology has been used in various applications;see [7] for a survey. The success of persistent homology stems from its generality, which makes it applicablefor various forms of data, from its stability with respect to perturbations [2, 5], from its ability to provideinformation on all scales, and, last but not least, from the availability of e ﬃ cient algorithms for computingthis information.The standard algorithm for computing persistent homology assumes the input to be a boundary matrix ofa chain complex, and proceeds by reducing that matrix using a variant of Gaussian elimination [8, 15]. Therunning time is cubic in the number of simplices; this can be improved to matrix multiplication time [12]or replaced by an output-sensitive bound [3]. However, on the practical side, it has been observed that thestandard algorithm usually performs much better on real-world instances than predicted by the worst-casebounds, and relatively simple optimizations of the standard method yield remarkable speed-ups [4]. Recently,Maria et al. [11] implemented a memory-e ﬃ cient and comparably fast method for computing persistentcohomology, which yields the same information about birth and death of features as its homology counterpart.Further improvements have been reported by using several processors in a shared memory environment [1](see also [9, 10] for alternative parallelization schemes). With these optimizations, it is often the case thatcomputing persistence actually takes less time than even reading the input into memory. Therefore, thelimiting factor is not so much the time spent for computation but rather the memory available on the computer. Contribution

We present a scalable algorithm for computing persistent homology in parallel in a distributedmemory environment. This method the computation of much larger instances than using existing state-of-theart algorithms on a single machine, by using su ﬃ ciently many computing nodes such that the data ﬁts intothe distributed memory. While overcoming the memory bottleneck is the primary purpose of our approach,we aim for a time-e ﬃ cient solution at the same time.As demonstrated by our experiments, our implementation exhibits excellent scaling with respect tomemory usage on a single node, and even outperforms similar parallel shared memory code in running time.This result is somewhat surprising, since the computation of topological properties like persistent homologyis a global problem, and at ﬁrst sight it is not obvious at all that the computation can be performed with thevery simple and inexpensive pattern of communication that our algorithm exhibits.Our method closely resembles the spectral sequence algorithm for persistent homology [6, S VII.4].However, several adaptions are necessary for an e ﬃ cient implementation in distributed memory. Mostimportantly, reduced columns are not stored in order of their index in the matrix, but rather according tothe order of their pivot , the largest index of a non-zero entry. This allows a node to perform eliminationsin its associated rows, and to determine if a column with pivot in these rows is reduced, without furthercommunication with other nodes. Furthermore, we minimize the number of messages sent through thenetwork by collecting blocks of messages, and we simplify the communication structure by letting node j only communicate with nodes j ±

1. Finally, we incorporate the clear optimization [4] into the algorithm inorder to avoid unnecessary column operations. 1 rganization

We introduce the necessary background on persistent homology in Section 2, describe ourdistributed memory algorithm in Section 3, report on experimental evaluation in Section 4, and conclude inSection 5.

This section summarizes the theoretical foundations of persistent homology as needed in this work. We limitour scope to simplicial homology over Z just for the sake of simplicity in the description; our methodshowever easily generalize to chain complexes over arbitrary ﬁelds. For a more detailed introduction, we referto [6, 7, 15]. Homology

Homology is an algebraic invariant for analyzing the connectivity of simplical complexes. Let K be a ﬁnite simplicial complex. For a given dimension d , a d-chain is a formal sum of d -simplices of K with Z coe ﬃ cients. The d -chains form a group C d ( K ) under addition. Equivalently, a d -chain can be interpretedas a subset of the d -simplices, with the group operation being the symmetric set di ﬀ erence. The boundary of a d -simplex σ is the ( d − σ of codimension 1. This operationextends linearly to a boundary operator ∂ d : C d ( K ) → C d − ( K ). A d -chain γ is a d-cycle if ∂ d ( γ ) =

0. The d -cycles form a subgroup of the d -chains, denoted by Z d ( K ). A d -chain γ is called a d-boundary if γ = ∂ d ( ξ )for some ( d + ξ . Again, the d -boundaries form a subgroup B d ( K ) of C d ( K ), and since ∂ d ( ∂ d ( ξ )) = ξ , d -boundaries are d -cycles, and so B d ( K ) is a subgroup of Z d ( K ). The d th homology groupH d ( K ) is deﬁned as the quotient group Z d ( K ) / B d ( K ). In fact, the groups C d ( K ), Z d ( K ), B d ( K ), and H d ( K )are Z -vector spaces. The dimension of H d ( K ) is called the d th Betti number β d . Roughly speaking, theBetti numbers in dimension 0, 1, and 2 count the number of connected components, tunnels, and voids of K ,respectively. Persistence

Consider a simplexwise ﬁltration of K , i.e., a sequence of inclusions ∅ = K ⊂ . . . ⊂ K n = K such that K i = K i − ∪ { σ i } , where σ i is a simplex of K . We write H ∗ ( K i ) for the direct sum of thehomology groups of K i in all dimensions. For i ≤ j , the inclusion K i (cid:44) → K j induces a homomorphism h ji : H ∗ ( K i ) → H ∗ ( K j ) on the homology groups. We say that a class α ∈ H ∗ ( K i ) is born at (index) i if α (cid:60) im h ii − . A class α ∈ H ∗ ( K i ) born at index i dies entering (index) j if h j − i ( α ) (cid:60) im h j − i − but h ji ( α ) ∈ im h ji − . In this case, the index pair ( i , j ) is called a persistence pair , and the di ﬀ erence j − i is the (index) persistence of the pair. The transition from K i − to K i either causes the birth or the death of some homology class. Thishomology class is not unique in general. Boundary matrix

For a matrix M ∈ ( Z ) n × n , let M j denote its j th column and m i j ∈ Z its entry in row i and column j . For a column M j , we deﬁne pivot( M j ) = min { p ∈ N : m i j = i > p } and call it the pivot index of that column. When obvious from the context, we omit explicit mention of the matrix M andwrite pivot( j ) for pivot( M j ).The boundary matrix D ∈ ( Z ) n × n of a simplexwise ﬁltration ( K i ) ni = is the n × n matrix of the boundaryoperator ∂ ∗ : C ∗ ( K ) → C ∗ ( K ) with respect to the ordered basis ( σ i ) ni = of C ∗ ( K ). We have D i j = σ i is a face of σ j of codimension 1. In other words, the j th column of D encodes the boundary of σ j . D isan upper-triangular matrix, since any face of σ j must precede σ j in the ﬁltration.2 atrix reduction A column operation of the form M j ← M j + M k is called left-to-right addition if k < j .A left-to-right addition is called eliminating if it decreases pivot( j ). A column M j is called reduced if pivot( j )cannot be decreased by applying any sequence of left-to-right additions. In particular, there is no non-zerocolumn M k with k < j and pivot( k ) = pivot( j ). Clearly a zero column is reduced. Note that a reduced columnremains reduced under eliminating left-to-right column additions.We call a matrix M reduced if all columns are reduced, or equivalently, if no two non-zero columns havethe same pivot index. We call

M reduced at index ( i , j ) if the lower left submatrix of M with rows of index > i and columns of index ≤ j is reduced. A su ﬃ cient condition for column M j to be reduced is that M is reducedat index ( i , j ) with i = pivot( j ).If R is a reduced matrix obtained by applying left-to-right additions to M , we call it a reduction of M . Inthis case, we deﬁne P R : = { ( i , j ) | i = pivot( R j ) > } Although the reduction matrix R is not unique, the set P R is the same for any reduction of M ; therefore, wecan deﬁne P M to be equal to P R for any reduction R of M . Persistence by reduction

For the boundary matrix D of the ﬁltration ( K i ) ni = , the ﬁrst i columns generatethe boundary group B ∗ ( K i ). This property is invariant under left-to-right column additions. For a reduction R of D , the non-zero columns among the ﬁrst i columns actually form a basis of B ∗ ( K i ). Note that i = dim C ∗ ( K i ) = dim Z ∗ ( K i ) + dim B ∗ ( K i ) anddim H ∗ ( K i ) = dim Z ∗ ( K i ) − dim B ∗ ( K i ) . Hence, if R i is zero, we have dim B ∗ ( K i ) = dim B ∗ ( K i − ) , dim Z ∗ ( K i ) = dim Z ∗ ( K i − ) +

1, anddim H ∗ ( K i ) = dim H ∗ ( K i − ) + , and so some homology class is born at i . If on the other hand R j is non-zero with i = pivot( j ), we havedim B ∗ ( K i ) = dim B ∗ ( K i − ) + , dim Z ∗ ( K i ) = dim Z ∗ ( K i − ), anddim H ∗ ( K i ) = dim H ∗ ( K i − ) − . The fact that R j has pivot i means that R j ∈ Z ∗ ( K i ) and hence[ R j ] ∈ H ∗ ( K i );the fact that it is reduced means that there is no b ∈ B ∗ ( K j − ) with b + R j ∈ Z ∗ ( K i − ) and hence[ R j ] (cid:60) im h ii − . We conclude that [ R j ] is born at i . We even have[ R j ] (cid:60) im h j − i − . Moreover, R j is a boundary in K j − , and so [ R j ] = ∈ im h ji − . We conclude that the pairs ( i , j ) ∈ P D are the persistence pairs of the ﬁltration.The standard way to reduce D is to process columns from left to right; for every column, previouslyreduced columns are added from the left until the pivot index is unique. A lookup table can be used to identifythe next column to be added in constant time. The running time is at most cubic in n , and this bound isactually tight for certain input ﬁltrations, as demonstrated in [13].3 learing optimization Despite its worst-case behavior, there are techniques to speed up the reductionsigniﬁcantly in practice. A particularly simple yet powerful improvement has been presented in [4]. It isbased on the following observations.First, the reduction of the matrix can be performed separately for each dimension d , by restricting tothe submatrix corresponding to columns of dimension d and rows of dimension d −

1. This submatrix isexactly the matrix of the boundary operator ∂ d : C p ( K ) → C p − ( K ). The second basic fact to note is that inany reduction of D , if i is a pivot of some column j , the i th column is zero.This leads to the following variant of the reduction algorithm: the boundary matrix is reduced separatelyin each dimension in decreasing order. After the reduction in dimension d , all columns corresponding topivots indices are set to zero – we call this process clearing . Note that columns corresponding to d -simpliceshave pivots corresponding to d − d − Throughout the section, let ( K i ) ni = be a ﬁltration of a simplicial complex consisting of n simplices, representedby its boundary matrix D . Our goal is to compute the persistence pairs of ( K i ) i on a cluster of p processorunits, called nodes , which are indexed by the integers { , . . . , p } . Reduction in blocks

Let 0 = r < · · · < r i < · · · < r p = n be an integer partition of the interval { , . . . , n } .Let the i th range be the interval of integers k with r i − < k ≤ r i . We deﬁne the block ( i , j ) of M as the blocksubmatrix with rows from the i th row range and columns from the j th columns range. The blocks partition thematrix into p submatrices. Any block ( i , j ) with i > j is completely zero, since D is lower triangular.To simplify notation, we call M reduced at block ( i , j ) if M is reduced at index ( r i − , r j ). Moreover, wecall M reducible in block ( i , j ) if M is reduced at block ( i , j −

1) and at block ( i + , j ). This terminology ismotivated by the fact that in order to obtain a matrix that is reduced at block ( i , j ), only entries in block ( i , j )have to be eliminated, as described in Algorithm 1 and shown in the following lemma. Algorithm 1

Block reduction

Require: input M is reducible in block ( i , j ) Ensure: result M is reduced at block ( i , j ) procedure R educe B lock ( i , j ) for each l in range j in increasing order do while ∃ k with pivot( k ) = pivot( l ) in range i do add column k to column l if pivot( l ) is in range i then add column l to collection of reduced columns Lemma 1.

Algorithm 1 is correct: if M is reducible in block ( i , j ) , then applying Algorithm 1 yields a matrixwhich is reduced at block ( i , j ) .Proof. By induction on l , M is reduced at index ( r i − , l ) after each iteration of the main for loop (Line 2). Thisfollows directly from the exit condition of the while loop in Line 3, together with the induction hypothesisand the precondition that M is reduced at index ( r i , r j ) and hence also at index ( r i , l ). (cid:3) Lemma 2.

Algorithm 1 only requires access to the unreduced columns of M in range j and the reducedcolumns with pivot in range i. roof. Let l be in range j and let k < l be such that pivot( k ) = pivot( l ) is in range i , as in the while loop inLine 3. Then clearly column l is unreduced. Moreover, as shown in the proof of Theorem 1, M is reduced atindex ( r i − , l ). Since pivot( k ) is in range i , we have r i − < pivot( k ), and by assumption k < l . Hence M is alsoreduced at index (pivot( k ) , k ), i.e., column k is reduced. (cid:3) Parallel reduction

We now describe a parallel algorithm to reduce a boundary matrix D by applying blockreduction on all blocks ( i , j ) with i ≤ j in a certain order.The algorithm reduces the blocks starting with the diagonal blocks ( i , i ) with 1 ≤ i ≤ p . Indeed, note thatthe boundary matrix D is ( i , i )-reducible for any diagonal block ( i , i ). All block reductions for diagonal blocksare independent and can be performed in parallel. Now consider a block of the form ( i , j ) with i < j . Notethat this block can be reduced as soon as blocks the ( i , j −

1) and ( i − , j ) have been reduced. This relationdeﬁnes a partial order on the blocks ( i , j ) with i ≤ j . If the order of execution of the block reductions isconsistent with that partial order, the preconditions of block reduction are satisﬁed in every block. Note thattwo blocks ( i , j ) and ( i (cid:48) , j (cid:48) ) can be reduced independently i ﬀ either ( i < i (cid:48) and j < j (cid:48) ) or ( i > i (cid:48) and j > j (cid:48) ).After having reduced the block (1 , p ), the postcondition of Algorithm 1 yields that the resulting matrix is areduction of the input boundary matrix D .Note that a special case of this block-reduction scheme is the spectral sequence algorithm presentedin [6, S VII.4]. This algorithm sweeps the blocks diagonally, and in each phase r ∈ { , . . . , p } of the sweepit reduces all blocks ( i , j ) with j − i = r − i . The algorithm as described issequential, however, as discussed above, within a given phase r the blocks can be reduced independently. Distributed reduction

We now describe how the data and the workload are distributed and transferredbetween the nodes.Each node i is assigned a row of blocks for reduction. The blocks are necessarily processed from leftto right. Recall that reducing a block ( i , j ) requires access to the unreduced columns in range j , and to thereduced columns with pivot in range i . During the execution, each node i maintains a collection of all reducedcolumns with pivot in the i th range, indexed by pivot. The unreduced columns in a given range j , on the otherhand, are passed on from node to node. No data is duplicated among the nodes; each column of the matrixis stored in exactly one node throughout the execution of the algorithm. The union of the locally storedunreduced and reduced columns yields a distributed representation of the partially reduced boundary matrix.Initially, each node i loads the columns of the input boundary matrix in range i . The following procedureis now repeated, with j ranging from i to m . Node i performs reduction in block ( i , j ) and retains the reducedcolumns with pivot in range i in its collection. After that, it sends a package to node i − j (if i > i + j + j < p ).Observe that in each iteration, node i has all the information required to perform reduction in block ( i , j ),namely, the unreduced columns in range j and the reduced columns with pivot in range i . Moreover, thepreconditions for block reduction are satisﬁed, since block ( i , j −

1) is reduced on the same node i beforeblock ( i , j ), and block ( i + , j ) is reduced on node i + i receives the unreduced columns inrange j from node i +

1. We conclude:

Lemma 3.

If Algorithm 2 is executed on a cluster with p nodes, it computes a reduction of the input matrix.

Note that the structure of communication between the nodes is very simple: each node i only receivesdata from node i + i −

1. Moreover, less than p messages are sent betweeneach pair of consecutive nodes. This is highly beneﬁcial for distributed computing, as the communicationoverhead and the network latency become negligible.5 lgorithm 2 Distributed matrix reduction

Require: access to columns of input boundary matrix D in range j Ensure: resulting output matrix R is a reduction of D procedure R educe O n N ode ( i ) input package with columns of D in range i for j = i , . . . , p do R educe B lock ( i , j ) if i > then send package with unreduced columns in range j to node i − if j < p then receive package with unreduced columns in range j + i + return reduced columns with pivot in range i Clearing in parallel

The clearing optimization from Section 2 can be implemented in the distributedreduction algorithm with minor changes. Recall that the the clearing optimization iterates over the dimensions d in decreasing order and processes only the columns of a given dimension d at a time.The ranges are deﬁned by a single global partition r < . . . < r p that does not change per dimension. Notethat this might cause initial column packages of di ﬀ erent sizes in a given dimension, even if the ranges are allof same size. However, it has the following advantage: when node i has performed its last block reductionfor dimension d , it knows all pivots that fall in the i th range. All these pivots corresponds to d − i isinitialized to process the columns of dimension d − i th range. Before it starts the block reduction, itcan simply clear all columns with indices that were pivots in dimension d . In particular, no communicationwith other nodes is required. Design rationale

We justify some design choices in our algorithm and discuss alternatives. First, weimplemented the sending of packages in Algorithm 2 in a blocking fashion, i.e., a node does not start receivingthe next package until has sent and discarded its current package. Clearly, this strategy can result in delayedprocessing of packages because a sending node has to wait for its predecessor to be ready to receive a package.On the other hand, the strategy guarantees that every node holds at most one package at a time; this preventsa slower node from accumulating more and more packages, possibly causing high memory consumption.A possible strategy to reduce the overall amount of communication would be to have node i send aunreduced column with pivot in the k th range to node k directly, instead of the predecessor node i −

1. However,this approach would complicate the communication structure and data management signiﬁcantly. Any nodewould have to be able to receive unreduced columns any time, and it would not be possible to bound thenumber of unprocessed columns a node has to maintain in memory. It would also increase the number ofmessages send through the network.A somewhat dual approach to our communication scheme would be to send the reduced columns fromnode i to i + i to i −

1. In this variant, node i wouldperform reduction in block ( j , i ) for j = i , i − , . . . ,

1. However, in this approach, the package size wouldincrease towards the end of the reduction, as the number of reduced columns increases, whereas in ourimplementation the package size decreases together with the number of reduced columns. Since typicallymost columns are reduced early on, we expect much more data to be sent between the nodes using thisvariant. 6 hat D ipha cores / nodes 1 16 2 4 8 16 32GRF2-256 10.2GB 10.5GB 11.1GB 5.6GB 2.8GB 1.4GB 0.74GBGRF1-256 10.8GB 11.3GB 11.8GB 6.1GB 3.1GB 1.5GB 0.8GBGRF2-512 11.1GB 5.7GBGRF1-512 9.1GBvertebra16 9.0GBTable 1: Peak memory consumption for sequential and parallel shared memory (P hat , left) algorithms andour distributed algorithm (D ipha , right)P hat D ipha cores / nodes 1 16 2 4 8 16 32GRF2-256 14.6s 5.2s 10.1s 5.5s 3.4s 2.2s 1.6sGRF1-256 28.8s 12.8s 27.2 20.3 15.4 12.1s 9.9sGRF2-512 17.9s 11.2sGRF1-512 95.3svertebra16 34.9sTable 2: Running times for sequential and parallel shared memory (P hat , left) algorithms and our distributedalgorithm (D ipha , right) Since our algorithm is, to the best of our knowledge, the ﬁrst attempt at computing persistence in a distributedmemory context, we concentrate our experimental evaluation on two aspects. First, how does our approachscale with an increasing number of nodes, in running time and memory consumption? Second, how doesthe our algorithm compare with state-of-the-art sequential and parallel shared memory implementations oninstances which are still computable in this context?We implemented Algorithm 2 in C ++ using the OpenMPI implementation of the Message ParsingInterface standard . We ran the distributed algorithm on a cluster with up to 32 nodes, each with two IntelXeon CPU E5-2670 2.60GHz processors (8 cores each) and 64GB RAM, connected by a 40Gbit Inﬁnibandinterconnect.For comparison, our tests also include results for the P hat library , which contains e ﬃ cient sequen-tial and parallel shared memory algorithms for computing persistence. Among the sequential versions,the --twist algorithm option, which is the standard reduction with the clearing optimization describedin Section 2, together with the --bit_tree_pivot_column data structure option, showed the overallbest performance (see the P hat documentation for more information). For parallel shared memory, the --block_spectral_sequence algorithm with the --bit_tree_pivot_column data structure showed theoverall best performance on the tested examples. We therefore used these two variants for comparison. Thesequential and parallel shared memory algorithms were run on a single machine of the cluster. In order toobtain a clear comparison between the shared memory and distributed memory algorithms, in our test of thedistributed algorithm only one processor core per node was used.For our tests, we focus on ﬁltrations induced by 3D image data. In particular, we used isotropic Gaussianrandom ﬁelds whose power spectral density is given by a power law (cid:107) x (cid:107) − p . This process is commonly usedin physical cosmology as a model for cosmic microwave background [14]. We consider two images sizes:ﬁltrations of images of size 256 have a length of n = ≈

133 millions and a binary ﬁle size of around http://phat.googlecode.com

10 20 30 0 10 20 30 510 0 10 20 30 0 10 20 30 202530 0 10 20 30 0 10 20 30 3132333435

Figure 1: Running times for each block reduction in dimensions δ = , , yield a ﬁltration of length n = ≈ .

07 billions and a ﬁle size of around40GB. In addition, we included the 512 medical image vertebra16 from the V ol V is repository in our testset, a rotational angiography scan of a head with an aneurysm. Scalability

Tables 1 and 2 show the running time and peak memory consumption of our algorithm forimages of size 256 and 512 . The table is incomplete; the algorithm was not able to compute a result for theremaining cases because of address space limitations of the OpenMPI I / O API that we plan to circumvent ina forthcoming version. We observe that the memory usage per node is almost exactly halved when doublingthe number of nodes. For the running time, the speed-up factor is not quite as high, but still the algorithmterminates faster when using more nodes. In summary, this provides strong evidence that our algorithm scaleswell with the number of nodes, both regarding time and space complexity.

Comparison

Tables 1 and 2 also lists the results for the best sequential and parallel shared memoryalgorithms of the P hat library. Both algorithms run out of memory when trying to compute persistencefor larger examples on our testing machine, showing that our distributed approach indeed extends the setof feasible instances. Moreover, we observe that the running time on 16 nodes with distributed memory isactually lower than that of the parallel shared memory algorithm on a single machine with 16 processor cores.One reason might be that the distributed system has a much larger total amount of processor cache availablethan the shared memory system. Since matrix reduction is more memory intensive than processor intensive,this e ﬀ ect may actually outweigh the overhead of communication over the network. This suggests that thedistributed approach may be preferable even if the solution is in principle computable in a non-distributedenvironment. Communication analysis

We give more details on the amount of data transmitted between the nodes byour algorithm. Table 3 shows the total amount of data exchanged; Table 4 shows the largest total amount ofdata transmitted between any pair of nodes; Table 5 shows the largest package size. For the more challengingexamples, the amount is in the range of GBs. Considering the bandwidth of modern interconnects and the factthat communication is bundled in a small number of packages, the running time of the local block reductionsdominates the time spent for communication. This is illustrated in Fig. 1, which shows a plot of the runningtimes for each block reduction for the vertebra16 data set on 32 nodes. Available at http://volvis.org

We presented the ﬁrst implementation of an algorithm for computing persistent homology in a distributedmemory environment. While our algorithm resembles the spectral sequence algorithm for persistencecomputation to a large extent, several lower-level design choices were necessary for an e ﬃ cient realization.Our approach permits the computation of instances that were infeasible for previous methods, and theparallelism also speeds up the computation for previously feasible instances.We plan to extend our experimental evaluation in future work. One problem in benchmarking our newapproach is that persistence computation is only the second step in the pipeline: ﬁrst, one has to generate aﬁltration that serves as the input for the algorithm. This itself usually requires a massive computation, whichat some point becomes infeasible on single machines as well. We are currently working on methods forgenerating ﬁltrations of large 3D images and Rips ﬁltrations in a distributed memory environment.9 eferences [1] Ulrich Bauer, Michael Kerber, and Jan Reininghaus. Clear and compress: Computing persistenthomology in chunks. In TopoInVis 2013 , 2013.[2] F. Chazal, D. Cohen-Steiner, M. Glisse, L. Guibas, and S. Oudot. Proximity of persistence modules andtheir diagrams. In

Proc. 25th ACM Symp. on Comp. Geom. , pages 237–246, 2009.[3] Chao Chen and Michael Kerber. An output-sensitive algorithm for persistent homology. In

Proceedingsof the 27th Annual Symposium on Computational Geometry , pages 207–215, 2011.[4] Chao Chen and Michael Kerber. Persistent homology computation with a twist. In , pages 197–200, 2011. Extended abstract.[5] D. Cohen-Steiner, H. Edelsbrunner, and J. Harer. Stability of persistence diagrams.

Discrete andComputational Geometry , 37:103–120, 2007.[6] H. Edelsbrunner and J. Harer.

Computational Topology, An Introduction . American MathematicalSociety, 2010.[7] Herbert Edelsbrunner and John Harer. Persistent homology — a survey. In Jacob E. Goodman, J´anosPach, and Richard Pollack, editors,

Surveys on Discrete and Computational Geometry: Twenty YearsLater , Contemporary Mathematics, pages 257–282. 2008.[8] Herbert Edelsbrunner, David Letscher, and Afra Zomorodian. Topological persistence and simpliﬁcation.

Discrete and Computational Geometry , 28:511–533, 2002.[9] R. H. Lewis and A. Zomorodian. Multicore homology. Manuscript, 2012.[10] David Lipsky, Primoz Skraba, and Mikael Vejdemo-Johansson. A spectral sequence for parallelizedpersistence. arXiv:1112.1245, 2011.[11] Clement Maria, Jean-Daniel Boissonnat, and Tamal Dey. The compressed annotation matrix: Ane ﬃ cient data structure for computing persistent cohomology. In ESA 2013 , 2013.[12] Nikola Milosavljevi´c, Dmitriy Morozov, and Primoˇz ˇSkraba. Zigzag persistent homology in matrixmultiplication time. In

Proceedings of the 27th Annual Symposium on Computational Geometry , pages216–225, 2011.[13] Dmitriy Morozov. Persistence algorithm takes cubic time in the worst case. In

BioGeometry News .Duke Computer Science, Durham, NC, 2005.[14] John Peacock.

Cosmological Physics . Cambridge University Press, 1999.[15] Afra Zomorodian and Gunnar Carlsson. Computing persistent homology.