Cross-Classification Clustering: An Efficient Multi-Object Tracking Technique for 3-D Instance Segmentation in Connectomics
Yaron Meirovitch, Lu Mi, Hayk Saribekyan, Alexander Matveev, David Rolnick, Nir Shavit
CCross-Classification Clustering: An Efficient Multi-Object Tracking Techniquefor 3-D Instance Segmentation in Connectomics
Yaron Meirovitch , ∗ , Lu Mi ∗ , Hayk Saribekyan , Alexander Matveev , David Rolnick , Nir Shavit , MIT, Harvard University, Neural Magic Inc., Tel-Aviv University
Abstract
Pixel-accurate tracking of objects is a key element inmany computer vision applications, often solved by iter-ated individual object tracking or instance segmentationfollowed by object matching. Here we introduce cross-classification clustering (3C), a technique that simultane-ously tracks complex, interrelated objects in an image stack.The key idea in cross-classification is to efficiently turn aclustering problem into a classification problem by run-ning a logarithmic number of independent classificationsper image, letting the cross-labeling of these classificationsuniquely classify each pixel to the object labels. We ap-ply the 3C mechanism to achieve state-of-the-art accuracyin connectomics – the nanoscale mapping of neural tissuefrom electron microscopy volumes. Our reconstruction sys-tem increases scalability by an order of magnitude over ex-isting single-object tracking methods (such as flood-fillingnetworks). This scalability is important for the deploymentof connectomics pipelines, since currently the best perform-ing techniques require computing infrastructures that arebeyond the reach of most laboratories. Our algorithm mayoffer benefits in other domains that require pixel-accuratetracking of multiple objects, such as segmentation of videosand medical imagery.
1. Introduction
Object tracking is an important and extensively studiedcomponent in many computer vision applications [1, 13,14, 16, 23, 54, 57, 59]. It occurs both in video segmenta-tion and in 3-D object reconstruction based on 2-D images.Less attention has been given to efficient algorithms per-forming simultaneous tracking of multiple interrelated ob-jects [14] in order to eliminate the redundancies of trackingmultiple objects via repeated use of single-object tracking.This problem is relevant to applications in medical imaging[10, 11, 22, 29, 30, 38] as well as videos [12, 43, 55, 58]. ∗ These authors equally contributed to this work { yaronm,lumi } @mit.edu Figure 1. (a) Single-object tracking using flood-filling net-works [21], (b) Multiple-object tracking using our cross-classification clustering (3C) algorithm, (c) The combinatorial en-coding of an instance segmentation by 3C. One segmented imagewith 10 objects is encoded using three images, each with 4 objectclasses.
The field of connectomics, the mapping of neural tissueat the level of individual neurons and the synapses betweenthem, offers one of the most challenging settings for testingalgorithms to track multiple complex objects. Such synap-tic level maps can be made only from high-resolution im-ages taken by electron microscopes, where the sheer vol-ume of data that needs to be processed (petabyte-size im-4321 a r X i v : . [ c s . C V ] J un igure 2. The raw input electron microscopy (EM) full image stackand our 3C-LSTM-UNET results in the SNEMI3D benchmark. age stacks), the desired accuracy and speed (terabytes perhour [33]), and the complexity of the neurons’ morphology,present a daunting computational task. By analogy to tradi-tional object tracking, imagine that instead of tracking a sin-gle sheep through multiple video frames, one must track anentire flock of sheep that intermingle as they move, changeshape, disappear and reappear, and obscure each other [32].As a consequence of this complexity, several highly suc-cessful tracking approaches from other domains, such as the“detect and track” approach [14], are less immediately ap-plicable to connectomics.Certain salient aspects are unique to the connectomicsdomain: a) All objects are of the same type (biologicalcells); sub-categorizing them is difficult and has little rel-evance to the segmentation problem. b) Most of the imageis foreground, with tens to hundreds of objects in a singlemegapixel image. c) Objects have intricate, finely branchedshapes and no two are the same. d) Stitching and align-ment of images can be imperfect, and the distance betweenimages ( z -resolution) is often greater than between pixelsof the same image ( xy -resolution), sometimes breaking theobjects’ continuity. e) Some 3-D objects are laid out par-allel to the image stack, spanning few images in the z di-rection and going back and forth in that limited space withextremely large extensions in some image planes.In this work, we introduce 3C, a technique that achievesvolumetric instance segmentation by transferring segmenta-tion knowledge from one image to another, simultaneouslyclassifying the pixels of the target image(s) with the labelsof the matching objects from the source image(s). Thisalgorithm is optimized for the setting of connectomics, inwhich objects frequently branch and come together, but issuitable for a wide range of video-segmentation and medi-cal imaging applications.The main advantage of our solution is its ability, unlikeprior single-object tracking methods for connectomics [21,37], to simultaneously and jointly segment neighboring, in- termingled objects, thereby avoiding redundant computa-tion. In addition, instead of extending single masks, ourdetectors perform clustering by taking into account infor-mation on all visible objects.The efficiency and accuracy of 3C are demonstrated onfour connectomics datasets: the public SNEMI3D bench-mark dataset, shown in Figure 2, the widely studied mousesomatosensory cortex dataset [24] ( S1 ), a Lichtman Labdataset of the V1 region of the rat brain ( ECS ), and a newlyaligned mouse peripheral nervous system dataset (
PNS ),where possible, comparing to other competitive results inthe field of connectomics.
A variety of techniques from the past decade have ad-dressed the task of neuron segmentation from electron mi-croscopy volumes. An increasing effort has been dedicatedto the problem of densely segmenting all pixels of a volumeaccording to foreground object instances (nerve and supportcells), known as saturated reconstruction . Note that unlikeeveryday images, a typical megapixel electron microscopyimage may contain hundreds of object instances, with verylittle background ( < . Furthermore, the existing single-object detectorsin connectomics [21, 37] and in other biomedical domains(e.g. [4, 8, 19, 48]) do not take advantage of the multi-objectscene to better understand the spatial correlation betweendifferent 3-D objects. The approach taken here generalizesthe single-object approach in connectomics to achieve sim-pler and more effective instance segmentation of the entirevolume. We provides a scalable framework for 3-D instance seg-mentation and multi-object tracking applications, with thefollowing contributions: • We propose a simple FCN approach, tackling the lessstudied problem of mapping an instance segmentationbetween two related images. Our algorithm jointlypredicts the shapes of several objects partially ob-served in the input. • We propose a novel technique that turns a clusteringproblem into a classification problem by running a log-arithmic number of independent classifications on thepixels of an image with N objects (for possibly large N , bounded only by the number of pixels). • We show empirically that the simultaneous trackingability of our algorithm is more efficient than indepen-dently tracking all objects. • We conduct extensive experimentation with four con-nectomics datasets, under different evaluation criteriaand a performance analysis, to show the efficacy andefficiency of our technique on the problem of neuronalreconstruction. Such approaches thus take time linear in the number of objects andin the number of pixels, with a large constant that depends on the objectdensity.
Figure 3. A high level view of our 3-D instance segmentationpipeline.
2. Methodology
We present cross-classification clustering (henceforth ), a technique that extends single object classification ap-proaches, simultaneously and efficiently classifying all ob-jects in a given image based on a proposed instance seg-mentation of a context-related image. One can think of thecontext-related image and its segmentation as a collectionof labeled masks to be simultaneously remapped togetherto the new target image, as in Figure 1(b). The immediatedifficulty of such simultaneous settings is that this general-ization is a clustering problem: unlike FFNs and MaskEx-tend (shown in Figure 1(a)), that produce a binary output(“YES” for extending the object and otherwise “NO”), inany volume, we really do not know how many classificationlabels we might need to capture all the objects, or more im-portantly how to represent those instances in ways usablefor supervised learning. Overcoming this difficulty is a keycontribution of 3C. Cross-Classification Clustering:
We begin by explain-ing the main idea behind 3C and differentiating it fromsingle-object methods such a FFNs. We then provide atop-down sketch of our pipeline and describe how it canbe adapted to other domains.Our goal is to extend a single-object classification fromone image to the next so as to simultaneously classify pix-els for an a priori unknown set of object labels. Moreformally, suppose that we have images X prev and X next ,where X prev has been segmented and X next must be seg-mented consistent with X prev . Given two such images, weseek a classification function f that takes as input a voxel v of X next and a segmentation s of X prev (an integer ma-trix representing object labels) and outputs a decision label.The function f outputs a label if and only if v belongs tothe object with that label in s . If s is allowed to be an over-segmentation (i.e., several labels representing the same ob-ject) then the output of f should be one of the compatiblelabels.For simplicity, let us assume that the input segmenta-tion s has entries from the label set { , · · · , N } . We de-ne a new space of labels, the length- k strings over a pre-determined alphabet A (here represented by colors), where | A | = l and n = | A | k ≥ N is an upper bound on the numberof objects we expect in a classification. We use an arbi-trary encoding function, χ , that maps labels in { , · · · , N } to distinct random strings over A of length k . In the ex-ample in Figure 1(c), A is represented by l =4 colors, and k =3 , so we have a total of =64 possible strings of length to which the N =10 objects can be mapped. Thus, forexample, object 5 is mapped to the string (Green, Purple,Orange) and object 1 is (Green, Green, Purple). We candefine the classification function f on string labels as theproduct of k traditional classifications, each with an inputsegmentation of labels in A , and an output of labels in A .Slightly abusing notation, let the direct product of images χ ( s ) = χ ( s ) ×· · ·× χ k ( s ) be the relabeling of the segmen-tation s where each image (or tensor) χ i ( s ) is the projectionof χ ( s ) in the i -th location (a single coloring of the segmen-tation) and × is the concatenation operation on labels in A .Then we can re-define f on χ ( s ) as f ( v, χ ( s )) = f (cid:48) ( v, χ ( s )) × · · · × f (cid:48) ( v, χ k ( s )) , (1)where each f (cid:48) ( χ k ( s )) is a traditional classification function.The key idea is that f (cid:48) is a classification of v based on aninstance segmentation with l predetermined labels. In theexample in Figure 1(c), even though in the map representingthe most significant digit of the original objects 5 and 1,they are both Green, when we perform the classification andtake the cross labeling of all three maps, the two objects areclassified into distinct labels. Our 3-D reconstructionsystem consists of the following steps (shown in Figure 3): Seeding and labeling the volume with an initial imper-fect 2-D/3-D instance segmentation that overlaps all objectsexcept for their boundaries (over-segmentation). Encoding the labeled seeds into a new space using the3C rule χ . Applying a fully convolutional network log( N ) timesto transfer the projected labeled seeds from the source im-age to the target images, and then take their cross labeling. Decoding the set of network outputs to the originallabel space using the inverse 3C rule χ − . Agglomerating the labels into 3-D consistent objectsbased on the overlap of the original seeding and the seg-ments predicted from other sections.To initially seed and label the volume (Step 1), we com-pute and label 2-D masks that over-segment all objects. Forthis we follow common practice in connectomics, comput-ing object borders with FCNs and searching for local min-ima in 2-D on the border elevation maps. Subsequently(Step 2), we use χ to encode the seeds of each section, re-sulting in a k -tuple over the l -color alphabet for each seed( k =5 and l =4 in Figure 4). Figure 4. The instance segmentation transfer mechanism of 3C:Encoding the seeded image as k l -color images using the encod-ing rule χ ( k = log( N ) ; here k =5 and l =4 ). Applying a fullyconvolutional network k times to transfer each of the seed imagesto a respective target. Decoding the set of k predicted seed imagesusing the decoding rule χ − .Figure 5. A schematic view of a global merge decision. The edgeweight between seed i and seed j at sections Z and Z + W , respec-tively. e ijZ + W | Z is calculated by the ratio of the overlapping areasof seed j and the 3C prediction of seed i from images Z to Z + W .Seeds that over-segment a common object tend to get merged dueto a path of strong edges. A fully convolutional neural network then predicts thecorrect label mapping between interrelated images of sec-tions Z and Z ± W , which determines which pixels in targetimage Z ± W belong to which seeds in source image Z (step igure 6. A schematic view of the 3C networks. The input layers have channels of the raw image, seed mask and border probability, for W +1 consecutive sections (images). The output is a feature map of seed predictions in section Z ± W (binary or labeled). Top: 3C-LSTM-UNET. Network architecture was implemented for the SNEMI3D dataset to optimize for accuracy. The inputs are processed withthree consecutive Conv-LSTM modules, followed by a symmetric Residual U-Net structure. Bottom: 3C-Maxout. Network architecturewas implemented for the Harvard Rodent Cortex and PNS datasets to optimize for speed. l ofcolors, and prediction is done log( N ) times based on Equa-tion 1. For decoding, all log( N ) predictions are aggregatedfor each pixel to determine the original label of the seedusing χ − (Step 4). For training, we use saturated groundtruth of the 3-D consistent objects. This approach allows usto formalize the reconstruction problem as a tracking prob-lem between independent images, and to deal with the ten-dency of objects to disappear/appear in different portions ofan enormous 3-D dataset.We now describe how the 3C seed transfers are utilizedto agglomerate the 2-D masks (as shown in Figure 5). Foragglomeration (Step 5), the FCN for 3C is applied fromall source images to their target images, which are at most W image sections apart from each other across the imagestack (along the z dimension). We collect overlaps be-tween all co-occurring segments, namely, those occurringby the original 2-D seeding, and those by the 3C seed trans-fer from source to target images. This leaves W +1 in-stance segmentation cases for each image (including the ini-tial seeding), which directly link seeds of different sections.Formally, the overlaps of different labels define a graphwhose nodes are the 2-D seed mask labels and the directedweighted edges are their overlap ratio from the source tothe target. Instead of optimizing this structure (as in the Fusion approach of [25]), we found that agglomerating allmasks of sufficient overlap delivers adequate accuracy evenfor a small W . We do however make forced linking onlower probability edges to avoid “orphan” objects that aretoo small, which is biologically implausible. We providefurther details in the Supplementary Materials.We note that 3C does not attempt to correct potential merge errors in the initial seeding. These can be addressedpost-hoc by learning morphological features [49, 60] orglobal constraints [34]. Adaptation to other domains:
To leverage 3C formulti-object tracking in videos, a domain-specific seedershould precede cross classification (e.g. with deep color-ing [28]). Natural images are likely to introduce spatiallyconsistent object splits across frames and hence a dedicatedagglomerating procedure should follow. The 3C techniquecan be readily applied to other medical imaging tasks, withseed transfers across different axes for isotropic settings.
3. Experiments
The SNEMI3D challenge is a widely used benchmarkfor connectomic segmentation algorithms dealing withanisotropic EM image stacks [5]. Although the competi-tion ended in 2014, several leading labs recently submit-ted new results on this dataset, improving the state-of-the-art. Recently Plaza et al. suggested that benchmarking con-nectomics accuracy on small datasets as SNEMI3D is mis-leading as large-scale “catastrophic errors” are hard to as-sess [44, 45]. Moreover, the clustering metrics such as Vari-ation of Information [36] and Rand Error [53] are inappro-priate since they are not centered around the connectomicsgoal of unraveling neuron shape and inter-neuron connec-tivity. We therefore conduct experiments on three additionaldatasets and show the Rand-Error results only on the canon-ical SNEMI3D dataset. To assess the quality of 3C at largescale, we demonstrate results on the widely studied datasetby Kasthuri et al. [24] (
S1 Dataset ). To further assess ourresults in terms of the end-goal of connectomics, neuronalconnectivity, we evaluate the synaptic connectivity of theC objects using the NRI metric [46] (
ECS Dataset ). Inthe final experiment we focus on the tracking ability of 3C(
PNS Dataset ). In order to implement 3C on the SNEMI3D dataset, wefirst created an initial set of 2-D labeled seeds over the en-tire volume. These were generated based on the regional2-D minima of the border probability map. This map wasgenerated by a Residual U-Net, which is known for its ex-cellent average pixel accuracy in border detection [31, 50].Next, the 3C algorithm was used to transfer 2-D labeledseeds through the volume, as shown in Figure 4. Finally,the original 2-D labeled seeds and transferred labeled seedswere agglomerated if their overlap ratio exceeded 0.1. Wefound that W =2 delivers adequate accuracy. All orphanswere greedily agglomerated to their best match. In orderto achieve better accuracy, we tested 3C with various net-work architectures, and evaluated their accuracy. To date,convolutional LSTMs (ConvLSTM) have shown good per-formance for sequential image data [56]. In order to adaptthese methods to the high pixel-accuracy required for con-nectomics, we combined both ConvLSTM and U-Net. Thenetwork is trained to learn an instance segmentation ofone image based on the proposed instance segmentation ofa nearby image with similar context. We found that theLSTM-UNET architecture has validation accuracy of 0.961,which outperforms other commonly used architectures. Aschematic view of our architecture is given in Figure 6. De-tails are provided in the Supplementary Materials.In order to illustrate the accuracy of 3C, we submittedour result to the public SNEMI3D challenge website along-side two common baseline models, the 3-D watershed trans-form (a region-growing technique) and Neuroproof agglom-eration [41]. Our watershed code was adopted from [35].Similar to other traditional agglomerating-techniques, Neu-roproof trains a random forest on merge decisions of neigh-boring objects [40, 41, 42]. These baseline methods werefed with the same high-quality border maps used in our 3Creconstruction system. The comparisons of 2-D results withground truth (section Z =30 ) are shown in Figure 7. Our re-sult has fewer merge- and split-errors, and outperforms thetwo baselines by a large margin. Furthermore, 3C comparesfavorably to other state of art works recently published inNature Methods [6, 21]. In the SNEMI3D challenge leader-board the Rand-Error of 3C was 0.041, compared with the0.06 achieved by a human annotator. Our accuracy (ranked3rd) outperforms most of the traditional pipelines many bya large margin, and is slightly behind the slower neuron-by-neuron FFN segmentation for this volume. The leadingentry is a UNET-based model learning short and long rangeaffinities [31]. The results are summarized in Table 1. Model Rand VI VI split VI merge ComplexityWatershed 0.113 0.67 0.55 0.12 -Neuroproof 0.104 0.55 0.42 0.13 -Multicuts 0.068 0.41 0.34 0.07 -
3C 0.041 0.31 0.19 0.12 O( V log N ) FFN 0.029 - - - O(
V N )Human val. 0.060 - - - -
Table 1. Comparison of Watershed, Neuroproof [41], Multicut [6],human values, 3C and FFN [21] on the SNEMI3D dataset forRand-Error, Variation of Information VI , VI split, VI merge. TimeComplexity: N is the number of objects and V is number of pix-els. For empirical comparison see the performance section. We donot have access to the FFN and human outputs and hence their VImetric is missing. ECS , S1 ) We describe two additional tests: (1) 3C on datasets withknown synaptic connectivity (subsets of
ECS and S1 ), and(2) a lightweight agglomeration-free reconstruction appliedto a large-scale dataset ( S1 ). Connectivity-based test:
Following [45], which re-cently advocated connectivity-based evaluation of connec-tomics, the accuracy of the pipeline was evaluated using theNRI metric [46]. In a nutshell, the NRI ranges between and , measuring how well a given neuronal segmentationpreserves the object connectivity between neural synapses( being optimal).For the first test, we used a lightweight yet successfulFCN model [35] ( Maxout ) (for border and 3C computa-tions), reconstructing the test set of [24] ( S1 ) (3C with FOVof 109 pixels). Maxout is currently the fastest border de-tector in connectomics, which was previously successfullyused for single-object tracking [37]. Details of architectureand training are presented in the Supplementary Materials.The NRI score of the 3C-Maxout segmentation was 0.54,compared to 0.41 of a traditional agglomeration pipeline[35]. For the second test, we were granted permission to re-construct a recently collected rat cortex dataset of the Licht-man group at Harvard ( ECS ). This test allowed the compar-ison of 3C to the excellent agglomerative approach of [42](4th on SNEMI3D), while using exactly their U-Net [50]border predictions as inputs to our 3C network. On the testset our NRI score was 0.86, compared to 0.73 for the ag-glomeration pipeline.
Large-scale reconstruction ( S1 ): We also ran a fast ver-sion of 3C on the entire S1 dataset (90 gigavoxels: 1840slices, 6 nm x 6 nm x 30 nm per voxel). In this experiment,we omitted the agglomeration step of the reconstruction al-gorithm to achieve better scalability and let 3C run on 3-Dmasks computed by local minima of the border probabilitymaps. This implementation is highly scalable since it hasno agglomeration step, while the 3C masks are updated on- igure 7. SNEMI3D: The 3C-LSTM-UNET Results compared with baseline techniques: Watershed, Neuroproof, and ground truth.Figure 8. Results on Kasthuri et al. [24] S1 . Fast lightweight 3C-Maxout operating on 3-D seeds, without agglomeration. Back-ground: Segmented section. Foreground: five 3-D reconstructedobjects.Figure 9. 3C-Maxout results of recursively tracking all objects (ax-ons) directly from the PNS raw images (no post-processing). the-fly in a streaming fashion every 100 slices. The Maxoutimplementation is attractive for large-scale systems becauseit is efficiently parallelized on multi-core systems with ex-cellent cache behavior on CPUs [35]. Figure 8 shows fiveobjects that span the whole volume of S1. PNS ) Dataset
Next, we tested the ability of the 3C framework to trackobjects recursively based on raw images in a streamingmode, that is, independently of any agglomerating or post- processing steps.We chose a previously unpublished motor nerve bundlefrom a (newborn) mouse contributed by the Lichtman labat Harvard for this purpose because it is a closed systemin which all objects are visible in the first and last imagesections of the 915-images dataset. This dataset is importantto neurobiologists since it contains the entire neural input(21 axons) of a complete muscle.Again, we applied the 3C algorithm using thelightweight FCN Maxout architecture of [35]. 3C was ableto track all objects without erroneous merges; results areshown in Figure 9. Out of the 21 axons, 20 were recursivelyreconstructed to their full extent (split errors in only one ob-ject). One extremely thin axon disappeared from the imageand reappeared after 7 sections and was not reconstructed.The axon run-length for all reconstructed axons was above70 microns (and > sections) until all of these exited thevolume on the last slice in the image stack.This benchmark demonstrates: a) b) our training proce-dures display satisfactory generalization abilities, learningfrom a relatively small number of examples.
4. Scalability Comparisons
In this section, we compare the relative scalability of 3Cto FFN and MaskExtend, as far as possible without havingaccess to the full FFN pipeline. 3C is a generalization of theFFN and MaskExtend techniques [21, 37], which augmentthe (pixel) support set of each object, one at a time. The 3Ctechnique simultaneously augments all the objects from itsinput image(s) after a logarithmic number of iterations (seeFigure 4). This allows us directly to compare the two typesof approaches based on the number of iterations required,ignoring details of implementation.
FFN:
We compare the number of FCNs calls in FFNand 3C assuming both algorithms reconstruct all objectsflawlessly. We assume both algorithms use the same FCNmodel. Although 3C and FFN invoke FCNs a logarithmicversus a linear number of times, respectively, FFN runs on igure 10. Compute cost per pixel using FFN-Style segmentation.We computed the number of times a pixel is participating in objectdetection (red) for two public datasets (SNEMI3D, FlyEM), andcompared to the number of classification calls in 3C. smaller inputs, centered around small regions of interest.At each iteration, FFN will output an entire 3-D mask ofthe object around the center pixel. We assume that a frac-tion of those pixels will require revisiting (zero for best-casescenario). Figure 10 depicts the number of FCN calls andtheir ratio for FlyEM [51] and SNEMI3D [5] for FFN and3C. In 3C, each pixel participates in an FCN call a numberof times logarithmic in the number of objects visible in thefield of view (the FCN calls for FFN are color-coded redin the data cubes of Figure 10). The y -axis depicts the ratiobetween the FCN calls by 3C to that for FFN, for several ra-tios of object pixels found per FCN call. A zero ratio meansthat no pixels are found for the object in a single FCN call,whereas 1 means that all object pixels are found and requireno further revisiting. We can see from the plot that, assum-ing error-free reconstruction, 3C is more efficient than FFNwhen there is a fraction of object pixels that require revis-iting after a single call of the FCN. The revisiting of somepixels is also reported by [21], as the 3-D output has greateruncertainty far from the initial pixels. For a revisit ratio of0.5, 3C is more than 10x faster than FFN on FlyEM. MaskExtend:
Figure 11 repeats the above procedurewith MaskExtend [37], comparing its FCN calls with 3C.MaskExtend is more wasteful than 3C, propagating somepixels into its FCN model 23 times. The instruction and cy-cle counts as well as the L1 Cache pressure are larger forMaskExtend (equal multi-core infrastructure and inferenceframework [35]).
5. Conclusion
In this paper, we have presented cross-classification clus-tering (3C), an algorithm that tracks multiple objects simul-taneously, transferring a segmentation from one image to
Figure 11. Compute cost per pixel with single-object trackingmethods [37]. The number of calls per pixel is color-coded inpurple. For the highly dense areas 23 calls of the object detectorare required. The table depicts the performance counter statisticsfor the execution of [37] and the 3C-Maxout FCN on a stack of100 images. the next by composing simpler segmentations. We havedemonstrated the power of 3C in the domain of connec-tomics, which presents an especially difficult task for im-age segmentation. Within the space of connectomics al-gorithms, 3C provides an end-to-end approach with fewer“moving parts,” improving on the accuracy of many lead-ing connectomics systems. Our solution is computation-ally cheap, can be achieved with lightweight FCNs, andis at least an order of magnitude faster than its relative,flood-filling networks. Although the main theme of this pa-per was tackling neuronal reconstruction, our approach alsopromises scalable, effective algorithms for broader applica-tions in medical imaging and video tracking.
Acknowledgements
We would like to thank Jeff Lichtman and Kai Kangfor allowing us to access the PNS dataset, Marco Badwalfor alignment, and Daniel Berger and Casimir Wierzynskifor insightful comments. This research was supported bythe National Science Foundation (NSF) under grants IIS-1607189, IIS-1447786, CCF-1563880, IIS-1803547 and bya grant from the Intel corporation.
References [1] Amit Adam, Ehud Rivlin, and Ilan Shimshoni. Robustfragments-based tracking using the integral histogram. In
Computer vision and pattern recognition, 2006 IEEE Com-puter Society Conference on , volume 1, pages 798–805.IEEE, 2006.[2] Bj¨orn Andres, Ullrich K¨othe, Moritz Helmstaedter, WinfriedDenk, and Fred A Hamprecht. Segmentation of SBFSEMvolume data of neural tissue by hierarchical classification.In
Joint Pattern Recognition Symposium , pages 142–152.Springer, 2008.3] Pablo Arbelaez, Michael Maire, Charless Fowlkes, and Ji-tendra Malik. Contour detection and hierarchical image seg-mentation.
Pattern Analysis and Machine Intelligence, IEEETransactions on , 33(5):898–916, 2011.[4] Assaf Arbelle, Jose Reyes, Jia-Yun Chen, Galit Lahav, andTammy Riklin Raviv. A probabilistic approach to joint celltracking and segmentation in high-throughput microscopyvideos.
Medical image analysis , 47:140–152, 2018.[5] Ignacio Arganda-Carreras, Srinivas C Turaga, Daniel RBerger, Dan Cires¸an, Alessandro Giusti, Luca M Gam-bardella, J¨urgen Schmidhuber, Dmitry Laptev, SarveshDwivedi, Joachim M Buhmann, et al. Crowdsourcingthe creation of image segmentation algorithms for connec-tomics.
Frontiers in neuroanatomy , 9:142, 2015.[6] Thorsten Beier, Constantin Pape, Nasim Rahaman, TimoPrange, Stuart Berg, Davi D Bock, Albert Cardona, Gra-ham W Knott, Stephen M Plaza, Louis K Scheffer, et al.Multicut brings automated neurite segmentation closer to hu-man performance.
Nature Methods , 14(2):101–102, 2017.[7] Manuel Berning, Kevin M Boergens, and Moritz Helm-staedter. SegEM: efficient image analysis for high-resolutionconnectomics.
Neuron , 87(6):1193–1206, 2015.[8] Ryoma Bise, Takeo Kanade, Zhaozheng Yin, and Seung-ilHuh. Automatic cell tracking applied to analysis of cell mi-gration in wound healing assay. In , pages 6174–6179. IEEE, 2011.[9] Dan Ciresan, Alessandro Giusti, Luca M Gambardella, andJ¨urgen Schmidhuber. Deep neural networks segment neu-ronal membranes in electron microscopy images. In
Ad-vances in neural information processing systems , pages2843–2851, 2012.[10] Qi Dou, Lequan Yu, Hao Chen, Yueming Jin, Xin Yang, JingQin, and Pheng-Ann Heng. 3d deeply supervised networkfor automated segmentation of volumetric medical images.
Medical image analysis , 41:40–54, 2017.[11] Michal Drozdzal, Gabriel Chartrand, Eugene Vorontsov,Mahsa Shakeri, Lisa Di Jorio, An Tang, Adriana Romero,Yoshua Bengio, Chris Pal, and Samuel Kadoury. Learningnormalized inputs for iterative estimation in medical imagesegmentation.
Medical image analysis , 44:1–13, 2018.[12] Lo¨ıc Fagot-Bouquet, Romaric Audigier, Yoann Dhome, andFr´ed´eric Lerasle. Improving multi-frame data associationwith sparse representations for robust near-online multi-object tracking. In
European Conference on Computer Vi-sion , pages 774–790. Springer, 2016.[13] Jialue Fan, Wei Xu, Ying Wu, and Yihong Gong. Humantracking using convolutional neural networks.
IEEE Trans-actions on Neural Networks , 21(10):1610–1623, 2010.[14] Rohit Girdhar, Georgia Gkioxari, Lorenzo Torresani,Manohar Paluri, and Du Tran. Detect-and-track: Efficientpose estimation in videos. In
Proceedings of the IEEE Con-ference on Computer Vision and Pattern Recognition , pages350–359, 2018.[15] Kostas Haris, Serafim N Efstratiadis, Nikolaos Maglaveras,and Aggelos K Katsaggelos. Hybrid image segmentation us-ing watersheds and fast region merging.
IEEE Transactionson image processing , 7(12):1684–1699, 1998. [16] David Held, Sebastian Thrun, and Silvio Savarese. Learn-ing to track at 100 fps with deep regression networks. In
European Conference on Computer Vision , pages 749–765.Springer, 2016.[17] Moritz Helmstaedter. Cellular-resolution connectomics:challenges of dense neural circuit reconstruction.
Naturemethods , 10(6):501–507, 2013.[18] Moritz Helmstaedter, Kevin L Briggman, Srinivas C Turaga,Viren Jain, H Sebastian Seung, and Winfried Denk. Con-nectomic reconstruction of the inner plexiform layer in themouse retina.
Nature , 500(7461):168, 2013.[19] Nathaniel Huebsch, Peter Loskill, Mohammad A Mande-gar, Natalie C Marks, Alice S Sheehan, Zhen Ma, AnuragMathur, Trieu N Nguyen, Jennie C Yoo, Luke M Judge,et al. Automated video-based analysis of contractility andcalcium flux in human-induced pluripotent stem cell-derivedcardiomyocytes cultured over different spatial scales.
TissueEngineering Part C: Methods , 21(5):467–479, 2015.[20] Viren Jain, Joseph F Murray, Fabian Roth, Srinivas Turaga,Valentin Zhigulin, Kevin L Briggman, Moritz N Helm-staedter, Winfried Denk, and H Sebastian Seung. Supervisedlearning of image restoration with convolutional networks.In
Computer Vision, 2007. ICCV 2007. IEEE 11th Interna-tional Conference on , pages 1–8. IEEE, 2007.[21] Michał Januszewski, J¨orgen Kornfeld, Peter H Li, Art Pope,Tim Blakely, Larry Lindsey, Jeremy Maitin-Shepard, MikeTyka, Winfried Denk, and Viren Jain. High-precision auto-mated reconstruction of neurons with flood-filling networks.
Nature methods , 15(8):605, 2018.[22] Alexandr A Kalinin, Ari Allyn-Feuer, Alex Ade, Gordon-Victor Fon, Walter Meixner, David Dilworth, Jeffrey RDe Wet, Gerald A Higgins, Gen Zheng, Amy Creekmore,et al. 3d cell nuclear morphology: microscopy imagingdataset and voxel-based morphometry classification results.In
Proceedings of the IEEE Conference on Computer Vi-sion and Pattern Recognition Workshops , pages 2272–2280,2018.[23] Kai Kang, Wanli Ouyang, Hongsheng Li, and XiaogangWang. Object detection from video tubelets with convolu-tional neural networks. In
Proceedings of the IEEE Con-ference on Computer Vision and Pattern Recognition , pages817–825, 2016.[24] Narayanan Kasthuri, Kenneth Jeffrey Hayworth,Daniel Raimund Berger, Richard Lee Schalek, Jos´e AngelConchello, Seymour Knowles-Barley, Dongil Lee, AmelioV´azquez-Reina, Verena Kaynig, Thouis Raymond Jones,et al. Saturated reconstruction of a volume of neocortex.
Cell , 162(3):648–661, 2015.[25] Verena Kaynig, Amelio Vazquez-Reina, Seymour Knowles-Barley, Mike Roberts, Thouis R Jones, Narayanan Kasthuri,Eric Miller, Jeff Lichtman, and Hanspeter Pfister. Large-scale automatic reconstruction of neuronal processes fromelectron microscopy images.
Medical image analysis ,22(1):77–88, 2015.[26] Jinseop S Kim, Matthew J Greene, Aleksandar Zlateski,Kisuk Lee, Mark Richardson, Srinivas C Turaga, MichaelPurcaro, Matthew Balkam, Amy Robinson, Bardia F Be-abadi, et al. Space–time wiring specificity supports direc-tion selectivity in the retina.
Nature , 509(7500):331, 2014.[27] Seymour Knowles-Barley, Verena Kaynig, Thouis RayJones, Alyssa Wilson, Joshua Morgan, Dongil Lee,Daniel Berger, Narayanan Kasthuri, Jeff W Lichtman, andHanspeter Pfister. RhoanaNet pipeline: Dense automaticneural annotation. arXiv preprint arXiv:1611.06973 , 2016.[28] Victor Kulikov, Victor Yurchenko, and Victor Lempitsky.Instance segmentation by deep coloring. arXiv preprintarXiv:1807.10007 , 2018.[29] Avisek Lahiri, Kumar Ayush, Prabir Kumar Biswas, andPabitra Mitra. Generative adversarial learning for reduc-ing manual annotation in semantic segmentation on largescale miscroscopy images: Automated vessel segmentationin retinal fundus image as test case. In
Proceedings of theIEEE Conference on Computer Vision and Pattern Recogni-tion Workshops , pages 42–48, 2017.[30] June-Goo Lee, Sanghoon Jun, Young-Won Cho, HyunnaLee, Guk Bae Kim, Joon Beom Seo, and Namkug Kim. Deeplearning in medical imaging: general overview.
Korean jour-nal of radiology , 18(4):570–584, 2017.[31] Kisuk Lee, Jonathan Zung, Peter Li, Viren Jain, and H Sebas-tian Seung. Superhuman accuracy on the SNEMI3D connec-tomics challenge. arXiv preprint arXiv:1706.00120 , 2017.[32] Jeff W Lichtman and Winfried Denk. The big and thesmall: challenges of imaging the brains circuits.
Science ,334(6056):618–623, 2011.[33] Jeff W Lichtman, Hanspeter Pfister, and Nir Shavit. Thebig data challenges of connectomics.
Nature neuroscience ,17(11):1448–1454, 2014.[34] Brian Matejek, Daniel Haehn, Haidong Zhu, DonglaiWei, Toufiq Parag, and Hanspeter Pfister. Biologically-constrained graphs for global connectomics reconstruction.
CVPR , 2019.[35] Alexander Matveev, Yaron Meirovitch, Hayk Saribekyan,Wiktor Jakubiuk, Tim Kaler, Gergely Odor, David Budden,Aleksandar Zlateski, and Nir Shavit. A multicore path toconnectomics-on-demand. In
Proceedings of the 22Nd ACMSIGPLAN Symposium on Principles and Practice of ParallelProgramming , PPoPP ’17, pages 267–281, New York, NY,USA, 2017. ACM.[36] Marina Meil˘a. Comparing clusterings.
Journal of multivari-ate analysis , 98(5):873–895, 2007.[37] Yaron Meirovitch, Alexander Matveev, Hayk Saribekyan,David Budden, David Rolnick, Gergely Odor, SeymourKnowles-Barley Thouis Raymond Jones, Hanspeter Pfis-ter, Jeff William Lichtman, and Nir Shavit. A multi-pass approach to large-scale connectomics. arXiv preprintarXiv:1612.02120 , 2016.[38] Fausto Milletari, Nassir Navab, and Seyed-Ahmad Ahmadi.V-net: Fully convolutional neural networks for volumet-ric medical image segmentation. In , pages 565–571.IEEE, 2016.[39] Laurent Najman and Michel Schmitt. Geodesic saliency ofwatershed contours and hierarchical segmentation.
IEEETransactions on pattern analysis and machine intelligence ,18(12):1163–1173, 1996. [40] Juan Nunez-Iglesias, Ryan Kennedy, Toufiq Parag, JianboShi, and Dmitri B Chklovskii. Machine learning of hierar-chical clustering to segment 2D and 3D images.
PloS one ,8(8):e71715, 2013.[41] Toufiq Parag, Anirban Chakraborty, Stephen Plaza, andLouis Scheffer. A context-aware delayed agglomerationframework for electron microscopy segmentation.
PloS one ,10(5):e0125825, 2015.[42] Toufiq Parag, Fabian Tschopp, William Grisaitis, Srinivas CTuraga, Xuewen Zhang, Brian Matejek, Lee Kamentsky,Jeff W Lichtman, and Hanspeter Pfister. Anisotropic EMsegmentation by 3D affinity learning and agglomeration. arXiv preprint arXiv:1707.08935 , 2017.[43] AG Amitha Perera, Chukka Srinivas, Anthony Hoogs, GlenBrooksby, and Wensheng Hu. Multi-object tracking throughsimultaneous long occlusions and split-merge conditions. In , volume 1, pages666–673. IEEE, 2006.[44] Stephen M Plaza and Stuart E Berg. Large-scale electronmicroscopy image segmentation in Spark. arXiv preprintarXiv:1604.00385 , 2016.[45] Stephen M Plaza and Jan Funke. Analyzing image segmenta-tion for connectomics.
Frontiers in Neural Circuits , 12:102,2018.[46] Elizabeth P Reilly, Jeffrey S Garretson, William R Gray Ron-cal, Dean M Kleissas, Brock A Wester, Mark A Chevillet,and Matthew J Roos. Neural reconstruction integrity: A met-ric for assessing the connectivity accuracy of reconstructedneural networks.
Frontiers in Neuroinformatics , 12:74, 2018.[47] Xiaofeng Ren and Jitendra Malik. Learning a classificationmodel for segmentation. In null , page 10. IEEE, 2003.[48] Aur´elien Rizk, Gr´egory Paul, Pietro Incardona, MilicaBugarski, Maysam Mansouri, Axel Niemann, Urs Ziegler,Philipp Berger, and Ivo F Sbalzarini. Segmentation andquantification of subcellular structures in fluorescence mi-croscopy images using squassh.
Nature protocols , 9(3):586,2014.[49] David Rolnick, Yaron Meirovitch, Toufiq Parag, HanspeterPfister, Viren Jain, Jeff W Lichtman, Edward S Boyden, andNir Shavit. Morphological error detection in 3d segmenta-tions. arXiv preprint arXiv:1705.10882 , 2017.[50] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-Net: Convolutional networks for biomedical image segmen-tation. In
Medical Image Computing and Computer-AssistedIntervention–MICCAI 2015 , pages 234–241. Springer, 2015.[51] Shin-ya Takemura, C Shan Xu, Zhiyuan Lu, Patricia KRivlin, Toufiq Parag, Donald J Olbris, Stephen Plaza, TingZhao, William T Katz, Lowell Umayam, et al. Synapticcircuits and their variations within different columns in thevisual system of drosophila.
Proceedings of the NationalAcademy of Sciences , 112(44):13711–13716, 2015.[52] Srinivas C Turaga, Joseph F Murray, Viren Jain, Fabian Roth,Moritz Helmstaedter, Kevin Briggman, Winfried Denk, andH Sebastian Seung. Convolutional networks can learn to gen-erate affinity graphs for image segmentation.
Neural compu-tation , 22(2):511–538, 2010.53] Ranjith Unnikrishnan, Caroline Pantofaru, and MartialHebert. Toward objective evaluation of image segmentationalgorithms.
IEEE Transactions on Pattern Analysis & Ma-chine Intelligence , (6):929–944, 2007.[54] Mengmeng Wang, Yong Liu, and Zeyi Huang. Large mar-gin object tracking with circulant feature maps. In
Proceed-ings of the IEEE Conference on Computer Vision and PatternRecognition , pages 4021–4029, 2017.[55] Yu Xiang, Alexandre Alahi, and Silvio Savarese. Learning totrack: Online multi-object tracking by decision making. In
Proceedings of the IEEE international conference on com-puter vision , pages 4705–4713, 2015.[56] SHI Xingjian, Zhourong Chen, Hao Wang, Dit-Yan Yeung,Wai-Kin Wong, and Wang-chun Woo. Convolutional LSTMnetwork: A machine learning approach for precipitationnowcasting. In
Advances in neural information processingsystems , pages 802–810, 2015.[57] SYJCY Yoo, Kimin Yun, Jin Young Choi, K Yun, and JYChoi. Action-decision networks for visual tracking with deepreinforcement learning. CVPR, 2017.[58] Li Zhang, Yuan Li, and Ramakant Nevatia. Global dataassociation for multi-object tracking using network flows.In , pages 1–8. IEEE, 2008.[59] Tianzhu Zhang, Changsheng Xu, and Ming-Hsuan Yang.Multi-task correlation particle filter for robust object track-ing. In
Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition , pages 4335–4343, 2017.[60] Jonathan Zung, Ignacio Tartavull, Kisuk Lee, and H Sebas-tian Seung. An error detection and correction framework forconnectomics. In