[PDF] ASIST: Annotation-free Synthetic Instance Segmentation and Tracking by Adversarial Simulations

Abstract

Background: The quantitative analysis of microscope videos often requires instance segmentation and tracking of cellular and subcellular objects. The traditional method consists of two stages: (1) performing instance object segmentation of each frame, and (2) associating objects frame-by-frame. Recently, pixel-embedding-based deep learning approaches these two steps simultaneously as a single stage holistic solution. In computer vision, annotated training data with consistent segmentation and tracking is resource intensive, the severity of which is multiplied in microscopy imaging due to (1) dense objects (e.g., overlapping or touching), and (2) high dynamics (e.g., irregular motion and mitosis). Adversarial simulations have provided successful solutions to alleviate the lack of such annotations in dynamics scenes in computer vision, such as using simulated environments (e.g., computer games) to train real-world self-driving systems. Methods: In this paper, we propose an annotation-free synthetic instance segmentation and tracking (ASIST) method with adversarial simulation and single-stage pixel-embedding based learning. Contribution: The contribution of this paper is three-fold: (1) the proposed method aggregates adversarial simulations and single-stage pixel-embedding based deep learning; (2) the method is assessed with both the cellular (i.e., HeLa cells) and subcellular (i.e., microvilli) objects; and (3) to the best of our knowledge, this is the first study to explore annotation-free instance segmentation and tracking study for microscope videos. Results: The ASIST method achieved an important step forward, when compared with fully supervised approaches: ASIST shows 7% to 11% higher segmentation, detection and tracking performance on microvilli relative to fully supervised methods, and comparable performance on Hela cell videos.

Full PDF

MMANUSCRIPT PRE-PRINT, JANUARY 2021 1

Towards Annotation-free Instance Segmentation andTracking with Adversarial Simulations

Quan Liu, Isabella M. Gaeta, Mengyang Zhao, Ruining Deng, Aadarsh Jha, Bryan A. Millis, AnitaMahadevan-Jansen, Matthew J. Tyska, and Yuankai Huo,

Member, IEEE

Abstract —The quantitative analysis of microscope videos oftenrequires instance segmentation and tracking of cellular andsubcellular objects. The traditional method is composed of twostages: (1) performing instance object segmentation of eachframe, and (2) associating objects frame-by-frame. Recently,pixel-embedding-based deep learning approaches provide singlestage holistic solutions to tackle instance segmentation and track-ing simultaneously. However, such deep learning methods requireconsistent annotations not only spatially (for segmentation), butalso temporally (for tracking). In computer vision, annotatedtraining data with consistent segmentation and tracking isresource intensive, the severity of which can be multiplied inmicroscopy imaging due to (1) dense objects (e.g., overlappingor touching), and (2) high dynamics (e.g., irregular motion andmitosis). To alleviate the lack of such annotations in dynamicsscenes, adversarial simulations have provided successful solutionsin computer vision, such as using simulated environments (e.g.,computer games) to train real-world self-driving systems. Inthis paper, we propose an annotation-free synthetic instancesegmentation and tracking (ASIST) method with adversarialsimulation and single-stage pixel-embedding based learning.The contribution of this paper is three-fold: (1) the proposedmethod aggregates adversarial simulations and single-stage pixel-embedding based deep learning; (2) the method is assessed withboth the cellular (i.e., HeLa cells) and subcellular (i.e., microvilli)objects; and (3) to the best of our knowledge, this is the ﬁrst studyto explore annotation-free instance segmentation and trackingstudy for microscope videos. This ASIST method achieved animportant step forward, when compared with fully supervisedapproaches.

Index Terms —Annotation free, segmentation, tracking

I. I

NTRODUCTION H OLISTIC instance object segmentation and tracking isan essential analytics tool in microscope video analysis.Capturing cellular and subcellular dynamics of microscopevideos helps domain experts in characterizing biological pro-cesses [1] in a quantitative manner, leading to advancedbiomedical applications (e.g., drug discovery) [2].Due to the importance of quantifying cellular and subcel-lular dynamics, numerous image processing approaches havebeen proposed for precise instance object segmentation andtracking. Most of the previous solutions [3], [4], [5] follow asimilar ”two-stage” strategy: (1) segmentation on each frame,and (2) frame-by-frame association across the video. In recentyears, a new family of ”single-stage” algorithms was enabledby cutting-edge pixel-embedding based deep learning [6], [7].Such methods enforce the spatio-temporally consistent pixel-wise feature embedding for the same cellular or subcellularobjects across video frames, which address both the instancesegmentation and tracking as a holistic model. However, such

Pixel-embedding based learningPixel-embedding based learning

Synthetic data and annotation

Real data and manual annotation Training

Testing

Existing

Proposed

Microscope videoMicroscope video

Instance segmentation and tracking

Fig. 1. The upper panel shows existing pixel-embedding deep learning basedsingle-stage instance segmentation and tracking method which is trained byreal microscope video and manual annotations. The lower panel presentsour proposed annotation-free ASIST method, with synthesized data andannotations from adversarial simulations. methods are limited by a substantial hurdle in which pixel-wiseannotations are required through a fully supervised design,with spatial (for segmentation) and temporal (for tracking)consistency. Such labeling efforts are typically expensive, andpotential unscalable, for microscope videos due to (1) denseobjects (e.g., overlapping or touching), and (2) high dynamics(e.g., irregular motion and mitosis). Therefore, better learningstrategies are desired beyond the current human annotationbased supervised learning.Adversarial simulation, as an emerging computing schemeto create realistic synthetic environments using adversarialdeep learning, has provided a scalable option to model com-plex dynamic systems without extensive human annotations.Particularly striking examples include (1) using computergames such as Grand Theft Auto to train self-driving deeplearning models [8], (2) using a simulation environmentGazebo to train robotics [9], and (3) using a SUMO simulatorto train trafﬁc management artiﬁcial intelligence (AI) [10].Inspired by these successful studies, we propose to build bio-logical simulation algorithms, with deep adversarial learning,to characterize high spatial-temporal dimension dynamics ofcellular and subcellular structures.In this paper, we propose an annotation-free syntheticinstance segmentation and tracking (ASIST) method withadversarial simulation and single-stage pixel-embedding basedlearning. Brieﬂy, the ASIST framework consists of three majorsteps: (1) unsupervised image-annotation synthesis, (2) videoand temporal annotation synthesis, and (3) pixel-embeddingbased instance segmentation and tracking. As opposed to tra-ditional manual annotation based pixel-embedding deep learn- a r X i v : . [ ee ss . I V ] J a n ANUSCRIPT PRE-PRINT, JANUARY 2021 2

Shape Appearance Dynamics M i c r o v illi v i d e o H e L a c e ll v i d e o R e a l R e a l S y n t h e t i c S y n t h e t i c Fig. 2. Real and synthetic video of Hela cell and microvilli consisting ofthree aspects: shape, appearance and dynamics. The ”shape” is deﬁned as theunderlying shape of the manual annotations. The ”appearance” is deﬁned bythe various appearances of objects. The ”dynamics” indicates the mitigationof cellular and subcellular objects. ing, the proposed ASIST method is annotation-free (Fig.1).To achieve the annotation-free solution, we simulated cel-lular or subcellular structures with three important aspects:shape, appearance and dynamics (Fig.2). To evaluate ourproposed ASIST method, microscope videos of both cellular(i.e., HeLa cell videos from ISBI Cell Tracking Challenge [11],[12]) and subcellular (i.e., microvilli videos from in housedata) objects were included in this study. The HeLa cell videoshave larger shape variations compared with microvilli videos.From the results, our ASIST method achieved promisingaccuracy compared with fully supervised approaches.In summary, this paper has three major contributions: • We propose the ASIST annotation-free framework, ag-gregating adversarial simulations and single-stage pixel-embedding based deep learning. • We propose a novel annotation reﬁnement approach tosimulate shape variations of cellular objects, with circlesas middle representation. • To our best knowledge, our proposed approach is the ﬁrstannotation-free solution for single-stage pixel-embeddingdeep learning based cell instance segmentation and track-ing. II. R

ELATED W ORK

A. Image synthesis

The simplest approach to synthesize new images is to per-form image transformations, which includes ﬂipping, rotation,resizing, and cropping. Such synthetic images improved theaccuracy of image quantiﬁcation upon benchmark datasets [13]as well as biomedical applications [14].More complicated than above image transformations aregenerative adversarial networks (GAN) [15] which open anew window of synthesizing highly realistic images, and havebeen widely used in different computer vision and biomedicalimaging applications. Synthesized retinal images using GANto map retinal images to binary retinal vessel trees [16]. Thesynthetic images can be generated from random noise [17], with geometry constraints [18], and even in high dimensionalspace [19]. To tackle the limitations of needing paired trainingdata requirements, CycleGAN [20] was proposed to furtheradvance the GAN technique to broader applications. Cycle-GAN has shown promise in on cross-modality synthesis [21]and microscope image synthesis [22]. DeepSynth [23] demon-strated that CycleGAN can be applied to 3D medical imagesynthesis.

B. Microscope image segmentation and tracking

Historically, early approaches utilize intensity-based thresh-olding to segment a region of interest (ROI) from the back-ground. Ridler et al. [24] use a dynamic updated threshold tosegment object based on the mean intensity of the foregroundand the background. Otsu et al. [25] select threshold byminimizing variance of the intraclass. To avoid the sensitivityto all image pixels, Pratt et al. [26] proposed growing asegmented area from a point, determined by texture similarity.Based on rough annotations, energy functions can be ab-stracted to segment images by minimizing the aforementionedenergy function [27]. Among such methods, the watershedsegmentation approaches are arguably the most widely usedmethods for intensity based cell image segmentation [28].Object tracking on microscope videos is challenging due tothe complex dynamics and vague instance boundaries when atcellular or sub-cellular resolutions. Gerlich et al. [29] usedoptical ﬂow from microscope videos to track cell motion.Ray et al. [30] tracked leukocytes by computing gradientvectors of cell motions based on active contours. Sato etal. [31] designed orientation-selective ﬁlters to generate spatio-temporal information enhancing the motion of cells. [32], [33]also tracked cell motion by applying spatio-temporal analysison microscope videos.Recent studies have employed machine learning, especiallydeep learning approaches, for instance cell segmentation andtracking. Jain et al. [34] showed superior performance of awell-trained convolutional network. Baghli et al. [35] achieved97% prediction accuracy by employing supervised machinelearning approaches. To avoid relying on image annotation,Yu et al. [36] trained a Convolutional Neural Network withoutannotation to track large scale ﬁbers in images of materialacquired via microscope techniques. However, to the bestof our knowledge, no existing studies have investigated thechallenging problem of quantifying cellular and subcellulardynamics as pixel-wise instance segmentation and trackingwith embedding based deep learning.III. M

ETHODS

The proposed ASIST framework consists of three stages:unsupervised image-annotation synthesis, video synthesis andinstance segmentation and tracking (Fig.3).

A. Unsupervised image-annotation synthesis

The ﬁrst step is to train a CycleGAN based approach [37] todirectly synthesize annotations from microscope images, andvice versa. Compared with the tasks in computer vision, the

ANUSCRIPT PRE-PRINT, JANUARY 2021 3

Simulator

CycleGAN

Images

Annotations

Raw Images Microvilli

Generator A Generator B Generator B RSHN Pixel-wise embedding Faster Mean-shiftclustering

Simulated annotation video binarized label video with augmentation Synthesized microvilli videoSynthesized microvilli video

Step 1. Unsupervised image-annotation synthesis Step 2. Video synthesisStep 3. Instance segmentation and tracking

Simulated annotation of instance segmentation and tracking

Channelsplit

Binarization +

Augmentation

Simulated annotationSimulated annotationmask

Annotation refinement (for HeLa cell video)

Fig. 3. This ﬁgure shows the proposed ASIST method. First, CycleGAN based image-annotation synthesis is trained using real microscope images and simulatedannotations. Second, synthesized microscope videos are generated from simulated annotation videos. Last, an embedding based instance segmentation andtracking algorithm is trained using synthetic training data. For HeLa cell videos, a new annotation reﬁnement step is introduced to capture the larger shapevariations. objects in microscope images are often repetitive with morehomogeneous shapes. Therefore, with knowledge of shapesassociated with microvilli (stick-shaped) and HeLa cell images(ball-shaped), we randomly generate fake annotations withrepetitive sticks and circles to model the shape of microvilliand HeLa cells, respectively. The network structure, trainingprocess and parameters follows [38].

B. Video synthesis

Using annotation-to-image generator (marked as GeneratorB) from the above CycleGAN model, synthetic intensityimages can be generated from simulated annotations. As avideo is composed with images (i.e., video frames), we extendthe utilization of trained Generator B from ”annotation-to-image” to ”annotation frames-to-video”. Brieﬂy, simulatedannotation videos are generated by our annotation simulatorwith variations in shape and dynamics. Then, each annotationvideo frame is used to generate a synthetic microscope imageframe. After repeating such a process for the entire simulatedannotation videos, synthetic microscope video is achieved formicrovilli and HeLa cells, respectively.

1) Microvilli video simulation:

As shown in Fig.4, wemodel shape of microvilli as sticks (narrow rectangles) tosimulate microvilli videos. The simulated microvilli annotationvideos are determined by the following operations:

Object number : Different numbers of objects are evaluatedwhen simulating microvilli videos. The details are presentedin § Experimental design . Translation : Instance annotations are translated by 1 pixel at50 % probability. Rotation : Each instance label is randomly rotated by 1 degreeat 50 % probability. Shortening/Lengthening : Each object has 50 % probability tobecome longer or shorter by 1 pixel. Each object can onlybecome longer or shorter across the video. Moving in/out : To simulate the instance moving in and outfrom the video scope, we generate frames in larger size (550 ×

550 pixels) and center-cropped into the target size (512 ×

512 pixels).

2) HeLa cell video simulation:

The HeLa cells have higherdegrees of freedom in terms of shape variations, comparedwith microvilli. In this study, we proposed an annotationreﬁnement strategy, to generate shape consistent syntheticHeLa cell videos and annotations, using circles as middlerepresentations (Fig. 5), without introducing manual annota-tions. The simulated videos and annotations of HeLa cells aredetermined by the following operations:

Object number :The numbers of objects are evaluated whensimulating HeLa cell videos. The details are presented in § Experimental design . Translation : The instance annotation center can be moved by N pixels. N will be described in § Experimental design . Radius changing : Radius of annotations has 10 % probabilityto get bigger or smaller by 1 pixel. Disappearing : Existing instance cells are randomly deletedfrom certain frames in videos.

Appearing : New instance cells shows up from certain framein videos randomly. New cells will be added to the video fromthe appearing frame.

Mitosis : To simulate HeLa cell mitosis, we randomly deﬁne”mother cells” at the n th frame. At the n +1 th frame, we deletethe ”mother cells” and randomly create two new cells nearby.Based on biological knowledge, these two new instances aretypically smaller than normal instances, and will grow upbigger and move randomly like other instance annotations. ANUSCRIPT PRE-PRINT, JANUARY 2021 4 frame 1 frame 2 frame 3 A nn o t a t i o n V i d e o frame 10 Real data Synthetic data frame 1 frame 2 frame 3 frame 10 M i c r o v illi v i d e o H e L a c e ll v i d e o A nn o t a t i o n V i d e o Fig. 4. The left panel shows real microscope videos as well as manual annotations. The right panel presents our synthetic videos and simulated annotations. I m a g e A nn o t a t i o n w / G a u ss i a n b l u rr i n g RealFake RealFake

CycleGAN (early epoch)

Generator A* Generator B* I m a g e A nn o t a t i o n w / o G a u ss i a n b l u rr i n g RealFake RealFakeCycleGAN (late epoch)

Generator A

Generator B

Fig. 5. The upper panel shows the CycleGAN that is trained by real imagesand simulated annotations with Gaussian blurring. The lower panel showsthe CycleGAN that is trained by the same data without Gaussian blurring.The Generator B is used to generate synthetic videos with larger shapevariations from circle representations, while the Generator A* generate sharpsegmentation for the annotation registrations.

Overlapping : We allow partial overlap between cells. Theminimum distance between two cells are set to be 70 % ofthe total diameter between two cells. Size change : The radius of instance annotation has a 10 % probability to become larger by 1 pixel or become smaller by1 pixel. C. Annotation reﬁnement for HeLa cell video simulation

After training initial CycleGAN synthesis, we are able tobuild simulated videos (with circle representation) as wellas their corresponding synthetic microscope videos. However,

Generator B Generator A*Registration Cleaning

Fig. 6. This ﬁgures shows the workﬂow of the annotation reﬁnementapproach. The simulated circle annotations are fed into Generator B tosynthesize cell images. We used Generator A* in Fig.5 to generate sharpbinary masks from synthetic images. Then, we registered simulated circleannotations to binary masks to match the shape of cells in synthetic images.Last, an annotation cleaning step was introduced to delete the inconsistentannotations between deformed instance object masks and binary masks. circles are not the exact shape of annotations for syntheticvideos. To further achieve consistent synthetic videos andannotations, we proposed an annotation reﬁnement framework,which has a workﬂow shown in Fig. 6.

1) Binary mask generation:

We trained CycleGAN to gen-erate a binary mask of synthetic cell images. Unique fromCycleGAN in § Unsupervised image-annotation synthesis ,we used training data without applying Gaussian blurring andused the model from an early epoch. From our experiments,we observed that the early epochs of the CycleGAN train-ing focused more on intensity adaptations rather than shapeadaptations. The trained Generator A is used to generatesharp binary masks as templates in the following annotationregistration step.

2) Annotation deformation (AD):

To bridge the gap be-tween circle representations and HeLa cell shape annotations,a non-rigid registration approach from ANTs [39] is used todeform the circle shapes to the HeLa cell shapes. Brieﬂy,

ANUSCRIPT PRE-PRINT, JANUARY 2021 5 we used generator B to synthesize cell images based onour simulated annotations. In the mask generation, we usedgenerator A* to generate binary masks and registered the circleshape annotations to the binary masks. In that case, we keepthe label numbers of circle representations, and deform theirshapes to ﬁt the synthetic cells.

3) Annotation cleaning (AC):

When performing image-annotation synthesis using CycleGAN, it is very likely tohave a slightly different number of objects between HeLa cellimages and annotations without using paired training data. Tomake the synthetic videos and simulated annotations to havemore consistent numbers of objects, we introduce an annota-tion cleaning step (Fig. 6). First, we generate binary masksof simulated images using the Generator A*. Second, weclean up the inconsistent objects and annotations by comparingdeformed simulated annotations and binary masks. Brieﬂy,pseudo-instance annotations are achieved from binary masks,by treating any connected components as instances. Third, ifan instance object in the deformed simulated annotations isnot 90% covered by binary masks, we re-assign the label asbackground. On the other hand, if a pseudo instance objectfrom the binary masks is not 90% covered by deformed sim-ulated annotations, we re-assign the corresponding region inintensity image with the average background intensity values.In summation, the consistent synthetic videos and deformedsimulated instance annotations are achieved with annotationcleaning (Fig. 6).

D. Instance segmentation and tracking

From the above stages, the synthetic videos and correspond-ing annotations are achieved frame-by-frame. The next step isto train our instance segmentation and tracking model. Weused the recurrent stacked hourglass network (RSHN) [7] asthe instance segmentation and tracking backbone to encode theembedding vectors of each pixel. The ideal pixel-embeddinghas two properties: (1) embedding of pixels belonging to thesame objects should be similar across the entire video, and (2)the embedding of pixels belonging to different objects shouldbe different. For a testing video, we employed the FasterMean-shift algorithm [6] to cluster pixels to objects as theinstance segmentation and tracking results. The embedding-based deep learning methods approach the instance segmenta-tion and tracking as a ”single-stage” approach, which is a sim-ple and generalizable solution across different applications [7],[6]. IV. E

XPERIMENTAL D ESIGN

A. Instance segmentation and tracking on microvilli video1) Data:

Two microvilli videos captured by ﬂuorescencemicroscopy are in 1.1 µ m pixel resolution. Training data is onemicrovilli video in 512 ×

512 in pixel resolution. Testing datais another microvilli video in the size of 328 ×

238 pixels. Dueto the heavy load of manual annotations on video frames, weonly annotated the ﬁrst ten frames of both videos as the goldenstandard. The annotation work includes two parts: 1) ﬁrst weannotated each microvilli structure including overlapping ordensely distributed areas; 2) secondly, each instance has been assigned consistent labels across all frames in same video. Themanual annotation labor on both training and testing data takesroughly a week of work from a graduate student. This longmanual annotation process shows the value of annotation-freesolutions in quantifying cellular and subcellular dynamics.

2) Experimental design:

In order to assess the performanceof our annotation-free instance segmentation and trackingmodel, the proposed method is compared with the modeltrained with manual annotations on the same testing microvillivideo. The different experimental settings are shown as thefollowing:

Self : The testing video with manual annotations was used asboth training and testing data.

Real : Another real microvilli video with manual annotationswere used as training data.

Microvilli-1 : One simulated video which consisted of 100instances in size of 512 ×

512 pixels was used as training data.The ”Microvilli-1 10 frames” indicated only 10 frames wereused, while other simulated data used 50 frames.

Microvilli-5 : Five simulated videos with 512 ×

512 pixel res-olutions were used as training data. The number of objectswere empirically chosen to be between 80 to 220.

Microvilli-20 : We further spatially split each 512 ×

512 videoin Microvilli-5 to four 256 ×

256 videos to form a total of 20simulated videos with half resolution.

B. Instance segmentation and tracking on HeLa cell video1) Data:

HeLa cell videos (N2DL-HeLa) were obtainedfrom the ISBI Cell Tracking Challenge [11], [12]. The cohorthas two 92-frame HeLa cell videos in size of 1100 ×

700 pixelswith annotations. The second video with complete manualannotations is used as the testing data for all experiments.

2) Experimental design:

For experiments using anannotation-free framework, synthetic videos and simulatedannotations are used for training. As a comparison experiment,experiments trained with annotated data used two N2DL-HeLavideos with annotations as training data. Our experimentsettings are described as follows:

Self : The testing video with manual annotations was used asboth training and testing data. The patch size of 256 ×

256 wasused, following [7], [6].

Self-HW : The testing video with manual annotations was usedas both training and testing data. The patch size of 128 × HeLa : Our training data was 10 simulated videos with512 ×

512 resolution containing approximately 150 objects, in-cluding 20 cells appearing events, 20 cells disappearing events,and 5 or 10 mitosis events. The numbers were empirically cho-sen. This experiment employed the circle annotations directlyas the baseline performance.The patch size of 256 ×

256 wasused.

HeLa-AD : The above simulated data was used for training,with an extra annotation deformation (AD) step.

HeLa-AD+AC : The above simulated data was used for train-ing, with extra AD and annotation cleaning (AC) steps.

HeLa-AD+AC+HW : The above simulated data was used fortraining, with extra AD and AC steps. The patch size of128 ×

128 was used, as a half window (HW) size.

ANUSCRIPT PRE-PRINT, JANUARY 2021 6

TABLE IDET, SET

AND

TRA

VALUES OF DIFFERENT EXPERIMENTS ONMICROVILLI VIDEO .Exp. T.V. T.F. DET SEG TRARSHN (Self) [7] 1 10 0.662 0.298 0.629RSHN (Real) [7] 1 10 0.357 0.169 0.334ASIST (Microvilli-1) 1 10 0.580 0.306 0.551ASIST (Microvilli-1) 1 50 0.586 0.311 0.556ASIST (Microvilli-5) 5 50 0.660

T.V. is the number of training videos. T.F. is the number of training framesof each video. RSHN (Self) uses testing video for training. RSHN (Real) isthe standard testing accuracy of using another independent video as trainingdata.TABLE IIDET, SET

AND

TRA

VALUES OF DIFFERENT EXPERIMENTS ON H E L ACELL VIDEO .Exp. T.V. T.F. DET SEG TRARSHN (Self) [7] 2 92

RSHN (Self-HW) 2 92 0.956 0.809 0.951ASIST (HeLa) 10 50 0.858 0.656 0.849ASIST (HeLa-AD) 10 50 0.853 0.718 0.844ASIST (HeLa-AD+AC) 10 50 0.919 0.755 0.911ASIST (HeLa-AD+AC+HW) 10 50 0.939 0.796 0.928T.V. is the number of training videos. T.F. is the number of training framesper video. RSHN (Self) is the upper bound of RSHN using testing video fortraining.

C. Evaluation matrix

The TRA, DET, and SEG are the standard metrics in theISBI cell tracking challenge [40], evaluating the performanceof tracking, detection and segmentation, respectively. The ISBICell Tracking Challenge used these three metrics as de facto measurement standard. The larger values of TRA, DET andSEG indicate the better performance.V. R

ESULTS

A. Instance segmentation and tracking on microvilli videos

The qualitative and quantitative results are presented inFig. 7 and Table. I. From the quantitative results shown inTable. I, the best performance according to the evaluationmetric scores was achieved by Microvilli-20 without usingmanual annotations. By contrast,it took one week of manualannotation labor from a graduate student to annotate only 10frames of RSHN (Self) and RSHN (Real) .

B. Instance segmentation and tracking on HeLa cell videos

Instance segmentation and tracking results of HeLa cellvideos were presented in Fig. V-B. Based on the performancein Table. II. HeLa-AD+AC+HW achieved superior perfor-mance than other settings using the ASIST method. The bestperformance of our annotation-free ASIST method is 5% to9% lower than the manual annotation baseline.VI. D

ISCUSSION

In this paper, we aim to evaluate the feasibility of per-forming pixel-embedding based instance object segmentationand tracking in a annotation-free manner, with adversarialsimulations. According to our experimental results, though not perfect, our annotation-free instance segmentation and track-ing model achieved superior performance on the microvillidataset as well as comparable results on the HeLa dataset.Such encouraging results provide a new path to leverage thecurrently unsalable human annotation based pixel-embeddingdeep learning approach in an annotation-free manner.This study presented our methodological strategies toachieve annotation-free instance segmentation and tracking,with different appearances, shapes, and dynamics. One majorlimitation is that both microvilli and HeLa cells have relativelyhomogeneous shape and appearance variations. In the future,it will be valuable to explore more complicated cell linesand more heterogeneous microscope videos. Meanwhile, theregistration based method is introduced to capture shapevariations for ball-shaped HeLa cells. For more complicatedcellular and subcellular objects, deep learning based solutionsmight be needed, such as he shape auto-encoder.Following the proposed ASIST framework, our long-termgoal is to propose more general and comprehensive algorithmsthat can be applied to a variety of microscope videos withpixel-level instance segmentation and tracking. This wouldprovide new analytical tools for domain experts to charac-terize high spatio-temporal dynamics of cells and subcellularstructures. VII. C

ONCLUSION

In this paper, we propose the ASIST method – anannotation-free instance segmentation and tracking solutionto characterize cellular and subcellular dynamics in micro-scope videos. Our method consists of unsupervised image-annotation synthesis, video synthesis, and instance segmenta-tion and tracking. According to the experiments on subcellular(microvilli) videos and cellular (HeLa cell) videos, ASISTachieved comparable performance to manual annotation-basedstrategies. The proposed approach is a novel step towardsannotation-free quantiﬁcation of cellular and subcellular dy-namics for microscope biology.R

EFERENCES[1] L. M. Meenderink, I. M. Gaeta, M. M. Postema, C. S. Cencer, C. R.Chinowsky, E. S. Krystoﬁak, B. A. Millis, and M. J. Tyska, “Actindynamics drive microvillar motility and clustering during brush borderassembly,”

Developmental cell , vol. 50, no. 5, pp. 545–556, 2019.[2] A. Arbelle, J. Reyes, J.-Y. Chen, G. Lahav, and T. R. Raviv, “Aprobabilistic approach to joint cell tracking and segmentation in high-throughput microscopy videos,”

Medical image analysis , vol. 47, pp.140–152, 2018.[3] Y. Al-Kofahi, A. Zaltsman, R. Graves, W. Marshall, and M. Rusu, “Adeep learning-based algorithm for 2-d cell segmentation in microscopyimages,”

BMC bioinformatics , vol. 19, no. 1, pp. 1–11, 2018.[4] N. Korfhage, M. M¨uhling, S. Ringshandl, A. Becker, B. Schmeck, andB. Freisleben, “Detection and segmentation of morphologically complexeukaryotic cells in ﬂuorescence microscopy images via feature pyramidfusion,”

PLOS Computational Biology , vol. 16, no. 9, p. e1008179, 2020.[5] D. A. Van Valen, T. Kudo, K. M. Lane, D. N. Macklin, N. T. Quach,M. M. DeFelice, I. Maayan, Y. Tanouchi, E. A. Ashley, and M. W.Covert, “Deep learning automates the quantitative analysis of individualcells in live-cell imaging experiments,”

PLoS computational biology ,vol. 12, no. 11, p. e1005177, 2016.[6] M. Zhao, A. Jha, Q. Liu, B. A. Millis, A. Mahadevan-Jansen, L. Lu,B. A. Landman, M. J. Tyskac, and Y. Huo, “Faster mean-shift: Gpu-accelerated embedding-clustering for cell segmentation and tracking,” arXiv preprint arXiv:2007.14283 , 2020.

ANUSCRIPT PRE-PRINT, JANUARY 2021 7

Frame 1 Frame 5Frame 3 Frame 7 Frame 9

Real videoMicrovilli-5Manual annotationMicrovilli-20Microvilli-110 frames

Microvilli-1

RSHN(Self)RSHN(Real)

Fig. 7. This ﬁgure shows the instance segmentation and tracking results of the real testing microvilli video.

Frame 40 Frame 44Frame 42 Frame 46 Frame 48

Real videoHeLa-AD+ACManual annotationHeLa-AD+AC+HWHeLaHeLa-ADSelfSelf-HW

Fig. 8. This ﬁgure shows the instance segmentation and tracking results on the real HeLa cell testing video.

ANUSCRIPT PRE-PRINT, JANUARY 2021 8 [7] C. Payer, D. ˇStern, T. Neff, H. Bischof, and M. Urschler, “Instance seg-mentation and tracking with cosine embeddings and recurrent hourglassnetworks,” in

International Conference on Medical Image Computingand Computer-Assisted Intervention . Springer, 2018, pp. 3–11.[8] M. Johnson-Roberson, C. Barto, R. Mehta, S. N. Sridhar, K. Rosaen,and R. Vasudevan, “Driving in the matrix: Can virtual worlds replacehuman-generated annotations for real world tasks?” arXiv preprintarXiv:1610.01983 , 2016.[9] I. Zamora, N. G. Lopez, V. M. Vilches, and A. H. Cordero, “Extendingthe openai gym for robotics: a toolkit for reinforcement learning usingros and gazebo,” arXiv preprint arXiv:1608.05742 , 2016.[10] N. Kheterpal, K. Parvate, C. Wu, A. Kreidieh, E. Vinitsky, and A. Bayen,“Flow: Deep reinforcement learning for control in sumo,”

EPiC Seriesin Engineering , vol. 2, pp. 134–151, 2018.[11] M. Maˇska, V. Ulman, D. Svoboda, P. Matula, P. Matula, C. Ederra,A. Urbiola, T. Espa˜na, S. Venkatesan, D. M. Balak et al. , “A benchmarkfor comparison of cell tracking algorithms,”

Bioinformatics , vol. 30,no. 11, pp. 1609–1617, 2014.[12] V. Ulman, M. Maˇska, K. E. Magnusson, O. Ronneberger, C. Haubold,N. Harder, P. Matula, P. Matula, D. Svoboda, M. Radojevic et al. ,“An objective comparison of cell-tracking algorithms,”

Nature methods ,vol. 14, no. 12, pp. 1141–1152, 2017.[13] P. Y. Simard, D. Steinkraus, J. C. Platt et al. , “Best practices forconvolutional neural networks applied to visual document analysis.” in

Icdar , vol. 3, no. 2003, 2003.[14] M. Drozdzal, G. Chartrand, E. Vorontsov, M. Shakeri, L. Di Jorio,A. Tang, A. Romero, Y. Bengio, C. Pal, and S. Kadoury, “Learningnormalized inputs for iterative estimation in medical image segmenta-tion,”

Medical image analysis , vol. 44, pp. 1–13, 2018.[15] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley,S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial nets,” in

Advances in neural information processing systems , 2014, pp. 2672–2680.[16] P. Costa, A. Galdran, M. I. Meyer, M. D. Abr`amoff, M. Niemeijer,A. M. Mendonc¸a, and A. Campilho, “Towards adversarial retinal imagesynthesis,” arXiv preprint arXiv:1701.08974 , 2017.[17] Q. Zhang, H. Wang, H. Lu, D. Won, and S. W. Yoon, “Medical imagesynthesis with generative adversarial networks for tissue recognition,” in ,2018, pp. 199–207.[18] J. Zhuang and D. Wang, “Geometrically matched multi-source micro-scopic image synthesis using bidirectional adversarial networks,” arXivpreprint arXiv:2010.13308 , 2020.[19] S. Liu, E. Gibson, S. Grbic, Z. Xu, A. A. A. Setio, J. Yang, B. Georgescu,and D. Comaniciu, “Decompose to manipulate: manipulable objectsynthesis in 3d medical images with structured image decomposition,” arXiv preprint arXiv:1812.01737 , 2018.[20] J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros, “Unpaired image-to-imagetranslation using cycle-consistent adversarial networks,” in

Proceedingsof the IEEE international conference on computer vision , 2017, pp.2223–2232.[21] Y. Huo, Z. Xu, S. Bao, A. Assad, R. G. Abramson, and B. A. Landman,“Adversarial synthesis learning enables segmentation without targetmodality ground truth,” in . IEEE, 2018, pp. 1217–1220.[22] S. J. Ihle, A. M. Reichmuth, S. Girardin, H. Han, F. Stauffer, A. Bonnin,M. Stampanoni, K. Pattisapu, J. V¨or¨os, and C. Forr´o, “Unsuperviseddata to content transformation with histogram-matching cycle-consistentgenerative adversarial networks,”

Nature Machine Intelligence , vol. 1,no. 10, pp. 461–470, 2019.[23] K. W. Dunn, C. Fu, D. J. Ho, S. Lee, S. Han, P. Salama, andE. J. Delp, “Deepsynth: Three-dimensional nuclear segmentation ofbiological images using neural networks trained with synthetic data,”

Scientiﬁc reports , vol. 9, no. 1, pp. 1–15, 2019.[24] T. Ridler, S. Calvard et al. , “Picture thresholding using an iterativeselection method,”

IEEE trans syst Man Cybern , vol. 8, no. 8, pp. 630–632, 1978.[25] N. Otsu, “A threshold selection method from gray-level histograms,”

IEEE transactions on systems, man, and cybernetics , vol. 9, no. 1, pp.62–66, 1979.[26] W. Pratt, “Digital image processing: Piks scientiﬁc inside. wiley-interscience, john wiley & sons, inc,” 2007.[27] M. Kass, A. Witkin, and D. Terzopoulos, “Snakes: Active contourmodels,”

International journal of computer vision , vol. 1, no. 4, pp.321–331, 1988. [28] A. S. Kornilov and I. V. Safonov, “An overview of watershed algorithmimplementations in open source libraries,”

Journal of Imaging , vol. 4,no. 10, p. 123, 2018.[29] D. Gerlich, J. Mattes, and R. Eils, “Quantitative motion analysis andvisualization of cellular structures,”

Methods , vol. 29, no. 1, pp. 3–13,2003.[30] N. Ray and S. T. Acton, “Motion gradient vector ﬂow: An external forcefor tracking rolling leukocytes with shape and size constrained activecontours,”

IEEE transactions on medical Imaging , vol. 23, no. 12, pp.1466–1478, 2004.[31] Y. Sato, J. Chen, R. A. Zorooﬁ, N. Harada, S. Tamura, and T. Shiga, “Au-tomatic extraction and measurement of leukocyte motion in microvesselsusing spatiotemporal image analysis,”

IEEE Transactions on BiomedicalEngineering , vol. 44, no. 4, pp. 225–236, 1997.[32] C. De Hauwer, I. Camby, F. Darro, I. Migeotte, C. Decaestecker,C. Verbeek, A. Danguy, J.-L. Pasteels, J. Brotchi, I. Salmon et al. ,“Gastrin inhibits motility, decreases cell death levels and increasesproliferation in human glioblastoma cell lines,”

Journal of neurobiology ,vol. 37, no. 3, pp. 373–382, 1998.[33] C. De Hauwer, F. Darro, I. Camby, R. Kiss, P. Van Ham, and C. De-caesteker, “In vitro motility evaluation of aggregated cancer cells bymeans of automatic image processing,”

Cytometry: The Journal of theInternational Society for Analytical Cytology , vol. 36, no. 1, pp. 1–10,1999.[34] V. Jain, J. F. Murray, F. Roth, S. Turaga, V. Zhigulin, K. L. Briggman,M. N. Helmstaedter, W. Denk, and H. S. Seung, “Supervised learningof image restoration with convolutional networks,” in . IEEE, 2007, pp. 1–8.[35] I. Baghli, M. Benazzouz, and M. A. Chikh, “Plasma cell identiﬁcationbased on evidential segmentation and supervised learning,”

InternationalJournal of Biomedical Engineering and Technology , vol. 32, no. 4, pp.331–350, 2020.[36] H. Yu, D. Guo, Z. Yan, W. Liu, J. Simmons, C. P. Przybyla, and S. Wang,“Unsupervised learning for large-scale ﬁber detection and tracking inmicroscopic material images,” arXiv preprint arXiv:1805.10256 , 2018.[37] J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros, “Unpaired image-to-imagetranslation using cycle-consistent adversarial networkss,” in

ComputerVision (ICCV), 2017 IEEE International Conference on , 2017.[38] Q. Liu, I. M. Gaeta, B. A. Millis, M. J. Tyska, and Y. Huo, “Ganbased unsupervised segmentation: Should we match the exact numberof objects,” arXiv preprint arXiv:2010.11438 , 2020.[39] B. B. Avants, N. J. Tustison, G. Song, P. A. Cook, A. Klein, and J. C.Gee, “A reproducible evaluation of ants similarity metric performancein brain image registration,”

Neuroimage , vol. 54, no. 3, pp. 2033–2044,2011.[40] P. Matula, M. Maˇska, D. V. Sorokin, P. Matula, C. Ortiz-de Sol´orzano,and M. Kozubek, “Cell tracking accuracy measurement based on com-parison of acyclic oriented graphs,”