[PDF] Multi-Task Deep Learning with Dynamic Programming for Embryo Early Development Stage Classification from Time-Lapse Videos

Abstract

Time-lapse is a technology used to record the development of embryos during in-vitro fertilization (IVF). Accurate classification of embryo early development stages can provide embryologists valuable information for assessing the embryo quality, and hence is critical to the success of IVF. This paper proposes a multi-task deep learning with dynamic programming (MTDL-DP) approach for this purpose. It first uses MTDL to pre-classify each frame in the time-lapse video to an embryo development stage, and then DP to optimize the stage sequence so that the stage number is monotonically non-decreasing, which usually holds in practice. Different MTDL frameworks, e.g., one-to-many, many-to-one, and many-to-many, are investigated. It is shown that the one-to-many MTDL framework achieved the best compromise between performance and computational cost. To our knowledge, this is the first study that applies MTDL to embryo early development stage classification from time-lapse videos.

Full PDF

aa r X i v : . [ ee ss . I V ] A ug Multi-Task Deep Learning with DynamicProgramming for Embryo Early Development StageClassiﬁcation from Time-Lapse Videos

Zihan Liu, Bo Huang, Yuqi Cui, Yifan Xu, Bo Zhang, Lixia Zhu, Yang Wang, Lei Jin and Dongrui Wu

Abstract —Time-lapse is a technology used to record the devel-opment of embryos during in-vitro fertilization (IVF). Accurateclassiﬁcation of embryo early development stages can provideembryologists valuable information for assessing the embryoquality, and hence is critical to the success of IVF. This paperproposes a multi-task deep learning with dynamic programming(MTDL-DP) approach for this purpose. It ﬁrst uses MTDL topre-classify each frame in the time-lapse video to an embryodevelopment stage, and then DP to optimize the stage sequenceso that the stage number is monotonically non-decreasing, whichusually holds in practice. Different MTDL frameworks, e.g., one-to-many, many-to-one, and many-to-many, are investigated. Itis shown that the one-to-many MTDL framework achieved thebest compromise between performance and computational cost.To our knowledge, this is the ﬁrst study that applies MTDL toembryo early development stage classiﬁcation from time-lapsevideos.

Index Terms —Multi-task learning, in-vitro fertilization, convo-lutional neural networks, dynamic programming, image classiﬁ-cation

I. I

NTRODUCTION

In-vitro fertilization (IVF) [1]–[3] is a frequently usedtechnology for treating infertility. The process involves thecollection of multiple follicles for fertilization and in-vitroculture. Cultivation, selection and transplantation of embryoare the key steps in determining a successful implantationduring IVF [4], [5]. During the development of embryos, themorphological characteristics [6] and kinetic characteristics [7]are highly correlated with the outcome of transplantation.Time-lapse videos have been widely used in various repro-ductive medicine centers during the cultivation of embryos [8]to monitor them. A time-lapse video records the embryonicdevelopment process in real time by taking photos of theembryos at short time intervals [9]. Thus, a large amount oftime series image data for each embryo are produced in thisprocess. At the ﬁnal stage of embryo selection, an embryolo-gist reviews the entire embryo development process to score

Z. Liu, Y. Cui, Y. Xu, Y. Wang and D. Wu are with theKey Laboratory of the Ministry of Education for Image Processingand Intelligent Control, School of Artiﬁcial Intelligence and Automa-tion, Huazhong University of Science and Technology, Wuhan 430074,China. Email: [email protected], [email protected], [email protected],wangyang [email protected], [email protected]. Huang, B. Zhang, L. Zhu and L. Jin are with the Reproductive MedicineCenter, Tongji Hospital, Tongji Medical College, Huazhong University ofScience and Technology, Wuhan 430074, China. Email: [email protected],[email protected], [email protected], [email protected] Liu and Bo Huang contributed equally to this work.Lei Jin and Dongrui Wu are the corresponding authors. and sort them. Studies with different time-lapse equipmentreported improved prediction accuracy of embryo implantationpotential by analyzing the morphokinetics of human embryosat early cleavage stages [8]–[12]. These features have beenshown to be statistically signiﬁcant to the ﬁnal outcome ofthe transplantation [7].There have been only a few approaches to analyze time-lapse image data [9], [13]–[18]. Due to the limitations of thetime-lapse technology, stereoscopic cells of different heightsoverlap in the images when photographed. It is difﬁcult foreven an experienced embryologist to accurately count the num-ber of cells in a single time-lapse image when there are morethan eight cells. Therefore, most research focused on the earlydevelopment stages of embryos. Wong et al. [9] identiﬁedseveral key parameters that can predict blastocyst formationat the 4-cell stage from time-lapse images, and employedsequential Monte Carlo based probabilistic model estimationto monitor these parameters and track the cells. Wang et al. [13] presented a multi-level embryo stage classiﬁcation ap-proach, by using both hand-crafted and automatically learnedembryo features to identify the number of cells in a time-lapsevideo. Conaghan et al. [14] used an automated and proprietaryimage analysis software EEVA TM (Early Embryo ViabilityAssessment), which exhibited high image contrast through theuse of darkﬁeld illumination, to track cell divisions from one-cell stage to four-cell stage. Their experiments veriﬁed that theEEVA Test can signiﬁcantly improve embryologists’ ability toidentify embryos that would develop into usable blastocysts.There are also several other studies on embryo selection byusing EEVA TM [19]–[22], but they did not provide the detailsof the used EEVA Test. Jonaitis et al. [15] compared theperformance of neural network, support vector machine andnearest neighbor classiﬁer in detecting cell division time. Khan et al. [18] used a deep convolutional neural network (CNN) toclassify the number of cells, and also semantic segmentationto extract the cell regions in a time-lapse image [16]. Ng et al. [17] combined late fusion networks with dynamicprogramming (DP) to predict different cell development stagesand obtained better results than a single-frame model.Multi-task learning has been successfully used in manyapplications, such as natural language processing [23], speechrecognition [24], and computer vision [25]. Its basic idea isto share representations among related tasks, so that eachtrained model may have better generalization ability [26].This paper proposes a multi-task deep learning with dynamicprogramming (MTDL-DP) approach, which ﬁrst uses MTDL to pre-classify each frame in the time-lapse video to an embryodevelopment stage, and then DP to optimize the stage sequenceso that the stage number is monotonically non-decreasing,which usually holds in practice. To our knowledge, this is theﬁrst study that applies MTDL to embryo early developmentstage classiﬁcation from time-lapse videos.The remainder of this paper is organized as follows: Sec-tion II introduces four classiﬁcation frameworks for time-lapse video analysis. Section III proposes our MTDL-DPapproach. Section IV presents the experimental results. Finally,Section V draws conclusion.II. C LASSIFICATION F RAMEWORKS

This section introduces four frameworks for embryo earlydevelopment stage classiﬁcation from time-lapse videos. Weﬁrst describe our dataset and the baseline network architecture,and then extend it to many-to-one , one-to-many and many-to-many MTDL frameworks.

A. Dataset

The time-lapse video dataset used in our experiments camefrom the Reproductive Medicine Center of Tongji Hospital,Huazhong University of Science and Technology, Wuhan,China. It consisted of 170 time-lapse videos extracted fromincubators, using an EmbryoScope+ time-lapse microscopesystem at 10-minute sampling interval. Each frame in thevideo is a grayscale × image with a well numberin the lower left corner and a time marker after fertilizationin the lower right corner, as shown in Fig. 1. The embryo issurrounded by some granulosa cells in the microscope ﬁeld.The scale bar in the upper right corner indicates the size of thecells. Each video began about 0-2 hours after fertilization, andended about 140 hours after fertilization. We only used the ﬁrst N = 350 frames in each video, which were manually labeledfor the embryo development stages. Therefore, we had a totalof ×

350 = 59 , labeled frames in the experiment.As in [17], we focused on the ﬁrst six embryo developmentstages, which included initialization (tStart), the appearanceand breakdown of the male and female pronucleus (tPNf),and the appearance of 2 through 4+ cells (t2, t3, t4, t4+). Wecounted the number of images in different embryo develop-ment stages in the dataset, and show the summary in Fig. 2.Note that t3 was rarely observed in our dataset. B. The Baseline One-to-One Classiﬁcation Framework

Let x n be the n th frame in a time-lapse video. For imageclassiﬁcation, a standard one-to-one classiﬁcation frameworklearns a mapping: f : x n y n ∈ L (1)where y n is the stage label of x n , and L the label set of theembryo development stages.When information of the previous and future frames isused, the standard one-to-one classiﬁcation framework can be extended to many-to-one , one-to-many and many-to-many MTDL frameworks, as illustrated in Fig. 3.We used ResNet [27], which won the 2015 ImageNetclassiﬁcation competition, to process individual video frames.Table I shows our baseline ResNet50 model. The input imagehad three channels (RGB), each with 224 ×

224 pixels (the800 ×

800 images were resized). The model was initializedfrom the ResNet weights pre-trained on ImageNet [28], whichcan help reduce overﬁtting on small datasets.

TABLE IT

HE BASELINE R ES N ET MODEL .layer name 50-layer output sizeconv1 × , 64, stride × × res2 × max pool, stride × ×  × , × , × ,  × res3  × , × , × ,  × × × res4  × , × , × ,  × × × res5  × , × , × ,  × × × global average pool × × fc × × softmax × × C. The Many-to-One MTDL Framework

The many-to-one

MTDL framework, shown in Fig. 3(b),is frequently used in video understanding [29]–[31] becausemultiple frames in the same video usually have the samelabel, and hence they can be considered together to predict theﬁnal label.

Many-to-one can better make use of input contextinformation than one-to-one . Many-to-one performs the following mapping: f : ( x n − τ , . . . , x n , . . . , x n + τ ) y n ∈ L , (2)where τ is the number of neighboring frames before and afterthe current frame (the input context window size is hence τ +1 ).There are two common approaches to fuse time domaininformation from the τ + 1 frames: Conv Pooling [32] andLate Fusion [30].

1) Conv Pooling:

This is a convolutional temporal featurepooling architecture, which has been extensively used forvideo classiﬁcation, especially for bag-of-words representa-tions [33]. Image features are computed for each frame andthen max pooled. The pooling features can then be sent to fullyconnected layers for ﬁnal classiﬁcation. A major advantage of

50 μμ (a)

50 μμ (b)

50 μμ (c)

50 μμ (d)Fig. 1. Sample frames from a time-lapse video. (a) 1-cell stage; (b) 2-cell stage; (c) 4-cell stage; (d) 4+-cell stage.Fig. 2. Percentage of frames in different embryo development stages. this approach is that spatial information in multiple frames,output by the convolutional layers, is preserved through amax pooling operation in the time domain. Experiments [32]veriﬁed that Conv Pooling outperformed all other featurepooling approaches on the Sports-1M dataset, using a 120- frame AlexNet model [34].

2) Late Fusion:

In Late Fusion, all frames in the inputcontext window are encoded via identical ConvNets. Theﬁnal representations after all convolutional layers are concate-nated and passed through a fully-connected layer to generateclassiﬁcations. The concatenation can happen to a subset offrames in the input context window [30], or to all framesin that window [17]. Previous research [17] demonstratedthat Late Fusion ConvNets using 15 frames and a DP-baseddecoder outperformed Early Fusion for predicting embryomorphokinetics in time-lapse videos.

D. The One-to-Many MTDL FrameworkOne-to-many , shown in Fig. 3(c), means each input ismapped to multiple outputs, which is also called multi-tasknets [35] in deep learning. This paper uses hard parametersharing of hidden layers [26], as illustrated in Fig. 4. Theparameters of the convolution layers are shared among differ-ent tasks, but those of the fully connected layers are trainedseparately.In one-to-many , each x n is used in classifying τ + 1 stages centered at n , i.e., it learns the following one-to-many (a) (b)(c) (d)Fig. 3. Different classiﬁcation frameworks. (a) one-to-one ; (b) many-to-one ; (c) one-to-many ; (d) many-to-many . The convolutional layers are denoted by ‘C’.Blue and red rectangles denote the ﬂatten layer and the max-pooling layer, respectively. Orange rectangles denote the fully connected and softmax layers. (cid:28646)(cid:28667)(cid:28660)(cid:28677)(cid:28664)(cid:28663)(cid:28595)(cid:28671)(cid:28660)(cid:28684)(cid:28664)(cid:28677)(cid:28678)(cid:28647)(cid:28660)(cid:28678)(cid:28670)(cid:28608)(cid:28678)(cid:28675)(cid:28664)(cid:28662)(cid:28668)(cid:28665)(cid:28668)(cid:28662)(cid:28595)(cid:28671)(cid:28660)(cid:28684)(cid:28664)(cid:28677)(cid:28678) (cid:28605)(cid:28642)(cid:28644)(cid:28649)(cid:28648)(cid:28616)(cid:28629)(cid:28647)(cid:28639)(cid:28564)(cid:28598)(cid:28616)(cid:28629)(cid:28647)(cid:28639)(cid:28564)(cid:28597) (cid:28616)(cid:28629)(cid:28647)(cid:28639)(cid:28564)(cid:28599) (cid:1710) (cid:1710) Fig. 4. Hard parameter sharing for MTDL. mapping: f : x n ( y n − τ , . . . , y n + τ ) ∈ L τ +1 . (3) x n ’s classiﬁcation for the stage at time index t ∈ [ n − τ, n + τ ] is a probability vector ˆ p t ( x n ) ∈ R | L |× .At each Frame Index n , the corresponding label is estimatedby τ + 1 neighboring x t , t ∈ [ n − τ, n + τ ] . We need toaggregate them to obtain the ﬁnal classiﬁcation. This can bedone by an ensemble approach.Because each frame x n is involved in τ + 1 outputs, thetotal loss on a training frame x n is computed as the sum of the loss on all involved outputs: ℓ ( x n ) = n + τ X t = n − τ w t · ℓ ( y t , ˆ p t ( x n )) , (4)where w t is the weight for the t -th output, and y t is the truelabel for Frame t . w t = 1 and the cross-entropy loss wereused in this paper. The cross-entropy loss on the t -th outputcan be written as follows: ℓ ( y t , ˆ p t ( x n )) = − log (ˆ p t,y t ( x n )) , (5)where ˆ p t,y t ( x n ) is the y t -th element of ˆ p t ( x n ) . E. The Many-to-Many MTDL FrameworkMany-to-many can be viewed as a combination of one-to-many and many-to-one . Each input frame is processed by aseparate CNN. Late Fusion was used, and the parameters of thefully connected layers were also trained separately, as shownin Fig. 3(d).III. M

ULTI -T ASK D EEP L EARNING WITH D YNAMIC P ROGRAMMING (MTDL-DP)This section introduces our proposed MTDL-DP approach. (cid:2198) (cid:3041)(cid:2879)(cid:2870)(cid:3099) ( (cid:1824) (cid:3041)(cid:2879)(cid:3099) ) (cid:2198) (cid:3041)(cid:2879)(cid:3099) ( (cid:1824) (cid:3041)(cid:2879)(cid:3099) ) (cid:1710) (cid:2198) (cid:2196) ( (cid:1824) (cid:3041)(cid:2879)(cid:3099) ) (cid:1710) (cid:2198) (cid:3041)(cid:2879)(cid:3099) ( (cid:1824) (cid:3041) ) (cid:2198) (cid:2196) ( (cid:1824) (cid:3041) ) (cid:1710) (cid:2198) (cid:3041)(cid:2878)(cid:3099) ( (cid:1824) (cid:3041) ) (cid:1710) (cid:2198) (cid:2196) ( (cid:1824) (cid:3041)(cid:2878)(cid:3099) ) (cid:2198) (cid:3041)(cid:2878)(cid:3099) ( (cid:1824) (cid:3041)(cid:2878)(cid:3099) ) (cid:2198) (cid:3041)(cid:2878)(cid:2870)(cid:3099) ( (cid:1824) (cid:3041)(cid:2878)(cid:3099) ) (cid:1710) (cid:1710)(cid:1710) (cid:1710) (cid:2198) (cid:2196) ( (cid:1824) (cid:3041)(cid:2879)(cid:2869) ) (cid:1710) (cid:2198) (cid:3041)(cid:2879)(cid:3099) ( (cid:1824) (cid:3041)(cid:2879)(cid:2869) ) (cid:1710) (cid:1710) (cid:2198) (cid:3041)(cid:2878)(cid:3099) ( (cid:1824) (cid:3041)(cid:2878)(cid:2869) ) (cid:1710) (cid:2198) (cid:2196) ( (cid:1824) (cid:3041)(cid:2878)(cid:2869) ) Fig. 5. Ensemble of the multi-task net’s predictions at Frame Index n , made by neighboring frames x t , t ∈ [ n − τ, n + τ ] . A. Ensemble Learning for MTDL

As mentioned in Section II-D, a multi-task net has multipleoutputs. The easiest approach to get the ﬁnal classiﬁcationcorresponding to a speciﬁc frame is to choose the middleoutput of the network. A more sophisticated approach isensemble learning [36]. We consider two common probabilis-tic aggregation approaches in this paper: additive mean and multiplicative mean .Let ˆ p n ( x t ) be the predicted probability vector at FrameIndex n , given Frame x t , t ∈ [ n − τ, n + τ ] , as illustratedin Fig. 5. The ensemble probability ˆ p n at Frame Index n ,aggregated by the additive mean, is: ˆ p n = 12 τ + 1 n + τ X t = n − τ ˆ p n ( x t ) , (6)If the multiplicative mean is used, ˆ p n = 12 τ + 1 n + τ Y t = n − τ ˆ p n ( x t ) . (7)Since each ˆ p n ( x t ) is a vector, the summation in (6) andmultiplication in (7) are element-wise operations.The ﬁnal classiﬁcation label ˆ y n for Frame x n is obtainedby probability maximization: ˆ y n = arg max ≤ l ≤| L | ˆ p n,l , (8)where ˆ p n,l is the l -th element of ˆ p n . B. Post-processing with DP

The number of cells in the development of an embryois almost always non-decreasing [37]. However, this is notguaranteed in the classiﬁcation outputs of MTDL. We use DPto adjust the classiﬁcations so that this constraint is satisﬁed.For each video, the groundtruth stages { y n } Nn =1 forma sequence. MTDL outputs a probability vector ˆ p n =[ p n, , ..., p n, | L | ] T before likelihood maximization at Frame n ,where ˆ p n,l is the estimated probability that Frame n is at Stage l . We deﬁne E ( ˆ y , ˆP ) as the total loss for an estimatedprediction ˆ y = { ˆ y n } Nn =1 , given the model output probabilitymatrix ˆP = [ ˆ p , ..., ˆ p N ] . The total loss is the sum of theper-frame losses P Nn =1 e (ˆ y n , ˆ p n ) , which must be optimizedsubject to the monotonicity constraint: ˆ y n +1 ≥ ˆ y n , ∀ n .Two common per-frame losses [17] were used. The ﬁrst isnegative label likelihood (LL), deﬁned as: e LL (ˆ y n , ˆ p n ) = − log (ˆ p n,y n ) . (9)The second is earth mover (EM) distance, deﬁned as: e EM (ˆ y n , ˆ p n ) = − | L | X l =1 ˆ p n,l | ˆ y n − l | , (10)The ﬁnal classiﬁcation stage sequence ˆ y ∗ = { ˆ y n } Nn =1 canbe obtained as: ˆ y ∗ = arg min ˆ y = { ˆ y n } Nn =1 N X n =1 e (ˆ y n , ˆ p n ) (11)s.t. ˆ y n +1 ≥ ˆ y n , ∀ n. which can be easily solved by DP, as shown in Algorithm 1. C. MTDL-DP

Our proposed MTDL-DP consists of three steps: 1) con-struct a multi-task net with the one-to-many or many-to-many MTDL framework; 2) use multiplicative mean to aggregatethe prediction of the multi-task net; and, 3) post-process withDP using the EM distance per-frame loss. Its pseudocode isgiven in Algorithm 2.The one-to-many

MTDL framework can also be replacedby the many-to-many

MTDL framework.IV. E

XPERIMENTAL R ESULTS

This section investigates the performance of our proposedMTDL-DP.

Algorithm 1:

Pseudocode of dynamic programming (DP).

Input: N , the number of frames in a time-lapse video; L , label set of embryo development stages; ˆP = [ ˆ p , ..., ˆ p N ] ∈ R | L |× N , the MTDL model outputprobability matrix for the N frames. Output: ˆ y ∗ , the optimized stage sequence.Set e ( l, ˆ p n ) = 0 and E ( l, ˆ p n ) = 0 , ∀ l ∈ [1 , | L | ] , ∀ n ∈ [1 , N ] ; for n = 1 , ..., N dofor ˆ y = 1 , ..., | L | do Compute e (ˆ y, ˆ p n ) in (10); endendfor n = 2 , ..., N dofor ˆ y = 1 , ..., | L | do E (ˆ y, ˆ p n ) = e (ˆ y, ˆ p n ) + min ≤ l ≤ ˆ y E ( l, ˆ p n − ) ; endend k = | L | ; for n = N, ..., do ˆ y n = arg min l k E ( l, ˆ p n ) ; if ˆ y n < k then k = ˆ y n ; endend ˆ y ∗ = { ˆ y n } Nn =1 ; Return

The optimized stage sequence ˆ y ∗ . A. Experimental Setup

We created training/validation/test data partitions by ran-domly selecting 70%/10%/20% videos from the dataset, i.e.,41,650/5,950/11,900 frames, respectively. We resized eachframe to 224 ×

224 so that it can be used by ResNet50, ourbaseline model. Random rotation and ﬂip data augmentationwas used. All MTDL frameworks were initialized by theweights trained by one-to-one (ResNet50). Then, the convo-lution layer parameters were frozen, and the fully connectedlayers were further tuned.We used the cross-entropy loss function and Adam op-timizer [38], and early stopping to reduce overﬁtting, inall experiments. Multiplicative mean and EM distance per-frame loss were used in the MTDL-DP. All experiments wererepeated ﬁve times, and the mean results were reported.

B. Classiﬁcation Accuracy

First, we considered MTDL only, without using DP. Theclassiﬁcation accuracies are shown in the left panel of Table II,with τ = { , , } (the output context window size was τ + 1 ). All MTDL frameworks outperformed the one-to-one framework, suggesting using neighboring input or labelinformation in multi-task learning was indeed beneﬁcial.For the many-to-one MTDL framework, when τ increased,the performance of Late Fusion also increased, whereas theperformance of Conv Pooling decreased. This is intuitive, Algorithm 2:

MTDL-DP

Input: N , the number of frames in a time-lapse video; D , set of labeled time-lapse videos; { x n } Nn =1 , frames to be labeled; τ , the number of left and right neighboring frames in thecontext window. Output: ˆ y ∗ , the labeled stage sequence.Use the one-to-one framework to train a baseline model f from D ;Initialize an MTDL model, whose convolution layerparameters are identical to f ;Fine-tune the fully connected layer parameters of theMTDL model on D ; for n = 1 , ..., N do Use the MTDL model to compute ˆ p t ( x n ) , t = n − τ, ..., n + τ ; endfor n = 1 , ..., N do Compute ˆ p n by (7);Compute the per-frame loss e EM (ˆ y n , ˆ p n ) in (10); end Solve for ˆ y ∗ in (11) by Algorithm 1; Return

The optimized stage sequence ˆ y ∗ .because more input information was ignored in Conv Poolingwhen τ increased.The classiﬁcation accuracies with DP post-processing areshown in the right panel of Table II. Post-processing increasedthe classiﬁcation accuracies for all classiﬁers and different τ ,e.g., the ﬁve classiﬁers achieved 2.3%, 1.2%, 2.1%, 1.5%,and 2.0% performance improvement when τ = 1 , respec-tively. However, as τ increased, the classiﬁcation performanceimprovements became less obvious. After post-processing,the many-to-many and one-to-many frameworks had higheraccuracies than the many-to-one framework, and only many-to-many consistently outperformed one-to-one for all τ , sug-gesting post-processing may be more beneﬁcial when moreinput and output information was utilized. C. Root Mean Squared Error (RMSE)

We also computed the root mean squared error (RMSE)between the true video label sequences and the classiﬁcations.The RMSEs without DP post-processing are shown in the leftpanel of Table III. All MTDL frameworks had lower RMSEsthan the one-to-one framework, suggesting again that usingneighboring input or label information in multi-task learningwas beneﬁcial.The results after DP post-processing are shown in the rightpanel of Table III. DP post-processing reduced the RMSEfor all MTDL frameworks and different τ , suggesting thatDP was indeed beneﬁcial. Though all MTDL frameworksoutperformed the one-to-one framework only at τ = 1 , the many-to-many framework consistently outperformed one-to-one for all different τ . TABLE IIC

LASSIFICATION ACCURACIES FOR DIFFERENT CLASSIFICATION FRAMEWORKS AND τ , BEFORE AND AFTER DP POST - PROCESSING .Framework Method Accuracy without DP Accuracy with DP τ = 1 τ = 4 τ = 7 τ = 1 τ = 4 τ = 7 One-to-one ResNet50 83.8% 83.8% 83.8% 86.1% 86.1% 86.1%Many-to-one Conv Pooling 84.7% 84.4% 83.8% 85.9% 85.1% 84.5%Late Fusion 83.9% 84.6% 85.1% 86.0% 85.2% 85.2%One-to-many Multi-Task Nets (ours)

TABLE IIIRMSE

S FOR DIFFERENT CLASSIFICATION FRAMEWORKS AND τ , BEFORE AND AFTER DP POST - PROCESSING .Framework Method RMSE without DP RMSE with DP τ = 1 τ = 4 τ = 7 τ = 1 τ = 4 τ = 7 One-to-one ResNet50 0.4840 0.4840 0.4840 0.4199 0.4199 0.4199Many-to-one Conv Pooling 0.4728 0.4690 0.4795 0.4066 0.4432 0.4419Late Fusion 0.4761

D. Training Time

The training time of different models, averaged over ﬁveruns, is shown in Table IV. The training time of the many-to-one and many-to-many

MTDL frameworks increased aboutlinearly with the input context size; however, the training timeof the one-to-many

MTDL framework was insensitive to τ ,which is an advantage. TABLE IVT

RAINING TIME FOR DIFFERENT CLASSIFICATION FRAMEWORKS AND τ .Framework Method Training time (s) τ = 1 τ = 4 τ = 7 One-to-one ResNet50

Many-to-one Conv Pooling 5318 15378 29139Late Fusion 4892 17390 27534One-to-many Multi-Task Nets (ours) 2246 2265 2542Many-to-many 5759 16182 27808

E. Comparison of Different Ensemble Approaches

We also compared the performances of different ensembleapproaches introduced in Section III-A, without consideringDP post-processing. The CNN models were constructed usingthe one-to-many and many-to-many

MTDL frameworks. Theresults are shown in Figs. 6 and 7. Both additive mean and multiplicative mean achieved performance improvements.

Multiplicative mean also slightly outperformed additive mean .As τ increased, the performance of the many-to-many MTDLframework was improved. The one-to-many

MTDL frameworkhad the best performance when τ = 4 . F. Comparison of Different Losses in DP Post-Processing

Next, we studied the effect of different per-frame losses inDP post-processing. The RMSEs for different τ and different τ A cc u r a c y One-to-Many

Without ensembleAdd meanMul mean (a) τ A cc u r a c y Many-to-Many

Without ensembleAdd meanMul mean (b)Fig. 6. Classiﬁcation accuracies with and without ensemble learning. (a)

One-to-many ; (b)

Many-to-many . MTDL frameworks are shown in Fig. 8. The EM loss alwaysgave smaller RMSEs than the LL loss.The true stage labels, and the classiﬁed labels before andafter DP in two time-lapse videos, are shown in Fig. 9. Clearly,DP smoothed the classiﬁcations, and its outputs were closerto the groundtruth labels. τ R M S E One-to-Many

Without ensembleAdd meanMul mean (a) τ R M S E Many-to-Many

Without ensembleAdd meanMul mean (b)Fig. 7. RMSEs with and without ensemble learning. (a)

One-to-many ; (b)

Many-to-many . The confusion matrix for the one-to-many

MTDL frame-work, using the multiplicative mean and τ = 1 , is shownin Fig. 10(a) before DP post-processing, and in Fig. 10(b)after DP post-processing. The diagonal shows the classiﬁcationaccuracy of each individual cell stage. Post-processing im-proved the accuracy of all embryonic stages except t3, whoseclassiﬁcation accuracy before DP (16%) was much lower thanothers. There may be two reasons for this: 1) Stage t3 hadmuch fewer training examples in our dataset (see Fig. 2), andhence it was not trained adequately; and, 2) the low accuracyof t3 may also be due to multipolar cleavages from the zygotestage, which occurs in 12.2% of human embryos [39].V. C ONCLUSION

Accurate classiﬁcation of embryo early development stagescan provide embryologists valuable information for assessingthe embryo quality, and hence is critical to the success ofIVF. This paper has proposed an MTDL-DP approach forautomatic embryo development stage classiﬁcation from time-lapse videos. Particularly, the one-to-many and many-to-many

MTDL frameworks performed the best. Considering the trade-off between training time and classiﬁcation accuracy, werecommend the one-to-many

MTDL framework in MTDL-DP,because it can achieve comparable performance with the many-to-many

MTDL framework, with much lower computationalcost.To our knowledge, this is the ﬁrst study that applies MTDLto embryo early development stage classiﬁcation from time-lapse videos.

Approaches R M S E τ = 1 LL lossEM loss (a)

Approaches R M S E τ = 4 LL lossEM loss (b)

Approaches R M S E τ = 7 LL lossEM loss (c)Fig. 8. RMSEs of different per-frame losses in DP. (a) τ = 1 ; (b) τ =4 ; (b) τ = 7 . The numbers on the horizontal axis denote different MTDLframeworks: 1– One-to-one , 2–

Many-to-one (Conv Pooling), 3–

Many-to-one (Late Fusion), 4–

One-to-many , 5–

Many-to-many . R EFERENCES[1] B. Huang, X. Ren, L. Wu, L. Zhu, B. Xu, Y. Li, J. Ai, and L. Jin,“Elevated progesterone levels on the day of oocyte maturation may affecttop quality embryo IVF cycles,”

PLoS One , vol. 11, no. 1, p. e0145895,2016.[2] B. Huang, D. Hu, K. Qian, J. Ai, Y. Li, L. Jin, G. Zhu, and H. Zhang,“Is frozen embryo transfer cycle associated with a signiﬁcantly lowerincidence of ectopic pregnancy? an analysis of more than 30,000 cycles,”

Fertility Sterility , vol. 102, no. 5, pp. 1345–1349, 2014.[3] B. Huang, K. Qian, Z. Li, J. Yue, W. Yang, G. Zhu, and H. Zhang,“Neonatal outcomes after early rescue intracytoplasmic sperm injection:an analysis of a 5-year period,”

Fertility Sterility , vol. 103, no. 6, pp.1432–1437, 2015.[4] A. S. in Reproductive Medicine and E. S. I. G. of Embryology, “Theistanbul consensus workshop on embryo assessment: Proceedings of anexpert meeting,”

Human Reproduction , vol. 26, no. 6, pp. 1270–1283,2011.[5] B. Tomasz, K. Rafal, and G. Wojciech, “Methods of embryo scoring

Video frame tStarttPNft2t3t4t4+ C e ll s t a g e TrueBefore DPAfter DP (a)

Video frame tStarttPNft2t3t4t4+ C e ll s t a g e TrueBefore DPAfter DP (b)Fig. 9. True stage labels, and classiﬁcations before and after DP, in twotime-lapse videos.

One-to-many and τ = 1 were used.in in vitro fertilization,” Reproductive Biology , vol. 4, no. 1, pp. 5–22,2004.[6] J. Holte, L. Berglund, K. Milton, C. Garello, G. Gennarelli, A. Revelli,and T. Bergh, “Construction of an evidence-based integrated morphologycleavage embryo score for implantation potential of embryos scoredand transferred on day 2 after oocyte retrieval,”

Human Reproduction ,vol. 22, no. 2, pp. 548–557, 2006.[7] J. Lemmen, I. Agerholm, and S. Ziebe, “Kinetic markers of human em-bryo quality using time-lapse recordings of IVF/ICSI-fertilized oocytes,”

Reproductive Biomedicine Online , vol. 17, no. 3, pp. 385–391, 2008.[8] K. Kirkegaard, I. E. Agerholm, and H. J. Ingerslev, “Time-lapse moni-toring as a tool for clinical embryo assessment,”

Human Reproduction ,vol. 27, no. 5, pp. 1277–1285, 2012.[9] C. C. Wong, K. E. Loewke, N. L. Bossert, B. Behr, C. J. De Jonge,T. M. Baer, and R. A. R. Pera, “Non-invasive imaging of humanembryos before embryonic genome activation predicts development tothe blastocyst stage,”

Nature Biotechnology , vol. 28, no. 10, pp. 1115–1121, 2010.[10] J. Herrero, A. Tejera, C. Albert, C. Vidal, M. J. De Los Santos,and M. Meseguer, “A time to look back: Analysis of morphokineticcharacteristics of human embryo development,”

Fertility Sterility , vol.100, no. 6, pp. 1602–1609, 2013.[11] A. A. Chen, L. Tan, V. Suraj, R. R. Pera, and S. Shen, “Biomarkersidentiﬁed with time-lapse imaging: Discovery, validation, and practicalapplication,”

Fertility Sterility , vol. 99, no. 4, pp. 1035–1043, 2013.[12] M. Meseguer, J. Herrero, A. Tejera, K. M. Hilligsoe, N. B. Ramsing,and J. Remohi, “The use of morphokinetics as a predictor of embryoimplantation,”

Human Reproduction , vol. 26, no. 10, pp. 2658–2671,2011.[13] Y. Wang, F. Moussavi, and P. Lorenzen, “Automated embryo stageclassiﬁcation in time-lapse microscopy video of early human embryodevelopment,” in

Proc. 16th Int’l Conf. on Medical Image Computingand Computer-Assisted Intervention (ICMICCAI) , Nagoya, Japan, Sep.2013, pp. 460–467.[14] J. Conaghan, A. A. Chen, S. P. Willman, K. Ivani, P. E. Ch-enette, R. Boostanfar, V. L. Baker, G. D. Adamson, M. E. Abusief,M. Gvakharia et al. , “Improving embryo selection using a computer-automated time-lapse image analysis test plus day 3 morphology: results t S t a r t t P N f t t t t + Predicted labeltStarttPNft2t3t4t4+ T r u e l a b e l Confusion matrix (a) t S t a r t t P N f t t t t + Predicted labeltStarttPNft2t3t4t4+ T r u e l a b e l Confusion matrix (b)Fig. 10. Confusion matrices (a) before and (b) after DP post-processing.from a prospective multicenter trial,”

Fertility and Sterility , vol. 100,no. 2, pp. 412–419, 2013.[15] D. Jonaitis, V. Raudonis, and A. Lipnickas, “Application of numericalintelligence methods for the automatic quality grading of an embryodevelopment,”

International Journal of Computing , vol. 15, no. 3, pp.177–183, 2016.[16] A. Khan, S. Gould, and M. Salzmann, “Segmentation of developinghuman embryo in time-lapse microscopy,” in

Proc. 13th Int’l Symposiumon Biomedical Imaging (ISBI) , Prague, Czech Republic, April 2016, pp.930–934.[17] N. H. Ng, J. McAuley, J. A. Gingold, N. Desai, and Z. C.Lipton, “Predicting embryo morphokinetics in videos with latefusion nets & dynamic decoders,” May 2018. [Online]. Available:https://openreview.net/forum?id=By1QAYkvz[18] A. Khan, S. Gould, and M. Salzmann, “Deep convolutional neuralnetworks for human embryonic cell counting,” in

Proc. 14th EuropeanConf. on Computer Vision (ECCV) , Amsterdam, The Netherlands, Oc-tober 2016, pp. 339–348.[19] M. D. VerMilyea, L. Tan, J. T. Anthony, J. Conaghan, K. Ivani,M. Gvakharia, R. Boostanfar, V. L. Baker, V. Suraj, A. A. Chen et al. ,“Computer-automated time-lapse analysis results correlate with embryoimplantation and clinical pregnancy: a blinded, multi-centre study,”

Reproductive Biomedicine Online , vol. 29, no. 6, pp. 729–736, Dec.2014.[20] M. P. Diamond, V. Suraj, E. J. Behnke, X. Yang, M. J. Angle, J. C.Lambe-Steinmiller, R. Watterson, K. A. Wirka, A. A. Chen, and S. Shen,“Using the Eeva Test adjunctively to traditional day 3 morphology is informative for consistent embryo assessment within a panel of embry-ologists with diverse experience,” Journal of Assisted Reproduction andGenetics , vol. 32, no. 1, pp. 61–68, Jan. 2015.[21] B. Aparicio-Ruiz, N. Basile, S. P. Albal´a, F. Bronet, J. Remoh´ı, andM. Meseguer, “Automatic time-lapse instrument is superior to single-point morphology observation for selecting viable embryos: retrospectivestudy in oocyte donation,”

Fertility and Sterility , vol. 106, no. 6, pp.1379–1385, Nov. 2016.[22] D. C. Kieslinger, S. De Gheselle, C. B. Lambalk, P. De Sutter, E. H.Kostelijk, J. W. R. Twisk, J. van Rijswijk, E. Van den Abbeel, andC. G. Vergouw, “Embryo selection using time-lapse analysis (Early Em-bryo Viability Assessment) in conjunction with standard morphology:a prospective two-center pilot study,”

Human Reproduction , vol. 31,no. 11, pp. 2450–2457, Nov. 2016.[23] R. Collobert and J. Weston, “A uniﬁed architecture for natural languageprocessing: Deep neural networks with multitask learning,” in

Proc. 25thInt’l Conf. on Machine Learning (ICML) , Helsinki, Finland, July 2008,pp. 160–167.[24] L. Deng, G. Hinton, and B. Kingsbury, “New types of deep neuralnetwork learning for speech recognition and related applications: anoverview,” in

Proc. 38th Int’l Conf. on Acoustics, Speech, and SignalProcessing (ICASSP) , Vancouver, Canada, May 2013, pp. 8599–8603.[25] R. Girshick, “Fast R-CNN,” in

Proc. Int’l Conf. on Computer Vision(ICCV) , Santiago, Chile, December 2015.[26] S. Ruder, “An overview of multi-task learning in deep neuralnetworks,”

CoRR , vol. abs/1706.05098, 2017. [Online]. Available:http://arxiv.org/abs/1706.05098[27] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning forimage recognition,” in

Proc. IEEE Conf. on Computer Vision and PatternRecognition (CVPR) , Las Vegas, NV, June 2016, pp. 770–778.[28] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “ImageNet:A large-scale hierarchical image database,” in

Proc. IEEE Conf. onComputer Vision and Pattern Recognition (CVPR) , Miami Beach, FL,June 2009, pp. 248–255.[29] K. Soomro, A. R. Zamir, and M. Shah, “UCF101: A dataset of101 human actions classes from videos in the wild,”

CoRR , vol.abs/1212.0402, 2012. [Online]. Available: http://arxiv.org/abs/1212.0402[30] A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, andL. Fei-Fei, “Large-scale video classiﬁcation with convolutional neuralnetworks,” in

Proc. IEEE Conf. on Computer Vision and PatternRecognition (CVPR) . Columbus, OH: IEEE, June 2014, pp. 1725–1732.[31] F. Caba Heilbron, V. Escorcia, B. Ghanem, and J. Carlos Niebles,“Activitynet: A large-scale video benchmark for human activity un-derstanding,” in

Proc. IEEE Conf. on Computer Vision and PatternRecognition (CVPR) . Boston, MA: IEEE, June 2015, pp. 961–970.[32] J. Y.-H. Ng, M. Hausknecht, S. Vijayanarasimhan, O. Vinyals, R. Monga,and G. Toderici, “Beyond short snippets: Deep networks for videoclassiﬁcation,” in

Proc. IEEE Conf. on Computer Vision and PatternRecognition (CVPR) . Boston, MA: IEEE, June 2015, pp. 4694–4702.[33] L. Fei-Fei and P. Perona, “A bayesian hierarchical model for learningnatural scene categories,” in

Proc. IEEE Conf. on Computer Vision andPattern Recognition (CVPR) , vol. 2, San Diego, CA, June 2005, pp.524–531.[34] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet classiﬁcationwith deep convolutional neural networks,” in

Proc. Advances in NeuralInformation Processing Systems , Lake Tahoe, NV, December 2012, pp.1097–1105.[35] G. E. Dahl, N. Jaitly, and R. Salakhutdinov, “Multi-task neural networksfor qsar predictions,” arXiv preprint arXiv:1406.1231 , 2014.[36] Z.-H. Zhou,

Ensemble methods: foundations and algorithms . BocaRaton, FL: CRC press, 2012.[37] Y. Liu, V. Chapple, P. Roberts, and P. Matson, “Prevalence, consequence,and signiﬁcance of reverse cleavage by human embryos viewed with theuse of the embryoscope time-lapse video system,”

Fertility Sterility , vol.102, no. 5, pp. 1295–1300, 2014.[38] D. P. Kingma and J. Ba, “Adam: A method for stochasticoptimization,”

CoRR , vol. abs/1412.6980, 2014. [Online]. Available:https://arxiv.org/abs/1412.6980[39] B. Kalatova, R. Jesenska, D. Hlinka, and M. Dudas, “Tripolar mitosisin human cells and embryos: occurrence, pathophysiology and medicalimplications,”