aa r X i v : . [ ee ss . I V ] A ug Multi-Task Deep Learning with DynamicProgramming for Embryo Early Development StageClassification from Time-Lapse Videos
Zihan Liu, Bo Huang, Yuqi Cui, Yifan Xu, Bo Zhang, Lixia Zhu, Yang Wang, Lei Jin and Dongrui Wu
Abstract —Time-lapse is a technology used to record the devel-opment of embryos during in-vitro fertilization (IVF). Accurateclassification of embryo early development stages can provideembryologists valuable information for assessing the embryoquality, and hence is critical to the success of IVF. This paperproposes a multi-task deep learning with dynamic programming(MTDL-DP) approach for this purpose. It first uses MTDL topre-classify each frame in the time-lapse video to an embryodevelopment stage, and then DP to optimize the stage sequenceso that the stage number is monotonically non-decreasing, whichusually holds in practice. Different MTDL frameworks, e.g., one-to-many, many-to-one, and many-to-many, are investigated. Itis shown that the one-to-many MTDL framework achieved thebest compromise between performance and computational cost.To our knowledge, this is the first study that applies MTDL toembryo early development stage classification from time-lapsevideos.
Index Terms —Multi-task learning, in-vitro fertilization, convo-lutional neural networks, dynamic programming, image classifi-cation
I. I
NTRODUCTION
In-vitro fertilization (IVF) [1]–[3] is a frequently usedtechnology for treating infertility. The process involves thecollection of multiple follicles for fertilization and in-vitroculture. Cultivation, selection and transplantation of embryoare the key steps in determining a successful implantationduring IVF [4], [5]. During the development of embryos, themorphological characteristics [6] and kinetic characteristics [7]are highly correlated with the outcome of transplantation.Time-lapse videos have been widely used in various repro-ductive medicine centers during the cultivation of embryos [8]to monitor them. A time-lapse video records the embryonicdevelopment process in real time by taking photos of theembryos at short time intervals [9]. Thus, a large amount oftime series image data for each embryo are produced in thisprocess. At the final stage of embryo selection, an embryolo-gist reviews the entire embryo development process to score
Z. Liu, Y. Cui, Y. Xu, Y. Wang and D. Wu are with theKey Laboratory of the Ministry of Education for Image Processingand Intelligent Control, School of Artificial Intelligence and Automa-tion, Huazhong University of Science and Technology, Wuhan 430074,China. Email: [email protected], [email protected], [email protected],wangyang [email protected], [email protected]. Huang, B. Zhang, L. Zhu and L. Jin are with the Reproductive MedicineCenter, Tongji Hospital, Tongji Medical College, Huazhong University ofScience and Technology, Wuhan 430074, China. Email: [email protected],[email protected], [email protected], [email protected] Liu and Bo Huang contributed equally to this work.Lei Jin and Dongrui Wu are the corresponding authors. and sort them. Studies with different time-lapse equipmentreported improved prediction accuracy of embryo implantationpotential by analyzing the morphokinetics of human embryosat early cleavage stages [8]–[12]. These features have beenshown to be statistically significant to the final outcome ofthe transplantation [7].There have been only a few approaches to analyze time-lapse image data [9], [13]–[18]. Due to the limitations of thetime-lapse technology, stereoscopic cells of different heightsoverlap in the images when photographed. It is difficult foreven an experienced embryologist to accurately count the num-ber of cells in a single time-lapse image when there are morethan eight cells. Therefore, most research focused on the earlydevelopment stages of embryos. Wong et al. [9] identifiedseveral key parameters that can predict blastocyst formationat the 4-cell stage from time-lapse images, and employedsequential Monte Carlo based probabilistic model estimationto monitor these parameters and track the cells. Wang et al. [13] presented a multi-level embryo stage classification ap-proach, by using both hand-crafted and automatically learnedembryo features to identify the number of cells in a time-lapsevideo. Conaghan et al. [14] used an automated and proprietaryimage analysis software EEVA TM (Early Embryo ViabilityAssessment), which exhibited high image contrast through theuse of darkfield illumination, to track cell divisions from one-cell stage to four-cell stage. Their experiments verified that theEEVA Test can significantly improve embryologists’ ability toidentify embryos that would develop into usable blastocysts.There are also several other studies on embryo selection byusing EEVA TM [19]–[22], but they did not provide the detailsof the used EEVA Test. Jonaitis et al. [15] compared theperformance of neural network, support vector machine andnearest neighbor classifier in detecting cell division time. Khan et al. [18] used a deep convolutional neural network (CNN) toclassify the number of cells, and also semantic segmentationto extract the cell regions in a time-lapse image [16]. Ng et al. [17] combined late fusion networks with dynamicprogramming (DP) to predict different cell development stagesand obtained better results than a single-frame model.Multi-task learning has been successfully used in manyapplications, such as natural language processing [23], speechrecognition [24], and computer vision [25]. Its basic idea isto share representations among related tasks, so that eachtrained model may have better generalization ability [26].This paper proposes a multi-task deep learning with dynamicprogramming (MTDL-DP) approach, which first uses MTDL to pre-classify each frame in the time-lapse video to an embryodevelopment stage, and then DP to optimize the stage sequenceso that the stage number is monotonically non-decreasing,which usually holds in practice. To our knowledge, this is thefirst study that applies MTDL to embryo early developmentstage classification from time-lapse videos.The remainder of this paper is organized as follows: Sec-tion II introduces four classification frameworks for time-lapse video analysis. Section III proposes our MTDL-DPapproach. Section IV presents the experimental results. Finally,Section V draws conclusion.II. C LASSIFICATION F RAMEWORKS
This section introduces four frameworks for embryo earlydevelopment stage classification from time-lapse videos. Wefirst describe our dataset and the baseline network architecture,and then extend it to many-to-one , one-to-many and many-to-many MTDL frameworks.
A. Dataset
The time-lapse video dataset used in our experiments camefrom the Reproductive Medicine Center of Tongji Hospital,Huazhong University of Science and Technology, Wuhan,China. It consisted of 170 time-lapse videos extracted fromincubators, using an EmbryoScope+ time-lapse microscopesystem at 10-minute sampling interval. Each frame in thevideo is a grayscale × image with a well numberin the lower left corner and a time marker after fertilizationin the lower right corner, as shown in Fig. 1. The embryo issurrounded by some granulosa cells in the microscope field.The scale bar in the upper right corner indicates the size of thecells. Each video began about 0-2 hours after fertilization, andended about 140 hours after fertilization. We only used the first N = 350 frames in each video, which were manually labeledfor the embryo development stages. Therefore, we had a totalof ×
350 = 59 , labeled frames in the experiment.As in [17], we focused on the first six embryo developmentstages, which included initialization (tStart), the appearanceand breakdown of the male and female pronucleus (tPNf),and the appearance of 2 through 4+ cells (t2, t3, t4, t4+). Wecounted the number of images in different embryo develop-ment stages in the dataset, and show the summary in Fig. 2.Note that t3 was rarely observed in our dataset. B. The Baseline One-to-One Classification Framework
Let x n be the n th frame in a time-lapse video. For imageclassification, a standard one-to-one classification frameworklearns a mapping: f : x n y n ∈ L (1)where y n is the stage label of x n , and L the label set of theembryo development stages.When information of the previous and future frames isused, the standard one-to-one classification framework can be extended to many-to-one , one-to-many and many-to-many MTDL frameworks, as illustrated in Fig. 3.We used ResNet [27], which won the 2015 ImageNetclassification competition, to process individual video frames.Table I shows our baseline ResNet50 model. The input imagehad three channels (RGB), each with 224 ×
224 pixels (the800 ×
800 images were resized). The model was initializedfrom the ResNet weights pre-trained on ImageNet [28], whichcan help reduce overfitting on small datasets.
TABLE IT
HE BASELINE R ES N ET MODEL .layer name 50-layer output sizeconv1 × , 64, stride × × res2 × max pool, stride × × × , × , × , × res3 × , × , × , × × × res4 × , × , × , × × × res5 × , × , × , × × × global average pool × × fc × × softmax × × C. The Many-to-One MTDL Framework
The many-to-one
MTDL framework, shown in Fig. 3(b),is frequently used in video understanding [29]–[31] becausemultiple frames in the same video usually have the samelabel, and hence they can be considered together to predict thefinal label.
Many-to-one can better make use of input contextinformation than one-to-one . Many-to-one performs the following mapping: f : ( x n − τ , . . . , x n , . . . , x n + τ ) y n ∈ L , (2)where τ is the number of neighboring frames before and afterthe current frame (the input context window size is hence τ +1 ).There are two common approaches to fuse time domaininformation from the τ + 1 frames: Conv Pooling [32] andLate Fusion [30].
1) Conv Pooling:
This is a convolutional temporal featurepooling architecture, which has been extensively used forvideo classification, especially for bag-of-words representa-tions [33]. Image features are computed for each frame andthen max pooled. The pooling features can then be sent to fullyconnected layers for final classification. A major advantage of
50 μμ (a)
50 μμ (b)
50 μμ (c)
50 μμ (d)Fig. 1. Sample frames from a time-lapse video. (a) 1-cell stage; (b) 2-cell stage; (c) 4-cell stage; (d) 4+-cell stage.Fig. 2. Percentage of frames in different embryo development stages. this approach is that spatial information in multiple frames,output by the convolutional layers, is preserved through amax pooling operation in the time domain. Experiments [32]verified that Conv Pooling outperformed all other featurepooling approaches on the Sports-1M dataset, using a 120- frame AlexNet model [34].
2) Late Fusion:
In Late Fusion, all frames in the inputcontext window are encoded via identical ConvNets. Thefinal representations after all convolutional layers are concate-nated and passed through a fully-connected layer to generateclassifications. The concatenation can happen to a subset offrames in the input context window [30], or to all framesin that window [17]. Previous research [17] demonstratedthat Late Fusion ConvNets using 15 frames and a DP-baseddecoder outperformed Early Fusion for predicting embryomorphokinetics in time-lapse videos.
D. The One-to-Many MTDL FrameworkOne-to-many , shown in Fig. 3(c), means each input ismapped to multiple outputs, which is also called multi-tasknets [35] in deep learning. This paper uses hard parametersharing of hidden layers [26], as illustrated in Fig. 4. Theparameters of the convolution layers are shared among differ-ent tasks, but those of the fully connected layers are trainedseparately.In one-to-many , each x n is used in classifying τ + 1 stages centered at n , i.e., it learns the following one-to-many (a) (b)(c) (d)Fig. 3. Different classification frameworks. (a) one-to-one ; (b) many-to-one ; (c) one-to-many ; (d) many-to-many . The convolutional layers are denoted by ‘C’.Blue and red rectangles denote the flatten layer and the max-pooling layer, respectively. Orange rectangles denote the fully connected and softmax layers. (cid:28646)(cid:28667)(cid:28660)(cid:28677)(cid:28664)(cid:28663)(cid:28595)(cid:28671)(cid:28660)(cid:28684)(cid:28664)(cid:28677)(cid:28678)(cid:28647)(cid:28660)(cid:28678)(cid:28670)(cid:28608)(cid:28678)(cid:28675)(cid:28664)(cid:28662)(cid:28668)(cid:28665)(cid:28668)(cid:28662)(cid:28595)(cid:28671)(cid:28660)(cid:28684)(cid:28664)(cid:28677)(cid:28678) (cid:28605)(cid:28642)(cid:28644)(cid:28649)(cid:28648)(cid:28616)(cid:28629)(cid:28647)(cid:28639)(cid:28564)(cid:28598)(cid:28616)(cid:28629)(cid:28647)(cid:28639)(cid:28564)(cid:28597) (cid:28616)(cid:28629)(cid:28647)(cid:28639)(cid:28564)(cid:28599) (cid:1710) (cid:1710) Fig. 4. Hard parameter sharing for MTDL. mapping: f : x n ( y n − τ , . . . , y n + τ ) ∈ L τ +1 . (3) x n ’s classification for the stage at time index t ∈ [ n − τ, n + τ ] is a probability vector ˆ p t ( x n ) ∈ R | L |× .At each Frame Index n , the corresponding label is estimatedby τ + 1 neighboring x t , t ∈ [ n − τ, n + τ ] . We need toaggregate them to obtain the final classification. This can bedone by an ensemble approach.Because each frame x n is involved in τ + 1 outputs, thetotal loss on a training frame x n is computed as the sum of the loss on all involved outputs: ℓ ( x n ) = n + τ X t = n − τ w t · ℓ ( y t , ˆ p t ( x n )) , (4)where w t is the weight for the t -th output, and y t is the truelabel for Frame t . w t = 1 and the cross-entropy loss wereused in this paper. The cross-entropy loss on the t -th outputcan be written as follows: ℓ ( y t , ˆ p t ( x n )) = − log (ˆ p t,y t ( x n )) , (5)where ˆ p t,y t ( x n ) is the y t -th element of ˆ p t ( x n ) . E. The Many-to-Many MTDL FrameworkMany-to-many can be viewed as a combination of one-to-many and many-to-one . Each input frame is processed by aseparate CNN. Late Fusion was used, and the parameters of thefully connected layers were also trained separately, as shownin Fig. 3(d).III. M
ULTI -T ASK D EEP L EARNING WITH D YNAMIC P ROGRAMMING (MTDL-DP)This section introduces our proposed MTDL-DP approach. (cid:2198) (cid:3041)(cid:2879)(cid:2870)(cid:3099) ( (cid:1824) (cid:3041)(cid:2879)(cid:3099) ) (cid:2198) (cid:3041)(cid:2879)(cid:3099) ( (cid:1824) (cid:3041)(cid:2879)(cid:3099) ) (cid:1710) (cid:2198) (cid:2196) ( (cid:1824) (cid:3041)(cid:2879)(cid:3099) ) (cid:1710) (cid:2198) (cid:3041)(cid:2879)(cid:3099) ( (cid:1824) (cid:3041) ) (cid:2198) (cid:2196) ( (cid:1824) (cid:3041) ) (cid:1710) (cid:2198) (cid:3041)(cid:2878)(cid:3099) ( (cid:1824) (cid:3041) ) (cid:1710) (cid:2198) (cid:2196) ( (cid:1824) (cid:3041)(cid:2878)(cid:3099) ) (cid:2198) (cid:3041)(cid:2878)(cid:3099) ( (cid:1824) (cid:3041)(cid:2878)(cid:3099) ) (cid:2198) (cid:3041)(cid:2878)(cid:2870)(cid:3099) ( (cid:1824) (cid:3041)(cid:2878)(cid:3099) ) (cid:1710) (cid:1710)(cid:1710) (cid:1710) (cid:2198) (cid:2196) ( (cid:1824) (cid:3041)(cid:2879)(cid:2869) ) (cid:1710) (cid:2198) (cid:3041)(cid:2879)(cid:3099) ( (cid:1824) (cid:3041)(cid:2879)(cid:2869) ) (cid:1710) (cid:1710) (cid:2198) (cid:3041)(cid:2878)(cid:3099) ( (cid:1824) (cid:3041)(cid:2878)(cid:2869) ) (cid:1710) (cid:2198) (cid:2196) ( (cid:1824) (cid:3041)(cid:2878)(cid:2869) ) Fig. 5. Ensemble of the multi-task net’s predictions at Frame Index n , made by neighboring frames x t , t ∈ [ n − τ, n + τ ] . A. Ensemble Learning for MTDL
As mentioned in Section II-D, a multi-task net has multipleoutputs. The easiest approach to get the final classificationcorresponding to a specific frame is to choose the middleoutput of the network. A more sophisticated approach isensemble learning [36]. We consider two common probabilis-tic aggregation approaches in this paper: additive mean and multiplicative mean .Let ˆ p n ( x t ) be the predicted probability vector at FrameIndex n , given Frame x t , t ∈ [ n − τ, n + τ ] , as illustratedin Fig. 5. The ensemble probability ˆ p n at Frame Index n ,aggregated by the additive mean, is: ˆ p n = 12 τ + 1 n + τ X t = n − τ ˆ p n ( x t ) , (6)If the multiplicative mean is used, ˆ p n = 12 τ + 1 n + τ Y t = n − τ ˆ p n ( x t ) . (7)Since each ˆ p n ( x t ) is a vector, the summation in (6) andmultiplication in (7) are element-wise operations.The final classification label ˆ y n for Frame x n is obtainedby probability maximization: ˆ y n = arg max ≤ l ≤| L | ˆ p n,l , (8)where ˆ p n,l is the l -th element of ˆ p n . B. Post-processing with DP
The number of cells in the development of an embryois almost always non-decreasing [37]. However, this is notguaranteed in the classification outputs of MTDL. We use DPto adjust the classifications so that this constraint is satisfied.For each video, the groundtruth stages { y n } Nn =1 forma sequence. MTDL outputs a probability vector ˆ p n =[ p n, , ..., p n, | L | ] T before likelihood maximization at Frame n ,where ˆ p n,l is the estimated probability that Frame n is at Stage l . We define E ( ˆ y , ˆP ) as the total loss for an estimatedprediction ˆ y = { ˆ y n } Nn =1 , given the model output probabilitymatrix ˆP = [ ˆ p , ..., ˆ p N ] . The total loss is the sum of theper-frame losses P Nn =1 e (ˆ y n , ˆ p n ) , which must be optimizedsubject to the monotonicity constraint: ˆ y n +1 ≥ ˆ y n , ∀ n .Two common per-frame losses [17] were used. The first isnegative label likelihood (LL), defined as: e LL (ˆ y n , ˆ p n ) = − log (ˆ p n,y n ) . (9)The second is earth mover (EM) distance, defined as: e EM (ˆ y n , ˆ p n ) = − | L | X l =1 ˆ p n,l | ˆ y n − l | , (10)The final classification stage sequence ˆ y ∗ = { ˆ y n } Nn =1 canbe obtained as: ˆ y ∗ = arg min ˆ y = { ˆ y n } Nn =1 N X n =1 e (ˆ y n , ˆ p n ) (11)s.t. ˆ y n +1 ≥ ˆ y n , ∀ n. which can be easily solved by DP, as shown in Algorithm 1. C. MTDL-DP
Our proposed MTDL-DP consists of three steps: 1) con-struct a multi-task net with the one-to-many or many-to-many MTDL framework; 2) use multiplicative mean to aggregatethe prediction of the multi-task net; and, 3) post-process withDP using the EM distance per-frame loss. Its pseudocode isgiven in Algorithm 2.The one-to-many
MTDL framework can also be replacedby the many-to-many
MTDL framework.IV. E
XPERIMENTAL R ESULTS
This section investigates the performance of our proposedMTDL-DP.
Algorithm 1:
Pseudocode of dynamic programming (DP).
Input: N , the number of frames in a time-lapse video; L , label set of embryo development stages; ˆP = [ ˆ p , ..., ˆ p N ] ∈ R | L |× N , the MTDL model outputprobability matrix for the N frames. Output: ˆ y ∗ , the optimized stage sequence.Set e ( l, ˆ p n ) = 0 and E ( l, ˆ p n ) = 0 , ∀ l ∈ [1 , | L | ] , ∀ n ∈ [1 , N ] ; for n = 1 , ..., N dofor ˆ y = 1 , ..., | L | do Compute e (ˆ y, ˆ p n ) in (10); endendfor n = 2 , ..., N dofor ˆ y = 1 , ..., | L | do E (ˆ y, ˆ p n ) = e (ˆ y, ˆ p n ) + min ≤ l ≤ ˆ y E ( l, ˆ p n − ) ; endend k = | L | ; for n = N, ..., do ˆ y n = arg min l k E ( l, ˆ p n ) ; if ˆ y n < k then k = ˆ y n ; endend ˆ y ∗ = { ˆ y n } Nn =1 ; Return
The optimized stage sequence ˆ y ∗ . A. Experimental Setup
We created training/validation/test data partitions by ran-domly selecting 70%/10%/20% videos from the dataset, i.e.,41,650/5,950/11,900 frames, respectively. We resized eachframe to 224 ×
224 so that it can be used by ResNet50, ourbaseline model. Random rotation and flip data augmentationwas used. All MTDL frameworks were initialized by theweights trained by one-to-one (ResNet50). Then, the convo-lution layer parameters were frozen, and the fully connectedlayers were further tuned.We used the cross-entropy loss function and Adam op-timizer [38], and early stopping to reduce overfitting, inall experiments. Multiplicative mean and EM distance per-frame loss were used in the MTDL-DP. All experiments wererepeated five times, and the mean results were reported.
B. Classification Accuracy
First, we considered MTDL only, without using DP. Theclassification accuracies are shown in the left panel of Table II,with τ = { , , } (the output context window size was τ + 1 ). All MTDL frameworks outperformed the one-to-one framework, suggesting using neighboring input or labelinformation in multi-task learning was indeed beneficial.For the many-to-one MTDL framework, when τ increased,the performance of Late Fusion also increased, whereas theperformance of Conv Pooling decreased. This is intuitive, Algorithm 2:
MTDL-DP
Input: N , the number of frames in a time-lapse video; D , set of labeled time-lapse videos; { x n } Nn =1 , frames to be labeled; τ , the number of left and right neighboring frames in thecontext window. Output: ˆ y ∗ , the labeled stage sequence.Use the one-to-one framework to train a baseline model f from D ;Initialize an MTDL model, whose convolution layerparameters are identical to f ;Fine-tune the fully connected layer parameters of theMTDL model on D ; for n = 1 , ..., N do Use the MTDL model to compute ˆ p t ( x n ) , t = n − τ, ..., n + τ ; endfor n = 1 , ..., N do Compute ˆ p n by (7);Compute the per-frame loss e EM (ˆ y n , ˆ p n ) in (10); end Solve for ˆ y ∗ in (11) by Algorithm 1; Return
The optimized stage sequence ˆ y ∗ .because more input information was ignored in Conv Poolingwhen τ increased.The classification accuracies with DP post-processing areshown in the right panel of Table II. Post-processing increasedthe classification accuracies for all classifiers and different τ ,e.g., the five classifiers achieved 2.3%, 1.2%, 2.1%, 1.5%,and 2.0% performance improvement when τ = 1 , respec-tively. However, as τ increased, the classification performanceimprovements became less obvious. After post-processing,the many-to-many and one-to-many frameworks had higheraccuracies than the many-to-one framework, and only many-to-many consistently outperformed one-to-one for all τ , sug-gesting post-processing may be more beneficial when moreinput and output information was utilized. C. Root Mean Squared Error (RMSE)
We also computed the root mean squared error (RMSE)between the true video label sequences and the classifications.The RMSEs without DP post-processing are shown in the leftpanel of Table III. All MTDL frameworks had lower RMSEsthan the one-to-one framework, suggesting again that usingneighboring input or label information in multi-task learningwas beneficial.The results after DP post-processing are shown in the rightpanel of Table III. DP post-processing reduced the RMSEfor all MTDL frameworks and different τ , suggesting thatDP was indeed beneficial. Though all MTDL frameworksoutperformed the one-to-one framework only at τ = 1 , the many-to-many framework consistently outperformed one-to-one for all different τ . TABLE IIC
LASSIFICATION ACCURACIES FOR DIFFERENT CLASSIFICATION FRAMEWORKS AND τ , BEFORE AND AFTER DP POST - PROCESSING .Framework Method Accuracy without DP Accuracy with DP τ = 1 τ = 4 τ = 7 τ = 1 τ = 4 τ = 7 One-to-one ResNet50 83.8% 83.8% 83.8% 86.1% 86.1% 86.1%Many-to-one Conv Pooling 84.7% 84.4% 83.8% 85.9% 85.1% 84.5%Late Fusion 83.9% 84.6% 85.1% 86.0% 85.2% 85.2%One-to-many Multi-Task Nets (ours)
TABLE IIIRMSE
S FOR DIFFERENT CLASSIFICATION FRAMEWORKS AND τ , BEFORE AND AFTER DP POST - PROCESSING .Framework Method RMSE without DP RMSE with DP τ = 1 τ = 4 τ = 7 τ = 1 τ = 4 τ = 7 One-to-one ResNet50 0.4840 0.4840 0.4840 0.4199 0.4199 0.4199Many-to-one Conv Pooling 0.4728 0.4690 0.4795 0.4066 0.4432 0.4419Late Fusion 0.4761
D. Training Time
The training time of different models, averaged over fiveruns, is shown in Table IV. The training time of the many-to-one and many-to-many
MTDL frameworks increased aboutlinearly with the input context size; however, the training timeof the one-to-many
MTDL framework was insensitive to τ ,which is an advantage. TABLE IVT
RAINING TIME FOR DIFFERENT CLASSIFICATION FRAMEWORKS AND τ .Framework Method Training time (s) τ = 1 τ = 4 τ = 7 One-to-one ResNet50
Many-to-one Conv Pooling 5318 15378 29139Late Fusion 4892 17390 27534One-to-many Multi-Task Nets (ours) 2246 2265 2542Many-to-many 5759 16182 27808
E. Comparison of Different Ensemble Approaches
We also compared the performances of different ensembleapproaches introduced in Section III-A, without consideringDP post-processing. The CNN models were constructed usingthe one-to-many and many-to-many
MTDL frameworks. Theresults are shown in Figs. 6 and 7. Both additive mean and multiplicative mean achieved performance improvements.
Multiplicative mean also slightly outperformed additive mean .As τ increased, the performance of the many-to-many MTDLframework was improved. The one-to-many
MTDL frameworkhad the best performance when τ = 4 . F. Comparison of Different Losses in DP Post-Processing
Next, we studied the effect of different per-frame losses inDP post-processing. The RMSEs for different τ and different τ A cc u r a c y One-to-Many
Without ensembleAdd meanMul mean (a) τ A cc u r a c y Many-to-Many
Without ensembleAdd meanMul mean (b)Fig. 6. Classification accuracies with and without ensemble learning. (a)
One-to-many ; (b)
Many-to-many . MTDL frameworks are shown in Fig. 8. The EM loss alwaysgave smaller RMSEs than the LL loss.The true stage labels, and the classified labels before andafter DP in two time-lapse videos, are shown in Fig. 9. Clearly,DP smoothed the classifications, and its outputs were closerto the groundtruth labels. τ R M S E One-to-Many
Without ensembleAdd meanMul mean (a) τ R M S E Many-to-Many
Without ensembleAdd meanMul mean (b)Fig. 7. RMSEs with and without ensemble learning. (a)
One-to-many ; (b)
Many-to-many . The confusion matrix for the one-to-many
MTDL frame-work, using the multiplicative mean and τ = 1 , is shownin Fig. 10(a) before DP post-processing, and in Fig. 10(b)after DP post-processing. The diagonal shows the classificationaccuracy of each individual cell stage. Post-processing im-proved the accuracy of all embryonic stages except t3, whoseclassification accuracy before DP (16%) was much lower thanothers. There may be two reasons for this: 1) Stage t3 hadmuch fewer training examples in our dataset (see Fig. 2), andhence it was not trained adequately; and, 2) the low accuracyof t3 may also be due to multipolar cleavages from the zygotestage, which occurs in 12.2% of human embryos [39].V. C ONCLUSION
Accurate classification of embryo early development stagescan provide embryologists valuable information for assessingthe embryo quality, and hence is critical to the success ofIVF. This paper has proposed an MTDL-DP approach forautomatic embryo development stage classification from time-lapse videos. Particularly, the one-to-many and many-to-many
MTDL frameworks performed the best. Considering the trade-off between training time and classification accuracy, werecommend the one-to-many
MTDL framework in MTDL-DP,because it can achieve comparable performance with the many-to-many
MTDL framework, with much lower computationalcost.To our knowledge, this is the first study that applies MTDLto embryo early development stage classification from time-lapse videos.
Approaches R M S E τ = 1 LL lossEM loss (a)
Approaches R M S E τ = 4 LL lossEM loss (b)
Approaches R M S E τ = 7 LL lossEM loss (c)Fig. 8. RMSEs of different per-frame losses in DP. (a) τ = 1 ; (b) τ =4 ; (b) τ = 7 . The numbers on the horizontal axis denote different MTDLframeworks: 1– One-to-one , 2–
Many-to-one (Conv Pooling), 3–
Many-to-one (Late Fusion), 4–
One-to-many , 5–
Many-to-many . R EFERENCES[1] B. Huang, X. Ren, L. Wu, L. Zhu, B. Xu, Y. Li, J. Ai, and L. Jin,“Elevated progesterone levels on the day of oocyte maturation may affecttop quality embryo IVF cycles,”
PLoS One , vol. 11, no. 1, p. e0145895,2016.[2] B. Huang, D. Hu, K. Qian, J. Ai, Y. Li, L. Jin, G. Zhu, and H. Zhang,“Is frozen embryo transfer cycle associated with a significantly lowerincidence of ectopic pregnancy? an analysis of more than 30,000 cycles,”
Fertility Sterility , vol. 102, no. 5, pp. 1345–1349, 2014.[3] B. Huang, K. Qian, Z. Li, J. Yue, W. Yang, G. Zhu, and H. Zhang,“Neonatal outcomes after early rescue intracytoplasmic sperm injection:an analysis of a 5-year period,”
Fertility Sterility , vol. 103, no. 6, pp.1432–1437, 2015.[4] A. S. in Reproductive Medicine and E. S. I. G. of Embryology, “Theistanbul consensus workshop on embryo assessment: Proceedings of anexpert meeting,”
Human Reproduction , vol. 26, no. 6, pp. 1270–1283,2011.[5] B. Tomasz, K. Rafal, and G. Wojciech, “Methods of embryo scoring
Video frame tStarttPNft2t3t4t4+ C e ll s t a g e TrueBefore DPAfter DP (a)
Video frame tStarttPNft2t3t4t4+ C e ll s t a g e TrueBefore DPAfter DP (b)Fig. 9. True stage labels, and classifications before and after DP, in twotime-lapse videos.
One-to-many and τ = 1 were used.in in vitro fertilization,” Reproductive Biology , vol. 4, no. 1, pp. 5–22,2004.[6] J. Holte, L. Berglund, K. Milton, C. Garello, G. Gennarelli, A. Revelli,and T. Bergh, “Construction of an evidence-based integrated morphologycleavage embryo score for implantation potential of embryos scoredand transferred on day 2 after oocyte retrieval,”
Human Reproduction ,vol. 22, no. 2, pp. 548–557, 2006.[7] J. Lemmen, I. Agerholm, and S. Ziebe, “Kinetic markers of human em-bryo quality using time-lapse recordings of IVF/ICSI-fertilized oocytes,”
Reproductive Biomedicine Online , vol. 17, no. 3, pp. 385–391, 2008.[8] K. Kirkegaard, I. E. Agerholm, and H. J. Ingerslev, “Time-lapse moni-toring as a tool for clinical embryo assessment,”
Human Reproduction ,vol. 27, no. 5, pp. 1277–1285, 2012.[9] C. C. Wong, K. E. Loewke, N. L. Bossert, B. Behr, C. J. De Jonge,T. M. Baer, and R. A. R. Pera, “Non-invasive imaging of humanembryos before embryonic genome activation predicts development tothe blastocyst stage,”
Nature Biotechnology , vol. 28, no. 10, pp. 1115–1121, 2010.[10] J. Herrero, A. Tejera, C. Albert, C. Vidal, M. J. De Los Santos,and M. Meseguer, “A time to look back: Analysis of morphokineticcharacteristics of human embryo development,”
Fertility Sterility , vol.100, no. 6, pp. 1602–1609, 2013.[11] A. A. Chen, L. Tan, V. Suraj, R. R. Pera, and S. Shen, “Biomarkersidentified with time-lapse imaging: Discovery, validation, and practicalapplication,”
Fertility Sterility , vol. 99, no. 4, pp. 1035–1043, 2013.[12] M. Meseguer, J. Herrero, A. Tejera, K. M. Hilligsoe, N. B. Ramsing,and J. Remohi, “The use of morphokinetics as a predictor of embryoimplantation,”
Human Reproduction , vol. 26, no. 10, pp. 2658–2671,2011.[13] Y. Wang, F. Moussavi, and P. Lorenzen, “Automated embryo stageclassification in time-lapse microscopy video of early human embryodevelopment,” in
Proc. 16th Int’l Conf. on Medical Image Computingand Computer-Assisted Intervention (ICMICCAI) , Nagoya, Japan, Sep.2013, pp. 460–467.[14] J. Conaghan, A. A. Chen, S. P. Willman, K. Ivani, P. E. Ch-enette, R. Boostanfar, V. L. Baker, G. D. Adamson, M. E. Abusief,M. Gvakharia et al. , “Improving embryo selection using a computer-automated time-lapse image analysis test plus day 3 morphology: results t S t a r t t P N f t t t t + Predicted labeltStarttPNft2t3t4t4+ T r u e l a b e l Confusion matrix (a) t S t a r t t P N f t t t t + Predicted labeltStarttPNft2t3t4t4+ T r u e l a b e l Confusion matrix (b)Fig. 10. Confusion matrices (a) before and (b) after DP post-processing.from a prospective multicenter trial,”
Fertility and Sterility , vol. 100,no. 2, pp. 412–419, 2013.[15] D. Jonaitis, V. Raudonis, and A. Lipnickas, “Application of numericalintelligence methods for the automatic quality grading of an embryodevelopment,”
International Journal of Computing , vol. 15, no. 3, pp.177–183, 2016.[16] A. Khan, S. Gould, and M. Salzmann, “Segmentation of developinghuman embryo in time-lapse microscopy,” in
Proc. 13th Int’l Symposiumon Biomedical Imaging (ISBI) , Prague, Czech Republic, April 2016, pp.930–934.[17] N. H. Ng, J. McAuley, J. A. Gingold, N. Desai, and Z. C.Lipton, “Predicting embryo morphokinetics in videos with latefusion nets & dynamic decoders,” May 2018. [Online]. Available:https://openreview.net/forum?id=By1QAYkvz[18] A. Khan, S. Gould, and M. Salzmann, “Deep convolutional neuralnetworks for human embryonic cell counting,” in
Proc. 14th EuropeanConf. on Computer Vision (ECCV) , Amsterdam, The Netherlands, Oc-tober 2016, pp. 339–348.[19] M. D. VerMilyea, L. Tan, J. T. Anthony, J. Conaghan, K. Ivani,M. Gvakharia, R. Boostanfar, V. L. Baker, V. Suraj, A. A. Chen et al. ,“Computer-automated time-lapse analysis results correlate with embryoimplantation and clinical pregnancy: a blinded, multi-centre study,”
Reproductive Biomedicine Online , vol. 29, no. 6, pp. 729–736, Dec.2014.[20] M. P. Diamond, V. Suraj, E. J. Behnke, X. Yang, M. J. Angle, J. C.Lambe-Steinmiller, R. Watterson, K. A. Wirka, A. A. Chen, and S. Shen,“Using the Eeva Test adjunctively to traditional day 3 morphology is informative for consistent embryo assessment within a panel of embry-ologists with diverse experience,” Journal of Assisted Reproduction andGenetics , vol. 32, no. 1, pp. 61–68, Jan. 2015.[21] B. Aparicio-Ruiz, N. Basile, S. P. Albal´a, F. Bronet, J. Remoh´ı, andM. Meseguer, “Automatic time-lapse instrument is superior to single-point morphology observation for selecting viable embryos: retrospectivestudy in oocyte donation,”
Fertility and Sterility , vol. 106, no. 6, pp.1379–1385, Nov. 2016.[22] D. C. Kieslinger, S. De Gheselle, C. B. Lambalk, P. De Sutter, E. H.Kostelijk, J. W. R. Twisk, J. van Rijswijk, E. Van den Abbeel, andC. G. Vergouw, “Embryo selection using time-lapse analysis (Early Em-bryo Viability Assessment) in conjunction with standard morphology:a prospective two-center pilot study,”
Human Reproduction , vol. 31,no. 11, pp. 2450–2457, Nov. 2016.[23] R. Collobert and J. Weston, “A unified architecture for natural languageprocessing: Deep neural networks with multitask learning,” in
Proc. 25thInt’l Conf. on Machine Learning (ICML) , Helsinki, Finland, July 2008,pp. 160–167.[24] L. Deng, G. Hinton, and B. Kingsbury, “New types of deep neuralnetwork learning for speech recognition and related applications: anoverview,” in
Proc. 38th Int’l Conf. on Acoustics, Speech, and SignalProcessing (ICASSP) , Vancouver, Canada, May 2013, pp. 8599–8603.[25] R. Girshick, “Fast R-CNN,” in
Proc. Int’l Conf. on Computer Vision(ICCV) , Santiago, Chile, December 2015.[26] S. Ruder, “An overview of multi-task learning in deep neuralnetworks,”
CoRR , vol. abs/1706.05098, 2017. [Online]. Available:http://arxiv.org/abs/1706.05098[27] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning forimage recognition,” in
Proc. IEEE Conf. on Computer Vision and PatternRecognition (CVPR) , Las Vegas, NV, June 2016, pp. 770–778.[28] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “ImageNet:A large-scale hierarchical image database,” in
Proc. IEEE Conf. onComputer Vision and Pattern Recognition (CVPR) , Miami Beach, FL,June 2009, pp. 248–255.[29] K. Soomro, A. R. Zamir, and M. Shah, “UCF101: A dataset of101 human actions classes from videos in the wild,”
CoRR , vol.abs/1212.0402, 2012. [Online]. Available: http://arxiv.org/abs/1212.0402[30] A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, andL. Fei-Fei, “Large-scale video classification with convolutional neuralnetworks,” in
Proc. IEEE Conf. on Computer Vision and PatternRecognition (CVPR) . Columbus, OH: IEEE, June 2014, pp. 1725–1732.[31] F. Caba Heilbron, V. Escorcia, B. Ghanem, and J. Carlos Niebles,“Activitynet: A large-scale video benchmark for human activity un-derstanding,” in
Proc. IEEE Conf. on Computer Vision and PatternRecognition (CVPR) . Boston, MA: IEEE, June 2015, pp. 961–970.[32] J. Y.-H. Ng, M. Hausknecht, S. Vijayanarasimhan, O. Vinyals, R. Monga,and G. Toderici, “Beyond short snippets: Deep networks for videoclassification,” in
Proc. IEEE Conf. on Computer Vision and PatternRecognition (CVPR) . Boston, MA: IEEE, June 2015, pp. 4694–4702.[33] L. Fei-Fei and P. Perona, “A bayesian hierarchical model for learningnatural scene categories,” in
Proc. IEEE Conf. on Computer Vision andPattern Recognition (CVPR) , vol. 2, San Diego, CA, June 2005, pp.524–531.[34] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet classificationwith deep convolutional neural networks,” in
Proc. Advances in NeuralInformation Processing Systems , Lake Tahoe, NV, December 2012, pp.1097–1105.[35] G. E. Dahl, N. Jaitly, and R. Salakhutdinov, “Multi-task neural networksfor qsar predictions,” arXiv preprint arXiv:1406.1231 , 2014.[36] Z.-H. Zhou,
Ensemble methods: foundations and algorithms . BocaRaton, FL: CRC press, 2012.[37] Y. Liu, V. Chapple, P. Roberts, and P. Matson, “Prevalence, consequence,and significance of reverse cleavage by human embryos viewed with theuse of the embryoscope time-lapse video system,”
Fertility Sterility , vol.102, no. 5, pp. 1295–1300, 2014.[38] D. P. Kingma and J. Ba, “Adam: A method for stochasticoptimization,”
CoRR , vol. abs/1412.6980, 2014. [Online]. Available:https://arxiv.org/abs/1412.6980[39] B. Kalatova, R. Jesenska, D. Hlinka, and M. Dudas, “Tripolar mitosisin human cells and embryos: occurrence, pathophysiology and medicalimplications,”