[PDF] Bounding boxes for weakly supervised segmentation: Global constraints get close to full supervision

Abstract

We propose a novel weakly supervised learning segmentation based on several global constraints derived from box annotations. Particularly, we leverage a classical tightness prior to a deep learning setting via imposing a set of constraints on the network outputs. Such a powerful topological prior prevents solutions from excessive shrinking by enforcing any horizontal or vertical line within the bounding box to contain, at least, one pixel of the foreground region. Furthermore, we integrate our deep tightness prior with a global background emptiness constraint, guiding training with information outside the bounding box. We demonstrate experimentally that such a global constraint is much more powerful than standard cross-entropy for the background class. Our optimization problem is challenging as it takes the form of a large set of inequality constraints on the outputs of deep networks. We solve it with sequence of unconstrained losses based on a recent powerful extension of the log-barrier method, which is well-known in the context of interior-point methods. This accommodates standard stochastic gradient descent (SGD) for training deep networks, while avoiding computationally expensive and unstable Lagrangian dual steps and projections. Extensive experiments over two different public data sets and applications (prostate and brain lesions) demonstrate that the synergy between our global tightness and emptiness priors yield very competitive performances, approaching full supervision and outperforming significantly DeepCut. Furthermore, our approach removes the need for computationally expensive proposal generation. Our code is shared anonymously.

Full PDF

PProceedings of Machine Learning Research 1–16, 2020 Full Paper – MIDL 2020

Bounding boxes for weakly supervised segmentation:Global constraints get close to full supervision

Hoel Kervadec [email protected] ´ETS Montral

Jose Dolz ´ETS Montral

Shanshan Wang

Shenzhen Institutes of Advanced Technology

Eric Granger ´ETS Montral

Ismail Ben Ayed ´ETS Montral

Abstract

We propose a novel weakly supervised learning segmentation based on several globalconstraints derived from box annotations. Particularly, we leverage a classical tightnessprior to a deep learning setting via imposing a set of constraints on the network outputs.Such a powerful topological prior prevents solutions from excessive shrinking by enforcingany horizontal or vertical line within the bounding box to contain, at least, one pixel ofthe foreground region. Furthermore, we integrate our deep tightness prior with a globalbackground emptiness constraint, guiding training with information outside the boundingbox. We demonstrate experimentally that such a global constraint is much more powerfulthan standard cross-entropy for the background class. Our optimization problem is chal-lenging as it takes the form of a large set of inequality constraints on the outputs of deepnetworks. We solve it with sequence of unconstrained losses based on a recent powerfulextension of the log-barrier method, which is well-known in the context of interior-pointmethods. This accommodates standard stochastic gradient descent (SGD) for trainingdeep networks, while avoiding computationally expensive and unstable Lagrangian dualsteps and projections. Extensive experiments over two diﬀerent public data sets and ap-plications (prostate and brain lesions) demonstrate that the synergy between our globaltightness and emptiness priors yield very competitive performances, approaching full super-vision and outperforming signiﬁcantly DeepCut. Furthermore, our approach removes theneed for computationally expensive proposal generation. Our code is shared anonymously.

Keywords:

CNN,image segmentation, weak supervision, bounding boxes, global con-straints, Lagrangian optimization, log-barriers

1. Introduction

Semantic segmentation is of paramount importance in the understanding and interpretationof medical images, as it plays a crucial role in the diagnostic, treatment and follow-up ofmany diseases. Even though the problem has been widely studied during the last decades,we have witnessed a tremendous progress in the recent years with the advent of deep con- c (cid:13) a r X i v : . [ c s . C V ] A p r ervadec Dolz Wang Granger Ben Ayed volutional neural networks (CNNs) (Litjens et al., 2017; Ronneberger et al., 2015; Rajchlet al., 2016; Dolz et al., 2018). Nevertheless, a main limitation of these models is the needof large annotated datasets, which hampers the performance and limits the scalability ofdeep CNNs in the medical domain, where pixel-wise annotations are prohibitively time-consuming. Weakly supervised learning has gained popularity to alleviate the need of largeamounts of pixel-labeled images. Weak labels can come in the form of image tags (Pathaket al., 2015), scribbles (Lin et al., 2016), points (Bearman et al., 2016), bounding boxes (Daiet al., 2015; Khoreva et al., 2017; Hsu et al., 2019) or global constraints (Jia et al., 2017;Kervadec et al., 2019b). A common paradigm in the weakly supervised learning setting isto employ weak annotations to generate pseudo-masks or proposals . These proposals are‘’fake” labels, which are generated iteratively to reﬁne the parameters of deep CNNs, therebymimicking full supervision. Unfortunately, as discussed in several recent works (Tang et al.,2018; Kervadec et al., 2019b), proposals contain errors, which might be propagated dur-ing training, aﬀecting severely segmentation performances. Furthermore, iterative proposalgeneration increases signiﬁcantly the computation load for training. More recently, severalstudies investigated global loss functions, e.g., in the form of constraints on the target-regionsize (Pathak et al., 2015; Jia et al., 2017; Kervadec et al., 2019b; Bateson et al., 2019). Thiscan be done by constraining the softmax outputs of deep networks, leveraging unlabeleddata with a single loss function and removing the need for iterative proposal generation.Nevertheless, despite the good performances achieved by these works in certain practicalscenarios, their applicability might be limited by the assumptions underlying such globalconstraints, e.g., precise knowledge of the target region size.Among diﬀerent weak supervision approaches, bounding box annotations are an ap-pealing alternative due to their simplicity and low-annotation cost. In practice, boundingboxes can be deﬁned with two corner coordinates, allowing fast placement and light storage.Furthermore, they provide localization-awareness, which spatially constrains the problem.This form of supervision has indeed become popular in computer vision to initialize shallowsegmentation models, whose outputs are later used to train deep networks, as in full su-pervision (Dai et al., 2015; Papandreou et al., 2015; Khoreva et al., 2017; Pu et al., 2018).A naive use of bounding boxes amounts to generating pseudo-labels by simply consideringeach pixel within the bounding box as a positive sample for the respective class (Papan-dreou et al., 2015; Rajchl et al., 2016). However, in a realistic scenario, a bounding boxalso contains background pixels. To account for this, some advanced foreground extractionmethods are employed. Particularly, the very popular GrabCut (Rother et al., 2004) is astandard choice to generate segmentation masks from bounding boxes, even though alter-native approaches such as Multiscale Combinatorial Grouping (MCG) (Pont-Tuset et al.,2017) were recently used for the same purpose (Dai et al., 2015). Contributions:

We propose a novel weakly supervised learning paradigm based on sev-eral global constraints derived from box annotations. First, we leverage the classical tight-ness prior in (Lempitsky et al., 2009) to a deep learning setting, and re-formulate the prob-lem by imposing a set of constraints on the network outputs. Such a powerful topologicalprior prevents solutions from excessive shrinking by enforcing any horizontal or vertical linewithin the bounding box to contain, at least, one pixel of the foreground region. Further-more, we integrate our deep tightness prior with a global background emptiness constraint, ounding boxes for weakly supervised segmentation guiding training with information outside the bounding box. As we will see in our experi-ments, such a global constraint is much more powerful than standard cross-entropy for thebackground class. Our optimization problem is challenging as it takes the form of a largeset of inequality constraints, which are diﬃcult to handle in the context of deep networks.We solve it with sequence of unconstrained losses based on a recent powerful extensionof the log-barrier method (Kervadec et al., 2019c), which is well-known in the context ofinterior-point methods. This accommodates standard stochastic gradient descent (SGD) fortraining deep networks, while avoiding computationally expensive and unstable Lagrangiandual steps and projections. Extensive experiments over two diﬀerent public data sets andapplications (prostate and brain lesions) demonstrate that the synergy between our globaltightness and emptiness priors yield very competitive performances, approaching full su-pervision and outperforming signiﬁcantly DeepCut (Rajchl et al., 2016). Furthermore, ourapproach removes the need for computationally expensive proposal generation.Figure 1: Example of weak labels on two diﬀerent tasks: prostate segmentation and strokelesion segmentation.

2. Related works

Weakly supervised medical image segmentation.

Despite the increasing interest inweakly supervised segmentation models in the computer vision community, the literature ervadec Dolz Wang Granger Ben Ayed on these models in medical imaging remains scarce. The authors of (Qu et al., 2019)leverage point annotations in the context of histopathology images. From labeled points,they derived additional information in the form of a voronoi diagram, so as to generatecoarse labels for nuclei segmentation. Their objective function integrated the cross-entropywith coarse labels and the conditional random ﬁeld (CRF) loss in (Tang et al., 2018).Similarly to previous works in computer vision, (Nguyen et al., 2019) used classiﬁcationactivation maps (CAMs) derived from the networks as a pseudo-masks to train a CNNin a fully supervised manner. To constrain the location of the target, they employed anActive Shape Model (ASM) as a prior information. Nevertheless, this method presentstwo limitations. First, as in similar works, inaccuracies of the pseudo-masks may leadto sub-optimal performances. Second, the ASM is tailored to this speciﬁc application, asits generation for novel classes is dependent on the segmentation masks. More recently,(Wu et al., 2019) proposed to reﬁne the generated CAM with attention, with the goal ofgenerating more reliable pseudo-masks. Alternatively, other recent methods investigatedhow to constrain network predictions with global statistics, for instance, the size of thetarget region (Jia et al., 2017; Kervadec et al., 2019a,b; Bateson et al., 2019). This typeof prior information can be imposed as equality (Jia et al., 2017) or inequality (Kervadecet al., 2019b; Bateson et al., 2019) constraint. Although such constrained-CNN predictionsachieved outstanding performances in a few weakly-supervised learning scenarios, theirapplicability remains limited to certain assumptions. Bounding box supervision.

Most CNN-based methods under the umbrella of bounding-box supervision fall under the category of proposal-based methods. In these approaches,the bounding box annotations are exploited to obtain initial pseudo-masks, or proposals,typically with a shallow segmentation method, e.g., the very popular GrabCut method(Rother et al., 2004). Then, training typically follows an iterative scheme, which involvestwo steps, one updating the network parameters and the other adjusting the pseudo-labels(Dai et al., 2015; Papandreou et al., 2015; Khoreva et al., 2017). To further reﬁne the pseudo-labels generated at each iteration, several works (Rajchl et al., 2016; Song et al., 2019) usedthe popular DenseCRF (Kr¨ahenb¨uhl and Koltun, 2011) or other heuristics. While thismight be very eﬀective on some datasets, DenseCRF typically assumes that all the trainingimages have consistent and strong contrast between the foreground and background regions.Finding the optimal DenseCRF parameters is diﬃcult when the contrast of the object edgevaries signiﬁcantly within the same dataset. Moreover, the ensuing training is not end-to-end, as it still relies on a DenseCRF post-processing, even at inference time. Anotherdrawback of those bounding-box based learning approaches – which is also shared by otherproposal-based methods in general – is that early mistakes will re-enforce themselves duringtraining. For example, in DeepCut (Rajchl et al., 2016), while the pseudo-labels cannot growbeyond the bounding box, the inner foreground may gradually disappear. More recently,Hsu et al (Hsu et al., 2019) employed a Multiple Instance Learning (MIL) framework toimpose a tightness prior in the context of instance segmentation of natural images. Focusingon instance segmentation, the method used bounding boxes generated by R-CNN. In suchMIL framework, positive bags are composed of box lines while negative bags correspond to

1. Several hyper-parameters controls the edge sensitivity of popular DenseCRF (Kr¨ahenb¨uhl and Koltun,2011), mostly θ β and θ γ , but also ω , ω and θ α to some extent. ounding boxes for weakly supervised segmentation ( a ) ( b )Figure 2: (a) Illustration of the tightness prior: any vertical (red) or horizontal (blue) linewill cross at least one (1) pixel of the camel. (b) This can be generalized, wheresegments of width w cross at least w pixels of camel.lines outside the box. The MIL loss function is deﬁned so as to push the maximum predictedprobability within each positive bag to 1, and the maximum predicted probability withineach negative bag to 0. This MIL loss is integrated with a GridCRF loss (Marin et al.,2019) to ensure consistency between neighboring pixels. As many other works, the ﬁnalpredictions are reﬁned with DenseCRF (Kr¨ahenb¨uhl and Koltun, 2011).

3. Method

Let X : Ω ⊂ R , → R denotes a training image, and Ω its corresponding spatial domain.In a standard fully supervised setting, we can denote the training set as D = { ( X, Y ) } D ,where X ∈ R Ω are input images and Y ∈ { , } Ω their corresponding pixel-wise labels. Inthe context of this work, however, labels Y take the form of bounding boxes (as shown inFigure 1, third column). Thus, we use Ω O and Ω I to deﬁne the area outside and inside thebounding box, respectively, with Ω O ∪ Ω I = Ω. Let s θ ∈ [0 , Ω denote the probabilitiespredicted by the CNNs, where 0 and 1 represent background and foreground, respectively.In fully supervised setting, one would typically optimize the standard cross-entropy loss:min θ L CE ( θ ) := − (cid:88) p ∈ Ω [ y p log( s θ ( p )) + (1 − y p ) log(1 − s θ ( p ))] . As shown in Figure 1, we certainly know that all pixels p outside a given bounding box (Ω O ) belong to the background. A straightforward solutionwould be to employ the cross-entropy, but only partially for each of those pixels outside thebounding box: L MCE := − (cid:88) p ∈ Ω O log(1 − s θ ( p )) . ervadec Dolz Wang Granger Ben Ayed Alternatively, notice that the size of the predicted foreground , when computed over thebackground pixels (Ω O ), should be equal to zero. This gives the following global constraintfor our optimization problem, which enforces that the background region is empty: (cid:88) p ∈ Ω O s θ ( p ) ≤ . (1)We will refer to this constraint as the emptiness constraint , L EMP . L O will denote either L MCE or L EMP . Uncertainty inside the box.

While bounding box annotations provide cues about thespatial location of the target regions, pixel-wise information still remain uncertain. However,the bounding box can be further exploited to impose a powerful topological prior, referredto as tightness prior (Lempitsky et al., 2009). This global prior assumes that the targetregion should be suﬃciently close to each of the sides of the bounding box. Therefore, wecan expect that each horizontal or vertical line will cross at least one pixel of the targetregion (as illustrated in Figure 2), and for any region shape. Furthermore, we can regroupthe lines into segments of width w , each containing w lines. In this case, we can assumethat at least w pixels of the object will be crossed by the segment. Formally, we can writethis as a set of inequality constraints: (cid:88) p ∈ s l y p ≥ w ∀ s l ∈ S L (2)where S L := { s l } is the set of segments parallel to the sides of the bounding boxes. Thiscan be easily translated into inequality constraints on the outputs of the CNN, where thesum of the softmax probabilities for each segment should be greater or equal to its width.The set of segments S L can be eﬃciently pre-computed; only the masked softmax sum isrequired during training. The ﬁrst two parts of the loss are biased toward opposed, trivial solutions: L O trivialsolution is to predict the whole image as background, while the easiest way to satisfythe tightness constraints (2) is to predict everything as foreground. But there is moreinformation that we can exploit from the boxes: their total size gives an upper bound onthe object size. We can also assume that a small fraction (cid:15) of the box belongs to the targetregion, which yield another lower bound. This takes the form of region-size constraintsimilar to (Kervadec et al., 2019b):min θ L ( θ ) + ... + L n ( θ ) (3)s.t. (cid:15) | Ω I | ≤ (cid:88) p ∈ Ω s θ ( p ) ≤ | Ω I | .

2. Here we refer the size as the sum of the softmax probabilities, as it is easy to compute and diﬀerentiable.Therefore, it accommodates standard Stochastic Gradient Descent. ounding boxes for weakly supervised segmentation Optimizing L O with the constraints from sections 3.2 and 3.3 gives the following constrainedoptimization problem:min θ L O ( θ ) (4)s.t. (cid:88) p ∈ s l s θ ( p ) ≥ w ∀ s l ∈ S L s.t. (cid:15) | Ω I | ≤ (cid:88) p ∈ Ω s θ ( p ) ≤ | Ω I | . This formulation involves a large number of competing constraints. Recent optimizationworks on constrained CNNs (Kervadec et al., 2019c) suggest that, in the case of mul-tiple competing constraints, log-barrier extensions provide approximations of Lagrangianoptimization in the form of sequences of unconstrained losses, which removes completelyexpensive and unstable primal-dual steps in the context of deep networks, handling the mul-tiple constraints fully within SGD. Therefore, log-barriers can accommodate the interplaybetween multiple competing constraints, unlike naive penalty-based methods. These desir-able properties are consistent with well-established interior-point and log-barrier methodsin convex optimization (Boyd and Vandenberghe, 2004).For an inequality constraint in the form of z ≤

0, the log-barrier extension can be deﬁnedas follows: ˜ ψ t ( z ) = (cid:40) − t log( − z ) if z ≤ − t tz − t log( t ) + t otherwise , (5)where t is a parameter that raise the barrier over time (i.e., during training). The maindiﬀerence with a penalty (such as max(0 , z ) , used by (Kervadec et al., 2019b)) is that(5) acts as a barrier even when the constraint is satisﬁed ( z ≤ Using the log-barrier extension, we obtain the ﬁnal unconstrained optimization problem,which can be optimized with standard SGD:min θ L O ( θ ) + λ  (cid:88) s l ∈S L ˜ ψ t (cid:32) w − (cid:88) p ∈ s l s θ ( p ) (cid:33) + ˜ ψ t  (cid:15) | Ω I | − (cid:88) p ∈ Ω s θ ( p )  + ˜ ψ t (cid:88) p ∈ Ω s θ ( p ) − | Ω I |  . (6) λ is a real number balancing the tightness prior with respect to the other parts of the loss.Notice that all log-barrier extensions ˜ ψ t use the same t , with a common scheduling strategyfor t . This limits the number of hyper-parameters and simpliﬁes the model. ervadec Dolz Wang Granger Ben Ayed

4. Experiments

We evaluate our method on two diﬀerent tasks: prostate segmentation in MR-T2 and brainlesion segmentation in MR-T1. Among these tasks, lesion segmentation is particularlychallenging, due to the heterogeneity of the lesions and high imbalance in the number offoreground and background pixels.

Prostate segmentation on MR-T2.

The ﬁrst dataset that we use was made availableat the MICCAI 2012 prostate MR segmentation challenge (Litjens et al., 2014). It containsthe transversal T2-weighted MR images of 50 patients acquired at diﬀerent centers, withmultiple MRI vendors and diﬀerent scanning protocols. The images include patients withbenign diseases, as well as with prostate cancer. Images resolution ranges from 15 × × × ×

512 voxels, with a spacing ranging from 2 × . × .

27 to 4 × . × . .We employed 40 patients for training and 10 for validation. Brain lesion segmentation on MR-T1.

We also evaluated the proposed method on theAnatomical Tracings of Lesions After Stroke (ATLAS) (Liew et al., 2018), an open-sourcedataset of stroke lesions. It contains 229 T1-weighted MR images, coming from diﬀerentcohorts and diﬀerent scanners. All the images have a resolution of 197 × ×

189 pixels,with a spacing of 1 × × Evaluation.

To compare quantitatively the performances of the diﬀerent methods, weemployed the Dice similarity coeﬃcient, a standard performance metric in medical imagesegmentation. In addition to the baseline models, we also perform comprehensive compar-isons with DeepCut (Rajchl et al., 2016), whose learning setting is also based on boundingbox annotations.

To evaluate our method under diﬀerent settings, we experimented with a diﬀernt networkarchitecture for each task. We employ a residual version of the well-known UNet (Ron-neberger et al., 2015) to segment the prostate, whereas ENet (Paszke et al., 2016) wasa backbone architecture in the stroke lesion segmentation experiments. The models weretrained with ADAM (Kingma and Ba, 2014), an initial learning rate of 5 × − and abatch size of 4 for the prostate and 32 for stroke lesions. While we employed oﬄine dataaugmentation (i.e., mirroring, ﬂipping, rotation) to augment the PROMISE12 dataset, noaugmentation was performed on the ATLAS dataset. The reason for this is the low numberof images on the PROMISE12 dataset compared to ATLAS.The log-barrier parameters were set following (Kervadec et al., 2019c), and were sharedacross all the log-barrier instances. We set λ (from Eq. (6)) as 0 . https://promise12.grand-challenge.org ounding boxes for weakly supervised segmentation found that changes on the width w of the segments for the tightness constraints did nothave a signiﬁcant impact on the results. Therefore, w was set to 5 in all the experiments.All methods are implemented in PyTorch, with the exception of the DenseCRF (Kr¨ahenb¨uhland Koltun, 2011) which uses the Python wrapper PyDenseCRF . To speed the proposalgeneration of DeepCut, the CRF inference is parallelized using the standard Python mul-tiprocessing module, with a careful use of SharedArrays to avoid un-necessary and costlycopies of arrays between the processes. The code is available online . While the main experiments are performed on tight boxes (i.e., the gap between the targetregions and the bounding-box sides is not signiﬁcant), we perform additional experimentswhere a margin m of 10 pixels was added on each side. This enables us to evaluate therobustness of each model to imprecise bounding-box placement. Robustness to placementis of signiﬁcant importance, since perfect annotation of all bounding boxes might be un-realistic. Furthermore, robustness to imprecision also alleviates the problem of annotatorsubjectivity.

5. Results

The results of the segmentation experiments are reported in Table 1. We can observe thatthe proposed method consistently outperforms DeepCut (Rajchl et al., 2016) across the twodatasets. The diﬀerences in performance range from 1% in the PROMISE12 dataset to 10%in the case of ATLAS. Furthermore, the results obtained from the two loss functions de-signed to deal with the background constraints indicate that the proposed global emptinessconstraint is more eﬀective in our setting. We hypothesize this is due to several factors.First, employing the emptiness constraint on background pixels results in all the constraintlosses being on the same scale, which has very nice properties from an optimization perspec-tive. Second, the imbalance nature of the segmentation task in the ATLAS dataset makesthe use of the cross-entropy over all the background pixels a suboptimal alternative, forcingsolutions that encourage empty segmentations. Finally, we can observe that the proposedmethod achieves performances comparable to full supervision, particularly in the task ofstroke lesion segmentation. Using only a subset of the losses does not give optimal results,showing their synergy.Figure 3 depicts the validation results over training of the diﬀerent models. Even thoughDeepCut achieves similar results as the proposed approach in the PROMISE12 dataset,we can see that it is very unstable during training, as is the case generally for proposal-based methods. Additionally, its performance degrades over time. This eﬀect is even morenoticeable on the ATLAS dataset, where it collapses to empty segmentations after 25 epochs.This behaviour is a clear example of the instability of proposal-based methods, since weobserved similar ﬁndings on the training images. More details about this issue are providedin Appendix A. https://github.com/lucasb-eyer/pydensecrf https://github.com/LIVIAETS/boxes_tightness_prior ervadec Dolz Wang Granger Ben Ayed Method PROMISE12 ATLASDSC DSCDeep cut (Rajchl et al., 2016) 0.827 (0.085) 0.375 (0.246)Tightness priorw/ emptiness constraint NA 0.161 (0.145)Tightness prior + box size 0.620 (0.100) 0.146 (0.134)w/ masked cross-entropy ( L MCE ) 0.774 (0.045) 0.159 (0.203)w/ emptiness constraint ( L EMP ) Full supervision (Cross-entropy) 0.901 (0.025) 0.489 (0.294)

Table 1: Results on the validation set for the proposed method, and the diﬀerent baselines inboth PROMISE12 and ATLAS datasets. The best results in the weakly supervisedsetting are highlighted in bold. NA means that the network didn’t learn to segmentanything meaningful.Figure 3: Evolution the validation DSC values over time for both PROMISE12 and ATLAS,and for diﬀerent methods.Qualitative segmentation results are depicted in Fig 4. We can observe how the proposedmethod with masked CE achieves satisfactory visual results on the prostate (ﬁrst two rows),but fails to properly segment stroke lesions (last two rows). In contrast, when backgroundsegmentations are optimized with the proposed emptiness constraint, we observe how thesegmentation results approach full supervision performance in both datasets. This is in linewith the results reported in Table 1. On the other hand, DeepCut succeeds to segment theprostate but it is not able to obtain satisfactory segmentations for brain lesions. Lookingcloser at these segmentations, we can observe that they do not reliably follow the targetboundaries. This can be explained by the fact that denseCRF assumes strong contrastsbetween foreground and background regions, which is not the case in many of these images.Furthermore, the results provided by denseCRF are sensitive to its hyper-parameters θ β , θ γ , ω and ω , which control the edge sensitivity. Since the set of hyper-parameters wereﬁxed across all the images in the whole dataset, it might happen that an optimal set ofhyper-parameters for a given image performs sub-optimally for another image. ounding boxes for weakly supervised segmentation Figure 4: Predicted segmentation on the validation set for the two tasks.

Results of the sensitivity study on the box precision are reported in Table 2. While allmethods were able to reach similar performances when the bounding box annotation isnearly perfect (despite stability issues for some methods), their performance degrades asthe margin between the region of interest and the borders of the bounding box increases.Speciﬁcally, if a margin m of 10 pixels is added on each side, the performance of the proposedmethod only drops by 5%, in terms of DSC, whereas DeepCut performance decreases by14%. Method Margin=0 Margin=10DeepCut 0.827 (0.085) 0.684 (0.069)Ours (emptiness constraint)

Table 2: Sensitivity study wrt. the box margins on the PROMISE12 dataset. Best resultshighlighted in bold.Finally, the computational cost of the diﬀerent methods is discussed in more details inAppendix B. ervadec Dolz Wang Granger Ben Ayed

6. Conclusion

In this paper we proposed a novel weakly-supervised learning paradigm based on severalglobal constraints, which are derived from bounding box annotations. First, the classicaltightness prior is integrated into a a deep learning framework by reformulating the problemas a set of constraints on the outputs of the network. Second, a global background emptinessconstraint is employed to enforce empty segmentations outside the bounding box, which isdemonstrated to be more powerful than standard cross-entropy for handling the backgroundclass. Integration of such a large set of inequality constraints on deep networks representsa challenging optimization problem.We solve it with sequence of unconstrained losses, which are based on a recent exten-sion of the log-barrier method. Since this formulation accommodates standard stochasticgradient descent, it can be easily trained on deep networks. We performed comprehen-sive experiments on two public benchmarks for the challenging tasks of prostate and brainstroke lesion segmentation, and demonstrated that the proposed approach outperformsstate-of-the-art approaches with bounding-box supervision. Furthermore, quantitative andqualitative results indicate that the proposed approach has the potential to close the gapbetween bounding-box annotations and full supervision in semantic-segmentation tasks.The sensibility study showed that the proposed method is resilient to imprecision in thebox tightness. Future works will investigate the use of 3D bounding boxes as annotations,which will make the corresponding 2D boxes looser. Such a workﬂow could further speedup the annotation process. The proposed framework could also be extended to 3D-CNN, bygenerating segments for the tightness prior along the three axes. Furthermore, our approachis also compatible with multi-class segmentation problems, even when bounding boxes ofdiﬀerent classes overlap.

Acknowledgments

This work is supported by the National Science and Engineering Research Council of Canada(NSERC), via its Discovery Grant program. We also thank NVIDIA for the GPU donation.

References

Mathilde Bateson, Hoel Kervadec, Jose Dolz, Herv´e Lombaert, and Ismail Ben Ayed. Con-strained domain adaptation for segmentation. In

International Conference on MedicalImage Computing and Computer-Assisted Intervention , pages 326–334. Springer, 2019.Amy Bearman, Olga Russakovsky, Vittorio Ferrari, and Fei-Fei Li. What’s the point:Semantic segmentation with point supervision. In

European Conference on ComputerVision (ECCV) , pages 549–565, 2016. doi: 10.1007/978-3-319-46478-7 34. URL https://doi.org/10.1007/978-3-319-46478-7_34 .Stephen Boyd and Lieven Vandenberghe.

Convex Optimization . Cambridge UniversityPress, 2004. ounding boxes for weakly supervised segmentation Jifeng Dai, Kaiming He, and Jian Sun. Boxsup: Exploiting bounding boxes to superviseconvolutional networks for semantic segmentation. In

IEEE International Conference onComputer Vision (ICCV) , pages 1635–1643, 2015.Jose Dolz, Christian Desrosiers, and Ismail Ben Ayed. 3D fully convolutional networks forsubcortical segmentation in MRI: A large-scale study.

NeuroImage , 170:456–470, 2018.Cheng-Chun Hsu, Kuang-Jui Hsu, Chung-Chi Tsai, Yen-Yu Lin, and Yung-Yu Chuang.Weakly supervised instance segmentation using the bounding box tightness prior. In

Advances in Neural Information Processing Systems , pages 6582–6593, 2019.Zhipeng Jia, Xingyi Huang, I Eric, Chao Chang, and Yan Xu. Constrained deep weak super-vision for histopathology image segmentation.

IEEE Transactions on Medical Imaging ,36(11):2376–2388, 2017.Hoel Kervadec, Jose Dolz, Eric Granger, and Ismail Ben Ayed. Curriculum semi-supervisedsegmentation. In

International Conference on Medical Image Computing and Computer-Assisted Intervention , 2019a.Hoel Kervadec, Jose Dolz, Meng Tang, Eric Granger, Yuri Boykov, and Ismail Ben Ayed.Constrained-cnn losses for weakly supervised segmentation.

Medical Image Analysis ,2019b.Hoel Kervadec, Jose Dolz, Jing Yuan, Christian Desrosiers, Eric Granger, and IsmailBen Ayed. Constrained deep networks: Lagrangian optimization via log-barrier exten-sions. arXiv preprint arXiv:1904.04205 , 2019c.Anna Khoreva, Rodrigo Benenson, Jan Hendrik Hosang, Matthias Hein, and Bernt Schiele.Simple does it: Weakly supervised instance and semantic segmentation. In

CVPR , vol-ume 1, page 3, 2017.Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXivpreprint arXiv:1412.6980 , 2014.Philipp Kr¨ahenb¨uhl and Vladlen Koltun. Eﬃcient inference in fully connected crfs withgaussian edge potentials. In

Advances in neural information processing systems , pages109–117, 2011.Victor Lempitsky, Pushmeet Kohli, Carsten Rother, and Toby Sharp. Image segmentationwith a bounding box prior. In , pages 277–284. IEEE, 2009.Sook-Lei Liew, Julia M Anglin, Nick W Banks, Matt Sondag, Kaori L Ito, Hosung Kim,Jennifer Chan, Joyce Ito, Connie Jung, Nima Khoshab, et al. A large, open source datasetof stroke anatomical brain images and manual lesion segmentations.

Scientiﬁc data , 5:180011, 2018.Di Lin, Jifeng Dai, Jiaya Jia, Kaiming He, and Jian Sun. Scribblesup: Scribble-supervisedconvolutional networks for semantic segmentation. In

Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition , pages 3159–3167, 2016. ervadec Dolz Wang Granger Ben Ayed Geert Litjens, Robert Toth, Wendy van de Ven, Caroline Hoeks, Sjoerd Kerkstra, Bramvan Ginneken, Graham Vincent, Gwenael Guillard, Neil Birbeck, Jindang Zhang, et al.Evaluation of prostate segmentation algorithms for mri: the promise12 challenge.

Medicalimage analysis , 18(2):359–373, 2014.Geert Litjens, Thijs Kooi, Babak Ehteshami Bejnordi, Arnaud Arindra Adiyoso Setio,Francesco Ciompi, Mohsen Ghafoorian, Jeroen Awm Van Der Laak, Bram Van Ginneken,and Clara I S´anchez. A survey on deep learning in medical image analysis.

Medical imageanalysis , 42:60–88, 2017.D. Marin, M. Tang, Ben Ayed I., and Boykov Y. Beyond gradient descent for regularizedsegmentation losses. In

IEEE Conference on Computer Vision and Pattern Recognition(CVPR) , pages 10187–10196, 2019.Huu-Giao Nguyen, Alessia Pica, Francesco La Rosa, Jan Hrbacek, Damien C Weber, AnnSchalenbourg, Raphael Sznitman, and Meritxell Bach Cuadra. A novel segmentationframework for uveal melanoma based on magnetic resonance imaging and class activationmaps. In

International Conference on Medical Imaging with Deep Learning , 2019.George Papandreou, Liang-Chieh Chen, Kevin P Murphy, and Alan L Yuille. Weakly-andsemi-supervised learning of a deep convolutional network for semantic image segmen-tation. In

Proceedings of the IEEE international conference on computer vision , pages1742–1750, 2015.Adam Paszke, Abhishek Chaurasia, Sangpil Kim, and Eugenio Culurciello. Enet: Adeep neural network architecture for real-time semantic segmentation. arXiv preprintarXiv:1606.02147 , 2016.Deepak Pathak, Philipp Krahenbuhl, and Trevor Darrell. Constrained convolutional neu-ral networks for weakly supervised segmentation. In

IEEE International Conference onComputer Vision (ICCV) , pages 1796–1804, 2015.Jordi Pont-Tuset, Pablo Arbelaez, Jonathan T Barron, Ferran Marques, and Jitendra Malik.Multiscale combinatorial grouping for image segmentation and object proposal genera-tion.

IEEE transactions on pattern analysis and machine intelligence , 39(1):128–140,2017.Mengyang Pu, Yaping Huang, Qingji Guan, and Qi Zou. GraphNet: Learning image pseudoannotations for weakly-supervised semantic segmentation. In , pages 483–491. ACM, 2018.Hui Qu, Pengxiang Wu, Qiaoying Huang, Jingru Yi, Gregory M Riedlinger, SubhajyotiDe, and Dimitris N Metaxas. Weakly supervised deep nuclei segmentation using pointsannotation in histopathology images. In

International Conference on Medical Imagingwith Deep Learning , pages 390–400, 2019.Martin Rajchl, Matthew CH Lee, Ozan Oktay, Konstantinos Kamnitsas, Jonathan Passerat-Palmbach, Wenjia Bai, Mellisa Damodaram, Mary A Rutherford, Joseph V Hajnal, Bern-hard Kainz, et al. Deepcut: Object segmentation from bounding box annotations using ounding boxes for weakly supervised segmentation convolutional neural networks. IEEE transactions on medical imaging , 36(2):674–683,2016.Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks forbiomedical image segmentation. In

International Conference on Medical image computingand computer-assisted intervention , pages 234–241. Springer, 2015.Carsten Rother, Vladimir Kolmogorov, and Andrew Blake. Grabcut: Interactive foregroundextraction using iterated graph cuts. In

ACM transactions on graphics (TOG) , volume 23,pages 309–314. ACM, 2004.Chunfeng Song, Yan Huang, Wanli Ouyang, and Liang Wang. Box-driven class-wise regionmasking and ﬁlling rate guided loss for weakly supervised semantic segmentation. In

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , pages3136–3145, 2019.Meng Tang, Federico Perazzi, Abdelaziz Djelouah, Ismail Ben Ayed, Christopher Schroers,and Yuri Boykov. On regularized losses for weakly-supervised cnn segmentation. In

Proceedings of the European Conference on Computer Vision (ECCV) , pages 507–522,2018.Kai Wu, Bowen Du, Man Luo, Hongkai Wen, Yiran Shen, and Jianfeng Feng. Weaklysupervised brain lesion segmentation via attentional representation learning. In

Inter-national Conference on Medical Image Computing and Computer-Assisted Intervention ,pages 211–219. Springer, 2019. ervadec Dolz Wang Granger Ben Ayed Appendix A. DeepCut training instability

We investigated the generated pseudo-labels (as showed in Figure 5) by DeepCut, and themain culprit is when the proposal under-segment the object inside the box. This forces,at the next training step, the network to segment the object as background. This kindof conﬂicting feedback to the network (some other proposal label similar looking patchesas foreground) makes the training unstable, and slowly skew the network toward emptypredictions. This will cause the next batch of proposals to be even smaller, until thenetwork outputs empty foreground for all the images.Figure 5: Progression of the pseudo-labels from DeepCut: only a few of those cases canmake the training very unstable.

Appendix B. Implementation and performances

Performances were measured on a machine equipped with an AMD Ryzen 1700X, 32GB ofRAM (frequency did not aﬀect speed) and an NVIDIA Titan RTX. They are reported inTable 3. The settings and hyper-parameters are the same as described in Section 4.2.Most of the extra time introduced by our model comes from the naive log-barrier im-plementation that we used. Instead of leveraging if/else switch and code vectorizationwe used a standard Python for loop over all constraints. This could be improved usingthe recent PyTorch development of its JIT compiler. The width parameter of the segmentswill aﬀect the overhead of our method: wider segments means less of them, which, in turns,results in less constraints to handle.Notice that implementing the DenseCRF post-processing in a parallel and eﬃcient fash-ion introduces a lot of software engineering uncommon in modern learning frameworks.While the DenseCRF implementation itself is highly eﬃcient, it remains a single processthat can handle only one image at a time. Performing it in parallel should be easy intheory, but is actually not very eﬃcient with Python standard multiprocessing tools. Inpractice, all the arrays (containing either the image or probabilities) are pickled and copiedacross processes. Those back-and-forth copies can add up quickly and slow-down the pro-cessing substantially, on top of ﬁlling the computer memory more quickly. The solutionis to carefully use SharedArray , which will contain all the batch in a single object. Thesub-processed will read and write only a subset of those SharedArrays, corresponding totheir assigned batch item.

6. Carefully, because they are not concurrency safe. ounding boxes for weakly supervised segmentation Time per epoch (s) Proposals update (s)

Total (h)

Method Pr At Pr At Pr AtFull supervision 150 235 - - 4.2 3.3Ours 170 325 - - 4.7 4.5DeepCut 150 235 440 3120 6.6 11.9