[PDF] Weakly Supervised Estimation of Shadow Confidence Maps in Fetal Ultrasound Imaging

Abstract

Detecting acoustic shadows in ultrasound images is important in many clinical and engineering applications. Real-time feedback of acoustic shadows can guide sonographers to a standardized diagnostic viewing plane with minimal artifacts and can provide additional information for other automatic image analysis algorithms. However, automatically detecting shadow regions using learning-based algorithms is challenging because pixel-wise ground truth annotation of acoustic shadows is subjective and time consuming. In this paper we propose a weakly supervised method for automatic confidence estimation of acoustic shadow regions. Our method is able to generate a dense shadow-focused confidence map. In our method, a shadow-seg module is built to learn general shadow features for shadow segmentation, based on global image-level annotations as well as a small number of coarse pixel-wise shadow annotations. A transfer function is introduced to extend the obtained binary shadow segmentation to a reference confidence map. Additionally, a confidence estimation network is proposed to learn the mapping between input images and the reference confidence maps. This network is able to predict shadow confidence maps directly from input images during inference. We use evaluation metrics such as DICE, inter-class correlation and etc. to verify the effectiveness of our method. Our method is more consistent than human annotation, and outperforms the state-of-the-art quantitatively in shadow segmentation and qualitatively in confidence estimation of shadow regions. We further demonstrate the applicability of our method by integrating shadow confidence maps into tasks such as ultrasound image classification, multi-view image fusion and automated biometric measurements.

Full PDF

WWeakly Supervised Estimation of ShadowConﬁdence Maps in Fetal Ultrasound Imaging

Qingjie Meng, Matthew Sinclair, Veronika Zimmer, Benjamin Hou, Martin Rajchl, Nicolas Toussaint, Ozan Oktay,Jo Schlemper, Alberto Gomez, James Housden, Jacqueline Matthew, Daniel Rueckert,

Fellow, IEEE ,Julia A. Schnabel,

Senior member, IEEE , and Bernhard Kainz,

Senior member, IEEE

Abstract —Detecting acoustic shadows in ultrasound images isimportant in many clinical and engineering applications. Real-time feedback of acoustic shadows can guide sonographers toa standardized diagnostic viewing plane with minimal artifactsand can provide additional information for other automaticimage analysis algorithms. However, automatically detectingshadow regions using learning-based algorithms is challengingbecause pixel-wise ground truth annotation of acoustic shadowsis subjective and time consuming. In this paper we propose aweakly supervised method for automatic conﬁdence estimationof acoustic shadow regions. Our method is able to generate adense shadow-focused conﬁdence map. In our method, a shadow-seg module is built to learn general shadow features for shadowsegmentation, based on global image-level annotations as wellas a small number of coarse pixel-wise shadow annotations. Atransfer function is introduced to extend the obtained binaryshadow segmentation to a reference conﬁdence map. Additionally,a conﬁdence estimation network is proposed to learn the mappingbetween input images and the reference conﬁdence maps. Thisnetwork is able to predict shadow conﬁdence maps directly frominput images during inference. We use evaluation metrics such asDICE, inter-class correlation and etc. to verify the effectivenessof our method. Our method is more consistent than humanannotation, and outperforms the state-of-the-art quantitatively inshadow segmentation and qualitatively in conﬁdence estimationof shadow regions. We further demonstrate the applicability ofour method by integrating shadow conﬁdence maps into taskssuch as ultrasound image classiﬁcation, multi-view image fusionand automated biometric measurements.

Index Terms —Ultrasound imaging, deep learning, weakly su-pervised, shadow detection, conﬁdence estimation.

I. I

NTRODUCTION U LTRASOUND (US) imaging is a medical imagingtechnique based on reﬂection and scattering of high-frequency sound in tissues. Compared with other imagingtechniques (e.g. Magnetic Resonance Imaging (MRI) andComputed Tomography (CT)), US imaging has various advan-tages including portability, low cost, high temporal resolution

Q. Meng, M. Sinclair, B. Hou, M. Rajchl, O. Otkay, J. Schlemper, D.Rueckert and B. Kainz are with the Biomedical Image Analysis Group,Department of Computing, Imperial College London, London SW7 2AZ, UK,(e-mail: [email protected]).V. Zimmer, N. Toussaint, A. Gomez, J. Housden, J. Matthew and J. A.Schnabel are with School of Biomedical Engineering and Imaging Sciences,King’s College London, London WC2R 2LS, UK.To appear in IEEE TRANSACTIONS ON MEDICAL IMAGING https://ieeexplore.ieee.org/document/8698843 DOI: 10.1109/TMI.2019.2913311.c (cid:13) c (cid:13) and real-time imaging capability. With these advantages, USis an important medical imaging modality that is utilized toexamine a range of anatomical structures in both adults andfetuses. In most countries, US imaging is an essential part ofclinical routine for pregnancy health screening between 11 and22 weeks of gestation [29].Although US imaging is capable of providing real-timeimages of anatomy, diagnostic accuracy is limited by therelatively low image quality. Artifacts such as noise [1],distortions [34] and acoustic shadows [12] make interpretationchallenging and highly dependent on experienced operators.These artifacts are unavoidable in clinical practice due to thelow energies used and the physical nature of sound wavepropagation in human tissues. Better hardware and advancedimage reconstruction algorithms have been developed to re-duce speckle noise [9, 10]. Prior anatomical expertise [21]and extensive sonographer training are the only way to handledistortions and shadows to date.Sound-opaque occluders, including bones and calciﬁed tis-sues, block the propagation of sound waves by stronglyabsorbing or reﬂecting sound waves during scanning. Theregions behind these sound-opaque occluders return little tono reﬂections to the US transducer. Thus these areas havelow intensity but very high acoustic impedance gradients attheir boundaries (e.g. Fig. 1(a) left column). Reducing acous-tic shadows and correct interpretation of images containingshadows rely heavily on sonographer experience. Experiencedsonographers avoid shadows by moving the probe to a morepreferable viewing direction during scanning or, if no shadow-free viewing direction can be found, a mental map is com-pounded with iterative acquisitions from different orientations.With less anatomical information in shadow regions, es-pecially when shadows cut through the anatomy of interest,images containing strong shadows can be problematic forautomatic real-time image analysis methods such as biometricmeasurements [32], anatomy segmentation [5] and US imageclassiﬁcation [3]. Moreover, the shortage of experienced sono-graphers [8] exacerbates the challenges of accurate US image-based screening and diagnostics. Therefore, shadow-aware USimage analysis is greatly needed and would be beneﬁcial, bothfor engineers who work on medical image analysis, as well asfor sonographers in clinical practice. Contribution : We propose a novel method based on con-volutional neural networks (CNNs) to automatically estimatepixel-wise conﬁdence maps of acoustic shadows in 2D USimages. Our method learns an initial latent space of shadow

Copyright (c) 2019 IEEE. Personal use of this material is permitted. However, permission to use this material for any other purposes mustbe obtained from the IEEE by sending a request to [email protected]. a r X i v : . [ c s . C V ] M a y O APPEAR IN IEEE TRANSACTIONS ON MEDICAL IMAGING DOI: 10.1109/TMI.2019.2913311 2 (a) Images (image-level labels) (b) Images with pixel-wise annotations

Fig. 1: Examples of data sets. (a) Images with global image-level labels (“has shadow” and “shadow-free”), and (b) Imageswith coarse pixel-wise annotations from two annotators.regions from images consisting of multiple anatomies and withglobal image-level labels (“has shadow” and “shadow-free”),e.g. Fig. 1(a). The basic latent space is then estimated by learn-ing from fewer images of a single anatomy (fetal brain) withcoarse pixel-wise shadow annotations (approximately ofthe images with global image-level labels), e.g. Fig. 1(b).The resulting latent space is then reﬁned by learning shadowintensity distributions using fetal brain images so that the latentspace is suitable for conﬁdence estimation of shadow regions.By using shadow intensity information, our method can detectmore shadow regions than the coarse manual segmentation,especially relatively weak shadow regions.The proposed training process is able to build a directmapping between input images and the corresponding shadowconﬁdence maps in any given anatomy, which allows real-timeapplication through direct inference.In contrast to our preliminary work [22], which uses sep-arate, heuristically linked components, here we establish apipeline to make full use of existing data sets and annotations.During inference our method can predict both a binary shadowsegmentation and a dense shadow-focused conﬁdence map.The shadow segmentation is not limited by hyperparameterssuch as thresholds in [22], and the segmentation accuracyas well as shadow conﬁdence maps are greatly improvedcompared to the state-of-the-art.We have demonstrated in [22] that shadow conﬁdence mapscan improve the performance of an automatic biometric mea-surement task. In this study, we further evaluate the usefulnessof the shadow conﬁdence estimation for other automatic imageanalysis algorithms such as an US image classiﬁcation task anda multi-view image fusion task.

Related workAutomatic US shadow detection:

Acoustic shadows have asigniﬁcant impact on US image quality, and thus a serious ef-fect on robustness and accuracy of image processing methods.In clinical literature, US artifacts including shadows have beenwell studied and reviewed [6, 19, 24]. However, the shadowproblem is not well covered in automated US image analysisliterature. Automatic estimation of acoustic shadows has rarelybeen the focus within the medical image analysis community.Identifying shadow regions in US images has been utilizedas a preprocessing step for extracting relevant image contentand improving image analysis accuracy in some applications. Penney et al. [27] have identiﬁed shadow regions by thresh-olding the accumulated intensity along each scanning beamline. Afterwards, these shadow regions have been masked outfrom US images for US to MRI hepatic image registration.Instead of excluding shadow regions, Kim et al. [17] focusedon accurate attenuation estimation, and aimed to use attenua-tion properties for determination of the anatomical propertieswhich can help diagnose diseases. They proposed a hybrid at-tenuation estimation method that combines spectral differenceand spectral shift methods to reduce the inﬂuence of localspectral noise and backscatter variations in Radio Frequency(RF) US data. To detect shadow regions in B-Mode scansdirectly and automatically, Hellier et al. [15] used the probe’sgeometric properties and statistically modelled the US B-Modecone. Compared with previous statistical shadow detectionmethods such as [27], their method can automatically estimatethe probe’s geometry as well as other hyperparameters, andhas shown improvements in 3D reconstruction, registrationand tracking. However, the method can only detect a subsetof ‘deep’ acoustic shadows because of the probe geometry-dependent sampling strategy.To improve the accuracy of US attenuation estimation andshadow detection, Karamalis et al. [16] proposed a moregeneral solution using the Random Walks (RW) algorithm topredict a per-pixel conﬁdence of US images. In [16], conﬁ-dence maps represent the uncertainty of US images resultingfrom shadows, and thus, show the acoustic shadow regions.The conﬁdence maps obtained by this work can improve theaccuracy of US image processing tasks, such as intensity-based US image reconstruction and multi-modal registration.However, such conﬁdence maps are sensitive to US transducersettings and limited by the US formation process. Klein etal. [18] have further extended the RW method to generatedistribution-based conﬁdence maps and applied it to RF USdata. This method is more robust since the conﬁdence predic-tion is no longer intensity-based.Some studies have utilized acoustic shadow detection asadditional information in their pipeline for other US imageprocessing tasks. Broersen et al. [7] combined acoustic shadowdetection for the characterization of dense calcium tissuein intravascular US virtual histology, and Berton et al. [5]automatically and simultaneously segment vertebrae, spinousprocess and acoustic shadow in US images for a better assess-ment of scoliosis progression. In these applications, acousticshadow detection is task-speciﬁc, and is mainly based onheuristic image intensity features as well as special anatomicalconstraints.The aforementioned literature relies heavily on manuallyselected relevant features, intensity information or a probe-speciﬁc US formation process. With the advances in deeplearning, US image analysis algorithms have gained bettersemantic image interpretation abilities. However, current deeplearning segmentation methods require a large amount ofpixel-wise, manually labelled ground truth images. This ischallenging in the US imaging domain because of (a) a lackof experienced annotators and (b) weakly deﬁned structuralfeatures that cause a high inter-observer variability.

O APPEAR IN IEEE TRANSACTIONS ON MEDICAL IMAGING DOI: 10.1109/TMI.2019.2913311 3

Weakly supervised image segmentation:

Weakly supervisedautomatic detection of class differences has been explored inother imaging domains (e.g. MRI). For example, Baumgartneret al. [4] proposed to use a generative adversarial network(GAN) to highlight class differences only from global image-level labels (Alzheimer’s disease or healthy). We used a similaridea in [22] and initialized potential shadow areas basedon saliency maps [33] from a classiﬁcation task betweenimages containing shadows and those without. Inspired byrecent weakly supervised deep learning methods that havedrastically improved semantic image analysis [20, 28, 37] andto overcome the limitations of [22], we develop a conﬁdenceestimation algorithm that takes advantages of both types ofweak labels, including global image-level labels and a sparseset of coarse pixel-wise labels. Our method is able to predictdense, shadow-focused conﬁdence maps directly from inputUS images in effectively real-time.II. M

ETHOD

In our proposed method, a shadow-seg module is ﬁrsttrained to produce a semantic segmentation of shadow regions.In this module, shadow features are initialized by training ashadow/shadow-free classiﬁcation network and generalized bytraining a shadow segmentation network. After obtaining theshadow segmentation, a transfer function is used to extend thepredicted binary shadow segmentation to a conﬁdence mapbased on the intensity distribution within suspected shadowregions. This conﬁdence map is regarded as a reference conﬁ-dence map for the next conﬁdence estimation network. Lastly,a conﬁdence estimation network is trained to learn the map-ping between the input shadow-containing US images and thecorresponding reference conﬁdence maps. The outline for thetraining process is shown in Fig. 2. During inference, we usethe conﬁdence estimation network to predict a dense, shadowconﬁdence map directly from the input image. Additionally,we integrate attention mechanisms [30] into our method toenhance the shadow features extracted by the networks.

Shadow-seg Module : We propose a shadow-seg moduleto extract generalized shadow features for a large range ofshadow types in fetal US images under limited weak manualannotations. Since shadow regions have different shapes, vari-ous intensity distributions and uncertain edges, the pixel-wiseannotation of shadow regions is time consuming and reliesheavily on annotator’s experience (e.g. various annotationsin Fig. 1(b)). This generally results in manual annotationsof limited quantity and quality. Compared with pixel-wiseshadow annotations, global image-level labels (“has shadow”and “shadow-free” in our case) are easier to obtain, andshadow images with global image-level labels can contain alarger variety of shadow types. Therefore, we use a shadow-segmodule that combines unreliable pixel-wise annotations andglobal image-level labels as weak annotations.The proposedshadow-seg module contains two tasks, (1) shadow/shadow-free classiﬁcation using image-level labels, and (2) shadowsegmentation that uses few coarse pixel-wise manual annota-tions ( of the global image-level labels). Shadow featurescan be extracted during simple shadow/shadow-free classiﬁ-cation and subsequently optimized for the more challenging shadow segmentation task. In our case, shadow features ex-tracted by the classiﬁcation network cover various shadowtypes in a range of anatomical structures. These shadowfeatures become suitable for the shadow segmentation afterbeing optimized by a shadow segmentation network.

Network Architecture : We build two sub-networks fromresidual-blocks [14] as shown in Fig. 3. Residual-blocks canreduce the training error when using deeper networks and sup-port better network optimization [14]. They have been widelyused for various image processing algorithms [13, 36, 38].The ﬁrst and initially trained network is a shadow/shadow-free classiﬁcation network that learns to distinguish imagescontaining shadows from shadow-free images, and thus learnsthe deﬁning features of acoustic shadow. This classiﬁcationnetwork consists of a feature encoder followed by a globalaverage pooling layer. The feature encoder uses six residual-blocks (Fig. 3) to extract shadow features that deﬁne shadow-containing images in the classiﬁer. We refer to l = 1 as thelabel of the shadow-containing class and l = 0 as the label ofthe shadow-free class. Image set X C = { x C , x C , ..., x CK } andtheir corresponding labels L = { l , l , ..., l K } s.t. l i ∈ { , } are used to train the feature encoder as well as the globalaverage pooling layer. We use softmax cross-entropy loss asthe cost function L C between the predicted labels and the truelabels.Representative shadow features extracted by the featureencoder of the shadow/shadow-free classiﬁcation network arethen optimized by the shadow segmentation network with alimited number of densely segmented US images. The featureencoder of the segmentation network has the same architectureas the classiﬁcation network. The weights of the featureencoder in the segmentation network are initialized by thatof the classiﬁcation network and are further ﬁne-tuned for thesegmentation task. Therefore, the extracted shadow featuresare suitable for the segmentation in addition to classiﬁcation.The decoder of the segmentation network is symmetrical tothe feature encoder. Feature layers from the feature encoderare concatenated to the corresponding layers in the decoderby skip connections. Here, we denote the image set used totrain the shadow segmentation with X S = { x S , x S , ..., x SM } and the corresponding pixel-wise manual segmentation with Y S = { y S , y S , ..., y SM } . The shadow segmentation providesa pixel-wise binary prediction ˆ Y S = { ˆ y S , ˆ y S , ..., ˆ y SM } forshadow regions and the cost function L seg is the softmaxcross-entropy between ˆ Y S and Y S . Transfer Function : Binary masks lack information aboutinherent uncertainties at the boundaries of shadow regions.Therefore, we use a transfer function to extend the binarysegmentation prediction to a conﬁdence map, which is moreappropriate to describe shadow regions. The main task ofthe transfer function is to learn the intensity distribution ofshadow regions so as to estimate conﬁdence of pixels infalse positive (FP) regions of the predicted binary shadowsegmentation. This transfer function is built and only usedduring training to provide reference conﬁdence maps for theconﬁdence estimation network.When comparing the manual segmentation y S and thepredicted segmentation ˆ y S of shadow regions in image x , we O APPEAR IN IEEE TRANSACTIONS ON MEDICAL IMAGING DOI: 10.1109/TMI.2019.2913311 4 T ( x , y s , ^ y s ) (b) Transfer Function L conf L seg p ( x i ∣ l i = ) p ( x i ∣ l i = ) True Label A v g . P oo li n g S o f t m a x X ∣ L = ^ p ( x i ∣ l i = )^ p ( x i ∣ l i = ) Weights U p s a m p l e R e s - B l o c k R e s - B l o c k U p s a m p l e R e s - B l o c k C o n v C o n v R e s - B l o c k R e s - B l o c k R e s - B l o c k Shared Data X ∣ L = X ∣ L = X ∣ L = Shadow/shadow-free classification networkShadow Segmentation Network(c) Confidence Estimation Network(a) Shadow-seg Module Prediction ... C o n v R e s - B l o c k R e s - B l o c k R e s - B l o c k ... R e s - B l o c k ... U p s a m p l e R e s - B l o c k R e s - B l o c k U p s a m p l e R e s - B l o c k C o n v C o n v R e s - B l o c k R e s - B l o c k R e s - B l o c k ... R e s - B l o c k ... L C ^ Y S Y S ^ Y C Y C Fig. 2: Training framework of the proposed method. (a) The shadow-seg module containing a shadow/shadow-free classiﬁcationnetwork and a shadow segmentation network. (b) The transfer function that expands a binary mask to a reference conﬁdencemap. (c) The conﬁdence estimation network which establishes direct mapping between input images and conﬁdence maps.

Res-Block BN ( x ) f ( x ) x f : ∑ ω x + b BN ( x ) f ( x ) f : ∑ ω x + b MaxPool ^ x ReLUReLU

Convolution layer (stride=2) Convolution layer (stride=1)

Fig. 3: The architecture of the residual-block. BN ( x ) refers toa batch normalization layer and f ( x ) is a convolutional layer.deﬁne the true positive (TP) regions x T P as shadow regionswith the full conﬁdence, C x ij = 1 , x ij ∈ x T P . Here, C x ij refers to the conﬁdence of pixel x ij being shadow.For each pixel x ij in the FP regions ( x F P ), the conﬁdenceof belonging to a shadow region is computed by a transferfunction T ( x ij | x ij ∈ x F P ) based on the intensity of thepixel ( I x ij ) and the mean intensity of x T P ( I mean ). I mean isdeﬁned in Eq. 1. With weak signals in the shadow regions, theaverage intensity of shadow pixels is lower than the maximumintensity ( I max = max ( x ) ) but not lower than the minimumintensity ( I min = min ( x ) ), that is I mean ∈ [ I min , I max ) . I mean = (cid:40) mean ( y S ∩ ˆ y S ) y S ∩ ˆ y S (cid:54) = ∅ ,mean ( y S ) y S ∩ ˆ y S = ∅ , (1)The transfer function T ( · ) computing C x ij for pixels in x F P is deﬁned according to the range of I mean . For I mean ∈ ( I min , I max ) , T ( · ) is shown in Eq. 2. For I mean = I min , T ( · ) is shown in Eq. 3. T ( x ij | x ij ∈ x F P ) =  I xij − I min I mean − I min , I min ≤ I x ij < I mean , I max − I xij I max − I mean , I mean < I x ij ≤ I max , , I x ij = I mean , (2) T ( x ij | x ij ∈ x F P ) = (cid:40) I xij − I mean I max − I mean , I mean < I x ij , , I x ij = I mean , (3)After using the transfer function, the binary map of thepredicted segmentation y S is extended to a conﬁdence map y C . y C acts as a reference (”ground truth”) for the training ofthe next conﬁdence estimation network. Conﬁdence Estimation Network : After obtaining referenceconﬁdence maps from the predicted binary segmentation, aconﬁdence estimation network is trained to map an imagewith shadows ( x ) to the corresponding reference conﬁdencemap ( y C ). This conﬁdence estimation network can be inde-pendently used to directly predict a dense shadow conﬁdencemap for an input image during inference.The conﬁdence estimation network consists of a down-sampling encoder, a symmetric up-sampling decoder, and skipconnections between feature layers from the encoder and thedecoder at different resolution levels. Both the encoder and thedecoder are composed of six residual-blocks. The cost functionof the conﬁdence estimation network is deﬁned as the mean O APPEAR IN IEEE TRANSACTIONS ON MEDICAL IMAGING DOI: 10.1109/TMI.2019.2913311 5

A A f R (⋅) f R (⋅) A g (⋅) g (⋅) g (⋅) ... ... ^ p ( x i ∣ l i = )^ p ( x i ∣ l i = ) A Attention layerFeature maps

Fig. 4: The architecture of the shadow/shadow-free classi-ﬁcation network with attention mechanism. f R ( · ) refers toresidual-blocks. g ( · ) refers to a global average pooling layer.squared error between the predicted conﬁdence map ˆ Y C andthe reference conﬁdence map Y C ( L conf = (cid:107) ˆ Y C − Y C (cid:107) ). Attention Gates : Attention gates are believed to generallyhighlight relevant features according to image context and thusimprove network performance for medical image analysis [25].We integrate attention gates [30] into our approach to exploreif attention mechanisms can further improve the conﬁdenceestimation of shadow regions in 2D ultrasound. In our case, weconnect the self-attention gating modules proposed in [25] tothe feature maps before the last two down-sampling operationsin the encoders of all three networks. For the shadow/shadow-free classiﬁcation network, the global average pooling layeris modiﬁed when adding this self-attention gating module. Indetail, as shown in Fig. 4, the global average pooling layersare operated separately on the two attention-gated featuremaps as well as the original last feature map to obtain threeaverage feature maps. These three average feature maps arethen concatenated, followed by a fully connected layer tocompute the ﬁnal classiﬁcation prediction.III. I

MPLEMENTATION

All the residual-blocks used in the proposed method areimplemented as proposed in [26], which provides a convenientinterface to realize residual-blocks.We optimize the different modules separately and consecu-tively in three steps. First we train ∼ epochs for the param-eters of the shadow/shadow-free classiﬁcation network, andthen ∼ epochs for the pixel-wise shadow segmentationnetwork. After obtaining a well-trained shadow segmentationnetwork, we train the conﬁdence estimation network for an-other 700 epochs.For all networks, we use Stochastic Gradient Descent(SGD) with momentum optimizer to update the parameterssince SGD has better generalization capability than adap-tive optimizer [35]. The parameters of the optimizer are momentum = 0 . , with a learning rate of − . We applyL2 regularization to all weights during training to help preventnetwork over-ﬁtting. The scale of the regularizer is set as − .The training batch size is 25 and our networks are trained ona Nvidia Titan X GPU with 12 GB of memory.IV. E VALUATION

The proposed method is trained and evaluated using twodata sets, (1) a multi-class data set consisting of 13 classesof 2D US fetal anatomy with global image-level label (“hasshadow” or “shadow-free”) and including 48 non-brain images with manual shadow segmentations, and (2) a single-class dataset containing 2D US fetal brain with coarse pixel-wise manualshadow segmentations. To reduce the variance in parameterestimation during training, we split relatively bigger trainingdata sets. In the multi-class data set, we use of the datafor training, for validation and the 48 non-brain imagesfor testing, while in the single-class data set we use ofthe data for training, for validation and for testing.To verify the effectiveness of the proposed method and theimportance of the shadow/shadow-free classiﬁcation networkin the shadow-seg module, we compare the variants of ourmethod to a baseline which only contains a shadow segmen-tation network and a conﬁdence estimation network.We use standard measurements such as Dice coefﬁcient(DICE) [11], recall, precision and Mean Squared Error (MSE)for shadow segmentation evaluation, and use the InterclassCorrelation (ICC) [31] as well as soft DICE [2] for conﬁdenceestimation evaluation. In order to verify the performance of ourmethod, we also compute quantitative measurements betweenthe chosen manual annotation (weak ground truth) and anothermanual annotation from a different annotator to show thehuman performance for the shadow detection task. Lastly, weshow the practical beneﬁts of shadow conﬁdence maps for dif-ferent applications such as a standard plane classiﬁcation task,an image fusion task from multiple views and a segmentationtask for automatic biometric measurements. Multi-class Data Set : This data set consists of ∼ . k − weeks (iFIND Project ). Eight different ultrasoundsystems of identical make and model (GE Voluson E8) wereused for the acquisitions. Various image settings based ondifferent sonographers’ personal preference for scanning areincluded in this data set. The images have been classiﬁed byexpert observers as containing strong shadow, being shadow-free, or being corrupted, e.g. poor tissue contact caused bylacking acoustic impedance gel. Corrupted images ( < )have been excluded as discussed in Section VI with Fig. 10. Single-class Data Set : This data set comprises 643 fetalbrain images and has no overlap with the multi-class dataset. Shadow regions in this data set have been coarselysegmented by two bio-engineering students using trapezoid-shaped segmentation masks for individual shadow regions.

Training Data : Validation and Test Data : The remaining 491 shadowimages and 502 clear images in the multi-class data setare used for testing and validation. Here, a subset ( M test )comprising 48 randomly selected images from the 491 shadow O APPEAR IN IEEE TRANSACTIONS ON MEDICAL IMAGING DOI: 10.1109/TMI.2019.2913311 6 images is used for testing. These 48 images contain variousfetal anatomies (except fetal brain), such as abdominal, kidney,cardiac and etc. Shadow regions in these images have beenmanually segmented to provide ground truth. The remaining443 shadow images and 502 clear images are used for thevalidation of the shadow/shadow-free classiﬁcation. Similarly,the remaining 143 fetal brain images of the single-classdata set are split into two subsets, where S val contains 50images for validation of the shadow segmentation, binary-to-conﬁdence transformation and the conﬁdence estimation,and S test with 93 images for testing. For all images fromthe single-class data set, we randomly choose one group ofannotations from two different existing groups of annotationsas ground truth for training, validation and testing. A. Baseline

The baseline method is used to demonstrate that the shadow-seg module is of importance for capturing generalized shadowfeatures and obtaining accurate conﬁdence estimation ofshadow regions. It comprises a shadow segmentation networkand a conﬁdence estimation network, which have the samearchitectures as shown in Fig. 2. We ﬁrstly train the shadowsegmentation network in the baseline method using the 500 fe-tal brain images from the single-class data set. After applyingthe transfer function on the binary segmentation prediction, wetrain the conﬁdence estimation network for a direct mappingbetween shadow images and reference conﬁdence maps.

B. Evaluation Metrics

In this section, we deﬁne the aforementioned statisticalmetrics and the computation of the inter-observer variabilitybetween two pixel-wise manual annotations of shadow regions.

DICE, Recall, Precision and MSE:

We refer to the binaryprediction of shadow segmentation as P and the binary manualsegmentation as G . DICE = 2 | P ∩ G | / ( | P | + | G | ) , Recall = | P ∩ G | / | G | , Precision = | P ∩ G | / | P | and MSE = | P − G | . ICC:

We use

ICC as proposed by [31] (Eq. 4) to mea-sure the agreement between two annotations. Each pixel inan image is regarded as a target. R MS , C MS and M MS are respectively mean squared value of rows, columns andinteraction. N is the number of targets. ICC = R MS − M MS R MS + M MS + 2 × ( C MS − M MS ) /N . (4) Soft DICE:

Soft DICE can be used to tackle probabilitymaps. We use real-value in the DICE deﬁnition to computesoft DICE between the predicted shadow conﬁdence maps ˆ Y C and reference conﬁdence maps Y C . Human Performance:

We consider another binary segmen-tation of shadow regions from a different annotator as Y Snew .The computed metrics between Y Snew and the chosen manualsegmentation Y S reﬂects the human inter-observer variability. C. Shadow Segmentation Analysis

We compare the segmentation performance of the state-of-the-art ([16] and [22]), the proposed methods and the TABLE I: Shadow segmentation performance ( µ ± σ ) of differ-ent methods on test data S test . RW and RW ∗ are Random Walkalgorithm [16] with different set of parameters. Pilot [22] is ourprevious work. Baseline, the proposed method (abbreviatedas “Proposed”) and the proposed method with attention gates(abbreviated as “Proposed + AG” in the rest of the paper)are our proposed methods. Anno ∗ refers to the human inter-observer variability, thus expected human performance on theshadow segmentation task. Best results are shown in bold. Methods DICE Recall Precision MSERW [16] µ σ ) (0.099) (0.2047) (0.0675) (7.6734)RW ∗ [16] µ σ ) (0.1123) (0.2196) (0.0771) (8.3484)Pilot [22] µ σ ) (0.1398) (0.201) (0.1352) (14.837)Baseline µ σ ) (0.212) (0.2255) (0.2326) (12.2885)Proposed µ ( σ ) (0.1988) (0.2131) (0.2255) (11.867)Proposed + AG µ σ ) (0.2014) (0.2169) (0.2247) (12.6317)Anno ∗ µ σ ) (0.2635) (0.3196) (0.3124) (23.0339) human performance. This comparison is used to examinethe importance of the shadow-seg module for the shadowsegmentation, and further, for the conﬁdence estimation ofshadow regions.Table I shows DICE, recall, precision and MSE of differentmethods on S test . RW and RW ∗ are results of [16] withvarious parameters. For fair comparison, we run 24 testson both test sets using the RW algorithm with differentparameter combinations ( α ∈ { , , } ; β ∈ { , } ; γ ∈{ . , . , . , . } ). With a negative relationship between thelikelihood of shadows and the conﬁdence in [16] and to consis-tently compare all methods, we use − S instead S to displaythe results of RW and RW ∗ in all comparison experiments.Here S is a conﬁdence map obtained by [16]. To generateshadow segmentation, we threshold the obtained conﬁdencemaps by T ∈ { . , . } so that pixels with conﬁdencehigher than T are shadows. We chose the parameters andthe threshold which achieve the highest average DICE on allsamples in both test sets. The chosen RW parameters and thethreshold are α = 1 ; β = 90 ; γ = 0 . ; T = 0 . . We alsoapplied the parameters and the threshold in [16] ( α = 2 ; β = 90 ; γ = 0 . ; T = 0 . ) in our experiments, whichis denoted as RW ∗ . Note that we use the public Matlab code of [16] to test RW and RW ∗ .As shown in Table I, the baseline, the proposed methodand the proposed + AG greatly outperform the state-of-the-art.Among all methods, the proposed method achieves highestDICE. Recall and precision of the proposed method are respec-tively . and . higher than that of the baseline whileMSE of the proposed method is . lower than that of the http://campar.in.tum.de/Main/AthanasiosKaramalisCode O APPEAR IN IEEE TRANSACTIONS ON MEDICAL IMAGING DOI: 10.1109/TMI.2019.2913311 7

TABLE II: Comparison of shadow segmentation performance( µ ± σ ) of different methods on test data M test . Best resultsare shown in bold. Methods DICE Recall Precision MSERW [16] µ σ ) (0.0855) (0.1241) (0.0592) (7.866)RW ∗ [16] µ σ ) (0.0871) (0.1528) (0.0602) (7.5643)Pilot [22] µ σ ) (0.1079) (0.137) (0.1308) (17.0491)Baseline µ σ ) (0.1798) (0.2233) (0.1712) (18.3773)Proposed µ ( σ ) (0.155) (0.2335) (0.1357) (17.2147)Proposed + AG µ σ ) (0.1544) (0.2035) (0.1562) (17.6628)The symbols of the methods are the same to Table I. TABLE III: The p-value of the Proposed method vs. Pilot [22]and of the Proposed method vs. Baseline. Statistically signif-icant results ( p < . ) are shown in bold. S test DICE Recall Precision MSEPilot [22]

Baseline M test DICE Recall Precision MSEPilot [22] † Baseline † refers to the proposed method performs worse and oth-erwise the proposed method is better. baseline. After adding attention gates to the proposed method(the proposed + AG), the shadow segmentation performanceis nearly the same to the proposed method without attentiongates, but better than the baseline. Additionally, the relativelylow scores of Anno ∗ indicate high inter-observer variabilityand how ambiguous human annotation can be for this task.A mean DICE of . shows that the proposed methodperforms better and more consistently than human annotation.We further conduct the same experiments on another non-brain test data set M test to verify the feature generalizationability of the shadow-seg module. Results are shown in Ta-ble II. Similarly, the proposed weakly supervised methods andthe baseline outperform all state-of-the-art methods.To statistically evaluate the difference among various meth-ods, we use the paired sample t-test on two test data sets S test and M test . Here, we compare the evaluation metrics(Dice, Recall, Precision and MSE) of the proposed methodand the Pilot [22] because the Pilot [22] outperforms otherstate-of-the-art in Table I and Table II. We also compare theevaluation metrics of the proposed method and the baseline.The obtained corresponding p-values are shown in Table III,using . as the threshold for statistical signiﬁcance, Table IIIshows that the proposed method greatly improves the shadowsegmentation performance compared with the Pilot [22] andthe baseline. Fig. 5: Results of shadow conﬁdence estimation. (a) SoftDICE of the baseline, the proposed method and the proposedmethod with attention gates (proposed + AG) on S test and M test . (b) Interclass correlation (ICC) of the baseline, theproposed method and the proposed + AG on S test and M test .Additionally, ICC of the human performance is shown asAnno ∗ for S test . D. Shadow Conﬁdence Estimation

In this part, we evaluate the performance of the conﬁdenceestimation by comparing the shadow conﬁdence maps ofdifferent methods.Fig. 5 (a) shows the soft DICE evaluation on S test and M test . The proposed method and the proposed + AG methodachieve higher soft DICE on both test sets than the baseline,and are more robust than the baseline on M test . The baselinefails in this experiment on M test because it is unable to obtainaccurate shadow segmentation in the previous step (shown inTable II). With less accurate shadow segmentation, the shadowconﬁdence estimation can hardly establish a valid mappingbetween input images and reference conﬁdence maps. Thisdemonstrates that the shadow-seg module is beneﬁcial forshadow segmentation and conﬁdence estimation.We additionally evaluate the reliability of the shadowconﬁdence estimation by measuring the agreement betweenthe decision of each method and the manual segmentation.Regarding the baseline, the proposed and the proposed + AGas different judges and the manual segmentation of shadowregions as a contrasting judge, we use the ICC to measure theagreement between each different judge and the contrastingjudge. Fig. 5 (b) shows the ICC evaluation on two testdata sets, which indicate that the proposed method and theproposed + AG are more consistent on estimating shadow con-ﬁdence maps compared with the baseline. When consideringanother manual segmentation of shadow regions as an extrajudge, we can evaluate the agreement of human annotations.Fig. 5 (b) shows that the ICC of two human annotations(shown as Anno) is normally . . The proposed method withan ICC of . is more consistent than annotations from twohuman annotators.Fig. 6 compares the shadow conﬁdence maps of the state-of-the-art methods and the proposed methods. RW and RW ∗ have the same parameters as used for Table I. The shadowconﬁdence maps of the baseline, the proposed method andthe proposed + AG method are generated directly from inputshadow images by conﬁdence estimation networks. Overall,the proposed method and the proposed + AG method achieve

O APPEAR IN IEEE TRANSACTIONS ON MEDICAL IMAGING DOI: 10.1109/TMI.2019.2913311 8 . . . . (a) Image (b) RW [16] (c) RW ∗ [16] (d) Pilot [22] (e) Baseline (f) Proposed (g) Proposed+AG (h) Weak GT Fig. 6: Conﬁdence estimation of shadow regions using the state-of-the-art methods and our methods. Rows I-IV show fourexamples: Brain (top), Lip (second), Abdominal (third) and Cardiac (bottom). Column (a) is the original US image. Columns(b-d) are shadow conﬁdence maps from the RW algorithm [16] and our previous work [22]. Columns (e-g) show the shadowconﬁdence maps of the baseline, the proposed method and the Proposed + AG method. Column (h) is the binary map of themanual shadow segmentation. The color bar on the top of this ﬁgure shows that the more yellow/brighter (closer to 1), thehigher the conﬁdence of being shadow regions.more visually reasonable shadow conﬁdence estimation thanthe baseline and the state-of-the-art on different anatomicalstructures shown in Fig. 6. The proposed method and theproposed + AG method are able to highlight multiple shadowregions while the RW algorithm shows limitations for mostcases, especially for disjoint shadow regions.Row I in Fig. 6 shows a fetal brain image from S test . Theconﬁdence estimation of shadow regions from the baseline,the proposed method and the proposed + AG method are sim-ilarly accurate since we use fetal brain images to train theconﬁdence estimation networks in these three methods. Theseoutperform [16] and [22]. Rows (II-IV) in Fig. 6 show shadowconﬁdence maps of non-brain anatomy from M test , includinglips, abdominal and cardiac. The baseline failed on unseendata during inference. However, the proposed methods areable to generate accurate shadow conﬁdence maps becauseof the generalized shadow features obtained by the shadow-seg module. Furthermore, the “Lips” example shows thatour method is capable of detecting weaker shadow regionsthat have not been annotated in manual segmentation. Thisindicates that the conﬁdence estimation network has learnedgeneral properties of shadow regions. E. Transfer Function Performance

We show two illustrative examples in Fig. 7 to demonstratethe performance of the transfer function. Fig. 7 (c) and(d) show that the transfer function computes the conﬁdenceof each pixel in the false positive areas of the predicted segmentation, so that to extend a binary segmentation to areference conﬁdence map. (a) Image (b) Weak GT (c) ˆ Y S (d) Y C Fig. 7: Two examples showing the performance of the trans-fer function. (a) is the input image and (b) is the binarymanual segmentation. (c) is the predicted segmentation beforeapplying the transfer function while (d) is the correspondingreference conﬁdence map after the transfer function.

F. Runtime

The RW algorithm [16] is implemented in Matlab (CPUXeon E5-2643) while the previous work [22] and the proposedmethods use Tensorﬂow and run on a Nvidia Titan X GPU.For the RW algorithm [16] and the previous work [22], theinference time are . s and . s respectively. Since thebaseline, the proposed method and the proposed + AG method

O APPEAR IN IEEE TRANSACTIONS ON MEDICAL IMAGING DOI: 10.1109/TMI.2019.2913311 9 have the same conﬁdence estimation networks, they have thesame inference time, which is . s . A system-independentevaluation can be performed by estimating the required Giga-ﬂoating point operations (GFlops, fused multiply-adds) duringinference. Our method requires ∼ . − GFlops (estimatedfrom conv-layers including ReLU activation, Appendix I,supplementary materials are available in the supplementaryﬁles /multimedia tab.) while the RW algorithm [16] requires ∼ − . GFlops (according to the built-in Matlab proﬁler)and the previous work [22] requires ∼ GFlops (estimatedfrom conv-layers including ReLU activation, Appendix I).V. A

PPLICATIONS

To verify the practical beneﬁts of our method, we integratethe shadow conﬁdence maps into different applications such as2D US standard plane classiﬁcation, multi-view image fusionand automated biometric measurements.

A. Ultrasound Standard Plane Classiﬁcation

Classifying 2D fetal standard planes is of great im-portance for early detection of abnormalities during mid-pregnancy [29]. However, distinguishing different standardplanes is a challenging task and requires intense operatortraining and experience. Baumgartner et al. [3] have proposeda deep learning method for the detection of various fetal stan-dard planes. We extend [3] and utilize shadow conﬁdence mapsto provide extra information for standard plane classiﬁcation.The data is the same as used in [3], which is a set of

2D ultrasound examinations between 18-22 weeks ofgestation (iFIND Project 1). We select nine classes of standardplanes including Three Vessel View (3VV), Four ChamberView (4CH), Abdominal, Brain View at the level of thecerebellum (Brain (cb.)), Brain view at posterior horn of theventricle (Brain (tv.)), Femur, Lips, Left Ventricular OutﬂowTract (LVOT) and Right Ventricular Outﬂow Tract (RVOT).The data set is split into training ( ), validation ( ),and testing ( ) images, similar to [3] (see appendix Efor individual class split numbers). We use image whitening(subtracting the mean intensity and divide by the variance) oneach image to preprocess the whole data set.Four networks based on SonoNet-32 [3] are trained andtested in order to verify the utility of shadow conﬁdencemaps. The ﬁrst network is trained with the standard planeimages from the training data. The next three networks areseparately trained with standard plane images and their corre-sponding shadow conﬁdence maps obtained by the baseline,the proposed method and the proposed + AG method. Thus, thetraining data in the ﬁrst network has one channel while theremaining networks have two input channels. We train thesenetworks for epochs with a learning rate of . .Table IV shows the standard plane classiﬁcation perfor-mance of the four networks. Networks with shadow con-ﬁdence maps achieve higher classiﬁcation accuracy on al-most all classes (except Abdominal, LVOT and RVOT), aswell as on average classiﬁcation accuracy. CM P AG achieveshighest classiﬁcation accuracies for ﬁve classes (3VV, 4CH,Brain(Cb.), Brain(Tv.) and Femur). Of particular note, the TABLE IV: Classiﬁcation accuracy ( % ) with vs. withoutshadow conﬁdence maps. w/o CM is the network withoutshadow conﬁdence maps while CM B , CM P , CM P AG arenetworks with shadow conﬁdence maps from the baseline, theproposed method and the proposed+AG method. Best resultsare shown in bold.

Class w/o

CM CM B CM P CM PAG

Abdominal

Brain(Tv.) 99.11 99.78 99.78

Femur 99.04 99.81 99.81

Lips 98.29 99.81

Avg. accuracies of the 3VV and 4CH classes increase over thebaseline by . and . respectively. Five other classes(Abdominal, Brain(Cb.), Brain(Tv.), Femur and Lips) achievenear accuracy in both the baseline and CM P AG , whileLVOT and RVOT classes see modest decreases in CM P AG compared with the baseline, . and . respectively.Therefore, when compared CM P AG with the baseline, theincrease in average classiﬁcation accuracy across all classes( . to . ) is primarily driven by the large improve-ments in 3VV and 4CH. These results indicate that shadowconﬁdence maps are able to provide extra information andimprove the performance of another automatic medical imageanalysis algorithm.We additionally explore the importance of estimating con-ﬁdence maps over binary segmentation of shadow regions.We compare the classiﬁcation accuracy between using shadowconﬁdence maps and directly using binary shadow segmen-tations generated from different methods. Fig.8 shows thatfor classes with high classiﬁcation accuracy such as Ab-dominal, Brain(Cb.), Brain(Tv.), Femur and Lips, integratingshadow conﬁdence maps into the classiﬁcation task yieldsminor improvement. However, for classes with relatively lowclassiﬁcation accuracy such as 3VV and LVOT, classiﬁcationwith shadow conﬁdence maps achieves higher accuracy thanclassiﬁcation with only binary shadow segmentations. B. Multi-view Image Fusion

Routine US screening is usually performed using a single2D probe. However, the position of the probe and resultingtomographic view through the anatomy has great impact ondiagnosis. Zimmer et al. [39] proposed a multi-view imagereconstruction method, which compounds different images ofthe same anatomical structure acquired from different viewdirections. They use a Gaussian weighting strategy to blendintensity information from different views. Here, we combinepredicted shadow conﬁdence maps from these multi-viewimages as additional image fusion weights to investigate ifthese conﬁdence maps can further improve image quality.The proposed method generally outperforms the baselineand the proposed + AG method, thus we only integrate the

O APPEAR IN IEEE TRANSACTIONS ON MEDICAL IMAGING DOI: 10.1109/TMI.2019.2913311 10 C l a ss i f i c a t i on a cc u r a cy ( % ) Fig. 8: Comparison of classiﬁcation accuracy between usingshadow conﬁdence maps and using shadow segmentation. BI B , BI P and BI P AG are networks with binary shadowsegmentation from the baseline, the proposed method and theproposed + AG method. CM B , CM P and CM P AG are thesame networks as in Tabel IV.shadow conﬁdence maps generated by the proposed method( CM P ) into the weighting strategy in [39]. In detail, theprobability value of each pixel in a shadow conﬁdence mapis multiplied to the original weight of the same pixel com-puted in [39]. The generated new weights are normalized asdescribed in [39] and then are used for image fusion. The dataset in this experiment is same as used for [39].Fig. 9 qualitatively shows that shadow conﬁdence mapsare able to improve the performance of US image fusionalgorithms with different weighting strategies. Fig. 9 alsoshows the difference between adding two different types ofconﬁdence maps. These two types of conﬁdence maps aregenerated by the conﬁdence estimation network which areseparately trained by either MSE or Sigmoid loss. Fig. 9 (a) to(d) illustrate image fusion results for the same case using dif-ferent combinations of weighting strategies and loss functions.The difference maps indicate that shadow conﬁdence mapsare capable of improving image fusion performance. Fig. 9(e) to (h) show image fusion results on four different cases.We randomly select two positively affected cases (Fig. 9 (e)and (f)) to show visual improvement. We additionally showtwo randomly selected examples (Fig. 9 (g) and (h)) thatdon’t show perceptually signiﬁcant improvements after addingshadow conﬁdence maps. Quantitative evaluation for imagefusion is not possible because of lacking a ground truth forUS compounding tasks. C. Automated Biometric Measurements

We integrate our shadow conﬁdence maps into an automaticbiometric measurement approach [32], and show the biometricmeasurement performance (measured by DICE) before andafter adding shadow conﬁdence maps.Similar to the ultrasound standard plane classiﬁcation,shadow conﬁdence maps are integrated into a biometric esti-mation model described in [32] as an extral channel. Speciﬁ-cally, we train and test four fully convolutional networks withthe same hyper-parameters as detailed in [32], and use thesame ellipse ﬁtting algorithm described therein. The ﬁrst net-work is trained only on the image data used in [32]. The otherthree networks are trained with an additional input channel for TABLE V: Biometric measurement performance (DICE) withvs. without shadow conﬁdence maps. w/o

CM CM B CM P CM PAG shadow conﬁdence maps that are separately generated by thebaseline, the proposed, and the proposed + AG method.We show three examples that are affected by shadows,and show their biometric measurement results in Table V.From this experiment, we ﬁnd that biometric measurementperformance is boosted by up to for problematic failurecases after adding shadow conﬁdence maps. The averageperformance on the entire test data set stays almost the samesince only a small proportion of the test images are affectedby strong shadows, mainly because of the image acquisitionby highly skilled sonographers.VI. D ISCUSSION

In this paper, we propose a weakly supervised method totackle the ill-deﬁned problem of shadow detection in US. Ana¨ıve alternative to our method would be to train a fullysupervised shadow segmentation network using pixel-wiseannotation of shadow regions. However, pixel-wise annotationis infeasible because (a) accurately annotating a large numberof images requires a vast amount of labour and time andhas scanner dependencies (b) binary annotations of shadowregions would lead to high inter-observer variability as shadowfeatures are poorly deﬁned, and (c) real-valued annotations ofshadow regions are affected by subjectivity of annotators.The performance of shadow region conﬁdence estimationon different anatomical structures can be improved after inte-grating attention mechanisms. For example, the soft DICE isincreased on S test . This also results in improved ultrasoundclassiﬁcation (Table IV). However, the quantitative resultsshow that attention mechanisms are not essential. Networkswith attention mechanisms are sometimes outperformed bynetworks without attention mechanisms. This may be causedby the way we integrate the attention mechanism. Since we addattention gates to encoders of all networks, the shadow featuresare emphasized for the shadow/shadow-free classiﬁcation,which increases the difﬁculty of generalizing shadow featuresfrom classiﬁcation to shadow segmentation.We use MSE as the loss function of the conﬁdence es-timation network, but this loss can also be measured byother functions. Practically this choice has no effect on ourquantitative results. However, in the image fusion task, weobserve qualitative differences, which we show in Fig. 9 forSigmoid cross-entropy loss.In the standard plane classiﬁcation task, we use only asubset of target standard planes compared to [3] because (1)we aim at verifying the usefulness of our method rather thanimproving performance of [3], (2) it is desirable to keep inter-class balance to avoid side-effects from under-represented O APPEAR IN IEEE TRANSACTIONS ON MEDICAL IMAGING DOI: 10.1109/TMI.2019.2913311 11 low difference high difference (a) Gaussian, MSE (b) Int. & Gaussian, MSE (c) Gaussian, Sigmoid (d) Int. & Gaussian, Sigmoid(e) Gaussian, MSE, + (f) Int. & Gaussian, Sigmoid, + (g) Gaussian, Sigmoid, − (h) Int. & Gaussian, Sigmoid, − Fig. 9: Results of image fusion based on different weighting strategies and loss functions (Gaussian weighting vs. Intensity-and-Gaussian weighting (Int. & Gaussian), MSE loss vs. Sigmoid loss). Note that the MSE loss and the Sigmoid loss are usedfor training of the conﬁdence estimation network, which generates the shadow conﬁdence maps. (a-d) are the image fusionresults of the same case. (a,c) are the image fusion of Gaussian weighting with MSE loss and Sigmoid loss respectively and(b,d) are the results of Intensity-and-Gaussian weighting with MSE loss and Sigmoid loss respectively. (e-h) show the imagefusion results on four different cases. (e,f) are examples for visually improved cases ( + ) showing notable positive differencesof image fusion before and after adding CM P conﬁrmed by our sonographers while (g,h) are cases with less change ( − ). Foreach sub-ﬁgure (e.g. (a)), in the ﬁrst column, the top row is the result without integrating a shadow conﬁdence map CM P and the bottom row is the result with integrated CM P . The second column shows the corresponding enlarged framed areasof the images. The third column is the difference map of corresponding framed areas. The color bar on the top shows that themore yellow/brighter, the higher the difference between the two framed areas.classes, and (3) we chose standard planes for which [3] didnot show optimal classiﬁcation performance. T ( · ) , as deﬁned in Eq. 2 or Eq. 3 is one example howprior knowledge can be integrated into the training process.If T ( · ) is chosen to be a continuous non-trainable function,e.g. quadratic or Gaussian, further weight relaxation canbe introduced for joint reﬁnement of both, the shadow-segmodule in Fig. 2a and the conﬁdence estimation in Fig. 2c.However, since probabilistic ground truth does not exist forour applications, evaluation would become purely subjective,thus we decide to use direct but discontinuous integration ofshadow-intensity assumptions for T ( · ) .Task-speciﬁc deep networks, e.g. for classiﬁcation, mayinadvertently learn to ignore weak shadows in some cases,but the learning capacity of shadow properties is unknown.By estimating conﬁdence of shadow regions independently,our method guarantees that shadow property information isseparately extracted and can be seamlessly integrated intoother image analysis algorithms. With additional shadow prop-erty information, our method can improve steerability andinterpretability for deep neural networks, and also enablesextensions for non-deep learning algorithms. As shown in theexperiments, prior knowledge provided by shadow conﬁdencemaps can improve the performance of various applications.Binary shadow segmentation generated by the shadow-segmodule (Fig.1a) may provide shadow information to some ex- tent. The easiest way to utilize shadow information is integrat-ing this binary shadow segmentation into other applications.However, a binary segmentation of shadow regions is improperto describe inherent ambiguity of acoustic shadows caused byvarious attenuation of sound waves. Compared with binaryshadow segmentation, a real-valued shadow conﬁdence mapis more reasonable to represent shadows, especially uncertainboundaries. With this more accurate representation, shadowconﬁdence maps are able to improve the performance of otherapplications compared to using simple binary segmentation.Corrupted images such as images with shadows causedby insufﬁcient acoustic impedance gel are excluded in thetraining. This type of shadows can be regarded as backgroundsince signals can hardly reach the tissues, and corruptedimages with these shadows contain incomplete anatomicalinformation. Additionally, during scanning, regions of missingsignals caused by insufﬁcient gel can be discovered andavoided in contrast to shadows generated by the interactionbetween signals and tissues. Therefore, our work excluded thecorrupt images and focus on shadows within valid anatomy.Nevertheless, Fig. 10 further shows that our proposed methodis capable of indicating regions suffering from signal decay,especially on the boundaries.We use the coarse pixel-wise binary manual segmentationas ground truth for the shadow segmentation network and thetransfer function since accurate manual annotation for shadow O APPEAR IN IEEE TRANSACTIONS ON MEDICAL IMAGING DOI: 10.1109/TMI.2019.2913311 12 (a) Image (b) Baseline (c) Proposed (d) Proposed + AG Fig. 10: Qualitative performance of our methods for detectingsignal lacking regions caused by insufﬁcient gel.regions is unavailable as we discussed before. However, theinaccuracy of the coarse ground truth can hardly affect thequantitative assessments and the generation of reference con-ﬁdence maps, because (1) DICE, recall, precision and MSEare still positively related to the effectiveness of the methods,(2) soft DICE and ICC are not related to the coarse groundtruth, and (3) reference conﬁdence maps are generated basedon I mean (Eq. 1), which smooth the inﬂuence of coarse groundtruth by using mean intensity of TP regions. Additionally, weuse human inter-observer variability which is computed by twocoarse binary manual annotations to further fairly assess theeffectiveness of the methods.Acoustic shadows are caused by absorption, refraction orreﬂection of sound waves, which each leads to a differentdegree of signal attenuation. Our method is predominatelytrained on fetal US images containing shadow regions with anelongated shape and a relatively strong drop of intensity. Theseare the shadow features that we have observed in a majorityof images in our data sets. However, our method might belimited to perform effectively for shadows caused by differentacquisition-related causes which are less well represented inour current training data.VII. C ONCLUSION

We propose a CNN-based, weakly supervised method forautomatic conﬁdence estimation of shadow regions in 2D USimages. By learning and transferring shadow features fromweakly-labelled images, our method can predict dense, con-tinuous shadow conﬁdence maps directly from input images.We evaluate the performance of our method by compar-ing it to the state-of-the-art and human performance. Ourexperiments show that our method is quantitatively betterthan the state-of-the-art and human annotation for shadowsegmentation. For conﬁdence estimation of shadow regions,our method is also qualitatively better than the state-of-the-artand is more consistent than human annotation. More impor-tantly, our method is capable of detecting disjoint multipleshadow regions without being limited by the correlation be-tween adjacent pixels as in [16], and the heuristically selectedhyperparameters in [22].We further demonstrate that our method improves the per-formance of other automatic image analysis algorithms when integrating the obtained shadow conﬁdence maps into otherUS applications such as standard plane classiﬁcation, imagefusion and automated biometric measurements.Our method has signiﬁcantly short inference time, whichenables effective real-time feedback of local image properties.This feedback can guiding inexperienced sonographers to ﬁnddiagnostically valuable viewing directions and pave the wayfor standardized image acquisition training.A

CKNOWLEDGMENT

We thank the volunteers, sonographers and experts forproviding manually annotated datasets and NVIDIA for theirGPU donations. This work was supported by the WellcomeTrust IEH Award [102431], EPSRC grants (EP/L016796/1,EP/P001009/1), ERC 319456, and the Wellcome/EPSRC Cen-ter for Medical Engineering [WT 203148/Z/16/Z]. The re-search was funded/supported by the National Institute forHealth Research (NIHR) Biomedical Research Center based atGuy’s and St Thomas’ NHS Foundation Trust, King’s CollegeLondon and the NIHR Clinical Research Facility (CRF) atGuy’s and St Thomas’. Q. Meng is funded by the CSC-Imperial Scholarship. The views expressed are those of theauthor(s) and not necessarily those of the NHS, the NIHR orthe Department of Health.R

EFERENCES [1] J. Abbott and F. Thurstone. Acoustic speckle: Theory andexperimental analysis.

Ultrasonic Imaging , 1(4):303–324, 1979.[2] P. Anbeek, K. L. Vincken, G. S. van Bochove, M. J. vanOsch, and J. van der Grond. Probabilistic segmentationof brain tissue in mr imaging.

NeuroImage , 27(4):795 –804, 2005.[3] C. Baumgartner, K. Kamnitsas, J. Matthew, T. P. Fletcher,S. Smith, L. M. Koch, B. Kainz, and D. Rueckert.Sononet: Real-time detection and localisation of fetalstandard scan planes in freehand ultrasound.

IEEE Trans.Med. Imag. , 36:2204–2215, 2017.[4] C. Baumgartner, L. Koch, K. Tezcan, J. Ang, andE. Konukoglu. Visual feature attribution using wasser-stein gans.

CoRR , abs/1711.08998, 2017.[5] F. Berton, F. Cheriet, M. M., and C. Laporte. Segmen-tation of the spinous process and its acoustic shadow invertebral ultrasound images.

Computers in Biology andMedicine , 72:201–211, 2016.[6] B. Bouhemad, M. Zhang, Q. Lu, and J. Rouby. Clinicalreview: bedside lung ultrasound in critical care practice.

Critical Care , 11(1):205, 2007.[7] A. Broersen, M. Graaf, J. Eggermont, R. Wolterbeek,P. Kitslaar, J. Dijkstra, J. Bax, J. Reiber, and A. Scholte.Enhanced characterization of calciﬁed areas in intravas-cular ultrasound virtual histology images by quantiﬁca-tion of the acoustic shadow: validation against computedtomography coronary angiography.

Int J CardiovascImaging , 32:543–552, 2015.[8] Centre for Workforce Intelligence. Securing the futureworkforce supply sonography workforce review. 2017.

O APPEAR IN IEEE TRANSACTIONS ON MEDICAL IMAGING DOI: 10.1109/TMI.2019.2913311 13 [9] H. Choi, J. Lee, S. Kim, and S. Park. Speckle noisereduction in ultrasound images using a discrete wavelettransform-based image fusion technique.

Bio-MedicalMaterials and Engineering , 26(1):1587–1597, 2015.[10] P. Coup´e, P. Hellier, C. Kervrann, and C. Barillot. Nonlo-cal means-based speckle ﬁltering for ultrasound images.

IEEE Trans. Image Process. , 18(10):2221–2229, 2009.[11] L. R. Dice. Measures of the amount of ecologic associ-ation between species.

Ecology , 26(3):297–302, 1945.[12] M. K. Feldman, S. Katyal, and M. S. Blackwood. Usartifacts.

Radio Graphics , 29:11791189, 2009.[13] K. He, X. Zhang, S. Ren, and J. Sun. Deep residuallearning for image recognition. In

CVPR’16 , 2016.[14] K. He, X. Zhang, S. Ren, and J. Sun. Identity mappingsin deep residual networks. In

ECCV , pages 630–645.Springer, 2016.[15] P. Hellier, P. Coup´e, X. Morandi, and D. Collins. Anautomatic geometrical and statistical method to detectacoustic shadows in intraoperative ultrasound brain im-ages.

Med Image Anal , 14(2):195–204, 2010.[16] A. Karamalis, W. Wein, T. Klein, and N. Navab. Ultra-sound conﬁdence maps using random walks.

Med ImageAnal , 16(6):1101–1112, 2012.[17] H. Kim and T. Varghese. Hybrid spectral domain methodfor attenuation slope estimation.

Ultrasound Med Biol ,34:1808–1819, 2008.[18] T. Klein and W. Wells. Rf ultrasound distribution-based conﬁdence maps. In

MICCAI’15 , pages 595–602.Springer, 2015.[19] F. W. Kremkau and K. Taylor. Artifacts in ultrasoundimaging.

J Ultrasound Med , 5(4):227–237, 1986.[20] A. Krizhevsky, I. Sutskever, and G. Hinton. Imagenetclassiﬁcation with deep convolutional neural networks.In

NIPS’12 , pages 1097–1105, 2012.[21] T. Lange, N. Papenberg, S. Heldmann, J. Modersitzki,B. Fischer, H. Lamecker, and P. Schlag. 3D ultrasound-CT registration of the liver using combined landmark-intensity information.

Int J Comput Assist Radiol Surg ,4(1):79–88, 2009.[22] Q. Meng, C. Baumgartner, M. Sinclair, J. Housden,M. Rajchl, A. Gomez, B. Hou, N. Toussaint, V. Zimmer,J. Tan, et al. Automatic shadow detection in 2d ultra-sound images. In

MICCAI Workshop on PIPPI , 2018.[23] NHS.

Fetal anomaly screening programme: programmehandbook June 2015 . Public Health England, 2015.[24] J. A. Noble. Ultrasound image segmentation and tissuecharacterization.

Proc Inst Mech Eng H. , 224(2):307–316, 2010.[25] O. Oktay, J. Schlemper, L. L. Folgoc, M. Lee, M. P.Heinrich, K. Misawa, K. Mori, S. G. McDonagh, N. Y.Hammerla, B. Kainz, et al. Attention u-net: Learningwhere to look for the pancreas.

CoRR , abs/1804.03999,2018.[26] N. Pawlowski, S. I. Ktena, M. Lee, B. Kainz, D. Rueck- ert, B. Glocker, and M. Rajchl. Dltk: State of the artreference implementations for deep learning on medicalimages. arXiv preprint arXiv:1711.06853 , 2017.[27] G. P. Penney, J. M. Blackall, M. S. Hamady, T. Sab-harwal, A. Adam, and D. J. Hawkes. Registration offreehand 3d ultrasound and magnetic resonance liverimages.

Med Image Anal , 8:81–91, 2004.[28] M. Rajchl, M. Lee, O. Oktay, K. Kamnitsas, J. Passerat-Palmbach, W. Bai, M. Damodaram, M. Rutherford, J. Ha-jnal, B. Kainz, et al. Deepcut: Object segmentation frombounding box annotations using convolutional neuralnetworks.

IEEE Trans. Med. Imag. , 36(2):674–683, 2017.[29] L. J. Salomon, Z. Alﬁrevic, V. Berghella, C. Bilardo,E. Hernandez-Andrade, S. L. Johnsen, K. Kalache,K. Leung, G. Malinger, H. Munoz, et al. Practiceguidelines for performance of the routine midtrimesterfetal ultrasound scan.

Ultrasound Obst Gyn , 37:116–126,2011.[30] T. Shen, T. Zhou, G. Long, J. Jiang, S. Pan, and C. Zhang.Disan: Directional self-attention network for rnn/cnn-freelanguage understanding. In

AAAI , 2018.[31] P. E. Shrout and J. L. Fleiss. Intraclass correlations: Usesin assessing rater reliability.

Psychol Bull. , 86(2):420–428, 1979.[32] M. Sinclair, C. Baumgartner, J. Matthew, W. Bai, J. Cer-rolaza, Y. Li, S. Smith, C. Knight, B. Kainz, J. Hajnal,et al. Human-level performance on automatic headbiometrics in fetal ultrasound using fully convolutionalneural networks. In

EMBC’18 , 2018.[33] J. Springenberg, A. Dosovitskiy, T. Brox, and M. Ried-miller. Striving for simplicity: The all convolutional net.

CoRR , abs/1412.6806, 2014.[34] R. Steel, T. L. Poepping, R. S.Thompson, andC. Macaskill. Origins of the edge shadowing artefactin medical ultrasound imaging.

Ultrasound Med Biol ,39:1153–1162, 2005.[35] A. C. Wilson, R. Roelofs, M. Stern, N. Srebro, andB. Recht. The marginal value of adaptive gradientmethods in machine learning.

CoRR , abs/1705.08292,2018.[36] Y. Zhang, Y. Tian, Y. Kong, B. Zhong, and Y. Fu.Residual dense network for image super-resolution. In

CVPR’18 , 2018.[37] B. Zhou, A. Khosla, A. Lapedriza, A. Oliva, and A. Tor-ralba. Learning deep features for discriminative localiza-tion. In

CVPR’16 , pages 2921–2929. IEEE, 2016.[38] J. Zhu, T. Park, P. Isola, and A. A. Efros. Unpairedimage-to-image translation using cycle-consistent adver-sarial networks. In

ICCV’17 , 2017.[39] A. V. Zimmer, A. Gomez, Y. Noh, N. Toussaint,B. Khanal, R. Wright, L. Peralta, V. M. Poppel, E. Skel-ton, J. Matthew, et al. Multi-view image reconstruction:Application to fetal ultrasound compounding. In

MICCAIWorkshop on PIPPI , 2018.

O APPEAR IN IEEE TRANSACTIONS ON MEDICAL IMAGING DOI: 10.1109/TMI.2019.2913311 14 A PPENDIX

A. Shadow/Shadow-free Classiﬁcation Network

In this section, we use Python-inspired pseudo codeto present the detailed network architecture of theshadow/shadow-free classiﬁcation network (shown inFig. 11). The conv layer function performs astandard 2D convolution without activation layer andthe global average pool operates spatial averagingon the feature maps. The residual block is realized byDLTK [26].Fig. 11: Shadow/shadow-free classiﬁcation network architec-ture.

B. Shadow Segmentation Network

Fig.12 shows the detailed architecture of the segmenta-tion network. The conv layer function performs a stan-dard 2D convolution without activation layer. The resid-ual block and the upsample concat (the upsamplingand concatenation layer) are realized by DLTK [26].Fig. 12: Shadow segmentation network architecture.

C. Conﬁdence Estimation Network

Fig. 13 shows the detailed architecture of the shadowconﬁdence estimation network. Similarly, the conv layer function performs a standard 2D convolution without ac-tivation layer. The residual block and the upsam-ple concat (the upsampling and concatenation layer) arerealized by DLTK [26].Fig. 13: Shadow conﬁdence estimation network architecture.

D. Alternative Examples of Shadow Conﬁdence Estimation

We show an alternative group of examples for the conﬁ-dence estimation of shadow regions (shown in Fig. 14). Theseexamples include fetal brain from M test , and cardiac, lips,kidney from S test . Similar to the Fig. 6 in the main paper,Fig. 14 shows that the baseline fails to handle unseen datawhile the proposed method and the proposed + AG methodare able to predict pixel-wise conﬁdence of multiple shadowregions. These examples demonstrate that the shadow-segmodule is able to generalize the shadow representation andtransfer shadow representation from the shadow/shadow-freeclassiﬁcation task to a conﬁdence estimation task.

E. Data in Ultrasound Classiﬁcation

Table VI shows the exact number of data used in theapplication of 2D US standard plane classiﬁcation (SectionV. Part A ). The training data of each class is almost the sameso that we can keep class balance between different classesduring training. F. Class Confusion Matrix

Fig. 17 additionally shows the class confusion matrix of2D US standard plane classiﬁcation in Section V. Part A. Thisclass confusion matrix demonstrates that less 3VV images aremis-classiﬁed as RVOT images and less 4CH images are mis-classiﬁed as LVOT images after adding shadow conﬁdencemaps. However, as we discussed in the above Discuss Section,the shadow conﬁdence maps can also introduce redundantinformation for similar anatomical structures in this classi-ﬁcation task. For example, more LVOT images are wronglyclassied as RVOT and more RVOT images are classiﬁed as3VV images.

O APPEAR IN IEEE TRANSACTIONS ON MEDICAL IMAGING DOI: 10.1109/TMI.2019.2913311 15 . . . . (a) Image (b) Baseline (c) Proposed (d) Proposed + AG (e) Weak GT

Fig. 14: Shadow conﬁdence maps of different methods on various anatomical US images. Rows I-IV show four examples ofshadow conﬁdence estimation; Brain (top), Cardiac (middle), Lips (third) and Kidney (bottom). Columns (b-d) are shadowconﬁdence maps from the baseline, the proposed method and the proposed method with attention gate (Proposed + AG). (f) isthe binary map of manual segmentation.TABLE VI: Summary of the Data Set used in UltrasoundStandard Plane Classiﬁcation Task.

Class Training Validation Testing

Sum

G. Examples for Image Fusion

Fig.15 shows more examples of the multi-view image fusiontask which include the original multi-view images. From thecolumn (a-b) of Fig.15, we can see that the original imagescontain strong shadow artifacts that can affect the anatomical analysis. The image fusion task aims to use complementaryinformation from images with different views for reducingartifacts and increasing anatomical information. Column (e-f)enlarge the areas within the bounding boxes in column (c-d). Column (g) shows the difference masks between column(e)and (f). The difference masks clearly indicates the improvedperformance of image fusion after adding shadow conﬁdencemaps for Gaussian weighting strategy as well as Intensity andGaussian weighting strategy.

H. Examples for Biometric Measurement

We visualize the biometric measurement of the three exam-ples shown in Table V. Fig. 16 demonstrates that, for the casesaffected by shadow artifacts, the segmentation performance(“EI seg DICE”) is improved after adding shadow conﬁdencemaps as an extra channel. From the ﬁrst row to the third rowin Fig. 16, these three samples are respectively

O APPEAR IN IEEE TRANSACTIONS ON MEDICAL IMAGING DOI: 10.1109/TMI.2019.2913311 16 low difference high difference (a) Image Component 1 (b) Image Component 2 (c) Without CM P (d) With CM P (e) Enlarged (f) Enlarged (g) D map Fig. 15: The results of the multi-view image fusion. (a-b) The multi-view images, (c) Image fusion without shadow conﬁdencemaps ( CM P ), (d) Image fusion with shadow conﬁdence maps ( CM P ), (e-f) Enlarged areas of (c-d) respectively, and (g)Difference maps of (e) and (f). Rows (1-2) use MSE loss to train networks for generating shadow conﬁdence maps whileRow (3-4) use Sigmoid loss. Row 2 uses the Gaussian weighting for image fusion while Rows (1, 3, 4) use the Intensity andGaussian weighting. The color bar on the top shows that the more yellow/brighter, the higher the difference between the twoframed areas. I. Equations for estimating ﬂoating point operations (Flops)for convolutional layers

We use Eq. 5 to estimate the required Flops for convolutionlayers including ReLU activation. Here, W and H are thewidth and height of the input image respectively. K is kernelsize, P is the padding, S is the stride and F is the number ofﬁlters. n is the size of the convolution layer. (channels ∗ K ∗ K ). F lops ≈ (cid:18) W − K + 2 ∗ PS + 1 (cid:19) ∗ (cid:18) H − K + 2 ∗ PS + 1 (cid:19) ∗ n ∗ ( n − ∗ F + F ∗ W ∗ H. (5) Eq. 5 evaluates the number of Flops ( n : multiplications and n − : additions) for W × H ﬁlter convolutions adjusted forpadding P and stride S . ReLU activation is assumed to be F ∗ W ∗ H Flops (one comparison and one multiplication).

O APPEAR IN IEEE TRANSACTIONS ON MEDICAL IMAGING DOI: 10.1109/TMI.2019.2913311 17 (a) w/o CM (b) CM B (c) CM P (d) CM PAG

Fig. 16: Biometric measurement with VS. without shadow conﬁdence maps. The yellow circles refer to the ground truth, thegreen curves are segmentation predictions, and the red circles are the ellipses of the segmentation prediction.

O APPEAR IN IEEE TRANSACTIONS ON MEDICAL IMAGING DOI: 10.1109/TMI.2019.2913311 18

Fig. 17: Class confusion metrics for 2D ultrasound standard plane classiﬁcation. Upper row left: the class confusion matrixwithout shadow conﬁdence maps. Upper row right: the class confusion matrix with the shadow conﬁdence maps generatedby the baseline. Lower row left: the class confusion matrix with shadow conﬁdence maps obtained by the proposed method.Lower row right: the class confusion matrix with the shadow conﬁdence maps produced by the proposed ++