Weakly Supervised Estimation of Shadow Confidence Maps in Fetal Ultrasound Imaging
Qingjie Meng, Matthew Sinclair, Veronika Zimmer, Benjamin Hou, Martin Rajchl, Nicolas Toussaint, Ozan Oktay, Jo Schlemper, Alberto Gomez, James Housden, Jacqueline Matthew, Daniel Rueckert, Julia Schnabel, Bernhard Kainz
WWeakly Supervised Estimation of ShadowConfidence Maps in Fetal Ultrasound Imaging
Qingjie Meng, Matthew Sinclair, Veronika Zimmer, Benjamin Hou, Martin Rajchl, Nicolas Toussaint, Ozan Oktay,Jo Schlemper, Alberto Gomez, James Housden, Jacqueline Matthew, Daniel Rueckert,
Fellow, IEEE ,Julia A. Schnabel,
Senior member, IEEE , and Bernhard Kainz,
Senior member, IEEE
Abstract —Detecting acoustic shadows in ultrasound images isimportant in many clinical and engineering applications. Real-time feedback of acoustic shadows can guide sonographers toa standardized diagnostic viewing plane with minimal artifactsand can provide additional information for other automaticimage analysis algorithms. However, automatically detectingshadow regions using learning-based algorithms is challengingbecause pixel-wise ground truth annotation of acoustic shadowsis subjective and time consuming. In this paper we propose aweakly supervised method for automatic confidence estimationof acoustic shadow regions. Our method is able to generate adense shadow-focused confidence map. In our method, a shadow-seg module is built to learn general shadow features for shadowsegmentation, based on global image-level annotations as wellas a small number of coarse pixel-wise shadow annotations. Atransfer function is introduced to extend the obtained binaryshadow segmentation to a reference confidence map. Additionally,a confidence estimation network is proposed to learn the mappingbetween input images and the reference confidence maps. Thisnetwork is able to predict shadow confidence maps directly frominput images during inference. We use evaluation metrics such asDICE, inter-class correlation and etc. to verify the effectivenessof our method. Our method is more consistent than humanannotation, and outperforms the state-of-the-art quantitatively inshadow segmentation and qualitatively in confidence estimationof shadow regions. We further demonstrate the applicability ofour method by integrating shadow confidence maps into taskssuch as ultrasound image classification, multi-view image fusionand automated biometric measurements.
Index Terms —Ultrasound imaging, deep learning, weakly su-pervised, shadow detection, confidence estimation.
I. I
NTRODUCTION U LTRASOUND (US) imaging is a medical imagingtechnique based on reflection and scattering of high-frequency sound in tissues. Compared with other imagingtechniques (e.g. Magnetic Resonance Imaging (MRI) andComputed Tomography (CT)), US imaging has various advan-tages including portability, low cost, high temporal resolution
Q. Meng, M. Sinclair, B. Hou, M. Rajchl, O. Otkay, J. Schlemper, D.Rueckert and B. Kainz are with the Biomedical Image Analysis Group,Department of Computing, Imperial College London, London SW7 2AZ, UK,(e-mail: [email protected]).V. Zimmer, N. Toussaint, A. Gomez, J. Housden, J. Matthew and J. A.Schnabel are with School of Biomedical Engineering and Imaging Sciences,King’s College London, London WC2R 2LS, UK.To appear in IEEE TRANSACTIONS ON MEDICAL IMAGING https://ieeexplore.ieee.org/document/8698843 DOI: 10.1109/TMI.2019.2913311.c (cid:13) c (cid:13) and real-time imaging capability. With these advantages, USis an important medical imaging modality that is utilized toexamine a range of anatomical structures in both adults andfetuses. In most countries, US imaging is an essential part ofclinical routine for pregnancy health screening between 11 and22 weeks of gestation [29].Although US imaging is capable of providing real-timeimages of anatomy, diagnostic accuracy is limited by therelatively low image quality. Artifacts such as noise [1],distortions [34] and acoustic shadows [12] make interpretationchallenging and highly dependent on experienced operators.These artifacts are unavoidable in clinical practice due to thelow energies used and the physical nature of sound wavepropagation in human tissues. Better hardware and advancedimage reconstruction algorithms have been developed to re-duce speckle noise [9, 10]. Prior anatomical expertise [21]and extensive sonographer training are the only way to handledistortions and shadows to date.Sound-opaque occluders, including bones and calcified tis-sues, block the propagation of sound waves by stronglyabsorbing or reflecting sound waves during scanning. Theregions behind these sound-opaque occluders return little tono reflections to the US transducer. Thus these areas havelow intensity but very high acoustic impedance gradients attheir boundaries (e.g. Fig. 1(a) left column). Reducing acous-tic shadows and correct interpretation of images containingshadows rely heavily on sonographer experience. Experiencedsonographers avoid shadows by moving the probe to a morepreferable viewing direction during scanning or, if no shadow-free viewing direction can be found, a mental map is com-pounded with iterative acquisitions from different orientations.With less anatomical information in shadow regions, es-pecially when shadows cut through the anatomy of interest,images containing strong shadows can be problematic forautomatic real-time image analysis methods such as biometricmeasurements [32], anatomy segmentation [5] and US imageclassification [3]. Moreover, the shortage of experienced sono-graphers [8] exacerbates the challenges of accurate US image-based screening and diagnostics. Therefore, shadow-aware USimage analysis is greatly needed and would be beneficial, bothfor engineers who work on medical image analysis, as well asfor sonographers in clinical practice. Contribution : We propose a novel method based on con-volutional neural networks (CNNs) to automatically estimatepixel-wise confidence maps of acoustic shadows in 2D USimages. Our method learns an initial latent space of shadow
Copyright (c) 2019 IEEE. Personal use of this material is permitted. However, permission to use this material for any other purposes mustbe obtained from the IEEE by sending a request to [email protected]. a r X i v : . [ c s . C V ] M a y O APPEAR IN IEEE TRANSACTIONS ON MEDICAL IMAGING DOI: 10.1109/TMI.2019.2913311 2 (a) Images (image-level labels) (b) Images with pixel-wise annotations
Fig. 1: Examples of data sets. (a) Images with global image-level labels (“has shadow” and “shadow-free”), and (b) Imageswith coarse pixel-wise annotations from two annotators.regions from images consisting of multiple anatomies and withglobal image-level labels (“has shadow” and “shadow-free”),e.g. Fig. 1(a). The basic latent space is then estimated by learn-ing from fewer images of a single anatomy (fetal brain) withcoarse pixel-wise shadow annotations (approximately ofthe images with global image-level labels), e.g. Fig. 1(b).The resulting latent space is then refined by learning shadowintensity distributions using fetal brain images so that the latentspace is suitable for confidence estimation of shadow regions.By using shadow intensity information, our method can detectmore shadow regions than the coarse manual segmentation,especially relatively weak shadow regions.The proposed training process is able to build a directmapping between input images and the corresponding shadowconfidence maps in any given anatomy, which allows real-timeapplication through direct inference.In contrast to our preliminary work [22], which uses sep-arate, heuristically linked components, here we establish apipeline to make full use of existing data sets and annotations.During inference our method can predict both a binary shadowsegmentation and a dense shadow-focused confidence map.The shadow segmentation is not limited by hyperparameterssuch as thresholds in [22], and the segmentation accuracyas well as shadow confidence maps are greatly improvedcompared to the state-of-the-art.We have demonstrated in [22] that shadow confidence mapscan improve the performance of an automatic biometric mea-surement task. In this study, we further evaluate the usefulnessof the shadow confidence estimation for other automatic imageanalysis algorithms such as an US image classification task anda multi-view image fusion task.
Related workAutomatic US shadow detection:
Acoustic shadows have asignificant impact on US image quality, and thus a serious ef-fect on robustness and accuracy of image processing methods.In clinical literature, US artifacts including shadows have beenwell studied and reviewed [6, 19, 24]. However, the shadowproblem is not well covered in automated US image analysisliterature. Automatic estimation of acoustic shadows has rarelybeen the focus within the medical image analysis community.Identifying shadow regions in US images has been utilizedas a preprocessing step for extracting relevant image contentand improving image analysis accuracy in some applications. Penney et al. [27] have identified shadow regions by thresh-olding the accumulated intensity along each scanning beamline. Afterwards, these shadow regions have been masked outfrom US images for US to MRI hepatic image registration.Instead of excluding shadow regions, Kim et al. [17] focusedon accurate attenuation estimation, and aimed to use attenua-tion properties for determination of the anatomical propertieswhich can help diagnose diseases. They proposed a hybrid at-tenuation estimation method that combines spectral differenceand spectral shift methods to reduce the influence of localspectral noise and backscatter variations in Radio Frequency(RF) US data. To detect shadow regions in B-Mode scansdirectly and automatically, Hellier et al. [15] used the probe’sgeometric properties and statistically modelled the US B-Modecone. Compared with previous statistical shadow detectionmethods such as [27], their method can automatically estimatethe probe’s geometry as well as other hyperparameters, andhas shown improvements in 3D reconstruction, registrationand tracking. However, the method can only detect a subsetof ‘deep’ acoustic shadows because of the probe geometry-dependent sampling strategy.To improve the accuracy of US attenuation estimation andshadow detection, Karamalis et al. [16] proposed a moregeneral solution using the Random Walks (RW) algorithm topredict a per-pixel confidence of US images. In [16], confi-dence maps represent the uncertainty of US images resultingfrom shadows, and thus, show the acoustic shadow regions.The confidence maps obtained by this work can improve theaccuracy of US image processing tasks, such as intensity-based US image reconstruction and multi-modal registration.However, such confidence maps are sensitive to US transducersettings and limited by the US formation process. Klein etal. [18] have further extended the RW method to generatedistribution-based confidence maps and applied it to RF USdata. This method is more robust since the confidence predic-tion is no longer intensity-based.Some studies have utilized acoustic shadow detection asadditional information in their pipeline for other US imageprocessing tasks. Broersen et al. [7] combined acoustic shadowdetection for the characterization of dense calcium tissuein intravascular US virtual histology, and Berton et al. [5]automatically and simultaneously segment vertebrae, spinousprocess and acoustic shadow in US images for a better assess-ment of scoliosis progression. In these applications, acousticshadow detection is task-specific, and is mainly based onheuristic image intensity features as well as special anatomicalconstraints.The aforementioned literature relies heavily on manuallyselected relevant features, intensity information or a probe-specific US formation process. With the advances in deeplearning, US image analysis algorithms have gained bettersemantic image interpretation abilities. However, current deeplearning segmentation methods require a large amount ofpixel-wise, manually labelled ground truth images. This ischallenging in the US imaging domain because of (a) a lackof experienced annotators and (b) weakly defined structuralfeatures that cause a high inter-observer variability.
O APPEAR IN IEEE TRANSACTIONS ON MEDICAL IMAGING DOI: 10.1109/TMI.2019.2913311 3
Weakly supervised image segmentation:
Weakly supervisedautomatic detection of class differences has been explored inother imaging domains (e.g. MRI). For example, Baumgartneret al. [4] proposed to use a generative adversarial network(GAN) to highlight class differences only from global image-level labels (Alzheimer’s disease or healthy). We used a similaridea in [22] and initialized potential shadow areas basedon saliency maps [33] from a classification task betweenimages containing shadows and those without. Inspired byrecent weakly supervised deep learning methods that havedrastically improved semantic image analysis [20, 28, 37] andto overcome the limitations of [22], we develop a confidenceestimation algorithm that takes advantages of both types ofweak labels, including global image-level labels and a sparseset of coarse pixel-wise labels. Our method is able to predictdense, shadow-focused confidence maps directly from inputUS images in effectively real-time.II. M
ETHOD
In our proposed method, a shadow-seg module is firsttrained to produce a semantic segmentation of shadow regions.In this module, shadow features are initialized by training ashadow/shadow-free classification network and generalized bytraining a shadow segmentation network. After obtaining theshadow segmentation, a transfer function is used to extend thepredicted binary shadow segmentation to a confidence mapbased on the intensity distribution within suspected shadowregions. This confidence map is regarded as a reference confi-dence map for the next confidence estimation network. Lastly,a confidence estimation network is trained to learn the map-ping between the input shadow-containing US images and thecorresponding reference confidence maps. The outline for thetraining process is shown in Fig. 2. During inference, we usethe confidence estimation network to predict a dense, shadowconfidence map directly from the input image. Additionally,we integrate attention mechanisms [30] into our method toenhance the shadow features extracted by the networks.
Shadow-seg Module : We propose a shadow-seg moduleto extract generalized shadow features for a large range ofshadow types in fetal US images under limited weak manualannotations. Since shadow regions have different shapes, vari-ous intensity distributions and uncertain edges, the pixel-wiseannotation of shadow regions is time consuming and reliesheavily on annotator’s experience (e.g. various annotationsin Fig. 1(b)). This generally results in manual annotationsof limited quantity and quality. Compared with pixel-wiseshadow annotations, global image-level labels (“has shadow”and “shadow-free” in our case) are easier to obtain, andshadow images with global image-level labels can contain alarger variety of shadow types. Therefore, we use a shadow-segmodule that combines unreliable pixel-wise annotations andglobal image-level labels as weak annotations.The proposedshadow-seg module contains two tasks, (1) shadow/shadow-free classification using image-level labels, and (2) shadowsegmentation that uses few coarse pixel-wise manual annota-tions ( of the global image-level labels). Shadow featurescan be extracted during simple shadow/shadow-free classifi-cation and subsequently optimized for the more challenging shadow segmentation task. In our case, shadow features ex-tracted by the classification network cover various shadowtypes in a range of anatomical structures. These shadowfeatures become suitable for the shadow segmentation afterbeing optimized by a shadow segmentation network.
Network Architecture : We build two sub-networks fromresidual-blocks [14] as shown in Fig. 3. Residual-blocks canreduce the training error when using deeper networks and sup-port better network optimization [14]. They have been widelyused for various image processing algorithms [13, 36, 38].The first and initially trained network is a shadow/shadow-free classification network that learns to distinguish imagescontaining shadows from shadow-free images, and thus learnsthe defining features of acoustic shadow. This classificationnetwork consists of a feature encoder followed by a globalaverage pooling layer. The feature encoder uses six residual-blocks (Fig. 3) to extract shadow features that define shadow-containing images in the classifier. We refer to l = 1 as thelabel of the shadow-containing class and l = 0 as the label ofthe shadow-free class. Image set X C = { x C , x C , ..., x CK } andtheir corresponding labels L = { l , l , ..., l K } s.t. l i ∈ { , } are used to train the feature encoder as well as the globalaverage pooling layer. We use softmax cross-entropy loss asthe cost function L C between the predicted labels and the truelabels.Representative shadow features extracted by the featureencoder of the shadow/shadow-free classification network arethen optimized by the shadow segmentation network with alimited number of densely segmented US images. The featureencoder of the segmentation network has the same architectureas the classification network. The weights of the featureencoder in the segmentation network are initialized by thatof the classification network and are further fine-tuned for thesegmentation task. Therefore, the extracted shadow featuresare suitable for the segmentation in addition to classification.The decoder of the segmentation network is symmetrical tothe feature encoder. Feature layers from the feature encoderare concatenated to the corresponding layers in the decoderby skip connections. Here, we denote the image set used totrain the shadow segmentation with X S = { x S , x S , ..., x SM } and the corresponding pixel-wise manual segmentation with Y S = { y S , y S , ..., y SM } . The shadow segmentation providesa pixel-wise binary prediction ˆ Y S = { ˆ y S , ˆ y S , ..., ˆ y SM } forshadow regions and the cost function L seg is the softmaxcross-entropy between ˆ Y S and Y S . Transfer Function : Binary masks lack information aboutinherent uncertainties at the boundaries of shadow regions.Therefore, we use a transfer function to extend the binarysegmentation prediction to a confidence map, which is moreappropriate to describe shadow regions. The main task ofthe transfer function is to learn the intensity distribution ofshadow regions so as to estimate confidence of pixels infalse positive (FP) regions of the predicted binary shadowsegmentation. This transfer function is built and only usedduring training to provide reference confidence maps for theconfidence estimation network.When comparing the manual segmentation y S and thepredicted segmentation ˆ y S of shadow regions in image x , we O APPEAR IN IEEE TRANSACTIONS ON MEDICAL IMAGING DOI: 10.1109/TMI.2019.2913311 4 T ( x , y s , ^ y s ) (b) Transfer Function L conf L seg p ( x i ∣ l i = ) p ( x i ∣ l i = ) True Label A v g . P oo li n g S o f t m a x X ∣ L = ^ p ( x i ∣ l i = )^ p ( x i ∣ l i = ) Weights U p s a m p l e R e s - B l o c k R e s - B l o c k U p s a m p l e R e s - B l o c k C o n v C o n v R e s - B l o c k R e s - B l o c k R e s - B l o c k Shared Data X ∣ L = X ∣ L = X ∣ L = Shadow/shadow-free classification networkShadow Segmentation Network(c) Confidence Estimation Network(a) Shadow-seg Module Prediction ... C o n v R e s - B l o c k R e s - B l o c k R e s - B l o c k ... R e s - B l o c k ... U p s a m p l e R e s - B l o c k R e s - B l o c k U p s a m p l e R e s - B l o c k C o n v C o n v R e s - B l o c k R e s - B l o c k R e s - B l o c k ... R e s - B l o c k ... L C ^ Y S Y S ^ Y C Y C Fig. 2: Training framework of the proposed method. (a) The shadow-seg module containing a shadow/shadow-free classificationnetwork and a shadow segmentation network. (b) The transfer function that expands a binary mask to a reference confidencemap. (c) The confidence estimation network which establishes direct mapping between input images and confidence maps.
Res-Block BN ( x ) f ( x ) x f : ∑ ω x + b BN ( x ) f ( x ) f : ∑ ω x + b MaxPool ^ x ReLUReLU
Convolution layer (stride=2) Convolution layer (stride=1)
Fig. 3: The architecture of the residual-block. BN ( x ) refers toa batch normalization layer and f ( x ) is a convolutional layer.define the true positive (TP) regions x T P as shadow regionswith the full confidence, C x ij = 1 , x ij ∈ x T P . Here, C x ij refers to the confidence of pixel x ij being shadow.For each pixel x ij in the FP regions ( x F P ), the confidenceof belonging to a shadow region is computed by a transferfunction T ( x ij | x ij ∈ x F P ) based on the intensity of thepixel ( I x ij ) and the mean intensity of x T P ( I mean ). I mean isdefined in Eq. 1. With weak signals in the shadow regions, theaverage intensity of shadow pixels is lower than the maximumintensity ( I max = max ( x ) ) but not lower than the minimumintensity ( I min = min ( x ) ), that is I mean ∈ [ I min , I max ) . I mean = (cid:40) mean ( y S ∩ ˆ y S ) y S ∩ ˆ y S (cid:54) = ∅ ,mean ( y S ) y S ∩ ˆ y S = ∅ , (1)The transfer function T ( · ) computing C x ij for pixels in x F P is defined according to the range of I mean . For I mean ∈ ( I min , I max ) , T ( · ) is shown in Eq. 2. For I mean = I min , T ( · ) is shown in Eq. 3. T ( x ij | x ij ∈ x F P ) = I xij − I min I mean − I min , I min ≤ I x ij < I mean , I max − I xij I max − I mean , I mean < I x ij ≤ I max , , I x ij = I mean , (2) T ( x ij | x ij ∈ x F P ) = (cid:40) I xij − I mean I max − I mean , I mean < I x ij , , I x ij = I mean , (3)After using the transfer function, the binary map of thepredicted segmentation y S is extended to a confidence map y C . y C acts as a reference (”ground truth”) for the training ofthe next confidence estimation network. Confidence Estimation Network : After obtaining referenceconfidence maps from the predicted binary segmentation, aconfidence estimation network is trained to map an imagewith shadows ( x ) to the corresponding reference confidencemap ( y C ). This confidence estimation network can be inde-pendently used to directly predict a dense shadow confidencemap for an input image during inference.The confidence estimation network consists of a down-sampling encoder, a symmetric up-sampling decoder, and skipconnections between feature layers from the encoder and thedecoder at different resolution levels. Both the encoder and thedecoder are composed of six residual-blocks. The cost functionof the confidence estimation network is defined as the mean O APPEAR IN IEEE TRANSACTIONS ON MEDICAL IMAGING DOI: 10.1109/TMI.2019.2913311 5
A A f R (⋅) f R (⋅) A g (⋅) g (⋅) g (⋅) ... ... ^ p ( x i ∣ l i = )^ p ( x i ∣ l i = ) A Attention layerFeature maps
Fig. 4: The architecture of the shadow/shadow-free classi-fication network with attention mechanism. f R ( · ) refers toresidual-blocks. g ( · ) refers to a global average pooling layer.squared error between the predicted confidence map ˆ Y C andthe reference confidence map Y C ( L conf = (cid:107) ˆ Y C − Y C (cid:107) ). Attention Gates : Attention gates are believed to generallyhighlight relevant features according to image context and thusimprove network performance for medical image analysis [25].We integrate attention gates [30] into our approach to exploreif attention mechanisms can further improve the confidenceestimation of shadow regions in 2D ultrasound. In our case, weconnect the self-attention gating modules proposed in [25] tothe feature maps before the last two down-sampling operationsin the encoders of all three networks. For the shadow/shadow-free classification network, the global average pooling layeris modified when adding this self-attention gating module. Indetail, as shown in Fig. 4, the global average pooling layersare operated separately on the two attention-gated featuremaps as well as the original last feature map to obtain threeaverage feature maps. These three average feature maps arethen concatenated, followed by a fully connected layer tocompute the final classification prediction.III. I
MPLEMENTATION
All the residual-blocks used in the proposed method areimplemented as proposed in [26], which provides a convenientinterface to realize residual-blocks.We optimize the different modules separately and consecu-tively in three steps. First we train ∼ epochs for the param-eters of the shadow/shadow-free classification network, andthen ∼ epochs for the pixel-wise shadow segmentationnetwork. After obtaining a well-trained shadow segmentationnetwork, we train the confidence estimation network for an-other 700 epochs.For all networks, we use Stochastic Gradient Descent(SGD) with momentum optimizer to update the parameterssince SGD has better generalization capability than adap-tive optimizer [35]. The parameters of the optimizer are momentum = 0 . , with a learning rate of − . We applyL2 regularization to all weights during training to help preventnetwork over-fitting. The scale of the regularizer is set as − .The training batch size is 25 and our networks are trained ona Nvidia Titan X GPU with 12 GB of memory.IV. E VALUATION
The proposed method is trained and evaluated using twodata sets, (1) a multi-class data set consisting of 13 classesof 2D US fetal anatomy with global image-level label (“hasshadow” or “shadow-free”) and including 48 non-brain images with manual shadow segmentations, and (2) a single-class dataset containing 2D US fetal brain with coarse pixel-wise manualshadow segmentations. To reduce the variance in parameterestimation during training, we split relatively bigger trainingdata sets. In the multi-class data set, we use of the datafor training, for validation and the 48 non-brain imagesfor testing, while in the single-class data set we use ofthe data for training, for validation and for testing.To verify the effectiveness of the proposed method and theimportance of the shadow/shadow-free classification networkin the shadow-seg module, we compare the variants of ourmethod to a baseline which only contains a shadow segmen-tation network and a confidence estimation network.We use standard measurements such as Dice coefficient(DICE) [11], recall, precision and Mean Squared Error (MSE)for shadow segmentation evaluation, and use the InterclassCorrelation (ICC) [31] as well as soft DICE [2] for confidenceestimation evaluation. In order to verify the performance of ourmethod, we also compute quantitative measurements betweenthe chosen manual annotation (weak ground truth) and anothermanual annotation from a different annotator to show thehuman performance for the shadow detection task. Lastly, weshow the practical benefits of shadow confidence maps for dif-ferent applications such as a standard plane classification task,an image fusion task from multiple views and a segmentationtask for automatic biometric measurements. Multi-class Data Set : This data set consists of ∼ . k − weeks (iFIND Project ). Eight different ultrasoundsystems of identical make and model (GE Voluson E8) wereused for the acquisitions. Various image settings based ondifferent sonographers’ personal preference for scanning areincluded in this data set. The images have been classified byexpert observers as containing strong shadow, being shadow-free, or being corrupted, e.g. poor tissue contact caused bylacking acoustic impedance gel. Corrupted images ( < )have been excluded as discussed in Section VI with Fig. 10. Single-class Data Set : This data set comprises 643 fetalbrain images and has no overlap with the multi-class dataset. Shadow regions in this data set have been coarselysegmented by two bio-engineering students using trapezoid-shaped segmentation masks for individual shadow regions.
Training Data : Validation and Test Data : The remaining 491 shadowimages and 502 clear images in the multi-class data setare used for testing and validation. Here, a subset ( M test )comprising 48 randomly selected images from the 491 shadow O APPEAR IN IEEE TRANSACTIONS ON MEDICAL IMAGING DOI: 10.1109/TMI.2019.2913311 6 images is used for testing. These 48 images contain variousfetal anatomies (except fetal brain), such as abdominal, kidney,cardiac and etc. Shadow regions in these images have beenmanually segmented to provide ground truth. The remaining443 shadow images and 502 clear images are used for thevalidation of the shadow/shadow-free classification. Similarly,the remaining 143 fetal brain images of the single-classdata set are split into two subsets, where S val contains 50images for validation of the shadow segmentation, binary-to-confidence transformation and the confidence estimation,and S test with 93 images for testing. For all images fromthe single-class data set, we randomly choose one group ofannotations from two different existing groups of annotationsas ground truth for training, validation and testing. A. Baseline
The baseline method is used to demonstrate that the shadow-seg module is of importance for capturing generalized shadowfeatures and obtaining accurate confidence estimation ofshadow regions. It comprises a shadow segmentation networkand a confidence estimation network, which have the samearchitectures as shown in Fig. 2. We firstly train the shadowsegmentation network in the baseline method using the 500 fe-tal brain images from the single-class data set. After applyingthe transfer function on the binary segmentation prediction, wetrain the confidence estimation network for a direct mappingbetween shadow images and reference confidence maps.
B. Evaluation Metrics
In this section, we define the aforementioned statisticalmetrics and the computation of the inter-observer variabilitybetween two pixel-wise manual annotations of shadow regions.
DICE, Recall, Precision and MSE:
We refer to the binaryprediction of shadow segmentation as P and the binary manualsegmentation as G . DICE = 2 | P ∩ G | / ( | P | + | G | ) , Recall = | P ∩ G | / | G | , Precision = | P ∩ G | / | P | and MSE = | P − G | . ICC:
We use
ICC as proposed by [31] (Eq. 4) to mea-sure the agreement between two annotations. Each pixel inan image is regarded as a target. R MS , C MS and M MS are respectively mean squared value of rows, columns andinteraction. N is the number of targets. ICC = R MS − M MS R MS + M MS + 2 × ( C MS − M MS ) /N . (4) Soft DICE:
Soft DICE can be used to tackle probabilitymaps. We use real-value in the DICE definition to computesoft DICE between the predicted shadow confidence maps ˆ Y C and reference confidence maps Y C . Human Performance:
We consider another binary segmen-tation of shadow regions from a different annotator as Y Snew .The computed metrics between Y Snew and the chosen manualsegmentation Y S reflects the human inter-observer variability. C. Shadow Segmentation Analysis
We compare the segmentation performance of the state-of-the-art ([16] and [22]), the proposed methods and the TABLE I: Shadow segmentation performance ( µ ± σ ) of differ-ent methods on test data S test . RW and RW ∗ are Random Walkalgorithm [16] with different set of parameters. Pilot [22] is ourprevious work. Baseline, the proposed method (abbreviatedas “Proposed”) and the proposed method with attention gates(abbreviated as “Proposed + AG” in the rest of the paper)are our proposed methods. Anno ∗ refers to the human inter-observer variability, thus expected human performance on theshadow segmentation task. Best results are shown in bold. Methods DICE Recall Precision MSERW [16] µ σ ) (0.099) (0.2047) (0.0675) (7.6734)RW ∗ [16] µ σ ) (0.1123) (0.2196) (0.0771) (8.3484)Pilot [22] µ σ ) (0.1398) (0.201) (0.1352) (14.837)Baseline µ σ ) (0.212) (0.2255) (0.2326) (12.2885)Proposed µ ( σ ) (0.1988) (0.2131) (0.2255) (11.867)Proposed + AG µ σ ) (0.2014) (0.2169) (0.2247) (12.6317)Anno ∗ µ σ ) (0.2635) (0.3196) (0.3124) (23.0339) human performance. This comparison is used to examinethe importance of the shadow-seg module for the shadowsegmentation, and further, for the confidence estimation ofshadow regions.Table I shows DICE, recall, precision and MSE of differentmethods on S test . RW and RW ∗ are results of [16] withvarious parameters. For fair comparison, we run 24 testson both test sets using the RW algorithm with differentparameter combinations ( α ∈ { , , } ; β ∈ { , } ; γ ∈{ . , . , . , . } ). With a negative relationship between thelikelihood of shadows and the confidence in [16] and to consis-tently compare all methods, we use − S instead S to displaythe results of RW and RW ∗ in all comparison experiments.Here S is a confidence map obtained by [16]. To generateshadow segmentation, we threshold the obtained confidencemaps by T ∈ { . , . } so that pixels with confidencehigher than T are shadows. We chose the parameters andthe threshold which achieve the highest average DICE on allsamples in both test sets. The chosen RW parameters and thethreshold are α = 1 ; β = 90 ; γ = 0 . ; T = 0 . . We alsoapplied the parameters and the threshold in [16] ( α = 2 ; β = 90 ; γ = 0 . ; T = 0 . ) in our experiments, whichis denoted as RW ∗ . Note that we use the public Matlab code of [16] to test RW and RW ∗ .As shown in Table I, the baseline, the proposed methodand the proposed + AG greatly outperform the state-of-the-art.Among all methods, the proposed method achieves highestDICE. Recall and precision of the proposed method are respec-tively . and . higher than that of the baseline whileMSE of the proposed method is . lower than that of the http://campar.in.tum.de/Main/AthanasiosKaramalisCode O APPEAR IN IEEE TRANSACTIONS ON MEDICAL IMAGING DOI: 10.1109/TMI.2019.2913311 7
TABLE II: Comparison of shadow segmentation performance( µ ± σ ) of different methods on test data M test . Best resultsare shown in bold. Methods DICE Recall Precision MSERW [16] µ σ ) (0.0855) (0.1241) (0.0592) (7.866)RW ∗ [16] µ σ ) (0.0871) (0.1528) (0.0602) (7.5643)Pilot [22] µ σ ) (0.1079) (0.137) (0.1308) (17.0491)Baseline µ σ ) (0.1798) (0.2233) (0.1712) (18.3773)Proposed µ ( σ ) (0.155) (0.2335) (0.1357) (17.2147)Proposed + AG µ σ ) (0.1544) (0.2035) (0.1562) (17.6628)The symbols of the methods are the same to Table I. TABLE III: The p-value of the Proposed method vs. Pilot [22]and of the Proposed method vs. Baseline. Statistically signif-icant results ( p < . ) are shown in bold. S test DICE Recall Precision MSEPilot [22]
Baseline M test DICE Recall Precision MSEPilot [22] † Baseline † refers to the proposed method performs worse and oth-erwise the proposed method is better. baseline. After adding attention gates to the proposed method(the proposed + AG), the shadow segmentation performanceis nearly the same to the proposed method without attentiongates, but better than the baseline. Additionally, the relativelylow scores of Anno ∗ indicate high inter-observer variabilityand how ambiguous human annotation can be for this task.A mean DICE of . shows that the proposed methodperforms better and more consistently than human annotation.We further conduct the same experiments on another non-brain test data set M test to verify the feature generalizationability of the shadow-seg module. Results are shown in Ta-ble II. Similarly, the proposed weakly supervised methods andthe baseline outperform all state-of-the-art methods.To statistically evaluate the difference among various meth-ods, we use the paired sample t-test on two test data sets S test and M test . Here, we compare the evaluation metrics(Dice, Recall, Precision and MSE) of the proposed methodand the Pilot [22] because the Pilot [22] outperforms otherstate-of-the-art in Table I and Table II. We also compare theevaluation metrics of the proposed method and the baseline.The obtained corresponding p-values are shown in Table III,using . as the threshold for statistical significance, Table IIIshows that the proposed method greatly improves the shadowsegmentation performance compared with the Pilot [22] andthe baseline. Fig. 5: Results of shadow confidence estimation. (a) SoftDICE of the baseline, the proposed method and the proposedmethod with attention gates (proposed + AG) on S test and M test . (b) Interclass correlation (ICC) of the baseline, theproposed method and the proposed + AG on S test and M test .Additionally, ICC of the human performance is shown asAnno ∗ for S test . D. Shadow Confidence Estimation
In this part, we evaluate the performance of the confidenceestimation by comparing the shadow confidence maps ofdifferent methods.Fig. 5 (a) shows the soft DICE evaluation on S test and M test . The proposed method and the proposed + AG methodachieve higher soft DICE on both test sets than the baseline,and are more robust than the baseline on M test . The baselinefails in this experiment on M test because it is unable to obtainaccurate shadow segmentation in the previous step (shown inTable II). With less accurate shadow segmentation, the shadowconfidence estimation can hardly establish a valid mappingbetween input images and reference confidence maps. Thisdemonstrates that the shadow-seg module is beneficial forshadow segmentation and confidence estimation.We additionally evaluate the reliability of the shadowconfidence estimation by measuring the agreement betweenthe decision of each method and the manual segmentation.Regarding the baseline, the proposed and the proposed + AGas different judges and the manual segmentation of shadowregions as a contrasting judge, we use the ICC to measure theagreement between each different judge and the contrastingjudge. Fig. 5 (b) shows the ICC evaluation on two testdata sets, which indicate that the proposed method and theproposed + AG are more consistent on estimating shadow con-fidence maps compared with the baseline. When consideringanother manual segmentation of shadow regions as an extrajudge, we can evaluate the agreement of human annotations.Fig. 5 (b) shows that the ICC of two human annotations(shown as Anno) is normally . . The proposed method withan ICC of . is more consistent than annotations from twohuman annotators.Fig. 6 compares the shadow confidence maps of the state-of-the-art methods and the proposed methods. RW and RW ∗ have the same parameters as used for Table I. The shadowconfidence maps of the baseline, the proposed method andthe proposed + AG method are generated directly from inputshadow images by confidence estimation networks. Overall,the proposed method and the proposed + AG method achieve
O APPEAR IN IEEE TRANSACTIONS ON MEDICAL IMAGING DOI: 10.1109/TMI.2019.2913311 8 . . . . (a) Image (b) RW [16] (c) RW ∗ [16] (d) Pilot [22] (e) Baseline (f) Proposed (g) Proposed+AG (h) Weak GT Fig. 6: Confidence estimation of shadow regions using the state-of-the-art methods and our methods. Rows I-IV show fourexamples: Brain (top), Lip (second), Abdominal (third) and Cardiac (bottom). Column (a) is the original US image. Columns(b-d) are shadow confidence maps from the RW algorithm [16] and our previous work [22]. Columns (e-g) show the shadowconfidence maps of the baseline, the proposed method and the Proposed + AG method. Column (h) is the binary map of themanual shadow segmentation. The color bar on the top of this figure shows that the more yellow/brighter (closer to 1), thehigher the confidence of being shadow regions.more visually reasonable shadow confidence estimation thanthe baseline and the state-of-the-art on different anatomicalstructures shown in Fig. 6. The proposed method and theproposed + AG method are able to highlight multiple shadowregions while the RW algorithm shows limitations for mostcases, especially for disjoint shadow regions.Row I in Fig. 6 shows a fetal brain image from S test . Theconfidence estimation of shadow regions from the baseline,the proposed method and the proposed + AG method are sim-ilarly accurate since we use fetal brain images to train theconfidence estimation networks in these three methods. Theseoutperform [16] and [22]. Rows (II-IV) in Fig. 6 show shadowconfidence maps of non-brain anatomy from M test , includinglips, abdominal and cardiac. The baseline failed on unseendata during inference. However, the proposed methods areable to generate accurate shadow confidence maps becauseof the generalized shadow features obtained by the shadow-seg module. Furthermore, the “Lips” example shows thatour method is capable of detecting weaker shadow regionsthat have not been annotated in manual segmentation. Thisindicates that the confidence estimation network has learnedgeneral properties of shadow regions. E. Transfer Function Performance
We show two illustrative examples in Fig. 7 to demonstratethe performance of the transfer function. Fig. 7 (c) and(d) show that the transfer function computes the confidenceof each pixel in the false positive areas of the predicted segmentation, so that to extend a binary segmentation to areference confidence map. (a) Image (b) Weak GT (c) ˆ Y S (d) Y C Fig. 7: Two examples showing the performance of the trans-fer function. (a) is the input image and (b) is the binarymanual segmentation. (c) is the predicted segmentation beforeapplying the transfer function while (d) is the correspondingreference confidence map after the transfer function.
F. Runtime
The RW algorithm [16] is implemented in Matlab (CPUXeon E5-2643) while the previous work [22] and the proposedmethods use Tensorflow and run on a Nvidia Titan X GPU.For the RW algorithm [16] and the previous work [22], theinference time are . s and . s respectively. Since thebaseline, the proposed method and the proposed + AG method
O APPEAR IN IEEE TRANSACTIONS ON MEDICAL IMAGING DOI: 10.1109/TMI.2019.2913311 9 have the same confidence estimation networks, they have thesame inference time, which is . s . A system-independentevaluation can be performed by estimating the required Giga-floating point operations (GFlops, fused multiply-adds) duringinference. Our method requires ∼ . − GFlops (estimatedfrom conv-layers including ReLU activation, Appendix I,supplementary materials are available in the supplementaryfiles /multimedia tab.) while the RW algorithm [16] requires ∼ − . GFlops (according to the built-in Matlab profiler)and the previous work [22] requires ∼ GFlops (estimatedfrom conv-layers including ReLU activation, Appendix I).V. A
PPLICATIONS
To verify the practical benefits of our method, we integratethe shadow confidence maps into different applications such as2D US standard plane classification, multi-view image fusionand automated biometric measurements.
A. Ultrasound Standard Plane Classification
Classifying 2D fetal standard planes is of great im-portance for early detection of abnormalities during mid-pregnancy [29]. However, distinguishing different standardplanes is a challenging task and requires intense operatortraining and experience. Baumgartner et al. [3] have proposeda deep learning method for the detection of various fetal stan-dard planes. We extend [3] and utilize shadow confidence mapsto provide extra information for standard plane classification.The data is the same as used in [3], which is a set of
2D ultrasound examinations between 18-22 weeks ofgestation (iFIND Project 1). We select nine classes of standardplanes including Three Vessel View (3VV), Four ChamberView (4CH), Abdominal, Brain View at the level of thecerebellum (Brain (cb.)), Brain view at posterior horn of theventricle (Brain (tv.)), Femur, Lips, Left Ventricular OutflowTract (LVOT) and Right Ventricular Outflow Tract (RVOT).The data set is split into training ( ), validation ( ),and testing ( ) images, similar to [3] (see appendix Efor individual class split numbers). We use image whitening(subtracting the mean intensity and divide by the variance) oneach image to preprocess the whole data set.Four networks based on SonoNet-32 [3] are trained andtested in order to verify the utility of shadow confidencemaps. The first network is trained with the standard planeimages from the training data. The next three networks areseparately trained with standard plane images and their corre-sponding shadow confidence maps obtained by the baseline,the proposed method and the proposed + AG method. Thus, thetraining data in the first network has one channel while theremaining networks have two input channels. We train thesenetworks for epochs with a learning rate of . .Table IV shows the standard plane classification perfor-mance of the four networks. Networks with shadow con-fidence maps achieve higher classification accuracy on al-most all classes (except Abdominal, LVOT and RVOT), aswell as on average classification accuracy. CM P AG achieveshighest classification accuracies for five classes (3VV, 4CH,Brain(Cb.), Brain(Tv.) and Femur). Of particular note, the TABLE IV: Classification accuracy ( % ) with vs. withoutshadow confidence maps. w/o CM is the network withoutshadow confidence maps while CM B , CM P , CM P AG arenetworks with shadow confidence maps from the baseline, theproposed method and the proposed+AG method. Best resultsare shown in bold.
Class w/o
CM CM B CM P CM PAG
Abdominal
Brain(Tv.) 99.11 99.78 99.78
Femur 99.04 99.81 99.81
Lips 98.29 99.81
Avg. accuracies of the 3VV and 4CH classes increase over thebaseline by . and . respectively. Five other classes(Abdominal, Brain(Cb.), Brain(Tv.), Femur and Lips) achievenear accuracy in both the baseline and CM P AG , whileLVOT and RVOT classes see modest decreases in CM P AG compared with the baseline, . and . respectively.Therefore, when compared CM P AG with the baseline, theincrease in average classification accuracy across all classes( . to . ) is primarily driven by the large improve-ments in 3VV and 4CH. These results indicate that shadowconfidence maps are able to provide extra information andimprove the performance of another automatic medical imageanalysis algorithm.We additionally explore the importance of estimating con-fidence maps over binary segmentation of shadow regions.We compare the classification accuracy between using shadowconfidence maps and directly using binary shadow segmen-tations generated from different methods. Fig.8 shows thatfor classes with high classification accuracy such as Ab-dominal, Brain(Cb.), Brain(Tv.), Femur and Lips, integratingshadow confidence maps into the classification task yieldsminor improvement. However, for classes with relatively lowclassification accuracy such as 3VV and LVOT, classificationwith shadow confidence maps achieves higher accuracy thanclassification with only binary shadow segmentations. B. Multi-view Image Fusion
Routine US screening is usually performed using a single2D probe. However, the position of the probe and resultingtomographic view through the anatomy has great impact ondiagnosis. Zimmer et al. [39] proposed a multi-view imagereconstruction method, which compounds different images ofthe same anatomical structure acquired from different viewdirections. They use a Gaussian weighting strategy to blendintensity information from different views. Here, we combinepredicted shadow confidence maps from these multi-viewimages as additional image fusion weights to investigate ifthese confidence maps can further improve image quality.The proposed method generally outperforms the baselineand the proposed + AG method, thus we only integrate the
O APPEAR IN IEEE TRANSACTIONS ON MEDICAL IMAGING DOI: 10.1109/TMI.2019.2913311 10 C l a ss i f i c a t i on a cc u r a cy ( % ) Fig. 8: Comparison of classification accuracy between usingshadow confidence maps and using shadow segmentation. BI B , BI P and BI P AG are networks with binary shadowsegmentation from the baseline, the proposed method and theproposed + AG method. CM B , CM P and CM P AG are thesame networks as in Tabel IV.shadow confidence maps generated by the proposed method( CM P ) into the weighting strategy in [39]. In detail, theprobability value of each pixel in a shadow confidence mapis multiplied to the original weight of the same pixel com-puted in [39]. The generated new weights are normalized asdescribed in [39] and then are used for image fusion. The dataset in this experiment is same as used for [39].Fig. 9 qualitatively shows that shadow confidence mapsare able to improve the performance of US image fusionalgorithms with different weighting strategies. Fig. 9 alsoshows the difference between adding two different types ofconfidence maps. These two types of confidence maps aregenerated by the confidence estimation network which areseparately trained by either MSE or Sigmoid loss. Fig. 9 (a) to(d) illustrate image fusion results for the same case using dif-ferent combinations of weighting strategies and loss functions.The difference maps indicate that shadow confidence mapsare capable of improving image fusion performance. Fig. 9(e) to (h) show image fusion results on four different cases.We randomly select two positively affected cases (Fig. 9 (e)and (f)) to show visual improvement. We additionally showtwo randomly selected examples (Fig. 9 (g) and (h)) thatdon’t show perceptually significant improvements after addingshadow confidence maps. Quantitative evaluation for imagefusion is not possible because of lacking a ground truth forUS compounding tasks. C. Automated Biometric Measurements
We integrate our shadow confidence maps into an automaticbiometric measurement approach [32], and show the biometricmeasurement performance (measured by DICE) before andafter adding shadow confidence maps.Similar to the ultrasound standard plane classification,shadow confidence maps are integrated into a biometric esti-mation model described in [32] as an extral channel. Specifi-cally, we train and test four fully convolutional networks withthe same hyper-parameters as detailed in [32], and use thesame ellipse fitting algorithm described therein. The first net-work is trained only on the image data used in [32]. The otherthree networks are trained with an additional input channel for TABLE V: Biometric measurement performance (DICE) withvs. without shadow confidence maps. w/o
CM CM B CM P CM PAG shadow confidence maps that are separately generated by thebaseline, the proposed, and the proposed + AG method.We show three examples that are affected by shadows,and show their biometric measurement results in Table V.From this experiment, we find that biometric measurementperformance is boosted by up to for problematic failurecases after adding shadow confidence maps. The averageperformance on the entire test data set stays almost the samesince only a small proportion of the test images are affectedby strong shadows, mainly because of the image acquisitionby highly skilled sonographers.VI. D ISCUSSION
In this paper, we propose a weakly supervised method totackle the ill-defined problem of shadow detection in US. Ana¨ıve alternative to our method would be to train a fullysupervised shadow segmentation network using pixel-wiseannotation of shadow regions. However, pixel-wise annotationis infeasible because (a) accurately annotating a large numberof images requires a vast amount of labour and time andhas scanner dependencies (b) binary annotations of shadowregions would lead to high inter-observer variability as shadowfeatures are poorly defined, and (c) real-valued annotations ofshadow regions are affected by subjectivity of annotators.The performance of shadow region confidence estimationon different anatomical structures can be improved after inte-grating attention mechanisms. For example, the soft DICE isincreased on S test . This also results in improved ultrasoundclassification (Table IV). However, the quantitative resultsshow that attention mechanisms are not essential. Networkswith attention mechanisms are sometimes outperformed bynetworks without attention mechanisms. This may be causedby the way we integrate the attention mechanism. Since we addattention gates to encoders of all networks, the shadow featuresare emphasized for the shadow/shadow-free classification,which increases the difficulty of generalizing shadow featuresfrom classification to shadow segmentation.We use MSE as the loss function of the confidence es-timation network, but this loss can also be measured byother functions. Practically this choice has no effect on ourquantitative results. However, in the image fusion task, weobserve qualitative differences, which we show in Fig. 9 forSigmoid cross-entropy loss.In the standard plane classification task, we use only asubset of target standard planes compared to [3] because (1)we aim at verifying the usefulness of our method rather thanimproving performance of [3], (2) it is desirable to keep inter-class balance to avoid side-effects from under-represented O APPEAR IN IEEE TRANSACTIONS ON MEDICAL IMAGING DOI: 10.1109/TMI.2019.2913311 11 low difference high difference (a) Gaussian, MSE (b) Int. & Gaussian, MSE (c) Gaussian, Sigmoid (d) Int. & Gaussian, Sigmoid(e) Gaussian, MSE, + (f) Int. & Gaussian, Sigmoid, + (g) Gaussian, Sigmoid, − (h) Int. & Gaussian, Sigmoid, − Fig. 9: Results of image fusion based on different weighting strategies and loss functions (Gaussian weighting vs. Intensity-and-Gaussian weighting (Int. & Gaussian), MSE loss vs. Sigmoid loss). Note that the MSE loss and the Sigmoid loss are usedfor training of the confidence estimation network, which generates the shadow confidence maps. (a-d) are the image fusionresults of the same case. (a,c) are the image fusion of Gaussian weighting with MSE loss and Sigmoid loss respectively and(b,d) are the results of Intensity-and-Gaussian weighting with MSE loss and Sigmoid loss respectively. (e-h) show the imagefusion results on four different cases. (e,f) are examples for visually improved cases ( + ) showing notable positive differencesof image fusion before and after adding CM P confirmed by our sonographers while (g,h) are cases with less change ( − ). Foreach sub-figure (e.g. (a)), in the first column, the top row is the result without integrating a shadow confidence map CM P and the bottom row is the result with integrated CM P . The second column shows the corresponding enlarged framed areasof the images. The third column is the difference map of corresponding framed areas. The color bar on the top shows that themore yellow/brighter, the higher the difference between the two framed areas.classes, and (3) we chose standard planes for which [3] didnot show optimal classification performance. T ( · ) , as defined in Eq. 2 or Eq. 3 is one example howprior knowledge can be integrated into the training process.If T ( · ) is chosen to be a continuous non-trainable function,e.g. quadratic or Gaussian, further weight relaxation canbe introduced for joint refinement of both, the shadow-segmodule in Fig. 2a and the confidence estimation in Fig. 2c.However, since probabilistic ground truth does not exist forour applications, evaluation would become purely subjective,thus we decide to use direct but discontinuous integration ofshadow-intensity assumptions for T ( · ) .Task-specific deep networks, e.g. for classification, mayinadvertently learn to ignore weak shadows in some cases,but the learning capacity of shadow properties is unknown.By estimating confidence of shadow regions independently,our method guarantees that shadow property information isseparately extracted and can be seamlessly integrated intoother image analysis algorithms. With additional shadow prop-erty information, our method can improve steerability andinterpretability for deep neural networks, and also enablesextensions for non-deep learning algorithms. As shown in theexperiments, prior knowledge provided by shadow confidencemaps can improve the performance of various applications.Binary shadow segmentation generated by the shadow-segmodule (Fig.1a) may provide shadow information to some ex- tent. The easiest way to utilize shadow information is integrat-ing this binary shadow segmentation into other applications.However, a binary segmentation of shadow regions is improperto describe inherent ambiguity of acoustic shadows caused byvarious attenuation of sound waves. Compared with binaryshadow segmentation, a real-valued shadow confidence mapis more reasonable to represent shadows, especially uncertainboundaries. With this more accurate representation, shadowconfidence maps are able to improve the performance of otherapplications compared to using simple binary segmentation.Corrupted images such as images with shadows causedby insufficient acoustic impedance gel are excluded in thetraining. This type of shadows can be regarded as backgroundsince signals can hardly reach the tissues, and corruptedimages with these shadows contain incomplete anatomicalinformation. Additionally, during scanning, regions of missingsignals caused by insufficient gel can be discovered andavoided in contrast to shadows generated by the interactionbetween signals and tissues. Therefore, our work excluded thecorrupt images and focus on shadows within valid anatomy.Nevertheless, Fig. 10 further shows that our proposed methodis capable of indicating regions suffering from signal decay,especially on the boundaries.We use the coarse pixel-wise binary manual segmentationas ground truth for the shadow segmentation network and thetransfer function since accurate manual annotation for shadow O APPEAR IN IEEE TRANSACTIONS ON MEDICAL IMAGING DOI: 10.1109/TMI.2019.2913311 12 (a) Image (b) Baseline (c) Proposed (d) Proposed + AG Fig. 10: Qualitative performance of our methods for detectingsignal lacking regions caused by insufficient gel.regions is unavailable as we discussed before. However, theinaccuracy of the coarse ground truth can hardly affect thequantitative assessments and the generation of reference con-fidence maps, because (1) DICE, recall, precision and MSEare still positively related to the effectiveness of the methods,(2) soft DICE and ICC are not related to the coarse groundtruth, and (3) reference confidence maps are generated basedon I mean (Eq. 1), which smooth the influence of coarse groundtruth by using mean intensity of TP regions. Additionally, weuse human inter-observer variability which is computed by twocoarse binary manual annotations to further fairly assess theeffectiveness of the methods.Acoustic shadows are caused by absorption, refraction orreflection of sound waves, which each leads to a differentdegree of signal attenuation. Our method is predominatelytrained on fetal US images containing shadow regions with anelongated shape and a relatively strong drop of intensity. Theseare the shadow features that we have observed in a majorityof images in our data sets. However, our method might belimited to perform effectively for shadows caused by differentacquisition-related causes which are less well represented inour current training data.VII. C ONCLUSION
We propose a CNN-based, weakly supervised method forautomatic confidence estimation of shadow regions in 2D USimages. By learning and transferring shadow features fromweakly-labelled images, our method can predict dense, con-tinuous shadow confidence maps directly from input images.We evaluate the performance of our method by compar-ing it to the state-of-the-art and human performance. Ourexperiments show that our method is quantitatively betterthan the state-of-the-art and human annotation for shadowsegmentation. For confidence estimation of shadow regions,our method is also qualitatively better than the state-of-the-artand is more consistent than human annotation. More impor-tantly, our method is capable of detecting disjoint multipleshadow regions without being limited by the correlation be-tween adjacent pixels as in [16], and the heuristically selectedhyperparameters in [22].We further demonstrate that our method improves the per-formance of other automatic image analysis algorithms when integrating the obtained shadow confidence maps into otherUS applications such as standard plane classification, imagefusion and automated biometric measurements.Our method has significantly short inference time, whichenables effective real-time feedback of local image properties.This feedback can guiding inexperienced sonographers to finddiagnostically valuable viewing directions and pave the wayfor standardized image acquisition training.A
CKNOWLEDGMENT
We thank the volunteers, sonographers and experts forproviding manually annotated datasets and NVIDIA for theirGPU donations. This work was supported by the WellcomeTrust IEH Award [102431], EPSRC grants (EP/L016796/1,EP/P001009/1), ERC 319456, and the Wellcome/EPSRC Cen-ter for Medical Engineering [WT 203148/Z/16/Z]. The re-search was funded/supported by the National Institute forHealth Research (NIHR) Biomedical Research Center based atGuy’s and St Thomas’ NHS Foundation Trust, King’s CollegeLondon and the NIHR Clinical Research Facility (CRF) atGuy’s and St Thomas’. Q. Meng is funded by the CSC-Imperial Scholarship. The views expressed are those of theauthor(s) and not necessarily those of the NHS, the NIHR orthe Department of Health.R
EFERENCES [1] J. Abbott and F. Thurstone. Acoustic speckle: Theory andexperimental analysis.
Ultrasonic Imaging , 1(4):303–324, 1979.[2] P. Anbeek, K. L. Vincken, G. S. van Bochove, M. J. vanOsch, and J. van der Grond. Probabilistic segmentationof brain tissue in mr imaging.
NeuroImage , 27(4):795 –804, 2005.[3] C. Baumgartner, K. Kamnitsas, J. Matthew, T. P. Fletcher,S. Smith, L. M. Koch, B. Kainz, and D. Rueckert.Sononet: Real-time detection and localisation of fetalstandard scan planes in freehand ultrasound.
IEEE Trans.Med. Imag. , 36:2204–2215, 2017.[4] C. Baumgartner, L. Koch, K. Tezcan, J. Ang, andE. Konukoglu. Visual feature attribution using wasser-stein gans.
CoRR , abs/1711.08998, 2017.[5] F. Berton, F. Cheriet, M. M., and C. Laporte. Segmen-tation of the spinous process and its acoustic shadow invertebral ultrasound images.
Computers in Biology andMedicine , 72:201–211, 2016.[6] B. Bouhemad, M. Zhang, Q. Lu, and J. Rouby. Clinicalreview: bedside lung ultrasound in critical care practice.
Critical Care , 11(1):205, 2007.[7] A. Broersen, M. Graaf, J. Eggermont, R. Wolterbeek,P. Kitslaar, J. Dijkstra, J. Bax, J. Reiber, and A. Scholte.Enhanced characterization of calcified areas in intravas-cular ultrasound virtual histology images by quantifica-tion of the acoustic shadow: validation against computedtomography coronary angiography.
Int J CardiovascImaging , 32:543–552, 2015.[8] Centre for Workforce Intelligence. Securing the futureworkforce supply sonography workforce review. 2017.
O APPEAR IN IEEE TRANSACTIONS ON MEDICAL IMAGING DOI: 10.1109/TMI.2019.2913311 13 [9] H. Choi, J. Lee, S. Kim, and S. Park. Speckle noisereduction in ultrasound images using a discrete wavelettransform-based image fusion technique.
Bio-MedicalMaterials and Engineering , 26(1):1587–1597, 2015.[10] P. Coup´e, P. Hellier, C. Kervrann, and C. Barillot. Nonlo-cal means-based speckle filtering for ultrasound images.
IEEE Trans. Image Process. , 18(10):2221–2229, 2009.[11] L. R. Dice. Measures of the amount of ecologic associ-ation between species.
Ecology , 26(3):297–302, 1945.[12] M. K. Feldman, S. Katyal, and M. S. Blackwood. Usartifacts.
Radio Graphics , 29:11791189, 2009.[13] K. He, X. Zhang, S. Ren, and J. Sun. Deep residuallearning for image recognition. In
CVPR’16 , 2016.[14] K. He, X. Zhang, S. Ren, and J. Sun. Identity mappingsin deep residual networks. In
ECCV , pages 630–645.Springer, 2016.[15] P. Hellier, P. Coup´e, X. Morandi, and D. Collins. Anautomatic geometrical and statistical method to detectacoustic shadows in intraoperative ultrasound brain im-ages.
Med Image Anal , 14(2):195–204, 2010.[16] A. Karamalis, W. Wein, T. Klein, and N. Navab. Ultra-sound confidence maps using random walks.
Med ImageAnal , 16(6):1101–1112, 2012.[17] H. Kim and T. Varghese. Hybrid spectral domain methodfor attenuation slope estimation.
Ultrasound Med Biol ,34:1808–1819, 2008.[18] T. Klein and W. Wells. Rf ultrasound distribution-based confidence maps. In
MICCAI’15 , pages 595–602.Springer, 2015.[19] F. W. Kremkau and K. Taylor. Artifacts in ultrasoundimaging.
J Ultrasound Med , 5(4):227–237, 1986.[20] A. Krizhevsky, I. Sutskever, and G. Hinton. Imagenetclassification with deep convolutional neural networks.In
NIPS’12 , pages 1097–1105, 2012.[21] T. Lange, N. Papenberg, S. Heldmann, J. Modersitzki,B. Fischer, H. Lamecker, and P. Schlag. 3D ultrasound-CT registration of the liver using combined landmark-intensity information.
Int J Comput Assist Radiol Surg ,4(1):79–88, 2009.[22] Q. Meng, C. Baumgartner, M. Sinclair, J. Housden,M. Rajchl, A. Gomez, B. Hou, N. Toussaint, V. Zimmer,J. Tan, et al. Automatic shadow detection in 2d ultra-sound images. In
MICCAI Workshop on PIPPI , 2018.[23] NHS.
Fetal anomaly screening programme: programmehandbook June 2015 . Public Health England, 2015.[24] J. A. Noble. Ultrasound image segmentation and tissuecharacterization.
Proc Inst Mech Eng H. , 224(2):307–316, 2010.[25] O. Oktay, J. Schlemper, L. L. Folgoc, M. Lee, M. P.Heinrich, K. Misawa, K. Mori, S. G. McDonagh, N. Y.Hammerla, B. Kainz, et al. Attention u-net: Learningwhere to look for the pancreas.
CoRR , abs/1804.03999,2018.[26] N. Pawlowski, S. I. Ktena, M. Lee, B. Kainz, D. Rueck- ert, B. Glocker, and M. Rajchl. Dltk: State of the artreference implementations for deep learning on medicalimages. arXiv preprint arXiv:1711.06853 , 2017.[27] G. P. Penney, J. M. Blackall, M. S. Hamady, T. Sab-harwal, A. Adam, and D. J. Hawkes. Registration offreehand 3d ultrasound and magnetic resonance liverimages.
Med Image Anal , 8:81–91, 2004.[28] M. Rajchl, M. Lee, O. Oktay, K. Kamnitsas, J. Passerat-Palmbach, W. Bai, M. Damodaram, M. Rutherford, J. Ha-jnal, B. Kainz, et al. Deepcut: Object segmentation frombounding box annotations using convolutional neuralnetworks.
IEEE Trans. Med. Imag. , 36(2):674–683, 2017.[29] L. J. Salomon, Z. Alfirevic, V. Berghella, C. Bilardo,E. Hernandez-Andrade, S. L. Johnsen, K. Kalache,K. Leung, G. Malinger, H. Munoz, et al. Practiceguidelines for performance of the routine midtrimesterfetal ultrasound scan.
Ultrasound Obst Gyn , 37:116–126,2011.[30] T. Shen, T. Zhou, G. Long, J. Jiang, S. Pan, and C. Zhang.Disan: Directional self-attention network for rnn/cnn-freelanguage understanding. In
AAAI , 2018.[31] P. E. Shrout and J. L. Fleiss. Intraclass correlations: Usesin assessing rater reliability.
Psychol Bull. , 86(2):420–428, 1979.[32] M. Sinclair, C. Baumgartner, J. Matthew, W. Bai, J. Cer-rolaza, Y. Li, S. Smith, C. Knight, B. Kainz, J. Hajnal,et al. Human-level performance on automatic headbiometrics in fetal ultrasound using fully convolutionalneural networks. In
EMBC’18 , 2018.[33] J. Springenberg, A. Dosovitskiy, T. Brox, and M. Ried-miller. Striving for simplicity: The all convolutional net.
CoRR , abs/1412.6806, 2014.[34] R. Steel, T. L. Poepping, R. S.Thompson, andC. Macaskill. Origins of the edge shadowing artefactin medical ultrasound imaging.
Ultrasound Med Biol ,39:1153–1162, 2005.[35] A. C. Wilson, R. Roelofs, M. Stern, N. Srebro, andB. Recht. The marginal value of adaptive gradientmethods in machine learning.
CoRR , abs/1705.08292,2018.[36] Y. Zhang, Y. Tian, Y. Kong, B. Zhong, and Y. Fu.Residual dense network for image super-resolution. In
CVPR’18 , 2018.[37] B. Zhou, A. Khosla, A. Lapedriza, A. Oliva, and A. Tor-ralba. Learning deep features for discriminative localiza-tion. In
CVPR’16 , pages 2921–2929. IEEE, 2016.[38] J. Zhu, T. Park, P. Isola, and A. A. Efros. Unpairedimage-to-image translation using cycle-consistent adver-sarial networks. In
ICCV’17 , 2017.[39] A. V. Zimmer, A. Gomez, Y. Noh, N. Toussaint,B. Khanal, R. Wright, L. Peralta, V. M. Poppel, E. Skel-ton, J. Matthew, et al. Multi-view image reconstruction:Application to fetal ultrasound compounding. In
MICCAIWorkshop on PIPPI , 2018.
O APPEAR IN IEEE TRANSACTIONS ON MEDICAL IMAGING DOI: 10.1109/TMI.2019.2913311 14 A PPENDIX
A. Shadow/Shadow-free Classification Network
In this section, we use Python-inspired pseudo codeto present the detailed network architecture of theshadow/shadow-free classification network (shown inFig. 11). The conv layer function performs astandard 2D convolution without activation layer andthe global average pool operates spatial averagingon the feature maps. The residual block is realized byDLTK [26].Fig. 11: Shadow/shadow-free classification network architec-ture.
B. Shadow Segmentation Network
Fig.12 shows the detailed architecture of the segmenta-tion network. The conv layer function performs a stan-dard 2D convolution without activation layer. The resid-ual block and the upsample concat (the upsamplingand concatenation layer) are realized by DLTK [26].Fig. 12: Shadow segmentation network architecture.
C. Confidence Estimation Network
Fig. 13 shows the detailed architecture of the shadowconfidence estimation network. Similarly, the conv layer function performs a standard 2D convolution without ac-tivation layer. The residual block and the upsam-ple concat (the upsampling and concatenation layer) arerealized by DLTK [26].Fig. 13: Shadow confidence estimation network architecture.
D. Alternative Examples of Shadow Confidence Estimation
We show an alternative group of examples for the confi-dence estimation of shadow regions (shown in Fig. 14). Theseexamples include fetal brain from M test , and cardiac, lips,kidney from S test . Similar to the Fig. 6 in the main paper,Fig. 14 shows that the baseline fails to handle unseen datawhile the proposed method and the proposed + AG methodare able to predict pixel-wise confidence of multiple shadowregions. These examples demonstrate that the shadow-segmodule is able to generalize the shadow representation andtransfer shadow representation from the shadow/shadow-freeclassification task to a confidence estimation task.
E. Data in Ultrasound Classification
Table VI shows the exact number of data used in theapplication of 2D US standard plane classification (SectionV. Part A ). The training data of each class is almost the sameso that we can keep class balance between different classesduring training. F. Class Confusion Matrix
Fig. 17 additionally shows the class confusion matrix of2D US standard plane classification in Section V. Part A. Thisclass confusion matrix demonstrates that less 3VV images aremis-classified as RVOT images and less 4CH images are mis-classified as LVOT images after adding shadow confidencemaps. However, as we discussed in the above Discuss Section,the shadow confidence maps can also introduce redundantinformation for similar anatomical structures in this classi-fication task. For example, more LVOT images are wronglyclassied as RVOT and more RVOT images are classified as3VV images.
O APPEAR IN IEEE TRANSACTIONS ON MEDICAL IMAGING DOI: 10.1109/TMI.2019.2913311 15 . . . . (a) Image (b) Baseline (c) Proposed (d) Proposed + AG (e) Weak GT
Fig. 14: Shadow confidence maps of different methods on various anatomical US images. Rows I-IV show four examples ofshadow confidence estimation; Brain (top), Cardiac (middle), Lips (third) and Kidney (bottom). Columns (b-d) are shadowconfidence maps from the baseline, the proposed method and the proposed method with attention gate (Proposed + AG). (f) isthe binary map of manual segmentation.TABLE VI: Summary of the Data Set used in UltrasoundStandard Plane Classification Task.
Class Training Validation Testing
Sum
G. Examples for Image Fusion
Fig.15 shows more examples of the multi-view image fusiontask which include the original multi-view images. From thecolumn (a-b) of Fig.15, we can see that the original imagescontain strong shadow artifacts that can affect the anatomical analysis. The image fusion task aims to use complementaryinformation from images with different views for reducingartifacts and increasing anatomical information. Column (e-f)enlarge the areas within the bounding boxes in column (c-d). Column (g) shows the difference masks between column(e)and (f). The difference masks clearly indicates the improvedperformance of image fusion after adding shadow confidencemaps for Gaussian weighting strategy as well as Intensity andGaussian weighting strategy.
H. Examples for Biometric Measurement
We visualize the biometric measurement of the three exam-ples shown in Table V. Fig. 16 demonstrates that, for the casesaffected by shadow artifacts, the segmentation performance(“EI seg DICE”) is improved after adding shadow confidencemaps as an extra channel. From the first row to the third rowin Fig. 16, these three samples are respectively
O APPEAR IN IEEE TRANSACTIONS ON MEDICAL IMAGING DOI: 10.1109/TMI.2019.2913311 16 low difference high difference (a) Image Component 1 (b) Image Component 2 (c) Without CM P (d) With CM P (e) Enlarged (f) Enlarged (g) D map Fig. 15: The results of the multi-view image fusion. (a-b) The multi-view images, (c) Image fusion without shadow confidencemaps ( CM P ), (d) Image fusion with shadow confidence maps ( CM P ), (e-f) Enlarged areas of (c-d) respectively, and (g)Difference maps of (e) and (f). Rows (1-2) use MSE loss to train networks for generating shadow confidence maps whileRow (3-4) use Sigmoid loss. Row 2 uses the Gaussian weighting for image fusion while Rows (1, 3, 4) use the Intensity andGaussian weighting. The color bar on the top shows that the more yellow/brighter, the higher the difference between the twoframed areas. I. Equations for estimating floating point operations (Flops)for convolutional layers
We use Eq. 5 to estimate the required Flops for convolutionlayers including ReLU activation. Here, W and H are thewidth and height of the input image respectively. K is kernelsize, P is the padding, S is the stride and F is the number offilters. n is the size of the convolution layer. (channels ∗ K ∗ K ). F lops ≈ (cid:18) W − K + 2 ∗ PS + 1 (cid:19) ∗ (cid:18) H − K + 2 ∗ PS + 1 (cid:19) ∗ n ∗ ( n − ∗ F + F ∗ W ∗ H. (5) Eq. 5 evaluates the number of Flops ( n : multiplications and n − : additions) for W × H filter convolutions adjusted forpadding P and stride S . ReLU activation is assumed to be F ∗ W ∗ H Flops (one comparison and one multiplication).
O APPEAR IN IEEE TRANSACTIONS ON MEDICAL IMAGING DOI: 10.1109/TMI.2019.2913311 17 (a) w/o CM (b) CM B (c) CM P (d) CM PAG
Fig. 16: Biometric measurement with VS. without shadow confidence maps. The yellow circles refer to the ground truth, thegreen curves are segmentation predictions, and the red circles are the ellipses of the segmentation prediction.
O APPEAR IN IEEE TRANSACTIONS ON MEDICAL IMAGING DOI: 10.1109/TMI.2019.2913311 18
Fig. 17: Class confusion metrics for 2D ultrasound standard plane classification. Upper row left: the class confusion matrixwithout shadow confidence maps. Upper row right: the class confusion matrix with the shadow confidence maps generatedby the baseline. Lower row left: the class confusion matrix with shadow confidence maps obtained by the proposed method.Lower row right: the class confusion matrix with the shadow confidence maps produced by the proposed ++