[PDF] Multiple Instance Segmentation in Brachial Plexus Ultrasound Image Using BPMSegNet

Abstract

The identification of nerve is difficult as structures of nerves are challenging to image and to detect in ultrasound images. Nevertheless, the nerve identification in ultrasound images is a crucial step to improve performance of regional anesthesia. In this paper, a network called Brachial Plexus Multi-instance Segmentation Network (BPMSegNet) is proposed to identify different tissues (nerves, arteries, veins, muscles) in ultrasound images. The BPMSegNet has three novel modules. The first is the spatial local contrast feature, which computes contrast features at different scales. The second one is the self-attention gate, which reweighs the channels in feature maps by their importance. The third is the addition of a skip concatenation with transposed convolution within a feature pyramid network. The proposed BPMSegNet is evaluated by conducting experiments on our constructed Ultrasound Brachial Plexus Dataset (UBPD). Quantitative experimental results show the proposed network can segment multiple tissues from the ultrasound images with a good performance.

Full PDF

JJOURNAL OF L A TEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 1

Multiple Instance Segmentation in Brachial PlexusUltrasound Image Using BPMSegNet

Yi Ding,

Member, IEEE , Qiqi Yang, Guozheng Wu, Jian Zhang, Zhiguang Qin,

Member, IEEE . Abstract —The identiﬁcation of nerve is difﬁcult as structuresof nerves are challenging to image and to detect in ultrasoundimages. Nevertheless, the nerve identiﬁcation in ultrasoundimages is a crucial step to improve performance of regionalanesthesia. In this paper, a network called Brachial Plexus Multi-instance Segmentation Network (BPMSegNet) is proposed toidentify different tissues (nerves, arteries, veins, muscles, etc,.) inultrasound images. The BPMSegNet has three novel modules. Theﬁrst is the spatial local contrast feature, which computes contrastfeatures at different scales. The second one is the self-attentiongate, which reweighs the channels in feature maps by theirimportance. The third is the addition of a skip concatenation withtransposed convolution within a feature pyramid network. Theproposed BPMSegNet is evaluated by conducting experimentson our constructed Ultrasound Brachial Plexus Dataset (UBPD).Quantitative experimental results show the proposed network cansegment multiple tissues from the ultrasound images with a goodperformance..

Index Terms —Brachial plexus segmentation, ultrasound image,deep learning, instance segmentation.

I. I

NTRODUCTION

The ultrasound is one of the core diagnostic imaging modal-ities and it is widely applied in diagnosis and treatment. Forexample, the ultrasound-guided injection of botulinum toxinA [13], using ultrasound to diagnose neuralgic amyotrophy[40], and peripheral nerve blockade (PNB) [2]. For PNB, theultrasound guidance has become the indispensable guidancemodality [8]. During the PNB, anesthesiologists perform re-gional anesthesia around the target nerve, where ultrasoundimages are used to locate nerve to promote pain management[39]. Hence the accurate localization of nerve structures inultrasound image is a critical step for effectively processingPNB procedures [9]. The clinicians, and in particular novicescan beneﬁt from assistance in interpreting these images.Nowadays, deep learning algorithms, in particular convolu-tional networks, have rapidly become a methodology of choice

Corresponding authors: Yi Ding ([email protected]) and Jian Zhang([email protected]).Yi Ding is with the School of Information and Software Engineeringin University of Electronic Science and Technology of China, Chengdu,Sichuan, 610054, China; he is also with Institute of Electronic and InformationEngineering of UESTC in Guangdong, Guangdong, 523808, China (e-mail:[email protected]).Qiqi Yang and Zhiguang Qin are with the School of Information andSoftware Engineering in University of Electronic Science and Technology ofChina, Chengdu, Sichuan, 610054, China (e-mail: [email protected];[email protected]).Guozheng Wu is with National Natural Science Foundation of China,Beijing, China (e-mail: [email protected]).Jian Zhang is with the Center of Anaesthesia surgery, Sichuan ProvincialHospital for Women and Children/Afﬁlated Women and Children’s Hos-pital of Chengdu Medical College, Chengdu, China (e-mail: [email protected]) Fig. 1. (a) shows the raw ultrasound images collected from different patients.(b) is the medical ground truth marked by anesthesiologist. for analyzing medical images [26], [37]. For example, [7],[10], [11] design unique network architectures to segment thebrain tumor in MRI images. For ultrasound image analysis,[32] has proposed an approach for midbrain segmentation.[47], [50] use improved convolutional networks for nervesegmentation.However, there are challenges in nerve segmentation. First,the size of nerve is very small and inconspicuous. Next, thenoise disturbance in ultrasound imaging causes a reductionin image quality and degrades of details like texture. Third,the low contrast with the neighbor also blurs the deﬁnitionof boundaries of anatomical tissue. Besides the challenges inultrasound imaging, identifying nerves in ultrasound image isalso difﬁcult for anesthesiologists. It requires anesthesiologiststo fully understand the nerve and its surrounding anatomiclandmarks (muscle, vein, and artery), and have extensiveclinical experience [44].To solve these challenges, we have worked from twoaspects. First, we merge features from different scales tocombine local and semantic information, and enhance thecontrast information to highlight the small and inconspicuousobjects. In addition, we have inspired by the clinical PNBoperation. Anesthesiologists tend to identify the nerves byreferring to its its surrounding anatomic landmarks rather thanlocating the nerves directly. The identiﬁcation of muscle, vein,and artery can help recognize the nerve in clinical practice. Asa result, instead of focusing on nerves, it is also important fornetwork to learn the relationship among those tissues.In this paper, a Brachial Plexus Multi-instance Segmentation a r X i v : . [ c s . C V ] D ec OURNAL OF L A TEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 2

Network (BPMSegNet) is proposed based on the followinginsights. (1) Following the clinical procedure of anesthesiolo-gists, the positional relationship of nerve and its surroundinglandmarks can help segmenting nerve. (2) The inconspicuousand small objects segmentation can be improved by multi-scalecontrast features. The BPMSegNet is implemented by follow-ing the Mask R-CNN [16] framework, besides, we proposesome simple yet effective modules to improve the segmenta-tion performance. First, we use Spatial Local Contrast Feature(SLCF) module to aggregate the contrast features in spatial andlocal resolution. Secondly, to select channels among featuremaps and save useful information, the Self Attention Gate(SAG) is proposed. Thirdly, the original cascade upsamplingprocess in Feature Pyramid Network (FPN) [24] is improvedwith transpose convolution and skip concatenation. Last, aultrasound image dataset - Ultrasonic Brachial Plexus Dataset(UBPD) is constructed to evaluate the BPMSegNet, whichcontains nerve, muscle, vein, artery, and their correspondingmasks.The contributions of this work are summarized as follows:1) A novel brachial plexus segmentation network, BPM-SegNet, is developed to identify multiple instances forultrasound image by applying deep learning algorithm ofinstance segmentation and integrates with prior medicalknowledge in the clinical nerve identiﬁcation process.2) We design modules to improve performance for nervesegmentation. The multi-scale contrast feature modulecombined with self-attention gate is developed to generatetailored features for ultrasound image. In addition, theupsampling mechanism in FPN is improved by adoptingthe skip connection to ﬁne tune the information ﬂow andincrease the nonlinearity. Beneﬁt from these designs, ourmethod achieves higher nerve detection and localizationaccuracy than state-of-the-art methods.3) We have built an ultrasound image dataset - UBPD. Thisdataset dedicates to the segment brachial plexus and itssurrounding anatomy. It consists 1055 ultrasound imageswith different targets and their corresponding labeledmasks.The remainder of the paper is organized as follows. InSection II we brieﬂy review the related work in instancesegmentation and ultrasound image segmentation. In SectionIII we introduce the construction of our new dataset UBPD.The mechanism of the proposed method is elaborated inSection IV. The experimental results are presented in SectionV. II. R

ELATED W ORK

In this section, we brieﬂy review the works in instancesegmentation and related improvements.

A. Instance segmentation

The idea of instance segmentation is to identify, localizeand segment objects while distinguish different instances ofthe same category, which can be roughly divided into twomain approaches: segmentation based methods and detectionbased methods. The detection based instance segmentation methods ex-tended from object detection methods (e.g., Faster-RCNN[35]) to obtain detection instances, then add a mask branch topredict the segmentation mask. In Mask RCNN, the networkuses one additional branch to predict instance segmentationmasks for the detection instances generated from Faster-RCNN[35]. PANet [27] proposed several improvements based onMask R-CNN by adding bottom-up paths to facilitate featurepropagation in. YOLACT [3] achieves 29.8 mAP on MicrosoftCOCO dataset [25] by breaking instance segmentation intotwo parallel subtasks: generating a set of prototype masks andpredicting per-instance mask coefﬁcients.The segmentation based method ﬁrstly predicts the cate-gory labels of each pixel and then groups them together toobtain pixel label of instance segmentation. [23] used spectralclustering to cluster the pixels. [48] uses depth estimationand adds boundary detection information during the clusteringprocedure.

B. Attention Mechanism

Nowadays, the attention mechanism has received extensiveattention in both Natural Language Processing and com-puter vision. Non-local neural network [42] ﬁrst explored theresearch of attention mechanism in image processing, andgive the intuitional instructions on the design of attentionoperations. Then the recently published DANet [12], OCNet[45], CCNet [20], PSANet [49], and Local Relation Net [19]explore the application of attention mechanism in semanticsegmentation.In attention mechanism, capturing long term dependencies[18] is of vital importance in deep neural networks. We canestablish the connection between two pixels with a certaindistance on the image (or on the feature map), which can beapplied along space and time. But Convolution and recurrentneural network [30] often used in deep neural networks areoperations performed on local areas, which are typical localoperations. [42] discuss the inﬂuences of non-local operationswhich can be designed to capture long-term dependencies.The non-local operations consider the weighting of featuresbetween all locations when calculating the response at a certainposition. So comparing to the general Euclidean distance,this dependency may reﬂect the relationship and connectionbetween different positions, or the continuity of the sameposition in different dimensions. The design of non-localoperation is summarized in Non-local Neural Networks as y i = 1 C ( x ) (cid:88) ∀ j f ( x i , x j ) g ( x j ) , (1)where i is the index of an output position whose response isto be computed. The j is the index of other possible positions, x is the input (image,video, features) and y is the computedoutput signal of position i . The function f computes a scalarthat represents the relationship between the position i and allthe others j . Lastly, function g computes a representation ofthe input signal at the position j , and C ( x ) is normalizationfactor. OURNAL OF L A TEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 3

Fig. 2. The schematic of BPMSegNet. The Spatial Local Contrast Feature (SLCF) module learns contrast features for ultrasound images, then Self AttentionGate (SAG) assigns adaptive weights to different channels. Last, the reﬁned features are then utilized by region proposal network (RPN) and head architecture.

C. Ultrasound Images Segmentation

The work [33] has reviewed techniques developed for ultra-sound image segmentation, including thresholding, level sets,active contours and other model-based methods. e.g. In [5],longitudinal and transverse view ultrasound image sequencesof brachial arteries were segmented by feed forward activecontour (FFAC).For deep learning methods, many studies have shownthe advantages of adopting the deep learning algorithm forultrasound image segmentation, especially in accuracy andefﬁciency. In recent papers, segmentation methods based onmachine learning are proposed to achieve pixel-wise segmen-tation. There are various types of neural networks designedand applied for ultrasound image segmentation. [21] usesboundary regularized convolutional encoder–decoder networkto segment breast anatomy, [34] and [6] design the handcraftedfeatures for ultrasound image, and [4], [31] use deep learningbased method to meet different clinical needs. The segmenta-tion in 3D ultrasound image has also been studied in recentwork [29], [43].For speciﬁc deep learning solutions in brachial plexus seg-mentation, there are studies which use semantic segmentationmethods. For example, [1], [15], [47], [50] propose theirnetwork to segment nerve in ultrasound images. [47] combinesthe features from shallow and deep layers through multi-pathinformation confusion, and use dilated convolution to enlargethe receptive ﬁeld in deep layer. [50] proposes a U-net like net-work too, and takes advantage of inception modules and batchnormalization instead of ordinary convolutional layers. Thereare few limitations in these works. First, these studies transferthe commonly used neural network structures in natural imageto ultrasound image segmentation, and do not consider the lowresolution and low contrast of ultrasound image. Moreover, inclinical practice, the brachial plexus identiﬁcation is highlyrelated to the identiﬁcation of its surroundings. They onlyfocus on the nerve, and do not take the important relevancebetween nerve and other tissues into account. In clinicalpractice, the brachial plexus identiﬁcation in ultrasound imageis very important for visual navigation. Hence, we constructthe dataset for brachial plexus and its surrounding anatomicalstructure, and explore the application of deep learning on brachial plexus segmentation to assist anesthesiologist in PNBprocess. III. D

ATASET

We have constructed an Ultrasonic Brachial Plexus Dataset(UBPD) that focuses on segmenting brachial plexus and itssurrounding anatomical tissues.The dataset is collected from 101 patients by professionalanesthesiologists. Firstly, there are two ultrasound devices(SIEMENS ACUSON NX3 Elite and Philips EPIQ5) usedfor data collection, and anesthesiologists use high frequencyprobe to scan the targets. The acquisition method is the sameas clinical practice. The anesthesiologists initially place probesfrom the middle of the right neck to ﬁnd suitable view, andcapturing depth is 4cm. When the video shows the internaljugular vein, common carotid artery, anterior oblique muscleand brachial plexus coexisting, the anesthesiologist will slowlyslide the probe down. The ultrasound video is recorded for atotal of 8 seconds. By following this method, we collected atotal of 101 ultrasound videos. Next, the ultrasound images arecaptured by extracting frames in these videos. We randomlyextract 10 to 15 frames from each ultrasound video, and thesame time we ensure there is at least one target in the selectedframe. Together we have obtained an ultrasound dataset witha total of 1052 images.Each ultrasound image contains targets and their corre-sponding annotated masks. The Labelme [36] is adopted formanual labeling, and the ground truth labels are marked bythe anesthesiologists. Part of the masks and images in UBPDare shown in Figure 1, there are 4 categories including nerve,muscle, vein, and artery. These tissues varies in shape andsize, and have different characteristics. The vein and arteryare salient in ultrasound imaging, and their edges are clearand large in size. On the contrary, the size of nerve is small,and it is inconspicuous.During the labeling process, anesthesiologist has concludedthat the nerve has the following characteristics: bright edges,dark in internal, its diameter is within 3mm, and it is continu-ous beaded or honeycomb. In clinical practice, the anesthesiol-ogist will ﬁrst identify the arteries and veins, then identify themuscle, and ﬁnally the nerve. The target recognition process

OURNAL OF L A TEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 4 of the anesthesiologists is consistent with their examinationprocedures in clinical anesthesia.In our experiments, the dataset is divided into training andtesting dataset. The training dataset consists of 955 imagescollected from 91 patients, and the testing dataset consists of97 images collected from 10 patients.IV. T HE BPMS EG N ET In this paper, the Brachial Plexus Multi-instance Segmen-tation Network (BPMSegNet) is proposed for brachial plexussegmentation. The shape of the nerve shown in ultrasoundimage is highly variable, and nerves with clear edges andbrightness contrast are easier to be recognized. This observa-tion suggests that the successful identiﬁcation of nerves needsto highlight and enhance the contrast information. In addition,in clinical nerve identiﬁcation process, the anesthesiologistsrely on the position of its surroundings to locate the suitableview for nerves. This indicates that considering positionalrelationship among these anatomical targets can help thenerve identiﬁcation. Inspired by the above observations, wedesign some components for the proposed network. First,the SLCF module is proposed to save the contrast informa-tion in spatial and local resolution for multi-scale featuresaggregation. Secondly, to further select useful informationamong a large number of channels produced by SLCF, theSAG is proposed to ﬁlter redundant information. The detaileddescription of the proposed framework can be found in thefollowing subsections.

A. The Structure of BPMSegNet

The structure of BPMSegNet is illustrated in Figure 2. Ittakes a single image as input and extract multi-level features(C2, C3, C4, C5) by the feature extraction network. In thispaper, the output stride is expressed as the ratio of imageresolution to the feature map resolution. For example ,thesize of C5 is 32 times smaller than the input (because of 5downsampling operations), so the output stride is 32. AlthoughC5 has the lowest resolution, its semantic information is themost abundant. While feature map C2 is of low-level, whichincludes local and detailed information. Then, the BPMSegNetuses a U-shape structure with skipping transposed convolutionthat akin to FPN for fusing multi-scale features. This upsam-pling branch outputs feature maps with output stride of 32,16, 8, and 4 respectively, which are denoted as the P5, P4,P3, and P2.Then, the feature map in P2 is fed into SLCF module tolearn multi-scale contrast features. The output feature maps ofSLCF are further selected and ﬁltered by the SAG. Next thesefeature maps are inputed into region proposal network (RPN)[35] to scan the image and locate objects. The RPN candidatebounding boxes that are more likely to contain foregroundobjects, and then performs multi-classiﬁcation and boundingreﬁnement on candidate bounding boxes by non-maximumsuppression (NMS) [14]. Lastly, the FCN network [28] in themask branch classiﬁes the segmentation mask and generatesboundary boxes.

B. Spatial Local Contrast Feature

We have observed that human eyes are sensitive to thecontrast (brightness, shape) and adaptively adjust the size ofpupils to perceive different visual information. It is easier forus to identify items with clear edges and with high contrast,and this makes the concept of contrast very essential forcomputer vision and image recognition.There are two types of contrast features adopted in thesegmentation task: spatial contrast and local contrast feature.The spatial contrast refers to the contrast between largeregions, and it reveals the global and semantic informationof the image. The local contrast refers to the contrast betweensmaller regions. For salient objects (e.g. artery and vein inFigure 1 (b)), they have clear edges, large shapes and otherhistology features. In addition, they are more sensitive to thesemantic information and more advantage in segmentationtask [22]. However, the advantages of salient objects resultin the challenge for segmenting the inconspicuous and smallobjects. When aggregating features of different scales usingupsampling and element-wise sum, the features of pixels ininconspicuous objects will be dominated by the features ofsalient objects. Some information of inconspicuous objectswould be ignored in the ﬁnal prediction, resulting in incorrectlabeling for pixels at certain locations. For nerve segmentationtask, the size of objects is across a large range of scales,and it is important to consider both spatial and local contrastfeatures. Therefore, the Spatial Local Contrast Feature (SLCF)is proposed to effectively aggregate spatial and local contrastfeatures.The structure of proposed SLCF module is shown in Figure3. Given the input feature maps, the SLCF aggregates thecontrast features at four different scales by four convolutionalbranches (with dilate rate of 2, 4, 8, 16). The SLCF is onlyused on P2, because the size of P3, P4, and P5 is relativelysmaller, it will produce a lot of useless information whenapplying convolutional (Conv) operations with a large kernelsize. P2 has the largest resolution and the most detailedfeatures, which is desirable for generating contrast features.There are four parallel contrasting blocks in the SLCF mod-ule. As shown in each branch, contrast features are generatedby subtracting feature maps generated by convolution withdifferent perception ﬁelds, which is expressed as follows

SLCF = ( f ∗ r k ) ( p ) − ( f ∗ k ) ( p ) . (2)The dilated convolution ∗ r between input f and kernel k anddilation factor r is deﬁned as ( f ∗ r k ) ( p ) = (cid:88) s + rt = p f ( s ) k ( t ) . (3)The normal convolution ∗ is deﬁned as ( f ∗ k ) ( p ) = (cid:88) s + t = p f ( s ) k ( t ) . (4)Here f is the input feature map, and k represents the kernel(kernel size × ). The operator ∗ r refers to r -dilated convolu-tion, and the operator ∗ is the plain 1-dilated convolution. Thesubtraction results SLCF are the expected contrast features,

OURNAL OF L A TEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 5

Fig. 3. The Spatial Local Contrast Feature (SLCF) module.Fig. 4. The structure of SAG module, in which the features are combinedlinearly with adaptive weights. and then multi-scale contrast features produced by these fourblocks will be concatenated together.In the SLCF module, the contrast features generated by dif-ferent convolutions have four different scales. The convolutionwith a larger dilation rate corresponds to a larger receptiveﬁeld, and the size of the receptive ﬁeld reﬂects the abstractiondegree of features. As shown in Figure 3, in these four parallelcontrasting blocks the dilation rate r is set to 2, 4, 8, and16 respectively, to obtain feature maps with a long-range ofreceptive ﬁelds. When the dilation rate is small, the SLCFcan contrast the local features with more discriminative anddetail information. When the dilation rate is large, the SLCFare making contrast with the spatial features, which containmore global and semantic information. By fusing the outputsof four parallel contrasting blocks, the SLCF can generatemore tailored contrast features for multi-scale objects, andthese features can restrain the network to generate customizeddetails both for salient and inconspicuous objects. Finally,these multi-scale contrasting feature maps are concatenatedtogether, and then be reﬁned by SAG. C. Self Attention Gate

The output of SLCF is a feature map matrix with 512 chan-nels. However, too many channels will cause the parametersof network to sharply increase, and result in a large amount ofcomputation consumption. In addition, over-trained parameters can lead to the recession of training effectiveness and eventhe performance degradation. In this case, obtaining moreuseful and streamlined information from feature maps canimprove the performance of network and reduce computationaloverhead.To address this problem, the Self Attention Gate (SAG)is proposed to establish the connections with positions indifferent channels and highlight those useful features. TheSAG can capture the relationships of different positions in dif-ferent channels by adopting the non-local design mechanism.Comparing to continuous stacking convolution operations,the SAG can compute the long-term dependencies [41] (therelationship between the channels) rather than the relationshipof pixels in one channel.In fact, such a method of establishing a connection betweenchannels can be regarded as a self-attention [46] mechanism.The self-attention mechanism calculates the aggregates ofattention scores at certain positions by interacting within thefeature maps (here refers to the channels). It pays attentionto all positions and obtains their weighted average in ahigh-dimensional embedding space. Different channels can beconsidered as a high-dimensional space that expresses featuresin more abstract and diverse way. The SAG follows the designof self-attention, which is achieved by focusing on all positionsin a feature map and taking their weighted average on thechannel dimension.The detail structure of SAG is shown in Figure 4, and theoutput of SAG ( F (cid:48) ) is deﬁned as F (cid:48) = W ( F ) ⊗ F , (5) W i ( F ) = σ (cid:0) Θ (cid:0) F GAPi (cid:1) + Θ (cid:0) F GMPi (cid:1)(cid:1) , (6)where F is the input feature maps, W ( F ) represents theupdated weights of channels in input F , and W i ( F ) is theweight for the i -th channel. Θ represents the parameters ofdense layers, and σ is sigmoid activation. The ﬁnal output F (cid:48) is the dot product of input F and W ( F ) . OURNAL OF L A TEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 6

Firstly, the i -th channel is compressed into two vectorsby Global Average Pooling (GAP) and Global Max Pooling(GMP) respectively, which contain global and mean informa-tion. The GAP and GMP are deﬁned as F GAP = (cid:80) km,n x m,n n , (7) F GMP = max ( x m,n ) . (8)Here ( m, n ) is the pixel index in feature map (size k × k ),and x m,n stands for the value of location ( m, n ) .Then the two vectors are input to a simple neural networkto update the separately. This neural network consists of twodense connected layers, and each layer has different numberof neural units. There are 64 units in the ﬁrst layer, and512 units in the second layer. This network can reassignthe weights to F GAP and F GMP by update its parameters.The SAG increases the weight of effective channels andreduces the weight of invalid ones, so that the network canchoose useful channels from feature maps. The output ofSAG is with the same size as the input, so the SAG canbe applied to multiple layers and easily embedded into othernetwork architectures. More importantly, the SAG can discoverthe internal relationship among pixels in different channelswithout increasing many parameters.

D. Skip Concatenation with Transpose Convolution

For the feature maps, different layers compute featuresmaps at different levels and different scales. A larger outputstride indicates that it has low-resolution, and high semanticinformation. Conversely, the feature map with smaller outputstride (e.g. P2) has high-resolution and ﬁner information,and they are important for restoring the edge and boundaryof objects. Although all the feature maps are at differentscales and have huge semantic gaps generated by differentlayers in the network. The high-resolution maps have low-levelfeatures can damage their representational capacity for objectrecognition, and we need to combine both high-resolution andlow-resolution features to do a robust segmentation/detectionat different scales. Therefore a skip concatenation with trans-pose convolution (SC) is proposed to combine low-resolutionfeature maps with high-level semantic information.As shown in Figure 5 (a), in the FPN, the low-resolutionfeature maps are upsampled and then connected with largerresolution. The feature maps are upsampled by bilinear inter-polation, which can be consider as a linear decoder module.Base on the FPN, Figure 5 (b) enhances the feature reuse byadding another connection between the lowest-resolution andhighest-resolution feature maps.The proposed improved method is illustrated in 5 (c). Wechange the connection between P4 to P3, and re-connect theP5 and P3. The feature map in P5 has the highest level ofsemantic information and globality. There is a huge semanticgap between P5 and P2, and connecting P5 directly to P3 canpropagate high-level semantic information while prevent theinformation being discarded by many upsampling operations.Besides, instead of using bilinear interpolation, the transposeconvolution is adopted to increase nonlinearity. The size of transpose convolution output is: s × (i-1)-2p+k. the s is strideof upsampling, i is size of input, p is padding size, and k issize of kernel. The p is set to 0, k and s is set to 2, except theskip concatenation from P5 to P3, in which s and k is set to4. By combining bottom-up pathway with top-down pathwayusing lateral connections, the high semantic information in thelow-resolution feature maps are connected to high-resolutionfeature maps, and the skip concatenation saves the high-levelinformation while keep a balanced budget between efﬁciencyand effectiveness. V. E XPERIMENT

This section presents the performance of the proposedmethod through different experiments. These experiments areall performed on the UBPD. The UBPD is divided into atraining dataset (consists of 955 ultrasound images) and atest set (consist of 97 ultrasound images), which contains4 categories and the background. In order to evaluate theperformance of proposed network, extensive experiments areconducted on the test set in both detection and segmentationtask.

A. Implementation Details

Setup.

The proposed network is implemented with Tensor-ﬂow backend and runs on Nvidia GTX-1080Ti. The back-bones used in the feature extraction network are ResNet101,ResNet50, and VGG-19 respectively. In addition, to constrainthe parameters of the whole network, the number of channelsfor each block is reduced to half the number as that usedin the normal network, thereby greatly reducing the overallparameters of the network.

Training.

The network is trained end-to-end on the trainingdataset of UBPD with 2 images per GPU, the SGD optimizerwith ﬁxed momentum 0.9 and weight decay 0.0001 is adoptedfor the implementation. The learning rate is initialized to 0.01,and decays by a factor of 10 for each 10th epoch. Sincethe ultrasound images are collected from different ultrasonicdevices and the sizes of the images are varied, in the trainingprocess, the images are ﬁrstly resized into 640 ×

640 by usingbilinear interpolation. The training dataset is relatively smallerthan the natural image dataset, so the network is trained witha total of 100 epochs to avoid overﬁtting. Each module isnot speciﬁcally pre-trained on any dataset or initialized. Thefeature extraction network consists of ﬁve blocks. Each blockcontains a series of convolution and downsample operations,which will reduce the size of the feature map into half of itsoriginal size. The multi-scale outputs of the feature extractionnetwork are saved and represented by C1, C2, C3, C4, andC5, respectively.

B. Experimental Evaluation

In this work, the performance of the proposed network isevaluated from three aspects: object detection, segmentation,and the computation efﬁciency. For detection task, the goalis to predict the bounding boxes (B-box) of each object ofthat class in a test image. For segmentation task, in each test

OURNAL OF L A TEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 7

Fig. 5. (a) is the Feature Pyramid Network (FPN) where the feature maps are gradually upsampled by bilinear interpolation with a factor of 2. (b) enhancesthe feature reuse in FPN. (c) is our improved upsampling structure, and the feature maps are upsampled by transpose convolution, and the p5 is directlyconnected to the p2 with factor of 4. image, predict the class of each pixel, or background if theobject does not belong to one of the 4 speciﬁed categories.To measure the detection precision and segmentation pre-cision, we use average precision (AP) (i.e the area underthe precision/recall curve) over all categories as the metric.The mask AP (AP mask ) is the metric for the segmentation,and the B-box AP (AP B − box ) is the metric of detection. Tocompute this metric, a group of Intersection over Union (IoU)thresholds (0.5, 0.6, and 0.7) are used. The IoU is the overlapratio between prediction and ground truth. If the overlap ratiois lower than the set IoU threshold, the prediction is consideredas false. The IoU overlap is deﬁned as follow: IoU = I g ∩ I p I g ∪ I p , (9)where I g is the area of ground truth mask (ground truth B-box), and I p denotes the area of predicted mask (predictedB-box).As shown in Table I II III V, there are AP , AP , AP ,and AP. Speciﬁcally, AP , AP , AP represent the APcomputed at IoU threshold 0.5, 0.6, and 0.7 respectively,and AP in the last column means the average AP over IoUthreshold = 0.50 : 0.05 : 0.95. C. Ablation Study

To verify the validity of the proposed module, the ablationstudies are implemented by removal of one or more modules.The standard metrics consist of AP, AP , AP , and AP forboth B-box and segmentation masks. To reveal the differentperformance of each proposed component, except for thedifferences speciﬁed in each ablation experiment, the othersettings are remained consistent (following the training methodin Implementation Details). The results of different studies are shown in Table II and Table III. In the following, theperformance of each proposed module is discussed in detail. TABLE IT

HE COMPARISON OF EFFECTS OF DIFFERENT FEATURE EXTRACTIONNETWORK ON D ETECTION

AP (AP B − box ) AND M ASK

AP (AP mask ).(a) Mask AP of different feature extraction networkConditions AP mask AP AP AP APResNet-50 backbone 43.81 39.22 31.67 28.31ResNet-101 backbone 45.47 39.45 31.94 28VGG-19 backbone 49.33 45.19 35.21 30.97(b) Detection AP of different feature extraction networkConditions AP B − box AP AP AP APResNet-50 backbone 50.77 45.6 38.63 32.58ResNet-101 backbone 58 48.71 38.4 33.11VGG-19 backbone 59.09 51.45 34.57 33.54

Feature extraction network.

The performance of usingdifferent backbone (including VGG-19 [38], ResNet-50 [17],and ResNet-101) as feature extraction networks is ﬁrst evalu-ated. Both VGG-19 backbone and ResNet-101 backbone areused as the baselines in the experiments. ResNet has beenproved to be more useful and efﬁcient when comparing toVGG-19, but as shown in Table I, VGG-19 performs better.With the fewest parameters, the VGG-19 backbone achievessegmentation performances of 49.33, 45.19, 35.21, and 30.97at AP , AP , AP , and AP, respectively. As for detectionperformance, the VGG-19 backbone also reaches the bestscores of 59.09, 51.45, 34.57, and 33.54 at AP , AP , AP ,and AP, respectively. The main reason is that the ultrasound OURNAL OF L A TEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 8

TABLE IIT

HE SEGMENTATION RESULTS OF ABLATION EXPERIMENTS EVALUATEDON

UBPD

DATASET . T

HE BEST RESULTS ARE HIGHLIGHTED IN BOLD .(a) SLCF performanceConditions AP mask AP AP AP APResNet-101 backbone 45.47 39.45 31.94 28VGG-19 backbone 49.33 45.19 35.21 30.97ResNet-101+SLCF 49.45 41.06 33.75 30.43VGG-19+SLCF 54.53 46.27 36.15 33.21(b) SAG performanceConditions AP mask AP AP AP APResNet-101 backbone 45.47 39.45 31.94 28VGG-19 backbone 49.33 45.19 35.21 30.97ResNet-101+SAG 51.1 44.13 34.44 31.8VGG-19+SAG 52.05 46.34 36.07 33.9(c) SLCF+SAG performanceConditions AP mask AP AP AP APResNet-101 backbone 45.47 39.45 31.94 28VGG-19 backbone 49.33 45.19 35.21 30.97ResNet-101+SLCF+SAG 51.1 44.13 34.44 31.8VGG-19+SLCF+SAG 54.06 48.62 36.55 35.2(d) SLCF+SAG+SC performanceConditions AP mask AP AP AP APResNet-101 baseline 45.47 39.45 31.94 28VGG-19 backbone 49.33 45.19 35.21 30.97ResNet-101+SLCF+SAG+SC 50.55 46.19 39.53 33.56VGG-19+SLCF+SAG+SC image has low resolution and less semantic information. Whenstacking many convolution layers, the edges and shapes oftargets might be discarded.

Spatial Local Contrast Feature (SLCF).

This experimentmainly discuss the improvement performance of SLCF. Asshown in Table II (a) and III (a), comparing to the baselinewith VGG-19 backbone, SLCF increase the segmentationresults by 5.2, 1.06, 0.94, and 2.24 on AP , AP , AP and AP respectively. For detection results, SLCF improves theresults by 4.73, 3.43, 1.34, and 4.23 on AP , AP , AP andAP, respectively. It indicates that considering both local andspatial contrast is important and useful for segmentation anddetection in ultrasound images. Self Attention Gate (SAG).

As shown in Table IV, theincreased parameters of adding SAG on VGG-19 backbone isonly 0.78M. The results of using SAG module alone is shownin Table II III (b). Comparing the results in Table II (a) (b)and Table II (c), it can be found that SLCF and SAG canpromote each other when used together, and it is better thanusing SLCF or SAG alone.

Skip Concatenation with Transpose Convolution (SC).

To further demonstrate the effectiveness of the synergy of SC

TABLE IIIT

HE DETECTION RESULTS OF ABLATION EXPERIMENTS EVALUATED ON

UBPD

DATASET . T

HE BEST RESULTS ARE HIGHLIGHTED IN BOLD .(a) SLCF performanceConditions AP B − box AP AP AP APResNet-101 backbone 58 48.71 38.4 33.31VGG-19 backbone 59.09 51.45 34.57 33.54ResNet-101+SLCF 58.99 52.26 40.45 33.91VGG-19+SLCF 61.64 53.69 39.05 33.91(b) SAG performanceConditions AP B − box AP AP AP APResNet-101 backbone 58 48.71 38.4 33.31VGG-19 backbone 59.09 51.45 34.57 33.54ResNet-101+SAG 58.08 50.22 39.37 33.88VGG-19+SAG 59.33 51.53 36.29 34.26(c) SLCF+SAG performanceConditions AP mask AP AP AP APResNet-101 backbone 58 48.71 38.4 33.31VGG-19 backbone 59.09 51.45 34.57 33.54ResNet-101+SLCF+SAG 59.53 53.18 39.33 35.57VGG-19+SLCF+SAG 62.85 53.46 38.93 36.81(d) SLCF+SAG+SC performanceConditions AP B − box AP AP AP APResNet-101 baseline 58 48.71 38.4 33.31VGG-19 backbone 59.09 51.45 34.57 33.54ResNet-101+SLCF+SAG+SC 60.01 53.87 39.28 35.67VGG-19+SLCF+SAG+SC

TABLE IVC

OMPUTATION COMPLEXITY COMPARISON BETWEEN THE DIFFERENTCOMBINATION OF PROPOSED MODULES

Methods Parameters(M) Flops(M)ResNet-50 backbone 26.54 53.04ResNet-101 backbone 31.34 62.59VGG-19 backbone 25.3 50.6ResNet-101+SLCF 34.29 68.49ResNet-101+SAG 34.29 68.49ResNet-101+SLCF+SAG 34.36 68.62VGG-19+SLCF 28.24 56.5VGG-19+SAG 26.08 52.17VGG-19+SLCF+SAG 28.31 56.63ResNet-101+SLCF+SAG+SC

VGG-19+SLCF+SAG+SC

OURNAL OF L A TEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 9 with SLCF and SAG. In this experiment, the SC module iscombined with SLCF and SAG. Because the transpose convo-lution instead of bilinear interpolation is adopted to upsamplethe feature maps, the parameters are slightly increased. Theresults of computation complexity comparison in Table IVshow that for ResNet-101+SLCF+SAG+SC, SC only increases0.36M parameters comparing to ResNet-101+SLCF+SAG. Butcomparing their segmentation results in Table II (c) and (d), SCgreatly improves the performance by 3.96, 0.3, 5.51 and 1.98on AP , AP , AP and AP respectively, which indicates thatthree modules can cooperate well with each other. As shownin Figure 7, combining all the proposed modules can greatlyimprove the performance on both segmentation and detectiontasks.It is important to constrain the parameters of network andwhile improving performance, so the comparison of param-eters and FLOPs (Floating Point Operations) between theproposed method and baseline is evaluated. As shown in TableIV, the proposed network only increase a little computationalcomplexity, but has greatly improved the performance on bothsegmentation and detection tasks. On the other hand, it alsoproves that the proposed SLCF and SAG are very lightweightand easy to use. TABLE VP

ERFORMANCE COMPARISON TO DIFFERENT STATE - OF - THE - ARTMETHODS ON THE TEST SET OF

UBPD. T

HE BEST RESULTS AREHIGHLIGHTED IN BOLD .Conditions AP mask AP AP AP APMask R-CNN 43.96 39.45 30.41 27.89YOLACT 57.15 45.95

Conditions AP B − box AP AP AP APMask R-CNN 58 48.71 38.4 33.11YOLACT 62.01 51.58 41.28 30.39OURS

D. Main Results

Comparison experiments with Mask R-CNN and YOLACThave also been conducted. They are both state-of-the-artmethods for image instance segmentation, and all methodsare implemented according to the originally provided code.Table V presents the comparison results. As shown in TableV, for detection task, our method achieve the best results at62.97, 56.46, 42.01, and 37.62 on AP , AP , AP and AP.The proposed method achieves great results on segmentationtask too, it exceeds other methods in segmentation task withthe highest scores at 58.02, 48.92, and 37.18 on AP , AP ,and AP. Comparing to the Mask R-CNN and YOLACT,the proposed method shows the consistently stability andavailability for both segmentation and detection tasks.The visualization of segmentation results on the test set ofUBPD is shown in the Figure 6, and each row has 7 sampledframes from different videos. The Figure 6 (a) shows the raw ultrasound images blurred by noise interference, in which thenerves are too inconspicuous to identify. The segmentationresults of the proposed network are shown in Figure 6 (c),which are very close to ground truth. This indicates that theproposed network can effectively detect and segment differenttissues (nerves, arteries, veins, muscles) in ultrasound images,and generate the segmentation results with ﬁner details andsmoother edges. However, comparing to the proposed network,the Mask R-CNN fails to effectively segment the nerve andother tissues. As shown in Figure 6 (d), there are instances withwrong segmentation categories and some are even not beenidentiﬁed. These comparison results prove the effectivenessof the proposed network. and the BPMSegNet can achievea stable and better performance on Brachial Plexus instancesegmentation tasks. VI. C ONCLUSION

This paper addresses the problems in brachial plexus ul-trasound image segmentation to effectively assist anesthesiol-ogist in PNB. According to the nerve identiﬁcation processin clinical PNB, the identiﬁcation of brachial plexus canbe converted into segmenting the nerve and its surroundingtissues. But in the ultrasound images, these targets havedifferent scales and reciprocal with each other, thus resultingin difﬁculties for ultrasound image segmentation. Therefore, anovel method is proposed to segment multiple instances in ul-trasound image. Speciﬁcally, the ﬁrst ultrasound image datasetof brachial plexus (UBPD) and a novel network (BPMSegNet)are proposed for the segmentation of brachial plexus and itssurrounding anatomy. Quantitative experiments are carried outto verify the superiority of the proposed network on test set ofUBPD, and the network achieves state-of-the-art consistentlyin both detection and segmentation tasks.A

CKNOWLEDGMENT

This work was supported by the Natural Science Founda-tion of Guangdong Province (Grant No. 2018A030313354),the Neijiang Intelligent Showmanship Service PlatformProject (No. 180589), the Sichuan Science-TechnologySupport Plan Program (No.2019YJ0636, No.2018GZ0236,No.18ZDYF2558), and the National Science Foundation ofChina - Guangdong Joint Foundation (No.U1401257).R

EFERENCES[1] M. Baby and A. S. Jereesh, “Automatic nerve segmentation of ultrasoundimages,” in , 2017.[2] M. J. Barrington and R. Kluger, “Ultrasound guidance reduces the risk oflocal anesthetic systemic toxicity following peripheral nerve blockade,”

Regional Anesthesia & Pain Medicine , vol. 38, no. 4, pp. 289–297, 2013.[3] D. Bolya, C. Zhou, F. Xiao, and Y. J. Lee, “YOLACT: real-timeinstance segmentation,”

CoRR , vol. abs/1904.02689, 2019. [Online].Available: http://arxiv.org/abs/1904.02689[4] G. Carneiro, J. C. Nascimento, and A. Freitas, “The segmentation ofthe left ventricle of the heart from ultrasound data using deep learningarchitectures and derivative-based search methods,”

IEEE Transactionson Image Processing , vol. 21, no. 3, pp. 968–982, 2011.[5] T. W. Cary, C. B. Reamer, L. R. Sultan, E. R. Mohler III, and C. M.Sehgal, “Brachial artery vasomotion and transducer pressure effect onmeasurements by active contour segmentation on ultrasound,”

Medicalphysics , vol. 41, no. 2, p. 022901, 2014.

OURNAL OF L A TEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 10

Fig. 6. Visualization of part segmentation results using different networks. (c) and (d) are the segmentation results of the proposed network and the MaskR-CNN.Fig. 7. Comparing the effect of the network with all proposed modules andwithout.[6] A. Chauhan, L. R. Sultan, E. E. Furth, L. P. Jones, V. Khungar, and C. M.Sehgal, “Diagnostic accuracy of hepatorenal index in the detection andgrading of hepatic steatosis,”

Journal of Clinical Ultrasound , vol. 44,no. 9, pp. 580–586, 2016.[7] H. Chen, Z. Qin, Y. Ding, L. Tian, and Z. Qin, “Brain tumor segmen-tation with deep convolutional symmetric neural network,”

Neurocom-puting .[8] S. Choi and C. J. L. Mccartney, “Evidence base for the use of ultrasoundfor upper extremity blocks: 2014 update,”

Regional Anesthesia & PainMedicine , vol. Online First, no. 2, p. 242, 2014.[9] N. Denny and W. Harrop-Grifﬁths, “Editorial i: Location, location,location! ultrasound imaging in regional anaesthesia,”

British Journalof Anaesthesia

IEEE Access , vol. PP, pp. 1–1, 07 2019.[11] Y. Ding, C. Li, Q. Yang, Z. Qin, and Z. Qin, “How to improve the deepresidual network to segment multi-modal brain tumor images,”

IEEEAccess , vol. PP, pp. 1–1, 10 2019.[12] J. Fu, J. Liu, H. Tian, Z. Fang, and H. Lu, “Dual attention networkfor scene segmentation,”

CoRR , vol. abs/1809.02983, 2018. [Online].Available: http://arxiv.org/abs/1809.02983[13] A. Garc´ıa Ron, R. Gallardo, and B. Huete Hernani, “Utilidad deltratamiento con inﬁltraciones ecoguiadas de toxina botul´ınica a enel desequilibrio muscular de ni?os con par´alisis obst´etrica del plexobraquial. descripci´on del procedimiento y protocolo de actuaci´on,”

Neurolog´ıa , p. S0213485317300221.[14] R. Girshick, F. Iandola, T. Darrell, and J. Malik, “Deformable partmodels are convolutional neural networks,” in

Proceedings of the IEEEconference on Computer Vision and Pattern Recognition , 2015, pp. 437–446.[15] A. Haﬁane, P. Vieyres, and A. Delbos, “Deep learning withspatiotemporal consistency for nerve segmentation in ultrasoundimages,”

CoRR , vol. abs/1706.05870, 2017. [Online]. Available:http://arxiv.org/abs/1706.05870[16] K. He, G. Gkioxari, P. Doll´ar, and R. B. Girshick, “MaskR-CNN,”

CoRR , vol. abs/1703.06870, 2017. [Online]. Available:http://arxiv.org/abs/1703.06870[17] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for imagerecognition,” in

Proceedings of the IEEE conference on computer visionand pattern recognition , 2016, pp. 770–778.[18] S. Hochreiter, Y. Bengio, P. Frasconi, J. Schmidhuber et al. , “Gradientﬂow in recurrent nets: the difﬁculty of learning long-term dependencies,”2001.[19] H. Hu, Z. Zhang, Z. Xie, and S. Lin, “Local relation networksfor image recognition,”

CoRR , vol. abs/1904.11491, 2019. [Online].Available: http://arxiv.org/abs/1904.11491

OURNAL OF L A TEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 11 [20] Z. Huang, X. Wang, L. Huang, C. Huang, Y. Wei, and W. Liu,“Ccnet: Criss-cross attention for semantic segmentation,”

CoRR , vol.abs/1811.11721, 2018. [Online]. Available: http://arxiv.org/abs/1811.11721[21] B. Lei, S. Huang, R. Li, C. Bian, H. Li, Y.-H. Chou, andJ.-Z. Cheng, “Segmentation of breast anatomy for automated wholebreast ultrasound images with boundary regularized convolutionalencoder–decoder network,”

Neurocomputing

Proceedings of the IEEE Conference on Computer Vision and PatternRecognition , 2016, pp. 478–487.[23] X. Liang, L. Lin, Y. Wei, X. Shen, J. Yang, and S. Yan, “Proposal-freenetwork for instance-level object segmentation,”

IEEE transactions onpattern analysis and machine intelligence , vol. 40, no. 12, pp. 2978–2991, 2017.[24] T. Lin, P. Doll´ar, R. B. Girshick, K. He, B. Hariharan, and S. J.Belongie, “Feature pyramid networks for object detection,”

CoRR , vol.abs/1612.03144, 2016. [Online]. Available: http://arxiv.org/abs/1612.03144[25] T. Lin, M. Maire, S. J. Belongie, L. D. Bourdev, R. B. Girshick,J. Hays, P. Perona, D. Ramanan, P. Doll´ar, and C. L. Zitnick, “MicrosoftCOCO: common objects in context,”

CoRR , vol. abs/1405.0312, 2014.[Online]. Available: http://arxiv.org/abs/1405.0312[26] G. Litjens, T. Kooi, B. E. Bejnordi, A. A. A. Setio, F. Ciompi,M. Ghafoorian, J. A. [van der Laak], B. [van Ginneken], and C. I.S´anchez, “A survey on deep learning in medical image analysis,”

Medical Image Analysis

Proceedings of the IEEE Conference onComputer Vision and Pattern Recognition , 2018, pp. 8759–8768.[28] J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networksfor semantic segmentation,” in

Proceedings of the IEEE conference oncomputer vision and pattern recognition , 2015, pp. 3431–3440.[29] P. Looney, G. N. Stevenson, K. H. Nicolaides, W. Plasencia, M. Mol-loholli, S. Natsis, and S. L. Collins, “Automatic 3d ultrasound seg-mentation of the ﬁrst trimester placenta using deep learning,” in .IEEE, 2017, pp. 279–282.[30] T. Mikolov, M. Karaﬁ´at, L. Burget, J. ˇCernock`y, and S. Khudanpur,“Recurrent neural network based language model,” in

Eleventh annualconference of the international speech communication association , 2010.[31] F. Milletari, S.-A. Ahmadi, C. Kroll, A. Plate, V. Rozanski, J. Maiostre,J. Levin, O. Dietrich, B. Ertl-Wagner, K. B¨otzel et al. , “Hough-cnn: deeplearning for segmentation of deep brain regions in mri and ultrasound,”

Computer Vision and Image Understanding , vol. 164, pp. 92–102, 2017.[32] F. Milletari, S.-A. Ahmadi, C. Kroll, A. Plate, V. Rozanski, J. Maiostre,J. Levin, O. Dietrich, B. Ertl-Wagner, K. B¨otzel, and N. Navab, “Hough-cnn: Deep learning for segmentation of deep brain regions in mri andultrasound,”

Computer Vision and Image Understanding

IEEE Transactions on medical imaging , vol. 25, no. 8, pp. 987–1010, 2006.[34] M. H. Noe, O. Rodriguez, L. Taylor, L. Sultan, C. Sehgal, S. Schultz,J. M. Gelfand, M. A. Judson, and M. Rosenbach, “High frequency ul-trasound: a novel instrument to quantify granuloma burden in cutaneoussarcoidosis,”

Sarcoidosis, vasculitis, and diffuse lung diseases: ofﬁcialjournal of WASOG , vol. 34, no. 2, p. 136, 2017.[35] S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards real-timeobject detection with region proposal networks,” in

Advances in neuralinformation processing systems , 2015, pp. 91–99.[36] B. C. Russell, A. Torralba, K. P. Murphy, and W. T. Freeman, “Labelme:a database and web-based tool for image annotation,”

Internationaljournal of computer vision , vol. 77, no. 1-3, pp. 157–173, 2008.[37] D. Shen, G. Wu, and H. I. Suk, “Deep learning in medical imageanalysis,”

Annual Review of Biomedical Engineering , vol. 19, no. 1,pp. annurev–bioeng–071 516–044 442, 2017.[38] K. Simonyan and A. Zisserman, “Very deep convolutional networks forlarge-scale image recognition,” arXiv preprint arXiv:1409.1556 , 2014.[39] B. D. Sites and R. Brull, “Ultrasound guidance in peripheral regionalanesthesia: philosophy, evidence-based medicine, and techniques,”

Cur-rent Opinion in Anesthesiology , vol. 19, no. 6, pp. 630–639, 2006. [40] M. van Rosmalen, D. Lieba-Samal, S. Pillen, and N. van Alfen,“Ultrasound of peripheral nerves in neuralgic amyotrophy,”

Muscle &Nerve , 2018.[41] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez,Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” in

Advancesin neural information processing systems , 2017, pp. 5998–6008.[42] X. Wang, R. Girshick, A. Gupta, and K. He, “Non-local neural net-works,” in

Proceedings of the IEEE Conference on Computer Visionand Pattern Recognition , 2018, pp. 7794–7803.[43] X. Yang, P. J. Rossi, A. B. Jani, H. Mao, W. J. Curran, and T. Liu,“3d transrectal ultrasound (trus) prostate segmentation based on optimalfeature learning framework,” in

Medical Imaging 2016: Image Process-ing , vol. 9784. International Society for Optics and Photonics, 2016,p. 97842F.[44] J. M. Youngner, K. Matsuo, T. Grant, A. Garg, J. Samet, andI. M. Omar, “Sonographic evaluation of uncommonly assessedupper extremity peripheral nerves: anatomy, technique, and clinicalsyndromes,”

Skeletal Radiology , vol. 48, no. 1, pp. 57–74, 2019.[Online]. Available: https://doi.org/10.1007/s00256-018-3028-z[45] Y. Yuan and J. Wang, “Ocnet: Object context network for sceneparsing,”

CoRR , vol. abs/1809.00916, 2018. [Online]. Available:http://arxiv.org/abs/1809.00916[46] H. Zhang, I. Goodfellow, D. Metaxas, and A. Odena, “Self-attention gen-erative adversarial networks,” arXiv preprint arXiv:1805.08318 , 2018.[47] Q. Zhang, Z. Cui, X. Niu, S. Geng, and Y. Qiao, “Image segmentationwith pyramid dilated convolution based on resnet and u-net,” in

Interna-tional Conference on Neural Information Processing . Springer, 2017,pp. 364–372.[48] Z. Zhang, A. G. Schwing, S. Fidler, and R. Urtasun, “Monocular objectinstance segmentation and depth ordering with cnns,” in

Proceedingsof the IEEE International Conference on Computer Vision , 2015, pp.2614–2622.[49] H. Zhao, Y. Zhang, S. Liu, J. Shi, C. Change Loy, D. Lin, and J. Jia,“Psanet: Point-wise spatial attention network for scene parsing,” in

Proceedings of the European Conference on Computer Vision (ECCV) ,2018, pp. 267–283.[50] H. Zhao and N. Sun, “Improved u-net model for nerve segmentation,”in