[PDF] A Universal Deep Learning Framework for Real-Time Denoising of Ultrasound Images

Abstract

Ultrasound images are widespread in medical diagnosis for muscle-skeletal, cardiac, and obstetrical diseases, due to the efficiency and non-invasiveness of the acquisition methodology. However, ultrasound acquisition introduces a speckle noise in the signal, that corrupts the resulting image and affects further processing operations, and the visual analysis that medical experts conduct to estimate patient diseases. Our main goal is to define a universal deep learning framework for real-time denoising of ultrasound images. We analyse and compare state-of-the-art methods for the smoothing of ultrasound images (e.g., spectral, low-rank, and deep learning denoising algorithms), in order to select the best one in terms of accuracy, preservation of anatomical features, and computational cost. Then, we propose a tuned version of the selected state-of-the-art denoising methods (e.g., WNNM), to improve the quality of the denoised images, and extend its applicability to ultrasound images. To handle large data sets of ultrasound images with respect to applications and industrial requirements, we introduce a denoising framework that exploits deep learning and HPC tools, and allows us to replicate the results of state-of-the-art denoising methods in a real-time execution.

Full PDF

AA Universal Deep Learning Framework forReal-Time Denoising of Ultrasound Images

Simone Cammarasana, Paolo Nicolardi, Giuseppe Patan`eCNR-IMATI & ESAOTE SpAJanuary 25, 2021

Abstract

Ultrasound images are widespread in medical diagnosis for muscle-skeletal, cardiac, and obstetrical diseases, due to the eﬃciency and non-invasiveness of the acquisition methodology. However, ultrasound acqui-sition introduces a speckle noise in the signal, that corrupts the resultingimage and aﬀects further processing operations, and the visual analysisthat medical experts conduct to estimate patient diseases. Our main goalis to deﬁne a universal deep learning framework for real-time denoising ofultrasound images. We analyse and compare state-of-the-art methods forthe smoothing of ultrasound images (e.g., spectral, low-rank, and deeplearning denoising algorithms), in order to select the best one in termsof accuracy, preservation of anatomical features, and computational cost.Then, we propose a tuned version of the selected state-of-the-art denoisingmethods (e.g., WNNM), to improve the quality of the denoised images,and extend its applicability to ultrasound images. To handle large datasets of ultrasound images with respect to applications and industrial re-quirements, we introduce a denoising framework that exploits deep learn-ing and HPC tools, and allows us to replicate the results of state-of-the-artdenoising methods in a real-time execution.

Ultrasound imaging , or sonography, uses high-frequency sound waves (above therange of human hearing) to visualise soft tissues, such as internal organs. Ul-trasonic sound waves are reﬂected oﬀ from diﬀerent layers of body tissues. Thetransducer converts the echoes into electrical signals that are used to create animage and to display it on a screen. The output 2D image is based on the fre-quency and strength of the sound signal and the time the echoes takes to return.Ultrasound images are widespread in medical diagnosis for muscle-skeletal, car-diac, and obstetrical diseases, due to the eﬃciency and non-invasiveness of theacquisition methodology. The main issues of the ultrasound techniques are: asigniﬁcant loss of information during the reconstruction of the signal, the de-pendency of the signal from the direction of acquisition, a speckle noise that1 a r X i v : . [ ee ss . I V ] J a n orrupts the resulting image and signiﬁcantly aﬀects the evaluation of the mor-phology of the anatomical district. For these reasons, in the last decades severaldenoising methods of ultrasound images have been proposed (Sect. 2). Ultra-sound acquisition introduces several challenges, such as: (i) large (e.g., 600 × Overview and contribution

Our main goal is to deﬁne a universal deeplearning framework for real-time denoising of ultrasound images. Our approachaims at providing an accurate image of the analysed area to medical expertsor to post-processing steps, such as classiﬁcation (Cheng et al. 2010), featureextraction (Iakovidis et al. 2008), segmentation (Huang et al. 2017, Milletariet al. 2017), and quantitative analysis (Kovalski et al. 2000), for a more accuratehealthcare monitoring, diagnosis, and estimation of patient diseases (e.g., thedimension of a tumour, the identiﬁcation of a certain tissue). The proposedframework combines three main elements: low-rank denoising, deep learning,and High Performance Computing (HPC). Our main contribution are • the comparison of state-of-the-art methods for the denoising of imagesaﬀected by speckle noise, with a focus on ultrasound images (Sect. 3).To this end, we perform a quantitative and qualitative evaluation (byultrasound experts) of the selected denoising methods on generic datasets and large sets of ultrasound images acquired from diﬀerent anatomicaldistricts (e.g., muscle-skeletal, obstetric, abdominal); • a specialisation of denoising algorithms (e.g., WNNM) for generic imagesto ultrasound images, based on the tuning of the algorithm parametersand a specialisation of denoising ﬁlters; • the design and implementation of a deep learning framework for real-timedenoising of ultrasound images through the learning of the best smoothingalgorithms (Sect. 4); • the design and implementation of a HPC framework for the proposed deeplearning framework; • a discussion on main results, conclusions, and future works (Sect. 5); Given a noisy image Y = X + N X , where X is the normalised ground-truth im-age, we deﬁne the multiplicative noise N ( x ) = √ σu , where u ∼ U ( − . , . U is a uniform distribution, σ is the noise intensity, and x is a pixel of the image.2ost of the methods here presented shares a common approach for recoveringthe ground-truth image. Given a pixel x , the patch P x is the set of pixels in theneighbourhood of x ; each pixel of the image has a related patch. The searchwindow is the set of patches considered for searching the closest ones to a refer-ence patch, under a certain metric. The stack , or

3D block , is the set of patchesthat are similar to a reference patch; these patches are stored in a 3D structure,and the redundancy of the stack is exploited to remove the noise.

Non-local methods

The

Non-Local Means (NLM) method (Buades et al.2005) exploits the patterns redundancy of images, and each patch is restoredwith a weighted average of all the other patches, where each weight is pro-portional to the similarity among the patches. The

Bayesian non-local meanﬁlter (Kervrann et al. 2007) improves the NLM with the introduction of aBayesian estimator as distance measure among the patches, that allows theuser to better determine the amount of smoothing by the noise variance of thepatch. The anisotropic neighbourhood in NLM (Maleki et al. 2013) uses im-age gradient to estimate the edges orientation, and then adapts the patches tomatch the local edges. Finally, the improvement on the structure of the searchwindow is achieved through the computation of an optimal search window foreach pixel (Verma & Pandey 2017), according to the smoothing degree of therelated patch.

Anisotropic methods

In (Perona & Malik 1990), the smoothed image iscomputed as the solution to an anisotropic diﬀusion equation, using the gradi-ent of the image to guide the diﬀusion process. The variant (Yu & Acton 2002)exploits the Lee (Lee 1980) and Forst (Frost et al. 1982) ﬁlters, that are edge-sensitive to speckle noise. An improvement of the previous results is achievedby (Aja-Fern´andez & Alberola-L´opez 2006), applying the Kuan ﬁlter (Kuanet al. 1985) in the diﬀusion equation, and improving the criteria for the selec-tion of the neighbourhood used for the estimation of the statistical parameters.The anisotropic method in (Bai & Feng 2007) introduces a class of fractional-order anisotropic diﬀusion equations, using the Fourier transform to computethe fractional derivatives, and the discrete Fourier transform to compute thefractional-order diﬀerences.

Spectral decomposition methods

Spectral decomposition transforms asignal into its spectral domain, and exploits the sparsity of the transformedsignal to remove noise, through a threshold operation. Several transformationshave been applied to image denoising, such as Wavelets (Mihcak et al. 1999,Chang et al. 2000, Portilla et al. 2003, Liu et al. 2017), Curvelets (Starck et al.2002), Contourlets (Da Cunha et al. 2006), and Shearlets (Yang et al. 2014).The

3D block-matching (Dabov et al. 2006) computes and stacks similar patchesthrough NLM; each stack is transformed into its spectral domain with a waveletdecomposition, denoised through a hard/soft thresholding, and reconstructedin the space domain. Then, the smoothed patches are aggregated by a col-3aborative ﬁlter, in order to reconstruct the smoothed image. The syntheticaperture radar block matching 3-D (SAR-BM3-D) (Parrilli et al. 2011) intro-duces a speckle-based variant of 3D block matching; the similarity among thepatches is computed by considering the probability distribution of the specklenoise as distance metric; furthermore, the hard/soft thresholding of the wavelettransformed signal is substituted by a Local Linear Minimum Mean SquareError (LLMMSE) ﬁlter. The principal component analysis block matching 3-D (PCA-BM3-D) (Dabov et al. 2009) improves the stacking operation of 3Dblock-matching by using shape-adaptive neighbourhoods, that enable its localadaptability to image features. Furthermore, the 3D transformation of eachstack to the spectral domain is performed through the PCA (Wold et al. 1987)and an orthogonal 1-D transformation in the third dimension.

Low rank methods

Low-rank approximation methods compute the denoisedimage as the solution to a weighted minimisation problem, whose cost functionis the Frobenius norm (Srebro & Jaakkola 2003) or the (cid:96) norm (Eriksson &Van Den Hengel 2010), between the input and the target images. The relationbetween local and non-local information (Dong, Shi & Li 2012) allows us to esti-mate signal variances, by interpreting the Singular Value Decomposition (SVD)through a bilateral variance estimation. In (Rajwade et al. 2012), a high-ordersingular values decomposition is applied to 3D blocks, and the smoothed imageis achieved with a hard thresholding of the decomposed signal. The WeightedNuclear Norm Minimisation (WNNM) (Gu et al. 2017) computes the stacks asin the 3D block-matching method, performs a SVD on the stacks, and appliesa weighted thresholding to the singular values, where lower weights correspondto higher singular values. In fact, the lower singular values capture the noisycomponent of the image, so their reduction must be stronger. Finally, the collab-orative ﬁltering for the aggregation of the smoothed patches is performed as inthe 3D block-matching method. The weighted nuclear norm and the histogrampreservation (Zhang & Desrosiers 2018) are combined in a single constrainedoptimisation problem, that is solved through the alternating direction methodof multipliers (Boyd et al. 2011).

External learning methods

In the K-SVD algorithm (Aharon et al. 2006),the signal is represented with a linear combination of an over-complete dictio-nary of atoms, that are iteratively updated through a SVD of the represen-tation error, in order to better ﬁt the data. A learned simultaneous sparsecoding method (Mairal et al. 2009) integrates sparse dictionary learning withnon-local self-similarities of natural images. The non-locally centralised sparserepresentation method (NCSR) (Dong, Zhang, Shi & Li 2012) exploits the non-local redundancies, combined with local sparsity properties, to estimate thecoeﬃcients of the sparse representation of the input image; the dictionary islearned by clustering the patches of the image into K clusters through the K-means (MacQueen et al. 1967) method, and then learning a PCA sub-dictionaryfor each cluster. This method has been further improved in (Xu et al. 2017),4y proposing a fast version based on a pre-learned dictionary, and achieving animprovement of the computational eﬃciency.

Deep learning methods for denoising

In the

Noise2Noise algorithm (Lehti-nen et al. 2018), the network learns to denoise images only considering thenoisy data, without any knowledge of the ground-truth. The

Noise2Void al-gorithm (Krull et al. 2019) further expands this idea, and it does not requirecouples of noisy images for the training phase. This approach is relevant inbiomedical ﬁelds, where there are no ground-truth images. The

Noise2Self method (Batson & Royer 2019) proposes a self-supervised algorithm that doesnot require any prior information on the input image, estimation on the noise,or ground-truth data. The smoothing of images (Fang & Zeng 2020) is achievedthrough the extraction of the edges features from the noisy image through aconvolutional neural network (CNN), and combining the edge regularisationwith the total variation regularisation. The block matching Convolutional Neu-ral Network (BM-CNN) (Ahn & Cho 2017) integrates a deep learning approachwith the 3D block-matching method; the denoising of the stacks is predictedthrough a DnCNN (Zhang et al. 2017), that is trained with a data set of 400images, that correspond to more than 250K training samples. A feed-forwardConvolutional Neural Network smooths images, independently from the noiselevel, by exploiting residual learning and batch normalisation. Then, the blocksare aggregated and the image is reconstructed, as in the 3D block-matchingalgorithm.

Deep learning methods for image-to-image regression

The

VGG19 (Si-monyan & Zisserman 2014) introduces

Convolutional Neural Networks (CNN)pushing the depth to 16–19 weight layers, using small convolution ﬁlters of 3 × Pix2Pix (Isolaet al. 2017) method is a

Generative Adversarial Network (GAN), where thegenerator is a U-net (Ronneberger et al. 2015), the discriminator is an encodingnetwork (Krizhevsky et al. 2012), and the loss function is based on the binarycross entropy. The deep convolutional generative adversarial network (Radfordet al. 2015) applies unsupervised learning for image classiﬁcation tasks, and forthe generation of natural images, exploiting strided convolutions, batch normal-isation, rectiﬁed linear unit activations, and removing fully connected hiddenlayers.

We compare ﬁve state-of-the-art denoising methods belonging to the spectral,low-rank, and deep learning classes, identifying their diﬀerences and analo-gies. The selected algorithms are the WNNM, the SAR-BM3-D, the BM-CNN,the NCSR, and the BM3D-PCA (Sect.2) ﬁlters. Then, we identify the bestmethod(s) with respect to quantitative metrics and qualitative evaluation, in5erms of smoothing, edges preservation and enhancement, and features preser-vation, and with respect to diﬀerent data sets (i.e., generic images with diﬀerentnoise intensity, ultrasound images from diﬀerent anatomical districts).We introduce the data sets (Sect. 3.1) and the quantitative metrics for theevaluation of the denoising quality (Sect. 3.2); then, we discuss the quanti-tative (Sect. 3.3) and qualitative (Sect. 3.4) results of these tests, and theircomputation time (Sect. 3.5). Finally, we discuss the proposed improvementsto WNNM (Sect. 3.6).

We consider two data sets for the comparison of the selected denoising meth-ods. The SIPI data set (Weber 1997) is composed of 44 ground-truth imagesof diﬀerent sizes and belonging to diﬀerent classes (e.g., humans, landscapes).Then, we add an artiﬁcial speckle noise, with diﬀerent levels of noise intensity,in order to evaluate the eﬃciency of the denoising methods. The

Esaote dataset contains more than 3K ultrasound images at diﬀerent resolutions, acquiredfrom diﬀerent (e.g., muscle-skeletal, obstetric, abdominal) districts. This dataset is used to verify the performance of the denoising methods when applied toultrasound images, and visually analyse the smoothing eﬀects, the preservationof the greyscale values, edges, and anatomical features.

The selected denoising methods are compared through quantitative metrics anda qualitative evaluation. As quantitative metrics, we consider the peak-signal-to-noise ratio (PSNR) and the structural similarity index measure (SSIM), thatare applied to compare the input ground-truth with the smoothed images onthe SIPI data set. However, the use of quantitative metrics is not suﬃcientto evaluate the quality of the denoising algorithms, especially for ultrasoundimages; in fact, a good quantitative result may not correspond to a satisfactorypreservation of the anatomical features. For this reason, we integrate thesequantitative metrics with a qualitative and visual evaluation of the smoothedimages, in terms of edge and feature preservation. In particular, the qualitativeevaluation on the

ESAOTE data set has been performed through the assessmentof experts of ultrasound image denoising.

Table 1 summarises the quantitative metrics (Sect. 3.2) of the denoising methodson the SIPI data set; we compute the average value of each metric (i.e., PSNRand SSIM) among 44 images; furthermore, we report the average values whenvarying the intensity of the speckle noise. SAR-BM3D has the best results underthese metrics; in particular, when considering the PSNR index, SAR-BM3Doutperforms all the other methods. The NCSR, WNNM, and BM-PCA methodshave good and similar results in terms of PSNR and SSIM indices. These four6nput image Noisy image WNNM SAR-BM3DBM3D-PCA NCSR BM-CNNFigure 1: Input (SIPI data set, van image), noisy ( σ = 0 . The only use of quantitative metrics, such as PSNR and SSIM, is generally notenough to evaluate the quality of the denoising methods. A visual evaluation ismandatory, to detect possible blurring eﬀects, artefacts, or poor edges preser-vation. In particular, this analysis is relevant in the ultrasound ﬁeld, where thepreservation of the anatomical features is signiﬁcant for a correct medical diag-nosis. For this reason, we evaluate the denoising results on both the SIPI andthe

Esaote data sets. According to the denoising results on the SIPI data set(Figs 1, 2), WNNM has very good results in terms of noise removal, edge andfeature preservation (e.g., vehicles shape (Fig. 1), and hat’s feathers (Fig. 2)).SAR-BM3D has the best results in terms of noise removal; however, it does notcorrectly preserve the pixel intensity (e.g., boy’s sleeve (Fig. 2)) and it generatesa blurred eﬀect (e.g., grass and bushes (Fig. 1)). BM3D-PCA and NCSR showa minor preservation of edges and details with respect to WNNM (e.g., boy’sface (Fig. 2)). Finally, BM-CNN is not able to correctly remove the noise; thisresult underlines the importance of the training data set (e.g., the type and theintensity of the applied noise), when using a deep learning approach, and thenecessity of using data-speciﬁc networks, instead of using only a generic-purpose7nput image Noisy image WNNM SAR-BM3DBM3D-PCA NCSR BM-CNNFigure 2: Input (SIPI data set, man image), noisy ( σ = 0 . Esaote data set (Figs. 3, 4, 5), WNNM, NCSR,and BM3D-PCA are the best methods in terms of smoothing, and WNNM out-performs all the other methods in terms of edge preservation and enhancement.In particular, WNNM well preserves the edges of the muscular ﬁbres (Fig. 4)and the internal organs (Figs. 3, 5). The output of SAR-BM3D shows a gran-ular eﬀect, that negatively aﬀects the preservation of the anatomical features,and BM-CNN generates artefacts, that are typical of a deep learning approach.According to these results, we select WNNM as best method for ultrasoundimages denoising.

Our tests (Table 2) are executed with Matlab R2020a, on a workstation with2 Intel i9-9900KF CPUs (3.60GHz), and 32 GB RAM. None of these methodsachieves real-time computation, on the aforementioned workstation. In partic-ular, WNNM takes more than three minutes to process a 600 ×

485 image, andthe fastest method (i.e., SAR-BM3D) takes about one minute; however, real-time computation on ecographic environment requires a processing time in theorder of few milliseconds. This result motivates our decision of developing adeep learning framework, further optimised with a HPC framework (Sect. 4.6).8aw image WNNM SAR-BM3DBM3D-PCA NCSR BM-CNNFigure 3: Raw data set (

Esaote images of an abdominal district), and denoisedimages.

In order to improve the quality of the denoised image, we propose a novelapproach to the tuning of WNNM parameters; we refer to this method as tuned-WNNM. More precisely, we vary the stack size, the number of iterations, thenumber of selected patches, the dimension of the search window, and the numberof times the block-matching is applied. According to this new conﬁguration ofWNNM and comparing the baseline with the tuned-WNNM, we improve thedenoising quality (Fig. 6) in terms of quantitative metrics; in fact, the outputof WNNM has a PSNR value of 26.67, while the output of tuned-WNNM has aPSNR value of 26.74. Nevertheless, the execution time of WNNM is 94 seconds,while the execution time of tuned-WNNM is 260 seconds.We also specialise the tuned-WNNM method to ultrasound images. Tothis end, we vary the smoothing intensity through a parameter that aﬀectsthe thresholding of the singular values of the SVD; increasing this parameter,the method improves in terms of removed noise, though introducing a blurringeﬀect. According to the medical experts’ evaluation, we select the output imagethat best ﬁts the medical requirements; among three diﬀerent levels of denoisingintensity (Fig. 7), (b) shows the best result as compromise between noise re-moval, edge preservation, and blurring eﬀects; in fact, it preserves the geometryof the internal tissues, while enhancing the edges of the anatomical structures,therefore allowing the doctor to better identify/monitor patient diseases.9aw image WNNM SAR-BM3DBM3D-PCA NCSR BM-CNNFigure 4: Raw data set (

Esaote images of a muscle-skeletal district), and de-noised images.

We introduce the proposed deep learning approach (Sect. 4.1) and architecture(Sect. 4.2) for denoising, the data sets (Sect. 4.3), the qualitative (Sect. 4.4),quantitative (Sect. 4.5), and computational (Sect. 4.6) aspects.

The main requirements of a denoising algorithm for ultrasound images are themagnitude of the removed noise, edges and features preservation, edges en-hancement, and real-time computation. The tuned-WNNM (Sect. 3.6) satisﬁesall these requirements, except for the execution time, that does not respect thereal-time need of ultrasound application (Sect. 3.5). In order to achieve a real-time computation while maintaining the good results of the WNNM method interms of denoising and edges preservation, we identify two strategies: (i) thedevelopment of a computationally optimised version of the WNNM method, ex-ploiting HPC tools, CPUs and GPUs, and low-level programming languages; (ii)the design and implementation of a deep learning framework that uses WNNMas an instance of denoising methods. The implementation of a computation-ally optimised, and potentially real-time, version of the tuned-WNNM is a verytough requirement; the main iterative cycle of the algorithm is not parallelis-10aw image WNNM SAR-BM3DBM3D-PCA NCSR BM-CNNFigure 5: Raw data set (

Esaote images of an obstetric district), and denoisedimages.Input image Noisy image WNNM Tuned-WNNMFigure 6: Input (256 × σ = 0 . • its generality with respect to the input data, i.e., type of noise (e.g.,speckle, Gaussian noise), the resolution of the 2D images (e.g., isotropic,anisotropic images), the dimension of the images (i.e., 2D, 3D images),11able 1: PSNR and SSIM metrics of the denoising methods tested on the SIPIdata set. For each σ value (i.e., the intensity of the speckle noise), we reportthe average metric computed on the 44 images of the data set. Metric PSNR SSIMMethod | σ .

36 26 .

01 24 .

71 23 . .

699 0 .

673 0 . BM-PCA 25.09 24.36 23.05 22.10 0.652 0.640 0.614 0.585NCSR 26.60 25.36 23.61 22.36 0.665 0.669 0.619 0.588BM3D-CNN 26.85 24.20 20.21 17.26 . the acquisition methodology, and the anatomical district; • its generality in terms of building blocks and parameters of the deep learn-ing framework, i.e., the denoising algorithms (e.g., WNNM, SARBM3D,custom methods), the deep learning architecture (e.g., Pix2Pix, VGG19); • the development of a universal framework, according to its generality withrespect to the input data and in terms of building blocks and parametersof the deep learning framework. In case several denoising methods areavailable, our framework allows the user to compare diﬀerent denoisingalgorithms, in terms of smoothing quality and edges preservation; further-more, any modiﬁcation to the denoising algorithm only needs a new oﬄinetraining of the neural network. • the option to specialise the training phase to speciﬁc anatomical districtsor types of noise, etc. For instance, a speciﬁc network can be trained foreach district, thus allowing the user to obtain a more precise result whenpredicting the denoised image, as each network is specialised for a uniqueanatomical district; • the real-time denoising based on an oﬀ-line training step. In fact, thereal-time computation depends only on the execution time of the networkprediction; • the possibility of improving the oﬄine training with new data, a-prioriand/or additional information on the input data (e.g., input anatomi-cal district, noisy type/intensity, image resolution, acquisition methodol-ogy/protocol). The integration of the existing training data set for thedeﬁnition of a new training set is always addressed oﬄine. Furthermore,the training data set can be periodically updated with the denoised imagesafter experts’ validation of the denoising results; • the improvement of the WNNM denoising algorithm, in terms of rangeof applicability, criteria for parameters tuning, and selected smoothingparameters; 12nput image (a)(b) (c)Figure 7: Ultrasound image of an abdominal district, denoised with tunedWNNM, and varying the intensity of the smoothing from low (a) to high (c). To evaluate the proposed framework, we analyse several networks that allowus to perform an image-to-image regression; among them (Sect. 2), we se-lect Pix2Pix, that guarantees good results in terms of learning. We specialisePix2Pix to ultrasound images, with two shrewdness: (i) the introduction of avalidation data set of the same district of the training data set, that allows usto force the exit condition when the validation error increases, and (ii) the in-troduction of padding and masking pre-processing operations, that allow us todeal with images of diﬀerent resolution.

We generate and test diﬀerent data sets, varying the number of images forthe training phase, and the anatomical district for the prediction phase. Moreprecisely, the custom Pix2Pix network is trained on three data sets of obstetricimages, with respectively: (a) 500, (b) 1500, and (c) 3500 images. Each dataset is composed of the input images (i.e., the raw EASOTE data set), andthe target images (i.e., the corresponding images denoised with the parameters-tuned WNNM). Then, we evaluate each of the three networks (i.e., the networkstrained with diﬀerent data sets) with two diﬀerent test data sets of 50 imageseach, respectively from the (i) obstetric and (ii) muscular anatomical district.13aw image Target image(a) (b) (c)Figure 8: Raw, target, and prediction images, related to the obstetric data set(i). Training set: (a) 500 images, (b) 1500 images, (c) 3500 images (Sect. 4.3).Table 2: Execution time computed as an average value of a set of

Esaote imagesat 600 ×

485 resolution. Method Execution time [s]WNNM 215SAR-BM3D BM-CNN 356NCSR 380BM3D-PCA 95For each test data set, we compute the quantitative PSNR and SSIM metricsbetween the prediction of the network and the expected target; furthermore,the experts visually evaluate the prediction results.

Fig. 8 shows the prediction results of the three networks, when tested withobstetric images (i). The predicted images are very close to the target image inall the three cases; the edges, the main features, and the grey scale values arewell reproduced by the network. Furthermore, the predictions do not generateartefacts or patterns. Varying the number of images of the training data set from500 to 3500 (i.e., Fig. 8 (a), (b), and (c)), the predicted images slightly improvein terms of similarity to the target image. Nevertheless, the results are good evenwith a small training data set of 500 images. Fig. 9 shows the prediction resultsof the three networks, when tested with muscle-skeletal images (ii). Whenpredicting the output images with the networks trained with obstetric images(i.e., Fig. 9(a), (b), and (c)), the results are slightly worse with respect to the14aw image Target image(a) (b) (c)Figure 9: Raw, target, and prediction images, related to the muscle-skeletaldata set (ii). Training set: (a) 500 images, (b) 1500 images, (c) 3500 images(Sect. 4.3).corresponding case of Fig. 8, even if the predicted images do not show anyartefact of pattern repetition. In fact, these networks are trained with imagesfrom a diﬀerent (i.e., obstetric) district, with diﬀerent anatomical features. Thisresult conﬁrms that each district requires a speciﬁc network, and that the useof a single network for all the districts gives lower quality results.

Table 3 summarises the quantitative metrics (Sect. 3.2) computed between thetarget and the predicted images. The metrics conﬁrm that the best results arereached when using a speciﬁc network for each district. In fact, the network(c) tested with obstetric images (i) has an average PSNR and SSIM value of36.07 and 0.962, respectively, while the same network tested with muscle-skeletalimages (ii) has an average PSNR and SSIM value of 26.31 and 0.878. Both themetrics have a very slight improvement, when passing from a training set of500 to a training set of 3500 images, conﬁrming the results of the qualitativeanalysis.Finally, Fig. 10 shows the box plot of the PSNR and SSIM metrics for threetraining data sets and two test data sets. Increasing the number of images ofthe training data set, the range of the metrics tends to decrease; this behaviourallows us to have a lower variability on the prediction of the output image.15able 3: PSNR and SSIM metrics, computed between the target and the predic-tion images, for the three training data sets and the two test data sets (Sect. 4.3).The results are computed as average value among the 50 test images.Testdata set (i) (ii)Trainingdata set PSNR SSIM PSNR SSIM(a) 35.93 0.973 25.88 0.886(b) 34.52 0.957 26.33 0.854(c) 36.07 0.962 26.31 0.878Figure 10: PSNR and SSIM box plot for the three training data sets and the twotest data sets (Sect. 4.3). (top-left) PSNR, obstetric test data set; (top-right)SSIM, obstetric test data set; (bottom-left) PSNR, muscle-skeletal test data set;(bottom-right) SSIM, muscle-skeletal test data set. Each sub ﬁgure shows themetric result for each of the three training data set (i.e., (a), (b), and (c)).

We deﬁne a HPC framework for the implementation of the deep learning frame-work previously introduced, taking advantage of a large ultrasound data set(Sect. 4.3), and the CINECA-Marconi100 cluster, exploiting both CPUs (IBMPOWER9 AC922) and GPUs (NVIDIA Volta V100). Given a training dataset, composed of raw images and the corresponding denoised images, we designour deep learning framework with TensorFlow 2, that allows us to implementa parallel and distributed version of our framework. Then, we deﬁne a batchﬁle for the execution of the deep learning framework on the cluster; throughthe batch ﬁle, we deﬁne the number of nodes, CPUs, GPUs, and memory ofthe cluster. For the training phase, we exploit 8 nodes, each one composed of32 cores and 4 accelerators, for a theoretical computational performance of 260TFLOPS, and 220 GB of memory per node.16he parallel implementation of the deep learning framework and the highhardware performance allow us to reduce the computation time of the train-ing phase of at least 100 orders, with respect to a serial implementation on astandard workstation. Through the proposed HPC framework, we are able totrain multiple networks with large data sets in a reasonable time with respect tothe target medical application, thus increasing the specialisation to anatomicaldistricts, and consequently the accuracy of the deep learning framework. Fur-thermore, we can improve the oﬄine training with new data, a-priori and/oradditional information on the input data (e.g., input anatomical district, noisytype/intensity, image resolution, acquisition methodology/protocol); also, thetraining data set can be periodically updated with the denoised images afterexperts’ validation of the denoising results. The HPC framework generates anetwork model, that is stored and used for predicting the output results. Theaverage execution time for the output prediction of a 600 ×

500 ultrasound imageis 25 milliseconds on an

Esaote hardware, that replicates an ultrasound scannercurrently in use; this result conﬁrms that we achieve the real-time computationtarget.

We have presented an universal deep learning framework for real-time denois-ing of ultrasound images, that preserves the main features of the underlyingimage (e.g., edges, greyscale), while respecting the industrial and productionrequirements (e.g., real-time computation, memory overhead, hardware conﬁg-urations). We have shown the main novelties and contribution of the proposedframework, including the analysis and tuning of the denoising algorithm, and thecontribution on the machine learning and HPC aspects. We have discussed thegenerality of the framework with respect to the input data, the noise properties,the denoising algorithm, and the deep learning architecture. Finally, we havepresented the results of the framework with ultrasound images, analysing thequalitative and quantitative results, with images belonging to diﬀerent anatom-ical districts. As future work, we plan to extend the deep learning frameworkto diﬀerent types of data, such as meshes (e.g., through segmentation and iso-surface extraction) extracted from ultrasound and magnetic resonance images.

Acknowledgements

This research is carried out as part of an Industrial PhDproject funded by CNR-IMATI and Esaote S.p.A. under the CNR-Conﬁndustriaagreement.

References

Aharon, M., Elad, M. & Bruckstein, A. (2006), ‘K-svd: An algorithm for de-signing overcomplete dictionaries for sparse representation’,

IEEE Trans. onSignal Processing (11), 4311–4322.17hn, B. & Cho, N. I. (2017), ‘Block-matching convolutional neural network forimage denoising’, arXiv:1704.00524 .Aja-Fern´andez, S. & Alberola-L´opez, C. (2006), ‘On the estimation of the coef-ﬁcient of variation for anisotropic diﬀusion speckle ﬁltering’, IEEE Trans. onImage Processing (9), 2694–2701.Bai, J. & Feng, X.-C. (2007), ‘Fractional-order anisotropic diﬀusion for imagedenoising’, IEEE Trans. on Image Processing (10), 2492–2502.Batson, J. & Royer, L. (2019), ‘Noise2self: Blind denoising by self-supervision’, arXiv:1901.11365 .Boyd, S., Parikh, N., Chu, E., Peleato, B. & Eckstein, J. (2011), ‘Distributedoptimization and statistical learning via the alternating direction method ofmultipliers’, Found. Trends Mach. Learn. (1), 1–122. URL: https://doi.org/10.1561/2200000016

Buades, A., Coll, B. & Morel, J.-M. (2005), A non-local algorithm for imagedenoising, in ‘Conf. on Computer Vision and Pattern Recognition’, Vol. 2,pp. 60–65.Chang, S. G., Yu, B. & Vetterli, M. (2000), ‘Adaptive wavelet thresholdingfor image denoising and compression’, IEEE Trans. on Image Processing (9), 1532–1546.Cheng, H.-D., Shan, J., Ju, W., Guo, Y. & Zhang, L. (2010), ‘Automatedbreast cancer detection and classiﬁcation using ultrasound images: A survey’, Pattern Recognition (1), 299–317.Da Cunha, A. L., Zhou, J. & Do, M. N. (2006), ‘The nonsubsampled contourlettransform: theory, design, and applications’, IEEE Trans. on Image Process-ing (10), 3089–3101.Dabov, K., Foi, A., Katkovnik, V. & Egiazarian, K. (2006), Image denoisingwith block-matching and 3d ﬁltering, in ‘Image Processing: Algorithms andSystems, Neural Networks, and Machine Learning’, Vol. 6064, p. 606414.Dabov, K., Foi, A., Katkovnik, V. & Egiazarian, K. (2009), BM3D Image De-noising with Shape-Adaptive Principal Component Analysis, in R. Gribonval,ed., ‘Signal Processing with Adaptive Sparse Structured Representations’.Dong, W., Shi, G. & Li, X. (2012), ‘Nonlocal image restoration with bilateralvariance estimation: a low-rank approach’,

IEEE Trans. on Image Processing (2), 700–711.Dong, W., Zhang, L., Shi, G. & Li, X. (2012), ‘Nonlocally centralized sparserepresentation for image restoration’, IEEE Trans. on Image Processing (4), 1620–1630. 18riksson, A. & Van Den Hengel, A. (2010), Eﬃcient computation of robustlow-rank matrix approximations in the presence of missing data using the L1norm, in ‘2010 IEEE Conf. on Computer Vision and Pattern Recognition’,pp. 771–778.Fang, Y. & Zeng, T. (2020), ‘Learning deep edge prior for image denoising’, Computer Vision and Image Understanding , 103044.

URL:

Frost, V. S., Stiles, J. A., Shanmugan, K. S. & Holtzman, J. C. (1982), ‘Amodel for radar images and its application to adaptive digital ﬁltering of mul-tiplicative noise’,

IEEE Trans. on Pattern Analysis and Machine Intelligence (2), 157–166.Gu, S., Xie, Q., Meng, D., Zuo, W., Feng, X. & Zhang, L. (2017), ‘Weightednuclear norm minimization and its applications to low level vision’,

Inter.Journal of Computer Vision (2), 183–208.Huang, Q., Luo, Y. & Zhang, Q. (2017), ‘Breast ultrasound image segmenta-tion: a survey’,

Inter. Journal of Computer Assisted Radiology and Surgery (3), 493–507.Iakovidis, D. K., Keramidas, E. G. & Maroulis, D. (2008), Fuzzy local binarypatterns for ultrasound texture characterization, in ‘Inter. Conf. Image Anal-ysis and Recognition’, Springer, pp. 750–759.Isola, P., Zhu, J.-Y., Zhou, T. & Efros, A. A. (2017), Image-to-image translationwith conditional adversarial networks, in ‘Conf. on Computer Vision andPattern Recognition’, pp. 1125–1134.Kervrann, C., Boulanger, J. & Coup´e, P. (2007), Bayesian non-local meansﬁlter, image redundancy and adaptive dictionaries for noise removal, in ‘Inter.Conf. on Scale Space and Variational Methods in Computer Vision’, Springer,pp. 520–532.Kovalski, G., Beyar, R., Shofti, R. & Azhari, H. (2000), ‘Three-dimensional au-tomatic quantitative analysis of intravascular ultrasound images’, Ultrasoundin Medicine & Biology (4), 527–537.Krizhevsky, A., Sutskever, I. & Hinton, G. E. (2012), Imagenet classiﬁcationwith deep convolutional neural networks, in ‘Advances in Neural InformationProcessing Systems’, pp. 1097–1105.Krull, A., Buchholz, T.-O. & Jug, F. (2019), Noise2void-learning denoising fromsingle noisy images, in ‘Proc. of the IEEE Conf. on Computer Vision andPattern Recognition’, pp. 2129–2137.Kuan, D. T., Sawchuk, A. A., Strand, T. C. & Chavel, P. (1985), ‘Adaptivenoise smoothing ﬁlter for images with signal-dependent noise’, IEEE Trans.on Pattern Analysis and Machine Intelligence (2), 165–177.19ee, J. (1980), ‘Digital image enhancement and noise ﬁltering by use of lo-cal statistics’,

IEEE Trans. on Pattern Analysis and Machine Intelligence (2), 165–168.Lehtinen, J., Munkberg, J., Hasselgren, J., Laine, S., Karras, T., Aittala, M. &Aila, T. (2018), ‘Noise2noise: Learning image restoration without clean data’, arXiv:1803.04189 .Liu, X., Zhang, H., Cheung, Y.-m., You, X. & Tang, Y. Y. (2017), ‘Eﬃcient sin-gle image dehazing and denoising: An eﬃcient multi-scale correlated waveletapproach’, Computer Vision and Image Understanding , 23–33.MacQueen, J. et al. (1967), Some methods for classiﬁcation and analysis of mul-tivariate observations, in ‘Proc. of the Symposium on Mathematical Statisticsand Probability’, Vol. 1, pp. 281–297.Mairal, J., Bach, F., Ponce, J., Sapiro, G. & Zisserman, A. (2009), Non-localsparse models for image restoration, in ‘IEEE Inter. Conf. on Computer Vi-sion’, pp. 2272–2279.Maleki, A., Narayan, M. & Baraniuk, R. G. (2013), ‘Anisotropic nonlocal meansdenoising’, Applied and Computational Harmonic Analysis (3), 452–482.Mihcak, M. K., Kozintsev, I., Ramchandran, K. & Moulin, P. (1999), ‘Low-complexity image denoising based on statistical modeling of wavelet coeﬃ-cients’, IEEE Signal Processing Letters (12), 300–303.Milletari, F., Ahmadi, S.-A., Kroll, C., Plate, A., Rozanski, V., Maiostre, J.,Levin, J., Dietrich, O., Ertl-Wagner, B., B¨otzel, K. & Navab, N. (2017),‘Hough-cnn: Deep learning for segmentation of deep brain regions in mriand ultrasound’, Computer Vision and Image Understanding , 92 – 102.Deep Learning for Computer Vision.

URL:

Parrilli, S., Poderico, M., Angelino, C. V. & Verdoliva, L. (2011), ‘A nonlocal sarimage denoising algorithm based on llmmse wavelet shrinkage’,

IEEE Trans.on Geoscience and Remote Sensing (2), 606–616.Perona, P. & Malik, J. (1990), ‘Scale-space and edge detection using anisotropicdiﬀusion’, IEEE Trans. on Pattern Analysis and Machine Intelligence (7), 629–639.Portilla, J., Strela, V., Wainwright, M. J. & Simoncelli, E. P. (2003), ‘Imagedenoising using scale mixtures of gaussians in the wavelet domain’, IEEETrans. on Image Processing (11), 1338–1351.Radford, A., Metz, L. & Chintala, S. (2015), ‘Unsupervised representa-tion learning with deep convolutional generative adversarial networks’, arXiv:1511.06434 . 20ajwade, A., Rangarajan, A. & Banerjee, A. (2012), ‘Image denoising using thehigher order singular value decomposition’, IEEE Trans. on Pattern Analysisand Machine Intelligence (4), 849–862.Ronneberger, O., Fischer, P. & Brox, T. (2015), U-net: Convolutional networksfor biomedical image segmentation, in ‘Inter. Conf. on Medical Image Com-puting and Computer-Assisted Intervention’, Springer, pp. 234–241.Simonyan, K. & Zisserman, A. (2014), ‘Very deep convolutional networks forlarge-scale image recognition’, arXiv:1409.1556 .Srebro, N. & Jaakkola, T. (2003), Weighted low-rank approximations, in ‘Proc.of the Inter. Conf. on Machine Learning’, pp. 720–727.Starck, J.-L., Cand`es, E. J. & Donoho, D. L. (2002), ‘The curvelet transformfor image denoising’, IEEE Trans. on Image Processing (6), 670–684.Verma, R. & Pandey, R. (2017), ‘Adaptive selection of search region for nlmbased image denoising’, Optik , 151–162.Weber, A. G. (1997), ‘The usc-sipi image database version 5’,

USC-SIPI Report (1).Wold, S., Esbensen, K. & Geladi, P. (1987), ‘Principal component analysis’,

Chemometrics and Intelligent Laboratory Systems (1-3), 37–52.Xu, S., Yang, X. & Jiang, S. (2017), ‘A fast nonlocally centralized sparse rep-resentation algorithm for image denoising’, Signal Processing , 99–112.Yang, H.-Y., Wang, X.-Y., Niu, P.-P. & Liu, Y.-C. (2014), ‘Image denoisingusing nonsubsampled shearlet transform and twin support vector machines’,

Neural Networks , 152–165.Yu, Y. & Acton, S. T. (2002), ‘Speckle reducing anisotropic diﬀusion’, IEEETrans. on Image Processing (11), 1260–1270.Zhang, K., Zuo, W., Chen, Y., Meng, D. & Zhang, L. (2017), ‘Beyond a gaussiandenoiser: Residual learning of deep cnn for image denoising’, IEEE Trans. onImage Processing (7), 3142–3155.Zhang, M. & Desrosiers, C. (2018), ‘Structure preserving image denoising basedon low-rank reconstruction and gradient histograms’, Computer Vision andImage Understanding171