[PDF] RCoNet: Deformable Mutual Information Maximization and High-order Uncertainty-aware Learning for Robust COVID-19 Detection

Abstract

The novel 2019 Coronavirus (COVID-19) infection has spread world widely and is currently a major healthcare challenge around the world. Chest Computed Tomography (CT) and X-ray images have been well recognized to be two effective techniques for clinical COVID-19 disease diagnoses. Due to faster imaging time and considerably lower cost than CT, detecting COVID-19 in chest X-ray (CXR) images is preferred for efficient diagnosis, assessment and treatment. However, considering the similarity between COVID-19 and pneumonia, CXR samples with deep features distributed near category boundaries are easily misclassified by the hyper-planes learned from limited training data. Moreover, most existing approaches for COVID-19 detection focus on the accuracy of prediction and overlook the uncertainty estimation, which is particularly important when dealing with noisy datasets. To alleviate these concerns, we propose a novel deep network named {\em RCoNet^k_s} for robust COVID-19 detection which employs {\em Deformable Mutual Information Maximization} (DeIM), {\em Mixed High-order Moment Feature} (MHMF) and {\em Multi-expert Uncertainty-aware Learning} (MUL). With DeIM, the mutual information (MI) between input data and the corresponding latent representations can be well estimated and maximized to capture compact and disentangled representational characteristics. Meanwhile, MHMF can fully explore the benefits of using high-order statistics and extract discriminative features of complex distributions in medical imaging. Finally, MUL creates multiple parallel dropout networks for each CXR image to evaluate uncertainty and thus prevent performance degradation caused by the noise in the data.

Full PDF

11 RCoNet: Deformable Mutual InformationMaximization and High-order Uncertainty-awareLearning for Robust COVID-19 Detection

Shunjie Dong, Qianqian Yang, Yu Fu, Mei Tian, Cheng Zhuo,

Senior Member, IEEE

Abstract —The novel 2019 Coronavirus (COVID-19) infectionhas spread world widely and is currently a major healthcarechallenge around the world. Chest Computed Tomography (CT)and X-ray images have been well recognized to be two effectivetechniques for clinical COVID-19 disease diagnoses. Due to fasterimaging time and considerably lower cost than CT, detectingCOVID-19 in chest X-ray (CXR) images is preferred for efﬁcientdiagnosis, assessment and treatment. However, considering thesimilarity between COVID-19 and pneumonia, CXR sampleswith deep features distributed near category boundaries areeasily misclassiﬁed by the hyper-planes learned from limitedtraining data. Moreover, most existing approaches for COVID-19 detection focus on the accuracy of prediction and overlookthe uncertainty estimation, which is particularly important whendealing with noisy datasets. To alleviate these concerns, wepropose a novel deep network named

RCoNet ks for robust COVID-19 detection which employs Deformable Mutual Information Max-imization (DeIM),

Mixed High-order Moment Feature (MHMF)and

Multi-expert Uncertainty-aware Learning (MUL). With DeIM,the mutual information (MI) between input data and the cor-responding latent representations can be well estimated andmaximized to capture compact and disentangled representationalcharacteristics. Meanwhile, MHMF can fully explore the beneﬁtsof using high-order statistics and extract discriminative featuresof complex distributions in medical imaging. Finally, MULcreates multiple parallel dropout networks for each CXR image toevaluate uncertainty and thus prevent performance degradationcaused by the noise in the data. The experimental results showthat RCoNet ks achieves the state-of-the-art performance on anopen source COVIDx dataset of 15134 original CXR imagesacross several metrics. Crucially, our method is shown to bemore effective than existing methods with the presence of noisein the data. Index Terms —Chest X-rays, COVID-19, RCoNet ks , DeIM,MHMF, MUL, Noisy Data, Uncertainty I. I

NTRODUCTION C ORONAVIRUS disease 2019 (COVID-19) causes an on-going pandemic that signiﬁcantly impacts everyone’s lifesince it was ﬁrst reported, with hundreds of thousands of deathsand millions of infections emerging in over 200 countries [1],[2]. As indicated by the World Health Organization (WHO),due to its highly contagious nature and lack of correspondingvaccines, the most effective method to control the spread ofCOVID-19 infection is to keep social distance and contacttracing. Hence, early and fast diagnosis of COVID-19 hasbecome signiﬁcantly essential to control further spreading, andsuch that the patients could be hospitalized and receive propertreatment in time.Since the emerge of COVID-19, reverse transcriptionpolymerase chain reaction (RT-PCR), as a viral nucleic

Normal Pneumonia COVID-19

Fig. 1. Visual illustration of chest X-ray images, including normal, pneumoniaand COVID-19. acid detection method by gene sequencing, is the acceptedstandard for COVID-19 detection [3]. However, because ofthe low accuracy of RT-PCR and limited medical test kits inmany hyper-endemic regions or countries, it is challenging todetect every individual affected by COVID-19 rapidly [4], [5].Therefore, alternative testing methods, which are faster andmore reliable than RT-PCR, are urgently needed to combat thedisease.Since most COVID-19 positive patients were diagnosedwith pneumonia, radiological examinations could help detectand assess the disease. Recently, chest computed tomography (CT) has been shown to be efﬁcient and reliable to achieve areal-time clinical diagnosis of COVID-19, outperforming overRT-PCR in terms of accuracy. Moreover, some deep learningbased methods have been proposed for COVID-19 detectionusing chest CT images [6], [7], [8], [9]. For example, anadaptive feature selection approach was proposed in [10] forCOVID-19 detection based on a trained deep forest model.In [11], an uncertainty vertex-weighted hypergraph learningmethod was designed to identify COVID-19 from communityacquired pneumonia (CAP) using CT images. However, theroutine use of CT, which is conducted via expensive equipments,takes considerably more time than X-ray imaging and bringsa massive burden on radiology departments. Compared to CT,X-rays could signiﬁcantly speed up disease screening, andhence become a preferred method for disease diagnosis.Accordingly, deep learning based methods for detectingCOVID-19 with chest X-ray (CXR) have been developed andshown to be able to achieve accurate and speedy detection [12],[13]. For instance, a tailored convolution neural networkplatform trained on open source dataset called COVIDNetin [14] was proposed for the detection of COVID-19 casesfrom CXR. Oh et al. [15] proposed a novel probabilisticgradient-weighted class activation map to enable infectionsegmentation and detection of COVID-19 on CXR images. a r X i v : . [ ee ss . I V ] F e b Fig. 1 shows three samples from the

COVIDx dataset [14]which contains three different classes: normal, pneumoniaand COVID-19. However, due to the similar pathologicalinformation between pneumonia and COVID-19 in the earlystage, the CXR samples may have latent features distributednear the category boundaries, which can be easily misclassiﬁedby the hyper-plane learned from the limited training data.Moreover, to the best of our knowledge, most of the existingmethods for COVID-19 detection are designed to extractthe lower-dimension latent representations which may notbe able to fully capture statistical characteristic of complexdistributions (i.e., non-Gaussian distribution). Furthermore,quantifying uncertainty in COVID-19 detection is still a majoryet challenging task for doctors, especially with the presenceof noise in the training samples (i.e., label noise and image noise).To address the above problems, we propose a novel deep net-work architecture, referred to as

RCoNet ks , for robust COVID-19 detection which, in particular, contains the following threemodules, i.e., Deformable mutual Information Maximization (DeIM),

Mixed High-order Moment Feature (MHMF) and

Multi-expert Uncertainty-aware Learning (MUL): • The Deformable mutual Information Maximization (DeIM)module estimates and maximizes the mutual information(MI) between input data and learned high-level representa-tions, which pushes the model to learn the discriminativeand compact features. We employ deformable convolutionlayers in this module which are able to explore disentan-gled spatial features and mitigate the negative effect ofsimilar samples across different categories. • The Mixed High-order Moment Feature (MHMF) module,inspired by [16], fully explores the beneﬁts of using amix of high-order moment statistics to better characterizethe feature distributions in medical imaging. • The Multi-expert Uncertainty-aware Learning (MUL)creates multiple parallel dropout networks, each can betreated as an expert , to derive multiple experts baseddiagnosis similar to clinical practices, which improves theprediction accuracy. MUL also quantiﬁes the predictionaccuracy by obtaining the variance in prediction acrossdifferent experts. • The experimental results show that our proposal achievesthe state-of-the-art performance in terms of most metricsboth on open source COVIDx dataset of 15134 originalCXR images and that of noisy setting.The remaining of this paper is organized as follows: InSection II, we review related works on mutual informationestimation and uncertainty learning as well. In Section III,after an overview of our proposed approach, we discuss themain components of RCoNet ks . In Section IV, we compare ourproposed architecture with the existing deep learning basedmethods evaluated on a public available dataset of CXR imagesand also the same dataset but under noisy conditions. And wealso conduct extensive experiments to demonstrate the beneﬁtsof DeIM, MHMF and MUL on the performance of the system.Finally, we conclude this paper in Section V. II. B ACKGROUND AND R ELATED W ORKS

In this section, we introduce related works on mutualinformation estimation and uncertainty learning that lay thefoundation of this paper.

A. Mutual Information Estimation

Mutual information (MI), as a fundamental concept ininformation theory, is widely applied to unsupervised featurelearning for quantifying the correlation between randomvariables. MI has been exploited in a wide range of domainsand tasks, including biomedical sciences [17], blind sourceseparation (BSS, e.g., independent component analysis [18]),feature selection [19], [20] and causal inference [21]. Forexample, the object tracking task considered in [22] was treatedas a problem of optimizing the mutual information betweenfeatures extracted from a video with most color informationremoved and those from the original full-color video. Closelyrelated work presented in [23] considered learning representa-tions to predict cross-modal correspondence by maximizing MIbetween features from the multi-view encoders and the contentof the held-out view. Moreover, Mutual Information NeuralEstimation (MINE) proposed by [24] was designed to learna general-purpose estimator of the MI between continuousvariables based on dual representations of the KL-divergence,which are scalable, ﬂexible and, most crucially, trainable viaback-propagation. Based on MINE, our proposal estimates andmaximizes the CXR image inputs and the corresponding latentrepresentations to improve diagnosis performance.

B. Uncertainty in Deep Learning

Aiming at combating the signiﬁcant negative effects ofuncertainty in deep neural networks, uncertainty learning hasbeen getting lots of research attention, which facilitates thereliability assessment and solves risk-based decision-makingproblems [25], [26], [27]. In recent years, various frameworkshave been proposed to characterize the uncertainty in themodel parameters of deep neural networks, referred to as modeluncertainty , due to the limited size of training data [28], [29],which can be reduced by collecting more training data [26],[30], [31]. Meanwhile, another kind of uncertainty in deeplearning, referred to as data uncertainty , measures the noiseinherent in given training data, and hence cannot be eliminatedby having more training data [32]. To combat these twokinds of uncertainty, lots of works on various computer visiontasks, i.e., face recognition [25], semantic segmentation [33],object detection [34], person re-identiﬁcation [35], etc., haveintroduced deep uncertainty learning to improve the robustnessof deep learning model and interpretability of discriminant. Forface recognition task in [26], an uncertainty-aware probabilisticface embedding (PFE) was proposed to represent face imagesas distributions by utilizing data uncertainty. Exploiting theadvantage of Bayesian deep neural networks, one recentstudy [36] leveraged the model uncertainty for analysis andlearning of face representations. To our knowledge, our proposalis the ﬁrst work that utilizes the high-order moment statisticsand multiple expert networks to estimate uncertainty forCOVID-19 detection using CXR images. ⋮ CXR Samples

MHMF module MUL module

Feature

Encoder = Conv STAGE0 COVID-19PneumoniaNormal 𝑿 = ⊕ Deformable block (Conv STAGE1 ~𝟒 ) Batch NormalizationDeformableConvConvGroupedConvMaxPool ~ Softplus ⊖ Subtraction ⨀ Multiplication ⊕ Addition

Global

Discriminator Global Discriminator ℒ 𝐼 Concat 𝑬 𝝍 (𝑿)𝑿 ′ Concat ⊖ Deformable Mutual Information Maximization ~~ 𝑿 SharedWeight 𝑬 𝝍 (𝑿) DeIMmodule

𝓘(𝑬 𝝍 (𝑿))𝑿 ′ Final Prediction

Feature Encoder

ConvSTAGE0 ConvSTAGE1 ConvSTAGE2 ConvSTAGE3 ConvSTAGE4 ⨀ 𝜙 𝜙 ⨀ … 𝜙 ⨀ … 𝜙 𝑘 𝑬 𝝍 (𝑿) Mixed Higher-order Moment Feature

𝓘(𝑬 𝝍 (𝑿)) Concat ℒ 𝑀 𝜎 A v e r a g e DropoutDropoutDropout … Dropout

FC LayerFC LayerFC LayerFC Layer

Shared weight … ℒ ℒ ℒ ℒ 𝑠 … Multi-expert Uncertainty-aware Learning

Different mask

𝓘(𝑬 𝝍 (𝑿))𝑿𝑿 ′ Fig. 2. The architecture of RCoNet ks for COVID-19 detection. III. M

ETHOD

In this section, we introduce the novel

RCoNet ks for robustCOVID-19 detection, which incorporates Deformable mutualInformation Maximization (DeIM),

Mixed High-order MomentFeature (MHMF) and

Multi-expert Uncertainty-aware Learning (MUL), as illustrated in Fig. 2. k is the number of levels ofmoment features that are combined in MHMF, and s is thenumber of the expert network in MUL, which will be furtherclariﬁed in the sequel. The CXR images are ﬁrst processedby DeIM which consists of a stack of deformable convolutionlayers, extracting discriminative features. The compact featuresare then fed into MHMF module to generate high-order momentlatent features, reducing negative effects caused by similarimages. The proposed MUL utilizes the learned high-orderfeatures to generate ﬁnal diagnoses. A. Deformable Mutual Information Estimation and Maximiza-tion

Due to the similarity between COVID-19 and pneumonia inthe latent space, we propose Deformable mutual InformationMaximization (DeIM) to extract discriminative and informativefeatures, reducing the negative inﬂuence caused by the lackof distinctiveness in the deep features. In particular, we trainthe model by maximizing the mutual information between theinput and corresponding latent representation.We use a stack of ﬁve convolutional stages, as shown inFig. 2, to encode inputs into latent representations, which isdenoted by a differentiable parametric function E ψ : E ψ : X → Z , (1)where ψ denotes the set of all the trainable parameters inthese layers, and X and Z denote the input and output spaces,respectively.The detailed architecture of each convolutional stage ispresented in Fig. 2, which consists of several convolutional layers each followed by a batch normalization layer. Note thatwe employ deformable convolutional layers which can betterextract spatial information of the irregular infected area com-pared to conventional convolutional layers. More speciﬁcally,regular convolution operates on pre-deﬁned rectangular gridfrom an input image or a set of input feature maps, while thedeformable convolution operates on deformable grids that eachgrid point is moved by a learnable offset. For example, thereceptive grid P of a regular convolution with kernel size × is ﬁxed and can be given by: P = { ( − , − , ( − , , ..., (0 , , (1 , } , (2)while, for deformable convolution, the receptive grid is movedby the learned offsets ∆ p n ∈ R and the output is given asfollows: b ( p ) = (cid:88) P n ∈P w ( p n ) · a ( p + p n + ∆ p n ) . (3)where b ( p ) denotes the value at location p on the outputfeature map b , p n enumerates the locations in P , w ( p n ) represents the weight at location p n of the kernel, and a ( · ) isvalue at given location on the input feature map. We can seethat with the introduction of offsets ∆ p n , the receptive grid isno longer ﬁxed to be a rectangle, and instead is deformable.We optimize E ψ by maximizing the mutual informationbetween the input and the output, i.e., I ( X ; Z ) , where Z (cid:44) E ψ ( X ) . The precise mutual information requires knowledgeprobability density functions (PDFs) of X and Z , which isintractable to obtain in practice. To overcome this issue, MutualInformation Neural Estimation (MINE) proposed in [24]estimates mutual information by using a lower-bound on theDonsker-Varadhan representation [37] of the KL-divergence: I ( X ; Z ) : = D KL ( J || M ) ≥ (cid:98) I ( DV ) θ ( X ; Z ) := E J [ T θ ( x, z )] − log E M [ e T θ ( x,z ) ] , (4) where J represents the joint probability of X and Z , i.e., J (cid:44) P ( X, Z ) , and M denotes the product of marginalprobabilities of X and Z , M (cid:44) P ( X ) P ( Z ) . T θ : X × Z → R denotes a global discriminator modeled by a neural networkwith parameters θ , which is trained to maximize (cid:98) I ( DV ) θ ( X ; Z ) to approximate the actual mutual information. Hence, wecan simultaneously estimate and maximize I ( X ; E ψ ( X )) bymaximizing (cid:98) I ( DV ) θ ( X ; Z ) : ( (cid:98) θ, (cid:98) ψ ) = argmax θ,ψ (cid:98) I ( DV ) θ ( X ; E ψ ( X )) . (5)Since the encoder E ψ and the mutual information estimator T θ are optimized simultaneously with the same objective function,we can share some layers between them, and replace the T θ with T θ,ψ to account for this fact.Since we are primarily interested in maximizing the mutualinformation rather than estimating the precise value, we canalternatively use a Jensen-Shannon MI estimator (JSD) [38],which offers more interpretable trade-off: (cid:98) I ( DeJSD ) θ,ψ ( X ; E ψ ( X )) := E P (cid:104) − log (cid:16) e − T θ,ψ ( x,E ψ ( x )) (cid:17)(cid:105) − E P × (cid:101) P (cid:104) log (cid:16) e T θ,ψ ( x (cid:48) ,E ψ ( x )) (cid:17)(cid:105) , (6)where x is an input sample of an empirical probabilitydistribution P , x (cid:48) denotes a fake sample from distribution (cid:101) P , where (cid:101) P = P . This estimator is illustrated by th DeIM blockshown in Fig. 2, which has the latent representation E ψ ( x ) ,the input sample x and the fake sample x (cid:48) as input, and thedifference between the outputs of the two softplus operationsas the estimation of MI.Another alternative MI estimator is called Noise-ContrastiveEstimator (NCE) [39], which is deﬁned as: (cid:98) I ( DeNCE ) θ,ψ ( X ; E (cid:48) ψ ( X )) := E P (cid:34) T θ,ψ ( x, E (cid:48) ψ ( x )) − E (cid:101) P (cid:34) log (cid:88) x (cid:48) e T θ,ψ ( x (cid:48) ,E (cid:48) ψ ( x )) (cid:35)(cid:35) . (7)The experiments have found that using the NCE estimatoroutperforms the JSD estimator in some cases, but appears tobe quite similar most of the time.The existing works [40] that implement these estimators usesome latent representation of x , which is then merged withsome randomly generated features to obtain “fake” samplesthat satisfy P = (cid:101) P . In contrast, we use the samples from othercategories as the “fake” samples, i.e., x (cid:48) , instead. For example,if the input is a pneumonia sample, then the fake sample iseither a normal or COVID sample. We note that this can pushthe learned encoder to derive more distinguishable features forsamples from different categories. B. Mixed High-order Moment Feature

The presence of the image noise and label noise in CXRdatasets may cause image latent representations generated bydeep neural networks to be scattered in the entire feature space.To deal with this issue, [25], [26], [35] represent each image asa Gaussian distribution, that is deﬁned by a mean (a standardfeature vector) and a variance. However, the deep features of order = 1 order = 2 order = 3 order = 4 Fig. 3. Data points from three Gaussian distributions and the correspondingmoment feature of order 1 to 4 .CXR samples we considered in this paper typically follow acomplex, non-Gaussian distribution [41], [42], which cannotbe fully captured by its ﬁrst-order (mean) or second-orderstatistics (variance).We seek a better combination of different orders of statisticsto more precisely characterize the latent representation of theCXR images. We illustrate the moment features of differentorders [16] in Fig. 3, where we plot 350 data points in R sampled from a distribution that combines three differentGaussian distributions. We can observe that the high-ordermoment features are more expressive of statistical characteristiccompared to low-order one. More speciﬁcally, it captures theshape of the cloud of samples more accurately. Therefore,we include the Mixed High-order Moment Feature (MHMF)module in the proposed model, as shown in Fig. 2, whichoutputs a combination of high-order moment features withthe latent representation E ψ ( X ) as input. This will potentiallysolve the scattering problem, and, more importantly, capture thesubtle differences between CXR images of similar categories,i.e., pneumonia and COVID-19 in our case.We show how to obtain the complicated high-order momentfeature in the following. Deﬁne r -th order moment featureas φ r ( a ) , where a ∈ R H × W × C denotes a latent feature mapof dimension H × W × C . Lots of recent works adopt theKronecker product to compute high-order moment feature [42].However, calculating Kronecker product of high dimensionalfeature maps is signiﬁcantly computational intensive, and henceinfeasible for real-world applications. Inspired by [43], [44],[45], we approximate φ r ( a ) by exploiting r random projectorswhich relies on certain factorization schemes, such as RandomMaclaurin [46]. We use × convolution kernels as the randomprojectors to estimate the expectations of high-order momentfeatures. That is, φ r ( a ) ≈ K ( a ) (cid:12) K ( a ) (cid:12) · · · (cid:12) K r ( a ) ∈ R H × W × C , (8)where (cid:12) represents the Hadamard (element-wise) product, and K , K , . . . , K r are × convolution kernels with randomweights.Note that Random Maclaurin produces a estimator thatis independent of the input distribution, which causes theestimated high-order moments to contain non-informative high-order moment components. We eliminate these components bylearning the weights of the projectors, i.e., the × convolutionkernels, from the data. Also note that the Hadamard product ofa number of random projectors may end up with the estimatedhigh-order moment features to be similar to low-order ones.To solve this problem, we use a recursive way to estimate the high-order moments instead, φ r ( a ) = φ r − ( a ) (cid:12) K r ( a ) . (9)Since different order moments capture different informativestatistics, we design the MHMF module to keep the estimatedmoments of different levels of order, as shown in Fig. 2, theoutput of which is given as: J ( a ) = [ φ ( a ); φ ( a ); · · · ; φ r ( a )] ∈ R H × W × rC . (10)Hence, J ( a ) is rich enough to capture the complicatedstatistics, and produce discriminative features for the inputof different categories. C. Multi-expert Uncertainty-aware Learning

The MHMF module, as described in section III-B, generatesmixed high-order moment features of each sample in the latentspace, which we aim to further exploit to derive compact anddisentangled information for COVID-19 detection. Meanwhile,quantifying uncertainty in disease detection is undoubtedlysigniﬁcant to understand the conﬁdence level of computer-baseddiagnoses. Motivated by the clinical practices, we present anovel neural network in this section, referred to as Multi-expertUncertainty-aware Learning (MUL), which takes in the mixedhigh-order moment features and outputs the prediction and thequantiﬁcation of the diagnostic uncertainty caused by the noisein the data.The structure of Multi-expert Uncertainty-aware Learningmodule is shown in Fig. 2, which consists of multiple dropoutlayers that process the output from MHMF in parallel, eachof which together with the following several fully connectedlayers can be regarded as an expert for COVID-19 detection.We note that each dropout layer uses different masks whichresults in different subsets of latent information to be kept,while the following fully connected layers share the sameweights across different experts. The masks for the dropoutlayers are generated randomly at each iteration during training,but ﬁxed during the inference time. We denote the input-outputfunction of each expert by C je ( · ) , j = 1 , ..., N , where N isthe total number of experts. Hence, we have the classiﬁcationloss L je of j -th expert given as follows: L je = 1 n n (cid:88) i =1 L w ( C je ( J ( E ψ ( x i ))) , y i ) , (11)where n represents the total number of labeled CXR samples,and y i denotes the one-hot representation of the class label, i =1 , ..., n , and we recall that J ( · ) denotes the MHMF operationgiven in Eq. (10) and E ψ ( · ) is the preprocessing step onthe CXR samples. Note that, the total number of COVID-19 cases is much smaller than non-COVID cases, i.e., normaland pneumonia cases. This imbalance in the dataset leads toa high ratio of false-negative classiﬁcation. To mitigate thisnegative effect, we employ a weighted cross-entropy L w ( · ) given as follows: L w ( (cid:98) y i , y i ) = − C C (cid:88) c =1 λ c · y i,c log (cid:98) y i,c , (12) where C is the total number of classes, y i,c is the c -thelement of y i , and (cid:98) y i,c denotes the corresponding prediction. λ c represents the weight that controls how much the error onclass c contributes to the loss, c = 1 , ..., C . Finally, the loss L M of the whole MUL module is derived by averaging theloss values of all the experts: L M = 1 N N (cid:88) j =1 L je . (13)We use the variance of classiﬁcation loss L je with regardsto the average loss L M to quantify the uncertainty, denoted by σ , which is given as: σ = 1 N N (cid:88) j =1 ( L M − L je ) . (14)The proposed MUL module improves the diagnostic accuracyas the ﬁnal prediction combines the results from multipleexperts, and also mitigates the negative effects caused by thenoise in the data by introducing the dropout layers. Moreover,the experiments have revealed that the more experts in MULmodule the faster the system converges during training. D. Training

The whole architecture of RCoNet ks is presented in Fig. 2,where the CXR images are ﬁrst processed by a stack ofdeformable convolution layers, then transformed to high-ordermoment latent features by the MHMF module, which are thenfed to the MUL module to generate ﬁnal diagnoses. The lossused to optimize RCoNet ks is given as follows L total = L M − α L I , (15)where L M is the prediction loss given by Eq. (13) , and L I denotes the mutual information between the input X and thelatent representation E ψ ( X ) estimated by either Eq. (6) orEq. (7). α is a positive hyper-parameter that governs howmuch L M and L I contribute to the total loss. During training,the trainable parameters of the whole systems are updatediteratively to minimize L total , which is to jointly minimize theprediction loss L M thus to improve the accuracy, and maximizethe mutual information L I .IV. E XPERIMENTS AND R ESULTS

A. Dataset

We use a public chest X-ray dataset, referred to as

COVIDx ,to evaluate the proposed model, which is published by theauthors of COVID-Net [14]. This dataset contains a total of13975 CXR images from 13870 patients of 3 classes: (a) normal(no infections); (b) pneumonia (non-COVID-19 pneumonia);(c) COVID-19. It contains samples from ﬁve open source avail-able data repositories https://github.com/lindawangg/COVID-Net/blob/master/docs/COVIDx.md . Three random CXR samplesof these three classes are shown in Fig. 1. To reduce the negativeeffect caused by extremely unbalanced training samples, i.e.,very limited number of COVID-19 positive cases compared tothe other two categories, we further include other open-source

TABLE ID

ETAILS OF PATIENT DATA USED FOR TRAINING AND TESTING

Data Number of Patients Per Class Total PatientsNormal Pneumonia COVID-19Train 7966 5451 207 13624Test 885 594 31 1510

CXR datasets from . Following [14], [47], the dataset isﬁnally divided into 13624 training and 1510 test samples.The numbers of samples from different categories used fortraining and testing are summarized in Table I. Moreover, wealso adopted various data augmentation techniques to generatemore COVID-19 training samples, such as ﬂipping, translation,rotation using random ﬁve different angles, to tackle the dataimbalance issue such that the proposed model can learn aneffective mechanism of detecting COVID-19.

B. Evaluation Metrics

In our experiments, we use the following six metrics toevaluate the COVID-19 detection performance of differentapproaches: • Accuracy (

ACC ): ACC calculates the proportionof images that are correctly identiﬁed.

ACC = T P + T NT P + T N + F P + F N . • Sensitivity (

SEN ): SEN is the ratio of the positive casesthat have been correctly detected to all the positive cases.

SEN = T PT P + F N . • Speciﬁcity (

SP E ): SP E is the ratio of the negative casesthat have been correctly classiﬁed to all the negative cases.

SP E = T NT N + F P . • Balance (

BAC ): BAC is the mean value of

SEN and

SP E . BAC = SEN + SP E . • Positive Predictive Value (

P P V ): P P V is the ratio ofcorrectly detected positive cases to all cases that aredetected to be positive.

P P V = T PT P + F P . • F1-score ( F ): F uses a combination of accuracy andsensitivity to calculate a balanced average result. F × ACC × SENACC + SEN . T N , T P , F N and

F P represent the total number of truenegatives, true positives, false negatives, and false positives,respectively.

C. Compared Methods

We compare the proposed RCoNet ks with the following ﬁveexisting deep learning methods for COVID-19 detection: • PbCNN [15]: A patch-based convolutional neural networkwith a relatively small number of trainable parameters. • COVID-Net [14]: A tailored deep convolutional neuralnetwork that uses a projection-expansion-projection designpattern. • DenseNet-121 [48]: A densely connected convolutionalnetwork that connects each layer to every other layer ina feed-forward fashion.

TABLE IID

ETAILS OF

NOISY PATIENT DATA USED FOR TRAINING .Training Date Clean Noise TotalNormal 7170 796 (Peumonia+COVID-19) 7966Pneumonia 4906 545 (COVID-19+Normal) 5451COVID-19 187 20 (Peumonia+Normal) 207 • CoroNet [49]: A deep convolutional neural network modelbased on Xception architecture pre-trained on ImageNetdataset. • ReCoNet [47]: A residual image-based COVID-19 de-tection network that exploits a CNN-based multi-levelpreprocessing ﬁlter block and a multi-task learning loss.

D. Implementation

We implement our RCoNet ks using the PyTorch library andapply ResNeXt [50] as the backbone network. We train themodel with the Adam optimizer with an initial learning rateof × − and a weight decay factor of × − . All theexperiments are run on an NVIDIA GeForce GTX 1080TiGPU. We set the batch size to be 8, and resize all images to × pixels. The hyperparameter α in the loss functiongiven in Eq. (15) is set to be within the range of [0 , . .The drops rate of each dropout layer in the MUL module israndomly chosen from { . , . , . } . The loss weight λ c foreach category, which is used to calculate the weighted sumof the loss as given in Eq. (12), is set to be , , and for the normal, pneumonia, COVID-19 samples, respectively,corresponding to the number of training samples in each. Weadopt 5-fold cross-validation training that we randomly dividethe training sets into ﬁve equal-size subsets and train the modelﬁve times that using different four subsets for training, and theremaining one for validation each time. We also evaluate ourproposed model with different number of order moments forthe MHMF module k , and different number of experts s .To evaluate the performance of the proposed model withthe presence of label noise, we derive a noisy dataset fromthe given dataset in the following way: we randomly selecta given percentage of training samples in each category, andassign wrong labels to these sample. In particular, to ensurethat the fake COVID-19 samples are less than the real ones, weassign the COVID-19 labels to selected normal and pneumoniasamples in a way the the number of normal and pneumoniasamples assigned with COVID-19 label equals to the number ofCOVID-19 samples assigned with either normal and pneumonialabel. We show a realization of the derived noisy dataset whenthe percentage of fake samples is set to be 10 % in Table II. E. Results and Discussions

Performance on Clean Data : The numerical results on theclean dataset without any artiﬁcial noise added are shownin Table III. The results are presented in the form of a ± b ,where a and b denote the average and variance values of eachmetric on ﬁve independent experiments, respectively. We cansee that RCoNet , i.e., the proposed model with k = 4 levels TABLE IIIP

ERFORMANCE COMPARISON OF DIFFERENT APPROACHES FOR

COVID-19

DETECTION ON THE

COVID

X DATASET

Method ACC ( % ) SEN ( % ) SPE ( % ) BAC ( % ) PPV ( % ) F1 ( % ) Param (M) FLOPs (G)PbCNN [15] 88.90 ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± Normal Pneumonia COVID-19 N o r m a l P n e u m o n i a C O V I D - Prediction A c t u a l (a) Clean Normal Pneumonia COVID-19 N o r m a l P n e u m o n i a C O V I D - Prediction A c t u a l (b) Noise

Normal Pneumonia COVID-19 N o r m a l P n e u m o n i a C O V I D - Prediction A c t u a l (c) Noise

Normal Pneumonia COVID-19 N o r m a l P n e u m o n i a C O V I D - Prediction A c t u a l (d) NoiseFig. 4. Confusion matrices of the proposed RCoNet ks trained on noisy dataset with different percentages of noisy samples.TABLE IVP ERFORMANCE COMPARISON OF DIFFERENT APPROACHES ON

COVID

XDATASET WITH NOISY SAMPLES

Noise Method ACC ( % ) SEN ( % ) SPE ( % )10 % PbCNN [15] 83.22 81.98 89.01COVID-Net [14] 91.03 87.94 90.62DenseNet-121 [48] 91.97 87.94 92.17CoroNet [49] 89.45 88.74 90.06ReCoNet [47] 91.63 90.82 91.16RCoNet RCoNet % PbCNN [15] 78.42 75.90 80.29COVID-Net [14] 82.51 82.77 81.95DenseNet-121 [48] 82.16 81.01 82.21CoroNet [49] 82.33 81.10 81.89ReCoNet [47] 83.26 82.72 83.17RCoNet RCoNet % PbCNN [15] 67.76 66.47 70.61COVID-Net [14] 71.98 70.13 71.55DenseNet-121 [48] 72.74 72.36 72.96CoroNet [49] 71.87 72.02 71.54ReCoNet [47] 73.26 72.53 73.11RCoNet RCoNet of mixed moment features and s = 4 experts, achieves notableperformance improvement over the comparison methods interms of most metrics considered, including ACC, SPE, BAC,PPV and F1 score. We note the performance of RCoNet ks canbe further improved with a different set of k and s . For instance,RCoNet achieves better SEN and F1 score than RCoNet .The higher ACC and F1 score validate that RCoNet ks is ableto obtain latent features, i.e., the mixed moment features of different levels of order, that maintains inter-class separabilityand intra-class compactness better than other models. Notethat RCoNet leads to a higher SEN than all other methods,which is particularly important to COVID-19 detection, sincesuccessfully detecting COVID-19 positive cases is the key tocontrol the spread of this super contagious disease. Moreover, itcan be observed that RCoNet ks has smaller variance comparedto the others, which demonstrates the robustness and stabilityof our model.We also evaluate the complexity of the proposed model interms of numbers of parameters and computational cost, i.e.,Float-point operations (FLOPs), which is presented in Table III.It can be observed that the proposed model has much fewerparameters than several existing methods, except ReCoNet.However, we note that the FLOPs of RCoNet ks is quite closeto that of ReCoNet, which means it takes a similar amountof time to diagnose COVID-19 from CXR images by thesetwo model. We can also observe that the increase of k and s , i.e., the number of mixed moment features and the numberof experts in MUL, only causes a small, or even neglectable,amount of increase in the number of parameters and FLOPsas well, which suggests that we can improve the performanceof the proposed model by optimizing k and s , without theconcern on the signiﬁcant increase of the complexity. Performance on Noisy Data : We further compare theproposed model to the existing ones when there is noise presentin the training dataset. We generate three noisy training datasetsin the aforementioned way from the clean dataset with , and samples with wrong labels, respectively. Theresults, which we take the averages from ﬁve independentexperiments, are presented in Table IV. It can be easily seenthat the more fake samples we add the more it degrades −30 −20 −10 0 10 20 30 40 50−60−40−2002040 (a) baseline −40 −20 0 20 40 60−40−2002040 (b) RCoNet-D −40 −20 0 20 40 60−40−2002040 (c) RCoNet-M −40 −20 0 20 40−40−2002040 (d) RCoNet-DM(e) RCoNet (f) RCoNet (g) RCoNet (h) RCoNet Fig. 5. The t-SNE visualization of the latent features generated by different methods. Blue, green and red dots represent normal, pneumonia and COVID-19samples, respectively.

Actual: COVID-19Prediction: COVID-19Uncertainty:0.0094 Actual: COVID-19Prediction: NormalUncertainty:0.4792

Fig. 6. Example CXR samples with their predictions and the correspondinguncertainty levels by RCoNet . the performance of all the methods. Note that the proposedRCoNet still gets the state-of-the-art results in all consideredcases with different percentages of noisy samples in the trainingdataset. Moreover, the performance gain over the existingmethods slightly increases with the ratio of noisy samples,verifying that our model is more robust to the noise. Notethat the extreme case of noisy samples leads to greatperformance degradation of all the models. In practice, thepercentage of label noise is usually around to . Wepresent the confusion matrices in Fig. 4 to summarize theprediction accuracy of different categories. We can observethat, although with very limited number of COVID-19, ourmodel still maintains high accuracy of detecting COVID-19cases, even with the presence of noisy samples. Uncertainty Estimation : One remarkable advantage ofour model is the ability to quantify the uncertainty in theﬁnal prediction, which is signiﬁcantly crucial for COVID-19detection. This is done by obtaining the variance in the outputof different experts in MUL as described in Section III-C. Thelarger the variance is, the more different experts disagree witheach other, and, hence, the more uncertain the model is about U n c e r t a i n t y Fig. 7. Comparison on uncertainty level of the predictions by RCoNet . the ﬁnal prediction. We present two CXR samples in Fig. 6,including the predictions and the corresponding uncertaintylevel by RCoNet ks . We can see that the correctly classiﬁedCXR image has a low uncertainty level about its prediction,i.e., 0.0094, and the misclassiﬁed CXR sample with a highuncertainty level, i.e., 0.4792, suggests that an alternative wayof diagnosis should be sought to correct this prediction. Thisgreatly improves the reliability of the prediction by RCoNet ks ,and reduces the chance of misdiagnosis. We also show in Fig. 7the average uncertainty levels of RCoNet ks trained on cleanand noisy datasets with different ratios of noisy samples. It canbe observed that the uncertainty level increases almost linearlywith the percentage of noisy samples in the dataset, whichhighlights the negative impact of noise on model training. F. Analysis

We further numerically analyse the beneﬁts of the threekey modules of RCoNet ks , i.e., the DeIM, MHMF and MULmodules in this section. Effectiveness of DeIM : We utilize t-SNE method [51] tovisualize the latent features, presented in Fig. 5, which are

TABLE VI

MPACT OF THE

MHMF

AND

MUL

ON THE MODEL PERFORMANCE .RCoNet ks s=1 s=2 s=3 s=4 s=5 s=6 s=7k=1 95.4 95.7 95.9 96.1 96.1 96.0 95.8k=2 96.3 96.4 96.6 96.8 96.8 96.7 96.4k=3 97.2 97.2 97.3 97.5 97.4 97.3 97.3k=4 97.4 97.6 97.8 generated by the bottleneck layers of the baseline model,i.e., ResNeXt, RCoNet ks and three variants of RCoNet ks : (a)RCoNet-D: a model contains only DeIM; (b) RCoNet-M: amodel contains only MUL; (c) RCoNet-DM: a model containsDeIM and MUL but not MHMF. Comparing the latent featuredistribution by the baseline model shown in Fig. 5(a), andthat by RCoNet-D presented in Fig. 5(b), we can tell that theintroduction of DeIM leads to better class separation in thelatent space. Effectiveness of MHMF : We can observe in Fig. 5(a) -Fig. 5(d) that the latent features of the COVID-19 samples,generated by the models without MHMF, always distributearound the category boundary, and are not quite separablefrom those of some pneumonia samples. Meanwhile, the latentfeature distributions presented in Fig. 5(e) - Fig. 5(h) derived bythe models with MHMF show signiﬁcant separability betweendifferent categories, which implies that MHMF can extractdiscriminative features. We also include numerical results ofRCoNet ks , trained and tested on COVIDx dataset, with regardsto different values of k , i.e., the number of levels of the momentfeatures to be mixed, and s , i.e., the number of experts, inTable V in terms of accuracy. We can observe that, for a givenvalue of s , the accuracy increases ﬁrst with the value of k but decreases after k is larger than . It demonstrates thatincluding more levels of moment feature could improve themodel performance. However, the overly high-order momentsmay lead to performance degradation, which may be becausethese features are not useful for COVID detection. Effectiveness of MUL : From Table V, we observe that, fora given value of k , accuracy increases ﬁrst with the value of s but saturates around s = 5 . This implies that having moreexperts in MUL can increase the prediction accuracy but it isnot necessary to have too many. Parameter Sensitivity and Convergence : We evaluate howsensitive the model performance in terms of accuracy to thevalue of α . We show the average accuracy of ﬁve independentexperiments by RCoNet trained on the dataset with differentratios of noisy samples in Fig. 8. As we can see, the larger α , which means the prediction loss, i.e., L M , contributesless to the total loss, not necessarily leads to degradation inthe accuracy. This means maximizing the mutual informationbetween the input and the latent features could keep usefulinformation within the latent features, thus improving theprediction accuracy. We have also shown the learning curvesof different models in Fig. 9, which shows that RCoNet converges slightly faster than the others, including COVID-Net,ReCoNet and CoroNet. A cc u r a c y Fig. 8. The prediction accuracy by RCoNet with regards to different valuesof α . COVID-NetReCoNetCoroNetRCoNet

Epoch T e s t E rr o r Fig. 9. Comparison on the learning trajectories of different models.

V. C

ONCLUSIONS

In this paper, we proposed a novel deep network model,named

RCoNet ks , for robust COVID-19 detection, whichcontains three key components, i.e., Deformable mutual In-formation Maximization (DeIM),

Mixed High-order MomentFeature (MHMF) and

Multi-expert Uncertainty-aware Learning (MUL). DeIM estimates and maximizes the mutual informationbetween input data and the latent representations simultaneouslyto obtain the category separability in the latent space. Weproposed MHMF to overcome the limited expressive capabilityof low-order statistics, and instead use a combination of bothlow and high order moment features to extract more informativeand discriminative features. MUL generates the ﬁnal diagnosisand the uncertainty estimation, by combining the output ofmultiple parallel dropout networks, each as an expert. Wenumerically validated that the proposed RCoNet trained oneither the public COVIDx dataset or the noisy version of it,outperforms the existing methods in terms of all the metricsconsidered. We note that these three modules can be easilyimplemented into other frameworks for different tasks.R

EFERENCES[1] K. Zhang, X. Liu, J. Shen, Z. Li, Y. Sang, X. Wu, Y. Zha, W. Liang,C. Wang, K. Wang et al. , “Clinically applicable ai system for accu-rate diagnosis, quantitative measurements, and prognosis of covid-19pneumonia using computed tomography,”

Cell , 2020.[2] Z. Han, B. Wei, Y. Hong, T. Li, J. Cong, X. Zhu, H. Wei, and W. Zhang,“Accurate screening of covid-19 using attention based deep 3d multipleinstance learning,”

IEEE Transactions on Medical Imaging , 2020.[3] X. Mei, H.-C. Lee, K.-y. Diao, M. Huang, B. Lin, C. Liu, Z. Xie, Y. Ma,P. M. Robson, M. Chung et al. , “Artiﬁcial intelligence–enabled rapiddiagnosis of patients with covid-19,”

Nature Medicine , pp. 1–5, 2020. [4] W. Xie, C. Jacobs, J.-P. Charbonnier, and B. van Ginneken, “Relationalmodeling for robust and efﬁcient pulmonary lobe segmentation in ctscans,” IEEE Transactions on Medical Imaging , 2020.[5] X. Ouyang, J. Huo, L. Xia, F. Shan, J. Liu, Z. Mo, F. Yan, Z. Ding,Q. Yang, B. Song et al. , “Dual-sampling attention network for diagnosisof covid-19 from community acquired pneumonia,”

IEEE Transactionson Medical Imaging , 2020.[6] H. X. Bai, R. Wang, Z. Xiong, B. Hsieh, K. Chang, K. Halsey, T. M. L.Tran, J. W. Choi, D.-C. Wang, L.-B. Shi et al. , “Ai augmentation ofradiologist performance in distinguishing covid-19 from pneumonia ofother etiology on chest ct,”

Radiology , p. 201491, 2020.[7] A. A. Ardakani, A. R. Kanaﬁ, U. R. Acharya, N. Khadem, andA. Mohammadi, “Application of deep learning technique to managecovid-19 in routine clinical practice using ct images: Results of 10convolutional neural networks,”

Computers in Biology and Medicine , p.103795, 2020.[8] H. Kang, L. Xia, F. Yan, Z. Wan, F. Shi, H. Yuan, H. Jiang, D. Wu, H. Sui,C. Zhang et al. , “Diagnosis of coronavirus disease 2019 (covid-19) withstructured latent multi-view representation learning,”

IEEE transactionson medical imaging , 2020.[9] D.-P. Fan, T. Zhou, G.-P. Ji, Y. Zhou, G. Chen, H. Fu, J. Shen, andL. Shao, “Inf-net: Automatic covid-19 lung infection segmentation fromct images,”

IEEE Transactions on Medical Imaging , 2020.[10] L. Sun, Z. Mo, F. Yan, L. Xia, F. Shan, Z. Ding, W. Shao, F. Shi, H. Yuan,H. Jiang et al. , “Adaptive feature selection guided deep forest for covid-19classiﬁcation with chest ct,” arXiv preprint arXiv:2005.03264 , 2020.[11] D. Donglin, S. Feng, Y. Fuhua, X. Liming, M. Zhanhao, D. Zhongxiang,S. Fei, L. Shengrui, W. Ying, S. Ying, H. Miaofei, G. Yaozong, S. He,G. Yue, and S. Dinggang, “Hypergraph learning for identiﬁcation ofcovid-19 with ct imaging,” 2020.[12] Z. Y. Zu, M. D. Jiang, P. P. Xu, W. Chen, Q. Q. Ni, G. M. Lu, and L. J.Zhang, “Coronavirus disease 2019 (covid-19): a perspective from china,”

Radiology , p. 200490, 2020.[13] M. Siddhartha and A. Santra, “Covidlite: A depth-wise separable deepneural network with white balance and clahe for detection of covid-19,” arXiv preprint arXiv:2006.13873 , 2020.[14] L. Wang and A. Wong, “Covid-net: A tailored deep convolutional neuralnetwork design for detection of covid-19 cases from chest x-ray images,” arXiv preprint arXiv:2003.09871 , 2020.[15] Y. Oh, S. Park, and J. C. Ye, “Deep learning covid-19 features on cxrusing limited training data sets,”

IEEE Transactions on Medical Imaging ,2020.[16] E. Pauwels and J. B. Lasserre, “Sorting out typicality with the inversemoment matrix sos polynomial,” in

Advances in Neural InformationProcessing Systems , 2016, pp. 190–198.[17] F. Maes, A. Collignon, D. Vandermeulen, G. Marchal, and P. Suetens,“Multimodality image registration by maximization of mutual information,”

IEEE transactions on Medical Imaging , vol. 16, no. 2, pp. 187–198,1997.[18] A. Hyvärinen and E. Oja, “Independent component analysis: algorithmsand applications,”

Neural networks , vol. 13, no. 4-5, pp. 411–430, 2000.[19] N. Kwak and C.-H. Choi, “Input feature selection by mutual informationbased on parzen window,”

IEEE transactions on pattern analysis andmachine intelligence , vol. 24, no. 12, pp. 1667–1671, 2002.[20] H. Peng, F. Long, and C. Ding, “Feature selection based on mutual infor-mation criteria of max-dependency, max-relevance, and min-redundancy,”

IEEE Transactions on pattern analysis and machine intelligence , vol. 27,no. 8, pp. 1226–1238, 2005.[21] A. J. Butte and I. S. Kohane, “Mutual information relevance networks:functional genomic clustering using pairwise entropy measurements,” in

Biocomputing 2000 . World Scientiﬁc, 1999, pp. 418–429.[22] C. Vondrick, A. Shrivastava, A. Fathi, S. Guadarrama, and K. Murphy,“Tracking emerges by colorizing videos,” in

Proceedings of the EuropeanConference on Computer Vision (ECCV) , 2018, pp. 391–408.[23] R. Arandjelovic and A. Zisserman, “Look, listen and learn,” in

Proceed-ings of the IEEE International Conference on Computer Vision , 2017,pp. 609–617.[24] M. I. Belghazi, A. Baratin, S. Rajeshwar, S. Ozair, Y. Bengio,A. Courville, and D. Hjelm, “Mutual information neural estimation,”in

International Conference on Machine Learning , 2018, pp. 531–540.[25] J. Chang, Z. Lan, C. Cheng, and Y. Wei, “Data uncertainty learning inface recognition,” arXiv preprint arXiv:2003.11339 , 2020.[26] Y. Shi and A. K. Jain, “Probabilistic face embeddings,” in

Proceedingsof the IEEE International Conference on Computer Vision , 2019, pp.6902–6911. [27] A. Kendall, V. Badrinarayanan, and R. Cipolla, “Bayesian segnet: Modeluncertainty in deep convolutional encoder-decoder architectures for sceneunderstanding,” arXiv preprint arXiv:1511.02680 , 2015.[28] C. Blundell, J. Cornebise, K. Kavukcuoglu, and D. Wierstra, “Weightuncertainty in neural networks,” arXiv preprint arXiv:1505.05424 , 2015.[29] Y. Gal, “Uncertainty in deep learning,”

University of Cambridge , vol. 1,p. 3, 2016.[30] D. J. MacKay, “A practical bayesian framework for backpropagationnetworks,”

Neural computation , vol. 4, no. 3, pp. 448–472, 1992.[31] R. M. Neal,

Bayesian learning for neural networks . Springer Science& Business Media, 2012, vol. 118.[32] A. Kendall and Y. Gal, “What uncertainties do we need in bayesiandeep learning for computer vision?” in

Advances in neural informationprocessing systems , 2017, pp. 5574–5584.[33] S. Isobe and S. Arai, “Deep convolutional encoder-decoder network withmodel uncertainty for semantic segmentation,” in . IEEE, 2017, pp. 365–370.[34] J. Choi, D. Chun, H. Kim, and H.-J. Lee, “Gaussian yolov3: An accurateand fast object detector using localization uncertainty for autonomousdriving,” in

Proceedings of the IEEE International Conference onComputer Vision , 2019, pp. 502–511.[35] T. Yu, D. Li, Y. Yang, T. M. Hospedales, and T. Xiang, “Robust personre-identiﬁcation by modelling feature uncertainty,” in

Proceedings of theIEEE International Conference on Computer Vision , 2019, pp. 552–561.[36] U. Zafar, M. Ghafoor, T. Zia, G. Ahmed, A. Latif, K. R. Malik, andA. M. Sharif, “Face recognition with bayesian convolutional networksfor robust surveillance systems,”

EURASIP Journal on Image and VideoProcessing , vol. 2019, no. 1, p. 10, 2019.[37] M. D. Donsker and S. S. Varadhan, “Asymptotic evaluation of certainmarkov process expectations for large time, i,”

Communications on Pureand Applied Mathematics , vol. 28, no. 1, pp. 1–47, 1975.[38] S. Nowozin, B. Cseke, and R. Tomioka, “f-gan: Training generativeneural samplers using variational divergence minimization,” in

Advancesin neural information processing systems , 2016, pp. 271–279.[39] M. U. Gutmann and A. Hyvärinen, “Noise-contrastive estimation ofunnormalized statistical models, with applications to natural imagestatistics,”

Journal of Machine Learning Research , vol. 13, no. Feb,pp. 307–361, 2012.[40] P. Bachman, R. D. Hjelm, and W. Buchwalter, “Learning representationsby maximizing mutual information across views,” in

Advances in NeuralInformation Processing Systems , 2019, pp. 15 509–15 519.[41] J. Xu, P. Ye, Q. Li, H. Du, Y. Liu, and D. Doermann, “Blind imagequality assessment based on high order statistics aggregation,”

IEEETransactions on Image Processing , vol. 25, no. 9, pp. 4444–4457, 2016.[42] C. Chen, Z. Fu, Z. Chen, S. Jin, Z. Cheng, X. Jin, and X.-S. Hua, “Homm:Higher-order moment matching for unsupervised domain adaptation,” arXiv preprint arXiv:1912.11976 , 2019.[43] P. Jacob, D. Picard, A. Histace, and E. Klein, “Metric learning withhorde: High-order regularizer for deep embeddings,” in

Proceedingsof the IEEE International Conference on Computer Vision , 2019, pp.6539–6548.[44] H. Jégou and O. Chum, “Negative evidences and co-occurences in imageretrieval: The beneﬁt of pca and whitening,” in

European conference oncomputer vision . Springer, 2012, pp. 774–787.[45] M. Opitz, G. Waltner, H. Possegger, and H. Bischof, “Bier-boosting inde-pendent embeddings robustly,” in

Proceedings of the IEEE InternationalConference on Computer Vision , 2017, pp. 5189–5198.[46] P. Kar and H. Karnick, “Random feature maps for dot product kernels,”pp. 583–591, 2012.[47] S. Ahmed, M. H. Yap, M. Tan, and M. K. Hasan, “Reconet: Multi-levelpreprocessing of chest x-rays for covid-19 detection using convolutionalneural networks,” medRxiv , 2020.[48] G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger, “Denselyconnected convolutional networks,” in

Proceedings of the IEEE confer-ence on computer vision and pattern recognition , 2017, pp. 4700–4708.[49] A. I. Khan, J. L. Shah, and M. M. Bhat, “Coronet: A deep neuralnetwork for detection and diagnosis of covid-19 from chest x-ray images,”

Computer Methods and Programs in Biomedicine , p. 105581, 2020.[50] S. Xie, R. Girshick, P. Dollar, Z. Tu, and K. He, “Aggregated residualtransformations for deep neural networks,” pp. 5987–5995, 2017.[51] J. Donahue, Y. Jia, O. Vinyals, J. Hoffman, N. Zhang, E. Tzeng, andT. Darrell, “Decaf: A deep convolutional activation feature for genericvisual recognition,” in