[PDF] AIforCOVID: predicting the clinical outcomes in patients with COVID-19 applying AI to chest-X-rays. An Italian multicentre study

Abstract

Recent epidemiological data report that worldwide more than 53 million people have been infected by SARS-CoV-2, resulting in 1.3 million deaths. The disease has been spreading very rapidly and few months after the identification of the first infected, shortage of hospital resources quickly became a problem. In this work we investigate whether chest X-ray (CXR) can be used as a possible tool for the early identification of patients at risk of severe outcome, like intensive care or death. CXR is a radiological technique that compared to computed tomography (CT) it is simpler, faster, more widespread and it induces lower radiation dose. We present a dataset including data collected from 820 patients by six Italian hospitals in spring 2020 during the first COVID-19 emergency. The dataset includes CXR images, several clinical attributes and clinical outcomes. We investigate the potential of artificial intelligence to predict the prognosis of such patients, distinguishing between severe and mild cases, thus offering a baseline reference for other researchers and practitioners. To this goal, we present three approaches that use features extracted from CXR images, either handcrafted or automatically by convolutional neuronal networks, which are then integrated with the clinical data. Exhaustive evaluation shows promising performance both in 10-fold and leave-one-centre-out cross-validation, implying that clinical data and images have the potential to provide useful information for the management of patients and hospital resources.

Full PDF

AAI

FOR

COVID:

PREDICTING THE CLINICAL OUTCOMES INPATIENTS WITH

COVID-19

APPLYING AI TO CHEST -X-

RAYS .A N I TALIAN MULTICENTRE STUDY . A P

REPRINT

Paolo Soda , Natascha Claudia D’Amico , Jacopo Tessadori , Giovanni Valbusa , Valerio Guarrasi , ChandraBortolotto , Muhammad Usman Akbar , Rosa Sicilia , Ermanno Cordelli , Deborah Fazzini , Michaela Cellina ,Giancarlo Oliva , Giovanni Callea , Silvia Panella , Maurizio Cariati , Diletta Cozzi , Vittorio Miele , ElviraStellato , Gian Paolo Carraﬁello , Giulia Castorani , Annalisa Simeone , Lorenzo Preda , Giulio Iannello ,Alessio Del Bue , Fabio Tedoldi , Marco Alì , Diego Sona , and Sergio Papa Unit of Computer Systems and Bioinformatics, Department of Engineering, University Campus Bio-Medico of Rome,Via Alvaro del Portillo 21, 00128 Rome, Italy Department of Diagnostic Imaging and Stereotactic Radiosurgey, Centro Diagnostico Italiano S.p.A.,Via S. Saint Bon20, 20147 Milan, Italy Pattern Analysis and Computer Vision, Istituto Italiano di Tecnologia, Via Morego 30, 16163 Genoa, Italy Bracco Imaging S.p.A., Via Caduti di Marcinelle 13, 20134 Milan, Italy Department of Computer, Control, and Management Engineering, Sapienza University of Rome, Via Ariosto, 25,00185 Rome, Italy Radiology Institute, Fondazione IRCCS Policlinico San Matteo, Viale Golgi 19, 27100 Pavia, Italy Department of Naval, Electrical, Electronic and Telecommunications Engineering - University of Genova, ViaAll’Opera Pia 11 A, 16145 Genoa, Italy Radiology Department, ASST Fatebenefratelli Sacco, Piazza Principessa Clotilde 3, 20121 Milan, Italy Diagnostic and interventional radiology unit, ASST Santi Paolo e Carlo - San Paolo Hospital, Via Antonio di Rudinì 8,20142 Milan, Italy Department of Advanced Diagnostic Technologies - Therapeutic, Diagnostic and Radiology Units, ASST Santi Paoloe Carlo - San Paolo Hospital,Via Antonio di Rudinì 8, 20142 Milan, Italy Department of Emergency Radiology, Careggi University Hospital, Largo Piero Palagi 1, 50139 Florence, Italy Operative Unit of Radiology, Fondazione IRCCS Ca’ Granda Ospedale Maggiore Policlinico of Milan, Via dellaCommenda, 10, 20122 Milan, Italy Department of Health Sciences, Univeristy of Milan, Via Festa del Perdono, 7, 20122 Milan, Italy Diagnostic Imaging, Postgraduate Medical School, University of Foggia, Via Antonio Gramsci 89, 71122 Foggia,Italy. Department of Diagnostic Imaging, IRCCS Ospedale Casa Sollievo della Sofferenza, Viale Cappuccini 1, 71013 SanGiovanni Rotondo, Italy. Radiology Unit, Department of Clinical, Surgical, Diagnostic, and Pediatric Sciences, University of Pavia, Corso Str.Nuova, 65, 27100 Pavia Neuroinformatics Laboratory, Fondazione Bruno Kessler, Via Sommarive, 18, 38123 Trento, Italy * Corresponding author: Giovanni Valbusa, email: [email protected] 14, 2020 A BSTRACT

ECEMBER

14, 2020of artiﬁcial intelligence to predict the prognosis of such patients, distinguishing between severe andmild cases, thus offering a baseline reference for other researchers and practitioners. To this goal,we present three approaches that use features extracted from CXR images, either handcrafted orautomatically by convolutional neuronal networks, which are then integrated with the clinical data.Exhaustive evaluation shows promising performance both in 10-fold and leave-one-centre-out cross-validation, implying that clinical data and images have the potential to provide useful information forthe management of patients and hospital resources. K eywords COVID-19 · Artiﬁcial Intelligence · Deep Learning · Prognosis

According to data reported by the European Centre for Disease Prevention and Control as of 13 November 2020almost 53 million patients worldwide have been infected with the new coronavirus SARS-CoV-2, causing 1.3 milliondeaths. Since the identiﬁcation of patient zero in China, the situation dramatically worsened worldwide, saturating thehealthcare system resources. With a shortage of beds available in intensive and sub-intensive care, the need for a quickand effective triage system became an urgency.Chest imaging examinations, as chest X-ray (CXR) [1] and computed tomography (CT) ([2]) play a pivotal role indifferent settings. Indeed, imaging is used during triage in case of unavailability, delay of or the ﬁrst negative result ofreverse transcriptase-polymerase chain reaction (RT-PCR) [3]. Moreover, imaging is used to stratify disease severity.Generally, the ﬁndings on chest imaging in COVID-19 are not speciﬁc and overlap with other infections. CT shouldnot be used to screen for or as a ﬁrst-line test to diagnose COVID-19 and should be used sparingly and reservedfor hospitalized, symptomatic patients with speciﬁc clinical indications for CT [4]. CXR most frequent lesions inCOVID-19 patients are reticular alteration (up to 5 days from the symptoms onset), and ground-glass opacity (aftermore than 5 days from the onset of the symptoms). In COVID-19 patients’ consolidation gradually increase over time.Bilateral, peripheral, middle/lower locations are the most frequent location [5]. In some hospitals, the CXR examinationis replaced or accompanied by CT scan, which showed a sensitivity of 97% for COVID-19 diagnosis [2], albeit with alimited speciﬁcity of 25%. Both CXR and CT have speciﬁc pros and cons, but the latter poses several logistic issues,such as the lack of availability of machines’ slots, the difﬁculty of moving bedridden patients, and the long sanitizationtimes. Furthermore, patients follow-up with CXR is simpliﬁed because it can be acquired at the patient’s bed and, whenrequired, directly at home [6].Recently, artiﬁcial intelligence (AI) has been widely adopted to analyse CXR for several purposes, such as tuberculosisdetection [7], abnormality classiﬁcation and image annotation [8], pneumonia screening in pediatric and non pediatricpatients [9], edema and ﬁbrosis [10]. Obviously, the challenge of COVID-19 pandemic has boosted the research effortsof AI in medical imaging and, according to the work presented by Greenspan et.al., [11], such applications may have animpact along three main directions, namely, detection and diagnosis, patient management and predictive modelling.Regarding detection and diagnosis, AI is mainly used to detect the presence of COVID-19 patterns by processing CXRand/or CT images with deep neural networks (DNNs), such as convolutional neural networks (CNNs). DNNs werealso applied to lesions segmentation or to produce a coarse localization map of the important regions in the image.For instance, Zhang et. al., [12] analysed CT scans collected from 4695 patients to differentiate novel coronaviruspneumonia from other types of pneumonia (bacterial, viral and mycoplasma pneumonia) and from healthy subjects.The classiﬁcation was based on the combination of the segmented lung-lesion map and the normalized CT volumes.Experimental tests were performed on 260 patients, achieving an overall accuracy equal to 92.49%. Mineae et.al.[13]analysed 5,000 chest x-ray images from publicly available datasets using four well known convolutional neural networks:ResNet-18, ResNet-50, SqueezeNet, and DenseNet-121. Two-thousand images were used for training, whilst themodels were tested on the other 3000, attaining a sensitivity rate equal to 98%, and a speciﬁcity rate of around 90% indetecting COVID-19 patients from their CXR.The development of systems supporting patient management during the hospitalization is mainly concerned withthe monitoring of disease evolution in time. For instance, Gozes et.al.[14] proposed an image-based tool supportingthe measurement of disease extent within the lungs. This severity biomarker is intended to help physicians in thedecision-making process by tracking the disease severity over time.Finally, predictive modeling mainly concerns with the development of models able to predict the progression of thedisease. These approaches usually make use of both imaging and clinical data to predict the severity of the infection orthe progression time, i.e. the time from the initial hospital admission to severe or critical illness, deﬁned by death or the PREPRINT - D

ECEMBER

14, 2020need for mechanical ventilation or the need for being transferred to the intensive care unit (ICU) [12]. Few applicationshave been recently developed within this category. For example, Greenspan et.al. [11] in their position paper presentedpreliminary and unconsolidated results on predicting the probability for a patient to be admitted to the ICU by exploitingquantitative features extracted from the lung region of the CXR images, vital parameters, comorbidities, and otherclinical parameters. These data fed a random forest, which attained an area under the ROC curve (AUC) equal to0.83. A survey offered by Wynant et.al. [15], compared 16 papers presenting prognostic models (8 for mortality, 5for progression to severe/critical state and 3 for length of stay), and the AUC ranged from 0.85 up to 0.99. Nine ofsuch papers used only clinical data for the analysis, whilst the others used clinical data and features extracted from CTimages. The authors also argued that all 16 papers have a high risk of bias [16] due to the high probability of modeloverﬁtting and unclear reporting on intended use of the models. Still using CT images, two multicentric studies havebeen recently presented by Yeu et.al. [17] and Chassagnon et.al. [18]. The former included a cohort of 52 patients fromﬁve hospitals to predict short- or long-term hospital stay in patients with COVID-19 pneumonia. First, the CT scanswere semi-automatically segmented and then for each lesion patch the authors extracted 1,218 features, accounting forﬁrst-order, shape, second-order and wavelet measures. Second, a logistic model and a random forest were trained on thedata from four hospitals, being tested on patients belonging the ﬁfth clinic. They attained balanced accuracies equalto 0.94% and 0.87%, respectively. The work presented by Chassagnon et.al. [18] aims to predict patient outcomes(severe vs non-severe) prior to mechanical ventilation support and to suggest a possible prognosis within three availablegroups (short-term deceased, long-term deceased, long-term recovered). To these goals they searched for a subset ofdiscriminative features from several image texture descriptors computed from CT scans and a few clinical data (i.e.age, gender, high blood pressure, diabetes, body mass index). On a cohort of 693 patients, an ensemble of classiﬁersseparated patients with severe vs non-severe outcomes and it correctly identiﬁed the prognosis with balanced accuraciesequal to 70% and 71%, respectively.This analysis of the literature shows that the development of AI-based models predicting the outcomes of COVID-19patients still deserves further research efforts. On the one side, sharing patient data from studies as well as creating newdata sets collected in clinical practice is fundamental for the AI community [19], since many researchers do not havethe possibility of collecting clinical data and images from different clinical centres. On the other side, in the contextof COVID-19 prognosis, except for the very preliminary results anticipated by Greenspan et.al. [11], all the works inliterature used CT scans. To address both concerns, this work introduces a novel dataset including clinical data andCXR images from 820 patients with COVID-19 who were hospitalized in six hospitals in Italy. To each patient weassociated prognostic information related to the clinical outcome. This data repository will be made publicly availableto encourage research in this ﬁeld, where most of resources are collected using CT scans, as already mentioned. We alsoinvestigated three AI-based approaches to predict clinical outcome integrating clinical and imaging data, thus offeringa ﬁrst analysis that can be used by other researchers and practitioners as a baseline reference. In addition to clinicaldata, such approaches use quantitative information extracted from the CXR images, which are also referred to as imagefeatures or quantitative biomarkers in the following. The ﬁrst approach computes handcrafted texture features to beused by a common classiﬁer, the second approach automatically extracts image descriptors by using a CNN, while thethird approach is fully based on DNNs, processing both clinical and image data (Figure1).In synthesis, the main objectives of this work are:1. to boost the research on AI-based prognostic models to support healthcare systems in the ﬁght against COVID-19 pandemic by making publicly available a repository of CXR images and clinical data (general information,laboratory data and comorbidities) collected in a true environment during the ﬁrst wave of the pandemicemergency, which include common real-world issues such as missing data, outliers, different imaging devices,poorly standardized data. The repository would also facilitate external validation of learning models developedin this ﬁeld;2. to present an evaluation of three state-of-the-art learning approaches to predict future severe cases at the timeof hospitalization, which are speciﬁcally designed to use either handcrafted or learned image features, togetherwith clinical data.The rest of this manuscript is organized as follows: next section describes the dataset we collected and that we aremaking publicly available. Section 3 introduces the methodology we adopted, whilst section 4 presents the classiﬁcationresults achieved. Section 5 discusses our ﬁndings providing also concluding remarks.

This study includes the images and clinical data collected in six Italian hospitals at the time of hospitalization ofsymptomatic patients with COVID-19, during the ﬁrst wave of emergency in the country (March-June 2020) . Suchdata was generated during the clinical activity with the primary purpose of managing COVID-19 patients within the3

PREPRINT - D

ECEMBER

14, 2020Figure 1: Overview of the method for automatic prognosis of COVID-19 in two classes, namely mild and severe.Our works includes data collected in 6 independent cohorts, resulting in 820 COVID-19 patients. For each, wecollected several clinical attributes, combined with quantitative imaging biomarkers computed by handcrafted featuresor automatically computed by CNNs.Table 1: Patient distribution across the hospitals where the data were collected.

Hospital Number of patients Mild class prior probability

A 120 29.2%B 104 56.7 %C 31 25.8 %D 139 54.7 %E 101 54.5%F 325 46.5%

Total 820 46.8% daily practice and they were retrospectively reviewed and collected, after patients’ anonymization.Ethics Committeeapproval was obtained (Trial-ID: 1507; Approval date: April 7th, 2020) and all data were managed in accordance withthe GDPR regulation. Furthermore, we randomly assigned to each centre a symbolic label, from A up to F.The 820 CXR examinations reviewed in this study were performed in COVID-19-positive adult patients at the time ofhospital admission (Table 1): all the patients resulted positive for SARS-CoV-2 infection at the RT-PCR test [20]. In 5%of such cases the positivity to the swab was obtained only at the second RT-PCR examination. In the different centres,CXR examinations were performed using different analog and digital units, and the execution parameters were settledaccording to the patient conditions. Paired with CXR examinations, we collected also relevant clinical parameters listedin Table 2.According to the clinical outcome, each patient was assigned to the mild or the severe group. The former contains thepatients sent back to domiciliary isolation or hospitalized without ventilatory support, whereas the latter is composedof patients who required non-invasive ventilation support, intensive care unit (ICU) and deceased patients. Figure 2shows four difﬁcult examples of CXR images within the dataset: indeed, panels A and B show two images of patientswith severe outcome whilst the radiological visual inspection may suggest severe and mild prognoses, respectively.Similarly, panels C and D show two images of patients with mild outcome whilst a radiologist may report severe andmild prognosis, respectively. 4

PREPRINT - D

ECEMBER

14, 2020Table 2: Description of the clinical data available within the repository. First and second columns report variableslabel and description. Summary statistics for the overall population and for the two patients groups are reported in thefollowing columns. For continuous variables median and interquartile range are reported, for categorical variablesproportions are reported. Feature names followed by ’+’ were not used for the analysis described in this work.P-valueslower than 0.05 were considered signiﬁcant. * Mann-Whitney U test. † z-test for proportions with Yates continuitycorrection. † Fisher exact test. N a m e D e s cr i p t i o n O v er a ll - M il d - g r o up ( A ) S e v ere - g r o up ( B ) A v s B M i ss i n g p o pu l a t i o np - va l u e d a t a ( % ) A c t i v ec a n cer i n t h e l a s t e a r s P a ti e n t h a d ac ti v eca n ce r i n t h e l a s t ea r s ( % r e po r t e d ) % % % < . † . A g e P a ti e n t ’ s a g e ( y ea r s ) ; - ; - ; - < . A t r i a l F i b r ill a t i o n P a ti e n t h a d a t r i a l ﬁ b r ill a t on ( % r e po r t e d ) % % % < . † . B o d y t e m p er a t u re ( ◦ C ) P a ti e n t s t e m p e r a t u r ea t a d m i ss i on ( i n ◦ C ) ; - ; - ; - . . C a r d i ova s c u l a r D i s e a s e P a ti e n t h a d ca r d i ov a s c u l a r d i s ea s e s ( % r e po r t e d ) % % % < . † . C h r o n i c K i dn e y d i s e a s e P a ti e n t h a d c h r on i c k i dn e yd i s ea s e ( % r e po r t e d ) % % % < . † . C O P D C h r on i c ob s t r u c ti v e pu l m on a r yd i s ea s e ( % r e po r t e d ) % % % < . † . C o u g h C ough t ( % y e s ) % % % < . † . CR P C -r eac ti v e p r o t e i n c on ce n t r a ti on ( m g / d L ) ; - ; - ; - < . . D ay s F e v er D a y s o ff e v e r up t o a d m i ss i on ( d a y s ) ; - ; - ; - . . D - d i m er D - d i m e r a m oun ti nb l ood632 ; - ; - ; - < . . D e a t h + D ea t ho f p a ti e n t o cc u rr e ddu r i ngho s p it a li za ti on f o r a ny ca u s e -- D e m e n t i a P a ti e n t h a dd e m e n ti a ( % r e po r t e d ) % % % . . D i a b e t e s P a ti e n t h a dd i a b e t e s ( % r e po r t e d ) % % % < . † . D y s pn e a P a ti e n t h a d i n t e n s e ti gh t e n i ng i n t h ec h e s t , a i r hung e r , d i f ﬁ c u lt y50 % % % < . † . r ea t h i ng , b r ea t h l e ss n e ss o r a f ee li ngo f s u ff o ca ti on ( % y e s ) F i b r i n og e n F i b r i nog e n c on ce n t r a ti on i nb l ood ( m g / d L ) ; - ; - ; - < . . G l u c o s e G l u c o s ec on ce n t r a ti on i nb l ood ( m g / d L ) ; - ; - ; - < . . H e a r t F a il u re P a ti e n t h a dh ea r t f a il u r e ( % r e po r t e d ) % % % . . H y p er t e n s i o n P a ti e n t h a dh i ghb l oodp r e ss u r e ( % r e po r t e d ) % % % < . † . I NR I n t e r n a ti on a l N o r m a li ze d R a ti o1 . ; . - . . ; . - . . ; . - . . . Is c h e m i c H e a r t D i s e a s e P a ti e n t h a d i s c h e m i c h ea r t d i s ea s e ( % r e po r t e d ) % % % < . † . L D H L ac t a t e d e hyd r og e n a s ec on ce n t r a ti on i nb l ood ( U / L ) ; - ; - ; - < . . O ( % ) O xyg e np e r ce n t a g e i nb l ood ( % ) ; - ; - ; - < . . O b e s i t y P a ti e n t h a dob e s it y ( % r e po r t e d ) % % % . . P a C O P a r ti a l p r e ss u r e o f ca r bond i ox i d e i n a r t e r i a l b l ood ( mm H g ) ; - ; - ; - . . P a O P a r ti a l p r e ss u r e o f oxyg e n i n a r t e r i a l b l ood ( mm H g ) ; - ; - ; - < . . P C T P l a t e l e t c oun t ( ng / m L ) . ; . - . . ; . - . . ; . - . < . . p H B l oodp H ; - ; - ; - < . . P o s i t i o n + P a ti e n t po s iti ondu r i ng c h e s t x -r a y ( % s up i n e ) % % % < . † P o s i t i v i t ya t a d m i ss i o n P o s iti v it y t o t h e S A R S - C o V - s w a b a tt h ea d m i ss i on ti m e % % % . . ( % po s iti v e ) P r og n o s i s P a ti e n t ou t c o m e , s ee s ec ti on2 ( % ca s e s )- . % . % . † . R B C R e db l ood ce ll s c oun t ( ^ / L ) . ; . - . . ; . - . . ; . - . < . . R e s p i r a t o r y F a il u re P a ti e n t h a d r e s p i r a t o r y f a il u r e ( % r e po r t e d ) % % % . . S a O a r t e r i a l oxyg e n s a t u r a ti on ( % ) ; - ; - ; - < . . S e x P a ti e n t ’ ss e x ( % m a l e s ) % % % < . † S t r o k e P a ti e n t h a d s t r ok e ( % r e po r t e d ) % % % . . T h er a p y A n a k i n r a + P a ti e n t w a s t r ea t e d w it h A n a k i n r a ( % y e s ) % % % - . T h er a p ya n t i - i nﬂ a mm a t o r y + P a ti e n t w a s t r ea t e d w it h a n ti - i n ﬂ a mm a t o r yd r ug s % % % . . t h e r a py ( % y e s ) T h er a p ya n t i v i r a l + P a ti e n t w a s t r ea t e d w it h a n ti v i r a l d r ug s ( % y e s ) % % % . . T h er a p y E p a r i n e + P a ti e n t w a s t r ea t e d w it h e p a r i n e ( no ; y e s ; p r ophy l ac ti c . % ; . % ; . % ; . % ; . % ; . % ; < . † . t r ea t m e n t;t h e r a p e u ti c t r ea t m e n t ) % ; . % . % ; . % . % ; . % T h er a p y h y d r oxy c h l o r o qu i n e + P a ti e n t w a s t r ea t e d w it hhyd r oxy c h l o r oqu i n e ( % y e s ) % % % . . T h er a p y T o c ili z u m a b + P a ti e n t w a s t r ea t e d w it h T o c ili z u m a b ( % y e s ) % % % < . † . W B C W h it e b l ood ce ll s c oun t ( ^ / L ) . ; . - . . ; . - . . ; . - . . . PREPRINT - D

ECEMBER

14, 2020Figure 2: Examples of CRX images of patients with COVID-19 available within the dataset. Panels A and B showtwo images of patients with severe outcome whilst the radiological visual inspection may suggest severe and mildprognoses, respectively. Similarly, panels C and D show two images of patients with mild outcome whilst a radiologistmay suggest severe and mild prognosis, respectively, on the basis of the visual interpretation.During an initial data quality cleaning, we double-checked with the clinical partners the anomalous data and the outliers,i.e. those values lying outside the expected clinical range or identiﬁed applying the interquartile range method, whichwere then corrected when needed. Categorical variables values were homogenized to a coherent coding, such as 0 and 1values for binary variables like comorbidities and sex, and we adopted the string “NaN” to denote missing data. Noexclusion rule was applied for images based on device type or brand (e.g. digital or analog devices) or patient positions(standing or at bed), whereas X-ray images taken with lateral projection were excluded. In the case of multiple CXRimages delivered for the same patient, the dataset contains only the ﬁrst one. It is worth noting that the presence ofmissing entries in the clinical data mostly depends upon the procedures carried out in the individual hospitals as well asupon the pressure due to the overwhelming number of patients hospitalized during the COVID-19 emergency. For thesake of completeness, the rate of missing data is reported in the last column of Table 2.CXR images were collected in DICOM format and, for anonymization constrains, all the ﬁelds but a set of selectedmetadata related to acquisition parameters were blanked in the DICOM header (e.g. image modality, allocated bits,pixel spacing, etc.). All the images in the repository are currently stored using 16 bits, while acquisition precisionvaries: 13.5% were acquired at 10 bits precision, 35.4% at 12 bits, 46.6% at 14 bits and 4.5% using the full 16 bitsprecision. Furthermore, all the images were acquired with isotropic pixel spacing ranging from 0.1 mm to 0.2 mm. Themost common pixel spacing is 0.15 mm, 0.1 mm and 0.16 mm for 43.9%, 13.7% and 13.6% of images respectively.Image sizes, in pixels, are distributed as follows: 33.4% of the images have 2336 × × × We performed a statistical analysis applying the Mann-Whitney U test to compare mild- and severe-groups in case ofcontinuous variables, whereas we used the z test with Yates continuity correction for analysing proportions.Summary statistics are reported in Table 2. For continuous variables median and interquartile range (IQR) were reported.For categorical variables we reported patients’ proportions expressed as percentage. For statistical analysis a p-valuelower than 0.05 was considered signiﬁcant. 6

PREPRINT - D

ECEMBER

14, 2020The analysis evidenced that females represented the 32% (n=266) of the total population and they were signiﬁcantly( p < p < p = 0 . ) older (median 63 years, IQR 50–76 years) than males (median 59 years, IQR 48–69 years). In thesevere-group were 436 out of 820 (53%) patients, of which 109 (25%) were females and signiﬁcantly ( p < We investigated three AI-based prognostic approaches covering well-known methodologies with the intent to offerto researchers and practitioners a reference baseline to process the data available within the AIforCovid dataset.Furthermore, for the sake of an easy and fair comparison and to foster further research in this ﬁeld, we detail also theadopted validation procedures, recommending others to measure models performance at least as reported here.As schematically depicted in Figure 1, the ﬁrst learning approach employs ﬁrst order and texture features computedfrom the images, which are mined together with the clinical data feeding a supervised learner. In the following, it isshortly referred to as handcrafted approach , and it is presented in section 3.6.In the last decade we have assisted to the rise of deep artiﬁcial neural networks, which have attained outstandingperformance in many ﬁelds. Recently, DNNs such as convolutional neural networks have been applied also to COVID-7

PREPRINT - D

ECEMBER

14, 202019 imaging mostly for diagnostic purposes [11]. On this basis, the second approach presented here mixes togetherautomatic features computed by a CNN with the clinical data. Shortly, we used a pre-trained CNN as a CXR featureextractor. The output of the last fully-connected layer was then provided as input for a SVM classiﬁer, together with theclinical features.In the following, it is shortly referred to as hybrid approach , and it is presented in section 3.7.The third approach exploits together the clinical data and the raw CXR using a multi-input convolutional network topredict patients’ outcome. In order to handle data from such different sources, the network consists of two dedicatedinput branches, while higher-level features from both sources are concatenated in the last layers before the classiﬁcationoutput. In the following, this approach is shortly referred to as end-to-end deep learning approach , and it is detailed insection 3.8.Note that all such approaches do not use the therapy-related variables included in the dataset because, albeit therapycould inﬂuence the ﬁnal outcome, it is also dependent on the outcome (i.e. patients which required intensive care wereadministered with speciﬁc therapies). Furthermore, the classiﬁcation task deﬁned considers only the data collected atthe time of hospitalization and, therefore, in a true clinical scenario, information on the administered therapy would notbe available. For this reason, the use of those variables could be misleading.Before presenting in detail each of the three approaches, following sections 3.1 and 3.2 describe data imputation andimage standardization. Furthermore, section 3.3 presents the framework used to segment the lung, whereas section 3.4describes the feature selection approach and the classiﬁers adopted, which are the same across the three methods tofacilitate their comparison. Table 3 summarize the common operations applied by each of the three AI methods andsection 3.5 introduces the procedure adopted to validate the learning models.Table 3: Summary of the operations common to the three AI approaches.

Method Operations

Data imputation Image standardization Lung segmentation Feature selection

Handcrafted (cid:88) (cid:88) (cid:88) (cid:88)

Hybrid (cid:88) (cid:88) (cid:88) (cid:88)

End-to-end DL (cid:88) (cid:88)

To deal with missing data, univariate data imputation estimates missing entries by using the mean of each column inwhich the missing values are located. We preferred this approach to multivariate or prediction-based imputation methodssince it is known to work well when the data size is not very large, and it can prevent data loss which results from bruteforce rows and columns removal. Furthermore, preliminary results not shown here conﬁrmed such observations. Asreported in the second column of Table 3, imputation was performed before each learning paradigm worked on the data.

CXR images collected for this study were acquired with different devices and acquisition conditions, as mentionedin section 2. For this reason, we applied image normalization that, to a large extent, is the same for all the threemethods. Indeed, for the handcrafted approach pixels values were normalized to have zero mean and unit standarddeviation, whilst the images were resized to 1024 × × ×

224 pixels, without prior cropping, andnormalized as in the previous cases.

When needed, to segment the lung we apply a semi-automatic approach that initially delineates the lung borders using aU-Net, which is a convolutional neural network architecture for fast and precise segmentation of images. The networkwas already trained on non-COVID-19 lung CXR datasets , namely the Montgomery County CXR set (MC) presented The network is available as detailed in the reference denoted as [21]. PREPRINT - D

ECEMBER

14, 2020by Jaeger et.al. [22] and the Japanese Society of Radiological Technology (JSRT) repository presented by Shiraishiet.al. [23]. The former contains 7,470 CXR collected by the National Library of Medicine within the Open-i service,whereas the latter is composed of 247 chest radiographs with and without a lung nodule. The U-net requires inputimages represented as 3-channel 256 ×

256 matrices and, hence, grayscale images were copied to all the channels andthen resized. Furthermore, we normalized the pixel intensities as detailed in section 3.2. After these transformations,each image was passed through the convolutional network and all the pixels were classiﬁed as foreground (i.e. thelung) or as background. To check if the network worked well, hand-made masks were segmented by expert radiologists(Figure 4).Figure 4: Example of the lung segmentation results. Green line: manual segmentation, red line: segmentation returnedby the U-Net, blue line: bounding box from U-Net segmentation.

In general, we had a large number of descriptors that suggested us to apply a feature selection stage, which was set upin two steps. The ﬁrst is a coarse step that runs a univariate ﬁltering based on mutual information as score function topre-select a reduced set of image descriptors, whatever the approach used for their computation. The calculation ofmutual information between continuous features with the discrete class variable was addressed estimating the entropyfrom the k-nearest neighbours distances [24].The second feature selection step merges the pre-selected imaging features with the clinical data. To this end, weapplied a wrapper approach, namely the Recursive Feature Elimination and Cross-Validated selection (RFECV) method[25], which receives as input the pre-selected imaging descriptors and the 34 clinical features. Indeed, the RFECV isfed by an increasing number of pre-selected imaging descriptors ( D pr ): ﬁne-grained sampling was carried out for D pr ≤

10 applying a step of 2; for 10 < D pr ≤ D pr was sampled with step of 5; ﬁnally, RFECV was fed with all theimage features. RFECV applies a pruning procedure that starts considering all features in the dataset and recursivelyeliminates the less important according to a feature’s importance score calculated using a classiﬁer. Note that theoptimal number of features is selected by RFECV using nested 5-fold cross-validation on the training set.With reference to the base learner we evaluated three different computational paradigms: Logistic Regression (LGR);Support Vector Machines with a linear kernel (SVM); and Random Forests (RF). For all parameters in the adoptedmodels we used the default values provided by the libraries, without any ﬁne tuning. Indeed, we were not interested onthe best absolute performance. Moreover, Arcuri et.al. [26] empirically observed that in many cases the use of tunedparameters cannot signiﬁcantly outperform the default values of a classiﬁer suggested in the literature. Model validation for the three tested methods consists of k-fold and leave-one-centre-out cross validation. For eachcross-validation run, the training fold was used for data normalization, parameters’ estimation and/or features ’ selectiondepending on the applied method. Classiﬁcation performance assessment was carried out using testing fold data only;k-fold cross-validation was repeated with k equal to 3 and 10 with 20 repetitions. In leave-one-centre-out (LOCO) crossvalidation, in each run the test set is composed of all the samples belonging to one centre only, while the others were9

PREPRINT - D

ECEMBER

14, 2020assigned to the training set. When needed, the validation set was extracted from the training set using any policies (suchas random selection, hold-out, nested cross validation, etc.), and considering also the computational burden.Performance of the learning models were measured in terms of accuracy, sensitivity and speciﬁcity, reporting theaverage and standard deviation of each experiments. When needed, we ran the pairwise two-sided Mann Whitney U testto compare the results provided by two methods, whereas we performed the Kruskal-Wallis test followed by the Dunn’stest with Bonferroni correction for multiple comparisons. In the rest of the manuscript we assume that the pairwisetwo-sided Mann Whitney U test was performed by default, otherwise we will specify the test used.

The handcrafted approach ﬁrst computes parametric maps of the lungs segmented in the CXR image; second it extractsseveral features that are then provided together with the clinical data to a supervised learner.To segment the lung we applied the approach presented in section 3.3, comparing also the segmentation masks providedby the U-Net with those manually annotated. We found that the network provides a Jaccard index and a Dice scoreequal to 0.896 and 0.942, respectively. We deem that such performance are satisfactory as long since it is only neededto recover the bounding box, as in the hybrid approach presented below, while it would not be sufﬁcient for exact lungdelineation. For this reason, the lung masks obtained so far are then reviewed by an expert radiologist and then used tocompute the handcrafted features as follows.From the segmented lungs we computed the parametric maps using a pixel-based approach as proposed by Penny et.al.[27]. Pixels values of the parametric maps were obtained by computing ﬁrst- and second-order radiomic features on a21x21 sliding window running over each pixel of the entire lung region. First-order measures describe the statisticaldistribution of tissue density inside the kernel; from its grey levels’ histogram, we extracted 18 descriptors, whose formalpresentation is offered in A. Second-order descriptors are based on the Grey Level Co-occurence Matrix (GLCM): ateach location, we got a GLCM image, where we computed 24 Haralick descriptors [28] detailed in the same Appendixas before. This procedure returned 42 parametric images (18 First-order + 24 GLCM) for each CXR image, wherewe ﬁnally computed seven statistics, namely: mean, median, variance, skewness, kurtosis, energy and entropy. Thisresulted in 294 image features (i.e. 7 statistics by 42 parametric maps).To cope with the large number of descriptors we proceeded as described in section 3.4, adopting the base learnersalready described there. Then, for each tested classiﬁer, given the set and number of descriptors selected by the wrapperapproach in the nested cross-validation fashion, we trained the same classiﬁer on the whole training fold and measuredrecognition performance on the test fold.

The hybrid approach integrated the output of a pre-trained deep network and the set of clinical measures. The pipelineworked as follows: ﬁrst, we applied a pre-trained deep neural network to segment the lungs; second, a convolutionalneural network was trained to extract relevant features from the CXR images; third, we concatenated the deep featureswith the clinical ones; fourth, we performed a feature selection step as reported in section 3.4; ﬁfth, we trained asupervised classiﬁer to accomplish the binary classiﬁcation task. In the following we will illustrate these steps.As mentioned before, the image repository is composed of CXR images collected in multiple hospitals, using differentmachines with different acquisition parameters. This resulted in a certain degree of variability among the images, wherethe lungs have also different sizes. To cope with this issue, we adopted the segmentation net already discussed insection 3.3, which boosts the performance of the feature extraction network by locating the lungs. Differently frombefore, where the U-Net was used to pre-segment the lungs whose borders were manually reﬁned, here we adopted afully automated approach since the segmentation mask given by network was used to extract the rectangular boundingbox containing the ROI. Now there was no need for any manual intervention since the performance at the level of ROIbounding box segmentation was satisfactory, when compared with human’s annotation. Indeed, the Jaccard index andthe Dice score were now equal to 0.929 and 0.960, respectively .Next, each ROI was resized to a square so that the longest side of the ROI was mapped to the square side, and theother ROI side was resized accordingly. Each cropped image was then passed to a deep neural network to extractthe features, where we performed a transfer learning process as follows. Indeed, preliminary experiments showedthat such an approach gave better results than starting the training from scratch. On our image dataset we trainedseveral state-of-the-art network architectures previously initialized on other repositories. In a ﬁrst stage we testedin ten-fold cross validation these networks: Alexnet [29], VGG-11, VGG-11 BN, VGG-13, VGG-13 BN, VGG-16, In only one case the segmentation network did not segment the lungs; in this case, the entire original image is used. PREPRINT - D

ECEMBER

14, 2020VGG-16 BN, VGG-19, VGG-19 BN [30], ResNet-18, ResNet-34, ResNet-50, ResNet-101, ResNet-152 [31], ResNext[32], Wide ResNet-50 v2 [33], SqueezeNet-1.0, SqueezeNet-1.1 [34], DenseNet-121, DenseNet-169, DenseNet-161,DenseNet-201 [35], GoogleNet [36], ShufﬂeNet v2 [37] and MobileNet v2 [38]. Then, to reduce the computationalburden, the top-ﬁve networks (i.e. VGG-11, VGG-19, ResNet-18, Wide ResNet-50 v2, and GoogleNet) underwentall the experiments described in section 3.5. In all the cases, we changed the output layer of the CNNs, using twoneurons, one for each class. Moreover, image standardization as described in section 3.2 was performed. We alsoaugmented the training data by independently applying the following transformations with a probability equal to 30%:vertical and horizontal shift (-7, +7), y-axis ﬂip, rotation (-175 ◦ , +175 ◦ ) and elastic deformation ( σ =7, α = [20,40]).Training parameters were: a batch size of 32 with a cross-entropy loss, a SGD optimizer with learning rate of 0.001and momentum of 0.9, with max epochs sets equal to 300 and an early stopping criterion ﬁxed at 25 epochs, using theaccuracy on the validation set.Once the deep networks were trained, we integrated the automatic features they computed with the clinical information.To this goal, we extracted the last fully connected layer for each network, which was used as a vector of features foreach patient; accordingly, on the basis of the network we were using, the number of automatic features varied between512 and 4096 (i.e. it is 512 for ResNet-18, 1024 for GoogleNet, 2048 for Wide-ResNet-50 v2, and 4096 for VGG-11and VGG-19) Each of such sets of automatically computed descriptors was combined with the clinical data and, toavoid to overwhelm the latter, the number of features in the former was reduced by a coarse selection stage using theunivariate approach already described in section 3.4. Furthermore, we then applied the same wrapper approach toinvestigate if the combination of automatic and clinical features had a degree of redundancy. Straightforwardly, to avoidany bias all the operations described so far were performed respecting the training, validation and test split introducedbefore, and ensuring that the test was not used in any stage except for the ﬁnal validation.Finally, the selected features were used to classify each patient in the two classes already mentioned, i.e. mild andsevere, as reported in the last part of previous subsection, and using the same learners already mentioned. The end-to-end DL approach was designed so that clinical information could inﬂuence the generation of usefulfeatures in image classiﬁcation and vice-versa. Two different variants have been tested: in the ﬁrst, CXR imageswere modiﬁed with the addition of an extra layer, in which pixels in ﬁxed positions would code properly normalizedclinical information, while the remaining pixels were ﬁlled with uniform white noise. The second variant consists in amulti-input network, which received separately CXR images and clinical information. Eventually, the latter variantproved to perform slightly better and it will be described in the following.The architecture we adopted is composed by three main sections: one branch for each input accepts raw data andprocesses them to obtain a small number of relevant features, while a ﬁnal common path concatenates the outputfeatures of the previous branches and uses them to provide the actual classiﬁcation. A representation of the network canbe found in Figure 5. Figure 5: Workﬂow of the end-to-end deep learning approach.Several different image classiﬁcation networks have been tested (VGGs, ResNet, Inception and Xception variants)for the CXR branch. In general, smaller models appeared to perform better and the best results were observed with aResNet50. The network has been adopted up to the last convolutional layer, while the ﬁnal fully-connected layer andclassiﬁcation section have been removed. The number of generated output features has been reduced by a dropout layerwith probability of 0.5, followed by a 256-neurons fully-connected layer, a leaky ReLU and a ﬁnal dropout layer withprobability 0.25. 11

PREPRINT - D

ECEMBER

14, 2020All network tested were pre-trained on the ImageNet dataset, then on a publicly available repository of CXR images withthe task of discriminating between healthy subjects and pneumonia patients [39]; ﬁnally, the network was trained on thedataset presented here. We found that, with our architecture, data augmentation did not improve ﬁnal classiﬁcationperformance and was, therefore, excluded from the processing pipeline. On the other hand, pre-training led to moreconsistent and slightly better results. As mentioned in section 3.2, the only image modiﬁcation applied has been that ofresizing all the input images to the size of 224 × − , whilethe batch size is set to 16.Table 4: Best recognition performance attained by each of the learning methods when the experiments were executedaccording to the 10-fold cross-validation (20 repetitions). In the second column, ML and DL stands for Machine-Learning and Deep Learning, respectively. The last column reports the learners providing the results shown here. Input data Approach Accuracy Sensitivity Speciﬁcity Learner

Clinical data ML . ± . . ± . . ± . SVMDL . ± . . ± . . ± . MLPCXR images Handcrafted . ± . . ± . . ± . LGRHybrid . ± . . ± . . ± . VGG-11 + RFEnd-to-end . ± . . ± . . ± . Resnet50Clinical data andCXR images Handcrafted . ± . . ± . . ± . SVMHybrid . ± . . ± . . ± . GoogleNet + SVMEnd-to-end . ± . . ± . . ± . Resnet50 + MLP

Table 5: Best recognition performance attained by each of the learning methods when the experiments were executedaccording to the LOCO cross-validation. In the second column, ML and DL stands for Machine-Learning and DeepLearning, respectively. The last column reports the learners providing the results shown here.

Input data Approach Accuracy Sensitivity Speciﬁcity Learner

Clinical data ML . ± . . ± . . ± . SVMDL . ± . . ± . . ± . MLPCXR images Handcrafted . ± . . ± . . ± . SVMHybrid . ± . . ± . . ± . Vgg11 + SVMEnd-to-end . ± . . ± . . ± . Resnet50Clinical data andCXR images Handcrafted . ± . . ± . . ± . LGRHybrid . ± . . ± . . ± . GoogleNet + LGREnd-to-end . ± . . ± . . ± . Resnet50 + MLP PREPRINT - D

ECEMBER

14, 2020

This section reports the results attained using the three approaches mentioned so far in staging the patients withCOVID-19 in severe and mild classes. The goal is to provide a baseline characterization of the performance achievedintegrating together quantitative image data with clinical information by using state-of-the art approaches.Table 4 and 5 present the best recognition performance attained by each of the learning methods when the experimentswere executed according to the 10-fold and LOCO cross validation, respectively (see section 3.5 for further details).In the former case, the results are averaged over the 20 repetitions. Furthermore, for the sake of readability we omitto report the results achieved using the 3-fold cross validation since they are consistent with those performed in the10-fold fashion.The ﬁrst two rows in both tables report the performance in discriminating between patients with mild and severeprognosis attained using clinical data only. In this respect, the row denoted by Machine Learning (ML) shows the bestperformance achieved by the RFECV and by the learners described in the last part of section 3.6, whereas the rowdenoted by Deep learning (DL) reports the performance returned by the multi-layer perceptron described in section 3.8.In the case of experiments performed in 10-fold cross validation (Table 4), the best accuracy is up to 75.7%, it is attainedby an SVM retaining on average 11 clinical features, and the sensitivity and the speciﬁcity are almost balanced. Thislatter observation can be expected since the a-priori class distribution is not skewed. We also notice that the use of adeep network is sub-optimal in the classiﬁcation task based on clinical information alone: this is likely due to the factthat, in contrast with the image case, pre-training of the network was impossible, due to the custom nature of input data.As a consequence, it is possible that the available number of samples was not sufﬁcient to train the network to optimalperformance. The same observations hold also in the case of the experiments performed in a LOCO modality (Table 5),and it is worth noticing the performance drops for both the ML and DL approaches. This can be due to the variation ofdata distribution among the centres, limiting the generalization capability of the learners.In both Table 4 and 5, the next two sections report the performance attained by the three methods described in section 3using only the CXR images and merging together the images with the clinical data, respectively. With reference to theresults reported in the section “CXR images”, they show that the use of the images only does not achieve the sameperformance obtained using the clinical data, whatever the method applied (Table 4). Furthermore, the fact that theend-to-end DL has better results than the hybrid approach suggests that the fully connected portion of the CNN betterexploits than a supervised classiﬁer the information provided by the convolutional layers. In the case of the experimentsperformed in LOCO modality, there are still gaps with the results achieved using clinical data only, suggesting that allthe learners suffer from the variability induced by the different centres. Turning our attention to the results shown insection “Clinical data and CXR images”, in the case of the experiments performed in 10 fold cross validation we noticethat the integration between the two sources of information provides some beneﬁts, permitting in some cases to improvethe classiﬁcation performance. Indeed, the hybrid approach achieves an accuracy up to 76.9%, using the automaticfeatures computed by the convolutional layers of the GoogleNet and an SVM classiﬁer. The end-to-end DL approachslightly improves the performance with respect to the ones attained using only the images, suggesting that an approachfully based on DNN is not beneﬁcial in this case, needing for further investigation. In the case of the experiments run inLOCO mode we found that the integration of clinical data and CXR images is beneﬁcial as the largest accuracy is up to75.2%, with improvements in terms of sensitivity and speciﬁcity.

This study originated during the ﬁrst wave of infection in Italy occurring in early spring, 2020, when thousands ofpeople arrived every day in hospitals. Despite their apparently similar conditions, some lived the infection as a seasonalﬂu while others rapidly deteriorated, making intensive care necessary. This situation is common worldwide and, to ﬁghtthe pandemic, in the last months the whole scientiﬁc community has carried out relevant research efforts in differentﬁelds of knowledge.Artiﬁcial intelligence is one of the scientiﬁc disciplines that has been attracting more attention, offering the possibilityto process and extract knowledge and insights from the massive amount of data generated during the pandemic, and ithas mostly impacted prediction, diagnosis and treatment. Within this context, large efforts have been directed towardsthe analysis of radiological images and, according to the analysis presented by Greenspan et.al. [11], detection ofCOVID-19 pneumonia in both CT and CXR [12, 13] is the ﬁeld where large research has been directed to. Recently,there has been growing interest in the development of AI models to predict the severity of the COVID-19 infectionsbecause of the pressure on the hospitals, where even during the second pandemic wave we have assisted to an increasingdemand for beds in both ordinary wards and intensive care units. The few papers available in this ﬁeld use CT images,but several guidelines and statement do not encourage the use of CT over CXR [40] and for several practice reasons13

PREPRINT - D

ECEMBER

14, 2020CXR imaging is used due to the difﬁculty of moving bedridden patients, the lack of CT machine slots, the risk ofcross-infection, etc..To deal with this issue, here we have presented a multicentre retrospective study where we collected the CXRexamination and clinical data of 820 patients from 6 Italian hospitals. Furthermore, we have investigated different AIapproaches to provide to researchers and practitioners a baseline performance reference to foster further studies.Figure 6: Clinical feature importance represented by the rate each descriptor was selected by the RFECV wrapperduring both the 10-fold and LOCO cross validation experiments using the three classiﬁers (LGR, SVM and RF series).The DL series represents feature importance estimated as the maximum absolute value of weights in the ﬁrst layer of theperceptron of the DL network, after averaging over folds and repetitions and rescaling in the [0,1] interval. Moreover,the “*” or a “+” reported before each feature name means that it is included in the feature set used to get the besthandcrafted results reported in the ﬁrst section of Table 4 and 5, respectively.With reference to the population characteristics, we found interesting reports on the age and gender distribution. Womenwere both less and older, suggesting that they become less ill and suffer from more serious conditions at an older agethan men; also the women mortality was lower, as 72% were male conﬁrming the male mortality reported in China(73%) by Chen et.al. [41]. The male-related susceptibility and the higher male-mortality rate was also reported byBorges et.al. [42], who analyses the data of 59,254 patients from 11 different countries. The second main ﬁnding wasthat 87% of patients had at least one comorbidity (Figure 3), suggesting that, in most cases, the conditions leadingto hospitalisation occur in patients with coexisting disorders. The most common disease (in 45 % of cases) washypertension, conﬁrming the results reported by Yang et.al. [43], who meta-analysed the data of 1,576 infected patientsfrom seven studies and reported an hypertension prevalence of 21%.Let us now turn the attention to the results attained by the AI approaches that process the clinical data only. Using anormalized unitary scale, Figure 6 shows the rate each clinical descriptor was included in the selected feature subsetby the RFECV wrapper, distinguishing also per classiﬁer used. The ﬁgure shows the cumulative results observedrunning both the 10-fold and LOCO cross validation experiments. We opted for this cumulative representation since thetrend is very similar in both the experiments. Furthermore, the readers can ﬁnd in the ﬁgure also the set of biomarkersproviding the best performance shown in the ﬁrst section of Table 4 and 5, which are denoted by reporting before an“*” or a “+” for 10-fold and LOCO cross validation experiments, respectively. Interestingly, Figure 6 shows that age,14

PREPRINT - D

ECEMBER

14, 2020LDH and O , were chosen in every fold for all the classiﬁers. If we used only such three descriptors, the averageclassiﬁcation accuracy attained by learners in 10-fold and LOCO cross validation is equal to . ± . and to . ± . , respectively. Moreover, sex, dyspnoea and WBC were always selected by the wrapper with the SVM andRF, whereas the D-dimer was always selected by the logistic regressor and by SVM. Oppositely, heart failure and coughwere scarcely selected. Notably, some features such as LDH, D-dimer and SaO were selected very frequently despite ahigh fraction of data was obtained by imputation (see Table 2). We deem that is mostly related to the strong differencesin the distributions of these features between the two classes.Figure 6 also shows in dark blue the feature relevance estimated by the deep learning approach (DL series). In thiscase the feature relevance was estimated as the maximum across neurons of the absolute value of the weights in theperceptron ﬁrst layer. Results have been averaged over cross-validation folds and repetitions and rescaled to the [0,1]interval in order to match the other three series. Comparing the results with those obtained with the RFECV wrapper, itis clear that the only feature with the maximum relevance for all approaches is LDH, while sex, dyspnoea, WBC andCRP present a score higher than 0.5 in all series. The impact of the other clinical attributes appears to vary signiﬁcantlydepending on the adopted approach. For example, a high value of D-dimer and WBC have shown to be an importantrisk factor for negative outcome [44, 45, 46]. Furthermore, D-dimer, WBC and other clinical features like dyspnoeaand LDH are indicators of pulmonary compromise, infection, tissue damage [47] and a pro-thrombotic state [48]respectively. Finally, from our ﬁrst statistical analysis (section 2.1) of the dataset and from the result in shown inFigure 6 the patient gender showed to have an in important role for classifying the patient severity. The reasons behindthis difference appear to be related to the stark difference in immune system responses, with females causing strongerimmune responses to pathogens. This difference can be a major contributing factor to viral load, disease severity, andmortality. Furthermore, differences in sex hormone environments could also be a determinant of viral infections asoestrogen has immune-stimulating effects while testosterone has immune-suppressive effects [49].With reference to the results attained by the handcrafted approach, we found that the best results in terms of accuracyare statistically lower than those attained by the clinical descriptors ( p < p < p < PREPRINT - D

ECEMBER

14, 2020 (a) Handcrafted approach(b) Hybrid approach

Figure 7: Importance of clinical and handcrafted (panel A) or automatically learnt features (panel B) measured as therate each descriptor was selected by the RFECV wrapper during the 10-fold and LOCO cross-validation experimentsconsidering all the the three classiﬁers employed. The y axis scale is normalized to one. Moreover, we add a “*” or a“+” before each feature name if it is included in the feature set used to get the best handcrafted or hybrid results reportedin the last section of Table 4 and 5, respectively. 16

PREPRINT - D

ECEMBER

14, 2020 (a) Handcrafted approach(b) Hybrid approach

Figure 8: Variation of the average classiﬁcation accuracy (blue bars) with the number of features feeding the RFECVwrapper. The red and green curves show the number of clinical and texture features selected by the RFECV wrapper,respectively. The experiments plotted here refer to the best results shown in Table 4 integrating clinical and imagingfeatures for the handcrafted (panel A) and hybrid approach (panel B).We now analyse how the performance of the handcrafted approach vary with the number of features selected by thecoarse step, which fed the ﬁne selection based on the RFECV method, as described in section 3.6. To this end, Figure 8areports on the x-axis the number of features in input to the RFECV, which ranges from 36 (i.e. 34 clinical plus 2 texturemeasures) up to 84 (i.e. 34 clinical plus 50 texture measures), plus the last value where the RFECV received all theclinical and all the image features . The bars show the average classiﬁcation accuracy (y-axis, left side), while thecurves in red and green show the average number of clinical and handcrafted texture features selected by the RFECV,respectively (y-axis, right side). As already noticed in Table 4, the use of texture measures does not improve theperformance attained using the clinical descriptors; this is also conﬁrmed by observing that, as the number of inputfeatures increases, the wrapper tends to select more imaging biomarkers than clinical ones, dropping the performance.This may remark the importance of using both clinical and imaging biomarkers since they may provide complementaryinformation: while the former, and especially comorbidities, refers to the functional reserve of the patient, the latter The experiments plotted in Figure 8a refer to the best results shown in Table 4 integrating clinical and imaging features by thehandcrafted approach. PREPRINT - D

ECEMBER

14, 2020 (a) Mild class, all neurons (b) Mild class, 40 most selectedneurons (c) Severe class, all neurons (d) Severe class, 40 most selectedneurons

Figure 9: Two examples of the activation maps provided by the Grad-CAM approach, using all the neurons in the denselayer of the CNN dense layer or all the 40 neurons selected by the RFECV wrapper.may quantify the actual impact on the lungs. Indeed, ﬁt patients with severe infection and damage are as likely asunﬁt-patients with less severe infections to have a poor prognosis. Although not reported, similar considerations can bederived in the case of LOCO cross-validation where we noticed that the best performance are attained by an almostbalanced number of clinical and imaging features.With reference to the results attained by the hybrid approach on the CXR images only, we found that the best resultsare statistically lower than those attained by the clinical descriptors for 10-fold cross validation ( p < p = 0 . ). Among the three learners used with the hybridapproach, the best results with 10-fold cross validation are obtained with RF ( p < p < p = p = 0 . ) and, arguably, for the LOCOcase, as well ( p = 0 . ).Furthermore, as already mentioned, classiﬁcation accuracy from images alone is better than other methods, conﬁrmingthe well-established ﬁnding that CNNs are powerful approaches for image classiﬁcation. Oppositely, a neural network-based approach suffered particularly in achieving good performance with clinical information as inputs. The most likelycause for this under-performance is the fact that the clinical information structure is not standard and, therefore, it wasimpossible to adopt already tested network models and, more importantly, to pre-train the network on other datasets. Itis likely that further ﬁne-tuning of the design and training procedure of the custom multi-layer perceptron adopted forclinical-info classiﬁcation could further improve results both with this speciﬁc input source, as well as for the combinedmodel. A similar result is expected with an increase in size of the available dataset, as this section of the network didnot undergo any pre-training, as mentioned above.This study has also some limitations. First, patient enrolment was not globally randomized but instead conducted topopulate the two classes with a roughly homogeneous number of cases. This implies that training data do not reﬂect18 PREPRINT - D

ECEMBER

14, 2020the true a-priori probabilities of the target classes. On the other hand, sampling within the mild and severe classes isunbiased because patients were randomly enrolled.Although this may bias the estimate of classiﬁcation accuracy, there exist methods for adjusting the outputs of a trainedclassiﬁer with respect to different prior probabilities without having to retrain the model, even when these probabilitiesare not known in advance [51].A further limitation but, from another point of view a key feature, of the study is the lack of full standardization ofimages and clinical data in the dataset. The dataset was built during spring 2020 when Italy was under lockdown andItalian hospitals and doctors were under pressure due to the huge amount of patients requiring hospitalization. Underthese circumstances, full standardization of clinical data collection and images acquisition could not be achieved, andwe decided to collect CXR images gathered under any conditions and all the clinical data most commonly acquired atthe time of patients hospitalization. This led to a dataset that reﬂects these circumstances with many missing valuesamong clinical data and images acquired with unstandardized clinical protocol (i.e. patient position and breath holding)and various devices. Although on the one side this may represent a limitation, on the other side it may be an advantagebecause this dataset could challenge the AI community on real data collected under critical circumstances. Anotherlimitation may be the ever-changing landscape of the pandemic. Compared to the ﬁrst wave, in many countries, andespecially in Europe, the second wave has been characterised by younger patients with early symptoms in the emergencydepartment. This may suggest to periodically re-train the learners to follows the disease evolution, or to investigate theuse methods able to cope with concept drifts [52].

In this preliminary analysis the use of image-derived data provide reduced predictive performance improvement withrespect to the use of clinical data alone. The analysis of clinical data, instead, showed that a number of measures haverobust predictive potential. Clinical data such as Age, LDH, O , Dyspena, Sex, WBC, D-dimer, SaO are consistentlyselected across the different validation condition and classiﬁers tested in this work, representing a set of biomarkers thatcan have impact in the clinical practice helping physicians and care-managers planning the bed allocations.The poor standardization of images in the dataset could be a possible cause of the results attained here, as it has led toa classiﬁcation problem hard to be addressed by the tested approaches. Indeed, beyond the variability introduced bynon-standardized acquisition conditions such as patient positions and imaging device, the number of various medicaldevices, metal objects and other artefacts (e.g. pacemakers, catheters, prosthesis, etc.) that can be observed within theﬁeld of view are additional sources of difﬁculty for the learners. To further explore the dataset, this suggests usingmethods that can manage such variability, for instance by disregarding those images not meeting some quality criterionthat can be learnt in parallel with the classiﬁcation task. With reference to the approaches investigated here, deepeninghow data augmentation impact network training, performing ablation studies on the hybrid approach as well as onnetwork sizes for the end-to-end DL procedure are future directions of investigation. Furthermore, to improve thequality of DNNs we deem that joint learning could be another direction of investigation, enabling the possibility toextract correlated information across clinical and imaging data to the used to enforce the network weights to be sharedacross these networks.In conclusion, the dataset presented here is unique, offering a large number of CXR for prognostic purposes, placingside by side with similar efforts that use CT images [18]. While this repository lets the machine learning communityto challenge their methods with poorly standardized data, the efforts to collect a large repository cannot be affordedby such community, asking for the collaboration of researchers from different backgrounds, clinicians, and institutes.Furthermore, the quantitative results reported offer a preliminary evaluation of the prognostic performance attainableusing AI approaches spanning from the use of handcrafted image descriptors to a fully automatic approach based onDNNs. The use of AI in this domain can open the chance to develop fast and low-cost clinical protocols. Data availability

The dataset generated and analysed in this study is publicly available to members of the scientiﬁc community uponrequest at aiforcovid.radiomica.it

Beyond that, we encourage other hospitals and clinical centres to join the network toshare their data; in this case, contacts for data sharing are also available on the website. As mentioned in section 2, thedataset contains the CXR images, the clinical data listed in Table 2, the labels, the blind association between each imageand the acquisition centre, and the acquisition information. The manual segmentation masks mentioned in section 3.6are not publicly available at the time this manuscript is submitted, and they will be added later on.19

PREPRINT - D

ECEMBER

14, 2020

Acknowledgments

The authors wish to thank Amazon Web Services (AWS) and the AWS Diagnostic Development Initiative for thesupport in putting in place the data management infrastructure.

References [1] Simone Schiafﬁno, Stefania Tritella, Andrea Cozzi, Serena Carriero, Lorenzo Blandi, Laurenzia Ferraris, andFrancesco Sardanelli. Diagnostic performance of Chest X-ray for COVID-19 pneumonia during the SARS-CoV-2pandemic in Lombardy, Italy.

Journal of thoracic imaging , 35(4):W105–W106, 2020.[2] Tao Ai, Zhenlu Yang, Hongyan Hou, Chenao Zhan, Chong Chen, Wenzhi Lv, Qian Tao, Ziyong Sun, and LimingXia. Correlation of ChestCT and RT-PCR testing in coronavirus disease 2019 (COVID-19) in china: a report of1014 cases.

Radiology , page 200642, 2020.[3] Hongzhou Lu, Charles W Stratton, and Yi-Wei Tang. Outbreak of pneumonia of unknown etiology in Wuhan,China: The mystery and the miracle.

Journal of medical virology , 92(4):401–402, 2020.[4] American College of Radiology. ACR Recommendations for the use of ChestRadiography and Computed Tomography (CT) for Suspected COVID-19 Infec-tion. . On-line; accessed November, 30 2020.[5] Sergio Giuseppe Vancheri, Giovanni Savietto, Francesco Ballati, Alessia Maggi, Costanza Canino, ChandraBortolotto, Adele Valentini, Roberto Dore, Giulia Maria Stella, Angelo Guido Corsico, et al. Radiographicﬁndings in 240 patients with COVID-19 pneumonia: time-dependence after the onset of symptoms.

EuropeanRadiology , page 1, 2020.[6] Moreno Zanardo, Simone Schiafﬁno, and Francesco Sardanelli. Bringing radiology to patient’s home using mobileequipment: A weapon to ﬁght COVID-19 pandemic.

Clinical Imaging , 68:99–101, 2020.[7] Chang Liu, Yu Cao, Marlon Alcantara, Benyuan Liu, Maria Brunette, Jesus Peinado, and Walter Curioso. Tx-cnn:Detecting tuberculosis in Chest X-ray images using convolutional neural network. In , pages 2314–2318. IEEE, 2017.[8] Fengqi Yan, Xin Huang, Yao Yao, Mingming Lu, and Maozhen Li. Combining lstm and densenet for automaticannotation and classiﬁcation of chest x-ray images.

IEEE Access , 7:74181–74189, 2019.[9] Radiological Society of North America. RSNA Pneumonia Detection Challenge (2018). . Online; accessed 15 November 2020.[10] Xiuyuan Xu, Quan Guo, Jixiang Guo, and Zhang Yi. Deepcxray: Automatically diagnosing diseases on ChestX-rays using deep neural networks.

IEEE Access , 6:66972–66983, 2018.[11] Hayit Greenspan, Raúl San José Estépar, Wiro J Niessen, Eliot Siegel, and Mads Nielsen. Position paper onCOVID-19 imaging and AI: From the clinical needs and technological challenges to initial AI solutions at the laband national level towards a new era for AI in healthcare.

Medical Image Analysis , 66:101800, 2020.[12] Kang Zhang, Xiaohong Liu, Jun Shen, Zhihuan Li, Ye Sang, Xingwang Wu, Yunfei Zha, Wenhua Liang, ChengdiWang, Ke Wang, et al. Clinically applicable AI system for accurate diagnosis, quantitative measurements, andprognosis of COVID-19 pneumonia using computed tomography.

Cell , 2020.[13] Shervin Minaee, Rahele Kaﬁeh, Milan Sonka, Shakib Yazdani, and Ghazaleh Jamalipour Souﬁ. Deep-COVID:Predicting COVID-19 from Chest X-Ray Images Using Deep Transfer Learning.

Medical Image Analysis , 2020.[14] Ophir Gozes, Maayan Frid-Adar, Hayit Greenspan, Patrick D Browning, Huangqi Zhang, Wenbin Ji, AdamBernheim, and Eliot Siegel. Rapid AI development cycle for the coronavirus (COVID-19) pandemic: Initialresults for automated detection & patient monitoring using deep learning CT image analysis. arXiv preprintarXiv:2003.05037 , 2020.[15] Laure Wynants, Ben Van Calster, Marc MJ Bonten, Gary S Collins, Thomas PA Debray, Maarten De Vos, Maria CHaller, Georg Heinze, Karel GM Moons, Richard D Riley, et al. Prediction models for diagnosis and prognosis ofCOVID-19 infection: systematic review and critical appraisal.

British Medical Journal , 369, 2020.[16] Karel GM Moons, Robert F Wolff, Richard D Riley, Penny F Whiting, Marie Westwood, Gary S Collins,Johannes B Reitsma, Jos Kleijnen, and Sue Mallett. Probast: a tool to assess risk of bias and applicability ofprediction model studies: explanation and elaboration.

Annals of internal medicine , 170(1):W1–W33, 2019.20

PREPRINT - D

ECEMBER

14, 2020[17] Hongmei Yue, Qian Yu, Chuan Liu, Yifei Huang, Zicheng Jiang, Chuxiao Shao, Hongguang Zhang, Baoyi Ma,Yuancheng Wang, Guanghang Xie, et al. Machine learning-based CT radiomics method for predicting hospital stayin patients with pneumonia associated with SARS-CoV-2 infection: a multicenter study.

Annals of translationalmedicine , 8(14), 2020.[18] Guillaume Chassagnon, Maria Vakalopoulou, Enzo Battistella, Stergios Christodoulidis, Trieu-Nghi Hoang-Thi,Severine Dangeard, Eric Deutsch, Fabrice Andre, Enora Guillo, Nara Halm, et al. AI-Driven quantiﬁcation,staging and outcome prediction of COVID-19 pneumonia.

Medical Image Analysis , page 101860, 2020.[19] Artuur M Leeuwenberg and Ewoud Schuit. Prediction models for COVID-19 clinical decision making.

TheLancet Digital Health , 2(10):e496–e497, 2020.[20] Xiaobo Yang, Yuan Yu, Jiqian Xu, Huaqing Shu, Hong Liu, Yongran Wu, Lu Zhang, Zhui Yu, Minghao Fang,Ting Yu, et al. Clinical course and outcomes of critically ill patients with SARS-CoV-2 pneumonia in Wuhan,China: a single-centered, retrospective, observational study.

The Lancet Respiratory Medicine , 2020.[21] Imlab-UIIP. Lung Segmentation (2D). https://github.com/imlab-uiip/lung-segmentation-2dR . On-line; accessed 19 October 2020.[22] Stefan Jaeger, Sema Candemir, Sameer Antani, Yì-Xiáng J Wáng, Pu-Xuan Lu, and George Thoma. Two publicChest X-ray datasets for computer-aided screening of pulmonary diseases.

Quantitative imaging in medicine andsurgery , 4(6):475, 2014.[23] Junji Shiraishi, Shigehiko Katsuragawa, Junpei Ikezoe, Tsuneo Matsumoto, Takeshi Kobayashi, Ken-ichi Komatsu,Mitate Matsui, Hiroshi Fujita, Yoshie Kodera, and Kunio Doi. Development of a digital image database forChestradiographs with and without a lung nodule: receiver operating characteristic analysis of radiologists’detection of pulmonary nodules.

American Journal of Roentgenology , 174(1):71–74, 2000.[24] Brian Ross. Mutual information between discrete and continuous data sets.

PloS one , 9:e87357, 02 2014.[25] Isabelle Guyon, Jason Weston, Stephen Barnhill, and Vladimir Vapnik. Gene selection for cancer classiﬁcationusing support vector machines.

Machine learning , 46(1-3):389–422, 2002.[26] Andrea Arcuri and Gordon Fraser. Parameter tuning or default values? an empirical investigation in search-basedsoftware engineering.

Empirical Software Engineering , 18(3):594–623, 2013.[27] William D Penny, Karl J Friston, John T Ashburner, Stefan J Kiebel, and Thomas E Nichols.

Statistical parametricmapping: the analysis of functional brain images . Elsevier, 2011.[28] Robert M Haralick, Karthikeyan Shanmugam, and Its’ Hak Dinstein. Textural features for image classiﬁcation.

IEEE Transactions on systems, man, and cybernetics , (6):610–621, 1973.[29] Alex Krizhevsky. One weird trick for parallelizing convolutional neural networks. arXiv preprint arXiv:1404.5997 ,2014.[30] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 , 2014.[31] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In

Proceedings of the IEEE conference on computer vision and pattern recognition , pages 770–778, 2016.[32] Saining Xie, Ross Girshick, Piotr Dollár, Zhuowen Tu, and Kaiming He. Aggregated residual transformations fordeep neural networks. In

Proceedings of the IEEE conference on computer vision and pattern recognition , pages1492–1500, 2017.[33] Sergey Zagoruyko and Nikos Komodakis. Wide residual networks. arXiv preprint arXiv:1605.07146 , 2016.[34] Forrest N Iandola, Song Han, Matthew W Moskewicz, Khalid Ashraf, William J Dally, and Kurt Keutzer.Squeezenet: Alexnet-level accuracy with 50x fewer parameters and < 0.5 mb model size. arXiv preprintarXiv:1602.07360 , 2016.[35] Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kilian Q Weinberger. Densely connected convolutionalnetworks. In

Proceedings of the IEEE conference on computer vision and pattern recognition , pages 4700–4708,2017.[36] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan,Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions. In

Proceedings of the IEEEconference on computer vision and pattern recognition , pages 1–9, 2015.[37] Ningning Ma, Xiangyu Zhang, Hai-Tao Zheng, and Jian Sun. Shufﬂenet v2: Practical guidelines for efﬁcientCNN architecture design. In

Proceedings of the European conference on computer vision (ECCV) , pages 116–131,2018. 21

PREPRINT - D

ECEMBER

14, 2020[38] Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zhmoginov, and Liang-Chieh Chen. Mobilenetv2:Inverted residuals and linear bottlenecks. In

Proceedings of the IEEE conference on computer vision and patternrecognition , pages 4510–4520, 2018.[39] Mooney, Paul. Chest X-ray images (Pneumonia). 2017. . Online; accessed 16 October 2020.[40] GD Rubin, CJ Ryerson, LB Haramati, N Sverzellati, and JP Kanne. others,“the role of chest imaging in patientmanagement during the covid-19 pandemic: a multinational consensus statement from the ﬂeischner society,”.

Chest , 2020.[41] Tao Chen, Di Wu, Huilong Chen, Weiming Yan, Danlei Yang, Guang Chen, Ke Ma, Dong Xu, Haijing Yu,Hongwu Wang, et al. Clinical characteristics of 113 deceased patients with coronavirus disease 2019: retrospectivestudy.

Bmj , 368, 2020.[42] Israel Júnior Borges do Nascimento, Nensi Cacic, Hebatullah Mohamed Abdulazeem, Thilo Caspar von Groote,Umesh Jayarajah, Ishanka Weerasekara, Meisam Abdar Esfahani, Vinicius Tassoni Civile, Ana Marusic, AnaJeroncic, et al. Novel coronavirus infection (COVID-19) in humans: a scoping review and meta-analysis.

Journalof clinical medicine , 9(4):941, 2020.[43] Jing Yang, Ya Zheng, Xi Gou, Ke Pu, Zhaofeng Chen, Qinghong Guo, Rui Ji, Haojia Wang, Yuping Wang, andYongning Zhou. Prevalence of comorbidities and its effects in patients infected with SARS-CoV-2: a systematicreview and meta-analysis.

International Journal of Infectious Diseases , 94:91–95, 2020.[44] Litao Zhang, Xinsheng Yan, Qingkun Fan, Haiyan Liu, Xintian Liu, Zejin Liu, and Zhenlu Zhang. D-dimer levelson admission to predict in-hospital mortality in patients with covid-19.

Journal of Thrombosis and Haemostasis ,18(6):1324–1329, 2020.[45] Christopher M Petrilli, Simon A Jones, Jie Yang, Harish Rajagopalan, Luke O’Donnell, Yelena Chernyak, Katie ATobin, Robert J Cerfolio, Fritz Francois, and Leora I Horwitz. Factors associated with hospital admission andcritical illness among 5279 people with coronavirus disease 2019 in new york city: prospective cohort study. bmj ,369, 2020.[46] Brandon Michael Henry, Maria Helena Santos De Oliveira, Stefanie Benoit, Mario Plebani, and Giuseppe Lippi.Hematologic, biochemical and immune biomarker abnormalities associated with severe illness and mortality incoronavirus disease 2019 (covid-19): a meta-analysis.

Clinical Chemistry and Laboratory Medicine (CCLM) ,58(7):1021–1028, 2020.[47] Chang Li, Jianfang Ye, Qijian Chen, Weihua Hu, Lingling Wang, Yameng Fan, Zhanjin Lu, Jie Chen, Zaishu Chen,Shiyan Chen, et al. Elevated lactate dehydrogenase (LDH) level as an independent risk factor for the severity andmortality of COVID-19.

Aging (Albany NY) , 12(15):15670, 2020.[48] Leonard Naymagon, Nicole Zubizarreta, Jonathan Feld, Maaike van Gerwen, Mathilda Alsen, Santiago Thibaud,Alaina Kessler, Sangeetha Venugopal, Iman Makki, Qian Qin, et al. Admission D-dimer levels, D-dimer trends,and outcomes in COVID-19.

Thrombosis Research , 196:99–105, 2020.[49] Ajay Pradhan and Per-Erik Olsson. Sex differences in severity and mortality from COVID-19: are males morevulnerable?

Biology of Sex Differences , 11(1):1–11, 2020.[50] Ramprasaath R Selvaraju, Michael Cogswell, Abhishek Das, Ramakrishna Vedantam, Devi Parikh, and DhruvBatra. Grad-CAM: Visual explanations from deep networks via gradient-based localization. In

Proceedings of theIEEE International Conference on Computer Vision , pages 618–626, 2017.[51] Patrice Latinne, Marco Saerens, and Christine Decaestecker. Adjusting the outputs of a classiﬁer to new a prioriprobabilities may signiﬁcantly improve classiﬁcation accuracy: evidence from a multi-class problem in remotesensing. In

ICML , volume 1, pages 298–305, 2001.[52] Jie Lu, Anjin Liu, Fan Dong, Feng Gu, Joao Gama, and Guangquan Zhang. Learning under concept drift: Areview.

IEEE Transactions on Knowledge and Data Engineering , 31(12):2346–2363, 2018.

A First- and second-order features

We present in Tables A.1 and A.2 the formal deﬁnition of ﬁrst- and second-order features introduced in section 3.6. Thecontent of such tables adopt the following notation:• X is a set of N p pixel in a ROI;• S ( i ) is the ﬁrst order histogram of the ROI using N g discrete intensity levels, equally spaced from 0 with adeﬁned width of 0.1; 22 PREPRINT - D

ECEMBER

14, 2020• s ( i ) = S ( i ) N p is the normalized ﬁrst order histogram;• V pixel is the volume of a pixel in mm;• X is the th percentile of X;• X is the th percentile of X;• X − is the image array with gray levels in between, or equal to the th and th percentile of X;• X is the mean value of the image array;• P ( i, j ) co-occurence matrix with a deﬁned distance ( δ = 1) and angle ( θ =0);• p ( i, j ) = P ( i,j ) (cid:80) P ( i,j ) is the normalized co-occurence matrix;• p x ( i ) = (cid:80) N g j =1 P ( i, j ) and p y ( i ) = (cid:80) N g i =1 P ( i, j ) are the marginal probabilities per row and per column,respectively;• µ x and µ y are the mean grey level intensities, deﬁned as Joined Average, of p x and p y respectively. If P ( i, j ) is symmetrical p x = p y ;• σ x and σ y are the standard deviations of p x and p y respectively;• p x + y ( k ) = (cid:80) N g i =1 (cid:80) N g j =1 p ( i, j ) , where i + j = k , and k = 2 , , .., N g ;• p x − y ( k ) = (cid:80) N g i =1 (cid:80) N g j =1 p ( i, j ) , where | i − j | = k , and k = 0 , , .., N g − ;• HX, HY and HXY are the entropy of p x , p y and p ( i, j ) , respectively.• HXY − (cid:80) N g i =1 (cid:80) N g j =1 p ( i, j ) · log[ p x ( i ) p y ( j )] is an auxiliary quantity;• HXY − (cid:80) N g i =1 (cid:80) N g j =1 p x ( i ) p y ( j ) · log[ p x ( i ) p y ( j )] is an auxiliary quantity;• DA is the Difference Average used to obtain the Difference Variance;23 PREPRINT - D

ECEMBER

14, 2020

Feature Deﬁnition

Energy (cid:80) N p i =0 X ( i ) Total Energy V pixel · (cid:80) N p i =0 X ( i ) Entropy − (cid:80) N g i =1 s ( i ) · log[ s ( i )] , for s ( i ) > Minimum min ( X ) Maximum max ( X ) Mean N p (cid:80) N p i =1 X ( i ) Median median grey level intensityInterquartile Range X − X Range max ( X ) − min ( X ) Mean Absolute Deviation N p · (cid:80) N p i =1 | X ( i ) − X | Robust Mean Absolute Deviation N − · (cid:80) N − i =1 | X − ( i ) − X − | Root Mean Squared (cid:113) ( N p · (cid:80) N p i =1 X ( i ) ) Skewness Np · (cid:80) Npi =1 ( X ( i ) − X ) ( (cid:114) Np · (cid:80) Npi =1 ( X ( i ) − X ) ) Kurtosis Np · (cid:80) Npi =1 ( X ( i ) − X ) ( Np · (cid:80) Npi =1 ( X ( i ) − X ) ) Variance N p · (cid:80) N p i =1 ( X ( i ) − X ) Uniformity (cid:80) N g i =1 s ( i ) Table A.1: Deﬁnition of the ﬁrst-order statistical measures.24

PREPRINT - D

ECEMBER

14, 2020

Feature Deﬁnition

Sum Squares (cid:80) N g i =1 (cid:80) N g j =1 ( i − µ x ) · p ( i, j ) Sum Entropy (cid:80) N g k =2 p x + y ( k ) · log[ p x + y ( k )] , for p x + y ( k ) > Sum Average (cid:80) N g k =2 p x + y ( k ) k MCC (Maximal Correlation Coefﬁcient) √ second largest eigenvalue of Q, where Q ( i, j ) = (cid:80) N g k =0 p ( i,k ) p ( j,k ) p x ( i ) p y ( k ) Maximum Probability max( p ( i, j )) Joint Entropy (cid:80) N g i =1 (cid:80) N g j =1 p ( i, j ) log[ p ( i, j )] , for p ( i, j ) > Joint Energy (cid:80) N g i =1 (cid:80) N g j =1 ( p ( i, j )) Joint Average (cid:80) N g i =1 (cid:80) N g j =1 p ( i, j ) i Inverse Variance (cid:80) N g − k =0 p x − y ( k ) k IMC (Informational Measure of Correlation) 2 √ − e − HXY − HXY ) IMC (Informational Measure of Correlation) 1

HXY − HXY max { HX,HY } IDN (Inverse Difference Normalized) (cid:80) N g − k =0 p x − y ( k )1+( kNg ) IDM (Inverse Difference Moment) (cid:80) N g − k =0 p x − y ( k )1+ k ID (Inverse Difference) (cid:80) N g − k =0 p x − y ( k )1+ k Difference Variance (cid:80) N g − k =0 ( k − DA ) · p x − y ( k ) Difference Entropy (cid:80) N g − k =0 k · p x − y ( k ) log[ p x − y ( k )] , for p x − y ( k ) > Difference Average (cid:80) N g − k =0 k · p x − y ( k ) Correlation (cid:80)

Ngi =1 (cid:80) Ngj =1 p ( i,j ) · i · j − µ x µ y σ x ( i ) σ y ( j ) Contrast (cid:80) N g i =1 (cid:80) N g j =1 ( i − j ) · p ( i, j ) Cluster Tendency (cid:80) N g i =1 (cid:80) N g j =1 ( i + j − µ x − µ y ) · p ( i, j ) Cluster Shade (cid:80) N g i =1 (cid:80) N g j =1 ( i + j − µ x − µ y ) · p ( i, j ) Cluster Prominence (cid:80) N g i =1 (cid:80) N g j =1 ( i + j − µ x − µ y ) · p ( i, j ))