[PDF] Detecting ulcerative colitis from colon samples using efficient feature selection and machine learning

Abstract

Ulcerative colitis (UC) is one of the most common forms of inflammatory bowel disease (IBD) characterized by inflammation of the mucosal layer of the colon. Diagnosis of UC is based on clinical symptoms, and then confirmed based on endoscopic, histologic and laboratory findings. Feature selection and machine learning have been previously used for creating models to facilitate the diagnosis of certain diseases. In this work, we used a recently developed feature selection algorithm (DRPT) combined with a support vector machine (SVM) classifier to generate a model to discriminate between healthy subjects and subjects with UC based on the expression values of 32 genes in colon samples. We validated our model with an independent gene expression dataset of colonic samples from subjects in active and inactive periods of UC. Our model perfectly detected all active cases and had an average precision of 0.62 in the inactive cases. Compared with results reported in previous studies and a model generated by a recently published software for biomarker discovery using machine learning (BioDiscML), our final model for detecting UC shows better performance in terms of average precision.

Full PDF

DDetecting ulcerative colitis from colon samplesusing efﬁcient feature selection and machinelearning

Hanieh Marvi Khorasani , Hamid Useﬁ , and Lourdes Pe ˜na-Castillo Department of Computer Science, Memorial University, St. John’s, NL, A1B3X5, Canada Department of Mathematics and Statistics, Memorial University, St. John’s, NL, A1C5S7, Canada * useﬁ@mun.ca, [email protected] ABSTRACT

Ulcerative colitis (UC) is one of the most common forms of inﬂammatory bowel disease (IBD) characterized by inﬂammation ofthe mucosal layer of the colon. Diagnosis of UC is based on clinical symptoms, and then conﬁrmed based on endoscopic,histologic and laboratory ﬁndings. Feature selection and machine learning have been previously used for creating models tofacilitate the diagnosis of certain diseases. In this work, we used a recently developed feature selection algorithm (DRPT)combined with a support vector machine (SVM) classiﬁer to generate a model to discriminate between healthy subjects andsubjects with UC based on the expression values of 32 genes in colon samples. We validated our model with an independentgene expression dataset of colonic samples from subjects in active and inactive periods of UC. Our model perfectly detected allactive cases and had an average precision of 0.62 in the inactive cases. Compared with results reported in previous studiesand a model generated by a recently published software for biomarker discovery using machine learning (BioDiscML), our ﬁnalmodel for detecting UC shows better performance in terms of average precision.

Introduction

Inﬂammatory bowel disease (IBD) is a chronic inﬂammatory condition of the gut with an increasing health burden . Ulcerativecolitis (UC) and Crohn’s disease are the two most common forms of chronic IBD with UC being more widespread than Crohn’sdisease . There is no cure for UC and people with the disease alternate between periods of remission (inactive) and activeinﬂammation . The underlying causes of UC are not completely understood yet, but it is thought to be a combination ofgenetic, environmental and psychological factors that disrupt the microbial ecosystem of the colon . Genome-wide associationstudies (GWAS) have identiﬁed 240 risk loci for IBD and 47 risk loci speciﬁcally associated with UC . However, the lowerconcordance rate in identical twins of 15% in UC compared with 30% in Crohn’s disease indicates that genetic contribution inUC is weaker than in Crohn’s disease . Thus, using gene expression data for disease diagnostic might be more appropriate forUC than using GWAS data, as it has been done for Crohn’s disease .There are several features used for clinical diagnosis of UC including patient symptoms, and laboratory, endoscopic andhistological ﬁndings . Boland et at carried out a proof-of-concept study for using gene expression measurements from colonsamples as a tool for clinical decision support in the treatment of UC. The purpose of Boland et al’s study was to discriminatebetween active and inactive UC cases; even though, they only considered gene expression of eight inﬂammatory genes insteadof assessing the discriminatory power of many groups of genes, they concluded that mRNA analysis in UC is a feasibleapproach to measure quantitative response to therapy.Machine learning-based models have a lot of potential to be incorporated into clinical practice ; specially in the area ofmedical image analysis . Supervised machine learning has already proved to be useful in disease diagnosis and prognosisas well as personalized medicine . In IBD, machine learning has been used to classify IBD paediatric patients usingendoscopic and histological data , to distinguish UC colonic samples from control and Crohn’s disease colonic samples ,and to discriminate between healthy subjects, UC patients, and Crohn’s disease patients . Here we apply a machine learningclassiﬁer on gene expression data to generate a model to differentiate UC cases from controls. Unlike previous studies , wecombined a number of independent gene expression data sets instead of using a single data set to train our model, and, by usingfeature selection, we were able to identify 32 genes out of thousands genes for which expression measurements were available.The expression values of these 32 genes is sufﬁcient to generate a SVM model to effectively discriminate between UC casesand controls. Our proposed model perfectly detected all active cases and had an average precision of 0.62 in the inactive cases. a r X i v : . [ q - b i o . Q M ] A ug ethods Data gathering

We searched the NCBI Gene Expression Omnibus database (GEO) for expression proﬁling studies using colonic samplesfrom UC subjects (in active and inactive state) and controls (healthy donors). We identiﬁed ﬁve datasets (accession numbersGSE1152 , GSE11223 , GSE22619 , GSE75214 and GSE9452 ). As healthy and Crohn’s disease subjects were used ascontrols in GSE9452 , this data set was excluded from our study. We used three of the datasets for model selection using5-fold cross-validation, and left one dataset for independent validation (Table 1). We partitioned the validation dataset into twodatasets: Active UC vs controls, and inactive UC vs controls. Accession

24 25 Biopsies from Agilent-012391 Whole 18,626 Modeluninﬂammed sigmoid colon Human Genome Oligo selectionMicroarray G4112AGSE22619 ,

10 10 Mucosal colonic tissue Affymetrix Human Genome 22,189 Modelfrom discordant twins U133 Plus 2.0 Array selectionGSE75214-active

11 74 Mucosal colonic biopsies from Affymetrix Human Gene 20,358 Modelactive UC patients and 1.0 ST Array evaluationfrom controlsGSE75214-inactive

11 23 Mucosal colonic biopsies from Affymetrix Human Gene 20,358 Modelinactive UC patients 1.0 ST Array evaluationfrom controls

Table 1.

Summary of datasets used in this study.For each dataset, GEO2R was used to retrieve the mapping between probe IDs and gene symbols. Probe IDs without agene mapping were removed from further processing. Expression values for the mapped probe IDs were obtained using thePython package GEOparse . The expression values obtained were as provided by the corresponding authors. Data Pre-processing

We performed the following steps for data pre-processing: (i) Calculating expression values per gene by taking the averageof expression values of all probes mapped to the same gene. (i) Handling missing values with K-Nearest Neighbours (KNN)imputation method (KNNImputer) from the “missingpy” library in Python . KNNImputer uses KNN to ﬁll in missing valuesby utilizing the values from nearest neighbours. We set the number of neighbours to 2 (n-neighbours=2) and we used uniformweight.To get our ﬁnal training datasets we merged datasets GSE1152, GSE11223, and GSE22619 by taking the genes present inall of them. The merged dataset has 39 UC samples and 38 controls, and 16,313 genes. These same genes were selected fromGSE75214 for validation. As the range of expression values across all datasets were different, we normalized the expressionvalues of the ﬁnal merged dataset and validation dataset by calculating Z-scores per sample. Model Generation

To create a model to discriminate between UC patients from healthy subjects, we selected the features (genes) using thedimension reduction through perturbation theory (DRPT) feature selection method . Let D = [ A | b ] be a dataset where b isthe class label and A is an m × n matrix with n columns (genes) and m rows (samples). There is only a limited number of genesthat are associated with the disease, and as such, a majority of genes are considered irrelevant. DRPT considers the solution x of the linear system A x = b with the smallest 2-norm. Hence, b is a sum of x i F i where F i is the i -th column of A . Then eachcomponent x i of x is viewed as an assigned weight to the feature F i . So the bigger the | x i | the more important F i is in connectionwith b . DRPT then ﬁlters out features whose weights are very small compared to the average of local maximums over | x i | ’s.After removing irrelevant features, DRPT uses perturbation theory to detect correlations between genes of the reduced dataset.Finally, the remaining genes are sorted based on their entropy. elected features were assessed using 5-fold cross-validation and support vector machines (SVMs) as the classiﬁer. First,we performed DRPT 100 times on the training dataset to generate 100 subsets of features. Then, to ﬁnd the best subsets, weperformed 3 repetitions of stratiﬁed 5-fold cross-validation (CV) on the training dataset. We utilized average precision (AP) ascalculated by the function average_precision_score from the Python library scikit-learn (version 0.22.1) as the evaluationmetric to determine the best subset of genes among those 100 generated subsets. The four subsets with the highest mean APover the cross-validation folds were chosen for creating the candidate models. For each of four selected subset of features,we created a candidate SVM model using all samples in the training dataset. To generate the models, we used the SVMimplementation available in the function SVC with parameter kernel=’linear’ from the Python library scikit-learn. To evaluatethe prediction performance of each of the ten models, we validated it on the GSE75214-active and GSE75214-inactive datasets.In this step, we utilized the precision-recall curve (PRC) to assess the performance of the candidate models on unseen data. Anadditional candidate model was created using the most frequently selected genes. BioDiscML

BioDiscML is a biomarker discovery software that uses machine learning methods to analyze biological datasets. To comparethe prediction performance of our models with BioDiscML, we ran the software on our training dataset. 2/3 of the samples(N=52) were utilized for training and the remaining 1/3 (N=25) for testing. Since the software generates thousands of models,and we required only one model, we speciﬁed the number of best models as 1 in the conﬁg ﬁle (numberOfBestModels=1).One best model out of all models was created based on the 10-fold cross-validated Area Under Precision-Recall Curve(numberOfBestModelsSortingMetric= TRAIN-10CV-AUPRC) on the train set. We used Weka 3.8 to evaluate theperformance of the model generated by BioDiscML, on the GSE75214-active and GSE75214-inactive datasets. Selectedfeatures by BioDiscML are C3orf36, ADAM30, SLS6A3, FEZF2, and GCNT3. In order to be able to use the model in Weka,we loaded the training dataset as it was created by BioDiscML, which was one of the outputs of the software. This dataset hassix features, including selected genes and class labels, and 52 samples. We also modiﬁed our validation datasets by extractingBioDiscML selected features. After loading the training and test dataset in Weka explorer, we loaded the model, and we enteredthe classiﬁer conﬁguration as “weka.classiﬁers.misc.InputMappedClassiﬁer -I -trim -W weka.classiﬁers.trees.RandomTree – -K3 -M 1.0 -V 0.001 -S 1” which is the classiﬁer’s set up in the generated model by BioDiscML. Results

Feature selection reduced signiﬁcantly the number of genes required to construct a classiﬁcation model

We performed DRPT 100 times on the training dataset to select 100 subsets of features. Then we performed 5-fold cross-validation to ﬁnd the subsets with the highest mean average precision (AP) over the folds. The range of AP for the 100 subsetsis between 0.82 and 0.97, with an average of 0 . ± .

03. Table 2 shows the ten subsets with the highest cross-validated APand the number of selected features (genes) on each subset. On average, DRPT selected 37 . ± .

84 genes per subset.

Subset AP of Features

Subset 10 0.97 42Subset 51 0.97 47Subset 58 0.97 32Subset 83 0 .97 39Subset 5 0.96 37Subset 16 0.96 30Subset 33 0.96 27Subset 55 0.96 22Subset 62 0.96 46Subset 74 0.96 50

Table 2.

Ten top subsets of genes with the highest cross-validated average AP.

Top ﬁve models are able to perfectly discriminate between active UC patients and controls

We selected the four top subsets with the highest mean AP, which are subsets 10, 51, 58, and 83 (Table 2), and created candidatemodels based on them. Each candidate model was created using all samples on the training dataset and the features of thecorresponding subset. To identify the genes most relevant to discriminate between healthy and UC subjects, we looked at thenumber of times each gene was selected by DRPT. On 100 DRPT runs, 211 genes were selected at least once. The upper ploton Fig. 1 shows the number of times each gene was selected, and the lower plot shows the normal quantile-quantile (QQ) plot. Index N u m be r o f t i m e s s e l e c t ed -3 -2 -1 Normal Q-Q Plot

Theoretical Quantiles S a m p l e Q uan t il e s Figure 1.

Identifying the most frequently selected genes. Top: Number of times each gene was selected. Genes were sortedbased on the number of times they were selected by DRPT. Bottom: Normal QQ-plot. Horizontal line at 31 indicates thethreshold selected to deem a gene as frequently chosen.Based on this plot, we saw that the observed distribution of the number of times a gene was selected deviates the most from aGaussian distribution above 31 times. We considered the genes selected by DRPT more than 31 times as highly relevant andcreated a ﬁfth model using 32 genes selected by DRPT at least 32 times over 100 runs.In order to evaluate the prediction performance of the candidate models, each model was tested on the validation datasets,and PRC was plotted for model assessment (Figs. 2, and 3). As the AP approximates the AUPRC , we used AP to summarizeand compare the performance of these ﬁve models. All ﬁve candidate models achieved high predictive performance on thevalidation dataset GSE75214-active with an average AP of 0 . ± .

03, while the average AP of these ﬁve models on thevalidation dataset GSE75214-inactive was 0 . ± .

06. The models with the best performance were the model created withthe 32 most frequently selected genes and subset 83 with an AP of 1 and 0.68 on GSE75214-active and GSE75214-inactive,respectively. However, based on a Friedman test ( p − value = . Our top models outperformed the model generated by BioDiscML.

The average AUPRC achieved by the model created by BioDiscML on both GSE75214-active and GSE75214-inactive datasetswas 0.798 and 0.544, respectively. Comparing the performance of our candidate models and the model created by BioDiscMLon the two validation datasets, we observed that we achieved better AUPRC on both datasets (AUPRC = 1 on the active dataset,AUPRC = 0.68 on the inactive dataset). In terms of running time, subset selection by DRPT and ﬁnal model creation andvalidation, took 3 minutes, while the running time of BioDiscML to create all the models and output the best ﬁnal model was1,890 minutes.

Links between the most frequently selected genes and UC.

We used Ensembl REST API (Version 11.0) to ﬁnd the associated phenotypes with each gene belonging to the subset ofthe 32 most frequently selected genes (Table 3). Among these 32 genes, FAM118A is the only one with a known phenotypicassociation with IBD and its subtypes. The evidence supporting the association of some of the other 31 genes with UC basedon phenotype is more indirect. For example, long term IBD patients are more susceptible to develop colorectal cancer , and igure 2. Precision-Recall Curve of Top Selected Subsets on GSE75214-active.one of the 32 genes, TFRC, is associated with colorectal cancer. IBD patients are more prone to develop cardio vascular diseasewhich is associated with blood pressure and cholesterol , and four of the most frequently selected genes (LIPF, MMP2, DMTNand PPP1CB) are associated with blood pressure and cholesterol.We looked at whether some of the 32 most frequently selected genes contained any of the 241 known IBD-associatedSNPs . To do this, we utilized Ensembl’s BioMart website (Ensembl Release version 98 - September 2019) to retrieve thegenomic location of the 32 genes. We then used the intersectBed utility in BEDtools to ﬁnd any overlap between the 241 IDBrisk loci and the genomic location of the 32 genes. None of the IBD-associated SNPs was located on our 32 genes. Similarly,gene set enrichment analysis found no enriched GO term or pathway among these 32 genes. Additionally, these 32 genes arenot listed as top differentially expressed genes in previous studies on UC .We searched the literature for links between the 32 genes and UC, and we found the following. MMP2 expression has beenfound signiﬁcantly increased in colorectal neoplasia in a mouse model of UC and MMP2 levels are elevated in IBD . TFRChas been found to have an anti-inﬂammatory effect on a murine colitis model . KRT8 genetic variants have been observed inIBD patients and it was suggested that these variants are a risk factor for IBD . DUOXA2 has been shown to be critical in theproduction of hydrogen peroxide within the colon and to be upregulated in active UC . e n e S y m bo l A ss o c i a t e d P h e no t yp e s f ti m e s s e l ec t e d C W F L S p i no ce r e b e ll a r a t a x i a , a u t o s o m a l r ece ss i v e ; d e p r e ss i v e d i s o r d e r , M a j o r F C E R B l oodp r o t e i n l e v e l s ; po s t b r on c hod il a t o r F E V MM P M u lti ce n t r i c O s t e o l y s i s - N odu l o s i s - A r t h r op a t hy ( M ONA ) s p ec t r u m d i s o r d e r s ; c ho l e s t e r o l , HD L ;li p a ndo r a l ca v it y ca r c i no m a ; bodyh e i gh t; w i n c h e s t e r s ynd r o m e PPP CB N oon a n S ynd r o m e - li k e d i s o r d e r w it h l oo s ea n a g e nh a i r ; H ee l bon e m i n e r a l d e n s it y ; B l oodp r e ss u r e ; b a s oph il s a s op a t hy w it hd e v e l op m e n t a l d e l a y ; s ho r t s t a t u r ea nd s p a r s e s l o w - g r o w i ngh a i r R P L A P A tt e n ti ond e ﬁ c it d i s o r d e r w it hhyp e r ac ti v it y ; body H e i gh t Z N F N on e R E G BC on t r a s t s e n s iti v it y ; B ody M a ss I nd e x93 T F RCB r ea s t du c t a l a d e no ca r c i no m a ; e s oph a g ea l a d e no ca r c i no m a ;t hy r o i d ca r c i no m a ; c l ea r ce ll r e n a l ca r c i no m a ; p r o s t a t eca r c i no m a ; p a n c r ea ti cca n ce r ; g a s t r i ca d e no ca r c i no m a ; h e p a t o ce ll u l a r ca r c i no m a ;l ung a d e no ca r c i no m a ; r ec t a l a d e no ca r c i no m a ; b a s a l ce ll ca r c i no m a ; c o l o rec t a l a d e n o c a rc i n o m a ; s qu a m ou s ce lll ung ca r c i no m a ; h ea d a ndn ec k s qu a m ou s ce ll ca r c i no m a ; c o l o n a d e n o c a rc i n o m a ;i r on s t a t u s b i o m a r k e r s ( t r a n s f e rr i n l e v e l s ) ; m ea n c o r pu s c u l a r h e m og l ob i n c on ce n t r a ti on ; r e d ce ll d i s t r i bu ti on w i d t h ; c o m b i n e d i mm unod e ﬁ c i e n c y ; r e db l ood ce llt r a it s ; h i gh li gh t s ca tt e r r e ti c u l o c y t e p e r ce n t a g e o fr e d ce ll s ; r e ti c u l o c y t e fr ac ti ono fr e d ce ll s ; I mm unod e ﬁ c i e n c y4691 F A M A C h r o n i c i nﬂ a mm a t o r y d i s e a s e s ( a nky l o s i ng s pondy liti s , C r ohn ’ s d i s ea s e , p s o r i a s i s , p r i m a r y s c l e r o s i ng c ho l a ng iti s , u l ce r a ti v ec o liti s ) ; G l u c o s e ; P ea nu t a ll e r gy ( m a t e r n a l g e n e ti ce ff ec t s ) ; H ee l bon e m i n e r a l d e n s it y89 C F H R M ac u l a r d e g e n e r a ti on ; b l oodp r o t e i n l e v e l s ; f ee li ng m i s e r a b l e ; a l a n i n ea m i no t r a n s f e r a s e ( A LT ) l e v e l s a f t e rr e m i ss i on i ndu c ti on t h e r a py i n ac u t e l y m phob l a s ti c l e uk ae m i a ( A LL ) ; a s t h m a K R T C i rr ho s i s ; f a m ili a l c i rr ho s i s ; h e p a titi s C v i r u s ; s u s ce p ti b ilit y t o , c i rr ho s i s , c r yp t og e n i cc i rr ho s i s , non c r yp t og e n i cc i rr ho s i s ; s u s ce p ti b ilit y t o , g a mm a g l u t a m y lt r a n s f e r a s e l e v e l s , ca n ce r( p l e i o t r opy ) P R EL I D B ody f a t d i s t r i bu ti on ; h ee l bon e m i n e r a l d e n s it y ; ac ti v a t e dp a r ti a lt h r o m bop l a s ti n ti m e Z N F N on e A B HD I t c h i n t e n s it y fr o mm o s qu it ob it ea d j u s t e dbyb it e s i ze ; gu t m i c r ob i o t a ; O b e s it y -r e l a t e d t r a it s ; c o r on a r y a r t e r yd i s ea s e ; a dv a n ce d a g e r e l a t e d m ac u l a r d e g e n e r a ti on ; s qu a m ou s ce lll ung ca r c i no m a ; pu l s e p r e ss u r e C rf N on e C A B L H e m og l ob i n S ; e r y t h r o c y t ec oun t; p a n c r ea ti c n e op l a s m s SP A T C L N on e DUOXA F a m ili a lt hy r o i ddy s ho r m onog e n e s i s ;t hy r og l obu li n s yn t h e s i s d e f ec t M E SP N on e M A M L S o c i a l s c i e n ce t r a it s ;i n t e lli g e n ce ( M T AG ) ; c h r on i c m u c u s hyp e r s ec r e ti on ; bo r d e r li n e p e r s on a lit yd i s o r d e r ; c ong e n it a l h ea r t m a l f o r m a ti on65 P I T X A x e n f e l d - R i e g e r s ynd r o m e ; r i ngd e r m o i do f c o r n ea ;i r i dogon i odyg e n e s i s t yp e ; p e t e r s a no m a l y ; f a m ili a l a t r i a l ﬁ b r ill a ti on ; r i e g e r a no m a l y ; s t r ok e ; i s c h e m i c s t r ok e ; ca t a r ac t; P I T X -r e l a t e d e y ea bno r m a liti e s ; pho s pho r u s ; c ogn iti v e d ec li n e r a t e i n l a t e m il d c ogn iti v e i m p a i r m e n t; c r ea ti n i n e ;i n t r a o c u l a r p r e ss u r e ;i n c i d e n t a t r i a l ﬁ b r ill a ti on ; w o l ff- p a r k i n s on - w h it e p a tt e r n ; p a r k i n s ond i s ea s e ; ea r l yon s e t a t r i a l ﬁ b r ill a ti on ; a n t e r i o r s e g m e n t s yg e n e s i s D M T N T o t a l c ho l e s t e r o ll e v e l s ; L D L c ho l e s t e r o l A SF B N on e P G F M ood i n s t a b ilit y ; b l oodp r o t e i n l e v e l s B E X N on e OD F B ody w e i gh t; body m a ss i nd e x ; g l u c o s e ; I g A n e ph r op a t hy ; C h r on i c l y m pho c y ti c l e uk ae m i a ;t yp e i a b e t e s ; e r y t h r o c y t e i nd i ce s P T G R B odyh e i gh t; m e n a r c h e ; m ono c y t ec oun t; b l oodp r o t e i n l e v e l s Z N F N on e L I PF M a x i m a l m i d e xp i r a t o r y ﬂ o w r a t e ; b l oodp r o t e i n l e v e l s ; r e s p i r a t o r y f un c ti on t e s t s ; b l oodp r e ss u r e S L C A C it r u lli n e m i a t yp e II ; n e on a t a li n t r a h e p a ti cc ho l e s t a s i s du e t o c it r i nd e ﬁ c i e n c y ; c it r i nd e ﬁ c i e n c y ; c it r u lli n e m i a t yp e I ; bon e m i n e r a l d e n s it y38 B A R X T yp e i a b e t e s ; b r ea s t ca n ce r ; n i gh t s l ee pph e no t yp e s ; r e s pon s e t o c y c l opho s ph a m i d e i n s y s t e m i c l upu s e r y t h e m a t o s u s w it h l upu s n e ph r iti s ; s t r ok e C rf N on e T a b l e3 . P h e no t yp e s a ss o c i a t e d w it h t h e m o s t fr e qu e n tl y s e l ec t e dg e n e s by D R P T . igure 3. Precision-Recall Curve of Top Selected Subsets on GSE75214-inactive. iscussion

In a previous study where machine learning was employed to perform a risk assessment for Crohn’s disease and UC usingGWAS data , a two-step feature selection strategy was used on a dataset containing 17,000 Crohn’s disease cases, 13,000UC cases, and 22,000 controls with 178,822 SNPs. In that study, Wei et al reduced the number of features by ﬁltering outSNPs with p -values greater than 10 − and then applied a penalized feature selection with L penalty to select a subset of SNPs.We decided against ﬁltering out genes based on an arbitrary p -value of statistical signiﬁcance of differential expression, asresearchers are strongly advised against the use of p -values and statistical signiﬁcance in relation to the null-hypothesis . Toavoid systematic experimental bias on the training data, we used three transcriptomic datasets from three separated studies, andused an independent dataset to validate our top performing models.Our 32-gene model achieved AP of 1 and 0.62 discriminating active UC patients from healthy donors, and inactive UCpatients from healthy donors, respectively. We found direct or indirect links to UC for about a quarter of the 32 most frequentlychosen genes. The remaining genes should be further investigated to ﬁnd associations with UC. To put the performance of our32-gene model into perspective, we looked at previous studies applying machine learning to create models for the diagnosticof UC. Maeda et al extracted 312 features from endocystoscopy images to train a SVM to classify UC patients as activeor healing. This approach achieve 90% precision at 74% recall; which is lower than the one achieved by our 32-gene model(Figs. 2, and 3). Yuan et al applied incremental feature selection and a SMO classiﬁer (a type of SVM) on gene expressiondata from blood samples to discriminate between healthy subjects, UC patients, and Crohn’s disease patients. The 10-foldcross-validation accuracy of their best model using the expression values of 1170 genes to classify UC patients was 92.31%,while our method obtained better accuracy than this with substantially less number of genes. In terms of potential for clinicaltranslation of a machine learning-based model, a model requiring to quantify the gene expression levels of fewer genes is moresuitable for the development of a new diagnostic test than one requiring the quantiﬁcation of the expression levels of thousandsof genes. Using an efﬁcient feature selection method such as DRPT and a SVM-classiﬁer on gene expression data, we generateda model that could facilitate the diagnosis of UC from expression measurements of 32 genes from colonic samples. References Kaplan, G. G. The global burden of IBD: from 2015 to 2025.

Nat Rev Gastroenterol Hepatol , 720–7, DOI: 10.1038/nrgastro.2015.150 (2015). Ordás, I., Eckmann, L., Talamini, M., Baumgart, D. C. & Sandborn, W. J. Ulcerative colitis.

Lancet , 1606–19, DOI:10.1016/S0140-6736(12)60150-0 (2012). Eisenstein, M. Ulcerative colitis: towards remission.

Nature , S33, DOI: 10.1038/d41586-018-07276-2 (2018). Khan, I. et al.

Alteration of gut microbiota in inﬂammatory bowel disease (IBD): Cause or consequence? IBD treatmenttargeting the gut microbiome.

Pathogens , DOI: 10.3390/pathogens8030126 (2019). de Lange, K. M. et al. Genome-wide association study implicates immune activation of multiple integrin genes ininﬂammatory bowel disease.

Nat Genet. , 256–261, DOI: 10.1038/ng.3760 (2017). Anderson, C. A. et al.

Meta-analysis identiﬁes 29 additional ulcerative colitis risk loci, increasing the number of conﬁrmedassociations to 47.

Nat Genet. , 246–52, DOI: 10.1038/ng.764 (2011). Conrad, K., Roggenbuck, D. & Laass, M. W. Diagnosis and classiﬁcation of ulcerative colitis.

Autoimmun Rev , 463–6,DOI: 10.1016/j.autrev.2014.01.028 (2014). Romagnoni, A. et al.

Comparative performances of machine learning methods for classifying Crohn disease patients usinggenome-wide genotyping data.

Sci Rep , 10351, DOI: 10.1038/s41598-019-46649-z (2019). Boland, B. S. et al.

Validated gene expression biomarker analysis for biopsy-based clinical trials in ulcerative colitis.

Aliment. Pharmacol Ther , 477–85, DOI: 10.1111/apt.12862 (2014). Shah, P. et al.

Artiﬁcial intelligence and machine learning in clinical development: a translational perspective.

NPJ Digit.Med , 69, DOI: 10.1038/s41746-019-0148-3 (2019). Esteva, A. et al.

Dermatologist-level classiﬁcation of skin cancer with deep neural networks.

Nature , 115–118, DOI:10.1038/nature21056 (2017).

McKinney, S. M. et al.

International evaluation of an AI system for breast cancer screening.

Nature , 89–94, DOI:10.1038/s41586-019-1799-6 (2020).

Molla, M., Waddell, M., Page, D. & Shavlik, J. Using machine learning to design and interpret gene-expression microarrays.

AI Mag. , 23–23 (2004). Xu, J. et al.

Translating cancer genomics into precision medicine with artiﬁcial intelligence: applications, challenges andfuture perspectives.

Hum. genetics , 109–124 (2019).

Mossotto, E. et al.

Classiﬁcation of paediatric inﬂammatory bowel disease using machine learning.

Sci Rep , 2427, DOI:10.1038/s41598-017-02606-2 (2017). Olsen, J. et al.

Diagnosis of ulcerative colitis before onset of inﬂammation by multivariate modeling of genome-wide geneexpression data.

Inﬂamm Bowel Dis , 1032–8, DOI: 10.1002/ibd.20879 (2009). Yuan, F., Zhang, Y.-H., Kong, X.-Y. & Cai, Y.-D. Identiﬁcation of candidate genes related to inﬂammatory bowel diseaseusing minimum redundancy maximum relevance, incremental feature selection, and the shortest-path approach.

BiomedRes Int , 5741948, DOI: 10.1155/2017/5741948 (2017).

Zahn, A. et al.

Aquaporin-8 expression is reduced in ileum and induced in colon of patients with ulcerative colitis.

World J.Gastroenterol. WJG , 1687 (2007). Noble, C. L. et al.

Regional variation in gene expression in the healthy colon is dysregulated in ulcerative colitis.

Gut ,1398–1405 (2008). Lepage, P. et al.

Twin study indicates loss of interaction between microbiota and mucosa of patients with ulcerative colitis.

Gastroenterology , 227–236 (2011).

Vancamelbeke, M. et al.

Genetic and transcriptomic bases of intestinal epithelial barrier dysfunction in inﬂammatorybowel disease.

Inﬂamm. bowel diseases , 1718–1729 (2017). Häsler, R. et al.

A functional methylome map of ulcerative colitis.

Genome research , 2130–2137 (2012). Barrett, T. et al.

NCBI GEO: archive for functional genomics data sets—update.

Nucleic acids research , D991–D995(2012). Gumienny, R. GEOparse. https://pypi.org/project/GEOparse/.

Troyanskaya, O. et al.

Missing value estimation methods for DNA microarrays.

Bioinformatics , 520–5, DOI:10.1093/bioinformatics/17.6.520 (2001). Afshar, M. & Useﬁ, H. High-Dimensional Feature Selection for Genomics Datasets. ArXiv2002.12104.

Pedregosa, F. et al.

Scikit-learn: Machine learning in Python.

J. Mach. Learn. Res. , 2825–2830 (2011). Leclercq, M. et al.

Large-scale automatic feature selection for biomarker discovery in high-dimensional omics data.

Front.genetics , 452 (2019). Holmes, G., Donkin, A. & Witten, I. H. Weka: A machine learning workbench. In

Proceedings of ANZIIS ’94 - AustralianNew Zealand Intelligent Information Systems Conference , 357–361 (1994).

Hall, M. et al.

The weka data mining software: an update.

ACM SIGKDD explorations newsletter , 10–18 (2009). Witten, I. H., Frank, E., Hall, M. A. & Pal, C. J.

Data Mining: Practical machine learning tools and techniques (MorganKaufmann, 2016).

Müller, A. C., Guido, S. et al. Introduction to machine learning with Python: a guide for data scientists (" O’Reilly Media,Inc.", 2016).

Demšar, J. Statistical comparisons of classiﬁers over multiple data sets.

J. Mach. learning research , 1–30 (2006). Yates, A. et al.

The Ensembl REST API: Ensembl data for any language.

Bioinformatics , 143–145 (2014). Kim, E. R. & Chang, D. K. Colorectal cancer in inﬂammatory bowel disease: the risk, pathogenesis, prevention anddiagnosis.

World journal gastroenterology: WJG , 9872 (2014). Schulte, D. et al.

Small dense LDL cholesterol in human subjects with different chronic inﬂammatory diseases.

Nutr.Metab. Cardiovasc. Dis. , 1100–1105 (2018). Smedley, D. et al.

Biomart–biological queries made easy.

BMC genomics , 22 (2009). Quinlan, A. R. & Hall, I. M. BEDTools: a ﬂexible suite of utilities for comparing genomic features.

Bioinformatics ,841–2, DOI: 10.1093/bioinformatics/btq033 (2010). Román, J. et al.

Evaluation of responsive gene expression as a sensitive and speciﬁc biomarker in patients with ulcerativecolitis.

Inﬂamm Bowel Dis , 221–9, DOI: 10.1002/ibd.23020 (2013). Song, R. et al.

Identiﬁcation and analysis of key genes associated with ulcerative colitis based on DNA microarray data.

Medicine (Baltimore) , e10658, DOI: 10.1097/MD.0000000000010658 (2018). Schwegmann, K. et al.

Detection of early murine colorectal cancer by MMP-2/-9-guided ﬂuorescence endoscopy.

InﬂammBowel Dis , 82–91, DOI: 10.1097/MIB.0000000000000605 (2016). Oliveira, L. G. d. et al.

Positive correlation between disease activity index and matrix metalloproteinases activity in a ratmodel of colitis.

Arq Gastroenterol , 107–12, DOI: 10.1590/s0004-28032014000200007 (2014). Shin, J.-S. et al.

Anti-inﬂammatory effect of a standardized triterpenoid-rich fraction isolated from Rubus coreanus ondextran sodium sulfate-induced acute colitis in mice and LPS-induced macrophages.

J Ethnopharmacol

158 Pt A , 291–300,DOI: 10.1016/j.jep.2014.10.044 (2014).

Owens, D. W. & Lane, E. B. Keratin mutations and intestinal pathology.

J Pathol , 377–85, DOI: 10.1002/path.1646(2004).

MacFie, T. S. et al.

DUOX2 and DUOXA2 form the predominant enzyme system capable of producing the reactive oxygenspecies H2O2 in active ulcerative colitis and are modulated by 5-aminosalicylic acid.

Inﬂamm Bowel Dis , 514–24, DOI:10.1097/01.MIB.0000442012.45038.0e (2014). Wei, Z. et al.

Large sample size, wide variant spectrum, and advanced machine-learning technique boost risk prediction forinﬂammatory bowel disease.

The Am. J. Hum. Genet. , 1008–1012 (2013). Amrhein, V., Greenland, S. & McShane, B. Scientists rise up against statistical signiﬁcance (2019).

Wasserstein, R. L., Schirm, A. L. & Lazar, N. A. Moving to a world beyond “p< 0.05” (2019).

Maeda, Y. et al.

Fully automated diagnostic system with artiﬁcial intelligence using endocytoscopy to identify thepresence of histologic inﬂammation associated with ulcerative colitis (with video).

Gastrointest Endosc , 408–415, DOI:10.1016/j.gie.2018.09.024 (2019). Acknowledgements

This research was partially supported by grants from the Natural Sciences and Engineering Research Council of Canada(NSERC) to H.U. (grant number RGPIN: 2019-05650) and to L.P.-C. (grant number RGPIN: 2019-05247). H.M.K. waspartially supported by funding from Memorial University’s School of Graduate Studies.

Author contributions statement

Conceptualisation H.U. and L.P.-C.; Methodology H.M.K., H.U. and L.P.-C.; Analysis H.M.K. and L.P.-C.; Writing H.M.K.,H.U. and L.P.-C.; Experiments H.M.K.; Supervision H.U. and L.P.-C.

Additional information

Competing interests

The author(s) declare no competing interests.

Use of experimental animals, and human participants

This research did not involve human participants or experimentalanimals.

Informed consent

Not applicable.

Ethics approval