A robust and generalizable transcriptomic signature for sepsis diagnostics
Yueran Yang, Yu Zhang, Shuai Li, Xubin Zheng, Man-Hon Wong, Kwong-Sak Leung, Lixin Cheng
AA robust and generalizable transcriptomic signature for sepsis diagnostics
Yueran Yang , Yu Zhang , Shuai Li , Xubin Zheng , Man-Hon Wong , Kwong-Sak Leung , Lixin Cheng * Shenzhen People's Hospital, The Second Clinical Medicine College of Jinan University, Shenzhen, China John Hopcroft Center for Computer Science, Shanghai Jiao Tong University, Shanghai, China Department of Computer Science and Engineering, The Chinese University of Hong Kong, Shatin, New Territories, Hong Kong, China
These authors contribute equally to this study. *To whom correspondence should be addressed.
Abstract
Motivation:
High-throughput sequencing can detect tens of thousands of genes in parallel, providing opportunities for improving the diagnostic accuracy of multiple diseases including sepsis, which is an aggressive inflammatory response to infection that can cause organ failure and death. Early screening of sepsis is essential in clinic, but no effective diagnostic biomarkers are available yet.
Results:
We present a novel method, Recurrent Logistic Regression (RLR), to identify diagnostic biomarkers for sepsis from the blood transcriptome data. Immune-related gene expression profiles are studied among 1,144 sepsis samples and 240 normal samples across three detection platforms, including Affymetrix Human Genome U133 Plus 2.0, Affymetrix Human Genome U219 Array, and Agilent Human Gene Expression 4x44K v2 Microarray. A panel including five genes, LRRN3, IL2RB, FCER1A, TLR5, and S100A12, are determined as diagnostic biomarkers (LIFTS) for sepsis on two discovery cohorts. LIFTS discriminates patients with sepsis from normal controls in high accuracy (AUROC = 0.9959 on average; IC = [0.9722-1.0]) on nine validation cohorts across three independent platforms, which outperforms existing markers. Each individual gene has certain distinguishability for the detection of sepsis but cannot achieve as high performance as the entire gene panel. Network and functional analysis illustrated that the interactors of the five genes are closely interacted and are commonly involved in several biological functions, including growth hormone synthesis, chemokine signaling pathway, B cell receptor signaling pathway, etc.
Availability and implementation:
Source code is available at https://github.com/bio-LIFTS/LIFTS.
Contact: [email protected] . Introduction
Sepsis is a life-threatening organ dysfunction caused by a hostβs unbalanced response to an infection. It is one of the most severe diseases in the intensive care unit (ICU) and one of the worldβs leading lethal diseases (Angus and van der Poll, 2013; van der Poll, 2016; Weng, et al., 2018). Its common clinical manifestations include abnormalities in body temperature, heart rate, breathing, and peripheral white blood cell counts. Besides, sepsis is often accompanied by multiple organ dysfunction syndromes, such as hemodynamic instability, respiratory failure, and disseminated intravascular coagulation. In the past few decades, the high morbidity and mortality caused by sepsis have made the society to endure huge economic burden (Angus and van der Poll, 2013; van der Poll, 2016; Weng, et al., 2018). The prevalent methods of the diagnosis of sepsis are microbiological culture and taxonomic identification of the pathogen. However, the methods based on bacterial culture techniques have several shortcomings: (1) it usually takes 24 hours to obtain a positive result; (2) only one-third of the blood cultures are positive in clinically diagnosed sepsis cases, so negative results in culture do not mean negative cases; (3) the chances of a positive culture are reduced in patients who have already used antibiotics; (4) false positives are frequently caused by contamination; (5) short-term bacteremia can lead to a positive blood culture without a severe inflammatory response. Therefore, the sensitivity and specificity of the methods based on microbiological culture are quite low, and hence fails to diagnose sepsis effectively. In recent years, with the rise of high-throughput sequencing technology, tens of thousands of genes can be detected in parallel (Cheng, et al., 2020; Liu, et al., 2020; Liu, et al., 2020), providing opportunities for precise molecular diagnosis using machine learning methods (Wang, et al., 2020; Wang, et al., 2020). Several gene markers have been developed for the diagnostic prediction of sepsis. For instance, Scicluna et al. proposed the FAIM3:PLAC8 ratio as a candidate biomarker to assist in the rapid diagnosis of community-acquired pneumonia (CAP) (Scicluna, et al., 2015), which accounts for a high proportion of intensive care unit (ICU) admissions for respiratory failure and sepsis. McHugh et al. designed a classifier SeptiCyte Lab composed of four mRNAs of CEACAM4, LAMP1, PLA2G7, and PLAC8 by applying Support Vector Machine (SVM) and Random Forest (McHugh, et al., 2015). Scicluna et al. developed a sNIP score for sepsis diagnosis based on the expression abundance of three genes using SVM and bootstrapping (Scicluna, et al., 2018). However, the above-mentioned mRNA biomarkers cannot obtain consistent results in multiple independent data sets. In this paper, we introduced a novel recurrent logistic regression (RLR) as an automatic detection for the diagnostic biomarkers of sepsis. Since patients with sepsis have a severely dysregulated immune system (McHugh, et al., 2015; Wynn, et al., 2016), we principally concentrated on the immune-related genes (IRGs) and regarded them as the key molecular vents involved in sepsis. Based on IRGs, the RLR model was trained and the less significant genes were filtered during each iteration until no gene is eliminated. Regularization and elimination of insignificant features were applied simultaneously in RLR to avoid overfitting and hence reduce the complexity of the discriminative model. The biomarkers identified by RLR were verified on nine independent expression cohorts across three different detection platforms. We also evaluated the classification performance of each individual gene in the identified biomarkers. Finally, network and functional analyses were carried out for the genes interacting with these biomarkers.
2. Method
Eleven different gene expression cohorts were collected from the Gene Expression Omnibus (GEO) database with both sepsis samples and healthy controls, including three adult datasets and eight pediatric datasets (Table 1). In total, 1,384 samples were analyzed from three microarray platforms, Affymetrix Human Genome U133 Plus 2.0 (AffyU133P2), Affymetrix Human Genome U219 (AffyU219), and Agilent Human Gene Expression 4x44K v2 Microarray (AgilentV2). The raw data were preprocessed and normalized using the robust multichip average (RMA) algorithm (Cheng, et al., 2016; Cheng, et al., 2016; Liu, et al., 2019). GSE57065 is adopted for training the recurrent logistic regression (RLR) and GSE26378 for tuning the hyperparameter. Seven datasets (GSE95233, GSE28750, GSE8121, GSE13904, GSE26440, GSE9692, and GSE4607) from AffyU133P2 serve as the validation cohorts β to evaluate the diagnostic performance. GSE65682 and E-MTAB-1548 detected by other platforms are the validation cohorts β ‘ for evaluating the cross-platform capability. Figure 1a ). Recurrent logistic regression contains many iterations with model optimization and automatical feature selection, since each iteration involves regression step and elimination step (
Figure 1b ). egression step: Logistic regression is employed to the candidate gene set (initially 608 IRGs). The expression abundance of genes in each sample, is represented by a vector denoted as π = (π , β― , π π ) π where π π is the i-th gene expression. To construct a classifier involving as fewer genes as possible based on the expression vector π of a sample, a function π: β π β {0,1} is build, where represents healthy controls and 1 represents sepsis. The logistic model applied in RLR is a binary classifier expressed by π(π) = 11 + π βππ = 11 + π β(π€ +π€ π +β―+π€ π π π ) , where π is an expression vector and π(π) is a diagnostic risk score used to predict the probability of having sepsis. π = (π€ , π€ β― , π€ π ) π are parameters optimized by the cost function with regularization, π½(π; π) = π π π + β log (1 + exp (βπ¦ π (π ππ π + π€ ))) ππ=1 , where π is the collection of samples π , π , β¦ , π π in discovery cohort and π¦ β {0,1} is the label for each sample. Elimination step:
After optimizing the regression model, the minor genes regarded as less significant are eliminated. A gene π π is defined as minor gene if the absolute value of its corresponding weight π€ π is less than a proportion of the absolute maximum weight, i.e., |π€ π | < πΆ max(|π€ |, β― , |π€ π |) where πΆ β [0,1] is a hyperparameter. Instead of using the traditional way that chooses a fixed threshold such as P value < 0.01, this step selects features adaptively based on the maximum weight trained by the model. The regression step and elimination step are repeated iteratively until it converges, specifically, no more minor gene remained. In this sense, the algorithm is named the recurrent logistic regression (RLR). The RLR is first trained on the discovery cohort β and evaluated by the AUROC on discovery cohort β ‘ ( Figure 1c ). We exhaustedly tried possible values of the hyperparameter πΆ with the search space between 0.75 and 0.9 and each interval equaling to 0.01. The hyperparameter πΆ can therefore be determined by the optimal performance on discovery cohort β ‘ . Receiver operating characteristic (ROC) curve was applied for performance evaluation, which is the function image of True Positive Rate (TPR) with respect to False Positive Rate (FPR), here TPR represents the positive correctly classified samples to the total number of positive samples and FPR represents the ratio between the incorrectly classified negative samples to the total number of negative samples. Area Under the Curve (AUC) means the area under the ROC curve ranging from 0 to 1. Higher AUC indicates a more discriminative model. We use AUC to quantify the discrimination ability of the models on seven cohorts measured with the same platform and compare to the existing biomarkers. Moreover, the cross-platform capability is also evaluated on two cohorts from different platforms. Meta-analysis was conducted for our gene panel LIFTS and the Standardized Mean Difference (SMD) are demonstrated in forest plot (
Figure 3 ). Four graphical elements are presented including the estimated SMD (solid block), the respective 95% confidence intervals for each cohort (horizontal line), the non-effect size (vertical line), and overall estimation of all cohorts (red rhombus) (Suchmacher and Geller, 2012). To analyze the role of the five genes in LFTS, we presented the human protein interaction and constructed the protein interaction network.
Protein interactions were obtained from the database InWeb_InBioMap (Cheng and Leung, 2018; Li, et al., 2017), which is the most comprehensive resource for human protein interactome. Around 57% of the interactions are experimentally validated. The interaction network was conducted and illustrated using the R package igraph . Function enrichment was carried out using the R package clusterprofiler (Yu, et al., 2012).
3. Results
Patients with sepsis have a severely dysregulated immune system, which impairs clearance of the infection and leaves the body susceptible to new infections with an increased risk of death. Thus, we principally concentrated on the immune-related genes (IRGs) and regarded them as the key molecular events involved in sepsis. After taking the intersection across different platforms, 608 IRGs are screened as potential biomarkers for further analysis. The recurrent logistic regression (RLR) was then applied on the discovery cohort GSE57065. Different hyperparameter results in multiple gene panels. To select the best gene panel and the hyperparameter, we tried a series of thresholds and evaluated their performance on the independent discovery cohort β ‘ , GSE26378. Figure 2 displays the AUROC of these gene panels. We finalized the model with the highest AUROC up to 0.9951 when πΆ equals to 0.83. The discriminative function of the diagnostic model is π¦(π₯) = β0.9305ππππππβ0.9692ππππππ β0.7378 ππ πππππ+0.8460 πππππ+0.8905ππππππππβ0.0153 , which contains five genes, LRRN3, IL2RB, FCER1A, TLR5, and S100A12. We abbreviated the biomarkers by LIFTS, which is composed by the initial letters of each gene. The Genome characteristics of the five genes are listed in Table 2 . Since the value of logistic model is too concentrated to illustrate, i.e., ranging from 0 to 1, we used the corresponsive part in logits of our diagnostic model to represent the diagnostic ability of each gene and LIFTS. Specifically, the logit is πππππ‘ = β0.9305 π
LRRN3 β 0.9692 π
IL2RB β0.7378 π
FCER1A + 0.8460 π
TLR5 + 0.8905 π
S100A12 . The standard difference mean
π β π in effect size between the sepsis and control subjects is displayed in
Figure 3 , where π is the mean of logits for the sepsis samples and π corresponds to normal samples. The five genes individually are qualified to distinguish sepsis samples with the average standard mean difference (SMD) ranging from 1.5 to 3.5 and their confidence intervals do not cross zero. Compared with the individual genes, LIFTS achieves a much higher average SMD of 11.6, suggesting that the weighted gene panel has better diagnostic capability than each of the five genes. LIFTS was evaluated on the nine independent cohorts and compared to existing biomarkers.
Figure 4 shows the ROC curves comparison between the LIFTS and three known transcriptome biomarkers, i.e., FAIM3:PLAC8, SeptiCyte Lab, and sNIP. SeptiCyte Lab includes four genes and its risk score is PLAC8/PLA2G7*LAMP1/CEACAM4. sNIP contains three genes and it is represented as (NLRP1-IDNK)/PLAC8. The genes of all these four biomarkers are detectable on the AffyU133P2. Overall, LIFTS outperforms the other biomarkers on all the validation cohorts except GSE95233. The area under ROC curve (AUC) of LIFTS on each dataset is consistently close to 1. The lowest score, 0.9722 on GSE13904, is still much higher than the other biomarkers. NLRP1 and PLAC8 are not detected on either the AffyU219 or the Agilent platform, so sNIP cannot be applied on dataset GSE65682. Since NLRP1 does not exist on GSE65682 and PLAC8 is not available on E-MTAB-1548, some previous biomarkers could not be evaluated on these two datasets. The AUC of LIFTS on GSE65682 and E-MTAB-1548 are 0.9994 and 1.0, which are superior to its counterparts, indicating the portability of LIFTS among different platforms in diagnostic prediction (
Figure 5 ). The AUC curves are related to the standard difference means. Considering LIFTS performance shown in
Figure 3 and
Figure 4 , the higher AUC value always corresponds to higher standard difference mean. For example, in GSE13904, the AUC value is relatively low and correspondingly the standard difference mean is relatively closer to zero. However, focusing on E-MTAB-1548, the AUC is 1.0 and the standard difference mean is far from zero. .4. Topological and functional analysis of LIFTS
Proteins usually group together as modules to implement in particular cellular functions through interactions (Cheng, et al., 2017; Cheng and Leung, 2018; Cheng, et al., 2018). The interference in protein interactions and new undesired protein interactions can cause diseases (Cheng, et al., 2019; Li, et al., 2020; Yin, et al., 2020).To explore the functions of the genes in LIFTS, we further studied the topological property of the genes physically interacted with the five genes by network analysis (
Figure 6A ). Specifically, 1, 45, 19, 7, and 3 partner genes interact with LRRN3, IL2RB, FCER1A, TLR5, and S100A12, respectively (
Figure 6B ). These genes are closely connected and involved in specific biological processes, including growth hormone synthesis, secretion and action, chemokine signaling pathway, B cell receptor signaling pathway, T cell receptor signaling pathway, etc. (
Figure 6C ). For the protein interaction network, the connections among genes are significantly dense than the simulated networks (P < e-26, Rank sum test,
Figure 6D ), where we randomly picked up the same number of proteins 10,000 times and calculated their network density distribution. The densities of the simulated networks are mainly less than 0.05 whereas the density of the curated network is 0.2854, indicating the five genes tend to function together with higher network connectivity than other genes.
4. Discussion
We developed a novel model to screen the diagnostic biomarkers of sepsis based on the logistic regression. Five genes were identified as a prediction model (LIFTS) with an average AUROC of 0.9959 among 11 cohorts containing in total 1,384 samples. LIFTS demonstrated its robust portability across three different transcriptome platforms, which is much better than its counterparts such as SeptiCyte Lab (Scicluna, et al., 2018). Our analysis thus determined an accurate prediction model and reproducible transcriptome biomarkers that can lay a foundation for clinical diagnostic tests and biological mechanistic studies. The classical logistic regression can only return the classifier with a given number of genes. Mathematically, our goal is to maximize the performance of classification and minimize the complexity of the diagnostic model simultaneously, requiring a competent algorithm with the ability to filter out irrelevant genes automatically. To this end, an enhanced version of logistic regression, recurrent logistic regression (RLR), was developed using the weight of each term as a measure of the gene importance. Importantly, the selection of features and the construction of classifiers are usually regarded as two independent tasks, but we combined these two tasks together. Thus, the biomarkers are more adaptive to the base model and superior to other models, which use a specific algorithm on previously selected biomarkers. Since sepsis is immune-related, the immune-related genes were used as a priori knowledge and considered as candidate genes in this study. After filtering the genes, the umber of candidate genes was greatly shortlisted, which is also an efficient preprocessing step for feature selection. Some researchers made use of differentially expressed (DE) genes as diagnostic signatures. However, DE genes are extremely inconsistent among different datasets and platforms. Only a few overlapping DE genes were obtained among the 11 datasets we used (
Supplementary Figure S1, 2 ), leading to obstacles to find a robust classifier based on DE genes. During the training process, interestingly, we observed that running logistic regression using different coding languages may lead to different results. In this study, the function LogisticRegression in the sklearn package of python was used. Technically, we applied πΏ regularization in our logistic regression processes, which is commonly used in machine learning to reduce model overfitting. In some studies, logistic regression with πΏ regularization is called ridge regression (Friedman, et al., 2010). The advantage of regularization is that it improves numerical stability, not only forces weights to shrink but also copes with the case sophisticatedly when the number of features is larger than the number of samples. Despite 11 public datasets were taken advantage, all of them were detected using microarray technology. No RNA-seq datasets were included, thereby making our model not applicable for all transcriptome platforms. Therefore, we call for the detection of sepsis using RNA-seq technologies in the near future. In conclusion, the diagnostic biomarkers LIFTS shows higher accuracy and robustness compared to the existing biomarkers when differentiating the sepsis patients from the normal controls. Further clinical trials are needed to confirm the findings in the paper. Funding
This work was supported by the Basic and Applied Basic Research Programs Foundation of Guangdong Province (2019A1515110097).
Conflict of Interest: none declared.
Reference
Angus, D.C. and van der Poll, T. Severe sepsis and septic shock. N Engl J Med 2013;369(9):840-851. Cheng, L., et al. Full Characterization of Localization Diversity in the Human Protein Interactome. J Proteome Res 2017;16(8):3019-3029. Cheng, L. and Leung, K.S. Identification and characterization of moonlighting long non-coding RNAs based on RNA and protein interactome. Bioinformatics 2018;34(20):3519-3528. Cheng, L. and Leung, K.S. Quantification of non-coding RNA target localization diversity and its application in cancers. J Mol Cell Biol 2018;10(2):130-138. Cheng, L., Liu, P. and Leung, K.S. SMILE: a novel procedure for subcellular module identification with localisation xpansion. IET Syst Biol 2018;12(2):55-61. Cheng, L., et al. Exploiting locational and topological overlap model to identify modules in protein interaction networks. BMC Bioinformatics 2019;20(1):23. Cheng, L., et al. CrossNorm: a novel normalization strategy for microarray data in cancers. Sci Rep 2016;6:18898. Cheng, L., et al. Whole blood transcriptomic investigation identifies long non-coding RNAs as regulators in sepsis. J Transl Med 2020;18(1):217. Cheng, L., et al. ICN: a normalization method for gene expression data considering the over-expression of informative genes. Mol Biosyst 2016;12(10):3057-3066. Friedman, J., Hastie, T. and Tibshirani, R. Regularization Paths for Generalized Linear Models via Coordinate Descent. Journal of statistical software 2010;33(1):1-22. Li, L., et al. Host-Guest Protein Assembly for Affinity Purification of Methyllysine Proteomes. Anal Chem 2020;92(13):9322-9329. Li, T., et al. A scored human protein β protein interaction network to catalyze genomic interpretation. Nature Methods 2017;14(1):61-64. Liu, X., et al. Normalization Methods for the Analysis of Unbalanced Transcriptome Data: A Review. Front Bioeng Biotechnol 2019;7:358. Liu, X., et al. A network-based algorithm for the identification of moonlighting noncoding RNAs and its application in sepsis. Brief Bioinform 2020. Liu, X., et al. A long non-coding RNA signature for diagnostic prediction of sepsis upon ICU admission. Clin Transl Med 2020. McHugh, L., et al. A Molecular Host Response Assay to Discriminate Between Sepsis and Infection-Negative Systemic Inflammation in Critically Ill Patients: Discovery and Validation in Independent Cohorts. PLoS Med 2015;12(12):e1001916. Scicluna, B.P., et al. A molecular biomarker to diagnose community-acquired pneumonia on intensive care unit admission. Am J Respir Crit Care Med 2015;192(7):826-835. Scicluna, B.P., et al. Molecular Biomarker to Assist in Diagnosing Abdominal Sepsis upon ICU Admission. Am J Respir Crit Care Med 2018;197(8):1070-1073. Suchmacher, M. and Geller, M. Chapter 13 - Systematic Reviews and Meta-Analyses. In: Suchmacher, M. and Geller, M., editors, Practical Biostatistics. San Diego: Academic Press; 2012. p. 159-166. van der Poll, T. Future of sepsis therapies. Crit Care 2016;20(1):106. Wang, J., et al. GNL-Scorer: A generalized model for predicting CRISPR on-target activity by machine learning and featurization. J Mol Cell Biol 2020. Wang, J., et al. An overview and metanalysis of machine and deep learning-based CRISPR gRNA design tools. RNA Biol 2020;17(1):13-22. Weng, L., et al. Sepsis-related mortality in China: a descriptive analysis. Intensive Care Med 2018;44(7):1071-1080. Wynn, J.L., et al. Targeting IL-17A attenuates neonatal sepsis mortality induced by IL-18. Proc Natl Acad Sci U S A 2016;113(19):E2627-2635. Yin, R., et al. Up-regulation of autophagy by low concentration of salicylic acid delays methyl jasmonate-induced leaf senescence. Sci Rep 2020;10(1):11472. Yu, G., et al. clusterProfiler: an R package for comparing biological themes among gene clusters. OMICS: A Journal of Integrative Biology 2012;16(5):284-287. ables Table 1. Summary of the gene expression cohorts used in this study.
Series
Gene Number Normal Sepsis Cell type Age Platform Discovery Cohort β GSE57065 23521 25 82 Whole blood Adult Affymetrix Human Genome U133 Plus 2.0
Discovery Cohort β ‘ GSE26378 23521 21 82 Whole blood Children
Validation Cohorts β GSE95233 23521 22 102 Whole blood Adult GSE28750 23521 20 10 Whole blood Adult GSE8121 23521 15 60 Whole blood Children GSE13904 23521 18 52 Whole blood Children GSE26440 23521 32 98 Whole blood Children GSE9692 23521 15 30 Whole blood Children GSE4607 23521 15 69 Whole blood Children
Validation Cohorts β ‘ GSE65682 19040 42 479 Whole blood Adult Affymetrix Human Genome U219 Array E-MTAB-1548 17028 15 80 Peripheral blood Adult Agilent Human Gene Expression 4x44K v2 Microarray
Table 2.
Genome characteristics of the genes in LIFTS.
Gene symbol Gene name Alignments Chromosomal Location Degree
LRRN3 leucine rich repeat neuronal 3 chr7:110731149-110765507 (+) chr7q31.1 1 IL2RB interleukin 2 receptor, beta chr22:37521886-37545962 (-) chr22q13.1 45 FCER1A Fc fragment of IgE, high affinity I, receptor for; alpha polypeptide chr1:159272125-159277991 (+) chr1q23 19 TLR5 toll-like receptor 5 chr1:223283646-223316624 (-) chr1q41-q42 7 S100A12 S100 calcium binding protein A12 chr1:153346183-153348075 (-) chr1q21 3 igures
Figure 1.
Workflow for the identification of sepsis biomarkers. a) statistics of immune-related genes of all cohorts in three platforms. b) the flow chart of the recurrent logistic regression algorithm including the regression step and the elimination step. c) determining the hyperparameter c with another discovery cohort and finalizing the biomarkers. d ) validating and comparing diagnostic capability with distinct cohorts and platforms. Figure 2 . The line graph of the AUROC with respect to different hyperparameter C between 0.7 and 0.99 with an interval of 0.01. The red dot represents the optimal model with the corresponding C=0.83, N=5 and AUROC=0.9994
Figure 3.
Forest plots of LIFTS and each genes in LIFTS. The x-axis represents the standardized mean difference between sepsis patients and healthy controls. The blue square is the average value of the difference and the size corresponds to the concentration of the data. The blue line represents the 95% confidence interval. The red diamond represents the average difference of a given gene or LIFTS for all cohorts. The width of the diamond represents the 95% confidence interval of the overall mean difference.
Figure 4.
Results of LIFTS in discovery and validation cohorts. LIFTS outperforms three existing models according to performance of measured by AUROC.
Figure 5.
The performance of LIFTS based on two independent cohorts and platforms. NLRP1 does not exist on GSE65682 and PLAC8 is not available on E-MTAB-1548, resulting in the absence of some biomarkers.
Figure 6.