General DeepLCP model for disease prediction : Case of Lung Cancer
GG ENERAL D EEP
LCP
MODEL FOR DISEASE PREDICTION : C
ASEOF L UNG C ANCER
A P
REPRINT
Mayssa Ben Kahla ∗ University of Sousse, Higher Institute of Applied Science and Technology, 4003, Sousse, Tunisia [email protected]
Dalel kanzari
University of Sousse, Higher Institute of Applied Science and Technology, 4003, Sousse, TunisiaUniversity of Manouba, National School of Computer Sciences, RIADI Laboratory 2010, Manouba, Tunisia [email protected]
Ahmed Maalel
University of Sousse, Higher Institute of Applied Science and Technology, 4003, Sousse, TunisiaUniversity of Manouba, National School of Computer Sciences, RIADI Laboratory 2010, Manouba, Tunisia [email protected] A BSTRACT
According to GHO (Global Health Observatory (GHO), the high prevalence of a large variety ofdiseases such as Ischaemic heart disease, stroke, lung cancer disease and lower respiratory infectionshave remained the top killers during the past decade.The growth in the number of mortalities caused by these disease is due to the very delayed symp-toms’detection. Since in the early stages, the symptoms are insignificant and similar to those ofbenign diseases (e.g. the flu ), and we can only detect the disease at an advanced stage.In addition, The high frequency of improper practices that are harmful to health, the hereditary factors,and the stressful living conditions can increase the death rates.Many researches dealt with these fatal disease, and most of them applied advantage machine learningmodels to deal with image diagnosis. However the drawback is that imagery permit only to detectdisease at a very delayed stage and then patient can hardly be saved.In this Paper we present our new approach "DeepLCP" to predict fatal diseases that threaten people’slives. It’s mainly based on raw and heterogeneous data of the concerned (or under-tested) person."DeepLCP" results of a combination combination of the Natural Language Processing (NLP) and thedeep learning paradigm.The experimental results of the proposed model in the case of Lung cancerprediction have approved high accuracy and a low loss data rate during the validation of the diseaseprediction.
Keywords
Deeplearning · NLP · Disease · Prevention · Lung Cancer · CNN
According to GHO Ischaemic heart disease, stroke, lung cancer disease and lower respiratory infections have remainedthe top killers during the past decade. They cause increasing numbers of deaths worldwide. For example, Lung canceris the fourth most common cancer in france according to the national cancer institue with 49109 new cases estimated in ∗ a r X i v : . [ c s . A I] S e p eneral DeepLCP model for disease prediction : Case of Lung Cancer A P
REPRINT , also lung cancer is the leading cause of cancer death in Canada,according to theCanadian cancer society.28600 Canadians were diagnosed with lung cancer, wich represents 14% of all new cancercases in 2017.21100 Canadians died from lung cancer, accounting for 26% of all deaths in 2017 .The growth in the number of mortalities caused by these fatal diseases is generally due to the very delayed detection ofsymptoms.The objective of our approach is to estimate the probability of having disease by the deal of person’ general lifecondition, the clinical symptoms such as « chest pain », « persistent cough », etc, and risk factors like « the familyhistory of the disease » and « smoking », etc.. To achieve our goal we we propose a new approach "DeepLCP"that combines the natural language processing and the deep learning paradigm to highlight hidden information withsignificant impact on the disease. There are many works that used Deep Learning in the field of medicine. In the approach[1], the authors used deeplearning in structural and functional brain imaging data. They used the Deep Belief Network (DBN) and the Boltzmannmachine building block (RMB) to describe large data dimensions. The problem of this approach is quadratic timecomplexity.Talathi [2] proposed an in-depth learning framework through the use of Managed Recurring Units (GRUs:Gated Recurrent textbfUnit) for crisis detection. The proposed method can detect about 98% of epileptic seizures inthe first 5 seconds of the overall duration of the seizure. According to Webb’s [3] study, Steve Finkbeiner’s lab used aconvolutional neuron network (CNN) to identify , the dead neurons in a living cell population. The weakness of thiswork is that it can no longer control the classification process and precisely explain the software outputs .Rajpurkar etal [4] used a convolutional network architecture (CNN) for musculoskeletal abnormalities detection, in particular theelbow abnormalities, forearm, hand humerus, shoulder, and wrist from musculoskeletal X-ray data set. Also, a lot ofworks are interested in deep learning for cancer detection. For example the study of Gruetzemacher and Gupta [5], usedrevolutionary image recognition method and DeepLearning methods, to distinguish between large and small pulmonarynodules potentially malignant lung nodules. Esteva and his colleagues [6] proposed a classification of skin lesions usinga single CNN from biopsy images to detect cancer. In the article[7], the authors used a DeepLearning algorithm topredict the presence or absence of lung cancer in a chest x-ray by using twelve thousand cases of biopsically provedlung cancer. Yang and al [8] research, implemented a DCNN-based classification system to detect lung cancer. Thestudy of Bychkov and its authors [9] consisted of combining convolutional(CNN) and recurent(RNN) architectures toform a deep network and predict colorectal cancer results from images of tumor tissue samples. In the field of generalmedicine many workers combined deep learning with natural language processing. For example the work of Hughes andhis colleagues[10], introduced a new approach to classify at the level of sentences, the medical documents.They showedthat it is possible to use CNNs to represent the semantics of the clinical text allowing semantic classification at the levelof the sentence. The approach of [11], consisted of a convolutive neural network (CNN) with a biomedical classificationof texts to identify the characteristics of cancer such as "Sustainable proliferative signaling",and "Resistance to celldeath". The authors of this work based their CNN architecture on the simple model of kim[12]. Qui and his colleagues[13], studied deep learning and in particular the convolutional neural network (CNN), to extract ICDO-3 topographiccodes from a corpus of cancer pathology reports breast and lung.We summarize in Table 1 the different works dealing with deep learning, in general medicine, and in cancer detection.In Table 1 we classify also the different works according to their input’s type. According to Table 1 we note that despitethe advanced related works and their significant contributions we remarques certain limits: • The first weakness is related to the performance degree of these models to detect disease which is lower thanthose depicted from radiographic image analysis. • The second limitation is related to the weakness of precision in the work that has combined NLP with CNN inthe field of medicine. • The Third limit is related to the training time, which is too long. • Also, most of the related works based their research on the clinical data and don’t emphasize the patient’squality of life. A P
REPRINT
Reference
Algorithm Architecture Input Data Goal Accuracy Error rate [1] Sigmoïde/ SVM/The error returnpropagation algo-rithm/Iterative algo-rithm "divide andcompete" (DC) DBN / RBM (Unsuper-vised) MRI image Detection of BrainDisease 90% _[2] Back propagationthrough time (BPTT)/Logistic regression GRU-RNN EEG recording image Detection of EpilepticSeizures 99.6% _[3] CNN (Deep thoughts) CNN (supervised) annotated image Detection of
Deadneurons
90% -99% _[4] _ CNN X-ray image Detection ofMusculoskeletalabnormalities 95% _[5] maxpooling, softmax DNN CT scan image Classification of theCharacteristics of malignant lungnodule
Table 1: Synthesis Of Related WorksWe remark that few approaches use text as an input and most of them use deep learning model after X-ray imageanalysis. Unfortunately these techniques have the limit to detect the desease at an advanced stage.To address these issues, we propose the "DeepLCP" model, which is the combination of Natural Language Processing(NLP) and the deep learning concept to prevent fatal disease by estimating the probability of having the disease.
Our "DeepLCP" model is mainly composed of two sub-processes, as shown in Figure1:
Data Preprocessing and
DataDeep processing . • Data Preprocessing is represented by the data extraction and the data cleaning . • Data Deep processing is made up of three process: semantic transformation, semantic classification, anddeeplearning algorithm.Both processing proceed in operational flow in order to predict the probability of having the disease.3eneral DeepLCP model for disease prediction : Case of Lung Cancer
A P
REPRINT
Figure 1: General DeepLCP ModelFigure 2: Data Extraction
As shown in Figure 2, in the data extraction, we collect two type of data : • Person’s Clinical Data : represents data of regarding patient symptoms. • Person’s General Data: represents data on major and minor risk factors for the disease.
In this stage, we consider three methods of data cleaning : • Deleting the irrelevant Data: we use this method to treat data that doesn’t actually fit the specific problem.It’sbased on the suggestion of experts. • Fix typos Data: we use this method when the strings can be entered in many different ways, and mistakes canbe made. • Standardize Data: in this method, our data is to not only recognize the typos but also put each value inthe same standardized format. E.g. for strings, we have to make sure that all values are either in lower oruppercase.The data cleaning help to eliminate the inaccurate information that may lead to bad decision making.4eneral DeepLCP model for disease prediction : Case of Lung Cancer
A P
REPRINT
Figure 3: DeepLCP ArchitectureFigure 4: Semantic Transformation.
The core of the DeepLCP architecture is composed of two parts:
NLP (Natural language processing) and Deeplearningalgorithm as shown in Figure 3.
In the semantic transformation process, we use the natural language processing (NLP) in particular, the word2vecmodel to convert the sentences into a vector. The latter is generated, according to the user’s response and the rules ofsemantic transformations .The output of the semantic transformation is the raw semantic matrix (MsB) as illustrated in figure 4.The semantic transformation converts each row in the database into a raw Semantic Matrix output. Each entry of the raw Semantic Matrix (MsB) represents an individual clinical observation.5eneral DeepLCP model for disease prediction : Case of Lung Cancer
A P
REPRINT
Figure 5: Semantic Classification.Figure 6: DeepLCP Model
From the
MsB , the semantic classification makes it possible to generate the reduced semantic matrix (MsR) with arestricted number of rows. The purpose of this section is to reveal the symptoms and major risks of the disease. Figure5 gives a general overview of the semantic classification.
In this process,we apply deeplearning algorithm on the semantic matrix reduced in order to predict the probability ofhaving disease. our objective is to extract features which characterize the disease and from this features the modelcalculate the probability of disease.
We instantiation Deep LCP model in experimentation by using the convolutional neural network (CNN) in the case oflung Cancer prevention. The figure 6 represent DeepLCP model in the case of lung cancer.
As shown in Figure 6, in the data extraction, we collect the data from three different sources: • the first source is the archived patient files of the Farhat Hached Hospital. • the second source is our online survey of non-sick individuals. • the third source is our online form of individuals wishing clinical examination of the disease.The data extraction from the first and the second source contains informal descriptive data of the individuals underexamination. An individual is represented by a set of characteristics such as age, gender, profession, etc.Our database is fed from the archive files of patients with lung cancer, Farhat Hached Hospital in Sousse. We consultedand seized 355 records of patients with the disease. For the unaffected, we launched an online survey and then we wereable to collect 246 descriptions. All data are archived in a csv file.The data is subdivided into two subsets: 6eneral DeepLCP model for disease prediction : Case of Lung Cancer A P
REPRINT
Figure 7: Rule 1 of transformation in Z language.Figure 8: Raw Matrix. • a learning set: we use 490 real cases, divided into 315 of lung cancer patients and 175 who are unaffected withlung cancer. • a test set: we use 111 real cases, divided into 40 patients affected by lung cancer and 71 cases not affected bythe disease. as shown in section 3.2.1, the part of the semantic transformation is generating, according to the user’s response and therules of semantic transformations. • Semantic Transformations Rules:
We use the formal language Z[14] to represent 31 semantic transformationrules. We use these rules to associate each piece of information with an incidence weight. E.g. in figure 7 wepropose rule 1: • Rule 1 (Figure 7) : This rule permits to give a weight for the vector associated with gender information.The intervals of weight are suggested by the doctor "Pr.Bouaouina Noureddine", the chef of the radiother-apy department at the hospital Farhat hached Sousse and the doctor "Dr.jalel Knani", Pulmonologist atTahar Sfar Hospital.according to the two doctors men have more risk of having the disease than women.The output of the semantic transformation part is the raw semantic matrix as illustrated in figure 8. The raw semanticmatrix has size of [31 * 13]. • • A P
REPRINT
Figure 9: Semantic Matrix Reduced.
The purpose of this part is to extract the symptoms, minor and major risks factors of lung cancer disease. In thesemantic classification part we apply two types of classification on the raw semantic matrix in order to obtain thereduced semantic matrix: • Classification by categories : classify the data according to three classes: minor risk factors, major riskfactors, and symptoms . These three categories are the main factors of lung cancer. • Classification by theme : classify the data according to six themes:
Thoracic signs, Cough, Feeding,Consumer, Personal history, Residence.
After classification, we obtain the semantic matrix reduced with size [18 * 13]. We use the classification tooptimize the number of features. As illustrated in figure 9. • The reduction will not lose any information.
Our CNN architecture composed of a convolution layer, a pooling layer, and a fully connected layer.As illustred in Figure 10, in the convolution layer we have 3 filter region sizes: 5, 6, 7. The region size 5 corresponds tothe part of "minor risk factors", the region size 6 corresponds to the part of "factors of major risks ", the region size 7corresponds to the part of" Symptoms ". Each region contains 2 filters. So in total we have 6 filters in our architecture.All filters will scan the semantic matrix reduced, which represents our input with 1 stride to give us a feature map foreach filter. We set the "stride" with 1 to extract all the characteristics of the matrix.Then in the pooling layer, we apply the maxpooling for each feature map to obtain a maximum value for each featuremap. Then we concatenate the 6 values obtained in the maxpooling layer.Finally, in the fully connected layer, we apply the softmax activation function on its concatenated values to obtain ouroutput which composes of two probabilities: • The probability of having the disease. • The probability of not having lung cancer.
We obtaine the simulation results shown in Figure 11 We notice that the validation test is 94.59% and the train precisionis 93.88% for the value of loss in the test validation 0.1699 and for the train is 0.1773.8eneral DeepLCP model for disease prediction : Case of Lung Cancer
A P
REPRINT
Figure 10: Architecture DeepLCPFigure 11: Result of the accuracy and loss values for the train part and the test partFrom the
AUC curve ( A rea U nder the ROC ( R eceiver O perating C haracteristic) C urve) shown in Figure 12 we findthat our model achieves a considerable performance value with a very high true positive rate and our classification isstrong. The value of AUC is 0.99 (very close to 1) means that it has a good measure of separability. The AUC is one ofthe most important evaluation parameters to verify the performance of a classification model. We test our dataset with four machine learning algorithm: • First, we test our database with the k-nearest neighbors (KNN) algorithm with neighborhood values = 5 withthis type of algorithm we obtain 86.48% as precision value and 13.52% error rate. • Second, we test the proposed dataset with
Decision Tree with this algorithm we obtain 93.69% as precisionvalue and as 6.31% error rate.Figure 12: Result of the accuracy and loss values for the train part and the test part9eneral DeepLCP model for disease prediction : Case of Lung Cancer
A P
REPRINT • Third, we test with the
Random Forest algorithm with this algorithm we obtain 91.89% as precision valueand 8.11% as Error rate. • Finally, we test the model with the
ANN [15] algorithm in this algorithm we use 18 nodes in the Input layerand 10 nodes in the hidden layer and a ’relu’ activation function, then the flatten layer, to avoid the overfittingswe use dropout = 0.75 with an output layer of 2. For this model we obtain 85.59% as precision value and14.41% as error rate after 200 epochs.
Conclusion
In this article, we present a new generic model for the prevention and detection of a fatal disease. Our modelcalled "DeepLCP" consists mainly of a combination of two approaches the natural language processing(NLP) andthe deeplearning paradigm. Our model is based on semantic transformation. In the exprimentation we instantiate"DeepLCP" model by using formal transformation rules and the convolutional neural network (CNN).The accuracy ofthe validation test is 94.5%, which confirms that our model gives an effective result.As perspectives, we plan to enrich our model by using the incremental learning algorithms.
References [1] Plis Sergey and R Hjelm Devon and Salakhutdinov Ruslan and Allen, Elena and Bockholt, H. Jeremy and Long,Jeffrey and Johnson, Hans and Paulsen, Jane and Turner, Jessica and Calhoun, Vince. Deep learning for neuroimaging:A validation study.
Frontiers in neuroscience,8,229 ,2014[2] Talathi, Sachin. Deep Recurrent Neural Networks for seizure detection and early seizure detection systems, 2017[3] Sarah Webb. Deep learning for biology
Nature , 2018[4] Pranav Rajpurkar and Jeremy Irvin and Aarti Bagul and Daisy Yi Ding and Tony Duan and Hershel Mehta andBrandon J Yang and Kaylie Zhu and Dillon Laird and Robyn L. Ball and Curtis P. Langlotz and Katie S. Shpanskayaand Matthew P. Lungren and Andrew Y. Ng. MURA: Large Dataset for Abnormality Detection in MusculoskeletalRadiographs, 2017[5] Richard Gruetzemacher and Ashish Gupta. Using Deep Learning for Pulmonary Nodule Detection & Diagnosis In
AMCIS , 2016[6] Esteva.A. and Kuprel.B and Novoa.RA and Ko J and Swetter SM and Blau HM andThrun S. Dermatologist-levelclassification of skin cancer with deep neural networks.
Nature , 2017[7] Park Hunter and Monahan Connor. Genetic Deep Learning for Lung Cancer Screening.
CoRR , 2017[8] D. Yang and C.A. Powell and C. Bai and J. Hu and S. Lu and , W. Shi and, N. Wang and P. Li and J. WangandD. Gao and X. Zhong. Deep Convolutional Neutral Networks Based Artificial Intelligence System for PulmonaryNodule Detection and Diagnosis in United States and Chinese Dataset.
American Journal of Respiratory and CriticalCare Medicine 2018 , 2018[9] Bychkov Dmitrii and Linder Nina and Turkki Riku and Nordling Stig and Kovanen Panu and Verrill Clare andWalliander Margarita and Lundin Mikael and Haglund Caj and Lundin Johan. Deep learning based tissue analysispredicts outcome in colorectal cancer.
Scientific Reports, 8 , 2018[10] Hughes Mark and Li Irene and Kotoulas Spyros and Suzumura Toyotaro. Medical Text Classification UsingConvolutional Neural Networks
Studies in Health Technology and Informatics, 235 , 2017[11] Baker Simon and Krohonen Anna and Pyysalo Sampo Cancer Hallmark Text Classification Using ConvolutionalNeural Networks In
Proceedings of the Fifth Workshop on Building and Evaluating Resources for Biomedical TextMining (BioTxtM2016) ,pages 1–9 ,The COLING 2016 Organizing Committee,2016[12] Kim Yoon. Convolutional Neural Networks for Sentence Classification. In
Proceedings of the 2014 Conference onEmpirical Methods in Natural Language Processing (EMNLP) , pages 1746–1751 , Association for ComputationalLinguistics , 2014[13] Qiu John X. and Yoon Hong-Jun and Fearn Paul A. and Tourassi Georgia D. Deep Learning for AutomatedExtraction of Primary Sites From Cancer Pathology Reports.
IEEE J. Biomedical and Health Informatics,22 pages244-251 ,2018[14] Bowen Jonathan Formal Specification and Documentation using Z: A Case Study Approach, ISBN:1-850-32230-92003 10eneral DeepLCP model for disease prediction : Case of Lung Cancer
A P
REPRINT [15] Shili M. and Hamza Gharsellaoui. and Dalel Kanzari. Distributed Intelligent Systems for Network SecurityControl In