Data Mining Techniques in Predicting Breast Cancer
OPEN ACCESS
Journal of Applied Sciences
ISSN 1812-5654DOI: 10.3923/jas.2020.124.133
Research ArticleData Mining Techniques in Predicting Breast Cancer
Hamza Saad and Nagendra Nagarur
Department of System Sciences and Industrial Engineering, The State University of New York, Binghamton, New York, USA
Abstract
Background and Objective:
Breast cancer, which accounts for 23% of all cancers, is threatening the communities of developing countriesbecause of poor awareness and treatment. Early diagnosis helps a lot in the treatment of the disease. The present study conducted inorder to improve the prediction process and extract the main causes impacted the breast cancer.
Materials and Methods:
Data werecollected based on eight attributes for 130 Libyan women in the clinical stages infected with this disease. Data mining was used byapplying six algorithms to predict disease based on clinical stages. All the algorithms gain high accuracy, but the decision tree providesthe highest accuracy-diagram of decision tree utilized to build rules from each leafnode. Ranking variables applied to extract significantvariables and support final rules to predict disease.
Results:
All applied algorithms were gained a high prediction with different accuracies.Rules 1, 3, 4, 5 and 9 provided a pure subset to be confirmed as significant rules. Only five input variables contributed to building rules,but not all variables have a significant impact.
Conclusion:
Tumor size plays a vital role in constructing all rules with a significant impact.Variables of inheritance, breast side and menopausal status have an insignificant impact in analysis, but they may consider remarkablefindings using a different strategy of data analysis.
Key words: Data mining, predictor screening, rules extraction, breast cancer, tumor size, clinical stagesCitation: Hamza Saad and Nagendra Nagarur, 2020. Data mining techniques in predicting breast cancer. J. Applied Sci., 20: 124-133.Corresponding Author: Hamza Saad, Department of System Sciences and Industrial Engineering, The State University of New York, Binghamton, New York,USACopyright: © 2020 Hamza Saad and Nagendra Nagarur. This is an open access article distributed under the terms of the creative commons attributionLicense, which permits unrestricted use, distribution and reproduction in any medium, provided the original author and source are credited. Competing Interest: The authors have declared that no competing interest exists.Data Availability: All relevant data are within the paper and its supporting information files. . Applied Sci., 20 (4): 124-133, 2020
INTRODUCTION
Breast cancer is a common health problem that attackswomen in the world, it is one of the most known malignancieswith 23% of all types of cancers, with over one million newcases detected per year . Roughly 4.4 million women areliving with breast cancer and more than 400,000 died annuallyfrom the disease. This disease recorded 14% of all cancerdeaths . It is the most common cause of female death inindustrialized countries , the second most common cause inthe world and the third most common in developingcountries . The protection from disease is by getting an earlyphysical exam which makes therapy more beneficial. Despitedevelopment in the strategies for disease treatment, advancedbreast cancer remains incurable and the goals of therapyrange from symptom palliation to extending survival. Breast cancer is the uncontrolled development of cells inthe breast. It mostly, the disease affects females, but malesalso suffer from the disease. Different factors can indicate theoccurrence of disease. The most important factor associatedwith breast cancer is a family history (Inheritance). Other riskfactors that can lead to the occurrence of breast cancer arefood, environment demographics, marital status, healthcondition, breast feeding, menarche, menopause, age and anumber of children. The ratio of breast cancer in differentareas differs based on specific factors. Similarly, mortality ratesare decreased and increased in different regions; inindustrialized countries, the mortality rate is lower than indeveloping countries .Small tumors are more be treated successfully by earlydetection . Delayed detection of breast cancer iscorrelated with danger clinical stages and low survivalpercentage . Reports in developed countries indicated thatthe median time to the consultation was 14-61 days . Adelay for more than three months before physician checkinghappened in 14-53% of cases . Minority ethnicity status lowsocio-economic and younger age were correlated with alonger duration of symptoms . Diagnosis delay was alsocorrelated with the older age, tiny symptoms and fear toinform anyone.In developing countries, the management of breastcancer faces social, significant medical and economicproblems. The patients with breast cancer usually present withthe dangers of clinical-stage, dominant presence inpremenopausal status, young age, have early diseaserecurrence and are associated with high mortality . Despiteadvances in the treatment, the mortality average is still high.Therefore, it is necessary to secure good cancer control byapplying different strategies such as; improve understandingof early detection and find the prognostic variables, which applied with traditional factors that can predict the output ofthe individual patient and allow selection of appropriatetherapy .Libya as an example of developing countries, the statisticsare 18.8 new cases for every 100,000 women annually . Mostof the patients in Libya present with the danger case becausethey fear early detection or they had not enough knowledgeabout the disease . The patients usually are younger thanin Europe, in line with the pattern typical in North Africa andthe Middle East . To improve the care service of breastcancer, it must get abetter understanding of the predictingcauses, factors, attributes and treatment delay .In recent years, patients of breast cancer have increaseddramatically in developing countries for many reasons, suchas; increased pollution and poor awareness of the seriousnessof this disease. Many women, especially in Libya fear enteringearly detection, which is necessary in treating the disease.Most of the cases who admitted to the Libyan hospital are inthe dangerous health situation that respond slowly to thetreatment and cancer medicine. Also, the treatment of cancerin Libya is still weak and does not perfectly respond topatientʼs bodies that have been exhausted from the disease,leading to the patient's death. Some women in developingcountries who attacked from breast cancer are survived withone breast for a limited life. This study analyzed the disease detailing some variablesbased on the collected data in order to get moreunderstanding of the main causes and analyze the breastcancer data to reach a logic decision to help new cases basedon the final prediction. The study presents an integratedstrategy using data mining as the main step to predict andextract important relationships from supervised attributes ofbreast cancer. MATERIALS AND METHODSStudy area:
The study was carried out betweenJanuary-November, 2019 at the Watson School, Departmentof System Sciences and Industrial Engineering, State Universityof New York at Binghamton.
Data mining:
Data mining and artificial intelligence havebecome extremely applying in medicine to help physiciansextract relevant information in decision-making, especiallyabout critical cases that depend on the main causes, whichmay understand by data mining (Fig. 1).Many studies have used statistics as a traditional methodof capturing and understanding data by focusing on p-value .Moreover, early diagnosis is covered mainly by several papers125. Applied Sci., 20 (4): 124-133, 2020 kNNDataset Naive bayesNeural networkLogistic regression D a t a D a t a D a t a Data D a t a D a t a D a t a Tree Tree viewerRules extraction from tree diagram Confirm rules
Confusion matrixRankROC analysis E v a l u a ti o n r e s u lt s P r e d i c ti on d a t a E v a l u a ti on r e s u lt s Test and score M od e l t r ee DataFeature statistics SVM L ea r n e r L ea r n e r L ea r n e r L ea r ne r L e a r n e r L ea r n e r Fig. 1: Framework of data analysis by using data miningand it is proven that early diagnosis of any disease, evencancer may increase the likelihood of treatment. The studyfocused on the application of data mining in predicting breastcancer using clinical stages, where it was used KNN, Tree, SVM,Neural Network, Naïve Bayes and Logistic Regression toachieve this goal. Although, all the algorithms produced highaccuracy, the decision tree was chosen to complete andextend the solution. Any accuracy produced by an algorithmis due to the influence of the input variables. To understandthese variables, the decision tree was exploited by drawing itand following each split to reach the decision leaf and buildthe appropriate rule. Based on the number of decision leaves,the rules were constructed and only the decision leafcontaining its pure subset (100% classification from the sameclinical stage) was confirmed as strong rule. It may not beenough to know only the essential variables that built thedecision tree, but to understand each variable, featureselection was used to distinguish influential variables fromweak variables and support the final solution.
Software:
Orange is a component-based visual programmingpackage for data mining, data visualization, machine learningand data analysis. Orange components called widget and they range from preprocessing, subset selection and simple datavisualization to an empirical evaluation of predictive modelingand learning algorithms. Visual programming is conducted through an interface inwhich workflows are established by liking user-designedwidgets or predefined, while advanced users can use theorange package as a Python library for data manipulation andwidget alteration.
Summary to applied algorithms KNN:
It is a simple algorithm that stored all available instancesand predicts the numerical target based on the similaritymeasuring (distance functions). K nearest neighbors hasalready been applied in pattern recognition and statisticalestimation at the beginning of the 1970s as non-parametrictechniques.
Tree:
The decision tree establishes models of classification orregression in the form of a tree diagram. It splits down adataset into smaller and smaller subsets, while at the sametime an associated decision tree is gradually developed. Theoutcome is a tree with leaf nodes and decision nodes. Adecision node (e.g., tumor size) has 2 or more branches126. Applied Sci., 20 (4): 124-133, 2020(e.g., tumor size and inheritance), each one is representedvalues for the tested attribute. Leaf node (e.g., clinical stages)represents the ultimate decision on the categorical target. Thetop most decision node in the tree that corresponds to thebest predictor or input variable called the root node. Decisiontrees can handle and analyze numerical and categorical data.A decision tree applied in the study to classify data basedon the clinical stages according to its accuracy. The diagram ofthe decision tree will be utilized to build the rules based onthe pure subset (100% at a leaf node). Orange softwaresuccessfully applied to analyze data and generate a tree graphbecause it is the best software that draws the tree graph withfull details and simple splitting.
SVM:
A Support Vector Machine (SVM) conducts aclassification by getting the hyperplane that maximizes themargin between the 2 instances. The cases (vectors) thatdefined the hyperplane called the support vectors.
Neural network:
An Artificial Neural Network (ANN) is asystem based on biological neural networks, like a brain. AnANN is comprised of an artificial neuron network (known as"nodes"). These nodes connected in network shape andstrength of the connections to another is assigned in the valuebased on strength; the inhibition (maximum being -1 and 0)or the excitation (maximum being +1 and 0). Within eachdesign of the node, the transfer function is calculated. Threetypes of neurons in an artificial neural network are input node,hidden node and the output node. The input nodes take in theinformation, in the form of which can be explainednumerically. The information presented the activation values,where each node gives a number, the higher the numbermeans the huge activation. This information is then passedthroughout the whole network. Based on the connectionweights (strengths), transfer functions and excitation orinhibition, the activation value is passed through the node tonode. Each of the nodes sums the activation values it receives;it is then modifying the value based on its transfer function.The activation flowed through the network, through thehidden layer, until it reached the output nodes. Then, theoutput nodes reflect the input in a meaningful way to anoutside world.
Naïve Bayes:
The classifier of Naive Bayesian is generatedfrom the “Bayes Theorem” with the independenceassumptions between variables (Predictors). A Naive Bayesianclassifier is easy to build, with no problematic iterativeparameter estimation, which makes it is useful for substantialmedicine datasets. Regardless of its simplicity, the algorithmoften does surprisingly well and is widely applied because it often outperforms perfect classification methods. It canpredict only a categorical output.
Logistic regression:
The LR predicts the probability of theoutcome that can include only two values (a dichotomy). Theprediction is based on the application of one or morepredictors (categorical and numerical). A linear regression isnot fit to predict the value of a binary variable because a linearregression will predict values out the acceptable range(outside the range between 0-1), moreover, since thedichotomous experiments can include only one of twopossible values for each test, the residuals will be not normallydistributed about a predicted line . Predictor screening:
For big data and large number ofvariables, the application of algorithm of data miningbecomes difficult, for example, neural networks becomeimpossible to manage when the number of input variablesinto the model exceeded a few hundred or even less.Therefore, it is easy a practical necessity to choose and screenessential variables from among a big set of predictor variablesthat are most likely the utility to predict the outputs of theinterest. The objective behind the Predictor Screening moduleis to choose a set of predictor variables based on thedependent variable from an extensive list of candidatesallowing them to focus on a more professional set for furtheranalysis. The Predictor Screening module optimally handlescategorical and continuous predictors, then estimate theirpredictive power that can improve accuracy and get asophisticated output using influential predictors. This study isincluded few variables, but the rules will build based on somevariables, whether weak or have a significant impact. Predictorscreening will be utilized to support rules by ranking thesignificant variables from the dataset . Data collection:
Data includes 130 patients diagnosed withone of the danger clinical stages of breast cancer. It collectedbetween 2017 and 2018 from the same place (OncologyHospital, Tripoli, Libya) based on the psychology and physicalaspects. Eight input variables will be predicted according tothe clinical stages.Age is a numerical variable between 25 and 66 years oldwith mean 44.1 and standard deviation 10.325. Menopausalstatus divided into two distinct, perimenopause 84 patientswith a percentage of 64.61% and post-menopausal46 patients with a percentage of 35.38%. Tumor size isbetween 1.9 and 34, the mean is 13.88 and a standarddeviation of 7.605. The breast side is included 2 distinct, the127. Applied Sci., 20 (4): 124-133, 2020Fig. 2: Clinical stages of breast cancer
Source: Gwen Shockey/Science photo libraryTable 1: Data collectionName Center Dispersion Minimum Maximum Category I Category II Missing (%)Age 44.0 0.23 25.00 66.00 - - 0Breast Right side 0.69 - - Right, 72 Left, 58 0Clinical stages Stage 2 1.23 - - Stage I, 6, Stage II, 44 Stage III, 44, Stage IV, 36 0Early detection No 0.62 - - Yes, 41 No, 89 0Histological grade 2.42 0.24 1.00 3.00 - - 0Inheritance Yes 0.56 - - Yes, 98 No, 32 0Lymph node status LN+ 0.52 - - LN+, 102 LN-, 28 0Menopausal status Perimenopause 0.65 - - Perimenopause, 84 Postmenopausal, 46 0Tumor size 13.88 0.55 1.90 34.00 - - 0LN+ and LN: Status of Lymph Node (LN) whether positive or negative right side includes 72 patients with 55.38% and the left sideincludes 58 patients with 44.61%. Lymph Node (LN) statusincluded two attributes, LN- with 28 patients and percentage21.53%, LN+ with 102 patients and percentage 78.46%. Thehistological grade is a numerical variable, the grade isbetween 1 and 3, the mean is 2.415 and the standarddeviation 0.581. Early detection, 41 women diagnosed earlywith no danger stage and 89 patients did not diagnose earlywith the disease. The inheritance includes 2 categories, Yesincluded 98 patients with a percentage of 75.38% and Noincluded 32 patients with a percentage of 24.61%. Moreover,the clinical stage represents the response variable to predictall data by using data mining, it divided into four stages. StageI with 6 patients and percentage 4.61%, stage II with 44patients and percentage 33.84%, stage III with 44 patients andpercentage 33.84% and stage IV with 36 patients withpercentage 27.69%. It can be shown (from Johns HopkinsMedicine) in Fig. 2. Table 1 presents the statistics for applieddata. This study did not consider stage 0 in the data of studybecause all the patients in the hospital of case study wereranked from I to IV:
Stage 0:
The cancer is only appeared inside the milk duct.This stage is non-invasive, it includes ductalcarcinoma in situ
Stage I:
Includes small tumors that are only affecting a smallarea of the sentinel lymph node
Stage II:
Includes large tumors that are affecting somenearby lymph nodes
Stage III:
The tumors are growing into surrounding tissues like muscle, breast skin and lymph nodes
Stage IV:
The tumors are started in the breast and spreadingto the other parts of the body (Medical News Today)Waste time without detecting disease, the risk of deathincreases, especially if the patient neglected herself withoutgetting treatment or taking ineffective medicine that is normal128. Applied Sci., 20 (4): 124-133, 2020
Table 2: Results for different algorithms based on the response of clinical stagesAlgorithm AUC CA F1 Precision Recall Algorithm rankingKNN 0.926 0.815 0.797 0.784 0.815 4Tree 0.949 0.900 0.900 0.901 0.900 1SVM 0.950 0.792 0.772 0.755 0.792 6Neural network 0.928 0.792 0.776 0.761 0.792 5Naïve Bayes 0.956 0.823 0.829 0.849 0.823 3Logistic regression 0.951 0.838 0.827 0.844 0.838 2AUC: Area under the curve, CA: Classification accuracy, F1: F measureTable 3: Confusion matrixStages Stage I Stage II Stage III Stage IV SumI 5 1 0 0 6II 0 37 7 0 44III 0 5 39 0 44IV 0 0 0 36 36Sum 5 43 46 36 130 in developing countries. Therefore, early detection should bean essential requirement for all women, especially those whohave relative history.Data were collected based on the variable of clinicalstages that is used as an output to predict disease using eightpredictors to classify the next patients with the correct clinicalstage. The clinical stages variable was tested beforeconfirming the final solution to extract high accuracy from thedataset.The performances of the algorithms ranged from high tolow as follows; Tree, Logistic Regression, Naïve Bayes, KNN,Neural Network and SVM that gives the lowest performance.Some equations that applied by algorithms to measureand validate algorithm performance are:
TpRecall = Tp+Fn
TpPrecision = Tp+Fn
Precision*recallF1 = 2* Precision+recall
TnTrue negative rate = Tn+Fp where, Tp is a true positive, Fn is a false negative, Fp is falsepositive and Tn is a true negative rate.
RESULTS AND DISCUSSIONData mining in prediction:
Six algorithms were experimentedto extract the best accuracy and figure out the characteristicsof the classification using cross-validation = 10 folds. For small data available, K-folds cross-validation used to achieve anunbiased estimation of the model performance by dividingthe limited data into equal sizes for K subsets. Each timeleaving out one of the subsets from the training set and use itas the test set. Data of breast cancer split in the same sizes.The best accuracy was obtained from the Decision tree. Theresults of the algorithms are shown in Table 2. To evaluate, analyze and validate dataset, five measureswere shown in Table 2, Area Under the Curve (AUC),Classification accuracy (CA, F measure (F1) and recall)Algorithms of data mining that applied in this study werehandled data of breast cancer without preprocessing becauseit is real-world data, complete, consistent and no missedinstances. Data of study was collected based on the clinicalstages with limited scopes and the number of patients-thehighest AUC (Area Under The Curve) recorded by Naïve Bayeswith accuracy 0.956. The lowest AUC recorded by KNN(K-Nearest Neighbors) with accuracy 0.926. However, allalgorithms provided high performance, whether in handlingor analyzing data to generate high accuracy by each one. InCA (Classification Accuracy), tree algorithm generated thehighest accuracy with 90% and the lowest accuracy of CA wasrecorded by 2 algorithms SVM (Support Vector Machine) andNN (Neural Network). So, in Table 2, at precision indicator, thetree algorithm was confirmed as the best algorithm by givingaccuracy 90%.All applied algorithms provided prediction with differentperformances. In the confusion matrix of decision tree Table 3, theoutput variable includes four clinical stages and each stageclassified with a particular accuracy value, the accuracy of eachstage will calculate to support reliable results. The confusionmatrix is presented as actual and predicted values; accordingto these values, the accuracy of each stage has beencalculated.129. Applied Sci., 20 (4): 124-133, 2020
Stage 377.3% 17/22 Stage 233.8% 44/130Tumor sizeStage 246.8% 44/94<1900Tumor size >19.00Stage 4100% 36/36<8.00Stage 283.3% 30/36Tumor size< 2.90Stage 1 100% 6/6 > 2.90Stage 2100% 30/30 Stage 375.9% 44/58Early detection>8.00Stage 387.2% 41/47BreastLeft No RightStage 396.0% 24/25 Stage 266.7% 2/3No Inheritance Stage 384.2% 16/19Tumor sizeYes<9.00Stage 266.7% 2/3 > 900Stage 393.8% 15/16Age<53Stage 3100% 13/13 Stage 366.7% 2/3>53 Stage 272.7% 8/11YesTumor size >9.50Stage 375.0% 3/4<9.50Stage 2100% 7/7
Fig. 3: Tree graph from decision treeEach stage gets a specific prediction in classification andaccording to each prediction, a vital relationship betweeninputs and output can be defined.In Stage I, 5 instances classified correctly and only 1instance is miss-classified, the overall accuracy for stage I is83.33%. In Stage II, 37 instances classified correctly and7 instances are miss-classified, the overall accuracy for thisstage is 84.09%. For Stage III, 39 instances classified correctlyand 5 instances are miss-classified, the overall accuracy forstage III is 88.63%. Moreover, Stage IV, 36 instances classifiedcorrectly and there is no miss-classified, the overall accuracyfor stage IV is 100%.Data is for 130 patients, by calculating miss-classified fromeach stage = 1+7+5+0 = 13 and correctly classified = 5+37+39+36 = 117. Then:
CorrectlyclassifiedTotalaccuracy 117Totalpatient 'ssample 0.9 90%130
So, the overall accuracy using the decision tree is 90%.Figure 3 showed a graph of the decision tree that will utilize toextract the rules in order to support the final decision.The fitted predictor that can provide more splits is tumorsize as shown in Fig. 3. The split is started from stage II asresponse and tumor size as a predictor. From the first splitting,the full category of Stage IV has predicted and ended withpure classification (36 of 36). Again, the decision treeestimated tumor size predictor to build a new split between(tumor size and early detection). Predictor of early detectionextended the split to many splits. However, tumor sizeprovided a pure subset at Stage I and Stage II with endsplitting. Predictors of breast and tumor size have split fromearly detection to provide more splits by breast and stopsplitting by tumor size. At tumor size split, pure subset atStage II and impure classification at Stage III from the left sideof the breast. On the right side, the split started again byinheritance to get impure classification at stage II. Stage II gotanother impure classification to start another split by age130. Applied Sci., 20 (4): 124-133, 2020
Table 4: Impact of each variable in the dataset (significant variablesdistinguished in the shaded part)Input predictor Chi-square p-value Variable numberHistological grade 74.5812 0.000000 6Early detection 68.2713 0.000000 9Tumor size 211.9512 0.000000 3Lymph Node (LN) status 48.8155 0.000000 5Age 34.7038 0.000520 1Breast side 5.8531 0.118980 4Inheritance 3.5891 0.309390 8Menopausal status 1.8924 0.595040 2 predictor. The split has ended at age predictor to get a pureclassification at stage III and impure classification at stage III.(Sometimes leaf node has the same category like stage I andstage I because the split is generated based on predictors orinput variables to predict the clinical stages).Two predictors, histological grade and lymph node statushave ranked as the most significant variables in the dataset,but they are not estimated by the algorithm of decision tree tosplit or classify because they are not fitted with the splittingstrategy. Furthermore, predictors of inheritance and breastside have ranked as insignificant or they have a weak impacton the clinical stages, so they estimated by decision tree forsplit and build more decision leaves. They build more splitwith impure classification.
Predictor screening:
Before building rules from decisionleaves or use a decision tree to build a tree structure, theimpact of each variable in the data set must figure out. Toknow that, chi-square and p-value calculated to rank eachvariable according to the output (Clinical stages). Based on theresults, variables of the breast side, inheritance andmenopausal status did not have a significant impact on theoutput. However, variables of histological grade, earlydetection, tumor size, Lymph Node (LN) status and ageprovided a significant effect on the output. This rank appliedto support extracted rules from the decision tree. Each rulewas constructed using one of these variables and then theefficiency of each variable help to support final rules andresults. Table 4 showed the rank of the effect of each variablebased on chi-square and p-value.Data mining does not end when predicting the nextpatient based on existing data, but its role continues to extractthe significant variables that may help to support the decision.However, some variables contributed to predicting theclinical stages to varying degrees from weak to strong. Byneglecting weak variables and focusing on strong variables,more robust expectations can be built by focusing onselected variables.
Rules extraction:
From the graph of the decision tree, moreknowledge can be obtained by extracting some rules andsupport these rules by the accuracy and effect of each variableon the final prediction: C If tumor size >19, then clinical stage = stage IV (100%pure subset) C If tumor size <= 19 and >8, early detection = Yes, tumorsize >9.5, then clinical stage = stage III (75% pure) C If tumor size <= 19 and >8 , early detection =Yes, tumorsize <= 9.5, then clinical stage = stage II (100% puresubset) C If tumor size <= 19 and <= 8, tumor size >2.9, then clinicalstage = stage II (100% pure subset) C If tumor size < = 19 and < = 8, tumor size < = 2.9, thenclinical stage = stage I (100% pure subset) C If tumor size < = 19 and >8, early detection = No,breast = left side, then clinical stage = stage III (96% pure) C If tumor size <= 19 and >8, early detection = No, breastside = right, inheritance = Yes, tumor size >9, age>53,then clinical stage = stage III (66.7% pure) C If tumor size< = 19 and >8, early detection = No,breast side = right, inheritance = No, then clinicalstage = stage II (66.7% pure) C If tumor size < = 19 and >8, early detection = No,breast = right side, inheritance = Yes, tumor size >9,age < = 53, then clinical stage = stage III (100% puresubset) C If tumor size < = 19 and >8, early detection = No,breast = right side, inheritance = Yes, tumor size>9,age>53, then clinical stage = stage III (66.7% pure)Few predictors estimated by algorithm to generate 90%accuracy. Rules extraction and predictor screening have joinedin Table 5 to measure how the significant predictors impactedon the clinical stages.Rules 1, 3, 4, 5 and 9 have a 100% pure subset for eachcategory of the clinical stage. However, the final accuracy ofthe algorithm does not provide accuracy more than 90%because it affected by all input variables whether thesevariables are significant or insignificant (weak predictorsincreased the split in the tree with impure subsets). Accordingto rule 1, there is only one variable effect in stage IV, but tumorsize must exceed 19. Hence, patients with tumor size morethan 19 can be predicted to enter one of the danger stages ofbreast cancer. For Rule 3, there are 2 predictors effect in stageII like tumor size and early detection, but tumor size has ahigher impact and the patient must get an early diagnosis. For131. Applied Sci., 20 (4): 124-133, 2020
Table 5: Effect of each variable in rules constructionRules-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------Variables 1 3 4 5 9Tumor size Significant Significant Significant Significant SignificantEarly detection Significant SignificantBreast side InsignificantAge SignificantInheritance InsignificantPrediction Stage IV Stage II Stage II Stage I Stage III
Fig. 4: Tumor size for each stage
Source: Medical news today
Rule 4 and Rule 5, only tumor size impacted on stage II andstage I, whether tumor size is between 8 and more significantthan 2.9 for stage II or between 8 and less than 2.9 for stage I.However, in Rule 9, there are many variables impact onfinal classification of stage III such as; tumor size must begreater than 8, early detection with late diagnosis, rightbreast side and there are close blood relatives in the samefamily with the disease (inheritance). Figure 4 presented howdoes tumor damage the breast if the disease does not detectearly.
CONCLUSION
Breast cancer is a common cancer for females indeveloping countries. Cancer progresses to the dangerousstage with the time, in which the tumor spreads to the body.Early detection leads to control of the disease, but neglect itmeans an increased chance of death. Many factors are relatedto the clinical stages, but the tumor size factor is a majorproblem that requires the patient greater care for checking upeach time (between 3 months to a year). In developingcountries, tumor size is a big problem threatening patientswho are less educated to deal with cancer diseases. Theaccuracy is high from each algorithm because data is stronglyrelated to the output (clinical stages).
SIGNIFICANCE STATEMENT
There are more factors that may affect the disease.Family history factor is the big challenge to the people forconsidering an early detection. The tumor size is increasedfrom 0 to IV in case of neglecting early detection or usingineffective medicine due to the high price of the cancermedicine. There are significant and direct relationshipsbetween the tumor size and other factors. Tumor size can playa main role to predict a specific clinical stage.
REFERENCES
1. Benamer, H., 2012. Healthcare system in Libya. Annuallyfactual report.2. Siegel, R., J. Ma, Z. Zou and A. Jemal, 2014. Cancer statistics,2014. CA: Cancer J. Clin., 64: 9-29.3. Ermiah E., F. Abdalla, A. Buhmeida, E. Larbesh, S. Pyrhönen,Y. Collan, 2012. Diagnosis delay in Libyan female breastcancer. BMC Res. Notes, 10.1186/1756-0500-5-452.4. Ermiah, A., 2013. Libyan breast cancer: Health services andbiology. PhD Thesis, University of Turku, Turku, Finland.5. Saad, H. and N., Nagarur, 2017. Data analysis of earlydetection and clinical stages of breast cancer in libya. The 6thAnnual world conference of the society for industrial andsystems engineering.6. Salem, M., 2011. Data mining techniques and breast cancerprediction: A case study of Libya, Doctoral Dissertation. PhDThesis, http://shura.shu.ac.uk/20611/1/10701258.pdf.7. Kim J., S. Lee, S.Y. Bae, M.Y. Choi and J. Lee et al., 2012.Comparison between screen-detected and symptomaticbreast cancers according to molecular subtypes. BreastCancer Res. Treat., 131: 527-540.8. Ramirez A.J., A.M. Westcombe, C.C. Burgess, S. Sutton,P. Littlejohns and M.A. Richards, 1999. Factors predictingdelayed presentation of symptomatic breast cancer: Asystematic review. Lancet, 353: 1127-1131.9. Burgess C.C., A.J. Ramirez, M.A. Richards and S.B. Love, 1998.Who and what influences delayed presentation in breastcancer? Br. J. Cancer, 77: 1343-1348.
T1 T2 2 cm or less 5 cm or more . Applied Sci., 20 (4): 124-133, 2020
10. Arndt V., T. Sturmer, C. Stegmaier, H. Ziegler, G. Dhom andH. Brenner, 2002. Patient delay and stage of diagnosis amongbreast cancer patients in Germany ‒ a population basedstudy. Br. J. Cancer, 86: 1034-1040.11. Meechan G., J. Collins and J. Petrie, 2003. The relationship ofsymptoms and psychological factors to delay in seekingmedical care for breast symptoms. Preventive Med.,36: 374-378.12. Velikova G., L. Booth, C. Johnston, D. Forman and P. Selby,2004. Breast cancer outcomes in South Asian population ofWest Yorkshire. Br. J. Cancer, 90: 1926-1932.13. Richardson L., B. Langholz, L. Bernstein, C. Burciaga, K. Danleyand K. Ross, 1992. Stage and delay in breast cancer diagnosisby race, socioeconomic status, age and year. Br. J. Cancer,65: 922-926.14. Jenner C., A. Middleton, M. Webb, R. Ommen and T. Bates,2000. In-hospital delay in the diagnosis of breast cancer.Br. J. Surg., 87: 914-919.15. Ahmad G., T. Eshlaghy, A. Poorebrahimi, M. Ebrahimi andR. Razavi, 2013. Using three machine learning techniques forpredicting breast cancer recurrence. J. Health Med. Inform.,10.4172/2157-7420.1000124. 16. El-Mistiri M., A. Verdecchia, I. Rashid, N. El-Sahli,M. El-Mangush and M. Federico, 2006. Cancer incidence inEastern Libya: The first report from the Benghazi cancerregistry. Int. J. Cancer, 120: 392-397.17. Najjar H. and A. Easson, 2010. Age at diagnosis of breastcancer in Arab nations. Int. J. Surg., 8: 448-452.18. Montazeri A., M. Ebrahimi and N. Mehrdad, 2003. Delayedpresentation in breast cancer: A study in Iranian women. BMCWomens Health, 3: 4-4.19. Chagpar, A.B., C.R. Crutcher, L.B. Cornwell andK.M. McMasters, 2011. Primary tumor size, not race,determines outcomes in women with hormone-responsivebreast cancer. Surgery, 150: 796-801.20. Saad H., 2018. Application of data mining to improveevaluation process. 2nd Edn., Scholars' Press Australia,ISBN-10: 6202318422, Pages: 82.21. Saad, H. and N. Nagarur, 2018. Decision tree-based rulesextraction to predict breast cancer using clinical stages as adependent variable. The 7th annual world conference of thesociety for industrial and systems engineering.10. Arndt V., T. Sturmer, C. Stegmaier, H. Ziegler, G. Dhom andH. Brenner, 2002. Patient delay and stage of diagnosis amongbreast cancer patients in Germany ‒ a population basedstudy. Br. J. Cancer, 86: 1034-1040.11. Meechan G., J. Collins and J. Petrie, 2003. The relationship ofsymptoms and psychological factors to delay in seekingmedical care for breast symptoms. Preventive Med.,36: 374-378.12. Velikova G., L. Booth, C. Johnston, D. Forman and P. Selby,2004. Breast cancer outcomes in South Asian population ofWest Yorkshire. Br. J. Cancer, 90: 1926-1932.13. Richardson L., B. Langholz, L. Bernstein, C. Burciaga, K. Danleyand K. Ross, 1992. Stage and delay in breast cancer diagnosisby race, socioeconomic status, age and year. Br. J. Cancer,65: 922-926.14. Jenner C., A. Middleton, M. Webb, R. Ommen and T. Bates,2000. In-hospital delay in the diagnosis of breast cancer.Br. J. Surg., 87: 914-919.15. Ahmad G., T. Eshlaghy, A. Poorebrahimi, M. Ebrahimi andR. Razavi, 2013. Using three machine learning techniques forpredicting breast cancer recurrence. J. Health Med. Inform.,10.4172/2157-7420.1000124. 16. El-Mistiri M., A. Verdecchia, I. Rashid, N. El-Sahli,M. El-Mangush and M. Federico, 2006. Cancer incidence inEastern Libya: The first report from the Benghazi cancerregistry. Int. J. Cancer, 120: 392-397.17. Najjar H. and A. Easson, 2010. Age at diagnosis of breastcancer in Arab nations. Int. J. Surg., 8: 448-452.18. Montazeri A., M. Ebrahimi and N. Mehrdad, 2003. Delayedpresentation in breast cancer: A study in Iranian women. BMCWomens Health, 3: 4-4.19. Chagpar, A.B., C.R. Crutcher, L.B. Cornwell andK.M. McMasters, 2011. Primary tumor size, not race,determines outcomes in women with hormone-responsivebreast cancer. Surgery, 150: 796-801.20. Saad H., 2018. Application of data mining to improveevaluation process. 2nd Edn., Scholars' Press Australia,ISBN-10: 6202318422, Pages: 82.21. Saad, H. and N. Nagarur, 2018. Decision tree-based rulesextraction to predict breast cancer using clinical stages as adependent variable. The 7th annual world conference of thesociety for industrial and systems engineering.