Predicting Participation in Cancer Screening Programs with Machine Learning
PPREDICTING PARTICIPATION IN CANCER SCREENINGPROGRAMS WITH MACHINE LEARNING
DONGHYUN (ETHAN) KIMGyeonggi Suwon International School, 451 YeongTong-Ro, YeongTong-Gu, Suwon-Si,Gyeonggi-Do, Republic of Korea
Abstract.
In this paper, we present machine learning models based on ran-dom forest classifiers, support vector machines, gradient boosted decision trees,and artificial neural networks to predict participation in cancer screening pro-grams in South Korea. The top performing model was based on gradientboosted decision trees and achieved an area under the receiver operating char-acteristic curve (AUC-ROC) of 0.8706 and average precision of 0.8776. Theresults of this study are encouraging and suggest that with further research,these models can be directly applied to Korea’s healthcare system, thus in-creasing participation in Korea’s National Cancer Screening Program. Introduction
In South Korea, cancer is the leading cause of death. In fact, in 2018, cancerwas responsible for 26.5% of all deaths in the country ([9]). As the early detectionand diagnosis of cancers increase a patient’s change of survival considerably, theKorean government established a National Cancer Screening Program, covering 6major cancers, including gastric cancer, colorectal cancer, breast cancer, cervicalcancer, liver cancer, and lung cancer. To increase participation in this program,the government offers free screening tests for National Health Insurance (NHI)beneficiaries in the lower 50% income bracket; for those in the upper 50% incomebracket, 90% of associated costs are covered by the NHI ([7]).However, even with such policies, the participation rate in cancer screening pro-grams was 55.6% in 2019 and showed minimal improvement from 50.1% in 2015([8]).In this paper, we aim to expand upon previous research conducted on the factorsassociated with participation in cancer screening programs by developing machinelearning models to predict participation in cancer screening programs.First, section 2 will discuss related studies and associated factors. Next, section3 will present the method with utilized data, selected variables, data pre-processingsteps, and chosen algorithms. Finally, sections 4 will present experimental resultsand sections 5 and 6 will discuss the results and any implications of the findings.2.
Related Studies
A number of studies have been conducted on the factors associated with cancerscreening participation. With various statistical tools and models, these studiesfound that factors including education level, income level, and smoking habits aresignificantly correlated with participation in cancer screening ([6],[2],[3]). a r X i v : . [ q - b i o . O T ] J a n DONGHYUN (ETHAN) KIM
However, very few studies have been conducted on the application of machinelearning to predict cancer screening participation. One particular study utilized ahybrid neural network to predict breast screening attendance (breast cancer) inthe United Kingdom and achieved 80% accuracy for the algorithm ([1]). Anotherstudy utilized machine learning (support vector machines, random forests, etc.) topredict hospital attendance and achieved 0.852 for the area under receiver operatingcharacteristic curve ([4]). 3.
Method
Data.
Data was obtained from the Seventh Korea National Health and Nu-trition Examination Survey, including data from 2016 2018. 24269 individuals par-ticipated in the survey (10611 households) and participants were chosen throughstratified cluster sampling. More specifically, primary sampling units were selectedbased on results from the annual Population and Housing Census.Available data for the 3 years ranges from demographic information to dietaryhabits and includes a total of 852 variables ([5]).3.2.
Features/Variables.
Variables were selected based on findings from paststudies and their relevance were verified with the Chi-Squared test ([6],[2],[3]). The55 (89 after pre-processing — one-hot encoding) chosen variables are listed below.
Variable Description Variable TypeIncome Level OrdinalEducation Level OrdinalSelf Perception of Health OrdinalHigh Blood Pressure Diagnosis BinaryHyperlipidemia Diagnosis BinaryStroke Diagnosis BinaryMyocardial Infarction/AnginaPectoris Diagnosis BinaryMyocardial Infarction Diagnosis BinaryAngina Pectoris Diagnosis BinaryArthritis Diagnosis BinaryOsteoarthritis Diagnosis BinaryRheumatoid Arthritis Diagnosis BinaryOsteoporosis Diagnosis BinaryTuberculosis Diagnosis BinaryAsthma Diagnosis BinaryDiabetes Diagnosis BinaryThyroid Gland disease Diagnosis BinaryStomach Cancer Diagnosis BinaryLiver Cancer Diagnosis BinaryColon Cancer Diagnosis BinaryBreast Cancer Diagnosis BinaryCervical Cancer Diagnosis BinaryLung Cancer Diagnosis BinaryThyroid Cancer Diagnosis BinaryOther-1 Cancer Diagnosis BinaryOther-2 Cancer Diagnosis BinaryDepression Diagnosis Binary Atopic Dermatitis Diagnosis BinaryAllergic Rhinitis Diagnosis BinarySinusitis Diagnosis BinaryOtitis Media Diagnosis BinaryCataract Diagnosis BinaryGlaucoma Diagnosis BinaryMacular Degeneration Diagnosis BinaryRenal Failure Diagnosis BinaryHepatitis B Diagnosis BinaryHepatitis C Diagnosis BinaryLiver Cirrhosis Diagnosis BinaryInfluenza Vaccination BinaryRegular Health Check-up BinaryLimited Daily/Social Life BinaryEmployment BinarySelf Perception of Stress OrdinalRegular Exercise BinaryNutrition Education Status BinaryPrivate Health Insurance BinaryType of Health Insurance NominalOccupation Type NominalRegion NominalUnmet Healthcare Needs NominalCauses of Unmet Healthcare Needs NominalSelf Perception of Body Image OrdinalHospital Admission in Past Year BinaryDrinking Level OrdinalSmoking Level Ordinal
Table 1.
Selected Variables — Description and Type
REDICTING PARTICIPATION IN CANCER SCREENING PROGRAMS WITH MACHINE LEARNING3
Data Pre-processing.
First, all rows containing unavailable (marked as ”un-known”) or missing (null) values were removed from the data-set. Next, all ordinalcategorical variables were label encoded and all nominal categorical variables wereone-hot encoded. All variables were scaled with Min-Max scaling to a fixed range(0 to 1) as well.Finally, the dataset was shuffled and split into training and test sets (ratio of80/20 respectively).3.4.
Algorithms.
Algorithms chosen for this task include random forest classifiers,support vector machines, gradient boosted decision trees (XGBoost), and artificialneural networks (with back-propagation).For all algorithms, grid search and 5-fold cross validation were performed to findoptimal hyper-parameters. Each model was then evaluated with the held-out testset. Note that all experiments were conducted via Google Colaboratory.4.
Results
The performance of the 4 algorithms can be found below in Table 2 and corre-sponding plots for the Receiver Operating Characteristic curve (ROC) and Precision-Recall curve (PR Curve) can be found in Figures 1, 2, 3, and 4.To evaluate each model, 3 accuracy metrics were chosen: Area under the Re-ceiver Operating Characteristic curve (AUC-ROC), Average Precision (Area underthe Precision-Recall curve), and Accuracy. Note that AUC-ROC was used as theaccuracy metric for hyper-parameter tuning.Algorithm AUC-ROC Average Precision AccuracyRandom Forest 0.8613 0.8694 0.8053Support Vector Machine 0.8340 0.8327 0.8118XGBoost 0.8706 0.8776 0.8171Artificial Neural Network 0.8590 0.8605 0.8118
Table 2.
Accuracy Metrics for Trained Models
DONGHYUN (ETHAN) KIM (a)
ROC Curve (b)
Precision-Recall Curve
Figure 1.
Random Forest (a)
ROC Curve (b)
Precision-Recall Curve
Figure 2.
Support Vector Machines (a)
ROC Curve (b)
Precision-Recall Curve
Figure 3.
XGBoost
REDICTING PARTICIPATION IN CANCER SCREENING PROGRAMS WITH MACHINE LEARNING5 (a)
ROC Curve (b)
Precision-Recall Curve
Figure 4.
Artificial Neural Network5.
Discussion
We see that all 4 models achieved scores between 0.8 and 0.9 for the 3 chosenaccuracy metrics. The top performing model was based on gradient boosted decisiontrees (XGBoost) and achieved an AUC-ROC of 0.8706 and average precision of0.8776.One point to note is that the 4 models did not differ significantly in terms ofachieved scores. As such, one option to improve prediction accuracy would be toincorporate more features/variables from the original dataset to form more complexmodels.Furthermore, there was minimal discrepancy between the models’ performanceduring cross-validation and testing, indicating how over-fitting was avoided.6.
Conclusion
One limitation of the models developed in this paper is that the Korea NationalHealth and Nutrition Examination Survey contains self-reported data. As such,portions of the data used (especially answers to subjective survey questions) mayhave been erroneous, which would reduce the accuracy of the model.Nevertheless, the models developed in this paper can be directly applied to Ko-rea’s healthcare system. Regional public health officials could use these models (orvariations of these models — depending on data availability) to predict individualswho are likely to not participate in cancer screening programs. Officials could thencontact these individuals, informing them of screening locations and dates. Bothpublic and private hospitals could make use of these models as well, depending onthe availability of data.If such a system is implemented, Korea’s National Cancer Screening Programmay see a rise in participation.Further research on this topic with models of greater complexity and additionalfeatures may lead to higher prediction accuracy and a rise in overall cancer screeningprogram participation. For instance, varying the total number of features used maylead to more efficient models. The use of complex neural networks with varioustypes of layers may result in models of greater accuracy as well.
DONGHYUN (ETHAN) KIM
References
1. Baskaran, V., Guergachi, A., Bali, R.K., & Gorgui-Naguib, R. (2011). Predicting BreastScreening Attendance Using Machine Learning Techniques.
IEEE Transactions on Informa-tion Technology in Biomedicine, 15 , 251-259.2. Hahm MI, Chen HF, Miller T, O’Neill L, Lee HY. Why Do Some People Choose Oppor-tunistic Rather Than Organized Cancer Screening? The Korean National Health and Nu-trition Examination Survey (KNHANES) 2010-2012.
Cancer Res Treat. arXiv preprint arXiv:2008.01600.
4. Nelson, A., Herron, D., Rees, G., & Nachev, P. (2019). Predicting scheduled hospital atten-dance with artificial intelligence.
NPJ digital medicine, 2,
Asian Pacific Journal of Cancer Prevention , 13(8), 3773–3779.https://doi.org/10.7314/apjcp.2012.13.8.37737. 국 가 암 검 진 비 용 지 원 . (n.d.). 정 부 국 민 건강 보 험 공 단 . (2019). 국 가 암 조 기 검 진 사 업 수 검 률 통 계 청 . (2019). 사 망 원 인 별 사 망률 추 이 Gyeonggi Suwon International School, 451 YeongTong-Ro, YeongTong-Gu, Suwon-Si, Gyeonggi-Do, Republic of Korea
Email address ::