[PDF] Predicting Participation in Cancer Screening Programs with Machine Learning

Abstract

In this paper, we present machine learning models based on random forest classifiers, support vector machines, gradient boosted decision trees, and artificial neural networks to predict participation in cancer screening programs in South Korea. The top performing model was based on gradient boosted decision trees and achieved an area under the receiver operating characteristic curve (AUC-ROC) of 0.8706 and average precision of 0.8776. The results of this study are encouraging and suggest that with further research, these models can be directly applied to Korea's healthcare system, thus increasing participation in Korea's National Cancer Screening Program.

Full PDF

PPREDICTING PARTICIPATION IN CANCER SCREENINGPROGRAMS WITH MACHINE LEARNING

DONGHYUN (ETHAN) KIMGyeonggi Suwon International School, 451 YeongTong-Ro, YeongTong-Gu, Suwon-Si,Gyeonggi-Do, Republic of Korea

Abstract.

In this paper, we present machine learning models based on ran-dom forest classiﬁers, support vector machines, gradient boosted decision trees,and artiﬁcial neural networks to predict participation in cancer screening pro-grams in South Korea. The top performing model was based on gradientboosted decision trees and achieved an area under the receiver operating char-acteristic curve (AUC-ROC) of 0.8706 and average precision of 0.8776. Theresults of this study are encouraging and suggest that with further research,these models can be directly applied to Korea’s healthcare system, thus in-creasing participation in Korea’s National Cancer Screening Program. Introduction

In South Korea, cancer is the leading cause of death. In fact, in 2018, cancerwas responsible for 26.5% of all deaths in the country ([9]). As the early detectionand diagnosis of cancers increase a patient’s change of survival considerably, theKorean government established a National Cancer Screening Program, covering 6major cancers, including gastric cancer, colorectal cancer, breast cancer, cervicalcancer, liver cancer, and lung cancer. To increase participation in this program,the government oﬀers free screening tests for National Health Insurance (NHI)beneﬁciaries in the lower 50% income bracket; for those in the upper 50% incomebracket, 90% of associated costs are covered by the NHI ([7]).However, even with such policies, the participation rate in cancer screening pro-grams was 55.6% in 2019 and showed minimal improvement from 50.1% in 2015([8]).In this paper, we aim to expand upon previous research conducted on the factorsassociated with participation in cancer screening programs by developing machinelearning models to predict participation in cancer screening programs.First, section 2 will discuss related studies and associated factors. Next, section3 will present the method with utilized data, selected variables, data pre-processingsteps, and chosen algorithms. Finally, sections 4 will present experimental resultsand sections 5 and 6 will discuss the results and any implications of the ﬁndings.2.

Related Studies

A number of studies have been conducted on the factors associated with cancerscreening participation. With various statistical tools and models, these studiesfound that factors including education level, income level, and smoking habits aresigniﬁcantly correlated with participation in cancer screening ([6],[2],[3]). a r X i v : . [ q - b i o . O T ] J a n DONGHYUN (ETHAN) KIM

However, very few studies have been conducted on the application of machinelearning to predict cancer screening participation. One particular study utilized ahybrid neural network to predict breast screening attendance (breast cancer) inthe United Kingdom and achieved 80% accuracy for the algorithm ([1]). Anotherstudy utilized machine learning (support vector machines, random forests, etc.) topredict hospital attendance and achieved 0.852 for the area under receiver operatingcharacteristic curve ([4]). 3.

Method

Data.

Data was obtained from the Seventh Korea National Health and Nu-trition Examination Survey, including data from 2016 2018. 24269 individuals par-ticipated in the survey (10611 households) and participants were chosen throughstratiﬁed cluster sampling. More speciﬁcally, primary sampling units were selectedbased on results from the annual Population and Housing Census.Available data for the 3 years ranges from demographic information to dietaryhabits and includes a total of 852 variables ([5]).3.2.

Features/Variables.

Variables were selected based on ﬁndings from paststudies and their relevance were veriﬁed with the Chi-Squared test ([6],[2],[3]). The55 (89 after pre-processing — one-hot encoding) chosen variables are listed below.

Variable Description Variable TypeIncome Level OrdinalEducation Level OrdinalSelf Perception of Health OrdinalHigh Blood Pressure Diagnosis BinaryHyperlipidemia Diagnosis BinaryStroke Diagnosis BinaryMyocardial Infarction/AnginaPectoris Diagnosis BinaryMyocardial Infarction Diagnosis BinaryAngina Pectoris Diagnosis BinaryArthritis Diagnosis BinaryOsteoarthritis Diagnosis BinaryRheumatoid Arthritis Diagnosis BinaryOsteoporosis Diagnosis BinaryTuberculosis Diagnosis BinaryAsthma Diagnosis BinaryDiabetes Diagnosis BinaryThyroid Gland disease Diagnosis BinaryStomach Cancer Diagnosis BinaryLiver Cancer Diagnosis BinaryColon Cancer Diagnosis BinaryBreast Cancer Diagnosis BinaryCervical Cancer Diagnosis BinaryLung Cancer Diagnosis BinaryThyroid Cancer Diagnosis BinaryOther-1 Cancer Diagnosis BinaryOther-2 Cancer Diagnosis BinaryDepression Diagnosis Binary Atopic Dermatitis Diagnosis BinaryAllergic Rhinitis Diagnosis BinarySinusitis Diagnosis BinaryOtitis Media Diagnosis BinaryCataract Diagnosis BinaryGlaucoma Diagnosis BinaryMacular Degeneration Diagnosis BinaryRenal Failure Diagnosis BinaryHepatitis B Diagnosis BinaryHepatitis C Diagnosis BinaryLiver Cirrhosis Diagnosis BinaryInﬂuenza Vaccination BinaryRegular Health Check-up BinaryLimited Daily/Social Life BinaryEmployment BinarySelf Perception of Stress OrdinalRegular Exercise BinaryNutrition Education Status BinaryPrivate Health Insurance BinaryType of Health Insurance NominalOccupation Type NominalRegion NominalUnmet Healthcare Needs NominalCauses of Unmet Healthcare Needs NominalSelf Perception of Body Image OrdinalHospital Admission in Past Year BinaryDrinking Level OrdinalSmoking Level Ordinal

Table 1.

Selected Variables — Description and Type

REDICTING PARTICIPATION IN CANCER SCREENING PROGRAMS WITH MACHINE LEARNING3

Data Pre-processing.

First, all rows containing unavailable (marked as ”un-known”) or missing (null) values were removed from the data-set. Next, all ordinalcategorical variables were label encoded and all nominal categorical variables wereone-hot encoded. All variables were scaled with Min-Max scaling to a ﬁxed range(0 to 1) as well.Finally, the dataset was shuﬄed and split into training and test sets (ratio of80/20 respectively).3.4.

Algorithms.

Algorithms chosen for this task include random forest classiﬁers,support vector machines, gradient boosted decision trees (XGBoost), and artiﬁcialneural networks (with back-propagation).For all algorithms, grid search and 5-fold cross validation were performed to ﬁndoptimal hyper-parameters. Each model was then evaluated with the held-out testset. Note that all experiments were conducted via Google Colaboratory.4.

Results

The performance of the 4 algorithms can be found below in Table 2 and corre-sponding plots for the Receiver Operating Characteristic curve (ROC) and Precision-Recall curve (PR Curve) can be found in Figures 1, 2, 3, and 4.To evaluate each model, 3 accuracy metrics were chosen: Area under the Re-ceiver Operating Characteristic curve (AUC-ROC), Average Precision (Area underthe Precision-Recall curve), and Accuracy. Note that AUC-ROC was used as theaccuracy metric for hyper-parameter tuning.Algorithm AUC-ROC Average Precision AccuracyRandom Forest 0.8613 0.8694 0.8053Support Vector Machine 0.8340 0.8327 0.8118XGBoost 0.8706 0.8776 0.8171Artiﬁcial Neural Network 0.8590 0.8605 0.8118

Table 2.

Accuracy Metrics for Trained Models

DONGHYUN (ETHAN) KIM (a)

ROC Curve (b)

Precision-Recall Curve

Figure 1.

Random Forest (a)

ROC Curve (b)

Precision-Recall Curve

Figure 2.

Support Vector Machines (a)

ROC Curve (b)

Precision-Recall Curve

Figure 3.

XGBoost

REDICTING PARTICIPATION IN CANCER SCREENING PROGRAMS WITH MACHINE LEARNING5 (a)

ROC Curve (b)

Precision-Recall Curve

Figure 4.

Artiﬁcial Neural Network5.

Discussion

We see that all 4 models achieved scores between 0.8 and 0.9 for the 3 chosenaccuracy metrics. The top performing model was based on gradient boosted decisiontrees (XGBoost) and achieved an AUC-ROC of 0.8706 and average precision of0.8776.One point to note is that the 4 models did not diﬀer signiﬁcantly in terms ofachieved scores. As such, one option to improve prediction accuracy would be toincorporate more features/variables from the original dataset to form more complexmodels.Furthermore, there was minimal discrepancy between the models’ performanceduring cross-validation and testing, indicating how over-ﬁtting was avoided.6.

Conclusion

One limitation of the models developed in this paper is that the Korea NationalHealth and Nutrition Examination Survey contains self-reported data. As such,portions of the data used (especially answers to subjective survey questions) mayhave been erroneous, which would reduce the accuracy of the model.Nevertheless, the models developed in this paper can be directly applied to Ko-rea’s healthcare system. Regional public health oﬃcials could use these models (orvariations of these models — depending on data availability) to predict individualswho are likely to not participate in cancer screening programs. Oﬃcials could thencontact these individuals, informing them of screening locations and dates. Bothpublic and private hospitals could make use of these models as well, depending onthe availability of data.If such a system is implemented, Korea’s National Cancer Screening Programmay see a rise in participation.Further research on this topic with models of greater complexity and additionalfeatures may lead to higher prediction accuracy and a rise in overall cancer screeningprogram participation. For instance, varying the total number of features used maylead to more eﬃcient models. The use of complex neural networks with varioustypes of layers may result in models of greater accuracy as well.

DONGHYUN (ETHAN) KIM

References

1. Baskaran, V., Guergachi, A., Bali, R.K., & Gorgui-Naguib, R. (2011). Predicting BreastScreening Attendance Using Machine Learning Techniques.

IEEE Transactions on Informa-tion Technology in Biomedicine, 15 , 251-259.2. Hahm MI, Chen HF, Miller T, O’Neill L, Lee HY. Why Do Some People Choose Oppor-tunistic Rather Than Organized Cancer Screening? The Korean National Health and Nu-trition Examination Survey (KNHANES) 2010-2012.

Cancer Res Treat. arXiv preprint arXiv:2008.01600.

4. Nelson, A., Herron, D., Rees, G., & Nachev, P. (2019). Predicting scheduled hospital atten-dance with artiﬁcial intelligence.

NPJ digital medicine, 2,

Asian Paciﬁc Journal of Cancer Prevention , 13(8), 3773–3779.https://doi.org/10.7314/apjcp.2012.13.8.37737. 국 가 암 검 진 비 용 지 원 . (n.d.). 정 부 국 민 건강 보 험 공 단 . (2019). 국 가 암 조 기 검 진 사 업 수 검 률 통 계 청 . (2019). 사 망 원 인 별 사 망률 추 이 Gyeonggi Suwon International School, 451 YeongTong-Ro, YeongTong-Gu, Suwon-Si, Gyeonggi-Do, Republic of Korea

Email address ::