2019 5th International Conference on Computing Engineering and Design (ICCED) | 2019

The Identification of Negative Content in Websites by Using Machine Learning Approaches

 
 
 
 

Abstract


The massive spread of negative contents on the internet has become a national problem in Indonesia. Website is one of the media which is used to spread the negative contents. This research aims to classify the negative contents into several categories such as gambling, pornography, and fraud based on existing resources from TRUST+™ Positif database. The classification is done by utilizing the keywords from URLs and the entire contents of the website. The research will also generate a collection of the lexicon in unigram and bigram of Bahasa Indonesia which characterizes gambling, pornography, and fraud category. Therefore, the classification process can be completed faster and also will reduce the dependency on human labors. The methodology of this research consists of web crawling, web scraping, feature extraction, training, testing, and data validation. Among three classification algorithms (Naive Bayes, K-Nearest Neighbor, and Support Vector Machine) which are experimented in this research, the results show that the Support Vector Machine (SVM) model has the highest level of accuracy (95.3886%). The data validation is done by trying out 20 new URLs that are not part of the training dataset. Among those URLs, there are 17 or 85% URLs that display the correct results. According to the experiments, the system has made many mistakes in gambling and fraud content prediction. Because those categories have similar characteristics in words representations. Based on the result, this research concludes that there is a number of improvements that can be conducted to improve the negative content identification.

Volume None
Pages 1-6
DOI 10.1109/ICCED46541.2019.9161105
Language English
Journal 2019 5th International Conference on Computing Engineering and Design (ICCED)

Full Text