Abdulmohsen Al-Thubaity
King Abdulaziz City for Science and Technology
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Abdulmohsen Al-Thubaity.
software engineering, artificial intelligence, networking and parallel/distributed computing | 2013
Abdulmohsen Al-Thubaity; Norah Abanumay; Sara Al-Jerayyed; Aljoharah Alrukban; Zarah Mannaa
Feature selection is one of several factors affecting text classification systems. Feature selection aims to choose a representative subset of all features to reduce the complexity of classification problems. Usually a single method is used for feature selection. For English, several attempts were reported examining the combination of different feature selection methods. To the best of our knowledge no such attempts were reported for Arabic text classification. In this study, we examined the effect of combining five feature selection methods, namely CHI, IG, GSS, NGL and RS, on Arabic text classification accuracy. Two approaches of combination were used, intersection (AND) and union (OR). The NB classification algorithm was used to classify a Saudi Press Agency dataset which comprised 6,300 texts divided evenly into six classes. Three feature representation schemas were used, namely Boolean, TFiDF and LTC. The experiments show slight improvement in classification accuracy for combining two and three feature selection methods. No improvement on classification accuracy was seen when four or all five feature selection methods were combined.
software engineering, artificial intelligence, networking and parallel/distributed computing | 2015
Abdulmohsen Al-Thubaity; Muneera Alhoshan; Itisam Hazzaa
The feature type (FT) chosen for extraction from the text and presented to the classification algorithm (CAL) is one of the factors affecting text classification (TC) accuracy. Character N-grams, word roots, word stems, and single words have been used as features for Arabic TC (ATC). A survey of current literature shows that no prior studies have been conducted on the effect of using word N-grams (N consecutive words) on ATC accuracy. Consequently, we have conducted 576 experiments using four FTs (single words, 2-grams, 3-grams, and 4-grams), four feature selection methods (document frequency (DF), chi-squared, information gain, and Galavotti, Sebastiani, Simi) with four thresholds for numbers of features (50, 100, 150, and 200), three data representation schemas (Boolean, term frequency-inversed document frequency, and lookup table convolution), and three CALs (naive Bayes (NB), k-nearest neighbor (KNN), and support vector machine (SVM)). Our results show that the use of single words as a feature provides greater classification accuracy (CA) for ATC compared to N-grams. Moreover, CA decreases by 17% on average when the number of N-grams increases. The data also show that the SVM CAL provides greater CA than NB and KNN; however, the best CA for 2-grams, 3-grams, and 4-grams is achieved when the NB CAL is used with Boolean representation and the number of features is 200.
international conference on asian language processing | 2015
Abdulmohsen Al-Thubaity; Abdullah Al-Subaie
The preprocessing stage in text classification is one of the factors affecting the accuracy of text classification. Text preprocessing involves several steps such as removing stop words, punctuation, and numerals. For Arabic text classification, stemming and root extraction were proposed as additional preprocessing steps. The resulting stems and roots are then used as features for Arabic text classification. In this study, we propose word segmentation as an additional preprocessing step. We used a dataset comprising 4,900 newspaper articles evenly distributed into seven classes. We conducted our experiments on segmented and non-segmented versions of this dataset. We used chi-squared to select top-ranked features, LTC as a representation schema, and SVM as a classifier. By measuring the accuracy, precision, recall, and F-measure, we evaluated the use of word orthography as a feature for Arabic text classification before and after segmentation. In all of the experiments we conducted, the classification performance for the segmented dataset outperformed the nonsegmented dataset with the same number of features. Furthermore, we can attain the same classification performance with nonsegmented datasets using fewer features.
international conference on asian language processing | 2012
Abdulmohsen Al-Thubaity; Albandari Alanazi; Itisam Hazzaa; Haya Al-Tuwaijri
Given the importance of organizing and managing the rapid growth in knowledge of Arabic electronic content, this study introduces the Weirdness Coefficient (W) as a new feature selection method for Arabic special domain text classification. The proposed method was used to classify a dataset comprising five Islamic topics using Naive base (NB) and K-nearest neighbor (K-NN) classifiers, and three representation schemas. The results were also compared with a well-known feature selection method, Chi-squared. In addition to its simplicity in computation, the Weirdness Coefficient showed promising classification accuracy.
Journal of Quantitative Linguistics | 2018
Abdulmohsen Al-Thubaity; Sultan Almujaiwel
Abstract This study primarily addresses the influence of a reference corpus in terms of its size, topics, time, and geographical area on keyword extraction. It sets out the underlying principles of the key statistical measures, and shows the influence of key statistical measures on keywords and the similarities and differences between such measures. The study provides simple insight into the words that can be easily excluded, with a core focus on the type of study used, and the number of keywords selected for further consideration. We have explored the influences indicated above using an empirical method, with an emphasis on the quantitative measures of the keywords in question. These statistical measures are found to influence the results of the keywords differently. We have also conducted an analysis of keywords retrieved during experiments, taking into consideration keywords’ complexity and associations. The results show the major influence of the reference corpus is related to its topics, whereas size, time, and geographical area are less influential.
Proceedings of the 2nd International Conference on Information System and Data Mining | 2018
Abdulmohsen Al-Thubaity; Aali Alqarni; Ahmad Alnafessah
Feature extraction - the process of choosing feature types that can represent and discriminate between dataset topics - is one of the critical steps in text classification and varies with the language of the texts. Different feature types have been proposed for Arabic text classification, ranging from features based on word orthography (single word and character and word N-grams) to features based on linguistic analysis (roots, stems). To the best of our knowledge, little attention has been paid to investigating the performance of Arabic text classification when Part of Speech (POS) tagging information is used to extract features. In this study, we used a corpus comprising 4900 newspaper texts distributed evenly over seven topics to investigate the effect of using POS tag distribution and words that belong to certain POS tags on Arabic text classification, namely nouns, verbs and adjectives. For feature selection, feature representation and text classification we used Chi-squared, Log-Weighted Term Frequency Inverse Document Frequency with Cosine Normalization (LTC) and support vector machine (SVM) respectively. We used four metrics, namely accuracy, precision, recall and F-measure to measure classification performance. Experiment data suggest that the words achieved the best classification performance when the number of features was low; however, the classification performance can be marginally increased when nouns, verbs and adjectives are used as features, given that the number of features is increased.
International Conference on Advanced Intelligent Systems and Informatics | 2018
Abdulmohsen Al-Thubaity; Muneera Alhoshan
The growth of electronically readable Arabic content available on the web has become a rich source from which to build new corpora or update the existing ones. The availability of such corpora will be beneficial for Arabic corpus linguistics, computational linguistics, and natural language processing. In this paper, we present ARARSS, a tool capable of automatically constructing and updating textual corpora benefiting from the Rich Site Summary (RSS) feeds. ARARSS is capable of collecting the texts in a properly categorized manner according to user needs, in addition to their metadata (for example, location, time, and topic) as provided by RSS sources. We used ARARSS to construct a modern standard Arabic corpus comprising 117,819 texts and more than 28 million words. ARARSS is an open source tool and freely available to download (http://corpus.kacst.edu.sa/more_info.jsp) along with the constructed corpus.
The Scientific World Journal | 2014
Abdulmohsen Al-Thubaity; Hend S. Al-Khalifa; Reem Alqifari; Manal Almazrua
Despite the accessibility of numerous online corpora, students and researchers engaged in the fields of Natural Language Processing (NLP), corpus linguistics, and language learning and teaching may encounter situations in which they need to develop their own corpora. Several commercial and free standalone corpora processing systems are available to process such corpora. In this study, we first propose a framework for the evaluation of standalone corpora processing systems and then use it to evaluate seven freely available systems. The proposed framework considers the usability, functionality, and performance of the evaluated systems while taking into consideration their suitability for Arabic corpora. While the results show that most of the evaluated systems exhibited comparable usability scores, the scores for functionality and performance were substantially different with respect to support for the Arabic language and N-grams profile generation. The results of our evaluation will help potential users of the evaluated systems to choose the system that best meets their needs. More importantly, the results will help the developers of the evaluated systems to enhance their systems and developers of new corpora processing systems by providing them with a reference framework.
language resources and evaluation | 2013
Mohammad S. Khorsheed; Abdulmohsen Al-Thubaity
language resources and evaluation | 2015
Abdulmohsen Al-Thubaity