Mohd Juzaiddin Ab Aziz

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Mohd Juzaiddin Ab Aziz is active.

Explore More

Publication

Featured researches published by Mohd Juzaiddin Ab Aziz.

rough sets and knowledge technology | 2010

Automatic part of speech tagging for Arabic: an experiment using Bigram hidden Markov model

Mohammed Albared; Nazlia Omar; Mohd Juzaiddin Ab Aziz; Mohd Zakree Ahmad Nazri

Part Of Speech (POS) tagging is the ability to computationally determine which POS of a word is activated by its use in a particular context. POS tagger is a useful preprocessing tool in many natural languages processing (NLP) applications such as information extraction and information retrieval. In this paper, we present the preliminary achievement of Bigram Hidden Markov Model (HMM) to tackle the POS tagging problem of Arabic language. In addition, we have used different smoothing algorithms with HMM model to overcome the data sparseness problem. The Viterbi algorithm is used to assign the most probable tag to each word in the text. Furthermore, several lexical models have been defined and implemented to handle unknown word POS guessing based on word substring i.e. prefix probability, suffix probability or the linear interpolation of both of them. The average overall accuracy for this tagger is 95.8.

Journal of Computer Science | 2013

Arabic person names recognition by using a rule based approach

Mohammed Aboaoga; Mohd Juzaiddin Ab Aziz

Name Entity Recognition is very important task in many natural language processing applications such as; Machine Translation, Question Answering, Information Extraction, Text Summarization, Semantic Applications and Word Sense Disambiguation. Rule-based approach is one of the techniques that are used for named entity recognition to identify the named entities such as a person names, location names and organization names. The recent rule-based methods have been applied to recognize the person names in political domain. They ignored the recognition of other named entity types such as locations and organizations. We have used the rule based approach for recognizing the named entity type (person names) for Arabic. We have developed four rules for identifying the person names depending on the position of name. We have used an in-house Arabic corpus collected from newspaper achieves. The evaluation method that compares the results of the system with the manually annotated text has been applied in order to compute precision, recall and f-measure. In the experiment of this study, the average f-measure for recognizing person names are (92.66, 92.04 and 90.43%) in sport, economic and politic domain respectively. The experimental results showed that our rule-based method achieved the highest f-measure values in sport domain comparing with political and economic domains.

Knowledge Based Systems | 2016

RFBoost: An improved multi-label boosting algorithm and its application to text categorisation

Bassam Al-Salemi; Shahrul Azman Mohd Noah; Mohd Juzaiddin Ab Aziz

Abstract The AdaBoost.MH boosting algorithm is considered to be one of the most accurate algorithms for multi-label classification. AdaBoost.MH works by iteratively building a committee of weak hypotheses of decision stumps. In each round of AdaBoost.MH learning, all features are examined, but only one feature is used to build a new weak hypothesis. This learning mechanism may entail a high degree of computational time complexity, particularly in the case of a large-scale dataset. This paper describes a way to manage the learning complexity and improve the classification performance of AdaBoost.MH. We propose an improved version of AdaBoost.MH, called RFBoost . The weak learning in RFBoost is based on filtering a small fixed number of ranked features in each boosting round rather than using all features, as AdaBoost.MH does. We propose two methods for ranking the features: One Boosting Round and Labeled Latent Dirichlet Allocation (LLDA), a supervised topic model based on Gibbs sampling. Additionally, we investigate the use of LLDA as a feature selection method for reducing the feature space based on the maximal conditional probabilities of words across labels. Our experimental results on eight well-known benchmarks for multi-label text categorisation show that RFBoost is significantly more efficient and effective than the baseline algorithms. Moreover, the LLDA-based feature ranking yields the best performance for RFBoost.

Journal of Information Science | 2015

LDA-AdaBoost.MH: Accelerated AdaBoost.MH based on latent Dirichlet allocation for text categorization

Bassam Al-Salemi; Mohd Juzaiddin Ab Aziz; Shahrul Azman Mohd Noah

AdaBoost.MH is a boosting algorithm that is considered to be one of the most accurate algorithms for multilabel classification. It works by iteratively building a committee of weak hypotheses of decision stumps. To build the weak hypotheses, in each iteration, AdaBoost.MH obtains the whole extracted features and examines them one by one to check their ability to characterize the appropriate category. Using Bag-Of-Words for text representation dramatically increases the computational time of AdaBoost.MH learning, especially for large-scale datasets. In this paper we demonstrate how to improve the efficiency and effectiveness of AdaBoost.MH using latent topics, rather than words. A well-known probabilistic topic modelling method, Latent Dirichlet Allocation, is used to estimate the latent topics in the corpus as features for AdaBoost.MH. To evaluate LDA-AdaBoost.MH, the following four datasets have been used: Reuters-21578-ModApte, WebKB, 20-Newsgroups and a collection of Arabic news. The experimental results confirmed that representing the texts as a small number of latent topics, rather than a large number of words, significantly decreased the computational time of AdaBoost.MH learning and improved its performance for text categorization.

asian conference on intelligent information and database systems | 2011

Developing a competitive HMM arabic POS tagger using small training corpora

Mohammed Albared; Nazlia Omar; Mohd Juzaiddin Ab Aziz

Part Of Speech (POS) tagging is the ability to computationally determine which POS of a word is activated by its use in a particular context. POS is one of the important processing steps for many natural language systems such as information extraction, question answering. This paper presents a study aiming to find out the appropriate strategy to develop a fast and accurate Arabic statistical POS tagger when only a limited amount of training material is available. This is an essential factor when dealing with languages like Arabic for which small annotated resources are scarce and not easily available. Different configurations of a HMM tagger are studied. Namely, bigram and trigram models are tested, as well as different smoothing techniques. In addition, new lexical model has been defined to handle unknown word POS guessing based on the linear interpolation of both word suffix probability and word prefix probability. Several experiments are carried out to determine the performance of the different configurations of HMM with two small training corpora. The first corpus includes about 29300 words from both Modern Standard Arabic and Classical Arabic. The second corpus is the Quranic Arabic Corpus which is consisting of 77,430 words of the Quranic Arabic.

international conference on asian language processing | 2010

Anaphora Resolution of Malay Text: Issues and Proposed Solution Model

Noorhuzaimi Karimah Mohd Noor; Shahrul Azman Mohd Noah; Mohd Juzaiddin Ab Aziz; Mohd Pouzi Hamzah

Anaphora resolution (AR) is a process to identify the appropriate antecedent with its anaphor which occur before the anaphor. AR able to improve most of the NLP applications such as question answering, short answer examination system and information extraction. Most of AR systems are deal with English language. Thus, in 1990’s the research on AR has been applied for other language, such as Arabic, Chinese, Hindi and Norwegian. There are however limited or no effort in dealing with Malay text. The AR systems for one language cannot be simply adapted to use in other languages. This is due to the fact that different languages have different set of rules relating to syntax and semantic to respective language. This paper proposed a model for resolving anaphora phenomena in Malay text. The model consists of three elements consisting of anaphora resolution process, syntactic knowledge process and semantic-world knowledge process. The elements are defined based on the observable fact occurring in Malay language.

Natural Language Engineering | 2017

Mapping Arabic WordNet synsets to Wikipedia articles using monolingual and bilingual features

Abdulgabbar Saif; Mohd Juzaiddin Ab Aziz; Nazlia Omar

The alignment of WordNet and Wikipedia has received wide attention from researchers of computational linguistics, who are building a new lexical knowledge source or enriching the semantic information of WordNet entities. The main challenge of this alignment is how to handle the synonymy and ambiguity issues in the contents of two units from different sources. Therefore, this paper introduces mapping method that links an Arabic WordNet synset to its corresponding article in Wikipedia. This method uses monolingual and bilingual features to overcome the lack of semantic information in Arabic WordNet. For evaluating this method, an Arabic mapping data set, which contains 1,291 synset–article pairs, is compiled. The experimental analysis shows that the proposed method achieves promising results and outperforms the state-of-the-art methods that depend only on monolingual features. The mapped method has also been used to increase the coverage of Arabic WordNet by inserting new synsets from Wikipedia.

Journal of Information Science | 2015

Boosting algorithms with topic modeling for multi-label text categorization

Bassam Al-Salemi; Mohd Juzaiddin Ab Aziz; Shahrul Azman Mohd Noah

Boosting algorithms have received significant attention over the past several years and are considered to be the state-of-the-art classifiers for multi-label classification tasks. The disadvantage of using boosting algorithms for text categorization (TC) is the vast number of features that are generated using the traditional Bag-of-Words (BOW) text representation, which dramatically increases the computational complexity. In this paper, an alternative text representation method using topic modeling for enhancing and accelerating multi-label boosting algorithms is concerned. An extensive empirical experimental comparison of eight multi-label boosting algorithms using topic-based and BOW representation methods was undertaken. For the evaluation, three well-known multi-label TC datasets were used. Furthermore, to justify boosting algorithms performance, three well-known instance-based multi-label algorithms were involved in the evaluation. For completely credible evaluations, all algorithms were evaluated using their native software tools, except for data formats and user settings. The experimental results demonstrated that the topic-based representation significantly accelerated all algorithms and slightly enhanced the classification performance, especially for near-balanced and balanced datasets. For the imbalanced dataset, BOW representation led to the best performance. The MP-Boost algorithm is the most efficient and effective algorithm for imbalanced datasets using BOW representation. For topic-based representation, AdaBoost.MH with meta base learners, Hamming Tree (AdaMH-Tree) and Product (AdaMH-Product) achieved the best performance; however, with respect to the computational time, these algorithms are the slowest overall. Moreover, the results indicated that topic-based representation is more significant for instance-based algorithms; nevertheless, boosting algorithms, such as MP-Boost, AdaMH-Tree and AdaMH-Product, notably exceed their performance.

Knowledge Based Systems | 2016

Reducing explicit semantic representation vectors using Latent Dirichlet Allocation

Abdulgabbar Saif; Mohd Juzaiddin Ab Aziz; Nazlia Omar

Explicit Semantic Analysis (ESA) is a knowledge-based method which builds the semantic representation of the words depending on the textual description of the concepts in the certain knowledge source. Due to its simplicity and success, ESA has received wide attention from researchers in the computational linguistics and information retrieval. However, the representation vectors formed by ESA method are generally very excessive, high dimensional, and may contain many redundant concepts. In this paper, we introduce a reduced semantic representation method that constructs the semantic interpretation of the words as the vectors over the latent topics from the original ESA representation vectors. For modeling the latent topics, the Latent Dirichlet Allocation (LDA) is adapted to the ESA vectors for extracting the topics as the probability distributions over the concepts rather than the words in the traditional model. The proposed method is applied to the wide knowledge sources used in the computational semantic analysis: WordNet and Wikipedia. For evaluation, we use the proposed method in two natural language processing tasks: measuring the semantic relatedness between words/texts and text clustering. The experimental results indicate that the proposed method overcomes the limitations of the representation of the ESA method.

asian conference on intelligent information and database systems | 2012

Malay anaphor and antecedent candidate identification: a proposed solution

Noorhuzaimi Karimah Mohd Noor; Shahrul Azman Mohd Noah; Mohd Juzaiddin Ab Aziz; Mohd Pouzi Hamzah

This paper discusses on Malay language anaphor and antecedent candidate determination using the knowledge-poor techniques. The process to determine the candidate for anaphor and antecedent is important because the usage of pronouns in a text is not always considered as an anaphor. Sometimes pronoun referred to something outside the context or does not refer to any situation in the text. Such a situation is also exhibited in the use of pronouns in Malay language. Therefore, certain rules must be issued to identify the antecedent and anaphor candidate. Pronoun usage in Malay language does indicate the gender of the person, but to distinguish the status of the person such as imperial family, honorable people and common people. Thus, generic rules that have been used by other languages cannot simply be adapted for Malay language. The proposed solution concerns with the distance of each candidate and location of the Subject-Verb-Object (SVO) used to determine the anaphor candidate. As such, syntactic information, semantic information and distance of anaphor-antecedent are seen important to determine the antecedent candidate.

Explore More