Network


Latest external collaboration on country level. Dive into details by clicking on the dots.

Hotspot


Dive into the research topics where Houda Benbrahim is active.

Publication


Featured researches published by Houda Benbrahim.


systems, man and cybernetics | 2012

An empirical study to address the problem of Unbalanced Data Sets in sentiment classification

Asmaa Mountassir; Houda Benbrahim; Ilham Berrada

With the emergence of Web 2.0, Sentiment Analysis is receiving more and more attention. Several interesting works were performed to address different issues in Sentiment Analysis. Nevertheless, the problem of Unbalanced Data Sets was not enough tackled within this research area. This paper presents the study we have carried out to address the problem of unbalanced data sets in supervised sentiment classification in a multi-lingual context. We propose three different methods to under-sample the majority class documents. These methods are Remove Similar, Remove Farthest and Remove by Clustering. Our goal is to compare the effectiveness of the proposed methods with the common random under-sampling. We also aim to evaluate the behavior of the classifiers toward different under-sampling rates. We use three different common classifiers, namely Naïve Bayes, Support Vector Machines and k-Nearest Neighbors. The experiments are carried out on two Arabic data sets and an English data set. We show that the four under-sampling methods are typically competitive. Naïve Bayes is shown as insensitive to unbalanced data sets. But Support Vector Machines seems to be highly sensitive to unbalanced data sets; k-Nearest Neighbors shows a slight sensitivity to imbalance in comparison with Support Vector Machines.


International Conference on Innovative Techniques and Applications of Artificial Intelligence | 2012

A cross-study of Sentiment Classification on Arabic corpora

Asmaa Mountassir; Houda Benbrahim; Ilham Berrada

Sentiment Analysis is a research area where the studies focus on processing and analyzing the opinions available on the web. Several interesting and advanced works were performed on English. In contrast, very few works were conducted on Arabic. This paper presents the study we have carried out to investigate supervised sentiment classification in an Arabic context. We use two Arabic Corpora which are different in many aspects. We use three common classifiers known by their effectiveness, namely Naive Bayes, Support Vector Machines and k-Nearest Neighbor. We investigate some settings to identify those that allow achieving the best results. These settings are about stemming type, term frequency thresholding, term weighting and n-gram words. We show that Naive Bayes and Support Vector Machines are competitively effective; however k- Nearest Neighbor’s effectiveness depends on the corpus. Through this study, we recommend to use light-stemming rather than stemming, to remove terms that occur once, to combine unigram and bigram words and to use presence-based weighting rather than frequency-based one. Our results show also that classification performance may be influenced by documents length, documents homogeneity and the nature of document authors. However, the size of data sets does not have an impact on classification results.


systems, man and cybernetics | 2004

An empirical study for hypertext categorization

Houda Benbrahim; Max Bramer

As the Web expands exponentially, the need to put some order to its content becomes apparent. Hypertext categorization, that is the automatic classification of web documents into predefined classes, came to elevate humans from that task. The extra information available in a hypertext document poses new challenges for automatic categorization. HTML tags and metadata provide rich information for hypertext categorization that is not available in traditional text classification. This paper looks at (i) what representation to use for documents and which extra information hidden in HTML pages to take into consideration to improve the classification task, and (ii) how to deal with the very high number of features of texts. A hypertext dataset and four well-known learning algorithms (Naive Bayes, K-nearest neighbor, support vector machines and C4.5) were used to exploit the enriched text representation along with feature reduction. The results showed that enhancing the basic text content with HTML page keywords, title and anchor links improved the accuracy of the classification algorithms.


International Conference on Innovative Techniques and Applications of Artificial Intelligence | 2005

Neighbourhood Exploitation in Hypertext Categorization

Houda Benbrahim; Max Bramer

The exponential growth of the web has led to the necessity to put some order to its content. The automatic classification of web documents into predefined classes, that is hypertext categorization, came to elevate humans from that task. The extra information available in a hypertext document poses new challenges for automatic categorization. HTML tags and linked neighbourhood all provide rich information for hypertext categorization that is not available in traditional text classification. This paper looks at (i) which extra information hidden in HTML tags and linked neighbourhood pages to take into consideration to improve the classification task, and (ii) how to deal with the high level of noise in linked pages. A hypertext dataset and four well-known learning algorithms (Naive Bayes, K-Nearest Neighbour, Support Vector Machine and C4.5) were used to exploit the enriched text representation. The results showed that the clever use of the information in linked neighbourhood and HTML tags improved the accuracy of the classification algorithms.


2012 Colloquium in Information Science and Technology | 2012

Some methods to address the problem of unbalanced sentiment classification in an arabic context

Asmaa Mountassir; Houda Benbrahim; Ilham Berrada

The rise of social media (such as online web forums and social networking sites) has attracted interests to mining and analyzing opinions available on the web. The online opinion has become the object of studies in many research areas; especially that called “Opinion Mining and Sentiment Analysis”. Several interesting and advanced works were performed on few languages (in particular English). However, there were very few studies on some languages such as Arabic. This paper presents the study we have carried out to address the problem of unbalanced data sets in supervised sentiment classification in an Arabic context. We propose three different methods to under-sample the majority class documents. Our goal is to compare the effectiveness of the proposed methods with the common random under-sampling. We also aim to evaluate the behavior of the classifier toward different under-sampling rates. We use two different common classifiers, namely Naïve Bayes and Support Vector Machines. The experiments are carried out on an Arabic data set that we have built from Aljazeeras web site and labeled manually. The results show that Naïve Bayes is sensitive to data set size, the more we reduce the data the more the results degrade. However, it is not sensitive to unbalanced data sets on the contrary of Support Vector Machines which is highly sensitive to unbalanced data sets. The results show also that we can rely on the proposed techniques and that they are typically competitive with random under-sampling.


Document numérique | 2013

Sentiment classification on arabic corpora. A preliminary cross-study

Asmaa Mountassir; Houda Benbrahim; Ilham Berrada

The rise of social media (such as online web forums and social networking sites) has attracted interests to mining and analyzing opinions available on the web. The online opinion has become the object of studies in many research areas; especially that called “Opinion Mining and Sentiment Analysis”. Several interesting and advanced works were performed on few languages (in particular English). However, there were very few studies on Morphologically Rich Languages such as Arabic. This paper presents the study we have carried out to investigate supervised sentiment classification in an Arabic context. We use two Arabic Corpora which are different in many aspects. We use three common classifiers known by their effectiveness, namely Naïve Bayes, Support Vector Machines and k-Nearest Neighbor. We investigate some settings to identify those that allow achieving the best results. These settings are about stemming type, term frequency thresholding, term weighting and ngram words. We show that Naïve Bayes and Support Vector Machines are competitively effective; however k-Nearest Neighbor’s effectiveness depends on the corpus. Through this study, we recommend to use light-stemming rather than stemming, to remove terms that occur once, to combine unigram and bigram words and to use presence-based weighting rather than frequency-based one. Our results show also that classification performance can be influenced by documents length, documents homogeneity and the nature of document authors. However, the size of data sets does not have an impact on classification results. RÉSUMÉ. Le développement des médias sociaux (tels que les forums web en ligne et les sites de réseaux sociaux) a provoqué l’intérêt de fouiller et d’analyser les opinions disponibles sur le web. Ainsi, l’opinion en ligne est devenue l’objet d’étude dans plusieurs domaines de recherche ; en l’occurrence le domaine dit « Opinion Mining and Sentiment Analysis ». Plusieurs travaux intéressants et avancés ont été menés dans peu de langues (notamment l’anglais). Les langues dites riches morphologiquement, comme l’arabe, ont développé très peu d’études. Le présent papier détaille l’étude que nous avons menée dans le but d’investiguer la classification supervisée de sentiment dans un contexte arabe. Nous avons utilisé deux corpus arabes différents à plusieurs niveaux. Nous avons utilisé trois classificateurs standard et connus pour leur efficacité, à savoir Naïve Bayes, Support Vector Machines et k-Nearest Neighbor. Nous investiguons un ensemble de settings pour identifier C et a rt ic le d es E di tio ns L av oi si er e st d is po ni bl e en a cc es li br e et g ra tu it su r dn .r ev ue so nl in e. co m 74 DN. Volume 16 – n° 1/2013 ceux permettant de donner les meilleurs résultats. Les settings ainsi étudiés concernent le type de racination, le seuillage de fréquence des termes, la pondération des termes et les ngrammes mots. Nous montrons que Naïve Bayes et Support Vector Machines sont efficaces et compétitifs. Néanmoins, la performance de k-Nearest Neighbor dépend du corpus. Nous recommandons, à travers cette étude, d’utiliser la pseudo-racination plutôt que la racination, de supprimer les termes apparaissant une seule fois, de combiner les unigrammes avec les bigrammes mots et d’utiliser une pondération à base de présence plutôt qu’une pondération à base de fréquence. Les résultats de notre étude montrent également que la performance de classification peut être influencée par la longueur et l’homogénéité des documents ainsi que par la nature des auteurs des documents. Par contre, la taille des corpus n’a pas d’impact sur les résultats de classification.


artificial intelligence applications and innovations | 2004

Impact on Performance of Hypertext Classification of Selective Rich HTML Capture

Houda Benbrahim; Max Bramer

Hypertext categorization is the automatic classification of web documents into predefined classes. It poses new challenges for automatic categorization because of the rich information in a hypertext document. Hyperlinks, HTML tags, and metadata all provide rich information for hypertext categorization that is not available in traditional text classification. This paper looks at (i) what representation to use for documents and which extra information hidden in HTML pages to take into consideration to improve the classification task, and (ii) how to deal with the very high number of features of texts. A hypertext dataset and three well-known learning algorithms (Naive Bayes, K-Nearest Neighbour and C4.5) were used to exploit the enriched text representation along with feature reduction. The results showed that enhancing the basic text content with HTML page keywords, title and anchor links improved the accuracy of the classification algorithms.


machine learning and data mining in pattern recognition | 2014

The Nearest Centroid Based on Vector Norms: A New Classification Algorithm for a New Document Representation Model

Asmaa Mountassir; Houda Benbrahim; Ilham Berrada

In this paper, we present a novel model that we propose for document representation. In contrast with the classical Vector Space Model which represents each document by a unique vector in the feature space, our model consists in representing each document by a vector in the space of training documents of each category. We develop, for this novel model, a discriminative classifier which is based on the norms of the generated vectors by our model. We call this algorithm the Nearest Cetroid based on Vector Norms. Our major goal, by the proposition of such new classification framework, is to overcome the problems related to huge dimensionality and vector sparsity which are commonly faced in Text Classification problems. We evaluate the performance of the proposed framework by comparing its effectiveness and efficiency with those of some standard classifiers when used with the classical document representation. The studied classifiers are Naive Bayes (NB), Support Vector Machines (SVM) and k-Nearest Neighbors (kNN). We conduct our experiments on multi-lingual balanced and unbalanced binary data sets. Our results show that our algorithm typically performs well since it is competitive with the classical methods and, at the same time, dramatically faster especially in comparison with NB and kNN. We also apply our model on the Reuters21578 corpus so as to evaluate its performance in a multi-class environment. We can say that the obtained result (85.4% in terms of micro-F1) is promising and that it can be improved in future works.


International Conference on Innovative Techniques and Applications of Artificial Intelligence | 2013

Inferring Context from Users’ Reviews for Context Aware Recommendation

Fatima Zahra Lahlou; Houda Benbrahim; Asmaa Mountassir; Ismail Kassou

Context Aware Recommendation Systems are Recommender Systems that provide recommendations based not only on users and items, but also on other information related to the context. A first challenge in building these systems is to obtain the contextual information. In this paper, we explore how accurate it is possible to infer contextual information from users’ reviews. For this purpose, we use Text Classification techniques and conduct several experiments to identify the appropriate Text Representation settings and classification algorithm to the context inference problem. We carry out our experiments on two datasets containing reviews related to hotels and cars, and aim to infer the contextual information ‘intent of purchase’ from these reviews. To infer context from reviews, we recommend removing terms that occur once in the data set, combining unigrams, bigrams and trigrams, adopting a TFIDF weighting schema and using the Multinomial algorithm rather Naive Bayes than Support Vector Machines.


international conference on artificial intelligence in theory and practice | 2008

A Fuzzy Semi-Supervised Support Vector Machines Approach to Hypertext Categorization

Houda Benbrahim; Max Bramer

Hypertext/text domains are characterized by several tens or hundreds of thousands of features. This represents a challenge for supervised learning algorithms which have to learn accurate classifiers using a small set of available training examples. In this paper, a fuzzy semi-supervised support vector machines (FSS-SVM) algorithm is proposed. It tries to overcome the need for a large labelled training set. For this, it uses both labelled and unlabelled data for training. It also modulates the effect of the unlabelled data in the learning process. Empirical evaluations with two real-world hypertext datasets showed that, by additionally using unlabelled data, FSS-SVM requires less labelled training data than its supervised version, support vector machines, to achieve the same level of classification performance. Also, the incorporated fuzzy membership values of the unlabelled training patterns in the learning process have positively influenced the classification performance in comparison with its crisp variant.

Collaboration


Dive into the Houda Benbrahim's collaboration.

Top Co-Authors

Avatar

Max Bramer

University of Portsmouth

View shared research outputs
Researchain Logo
Decentralizing Knowledge