Halima Bensmail
Qatar Computing Research Institute
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Halima Bensmail.
Statistics and Computing | 1997
Halima Bensmail; Gilles Celeux; Adrian E. Raftery; Christian P. Robert
A new approach to cluster analysis has been introduced based on parsimonious geometric modelling of the within-group covariance matrices in a mixture of multivariate normal distributions, using hierarchical agglomeration and iterative relocation. It works well and is widely used via the MCLUST software available in S-PLUS and StatLib. However, it has several limitations: there is no assessment of the uncertainty about the classification, the partition can be suboptimal, parameter estimates are biased, the shape matrix has to be specified by the user, prior group probabilities are assumed to be equal, the method for choosing the number of groups is based on a crude approximation, and no formal way of choosing between the various possible models is included. Here, we propose a new approach which overcomes all these difficulties. It consists of exact Bayesian inference via Gibbs sampling, and the calculation of Bayes factors (for choosing the model and the number of groups) from the output using the Laplace–Metropolis estimator. It works well in several real and simulated examples.
Pattern Recognition | 2013
Jim Jing-Yan Wang; Halima Bensmail; Xin Gao
Non-negative matrix factorization (NMF) has been widely used as a data representation method based on components. To overcome the disadvantage of NMF in failing to consider the manifold structure of a data set, graph regularized NMF (GrNMF) has been proposed by Cai et al. by constructing an affinity graph and searching for a matrix factorization that respects graph structure. Selecting a graph model and its corresponding parameters is critical for this strategy. This process is usually carried out by cross-validation or discrete grid search, which are time consuming and prone to overfitting. In this paper, we propose a GrNMF, called MultiGrNMF, in which the intrinsic manifold is approximated by a linear combination of several graphs with different models and parameters inspired by ensemble manifold regularization. Factorization metrics and linear combination coefficients of graphs are determined simultaneously within a unified object function. They are alternately optimized in an iterative algorithm, thus resulting in a novel data representation algorithm. Extensive experiments on a protein subcellular localization task and an Alzheimers disease diagnosis task demonstrate the effectiveness of the proposed algorithm.
BioMed Research International | 2005
Zhenqiu Liu; Dechang Chen; Halima Bensmail
One important feature of the gene expression data is that the number of genes M far exceeds the number of samples N. Standard statistical methods do not work well when N < M. Development of new methodologies or modification of existing methodologies is needed for the analysis of the microarray data. In this paper, we propose a novel analysis procedure for classifying the gene expression data. This procedure involves dimension reduction using kernel principal component analysis (KPCA) and classification with logistic regression (discrimination). KPCA is a generalization and nonlinear version of principal component analysis. The proposed algorithm was applied to five different gene expression datasets involving human tumor samples. Comparison with other popular classification methods such as support vector machines and neural networks shows that our algorithm is very promising in classifying gene expression data.
BMC Bioinformatics | 2012
Jim Jing-Yan Wang; Halima Bensmail; Xin Gao
BackgroundProtein domain ranking is a fundamental task in structural biology. Most protein domain ranking methods rely on the pairwise comparison of protein domains while neglecting the global manifold structure of the protein domain database. Recently, graph regularized ranking that exploits the global structure of the graph defined by the pairwise similarities has been proposed. However, the existing graph regularized ranking methods are very sensitive to the choice of the graph model and parameters, and this remains a difficult problem for most of the protein domain ranking methods.ResultsTo tackle this problem, we have developed the Multiple Graph regularized Ranking algorithm, MultiG-Rank. Instead of using a single graph to regularize the ranking scores, MultiG-Rank approximates the intrinsic manifold of protein domain distribution by combining multiple initial graphs for the regularization. Graph weights are learned with ranking scores jointly and automatically, by alternately minimizing an objective function in an iterative algorithm. Experimental results on a subset of the ASTRAL SCOP protein domain database demonstrate that MultiG-Rank achieves a better ranking performance than single graph regularized ranking methods and pairwise similarity based ranking methods.ConclusionThe problem of graph model and parameter selection in graph regularized protein domain ranking can be solved effectively by combining multiple graphs. This aspect of generalization introduces a new frontier in applying multiple graphs to solving protein domain ranking applications.
Iie Transactions | 2006
Chih-Hsuan Wang; Way Kuo; Halima Bensmail
The detection of process problems and parameter drift at an early stage is crucial to successful semiconductor manufacture. The defect patterns on the wafer can act as an important source of information for quality engineers allowing them to isolate production problems. Traditionally, defect recognition is performed by quality engineers using a scanning electron microscope. This manual approach is not only expensive and time consuming but also it leads to high misidentification levels. In this paper, an automatic approach consisting of a spatial filter, a classification module and an estimation module is proposed to validate both real and simulated data. Experimental results show that three types of typical defect patterns: (i) a linear scratch; (ii) a circular ring; and (iii) an elliptical zone can be successfully extracted and classified. A Gaussian EM algorithm is used to estimate the elliptic and linear patterns, and a spherical-shell algorithm is used to estimate ring patterns. Furthermore, both convex and nonconvex defect patterns can be simultaneously recognized via a hybrid clustering method. The proposed method has the potential to be applied to other industries.
Pattern Recognition | 2013
Jim Jing-Yan Wang; Halima Bensmail; Xin Gao
Automated classification of tissue types of Region of Interest (ROI) in medical images has been an important application in Computer-Aided Diagnosis (CAD). Recently, bag-of-feature methods which treat each ROI as a set of local features have shown their power in this field. Two important issues of bag-of-feature strategy for tissue classification are investigated in this paper: the visual vocabulary learning and weighting, which are always considered independently in traditional methods by neglecting the inner relationship between the visual words and their weights. To overcome this problem, we develop a novel algorithm, Joint-ViVo, which learns the vocabulary and visual word weights jointly. A unified objective function based on large margin is defined for learning of both visual vocabulary and visual word weights, and optimized alternately in the iterative algorithm. We test our algorithm on three tissue classification tasks: classifying breast tissue density in mammograms, classifying lung tissue in High-Resolution Computed Tomography (HRCT) images, and identifying brain tissue type in Magnetic Resonance Imaging (MRI). The results show that Joint-ViVo outperforms the state-of-art methods on tissue classification problems.
Expert Review of Proteomics | 2006
Abdelali Haoudi; Halima Bensmail
Proteomic studies involve the identification as well as qualitative and quantitative comparison of proteins expressed under different conditions, and elucidation of their properties and functions, usually in a large-scale, high-throughput format. The high dimensionality of data generated from these studies will require the development of improved bioinformatics tools and data-mining approaches for efficient and accurate data analysis of biological specimens from healthy and diseased individuals. Mining large proteomics data sets provides a better understanding of the complexities between the normal and abnormal cell proteome of various biological systems, including environmental hazards, infectious agents (bioterrorism) and cancers. This review will shed light on recent developments in bioinformatics and data-mining approaches, and their limitations when applied to proteomics data sets, in order to strengthen the interdependence between proteomic technologies and bioinformatics tools.
Bioinformatics | 2005
Halima Bensmail; Jennifer Lynn Golek; Michelle M. Moody; John O. Semmes; Abdelali Haoudi
MOTIVATION Bioinformatics clustering tools are useful at all levels of proteomic data analysis. Proteomics studies can provide a wealth of information and rapidly generate large quantities of data from the analysis of biological specimens. The high dimensionality of data generated from these studies requires the development of improved bioinformatics tools for efficient and accurate data analyses. For proteome profiling of a particular system or organism, a number of specialized software tools are needed. Indeed, significant advances in the informatics and software tools necessary to support the analysis and management of these massive amounts of data are needed. Clustering algorithms based on probabilistic and Bayesian models provide an alternative to heuristic algorithms. The number of clusters (diseased and non-diseased groups) is reduced to the choice of the number of components of a mixture of underlying probability. The Bayesian approach is a tool for including information from the data to the analysis. It offers an estimation of the uncertainties of the data and the parameters involved. RESULTS We present novel algorithms that can organize, cluster and derive meaningful patterns of expression from large-scaled proteomics experiments. We processed raw data using a graphical-based algorithm by transforming it from a real space data-expression to a complex space data-expression using discrete Fourier transformation; then we used a thresholding approach to denoise and reduce the length of each spectrum. Bayesian clustering was applied to the reconstructed data. In comparison with several other algorithms used in this study including K-means, (Kohonen self-organizing map (SOM), and linear discriminant analysis, the Bayesian-Fourier model-based approach displayed superior performances consistently, in selecting the correct model and the number of clusters, thus providing a novel approach for accurate diagnosis of the disease. Using this approach, we were able to successfully denoise proteomic spectra and reach up to a 99% total reduction of the number of peaks compared to the original data. In addition, the Bayesian-based approach generated a better classification rate in comparison with other classification algorithms. This new finding will allow us to apply the Fourier transformation for the selection of the protein profile for each sample, and to develop a novel bioinformatic strategy based on Bayesian clustering for biomarker discovery and optimal diagnosis.
Knowledge Based Systems | 2013
Jim Jing-Yan Wang; Halima Bensmail; Nan Yao; Xin Gao
Sparse coding has been popularly used as an effective data representation method in various applications, such as computer vision, medical imaging and bioinformatics. However, the conventional sparse coding algorithms and their manifold-regularized variants (graph sparse coding and Laplacian sparse coding), learn codebooks and codes in an unsupervised manner and neglect class information that is available in the training set. To address this problem, we propose a novel discriminative sparse coding method based on multi-manifolds, that learns discriminative class-conditioned codebooks and sparse codes from both data feature spaces and class labels. First, the entire training set is partitioned into multiple manifolds according to the class labels. Then, we formulate the sparse coding as a manifold-manifold matching problem and learn class-conditioned codebooks and codes to maximize the manifold margins of different classes. Lastly, we present a data sample-manifold matching-based strategy to classify the unlabeled data samples. Experimental results on somatic mutations identification and breast tumor classification based on ultrasonic images demonstrate the efficacy of the proposed data representation and classification approach.
BioMed Research International | 2003
Halima Bensmail; Abdelali Haoudi
Now that the human genome is completed, the characterization of the proteins encoded by the sequence remains a challenging task. The study of the complete protein complement of the genome, the “proteome,” referred to as proteomics, will be essential if new therapeutic drugs and new disease biomarkers for early diagnosis are to be developed. Research efforts are already underway to develop the technology necessary to compare the specific protein profiles of diseased versus nondiseased states. These technologies provide a wealth of information and rapidly generate large quantities of data. Processing the large amounts of data will lead to useful predictive mathematical descriptions of biological systems which will permit rapid identification of novel therapeutic targets and identification of metabolic disorders. Here, we present an overview of the current status and future research approaches in defining the cancer cells proteome in combination with different bioinformatics and computational biology tools toward a better understanding of health and disease.