Is this you? Create Your Porfile

Marcel Brun

Translational Genomics Research Institute

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Marcel Brun is active.

Explore More

Publication

Featured researches published by Marcel Brun.

Journal of Computational Biology | 2002

Inference from Clustering with Application to Gene-Expression Microarrays

Edward R. Dougherty; Junior Barrera; Marcel Brun; Seungchan Kim; Roberto M. Cesar; Yidong Chen; Michael L. Bittner; Jeffrey M. Trent

There are many algorithms to cluster sample data points based on nearness or a similarity measure. Often the implication is that points in different clusters come from different underlying classes, whereas those in the same cluster come from the same class. Stochastically, the underlying classes represent different random processes. The inference is that clusters represent a partition of the sample points according to which process they belong. This paper discusses a model-based clustering toolbox that evaluates cluster accuracy. Each random process is modeled as its mean plus independent noise, sample points are generated, the points are clustered, and the clustering error is the number of points clustered incorrectly according to the generating random processes. Various clustering algorithms are evaluated based on process variance and the key issue of the rate at which algorithmic performance improves with increasing numbers of experimental replications. The model means can be selected by hand to test the separability of expected types of biological expression patterns. Alternatively, the model can be seeded by real data to test the expected precision of that output or the extent of improvement in precision that replication could provide. In the latter case, a clustering algorithm is used to form clusters, and the model is seeded with the means and variances of these clusters. Other algorithms are then tested relative to the seeding algorithm. Results are averaged over various seeds. Output includes error tables and graphs, confusion matrices, principal-component plots, and validation measures. Five algorithms are studied in detail: K-means, fuzzy C-means, self-organizing maps, hierarchical Euclidean-distance-based and correlation-based clustering. The toolbox is applied to gene-expression clustering based on cDNA microarrays using real data. Expression profile graphics are generated and error analysis is displayed within the context of these profile graphics. A large amount of generated output is available over the web.

Pattern Recognition | 2007

Model-based evaluation of clustering validation measures

Marcel Brun; Chao Sima; Jianping Hua; James Lowey; Brent Carroll; Edward Suh; Edward R. Dougherty

A cluster operator takes a set of data points and partitions the points into clusters (subsets). As with any scientific model, the scientific content of a cluster operator lies in its ability to predict results. This ability is measured by its error rate relative to cluster formation. To estimate the error of a cluster operator, a sample of point sets is generated, the algorithm is applied to each point set and the clusters evaluated relative to the known partition according to the distributions, and then the errors are averaged over the point sets composing the sample. Many validity measures have been proposed for evaluating clustering results based on a single realization of the random-point-set process. In this paper we consider a number of proposed validity measures and we examine how well they correlate with error rates across a number of clustering algorithms and random-point-set models. Validity measures fall broadly into three classes: internal validation is based on calculating properties of the resulting clusters; relative validation is based on comparisons of partitions generated by the same algorithm with different parameters or different subsets of the data; and external validation compares the partition generated by the clustering algorithm and a given partition of the data. To quantify the degree of similarity between the validation indices and the clustering errors, we use Kendalls rank correlation between their values. Our results indicate that, overall, the performance of validity indices is highly variable. For complex models or when a clustering algorithm yields complex clusters, both the internal and relative indices fail to predict the error of the algorithm. Some external indices appear to perform well, whereas others do not. We conclude that one should not put much faith in a validity score unless there is evidence, either in terms of sufficient data for model estimation or prior model knowledge, that a validity measure is well-correlated to the error rate of the clustering algorithm.

Signal Processing | 2005

Steady-state probabilities for attractors in probabilistic boolean networks

Marcel Brun; Edward R. Dougherty; Ilya Shmulevich

Boolean networks form a class of disordered dynamical systems that have been studied in physics owing to their relationships with disordered systems in statistical mechanics and in biology as models of genetic regulatory networks. Recently they have been generalized to probabilistic Boolean networks (PBNs) to facilitate the incorporation of uncertainty in the model and to represent cellular context changes in biological modeling. In essence, a PBN is composed of a family of Boolean networks between which the PBN switches in a stochastic fashion. In whatever framework Boolean networks are studied, their most important attribute is their attractors. Left to run, a Boolean network will settle into one of a collection of state cycles called attractors. The set of states from which the network will transition into a specific attractor forms the basin of the attractor. The attractors represent the essential long-run behavior of the network. In a classical Boolean network, the network remains in an attractor once there; in a Boolean network with perturbation, the states form an ergodic Markov chain and the network can escape an attractor, but it will return to it or a different attractor unless interrupted by another perturbation; in a probabilistic Boolean network, so long as the PBN remains in one of its constituent Boolean networks it will behave as a Boolean network with perturbation, but upon a switch it will move to an attractor of the new constituent Boolean network. Given the ergodic nature of the model, the steady-state probabilities of the attractors are critical to network understanding. Heretofore they have been found by simulation; in this paper we derive analytic expressions for these probabilities, first for Boolean networks with perturbation and then for PBNs.

Nucleic Acids Research | 2007

Three methods for optimization of cross-laboratory and cross-platform microarray expression data

Phillip Stafford; Marcel Brun

Microarray gene expression data becomes more valuable as our confidence in the results grows. Guaranteeing data quality becomes increasingly important as microarrays are being used to diagnose and treat patients (1–4). The MAQC Quality Control Consortium, the FDAs Critical Path Initiative, NCIs caBIG and others are implementing procedures that will broadly enhance data quality. As GEO continues to grow, its usefulness is constrained by the level of correlation across experiments and general applicability. Although RNA preparation and array platform play important roles in data accuracy, pre-processing is a user-selected factor that has an enormous effect. Normalization of expression data is necessary, but the methods have specific and pronounced effects on precision, accuracy and historical correlation. As a case study, we present a microarray calibration process using normalization as the adjustable parameter. We examine the impact of eight normalizations across both Agilent and Affymetrix expression platforms on three expression readouts: (1) sensitivity and power, (2) functional/biological interpretation and (3) feature selection and classification error. The reader is encouraged to measure their own discordant data, whether cross-laboratory, cross-platform or across any other variance source, and to use their results to tune the adjustable parameters of their laboratory to ensure increased correlation.

Pattern Recognition | 2004

A probabilistic theory of clustering

Edward R. Dougherty; Marcel Brun

Abstract Data clustering is typically considered a subjective process, which makes it problematic. For instance, how does one make statistical inferences based on clustering? The matter is different with pattern classification, for which two fundamental characteristics can be stated: (1) the error of a classifier can be estimated using “test data,” and (2) a classifier can be learned using “training data.” This paper presents a probabilistic theory of clustering, including both learning (training) and error estimation (testing). The theory is based on operators on random labeled point processes. It includes an error criterion in the context of random point sets and representation of the Bayes (optimal) cluster operator for a given random labeled point process. Training is illustrated using a nearest-neighbor approach, and trained cluster operators are compared to several classical clustering algorithms.

Current Genomics | 2009

Clustering Algorithms: On Learning, Validation, Performance, and Applications to Genomics

Lori A. Dalton; Virginia L. Ballarin; Marcel Brun

The development of microarray technology has enabled scientists to measure the expression of thousands of genes simultaneously, resulting in a surge of interest in several disciplines throughout biology and medicine. While data clustering has been used for decades in image processing and pattern recognition, in recent years it has joined this wave of activity as a popular technique to analyze microarrays. To illustrate its application to genomics, clustering applied to genes from a set of microarray data groups together those genes whose expression levels exhibit similar behavior throughout the samples, and when applied to samples it offers the potential to discriminate pathologies based on their differential patterns of gene expression. Although clustering has now been used for many years in the context of gene expression microarrays, it has remained highly problematic. The choice of a clustering algorithm and validation index is not a trivial one, more so when applying them to high throughput biological or medical data. Factors to consider when choosing an algorithm include the nature of the application, the characteristics of the objects to be analyzed, the expected number and shape of the clusters, and the complexity of the problem versus computational power available. In some cases a very simple algorithm may be appropriate to tackle a problem, but many situations may require a more complex and powerful algorithm better suited for the job at hand. In this paper, we will cover the theoretical aspects of clustering, including error and learning, followed by an overview of popular clustering algorithms and classical validation indices. We also discuss the relative performance of these algorithms and indices and conclude with examples of the application of clustering to computational biology.

Cancer Letters | 2003

Differential expression of IGFBP-5 and two human ESTs in thyroid glands with goiter, adenoma and papillary or follicular carcinomas.

Beatriz S. Stolf; Alex F. Carvalho; Waleska K. Martins; Franco B. Runza; Marcel Brun; Roberto Hirata; Eduardo Jordão Neves; Fernando Augusto Soares; Juan Postigo-Dias; L.P. Kowalski; Luiz F. L. Reis

Here, we describe the identification of three human genes with altered expression in thyroid diseases. One of them corresponds to insulin-like growth factor binding protein 5 (IGFBP5), which has already been described as over expressed in other cancers and, for the first time, is identified as overexpressed in thyroid tumors. The other genes, named 44 and 199, are ESTs with yet unknown function and were mapped on human chromosomes seven and four, respectively. We determined by RT-PCR the expression level of these genes in ten samples of disease-free thyroid, ten of goiter, nine of papillary carcinoma, ten of adenoma and seven of follicular carcinoma and the significance of observed differences was statistically determined. IGFBP-5 and gene 44 were significantly overexpressed in papillary carcinoma when compared to normal and goiter. Genes 44 and 199 were differentially expressed in follicular carcinoma and adenoma when compared to normal thyroid tissue.

Journal of Mathematical Imaging and Vision | 2001

Multiresolution Analysis for Optimal Binary Filters

Edward R. Dougherty; Junior Barrera; Gerard Mozelle; Seungchan Kim; Marcel Brun

The performance of a designed digital filter is measured by the sum of the errors of the optimal filter and the estimation error. Viewing an image at a high resolution results in optimal filters having smaller errors than at lower resolutions; however, higher resolutions bring increased estimation error. Hence, choosing an appropriate resolution for filter design is important. The present paper provides expressions for both the error of the optimal filter and the design error for estimating optimal filters in a pyramidal multiresolution framework. The analysis is facilitated by a general characterization of suitable sequences of resolution-constraint mappings. The error expressions are generated from resolution to resolution in a telescoping manner. To take advantage of data at all resolutions, one can use a hybrid multiresolution design to arrive at a multiresolution filter. A sequence of filters is designed using data at increasing resolutions, each filter serves as a prior filter for the next, and the last filter is taken as the designed filter. The value of the multiresolution filter at a given observation is based on the highest resolution at which conditioning by the observation is considered significant.

International Journal of Computational Intelligence and Applications | 2011

ARITHMETIC MEAN BASED COMPENSATORY FUZZY LOGIC

Agustina Bouchet; Juan Ignacio Pastore; Rafael Espin Andrade; Marcel Brun; Virginia L. Ballarin

Fuzzy Logic is a multi-valued logic model based on fuzzy set theory, which may be considered as an extension of Boolean Logic. One of the fields of this theory is the Compensatory Fuzzy Logic, based on the removal of some axioms in order to achieve a sensitive and idempotent multi-valued system. This system is based on a quadruple of continuous operators: conjunction, disjunction, order and negation. In this work we present a new model of Compensatory Fuzzy Logic based on a different set of operators, conjunction and disjunction, than the ones used in the original definition, and then prove that this new model satisfies the required axioms. As an example, we present an application to decision-making, comparing the results against the ones based on the original model.

Pattern Recognition | 2005

The coefficient of intrinsic dependence (feature selection using el CID)

Tailen Hsing; Li Yu Liu; Marcel Brun; Edward R. Dougherty

Measuring the strength of dependence between two sets of random variables lies at the heart of many statistical problems, in particular, feature selection for pattern recognition. We believe that there are some basic desirable criteria for a measure of dependence not satisfied by many commonly employed measures, such as the correlation coefficient, Briefly stated, a measure of dependence should: (1) be model-free and invariant under monotone transformations of the marginals; (2) fully differentiate different levels of dependence; (3) be applicable to both continuous and categorical distributions; (4) should not have the dependence of X on Y be necessarily the same as the dependence of Y on X; (5) be readily estimated from data; and (6) be straightforwardly extended to multivariate distributions. The new measure of dependence introduced in this paper, called the coefficient of intrinsic dependence(CID), satisfies these criteria. The main motivating idea is that Y is strongly (weakly, resp.) dependent on X if and only if the conditional distribution of Y given X is significantly (mildly, resp.) different from the marginal distribution of Y. We measure the difference by the normalized integrated square difference distance so that the full range of dependence can be adequately reflected in the interval [0, 1]. The paper treats estimation of the CID, provides simulations and comparisons, and applies the CID to gene prediction and cancer classification based on gene-expression measurements from microarrays.

Explore More