Tijl De Bie | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Tijl De Bie is active.

Explore More

Publication

Featured researches published by Tijl De Bie.

Bioinformatics | 2004

A statistical framework for genomic data fusion

Gert R. G. Lanckriet; Tijl De Bie; Nello Cristianini; Michael I. Jordan; William Stafford Noble

MOTIVATION During the past decade, the new focus on genomics has highlighted a particular challenge: to integrate the different views of the genome that are provided by various types of experimental data. RESULTS This paper describes a computational framework for integrating and drawing inferences from a collection of genome-wide measurements. Each dataset is represented via a kernel function, which defines generalized similarity relationships between pairs of entities, such as genes or proteins. The kernel representation is both flexible and efficient, and can be applied to many different types of data. Furthermore, kernel functions derived from different types of data can be combined in a straightforward fashion. Recent advances in the theory of kernel methods have provided efficient algorithms to perform such combinations in a way that minimizes a statistical loss function. These methods exploit semidefinite programming techniques to reduce the problem of finding optimizing kernel combinations to a convex optimization problem. Computational experiments performed using yeast genome-wide datasets, including amino acid sequences, hydropathy profiles, gene expression data and known protein-protein interactions, demonstrate the utility of this approach. A statistical learning algorithm trained from all of these data to recognize particular classes of proteins--membrane proteins and ribosomal proteins--performs significantly better than the same algorithm trained on any single type of data. AVAILABILITY Supplementary data at http://noble.gs.washington.edu/proj/sdp-svm

Bioinformatics | 2006

CAFE: a computational tool for the study of gene family evolution

Tijl De Bie; Nello Cristianini; Jeffery P. Demuth; Matthew W. Hahn

SUMMARY We present CAFE (Computational Analysis of gene Family Evolution), a tool for the statistical analysis of the evolution of the size of gene families. It uses a stochastic birth and death process to model the evolution of gene family sizes over a phylogeny. For a specified phylogenetic tree, and given the gene family sizes in the extant species, CAFE can estimate the global birth and death rate of gene families, infer the most likely gene family size at all internal nodes, identify gene families that have accelerated rates of gain and loss (quantified by a p-value) and identify which branches cause the p-value to be small for significant families. AVAILABILITY Software is available from http://www.bio.indiana.edu/~hahnlab/Software.html

PLOS ONE | 2006

The Evolution of Mammalian Gene Families

Jeffery P. Demuth; Tijl De Bie; Jason E. Stajich; Nello Cristianini; Matthew W. Hahn

Gene families are groups of homologous genes that are likely to have highly similar functions. Differences in family size due to lineage-specific gene duplication and gene loss may provide clues to the evolutionary forces that have shaped mammalian genomes. Here we analyze the gene families contained within the whole genomes of human, chimpanzee, mouse, rat, and dog. In total we find that more than half of the 9,990 families present in the mammalian common ancestor have either expanded or contracted along at least one lineage. Additionally, we find that a large number of families are completely lost from one or more mammalian genomes, and a similar number of gene families have arisen subsequent to the mammalian common ancestor. Along the lineage leading to modern humans we infer the gain of 689 genes and the loss of 86 genes since the split from chimpanzees, including changes likely driven by adaptive natural selection. Our results imply that humans and chimpanzees differ by at least 6% (1,418 of 22,000 genes) in their complement of genes, which stands in stark contrast to the oft-cited 1.5% difference between orthologous nucleotide sequences. This genomic “revolving door” of gene gain and loss represents a large number of genetic differences separating humans from our closest relatives.

intelligent systems in molecular biology | 2007

Kernel-based data fusion for gene prioritization

Tijl De Bie; Léon-Charles Tranchevent; Liesbeth van Oeffelen; Yves Moreau

MOTIVATION Hunting disease genes is a problem of primary importance in biomedical research. Biologists usually approach this problem in two steps: first a set of candidate genes is identified using traditional positional cloning or high-throughput genomics techniques; second, these genes are further investigated and validated in the wet lab, one by one. To speed up discovery and limit the number of costly wet lab experiments, biologists must test the candidate genes starting with the most probable candidates. So far, biologists have relied on literature studies, extensive queries to multiple databases and hunches about expected properties of the disease gene to determine such an ordering. Recently, we have introduced the data mining tool ENDEAVOUR (Aerts et al., 2006), which performs this task automatically by relying on different genome-wide data sources, such as Gene Ontology, literature, microarray, sequence and more. RESULTS In this article, we present a novel kernel method that operates in the same setting: based on a number of different views on a set of training genes, a prioritization of test genes is obtained. We furthermore provide a thorough learning theoretical analysis of the methods guaranteed performance. Finally, we apply the method to the disease data sets on which ENDEAVOUR (Aerts et al., 2006) has been benchmarked, and report a considerable improvement in empirical performance. AVAILABILITY The MATLAB code used in the empirical results will be made publicly available.

european conference on machine learning | 2010

Flu detector: tracking epidemics on twitter

Vasileios Lampos; Tijl De Bie; Nello Cristianini

We present an automated tool with a web interface for tracking the prevalence of Influenza-like Illness (ILI) in several regions of the United Kingdom using the contents of Twitters microblogging service. Our data is comprised by a daily average of approximately 200,000 geolocated tweets collected by targeting 49 urban centres in the UK for a time period of 40 weeks. Official ILI rates from the Health Protection Agency (HPA) form our ground truth. Bolasso, the bootstrapped version of LASSO, is applied in order to extract a consistent set of features, which are then used for learning a regression model.

Handbook of geometric computing | 2005

Eigenproblems in Pattern Recognition

Tijl De Bie; Nello Cristianini; Roman Rosipal

The task of studying the properties of conﬂgurations of points embeddedin a metric space has long been a central task in pattern recognition, buthas acquired even greater importance after the recent introduction of kernel-based learning methods. These methods work by virtually embedding generaltypes of data in a vector space, and then analyzing the properties of theresulting data cloud. While a number of techniques for this task have beendeveloped in ﬂelds as diverse as multivariate statistics, neural networks, andsignal processing, many of them show an underlying unity. In this chapterwe describe a large class of pattern analysis methods based on the use ofgeneralized eigenproblems, which reduce to solving the equation Aw =

european conference on principles of data mining and knowledge discovery | 2007

MINI: Mining Informative Non-redundant Itemsets

Arianna Gallo; Tijl De Bie; Nello Cristianini

Frequent itemset mining assists the data mining practitioner in searching for strongly associated items (and transactions) in large transaction databases. Since the number of frequent itemsets is usually extremely large and unmanageable for a human user, recent works have sought to define condensed representations of them, e.g. closedor maximalfrequent itemsets. We argue that not only these methods often still fall short in sufficiently reducing of the output size, but they also output many redundant itemsets. In this paper we propose a philosophically new approach that resolves both these issues in a computationally tractable way. We present and empirically validate a statistically founded approach called MINI, to compress the set of frequent itemsets down to a list of informative and non-redundant itemsets.

knowledge discovery and data mining | 2011

An information theoretic framework for data mining

Tijl De Bie

We formalize the data mining process as a process of information exchange, defined by the following key components. The data miners state of mind is modeled as a probability distribution, called the background distribution, which represents the uncertainty and misconceptions the data miner has about the data. This model initially incorporates any prior (possibly incorrect) beliefs a data miner has about the data. During the data mining process, properties of the data (to which we refer as patterns) are revealed to the data miner, either in batch, one by one, or even interactively. This acquisition of information in the data mining process is formalized by updates to the background distribution to account for the presence of the found patterns. The proposed framework can be motivated using concepts from information theory and game theory. Understanding it from this perspective, it is easy to see how it can be extended to more sophisticated settings, e.g. where patterns are probabilistic functions of the data (thus allowing one to account for noise and errors in the data mining process, and allowing one to study data mining techniques based on subsampling the data). The framework then models the data mining process using concepts from information geometry, and I-projections in particular. The framework can be used to help in designing new data mining algorithms that maximize the efficiency of the information exchange from the algorithm to the data miner.

Genome Biology | 2009

DISTILLER: a data integration framework to reveal condition dependency of complex regulons in Escherichia coli

Karen Lemmens; Tijl De Bie; Thomas Dhollander; Sigrid De Keersmaecker; Inge Thijs; Geert Schoofs; Ami De Weerdt; Bart De Moor; Jos Vanderleyden; Julio Collado-Vides; Kristof Engelen; Kathleen Marchal

We present DISTILLER, a data integration framework for the inference of transcriptional module networks. Experimental validation of predicted targets for the well-studied fumarate nitrate reductase regulator showed the effectiveness of our approach in Escherichia coli. In addition, the condition dependency and modularity of the inferred transcriptional network was studied. Surprisingly, the level of regulatory complexity seemed lower than that which would be expected from RegulonDB, indicating that complex regulatory programs tend to decrease the degree of modularity.

international conference on pattern recognition | 2004

Learning from General Label Constraints

Tijl De Bie; Johan Suykens; Bart De Moor

Most machine learning algorithms are designed either for supervised or for unsupervised learning, notably classification and clustering. Practical problems in bioinformatics and in vision however show that this setting often is an oversimplification of reality. While label information is of course invaluable in most cases, it would be a huge waste to ignore the information on the cluster structure that is present in an (often much larger) unlabeled sample set. Several recent contributions deal with this topic: given partially labeled data, exploit all information available. In this paper, we present an elegant and efficient algorithm that allows to deal with very general types of label constraints in class learning problems. The approach is based on spectral clustering, and leads to an efficient algorithm based on the simple eigenvalue problem.

Explore More