Lauri Eronen
University of Helsinki
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Lauri Eronen.
data integration in the life sciences | 2006
Petteri Sevon; Lauri Eronen; Petteri Hintsanen; Kimmo Kulovesi; Hannu Toivonen
Public biological databases contain vast amounts of rich data that can also be used to create and evaluate new biological hypothesis. We propose a method for link discovery in biological databases, i.e., for prediction and evaluation of implicit or previously unknown connections between biological entities and concepts. In our framework, information extracted from available databases is represented as a graph, where vertices correspond to entities and concepts, and edges represent known, annotated relationships between vertices. A link, an (implicit and possibly unknown) relation between two entities is manifested as a path or a subgraph connecting the corresponding vertices. We propose measures for link goodness that are based on three factors: edge reliability, relevance, and rarity. We handle these factors with a proper probabilistic interpretation. We give practical methods for finding and evaluating links in large graphs and report experimental results with Alzheimer genes and protein interactions.
BMC Bioinformatics | 2006
Lauri Eronen; Floris Geerts; Hannu Toivonen
BackgroundHaplotypes extracted from human DNA can be used for gene mapping and other analysis of genetic patterns within and across populations. A fundamental problem is, however, that current practical laboratory methods do not give haplotype information. Estimation of phased haplotypes of unrelated individuals given their unphased genotypes is known as the haplotype reconstruction or phasing problem.ResultsWe define three novel statistical models and give an efficient algorithm for haplotype reconstruction, jointly called HaploRec. HaploRec is based on exploiting local regularities conserved in haplotypes: it reconstructs haplotypes so that they have maximal local coherence. This approach – not assuming statistical dependence for remotely located markers – has two useful properties: it is well-suited for sparse marker maps, such as those used in gene mapping, and it can actually take advantage of long maps.ConclusionOur experimental results with simulated and real data show that HaploRec is a powerful method for the large scale haplotyping needed in association studies. With sample sizes large enough for gene mapping it appeared to be the best compared to all other tested methods (Phase, fastPhase, PL-EM, Snphap, Gerbil; simulated data), with small samples it was competitive with the best available methods (real data). HaploRec is several orders of magnitude faster than Phase and comparable to the other methods; the running times are roughly linear in the number of subjects and the number of markers. HaploRec is publicly available at http://www.cs.helsinki.fi/group/genetics/haplotyping.html.
BMC Bioinformatics | 2012
Lauri Eronen; Hannu Toivonen
BackgroundBiological databases contain large amounts of data concerning the functions and associations of genes and proteins. Integration of data from several such databases into a single repository can aid the discovery of previously unknown connections spanning multiple types of relationships and databases.ResultsBiomine is a system that integrates cross-references from several biological databases into a graph model with multiple types of edges, such as protein interactions, gene-disease associations and gene ontology annotations. Edges are weighted based on their type, reliability, and informativeness. We present Biomine and evaluate its performance in link prediction, where the goal is to predict pairs of nodes that will be connected in the future, based on current data. In particular, we formulate protein interaction prediction and disease gene prioritization tasks as instances of link prediction. The predictions are based on a proximity measure computed on the integrated graph. We consider and experiment with several such measures, and perform a parameter optimization procedure where different edge types are weighted to optimize link prediction accuracy. We also propose a novel method for disease-gene prioritization, defined as finding a subset of candidate genes that cluster together in the graph. We experimentally evaluate Biomine by predicting future annotations in the source databases and prioritizing lists of putative disease genes.ConclusionsThe experimental results show that Biomine has strong potential for predicting links when a set of selected candidate links is available. The predictions obtained using the entire Biomine dataset are shown to clearly outperform ones obtained using any single source of data alone, when different types of links are suitably weighted. In the gene prioritization task, an established reference set of disease-associated genes is useful, but the results show that under favorable conditions, Biomine can also perform well when no such information is available.The Biomine system is a proof of concept. Its current version contains 1.1 million entities and 8.1 million relations between them, with focus on human genetics. Some of its functionalities are available in a public query interface at http://biomine.cs.helsinki.fi, allowing searching for and visualizing connections between given biological entities.
pacific symposium on biocomputing | 2003
Lauri Eronen; Floris Geerts; Hannu Toivonen
Haplotypes are important for association based gene mapping, but there are no practical laboratory methods for obtaining them directly from DNA samples. We propose simple Markov models for reconstruction of haplotypes for a given sample of multilocus genotypes. The models are aimed specifically for long marker maps, where linkage disequilibrium between markers may vary and be relatively weak. Such maps are ultimately used in chromosome or genome-wide association studies. Haplotype reconstruction with standard Markov chains is based on linkage disequilibrium (LD) between neighboring markers. Markov chains of higher order can capture LD in a neighborhood of a given size. We introduce a more flexible and robust model, MC-VL, which is based on a Markov chain of variable order. Experimental validation of the Markov chain methods on both a wide range of simulated data and real data shows that they clearly out perform previous methods on genetically long marker maps and are highly competitive with short maps, too. MC-VL performs well across different data sets and settings while avoiding the problem of manually choosing an appropriate order for the Markov chain, and it has low computational complexity.
Journal of Medical Genetics | 2005
Petteri Hintsanen; Petteri Sevon; Päivi Onkamo; Lauri Eronen; Hannu Toivonen
Motivated by high throughput genotyping technology, our aim in this study was to experimentally compare the power and accuracy of case-control and family trio based approaches for haplotype based, large scale, association gene mapping. We compared trio based and case-control study designs in different disease models, and partitioned the performance differences into separate components: those from the sample ascertainment, the effective sample size, and the haplotyping approaches. For systematic and controlled tests, we simulated a rapidly expanding and relatively young isolated population. The experiments were also replicated with real asthma data. We used computationally efficient methods that scale up to large amounts of both markers and individuals. Mapping is based on a haplotype association test for haplotypes of 1–10 markers. For population based haplotype reconstruction, we use HaploRec, and compare it to both a simple trio based inference and true haplotypes. Firstly and surprisingly, statistically inferred population based haplotypes can be equally powerful as true haplotypes. Secondly, as expected, the effective sample size has a clear effect on both gene detection power and mapping accuracy. Thirdly, the sample ascertainment method does not have much effect on mapping accuracy. Finally, an interesting side result is that the simple haplotype association test clearly outperformed exhaustive allelic transmission disequilibrium tests. The results suggest that the case-control design is a powerful alternative to the more laborious family based ascertainment approach, especially for large datasets, and wherever population stratification can be controlled.
BMC Bioinformatics | 2007
Niels Landwehr; Taneli Mielikäinen; Lauri Eronen; Hannu Toivonen; Heikki Mannila
BackgroundHaplotype Reconstruction is the problem of resolving the hidden phase information in genotype data obtained from laboratory measurements. Solving this problem is an important intermediate step in gene association studies, which seek to uncover the genetic basis of complex diseases. We propose a novel approach for haplotype reconstruction based on constrained hidden Markov models. Models are constructed by incrementally refining and regularizing the structure of a simple generative model for genotype data under Hardy-Weinberg equilibrium.ResultsThe proposed method is evaluated on real-world and simulated population data. Results show that it is competitive with other recently proposed methods in terms of reconstruction accuracy, while offering a particularly good trade-off between computational costs and quality of results for large datasets.ConclusionRelatively simple probabilistic approaches for haplotype reconstruction based on structured hidden Markov models are competitive with more complex, well-established techniques in this field.
Journal of Integrative Bioinformatics | 2008
Petteri Sevon; Lauri Eronen
We describe a method for querying vertex- and edge-labeled graphs using context-free grammars to specify the class of interesting paths. We introduce a novel problem, finding the connection subgraph induced by the set of matching paths between given two vertices or two sets of vertices. Such a subgraph provides a concise summary of the relationship between the vertices. We also present novel algorithms for parsing subgraphs directly without enumerating all the individual paths. We evaluate experimentally the presented parsing algorithms on a set of real graphs derived from publicly available biomedical databases and on randomly generated graphs. The results indicate that parsing the connection subgraph directly is much more effective than parsing individual paths separately. Furthermore, we show that using a bidirectional parsing algorithm, in most cases, allows for searching twice as long paths as using a unidirectional search strategy.
Archive | 2010
Lauri Eronen; Atte Hinkka; Petteri Hintsanen; Melissa Kasari; Kimmo Kulovesi; Laura Langohr; Petteri Sevon; Hannu Toivonen
Bisociative Knowledge Discovery | 2012
Lauri Eronen; Petteri Hintsanen; Hannu Toivonen
Archive | 2006
Petteri Sevon; Lauri Eronen; Petteri Hintsanen; Kimmo Kulovesi; Hannu Toivonen