Juho Rousu | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Juho Rousu is active.

Explore More

Publication

Featured researches published by Juho Rousu.

Machine Learning | 1999

General and Efficient Multisplitting of Numerical Attributes

Tapio Elomaa; Juho Rousu

Often in supervised learning numerical attributes require special treatment and do not fit the learning scheme as well as one could hope. Nevertheless, they are common in practical tasks and, therefore, need to be taken into account. We characterize the well-behavedness of an evaluation function, a property that guarantees the optimal multi-partition of an arbitrary numerical domain to be defined on boundary points. Well-behavedness reduces the number of candidate cut points that need to be examined in multisplitting numerical attributes. Many commonly used attribute evaluation functions possess this property; we demonstrate that the cumulative functions Information Gain and Training Set Error as well as the non-cumulative functions Gain Ratio and Normalized Distance Measure are all well-behaved. We also devise a method of finding optimal multisplits efficiently by examining the minimum number of boundary point combinations that is required to produce partitions which are optimal with respect to a cumulative and well-behaved evaluation function. Our empirical experiments validate the utility of optimal multisplitting: it produces constantly better partitions than alternative approaches do and it only requires comparable time. In top-down induction of decision trees the choice of evaluation function has a more decisive effect on the result than the choice of partitioning strategy; optimizing the value of most common attribute evaluation functions does not raise the accuracy of the produced decision trees. In our tests the construction time using optimal multisplitting was, on the average, twice that required by greedy multisplitting, which in its part required on the average twice the time of binary splitting.

Rapid Communications in Mass Spectrometry | 2008

FiD: a software for ab initio structural identification of product ions from tandem mass spectrometric data.

Markus Heinonen; Ari Rantanen; Taneli Mielikäinen; Juha Kokkonen; Jari Kiuru; Raimo A. Ketola; Juho Rousu

We present FiD (Fragment iDentificator), a software tool for the structural identification of product ions produced with tandem mass spectrometric measurement of low molecular weight organic compounds. Tandem mass spectrometry (MS/MS) has proven to be an indispensable tool in modern, cell-wide metabolomics and fluxomics studies. In such studies, the structural information of the MS(n) product ions is usually needed in the downstream analysis of the measurement data. The manual identification of the structures of MS(n) product ions is, however, a nontrivial task requiring expertise, and calls for computer assistance. Commercial software tools, such as Mass Frontier and ACD/MS Fragmenter, rely on fragmentation rule databases for the identification of MS(n) product ions. FiD, on the other hand, conducts a combinatorial search over all possible fragmentation paths and outputs a ranked list of alternative structures. This gives the user an advantage in situations where the MS/MS data of compounds with less well-known fragmentation mechanisms are processed. FiD software implements two fragmentation models, the single-step model that ignores intermediate fragmentation states and the multi-step model, which allows for complex fragmentation pathways. The software works for MS/MS data produced both in positive- and negative-ion modes. The software has an easy-to-use graphical interface with built-in visualization capabilities for structures of product ions and fragmentation pathways. In our experiments involving amino acids and sugar-phosphates, often found, e.g., in the central carbon metabolism of yeasts, FiD software correctly predicted the structures of product ions on average in 85% of the cases. The FiD software is free for academic use and is available for download from www.cs.helsinki.fi/group/sysfys/software/fragid.

international conference on machine learning | 2005

Learning hierarchical multi-category text classification models

Juho Rousu; Craig Saunders; Sandor Szedmak; John Shawe-Taylor

We present a kernel-based algorithm for hierarchical text classification where the documents are allowed to belong to more than one category at a time. The classification model is a variant of the Maximum Margin Markov Network framework, where the classification hierarchy is represented as a Markov tree equipped with an exponential family defined on the edges. We present an efficient optimization algorithm based on incremental conditional gradient ascent in single-example subspaces spanned by the marginal dual variables. Experiments show that the algorithm can feasibly optimize training sets of thousands of examples and classification hierarchies consisting of hundreds of nodes. The algorithms predictive accuracy is competitive with other recently introduced hierarchical multi-category or multilabel classification learning algorithms.

BMC Systems Biology | 2009

Inferring branching pathways in genome-scale metabolic networks.

Esa Pitkänen; Paula Jouhten; Juho Rousu

BackgroundA central problem in computational metabolic modelling is how to find biochemically plausible pathways between metabolites in a metabolic network. Two general, complementary frameworks have been utilized to find metabolic pathways: constraint-based modelling and graph-theoretical path finding approaches. In constraint-based modelling, one aims to find pathways where metabolites are balanced in a pseudo steady-state. Constraint-based methods, such as elementary flux mode analysis, have typically a high computational cost stemming from a large number of steady-state pathways in a typical metabolic network. On the other hand, graph-theoretical approaches avoid the computational complexity of constraint-based methods by solving a simpler problem of finding shortest paths. However, while scaling well with network size, graph-theoretic methods generally tend to return more false positive pathways than constraint-based methods.ResultsIn this paper, we introduce a computational method, ReTrace, for finding biochemically relevant, branching metabolic pathways in an atom-level representation of metabolic networks. The method finds compact pathways which transfer a high fraction of atoms from source to target metabolites by considering combinations of linear shortest paths. In contrast to current steady-state pathway analysis methods, our method scales up well and is able to operate on genome-scale models. Further, we show that the pathways produced are biochemically meaningful by an example involving the biosynthesis of inosine 5-monophosphate (IMP). In particular, the method is able to avoid typical problems associated with graph-theoretic approaches such as the need to define side metabolites or pathways not carrying any net carbon flux appearing in results. Finally, we discuss an application involving reconstruction of amino acid pathways of a recently sequenced organism demonstrating how measurement data can be easily incorporated into ReTrace analysis. ReTrace is licensed under GPL and is freely available for academic use at http://www.cs.helsinki.fi/group/sysfys/software/retrace/.ConclusionReTrace is a useful method in metabolic path finding tasks, combining some of the best aspects in constraint-based and graph-theoretic methods. It finds use in a multitude of tasks ranging from metabolic engineering to metabolic reconstruction of recently sequenced organisms.

BMC Bioinformatics | 2008

An analytic and systematic framework for estimating metabolic flux ratios from 13C tracer experiments

Ari Rantanen; Juho Rousu; Paula Jouhten; Nicola Zamboni; Hannu Maaheimo; Esko Ukkonen

BackgroundMetabolic fluxes provide invaluable insight on the integrated response of a cell to environmental stimuli or genetic modifications. Current computational methods for estimating the metabolic fluxes from 13C isotopomer measurement data rely either on manual derivation of analytic equations constraining the fluxes or on the numerical solution of a highly nonlinear system of isotopomer balance equations. In the first approach, analytic equations have to be tediously derived for each organism, substrate or labelling pattern, while in the second approach, the global nature of an optimum solution is difficult to prove and comprehensive measurements of external fluxes to augment the 13C isotopomer data are typically needed.ResultsWe present a novel analytic framework for estimating metabolic flux ratios in the cell from 13C isotopomer measurement data. In the presented framework, equation systems constraining the fluxes are derived automatically from the model of the metabolism of an organism. The framework is designed to be applicable with all metabolic network topologies, 13C isotopomer measurement techniques, substrates and substrate labelling patterns.By analyzing nuclear magnetic resonance (NMR) and mass spectrometry (MS) measurement data obtained from the experiments on glucose with the model micro-organisms Bacillus subtilis and Saccharomyces cerevisiae we show that our framework is able to automatically produce the flux ratios discovered so far by the domain experts with tedious manual analysis. Furthermore, we show by in silico calculability analysis that our framework can rapidly produce flux ratio equations – as well as predict when the flux ratios are unobtainable by linear means – also for substrates not related to glucose.ConclusionThe core of 13C metabolic flux analysis framework introduced in this article constitutes of flow and independence analysis of metabolic fragments and techniques for manipulating isotopomer measurements with vector space techniques. These methods facilitate efficient, analytic computation of the ratios between the fluxes of pathways that converge to a common junction metabolite. The framework can been seen as a generalization and formalization of existing tradition for computing metabolic flux ratios where equations constraining flux ratios are manually derived, usually without explicitly showing the formal proofs of the validity of the equations.

Data Mining and Knowledge Discovery | 2004

Efficient Multisplitting Revisited: Optima-Preserving Elimination of Partition Candidates

Tapio Elomaa; Juho Rousu

We consider multisplitting of numerical value ranges, a task that is encountered as a discretization step preceding induction and also embedded into learning algorithms. We are interested in finding the partition that optimizes the value of a given attribute evaluation function. For most commonly used evaluation functions this task takes quadratic time in the number of potential cut points in the numerical range. Hence, it is a potential bottleneck in data mining algorithms.We present two techniques that speed up the optimal multisplitting task. The first one aims at discarding cut point candidates in a quick linear-time preprocessing scan before embarking on the actual search. We generalize the definition of boundary points by Fayyad and Irani to allow us to merge adjacent example blocks that have the same relative class distribution. We prove for several commonly used evaluation functions that this processing removes only suboptimal cut points. Hence, the algorithm does not lose optimality.Our second technique tackles the quadratic-time dynamic programming algorithm, which is the best schema for optimizing many well-known evaluation functions. We present a technique that dynamically—i.e., during the search—prunes partitions of prefixes of the sorted data from the search space of the algorithm. The method works for all convex and cumulative evaluation functions.Together the use of these two techniques speeds up the multisplitting process considerably. Compared to the baseline dynamic programming algorithm the speed-up is around 50 percent on the average and up to 90 percent in some cases. We conclude that optimal multisplitting is fully feasible on all benchmark data sets we have encountered.

Current Opinion in Biotechnology | 2010

Computational methods for metabolic reconstruction.

Esa Pitkänen; Juho Rousu; Esko Ukkonen

In the wake of numerous sequenced genomes becoming available, computational methods for the reconstruction of metabolic networks have received considerable attention. Here, we review recent methods and software tools useful along the reconstruction workflow, from sequence annotation and network assembly to model verification and testing against experimental data. Reconstruction methods can be divided into three categories, depending on the magnitude of network context which is taken into account in the process of assembling the metabolic model: First, each enzyme may be predicted independently by annotation transfer or machine learning methods. Second, the presence of a metabolic pathway may be detected from genome and experimental evidence, often utilizing a reference pathway database. Third, the method may attempt to directly reconstruct a consistent metabolic network without relying on predefined reference pathways. Regardless of the chosen context, all methods strive to reconstruct genome-scale metabolic reconstructions. Currently a gap exists between software platforms dedicated to genome annotation and computational tools for automatically repairing network inconsistencies and validating against measurement data. We argue that to accelerate the reconstruction efforts, computational tools need to be developed that bridge the phases of the reconstruction workflow. In particular, the goal of finding consistent metabolic models suitable for computational analysis should be taken into account already in the beginning phases of reconstruction.

Journal of Computational Biology | 2011

Computing Atom Mappings for Biochemical Reactions without Subgraph Isomorphism

Markus Heinonen; Sampsa Lappalainen; Taneli Mielikäinen; Juho Rousu

The ability to trace the fate of individual atoms through the metabolic pathways is needed in many applications of systems biology and drug discovery. However, this information is not immediately available from the most common metabolome studies and needs to be separately acquired. Automatic discovery of correspondence of atoms in biochemical reactions is called the atom mapping problem. We suggest an efficient approach for solving the atom mapping problem exactly--finding mappings of minimum edge edit distance. The algorithm is based on A* search equipped with sophisticated heuristics for pruning the search space. This approach has clear advantages over the commonly used heuristic approach of iterative maximum common subgraph (MCS) algorithm: we explicitly minimize an objective function, and we produce solutions that typically require less manual curation. The two methods are similar in computational resource demands. We compare the performance of the proposed algorithm against several alternatives on data obtained from the KEGG LIGAND and RPAIR databases: greedy search, bi-partite graph matching, and the MCS approach. Our experiments show that alternative approaches often fail in finding mappings with minimum edit distance.

Bioinformatics | 2006

Planning optimal measurements of isotopomer distributions for estimation of metabolic fluxes†Preliminary version of this article appeared in the proceedings of German Conference on Bioinformatics 2005. Lecture Notes in Informatics Vol. P-71 (2005), pp. 177--191.

Ari Rantanen; Taneli Mielikäinen; Juho Rousu; Hannu Maaheimo; Esko Ukkonen

MOTIVATIONnFlux estimation using isotopomer information of metabolites is currently the most reliable method to obtain quantitative estimates of the activity of metabolic pathways. However, the development of isotopomer measurement techniques for intermediate metabolites is a demanding task. Careful planning of isotopomer measurements is thus needed to maximize the available flux information while minimizing the experimental effort.nnnRESULTSnIn this paper we study the question of finding the smallest subset of metabolites to measure that ensure the same level of isotopomer information as the measurement of every metabolite in the metabolic network. We study the computational complexity of this optimization problem in the case of the so-called positional enrichment data, give methods for obtaining exact and fast approximate solutions, and evaluate empirically the efficacy of the proposed methods by analyzing a metabolic network that models the central carbon metabolism of Saccharomyces cerevisiae.

Journal of Food Engineering | 2003

Novel computational tools in bakery process data analysis: a comparative study

Juho Rousu; Laura Flander; Marjaana Suutarinen; Karin Autio; Petri Kontkanen; Ari Rantanen

We studied the potential of various machine learning and statistical methods in the prediction of product quality in industrial bakery processes. The methods included classification and regression tree, decision list, neural network, support vector machine and Bayesian learning algorithms as well as statistical multivariate methods. Our data originated from two industrial bakery processes: a sourdough rye bread and a Danish pastry process. In our studies, the Naive Bayesian algorithm turned out to be the best classifier building algorithm while the partial least squares (PLS) method was the best regression method. The prediction accuracy of these models improved significantly by pruning the original set of variables. In this study, two response variables could be predicted on a level that justifies further study: rye bread pH could be predicted with high accuracy with Naive Bayesian Classifier, and Danish pastry height could be predicted with a moderately high correlation with PLS.

Explore More