Is this you? Create Your Porfile

Jake Lever

University of British Columbia

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Jake Lever is active.

Explore More

Publication

Featured researches published by Jake Lever.

Nature Methods | 2016

Points of Significance: Model selection and overfitting

Jake Lever; Martin Krzywinski; Naomi Altman

With four parameters I can fit an elephant and with five I can make him wiggle his trunk.—John von Neumann

Nature Methods | 2016

Points of Significance: Classification evaluation

Jake Lever; Martin Krzywinski; Naomi Altman

It is important to understand both what a classification metric expresses and what it hides.

Nature Methods | 2017

Points of Significance: Principal component analysis

Jake Lever; Martin Krzywinski; Naomi Altman

Principal component analysis (PCA) simplifies the complexity in high-dimensional data while retaining trends and patterns. It does this by transforming the data into fewer dimensions, which act as summaries of features. High-dimensional data are very common in biology and arise when multiple features, such as expression of many genes, are measured for each sample. This type of data presents several challenges that PCA mitigates: computational expense and an increased error rate due to multiple test correction when testing each feature for association with an outcome. PCA is an unsupervised learning method and is similar to clustering1—it finds patterns without reference to prior knowledge about whether the samples come from different treatment groups or have phenotypic differences. PCA reduces data by geometrically projecting them onto lower dimensions called principal components (PCs), with the goal of finding the best summary of the data using a limited number of PCs. The first PC is chosen to minimize the total distance between the data and their projection onto the PC (Fig. 1a). By minimizing this distance, we also maximize the variance of the projected points, σ2 (Fig. 1b). The second (and subsequent) PCs are selected similarly, with the additional requirement that they be uncorrelated with all previous PCs. For example, projection onto PC1 is uncorrelated with projection onto PC2, and we can think of the PCs as geometrically orthogonal. This requirement of no correlation means that the maximum number of PCs possible is either the number of samples or the number of features, whichever is smaller. The PC selection process has the effect of maximizing the correlation (r2) (ref. 2) between data and their projection and is equivalent to carrying out multiple linear regression3,4 on the projected data against each variable of the original data. For example, the projection onto PC2 has maximum r2 when used in multiple regression with PC1. The PCs are defined as a linear combination of the data’s original variables, and in our two-dimensional (2D) example, PC1 = x/√2 + y/√2 (Fig. 1c). These coefficients are stored in a ‘PCA loading matrix’, which can be interpreted as a rotation matrix that rotates data such that the projection with greatest variance goes along the first axis. At first glance, PC1 closely resembles the linear regression line3 of y versus x or x versus y (Fig. 1c). However, PCA differs from linear regression in that PCA minimizes the perpendicular distance between a data point and the principal component, whereas linear regression minimizes the distance between the response variable and its predicted value. To illustrate PCA on biological data, we simulated expression profiles for nine genes that fall into one of three patterns across six samples (Fig. 2a). We find that the variance is fairly similar across samples (Fig. 2a), which tells us that no single sample captures the patterns in the data appreciably more than another. In other words, we need all six sample dimensions to express the data fully. Let’s now use PCA to see whether a smaller number of combinations of samples can capture the patterns. We start by finding the six PCs (PC1–PC6), which become our new axes (Fig. 2b). We next transform the profiles so that they are expressed as linear combinations of PCs—each profile is now a set of coordinates on the PC axes—and calculate the variance (Fig. 2c). As expected, PC1 has the largest variance, with 52.6% captured by PC1 and 47.0% captured by PC2. A useful interpretation of PCA is that r2 of the regression is the percent variance (of all the data) explained by the PCs. As additional PCs are added to the prediction, the difference in r2 corresponds to the variance explained by that PC. However, all the PCs are not typically used because the majority of variance, and hence patterns in the data, will be limited to the first few PCs. In our example, we can ignore PC3−PC6, which contribute little (0.4%) to explaining the variance, and express the data in two dimensions instead of six. Figure 2d verifies visually that we can faithfully reproduce the profiles using only PC1 and PC2. For example, the root mean square (r.m.s.) distances of the original profile A from its 1D, 2D and 3D reconstructions are 0.29, 0.03 and 0.01, respectively. Approximations using two or three PCs are useful, because we Figure 2 | PCA reduction of nine expression profiles from six to two dimensions. (a) Expression profiles for nine genes (A–I) across six samples (a−f), coded by color on the basis of shape similarity, and the expression variance of each sample. (b) PC1–PC6 of the profiles in a. PC1 and PC2 reflect clearly visible trends, and the remaining capture only small fluctuations. (c) Transformed profiles, expressed as PC scores and σ2 of each component score. (d) The profiles reconstructed using PC1–PC3. (e) The 2D coordinates of each profile based on the scores of the first two PCs. 0 0.6

Nature Methods | 2016

Points of Significance: Logistic regression

Jake Lever; Martin Krzywinski; Naomi Altman

In recent columns we showed how linear regression can be used to predict a continuous dependent variable given other independent variables1,2. When the dependent variable is categorical, a common approach is to use logistic regression, a method that takes its name from the type of curve it uses to fit data. Categorical variables are commonly used in biomedical data to encode a set of discrete states, such as whether a drug was administered or whether a patient has survived. Categorical variables may have more than two values, which may have an implicit order, such as whether a patient never, occasionally or frequently smokes. In addition to predicting the value of a variable (e.g., a patient will survive), logistic regression can also predict the associated probability (e.g., the patient has a 75% chance of survival). There are many reasons to assess the probability of a state of a categorical variable, and a common application is classification— predicting the class of a new data point. Many methods are available, but regression has the advantage of being relatively simple to perform and interpret. First a training set is used to develop a prediction equation, and then the predicted membership probability is thresholded to predict the class membership for new observations, with the point classified to the most probable class. If the costs of misclassification differ between the two classes, alternative thresholds may be chosen to minimize misclassification costs estimated from the training sample (Fig. 1). For example, in the diagnosis of a deadly but readily treated disease, it is less costly to falsely assign a patient to the treatment group than to the no-treatment group. In our example of simple linear regression1, we saw how one continuous variable (weight) could be predicted on the basis of another continuous variable (height). To illustrate classification, here we extend that example to use height to predict the probability that an individual plays professional basketball. Let us assume that professional basketball players have a mean height of 200 cm and that those who do not play professionally have a mean height of 170 cm, with both populations being normal and having an s.d. of 15 cm. First, we create a training data set by randomly sampling the heights of 5 individuals who play professional basketball and 15 who do not (Fig. 2a). We then assign categorical classifications of 1 (plays professional basketball) and 0 (does not play professional basketball). For simplicity, our example is limited to two classes, but more are possible. Let us first approach this classification using linear regression, which minimizes least-squares1, and fit a line to the data (Fig. 2a). Each data point has one of two distinct y-values (0 and 1), which correspond to the probability of playing professional basketball, and the fit represents the predicted probability as a function of height, increasing from 0 at 159 cm to 1 at 225 cm. The fit line is truncated outside the [0, 1] range because it cannot be interpreted as a probability. Using a probability threshold of 0.5 for classification, we find that 192 cm should be the decision boundary for predicting whether an individual plays professional basketball. It gives reasonable classification performance—only one point is misclassified as false positive, and one point as false negative (Fig. 2a). Unfortunately, our linear regression fit is not robust. Consider a child of height H = 100 cm who does not play professional basketball (Fig. 2a). This height is below the threshold of 192 cm and would be classified correctly. However, if this data point is part of the training set, it will greatly influence the fit3 and increase the classification threshold to 197 cm, which would result in an additional false negative. To improve the robustness and general performance of this classifier, we could fit the data to a curve other than a straight line. One very simple option is the step function (Fig. 2b), which is 1 when greater than a certain value and 0 otherwise. An advantage of the step function is that it defines a decision boundary (185 cm) that is not affected by the outlier (H = 100 cm), but it cannot provide class probabilities other than 0 and 1. This turns out to be sufficient for the purpose of classification—many classification algorithms do not provide probabilities. However, the step function also does not differentiate between the more extreme observations, which are far from the decision boundary and more likely to be correctly assigned, and those near the decision boundary for which membership in a b

Proceedings of the 4th BioNLP Shared Task Workshop | 2016

VERSE: Event and Relation Extraction in the BioNLP 2016 Shared Task.

Jake Lever; Steven J.M. Jones

We present the Vancouver Event and Relation System for Extraction (VERSE)1 as a competing system for three subtasks of the BioNLP Shared Task 2016. VERSE performs full event extraction including entity, relation and modification extraction using a feature-based approach. It achieved the highest F1-score in the Bacteria Biotope (BB3) event subtask and the third highest F1-score in the Seed Development (SeeDev) binary subtask.

Oncotarget | 2016

Small molecule epigenetic screen identifies novel EZH2 and HDAC inhibitors that target glioblastoma brain tumor-initiating cells

Natalie Grinshtein; Constanza Rioseco; Richard Marcellus; David Uehling; Ahmed Aman; Xueqing Lun; Osamu Muto; Lauren Podmore; Jake Lever; Yaoqing Shen; Michael D. Blough; Greg Cairncross; Stephen M. Robbins; Steven J.M. Jones; Marco A. Marra; Rima Al-awar; Donna L. Senger; David R. Kaplan

Glioblastoma (GBM) is the most lethal and aggressive adult brain tumor, requiring the development of efficacious therapeutics. Towards this goal, we screened five genetically distinct patient-derived brain-tumor initiating cell lines (BTIC) with a unique collection of small molecule epigenetic modulators from the Structural Genomics Consortium (SGC). We identified multiple hits that inhibited the growth of BTICs in vitro, and further evaluated the therapeutic potential of EZH2 and HDAC inhibitors due to the high relevance of these targets for GBM. We found that the novel SAM-competitive EZH2 inhibitor UNC1999 exhibited low micromolar cytotoxicity in vitro on a diverse collection of BTIC lines, synergized with dexamethasone (DEX) and suppressed tumor growth in vivo in combination with DEX. In addition, a unique brain-penetrant class I HDAC inhibitor exhibited cytotoxicity in vitro on a panel of BTIC lines and extended survival in combination with TMZ in an orthotopic BTIC model in vivo. Finally, a combination of EZH2 and HDAC inhibitors demonstrated synergy in vitro by augmenting apoptosis and increasing DNA damage. Our findings identify key epigenetic modulators in GBM that regulate BTIC growth and survival and highlight promising combination therapies.

Nature Methods | 2016

Points of Significance: Regularization

Jake Lever; Martin Krzywinski; Naomi Altman

Regularization Constraining the magnitude of parameters of a model can control its complexity Last month we examined the challenge of selecting a predictive model that generalizes well, and we discussed how a models ability to generalize is related to its number of parameters and its complexity 1. An appropriate level of complexity is needed to avoid both underfitting and overfitting. An underfitted model is usually a poor fit to the training data, and an overfitted model is a good fit to the training data but not to new data. This month we explore the topic of regularization, a method that controls a mod-els complexity by penalizing the magnitude of its parameters. Regularization can be used with any type of predictive model. We will illustrate this method using multiple linear regression applied to the analysis of simulated biomarker data to predict disease severity on a continuous scale. For the ith patient, let y i be the known disease severity and x ij be the value of the jth bio-marker. Multiple linear regression finds the parameter estimate β ˆ j values that minimize the sum squared error SSE = Σ i (y i – ŷ i) 2 , where ŷ i = Σ j β ˆ j x ij is the ith patients predicted disease severity. For simplicity, we exclude the intercept, β ˆ 0 , which is a constant offset. Recall that the hat on β ˆ j indicates that the value is an estimate of the corresponding parameter β j in the underlying model. The complexity of a model is related to the number and magnitude of its parameters 1. With too few parameters the model underfits, missing predictable systematic variation. With too many parameters it overfits, fitting training data noise with lower SSE but increasing variability in prediction of new data. Generally, as the number of parameters increases, so does the total magnitude of their estimated values (Fig. 1a). Thus, rather than limiting the number of parameters, we can control model complexity by constraining the magnitude of the parameters—a Figure 1 | Regularization controls model complexity by imposing a limit on the magnitude of its parameters. (a) Complexity of polynomial models of orders 1 to 5 fit to data points in b as measured by the sum of squared parameters. The highest order polynomial is the most complex (blue bar) and drastically overfits, fitting the data exactly (blue trace …

Bioinformatics | 2018

A collaborative filtering-based approach to biomedical knowledge discovery

Jake Lever; Sitanshu Gakkhar; Michael Gottlieb; Tahereh Rashnavadi; Santina Lin; Celia Siu; Maia Smith; Martin R. Jones; Martin Krzywinski; Steven J.M. Jones

Motivation The increase in publication rates makes it challenging for an individual researcher to stay abreast of all relevant research in order to find novel research hypotheses. Literature-based discovery methods make use of knowledge graphs built using text mining and can infer future associations between biomedical concepts that will likely occur in new publications. These predictions are a valuable resource for researchers to explore a research topic. Current methods for prediction are based on the local structure of the knowledge graph. A method that uses global knowledge from across the knowledge graph needs to be developed in order to make knowledge discovery a frequently used tool by researchers. Results We propose an approach based on the singular value decomposition (SVD) that is able to combine data from across the knowledge graph through a reduced representation. Using cooccurrence data extracted from published literature, we show that SVD performs better than the leading methods for scoring discoveries. We also show the diminishing predictive power of knowledge discovery as we compare our predictions with real associations that appear further into the future. Finally, we examine the strengths and weaknesses of the SVD approach against another well-performing system using several predicted associations. Availability and implementation All code and results files for this analysis can be accessed at https://github.com/jakelever/knowledgediscovery. Contact [email protected]. Supplementary information Supplementary data are available at Bioinformatics online.

Journal of Inherited Metabolic Disease | 2018

Text-based phenotypic profiles incorporating biochemical phenotypes of inborn errors of metabolism improve phenomics-based diagnosis

Jessica Lee; Michael Gottlieb; Jake Lever; Steven J.M. Jones; Nenad Blau; Clara van Karnebeek; Wyeth W. Wasserman

Phenomics is the comprehensive study of phenotypes at every level of biology: from metabolites to organisms. With high throughput technologies increasing the scope of biological discoveries, the field of phenomics has been developing rapid and precise methods to collect, catalog, and analyze phenotypes. Such methods have allowed phenotypic data to be widely used in medical applications, from assisting clinical diagnoses to prioritizing genomic diagnoses. To channel the benefits of phenomics into the field of inborn errors of metabolism (IEM), we have recently launched IEMbase, an expert-curated knowledgebase of IEM and their disease-characterizing phenotypes. While our efforts with IEMbase have realized benefits, taking full advantage of phenomics requires a comprehensive curation of IEM phenotypes in core phenomics projects, which is dependent upon contributions from the IEM clinical and research community. Here, we assess the inclusion of IEM biochemical phenotypes in a core phenomics project, the Human Phenotype Ontology. We then demonstrate the utility of biochemical phenotypes using a text-based phenomics method to predict gene-disease relationships, showing that the prediction of IEM genes is significantly better using biochemical rather than clinical profiles. The findings herein provide a motivating goal for the IEM community to expand the computationally accessible descriptions of biochemical phenotypes associated with IEM in phenomics resources.

bioRxiv | 2017

LIONS: Analysis Suite for Detecting and Quantifying Transposable Element Initiated Transcription from RNA-seq

Artem Babaian; Jake Lever; Liane Gagnier; Dixie L. Mager

Summary Transposable Elements (TEs) influence the evolution of novel transcriptional networks yet the specific and meaningful interpretation of how TE-initiation events contribute to the transcriptome has been marred by computational and methodological deficiencies. We developed LIONS for the analysis of paired-end RNA-seq data to specifically detect and quantify TE-initiated transcripts. Availability Source code, container, test data and instruction manual are freely available at www.github.com/ababaian/LIONS. Contact [email protected] or [email protected] or [email protected]. Supplementary information Supplementary data are available at Bioinformatics online.

Explore More