Saharon Rosset
Tel Aviv University
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Saharon Rosset.
Annals of Statistics | 2004
Bradley Efron; Trevor Hastie; Iain M. Johnstone; Robert Tibshirani; Hemant Ishwaran; Keith Knight; Jean-Michel Loubes; Pascal Massart; David Madigan; Greg Ridgeway; Saharon Rosset; J. Zhu; Robert A. Stine; Berwin A. Turlach; Sanford Weisberg
DISCUSSION OF “LEAST ANGLE REGRESSION” BY EFRONET AL.By Jean-Michel Loubes and Pascal MassartUniversit´e Paris-SudThe issue of model selection has drawn the attention of both applied andtheoretical statisticians for a long time. Indeed, there has been an enor-mous range of contribution in model selection proposals, including work byAkaike (1973), Mallows (1973), Foster and George (1994), Birg´e and Mas-sart (2001a) and Abramovich, Benjamini, Donoho and Johnstone (2000).Over the last decade, modern computer-driven methods have been devel-oped such as All Subsets, Forward Selection, Forward Stagewise or Lasso.Such methods are useful in the setting of the standard linear model, wherewe observe noisy data and wish to predict the response variable using onlya few covariates, since they provide automatically linear models that fit thedata. The procedure described in this paper is, on the one hand, numeri-cally very efficient and, on the other hand, very general, since, with slightmodifications, it enables us to recover the estimates given by the Lasso andStagewise.1. Estimation procedure. The “LARS” method is based on a recursiveprocedure selecting, at each step, the covariates having largest absolute cor-relation with the response y. In the case of an orthogonal design, the esti-mates can then be viewed as an lDISCUSSION OF “LEAST ANGLE REGRESSION” BY EFRONET AL.By Berwin A. TurlachUniversity of Western AustraliaI would like to begin by congratulating the authors (referred to belowas EHJT) for their interesting paper in which they propose a new variableselection method (LARS) for building linear models and show how their newmethod relates to other methods that have been proposed recently. I foundthe paper to be very stimulating and found the additional insight that itprovides about the Lasso technique to be of particular interest.My comments center around the question of how we can select linearmodels that conform with the marginality principle [Nelder (1977, 1994)and McCullagh and Nelder (1989)]; that is, the response surface is invariantunder scaling and translation of the explanatory variables in the model.Recently one of my interests was to explore whether the Lasso techniqueor the nonnegative garrote [Breiman (1995)] could be modified such that itincorporates the marginality principle. However, it does not seem to be atrivial matter to change the criteria that these techniques minimize in such away that the marginality principle is incorporated in a satisfactory manner.On the other hand, it seems to be straightforward to modify the LARStechnique to incorporate this principle. In their paper, EHJT address thisissue somewhat in passing when they suggest toward the end of Section 3that one first fit main effects only and interactions in a second step to controlthe order in which variables are allowed to enter the model. However, sucha two-step procedure may have a somewhat less than optimal behavior asthe following, admittedly artificial, example shows.Assume we have a vector of explanatory variables X =(XThe purpose of model selection algorithms such as All Subsets, Forward Selection and Backward Elimination is to choose a linear model on the basis of the same set of data to which the model will be applied. Typically we have available a large collection of possible covariates from which we hope to select a parsimonious set for the efficient prediction of a response variable. Least Angle Regression (LARS), a new model selection algorithm, is a useful and less greedy version of traditional forward selection methods. Three main properties are derived: (1) A simple modification of the LARS algorithm implements the Lasso, an attractive version of ordinary least squares that constrains the sum of the absolute regression coefficients; the LARS modification calculates all possible Lasso estimates for a given problem, using an order of magnitude less computer time than previous methods. (2) A different LARS modification efficiently implements Forward Stagewise linear regression, another promising new model selection method; this connection explains the similar numerical results previously observed for the Lasso and Stagewise, and helps us understand the properties of both methods, which are seen as constrained versions of the simpler LARS algorithm. (3) A simple approximation for the degrees of freedom of a LARS estimate is available, from which we derive a Cp estimate of prediction error; this allows a principled choice among the range of possible LARS estimates. LARS and its variants are computationally efficient: the paper describes a publicly available algorithm that requires only the same order of magnitude of computational effort as ordinary least squares applied to the full set of covariates.
Human Genetics | 2010
Shay Tzur; Saharon Rosset; Revital Shemer; Guennady Yudkovsky; Sara Selig; Ayele Tarekegn; Endashaw Bekele; Neil Bradman; Walter G. Wasser; Doron M. Behar; Karl Skorecki
MYH9 has been proposed as a major genetic risk locus for a spectrum of nondiabetic end stage kidney disease (ESKD). We use recently released sequences from the 1000 Genomes Project to identify two western African-specific missense mutations (S342G and I384M) in the neighboring APOL1 gene, and demonstrate that these are more strongly associated with ESKD than previously reported MYH9 variants. The APOL1 gene product, apolipoprotein L-1, has been studied for its roles in trypanosomal lysis, autophagic cell death, lipid metabolism, as well as vascular and other biological activities. We also show that the distribution of these newly identified APOL1 risk variants in African populations is consistent with the pattern of African ancestry ESKD risk previously attributed to MYH9.Mapping by admixture linkage disequilibrium (MALD) localized an interval on chromosome 22, in a region that includes the MYH9 gene, which was shown to contain African ancestry risk variants associated with certain forms of ESKD (Kao et al. 2008; Kopp et al. 2008). MYH9 encodes nonmuscle myosin heavy chain IIa, a major cytoskeletal nanomotor protein expressed in many cell types, including podocyte cells of the renal glomerulus. Moreover, 39 different coding region mutations in MYH9 have been identified in patients with a group of rare syndromes, collectively termed the Giant Platelet Syndromes, with clear autosomal dominant inheritance, and various clinical manifestations, sometimes also including glomerular pathology and chronic kidney disease (Kopp 2010; Sekine et al. 2010). Accordingly, MYH9 was further explored in these studies as the leading candidate gene responsible for the MALD signal. Dense mapping of MYH9 identified individual single nucleotide polymorphisms (SNPs) and sets of such SNPs grouped as haplotypes that were found to be highly associated with a large and important group of ESKD risk phenotypes, which as a consequence were designated as MYH9-associated nephropathies (Bostrom and Freedman 2010). These included HIV-associated nephropathy (HIVAN), primary nonmonogenic forms of focal segmental glomerulosclerosis, and hypertension affiliated chronic kidney disease not attributed to other etiologies (Bostrom and Freedman 2010). The MYH9 SNP and haplotype associations observed with these forms of ESKD yielded the largest odds ratios (OR) reported to date for the association of common variants with common disease risk (Winkler et al. 2010). Two specific MYH9 variants (rs5750250 of S-haplotype and rs11912763 of F-haplotype) were designated as most strongly predictive on the basis of Receiver Operating Characteristic analysis (Nelson et al. 2010). These MYH9 association studies were then also extended to earlier stage and related kidney disease phenotypes and to population groups with varying degrees of recent African ancestry admixture (Behar et al. 2010; Freedman et al. 2009a, b; Nelson et al. 2010), and led to the expectation of finding a functional African ancestry causative variant within MYH9. However, despite intensive efforts including re-sequencing of the MYH9 gene no suggested functional mutation has been identified (Nelson et al. 2010; Winkler et al. 2010). This led us to re-examine the interval surrounding MYH9 and to the detection of novel missense mutations with predicted functional effects in the neighboring APOL1 gene, which are significantly more associated with ESKD than all previously reported SNPs in MYH9.
Annals of Statistics | 2007
Saharon Rosset; J. Zhu
We consider the generic regularized optimization problem β(λ) = argminβ L(y, Xβ) + λJ(β). Efron, Hastie, Johnstone and Tibshirani [Ann. Statist. 32 (2004) 407-499] have shown that for the LASSO-that is, if L is squared error loss and J(β) = ∥β∥ 1 is the l 1 norm of β-the optimal coefficient path is piecewise linear, that is, ∂β(λ)/∂λ. is piecewise constant. We derive a general characterization of the properties of (loss L, penalty J) pairs which give piecewise linear coefficient paths. Such pairs allow for efficient generation of the full regularized coefficient paths. We investigate the nature of efficient path following algorithms which arise. We use our results to suggest robust versions of the LASSO for regression and classification, and to develop new, efficient algorithms for existing problems in the literature, including Mammen and van de Geers locally adaptive regression splines.
American Journal of Human Genetics | 2008
Doron M. Behar; Richard Villems; Himla Soodyall; Jason Blue-Smith; Luísa Pereira; Ene Metspalu; Rosaria Scozzari; Heeran Makkan; Shay Tzur; David Comas; Jaume Bertranpetit; Lluis Quintana-Murci; Chris Tyler-Smith; R. Spencer Wells; Saharon Rosset
The quest to explain demographic history during the early part of human evolution has been limited because of the scarce paleoanthropological record from the Middle Stone Age. To shed light on the structure of the mitochondrial DNA (mtDNA) phylogeny at the dawn of Homo sapiens, we constructed a matrilineal tree composed of 624 complete mtDNA genomes from sub-Saharan Hg L lineages. We paid particular attention to the Khoi and San (Khoisan) people of South Africa because they are considered to be a unique relic of hunter-gatherer lifestyle and to carry paternal and maternal lineages belonging to the deepest clades known among modern humans. Both the tree phylogeny and coalescence calculations suggest that Khoisan matrilineal ancestry diverged from the rest of the human mtDNA pool 90,000-150,000 years before present (ybp) and that at least five additional, currently extant maternal lineages existed during this period in parallel. Furthermore, we estimate that a minimum of 40 other evolutionarily successful lineages flourished in sub-Saharan Africa during the period of modern human dispersal out of Africa approximately 60,000-70,000 ybp. Only much later, at the beginning of the Late Stone Age, about 40,000 ybp, did introgression of additional lineages occur into the Khoisan mtDNA pool. This process was further accelerated during the recent Bantu expansions. Our results suggest that the early settlement of humans in Africa was already matrilineally structured and involved small, separately evolving isolated populations.
Nature | 2010
Doron M. Behar; Bayazit Yunusbayev; Mait Metspalu; Ene Metspalu; Saharon Rosset; Jüri Parik; Siiri Rootsi; Gyaneshwer Chaubey; Ildus Kutuev; Guennady Yudkovsky; Elza Khusnutdinova; Oleg Balanovsky; Ornella Semino; Luísa Pereira; David Comas; David Gurwitz; Batsheva Bonne-Tamir; Tudor Parfitt; Michael F. Hammer; Karl Skorecki; Richard Villems
Contemporary Jews comprise an aggregate of ethno-religious communities whose worldwide members identify with each other through various shared religious, historical and cultural traditions. Historical evidence suggests common origins in the Middle East, followed by migrations leading to the establishment of communities of Jews in Europe, Africa and Asia, in what is termed the Jewish Diaspora. This complex demographic history imposes special challenges in attempting to address the genetic structure of the Jewish people. Although many genetic studies have shed light on Jewish origins and on diseases prevalent among Jewish communities, including studies focusing on uniparentally and biparentally inherited markers, genome-wide patterns of variation across the vast geographic span of Jewish Diaspora communities and their respective neighbours have yet to be addressed. Here we use high-density bead arrays to genotype individuals from 14 Jewish Diaspora communities and compare these patterns of genome-wide diversity with those from 69 Old World non-Jewish populations, of which 25 have not previously been reported. These samples were carefully chosen to provide comprehensive comparisons between Jewish and non-Jewish populations in the Diaspora, as well as with non-Jewish populations from the Middle East and north Africa. Principal component and structure-like analyses identify previously unrecognized genetic substructure within the Middle East. Most Jewish samples form a remarkably tight subcluster that overlies Druze and Cypriot samples but not samples from other Levantine populations or paired Diaspora host populations. In contrast, Ethiopian Jews (Beta Israel) and Indian Jews (Bene Israel and Cochini) cluster with neighbouring autochthonous populations in Ethiopia and western India, respectively, despite a clear paternal link between the Bene Israel and the Levant. These results cast light on the variegated genetic architecture of the Middle East, and trace the origins of most Jewish Diaspora communities to the Levant.
Genome Research | 2012
Melissa Gymrek; David E. Golan; Saharon Rosset; Yaniv Erlich
Short tandem repeats (STRs) have a wide range of applications, including medical genetics, forensics, and genetic genealogy. High-throughput sequencing (HTS) has the potential to profile hundreds of thousands of STR loci. However, mainstream bioinformatics pipelines are inadequate for the task. These pipelines treat STR mapping as gapped alignment, which results in cumbersome processing times and a biased sampling of STR alleles. Here, we present lobSTR, a novel method for profiling STRs in personal genomes. lobSTR harnesses concepts from signal processing and statistical learning to avoid gapped alignment and to address the specific noise patterns in STR calling. The speed and reliability of lobSTR exceed the performance of current mainstream algorithms for STR profiling. We validated lobSTRs accuracy by measuring its consistency in calling STRs from whole-genome sequencing of two biological replicates from the same individual, by tracing Mendelian inheritance patterns in STR alleles in whole-genome sequencing of a HapMap trio, and by comparing lobSTR results to traditional molecular techniques. Encouraged by the speed and accuracy of lobSTR, we used the algorithm to conduct a comprehensive survey of STR variations in a deeply sequenced personal genome. We traced the mutation dynamics of close to 100,000 STR loci and observed more than 50,000 STR variations in a single genome. lobSTRs implementation is an end-to-end solution. The package accepts raw sequencing reads and provides the user with the genotyping results. It is written in C/C++, includes multi-threading capabilities, and is compatible with the BAM format.
PLOS Genetics | 2007
Doron M. Behar; Saharon Rosset; Jason Blue-Smith; Oleg Balanovsky; Shay Tzur; David Comas; R. John Mitchell; Lluis Quintana-Murci; Chris Tyler-Smith; R. Spencer Wells
The Genographic Project is studying the genetic signatures of ancient human migrations and creating an open-source research database. It allows members of the public to participate in a real-time anthropological genetics study by submitting personal samples for analysis and donating the genetic results to the database. We report our experience from the first 18 months of public participation in the Genographic Project, during which we have created the largest standardized human mitochondrial DNA (mtDNA) database ever collected, comprising 78,590 genotypes. Here, we detail our genotyping and quality assurance protocols including direct sequencing of the mtDNA HVS-I, genotyping of 22 coding-region SNPs, and a series of computational quality checks based on phylogenetic principles. This database is very informative with respect to mtDNA phylogeny and mutational dynamics, and its size allows us to develop a nearest neighbor–based methodology for mtDNA haplogroup prediction based on HVS-I motifs that is superior to classic rule-based approaches. We make available to the scientific community and general public two new resources: a periodically updated database comprising all data donated by participants, and the nearest neighbor haplogroup prediction tool.
knowledge discovery and data mining | 1999
Saharon Rosset; Uzi Murad; Einat Neumann; Yizhak Idan; Gadi Pinkas
Many fraud analysis systems have at their heart a rule-based engine for generating alerts about suspicious behaviors. The rules in the system are usually based on expert knowledge. Automatic rule discovery aims at using past examples of fraudulent and legitimate usage to find new patterns and rules to help distinguish between the two. Some aspects of the problem of finding rules suitable for fraud analysis make this problem unique. Among them are the following: the need to find rules combining both the properties of the customer (e.g., credit rating) and properties of the specific “behavior” which indicates fraud (e.g., number of international calls in one day); and the need for a new definition of accuracy: We need to find rules which do not necessarily classify correctly each individual “usage sample” as either fraudulent or not, but ensure the identification, with a minimum of wasted cost and effort, of most of the fraud “cases” (i.e., defrauded customers). These aspects require a special-purpose rule discovery system. We present as an example a two-stage system based on adaptation of the C4.5 rule generator, with an additional rule selection mechanism. Our experimental results indicate that this route is very promising.
Bioinformatics | 2009
Aurelie C. Lozano; Naoki Abe; Yan Liu; Saharon Rosset
We consider the problem of discovering gene regulatory networks from time-series microarray data. Recently, graphical Granger modeling has gained considerable attention as a promising direction for addressing this problem. These methods apply graphical modeling methods on time-series data and invoke the notion of ‘Granger causality’ to make assertions on causality through inference on time-lagged effects. Existing algorithms, however, have neglected an important aspect of the problem—the group structure among the lagged temporal variables naturally imposed by the time series they belong to. Specifically, existing methods in computational biology share this shortcoming, as well as additional computational limitations, prohibiting their effective applications to the large datasets including a large number of genes and many data points. In the present article, we propose a novel methodology which we term ‘grouped graphical Granger modeling method’, which overcomes the limitations mentioned above by applying a regression method suited for high-dimensional and large data, and by leveraging the group structure among the lagged temporal variables according to the time series they belong to. We demonstrate the effectiveness of the proposed methodology on both simulated and actual gene expression data, specifically the human cancer cell (HeLa S3) cycle data. The simulation results show that the proposed methodology generally exhibits higher accuracy in recovering the underlying causal structure. Those on the gene expression data demonstrate that it leads to improved accuracy with respect to prediction of known links, and also uncovers additional causal relationships uncaptured by earlier works. Contact: [email protected]
Data Mining and Knowledge Discovery | 2003
Saharon Rosset; Einat Neumann; Uri Eick; Nurit Vatnik
We present and discuss the important business problem of estimating the effect of marketing activities on the Lifetime Value of a customer in the Telecommunications industry. We discuss the components of this problem, in particular customer value and length of service (or tenure) modeling, and present a novel segment-based approach, motivated by the segment-level view marketing analysts usually employ. We describe in detail how we build on this approach to estimate the effects of retention campaigns on Lifetime Value, and also discuss its application in other situations. Our solution has been successfully implemented by the Business Insight (BI) Professional Services.