Arif Harmanci
Yale University
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Arif Harmanci.
Nature | 2012
Mark Gerstein; Anshul Kundaje; Manoj Hariharan; Stephen G. Landt; Koon Kiu Yan; Chao Cheng; Xinmeng Jasmine Mu; Ekta Khurana; Joel Rozowsky; Roger P. Alexander; Renqiang Min; Pedro Alves; Alexej Abyzov; Nick Addleman; Nitin Bhardwaj; Alan P. Boyle; Philip Cayting; Alexandra Charos; David Chen; Yong Cheng; Declan Clarke; Catharine L. Eastman; Ghia Euskirchen; Seth Frietze; Yao Fu; Jason Gertz; Fabian Grubert; Arif Harmanci; Preti Jain; Maya Kasowski
Transcription factors bind in a combinatorial fashion to specify the on-and-off states of genes; the ensemble of these binding events forms a regulatory network, constituting the wiring diagram for a cell. To examine the principles of the human transcriptional regulatory network, we determined the genomic binding information of 119 transcription-related factors in over 450 distinct experiments. We found the combinatorial, co-association of transcription factors to be highly context specific: distinct combinations of factors bind at specific genomic locations. In particular, there are significant differences in the binding proximal and distal to genes. We organized all the transcription factor binding into a hierarchy and integrated it with other genomic information (for example, microRNA regulation), forming a dense meta-network. Factors at different levels have different properties; for instance, top-level transcription factors more strongly influence expression and middle-level ones co-regulate targets to mitigate information-flow bottlenecks. Moreover, these co-regulations give rise to many enriched network motifs (for example, noise-buffering feed-forward loops). Finally, more connected network components are under stronger selection and exhibit a greater degree of allele-specific activity (that is, differential binding to the two parental alleles). The regulatory information obtained in this study will be crucial for interpreting personal genome sequences and understanding basic principles of human biology and disease.
Science | 2013
Ekta Khurana; Yao Fu; Vincenza Colonna; Xinmeng Jasmine Mu; Hyun Min Kang; Tuuli Lappalainen; Andrea Sboner; Lucas Lochovsky; Jieming Chen; Arif Harmanci; Jishnu Das; Alexej Abyzov; Suganthi Balasubramanian; Kathryn Beal; Dimple Chakravarty; Daniel Challis; Yuan Chen; Declan Clarke; Laura Clarke; Fiona Cunningham; Uday S. Evani; Paul Flicek; Robert Fragoza; Erik Garrison; Richard A. Gibbs; Zeynep H. Gümüş; Javier Herrero; Naoki Kitabayashi; Yong Kong; Kasper Lage
Introduction Plummeting sequencing costs have led to a great increase in the number of personal genomes. Interpreting the large number of variants in them, particularly in noncoding regions, is a current challenge. This is especially the case for somatic variants in cancer genomes, a large proportion of which are noncoding. Prioritization of candidate noncoding cancer drivers based on patterns of selection. (Step 1) Filter somatic variants to exclude 1000 Genomes polymorphisms; (2) retain variants in noncoding annotations; (3) retain those in “sensitive” regions; (4) prioritize those disrupting a transcription-factor binding motif and (5) residing near the center of a biological network; (6) prioritize ones in annotation blocks mutated in multiple cancer samples. Methods We investigated patterns of selection in DNA elements from the ENCODE project using the full spectrum of variants from 1092 individuals in the 1000 Genomes Project (Phase 1), including single-nucleotide variants (SNVs), short insertions and deletions (indels), and structural variants (SVs). Although we analyzed broad functional annotations, such as all transcription-factor binding sites, we focused more on highly specific categories such as distal binding sites of factor ZNF274. The greater statistical power of the Phase 1 data set compared with earlier ones allowed us to differentiate the selective constraints on these categories. We also used connectivity information between elements from protein-protein-interaction and regulatory networks. We integrated all the information on selection to develop a workflow (FunSeq) to prioritize personal-genome variants on the basis of their deleterious impact. As a proof of principle, we experimentally validated and characterized a few candidate variants. Results We identified a specific subgroup of noncoding categories with almost as much selective constraint as coding genes: “ultrasensitive” regions. We also uncovered a number of clear patterns of selection. Elements more consistently active across tissues and both maternal and paternal alleles (in terms of allele-specific activity) are under stronger selection. Variants disruptive because of mechanistic effects on transcription-factor binding (i.e. “motif-breakers”) are selected against. Higher network connectivity (i.e. for hubs) is associated with higher constraint. Additionally, many hub promoters and regulatory elements show evidence of recent positive selection. Overall, indels and SVs follow the same pattern as SNVs; however, there are notable exceptions. For instance, enhancers are enriched for SVs formed by nonallelic homologous recombination. We integrated these patterns of selection into the FunSeq prioritization workflow and applied it to cancer variants, because they present a strong contrast to inherited polymorphisms. In particular, application to ~90 cancer genomes (breast, prostate and medulloblastoma) reveals nearly a hundred candidate noncoding drivers. Discussion Our approach can be readily used to prioritize variants in cancer and is immediately applicable in a precision-medicine context. It can be further improved by incorporation of larger-scale population sequencing, better annotations, and expression data from large cohorts. Identifying Important Identifiers Each of us has millions of sequence variations in our genomes. Signatures of purifying or negative selection should help identify which of those variations is functionally important. Khurana et al. (1235587) used sequence polymorphisms from 1092 humans across 14 populations to identify patterns of selection, especially in noncoding regulatory regions. Noncoding regions under very strong negative selection included binding sites of some chromatin and general transcription factors (TFs) and core motifs of some important TF families. Positive selection in TF binding sites tended to occur in network hub promoters. Many recurrent somatic cancer variants occurred in noncoding regulatory regions and thus might indicate mutations that drive cancer. Regions under strong selection in the human genome identify noncoding regulatory elements with possible roles in disease. Interpreting variants, especially noncoding ones, in the increasing number of personal genomes is challenging. We used patterns of polymorphisms in functionally annotated regions in 1092 humans to identify deleterious variants; then we experimentally validated candidates. We analyzed both coding and noncoding regions, with the former corroborating the latter. We found regions particularly sensitive to mutations (“ultrasensitive”) and variants that are disruptive because of mechanistic effects on transcription-factor binding (that is, “motif-breakers”). We also found variants in regions with higher network centrality tend to be deleterious. Insertions and deletions followed a similar pattern to single-nucleotide variants, with some notable exceptions (e.g., certain deletions and enhancers). On the basis of these patterns, we developed a computational tool (FunSeq), whose application to ~90 cancer genomes reveals nearly a hundred candidate noncoding drivers.
Molecular Systems Biology | 2014
Joel Rozowsky; Alexej Abyzov; Jing Wang; Pedro Alves; Debasish Raha; Arif Harmanci; Jing Leng; Robert D. Bjornson; Yong Kong; Naoki Kitabayashi; Nitin Bhardwaj; Mark A. Rubin; Michael Snyder; Mark Gerstein
To study allele‐specific expression (ASE) and binding (ASB), that is, differences between the maternally and paternally derived alleles, we have developed a computational pipeline (AlleleSeq). Our pipeline initially constructs a diploid personal genome sequence (and corresponding personalized gene annotation) using genomic sequence variants (SNPs, indels, and structural variants), and then identifies allele‐specific events with significant differences in the number of mapped reads between maternal and paternal alleles. There are many technical challenges in the construction and alignment of reads to a personal diploid genome sequence that we address, for example, bias of reads mapping to the reference allele. We have applied AlleleSeq to variation data for NA12878 from the 1000 Genomes Project as well as matched, deeply sequenced RNA‐Seq and ChIP‐Seq data sets generated for this purpose. In addition to observing fairly widespread allele‐specific behavior within individual functional genomic data sets (including results consistent with X‐chromosome inactivation), we can study the interaction between ASE and ASB. Furthermore, we investigate the coordination between ASE and ASB from multiple transcription factors events using a regulatory network framework. Correlation analyses and network motifs show mostly coordinated ASB and ASE.
Nature | 2014
Mark Gerstein; Joel Rozowsky; Koon Kiu Yan; Daifeng Wang; Chao Cheng; James B. Brown; Carrie A. Davis; LaDeana W. Hillier; Cristina Sisu; Jingyi Jessica Li; Baikang Pei; Arif Harmanci; Michael O. Duff; Sarah Djebali; Roger P. Alexander; Burak H. Alver; Raymond K. Auerbach; Kimberly Bell; Peter J. Bickel; Max E. Boeck; Nathan Boley; Benjamin W. Booth; Lucy Cherbas; Peter Cherbas; Chao Di; Alexander Dobin; Jorg Drenkow; Brent Ewing; Gang Fang; Megan Fastuca
The transcriptome is the readout of the genome. Identifying common features in it across distant species can reveal fundamental principles. To this end, the ENCODE and modENCODE consortia have generated large amounts of matched RNA-sequencing data for human, worm and fly. Uniform processing and comprehensive annotation of these data allow comparison across metazoan phyla, extending beyond earlier within-phylum transcriptome comparisons and revealing ancient, conserved features. Specifically, we discover co-expression modules shared across animals, many of which are enriched in developmental genes. Moreover, we use expression patterns to align the stages in worm and fly development and find a novel pairing between worm embryo and fly pupae, in addition to the embryo-to-embryo and larvae-to-larvae pairings. Furthermore, we find that the extent of non-canonical, non-coding transcription is similar in each organism, per base pair. Finally, we find in all three organisms that the gene-expression levels, both coding and non-coding, can be quantitatively predicted from chromatin features at the promoter using a ‘universal model’ based on a single set of organism-independent parameters.
Nature Communications | 2015
Dmitri D. Pervouchine; Sarah Djebali; Alessandra Breschi; Carrie A. Davis; Pablo Prieto Barja; Alexander Dobin; Andrea Tanzer; Julien Lagarde; Chris Zaleski; Lei Hoon See; Meagan Fastuca; Jorg Drenkow; Huaien Wang; Giovanni Bussotti; Baikang Pei; Suganthi Balasubramanian; Jean Monlong; Arif Harmanci; Mark Gerstein; Michael Beer; Cedric Notredame; Roderic Guigó; Thomas R. Gingeras
Mice have been a long-standing model for human biology and disease. Here we characterize, by RNA sequencing, the transcriptional profiles of a large and heterogeneous collection of mouse tissues, augmenting the mouse transcriptome with thousands of novel transcript candidates. Comparison with transcriptome profiles in human cell lines reveals substantial conservation of transcriptional programmes, and uncovers a distinct class of genes with levels of expression that have been constrained early in vertebrate evolution. This core set of genes captures a substantial fraction of the transcriptional output of mammalian cells, and participates in basic functional and structural housekeeping processes common to all cell types. Perturbation of these constrained genes is associated with significant phenotypes including embryonic lethality and cancer. Evolutionary constraint in gene expression levels is not reflected in the conservation of the genomic sequences, but is associated with conserved epigenetic marking, as well as with characteristic post-transcriptional regulatory programme, in which sub-cellular localization and alternative splicing play comparatively large roles.
Nature Methods | 2016
Arif Harmanci; Mark Gerstein
Studies on genomic privacy have traditionally focused on identifying individuals using DNA variants. In contrast, molecular phenotype data, such as gene expression levels, are generally assumed to be free of such identifying information. Although there is no explicit genotypic information in phenotype data, adversaries can statistically link phenotypes to genotypes using publicly available genotype-phenotype correlations such as expression quantitative trait loci (eQTLs). This linking can be accurate when high-dimensional data (i.e., many expression levels) are used, and the resulting links can then reveal sensitive information (for example, the fact that an individual has cancer). Here we develop frameworks for quantifying the leakage of characterizing information from phenotype data sets. These frameworks can be used to estimate the leakage from large data sets before release. We also present a general three-step procedure for practically instantiating linking attacks and a specific attack using outlier gene expression levels that is simple yet accurate. Finally, we describe the effectiveness of this outlier attack under different scenarios.
Genome Biology | 2014
Arif Harmanci; Joel Rozowsky; Mark Gerstein
We present MUSIC, a signal processing approach for identification of enriched regions in ChIP-Seq data, available at http://www.music.gersteinlab.org. MUSIC first filters the ChIP-Seq read-depth signal for systematic noise from non-uniform mappability, which fragments enriched regions. Then it performs a multiscale decomposition, using median filtering, identifying enriched regions at multiple length scales. This is useful given the wide range of scales probed in ChIP-Seq assays. MUSIC performs favorably in terms of accuracy and reproducibility compared with other methods. In particular, analysis of RNA polymerase II data reveals a clear distinction between the stalled and elongating forms of the polymerase.
bioRxiv | 2014
Dmitri D. Pervouchine; Sarah Djebali; Alessandra Breschi; Carrie A. Davis; Pablo Prieto Barja; Alexander Dobin; Andrea Tanzer; Julien Lagarde; Chris Zaleski; Lei-Hoon See; Meagan Fastuca; Jorg Drenkow; Huaien Wang; Giovanni Bussotti; Baikang Pei; Suganthi Balasubramanian; Jean Monlong; Arif Harmanci; Mark Gerstein; Michael Beer; Cedric Notredame; Roderic Guigó; Thomas R. Gingeras
We characterized by RNA-seq the transcriptional profiles of a large and heterogeneous collection of mouse tissues, augmenting the mouse transcriptome with thousands of novel transcript candidates. Comparison with transcriptome profiles obtained in human cell lines reveals substantial conservation of transcriptional programs, and uncovers a distinct class of genes with levels of expression across cell types and species, that have been constrained early in vertebrate evolution. This core set of genes capture a substantial and constant fraction of the transcriptional output of mammalian cells, and participates in basic functional and structural housekeeping processes common to all cell types. Perturbation of these constrained genes is associated with significant phenotypes including embryonic lethality and cancer. Evolutionary constraint in gene expression levels is not reflected in the conservation of the genomic sequences, but is associated with strong and conserved epigenetic marking, as well as to a characteristic post-transcriptional regulatory program in which sub-cellular localization and alternative splicing play comparatively large roles.
Nature Communications | 2018
Arif Harmanci; Mark Gerstein
Functional genomics experiments, such as RNA-seq, provide non-individual specific information about gene expression under different conditions such as disease and normal. There is great desire to share these data. However, privacy concerns often preclude sharing of the raw reads. To enable safe sharing, aggregated summaries such as read-depth signal profiles and levels of gene expression are used. Projects such as GTEx and ENCODE share these because they ostensibly do not leak much identifying information. Here, we attempt to quantify the validity of this statement, measuring the leakage of genomic deletions from signal profiles. We present information theoretic measures for the degree to which one can genotype these deletions. We then develop practical genotyping approaches and demonstrate how to use these to identify an individual within a large cohort in the context of linking attacks. Finally, we present an anonymization method removing much of the leakage from signal profiles.Functional genomics data from many studies are widely shared publicly for their value in biomedical and disease research. Here, the authors show sensitive information leakage is possible by analyzing functional genomics signal profiles, and develop an anonymization procedure for privacy protection.
ETHICS '14 Proceedings of the IEEE 2014 International Symposium on Ethics in Engineering, Science, and Technology | 2014
Dov Greenbaum; Arif Harmanci; Mark Gerstein
The issues of privacy and disclosure are two sides of a weighty coin. Computational biologists and other scientists involved in genomic research need to be constantly cognizant of the push and pull of these two important concepts. Clinical genomics research in particular raises a number of particularly poignant concerns as society struggles between invasions of privacy such as recent efforts by the FBI and the NSA, and our own (surprisingly) personal disclosures on social media sites or via apathetic acquiescence to large data collection efforts. With regard to privacy there are numerous computational efforts that have heretofore offered to provide both the robustness of protection and the ease of use to be effective in manipulating the terabytes of data before the genomics researcher. Unfortunately algorithms alone have thus far failed to provide either the necessary strength to foil those intent on obtaining information or the promised agility to manipulate the vast datasets. While technical solutions advance, they cannot stand on their own and this paper proposes and outlines a licensing scheme, similar to those used by professional organizations, that not only enforce a code of conduct and punish those who fail to live up to that code, but also mandate required continuing education to limit the possibility that the code will be violated inadvertently. It is the use of the social and the technological advances together that will likely create not only an environment that fosters research and innovation, but also one that is responsive to privacy needs and norms.