Is this you? Create Your Porfile

Jan F. Prins

University of North Carolina at Chapel Hill

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Jan F. Prins is active.

Explore More

Publication

Featured researches published by Jan F. Prins.

Nucleic Acids Research | 2010

MapSplice: Accurate mapping of RNA-seq reads for splice junction discovery

Kai Wang; Darshan Singh; Zheng Zeng; Stephen J. Coleman; Yan Huang; Gleb L. Savich; Xiaping He; Piotr A. Mieczkowski; Sara A. Grimm; Charles M. Perou; James N. MacLeod; Derek Y. Chiang; Jan F. Prins; Jinze Liu

The accurate mapping of reads that span splice junctions is a critical component of all analytic techniques that work with RNA-seq data. We introduce a second generation splice detection algorithm, MapSplice, whose focus is high sensitivity and specificity in the detection of splices as well as CPU and memory efficiency. MapSplice can be applied to both short (<75 bp) and long reads (≥75 bp). MapSplice is not dependent on splice site features or intron length, consequently it can detect novel canonical as well as non-canonical splices. MapSplice leverages the quality and diversity of read alignments of a given splice to increase accuracy. We demonstrate that MapSplice achieves higher sensitivity and specificity than TopHat and SpliceMap on a set of simulated RNA-seq data. Experimental studies also support the accuracy of the algorithm. Splice junctions derived from eight breast cancer RNA-seq datasets recapitulated the extensiveness of alternative splicing on a global level as well as the differences between molecular subtypes of breast cancer. These combined results indicate that MapSplice is a highly accurate algorithm for the alignment of RNA-seq reads to splice junctions. Software download URL: http://www.netlab.uky.edu/p/bioinfo/MapSplice.

international conference on data mining | 2003

Efficient mining of frequent subgraphs in the presence of isomorphism

Jun Huan; Wei Wang; Jan F. Prins

Frequent subgraph mining is an active research topic in the data mining community. A graph is a general model to represent data and has been used in many domains like cheminformatics and bioinformatics. Mining patterns from graph databases is challenging since graph related operations, such as subgraph testing, generally have higher time complexity than the corresponding operations on itemsets, sequences, and trees, which have been studied extensively. We propose a novel frequent subgraph mining algorithm: FFSM, which employs a vertical search scheme within an algebraic graph framework we have developed to reduce the number of redundant candidates proposed. Our empirical study on synthetic and real datasets demonstrates that FFSM achieves a substantial performance gain over the current start-of-the-art subgraph mining algorithm gSpan.

ACM Transactions on Programming Languages and Systems | 1989

Integrating noninterfering versions of programs

Susan Horwitz; Jan F. Prins; Thomas W. Reps

The need to integrate several versions of a program into a common one arises frequently, but it is a tedious and time consuming task to integrate programs by hand. To date, the only available tools for assisting with program integration are variants of <italic>text-based</italic> differential file comparators; these are of limited utility because one has no guarantees about how the program that is the product of an integration behaves compared to the programs that were integrated. This paper concerns the design of a <italic>semantics-based</italic> tool for automatically integrating program versions. The main contribution of the paper is an algorithm that takes as input three programs <italic>A</italic>, <italic>B</italic>, and <italic>Base</italic>, where <italic>A</italic> and <italic>B</italic> are two variants of <italic>Base</italic>. Whenever the changes made to <italic>Base</italic> to create <italic>A</italic> and <italic>B</italic> do not “interfere” (in a sense defined in the paper), the algorithm produces a program <italic>M</italic> that integrates <italic>A</italic> and <italic>B</italic>. The algorithm is predicated on the assumption that differences in the <italic>behavior</italic> of the variant programs from that of <italic>Base</italic>, rather than differences in the <italic>text</italic>, are significant and must be preserved in <italic>M</italic>. Although it is undecidable whether a program modification actually leads to such a difference, it is possible to determine a safe approximation by comparing each of the variants with <italic>Base</italic>. To determine this information, the integration algorithm employs a program representation that is similar (although not identical) to the <italic>dependence graphs</italic> that have been used previously in vectorizing and parallelizing compilers. The algorithm also makes use of the notion of a <italic>program slice</italic> to find just those statements of a program that determine the values of potentially affected variables. The program-integration problem has not been formalized previously. It should be noted, however, that the integration problem examined here is a greatly simplified one; in particular, we assume that expressions contain only scalar variables and constants, and that the only statements used in programs are assignment statements, conditional statements, and while-loops.

knowledge discovery and data mining | 2004

SPIN: mining maximal frequent subgraphs from graph databases

Jun Huan; Wei Wang; Jan F. Prins; Jiong Yang

One fundamental challenge for mining recurring subgraphs from semi-structured data sets is the overwhelming abundance of such patterns. In large graph databases, the total number of frequent subgraphs can become too large to allow a full enumeration using reasonable computational resources. In this paper, we propose a new algorithm that mines only maximal frequent subgraphs, i.e. subgraphs that are not a part of any other frequent subgraphs. This may exponentially decrease the size of the output set in the best case; in our experiments on practical data sets, mining maximal frequent subgraphs reduces the total number of mined patterns by two to three orders of magnitude.Our method first mines all frequent trees from a general graph database and then reconstructs all maximal subgraphs from the mined trees. Using two chemical structure benchmarks and a set of synthetic graph data sets, we demonstrate that, in addition to decreasing the output size, our algorithm can achieve a five-fold speed up over the current state-of-the-art subgraph mining algorithms.

languages and compilers for parallel computing | 2006

UTS: an unbalanced tree search benchmark

Stephen L. Olivier; Jun Huan; Jinze Liu; Jan F. Prins; James Dinan; P. Sadayappan; Chau-Wen Tseng

This paper presents an unbalanced tree search (UTS) benchmark designed to evaluate the performance and ease of programming for parallel applications requiring dynamic load balancing. We describe algorithms for building a variety of unbalanced search trees to simulate different forms of load imbalance. We created versions of UTS in two parallel languages, OpenMP and Unified Parallel C (UPC), using work stealing as the mechanism for reducing load imbalance. We benchmarked the performance of UTS on various parallel architectures, including shared-memory systems and PC clusters. We found it simple to implement UTS in both UPC and OpenMP, due to UPCs shared-memory abstractions. Results show that both UPC and OpenMP can support efficient dynamic load balancing on shared-memory architectures. However, UPC cannot alleviate the underlying communication costs of distributed-memory systems. Since dynamic load balancing requires intensive communication, performance portability remains difficult for applications such as UTS and performance degrades on PC clusters. By varying key work stealing parameters, we expose important tradeoffs between the granularity of load balance, the degree of parallelism, and communication costs.

Genome Research | 2014

Variation in chromatin accessibility in human kidney cancer links H3K36 methyltransferase loss with widespread RNA processing defects.

Jeremy M. Simon; Kathryn E. Hacker; Darshan Singh; A. Rose Brannon; Joel S. Parker; Matthew Weiser; Thai H. Ho; Pei Fen Kuan; Eric Jonasch; Terrence S. Furey; Jan F. Prins; Jason D. Lieb; W.Kimryn Rathmell; Ian J. Davis

Comprehensive sequencing of human cancers has identified recurrent mutations in genes encoding chromatin regulatory proteins. For clear cell renal cell carcinoma (ccRCC), three of the five commonly mutated genes encode the chromatin regulators PBRM1, SETD2, and BAP1. How these mutations alter the chromatin landscape and transcriptional program in ccRCC or other cancers is not understood. Here, we identified alterations in chromatin organization and transcript profiles associated with mutations in chromatin regulators in a large cohort of primary human kidney tumors. By associating variation in chromatin organization with mutations in SETD2, which encodes the enzyme responsible for H3K36 trimethylation, we found that changes in chromatin accessibility occurred primarily within actively transcribed genes. This increase in chromatin accessibility was linked with widespread alterations in RNA processing, including intron retention and aberrant splicing, affecting ∼25% of all expressed genes. Furthermore, decreased nucleosome occupancy proximal to misspliced exons was observed in tumors lacking H3K36me3. These results directly link mutations in SETD2 to chromatin accessibility changes and RNA processing defects in cancer. Detecting the functional consequences of specific mutations in chromatin regulatory proteins in primary human samples could ultimately inform the therapeutic application of an emerging class of chromatin-targeted compounds.

Nucleic Acids Research | 2013

DiffSplice: The genome-wide detection of differential splicing events with RNA-seq

Yin Hu; Yan Huang; Ying Du; Christian F. Orellana; Darshan Singh; Amy R. Johnson; Anaı̈s Monroy; Pei Fen Kuan; Scott M. Hammond; Liza Makowski; Scott H. Randell; Derek Y. Chiang; D. Neil Hayes; Corbin D. Jones; Yufeng Liu; Jan F. Prins; Jinze Liu

The RNA transcriptome varies in response to cellular differentiation as well as environmental factors, and can be characterized by the diversity and abundance of transcript isoforms. Differential transcription analysis, the detection of differences between the transcriptomes of different cells, may improve understanding of cell differentiation and development and enable the identification of biomarkers that classify disease types. The availability of high-throughput short-read RNA sequencing technologies provides in-depth sampling of the transcriptome, making it possible to accurately detect the differences between transcriptomes. In this article, we present a new method for the detection and visualization of differential transcription. Our approach does not depend on transcript or gene annotations. It also circumvents the need for full transcript inference and quantification, which is a challenging problem because of short read lengths, as well as various sampling biases. Instead, our method takes a divide-and-conquer approach to localize the difference between transcriptomes in the form of alternative splicing modules (ASMs), where transcript isoforms diverge. Our approach starts with the identification of ASMs from the splice graph, constructed directly from the exons and introns predicted from RNA-seq read alignments. The abundance of alternative splicing isoforms residing in each ASM is estimated for each sample and is compared across sample groups. A non-parametric statistical test is applied to each ASM to detect significant differential transcription with a controlled false discovery rate. The sensitivity and specificity of the method have been assessed using simulated data sets and compared with other state-of-the-art approaches. Experimental validation using qRT-PCR confirmed a selected set of genes that are differentially expressed in a lung differentiation study and a breast cancer data set, demonstrating the utility of the approach applied on experimental biological data sets. The software of DiffSplice is available at http://www.netlab.uky.edu/p/bioinfo/DiffSplice.

research in computational molecular biology | 2004

Mining protein family specific residue packing patterns from protein structure graphs

Jun Huan; Wei Wang; Deepak Bandyopadhyay; Jack Snoeyink; Jan F. Prins; Alexander Tropsha

Finding recurring residue packing patterns, or spatial motifs, that characterize protein structural families is an important problem in bioinformatics. We apply a novel frequent subgraph mining algorithm to three graph representations of protein three-dimensional (3D) structure. In each protein graph, a vertex represents an amino acid. Vertex-residues are connected by edges using three approaches: first, based on simple distance threshold between contact residues; second using the Delaunay tessellation from computational geometry, and third using the recently developed almost-Delaunay tessellation approach.Applying a frequent subgraph mining algorithm to a set of graphs representing a protein family from the Structural Classification of Proteins (SCOP) database, we typically identify several hundred common subgraphs equivalent to common packing motifs found in the majority of proteins in the family. We also use the counts of motifs extracted from proteins in two different SCOP families as input variables in a binary classification experiment. The resulting models are capable of predicting the protein family association with the accuracy exceeding 90 percent. Our results indicate that graphs based on both almost-Delaunay and Delaunay tessellations are sparser than the contact distance graphs; yet they are robust and efficient for mining protein spatial motif.

Journal of Computational Biology | 2005

Comparing Graph Representations of Protein Structure for Mining Family-Specific Residue-Based Packing Motifs

Jun Huan; Deepak Bandyopadhyay; Wei Wang; Jack Snoeyink; Jan F. Prins; Alexander Tropsha

We find recurring amino-acid residue packing patterns, or spatial motifs, that are characteristic of protein structural families, by applying a novel frequent subgraph mining algorithm to graph representations of protein three-dimensional structure. Graph nodes represent amino acids, and edges are chosen in one of three ways: first, using a threshold for contact distance between residues; second, using Delaunay tessellation; and third, using the recently developed almost-Delaunay edges. For a set of graphs representing a protein family from the Structural Classification of Proteins (SCOP) database, subgraph mining typically identifies several hundred common subgraphs corresponding to spatial motifs that are frequently found in proteins in the family but rarely found outside of it. We find that some of the large motifs map onto known functional regions in two protein families explored in this study, i.e., serine proteases and kinases. We find that graphs based on almost-Delaunay edges significantly reduce the number of edges in the graph representation and hence present computational advantage, yet the patterns extracted from such graphs have a biological interpretation approximately equivalent to that of those extracted from distance based graphs.

ieee international conference on high performance computing data and analytics | 2012

OpenMP task scheduling strategies for multicore NUMA systems

Stephen L. Olivier; Allan Porterfield; Kyle Wheeler; Michael Spiegel; Jan F. Prins

The recent addition of task parallelism to the OpenMP shared memory API allows programmers to express concurrency at a high level of abstraction and places the burden of scheduling parallel execution on the OpenMP run-time system. Efficient scheduling of tasks on modern multi-socket multicore shared memory systems requires careful consideration of an increasingly complex memory hierarchy, including shared caches and non-uniform memory access (NUMA) characteristics. In order to evaluate scheduling strategies, we extended the open source Qthreads threading library to implement different scheduler designs, accepting OpenMP programs through the ROSE compiler. Our comprehensive performance study of diverse OpenMP task-parallel benchmarks compares seven different task-parallel run-time scheduler implementations on an Intel Nehalem multi-socket multicore system: our proposed hierarchical work-stealing scheduler, a per-core work-stealing scheduler, a centralized scheduler, and LIFO and FIFO versions of the Qthreads round-robin scheduler. In addition, we compare our results against the Intel and GNU OpenMP implementations. Our hierarchical scheduling strategy leverages different scheduling methods at different levels of the hierarchy. By allowing one thread to steal work on behalf of all of the threads within a single chip that share a cache, the scheduler limits the number of costly remote steals. For cores on the same chip, a shared LIFO queue allows exploitation of cache locality between sibling tasks as well as between a parent task and its newly created child tasks. In the performance evaluation, our Qthreads hierarchical scheduler is competitive on all benchmarks tested. On five of the seven benchmarks, it demonstrates speedup and absolute performance superior to both the Intel and GNU OpenMP run-time systems. Our run-time also demonstrates similar performance benefits on AMD Magny Cours and SGI Altix systems, enabling several benchmarks to successfully scale to 192 CPUs of an SGI Altix.

Explore More