Is this you? Create Your Porfile

Agnieszka Danek

Silesian University of Technology

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Agnieszka Danek is active.

Explore More

Publication

Featured researches published by Agnieszka Danek.

Bioinformatics | 2013

Genome compression: a novel approach for large collections

Sebastian Deorowicz; Agnieszka Danek; Szymon Grabowski

MOTIVATION Genomic repositories are rapidly growing, as witnessed by the 1000 Genomes or the UK10K projects. Hence, compression of multiple genomes of the same species has become an active research area in the past years. The well-known large redundancy in human sequences is not easy to exploit because of huge memory requirements from traditional compression algorithms. RESULTS We show how to obtain several times higher compression ratio than of the best reported results, on two large genome collections (1092 human and 775 plant genomes). Our inputs are variant call format files restricted to their essential fields. More precisely, our novel Ziv-Lempel-style compression algorithm squeezes a single human genome to ∼400 KB. The key to high compression is to look for similarities across the whole collection, not just against one reference sequence, what is typical for existing solutions. AVAILABILITY http://sun.aei.polsl.pl/tgc (also as Supplementary Material) under a free license. Supplementary data: Supplementary data are available at Bioinformatics online.

PLOS ONE | 2014

Indexes of Large Genome Collections on a PC

Agnieszka Danek; Sebastian Deorowicz; Szymon Grabowski

The availability of thousands of individual genomes of one species should boost rapid progress in personalized medicine or understanding of the interaction between genotype and phenotype, to name a few applications. A key operation useful in such analyses is aligning sequencing reads against a collection of genomes, which is costly with the use of existing algorithms due to their large memory requirements. We present MuGI, Multiple Genome Index, which reports all occurrences of a given pattern, in exact and approximate matching model, against a collection of thousand(s) genomes. Its unique feature is the small index size, which is customisable. It fits in a standard computer with 16–32 GB, or even 8 GB, of RAM, for the 1000GP collection of 1092 diploid human genomes. The solution is also fast. For example, the exact matching queries (of average length 150 bp) are handled in average time of 39 µs and with up to 3 mismatches in 373 µs on the test PC with the index size of 13.4 GB. For a smaller index, occupying 7.4 GB in memory, the respective times grow to 76 µs and 917 µs. Software is available at http://sun.aei.polsl.pl/mugi under a free license. Data S1 is available at PLOS One online.

Scientific Reports | 2015

GDC 2: Compression of large collections of genomes

Sebastian Deorowicz; Agnieszka Danek; Marcin Niemiec

The fall of prices of the high-throughput genome sequencing changes the landscape of modern genomics. A number of large scale projects aimed at sequencing many human genomes are in progress. Genome sequencing also becomes an important aid in the personalized medicine. One of the significant side effects of this change is a necessity of storage and transfer of huge amounts of genomic data. In this paper we deal with the problem of compression of large collections of complete genomic sequences. We propose an algorithm that is able to compress the collection of 1092 human diploid genomes about 9,500 times. This result is about 4 times better than what is offered by the other existing compressors. Moreover, our algorithm is very fast as it processes the data with speed 200 MB/s on a modern workstation. In a consequence the proposed algorithm allows storing the complete genomic collections at low cost, e.g., the examined collection of 1092 human genomes needs only about 700 MB when compressed, what can be compared to about 6.7 TB of uncompressed FASTA files. The source code is available at http://sun.aei.polsl.pl/REFRESH/index.php?page=projects&project=gdc&subpage=about.

BMC Plant Biology | 2014

Evaluation and integration of functional annotation pipelines for newly sequenced organisms: the potato genome as a test case

David Amar; Itziar Frades; Agnieszka Danek; Tatyana Goldberg; Sanjeev Kumar Sharma; Peter E. Hedley; Estelle Proux-Wéra; Erik Andreasson; Ron Shamir; Oren Tzfadia; Erik Alexandersson

BackgroundFor most organisms, even if their genome sequence is available, little functional information about individual genes or proteins exists. Several annotation pipelines have been developed for functional analysis based on sequence, ‘omics’, and literature data. However, researchers encounter little guidance on how well they perform. Here, we used the recently sequenced potato genome as a case study. The potato genome was selected since its genome is newly sequenced and it is a non-model plant even if there is relatively ample information on individual potato genes, and multiple gene expression profiles are available.ResultsWe show that the automatic gene annotations of potato have low accuracy when compared to a “gold standard” based on experimentally validated potato genes. Furthermore, we evaluate six state-of-the-art annotation pipelines and show that their predictions are markedly dissimilar (Jaccard similarity coefficient of 0.27 between pipelines on average). To overcome this discrepancy, we introduce a simple GO structure-based algorithm that reconciles the predictions of the different pipelines. We show that the integrated annotation covers more genes, increases by over 50% the number of highly co-expressed GO processes, and obtains much higher agreement with the gold standard.ConclusionsWe find that different annotation pipelines produce different results, and show how to integrate them into a unified annotation that is of higher quality than each single pipeline. We offer an improved functional annotation of both PGSC and ITAG potato gene models, as well as tools that can be applied to additional pipelines and improve annotation in other organisms. This will greatly aid future functional analysis of ‘-omics’ datasets from potato and other organisms with newly sequenced genomes. The new potato annotations are available with this paper.

PLOS Genetics | 2015

Inter-population Differences in Retrogene Loss and Expression in Humans.

Magdalena Regina Kubiak; Agnieszka Danek; Wojciech Rosikiewicz; Sebastian Deorowicz; Andrzej Polanski; Izabela Makalowska

Gene retroposition leads to considerable genetic variation between individuals. Recent studies revealed the presence of at least 208 retroduplication variations (RDVs), a class of polymorphisms, in which a retrocopy is present or absent from individual genomes. Most of these RDVs resulted from recent retroduplications. In this study, we used the results of Phase 1 from the 1000 Genomes Project to investigate the variation in loss of ancestral (i.e. shared with other primates) retrocopies among different human populations. In addition, we examined retrocopy expression levels using RNA-Seq data derived from the Ilumina BodyMap project, as well as data from lymphoblastoid cell lines provided by the Geuvadis Consortium. We also developed a new approach to detect novel retrocopies absent from the reference human genome. We experimentally confirmed the existence of the detected retrocopies and determined their presence or absence in the human genomes of 17 different populations. Altogether, we were able to detect 193 RDVs; the majority resulted from retrocopy deletion. Most of these RDVs had not been previously reported. We experimentally confirmed the expression of 11 ancestral retrogenes that underwent deletion in certain individuals. The frequency of their deletion, with the exception of one retrogene, is very low. The expression, conservation and low rate of deletion of the remaining 10 retrocopies may suggest some functionality. Aside from the presence or absence of expressed retrocopies, we also searched for differences in retrocopy expression levels between populations, finding 9 retrogenes that undergo statistically significant differential expression.

International Journal of Foundations of Computer Science | 2013

BIT-PARALLEL ALGORITHMS FOR THE MERGED LONGEST COMMON SUBSEQUENCE PROBLEM

Sebastian Deorowicz; Agnieszka Danek

It is often a necessity to compare some sequences to find out how similar they are. There are many similarity measures that can be used, e.g., longest common subsequence, edit distance, sequence alignment. Recently a merged longest common subsequence (MergedLCS) problem was formulated with applications in bioinformatics. We propose the bit-parallel algorithms for the MergedLCS problem and evaluate them in practice showing that they are usually tens times faster than the already published methods.

ICMMI | 2014

Bit-Parallel Algorithm for the Block Variant of the Merged Longest Common Subsequence Problem

Agnieszka Danek; Sebastian Deorowicz

The problem of comparison of genomic sequences is of great importance. There are various measures of similarity of sequences. One of the most popular is the length of the longest common subsequence (LCS). We propose the first bit-parallel algorithm for the variant of the LCS problem, block merged LCS, which was recently formulated in the studies on the whole genome duplication hypothesis. Practical experiments show that our proposal is from 10 to over 100 times faster than existing algorithms.

Bioinformatics | 2018

Kmer-db: instant evolutionary distance estimation

Sebastian Deorowicz; Adam Gudyś; Maciej Długosz; Marek Kokot; Agnieszka Danek

Summary: Kmer‐db is a new tool for estimating evolutionary relationship on the basis of k‐mers extracted from genomes or sequencing reads. Thanks to an efficient data structure and parallel implementation, our software estimates distances between 40 715 pathogens in <7 min (on a modern workstation), 26 times faster than Mash, its main competitor. Availability and implementation: https://github.com/refresh‐bio/kmer‐db and http://sun.aei.polsl.pl/REFRESH/kmer‐db. Supplementary information: Supplementary data are available at Bioinformatics online.

bioRxiv | 2017

GTC: An Attempt To Maintenance Of Huge Genome Collections Compressed

Agnieszka Danek; Sebastian Deorowicz

Motivation Results We present GTC, a novel compressed data structure for representation of huge collections of genetic variation data. GTC significantly outperforms existing solutions in terms of compression ratio and time of answering various types of queries. We show that the largest of publicly available database of about 60 thousand haplotypes at about 40 million SNPs can be stored in less than 4 Gbytes, while the queries related to variants are answered in a fraction of a second. Availability GTC can be downloaded from https://github.com/refresh-bio/GTC or http://sun.aei.polsl.pl/REFRESH/GTC. Contact [email protected]

World Academy of Science, Engineering and Technology, International Journal of Biological, Biomolecular, Agricultural, Food and Biotechnological Engineering | 2012