Dimitar Vassilev
Sofia University
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Dimitar Vassilev.
Biology Direct | 2015
Ola Spjuth; Erik Bongcam-Rudloff; Guillermo Carrasco Hernández; Lukas Forer; Mario Giovacchini; Roman Valls Guimera; Aleksi Kallio; Eija Korpelainen; Maciej M. Kańduła; Milko Krachunov; David P. Kreil; Ognyan Kulev; Paweł P. Łabaj; Samuel Lampa; Luca Pireddu; Sebastian Schönherr; Alexey Siretskiy; Dimitar Vassilev
AbstractHigh-throughput technologies, such as next-generation sequencing, have turned molecular biology into a data-intensive discipline, requiring bioinformaticians to use high-performance computing resources and carry out data management and analysis tasks on large scale. Workflow systems can be useful to simplify construction of analysis pipelines that automate tasks, support reproducibility and provide measures for fault-tolerance. However, workflow systems can incur significant development and administration overhead so bioinformatics pipelines are often still built without them. We present the experiences with workflows and workflow systems within the bioinformatics community participating in a series of hackathons and workshops of the EU COST action SeqAhead. The organizations are working on similar problems, but we have addressed them with different strategies and solutions. This fragmentation of efforts is inefficient and leads to redundant and incompatible solutions. Based on our experiences we define a set of recommendations for future systems to enable efficient yet simple bioinformatics workflow construction and execution. Reviewers This article was reviewed by Dr Andrew Clark.
Journal of Computational Science | 2014
Milko Krachunov; Dimitar Vassilev
Abstract Metagenomics is a rapidly growing field, which has been greatly driven by the ongoing advancements in high-throughput sequencing technologies. As a result, both the data preparation and the subsequent in silico experiments pose unsolved technical and theoretical challenges, as there are not any well-established approaches, and new expertise and software are constantly emerging. Our project main focus is the creation and evaluation of a novel error detection and correction approach to be used inside a metagenomic processing workflow. The approach, together with an indirect validation technique and the already obtained empirical results, are described in detail in this paper. To aid the development and testing, we are also building a workflow execution system to run our experiments that is designed to be extensible beyond the scope of error detection which will be released as a free/open-source software package.
Biotechnology & Biotechnological Equipment | 2012
Peter Petrov; Milko Krachunov; Elena Todorovska; Dimitar Vassilev
ABSTRACT Recent years have seen a vast amount of data generated by various biological and biomedical experiments. The storage, management and analysis of this data, is done by means of the modern bioinformatics applications and tools. One of the bioinformatics instruments used for solving these tasks, are ontologies and the apparatus they provide. Ontology as a modeling tool is a specification of a conceptualization meaning that an ontology is a formal description of the concepts and relationships that can exist for a given software system or software agent (8, 10). Anatomical (phenotypic) ontologies of various species nowadays typically contain from few thousands to few tens of thousands of terms and relations (which is a very small number compared to the count of objects and the amount of data produced by biological experiments at the molecular level, for example) but usually the semantics employed in them is enormous in scale. The major problem when using such ontologies is that they lack intelligent tools for cross-species literature searches (text mining) as well as tools aiding the design of new biological and biomedical experiments with other (notyet tested) species/organisms, based on available information about experiments already performed on certain model species/organisms. This is where the process of merging anatomical ontologies comes into use. Using specific models and algorithms for merging of such ontologies is a matter of choice. In this work a novel approach for solving this task, based on two directed acyclic graph (DAG) models and three original algorithmic procedures is presented. Based on them, an intelligent software system for merging two (and possibly more) input/source anatomical ontologies into one output/target super-ontology was designed and implemented. This system was named AnatOM (an abbreviation from “Anatomical Ontologies Merger”). In this work a short overview of ontologies is provided describing what ontologies are and why they are widely used as a tool in bioinformatics. The problem of merging anatomical ontologies of two or more different organisms is introduced and some effort has been put into explaining why it is important. A general outline is presented of the models and the method that have been developed for solving the ontologies merging problem. A high-level overview of the AnatOM program implemented by the authors as part of this work is also provided. To achieve the degree of intelligence that is needed, the AnatOM program utilizes the large amount of high-quality data (knowledge) available in several widely popular and generally recognized knowledge bases such as UMLS, FMA, and WordNet. The last one of these is a general-purpose i.e. non-specialized knowledge source. The first two are biological/biomedical ones. Their choice was based on the fact that they provide a very good foundation for building an intelligent system that performs certain comparative anatomy tasks including mapping and merging of anatomical ontologies (23).
international conference on conceptual structures | 2017
Milko Krachunov; Maria Nisheva; Dimitar Vassilev
Abstract In high-variation genomics datasets, such as found in metagenomics or complex polyploid genome analysis, error detection and variant calling are impeded by the difficulty in discerning sequencing errors from actual biological variation. Confirming base candidates with high frequency of occurrence is no longer a reliable measure, because of the natural variation and the presence of rare bases. This work employs machine learning models to classify bases into erroneous and rare variations, after preselecting potential error candidates with a weighted frequency measure, which aims to focus on unexpected variations by using the inter-sequence pairwise similarity. Different similarity measures are used to account for different types of datasets. Four machine learning models are tested.
Archaea | 2014
Galina Radeva; Anelia Kenarova; Velina Bachvarova; Katrin Flemming; Ivan Popov; Dimitar Vassilev; Sonja Selenska-Pobell
Uranium mining and milling activities adversely affect the microbial populations of impacted sites. The negative effects of uranium on soil bacteria and fungi are well studied, but little is known about the effects of radionuclides and heavy metals on archaea. The composition and diversity of archaeal communities inhabiting the waste pile of the Sliven uranium mine and the soil of the Buhovo uranium mine were investigated using 16S rRNA gene retrieval. A total of 355 archaeal clones were selected, and their 16S rDNA inserts were analysed by restriction fragment length polymorphism (RFLP) discriminating 14 different RFLP types. All evaluated archaeal 16S rRNA gene sequences belong to the 1.1b/Nitrososphaera cluster of Crenarchaeota. The composition of the archaeal community is distinct for each site of interest and dependent on environmental characteristics, including pollution levels. Since the members of 1.1b/Nitrososphaera cluster have been implicated in the nitrogen cycle, the archaeal communities from these sites were probed for the presence of the ammonia monooxygenase gene (amoA). Our data indicate that amoA gene sequences are distributed in a similar manner as in Crenarchaeota, suggesting that archaeal nitrification processes in uranium mining-impacted locations are under the control of the same key factors controlling archaeal diversity.
Biotechnology & Biotechnological Equipment | 2009
I. Popov; A. Nenov; Peter Petrov; Dimitar Vassilev
ABSTRACT It is often said that bioinformatics is a knowledge based discipline. This means that many of the search and prediction methods that have been used to greatest effect in bioinformatics exploit information that has already been accumulated about the problem of interest, rather than working from first principles. Most of the methods and algorithms discussed in this paper adopt these knowledge-based approaches for protein studies. Typically we have some given examples i.e. data of a given class or function, and we try to identify patterns in that data which characterize these sequences or structures and distinguish them from others that are not in this class. The purpose of this paper is to describe the basic conceptual methods and adjacent algorithms and applications that are used to obtain better and more reliable information of the studied characteristic patterns.
Euphytica | 2017
Miroslav Zorić; Sreten Terzić; Vladimir Sikora; Milka Brdar-Jokanovic; Dimitar Vassilev
Tuber yield, together with tuber number and size are the basic agronomic and breeding traits in Jerusalem artichoke and can be significantly affected by environmental factors. We report the results of a long term trial on the performance of 20 Jerusalem artichoke cultivars. The random model for means with restricted maximum likelihood (REML) procedure was used to estimate the overall effects of the genotype, environment and genotype by environment interaction on traits. The partial least square regression (PLSR) model was used for modeling genotype by environment interaction variance components with a set of available correlated environmental variables. The REML variance component estimates model revealed that tuber number and yield are more dependent on GE interaction which allowed identification of best genotypes for specific environments. The PLSR model revealed that the most important climatic variables for optimal emergence, canopy development, high tuber number and yield are adequate soil and air temperatures in April. For larger tuber mass, precipitation variables and even distribution of rainfall were the most important factor, together with soil and air temperature in June when tuber growth is initiated. The knowledge obtained in this study is valuable for the identification and understanding of key environmental factors that contribute to the performance of Jerusalem artichoke.
international conference on conceptual structures | 2015
Milko Krachunov; Dimitar Vassilev; Maria Nisheva; Ognyan Kulev; Valeriya Simeonova; Vladimir Dimitrov
NGS data processing in metagenomics studies has to deal with noisy data that can contain a large amount of reading errors which are difficult to detect and account for. This work introduces a fuzzy indicator of reliability technique to facilitate solutions to this problem. It includes modified Hamming and Levenshtein distance functions that are aimed to be used as drop-in replacements in NGS analysis procedures which rely on distances, such as phylogenetic tree construction. The distances utilise fuzzy sets of reliable bases or an equivalent fuzzy logic, potentially aggregating multiple sources of base reliability.
Journal of Integrative Bioinformatics | 2013
Peter Petrov; Milko Krachunov; Dimitar Vassilev
This paper presents a study in the domain of semi-automated and fully-automated ontology mapping. A process for inferring additional cross-ontology links within the domain of anatomical ontologies is presented and evaluated on pairs from three model organisms. The results of experiments performed with various external knowledge sources and scoring schemes are discussed.
Biotechnology & Biotechnological Equipment | 2012
Valeriya Simeonova; Ivan Popov; Dimitar Vassilev
ABSTRACT The quality of next-generation sequencing data is a major problem in todays bioinformatics. The validation of sequences, either by re-sequencing or pure statistical error evaluation, is the tool needed to ensure the correct results of all following research done with the data. Estimating the error rates in genome databases gives an idea about the level of inherited errors in genome sequences. It is important as these kinds of errors have cumulative effect on every following step of analysis of the sequences. Here we present a way to define the error level in a genome, using two different databases: National Center of Biotechnology Information (NCBI) (as a verified one) and Resources for Plant Comparative Genomics (PlantGDB) as reference. Based on the most conservative regions in every genome—donor/acceptor splice sites (the canonical forms are the dinucleotides GT or GC and AG), we applied statistical methods to derive the NCBI error level for Oryza sativa (japonica cultivar) genome.