Vladimir Filkov | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Vladimir Filkov is active.

Explore More

Publication

Featured researches published by Vladimir Filkov.

Plant Physiology | 2011

Discovery of Rare Mutations in Populations: TILLING by Sequencing

Helen Tsai; Tyson Howell; Rebecca Nitcher; Victor Missirian; Brian Watson; Kathie J. Ngo; Meric Lieberman; Joseph Fass; Cristobal Uauy; Robert K. Tran; Asif Ali Khan; Vladimir Filkov; Thomas H. Tai; Jorge Dubcovsky; Luca Comai

Discovery of rare mutations in populations requires methods, such as TILLING (for Targeting Induced Local Lesions in Genomes), for processing and analyzing many individuals in parallel. Previous TILLING protocols employed enzymatic or physical discrimination of heteroduplexed from homoduplexed target DNA. Using mutant populations of rice (Oryza sativa) and wheat (Triticum durum), we developed a method based on Illumina sequencing of target genes amplified from multidimensionally pooled templates representing 768 individuals per experiment. Parallel processing of sequencing libraries was aided by unique tracer sequences and barcodes allowing flexibility in the number and pooling arrangement of targeted genes, species, and pooling scheme. Sequencing reads were processed and aligned to the reference to identify possible single-nucleotide changes, which were then evaluated for frequency, sequencing quality, intersection pattern in pools, and statistical relevance to produce a Bayesian score with an associated confidence threshold. Discovery was robust both in rice and wheat using either bidimensional or tridimensional pooling schemes. The method compared favorably with other molecular and computational approaches, providing high sensitivity and specificity.

research in computational molecular biology | 2001

Analysis techniques for microarray time-series data

Vladimir Filkov; Steven Skiena; Jizu Zhi

We introduce new methods for the analysis of short-term time-series data, and apply them to gene expression data in yeast. These include (1) methods for automated period detection in a predominately cycling data set and (2) phase detection between phase-shifted cyclic data sets. We show how to properly correct for the problem of comparing correlation coefficents between pairs of sequences of different lengths and small alphabets. In particular, we show that the correlation coefficient of sequences over alphabets of size two can exhibit very counter-intuitive behavior when compared with the Hamming distance. Finally, we address the predictability of known regulators via time-series analysis, and show that less than 20% of known regulatory pairs exhibit strong correlations in the Cho/Spellman data sets. By analyzing known regulatory relationships, we designed an edge detection function which identified candidate regulations with greater fidelity than standard correlation methods.

foundations of software engineering | 2014

A large scale study of programming languages and code quality in github

Baishakhi Ray; Daryl Posnett; Vladimir Filkov; Premkumar T. Devanbu

What is the effect of programming languages on software quality? This question has been a topic of much debate for a very long time. In this study, we gather a very large data set from GitHub (729 projects, 80 Million SLOC, 29,000 authors, 1.5 million commits, in 17 languages) in an attempt to shed some empirical light on this question. This reasonably large sample size allows us to use a mixed-methods approach, combining multiple regression modeling with visualization and text analytics, to study the effect of language features such as static v.s. dynamic typing, strong v.s. weak typing on software quality. By triangulating findings from different methods, and controlling for confounding effects such as team size, project size, and project history, we report that language design does have a significant, but modest effect on software quality. Most notably, it does appear that strong typing is modestly better than weak typing, and among functional languages, static typing is also somewhat better than dynamic typing. We also find that functional languages are somewhat better than procedural languages. It is worth noting that these modest effects arising from language design are overwhelmingly dominated by the process factors such as project size, team size, and commit size. However, we hasten to caution the reader that even these modest effects might quite possibly be due to other, intangible process factors, e.g., the preference of certain personality types for functional, static and strongly typed languages.

research in computational molecular biology | 1999

Identifying gene regulatory networks from experimental data

Ting Chen; Vladimir Filkov; Steven Skiena

As biology enters an era where the genomes of several organisms have been completely sequenced, the next great challenge is determining gene regulatory networks. Every gene has one or more activators, biochemical signals which are necessary to start transcription of the gene. Without the presence of the activator, only low level expression of the given gene can occur. Genes also have inhibitors, biochemical signals which prevent the expression of a particular gene even in the presence of an appropriate activator. Only a small number of genes function as activators or inhibitors, but identifying them is an important and diflicult problem. A gene regufatory network defines the complicated structure of gene products which activate/inhibit other gene products. Identifying gene regulatory networks from experimental data is now an area of extremely active research. New experimental technologies in molecular biology (particularly oligonucleotide arrays [ll] and micro arrays) now make it possible to quickly obtain vast amounts of data on gene expression in a particular organism under particular conditions. For example, Cho, etal [3] recently published a 17-point time series data set mea suring the expression level of each of 6601 d&rent genes for the yeast Saccharomycea cemhiae, obtained using an Affymetrix hybridization array. Wen, et.al [16] has generated Qpoint times series for the expression levels using RT-PCR of each of 112 genes involved in the rat nervous system development. Associating functions to genes based on this huge amount of data is en important and challenging problem.

International Journal on Artificial Intelligence Tools | 2004

INTEGRATING MICROARRAY DATA BY CONSENSUS CLUSTERING

Vladimir Filkov; Steven Skiena

With the exploding volume of microarray experiments comes increasing interest in mining repositories of such data. Meaningfully combining results from varied experiments on an equal basis is a challenging task. Here we propose a general method for integrating heterogeneous data sets based on the consensus clustering formalism. Our method analyzes source-specific clusterings and identifies a consensus set-partition which is as close as possible to all of them. We develop a general criterion to assess the potential benefit of integrating multiple heterogeneous data sets, i.e. whether the integrated data is more informative than the individual data sets. We apply our methods on two popular sets of microarray data yielding gene classifications of potentially greater interest than could be derived from the analysis of each individual data set.

conference on computer supported cooperative work | 2014

How social Q&A sites are changing knowledge sharing in open source software communities

Bogdan Vasilescu; Alexander Serebrenik; Premkumar T. Devanbu; Vladimir Filkov

Historically, mailing lists have been the preferred means for coordinating development and user support activities. With the emergence and popularity growth of social Q&A sites such as the StackExchange network (e.g., StackOverflow), this is beginning to change. Such sites offer different socio-technical incentives to their participants than mailing lists do, e.g., rich web environments to store and manage content collaboratively, or a place to showcase their knowledge and expertise more vividly to peers or potential recruiters. A key difference between StackExchange and mailing lists is gamification, i.e., StackExchange participants compete to obtain reputation points and badges. In this paper, we use a case study of R (a widely-used tool for data analysis) to investigate how mailing list participation has evolved since the launch of StackExchange. Our main contribution is the assembly of a joint data set from the two sources, in which participants in both the texttt{r-help} mailing list and StackExchange are identifiable. This permits their activities to be linked across the two resources and also over time. With this data set we found that user support activities show a strong shift away from texttt{r-help}. In particular, mailing list experts are migrating to StackExchange, where their behaviour is different. First, participants active both on texttt{r-help} and on StackExchange are more active than those who focus exclusively on only one of the two. Second, they provide faster answers on StackExchange than on texttt{r-help}, suggesting they are motivated by the emph{gamified} environment. To our knowledge, our study is the first to directly chart the changes in behaviour of specific contributors as they migrate into gamified environments, and has important implications for knowledge management in software engineering.

Journal of Computational Biology | 2002

Analysis techniques for microarray time-series data.

Vladimir Filkov; Steven Skiena; Jizu Zhi

We address possible limitations of publicly available data sets of yeast gene expression. We study the predictability of known regulators via time-series analysis, and show that less than 20% of known regulatory pairs exhibit strong correlations in the Cho/Spellman data sets. By analyzing known regulatory relationships, we designed an edge detection function which identified candidate regulations with greater fidelity than standard correlation methods. We develop general methods for integrated analysis of coarse time-series data sets. These include 1) methods for automated period detection in a predominately cycling data set and 2) phase detection between phase-shifted cyclic data sets. We show how to properly correct for the problem of comparing correlation coefficients between pairs of sequences of different lengths and small alphabets. Finally, we note that the correlation coefficient of sequences over alphabets of size two can exhibit very counterintuitive behavior when compared with the Hamming distance.

human factors in computing systems | 2015

Gender and Tenure Diversity in GitHub Teams

Bogdan Vasilescu; Daryl Posnett; Baishakhi Ray; Mark van den Brand; Alexander Serebrenik; Premkumar T. Devanbu; Vladimir Filkov

Software development is usually a collaborative venture. Open Source Software (OSS) projects are no exception; indeed, by design, the OSS approach can accommodate teams that are more open, geographically distributed, and dynamic than commercial teams. This, we find, leads to OSS teams that are quite diverse. Team diversity, predominantly in offline groups, is known to correlate with team output, mostly with positive effects. How about in OSS? Using GitHub, the largest publicly available collection of OSS projects, we studied how gender and tenure diversity relate to team productivity and turnover. Using regression modeling of GitHub data and the results of a survey, we show that both gender and tenure diversity are positive and significant predictors of productivity, together explaining a sizable fraction of the data variability. These results can inform decision making on all levels, leading to better outcomes in recruiting and performance.

foundations of software engineering | 2015

Quality and productivity outcomes relating to continuous integration in GitHub

Bogdan Vasilescu; Yue Yu; Huaimin Wang; Premkumar T. Devanbu; Vladimir Filkov

Software processes comprise many steps; coding is followed by building, integration testing, system testing, deployment, operations, among others. Software process integration and automation have been areas of key concern in software engineering, ever since the pioneering work of Osterweil; market pressures for Agility, and open, decentralized, software development have provided additional pressures for progress in this area. But do these innovations actually help projects? Given the numerous confounding factors that can influence project performance, it can be a challenge to discern the effects of process integration and automation. Software project ecosystems such as GitHub provide a new opportunity in this regard: one can readily find large numbers of projects in various stages of process integration and automation, and gather data on various influencing factors as well as productivity and quality outcomes. In this paper we use large, historical data on process metrics and outcomes in GitHub projects to discern the effects of one specific innovation in process automation: continuous integration. Our main finding is that continuous integration improves the productivity of project teams, who can integrate more outside contributions, without an observable diminishment in code quality.

international conference on tools with artificial intelligence | 2003

Integrating microarray data by consensus clustering

Vladimir Filkov; Steven Skiena

With the exploding volume of microarray experiments comes increasing interest in mining repositories of such data. Meaningfully combining results from varied experiments on an equal basis is a challenging task. In this paper we propose a general method for integrating heterogeneous data sets based on the consensus clustering formalism. Our method analyzes source-specific clusterings and identifies a consensus set-partition which is as close as possible to all of them. We develop a general criterion to assess the potential benefit of integrating multiple heterogeneous data sets, i.e. whether the integrated data is more informative than the individual data sets. We apply our methods on two popular sets of microarray data yielding gene classifications of potentially greater interest than could be derived from the analysis of each individual data set.

Explore More