Valentina Boeva
PSL Research University
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Valentina Boeva.
Bioinformatics | 2012
Valentina Boeva; Tatiana Popova; Kevin Bleakley; Pierre Chiche; Julie Cappo; Gudrun Schleiermacher; Isabelle Janoueix-Lerosey; Olivier Delattre; Emmanuel Barillot
Summary: More and more cancer studies use next-generation sequencing (NGS) data to detect various types of genomic variation. However, even when researchers have such data at hand, single-nucleotide polymorphism arrays have been considered necessary to assess copy number alterations and especially loss of heterozygosity (LOH). Here, we present the tool Control-FREEC that enables automatic calculation of copy number and allelic content profiles from NGS data, and consequently predicts regions of genomic alteration such as gains, losses and LOH. Taking as input aligned reads, Control-FREEC constructs copy number and B-allele frequency profiles. The profiles are then normalized, segmented and analyzed in order to assign genotype status (copy number and allelic content) to each genomic region. When a matched normal sample is provided, Control-FREEC discriminates somatic from germline events. Control-FREEC is able to analyze overdiploid tumor samples and samples contaminated by normal cells. Low mappability regions can be excluded from the analysis using provided mappability tracks. Availability: C++ source code is available at: http://bioinfo.curie.fr/projects/freec/ Contact: freec@curie.fr Supplementary information: Supplementary data are available at Bioinformatics online.
Bioinformatics | 2011
Valentina Boeva; Andrei Zinovyev; Kevin Bleakley; Jean-Philippe Vert; Isabelle Janoueix-Lerosey; Olivier Delattre; Emmanuel Barillot
Summary: We present a tool for control-free copy number alteration (CNA) detection using deep-sequencing data, particularly useful for cancer studies. The tool deals with two frequent problems in the analysis of cancer deep-sequencing data: absence of control sample and possible polyploidy of cancer cells. FREEC (control-FREE Copy number caller) automatically normalizes and segments copy number profiles (CNPs) and calls CNAs. If ploidy is known, FREEC assigns absolute copy number to each predicted CNA. To normalize raw CNPs, the user can provide a control dataset if available; otherwise GC content is used. We demonstrate that for Illumina single-end, mate-pair or paired-end sequencing, GC-contentr normalization provides smooth profiles that can be further segmented and analyzed in order to predict CNAs. Availability: Source code and sample data are available at http://bioinfo-out.curie.fr/projects/freec/. Contact: freec@curie.fr Supplementary information: Supplementary data are available at Bioinformatics online.
Bioinformatics | 2010
Ivan V. Kulakovskiy; Valentina Boeva; Alexander V. Favorov; Vsevolod J. Makeev
SUMMARY ChIP-Seq data are a new challenge for motif discovery. Such a data typically consists of thousands of DNA segments with base-specific coverage values. We present a new version of our DNA motif discovery software ChIPMunk adapted for ChIP-Seq data. ChIPMunk is an iterative algorithm that combines greedy optimization with bootstrapping and uses coverage profiles as motif positional preferences. ChIPMunk does not require truncation of long DNA segments and it is practical for processing up to tens of thousands of data sequences. Comparison with traditional (MEME) or ChIP-Seq-oriented (HMS) motif discovery tools shows that ChIPMunk identifies the correct motifs with the same or better quality but works dramatically faster. AVAILABILITY AND IMPLEMENTATION ChIPMunk is freely available within the ru_genetika Java package: http://line.imb.ac.ru/ChIPMunk. Web-based version is also available. CONTACT ivan.kulakovskiy@gmail.com SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Bioinformatics | 2006
Valentina Boeva; Mireille Régnier; Dmitri Papatsenko; Vsevolod J. Makeev
MOTIVATION Genomic sequences are highly redundant and contain many types of repetitive DNA. Fuzzy tandem repeats (FTRs) are of particular interest. They are found in regulatory regions of eukaryotic genes and are reported to interact with transcription factors. However, accurate assessment of FTR occurrences in different genome segments requires specific algorithm for efficient FTR identification and classification. RESULTS We have obtained formulas for P-values of FTR occurrence and developed an FTR identification algorithm implemented in TandemSWAN software. Using TandemSWAN we compared the structure and the occurrence of FTRs with short period length (up to 24 bp) in coding and non-coding regions including UTRs, heterochromatic, intergenic and enhancer sequences of Drosophila melanogaster and Drosophila pseudoobscura. Tandems with period three and its multiples were found in coding segments, whereas FTRs with periods multiple of six are overrepresented in all non-coding segment. Periods equal to 5-7 and 11-14 were characteristic of the enhancer regions and other non-coding regions close to genes. AVAILABILITY TandemSWAN web page, stand-alone version and documentation can be found at http://bioinform.genetika.ru/projects/swan/www/ SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Nucleic Acids Research | 2010
Valentina Boeva; Didier Surdez; Noëlle Guillon; Franck Tirode; Anthony P. Fejes; Olivier Delattre; Emmanuel Barillot
Dramatic progress in the development of next-generation sequencing technologies has enabled accurate genome-wide characterization of the binding sites of DNA-associated proteins. This technique, baptized as ChIP-Seq, uses a combination of chromatin immunoprecipitation and massively parallel DNA sequencing. Other published tools that predict binding sites from ChIP-Seq data use only positional information of mapped reads. In contrast, our algorithm MICSA (Motif Identification for ChIP-Seq Analysis) combines this source of positional information with information on motif occurrences to better predict binding sites of transcription factors (TFs). We proved the greater accuracy of MICSA with respect to several other tools by running them on datasets for the TFs NRSF, GABP, STAT1 and CTCF. We also applied MICSA on a dataset for the oncogenic TF EWS-FLI1. We discovered >2000 binding sites and two functionally different binding motifs. We observed that EWS-FLI1 can activate gene transcription when (i) its binding site is located in close proximity to the gene transcription start site (up to ∼150 kb), and (ii) it contains a microsatellite sequence. Furthermore, we observed that sites without microsatellites can also induce regulation of gene expression—positively as often as negatively—and at much larger distances (up to ∼1 Mb).
Nature Genetics | 2015
Thomas G. P. Grunewald; Virginie Bernard; Pascale Gilardi-Hebenstreit; Virginie Raynal; Didier Surdez; Marie Ming Aynaud; Olivier Mirabeau; Florencia Cidre-Aranaz; Franck Tirode; Sakina Zaidi; Gaëlle Pérot; Anneliene H. Jonker; Carlo Lucchesi; Marie Cécile Le Deley; Odile Oberlin; Perrine Marec-Berard; Amelie S. Veron; Stéphanie Reynaud; Eve Lapouble; Valentina Boeva; Thomas Rio Frio; Javier Alonso; Smita Bhatia; Gaëlle Pierron; Geraldine Cancel-Tassin; Olivier Cussenot; David G. Cox; Lindsay M. Morton; Mitchell J. Machiela; Stephen J. Chanock
Deciphering the ways in which somatic mutations and germline susceptibility variants cooperate to promote cancer is challenging. Ewing sarcoma is characterized by fusions between EWSR1 and members of the ETS gene family, usually EWSR1-FLI1, leading to the generation of oncogenic transcription factors that bind DNA at GGAA motifs. A recent genome-wide association study identified susceptibility variants near EGR2. Here we found that EGR2 knockdown inhibited proliferation, clonogenicity and spheroidal growth in vitro and induced regression of Ewing sarcoma xenografts. Targeted germline deep sequencing of the EGR2 locus in affected subjects and controls identified 291 Ewing-associated SNPs. At rs79965208, the A risk allele connected adjacent GGAA repeats by converting an interspaced GGAT motif into a GGAA motif, thereby increasing the number of consecutive GGAA motifs and thus the EWSR1-FLI1–dependent enhancer activity of this sequence, with epigenetic characteristics of an active regulatory element. EWSR1-FLI1 preferentially bound to the A risk allele, which increased global and allele-specific EGR2 expression. Collectively, our findings establish cooperation between a dominant oncogene and a susceptibility variant that regulates a major driver of Ewing sarcomagenesis.
Bioinformatics | 2014
Valentina Boeva; Tatiana Popova; Maxime Lienard; Sebastien Toffoli; Maud Kamal; Christophe Le Tourneau; David Gentien; Nicolas Servant; Pierre Gestraud; Thomas Rio Frio; Philippe Hupé; Emmanuel Barillot; Jean-François Laes
Motivation: Because of its low cost, amplicon sequencing, also known as ultra-deep targeted sequencing, is now becoming widely used in oncology for detection of actionable mutations, i.e. mutations influencing cell sensitivity to targeted therapies. Amplicon sequencing is based on the polymerase chain reaction amplification of the regions of interest, a process that considerably distorts the information on copy numbers initially present in the tumor DNA. Therefore, additional experiments such as single nucleotide polymorphism (SNP) or comparative genomic hybridization (CGH) arrays often complement amplicon sequencing in clinics to identify copy number status of genes whose amplification or deletion has direct consequences on the efficacy of a particular cancer treatment. So far, there has been no proven method to extract the information on gene copy number aberrations based solely on amplicon sequencing. Results: Here we present ONCOCNV, a method that includes a multifactor normalization and annotation technique enabling the detection of large copy number changes from amplicon sequencing data. We validated our approach on high and low amplicon density datasets and demonstrated that ONCOCNV can achieve a precision comparable with that of array CGH techniques in detecting copy number aberrations. Thus, ONCOCNV applied on amplicon sequencing data would make the use of additional array CGH or SNP array experiments unnecessary. Availability and implementation: http://oncocnv.curie.fr/ Contact: valentina.boeva@curie.fr Supplementary information: Supplementary data are available at Bioinformatics online.
Algorithms for Molecular Biology | 2007
Valentina Boeva; Julien Clement; Mireille Régnier; Mikhail A. Roytberg; Vsevolod J. Makeev
Backgroundcis-Regulatory modules (CRMs) of eukaryotic genes often contain multiple binding sites for transcription factors. The phenomenon that binding sites form clusters in CRMs is exploited in many algorithms to locate CRMs in a genome. This gives rise to the problem of calculating the statistical significance of the event that multiple sites, recognized by different factors, would be found simultaneously in a text of a fixed length. The main difficulty comes from overlapping occurrences of motifs. So far, no tools have been developed allowing the computation of p-values for simultaneous occurrences of different motifs which can overlap.ResultsWe developed and implemented an algorithm computing the p-value that s different motifs occur respectively k1, ..., ksor more times, possibly overlapping, in a random text. Motifs can be represented with a majority of popular motif models, but in all cases, without indels. Zero or first order Markov chains can be adopted as a model for the random text. The computational tool was tested on the set of cis-regulatory modules involved in D. melanogaster early development, for which there exists an annotation of binding sites for transcription factors. Our test allowed us to correctly identify transcription factors cooperatively/competitively binding to DNA.MethodThe algorithm that precisely computes the probability of simultaneous motif occurrences is inspired by the Aho-Corasick automaton and employs a prefix tree together with a transition function. The algorithm runs with the O(n|Σ|(m|ℋMathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaat0uy0HwzTfgDPnwy1egaryqtHrhAL1wy0L2yHvdaiqaacqWFlecsaaa@3762@| + K|σ|K) ∏iki) time complexity, where n is the length of the text, |Σ| is the alphabet size, m is the maximal motif length, |ℋMathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaat0uy0HwzTfgDPnwy1egaryqtHrhAL1wy0L2yHvdaiqaacqWFlecsaaa@3762@| is the total number of words in motifs, K is the order of Markov model, and kiis the number of occurrences of the i th motif.ConclusionThe primary objective of the program is to assess the likelihood that a given DNA segment is CRM regulated with a known set of regulatory factors. In addition, the program can also be used to select the appropriate threshold for PWM scanning. Another application is assessing similarity of different motifs.AvailabilityProject web page, stand-alone version and documentation can be found at http://bioinform.genetika.ru/AhoPro/
Bioinformatics | 2012
Valentina Boeva; Alban Lermine; Camille Barette; Christel Guillouf; Emmanuel Barillot
MOTIVATION ChIP-seq consists of chromatin immunoprecipitation and deep sequencing of the extracted DNA fragments. It is the technique of choice for accurate characterization of the binding sites of transcription factors and other DNA-associated proteins. We present a web service, Nebula, which allows inexperienced users to perform a complete bioinformatics analysis of ChIP-seq data. RESULTS Nebula was designed for both bioinformaticians and biologists. It is based on the Galaxy open source framework. Galaxy already includes a large number of functionalities for mapping reads and peak calling. We added the following to Galaxy: (i) peak calling with FindPeaks and a module for immunoprecipitation quality control, (ii) de novo motif discovery with ChIPMunk, (iii) calculation of the density and the cumulative distribution of peak locations relative to gene transcription start sites, (iv) annotation of peaks with genomic features and (v) annotation of genes with peak information. Nebula generates the graphs and the enrichment statistics at each step of the process. During Steps 3-5, Nebula optionally repeats the analysis on a control dataset and compares these results with those from the main dataset. Nebula can also incorporate gene expression (or gene modulation) data during these steps. In summary, Nebula is an innovative web service that provides an advanced ChIP-seq analysis pipeline providing ready-to-publish results. AVAILABILITY Nebula is available at http://nebula.curie.fr/ SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
BMC Bioinformatics | 2013
Toby Dylan Hocking; Gudrun Schleiermacher; Isabelle Janoueix-Lerosey; Valentina Boeva; Julie Cappo; Olivier Delattre; Francis R. Bach; Jean-Philippe Vert
BackgroundMany models have been proposed to detect copy number alterations in chromosomal copy number profiles, but it is usually not obvious to decide which is most effective for a given data set. Furthermore, most methods have a smoothing parameter that determines the number of breakpoints and must be chosen using various heuristics.ResultsWe present three contributions for copy number profile smoothing model selection. First, we propose to select the model and degree of smoothness that maximizes agreement with visual breakpoint region annotations. Second, we develop cross-validation procedures to estimate the error of the trained models. Third, we apply these methods to compare 17 smoothing models on a new database of 575 annotated neuroblastoma copy number profiles, which we make available as a public benchmark for testing new algorithms.ConclusionsWhereas previous studies have been qualitative or limited to simulated data, our annotation-guided approach is quantitative and suggests which algorithms are fastest and most accurate in practice on real data. In the neuroblastoma data, the equivalent pelt.n and cghseg.k methods were the best breakpoint detectors, and exhibited reasonable computation times.