Gene family amplification facilitates adaptation in freshwater Unionid bivalve Megalonaias nervosa
Rebekah L. Rogers, Stephanie L. Grizzard, James E. Titus-McQuillan, Katherine Bockrath, Sagar Patel, John P. Wares, Jeffrey T. Garner, Cathy C. Moore
GGene family amplification facilitates adaptationin freshwater Unionid bivalve
Megalonaias nervosa
Rebekah L. Rogers ∗ , Stephanie L. Grizzard , , Katherine Bockrath , , Sagar Patel , , ,John P. Wares , , Jeffrey T Garner , Cathy C. Moore Author Affiliations:1. Department of Bioinformatics and Genomics, University of North Carolina, Charlotte,NC2. Department of Biological Sciences, Old Dominion University, Norfolk, VA3. Department of Genetics, University of Georgia, Athens, GA4. U.S. Fish and Wildlife Service, Midwest Fisheries Center Whitney Genetics Lab,Onalaska, WI5. Department of Biology, Saint Louis University, St. Louis, MO6. Donald Danforth Plant Science Center, St. Louis, MO7. Odum School of Ecology, University of Georgia, Athens, GA8. Division of Wildlife and Freshwater Fisheries, Alabama Department of Conservation andNatural Resources, Florence, AL*Corresponding author: Department of Bioinformatics and Genomics, University of NorthCarolina, Charlotte, NC. [email protected]: Unionidae,
M. nervosa , Evolutionary genomics, gene family expansion,transposable element evolution, Cytochrome P450, Population genomics, reverse ecologicalgeneticsShort Title: Gene family expansion in
Megalonaias nervosa a r X i v : . [ q - b i o . GN ] A ug ntroduction As organisms are faced with intense rapidly changing selective pressures, new genetic materialis required to facilitate adaptation. Among sources of genetic novelty, gene duplications andtransposable elements (TEs) offer new genes or new regulatory patterns that can facilitateevolutionary change (Oliver and GreeneOliver and Greene, 20092009; FeschotteFeschotte, 20082008; Conant and WolfeConant and Wolfe, 20082008;OhnoOhno, 19701970). These mutations can offer a substrate of novel genetic variation that mayproduce new functions to help organisms adapt when innovative changes are needed forsurvival. In evolutionary systems that are faced with sudden shifts in selective pressures,these sources of new genes are expected to be all the more essential as large-effect mutationsfacilitate moves to novel fitness peaks (OrrOrr, 20052005).Historically, duplicate genes and TEs have been well studied in modelorganisms where it has been possible to estimate prevalence (Adams et al.Adams et al., 20002000;
Drosophila
12 Genomes Consortium et al.
Drosophila
12 Genomes Consortium et al., 20072007; Consortium et al.Consortium et al., 20012001; Goffeau et al.Goffeau et al.,19961996), mutation rates (Bennett et al.Bennett et al., 20042004; Adrion et al.Adrion et al., 20172017; Demuth et al.Demuth et al., 20062006;Hahn et al.Hahn et al., 20072007; Rogers et al.Rogers et al., 20092009), and identify adaptive changes (Aminetzach et al.Aminetzach et al.,20052005; Han et al.Han et al., 20092009; Schmidt et al.Schmidt et al., 20102010). With advances in genome sequencing it ispossible to gain a broader view of how gene family proliferation and TE content evolveacross the tree of life. Most model organisms are either short lived with large, persistentpopulations (
Drosophila , yeast) or long lived with modest population sizes and lowergenetic diversity (Humans, mammals) (LynchLynch, 20072007). It has only recently become tractableto generate reference genomes and evaluate the role of genetic novelty in alternativeevolutionary systems. An understanding of how genomes evolve in populations that haverecently faced strong shifts in selective pressures and sharp population declines that amplifygenetic drift will clarify whether these mutations might play an outsized role in specificevolutionary scenarios.The tempo of evolution in response to selection depends directly on population sizes,mutation rates, and generation time (Hermisson and PenningsHermisson and Pennings, 20052005; GillespieGillespie, 19911991;Maynard SmithMaynard Smith, 19711971). It is therefore important to understand how genetic novelty respondsin systems with alternative demography and life history in order to discern how these locicontribute to rapid evolutionary change. Moreover, the potential for threatened species torespond to selective challenges may depend on life history, as maturation time, growth andreproductive strategies may contribute to risk of extinction (Haag and RypelHaag and Rypel, 20112011; HaagHaag,20122012). Freshwater unionid bivalves (Unionidae) offer a long-lived evolutionary model thathas recently encountered environmental challenges, with intense changes in population sizes,and strong selective pressures. These bivalves have experienced strong selective pressuresin modern history, where 70% of Unionidae species are threatened, endangered, or extinct(Williams et al.Williams et al., 19931993; PlattPlatt, 20182018). Understanding of how duplications and TEs evolve inthe face of modern anthropogenic challenges will be key to understanding how they mightcontribute to species survival.Unionid bivalves are benthic filter feeders that form the backbone of freshwaterecosystems (Lopez et al.Lopez et al., 20202020; Williams et al.Williams et al., 20082008; PlattPlatt, 20182018). They are notable fortheir unusual life history. The Unionidae are dioecious, with spermcast mating and internalfertilization (Williams et al.Williams et al., 20082008). A single gravid female can hold up to 1 million offspringin a single brood (Haggerty et al.Haggerty et al., 20052005). Gravid females house offspring in gill marsupia until2aturity. The mature larvae (glochidia) are released as obligate parasites that use fish hosts(Williams et al.Williams et al., 20082008). Females of some species develop mantle modifications to lure hostsor produce conglutinates of glochidia with markings to mimic small fish prey, while otherspecies simply broadcast glochidia into the water column (Williams et al.Williams et al., 20082008). The larvaeare external parasites and become encysted on the gills or fins of the fish and remain thereuntil maturation, when they drop off to become benthic filter feeders that remove algae andnutrients from the water column (Williams et al.Williams et al., 20082008). Unionidae are known for biparentalmitochondria inheritance, where males receive mitochondria from the paternal lineage andfemales receive them from maternal lineages, a genetic factor in sex determination (Liu et al.Liu et al.,19961996; Wen et al.Wen et al., 20172017). Because of this unusual life history, Unionidae serve as an interestingevolutionary model system for parental care, parasite-host coevolution, and evolution of lifehistory strategies. It may also offer a system to study the genetic response to strong andsuddenly shifting selective pressures (Lopez et al.Lopez et al., 20202020; Williams et al.Williams et al., 20082008; PlattPlatt, 20182018)in a long-lived species (Haag and RypelHaag and Rypel, 20112011; HaagHaag, 20122012) that departs from the standardpopulation genetic models.These populations have recently been threatened by a number of human induced selectivepressures. In the modern industrial era, waterways have been dammed throughout theUnited States limiting dispersal (OrtmannOrtmann, 19241924). Alongside these changes to rivers, watersare compromised with pesticides, industrial pollution, and agricultural runoff (WoodsideWoodside,20042004). One unintended consequence has been limited dispersal of aquatic life throughwaterways, leading to a marked reduction in species diversity and prevalence (OrtmannOrtmann,19241924). Many freshwater bivalves in the Southeast struggle to compete with invasive speciessuch as Zebra mussels and Asiatic clams (StrayerStrayer, 19991999). Unionidae are also harvestedfor shells and pearls from wild populations. The Tennessee Valley Authority (TVA) hasmonitored unionids in the Tennessee River for decades as an indicator of water quality andto track impacts of commercial mussel capture, providing a rich historical record (CahnCahn,19361936; Isom et al.Isom et al., 19731973). Improved environmental regulations and targeted conservationefforts have encouraged recovery, but the historic mark of these environmental challenges isnot fully known.Given the shifting landscape of selective pressures and extreme environmental changesin recent history, Unionidae are an ideal model to study how new gene formation creates“hopeful monsters” within the genome and contributes to adaptive changes. Genomesequencing offers a means to bring these fascinating organisms into the modern geneticera, where reverse ecological genomics may reveal the genetic responses to these changingenvironmental and coevolutionary pressures (Marmeisse et al.Marmeisse et al., 20132013). As yet, these types ofgenetic studies have not been tractable in Unionids due to the limitation of genetic resources.As a first step to facilitate evolutionary genetics with sufficient power to resolve geneticpatterns of adaptation in Unionidae, sequencing and assembly of reference genomes can helpfill this gap, with broad impacts on conservation and management for this imperiled group.Among Unionidae,
Megalonaias nervosa is one widespread, prevalent species whosepopulations have remained robust, so far, in the face of environmental challenges.
M.nervosa is thick-shelled, spawns in the fall, and produces large brood sizes of up to 1million glochidia per individual (Williams et al.Williams et al., 20082008; Haggerty et al.Haggerty et al., 20052005). Earliestmaturation is between 5-8 years of age (Woody and Holland-BartelsWoody and Holland-Bartels, 19931993) and the speciesis long lived with individuals identified up to 43 years of age (Haag and RypelHaag and Rypel, 20112011).3 . nervosa larvae are generalist parasites that use channel catfish, sunfish, and bullhead(Woody and Holland-BartelsWoody and Holland-Bartels, 19931993) as well as crappie, perch, gar, darters and bass(Williams et al.Williams et al., 20082008). These mussels are present throughout the Southeastern UnitedStates, in the Mississippi, Ohio, and Tennessee River waterways (Williams et al.Williams et al., 20082008).The Tennessee River offers a focal location where 79 species of Unionid bivalves arepresent, the most diverse location in Alabama (Garner and McGregorGarner and McGregor, 20012001). In NorthernAlabama, TVA has documented as much as 6,700 tons of shells harvested in a singleyear at peak demand (Williams et al.Williams et al., 20082008) and up to 45% of these were
M. nervosa (Ahlstedt and McDonoughAhlstedt and McDonough, 19921992).To determine how the genome of
M. nervosa has responded to recent selective pressures,we have generated a reference genome for a single individual collected from Pickwick Lake,Muscle Shoals, Alabama. We explore the relative contribution of gene family expansion,TE proliferation, and single nucleotide mutations to genetic variation and its influence onevolutionary potential. This reference genome shows strong signatures of TE proliferationand adaptation through gene family expansion in
M. nervosa . We further analyze geneticdiversity in comparison with recently sequenced bivalves, and how it is expected to influencesubsequent evolutionary processes in this species.
MethodsAssembly and annotation
A single gravid female roughly 11 years of age based on ring counts was collected fromPickwick Lake in Muscle Shoals, Alabama in 2017 (JGPickwickLake2017 Specimen
Quadrula quadrula mtDNA sequence obtained from NCBI (accession NC 013658.1, downloaded Aug 15 2018).In addition we obtained transcriptomes from 6 tissues: glochidia, gills, mucus glands,4antle, adductor muscle, and muscular foot. Tissues were flash frozen in liquid nitrogenimmediately on dissection, then RNA was extracted using Zymo DirectZol RNA microPrepswithout DNase. RNA was converted to cDNA using the Illumina TruSeq mRNA HTLibrary Prep. Libraries were prepared using 8M fragmentation (Covaris) and 15 cycles ofamplification. We used secondary cleanup step with Ampure XP beads to remove lowermolecular weight primer-dimers. Transcriptomes were sequenced on a single lane of aHiSeq4000 using the 150bp paired end (PE) Cluster kit (DUGSIM).We assembled transcriptomes using the Oyster River Protocol v2.3.1 (MacManesMacManes, 20182018),then aligned to the reference genome using BLAT v. 36 (KentKent, 20022002). Alignmentswere converted to hints and used as input for de novo gene prediction in Augustus v3.3(Stanke and WaackStanke and Waack, 20032003), allowing isoform prediction with evidence (Stanke et al.Stanke et al., 20082008).Species-specific gene prediction models for
M. nervosa produced during BUSCO trainingwere used for Augustus gene prediction. We used Augustus gene models to create CDSand protein fasta files using gffread (Pertea and PerteaPertea and Pertea, 20202020). We produced functionalannotation for gene models using Interproscan v5.34-73.0 (Quevillon et al.Quevillon et al., 20052005) basedon identified Pfam (El-Gebali et al.El-Gebali et al., 20192019) conserved domains. Sequences were comparedagainst gene models from
Crassotrea gigas (the Pacific Oyster) (Zhang et al.Zhang et al., 20122012) using ablastp at an E-value of 10 − . One to one orthologs were defined as reciprocal best hits.Annotated sequences were compared to the RepBase database (Jurka et al.Jurka et al., 20052005) of TEsin a blastn search at an E-value of 10 − . Annotations with blast hits in RepBase or withInterproscan results suggesting TE or viral activities were removed from the annotations.We note that these criteria could allow for some novel selfish genetic elements if sufficientlydistant from previously annotated TEs in other species. To establish a final high qualityannotation data set, we required that de novo predictions be supported by at least one lineof evidence: results from Interproscan, blast comparison against C. gigas or support in thetranscriptome assembly.
Transcriptome annotation for other Unionidae
Comparisons of gene content across species are complicated by varying assembly qualityand limited genome availability for Unionidae. Transcriptome data for
Elliptio crassidens originally collected for other purposes were incidentally available. Fresh gonad tissue wasdissected for a single individual from the Apalachicola River Basin (Georgia, USA) in 2013.Gonad tissue was fixed in RNAlater and prepped using a Trizol RNA isolation reaction.The reactions were quantified on a qubit at 1000 ng/ml RNA. RNA was prepared for RNAsequencing at the Georgia Genomic Core Facility. Sequencing was performed on a MiSeq with150 bp paired end reads. Previously published transcriptome sequences were available for
Venustaconcha ellipsiformis (Renaut et al.Renaut et al., 20182018). To offer a comparison for gene contentin
M. nervosa that could be verified independently from genome assembly, we de novo assembled and annotated transcriptomes for
E. crassidens , V. ellipsiformis , and
M. nervosa .Transcriptomes were assembled using the Oyster River Protocol (MacManesMacManes, 20182018) thentranslated into six open reading frames. Translated proteins were functionally annotatedusing Interproscan (Quevillon et al.Quevillon et al., 20052005), which identifies conserved domains in translatedprotein sequences. These results offer comparable annotations for all three species, thoughwith less stringent criteria than employed when mapping to genome assemblies for official5nnotations.
Gene Families
Gene family expansion is an important mode of new gene formation that can lead toevolutionary innovation (Hahn et al.Hahn et al., 20072007; Conant and WolfeConant and Wolfe, 20082008; HahnHahn, 20092009; OhnoOhno,19701970). The rates at which new genes are formed and lost dictates evolution of gene contentand the substrate of genetic novelty (Lynch and ConeryLynch and Conery, 20002000; Rogers et al.Rogers et al., 20092009; HahnHahn,20092009).To determine the genomic responses of new gene formation in
M. nervosa , we identifiedgene families and used birth-death models to identify rates of evolution through new geneformation. These models assume new genes form in clock-like fashion according to Poissonprocesses. The history of gene birth and death is recorded in the demographic history ofgene family members, where gene ages are recorded in genetic sequence data. We used aFuzzy Reciprocal Best Hit algorithm (Hahn et al.Hahn et al., 20072007) on an all-by-all blastn comparisonat an E-value of 10 − of gene annotations within M. nervosa to identify gene families. Weallowed for a difference in E-values within two orders of magnitude. If orthogroup orderwas ambiguous, the maximum rank was chosen. The distribution of duplicate gene ages isrecorded in the divergence among copies at synonymous sites (dS) and the relative rate ofamino acid changing mutations is recorded in divergence at non-synonymous sites (dN).We collected sequences for each orthogroup, translated them, then aligned amino acidsequences in clustalw2 (Thompson et al.Thompson et al., 19941994; Larkin et al.Larkin et al., 20072007). Alignments wereback-translated to the original nucleotide sequence to produce in-frame alignments. Weused PAML’s codeml (YangYang, 19971997) to estimate dN and dS, estimating codon frequenciesfrom nucleotide frequencies (F1x4) under the guide tree from clustalw2. We excluded allgenes past the mutation saturation point with dS > .
0. We then fit a birth-death modelusing maximum likelihood to the distribution of dS for gene family members according to λe − µdS similar to previous work on duplicate genes (Lynch and ConeryLynch and Conery, 20002000; Rogers et al.Rogers et al.,20092009). We established the 95% CI using a jackknife of 75% resampling with 10,000 replicates.Mutation rates for duplicate gene formation and the half-life of mutations were inferredbased on birth-death parameters. These can be rescaled to years assuming copies diverge bytwice the mutation rate of 5 × − (Pogson and ZourosPogson and Zouros, 19941994). Interproscan conserveddomain annotations identified orthogroups belonging to large gene families. We fittedseparate birth-death models for the most common gene families in the M. nervosa genome:
Cytochrome P450 genes,
Hsp70s , Mitochondria-eating genes, Chitin metabolism genes, vonWillebrand proteins,
ABC transporters and opsin genes.
TE content
Transposable elements are known as sources of innovation that can remodel genomes toproduce innovation (Oliver and GreeneOliver and Greene, 20092009; FeschotteFeschotte, 20082008). They are often associatedwith genome remodeling not solely because of new insertions but also due to their abilityto facilitate rearrangements and copy gene sequences. To determine how TEs mighthave reshaped genome content in
M. nervosa , we identified putative transposable elementsin
M. nervosa using a three pronged approach. Repeatscout 1.0.5 (Price et al.Price et al., 20052005),6epDeNovo (Chu et al.Chu et al., 20162016), and homology-based detection using either Interproscan5.34-73.0 (Quevillon et al.Quevillon et al., 20052005) or blast comparisons to RepBase (Bao et al.Bao et al., 20152015).RepeatScout identifies repetitive sequences in genomes where TE copies have sufficientdivergence from one another to successfully assemble into independent contigs. We thenmapped these repeats back to the
M. nervosa reference assembly using a blastn at
E < − with dust filters turned off. These methods effectively identify repeats with greater than 2%divergence. They do depend on genome assembly quality, and may not be appropriate forcross-species comparisons when assembly quality is variable.RepDeNovo does not rely on pre-existing genome assemblies but rather assembles contigsfrom high copy number reads in raw fastq files. This program identifies recently proliferatedtransposons that have less than 2% divergence from one another. We mapped Illuminasequencing reads onto these repetitive contigs identified with RepDeNovo and estimatedcopy number using sequence depth. TE types were identified using a tblastx against theRepBase Database. Finally, we identified transposon proteins among gene annotations usinghomology based searches in Interproscan or using blastn comparisons with the RepBasedatabase. Together, these methods generate a comprehensive portrait of TE content in thegenome of M. nervosa .For comparison, Illumina sequences are incidentally available for
V. ellipsiformis (42Xcoverage, SRR6689532) (Renaut et al.Renaut et al., 20182018) and
Elliptio hopetonensis (14X coverage), bothoriginally collected for other purposes.
V. ellipsiformis is present throughout the NorthernMississippi River, while
E. hopetonensis is an endemic species with limited numbers locatedin the Altahama River. These distantly related taxa offer the nearest comparisons amongUnionid mussels where genetic data are available. We used the assembly-free methods ofRepDeNovo (Chu et al.Chu et al., 20162016) to assay TE content in these two species. These data offerdirectly comparable estimates of TE content, independently from quality of genome assembly.We used downsampled data for
M. nervosa at 14X coverage to ascertain that differences inTE content were not driven by sequence coverage.
Low coverage assembly for
E. hopetonensis
Illumina sequencing data were incidentally available for
Elliptio hopetonensis taken froma single individual collected from the Ocmulgee River in 2012 (Field ID JMW120329-2,Ocmulgee R, site 2 of day, JMW, MJH, DAW). Mantle tissues was dissected and preparedfor DNA sequencing on a HiSeq2000 at the University of Colorado sequencing core. Usingthis 14X coverage Illumina sequencing data for
E. hopetonensis , we have generated a cursoryreference assembly in Abyss using default parameters. The assembly N50 is only 1 kb,and the maximum contig size is 22 kb. Genome size estimates are 1 Gb, with 500 Mbin scaffolds 1kb or larger. BUSCO (Waterhouse et al.Waterhouse et al., 20172017; Sim˜ao et al.Sim˜ao et al., 20152015) analysissuggests the assembly is largely incomplete with only 15% of conserved metazoan genesidentified. However, in a tblastn, we can find partial matches for 15,010
C. gigas genes. Thenumber of
C. gigas orthologs identified in this cursory reference is similar to the numberidentified in
M. nervosa . While this cursory reference genome is not of sufficient quality toanchor genes to regions or create syntenic maps, it can offer a comparison against
M. nervosa for genome size, and effective population size, as well as surveys of TE content describedabove. 7 ffective population size
The tempo of evolution depends directly on the effective population size, mutation rate,and generation times. To evaluate the timescale of adaptation, we used heterozygosity inthe reference specimen to estimate population genetic parameters such as θ = 4 N e µ , theexpected coalescent time for neutral alleles of 2N generations, and t MRCA of 4N generations(WakeleyWakeley, 20092009). To further identify the genetic potential available through standing geneticvariation, we estimated the time to establishment of selective sweeps on new mutations T e ,and the probability of adaptation from standing variation P sgv . These parameters offera description of theoretical expectations for how evolution should proceed given currentpopulation sizes and estimates of diversity. They offer a means to determine the limits ofwhat might be possible given strong selection acting on these species.We mapped Illumina reads from the genome assembly to the M. nervosa reference genomeusing bwa aln with default parameters. We identified SNPs using GATK (McKenna et al.McKenna et al.,20102010) haplotype caller v. 3.8-0-ge9d806836 assuming diploid genomes. We identifiedheterozygous sites to estimate genome wide heterozygosity for
M. nervosa . We performedidentical analysis for the cursory assembly for
E. hopetonensis . These data provide estimatesof θ = 4 N e µ for these freshwater bivalves. Using these data we calculate N e based on amutation rate of µ = 5 × − in bivalves (Pogson and ZourosPogson and Zouros, 19941994). We are able to usethis parameter to estimate coalescent timescale of 2 N generations and the t MRCA = 4 N generations. Timescales are offered in generations and converted to years assuming anaverage 20 year generation time based on the long lifespan and late maturation of M. nervosa at 8 years of age (Woody and Holland-BartelsWoody and Holland-Bartels, 19931993; Haag and RypelHaag and Rypel, 20112011). Alternativescaling from generations to years would scale linearly with generation time.We estimate the time to establishment of a sweep on a new mutation T e = θs (GillespieGillespie,19911991; Maynard SmithMaynard Smith, 19711971) for each species. These models describe time for new mutationsto arise de novo and escape stochastic forces of drift to establish selective sweeps. Weestimated the probability of adaptation from standing variation rather than new mutation P sgv = 1 − e − θ ∗ ln (1+2 N e s ) (Hermisson and PenningsHermisson and Pennings, 20052005). P sgv estimates the probability thata mutation is both present among standing genetic variation and at a sufficient frequency thatit is able to establish a deterministic sweep through the population in the face of geneticdrift. It depends on the population size, selective coefficient, and mutation rate. Theseparameters are estimated assuming moderate selective coefficients of s = 0 .
01 and for verystrong selection of s = 0 . ResultsGenome Assembly
We generated a reference genome sequence for
M. nervosa to facilitate evolutionary andecological genomics in Unionidae. We collected 40X Illumina short read sequencing and5X Oxford Nanopore long molecule sequences for a single gravid female
M. nervosa fromPickwick Lake, Muscle Shoals, Alabama. We assembled these sequences to produce areference genome using the Masurca hybrid assembler with default parameters (Zimin et al.Zimin et al.,20132013). Genome size estimates prior to filtering are 2.7 Gb, with a scaffold N50 of 52.7 kband a contig N50 of 51.5 kb (Table S1S1). These show higher contiguity than the previously8ssembled genome of
V. ellipsiformis (Renaut et al.Renaut et al., 20182018). We compared these resultswith one high quality marine bivalve assembly for Pacific oyster,
Crassotrea gigas . C. gigas diverged from
M. nervosa by 200 million years (Bolotov et al.Bolotov et al., 20172017) but is the nearestnon-Unionid bivalve with a high quality assembly and annotation. The contig N50 is superiorto the recent
C. gigas genome of 19.4 kb, but poorer than the scaffold N50 of 401 kb(Zhang et al.Zhang et al., 20122012). The distribution of kmers is continuous with no “shoulders” of high orlow kmer frequency, indicating no sign of polyploidy or genomes misassembled in the face ofheterozygosity (Figure S1S1).Some 92% of the genome is assembled into contigs over 10 kb and 50% is assembled intocontigs 50 kb or larger. Contaminating sequences were removed from the bivalve referenceassembly using a blastn, producing a total assembly of 2.36 Gb, 500 Mb (28%) larger thanestimates for one other Unionid mussel sequenced to date,
V. ellipsiformis . V. ellipsiformis has a genome size of 1.80 Gb (Renaut et al.Renaut et al., 20182018). BUSCO analysis (Sim˜ao et al.Sim˜ao et al., 20152015)suggests 83% complete orthologs and 9% fragmented orthologs, consistent with previousgenome assemblies for bivalves. Only 2.1% of BUSCOs are duplicated, again confirming theabsence of polyploidy or high heterozygosity that would lead to separation of homologouschromosomes in the assembly. Such sequence quality captures the majority of gene contenteither wholly or partially, and is sufficient for many applications in evolutionary andpopulation genomics.
Gene Annotations
Gene content is expected to reflect the biology of the organism. By identifying genesand searching for genetic signatures of rapid evolutionary events, we can facilitate reverseecological genomics and discover evolutionary processes that are important to the organismwithout pre-existing human biases.To identify genes and gene families that have emerged in freshwater Unionidae, wegenerated gene annotations for
M. nervosa . We produced gene annotations using RNAsequencing data for 6 tissues from the same individual: Gills, mantle, muscular foot, mucosalpalps, adductor muscle, and glochidia (offspring) (Figure 11). We used the Oyster RiverProtocol (MacManesMacManes, 20182018) to de novo assemble transcriptomes for six tissues. Thesetranscriptome sequences provided hints for Augustus de novo gene prediction to annotategene sequences in
M. nervosa . The Oyster River Protocol identifies 66.7-87.1% of BUSCOsacross each of the 6 transcriptomes. We identify 49,149 genes: 17,982 singleton genes with21,596 transcripts and 31,167 members of gene families with 32,967 transcripts. A total of 801transcripts belonging to 733 genes contain premature stop codons suggesting pseudogenes.Genes encompass approximately 477 kb of sequence. Average intron length is 3714 bp, (min7, max 145,991). In a reciprocal best hit blast comparison, we identify 8811 one-to-oneorthologs between
M. nervosa and
C. gigas .We used Interproscan (Quevillon et al.Quevillon et al., 20052005) to identify conserved protein domainsin comparisons against the Pfam database (El-Gebali et al.El-Gebali et al., 20192019) and thereby generatefunctional annotations for gene models. We placed no a priori criteria for what functionalcategories should be represented in the
M. nervosa genome. However, among interproscanannotations, we note large gene families related to biological processes expected to beimportant for Unionidae survival and reproduction. We identify 78 glutathione transferase9enes, 44 thioredoxins, 143 cytochrome P450 genes, and 106 ABC transporters expected tointeract with cytochromes (Figure 22). In addition we observe up to 500 fragmented copiesof both cytochrome P450 genes and
ABC transporter interacting regions. We find multiplecopies of genes involved in anticoagulation processes in
M. nervosa , consistent with bloodfeeding during the parasitic larval stage. We find 138 von Willebrand factor proteins (99type A and 39 type D), 15 Xa-binding genes, and 33 fibrinogen binding proteins.We further observe 96
Hsp70 s and 6
Hsp90 s, consistent with previously publishedresults from
C. gigas showing expansion of heat shock proteins (Zhang et al.Zhang et al., 20122012), butsubstantially increased over only 52 observed in
V. ellipsiformis (Renaut et al.Renaut et al., 20182018). Thesegenetic differences are consistent with thermal tolerance through
Hsp activity in otherorganisms (Sørensen et al.Sørensen et al., 20032003), though pleiotropic functions of
Hsp70 genes may alsoinfluence gene family expansion.
V. ellipsiformis has a range in the Northern US, coolerclimates compared with
M. nervosa . M. nervosa is found throughout the SoutheasternUnited States, and the present sample’s home location lies in Muscle Shoals, Alabama.We also note 238 rhodopsin genes and 117 Mitochondria-eating protein genes. These arein addition to large numbers of conserved domains with less clear functional implicationssuch as zinc finger domains, WD-40 domains, short chain dehydrogenase domains, Ankyrinrepeats, etc. In a targeted identification, we find 11 chitin synthase genes and 55 Chitinbinding Peritrophin-A genes, putatively important for shell formation. We identify 7 mucins,proteins important for defense and release of glochidia offspring. These numbers are manytimes higher than listed in official annotations in
V. ellipsiformis (Table S2S2), though we notethat annotation methods differ, especially with respect to repeat masking.Previously published annotations for
V. ellipsiformis show fewer members of theselarge gene families (Table S2S2) (Renaut et al.Renaut et al., 20182018). However, these annotations throughdifferent methods and with different assembly qualities could lead to a false portrait ofspecies difference. To generate annotations for direct comparisons independently fromgenome assembly, we de novo assembled RNAseq data in the Oyster River Protocol andused Interproscan to functionally annotate transcripts for
M. nervosa , E. crassidens , and
V. ellipsiformis . The
M. nervosa transcriptome shows 72.9% complete, 9.5% missingBUSCOs.
E. crassidens has 78% complete, only 4.2% missing BUSCOs.
V. ellipsiformis shows 96% complete, 0.3% missing BUSCOs. These transcriptome assemblies suggest aslight bias toward identifying full transcripts in
V. ellipsiformis but should be more directlycomparable than results from genome assemblies of varying qualities with differing methodsof annotation.
M. nervosa and
E. crassidens both show over 200 cytochrome P450 transcripts while theassembled
V. ellipsiformis transcriptome has only 129, in spite of its higher BUSCO score.These results point to large cytochrome P450 content across Unionidae, but with variationin the extent of gene family expansion (Figure S2S2, Table S3S3). Transcriptome annotations freeof reference assembly also show an excess of Mitochondria-eating proteins, von Willebrandfactors, and chitin metabolism genes, though differences are generally 2-fold or less (TableS3S3), in contrast to official genome annotations previously published. In many categories
E.crassidens shows a parallel increase in gene copy number, suggesting
M. nervosa is unlikelyto be a unique outlier among the Unionidae phylogeny. Genomic analysis on additionalspecies in the future will be necessary to fully assay detox gene variation among Unionidae.Genome annotation often masks highly repetitive sequences prior to annotation to reduce10he likelihood of identifying TEs as false gene sequences. This standard practice of maskingrepetitive sequences prior to annotation would miss many members of these important genefamilies relevant to the unique biology of freshwater Unionidae. Instead a more focused,precise study of transposable elements is able to sort out this adaptive variation, and offera more complete description of evolutionary change in
M. nervosa , an approach that shouldbe noted in future genome annotation efforts.
Gene Family Evolution
Gene family expansion through duplication can offer a substrate of genetic novelty that mayfacilitate adaptation to new environments. To identify signatures of adaptive changes andrapid evolution of copy number, we estimated dN/dS and fit birth-death models genomewide and for large gene families in
M. nervosa . Cytochrome P450 family orthogroups have 26 branches suggesting no divergence atsynonymous sites and 36 branches in the tree indicating dN/dS >> .
0, a strong signature ofnatural selection driving amino acid substitutions. These results suggest strong adaptationat detox gene families in
M. nervosa . We also identify 11 incidences of Hsp70 formation, 21branches for Opsin genes, 11 mitochondrial-eating protein duplications, 15 branches for vonWillebrand proteins, and 5 chitin duplications associated with dN/dS >
1. These signaturesof rapid amino acid substitution in excess of rates expected under neutrality indicate that newgene formation through gene family expansion makes a significant contribution to adaptationin
M. nervosa .It is difficult to resolve evolutionary events in near time using a single genome. However,we can infer that gene copies where dS < − occurred within the last 100 generations(10 − / (2 ∗ × − )) or 2000 years. Even under the minimum age at maturation, thesetimescales would fall within the last 500 years. Determining whether these changes precededthe modern industrial era and damming of the river in 1918 would require data from moreindividuals, potentially from historic specimens. Regardless it is clear that detox geneamplification in M. nervosa has expanded gene content and aided adaptation in recentevolutionary time, whether before or after the anthropogenic era.We fit a birth-death model to 6002 first order paralogs for all gene families using dS as aproxy for age according to λe − µdS (Table S4S4, Figure 33). The distribution of dS values followsa pattern of exponential decay as expected under birth-death processes (Figure 33). Usingmaximum likelihood, we fit a death rate, µ = 6 . ± (95% CI 6.7-6.9) and a birth rate λ =40 ,
859 genes per 1.0 dS (95% CI 40,262-41,455). Assuming gene copies appear in clock-likefashion and each accumulate point mutations at a rate of 5 × − (Pogson and ZourosPogson and Zouros,19941994), this birth rate scales to a mutation rate of 1 . × − duplications per gene pergeneration. Such mutation rates are equivalent to copying each gene in the genome once onaverage every 11 million generations. The half-life of a newly formed gene family member is0.1 dS, corresponding to 1 million generations or roughly 200 million years. We note that thisis far longer than we found for duplicate gene pairs in our previous work in deletion-biased Drosophila (Rogers et al.Rogers et al., 20092009). The distribution of dS also does not show evidence of longterm preservation suggesting an escape from birth-death processes (Figure 33).We identified 88 cytochrome P450 genes that appear to be members of large gene familiesin a first-order Fuzzy Reciprocal Best Hit Blast (Hahn et al.Hahn et al., 20072007). We estimated dN and dS11n PAML and fit birth-death models to these large gene families (Figure 44). For cytochromeP450 genes, λ = 898, µ = 10 . P < . µ to compensate for higher rates of gene birth.Hence, we conclude that cytochromes are operating under enhanced birth-death processeswith rapid expansion (and possibly contraction) in M. nervosa compared with genomewide expectations. Such expansion occurs in addition to observed signatures of amino acidsubstitutions driven by selection. In aggregate, these results of gene family evolution pointto widespread adaptation through gene family expansion, especially in detox genes.
Burst of Recent Transposable Element Activity
TEs are selfish genetic elements that copy themselves often at the expense of their hostorganisms. Transposable elements are key drivers of gene family expansion as theycan facilitate ectopic recombination, create new retrogenes, and duplicate adjacent genesequences (Oliver and GreeneOliver and Greene, 20092009; FeschotteFeschotte, 20082008). The way that these modes ofmutation interact to reshape genomes during adaptation to environmental challenges is keyto understanding how organisms evolve in nature.These high copy number repeats can be identified in next generation sequence dataindependently from genome assemblies or in fully assembled reference genomes. We usedRepDeNovo (Chu et al.Chu et al., 20162016) to identify young, newly propagated transposable elements inthree species where sequence data are currently available
M. nervosa , V. ellipsiformis , and
E. hopetonensis . This algorithm uses sequence data without requiring genome assembly toidentify high copy number reads. It then de novo assembles repetitive contigs, and mergesTE copies with roughly less than 1.5-2% sequence divergence. TE copy number and total TEcontent can be estimated based on sequencing coverage across these contigs and TE familiescan be identified through comparisons with RepBase (Bao et al.Bao et al., 20152015). We observe evidenceof a burst of TE activity in
M. nervosa , resulting in an expansion of genome content. Weobserve roughly two fold greater TE content in
M. nervosa compared with the other species(Figure 55). Because these methods do not rely on genome assembly quality, results fromRepDeNovo should be directly comparable across species.These TEs account for a 21% (382 Mb) increase in genome size compared with previouslysequenced bivalve
V. ellipsiformis . A total of 21% of the increase in TE content in
M. nervosa compared with
E. hopetonensis is explained by expansion of
Polinton -like DNA transposons,and 21% is explained by expansion of
Gypsy
LTR retroelements (Figure 55). Remainingclasses each explain 0-5% of TE expansion. Repeat content in
V. ellipsiformis roughlyresembles that of
E. hopetonensis with the exception of the 4-fold increase in
Polinton
DNA12ransposons and a reduced number of hAT repeats (Figure 55). Notably, the
Neptune familyof
Penelope -like elements is identified in
M. nervosa but absent in
E. hopetonensis and
V.ellipsiformis , explaining 5% of the expansion in TE content (Figure 55). Roughly 45% ofthe repetitive content in each genome could not be matched in a tblastx, suggesting somerepeat families are highly diverged from annotated repeats in RepBase (Bao et al.Bao et al., 20152015).The difference in genome size for
V. ellipsiformis and
M. nervosa is primarily explainedby the proliferation of 382 Mb of repeats.
M. nervosa and
E. hopetonensis are from 42Xand 14X coverage data, respectively. Downsampling genomic data to 14X, total TE contentincreases marginally to 465 Mb. Hence, we do not believe the observed differences in TEsbetween species can be explained by sequence depth but rather reflect biological differencesbetween
M. nervosa and the two other bivalve species.RepeatScout identifies older repetitive elements with sufficient divergence to be assembledcorrectly into genome scaffolds (Price et al.Price et al., 20052005). We identify 3686 repeats in
M. nervosa ,1827 of which are greater than 200bp long. We are able to classify 573 of these into familiesusing a tblastx comparison against RepBase at
E < − . We find 98 Polintons, 75 Gypsy elements, 73 hATs , 66
RTE s, 59
DIRS , , and 33
Mariner elements (Figure 66). Otherfamilies represented by multiple elements include
Crack , Kolobok , Nimb , Chapaev , piggyBac and Penelope . These elements encompass 275 Mb of sequence. Diversity for these oldergroups of TEs compared with results of RepDeNovo would suggest that hAT , Mariner , and
RTE elements may have been more active in the ancient past but silent in more recentgenerations. Meanwhile
Gypsy , and
Polinton elements have maintained steady rates ofactivity. RepeatScout depends directly on the quality of genome assembly, and may not beappropriate for cross-species comparisons unless assembly quality is equivalent.We identify only 7
LINE elements, with limited diversity to L1 and L2 retroelements,raising questions about when these selfish TEs may have appeared in the phylogeny ofUnionidae. Interproscan analysis reveals hundreds of open reading frames for TE-associatedprotein domains, many with support in the transcriptome (Table S5S5). These ORFs havebeen removed from gene annotations described above. In total, the results of RepDeNovoand RepeatScout suggest that
M. nervosa harbors a total of 575 Mb of transposons, roughly25% of genome content.
Effective Population Size
The tempo and outcomes of evolution via selection and drift depend directly on the effectivepopulation size. Coalescent theory suggests that heterozygosity, the number of differencesbetween chromosomes in diploid individuals per site, will be distributed according to H = θ = 4 N e µ . Here, N e is the effective population size necessary to produce genetic diversity inan idealized model and µ is the mutation rate. Modern population genetic theory is based onthese parameters and the ways that they shape the response to forces of drift and selection innatural populations. We can use theoretical models to explore the limits of current geneticdiversity and the likelihood of adaptation from SNPs and gene duplications.We have used heterozygosity in reference specimens for three species of bivalves toestimate θ . We observe high heterozygosity suggesting θ M.nervosa = 0 . θ V.ellipsiformis =0 . θ E.hopetonensis = 0 . E. hopetonensis is an endemic to Georgia and South Carolina with alimited range but with census sizes estimated to be over 10 individuals. It is not surprisingthat it would have such lower N e than M. nervosa and
V. ellipsiformis , which are widespreadthroughout the Eastern United States. These observed levels of heterozygosity lead toestimates for N e of 388,500 individuals in M. nervosa . In
V. ellipsiformis N e is estimatedat 300,000 individuals and in E. hopetonensis − is seen in Drosophila(Karasov et al.Karasov et al., 20102010) or 10 or greater in unicellular eukaryotes (LynchLynch, 20072007).We estimate coalescent times based on 2 N e generations of 777,00 generations for M.nervosa , 600,000 generations for
V. ellipsiformis , and 576,000 generations for
E. hopetonensis (Table 11) with t MRCA scaling by 2 times higher for 4 N e generations (WakeleyWakeley, 20092009). Thisevolutionary system is unusual in comparison with life history of organisms commonly usedfor population genomics like yeast and Drosophila (large N e , short generation time) orhumans (small N e , long generation time) (LynchLynch, 20072007). It is rarer to find systems wherepopulations have housed millions of individuals with long generation times. The observedhigh N e in a long lived, late maturing species is expected to influence the tempo of evolution.Given a mutation rate of 5 × − (Pogson and ZourosPogson and Zouros, 19941994) and a modest estimate of 20years per generation, the coalescent times under neutrality for such diversity would span upto 15 million years and total population diversity by 30 million years based on t MRCA .We can estimate the time necessary for a new allele to appear and establish a selectivesweep in a population, T e . We observe waiting times for strong selective sweeps ( s = 0 . M. nervosa and 56,180 generationin
E. hopetonensis . Using an estimated generation time of 20 years, these numberscorrespond to 257,000 years in
M. nervosa and 347,000 years in
E. hopetonensis (Table11), exceptionally long wait times to respond to fluctuating environments. Under strongerselection coefficients of s = 0 .
10, we would expect wait times of 25,700 years in
M. nervosa and 34,700 years in
E. hopetonensis (Table S6S6). Hence, we would suggest that any response tomodern anthropogenic selective pressures in Unionidae will necessarily proceed from standingvariation.If adaptation from new mutations requires long timescales, standing genetic variation isthe major source of adaptive changes in the near term. If standing variation is sufficientto offer genetic solutions to selective pressures, organisms will be able to readily adapt. Ifgenetic diversity is unavailable, animals faced with strong selection will be unable to realizeadaptive change and may be at greater risk of extinction. Evaluating the substrate ofgenetic variation is essential to understand the limits that these species face, especially givenwidespread extinction of freshwater Unionids.The probability of adaptation from standing variation at single nucleotide sites is givenby P sgv = 1 − e − θ ∗ ln (1+2 N e s ) (Hermisson and PenningsHermisson and Pennings, 20052005). Given estimates of N e and θ observed in these three cursory reference genomes, we estimate probabilities of fixation fromstanding variation under a selection coefficient of 0.01 of 4.8% for E. hopetonensis , 5.1% for
V.ellipsiformis , and 6.7% for
M. nervosa . M. nervosa , with its high genetic diversity will holda 33% advantage in the ability to respond to sudden shifts in selective pressures comparedwith endemic species like
E. hopetonensis . Using the mutation rate for new gene formationestimated from birth-death models, we would estimate P sgv for gene family expansion to be148.5%, considerably higher. Hence, gene family expansion is expected to be an importantsubstrate of novelty in M. nervosa . DiscussionGene family expansion and natural selection
Gene family evolution is an important source of novel genetic material that producesinnovation in the face of shifting selective pressures (Conant and WolfeConant and Wolfe, 20082008; OhnoOhno, 19701970;HahnHahn, 20092009). New genes arise through duplication, then accumulate mutations thatmodify original functions (Conant and WolfeConant and Wolfe, 20082008). They may also modify expressionpatterns, especially when regulatory elements are shuffled in the face of partial duplication(Rogers et al.Rogers et al., 20172017). In the face of rapidly changing selective pressures, these substratesof genetic novelty are expected to be all the more essential as organisms require adaptiveinnovation. In
M. nervosa , we estimate rates of gene duplication equivalent to 1 . × − per gene per generation. Such a high mutation rate combined with the large populationsizes estimated for M. nervosa suggest a high probability of adaptation from standinggenetic variation of P sgv = 0 . P sgv for singlenucleotide changes (Table 11). Hence, the data suggest a greater likelihood of adaptationfrom standing genetic variation for any given gene duplication compared with any givenSNP. This theoretical rate suggests that adaptation through gene family amplification mayoffer a rich substrate to respond to environmental changes, especially when new functionsare needed.The M. nervosa genome houses an unusually high rate of copy number expansion indetox genes compared with genome wide background rates of duplication, with a birth rate7.6 times higher. This enhanced birth rate for new gene formation is likely to be responsiblefor the observed excess of cytochrome P450 genes compared with many other annotatedanimals. Moreover, 26 cytochrome P450 genes appear with no divergence at synonymoussites, suggesting recent origin within the past 100 generations (roughly 2000 years). TheTennessee River has been historically subject to high pesticide loads from TVA’s mosquitocontrol programs and from cotton farming (WoodsideWoodside, 20042004). Toxicity assays in Unionidae,including
M. nervosa have identified high tolerance for multiple pesticides and herbicides(Milam et al.Milam et al., 20052005; Loganathan et al.Loganathan et al., 19981998), consistent with high rates of exposure expectedin filter feeders that process large water volumes (Bril et al.Bril et al., 20172017). Toxicity levels inbivalve tissues have been noted in excess of 0.2 ppm (Bedford et al.Bedford et al., 19681968). Given therecent amplification for gene copies and the reduced number of cytochrome P450 genes in
V.ellipsiformis , we would suggest that convergent evolution in marine and freshwater bivalvesrather than ancestral pre-adaptation or shared ancient gene family expansion are responsiblefor adaptation at cytochrome genes.While we placed no prior constraints on functional categories, we note many large genefamilies associated with Unionidae biology. Large gene families with new copies are identifiedamong anticoagulation genes consistent with blood feeding in the parasitic larval stage.We also note expanded families of chitin metabolism genes, mitochondria-eating proteins,Opsins, and heat shock proteins. These large gene families have 3.5-4.0 fold higher birth ratesthan genome background rates, except ABC transporters, suggesting rapid gene gains. Many15opies appear with few mutations distinguishing gene copies (Figure 44), suggesting recentorigin. They further show many copies with high dN/dS, a signature of strong selectivepressures driving amino acid substitutions.Globally, these results suggest that new gene formation has had a substantial impacton the genome of
M. nervosa . It has experienced widespread creation of new genes withstrong signatures of adaptive evolution and recent evolutionary changes. The ways thatthese mutations contribute to adaptive changes in the near term are well worth furtherstudy to determine how they contribute to Unionidae evolution as well as their importancein evolutionary theory.
Transposable elements and genome remodeling
TE mobilization can lead to rapid and dynamic changes in genomes (Oliver and GreeneOliver and Greene,20092009; FeschotteFeschotte, 20082008). Selfish elements can copy, modify, or break gene sequences as well asinduce regulatory changes in neighboring genes (Oliver and GreeneOliver and Greene, 20092009; FeschotteFeschotte, 20082008).These mutations are most often detrimental (LynchLynch, 20072007). They can however incidentallycreate new genetic variation within genomes that shapes evolutionary outcomes. TEs areknown to activate in times of organism stress. It has been proposed that TE activation canbe a source of gene expression remodeling to modify suites of genes in few mutational steps(Oliver and GreeneOliver and Greene, 20092009; Laudencia-Chingcuanco and FowlerLaudencia-Chingcuanco and Fowler, 20122012; Lynch et al.Lynch et al., 20112011).These TEs are related to the high rates of gene family expansion observed in
M. nervosa , asmobile elements are expected to facilitate new gene copies through gene capture, retrogeneformation, and ectopic recombination. The fact that both TEs and gene families exhibitdynamic evolution suggests that these mutations have an exceptional impact on genomeremodeling and evolutionary change.We observe a recent burst of TE mobilization for
Polinton and
Gypsy elements in
M. nervosa that resulted in a 382Mb expansion in genome size. Comparisons witholder transposons in the genome assembly suggest that these families have had ongoingproliferation throughout the history of
M. nervosa . In contrast, hAT , Mariner , and
RTE elements appear to have been active in the more ancient past but largely quiescent inthe recent past. The causes and impacts of these changing TE dynamics deserve furtherexploration in future cross-species comparisons as well as searches for variation withinpopulations.It remains unclear to what extent the recent TE mobilization in
M. nervosa may haveproduced beneficial changes or resulted in widespread damage to genomes. With modernpopulation size reduction (Ahlstedt and McDonoughAhlstedt and McDonough, 19921992), the nearly-neutral coefficientwill have changed by more than one order of magnitude. It is possible that this shift innearly-neutral impacts (OhtaOhta, 19731973) has allowed some TEs to proliferate unchecked (LynchLynch,20072007), even while absolute species numbers have remained very large. It is equally possiblethat strong selection has favored some insertions or TE mediated genome remodeling as ameans to respond to environmental challenges. The observed adaptive signatures and rapidevolution of gene families may point to an interaction between TEs and adaptation throughduplication, a prospect that deserves future exploration.Regardless of whether this TE expansion has produced beneficial or detrimental changes,it is clear that recent proliferation of TEs has altered genome structure and expanded genome16ontent in
M. nervosa compared with its relatives. It is possible that TEs have proliferated inthe face of environmental stressors, declining populations, or even due to horizontal transferfrom other organisms. However, it is also possible that some of these new TE insertionsreflect adaptive changes as selfish genetic elements remodel genomes and incidentally createnew genes or new expression profiles across the
M. nervosa genome. Sorting out the adaptiveTE insertions from detrimental TE insertions will require full population genomic datasets,a prospect that will become tractable with this new reference genome.
Tempo of Evolution
The tempo of evolution depends directly on effective population sizes and generation times.We have observed high heterozygosity in Unionidae, with θ = 0 . M.nervosa . Under these parameters, we expect neutrally evolving variation will experience longcoalescent times of 777,00 generations, up to 15 million years assuming constant generationtimes of 20 years. Population genetic data can be used to extract information concerningpopulation history up until the t MRCA of all individuals in the population, which is expectedto be 4N generations (WakeleyWakeley, 20092009), or 30 million years. This timescale for neutral variationspans ice ages and mass extinctions, encompassing historical environmental changes thathave reshaped the history of the earth. Populations of bivalves from Northern and SouthernAlabama are separated by the Fall Line, where river systems flow from rocky upland provincesto the less consolidated Coastal Plain (Williams et al.Williams et al., 20082008). Periodic connections acrosswaterways as glaciers formed and receded allowed for species migration far in the past,potentially influencing genetic diversity. In light of these numbers, it appears likely thatthe historic and geological forces that have shaped these species may indeed still be writtenin their DNA. This long-standing genetic diversity may offer the chance to use populationgenetics to explore evolutionary theory in deeper timescales than most standard populationgenetic models.Phenotypic studies in freshwater bivalves have noted rampant convergence of formsacross lineages (Williams et al.Williams et al., 20082008), and phylogenetic analysis has been challenging inthe face of variation rather than adherence to strict phenotypes (Campbell et al.Campbell et al., 20052005).It is further possible that simplified mtDNA trees (Campbell et al.Campbell et al., 20052005) may departsignificantly from patterns in whole-genome sequencing. The population genetic parametersfor these three species of Unionidae imply that the coalescent time scale may be greaterthan many species splits. Under these circumstances we may expect rampant trans-speciespolymorphism and widespread convergence from adaptation via shared ancestral variation.Future sequencing efforts in this diverse clade will be essential to understand how populationgenetic and evolutionary forces have shaped the genetic underpinnings of phenotypic diversityin Unionidae.The genetic diversity of
M. nervosa and other Unionidae is shaped by ancestral variation,and is expected to decay gradually over time as populations diminish or become fragmented(WrightWright, 19311931, 19511951). Still, in the face of modern anthropogenic selective pressures inUnionidae, this genetic variation will be essential for adaptation. Population genetic theorypredicts that selective sweeps on new mutations will be slow to appear, requiring tensof thousands of generations or hundreds of millions of years for desired genetic variationto appear (GillespieGillespie, 19911991; Maynard SmithMaynard Smith, 19711971). For adaptation to occur on nearer17imescales, standing variation is expected to offer the substrate for initial adaptive responses.The probability that a genetic variant exists and is able to establish a selective sweepin a population P sgv , depends directly on θ (Hermisson and PenningsHermisson and Pennings, 20052005). In thesebivalves, M. nervosa has the highest level of genetic variation, 35% higher than endemic
E.hopetonensis . This will result in a 33% higher chance of adaptation from standing variation,a key advantage in the response to ongoing selective events. With modern populationreductions (Ahlstedt and McDonoughAhlstedt and McDonough, 19921992), this ancestral standing variation will be evenmore essential than these estimates from historic population sizes might reflect.Census of bivalves noted smaller classes of
M. nervosa below the 4 inch commercialharvest limit were noted to be scarce in Wheeler Lake (Ahlstedt and McDonoughAhlstedt and McDonough, 19921992),though the authors have noted healthy numbers of juveniles near Pickwick Lake. It isunlikely that N e fully reflects the ongoing mussel declines, nor do we expect it reflectsequilibrium levels for reduced populations. It does, however, indicate that the species stillhouse substantial genetic diversity, if populations can be preserved. We would expect adecay in this diversity over time as population sizes drop according to H t = H (1 − N ) t (GillespieGillespie, 20042004). Even assuming the minimum age of reproduction at 5 years of age, it hasbeen fewer than 20 generations since the damming of the Tennessee River began in 1918and even fewer since Pickwick Dam was established in 1938. Hence, we expect that genomewide H will reflect historic rather than modern diversity. The extent to which ongoingtrace migration across the geographic range, especially incidental or intentional transportwith humans, might reseed population diversity remains unknown. Future work sequencingmultiple individuals across the geographic range might clarify whether subpopulations withdistinct genetic and phenotypic features may be emerging in isolated locations or whetherpopulations remain panmictic.Here, Unionidae offer the prospect to study how adaptation occurs in long-lived butnumerous species that is notably different from many population genetic models. Yet,the implications of these differences go far beyond abstract concepts in population genetictheory. Over 70% of Unionidae threatened or endangered (Williams et al.Williams et al., 19931993). They facenumerous environmental challenges, invasive species, and predation or harvesting from thefreshwater pearl industry. Those organisms that house the greatest diversity may indeedhave the best chances for adaptive evolution that could help them survive when new geneticvariation is needed. Genetic resources for freshwater Unionidae
Unionidae currently face a modern day mass extinction (Williams et al.Williams et al., 19931993). They havefaced strong recent selective pressures of pollution, barriers on waterways, invasive species,and loss of fish hosts. These organisms badly need genetic resources to assess populationhealth, monitor diversity, and identify unique groups across geographic locations. Moreover,evolutionary analysis is currently limited by a lack of genetic resources. At present, a singlefragmented whole genome sequence exists across all Unionidae for comparison (Renaut et al.Renaut et al.,20182018), and the nearest outgroup,
C. gigas is 200 million years divergent (Bolotov et al.Bolotov et al.,20172017). Previous work has used limited mtDNA sequencing, targeted single genes, or fewmicrosatellite markers. These single gene studies offer only a limited glimpse at evolutionaryprocesses for these unique animals. The reference genome for
M. nervosa represents a step18oward evolutionary genomics of gene families and TEs in a high-risk clade. This referencegenome offers a portrait of dynamically changing genomes that have responded to strongselection in the past. This work represents a first step to understanding of genetic responsesin a robust species within a larger clade of vulnerable species.These genome sequences can further help target conservation efforts where they areneeded most. Unionidae with lowest sequence diversity can be marked as first priority forconservation. Monitoring whole genome genetic variation can help identify a baseline so thatgenetic diversity is preserved as wildlife management efforts survey freshwater Unionidaepopulations. In the future, even low-level whole genome sequencing to identify nucleargenetic variation is likely to facilitate phylogenetic and evolutionary analysis in Unionidaeand offer a more complete account of genetic diversity in these threatened species. For lowdiversity species, adaptation to newly arising challenges will be increasingly difficult with alesser substrate for adaptation. In these cases, early intervention may be warranted so thattheir modest diversity remains preserved. Additionally, when evaluating which efforts aremost likely to succeed, species that retain moderate to high diversity may have the greatestchances of surviving the modern anthropogenic challenges. Similar genetic studies in thefuture will be increasingly essential for evolutionary theory and conservation in Unionidae.
Data availability
Venustaconcha ellipsiformis transcriptome data are available at SRR6279374 and genomicsequence data were taken from SRR6689532 and SRR6689533.
Megalonaias nervosa genomicsequence data are available in PRJNA646917 and transcriptome data are available at SRAPRJNA646778.
Elliptio crassidens transcriptomes are deposited under SRA PRJNA647326.
Elliptio hopetonensis genomic reads are deposited at SRA PRJNA647325. Full annotations,de novo assembled transcriptomes, and interproscan results are available in SupportingInformation, currently on (DropboxDropbox).
Acknowledgements
We thank Rob Reid and James Baldwin-Brown for helpful suggestions about genomeassembly software and annotation. Jason Wisniewski provided
E. hopetonensis specimens,and Robert Bringolf provided
E. crassidens . Jon Halter, Michael Moseley, and ChadDeWitt provided help with software installation and functionality. James Titus-McQuillanprovided helpful comments on the manuscript draft. We thank the Duke University Schoolof Medicine for the use of the Sequencing and Genomic Technologies Shared Resource, whichprovided Illumina sequencing services for
M. nervosa genomes and transcriptomes. The NCState Sequencing core provided HiC library services. De Novo Genomics generated OxfordNanopore sequences for this study. the University of Colorado Sequencing Core generatedsequence data
E. hopetonensis and the University of Georgia Genomic Core Facility generatedRNA sequencing data for
E. crassidens . All analyses were run on the UNCC CopperheadHigh Performance Computing cluster, supported by UNC Charlotte and the Department ofBioinformatics. 19 unding
This work was supported by startup funding from University of North Carolina, CharlotteDepartment of Bioinformatics and Genomics. Rebekah L. Rogers is funded in part by NIHNIGMS R35 GM133376. The funders had no role in study design, data collection andanalysis, decision to publish, or preparation of the manuscript. The findings and conclusionsin this article are those of the author(s) and do not necessarily represent the views of theU.S. Fish and Wildlife Service.
Author contributions
RLR, SP, CCM and JPW designed experiments and analysesJG collected specimens from wild populationsRLR, SLG, KB, and CCM performed experimentsRLR, SP, and SLG performed analysesRLR, JPW, CCM and JG wrote and edited the manuscript20 eferences
Adams, M. D., Celniker, S. E., Holt, R. A., Evans, C. A., Gocayne, J. D., Amanatides, P. G.,Scherer, S. E., Li, P. W., Hoskins, R. A., Galle, R. F., et al. (2000). The genome sequenceof
Drosophila melanogaster . Science , 287(5461):2185–2195.Adrion, J. R., Song, M. J., Schrider, D. R., Hahn, M. W., and Schaack, S. (2017).Genome-wide estimates of transposable element insertion and deletion rates in
Drosophilamelanogaster . Genome biology and evolution , 9(5):1329–1340.Ahlstedt, S. A. and McDonough, T. A. (1992). Quantitative evaluation of commercial musselpopulations in the Tennessee river portion of Wheeler Reservoir, Alabama. In
Conservationand management of freshwater mussels. Proceedings of a UMRCC symposium , pages 12–14.Aminetzach, Y. T., Macpherson, J. M., and Petrov, D. A. (2005). Pesticideresistance via transposition-mediated adaptive gene truncation in
Drosophila . Science ,309(5735):764–767.Bao, W., Kojima, K. K., and Kohany, O. (2015). Repbase Update, a database of repetitiveelements in eukaryotic genomes.
Mobile DNA , 6(1):11.Bedford, J., Roelofs, E., and Zabik, M. (1968). The freshwater mussel as a biologicalmonitor of pesticide concentrations in a lotic environment 1.
Limnology and Oceanography ,13(1):118–126.Bennett, E. A., Coleman, L. E., Tsui, C., Pittard, W. S., and Devine, S. E. (2004). Naturalgenetic variation caused by transposable elements in humans.
Genetics , 168(2):933–951.Bolotov, I. N., Kondakov, A. V., Vikhrev, I. V., Aksenova, O. V., Bespalaya, Y. V., Gofarov,M. Y., Kolosova, Y. S., Konopleva, E. S., Spitsyn, V. M., Tanmuangpak, K., et al.(2017). Ancient river inference explains exceptional oriental freshwater mussel radiations.
Scientific Reports , 7(1):1–14.Bril, J. S., Langenfeld, K., Just, C. L., Spak, S. N., and Newton, T. J. (2017). Simulatedmussel mortality thresholds as a function of mussel biomass and nutrient loading.
PeerJ ,5:e2838.Cahn, A. R. (1936).
The Molluscan Fauna of the Clinch River Below Norris Dam Upon theCompletion of that Structure . Tennessee Valley Authority.Campbell, D. C., Serb, J. M., Buhay, J. E., Roe, K. J., Minton, R. L., and Lydeard, C. (2005).Phylogeny of North American amblemines (
Bivalvia, Unionoida ): prodigious polyphylyproves pervasive across genera.
Invertebrate Biology , 124(2):131–164.Chu, C., Nielsen, R., and Wu, Y. (2016). REPdenovo: inferring de novo repeat motifs fromshort sequence reads.
PloS One , 11(3):e0150719.Conant, G. C. and Wolfe, K. H. (2008). Turning a hobby into a job: how duplicated genesfind new functions.
Nature Reviews Genetics , 9(12):938–950.21onsortium, I. H. G. S. et al. (2001). Initial sequencing and analysis of the human genome.
Nature , 409:860–921.Demuth, J. P., De Bie, T., Stajich, J. E., Cristianini, N., and Hahn, M. W. (2006). Theevolution of mammalian gene families.
PloS One , 1(1):e85.El-Gebali, S., Mistry, J., Bateman, A., Eddy, S. R., Luciani, A., Potter, S. C., Qureshi, M.,Richardson, L. J., Salazar, G. A., Smart, A., et al. (2019). The pfam protein familiesdatabase in 2019.
Nucleic acids research , 47(D1):D427–D432.
Drosophila
12 Genomes Consortium et al. (2007). Evolution of genes and genomes on the
Drosophila phylogeny.
Nature , 450(7167):203.Feschotte, C. (2008). Transposable elements and the evolution of regulatory networks.
NatureReviews Genetics , 9(5):397.Garner, J. and McGregor, S. (2001). Current status of freshwater mussels (Unionidae,Margaritiferidae) in the Muscle Shoals area of Tennessee River in Alabama (Muscle Shoalsrevisited again).
American Malacological Bulletin , 16(1-2):155–170.Gillespie, J. H. (1991).
The Causes of Molecular Evolution , chapter 5, page 232. OxfordUniversity Press.Gillespie, J. H. (2004).
Population genetics: a concise guide . JHU Press.Goffeau, A., Barrell, B. G., Bussey, H., Davis, R. W., Dujon, B., Feldmann, H., Galibert,F., Hoheisel, J. D., Jacq, C., Johnston, M., et al. (1996). Life with 6000 genes.
Science ,274(5287):546–567.Haag, W. R. (2012).
North American freshwater mussels: natural history, ecology, andconservation . Cambridge University Press.Haag, W. R. and Rypel, A. L. (2011). Growth and longevity in freshwater mussels:evolutionary and conservation implications.
Biological Reviews , 86(1):225–247.Haggerty, T. M., Garner, J. T., and Rogers, R. L. (2005). Reproductive phenologyin
Megalonaias nervosa (Bivalvia: Unionidae) in Wheeler Reservoir, Tennessee River,Alabama, USA.
Hydrobiologia , 539(1):131–136.Hahn, M. W. (2009). Distinguishing among evolutionary models for the maintenance of geneduplicates.
The Journal of Heredity , 100(5):605–617.Hahn, M. W., Han, M. V., and Han, S.-G. (2007). Gene family evolution across 12
Drosophila genomes.
PLoS Genetics , 3(11):e197.Han, M. V., Demuth, J. P., McGrath, C. L., Casola, C., and Hahn, M. W. (2009). Adaptiveevolution of young gene duplicates in mammals.
Genome Research , 19(5):859–867.Hermisson, J. and Pennings, P. S. (2005). Soft sweeps: molecular population genetics ofadaptation from standing genetic variation.
Genetics , 169(4):2335–2352.22som, B. G., Yokley, P., and Gooch, C. H. (1973). Mussels of the Elk River Basin in Alabamaand Tennessee-1965-1967.
American Midland Naturalist , pages 437–442.Jurka, J., Kapitonov, V. V., Pavlicek, A., Klonowski, P., Kohany, O., and Walichiewicz, J.(2005). RepBase update, a database of eukaryotic repetitive elements.
Cytogenetic andGenome Research , 110(1-4):462–467.Karasov, T., Messer, P. W., and Petrov, D. A. (2010). Evidence that adaptation in
Drosophila is not limited by mutation at single sites.
PLoS Genetics , 6(6):e1000924.Kent, W. J. (2002). BLAT the BLAST-like alignment tool.
Genome Research , 12(4):656–664.Larkin, M. A., Blackshields, G., Brown, N. P., Chenna, R., McGettigan, P. A., McWilliam,H., Valentin, F., Wallace, I. M., Wilm, A., Lopez, R., et al. (2007). Clustal w and clustalx version 2.0. bioinformatics , 23(21):2947–2948.Laudencia-Chingcuanco, D. and Fowler, D. B. (2012). Genotype-dependent burst oftransposable element expression in crowns of hexaploid wheat (
Triticum aestivum L. )during cold acclimation.
Comparative and Functional Genomics , 2012.Li, H. and Durbin, R. (2011). Inference of human population history from individualwhole-genome sequences.
Nature , 475(7357):493.Liu, H.-P., Mitton, J. B., and Wu, S.-K. (1996). Paternal mitochondrial DNA differentiationfar exceeds maternal mitochondrial DNA and allozyme differentiation in the freshwatermussel,
Anodonta grandis grandis . Evolution , 50(2):952–957.Loganathan, B., Neale, J., Sickel, J., Sajwan, K., and Owen, D. (1998). Persistentorganochlorine concentrations in sediment and mussel tissues from the lowermostTennessee River and Kentucky lake, USA.
Organohalogen Compounds , 39:121–124.Lopez, J. W., Parr, T. B., Allen, D. C., and Vaughn, C. C. (2020). Animal aggregationspromote emergent aquatic plant production at the aquatic-terrestrial interface.
Ecology ,page e03126.Lynch, M. (2007).
The Origins of Genome Architecture , page 494. Sinauer Associates,Sunderland, Mass.Lynch, M. and Conery, J. S. (2000). The evolutionary fate and consequences of duplicategenes.
Science , 290(5494):1151–1155.Lynch, V. J., Leclerc, R. D., May, G., and Wagner, G. P. (2011). Transposon-mediatedrewiring of gene regulatory networks contributed to the evolution of pregnancy inmammals.
Nature Genetics , 43(11):1154.MacManes, M. D. (2018). The Oyster River Protocol: a multi-assembler and kmer approachfor de novo transcriptome assembly.
PeerJ , 6:e5428.Marmeisse, R., Nehls, U., ¨Opik, M., Selosse, M.-A., and Pringle, A. (2013). Bridgingmycorrhizal genomics, metagenomics and forest ecology.
New Phytologist , 198(2):343–346.23aynard Smith, J. (1971). What use is sex?
Journal of Theoretical Biology , 30(2):319–335.McKenna, A., Hanna, M., Banks, E., Sivachenko, A., Cibulskis, K., Kernytsky, A.,Garimella, K., Altshuler, D., Gabriel, S., Daly, M., et al. (2010). The genome analysistoolkit: a mapreduce framework for analyzing next-generation DNA sequencing data.
Genome Research , 20(9):1297–1303.Milam, C., Farris, J., Dwyer, F., and Hardesty, D. (2005). Acute toxicity of six freshwatermussel species (glochidia) to six chemicals: Implications for daphnids and
Utterbackiaimbecillis as surrogates for protection of freshwater mussels (Unionidae).
Archives ofEnvironmental Contamination and Toxicology , 48(2):166–173.Ohno, S. (1970).
Evolution by Gene Duplication . Springer-Verlag, Berlin, New York.Ohta, T. (1973). Slightly deleterious mutant substitutions in evolution.
Nature ,246(5428):96–98.Oliver, K. R. and Greene, W. K. (2009). Transposable elements: powerful facilitators ofevolution.
Bioessays , 31(7):703–714.Orr, H. A. (2005). The genetic theory of adaptation: a brief history.
Nature Reviews Genetics ,6(2):119–127.Ortmann, A. (1924). Mussel Shoals.
Science , 60(1564):565–566.Pertea, G. and Pertea, M. (2020). Gff utilities: GFFRead and GFFCompare.
F1000Research ,9(304):304.Platt, J. R. (2018). America’s freshwater mussels are going extinct–here’s why that sucks.
Scientific American .Pogson, G. H. and Zouros, E. (1994). Allozyme and rflp heterozygosities as correlatesof growth rate in the scallop
Placopecten magellanicus : a test of the associativeoverdominance hypothesis.
Genetics , 137(1):221–231.Price, A. L., Jones, N. C., and Pevzner, P. A. (2005).
De novo identification of repeatfamilies in large genomes.
Bioinformatics , 21(suppl 1):i351–i358.Quevillon, E., Silventoinen, V., Pillai, S., Harte, N., Mulder, N., Apweiler, R., andLopez, R. (2005). Interproscan: protein domains identifier.
Nucleic Acids Research ,33(suppl 2):W116–W120.Renaut, S., Guerra, D., Hoeh, W. R., Stewart, D. T., Bogan, A. E., Ghiselli, F., Milani,L., Passamonti, M., and Breton, S. (2018). Genome survey of the freshwater mussel
Venustaconcha ellipsiformis (Bivalvia: Unionida) using a hybrid de novo assemblyapproach.
Genome Biology and Evolution , 10(7):1637–1646.Rogers, R. L., Bedford, T., and Hartl, D. L. (2009). Formation and longevity of chimericand duplicate genes in
Drosophila melanogaster . Genetics , 181(1):313–322.24ogers, R. L., Shao, L., and Thornton, K. R. (2017). Tandem duplications lead tonovel expression patterns through exon shuffling in
Drosophila yakuba . PLoS genetics ,13(5):e1006795.Schmidt, J. M., Good, R. T., Appleton, B., Sherrard, J., Raymant, G. C., Bogwitz, M. R.,Martin, J., Daborn, P. J., Goddard, M. E., Batterham, P., et al. (2010). Copy numbervariation and transposable elements feature in recent, ongoing adaptation at the cyp6g1locus.
PLoS Genet , 6(6):e1000998.Sim˜ao, F. A., Waterhouse, R. M., Ioannidis, P., Kriventseva, E. V., and Zdobnov, E. M.(2015). BUSCO: assessing genome assembly and annotation completeness with single-copyorthologs.
Bioinformatics , 31(19):3210–3212.Sørensen, J. G., Kristensen, T. N., and Loeschcke, V. (2003). The evolutionary and ecologicalrole of heat shock proteins.
Ecology Letters , 6(11):1025–1037.Stanke, M., Diekhans, M., Baertsch, R., and Haussler, D. (2008). Using native andsyntenically mapped cDNA alignments to improve de novo gene finding.
Bioinformatics ,24(5):637–644.Stanke, M. and Waack, S. (2003). Gene prediction with a hidden markov model and a newintron submodel.
Bioinformatics , 19(suppl 2):ii215–ii225.Strayer, D. L. (1999). Effects of alien species on freshwater mollusks in North America.
Journal of the North American Benthological Society , 18(1):74–98.Thompson, J. D., Higgins, D. G., and Gibson, T. J. (1994). Clustal w: improvingthe sensitivity of progressive multiple sequence alignment through sequence weighting,position-specific gap penalties and weight matrix choice.
Nucleic Acids Research ,22(22):4673–4680.Wakeley, J. (2009).
Coalescent Theory: An Introduction , page 97. Roberts & CompanyPublishers.Waterhouse, R. M., Seppey, M., Sim˜ao, F. A., Manni, M., Ioannidis, P., Klioutchnikov,G., Kriventseva, E. V., and Zdobnov, E. M. (2017). BUSCO applications from qualityassessments to gene prediction and phylogenomics.
Molecular Biology and Evolution ,35(3):543–548.Wen, H. B., Cao, Z. M., Hua, D., Xu, P., Ma, X. Y., Jin, W., Yuan, X. H., and Gu, R. B.(2017). The complete maternally and paternally inherited mitochondrial genomes of afreshwater mussel
Potamilus alatus (Bivalvia: Unionidae) . PloS One , 12(1):e0169749.Williams, J. D., Bogan, A. E., Garner, J. T., et al. (2008).
Freshwater mussels of Alabamaand the Mobile basin in Georgia, Mississippi, and Tennessee . University of Alabama Press.Williams, J. D., Warren Jr, M. L., Cummings, K. S., Harris, J. L., and Neves, R. J. (1993).Conservation status of freshwater mussels of the United States and Canada.
Fisheries ,18(9):6–22. 25oodside, M. D. (2004).
Water quality in the lower Tennessee River Basin, Tennessee,Alabama, Kentucky, Mississippi, and Georgia, 1999-2001 , volume 1233. US GeologicalSurvey.Woody, C. A. and Holland-Bartels, L. (1993). Reproductive characteristics of a populationof the washboard mussel
Megalonaias nervosa (Rafinesque 1820) in the upper MississippiRiver.
Journal of Freshwater Ecology , 8(1):57–66.Wright, S. (1931). Evolution in mendelian populations.
Genetics , 16(2):97.Wright, S. (1951). The genetical structure of populations.
Annals of Eugenics , 15(1):323–354.Yang, Z. (1997). PAML: a program package for phylogenetic analysis by maximum likelihood.
Bioinformatics , 13(5):555–556.Zhang, G., Fang, X., Guo, X., Li, L., Luo, R., Xu, F., Yang, P., Zhang, L., Wang, X., Qi,H., et al. (2012). The oyster genome reveals stress adaptation and complexity of shellformation.
Nature , 490(7418):49–54.Zimin, A. V., Mar¸cais, G., Puiu, D., Roberts, M., Salzberg, S. L., and Yorke, J. A. (2013).The masurca genome assembler.
Bioinformatics , 29(21):2669–2677.26able 1: Population genetic parameters of three species of bivalves under moderate selection.Species θ = 4 N e µ N e P sgv T e (gens) M. nervosa
V. ellipsiformis
E. hopetonensis s = 0 .
01 27igure 1: A 127,552 bp contig containing 3 cytochrome P450 genes. Exons are representedwith rectangles, introns with connecting lines. Repetitive elements identified usingRepeatScout are marked in green. Genes on the reverse strand are marked below and geneson the forward strand are above. 28igure 2: Detox genes among annotations for
M. nervosa and
V. ellipsiformis .29igure 3: Gene family member ages across the genome of
M. nervosa , are fit with abirth-death model. The distribution of duplicate ages, given by dS, shows larger numbers ofyounger gene family members, tapering off exponentially as expected for a Poisson process.Model fit suggests a birth rate λ = 40 ,
859 (95% CI 40,262-41,455) and death rate µ = 6 . ytochromes nu m be r o f b r an c he s dS Opsins nu m be r o f b r an c he s dS Mitochondria Eating nu m be r o f b r an c he s dS Chitin nu m be r o f b r an c he s dS Hsp 70 nu m be r o f b r an c he s dS von Willebrand nu m be r o f b r an c he s dS ABC transporters nu m be r o f b r an c he s dS Figure 4: Gene family member ages for high copy number gene families in
M. nervosa .We fitted a birth-death models (red lines) to identify families with rapid gain and losscompared with genomic background (dashed lines). Cytochrome P450 genes show highrates of birth and death compared with background levels with over 7.6-fold increased birthrates (
P < . M. nervosa has the highestaggregate TE content across all families with 382 Mb of new genome content.
Gypsy elementsand
Polinton elements each explain 21% of the expansion in genome content. A burst of
Polinton amplification is observed in
V. ellipsiformis .32igure 6: Number of repeat sequences identified in the
M. nervosa reference assembly usingRepeatScout for the 6 most common TE families in
M. nervosa . Polinton and
Gypsy elements harbor the greatest number of independent repeats, consistent with results showingongoing proliferation in assembly-free methods. Other prevalent elements include hAT , RTE , DIRS and
Mariner elements.
LINE elements showed little representation among repeats.33igure 7: Heterozygosity in
M. nervosa , V. ellipsiformis , and
E. hopetonensis . M. nervosa shows the highest heterozygosity, consistent with its expansive range across the SoutheasternUnited States and Northern Mississippi River.34 upporting Information
M. nervosa reference genome qualitycontig N50 51,552scaffold N50 52,791max contig length 588,638max scaffold length 588,638Greater than 10 kb 92%36able S2: Number of Gene Family Members by Functional Class
Function
M. nervosa V. ellipsiformis
Cytochrome P450s 143 48ABC transporters 106 53Hsp70s 96 52Opsins 238 5Mitochondria eating proteins 117 6Chitin metabolism 66 14von Willebrand factors 138 4 de novo assembled transcriptomes
Function
V. ellipsiformis E. crassidens M. nervosa
Cytochrome P450s 129 280 242ABC transporters 262 243 122Glutathione transferase 81 76 131Thioredoxins 154 76 114Hsp70s 52 313 117Mitochondrial eating 90 487 210Chitin metabolism 92 175 104von Willebrand Factors 130 378 209
Class Total Paralogs Death rate µ µ fold-change Birth rate λ λ fold-changeAll 49,149 6002 6.8 - 40859 -Cytochrome P450 143 88 10.2 1.5 898 7.6Opsin 238 110 7.2 1.1 793 4.0Chitin metabolism 66 35 6.4 0.9 224 4.0Mt Eating protein 117 55 6.4 0.9 353 3.6Hsp70 96 61 4.6 0.7 283 3.5VonWillebrand factor 138 52 7.8 1.1 406 3.5ABC transporter 106 28 4.3 0.6 122 1.4 θ = 4 N e µ N e P sgv T e (gens) M. nervosa
V. ellipsiformis
E. hopetonensis s = 0 . Distribution of kmer counts for one sequencing lane of
M. nervosa . The distribution shows nosuggestion of polyploidy.
M. nervosa , E. crassidens , and
V. ellipsiformis . M. nervosa and
E. crassidens house a greater number of cytochrome P450 transcripts, glutathione transferases, and thioredoxins.