Ag Data Commons
Browse

Bradysia coprophila genome annotations Bcop_v1.0

Download (96.8 MB)
dataset
posted on 2024-02-16, 15:39 authored by John Urban
<p>This dataset presents the <em>Bradysia coprophila</em> genome annotations Bcop_v1.0. It will be used as a starting point to manually improve annotations. </p> <p>The annotations were generated using Maker2. Highly detailed bioinformatic methods information can be found in the supplemental material of our preprint titled, "Single-molecule sequencing of long DNA molecules allows high contiguity de novo genome assembly for the fungus fly, <em>Sciara coprophila</em>" (doi: <a href="https://doi.org/10.1101/2020.02.24.963009">https://doi.org/10.1101/2020.02.24.963009</a> ). See the Table of Contents therein. A far briefer description is below. Note that <em>Sciara coprophila</em> is synonymous with <em>Bradysia coprophila</em>, and was used in the title of our publication for historical reasons.</p> <p>Repeat library used for masking: species-specific repeat libraries were built using RepeatModeler. A more comprehensive repeat library was created by adding previously-known repeat sequences from <em>Bradysia coprophila</em> and all Arthropod repeats in the RepeatMasker Combined Database: Dfam_Consensus-20181026, RepBase-20181026. The comprehensive repeat library was used with RepeatMasker as part of the Maker2 pipeline.</p> <p>Automated gene finding: To predict/find protein-coding genes, Maker2 was used to take of 3 sources of evidence: RNA-seq expression evidence, homology, and gene prediction. RNA-seq data from both male and female embryos, larvae, pupae, and adults were combined to create transcriptome assemblies using Trinity (de novo) and HiSat2 followed by StringTie (genome-guided). The transcriptome assemblies were used as EST evidence in Maker2. Transcript and protein sequences from related species was used for homology evidence. Three gene predictors were used: Augustus, SNAP, GeneMark-ES. See the supplemental materials in our preprint for more information on iterative Maker2 rounds, training each gene predictor, RNA-seq methods, and transcriptome assembly generation. The Maker2 gene annotations of the final round were evaluated using annotation edit distances, BUSCO, RSEM-Eval, and TransRate. </p> <p>Functional information: InterProScan was used to identify Pfam domains and GO terms from predicted protein sequences, and BLASTp was to find best matches to curated proteins in the UniProtKB/Swiss-Prot database. </p> <div><br>Resources in this dataset:</div><br><ul><li><p>Resource Title: Bradysia coprophila genome annotations Bcop_v1.0.</p> <p>File Name: bradysia_coprophila.bcop_v1.0.tar.gz</p><p>Resource Description: Primary file: - Bradysia_coprophila.Bcop_v1.0_gene_set.gff - Contains automated annotations from Maker2 (described in <a href="https://doi.org/10.1101/2020.02.24.963009)." target="_blank">https://doi.org/10.1101/2020.02.24.963009).</a> - This is the main file in this tar archive. - The reference genome fasta is available from GenBank: <a href="https://www.ncbi.nlm.nih.gov/assembly/GCA_014529535.1." target="_blank">https://www.ncbi.nlm.nih.gov/assembly/GCA_014529535.1.</a> - The Seqid in Column 1 of this gff3 file corresponds to the 'Sequence name' in the GenBank assembly report: <a href="https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/014/529/535/GCA_014529535.1_BU_Bcop_v1/GCA_014529535.1_BU_Bcop_v1_assembly_report.txt" target="_blank">https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/014/529/535/GCA_014529535.1_BU_Bcop_v1/GCA_014529535.1_BU_Bcop_v1_assembly_report.txt</a> Supplementary files: - Bradysia_coprophila.Bcop_v1.0_evidence.rnd3.gff - Contains aligned evidence Maker2 used. - Bradysia_coprophila.Bcop_v1.0_masked_genome.rnd3.gff - Contains coordinates for masked regions of the genome as seen by Maker2. - Bradysia_coprophila.Bcop_v1.0_proteins_with_putative_function.fasta - Contains predicted protein sequences - Bradysia_coprophila.Bcop_v1.0_transcripts_with_putative_function.fasta - Contains predicted transcript sequences </p></li></ul>

Funding

National Science Foundation: MCB-1607411

National Institutes of Health: GM121455

National Institutes of Health: T32-GM007601

National Science Foundation: EPSCoR #1004057

National Science Foundation: GRFP-DGE-1058262

History

Data contact name

Urban, John

Data contact email

dr.john.urban@gmail.com

Publisher

Ag Data Commons

Temporal Extent Start Date

2013-01-01

Temporal Extent End Date

2016-12-31

Theme

  • Not specified

Geographic Coverage

{"type":"FeatureCollection","features":[{"geometry":{"type":"Polygon","coordinates":[[[-76.625551823527,39.33137246469],[-76.624420434237,39.32985385411],[-76.623266665265,39.330517889657],[-76.623146468773,39.331253300693],[-76.624360503629,39.331666289305],[-76.625551823527,39.33137246469]]]},"type":"Feature","properties":{}},{"geometry":{"type":"Point","coordinates":[-76.624924354255,39.331016171513]},"type":"Feature","properties":{}},{"geometry":{"type":"Point","coordinates":[-73.469181396067,40.860168798284]},"type":"Feature","properties":{}},{"geometry":{"type":"Point","coordinates":[-71.401475393213,41.828731518494]},"type":"Feature","properties":{}}]}

Geographic location - description

USA: Northeast

ISO Topic Category

  • biota

Ag Data Commons Group

  • Insects - i5K

National Agricultural Library Thesaurus terms

genomics; data collection; bioinformatics; DNA; genome assembly; fungi; arthropods; databases; automation; prediction; males; females; larvae; pupae; adults; transcriptome; expressed sequence tags; amino acid sequences; proteins; Bradysia; Sciara coprophila

Pending citation

  • No

Public Access Level

  • Public

Preferred dataset citation

Urban, John (2021). Bradysia coprophila genome annotations Bcop_v1.0. Ag Data Commons. https://doi.org/10.15482/USDA.ADC/1522618

Usage metrics

    Exports

    RefWorks
    BibTeX
    Ref. manager
    Endnote
    DataCite
    NLM
    DC