<p>This dataset presents the <em>Bradysia coprophila</em> genome annotations Bcop_v1.0. It will be used as a starting point to manually improve annotations. </p>
<p>The annotations were generated using Maker2. Highly detailed bioinformatic methods information can be found in the supplemental material of our preprint titled, "Single-molecule sequencing of long DNA molecules allows high contiguity de novo genome assembly for the fungus fly, <em>Sciara coprophila</em>" (doi: <a href="https://doi.org/10.1101/2020.02.24.963009">https://doi.org/10.1101/2020.02.24.963009</a> ). See the Table of Contents therein. A far briefer description is below. Note that <em>Sciara coprophila</em> is synonymous with <em>Bradysia coprophila</em>, and was used in the title of our publication for historical reasons.</p>
<p>Repeat library used for masking: species-specific repeat libraries were built using RepeatModeler. A more comprehensive repeat library was created by adding previously-known repeat sequences from <em>Bradysia coprophila</em> and all Arthropod repeats in the RepeatMasker Combined Database: Dfam_Consensus-20181026, RepBase-20181026. The comprehensive repeat library was used with RepeatMasker as part of the Maker2 pipeline.</p>
<p>Automated gene finding: To predict/find protein-coding genes, Maker2 was used to take of 3 sources of evidence: RNA-seq expression evidence, homology, and gene prediction. RNA-seq data from both male and female embryos, larvae, pupae, and adults were combined to create transcriptome assemblies using Trinity (de novo) and HiSat2 followed by StringTie (genome-guided). The transcriptome assemblies were used as EST evidence in Maker2. Transcript and protein sequences from related species was used for homology evidence. Three gene predictors were used: Augustus, SNAP, GeneMark-ES. See the supplemental materials in our preprint for more information on iterative Maker2 rounds, training each gene predictor, RNA-seq methods, and transcriptome assembly generation. The Maker2 gene annotations of the final round were evaluated using annotation edit distances, BUSCO, RSEM-Eval, and TransRate. </p>
<p>Functional information: InterProScan was used to identify Pfam domains and GO terms from predicted protein sequences, and BLASTp was to find best matches to curated proteins in the UniProtKB/Swiss-Prot database. </p>
<div><br>Resources in this dataset:</div><br><ul><li><p>Resource Title: Bradysia coprophila genome annotations Bcop_v1.0.</p> <p>File Name: bradysia_coprophila.bcop_v1.0.tar.gz</p><p>Resource Description: Primary file:
- Bradysia_coprophila.Bcop_v1.0_gene_set.gff
- Contains automated annotations from Maker2 (described in <a href="https://doi.org/10.1101/2020.02.24.963009)." target="_blank">https://doi.org/10.1101/2020.02.24.963009).</a>
- This is the main file in this tar archive.
- The reference genome fasta is available from GenBank: <a href="https://www.ncbi.nlm.nih.gov/assembly/GCA_014529535.1." target="_blank">https://www.ncbi.nlm.nih.gov/assembly/GCA_014529535.1.</a>
- The Seqid in Column 1 of this gff3 file corresponds to the 'Sequence name' in the GenBank assembly report: <a href="https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/014/529/535/GCA_014529535.1_BU_Bcop_v1/GCA_014529535.1_BU_Bcop_v1_assembly_report.txt" target="_blank">https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/014/529/535/GCA_014529535.1_BU_Bcop_v1/GCA_014529535.1_BU_Bcop_v1_assembly_report.txt</a>
Supplementary files:
- Bradysia_coprophila.Bcop_v1.0_evidence.rnd3.gff
- Contains aligned evidence Maker2 used.
- Bradysia_coprophila.Bcop_v1.0_masked_genome.rnd3.gff
- Contains coordinates for masked regions of the genome as seen by Maker2.
- Bradysia_coprophila.Bcop_v1.0_proteins_with_putative_function.fasta
- Contains predicted protein sequences
- Bradysia_coprophila.Bcop_v1.0_transcripts_with_putative_function.fasta
- Contains predicted transcript sequences
</p></li></ul>