Salmonella enterica pangenome graph and variant call data for 539,283 genomes
Salmonella pangenome graph and variant call data for 539,283 genomes
Description:
Salmonella enterica causes human disease and decreases agricultural production. The overall goals of this project is to generate a large database of S. enterica variants with 539,283 samples and 236,069 features for applications in machine learning and genomics. We transformed single nucleotide polymorphism (SNP) data into reduced dimensional representations which are tolerant of missing data based on disentangled variational autoencoders. TFRecord files were made with custom Python scripts that parsed the variant call formats (VCF) into sparse tensors and combined them with the Salmonella In Silico Typing Resource (SISTR) serotype data.
The data directory contains:
- The tar file of TFRecords:
tfrecords.tar
(103 GB). The TFRecords are organized first by how they were genotyped.mpileup
records were created with Mpileup, and thegvg
records were created with graph variant calling. In each of these directories batches of ~10,000 sequence reads named Sra10k_XX.tfrecord.gz (00--54). File Sra10k_99.tfrecord.gz contains incomplete SRAs. Each TFRecord contains the shape of the tensor, the indices of non-zero variants, sample name, serotype, and sparse values. Value 99 was assigned to '.' records. - The file
output.tar
(11.4 TB) contains the.vcf
files used to create the TFRecords above. The data in here is contained more succinctly in the TTFrecord format. This data will not normally be used. - A tar file of metadata files for the samples,
metadata
(95 MB). Sequence read archive (SRA) accessions were downloaded usingedirect/eutilities
and saved asSraAccList.txt
.
esearch -db sra -query "txid28901[Organism:exp] AND (cluster_public[prop] AND 'biomol dna'[Properties] AND 'library layout paired'[Properties] AND 'platform illumina'[Properties] AND 'strategy wgs'[Properties] OR 'strategy wga'[Properties] OR 'strategy wcs'[Properties] OR 'strategy clone'[Properties] OR 'strategy finishing'[Properties] OR 'strategy validation'[Properties])" | efetch -format runinfo -mode xml | xtract -pattern Row -element Run > SraAccList.txt
Google BigQuery
was used to download metadata for the SRA accessions from the National Institute of Health (NIH).
SELECT * FROM `nih-sra-datastore.sra.metadata` as metadata INNER JOIN `{table_id}` as leiacc ON metadata.acc = leiacc.accID;
Files were processed into batches of ~10,000 and named Sra_completed_XX.csv
(00--53).
- A VCF document mapping the TFRecord data to the positions in the graph subjected to the Type strain LT2:
mapping/DRR452337.gvg.vcf-with_TFRecord_in_1st_column.txt
- Scripts for creating and reading TFRecord data:
code
.
reading_and_parsing_fns.py
defines functions for converting VCFs of variants called using gvg to sparse tensors and makes the TFRecord files.gvg_to_tfrecord.py
creates TFRecords from from the sparse tensors.
- Tutorial for using the TFRecords:
Example_logistic_regression.md
- Pangenome graph files and references used for variant calling and genotyping:
pangenome
.
refPlus100.fasta.gz
which contains the genomes of the 101 Salmonella strains without plasmids used for construction of the pangenome graph.salm.100.NC_003197_v2.d2_complete.gfa.gz
The complete 101 Salmonella strain pangenome graph in Graphical Fragment Assembly (GFA2) Format 2.0 including alt nodes used for genotypingsalm.100.NC_003197_v2.full.gfa.gz
the full graph including alt nodes.salm.100.NC_003197_v2.full.vcf.gz
A VCF of the file abovegenotyped.gvg.vcf
the genotype calls in vcf formatpaths.txt
the paths of the graph
SCINet users: The data folder can be accessed/retrieved with valid SCINet account at this location: /LTS/ADCdatastorage/NAL/published/node28083194/
See the SCINet File Transfer guide for more information on moving large files: https://scinet.usda.gov/guides/data/datatransfer
Globus users: The files can also be accessed through Globus by following this data link. The user will need to log in to Globus in order to access this data. User accounts are free of charge with several options for signing on. Instructions for creating an account are on the login page.
Funding
FACT: Salmonella Typing and Phenotypic Prediction From Genomes and Metagenomes Using Population Genomics and Machine Learning
National Institute of Food and Agriculture
Find out more...USDA-ARS: 6066-21310-006-000D
History
Data contact name
Rivers, AdamData contact email
adam.rivers@usda.govPublisher
Ag Data CommonsIntended use
The variant calling can be used in machine learning applications including identifying variants that are most predictive of serotypes to create a rapid test which does not involve sequencing or culturing, prediction of antimicrobial resistance, vaccine development, and developing a pangenome graph.Use limitations
Singleton variants are filtered out, so this data set is more appropriate for predictive modeling than variant discovery.Temporal Extent Start Date
2000-01-01Frequency
- notPlanned
Theme
- Non-geospatial
ISO Topic Category
- farming
- health
National Agricultural Library Thesaurus terms
Salmonella enterica; genome; food safety; traceability; certification; product authenticity; agricultural biotechnology; diagnostic techniques; biosensors; engineering; nucleic acids; proteins; animals; pests; pathogens; human diseases; databases; artificial intelligence; single nucleotide polymorphism; computer software; serotypes; genotyping; metadata; phenotype; prediction; metagenomics; rapid methods; antibiotic resistance; vaccine development; modelsOMB Bureau Code
- 005:20 - National Institute of Food and Agriculture
- 005:18 - Agricultural Research Service
OMB Program Code
- 005:040 - National Research
ARS National Program Number
- 301
ARIS Log Number
426978Pending citation
- No
Public Access Level
- Public