Ag Data Commons
Browse

Salmonella enterica pangenome graph and variant call data for 539,283 genomes

dataset
posted on 2025-06-25, 17:45 authored by ADAM RIVERSADAM RIVERS, Lei Ma, JUSTIN VAUGHNJUSTIN VAUGHN, Brian Abernathy, Brian Nadon, Annette HynesAnnette Hynes

Salmonella pangenome graph and variant call data for 539,283 genomes

Description:

Salmonella enterica causes human disease and decreases agricultural production. The overall goals of this project is to generate a large database of S. enterica variants with 539,283 samples and 236,069 features for applications in machine learning and genomics. We transformed single nucleotide polymorphism (SNP) data into reduced dimensional representations which are tolerant of missing data based on disentangled variational autoencoders. TFRecord files were made with custom Python scripts that parsed the variant call formats (VCF) into sparse tensors and combined them with the Salmonella In Silico Typing Resource (SISTR) serotype data.

The data directory contains:

  1. The tar file of TFRecords: tfrecords.tar (103 GB). The TFRecords are organized first by how they were genotyped. mpileup records were created with Mpileup, and the gvg records were created with graph variant calling. In each of these directories batches of ~10,000 sequence reads named Sra10k_XX.tfrecord.gz (00--54). File Sra10k_99.tfrecord.gz contains incomplete SRAs. Each TFRecord contains the shape of the tensor, the indices of non-zero variants, sample name, serotype, and sparse values. Value 99 was assigned to '.' records.
  2. The file output.tar (11.4 TB) contains the .vcf files used to create the TFRecords above. The data in here is contained more succinctly in the TTFrecord format. This data will not normally be used.
  3. A tar file of metadata files for the samples, metadata (95 MB). Sequence read archive (SRA) accessions were downloaded using edirect/eutilities and saved as SraAccList.txt.
esearch -db sra -query "txid28901[Organism:exp] AND (cluster_public[prop] AND 'biomol dna'[Properties] AND 'library layout paired'[Properties] AND 'platform illumina'[Properties] AND 'strategy wgs'[Properties] OR 'strategy wga'[Properties] OR 'strategy wcs'[Properties] OR 'strategy clone'[Properties] OR 'strategy finishing'[Properties] OR 'strategy validation'[Properties])" | efetch -format runinfo -mode xml | xtract -pattern Row -element Run > SraAccList.txt

Google BigQuery was used to download metadata for the SRA accessions from the National Institute of Health (NIH).

SELECT * FROM `nih-sra-datastore.sra.metadata` as metadata INNER JOIN `{table_id}` as leiacc ON metadata.acc = leiacc.accID;

Files were processed into batches of ~10,000 and named Sra_completed_XX.csv (00--53).

  1. A VCF document mapping the TFRecord data to the positions in the graph subjected to the Type strain LT2: mapping/DRR452337.gvg.vcf-with_TFRecord_in_1st_column.txt
  2. Scripts for creating and reading TFRecord data: code.
  • reading_and_parsing_fns.py defines functions for converting VCFs of variants called using gvg to sparse tensors and makes the TFRecord files.
  • gvg_to_tfrecord.py creates TFRecords from from the sparse tensors.
  1. Tutorial for using the TFRecords: Example_logistic_regression.md
  2. Pangenome graph files and references used for variant calling and genotyping: pangenome.
  • refPlus100.fasta.gz which contains the genomes of the 101 Salmonella strains without plasmids used for construction of the pangenome graph.
  • salm.100.NC_003197_v2.d2_complete.gfa.gz The complete 101 Salmonella strain pangenome graph in Graphical Fragment Assembly (GFA2) Format 2.0 including alt nodes used for genotyping
  • salm.100.NC_003197_v2.full.gfa.gz the full graph including alt nodes.
  • salm.100.NC_003197_v2.full.vcf.gz A VCF of the file above
  • genotyped.gvg.vcf the genotype calls in vcf format
  • paths.txt the paths of the graph

SCINet users: The data folder can be accessed/retrieved with valid SCINet account at this location: /LTS/ADCdatastorage/NAL/published/node28083194/

See the SCINet File Transfer guide for more information on moving large files: https://scinet.usda.gov/guides/data/datatransfer

Globus users: The files can also be accessed through Globus by following this data link. The user will need to log in to Globus in order to access this data. User accounts are free of charge with several options for signing on. Instructions for creating an account are on the login page.

Funding

FACT: Salmonella Typing and Phenotypic Prediction From Genomes and Metagenomes Using Population Genomics and Machine Learning

National Institute of Food and Agriculture

Find out more...

USDA-ARS: 6066-21310-006-000D

History

Data contact name

Rivers, Adam

Data contact email

adam.rivers@usda.gov

Publisher

Ag Data Commons

Intended use

The variant calling can be used in machine learning applications including identifying variants that are most predictive of serotypes to create a rapid test which does not involve sequencing or culturing, prediction of antimicrobial resistance, vaccine development, and developing a pangenome graph.

Use limitations

Singleton variants are filtered out, so this data set is more appropriate for predictive modeling than variant discovery.

Temporal Extent Start Date

2000-01-01

Frequency

  • notPlanned

Theme

  • Non-geospatial

ISO Topic Category

  • farming
  • health

National Agricultural Library Thesaurus terms

Salmonella enterica; genome; food safety; traceability; certification; product authenticity; agricultural biotechnology; diagnostic techniques; biosensors; engineering; nucleic acids; proteins; animals; pests; pathogens; human diseases; databases; artificial intelligence; single nucleotide polymorphism; computer software; serotypes; genotyping; metadata; phenotype; prediction; metagenomics; rapid methods; antibiotic resistance; vaccine development; models

OMB Bureau Code

  • 005:20 - National Institute of Food and Agriculture
  • 005:18 - Agricultural Research Service

OMB Program Code

  • 005:040 - National Research

ARS National Program Number

  • 301

ARIS Log Number

426978

Pending citation

  • No

Public Access Level

  • Public