mercury

Description

Mercury prepares and formats metadata and sequencing files located in GCP buckets for submission to national & international databases, currently NCBI & GISAID. The default organism (set with --organism) is "sars-cov-2" although "mpox" and "flu" are also accepted.

Important note: Mercury was designed to work with metadata tables that were processed after running the TheiaCoV workflows. If you are using a different pipeline, please ensure that the metadata table is formatted correctly.

For all organisms:

Required & optional metadata fields are retrieved from the Metadata.py file, dependent on the optional --organism and --skip_ncbi arguments. There are additional metadata customization arguments that will overwrite and populate the column that argument references. Note that --metadata_organism will populate the organism column in the input_table and --organism will NOT populate this column.
The input TSV file is read (from the positional input_table and table_name argument) and the required/optional metadata is extracted for only the samples specified in the postiional samplenames argument.
The metadata is formatted according to the requirements of each database, dependent on the specified --organism argument.
If SRA submission is not skipped (if --skip_ncbi is indicated), the sequencing read files (fastq files) are uploaded to a Google Cloud Storage bucket (specified by --gcp_bucket_uri) for temporary storage until they can be retrieved by NCBI (specifically, SRA) during submission.
For BankIt, GenBank, and/or GISAID, The assembly files (fasta files) are concatenated and have the header lines renamed and are available for local download and submission to respective databases.

Default databases by organism:

"sars-cov-2": BioSample, GenBank, GISAID, SRA
"mpox": BankIt, BioSample, GISAID, SRA
"flu": BioSample, SRA

Installation

Docker

We highly recommend using the following Docker image to run Mercury:

docker pull us-docker.pkg.dev/general-theiagen/theiagen/mercury:1.1.1

The entrypoint for this Docker image is the Mercury help message. To run this container interactively, use the following command:

docker run -it --entrypoint=/bin/bash us-docker.pkg.dev/general-theiagen/theiagen/mercury:1.1.1
# Once inside the container interactively, you can run the mercury tool
mercury.py -v
# v1.1.1

Locally with Python

Mercury is not yet available with pip or conda. To run Mercury in your local command-line environment, install the following dependencies:

Python 3.9+
pandas >= 1.4.2
Google Cloud SDK 479.0.0+ and all its dependencies
numpy >= 1.22.4

Outputs

Each organism will produce different output files. See the table below:

	`<output_name>_bankit_combined.fasta`	`<output_name>.src`	`<output_name>_biosample_metadata.tsv`	`<output_name>_genbank_metadata.tsv`	`<output_name>_genbank_combined.fasta`	`<output_name>_gisaid_metadata.csv`	`<output_name>_gisaid_combined.fasta`	`<output_name>_sra_metadata.tsv`	`<output_name>_excluded_samples.tsv`
organism	BankIt	BankIt	BioSample	GenBank	GenBank	GISAID	GISAID	SRA	N/A
`"mpox"`	✓	✓	✓			✓	✓	✓	✓
`"sars-cov-2"`			✓	✓	✓	✓	✓	✓	✓
`"flu"`			✓					✓	✓

Explanation of Arguments

Usage & Help Message

usage: python3 /mercury/mercury/mercury.py <input_table.tsv> <table_name> <samplenames> [<args>]

Mercury prepares and formats metadata for submission to national & international genomic databases

positional arguments:
  input_table
          The table containing the metadata for the samples to be submitted
  table_name
          The name of the first column in the table (A1); include the `_id` if data table is downloaded from Terra.bio
  samplenames
          The sample names to be extracted from the table

optional arguments:
  -h, --help
          show this help message and exit
  -v, --version
          show program's version number and exit
  -o, --output_prefix 
          The prefix for the output files
          default="mercury"
  -b, --gcp_bucket_uri 
          The GCP bucket URI to store the temporarily store the read files (required)

submission type arguments:
  options that determine submission type

  --organism 
          The organism type of the samples in the table
          default="sars-cov-2"
  --skip_ncbi
          Add to skip NCBI metadata preparation; prep only for GISAID submission

metadata customization arguments:
  options that customize the metadata configuration

  --skip_county
          Add to skip adding county to location in GISAID metadata
  --usa_territory
          Add if the country is a USA territory to use the territory name in the state column
  --using_clearlabs_data
          Add if using Clearlabs-generated data and metrics
  --using_reads_dehosted
          Add if using reads_dehosted instead of clearlabs data
  --single_end
          Add if the data is single-end

metadata population arguments:
  options that populate metadata fields

  --amplicon_primer_scheme [AMPLICON_PRIMER_SCHEME ...]
          Amplicon primer scheme
  --amplicon_size [AMPLICON_SIZE ...]
          Amplicon size
  --authors [AUTHORS ...]
          Authors of the study
  --bioproject_accession [BIOPROJECT_ACCESSION ...]
          Bioproject accession number
  --continent [CONTINENT ...]
          Continent of the sample
  --country [COUNTRY ...]
          Country of the sample
  --gisaid_submitter [GISAID_SUBMITTER ...]
          GISAID submitter
  --host_disease [HOST_DISEASE ...]
          Disease of the host
  --instrument_model [INSTRUMENT_MODEL ...]
          Instrument model
  --isolation_source [ISOLATION_SOURCE ...]
          Source of isolation
  --library_layout [LIBRARY_LAYOUT ...]
          Library layout
  --library_selection [LIBRARY_SELECTION ...]
          Library selection method
  --library_source [LIBRARY_SOURCE ...]
          Library source
  --library_strategy [LIBRARY_STRATEGY ...]
          Library strategy
  --metadata_organism [METADATA_ORGANISM ...]
          Organism name for metadata population
  --purpose_of_sequencing [PURPOSE_OF_SEQUENCING ...]
          Purpose of sequencing
  --seq_platform [SEQ_PLATFORM ...]
          Sequencing platform
  --state [STATE ...]
          State of the sample
  --submitter_email [SUBMITTER_EMAIL ...]
          Submitter email
  --submitting_lab [SUBMITTING_LAB ...]
          Submitting laboratory
  --submitting_lab_address [SUBMITTING_LAB_ADDRESS ...]
          Address of the submitting laboratory

quality control arguments:
  options that control quality thresholds (currently only for SARS-CoV-2 samples)

  -a, --vadr_alert_limit 
          The maximum number of VADR alerts allowed for SARS-CoV-2 samples
          default=0
  -n, --number_n_threshold 
          The maximum number of Ns allowed in SARS-CoV-2 assemblies
          default=5000

logging arguments:
  options that change the verbosity of the stdout logging

  --verbose
          Add to enable verbose logging
  --debug
          Add to enable debug logging; overwrites --verbose

Please contact support@theiagen.com or sage.wright@theiagen.com with any questions

Positional & Required Arguments

To successful run Mercury, these arguments are required.

input_table: The table containing the metadata for the samples to be submitted in TSV format
table_name: The name of the first column in the table (A1) in its entirety
samplenames: The sample names to be extracted from the table (or in other words, the names of the rows in the table) in a comma-delimited list
--gcp_bucket_uri: The GCP bucket URI to store the temporarily store the read files (such as gs://bucket_with_sra_access_permissions; contact support@theiagen.com if you would like to use the GCP bucket we use for this purpose)

Optional Arguments

These arguments provide helpful information, or help customize the output file names.

-h, --help: Show the help message and exit
-v, --version: Show the program's version number and exit
-o, --output_prefix: The prefix for the output files (default is "mercury")

Submission Type Arguments

These arguments change the type of submission Mercury prepares.

--organism: The organism type of all the samples in the table (default is "sars-cov-2"; options include "sars-cov-2", "mpox", and "flu"; contact us if you would like any additional organisms/databases supported)
--skip_ncbi: Add to skip NCBI metadata preparation; prepare metadata and sequencing files only for GISAID submission

Metadata Customization Arguments

These arguments customize the configuration of the required and/or optional metadata.

--skip_county: Add to skip adding county to location in GISAID metadata
--usa_territory: Add if the country is a USA territory to use the territory name in the state column; this is useful for territories like Puerto Rico (e.g., instead of "North America/USA/Puerto Rico", the location will be "North America/Puerto Rico")
--using_clearlabs_data: Add if using Clearlabs-generated data and metrics
--using_reads_dehosted: Add if using reads_dehosted instead of clearlabs data
--single_end: Add if the data is single-end; this ensures that the read2 column is not included in the metadata

Metadata Population Arguments

--amplicon_primer_scheme: Add and populate to overwrite amplicon_primer_scheme column with input
--amplicon_size: Add and populate to overwrite amplicon_size column with input
--authors: Add and populate to overwrite authors column with input
--bioproject_accession: Add and populate to overwrite bioproject_accession column with input
--continent: Add and populate to overwrite continent column with input
--country: Add and populate to overwrite country column with input
--gisaid_submitter: Add and populate to overwrite gisaid_submitter column with input
--host_disease: Add and populate to overwrite host_disease column with input
--instrument_model: Add and populate to overwrite instrument_model column with input
--isolation_source: Add and populate to overwrite isolation_source column with input
--library_layout: Add and populate to overwrite library_layout column with input
--library_selection: Add and populate to overwrite library_selection column with input
--library_source: Add and populate to overwrite library_source column with input
--library_strategy: Add and populate to overwrite library_strategy column with input
--metadata_organism: Add and populate to overwrite organism column with input
--purpose_of_sequencing: Add and populate to overwrite purpose_of_sequencing column with input
--seq_platform: Add and populate to overwrite seq_platform column with input
--state: Add and populate to overwrite state column with input
--submitter_email: Add and populate to overwrite submitter_email column with input
--submitting_lab: Add and populate to overwrite submitting_lab column with input
--submitting_lab_address: Add and populate to overwrite submitting_lab_address column with input

A note on `--using_clearlabs_data` & `--using_reads_dehosted`

The --using_clearlabs_data and --using_reads_dehosted arguments change the default values for the read1_column_name, assembly_fasta_column_name, and assembly_mean_coverage_column_name metadata columns. The default values are shown in the table below in addition to what they are changed to depending on what arguments are used.

Variable	Default Value	with `--using_clearlabs_data`	with `--using_reads_dehosted`	with both `--using_clearlabs_data` *and* `--using_reads_dehosted`
`read1_column_name`	`"read1_dehosted"`	`"clearlabs_fastq_gz"`	`"reads_dehosted"`	`"reads_dehosted"`
`assembly_fasta_column_name`	`"assembly_fasta"`	`"clearlabs_fasta"`	`"assembly_fasta"`	`"clearlabs_fasta"`
`assembly_mean_coverage_column_name`	`"assembly_mean_coverage"`	`"clearlabs_assembly_coverage"`	`"assembly_mean_coverage"`	`"clearlabs_assembly_coverage"`

Quality Control Arguments

These arguments are currently only implemented for SARS-CoV-2. If any samples do not meet the quality thresholds, they will not be submitted to the respective databases and will be found in the <output_prefix>_excluded_samples.tsv file. If provided for other organisms, they will be ignored.

--vadr_alert_limit: The maximum number of VADR alerts allowed for SARS-CoV-2 samples (default is 0)
--number_n_threshold: The maximum number of Ns allowed in SARS-CoV-2 assemblies (default is 5000)

Logging Arguments

These arguments control the amount of logging that is output to the console.

--verbose: Add to enable verbose logging
--debug: Add to enable debug logging; overwrites --verbose

Happy submissions!

Name		Name	Last commit message	Last commit date
Latest commit History 41 Commits
.github		.github
mercury		mercury
.gitignore		.gitignore
AUTHORS		AUTHORS
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

mercury

Description

Installation

Docker

Locally with Python

Outputs

Explanation of Arguments

Usage & Help Message

Positional & Required Arguments

Optional Arguments

Submission Type Arguments

Metadata Customization Arguments

Metadata Population Arguments

A note on `--using_clearlabs_data` & `--using_reads_dehosted`

Quality Control Arguments

Logging Arguments

About

Uh oh!

Releases 14

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

mercury

Description

Installation

Docker

Locally with Python

Outputs

Explanation of Arguments

Usage & Help Message

Positional & Required Arguments

Optional Arguments

Submission Type Arguments

Metadata Customization Arguments

Metadata Population Arguments

A note on --using_clearlabs_data & --using_reads_dehosted

Quality Control Arguments

Logging Arguments

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 14

Uh oh!

Contributors

Uh oh!

Languages

A note on `--using_clearlabs_data` & `--using_reads_dehosted`