Mercury prepares and formats metadata and sequencing files located in GCP buckets for submission to national & international databases, currently NCBI & GISAID. The default organism (set with --organism) is "sars-cov-2" although "mpox" and "flu" are also accepted.
Important note: Mercury was designed to work with metadata tables that were processed after running the TheiaCoV workflows. If you are using a different pipeline, please ensure that the metadata table is formatted correctly.
For all organisms:
- Required & optional metadata fields are retrieved from the
Metadata.pyfile, dependent on the optional--organismand--skip_ncbiarguments. There are additional metadata customization arguments that will overwrite and populate the column that argument references. Note that--metadata_organismwill populate theorganismcolumn in theinput_tableand--organismwill NOT populate this column. - The input TSV file is read (from the positional
input_tableandtable_nameargument) and the required/optional metadata is extracted for only the samples specified in the postiionalsamplenamesargument. - The metadata is formatted according to the requirements of each database, dependent on the specified
--organismargument. - If SRA submission is not skipped (if
--skip_ncbiis indicated), the sequencing read files (fastq files) are uploaded to a Google Cloud Storage bucket (specified by--gcp_bucket_uri) for temporary storage until they can be retrieved by NCBI (specifically, SRA) during submission. - For BankIt, GenBank, and/or GISAID, The assembly files (fasta files) are concatenated and have the header lines renamed and are available for local download and submission to respective databases.
Default databases by organism:
"sars-cov-2": BioSample, GenBank, GISAID, SRA"mpox": BankIt, BioSample, GISAID, SRA"flu": BioSample, SRA
We highly recommend using the following Docker image to run Mercury:
docker pull us-docker.pkg.dev/general-theiagen/theiagen/mercury:1.1.1The entrypoint for this Docker image is the Mercury help message. To run this container interactively, use the following command:
docker run -it --entrypoint=/bin/bash us-docker.pkg.dev/general-theiagen/theiagen/mercury:1.1.1
# Once inside the container interactively, you can run the mercury tool
mercury.py -v
# v1.1.1Mercury is not yet available with pip or conda. To run Mercury in your local command-line environment, install the following dependencies:
- Python 3.9+
- pandas >= 1.4.2
- Google Cloud SDK 479.0.0+ and all its dependencies
- numpy >= 1.22.4
Each organism will produce different output files. See the table below:
<output_name>_bankit_combined.fasta |
<output_name>.src |
<output_name>_biosample_metadata.tsv |
<output_name>_genbank_metadata.tsv |
<output_name>_genbank_combined.fasta |
<output_name>_gisaid_metadata.csv |
<output_name>_gisaid_combined.fasta |
<output_name>_sra_metadata.tsv |
<output_name>_excluded_samples.tsv |
|
|---|---|---|---|---|---|---|---|---|---|
| organism | BankIt | BankIt | BioSample | GenBank | GenBank | GISAID | GISAID | SRA | N/A |
"mpox" |
✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ||
"sars-cov-2" |
✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ||
"flu" |
✓ | ✓ | ✓ |
usage: python3 /mercury/mercury/mercury.py <input_table.tsv> <table_name> <samplenames> [<args>]
Mercury prepares and formats metadata for submission to national & international genomic databases
positional arguments:
input_table
The table containing the metadata for the samples to be submitted
table_name
The name of the first column in the table (A1); include the `_id` if data table is downloaded from Terra.bio
samplenames
The sample names to be extracted from the table
optional arguments:
-h, --help
show this help message and exit
-v, --version
show program's version number and exit
-o, --output_prefix
The prefix for the output files
default="mercury"
-b, --gcp_bucket_uri
The GCP bucket URI to store the temporarily store the read files (required)
submission type arguments:
options that determine submission type
--organism
The organism type of the samples in the table
default="sars-cov-2"
--skip_ncbi
Add to skip NCBI metadata preparation; prep only for GISAID submission
metadata customization arguments:
options that customize the metadata configuration
--skip_county
Add to skip adding county to location in GISAID metadata
--usa_territory
Add if the country is a USA territory to use the territory name in the state column
--using_clearlabs_data
Add if using Clearlabs-generated data and metrics
--using_reads_dehosted
Add if using reads_dehosted instead of clearlabs data
--single_end
Add if the data is single-end
metadata population arguments:
options that populate metadata fields
--amplicon_primer_scheme [AMPLICON_PRIMER_SCHEME ...]
Amplicon primer scheme
--amplicon_size [AMPLICON_SIZE ...]
Amplicon size
--authors [AUTHORS ...]
Authors of the study
--bioproject_accession [BIOPROJECT_ACCESSION ...]
Bioproject accession number
--continent [CONTINENT ...]
Continent of the sample
--country [COUNTRY ...]
Country of the sample
--gisaid_submitter [GISAID_SUBMITTER ...]
GISAID submitter
--host_disease [HOST_DISEASE ...]
Disease of the host
--instrument_model [INSTRUMENT_MODEL ...]
Instrument model
--isolation_source [ISOLATION_SOURCE ...]
Source of isolation
--library_layout [LIBRARY_LAYOUT ...]
Library layout
--library_selection [LIBRARY_SELECTION ...]
Library selection method
--library_source [LIBRARY_SOURCE ...]
Library source
--library_strategy [LIBRARY_STRATEGY ...]
Library strategy
--metadata_organism [METADATA_ORGANISM ...]
Organism name for metadata population
--purpose_of_sequencing [PURPOSE_OF_SEQUENCING ...]
Purpose of sequencing
--seq_platform [SEQ_PLATFORM ...]
Sequencing platform
--state [STATE ...]
State of the sample
--submitter_email [SUBMITTER_EMAIL ...]
Submitter email
--submitting_lab [SUBMITTING_LAB ...]
Submitting laboratory
--submitting_lab_address [SUBMITTING_LAB_ADDRESS ...]
Address of the submitting laboratory
quality control arguments:
options that control quality thresholds (currently only for SARS-CoV-2 samples)
-a, --vadr_alert_limit
The maximum number of VADR alerts allowed for SARS-CoV-2 samples
default=0
-n, --number_n_threshold
The maximum number of Ns allowed in SARS-CoV-2 assemblies
default=5000
logging arguments:
options that change the verbosity of the stdout logging
--verbose
Add to enable verbose logging
--debug
Add to enable debug logging; overwrites --verbose
Please contact support@theiagen.com or sage.wright@theiagen.com with any questions
To successful run Mercury, these arguments are required.
input_table: The table containing the metadata for the samples to be submitted in TSV formattable_name: The name of the first column in the table (A1) in its entiretysamplenames: The sample names to be extracted from the table (or in other words, the names of the rows in the table) in a comma-delimited list--gcp_bucket_uri: The GCP bucket URI to store the temporarily store the read files (such asgs://bucket_with_sra_access_permissions; contact support@theiagen.com if you would like to use the GCP bucket we use for this purpose)
These arguments provide helpful information, or help customize the output file names.
-h, --help: Show the help message and exit-v, --version: Show the program's version number and exit-o, --output_prefix: The prefix for the output files (default is"mercury")
These arguments change the type of submission Mercury prepares.
--organism: The organism type of all the samples in the table (default is"sars-cov-2"; options include"sars-cov-2","mpox", and"flu"; contact us if you would like any additional organisms/databases supported)--skip_ncbi: Add to skip NCBI metadata preparation; prepare metadata and sequencing files only for GISAID submission
These arguments customize the configuration of the required and/or optional metadata.
--skip_county: Add to skip adding county to location in GISAID metadata--usa_territory: Add if the country is a USA territory to use the territory name in the state column; this is useful for territories like Puerto Rico (e.g., instead of "North America/USA/Puerto Rico", the location will be "North America/Puerto Rico")--using_clearlabs_data: Add if using Clearlabs-generated data and metrics--using_reads_dehosted: Add if using reads_dehosted instead of clearlabs data--single_end: Add if the data is single-end; this ensures that theread2column is not included in the metadata
--amplicon_primer_scheme: Add and populate to overwriteamplicon_primer_schemecolumn with input--amplicon_size: Add and populate to overwriteamplicon_sizecolumn with input--authors: Add and populate to overwriteauthorscolumn with input--bioproject_accession: Add and populate to overwritebioproject_accessioncolumn with input--continent: Add and populate to overwritecontinentcolumn with input--country: Add and populate to overwritecountrycolumn with input--gisaid_submitter: Add and populate to overwritegisaid_submittercolumn with input--host_disease: Add and populate to overwritehost_diseasecolumn with input--instrument_model: Add and populate to overwriteinstrument_modelcolumn with input--isolation_source: Add and populate to overwriteisolation_sourcecolumn with input--library_layout: Add and populate to overwritelibrary_layoutcolumn with input--library_selection: Add and populate to overwritelibrary_selectioncolumn with input--library_source: Add and populate to overwritelibrary_sourcecolumn with input--library_strategy: Add and populate to overwritelibrary_strategycolumn with input--metadata_organism: Add and populate to overwriteorganismcolumn with input--purpose_of_sequencing: Add and populate to overwritepurpose_of_sequencingcolumn with input--seq_platform: Add and populate to overwriteseq_platformcolumn with input--state: Add and populate to overwritestatecolumn with input--submitter_email: Add and populate to overwritesubmitter_emailcolumn with input--submitting_lab: Add and populate to overwritesubmitting_labcolumn with input--submitting_lab_address: Add and populate to overwritesubmitting_lab_addresscolumn with input
The --using_clearlabs_data and --using_reads_dehosted arguments change the default values for the read1_column_name, assembly_fasta_column_name, and assembly_mean_coverage_column_name metadata columns. The default values are shown in the table below in addition to what they are changed to depending on what arguments are used.
| Variable | Default Value | with --using_clearlabs_data |
with --using_reads_dehosted |
with both --using_clearlabs_data and --using_reads_dehosted |
|---|---|---|---|---|
read1_column_name |
"read1_dehosted" |
"clearlabs_fastq_gz" |
"reads_dehosted" |
"reads_dehosted" |
assembly_fasta_column_name |
"assembly_fasta" |
"clearlabs_fasta" |
"assembly_fasta" |
"clearlabs_fasta" |
assembly_mean_coverage_column_name |
"assembly_mean_coverage" |
"clearlabs_assembly_coverage" |
"assembly_mean_coverage" |
"clearlabs_assembly_coverage" |
These arguments are currently only implemented for SARS-CoV-2. If any samples do not meet the quality thresholds, they will not be submitted to the respective databases and will be found in the <output_prefix>_excluded_samples.tsv file. If provided for other organisms, they will be ignored.
--vadr_alert_limit: The maximum number of VADR alerts allowed for SARS-CoV-2 samples (default is0)--number_n_threshold: The maximum number of Ns allowed in SARS-CoV-2 assemblies (default is5000)
These arguments control the amount of logging that is output to the console.
--verbose: Add to enable verbose logging--debug: Add to enable debug logging; overwrites--verbose
Happy submissions!