MegaSSR v3.0

High-throughput SSR Identification and Primer Design Pipeline

Features • Installation • Quick Start • Usage • Output • Citation

Overview

MegaSSR is a comprehensive bioinformatics pipeline for identifying Simple Sequence Repeats (SSRs/microsatellites) in genomic sequences, classifying them as genic or intergenic, and designing PCR primers for molecular marker development.

The pipeline integrates MISA for SSR detection, custom annotation scripts for genic/intergenic classification, and Primer3 for automated primer design.

Features

🔬 SSR Detection - Identifies mono- to hexa-nucleotide repeats using MISA
🧬 Gene Annotation - Classifies SSRs as genic (within genes) or intergenic
🎯 Primer Design - Automated primer design using Primer3
⚡ Parallel Processing - Multi-threaded for high performance
📊 Statistics & Visualization - Comprehensive analysis reports and plots
💾 Checkpoint System - Resume interrupted analyses
📦 Gzip Support - Works with compressed FASTA and GFF files
🌐 NCBI Download - Direct download from NCBI FTP links
🔧 Universal Compatibility - Automatic header normalization works with any FASTA/GFF format

Installation

Prerequisites

Linux/macOS - Native support
Windows - Supported via WSL2 (Windows Subsystem for Linux)
Conda or Mamba

Windows Users (WSL2)

Install WSL2: Open PowerShell as Administrator and run:
```
wsl --install -d Ubuntu
```
Restart your computer
Open Ubuntu from Start menu and follow the Linux setup below

Setup

# Clone the repository
git clone https://github.com/Bioinformatics-UM6P/megassr-v3.git
cd megassr-v3

# Create conda environment (includes all dependencies)
conda env create -f config/megassr.yml

# Activate the environment
conda activate megassr

# Verify installation
./MegaSSR.sh --help

Dependencies (auto-installed via conda)

Perl 5.32+ with BioPerl
Python 3.11+ with pandas, matplotlib
Primer3
MISA

Quick Start

Test with Sample Data

# Quick SSR identification test (~30 seconds)
make test

# Full analysis with primer design (~2-5 minutes)
make test-full

Run Your Analysis

# SSR identification only
./MegaSSR.sh -A 1 -F your_genome.fa -P my_project

# Full analysis with gene annotation and primers
./MegaSSR.sh -A 2 -F your_genome.fa -G annotation.gff -P my_project

# With compressed files
./MegaSSR.sh -A 2 -F genome.fa.gz -G annotation.gff.gz -P my_project

Usage

Command Line Options

USAGE:
    MegaSSR.sh --analysis <type> --fasta <file> --project <name> [options]

REQUIRED:
    --analysis, -A <1|2>    Analysis type
                            1 = SSR identification only
                            2 = SSR + gene annotation + primer design
    --fasta, -F <file>      Input FASTA file (genome sequence)
    --project, -P <name>    Project name (used for output directories)

OPTIONAL:
    --gff, -G <file>        GFF annotation file (required for -A 2)
    --feature, -g <type>    Genic feature type (default: gene)
    --threads, -T <n>       Number of threads (default: 4)
    
    NCBI Download:
    --ncbi-fasta <url>      Download FASTA from NCBI FTP
    --ncbi-gff <url>        Download GFF from NCBI FTP

SSR Mining Parameters

Parameter	Default	Description
`--mono`	20	Minimum repeats for mononucleotide SSRs
`--di`	6	Minimum repeats for dinucleotide SSRs
`--tri`	5	Minimum repeats for trinucleotide SSRs
`--tetra`	4	Minimum repeats for tetranucleotide SSRs
`--penta`	3	Minimum repeats for pentanucleotide SSRs
`--hexa`	3	Minimum repeats for hexanucleotide SSRs
`--compound`	100	Max distance between compound SSRs

Primer Design Parameters

Parameter	Short	Default	Description
`--primer-min`	`-s`	18	Minimum primer length (bp)
`--primer-opt`	`-o`	20	Optimal primer length (bp)
`--primer-max`	`-S`	22	Maximum primer length (bp)
`--product-min`	`-r`	250	Minimum PCR product size (bp)
`--product-max`	`-R`	500	Maximum PCR product size (bp)

Checkpoint/Resume Options

Parameter	Description
`--resume`	Resume from last completed phase
`--restart`	Clear checkpoints and start fresh

NCBI FTP Download Options

Parameter	Description
`--ncbi-fasta <url>`	Download FASTA file from NCBI FTP
`--ncbi-gff <url>`	Download GFF file from NCBI FTP

Note: URLs must start with https://ftp.ncbi.nlm.nih.gov/. You can mix local files and NCBI downloads (e.g., local FASTA + NCBI GFF).

Input Formats

FASTA File

Standard FASTA format for genome sequences:

>chromosome1
ATCGATCGATCGATCGATCG...
>chromosome2
GCTAGCTAGCTAGCTAGCTA...

GFF File

Standard GFF3 format for gene annotations:

##gff-version 3
chr1    NCBI    gene    1000    2000    .    +    .    ID=gene1;Name=ABC
chr1    NCBI    mRNA    1000    2000    .    +    .    ID=mrna1;Parent=gene1

Supported Extensions

Format	Uncompressed	Compressed
FASTA	`.fa`, `.fasta`, `.fna`	`.fa.gz`, `.fasta.gz`, `.fna.gz`
GFF	`.gff`, `.gff3`	`.gff.gz`, `.gff3.gz`

Note: MegaSSR automatically normalizes FASTA headers (replaces spaces with underscores) to ensure compatibility with all genome databases (NCBI, Ensembl, custom assemblies). No manual preprocessing required!

Input Validation

MegaSSR automatically validates all input files before processing to prevent errors:

FASTA Validation

✅ File existence and readability
✅ Non-empty file
✅ Valid FASTA headers (starting with >)
✅ Valid nucleotide sequences (A, C, G, T, N only)
✅ No invalid characters in sequences

GFF Validation

✅ File existence and readability
✅ Non-empty file
✅ Correct number of fields (9 tab-separated columns)
✅ Numeric coordinates (start and end positions)
✅ Valid coordinate ranges (start ≤ end)
✅ Valid strand information (+, -, or .)

Chromosome ID Consistency

✅ Cross-validation between FASTA and GFF chromosome IDs
✅ Warns if GFF contains chromosomes not found in FASTA
⚠️ Pipeline continues (doesn't fail) to allow normalization to fix mismatches

If validation fails, you'll see a clear error message indicating the problem. Common issues:

Invalid characters in sequences (e.g., numbers, special characters)
Malformed GFF coordinates
Missing required fields
Empty files

Output

Results are saved in results/<project_name>/:

results/my_project/
├── my_project-MegaSSR_Results/     # Final results
│   ├── my_project.Genic-primers.txt        # Genic SSR primers
│   ├── my_project.interGenic-primers.txt   # Intergenic SSR primers
│   ├── my_project.Genic_SSR_with_feature.txt
│   ├── Genic_SSR_flanking_regions.fa
│   ├── Intergenic_SSR_flanking_regions.fa
│   ├── SSR_statistics.txt
│   └── *.csv                        # Statistical tables
├── intermediate/                    # Intermediate files
├── fasta/                          # MISA output
└── .megassr_checkpoint             # Checkpoint file

Primer Output Format

The primer output files contain tab-separated columns:

Column	Description
Chromosome	Chromosome/contig ID
SSR_ID	Unique SSR identifier
SSR_Type	Repeat type (mono, di, tri, etc.)
SSR_Motif	Repeat motif sequence
SSR_Start	SSR start position
SSR_End	SSR end position
Forward_Primer	Forward primer sequence
Reverse_Primer	Reverse primer sequence
Product_Size	Expected PCR product size
Tm_Forward	Forward primer melting temperature
Tm_Reverse	Reverse primer melting temperature
GC_Forward	Forward primer GC content
GC_Reverse	Reverse primer GC content

Examples

Basic SSR Identification

./MegaSSR.sh \
    --analysis 1 \
    --fasta genome.fa.gz \
    --project rice_ssr \
    --threads 8

Full Analysis with Custom Parameters

./MegaSSR.sh \
    --analysis 2 \
    --fasta genome.fa.gz \
    --gff annotation.gff.gz \
    --project rice_markers \
    --threads 8 \
    --mono 15 --di 8 --tri 6 \
    --primer-min 18 --primer-opt 20 --primer-max 25 \
    --product-min 150 --product-max 400 \
    --feature gene

Download from NCBI FTP

# Download both FASTA and GFF from NCBI
./MegaSSR.sh \
    --analysis 2 \
    --project ncbi_analysis \
    --ncbi-fasta https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/002/846/995/GCF_002846995.1_ASM284699v1/GCF_002846995.1_ASM284699v1_genomic.fna.gz \
    --ncbi-gff https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/002/846/995/GCF_002846995.1_ASM284699v1/GCF_002846995.1_ASM284699v1_genomic.gff.gz

# Mix local FASTA with NCBI GFF
./MegaSSR.sh \
    --analysis 2 \
    --fasta my_genome.fa \
    --ncbi-gff https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/.../xxx_genomic.gff.gz \
    --project mixed_analysis

# NCBI FASTA with local GFF
./MegaSSR.sh \
    --analysis 2 \
    --ncbi-fasta https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/.../xxx_genomic.fna.gz \
    --gff my_annotation.gff \
    --project mixed_analysis

Resume After Interruption

# If pipeline was interrupted, resume from checkpoint
./MegaSSR.sh \
    --analysis 2 \
    --fasta genome.fa.gz \
    --gff annotation.gff.gz \
    --project rice_markers \
    --resume

Using Makefile

# Show all commands
make help

# Run with custom parameters
make run-full \
    FASTA=genome.fa.gz \
    GFF=annotation.gff.gz \
    PROJECT=my_analysis \
    THREADS=8 \
    MONO=15 \
    DI=8

# Check project status
make status

# Clean results
make clean-project PROJECT=my_analysis

Pipeline Phases

MegaSSR runs in 8 phases:

Phase	Description	Output
1	Initialization	Directory structure, validation
2	SSR Detection	`.misa` files, SSR list
3	Gene Annotation	Genic/intergenic classification
4	Genic Primer Design	Primers for genic SSRs
5	Intergenic Primer Design	Primers for intergenic SSRs
6	Post-processing	Final formatted outputs
7	Statistics	Frequency tables, comparisons
8	Visualization	PNG plots (if enabled)

Troubleshooting

Common Issues

1. "FASTA file not found"

# Check file path and ensure it exists
ls -la your_genome.fa

2. "GFF file required for analysis type 2"

# Add --gff parameter for full analysis
./MegaSSR.sh -A 2 -F genome.fa -G annotation.gff -P project

3. "No SSRs found"

Try lowering the threshold parameters (e.g., --mono 10 --di 5)
Verify your FASTA file contains valid sequences

4. "No primers designed"

Check if SSRs were found in the previous phase
MegaSSR v3.0 automatically normalizes FASTA headers - this issue should no longer occur
If primers are still missing, check the log file in logs/<project>.log for detailed error messages

Resume Failed Analysis

# Check what phase completed
cat results/my_project/.megassr_checkpoint

# Resume from last checkpoint
./MegaSSR.sh -A 2 -F genome.fa -G annotation.gff -P my_project --resume

Test Data

Sample data is included for testing:

Species: Arabidopsis thaliana (chromosome 1)
FASTA: data/arabidopsis_chr1/arabidopsis_chr1.fa.gz (8.8 MB)
GFF: data/arabidopsis_chr1/arabidopsis_chr1.gff.gz (5.6 MB)

Expected results:

~5,446 SSRs identified
~3,184 genic SSRs
~2,373 intergenic SSRs
~2,790 genic primers
~1,429 intergenic primers
7 visualization PNG plots
10+ CSV output files

Project Structure

MegaSSR/
├── MegaSSR.sh              # Main pipeline script
├── Makefile                # Build/run automation
├── README.md               # This file
├── LICENSE                 # MIT License
├── config/
│   ├── megassr.yml         # Conda environment
│   └── megassr.env         # Default settings
├── bin/Script/             # Core processing scripts
│   ├── misa.pl             # MISA SSR detection
│   ├── extractseq-*.pl     # Sequence extraction
│   ├── *primer*.py         # Primer design scripts
│   └── *.py                # Statistics/visualization
├── data/
│   └── arabidopsis_chr1/   # Test dataset
├── utils/
│   └── functions.sh        # Utility functions
├── results/                # Output directory
├── logs/                   # Log files
└── tmp/                    # Temporary files

Citation

If you use MegaSSR in your research, please cite:

Morad M. Mokhtar, Alsamman M. Alsamman and Achraf El Allali (2023). MegaSSR: A webserver for large scale SSR identification, classification, and marker development. Frontiers in Plant Science, 14, [https://doi.org/10.3389/fpls.2023.1219055]

License

This project is licensed under the MIT License - see the LICENSE file for details.

Author

Zakaria Mahmoud

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

Acknowledgments

MISA - Microsatellite identification tool
Primer3 - Primer design tool
BioPerl - Perl tools for bioinformatics

Name		Name	Last commit message	Last commit date
Latest commit History 58 Commits
bin/Script		bin/Script
config		config
data/arabidopsis_chr1		data/arabidopsis_chr1
logs		logs
results		results
tests		tests
utils		utils
.gitignore		.gitignore
LICENSE		LICENSE
Makefile		Makefile
MegaSSR.sh		MegaSSR.sh
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

MegaSSR v3.0

Overview

Features

Installation

Prerequisites

Windows Users (WSL2)

Setup

Dependencies (auto-installed via conda)

Quick Start

Test with Sample Data

Run Your Analysis

Usage

Command Line Options

SSR Mining Parameters

Primer Design Parameters

Checkpoint/Resume Options

NCBI FTP Download Options

Input Formats

FASTA File

GFF File

Supported Extensions

Input Validation

FASTA Validation

GFF Validation

Chromosome ID Consistency

Output

Primer Output Format

Examples

Basic SSR Identification

Full Analysis with Custom Parameters

Download from NCBI FTP

Resume After Interruption

Using Makefile

Pipeline Phases

Troubleshooting

Common Issues

Resume Failed Analysis

Test Data

Project Structure

Citation

License

Author

Contributing

Acknowledgments

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages