High-throughput SSR Identification and Primer Design Pipeline
Features • Installation • Quick Start • Usage • Output • Citation
MegaSSR is a comprehensive bioinformatics pipeline for identifying Simple Sequence Repeats (SSRs/microsatellites) in genomic sequences, classifying them as genic or intergenic, and designing PCR primers for molecular marker development.
The pipeline integrates MISA for SSR detection, custom annotation scripts for genic/intergenic classification, and Primer3 for automated primer design.
- 🔬 SSR Detection - Identifies mono- to hexa-nucleotide repeats using MISA
- 🧬 Gene Annotation - Classifies SSRs as genic (within genes) or intergenic
- 🎯 Primer Design - Automated primer design using Primer3
- ⚡ Parallel Processing - Multi-threaded for high performance
- 📊 Statistics & Visualization - Comprehensive analysis reports and plots
- 💾 Checkpoint System - Resume interrupted analyses
- 📦 Gzip Support - Works with compressed FASTA and GFF files
- 🌐 NCBI Download - Direct download from NCBI FTP links
- 🔧 Universal Compatibility - Automatic header normalization works with any FASTA/GFF format
- Linux/macOS - Native support
- Windows - Supported via WSL2 (Windows Subsystem for Linux)
- Conda or Mamba
- Install WSL2: Open PowerShell as Administrator and run:
wsl --install -d Ubuntu
- Restart your computer
- Open Ubuntu from Start menu and follow the Linux setup below
# Clone the repository
git clone https://github.com/Bioinformatics-UM6P/megassr-v3.git
cd megassr-v3
# Create conda environment (includes all dependencies)
conda env create -f config/megassr.yml
# Activate the environment
conda activate megassr
# Verify installation
./MegaSSR.sh --help- Perl 5.32+ with BioPerl
- Python 3.11+ with pandas, matplotlib
- Primer3
- MISA
# Quick SSR identification test (~30 seconds)
make test
# Full analysis with primer design (~2-5 minutes)
make test-full# SSR identification only
./MegaSSR.sh -A 1 -F your_genome.fa -P my_project
# Full analysis with gene annotation and primers
./MegaSSR.sh -A 2 -F your_genome.fa -G annotation.gff -P my_project
# With compressed files
./MegaSSR.sh -A 2 -F genome.fa.gz -G annotation.gff.gz -P my_projectUSAGE:
MegaSSR.sh --analysis <type> --fasta <file> --project <name> [options]
REQUIRED:
--analysis, -A <1|2> Analysis type
1 = SSR identification only
2 = SSR + gene annotation + primer design
--fasta, -F <file> Input FASTA file (genome sequence)
--project, -P <name> Project name (used for output directories)
OPTIONAL:
--gff, -G <file> GFF annotation file (required for -A 2)
--feature, -g <type> Genic feature type (default: gene)
--threads, -T <n> Number of threads (default: 4)
NCBI Download:
--ncbi-fasta <url> Download FASTA from NCBI FTP
--ncbi-gff <url> Download GFF from NCBI FTP
| Parameter | Short | Default | Description |
|---|---|---|---|
--mono |
20 | Minimum repeats for mononucleotide SSRs | |
--di |
6 | Minimum repeats for dinucleotide SSRs | |
--tri |
5 | Minimum repeats for trinucleotide SSRs | |
--tetra |
4 | Minimum repeats for tetranucleotide SSRs | |
--penta |
3 | Minimum repeats for pentanucleotide SSRs | |
--hexa |
3 | Minimum repeats for hexanucleotide SSRs | |
--compound |
100 | Max distance between compound SSRs |
| Parameter | Short | Default | Description |
|---|---|---|---|
--primer-min |
-s |
18 | Minimum primer length (bp) |
--primer-opt |
-o |
20 | Optimal primer length (bp) |
--primer-max |
-S |
22 | Maximum primer length (bp) |
--product-min |
-r |
250 | Minimum PCR product size (bp) |
--product-max |
-R |
500 | Maximum PCR product size (bp) |
| Parameter | Description |
|---|---|
--resume |
Resume from last completed phase |
--restart |
Clear checkpoints and start fresh |
| Parameter | Description |
|---|---|
--ncbi-fasta <url> |
Download FASTA file from NCBI FTP |
--ncbi-gff <url> |
Download GFF file from NCBI FTP |
Note: URLs must start with https://ftp.ncbi.nlm.nih.gov/. You can mix local files and NCBI downloads (e.g., local FASTA + NCBI GFF).
Standard FASTA format for genome sequences:
>chromosome1
ATCGATCGATCGATCGATCG...
>chromosome2
GCTAGCTAGCTAGCTAGCTA...
Standard GFF3 format for gene annotations:
##gff-version 3
chr1 NCBI gene 1000 2000 . + . ID=gene1;Name=ABC
chr1 NCBI mRNA 1000 2000 . + . ID=mrna1;Parent=gene1
| Format | Uncompressed | Compressed |
|---|---|---|
| FASTA | .fa, .fasta, .fna |
.fa.gz, .fasta.gz, .fna.gz |
| GFF | .gff, .gff3 |
.gff.gz, .gff3.gz |
Note: MegaSSR automatically normalizes FASTA headers (replaces spaces with underscores) to ensure compatibility with all genome databases (NCBI, Ensembl, custom assemblies). No manual preprocessing required!
MegaSSR automatically validates all input files before processing to prevent errors:
- ✅ File existence and readability
- ✅ Non-empty file
- ✅ Valid FASTA headers (starting with
>) - ✅ Valid nucleotide sequences (A, C, G, T, N only)
- ✅ No invalid characters in sequences
- ✅ File existence and readability
- ✅ Non-empty file
- ✅ Correct number of fields (9 tab-separated columns)
- ✅ Numeric coordinates (start and end positions)
- ✅ Valid coordinate ranges (start ≤ end)
- ✅ Valid strand information (+, -, or .)
- ✅ Cross-validation between FASTA and GFF chromosome IDs
- ✅ Warns if GFF contains chromosomes not found in FASTA
⚠️ Pipeline continues (doesn't fail) to allow normalization to fix mismatches
If validation fails, you'll see a clear error message indicating the problem. Common issues:
- Invalid characters in sequences (e.g., numbers, special characters)
- Malformed GFF coordinates
- Missing required fields
- Empty files
Results are saved in results/<project_name>/:
results/my_project/
├── my_project-MegaSSR_Results/ # Final results
│ ├── my_project.Genic-primers.txt # Genic SSR primers
│ ├── my_project.interGenic-primers.txt # Intergenic SSR primers
│ ├── my_project.Genic_SSR_with_feature.txt
│ ├── Genic_SSR_flanking_regions.fa
│ ├── Intergenic_SSR_flanking_regions.fa
│ ├── SSR_statistics.txt
│ └── *.csv # Statistical tables
├── intermediate/ # Intermediate files
├── fasta/ # MISA output
└── .megassr_checkpoint # Checkpoint file
The primer output files contain tab-separated columns:
| Column | Description |
|---|---|
| Chromosome | Chromosome/contig ID |
| SSR_ID | Unique SSR identifier |
| SSR_Type | Repeat type (mono, di, tri, etc.) |
| SSR_Motif | Repeat motif sequence |
| SSR_Start | SSR start position |
| SSR_End | SSR end position |
| Forward_Primer | Forward primer sequence |
| Reverse_Primer | Reverse primer sequence |
| Product_Size | Expected PCR product size |
| Tm_Forward | Forward primer melting temperature |
| Tm_Reverse | Reverse primer melting temperature |
| GC_Forward | Forward primer GC content |
| GC_Reverse | Reverse primer GC content |
./MegaSSR.sh \
--analysis 1 \
--fasta genome.fa.gz \
--project rice_ssr \
--threads 8./MegaSSR.sh \
--analysis 2 \
--fasta genome.fa.gz \
--gff annotation.gff.gz \
--project rice_markers \
--threads 8 \
--mono 15 --di 8 --tri 6 \
--primer-min 18 --primer-opt 20 --primer-max 25 \
--product-min 150 --product-max 400 \
--feature gene# Download both FASTA and GFF from NCBI
./MegaSSR.sh \
--analysis 2 \
--project ncbi_analysis \
--ncbi-fasta https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/002/846/995/GCF_002846995.1_ASM284699v1/GCF_002846995.1_ASM284699v1_genomic.fna.gz \
--ncbi-gff https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/002/846/995/GCF_002846995.1_ASM284699v1/GCF_002846995.1_ASM284699v1_genomic.gff.gz
# Mix local FASTA with NCBI GFF
./MegaSSR.sh \
--analysis 2 \
--fasta my_genome.fa \
--ncbi-gff https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/.../xxx_genomic.gff.gz \
--project mixed_analysis
# NCBI FASTA with local GFF
./MegaSSR.sh \
--analysis 2 \
--ncbi-fasta https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/.../xxx_genomic.fna.gz \
--gff my_annotation.gff \
--project mixed_analysis# If pipeline was interrupted, resume from checkpoint
./MegaSSR.sh \
--analysis 2 \
--fasta genome.fa.gz \
--gff annotation.gff.gz \
--project rice_markers \
--resume# Show all commands
make help
# Run with custom parameters
make run-full \
FASTA=genome.fa.gz \
GFF=annotation.gff.gz \
PROJECT=my_analysis \
THREADS=8 \
MONO=15 \
DI=8
# Check project status
make status
# Clean results
make clean-project PROJECT=my_analysisMegaSSR runs in 8 phases:
| Phase | Description | Output |
|---|---|---|
| 1 | Initialization | Directory structure, validation |
| 2 | SSR Detection | .misa files, SSR list |
| 3 | Gene Annotation | Genic/intergenic classification |
| 4 | Genic Primer Design | Primers for genic SSRs |
| 5 | Intergenic Primer Design | Primers for intergenic SSRs |
| 6 | Post-processing | Final formatted outputs |
| 7 | Statistics | Frequency tables, comparisons |
| 8 | Visualization | PNG plots (if enabled) |
1. "FASTA file not found"
# Check file path and ensure it exists
ls -la your_genome.fa2. "GFF file required for analysis type 2"
# Add --gff parameter for full analysis
./MegaSSR.sh -A 2 -F genome.fa -G annotation.gff -P project3. "No SSRs found"
- Try lowering the threshold parameters (e.g.,
--mono 10 --di 5) - Verify your FASTA file contains valid sequences
4. "No primers designed"
- Check if SSRs were found in the previous phase
- MegaSSR v3.0 automatically normalizes FASTA headers - this issue should no longer occur
- If primers are still missing, check the log file in
logs/<project>.logfor detailed error messages
# Check what phase completed
cat results/my_project/.megassr_checkpoint
# Resume from last checkpoint
./MegaSSR.sh -A 2 -F genome.fa -G annotation.gff -P my_project --resumeSample data is included for testing:
- Species: Arabidopsis thaliana (chromosome 1)
- FASTA:
data/arabidopsis_chr1/arabidopsis_chr1.fa.gz(8.8 MB) - GFF:
data/arabidopsis_chr1/arabidopsis_chr1.gff.gz(5.6 MB)
Expected results:
- ~5,446 SSRs identified
- ~3,184 genic SSRs
- ~2,373 intergenic SSRs
- ~2,790 genic primers
- ~1,429 intergenic primers
- 7 visualization PNG plots
- 10+ CSV output files
MegaSSR/
├── MegaSSR.sh # Main pipeline script
├── Makefile # Build/run automation
├── README.md # This file
├── LICENSE # MIT License
├── config/
│ ├── megassr.yml # Conda environment
│ └── megassr.env # Default settings
├── bin/Script/ # Core processing scripts
│ ├── misa.pl # MISA SSR detection
│ ├── extractseq-*.pl # Sequence extraction
│ ├── *primer*.py # Primer design scripts
│ └── *.py # Statistics/visualization
├── data/
│ └── arabidopsis_chr1/ # Test dataset
├── utils/
│ └── functions.sh # Utility functions
├── results/ # Output directory
├── logs/ # Log files
└── tmp/ # Temporary files
If you use MegaSSR in your research, please cite:
Morad M. Mokhtar, Alsamman M. Alsamman and Achraf El Allali (2023). MegaSSR: A webserver for large scale SSR identification, classification, and marker development. Frontiers in Plant Science, 14, [https://doi.org/10.3389/fpls.2023.1219055]
This project is licensed under the MIT License - see the LICENSE file for details.
Zakaria Mahmoud
Contributions are welcome! Please feel free to submit a Pull Request.