Skip to content

Bioinformatics-UM6P/megassr-v3

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

58 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

MegaSSR v3.0

High-throughput SSR Identification and Primer Design Pipeline

FeaturesInstallationQuick StartUsageOutputCitation


Overview

MegaSSR is a comprehensive bioinformatics pipeline for identifying Simple Sequence Repeats (SSRs/microsatellites) in genomic sequences, classifying them as genic or intergenic, and designing PCR primers for molecular marker development.

The pipeline integrates MISA for SSR detection, custom annotation scripts for genic/intergenic classification, and Primer3 for automated primer design.

Features

  • 🔬 SSR Detection - Identifies mono- to hexa-nucleotide repeats using MISA
  • 🧬 Gene Annotation - Classifies SSRs as genic (within genes) or intergenic
  • 🎯 Primer Design - Automated primer design using Primer3
  • Parallel Processing - Multi-threaded for high performance
  • 📊 Statistics & Visualization - Comprehensive analysis reports and plots
  • 💾 Checkpoint System - Resume interrupted analyses
  • 📦 Gzip Support - Works with compressed FASTA and GFF files
  • 🌐 NCBI Download - Direct download from NCBI FTP links
  • 🔧 Universal Compatibility - Automatic header normalization works with any FASTA/GFF format

Installation

Prerequisites

  • Linux/macOS - Native support
  • Windows - Supported via WSL2 (Windows Subsystem for Linux)
  • Conda or Mamba

Windows Users (WSL2)

  1. Install WSL2: Open PowerShell as Administrator and run:
    wsl --install -d Ubuntu
  2. Restart your computer
  3. Open Ubuntu from Start menu and follow the Linux setup below

Setup

# Clone the repository
git clone https://github.com/Bioinformatics-UM6P/megassr-v3.git
cd megassr-v3

# Create conda environment (includes all dependencies)
conda env create -f config/megassr.yml

# Activate the environment
conda activate megassr

# Verify installation
./MegaSSR.sh --help

Dependencies (auto-installed via conda)

  • Perl 5.32+ with BioPerl
  • Python 3.11+ with pandas, matplotlib
  • Primer3
  • MISA

Quick Start

Test with Sample Data

# Quick SSR identification test (~30 seconds)
make test

# Full analysis with primer design (~2-5 minutes)
make test-full

Run Your Analysis

# SSR identification only
./MegaSSR.sh -A 1 -F your_genome.fa -P my_project

# Full analysis with gene annotation and primers
./MegaSSR.sh -A 2 -F your_genome.fa -G annotation.gff -P my_project

# With compressed files
./MegaSSR.sh -A 2 -F genome.fa.gz -G annotation.gff.gz -P my_project

Usage

Command Line Options

USAGE:
    MegaSSR.sh --analysis <type> --fasta <file> --project <name> [options]

REQUIRED:
    --analysis, -A <1|2>    Analysis type
                            1 = SSR identification only
                            2 = SSR + gene annotation + primer design
    --fasta, -F <file>      Input FASTA file (genome sequence)
    --project, -P <name>    Project name (used for output directories)

OPTIONAL:
    --gff, -G <file>        GFF annotation file (required for -A 2)
    --feature, -g <type>    Genic feature type (default: gene)
    --threads, -T <n>       Number of threads (default: 4)
    
    NCBI Download:
    --ncbi-fasta <url>      Download FASTA from NCBI FTP
    --ncbi-gff <url>        Download GFF from NCBI FTP

SSR Mining Parameters

Parameter Short Default Description
--mono 20 Minimum repeats for mononucleotide SSRs
--di 6 Minimum repeats for dinucleotide SSRs
--tri 5 Minimum repeats for trinucleotide SSRs
--tetra 4 Minimum repeats for tetranucleotide SSRs
--penta 3 Minimum repeats for pentanucleotide SSRs
--hexa 3 Minimum repeats for hexanucleotide SSRs
--compound 100 Max distance between compound SSRs

Primer Design Parameters

Parameter Short Default Description
--primer-min -s 18 Minimum primer length (bp)
--primer-opt -o 20 Optimal primer length (bp)
--primer-max -S 22 Maximum primer length (bp)
--product-min -r 250 Minimum PCR product size (bp)
--product-max -R 500 Maximum PCR product size (bp)

Checkpoint/Resume Options

Parameter Description
--resume Resume from last completed phase
--restart Clear checkpoints and start fresh

NCBI FTP Download Options

Parameter Description
--ncbi-fasta <url> Download FASTA file from NCBI FTP
--ncbi-gff <url> Download GFF file from NCBI FTP

Note: URLs must start with https://ftp.ncbi.nlm.nih.gov/. You can mix local files and NCBI downloads (e.g., local FASTA + NCBI GFF).

Input Formats

FASTA File

Standard FASTA format for genome sequences:

>chromosome1
ATCGATCGATCGATCGATCG...
>chromosome2
GCTAGCTAGCTAGCTAGCTA...

GFF File

Standard GFF3 format for gene annotations:

##gff-version 3
chr1    NCBI    gene    1000    2000    .    +    .    ID=gene1;Name=ABC
chr1    NCBI    mRNA    1000    2000    .    +    .    ID=mrna1;Parent=gene1

Supported Extensions

Format Uncompressed Compressed
FASTA .fa, .fasta, .fna .fa.gz, .fasta.gz, .fna.gz
GFF .gff, .gff3 .gff.gz, .gff3.gz

Note: MegaSSR automatically normalizes FASTA headers (replaces spaces with underscores) to ensure compatibility with all genome databases (NCBI, Ensembl, custom assemblies). No manual preprocessing required!

Input Validation

MegaSSR automatically validates all input files before processing to prevent errors:

FASTA Validation

  • ✅ File existence and readability
  • ✅ Non-empty file
  • ✅ Valid FASTA headers (starting with >)
  • ✅ Valid nucleotide sequences (A, C, G, T, N only)
  • ✅ No invalid characters in sequences

GFF Validation

  • ✅ File existence and readability
  • ✅ Non-empty file
  • ✅ Correct number of fields (9 tab-separated columns)
  • ✅ Numeric coordinates (start and end positions)
  • ✅ Valid coordinate ranges (start ≤ end)
  • ✅ Valid strand information (+, -, or .)

Chromosome ID Consistency

  • ✅ Cross-validation between FASTA and GFF chromosome IDs
  • ✅ Warns if GFF contains chromosomes not found in FASTA
  • ⚠️ Pipeline continues (doesn't fail) to allow normalization to fix mismatches

If validation fails, you'll see a clear error message indicating the problem. Common issues:

  • Invalid characters in sequences (e.g., numbers, special characters)
  • Malformed GFF coordinates
  • Missing required fields
  • Empty files

Output

Results are saved in results/<project_name>/:

results/my_project/
├── my_project-MegaSSR_Results/     # Final results
│   ├── my_project.Genic-primers.txt        # Genic SSR primers
│   ├── my_project.interGenic-primers.txt   # Intergenic SSR primers
│   ├── my_project.Genic_SSR_with_feature.txt
│   ├── Genic_SSR_flanking_regions.fa
│   ├── Intergenic_SSR_flanking_regions.fa
│   ├── SSR_statistics.txt
│   └── *.csv                        # Statistical tables
├── intermediate/                    # Intermediate files
├── fasta/                          # MISA output
└── .megassr_checkpoint             # Checkpoint file

Primer Output Format

The primer output files contain tab-separated columns:

Column Description
Chromosome Chromosome/contig ID
SSR_ID Unique SSR identifier
SSR_Type Repeat type (mono, di, tri, etc.)
SSR_Motif Repeat motif sequence
SSR_Start SSR start position
SSR_End SSR end position
Forward_Primer Forward primer sequence
Reverse_Primer Reverse primer sequence
Product_Size Expected PCR product size
Tm_Forward Forward primer melting temperature
Tm_Reverse Reverse primer melting temperature
GC_Forward Forward primer GC content
GC_Reverse Reverse primer GC content

Examples

Basic SSR Identification

./MegaSSR.sh \
    --analysis 1 \
    --fasta genome.fa.gz \
    --project rice_ssr \
    --threads 8

Full Analysis with Custom Parameters

./MegaSSR.sh \
    --analysis 2 \
    --fasta genome.fa.gz \
    --gff annotation.gff.gz \
    --project rice_markers \
    --threads 8 \
    --mono 15 --di 8 --tri 6 \
    --primer-min 18 --primer-opt 20 --primer-max 25 \
    --product-min 150 --product-max 400 \
    --feature gene

Download from NCBI FTP

# Download both FASTA and GFF from NCBI
./MegaSSR.sh \
    --analysis 2 \
    --project ncbi_analysis \
    --ncbi-fasta https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/002/846/995/GCF_002846995.1_ASM284699v1/GCF_002846995.1_ASM284699v1_genomic.fna.gz \
    --ncbi-gff https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/002/846/995/GCF_002846995.1_ASM284699v1/GCF_002846995.1_ASM284699v1_genomic.gff.gz

# Mix local FASTA with NCBI GFF
./MegaSSR.sh \
    --analysis 2 \
    --fasta my_genome.fa \
    --ncbi-gff https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/.../xxx_genomic.gff.gz \
    --project mixed_analysis

# NCBI FASTA with local GFF
./MegaSSR.sh \
    --analysis 2 \
    --ncbi-fasta https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/.../xxx_genomic.fna.gz \
    --gff my_annotation.gff \
    --project mixed_analysis

Resume After Interruption

# If pipeline was interrupted, resume from checkpoint
./MegaSSR.sh \
    --analysis 2 \
    --fasta genome.fa.gz \
    --gff annotation.gff.gz \
    --project rice_markers \
    --resume

Using Makefile

# Show all commands
make help

# Run with custom parameters
make run-full \
    FASTA=genome.fa.gz \
    GFF=annotation.gff.gz \
    PROJECT=my_analysis \
    THREADS=8 \
    MONO=15 \
    DI=8

# Check project status
make status

# Clean results
make clean-project PROJECT=my_analysis

Pipeline Phases

MegaSSR runs in 8 phases:

Phase Description Output
1 Initialization Directory structure, validation
2 SSR Detection .misa files, SSR list
3 Gene Annotation Genic/intergenic classification
4 Genic Primer Design Primers for genic SSRs
5 Intergenic Primer Design Primers for intergenic SSRs
6 Post-processing Final formatted outputs
7 Statistics Frequency tables, comparisons
8 Visualization PNG plots (if enabled)

Troubleshooting

Common Issues

1. "FASTA file not found"

# Check file path and ensure it exists
ls -la your_genome.fa

2. "GFF file required for analysis type 2"

# Add --gff parameter for full analysis
./MegaSSR.sh -A 2 -F genome.fa -G annotation.gff -P project

3. "No SSRs found"

  • Try lowering the threshold parameters (e.g., --mono 10 --di 5)
  • Verify your FASTA file contains valid sequences

4. "No primers designed"

  • Check if SSRs were found in the previous phase
  • MegaSSR v3.0 automatically normalizes FASTA headers - this issue should no longer occur
  • If primers are still missing, check the log file in logs/<project>.log for detailed error messages

Resume Failed Analysis

# Check what phase completed
cat results/my_project/.megassr_checkpoint

# Resume from last checkpoint
./MegaSSR.sh -A 2 -F genome.fa -G annotation.gff -P my_project --resume

Test Data

Sample data is included for testing:

  • Species: Arabidopsis thaliana (chromosome 1)
  • FASTA: data/arabidopsis_chr1/arabidopsis_chr1.fa.gz (8.8 MB)
  • GFF: data/arabidopsis_chr1/arabidopsis_chr1.gff.gz (5.6 MB)

Expected results:

  • ~5,446 SSRs identified
  • ~3,184 genic SSRs
  • ~2,373 intergenic SSRs
  • ~2,790 genic primers
  • ~1,429 intergenic primers
  • 7 visualization PNG plots
  • 10+ CSV output files

Project Structure

MegaSSR/
├── MegaSSR.sh              # Main pipeline script
├── Makefile                # Build/run automation
├── README.md               # This file
├── LICENSE                 # MIT License
├── config/
│   ├── megassr.yml         # Conda environment
│   └── megassr.env         # Default settings
├── bin/Script/             # Core processing scripts
│   ├── misa.pl             # MISA SSR detection
│   ├── extractseq-*.pl     # Sequence extraction
│   ├── *primer*.py         # Primer design scripts
│   └── *.py                # Statistics/visualization
├── data/
│   └── arabidopsis_chr1/   # Test dataset
├── utils/
│   └── functions.sh        # Utility functions
├── results/                # Output directory
├── logs/                   # Log files
└── tmp/                    # Temporary files

Citation

If you use MegaSSR in your research, please cite:

Morad M. Mokhtar, Alsamman M. Alsamman and Achraf El Allali (2023). MegaSSR: A webserver for large scale SSR identification, classification, and marker development. Frontiers in Plant Science, 14, [https://doi.org/10.3389/fpls.2023.1219055]

License

This project is licensed under the MIT License - see the LICENSE file for details.

Author

Zakaria Mahmoud

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

Acknowledgments

  • MISA - Microsatellite identification tool
  • Primer3 - Primer design tool
  • BioPerl - Perl tools for bioinformatics

About

MegaSSR is a robust online server that identifies Simple Sequence Repeats (SSR) and enables the design of SSR markers in high-throughput data. MegaSSR perfectly matches any target genome (including Plantae, Protozoa, Animalia, Chromista, Fungi, Archaea, and Bacteria)

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages

  • Shell 61.5%
  • Python 18.0%
  • Perl 11.7%
  • Makefile 8.8%