scbirlab/nf-promotermap maps Illumina sequences to bacterial genomes and calls peaks.
Table of contents
The pipeline carries out the following steps, given a sample sheet (see below):
- Downloads reference genome and annotations from NCBI
- Trims adapter sequences from Illumina reads using
cutadapt - Aligns to reference genome using either
bowtie2orminimap2 - Plot gene start coverage with
deeptools. - Call peaks across all samples with
MACS3. - Annotate peaks with nearest genes.
- Generate FASTA of peak sequences.
- Calculate coverage of each peak for each bin.
- Calculate coverage variance across bins.
- Calculate per-base coverage within each peak for each bins and mean and variance across bins.
- Identify elements associated with strength and variance.
- Identify common sequence motifs in those elements.
- Get FASTQ quality metrics with
fastqc. - Calculate coverage and other with
samtools. - Compile the logs of processing steps into an HTML report with
multiqc.
You need to have Nextflow and either Anaconda, Singularity, or Docker installed on your system.
If you're at the Crick or your shared cluster has it already installed, try:
module load Nextflow SingularityOtherwise, if it's your first time using Nextflow on your system and you have Conda installed, you can install it using conda:
conda install -c bioconda nextflow You may need to set the NXF_HOME environment variable. For example,
mkdir -p ~/.nextflow
export NXF_HOME=~/.nextflowTo make this a permanent change, you can do something like the following:
mkdir -p ~/.nextflow
echo "export NXF_HOME=~/.nextflow" >> ~/.bash_profile
source ~/.bash_profileMake a sample sheet (see below) and, optionally,
a nextflow.config file in the directory where you want the
pipeline to run. Then run Nextflow.
nextflow run scbirlab/nf-promotermap -latestIf you want to run a particular tagged version of the pipeline, such as v0.0.3,
you can do so using
nextflow run scbirlab/nf-promotermap -r v0.0.3For help, use nextflow run scbirlab/nf-promotermap --help.
The first time you run the pipeline, the software dependencies
in environment.yml will be installed. This may take several minutes.
The following parameters are required:
sample_sheet: path to a CSV with information about the samples and FASTQ files to be processedfastq_dir: path to where FASTQ files are storedcontrol_label: the bin ID (from sample sheet) of background controls
The following parameters have default values which can be overridden if necessary.
inputs = "inputs": The folder containing your inputs.outputs = "outputs": The folder to containing the pipeline outputs.trim_qual = 5: Minimum base-call quality for trimming.min_length = 9: Discard reads shorter than this number of bases after trimming.mapper = "bowtie2": Alignment tool.
The parameters can be provided either in the nextflow.config file or on the nextflow run command.
Here is an example of the nextflow.config file:
params {
sample_sheet = "/path/to/sample-sheet.csv"
inputs = "/path/to/inputs"
fastq_dir = "/path/to/fastq"
control_label = "U" // bin_id of your background control
mapper = "minimap2"
}Alternatively, you can provide the parameters on the command line:
nextflow run scbirlab/nf-promotermap \
--sample_sheet /path/to/sample-sheet.csv \
--inputs /path/to/inputs \
--fastq_dir /path/to/fastq \
--control_label U \
--mapper minimap2The sample sheet is a CSV file providing information about which FASTQ files belong to which sample.
The file must have a header with the column names below (in any order), and one line per sample to be processed. You can have additional columns eith extra information if you like.
expt_id: Unique name of a peak-calling experiment. Peaks will be called across all samples with the same experiment ID.sample_id: Unique name of the sample within an experiment. FASTQ files under the same sample ID will be combined.bin_id: Unique name of a bin within an experiment. Sample IDs under the same bin will be pooled before coverage analysis.fastq_pattern: Partial filename that matches at least both R1 and R2 FASTQ files for a sample in thefastq_dir(defined above).genome_accession: The NCBI assembly accession number for the genome for alignment and annotation. This number starts with "GCF_" or "GCA_".adapter_read1_3prime: the 3' adapter on the forward read to trim. The adapter itself and sequences downstream will be removed.adapter_read2_3prime: the 3' adapter on the reverse read to trim. The adapter itself and sequences downstream will be removed.adapter_read1_5prime: the 5' adapter on the forward read to trim. The adapter itself and sequences upstream will be removed.adapter_read2_5prime: the 5' adapter on the reverse read to trim. The adapter itself and sequences upstream will be removed.
Here is an example of the sample sheet:
| expt_id | sample_id | bin_id | fastq_pattern | genome_accession | adapter_read1_3prime | adapter_read2_3prime | adapter_read1_5prime | adapter_read2_5prime |
|---|---|---|---|---|---|---|---|---|
| expt-01 | 01-Unsorted | U | G5512A22_R | GCF_904425475.1 | ATTAACCTCCTAATCGTGCGT | CTACCGCCTTGCTGCTGCGT | ACGCAGCAGCAAGGCGG | ACGCACGATTAGGA |
| expt-01 | 01-Red1 | Red1 | G5512A23_R | GCF_904425475.1 | ATTAACCTCCTAATCGTGCGT | CTACCGCCTTGCTGCTGCGT | ACGCAGCAGCAAGGCGG | ACGCACGATTAGGA |
Read more here.
You cna find some examples in the test directory of this repository.
Outputs are saved in the directory specified by --outputs (outputs by default).
They are organised into these directories:
bigwig: Coverage bigwig filescoverage: Coverage of peaks per bingenome: Reference genomes and annotationsmapped: BAM files of mapped Illumina readsmultiqc: HTML reports from the outputs of intermediate stepspeaks: peak callssamtools: Coverage and other metricstrimmed: Trimming logs and FASTQ files.
If you run into problems not covered here, add to the issue tracker.
Here are the help pages of the software used by this pipeline.