A rule describes how to transform one file into another:
Rule 1: file.bam → file.vcf
Rule 2: file.fastq → file.bam
Rules are not executed if output exists and is newer than input
When run, snakemake expects a Snakefile
in the current directory. An example:
rule:
input: 'slides.md'
output: 'slides.html'
shell:
'pandoc -t revealjs -o {output} {input}'
Running snakemake slides.html
generates slides.html
from slides.md
.
rule pandoc:
input:
md='slides.md',
template='default.revealjs'
output:
html='slides.html'
shell:
'pandoc --template={input.template} -o {output.html} {input.md}'
Multiple output files are also possible (unlike make)!
"""
for multi-line strings{...}
to access parameters{
and }
if part of the commandrule:
input: 'file.txt'
output: 'file.sum'
shell:
"""
awk '{{ s += $3 }} END {{ print s }}' {input} > {output}
echo "finished"
"""
rule targets:
input: 'slides.html', 'talk.html'
rule pandoc:
input:
md='{base}.md',
template='default.revealjs'
output:
html='{base}.html'
shell:
'pandoc --template={input.template} -o {output.html} {input.md}'
{base}
in input and output file names are allowed.Running snakemake
generates both slides.html
and talk.html
.
snakemake -j 2
uses two cores to generate slides.html
and talk.html
in parallel.
Specify required resources with params
:
rule pandoc:
input: md='{base}.md'
output: html='{base}.html'
params: runtime='00:00:01' # one minute
shell: 'pandoc --template={input.template} -o {output.html} {input.md}'
Then submit to SLURM:
snakemake -j --cluster "sbatch -A b1999111 -t {params.runtime}"
Using pandas and matplotlib to plot a coverage histogram:
import pandas as pd
import matplotlib.pyplot as plt
rule plot_coverage_histogram:
output: pdf='{base}.covhist.pdf'
input: txt='{base}.covhist.txt'
run:
d = pd.read_csv(input.txt, sep='\t', index_col=0, dtype=int)
fig = plt.figure()
ax = fig.gca()
ax.plot(d.index, d.frequency, 'bo')
fig.savefig(output.pdf)
input.txt
and output.pdf
are usedTask: Map five files with reads to two genome references.
Define datasets and genomes as Python lists:
GENOMES = ['Danio_rerio_Zv9', 'Danio_rerio_GRCz10']
DATASETS = ['miseq1', 'miseq2', 'pacbio17', 'pacbio18', 'hiseq201412']
Use the built-in expand
function to create list of targets:
rule targets:
input:
expand('mapped/{genome}-{dataset}.bam',
genome=GENOMES, dataset=DATASETS)
expand()
creates a list of 2·5=10 file names: mapped/Danio_rerio_Zv9-miseq1.bam
etc.
Use BWA-MEM to map the reads:
rule bwa_mem_single:
output: bam='mapped/{genome}-{dataset}.bam' # multiple wildcards
input:
ref='reference/{genome}.fasta',
bwt='reference/{genome}.fasta.bwt',
fastq='reads/{dataset}.fastq.gz'
shell:
"""
bwa mem {input.ref} {input.fastq} | \
samtools view -uS - | samtools sort -f - {output.bam}
"""
With output file mapped/Danio_rerio_GRCz10-mydata.bam
:
Danio_rerio_GRCz10
mydata
Use threads:
to define the maximum number of threads:
rule bwa_mem_single:
output: ...
input: ...
threads: 8
shell:
"""
bwa mem -t {threads} {input.ref} {input.fastq} ...
"""
snakemake -j 16
.bwa_mem_single
twice in parallel with 8 threads each.Indexing the reference:
rule bwa_index:
output:
'{base}.amb', '{base}.ann', '{base}.bwt', '{base}.pac', '{base}.sa'
input: '{base}'
resources: time=6*60
shell:
'bwa index {input}'
This works even though {base}
is set to reference/Danio_rerio_Zv9.fasta
and contains a slash.
Finally, run snakemake -j 16
to create all 10 BAM files.
snakemake --dag
(DAG: directed acyclic graph)
PRIMER = 'ACGTTAGT'
rule trim_primers:
output: fastq='trimmed.fastq.gz'
input: fastq='merged.fastq.gz'
log: 'cutadapt.log'
shell:
'cutadapt -g ^{PRIMER} -o {output.fastq} {input.fastq} > {log}'
Use the log:
directive in the rule and {log}
in your command.
PRIMERS = [ 'ACGTTAGT', 'TTGGAACC']
rule trim_primers:
output: fastq='trimmed.fastq.gz'
input: fastq='merged.fastq.gz'
log: 'cutadapt.log'
run:
primers = ' '.join('-g ^{}'.format(seq) for seq in PRIMERS)
shell('cutadapt {primers} -o {output.fastq} {input.fastq} > {log}')
Use the shell()
function to run both Python and shell code in a single rule.
params:
PRIMERS = [ 'ACGTTAGT', 'TTGGAACC']
rule trim_primers:
output: fastq='trimmed.fastq.gz'
input: fastq='merged.fastq.gz'
log: 'cutadapt.log'
params: primers=' '.join('-g ^{}'.format(seq) for seq in PRIMERS)
shell:
'cutadapt {params.primers} -o {output.fastq} {input.fastq} > {log}'
Create pipeline.conf
:
PRIMERS = [ 'ACGTTAGT', 'TTGGAACC']
And put include: "pipeline.conf"
in your main Snakefile.
(There is also the configfile:
directive.)
if
for conditional rulesif MERGE_PROGRAM == 'flash':
rule flash_merge:
"""Use FLASH to merge paired-end reads"""
output: 'merged.fastq.gz'
input: 'reads.1.fastq', 'reads.2.fastq'
threads: 8
shell:
"""
flash -t {threads} -c -M {MAX_OVERLAP} {input} | gzip > {output}
"""
elif MERGE_PROGRAM == 'pear':
rule pear_merge:
"""Use pear to merge paired-end reads"""
...
else:
sys.exit("MERGE_PROGRAM not recognized")
To use the unofficial bash strict mode, put this at the beginning of your Snakefile:
shell.prefix("set -euo pipefail;")
If a command fails, the entire rule will fail – this is what you want.
Start it with snakemake --gui
Use our snakemake module on Uppmax:
module use /proj/b2013006/sw/modules
module load snakemake
Read the Snakemake documentation
Lakes wake snakes
Snakemake’s namesakes fake mistakes
Makesnake mate quakes
Matesnakes shake cornflakes
Snakemake ... bakes cakes.