Snakemake is a workflow tool

Being developed by Johannes Köster
Started 2011 at TU Dortmund in our bioinformatics group
Now part of his PhD thesis
Used by a lot of groups now

Snakemake is inspired by make

but with Python-derived, readable syntax
and a focus on bioinformatics pipelines

A workflow is a set of rules

A rule describes how to transform one file into another:

Rule 1: file.bam → file.vcf

Rule 2: file.fastq → file.bam
Filenames determine which rule to apply
Execution order is figured out automatically
Rules are not executed if output exists and is newer than input

Rules are written in a “Snakefile”

When run, snakemake expects a Snakefile in the current directory. An example:

rule:
    input: 'slides.md'
    output: 'slides.html'
    shell:
        'pandoc -t revealjs -o {output} {input}'

Running snakemake slides.html generates slides.html from slides.md.

Rules and parameters can have names

rule pandoc:
    input:
        md='slides.md',
        template='default.revealjs'
    output:
        html='slides.html'
    shell:
        'pandoc --template={input.template} -o {output.html} {input.md}'

Multiple output files are also possible (unlike make)!

Shell commands must be quoted

Use """ for multi-line strings
Use {...} to access parameters
Double { and } if part of the command

rule:
    input: 'file.txt'
    output: 'file.sum'
    shell:
        """
        awk '{{ s += $3 }} END {{ print s }}' {input} > {output}
        echo "finished"
        """

Wildcards generalize rules

rule targets:
    input: 'slides.html', 'talk.html'

rule pandoc:
    input:
        md='{base}.md',
        template='default.revealjs'
    output:
        html='{base}.html'
    shell:
        'pandoc --template={input.template} -o {output.html} {input.md}'

First rule is executed if nothing given on command line.
Wildcards such as {base} in input and output file names are allowed.

Running snakemake generates both slides.html and talk.html.

Rules can run in parallel

... on a cluster

Specify required resources with params:

rule pandoc:
    input: md='{base}.md'
    output: html='{base}.html'
    params: runtime='00:00:01'  # one minute
    shell: 'pandoc --template={input.template} -o {output.html} {input.md}'

Then submit to SLURM:

snakemake -j --cluster "sbatch -A b1999111 -t {params.runtime}"

Snakefiles can contain Python code

Using pandas and matplotlib to plot a coverage histogram:

import pandas as pd
import matplotlib.pyplot as plt

rule plot_coverage_histogram:
    output: pdf='{base}.covhist.pdf'
    input: txt='{base}.covhist.txt'
    run:
        d = pd.read_csv(input.txt, sep='\t', index_col=0, dtype=int)
        fig = plt.figure()
        ax = fig.gca()
        ax.plot(d.index, d.frequency, 'bo')
        fig.savefig(output.pdf)

Note how input.txt and output.pdf are used
Snakefiles are Python programs with some extra syntax for rules!

Read mapping example

Define datasets and genomes as Python lists:

GENOMES = ['Danio_rerio_Zv9', 'Danio_rerio_GRCz10']
DATASETS = ['miseq1', 'miseq2', 'pacbio17', 'pacbio18', 'hiseq201412']

Use the built-in expand function to create list of targets:

rule targets:
    input:
        expand('mapped/{genome}-{dataset}.bam',
            genome=GENOMES, dataset=DATASETS)

expand() creates a list of 2·5=10 file names: mapped/Danio_rerio_Zv9-miseq1.bam etc.

Use BWA-MEM to map the reads:

rule bwa_mem_single:
    output: bam='mapped/{genome}-{dataset}.bam'  # multiple wildcards
    input:
        ref='reference/{genome}.fasta',
        bwt='reference/{genome}.fasta.bwt',
        fastq='reads/{dataset}.fastq.gz'
    shell:
        """
        bwa mem {input.ref} {input.fastq} | \
            samtools view -uS - | samtools sort -f - {output.bam}
        """

With output file mapped/Danio_rerio_GRCz10-mydata.bam:

genome is set to Danio_rerio_GRCz10
dataset is set to mydata

Use threads: to define the maximum number of threads:

rule bwa_mem_single:
    output: ...
    input: ...
    threads: 8
    shell:
        """
        bwa mem -t {threads} {input.ref} {input.fastq} ...
        """

Tell snakemake how many cores to use with snakemake -j 16.
It will run bwa_mem_single twice in parallel with 8 threads each.

Indexing the reference:

rule bwa_index:
    output:
        '{base}.amb', '{base}.ann', '{base}.bwt', '{base}.pac', '{base}.sa'
    input: '{base}'
    resources: time=6*60
    shell:
        'bwa index {input}'

This works even though {base} is set to reference/Danio_rerio_Zv9.fasta and contains a slash.

Visualize the dependencies

snakemake --dag (DAG: directed acyclic graph)

Primer trimming example

Log output to a file

PRIMER = 'ACGTTAGT'

rule trim_primers:
    output: fastq='trimmed.fastq.gz'
    input: fastq='merged.fastq.gz'
    log: 'cutadapt.log'
    shell:
        'cutadapt -g ^{PRIMER} -o {output.fastq} {input.fastq} > {log}'

Use the log: directive in the rule and {log} in your command.

Use Python to create a command line dynamically

PRIMERS = [ 'ACGTTAGT', 'TTGGAACC']

rule trim_primers:
    output: fastq='trimmed.fastq.gz'
    input: fastq='merged.fastq.gz'
    log: 'cutadapt.log'
    run:
        primers = ' '.join('-g ^{}'.format(seq) for seq in PRIMERS)
        shell('cutadapt {primers} -o {output.fastq} {input.fastq} > {log}')

Use the shell() function to run both Python and shell code in a single rule.

It becomes simpler with `params:`

PRIMERS = [ 'ACGTTAGT', 'TTGGAACC']

rule trim_primers:
    output: fastq='trimmed.fastq.gz'
    input: fastq='merged.fastq.gz'
    log: 'cutadapt.log'
    params: primers=' '.join('-g ^{}'.format(seq) for seq in PRIMERS)
    shell:
        'cutadapt {params.primers} -o {output.fastq} {input.fastq} > {log}'

Configure the workflow with an external file

Create pipeline.conf:

PRIMERS = [ 'ACGTTAGT', 'TTGGAACC']

And put include: "pipeline.conf" in your main Snakefile.

(There is also the configfile: directive.)

Bonus section

Use `if` for conditional rules

if MERGE_PROGRAM == 'flash':
    rule flash_merge:
        """Use FLASH to merge paired-end reads"""
        output: 'merged.fastq.gz'
        input: 'reads.1.fastq', 'reads.2.fastq'
        threads: 8
        shell:
            """
            flash -t {threads} -c -M {MAX_OVERLAP} {input} | gzip > {output}
            """
elif MERGE_PROGRAM == 'pear':
    rule pear_merge:
        """Use pear to merge paired-end reads"""
        ...
else:
    sys.exit("MERGE_PROGRAM not recognized")

Simplify debugging

To use the unofficial bash strict mode, put this at the beginning of your Snakefile:

shell.prefix("set -euo pipefail;")

If a command fails, the entire rule will fail – this is what you want.

Snakemake has a browser interface

Start it with snakemake --gui

Try it yourself

Use our snakemake module on Uppmax:

module use /proj/b2013006/sw/modules
module load snakemake

... or use a virtualenv (with Python 3)
Read the Snakemake documentation

snakemake makes ... snakes?

Marcel Martin

March 25, 2015