Native Snakemake Support on LatchBio

Bioinformatics in Python on the Cloud for Biology

Sep 14, 2023

Bioinformaticians build specialized scientific workflows that make raw experimental data interpretable for wet lab biologists. This software is difficult to develop and plagued with a unique set of challenges, from orchestrating large computers to managing enormous files and deploying sparsely documented academic tools and libraries.

While a myriad of frameworks and tooling have emerged to aid the development of these workflows, most bioinformaticians use Snakemake or Nextflow. Both of these frameworks are DSLs with dedicated language constructs that address common problems in bioinformatics: file I/O, dependency management and resource scheduling. Both are surrounded by burgeoning open-source communities that curate public examples of workflows for common assays and are actively developing new language features.

Bioinformaticians who use these languages are often part of multidisciplinary scientific organizations that create new challenges. The workflows they develop need to be run at scale to process new volumes of data flowing from modern high-throughput techniques. Workflow executions and data must be tracked for provenance and reproducibility in a central source of truth. A history of every analysis from the conception of a drug program will be required when filing an IND; results cannot be scattered on local machines.

Most importantly, data and analyses need to be made accessible to biologists, not just those with computational abilities, for rapid experimentation and integration into large scale studies. This empowers scientists to explore data and hypothesize independently, increasing the productivity of research teams.

This set of problems have led to bioinformatics platform solutions with managed cloud infrastructure and wet-lab friendly workflow interfaces. While industry Nextflow users have found a solution with Nextflow Tower, there remains no similar option for Bioinformaticians who prefer Snakemake.

And there is a need.

Today, LatchBio is releasing native support for Snakemake, offering graphical interfaces, managed infrastructure and downstream analysis solutions to this Python framework.

Before analyzing the technical tradeoffs between these competing languages and diving into the mechanics of the integration, our team would first like to extend our gratitude to Johannes Koester, the creator of the Snakemake project, as well as the broader community, for building and maintaining this framework. We love developing in snakemake and hope to see the community continue to thrive.

Disclaimer: LatchBio is not affiliated with Johannes or the Snakemake project.

Why Snakemake is better than Nextflow

Python is the lingua franca of bioinformatics

Snakemake uses Python and Python is the language of bioinformatics. It is a modern, expressive and versatile language - easy to pick up for scripting and bootstrapping small projects but can be used to write industry-grade systems and library code (scanpy, biopython, scikit-learn).

Groovy, the language used in Nextflow, is an archaic scripting language developed for the JVM in the early 2000s that few have heard about or understand.

This alone makes a strong case for Snakemake. It is easier to find talent to build and maintain bioinformatics projects over a long period of time if the language they use is modern and widespread. Along with its ubiquity, Python’s growing popularity amongst software engineers has produced excellent dependency management tools and testing frameworks.

Snakemake is Python all the way down

Nextflow’s DSL is overbearing and conflicts with Groovy in unintuitive ways. Nextflow sports channel operators that can be confused with native Groovy collections and have semantic meaning elsewhere in programming. Nextflow’s use of Groovy is also inconsistent - utility methods are not written in Groovy and such code is hard to test and maintain. Because it is neither a complete DSL (like WDL) but doesn’t use Groovy consistently, it makes for a confusing development experience.

The Snakemake DSL is minimal and reads as Python. Wherever the DSL isn’t used, Python is used. Utility methods are written in Python and this code is easily testable and maintainable as packages. It is easier to reason about and develop.

Managing the structure of data in Nextflow is difficult

In Nextflow, structured data is passed around in tuples without static type checking. Nextflow also lacks typing on channels, making it difficult to share and reuse code without referring to the source.

Snakemake relies on wildcards and the ability to retrieve additional metadata with lambdas defined on parameters.

Snakemake is easy to debug

Workflow authoring is more efficient and less error prone in Snakemake. Debugging typos and errors in your workflow is much easier in Snakemake. Nextflow cannot identify which line in your file has the extra comma, while Snakemake leverages Python to do that for you.

Nextflow .command files are difficult to find and understand. Errors and logs are much easier to find and interpret in Snakemake.

Nextflow is difficult to configure

Nextflow claims to have zero configuration, but really relies on multiple layers of config files. There are config files for processes, modules, workflows, labels, publishDir, parameters and so on. You can even have code inside your config files which creates bugs. With such nested configuration, it is hard to understand which config file is the source of issues.

Snakemake configuration is simple and not nested, so it does not lead to these problems.

Snakemake has sensible outputs

You tend to have a sensible output directory structure when writing Snakemake, due to the directives (output, log, benchmarks) and the use of wildcards with explicit naming of files.

Nextflow’s work directory is difficult to interpret as a workflow Nextflow’s publishDir is opt-in, can be hard to configure and get working correctly.

Nextflow has more foot guns

Nextflow will not fail a process when running a pipe chain. Snakemake has set -euo pipefail by default.
Nextflow’s choice to cache and/or re-run processes is opaque and sometimes confusing. Snakemake is explicit and predictable.
Temporary files and intermediate files can be cleaned up in Snakemake, but in Nextflow not so much.
Nextflow has many ways of doing the same thing. This is a blessing and a curse, especially when working in a team where setting standards is important and debugging is difficult (which config file is causing the process to run this way).
Some operators have non-intuitive defaults and behavior, drop data in channels on the ground when combining/merging or have names that imply they do something else than their actual behavior.

What Nextflow is good at

Nextflow is easier to conceptualize at high-level due to its procedural nature, but harder to reason about granularly due to the difficulty reasoning about what’s contained in each channel.

Nf-core sports a diverse and actively developed repository of best-practices workflows for popular bioinformatics assays. These workflows are useful and powerful.

Nextflow has first-class support for S3 and Azure files that "just works". Snakemake requires a special decoration.

Nextflow has better support for cloud deployment natively, irrespective of Nextflow tower. Many Snakefiles are incompatible with cloud execution due to implicit file dependencies between rules.

Nextflow has better conditional processing due to the data flow model. Snakemakes use of wildcard outputs and checkpoints can be more terse.

Developing Snakemake on Latch

Latch is a biological cloud built to bring together wet and dry lab teams at biotech organizations to store, process and visualize their multi-omics data. The platform now allows you to drop existing Snakemake projects into a data infrastructure that has been refined over years of collaboration with biotech companies across the industry, such as Gensaic, Elsie Bio, and AtlasXOmics.

The platform solves organizational problems in data analysis that generalize across diverse biology, facilitating collaborative exploration and analysis of biological data with a focus on usability for scientists.

With Latch, you can bootstrap end-to-end bioinformatics analysis quickly for a large team, developing your workflows natively in Snakemake and integrating them into a broader ecosystem of data management and analysis tooling.

Generate graphical interfaces from code

When an existing Snakemake project is registered to Latch, frontend interfaces are dynamically generated directly from code. The behavior of frontend components constructed for each parameter, such as validation rules and hover-over descriptions, as well as the layout of the interface are defined in a metadata file that sits next to the Snakefile.

Scale out execution on managed infrastructure

As modern high throughput techniques move towards building and characterizing increasingly large libraries of biological designs, the assays that measure them require clusters of big computers to process the data they generate. Managing this infrastructure, either on-premise or with a cloud provider, requires dedicated personnel and resource expenditure.

Snakemake provides keywords to annotate workflow rules with resources arbitrarily (“this assembly step needs 96 cores and 80 GB of RAM”). Latch will provision the necessary infrastructure to satisfy these resource requests when such workflow rules are scheduled. It will do so in a way that is cost effective and scalable, allowing batched runs totalling thousands of cores, terabytes of RAM and hundreds of GPUs to run smoothly when needed.

rule a:
    input:     ...
    output:    ...
    resources:
        mem_gb=80
        threads=96
    shell:
        "..."

Plug into the Latch ecosystem

Developing and deploying bioinformatics workflows is a small piece of a broader analysis ecosystem. Large files need to be stored, versioned and tracked. Relational schemas associating raw assay outputs with experimental context need to be managed and made accessible to downstream analysis. Infrastructure to support exploratory data science in R and python need to pull in workflow outputs, files and metadata to facilitate analysis. These components are all a necessary part of the lifecycle of biological data and are part of a broader biological cloud.

Bootstrapping a Snakemake workflow with Latch gives you access to these other prebuilt components, allowing you to focus on differentiating biology rather than reinventing the wheel.

How Snakemake works on Latch

We want our integration to require as little boilerplate and configuration to existing Snakemake projects as possible. Bioinformaticians should be able to drop in existing repositories and have them behave as expected on Latch, but easily eject and run them in other contexts.

from pathlib import Path

from latch.types.directory import LatchDir
from latch.types.file import LatchFile
from latch.types.metadata import LatchAuthor, SnakemakeFileParameter, SnakemakeMetadata

SnakemakeMetadata(
    display_name="fgbio Best Practise FASTQ -> Consensus Pipeline",
    author=LatchAuthor(
        name="Fulcrum Genomics",
    ),
    parameters={
        "r1_fastq": SnakemakeFileParameter(
            display_name="Read 1 FastQ",
            type=LatchFile,
            path=Path("tests/r1.fq.gz"),
        ),
        "r2_fastq": SnakemakeFileParameter(
            display_name="Read 2 FastQ",
            type=LatchFile,
            path=Path("tests/r2.fq.gz"),
        ),
        "genome": SnakemakeFileParameter(
            display_name="Reference Genome",
            type=LatchDir,
            path=Path("tests/hs38DH"),
        ),
    },
)

A latch_metadata.py file that sits next to the Snakefile describes parameters and how they should be displayed. A Dockerfile defines the environment that each job will run in, allowing configuration of arbitrary rule dependencies.

JIT (Just-in-Time) compilation

Latch workflows are immutable. They are defined by their nodes (jobs in Snakemake verbage), their typed input and output parameters and the “edges” between them. Any change to the structure of the workflow, eg. the number of nodes or a parameter type, require a new workflow version to be registered. This requirement gives us strong reproducibility guarantees, but make support for dynamic workflows, where the workflow graph can only be determined with input values, somewhat difficult.

Snakemake workflows are dynamic. Jobs are often generated for each file in a directory, so directories of different sizes generate workflows of different structure. To support this on Latch, we’ve introduced a “just-in-time” workflow construction step.

How to try Snakemake on Latch today

Try developing your first workflow by following our documentation here: https://docs.latch.bio/manual/snakemake.html
All of your feedback and requests are encouraged. Please contact me at kenny@latch.bio or via twitter @kenbwork to get involved.

We plan to continue developing Snakemake support alongside bioinformatics experts and tailoring it to the needs of industry.

Acknowledgements

A special thanks to Nils Homer and the team at Fulcrum Genomics for their feedback and co-development of this integration.

Thank you to Kyle Giffin, Hannah Le, Aidan Abdulali, Nathan Manske at Latch and JamesMcGann at AtlasXOmics for their comments and edits.

LatchBio